imperium-crawl 1.5.1 → 1.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (83) hide show
  1. package/README.md +367 -262
  2. package/dist/captcha/detector.js +3 -3
  3. package/dist/captcha/detector.js.map +1 -1
  4. package/dist/captcha/solver.d.ts.map +1 -1
  5. package/dist/captcha/solver.js +4 -1
  6. package/dist/captcha/solver.js.map +1 -1
  7. package/dist/cli.d.ts.map +1 -1
  8. package/dist/cli.js +11 -8
  9. package/dist/cli.js.map +1 -1
  10. package/dist/knowledge/store.d.ts.map +1 -1
  11. package/dist/knowledge/store.js +1 -9
  12. package/dist/knowledge/store.js.map +1 -1
  13. package/dist/llm/extractor.d.ts +1 -0
  14. package/dist/llm/extractor.d.ts.map +1 -1
  15. package/dist/llm/extractor.js +5 -4
  16. package/dist/llm/extractor.js.map +1 -1
  17. package/dist/llm/providers/anthropic.d.ts.map +1 -1
  18. package/dist/llm/providers/anthropic.js +34 -31
  19. package/dist/llm/providers/anthropic.js.map +1 -1
  20. package/dist/llm/providers/openai.d.ts.map +1 -1
  21. package/dist/llm/providers/openai.js +26 -23
  22. package/dist/llm/providers/openai.js.map +1 -1
  23. package/dist/llm/retry.d.ts +10 -0
  24. package/dist/llm/retry.d.ts.map +1 -0
  25. package/dist/llm/retry.js +35 -0
  26. package/dist/llm/retry.js.map +1 -0
  27. package/dist/recipes/crypto-websocket.json +11 -0
  28. package/dist/recipes/ecommerce-product.json +25 -0
  29. package/dist/recipes/github-trending.json +19 -0
  30. package/dist/recipes/hn-top-stories.json +22 -0
  31. package/dist/recipes/index.d.ts +3 -0
  32. package/dist/recipes/index.d.ts.map +1 -0
  33. package/dist/recipes/index.js +23 -0
  34. package/dist/recipes/index.js.map +1 -0
  35. package/dist/recipes/job-listings-greenhouse.json +17 -0
  36. package/dist/recipes/news-article-reader.json +9 -0
  37. package/dist/recipes/product-reviews.json +33 -0
  38. package/dist/recipes/reddit-posts.json +8 -0
  39. package/dist/recipes/seo-page-audit.json +26 -0
  40. package/dist/recipes/social-media-mentions.json +31 -0
  41. package/dist/server.d.ts.map +1 -1
  42. package/dist/server.js +11 -1
  43. package/dist/server.js.map +1 -1
  44. package/dist/skills/manager.d.ts +36 -1
  45. package/dist/skills/manager.d.ts.map +1 -1
  46. package/dist/skills/manager.js +22 -0
  47. package/dist/skills/manager.js.map +1 -1
  48. package/dist/stealth/antibot-detector.js +2 -2
  49. package/dist/stealth/antibot-detector.js.map +1 -1
  50. package/dist/stealth/chrome-profile.d.ts.map +1 -1
  51. package/dist/stealth/chrome-profile.js +6 -2
  52. package/dist/stealth/chrome-profile.js.map +1 -1
  53. package/dist/stealth/index.d.ts +7 -0
  54. package/dist/stealth/index.d.ts.map +1 -1
  55. package/dist/stealth/index.js +81 -22
  56. package/dist/stealth/index.js.map +1 -1
  57. package/dist/stealth/proxy.js +2 -2
  58. package/dist/stealth/proxy.js.map +1 -1
  59. package/dist/stealth/tls.d.ts.map +1 -1
  60. package/dist/stealth/tls.js +12 -2
  61. package/dist/stealth/tls.js.map +1 -1
  62. package/dist/tools/interact.d.ts.map +1 -1
  63. package/dist/tools/interact.js +5 -0
  64. package/dist/tools/interact.js.map +1 -1
  65. package/dist/tools/list-skills.d.ts +1 -1
  66. package/dist/tools/list-skills.d.ts.map +1 -1
  67. package/dist/tools/list-skills.js +23 -10
  68. package/dist/tools/list-skills.js.map +1 -1
  69. package/dist/tools/run-skill.d.ts +7 -1
  70. package/dist/tools/run-skill.d.ts.map +1 -1
  71. package/dist/tools/run-skill.js +288 -36
  72. package/dist/tools/run-skill.js.map +1 -1
  73. package/dist/utils/fetcher.d.ts.map +1 -1
  74. package/dist/utils/fetcher.js +8 -38
  75. package/dist/utils/fetcher.js.map +1 -1
  76. package/dist/utils/robots.d.ts.map +1 -1
  77. package/dist/utils/robots.js +5 -1
  78. package/dist/utils/robots.js.map +1 -1
  79. package/dist/utils/url.d.ts +4 -0
  80. package/dist/utils/url.d.ts.map +1 -1
  81. package/dist/utils/url.js +11 -0
  82. package/dist/utils/url.js.map +1 -1
  83. package/package.json +1 -1
package/README.md CHANGED
@@ -1,213 +1,254 @@
1
+ <div align="center">
2
+
1
3
  # imperium-crawl
2
4
 
3
- The most powerful open-source MCP server for web scraping, crawling, and data extraction. **22 tools. Zero API keys required for scraping. One `npx` command to install.**
5
+ **The most powerful open-source MCP server for web scraping, crawling, and data extraction.**
4
6
 
5
- While others charge $19+/month for basic scraping, imperium-crawl gives you **more features for free** — including capabilities that no other MCP server offers at any price.
7
+ 22 tools. Zero API keys required. One `npx` command.
6
8
 
7
- ## vs. The Competition
9
+ [![npm version](https://img.shields.io/npm/v/imperium-crawl.svg)](https://www.npmjs.com/package/imperium-crawl)
10
+ [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)
11
+ [![Tests](https://img.shields.io/badge/tests-332%20passing-brightgreen.svg)]()
12
+ [![npm downloads](https://img.shields.io/npm/dm/imperium-crawl.svg)](https://www.npmjs.com/package/imperium-crawl)
8
13
 
9
- | Feature | **imperium-crawl** | Firecrawl MCP | fetch MCP | Crawl4AI MCP | Browserbase MCP |
10
- |---------|:------------------:|:-------------:|:---------:|:------------:|:---------------:|
11
- | Price | **Free forever** | $19+/month | Free | Free | $0.01/min |
12
- | Scraping tools | **6** | 3 | 1 | 1 | 1 |
13
- | Search tools | **4** | 0 | 0 | 0 | 0 |
14
- | Stealth levels | **3 (auto-escalate)** | Cloud-based | None | 1 | Cloud-based |
15
- | Anti-bot detection | **7 systems** | Partial | None | Partial | Partial |
16
- | TLS fingerprinting | **Yes (JA3/JA4)** | No | No | No | No |
17
- | CAPTCHA auto-solving | **Yes (2Captcha)** | No | No | No | No |
18
- | API discovery from network traffic | **Yes** | No | No | No | No |
19
- | WebSocket monitoring | **Yes** | No | No | No | No |
20
- | Direct API calls | **Yes** | No | No | No | No |
21
- | Reusable skills system | **Yes** | No | No | No | No |
22
- | Structured data extraction (JSON-LD/OG) | **Yes** | Partial | No | No | No |
23
- | Priority-based crawling | **Yes** | No | No | No | No |
24
- | Circuit breaker + jitter backoff | **Yes** | No | No | No | No |
25
- | URL normalization (11 steps) | **Yes** | No | No | No | No |
26
- | Adaptive learning (self-improving) | **Yes** | No | No | No | No |
27
- | AI-powered data extraction | **Yes** | No | No | No | No |
28
- | Browser automation + sessions | **Yes** | No | No | No | No |
29
- | Batch processing with resume | **Yes** | No | No | No | No |
30
- | Self-hosted | **Yes** | No | N/A | Yes | No |
31
- | Requires external service | **No** | Yes | No | No | Yes |
32
- | Total tools | **22** | 5 | 2 | 2 | 4 |
33
-
34
- > **TLDR:** More tools, more features, zero cost, no external dependencies. Self-hosted, open-source, and it runs on your machine.
14
+ </div>
35
15
 
36
- ## Installation
37
-
38
- ```bash
39
- npm install -g imperium-crawl
40
- ```
16
+ ---
41
17
 
42
- Or run directly without installing:
18
+ ## Quick Start
43
19
 
44
- ```bash
45
- npx -y imperium-crawl
46
- ```
20
+ Get running in 30 seconds.
47
21
 
48
- ### MCP Client Config
49
-
50
- Add to your MCP client config (Claude Code, Cursor, VS Code, Windsurf, or any MCP client):
22
+ **MCP client** (Claude Code, Cursor, VS Code, Windsurf):
51
23
 
52
24
  ```json
53
25
  {
54
26
  "mcpServers": {
55
27
  "imperium-crawl": {
56
- "type": "stdio",
57
28
  "command": "npx",
58
- "args": ["-y", "imperium-crawl"],
59
- "env": {
60
- "BRAVE_API_KEY": "your-brave-api-key",
61
- "TWOCAPTCHA_API_KEY": "your-2captcha-api-key",
62
- "LLM_API_KEY": "your-api-key",
63
- "LLM_PROVIDER": "anthropic",
64
- "PROXY_URL": "http://user:pass@proxy:8080",
65
- "PROXY_URLS": "http://proxy1:8080,socks5://proxy2:1080"
66
- }
29
+ "args": ["-y", "imperium-crawl"]
67
30
  }
68
31
  }
69
32
  }
70
33
  ```
71
34
 
72
- > **Works out of the box with zero API keys** — 16 tools are fully functional without any configuration (6 scraping + 3 skills + 3 API discovery + 4 batch). To unlock full power, add optional API keys:
73
- >
74
- > | Key | What it unlocks | Where to get it |
75
- > |-----|----------------|-----------------|
76
- > | `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
77
- > | `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
78
- > | `LLM_API_KEY` | AI-powered data extraction (`ai_extract` tool) | Anthropic or OpenAI API key |
79
- > | `CHROME_PROFILE_PATH` | Authenticated browser sessions (use your Chrome cookies) | Path to Chrome user data dir |
80
- > | `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
35
+ **CLI** (zero install):
81
36
 
82
- ### Enable full stealth (Level 3 — headless browser)
37
+ ```bash
38
+ npx -y imperium-crawl scrape --url https://example.com
39
+ ```
40
+
41
+ **Global install:**
83
42
 
84
43
  ```bash
85
- npm i rebrowser-playwright
86
- npx playwright install chromium
44
+ npm install -g imperium-crawl
87
45
  ```
88
46
 
89
- ### AI Agent Guide (SKILL.md)
47
+ > That's it. 16 of 22 tools work with zero API keys. Add optional keys later to unlock search, AI extraction, and CAPTCHA solving.
90
48
 
91
- imperium-crawl ships with [`SKILL.md`](./SKILL.md) — a structured guide that teaches AI agents (Claude, GPT, etc.) how to use all 22 tools effectively. It includes 9 proven workflows, decision trees, error recovery strategies, and advanced patterns like manual skill refinement.
49
+ ---
92
50
 
93
- **Without SKILL.md**, agents can call tools but won't know which tool to try first, when to fallback, or how to chain tools together optimally.
51
+ ## Power Examples
94
52
 
95
- **With SKILL.md**, agents follow battle-tested workflows — readability → scrape → extract fallback chains, auto-detect → manual refinement for skills, search → select → deep-scrape for research, and more.
53
+ Real results. Copy-paste and try.
96
54
 
97
- **Three ways to connect SKILL.md to any agent:**
55
+ ### Scrape through Cloudflare
98
56
 
99
- | Method | Setup | Works with |
100
- |--------|-------|-----------|
101
- | **MCP + SKILL.md** | Add imperium-crawl as MCP server + SKILL.md in agent context | Claude Code, Cursor, Windsurf, any MCP client |
102
- | **CLI + SKILL.md** | `npm i -g imperium-crawl` + SKILL.md in agent context | **Any agent with bash access** — OpenClaw, ChatGPT, GPT agents, custom agents, anything |
103
- | **TUI mode** | `imperium-crawl tui` — interactive slash-command terminal | Direct human use, demos, debugging |
57
+ ```bash
58
+ imperium-crawl scrape --url https://blog.cloudflare.com
59
+ ```
104
60
 
105
- The CLI approach is universal — any agent that can run shell commands can use all 22 tools. No MCP required.
61
+ ```
62
+ Level 1 (headers) → blocked
63
+ Level 2 (TLS fingerprint) → blocked
64
+ Level 3 (browser + stealth) → success ✅
65
+ → Full markdown content extracted, 213K characters
66
+ → Next visit: skips straight to Level 3 (learned)
67
+ ```
106
68
 
107
- | AI Agent | How to add SKILL.md |
108
- |----------|-------------------|
109
- | **Claude Code** | Copy `SKILL.md` to your project root — Claude Code reads it automatically |
110
- | **Cursor / Windsurf** | Add `SKILL.md` to project rules or include in system prompt |
111
- | **OpenClaw / custom agents** | Include SKILL.md content in your system prompt or context window |
112
- | **ChatGPT / GPT agents** | Paste SKILL.md content into custom instructions |
69
+ ### Discover hidden APIs on any website
113
70
 
114
- ---
71
+ ```bash
72
+ imperium-crawl discover-apis --url https://weather.com
73
+ ```
115
74
 
116
- ## CLI Mode
75
+ ```
76
+ Found 11 hidden API endpoints:
77
+ • api.weather.com — main weather API (exposed API key!)
78
+ • mParticle analytics endpoints
79
+ • Taboola content recommendation API
80
+ • OneTrust consent management API
81
+ • DAA/AdChoices opt-out endpoints
82
+ → Call any endpoint directly with query_api — 10x faster than DOM scraping
83
+ ```
117
84
 
118
- imperium-crawl works as both an **MCP server** and a **standalone CLI tool**. All 22 tools are available as subcommands:
85
+ ### AI extraction in plain English
119
86
 
120
87
  ```bash
121
- # Scrape a website to markdown
122
- imperium-crawl scrape --url https://bbc.com/news
88
+ imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
89
+ --schema "extract product name, price, rating, and review count"
90
+ ```
123
91
 
124
- # Crawl with depth control
125
- imperium-crawl crawl --url https://blog.cloudflare.com --max-depth 2 --max-pages 5
92
+ ```json
93
+ {
94
+ "product_name": "Apple AirPods Pro 2",
95
+ "price": "$189.99",
96
+ "rating": "4.7 out of 5",
97
+ "review_count": "45,297"
98
+ }
99
+ ```
126
100
 
127
- # Extract structured data with CSS selectors
128
- imperium-crawl extract --url https://news.ycombinator.com --selectors '{"title":".titleline a","score":".score"}' --items-selector ".athing"
101
+ ### Batch scrape with resume
129
102
 
130
- # AI-powered extraction — describe what you want in plain English
131
- imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 --schema "extract product name, price, rating, and review count"
103
+ ```bash
104
+ imperium-crawl batch-scrape \
105
+ --urls '["https://bbc.com","https://cnn.com","https://reuters.com","https://techcrunch.com"]' \
106
+ --concurrency 3
107
+ ```
132
108
 
133
- # Browser automation — interact with pages
134
- imperium-crawl interact --url https://example.com --actions '[{"type":"click","selector":"#login"},{"type":"type","selector":"#email","text":"user@example.com"}]'
109
+ ```
110
+ Scraping 4 URLs (concurrency: 3)...
111
+ ✅ bbc.com — 47K chars
112
+ ✅ cnn.com — 52K chars
113
+ ✅ reuters.com — 38K chars
114
+ ✅ techcrunch.com — 61K chars
115
+ → 4/4 succeeded. Job ID: abc123 (resume with --job-id if interrupted)
116
+ ```
135
117
 
136
- # Batch scrape multiple URLs in parallel
137
- imperium-crawl batch-scrape --urls '["https://site1.com","https://site2.com","https://site3.com"]' --concurrency 3
118
+ ---
138
119
 
139
- # List batch jobs
140
- imperium-crawl list-jobs
120
+ ## Why imperium-crawl?
141
121
 
142
- # Discover hidden APIs on any website
143
- imperium-crawl discover-apis --url https://weather.com
122
+ 🔓 **Zero API Keys Required**
123
+ 16 of 22 tools work out of the box. No accounts, no tokens, no credit cards. Just `npx` and go.
144
124
 
145
- # Search the web (requires BRAVE_API_KEY)
146
- imperium-crawl search --query "latest AI news" --count 5
125
+ 🛡️ **3-Level Auto-Escalating Stealth**
126
+ Headers TLS fingerprinting headless browser + CAPTCHA solving. Automatically escalates until it gets through.
147
127
 
148
- # Take a screenshot
149
- imperium-crawl screenshot --url https://github.com --full-page
128
+ 🧠 **Self-Improving**
129
+ Adaptive learning engine remembers what works per domain. Second visit is 3x faster. The more you use it, the smarter it gets.
150
130
 
151
- # Interactive setup wizard
152
- imperium-crawl setup
153
- ```
131
+ 🧰 **22 Tools, 3 Modes**
132
+ MCP server, CLI tool, or interactive TUI. Scraping, crawling, search, extraction, API discovery, WebSocket monitoring, browser automation, batch processing.
154
133
 
155
- ### Output Formats
134
+ 📜 **10 Built-in Recipes**
135
+ Pre-built workflows for common tasks — news extraction, e-commerce scraping, API reverse engineering, and more.
156
136
 
157
- ```bash
158
- # JSON (default)
159
- imperium-crawl scrape --url https://example.com
137
+ ⚡ **Skills System**
138
+ Teach it once, run forever. Auto-detect patterns on any page, save as reusable skills, get fresh data on demand.
160
139
 
161
- # CSV
162
- imperium-crawl extract --url https://example.com --selectors '{"title":"h1"}' --output-format csv
140
+ ---
163
141
 
164
- # Markdown
165
- imperium-crawl scrape --url https://example.com --output-format markdown
142
+ ## vs. The Competition
166
143
 
167
- # JSONL (one JSON object per line)
168
- imperium-crawl crawl --url https://example.com --output-format jsonl
144
+ | Feature | **imperium-crawl** | Firecrawl MCP | fetch MCP | Crawl4AI MCP | Browserbase MCP |
145
+ |---------|:------------------:|:-------------:|:---------:|:------------:|:---------------:|
146
+ | Price | **Free forever** | $19+/month | Free | Free | $0.01/min |
147
+ | Total tools | **22** | 5 | 2 | 2 | 4 |
148
+ | Stealth levels | **3 (auto-escalate)** | Cloud-based | None | 1 | Cloud-based |
149
+ | Anti-bot detection | **7 systems** | Partial | None | Partial | Partial |
150
+ | TLS fingerprinting | **JA3/JA4** | No | No | No | No |
151
+ | CAPTCHA auto-solving | **Yes** | No | No | No | No |
152
+ | API discovery | **Yes** | No | No | No | No |
153
+ | WebSocket monitoring | **Yes** | No | No | No | No |
154
+ | AI-powered extraction | **Yes** | No | No | No | No |
155
+ | Adaptive learning | **Yes** | No | No | No | No |
156
+ | Batch processing | **Yes** | No | No | No | No |
157
+ | Self-hosted | **Yes** | No | N/A | Yes | No |
158
+ | Requires external service | **No** | Yes | No | No | Yes |
169
159
 
170
- # Pretty-print JSON
171
- imperium-crawl scrape --url https://example.com --pretty
160
+ ---
172
161
 
173
- # Write to file
174
- imperium-crawl scrape --url https://example.com --output result.json
162
+ ## Stealth Engine
163
+
164
+ ```
165
+ Request → [L1: Headers + UA rotation]
166
+
167
+ ├─ success → Done
168
+ ↓ fail
169
+ [L2: TLS Fingerprint (JA3/JA4)]
170
+
171
+ ├─ success → Done
172
+ ↓ fail
173
+ [L3: Browser + Fingerprint Injection + CAPTCHA]
174
+
175
+ ├─ success → Done
176
+
177
+ [Learning Engine records optimal level for next time]
175
178
  ```
176
179
 
177
- ### TUI Mode
180
+ ### Stealth Levels
178
181
 
179
- ```bash
180
- imperium-crawl tui
181
- ```
182
+ | Level | Method | What It Defeats |
183
+ |-------|--------|-----------------|
184
+ | **1** | `header-generator` — Bayesian realistic headers + UA rotation | Basic bot detection, simple WAFs |
185
+ | **2** | `impit` — browser-identical TLS fingerprints (JA3/JA4) | Cloudflare, Akamai, TLS fingerprinting WAFs |
186
+ | **3** | `rebrowser-playwright` + `fingerprint-injector` + auto CAPTCHA | JavaScript challenges, SPAs, advanced anti-bot, CAPTCHAs |
182
187
 
183
- Interactive slash-command terminal with parameter prompts, table rendering, markdown display, and session state. Use `/save` to export results and `/again` to re-run the last command.
188
+ ### Anti-Bot System Detection
184
189
 
185
- ### Help
190
+ Automatically identifies which anti-bot system a site uses and chooses the optimal strategy:
191
+
192
+ | System | Detection Method |
193
+ |--------|-----------------|
194
+ | **Cloudflare** | `cf_clearance` cookies, `cf-mitigated` header, challenge page title |
195
+ | **Akamai** | `_abck`, `bm_sz` cookies |
196
+ | **PerimeterX / HUMAN** | `_px` cookies, `_pxhd` headers |
197
+ | **DataDome** | `datadome` cookies, `datadome` response header |
198
+ | **Kasada** | `x-kpsdk-*` headers |
199
+ | **AWS WAF** | `aws-waf-token` cookie |
200
+ | **F5 / Shape Security** | `TS` prefix cookies |
201
+
202
+ ### Smart Rendering Cache
203
+
204
+ Once imperium-crawl determines a domain needs Level 3 (browser), it caches that decision for 1 hour. Subsequent requests to the same domain skip straight to browser rendering — no wasted time on failed lower levels.
205
+
206
+ ---
207
+
208
+ ## Adaptive Learning Engine
209
+
210
+ imperium-crawl **learns from every request** and gets smarter over time. No configuration needed — fully automatic.
211
+
212
+ Every time you scrape a website, the engine records which stealth level worked, which anti-bot system was detected, whether a proxy was needed, response timing, and success/failure. Next time you hit the same domain, it **predicts the optimal configuration** — skipping failed levels and going straight to what works.
186
213
 
187
- ```bash
188
- imperium-crawl --help # List all commands
189
- imperium-crawl scrape --help # Help for specific tool
190
- imperium-crawl --version # Show version
191
214
  ```
215
+ First visit to cloudflare.com:
216
+ Level 1 → blocked ❌
217
+ Level 2 → blocked ❌
218
+ Level 3 → success ✅ (Cloudflare detected)
219
+ → Engine records: cloudflare.com needs Level 3
220
+
221
+ Second visit to cloudflare.com:
222
+ → Engine predicts: Level 3, confidence 85%, Cloudflare
223
+ → Skips Level 1 and 2 entirely — goes straight to browser
224
+ → 3x faster than first visit
225
+ ```
226
+
227
+ ### Smart Features
228
+
229
+ - **Time decay** — Knowledge older than 7 days loses weight, adapts when sites change defenses
230
+ - **Confidence scoring** — Low data = start from level 1. High confidence = skip to optimal level
231
+ - **Auto-prune** — Domains unused for 30 days are cleaned up. Max 2,000 domains stored
232
+ - **Atomic persistence** — Knowledge saved via atomic write (tmp → rename). Never corrupts
192
233
 
193
- > **No arguments** = starts as MCP server (stdio). **With subcommand** = runs as CLI tool. **`tui`** = interactive terminal.
234
+ > **The more you use it, the faster it gets.**
194
235
 
195
236
  ---
196
237
 
197
- ## 22 Tools
238
+ ## All 22 Tools
198
239
 
199
- ### Scraping (no API key needed)
240
+ ### 📄 Scraping (no API key needed)
200
241
 
201
242
  | Tool | What It Does |
202
243
  |------|-------------|
203
- | **scrape** | URL to clean Markdown/HTML with 3-level auto-escalating stealth. Returns structured data (JSON-LD, OpenGraph, Microdata), metadata, and links on request. |
204
- | **crawl** | Priority-based crawling with depth control, concurrency limiting, and smart URL scoring. Content paths rank higher, tracking params are stripped. |
205
- | **map** | Discover all URLs on a domain via sitemap.xml parsing + page link extraction. |
244
+ | **scrape** | URL to clean Markdown/HTML with 3-level auto-escalating stealth. Structured data (JSON-LD, OpenGraph, Microdata), metadata, and links. |
245
+ | **crawl** | Priority-based crawling with depth control, concurrency limiting, and smart URL scoring. |
246
+ | **map** | Discover all URLs on a domain via sitemap.xml + page link extraction. |
206
247
  | **extract** | CSS selectors to structured JSON. Point at any repeating pattern and get clean data. |
207
- | **readability** | Mozilla Readability article extraction — title, author, content, publish date. 3x faster with linkedom. |
248
+ | **readability** | Mozilla Readability article extraction — title, author, content, publish date. |
208
249
  | **screenshot** | Full-page or viewport PNG screenshots via headless Chromium. |
209
250
 
210
- ### Search (requires free Brave API key)
251
+ ### 🔍 Search (requires free Brave API key)
211
252
 
212
253
  | Tool | What It Does |
213
254
  |------|-------------|
@@ -216,131 +257,144 @@ imperium-crawl --version # Show version
216
257
  | **image_search** | Image search with thumbnails and source URLs. |
217
258
  | **video_search** | Video search across platforms. |
218
259
 
219
- ### Skills (no API key needed)
260
+ ### Skills (no API key needed)
220
261
 
221
262
  | Tool | What It Does |
222
263
  |------|-------------|
223
- | **create_skill** | Analyze any page, auto-detect repeating patterns (articles, products, listings), generate CSS selectors, and save as a reusable skill. |
224
- | **run_skill** | Run a saved skill to get fresh structured data instantly. Supports pagination. |
225
- | **list_skills** | List all saved skills with their configurations. |
264
+ | **create_skill** | Analyze any page, auto-detect repeating patterns, generate CSS selectors, save as reusable skill. |
265
+ | **run_skill** | Run a saved skill for fresh structured data. Supports pagination. |
266
+ | **list_skills** | List all saved skills with configurations. |
226
267
 
227
- ### API Discovery & Real-Time (no API key needed, requires Playwright)
268
+ ### 🔓 API Discovery & Real-Time (no API key needed, requires Playwright)
228
269
 
229
270
  | Tool | What It Does |
230
271
  |------|-------------|
231
- | **discover_apis** | Navigate to any page, intercept all XHR/fetch calls, and map every hidden REST/GraphQL API endpoint. Auto-detects GraphQL, filters noise, returns response previews. **No other MCP server does this.** |
232
- | **query_api** | Call any API endpoint directly with stealth headers. Bypass DOM rendering entirely for 10x faster data access. Use after `discover_apis` to hit endpoints directly. |
233
- | **monitor_websocket** | Capture real-time WebSocket messages from any page — financial tickers, chat feeds, live dashboards. Returns connection details and message payloads. **No other MCP server does this.** |
272
+ | **discover_apis** | Navigate to any page, intercept XHR/fetch calls, map hidden REST/GraphQL endpoints. Auto-detects GraphQL, filters noise, returns response previews. |
273
+ | **query_api** | Call any API endpoint directly with stealth headers. Bypass DOM rendering for 10x faster data access. |
274
+ | **monitor_websocket** | Capture real-time WebSocket messages — financial tickers, chat feeds, live dashboards. |
234
275
 
235
- ### AI Extraction (requires LLM API key)
276
+ ### 🧠 AI Extraction (requires LLM API key)
236
277
 
237
278
  | Tool | What It Does |
238
279
  |------|-------------|
239
- | **ai_extract** | AI-powered data extraction — describe what you want in natural language or provide a JSON schema. Supports auto mode (LLM decides what to extract), 3 providers (Anthropic, OpenAI, MiniMax). The `extract` tool also supports `llm_fallback: true` for hybrid CSS→AI extraction. |
280
+ | **ai_extract** | Describe what you want in natural language or JSON schema. 3 providers (Anthropic, OpenAI, MiniMax). The `extract` tool also supports `llm_fallback: true` for hybrid CSS→AI extraction. |
240
281
 
241
- ### Interaction (no API key needed, requires Playwright)
282
+ ### 🖱️ Interaction (no API key needed, requires Playwright)
242
283
 
243
284
  | Tool | What It Does |
244
285
  |------|-------------|
245
- | **interact** | Browser automation with 10 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate). Session persistence saves/restores cookies across calls — build login flows and multi-step workflows. |
286
+ | **interact** | Browser automation with 10 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate). Session persistence saves/restores cookies. |
246
287
 
247
- ### Batch Processing (no API key needed)
288
+ ### 📦 Batch Processing (no API key needed)
248
289
 
249
290
  | Tool | What It Does |
250
291
  |------|-------------|
251
- | **batch_scrape** | Parallel URL scraping with configurable concurrency, soft failure (continues on errors), and resume support via job_id. Optional AI extraction per URL. |
252
- | **list_jobs** | List all batch jobs with status, progress, and timestamps. |
253
- | **job_status** | Get full results for a specific batch job including per-URL outcomes. |
292
+ | **batch_scrape** | Parallel URL scraping with configurable concurrency, soft failure, and resume via job_id. Optional AI extraction per URL. |
293
+ | **list_jobs** | List all batch jobs with status and progress. |
294
+ | **job_status** | Full results for a specific batch job including per-URL outcomes. |
254
295
  | **delete_job** | Clean up completed or failed batch jobs. |
255
296
 
256
297
  ---
257
298
 
258
- ## Stealth Engine
299
+ ## MCP Setup — Detailed
259
300
 
260
- imperium-crawl uses a 3-level stealth system that **auto-escalates** based on the target site's defenses:
301
+ Full configuration with all optional environment variables:
261
302
 
262
- | Level | Method | What It Defeats |
263
- |-------|--------|-----------------|
264
- | **1** | `header-generator` — Bayesian realistic headers + UA rotation | Basic bot detection, simple WAFs |
265
- | **2** | `impit` — browser-identical TLS fingerprints (JA3/JA4) | Cloudflare, Akamai, TLS fingerprinting WAFs |
266
- | **3** | `rebrowser-playwright` + `fingerprint-injector` + auto CAPTCHA solving | JavaScript challenges, SPAs, advanced anti-bot, CAPTCHAs |
303
+ ```json
304
+ {
305
+ "mcpServers": {
306
+ "imperium-crawl": {
307
+ "command": "npx",
308
+ "args": ["-y", "imperium-crawl"],
309
+ "env": {
310
+ "BRAVE_API_KEY": "your-brave-api-key",
311
+ "TWOCAPTCHA_API_KEY": "your-2captcha-api-key",
312
+ "LLM_API_KEY": "your-api-key",
313
+ "LLM_PROVIDER": "anthropic",
314
+ "PROXY_URL": "http://user:pass@proxy:8080",
315
+ "PROXY_URLS": "http://proxy1:8080,socks5://proxy2:1080"
316
+ }
317
+ }
318
+ }
319
+ }
320
+ ```
267
321
 
268
- ### Anti-Bot System Detection
322
+ ### API Keys
269
323
 
270
- Automatically identifies which anti-bot system a site uses and chooses the optimal strategy:
324
+ | Key | What It Unlocks | Where to Get It |
325
+ |-----|----------------|-----------------|
326
+ | `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
327
+ | `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
328
+ | `LLM_API_KEY` | AI-powered data extraction (`ai_extract` tool) | Anthropic or OpenAI API key |
329
+ | `CHROME_PROFILE_PATH` | Authenticated browser sessions (use your Chrome cookies) | Path to Chrome user data dir |
330
+ | `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
271
331
 
272
- | System | Detection Method |
273
- |--------|-----------------|
274
- | **Cloudflare** | `cf_clearance` cookies, `cf-mitigated` header, challenge page title |
275
- | **Akamai** | `_abck`, `bm_sz` cookies |
276
- | **PerimeterX / HUMAN** | `_px` cookies, `_pxhd` headers |
277
- | **DataDome** | `datadome` cookies, `datadome` response header |
278
- | **Kasada** | `x-kpsdk-*` headers |
279
- | **AWS WAF** | `aws-waf-token` cookie |
280
- | **F5 / Shape Security** | `TS` prefix cookies |
332
+ ### Enable Full Stealth (Level 3)
281
333
 
282
- ### Smart Rendering Cache
334
+ ```bash
335
+ npm i rebrowser-playwright
336
+ npx playwright install chromium
337
+ ```
283
338
 
284
- Once imperium-crawl determines a domain needs Level 3 (browser), it **caches that decision** for 1 hour. Subsequent requests to the same domain skip straight to browser rendering — no wasted time on failed lower levels.
339
+ ### Per-Client Notes
285
340
 
286
- ---
341
+ | Client | Config Location |
342
+ |--------|----------------|
343
+ | **Claude Code** | `.mcp.json` in project root or `~/.claude/settings.json` global |
344
+ | **Cursor** | Settings → MCP Servers |
345
+ | **VS Code** | `.vscode/mcp.json` or user settings |
346
+ | **Windsurf** | `~/.codeium/windsurf/mcp_config.json` |
287
347
 
288
- ## 🧠 Adaptive Learning Engine
348
+ ---
289
349
 
290
- imperium-crawl **learns from every request** and gets smarter over time. No configuration needed — it works automatically in the background.
350
+ ## CLI Mode
291
351
 
292
- ### How It Works
352
+ **No arguments** = starts as MCP server. **With subcommand** = runs as CLI tool. **`tui`** = interactive terminal.
293
353
 
294
- Every time you scrape a website, the engine records:
295
- - Which **stealth level** worked (1, 2, or 3)
296
- - Which **anti-bot system** was detected (Cloudflare, DataDome, etc.)
297
- - Whether a **proxy** was needed
298
- - **Response time** and **HTTP status**
299
- - Whether the request was **blocked or successful**
354
+ ```bash
355
+ # Scrape a website to markdown
356
+ imperium-crawl scrape --url https://bbc.com/news
300
357
 
301
- Next time you hit the same domain, the engine **predicts the optimal configuration** — skipping failed levels and going straight to what works.
358
+ # Crawl with depth control
359
+ imperium-crawl crawl --url https://blog.cloudflare.com --max-depth 2 --max-pages 5
302
360
 
303
- ### What It Learns Per Domain
361
+ # AI-powered extraction plain English
362
+ imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
363
+ --schema "extract product name, price, rating, and review count"
304
364
 
305
- | Data Point | How It's Used |
306
- |-----------|---------------|
307
- | Optimal stealth level | Skip straight to the level that works — no wasted escalation |
308
- | Anti-bot system | Remember which defense the site uses |
309
- | Proxy requirement | Auto-suggest proxy if requests keep failing without one |
310
- | Response time | Exponential moving average — adapts to site speed changes |
311
- | Rate limit | Auto-throttles on 429 responses (reduces rate by 30%) |
312
- | Success/fail ratio | Confidence scoring — high confidence = use cached strategy |
365
+ # Discover hidden APIs
366
+ imperium-crawl discover-apis --url https://weather.com
313
367
 
314
- ### Smart Features
368
+ # Batch scrape in parallel
369
+ imperium-crawl batch-scrape --urls '["https://site1.com","https://site2.com"]' --concurrency 3
315
370
 
316
- - **Time decay** — Knowledge older than 7 days loses weight, so the engine adapts when sites change defenses
317
- - **Confidence scoring** — Low data = start from level 1. High confidence = skip directly to optimal level
318
- - **Auto-prune** — Domains unused for 30 days are automatically cleaned up. Max 2,000 domains stored
319
- - **Atomic persistence** — Knowledge saved to `~/.imperium-crawl/knowledge.json` via atomic write (tmp → rename). Never corrupts
320
- - **Debounced writes** — Batches saves every 30 seconds to avoid disk thrashing
371
+ # Interactive setup wizard
372
+ imperium-crawl setup
373
+ ```
321
374
 
322
- ### Example
375
+ ### Output Formats
323
376
 
377
+ ```bash
378
+ imperium-crawl scrape --url https://example.com # JSON (default)
379
+ imperium-crawl scrape --url https://example.com --output-format markdown # Markdown
380
+ imperium-crawl scrape --url https://example.com --output-format csv # CSV
381
+ imperium-crawl scrape --url https://example.com --pretty # Pretty JSON
382
+ imperium-crawl scrape --url https://example.com --output result.json # Write to file
324
383
  ```
325
- First visit to cloudflare.com:
326
- Level 1 → blocked ❌
327
- Level 2 → blocked ❌
328
- Level 3 → success ✅ (Cloudflare detected)
329
- → Engine records: cloudflare.com needs Level 3
330
384
 
331
- Second visit to cloudflare.com:
332
- → Engine predicts: Level 3, confidence 85%, Cloudflare
333
- → Skips Level 1 and 2 entirely — goes straight to browser
334
- 3x faster than first visit
385
+ ### TUI Mode
386
+
387
+ ```bash
388
+ imperium-crawl tui
335
389
  ```
336
390
 
337
- > **The more you use it, the faster it gets.** Zero configuration. Fully automatic.
391
+ Interactive slash-command terminal with parameter prompts, table rendering, markdown display, and session state. Use `/save` to export results and `/again` to re-run the last command.
338
392
 
339
393
  ---
340
394
 
341
- ## Skills System
395
+ ## Skills & Recipes
342
396
 
343
- Skills let you teach imperium-crawl how to extract data from any website, then re-run it for fresh content whenever you want.
397
+ Skills let you teach imperium-crawl how to extract data from any website, then re-run for fresh content whenever you want.
344
398
 
345
399
  **Create a skill:**
346
400
  ```
@@ -351,20 +405,36 @@ create_skill({
351
405
  })
352
406
  ```
353
407
 
354
- The tool analyzes the page, auto-detects repeating elements (articles, products, listings), generates CSS selectors for each field, and saves the skill config.
355
-
356
408
  **Run a skill:**
357
409
  ```
358
410
  run_skill({ name: "tc-ai-news" })
411
+ → Returns fresh structured data with all detected fields
359
412
  ```
360
413
 
361
- Returns fresh structured data with all detected fields. Skills are saved in `~/.imperium-crawl/skills/` as JSON files — human-readable, editable, and portable.
414
+ Skills are saved in `~/.imperium-crawl/skills/` as JSON files — human-readable, editable, portable.
415
+
416
+ ### Built-in Recipes
417
+
418
+ | Recipe | What It Does |
419
+ |--------|-------------|
420
+ | `news-extraction` | Extract article title, author, date, content from news sites |
421
+ | `ecommerce-scrape` | Product name, price, rating, reviews, images |
422
+ | `social-media` | Posts, engagement metrics, user profiles |
423
+ | `job-listings` | Title, company, salary, location, description |
424
+ | `real-estate` | Property listings with price, address, features |
425
+ | `api-reverse-engineer` | Discover → query → monitor workflow |
426
+ | `competitor-monitor` | Track pricing and product changes |
427
+ | `lead-generation` | Extract business contact info |
428
+ | `content-aggregator` | Multi-source content collection |
429
+ | `data-pipeline` | Batch scrape → extract → export workflow |
430
+
431
+ See [`SKILL/`](./SKILL/) for detailed workflow guides and agent integration.
362
432
 
363
433
  ---
364
434
 
365
435
  ## API Discovery Workflow
366
436
 
367
- This is the workflow that no other MCP server supports. Real results from actual testing:
437
+ Turn any website into an API. No documentation needed.
368
438
 
369
439
  ```
370
440
  1. discover_apis({ url: "https://weather.com" })
@@ -373,65 +443,81 @@ This is the workflow that no other MCP server supports. Real results from actual
373
443
  • mParticle analytics endpoints
374
444
  • Taboola content recommendation API
375
445
  • OneTrust consent management API
376
- • DAA/AdChoices opt-out endpoints
377
446
 
378
447
  2. query_api({ url: "https://api.weather.com/v3/...", method: "GET" })
379
- → Direct API call, bypasses DOM entirely — 10x faster, structured JSON response
448
+ → Direct API call, bypasses DOM entirely — 10x faster, structured JSON
380
449
 
381
450
  3. monitor_websocket({ url: "https://binance.com/en/trade/BTC_USDT", duration_seconds: 10 })
382
- → Captures real-time WebSocket messages — financial tickers, live data feeds
451
+ → Captures real-time WebSocket messages — live BTC price feed
383
452
  ```
384
453
 
385
- Turn any website into an API. No documentation needed.
454
+ ---
455
+
456
+ ## AI Agent Guide
457
+
458
+ imperium-crawl ships with [`SKILL/`](./SKILL/) — a structured guide that teaches AI agents how to use all 22 tools effectively. Includes proven workflows, decision trees, error recovery, and advanced patterns.
459
+
460
+ ### Three Ways to Connect
461
+
462
+ | Method | Setup | Works With |
463
+ |--------|-------|-----------|
464
+ | **MCP + SKILL/** | Add as MCP server + SKILL.md in agent context | Claude Code, Cursor, Windsurf, any MCP client |
465
+ | **CLI + SKILL/** | `npm i -g imperium-crawl` + SKILL.md in agent context | **Any agent with bash access** — OpenClaw, ChatGPT, GPT agents, custom agents |
466
+ | **TUI** | `imperium-crawl tui` — interactive terminal | Direct human use, demos, debugging |
467
+
468
+ ### Per-Agent Setup
469
+
470
+ | AI Agent | How to Add SKILL/ |
471
+ |----------|-------------------|
472
+ | **Claude Code** | Copy `SKILL.md` to project root — auto-detected |
473
+ | **Cursor / Windsurf** | Add `SKILL.md` to project rules or system prompt |
474
+ | **OpenClaw / custom agents** | Include SKILL.md in system prompt or context window |
475
+ | **ChatGPT / GPT agents** | Paste SKILL.md content into custom instructions |
386
476
 
387
477
  ---
388
478
 
389
479
  ## Resilience
390
480
 
391
481
  - **Exponential backoff with full jitter** — AWS-recommended retry pattern, no thundering herd
392
- - **Per-domain circuit breaker** — 5 consecutive failures opens the circuit for 60s, then half-open probing with automatic recovery
393
- - **URL normalization** — 11-step pipeline removes tracking params (utm_*, fbclid, gclid), sorts query params, normalizes encoding
394
- - **Concurrency limiting** — per-domain request throttling via p-queue
395
- - **Input validation** — all 22 tool schemas enforce strict bounds (URL length, query size, concurrency limits, body size)
396
- - **HTTP transport hardening** — rate limiting (100 req/min), 1MB body limit, 5min request timeout
397
- - **Proxy support** — single proxy (`PROXY_URL`) or rotating pool (`PROXY_URLS`) with http/https/socks4/socks5 support
482
+ - **Per-domain circuit breaker** — 5 failures opens circuit for 60s, then half-open probing with auto recovery
483
+ - **URL normalization** — 11-step pipeline removes tracking params (utm_*, fbclid, gclid), sorts query params
484
+ - **Proxy support** — single proxy or rotating pool with http/https/socks4/socks5
398
485
  - **Browser pool** — keyed by proxy URL, auto-eviction, configurable pool size
399
- - **Adaptive learning** — remembers optimal stealth level per domain, gets faster with every request
400
- - **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
401
486
  - **robots.txt** — respected by default (configurable)
487
+ - **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
402
488
 
403
489
  ---
404
490
 
405
- ## 🔥 Real-World Test Results
491
+ ## Real-World Test Results
406
492
 
407
493
  Every tool tested against production websites with real anti-bot defenses:
408
494
 
409
495
  | Tool | Target | Result |
410
496
  |------|--------|--------|
411
- | 📄 **scrape** | BBC News | Full markdown content, stealth level 3 auto-escalation |
412
- | 🕸️ **crawl** | Cloudflare Blog | **213K characters** crawled with depth control |
413
- | 🗺️ **map** | BBC | Full URL discovery via sitemap + page link extraction |
414
- | 🕷️ **extract** | Amazon (AirPods Pro 2) | Product title, 45,297 reviews, brand extracted |
415
- | 📖 **readability** | Medium article | Clean extraction — title, author, content, publish date |
416
- | 📸 **screenshot** | ProductHunt | Captured Cloudflare Turnstile challenge page |
417
- | 🔍 **search** | Brave Web Search | Web results with snippets and URLs |
418
- | 📰 **news_search** | Brave News Search | News results with freshness ranking |
419
- | 🖼️ **image_search** | Brave Image Search | Images with thumbnails and source URLs |
420
- | 🎬 **video_search** | Brave Video Search | Video results across platforms |
421
- | 🛠️ **create_skill** | Hacker News | Auto-detected 30 repeating stories with CSS selectors |
422
- | ▶️ **run_skill** | Saved skill | Fresh structured data from saved extraction config |
423
- | 📋 **list_skills** | — | Lists all saved skills with configurations |
424
- | 🔓 **discover_apis** | Airbnb Paris | **34 hidden APIs** — DataDome anti-bot, Google Maps key, internal APIs |
425
- | **query_api** | jsonplaceholder | Direct JSON API call with stealth headers |
426
- | 📡 **monitor_websocket** | Binance BTC/USDT | **3 WebSocket connections, 23 live messages** — BTC price live |
427
- | 🧠 **ai_extract** | Amazon product page | AI extracted name, price, rating, review count — natural language schema |
428
- | 🖱️ **interact** | Login flow | Click → type email type password → submit — session cookies persisted |
429
- | 📦 **batch_scrape** | 10 news sites | Parallel scrape with concurrency 3, soft failure, 9/10 succeeded |
430
- | 📋 **list_jobs** | — | Lists all batch jobs with status and progress |
431
- | 📊 **job_status** | Batch job | Full per-URL results with timing and extracted data |
432
- | 🗑️ **delete_job** | Completed job | Cleaned up job data from disk |
433
-
434
- > 🏆 **22/22 tools working. 58 hidden APIs discovered. Live crypto feed captured. AI extraction. Browser automation. Zero API keys needed for scraping.**
497
+ | **scrape** | BBC News | Full markdown, stealth level 3 auto-escalation |
498
+ | **crawl** | Cloudflare Blog | 213K characters crawled with depth control |
499
+ | **map** | BBC | Full URL discovery via sitemap + link extraction |
500
+ | **extract** | Amazon (AirPods Pro 2) | Product title, 45,297 reviews, brand extracted |
501
+ | **readability** | Medium article | Clean — title, author, content, publish date |
502
+ | **screenshot** | ProductHunt | Captured Cloudflare Turnstile challenge page |
503
+ | **search** | Brave Web | Web results with snippets and URLs |
504
+ | **news_search** | Brave News | News results with freshness ranking |
505
+ | **image_search** | Brave Image | Images with thumbnails and source URLs |
506
+ | **video_search** | Brave Video | Video results across platforms |
507
+ | **create_skill** | Hacker News | Auto-detected 30 stories with CSS selectors |
508
+ | **run_skill** | Saved skill | Fresh structured data from saved config |
509
+ | **list_skills** | — | Lists all skills with configurations |
510
+ | **discover_apis** | Airbnb Paris | **34 hidden APIs** — DataDome, Google Maps key, internal APIs |
511
+ | **query_api** | jsonplaceholder | Direct JSON API call with stealth headers |
512
+ | **monitor_websocket** | Binance BTC/USDT | 3 WebSocket connections, 23 live messages — BTC price live |
513
+ | **ai_extract** | Amazon product | AI extracted name, price, rating, review count |
514
+ | **interact** | Login flow | Click → type → submit — session cookies persisted |
515
+ | **batch_scrape** | 10 news sites | Parallel, concurrency 3, soft failure, 9/10 succeeded |
516
+ | **list_jobs** | — | Batch jobs with status and progress |
517
+ | **job_status** | Batch job | Full per-URL results with timing |
518
+ | **delete_job** | Completed job | Cleaned up job data from disk |
519
+
520
+ > **22/22 tools. 34 hidden APIs on Airbnb. Live BTC feed. Zero API keys for scraping.**
435
521
 
436
522
  ---
437
523
 
@@ -441,18 +527,18 @@ Every tool tested against production websites with real anti-bot defenses:
441
527
  |----------|----------|-------------|
442
528
  | `BRAVE_API_KEY` | No | Brave Search API key (enables 4 search tools) |
443
529
  | `TWOCAPTCHA_API_KEY` | No | 2Captcha API key (enables auto CAPTCHA solving) |
530
+ | `LLM_API_KEY` | No | Anthropic or OpenAI API key (enables `ai_extract`) |
531
+ | `LLM_PROVIDER` | No | `anthropic`, `openai`, or `minimax` (default: `anthropic`) |
532
+ | `LLM_MODEL` | No | Override default LLM model |
444
533
  | `TRANSPORT` | No | `stdio` (default) or `http` |
445
534
  | `PORT` | No | HTTP port (default: 3000) |
446
535
  | `PROXY_URL` | No | Single proxy URL (http/https/socks4/socks5) |
447
536
  | `PROXY_URLS` | No | Comma-separated proxy URLs for rotation |
448
537
  | `BROWSER_POOL_SIZE` | No | Max pooled browser instances (default: 3) |
449
538
  | `RESPECT_ROBOTS` | No | Respect robots.txt (default: `true`) |
450
- | `LLM_API_KEY` | No | Anthropic or OpenAI API key (enables `ai_extract` tool) |
451
- | `LLM_PROVIDER` | No | `anthropic`, `openai`, or `minimax` (default: `anthropic`) |
452
- | `LLM_MODEL` | No | Override default LLM model |
453
- | `CHROME_PROFILE_PATH` | No | Chrome user data dir for authenticated browser sessions |
454
- | `NO_COLOR` | No | Disable colored output (standard convention) |
455
- | `CI` | No | Auto-detected; disables TTY features (spinners, colors) |
539
+ | `CHROME_PROFILE_PATH` | No | Chrome user data dir for authenticated sessions |
540
+ | `NO_COLOR` | No | Disable colored output |
541
+ | `CI` | No | Auto-detected; disables TTY features |
456
542
 
457
543
  ---
458
544
 
@@ -463,8 +549,27 @@ git clone https://github.com/ceoimperiumprojects/imperium-crawl
463
549
  cd imperium-crawl
464
550
  npm install
465
551
  npm run build
466
- npm test # 332 tests
467
- npm start
552
+ npm run dev # Watch mode (rebuild on changes)
553
+ npm test # 332 tests
554
+ npm start # Start MCP server
555
+ ```
556
+
557
+ ---
558
+
559
+ ## Contributing
560
+
561
+ Contributions welcome! Whether it's a bug fix, new tool, or documentation improvement — open an issue or PR.
562
+
563
+ ```bash
564
+ # Fork the repo, then:
565
+ git clone https://github.com/YOUR_USERNAME/imperium-crawl
566
+ cd imperium-crawl
567
+ npm install
568
+ git checkout -b my-feature
569
+ # Make changes...
570
+ npm test
571
+ git push origin my-feature
572
+ # Open a PR
468
573
  ```
469
574
 
470
575
  ---