imperium-crawl 1.5.2 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (81) hide show
  1. package/README.md +370 -257
  2. package/dist/constants.d.ts +2 -1
  3. package/dist/constants.d.ts.map +1 -1
  4. package/dist/constants.js +3 -1
  5. package/dist/constants.js.map +1 -1
  6. package/dist/network/interceptor.d.ts +19 -0
  7. package/dist/network/interceptor.d.ts.map +1 -0
  8. package/dist/network/interceptor.js +82 -0
  9. package/dist/network/interceptor.js.map +1 -0
  10. package/dist/network/types.d.ts +27 -0
  11. package/dist/network/types.d.ts.map +1 -0
  12. package/dist/network/types.js +2 -0
  13. package/dist/network/types.js.map +1 -0
  14. package/dist/security/action-policy.d.ts +26 -0
  15. package/dist/security/action-policy.d.ts.map +1 -0
  16. package/dist/security/action-policy.js +136 -0
  17. package/dist/security/action-policy.js.map +1 -0
  18. package/dist/security/auth-vault.d.ts +49 -0
  19. package/dist/security/auth-vault.d.ts.map +1 -0
  20. package/dist/security/auth-vault.js +133 -0
  21. package/dist/security/auth-vault.js.map +1 -0
  22. package/dist/security/domain-filter.d.ts +19 -0
  23. package/dist/security/domain-filter.d.ts.map +1 -0
  24. package/dist/security/domain-filter.js +114 -0
  25. package/dist/security/domain-filter.js.map +1 -0
  26. package/dist/security/types.d.ts +19 -0
  27. package/dist/security/types.d.ts.map +1 -0
  28. package/dist/security/types.js +2 -0
  29. package/dist/security/types.js.map +1 -0
  30. package/dist/sessions/encryption.d.ts +37 -0
  31. package/dist/sessions/encryption.d.ts.map +1 -0
  32. package/dist/sessions/encryption.js +108 -0
  33. package/dist/sessions/encryption.js.map +1 -0
  34. package/dist/sessions/index.d.ts +1 -0
  35. package/dist/sessions/index.d.ts.map +1 -1
  36. package/dist/sessions/index.js +1 -0
  37. package/dist/sessions/index.js.map +1 -1
  38. package/dist/sessions/manager.d.ts +3 -0
  39. package/dist/sessions/manager.d.ts.map +1 -1
  40. package/dist/sessions/manager.js +28 -2
  41. package/dist/sessions/manager.js.map +1 -1
  42. package/dist/snapshot/annotator.d.ts +21 -0
  43. package/dist/snapshot/annotator.d.ts.map +1 -0
  44. package/dist/snapshot/annotator.js +152 -0
  45. package/dist/snapshot/annotator.js.map +1 -0
  46. package/dist/snapshot/boundary.d.ts +7 -0
  47. package/dist/snapshot/boundary.d.ts.map +1 -0
  48. package/dist/snapshot/boundary.js +12 -0
  49. package/dist/snapshot/boundary.js.map +1 -0
  50. package/dist/snapshot/differ.d.ts +40 -0
  51. package/dist/snapshot/differ.d.ts.map +1 -0
  52. package/dist/snapshot/differ.js +194 -0
  53. package/dist/snapshot/differ.js.map +1 -0
  54. package/dist/snapshot/extractor.d.ts +27 -0
  55. package/dist/snapshot/extractor.d.ts.map +1 -0
  56. package/dist/snapshot/extractor.js +265 -0
  57. package/dist/snapshot/extractor.js.map +1 -0
  58. package/dist/snapshot/index.d.ts +8 -0
  59. package/dist/snapshot/index.d.ts.map +1 -0
  60. package/dist/snapshot/index.js +6 -0
  61. package/dist/snapshot/index.js.map +1 -0
  62. package/dist/snapshot/store.d.ts +28 -0
  63. package/dist/snapshot/store.d.ts.map +1 -0
  64. package/dist/snapshot/store.js +65 -0
  65. package/dist/snapshot/store.js.map +1 -0
  66. package/dist/snapshot/types.d.ts +42 -0
  67. package/dist/snapshot/types.d.ts.map +1 -0
  68. package/dist/snapshot/types.js +2 -0
  69. package/dist/snapshot/types.js.map +1 -0
  70. package/dist/tools/index.d.ts.map +1 -1
  71. package/dist/tools/index.js +2 -0
  72. package/dist/tools/index.js.map +1 -1
  73. package/dist/tools/interact.d.ts +194 -5
  74. package/dist/tools/interact.d.ts.map +1 -1
  75. package/dist/tools/interact.js +355 -20
  76. package/dist/tools/interact.js.map +1 -1
  77. package/dist/tools/snapshot.d.ts +53 -0
  78. package/dist/tools/snapshot.d.ts.map +1 -0
  79. package/dist/tools/snapshot.js +160 -0
  80. package/dist/tools/snapshot.js.map +1 -0
  81. package/package.json +1 -1
package/README.md CHANGED
@@ -1,213 +1,258 @@
1
+ <div align="center">
2
+
1
3
  # imperium-crawl
2
4
 
3
- The most powerful open-source MCP server for web scraping, crawling, and data extraction. **22 tools. Zero API keys required for scraping. One `npx` command to install.**
5
+ **The most powerful open-source MCP server for web scraping, crawling, and data extraction.**
4
6
 
5
- While others charge $19+/month for basic scraping, imperium-crawl gives you **more features for free** — including capabilities that no other MCP server offers at any price.
7
+ 23 tools. Zero API keys required. One `npx` command.
6
8
 
7
- ## vs. The Competition
9
+ [![npm version](https://img.shields.io/npm/v/imperium-crawl.svg)](https://www.npmjs.com/package/imperium-crawl)
10
+ [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)
11
+ [![Tests](https://img.shields.io/badge/tests-332%20passing-brightgreen.svg)]()
12
+ [![npm downloads](https://img.shields.io/npm/dm/imperium-crawl.svg)](https://www.npmjs.com/package/imperium-crawl)
8
13
 
9
- | Feature | **imperium-crawl** | Firecrawl MCP | fetch MCP | Crawl4AI MCP | Browserbase MCP |
10
- |---------|:------------------:|:-------------:|:---------:|:------------:|:---------------:|
11
- | Price | **Free forever** | $19+/month | Free | Free | $0.01/min |
12
- | Scraping tools | **6** | 3 | 1 | 1 | 1 |
13
- | Search tools | **4** | 0 | 0 | 0 | 0 |
14
- | Stealth levels | **3 (auto-escalate)** | Cloud-based | None | 1 | Cloud-based |
15
- | Anti-bot detection | **7 systems** | Partial | None | Partial | Partial |
16
- | TLS fingerprinting | **Yes (JA3/JA4)** | No | No | No | No |
17
- | CAPTCHA auto-solving | **Yes (2Captcha)** | No | No | No | No |
18
- | API discovery from network traffic | **Yes** | No | No | No | No |
19
- | WebSocket monitoring | **Yes** | No | No | No | No |
20
- | Direct API calls | **Yes** | No | No | No | No |
21
- | Reusable skills system | **Yes** | No | No | No | No |
22
- | Structured data extraction (JSON-LD/OG) | **Yes** | Partial | No | No | No |
23
- | Priority-based crawling | **Yes** | No | No | No | No |
24
- | Circuit breaker + jitter backoff | **Yes** | No | No | No | No |
25
- | URL normalization (11 steps) | **Yes** | No | No | No | No |
26
- | Adaptive learning (self-improving) | **Yes** | No | No | No | No |
27
- | AI-powered data extraction | **Yes** | No | No | No | No |
28
- | Browser automation + sessions | **Yes** | No | No | No | No |
29
- | Batch processing with resume | **Yes** | No | No | No | No |
30
- | Self-hosted | **Yes** | No | N/A | Yes | No |
31
- | Requires external service | **No** | Yes | No | No | Yes |
32
- | Total tools | **22** | 5 | 2 | 2 | 4 |
33
-
34
- > **TLDR:** More tools, more features, zero cost, no external dependencies. Self-hosted, open-source, and it runs on your machine.
35
-
36
- ## Installation
14
+ </div>
37
15
 
38
- ```bash
39
- npm install -g imperium-crawl
40
- ```
41
-
42
- Or run directly without installing:
16
+ ---
43
17
 
44
- ```bash
45
- npx -y imperium-crawl
46
- ```
18
+ ## Quick Start
47
19
 
48
- ### MCP Client Config
20
+ Get running in 30 seconds.
49
21
 
50
- Add to your MCP client config (Claude Code, Cursor, VS Code, Windsurf, or any MCP client):
22
+ **MCP client** (Claude Code, Cursor, VS Code, Windsurf):
51
23
 
52
24
  ```json
53
25
  {
54
26
  "mcpServers": {
55
27
  "imperium-crawl": {
56
- "type": "stdio",
57
28
  "command": "npx",
58
- "args": ["-y", "imperium-crawl"],
59
- "env": {
60
- "BRAVE_API_KEY": "your-brave-api-key",
61
- "TWOCAPTCHA_API_KEY": "your-2captcha-api-key",
62
- "LLM_API_KEY": "your-api-key",
63
- "LLM_PROVIDER": "anthropic",
64
- "PROXY_URL": "http://user:pass@proxy:8080",
65
- "PROXY_URLS": "http://proxy1:8080,socks5://proxy2:1080"
66
- }
29
+ "args": ["-y", "imperium-crawl"]
67
30
  }
68
31
  }
69
32
  }
70
33
  ```
71
34
 
72
- > **Works out of the box with zero API keys** — 16 tools are fully functional without any configuration (6 scraping + 3 skills + 3 API discovery + 4 batch). To unlock full power, add optional API keys:
73
- >
74
- > | Key | What it unlocks | Where to get it |
75
- > |-----|----------------|-----------------|
76
- > | `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
77
- > | `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
78
- > | `LLM_API_KEY` | AI-powered data extraction (`ai_extract` tool) | Anthropic or OpenAI API key |
79
- > | `CHROME_PROFILE_PATH` | Authenticated browser sessions (use your Chrome cookies) | Path to Chrome user data dir |
80
- > | `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
35
+ **CLI** (zero install):
81
36
 
82
- ### Enable full stealth (Level 3 — headless browser)
37
+ ```bash
38
+ npx -y imperium-crawl scrape --url https://example.com
39
+ ```
40
+
41
+ **Global install:**
83
42
 
84
43
  ```bash
85
- npm i rebrowser-playwright
86
- npx playwright install chromium
44
+ npm install -g imperium-crawl
87
45
  ```
88
46
 
89
- ### AI Agent Guide (SKILL.md)
47
+ > That's it. 17 of 23 tools work with zero API keys. Add optional keys later to unlock search, AI extraction, and CAPTCHA solving.
90
48
 
91
- imperium-crawl ships with [`SKILL.md`](./SKILL.md) — a structured guide that teaches AI agents (Claude, GPT, etc.) how to use all 22 tools effectively. It includes 9 proven workflows, decision trees, error recovery strategies, and advanced patterns like manual skill refinement.
49
+ ---
92
50
 
93
- **Without SKILL.md**, agents can call tools but won't know which tool to try first, when to fallback, or how to chain tools together optimally.
51
+ ## Power Examples
94
52
 
95
- **With SKILL.md**, agents follow battle-tested workflows — readability → scrape → extract fallback chains, auto-detect → manual refinement for skills, search → select → deep-scrape for research, and more.
53
+ Real results. Copy-paste and try.
96
54
 
97
- **Three ways to connect SKILL.md to any agent:**
55
+ ### Scrape through Cloudflare
98
56
 
99
- | Method | Setup | Works with |
100
- |--------|-------|-----------|
101
- | **MCP + SKILL.md** | Add imperium-crawl as MCP server + SKILL.md in agent context | Claude Code, Cursor, Windsurf, any MCP client |
102
- | **CLI + SKILL.md** | `npm i -g imperium-crawl` + SKILL.md in agent context | **Any agent with bash access** — OpenClaw, ChatGPT, GPT agents, custom agents, anything |
103
- | **TUI mode** | `imperium-crawl tui` — interactive slash-command terminal | Direct human use, demos, debugging |
57
+ ```bash
58
+ imperium-crawl scrape --url https://blog.cloudflare.com
59
+ ```
104
60
 
105
- The CLI approach is universal — any agent that can run shell commands can use all 22 tools. No MCP required.
61
+ ```
62
+ Level 1 (headers) → blocked
63
+ Level 2 (TLS fingerprint) → blocked
64
+ Level 3 (browser + stealth) → success ✅
65
+ → Full markdown content extracted, 213K characters
66
+ → Next visit: skips straight to Level 3 (learned)
67
+ ```
106
68
 
107
- | AI Agent | How to add SKILL.md |
108
- |----------|-------------------|
109
- | **Claude Code** | Copy `SKILL.md` to your project root — Claude Code reads it automatically |
110
- | **Cursor / Windsurf** | Add `SKILL.md` to project rules or include in system prompt |
111
- | **OpenClaw / custom agents** | Include SKILL.md content in your system prompt or context window |
112
- | **ChatGPT / GPT agents** | Paste SKILL.md content into custom instructions |
69
+ ### Discover hidden APIs on any website
113
70
 
114
- ---
71
+ ```bash
72
+ imperium-crawl discover-apis --url https://weather.com
73
+ ```
115
74
 
116
- ## CLI Mode
75
+ ```
76
+ Found 11 hidden API endpoints:
77
+ • api.weather.com — main weather API (exposed API key!)
78
+ • mParticle analytics endpoints
79
+ • Taboola content recommendation API
80
+ • OneTrust consent management API
81
+ • DAA/AdChoices opt-out endpoints
82
+ → Call any endpoint directly with query_api — 10x faster than DOM scraping
83
+ ```
117
84
 
118
- imperium-crawl works as both an **MCP server** and a **standalone CLI tool**. All 22 tools are available as subcommands:
85
+ ### AI extraction in plain English
119
86
 
120
87
  ```bash
121
- # Scrape a website to markdown
122
- imperium-crawl scrape --url https://bbc.com/news
88
+ imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
89
+ --schema "extract product name, price, rating, and review count"
90
+ ```
123
91
 
124
- # Crawl with depth control
125
- imperium-crawl crawl --url https://blog.cloudflare.com --max-depth 2 --max-pages 5
92
+ ```json
93
+ {
94
+ "product_name": "Apple AirPods Pro 2",
95
+ "price": "$189.99",
96
+ "rating": "4.7 out of 5",
97
+ "review_count": "45,297"
98
+ }
99
+ ```
126
100
 
127
- # Extract structured data with CSS selectors
128
- imperium-crawl extract --url https://news.ycombinator.com --selectors '{"title":".titleline a","score":".score"}' --items-selector ".athing"
101
+ ### Batch scrape with resume
129
102
 
130
- # AI-powered extraction — describe what you want in plain English
131
- imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 --schema "extract product name, price, rating, and review count"
103
+ ```bash
104
+ imperium-crawl batch-scrape \
105
+ --urls '["https://bbc.com","https://cnn.com","https://reuters.com","https://techcrunch.com"]' \
106
+ --concurrency 3
107
+ ```
132
108
 
133
- # Browser automation — interact with pages
134
- imperium-crawl interact --url https://example.com --actions '[{"type":"click","selector":"#login"},{"type":"type","selector":"#email","text":"user@example.com"}]'
109
+ ```
110
+ Scraping 4 URLs (concurrency: 3)...
111
+ ✅ bbc.com — 47K chars
112
+ ✅ cnn.com — 52K chars
113
+ ✅ reuters.com — 38K chars
114
+ ✅ techcrunch.com — 61K chars
115
+ → 4/4 succeeded. Job ID: abc123 (resume with --job-id if interrupted)
116
+ ```
135
117
 
136
- # Batch scrape multiple URLs in parallel
137
- imperium-crawl batch-scrape --urls '["https://site1.com","https://site2.com","https://site3.com"]' --concurrency 3
118
+ ---
138
119
 
139
- # List batch jobs
140
- imperium-crawl list-jobs
120
+ ## Why imperium-crawl?
141
121
 
142
- # Discover hidden APIs on any website
143
- imperium-crawl discover-apis --url https://weather.com
122
+ 🔓 **Zero API Keys Required**
123
+ 17 of 23 tools work out of the box. No accounts, no tokens, no credit cards. Just `npx` and go.
144
124
 
145
- # Search the web (requires BRAVE_API_KEY)
146
- imperium-crawl search --query "latest AI news" --count 5
125
+ 🛡️ **3-Level Auto-Escalating Stealth**
126
+ Headers TLS fingerprinting headless browser + CAPTCHA solving. Automatically escalates until it gets through.
147
127
 
148
- # Take a screenshot
149
- imperium-crawl screenshot --url https://github.com --full-page
128
+ 🧠 **Self-Improving**
129
+ Adaptive learning engine remembers what works per domain. Second visit is 3x faster. The more you use it, the smarter it gets.
150
130
 
151
- # Interactive setup wizard
152
- imperium-crawl setup
153
- ```
131
+ 🧰 **23 Tools, 3 Modes**
132
+ MCP server, CLI tool, or interactive TUI. Scraping, crawling, search, extraction, API discovery, WebSocket monitoring, browser automation, batch processing.
154
133
 
155
- ### Output Formats
134
+ 📜 **10 Built-in Recipes**
135
+ Pre-built workflows for common tasks — news extraction, e-commerce scraping, API reverse engineering, and more.
156
136
 
157
- ```bash
158
- # JSON (default)
159
- imperium-crawl scrape --url https://example.com
137
+ ⚡ **Skills System**
138
+ Teach it once, run forever. Auto-detect patterns on any page, save as reusable skills, get fresh data on demand.
139
+
140
+ ---
160
141
 
161
- # CSV
162
- imperium-crawl extract --url https://example.com --selectors '{"title":"h1"}' --output-format csv
142
+ ## vs. The Competition
163
143
 
164
- # Markdown
165
- imperium-crawl scrape --url https://example.com --output-format markdown
144
+ | Feature | **imperium-crawl** | Firecrawl MCP | fetch MCP | Crawl4AI MCP | Browserbase MCP |
145
+ |---------|:------------------:|:-------------:|:---------:|:------------:|:---------------:|
146
+ | Price | **Free forever** | $19+/month | Free | Free | $0.01/min |
147
+ | Total tools | **23** | 5 | 2 | 2 | 4 |
148
+ | Stealth levels | **3 (auto-escalate)** | Cloud-based | None | 1 | Cloud-based |
149
+ | Anti-bot detection | **7 systems** | Partial | None | Partial | Partial |
150
+ | TLS fingerprinting | **JA3/JA4** | No | No | No | No |
151
+ | CAPTCHA auto-solving | **Yes** | No | No | No | No |
152
+ | API discovery | **Yes** | No | No | No | No |
153
+ | WebSocket monitoring | **Yes** | No | No | No | No |
154
+ | AI-powered extraction | **Yes** | No | No | No | No |
155
+ | Adaptive learning | **Yes** | No | No | No | No |
156
+ | Batch processing | **Yes** | No | No | No | No |
157
+ | ARIA Snapshots | **Yes** | No | No | No | No |
158
+ | Session Encryption | **Yes** | No | No | No | No |
159
+ | Action Policy | **Yes** | No | No | No | No |
160
+ | Domain Sandboxing | **Yes** | No | No | No | No |
161
+ | Self-hosted | **Yes** | No | N/A | Yes | No |
162
+ | Requires external service | **No** | Yes | No | No | Yes |
166
163
 
167
- # JSONL (one JSON object per line)
168
- imperium-crawl crawl --url https://example.com --output-format jsonl
164
+ ---
169
165
 
170
- # Pretty-print JSON
171
- imperium-crawl scrape --url https://example.com --pretty
166
+ ## Stealth Engine
172
167
 
173
- # Write to file
174
- imperium-crawl scrape --url https://example.com --output result.json
168
+ ```
169
+ Request [L1: Headers + UA rotation]
170
+
171
+ ├─ success → Done
172
+ ↓ fail
173
+ [L2: TLS Fingerprint (JA3/JA4)]
174
+
175
+ ├─ success → Done
176
+ ↓ fail
177
+ [L3: Browser + Fingerprint Injection + CAPTCHA]
178
+
179
+ ├─ success → Done
180
+
181
+ [Learning Engine records optimal level for next time]
175
182
  ```
176
183
 
177
- ### TUI Mode
184
+ ### Stealth Levels
178
185
 
179
- ```bash
180
- imperium-crawl tui
181
- ```
186
+ | Level | Method | What It Defeats |
187
+ |-------|--------|-----------------|
188
+ | **1** | `header-generator` — Bayesian realistic headers + UA rotation | Basic bot detection, simple WAFs |
189
+ | **2** | `impit` — browser-identical TLS fingerprints (JA3/JA4) | Cloudflare, Akamai, TLS fingerprinting WAFs |
190
+ | **3** | `rebrowser-playwright` + `fingerprint-injector` + auto CAPTCHA | JavaScript challenges, SPAs, advanced anti-bot, CAPTCHAs |
182
191
 
183
- Interactive slash-command terminal with parameter prompts, table rendering, markdown display, and session state. Use `/save` to export results and `/again` to re-run the last command.
192
+ ### Anti-Bot System Detection
184
193
 
185
- ### Help
194
+ Automatically identifies which anti-bot system a site uses and chooses the optimal strategy:
195
+
196
+ | System | Detection Method |
197
+ |--------|-----------------|
198
+ | **Cloudflare** | `cf_clearance` cookies, `cf-mitigated` header, challenge page title |
199
+ | **Akamai** | `_abck`, `bm_sz` cookies |
200
+ | **PerimeterX / HUMAN** | `_px` cookies, `_pxhd` headers |
201
+ | **DataDome** | `datadome` cookies, `datadome` response header |
202
+ | **Kasada** | `x-kpsdk-*` headers |
203
+ | **AWS WAF** | `aws-waf-token` cookie |
204
+ | **F5 / Shape Security** | `TS` prefix cookies |
205
+
206
+ ### Smart Rendering Cache
207
+
208
+ Once imperium-crawl determines a domain needs Level 3 (browser), it caches that decision for 1 hour. Subsequent requests to the same domain skip straight to browser rendering — no wasted time on failed lower levels.
209
+
210
+ ---
211
+
212
+ ## Adaptive Learning Engine
213
+
214
+ imperium-crawl **learns from every request** and gets smarter over time. No configuration needed — fully automatic.
215
+
216
+ Every time you scrape a website, the engine records which stealth level worked, which anti-bot system was detected, whether a proxy was needed, response timing, and success/failure. Next time you hit the same domain, it **predicts the optimal configuration** — skipping failed levels and going straight to what works.
186
217
 
187
- ```bash
188
- imperium-crawl --help # List all commands
189
- imperium-crawl scrape --help # Help for specific tool
190
- imperium-crawl --version # Show version
191
218
  ```
219
+ First visit to cloudflare.com:
220
+ Level 1 → blocked ❌
221
+ Level 2 → blocked ❌
222
+ Level 3 → success ✅ (Cloudflare detected)
223
+ → Engine records: cloudflare.com needs Level 3
192
224
 
193
- > **No arguments** = starts as MCP server (stdio). **With subcommand** = runs as CLI tool. **`tui`** = interactive terminal.
225
+ Second visit to cloudflare.com:
226
+ → Engine predicts: Level 3, confidence 85%, Cloudflare
227
+ → Skips Level 1 and 2 entirely — goes straight to browser
228
+ → 3x faster than first visit
229
+ ```
230
+
231
+ ### Smart Features
232
+
233
+ - **Time decay** — Knowledge older than 7 days loses weight, adapts when sites change defenses
234
+ - **Confidence scoring** — Low data = start from level 1. High confidence = skip to optimal level
235
+ - **Auto-prune** — Domains unused for 30 days are cleaned up. Max 2,000 domains stored
236
+ - **Atomic persistence** — Knowledge saved via atomic write (tmp → rename). Never corrupts
237
+
238
+ > **The more you use it, the faster it gets.**
194
239
 
195
240
  ---
196
241
 
197
- ## 22 Tools
242
+ ## All 23 Tools
198
243
 
199
- ### Scraping (no API key needed)
244
+ ### 📄 Scraping (no API key needed)
200
245
 
201
246
  | Tool | What It Does |
202
247
  |------|-------------|
203
- | **scrape** | URL to clean Markdown/HTML with 3-level auto-escalating stealth. Returns structured data (JSON-LD, OpenGraph, Microdata), metadata, and links on request. |
204
- | **crawl** | Priority-based crawling with depth control, concurrency limiting, and smart URL scoring. Content paths rank higher, tracking params are stripped. |
205
- | **map** | Discover all URLs on a domain via sitemap.xml parsing + page link extraction. |
248
+ | **scrape** | URL to clean Markdown/HTML with 3-level auto-escalating stealth. Structured data (JSON-LD, OpenGraph, Microdata), metadata, and links. |
249
+ | **crawl** | Priority-based crawling with depth control, concurrency limiting, and smart URL scoring. |
250
+ | **map** | Discover all URLs on a domain via sitemap.xml + page link extraction. |
206
251
  | **extract** | CSS selectors to structured JSON. Point at any repeating pattern and get clean data. |
207
- | **readability** | Mozilla Readability article extraction — title, author, content, publish date. 3x faster with linkedom. |
252
+ | **readability** | Mozilla Readability article extraction — title, author, content, publish date. |
208
253
  | **screenshot** | Full-page or viewport PNG screenshots via headless Chromium. |
209
254
 
210
- ### Search (requires free Brave API key)
255
+ ### 🔍 Search (requires free Brave API key)
211
256
 
212
257
  | Tool | What It Does |
213
258
  |------|-------------|
@@ -216,131 +261,146 @@ imperium-crawl --version # Show version
216
261
  | **image_search** | Image search with thumbnails and source URLs. |
217
262
  | **video_search** | Video search across platforms. |
218
263
 
219
- ### Skills (no API key needed)
264
+ ### Skills (no API key needed)
220
265
 
221
266
  | Tool | What It Does |
222
267
  |------|-------------|
223
- | **create_skill** | Analyze any page, auto-detect repeating patterns (articles, products, listings), generate CSS selectors, and save as a reusable skill. |
224
- | **run_skill** | Run a saved skill to get fresh structured data instantly. Supports pagination. |
225
- | **list_skills** | List all saved skills with their configurations. |
268
+ | **create_skill** | Analyze any page, auto-detect repeating patterns, generate CSS selectors, save as reusable skill. |
269
+ | **run_skill** | Run a saved skill for fresh structured data. Supports pagination. |
270
+ | **list_skills** | List all saved skills with configurations. |
226
271
 
227
- ### API Discovery & Real-Time (no API key needed, requires Playwright)
272
+ ### 🔓 API Discovery & Real-Time (no API key needed, requires Playwright)
228
273
 
229
274
  | Tool | What It Does |
230
275
  |------|-------------|
231
- | **discover_apis** | Navigate to any page, intercept all XHR/fetch calls, and map every hidden REST/GraphQL API endpoint. Auto-detects GraphQL, filters noise, returns response previews. **No other MCP server does this.** |
232
- | **query_api** | Call any API endpoint directly with stealth headers. Bypass DOM rendering entirely for 10x faster data access. Use after `discover_apis` to hit endpoints directly. |
233
- | **monitor_websocket** | Capture real-time WebSocket messages from any page — financial tickers, chat feeds, live dashboards. Returns connection details and message payloads. **No other MCP server does this.** |
276
+ | **discover_apis** | Navigate to any page, intercept XHR/fetch calls, map hidden REST/GraphQL endpoints. Auto-detects GraphQL, filters noise, returns response previews. |
277
+ | **query_api** | Call any API endpoint directly with stealth headers. Bypass DOM rendering for 10x faster data access. |
278
+ | **monitor_websocket** | Capture real-time WebSocket messages — financial tickers, chat feeds, live dashboards. |
234
279
 
235
- ### AI Extraction (requires LLM API key)
280
+ ### 🧠 AI Extraction (requires LLM API key)
236
281
 
237
282
  | Tool | What It Does |
238
283
  |------|-------------|
239
- | **ai_extract** | AI-powered data extraction — describe what you want in natural language or provide a JSON schema. Supports auto mode (LLM decides what to extract), 3 providers (Anthropic, OpenAI, MiniMax). The `extract` tool also supports `llm_fallback: true` for hybrid CSS→AI extraction. |
284
+ | **ai_extract** | Describe what you want in natural language or JSON schema. 3 providers (Anthropic, OpenAI, MiniMax). The `extract` tool also supports `llm_fallback: true` for hybrid CSS→AI extraction. |
240
285
 
241
- ### Interaction (no API key needed, requires Playwright)
286
+ ### 🖱️ Interaction (no API key needed, requires Playwright)
242
287
 
243
288
  | Tool | What It Does |
244
289
  |------|-------------|
245
- | **interact** | Browser automation with 10 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate). Session persistence saves/restores cookies across calls build login flows and multi-step workflows. |
290
+ | **interact** | Browser automation with 18 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate, drag, upload, storage, cookies, pdf, auth_login). Ref targeting via ARIA snapshot, session encryption, action policy, domain filter, network interception, device emulation. |
291
+ | **snapshot** | ARIA-based page snapshot with interactive element refs. Use refs in interact for precise targeting. Annotated screenshots. |
246
292
 
247
- ### Batch Processing (no API key needed)
293
+ ### 📦 Batch Processing (no API key needed)
248
294
 
249
295
  | Tool | What It Does |
250
296
  |------|-------------|
251
- | **batch_scrape** | Parallel URL scraping with configurable concurrency, soft failure (continues on errors), and resume support via job_id. Optional AI extraction per URL. |
252
- | **list_jobs** | List all batch jobs with status, progress, and timestamps. |
253
- | **job_status** | Get full results for a specific batch job including per-URL outcomes. |
297
+ | **batch_scrape** | Parallel URL scraping with configurable concurrency, soft failure, and resume via job_id. Optional AI extraction per URL. |
298
+ | **list_jobs** | List all batch jobs with status and progress. |
299
+ | **job_status** | Full results for a specific batch job including per-URL outcomes. |
254
300
  | **delete_job** | Clean up completed or failed batch jobs. |
255
301
 
256
302
  ---
257
303
 
258
- ## Stealth Engine
304
+ ## MCP Setup — Detailed
259
305
 
260
- imperium-crawl uses a 3-level stealth system that **auto-escalates** based on the target site's defenses:
306
+ Full configuration with all optional environment variables:
261
307
 
262
- | Level | Method | What It Defeats |
263
- |-------|--------|-----------------|
264
- | **1** | `header-generator` — Bayesian realistic headers + UA rotation | Basic bot detection, simple WAFs |
265
- | **2** | `impit` — browser-identical TLS fingerprints (JA3/JA4) | Cloudflare, Akamai, TLS fingerprinting WAFs |
266
- | **3** | `rebrowser-playwright` + `fingerprint-injector` + auto CAPTCHA solving | JavaScript challenges, SPAs, advanced anti-bot, CAPTCHAs |
308
+ ```json
309
+ {
310
+ "mcpServers": {
311
+ "imperium-crawl": {
312
+ "command": "npx",
313
+ "args": ["-y", "imperium-crawl"],
314
+ "env": {
315
+ "BRAVE_API_KEY": "your-brave-api-key",
316
+ "TWOCAPTCHA_API_KEY": "your-2captcha-api-key",
317
+ "LLM_API_KEY": "your-api-key",
318
+ "LLM_PROVIDER": "anthropic",
319
+ "SESSION_ENCRYPTION_KEY": "your-64-char-hex-key",
320
+ "PROXY_URL": "http://user:pass@proxy:8080",
321
+ "PROXY_URLS": "http://proxy1:8080,socks5://proxy2:1080"
322
+ }
323
+ }
324
+ }
325
+ }
326
+ ```
267
327
 
268
- ### Anti-Bot System Detection
328
+ ### API Keys
269
329
 
270
- Automatically identifies which anti-bot system a site uses and chooses the optimal strategy:
330
+ | Key | What It Unlocks | Where to Get It |
331
+ |-----|----------------|-----------------|
332
+ | `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
333
+ | `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
334
+ | `LLM_API_KEY` | AI-powered data extraction (`ai_extract` tool) | Anthropic or OpenAI API key |
335
+ | `CHROME_PROFILE_PATH` | Authenticated browser sessions (use your Chrome cookies) | Path to Chrome user data dir |
336
+ | `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
271
337
 
272
- | System | Detection Method |
273
- |--------|-----------------|
274
- | **Cloudflare** | `cf_clearance` cookies, `cf-mitigated` header, challenge page title |
275
- | **Akamai** | `_abck`, `bm_sz` cookies |
276
- | **PerimeterX / HUMAN** | `_px` cookies, `_pxhd` headers |
277
- | **DataDome** | `datadome` cookies, `datadome` response header |
278
- | **Kasada** | `x-kpsdk-*` headers |
279
- | **AWS WAF** | `aws-waf-token` cookie |
280
- | **F5 / Shape Security** | `TS` prefix cookies |
338
+ ### Enable Full Stealth (Level 3)
281
339
 
282
- ### Smart Rendering Cache
340
+ ```bash
341
+ npm i rebrowser-playwright
342
+ npx playwright install chromium
343
+ ```
283
344
 
284
- Once imperium-crawl determines a domain needs Level 3 (browser), it **caches that decision** for 1 hour. Subsequent requests to the same domain skip straight to browser rendering — no wasted time on failed lower levels.
345
+ ### Per-Client Notes
285
346
 
286
- ---
347
+ | Client | Config Location |
348
+ |--------|----------------|
349
+ | **Claude Code** | `.mcp.json` in project root or `~/.claude/settings.json` global |
350
+ | **Cursor** | Settings → MCP Servers |
351
+ | **VS Code** | `.vscode/mcp.json` or user settings |
352
+ | **Windsurf** | `~/.codeium/windsurf/mcp_config.json` |
287
353
 
288
- ## 🧠 Adaptive Learning Engine
354
+ ---
289
355
 
290
- imperium-crawl **learns from every request** and gets smarter over time. No configuration needed — it works automatically in the background.
356
+ ## CLI Mode
291
357
 
292
- ### How It Works
358
+ **No arguments** = starts as MCP server. **With subcommand** = runs as CLI tool. **`tui`** = interactive terminal.
293
359
 
294
- Every time you scrape a website, the engine records:
295
- - Which **stealth level** worked (1, 2, or 3)
296
- - Which **anti-bot system** was detected (Cloudflare, DataDome, etc.)
297
- - Whether a **proxy** was needed
298
- - **Response time** and **HTTP status**
299
- - Whether the request was **blocked or successful**
360
+ ```bash
361
+ # Scrape a website to markdown
362
+ imperium-crawl scrape --url https://bbc.com/news
300
363
 
301
- Next time you hit the same domain, the engine **predicts the optimal configuration** — skipping failed levels and going straight to what works.
364
+ # Crawl with depth control
365
+ imperium-crawl crawl --url https://blog.cloudflare.com --max-depth 2 --max-pages 5
302
366
 
303
- ### What It Learns Per Domain
367
+ # AI-powered extraction plain English
368
+ imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
369
+ --schema "extract product name, price, rating, and review count"
304
370
 
305
- | Data Point | How It's Used |
306
- |-----------|---------------|
307
- | Optimal stealth level | Skip straight to the level that works — no wasted escalation |
308
- | Anti-bot system | Remember which defense the site uses |
309
- | Proxy requirement | Auto-suggest proxy if requests keep failing without one |
310
- | Response time | Exponential moving average — adapts to site speed changes |
311
- | Rate limit | Auto-throttles on 429 responses (reduces rate by 30%) |
312
- | Success/fail ratio | Confidence scoring — high confidence = use cached strategy |
371
+ # Discover hidden APIs
372
+ imperium-crawl discover-apis --url https://weather.com
313
373
 
314
- ### Smart Features
374
+ # Batch scrape in parallel
375
+ imperium-crawl batch-scrape --urls '["https://site1.com","https://site2.com"]' --concurrency 3
315
376
 
316
- - **Time decay** — Knowledge older than 7 days loses weight, so the engine adapts when sites change defenses
317
- - **Confidence scoring** — Low data = start from level 1. High confidence = skip directly to optimal level
318
- - **Auto-prune** — Domains unused for 30 days are automatically cleaned up. Max 2,000 domains stored
319
- - **Atomic persistence** — Knowledge saved to `~/.imperium-crawl/knowledge.json` via atomic write (tmp → rename). Never corrupts
320
- - **Debounced writes** — Batches saves every 30 seconds to avoid disk thrashing
377
+ # Interactive setup wizard
378
+ imperium-crawl setup
379
+ ```
321
380
 
322
- ### Example
381
+ ### Output Formats
323
382
 
383
+ ```bash
384
+ imperium-crawl scrape --url https://example.com # JSON (default)
385
+ imperium-crawl scrape --url https://example.com --output-format markdown # Markdown
386
+ imperium-crawl scrape --url https://example.com --output-format csv # CSV
387
+ imperium-crawl scrape --url https://example.com --pretty # Pretty JSON
388
+ imperium-crawl scrape --url https://example.com --output result.json # Write to file
324
389
  ```
325
- First visit to cloudflare.com:
326
- Level 1 → blocked ❌
327
- Level 2 → blocked ❌
328
- Level 3 → success ✅ (Cloudflare detected)
329
- → Engine records: cloudflare.com needs Level 3
330
390
 
331
- Second visit to cloudflare.com:
332
- → Engine predicts: Level 3, confidence 85%, Cloudflare
333
- → Skips Level 1 and 2 entirely — goes straight to browser
334
- 3x faster than first visit
391
+ ### TUI Mode
392
+
393
+ ```bash
394
+ imperium-crawl tui
335
395
  ```
336
396
 
337
- > **The more you use it, the faster it gets.** Zero configuration. Fully automatic.
397
+ Interactive slash-command terminal with parameter prompts, table rendering, markdown display, and session state. Use `/save` to export results and `/again` to re-run the last command.
338
398
 
339
399
  ---
340
400
 
341
- ## Skills System
401
+ ## Skills & Recipes
342
402
 
343
- Skills let you teach imperium-crawl how to extract data from any website, then re-run it for fresh content whenever you want.
403
+ Skills let you teach imperium-crawl how to extract data from any website, then re-run for fresh content whenever you want.
344
404
 
345
405
  **Create a skill:**
346
406
  ```
@@ -351,20 +411,36 @@ create_skill({
351
411
  })
352
412
  ```
353
413
 
354
- The tool analyzes the page, auto-detects repeating elements (articles, products, listings), generates CSS selectors for each field, and saves the skill config.
355
-
356
414
  **Run a skill:**
357
415
  ```
358
416
  run_skill({ name: "tc-ai-news" })
417
+ → Returns fresh structured data with all detected fields
359
418
  ```
360
419
 
361
- Returns fresh structured data with all detected fields. Skills are saved in `~/.imperium-crawl/skills/` as JSON files — human-readable, editable, and portable.
420
+ Skills are saved in `~/.imperium-crawl/skills/` as JSON files — human-readable, editable, portable.
421
+
422
+ ### Built-in Recipes
423
+
424
+ | Recipe | What It Does |
425
+ |--------|-------------|
426
+ | `news-extraction` | Extract article title, author, date, content from news sites |
427
+ | `ecommerce-scrape` | Product name, price, rating, reviews, images |
428
+ | `social-media` | Posts, engagement metrics, user profiles |
429
+ | `job-listings` | Title, company, salary, location, description |
430
+ | `real-estate` | Property listings with price, address, features |
431
+ | `api-reverse-engineer` | Discover → query → monitor workflow |
432
+ | `competitor-monitor` | Track pricing and product changes |
433
+ | `lead-generation` | Extract business contact info |
434
+ | `content-aggregator` | Multi-source content collection |
435
+ | `data-pipeline` | Batch scrape → extract → export workflow |
436
+
437
+ See [`SKILL/`](./SKILL/) for detailed workflow guides and agent integration.
362
438
 
363
439
  ---
364
440
 
365
441
  ## API Discovery Workflow
366
442
 
367
- This is the workflow that no other MCP server supports. Real results from actual testing:
443
+ Turn any website into an API. No documentation needed.
368
444
 
369
445
  ```
370
446
  1. discover_apis({ url: "https://weather.com" })
@@ -373,65 +449,82 @@ This is the workflow that no other MCP server supports. Real results from actual
373
449
  • mParticle analytics endpoints
374
450
  • Taboola content recommendation API
375
451
  • OneTrust consent management API
376
- • DAA/AdChoices opt-out endpoints
377
452
 
378
453
  2. query_api({ url: "https://api.weather.com/v3/...", method: "GET" })
379
- → Direct API call, bypasses DOM entirely — 10x faster, structured JSON response
454
+ → Direct API call, bypasses DOM entirely — 10x faster, structured JSON
380
455
 
381
456
  3. monitor_websocket({ url: "https://binance.com/en/trade/BTC_USDT", duration_seconds: 10 })
382
- → Captures real-time WebSocket messages — financial tickers, live data feeds
457
+ → Captures real-time WebSocket messages — live BTC price feed
383
458
  ```
384
459
 
385
- Turn any website into an API. No documentation needed.
460
+ ---
461
+
462
+ ## AI Agent Guide
463
+
464
+ imperium-crawl ships with [`SKILL/`](./SKILL/) — a structured guide that teaches AI agents how to use all 23 tools effectively. Includes proven workflows, decision trees, error recovery, and advanced patterns.
465
+
466
+ ### Three Ways to Connect
467
+
468
+ | Method | Setup | Works With |
469
+ |--------|-------|-----------|
470
+ | **MCP + SKILL/** | Add as MCP server + SKILL.md in agent context | Claude Code, Cursor, Windsurf, any MCP client |
471
+ | **CLI + SKILL/** | `npm i -g imperium-crawl` + SKILL.md in agent context | **Any agent with bash access** — OpenClaw, ChatGPT, GPT agents, custom agents |
472
+ | **TUI** | `imperium-crawl tui` — interactive terminal | Direct human use, demos, debugging |
473
+
474
+ ### Per-Agent Setup
475
+
476
+ | AI Agent | How to Add SKILL/ |
477
+ |----------|-------------------|
478
+ | **Claude Code** | Copy `SKILL.md` to project root — auto-detected |
479
+ | **Cursor / Windsurf** | Add `SKILL.md` to project rules or system prompt |
480
+ | **OpenClaw / custom agents** | Include SKILL.md in system prompt or context window |
481
+ | **ChatGPT / GPT agents** | Paste SKILL.md content into custom instructions |
386
482
 
387
483
  ---
388
484
 
389
485
  ## Resilience
390
486
 
391
487
  - **Exponential backoff with full jitter** — AWS-recommended retry pattern, no thundering herd
392
- - **Per-domain circuit breaker** — 5 consecutive failures opens the circuit for 60s, then half-open probing with automatic recovery
393
- - **URL normalization** — 11-step pipeline removes tracking params (utm_*, fbclid, gclid), sorts query params, normalizes encoding
394
- - **Concurrency limiting** — per-domain request throttling via p-queue
395
- - **Input validation** — all 22 tool schemas enforce strict bounds (URL length, query size, concurrency limits, body size)
396
- - **HTTP transport hardening** — rate limiting (100 req/min), 1MB body limit, 5min request timeout
397
- - **Proxy support** — single proxy (`PROXY_URL`) or rotating pool (`PROXY_URLS`) with http/https/socks4/socks5 support
488
+ - **Per-domain circuit breaker** — 5 failures opens circuit for 60s, then half-open probing with auto recovery
489
+ - **URL normalization** — 11-step pipeline removes tracking params (utm_*, fbclid, gclid), sorts query params
490
+ - **Proxy support** — single proxy or rotating pool with http/https/socks4/socks5
398
491
  - **Browser pool** — keyed by proxy URL, auto-eviction, configurable pool size
399
- - **Adaptive learning** — remembers optimal stealth level per domain, gets faster with every request
400
- - **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
401
492
  - **robots.txt** — respected by default (configurable)
493
+ - **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
402
494
 
403
495
  ---
404
496
 
405
- ## 🔥 Real-World Test Results
497
+ ## Real-World Test Results
406
498
 
407
499
  Every tool tested against production websites with real anti-bot defenses:
408
500
 
409
501
  | Tool | Target | Result |
410
502
  |------|--------|--------|
411
- | 📄 **scrape** | BBC News | Full markdown content, stealth level 3 auto-escalation |
412
- | 🕸️ **crawl** | Cloudflare Blog | **213K characters** crawled with depth control |
413
- | 🗺️ **map** | BBC | Full URL discovery via sitemap + page link extraction |
503
+ | 📄 **scrape** | BBC News | Full markdown, stealth level 3 auto-escalation |
504
+ | 🕸️ **crawl** | Cloudflare Blog | 213K characters crawled with depth control |
505
+ | 🗺️ **map** | BBC | Full URL discovery via sitemap + link extraction |
414
506
  | 🕷️ **extract** | Amazon (AirPods Pro 2) | Product title, 45,297 reviews, brand extracted |
415
- | 📖 **readability** | Medium article | Clean extraction — title, author, content, publish date |
507
+ | 📖 **readability** | Medium article | Clean — title, author, content, publish date |
416
508
  | 📸 **screenshot** | ProductHunt | Captured Cloudflare Turnstile challenge page |
417
- | 🔍 **search** | Brave Web Search | Web results with snippets and URLs |
418
- | 📰 **news_search** | Brave News Search | News results with freshness ranking |
419
- | 🖼️ **image_search** | Brave Image Search | Images with thumbnails and source URLs |
420
- | 🎬 **video_search** | Brave Video Search | Video results across platforms |
421
- | 🛠️ **create_skill** | Hacker News | Auto-detected 30 repeating stories with CSS selectors |
422
- | ▶️ **run_skill** | Saved skill | Fresh structured data from saved extraction config |
423
- | 📋 **list_skills** | — | Lists all saved skills with configurations |
424
- | 🔓 **discover_apis** | Airbnb Paris | **34 hidden APIs** — DataDome anti-bot, Google Maps key, internal APIs |
509
+ | 🔍 **search** | Brave Web | Web results with snippets and URLs |
510
+ | 📰 **news_search** | Brave News | News results with freshness ranking |
511
+ | 🖼️ **image_search** | Brave Image | Images with thumbnails and source URLs |
512
+ | 🎬 **video_search** | Brave Video | Video results across platforms |
513
+ | 🛠️ **create_skill** | Hacker News | Auto-detected 30 stories with CSS selectors |
514
+ | ▶️ **run_skill** | Saved skill | Fresh structured data from saved config |
515
+ | 📋 **list_skills** | — | Lists all skills with configurations |
516
+ | 🔓 **discover_apis** | Airbnb Paris | **34 hidden APIs** — DataDome, Google Maps key, internal APIs |
425
517
  | ⚡ **query_api** | jsonplaceholder | Direct JSON API call with stealth headers |
426
- | 📡 **monitor_websocket** | Binance BTC/USDT | **3 WebSocket connections, 23 live messages** — BTC price live |
427
- | 🧠 **ai_extract** | Amazon product page | AI extracted name, price, rating, review count — natural language schema |
428
- | 🖱️ **interact** | Login flow | Click type email type password → submit — session cookies persisted |
429
- | 📦 **batch_scrape** | 10 news sites | Parallel scrape with concurrency 3, soft failure, 9/10 succeeded |
430
- | 📋 **list_jobs** | | Lists all batch jobs with status and progress |
431
- | 📊 **job_status** | Batch job | Full per-URL results with timing and extracted data |
518
+ | 📡 **monitor_websocket** | Binance BTC/USDT | 3 WebSocket connections, 23 live messages — BTC price live |
519
+ | 🧠 **ai_extract** | Amazon product | AI extracted name, price, rating, review count |
520
+ | 🎯 **snapshot** | GitHub, Wikipedia | ARIA tree with 107/113 refs, annotated screenshots |
521
+ | 🖱️ **interact** | Login flow | Click type submit — ref targeting, session encryption, 18 action types |
522
+ | 📦 **batch_scrape** | 10 news sites | Parallel, concurrency 3, soft failure, 9/10 succeeded |
523
+ | 📋 **list_jobs** | | Batch jobs with status and progress |
524
+ | 📊 **job_status** | Batch job | Full per-URL results with timing |
432
525
  | 🗑️ **delete_job** | Completed job | Cleaned up job data from disk |
433
526
 
434
- > 🏆 **22/22 tools working. 58 hidden APIs discovered. Live crypto feed captured. AI extraction. Browser automation. Zero API keys needed for scraping.**
527
+ > **23/23 tools. 34 hidden APIs on Airbnb. Live BTC feed. Zero API keys for scraping.**
435
528
 
436
529
  ---
437
530
 
@@ -441,18 +534,19 @@ Every tool tested against production websites with real anti-bot defenses:
441
534
  |----------|----------|-------------|
442
535
  | `BRAVE_API_KEY` | No | Brave Search API key (enables 4 search tools) |
443
536
  | `TWOCAPTCHA_API_KEY` | No | 2Captcha API key (enables auto CAPTCHA solving) |
537
+ | `LLM_API_KEY` | No | Anthropic or OpenAI API key (enables `ai_extract`) |
538
+ | `LLM_PROVIDER` | No | `anthropic`, `openai`, or `minimax` (default: `anthropic`) |
539
+ | `LLM_MODEL` | No | Override default LLM model |
540
+ | `SESSION_ENCRYPTION_KEY` | No | 32-byte hex key for encrypting session files at rest |
444
541
  | `TRANSPORT` | No | `stdio` (default) or `http` |
445
542
  | `PORT` | No | HTTP port (default: 3000) |
446
543
  | `PROXY_URL` | No | Single proxy URL (http/https/socks4/socks5) |
447
544
  | `PROXY_URLS` | No | Comma-separated proxy URLs for rotation |
448
545
  | `BROWSER_POOL_SIZE` | No | Max pooled browser instances (default: 3) |
449
546
  | `RESPECT_ROBOTS` | No | Respect robots.txt (default: `true`) |
450
- | `LLM_API_KEY` | No | Anthropic or OpenAI API key (enables `ai_extract` tool) |
451
- | `LLM_PROVIDER` | No | `anthropic`, `openai`, or `minimax` (default: `anthropic`) |
452
- | `LLM_MODEL` | No | Override default LLM model |
453
- | `CHROME_PROFILE_PATH` | No | Chrome user data dir for authenticated browser sessions |
454
- | `NO_COLOR` | No | Disable colored output (standard convention) |
455
- | `CI` | No | Auto-detected; disables TTY features (spinners, colors) |
547
+ | `CHROME_PROFILE_PATH` | No | Chrome user data dir for authenticated sessions |
548
+ | `NO_COLOR` | No | Disable colored output |
549
+ | `CI` | No | Auto-detected; disables TTY features |
456
550
 
457
551
  ---
458
552
 
@@ -463,8 +557,27 @@ git clone https://github.com/ceoimperiumprojects/imperium-crawl
463
557
  cd imperium-crawl
464
558
  npm install
465
559
  npm run build
466
- npm test # 332 tests
467
- npm start
560
+ npm run dev # Watch mode (rebuild on changes)
561
+ npm test # 332 tests
562
+ npm start # Start MCP server
563
+ ```
564
+
565
+ ---
566
+
567
+ ## Contributing
568
+
569
+ Contributions welcome! Whether it's a bug fix, new tool, or documentation improvement — open an issue or PR.
570
+
571
+ ```bash
572
+ # Fork the repo, then:
573
+ git clone https://github.com/YOUR_USERNAME/imperium-crawl
574
+ cd imperium-crawl
575
+ npm install
576
+ git checkout -b my-feature
577
+ # Make changes...
578
+ npm test
579
+ git push origin my-feature
580
+ # Open a PR
468
581
  ```
469
582
 
470
583
  ---