imperium-crawl 1.5.1 → 1.5.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +367 -262
- package/dist/captcha/detector.js +3 -3
- package/dist/captcha/detector.js.map +1 -1
- package/dist/captcha/solver.d.ts.map +1 -1
- package/dist/captcha/solver.js +4 -1
- package/dist/captcha/solver.js.map +1 -1
- package/dist/cli.d.ts.map +1 -1
- package/dist/cli.js +11 -8
- package/dist/cli.js.map +1 -1
- package/dist/knowledge/store.d.ts.map +1 -1
- package/dist/knowledge/store.js +1 -9
- package/dist/knowledge/store.js.map +1 -1
- package/dist/llm/extractor.d.ts +1 -0
- package/dist/llm/extractor.d.ts.map +1 -1
- package/dist/llm/extractor.js +5 -4
- package/dist/llm/extractor.js.map +1 -1
- package/dist/llm/providers/anthropic.d.ts.map +1 -1
- package/dist/llm/providers/anthropic.js +34 -31
- package/dist/llm/providers/anthropic.js.map +1 -1
- package/dist/llm/providers/openai.d.ts.map +1 -1
- package/dist/llm/providers/openai.js +26 -23
- package/dist/llm/providers/openai.js.map +1 -1
- package/dist/llm/retry.d.ts +10 -0
- package/dist/llm/retry.d.ts.map +1 -0
- package/dist/llm/retry.js +35 -0
- package/dist/llm/retry.js.map +1 -0
- package/dist/recipes/crypto-websocket.json +11 -0
- package/dist/recipes/ecommerce-product.json +25 -0
- package/dist/recipes/github-trending.json +19 -0
- package/dist/recipes/hn-top-stories.json +22 -0
- package/dist/recipes/index.d.ts +3 -0
- package/dist/recipes/index.d.ts.map +1 -0
- package/dist/recipes/index.js +23 -0
- package/dist/recipes/index.js.map +1 -0
- package/dist/recipes/job-listings-greenhouse.json +17 -0
- package/dist/recipes/news-article-reader.json +9 -0
- package/dist/recipes/product-reviews.json +33 -0
- package/dist/recipes/reddit-posts.json +8 -0
- package/dist/recipes/seo-page-audit.json +26 -0
- package/dist/recipes/social-media-mentions.json +31 -0
- package/dist/server.d.ts.map +1 -1
- package/dist/server.js +11 -1
- package/dist/server.js.map +1 -1
- package/dist/skills/manager.d.ts +36 -1
- package/dist/skills/manager.d.ts.map +1 -1
- package/dist/skills/manager.js +22 -0
- package/dist/skills/manager.js.map +1 -1
- package/dist/stealth/antibot-detector.js +2 -2
- package/dist/stealth/antibot-detector.js.map +1 -1
- package/dist/stealth/chrome-profile.d.ts.map +1 -1
- package/dist/stealth/chrome-profile.js +6 -2
- package/dist/stealth/chrome-profile.js.map +1 -1
- package/dist/stealth/index.d.ts +7 -0
- package/dist/stealth/index.d.ts.map +1 -1
- package/dist/stealth/index.js +81 -22
- package/dist/stealth/index.js.map +1 -1
- package/dist/stealth/proxy.js +2 -2
- package/dist/stealth/proxy.js.map +1 -1
- package/dist/stealth/tls.d.ts.map +1 -1
- package/dist/stealth/tls.js +12 -2
- package/dist/stealth/tls.js.map +1 -1
- package/dist/tools/interact.d.ts.map +1 -1
- package/dist/tools/interact.js +5 -0
- package/dist/tools/interact.js.map +1 -1
- package/dist/tools/list-skills.d.ts +1 -1
- package/dist/tools/list-skills.d.ts.map +1 -1
- package/dist/tools/list-skills.js +23 -10
- package/dist/tools/list-skills.js.map +1 -1
- package/dist/tools/run-skill.d.ts +7 -1
- package/dist/tools/run-skill.d.ts.map +1 -1
- package/dist/tools/run-skill.js +288 -36
- package/dist/tools/run-skill.js.map +1 -1
- package/dist/utils/fetcher.d.ts.map +1 -1
- package/dist/utils/fetcher.js +8 -38
- package/dist/utils/fetcher.js.map +1 -1
- package/dist/utils/robots.d.ts.map +1 -1
- package/dist/utils/robots.js +5 -1
- package/dist/utils/robots.js.map +1 -1
- package/dist/utils/url.d.ts +4 -0
- package/dist/utils/url.d.ts.map +1 -1
- package/dist/utils/url.js +11 -0
- package/dist/utils/url.js.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,213 +1,254 @@
|
|
|
1
|
+
<div align="center">
|
|
2
|
+
|
|
1
3
|
# imperium-crawl
|
|
2
4
|
|
|
3
|
-
The most powerful open-source MCP server for web scraping, crawling, and data extraction
|
|
5
|
+
**The most powerful open-source MCP server for web scraping, crawling, and data extraction.**
|
|
4
6
|
|
|
5
|
-
|
|
7
|
+
22 tools. Zero API keys required. One `npx` command.
|
|
6
8
|
|
|
7
|
-
|
|
9
|
+
[](https://www.npmjs.com/package/imperium-crawl)
|
|
10
|
+
[](./LICENSE)
|
|
11
|
+
[]()
|
|
12
|
+
[](https://www.npmjs.com/package/imperium-crawl)
|
|
8
13
|
|
|
9
|
-
|
|
10
|
-
|---------|:------------------:|:-------------:|:---------:|:------------:|:---------------:|
|
|
11
|
-
| Price | **Free forever** | $19+/month | Free | Free | $0.01/min |
|
|
12
|
-
| Scraping tools | **6** | 3 | 1 | 1 | 1 |
|
|
13
|
-
| Search tools | **4** | 0 | 0 | 0 | 0 |
|
|
14
|
-
| Stealth levels | **3 (auto-escalate)** | Cloud-based | None | 1 | Cloud-based |
|
|
15
|
-
| Anti-bot detection | **7 systems** | Partial | None | Partial | Partial |
|
|
16
|
-
| TLS fingerprinting | **Yes (JA3/JA4)** | No | No | No | No |
|
|
17
|
-
| CAPTCHA auto-solving | **Yes (2Captcha)** | No | No | No | No |
|
|
18
|
-
| API discovery from network traffic | **Yes** | No | No | No | No |
|
|
19
|
-
| WebSocket monitoring | **Yes** | No | No | No | No |
|
|
20
|
-
| Direct API calls | **Yes** | No | No | No | No |
|
|
21
|
-
| Reusable skills system | **Yes** | No | No | No | No |
|
|
22
|
-
| Structured data extraction (JSON-LD/OG) | **Yes** | Partial | No | No | No |
|
|
23
|
-
| Priority-based crawling | **Yes** | No | No | No | No |
|
|
24
|
-
| Circuit breaker + jitter backoff | **Yes** | No | No | No | No |
|
|
25
|
-
| URL normalization (11 steps) | **Yes** | No | No | No | No |
|
|
26
|
-
| Adaptive learning (self-improving) | **Yes** | No | No | No | No |
|
|
27
|
-
| AI-powered data extraction | **Yes** | No | No | No | No |
|
|
28
|
-
| Browser automation + sessions | **Yes** | No | No | No | No |
|
|
29
|
-
| Batch processing with resume | **Yes** | No | No | No | No |
|
|
30
|
-
| Self-hosted | **Yes** | No | N/A | Yes | No |
|
|
31
|
-
| Requires external service | **No** | Yes | No | No | Yes |
|
|
32
|
-
| Total tools | **22** | 5 | 2 | 2 | 4 |
|
|
33
|
-
|
|
34
|
-
> **TLDR:** More tools, more features, zero cost, no external dependencies. Self-hosted, open-source, and it runs on your machine.
|
|
14
|
+
</div>
|
|
35
15
|
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
```bash
|
|
39
|
-
npm install -g imperium-crawl
|
|
40
|
-
```
|
|
16
|
+
---
|
|
41
17
|
|
|
42
|
-
|
|
18
|
+
## Quick Start
|
|
43
19
|
|
|
44
|
-
|
|
45
|
-
npx -y imperium-crawl
|
|
46
|
-
```
|
|
20
|
+
Get running in 30 seconds.
|
|
47
21
|
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
Add to your MCP client config (Claude Code, Cursor, VS Code, Windsurf, or any MCP client):
|
|
22
|
+
**MCP client** (Claude Code, Cursor, VS Code, Windsurf):
|
|
51
23
|
|
|
52
24
|
```json
|
|
53
25
|
{
|
|
54
26
|
"mcpServers": {
|
|
55
27
|
"imperium-crawl": {
|
|
56
|
-
"type": "stdio",
|
|
57
28
|
"command": "npx",
|
|
58
|
-
"args": ["-y", "imperium-crawl"]
|
|
59
|
-
"env": {
|
|
60
|
-
"BRAVE_API_KEY": "your-brave-api-key",
|
|
61
|
-
"TWOCAPTCHA_API_KEY": "your-2captcha-api-key",
|
|
62
|
-
"LLM_API_KEY": "your-api-key",
|
|
63
|
-
"LLM_PROVIDER": "anthropic",
|
|
64
|
-
"PROXY_URL": "http://user:pass@proxy:8080",
|
|
65
|
-
"PROXY_URLS": "http://proxy1:8080,socks5://proxy2:1080"
|
|
66
|
-
}
|
|
29
|
+
"args": ["-y", "imperium-crawl"]
|
|
67
30
|
}
|
|
68
31
|
}
|
|
69
32
|
}
|
|
70
33
|
```
|
|
71
34
|
|
|
72
|
-
|
|
73
|
-
>
|
|
74
|
-
> | Key | What it unlocks | Where to get it |
|
|
75
|
-
> |-----|----------------|-----------------|
|
|
76
|
-
> | `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
|
|
77
|
-
> | `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
|
|
78
|
-
> | `LLM_API_KEY` | AI-powered data extraction (`ai_extract` tool) | Anthropic or OpenAI API key |
|
|
79
|
-
> | `CHROME_PROFILE_PATH` | Authenticated browser sessions (use your Chrome cookies) | Path to Chrome user data dir |
|
|
80
|
-
> | `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
|
|
35
|
+
**CLI** (zero install):
|
|
81
36
|
|
|
82
|
-
|
|
37
|
+
```bash
|
|
38
|
+
npx -y imperium-crawl scrape --url https://example.com
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
**Global install:**
|
|
83
42
|
|
|
84
43
|
```bash
|
|
85
|
-
npm
|
|
86
|
-
npx playwright install chromium
|
|
44
|
+
npm install -g imperium-crawl
|
|
87
45
|
```
|
|
88
46
|
|
|
89
|
-
|
|
47
|
+
> That's it. 16 of 22 tools work with zero API keys. Add optional keys later to unlock search, AI extraction, and CAPTCHA solving.
|
|
90
48
|
|
|
91
|
-
|
|
49
|
+
---
|
|
92
50
|
|
|
93
|
-
|
|
51
|
+
## Power Examples
|
|
94
52
|
|
|
95
|
-
|
|
53
|
+
Real results. Copy-paste and try.
|
|
96
54
|
|
|
97
|
-
|
|
55
|
+
### Scrape through Cloudflare
|
|
98
56
|
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
| **CLI + SKILL.md** | `npm i -g imperium-crawl` + SKILL.md in agent context | **Any agent with bash access** — OpenClaw, ChatGPT, GPT agents, custom agents, anything |
|
|
103
|
-
| **TUI mode** | `imperium-crawl tui` — interactive slash-command terminal | Direct human use, demos, debugging |
|
|
57
|
+
```bash
|
|
58
|
+
imperium-crawl scrape --url https://blog.cloudflare.com
|
|
59
|
+
```
|
|
104
60
|
|
|
105
|
-
|
|
61
|
+
```
|
|
62
|
+
Level 1 (headers) → blocked
|
|
63
|
+
Level 2 (TLS fingerprint) → blocked
|
|
64
|
+
Level 3 (browser + stealth) → success ✅
|
|
65
|
+
→ Full markdown content extracted, 213K characters
|
|
66
|
+
→ Next visit: skips straight to Level 3 (learned)
|
|
67
|
+
```
|
|
106
68
|
|
|
107
|
-
|
|
108
|
-
|----------|-------------------|
|
|
109
|
-
| **Claude Code** | Copy `SKILL.md` to your project root — Claude Code reads it automatically |
|
|
110
|
-
| **Cursor / Windsurf** | Add `SKILL.md` to project rules or include in system prompt |
|
|
111
|
-
| **OpenClaw / custom agents** | Include SKILL.md content in your system prompt or context window |
|
|
112
|
-
| **ChatGPT / GPT agents** | Paste SKILL.md content into custom instructions |
|
|
69
|
+
### Discover hidden APIs on any website
|
|
113
70
|
|
|
114
|
-
|
|
71
|
+
```bash
|
|
72
|
+
imperium-crawl discover-apis --url https://weather.com
|
|
73
|
+
```
|
|
115
74
|
|
|
116
|
-
|
|
75
|
+
```
|
|
76
|
+
Found 11 hidden API endpoints:
|
|
77
|
+
• api.weather.com — main weather API (exposed API key!)
|
|
78
|
+
• mParticle analytics endpoints
|
|
79
|
+
• Taboola content recommendation API
|
|
80
|
+
• OneTrust consent management API
|
|
81
|
+
• DAA/AdChoices opt-out endpoints
|
|
82
|
+
→ Call any endpoint directly with query_api — 10x faster than DOM scraping
|
|
83
|
+
```
|
|
117
84
|
|
|
118
|
-
|
|
85
|
+
### AI extraction in plain English
|
|
119
86
|
|
|
120
87
|
```bash
|
|
121
|
-
|
|
122
|
-
|
|
88
|
+
imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
|
|
89
|
+
--schema "extract product name, price, rating, and review count"
|
|
90
|
+
```
|
|
123
91
|
|
|
124
|
-
|
|
125
|
-
|
|
92
|
+
```json
|
|
93
|
+
{
|
|
94
|
+
"product_name": "Apple AirPods Pro 2",
|
|
95
|
+
"price": "$189.99",
|
|
96
|
+
"rating": "4.7 out of 5",
|
|
97
|
+
"review_count": "45,297"
|
|
98
|
+
}
|
|
99
|
+
```
|
|
126
100
|
|
|
127
|
-
|
|
128
|
-
imperium-crawl extract --url https://news.ycombinator.com --selectors '{"title":".titleline a","score":".score"}' --items-selector ".athing"
|
|
101
|
+
### Batch scrape with resume
|
|
129
102
|
|
|
130
|
-
|
|
131
|
-
imperium-crawl
|
|
103
|
+
```bash
|
|
104
|
+
imperium-crawl batch-scrape \
|
|
105
|
+
--urls '["https://bbc.com","https://cnn.com","https://reuters.com","https://techcrunch.com"]' \
|
|
106
|
+
--concurrency 3
|
|
107
|
+
```
|
|
132
108
|
|
|
133
|
-
|
|
134
|
-
|
|
109
|
+
```
|
|
110
|
+
Scraping 4 URLs (concurrency: 3)...
|
|
111
|
+
✅ bbc.com — 47K chars
|
|
112
|
+
✅ cnn.com — 52K chars
|
|
113
|
+
✅ reuters.com — 38K chars
|
|
114
|
+
✅ techcrunch.com — 61K chars
|
|
115
|
+
→ 4/4 succeeded. Job ID: abc123 (resume with --job-id if interrupted)
|
|
116
|
+
```
|
|
135
117
|
|
|
136
|
-
|
|
137
|
-
imperium-crawl batch-scrape --urls '["https://site1.com","https://site2.com","https://site3.com"]' --concurrency 3
|
|
118
|
+
---
|
|
138
119
|
|
|
139
|
-
|
|
140
|
-
imperium-crawl list-jobs
|
|
120
|
+
## Why imperium-crawl?
|
|
141
121
|
|
|
142
|
-
|
|
143
|
-
|
|
122
|
+
🔓 **Zero API Keys Required**
|
|
123
|
+
16 of 22 tools work out of the box. No accounts, no tokens, no credit cards. Just `npx` and go.
|
|
144
124
|
|
|
145
|
-
|
|
146
|
-
|
|
125
|
+
🛡️ **3-Level Auto-Escalating Stealth**
|
|
126
|
+
Headers → TLS fingerprinting → headless browser + CAPTCHA solving. Automatically escalates until it gets through.
|
|
147
127
|
|
|
148
|
-
|
|
149
|
-
|
|
128
|
+
🧠 **Self-Improving**
|
|
129
|
+
Adaptive learning engine remembers what works per domain. Second visit is 3x faster. The more you use it, the smarter it gets.
|
|
150
130
|
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
```
|
|
131
|
+
🧰 **22 Tools, 3 Modes**
|
|
132
|
+
MCP server, CLI tool, or interactive TUI. Scraping, crawling, search, extraction, API discovery, WebSocket monitoring, browser automation, batch processing.
|
|
154
133
|
|
|
155
|
-
|
|
134
|
+
📜 **10 Built-in Recipes**
|
|
135
|
+
Pre-built workflows for common tasks — news extraction, e-commerce scraping, API reverse engineering, and more.
|
|
156
136
|
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
imperium-crawl scrape --url https://example.com
|
|
137
|
+
⚡ **Skills System**
|
|
138
|
+
Teach it once, run forever. Auto-detect patterns on any page, save as reusable skills, get fresh data on demand.
|
|
160
139
|
|
|
161
|
-
|
|
162
|
-
imperium-crawl extract --url https://example.com --selectors '{"title":"h1"}' --output-format csv
|
|
140
|
+
---
|
|
163
141
|
|
|
164
|
-
|
|
165
|
-
imperium-crawl scrape --url https://example.com --output-format markdown
|
|
142
|
+
## vs. The Competition
|
|
166
143
|
|
|
167
|
-
|
|
168
|
-
|
|
144
|
+
| Feature | **imperium-crawl** | Firecrawl MCP | fetch MCP | Crawl4AI MCP | Browserbase MCP |
|
|
145
|
+
|---------|:------------------:|:-------------:|:---------:|:------------:|:---------------:|
|
|
146
|
+
| Price | **Free forever** | $19+/month | Free | Free | $0.01/min |
|
|
147
|
+
| Total tools | **22** | 5 | 2 | 2 | 4 |
|
|
148
|
+
| Stealth levels | **3 (auto-escalate)** | Cloud-based | None | 1 | Cloud-based |
|
|
149
|
+
| Anti-bot detection | **7 systems** | Partial | None | Partial | Partial |
|
|
150
|
+
| TLS fingerprinting | **JA3/JA4** | No | No | No | No |
|
|
151
|
+
| CAPTCHA auto-solving | **Yes** | No | No | No | No |
|
|
152
|
+
| API discovery | **Yes** | No | No | No | No |
|
|
153
|
+
| WebSocket monitoring | **Yes** | No | No | No | No |
|
|
154
|
+
| AI-powered extraction | **Yes** | No | No | No | No |
|
|
155
|
+
| Adaptive learning | **Yes** | No | No | No | No |
|
|
156
|
+
| Batch processing | **Yes** | No | No | No | No |
|
|
157
|
+
| Self-hosted | **Yes** | No | N/A | Yes | No |
|
|
158
|
+
| Requires external service | **No** | Yes | No | No | Yes |
|
|
169
159
|
|
|
170
|
-
|
|
171
|
-
imperium-crawl scrape --url https://example.com --pretty
|
|
160
|
+
---
|
|
172
161
|
|
|
173
|
-
|
|
174
|
-
|
|
162
|
+
## Stealth Engine
|
|
163
|
+
|
|
164
|
+
```
|
|
165
|
+
Request → [L1: Headers + UA rotation]
|
|
166
|
+
│
|
|
167
|
+
├─ success → Done
|
|
168
|
+
↓ fail
|
|
169
|
+
[L2: TLS Fingerprint (JA3/JA4)]
|
|
170
|
+
│
|
|
171
|
+
├─ success → Done
|
|
172
|
+
↓ fail
|
|
173
|
+
[L3: Browser + Fingerprint Injection + CAPTCHA]
|
|
174
|
+
│
|
|
175
|
+
├─ success → Done
|
|
176
|
+
↓
|
|
177
|
+
[Learning Engine records optimal level for next time]
|
|
175
178
|
```
|
|
176
179
|
|
|
177
|
-
###
|
|
180
|
+
### Stealth Levels
|
|
178
181
|
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
+
| Level | Method | What It Defeats |
|
|
183
|
+
|-------|--------|-----------------|
|
|
184
|
+
| **1** | `header-generator` — Bayesian realistic headers + UA rotation | Basic bot detection, simple WAFs |
|
|
185
|
+
| **2** | `impit` — browser-identical TLS fingerprints (JA3/JA4) | Cloudflare, Akamai, TLS fingerprinting WAFs |
|
|
186
|
+
| **3** | `rebrowser-playwright` + `fingerprint-injector` + auto CAPTCHA | JavaScript challenges, SPAs, advanced anti-bot, CAPTCHAs |
|
|
182
187
|
|
|
183
|
-
|
|
188
|
+
### Anti-Bot System Detection
|
|
184
189
|
|
|
185
|
-
|
|
190
|
+
Automatically identifies which anti-bot system a site uses and chooses the optimal strategy:
|
|
191
|
+
|
|
192
|
+
| System | Detection Method |
|
|
193
|
+
|--------|-----------------|
|
|
194
|
+
| **Cloudflare** | `cf_clearance` cookies, `cf-mitigated` header, challenge page title |
|
|
195
|
+
| **Akamai** | `_abck`, `bm_sz` cookies |
|
|
196
|
+
| **PerimeterX / HUMAN** | `_px` cookies, `_pxhd` headers |
|
|
197
|
+
| **DataDome** | `datadome` cookies, `datadome` response header |
|
|
198
|
+
| **Kasada** | `x-kpsdk-*` headers |
|
|
199
|
+
| **AWS WAF** | `aws-waf-token` cookie |
|
|
200
|
+
| **F5 / Shape Security** | `TS` prefix cookies |
|
|
201
|
+
|
|
202
|
+
### Smart Rendering Cache
|
|
203
|
+
|
|
204
|
+
Once imperium-crawl determines a domain needs Level 3 (browser), it caches that decision for 1 hour. Subsequent requests to the same domain skip straight to browser rendering — no wasted time on failed lower levels.
|
|
205
|
+
|
|
206
|
+
---
|
|
207
|
+
|
|
208
|
+
## Adaptive Learning Engine
|
|
209
|
+
|
|
210
|
+
imperium-crawl **learns from every request** and gets smarter over time. No configuration needed — fully automatic.
|
|
211
|
+
|
|
212
|
+
Every time you scrape a website, the engine records which stealth level worked, which anti-bot system was detected, whether a proxy was needed, response timing, and success/failure. Next time you hit the same domain, it **predicts the optimal configuration** — skipping failed levels and going straight to what works.
|
|
186
213
|
|
|
187
|
-
```bash
|
|
188
|
-
imperium-crawl --help # List all commands
|
|
189
|
-
imperium-crawl scrape --help # Help for specific tool
|
|
190
|
-
imperium-crawl --version # Show version
|
|
191
214
|
```
|
|
215
|
+
First visit to cloudflare.com:
|
|
216
|
+
Level 1 → blocked ❌
|
|
217
|
+
Level 2 → blocked ❌
|
|
218
|
+
Level 3 → success ✅ (Cloudflare detected)
|
|
219
|
+
→ Engine records: cloudflare.com needs Level 3
|
|
220
|
+
|
|
221
|
+
Second visit to cloudflare.com:
|
|
222
|
+
→ Engine predicts: Level 3, confidence 85%, Cloudflare
|
|
223
|
+
→ Skips Level 1 and 2 entirely — goes straight to browser
|
|
224
|
+
→ 3x faster than first visit
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
### Smart Features
|
|
228
|
+
|
|
229
|
+
- **Time decay** — Knowledge older than 7 days loses weight, adapts when sites change defenses
|
|
230
|
+
- **Confidence scoring** — Low data = start from level 1. High confidence = skip to optimal level
|
|
231
|
+
- **Auto-prune** — Domains unused for 30 days are cleaned up. Max 2,000 domains stored
|
|
232
|
+
- **Atomic persistence** — Knowledge saved via atomic write (tmp → rename). Never corrupts
|
|
192
233
|
|
|
193
|
-
> **
|
|
234
|
+
> **The more you use it, the faster it gets.**
|
|
194
235
|
|
|
195
236
|
---
|
|
196
237
|
|
|
197
|
-
## 22 Tools
|
|
238
|
+
## All 22 Tools
|
|
198
239
|
|
|
199
|
-
### Scraping (no API key needed)
|
|
240
|
+
### 📄 Scraping (no API key needed)
|
|
200
241
|
|
|
201
242
|
| Tool | What It Does |
|
|
202
243
|
|------|-------------|
|
|
203
|
-
| **scrape** | URL to clean Markdown/HTML with 3-level auto-escalating stealth.
|
|
204
|
-
| **crawl** | Priority-based crawling with depth control, concurrency limiting, and smart URL scoring.
|
|
205
|
-
| **map** | Discover all URLs on a domain via sitemap.xml
|
|
244
|
+
| **scrape** | URL to clean Markdown/HTML with 3-level auto-escalating stealth. Structured data (JSON-LD, OpenGraph, Microdata), metadata, and links. |
|
|
245
|
+
| **crawl** | Priority-based crawling with depth control, concurrency limiting, and smart URL scoring. |
|
|
246
|
+
| **map** | Discover all URLs on a domain via sitemap.xml + page link extraction. |
|
|
206
247
|
| **extract** | CSS selectors to structured JSON. Point at any repeating pattern and get clean data. |
|
|
207
|
-
| **readability** | Mozilla Readability article extraction — title, author, content, publish date.
|
|
248
|
+
| **readability** | Mozilla Readability article extraction — title, author, content, publish date. |
|
|
208
249
|
| **screenshot** | Full-page or viewport PNG screenshots via headless Chromium. |
|
|
209
250
|
|
|
210
|
-
### Search (requires free Brave API key)
|
|
251
|
+
### 🔍 Search (requires free Brave API key)
|
|
211
252
|
|
|
212
253
|
| Tool | What It Does |
|
|
213
254
|
|------|-------------|
|
|
@@ -216,131 +257,144 @@ imperium-crawl --version # Show version
|
|
|
216
257
|
| **image_search** | Image search with thumbnails and source URLs. |
|
|
217
258
|
| **video_search** | Video search across platforms. |
|
|
218
259
|
|
|
219
|
-
### Skills (no API key needed)
|
|
260
|
+
### ⚡ Skills (no API key needed)
|
|
220
261
|
|
|
221
262
|
| Tool | What It Does |
|
|
222
263
|
|------|-------------|
|
|
223
|
-
| **create_skill** | Analyze any page, auto-detect repeating patterns
|
|
224
|
-
| **run_skill** | Run a saved skill
|
|
225
|
-
| **list_skills** | List all saved skills with
|
|
264
|
+
| **create_skill** | Analyze any page, auto-detect repeating patterns, generate CSS selectors, save as reusable skill. |
|
|
265
|
+
| **run_skill** | Run a saved skill for fresh structured data. Supports pagination. |
|
|
266
|
+
| **list_skills** | List all saved skills with configurations. |
|
|
226
267
|
|
|
227
|
-
### API Discovery & Real-Time (no API key needed, requires Playwright)
|
|
268
|
+
### 🔓 API Discovery & Real-Time (no API key needed, requires Playwright)
|
|
228
269
|
|
|
229
270
|
| Tool | What It Does |
|
|
230
271
|
|------|-------------|
|
|
231
|
-
| **discover_apis** | Navigate to any page, intercept
|
|
232
|
-
| **query_api** | Call any API endpoint directly with stealth headers. Bypass DOM rendering
|
|
233
|
-
| **monitor_websocket** | Capture real-time WebSocket messages
|
|
272
|
+
| **discover_apis** | Navigate to any page, intercept XHR/fetch calls, map hidden REST/GraphQL endpoints. Auto-detects GraphQL, filters noise, returns response previews. |
|
|
273
|
+
| **query_api** | Call any API endpoint directly with stealth headers. Bypass DOM rendering for 10x faster data access. |
|
|
274
|
+
| **monitor_websocket** | Capture real-time WebSocket messages — financial tickers, chat feeds, live dashboards. |
|
|
234
275
|
|
|
235
|
-
### AI Extraction (requires LLM API key)
|
|
276
|
+
### 🧠 AI Extraction (requires LLM API key)
|
|
236
277
|
|
|
237
278
|
| Tool | What It Does |
|
|
238
279
|
|------|-------------|
|
|
239
|
-
| **ai_extract** |
|
|
280
|
+
| **ai_extract** | Describe what you want in natural language or JSON schema. 3 providers (Anthropic, OpenAI, MiniMax). The `extract` tool also supports `llm_fallback: true` for hybrid CSS→AI extraction. |
|
|
240
281
|
|
|
241
|
-
### Interaction (no API key needed, requires Playwright)
|
|
282
|
+
### 🖱️ Interaction (no API key needed, requires Playwright)
|
|
242
283
|
|
|
243
284
|
| Tool | What It Does |
|
|
244
285
|
|------|-------------|
|
|
245
|
-
| **interact** | Browser automation with 10 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate). Session persistence saves/restores cookies
|
|
286
|
+
| **interact** | Browser automation with 10 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate). Session persistence saves/restores cookies. |
|
|
246
287
|
|
|
247
|
-
### Batch Processing (no API key needed)
|
|
288
|
+
### 📦 Batch Processing (no API key needed)
|
|
248
289
|
|
|
249
290
|
| Tool | What It Does |
|
|
250
291
|
|------|-------------|
|
|
251
|
-
| **batch_scrape** | Parallel URL scraping with configurable concurrency, soft failure
|
|
252
|
-
| **list_jobs** | List all batch jobs with status
|
|
253
|
-
| **job_status** |
|
|
292
|
+
| **batch_scrape** | Parallel URL scraping with configurable concurrency, soft failure, and resume via job_id. Optional AI extraction per URL. |
|
|
293
|
+
| **list_jobs** | List all batch jobs with status and progress. |
|
|
294
|
+
| **job_status** | Full results for a specific batch job including per-URL outcomes. |
|
|
254
295
|
| **delete_job** | Clean up completed or failed batch jobs. |
|
|
255
296
|
|
|
256
297
|
---
|
|
257
298
|
|
|
258
|
-
##
|
|
299
|
+
## MCP Setup — Detailed
|
|
259
300
|
|
|
260
|
-
|
|
301
|
+
Full configuration with all optional environment variables:
|
|
261
302
|
|
|
262
|
-
|
|
263
|
-
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
|
|
303
|
+
```json
|
|
304
|
+
{
|
|
305
|
+
"mcpServers": {
|
|
306
|
+
"imperium-crawl": {
|
|
307
|
+
"command": "npx",
|
|
308
|
+
"args": ["-y", "imperium-crawl"],
|
|
309
|
+
"env": {
|
|
310
|
+
"BRAVE_API_KEY": "your-brave-api-key",
|
|
311
|
+
"TWOCAPTCHA_API_KEY": "your-2captcha-api-key",
|
|
312
|
+
"LLM_API_KEY": "your-api-key",
|
|
313
|
+
"LLM_PROVIDER": "anthropic",
|
|
314
|
+
"PROXY_URL": "http://user:pass@proxy:8080",
|
|
315
|
+
"PROXY_URLS": "http://proxy1:8080,socks5://proxy2:1080"
|
|
316
|
+
}
|
|
317
|
+
}
|
|
318
|
+
}
|
|
319
|
+
}
|
|
320
|
+
```
|
|
267
321
|
|
|
268
|
-
###
|
|
322
|
+
### API Keys
|
|
269
323
|
|
|
270
|
-
|
|
324
|
+
| Key | What It Unlocks | Where to Get It |
|
|
325
|
+
|-----|----------------|-----------------|
|
|
326
|
+
| `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
|
|
327
|
+
| `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
|
|
328
|
+
| `LLM_API_KEY` | AI-powered data extraction (`ai_extract` tool) | Anthropic or OpenAI API key |
|
|
329
|
+
| `CHROME_PROFILE_PATH` | Authenticated browser sessions (use your Chrome cookies) | Path to Chrome user data dir |
|
|
330
|
+
| `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
|
|
271
331
|
|
|
272
|
-
|
|
273
|
-
|--------|-----------------|
|
|
274
|
-
| **Cloudflare** | `cf_clearance` cookies, `cf-mitigated` header, challenge page title |
|
|
275
|
-
| **Akamai** | `_abck`, `bm_sz` cookies |
|
|
276
|
-
| **PerimeterX / HUMAN** | `_px` cookies, `_pxhd` headers |
|
|
277
|
-
| **DataDome** | `datadome` cookies, `datadome` response header |
|
|
278
|
-
| **Kasada** | `x-kpsdk-*` headers |
|
|
279
|
-
| **AWS WAF** | `aws-waf-token` cookie |
|
|
280
|
-
| **F5 / Shape Security** | `TS` prefix cookies |
|
|
332
|
+
### Enable Full Stealth (Level 3)
|
|
281
333
|
|
|
282
|
-
|
|
334
|
+
```bash
|
|
335
|
+
npm i rebrowser-playwright
|
|
336
|
+
npx playwright install chromium
|
|
337
|
+
```
|
|
283
338
|
|
|
284
|
-
|
|
339
|
+
### Per-Client Notes
|
|
285
340
|
|
|
286
|
-
|
|
341
|
+
| Client | Config Location |
|
|
342
|
+
|--------|----------------|
|
|
343
|
+
| **Claude Code** | `.mcp.json` in project root or `~/.claude/settings.json` global |
|
|
344
|
+
| **Cursor** | Settings → MCP Servers |
|
|
345
|
+
| **VS Code** | `.vscode/mcp.json` or user settings |
|
|
346
|
+
| **Windsurf** | `~/.codeium/windsurf/mcp_config.json` |
|
|
287
347
|
|
|
288
|
-
|
|
348
|
+
---
|
|
289
349
|
|
|
290
|
-
|
|
350
|
+
## CLI Mode
|
|
291
351
|
|
|
292
|
-
|
|
352
|
+
**No arguments** = starts as MCP server. **With subcommand** = runs as CLI tool. **`tui`** = interactive terminal.
|
|
293
353
|
|
|
294
|
-
|
|
295
|
-
|
|
296
|
-
-
|
|
297
|
-
- Whether a **proxy** was needed
|
|
298
|
-
- **Response time** and **HTTP status**
|
|
299
|
-
- Whether the request was **blocked or successful**
|
|
354
|
+
```bash
|
|
355
|
+
# Scrape a website to markdown
|
|
356
|
+
imperium-crawl scrape --url https://bbc.com/news
|
|
300
357
|
|
|
301
|
-
|
|
358
|
+
# Crawl with depth control
|
|
359
|
+
imperium-crawl crawl --url https://blog.cloudflare.com --max-depth 2 --max-pages 5
|
|
302
360
|
|
|
303
|
-
|
|
361
|
+
# AI-powered extraction — plain English
|
|
362
|
+
imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
|
|
363
|
+
--schema "extract product name, price, rating, and review count"
|
|
304
364
|
|
|
305
|
-
|
|
306
|
-
|
|
307
|
-
| Optimal stealth level | Skip straight to the level that works — no wasted escalation |
|
|
308
|
-
| Anti-bot system | Remember which defense the site uses |
|
|
309
|
-
| Proxy requirement | Auto-suggest proxy if requests keep failing without one |
|
|
310
|
-
| Response time | Exponential moving average — adapts to site speed changes |
|
|
311
|
-
| Rate limit | Auto-throttles on 429 responses (reduces rate by 30%) |
|
|
312
|
-
| Success/fail ratio | Confidence scoring — high confidence = use cached strategy |
|
|
365
|
+
# Discover hidden APIs
|
|
366
|
+
imperium-crawl discover-apis --url https://weather.com
|
|
313
367
|
|
|
314
|
-
|
|
368
|
+
# Batch scrape in parallel
|
|
369
|
+
imperium-crawl batch-scrape --urls '["https://site1.com","https://site2.com"]' --concurrency 3
|
|
315
370
|
|
|
316
|
-
|
|
317
|
-
-
|
|
318
|
-
|
|
319
|
-
- **Atomic persistence** — Knowledge saved to `~/.imperium-crawl/knowledge.json` via atomic write (tmp → rename). Never corrupts
|
|
320
|
-
- **Debounced writes** — Batches saves every 30 seconds to avoid disk thrashing
|
|
371
|
+
# Interactive setup wizard
|
|
372
|
+
imperium-crawl setup
|
|
373
|
+
```
|
|
321
374
|
|
|
322
|
-
###
|
|
375
|
+
### Output Formats
|
|
323
376
|
|
|
377
|
+
```bash
|
|
378
|
+
imperium-crawl scrape --url https://example.com # JSON (default)
|
|
379
|
+
imperium-crawl scrape --url https://example.com --output-format markdown # Markdown
|
|
380
|
+
imperium-crawl scrape --url https://example.com --output-format csv # CSV
|
|
381
|
+
imperium-crawl scrape --url https://example.com --pretty # Pretty JSON
|
|
382
|
+
imperium-crawl scrape --url https://example.com --output result.json # Write to file
|
|
324
383
|
```
|
|
325
|
-
First visit to cloudflare.com:
|
|
326
|
-
Level 1 → blocked ❌
|
|
327
|
-
Level 2 → blocked ❌
|
|
328
|
-
Level 3 → success ✅ (Cloudflare detected)
|
|
329
|
-
→ Engine records: cloudflare.com needs Level 3
|
|
330
384
|
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
385
|
+
### TUI Mode
|
|
386
|
+
|
|
387
|
+
```bash
|
|
388
|
+
imperium-crawl tui
|
|
335
389
|
```
|
|
336
390
|
|
|
337
|
-
|
|
391
|
+
Interactive slash-command terminal with parameter prompts, table rendering, markdown display, and session state. Use `/save` to export results and `/again` to re-run the last command.
|
|
338
392
|
|
|
339
393
|
---
|
|
340
394
|
|
|
341
|
-
## Skills
|
|
395
|
+
## Skills & Recipes
|
|
342
396
|
|
|
343
|
-
Skills let you teach imperium-crawl how to extract data from any website, then re-run
|
|
397
|
+
Skills let you teach imperium-crawl how to extract data from any website, then re-run for fresh content whenever you want.
|
|
344
398
|
|
|
345
399
|
**Create a skill:**
|
|
346
400
|
```
|
|
@@ -351,20 +405,36 @@ create_skill({
|
|
|
351
405
|
})
|
|
352
406
|
```
|
|
353
407
|
|
|
354
|
-
The tool analyzes the page, auto-detects repeating elements (articles, products, listings), generates CSS selectors for each field, and saves the skill config.
|
|
355
|
-
|
|
356
408
|
**Run a skill:**
|
|
357
409
|
```
|
|
358
410
|
run_skill({ name: "tc-ai-news" })
|
|
411
|
+
→ Returns fresh structured data with all detected fields
|
|
359
412
|
```
|
|
360
413
|
|
|
361
|
-
|
|
414
|
+
Skills are saved in `~/.imperium-crawl/skills/` as JSON files — human-readable, editable, portable.
|
|
415
|
+
|
|
416
|
+
### Built-in Recipes
|
|
417
|
+
|
|
418
|
+
| Recipe | What It Does |
|
|
419
|
+
|--------|-------------|
|
|
420
|
+
| `news-extraction` | Extract article title, author, date, content from news sites |
|
|
421
|
+
| `ecommerce-scrape` | Product name, price, rating, reviews, images |
|
|
422
|
+
| `social-media` | Posts, engagement metrics, user profiles |
|
|
423
|
+
| `job-listings` | Title, company, salary, location, description |
|
|
424
|
+
| `real-estate` | Property listings with price, address, features |
|
|
425
|
+
| `api-reverse-engineer` | Discover → query → monitor workflow |
|
|
426
|
+
| `competitor-monitor` | Track pricing and product changes |
|
|
427
|
+
| `lead-generation` | Extract business contact info |
|
|
428
|
+
| `content-aggregator` | Multi-source content collection |
|
|
429
|
+
| `data-pipeline` | Batch scrape → extract → export workflow |
|
|
430
|
+
|
|
431
|
+
See [`SKILL/`](./SKILL/) for detailed workflow guides and agent integration.
|
|
362
432
|
|
|
363
433
|
---
|
|
364
434
|
|
|
365
435
|
## API Discovery Workflow
|
|
366
436
|
|
|
367
|
-
|
|
437
|
+
Turn any website into an API. No documentation needed.
|
|
368
438
|
|
|
369
439
|
```
|
|
370
440
|
1. discover_apis({ url: "https://weather.com" })
|
|
@@ -373,65 +443,81 @@ This is the workflow that no other MCP server supports. Real results from actual
|
|
|
373
443
|
• mParticle analytics endpoints
|
|
374
444
|
• Taboola content recommendation API
|
|
375
445
|
• OneTrust consent management API
|
|
376
|
-
• DAA/AdChoices opt-out endpoints
|
|
377
446
|
|
|
378
447
|
2. query_api({ url: "https://api.weather.com/v3/...", method: "GET" })
|
|
379
|
-
→ Direct API call, bypasses DOM entirely — 10x faster, structured JSON
|
|
448
|
+
→ Direct API call, bypasses DOM entirely — 10x faster, structured JSON
|
|
380
449
|
|
|
381
450
|
3. monitor_websocket({ url: "https://binance.com/en/trade/BTC_USDT", duration_seconds: 10 })
|
|
382
|
-
→ Captures real-time WebSocket messages —
|
|
451
|
+
→ Captures real-time WebSocket messages — live BTC price feed
|
|
383
452
|
```
|
|
384
453
|
|
|
385
|
-
|
|
454
|
+
---
|
|
455
|
+
|
|
456
|
+
## AI Agent Guide
|
|
457
|
+
|
|
458
|
+
imperium-crawl ships with [`SKILL/`](./SKILL/) — a structured guide that teaches AI agents how to use all 22 tools effectively. Includes proven workflows, decision trees, error recovery, and advanced patterns.
|
|
459
|
+
|
|
460
|
+
### Three Ways to Connect
|
|
461
|
+
|
|
462
|
+
| Method | Setup | Works With |
|
|
463
|
+
|--------|-------|-----------|
|
|
464
|
+
| **MCP + SKILL/** | Add as MCP server + SKILL.md in agent context | Claude Code, Cursor, Windsurf, any MCP client |
|
|
465
|
+
| **CLI + SKILL/** | `npm i -g imperium-crawl` + SKILL.md in agent context | **Any agent with bash access** — OpenClaw, ChatGPT, GPT agents, custom agents |
|
|
466
|
+
| **TUI** | `imperium-crawl tui` — interactive terminal | Direct human use, demos, debugging |
|
|
467
|
+
|
|
468
|
+
### Per-Agent Setup
|
|
469
|
+
|
|
470
|
+
| AI Agent | How to Add SKILL/ |
|
|
471
|
+
|----------|-------------------|
|
|
472
|
+
| **Claude Code** | Copy `SKILL.md` to project root — auto-detected |
|
|
473
|
+
| **Cursor / Windsurf** | Add `SKILL.md` to project rules or system prompt |
|
|
474
|
+
| **OpenClaw / custom agents** | Include SKILL.md in system prompt or context window |
|
|
475
|
+
| **ChatGPT / GPT agents** | Paste SKILL.md content into custom instructions |
|
|
386
476
|
|
|
387
477
|
---
|
|
388
478
|
|
|
389
479
|
## Resilience
|
|
390
480
|
|
|
391
481
|
- **Exponential backoff with full jitter** — AWS-recommended retry pattern, no thundering herd
|
|
392
|
-
- **Per-domain circuit breaker** — 5
|
|
393
|
-
- **URL normalization** — 11-step pipeline removes tracking params (utm_*, fbclid, gclid), sorts query params
|
|
394
|
-
- **
|
|
395
|
-
- **Input validation** — all 22 tool schemas enforce strict bounds (URL length, query size, concurrency limits, body size)
|
|
396
|
-
- **HTTP transport hardening** — rate limiting (100 req/min), 1MB body limit, 5min request timeout
|
|
397
|
-
- **Proxy support** — single proxy (`PROXY_URL`) or rotating pool (`PROXY_URLS`) with http/https/socks4/socks5 support
|
|
482
|
+
- **Per-domain circuit breaker** — 5 failures opens circuit for 60s, then half-open probing with auto recovery
|
|
483
|
+
- **URL normalization** — 11-step pipeline removes tracking params (utm_*, fbclid, gclid), sorts query params
|
|
484
|
+
- **Proxy support** — single proxy or rotating pool with http/https/socks4/socks5
|
|
398
485
|
- **Browser pool** — keyed by proxy URL, auto-eviction, configurable pool size
|
|
399
|
-
- **Adaptive learning** — remembers optimal stealth level per domain, gets faster with every request
|
|
400
|
-
- **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
|
|
401
486
|
- **robots.txt** — respected by default (configurable)
|
|
487
|
+
- **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
|
|
402
488
|
|
|
403
489
|
---
|
|
404
490
|
|
|
405
|
-
##
|
|
491
|
+
## Real-World Test Results
|
|
406
492
|
|
|
407
493
|
Every tool tested against production websites with real anti-bot defenses:
|
|
408
494
|
|
|
409
495
|
| Tool | Target | Result |
|
|
410
496
|
|------|--------|--------|
|
|
411
|
-
|
|
|
412
|
-
|
|
|
413
|
-
|
|
|
414
|
-
|
|
|
415
|
-
|
|
|
416
|
-
|
|
|
417
|
-
|
|
|
418
|
-
|
|
|
419
|
-
|
|
|
420
|
-
|
|
|
421
|
-
|
|
|
422
|
-
|
|
|
423
|
-
|
|
|
424
|
-
|
|
|
425
|
-
|
|
|
426
|
-
|
|
|
427
|
-
|
|
|
428
|
-
|
|
|
429
|
-
|
|
|
430
|
-
|
|
|
431
|
-
|
|
|
432
|
-
|
|
|
433
|
-
|
|
434
|
-
>
|
|
497
|
+
| **scrape** | BBC News | Full markdown, stealth level 3 auto-escalation |
|
|
498
|
+
| **crawl** | Cloudflare Blog | 213K characters crawled with depth control |
|
|
499
|
+
| **map** | BBC | Full URL discovery via sitemap + link extraction |
|
|
500
|
+
| **extract** | Amazon (AirPods Pro 2) | Product title, 45,297 reviews, brand extracted |
|
|
501
|
+
| **readability** | Medium article | Clean — title, author, content, publish date |
|
|
502
|
+
| **screenshot** | ProductHunt | Captured Cloudflare Turnstile challenge page |
|
|
503
|
+
| **search** | Brave Web | Web results with snippets and URLs |
|
|
504
|
+
| **news_search** | Brave News | News results with freshness ranking |
|
|
505
|
+
| **image_search** | Brave Image | Images with thumbnails and source URLs |
|
|
506
|
+
| **video_search** | Brave Video | Video results across platforms |
|
|
507
|
+
| **create_skill** | Hacker News | Auto-detected 30 stories with CSS selectors |
|
|
508
|
+
| **run_skill** | Saved skill | Fresh structured data from saved config |
|
|
509
|
+
| **list_skills** | — | Lists all skills with configurations |
|
|
510
|
+
| **discover_apis** | Airbnb Paris | **34 hidden APIs** — DataDome, Google Maps key, internal APIs |
|
|
511
|
+
| **query_api** | jsonplaceholder | Direct JSON API call with stealth headers |
|
|
512
|
+
| **monitor_websocket** | Binance BTC/USDT | 3 WebSocket connections, 23 live messages — BTC price live |
|
|
513
|
+
| **ai_extract** | Amazon product | AI extracted name, price, rating, review count |
|
|
514
|
+
| **interact** | Login flow | Click → type → submit — session cookies persisted |
|
|
515
|
+
| **batch_scrape** | 10 news sites | Parallel, concurrency 3, soft failure, 9/10 succeeded |
|
|
516
|
+
| **list_jobs** | — | Batch jobs with status and progress |
|
|
517
|
+
| **job_status** | Batch job | Full per-URL results with timing |
|
|
518
|
+
| **delete_job** | Completed job | Cleaned up job data from disk |
|
|
519
|
+
|
|
520
|
+
> **22/22 tools. 34 hidden APIs on Airbnb. Live BTC feed. Zero API keys for scraping.**
|
|
435
521
|
|
|
436
522
|
---
|
|
437
523
|
|
|
@@ -441,18 +527,18 @@ Every tool tested against production websites with real anti-bot defenses:
|
|
|
441
527
|
|----------|----------|-------------|
|
|
442
528
|
| `BRAVE_API_KEY` | No | Brave Search API key (enables 4 search tools) |
|
|
443
529
|
| `TWOCAPTCHA_API_KEY` | No | 2Captcha API key (enables auto CAPTCHA solving) |
|
|
530
|
+
| `LLM_API_KEY` | No | Anthropic or OpenAI API key (enables `ai_extract`) |
|
|
531
|
+
| `LLM_PROVIDER` | No | `anthropic`, `openai`, or `minimax` (default: `anthropic`) |
|
|
532
|
+
| `LLM_MODEL` | No | Override default LLM model |
|
|
444
533
|
| `TRANSPORT` | No | `stdio` (default) or `http` |
|
|
445
534
|
| `PORT` | No | HTTP port (default: 3000) |
|
|
446
535
|
| `PROXY_URL` | No | Single proxy URL (http/https/socks4/socks5) |
|
|
447
536
|
| `PROXY_URLS` | No | Comma-separated proxy URLs for rotation |
|
|
448
537
|
| `BROWSER_POOL_SIZE` | No | Max pooled browser instances (default: 3) |
|
|
449
538
|
| `RESPECT_ROBOTS` | No | Respect robots.txt (default: `true`) |
|
|
450
|
-
| `
|
|
451
|
-
| `
|
|
452
|
-
| `
|
|
453
|
-
| `CHROME_PROFILE_PATH` | No | Chrome user data dir for authenticated browser sessions |
|
|
454
|
-
| `NO_COLOR` | No | Disable colored output (standard convention) |
|
|
455
|
-
| `CI` | No | Auto-detected; disables TTY features (spinners, colors) |
|
|
539
|
+
| `CHROME_PROFILE_PATH` | No | Chrome user data dir for authenticated sessions |
|
|
540
|
+
| `NO_COLOR` | No | Disable colored output |
|
|
541
|
+
| `CI` | No | Auto-detected; disables TTY features |
|
|
456
542
|
|
|
457
543
|
---
|
|
458
544
|
|
|
@@ -463,8 +549,27 @@ git clone https://github.com/ceoimperiumprojects/imperium-crawl
|
|
|
463
549
|
cd imperium-crawl
|
|
464
550
|
npm install
|
|
465
551
|
npm run build
|
|
466
|
-
npm
|
|
467
|
-
npm
|
|
552
|
+
npm run dev # Watch mode (rebuild on changes)
|
|
553
|
+
npm test # 332 tests
|
|
554
|
+
npm start # Start MCP server
|
|
555
|
+
```
|
|
556
|
+
|
|
557
|
+
---
|
|
558
|
+
|
|
559
|
+
## Contributing
|
|
560
|
+
|
|
561
|
+
Contributions welcome! Whether it's a bug fix, new tool, or documentation improvement — open an issue or PR.
|
|
562
|
+
|
|
563
|
+
```bash
|
|
564
|
+
# Fork the repo, then:
|
|
565
|
+
git clone https://github.com/YOUR_USERNAME/imperium-crawl
|
|
566
|
+
cd imperium-crawl
|
|
567
|
+
npm install
|
|
568
|
+
git checkout -b my-feature
|
|
569
|
+
# Make changes...
|
|
570
|
+
npm test
|
|
571
|
+
git push origin my-feature
|
|
572
|
+
# Open a PR
|
|
468
573
|
```
|
|
469
574
|
|
|
470
575
|
---
|