imperium-crawl 1.5.2 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +370 -257
- package/dist/constants.d.ts +2 -1
- package/dist/constants.d.ts.map +1 -1
- package/dist/constants.js +3 -1
- package/dist/constants.js.map +1 -1
- package/dist/network/interceptor.d.ts +19 -0
- package/dist/network/interceptor.d.ts.map +1 -0
- package/dist/network/interceptor.js +82 -0
- package/dist/network/interceptor.js.map +1 -0
- package/dist/network/types.d.ts +27 -0
- package/dist/network/types.d.ts.map +1 -0
- package/dist/network/types.js +2 -0
- package/dist/network/types.js.map +1 -0
- package/dist/security/action-policy.d.ts +26 -0
- package/dist/security/action-policy.d.ts.map +1 -0
- package/dist/security/action-policy.js +136 -0
- package/dist/security/action-policy.js.map +1 -0
- package/dist/security/auth-vault.d.ts +49 -0
- package/dist/security/auth-vault.d.ts.map +1 -0
- package/dist/security/auth-vault.js +133 -0
- package/dist/security/auth-vault.js.map +1 -0
- package/dist/security/domain-filter.d.ts +19 -0
- package/dist/security/domain-filter.d.ts.map +1 -0
- package/dist/security/domain-filter.js +114 -0
- package/dist/security/domain-filter.js.map +1 -0
- package/dist/security/types.d.ts +19 -0
- package/dist/security/types.d.ts.map +1 -0
- package/dist/security/types.js +2 -0
- package/dist/security/types.js.map +1 -0
- package/dist/sessions/encryption.d.ts +37 -0
- package/dist/sessions/encryption.d.ts.map +1 -0
- package/dist/sessions/encryption.js +108 -0
- package/dist/sessions/encryption.js.map +1 -0
- package/dist/sessions/index.d.ts +1 -0
- package/dist/sessions/index.d.ts.map +1 -1
- package/dist/sessions/index.js +1 -0
- package/dist/sessions/index.js.map +1 -1
- package/dist/sessions/manager.d.ts +3 -0
- package/dist/sessions/manager.d.ts.map +1 -1
- package/dist/sessions/manager.js +28 -2
- package/dist/sessions/manager.js.map +1 -1
- package/dist/snapshot/annotator.d.ts +21 -0
- package/dist/snapshot/annotator.d.ts.map +1 -0
- package/dist/snapshot/annotator.js +152 -0
- package/dist/snapshot/annotator.js.map +1 -0
- package/dist/snapshot/boundary.d.ts +7 -0
- package/dist/snapshot/boundary.d.ts.map +1 -0
- package/dist/snapshot/boundary.js +12 -0
- package/dist/snapshot/boundary.js.map +1 -0
- package/dist/snapshot/differ.d.ts +40 -0
- package/dist/snapshot/differ.d.ts.map +1 -0
- package/dist/snapshot/differ.js +194 -0
- package/dist/snapshot/differ.js.map +1 -0
- package/dist/snapshot/extractor.d.ts +27 -0
- package/dist/snapshot/extractor.d.ts.map +1 -0
- package/dist/snapshot/extractor.js +265 -0
- package/dist/snapshot/extractor.js.map +1 -0
- package/dist/snapshot/index.d.ts +8 -0
- package/dist/snapshot/index.d.ts.map +1 -0
- package/dist/snapshot/index.js +6 -0
- package/dist/snapshot/index.js.map +1 -0
- package/dist/snapshot/store.d.ts +28 -0
- package/dist/snapshot/store.d.ts.map +1 -0
- package/dist/snapshot/store.js +65 -0
- package/dist/snapshot/store.js.map +1 -0
- package/dist/snapshot/types.d.ts +42 -0
- package/dist/snapshot/types.d.ts.map +1 -0
- package/dist/snapshot/types.js +2 -0
- package/dist/snapshot/types.js.map +1 -0
- package/dist/tools/index.d.ts.map +1 -1
- package/dist/tools/index.js +2 -0
- package/dist/tools/index.js.map +1 -1
- package/dist/tools/interact.d.ts +194 -5
- package/dist/tools/interact.d.ts.map +1 -1
- package/dist/tools/interact.js +355 -20
- package/dist/tools/interact.js.map +1 -1
- package/dist/tools/snapshot.d.ts +53 -0
- package/dist/tools/snapshot.d.ts.map +1 -0
- package/dist/tools/snapshot.js +160 -0
- package/dist/tools/snapshot.js.map +1 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,213 +1,258 @@
|
|
|
1
|
+
<div align="center">
|
|
2
|
+
|
|
1
3
|
# imperium-crawl
|
|
2
4
|
|
|
3
|
-
The most powerful open-source MCP server for web scraping, crawling, and data extraction
|
|
5
|
+
**The most powerful open-source MCP server for web scraping, crawling, and data extraction.**
|
|
4
6
|
|
|
5
|
-
|
|
7
|
+
23 tools. Zero API keys required. One `npx` command.
|
|
6
8
|
|
|
7
|
-
|
|
9
|
+
[](https://www.npmjs.com/package/imperium-crawl)
|
|
10
|
+
[](./LICENSE)
|
|
11
|
+
[]()
|
|
12
|
+
[](https://www.npmjs.com/package/imperium-crawl)
|
|
8
13
|
|
|
9
|
-
|
|
10
|
-
|---------|:------------------:|:-------------:|:---------:|:------------:|:---------------:|
|
|
11
|
-
| Price | **Free forever** | $19+/month | Free | Free | $0.01/min |
|
|
12
|
-
| Scraping tools | **6** | 3 | 1 | 1 | 1 |
|
|
13
|
-
| Search tools | **4** | 0 | 0 | 0 | 0 |
|
|
14
|
-
| Stealth levels | **3 (auto-escalate)** | Cloud-based | None | 1 | Cloud-based |
|
|
15
|
-
| Anti-bot detection | **7 systems** | Partial | None | Partial | Partial |
|
|
16
|
-
| TLS fingerprinting | **Yes (JA3/JA4)** | No | No | No | No |
|
|
17
|
-
| CAPTCHA auto-solving | **Yes (2Captcha)** | No | No | No | No |
|
|
18
|
-
| API discovery from network traffic | **Yes** | No | No | No | No |
|
|
19
|
-
| WebSocket monitoring | **Yes** | No | No | No | No |
|
|
20
|
-
| Direct API calls | **Yes** | No | No | No | No |
|
|
21
|
-
| Reusable skills system | **Yes** | No | No | No | No |
|
|
22
|
-
| Structured data extraction (JSON-LD/OG) | **Yes** | Partial | No | No | No |
|
|
23
|
-
| Priority-based crawling | **Yes** | No | No | No | No |
|
|
24
|
-
| Circuit breaker + jitter backoff | **Yes** | No | No | No | No |
|
|
25
|
-
| URL normalization (11 steps) | **Yes** | No | No | No | No |
|
|
26
|
-
| Adaptive learning (self-improving) | **Yes** | No | No | No | No |
|
|
27
|
-
| AI-powered data extraction | **Yes** | No | No | No | No |
|
|
28
|
-
| Browser automation + sessions | **Yes** | No | No | No | No |
|
|
29
|
-
| Batch processing with resume | **Yes** | No | No | No | No |
|
|
30
|
-
| Self-hosted | **Yes** | No | N/A | Yes | No |
|
|
31
|
-
| Requires external service | **No** | Yes | No | No | Yes |
|
|
32
|
-
| Total tools | **22** | 5 | 2 | 2 | 4 |
|
|
33
|
-
|
|
34
|
-
> **TLDR:** More tools, more features, zero cost, no external dependencies. Self-hosted, open-source, and it runs on your machine.
|
|
35
|
-
|
|
36
|
-
## Installation
|
|
14
|
+
</div>
|
|
37
15
|
|
|
38
|
-
|
|
39
|
-
npm install -g imperium-crawl
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
Or run directly without installing:
|
|
16
|
+
---
|
|
43
17
|
|
|
44
|
-
|
|
45
|
-
npx -y imperium-crawl
|
|
46
|
-
```
|
|
18
|
+
## Quick Start
|
|
47
19
|
|
|
48
|
-
|
|
20
|
+
Get running in 30 seconds.
|
|
49
21
|
|
|
50
|
-
|
|
22
|
+
**MCP client** (Claude Code, Cursor, VS Code, Windsurf):
|
|
51
23
|
|
|
52
24
|
```json
|
|
53
25
|
{
|
|
54
26
|
"mcpServers": {
|
|
55
27
|
"imperium-crawl": {
|
|
56
|
-
"type": "stdio",
|
|
57
28
|
"command": "npx",
|
|
58
|
-
"args": ["-y", "imperium-crawl"]
|
|
59
|
-
"env": {
|
|
60
|
-
"BRAVE_API_KEY": "your-brave-api-key",
|
|
61
|
-
"TWOCAPTCHA_API_KEY": "your-2captcha-api-key",
|
|
62
|
-
"LLM_API_KEY": "your-api-key",
|
|
63
|
-
"LLM_PROVIDER": "anthropic",
|
|
64
|
-
"PROXY_URL": "http://user:pass@proxy:8080",
|
|
65
|
-
"PROXY_URLS": "http://proxy1:8080,socks5://proxy2:1080"
|
|
66
|
-
}
|
|
29
|
+
"args": ["-y", "imperium-crawl"]
|
|
67
30
|
}
|
|
68
31
|
}
|
|
69
32
|
}
|
|
70
33
|
```
|
|
71
34
|
|
|
72
|
-
|
|
73
|
-
>
|
|
74
|
-
> | Key | What it unlocks | Where to get it |
|
|
75
|
-
> |-----|----------------|-----------------|
|
|
76
|
-
> | `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
|
|
77
|
-
> | `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
|
|
78
|
-
> | `LLM_API_KEY` | AI-powered data extraction (`ai_extract` tool) | Anthropic or OpenAI API key |
|
|
79
|
-
> | `CHROME_PROFILE_PATH` | Authenticated browser sessions (use your Chrome cookies) | Path to Chrome user data dir |
|
|
80
|
-
> | `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
|
|
35
|
+
**CLI** (zero install):
|
|
81
36
|
|
|
82
|
-
|
|
37
|
+
```bash
|
|
38
|
+
npx -y imperium-crawl scrape --url https://example.com
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
**Global install:**
|
|
83
42
|
|
|
84
43
|
```bash
|
|
85
|
-
npm
|
|
86
|
-
npx playwright install chromium
|
|
44
|
+
npm install -g imperium-crawl
|
|
87
45
|
```
|
|
88
46
|
|
|
89
|
-
|
|
47
|
+
> That's it. 17 of 23 tools work with zero API keys. Add optional keys later to unlock search, AI extraction, and CAPTCHA solving.
|
|
90
48
|
|
|
91
|
-
|
|
49
|
+
---
|
|
92
50
|
|
|
93
|
-
|
|
51
|
+
## Power Examples
|
|
94
52
|
|
|
95
|
-
|
|
53
|
+
Real results. Copy-paste and try.
|
|
96
54
|
|
|
97
|
-
|
|
55
|
+
### Scrape through Cloudflare
|
|
98
56
|
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
| **CLI + SKILL.md** | `npm i -g imperium-crawl` + SKILL.md in agent context | **Any agent with bash access** — OpenClaw, ChatGPT, GPT agents, custom agents, anything |
|
|
103
|
-
| **TUI mode** | `imperium-crawl tui` — interactive slash-command terminal | Direct human use, demos, debugging |
|
|
57
|
+
```bash
|
|
58
|
+
imperium-crawl scrape --url https://blog.cloudflare.com
|
|
59
|
+
```
|
|
104
60
|
|
|
105
|
-
|
|
61
|
+
```
|
|
62
|
+
Level 1 (headers) → blocked
|
|
63
|
+
Level 2 (TLS fingerprint) → blocked
|
|
64
|
+
Level 3 (browser + stealth) → success ✅
|
|
65
|
+
→ Full markdown content extracted, 213K characters
|
|
66
|
+
→ Next visit: skips straight to Level 3 (learned)
|
|
67
|
+
```
|
|
106
68
|
|
|
107
|
-
|
|
108
|
-
|----------|-------------------|
|
|
109
|
-
| **Claude Code** | Copy `SKILL.md` to your project root — Claude Code reads it automatically |
|
|
110
|
-
| **Cursor / Windsurf** | Add `SKILL.md` to project rules or include in system prompt |
|
|
111
|
-
| **OpenClaw / custom agents** | Include SKILL.md content in your system prompt or context window |
|
|
112
|
-
| **ChatGPT / GPT agents** | Paste SKILL.md content into custom instructions |
|
|
69
|
+
### Discover hidden APIs on any website
|
|
113
70
|
|
|
114
|
-
|
|
71
|
+
```bash
|
|
72
|
+
imperium-crawl discover-apis --url https://weather.com
|
|
73
|
+
```
|
|
115
74
|
|
|
116
|
-
|
|
75
|
+
```
|
|
76
|
+
Found 11 hidden API endpoints:
|
|
77
|
+
• api.weather.com — main weather API (exposed API key!)
|
|
78
|
+
• mParticle analytics endpoints
|
|
79
|
+
• Taboola content recommendation API
|
|
80
|
+
• OneTrust consent management API
|
|
81
|
+
• DAA/AdChoices opt-out endpoints
|
|
82
|
+
→ Call any endpoint directly with query_api — 10x faster than DOM scraping
|
|
83
|
+
```
|
|
117
84
|
|
|
118
|
-
|
|
85
|
+
### AI extraction in plain English
|
|
119
86
|
|
|
120
87
|
```bash
|
|
121
|
-
|
|
122
|
-
|
|
88
|
+
imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
|
|
89
|
+
--schema "extract product name, price, rating, and review count"
|
|
90
|
+
```
|
|
123
91
|
|
|
124
|
-
|
|
125
|
-
|
|
92
|
+
```json
|
|
93
|
+
{
|
|
94
|
+
"product_name": "Apple AirPods Pro 2",
|
|
95
|
+
"price": "$189.99",
|
|
96
|
+
"rating": "4.7 out of 5",
|
|
97
|
+
"review_count": "45,297"
|
|
98
|
+
}
|
|
99
|
+
```
|
|
126
100
|
|
|
127
|
-
|
|
128
|
-
imperium-crawl extract --url https://news.ycombinator.com --selectors '{"title":".titleline a","score":".score"}' --items-selector ".athing"
|
|
101
|
+
### Batch scrape with resume
|
|
129
102
|
|
|
130
|
-
|
|
131
|
-
imperium-crawl
|
|
103
|
+
```bash
|
|
104
|
+
imperium-crawl batch-scrape \
|
|
105
|
+
--urls '["https://bbc.com","https://cnn.com","https://reuters.com","https://techcrunch.com"]' \
|
|
106
|
+
--concurrency 3
|
|
107
|
+
```
|
|
132
108
|
|
|
133
|
-
|
|
134
|
-
|
|
109
|
+
```
|
|
110
|
+
Scraping 4 URLs (concurrency: 3)...
|
|
111
|
+
✅ bbc.com — 47K chars
|
|
112
|
+
✅ cnn.com — 52K chars
|
|
113
|
+
✅ reuters.com — 38K chars
|
|
114
|
+
✅ techcrunch.com — 61K chars
|
|
115
|
+
→ 4/4 succeeded. Job ID: abc123 (resume with --job-id if interrupted)
|
|
116
|
+
```
|
|
135
117
|
|
|
136
|
-
|
|
137
|
-
imperium-crawl batch-scrape --urls '["https://site1.com","https://site2.com","https://site3.com"]' --concurrency 3
|
|
118
|
+
---
|
|
138
119
|
|
|
139
|
-
|
|
140
|
-
imperium-crawl list-jobs
|
|
120
|
+
## Why imperium-crawl?
|
|
141
121
|
|
|
142
|
-
|
|
143
|
-
|
|
122
|
+
🔓 **Zero API Keys Required**
|
|
123
|
+
17 of 23 tools work out of the box. No accounts, no tokens, no credit cards. Just `npx` and go.
|
|
144
124
|
|
|
145
|
-
|
|
146
|
-
|
|
125
|
+
🛡️ **3-Level Auto-Escalating Stealth**
|
|
126
|
+
Headers → TLS fingerprinting → headless browser + CAPTCHA solving. Automatically escalates until it gets through.
|
|
147
127
|
|
|
148
|
-
|
|
149
|
-
|
|
128
|
+
🧠 **Self-Improving**
|
|
129
|
+
Adaptive learning engine remembers what works per domain. Second visit is 3x faster. The more you use it, the smarter it gets.
|
|
150
130
|
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
```
|
|
131
|
+
🧰 **23 Tools, 3 Modes**
|
|
132
|
+
MCP server, CLI tool, or interactive TUI. Scraping, crawling, search, extraction, API discovery, WebSocket monitoring, browser automation, batch processing.
|
|
154
133
|
|
|
155
|
-
|
|
134
|
+
📜 **10 Built-in Recipes**
|
|
135
|
+
Pre-built workflows for common tasks — news extraction, e-commerce scraping, API reverse engineering, and more.
|
|
156
136
|
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
137
|
+
⚡ **Skills System**
|
|
138
|
+
Teach it once, run forever. Auto-detect patterns on any page, save as reusable skills, get fresh data on demand.
|
|
139
|
+
|
|
140
|
+
---
|
|
160
141
|
|
|
161
|
-
|
|
162
|
-
imperium-crawl extract --url https://example.com --selectors '{"title":"h1"}' --output-format csv
|
|
142
|
+
## vs. The Competition
|
|
163
143
|
|
|
164
|
-
|
|
165
|
-
|
|
144
|
+
| Feature | **imperium-crawl** | Firecrawl MCP | fetch MCP | Crawl4AI MCP | Browserbase MCP |
|
|
145
|
+
|---------|:------------------:|:-------------:|:---------:|:------------:|:---------------:|
|
|
146
|
+
| Price | **Free forever** | $19+/month | Free | Free | $0.01/min |
|
|
147
|
+
| Total tools | **23** | 5 | 2 | 2 | 4 |
|
|
148
|
+
| Stealth levels | **3 (auto-escalate)** | Cloud-based | None | 1 | Cloud-based |
|
|
149
|
+
| Anti-bot detection | **7 systems** | Partial | None | Partial | Partial |
|
|
150
|
+
| TLS fingerprinting | **JA3/JA4** | No | No | No | No |
|
|
151
|
+
| CAPTCHA auto-solving | **Yes** | No | No | No | No |
|
|
152
|
+
| API discovery | **Yes** | No | No | No | No |
|
|
153
|
+
| WebSocket monitoring | **Yes** | No | No | No | No |
|
|
154
|
+
| AI-powered extraction | **Yes** | No | No | No | No |
|
|
155
|
+
| Adaptive learning | **Yes** | No | No | No | No |
|
|
156
|
+
| Batch processing | **Yes** | No | No | No | No |
|
|
157
|
+
| ARIA Snapshots | **Yes** | No | No | No | No |
|
|
158
|
+
| Session Encryption | **Yes** | No | No | No | No |
|
|
159
|
+
| Action Policy | **Yes** | No | No | No | No |
|
|
160
|
+
| Domain Sandboxing | **Yes** | No | No | No | No |
|
|
161
|
+
| Self-hosted | **Yes** | No | N/A | Yes | No |
|
|
162
|
+
| Requires external service | **No** | Yes | No | No | Yes |
|
|
166
163
|
|
|
167
|
-
|
|
168
|
-
imperium-crawl crawl --url https://example.com --output-format jsonl
|
|
164
|
+
---
|
|
169
165
|
|
|
170
|
-
|
|
171
|
-
imperium-crawl scrape --url https://example.com --pretty
|
|
166
|
+
## Stealth Engine
|
|
172
167
|
|
|
173
|
-
|
|
174
|
-
|
|
168
|
+
```
|
|
169
|
+
Request → [L1: Headers + UA rotation]
|
|
170
|
+
│
|
|
171
|
+
├─ success → Done
|
|
172
|
+
↓ fail
|
|
173
|
+
[L2: TLS Fingerprint (JA3/JA4)]
|
|
174
|
+
│
|
|
175
|
+
├─ success → Done
|
|
176
|
+
↓ fail
|
|
177
|
+
[L3: Browser + Fingerprint Injection + CAPTCHA]
|
|
178
|
+
│
|
|
179
|
+
├─ success → Done
|
|
180
|
+
↓
|
|
181
|
+
[Learning Engine records optimal level for next time]
|
|
175
182
|
```
|
|
176
183
|
|
|
177
|
-
###
|
|
184
|
+
### Stealth Levels
|
|
178
185
|
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
186
|
+
| Level | Method | What It Defeats |
|
|
187
|
+
|-------|--------|-----------------|
|
|
188
|
+
| **1** | `header-generator` — Bayesian realistic headers + UA rotation | Basic bot detection, simple WAFs |
|
|
189
|
+
| **2** | `impit` — browser-identical TLS fingerprints (JA3/JA4) | Cloudflare, Akamai, TLS fingerprinting WAFs |
|
|
190
|
+
| **3** | `rebrowser-playwright` + `fingerprint-injector` + auto CAPTCHA | JavaScript challenges, SPAs, advanced anti-bot, CAPTCHAs |
|
|
182
191
|
|
|
183
|
-
|
|
192
|
+
### Anti-Bot System Detection
|
|
184
193
|
|
|
185
|
-
|
|
194
|
+
Automatically identifies which anti-bot system a site uses and chooses the optimal strategy:
|
|
195
|
+
|
|
196
|
+
| System | Detection Method |
|
|
197
|
+
|--------|-----------------|
|
|
198
|
+
| **Cloudflare** | `cf_clearance` cookies, `cf-mitigated` header, challenge page title |
|
|
199
|
+
| **Akamai** | `_abck`, `bm_sz` cookies |
|
|
200
|
+
| **PerimeterX / HUMAN** | `_px` cookies, `_pxhd` headers |
|
|
201
|
+
| **DataDome** | `datadome` cookies, `datadome` response header |
|
|
202
|
+
| **Kasada** | `x-kpsdk-*` headers |
|
|
203
|
+
| **AWS WAF** | `aws-waf-token` cookie |
|
|
204
|
+
| **F5 / Shape Security** | `TS` prefix cookies |
|
|
205
|
+
|
|
206
|
+
### Smart Rendering Cache
|
|
207
|
+
|
|
208
|
+
Once imperium-crawl determines a domain needs Level 3 (browser), it caches that decision for 1 hour. Subsequent requests to the same domain skip straight to browser rendering — no wasted time on failed lower levels.
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
## Adaptive Learning Engine
|
|
213
|
+
|
|
214
|
+
imperium-crawl **learns from every request** and gets smarter over time. No configuration needed — fully automatic.
|
|
215
|
+
|
|
216
|
+
Every time you scrape a website, the engine records which stealth level worked, which anti-bot system was detected, whether a proxy was needed, response timing, and success/failure. Next time you hit the same domain, it **predicts the optimal configuration** — skipping failed levels and going straight to what works.
|
|
186
217
|
|
|
187
|
-
```bash
|
|
188
|
-
imperium-crawl --help # List all commands
|
|
189
|
-
imperium-crawl scrape --help # Help for specific tool
|
|
190
|
-
imperium-crawl --version # Show version
|
|
191
218
|
```
|
|
219
|
+
First visit to cloudflare.com:
|
|
220
|
+
Level 1 → blocked ❌
|
|
221
|
+
Level 2 → blocked ❌
|
|
222
|
+
Level 3 → success ✅ (Cloudflare detected)
|
|
223
|
+
→ Engine records: cloudflare.com needs Level 3
|
|
192
224
|
|
|
193
|
-
|
|
225
|
+
Second visit to cloudflare.com:
|
|
226
|
+
→ Engine predicts: Level 3, confidence 85%, Cloudflare
|
|
227
|
+
→ Skips Level 1 and 2 entirely — goes straight to browser
|
|
228
|
+
→ 3x faster than first visit
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
### Smart Features
|
|
232
|
+
|
|
233
|
+
- **Time decay** — Knowledge older than 7 days loses weight, adapts when sites change defenses
|
|
234
|
+
- **Confidence scoring** — Low data = start from level 1. High confidence = skip to optimal level
|
|
235
|
+
- **Auto-prune** — Domains unused for 30 days are cleaned up. Max 2,000 domains stored
|
|
236
|
+
- **Atomic persistence** — Knowledge saved via atomic write (tmp → rename). Never corrupts
|
|
237
|
+
|
|
238
|
+
> **The more you use it, the faster it gets.**
|
|
194
239
|
|
|
195
240
|
---
|
|
196
241
|
|
|
197
|
-
##
|
|
242
|
+
## All 23 Tools
|
|
198
243
|
|
|
199
|
-
### Scraping (no API key needed)
|
|
244
|
+
### 📄 Scraping (no API key needed)
|
|
200
245
|
|
|
201
246
|
| Tool | What It Does |
|
|
202
247
|
|------|-------------|
|
|
203
|
-
| **scrape** | URL to clean Markdown/HTML with 3-level auto-escalating stealth.
|
|
204
|
-
| **crawl** | Priority-based crawling with depth control, concurrency limiting, and smart URL scoring.
|
|
205
|
-
| **map** | Discover all URLs on a domain via sitemap.xml
|
|
248
|
+
| **scrape** | URL to clean Markdown/HTML with 3-level auto-escalating stealth. Structured data (JSON-LD, OpenGraph, Microdata), metadata, and links. |
|
|
249
|
+
| **crawl** | Priority-based crawling with depth control, concurrency limiting, and smart URL scoring. |
|
|
250
|
+
| **map** | Discover all URLs on a domain via sitemap.xml + page link extraction. |
|
|
206
251
|
| **extract** | CSS selectors to structured JSON. Point at any repeating pattern and get clean data. |
|
|
207
|
-
| **readability** | Mozilla Readability article extraction — title, author, content, publish date.
|
|
252
|
+
| **readability** | Mozilla Readability article extraction — title, author, content, publish date. |
|
|
208
253
|
| **screenshot** | Full-page or viewport PNG screenshots via headless Chromium. |
|
|
209
254
|
|
|
210
|
-
### Search (requires free Brave API key)
|
|
255
|
+
### 🔍 Search (requires free Brave API key)
|
|
211
256
|
|
|
212
257
|
| Tool | What It Does |
|
|
213
258
|
|------|-------------|
|
|
@@ -216,131 +261,146 @@ imperium-crawl --version # Show version
|
|
|
216
261
|
| **image_search** | Image search with thumbnails and source URLs. |
|
|
217
262
|
| **video_search** | Video search across platforms. |
|
|
218
263
|
|
|
219
|
-
### Skills (no API key needed)
|
|
264
|
+
### ⚡ Skills (no API key needed)
|
|
220
265
|
|
|
221
266
|
| Tool | What It Does |
|
|
222
267
|
|------|-------------|
|
|
223
|
-
| **create_skill** | Analyze any page, auto-detect repeating patterns
|
|
224
|
-
| **run_skill** | Run a saved skill
|
|
225
|
-
| **list_skills** | List all saved skills with
|
|
268
|
+
| **create_skill** | Analyze any page, auto-detect repeating patterns, generate CSS selectors, save as reusable skill. |
|
|
269
|
+
| **run_skill** | Run a saved skill for fresh structured data. Supports pagination. |
|
|
270
|
+
| **list_skills** | List all saved skills with configurations. |
|
|
226
271
|
|
|
227
|
-
### API Discovery & Real-Time (no API key needed, requires Playwright)
|
|
272
|
+
### 🔓 API Discovery & Real-Time (no API key needed, requires Playwright)
|
|
228
273
|
|
|
229
274
|
| Tool | What It Does |
|
|
230
275
|
|------|-------------|
|
|
231
|
-
| **discover_apis** | Navigate to any page, intercept
|
|
232
|
-
| **query_api** | Call any API endpoint directly with stealth headers. Bypass DOM rendering
|
|
233
|
-
| **monitor_websocket** | Capture real-time WebSocket messages
|
|
276
|
+
| **discover_apis** | Navigate to any page, intercept XHR/fetch calls, map hidden REST/GraphQL endpoints. Auto-detects GraphQL, filters noise, returns response previews. |
|
|
277
|
+
| **query_api** | Call any API endpoint directly with stealth headers. Bypass DOM rendering for 10x faster data access. |
|
|
278
|
+
| **monitor_websocket** | Capture real-time WebSocket messages — financial tickers, chat feeds, live dashboards. |
|
|
234
279
|
|
|
235
|
-
### AI Extraction (requires LLM API key)
|
|
280
|
+
### 🧠 AI Extraction (requires LLM API key)
|
|
236
281
|
|
|
237
282
|
| Tool | What It Does |
|
|
238
283
|
|------|-------------|
|
|
239
|
-
| **ai_extract** |
|
|
284
|
+
| **ai_extract** | Describe what you want in natural language or JSON schema. 3 providers (Anthropic, OpenAI, MiniMax). The `extract` tool also supports `llm_fallback: true` for hybrid CSS→AI extraction. |
|
|
240
285
|
|
|
241
|
-
### Interaction (no API key needed, requires Playwright)
|
|
286
|
+
### 🖱️ Interaction (no API key needed, requires Playwright)
|
|
242
287
|
|
|
243
288
|
| Tool | What It Does |
|
|
244
289
|
|------|-------------|
|
|
245
|
-
| **interact** | Browser automation with
|
|
290
|
+
| **interact** | Browser automation with 18 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate, drag, upload, storage, cookies, pdf, auth_login). Ref targeting via ARIA snapshot, session encryption, action policy, domain filter, network interception, device emulation. |
|
|
291
|
+
| **snapshot** | ARIA-based page snapshot with interactive element refs. Use refs in interact for precise targeting. Annotated screenshots. |
|
|
246
292
|
|
|
247
|
-
### Batch Processing (no API key needed)
|
|
293
|
+
### 📦 Batch Processing (no API key needed)
|
|
248
294
|
|
|
249
295
|
| Tool | What It Does |
|
|
250
296
|
|------|-------------|
|
|
251
|
-
| **batch_scrape** | Parallel URL scraping with configurable concurrency, soft failure
|
|
252
|
-
| **list_jobs** | List all batch jobs with status
|
|
253
|
-
| **job_status** |
|
|
297
|
+
| **batch_scrape** | Parallel URL scraping with configurable concurrency, soft failure, and resume via job_id. Optional AI extraction per URL. |
|
|
298
|
+
| **list_jobs** | List all batch jobs with status and progress. |
|
|
299
|
+
| **job_status** | Full results for a specific batch job including per-URL outcomes. |
|
|
254
300
|
| **delete_job** | Clean up completed or failed batch jobs. |
|
|
255
301
|
|
|
256
302
|
---
|
|
257
303
|
|
|
258
|
-
##
|
|
304
|
+
## MCP Setup — Detailed
|
|
259
305
|
|
|
260
|
-
|
|
306
|
+
Full configuration with all optional environment variables:
|
|
261
307
|
|
|
262
|
-
|
|
263
|
-
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
|
|
308
|
+
```json
|
|
309
|
+
{
|
|
310
|
+
"mcpServers": {
|
|
311
|
+
"imperium-crawl": {
|
|
312
|
+
"command": "npx",
|
|
313
|
+
"args": ["-y", "imperium-crawl"],
|
|
314
|
+
"env": {
|
|
315
|
+
"BRAVE_API_KEY": "your-brave-api-key",
|
|
316
|
+
"TWOCAPTCHA_API_KEY": "your-2captcha-api-key",
|
|
317
|
+
"LLM_API_KEY": "your-api-key",
|
|
318
|
+
"LLM_PROVIDER": "anthropic",
|
|
319
|
+
"SESSION_ENCRYPTION_KEY": "your-64-char-hex-key",
|
|
320
|
+
"PROXY_URL": "http://user:pass@proxy:8080",
|
|
321
|
+
"PROXY_URLS": "http://proxy1:8080,socks5://proxy2:1080"
|
|
322
|
+
}
|
|
323
|
+
}
|
|
324
|
+
}
|
|
325
|
+
}
|
|
326
|
+
```
|
|
267
327
|
|
|
268
|
-
###
|
|
328
|
+
### API Keys
|
|
269
329
|
|
|
270
|
-
|
|
330
|
+
| Key | What It Unlocks | Where to Get It |
|
|
331
|
+
|-----|----------------|-----------------|
|
|
332
|
+
| `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
|
|
333
|
+
| `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
|
|
334
|
+
| `LLM_API_KEY` | AI-powered data extraction (`ai_extract` tool) | Anthropic or OpenAI API key |
|
|
335
|
+
| `CHROME_PROFILE_PATH` | Authenticated browser sessions (use your Chrome cookies) | Path to Chrome user data dir |
|
|
336
|
+
| `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
|
|
271
337
|
|
|
272
|
-
|
|
273
|
-
|--------|-----------------|
|
|
274
|
-
| **Cloudflare** | `cf_clearance` cookies, `cf-mitigated` header, challenge page title |
|
|
275
|
-
| **Akamai** | `_abck`, `bm_sz` cookies |
|
|
276
|
-
| **PerimeterX / HUMAN** | `_px` cookies, `_pxhd` headers |
|
|
277
|
-
| **DataDome** | `datadome` cookies, `datadome` response header |
|
|
278
|
-
| **Kasada** | `x-kpsdk-*` headers |
|
|
279
|
-
| **AWS WAF** | `aws-waf-token` cookie |
|
|
280
|
-
| **F5 / Shape Security** | `TS` prefix cookies |
|
|
338
|
+
### Enable Full Stealth (Level 3)
|
|
281
339
|
|
|
282
|
-
|
|
340
|
+
```bash
|
|
341
|
+
npm i rebrowser-playwright
|
|
342
|
+
npx playwright install chromium
|
|
343
|
+
```
|
|
283
344
|
|
|
284
|
-
|
|
345
|
+
### Per-Client Notes
|
|
285
346
|
|
|
286
|
-
|
|
347
|
+
| Client | Config Location |
|
|
348
|
+
|--------|----------------|
|
|
349
|
+
| **Claude Code** | `.mcp.json` in project root or `~/.claude/settings.json` global |
|
|
350
|
+
| **Cursor** | Settings → MCP Servers |
|
|
351
|
+
| **VS Code** | `.vscode/mcp.json` or user settings |
|
|
352
|
+
| **Windsurf** | `~/.codeium/windsurf/mcp_config.json` |
|
|
287
353
|
|
|
288
|
-
|
|
354
|
+
---
|
|
289
355
|
|
|
290
|
-
|
|
356
|
+
## CLI Mode
|
|
291
357
|
|
|
292
|
-
|
|
358
|
+
**No arguments** = starts as MCP server. **With subcommand** = runs as CLI tool. **`tui`** = interactive terminal.
|
|
293
359
|
|
|
294
|
-
|
|
295
|
-
|
|
296
|
-
-
|
|
297
|
-
- Whether a **proxy** was needed
|
|
298
|
-
- **Response time** and **HTTP status**
|
|
299
|
-
- Whether the request was **blocked or successful**
|
|
360
|
+
```bash
|
|
361
|
+
# Scrape a website to markdown
|
|
362
|
+
imperium-crawl scrape --url https://bbc.com/news
|
|
300
363
|
|
|
301
|
-
|
|
364
|
+
# Crawl with depth control
|
|
365
|
+
imperium-crawl crawl --url https://blog.cloudflare.com --max-depth 2 --max-pages 5
|
|
302
366
|
|
|
303
|
-
|
|
367
|
+
# AI-powered extraction — plain English
|
|
368
|
+
imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
|
|
369
|
+
--schema "extract product name, price, rating, and review count"
|
|
304
370
|
|
|
305
|
-
|
|
306
|
-
|
|
307
|
-
| Optimal stealth level | Skip straight to the level that works — no wasted escalation |
|
|
308
|
-
| Anti-bot system | Remember which defense the site uses |
|
|
309
|
-
| Proxy requirement | Auto-suggest proxy if requests keep failing without one |
|
|
310
|
-
| Response time | Exponential moving average — adapts to site speed changes |
|
|
311
|
-
| Rate limit | Auto-throttles on 429 responses (reduces rate by 30%) |
|
|
312
|
-
| Success/fail ratio | Confidence scoring — high confidence = use cached strategy |
|
|
371
|
+
# Discover hidden APIs
|
|
372
|
+
imperium-crawl discover-apis --url https://weather.com
|
|
313
373
|
|
|
314
|
-
|
|
374
|
+
# Batch scrape in parallel
|
|
375
|
+
imperium-crawl batch-scrape --urls '["https://site1.com","https://site2.com"]' --concurrency 3
|
|
315
376
|
|
|
316
|
-
|
|
317
|
-
-
|
|
318
|
-
|
|
319
|
-
- **Atomic persistence** — Knowledge saved to `~/.imperium-crawl/knowledge.json` via atomic write (tmp → rename). Never corrupts
|
|
320
|
-
- **Debounced writes** — Batches saves every 30 seconds to avoid disk thrashing
|
|
377
|
+
# Interactive setup wizard
|
|
378
|
+
imperium-crawl setup
|
|
379
|
+
```
|
|
321
380
|
|
|
322
|
-
###
|
|
381
|
+
### Output Formats
|
|
323
382
|
|
|
383
|
+
```bash
|
|
384
|
+
imperium-crawl scrape --url https://example.com # JSON (default)
|
|
385
|
+
imperium-crawl scrape --url https://example.com --output-format markdown # Markdown
|
|
386
|
+
imperium-crawl scrape --url https://example.com --output-format csv # CSV
|
|
387
|
+
imperium-crawl scrape --url https://example.com --pretty # Pretty JSON
|
|
388
|
+
imperium-crawl scrape --url https://example.com --output result.json # Write to file
|
|
324
389
|
```
|
|
325
|
-
First visit to cloudflare.com:
|
|
326
|
-
Level 1 → blocked ❌
|
|
327
|
-
Level 2 → blocked ❌
|
|
328
|
-
Level 3 → success ✅ (Cloudflare detected)
|
|
329
|
-
→ Engine records: cloudflare.com needs Level 3
|
|
330
390
|
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
391
|
+
### TUI Mode
|
|
392
|
+
|
|
393
|
+
```bash
|
|
394
|
+
imperium-crawl tui
|
|
335
395
|
```
|
|
336
396
|
|
|
337
|
-
|
|
397
|
+
Interactive slash-command terminal with parameter prompts, table rendering, markdown display, and session state. Use `/save` to export results and `/again` to re-run the last command.
|
|
338
398
|
|
|
339
399
|
---
|
|
340
400
|
|
|
341
|
-
## Skills
|
|
401
|
+
## Skills & Recipes
|
|
342
402
|
|
|
343
|
-
Skills let you teach imperium-crawl how to extract data from any website, then re-run
|
|
403
|
+
Skills let you teach imperium-crawl how to extract data from any website, then re-run for fresh content whenever you want.
|
|
344
404
|
|
|
345
405
|
**Create a skill:**
|
|
346
406
|
```
|
|
@@ -351,20 +411,36 @@ create_skill({
|
|
|
351
411
|
})
|
|
352
412
|
```
|
|
353
413
|
|
|
354
|
-
The tool analyzes the page, auto-detects repeating elements (articles, products, listings), generates CSS selectors for each field, and saves the skill config.
|
|
355
|
-
|
|
356
414
|
**Run a skill:**
|
|
357
415
|
```
|
|
358
416
|
run_skill({ name: "tc-ai-news" })
|
|
417
|
+
→ Returns fresh structured data with all detected fields
|
|
359
418
|
```
|
|
360
419
|
|
|
361
|
-
|
|
420
|
+
Skills are saved in `~/.imperium-crawl/skills/` as JSON files — human-readable, editable, portable.
|
|
421
|
+
|
|
422
|
+
### Built-in Recipes
|
|
423
|
+
|
|
424
|
+
| Recipe | What It Does |
|
|
425
|
+
|--------|-------------|
|
|
426
|
+
| `news-extraction` | Extract article title, author, date, content from news sites |
|
|
427
|
+
| `ecommerce-scrape` | Product name, price, rating, reviews, images |
|
|
428
|
+
| `social-media` | Posts, engagement metrics, user profiles |
|
|
429
|
+
| `job-listings` | Title, company, salary, location, description |
|
|
430
|
+
| `real-estate` | Property listings with price, address, features |
|
|
431
|
+
| `api-reverse-engineer` | Discover → query → monitor workflow |
|
|
432
|
+
| `competitor-monitor` | Track pricing and product changes |
|
|
433
|
+
| `lead-generation` | Extract business contact info |
|
|
434
|
+
| `content-aggregator` | Multi-source content collection |
|
|
435
|
+
| `data-pipeline` | Batch scrape → extract → export workflow |
|
|
436
|
+
|
|
437
|
+
See [`SKILL/`](./SKILL/) for detailed workflow guides and agent integration.
|
|
362
438
|
|
|
363
439
|
---
|
|
364
440
|
|
|
365
441
|
## API Discovery Workflow
|
|
366
442
|
|
|
367
|
-
|
|
443
|
+
Turn any website into an API. No documentation needed.
|
|
368
444
|
|
|
369
445
|
```
|
|
370
446
|
1. discover_apis({ url: "https://weather.com" })
|
|
@@ -373,65 +449,82 @@ This is the workflow that no other MCP server supports. Real results from actual
|
|
|
373
449
|
• mParticle analytics endpoints
|
|
374
450
|
• Taboola content recommendation API
|
|
375
451
|
• OneTrust consent management API
|
|
376
|
-
• DAA/AdChoices opt-out endpoints
|
|
377
452
|
|
|
378
453
|
2. query_api({ url: "https://api.weather.com/v3/...", method: "GET" })
|
|
379
|
-
→ Direct API call, bypasses DOM entirely — 10x faster, structured JSON
|
|
454
|
+
→ Direct API call, bypasses DOM entirely — 10x faster, structured JSON
|
|
380
455
|
|
|
381
456
|
3. monitor_websocket({ url: "https://binance.com/en/trade/BTC_USDT", duration_seconds: 10 })
|
|
382
|
-
→ Captures real-time WebSocket messages —
|
|
457
|
+
→ Captures real-time WebSocket messages — live BTC price feed
|
|
383
458
|
```
|
|
384
459
|
|
|
385
|
-
|
|
460
|
+
---
|
|
461
|
+
|
|
462
|
+
## AI Agent Guide
|
|
463
|
+
|
|
464
|
+
imperium-crawl ships with [`SKILL/`](./SKILL/) — a structured guide that teaches AI agents how to use all 23 tools effectively. Includes proven workflows, decision trees, error recovery, and advanced patterns.
|
|
465
|
+
|
|
466
|
+
### Three Ways to Connect
|
|
467
|
+
|
|
468
|
+
| Method | Setup | Works With |
|
|
469
|
+
|--------|-------|-----------|
|
|
470
|
+
| **MCP + SKILL/** | Add as MCP server + SKILL.md in agent context | Claude Code, Cursor, Windsurf, any MCP client |
|
|
471
|
+
| **CLI + SKILL/** | `npm i -g imperium-crawl` + SKILL.md in agent context | **Any agent with bash access** — OpenClaw, ChatGPT, GPT agents, custom agents |
|
|
472
|
+
| **TUI** | `imperium-crawl tui` — interactive terminal | Direct human use, demos, debugging |
|
|
473
|
+
|
|
474
|
+
### Per-Agent Setup
|
|
475
|
+
|
|
476
|
+
| AI Agent | How to Add SKILL/ |
|
|
477
|
+
|----------|-------------------|
|
|
478
|
+
| **Claude Code** | Copy `SKILL.md` to project root — auto-detected |
|
|
479
|
+
| **Cursor / Windsurf** | Add `SKILL.md` to project rules or system prompt |
|
|
480
|
+
| **OpenClaw / custom agents** | Include SKILL.md in system prompt or context window |
|
|
481
|
+
| **ChatGPT / GPT agents** | Paste SKILL.md content into custom instructions |
|
|
386
482
|
|
|
387
483
|
---
|
|
388
484
|
|
|
389
485
|
## Resilience
|
|
390
486
|
|
|
391
487
|
- **Exponential backoff with full jitter** — AWS-recommended retry pattern, no thundering herd
|
|
392
|
-
- **Per-domain circuit breaker** — 5
|
|
393
|
-
- **URL normalization** — 11-step pipeline removes tracking params (utm_*, fbclid, gclid), sorts query params
|
|
394
|
-
- **
|
|
395
|
-
- **Input validation** — all 22 tool schemas enforce strict bounds (URL length, query size, concurrency limits, body size)
|
|
396
|
-
- **HTTP transport hardening** — rate limiting (100 req/min), 1MB body limit, 5min request timeout
|
|
397
|
-
- **Proxy support** — single proxy (`PROXY_URL`) or rotating pool (`PROXY_URLS`) with http/https/socks4/socks5 support
|
|
488
|
+
- **Per-domain circuit breaker** — 5 failures opens circuit for 60s, then half-open probing with auto recovery
|
|
489
|
+
- **URL normalization** — 11-step pipeline removes tracking params (utm_*, fbclid, gclid), sorts query params
|
|
490
|
+
- **Proxy support** — single proxy or rotating pool with http/https/socks4/socks5
|
|
398
491
|
- **Browser pool** — keyed by proxy URL, auto-eviction, configurable pool size
|
|
399
|
-
- **Adaptive learning** — remembers optimal stealth level per domain, gets faster with every request
|
|
400
|
-
- **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
|
|
401
492
|
- **robots.txt** — respected by default (configurable)
|
|
493
|
+
- **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
|
|
402
494
|
|
|
403
495
|
---
|
|
404
496
|
|
|
405
|
-
##
|
|
497
|
+
## Real-World Test Results
|
|
406
498
|
|
|
407
499
|
Every tool tested against production websites with real anti-bot defenses:
|
|
408
500
|
|
|
409
501
|
| Tool | Target | Result |
|
|
410
502
|
|------|--------|--------|
|
|
411
|
-
| 📄 **scrape** | BBC News | Full markdown
|
|
412
|
-
| 🕸️ **crawl** | Cloudflare Blog |
|
|
413
|
-
| 🗺️ **map** | BBC | Full URL discovery via sitemap +
|
|
503
|
+
| 📄 **scrape** | BBC News | Full markdown, stealth level 3 auto-escalation |
|
|
504
|
+
| 🕸️ **crawl** | Cloudflare Blog | 213K characters crawled with depth control |
|
|
505
|
+
| 🗺️ **map** | BBC | Full URL discovery via sitemap + link extraction |
|
|
414
506
|
| 🕷️ **extract** | Amazon (AirPods Pro 2) | Product title, 45,297 reviews, brand extracted |
|
|
415
|
-
| 📖 **readability** | Medium article | Clean
|
|
507
|
+
| 📖 **readability** | Medium article | Clean — title, author, content, publish date |
|
|
416
508
|
| 📸 **screenshot** | ProductHunt | Captured Cloudflare Turnstile challenge page |
|
|
417
|
-
| 🔍 **search** | Brave Web
|
|
418
|
-
| 📰 **news_search** | Brave News
|
|
419
|
-
| 🖼️ **image_search** | Brave Image
|
|
420
|
-
| 🎬 **video_search** | Brave Video
|
|
421
|
-
| 🛠️ **create_skill** | Hacker News | Auto-detected 30
|
|
422
|
-
| ▶️ **run_skill** | Saved skill | Fresh structured data from saved
|
|
423
|
-
| 📋 **list_skills** | — | Lists all
|
|
424
|
-
| 🔓 **discover_apis** | Airbnb Paris | **34 hidden APIs** — DataDome
|
|
509
|
+
| 🔍 **search** | Brave Web | Web results with snippets and URLs |
|
|
510
|
+
| 📰 **news_search** | Brave News | News results with freshness ranking |
|
|
511
|
+
| 🖼️ **image_search** | Brave Image | Images with thumbnails and source URLs |
|
|
512
|
+
| 🎬 **video_search** | Brave Video | Video results across platforms |
|
|
513
|
+
| 🛠️ **create_skill** | Hacker News | Auto-detected 30 stories with CSS selectors |
|
|
514
|
+
| ▶️ **run_skill** | Saved skill | Fresh structured data from saved config |
|
|
515
|
+
| 📋 **list_skills** | — | Lists all skills with configurations |
|
|
516
|
+
| 🔓 **discover_apis** | Airbnb Paris | **34 hidden APIs** — DataDome, Google Maps key, internal APIs |
|
|
425
517
|
| ⚡ **query_api** | jsonplaceholder | Direct JSON API call with stealth headers |
|
|
426
|
-
| 📡 **monitor_websocket** | Binance BTC/USDT |
|
|
427
|
-
| 🧠 **ai_extract** | Amazon product
|
|
428
|
-
|
|
|
429
|
-
|
|
|
430
|
-
|
|
|
431
|
-
|
|
|
518
|
+
| 📡 **monitor_websocket** | Binance BTC/USDT | 3 WebSocket connections, 23 live messages — BTC price live |
|
|
519
|
+
| 🧠 **ai_extract** | Amazon product | AI extracted name, price, rating, review count |
|
|
520
|
+
| 🎯 **snapshot** | GitHub, Wikipedia | ARIA tree with 107/113 refs, annotated screenshots |
|
|
521
|
+
| 🖱️ **interact** | Login flow | Click → type → submit — ref targeting, session encryption, 18 action types |
|
|
522
|
+
| 📦 **batch_scrape** | 10 news sites | Parallel, concurrency 3, soft failure, 9/10 succeeded |
|
|
523
|
+
| 📋 **list_jobs** | — | Batch jobs with status and progress |
|
|
524
|
+
| 📊 **job_status** | Batch job | Full per-URL results with timing |
|
|
432
525
|
| 🗑️ **delete_job** | Completed job | Cleaned up job data from disk |
|
|
433
526
|
|
|
434
|
-
>
|
|
527
|
+
> **23/23 tools. 34 hidden APIs on Airbnb. Live BTC feed. Zero API keys for scraping.**
|
|
435
528
|
|
|
436
529
|
---
|
|
437
530
|
|
|
@@ -441,18 +534,19 @@ Every tool tested against production websites with real anti-bot defenses:
|
|
|
441
534
|
|----------|----------|-------------|
|
|
442
535
|
| `BRAVE_API_KEY` | No | Brave Search API key (enables 4 search tools) |
|
|
443
536
|
| `TWOCAPTCHA_API_KEY` | No | 2Captcha API key (enables auto CAPTCHA solving) |
|
|
537
|
+
| `LLM_API_KEY` | No | Anthropic or OpenAI API key (enables `ai_extract`) |
|
|
538
|
+
| `LLM_PROVIDER` | No | `anthropic`, `openai`, or `minimax` (default: `anthropic`) |
|
|
539
|
+
| `LLM_MODEL` | No | Override default LLM model |
|
|
540
|
+
| `SESSION_ENCRYPTION_KEY` | No | 32-byte hex key for encrypting session files at rest |
|
|
444
541
|
| `TRANSPORT` | No | `stdio` (default) or `http` |
|
|
445
542
|
| `PORT` | No | HTTP port (default: 3000) |
|
|
446
543
|
| `PROXY_URL` | No | Single proxy URL (http/https/socks4/socks5) |
|
|
447
544
|
| `PROXY_URLS` | No | Comma-separated proxy URLs for rotation |
|
|
448
545
|
| `BROWSER_POOL_SIZE` | No | Max pooled browser instances (default: 3) |
|
|
449
546
|
| `RESPECT_ROBOTS` | No | Respect robots.txt (default: `true`) |
|
|
450
|
-
| `
|
|
451
|
-
| `
|
|
452
|
-
| `
|
|
453
|
-
| `CHROME_PROFILE_PATH` | No | Chrome user data dir for authenticated browser sessions |
|
|
454
|
-
| `NO_COLOR` | No | Disable colored output (standard convention) |
|
|
455
|
-
| `CI` | No | Auto-detected; disables TTY features (spinners, colors) |
|
|
547
|
+
| `CHROME_PROFILE_PATH` | No | Chrome user data dir for authenticated sessions |
|
|
548
|
+
| `NO_COLOR` | No | Disable colored output |
|
|
549
|
+
| `CI` | No | Auto-detected; disables TTY features |
|
|
456
550
|
|
|
457
551
|
---
|
|
458
552
|
|
|
@@ -463,8 +557,27 @@ git clone https://github.com/ceoimperiumprojects/imperium-crawl
|
|
|
463
557
|
cd imperium-crawl
|
|
464
558
|
npm install
|
|
465
559
|
npm run build
|
|
466
|
-
npm
|
|
467
|
-
npm
|
|
560
|
+
npm run dev # Watch mode (rebuild on changes)
|
|
561
|
+
npm test # 332 tests
|
|
562
|
+
npm start # Start MCP server
|
|
563
|
+
```
|
|
564
|
+
|
|
565
|
+
---
|
|
566
|
+
|
|
567
|
+
## Contributing
|
|
568
|
+
|
|
569
|
+
Contributions welcome! Whether it's a bug fix, new tool, or documentation improvement — open an issue or PR.
|
|
570
|
+
|
|
571
|
+
```bash
|
|
572
|
+
# Fork the repo, then:
|
|
573
|
+
git clone https://github.com/YOUR_USERNAME/imperium-crawl
|
|
574
|
+
cd imperium-crawl
|
|
575
|
+
npm install
|
|
576
|
+
git checkout -b my-feature
|
|
577
|
+
# Make changes...
|
|
578
|
+
npm test
|
|
579
|
+
git push origin my-feature
|
|
580
|
+
# Open a PR
|
|
468
581
|
```
|
|
469
582
|
|
|
470
583
|
---
|