imperium-crawl 1.1.5 → 1.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +56 -0
- package/dist/constants.d.ts +1 -1
- package/dist/constants.js +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -23,6 +23,7 @@ While others charge $19+/month for basic scraping, imperium-crawl gives you **mo
|
|
|
23
23
|
| Priority-based crawling | **Yes** | No | No | No | No |
|
|
24
24
|
| Circuit breaker + jitter backoff | **Yes** | No | No | No | No |
|
|
25
25
|
| URL normalization (11 steps) | **Yes** | No | No | No | No |
|
|
26
|
+
| Adaptive learning (self-improving) | **Yes** | No | No | No | No |
|
|
26
27
|
| Self-hosted | **Yes** | No | N/A | Yes | No |
|
|
27
28
|
| Requires external service | **No** | Yes | No | No | Yes |
|
|
28
29
|
| Total tools | **16** | 5 | 2 | 2 | 4 |
|
|
@@ -69,6 +70,7 @@ Add to your MCP client config (Claude Code, Cursor, VS Code, Windsurf, or any MC
|
|
|
69
70
|
> |-----|----------------|-----------------|
|
|
70
71
|
> | `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
|
|
71
72
|
> | `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
|
|
73
|
+
> | `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
|
|
72
74
|
|
|
73
75
|
### Enable full stealth (Level 3 — headless browser)
|
|
74
76
|
|
|
@@ -207,6 +209,59 @@ Once imperium-crawl determines a domain needs Level 3 (browser), it **caches tha
|
|
|
207
209
|
|
|
208
210
|
---
|
|
209
211
|
|
|
212
|
+
## 🧠 Adaptive Learning Engine
|
|
213
|
+
|
|
214
|
+
imperium-crawl **learns from every request** and gets smarter over time. No configuration needed — it works automatically in the background.
|
|
215
|
+
|
|
216
|
+
### How It Works
|
|
217
|
+
|
|
218
|
+
Every time you scrape a website, the engine records:
|
|
219
|
+
- Which **stealth level** worked (1, 2, or 3)
|
|
220
|
+
- Which **anti-bot system** was detected (Cloudflare, DataDome, etc.)
|
|
221
|
+
- Whether a **proxy** was needed
|
|
222
|
+
- **Response time** and **HTTP status**
|
|
223
|
+
- Whether the request was **blocked or successful**
|
|
224
|
+
|
|
225
|
+
Next time you hit the same domain, the engine **predicts the optimal configuration** — skipping failed levels and going straight to what works.
|
|
226
|
+
|
|
227
|
+
### What It Learns Per Domain
|
|
228
|
+
|
|
229
|
+
| Data Point | How It's Used |
|
|
230
|
+
|-----------|---------------|
|
|
231
|
+
| Optimal stealth level | Skip straight to the level that works — no wasted escalation |
|
|
232
|
+
| Anti-bot system | Remember which defense the site uses |
|
|
233
|
+
| Proxy requirement | Auto-suggest proxy if requests keep failing without one |
|
|
234
|
+
| Response time | Exponential moving average — adapts to site speed changes |
|
|
235
|
+
| Rate limit | Auto-throttles on 429 responses (reduces rate by 30%) |
|
|
236
|
+
| Success/fail ratio | Confidence scoring — high confidence = use cached strategy |
|
|
237
|
+
|
|
238
|
+
### Smart Features
|
|
239
|
+
|
|
240
|
+
- **Time decay** — Knowledge older than 7 days loses weight, so the engine adapts when sites change defenses
|
|
241
|
+
- **Confidence scoring** — Low data = start from level 1. High confidence = skip directly to optimal level
|
|
242
|
+
- **Auto-prune** — Domains unused for 30 days are automatically cleaned up. Max 2,000 domains stored
|
|
243
|
+
- **Atomic persistence** — Knowledge saved to `~/.imperium-crawl/knowledge.json` via atomic write (tmp → rename). Never corrupts
|
|
244
|
+
- **Debounced writes** — Batches saves every 30 seconds to avoid disk thrashing
|
|
245
|
+
|
|
246
|
+
### Example
|
|
247
|
+
|
|
248
|
+
```
|
|
249
|
+
First visit to cloudflare.com:
|
|
250
|
+
Level 1 → blocked ❌
|
|
251
|
+
Level 2 → blocked ❌
|
|
252
|
+
Level 3 → success ✅ (Cloudflare detected)
|
|
253
|
+
→ Engine records: cloudflare.com needs Level 3
|
|
254
|
+
|
|
255
|
+
Second visit to cloudflare.com:
|
|
256
|
+
→ Engine predicts: Level 3, confidence 85%, Cloudflare
|
|
257
|
+
→ Skips Level 1 and 2 entirely — goes straight to browser
|
|
258
|
+
→ 3x faster than first visit
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
> **The more you use it, the faster it gets.** Zero configuration. Fully automatic.
|
|
262
|
+
|
|
263
|
+
---
|
|
264
|
+
|
|
210
265
|
## Skills System
|
|
211
266
|
|
|
212
267
|
Skills let you teach imperium-crawl how to extract data from any website, then re-run it for fresh content whenever you want.
|
|
@@ -265,6 +320,7 @@ Turn any website into an API. No documentation needed.
|
|
|
265
320
|
- **HTTP transport hardening** — rate limiting (100 req/min), 1MB body limit, 5min request timeout
|
|
266
321
|
- **Proxy support** — single proxy (`PROXY_URL`) or rotating pool (`PROXY_URLS`) with http/https/socks4/socks5 support
|
|
267
322
|
- **Browser pool** — keyed by proxy URL, auto-eviction, configurable pool size
|
|
323
|
+
- **Adaptive learning** — remembers optimal stealth level per domain, gets faster with every request
|
|
268
324
|
- **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
|
|
269
325
|
- **robots.txt** — respected by default (configurable)
|
|
270
326
|
|
package/dist/constants.d.ts
CHANGED
|
@@ -1,5 +1,5 @@
|
|
|
1
1
|
export declare const PACKAGE_NAME = "imperium-crawl";
|
|
2
|
-
export declare const PACKAGE_VERSION = "1.1.
|
|
2
|
+
export declare const PACKAGE_VERSION = "1.1.7";
|
|
3
3
|
export declare const DEFAULT_TIMEOUT_MS = 30000;
|
|
4
4
|
export declare const DEFAULT_MAX_PAGES = 10;
|
|
5
5
|
export declare const DEFAULT_MAX_DEPTH = 2;
|
package/dist/constants.js
CHANGED