imperium-crawl 1.1.5 → 1.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -23,6 +23,7 @@ While others charge $19+/month for basic scraping, imperium-crawl gives you **mo
23
23
  | Priority-based crawling | **Yes** | No | No | No | No |
24
24
  | Circuit breaker + jitter backoff | **Yes** | No | No | No | No |
25
25
  | URL normalization (11 steps) | **Yes** | No | No | No | No |
26
+ | Adaptive learning (self-improving) | **Yes** | No | No | No | No |
26
27
  | Self-hosted | **Yes** | No | N/A | Yes | No |
27
28
  | Requires external service | **No** | Yes | No | No | Yes |
28
29
  | Total tools | **16** | 5 | 2 | 2 | 4 |
@@ -69,6 +70,7 @@ Add to your MCP client config (Claude Code, Cursor, VS Code, Windsurf, or any MC
69
70
  > |-----|----------------|-----------------|
70
71
  > | `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
71
72
  > | `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
73
+ > | `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
72
74
 
73
75
  ### Enable full stealth (Level 3 — headless browser)
74
76
 
@@ -207,6 +209,59 @@ Once imperium-crawl determines a domain needs Level 3 (browser), it **caches tha
207
209
 
208
210
  ---
209
211
 
212
+ ## 🧠 Adaptive Learning Engine
213
+
214
+ imperium-crawl **learns from every request** and gets smarter over time. No configuration needed — it works automatically in the background.
215
+
216
+ ### How It Works
217
+
218
+ Every time you scrape a website, the engine records:
219
+ - Which **stealth level** worked (1, 2, or 3)
220
+ - Which **anti-bot system** was detected (Cloudflare, DataDome, etc.)
221
+ - Whether a **proxy** was needed
222
+ - **Response time** and **HTTP status**
223
+ - Whether the request was **blocked or successful**
224
+
225
+ Next time you hit the same domain, the engine **predicts the optimal configuration** — skipping failed levels and going straight to what works.
226
+
227
+ ### What It Learns Per Domain
228
+
229
+ | Data Point | How It's Used |
230
+ |-----------|---------------|
231
+ | Optimal stealth level | Skip straight to the level that works — no wasted escalation |
232
+ | Anti-bot system | Remember which defense the site uses |
233
+ | Proxy requirement | Auto-suggest proxy if requests keep failing without one |
234
+ | Response time | Exponential moving average — adapts to site speed changes |
235
+ | Rate limit | Auto-throttles on 429 responses (reduces rate by 30%) |
236
+ | Success/fail ratio | Confidence scoring — high confidence = use cached strategy |
237
+
238
+ ### Smart Features
239
+
240
+ - **Time decay** — Knowledge older than 7 days loses weight, so the engine adapts when sites change defenses
241
+ - **Confidence scoring** — Low data = start from level 1. High confidence = skip directly to optimal level
242
+ - **Auto-prune** — Domains unused for 30 days are automatically cleaned up. Max 2,000 domains stored
243
+ - **Atomic persistence** — Knowledge saved to `~/.imperium-crawl/knowledge.json` via atomic write (tmp → rename). Never corrupts
244
+ - **Debounced writes** — Batches saves every 30 seconds to avoid disk thrashing
245
+
246
+ ### Example
247
+
248
+ ```
249
+ First visit to cloudflare.com:
250
+ Level 1 → blocked ❌
251
+ Level 2 → blocked ❌
252
+ Level 3 → success ✅ (Cloudflare detected)
253
+ → Engine records: cloudflare.com needs Level 3
254
+
255
+ Second visit to cloudflare.com:
256
+ → Engine predicts: Level 3, confidence 85%, Cloudflare
257
+ → Skips Level 1 and 2 entirely — goes straight to browser
258
+ → 3x faster than first visit
259
+ ```
260
+
261
+ > **The more you use it, the faster it gets.** Zero configuration. Fully automatic.
262
+
263
+ ---
264
+
210
265
  ## Skills System
211
266
 
212
267
  Skills let you teach imperium-crawl how to extract data from any website, then re-run it for fresh content whenever you want.
@@ -265,6 +320,7 @@ Turn any website into an API. No documentation needed.
265
320
  - **HTTP transport hardening** — rate limiting (100 req/min), 1MB body limit, 5min request timeout
266
321
  - **Proxy support** — single proxy (`PROXY_URL`) or rotating pool (`PROXY_URLS`) with http/https/socks4/socks5 support
267
322
  - **Browser pool** — keyed by proxy URL, auto-eviction, configurable pool size
323
+ - **Adaptive learning** — remembers optimal stealth level per domain, gets faster with every request
268
324
  - **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
269
325
  - **robots.txt** — respected by default (configurable)
270
326
 
@@ -1,5 +1,5 @@
1
1
  export declare const PACKAGE_NAME = "imperium-crawl";
2
- export declare const PACKAGE_VERSION = "1.1.5";
2
+ export declare const PACKAGE_VERSION = "1.1.7";
3
3
  export declare const DEFAULT_TIMEOUT_MS = 30000;
4
4
  export declare const DEFAULT_MAX_PAGES = 10;
5
5
  export declare const DEFAULT_MAX_DEPTH = 2;
package/dist/constants.js CHANGED
@@ -1,5 +1,5 @@
1
1
  export const PACKAGE_NAME = "imperium-crawl";
2
- export const PACKAGE_VERSION = "1.1.5";
2
+ export const PACKAGE_VERSION = "1.1.7";
3
3
  export const DEFAULT_TIMEOUT_MS = 30_000;
4
4
  export const DEFAULT_MAX_PAGES = 10;
5
5
  export const DEFAULT_MAX_DEPTH = 2;
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "imperium-crawl",
3
- "version": "1.1.5",
3
+ "version": "1.1.7",
4
4
  "description": "Open-source MCP server with Firecrawl-like scraping, crawling, search, and custom skills",
5
5
  "type": "module",
6
6
  "bin": {