npm - imperium-crawl - Versions diffs - 1.1.5 → 1.1.7 - Mend

imperium-crawl 1.1.5 → 1.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -23,6 +23,7 @@ While others charge $19+/month for basic scraping, imperium-crawl gives you **mo
 | Priority-based crawling | **Yes** | No | No | No | No |
 | Circuit breaker + jitter backoff | **Yes** | No | No | No | No |
 | URL normalization (11 steps) | **Yes** | No | No | No | No |
+| Adaptive learning (self-improving) | **Yes** | No | No | No | No |
 | Self-hosted | **Yes** | No | N/A | Yes | No |
 | Requires external service | **No** | Yes | No | No | Yes |
 | Total tools | **16** | 5 | 2 | 2 | 4 |
@@ -69,6 +70,7 @@ Add to your MCP client config (Claude Code, Cursor, VS Code, Windsurf, or any MC
 > |-----|----------------|-----------------|
 > | `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |
 > | `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |
+> | `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |
 ### Enable full stealth (Level 3 — headless browser)
@@ -207,6 +209,59 @@ Once imperium-crawl determines a domain needs Level 3 (browser), it **caches tha
 ---
+## 🧠 Adaptive Learning Engine
+imperium-crawl **learns from every request** and gets smarter over time. No configuration needed — it works automatically in the background.
+### How It Works
+Every time you scrape a website, the engine records:
+- Which **stealth level** worked (1, 2, or 3)
+- Which **anti-bot system** was detected (Cloudflare, DataDome, etc.)
+- Whether a **proxy** was needed
+- **Response time** and **HTTP status**
+- Whether the request was **blocked or successful**
+Next time you hit the same domain, the engine **predicts the optimal configuration** — skipping failed levels and going straight to what works.
+### What It Learns Per Domain
+| Data Point | How It's Used |
+|-----------|---------------|
+| Optimal stealth level | Skip straight to the level that works — no wasted escalation |
+| Anti-bot system | Remember which defense the site uses |
+| Proxy requirement | Auto-suggest proxy if requests keep failing without one |
+| Response time | Exponential moving average — adapts to site speed changes |
+| Rate limit | Auto-throttles on 429 responses (reduces rate by 30%) |
+| Success/fail ratio | Confidence scoring — high confidence = use cached strategy |
+### Smart Features
+- **Time decay** — Knowledge older than 7 days loses weight, so the engine adapts when sites change defenses
+- **Confidence scoring** — Low data = start from level 1. High confidence = skip directly to optimal level
+- **Auto-prune** — Domains unused for 30 days are automatically cleaned up. Max 2,000 domains stored
+- **Atomic persistence** — Knowledge saved to `~/.imperium-crawl/knowledge.json` via atomic write (tmp → rename). Never corrupts
+- **Debounced writes** — Batches saves every 30 seconds to avoid disk thrashing
+### Example
+```
+First visit to cloudflare.com:
+  Level 1 → blocked ❌
+  Level 2 → blocked ❌
+  Level 3 → success ✅ (Cloudflare detected)
+  → Engine records: cloudflare.com needs Level 3
+Second visit to cloudflare.com:
+  → Engine predicts: Level 3, confidence 85%, Cloudflare
+  → Skips Level 1 and 2 entirely — goes straight to browser
+  → 3x faster than first visit
+```
+> **The more you use it, the faster it gets.** Zero configuration. Fully automatic.
+---
 ## Skills System
 Skills let you teach imperium-crawl how to extract data from any website, then re-run it for fresh content whenever you want.
@@ -265,6 +320,7 @@ Turn any website into an API. No documentation needed.
 - **HTTP transport hardening** — rate limiting (100 req/min), 1MB body limit, 5min request timeout
 - **Proxy support** — single proxy (`PROXY_URL`) or rotating pool (`PROXY_URLS`) with http/https/socks4/socks5 support
 - **Browser pool** — keyed by proxy URL, auto-eviction, configurable pool size
+- **Adaptive learning** — remembers optimal stealth level per domain, gets faster with every request
 - **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes
 - **robots.txt** — respected by default (configurable)

package/dist/constants.d.ts CHANGED Viewed

@@ -1,5 +1,5 @@
 export declare const PACKAGE_NAME = "imperium-crawl";
-export declare const PACKAGE_VERSION = "1.1.5";
+export declare const PACKAGE_VERSION = "1.1.7";
 export declare const DEFAULT_TIMEOUT_MS = 30000;
 export declare const DEFAULT_MAX_PAGES = 10;
 export declare const DEFAULT_MAX_DEPTH = 2;

package/dist/constants.js CHANGED Viewed

@@ -1,5 +1,5 @@
 export const PACKAGE_NAME = "imperium-crawl";
-export const PACKAGE_VERSION = "1.1.5";
+export const PACKAGE_VERSION = "1.1.7";
 export const DEFAULT_TIMEOUT_MS = 30_000;
 export const DEFAULT_MAX_PAGES = 10;
 export const DEFAULT_MAX_DEPTH = 2;

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "imperium-crawl",
-  "version": "1.1.5",
+  "version": "1.1.7",
   "description": "Open-source MCP server with Firecrawl-like scraping, crawling, search, and custom skills",
   "type": "module",
   "bin": {