npm - imperium-crawl - Versions diffs - 2.5.1 → 2.5.3 - Mend

imperium-crawl 2.5.1 → 2.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +53 -10
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -17,13 +17,27 @@
 ---
-## What's new in 2.5.0
+## What's new in 2.5.1
-Three new tools for document extraction and content monitoring — all zero-API-key, all native:
+**Browser-based image extraction overhaul** — 100% coverage on any website:
-- **`pdf-extract`** — Pull text, pages, tables, and metadata from any PDF (local or remote) via `pdfjs-dist`. Ideal for regulatory docs, sustainability reports, invoices. Smoke-tested on a 98-page CBAM Guidance PDF (199K chars, confidence 0.99).
-- **`watch`** — Hash-based one-shot change detector for a single URL. Cron-friendly. Fires a webhook on change.
-- **`monitor`** — Multi-URL intelligence digest. Reads a JSON config grouping URLs by topic, emits a markdown digest filtered by minimum change percentage.
+- **Full browser rendering (L3)** for image discovery — JavaScript, lazy-load, shadow DOM, same-origin iframes
+- **7 image sources**: `<img>`, `<picture>`, CSS `background-image`, shadow DOM, JSON-LD, inline scripts, iframes
+- **Precise targeting**: `--selector`, `--index`, `--alt-match`, `--min-width`, `--max-width`
+- **Auto-click** "Load more" / "Gallery" buttons with multilingual keyword matching
+- **Referer injection** fixes 403 errors on image CDN anti-hotlink protection
+- **New `auto_click` action** in `interact` tool for standalone browser automation
+```bash
+# Download ALL images from any page (100% coverage)
+imperium-crawl download <url> --images --output ./slike
+# Target exactly the 3rd image
+imperium-crawl download <url> --images --index 3
+# Auto-click "Prikaži više" + scan iframes
+imperium-crawl download <url> --images --auto-click --iframe-scan
+```
 See [CHANGELOG.md](./CHANGELOG.md) for the full release notes.
@@ -48,7 +62,7 @@ npm install -g imperium-crawl
 **Install from a local tarball** (e.g. pre-release testing):
 ```bash
-npm install -g ./imperium-crawl-2.5.0.tgz
+npm install -g ./imperium-crawl-2.5.2.tgz
 ```
 > That's it. 33 of 39 tools work with zero API keys. Add optional keys later to unlock search, AI extraction, and CAPTCHA solving.
@@ -105,6 +119,34 @@ imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
 }
 ```
+### Extract ALL images from any page (100% coverage)
+```bash
+imperium-crawl download https://www.njuskalo.hr/nekretnine/stan-Zagreb --images --output ./slike
+```
+```
+Discovered 23 unique images
+  ✅ njuskalo.hr-001.jpg — 142KB
+  ✅ njuskalo.hr-002.jpg — 89KB
+  ✅ njuskalo.hr-003.jpg — 256KB
+→ 23/23 downloaded. Total: 4.2MB
+```
+**Target a specific image:**
+```bash
+imperium-crawl download https://olx.ba/artikal/12345 \
+  --images --selector "img.gallery-main" --output ./oglas.jpg
+```
+**Auto-click "Load more" + iframe scan:**
+```bash
+imperium-crawl download https://www.leboncoin.fr/ad/12345 \
+  --images --auto-click --iframe-scan --limit 50
+```
 ### Batch scrape with resume
 ```bash
@@ -135,7 +177,7 @@ Headers → TLS fingerprinting → headless browser + CAPTCHA solving. Automatic
 🧠 **Self-Improving**
 Adaptive learning engine remembers what works per domain. Second visit is 3x faster. The more you use it, the smarter it gets.
-🧰 **33 Tools, 2 Modes**
+🧰 **39 Tools, 2 Modes**
 CLI tool or interactive TUI. Scraping, crawling, search, extraction, API discovery, WebSocket monitoring, browser automation, batch processing.
 📜 **14 Built-in Recipes**
@@ -151,7 +193,7 @@ Teach it once, run forever. Auto-detect patterns on any page, save as reusable s
 | Feature | **imperium-crawl** | Firecrawl | Crawl4AI | Browserbase | Puppeteer |
 |---------|:------------------:|:---------:|:--------:|:-----------:|:---------:|
 | Price | **Free forever** | $19+/month | Free | $0.01/min | Free |
-| Total tools | **33** | 5 | 2 | 4 | N/A |
+| Total tools | **39** | 5 | 2 | 4 | N/A |
 | Stealth levels | **3 (auto-escalate)** | Cloud-based | 1 | Cloud-based | None |
 | Anti-bot detection | **7 systems** | Partial | Partial | Partial | None |
 | TLS fingerprinting | **JA3/JA4** | No | No | No | No |
@@ -292,7 +334,7 @@ Second visit to cloudflare.com:
 | Tool | What It Does |
 |------|-------------|
-| **interact** | Browser automation with 19 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate, drag, upload, storage, cookies, pdf, auth_login, refresh). Ref targeting via ARIA snapshot, session encryption, action policy, domain filter, network interception, device emulation. |
+| **interact** | Browser automation with 20 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate, drag, upload, storage, cookies, pdf, auth_login, refresh, **auto_click**). Ref targeting via ARIA snapshot, session encryption, action policy, domain filter, network interception, device emulation. **auto_click** finds and clicks "load more" / "gallery" buttons with multilingual keyword matching. |
 | **snapshot** | ARIA-based page snapshot with interactive element refs. Use refs in interact for precise targeting. Annotated screenshots. |
 ### 📱 Social Media (no API key needed)
@@ -307,7 +349,8 @@ Second visit to cloudflare.com:
 | Tool | What It Does |
 |------|-------------|
-| **download** | Download media files from any URL — images, video, YouTube, TikTok, bulk. Auto-detects URL type and applies optimal strategy. |
+| **download** | Download media files from any URL — images, video, YouTube, TikTok, bulk. **v2.5.1**: Browser-based image extraction with 100% coverage (lazy-load, shadow DOM, iframes, JSON-LD, CSS backgrounds). Target specific images via `--selector`, `--index`, `--alt-match`. Auto-click "load more" buttons. Referer injection fixes 403 on CDNs. |
+| **batch_download** | Download multiple files (PDFs, images, documents) in parallel with session cookie support. Uses L1 HTTP fetch — 10x faster than browser-based downloads. Ideal for bulk file retrieval from authenticated sessions. |
 | **rss** | Fetch and parse RSS/Atom feeds. Filter by date, output as JSON or Markdown. |
 ### 📦 Batch Processing (no API key needed)

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "imperium-crawl",
-  "version": "2.5.1",
+  "version": "2.5.3",
   "description": "39-tool open-source CLI for web scraping, PDF extraction, content monitoring, reusable browser flows, RSS aggregation, and custom skills. Zero API keys for core tools.",
   "type": "module",
   "bin": {