imperium-crawl 2.5.1 → 2.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +53 -10
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -17,13 +17,27 @@
17
17
 
18
18
  ---
19
19
 
20
- ## What's new in 2.5.0
20
+ ## What's new in 2.5.1
21
21
 
22
- Three new tools for document extraction and content monitoring all zero-API-key, all native:
22
+ **Browser-based image extraction overhaul** 100% coverage on any website:
23
23
 
24
- - **`pdf-extract`** Pull text, pages, tables, and metadata from any PDF (local or remote) via `pdfjs-dist`. Ideal for regulatory docs, sustainability reports, invoices. Smoke-tested on a 98-page CBAM Guidance PDF (199K chars, confidence 0.99).
25
- - **`watch`** Hash-based one-shot change detector for a single URL. Cron-friendly. Fires a webhook on change.
26
- - **`monitor`** Multi-URL intelligence digest. Reads a JSON config grouping URLs by topic, emits a markdown digest filtered by minimum change percentage.
24
+ - **Full browser rendering (L3)** for image discovery JavaScript, lazy-load, shadow DOM, same-origin iframes
25
+ - **7 image sources**: `<img>`, `<picture>`, CSS `background-image`, shadow DOM, JSON-LD, inline scripts, iframes
26
+ - **Precise targeting**: `--selector`, `--index`, `--alt-match`, `--min-width`, `--max-width`
27
+ - **Auto-click** "Load more" / "Gallery" buttons with multilingual keyword matching
28
+ - **Referer injection** fixes 403 errors on image CDN anti-hotlink protection
29
+ - **New `auto_click` action** in `interact` tool for standalone browser automation
30
+
31
+ ```bash
32
+ # Download ALL images from any page (100% coverage)
33
+ imperium-crawl download <url> --images --output ./slike
34
+
35
+ # Target exactly the 3rd image
36
+ imperium-crawl download <url> --images --index 3
37
+
38
+ # Auto-click "Prikaži više" + scan iframes
39
+ imperium-crawl download <url> --images --auto-click --iframe-scan
40
+ ```
27
41
 
28
42
  See [CHANGELOG.md](./CHANGELOG.md) for the full release notes.
29
43
 
@@ -48,7 +62,7 @@ npm install -g imperium-crawl
48
62
  **Install from a local tarball** (e.g. pre-release testing):
49
63
 
50
64
  ```bash
51
- npm install -g ./imperium-crawl-2.5.0.tgz
65
+ npm install -g ./imperium-crawl-2.5.2.tgz
52
66
  ```
53
67
 
54
68
  > That's it. 33 of 39 tools work with zero API keys. Add optional keys later to unlock search, AI extraction, and CAPTCHA solving.
@@ -105,6 +119,34 @@ imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
105
119
  }
106
120
  ```
107
121
 
122
+ ### Extract ALL images from any page (100% coverage)
123
+
124
+ ```bash
125
+ imperium-crawl download https://www.njuskalo.hr/nekretnine/stan-Zagreb --images --output ./slike
126
+ ```
127
+
128
+ ```
129
+ Discovered 23 unique images
130
+ ✅ njuskalo.hr-001.jpg — 142KB
131
+ ✅ njuskalo.hr-002.jpg — 89KB
132
+ ✅ njuskalo.hr-003.jpg — 256KB
133
+ → 23/23 downloaded. Total: 4.2MB
134
+ ```
135
+
136
+ **Target a specific image:**
137
+
138
+ ```bash
139
+ imperium-crawl download https://olx.ba/artikal/12345 \
140
+ --images --selector "img.gallery-main" --output ./oglas.jpg
141
+ ```
142
+
143
+ **Auto-click "Load more" + iframe scan:**
144
+
145
+ ```bash
146
+ imperium-crawl download https://www.leboncoin.fr/ad/12345 \
147
+ --images --auto-click --iframe-scan --limit 50
148
+ ```
149
+
108
150
  ### Batch scrape with resume
109
151
 
110
152
  ```bash
@@ -135,7 +177,7 @@ Headers → TLS fingerprinting → headless browser + CAPTCHA solving. Automatic
135
177
  🧠 **Self-Improving**
136
178
  Adaptive learning engine remembers what works per domain. Second visit is 3x faster. The more you use it, the smarter it gets.
137
179
 
138
- 🧰 **33 Tools, 2 Modes**
180
+ 🧰 **39 Tools, 2 Modes**
139
181
  CLI tool or interactive TUI. Scraping, crawling, search, extraction, API discovery, WebSocket monitoring, browser automation, batch processing.
140
182
 
141
183
  📜 **14 Built-in Recipes**
@@ -151,7 +193,7 @@ Teach it once, run forever. Auto-detect patterns on any page, save as reusable s
151
193
  | Feature | **imperium-crawl** | Firecrawl | Crawl4AI | Browserbase | Puppeteer |
152
194
  |---------|:------------------:|:---------:|:--------:|:-----------:|:---------:|
153
195
  | Price | **Free forever** | $19+/month | Free | $0.01/min | Free |
154
- | Total tools | **33** | 5 | 2 | 4 | N/A |
196
+ | Total tools | **39** | 5 | 2 | 4 | N/A |
155
197
  | Stealth levels | **3 (auto-escalate)** | Cloud-based | 1 | Cloud-based | None |
156
198
  | Anti-bot detection | **7 systems** | Partial | Partial | Partial | None |
157
199
  | TLS fingerprinting | **JA3/JA4** | No | No | No | No |
@@ -292,7 +334,7 @@ Second visit to cloudflare.com:
292
334
 
293
335
  | Tool | What It Does |
294
336
  |------|-------------|
295
- | **interact** | Browser automation with 19 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate, drag, upload, storage, cookies, pdf, auth_login, refresh). Ref targeting via ARIA snapshot, session encryption, action policy, domain filter, network interception, device emulation. |
337
+ | **interact** | Browser automation with 20 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate, drag, upload, storage, cookies, pdf, auth_login, refresh, **auto_click**). Ref targeting via ARIA snapshot, session encryption, action policy, domain filter, network interception, device emulation. **auto_click** finds and clicks "load more" / "gallery" buttons with multilingual keyword matching. |
296
338
  | **snapshot** | ARIA-based page snapshot with interactive element refs. Use refs in interact for precise targeting. Annotated screenshots. |
297
339
 
298
340
  ### 📱 Social Media (no API key needed)
@@ -307,7 +349,8 @@ Second visit to cloudflare.com:
307
349
 
308
350
  | Tool | What It Does |
309
351
  |------|-------------|
310
- | **download** | Download media files from any URL — images, video, YouTube, TikTok, bulk. Auto-detects URL type and applies optimal strategy. |
352
+ | **download** | Download media files from any URL — images, video, YouTube, TikTok, bulk. **v2.5.1**: Browser-based image extraction with 100% coverage (lazy-load, shadow DOM, iframes, JSON-LD, CSS backgrounds). Target specific images via `--selector`, `--index`, `--alt-match`. Auto-click "load more" buttons. Referer injection fixes 403 on CDNs. |
353
+ | **batch_download** | Download multiple files (PDFs, images, documents) in parallel with session cookie support. Uses L1 HTTP fetch — 10x faster than browser-based downloads. Ideal for bulk file retrieval from authenticated sessions. |
311
354
  | **rss** | Fetch and parse RSS/Atom feeds. Filter by date, output as JSON or Markdown. |
312
355
 
313
356
  ### 📦 Batch Processing (no API key needed)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "imperium-crawl",
3
- "version": "2.5.1",
3
+ "version": "2.5.3",
4
4
  "description": "39-tool open-source CLI for web scraping, PDF extraction, content monitoring, reusable browser flows, RSS aggregation, and custom skills. Zero API keys for core tools.",
5
5
  "type": "module",
6
6
  "bin": {