imperium-crawl 2.5.1 → 2.5.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +53 -10
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -17,13 +17,27 @@
|
|
|
17
17
|
|
|
18
18
|
---
|
|
19
19
|
|
|
20
|
-
## What's new in 2.5.
|
|
20
|
+
## What's new in 2.5.1
|
|
21
21
|
|
|
22
|
-
|
|
22
|
+
**Browser-based image extraction overhaul** — 100% coverage on any website:
|
|
23
23
|
|
|
24
|
-
-
|
|
25
|
-
-
|
|
26
|
-
-
|
|
24
|
+
- **Full browser rendering (L3)** for image discovery — JavaScript, lazy-load, shadow DOM, same-origin iframes
|
|
25
|
+
- **7 image sources**: `<img>`, `<picture>`, CSS `background-image`, shadow DOM, JSON-LD, inline scripts, iframes
|
|
26
|
+
- **Precise targeting**: `--selector`, `--index`, `--alt-match`, `--min-width`, `--max-width`
|
|
27
|
+
- **Auto-click** "Load more" / "Gallery" buttons with multilingual keyword matching
|
|
28
|
+
- **Referer injection** fixes 403 errors on image CDN anti-hotlink protection
|
|
29
|
+
- **New `auto_click` action** in `interact` tool for standalone browser automation
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
# Download ALL images from any page (100% coverage)
|
|
33
|
+
imperium-crawl download <url> --images --output ./slike
|
|
34
|
+
|
|
35
|
+
# Target exactly the 3rd image
|
|
36
|
+
imperium-crawl download <url> --images --index 3
|
|
37
|
+
|
|
38
|
+
# Auto-click "Prikaži više" + scan iframes
|
|
39
|
+
imperium-crawl download <url> --images --auto-click --iframe-scan
|
|
40
|
+
```
|
|
27
41
|
|
|
28
42
|
See [CHANGELOG.md](./CHANGELOG.md) for the full release notes.
|
|
29
43
|
|
|
@@ -48,7 +62,7 @@ npm install -g imperium-crawl
|
|
|
48
62
|
**Install from a local tarball** (e.g. pre-release testing):
|
|
49
63
|
|
|
50
64
|
```bash
|
|
51
|
-
npm install -g ./imperium-crawl-2.5.
|
|
65
|
+
npm install -g ./imperium-crawl-2.5.2.tgz
|
|
52
66
|
```
|
|
53
67
|
|
|
54
68
|
> That's it. 33 of 39 tools work with zero API keys. Add optional keys later to unlock search, AI extraction, and CAPTCHA solving.
|
|
@@ -105,6 +119,34 @@ imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
|
|
|
105
119
|
}
|
|
106
120
|
```
|
|
107
121
|
|
|
122
|
+
### Extract ALL images from any page (100% coverage)
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
imperium-crawl download https://www.njuskalo.hr/nekretnine/stan-Zagreb --images --output ./slike
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
```
|
|
129
|
+
Discovered 23 unique images
|
|
130
|
+
✅ njuskalo.hr-001.jpg — 142KB
|
|
131
|
+
✅ njuskalo.hr-002.jpg — 89KB
|
|
132
|
+
✅ njuskalo.hr-003.jpg — 256KB
|
|
133
|
+
→ 23/23 downloaded. Total: 4.2MB
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
**Target a specific image:**
|
|
137
|
+
|
|
138
|
+
```bash
|
|
139
|
+
imperium-crawl download https://olx.ba/artikal/12345 \
|
|
140
|
+
--images --selector "img.gallery-main" --output ./oglas.jpg
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
**Auto-click "Load more" + iframe scan:**
|
|
144
|
+
|
|
145
|
+
```bash
|
|
146
|
+
imperium-crawl download https://www.leboncoin.fr/ad/12345 \
|
|
147
|
+
--images --auto-click --iframe-scan --limit 50
|
|
148
|
+
```
|
|
149
|
+
|
|
108
150
|
### Batch scrape with resume
|
|
109
151
|
|
|
110
152
|
```bash
|
|
@@ -135,7 +177,7 @@ Headers → TLS fingerprinting → headless browser + CAPTCHA solving. Automatic
|
|
|
135
177
|
🧠 **Self-Improving**
|
|
136
178
|
Adaptive learning engine remembers what works per domain. Second visit is 3x faster. The more you use it, the smarter it gets.
|
|
137
179
|
|
|
138
|
-
🧰 **
|
|
180
|
+
🧰 **39 Tools, 2 Modes**
|
|
139
181
|
CLI tool or interactive TUI. Scraping, crawling, search, extraction, API discovery, WebSocket monitoring, browser automation, batch processing.
|
|
140
182
|
|
|
141
183
|
📜 **14 Built-in Recipes**
|
|
@@ -151,7 +193,7 @@ Teach it once, run forever. Auto-detect patterns on any page, save as reusable s
|
|
|
151
193
|
| Feature | **imperium-crawl** | Firecrawl | Crawl4AI | Browserbase | Puppeteer |
|
|
152
194
|
|---------|:------------------:|:---------:|:--------:|:-----------:|:---------:|
|
|
153
195
|
| Price | **Free forever** | $19+/month | Free | $0.01/min | Free |
|
|
154
|
-
| Total tools | **
|
|
196
|
+
| Total tools | **39** | 5 | 2 | 4 | N/A |
|
|
155
197
|
| Stealth levels | **3 (auto-escalate)** | Cloud-based | 1 | Cloud-based | None |
|
|
156
198
|
| Anti-bot detection | **7 systems** | Partial | Partial | Partial | None |
|
|
157
199
|
| TLS fingerprinting | **JA3/JA4** | No | No | No | No |
|
|
@@ -292,7 +334,7 @@ Second visit to cloudflare.com:
|
|
|
292
334
|
|
|
293
335
|
| Tool | What It Does |
|
|
294
336
|
|------|-------------|
|
|
295
|
-
| **interact** | Browser automation with
|
|
337
|
+
| **interact** | Browser automation with 20 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate, drag, upload, storage, cookies, pdf, auth_login, refresh, **auto_click**). Ref targeting via ARIA snapshot, session encryption, action policy, domain filter, network interception, device emulation. **auto_click** finds and clicks "load more" / "gallery" buttons with multilingual keyword matching. |
|
|
296
338
|
| **snapshot** | ARIA-based page snapshot with interactive element refs. Use refs in interact for precise targeting. Annotated screenshots. |
|
|
297
339
|
|
|
298
340
|
### 📱 Social Media (no API key needed)
|
|
@@ -307,7 +349,8 @@ Second visit to cloudflare.com:
|
|
|
307
349
|
|
|
308
350
|
| Tool | What It Does |
|
|
309
351
|
|------|-------------|
|
|
310
|
-
| **download** | Download media files from any URL — images, video, YouTube, TikTok, bulk. Auto-
|
|
352
|
+
| **download** | Download media files from any URL — images, video, YouTube, TikTok, bulk. **v2.5.1**: Browser-based image extraction with 100% coverage (lazy-load, shadow DOM, iframes, JSON-LD, CSS backgrounds). Target specific images via `--selector`, `--index`, `--alt-match`. Auto-click "load more" buttons. Referer injection fixes 403 on CDNs. |
|
|
353
|
+
| **batch_download** | Download multiple files (PDFs, images, documents) in parallel with session cookie support. Uses L1 HTTP fetch — 10x faster than browser-based downloads. Ideal for bulk file retrieval from authenticated sessions. |
|
|
311
354
|
| **rss** | Fetch and parse RSS/Atom feeds. Filter by date, output as JSON or Markdown. |
|
|
312
355
|
|
|
313
356
|
### 📦 Batch Processing (no API key needed)
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "imperium-crawl",
|
|
3
|
-
"version": "2.5.
|
|
3
|
+
"version": "2.5.3",
|
|
4
4
|
"description": "39-tool open-source CLI for web scraping, PDF extraction, content monitoring, reusable browser flows, RSS aggregation, and custom skills. Zero API keys for core tools.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"bin": {
|