smippo 0.0.1 → 0.0.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +354 -27
- package/package.json +1 -1
- package/src/link-rewriter.js +24 -0
package/README.md
CHANGED
|
@@ -27,6 +27,45 @@
|
|
|
27
27
|
|
|
28
28
|
📚 **[View complete documentation →](https://smippo.com)**
|
|
29
29
|
|
|
30
|
+
## Table of Contents
|
|
31
|
+
|
|
32
|
+
- [Table of Contents](#table-of-contents)
|
|
33
|
+
- [Features](#features)
|
|
34
|
+
- [Quick Start](#quick-start)
|
|
35
|
+
- [Installation](#installation)
|
|
36
|
+
- [Requirements](#requirements)
|
|
37
|
+
- [npm (Global)](#npm-global)
|
|
38
|
+
- [Homebrew (Coming soon)](#homebrew-coming-soon)
|
|
39
|
+
- [Usage](#usage)
|
|
40
|
+
- [Basic Usage](#basic-usage)
|
|
41
|
+
- [Interactive Mode](#interactive-mode)
|
|
42
|
+
- [Filtering](#filtering)
|
|
43
|
+
- [Scope Control](#scope-control)
|
|
44
|
+
- [Browser Options](#browser-options)
|
|
45
|
+
- [Screenshots](#screenshots)
|
|
46
|
+
- [Authentication](#authentication)
|
|
47
|
+
- [Output Options](#output-options)
|
|
48
|
+
- [Performance \& Parallelism: The Vacuum Architecture](#performance--parallelism-the-vacuum-architecture)
|
|
49
|
+
- [Continue/Update](#continueupdate)
|
|
50
|
+
- [Serve](#serve)
|
|
51
|
+
- [Static Mode](#static-mode)
|
|
52
|
+
- [Structured Output](#structured-output)
|
|
53
|
+
- [Programmatic API](#programmatic-api)
|
|
54
|
+
- [Contributing](#contributing)
|
|
55
|
+
- [License](#license)
|
|
56
|
+
- [Acknowledgments](#acknowledgments)
|
|
57
|
+
|
|
58
|
+
## Features
|
|
59
|
+
|
|
60
|
+
- **🚀 Vacuum Architecture** — Parallel workers consume sites rapidly, just like hippos vacuum up everything in their path
|
|
61
|
+
- **📸 Structured Mirroring** — Every page, every resource, every network request captured in organized, structured output
|
|
62
|
+
- **🔍 Complete Fidelity** — Gets the page exactly as you see it, including CSS-in-JS, dynamic content, and lazy-loaded images
|
|
63
|
+
- **🎯 Smart Consumption** — Respects robots.txt, filters by URL patterns, MIME types, and file sizes
|
|
64
|
+
- **📦 Structured Output** — Organized mirror structure preserves original paths for seamless offline browsing
|
|
65
|
+
- **🎨 Beautiful CLI** — Interactive guided mode, progress bars, and elegant terminal output
|
|
66
|
+
- **🌐 Built-in Server** — Serve captured sites locally with directory browsing
|
|
67
|
+
- **📊 HAR Files** — Generates HTTP Archive files for debugging and replay
|
|
68
|
+
|
|
30
69
|
## Quick Start
|
|
31
70
|
|
|
32
71
|
Install globally:
|
|
@@ -53,48 +92,336 @@ Or use without installing:
|
|
|
53
92
|
npx smippo https://example.com
|
|
54
93
|
```
|
|
55
94
|
|
|
56
|
-
|
|
95
|
+
> 📖 **For complete documentation, guides, and API reference, visit [smippo.com](https://smippo.com)**
|
|
57
96
|
|
|
58
|
-
|
|
97
|
+
## Installation
|
|
59
98
|
|
|
60
|
-
|
|
61
|
-
- **`smippo capture <url>`** — Take screenshots of web pages
|
|
62
|
-
- **`smippo serve <directory>`** — Serve captured sites locally
|
|
63
|
-
- **`smippo continue`** — Resume an interrupted capture
|
|
64
|
-
- **`smippo update`** — Update an existing mirror
|
|
99
|
+
### Requirements
|
|
65
100
|
|
|
66
|
-
|
|
101
|
+
- Node.js 18 or later
|
|
102
|
+
- Chromium (automatically downloaded on first install)
|
|
67
103
|
|
|
68
|
-
|
|
104
|
+
### npm (Global)
|
|
69
105
|
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
- **🌐 Built-in Server** — Serve captured sites locally with directory browsing
|
|
74
|
-
- **📊 HAR Files** — Generates HTTP Archive files for debugging and replay
|
|
75
|
-
- **💻 Programmatic API** — Use Smippo in your Node.js applications
|
|
106
|
+
```bash
|
|
107
|
+
npm install -g smippo
|
|
108
|
+
```
|
|
76
109
|
|
|
77
|
-
|
|
110
|
+
### Homebrew (Coming soon)
|
|
78
111
|
|
|
79
|
-
|
|
112
|
+
```bash
|
|
113
|
+
brew install smippo
|
|
114
|
+
```
|
|
80
115
|
|
|
81
|
-
|
|
82
|
-
- **[Commands Reference](https://smippo.com/commands)** — All available commands and options
|
|
83
|
-
- **[Configuration](https://smippo.com/configuration)** — Filtering, scope control, performance tuning
|
|
84
|
-
- **[Guides](https://smippo.com/guides)** — Output structure, link rewriting, troubleshooting
|
|
85
|
-
- **[Programmatic API](https://smippo.com/api/programmatic)** — Use Smippo in your Node.js code
|
|
86
|
-
- **[Examples](https://smippo.com/getting-started/examples)** — Real-world use cases
|
|
116
|
+
## Usage
|
|
87
117
|
|
|
88
|
-
|
|
118
|
+
### Basic Usage
|
|
89
119
|
|
|
90
|
-
|
|
91
|
-
|
|
120
|
+
```bash
|
|
121
|
+
# Capture a single page with all assets
|
|
122
|
+
smippo https://example.com
|
|
123
|
+
|
|
124
|
+
# Mirror a site with depth control
|
|
125
|
+
smippo https://example.com --depth 3
|
|
126
|
+
|
|
127
|
+
# Save to custom directory
|
|
128
|
+
smippo https://example.com --output ./my-mirror
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
### Interactive Mode
|
|
132
|
+
|
|
133
|
+
Just run `smippo` with no arguments to start the guided wizard:
|
|
134
|
+
|
|
135
|
+
```bash
|
|
136
|
+
smippo
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
This will walk you through:
|
|
140
|
+
|
|
141
|
+
- URL to capture
|
|
142
|
+
- Crawl depth
|
|
143
|
+
- Scope settings
|
|
144
|
+
- Asset options
|
|
145
|
+
- Advanced configuration
|
|
146
|
+
|
|
147
|
+
Perfect for beginners or when you want to explore options!
|
|
148
|
+
|
|
149
|
+
### Filtering
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
# Include only specific patterns
|
|
153
|
+
smippo https://example.com --include "*.html" --include "*.css"
|
|
154
|
+
|
|
155
|
+
# Exclude patterns
|
|
156
|
+
smippo https://example.com --exclude "*tracking*" --exclude "*ads*"
|
|
157
|
+
|
|
158
|
+
# Filter by MIME type
|
|
159
|
+
smippo https://example.com --mime-include "image/*" --mime-exclude "video/*"
|
|
160
|
+
|
|
161
|
+
# Filter by file size
|
|
162
|
+
smippo https://example.com --max-size 5MB --min-size 1KB
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### Scope Control
|
|
166
|
+
|
|
167
|
+
```bash
|
|
168
|
+
# Stay on same subdomain (default)
|
|
169
|
+
smippo https://www.example.com --scope subdomain
|
|
170
|
+
|
|
171
|
+
# Allow all subdomains
|
|
172
|
+
smippo https://www.example.com --scope domain
|
|
173
|
+
|
|
174
|
+
# Go everywhere (use with caution!)
|
|
175
|
+
smippo https://example.com --scope all --depth 2
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
### Browser Options
|
|
179
|
+
|
|
180
|
+
```bash
|
|
181
|
+
# Wait for specific condition
|
|
182
|
+
smippo https://example.com --wait networkidle
|
|
183
|
+
smippo https://example.com --wait domcontentloaded
|
|
184
|
+
|
|
185
|
+
# Add extra wait time for slow sites
|
|
186
|
+
smippo https://example.com --wait-time 5000
|
|
187
|
+
|
|
188
|
+
# Custom user agent
|
|
189
|
+
smippo https://example.com --user-agent "Mozilla/5.0..."
|
|
190
|
+
|
|
191
|
+
# Custom viewport
|
|
192
|
+
smippo https://example.com --viewport 1280x720
|
|
193
|
+
|
|
194
|
+
# Emulate device
|
|
195
|
+
smippo https://example.com --device "iPhone 13"
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
### Screenshots
|
|
199
|
+
|
|
200
|
+
Take quick screenshots without mirroring the full site:
|
|
201
|
+
|
|
202
|
+
```bash
|
|
203
|
+
# Basic screenshot
|
|
204
|
+
smippo capture https://example.com
|
|
205
|
+
|
|
206
|
+
# Full-page screenshot (captures entire scrollable page)
|
|
207
|
+
smippo capture https://example.com --full-page
|
|
208
|
+
|
|
209
|
+
# Save to specific file
|
|
210
|
+
smippo capture https://example.com -O ./screenshots/example.png
|
|
211
|
+
|
|
212
|
+
# Mobile device screenshot
|
|
213
|
+
smippo capture https://example.com --device "iPhone 13" -O mobile.png
|
|
214
|
+
|
|
215
|
+
# Screenshot with dark mode
|
|
216
|
+
smippo capture https://example.com --dark-mode
|
|
217
|
+
|
|
218
|
+
# Capture specific element
|
|
219
|
+
smippo capture https://example.com --selector ".hero-section"
|
|
220
|
+
|
|
221
|
+
# JPEG format with quality
|
|
222
|
+
smippo capture https://example.com --format jpeg --quality 90
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
### Authentication
|
|
226
|
+
|
|
227
|
+
```bash
|
|
228
|
+
# Basic auth
|
|
229
|
+
smippo https://user:pass@example.com
|
|
230
|
+
|
|
231
|
+
# Cookie-based auth
|
|
232
|
+
smippo https://example.com --cookies cookies.json
|
|
233
|
+
|
|
234
|
+
# Interactive login (opens browser window)
|
|
235
|
+
smippo https://example.com --capture-auth
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
### Output Options
|
|
239
|
+
|
|
240
|
+
```bash
|
|
241
|
+
# Generate screenshots
|
|
242
|
+
smippo https://example.com --screenshot
|
|
243
|
+
|
|
244
|
+
# Generate PDFs
|
|
245
|
+
smippo https://example.com --pdf
|
|
246
|
+
|
|
247
|
+
# Skip HAR file
|
|
248
|
+
smippo https://example.com --no-har
|
|
249
|
+
|
|
250
|
+
# Output structure
|
|
251
|
+
smippo https://example.com --structure original # URL paths (default)
|
|
252
|
+
smippo https://example.com --structure flat # All in one directory
|
|
253
|
+
smippo https://example.com --structure domain # Organized by domain
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
### Performance & Parallelism: The Vacuum Architecture
|
|
257
|
+
|
|
258
|
+
Smippo's parallel worker architecture mirrors how hippos consume everything in their path—rapidly and efficiently. Multiple workers operate simultaneously, each vacuuming up pages, resources, and network requests in parallel.
|
|
259
|
+
|
|
260
|
+
```bash
|
|
261
|
+
# Default: 8 parallel workers (8 hippos vacuuming simultaneously)
|
|
262
|
+
smippo https://example.com
|
|
263
|
+
|
|
264
|
+
# Limit to 4 workers (for rate-limited sites)
|
|
265
|
+
smippo https://example.com --workers 4
|
|
266
|
+
|
|
267
|
+
# Single worker (sequential, safest)
|
|
268
|
+
smippo https://example.com --workers 1
|
|
269
|
+
|
|
270
|
+
# Maximum speed (use with caution)
|
|
271
|
+
smippo https://example.com --workers 16
|
|
272
|
+
|
|
273
|
+
# Limit total pages
|
|
274
|
+
smippo https://example.com --max-pages 100
|
|
275
|
+
|
|
276
|
+
# Limit total time
|
|
277
|
+
smippo https://example.com --max-time 300 # 5 minutes
|
|
278
|
+
|
|
279
|
+
# Rate limiting (delay between requests per worker)
|
|
280
|
+
smippo https://example.com --rate-limit 1000 # 1 second between requests
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
**The Vacuum Architecture:**
|
|
284
|
+
|
|
285
|
+
Each worker operates like an independent hippo, vacuuming up:
|
|
286
|
+
|
|
287
|
+
- Fully rendered pages (after JavaScript execution)
|
|
288
|
+
- All network resources (images, fonts, stylesheets, API responses)
|
|
289
|
+
- Network metadata (captured in HAR files)
|
|
290
|
+
- Link structures (for recursive crawling)
|
|
291
|
+
|
|
292
|
+
All captured content is then **structured** into organized mirrors that preserve original paths and relationships.
|
|
293
|
+
|
|
294
|
+
**Tips for optimal performance:**
|
|
295
|
+
|
|
296
|
+
- Use `--workers 1` for sites with strict rate limiting
|
|
297
|
+
- Use `--workers 4-8` for most sites (default: 8)
|
|
298
|
+
- Use `--workers 16` only for fast servers you control
|
|
299
|
+
- Combine `--workers` with `--rate-limit` for polite crawling
|
|
300
|
+
|
|
301
|
+
### Continue/Update
|
|
302
|
+
|
|
303
|
+
```bash
|
|
304
|
+
# Continue an interrupted capture
|
|
305
|
+
smippo continue
|
|
306
|
+
|
|
307
|
+
# Update an existing mirror
|
|
308
|
+
smippo update
|
|
309
|
+
```
|
|
310
|
+
|
|
311
|
+
### Serve
|
|
312
|
+
|
|
313
|
+
Serve captured sites locally with a built-in web server:
|
|
314
|
+
|
|
315
|
+
```bash
|
|
316
|
+
# Serve with auto port detection
|
|
317
|
+
smippo serve ./site
|
|
318
|
+
|
|
319
|
+
# Specify port
|
|
320
|
+
smippo serve ./site --port 3000
|
|
321
|
+
|
|
322
|
+
# Open browser automatically
|
|
323
|
+
smippo serve ./site --open
|
|
324
|
+
|
|
325
|
+
# Show all requests
|
|
326
|
+
smippo serve ./site --verbose
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
The server provides:
|
|
330
|
+
|
|
331
|
+
- **Auto port detection** — Finds next available port if default is busy
|
|
332
|
+
- **Proper MIME types** — Correct content-type headers for all file types
|
|
333
|
+
- **CORS support** — Enabled by default for local development
|
|
334
|
+
- **Nice terminal UI** — Shows clickable URL and request logs
|
|
335
|
+
|
|
336
|
+
### Static Mode
|
|
337
|
+
|
|
338
|
+
For any site, use `--static` to strip scripts for true offline viewing:
|
|
339
|
+
|
|
340
|
+
```bash
|
|
341
|
+
# Capture as static HTML (removes JS, keeps rendered content)
|
|
342
|
+
smippo https://example.com --static --external-assets
|
|
343
|
+
|
|
344
|
+
# Then serve
|
|
345
|
+
smippo serve ./site --open
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
## Structured Output
|
|
349
|
+
|
|
350
|
+
Smippo creates **structured mirrors** that preserve the original URL structure and relationships. Every page, every resource, every network request is organized and stored in a logical hierarchy:
|
|
351
|
+
|
|
352
|
+
```
|
|
353
|
+
site/
|
|
354
|
+
├── example.com/
|
|
355
|
+
│ ├── index.html
|
|
356
|
+
│ ├── about/
|
|
357
|
+
│ │ └── index.html
|
|
358
|
+
│ └── assets/
|
|
359
|
+
│ ├── style.css
|
|
360
|
+
│ └── logo.png
|
|
361
|
+
├── .smippo/
|
|
362
|
+
│ ├── cache.json # Metadata cache
|
|
363
|
+
│ ├── network.har # HAR file
|
|
364
|
+
│ ├── manifest.json # Capture manifest
|
|
365
|
+
│ └── log.txt # Capture log
|
|
366
|
+
└── index.html # Entry point
|
|
367
|
+
```
|
|
368
|
+
|
|
369
|
+
## Programmatic API
|
|
370
|
+
|
|
371
|
+
```javascript
|
|
372
|
+
import {capture, Crawler, createServer} from 'smippo';
|
|
373
|
+
|
|
374
|
+
// Simple capture
|
|
375
|
+
const result = await capture('https://example.com', {
|
|
376
|
+
output: './mirror',
|
|
377
|
+
depth: 2,
|
|
378
|
+
});
|
|
379
|
+
|
|
380
|
+
console.log(`Captured ${result.stats.pagesCapt} pages`);
|
|
381
|
+
|
|
382
|
+
// Advanced usage with events
|
|
383
|
+
const crawler = new Crawler({
|
|
384
|
+
url: 'https://example.com',
|
|
385
|
+
output: './mirror',
|
|
386
|
+
depth: 3,
|
|
387
|
+
scope: 'domain',
|
|
388
|
+
});
|
|
389
|
+
|
|
390
|
+
crawler.on('page:complete', ({url, size}) => {
|
|
391
|
+
console.log(`Captured: ${url} (${size} bytes)`);
|
|
392
|
+
});
|
|
393
|
+
|
|
394
|
+
crawler.on('error', ({url, error}) => {
|
|
395
|
+
console.error(`Failed: ${url} - ${error.message}`);
|
|
396
|
+
});
|
|
397
|
+
|
|
398
|
+
await crawler.start();
|
|
399
|
+
|
|
400
|
+
// Start a server programmatically
|
|
401
|
+
const server = await createServer({
|
|
402
|
+
directory: './mirror',
|
|
403
|
+
port: 8080,
|
|
404
|
+
open: true, // Opens browser automatically
|
|
405
|
+
});
|
|
406
|
+
|
|
407
|
+
console.log(`Server running at ${server.url}`);
|
|
408
|
+
|
|
409
|
+
// Later: stop the server
|
|
410
|
+
await server.close();
|
|
411
|
+
```
|
|
412
|
+
|
|
413
|
+
> 📖 **For complete API documentation, see the [Programmatic API guide](https://smippo.com/api-reference/programmatic-api) on smippo.com**
|
|
92
414
|
|
|
93
415
|
## Contributing
|
|
94
416
|
|
|
95
417
|
Contributions are welcome! Whether it's bug reports, feature requests, or pull requests — all contributions help make Smippo better.
|
|
96
418
|
|
|
97
|
-
Please read our [Contributing Guide](CONTRIBUTING.md) for details on
|
|
419
|
+
Please read our [Contributing Guide](CONTRIBUTING.md) for details on:
|
|
420
|
+
|
|
421
|
+
- Development setup
|
|
422
|
+
- Code style guidelines
|
|
423
|
+
- Pull request process
|
|
424
|
+
- Testing requirements
|
|
98
425
|
|
|
99
426
|
Quick start:
|
|
100
427
|
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "smippo",
|
|
3
|
-
"version": "0.0.
|
|
3
|
+
"version": "0.0.5",
|
|
4
4
|
"description": "S.M.I.P.P.O. — Structured Mirroring of Internet Pages and Public Objects. Modern website copier that captures sites exactly as they appear in your browser.",
|
|
5
5
|
"main": "src/index.js",
|
|
6
6
|
"bin": {
|
package/src/link-rewriter.js
CHANGED
|
@@ -171,6 +171,30 @@ export function rewriteLinks(html, pageUrl, urlMap, _options = {}) {
|
|
|
171
171
|
}
|
|
172
172
|
});
|
|
173
173
|
|
|
174
|
+
// Rewrite SVG <image>, <use>, <feImage> xlink:href and href attributes
|
|
175
|
+
$('image[xlink\\:href], use[xlink\\:href], feImage[xlink\\:href]').each(
|
|
176
|
+
(_, el) => {
|
|
177
|
+
const href = $(el).attr('xlink:href');
|
|
178
|
+
if (shouldSkipUrl(href)) return;
|
|
179
|
+
|
|
180
|
+
const localPath = getLocalPath(href);
|
|
181
|
+
if (localPath) {
|
|
182
|
+
$(el).attr('xlink:href', localPath);
|
|
183
|
+
}
|
|
184
|
+
},
|
|
185
|
+
);
|
|
186
|
+
|
|
187
|
+
// SVG 2 uses href without namespace
|
|
188
|
+
$('image[href], use[href], feImage[href]').each((_, el) => {
|
|
189
|
+
const href = $(el).attr('href');
|
|
190
|
+
if (shouldSkipUrl(href)) return;
|
|
191
|
+
|
|
192
|
+
const localPath = getLocalPath(href);
|
|
193
|
+
if (localPath) {
|
|
194
|
+
$(el).attr('href', localPath);
|
|
195
|
+
}
|
|
196
|
+
});
|
|
197
|
+
|
|
174
198
|
// Rewrite style attributes
|
|
175
199
|
$('[style]').each((_, el) => {
|
|
176
200
|
const style = $(el).attr('style');
|