smippo 0.0.1 → 0.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -27,6 +27,45 @@
27
27
 
28
28
  📚 **[View complete documentation →](https://smippo.com)**
29
29
 
30
+ ## Table of Contents
31
+
32
+ - [Table of Contents](#table-of-contents)
33
+ - [Features](#features)
34
+ - [Quick Start](#quick-start)
35
+ - [Installation](#installation)
36
+ - [Requirements](#requirements)
37
+ - [npm (Global)](#npm-global)
38
+ - [Homebrew (Coming soon)](#homebrew-coming-soon)
39
+ - [Usage](#usage)
40
+ - [Basic Usage](#basic-usage)
41
+ - [Interactive Mode](#interactive-mode)
42
+ - [Filtering](#filtering)
43
+ - [Scope Control](#scope-control)
44
+ - [Browser Options](#browser-options)
45
+ - [Screenshots](#screenshots)
46
+ - [Authentication](#authentication)
47
+ - [Output Options](#output-options)
48
+ - [Performance \& Parallelism: The Vacuum Architecture](#performance--parallelism-the-vacuum-architecture)
49
+ - [Continue/Update](#continueupdate)
50
+ - [Serve](#serve)
51
+ - [Static Mode](#static-mode)
52
+ - [Structured Output](#structured-output)
53
+ - [Programmatic API](#programmatic-api)
54
+ - [Contributing](#contributing)
55
+ - [License](#license)
56
+ - [Acknowledgments](#acknowledgments)
57
+
58
+ ## Features
59
+
60
+ - **🚀 Vacuum Architecture** — Parallel workers consume sites rapidly, just like hippos vacuum up everything in their path
61
+ - **📸 Structured Mirroring** — Every page, every resource, every network request captured in organized, structured output
62
+ - **🔍 Complete Fidelity** — Gets the page exactly as you see it, including CSS-in-JS, dynamic content, and lazy-loaded images
63
+ - **🎯 Smart Consumption** — Respects robots.txt, filters by URL patterns, MIME types, and file sizes
64
+ - **📦 Structured Output** — Organized mirror structure preserves original paths for seamless offline browsing
65
+ - **🎨 Beautiful CLI** — Interactive guided mode, progress bars, and elegant terminal output
66
+ - **🌐 Built-in Server** — Serve captured sites locally with directory browsing
67
+ - **📊 HAR Files** — Generates HTTP Archive files for debugging and replay
68
+
30
69
  ## Quick Start
31
70
 
32
71
  Install globally:
@@ -53,48 +92,336 @@ Or use without installing:
53
92
  npx smippo https://example.com
54
93
  ```
55
94
 
56
- ## Commands
95
+ > 📖 **For complete documentation, guides, and API reference, visit [smippo.com](https://smippo.com)**
57
96
 
58
- Smippo provides several commands for different use cases:
97
+ ## Installation
59
98
 
60
- - **`smippo <url>`** — Capture and mirror websites with full fidelity
61
- - **`smippo capture <url>`** — Take screenshots of web pages
62
- - **`smippo serve <directory>`** — Serve captured sites locally
63
- - **`smippo continue`** — Resume an interrupted capture
64
- - **`smippo update`** — Update an existing mirror
99
+ ### Requirements
65
100
 
66
- Run `smippo` with no arguments to start the interactive guided mode.
101
+ - Node.js 18 or later
102
+ - Chromium (automatically downloaded on first install)
67
103
 
68
- ## Features
104
+ ### npm (Global)
69
105
 
70
- - **🚀 Vacuum Architecture** — Parallel workers consume sites rapidly
71
- - **📸 Complete Fidelity** — Captures pages exactly as rendered, including CSS-in-JS, dynamic content, and lazy-loaded images
72
- - **🎯 Smart Filtering** — Filter by URL patterns, MIME types, and file sizes. Respects robots.txt
73
- - **🌐 Built-in Server** — Serve captured sites locally with directory browsing
74
- - **📊 HAR Files** — Generates HTTP Archive files for debugging and replay
75
- - **💻 Programmatic API** — Use Smippo in your Node.js applications
106
+ ```bash
107
+ npm install -g smippo
108
+ ```
76
109
 
77
- ## Documentation
110
+ ### Homebrew (Coming soon)
78
111
 
79
- For complete documentation, guides, and API reference, visit **[smippo.com](https://smippo.com)**:
112
+ ```bash
113
+ brew install smippo
114
+ ```
80
115
 
81
- - **[Installation Guide](https://smippo.com/getting-started/installation)** — Detailed installation instructions
82
- - **[Commands Reference](https://smippo.com/commands)** — All available commands and options
83
- - **[Configuration](https://smippo.com/configuration)** — Filtering, scope control, performance tuning
84
- - **[Guides](https://smippo.com/guides)** — Output structure, link rewriting, troubleshooting
85
- - **[Programmatic API](https://smippo.com/api/programmatic)** — Use Smippo in your Node.js code
86
- - **[Examples](https://smippo.com/getting-started/examples)** — Real-world use cases
116
+ ## Usage
87
117
 
88
- ## Requirements
118
+ ### Basic Usage
89
119
 
90
- - Node.js 18 or later
91
- - Chromium (automatically downloaded on first install)
120
+ ```bash
121
+ # Capture a single page with all assets
122
+ smippo https://example.com
123
+
124
+ # Mirror a site with depth control
125
+ smippo https://example.com --depth 3
126
+
127
+ # Save to custom directory
128
+ smippo https://example.com --output ./my-mirror
129
+ ```
130
+
131
+ ### Interactive Mode
132
+
133
+ Just run `smippo` with no arguments to start the guided wizard:
134
+
135
+ ```bash
136
+ smippo
137
+ ```
138
+
139
+ This will walk you through:
140
+
141
+ - URL to capture
142
+ - Crawl depth
143
+ - Scope settings
144
+ - Asset options
145
+ - Advanced configuration
146
+
147
+ Perfect for beginners or when you want to explore options!
148
+
149
+ ### Filtering
150
+
151
+ ```bash
152
+ # Include only specific patterns
153
+ smippo https://example.com --include "*.html" --include "*.css"
154
+
155
+ # Exclude patterns
156
+ smippo https://example.com --exclude "*tracking*" --exclude "*ads*"
157
+
158
+ # Filter by MIME type
159
+ smippo https://example.com --mime-include "image/*" --mime-exclude "video/*"
160
+
161
+ # Filter by file size
162
+ smippo https://example.com --max-size 5MB --min-size 1KB
163
+ ```
164
+
165
+ ### Scope Control
166
+
167
+ ```bash
168
+ # Stay on same subdomain (default)
169
+ smippo https://www.example.com --scope subdomain
170
+
171
+ # Allow all subdomains
172
+ smippo https://www.example.com --scope domain
173
+
174
+ # Go everywhere (use with caution!)
175
+ smippo https://example.com --scope all --depth 2
176
+ ```
177
+
178
+ ### Browser Options
179
+
180
+ ```bash
181
+ # Wait for specific condition
182
+ smippo https://example.com --wait networkidle
183
+ smippo https://example.com --wait domcontentloaded
184
+
185
+ # Add extra wait time for slow sites
186
+ smippo https://example.com --wait-time 5000
187
+
188
+ # Custom user agent
189
+ smippo https://example.com --user-agent "Mozilla/5.0..."
190
+
191
+ # Custom viewport
192
+ smippo https://example.com --viewport 1280x720
193
+
194
+ # Emulate device
195
+ smippo https://example.com --device "iPhone 13"
196
+ ```
197
+
198
+ ### Screenshots
199
+
200
+ Take quick screenshots without mirroring the full site:
201
+
202
+ ```bash
203
+ # Basic screenshot
204
+ smippo capture https://example.com
205
+
206
+ # Full-page screenshot (captures entire scrollable page)
207
+ smippo capture https://example.com --full-page
208
+
209
+ # Save to specific file
210
+ smippo capture https://example.com -O ./screenshots/example.png
211
+
212
+ # Mobile device screenshot
213
+ smippo capture https://example.com --device "iPhone 13" -O mobile.png
214
+
215
+ # Screenshot with dark mode
216
+ smippo capture https://example.com --dark-mode
217
+
218
+ # Capture specific element
219
+ smippo capture https://example.com --selector ".hero-section"
220
+
221
+ # JPEG format with quality
222
+ smippo capture https://example.com --format jpeg --quality 90
223
+ ```
224
+
225
+ ### Authentication
226
+
227
+ ```bash
228
+ # Basic auth
229
+ smippo https://user:pass@example.com
230
+
231
+ # Cookie-based auth
232
+ smippo https://example.com --cookies cookies.json
233
+
234
+ # Interactive login (opens browser window)
235
+ smippo https://example.com --capture-auth
236
+ ```
237
+
238
+ ### Output Options
239
+
240
+ ```bash
241
+ # Generate screenshots
242
+ smippo https://example.com --screenshot
243
+
244
+ # Generate PDFs
245
+ smippo https://example.com --pdf
246
+
247
+ # Skip HAR file
248
+ smippo https://example.com --no-har
249
+
250
+ # Output structure
251
+ smippo https://example.com --structure original # URL paths (default)
252
+ smippo https://example.com --structure flat # All in one directory
253
+ smippo https://example.com --structure domain # Organized by domain
254
+ ```
255
+
256
+ ### Performance & Parallelism: The Vacuum Architecture
257
+
258
+ Smippo's parallel worker architecture mirrors how hippos consume everything in their path—rapidly and efficiently. Multiple workers operate simultaneously, each vacuuming up pages, resources, and network requests in parallel.
259
+
260
+ ```bash
261
+ # Default: 8 parallel workers (8 hippos vacuuming simultaneously)
262
+ smippo https://example.com
263
+
264
+ # Limit to 4 workers (for rate-limited sites)
265
+ smippo https://example.com --workers 4
266
+
267
+ # Single worker (sequential, safest)
268
+ smippo https://example.com --workers 1
269
+
270
+ # Maximum speed (use with caution)
271
+ smippo https://example.com --workers 16
272
+
273
+ # Limit total pages
274
+ smippo https://example.com --max-pages 100
275
+
276
+ # Limit total time
277
+ smippo https://example.com --max-time 300 # 5 minutes
278
+
279
+ # Rate limiting (delay between requests per worker)
280
+ smippo https://example.com --rate-limit 1000 # 1 second between requests
281
+ ```
282
+
283
+ **The Vacuum Architecture:**
284
+
285
+ Each worker operates like an independent hippo, vacuuming up:
286
+
287
+ - Fully rendered pages (after JavaScript execution)
288
+ - All network resources (images, fonts, stylesheets, API responses)
289
+ - Network metadata (captured in HAR files)
290
+ - Link structures (for recursive crawling)
291
+
292
+ All captured content is then **structured** into organized mirrors that preserve original paths and relationships.
293
+
294
+ **Tips for optimal performance:**
295
+
296
+ - Use `--workers 1` for sites with strict rate limiting
297
+ - Use `--workers 4-8` for most sites (default: 8)
298
+ - Use `--workers 16` only for fast servers you control
299
+ - Combine `--workers` with `--rate-limit` for polite crawling
300
+
301
+ ### Continue/Update
302
+
303
+ ```bash
304
+ # Continue an interrupted capture
305
+ smippo continue
306
+
307
+ # Update an existing mirror
308
+ smippo update
309
+ ```
310
+
311
+ ### Serve
312
+
313
+ Serve captured sites locally with a built-in web server:
314
+
315
+ ```bash
316
+ # Serve with auto port detection
317
+ smippo serve ./site
318
+
319
+ # Specify port
320
+ smippo serve ./site --port 3000
321
+
322
+ # Open browser automatically
323
+ smippo serve ./site --open
324
+
325
+ # Show all requests
326
+ smippo serve ./site --verbose
327
+ ```
328
+
329
+ The server provides:
330
+
331
+ - **Auto port detection** — Finds next available port if default is busy
332
+ - **Proper MIME types** — Correct content-type headers for all file types
333
+ - **CORS support** — Enabled by default for local development
334
+ - **Nice terminal UI** — Shows clickable URL and request logs
335
+
336
+ ### Static Mode
337
+
338
+ For any site, use `--static` to strip scripts for true offline viewing:
339
+
340
+ ```bash
341
+ # Capture as static HTML (removes JS, keeps rendered content)
342
+ smippo https://example.com --static --external-assets
343
+
344
+ # Then serve
345
+ smippo serve ./site --open
346
+ ```
347
+
348
+ ## Structured Output
349
+
350
+ Smippo creates **structured mirrors** that preserve the original URL structure and relationships. Every page, every resource, every network request is organized and stored in a logical hierarchy:
351
+
352
+ ```
353
+ site/
354
+ ├── example.com/
355
+ │ ├── index.html
356
+ │ ├── about/
357
+ │ │ └── index.html
358
+ │ └── assets/
359
+ │ ├── style.css
360
+ │ └── logo.png
361
+ ├── .smippo/
362
+ │ ├── cache.json # Metadata cache
363
+ │ ├── network.har # HAR file
364
+ │ ├── manifest.json # Capture manifest
365
+ │ └── log.txt # Capture log
366
+ └── index.html # Entry point
367
+ ```
368
+
369
+ ## Programmatic API
370
+
371
+ ```javascript
372
+ import {capture, Crawler, createServer} from 'smippo';
373
+
374
+ // Simple capture
375
+ const result = await capture('https://example.com', {
376
+ output: './mirror',
377
+ depth: 2,
378
+ });
379
+
380
+ console.log(`Captured ${result.stats.pagesCapt} pages`);
381
+
382
+ // Advanced usage with events
383
+ const crawler = new Crawler({
384
+ url: 'https://example.com',
385
+ output: './mirror',
386
+ depth: 3,
387
+ scope: 'domain',
388
+ });
389
+
390
+ crawler.on('page:complete', ({url, size}) => {
391
+ console.log(`Captured: ${url} (${size} bytes)`);
392
+ });
393
+
394
+ crawler.on('error', ({url, error}) => {
395
+ console.error(`Failed: ${url} - ${error.message}`);
396
+ });
397
+
398
+ await crawler.start();
399
+
400
+ // Start a server programmatically
401
+ const server = await createServer({
402
+ directory: './mirror',
403
+ port: 8080,
404
+ open: true, // Opens browser automatically
405
+ });
406
+
407
+ console.log(`Server running at ${server.url}`);
408
+
409
+ // Later: stop the server
410
+ await server.close();
411
+ ```
412
+
413
+ > 📖 **For complete API documentation, see the [Programmatic API guide](https://smippo.com/api-reference/programmatic-api) on smippo.com**
92
414
 
93
415
  ## Contributing
94
416
 
95
417
  Contributions are welcome! Whether it's bug reports, feature requests, or pull requests — all contributions help make Smippo better.
96
418
 
97
- Please read our [Contributing Guide](CONTRIBUTING.md) for details on development setup, code style guidelines, and the pull request process.
419
+ Please read our [Contributing Guide](CONTRIBUTING.md) for details on:
420
+
421
+ - Development setup
422
+ - Code style guidelines
423
+ - Pull request process
424
+ - Testing requirements
98
425
 
99
426
  Quick start:
100
427
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "smippo",
3
- "version": "0.0.1",
3
+ "version": "0.0.5",
4
4
  "description": "S.M.I.P.P.O. — Structured Mirroring of Internet Pages and Public Objects. Modern website copier that captures sites exactly as they appear in your browser.",
5
5
  "main": "src/index.js",
6
6
  "bin": {
@@ -171,6 +171,30 @@ export function rewriteLinks(html, pageUrl, urlMap, _options = {}) {
171
171
  }
172
172
  });
173
173
 
174
+ // Rewrite SVG <image>, <use>, <feImage> xlink:href and href attributes
175
+ $('image[xlink\\:href], use[xlink\\:href], feImage[xlink\\:href]').each(
176
+ (_, el) => {
177
+ const href = $(el).attr('xlink:href');
178
+ if (shouldSkipUrl(href)) return;
179
+
180
+ const localPath = getLocalPath(href);
181
+ if (localPath) {
182
+ $(el).attr('xlink:href', localPath);
183
+ }
184
+ },
185
+ );
186
+
187
+ // SVG 2 uses href without namespace
188
+ $('image[href], use[href], feImage[href]').each((_, el) => {
189
+ const href = $(el).attr('href');
190
+ if (shouldSkipUrl(href)) return;
191
+
192
+ const localPath = getLocalPath(href);
193
+ if (localPath) {
194
+ $(el).attr('href', localPath);
195
+ }
196
+ });
197
+
174
198
  // Rewrite style attributes
175
199
  $('[style]').each((_, el) => {
176
200
  const style = $(el).attr('style');