webpeel 0.1.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2,11 +2,12 @@
2
2
 
3
3
  [![npm version](https://img.shields.io/npm/v/webpeel.svg)](https://www.npmjs.com/package/webpeel)
4
4
  [![npm downloads](https://img.shields.io/npm/dm/webpeel.svg)](https://www.npmjs.com/package/webpeel)
5
+ [![GitHub stars](https://img.shields.io/github/stars/JakeLiuMe/webpeel.svg)](https://github.com/JakeLiuMe/webpeel/stargazers)
5
6
  [![CI](https://github.com/JakeLiuMe/webpeel/actions/workflows/ci.yml/badge.svg)](https://github.com/JakeLiuMe/webpeel/actions/workflows/ci.yml)
6
7
  [![TypeScript](https://img.shields.io/badge/TypeScript-5.6-blue.svg)](https://www.typescriptlang.org/)
7
8
  [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
8
9
 
9
- Turn any web page into clean markdown. Zero config. Free forever.
10
+ Turn any web page into clean markdown. **Stealth mode. Crawl mode. Zero config. Free forever.**
10
11
 
11
12
  ```bash
12
13
  npx webpeel https://news.ycombinator.com
@@ -36,17 +37,26 @@ npx webpeel https://news.ycombinator.com
36
37
  |---|:---:|:---:|:---:|:---:|
37
38
  | **Local execution** | ✅ Free forever | ❌ Cloud only | ❌ Cloud only | ✅ Free |
38
39
  | **JS rendering** | ✅ Auto-escalates | ✅ Always | ❌ No | ❌ No |
39
- | **Anti-bot handling** | ✅ Stealth mode | ✅ Yes | ⚠️ Limited | ❌ No |
40
+ | **Stealth mode** | ✅ Built-in | ✅ Yes | ⚠️ Limited | ❌ No |
41
+ | **Crawl mode** | ✅ Built-in | ✅ Yes | ❌ No | ❌ No |
40
42
  | **MCP Server** | ✅ Built-in | ✅ Separate repo | ❌ No | ✅ Yes |
41
43
  | **Zero config** | ✅ `npx webpeel` | ❌ API key required | ❌ API key required | ✅ Yes |
42
44
  | **Free tier** | ∞ Unlimited local | 500 pages (one-time) | 1000 req/month | ∞ Local only |
43
- | **Hosted API** | $9/mo (5K pages) | $16/mo (3K pages) | $200/mo (Starter) | N/A |
44
- | **Credit rollover** | Up to 1 month | ❌ Expire monthly | ❌ N/A | ❌ N/A |
45
+ | **Hosted API** | $9/mo (1,250/wk) | $16/mo (3K/mo) | $200/mo (Starter) | N/A |
46
+ | **Weekly reset** | N/A | ❌ Monthly only | ❌ Monthly only | ❌ N/A |
47
+ | **Extra usage** | N/A | ✅ Pay-as-you-go | ❌ Upgrade only | N/A |
48
+ | **Rollover** | N/A | ✅ 1 week | ❌ Expire monthly | ❌ N/A |
45
49
  | **Soft limits** | ✅ Never blocked | ❌ Hard cut-off | ❌ Rate limited | ❌ N/A |
46
50
  | **Markdown output** | ✅ Optimized for AI | ✅ Yes | ✅ Yes | ⚠️ Basic |
47
51
 
48
52
  **WebPeel gives you Firecrawl's power without the price tag.** Run locally for free, or use our hosted API when you need scale.
49
53
 
54
+ ### Highlights
55
+
56
+ 1. **🎭 Stealth Mode** — Bypass bot detection with playwright-extra stealth plugin. Works on sites that block regular scrapers.
57
+ 2. **🕷️ Crawl Mode** — Follow links and extract entire sites. Respects robots.txt and rate limits automatically.
58
+ 3. **💰 Actually Free** — Run unlimited requests locally. No API keys, no credit cards, no surprises. Open source MIT.
59
+
50
60
  ---
51
61
 
52
62
  ## Quick Start
@@ -57,6 +67,12 @@ npx webpeel https://news.ycombinator.com
57
67
  # Basic usage
58
68
  npx webpeel https://example.com
59
69
 
70
+ # Stealth mode (bypass bot detection)
71
+ npx webpeel https://protected-site.com --stealth
72
+
73
+ # Crawl a website (follow links, respect robots.txt)
74
+ npx webpeel crawl https://example.com --max-pages 20 --max-depth 2
75
+
60
76
  # JSON output with metadata
61
77
  npx webpeel https://example.com --json
62
78
 
@@ -91,9 +107,9 @@ const result = await peel('https://example.com', {
91
107
  });
92
108
  ```
93
109
 
94
- ### MCP Server (Claude Desktop, Cursor, VS Code)
110
+ ### MCP Server (Claude Desktop, Cursor, VS Code, Windsurf)
95
111
 
96
- WebPeel provides two MCP tools: `webpeel_fetch` (fetch a URL) and `webpeel_search` (DuckDuckGo search + fetch results).
112
+ WebPeel provides four MCP tools: `webpeel_fetch` (fetch a URL), `webpeel_search` (search the web), `webpeel_batch` (fetch multiple URLs), and `webpeel_crawl` (crawl a site).
97
113
 
98
114
  #### Claude Desktop
99
115
 
@@ -145,6 +161,50 @@ Or install with one click:
145
161
  [![Install in Claude Desktop](https://img.shields.io/badge/Install-Claude%20Desktop-5B3FFF?style=for-the-badge&logo=anthropic)](https://mcp.so/install/webpeel?for=claude)
146
162
  [![Install in VS Code](https://img.shields.io/badge/Install-VS%20Code-007ACC?style=for-the-badge&logo=visualstudiocode)](https://mcp.so/install/webpeel?for=vscode)
147
163
 
164
+ #### Windsurf
165
+
166
+ Add to `~/.codeium/windsurf/mcp_config.json`:
167
+
168
+ ```json
169
+ {
170
+ "mcpServers": {
171
+ "webpeel": {
172
+ "command": "npx",
173
+ "args": ["-y", "webpeel", "mcp"]
174
+ }
175
+ }
176
+ }
177
+ ```
178
+
179
+ ---
180
+
181
+ ## Use with Claude Code
182
+
183
+ One command to add WebPeel to Claude Code:
184
+
185
+ ```bash
186
+ claude mcp add webpeel -- npx -y webpeel mcp
187
+ ```
188
+
189
+ Or add to your project's `.mcp.json` for team sharing:
190
+
191
+ ```json
192
+ {
193
+ "mcpServers": {
194
+ "webpeel": {
195
+ "command": "npx",
196
+ "args": ["-y", "webpeel", "mcp"]
197
+ }
198
+ }
199
+ }
200
+ ```
201
+
202
+ This gives Claude Code access to:
203
+ - **webpeel_fetch** — Fetch any URL as clean markdown (with stealth mode for protected sites)
204
+ - **webpeel_search** — Search the web via DuckDuckGo
205
+ - **webpeel_batch** — Fetch multiple URLs concurrently
206
+ - **webpeel_crawl** — Crawl websites following links
207
+
148
208
  ---
149
209
 
150
210
  ## How It Works: Smart Escalation
@@ -156,16 +216,16 @@ WebPeel tries the fastest method first, then escalates only when needed:
156
216
  │ Smart Escalation │
157
217
  └─────────────────────────────────────────────────────────────┘
158
218
 
159
- Simple HTTP Fetch Browser Rendering Stealth Mode
160
- ~200ms ~2 seconds ~5 seconds
219
+ Simple HTTP FetchBrowser RenderingStealth Mode
220
+ ~200ms ~2 seconds ~5 seconds
161
221
  │ │ │
162
222
  ├─ User-Agent headers ├─ Full JS execution ├─ Anti-detect
163
- ├─ Cheerio parsing ├─ Wait for content ├─ Proxy rotation
164
- ├─ Fast & cheap ├─ Screenshots └─ Cloudflare bypass
165
- │ │
166
- ▼ ▼
167
- Works for 80% Works for 19% Works for 1%
168
- of websites (JS-heavy sites) (heavily protected)
223
+ ├─ Cheerio parsing ├─ Wait for content ├─ Fingerprint mask
224
+ ├─ Fast & cheap ├─ Screenshots ├─ Cloudflare bypass
225
+ │ │
226
+ ▼ ▼
227
+ Works for 80% Works for 15% Works for 5%
228
+ of websites (JS-heavy sites) (bot-protected)
169
229
  ```
170
230
 
171
231
  **Why this matters:**
@@ -257,29 +317,46 @@ curl "https://webpeel-api.onrender.com/v1/fetch?url=https://example.com" \
257
317
  -H "Authorization: Bearer wp_live_your_api_key"
258
318
  ```
259
319
 
260
- ### Pricing
320
+ ### Pricing — Weekly Reset Model
321
+
322
+ Usage resets every **Monday at 00:00 UTC**, just like Claude Code.
323
+
324
+ | Plan | Price | Weekly Fetches | Burst Limit | Stealth Mode | Extra Usage |
325
+ |------|------:|---------------:|:-----------:|:------------:|:-----------:|
326
+ | **Local CLI** | $0 | ∞ Unlimited | N/A | ✅ | N/A |
327
+ | **Cloud Free** | $0 | 125/wk (~500/mo) | 25/hr | ❌ | ❌ |
328
+ | **Cloud Pro** | $9/mo | 1,250/wk (~5K/mo) | 100/hr | ✅ | ✅ |
329
+ | **Cloud Max** | $29/mo | 6,250/wk (~25K/mo) | 500/hr | ✅ | ✅ |
330
+
331
+ **Three layers of usage control:**
332
+ 1. **Burst limit** — Per-hour cap (25/hr free, 100/hr Pro, 500/hr Max) prevents hammering
333
+ 2. **Weekly limit** — Main usage gate, resets every Monday
334
+ 3. **Extra usage** — When you hit your weekly limit, keep fetching at pay-as-you-go rates
261
335
 
262
- | Plan | Price | Fetches/Month | JS Rendering | Key Features |
263
- |------|------:|---------------:|:------------:|----------|
264
- | **Local CLI** | $0 | ∞ Unlimited | ✅ | Full power, your machine |
265
- | **Cloud Free** | $0 | 500 | ❌ | Soft limits — never blocked |
266
- | **Cloud Pro** | $9/mo | 5,000 | ✅ | Credit rollover, soft limits |
267
- | **Cloud Max** | $29/mo | 25,000 | ✅ | Priority queue, credit rollover |
336
+ **Extra usage rates (Pro/Max only):**
337
+ | Fetch Type | Cost |
338
+ |-----------|------|
339
+ | Basic (HTTP) | $0.002 |
340
+ | Stealth (browser) | $0.01 |
341
+ | Search | $0.001 |
268
342
 
269
- ### Why WebPeel Pro Beats Firecrawl
343
+ ### Why WebPeel Beats Firecrawl
270
344
 
271
345
  | Feature | WebPeel Local | WebPeel Pro | Firecrawl Hobby |
272
346
  |---------|:-------------:|:-----------:|:---------------:|
273
347
  | **Price** | $0 | $9/mo | $16/mo |
274
- | **Monthly Fetches** | ∞ | 5,000 | 3,000 |
275
- | **Credit Rollover** | N/A | ✅ 1 month | ❌ Expire monthly |
348
+ | **Weekly Fetches** | ∞ | 1,250/wk | ~750/wk |
349
+ | **Rollover** | N/A | ✅ 1 week | ❌ Expire monthly |
276
350
  | **Soft Limits** | ✅ Always | ✅ Never locked out | ❌ Hard cut-off |
351
+ | **Extra Usage** | N/A | ✅ Pay-as-you-go | ❌ Upgrade only |
277
352
  | **Self-Host** | ✅ MIT | N/A | ❌ AGPL |
278
353
 
279
354
  **Key differentiators:**
280
- - **Soft limits on every tier** — When you hit your limit, we degrade to HTTP-only instead of blocking you. Even free users are never locked out.
281
- - **Credits roll over** — Unused fetches carry forward for 1 month (Firecrawl expires monthly)
282
- - **CLI is always free** — No vendor lock-in. Run unlimited locally forever.
355
+ - **Weekly resets** — Your usage refreshes every Monday, not once a month
356
+ - **Soft limits on every tier** — At 100%, we degrade to HTTP-only instead of blocking you
357
+ - **Extra usage** — Pro/Max users can toggle on pay-as-you-go with spending caps (no surprise bills)
358
+ - **Rollover** — Unused fetches carry forward 1 week
359
+ - **CLI is always free** — No vendor lock-in. Run unlimited locally forever
283
360
 
284
361
  See pricing at [webpeel.dev](https://webpeel.dev/#pricing)
285
362
 
@@ -388,12 +465,19 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
388
465
  - [x] CLI with smart escalation
389
466
  - [x] TypeScript library
390
467
  - [x] MCP server for Claude/Cursor/VS Code
391
- - [ ] Hosted API with authentication
392
- - [ ] Rate limiting and caching
393
- - [ ] Batch processing API
394
- - [ ] Screenshot capture
468
+ - [x] Hosted API with authentication and usage tracking
469
+ - [x] Rate limiting and caching
470
+ - [x] Batch processing API (`batch <file>`)
471
+ - [x] Screenshot capture (`--screenshot`)
472
+ - [x] CSS selector filtering (`--selector`, `--exclude`)
473
+ - [x] DuckDuckGo search (`search <query>`)
474
+ - [x] Custom headers and cookies
475
+ - [x] Weekly reset usage model with extra usage
476
+ - [x] Stealth mode (playwright-extra + anti-detect)
477
+ - [x] Crawl mode (follow links, respect robots.txt)
395
478
  - [ ] PDF extraction
396
479
  - [ ] Webhook notifications for monitoring
480
+ - [ ] AI CAPTCHA solving (planned)
397
481
 
398
482
  Vote on features and roadmap at [GitHub Discussions](https://github.com/JakeLiuMe/webpeel/discussions).
399
483
 
@@ -408,13 +492,13 @@ A: WebPeel runs locally for free (Firecrawl is cloud-only). We also have smart e
408
492
  A: Yes! Run `npm run serve` to start the API server. See [docs/self-hosting.md](docs/self-hosting.md) (coming soon).
409
493
 
410
494
  **Q: Does this violate websites' Terms of Service?**
411
- A: WebPeel respects `robots.txt` by default. Always check a site's ToS before scraping at scale.
495
+ A: WebPeel is a tool — how you use it is up to you. Always check a site's ToS before fetching at scale. We recommend respecting `robots.txt` in your own workflows.
412
496
 
413
497
  **Q: What about CAPTCHA and Cloudflare?**
414
- A: WebPeel handles most Cloudflare challenges automatically. For CAPTCHAs, you'll need a solving service (not included).
498
+ A: WebPeel handles most Cloudflare challenges automatically via stealth mode. AI-powered CAPTCHA solving is on our roadmap.
415
499
 
416
500
  **Q: Can I use this in production?**
417
- A: Yes, but be mindful of rate limits. The hosted API (coming soon) is better for high-volume production use.
501
+ A: Yes! The hosted API at `https://webpeel-api.onrender.com` is production-ready with authentication, rate limiting, and usage tracking.
418
502
 
419
503
  ---
420
504
 
package/dist/cli.js CHANGED
@@ -14,15 +14,18 @@
14
14
  */
15
15
  import { Command } from 'commander';
16
16
  import ora from 'ora';
17
- import { peel, cleanup } from './index.js';
17
+ import { writeFileSync } from 'fs';
18
+ import { peel, peelBatch, cleanup } from './index.js';
18
19
  const program = new Command();
19
20
  program
20
21
  .name('webpeel')
21
22
  .description('Fast web fetcher for AI agents')
22
- .version('0.1.2');
23
+ .version('0.3.0')
24
+ .enablePositionalOptions();
23
25
  program
24
26
  .argument('[url]', 'URL to fetch')
25
27
  .option('-r, --render', 'Use headless browser (for JS-heavy sites)')
28
+ .option('--stealth', 'Use stealth mode to bypass bot detection (auto-enables --render)')
26
29
  .option('-w, --wait <ms>', 'Wait time after page load (ms)', parseInt)
27
30
  .option('--html', 'Output raw HTML instead of markdown')
28
31
  .option('--text', 'Output plain text instead of markdown')
@@ -30,6 +33,12 @@ program
30
33
  .option('-t, --timeout <ms>', 'Request timeout (ms)', parseInt, 30000)
31
34
  .option('--ua <agent>', 'Custom user agent')
32
35
  .option('-s, --silent', 'Silent mode (no spinner)')
36
+ .option('--screenshot [path]', 'Take a screenshot (optionally save to file path)')
37
+ .option('--full-page', 'Full-page screenshot (use with --screenshot)')
38
+ .option('--selector <css>', 'CSS selector to extract (e.g., "article", ".content")')
39
+ .option('--exclude <selectors...>', 'CSS selectors to exclude (e.g., ".sidebar" ".ads")')
40
+ .option('-H, --header <header...>', 'Custom headers (e.g., "Authorization: Bearer token")')
41
+ .option('--cookie <cookie...>', 'Cookies to set (e.g., "session=abc123")')
33
42
  .action(async (url, options) => {
34
43
  if (!url) {
35
44
  console.error('Error: URL is required\n');
@@ -65,12 +74,35 @@ program
65
74
  console.error('Error: Wait time must be between 0 and 60000ms');
66
75
  process.exit(1);
67
76
  }
77
+ // Parse custom headers
78
+ let headers;
79
+ if (options.header && options.header.length > 0) {
80
+ headers = {};
81
+ for (const header of options.header) {
82
+ const colonIndex = header.indexOf(':');
83
+ if (colonIndex === -1) {
84
+ console.error(`Error: Invalid header format: ${header}`);
85
+ console.error('Expected format: "Key: Value"');
86
+ process.exit(1);
87
+ }
88
+ const key = header.slice(0, colonIndex).trim();
89
+ const value = header.slice(colonIndex + 1).trim();
90
+ headers[key] = value;
91
+ }
92
+ }
68
93
  // Build peel options
69
94
  const peelOptions = {
70
95
  render: options.render || false,
96
+ stealth: options.stealth || false,
71
97
  wait: options.wait || 0,
72
98
  timeout: options.timeout,
73
99
  userAgent: options.ua,
100
+ screenshot: options.screenshot !== undefined,
101
+ screenshotFullPage: options.fullPage || false,
102
+ selector: options.selector,
103
+ exclude: options.exclude,
104
+ headers,
105
+ cookies: options.cookie,
74
106
  };
75
107
  // Determine format
76
108
  if (options.html) {
@@ -87,12 +119,42 @@ program
87
119
  if (spinner) {
88
120
  spinner.succeed(`Fetched in ${result.elapsed}ms using ${result.method} method`);
89
121
  }
90
- // Output results
122
+ // Handle screenshot saving
123
+ if (options.screenshot && result.screenshot) {
124
+ const screenshotPath = typeof options.screenshot === 'string'
125
+ ? options.screenshot
126
+ : 'screenshot.png';
127
+ const screenshotBuffer = Buffer.from(result.screenshot, 'base64');
128
+ writeFileSync(screenshotPath, screenshotBuffer);
129
+ if (!options.silent) {
130
+ console.error(`Screenshot saved to: ${screenshotPath}`);
131
+ }
132
+ // Remove screenshot from JSON output if saving to file
133
+ if (typeof options.screenshot === 'string') {
134
+ delete result.screenshot;
135
+ }
136
+ }
137
+ // Output results with proper stdout flushing
91
138
  if (options.json) {
92
- console.log(JSON.stringify(result, null, 2));
139
+ const jsonStr = JSON.stringify(result, null, 2);
140
+ await new Promise((resolve, reject) => {
141
+ process.stdout.write(jsonStr + '\n', (err) => {
142
+ if (err)
143
+ reject(err);
144
+ else
145
+ resolve();
146
+ });
147
+ });
93
148
  }
94
149
  else {
95
- console.log(result.content);
150
+ await new Promise((resolve, reject) => {
151
+ process.stdout.write(result.content + '\n', (err) => {
152
+ if (err)
153
+ reject(err);
154
+ else
155
+ resolve();
156
+ });
157
+ });
96
158
  }
97
159
  // Clean up and exit
98
160
  await cleanup();
@@ -112,15 +174,279 @@ program
112
174
  process.exit(1);
113
175
  }
114
176
  });
115
- // Future commands
177
+ // Search command
178
+ program
179
+ .command('search <query>')
180
+ .description('Search using DuckDuckGo')
181
+ .option('-n, --count <n>', 'Number of results (1-10)', '5')
182
+ .option('--json', 'Output as JSON')
183
+ .option('-s, --silent', 'Silent mode')
184
+ .action(async (query, options) => {
185
+ const isJson = options.json;
186
+ const isSilent = options.silent;
187
+ const count = parseInt(options.count) || 5;
188
+ const spinner = isSilent ? null : ora('Searching...').start();
189
+ try {
190
+ // Import the search function dynamically
191
+ const { fetch: undiciFetch } = await import('undici');
192
+ const { load } = await import('cheerio');
193
+ const searchUrl = `https://html.duckduckgo.com/html/?q=${encodeURIComponent(query)}`;
194
+ const response = await undiciFetch(searchUrl, {
195
+ headers: {
196
+ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
197
+ },
198
+ });
199
+ if (!response.ok) {
200
+ throw new Error(`Search failed: HTTP ${response.status}`);
201
+ }
202
+ const html = await response.text();
203
+ const $ = load(html);
204
+ const results = [];
205
+ $('.result').each((_i, elem) => {
206
+ if (results.length >= count)
207
+ return;
208
+ const $result = $(elem);
209
+ const title = $result.find('.result__title').text().trim();
210
+ const rawUrl = $result.find('.result__a').attr('href') || '';
211
+ const snippet = $result.find('.result__snippet').text().trim();
212
+ if (!title || !rawUrl)
213
+ return;
214
+ // Extract actual URL from DuckDuckGo redirect
215
+ let url = rawUrl;
216
+ try {
217
+ const ddgUrl = new URL(rawUrl, 'https://duckduckgo.com');
218
+ const uddg = ddgUrl.searchParams.get('uddg');
219
+ if (uddg) {
220
+ url = decodeURIComponent(uddg);
221
+ }
222
+ }
223
+ catch {
224
+ // Use raw URL if parsing fails
225
+ }
226
+ // Validate final URL
227
+ try {
228
+ const parsed = new URL(url);
229
+ if (!['http:', 'https:'].includes(parsed.protocol)) {
230
+ return;
231
+ }
232
+ url = parsed.href;
233
+ }
234
+ catch {
235
+ return;
236
+ }
237
+ results.push({
238
+ title: title.slice(0, 200),
239
+ url,
240
+ snippet: snippet.slice(0, 500)
241
+ });
242
+ });
243
+ if (spinner) {
244
+ spinner.succeed(`Found ${results.length} results`);
245
+ }
246
+ if (isJson) {
247
+ const jsonStr = JSON.stringify(results, null, 2);
248
+ await new Promise((resolve, reject) => {
249
+ process.stdout.write(jsonStr + '\n', (err) => {
250
+ if (err)
251
+ reject(err);
252
+ else
253
+ resolve();
254
+ });
255
+ });
256
+ }
257
+ else {
258
+ for (const result of results) {
259
+ console.log(`\n${result.title}`);
260
+ console.log(result.url);
261
+ console.log(result.snippet);
262
+ }
263
+ }
264
+ process.exit(0);
265
+ }
266
+ catch (error) {
267
+ if (spinner) {
268
+ spinner.fail('Search failed');
269
+ }
270
+ if (error instanceof Error) {
271
+ console.error(`\nError: ${error.message}`);
272
+ }
273
+ else {
274
+ console.error('\nError: Unknown error occurred');
275
+ }
276
+ process.exit(1);
277
+ }
278
+ });
279
+ // Batch command
280
+ program
281
+ .command('batch <file>')
282
+ .description('Fetch multiple URLs')
283
+ .option('-c, --concurrency <n>', 'Max concurrent fetches (default: 3)', '3')
284
+ .option('-o, --output <dir>', 'Output directory (one file per URL)')
285
+ .option('--json', 'Output as JSON array')
286
+ .option('-s, --silent', 'Silent mode')
287
+ .option('-r, --render', 'Use headless browser')
288
+ .option('--selector <css>', 'CSS selector to extract')
289
+ .action(async (file, options) => {
290
+ const isJson = options.json;
291
+ const isSilent = options.silent;
292
+ const shouldRender = options.render;
293
+ const selector = options.selector;
294
+ const spinner = isSilent ? null : ora('Loading URLs...').start();
295
+ try {
296
+ const { readFileSync } = await import('fs');
297
+ // Read URLs from file
298
+ let urls;
299
+ try {
300
+ const content = readFileSync(file, 'utf-8');
301
+ urls = content.split('\n')
302
+ .map(line => line.trim())
303
+ .filter(line => line && !line.startsWith('#'));
304
+ }
305
+ catch (error) {
306
+ throw new Error(`Failed to read file: ${file}`);
307
+ }
308
+ if (urls.length === 0) {
309
+ throw new Error('No URLs found in file');
310
+ }
311
+ if (spinner) {
312
+ spinner.text = `Fetching ${urls.length} URLs (concurrency: ${options.concurrency})...`;
313
+ }
314
+ // Batch fetch
315
+ const results = await peelBatch(urls, {
316
+ concurrency: parseInt(options.concurrency) || 3,
317
+ render: shouldRender,
318
+ selector: selector,
319
+ });
320
+ if (spinner) {
321
+ const successCount = results.filter(r => 'content' in r).length;
322
+ spinner.succeed(`Completed: ${successCount}/${urls.length} successful`);
323
+ }
324
+ // Output results
325
+ if (isJson) {
326
+ const jsonStr = JSON.stringify(results, null, 2);
327
+ await new Promise((resolve, reject) => {
328
+ process.stdout.write(jsonStr + '\n', (err) => {
329
+ if (err)
330
+ reject(err);
331
+ else
332
+ resolve();
333
+ });
334
+ });
335
+ }
336
+ else if (options.output) {
337
+ const { writeFileSync, mkdirSync } = await import('fs');
338
+ const { join } = await import('path');
339
+ // Create output directory
340
+ mkdirSync(options.output, { recursive: true });
341
+ results.forEach((result, i) => {
342
+ const urlObj = new URL(urls[i]);
343
+ const filename = `${i + 1}_${urlObj.hostname.replace(/[^a-z0-9]/gi, '_')}.md`;
344
+ const filepath = join(options.output, filename);
345
+ if ('content' in result) {
346
+ writeFileSync(filepath, result.content);
347
+ }
348
+ else {
349
+ writeFileSync(filepath, `Error: ${result.error}`);
350
+ }
351
+ });
352
+ if (!isSilent) {
353
+ console.log(`\nResults saved to: ${options.output}`);
354
+ }
355
+ }
356
+ else {
357
+ // Print results to stdout
358
+ results.forEach((result, i) => {
359
+ console.log(`\n=== ${urls[i]} ===\n`);
360
+ if ('content' in result) {
361
+ console.log(result.content.slice(0, 500) + '...');
362
+ }
363
+ else {
364
+ console.log(`Error: ${result.error}`);
365
+ }
366
+ });
367
+ }
368
+ await cleanup();
369
+ process.exit(0);
370
+ }
371
+ catch (error) {
372
+ if (spinner) {
373
+ spinner.fail('Batch fetch failed');
374
+ }
375
+ if (error instanceof Error) {
376
+ console.error(`\nError: ${error.message}`);
377
+ }
378
+ else {
379
+ console.error('\nError: Unknown error occurred');
380
+ }
381
+ await cleanup();
382
+ process.exit(1);
383
+ }
384
+ });
116
385
  program
117
- .command('search')
118
- .argument('<query>', 'Search query')
119
- .description('Search using DuckDuckGo (future)')
120
- .action(() => {
121
- console.log('Search command not yet implemented');
122
- console.log('Coming soon: DuckDuckGo search integration');
123
- process.exit(1);
386
+ .command('crawl <url>')
387
+ .description('Crawl a website starting from a URL')
388
+ .option('--max-pages <number>', 'Maximum number of pages to crawl (default: 10, max: 100)', parseInt, 10)
389
+ .option('--max-depth <number>', 'Maximum depth to crawl (default: 2, max: 5)', parseInt, 2)
390
+ .option('--allowed-domains <domains...>', 'Only crawl these domains (default: same as starting URL)')
391
+ .option('--exclude <patterns...>', 'Exclude URLs matching these regex patterns')
392
+ .option('--ignore-robots', 'Ignore robots.txt (default: respect robots.txt)')
393
+ .option('--rate-limit <ms>', 'Rate limit between requests in ms (default: 1000)', parseInt, 1000)
394
+ .option('-r, --render', 'Use headless browser for all pages')
395
+ .option('--stealth', 'Use stealth mode for all pages')
396
+ .option('-s, --silent', 'Silent mode (no spinner)')
397
+ .option('--json', 'Output as JSON')
398
+ .action(async (url, options) => {
399
+ const { crawl } = await import('./core/crawler.js');
400
+ const spinner = options.silent ? null : ora('Crawling...').start();
401
+ try {
402
+ const results = await crawl(url, {
403
+ maxPages: options.maxPages,
404
+ maxDepth: options.maxDepth,
405
+ allowedDomains: options.allowedDomains,
406
+ excludePatterns: options.exclude,
407
+ respectRobotsTxt: !options.ignoreRobots,
408
+ rateLimitMs: options.rateLimit,
409
+ render: options.render || false,
410
+ stealth: options.stealth || false,
411
+ });
412
+ if (spinner) {
413
+ spinner.succeed(`Crawled ${results.length} pages`);
414
+ }
415
+ if (options.json) {
416
+ console.log(JSON.stringify(results, null, 2));
417
+ }
418
+ else {
419
+ results.forEach((result, i) => {
420
+ console.log(`\n${'='.repeat(60)}`);
421
+ console.log(`[${i + 1}/${results.length}] ${result.title}`);
422
+ console.log(`URL: ${result.url}`);
423
+ console.log(`Depth: ${result.depth}${result.parent ? ` (from: ${result.parent})` : ''}`);
424
+ console.log(`Links found: ${result.links.length}`);
425
+ console.log(`Elapsed: ${result.elapsed}ms`);
426
+ if (result.error) {
427
+ console.log(`ERROR: ${result.error}`);
428
+ }
429
+ else {
430
+ console.log(`\n${result.markdown.slice(0, 500)}${result.markdown.length > 500 ? '...' : ''}`);
431
+ }
432
+ });
433
+ }
434
+ await cleanup();
435
+ process.exit(0);
436
+ }
437
+ catch (error) {
438
+ if (spinner) {
439
+ spinner.fail('Crawl failed');
440
+ }
441
+ if (error instanceof Error) {
442
+ console.error(`\nError: ${error.message}`);
443
+ }
444
+ else {
445
+ console.error('\nError: Unknown error occurred');
446
+ }
447
+ await cleanup();
448
+ process.exit(1);
449
+ }
124
450
  });
125
451
  program
126
452
  .command('serve')