@monostate/node-scraper 1.8.1 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,69 +1,47 @@
1
1
  # @monostate/node-scraper
2
2
 
3
- > **Lightning-fast web scraping with intelligent fallback system - 11.35x faster than Firecrawl**
3
+ > Intelligent web scraping with multi-tier fallback 11x faster than traditional scrapers
4
4
 
5
- [![npm version](https://img.shields.io/npm/v/@monostate/node-scraper.svg)](https://www.npmjs.com/package/@monostate/node-scraper)
6
- [![Performance](https://img.shields.io/badge/Performance-11.35x_faster_than_Firecrawl-brightgreen)](../../test-results/)
7
- [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](../../LICENSE)
8
- [![Node](https://img.shields.io/badge/Node.js-18%2B-green)](https://nodejs.org/)
5
+ [![npm](https://img.shields.io/npm/v/@monostate/node-scraper.svg)](https://www.npmjs.com/package/@monostate/node-scraper)
6
+ [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/monostate/node-scraper/blob/main/LICENSE)
7
+ [![Node](https://img.shields.io/badge/Node.js-20%2B-green)](https://nodejs.org/)
9
8
 
10
- ## Quick Start
11
-
12
- ### Installation
9
+ ## Install
13
10
 
14
11
  ```bash
15
12
  npm install @monostate/node-scraper
16
- # or
17
- yarn add @monostate/node-scraper
18
- # or
19
- pnpm add @monostate/node-scraper
20
13
  ```
21
14
 
22
- **Fixed in v1.8.1**: Critical production fix - browser-pool.js now included in npm package.
23
-
24
- **New in v1.8.0**: Bulk scraping with automatic request queueing, progress tracking, and streaming results! Process hundreds of URLs efficiently. Plus critical memory leak fix with browser pooling.
25
-
26
- **Fixed in v1.7.0**: Critical cross-platform compatibility fix - binaries are now correctly downloaded per platform instead of being bundled.
27
-
28
- **New in v1.6.0**: Method override support! Force specific scraping methods with `method` parameter for testing and optimization.
29
-
30
- **New in v1.5.0**: AI-powered Q&A! Ask questions about any website using OpenRouter, OpenAI, or built-in AI.
15
+ LightPanda is downloaded automatically on install. Puppeteer is an optional peer dependency for full browser fallback.
31
16
 
32
- **Also in v1.3.0**: PDF parsing support added! Automatically extracts text, metadata, and page count from PDF documents.
33
-
34
- **Also in v1.2.0**: Lightpanda binary is now automatically downloaded and configured during installation! No manual setup required.
35
-
36
- ### Zero-Configuration Setup
37
-
38
- The package now automatically:
39
- - Downloads the correct Lightpanda binary for your platform (macOS, Linux, Windows/WSL)
40
- - Configures binary paths and permissions
41
- - Validates installation health on first use
42
-
43
- ### Basic Usage
17
+ ## Usage
44
18
 
45
19
  ```javascript
46
20
  import { smartScrape, smartScreenshot, quickShot } from '@monostate/node-scraper';
47
21
 
48
- // Simple one-line scraping
22
+ // Scrape with automatic method selection
49
23
  const result = await smartScrape('https://example.com');
50
- console.log(result.content); // Extracted content
51
- console.log(result.method); // Method used: direct-fetch, lightpanda, or puppeteer
24
+ console.log(result.content);
25
+ console.log(result.method); // 'direct-fetch' | 'lightpanda' | 'puppeteer'
52
26
 
53
- // Take a screenshot
27
+ // Screenshots
54
28
  const screenshot = await smartScreenshot('https://example.com');
55
- console.log(screenshot.screenshot); // Base64 encoded image
29
+ const quick = await quickShot('https://example.com'); // optimized for speed
30
+
31
+ // PDFs are detected and parsed automatically
32
+ const pdf = await smartScrape('https://example.com/doc.pdf');
33
+ ```
56
34
 
57
- // Quick screenshot (optimized for speed)
58
- const quick = await quickShot('https://example.com');
59
- console.log(quick.screenshot); // Fast screenshot capture
35
+ ### Force a specific method
60
36
 
61
- // PDF parsing (automatic detection)
62
- const pdfResult = await smartScrape('https://example.com/document.pdf');
63
- console.log(pdfResult.content); // Extracted text, metadata, page count
37
+ ```javascript
38
+ const result = await smartScrape('https://example.com', { method: 'direct' });
39
+ // Also: 'lightpanda', 'puppeteer', 'auto' (default)
64
40
  ```
65
41
 
66
- ### Advanced Usage
42
+ No fallback occurs when a method is forced — useful for testing and debugging.
43
+
44
+ ### Advanced usage
67
45
 
68
46
  ```javascript
69
47
  import { BNCASmartScraper } from '@monostate/node-scraper';
@@ -71,575 +49,145 @@ import { BNCASmartScraper } from '@monostate/node-scraper';
71
49
  const scraper = new BNCASmartScraper({
72
50
  timeout: 10000,
73
51
  verbose: true,
74
- lightpandaPath: './lightpanda' // optional
75
52
  });
76
53
 
77
54
  const result = await scraper.scrape('https://complex-spa.com');
78
- console.log(result.stats); // Performance statistics
79
-
80
- await scraper.cleanup(); // Clean up resources
81
- ```
82
-
83
- ### Browser Pool Configuration (New in v1.8.0)
84
-
85
- The package now includes automatic browser instance pooling to prevent memory leaks:
86
-
87
- ```javascript
88
- // Browser pool is managed automatically with these defaults:
89
- // - Max 3 concurrent browser instances
90
- // - 5 second idle timeout before cleanup
91
- // - Automatic reuse of browser instances
92
-
93
- // For heavy workloads, you can manually clean up:
94
- const scraper = new BNCASmartScraper();
95
- // ... perform multiple scrapes ...
96
- await scraper.cleanup(); // Closes all browser instances
97
- ```
98
-
99
- **Important**: The convenience functions (`smartScrape`, `smartScreenshot`, etc.) automatically handle cleanup. You only need to call `cleanup()` when using the `BNCASmartScraper` class directly.
100
-
101
- ### Method Override (New in v1.6.0)
102
-
103
- Force a specific scraping method instead of using automatic fallback:
104
-
105
- ```javascript
106
- // Force direct fetch (no browser)
107
- const result = await smartScrape('https://example.com', { method: 'direct' });
108
-
109
- // Force Lightpanda browser
110
- const result = await smartScrape('https://example.com', { method: 'lightpanda' });
111
-
112
- // Force Puppeteer (full Chrome)
113
- const result = await smartScrape('https://example.com', { method: 'puppeteer' });
114
-
115
- // Auto mode (default - intelligent fallback)
116
- const result = await smartScrape('https://example.com', { method: 'auto' });
117
- ```
118
-
119
- **Important**: When forcing a method, no fallback occurs if it fails. This is useful for:
120
- - Testing specific methods in isolation
121
- - Optimizing for known site requirements
122
- - Debugging method-specific issues
55
+ const stats = scraper.getStats();
56
+ const health = await scraper.healthCheck();
123
57
 
124
- **Error Response for Forced Methods**:
125
- ```javascript
126
- {
127
- success: false,
128
- error: "Lightpanda scraping failed: [specific error]",
129
- method: "lightpanda",
130
- errorType: "network|timeout|parsing|service_unavailable",
131
- details: "Additional error context"
132
- }
58
+ await scraper.cleanup();
133
59
  ```
134
60
 
135
- ### Bulk Scraping (New in v1.8.0)
136
-
137
- Process multiple URLs efficiently with automatic request queueing and progress tracking:
61
+ ### Bulk scraping
138
62
 
139
63
  ```javascript
140
- import { bulkScrape } from '@monostate/node-scraper';
141
-
142
- // Basic bulk scraping
143
- const urls = [
144
- 'https://example1.com',
145
- 'https://example2.com',
146
- 'https://example3.com',
147
- // ... hundreds more
148
- ];
64
+ import { bulkScrape, bulkScrapeStream } from '@monostate/node-scraper';
149
65
 
150
66
  const results = await bulkScrape(urls, {
151
- concurrency: 5, // Process 5 URLs at a time
152
- continueOnError: true, // Don't stop on failures
153
- progressCallback: (progress) => {
154
- console.log(`Progress: ${progress.percentage.toFixed(1)}% (${progress.processed}/${progress.total})`);
155
- }
67
+ concurrency: 5,
68
+ continueOnError: true,
69
+ progressCallback: (p) => console.log(`${p.percentage.toFixed(1)}%`),
156
70
  });
157
71
 
158
- console.log(`Success: ${results.stats.successful}, Failed: ${results.stats.failed}`);
159
- console.log(`Total time: ${results.stats.totalTime}ms`);
160
- console.log(`Average time per URL: ${results.stats.averageTime}ms`);
161
- ```
162
-
163
- #### Streaming Results
164
-
165
- For large datasets, use streaming to process results as they complete:
166
-
167
- ```javascript
168
- import { bulkScrapeStream } from '@monostate/node-scraper';
169
-
72
+ // Or stream results as they complete
170
73
  await bulkScrapeStream(urls, {
171
74
  concurrency: 10,
172
- onResult: async (result) => {
173
- // Process each successful result immediately
174
- await saveToDatabase(result);
175
- console.log(`✓ ${result.url} - ${result.duration}ms`);
176
- },
177
- onError: async (error) => {
178
- // Handle errors as they occur
179
- console.error(`✗ ${error.url} - ${error.error}`);
180
- },
181
- progressCallback: (progress) => {
182
- process.stdout.write(`\rProcessing: ${progress.percentage.toFixed(1)}%`);
183
- }
75
+ onResult: async (result) => await saveToDatabase(result),
76
+ onError: async (error) => console.error(error.url, error.error),
184
77
  });
185
78
  ```
186
79
 
187
- **Features:**
188
- - Automatic request queueing (no more memory errors!)
189
- - Configurable concurrency control
190
- - Real-time progress tracking
191
- - Continue on error or stop on first failure
192
- - Detailed statistics and method tracking
193
- - Browser instance pooling for efficiency
194
-
195
- For detailed examples and advanced usage, see [BULK_SCRAPING.md](./BULK_SCRAPING.md).
196
-
197
- ## How It Works
198
-
199
- BNCA uses a sophisticated multi-tier system with intelligent detection:
200
-
201
- ### 1. 🔄 Direct Fetch (Fastest)
202
- - Pure HTTP requests with intelligent HTML parsing
203
- - **Performance**: Sub-second responses
204
- - **Success rate**: 75% of websites
205
- - **PDF Detection**: Automatically detects PDFs by URL, content-type, or magic bytes
206
-
207
- ### 2. 🐼 Lightpanda Browser (Fast)
208
- - Lightweight browser engine (2-3x faster than Chromium)
209
- - **Performance**: Fast JavaScript execution
210
- - **Fallback triggers**: SPA detection
211
-
212
- ### 3. 🔵 Puppeteer (Complete)
213
- - Full Chromium browser for maximum compatibility
214
- - **Performance**: Complete JavaScript execution
215
- - **Fallback triggers**: Complex interactions needed
216
-
217
- ### 📄 PDF Parser (Specialized)
218
- - Automatic PDF detection and parsing
219
- - **Features**: Text extraction, metadata, page count
220
- - **Smart Detection**: Works even when PDFs are served with wrong content-types
221
- - **Performance**: Typically 100-500ms for most PDFs
222
-
223
- ### 📸 Screenshot Methods
224
- - **Chrome CLI**: Direct Chrome screenshot capture
225
- - **Quickshot**: Optimized with retry logic and smart timeouts
226
-
227
- ## 📊 Performance Benchmark
228
-
229
- | Site Type | BNCA | Firecrawl | Speed Advantage |
230
- |-----------|------|-----------|----------------|
231
- | **Wikipedia** | 154ms | 4,662ms | **30.3x faster** |
232
- | **Hacker News** | 1,715ms | 4,644ms | **2.7x faster** |
233
- | **GitHub** | 9,167ms | 9,790ms | **1.1x faster** |
234
-
235
- **Average**: 11.35x faster than Firecrawl with 100% reliability
236
-
237
- ## 🎛️ API Reference
238
-
239
- ### Convenience Functions
240
-
241
- #### `smartScrape(url, options?)`
242
- Quick scraping with intelligent fallback.
243
-
244
- #### `smartScreenshot(url, options?)`
245
- Take a screenshot of any webpage.
246
-
247
- #### `quickShot(url, options?)`
248
- Optimized screenshot capture for maximum speed.
249
-
250
- **Parameters:**
251
- - `url` (string): URL to scrape/capture
252
- - `options` (object, optional): Configuration options
253
-
254
- **Returns:** Promise<ScrapingResult>
255
-
256
- ### `BNCASmartScraper`
257
-
258
- Main scraper class with advanced features.
259
-
260
- #### Constructor Options
261
-
262
- ```javascript
263
- const scraper = new BNCASmartScraper({
264
- timeout: 10000, // Request timeout in ms
265
- retries: 2, // Number of retries per method
266
- verbose: false, // Enable detailed logging
267
- lightpandaPath: './lightpanda', // Path to Lightpanda binary
268
- userAgent: 'Mozilla/5.0 ...', // Custom user agent
269
- });
270
- ```
271
-
272
- #### Methods
273
-
274
- ##### `scraper.scrape(url, options?)`
275
-
276
- Scrape a URL with intelligent fallback.
277
-
278
- ```javascript
279
- const result = await scraper.scrape('https://example.com');
280
- ```
80
+ See [BULK_SCRAPING.md](./BULK_SCRAPING.md) for full documentation.
281
81
 
282
- ##### `scraper.screenshot(url, options?)`
82
+ ### AI-powered Q&A
283
83
 
284
- Take a screenshot of a webpage.
84
+ Ask questions about any website using OpenRouter, OpenAI, or local fallback:
285
85
 
286
86
  ```javascript
287
- const result = await scraper.screenshot('https://example.com');
288
- const img = result.screenshot; // data:image/png;base64,...
289
- ```
290
-
291
- ##### `scraper.quickshot(url, options?)`
292
-
293
- Quick screenshot capture - optimized for speed with retry logic.
87
+ import { askWebsiteAI } from '@monostate/node-scraper';
294
88
 
295
- ```javascript
296
- const result = await scraper.quickshot('https://example.com');
297
- // 2-3x faster than regular screenshot
89
+ const answer = await askWebsiteAI('https://example.com', 'What is this site about?', {
90
+ openRouterApiKey: process.env.OPENROUTER_API_KEY,
91
+ });
298
92
  ```
299
93
 
300
- ##### `scraper.getStats()`
94
+ API key priority: OpenRouter > OpenAI > BNCA backend > local pattern matching (no key needed).
301
95
 
302
- Get performance statistics.
96
+ ## How it works
303
97
 
304
- ```javascript
305
- const stats = scraper.getStats();
306
- console.log(stats.successRates); // Success rates by method
307
- ```
98
+ The scraper uses a three-tier fallback system:
308
99
 
309
- ##### `scraper.healthCheck()`
100
+ 1. **Direct fetch** — Pure HTTP with HTML parsing. Sub-second, handles ~75% of sites.
101
+ 2. **LightPanda** — Lightweight browser engine, 2-3x faster than Chromium. Handles SPAs.
102
+ 3. **Puppeteer** — Full Chromium for maximum compatibility.
310
103
 
311
- Check availability of all scraping methods.
104
+ Additional specialized handlers:
105
+ - **PDF parser** — Automatic detection by URL, content-type, or magic bytes. Extracts text, metadata, and page count.
106
+ - **Screenshots** — Chrome CLI capture with retry logic and smart timeouts.
312
107
 
313
- ```javascript
314
- const health = await scraper.healthCheck();
315
- console.log(health.status); // 'healthy' or 'unhealthy'
316
- ```
108
+ Browser instances are pooled (max 3, 5s idle timeout) to prevent memory leaks.
317
109
 
318
- ##### `scraper.cleanup()`
110
+ ## Performance
319
111
 
320
- Clean up resources (close browser instances).
112
+ | Site Type | node-scraper | Firecrawl | Speedup |
113
+ |-----------|-------------|-----------|---------|
114
+ | Wikipedia | 154ms | 4,662ms | 30.3x |
115
+ | Hacker News | 1,715ms | 4,644ms | 2.7x |
116
+ | GitHub | 9,167ms | 9,790ms | 1.1x |
321
117
 
322
- ```javascript
323
- await scraper.cleanup();
324
- ```
118
+ Average: **11.35x faster** with 100% reliability.
325
119
 
326
- ### AI-Powered Q&A
120
+ ## API Reference
327
121
 
328
- Ask questions about any website and get AI-generated answers:
122
+ ### Convenience functions
329
123
 
330
- ```javascript
331
- // Method 1: Using your own OpenRouter API key
332
- const scraper = new BNCASmartScraper({
333
- openRouterApiKey: 'your-openrouter-api-key'
334
- });
335
- const result = await scraper.askAI('https://example.com', 'What is this website about?');
124
+ | Function | Description |
125
+ |----------|-------------|
126
+ | `smartScrape(url, opts?)` | Scrape with intelligent fallback |
127
+ | `smartScreenshot(url, opts?)` | Full page screenshot |
128
+ | `quickShot(url, opts?)` | Fast screenshot capture |
129
+ | `bulkScrape(urls, opts?)` | Batch scrape multiple URLs |
130
+ | `bulkScrapeStream(urls, opts?)` | Stream results as they complete |
131
+ | `askWebsiteAI(url, question, opts?)` | AI Q&A about a webpage |
336
132
 
337
- // Method 2: Using OpenAI API (or compatible endpoints)
338
- const scraper = new BNCASmartScraper({
339
- openAIApiKey: 'your-openai-api-key',
340
- // Optional: Use a compatible endpoint like Groq, Together AI, etc.
341
- openAIBaseUrl: 'https://api.groq.com/openai'
342
- });
343
- const result = await scraper.askAI('https://example.com', 'What services do they offer?');
133
+ ### BNCASmartScraper options
344
134
 
345
- // Method 3: One-liner with OpenRouter
346
- import { askWebsiteAI } from '@monostate/node-scraper';
347
- const answer = await askWebsiteAI('https://example.com', 'What is the main topic?', {
348
- openRouterApiKey: process.env.OPENROUTER_API_KEY
349
- });
350
-
351
- // Method 4: Using BNCA backend API (requires BNCA API key)
352
- const scraper = new BNCASmartScraper({
353
- apiKey: 'your-bnca-api-key'
354
- });
355
- const result = await scraper.askAI('https://example.com', 'What products are featured?');
356
- ```
357
-
358
- **API Key Priority:**
359
- 1. OpenRouter API key (`openRouterApiKey`)
360
- 2. OpenAI API key (`openAIApiKey`)
361
- 3. BNCA backend API (`apiKey`)
362
- 4. Local fallback (pattern matching - no API key required)
363
-
364
- **Configuration Options:**
365
- ```javascript
366
- const result = await scraper.askAI(url, question, {
367
- // OpenRouter specific
368
- openRouterApiKey: 'sk-or-...',
369
- model: 'meta-llama/llama-4-scout:free', // Default model
370
-
371
- // OpenAI specific
372
- openAIApiKey: 'sk-...',
373
- openAIBaseUrl: 'https://api.openai.com', // Or compatible endpoint
374
- model: 'gpt-3.5-turbo',
375
-
376
- // Shared options
377
- temperature: 0.3,
378
- maxTokens: 500
379
- });
380
- ```
381
-
382
- **Response Format:**
383
135
  ```javascript
384
136
  {
385
- success: true,
386
- answer: "This website is about...",
387
- method: "direct-fetch", // Scraping method used
388
- scrapeTime: 1234, // Time to scrape in ms
389
- processing: "openrouter" // AI processing method used
137
+ timeout: 10000, // Request timeout (ms)
138
+ retries: 2, // Retries per method
139
+ verbose: false, // Detailed logging
140
+ lightpandaPath: './bin/lightpanda',
141
+ lightpandaFormat: 'html', // 'html' or 'markdown'
142
+ userAgent: 'Mozilla/5.0 ...',
143
+ openRouterApiKey: '...',
144
+ openAIApiKey: '...',
145
+ openAIBaseUrl: 'https://api.openai.com',
390
146
  }
391
147
  ```
392
148
 
393
- ### 📄 PDF Support
394
-
395
- BNCA automatically detects and parses PDF documents:
396
-
397
- ```javascript
398
- const pdfResult = await smartScrape('https://example.com/document.pdf');
399
-
400
- // Parsed content includes:
401
- const content = JSON.parse(pdfResult.content);
402
- console.log(content.title); // PDF title
403
- console.log(content.author); // Author name
404
- console.log(content.pages); // Number of pages
405
- console.log(content.text); // Full extracted text
406
- console.log(content.creationDate); // Creation date
407
- console.log(content.metadata); // Additional metadata
408
- ```
409
-
410
- **PDF Detection Methods:**
411
- - URL ending with `.pdf`
412
- - Content-Type header `application/pdf`
413
- - Binary content starting with `%PDF` (magic bytes)
414
- - Works with PDFs served as `application/octet-stream` (e.g., GitHub raw files)
415
-
416
- **Limitations:**
417
- - Maximum file size: 20MB
418
- - Text extraction only (no image OCR)
419
- - Requires `pdf-parse` dependency (automatically installed)
149
+ ### Methods
420
150
 
421
- ## 📱 Next.js Integration
151
+ - `scraper.scrape(url, opts?)` — Scrape with fallback
152
+ - `scraper.screenshot(url, opts?)` — Take screenshot
153
+ - `scraper.quickshot(url, opts?)` — Fast screenshot
154
+ - `scraper.askAI(url, question, opts?)` — AI Q&A
155
+ - `scraper.getStats()` — Performance statistics
156
+ - `scraper.healthCheck()` — Check method availability
157
+ - `scraper.cleanup()` — Close browser instances
422
158
 
423
- ### API Route Example
159
+ ### Response shape
424
160
 
425
161
  ```javascript
426
- // pages/api/scrape.js or app/api/scrape/route.js
427
- import { smartScrape } from '@monostate/node-scraper';
428
-
429
- export async function POST(request) {
430
- try {
431
- const { url } = await request.json();
432
- const result = await smartScrape(url);
433
-
434
- return Response.json({
435
- success: true,
436
- data: result.content,
437
- method: result.method,
438
- time: result.performance.totalTime
439
- });
440
- } catch (error) {
441
- return Response.json({
442
- success: false,
443
- error: error.message
444
- }, { status: 500 });
445
- }
446
- }
447
- ```
448
-
449
- ### React Hook Example
450
-
451
- ```javascript
452
- // hooks/useScraper.js
453
- import { useState } from 'react';
454
-
455
- export function useScraper() {
456
- const [loading, setLoading] = useState(false);
457
- const [data, setData] = useState(null);
458
- const [error, setError] = useState(null);
459
-
460
- const scrape = async (url) => {
461
- setLoading(true);
462
- setError(null);
463
-
464
- try {
465
- const response = await fetch('/api/scrape', {
466
- method: 'POST',
467
- headers: { 'Content-Type': 'application/json' },
468
- body: JSON.stringify({ url })
469
- });
470
-
471
- const result = await response.json();
472
-
473
- if (result.success) {
474
- setData(result.data);
475
- } else {
476
- setError(result.error);
477
- }
478
- } catch (err) {
479
- setError(err.message);
480
- } finally {
481
- setLoading(false);
482
- }
483
- };
484
-
485
- return { scrape, loading, data, error };
486
- }
487
- ```
488
-
489
- ### Component Example
490
-
491
- ```javascript
492
- // components/ScraperDemo.jsx
493
- import { useScraper } from '../hooks/useScraper';
494
-
495
- export default function ScraperDemo() {
496
- const { scrape, loading, data, error } = useScraper();
497
- const [url, setUrl] = useState('');
498
-
499
- const handleScrape = () => {
500
- if (url) scrape(url);
501
- };
502
-
503
- return (
504
- <div className="p-4">
505
- <div className="flex gap-2 mb-4">
506
- <input
507
- type="url"
508
- value={url}
509
- onChange={(e) => setUrl(e.target.value)}
510
- placeholder="Enter URL to scrape..."
511
- className="flex-1 px-3 py-2 border rounded"
512
- />
513
- <button
514
- onClick={handleScrape}
515
- disabled={loading}
516
- className="px-4 py-2 bg-blue-500 text-white rounded disabled:opacity-50"
517
- >
518
- {loading ? 'Scraping...' : 'Scrape'}
519
- </button>
520
- </div>
521
-
522
- {error && (
523
- <div className="p-3 bg-red-100 text-red-700 rounded mb-4">
524
- Error: {error}
525
- </div>
526
- )}
527
-
528
- {data && (
529
- <div className="p-3 bg-green-100 rounded">
530
- <h3 className="font-bold mb-2">Scraped Content:</h3>
531
- <pre className="text-sm overflow-auto">{data}</pre>
532
- </div>
533
- )}
534
- </div>
535
- );
162
+ {
163
+ success: true,
164
+ content: '...', // Extracted content (JSON string)
165
+ method: 'direct-fetch', // Method used
166
+ url: 'https://...',
167
+ performance: { totalTime: 154 },
168
+ stats: { ... }
536
169
  }
537
170
  ```
538
171
 
539
- ## ⚠️ Important Notes
540
-
541
- ### Server-Side Only
542
- BNCA is designed for **server-side use only** due to:
543
- - Browser automation requirements (Puppeteer)
544
- - File system access for Lightpanda binary
545
- - CORS restrictions in browsers
546
-
547
- ### Next.js Deployment
548
- - Use in API routes, not client components
549
- - Ensure Node.js 18+ in production environment
550
- - Consider adding Lightpanda binary to deployment
551
-
552
- ### Lightpanda Setup (Optional)
553
- For maximum performance, install Lightpanda:
554
-
555
- ```bash
556
- # macOS ARM64
557
- curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-aarch64-macos
558
- chmod +x lightpanda
559
-
560
- # Linux x64
561
- curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-x86_64-linux
562
- chmod +x lightpanda
563
- ```
564
-
565
- ## 🔒 Privacy & Security
566
-
567
- - **No external API calls** - all processing is local
568
- - **No data collection** - your data stays private
569
- - **Respects robots.txt** (optional enforcement)
570
- - **Configurable rate limiting**
571
-
572
- ## 📝 TypeScript Support
172
+ ## TypeScript
573
173
 
574
- Full TypeScript definitions included:
174
+ Full type definitions are included (`index.d.ts`).
575
175
 
576
176
  ```typescript
577
- import { BNCASmartScraper, ScrapingResult, ScrapingOptions } from '@monostate/node-scraper';
578
-
579
- const scraper: BNCASmartScraper = new BNCASmartScraper({
580
- timeout: 5000,
581
- verbose: true
582
- });
583
-
584
- const result: ScrapingResult = await scraper.scrape('https://example.com');
177
+ import { BNCASmartScraper, ScrapingResult } from '@monostate/node-scraper';
585
178
  ```
586
179
 
587
- ## Changelog
588
-
589
- ### v1.6.0 (Latest)
590
- - **Method Override**: Force specific scraping methods with `method` parameter
591
- - **Enhanced Error Handling**: Categorized error types for better debugging
592
- - **Fallback Chain Tracking**: See which methods were attempted in auto mode
593
- - **Graceful Failures**: No automatic fallback when method is forced
594
-
595
- ### v1.5.0
596
- - **AI-Powered Q&A**: Ask questions about any website and get AI-generated answers
597
- - **OpenRouter Support**: Native integration with OpenRouter API for advanced AI models
598
- - **OpenAI Support**: Compatible with OpenAI and OpenAI-compatible endpoints (Groq, Together AI, etc.)
599
- - **Smart Fallback**: Automatic fallback chain: OpenRouter -> OpenAI -> Backend API -> Local processing
600
- - **One-liner AI**: New `askWebsiteAI()` convenience function for quick AI queries
601
- - **Enhanced TypeScript**: Complete type definitions for all AI features
602
-
603
- ### v1.4.0
604
- - Internal release (skipped for public release)
180
+ ## Notes
605
181
 
606
- ### v1.3.0
607
- - **PDF Support**: Full PDF parsing with text extraction, metadata, and page count
608
- - **Smart PDF Detection**: Detects PDFs by URL patterns, content-type, or magic bytes
609
- - **Robust Parsing**: Handles PDFs served with incorrect content-types (e.g., GitHub raw files)
610
- - **Fast Performance**: PDF parsing typically completes in 100-500ms
611
- - **Comprehensive Extraction**: Title, author, creation date, page count, and full text
182
+ - **Server-side only** — requires filesystem access and browser automation.
183
+ - **Node.js 20+** required.
184
+ - LightPanda binary is auto-installed. Puppeteer is optional.
185
+ - No external API calls for scraping all processing is local.
612
186
 
613
- ### v1.2.0
614
- - **Auto-Installation**: Lightpanda binary is now automatically downloaded during `npm install`
615
- - **Cross-Platform Support**: Automatic detection and installation for macOS, Linux, and Windows/WSL
616
- - **Improved Performance**: Enhanced binary detection and ES6 module compatibility
617
- - **Better Error Handling**: More robust installation scripts with retry logic
618
- - **Zero Configuration**: No manual setup required - works out of the box
619
-
620
- ### v1.1.1
621
- - Bug fixes and stability improvements
622
- - Enhanced Puppeteer integration
623
-
624
- ### v1.1.0
625
- - Added screenshot capabilities
626
- - Improved fallback system
627
- - Performance optimizations
628
-
629
- ## 🤝 Contributing
630
-
631
- See the [main repository](https://github.com/your-org/bnca-prototype) for contribution guidelines.
632
-
633
- ## 📄 License
634
-
635
- MIT License - see [LICENSE](../../LICENSE) file for details.
636
-
637
- ---
638
-
639
- <div align="center">
187
+ ## Changelog
640
188
 
641
- **Built with ❤️ for fast, reliable web scraping**
189
+ See [CHANGELOG.md](./CHANGELOG.md) for the full release history.
642
190
 
643
- [⭐ Star on GitHub](https://github.com/your-org/bnca-prototype) | [📖 Full Documentation](https://github.com/your-org/bnca-prototype#readme)
191
+ ## License
644
192
 
645
- </div>
193
+ MIT
package/browser-pool.js CHANGED
@@ -42,7 +42,7 @@ class BrowserPool {
42
42
  async createBrowser() {
43
43
  const puppeteer = await this.getPuppeteer();
44
44
  const instance = await puppeteer.launch({
45
- headless: 'new',
45
+ headless: true,
46
46
  args: [
47
47
  '--no-sandbox',
48
48
  '--disable-setuid-sandbox',
package/index.js CHANGED
@@ -1,11 +1,10 @@
1
- import fetch from 'node-fetch';
2
1
  import { spawn, execSync } from 'child_process';
3
2
  import fs from 'fs/promises';
4
3
  import { existsSync, statSync } from 'fs';
5
4
  import path from 'path';
6
5
  import { fileURLToPath } from 'url';
7
6
  import { promises as fsPromises } from 'fs';
8
- import pdfParse from 'pdf-parse/lib/pdf-parse.js';
7
+ import { PDFParse } from 'pdf-parse';
9
8
  import browserPool from './browser-pool.js';
10
9
 
11
10
  let puppeteer = null;
@@ -604,27 +603,41 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
604
603
  }
605
604
 
606
605
  return new Promise((resolve) => {
607
- const args = ['fetch', '--dump', url];
606
+ const format = config.lightpandaFormat || 'html';
607
+ const args = [
608
+ 'fetch',
609
+ '--dump', format,
610
+ '--with_frames',
611
+ '--http_timeout', String(config.timeout),
612
+ url
613
+ ];
608
614
  const process = spawn(this.options.lightpandaPath, args, {
609
- timeout: config.timeout + 1000 // Add buffer for process timeout only
615
+ timeout: config.timeout + 2000 // Buffer above http_timeout
610
616
  });
611
-
617
+
612
618
  let output = '';
613
619
  let errorOutput = '';
614
-
620
+
615
621
  process.stdout.on('data', (data) => {
616
622
  output += data.toString();
617
623
  });
618
-
624
+
619
625
  process.stderr.on('data', (data) => {
620
626
  errorOutput += data.toString();
621
627
  });
622
-
628
+
623
629
  process.on('close', (code) => {
624
630
  if (code === 0 && output.length > 0) {
625
- const content = this.extractContentFromHTML(output);
631
+ // Markdown output is already clean text, no HTML extraction needed
632
+ const content = format === 'markdown'
633
+ ? JSON.stringify({
634
+ title: output.match(/^#\s+(.+)$/m)?.[1] || '',
635
+ content: output,
636
+ extractedAt: new Date().toISOString()
637
+ }, null, 2)
638
+ : this.extractContentFromHTML(output);
626
639
  this.stats.lightpanda.successes++;
627
-
640
+
628
641
  resolve({
629
642
  success: true,
630
643
  content,
@@ -642,7 +655,7 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
642
655
  });
643
656
  }
644
657
  });
645
-
658
+
646
659
  process.on('error', (error) => {
647
660
  resolve({
648
661
  success: false,
@@ -847,25 +860,30 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
847
860
  };
848
861
  }
849
862
 
850
- // Parse PDF
851
- const pdfData = await pdfParse(buffer);
852
-
863
+ // Parse PDF with pdf-parse v2 API
864
+ const parser = new PDFParse({ data: new Uint8Array(buffer) });
865
+ await parser.load();
866
+ const textResult = await parser.getText();
867
+ const infoResult = await parser.getInfo();
868
+ parser.destroy();
869
+
853
870
  // Extract structured content
871
+ const pdfInfo = infoResult.info || {};
854
872
  const content = {
855
- title: pdfData.info?.Title || 'Untitled PDF',
856
- author: pdfData.info?.Author || '',
857
- subject: pdfData.info?.Subject || '',
858
- keywords: pdfData.info?.Keywords || '',
859
- creator: pdfData.info?.Creator || '',
860
- producer: pdfData.info?.Producer || '',
861
- creationDate: pdfData.info?.CreationDate || '',
862
- modificationDate: pdfData.info?.ModificationDate || '',
863
- pages: pdfData.numpages || 0,
864
- text: pdfData.text || '',
865
- metadata: pdfData.metadata || null,
873
+ title: pdfInfo.Title || infoResult.outline?.[0]?.title || 'Untitled PDF',
874
+ author: pdfInfo.Author || '',
875
+ subject: pdfInfo.Subject || '',
876
+ keywords: pdfInfo.Keywords || '',
877
+ creator: pdfInfo.Creator || '',
878
+ producer: pdfInfo.Producer || '',
879
+ creationDate: pdfInfo.CreationDate || '',
880
+ modificationDate: pdfInfo.ModDate || '',
881
+ pages: textResult.total || 0,
882
+ text: textResult.text || '',
883
+ metadata: infoResult.metadata || null,
866
884
  url: url
867
885
  };
868
-
886
+
869
887
  this.stats.pdf.successes++;
870
888
 
871
889
  return {
@@ -1008,11 +1026,11 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
1008
1026
  });
1009
1027
 
1010
1028
  // Extract window state data
1011
- const windowDataMatch = html.match(/window\.__(?:INITIAL_STATE__|INITIAL_DATA__|NEXT_DATA__)__\s*=\s*({[\s\S]*?});/);
1029
+ const windowDataMatch = html.match(/window\.__(INITIAL_STATE|INITIAL_DATA|NEXT_DATA)__\s*=\s*({[\s\S]*?});/);
1012
1030
  let windowData = null;
1013
1031
  if (windowDataMatch) {
1014
1032
  try {
1015
- windowData = JSON.parse(windowDataMatch[1]);
1033
+ windowData = JSON.parse(windowDataMatch[2]);
1016
1034
  } catch {
1017
1035
  windowData = 'Found but unparseable';
1018
1036
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@monostate/node-scraper",
3
- "version": "1.8.1",
3
+ "version": "2.0.0",
4
4
  "description": "Intelligent web scraping with AI Q&A, PDF support and multi-level fallback system - 11x faster than traditional scrapers",
5
5
  "type": "module",
6
6
  "main": "index.js",
@@ -21,6 +21,7 @@
21
21
  "scripts/"
22
22
  ],
23
23
  "scripts": {
24
+ "test": "node --test test/",
24
25
  "postinstall": "node scripts/install-lightpanda.js"
25
26
  },
26
27
  "keywords": [
@@ -47,11 +48,10 @@
47
48
  "author": "BNCA Team",
48
49
  "license": "MIT",
49
50
  "dependencies": {
50
- "node-fetch": "^3.3.2",
51
- "pdf-parse": "^1.1.1"
51
+ "pdf-parse": "^2.4.5"
52
52
  },
53
53
  "peerDependencies": {
54
- "puppeteer": "^24.11.2"
54
+ "puppeteer": "^24.38.0"
55
55
  },
56
56
  "peerDependenciesMeta": {
57
57
  "puppeteer": {
@@ -59,7 +59,7 @@
59
59
  }
60
60
  },
61
61
  "engines": {
62
- "node": ">=18.0.0"
62
+ "node": ">=20.0.0"
63
63
  },
64
64
  "repository": {
65
65
  "type": "git",
@@ -6,17 +6,30 @@ import path from 'path';
6
6
  import { createWriteStream } from 'fs';
7
7
  import { execSync } from 'child_process';
8
8
 
9
- const LIGHTPANDA_VERSION = 'nightly';
9
+ const LIGHTPANDA_VERSION = 'v0.2.5';
10
10
  const BINARY_DIR = path.join(path.dirname(path.dirname(new URL(import.meta.url).pathname)), 'bin');
11
11
  const BINARY_NAME = 'lightpanda';
12
12
  const BINARY_PATH = path.join(BINARY_DIR, BINARY_NAME);
13
13
 
14
- // Platform-specific download URLs (matching official Lightpanda instructions)
15
- const DOWNLOAD_URLS = {
16
- 'darwin': `https://github.com/lightpanda-io/browser/releases/download/${LIGHTPANDA_VERSION}/lightpanda-aarch64-macos`,
17
- 'linux': `https://github.com/lightpanda-io/browser/releases/download/${LIGHTPANDA_VERSION}/lightpanda-x86_64-linux`,
18
- 'wsl': `https://github.com/lightpanda-io/browser/releases/download/${LIGHTPANDA_VERSION}/lightpanda-x86_64-linux` // WSL uses Linux binary
19
- };
14
+ function detectArch() {
15
+ const arch = process.arch;
16
+ if (arch === 'arm64' || arch === 'aarch64') return 'aarch64';
17
+ if (arch === 'x64' || arch === 'x86_64') return 'x86_64';
18
+ return arch;
19
+ }
20
+
21
+ // Platform-specific download URLs (matching official Lightpanda releases)
22
+ function getDownloadUrls() {
23
+ const arch = detectArch();
24
+ const base = `https://github.com/lightpanda-io/browser/releases/download/${LIGHTPANDA_VERSION}`;
25
+ return {
26
+ 'darwin': `${base}/lightpanda-${arch}-macos`,
27
+ 'linux': `${base}/lightpanda-${arch}-linux`,
28
+ 'wsl': `${base}/lightpanda-x86_64-linux`
29
+ };
30
+ }
31
+
32
+ const DOWNLOAD_URLS = getDownloadUrls();
20
33
 
21
34
  function detectPlatform() {
22
35
  const platform = process.platform;