@monostate/node-scraper 1.8.0 → 1.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,626 @@
1
+ # Bulk Scraping Guide
2
+
3
+ The `@monostate/node-scraper` package provides powerful bulk scraping capabilities with automatic request queueing, progress tracking, and efficient resource management.
4
+
5
+ ## Key Features
6
+
7
+ - **Automatic Request Queueing**: Never worry about "too many browser instances" errors. Requests are automatically queued when the browser pool is full.
8
+ - **Smart Browser Pooling**: Reuses browser instances for better performance while preventing memory leaks.
9
+ - **Real-time Progress Tracking**: Monitor scraping progress with customizable callbacks.
10
+ - **Streaming Support**: Process results as they complete for memory-efficient handling of large datasets.
11
+ - **Graceful Error Handling**: Continue processing even when some URLs fail, with detailed error reporting.
12
+
13
+ ## Table of Contents
14
+
15
+ - [Automatic Request Queueing](#automatic-request-queueing)
16
+ - [Basic Usage](#basic-usage)
17
+ - [Streaming Results](#streaming-results)
18
+ - [Configuration Options](#configuration-options)
19
+ - [Error Handling](#error-handling)
20
+ - [Performance Optimization](#performance-optimization)
21
+ - [Real-World Examples](#real-world-examples)
22
+ - [Best Practices](#best-practices)
23
+
24
+ ## Automatic Request Queueing
25
+
26
+ One of the most powerful features of v1.8.0 is automatic request queueing. When you make multiple concurrent requests:
27
+
28
+ ```javascript
29
+ // Before v1.8.0: This would throw errors when browser pool is full
30
+ // After v1.8.0: Requests are automatically queued!
31
+
32
+ const urls = Array.from({ length: 100 }, (_, i) => `https://example.com/page${i}`);
33
+
34
+ // Even with 100 URLs and only 3 browser instances, no errors!
35
+ const results = await bulkScrape(urls, {
36
+ concurrency: 20, // Request 20 at a time
37
+ method: 'puppeteer' // Force browser usage
38
+ });
39
+
40
+ // The browser pool (max 3 instances) automatically queues requests
41
+ // No "too many browser instances" errors!
42
+ ```
43
+
44
+ ### How It Works
45
+
46
+ 1. **Browser Pool**: Maximum of 3 browser instances by default
47
+ 2. **Request Queue**: When all browsers are busy, new requests wait in a queue
48
+ 3. **Automatic Processing**: As browsers become available, queued requests are processed
49
+ 4. **No Errors**: Instead of throwing errors, requests wait their turn
50
+
51
+ ### Benefits
52
+
53
+ - **No Manual Retry Logic**: The SDK handles queueing automatically
54
+ - **Memory Efficient**: Only 3 browsers maximum, preventing OOM errors
55
+ - **Optimal Performance**: Browsers are reused for faster processing
56
+ - **Graceful Degradation**: System remains stable under high load
57
+
58
+ ### Works Everywhere
59
+
60
+ The automatic queueing works with all scraping methods, not just bulk operations:
61
+
62
+ ```javascript
63
+ import { smartScrape } from '@monostate/node-scraper';
64
+
65
+ // Make 50 parallel requests with Puppeteer
66
+ // Only 3 browsers will be created, rest will queue automatically
67
+ const promises = [];
68
+ for (let i = 0; i < 50; i++) {
69
+ promises.push(
70
+ smartScrape(`https://example.com/page${i}`, { method: 'puppeteer' })
71
+ );
72
+ }
73
+
74
+ // All 50 requests complete successfully!
75
+ // No "too many browser instances" errors
76
+ const results = await Promise.all(promises);
77
+ console.log(`Completed ${results.length} requests with only 3 browsers!`);
78
+ ```
79
+
80
+ ## Basic Usage
81
+
82
+ ### Simple Bulk Scraping
83
+
84
+ ```javascript
85
+ import { bulkScrape } from '@monostate/node-scraper';
86
+
87
+ const urls = [
88
+ 'https://example1.com',
89
+ 'https://example2.com',
90
+ 'https://example3.com'
91
+ ];
92
+
93
+ const results = await bulkScrape(urls);
94
+
95
+ // Access results
96
+ results.success.forEach(result => {
97
+ console.log(`URL: ${result.url}`);
98
+ console.log(`Method: ${result.method}`);
99
+ console.log(`Duration: ${result.duration}ms`);
100
+ console.log(`Content: ${result.content.substring(0, 100)}...`);
101
+ });
102
+
103
+ // Check failures
104
+ results.failed.forEach(failure => {
105
+ console.log(`Failed URL: ${failure.url}`);
106
+ console.log(`Error: ${failure.error}`);
107
+ });
108
+ ```
109
+
110
+ ### With Progress Tracking
111
+
112
+ ```javascript
113
+ const results = await bulkScrape(urls, {
114
+ progressCallback: (progress) => {
115
+ console.log(`Progress: ${progress.percentage.toFixed(1)}%`);
116
+ console.log(`Current: ${progress.current}`);
117
+ console.log(`Processed: ${progress.processed}/${progress.total}`);
118
+ }
119
+ });
120
+ ```
121
+
122
+ ## Streaming Results
123
+
124
+ For large datasets, streaming allows you to process results as they complete:
125
+
126
+ ### Basic Streaming
127
+
128
+ ```javascript
129
+ import { bulkScrapeStream } from '@monostate/node-scraper';
130
+
131
+ const stats = await bulkScrapeStream(urls, {
132
+ onResult: (result) => {
133
+ console.log(`Success: ${result.url}`);
134
+ // Process immediately - save to database, write to file, etc.
135
+ },
136
+ onError: (error) => {
137
+ console.log(`Failed: ${error.url} - ${error.error}`);
138
+ }
139
+ });
140
+
141
+ console.log(`Total processed: ${stats.processed}`);
142
+ console.log(`Success rate: ${(stats.successful / stats.total * 100).toFixed(1)}%`);
143
+ ```
144
+
145
+ ### Stream to File
146
+
147
+ ```javascript
148
+ import { createWriteStream } from 'fs';
149
+ import { bulkScrapeStream } from '@monostate/node-scraper';
150
+
151
+ const outputStream = createWriteStream('results.jsonl');
152
+
153
+ await bulkScrapeStream(urls, {
154
+ onResult: async (result) => {
155
+ // Write each result as a JSON line
156
+ outputStream.write(JSON.stringify(result) + '\n');
157
+ },
158
+ onError: async (error) => {
159
+ // Log errors to a separate file
160
+ outputStream.write(JSON.stringify({ ...error, isError: true }) + '\n');
161
+ }
162
+ });
163
+
164
+ outputStream.end();
165
+ ```
166
+
167
+ ### Stream to Database
168
+
169
+ ```javascript
170
+ import { bulkScrapeStream } from '@monostate/node-scraper';
171
+ import { db } from './database.js';
172
+
173
+ await bulkScrapeStream(urls, {
174
+ concurrency: 10,
175
+ onResult: async (result) => {
176
+ await db.scraped_pages.insert({
177
+ url: result.url,
178
+ content: result.content,
179
+ method: result.method,
180
+ duration_ms: result.duration,
181
+ scraped_at: new Date(result.timestamp)
182
+ });
183
+ },
184
+ onError: async (error) => {
185
+ await db.scrape_errors.insert({
186
+ url: error.url,
187
+ error_message: error.error,
188
+ failed_at: new Date(error.timestamp)
189
+ });
190
+ }
191
+ });
192
+ ```
193
+
194
+ ## Configuration Options
195
+
196
+ ### Concurrency Control
197
+
198
+ ```javascript
199
+ // Low concurrency for rate-limited sites
200
+ const results = await bulkScrape(urls, {
201
+ concurrency: 2, // Only 2 parallel requests
202
+ timeout: 30000 // 30 second timeout per request
203
+ });
204
+
205
+ // High concurrency for your own servers
206
+ const results = await bulkScrape(internalUrls, {
207
+ concurrency: 20, // 20 parallel requests
208
+ timeout: 5000 // 5 second timeout
209
+ });
210
+ ```
211
+
212
+ ### Method Selection
213
+
214
+ ```javascript
215
+ // Force all URLs to use Puppeteer
216
+ const results = await bulkScrape(urls, {
217
+ method: 'puppeteer',
218
+ concurrency: 3 // Puppeteer is resource-intensive
219
+ });
220
+
221
+ // Force direct fetch for known static sites
222
+ const results = await bulkScrape(staticUrls, {
223
+ method: 'direct',
224
+ concurrency: 50 // Direct fetch can handle high concurrency
225
+ });
226
+ ```
227
+
228
+ ### Continue vs Stop on Error
229
+
230
+ ```javascript
231
+ // Continue processing even if some URLs fail (default)
232
+ const results = await bulkScrape(urls, {
233
+ continueOnError: true
234
+ });
235
+
236
+ // Stop immediately on first error
237
+ try {
238
+ const results = await bulkScrape(urls, {
239
+ continueOnError: false
240
+ });
241
+ } catch (error) {
242
+ console.error('Bulk scraping stopped due to error:', error);
243
+ }
244
+ ```
245
+
246
+ ## Error Handling
247
+
248
+ ### Retry Failed URLs
249
+
250
+ ```javascript
251
+ // First pass
252
+ const results = await bulkScrape(urls);
253
+
254
+ // Retry failures with different method
255
+ if (results.failed.length > 0) {
256
+ const failedUrls = results.failed.map(f => f.url);
257
+ const retryResults = await bulkScrape(failedUrls, {
258
+ method: 'puppeteer', // Try with full browser
259
+ timeout: 60000 // Longer timeout
260
+ });
261
+ }
262
+ ```
263
+
264
+ ### Custom Error Handling
265
+
266
+ ```javascript
267
+ await bulkScrapeStream(urls, {
268
+ onResult: (result) => {
269
+ // Process successful results
270
+ },
271
+ onError: async (error) => {
272
+ // Categorize and handle different error types
273
+ if (error.error.includes('timeout')) {
274
+ await logTimeoutError(error);
275
+ } else if (error.error.includes('404')) {
276
+ await handle404(error);
277
+ } else {
278
+ await logGeneralError(error);
279
+ }
280
+ }
281
+ });
282
+ ```
283
+
284
+ ## Performance Optimization
285
+
286
+ ### Dynamic Concurrency
287
+
288
+ ```javascript
289
+ // Start with low concurrency and increase based on success rate
290
+ let concurrency = 5;
291
+ const batchSize = 100;
292
+
293
+ for (let i = 0; i < urls.length; i += batchSize) {
294
+ const batch = urls.slice(i, i + batchSize);
295
+
296
+ const results = await bulkScrape(batch, { concurrency });
297
+
298
+ const successRate = results.stats.successful / batch.length;
299
+
300
+ // Adjust concurrency based on success rate
301
+ if (successRate > 0.95) {
302
+ concurrency = Math.min(concurrency + 2, 20);
303
+ } else if (successRate < 0.8) {
304
+ concurrency = Math.max(concurrency - 2, 2);
305
+ }
306
+
307
+ console.log(`Batch complete. Success rate: ${(successRate * 100).toFixed(1)}%. New concurrency: ${concurrency}`);
308
+ }
309
+ ```
310
+
311
+ ### Memory-Efficient Processing
312
+
313
+ ```javascript
314
+ // Process in chunks to avoid memory issues
315
+ async function processLargeDataset(allUrls) {
316
+ const chunkSize = 1000;
317
+ const results = {
318
+ successful: 0,
319
+ failed: 0,
320
+ totalTime: 0
321
+ };
322
+
323
+ for (let i = 0; i < allUrls.length; i += chunkSize) {
324
+ const chunk = allUrls.slice(i, i + chunkSize);
325
+ console.log(`Processing chunk ${i / chunkSize + 1} of ${Math.ceil(allUrls.length / chunkSize)}`);
326
+
327
+ const chunkResults = await bulkScrape(chunk, {
328
+ concurrency: 10,
329
+ progressCallback: (p) => {
330
+ const overallProgress = ((i + p.processed) / allUrls.length * 100).toFixed(1);
331
+ console.log(`Overall progress: ${overallProgress}%`);
332
+ }
333
+ });
334
+
335
+ results.successful += chunkResults.stats.successful;
336
+ results.failed += chunkResults.stats.failed;
337
+ results.totalTime += chunkResults.stats.totalTime;
338
+
339
+ // Optional: Process results immediately to free memory
340
+ await processChunkResults(chunkResults);
341
+ }
342
+
343
+ return results;
344
+ }
345
+ ```
346
+
347
+ ## Real-World Examples
348
+
349
+ ### E-commerce Price Monitoring
350
+
351
+ ```javascript
352
+ import { bulkScrapeStream } from '@monostate/node-scraper';
353
+
354
+ const productUrls = [
355
+ 'https://shop1.com/product/laptop-123',
356
+ 'https://shop2.com/items/laptop-123',
357
+ // ... hundreds of product URLs
358
+ ];
359
+
360
+ const priceData = [];
361
+
362
+ await bulkScrapeStream(productUrls, {
363
+ concurrency: 5,
364
+ onResult: async (result) => {
365
+ // Extract price from scraped content
366
+ const content = JSON.parse(result.content);
367
+ const price = extractPrice(content);
368
+
369
+ priceData.push({
370
+ url: result.url,
371
+ price: price,
372
+ timestamp: result.timestamp
373
+ });
374
+ },
375
+ progressCallback: (progress) => {
376
+ process.stdout.write(`\rChecking prices: ${progress.percentage.toFixed(0)}%`);
377
+ }
378
+ });
379
+
380
+ // Analyze price data
381
+ const avgPrice = priceData.reduce((sum, p) => sum + p.price, 0) / priceData.length;
382
+ console.log(`\nAverage price: $${avgPrice.toFixed(2)}`);
383
+ ```
384
+
385
+ ### News Aggregation
386
+
387
+ ```javascript
388
+ import { bulkScrape } from '@monostate/node-scraper';
389
+
390
+ const newsUrls = [
391
+ 'https://news1.com/latest',
392
+ 'https://news2.com/today',
393
+ 'https://news3.com/breaking'
394
+ ];
395
+
396
+ const results = await bulkScrape(newsUrls, {
397
+ concurrency: 3,
398
+ timeout: 10000
399
+ });
400
+
401
+ // Extract and combine articles
402
+ const allArticles = [];
403
+ results.success.forEach(result => {
404
+ const content = JSON.parse(result.content);
405
+ const articles = extractArticles(content);
406
+ allArticles.push(...articles.map(a => ({
407
+ ...a,
408
+ source: new URL(result.url).hostname
409
+ })));
410
+ });
411
+
412
+ // Sort by date and deduplicate
413
+ const uniqueArticles = deduplicateArticles(allArticles);
414
+ console.log(`Found ${uniqueArticles.length} unique articles`);
415
+ ```
416
+
417
+ ### SEO Analysis
418
+
419
+ ```javascript
420
+ import { bulkScrape } from '@monostate/node-scraper';
421
+
422
+ async function analyzeSEO(urls) {
423
+ const results = await bulkScrape(urls, {
424
+ concurrency: 10,
425
+ method: 'auto'
426
+ });
427
+
428
+ const seoData = results.success.map(result => {
429
+ const content = JSON.parse(result.content);
430
+ return {
431
+ url: result.url,
432
+ title: content.title,
433
+ metaDescription: content.metaDescription,
434
+ headings: content.headings,
435
+ loadTime: result.duration,
436
+ method: result.method,
437
+ hasStructuredData: !!content.structuredData
438
+ };
439
+ });
440
+
441
+ // Generate SEO report
442
+ const avgLoadTime = seoData.reduce((sum, d) => sum + d.loadTime, 0) / seoData.length;
443
+ const missingTitles = seoData.filter(d => !d.title).length;
444
+ const missingDescriptions = seoData.filter(d => !d.metaDescription).length;
445
+
446
+ return {
447
+ totalAnalyzed: seoData.length,
448
+ avgLoadTime: Math.round(avgLoadTime),
449
+ missingTitles,
450
+ missingDescriptions,
451
+ details: seoData
452
+ };
453
+ }
454
+ ```
455
+
456
+ ## Best Practices
457
+
458
+ ### 1. Respect Rate Limits
459
+
460
+ ```javascript
461
+ // Add delays between requests for external sites
462
+ async function respectfulBulkScrape(urls, delayMs = 1000) {
463
+ const results = [];
464
+
465
+ for (const url of urls) {
466
+ const result = await smartScrape(url);
467
+ results.push(result);
468
+
469
+ // Wait before next request
470
+ await new Promise(resolve => setTimeout(resolve, delayMs));
471
+ }
472
+
473
+ return results;
474
+ }
475
+ ```
476
+
477
+ ### 2. Handle Different Content Types
478
+
479
+ ```javascript
480
+ const results = await bulkScrape(mixedUrls, {
481
+ progressCallback: (progress) => {
482
+ console.log(`Processing: ${progress.current}`);
483
+ }
484
+ });
485
+
486
+ // Separate different content types
487
+ const pdfResults = results.success.filter(r => r.contentType?.includes('pdf'));
488
+ const htmlResults = results.success.filter(r => !r.contentType?.includes('pdf'));
489
+
490
+ console.log(`Found ${pdfResults.length} PDFs and ${htmlResults.length} web pages`);
491
+ ```
492
+
493
+ ### 3. Monitor Resource Usage
494
+
495
+ ```javascript
496
+ import { BNCASmartScraper } from '@monostate/node-scraper';
497
+ import browserPool from '@monostate/node-scraper/browser-pool.js';
498
+
499
+ const scraper = new BNCASmartScraper({ verbose: true });
500
+
501
+ // Monitor memory usage during bulk scraping
502
+ const memoryUsage = [];
503
+ const interval = setInterval(() => {
504
+ memoryUsage.push(process.memoryUsage());
505
+
506
+ // Log browser pool statistics
507
+ const poolStats = browserPool.getStats();
508
+ console.log('Browser Pool:', {
509
+ active: poolStats.busyCount,
510
+ idle: poolStats.idleCount,
511
+ queued: poolStats.queueLength,
512
+ totalCreated: poolStats.created,
513
+ reused: poolStats.reused
514
+ });
515
+ }, 1000);
516
+
517
+ try {
518
+ const results = await scraper.bulkScrape(urls, {
519
+ concurrency: 10,
520
+ progressCallback: (p) => {
521
+ const mem = process.memoryUsage();
522
+ console.log(`Progress: ${p.percentage.toFixed(1)}% | Memory: ${(mem.heapUsed / 1024 / 1024).toFixed(1)}MB`);
523
+ }
524
+ });
525
+ } finally {
526
+ clearInterval(interval);
527
+ await scraper.cleanup();
528
+ }
529
+ ```
530
+
531
+ ### 4. Implement Circuit Breaker
532
+
533
+ ```javascript
534
+ class CircuitBreaker {
535
+ constructor(threshold = 5, timeout = 60000) {
536
+ this.failureCount = 0;
537
+ this.threshold = threshold;
538
+ this.timeout = timeout;
539
+ this.state = 'CLOSED';
540
+ this.nextAttempt = Date.now();
541
+ }
542
+
543
+ async call(fn) {
544
+ if (this.state === 'OPEN') {
545
+ if (Date.now() < this.nextAttempt) {
546
+ throw new Error('Circuit breaker is OPEN');
547
+ }
548
+ this.state = 'HALF_OPEN';
549
+ }
550
+
551
+ try {
552
+ const result = await fn();
553
+ this.onSuccess();
554
+ return result;
555
+ } catch (error) {
556
+ this.onFailure();
557
+ throw error;
558
+ }
559
+ }
560
+
561
+ onSuccess() {
562
+ this.failureCount = 0;
563
+ this.state = 'CLOSED';
564
+ }
565
+
566
+ onFailure() {
567
+ this.failureCount++;
568
+ if (this.failureCount >= this.threshold) {
569
+ this.state = 'OPEN';
570
+ this.nextAttempt = Date.now() + this.timeout;
571
+ }
572
+ }
573
+ }
574
+
575
+ // Use with bulk scraping
576
+ const breaker = new CircuitBreaker();
577
+ const results = [];
578
+
579
+ for (const url of urls) {
580
+ try {
581
+ const result = await breaker.call(() => smartScrape(url));
582
+ results.push(result);
583
+ } catch (error) {
584
+ console.error(`Failed to scrape ${url}: ${error.message}`);
585
+ }
586
+ }
587
+ ```
588
+
589
+ ## Performance Tips
590
+
591
+ 1. **Use appropriate concurrency**: Start with 5-10 concurrent requests and adjust based on performance
592
+ 2. **Choose the right method**: Use `direct` for static sites, `puppeteer` for SPAs
593
+ 3. **Stream large datasets**: Use `bulkScrapeStream` for datasets over 1000 URLs
594
+ 4. **Monitor memory usage**: Process results immediately in streaming mode
595
+ 5. **Implement retry logic**: Some failures are temporary
596
+ 6. **Cache results**: Avoid re-scraping unchanged content
597
+ 7. **Use timeouts**: Prevent hanging requests from blocking progress
598
+
599
+ ## Troubleshooting
600
+
601
+ ### High Memory Usage
602
+
603
+ If you experience high memory usage:
604
+
605
+ 1. Reduce concurrency
606
+ 2. Use streaming mode
607
+ 3. Process results immediately
608
+ 4. Call `cleanup()` periodically
609
+
610
+ ### Slow Performance
611
+
612
+ If scraping is slow:
613
+
614
+ 1. Increase concurrency (if server allows)
615
+ 2. Use `direct` method when possible
616
+ 3. Reduce timeout values
617
+ 4. Check network connectivity
618
+
619
+ ### Many Failures
620
+
621
+ If many URLs fail:
622
+
623
+ 1. Check if sites require authentication
624
+ 2. Verify URLs are correct
625
+ 3. Use `puppeteer` method for JavaScript-heavy sites
626
+ 4. Implement retry logic with backoff
package/README.md CHANGED
@@ -19,6 +19,8 @@ yarn add @monostate/node-scraper
19
19
  pnpm add @monostate/node-scraper
20
20
  ```
21
21
 
22
+ **Fixed in v1.8.1**: Critical production fix - browser-pool.js now included in npm package.
23
+
22
24
  **New in v1.8.0**: Bulk scraping with automatic request queueing, progress tracking, and streaming results! Process hundreds of URLs efficiently. Plus critical memory leak fix with browser pooling.
23
25
 
24
26
  **Fixed in v1.7.0**: Critical cross-platform compatibility fix - binaries are now correctly downloaded per platform instead of being bundled.
@@ -0,0 +1,229 @@
1
+ class BrowserPool {
2
+ constructor(maxInstances = 3, idleTimeout = 5000) {
3
+ this.maxInstances = maxInstances;
4
+ this.idleTimeout = idleTimeout;
5
+ this.pool = [];
6
+ this.busyBrowsers = new Set();
7
+ this.cleanupTimer = null;
8
+ this.requestQueue = [];
9
+ this.stats = {
10
+ created: 0,
11
+ reused: 0,
12
+ queued: 0,
13
+ cleaned: 0
14
+ };
15
+ }
16
+
17
+ async getBrowser() {
18
+ // Try to get an idle browser from pool
19
+ let browser = this.pool.find(b => !this.busyBrowsers.has(b.instance));
20
+
21
+ if (browser) {
22
+ browser.lastUsed = Date.now();
23
+ this.busyBrowsers.add(browser.instance);
24
+ this.stats.reused++;
25
+ return browser.instance;
26
+ }
27
+
28
+ // Create new browser if under limit
29
+ if (this.pool.length < this.maxInstances) {
30
+ browser = await this.createBrowser();
31
+ this.pool.push(browser);
32
+ this.busyBrowsers.add(browser.instance);
33
+ this.stats.created++;
34
+ return browser.instance;
35
+ }
36
+
37
+ // Queue the request and wait for available browser
38
+ this.stats.queued++;
39
+ return this.queueRequest();
40
+ }
41
+
42
+ async createBrowser() {
43
+ const puppeteer = await this.getPuppeteer();
44
+ const instance = await puppeteer.launch({
45
+ headless: 'new',
46
+ args: [
47
+ '--no-sandbox',
48
+ '--disable-setuid-sandbox',
49
+ '--disable-dev-shm-usage',
50
+ '--disable-gpu',
51
+ '--disable-web-security',
52
+ '--disable-features=VizDisplayCompositor',
53
+ '--disable-background-timer-throttling',
54
+ '--disable-backgrounding-occluded-windows',
55
+ '--disable-renderer-backgrounding',
56
+ '--disable-extensions',
57
+ '--disable-default-apps',
58
+ '--disable-sync',
59
+ '--metrics-recording-only',
60
+ '--mute-audio',
61
+ '--no-first-run'
62
+ ]
63
+ });
64
+
65
+ const browser = {
66
+ instance,
67
+ created: Date.now(),
68
+ lastUsed: Date.now(),
69
+ pageCount: 0
70
+ };
71
+
72
+ // Handle browser disconnect
73
+ instance.on('disconnected', () => {
74
+ this.removeBrowser(browser);
75
+ this.processQueue();
76
+ });
77
+
78
+ return browser;
79
+ }
80
+
81
+ async getPuppeteer() {
82
+ try {
83
+ const puppeteer = await import('puppeteer');
84
+ return puppeteer.default || puppeteer;
85
+ } catch (error) {
86
+ throw new Error('Puppeteer is not installed. Please install it to use Puppeteer-based scraping.');
87
+ }
88
+ }
89
+
90
+ async queueRequest() {
91
+ return new Promise((resolve) => {
92
+ this.requestQueue.push({ resolve, timestamp: Date.now() });
93
+ });
94
+ }
95
+
96
+ processQueue() {
97
+ if (this.requestQueue.length === 0) return;
98
+
99
+ // Find available browser
100
+ const available = this.pool.find(b => !this.busyBrowsers.has(b.instance));
101
+ if (!available) return;
102
+
103
+ // Process oldest request in queue
104
+ const request = this.requestQueue.shift();
105
+ if (request) {
106
+ available.lastUsed = Date.now();
107
+ this.busyBrowsers.add(available.instance);
108
+ request.resolve(available.instance);
109
+ }
110
+ }
111
+
112
+ releaseBrowser(browser) {
113
+ this.busyBrowsers.delete(browser);
114
+
115
+ // Process any queued requests
116
+ this.processQueue();
117
+
118
+ // Start cleanup timer if not already running
119
+ if (!this.cleanupTimer) {
120
+ this.cleanupTimer = setTimeout(() => this.cleanup(), this.idleTimeout);
121
+ }
122
+ }
123
+
124
+ removeBrowser(browserObj) {
125
+ const index = this.pool.findIndex(b => b.instance === browserObj.instance);
126
+ if (index !== -1) {
127
+ this.pool.splice(index, 1);
128
+ this.busyBrowsers.delete(browserObj.instance);
129
+ }
130
+ }
131
+
132
+ async cleanup() {
133
+ this.cleanupTimer = null;
134
+ const now = Date.now();
135
+ const toRemove = [];
136
+
137
+ // Keep at least one browser if there are queued requests
138
+ const minBrowsers = this.requestQueue.length > 0 ? 1 : 0;
139
+
140
+ for (const browser of this.pool) {
141
+ // Skip if we need to keep minimum browsers
142
+ if (this.pool.length - toRemove.length <= minBrowsers) break;
143
+
144
+ // Remove idle browsers
145
+ const isIdle = !this.busyBrowsers.has(browser.instance);
146
+ const idleTime = now - browser.lastUsed;
147
+
148
+ if (isIdle && idleTime > this.idleTimeout) {
149
+ toRemove.push(browser);
150
+ }
151
+ }
152
+
153
+ // Close idle browsers
154
+ for (const browser of toRemove) {
155
+ try {
156
+ // Check if browser is still connected
157
+ if (browser.instance && browser.instance.isConnected()) {
158
+ await browser.instance.close();
159
+ }
160
+ this.removeBrowser(browser);
161
+ this.stats.cleaned++;
162
+ } catch (error) {
163
+ // Silently ignore protocol errors and disconnection errors
164
+ if (!error.message.includes('Protocol error') &&
165
+ !error.message.includes('Target closed') &&
166
+ !error.message.includes('Connection closed')) {
167
+ console.warn('Error closing browser:', error.message);
168
+ }
169
+ // Remove browser even if close failed
170
+ this.removeBrowser(browser);
171
+ }
172
+ }
173
+
174
+ // Schedule next cleanup if there are still browsers
175
+ if (this.pool.length > 0) {
176
+ this.cleanupTimer = setTimeout(() => this.cleanup(), this.idleTimeout);
177
+ }
178
+ }
179
+
180
+ async closeAll() {
181
+ if (this.cleanupTimer) {
182
+ clearTimeout(this.cleanupTimer);
183
+ this.cleanupTimer = null;
184
+ }
185
+
186
+ // Clear the queue
187
+ this.requestQueue = [];
188
+
189
+ const closePromises = this.pool.map(async (browser) => {
190
+ try {
191
+ // Check if browser is still connected
192
+ if (browser.instance && browser.instance.isConnected()) {
193
+ await browser.instance.close();
194
+ }
195
+ } catch (error) {
196
+ // Silently ignore protocol errors and disconnection errors
197
+ if (!error.message.includes('Protocol error') &&
198
+ !error.message.includes('Target closed') &&
199
+ !error.message.includes('Connection closed')) {
200
+ console.warn('Error closing browser:', error.message);
201
+ }
202
+ }
203
+ });
204
+
205
+ await Promise.all(closePromises);
206
+ this.pool = [];
207
+ this.busyBrowsers.clear();
208
+ }
209
+
210
+ getStats() {
211
+ return {
212
+ ...this.stats,
213
+ poolSize: this.pool.length,
214
+ busyCount: this.busyBrowsers.size,
215
+ idleCount: this.pool.length - this.busyBrowsers.size,
216
+ queueLength: this.requestQueue.length
217
+ };
218
+ }
219
+ }
220
+
221
+ // Global browser pool instance
222
+ const browserPool = new BrowserPool(3, 5000);
223
+
224
+ // Graceful shutdown
225
+ process.on('SIGTERM', () => browserPool.closeAll());
226
+ process.on('SIGINT', () => browserPool.closeAll());
227
+ process.on('beforeExit', () => browserPool.closeAll());
228
+
229
+ export default browserPool;
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@monostate/node-scraper",
3
- "version": "1.8.0",
3
+ "version": "1.8.1",
4
4
  "description": "Intelligent web scraping with AI Q&A, PDF support and multi-level fallback system - 11x faster than traditional scrapers",
5
5
  "type": "module",
6
6
  "main": "index.js",
@@ -14,7 +14,9 @@
14
14
  "files": [
15
15
  "index.js",
16
16
  "index.d.ts",
17
+ "browser-pool.js",
17
18
  "README.md",
19
+ "BULK_SCRAPING.md",
18
20
  "package.json",
19
21
  "scripts/"
20
22
  ],