npm - @monostate/node-scraper - Versions diffs - 1.6.0 → 1.8.0 - Mend

@monostate/node-scraper 1.6.0 → 1.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md CHANGED Viewed

@@ -7,7 +7,7 @@
 [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](../../LICENSE)
 [![Node](https://img.shields.io/badge/Node.js-18%2B-green)](https://nodejs.org/)
-## 🚀 Quick Start
+## Quick Start
 ### Installation
@@ -19,18 +19,24 @@ yarn add @monostate/node-scraper
 pnpm add @monostate/node-scraper
 ```
-**🤖 New in v1.5.0**: AI-powered Q&A! Ask questions about any website using OpenRouter, OpenAI, or built-in AI. (Note: v1.4.0 was an internal release)
+**New in v1.8.0**: Bulk scraping with automatic request queueing, progress tracking, and streaming results! Process hundreds of URLs efficiently. Plus critical memory leak fix with browser pooling.
-**🎉 Also in v1.3.0**: PDF parsing support added! Automatically extracts text, metadata, and page count from PDF documents.
+**Fixed in v1.7.0**: Critical cross-platform compatibility fix - binaries are now correctly downloaded per platform instead of being bundled.
-**✨ Also in v1.2.0**: Lightpanda binary is now automatically downloaded and configured during installation! No manual setup required.
+**New in v1.6.0**: Method override support! Force specific scraping methods with `method` parameter for testing and optimization.
+**New in v1.5.0**: AI-powered Q&A! Ask questions about any website using OpenRouter, OpenAI, or built-in AI.
+**Also in v1.3.0**: PDF parsing support added! Automatically extracts text, metadata, and page count from PDF documents.
+**Also in v1.2.0**: Lightpanda binary is now automatically downloaded and configured during installation! No manual setup required.
 ### Zero-Configuration Setup
 The package now automatically:
-- 📦 Downloads the correct Lightpanda binary for your platform (macOS, Linux, Windows/WSL)
-- 🔧 Configures binary paths and permissions
-- ✅ Validates installation health on first use
+- Downloads the correct Lightpanda binary for your platform (macOS, Linux, Windows/WSL)
+- Configures binary paths and permissions
+- Validates installation health on first use
 ### Basic Usage
@@ -72,6 +78,24 @@ console.log(result.stats); // Performance statistics
 await scraper.cleanup(); // Clean up resources
 ```
+### Browser Pool Configuration (New in v1.8.0)
+The package now includes automatic browser instance pooling to prevent memory leaks:
+```javascript
+// Browser pool is managed automatically with these defaults:
+// - Max 3 concurrent browser instances
+// - 5 second idle timeout before cleanup
+// - Automatic reuse of browser instances
+// For heavy workloads, you can manually clean up:
+const scraper = new BNCASmartScraper();
+// ... perform multiple scrapes ...
+await scraper.cleanup(); // Closes all browser instances
+```
+**Important**: The convenience functions (`smartScrape`, `smartScreenshot`, etc.) automatically handle cleanup. You only need to call `cleanup()` when using the `BNCASmartScraper` class directly.
 ### Method Override (New in v1.6.0)
 Force a specific scraping method instead of using automatic fallback:
@@ -106,7 +130,69 @@ const result = await smartScrape('https://example.com', { method: 'auto' });
 }
 ```
-## 🔧 How It Works
+### Bulk Scraping (New in v1.8.0)
+Process multiple URLs efficiently with automatic request queueing and progress tracking:
+```javascript
+import { bulkScrape } from '@monostate/node-scraper';
+// Basic bulk scraping
+const urls = [
+  'https://example1.com',
+  'https://example2.com',
+  'https://example3.com',
+  // ... hundreds more
+];
+const results = await bulkScrape(urls, {
+  concurrency: 5,  // Process 5 URLs at a time
+  continueOnError: true,  // Don't stop on failures
+  progressCallback: (progress) => {
+    console.log(`Progress: ${progress.percentage.toFixed(1)}% (${progress.processed}/${progress.total})`);
+  }
+});
+console.log(`Success: ${results.stats.successful}, Failed: ${results.stats.failed}`);
+console.log(`Total time: ${results.stats.totalTime}ms`);
+console.log(`Average time per URL: ${results.stats.averageTime}ms`);
+```
+#### Streaming Results
+For large datasets, use streaming to process results as they complete:
+```javascript
+import { bulkScrapeStream } from '@monostate/node-scraper';
+await bulkScrapeStream(urls, {
+  concurrency: 10,
+  onResult: async (result) => {
+    // Process each successful result immediately
+    await saveToDatabase(result);
+    console.log(`✓ ${result.url} - ${result.duration}ms`);
+  },
+  onError: async (error) => {
+    // Handle errors as they occur
+    console.error(`✗ ${error.url} - ${error.error}`);
+  },
+  progressCallback: (progress) => {
+    process.stdout.write(`\rProcessing: ${progress.percentage.toFixed(1)}%`);
+  }
+});
+```
+**Features:**
+- Automatic request queueing (no more memory errors!)
+- Configurable concurrency control
+- Real-time progress tracking
+- Continue on error or stop on first failure
+- Detailed statistics and method tracking
+- Browser instance pooling for efficiency
+For detailed examples and advanced usage, see [BULK_SCRAPING.md](./BULK_SCRAPING.md).
+## How It Works
 BNCA uses a sophisticated multi-tier system with intelligent detection:
@@ -235,7 +321,7 @@ Clean up resources (close browser instances).
 await scraper.cleanup();
 ```
-### 🤖 AI-Powered Q&A
+### AI-Powered Q&A
 Ask questions about any website and get AI-generated answers:

package/index.d.ts CHANGED Viewed

@@ -139,6 +139,118 @@ export interface HealthCheckResult {
   timestamp: string;
 }
+export interface BulkScrapeOptions extends ScrapingOptions {
+  /** Number of concurrent requests (default: 5) */
+  concurrency?: number;
+  /** Progress callback function */
+  progressCallback?: (progress: BulkProgress) => void;
+  /** Continue processing on error (default: true) */
+  continueOnError?: boolean;
+}
+export interface BulkScrapeStreamOptions extends ScrapingOptions {
+  /** Number of concurrent requests (default: 5) */
+  concurrency?: number;
+  /** Callback for each successful result */
+  onResult: (result: BulkScrapeResultItem) => void | Promise<void>;
+  /** Callback for errors */
+  onError?: (error: BulkScrapeErrorItem) => void | Promise<void>;
+  /** Progress callback function */
+  progressCallback?: (progress: BulkProgress) => void;
+}
+export interface BulkProgress {
+  /** Number of URLs processed */
+  processed: number;
+  /** Total number of URLs */
+  total: number;
+  /** Percentage complete */
+  percentage: number;
+  /** Current URL being processed */
+  current: string;
+}
+export interface BulkScrapeResult {
+  /** Successfully scraped results */
+  success: BulkScrapeResultItem[];
+  /** Failed scrapes */
+  failed: BulkScrapeErrorItem[];
+  /** Total number of URLs */
+  total: number;
+  /** Start timestamp */
+  startTime: number;
+  /** End timestamp */
+  endTime: number;
+  /** Aggregate statistics */
+  stats: BulkScrapeStats;
+}
+export interface BulkScrapeResultItem extends ScrapingResult {
+  /** The URL that was scraped */
+  url: string;
+  /** Time taken in milliseconds */
+  duration: number;
+  /** Timestamp of completion */
+  timestamp: string;
+}
+export interface BulkScrapeErrorItem {
+  /** The URL that failed */
+  url: string;
+  /** Success is always false for errors */
+  success: false;
+  /** Error message */
+  error: string;
+  /** Time taken in milliseconds */
+  duration: number;
+  /** Timestamp of failure */
+  timestamp: string;
+}
+export interface BulkScrapeStats {
+  /** Number of successful scrapes */
+  successful: number;
+  /** Number of failed scrapes */
+  failed: number;
+  /** Total time taken in milliseconds */
+  totalTime: number;
+  /** Average time per URL in milliseconds */
+  averageTime: number;
+  /** Count of methods used */
+  methods: {
+    direct: number;
+    lightpanda: number;
+    puppeteer: number;
+    pdf: number;
+  };
+}
+export interface BulkScrapeStreamStats {
+  /** Total number of URLs */
+  total: number;
+  /** Number of URLs processed */
+  processed: number;
+  /** Number of successful scrapes */
+  successful: number;
+  /** Number of failed scrapes */
+  failed: number;
+  /** Start timestamp */
+  startTime: number;
+  /** End timestamp */
+  endTime: number;
+  /** Total time in milliseconds */
+  totalTime: number;
+  /** Average time per URL in milliseconds */
+  averageTime: number;
+  /** Count of methods used */
+  methods: {
+    direct: number;
+    lightpanda: number;
+    puppeteer: number;
+    pdf: number;
+  };
+}
 /**
  * BNCA Smart Scraper - Intelligent web scraping with multi-level fallback
  */
@@ -264,6 +376,27 @@ export class BNCASmartScraper {
    * @param message Message to log
    */
   private log(message: string): void;
+  /**
+   * Clean up resources - closes all browser instances
+   */
+  cleanup(): Promise<void>;
+  /**
+   * Bulk scrape multiple URLs with optimized concurrency
+   * @param urls Array of URLs to scrape
+   * @param options Bulk scraping options
+   * @returns Promise resolving to bulk scraping results
+   */
+  bulkScrape(urls: string[], options?: BulkScrapeOptions): Promise<BulkScrapeResult>;
+  /**
+   * Bulk scrape with streaming results
+   * @param urls Array of URLs to scrape
+   * @param options Bulk scraping options with callbacks
+   * @returns Promise resolving to summary statistics
+   */
+  bulkScrapeStream(urls: string[], options: BulkScrapeStreamOptions): Promise<BulkScrapeStreamStats>;
 }
 /**
@@ -306,6 +439,22 @@ export function askWebsiteAI(url: string, question: string, options?: ScrapingOp
   processing?: 'openrouter' | 'openai' | 'backend' | 'local';
 }>;
+/**
+ * Convenience function for bulk scraping multiple URLs
+ * @param urls Array of URLs to scrape
+ * @param options Bulk scraping options
+ * @returns Promise resolving to bulk scraping results
+ */
+export function bulkScrape(urls: string[], options?: BulkScrapeOptions): Promise<BulkScrapeResult>;
+/**
+ * Convenience function for bulk scraping with streaming results
+ * @param urls Array of URLs to scrape
+ * @param options Bulk scraping options with callbacks
+ * @returns Promise resolving to summary statistics
+ */
+export function bulkScrapeStream(urls: string[], options: BulkScrapeStreamOptions): Promise<BulkScrapeStreamStats>;
 /**
  * Default export - same as BNCASmartScraper class
  */

package/index.js CHANGED Viewed

@@ -6,6 +6,7 @@ import path from 'path';
 import { fileURLToPath } from 'url';
 import { promises as fsPromises } from 'fs';
 import pdfParse from 'pdf-parse/lib/pdf-parse.js';
+import browserPool from './browser-pool.js';
 let puppeteer = null;
 try {
@@ -666,23 +667,13 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
       };
     }
+    let browser = null;
+    let page = null;
     try {
-      if (!this.browser) {
-        this.browser = await puppeteer.launch({
-          headless: true,
-          args: [
-            '--no-sandbox',
-            '--disable-setuid-sandbox',
-            '--disable-dev-shm-usage',
-            '--disable-accelerated-2d-canvas',
-            '--no-first-run',
-            '--no-zygote',
-            '--disable-gpu'
-          ]
-        });
-      }
-      const page = await this.browser.newPage();
+      // Get browser from pool
+      browser = await browserPool.getBrowser();
+      page = await browser.newPage();
       // Set user agent and viewport
       await page.setUserAgent(config.userAgent);
@@ -766,7 +757,6 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
         };
       });
-      await page.close();
       this.stats.puppeteer.successes++;
       return {
@@ -782,6 +772,26 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
         error: `Puppeteer scraping failed: ${errorMsg}`,
         errorType: this.categorizeError(errorMsg)
       };
+    } finally {
+      // Always clean up page
+      if (page) {
+        try {
+          // Check if page is still connected before closing
+          if (!page.isClosed()) {
+            await page.close();
+          }
+        } catch (e) {
+          // Silently ignore protocol errors when page is already closed
+          if (!e.message.includes('Protocol error') && !e.message.includes('Target closed')) {
+            console.warn('Error closing page:', e.message);
+          }
+        }
+      }
+      // Release browser back to pool
+      if (browser) {
+        browserPool.releaseBrowser(browser);
+      }
     }
   }
@@ -1467,6 +1477,235 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
       timestamp: new Date().toISOString()
     };
   }
+  /**
+   * Clean up resources - closes all browser instances
+   */
+  async cleanup() {
+    await browserPool.closeAll();
+  }
+  /**
+   * Bulk scrape multiple URLs with optimized concurrency
+   * @param {string[]} urls - Array of URLs to scrape
+   * @param {Object} options - Scraping options
+   * @returns {Promise<Object>} Bulk scraping results
+   */
+  async bulkScrape(urls, options = {}) {
+    const {
+      concurrency = 5,
+      progressCallback = null,
+      continueOnError = true,
+      ...scrapeOptions
+    } = options;
+    const results = {
+      success: [],
+      failed: [],
+      total: urls.length,
+      startTime: Date.now(),
+      endTime: null,
+      stats: {
+        successful: 0,
+        failed: 0,
+        totalTime: 0,
+        averageTime: 0,
+        methods: {
+          direct: 0,
+          lightpanda: 0,
+          puppeteer: 0,
+          pdf: 0
+        }
+      }
+    };
+    // Process URLs in batches
+    const batches = [];
+    for (let i = 0; i < urls.length; i += concurrency) {
+      batches.push(urls.slice(i, i + concurrency));
+    }
+    let processedCount = 0;
+    for (const batch of batches) {
+      const batchPromises = batch.map(async (url) => {
+        const startTime = Date.now();
+        try {
+          const result = await this.scrape(url, scrapeOptions);
+          const endTime = Date.now();
+          const duration = endTime - startTime;
+          const successResult = {
+            url,
+            ...result,
+            duration,
+            timestamp: new Date(endTime).toISOString()
+          };
+          results.success.push(successResult);
+          results.stats.successful++;
+          // Track method usage
+          if (result.method) {
+            results.stats.methods[result.method]++;
+          }
+          return successResult;
+        } catch (error) {
+          const endTime = Date.now();
+          const duration = endTime - startTime;
+          const failedResult = {
+            url,
+            success: false,
+            error: error.message,
+            duration,
+            timestamp: new Date(endTime).toISOString()
+          };
+          results.failed.push(failedResult);
+          results.stats.failed++;
+          if (!continueOnError) {
+            throw error;
+          }
+          return failedResult;
+        } finally {
+          processedCount++;
+          if (progressCallback) {
+            progressCallback({
+              processed: processedCount,
+              total: urls.length,
+              percentage: (processedCount / urls.length) * 100,
+              current: url
+            });
+          }
+        }
+      });
+      await Promise.all(batchPromises);
+    }
+    results.endTime = Date.now();
+    results.stats.totalTime = results.endTime - results.startTime;
+    results.stats.averageTime = results.stats.totalTime / urls.length;
+    return results;
+  }
+  /**
+   * Bulk scrape with streaming results
+   * @param {string[]} urls - Array of URLs to scrape
+   * @param {Object} options - Scraping options with onResult callback
+   * @returns {Promise<Object>} Summary statistics
+   */
+  async bulkScrapeStream(urls, options = {}) {
+    const {
+      concurrency = 5,
+      onResult = null,
+      onError = null,
+      progressCallback = null,
+      ...scrapeOptions
+    } = options;
+    if (!onResult) {
+      throw new Error('onResult callback is required for streaming bulk scrape');
+    }
+    const stats = {
+      total: urls.length,
+      processed: 0,
+      successful: 0,
+      failed: 0,
+      startTime: Date.now(),
+      endTime: null,
+      methods: {
+        direct: 0,
+        lightpanda: 0,
+        puppeteer: 0,
+        pdf: 0
+      }
+    };
+    const queue = [...urls];
+    const inProgress = new Set();
+    const processNext = async () => {
+      if (queue.length === 0 || inProgress.size >= concurrency) {
+        return;
+      }
+      const url = queue.shift();
+      inProgress.add(url);
+      const startTime = Date.now();
+      try {
+        const result = await this.scrape(url, scrapeOptions);
+        const duration = Date.now() - startTime;
+        stats.successful++;
+        if (result.method) {
+          stats.methods[result.method]++;
+        }
+        await onResult({
+          url,
+          ...result,
+          duration,
+          timestamp: new Date().toISOString()
+        });
+      } catch (error) {
+        const duration = Date.now() - startTime;
+        stats.failed++;
+        if (onError) {
+          await onError({
+            url,
+            error: error.message,
+            duration,
+            timestamp: new Date().toISOString()
+          });
+        }
+      } finally {
+        inProgress.delete(url);
+        stats.processed++;
+        if (progressCallback) {
+          progressCallback({
+            processed: stats.processed,
+            total: stats.total,
+            percentage: (stats.processed / stats.total) * 100,
+            current: url
+          });
+        }
+        // Process next URL
+        if (queue.length > 0) {
+          processNext();
+        }
+      }
+    };
+    // Start initial batch
+    const initialBatch = Math.min(concurrency, queue.length);
+    const promises = [];
+    for (let i = 0; i < initialBatch; i++) {
+      promises.push(processNext());
+    }
+    // Wait for all to complete
+    await Promise.all(promises);
+    while (inProgress.size > 0) {
+      await new Promise(resolve => setTimeout(resolve, 100));
+    }
+    stats.endTime = Date.now();
+    stats.totalTime = stats.endTime - stats.startTime;
+    stats.averageTime = stats.totalTime / stats.total;
+    return stats;
+  }
 }
 // Export convenience functions
@@ -1514,4 +1753,28 @@ export async function askWebsiteAI(url, question, options = {}) {
   }
 }
+export async function bulkScrape(urls, options = {}) {
+  const scraper = new BNCASmartScraper(options);
+  try {
+    const result = await scraper.bulkScrape(urls, options);
+    return result;
+  } catch (error) {
+    throw error;
+  } finally {
+    await scraper.cleanup();
+  }
+}
+export async function bulkScrapeStream(urls, options = {}) {
+  const scraper = new BNCASmartScraper(options);
+  try {
+    const result = await scraper.bulkScrapeStream(urls, options);
+    return result;
+  } catch (error) {
+    throw error;
+  } finally {
+    await scraper.cleanup();
+  }
+}
 export default BNCASmartScraper;

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@monostate/node-scraper",
-  "version": "1.6.0",
+  "version": "1.8.0",
   "description": "Intelligent web scraping with AI Q&A, PDF support and multi-level fallback system - 11x faster than traditional scrapers",
   "type": "module",
   "main": "index.js",
@@ -16,8 +16,7 @@
     "index.d.ts",
     "README.md",
     "package.json",
-    "scripts/",
-    "bin/"
+    "scripts/"
   ],
   "scripts": {
     "postinstall": "node scripts/install-lightpanda.js"
@@ -50,7 +49,7 @@
     "pdf-parse": "^1.1.1"
   },
   "peerDependencies": {
-    "puppeteer": ">=20.0.0"
+    "puppeteer": "^24.11.2"
   },
   "peerDependenciesMeta": {
     "puppeteer": {
@@ -76,4 +75,4 @@
   "publishConfig": {
     "access": "public"
   }
-}
+}

package/bin/lightpanda DELETED Viewed

Binary file