npm - @monostate/node-scraper - Versions diffs - 1.8.1 → 2.0.0 - Mend

@monostate/node-scraper 1.8.1 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md +106 -558
package/browser-pool.js +1 -1
package/index.js +46 -28
package/package.json +5 -5
package/scripts/install-lightpanda.js +20 -7

package/README.md CHANGED Viewed

@@ -1,69 +1,47 @@
 # @monostate/node-scraper
-> **Lightning-fast web scraping with intelligent fallback system - 11.35x faster than Firecrawl**
+> Intelligent web scraping with multi-tier fallback — 11x faster than traditional scrapers
-[![npm version](https://img.shields.io/npm/v/@monostate/node-scraper.svg)](https://www.npmjs.com/package/@monostate/node-scraper)
-[![Performance](https://img.shields.io/badge/Performance-11.35x_faster_than_Firecrawl-brightgreen)](../../test-results/)
-[![License](https://img.shields.io/badge/License-MIT-yellow.svg)](../../LICENSE)
-[![Node](https://img.shields.io/badge/Node.js-18%2B-green)](https://nodejs.org/)
+[![npm](https://img.shields.io/npm/v/@monostate/node-scraper.svg)](https://www.npmjs.com/package/@monostate/node-scraper)
+[![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/monostate/node-scraper/blob/main/LICENSE)
+[![Node](https://img.shields.io/badge/Node.js-20%2B-green)](https://nodejs.org/)
-## Quick Start
-### Installation
+## Install
 ```bash
 npm install @monostate/node-scraper
-# or
-yarn add @monostate/node-scraper
-# or
-pnpm add @monostate/node-scraper
 ```
-**Fixed in v1.8.1**: Critical production fix - browser-pool.js now included in npm package.
-**New in v1.8.0**: Bulk scraping with automatic request queueing, progress tracking, and streaming results! Process hundreds of URLs efficiently. Plus critical memory leak fix with browser pooling.
-**Fixed in v1.7.0**: Critical cross-platform compatibility fix - binaries are now correctly downloaded per platform instead of being bundled.
-**New in v1.6.0**: Method override support! Force specific scraping methods with `method` parameter for testing and optimization.
-**New in v1.5.0**: AI-powered Q&A! Ask questions about any website using OpenRouter, OpenAI, or built-in AI.
+LightPanda is downloaded automatically on install. Puppeteer is an optional peer dependency for full browser fallback.
-**Also in v1.3.0**: PDF parsing support added! Automatically extracts text, metadata, and page count from PDF documents.
-**Also in v1.2.0**: Lightpanda binary is now automatically downloaded and configured during installation! No manual setup required.
-### Zero-Configuration Setup
-The package now automatically:
-- Downloads the correct Lightpanda binary for your platform (macOS, Linux, Windows/WSL)
-- Configures binary paths and permissions
-- Validates installation health on first use
-### Basic Usage
+## Usage
 ```javascript
 import { smartScrape, smartScreenshot, quickShot } from '@monostate/node-scraper';
-// Simple one-line scraping
+// Scrape with automatic method selection
 const result = await smartScrape('https://example.com');
-console.log(result.content); // Extracted content
-console.log(result.method);  // Method used: direct-fetch, lightpanda, or puppeteer
+console.log(result.content);
+console.log(result.method); // 'direct-fetch' | 'lightpanda' | 'puppeteer'
-// Take a screenshot
+// Screenshots
 const screenshot = await smartScreenshot('https://example.com');
-console.log(screenshot.screenshot); // Base64 encoded image
+const quick = await quickShot('https://example.com'); // optimized for speed
+// PDFs are detected and parsed automatically
+const pdf = await smartScrape('https://example.com/doc.pdf');
+```
-// Quick screenshot (optimized for speed)
-const quick = await quickShot('https://example.com');
-console.log(quick.screenshot); // Fast screenshot capture
+### Force a specific method
-// PDF parsing (automatic detection)
-const pdfResult = await smartScrape('https://example.com/document.pdf');
-console.log(pdfResult.content); // Extracted text, metadata, page count
+```javascript
+const result = await smartScrape('https://example.com', { method: 'direct' });
+// Also: 'lightpanda', 'puppeteer', 'auto' (default)
 ```
-### Advanced Usage
+No fallback occurs when a method is forced — useful for testing and debugging.
+### Advanced usage
 ```javascript
 import { BNCASmartScraper } from '@monostate/node-scraper';
@@ -71,575 +49,145 @@ import { BNCASmartScraper } from '@monostate/node-scraper';
 const scraper = new BNCASmartScraper({
   timeout: 10000,
   verbose: true,
-  lightpandaPath: './lightpanda' // optional
 });
 const result = await scraper.scrape('https://complex-spa.com');
-console.log(result.stats); // Performance statistics
-await scraper.cleanup(); // Clean up resources
-```
-### Browser Pool Configuration (New in v1.8.0)
-The package now includes automatic browser instance pooling to prevent memory leaks:
-```javascript
-// Browser pool is managed automatically with these defaults:
-// - Max 3 concurrent browser instances
-// - 5 second idle timeout before cleanup
-// - Automatic reuse of browser instances
-// For heavy workloads, you can manually clean up:
-const scraper = new BNCASmartScraper();
-// ... perform multiple scrapes ...
-await scraper.cleanup(); // Closes all browser instances
-```
-**Important**: The convenience functions (`smartScrape`, `smartScreenshot`, etc.) automatically handle cleanup. You only need to call `cleanup()` when using the `BNCASmartScraper` class directly.
-### Method Override (New in v1.6.0)
-Force a specific scraping method instead of using automatic fallback:
-```javascript
-// Force direct fetch (no browser)
-const result = await smartScrape('https://example.com', { method: 'direct' });
-// Force Lightpanda browser
-const result = await smartScrape('https://example.com', { method: 'lightpanda' });
-// Force Puppeteer (full Chrome)
-const result = await smartScrape('https://example.com', { method: 'puppeteer' });
-// Auto mode (default - intelligent fallback)
-const result = await smartScrape('https://example.com', { method: 'auto' });
-```
-**Important**: When forcing a method, no fallback occurs if it fails. This is useful for:
-- Testing specific methods in isolation
-- Optimizing for known site requirements
-- Debugging method-specific issues
+const stats = scraper.getStats();
+const health = await scraper.healthCheck();
-**Error Response for Forced Methods**:
-```javascript
-{
-  success: false,
-  error: "Lightpanda scraping failed: [specific error]",
-  method: "lightpanda",
-  errorType: "network|timeout|parsing|service_unavailable",
-  details: "Additional error context"
-}
+await scraper.cleanup();
 ```
-### Bulk Scraping (New in v1.8.0)
-Process multiple URLs efficiently with automatic request queueing and progress tracking:
+### Bulk scraping
 ```javascript
-import { bulkScrape } from '@monostate/node-scraper';
-// Basic bulk scraping
-const urls = [
-  'https://example1.com',
-  'https://example2.com',
-  'https://example3.com',
-  // ... hundreds more
-];
+import { bulkScrape, bulkScrapeStream } from '@monostate/node-scraper';
 const results = await bulkScrape(urls, {
-  concurrency: 5,  // Process 5 URLs at a time
-  continueOnError: true,  // Don't stop on failures
-  progressCallback: (progress) => {
-    console.log(`Progress: ${progress.percentage.toFixed(1)}% (${progress.processed}/${progress.total})`);
-  }
+  concurrency: 5,
+  continueOnError: true,
+  progressCallback: (p) => console.log(`${p.percentage.toFixed(1)}%`),
 });
-console.log(`Success: ${results.stats.successful}, Failed: ${results.stats.failed}`);
-console.log(`Total time: ${results.stats.totalTime}ms`);
-console.log(`Average time per URL: ${results.stats.averageTime}ms`);
-```
-#### Streaming Results
-For large datasets, use streaming to process results as they complete:
-```javascript
-import { bulkScrapeStream } from '@monostate/node-scraper';
+// Or stream results as they complete
 await bulkScrapeStream(urls, {
   concurrency: 10,
-  onResult: async (result) => {
-    // Process each successful result immediately
-    await saveToDatabase(result);
-    console.log(`✓ ${result.url} - ${result.duration}ms`);
-  },
-  onError: async (error) => {
-    // Handle errors as they occur
-    console.error(`✗ ${error.url} - ${error.error}`);
-  },
-  progressCallback: (progress) => {
-    process.stdout.write(`\rProcessing: ${progress.percentage.toFixed(1)}%`);
-  }
+  onResult: async (result) => await saveToDatabase(result),
+  onError: async (error) => console.error(error.url, error.error),
 });
 ```
-**Features:**
-- Automatic request queueing (no more memory errors!)
-- Configurable concurrency control
-- Real-time progress tracking
-- Continue on error or stop on first failure
-- Detailed statistics and method tracking
-- Browser instance pooling for efficiency
-For detailed examples and advanced usage, see [BULK_SCRAPING.md](./BULK_SCRAPING.md).
-## How It Works
-BNCA uses a sophisticated multi-tier system with intelligent detection:
-### 1. 🔄 Direct Fetch (Fastest)
-- Pure HTTP requests with intelligent HTML parsing
-- **Performance**: Sub-second responses
-- **Success rate**: 75% of websites
-- **PDF Detection**: Automatically detects PDFs by URL, content-type, or magic bytes
-### 2. 🐼 Lightpanda Browser (Fast)
-- Lightweight browser engine (2-3x faster than Chromium)
-- **Performance**: Fast JavaScript execution
-- **Fallback triggers**: SPA detection
-### 3. 🔵 Puppeteer (Complete)
-- Full Chromium browser for maximum compatibility
-- **Performance**: Complete JavaScript execution
-- **Fallback triggers**: Complex interactions needed
-### 📄 PDF Parser (Specialized)
-- Automatic PDF detection and parsing
-- **Features**: Text extraction, metadata, page count
-- **Smart Detection**: Works even when PDFs are served with wrong content-types
-- **Performance**: Typically 100-500ms for most PDFs
-### 📸 Screenshot Methods
-- **Chrome CLI**: Direct Chrome screenshot capture
-- **Quickshot**: Optimized with retry logic and smart timeouts
-## 📊 Performance Benchmark
-| Site Type | BNCA | Firecrawl | Speed Advantage |
-|-----------|------|-----------|----------------|
-| **Wikipedia** | 154ms | 4,662ms | **30.3x faster** |
-| **Hacker News** | 1,715ms | 4,644ms | **2.7x faster** |
-| **GitHub** | 9,167ms | 9,790ms | **1.1x faster** |
-**Average**: 11.35x faster than Firecrawl with 100% reliability
-## 🎛️ API Reference
-### Convenience Functions
-#### `smartScrape(url, options?)`
-Quick scraping with intelligent fallback.
-#### `smartScreenshot(url, options?)`
-Take a screenshot of any webpage.
-#### `quickShot(url, options?)`
-Optimized screenshot capture for maximum speed.
-**Parameters:**
-- `url` (string): URL to scrape/capture
-- `options` (object, optional): Configuration options
-**Returns:** Promise<ScrapingResult>
-### `BNCASmartScraper`
-Main scraper class with advanced features.
-#### Constructor Options
-```javascript
-const scraper = new BNCASmartScraper({
-  timeout: 10000,           // Request timeout in ms
-  retries: 2,               // Number of retries per method
-  verbose: false,           // Enable detailed logging
-  lightpandaPath: './lightpanda', // Path to Lightpanda binary
-  userAgent: 'Mozilla/5.0 ...',   // Custom user agent
-});
-```
-#### Methods
-##### `scraper.scrape(url, options?)`
-Scrape a URL with intelligent fallback.
-```javascript
-const result = await scraper.scrape('https://example.com');
-```
+See [BULK_SCRAPING.md](./BULK_SCRAPING.md) for full documentation.
-##### `scraper.screenshot(url, options?)`
+### AI-powered Q&A
-Take a screenshot of a webpage.
+Ask questions about any website using OpenRouter, OpenAI, or local fallback:
 ```javascript
-const result = await scraper.screenshot('https://example.com');
-const img = result.screenshot; // data:image/png;base64,...
-```
-##### `scraper.quickshot(url, options?)`
-Quick screenshot capture - optimized for speed with retry logic.
+import { askWebsiteAI } from '@monostate/node-scraper';
-```javascript
-const result = await scraper.quickshot('https://example.com');
-// 2-3x faster than regular screenshot
+const answer = await askWebsiteAI('https://example.com', 'What is this site about?', {
+  openRouterApiKey: process.env.OPENROUTER_API_KEY,
+});
 ```
-##### `scraper.getStats()`
+API key priority: OpenRouter > OpenAI > BNCA backend > local pattern matching (no key needed).
-Get performance statistics.
+## How it works
-```javascript
-const stats = scraper.getStats();
-console.log(stats.successRates); // Success rates by method
-```
+The scraper uses a three-tier fallback system:
-##### `scraper.healthCheck()`
+1. **Direct fetch** — Pure HTTP with HTML parsing. Sub-second, handles ~75% of sites.
+2. **LightPanda** — Lightweight browser engine, 2-3x faster than Chromium. Handles SPAs.
+3. **Puppeteer** — Full Chromium for maximum compatibility.
-Check availability of all scraping methods.
+Additional specialized handlers:
+- **PDF parser** — Automatic detection by URL, content-type, or magic bytes. Extracts text, metadata, and page count.
+- **Screenshots** — Chrome CLI capture with retry logic and smart timeouts.
-```javascript
-const health = await scraper.healthCheck();
-console.log(health.status); // 'healthy' or 'unhealthy'
-```
+Browser instances are pooled (max 3, 5s idle timeout) to prevent memory leaks.
-##### `scraper.cleanup()`
+## Performance
-Clean up resources (close browser instances).
+| Site Type | node-scraper | Firecrawl | Speedup |
+|-----------|-------------|-----------|---------|
+| Wikipedia | 154ms | 4,662ms | 30.3x |
+| Hacker News | 1,715ms | 4,644ms | 2.7x |
+| GitHub | 9,167ms | 9,790ms | 1.1x |
-```javascript
-await scraper.cleanup();
-```
+Average: **11.35x faster** with 100% reliability.
-### AI-Powered Q&A
+## API Reference
-Ask questions about any website and get AI-generated answers:
+### Convenience functions
-```javascript
-// Method 1: Using your own OpenRouter API key
-const scraper = new BNCASmartScraper({
-  openRouterApiKey: 'your-openrouter-api-key'
-});
-const result = await scraper.askAI('https://example.com', 'What is this website about?');
+| Function | Description |
+|----------|-------------|
+| `smartScrape(url, opts?)` | Scrape with intelligent fallback |
+| `smartScreenshot(url, opts?)` | Full page screenshot |
+| `quickShot(url, opts?)` | Fast screenshot capture |
+| `bulkScrape(urls, opts?)` | Batch scrape multiple URLs |
+| `bulkScrapeStream(urls, opts?)` | Stream results as they complete |
+| `askWebsiteAI(url, question, opts?)` | AI Q&A about a webpage |
-// Method 2: Using OpenAI API (or compatible endpoints)
-const scraper = new BNCASmartScraper({
-  openAIApiKey: 'your-openai-api-key',
-  // Optional: Use a compatible endpoint like Groq, Together AI, etc.
-  openAIBaseUrl: 'https://api.groq.com/openai'
-});
-const result = await scraper.askAI('https://example.com', 'What services do they offer?');
+### BNCASmartScraper options
-// Method 3: One-liner with OpenRouter
-import { askWebsiteAI } from '@monostate/node-scraper';
-const answer = await askWebsiteAI('https://example.com', 'What is the main topic?', {
-  openRouterApiKey: process.env.OPENROUTER_API_KEY
-});
-// Method 4: Using BNCA backend API (requires BNCA API key)
-const scraper = new BNCASmartScraper({
-  apiKey: 'your-bnca-api-key'
-});
-const result = await scraper.askAI('https://example.com', 'What products are featured?');
-```
-**API Key Priority:**
-1. OpenRouter API key (`openRouterApiKey`)
-2. OpenAI API key (`openAIApiKey`)
-3. BNCA backend API (`apiKey`)
-4. Local fallback (pattern matching - no API key required)
-**Configuration Options:**
-```javascript
-const result = await scraper.askAI(url, question, {
-  // OpenRouter specific
-  openRouterApiKey: 'sk-or-...',
-  model: 'meta-llama/llama-4-scout:free', // Default model
-  // OpenAI specific
-  openAIApiKey: 'sk-...',
-  openAIBaseUrl: 'https://api.openai.com', // Or compatible endpoint
-  model: 'gpt-3.5-turbo',
-  // Shared options
-  temperature: 0.3,
-  maxTokens: 500
-});
-```
-**Response Format:**
 ```javascript
 {
-  success: true,
-  answer: "This website is about...",
-  method: "direct-fetch",     // Scraping method used
-  scrapeTime: 1234,          // Time to scrape in ms
-  processing: "openrouter"   // AI processing method used
+  timeout: 10000,            // Request timeout (ms)
+  retries: 2,                // Retries per method
+  verbose: false,            // Detailed logging
+  lightpandaPath: './bin/lightpanda',
+  lightpandaFormat: 'html',  // 'html' or 'markdown'
+  userAgent: 'Mozilla/5.0 ...',
+  openRouterApiKey: '...',
+  openAIApiKey: '...',
+  openAIBaseUrl: 'https://api.openai.com',
 }
 ```
-### 📄 PDF Support
-BNCA automatically detects and parses PDF documents:
-```javascript
-const pdfResult = await smartScrape('https://example.com/document.pdf');
-// Parsed content includes:
-const content = JSON.parse(pdfResult.content);
-console.log(content.title);          // PDF title
-console.log(content.author);         // Author name
-console.log(content.pages);          // Number of pages
-console.log(content.text);           // Full extracted text
-console.log(content.creationDate);   // Creation date
-console.log(content.metadata);       // Additional metadata
-```
-**PDF Detection Methods:**
-- URL ending with `.pdf`
-- Content-Type header `application/pdf`
-- Binary content starting with `%PDF` (magic bytes)
-- Works with PDFs served as `application/octet-stream` (e.g., GitHub raw files)
-**Limitations:**
-- Maximum file size: 20MB
-- Text extraction only (no image OCR)
-- Requires `pdf-parse` dependency (automatically installed)
+### Methods
-## 📱 Next.js Integration
+- `scraper.scrape(url, opts?)` — Scrape with fallback
+- `scraper.screenshot(url, opts?)` — Take screenshot
+- `scraper.quickshot(url, opts?)` — Fast screenshot
+- `scraper.askAI(url, question, opts?)` — AI Q&A
+- `scraper.getStats()` — Performance statistics
+- `scraper.healthCheck()` — Check method availability
+- `scraper.cleanup()` — Close browser instances
-### API Route Example
+### Response shape
 ```javascript
-// pages/api/scrape.js or app/api/scrape/route.js
-import { smartScrape } from '@monostate/node-scraper';
-export async function POST(request) {
-  try {
-    const { url } = await request.json();
-    const result = await smartScrape(url);
-    return Response.json({
-      success: true,
-      data: result.content,
-      method: result.method,
-      time: result.performance.totalTime
-    });
-  } catch (error) {
-    return Response.json({
-      success: false,
-      error: error.message
-    }, { status: 500 });
-  }
-}
-```
-### React Hook Example
-```javascript
-// hooks/useScraper.js
-import { useState } from 'react';
-export function useScraper() {
-  const [loading, setLoading] = useState(false);
-  const [data, setData] = useState(null);
-  const [error, setError] = useState(null);
-  const scrape = async (url) => {
-    setLoading(true);
-    setError(null);
-    try {
-      const response = await fetch('/api/scrape', {
-        method: 'POST',
-        headers: { 'Content-Type': 'application/json' },
-        body: JSON.stringify({ url })
-      });
-      const result = await response.json();
-      if (result.success) {
-        setData(result.data);
-      } else {
-        setError(result.error);
-      }
-    } catch (err) {
-      setError(err.message);
-    } finally {
-      setLoading(false);
-    }
-  };
-  return { scrape, loading, data, error };
-}
-```
-### Component Example
-```javascript
-// components/ScraperDemo.jsx
-import { useScraper } from '../hooks/useScraper';
-export default function ScraperDemo() {
-  const { scrape, loading, data, error } = useScraper();
-  const [url, setUrl] = useState('');
-  const handleScrape = () => {
-    if (url) scrape(url);
-  };
-  return (
-    <div className="p-4">
-      <div className="flex gap-2 mb-4">
-        <input
-          type="url"
-          value={url}
-          onChange={(e) => setUrl(e.target.value)}
-          placeholder="Enter URL to scrape..."
-          className="flex-1 px-3 py-2 border rounded"
-        />
-        <button
-          onClick={handleScrape}
-          disabled={loading}
-          className="px-4 py-2 bg-blue-500 text-white rounded disabled:opacity-50"
-        >
-          {loading ? 'Scraping...' : 'Scrape'}
-        </button>
-      </div>
-      {error && (
-        <div className="p-3 bg-red-100 text-red-700 rounded mb-4">
-          Error: {error}
-        </div>
-      )}
-      {data && (
-        <div className="p-3 bg-green-100 rounded">
-          <h3 className="font-bold mb-2">Scraped Content:</h3>
-          <pre className="text-sm overflow-auto">{data}</pre>
-        </div>
-      )}
-    </div>
-  );
+{
+  success: true,
+  content: '...',          // Extracted content (JSON string)
+  method: 'direct-fetch',  // Method used
+  url: 'https://...',
+  performance: { totalTime: 154 },
+  stats: { ... }
 }
 ```
-## ⚠️ Important Notes
-### Server-Side Only
-BNCA is designed for **server-side use only** due to:
-- Browser automation requirements (Puppeteer)
-- File system access for Lightpanda binary
-- CORS restrictions in browsers
-### Next.js Deployment
-- Use in API routes, not client components
-- Ensure Node.js 18+ in production environment
-- Consider adding Lightpanda binary to deployment
-### Lightpanda Setup (Optional)
-For maximum performance, install Lightpanda:
-```bash
-# macOS ARM64
-curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-aarch64-macos
-chmod +x lightpanda
-# Linux x64
-curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-x86_64-linux
-chmod +x lightpanda
-```
-## 🔒 Privacy & Security
-- **No external API calls** - all processing is local
-- **No data collection** - your data stays private
-- **Respects robots.txt** (optional enforcement)
-- **Configurable rate limiting**
-## 📝 TypeScript Support
+## TypeScript
-Full TypeScript definitions included:
+Full type definitions are included (`index.d.ts`).
 ```typescript
-import { BNCASmartScraper, ScrapingResult, ScrapingOptions } from '@monostate/node-scraper';
-const scraper: BNCASmartScraper = new BNCASmartScraper({
-  timeout: 5000,
-  verbose: true
-});
-const result: ScrapingResult = await scraper.scrape('https://example.com');
+import { BNCASmartScraper, ScrapingResult } from '@monostate/node-scraper';
 ```
-## Changelog
-### v1.6.0 (Latest)
-- **Method Override**: Force specific scraping methods with `method` parameter
-- **Enhanced Error Handling**: Categorized error types for better debugging
-- **Fallback Chain Tracking**: See which methods were attempted in auto mode
-- **Graceful Failures**: No automatic fallback when method is forced
-### v1.5.0
-- **AI-Powered Q&A**: Ask questions about any website and get AI-generated answers
-- **OpenRouter Support**: Native integration with OpenRouter API for advanced AI models
-- **OpenAI Support**: Compatible with OpenAI and OpenAI-compatible endpoints (Groq, Together AI, etc.)
-- **Smart Fallback**: Automatic fallback chain: OpenRouter -> OpenAI -> Backend API -> Local processing
-- **One-liner AI**: New `askWebsiteAI()` convenience function for quick AI queries
-- **Enhanced TypeScript**: Complete type definitions for all AI features
-### v1.4.0
-- Internal release (skipped for public release)
+## Notes
-### v1.3.0
-- **PDF Support**: Full PDF parsing with text extraction, metadata, and page count
-- **Smart PDF Detection**: Detects PDFs by URL patterns, content-type, or magic bytes
-- **Robust Parsing**: Handles PDFs served with incorrect content-types (e.g., GitHub raw files)
-- **Fast Performance**: PDF parsing typically completes in 100-500ms
-- **Comprehensive Extraction**: Title, author, creation date, page count, and full text
+- **Server-side only** — requires filesystem access and browser automation.
+- **Node.js 20+** required.
+- LightPanda binary is auto-installed. Puppeteer is optional.
+- No external API calls for scraping — all processing is local.
-### v1.2.0
-- **Auto-Installation**: Lightpanda binary is now automatically downloaded during `npm install`
-- **Cross-Platform Support**: Automatic detection and installation for macOS, Linux, and Windows/WSL
-- **Improved Performance**: Enhanced binary detection and ES6 module compatibility
-- **Better Error Handling**: More robust installation scripts with retry logic
-- **Zero Configuration**: No manual setup required - works out of the box
-### v1.1.1
-- Bug fixes and stability improvements
-- Enhanced Puppeteer integration
-### v1.1.0
-- Added screenshot capabilities
-- Improved fallback system
-- Performance optimizations
-## 🤝 Contributing
-See the [main repository](https://github.com/your-org/bnca-prototype) for contribution guidelines.
-## 📄 License
-MIT License - see [LICENSE](../../LICENSE) file for details.
----
-<div align="center">
+## Changelog
-**Built with ❤️ for fast, reliable web scraping**
+See [CHANGELOG.md](./CHANGELOG.md) for the full release history.
-[⭐ Star on GitHub](https://github.com/your-org/bnca-prototype) | [📖 Full Documentation](https://github.com/your-org/bnca-prototype#readme)
+## License
-</div>
+MIT

package/browser-pool.js CHANGED Viewed

@@ -42,7 +42,7 @@ class BrowserPool {
     async createBrowser() {
         const puppeteer = await this.getPuppeteer();
         const instance = await puppeteer.launch({
-            headless: 'new',
+            headless: true,
             args: [
                 '--no-sandbox',
                 '--disable-setuid-sandbox',

package/index.js CHANGED Viewed

@@ -1,11 +1,10 @@
-import fetch from 'node-fetch';
 import { spawn, execSync } from 'child_process';
 import fs from 'fs/promises';
 import { existsSync, statSync } from 'fs';
 import path from 'path';
 import { fileURLToPath } from 'url';
 import { promises as fsPromises } from 'fs';
-import pdfParse from 'pdf-parse/lib/pdf-parse.js';
+import { PDFParse } from 'pdf-parse';
 import browserPool from './browser-pool.js';
 let puppeteer = null;
@@ -604,27 +603,41 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
     }
     return new Promise((resolve) => {
-      const args = ['fetch', '--dump', url];
+      const format = config.lightpandaFormat || 'html';
+      const args = [
+        'fetch',
+        '--dump', format,
+        '--with_frames',
+        '--http_timeout', String(config.timeout),
+        url
+      ];
       const process = spawn(this.options.lightpandaPath, args, {
-        timeout: config.timeout + 1000 // Add buffer for process timeout only
+        timeout: config.timeout + 2000 // Buffer above http_timeout
       });
       let output = '';
       let errorOutput = '';
       process.stdout.on('data', (data) => {
         output += data.toString();
       });
       process.stderr.on('data', (data) => {
         errorOutput += data.toString();
       });
       process.on('close', (code) => {
         if (code === 0 && output.length > 0) {
-          const content = this.extractContentFromHTML(output);
+          // Markdown output is already clean text, no HTML extraction needed
+          const content = format === 'markdown'
+            ? JSON.stringify({
+                title: output.match(/^#\s+(.+)$/m)?.[1] || '',
+                content: output,
+                extractedAt: new Date().toISOString()
+              }, null, 2)
+            : this.extractContentFromHTML(output);
           this.stats.lightpanda.successes++;
           resolve({
             success: true,
             content,
@@ -642,7 +655,7 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
           });
         }
       });
       process.on('error', (error) => {
         resolve({
           success: false,
@@ -847,25 +860,30 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
         };
       }
-      // Parse PDF
-      const pdfData = await pdfParse(buffer);
+      // Parse PDF with pdf-parse v2 API
+      const parser = new PDFParse({ data: new Uint8Array(buffer) });
+      await parser.load();
+      const textResult = await parser.getText();
+      const infoResult = await parser.getInfo();
+      parser.destroy();
       // Extract structured content
+      const pdfInfo = infoResult.info || {};
       const content = {
-        title: pdfData.info?.Title || 'Untitled PDF',
-        author: pdfData.info?.Author || '',
-        subject: pdfData.info?.Subject || '',
-        keywords: pdfData.info?.Keywords || '',
-        creator: pdfData.info?.Creator || '',
-        producer: pdfData.info?.Producer || '',
-        creationDate: pdfData.info?.CreationDate || '',
-        modificationDate: pdfData.info?.ModificationDate || '',
-        pages: pdfData.numpages || 0,
-        text: pdfData.text || '',
-        metadata: pdfData.metadata || null,
+        title: pdfInfo.Title || infoResult.outline?.[0]?.title || 'Untitled PDF',
+        author: pdfInfo.Author || '',
+        subject: pdfInfo.Subject || '',
+        keywords: pdfInfo.Keywords || '',
+        creator: pdfInfo.Creator || '',
+        producer: pdfInfo.Producer || '',
+        creationDate: pdfInfo.CreationDate || '',
+        modificationDate: pdfInfo.ModDate || '',
+        pages: textResult.total || 0,
+        text: textResult.text || '',
+        metadata: infoResult.metadata || null,
         url: url
       };
       this.stats.pdf.successes++;
       return {
@@ -1008,11 +1026,11 @@ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h
       });
       // Extract window state data
-      const windowDataMatch = html.match(/window\.__(?:INITIAL_STATE__|INITIAL_DATA__|NEXT_DATA__)__\s*=\s*({[\s\S]*?});/);
+      const windowDataMatch = html.match(/window\.__(INITIAL_STATE|INITIAL_DATA|NEXT_DATA)__\s*=\s*({[\s\S]*?});/);
       let windowData = null;
       if (windowDataMatch) {
         try {
-          windowData = JSON.parse(windowDataMatch[1]);
+          windowData = JSON.parse(windowDataMatch[2]);
         } catch {
           windowData = 'Found but unparseable';
         }

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@monostate/node-scraper",
-  "version": "1.8.1",
+  "version": "2.0.0",
   "description": "Intelligent web scraping with AI Q&A, PDF support and multi-level fallback system - 11x faster than traditional scrapers",
   "type": "module",
   "main": "index.js",
@@ -21,6 +21,7 @@
     "scripts/"
   ],
   "scripts": {
+    "test": "node --test test/",
     "postinstall": "node scripts/install-lightpanda.js"
   },
   "keywords": [
@@ -47,11 +48,10 @@
   "author": "BNCA Team",
   "license": "MIT",
   "dependencies": {
-    "node-fetch": "^3.3.2",
-    "pdf-parse": "^1.1.1"
+    "pdf-parse": "^2.4.5"
   },
   "peerDependencies": {
-    "puppeteer": "^24.11.2"
+    "puppeteer": "^24.38.0"
   },
   "peerDependenciesMeta": {
     "puppeteer": {
@@ -59,7 +59,7 @@
     }
   },
   "engines": {
-    "node": ">=18.0.0"
+    "node": ">=20.0.0"
   },
   "repository": {
     "type": "git",

package/scripts/install-lightpanda.js CHANGED Viewed

@@ -6,17 +6,30 @@ import path from 'path';
 import { createWriteStream } from 'fs';
 import { execSync } from 'child_process';
-const LIGHTPANDA_VERSION = 'nightly';
+const LIGHTPANDA_VERSION = 'v0.2.5';
 const BINARY_DIR = path.join(path.dirname(path.dirname(new URL(import.meta.url).pathname)), 'bin');
 const BINARY_NAME = 'lightpanda';
 const BINARY_PATH = path.join(BINARY_DIR, BINARY_NAME);
-// Platform-specific download URLs (matching official Lightpanda instructions)
-const DOWNLOAD_URLS = {
-  'darwin': `https://github.com/lightpanda-io/browser/releases/download/${LIGHTPANDA_VERSION}/lightpanda-aarch64-macos`,
-  'linux': `https://github.com/lightpanda-io/browser/releases/download/${LIGHTPANDA_VERSION}/lightpanda-x86_64-linux`,
-  'wsl': `https://github.com/lightpanda-io/browser/releases/download/${LIGHTPANDA_VERSION}/lightpanda-x86_64-linux` // WSL uses Linux binary
-};
+function detectArch() {
+  const arch = process.arch;
+  if (arch === 'arm64' || arch === 'aarch64') return 'aarch64';
+  if (arch === 'x64' || arch === 'x86_64') return 'x86_64';
+  return arch;
+}
+// Platform-specific download URLs (matching official Lightpanda releases)
+function getDownloadUrls() {
+  const arch = detectArch();
+  const base = `https://github.com/lightpanda-io/browser/releases/download/${LIGHTPANDA_VERSION}`;
+  return {
+    'darwin': `${base}/lightpanda-${arch}-macos`,
+    'linux': `${base}/lightpanda-${arch}-linux`,
+    'wsl': `${base}/lightpanda-x86_64-linux`
+  };
+}
+const DOWNLOAD_URLS = getDownloadUrls();
 function detectPlatform() {
   const platform = process.platform;