@monostate/node-scraper 1.8.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,67 +1,47 @@
1
1
  # @monostate/node-scraper
2
2
 
3
- > **Lightning-fast web scraping with intelligent fallback system - 11.35x faster than Firecrawl**
3
+ > Intelligent web scraping with multi-tier fallback 11x faster than traditional scrapers
4
4
 
5
- [![npm version](https://img.shields.io/npm/v/@monostate/node-scraper.svg)](https://www.npmjs.com/package/@monostate/node-scraper)
6
- [![Performance](https://img.shields.io/badge/Performance-11.35x_faster_than_Firecrawl-brightgreen)](../../test-results/)
7
- [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](../../LICENSE)
8
- [![Node](https://img.shields.io/badge/Node.js-18%2B-green)](https://nodejs.org/)
5
+ [![npm](https://img.shields.io/npm/v/@monostate/node-scraper.svg)](https://www.npmjs.com/package/@monostate/node-scraper)
6
+ [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/monostate/node-scraper/blob/main/LICENSE)
7
+ [![Node](https://img.shields.io/badge/Node.js-20%2B-green)](https://nodejs.org/)
9
8
 
10
- ## Quick Start
11
-
12
- ### Installation
9
+ ## Install
13
10
 
14
11
  ```bash
15
12
  npm install @monostate/node-scraper
16
- # or
17
- yarn add @monostate/node-scraper
18
- # or
19
- pnpm add @monostate/node-scraper
20
13
  ```
21
14
 
22
- **New in v1.8.0**: Bulk scraping with automatic request queueing, progress tracking, and streaming results! Process hundreds of URLs efficiently. Plus critical memory leak fix with browser pooling.
23
-
24
- **Fixed in v1.7.0**: Critical cross-platform compatibility fix - binaries are now correctly downloaded per platform instead of being bundled.
25
-
26
- **New in v1.6.0**: Method override support! Force specific scraping methods with `method` parameter for testing and optimization.
27
-
28
- **New in v1.5.0**: AI-powered Q&A! Ask questions about any website using OpenRouter, OpenAI, or built-in AI.
29
-
30
- **Also in v1.3.0**: PDF parsing support added! Automatically extracts text, metadata, and page count from PDF documents.
15
+ LightPanda is downloaded automatically on install. Puppeteer is an optional peer dependency for full browser fallback.
31
16
 
32
- **Also in v1.2.0**: Lightpanda binary is now automatically downloaded and configured during installation! No manual setup required.
33
-
34
- ### Zero-Configuration Setup
35
-
36
- The package now automatically:
37
- - Downloads the correct Lightpanda binary for your platform (macOS, Linux, Windows/WSL)
38
- - Configures binary paths and permissions
39
- - Validates installation health on first use
40
-
41
- ### Basic Usage
17
+ ## Usage
42
18
 
43
19
  ```javascript
44
20
  import { smartScrape, smartScreenshot, quickShot } from '@monostate/node-scraper';
45
21
 
46
- // Simple one-line scraping
22
+ // Scrape with automatic method selection
47
23
  const result = await smartScrape('https://example.com');
48
- console.log(result.content); // Extracted content
49
- console.log(result.method); // Method used: direct-fetch, lightpanda, or puppeteer
24
+ console.log(result.content);
25
+ console.log(result.method); // 'direct-fetch' | 'lightpanda' | 'puppeteer'
50
26
 
51
- // Take a screenshot
27
+ // Screenshots
52
28
  const screenshot = await smartScreenshot('https://example.com');
53
- console.log(screenshot.screenshot); // Base64 encoded image
29
+ const quick = await quickShot('https://example.com'); // optimized for speed
30
+
31
+ // PDFs are detected and parsed automatically
32
+ const pdf = await smartScrape('https://example.com/doc.pdf');
33
+ ```
54
34
 
55
- // Quick screenshot (optimized for speed)
56
- const quick = await quickShot('https://example.com');
57
- console.log(quick.screenshot); // Fast screenshot capture
35
+ ### Force a specific method
58
36
 
59
- // PDF parsing (automatic detection)
60
- const pdfResult = await smartScrape('https://example.com/document.pdf');
61
- console.log(pdfResult.content); // Extracted text, metadata, page count
37
+ ```javascript
38
+ const result = await smartScrape('https://example.com', { method: 'direct' });
39
+ // Also: 'lightpanda', 'puppeteer', 'auto' (default)
62
40
  ```
63
41
 
64
- ### Advanced Usage
42
+ No fallback occurs when a method is forced — useful for testing and debugging.
43
+
44
+ ### Advanced usage
65
45
 
66
46
  ```javascript
67
47
  import { BNCASmartScraper } from '@monostate/node-scraper';
@@ -69,575 +49,145 @@ import { BNCASmartScraper } from '@monostate/node-scraper';
69
49
  const scraper = new BNCASmartScraper({
70
50
  timeout: 10000,
71
51
  verbose: true,
72
- lightpandaPath: './lightpanda' // optional
73
52
  });
74
53
 
75
54
  const result = await scraper.scrape('https://complex-spa.com');
76
- console.log(result.stats); // Performance statistics
77
-
78
- await scraper.cleanup(); // Clean up resources
79
- ```
80
-
81
- ### Browser Pool Configuration (New in v1.8.0)
82
-
83
- The package now includes automatic browser instance pooling to prevent memory leaks:
84
-
85
- ```javascript
86
- // Browser pool is managed automatically with these defaults:
87
- // - Max 3 concurrent browser instances
88
- // - 5 second idle timeout before cleanup
89
- // - Automatic reuse of browser instances
90
-
91
- // For heavy workloads, you can manually clean up:
92
- const scraper = new BNCASmartScraper();
93
- // ... perform multiple scrapes ...
94
- await scraper.cleanup(); // Closes all browser instances
95
- ```
96
-
97
- **Important**: The convenience functions (`smartScrape`, `smartScreenshot`, etc.) automatically handle cleanup. You only need to call `cleanup()` when using the `BNCASmartScraper` class directly.
98
-
99
- ### Method Override (New in v1.6.0)
100
-
101
- Force a specific scraping method instead of using automatic fallback:
102
-
103
- ```javascript
104
- // Force direct fetch (no browser)
105
- const result = await smartScrape('https://example.com', { method: 'direct' });
106
-
107
- // Force Lightpanda browser
108
- const result = await smartScrape('https://example.com', { method: 'lightpanda' });
109
-
110
- // Force Puppeteer (full Chrome)
111
- const result = await smartScrape('https://example.com', { method: 'puppeteer' });
112
-
113
- // Auto mode (default - intelligent fallback)
114
- const result = await smartScrape('https://example.com', { method: 'auto' });
115
- ```
116
-
117
- **Important**: When forcing a method, no fallback occurs if it fails. This is useful for:
118
- - Testing specific methods in isolation
119
- - Optimizing for known site requirements
120
- - Debugging method-specific issues
55
+ const stats = scraper.getStats();
56
+ const health = await scraper.healthCheck();
121
57
 
122
- **Error Response for Forced Methods**:
123
- ```javascript
124
- {
125
- success: false,
126
- error: "Lightpanda scraping failed: [specific error]",
127
- method: "lightpanda",
128
- errorType: "network|timeout|parsing|service_unavailable",
129
- details: "Additional error context"
130
- }
58
+ await scraper.cleanup();
131
59
  ```
132
60
 
133
- ### Bulk Scraping (New in v1.8.0)
134
-
135
- Process multiple URLs efficiently with automatic request queueing and progress tracking:
61
+ ### Bulk scraping
136
62
 
137
63
  ```javascript
138
- import { bulkScrape } from '@monostate/node-scraper';
139
-
140
- // Basic bulk scraping
141
- const urls = [
142
- 'https://example1.com',
143
- 'https://example2.com',
144
- 'https://example3.com',
145
- // ... hundreds more
146
- ];
64
+ import { bulkScrape, bulkScrapeStream } from '@monostate/node-scraper';
147
65
 
148
66
  const results = await bulkScrape(urls, {
149
- concurrency: 5, // Process 5 URLs at a time
150
- continueOnError: true, // Don't stop on failures
151
- progressCallback: (progress) => {
152
- console.log(`Progress: ${progress.percentage.toFixed(1)}% (${progress.processed}/${progress.total})`);
153
- }
67
+ concurrency: 5,
68
+ continueOnError: true,
69
+ progressCallback: (p) => console.log(`${p.percentage.toFixed(1)}%`),
154
70
  });
155
71
 
156
- console.log(`Success: ${results.stats.successful}, Failed: ${results.stats.failed}`);
157
- console.log(`Total time: ${results.stats.totalTime}ms`);
158
- console.log(`Average time per URL: ${results.stats.averageTime}ms`);
159
- ```
160
-
161
- #### Streaming Results
162
-
163
- For large datasets, use streaming to process results as they complete:
164
-
165
- ```javascript
166
- import { bulkScrapeStream } from '@monostate/node-scraper';
167
-
72
+ // Or stream results as they complete
168
73
  await bulkScrapeStream(urls, {
169
74
  concurrency: 10,
170
- onResult: async (result) => {
171
- // Process each successful result immediately
172
- await saveToDatabase(result);
173
- console.log(`✓ ${result.url} - ${result.duration}ms`);
174
- },
175
- onError: async (error) => {
176
- // Handle errors as they occur
177
- console.error(`✗ ${error.url} - ${error.error}`);
178
- },
179
- progressCallback: (progress) => {
180
- process.stdout.write(`\rProcessing: ${progress.percentage.toFixed(1)}%`);
181
- }
182
- });
183
- ```
184
-
185
- **Features:**
186
- - Automatic request queueing (no more memory errors!)
187
- - Configurable concurrency control
188
- - Real-time progress tracking
189
- - Continue on error or stop on first failure
190
- - Detailed statistics and method tracking
191
- - Browser instance pooling for efficiency
192
-
193
- For detailed examples and advanced usage, see [BULK_SCRAPING.md](./BULK_SCRAPING.md).
194
-
195
- ## How It Works
196
-
197
- BNCA uses a sophisticated multi-tier system with intelligent detection:
198
-
199
- ### 1. 🔄 Direct Fetch (Fastest)
200
- - Pure HTTP requests with intelligent HTML parsing
201
- - **Performance**: Sub-second responses
202
- - **Success rate**: 75% of websites
203
- - **PDF Detection**: Automatically detects PDFs by URL, content-type, or magic bytes
204
-
205
- ### 2. 🐼 Lightpanda Browser (Fast)
206
- - Lightweight browser engine (2-3x faster than Chromium)
207
- - **Performance**: Fast JavaScript execution
208
- - **Fallback triggers**: SPA detection
209
-
210
- ### 3. 🔵 Puppeteer (Complete)
211
- - Full Chromium browser for maximum compatibility
212
- - **Performance**: Complete JavaScript execution
213
- - **Fallback triggers**: Complex interactions needed
214
-
215
- ### 📄 PDF Parser (Specialized)
216
- - Automatic PDF detection and parsing
217
- - **Features**: Text extraction, metadata, page count
218
- - **Smart Detection**: Works even when PDFs are served with wrong content-types
219
- - **Performance**: Typically 100-500ms for most PDFs
220
-
221
- ### 📸 Screenshot Methods
222
- - **Chrome CLI**: Direct Chrome screenshot capture
223
- - **Quickshot**: Optimized with retry logic and smart timeouts
224
-
225
- ## 📊 Performance Benchmark
226
-
227
- | Site Type | BNCA | Firecrawl | Speed Advantage |
228
- |-----------|------|-----------|----------------|
229
- | **Wikipedia** | 154ms | 4,662ms | **30.3x faster** |
230
- | **Hacker News** | 1,715ms | 4,644ms | **2.7x faster** |
231
- | **GitHub** | 9,167ms | 9,790ms | **1.1x faster** |
232
-
233
- **Average**: 11.35x faster than Firecrawl with 100% reliability
234
-
235
- ## 🎛️ API Reference
236
-
237
- ### Convenience Functions
238
-
239
- #### `smartScrape(url, options?)`
240
- Quick scraping with intelligent fallback.
241
-
242
- #### `smartScreenshot(url, options?)`
243
- Take a screenshot of any webpage.
244
-
245
- #### `quickShot(url, options?)`
246
- Optimized screenshot capture for maximum speed.
247
-
248
- **Parameters:**
249
- - `url` (string): URL to scrape/capture
250
- - `options` (object, optional): Configuration options
251
-
252
- **Returns:** Promise<ScrapingResult>
253
-
254
- ### `BNCASmartScraper`
255
-
256
- Main scraper class with advanced features.
257
-
258
- #### Constructor Options
259
-
260
- ```javascript
261
- const scraper = new BNCASmartScraper({
262
- timeout: 10000, // Request timeout in ms
263
- retries: 2, // Number of retries per method
264
- verbose: false, // Enable detailed logging
265
- lightpandaPath: './lightpanda', // Path to Lightpanda binary
266
- userAgent: 'Mozilla/5.0 ...', // Custom user agent
75
+ onResult: async (result) => await saveToDatabase(result),
76
+ onError: async (error) => console.error(error.url, error.error),
267
77
  });
268
78
  ```
269
79
 
270
- #### Methods
271
-
272
- ##### `scraper.scrape(url, options?)`
273
-
274
- Scrape a URL with intelligent fallback.
275
-
276
- ```javascript
277
- const result = await scraper.scrape('https://example.com');
278
- ```
279
-
280
- ##### `scraper.screenshot(url, options?)`
281
-
282
- Take a screenshot of a webpage.
283
-
284
- ```javascript
285
- const result = await scraper.screenshot('https://example.com');
286
- const img = result.screenshot; // data:image/png;base64,...
287
- ```
80
+ See [BULK_SCRAPING.md](./BULK_SCRAPING.md) for full documentation.
288
81
 
289
- ##### `scraper.quickshot(url, options?)`
82
+ ### AI-powered Q&A
290
83
 
291
- Quick screenshot capture - optimized for speed with retry logic.
84
+ Ask questions about any website using OpenRouter, OpenAI, or local fallback:
292
85
 
293
86
  ```javascript
294
- const result = await scraper.quickshot('https://example.com');
295
- // 2-3x faster than regular screenshot
296
- ```
297
-
298
- ##### `scraper.getStats()`
299
-
300
- Get performance statistics.
87
+ import { askWebsiteAI } from '@monostate/node-scraper';
301
88
 
302
- ```javascript
303
- const stats = scraper.getStats();
304
- console.log(stats.successRates); // Success rates by method
89
+ const answer = await askWebsiteAI('https://example.com', 'What is this site about?', {
90
+ openRouterApiKey: process.env.OPENROUTER_API_KEY,
91
+ });
305
92
  ```
306
93
 
307
- ##### `scraper.healthCheck()`
94
+ API key priority: OpenRouter > OpenAI > BNCA backend > local pattern matching (no key needed).
308
95
 
309
- Check availability of all scraping methods.
96
+ ## How it works
310
97
 
311
- ```javascript
312
- const health = await scraper.healthCheck();
313
- console.log(health.status); // 'healthy' or 'unhealthy'
314
- ```
98
+ The scraper uses a three-tier fallback system:
315
99
 
316
- ##### `scraper.cleanup()`
100
+ 1. **Direct fetch** — Pure HTTP with HTML parsing. Sub-second, handles ~75% of sites.
101
+ 2. **LightPanda** — Lightweight browser engine, 2-3x faster than Chromium. Handles SPAs.
102
+ 3. **Puppeteer** — Full Chromium for maximum compatibility.
317
103
 
318
- Clean up resources (close browser instances).
104
+ Additional specialized handlers:
105
+ - **PDF parser** — Automatic detection by URL, content-type, or magic bytes. Extracts text, metadata, and page count.
106
+ - **Screenshots** — Chrome CLI capture with retry logic and smart timeouts.
319
107
 
320
- ```javascript
321
- await scraper.cleanup();
322
- ```
108
+ Browser instances are pooled (max 3, 5s idle timeout) to prevent memory leaks.
323
109
 
324
- ### AI-Powered Q&A
110
+ ## Performance
325
111
 
326
- Ask questions about any website and get AI-generated answers:
112
+ | Site Type | node-scraper | Firecrawl | Speedup |
113
+ |-----------|-------------|-----------|---------|
114
+ | Wikipedia | 154ms | 4,662ms | 30.3x |
115
+ | Hacker News | 1,715ms | 4,644ms | 2.7x |
116
+ | GitHub | 9,167ms | 9,790ms | 1.1x |
327
117
 
328
- ```javascript
329
- // Method 1: Using your own OpenRouter API key
330
- const scraper = new BNCASmartScraper({
331
- openRouterApiKey: 'your-openrouter-api-key'
332
- });
333
- const result = await scraper.askAI('https://example.com', 'What is this website about?');
118
+ Average: **11.35x faster** with 100% reliability.
334
119
 
335
- // Method 2: Using OpenAI API (or compatible endpoints)
336
- const scraper = new BNCASmartScraper({
337
- openAIApiKey: 'your-openai-api-key',
338
- // Optional: Use a compatible endpoint like Groq, Together AI, etc.
339
- openAIBaseUrl: 'https://api.groq.com/openai'
340
- });
341
- const result = await scraper.askAI('https://example.com', 'What services do they offer?');
120
+ ## API Reference
342
121
 
343
- // Method 3: One-liner with OpenRouter
344
- import { askWebsiteAI } from '@monostate/node-scraper';
345
- const answer = await askWebsiteAI('https://example.com', 'What is the main topic?', {
346
- openRouterApiKey: process.env.OPENROUTER_API_KEY
347
- });
122
+ ### Convenience functions
348
123
 
349
- // Method 4: Using BNCA backend API (requires BNCA API key)
350
- const scraper = new BNCASmartScraper({
351
- apiKey: 'your-bnca-api-key'
352
- });
353
- const result = await scraper.askAI('https://example.com', 'What products are featured?');
354
- ```
124
+ | Function | Description |
125
+ |----------|-------------|
126
+ | `smartScrape(url, opts?)` | Scrape with intelligent fallback |
127
+ | `smartScreenshot(url, opts?)` | Full page screenshot |
128
+ | `quickShot(url, opts?)` | Fast screenshot capture |
129
+ | `bulkScrape(urls, opts?)` | Batch scrape multiple URLs |
130
+ | `bulkScrapeStream(urls, opts?)` | Stream results as they complete |
131
+ | `askWebsiteAI(url, question, opts?)` | AI Q&A about a webpage |
355
132
 
356
- **API Key Priority:**
357
- 1. OpenRouter API key (`openRouterApiKey`)
358
- 2. OpenAI API key (`openAIApiKey`)
359
- 3. BNCA backend API (`apiKey`)
360
- 4. Local fallback (pattern matching - no API key required)
133
+ ### BNCASmartScraper options
361
134
 
362
- **Configuration Options:**
363
- ```javascript
364
- const result = await scraper.askAI(url, question, {
365
- // OpenRouter specific
366
- openRouterApiKey: 'sk-or-...',
367
- model: 'meta-llama/llama-4-scout:free', // Default model
368
-
369
- // OpenAI specific
370
- openAIApiKey: 'sk-...',
371
- openAIBaseUrl: 'https://api.openai.com', // Or compatible endpoint
372
- model: 'gpt-3.5-turbo',
373
-
374
- // Shared options
375
- temperature: 0.3,
376
- maxTokens: 500
377
- });
378
- ```
379
-
380
- **Response Format:**
381
135
  ```javascript
382
136
  {
383
- success: true,
384
- answer: "This website is about...",
385
- method: "direct-fetch", // Scraping method used
386
- scrapeTime: 1234, // Time to scrape in ms
387
- processing: "openrouter" // AI processing method used
137
+ timeout: 10000, // Request timeout (ms)
138
+ retries: 2, // Retries per method
139
+ verbose: false, // Detailed logging
140
+ lightpandaPath: './bin/lightpanda',
141
+ lightpandaFormat: 'html', // 'html' or 'markdown'
142
+ userAgent: 'Mozilla/5.0 ...',
143
+ openRouterApiKey: '...',
144
+ openAIApiKey: '...',
145
+ openAIBaseUrl: 'https://api.openai.com',
388
146
  }
389
147
  ```
390
148
 
391
- ### 📄 PDF Support
392
-
393
- BNCA automatically detects and parses PDF documents:
394
-
395
- ```javascript
396
- const pdfResult = await smartScrape('https://example.com/document.pdf');
397
-
398
- // Parsed content includes:
399
- const content = JSON.parse(pdfResult.content);
400
- console.log(content.title); // PDF title
401
- console.log(content.author); // Author name
402
- console.log(content.pages); // Number of pages
403
- console.log(content.text); // Full extracted text
404
- console.log(content.creationDate); // Creation date
405
- console.log(content.metadata); // Additional metadata
406
- ```
407
-
408
- **PDF Detection Methods:**
409
- - URL ending with `.pdf`
410
- - Content-Type header `application/pdf`
411
- - Binary content starting with `%PDF` (magic bytes)
412
- - Works with PDFs served as `application/octet-stream` (e.g., GitHub raw files)
413
-
414
- **Limitations:**
415
- - Maximum file size: 20MB
416
- - Text extraction only (no image OCR)
417
- - Requires `pdf-parse` dependency (automatically installed)
149
+ ### Methods
418
150
 
419
- ## 📱 Next.js Integration
151
+ - `scraper.scrape(url, opts?)` — Scrape with fallback
152
+ - `scraper.screenshot(url, opts?)` — Take screenshot
153
+ - `scraper.quickshot(url, opts?)` — Fast screenshot
154
+ - `scraper.askAI(url, question, opts?)` — AI Q&A
155
+ - `scraper.getStats()` — Performance statistics
156
+ - `scraper.healthCheck()` — Check method availability
157
+ - `scraper.cleanup()` — Close browser instances
420
158
 
421
- ### API Route Example
159
+ ### Response shape
422
160
 
423
161
  ```javascript
424
- // pages/api/scrape.js or app/api/scrape/route.js
425
- import { smartScrape } from '@monostate/node-scraper';
426
-
427
- export async function POST(request) {
428
- try {
429
- const { url } = await request.json();
430
- const result = await smartScrape(url);
431
-
432
- return Response.json({
433
- success: true,
434
- data: result.content,
435
- method: result.method,
436
- time: result.performance.totalTime
437
- });
438
- } catch (error) {
439
- return Response.json({
440
- success: false,
441
- error: error.message
442
- }, { status: 500 });
443
- }
444
- }
445
- ```
446
-
447
- ### React Hook Example
448
-
449
- ```javascript
450
- // hooks/useScraper.js
451
- import { useState } from 'react';
452
-
453
- export function useScraper() {
454
- const [loading, setLoading] = useState(false);
455
- const [data, setData] = useState(null);
456
- const [error, setError] = useState(null);
457
-
458
- const scrape = async (url) => {
459
- setLoading(true);
460
- setError(null);
461
-
462
- try {
463
- const response = await fetch('/api/scrape', {
464
- method: 'POST',
465
- headers: { 'Content-Type': 'application/json' },
466
- body: JSON.stringify({ url })
467
- });
468
-
469
- const result = await response.json();
470
-
471
- if (result.success) {
472
- setData(result.data);
473
- } else {
474
- setError(result.error);
475
- }
476
- } catch (err) {
477
- setError(err.message);
478
- } finally {
479
- setLoading(false);
480
- }
481
- };
482
-
483
- return { scrape, loading, data, error };
484
- }
485
- ```
486
-
487
- ### Component Example
488
-
489
- ```javascript
490
- // components/ScraperDemo.jsx
491
- import { useScraper } from '../hooks/useScraper';
492
-
493
- export default function ScraperDemo() {
494
- const { scrape, loading, data, error } = useScraper();
495
- const [url, setUrl] = useState('');
496
-
497
- const handleScrape = () => {
498
- if (url) scrape(url);
499
- };
500
-
501
- return (
502
- <div className="p-4">
503
- <div className="flex gap-2 mb-4">
504
- <input
505
- type="url"
506
- value={url}
507
- onChange={(e) => setUrl(e.target.value)}
508
- placeholder="Enter URL to scrape..."
509
- className="flex-1 px-3 py-2 border rounded"
510
- />
511
- <button
512
- onClick={handleScrape}
513
- disabled={loading}
514
- className="px-4 py-2 bg-blue-500 text-white rounded disabled:opacity-50"
515
- >
516
- {loading ? 'Scraping...' : 'Scrape'}
517
- </button>
518
- </div>
519
-
520
- {error && (
521
- <div className="p-3 bg-red-100 text-red-700 rounded mb-4">
522
- Error: {error}
523
- </div>
524
- )}
525
-
526
- {data && (
527
- <div className="p-3 bg-green-100 rounded">
528
- <h3 className="font-bold mb-2">Scraped Content:</h3>
529
- <pre className="text-sm overflow-auto">{data}</pre>
530
- </div>
531
- )}
532
- </div>
533
- );
162
+ {
163
+ success: true,
164
+ content: '...', // Extracted content (JSON string)
165
+ method: 'direct-fetch', // Method used
166
+ url: 'https://...',
167
+ performance: { totalTime: 154 },
168
+ stats: { ... }
534
169
  }
535
170
  ```
536
171
 
537
- ## ⚠️ Important Notes
538
-
539
- ### Server-Side Only
540
- BNCA is designed for **server-side use only** due to:
541
- - Browser automation requirements (Puppeteer)
542
- - File system access for Lightpanda binary
543
- - CORS restrictions in browsers
544
-
545
- ### Next.js Deployment
546
- - Use in API routes, not client components
547
- - Ensure Node.js 18+ in production environment
548
- - Consider adding Lightpanda binary to deployment
549
-
550
- ### Lightpanda Setup (Optional)
551
- For maximum performance, install Lightpanda:
552
-
553
- ```bash
554
- # macOS ARM64
555
- curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-aarch64-macos
556
- chmod +x lightpanda
557
-
558
- # Linux x64
559
- curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-x86_64-linux
560
- chmod +x lightpanda
561
- ```
562
-
563
- ## 🔒 Privacy & Security
564
-
565
- - **No external API calls** - all processing is local
566
- - **No data collection** - your data stays private
567
- - **Respects robots.txt** (optional enforcement)
568
- - **Configurable rate limiting**
569
-
570
- ## 📝 TypeScript Support
172
+ ## TypeScript
571
173
 
572
- Full TypeScript definitions included:
174
+ Full type definitions are included (`index.d.ts`).
573
175
 
574
176
  ```typescript
575
- import { BNCASmartScraper, ScrapingResult, ScrapingOptions } from '@monostate/node-scraper';
576
-
577
- const scraper: BNCASmartScraper = new BNCASmartScraper({
578
- timeout: 5000,
579
- verbose: true
580
- });
581
-
582
- const result: ScrapingResult = await scraper.scrape('https://example.com');
177
+ import { BNCASmartScraper, ScrapingResult } from '@monostate/node-scraper';
583
178
  ```
584
179
 
585
- ## Changelog
586
-
587
- ### v1.6.0 (Latest)
588
- - **Method Override**: Force specific scraping methods with `method` parameter
589
- - **Enhanced Error Handling**: Categorized error types for better debugging
590
- - **Fallback Chain Tracking**: See which methods were attempted in auto mode
591
- - **Graceful Failures**: No automatic fallback when method is forced
592
-
593
- ### v1.5.0
594
- - **AI-Powered Q&A**: Ask questions about any website and get AI-generated answers
595
- - **OpenRouter Support**: Native integration with OpenRouter API for advanced AI models
596
- - **OpenAI Support**: Compatible with OpenAI and OpenAI-compatible endpoints (Groq, Together AI, etc.)
597
- - **Smart Fallback**: Automatic fallback chain: OpenRouter -> OpenAI -> Backend API -> Local processing
598
- - **One-liner AI**: New `askWebsiteAI()` convenience function for quick AI queries
599
- - **Enhanced TypeScript**: Complete type definitions for all AI features
600
-
601
- ### v1.4.0
602
- - Internal release (skipped for public release)
180
+ ## Notes
603
181
 
604
- ### v1.3.0
605
- - **PDF Support**: Full PDF parsing with text extraction, metadata, and page count
606
- - **Smart PDF Detection**: Detects PDFs by URL patterns, content-type, or magic bytes
607
- - **Robust Parsing**: Handles PDFs served with incorrect content-types (e.g., GitHub raw files)
608
- - **Fast Performance**: PDF parsing typically completes in 100-500ms
609
- - **Comprehensive Extraction**: Title, author, creation date, page count, and full text
182
+ - **Server-side only** — requires filesystem access and browser automation.
183
+ - **Node.js 20+** required.
184
+ - LightPanda binary is auto-installed. Puppeteer is optional.
185
+ - No external API calls for scraping all processing is local.
610
186
 
611
- ### v1.2.0
612
- - **Auto-Installation**: Lightpanda binary is now automatically downloaded during `npm install`
613
- - **Cross-Platform Support**: Automatic detection and installation for macOS, Linux, and Windows/WSL
614
- - **Improved Performance**: Enhanced binary detection and ES6 module compatibility
615
- - **Better Error Handling**: More robust installation scripts with retry logic
616
- - **Zero Configuration**: No manual setup required - works out of the box
617
-
618
- ### v1.1.1
619
- - Bug fixes and stability improvements
620
- - Enhanced Puppeteer integration
621
-
622
- ### v1.1.0
623
- - Added screenshot capabilities
624
- - Improved fallback system
625
- - Performance optimizations
626
-
627
- ## 🤝 Contributing
628
-
629
- See the [main repository](https://github.com/your-org/bnca-prototype) for contribution guidelines.
630
-
631
- ## 📄 License
632
-
633
- MIT License - see [LICENSE](../../LICENSE) file for details.
634
-
635
- ---
636
-
637
- <div align="center">
187
+ ## Changelog
638
188
 
639
- **Built with ❤️ for fast, reliable web scraping**
189
+ See [CHANGELOG.md](./CHANGELOG.md) for the full release history.
640
190
 
641
- [⭐ Star on GitHub](https://github.com/your-org/bnca-prototype) | [📖 Full Documentation](https://github.com/your-org/bnca-prototype#readme)
191
+ ## License
642
192
 
643
- </div>
193
+ MIT