@monostate/node-scraper 1.8.1 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +106 -558
- package/browser-pool.js +1 -1
- package/browser-session.js +551 -0
- package/index.d.ts +97 -0
- package/index.js +50 -28
- package/lightpanda-server.js +151 -0
- package/package.json +10 -5
- package/scripts/install-lightpanda.js +20 -7
package/README.md
CHANGED
|
@@ -1,69 +1,47 @@
|
|
|
1
1
|
# @monostate/node-scraper
|
|
2
2
|
|
|
3
|
-
>
|
|
3
|
+
> Intelligent web scraping with multi-tier fallback — 11x faster than traditional scrapers
|
|
4
4
|
|
|
5
|
-
[](https://nodejs.org/)
|
|
5
|
+
[](https://www.npmjs.com/package/@monostate/node-scraper)
|
|
6
|
+
[](https://github.com/monostate/node-scraper/blob/main/LICENSE)
|
|
7
|
+
[](https://nodejs.org/)
|
|
9
8
|
|
|
10
|
-
##
|
|
11
|
-
|
|
12
|
-
### Installation
|
|
9
|
+
## Install
|
|
13
10
|
|
|
14
11
|
```bash
|
|
15
12
|
npm install @monostate/node-scraper
|
|
16
|
-
# or
|
|
17
|
-
yarn add @monostate/node-scraper
|
|
18
|
-
# or
|
|
19
|
-
pnpm add @monostate/node-scraper
|
|
20
13
|
```
|
|
21
14
|
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
**New in v1.8.0**: Bulk scraping with automatic request queueing, progress tracking, and streaming results! Process hundreds of URLs efficiently. Plus critical memory leak fix with browser pooling.
|
|
25
|
-
|
|
26
|
-
**Fixed in v1.7.0**: Critical cross-platform compatibility fix - binaries are now correctly downloaded per platform instead of being bundled.
|
|
27
|
-
|
|
28
|
-
**New in v1.6.0**: Method override support! Force specific scraping methods with `method` parameter for testing and optimization.
|
|
29
|
-
|
|
30
|
-
**New in v1.5.0**: AI-powered Q&A! Ask questions about any website using OpenRouter, OpenAI, or built-in AI.
|
|
15
|
+
LightPanda is downloaded automatically on install. Puppeteer is an optional peer dependency for full browser fallback.
|
|
31
16
|
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
**Also in v1.2.0**: Lightpanda binary is now automatically downloaded and configured during installation! No manual setup required.
|
|
35
|
-
|
|
36
|
-
### Zero-Configuration Setup
|
|
37
|
-
|
|
38
|
-
The package now automatically:
|
|
39
|
-
- Downloads the correct Lightpanda binary for your platform (macOS, Linux, Windows/WSL)
|
|
40
|
-
- Configures binary paths and permissions
|
|
41
|
-
- Validates installation health on first use
|
|
42
|
-
|
|
43
|
-
### Basic Usage
|
|
17
|
+
## Usage
|
|
44
18
|
|
|
45
19
|
```javascript
|
|
46
20
|
import { smartScrape, smartScreenshot, quickShot } from '@monostate/node-scraper';
|
|
47
21
|
|
|
48
|
-
//
|
|
22
|
+
// Scrape with automatic method selection
|
|
49
23
|
const result = await smartScrape('https://example.com');
|
|
50
|
-
console.log(result.content);
|
|
51
|
-
console.log(result.method);
|
|
24
|
+
console.log(result.content);
|
|
25
|
+
console.log(result.method); // 'direct-fetch' | 'lightpanda' | 'puppeteer'
|
|
52
26
|
|
|
53
|
-
//
|
|
27
|
+
// Screenshots
|
|
54
28
|
const screenshot = await smartScreenshot('https://example.com');
|
|
55
|
-
|
|
29
|
+
const quick = await quickShot('https://example.com'); // optimized for speed
|
|
30
|
+
|
|
31
|
+
// PDFs are detected and parsed automatically
|
|
32
|
+
const pdf = await smartScrape('https://example.com/doc.pdf');
|
|
33
|
+
```
|
|
56
34
|
|
|
57
|
-
|
|
58
|
-
const quick = await quickShot('https://example.com');
|
|
59
|
-
console.log(quick.screenshot); // Fast screenshot capture
|
|
35
|
+
### Force a specific method
|
|
60
36
|
|
|
61
|
-
|
|
62
|
-
const
|
|
63
|
-
|
|
37
|
+
```javascript
|
|
38
|
+
const result = await smartScrape('https://example.com', { method: 'direct' });
|
|
39
|
+
// Also: 'lightpanda', 'puppeteer', 'auto' (default)
|
|
64
40
|
```
|
|
65
41
|
|
|
66
|
-
|
|
42
|
+
No fallback occurs when a method is forced — useful for testing and debugging.
|
|
43
|
+
|
|
44
|
+
### Advanced usage
|
|
67
45
|
|
|
68
46
|
```javascript
|
|
69
47
|
import { BNCASmartScraper } from '@monostate/node-scraper';
|
|
@@ -71,575 +49,145 @@ import { BNCASmartScraper } from '@monostate/node-scraper';
|
|
|
71
49
|
const scraper = new BNCASmartScraper({
|
|
72
50
|
timeout: 10000,
|
|
73
51
|
verbose: true,
|
|
74
|
-
lightpandaPath: './lightpanda' // optional
|
|
75
52
|
});
|
|
76
53
|
|
|
77
54
|
const result = await scraper.scrape('https://complex-spa.com');
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
await scraper.cleanup(); // Clean up resources
|
|
81
|
-
```
|
|
82
|
-
|
|
83
|
-
### Browser Pool Configuration (New in v1.8.0)
|
|
84
|
-
|
|
85
|
-
The package now includes automatic browser instance pooling to prevent memory leaks:
|
|
86
|
-
|
|
87
|
-
```javascript
|
|
88
|
-
// Browser pool is managed automatically with these defaults:
|
|
89
|
-
// - Max 3 concurrent browser instances
|
|
90
|
-
// - 5 second idle timeout before cleanup
|
|
91
|
-
// - Automatic reuse of browser instances
|
|
92
|
-
|
|
93
|
-
// For heavy workloads, you can manually clean up:
|
|
94
|
-
const scraper = new BNCASmartScraper();
|
|
95
|
-
// ... perform multiple scrapes ...
|
|
96
|
-
await scraper.cleanup(); // Closes all browser instances
|
|
97
|
-
```
|
|
98
|
-
|
|
99
|
-
**Important**: The convenience functions (`smartScrape`, `smartScreenshot`, etc.) automatically handle cleanup. You only need to call `cleanup()` when using the `BNCASmartScraper` class directly.
|
|
100
|
-
|
|
101
|
-
### Method Override (New in v1.6.0)
|
|
102
|
-
|
|
103
|
-
Force a specific scraping method instead of using automatic fallback:
|
|
104
|
-
|
|
105
|
-
```javascript
|
|
106
|
-
// Force direct fetch (no browser)
|
|
107
|
-
const result = await smartScrape('https://example.com', { method: 'direct' });
|
|
108
|
-
|
|
109
|
-
// Force Lightpanda browser
|
|
110
|
-
const result = await smartScrape('https://example.com', { method: 'lightpanda' });
|
|
111
|
-
|
|
112
|
-
// Force Puppeteer (full Chrome)
|
|
113
|
-
const result = await smartScrape('https://example.com', { method: 'puppeteer' });
|
|
114
|
-
|
|
115
|
-
// Auto mode (default - intelligent fallback)
|
|
116
|
-
const result = await smartScrape('https://example.com', { method: 'auto' });
|
|
117
|
-
```
|
|
118
|
-
|
|
119
|
-
**Important**: When forcing a method, no fallback occurs if it fails. This is useful for:
|
|
120
|
-
- Testing specific methods in isolation
|
|
121
|
-
- Optimizing for known site requirements
|
|
122
|
-
- Debugging method-specific issues
|
|
55
|
+
const stats = scraper.getStats();
|
|
56
|
+
const health = await scraper.healthCheck();
|
|
123
57
|
|
|
124
|
-
|
|
125
|
-
```javascript
|
|
126
|
-
{
|
|
127
|
-
success: false,
|
|
128
|
-
error: "Lightpanda scraping failed: [specific error]",
|
|
129
|
-
method: "lightpanda",
|
|
130
|
-
errorType: "network|timeout|parsing|service_unavailable",
|
|
131
|
-
details: "Additional error context"
|
|
132
|
-
}
|
|
58
|
+
await scraper.cleanup();
|
|
133
59
|
```
|
|
134
60
|
|
|
135
|
-
### Bulk
|
|
136
|
-
|
|
137
|
-
Process multiple URLs efficiently with automatic request queueing and progress tracking:
|
|
61
|
+
### Bulk scraping
|
|
138
62
|
|
|
139
63
|
```javascript
|
|
140
|
-
import { bulkScrape } from '@monostate/node-scraper';
|
|
141
|
-
|
|
142
|
-
// Basic bulk scraping
|
|
143
|
-
const urls = [
|
|
144
|
-
'https://example1.com',
|
|
145
|
-
'https://example2.com',
|
|
146
|
-
'https://example3.com',
|
|
147
|
-
// ... hundreds more
|
|
148
|
-
];
|
|
64
|
+
import { bulkScrape, bulkScrapeStream } from '@monostate/node-scraper';
|
|
149
65
|
|
|
150
66
|
const results = await bulkScrape(urls, {
|
|
151
|
-
concurrency: 5,
|
|
152
|
-
continueOnError: true,
|
|
153
|
-
progressCallback: (
|
|
154
|
-
console.log(`Progress: ${progress.percentage.toFixed(1)}% (${progress.processed}/${progress.total})`);
|
|
155
|
-
}
|
|
67
|
+
concurrency: 5,
|
|
68
|
+
continueOnError: true,
|
|
69
|
+
progressCallback: (p) => console.log(`${p.percentage.toFixed(1)}%`),
|
|
156
70
|
});
|
|
157
71
|
|
|
158
|
-
|
|
159
|
-
console.log(`Total time: ${results.stats.totalTime}ms`);
|
|
160
|
-
console.log(`Average time per URL: ${results.stats.averageTime}ms`);
|
|
161
|
-
```
|
|
162
|
-
|
|
163
|
-
#### Streaming Results
|
|
164
|
-
|
|
165
|
-
For large datasets, use streaming to process results as they complete:
|
|
166
|
-
|
|
167
|
-
```javascript
|
|
168
|
-
import { bulkScrapeStream } from '@monostate/node-scraper';
|
|
169
|
-
|
|
72
|
+
// Or stream results as they complete
|
|
170
73
|
await bulkScrapeStream(urls, {
|
|
171
74
|
concurrency: 10,
|
|
172
|
-
onResult: async (result) =>
|
|
173
|
-
|
|
174
|
-
await saveToDatabase(result);
|
|
175
|
-
console.log(`✓ ${result.url} - ${result.duration}ms`);
|
|
176
|
-
},
|
|
177
|
-
onError: async (error) => {
|
|
178
|
-
// Handle errors as they occur
|
|
179
|
-
console.error(`✗ ${error.url} - ${error.error}`);
|
|
180
|
-
},
|
|
181
|
-
progressCallback: (progress) => {
|
|
182
|
-
process.stdout.write(`\rProcessing: ${progress.percentage.toFixed(1)}%`);
|
|
183
|
-
}
|
|
75
|
+
onResult: async (result) => await saveToDatabase(result),
|
|
76
|
+
onError: async (error) => console.error(error.url, error.error),
|
|
184
77
|
});
|
|
185
78
|
```
|
|
186
79
|
|
|
187
|
-
|
|
188
|
-
- Automatic request queueing (no more memory errors!)
|
|
189
|
-
- Configurable concurrency control
|
|
190
|
-
- Real-time progress tracking
|
|
191
|
-
- Continue on error or stop on first failure
|
|
192
|
-
- Detailed statistics and method tracking
|
|
193
|
-
- Browser instance pooling for efficiency
|
|
194
|
-
|
|
195
|
-
For detailed examples and advanced usage, see [BULK_SCRAPING.md](./BULK_SCRAPING.md).
|
|
196
|
-
|
|
197
|
-
## How It Works
|
|
198
|
-
|
|
199
|
-
BNCA uses a sophisticated multi-tier system with intelligent detection:
|
|
200
|
-
|
|
201
|
-
### 1. 🔄 Direct Fetch (Fastest)
|
|
202
|
-
- Pure HTTP requests with intelligent HTML parsing
|
|
203
|
-
- **Performance**: Sub-second responses
|
|
204
|
-
- **Success rate**: 75% of websites
|
|
205
|
-
- **PDF Detection**: Automatically detects PDFs by URL, content-type, or magic bytes
|
|
206
|
-
|
|
207
|
-
### 2. 🐼 Lightpanda Browser (Fast)
|
|
208
|
-
- Lightweight browser engine (2-3x faster than Chromium)
|
|
209
|
-
- **Performance**: Fast JavaScript execution
|
|
210
|
-
- **Fallback triggers**: SPA detection
|
|
211
|
-
|
|
212
|
-
### 3. 🔵 Puppeteer (Complete)
|
|
213
|
-
- Full Chromium browser for maximum compatibility
|
|
214
|
-
- **Performance**: Complete JavaScript execution
|
|
215
|
-
- **Fallback triggers**: Complex interactions needed
|
|
216
|
-
|
|
217
|
-
### 📄 PDF Parser (Specialized)
|
|
218
|
-
- Automatic PDF detection and parsing
|
|
219
|
-
- **Features**: Text extraction, metadata, page count
|
|
220
|
-
- **Smart Detection**: Works even when PDFs are served with wrong content-types
|
|
221
|
-
- **Performance**: Typically 100-500ms for most PDFs
|
|
222
|
-
|
|
223
|
-
### 📸 Screenshot Methods
|
|
224
|
-
- **Chrome CLI**: Direct Chrome screenshot capture
|
|
225
|
-
- **Quickshot**: Optimized with retry logic and smart timeouts
|
|
226
|
-
|
|
227
|
-
## 📊 Performance Benchmark
|
|
228
|
-
|
|
229
|
-
| Site Type | BNCA | Firecrawl | Speed Advantage |
|
|
230
|
-
|-----------|------|-----------|----------------|
|
|
231
|
-
| **Wikipedia** | 154ms | 4,662ms | **30.3x faster** |
|
|
232
|
-
| **Hacker News** | 1,715ms | 4,644ms | **2.7x faster** |
|
|
233
|
-
| **GitHub** | 9,167ms | 9,790ms | **1.1x faster** |
|
|
234
|
-
|
|
235
|
-
**Average**: 11.35x faster than Firecrawl with 100% reliability
|
|
236
|
-
|
|
237
|
-
## 🎛️ API Reference
|
|
238
|
-
|
|
239
|
-
### Convenience Functions
|
|
240
|
-
|
|
241
|
-
#### `smartScrape(url, options?)`
|
|
242
|
-
Quick scraping with intelligent fallback.
|
|
243
|
-
|
|
244
|
-
#### `smartScreenshot(url, options?)`
|
|
245
|
-
Take a screenshot of any webpage.
|
|
246
|
-
|
|
247
|
-
#### `quickShot(url, options?)`
|
|
248
|
-
Optimized screenshot capture for maximum speed.
|
|
249
|
-
|
|
250
|
-
**Parameters:**
|
|
251
|
-
- `url` (string): URL to scrape/capture
|
|
252
|
-
- `options` (object, optional): Configuration options
|
|
253
|
-
|
|
254
|
-
**Returns:** Promise<ScrapingResult>
|
|
255
|
-
|
|
256
|
-
### `BNCASmartScraper`
|
|
257
|
-
|
|
258
|
-
Main scraper class with advanced features.
|
|
259
|
-
|
|
260
|
-
#### Constructor Options
|
|
261
|
-
|
|
262
|
-
```javascript
|
|
263
|
-
const scraper = new BNCASmartScraper({
|
|
264
|
-
timeout: 10000, // Request timeout in ms
|
|
265
|
-
retries: 2, // Number of retries per method
|
|
266
|
-
verbose: false, // Enable detailed logging
|
|
267
|
-
lightpandaPath: './lightpanda', // Path to Lightpanda binary
|
|
268
|
-
userAgent: 'Mozilla/5.0 ...', // Custom user agent
|
|
269
|
-
});
|
|
270
|
-
```
|
|
271
|
-
|
|
272
|
-
#### Methods
|
|
273
|
-
|
|
274
|
-
##### `scraper.scrape(url, options?)`
|
|
275
|
-
|
|
276
|
-
Scrape a URL with intelligent fallback.
|
|
277
|
-
|
|
278
|
-
```javascript
|
|
279
|
-
const result = await scraper.scrape('https://example.com');
|
|
280
|
-
```
|
|
80
|
+
See [BULK_SCRAPING.md](./BULK_SCRAPING.md) for full documentation.
|
|
281
81
|
|
|
282
|
-
|
|
82
|
+
### AI-powered Q&A
|
|
283
83
|
|
|
284
|
-
|
|
84
|
+
Ask questions about any website using OpenRouter, OpenAI, or local fallback:
|
|
285
85
|
|
|
286
86
|
```javascript
|
|
287
|
-
|
|
288
|
-
const img = result.screenshot; // data:image/png;base64,...
|
|
289
|
-
```
|
|
290
|
-
|
|
291
|
-
##### `scraper.quickshot(url, options?)`
|
|
292
|
-
|
|
293
|
-
Quick screenshot capture - optimized for speed with retry logic.
|
|
87
|
+
import { askWebsiteAI } from '@monostate/node-scraper';
|
|
294
88
|
|
|
295
|
-
|
|
296
|
-
|
|
297
|
-
|
|
89
|
+
const answer = await askWebsiteAI('https://example.com', 'What is this site about?', {
|
|
90
|
+
openRouterApiKey: process.env.OPENROUTER_API_KEY,
|
|
91
|
+
});
|
|
298
92
|
```
|
|
299
93
|
|
|
300
|
-
|
|
94
|
+
API key priority: OpenRouter > OpenAI > BNCA backend > local pattern matching (no key needed).
|
|
301
95
|
|
|
302
|
-
|
|
96
|
+
## How it works
|
|
303
97
|
|
|
304
|
-
|
|
305
|
-
const stats = scraper.getStats();
|
|
306
|
-
console.log(stats.successRates); // Success rates by method
|
|
307
|
-
```
|
|
98
|
+
The scraper uses a three-tier fallback system:
|
|
308
99
|
|
|
309
|
-
|
|
100
|
+
1. **Direct fetch** — Pure HTTP with HTML parsing. Sub-second, handles ~75% of sites.
|
|
101
|
+
2. **LightPanda** — Lightweight browser engine, 2-3x faster than Chromium. Handles SPAs.
|
|
102
|
+
3. **Puppeteer** — Full Chromium for maximum compatibility.
|
|
310
103
|
|
|
311
|
-
|
|
104
|
+
Additional specialized handlers:
|
|
105
|
+
- **PDF parser** — Automatic detection by URL, content-type, or magic bytes. Extracts text, metadata, and page count.
|
|
106
|
+
- **Screenshots** — Chrome CLI capture with retry logic and smart timeouts.
|
|
312
107
|
|
|
313
|
-
|
|
314
|
-
const health = await scraper.healthCheck();
|
|
315
|
-
console.log(health.status); // 'healthy' or 'unhealthy'
|
|
316
|
-
```
|
|
108
|
+
Browser instances are pooled (max 3, 5s idle timeout) to prevent memory leaks.
|
|
317
109
|
|
|
318
|
-
|
|
110
|
+
## Performance
|
|
319
111
|
|
|
320
|
-
|
|
112
|
+
| Site Type | node-scraper | Firecrawl | Speedup |
|
|
113
|
+
|-----------|-------------|-----------|---------|
|
|
114
|
+
| Wikipedia | 154ms | 4,662ms | 30.3x |
|
|
115
|
+
| Hacker News | 1,715ms | 4,644ms | 2.7x |
|
|
116
|
+
| GitHub | 9,167ms | 9,790ms | 1.1x |
|
|
321
117
|
|
|
322
|
-
|
|
323
|
-
await scraper.cleanup();
|
|
324
|
-
```
|
|
118
|
+
Average: **11.35x faster** with 100% reliability.
|
|
325
119
|
|
|
326
|
-
|
|
120
|
+
## API Reference
|
|
327
121
|
|
|
328
|
-
|
|
122
|
+
### Convenience functions
|
|
329
123
|
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
335
|
-
|
|
124
|
+
| Function | Description |
|
|
125
|
+
|----------|-------------|
|
|
126
|
+
| `smartScrape(url, opts?)` | Scrape with intelligent fallback |
|
|
127
|
+
| `smartScreenshot(url, opts?)` | Full page screenshot |
|
|
128
|
+
| `quickShot(url, opts?)` | Fast screenshot capture |
|
|
129
|
+
| `bulkScrape(urls, opts?)` | Batch scrape multiple URLs |
|
|
130
|
+
| `bulkScrapeStream(urls, opts?)` | Stream results as they complete |
|
|
131
|
+
| `askWebsiteAI(url, question, opts?)` | AI Q&A about a webpage |
|
|
336
132
|
|
|
337
|
-
|
|
338
|
-
const scraper = new BNCASmartScraper({
|
|
339
|
-
openAIApiKey: 'your-openai-api-key',
|
|
340
|
-
// Optional: Use a compatible endpoint like Groq, Together AI, etc.
|
|
341
|
-
openAIBaseUrl: 'https://api.groq.com/openai'
|
|
342
|
-
});
|
|
343
|
-
const result = await scraper.askAI('https://example.com', 'What services do they offer?');
|
|
133
|
+
### BNCASmartScraper options
|
|
344
134
|
|
|
345
|
-
// Method 3: One-liner with OpenRouter
|
|
346
|
-
import { askWebsiteAI } from '@monostate/node-scraper';
|
|
347
|
-
const answer = await askWebsiteAI('https://example.com', 'What is the main topic?', {
|
|
348
|
-
openRouterApiKey: process.env.OPENROUTER_API_KEY
|
|
349
|
-
});
|
|
350
|
-
|
|
351
|
-
// Method 4: Using BNCA backend API (requires BNCA API key)
|
|
352
|
-
const scraper = new BNCASmartScraper({
|
|
353
|
-
apiKey: 'your-bnca-api-key'
|
|
354
|
-
});
|
|
355
|
-
const result = await scraper.askAI('https://example.com', 'What products are featured?');
|
|
356
|
-
```
|
|
357
|
-
|
|
358
|
-
**API Key Priority:**
|
|
359
|
-
1. OpenRouter API key (`openRouterApiKey`)
|
|
360
|
-
2. OpenAI API key (`openAIApiKey`)
|
|
361
|
-
3. BNCA backend API (`apiKey`)
|
|
362
|
-
4. Local fallback (pattern matching - no API key required)
|
|
363
|
-
|
|
364
|
-
**Configuration Options:**
|
|
365
|
-
```javascript
|
|
366
|
-
const result = await scraper.askAI(url, question, {
|
|
367
|
-
// OpenRouter specific
|
|
368
|
-
openRouterApiKey: 'sk-or-...',
|
|
369
|
-
model: 'meta-llama/llama-4-scout:free', // Default model
|
|
370
|
-
|
|
371
|
-
// OpenAI specific
|
|
372
|
-
openAIApiKey: 'sk-...',
|
|
373
|
-
openAIBaseUrl: 'https://api.openai.com', // Or compatible endpoint
|
|
374
|
-
model: 'gpt-3.5-turbo',
|
|
375
|
-
|
|
376
|
-
// Shared options
|
|
377
|
-
temperature: 0.3,
|
|
378
|
-
maxTokens: 500
|
|
379
|
-
});
|
|
380
|
-
```
|
|
381
|
-
|
|
382
|
-
**Response Format:**
|
|
383
135
|
```javascript
|
|
384
136
|
{
|
|
385
|
-
|
|
386
|
-
|
|
387
|
-
|
|
388
|
-
|
|
389
|
-
|
|
137
|
+
timeout: 10000, // Request timeout (ms)
|
|
138
|
+
retries: 2, // Retries per method
|
|
139
|
+
verbose: false, // Detailed logging
|
|
140
|
+
lightpandaPath: './bin/lightpanda',
|
|
141
|
+
lightpandaFormat: 'html', // 'html' or 'markdown'
|
|
142
|
+
userAgent: 'Mozilla/5.0 ...',
|
|
143
|
+
openRouterApiKey: '...',
|
|
144
|
+
openAIApiKey: '...',
|
|
145
|
+
openAIBaseUrl: 'https://api.openai.com',
|
|
390
146
|
}
|
|
391
147
|
```
|
|
392
148
|
|
|
393
|
-
###
|
|
394
|
-
|
|
395
|
-
BNCA automatically detects and parses PDF documents:
|
|
396
|
-
|
|
397
|
-
```javascript
|
|
398
|
-
const pdfResult = await smartScrape('https://example.com/document.pdf');
|
|
399
|
-
|
|
400
|
-
// Parsed content includes:
|
|
401
|
-
const content = JSON.parse(pdfResult.content);
|
|
402
|
-
console.log(content.title); // PDF title
|
|
403
|
-
console.log(content.author); // Author name
|
|
404
|
-
console.log(content.pages); // Number of pages
|
|
405
|
-
console.log(content.text); // Full extracted text
|
|
406
|
-
console.log(content.creationDate); // Creation date
|
|
407
|
-
console.log(content.metadata); // Additional metadata
|
|
408
|
-
```
|
|
409
|
-
|
|
410
|
-
**PDF Detection Methods:**
|
|
411
|
-
- URL ending with `.pdf`
|
|
412
|
-
- Content-Type header `application/pdf`
|
|
413
|
-
- Binary content starting with `%PDF` (magic bytes)
|
|
414
|
-
- Works with PDFs served as `application/octet-stream` (e.g., GitHub raw files)
|
|
415
|
-
|
|
416
|
-
**Limitations:**
|
|
417
|
-
- Maximum file size: 20MB
|
|
418
|
-
- Text extraction only (no image OCR)
|
|
419
|
-
- Requires `pdf-parse` dependency (automatically installed)
|
|
149
|
+
### Methods
|
|
420
150
|
|
|
421
|
-
|
|
151
|
+
- `scraper.scrape(url, opts?)` — Scrape with fallback
|
|
152
|
+
- `scraper.screenshot(url, opts?)` — Take screenshot
|
|
153
|
+
- `scraper.quickshot(url, opts?)` — Fast screenshot
|
|
154
|
+
- `scraper.askAI(url, question, opts?)` — AI Q&A
|
|
155
|
+
- `scraper.getStats()` — Performance statistics
|
|
156
|
+
- `scraper.healthCheck()` — Check method availability
|
|
157
|
+
- `scraper.cleanup()` — Close browser instances
|
|
422
158
|
|
|
423
|
-
###
|
|
159
|
+
### Response shape
|
|
424
160
|
|
|
425
161
|
```javascript
|
|
426
|
-
|
|
427
|
-
|
|
428
|
-
|
|
429
|
-
|
|
430
|
-
|
|
431
|
-
|
|
432
|
-
|
|
433
|
-
|
|
434
|
-
return Response.json({
|
|
435
|
-
success: true,
|
|
436
|
-
data: result.content,
|
|
437
|
-
method: result.method,
|
|
438
|
-
time: result.performance.totalTime
|
|
439
|
-
});
|
|
440
|
-
} catch (error) {
|
|
441
|
-
return Response.json({
|
|
442
|
-
success: false,
|
|
443
|
-
error: error.message
|
|
444
|
-
}, { status: 500 });
|
|
445
|
-
}
|
|
446
|
-
}
|
|
447
|
-
```
|
|
448
|
-
|
|
449
|
-
### React Hook Example
|
|
450
|
-
|
|
451
|
-
```javascript
|
|
452
|
-
// hooks/useScraper.js
|
|
453
|
-
import { useState } from 'react';
|
|
454
|
-
|
|
455
|
-
export function useScraper() {
|
|
456
|
-
const [loading, setLoading] = useState(false);
|
|
457
|
-
const [data, setData] = useState(null);
|
|
458
|
-
const [error, setError] = useState(null);
|
|
459
|
-
|
|
460
|
-
const scrape = async (url) => {
|
|
461
|
-
setLoading(true);
|
|
462
|
-
setError(null);
|
|
463
|
-
|
|
464
|
-
try {
|
|
465
|
-
const response = await fetch('/api/scrape', {
|
|
466
|
-
method: 'POST',
|
|
467
|
-
headers: { 'Content-Type': 'application/json' },
|
|
468
|
-
body: JSON.stringify({ url })
|
|
469
|
-
});
|
|
470
|
-
|
|
471
|
-
const result = await response.json();
|
|
472
|
-
|
|
473
|
-
if (result.success) {
|
|
474
|
-
setData(result.data);
|
|
475
|
-
} else {
|
|
476
|
-
setError(result.error);
|
|
477
|
-
}
|
|
478
|
-
} catch (err) {
|
|
479
|
-
setError(err.message);
|
|
480
|
-
} finally {
|
|
481
|
-
setLoading(false);
|
|
482
|
-
}
|
|
483
|
-
};
|
|
484
|
-
|
|
485
|
-
return { scrape, loading, data, error };
|
|
486
|
-
}
|
|
487
|
-
```
|
|
488
|
-
|
|
489
|
-
### Component Example
|
|
490
|
-
|
|
491
|
-
```javascript
|
|
492
|
-
// components/ScraperDemo.jsx
|
|
493
|
-
import { useScraper } from '../hooks/useScraper';
|
|
494
|
-
|
|
495
|
-
export default function ScraperDemo() {
|
|
496
|
-
const { scrape, loading, data, error } = useScraper();
|
|
497
|
-
const [url, setUrl] = useState('');
|
|
498
|
-
|
|
499
|
-
const handleScrape = () => {
|
|
500
|
-
if (url) scrape(url);
|
|
501
|
-
};
|
|
502
|
-
|
|
503
|
-
return (
|
|
504
|
-
<div className="p-4">
|
|
505
|
-
<div className="flex gap-2 mb-4">
|
|
506
|
-
<input
|
|
507
|
-
type="url"
|
|
508
|
-
value={url}
|
|
509
|
-
onChange={(e) => setUrl(e.target.value)}
|
|
510
|
-
placeholder="Enter URL to scrape..."
|
|
511
|
-
className="flex-1 px-3 py-2 border rounded"
|
|
512
|
-
/>
|
|
513
|
-
<button
|
|
514
|
-
onClick={handleScrape}
|
|
515
|
-
disabled={loading}
|
|
516
|
-
className="px-4 py-2 bg-blue-500 text-white rounded disabled:opacity-50"
|
|
517
|
-
>
|
|
518
|
-
{loading ? 'Scraping...' : 'Scrape'}
|
|
519
|
-
</button>
|
|
520
|
-
</div>
|
|
521
|
-
|
|
522
|
-
{error && (
|
|
523
|
-
<div className="p-3 bg-red-100 text-red-700 rounded mb-4">
|
|
524
|
-
Error: {error}
|
|
525
|
-
</div>
|
|
526
|
-
)}
|
|
527
|
-
|
|
528
|
-
{data && (
|
|
529
|
-
<div className="p-3 bg-green-100 rounded">
|
|
530
|
-
<h3 className="font-bold mb-2">Scraped Content:</h3>
|
|
531
|
-
<pre className="text-sm overflow-auto">{data}</pre>
|
|
532
|
-
</div>
|
|
533
|
-
)}
|
|
534
|
-
</div>
|
|
535
|
-
);
|
|
162
|
+
{
|
|
163
|
+
success: true,
|
|
164
|
+
content: '...', // Extracted content (JSON string)
|
|
165
|
+
method: 'direct-fetch', // Method used
|
|
166
|
+
url: 'https://...',
|
|
167
|
+
performance: { totalTime: 154 },
|
|
168
|
+
stats: { ... }
|
|
536
169
|
}
|
|
537
170
|
```
|
|
538
171
|
|
|
539
|
-
##
|
|
540
|
-
|
|
541
|
-
### Server-Side Only
|
|
542
|
-
BNCA is designed for **server-side use only** due to:
|
|
543
|
-
- Browser automation requirements (Puppeteer)
|
|
544
|
-
- File system access for Lightpanda binary
|
|
545
|
-
- CORS restrictions in browsers
|
|
546
|
-
|
|
547
|
-
### Next.js Deployment
|
|
548
|
-
- Use in API routes, not client components
|
|
549
|
-
- Ensure Node.js 18+ in production environment
|
|
550
|
-
- Consider adding Lightpanda binary to deployment
|
|
551
|
-
|
|
552
|
-
### Lightpanda Setup (Optional)
|
|
553
|
-
For maximum performance, install Lightpanda:
|
|
554
|
-
|
|
555
|
-
```bash
|
|
556
|
-
# macOS ARM64
|
|
557
|
-
curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-aarch64-macos
|
|
558
|
-
chmod +x lightpanda
|
|
559
|
-
|
|
560
|
-
# Linux x64
|
|
561
|
-
curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-x86_64-linux
|
|
562
|
-
chmod +x lightpanda
|
|
563
|
-
```
|
|
564
|
-
|
|
565
|
-
## 🔒 Privacy & Security
|
|
566
|
-
|
|
567
|
-
- **No external API calls** - all processing is local
|
|
568
|
-
- **No data collection** - your data stays private
|
|
569
|
-
- **Respects robots.txt** (optional enforcement)
|
|
570
|
-
- **Configurable rate limiting**
|
|
571
|
-
|
|
572
|
-
## 📝 TypeScript Support
|
|
172
|
+
## TypeScript
|
|
573
173
|
|
|
574
|
-
Full
|
|
174
|
+
Full type definitions are included (`index.d.ts`).
|
|
575
175
|
|
|
576
176
|
```typescript
|
|
577
|
-
import { BNCASmartScraper, ScrapingResult
|
|
578
|
-
|
|
579
|
-
const scraper: BNCASmartScraper = new BNCASmartScraper({
|
|
580
|
-
timeout: 5000,
|
|
581
|
-
verbose: true
|
|
582
|
-
});
|
|
583
|
-
|
|
584
|
-
const result: ScrapingResult = await scraper.scrape('https://example.com');
|
|
177
|
+
import { BNCASmartScraper, ScrapingResult } from '@monostate/node-scraper';
|
|
585
178
|
```
|
|
586
179
|
|
|
587
|
-
##
|
|
588
|
-
|
|
589
|
-
### v1.6.0 (Latest)
|
|
590
|
-
- **Method Override**: Force specific scraping methods with `method` parameter
|
|
591
|
-
- **Enhanced Error Handling**: Categorized error types for better debugging
|
|
592
|
-
- **Fallback Chain Tracking**: See which methods were attempted in auto mode
|
|
593
|
-
- **Graceful Failures**: No automatic fallback when method is forced
|
|
594
|
-
|
|
595
|
-
### v1.5.0
|
|
596
|
-
- **AI-Powered Q&A**: Ask questions about any website and get AI-generated answers
|
|
597
|
-
- **OpenRouter Support**: Native integration with OpenRouter API for advanced AI models
|
|
598
|
-
- **OpenAI Support**: Compatible with OpenAI and OpenAI-compatible endpoints (Groq, Together AI, etc.)
|
|
599
|
-
- **Smart Fallback**: Automatic fallback chain: OpenRouter -> OpenAI -> Backend API -> Local processing
|
|
600
|
-
- **One-liner AI**: New `askWebsiteAI()` convenience function for quick AI queries
|
|
601
|
-
- **Enhanced TypeScript**: Complete type definitions for all AI features
|
|
602
|
-
|
|
603
|
-
### v1.4.0
|
|
604
|
-
- Internal release (skipped for public release)
|
|
180
|
+
## Notes
|
|
605
181
|
|
|
606
|
-
|
|
607
|
-
- **
|
|
608
|
-
-
|
|
609
|
-
-
|
|
610
|
-
- **Fast Performance**: PDF parsing typically completes in 100-500ms
|
|
611
|
-
- **Comprehensive Extraction**: Title, author, creation date, page count, and full text
|
|
182
|
+
- **Server-side only** — requires filesystem access and browser automation.
|
|
183
|
+
- **Node.js 20+** required.
|
|
184
|
+
- LightPanda binary is auto-installed. Puppeteer is optional.
|
|
185
|
+
- No external API calls for scraping — all processing is local.
|
|
612
186
|
|
|
613
|
-
|
|
614
|
-
- **Auto-Installation**: Lightpanda binary is now automatically downloaded during `npm install`
|
|
615
|
-
- **Cross-Platform Support**: Automatic detection and installation for macOS, Linux, and Windows/WSL
|
|
616
|
-
- **Improved Performance**: Enhanced binary detection and ES6 module compatibility
|
|
617
|
-
- **Better Error Handling**: More robust installation scripts with retry logic
|
|
618
|
-
- **Zero Configuration**: No manual setup required - works out of the box
|
|
619
|
-
|
|
620
|
-
### v1.1.1
|
|
621
|
-
- Bug fixes and stability improvements
|
|
622
|
-
- Enhanced Puppeteer integration
|
|
623
|
-
|
|
624
|
-
### v1.1.0
|
|
625
|
-
- Added screenshot capabilities
|
|
626
|
-
- Improved fallback system
|
|
627
|
-
- Performance optimizations
|
|
628
|
-
|
|
629
|
-
## 🤝 Contributing
|
|
630
|
-
|
|
631
|
-
See the [main repository](https://github.com/your-org/bnca-prototype) for contribution guidelines.
|
|
632
|
-
|
|
633
|
-
## 📄 License
|
|
634
|
-
|
|
635
|
-
MIT License - see [LICENSE](../../LICENSE) file for details.
|
|
636
|
-
|
|
637
|
-
---
|
|
638
|
-
|
|
639
|
-
<div align="center">
|
|
187
|
+
## Changelog
|
|
640
188
|
|
|
641
|
-
|
|
189
|
+
See [CHANGELOG.md](./CHANGELOG.md) for the full release history.
|
|
642
190
|
|
|
643
|
-
|
|
191
|
+
## License
|
|
644
192
|
|
|
645
|
-
|
|
193
|
+
MIT
|