@monostate/node-scraper 1.8.0 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/BULK_SCRAPING.md +626 -0
- package/README.md +106 -556
- package/browser-pool.js +229 -0
- package/index.js +46 -28
- package/package.json +7 -5
- package/scripts/install-lightpanda.js +20 -7
package/README.md
CHANGED
|
@@ -1,67 +1,47 @@
|
|
|
1
1
|
# @monostate/node-scraper
|
|
2
2
|
|
|
3
|
-
>
|
|
3
|
+
> Intelligent web scraping with multi-tier fallback — 11x faster than traditional scrapers
|
|
4
4
|
|
|
5
|
-
[](https://nodejs.org/)
|
|
5
|
+
[](https://www.npmjs.com/package/@monostate/node-scraper)
|
|
6
|
+
[](https://github.com/monostate/node-scraper/blob/main/LICENSE)
|
|
7
|
+
[](https://nodejs.org/)
|
|
9
8
|
|
|
10
|
-
##
|
|
11
|
-
|
|
12
|
-
### Installation
|
|
9
|
+
## Install
|
|
13
10
|
|
|
14
11
|
```bash
|
|
15
12
|
npm install @monostate/node-scraper
|
|
16
|
-
# or
|
|
17
|
-
yarn add @monostate/node-scraper
|
|
18
|
-
# or
|
|
19
|
-
pnpm add @monostate/node-scraper
|
|
20
13
|
```
|
|
21
14
|
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
**Fixed in v1.7.0**: Critical cross-platform compatibility fix - binaries are now correctly downloaded per platform instead of being bundled.
|
|
25
|
-
|
|
26
|
-
**New in v1.6.0**: Method override support! Force specific scraping methods with `method` parameter for testing and optimization.
|
|
27
|
-
|
|
28
|
-
**New in v1.5.0**: AI-powered Q&A! Ask questions about any website using OpenRouter, OpenAI, or built-in AI.
|
|
29
|
-
|
|
30
|
-
**Also in v1.3.0**: PDF parsing support added! Automatically extracts text, metadata, and page count from PDF documents.
|
|
15
|
+
LightPanda is downloaded automatically on install. Puppeteer is an optional peer dependency for full browser fallback.
|
|
31
16
|
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
### Zero-Configuration Setup
|
|
35
|
-
|
|
36
|
-
The package now automatically:
|
|
37
|
-
- Downloads the correct Lightpanda binary for your platform (macOS, Linux, Windows/WSL)
|
|
38
|
-
- Configures binary paths and permissions
|
|
39
|
-
- Validates installation health on first use
|
|
40
|
-
|
|
41
|
-
### Basic Usage
|
|
17
|
+
## Usage
|
|
42
18
|
|
|
43
19
|
```javascript
|
|
44
20
|
import { smartScrape, smartScreenshot, quickShot } from '@monostate/node-scraper';
|
|
45
21
|
|
|
46
|
-
//
|
|
22
|
+
// Scrape with automatic method selection
|
|
47
23
|
const result = await smartScrape('https://example.com');
|
|
48
|
-
console.log(result.content);
|
|
49
|
-
console.log(result.method);
|
|
24
|
+
console.log(result.content);
|
|
25
|
+
console.log(result.method); // 'direct-fetch' | 'lightpanda' | 'puppeteer'
|
|
50
26
|
|
|
51
|
-
//
|
|
27
|
+
// Screenshots
|
|
52
28
|
const screenshot = await smartScreenshot('https://example.com');
|
|
53
|
-
|
|
29
|
+
const quick = await quickShot('https://example.com'); // optimized for speed
|
|
30
|
+
|
|
31
|
+
// PDFs are detected and parsed automatically
|
|
32
|
+
const pdf = await smartScrape('https://example.com/doc.pdf');
|
|
33
|
+
```
|
|
54
34
|
|
|
55
|
-
|
|
56
|
-
const quick = await quickShot('https://example.com');
|
|
57
|
-
console.log(quick.screenshot); // Fast screenshot capture
|
|
35
|
+
### Force a specific method
|
|
58
36
|
|
|
59
|
-
|
|
60
|
-
const
|
|
61
|
-
|
|
37
|
+
```javascript
|
|
38
|
+
const result = await smartScrape('https://example.com', { method: 'direct' });
|
|
39
|
+
// Also: 'lightpanda', 'puppeteer', 'auto' (default)
|
|
62
40
|
```
|
|
63
41
|
|
|
64
|
-
|
|
42
|
+
No fallback occurs when a method is forced — useful for testing and debugging.
|
|
43
|
+
|
|
44
|
+
### Advanced usage
|
|
65
45
|
|
|
66
46
|
```javascript
|
|
67
47
|
import { BNCASmartScraper } from '@monostate/node-scraper';
|
|
@@ -69,575 +49,145 @@ import { BNCASmartScraper } from '@monostate/node-scraper';
|
|
|
69
49
|
const scraper = new BNCASmartScraper({
|
|
70
50
|
timeout: 10000,
|
|
71
51
|
verbose: true,
|
|
72
|
-
lightpandaPath: './lightpanda' // optional
|
|
73
52
|
});
|
|
74
53
|
|
|
75
54
|
const result = await scraper.scrape('https://complex-spa.com');
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
await scraper.cleanup(); // Clean up resources
|
|
79
|
-
```
|
|
80
|
-
|
|
81
|
-
### Browser Pool Configuration (New in v1.8.0)
|
|
82
|
-
|
|
83
|
-
The package now includes automatic browser instance pooling to prevent memory leaks:
|
|
84
|
-
|
|
85
|
-
```javascript
|
|
86
|
-
// Browser pool is managed automatically with these defaults:
|
|
87
|
-
// - Max 3 concurrent browser instances
|
|
88
|
-
// - 5 second idle timeout before cleanup
|
|
89
|
-
// - Automatic reuse of browser instances
|
|
90
|
-
|
|
91
|
-
// For heavy workloads, you can manually clean up:
|
|
92
|
-
const scraper = new BNCASmartScraper();
|
|
93
|
-
// ... perform multiple scrapes ...
|
|
94
|
-
await scraper.cleanup(); // Closes all browser instances
|
|
95
|
-
```
|
|
96
|
-
|
|
97
|
-
**Important**: The convenience functions (`smartScrape`, `smartScreenshot`, etc.) automatically handle cleanup. You only need to call `cleanup()` when using the `BNCASmartScraper` class directly.
|
|
98
|
-
|
|
99
|
-
### Method Override (New in v1.6.0)
|
|
100
|
-
|
|
101
|
-
Force a specific scraping method instead of using automatic fallback:
|
|
102
|
-
|
|
103
|
-
```javascript
|
|
104
|
-
// Force direct fetch (no browser)
|
|
105
|
-
const result = await smartScrape('https://example.com', { method: 'direct' });
|
|
106
|
-
|
|
107
|
-
// Force Lightpanda browser
|
|
108
|
-
const result = await smartScrape('https://example.com', { method: 'lightpanda' });
|
|
109
|
-
|
|
110
|
-
// Force Puppeteer (full Chrome)
|
|
111
|
-
const result = await smartScrape('https://example.com', { method: 'puppeteer' });
|
|
112
|
-
|
|
113
|
-
// Auto mode (default - intelligent fallback)
|
|
114
|
-
const result = await smartScrape('https://example.com', { method: 'auto' });
|
|
115
|
-
```
|
|
116
|
-
|
|
117
|
-
**Important**: When forcing a method, no fallback occurs if it fails. This is useful for:
|
|
118
|
-
- Testing specific methods in isolation
|
|
119
|
-
- Optimizing for known site requirements
|
|
120
|
-
- Debugging method-specific issues
|
|
55
|
+
const stats = scraper.getStats();
|
|
56
|
+
const health = await scraper.healthCheck();
|
|
121
57
|
|
|
122
|
-
|
|
123
|
-
```javascript
|
|
124
|
-
{
|
|
125
|
-
success: false,
|
|
126
|
-
error: "Lightpanda scraping failed: [specific error]",
|
|
127
|
-
method: "lightpanda",
|
|
128
|
-
errorType: "network|timeout|parsing|service_unavailable",
|
|
129
|
-
details: "Additional error context"
|
|
130
|
-
}
|
|
58
|
+
await scraper.cleanup();
|
|
131
59
|
```
|
|
132
60
|
|
|
133
|
-
### Bulk
|
|
134
|
-
|
|
135
|
-
Process multiple URLs efficiently with automatic request queueing and progress tracking:
|
|
61
|
+
### Bulk scraping
|
|
136
62
|
|
|
137
63
|
```javascript
|
|
138
|
-
import { bulkScrape } from '@monostate/node-scraper';
|
|
139
|
-
|
|
140
|
-
// Basic bulk scraping
|
|
141
|
-
const urls = [
|
|
142
|
-
'https://example1.com',
|
|
143
|
-
'https://example2.com',
|
|
144
|
-
'https://example3.com',
|
|
145
|
-
// ... hundreds more
|
|
146
|
-
];
|
|
64
|
+
import { bulkScrape, bulkScrapeStream } from '@monostate/node-scraper';
|
|
147
65
|
|
|
148
66
|
const results = await bulkScrape(urls, {
|
|
149
|
-
concurrency: 5,
|
|
150
|
-
continueOnError: true,
|
|
151
|
-
progressCallback: (
|
|
152
|
-
console.log(`Progress: ${progress.percentage.toFixed(1)}% (${progress.processed}/${progress.total})`);
|
|
153
|
-
}
|
|
67
|
+
concurrency: 5,
|
|
68
|
+
continueOnError: true,
|
|
69
|
+
progressCallback: (p) => console.log(`${p.percentage.toFixed(1)}%`),
|
|
154
70
|
});
|
|
155
71
|
|
|
156
|
-
|
|
157
|
-
console.log(`Total time: ${results.stats.totalTime}ms`);
|
|
158
|
-
console.log(`Average time per URL: ${results.stats.averageTime}ms`);
|
|
159
|
-
```
|
|
160
|
-
|
|
161
|
-
#### Streaming Results
|
|
162
|
-
|
|
163
|
-
For large datasets, use streaming to process results as they complete:
|
|
164
|
-
|
|
165
|
-
```javascript
|
|
166
|
-
import { bulkScrapeStream } from '@monostate/node-scraper';
|
|
167
|
-
|
|
72
|
+
// Or stream results as they complete
|
|
168
73
|
await bulkScrapeStream(urls, {
|
|
169
74
|
concurrency: 10,
|
|
170
|
-
onResult: async (result) =>
|
|
171
|
-
|
|
172
|
-
await saveToDatabase(result);
|
|
173
|
-
console.log(`✓ ${result.url} - ${result.duration}ms`);
|
|
174
|
-
},
|
|
175
|
-
onError: async (error) => {
|
|
176
|
-
// Handle errors as they occur
|
|
177
|
-
console.error(`✗ ${error.url} - ${error.error}`);
|
|
178
|
-
},
|
|
179
|
-
progressCallback: (progress) => {
|
|
180
|
-
process.stdout.write(`\rProcessing: ${progress.percentage.toFixed(1)}%`);
|
|
181
|
-
}
|
|
182
|
-
});
|
|
183
|
-
```
|
|
184
|
-
|
|
185
|
-
**Features:**
|
|
186
|
-
- Automatic request queueing (no more memory errors!)
|
|
187
|
-
- Configurable concurrency control
|
|
188
|
-
- Real-time progress tracking
|
|
189
|
-
- Continue on error or stop on first failure
|
|
190
|
-
- Detailed statistics and method tracking
|
|
191
|
-
- Browser instance pooling for efficiency
|
|
192
|
-
|
|
193
|
-
For detailed examples and advanced usage, see [BULK_SCRAPING.md](./BULK_SCRAPING.md).
|
|
194
|
-
|
|
195
|
-
## How It Works
|
|
196
|
-
|
|
197
|
-
BNCA uses a sophisticated multi-tier system with intelligent detection:
|
|
198
|
-
|
|
199
|
-
### 1. 🔄 Direct Fetch (Fastest)
|
|
200
|
-
- Pure HTTP requests with intelligent HTML parsing
|
|
201
|
-
- **Performance**: Sub-second responses
|
|
202
|
-
- **Success rate**: 75% of websites
|
|
203
|
-
- **PDF Detection**: Automatically detects PDFs by URL, content-type, or magic bytes
|
|
204
|
-
|
|
205
|
-
### 2. 🐼 Lightpanda Browser (Fast)
|
|
206
|
-
- Lightweight browser engine (2-3x faster than Chromium)
|
|
207
|
-
- **Performance**: Fast JavaScript execution
|
|
208
|
-
- **Fallback triggers**: SPA detection
|
|
209
|
-
|
|
210
|
-
### 3. 🔵 Puppeteer (Complete)
|
|
211
|
-
- Full Chromium browser for maximum compatibility
|
|
212
|
-
- **Performance**: Complete JavaScript execution
|
|
213
|
-
- **Fallback triggers**: Complex interactions needed
|
|
214
|
-
|
|
215
|
-
### 📄 PDF Parser (Specialized)
|
|
216
|
-
- Automatic PDF detection and parsing
|
|
217
|
-
- **Features**: Text extraction, metadata, page count
|
|
218
|
-
- **Smart Detection**: Works even when PDFs are served with wrong content-types
|
|
219
|
-
- **Performance**: Typically 100-500ms for most PDFs
|
|
220
|
-
|
|
221
|
-
### 📸 Screenshot Methods
|
|
222
|
-
- **Chrome CLI**: Direct Chrome screenshot capture
|
|
223
|
-
- **Quickshot**: Optimized with retry logic and smart timeouts
|
|
224
|
-
|
|
225
|
-
## 📊 Performance Benchmark
|
|
226
|
-
|
|
227
|
-
| Site Type | BNCA | Firecrawl | Speed Advantage |
|
|
228
|
-
|-----------|------|-----------|----------------|
|
|
229
|
-
| **Wikipedia** | 154ms | 4,662ms | **30.3x faster** |
|
|
230
|
-
| **Hacker News** | 1,715ms | 4,644ms | **2.7x faster** |
|
|
231
|
-
| **GitHub** | 9,167ms | 9,790ms | **1.1x faster** |
|
|
232
|
-
|
|
233
|
-
**Average**: 11.35x faster than Firecrawl with 100% reliability
|
|
234
|
-
|
|
235
|
-
## 🎛️ API Reference
|
|
236
|
-
|
|
237
|
-
### Convenience Functions
|
|
238
|
-
|
|
239
|
-
#### `smartScrape(url, options?)`
|
|
240
|
-
Quick scraping with intelligent fallback.
|
|
241
|
-
|
|
242
|
-
#### `smartScreenshot(url, options?)`
|
|
243
|
-
Take a screenshot of any webpage.
|
|
244
|
-
|
|
245
|
-
#### `quickShot(url, options?)`
|
|
246
|
-
Optimized screenshot capture for maximum speed.
|
|
247
|
-
|
|
248
|
-
**Parameters:**
|
|
249
|
-
- `url` (string): URL to scrape/capture
|
|
250
|
-
- `options` (object, optional): Configuration options
|
|
251
|
-
|
|
252
|
-
**Returns:** Promise<ScrapingResult>
|
|
253
|
-
|
|
254
|
-
### `BNCASmartScraper`
|
|
255
|
-
|
|
256
|
-
Main scraper class with advanced features.
|
|
257
|
-
|
|
258
|
-
#### Constructor Options
|
|
259
|
-
|
|
260
|
-
```javascript
|
|
261
|
-
const scraper = new BNCASmartScraper({
|
|
262
|
-
timeout: 10000, // Request timeout in ms
|
|
263
|
-
retries: 2, // Number of retries per method
|
|
264
|
-
verbose: false, // Enable detailed logging
|
|
265
|
-
lightpandaPath: './lightpanda', // Path to Lightpanda binary
|
|
266
|
-
userAgent: 'Mozilla/5.0 ...', // Custom user agent
|
|
75
|
+
onResult: async (result) => await saveToDatabase(result),
|
|
76
|
+
onError: async (error) => console.error(error.url, error.error),
|
|
267
77
|
});
|
|
268
78
|
```
|
|
269
79
|
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
##### `scraper.scrape(url, options?)`
|
|
273
|
-
|
|
274
|
-
Scrape a URL with intelligent fallback.
|
|
275
|
-
|
|
276
|
-
```javascript
|
|
277
|
-
const result = await scraper.scrape('https://example.com');
|
|
278
|
-
```
|
|
279
|
-
|
|
280
|
-
##### `scraper.screenshot(url, options?)`
|
|
281
|
-
|
|
282
|
-
Take a screenshot of a webpage.
|
|
283
|
-
|
|
284
|
-
```javascript
|
|
285
|
-
const result = await scraper.screenshot('https://example.com');
|
|
286
|
-
const img = result.screenshot; // data:image/png;base64,...
|
|
287
|
-
```
|
|
80
|
+
See [BULK_SCRAPING.md](./BULK_SCRAPING.md) for full documentation.
|
|
288
81
|
|
|
289
|
-
|
|
82
|
+
### AI-powered Q&A
|
|
290
83
|
|
|
291
|
-
|
|
84
|
+
Ask questions about any website using OpenRouter, OpenAI, or local fallback:
|
|
292
85
|
|
|
293
86
|
```javascript
|
|
294
|
-
|
|
295
|
-
// 2-3x faster than regular screenshot
|
|
296
|
-
```
|
|
297
|
-
|
|
298
|
-
##### `scraper.getStats()`
|
|
299
|
-
|
|
300
|
-
Get performance statistics.
|
|
87
|
+
import { askWebsiteAI } from '@monostate/node-scraper';
|
|
301
88
|
|
|
302
|
-
|
|
303
|
-
|
|
304
|
-
|
|
89
|
+
const answer = await askWebsiteAI('https://example.com', 'What is this site about?', {
|
|
90
|
+
openRouterApiKey: process.env.OPENROUTER_API_KEY,
|
|
91
|
+
});
|
|
305
92
|
```
|
|
306
93
|
|
|
307
|
-
|
|
94
|
+
API key priority: OpenRouter > OpenAI > BNCA backend > local pattern matching (no key needed).
|
|
308
95
|
|
|
309
|
-
|
|
96
|
+
## How it works
|
|
310
97
|
|
|
311
|
-
|
|
312
|
-
const health = await scraper.healthCheck();
|
|
313
|
-
console.log(health.status); // 'healthy' or 'unhealthy'
|
|
314
|
-
```
|
|
98
|
+
The scraper uses a three-tier fallback system:
|
|
315
99
|
|
|
316
|
-
|
|
100
|
+
1. **Direct fetch** — Pure HTTP with HTML parsing. Sub-second, handles ~75% of sites.
|
|
101
|
+
2. **LightPanda** — Lightweight browser engine, 2-3x faster than Chromium. Handles SPAs.
|
|
102
|
+
3. **Puppeteer** — Full Chromium for maximum compatibility.
|
|
317
103
|
|
|
318
|
-
|
|
104
|
+
Additional specialized handlers:
|
|
105
|
+
- **PDF parser** — Automatic detection by URL, content-type, or magic bytes. Extracts text, metadata, and page count.
|
|
106
|
+
- **Screenshots** — Chrome CLI capture with retry logic and smart timeouts.
|
|
319
107
|
|
|
320
|
-
|
|
321
|
-
await scraper.cleanup();
|
|
322
|
-
```
|
|
108
|
+
Browser instances are pooled (max 3, 5s idle timeout) to prevent memory leaks.
|
|
323
109
|
|
|
324
|
-
|
|
110
|
+
## Performance
|
|
325
111
|
|
|
326
|
-
|
|
112
|
+
| Site Type | node-scraper | Firecrawl | Speedup |
|
|
113
|
+
|-----------|-------------|-----------|---------|
|
|
114
|
+
| Wikipedia | 154ms | 4,662ms | 30.3x |
|
|
115
|
+
| Hacker News | 1,715ms | 4,644ms | 2.7x |
|
|
116
|
+
| GitHub | 9,167ms | 9,790ms | 1.1x |
|
|
327
117
|
|
|
328
|
-
|
|
329
|
-
// Method 1: Using your own OpenRouter API key
|
|
330
|
-
const scraper = new BNCASmartScraper({
|
|
331
|
-
openRouterApiKey: 'your-openrouter-api-key'
|
|
332
|
-
});
|
|
333
|
-
const result = await scraper.askAI('https://example.com', 'What is this website about?');
|
|
118
|
+
Average: **11.35x faster** with 100% reliability.
|
|
334
119
|
|
|
335
|
-
|
|
336
|
-
const scraper = new BNCASmartScraper({
|
|
337
|
-
openAIApiKey: 'your-openai-api-key',
|
|
338
|
-
// Optional: Use a compatible endpoint like Groq, Together AI, etc.
|
|
339
|
-
openAIBaseUrl: 'https://api.groq.com/openai'
|
|
340
|
-
});
|
|
341
|
-
const result = await scraper.askAI('https://example.com', 'What services do they offer?');
|
|
120
|
+
## API Reference
|
|
342
121
|
|
|
343
|
-
|
|
344
|
-
import { askWebsiteAI } from '@monostate/node-scraper';
|
|
345
|
-
const answer = await askWebsiteAI('https://example.com', 'What is the main topic?', {
|
|
346
|
-
openRouterApiKey: process.env.OPENROUTER_API_KEY
|
|
347
|
-
});
|
|
122
|
+
### Convenience functions
|
|
348
123
|
|
|
349
|
-
|
|
350
|
-
|
|
351
|
-
|
|
352
|
-
|
|
353
|
-
|
|
354
|
-
|
|
124
|
+
| Function | Description |
|
|
125
|
+
|----------|-------------|
|
|
126
|
+
| `smartScrape(url, opts?)` | Scrape with intelligent fallback |
|
|
127
|
+
| `smartScreenshot(url, opts?)` | Full page screenshot |
|
|
128
|
+
| `quickShot(url, opts?)` | Fast screenshot capture |
|
|
129
|
+
| `bulkScrape(urls, opts?)` | Batch scrape multiple URLs |
|
|
130
|
+
| `bulkScrapeStream(urls, opts?)` | Stream results as they complete |
|
|
131
|
+
| `askWebsiteAI(url, question, opts?)` | AI Q&A about a webpage |
|
|
355
132
|
|
|
356
|
-
|
|
357
|
-
1. OpenRouter API key (`openRouterApiKey`)
|
|
358
|
-
2. OpenAI API key (`openAIApiKey`)
|
|
359
|
-
3. BNCA backend API (`apiKey`)
|
|
360
|
-
4. Local fallback (pattern matching - no API key required)
|
|
133
|
+
### BNCASmartScraper options
|
|
361
134
|
|
|
362
|
-
**Configuration Options:**
|
|
363
|
-
```javascript
|
|
364
|
-
const result = await scraper.askAI(url, question, {
|
|
365
|
-
// OpenRouter specific
|
|
366
|
-
openRouterApiKey: 'sk-or-...',
|
|
367
|
-
model: 'meta-llama/llama-4-scout:free', // Default model
|
|
368
|
-
|
|
369
|
-
// OpenAI specific
|
|
370
|
-
openAIApiKey: 'sk-...',
|
|
371
|
-
openAIBaseUrl: 'https://api.openai.com', // Or compatible endpoint
|
|
372
|
-
model: 'gpt-3.5-turbo',
|
|
373
|
-
|
|
374
|
-
// Shared options
|
|
375
|
-
temperature: 0.3,
|
|
376
|
-
maxTokens: 500
|
|
377
|
-
});
|
|
378
|
-
```
|
|
379
|
-
|
|
380
|
-
**Response Format:**
|
|
381
135
|
```javascript
|
|
382
136
|
{
|
|
383
|
-
|
|
384
|
-
|
|
385
|
-
|
|
386
|
-
|
|
387
|
-
|
|
137
|
+
timeout: 10000, // Request timeout (ms)
|
|
138
|
+
retries: 2, // Retries per method
|
|
139
|
+
verbose: false, // Detailed logging
|
|
140
|
+
lightpandaPath: './bin/lightpanda',
|
|
141
|
+
lightpandaFormat: 'html', // 'html' or 'markdown'
|
|
142
|
+
userAgent: 'Mozilla/5.0 ...',
|
|
143
|
+
openRouterApiKey: '...',
|
|
144
|
+
openAIApiKey: '...',
|
|
145
|
+
openAIBaseUrl: 'https://api.openai.com',
|
|
388
146
|
}
|
|
389
147
|
```
|
|
390
148
|
|
|
391
|
-
###
|
|
392
|
-
|
|
393
|
-
BNCA automatically detects and parses PDF documents:
|
|
394
|
-
|
|
395
|
-
```javascript
|
|
396
|
-
const pdfResult = await smartScrape('https://example.com/document.pdf');
|
|
397
|
-
|
|
398
|
-
// Parsed content includes:
|
|
399
|
-
const content = JSON.parse(pdfResult.content);
|
|
400
|
-
console.log(content.title); // PDF title
|
|
401
|
-
console.log(content.author); // Author name
|
|
402
|
-
console.log(content.pages); // Number of pages
|
|
403
|
-
console.log(content.text); // Full extracted text
|
|
404
|
-
console.log(content.creationDate); // Creation date
|
|
405
|
-
console.log(content.metadata); // Additional metadata
|
|
406
|
-
```
|
|
407
|
-
|
|
408
|
-
**PDF Detection Methods:**
|
|
409
|
-
- URL ending with `.pdf`
|
|
410
|
-
- Content-Type header `application/pdf`
|
|
411
|
-
- Binary content starting with `%PDF` (magic bytes)
|
|
412
|
-
- Works with PDFs served as `application/octet-stream` (e.g., GitHub raw files)
|
|
413
|
-
|
|
414
|
-
**Limitations:**
|
|
415
|
-
- Maximum file size: 20MB
|
|
416
|
-
- Text extraction only (no image OCR)
|
|
417
|
-
- Requires `pdf-parse` dependency (automatically installed)
|
|
149
|
+
### Methods
|
|
418
150
|
|
|
419
|
-
|
|
151
|
+
- `scraper.scrape(url, opts?)` — Scrape with fallback
|
|
152
|
+
- `scraper.screenshot(url, opts?)` — Take screenshot
|
|
153
|
+
- `scraper.quickshot(url, opts?)` — Fast screenshot
|
|
154
|
+
- `scraper.askAI(url, question, opts?)` — AI Q&A
|
|
155
|
+
- `scraper.getStats()` — Performance statistics
|
|
156
|
+
- `scraper.healthCheck()` — Check method availability
|
|
157
|
+
- `scraper.cleanup()` — Close browser instances
|
|
420
158
|
|
|
421
|
-
###
|
|
159
|
+
### Response shape
|
|
422
160
|
|
|
423
161
|
```javascript
|
|
424
|
-
|
|
425
|
-
|
|
426
|
-
|
|
427
|
-
|
|
428
|
-
|
|
429
|
-
|
|
430
|
-
|
|
431
|
-
|
|
432
|
-
return Response.json({
|
|
433
|
-
success: true,
|
|
434
|
-
data: result.content,
|
|
435
|
-
method: result.method,
|
|
436
|
-
time: result.performance.totalTime
|
|
437
|
-
});
|
|
438
|
-
} catch (error) {
|
|
439
|
-
return Response.json({
|
|
440
|
-
success: false,
|
|
441
|
-
error: error.message
|
|
442
|
-
}, { status: 500 });
|
|
443
|
-
}
|
|
444
|
-
}
|
|
445
|
-
```
|
|
446
|
-
|
|
447
|
-
### React Hook Example
|
|
448
|
-
|
|
449
|
-
```javascript
|
|
450
|
-
// hooks/useScraper.js
|
|
451
|
-
import { useState } from 'react';
|
|
452
|
-
|
|
453
|
-
export function useScraper() {
|
|
454
|
-
const [loading, setLoading] = useState(false);
|
|
455
|
-
const [data, setData] = useState(null);
|
|
456
|
-
const [error, setError] = useState(null);
|
|
457
|
-
|
|
458
|
-
const scrape = async (url) => {
|
|
459
|
-
setLoading(true);
|
|
460
|
-
setError(null);
|
|
461
|
-
|
|
462
|
-
try {
|
|
463
|
-
const response = await fetch('/api/scrape', {
|
|
464
|
-
method: 'POST',
|
|
465
|
-
headers: { 'Content-Type': 'application/json' },
|
|
466
|
-
body: JSON.stringify({ url })
|
|
467
|
-
});
|
|
468
|
-
|
|
469
|
-
const result = await response.json();
|
|
470
|
-
|
|
471
|
-
if (result.success) {
|
|
472
|
-
setData(result.data);
|
|
473
|
-
} else {
|
|
474
|
-
setError(result.error);
|
|
475
|
-
}
|
|
476
|
-
} catch (err) {
|
|
477
|
-
setError(err.message);
|
|
478
|
-
} finally {
|
|
479
|
-
setLoading(false);
|
|
480
|
-
}
|
|
481
|
-
};
|
|
482
|
-
|
|
483
|
-
return { scrape, loading, data, error };
|
|
484
|
-
}
|
|
485
|
-
```
|
|
486
|
-
|
|
487
|
-
### Component Example
|
|
488
|
-
|
|
489
|
-
```javascript
|
|
490
|
-
// components/ScraperDemo.jsx
|
|
491
|
-
import { useScraper } from '../hooks/useScraper';
|
|
492
|
-
|
|
493
|
-
export default function ScraperDemo() {
|
|
494
|
-
const { scrape, loading, data, error } = useScraper();
|
|
495
|
-
const [url, setUrl] = useState('');
|
|
496
|
-
|
|
497
|
-
const handleScrape = () => {
|
|
498
|
-
if (url) scrape(url);
|
|
499
|
-
};
|
|
500
|
-
|
|
501
|
-
return (
|
|
502
|
-
<div className="p-4">
|
|
503
|
-
<div className="flex gap-2 mb-4">
|
|
504
|
-
<input
|
|
505
|
-
type="url"
|
|
506
|
-
value={url}
|
|
507
|
-
onChange={(e) => setUrl(e.target.value)}
|
|
508
|
-
placeholder="Enter URL to scrape..."
|
|
509
|
-
className="flex-1 px-3 py-2 border rounded"
|
|
510
|
-
/>
|
|
511
|
-
<button
|
|
512
|
-
onClick={handleScrape}
|
|
513
|
-
disabled={loading}
|
|
514
|
-
className="px-4 py-2 bg-blue-500 text-white rounded disabled:opacity-50"
|
|
515
|
-
>
|
|
516
|
-
{loading ? 'Scraping...' : 'Scrape'}
|
|
517
|
-
</button>
|
|
518
|
-
</div>
|
|
519
|
-
|
|
520
|
-
{error && (
|
|
521
|
-
<div className="p-3 bg-red-100 text-red-700 rounded mb-4">
|
|
522
|
-
Error: {error}
|
|
523
|
-
</div>
|
|
524
|
-
)}
|
|
525
|
-
|
|
526
|
-
{data && (
|
|
527
|
-
<div className="p-3 bg-green-100 rounded">
|
|
528
|
-
<h3 className="font-bold mb-2">Scraped Content:</h3>
|
|
529
|
-
<pre className="text-sm overflow-auto">{data}</pre>
|
|
530
|
-
</div>
|
|
531
|
-
)}
|
|
532
|
-
</div>
|
|
533
|
-
);
|
|
162
|
+
{
|
|
163
|
+
success: true,
|
|
164
|
+
content: '...', // Extracted content (JSON string)
|
|
165
|
+
method: 'direct-fetch', // Method used
|
|
166
|
+
url: 'https://...',
|
|
167
|
+
performance: { totalTime: 154 },
|
|
168
|
+
stats: { ... }
|
|
534
169
|
}
|
|
535
170
|
```
|
|
536
171
|
|
|
537
|
-
##
|
|
538
|
-
|
|
539
|
-
### Server-Side Only
|
|
540
|
-
BNCA is designed for **server-side use only** due to:
|
|
541
|
-
- Browser automation requirements (Puppeteer)
|
|
542
|
-
- File system access for Lightpanda binary
|
|
543
|
-
- CORS restrictions in browsers
|
|
544
|
-
|
|
545
|
-
### Next.js Deployment
|
|
546
|
-
- Use in API routes, not client components
|
|
547
|
-
- Ensure Node.js 18+ in production environment
|
|
548
|
-
- Consider adding Lightpanda binary to deployment
|
|
549
|
-
|
|
550
|
-
### Lightpanda Setup (Optional)
|
|
551
|
-
For maximum performance, install Lightpanda:
|
|
552
|
-
|
|
553
|
-
```bash
|
|
554
|
-
# macOS ARM64
|
|
555
|
-
curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-aarch64-macos
|
|
556
|
-
chmod +x lightpanda
|
|
557
|
-
|
|
558
|
-
# Linux x64
|
|
559
|
-
curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-x86_64-linux
|
|
560
|
-
chmod +x lightpanda
|
|
561
|
-
```
|
|
562
|
-
|
|
563
|
-
## 🔒 Privacy & Security
|
|
564
|
-
|
|
565
|
-
- **No external API calls** - all processing is local
|
|
566
|
-
- **No data collection** - your data stays private
|
|
567
|
-
- **Respects robots.txt** (optional enforcement)
|
|
568
|
-
- **Configurable rate limiting**
|
|
569
|
-
|
|
570
|
-
## 📝 TypeScript Support
|
|
172
|
+
## TypeScript
|
|
571
173
|
|
|
572
|
-
Full
|
|
174
|
+
Full type definitions are included (`index.d.ts`).
|
|
573
175
|
|
|
574
176
|
```typescript
|
|
575
|
-
import { BNCASmartScraper, ScrapingResult
|
|
576
|
-
|
|
577
|
-
const scraper: BNCASmartScraper = new BNCASmartScraper({
|
|
578
|
-
timeout: 5000,
|
|
579
|
-
verbose: true
|
|
580
|
-
});
|
|
581
|
-
|
|
582
|
-
const result: ScrapingResult = await scraper.scrape('https://example.com');
|
|
177
|
+
import { BNCASmartScraper, ScrapingResult } from '@monostate/node-scraper';
|
|
583
178
|
```
|
|
584
179
|
|
|
585
|
-
##
|
|
586
|
-
|
|
587
|
-
### v1.6.0 (Latest)
|
|
588
|
-
- **Method Override**: Force specific scraping methods with `method` parameter
|
|
589
|
-
- **Enhanced Error Handling**: Categorized error types for better debugging
|
|
590
|
-
- **Fallback Chain Tracking**: See which methods were attempted in auto mode
|
|
591
|
-
- **Graceful Failures**: No automatic fallback when method is forced
|
|
592
|
-
|
|
593
|
-
### v1.5.0
|
|
594
|
-
- **AI-Powered Q&A**: Ask questions about any website and get AI-generated answers
|
|
595
|
-
- **OpenRouter Support**: Native integration with OpenRouter API for advanced AI models
|
|
596
|
-
- **OpenAI Support**: Compatible with OpenAI and OpenAI-compatible endpoints (Groq, Together AI, etc.)
|
|
597
|
-
- **Smart Fallback**: Automatic fallback chain: OpenRouter -> OpenAI -> Backend API -> Local processing
|
|
598
|
-
- **One-liner AI**: New `askWebsiteAI()` convenience function for quick AI queries
|
|
599
|
-
- **Enhanced TypeScript**: Complete type definitions for all AI features
|
|
600
|
-
|
|
601
|
-
### v1.4.0
|
|
602
|
-
- Internal release (skipped for public release)
|
|
180
|
+
## Notes
|
|
603
181
|
|
|
604
|
-
|
|
605
|
-
- **
|
|
606
|
-
-
|
|
607
|
-
-
|
|
608
|
-
- **Fast Performance**: PDF parsing typically completes in 100-500ms
|
|
609
|
-
- **Comprehensive Extraction**: Title, author, creation date, page count, and full text
|
|
182
|
+
- **Server-side only** — requires filesystem access and browser automation.
|
|
183
|
+
- **Node.js 20+** required.
|
|
184
|
+
- LightPanda binary is auto-installed. Puppeteer is optional.
|
|
185
|
+
- No external API calls for scraping — all processing is local.
|
|
610
186
|
|
|
611
|
-
|
|
612
|
-
- **Auto-Installation**: Lightpanda binary is now automatically downloaded during `npm install`
|
|
613
|
-
- **Cross-Platform Support**: Automatic detection and installation for macOS, Linux, and Windows/WSL
|
|
614
|
-
- **Improved Performance**: Enhanced binary detection and ES6 module compatibility
|
|
615
|
-
- **Better Error Handling**: More robust installation scripts with retry logic
|
|
616
|
-
- **Zero Configuration**: No manual setup required - works out of the box
|
|
617
|
-
|
|
618
|
-
### v1.1.1
|
|
619
|
-
- Bug fixes and stability improvements
|
|
620
|
-
- Enhanced Puppeteer integration
|
|
621
|
-
|
|
622
|
-
### v1.1.0
|
|
623
|
-
- Added screenshot capabilities
|
|
624
|
-
- Improved fallback system
|
|
625
|
-
- Performance optimizations
|
|
626
|
-
|
|
627
|
-
## 🤝 Contributing
|
|
628
|
-
|
|
629
|
-
See the [main repository](https://github.com/your-org/bnca-prototype) for contribution guidelines.
|
|
630
|
-
|
|
631
|
-
## 📄 License
|
|
632
|
-
|
|
633
|
-
MIT License - see [LICENSE](../../LICENSE) file for details.
|
|
634
|
-
|
|
635
|
-
---
|
|
636
|
-
|
|
637
|
-
<div align="center">
|
|
187
|
+
## Changelog
|
|
638
188
|
|
|
639
|
-
|
|
189
|
+
See [CHANGELOG.md](./CHANGELOG.md) for the full release history.
|
|
640
190
|
|
|
641
|
-
|
|
191
|
+
## License
|
|
642
192
|
|
|
643
|
-
|
|
193
|
+
MIT
|