@monostate/node-scraper 1.2.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -19,7 +19,11 @@ yarn add @monostate/node-scraper
19
19
  pnpm add @monostate/node-scraper
20
20
  ```
21
21
 
22
- **🎉 New in v1.2.0**: Lightpanda binary is now automatically downloaded and configured during installation! No manual setup required.
22
+ **🤖 New in v1.5.0**: AI-powered Q&A! Ask questions about any website using OpenRouter, OpenAI, or built-in AI. (Note: v1.4.0 was an internal release)
23
+
24
+ **🎉 Also in v1.3.0**: PDF parsing support added! Automatically extracts text, metadata, and page count from PDF documents.
25
+
26
+ **✨ Also in v1.2.0**: Lightpanda binary is now automatically downloaded and configured during installation! No manual setup required.
23
27
 
24
28
  ### Zero-Configuration Setup
25
29
 
@@ -45,6 +49,10 @@ console.log(screenshot.screenshot); // Base64 encoded image
45
49
  // Quick screenshot (optimized for speed)
46
50
  const quick = await quickShot('https://example.com');
47
51
  console.log(quick.screenshot); // Fast screenshot capture
52
+
53
+ // PDF parsing (automatic detection)
54
+ const pdfResult = await smartScrape('https://example.com/document.pdf');
55
+ console.log(pdfResult.content); // Extracted text, metadata, page count
48
56
  ```
49
57
 
50
58
  ### Advanced Usage
@@ -66,12 +74,13 @@ await scraper.cleanup(); // Clean up resources
66
74
 
67
75
  ## 🔧 How It Works
68
76
 
69
- BNCA uses a sophisticated 3-tier fallback system:
77
+ BNCA uses a sophisticated multi-tier system with intelligent detection:
70
78
 
71
79
  ### 1. 🔄 Direct Fetch (Fastest)
72
80
  - Pure HTTP requests with intelligent HTML parsing
73
81
  - **Performance**: Sub-second responses
74
82
  - **Success rate**: 75% of websites
83
+ - **PDF Detection**: Automatically detects PDFs by URL, content-type, or magic bytes
75
84
 
76
85
  ### 2. 🐼 Lightpanda Browser (Fast)
77
86
  - Lightweight browser engine (2-3x faster than Chromium)
@@ -83,6 +92,12 @@ BNCA uses a sophisticated 3-tier fallback system:
83
92
  - **Performance**: Complete JavaScript execution
84
93
  - **Fallback triggers**: Complex interactions needed
85
94
 
95
+ ### 📄 PDF Parser (Specialized)
96
+ - Automatic PDF detection and parsing
97
+ - **Features**: Text extraction, metadata, page count
98
+ - **Smart Detection**: Works even when PDFs are served with wrong content-types
99
+ - **Performance**: Typically 100-500ms for most PDFs
100
+
86
101
  ### 📸 Screenshot Methods
87
102
  - **Chrome CLI**: Direct Chrome screenshot capture
88
103
  - **Quickshot**: Optimized with retry logic and smart timeouts
@@ -186,6 +201,101 @@ Clean up resources (close browser instances).
186
201
  await scraper.cleanup();
187
202
  ```
188
203
 
204
+ ### 🤖 AI-Powered Q&A
205
+
206
+ Ask questions about any website and get AI-generated answers:
207
+
208
+ ```javascript
209
+ // Method 1: Using your own OpenRouter API key
210
+ const scraper = new BNCASmartScraper({
211
+ openRouterApiKey: 'your-openrouter-api-key'
212
+ });
213
+ const result = await scraper.askAI('https://example.com', 'What is this website about?');
214
+
215
+ // Method 2: Using OpenAI API (or compatible endpoints)
216
+ const scraper = new BNCASmartScraper({
217
+ openAIApiKey: 'your-openai-api-key',
218
+ // Optional: Use a compatible endpoint like Groq, Together AI, etc.
219
+ openAIBaseUrl: 'https://api.groq.com/openai'
220
+ });
221
+ const result = await scraper.askAI('https://example.com', 'What services do they offer?');
222
+
223
+ // Method 3: One-liner with OpenRouter
224
+ import { askWebsiteAI } from '@monostate/node-scraper';
225
+ const answer = await askWebsiteAI('https://example.com', 'What is the main topic?', {
226
+ openRouterApiKey: process.env.OPENROUTER_API_KEY
227
+ });
228
+
229
+ // Method 4: Using BNCA backend API (requires BNCA API key)
230
+ const scraper = new BNCASmartScraper({
231
+ apiKey: 'your-bnca-api-key'
232
+ });
233
+ const result = await scraper.askAI('https://example.com', 'What products are featured?');
234
+ ```
235
+
236
+ **API Key Priority:**
237
+ 1. OpenRouter API key (`openRouterApiKey`)
238
+ 2. OpenAI API key (`openAIApiKey`)
239
+ 3. BNCA backend API (`apiKey`)
240
+ 4. Local fallback (pattern matching - no API key required)
241
+
242
+ **Configuration Options:**
243
+ ```javascript
244
+ const result = await scraper.askAI(url, question, {
245
+ // OpenRouter specific
246
+ openRouterApiKey: 'sk-or-...',
247
+ model: 'meta-llama/llama-4-scout:free', // Default model
248
+
249
+ // OpenAI specific
250
+ openAIApiKey: 'sk-...',
251
+ openAIBaseUrl: 'https://api.openai.com', // Or compatible endpoint
252
+ model: 'gpt-3.5-turbo',
253
+
254
+ // Shared options
255
+ temperature: 0.3,
256
+ maxTokens: 500
257
+ });
258
+ ```
259
+
260
+ **Response Format:**
261
+ ```javascript
262
+ {
263
+ success: true,
264
+ answer: "This website is about...",
265
+ method: "direct-fetch", // Scraping method used
266
+ scrapeTime: 1234, // Time to scrape in ms
267
+ processing: "openrouter" // AI processing method used
268
+ }
269
+ ```
270
+
271
+ ### 📄 PDF Support
272
+
273
+ BNCA automatically detects and parses PDF documents:
274
+
275
+ ```javascript
276
+ const pdfResult = await smartScrape('https://example.com/document.pdf');
277
+
278
+ // Parsed content includes:
279
+ const content = JSON.parse(pdfResult.content);
280
+ console.log(content.title); // PDF title
281
+ console.log(content.author); // Author name
282
+ console.log(content.pages); // Number of pages
283
+ console.log(content.text); // Full extracted text
284
+ console.log(content.creationDate); // Creation date
285
+ console.log(content.metadata); // Additional metadata
286
+ ```
287
+
288
+ **PDF Detection Methods:**
289
+ - URL ending with `.pdf`
290
+ - Content-Type header `application/pdf`
291
+ - Binary content starting with `%PDF` (magic bytes)
292
+ - Works with PDFs served as `application/octet-stream` (e.g., GitHub raw files)
293
+
294
+ **Limitations:**
295
+ - Maximum file size: 20MB
296
+ - Text extraction only (no image OCR)
297
+ - Requires `pdf-parse` dependency (automatically installed)
298
+
189
299
  ## 📱 Next.js Integration
190
300
 
191
301
  ### API Route Example
@@ -354,7 +464,14 @@ const result: ScrapingResult = await scraper.scrape('https://example.com');
354
464
 
355
465
  ## 📋 Changelog
356
466
 
357
- ### v1.2.0 (Latest)
467
+ ### v1.3.0 (Latest)
468
+ - 📄 **PDF Support**: Full PDF parsing with text extraction, metadata, and page count
469
+ - 🔍 **Smart PDF Detection**: Detects PDFs by URL patterns, content-type, or magic bytes
470
+ - 🚀 **Robust Parsing**: Handles PDFs served with incorrect content-types (e.g., GitHub raw files)
471
+ - ⚡ **Fast Performance**: PDF parsing typically completes in 100-500ms
472
+ - 📊 **Comprehensive Extraction**: Title, author, creation date, page count, and full text
473
+
474
+ ### v1.2.0
358
475
  - 🎉 **Auto-Installation**: Lightpanda binary is now automatically downloaded during `npm install`
359
476
  - 🔧 **Cross-Platform Support**: Automatic detection and installation for macOS, Linux, and Windows/WSL
360
477
  - ⚡ **Improved Performance**: Enhanced binary detection and ES6 module compatibility
package/bin/lightpanda CHANGED
File without changes
package/index.d.ts CHANGED
@@ -13,6 +13,24 @@ export interface ScrapingOptions {
13
13
  lightpandaPath?: string;
14
14
  /** Custom user agent string */
15
15
  userAgent?: string;
16
+ /** BNCA API key for backend services */
17
+ apiKey?: string;
18
+ /** BNCA API URL (defaults to https://bnca-api.fly.dev) */
19
+ apiUrl?: string;
20
+ /** OpenRouter API key for AI processing */
21
+ openRouterApiKey?: string;
22
+ /** OpenAI API key for AI processing */
23
+ openAIApiKey?: string;
24
+ /** OpenAI base URL (for compatible endpoints) */
25
+ openAIBaseUrl?: string;
26
+ /** AI model to use */
27
+ model?: string;
28
+ /** AI temperature setting */
29
+ temperature?: number;
30
+ /** Maximum tokens for AI response */
31
+ maxTokens?: number;
32
+ /** HTTP referer for OpenRouter */
33
+ referer?: string;
16
34
  }
17
35
 
18
36
  export interface ScrapingResult {
@@ -146,6 +164,22 @@ export class BNCASmartScraper {
146
164
  * @returns Promise resolving to screenshot result
147
165
  */
148
166
  quickshot(url: string, options?: ScrapingOptions): Promise<ScrapingResult>;
167
+
168
+ /**
169
+ * Ask AI a question about a URL
170
+ * @param url The URL to analyze
171
+ * @param question The question to answer about the page
172
+ * @param options Optional configuration overrides
173
+ * @returns Promise resolving to AI answer
174
+ */
175
+ askAI(url: string, question: string, options?: ScrapingOptions): Promise<{
176
+ success: boolean;
177
+ answer?: string;
178
+ error?: string;
179
+ method?: string;
180
+ scrapeTime?: number;
181
+ processing?: 'openrouter' | 'openai' | 'backend' | 'local';
182
+ }>;
149
183
 
150
184
  /**
151
185
  * Get performance statistics for all methods
@@ -248,6 +282,22 @@ export function smartScreenshot(url: string, options?: ScrapingOptions): Promise
248
282
  */
249
283
  export function quickShot(url: string, options?: ScrapingOptions): Promise<ScrapingResult>;
250
284
 
285
+ /**
286
+ * Convenience function for asking AI questions about a webpage
287
+ * @param url The URL to analyze
288
+ * @param question The question to answer
289
+ * @param options Optional configuration
290
+ * @returns Promise resolving to AI answer
291
+ */
292
+ export function askWebsiteAI(url: string, question: string, options?: ScrapingOptions): Promise<{
293
+ success: boolean;
294
+ answer?: string;
295
+ error?: string;
296
+ method?: string;
297
+ scrapeTime?: number;
298
+ processing?: 'openrouter' | 'openai' | 'backend' | 'local';
299
+ }>;
300
+
251
301
  /**
252
302
  * Default export - same as BNCASmartScraper class
253
303
  */
package/index.js CHANGED
@@ -5,6 +5,7 @@ import { existsSync, statSync } from 'fs';
5
5
  import path from 'path';
6
6
  import { fileURLToPath } from 'url';
7
7
  import { promises as fsPromises } from 'fs';
8
+ import pdfParse from 'pdf-parse/lib/pdf-parse.js';
8
9
 
9
10
  let puppeteer = null;
10
11
  try {
@@ -42,10 +43,237 @@ export class BNCASmartScraper {
42
43
  this.stats = {
43
44
  directFetch: { attempts: 0, successes: 0 },
44
45
  lightpanda: { attempts: 0, successes: 0 },
45
- puppeteer: { attempts: 0, successes: 0 }
46
+ puppeteer: { attempts: 0, successes: 0 },
47
+ pdf: { attempts: 0, successes: 0 }
46
48
  };
47
49
  }
48
50
 
51
+ /**
52
+ * Ask AI a question about a URL
53
+ * Scrapes the URL and uses AI to answer the question
54
+ *
55
+ * @param {string} url - URL to analyze
56
+ * @param {string} question - Question to answer
57
+ * @param {object} options - Additional options
58
+ * @returns {Promise<object>} AI response with answer
59
+ */
60
+ async askAI(url, question, options = {}) {
61
+ try {
62
+ // First scrape the content
63
+ const scrapeResult = await this.scrape(url, options);
64
+
65
+ if (!scrapeResult.success) {
66
+ return {
67
+ success: false,
68
+ error: `Failed to scrape URL: ${scrapeResult.error}`,
69
+ method: scrapeResult.method
70
+ };
71
+ }
72
+
73
+ // Check for OpenRouter/OpenAI API key
74
+ const openRouterKey = options.openRouterApiKey || this.options.openRouterApiKey || process.env.OPENROUTER_API_KEY;
75
+ const openAIKey = options.openAIApiKey || this.options.openAIApiKey || process.env.OPENAI_API_KEY;
76
+
77
+ // Priority: OpenRouter > OpenAI > Backend API > Local
78
+ if (openRouterKey) {
79
+ try {
80
+ const answer = await this.processWithOpenRouter(question, scrapeResult.content, openRouterKey, options);
81
+ return {
82
+ success: true,
83
+ answer,
84
+ method: scrapeResult.method,
85
+ scrapeTime: scrapeResult.stats.totalTime,
86
+ processing: 'openrouter'
87
+ };
88
+ } catch (error) {
89
+ this.log(' ⚠️ OpenRouter API call failed, falling back...');
90
+ }
91
+ }
92
+
93
+ if (openAIKey) {
94
+ try {
95
+ const answer = await this.processWithOpenAI(question, scrapeResult.content, openAIKey, options);
96
+ return {
97
+ success: true,
98
+ answer,
99
+ method: scrapeResult.method,
100
+ scrapeTime: scrapeResult.stats.totalTime,
101
+ processing: 'openai'
102
+ };
103
+ } catch (error) {
104
+ this.log(' ⚠️ OpenAI API call failed, falling back...');
105
+ }
106
+ }
107
+
108
+ // If BNCA API key is provided, use the backend API
109
+ if (this.options.apiKey) {
110
+ try {
111
+ const response = await fetch(`${this.options.apiUrl || 'https://bnca-api.fly.dev'}/aireply`, {
112
+ method: 'POST',
113
+ headers: {
114
+ 'x-api-key': this.options.apiKey,
115
+ 'Content-Type': 'application/json'
116
+ },
117
+ body: JSON.stringify({ url, question })
118
+ });
119
+
120
+ if (response.ok) {
121
+ const data = await response.json();
122
+ return {
123
+ success: true,
124
+ answer: data.answer,
125
+ method: scrapeResult.method,
126
+ scrapeTime: scrapeResult.stats.totalTime,
127
+ processing: 'backend'
128
+ };
129
+ }
130
+ } catch (error) {
131
+ this.log(' ⚠️ Backend API call failed, using local AI processing');
132
+ }
133
+ }
134
+
135
+ // Local AI processing fallback
136
+ const answer = this.processLocally(question, scrapeResult.content);
137
+
138
+ return {
139
+ success: true,
140
+ answer,
141
+ method: scrapeResult.method,
142
+ scrapeTime: scrapeResult.stats.totalTime,
143
+ processing: 'local'
144
+ };
145
+
146
+ } catch (error) {
147
+ return {
148
+ success: false,
149
+ error: error.message || 'AI processing failed'
150
+ };
151
+ }
152
+ }
153
+
154
+ /**
155
+ * Process with OpenRouter API
156
+ * @private
157
+ */
158
+ async processWithOpenRouter(question, content, apiKey, options = {}) {
159
+ const parsedContent = typeof content === 'string' ? JSON.parse(content) : content;
160
+
161
+ const contentText = `
162
+ Title: ${parsedContent.title || 'Unknown'}
163
+ Content: ${parsedContent.content || parsedContent.bodyText || 'No content available'}
164
+ Meta Description: ${parsedContent.metaDescription || 'None'}
165
+ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h => `- ${h.text || h}`).join('\n')}` : ''}
166
+ `.trim();
167
+
168
+ const response = await fetch('https://openrouter.ai/api/v1/chat/completions', {
169
+ method: 'POST',
170
+ headers: {
171
+ 'Content-Type': 'application/json',
172
+ 'Authorization': `Bearer ${apiKey}`,
173
+ 'HTTP-Referer': options.referer || 'https://github.com/monostate/node-scraper',
174
+ 'X-Title': 'BNCA Node Scraper',
175
+ },
176
+ body: JSON.stringify({
177
+ model: options.model || 'meta-llama/llama-4-scout:free',
178
+ messages: [
179
+ {
180
+ role: 'system',
181
+ content: 'You are a helpful assistant that answers questions based on website content. Provide accurate, concise answers based only on the provided content.'
182
+ },
183
+ {
184
+ role: 'user',
185
+ content: `Based on the following website content, please answer this question: ${question}\n\nWebsite content:\n${contentText}`
186
+ }
187
+ ],
188
+ temperature: options.temperature || 0.3,
189
+ max_tokens: options.maxTokens || 500,
190
+ }),
191
+ });
192
+
193
+ if (!response.ok) {
194
+ throw new Error(`OpenRouter API error: ${response.status}`);
195
+ }
196
+
197
+ const data = await response.json();
198
+ return data.choices[0]?.message?.content || 'No response from AI';
199
+ }
200
+
201
+ /**
202
+ * Process with OpenAI API
203
+ * @private
204
+ */
205
+ async processWithOpenAI(question, content, apiKey, options = {}) {
206
+ const parsedContent = typeof content === 'string' ? JSON.parse(content) : content;
207
+
208
+ const contentText = `
209
+ Title: ${parsedContent.title || 'Unknown'}
210
+ Content: ${parsedContent.content || parsedContent.bodyText || 'No content available'}
211
+ Meta Description: ${parsedContent.metaDescription || 'None'}
212
+ ${parsedContent.headings?.length ? `\nHeadings:\n${parsedContent.headings.map(h => `- ${h.text || h}`).join('\n')}` : ''}
213
+ `.trim();
214
+
215
+ const baseUrl = options.openAIBaseUrl || 'https://api.openai.com';
216
+ const response = await fetch(`${baseUrl}/v1/chat/completions`, {
217
+ method: 'POST',
218
+ headers: {
219
+ 'Content-Type': 'application/json',
220
+ 'Authorization': `Bearer ${apiKey}`,
221
+ },
222
+ body: JSON.stringify({
223
+ model: options.model || 'gpt-3.5-turbo',
224
+ messages: [
225
+ {
226
+ role: 'system',
227
+ content: 'You are a helpful assistant that answers questions based on website content. Provide accurate, concise answers based only on the provided content.'
228
+ },
229
+ {
230
+ role: 'user',
231
+ content: `Based on the following website content, please answer this question: ${question}\n\nWebsite content:\n${contentText}`
232
+ }
233
+ ],
234
+ temperature: options.temperature || 0.3,
235
+ max_tokens: options.maxTokens || 500,
236
+ }),
237
+ });
238
+
239
+ if (!response.ok) {
240
+ throw new Error(`OpenAI API error: ${response.status}`);
241
+ }
242
+
243
+ const data = await response.json();
244
+ return data.choices[0]?.message?.content || 'No response from AI';
245
+ }
246
+
247
+ /**
248
+ * Local AI processing (simple pattern matching)
249
+ * @private
250
+ */
251
+ processLocally(question, content) {
252
+ const parsedContent = typeof content === 'string' ?
253
+ JSON.parse(content) : content;
254
+
255
+ const title = parsedContent.title || 'Unknown';
256
+ const text = parsedContent.content || parsedContent.bodyText || '';
257
+ const lowerQuestion = question.toLowerCase();
258
+
259
+ if (lowerQuestion.includes('title')) {
260
+ return `The page title is "${title}".`;
261
+ }
262
+
263
+ if (lowerQuestion.includes('about') || lowerQuestion.includes('what')) {
264
+ return `This page titled "${title}" contains: ${text.substring(0, 200)}...`;
265
+ }
266
+
267
+ if (lowerQuestion.includes('contact') || lowerQuestion.includes('email')) {
268
+ const emailMatch = text.match(/[\w.-]+@[\w.-]+\.\w+/);
269
+ return emailMatch ?
270
+ `Found contact: ${emailMatch[0]}` :
271
+ 'No contact information found.';
272
+ }
273
+
274
+ return `Based on "${title}": ${text.substring(0, 150)}...`;
275
+ }
276
+
49
277
  /**
50
278
  * Main scraping method with intelligent fallback
51
279
  */
@@ -60,6 +288,35 @@ export class BNCASmartScraper {
60
288
  let lastError = null;
61
289
 
62
290
  try {
291
+ // Check if URL is a PDF (by extension or content-type check)
292
+ const isPdfUrl = url.toLowerCase().endsWith('.pdf') ||
293
+ url.toLowerCase().includes('.pdf?') ||
294
+ url.toLowerCase().includes('/pdf/');
295
+
296
+ if (isPdfUrl) {
297
+ this.log(' 📄 PDF detected, using PDF parser...');
298
+ result = await this.tryPDFParse(url, config);
299
+
300
+ if (result.success) {
301
+ method = 'pdf';
302
+ this.log(' ✅ PDF parsing successful');
303
+
304
+ const totalTime = Date.now() - startTime;
305
+ return {
306
+ ...result,
307
+ method,
308
+ performance: {
309
+ totalTime,
310
+ method
311
+ },
312
+ stats: this.getStats()
313
+ };
314
+ } else {
315
+ this.log(' ❌ PDF parsing failed');
316
+ lastError = result.error;
317
+ }
318
+ }
319
+
63
320
  // Step 1: Try direct fetch first (fastest)
64
321
  this.log(' 🔄 Attempting direct fetch...');
65
322
  result = await this.tryDirectFetch(url, config);
@@ -67,6 +324,29 @@ export class BNCASmartScraper {
67
324
  if (result.success && !result.needsBrowser) {
68
325
  method = 'direct-fetch';
69
326
  this.log(' ✅ Direct fetch successful');
327
+ } else if (result.isPdf) {
328
+ // Direct fetch detected a PDF, try PDF parser
329
+ this.log(' 📄 Direct fetch detected PDF content, using PDF parser...');
330
+ result = await this.tryPDFParse(url, config);
331
+
332
+ if (result.success) {
333
+ method = 'pdf';
334
+ this.log(' ✅ PDF parsing successful');
335
+
336
+ const totalTime = Date.now() - startTime;
337
+ return {
338
+ ...result,
339
+ method,
340
+ performance: {
341
+ totalTime,
342
+ method
343
+ },
344
+ stats: this.getStats()
345
+ };
346
+ } else {
347
+ this.log(' ❌ PDF parsing failed');
348
+ lastError = result.error;
349
+ }
70
350
  } else {
71
351
  this.log(result.needsBrowser ? ' ⚠️ Browser rendering required' : ' ❌ Direct fetch failed');
72
352
  lastError = result.error;
@@ -152,7 +432,32 @@ export class BNCASmartScraper {
152
432
  };
153
433
  }
154
434
 
155
- const html = await response.text();
435
+ // Check if the response is actually a PDF
436
+ const contentType = response.headers.get('content-type') || '';
437
+ if (contentType.includes('application/pdf')) {
438
+ return {
439
+ success: false,
440
+ error: 'Content is PDF, should use PDF parser',
441
+ isPdf: true
442
+ };
443
+ }
444
+
445
+ // Get response as array buffer to check magic bytes
446
+ const buffer = await response.arrayBuffer();
447
+ const firstBytes = new Uint8Array(buffer.slice(0, 5));
448
+ const signature = Array.from(firstBytes).map(b => String.fromCharCode(b)).join('');
449
+
450
+ // Check for PDF magic bytes
451
+ if (signature.startsWith('%PDF')) {
452
+ return {
453
+ success: false,
454
+ error: 'Content is PDF (detected by magic bytes), should use PDF parser',
455
+ isPdf: true
456
+ };
457
+ }
458
+
459
+ // Convert buffer back to text for HTML processing
460
+ const html = new TextDecoder().decode(buffer);
156
461
 
157
462
  // Intelligent browser detection
158
463
  const needsBrowser = this.detectBrowserRequirement(html, url);
@@ -390,6 +695,95 @@ export class BNCASmartScraper {
390
695
  }
391
696
  }
392
697
 
698
+ /**
699
+ * PDF parsing method - handles PDF documents
700
+ */
701
+ async tryPDFParse(url, config) {
702
+ this.stats.pdf.attempts++;
703
+
704
+ try {
705
+ // Download PDF with timeout
706
+ const controller = new AbortController();
707
+ const timeoutId = setTimeout(() => controller.abort(), config.timeout);
708
+
709
+ const response = await fetch(url, {
710
+ headers: {
711
+ 'User-Agent': config.userAgent,
712
+ 'Accept': 'application/pdf,*/*'
713
+ },
714
+ signal: controller.signal
715
+ });
716
+
717
+ clearTimeout(timeoutId);
718
+
719
+ if (!response.ok) {
720
+ return {
721
+ success: false,
722
+ error: `HTTP ${response.status}: ${response.statusText}`
723
+ };
724
+ }
725
+
726
+ // Check content type (be lenient - accept various content types)
727
+ const contentType = response.headers.get('content-type') || '';
728
+ const acceptableTypes = ['pdf', 'octet-stream', 'binary', 'download'];
729
+ const isAcceptableType = acceptableTypes.some(type => contentType.includes(type));
730
+
731
+ if (!isAcceptableType && !url.toLowerCase().includes('.pdf')) {
732
+ return {
733
+ success: false,
734
+ error: `Not a PDF document: ${contentType}`
735
+ };
736
+ }
737
+
738
+ // Get PDF buffer
739
+ const arrayBuffer = await response.arrayBuffer();
740
+ const buffer = Buffer.from(arrayBuffer);
741
+
742
+ // Check size limit (20MB)
743
+ if (buffer.length > 20 * 1024 * 1024) {
744
+ return {
745
+ success: false,
746
+ error: 'PDF too large (max 20MB)'
747
+ };
748
+ }
749
+
750
+ // Parse PDF
751
+ const pdfData = await pdfParse(buffer);
752
+
753
+ // Extract structured content
754
+ const content = {
755
+ title: pdfData.info?.Title || 'Untitled PDF',
756
+ author: pdfData.info?.Author || '',
757
+ subject: pdfData.info?.Subject || '',
758
+ keywords: pdfData.info?.Keywords || '',
759
+ creator: pdfData.info?.Creator || '',
760
+ producer: pdfData.info?.Producer || '',
761
+ creationDate: pdfData.info?.CreationDate || '',
762
+ modificationDate: pdfData.info?.ModificationDate || '',
763
+ pages: pdfData.numpages || 0,
764
+ text: pdfData.text || '',
765
+ metadata: pdfData.metadata || null,
766
+ url: url
767
+ };
768
+
769
+ this.stats.pdf.successes++;
770
+
771
+ return {
772
+ success: true,
773
+ content: JSON.stringify(content, null, 2),
774
+ size: buffer.length,
775
+ contentType: 'application/pdf',
776
+ pages: content.pages
777
+ };
778
+
779
+ } catch (error) {
780
+ return {
781
+ success: false,
782
+ error: `PDF parsing error: ${error.message}`
783
+ };
784
+ }
785
+ }
786
+
393
787
  /**
394
788
  * Intelligent detection of browser requirement
395
789
  */
@@ -633,7 +1027,9 @@ export class BNCASmartScraper {
633
1027
  lightpanda: this.stats.lightpanda.attempts > 0 ?
634
1028
  (this.stats.lightpanda.successes / this.stats.lightpanda.attempts * 100).toFixed(1) + '%' : '0%',
635
1029
  puppeteer: this.stats.puppeteer.attempts > 0 ?
636
- (this.stats.puppeteer.successes / this.stats.puppeteer.attempts * 100).toFixed(1) + '%' : '0%'
1030
+ (this.stats.puppeteer.successes / this.stats.puppeteer.attempts * 100).toFixed(1) + '%' : '0%',
1031
+ pdf: this.stats.pdf.attempts > 0 ?
1032
+ (this.stats.pdf.successes / this.stats.pdf.attempts * 100).toFixed(1) + '%' : '0%'
637
1033
  }
638
1034
  };
639
1035
  }
@@ -995,4 +1391,16 @@ export async function quickShot(url, options = {}) {
995
1391
  }
996
1392
  }
997
1393
 
1394
+ export async function askWebsiteAI(url, question, options = {}) {
1395
+ const scraper = new BNCASmartScraper(options);
1396
+ try {
1397
+ const result = await scraper.askAI(url, question, options);
1398
+ return result;
1399
+ } catch (error) {
1400
+ throw error;
1401
+ } finally {
1402
+ await scraper.cleanup();
1403
+ }
1404
+ }
1405
+
998
1406
  export default BNCASmartScraper;
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "@monostate/node-scraper",
3
- "version": "1.2.0",
4
- "description": "Intelligent web scraping with multi-level fallback system - 11.35x faster than Firecrawl",
3
+ "version": "1.5.0",
4
+ "description": "Intelligent web scraping with AI Q&A, PDF support and multi-level fallback system - 11x faster than traditional scrapers",
5
5
  "type": "module",
6
6
  "main": "index.js",
7
7
  "types": "index.d.ts",
@@ -19,6 +19,9 @@
19
19
  "scripts/",
20
20
  "bin/"
21
21
  ],
22
+ "scripts": {
23
+ "postinstall": "node scripts/install-lightpanda.js"
24
+ },
22
25
  "keywords": [
23
26
  "web-scraping",
24
27
  "crawling",
@@ -29,6 +32,12 @@
29
32
  "data-extraction",
30
33
  "automation",
31
34
  "browser",
35
+ "ai-powered",
36
+ "question-answering",
37
+ "pdf-parsing",
38
+ "openrouter",
39
+ "openai",
40
+ "llm",
32
41
  "nextjs",
33
42
  "react",
34
43
  "performance",
@@ -37,7 +46,8 @@
37
46
  "author": "BNCA Team",
38
47
  "license": "MIT",
39
48
  "dependencies": {
40
- "node-fetch": "^3.3.2"
49
+ "node-fetch": "^3.3.2",
50
+ "pdf-parse": "^1.1.1"
41
51
  },
42
52
  "peerDependencies": {
43
53
  "puppeteer": ">=20.0.0"
@@ -65,8 +75,5 @@
65
75
  },
66
76
  "publishConfig": {
67
77
  "access": "public"
68
- },
69
- "scripts": {
70
- "postinstall": "node scripts/install-lightpanda.js"
71
78
  }
72
79
  }
package/LICENSE DELETED
@@ -1,21 +0,0 @@
1
- MIT License
2
-
3
- Copyright (c) 2025 BNCA Team
4
-
5
- Permission is hereby granted, free of charge, to any person obtaining a copy
6
- of this software and associated documentation files (the "Software"), to deal
7
- in the Software without restriction, including without limitation the rights
8
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
- copies of the Software, and to permit persons to whom the Software is
10
- furnished to do so, subject to the following conditions:
11
-
12
- The above copyright notice and this permission notice shall be included in all
13
- copies or substantial portions of the Software.
14
-
15
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
- SOFTWARE.