@firekid/scraper 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,767 @@
1
+ # Firekid Scraper
2
+
3
+ Advanced web scraping framework built on Playwright with intelligent anti-detection, automatic healing, and distributed crawling capabilities.
4
+
5
+ **GitHub:** [Firekid-is-him/firekid-scraper-sdk](https://github.com/Firekid-is-him/firekid-scraper-sdk)
6
+
7
+ ## Features
8
+
9
+ **Anti-Detection & Stealth:**
10
+ - Ghost fingerprinting system spoofs canvas, WebGL, audio, fonts, and navigator properties
11
+ - Cloudflare bypass with automatic fallback to manual solving
12
+ - Behavioral profiles that mimic human interaction patterns
13
+ - Network forensics cleaning removes tracking artifacts
14
+
15
+ **Intelligent Automation:**
16
+ - Self-healing selectors with 7 fallback strategies
17
+ - Pattern caching with SQLite storage for learned behaviors
18
+ - Smart fetch with automatic referer chain management
19
+ - Action recorder captures and replays user interactions
20
+
21
+ **Distributed Crawling:**
22
+ - Queue-based task distribution across multiple workers
23
+ - Browser worker pool with resource management
24
+ - Rate limiting with configurable windows and thresholds
25
+ - Session persistence and recovery
26
+
27
+ **Developer Experience:**
28
+ - Simple command-based scripting language
29
+ - Plugin system for extensibility
30
+ - Multiple scraping modes: auto, manual, SSR, infinite scroll, pagination
31
+ - Built-in scheduler for recurring tasks
32
+ - Webhook notifications and database export
33
+
34
+ ## Installation
35
+
36
+ ```bash
37
+ npm install @firekid/scraper
38
+ npx playwright install chromium
39
+ ```
40
+
41
+ ### Global CLI Installation
42
+
43
+ ```bash
44
+ npm install -g @firekid/scraper
45
+ firekid-scraper --help
46
+ ```
47
+
48
+ ### Docker Installation
49
+
50
+ ```bash
51
+ docker pull firekid/scraper:latest
52
+ docker run -v $(pwd)/data:/data firekid/scraper
53
+ ```
54
+
55
+ ## Quick Start
56
+
57
+ ### Basic Scraping
58
+
59
+ ```javascript
60
+ import { FirekidScraper } from '@firekid/scraper'
61
+
62
+ const scraper = new FirekidScraper({
63
+ headless: true,
64
+ bypassCloudflare: true
65
+ })
66
+
67
+ await scraper.init()
68
+
69
+ const data = await scraper.scrape('https://example.com', {
70
+ selectors: {
71
+ title: 'h1',
72
+ content: '.article-body',
73
+ author: '.author-name'
74
+ }
75
+ })
76
+
77
+ console.log(data)
78
+ await scraper.close()
79
+ ```
80
+
81
+ ### Command-Based Scripting
82
+
83
+ ```javascript
84
+ const scraper = new FirekidScraper()
85
+ await scraper.init()
86
+
87
+ await scraper.runCommands(`
88
+ GOTO https://example.com
89
+ WAIT .product-list
90
+ EXTRACT .product-title text AS titles
91
+ EXTRACT .product-price text AS prices
92
+ SCREENSHOT products.png
93
+ `)
94
+
95
+ await scraper.close()
96
+ ```
97
+
98
+ ### Auto Mode
99
+
100
+ ```javascript
101
+ const scraper = new FirekidScraper()
102
+ await scraper.init()
103
+
104
+ const data = await scraper.auto('https://example.com/products', {
105
+ depth: 2,
106
+ extractPattern: 'product'
107
+ })
108
+
109
+ await scraper.close()
110
+ ```
111
+
112
+ ## Core API
113
+
114
+ ### FirekidScraper
115
+
116
+ Main scraper class that orchestrates all operations.
117
+
118
+ **Constructor Options:**
119
+
120
+ ```javascript
121
+ new FirekidScraper({
122
+ headless: boolean, // Run browser in headless mode (default: true)
123
+ bypassCloudflare: boolean, // Enable Cloudflare bypass (default: false)
124
+ useGhost: boolean, // Enable fingerprint spoofing (default: true)
125
+ browserArgs: string[], // Additional Chromium arguments
126
+ timeout: number, // Default timeout in ms (default: 30000)
127
+ userAgent: string, // Custom user agent
128
+ viewport: { width, height }, // Browser viewport size
129
+ proxy: string, // Proxy URL (http://user:pass@host:port)
130
+ rateLimit: { // Rate limiting configuration
131
+ enabled: boolean,
132
+ max: number,
133
+ window: number
134
+ }
135
+ })
136
+ ```
137
+
138
+ **Methods:**
139
+
140
+ `await scraper.init()`: Initialize browser and context.
141
+
142
+ `await scraper.close()`: Close browser and cleanup resources.
143
+
144
+ `await scraper.goto(url)`: Navigate to URL with anti-detection measures.
145
+
146
+ `await scraper.scrape(url, options)`: Extract data using CSS selectors.
147
+
148
+ Options:
149
+ - `selectors`: Object mapping field names to CSS selectors
150
+ - `attribute`: Extract attribute instead of text (default: text)
151
+ - `multiple`: Return array of all matches (default: false)
152
+ - `screenshot`: Take screenshot after extraction
153
+
154
+ `await scraper.runCommands(script)`: Execute command-based script.
155
+
156
+ `await scraper.auto(url, options)`: Automatically detect and extract data.
157
+
158
+ Options:
159
+ - `depth`: Maximum crawl depth (default: 1)
160
+ - `extractPattern`: Pattern hint (product, article, listing, etc)
161
+ - `followLinks`: Follow pagination/navigation links
162
+
163
+ `await scraper.paginate(url, selector, options)`: Scrape paginated content.
164
+
165
+ Options:
166
+ - `maxPages`: Maximum pages to scrape
167
+ - `waitBetween`: Delay between pages in ms
168
+ - `nextSelector`: Selector for next page button
169
+
170
+ `await scraper.infiniteScroll(url, options)`: Scrape infinite scroll pages.
171
+
172
+ Options:
173
+ - `maxScrolls`: Maximum scroll iterations
174
+ - `itemSelector`: Selector for items to extract
175
+ - `scrollDelay`: Delay between scrolls in ms
176
+
177
+ ### Plugin System
178
+
179
+ Extend functionality through plugins.
180
+
181
+ **Loading Plugins:**
182
+
183
+ ```javascript
184
+ const scraper = new FirekidScraper()
185
+ await scraper.loadPlugin('./plugins/custom-plugin.js')
186
+ ```
187
+
188
+ **Plugin Structure:**
189
+
190
+ ```javascript
191
+ export default {
192
+ name: 'custom-extractor',
193
+ type: 'extractor',
194
+
195
+ async execute(page, options) {
196
+ const data = await page.evaluate(() => {
197
+ return {
198
+ title: document.title,
199
+ meta: Array.from(document.querySelectorAll('meta'))
200
+ .map(m => ({ name: m.name, content: m.content }))
201
+ }
202
+ })
203
+ return data
204
+ }
205
+ }
206
+ ```
207
+
208
+ **Plugin Types:**
209
+ - `scraping`: Custom scraping logic
210
+ - `action`: Custom page actions
211
+ - `extractor`: Data extraction methods
212
+ - `filter`: Data filtering and validation
213
+ - `output`: Custom output formats
214
+ - `parser`: Data parsing and transformation
215
+
216
+ ### Distributed Scraping
217
+
218
+ Scale scraping across multiple workers.
219
+
220
+ ```javascript
221
+ import { DistributedEngine } from '@firekid/scraper'
222
+
223
+ const engine = new DistributedEngine({
224
+ workers: 5,
225
+ queueSize: 100,
226
+ retries: 3
227
+ })
228
+
229
+ await engine.init()
230
+
231
+ engine.addTask({
232
+ id: 'task-1',
233
+ url: 'https://example.com',
234
+ mode: 'scrape',
235
+ options: {
236
+ selectors: { title: 'h1' }
237
+ },
238
+ priority: 10
239
+ })
240
+
241
+ engine.on('taskComplete', (result) => {
242
+ console.log('Task completed:', result)
243
+ })
244
+
245
+ engine.on('taskFailed', (error) => {
246
+ console.error('Task failed:', error)
247
+ })
248
+
249
+ await engine.start()
250
+ ```
251
+
252
+ ## Command Reference
253
+
254
+ Commands use simple syntax for browser automation.
255
+
256
+ ### Navigation Commands
257
+
258
+ `GOTO url`: Navigate to URL.
259
+ Example: `GOTO https://example.com`
260
+
261
+ `BACK`: Go back in history.
262
+
263
+ `FORWARD`: Go forward in history.
264
+
265
+ `REFRESH`: Reload current page.
266
+
267
+ ### Interaction Commands
268
+
269
+ `CLICK selector`: Click element.
270
+ Example: `CLICK button.submit`
271
+
272
+ `TYPE selector text`: Type text into input.
273
+ Example: `TYPE input[name="search"] laptop`
274
+
275
+ `PRESS key`: Press keyboard key.
276
+ Example: `PRESS Enter`
277
+
278
+ `SELECT selector value`: Select dropdown option.
279
+ Example: `SELECT select[name="country"] US`
280
+
281
+ `CHECK selector`: Check checkbox.
282
+ Example: `CHECK input[type="checkbox"]`
283
+
284
+ `UPLOAD selector filepath`: Upload file.
285
+ Example: `UPLOAD input[type="file"] ./document.pdf`
286
+
287
+ ### Wait Commands
288
+
289
+ `WAIT selector`: Wait for element to appear.
290
+ Example: `WAIT .product-list`
291
+
292
+ `WAITLOAD`: Wait for page load.
293
+
294
+ ### Scroll Commands
295
+
296
+ `SCROLL selector`: Scroll element into view.
297
+ Example: `SCROLL .footer`
298
+
299
+ `SCROLLDOWN pixels`: Scroll down by pixels.
300
+ Example: `SCROLLDOWN 500`
301
+
302
+ ### Extraction Commands
303
+
304
+ `SCAN`: Analyze page structure.
305
+
306
+ `EXTRACT selector type AS variable`: Extract data.
307
+ Types: text, html, attr:name, href, src
308
+ Example: `EXTRACT h1 text AS title`
309
+
310
+ `SCREENSHOT filename`: Take screenshot.
311
+ Example: `SCREENSHOT page.png`
312
+
313
+ ### Advanced Commands
314
+
315
+ `PAGINATE selector`: Auto-paginate through results.
316
+ Example: `PAGINATE .next-page`
317
+
318
+ `INFINITESCROLL count`: Scroll and load more items.
319
+ Example: `INFINITESCROLL 10`
320
+
321
+ `FETCH url`: Fetch URL with smart referer.
322
+ Example: `FETCH https://api.example.com/data`
323
+
324
+ `DOWNLOAD url`: Download file.
325
+ Example: `DOWNLOAD https://example.com/file.pdf`
326
+
327
+ `REFERER url`: Set custom referer.
328
+ Example: `REFERER https://google.com`
329
+
330
+ `BYPASS_CLOUDFLARE`: Attempt Cloudflare bypass.
331
+
332
+ ### Flow Control
333
+
334
+ `REPEAT selector`: Loop over matching elements.
335
+ ```
336
+ REPEAT .product
337
+ EXTRACT .title text AS titles
338
+ EXTRACT .price text AS prices
339
+ ```
340
+
341
+ `IF selector`: Conditional execution.
342
+ ```
343
+ IF .login-button
344
+ CLICK .login-button
345
+ TYPE input[name="username"] admin
346
+ ```
347
+
348
+ `LOOP count`: Repeat commands N times.
349
+ ```
350
+ LOOP 5
351
+ SCROLLDOWN 300
352
+ WAIT 1000
353
+ ```
354
+
355
+ ## Configuration
356
+
357
+ ### Environment Variables
358
+
359
+ `HEADLESS`: Run in headless mode (true/false)
360
+ `MAX_QUEUE_WORKERS`: Maximum concurrent workers (number)
361
+ `BROWSER_TIMEOUT`: Browser timeout in ms (number)
362
+ `CF_BYPASS`: Cloudflare bypass mode (auto/manual/skip)
363
+ `TURNSTILE_SOLVER`: Turnstile solver (skip/manual/2captcha/capsolver)
364
+ `CAPTCHA_API_KEY`: API key for captcha solver
365
+ `API_ENABLED`: Enable web API (true/false)
366
+ `API_PORT`: API server port (number)
367
+ `API_KEY`: API authentication key
368
+ `PROXY_ENABLED`: Enable proxy (true/false)
369
+ `PROXY_URL`: Proxy URL
370
+ `DATA_DIR`: Data storage directory
371
+ `PATTERNS_DB`: Pattern cache database path
372
+ `SESSIONS_DB`: Session storage database path
373
+ `LOG_LEVEL`: Logging level (error/warn/info/debug)
374
+ `RECORD_SCREENSHOTS`: Record screenshots (true/false)
375
+ `RATE_LIMIT_ENABLED`: Enable rate limiting (true/false)
376
+ `RATE_LIMIT_MAX`: Max requests per window (number)
377
+ `RATE_LIMIT_WINDOW`: Rate limit window in ms (number)
378
+
379
+ ### Configuration File
380
+
381
+ Create `.env` file in project root:
382
+
383
+ ```env
384
+ HEADLESS=true
385
+ MAX_QUEUE_WORKERS=5
386
+ BROWSER_TIMEOUT=30000
387
+ CF_BYPASS=auto
388
+ LOG_LEVEL=info
389
+ RATE_LIMIT_ENABLED=true
390
+ RATE_LIMIT_MAX=100
391
+ RATE_LIMIT_WINDOW=3600000
392
+ ```
393
+
394
+ ## Advanced Usage
395
+
396
+ ### Custom Behavioral Profiles
397
+
398
+ ```javascript
399
+ const scraper = new FirekidScraper()
400
+ await scraper.init()
401
+
402
+ await scraper.setProfile('human')
403
+
404
+ await scraper.goto('https://example.com')
405
+ ```
406
+
407
+ Available profiles:
408
+ - `fast`: 30-60ms delays, minimal randomization
409
+ - `normal`: 80-120ms delays, moderate randomization
410
+ - `careful`: 120-180ms delays, high randomization
411
+ - `human`: 50-150ms delays, natural patterns
412
+
413
+ ### Pattern Learning
414
+
415
+ ```javascript
416
+ const scraper = new FirekidScraper()
417
+ await scraper.init()
418
+
419
+ await scraper.goto('https://example.com/products')
420
+
421
+ const pattern = await scraper.learnPattern('product', {
422
+ containerSelector: '.product-card',
423
+ fields: ['title', 'price', 'image']
424
+ })
425
+
426
+ const products = await scraper.applyPattern('product')
427
+ ```
428
+
429
+ ### Self-Healing Selectors
430
+
431
+ ```javascript
432
+ const scraper = new FirekidScraper()
433
+ await scraper.init()
434
+
435
+ const healer = scraper.getHealer()
436
+
437
+ const element = await healer.find('.old-selector', {
438
+ strategies: ['id', 'className', 'text', 'position'],
439
+ savePattern: true
440
+ })
441
+ ```
442
+
443
+ ### Webhook Integration
444
+
445
+ ```javascript
446
+ const scraper = new FirekidScraper({
447
+ webhook: {
448
+ url: 'https://your-api.com/webhook',
449
+ events: ['scrapeComplete', 'error']
450
+ }
451
+ })
452
+
453
+ await scraper.init()
454
+ await scraper.scrape('https://example.com')
455
+ ```
456
+
457
+ ### Database Export
458
+
459
+ ```javascript
460
+ const scraper = new FirekidScraper()
461
+ await scraper.init()
462
+
463
+ const data = await scraper.scrape('https://example.com', {
464
+ selectors: { title: 'h1' }
465
+ })
466
+
467
+ await scraper.exportToDatabase(data, {
468
+ type: 'postgresql',
469
+ connection: {
470
+ host: 'localhost',
471
+ database: 'scraping',
472
+ user: 'user',
473
+ password: 'pass'
474
+ },
475
+ table: 'products'
476
+ })
477
+ ```
478
+
479
+ ### Scheduled Tasks
480
+
481
+ ```javascript
482
+ import { TaskScheduler } from '@firekid/scraper'
483
+
484
+ const scheduler = new TaskScheduler()
485
+
486
+ scheduler.schedule('daily-scrape', '0 0 * * *', async () => {
487
+ const scraper = new FirekidScraper()
488
+ await scraper.init()
489
+ await scraper.scrape('https://example.com')
490
+ await scraper.close()
491
+ })
492
+ ```
493
+
494
+ ## Examples
495
+
496
+ ### Product Scraper
497
+
498
+ ```javascript
499
+ import { FirekidScraper } from '@firekid/scraper'
500
+
501
+ const scraper = new FirekidScraper({ headless: true })
502
+ await scraper.init()
503
+
504
+ const products = await scraper.paginate('https://store.example.com/products', '.next-page', {
505
+ maxPages: 10,
506
+ selectors: {
507
+ title: '.product-title',
508
+ price: '.product-price',
509
+ image: 'img.product-image',
510
+ rating: '.product-rating'
511
+ }
512
+ })
513
+
514
+ await scraper.export(products, 'json', './products.json')
515
+ await scraper.close()
516
+ ```
517
+
518
+ ### Login and Scrape
519
+
520
+ ```javascript
521
+ const scraper = new FirekidScraper()
522
+ await scraper.init()
523
+
524
+ await scraper.runCommands(`
525
+ GOTO https://example.com/login
526
+ TYPE input[name="username"] myuser
527
+ TYPE input[name="password"] mypass
528
+ CLICK button[type="submit"]
529
+ WAITLOAD
530
+ GOTO https://example.com/dashboard
531
+ EXTRACT .data-table text AS tableData
532
+ `)
533
+
534
+ await scraper.close()
535
+ ```
536
+
537
+ ### Infinite Scroll
538
+
539
+ ```javascript
540
+ const scraper = new FirekidScraper()
541
+ await scraper.init()
542
+
543
+ const items = await scraper.infiniteScroll('https://example.com/feed', {
544
+ maxScrolls: 20,
545
+ itemSelector: '.feed-item',
546
+ scrollDelay: 1000,
547
+ extractFields: {
548
+ content: '.feed-content',
549
+ author: '.feed-author',
550
+ timestamp: '.feed-time'
551
+ }
552
+ })
553
+
554
+ await scraper.close()
555
+ ```
556
+
557
+ ### API Hunting
558
+
559
+ ```javascript
560
+ import { APIHunter } from '@firekid/scraper'
561
+
562
+ const hunter = new APIHunter()
563
+ await hunter.init()
564
+
565
+ const apis = await hunter.hunt('https://example.com', {
566
+ captureXHR: true,
567
+ captureFetch: true,
568
+ captureWebSocket: true
569
+ })
570
+
571
+ console.log('Discovered APIs:', apis)
572
+ await hunter.close()
573
+ ```
574
+
575
+ ### Video Download
576
+
577
+ ```javascript
578
+ const scraper = new FirekidScraper()
579
+ await scraper.init()
580
+
581
+ await scraper.runCommands(`
582
+ GOTO https://video-site.com/video/123
583
+ WAIT video
584
+ BYPASS_CLOUDFLARE
585
+ DOWNLOAD https://cdn.video-site.com/videos/file.mp4
586
+ `)
587
+
588
+ await scraper.close()
589
+ ```
590
+
591
+ ## TypeScript Support
592
+
593
+ Full TypeScript definitions included.
594
+
595
+ ```typescript
596
+ import { FirekidScraper, ScraperOptions, ScrapeResult } from '@firekid/scraper'
597
+
598
+ interface Product {
599
+ title: string
600
+ price: number
601
+ image: string
602
+ }
603
+
604
+ const scraper = new FirekidScraper({
605
+ headless: true,
606
+ bypassCloudflare: true
607
+ })
608
+
609
+ await scraper.init()
610
+
611
+ const result: ScrapeResult<Product> = await scraper.scrape('https://example.com', {
612
+ selectors: {
613
+ title: 'h1.product-title',
614
+ price: '.price',
615
+ image: 'img.main'
616
+ }
617
+ })
618
+
619
+ await scraper.close()
620
+ ```
621
+
622
+ ## Docker Usage
623
+
624
+ ### Using Docker Compose
625
+
626
+ ```yaml
627
+ version: '3.8'
628
+ services:
629
+ scraper:
630
+ image: firekid/scraper:latest
631
+ volumes:
632
+ - ./data:/data
633
+ - ./output:/output
634
+ environment:
635
+ - HEADLESS=true
636
+ - LOG_LEVEL=info
637
+ command: firekid-scraper run ./scripts/scrape.cmd
638
+ ```
639
+
640
+ ### Custom Dockerfile
641
+
642
+ ```dockerfile
643
+ FROM firekid/scraper:latest
644
+
645
+ COPY ./scripts /app/scripts
646
+ COPY ./plugins /app/plugins
647
+
648
+ WORKDIR /app
649
+ CMD ["firekid-scraper", "run", "./scripts/main.cmd"]
650
+ ```
651
+
652
+ ## Performance Optimization
653
+
654
+ ### Connection Pooling
655
+
656
+ ```javascript
657
+ const scraper = new FirekidScraper({
658
+ browserArgs: [
659
+ '--disable-dev-shm-usage',
660
+ '--no-sandbox',
661
+ '--disable-setuid-sandbox',
662
+ '--disable-gpu'
663
+ ]
664
+ })
665
+ ```
666
+
667
+ ### Resource Blocking
668
+
669
+ ```javascript
670
+ await scraper.optimizeRequests({
671
+ blockImages: true,
672
+ blockFonts: true,
673
+ blockMedia: true
674
+ })
675
+ ```
676
+
677
+ ### Parallel Scraping
678
+
679
+ ```javascript
680
+ import { DistributedEngine } from '@firekid/scraper'
681
+
682
+ const engine = new DistributedEngine({ workers: 10 })
683
+ await engine.init()
684
+
685
+ const urls = ['url1', 'url2', 'url3']
686
+ urls.forEach((url, i) => {
687
+ engine.addTask({
688
+ id: `task-${i}`,
689
+ url,
690
+ mode: 'scrape',
691
+ priority: 10
692
+ })
693
+ })
694
+
695
+ await engine.start()
696
+ ```
697
+
698
+ ## Troubleshooting
699
+
700
+ ### Cloudflare Challenges
701
+
702
+ If automatic bypass fails, enable manual solving:
703
+
704
+ ```javascript
705
+ const scraper = new FirekidScraper({
706
+ bypassCloudflare: true,
707
+ cloudflareMode: 'manual'
708
+ })
709
+ ```
710
+
711
+ The browser will open in headed mode for manual solving.
712
+
713
+ ### Memory Issues
714
+
715
+ Reduce memory usage by limiting concurrent workers:
716
+
717
+ ```javascript
718
+ const scraper = new FirekidScraper({
719
+ maxWorkers: 3,
720
+ timeout: 15000
721
+ })
722
+ ```
723
+
724
+ ### Rate Limiting
725
+
726
+ Implement delays between requests:
727
+
728
+ ```javascript
729
+ const scraper = new FirekidScraper({
730
+ rateLimit: {
731
+ enabled: true,
732
+ max: 10,
733
+ window: 60000
734
+ }
735
+ })
736
+ ```
737
+
738
+ ### Selector Not Found
739
+
740
+ Enable self-healing selectors:
741
+
742
+ ```javascript
743
+ const element = await scraper.healSelector('.old-selector', {
744
+ savePattern: true,
745
+ strategies: ['id', 'className', 'text']
746
+ })
747
+ ```
748
+
749
+ ## Contributing
750
+
751
+ Contributions are welcome. Please read the contributing guidelines before submitting pull requests.
752
+
753
+ ## License
754
+
755
+ MIT License. See LICENSE file for details.
756
+
757
+ ## Support
758
+
759
+ For issues and questions:
760
+ - GitHub Repository: https://github.com/Firekid-is-him/firekid-scraper-sdk
761
+ - GitHub Issues: Report bugs and request features
762
+ - Documentation: Complete guides in the docs folder
763
+ - Examples: Sample scripts in the examples folder
764
+
765
+ ## Changelog
766
+
767
+ See CHANGELOG.md for version history and release notes.