@jambudipa/spider 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/README.md +117 -69
  2. package/dist/index.js +835 -114
  3. package/dist/index.js.map +1 -1
  4. package/package.json +12 -7
  5. package/dist/index.d.ts +0 -33
  6. package/dist/index.d.ts.map +0 -1
  7. package/dist/lib/BrowserEngine/BrowserEngine.service.d.ts +0 -57
  8. package/dist/lib/BrowserEngine/BrowserEngine.service.d.ts.map +0 -1
  9. package/dist/lib/Config/SpiderConfig.service.d.ts +0 -256
  10. package/dist/lib/Config/SpiderConfig.service.d.ts.map +0 -1
  11. package/dist/lib/HttpClient/CookieManager.d.ts +0 -44
  12. package/dist/lib/HttpClient/CookieManager.d.ts.map +0 -1
  13. package/dist/lib/HttpClient/EnhancedHttpClient.d.ts +0 -88
  14. package/dist/lib/HttpClient/EnhancedHttpClient.d.ts.map +0 -1
  15. package/dist/lib/HttpClient/SessionStore.d.ts +0 -82
  16. package/dist/lib/HttpClient/SessionStore.d.ts.map +0 -1
  17. package/dist/lib/HttpClient/TokenExtractor.d.ts +0 -58
  18. package/dist/lib/HttpClient/TokenExtractor.d.ts.map +0 -1
  19. package/dist/lib/HttpClient/index.d.ts +0 -8
  20. package/dist/lib/HttpClient/index.d.ts.map +0 -1
  21. package/dist/lib/LinkExtractor/LinkExtractor.service.d.ts +0 -166
  22. package/dist/lib/LinkExtractor/LinkExtractor.service.d.ts.map +0 -1
  23. package/dist/lib/LinkExtractor/index.d.ts +0 -37
  24. package/dist/lib/LinkExtractor/index.d.ts.map +0 -1
  25. package/dist/lib/Logging/FetchLogger.d.ts +0 -8
  26. package/dist/lib/Logging/FetchLogger.d.ts.map +0 -1
  27. package/dist/lib/Logging/SpiderLogger.service.d.ts +0 -34
  28. package/dist/lib/Logging/SpiderLogger.service.d.ts.map +0 -1
  29. package/dist/lib/Middleware/SpiderMiddleware.d.ts +0 -276
  30. package/dist/lib/Middleware/SpiderMiddleware.d.ts.map +0 -1
  31. package/dist/lib/PageData/PageData.d.ts +0 -28
  32. package/dist/lib/PageData/PageData.d.ts.map +0 -1
  33. package/dist/lib/Resumability/Resumability.service.d.ts +0 -176
  34. package/dist/lib/Resumability/Resumability.service.d.ts.map +0 -1
  35. package/dist/lib/Resumability/backends/FileStorageBackend.d.ts +0 -47
  36. package/dist/lib/Resumability/backends/FileStorageBackend.d.ts.map +0 -1
  37. package/dist/lib/Resumability/backends/PostgresStorageBackend.d.ts +0 -95
  38. package/dist/lib/Resumability/backends/PostgresStorageBackend.d.ts.map +0 -1
  39. package/dist/lib/Resumability/backends/RedisStorageBackend.d.ts +0 -92
  40. package/dist/lib/Resumability/backends/RedisStorageBackend.d.ts.map +0 -1
  41. package/dist/lib/Resumability/index.d.ts +0 -51
  42. package/dist/lib/Resumability/index.d.ts.map +0 -1
  43. package/dist/lib/Resumability/strategies.d.ts +0 -76
  44. package/dist/lib/Resumability/strategies.d.ts.map +0 -1
  45. package/dist/lib/Resumability/types.d.ts +0 -201
  46. package/dist/lib/Resumability/types.d.ts.map +0 -1
  47. package/dist/lib/Robots/Robots.service.d.ts +0 -78
  48. package/dist/lib/Robots/Robots.service.d.ts.map +0 -1
  49. package/dist/lib/Scheduler/SpiderScheduler.service.d.ts +0 -211
  50. package/dist/lib/Scheduler/SpiderScheduler.service.d.ts.map +0 -1
  51. package/dist/lib/Scraper/Scraper.service.d.ts +0 -123
  52. package/dist/lib/Scraper/Scraper.service.d.ts.map +0 -1
  53. package/dist/lib/Spider/Spider.service.d.ts +0 -194
  54. package/dist/lib/Spider/Spider.service.d.ts.map +0 -1
  55. package/dist/lib/StateManager/StateManager.service.d.ts +0 -68
  56. package/dist/lib/StateManager/StateManager.service.d.ts.map +0 -1
  57. package/dist/lib/StateManager/index.d.ts +0 -5
  58. package/dist/lib/StateManager/index.d.ts.map +0 -1
  59. package/dist/lib/UrlDeduplicator/UrlDeduplicator.service.d.ts +0 -58
  60. package/dist/lib/UrlDeduplicator/UrlDeduplicator.service.d.ts.map +0 -1
  61. package/dist/lib/WebScrapingEngine/WebScrapingEngine.service.d.ts +0 -77
  62. package/dist/lib/WebScrapingEngine/WebScrapingEngine.service.d.ts.map +0 -1
  63. package/dist/lib/WebScrapingEngine/index.d.ts +0 -5
  64. package/dist/lib/WebScrapingEngine/index.d.ts.map +0 -1
  65. package/dist/lib/WorkerHealth/WorkerHealthMonitor.service.d.ts +0 -39
  66. package/dist/lib/WorkerHealth/WorkerHealthMonitor.service.d.ts.map +0 -1
  67. package/dist/lib/api-facades.d.ts +0 -313
  68. package/dist/lib/api-facades.d.ts.map +0 -1
  69. package/dist/lib/errors.d.ts +0 -99
  70. package/dist/lib/errors.d.ts.map +0 -1
package/README.md CHANGED
@@ -1,14 +1,57 @@
1
- # @jambudipa.io/spider
2
-
3
- A powerful, Effect.js-based web crawling framework for modern TypeScript applications. Built for type safety, composability, and enterprise-scale crawling operations.
1
+ # @jambudipa/spider
2
+
3
+ [![CI Status](https://github.com/jambudipa/spider/workflows/Spider%20Scenario%20Tests/badge.svg)](https://github.com/jambudipa/spider/actions)
4
+ [![Coverage](https://codecov.io/gh/jambudipa/spider/branch/main/graph/badge.svg)](https://codecov.io/gh/jambudipa/spider)
5
+ [![npm version](https://badge.fury.io/js/@jambudipa%2Fspider.svg)](https://badge.fury.io/js/@jambudipa%2Fspider)
6
+ [![Node.js Version](https://img.shields.io/node/v/@jambudipa/spider.svg)](https://nodejs.org/)
7
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8
+
9
+ A powerful, Effect-based web crawling framework for modern TypeScript applications. Built for type safety, composability, and enterprise-scale crawling operations.
10
+
11
+ > **⚠️ Pre-Release API**: Spider is currently in pre-release development (v0.x.x). The API may change frequently as we refine the library towards a stable v1.0.0 release. Consider this when using Spider in production environments and expect potential breaking changes in minor version updates.
12
+
13
+ ## 🏆 **Battle-Tested Against Real-World Scenarios**
14
+
15
+ **Spider successfully handles ALL 16 https://web-scraping.dev challenge scenarios** - the most comprehensive web scraping test suite available:
16
+
17
+ | ✅ Scenario | Description | Complexity |
18
+ |-------------|-------------|------------|
19
+ | **Static Paging** | Traditional pagination navigation | Basic |
20
+ | **Endless Scroll** | Infinite scroll content loading | Dynamic |
21
+ | **Button Loading** | Dynamic content via button clicks | Dynamic |
22
+ | **GraphQL Requests** | Background API data fetching | Advanced |
23
+ | **Hidden Data** | Extracting non-visible content | Intermediate |
24
+ | **Product Markup** | Structured data extraction | Intermediate |
25
+ | **Local Storage** | Browser storage interaction | Advanced |
26
+ | **Secret API Tokens** | Authentication handling | Security |
27
+ | **CSRF Protection** | Token-based security bypass | Security |
28
+ | **Cookie Authentication** | Session-based access control | Security |
29
+ | **PDF Downloads** | Binary file handling | Special |
30
+ | **Cookie Popups** | Modal interaction handling | Special |
31
+ | **New Tab Links** | Multi-tab navigation | Special |
32
+ | **Block Pages** | Anti-bot detection handling | Anti-Block |
33
+ | **Invalid Referer Blocking** | Header-based access control | Anti-Block |
34
+ | **Persistent Cookie Blocking** | Long-term blocking mechanisms | Anti-Block |
35
+
36
+ 🎯 **[View Live Test Results](https://github.com/jambudipa/spider/actions/workflows/ci.yml)** | 📊 **All Scenario Tests Passing** | 🚀 **Production Ready**
37
+
38
+ > **Live Testing**: Our CI pipeline runs all 16 web scraping scenarios against real websites daily, ensuring Spider remains robust against changing web technologies.
39
+
40
+ ### 🔍 **Current Status** (Updated: Aug 2025)
41
+ - ✅ **Core Functionality**: All web scraping scenarios working
42
+ - ✅ **Type Safety**: Full TypeScript compilation without errors
43
+ - ✅ **Build System**: Package builds successfully for distribution
44
+ - ✅ **Test Suite**: 92+ scenario tests passing against live websites
45
+ - ⚠️ **Code Quality**: 1,163 linting issues identified (technical debt - does not affect functionality)
4
46
 
5
47
  ## ✨ Key Features
6
48
 
7
- - **🔥 Effect.js Foundation**: Type-safe, functional composition with robust error handling
49
+ - **🔥 Effect Foundation**: Type-safe, functional composition with robust error handling
8
50
  - **⚡ High Performance**: Concurrent crawling with intelligent worker pool management
9
51
  - **🤖 Robots.txt Compliant**: Automatic robots.txt parsing and compliance checking
10
52
  - **🔄 Resumable Crawls**: State persistence and crash recovery capabilities
11
- - **🛡️ Middleware System**: Extensible middleware for rate limiting, authentication, and custom processing
53
+ - **🛡️ Anti-Bot Bypass**: Handles complex blocking mechanisms and security measures
54
+ - **🌐 Browser Automation**: Playwright integration for JavaScript-heavy sites
12
55
  - **📊 Built-in Monitoring**: Comprehensive logging and performance monitoring
13
56
  - **🎯 TypeScript First**: Full type safety with excellent IntelliSense support
14
57
 
@@ -45,22 +88,30 @@ Effect.runPromise(program.pipe(
45
88
  ))
46
89
  ```
47
90
 
48
- ## 🎯 What's Next?
91
+ ## 📚 Documentation
92
+
93
+ **Comprehensive documentation is now available** following the [Diátaxis framework](https://diataxis.fr/) for better learning and reference:
94
+
95
+ ### 🎓 New to Spider?
96
+ Start with our **[Tutorial](./docs/tutorial/getting-started.md)** - a hands-on guide that takes you from installation to building advanced scrapers.
97
+
98
+ ### 📋 Need to solve a specific problem?
99
+ Check our **[How-to Guides](./docs/how-to/)** for targeted solutions:
100
+ - **[Authentication](./docs/how-to/authentication.md)** - Handle logins, sessions, and auth flows
101
+ - **[Data Extraction](./docs/how-to/data-extraction.md)** - Extract structured data from HTML
102
+ - **[Resumable Operations](./docs/how-to/resumable-operations.md)** - Build fault-tolerant crawlers
49
103
 
50
- ### 🆕 New to Spider?
51
- - **[Getting Started Guide](./docs/guides/getting-started.md)** - Complete setup and first crawl
52
- - **[Configuration Guide](./docs/guides/configuration.md)** - Customise Spider for your needs
53
- - **[Basic Examples](./docs/examples/basic-crawling.md)** - Working examples to get you started
104
+ ### 📚 Need technical details?
105
+ See our **[Reference Documentation](./docs/reference/)**:
106
+ - **[API Reference](./docs/reference/api-reference.md)** - Complete API documentation
107
+ - **[Configuration](./docs/reference/configuration.md)** - All configuration options
54
108
 
55
- ### 🔄 Migrating from Another Library?
56
- - **[Migration Guide](./docs/guides/migration.md)** - Move from Puppeteer, Playwright, or Scrapy
57
- - **[Advanced Patterns](./docs/guides/advanced-patterns.md)** - Implement sophisticated crawling logic
58
- - **[Performance Guide](./docs/guides/performance.md)** - Optimise for your use case
109
+ ### 🧠 Want to understand the design?
110
+ Read our **[Explanations](./docs/explanation/)**:
111
+ - **[Architecture](./docs/explanation/architecture.md)** - System design and philosophy
112
+ - **[Web Scraping Concepts](./docs/explanation/web-scraping-concepts.md)** - Core principles
59
113
 
60
- ### 🏭 Building Production Systems?
61
- - **[Enterprise Patterns](./docs/examples/enterprise-patterns.md)** - Production-ready crawling solutions
62
- - **[Monitoring Guide](./docs/features/monitoring.md)** - Set up observability and alerting
63
- - **[API Reference](./docs/api/)** - Complete technical documentation
114
+ **📖 [Browse All Documentation →](./docs/README.md)**
64
115
 
65
116
  ## 🛠️ Quick Configuration
66
117
 
@@ -83,7 +134,7 @@ const config = makeSpiderConfig({
83
134
  The spider can be configured for different scraping scenarios:
84
135
 
85
136
  ```typescript
86
- import { makeSpiderConfig } from '@jambudipa.io/spider';
137
+ import { makeSpiderConfig } from '@jambudipa/spider';
87
138
 
88
139
  const config = makeSpiderConfig({
89
140
  // Basic settings
@@ -118,7 +169,7 @@ import {
118
169
  LoggingMiddleware,
119
170
  RateLimitMiddleware,
120
171
  UserAgentMiddleware
121
- } from '@jambudipa.io/spider';
172
+ } from '@jambudipa/spider';
122
173
 
123
174
  const middlewares = new MiddlewareManager()
124
175
  .use(new LoggingMiddleware({ level: 'info' }))
@@ -142,7 +193,7 @@ import {
142
193
  SpiderService,
143
194
  ResumabilityService,
144
195
  FileStorageBackend
145
- } from '@jambudipa.io/spider';
196
+ } from '@jambudipa/spider';
146
197
  import { Effect, Layer } from 'effect';
147
198
 
148
199
  // Configure resumability with file storage
@@ -191,7 +242,7 @@ const program = Effect.gen(function* () {
191
242
  Extract and process links from pages:
192
243
 
193
244
  ```typescript
194
- import { LinkExtractorService } from '@jambudipa.io/spider';
245
+ import { LinkExtractorService } from '@jambudipa/spider';
195
246
 
196
247
  const program = Effect.gen(function* () {
197
248
  const linkExtractor = yield* LinkExtractorService;
@@ -260,7 +311,7 @@ const program = Effect.gen(function* () {
260
311
  The library uses Effect for comprehensive error handling:
261
312
 
262
313
  ```typescript
263
- import { NetworkError, ResponseError, RobotsTxtError } from '@jambudipa.io/spider';
314
+ import { NetworkError, ResponseError, RobotsTxtError } from '@jambudipa/spider';
264
315
 
265
316
  const program = Effect.gen(function* () {
266
317
  const spider = yield* SpiderService;
@@ -295,7 +346,7 @@ const program = Effect.gen(function* () {
295
346
  Create custom middleware for specific needs:
296
347
 
297
348
  ```typescript
298
- import { SpiderMiddleware, SpiderRequest, SpiderResponse } from '@jambudipa.io/spider';
349
+ import { SpiderMiddleware, SpiderRequest, SpiderResponse } from '@jambudipa/spider';
299
350
  import { Effect } from 'effect';
300
351
 
301
352
  class CustomAuthMiddleware implements SpiderMiddleware {
@@ -326,7 +377,7 @@ const middlewares = new MiddlewareManager()
326
377
  Monitor scraping performance:
327
378
 
328
379
  ```typescript
329
- import { WorkerHealthMonitorService } from '@jambudipa.io/spider';
380
+ import { WorkerHealthMonitorService } from '@jambudipa/spider';
330
381
 
331
382
  const program = Effect.gen(function* () {
332
383
  const healthMonitor = yield* WorkerHealthMonitorService;
@@ -347,18 +398,6 @@ const program = Effect.gen(function* () {
347
398
  });
348
399
  ```
349
400
 
350
- ## Contributing
351
-
352
- 1. Fork the repository
353
- 2. Create a feature branch: `git checkout -b feature/new-feature`
354
- 3. Make your changes
355
- 4. Add tests for new functionality
356
- 5. Run tests: `npm test`
357
- 6. Run linting: `npm run lint`
358
- 7. Commit changes: `git commit -am 'Add new feature'`
359
- 8. Push to branch: `git push origin feature/new-feature`
360
- 9. Submit a pull request
361
-
362
401
  ## Development
363
402
 
364
403
  ```bash
@@ -368,59 +407,68 @@ npm install
368
407
  # Build the package
369
408
  npm run build
370
409
 
371
- # Run tests
410
+ # Run tests (all scenarios)
372
411
  npm test
373
412
 
374
413
  # Run tests with coverage
375
414
  npm run test:coverage
376
415
 
377
- # Type checking
416
+ # Type checking (must pass)
378
417
  npm run typecheck
379
418
 
380
- # Linting
419
+ # Validate CI setup locally
420
+ npm run ci:validate
421
+
422
+ # Code quality (has known issues)
423
+ npm run lint # Shows 1,163 issues
424
+ npm run format # Formats code consistently
425
+ ```
426
+
427
+ ### 🛠️ Contributing & Code Quality
428
+
429
+ **Current State**: The codebase is fully functional with comprehensive test coverage, but has technical debt in code style consistency.
430
+
431
+ - ✅ **Functional Changes**: All PRs must pass scenario tests
432
+ - ✅ **Type Safety**: TypeScript compilation must succeed
433
+ - ✅ **Build System**: Package must build without errors
434
+ - 🔄 **Code Style**: Help wanted fixing linting issues (great first contribution!)
435
+
436
+ **Contributing to Code Quality**:
437
+ ```bash
438
+ # See specific linting issues
381
439
  npm run lint
382
440
 
383
- # Format code
384
- npm run format
441
+ # Fix auto-fixable issues
442
+ npm run lint:fix
443
+
444
+ # Focus areas for improvement:
445
+ # - Unused variable cleanup (877 issues)
446
+ # - Return type annotations (286 issues)
447
+ # - Nullish coalescing operators
448
+ # - Console.log removal in production code
385
449
  ```
386
450
 
387
451
  ## License
388
452
 
389
453
  MIT License - see [LICENSE](LICENSE) file for details.
390
454
 
391
- ## Changelog
392
-
393
- ### 1.0.0
394
- - Initial standalone release
395
- - Migrated from monorepo structure
396
- - Full TypeScript support
397
- - Comprehensive middleware system
398
- - Resumable scraping functionality
399
- - Multiple storage backends
400
- - Rate limiting and performance monitoring
401
-
402
- ## 📚 Documentation
455
+ ## 📚 Complete Documentation
403
456
 
404
- Comprehensive documentation is available in the [`/docs`](./docs) directory:
457
+ All documentation is organized in the [`/docs`](./docs/) directory following the [Diátaxis framework](https://diataxis.fr/):
405
458
 
406
- ### 🚀 Quick Links
407
- - **[Getting Started Guide](./docs/guides/getting-started.md)** - Installation, setup, and first crawl
408
- - **[API Reference](./docs/api/)** - Complete API documentation
409
- - **[Configuration Guide](./docs/guides/configuration.md)** - Configuration options and patterns
410
- - **[Examples](./docs/examples/)** - Working examples for common use cases
459
+ - **🎓 [Tutorial](./docs/tutorial/)** - Learning-oriented lessons for getting started
460
+ - **📋 [How-to Guides](./docs/how-to/)** - Problem-solving guides for specific tasks
461
+ - **📚 [Reference](./docs/reference/)** - Technical reference and API documentation
462
+ - **🧠 [Explanation](./docs/explanation/)** - Understanding-oriented documentation
411
463
 
412
- ### 📖 Complete Documentation
413
- - **[Documentation Index](./docs/README.md)** - Overview of all available documentation
414
- - **[User Guides](./docs/guides/)** - Step-by-step tutorials and best practices
415
- - **[Feature Documentation](./docs/features/)** - Deep dives into key capabilities
416
- - **[Advanced Examples](./docs/examples/)** - Real-world usage patterns
464
+ **📖 [Start with the Documentation Index →](./docs/README.md)**
417
465
 
418
466
  ## Support
419
467
 
420
- - [GitHub Issues](https://github.com/jambudipa-io/spider/issues)
421
- - [Complete Documentation](./docs/)
422
- - [Working Examples](./docs/examples/)
468
+ - [GitHub Issues](https://github.com/jambudipa/spider/issues) - Bug reports and feature requests
469
+ - [Documentation](./docs/) - Comprehensive guides and reference material
470
+ - [Tutorial](./docs/tutorial/getting-started.md) - Step-by-step learning guide
423
471
 
424
472
  ---
425
473
 
426
- Built with ❤️ by [Jambudipa.io](https://jambudipa.io)
474
+ Built with ❤️ by [JAMBUDIPA](https://jambudipa.io)