@jambudipa/spider 0.1.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +117 -69
- package/dist/index.js +835 -114
- package/dist/index.js.map +1 -1
- package/package.json +12 -7
- package/dist/index.d.ts +0 -33
- package/dist/index.d.ts.map +0 -1
- package/dist/lib/BrowserEngine/BrowserEngine.service.d.ts +0 -57
- package/dist/lib/BrowserEngine/BrowserEngine.service.d.ts.map +0 -1
- package/dist/lib/Config/SpiderConfig.service.d.ts +0 -256
- package/dist/lib/Config/SpiderConfig.service.d.ts.map +0 -1
- package/dist/lib/HttpClient/CookieManager.d.ts +0 -44
- package/dist/lib/HttpClient/CookieManager.d.ts.map +0 -1
- package/dist/lib/HttpClient/EnhancedHttpClient.d.ts +0 -88
- package/dist/lib/HttpClient/EnhancedHttpClient.d.ts.map +0 -1
- package/dist/lib/HttpClient/SessionStore.d.ts +0 -82
- package/dist/lib/HttpClient/SessionStore.d.ts.map +0 -1
- package/dist/lib/HttpClient/TokenExtractor.d.ts +0 -58
- package/dist/lib/HttpClient/TokenExtractor.d.ts.map +0 -1
- package/dist/lib/HttpClient/index.d.ts +0 -8
- package/dist/lib/HttpClient/index.d.ts.map +0 -1
- package/dist/lib/LinkExtractor/LinkExtractor.service.d.ts +0 -166
- package/dist/lib/LinkExtractor/LinkExtractor.service.d.ts.map +0 -1
- package/dist/lib/LinkExtractor/index.d.ts +0 -37
- package/dist/lib/LinkExtractor/index.d.ts.map +0 -1
- package/dist/lib/Logging/FetchLogger.d.ts +0 -8
- package/dist/lib/Logging/FetchLogger.d.ts.map +0 -1
- package/dist/lib/Logging/SpiderLogger.service.d.ts +0 -34
- package/dist/lib/Logging/SpiderLogger.service.d.ts.map +0 -1
- package/dist/lib/Middleware/SpiderMiddleware.d.ts +0 -276
- package/dist/lib/Middleware/SpiderMiddleware.d.ts.map +0 -1
- package/dist/lib/PageData/PageData.d.ts +0 -28
- package/dist/lib/PageData/PageData.d.ts.map +0 -1
- package/dist/lib/Resumability/Resumability.service.d.ts +0 -176
- package/dist/lib/Resumability/Resumability.service.d.ts.map +0 -1
- package/dist/lib/Resumability/backends/FileStorageBackend.d.ts +0 -47
- package/dist/lib/Resumability/backends/FileStorageBackend.d.ts.map +0 -1
- package/dist/lib/Resumability/backends/PostgresStorageBackend.d.ts +0 -95
- package/dist/lib/Resumability/backends/PostgresStorageBackend.d.ts.map +0 -1
- package/dist/lib/Resumability/backends/RedisStorageBackend.d.ts +0 -92
- package/dist/lib/Resumability/backends/RedisStorageBackend.d.ts.map +0 -1
- package/dist/lib/Resumability/index.d.ts +0 -51
- package/dist/lib/Resumability/index.d.ts.map +0 -1
- package/dist/lib/Resumability/strategies.d.ts +0 -76
- package/dist/lib/Resumability/strategies.d.ts.map +0 -1
- package/dist/lib/Resumability/types.d.ts +0 -201
- package/dist/lib/Resumability/types.d.ts.map +0 -1
- package/dist/lib/Robots/Robots.service.d.ts +0 -78
- package/dist/lib/Robots/Robots.service.d.ts.map +0 -1
- package/dist/lib/Scheduler/SpiderScheduler.service.d.ts +0 -211
- package/dist/lib/Scheduler/SpiderScheduler.service.d.ts.map +0 -1
- package/dist/lib/Scraper/Scraper.service.d.ts +0 -123
- package/dist/lib/Scraper/Scraper.service.d.ts.map +0 -1
- package/dist/lib/Spider/Spider.service.d.ts +0 -194
- package/dist/lib/Spider/Spider.service.d.ts.map +0 -1
- package/dist/lib/StateManager/StateManager.service.d.ts +0 -68
- package/dist/lib/StateManager/StateManager.service.d.ts.map +0 -1
- package/dist/lib/StateManager/index.d.ts +0 -5
- package/dist/lib/StateManager/index.d.ts.map +0 -1
- package/dist/lib/UrlDeduplicator/UrlDeduplicator.service.d.ts +0 -58
- package/dist/lib/UrlDeduplicator/UrlDeduplicator.service.d.ts.map +0 -1
- package/dist/lib/WebScrapingEngine/WebScrapingEngine.service.d.ts +0 -77
- package/dist/lib/WebScrapingEngine/WebScrapingEngine.service.d.ts.map +0 -1
- package/dist/lib/WebScrapingEngine/index.d.ts +0 -5
- package/dist/lib/WebScrapingEngine/index.d.ts.map +0 -1
- package/dist/lib/WorkerHealth/WorkerHealthMonitor.service.d.ts +0 -39
- package/dist/lib/WorkerHealth/WorkerHealthMonitor.service.d.ts.map +0 -1
- package/dist/lib/api-facades.d.ts +0 -313
- package/dist/lib/api-facades.d.ts.map +0 -1
- package/dist/lib/errors.d.ts +0 -99
- package/dist/lib/errors.d.ts.map +0 -1
package/README.md
CHANGED
|
@@ -1,14 +1,57 @@
|
|
|
1
|
-
# @jambudipa
|
|
2
|
-
|
|
3
|
-
|
|
1
|
+
# @jambudipa/spider
|
|
2
|
+
|
|
3
|
+
[](https://github.com/jambudipa/spider/actions)
|
|
4
|
+
[](https://codecov.io/gh/jambudipa/spider)
|
|
5
|
+
[](https://badge.fury.io/js/@jambudipa%2Fspider)
|
|
6
|
+
[](https://nodejs.org/)
|
|
7
|
+
[](https://opensource.org/licenses/MIT)
|
|
8
|
+
|
|
9
|
+
A powerful, Effect-based web crawling framework for modern TypeScript applications. Built for type safety, composability, and enterprise-scale crawling operations.
|
|
10
|
+
|
|
11
|
+
> **⚠️ Pre-Release API**: Spider is currently in pre-release development (v0.x.x). The API may change frequently as we refine the library towards a stable v1.0.0 release. Consider this when using Spider in production environments and expect potential breaking changes in minor version updates.
|
|
12
|
+
|
|
13
|
+
## 🏆 **Battle-Tested Against Real-World Scenarios**
|
|
14
|
+
|
|
15
|
+
**Spider successfully handles ALL 16 https://web-scraping.dev challenge scenarios** - the most comprehensive web scraping test suite available:
|
|
16
|
+
|
|
17
|
+
| ✅ Scenario | Description | Complexity |
|
|
18
|
+
|-------------|-------------|------------|
|
|
19
|
+
| **Static Paging** | Traditional pagination navigation | Basic |
|
|
20
|
+
| **Endless Scroll** | Infinite scroll content loading | Dynamic |
|
|
21
|
+
| **Button Loading** | Dynamic content via button clicks | Dynamic |
|
|
22
|
+
| **GraphQL Requests** | Background API data fetching | Advanced |
|
|
23
|
+
| **Hidden Data** | Extracting non-visible content | Intermediate |
|
|
24
|
+
| **Product Markup** | Structured data extraction | Intermediate |
|
|
25
|
+
| **Local Storage** | Browser storage interaction | Advanced |
|
|
26
|
+
| **Secret API Tokens** | Authentication handling | Security |
|
|
27
|
+
| **CSRF Protection** | Token-based security bypass | Security |
|
|
28
|
+
| **Cookie Authentication** | Session-based access control | Security |
|
|
29
|
+
| **PDF Downloads** | Binary file handling | Special |
|
|
30
|
+
| **Cookie Popups** | Modal interaction handling | Special |
|
|
31
|
+
| **New Tab Links** | Multi-tab navigation | Special |
|
|
32
|
+
| **Block Pages** | Anti-bot detection handling | Anti-Block |
|
|
33
|
+
| **Invalid Referer Blocking** | Header-based access control | Anti-Block |
|
|
34
|
+
| **Persistent Cookie Blocking** | Long-term blocking mechanisms | Anti-Block |
|
|
35
|
+
|
|
36
|
+
🎯 **[View Live Test Results](https://github.com/jambudipa/spider/actions/workflows/ci.yml)** | 📊 **All Scenario Tests Passing** | 🚀 **Production Ready**
|
|
37
|
+
|
|
38
|
+
> **Live Testing**: Our CI pipeline runs all 16 web scraping scenarios against real websites daily, ensuring Spider remains robust against changing web technologies.
|
|
39
|
+
|
|
40
|
+
### 🔍 **Current Status** (Updated: Aug 2025)
|
|
41
|
+
- ✅ **Core Functionality**: All web scraping scenarios working
|
|
42
|
+
- ✅ **Type Safety**: Full TypeScript compilation without errors
|
|
43
|
+
- ✅ **Build System**: Package builds successfully for distribution
|
|
44
|
+
- ✅ **Test Suite**: 92+ scenario tests passing against live websites
|
|
45
|
+
- ⚠️ **Code Quality**: 1,163 linting issues identified (technical debt - does not affect functionality)
|
|
4
46
|
|
|
5
47
|
## ✨ Key Features
|
|
6
48
|
|
|
7
|
-
- **🔥 Effect
|
|
49
|
+
- **🔥 Effect Foundation**: Type-safe, functional composition with robust error handling
|
|
8
50
|
- **⚡ High Performance**: Concurrent crawling with intelligent worker pool management
|
|
9
51
|
- **🤖 Robots.txt Compliant**: Automatic robots.txt parsing and compliance checking
|
|
10
52
|
- **🔄 Resumable Crawls**: State persistence and crash recovery capabilities
|
|
11
|
-
- **🛡️
|
|
53
|
+
- **🛡️ Anti-Bot Bypass**: Handles complex blocking mechanisms and security measures
|
|
54
|
+
- **🌐 Browser Automation**: Playwright integration for JavaScript-heavy sites
|
|
12
55
|
- **📊 Built-in Monitoring**: Comprehensive logging and performance monitoring
|
|
13
56
|
- **🎯 TypeScript First**: Full type safety with excellent IntelliSense support
|
|
14
57
|
|
|
@@ -45,22 +88,30 @@ Effect.runPromise(program.pipe(
|
|
|
45
88
|
))
|
|
46
89
|
```
|
|
47
90
|
|
|
48
|
-
##
|
|
91
|
+
## 📚 Documentation
|
|
92
|
+
|
|
93
|
+
**Comprehensive documentation is now available** following the [Diátaxis framework](https://diataxis.fr/) for better learning and reference:
|
|
94
|
+
|
|
95
|
+
### 🎓 New to Spider?
|
|
96
|
+
Start with our **[Tutorial](./docs/tutorial/getting-started.md)** - a hands-on guide that takes you from installation to building advanced scrapers.
|
|
97
|
+
|
|
98
|
+
### 📋 Need to solve a specific problem?
|
|
99
|
+
Check our **[How-to Guides](./docs/how-to/)** for targeted solutions:
|
|
100
|
+
- **[Authentication](./docs/how-to/authentication.md)** - Handle logins, sessions, and auth flows
|
|
101
|
+
- **[Data Extraction](./docs/how-to/data-extraction.md)** - Extract structured data from HTML
|
|
102
|
+
- **[Resumable Operations](./docs/how-to/resumable-operations.md)** - Build fault-tolerant crawlers
|
|
49
103
|
|
|
50
|
-
###
|
|
51
|
-
|
|
52
|
-
- **[
|
|
53
|
-
- **[
|
|
104
|
+
### 📚 Need technical details?
|
|
105
|
+
See our **[Reference Documentation](./docs/reference/)**:
|
|
106
|
+
- **[API Reference](./docs/reference/api-reference.md)** - Complete API documentation
|
|
107
|
+
- **[Configuration](./docs/reference/configuration.md)** - All configuration options
|
|
54
108
|
|
|
55
|
-
###
|
|
56
|
-
|
|
57
|
-
- **[
|
|
58
|
-
- **[
|
|
109
|
+
### 🧠 Want to understand the design?
|
|
110
|
+
Read our **[Explanations](./docs/explanation/)**:
|
|
111
|
+
- **[Architecture](./docs/explanation/architecture.md)** - System design and philosophy
|
|
112
|
+
- **[Web Scraping Concepts](./docs/explanation/web-scraping-concepts.md)** - Core principles
|
|
59
113
|
|
|
60
|
-
|
|
61
|
-
- **[Enterprise Patterns](./docs/examples/enterprise-patterns.md)** - Production-ready crawling solutions
|
|
62
|
-
- **[Monitoring Guide](./docs/features/monitoring.md)** - Set up observability and alerting
|
|
63
|
-
- **[API Reference](./docs/api/)** - Complete technical documentation
|
|
114
|
+
**📖 [Browse All Documentation →](./docs/README.md)**
|
|
64
115
|
|
|
65
116
|
## 🛠️ Quick Configuration
|
|
66
117
|
|
|
@@ -83,7 +134,7 @@ const config = makeSpiderConfig({
|
|
|
83
134
|
The spider can be configured for different scraping scenarios:
|
|
84
135
|
|
|
85
136
|
```typescript
|
|
86
|
-
import { makeSpiderConfig } from '@jambudipa
|
|
137
|
+
import { makeSpiderConfig } from '@jambudipa/spider';
|
|
87
138
|
|
|
88
139
|
const config = makeSpiderConfig({
|
|
89
140
|
// Basic settings
|
|
@@ -118,7 +169,7 @@ import {
|
|
|
118
169
|
LoggingMiddleware,
|
|
119
170
|
RateLimitMiddleware,
|
|
120
171
|
UserAgentMiddleware
|
|
121
|
-
} from '@jambudipa
|
|
172
|
+
} from '@jambudipa/spider';
|
|
122
173
|
|
|
123
174
|
const middlewares = new MiddlewareManager()
|
|
124
175
|
.use(new LoggingMiddleware({ level: 'info' }))
|
|
@@ -142,7 +193,7 @@ import {
|
|
|
142
193
|
SpiderService,
|
|
143
194
|
ResumabilityService,
|
|
144
195
|
FileStorageBackend
|
|
145
|
-
} from '@jambudipa
|
|
196
|
+
} from '@jambudipa/spider';
|
|
146
197
|
import { Effect, Layer } from 'effect';
|
|
147
198
|
|
|
148
199
|
// Configure resumability with file storage
|
|
@@ -191,7 +242,7 @@ const program = Effect.gen(function* () {
|
|
|
191
242
|
Extract and process links from pages:
|
|
192
243
|
|
|
193
244
|
```typescript
|
|
194
|
-
import { LinkExtractorService } from '@jambudipa
|
|
245
|
+
import { LinkExtractorService } from '@jambudipa/spider';
|
|
195
246
|
|
|
196
247
|
const program = Effect.gen(function* () {
|
|
197
248
|
const linkExtractor = yield* LinkExtractorService;
|
|
@@ -260,7 +311,7 @@ const program = Effect.gen(function* () {
|
|
|
260
311
|
The library uses Effect for comprehensive error handling:
|
|
261
312
|
|
|
262
313
|
```typescript
|
|
263
|
-
import { NetworkError, ResponseError, RobotsTxtError } from '@jambudipa
|
|
314
|
+
import { NetworkError, ResponseError, RobotsTxtError } from '@jambudipa/spider';
|
|
264
315
|
|
|
265
316
|
const program = Effect.gen(function* () {
|
|
266
317
|
const spider = yield* SpiderService;
|
|
@@ -295,7 +346,7 @@ const program = Effect.gen(function* () {
|
|
|
295
346
|
Create custom middleware for specific needs:
|
|
296
347
|
|
|
297
348
|
```typescript
|
|
298
|
-
import { SpiderMiddleware, SpiderRequest, SpiderResponse } from '@jambudipa
|
|
349
|
+
import { SpiderMiddleware, SpiderRequest, SpiderResponse } from '@jambudipa/spider';
|
|
299
350
|
import { Effect } from 'effect';
|
|
300
351
|
|
|
301
352
|
class CustomAuthMiddleware implements SpiderMiddleware {
|
|
@@ -326,7 +377,7 @@ const middlewares = new MiddlewareManager()
|
|
|
326
377
|
Monitor scraping performance:
|
|
327
378
|
|
|
328
379
|
```typescript
|
|
329
|
-
import { WorkerHealthMonitorService } from '@jambudipa
|
|
380
|
+
import { WorkerHealthMonitorService } from '@jambudipa/spider';
|
|
330
381
|
|
|
331
382
|
const program = Effect.gen(function* () {
|
|
332
383
|
const healthMonitor = yield* WorkerHealthMonitorService;
|
|
@@ -347,18 +398,6 @@ const program = Effect.gen(function* () {
|
|
|
347
398
|
});
|
|
348
399
|
```
|
|
349
400
|
|
|
350
|
-
## Contributing
|
|
351
|
-
|
|
352
|
-
1. Fork the repository
|
|
353
|
-
2. Create a feature branch: `git checkout -b feature/new-feature`
|
|
354
|
-
3. Make your changes
|
|
355
|
-
4. Add tests for new functionality
|
|
356
|
-
5. Run tests: `npm test`
|
|
357
|
-
6. Run linting: `npm run lint`
|
|
358
|
-
7. Commit changes: `git commit -am 'Add new feature'`
|
|
359
|
-
8. Push to branch: `git push origin feature/new-feature`
|
|
360
|
-
9. Submit a pull request
|
|
361
|
-
|
|
362
401
|
## Development
|
|
363
402
|
|
|
364
403
|
```bash
|
|
@@ -368,59 +407,68 @@ npm install
|
|
|
368
407
|
# Build the package
|
|
369
408
|
npm run build
|
|
370
409
|
|
|
371
|
-
# Run tests
|
|
410
|
+
# Run tests (all scenarios)
|
|
372
411
|
npm test
|
|
373
412
|
|
|
374
413
|
# Run tests with coverage
|
|
375
414
|
npm run test:coverage
|
|
376
415
|
|
|
377
|
-
# Type checking
|
|
416
|
+
# Type checking (must pass)
|
|
378
417
|
npm run typecheck
|
|
379
418
|
|
|
380
|
-
#
|
|
419
|
+
# Validate CI setup locally
|
|
420
|
+
npm run ci:validate
|
|
421
|
+
|
|
422
|
+
# Code quality (has known issues)
|
|
423
|
+
npm run lint # Shows 1,163 issues
|
|
424
|
+
npm run format # Formats code consistently
|
|
425
|
+
```
|
|
426
|
+
|
|
427
|
+
### 🛠️ Contributing & Code Quality
|
|
428
|
+
|
|
429
|
+
**Current State**: The codebase is fully functional with comprehensive test coverage, but has technical debt in code style consistency.
|
|
430
|
+
|
|
431
|
+
- ✅ **Functional Changes**: All PRs must pass scenario tests
|
|
432
|
+
- ✅ **Type Safety**: TypeScript compilation must succeed
|
|
433
|
+
- ✅ **Build System**: Package must build without errors
|
|
434
|
+
- 🔄 **Code Style**: Help wanted fixing linting issues (great first contribution!)
|
|
435
|
+
|
|
436
|
+
**Contributing to Code Quality**:
|
|
437
|
+
```bash
|
|
438
|
+
# See specific linting issues
|
|
381
439
|
npm run lint
|
|
382
440
|
|
|
383
|
-
#
|
|
384
|
-
npm run
|
|
441
|
+
# Fix auto-fixable issues
|
|
442
|
+
npm run lint:fix
|
|
443
|
+
|
|
444
|
+
# Focus areas for improvement:
|
|
445
|
+
# - Unused variable cleanup (877 issues)
|
|
446
|
+
# - Return type annotations (286 issues)
|
|
447
|
+
# - Nullish coalescing operators
|
|
448
|
+
# - Console.log removal in production code
|
|
385
449
|
```
|
|
386
450
|
|
|
387
451
|
## License
|
|
388
452
|
|
|
389
453
|
MIT License - see [LICENSE](LICENSE) file for details.
|
|
390
454
|
|
|
391
|
-
##
|
|
392
|
-
|
|
393
|
-
### 1.0.0
|
|
394
|
-
- Initial standalone release
|
|
395
|
-
- Migrated from monorepo structure
|
|
396
|
-
- Full TypeScript support
|
|
397
|
-
- Comprehensive middleware system
|
|
398
|
-
- Resumable scraping functionality
|
|
399
|
-
- Multiple storage backends
|
|
400
|
-
- Rate limiting and performance monitoring
|
|
401
|
-
|
|
402
|
-
## 📚 Documentation
|
|
455
|
+
## 📚 Complete Documentation
|
|
403
456
|
|
|
404
|
-
|
|
457
|
+
All documentation is organized in the [`/docs`](./docs/) directory following the [Diátaxis framework](https://diataxis.fr/):
|
|
405
458
|
|
|
406
|
-
|
|
407
|
-
-
|
|
408
|
-
-
|
|
409
|
-
-
|
|
410
|
-
- **[Examples](./docs/examples/)** - Working examples for common use cases
|
|
459
|
+
- **🎓 [Tutorial](./docs/tutorial/)** - Learning-oriented lessons for getting started
|
|
460
|
+
- **📋 [How-to Guides](./docs/how-to/)** - Problem-solving guides for specific tasks
|
|
461
|
+
- **📚 [Reference](./docs/reference/)** - Technical reference and API documentation
|
|
462
|
+
- **🧠 [Explanation](./docs/explanation/)** - Understanding-oriented documentation
|
|
411
463
|
|
|
412
|
-
|
|
413
|
-
- **[Documentation Index](./docs/README.md)** - Overview of all available documentation
|
|
414
|
-
- **[User Guides](./docs/guides/)** - Step-by-step tutorials and best practices
|
|
415
|
-
- **[Feature Documentation](./docs/features/)** - Deep dives into key capabilities
|
|
416
|
-
- **[Advanced Examples](./docs/examples/)** - Real-world usage patterns
|
|
464
|
+
**📖 [Start with the Documentation Index →](./docs/README.md)**
|
|
417
465
|
|
|
418
466
|
## Support
|
|
419
467
|
|
|
420
|
-
- [GitHub Issues](https://github.com/jambudipa
|
|
421
|
-
- [
|
|
422
|
-
- [
|
|
468
|
+
- [GitHub Issues](https://github.com/jambudipa/spider/issues) - Bug reports and feature requests
|
|
469
|
+
- [Documentation](./docs/) - Comprehensive guides and reference material
|
|
470
|
+
- [Tutorial](./docs/tutorial/getting-started.md) - Step-by-step learning guide
|
|
423
471
|
|
|
424
472
|
---
|
|
425
473
|
|
|
426
|
-
Built with ❤️ by [
|
|
474
|
+
Built with ❤️ by [JAMBUDIPA](https://jambudipa.io)
|