@tyroneross/blog-scraper 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,371 @@
1
+ # @tyroneross/blog-scraper
2
+
3
+ > Powerful web scraping SDK for extracting blog articles and content. No LLM required.
4
+
5
+ [![npm version](https://img.shields.io/npm/v/@tyroneross/blog-scraper.svg)](https://www.npmjs.com/package/@tyroneross/blog-scraper)
6
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
7
+
8
+ ## Features
9
+
10
+ ✨ **No LLM needed** - Uses Mozilla Readability (92.2% F1 score) for content extraction
11
+ 🎯 **3-tier filtering** - URL patterns → content validation → quality scoring
12
+ ⚡ **Fast** - Extracts articles in 2-5 seconds
13
+ 🔧 **Modular** - Use high-level API or individual components
14
+ 📦 **Zero config** - Works out of the box
15
+ 🌐 **Multi-source** - RSS feeds, sitemaps, and HTML pages
16
+
17
+ ## Installation
18
+
19
+ ```bash
20
+ npm install @tyroneross/blog-scraper
21
+ ```
22
+
23
+ ## Quick Start
24
+
25
+ ```typescript
26
+ import { scrape } from '@tyroneross/blog-scraper';
27
+
28
+ // Simple usage - scrape a blog
29
+ const result = await scrape('https://example.com/blog');
30
+
31
+ console.log(`Found ${result.articles.length} articles`);
32
+ console.log(`Processing time: ${result.processingTime}ms`);
33
+
34
+ // Access articles
35
+ result.articles.forEach(article => {
36
+ console.log(article.title);
37
+ console.log(article.url);
38
+ console.log(article.fullContentMarkdown); // Markdown format
39
+ console.log(article.qualityScore); // 0-1 quality score
40
+ });
41
+ ```
42
+
43
+ ## API Reference
44
+
45
+ ### High-Level API (Recommended)
46
+
47
+ #### `scrape(url, options?)`
48
+
49
+ Extract articles from a URL with automatic source detection.
50
+
51
+ ```typescript
52
+ import { scrape } from '@tyroneross/blog-scraper';
53
+
54
+ const result = await scrape('https://example.com', {
55
+ // Optional configuration
56
+ sourceType: 'auto', // 'auto' | 'rss' | 'sitemap' | 'html'
57
+ maxArticles: 50, // Maximum articles to return
58
+ extractFullContent: true, // Extract full article content
59
+ denyPaths: ['/about', '/contact'], // URL patterns to exclude
60
+ qualityThreshold: 0.6 // Minimum quality score (0-1)
61
+ });
62
+ ```
63
+
64
+ **Returns:**
65
+ ```typescript
66
+ {
67
+ url: string;
68
+ detectedType: 'rss' | 'sitemap' | 'html';
69
+ confidence: 'high' | 'medium' | 'low';
70
+ articles: ScrapedArticle[];
71
+ extractionStats: {
72
+ attempted: number;
73
+ successful: number;
74
+ failed: number;
75
+ filtered: number;
76
+ totalDiscovered: number;
77
+ afterDenyFilter: number;
78
+ afterContentValidation: number;
79
+ afterQualityFilter: number;
80
+ };
81
+ processingTime: number;
82
+ errors: string[];
83
+ timestamp: string;
84
+ }
85
+ ```
86
+
87
+ #### `quickScrape(url)`
88
+
89
+ Fast URL-only extraction (no full content).
90
+
91
+ ```typescript
92
+ import { quickScrape } from '@tyroneross/blog-scraper';
93
+
94
+ const urls = await quickScrape('https://example.com/blog');
95
+ // Returns: ['url1', 'url2', 'url3', ...]
96
+ ```
97
+
98
+ ### Modular API (Advanced)
99
+
100
+ Use individual components for granular control.
101
+
102
+ #### Content Extraction
103
+
104
+ ```typescript
105
+ import { ContentExtractor } from '@tyroneross/blog-scraper';
106
+
107
+ const extractor = new ContentExtractor();
108
+ const content = await extractor.extractContent('https://example.com/article');
109
+
110
+ console.log(content.title);
111
+ console.log(content.textContent);
112
+ console.log(content.wordCount);
113
+ console.log(content.readingTime);
114
+ ```
115
+
116
+ #### Quality Scoring
117
+
118
+ ```typescript
119
+ import { calculateArticleQualityScore, getQualityBreakdown } from '@tyroneross/blog-scraper';
120
+
121
+ const score = calculateArticleQualityScore(extractedContent);
122
+ console.log(`Quality score: ${score}`); // 0-1
123
+
124
+ // Get detailed breakdown
125
+ const breakdown = getQualityBreakdown(extractedContent);
126
+ console.log(breakdown);
127
+ // {
128
+ // contentValidation: 0.6,
129
+ // publishedDate: 0.12,
130
+ // author: 0.08,
131
+ // schema: 0.08,
132
+ // readingTime: 0.12,
133
+ // total: 1.0,
134
+ // passesThreshold: true
135
+ // }
136
+ ```
137
+
138
+ #### Custom Quality Configuration
139
+
140
+ ```typescript
141
+ import { calculateArticleQualityScore } from '@tyroneross/blog-scraper';
142
+
143
+ const score = calculateArticleQualityScore(content, {
144
+ contentWeight: 0.8, // Increase content importance
145
+ dateWeight: 0.05, // Decrease date importance
146
+ authorWeight: 0.05,
147
+ schemaWeight: 0.05,
148
+ readingTimeWeight: 0.05,
149
+ threshold: 0.7 // Stricter threshold
150
+ });
151
+ ```
152
+
153
+ #### RSS Discovery
154
+
155
+ ```typescript
156
+ import { RSSDiscovery } from '@tyroneross/blog-scraper';
157
+
158
+ const discovery = new RSSDiscovery();
159
+ const feeds = await discovery.discoverFeeds('https://example.com');
160
+
161
+ feeds.forEach(feed => {
162
+ console.log(feed.url);
163
+ console.log(feed.title);
164
+ console.log(feed.confidence); // 0-1
165
+ });
166
+ ```
167
+
168
+ #### Sitemap Parsing
169
+
170
+ ```typescript
171
+ import { SitemapParser } from '@tyroneross/blog-scraper';
172
+
173
+ const parser = new SitemapParser();
174
+ const entries = await parser.parseSitemap('https://example.com/sitemap.xml');
175
+
176
+ entries.forEach(entry => {
177
+ console.log(entry.url);
178
+ console.log(entry.lastmod);
179
+ console.log(entry.priority);
180
+ });
181
+ ```
182
+
183
+ #### HTML Scraping
184
+
185
+ ```typescript
186
+ import { HTMLScraper } from '@tyroneross/blog-scraper';
187
+
188
+ const scraper = new HTMLScraper();
189
+ const articles = await scraper.extractFromPage('https://example.com/blog', {
190
+ selectors: {
191
+ articleLinks: ['article a', '.post-link'],
192
+ titleSelectors: ['h1', '.post-title'],
193
+ dateSelectors: ['time', '.published-date']
194
+ },
195
+ filters: {
196
+ minTitleLength: 10,
197
+ maxTitleLength: 200
198
+ }
199
+ });
200
+ ```
201
+
202
+ #### Rate Limiting
203
+
204
+ ```typescript
205
+ import { ScrapingRateLimiter } from '@tyroneross/blog-scraper';
206
+
207
+ // Create custom rate limiter
208
+ const limiter = new ScrapingRateLimiter({
209
+ maxConcurrent: 5,
210
+ minTime: 1000 // 1 second between requests
211
+ });
212
+
213
+ // Use in your scraping logic
214
+ await limiter.execute('example.com', async () => {
215
+ // Your scraping code here
216
+ });
217
+ ```
218
+
219
+ #### Circuit Breaker
220
+
221
+ ```typescript
222
+ import { CircuitBreaker } from '@tyroneross/blog-scraper';
223
+
224
+ const breaker = new CircuitBreaker('my-operation', {
225
+ failureThreshold: 5,
226
+ resetTimeout: 60000 // 1 minute
227
+ });
228
+
229
+ const result = await breaker.execute(async () => {
230
+ // Your operation here
231
+ });
232
+ ```
233
+
234
+ ## Examples
235
+
236
+ ### Example 1: Scrape with Custom Deny Patterns
237
+
238
+ ```typescript
239
+ import { scrape } from '@tyroneross/blog-scraper';
240
+
241
+ const result = await scrape('https://techcrunch.com', {
242
+ denyPaths: [
243
+ '/',
244
+ '/about',
245
+ '/contact',
246
+ '/tag/*', // Exclude all tag pages
247
+ '/author/*' // Exclude all author pages
248
+ ],
249
+ maxArticles: 20
250
+ });
251
+ ```
252
+
253
+ ### Example 2: Build Custom Pipeline
254
+
255
+ ```typescript
256
+ import {
257
+ SourceOrchestrator,
258
+ ContentExtractor,
259
+ calculateArticleQualityScore
260
+ } from '@tyroneross/blog-scraper';
261
+
262
+ // Step 1: Discover articles
263
+ const orchestrator = new SourceOrchestrator();
264
+ const discovered = await orchestrator.processSource('https://example.com', {
265
+ sourceType: 'auto'
266
+ });
267
+
268
+ // Step 2: Extract content
269
+ const extractor = new ContentExtractor();
270
+ const extracted = await Promise.all(
271
+ discovered.articles
272
+ .slice(0, 10)
273
+ .map(a => extractor.extractContent(a.url))
274
+ );
275
+
276
+ // Step 3: Score and filter
277
+ const scored = extracted
278
+ .filter(Boolean)
279
+ .map(content => ({
280
+ content,
281
+ score: calculateArticleQualityScore(content!)
282
+ }))
283
+ .filter(item => item.score >= 0.7);
284
+
285
+ console.log(`Found ${scored.length} high-quality articles`);
286
+ ```
287
+
288
+ ### Example 3: RSS-Only Scraping
289
+
290
+ ```typescript
291
+ import { scrape } from '@tyroneross/blog-scraper';
292
+
293
+ const result = await scrape('https://example.com', {
294
+ sourceType: 'rss', // Only use RSS feeds
295
+ extractFullContent: false, // Don't extract full content
296
+ maxArticles: 100
297
+ });
298
+ ```
299
+
300
+ ## How It Works
301
+
302
+ ### 3-Tier Filtering System
303
+
304
+ **Tier 1: URL Deny Patterns**
305
+ - Fast pattern-based filtering
306
+ - Excludes non-article pages (/, /about, /tag/*, etc.)
307
+ - Customizable patterns
308
+
309
+ **Tier 2: Content Validation**
310
+ - Minimum 200 characters
311
+ - Title length 10-200 characters
312
+ - Text-to-HTML ratio ≥ 10%
313
+
314
+ **Tier 3: Quality Scoring**
315
+ - Content quality: 60% weight
316
+ - Publication date: 12% weight
317
+ - Author/byline: 8% weight
318
+ - Schema.org metadata: 8% weight
319
+ - Reading time: 12% weight
320
+ - Default threshold: 50%
321
+
322
+ ### Auto-Detection Flow
323
+
324
+ 1. Try RSS feed (highest confidence)
325
+ 2. Discover RSS feeds from HTML
326
+ 3. Try sitemap parsing
327
+ 4. Discover sitemaps from domain
328
+ 5. Fall back to HTML link extraction
329
+
330
+ ## TypeScript Support
331
+
332
+ Full TypeScript support with exported types:
333
+
334
+ ```typescript
335
+ import type {
336
+ ScrapedArticle,
337
+ ScraperTestResult,
338
+ ScrapeOptions,
339
+ ExtractedContent,
340
+ QualityScoreConfig
341
+ } from '@tyroneross/blog-scraper';
342
+ ```
343
+
344
+ ## Performance
345
+
346
+ - **Single article extraction:** ~2-5 seconds
347
+ - **Bundle size:** ~150 KB (gzipped)
348
+ - **Memory usage:** ~100 MB average
349
+ - **No external APIs:** Zero API costs
350
+
351
+ ## Requirements
352
+
353
+ - Node.js ≥ 18.0.0
354
+ - No environment variables needed
355
+
356
+ ## License
357
+
358
+ MIT © Tyrone Ross
359
+
360
+ ## Contributing
361
+
362
+ Contributions welcome! Please open an issue or PR.
363
+
364
+ ## Support
365
+
366
+ - [GitHub Issues](https://github.com/tyroneross/blog-content-scraper/issues)
367
+ - [Documentation](https://github.com/tyroneross/blog-content-scraper#readme)
368
+
369
+ ---
370
+
371
+ **Built with ❤️ using Mozilla Readability**