scrapex 0.5.3 → 1.0.0-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +551 -145
  3. package/dist/enhancer-ByjRD-t5.mjs +769 -0
  4. package/dist/enhancer-ByjRD-t5.mjs.map +1 -0
  5. package/dist/enhancer-j0xqKDJm.cjs +847 -0
  6. package/dist/enhancer-j0xqKDJm.cjs.map +1 -0
  7. package/dist/index-CDgcRnig.d.cts +268 -0
  8. package/dist/index-CDgcRnig.d.cts.map +1 -0
  9. package/dist/index-piS5wtki.d.mts +268 -0
  10. package/dist/index-piS5wtki.d.mts.map +1 -0
  11. package/dist/index.cjs +2007 -0
  12. package/dist/index.cjs.map +1 -0
  13. package/dist/index.d.cts +580 -0
  14. package/dist/index.d.cts.map +1 -0
  15. package/dist/index.d.mts +580 -0
  16. package/dist/index.d.mts.map +1 -0
  17. package/dist/index.mjs +1956 -0
  18. package/dist/index.mjs.map +1 -0
  19. package/dist/llm/index.cjs +334 -0
  20. package/dist/llm/index.cjs.map +1 -0
  21. package/dist/llm/index.d.cts +258 -0
  22. package/dist/llm/index.d.cts.map +1 -0
  23. package/dist/llm/index.d.mts +258 -0
  24. package/dist/llm/index.d.mts.map +1 -0
  25. package/dist/llm/index.mjs +317 -0
  26. package/dist/llm/index.mjs.map +1 -0
  27. package/dist/parsers/index.cjs +11 -0
  28. package/dist/parsers/index.d.cts +2 -0
  29. package/dist/parsers/index.d.mts +2 -0
  30. package/dist/parsers/index.mjs +3 -0
  31. package/dist/parsers-Bneuws8x.cjs +569 -0
  32. package/dist/parsers-Bneuws8x.cjs.map +1 -0
  33. package/dist/parsers-CwkYnyWY.mjs +482 -0
  34. package/dist/parsers-CwkYnyWY.mjs.map +1 -0
  35. package/dist/types-CadAXrme.d.mts +674 -0
  36. package/dist/types-CadAXrme.d.mts.map +1 -0
  37. package/dist/types-DPEtPihB.d.cts +674 -0
  38. package/dist/types-DPEtPihB.d.cts.map +1 -0
  39. package/package.json +79 -100
  40. package/dist/index.d.ts +0 -45
  41. package/dist/index.js +0 -8
  42. package/dist/scrapex.cjs.development.js +0 -1130
  43. package/dist/scrapex.cjs.development.js.map +0 -1
  44. package/dist/scrapex.cjs.production.min.js +0 -2
  45. package/dist/scrapex.cjs.production.min.js.map +0 -1
  46. package/dist/scrapex.esm.js +0 -1122
  47. package/dist/scrapex.esm.js.map +0 -1
package/README.md CHANGED
@@ -1,157 +1,563 @@
1
- # Overview
2
-
3
- ScrapeX provides a node library to scrape basic details of a url using the [metascraper](https://metascraper.js.org/#/), [firefox readability](https://github.com/mozilla/readability), [page-metadata-parser](https://github.com/mozilla/page-metadata-parser) libraries.
4
-
5
- ## Getting Started
6
-
7
- ### Install
8
-
9
- Install
10
-
11
- ```javascript
12
- npm i scrapex --save or yarn add scrapex
13
- ```
14
-
15
- ### Usage
16
-
17
- ```javascript
18
- import { scrape, scrapeHtml } from 'scrapex';
19
- // or
20
- const { scrape, scrapeHtml } = require('scrapex');
21
- // usage
22
- scrape(url, options)
23
- options = { metascraperRules?, timeout?, sanitizeOptions?}
24
- metascraperRules = optional array of 'audio'
25
- | 'amazon'
26
- | 'iframe'
27
- | 'media-provider'
28
- | 'soundcloud'
29
- | 'uol'
30
- | 'spotify'
31
- | 'video'
32
- | 'youtube'
33
- timeout: number = default to 60 seconds
34
- sanitizeOptions: sanitize-html options
35
-
36
- ```
37
-
38
- ```javascript
39
- // define a url
40
- const url = "https://appleinsider.com/articles/19/08/22/like-apple-music-spotify-now-offers-a-three-month-premium-trial"
41
- const data = await scrape(url, {metascraperRules:['youtube']})
42
- // or
43
- const html = `html string of the webpage extracted separately`
44
- const data = await scrape(url, html, {metascraperRules:['youtube']})
45
-
46
- // scrape result
47
- {
48
- audio: undefined,
49
- author: 'Amber Neely',
50
- logo: 'https://logo.clearbit.com/appleinsider.com',
51
- favicon: 'https://photos5.appleinsider.com/v9/images/apple-touch-icon-72.png',
52
- image: 'https://photos5.appleinsider.com/gallery/32489-55660-header-xl.jpg',
53
- publisher: 'AppleInsider',
54
- date: '2019-08-22T12:14:35.000Z',
55
- description: "Spotify has extended the free-trial period it offers for Spotify Premium from one month to three, the default length of Apple's free trial for Apple Music.",
56
- lang: 'en',
57
- url: 'https://appleinsider.com/articles/19/08/22/like-apple-music-spotify-now-offers-a-three-month-premium-trial',
58
- text: "Spotify has extended the free-trial period it offers for Spotify Premium from one month to three, the default length of Apple's free trial for Apple Music.\n" +
59
- 'Streaming giant Spotify is now offering three free months to anyone who has yet to try their service, according to a news post on their site.\n' +
60
- `"Beginning August 22, eligible users will receive the first three months on us for free when they sign up for any Spotify Premium plan," says Spotify in a statement about the new trial. "You'll unlock a world of on-demand access to millions of hours of audio content— no matter when you sign up, winter, spring, summer, or fall."\n` +
61
- "The trial period currently only extends to individual and student plans and will roll out across Duo and Family in the coming months. The trial doesn't extend to Headspace or anyone who is billed directly through their carrier, with the exception of those in Japan, Australia, China, and Germany. \n" +
62
- "Apple has been offering free three-month trials to Apple Music since it's inception, though they may begin limiting their trial to one month. Apple had learned artists are wary of lengthy trial periods when Taylor Swift protested the three-month trial by withholding her album 1989 from the service. The protest earned artists the ability to be paid for track and album streams through the free trial period.\n" +
63
- 'Like most other paid music subscriptions, Spotify Premium offers users the ability to listen ad-free, download music to their device, create playlists, skip tracks, and toggle between devices when listening. ',
64
- video: null,
65
- keywords: [
66
- 'Apple', 'Apple Inc',
67
- 'iPhone', 'iPad',
68
- 'iPod touch', 'iPod nano',
69
- 'Apple TV', 'Apple',
70
- 'iPod shuffle', 'iphone 6',
71
- 'iphone 6s', 'ios 9',
72
- 'ios9', 'iTunes',
73
- 'i mac', 'mac os x',
74
- 'mac osx', 'Apple Computer',
75
- 'Apple Computer Inc.', 'Mac OS X',
76
- 'iMac', 'iBook',
77
- 'Mac Pro', 'MacBook Pro',
78
- 'Magic Pad', 'Magic Mouse',
79
- 'iPod classic', 'App Store',
80
- 'iTunes Store', 'iBook Store',
81
- 'mac book', 'Microsoft',
82
- 'Adobe', 'Research in Motion',
83
- 'RIM', 'Nokia',
84
- 'Samsung', 'Google',
85
- 'Nvidia', 'Intel'
86
- ],
87
- tags: [],
88
- embeds: [],
89
- source: 'appleinsider.com',
90
- twitter: {
91
- site: '@appleinsider',
92
- creator: '@appleinsider',
93
- description: "Spotify has extended the free-trial period it offers for Spotify Premium from one month to three, the default length of Apple's free trial for Apple Music.",
94
- title: 'Like Apple Music, Spotify now offers a three month premium trial | AppleInsider',
95
- image: 'https://photos5.appleinsider.com/gallery/32489-55660-header-xl.jpg'
1
+ # scrapex
2
+
3
+ Modern web scraper with LLM-enhanced extraction, extensible pipeline, and pluggable parsers.
4
+
5
+ > **Beta Release**: v1.0.0 is currently in beta. The API is stable but minor changes may occur before the stable release.
6
+
7
+ ## Features
8
+
9
+ - **LLM-Ready Output** - Content extracted as Markdown, optimized for AI/LLM consumption
10
+ - **Provider-Agnostic LLM** - Works with OpenAI, Anthropic, Ollama, LM Studio, or any OpenAI-compatible API
11
+ - **Vector Embeddings** - Generate embeddings with OpenAI, Azure, Cohere, HuggingFace, Ollama, or local Transformers.js
12
+ - **Extensible Pipeline** - Pluggable extractors with priority-based execution
13
+ - **Smart Extraction** - Uses Mozilla Readability for content, Cheerio for metadata
14
+ - **Markdown Parsing** - Parse markdown content, awesome lists, and GitHub repos
15
+ - **RSS/Atom Feeds** - Parse RSS 2.0, RSS 1.0 (RDF), and Atom feeds with pagination support
16
+ - **TypeScript First** - Full type safety with comprehensive type exports
17
+ - **Dual Format** - ESM and CommonJS builds
18
+
19
+ ## Installation
20
+
21
+ ```bash
22
+ npm install scrapex@beta
23
+ ```
24
+
25
+ ### Optional Peer Dependencies
26
+
27
+ ```bash
28
+ # For LLM features
29
+ npm install openai # OpenAI/Ollama/LM Studio
30
+ npm install @anthropic-ai/sdk # Anthropic Claude
31
+
32
+ # For JavaScript-rendered pages
33
+ npm install puppeteer
34
+ ```
35
+
36
+ ## Quick Start
37
+
38
+ ```typescript
39
+ import { scrape } from 'scrapex';
40
+
41
+ const result = await scrape('https://example.com/article');
42
+
43
+ console.log(result.title); // "Article Title"
44
+ console.log(result.content); // Markdown content
45
+ console.log(result.textContent); // Plain text (lower tokens)
46
+ console.log(result.excerpt); // First ~300 chars
47
+ ```
48
+
49
+ ## API Reference
50
+
51
+ ### `scrape(url, options?)`
52
+
53
+ Fetch and extract metadata and content from a URL.
54
+
55
+ ```typescript
56
+ import { scrape } from 'scrapex';
57
+
58
+ const result = await scrape('https://example.com', {
59
+ timeout: 10000,
60
+ userAgent: 'MyBot/1.0',
61
+ extractContent: true,
62
+ maxContentLength: 50000,
63
+ respectRobots: false,
64
+ });
65
+ ```
66
+
67
+ ### `scrapeHtml(html, url, options?)`
68
+
69
+ Extract from raw HTML without fetching.
70
+
71
+ ```typescript
72
+ import { scrapeHtml } from 'scrapex';
73
+
74
+ const html = await fetchSomehow('https://example.com');
75
+ const result = await scrapeHtml(html, 'https://example.com');
76
+ ```
77
+
78
+ ### Result Object (`ScrapedData`)
79
+
80
+ ```typescript
81
+ interface ScrapedData {
82
+ // Identity
83
+ url: string;
84
+ canonicalUrl: string;
85
+ domain: string;
86
+
87
+ // Basic metadata
88
+ title: string;
89
+ description: string;
90
+ image?: string;
91
+ favicon?: string;
92
+
93
+ // Content (LLM-optimized)
94
+ content: string; // Markdown format
95
+ textContent: string; // Plain text
96
+ excerpt: string; // ~300 char preview
97
+ wordCount: number;
98
+
99
+ // Context
100
+ author?: string;
101
+ publishedAt?: string;
102
+ modifiedAt?: string;
103
+ siteName?: string;
104
+ language?: string;
105
+
106
+ // Classification
107
+ contentType: 'article' | 'repo' | 'docs' | 'package' | 'video' | 'tool' | 'product' | 'unknown';
108
+ keywords: string[];
109
+
110
+ // Structured data
111
+ jsonLd?: Record<string, unknown>[];
112
+ links?: ExtractedLink[];
113
+
114
+ // LLM Enhancements (when enabled)
115
+ summary?: string;
116
+ suggestedTags?: string[];
117
+ entities?: ExtractedEntities;
118
+ extracted?: Record<string, unknown>;
119
+
120
+ // Meta
121
+ scrapedAt: string;
122
+ scrapeTimeMs: number;
123
+ error?: string;
124
+ }
125
+ ```
126
+
127
+ ## LLM Integration
128
+
129
+ ### Using OpenAI
130
+
131
+ ```typescript
132
+ import { scrape } from 'scrapex';
133
+ import { createOpenAI } from 'scrapex/llm';
134
+
135
+ const llm = createOpenAI({ apiKey: 'sk-...' });
136
+
137
+ const result = await scrape('https://example.com/article', {
138
+ llm,
139
+ enhance: ['summarize', 'tags', 'entities', 'classify'],
140
+ });
141
+
142
+ console.log(result.summary); // AI-generated summary
143
+ console.log(result.suggestedTags); // ['javascript', 'web', ...]
144
+ console.log(result.entities); // { people: [], organizations: [], ... }
145
+ ```
146
+
147
+ ### Embeddings
148
+
149
+ Generate vector embeddings from scraped content for semantic search, RAG, and similarity matching:
150
+
151
+ ```typescript
152
+ import { scrape } from 'scrapex';
153
+ import { createOpenAIEmbedding } from 'scrapex/embeddings';
154
+
155
+ const result = await scrape('https://example.com/article', {
156
+ embeddings: {
157
+ provider: { type: 'custom', provider: createOpenAIEmbedding() },
158
+ model: 'text-embedding-3-small',
96
159
  },
97
- title: 'Like Apple Music, Spotify now offers a three month premium trial',
98
- links: [
99
- {
100
- href: 'https://newsroom.spotify.com/2019-08-22/5-ways-to-take-control-of-your-streaming-with-spotify-premium/',
101
- text: 'a news post on their site.'
102
- },
103
- {
104
- href: 'https://appleinsider.com/inside/apple-music',
105
- text: 'Apple Music'
106
- },
107
- {
108
- href: 'https://appleinsider.com/articles/19/07/25/apple-begins-limiting-apple-music-free-trial-period-to-one-month',
109
- text: 'one month.'
110
- },
111
- {
112
- href: 'https://appleinsider.com/articles/15/06/18/apple-music-to-miss-out-on-taylor-swifts-1989-album',
113
- text: 'Taylor Swift protested the three-month trial'
114
- }
115
- ],
116
- content:
117
- '<div><div><div><div> <p>\n\t\t\tBy <a href="undefined/cdn-cgi/l/email-protection#d6b7bbb4b3a496b7a6a6bab3bfb8a5bfb2b3a4f8b5b9bb">Amber Neely</a>\t\t\t<br />\n\t\t\tThursday, August 22, 2019, 05:14 am PT (08:14 am ET)\n\t\t</p>Spotify has extended the free-trial period it offers for Spotify Premium from one month to three, the default length of Apple\'s free trial for Apple Music.<br /><p>\nStreaming giant Spotify is now offering three free months to anyone who has yet to try their service, according to <a href="https://newsroom.spotify.com/2019-08-22/5-ways-to-take-control-of-your-streaming-with-spotify-premium/">a news post on their site.</a></p><p>\n"Beginning August 22, eligible users will receive the first three months on us for free when they sign up for any Spotify Premium plan," says Spotify in a statement about the new trial. "You\'ll unlock a world of on-demand access to millions of hours of audio content—no matter when you sign up, winter, spring, summer, or fall."</p><p>\nThe trial period currently only extends to individual and student plans and will roll out across Duo and Family in the coming months. The trial doesn\'t extend to Headspace or anyone who is billed directly through their carrier, with the exception of those in Japan, Australia, China, and Germany. </p><p>\nApple has been offering free three-month trials to Apple Music since it\'s inception, though they may begin limiting their trial to <a href="https://appleinsider.com/articles/19/07/25/apple-begins-limiting-apple-music-free-trial-period-to-one-month">one month.</a> Apple had learned artists are wary of lengthy trial periods when <a href="https://appleinsider.com/articles/15/06/18/apple-music-to-miss-out-on-taylor-swifts-1989-album">Taylor Swift protested the three-month trial</a> by withholding her album <em>1989</em> from the service. The protest earned artists the ability to be paid for track and album streams through the free trial period.</p><p>\nStudents who sign up for Apple Music can get a free six-month trial <a href="https://support.apple.com/en-ke/HT205928">by visiting Apple\'s Support Page.</a> After the trial ends, students pay $4.99 a month to continue their subscription until graduation, which works out to be <a href="https://appleinsider.com/articles/16/05/06/apple-begins-offering-half-price-499-apple-music-subscriptions-for-students">about half the price of a standard subscription.</a></p><p>\nLike most other paid music subscriptions, Spotify Premium offers users the ability to listen ad-free, download music to their device, create playlists, skip tracks, and toggle between devices when listening. </p></div></div></div></div>',
118
- html: ...
160
+ });
161
+
162
+ if (result.embeddings?.status === 'success') {
163
+ console.log(result.embeddings.vector); // [0.023, -0.041, ...]
119
164
  }
120
165
  ```
121
166
 
122
- ### Extracted data elements
167
+ Features include:
168
+ - **Multiple providers** - OpenAI, Azure, Cohere, HuggingFace, Ollama, Transformers.js
169
+ - **PII redaction** - Automatically redact emails, phones, SSNs before sending to APIs
170
+ - **Smart chunking** - Split long content with configurable overlap
171
+ - **Caching** - Content-addressable cache to avoid redundant API calls
172
+ - **Resilience** - Retry, circuit breaker, rate limiting
173
+
174
+ See the [Embeddings Guide](https://scrapex.dev/guides/embeddings) for full documentation.
175
+
176
+ ## Breaking Changes (Beta)
177
+
178
+ - LLM provider classes (e.g., `AnthropicProvider`) were removed. Use preset factories like
179
+ `createOpenAI`, `createAnthropic`, `createOllama`, and `createLMStudio` instead.
180
+
181
+ ### Using Anthropic Claude
182
+
183
+ ```typescript
184
+ import { createAnthropic } from 'scrapex/llm';
185
+
186
+ const llm = createAnthropic({
187
+ apiKey: process.env.ANTHROPIC_API_KEY,
188
+ model: 'claude-3-5-haiku-20241022', // or 'claude-3-5-sonnet-20241022'
189
+ });
190
+
191
+ const result = await scrape(url, { llm, enhance: ['summarize'] });
192
+ ```
193
+
194
+ ### Using Ollama (Local)
195
+
196
+ ```typescript
197
+ import { createOllama } from 'scrapex/llm';
198
+
199
+ const llm = createOllama({ model: 'llama3.2' });
200
+
201
+ const result = await scrape(url, { llm, enhance: ['summarize'] });
202
+ ```
203
+
204
+ ### Using LM Studio (Local)
205
+
206
+ ```typescript
207
+ import { createLMStudio } from 'scrapex/llm';
208
+
209
+ const llm = createLMStudio({ model: 'local-model' });
210
+
211
+ const result = await scrape(url, { llm, enhance: ['summarize'] });
212
+ ```
213
+
214
+ ### Structured Extraction
215
+
216
+ Extract specific data using a schema:
217
+
218
+ ```typescript
219
+ const result = await scrape('https://example.com/product', {
220
+ llm,
221
+ extract: {
222
+ productName: 'string',
223
+ price: 'number',
224
+ features: 'string[]',
225
+ inStock: 'boolean',
226
+ sku: 'string?', // optional
227
+ },
228
+ });
229
+
230
+ console.log(result.extracted);
231
+ // { productName: "Widget", price: 29.99, features: [...], inStock: true }
232
+ ```
233
+
234
+ ## Custom Extractors
235
+
236
+ Create custom extractors to add domain-specific extraction logic:
237
+
238
+ ```typescript
239
+ import { scrape, type Extractor, type ExtractionContext } from 'scrapex';
240
+
241
+ const recipeExtractor: Extractor = {
242
+ name: 'recipe',
243
+ priority: 60, // Higher = runs earlier
244
+
245
+ async extract(context: ExtractionContext) {
246
+ const { $ } = context;
247
+
248
+ return {
249
+ custom: {
250
+ ingredients: $('.ingredients li').map((_, el) => $(el).text()).get(),
251
+ cookTime: $('[itemprop="cookTime"]').attr('content'),
252
+ servings: $('[itemprop="recipeYield"]').text(),
253
+ },
254
+ };
255
+ },
256
+ };
257
+
258
+ const result = await scrape('https://example.com/recipe', {
259
+ extractors: [recipeExtractor],
260
+ });
261
+
262
+ console.log(result.custom?.ingredients);
263
+ ```
264
+
265
+ ### Replacing Default Extractors
266
+
267
+ ```typescript
268
+ const result = await scrape(url, {
269
+ replaceDefaultExtractors: true,
270
+ extractors: [myCustomExtractor],
271
+ });
272
+ ```
123
273
 
124
- This is what `scrapex` will try to grab from a web page:
274
+ ## Markdown Parsing
125
275
 
126
- - `audio` — eg. <https://cf-media.sndcdn.com/U78RIfDPV6ok.128.mp3>. A audio URL that best represents the article.
127
- - `title` - The document's title (from the &lt;title&gt; tag)
128
- - `date` - The document's publication date
129
- - `copyright` - The document's copyright line, if present
130
- - `author` - The document's author
131
- - `publisher` - The document's publisher (website name)
132
- - `text` - The main text of the document with all the junk thrown away
133
- - `image` - The main image for the document (what's used by facebook, etc.)
134
- - `video` - A video URL that best represents the article.
135
- - `embeds` - An array of iframe, embed, object, video that were embedded in the article.
136
- - `tags`- Any tags or keywords that could be found by checking href urls at has the following pattern `a[href*='/t/'],a[href*='/tag/'], a[href*='/tags/'], a[href*='/topic/'],a[href*='/tagged/'], a[href*='?keyword=']`.
137
- - `keywords`- Any keywords that could be found by checking &lt;rel&gt; tags or by looking at href urls.
138
- - `lang` - The language of the document, either detected or supplied by you.
139
- - `description` - The description of the document, from &lt;meta&gt; tags
140
- - `favicon` - The url of the document's [favicon](http://en.wikipedia.org/wiki/Favicon).
141
- - `links` - An array of links embedded within the main article text. (text and href for each)
142
- - `logo` — eg. <https://entrepreneur.com/favicon180x180.png>. An image URL that best represents the publisher brand.
143
- - `content` — readability view html of the article.
144
- - `html` — full html of the page.
145
- - `text` — clear text of the readable html.
146
- - `code` - code segments defined using pre > code tags
276
+ Parse markdown content:
147
277
 
148
- ## Todo
278
+ ```typescript
279
+ import { MarkdownParser, extractListLinks, groupByCategory } from 'scrapex/parsers';
149
280
 
150
- ## Contributors
281
+ // Parse any markdown
282
+ const parser = new MarkdownParser();
283
+ const result = parser.parse(markdownContent);
151
284
 
152
- <img width=150px src="https://pbs.twimg.com/profile_images/1028292150205661185/TFP8E8Fc_400x400.jpg">
153
- <p><strong>Rakesh Paul</strong> - <a href="https://xtrios.com">Xtrios</a></p>
285
+ console.log(result.data.title);
286
+ console.log(result.data.sections);
287
+ console.log(result.data.links);
288
+ console.log(result.data.codeBlocks);
289
+
290
+ // Extract links from markdown lists and group by category
291
+ const links = extractListLinks(markdownContent);
292
+ const grouped = groupByCategory(links);
293
+
294
+ grouped.forEach((categoryLinks, category) => {
295
+ console.log(`${category}: ${categoryLinks.length} links`);
296
+ });
297
+ ```
298
+
299
+ ### GitHub Utilities
300
+
301
+ ```typescript
302
+ import {
303
+ isGitHubRepo,
304
+ parseGitHubUrl,
305
+ toRawUrl,
306
+ } from 'scrapex/parsers';
307
+
308
+ isGitHubRepo('https://github.com/owner/repo');
309
+ // true
310
+
311
+ parseGitHubUrl('https://github.com/facebook/react');
312
+ // { owner: 'facebook', repo: 'react' }
313
+
314
+ toRawUrl('https://github.com/owner/repo');
315
+ // 'https://raw.githubusercontent.com/owner/repo/main/README.md'
316
+ ```
317
+
318
+ ## RSS/Atom Feed Parsing
319
+
320
+ Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds:
321
+
322
+ ```typescript
323
+ import { RSSParser } from 'scrapex';
324
+
325
+ const parser = new RSSParser();
326
+ const result = parser.parse(feedXml, 'https://example.com/feed.xml');
327
+
328
+ console.log(result.data.format); // 'rss2' | 'rss1' | 'atom'
329
+ console.log(result.data.title); // Feed title
330
+ console.log(result.data.items); // Array of feed items
331
+ ```
332
+
333
+ **Supported formats:**
334
+ - `rss2` - RSS 2.0 (most common format)
335
+ - `rss1` - RSS 1.0 (RDF-based, older format)
336
+ - `atom` - Atom 1.0 (modern format with better semantics)
337
+
338
+ ### Feed Item Structure
339
+
340
+ ```typescript
341
+ interface FeedItem {
342
+ id: string;
343
+ title: string;
344
+ link: string;
345
+ description?: string;
346
+ content?: string;
347
+ author?: string;
348
+ publishedAt?: string; // ISO 8601
349
+ rawPublishedAt?: string; // Original date string
350
+ updatedAt?: string; // Atom only
351
+ categories: string[];
352
+ enclosure?: FeedEnclosure; // Podcast/media attachments
353
+ customFields?: Record<string, string>;
354
+ }
355
+ ```
356
+
357
+ ### Fetching and Parsing Feeds
358
+
359
+ ```typescript
360
+ import { fetchFeed, paginateFeed } from 'scrapex';
361
+
362
+ // Fetch and parse in one call
363
+ const result = await fetchFeed('https://example.com/feed.xml');
364
+ console.log(result.data.items);
365
+
366
+ // Paginate through feeds with rel="next" links (Atom)
367
+ for await (const page of paginateFeed('https://example.com/atom')) {
368
+ console.log(`Page with ${page.data.items.length} items`);
369
+ }
370
+ ```
371
+
372
+ ### Discovering Feeds in HTML
373
+
374
+ ```typescript
375
+ import { discoverFeeds } from 'scrapex';
376
+
377
+ const html = await fetch('https://example.com').then(r => r.text());
378
+ const feedUrls = discoverFeeds(html, 'https://example.com');
379
+ // ['https://example.com/feed.xml', 'https://example.com/atom.xml']
380
+ ```
381
+
382
+ ### Filtering by Date
383
+
384
+ ```typescript
385
+ import { RSSParser, filterByDate } from 'scrapex';
386
+
387
+ const parser = new RSSParser();
388
+ const result = parser.parse(feedXml);
389
+
390
+ const recentItems = filterByDate(result.data.items, {
391
+ after: new Date('2024-01-01'),
392
+ before: new Date('2024-12-31'),
393
+ includeUndated: false,
394
+ });
395
+ ```
396
+
397
+ ### Converting to Markdown/Text
398
+
399
+ ```typescript
400
+ import { RSSParser, feedToMarkdown, feedToText } from 'scrapex';
401
+
402
+ const parser = new RSSParser();
403
+ const result = parser.parse(feedXml);
404
+
405
+ // Convert to markdown (great for LLM consumption)
406
+ const markdown = feedToMarkdown(result.data, { maxItems: 10 });
407
+
408
+ // Convert to plain text
409
+ const text = feedToText(result.data);
410
+ ```
411
+
412
+ ### Custom Fields (Podcast/Media)
413
+
414
+ Extract custom namespace fields like iTunes podcast tags:
415
+
416
+ ```typescript
417
+ const parser = new RSSParser({
418
+ customFields: {
419
+ duration: 'itunes\\:duration',
420
+ explicit: 'itunes\\:explicit',
421
+ rating: 'media\\:rating',
422
+ },
423
+ });
424
+
425
+ const result = parser.parse(podcastXml);
426
+ const item = result.data.items[0];
427
+
428
+ console.log(item.customFields?.duration); // '10:00'
429
+ console.log(item.customFields?.explicit); // 'no'
430
+ ```
431
+
432
+ ### Security
433
+
434
+ The RSS parser enforces strict URL security:
435
+
436
+ - **HTTPS-only URLs (RSS parser only)**: The RSS/Atom parser (`RSSParser`) resolves all links to HTTPS only. Non-HTTPS URLs (http, javascript, data, file) are rejected and returned as empty strings. This is specific to feed parsing to prevent malicious links in untrusted feeds.
437
+ - **XML Mode**: Feeds are parsed with Cheerio's `{ xml: true }` mode, which disables HTML entity processing and prevents XSS vectors.
438
+
439
+ > **Note**: The public URL utilities (`resolveUrl`, `isValidUrl`, etc.) accept both `http:` and `https:` URLs. Protocol-relative URLs (e.g., `//example.com/path`) are resolved against the base URL's protocol by the standard `URL` constructor.
440
+
441
+ ## URL Utilities
442
+
443
+ ```typescript
444
+ import {
445
+ isValidUrl,
446
+ normalizeUrl,
447
+ extractDomain,
448
+ resolveUrl,
449
+ isExternalUrl,
450
+ } from 'scrapex';
451
+
452
+ isValidUrl('https://example.com');
453
+ // true
454
+
455
+ normalizeUrl('https://example.com/page?utm_source=twitter');
456
+ // 'https://example.com/page' (tracking params removed)
457
+
458
+ extractDomain('https://www.example.com/path');
459
+ // 'example.com'
460
+
461
+ resolveUrl('/path', 'https://example.com/page');
462
+ // 'https://example.com/path'
463
+
464
+ isExternalUrl('https://other.com', 'example.com');
465
+ // true
466
+ ```
467
+
468
+ ## Error Handling
469
+
470
+ ```typescript
471
+ import { scrape, ScrapeError } from 'scrapex';
472
+
473
+ try {
474
+ const result = await scrape('https://example.com');
475
+ } catch (error) {
476
+ if (error instanceof ScrapeError) {
477
+ console.log(error.code); // 'FETCH_FAILED' | 'TIMEOUT' | 'INVALID_URL' | ...
478
+ console.log(error.statusCode); // HTTP status if available
479
+ console.log(error.isRetryable()); // true for network errors
480
+ }
481
+ }
482
+ ```
483
+
484
+ Error codes:
485
+ - `FETCH_FAILED` - Network request failed
486
+ - `TIMEOUT` - Request timed out
487
+ - `INVALID_URL` - URL is malformed
488
+ - `BLOCKED` - Access denied (403)
489
+ - `NOT_FOUND` - Page not found (404)
490
+ - `ROBOTS_BLOCKED` - Blocked by robots.txt
491
+ - `PARSE_ERROR` - HTML parsing failed
492
+ - `LLM_ERROR` - LLM provider error
493
+ - `VALIDATION_ERROR` - Schema validation failed
494
+
495
+ ## Robots.txt
496
+
497
+ ```typescript
498
+ import { scrape, checkRobotsTxt } from 'scrapex';
499
+
500
+ // Check before scraping
501
+ const check = await checkRobotsTxt('https://example.com/path');
502
+ if (check.allowed) {
503
+ const result = await scrape('https://example.com/path');
504
+ }
505
+
506
+ // Or let scrape() handle it
507
+ const result = await scrape('https://example.com/path', {
508
+ respectRobots: true, // Throws if blocked
509
+ });
510
+ ```
511
+
512
+ ## Built-in Extractors
513
+
514
+ | Extractor | Priority | Description |
515
+ |-----------|----------|-------------|
516
+ | `MetaExtractor` | 100 | OG, Twitter, meta tags |
517
+ | `JsonLdExtractor` | 80 | JSON-LD structured data |
518
+ | `ContentExtractor` | 50 | Readability + Turndown |
519
+ | `FaviconExtractor` | 40 | Favicon discovery |
520
+ | `LinksExtractor` | 30 | Content link extraction |
521
+
522
+ ## Configuration
523
+
524
+ ### Options
525
+
526
+ ```typescript
527
+ interface ScrapeOptions {
528
+ timeout?: number; // Default: 10000ms
529
+ userAgent?: string; // Custom user agent
530
+ extractContent?: boolean; // Default: true
531
+ maxContentLength?: number; // Default: 50000 chars
532
+ fetcher?: Fetcher; // Custom fetcher
533
+ extractors?: Extractor[]; // Additional extractors
534
+ replaceDefaultExtractors?: boolean;
535
+ respectRobots?: boolean; // Check robots.txt
536
+ llm?: LLMProvider; // LLM provider
537
+ enhance?: EnhancementType[]; // LLM enhancements
538
+ extract?: ExtractionSchema; // Structured extraction
539
+ }
540
+ ```
541
+
542
+ ### Enhancement Types
543
+
544
+ ```typescript
545
+ type EnhancementType =
546
+ | 'summarize' // Generate summary
547
+ | 'tags' // Extract keywords/tags
548
+ | 'entities' // Extract named entities
549
+ | 'classify'; // Classify content type
550
+ ```
551
+
552
+ ## Requirements
553
+
554
+ - Node.js 20+
555
+ - TypeScript 5.0+ (for type imports)
154
556
 
155
557
  ## License
156
558
 
157
- This project is licensed under the MIT License.
559
+ MIT
560
+
561
+ ## Author
562
+
563
+ Rakesh Paul - [binaryroute](https://binaryroute.com/authors/rk-paul/)