magpie-html 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Anonyfox
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
22
+
package/README.md ADDED
@@ -0,0 +1,424 @@
1
+ # Magpie HTML ๐Ÿฆ…
2
+
3
+ Modern TypeScript library for scraping web content with isomorphic support. Works seamlessly in both Node.js and browser environments.
4
+
5
+ ## Features
6
+
7
+ - ๐ŸŽฏ **Isomorphic** - Works in Node.js and browsers
8
+ - ๐Ÿ“ฆ **Modern ESM/CJS** - Dual format support
9
+ - ๐Ÿ”’ **Type-safe** - Full TypeScript support
10
+ - ๐Ÿงช **Well-tested** - Built with Node.js native test runner
11
+ - ๐Ÿš€ **Minimal dependencies** - Lightweight and fast
12
+ - ๐Ÿ”„ **Multi-Format Feed Parser** - Parse RSS 2.0, Atom 1.0, and JSON Feed
13
+ - ๐Ÿ”— **Smart URL Resolution** - Automatic normalization to absolute URLs
14
+ - ๐Ÿ›ก๏ธ **Error Resilient** - Graceful handling of malformed data
15
+ - ๐Ÿฆ… **High-Level Convenience** - One-line functions for common tasks
16
+
17
+ ## Installation
18
+
19
+ ```bash
20
+ npm install magpie-html
21
+ ```
22
+
23
+ ## Quick Start
24
+
25
+ ```typescript
26
+ import { gatherWebsite, gatherArticle, gatherFeed } from 'magpie-html';
27
+
28
+ // Gather complete website metadata
29
+ const site = await gatherWebsite('https://example.com');
30
+ console.log(site.title); // Page title
31
+ console.log(site.description); // Meta description
32
+ console.log(site.image); // Featured image
33
+ console.log(site.feeds); // Discovered feeds
34
+ console.log(site.internalLinks); // Internal links
35
+
36
+ // Gather article content + metadata
37
+ const article = await gatherArticle('https://example.com/article');
38
+ console.log(article.title); // Article title
39
+ console.log(article.content); // Clean article text
40
+ console.log(article.wordCount); // Word count
41
+ console.log(article.readingTime); // Reading time in minutes
42
+
43
+ // Gather feed data
44
+ const feed = await gatherFeed('https://example.com/feed.xml');
45
+ console.log(feed.title); // Feed title
46
+ console.log(feed.items); // Feed items
47
+ ```
48
+
49
+ ## Usage
50
+
51
+ ### Gathering Websites
52
+
53
+ Extract comprehensive metadata from any webpage:
54
+
55
+ ```typescript
56
+ import { gatherWebsite } from 'magpie-html';
57
+
58
+ const site = await gatherWebsite('https://example.com');
59
+
60
+ // Basic metadata
61
+ console.log(site.url); // Final URL (after redirects)
62
+ console.log(site.title); // Best title (cleaned)
63
+ console.log(site.description); // Meta description
64
+ console.log(site.image); // Featured image URL
65
+ console.log(site.icon); // Site favicon/icon
66
+
67
+ // Language & region
68
+ console.log(site.language); // ISO 639-1 code (e.g., 'en')
69
+ console.log(site.region); // ISO 3166-1 alpha-2 (e.g., 'US')
70
+
71
+ // Discovered content
72
+ console.log(site.feeds); // Array of feed URLs
73
+ console.log(site.internalLinks); // Internal links (same domain)
74
+ console.log(site.externalLinks); // External links (other domains)
75
+
76
+ // Raw content
77
+ console.log(site.html); // Raw HTML
78
+ console.log(site.text); // Plain text (full page)
79
+ ```
80
+
81
+ **What it does:**
82
+ - Fetches the page with automatic redirect handling
83
+ - Extracts metadata from multiple sources (OpenGraph, Schema.org, Twitter Card, etc.)
84
+ - Picks the "best" value for each field (longest, highest priority, cleaned)
85
+ - Discovers RSS/Atom/JSON feeds linked on the page
86
+ - Categorizes internal vs external links
87
+ - Returns normalized, absolute URLs
88
+
89
+ ### Gathering Articles
90
+
91
+ Extract clean article content with metadata:
92
+
93
+ ```typescript
94
+ import { gatherArticle } from 'magpie-html';
95
+
96
+ const article = await gatherArticle('https://example.com/article');
97
+
98
+ // Core content
99
+ console.log(article.url); // Final URL
100
+ console.log(article.title); // Article title (Readability or metadata)
101
+ console.log(article.content); // Clean article text (formatted)
102
+ console.log(article.description); // Excerpt/summary
103
+
104
+ // Metrics
105
+ console.log(article.wordCount); // Word count
106
+ console.log(article.readingTime); // Est. reading time (minutes)
107
+
108
+ // Media & language
109
+ console.log(article.image); // Article image
110
+ console.log(article.language); // Language code
111
+ console.log(article.region); // Region code
112
+
113
+ // Links & raw content
114
+ console.log(article.internalLinks); // Internal links
115
+ console.log(article.externalLinks); // External links (citations)
116
+ console.log(article.html); // Raw HTML
117
+ console.log(article.text); // Plain text (full page)
118
+ ```
119
+
120
+ **What it does:**
121
+ - Uses Mozilla Readability to extract clean article content
122
+ - Falls back to metadata extraction if Readability fails
123
+ - Converts cleaned HTML to well-formatted plain text
124
+ - Calculates reading metrics (word count, reading time)
125
+ - Provides both cleaned content and raw HTML
126
+
127
+ ### Gathering Feeds
128
+
129
+ Parse any feed format with one function:
130
+
131
+ ```typescript
132
+ import { gatherFeed } from 'magpie-html';
133
+
134
+ const feed = await gatherFeed('https://example.com/feed.xml');
135
+
136
+ // Feed metadata
137
+ console.log(feed.title); // Feed title
138
+ console.log(feed.description); // Feed description
139
+ console.log(feed.url); // Feed URL
140
+ console.log(feed.siteUrl); // Website URL
141
+
142
+ // Feed items
143
+ for (const item of feed.items) {
144
+ console.log(item.title); // Item title
145
+ console.log(item.url); // Item URL (absolute)
146
+ console.log(item.description); // Item description
147
+ console.log(item.publishedAt); // Publication date
148
+ console.log(item.author); // Author
149
+ }
150
+
151
+ // Format detection
152
+ console.log(feed.format); // 'rss', 'atom', or 'json-feed'
153
+ ```
154
+
155
+ **What it does:**
156
+ - Auto-detects feed format (RSS 2.0, Atom 1.0, JSON Feed)
157
+ - Normalizes all formats to a unified interface
158
+ - Resolves relative URLs to absolute
159
+ - Handles malformed data gracefully
160
+
161
+ ## Advanced Usage
162
+
163
+ For more control, use the lower-level modules directly:
164
+
165
+ ### Feed Parsing
166
+
167
+ ```typescript
168
+ import { pluck, parseFeed } from 'magpie-html';
169
+
170
+ // Fetch feed content
171
+ const response = await pluck('https://example.com/feed.xml');
172
+ const feedContent = await response.textUtf8();
173
+
174
+ // Parse with base URL for relative links
175
+ const result = parseFeed(feedContent, response.finalUrl);
176
+
177
+ console.log(result.feed.title);
178
+ console.log(result.feed.items[0].title);
179
+ console.log(result.feed.format); // 'rss', 'atom', or 'json-feed'
180
+ ```
181
+
182
+ ### Content Extraction
183
+
184
+ ```typescript
185
+ import { parseHTML, extractContent, htmlToText } from 'magpie-html';
186
+
187
+ // Parse HTML once
188
+ const doc = parseHTML(html);
189
+
190
+ // Extract article with Readability
191
+ const result = extractContent(doc, {
192
+ baseUrl: 'https://example.com/article',
193
+ cleanConditionally: true,
194
+ keepClasses: false,
195
+ });
196
+
197
+ if (result.success) {
198
+ console.log(result.title); // Article title
199
+ console.log(result.excerpt); // Article excerpt
200
+ console.log(result.content); // Clean HTML
201
+ console.log(result.textContent); // Plain text
202
+ console.log(result.wordCount); // Word count
203
+ console.log(result.readingTime); // Reading time
204
+ }
205
+
206
+ // Or convert any HTML to text
207
+ const plainText = htmlToText(html, {
208
+ preserveWhitespace: false,
209
+ includeLinks: true,
210
+ wrapColumn: 80,
211
+ });
212
+ ```
213
+
214
+ ### Metadata Extraction
215
+
216
+ ```typescript
217
+ import { parseHTML, extractOpenGraph, extractSchemaOrg, extractSEO } from 'magpie-html';
218
+
219
+ const doc = parseHTML(html);
220
+
221
+ // Extract OpenGraph metadata
222
+ const og = extractOpenGraph(doc);
223
+ console.log(og.title);
224
+ console.log(og.description);
225
+ console.log(og.image);
226
+
227
+ // Extract Schema.org data
228
+ const schema = extractSchemaOrg(doc);
229
+ console.log(schema.articles); // NewsArticle, etc.
230
+
231
+ // Extract SEO metadata
232
+ const seo = extractSEO(doc);
233
+ console.log(seo.title);
234
+ console.log(seo.description);
235
+ console.log(seo.keywords);
236
+ ```
237
+
238
+ **Available extractors:**
239
+ - `extractSEO` - SEO meta tags
240
+ - `extractOpenGraph` - OpenGraph metadata
241
+ - `extractTwitterCard` - Twitter Card metadata
242
+ - `extractSchemaOrg` - Schema.org / JSON-LD
243
+ - `extractCanonical` - Canonical URLs
244
+ - `extractLanguage` - Language detection
245
+ - `extractIcons` - Favicon and icons
246
+ - `extractAssets` - All linked assets (images, scripts, fonts, etc.)
247
+ - `extractLinks` - Navigation links (with internal/external split)
248
+ - `extractFeedDiscovery` - Discover RSS/Atom/JSON feeds
249
+ - ...and more
250
+
251
+ ### Enhanced Fetching
252
+
253
+ Use `pluck()` for robust fetching with automatic encoding and redirect handling:
254
+
255
+ ```typescript
256
+ import { pluck } from 'magpie-html';
257
+
258
+ const response = await pluck('https://example.com', {
259
+ timeout: 30000, // 30 second timeout
260
+ maxRedirects: 10, // Follow up to 10 redirects
261
+ maxSize: 10485760, // 10MB limit
262
+ userAgent: 'MyBot/1.0',
263
+ throwOnHttpError: true,
264
+ strictContentType: false,
265
+ });
266
+
267
+ // Enhanced response properties
268
+ console.log(response.finalUrl); // URL after redirects
269
+ console.log(response.redirectChain); // All redirect URLs
270
+ console.log(response.detectedEncoding); // Detected charset
271
+ console.log(response.timing); // Request timing
272
+
273
+ // Get UTF-8 decoded content
274
+ const text = await response.textUtf8();
275
+ ```
276
+
277
+ **Why `pluck()`?**
278
+ - Handles broken sites with wrong/missing encoding declarations
279
+ - Follows redirect chains and tracks them
280
+ - Enforces timeouts and size limits
281
+ - Compatible with standard `fetch()` API
282
+ - Named `pluck()` to avoid confusion (magpies pluck things! ๐Ÿฆ…)
283
+
284
+ ## API Reference
285
+
286
+ ### High-Level Convenience
287
+
288
+ - **`gatherWebsite(url)`** - Extract complete website metadata
289
+ - **`gatherArticle(url)`** - Extract article content + metadata
290
+ - **`gatherFeed(url)`** - Parse any feed format
291
+
292
+ ### Fetching
293
+
294
+ - **`pluck(url, options?)`** - Enhanced fetch for web scraping
295
+
296
+ ### Parsing
297
+
298
+ - **`parseFeed(content, baseUrl?)`** - Parse RSS/Atom/JSON feeds
299
+ - **`detectFormat(content)`** - Detect feed format
300
+ - **`parseHTML(html)`** - Parse HTML to Document
301
+
302
+ ### Content Extraction
303
+
304
+ - **`extractContent(doc, options?)`** - Extract article with Readability
305
+ - **`htmlToText(html, options?)`** - Convert HTML to plain text
306
+ - **`isProbablyReaderable(doc)`** - Check if content is article-like
307
+
308
+ ### Metadata Extraction
309
+
310
+ - **`extractSEO(doc)`** - SEO meta tags
311
+ - **`extractOpenGraph(doc)`** - OpenGraph metadata
312
+ - **`extractTwitterCard(doc)`** - Twitter Card metadata
313
+ - **`extractSchemaOrg(doc)`** - Schema.org / JSON-LD
314
+ - **`extractCanonical(doc)`** - Canonical URLs
315
+ - **`extractLanguage(doc)`** - Language detection
316
+ - **`extractIcons(doc)`** - Favicons and icons
317
+ - **`extractAssets(doc, baseUrl)`** - Linked assets
318
+ - **`extractLinks(doc, baseUrl, options?)`** - Navigation links
319
+ - **`extractFeedDiscovery(doc, baseUrl)`** - Discover feeds
320
+ - ...and 10+ more specialized extractors
321
+
322
+ ### Utilities
323
+
324
+ - **`normalizeUrl(baseUrl, url)`** - Convert relative to absolute URLs
325
+ - **`countWords(text)`** - Count words in text
326
+ - **`calculateReadingTime(wordCount)`** - Estimate reading time
327
+
328
+ See [TypeDoc documentation](https://anonyfox.github.io/magpie-html) for complete API reference.
329
+
330
+ ## Performance Tips
331
+
332
+ **Best Practice:** Parse HTML once and reuse the document:
333
+
334
+ ```typescript
335
+ import { parseHTML, extractSEO, extractOpenGraph, extractContent } from 'magpie-html';
336
+
337
+ const doc = parseHTML(html);
338
+
339
+ // Reuse the same document for multiple extractions
340
+ const seo = extractSEO(doc); // Fast: <5ms
341
+ const og = extractOpenGraph(doc); // Fast: <5ms
342
+ const content = extractContent(doc); // ~100-500ms
343
+
344
+ // Total: One parse + all extractions
345
+ ```
346
+
347
+ ## Development
348
+
349
+ ### Setup
350
+
351
+ ```bash
352
+ npm install
353
+ ```
354
+
355
+ ### Run Tests
356
+
357
+ ```bash
358
+ npm test
359
+ ```
360
+
361
+ The test suite includes both unit tests (`*.test.ts`) and integration tests using real-world HTML/feed files from `cache/`.
362
+
363
+ ### Watch Mode
364
+
365
+ ```bash
366
+ npm run test:watch
367
+ ```
368
+
369
+ ### Build
370
+
371
+ ```bash
372
+ npm run build
373
+ ```
374
+
375
+ ### Linting & Formatting
376
+
377
+ ```bash
378
+ # Check for issues
379
+ npm run lint
380
+
381
+ # Auto-fix issues
382
+ npm run lint:fix
383
+
384
+ # Format code
385
+ npm run format
386
+
387
+ # Run all checks (typecheck + lint)
388
+ npm run check
389
+ ```
390
+
391
+ ### Type Check
392
+
393
+ ```bash
394
+ npm run typecheck
395
+ ```
396
+
397
+ ### Documentation
398
+
399
+ Generate API documentation:
400
+
401
+ ```bash
402
+ npm run docs
403
+ npm run docs:serve
404
+ ```
405
+
406
+ ## Integration Testing
407
+
408
+ The `cache/` directory contains real-world HTML and feed samples for integration testing. This enables testing against actual production data without network calls.
409
+
410
+ ## Publishing
411
+
412
+ ```bash
413
+ npm publish
414
+ ```
415
+
416
+ The `prepublishOnly` script automatically builds the package before publishing.
417
+
418
+ ## License
419
+
420
+ MIT
421
+
422
+ ## Author
423
+
424
+ [Anonyfox](https://anonyfox.com)