magpie-html 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +22 -0
- package/README.md +424 -0
- package/dist/index.cjs +5197 -0
- package/dist/index.cjs.map +1 -0
- package/dist/index.d.cts +3072 -0
- package/dist/index.d.ts +3072 -0
- package/dist/index.js +5149 -0
- package/dist/index.js.map +1 -0
- package/package.json +80 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 Anonyfox
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
22
|
+
|
package/README.md
ADDED
|
@@ -0,0 +1,424 @@
|
|
|
1
|
+
# Magpie HTML ๐ฆ
|
|
2
|
+
|
|
3
|
+
Modern TypeScript library for scraping web content with isomorphic support. Works seamlessly in both Node.js and browser environments.
|
|
4
|
+
|
|
5
|
+
## Features
|
|
6
|
+
|
|
7
|
+
- ๐ฏ **Isomorphic** - Works in Node.js and browsers
|
|
8
|
+
- ๐ฆ **Modern ESM/CJS** - Dual format support
|
|
9
|
+
- ๐ **Type-safe** - Full TypeScript support
|
|
10
|
+
- ๐งช **Well-tested** - Built with Node.js native test runner
|
|
11
|
+
- ๐ **Minimal dependencies** - Lightweight and fast
|
|
12
|
+
- ๐ **Multi-Format Feed Parser** - Parse RSS 2.0, Atom 1.0, and JSON Feed
|
|
13
|
+
- ๐ **Smart URL Resolution** - Automatic normalization to absolute URLs
|
|
14
|
+
- ๐ก๏ธ **Error Resilient** - Graceful handling of malformed data
|
|
15
|
+
- ๐ฆ
**High-Level Convenience** - One-line functions for common tasks
|
|
16
|
+
|
|
17
|
+
## Installation
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
npm install magpie-html
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Quick Start
|
|
24
|
+
|
|
25
|
+
```typescript
|
|
26
|
+
import { gatherWebsite, gatherArticle, gatherFeed } from 'magpie-html';
|
|
27
|
+
|
|
28
|
+
// Gather complete website metadata
|
|
29
|
+
const site = await gatherWebsite('https://example.com');
|
|
30
|
+
console.log(site.title); // Page title
|
|
31
|
+
console.log(site.description); // Meta description
|
|
32
|
+
console.log(site.image); // Featured image
|
|
33
|
+
console.log(site.feeds); // Discovered feeds
|
|
34
|
+
console.log(site.internalLinks); // Internal links
|
|
35
|
+
|
|
36
|
+
// Gather article content + metadata
|
|
37
|
+
const article = await gatherArticle('https://example.com/article');
|
|
38
|
+
console.log(article.title); // Article title
|
|
39
|
+
console.log(article.content); // Clean article text
|
|
40
|
+
console.log(article.wordCount); // Word count
|
|
41
|
+
console.log(article.readingTime); // Reading time in minutes
|
|
42
|
+
|
|
43
|
+
// Gather feed data
|
|
44
|
+
const feed = await gatherFeed('https://example.com/feed.xml');
|
|
45
|
+
console.log(feed.title); // Feed title
|
|
46
|
+
console.log(feed.items); // Feed items
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Usage
|
|
50
|
+
|
|
51
|
+
### Gathering Websites
|
|
52
|
+
|
|
53
|
+
Extract comprehensive metadata from any webpage:
|
|
54
|
+
|
|
55
|
+
```typescript
|
|
56
|
+
import { gatherWebsite } from 'magpie-html';
|
|
57
|
+
|
|
58
|
+
const site = await gatherWebsite('https://example.com');
|
|
59
|
+
|
|
60
|
+
// Basic metadata
|
|
61
|
+
console.log(site.url); // Final URL (after redirects)
|
|
62
|
+
console.log(site.title); // Best title (cleaned)
|
|
63
|
+
console.log(site.description); // Meta description
|
|
64
|
+
console.log(site.image); // Featured image URL
|
|
65
|
+
console.log(site.icon); // Site favicon/icon
|
|
66
|
+
|
|
67
|
+
// Language & region
|
|
68
|
+
console.log(site.language); // ISO 639-1 code (e.g., 'en')
|
|
69
|
+
console.log(site.region); // ISO 3166-1 alpha-2 (e.g., 'US')
|
|
70
|
+
|
|
71
|
+
// Discovered content
|
|
72
|
+
console.log(site.feeds); // Array of feed URLs
|
|
73
|
+
console.log(site.internalLinks); // Internal links (same domain)
|
|
74
|
+
console.log(site.externalLinks); // External links (other domains)
|
|
75
|
+
|
|
76
|
+
// Raw content
|
|
77
|
+
console.log(site.html); // Raw HTML
|
|
78
|
+
console.log(site.text); // Plain text (full page)
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
**What it does:**
|
|
82
|
+
- Fetches the page with automatic redirect handling
|
|
83
|
+
- Extracts metadata from multiple sources (OpenGraph, Schema.org, Twitter Card, etc.)
|
|
84
|
+
- Picks the "best" value for each field (longest, highest priority, cleaned)
|
|
85
|
+
- Discovers RSS/Atom/JSON feeds linked on the page
|
|
86
|
+
- Categorizes internal vs external links
|
|
87
|
+
- Returns normalized, absolute URLs
|
|
88
|
+
|
|
89
|
+
### Gathering Articles
|
|
90
|
+
|
|
91
|
+
Extract clean article content with metadata:
|
|
92
|
+
|
|
93
|
+
```typescript
|
|
94
|
+
import { gatherArticle } from 'magpie-html';
|
|
95
|
+
|
|
96
|
+
const article = await gatherArticle('https://example.com/article');
|
|
97
|
+
|
|
98
|
+
// Core content
|
|
99
|
+
console.log(article.url); // Final URL
|
|
100
|
+
console.log(article.title); // Article title (Readability or metadata)
|
|
101
|
+
console.log(article.content); // Clean article text (formatted)
|
|
102
|
+
console.log(article.description); // Excerpt/summary
|
|
103
|
+
|
|
104
|
+
// Metrics
|
|
105
|
+
console.log(article.wordCount); // Word count
|
|
106
|
+
console.log(article.readingTime); // Est. reading time (minutes)
|
|
107
|
+
|
|
108
|
+
// Media & language
|
|
109
|
+
console.log(article.image); // Article image
|
|
110
|
+
console.log(article.language); // Language code
|
|
111
|
+
console.log(article.region); // Region code
|
|
112
|
+
|
|
113
|
+
// Links & raw content
|
|
114
|
+
console.log(article.internalLinks); // Internal links
|
|
115
|
+
console.log(article.externalLinks); // External links (citations)
|
|
116
|
+
console.log(article.html); // Raw HTML
|
|
117
|
+
console.log(article.text); // Plain text (full page)
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
**What it does:**
|
|
121
|
+
- Uses Mozilla Readability to extract clean article content
|
|
122
|
+
- Falls back to metadata extraction if Readability fails
|
|
123
|
+
- Converts cleaned HTML to well-formatted plain text
|
|
124
|
+
- Calculates reading metrics (word count, reading time)
|
|
125
|
+
- Provides both cleaned content and raw HTML
|
|
126
|
+
|
|
127
|
+
### Gathering Feeds
|
|
128
|
+
|
|
129
|
+
Parse any feed format with one function:
|
|
130
|
+
|
|
131
|
+
```typescript
|
|
132
|
+
import { gatherFeed } from 'magpie-html';
|
|
133
|
+
|
|
134
|
+
const feed = await gatherFeed('https://example.com/feed.xml');
|
|
135
|
+
|
|
136
|
+
// Feed metadata
|
|
137
|
+
console.log(feed.title); // Feed title
|
|
138
|
+
console.log(feed.description); // Feed description
|
|
139
|
+
console.log(feed.url); // Feed URL
|
|
140
|
+
console.log(feed.siteUrl); // Website URL
|
|
141
|
+
|
|
142
|
+
// Feed items
|
|
143
|
+
for (const item of feed.items) {
|
|
144
|
+
console.log(item.title); // Item title
|
|
145
|
+
console.log(item.url); // Item URL (absolute)
|
|
146
|
+
console.log(item.description); // Item description
|
|
147
|
+
console.log(item.publishedAt); // Publication date
|
|
148
|
+
console.log(item.author); // Author
|
|
149
|
+
}
|
|
150
|
+
|
|
151
|
+
// Format detection
|
|
152
|
+
console.log(feed.format); // 'rss', 'atom', or 'json-feed'
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
**What it does:**
|
|
156
|
+
- Auto-detects feed format (RSS 2.0, Atom 1.0, JSON Feed)
|
|
157
|
+
- Normalizes all formats to a unified interface
|
|
158
|
+
- Resolves relative URLs to absolute
|
|
159
|
+
- Handles malformed data gracefully
|
|
160
|
+
|
|
161
|
+
## Advanced Usage
|
|
162
|
+
|
|
163
|
+
For more control, use the lower-level modules directly:
|
|
164
|
+
|
|
165
|
+
### Feed Parsing
|
|
166
|
+
|
|
167
|
+
```typescript
|
|
168
|
+
import { pluck, parseFeed } from 'magpie-html';
|
|
169
|
+
|
|
170
|
+
// Fetch feed content
|
|
171
|
+
const response = await pluck('https://example.com/feed.xml');
|
|
172
|
+
const feedContent = await response.textUtf8();
|
|
173
|
+
|
|
174
|
+
// Parse with base URL for relative links
|
|
175
|
+
const result = parseFeed(feedContent, response.finalUrl);
|
|
176
|
+
|
|
177
|
+
console.log(result.feed.title);
|
|
178
|
+
console.log(result.feed.items[0].title);
|
|
179
|
+
console.log(result.feed.format); // 'rss', 'atom', or 'json-feed'
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
### Content Extraction
|
|
183
|
+
|
|
184
|
+
```typescript
|
|
185
|
+
import { parseHTML, extractContent, htmlToText } from 'magpie-html';
|
|
186
|
+
|
|
187
|
+
// Parse HTML once
|
|
188
|
+
const doc = parseHTML(html);
|
|
189
|
+
|
|
190
|
+
// Extract article with Readability
|
|
191
|
+
const result = extractContent(doc, {
|
|
192
|
+
baseUrl: 'https://example.com/article',
|
|
193
|
+
cleanConditionally: true,
|
|
194
|
+
keepClasses: false,
|
|
195
|
+
});
|
|
196
|
+
|
|
197
|
+
if (result.success) {
|
|
198
|
+
console.log(result.title); // Article title
|
|
199
|
+
console.log(result.excerpt); // Article excerpt
|
|
200
|
+
console.log(result.content); // Clean HTML
|
|
201
|
+
console.log(result.textContent); // Plain text
|
|
202
|
+
console.log(result.wordCount); // Word count
|
|
203
|
+
console.log(result.readingTime); // Reading time
|
|
204
|
+
}
|
|
205
|
+
|
|
206
|
+
// Or convert any HTML to text
|
|
207
|
+
const plainText = htmlToText(html, {
|
|
208
|
+
preserveWhitespace: false,
|
|
209
|
+
includeLinks: true,
|
|
210
|
+
wrapColumn: 80,
|
|
211
|
+
});
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
### Metadata Extraction
|
|
215
|
+
|
|
216
|
+
```typescript
|
|
217
|
+
import { parseHTML, extractOpenGraph, extractSchemaOrg, extractSEO } from 'magpie-html';
|
|
218
|
+
|
|
219
|
+
const doc = parseHTML(html);
|
|
220
|
+
|
|
221
|
+
// Extract OpenGraph metadata
|
|
222
|
+
const og = extractOpenGraph(doc);
|
|
223
|
+
console.log(og.title);
|
|
224
|
+
console.log(og.description);
|
|
225
|
+
console.log(og.image);
|
|
226
|
+
|
|
227
|
+
// Extract Schema.org data
|
|
228
|
+
const schema = extractSchemaOrg(doc);
|
|
229
|
+
console.log(schema.articles); // NewsArticle, etc.
|
|
230
|
+
|
|
231
|
+
// Extract SEO metadata
|
|
232
|
+
const seo = extractSEO(doc);
|
|
233
|
+
console.log(seo.title);
|
|
234
|
+
console.log(seo.description);
|
|
235
|
+
console.log(seo.keywords);
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
**Available extractors:**
|
|
239
|
+
- `extractSEO` - SEO meta tags
|
|
240
|
+
- `extractOpenGraph` - OpenGraph metadata
|
|
241
|
+
- `extractTwitterCard` - Twitter Card metadata
|
|
242
|
+
- `extractSchemaOrg` - Schema.org / JSON-LD
|
|
243
|
+
- `extractCanonical` - Canonical URLs
|
|
244
|
+
- `extractLanguage` - Language detection
|
|
245
|
+
- `extractIcons` - Favicon and icons
|
|
246
|
+
- `extractAssets` - All linked assets (images, scripts, fonts, etc.)
|
|
247
|
+
- `extractLinks` - Navigation links (with internal/external split)
|
|
248
|
+
- `extractFeedDiscovery` - Discover RSS/Atom/JSON feeds
|
|
249
|
+
- ...and more
|
|
250
|
+
|
|
251
|
+
### Enhanced Fetching
|
|
252
|
+
|
|
253
|
+
Use `pluck()` for robust fetching with automatic encoding and redirect handling:
|
|
254
|
+
|
|
255
|
+
```typescript
|
|
256
|
+
import { pluck } from 'magpie-html';
|
|
257
|
+
|
|
258
|
+
const response = await pluck('https://example.com', {
|
|
259
|
+
timeout: 30000, // 30 second timeout
|
|
260
|
+
maxRedirects: 10, // Follow up to 10 redirects
|
|
261
|
+
maxSize: 10485760, // 10MB limit
|
|
262
|
+
userAgent: 'MyBot/1.0',
|
|
263
|
+
throwOnHttpError: true,
|
|
264
|
+
strictContentType: false,
|
|
265
|
+
});
|
|
266
|
+
|
|
267
|
+
// Enhanced response properties
|
|
268
|
+
console.log(response.finalUrl); // URL after redirects
|
|
269
|
+
console.log(response.redirectChain); // All redirect URLs
|
|
270
|
+
console.log(response.detectedEncoding); // Detected charset
|
|
271
|
+
console.log(response.timing); // Request timing
|
|
272
|
+
|
|
273
|
+
// Get UTF-8 decoded content
|
|
274
|
+
const text = await response.textUtf8();
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
**Why `pluck()`?**
|
|
278
|
+
- Handles broken sites with wrong/missing encoding declarations
|
|
279
|
+
- Follows redirect chains and tracks them
|
|
280
|
+
- Enforces timeouts and size limits
|
|
281
|
+
- Compatible with standard `fetch()` API
|
|
282
|
+
- Named `pluck()` to avoid confusion (magpies pluck things! ๐ฆ
)
|
|
283
|
+
|
|
284
|
+
## API Reference
|
|
285
|
+
|
|
286
|
+
### High-Level Convenience
|
|
287
|
+
|
|
288
|
+
- **`gatherWebsite(url)`** - Extract complete website metadata
|
|
289
|
+
- **`gatherArticle(url)`** - Extract article content + metadata
|
|
290
|
+
- **`gatherFeed(url)`** - Parse any feed format
|
|
291
|
+
|
|
292
|
+
### Fetching
|
|
293
|
+
|
|
294
|
+
- **`pluck(url, options?)`** - Enhanced fetch for web scraping
|
|
295
|
+
|
|
296
|
+
### Parsing
|
|
297
|
+
|
|
298
|
+
- **`parseFeed(content, baseUrl?)`** - Parse RSS/Atom/JSON feeds
|
|
299
|
+
- **`detectFormat(content)`** - Detect feed format
|
|
300
|
+
- **`parseHTML(html)`** - Parse HTML to Document
|
|
301
|
+
|
|
302
|
+
### Content Extraction
|
|
303
|
+
|
|
304
|
+
- **`extractContent(doc, options?)`** - Extract article with Readability
|
|
305
|
+
- **`htmlToText(html, options?)`** - Convert HTML to plain text
|
|
306
|
+
- **`isProbablyReaderable(doc)`** - Check if content is article-like
|
|
307
|
+
|
|
308
|
+
### Metadata Extraction
|
|
309
|
+
|
|
310
|
+
- **`extractSEO(doc)`** - SEO meta tags
|
|
311
|
+
- **`extractOpenGraph(doc)`** - OpenGraph metadata
|
|
312
|
+
- **`extractTwitterCard(doc)`** - Twitter Card metadata
|
|
313
|
+
- **`extractSchemaOrg(doc)`** - Schema.org / JSON-LD
|
|
314
|
+
- **`extractCanonical(doc)`** - Canonical URLs
|
|
315
|
+
- **`extractLanguage(doc)`** - Language detection
|
|
316
|
+
- **`extractIcons(doc)`** - Favicons and icons
|
|
317
|
+
- **`extractAssets(doc, baseUrl)`** - Linked assets
|
|
318
|
+
- **`extractLinks(doc, baseUrl, options?)`** - Navigation links
|
|
319
|
+
- **`extractFeedDiscovery(doc, baseUrl)`** - Discover feeds
|
|
320
|
+
- ...and 10+ more specialized extractors
|
|
321
|
+
|
|
322
|
+
### Utilities
|
|
323
|
+
|
|
324
|
+
- **`normalizeUrl(baseUrl, url)`** - Convert relative to absolute URLs
|
|
325
|
+
- **`countWords(text)`** - Count words in text
|
|
326
|
+
- **`calculateReadingTime(wordCount)`** - Estimate reading time
|
|
327
|
+
|
|
328
|
+
See [TypeDoc documentation](https://anonyfox.github.io/magpie-html) for complete API reference.
|
|
329
|
+
|
|
330
|
+
## Performance Tips
|
|
331
|
+
|
|
332
|
+
**Best Practice:** Parse HTML once and reuse the document:
|
|
333
|
+
|
|
334
|
+
```typescript
|
|
335
|
+
import { parseHTML, extractSEO, extractOpenGraph, extractContent } from 'magpie-html';
|
|
336
|
+
|
|
337
|
+
const doc = parseHTML(html);
|
|
338
|
+
|
|
339
|
+
// Reuse the same document for multiple extractions
|
|
340
|
+
const seo = extractSEO(doc); // Fast: <5ms
|
|
341
|
+
const og = extractOpenGraph(doc); // Fast: <5ms
|
|
342
|
+
const content = extractContent(doc); // ~100-500ms
|
|
343
|
+
|
|
344
|
+
// Total: One parse + all extractions
|
|
345
|
+
```
|
|
346
|
+
|
|
347
|
+
## Development
|
|
348
|
+
|
|
349
|
+
### Setup
|
|
350
|
+
|
|
351
|
+
```bash
|
|
352
|
+
npm install
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
### Run Tests
|
|
356
|
+
|
|
357
|
+
```bash
|
|
358
|
+
npm test
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
The test suite includes both unit tests (`*.test.ts`) and integration tests using real-world HTML/feed files from `cache/`.
|
|
362
|
+
|
|
363
|
+
### Watch Mode
|
|
364
|
+
|
|
365
|
+
```bash
|
|
366
|
+
npm run test:watch
|
|
367
|
+
```
|
|
368
|
+
|
|
369
|
+
### Build
|
|
370
|
+
|
|
371
|
+
```bash
|
|
372
|
+
npm run build
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
### Linting & Formatting
|
|
376
|
+
|
|
377
|
+
```bash
|
|
378
|
+
# Check for issues
|
|
379
|
+
npm run lint
|
|
380
|
+
|
|
381
|
+
# Auto-fix issues
|
|
382
|
+
npm run lint:fix
|
|
383
|
+
|
|
384
|
+
# Format code
|
|
385
|
+
npm run format
|
|
386
|
+
|
|
387
|
+
# Run all checks (typecheck + lint)
|
|
388
|
+
npm run check
|
|
389
|
+
```
|
|
390
|
+
|
|
391
|
+
### Type Check
|
|
392
|
+
|
|
393
|
+
```bash
|
|
394
|
+
npm run typecheck
|
|
395
|
+
```
|
|
396
|
+
|
|
397
|
+
### Documentation
|
|
398
|
+
|
|
399
|
+
Generate API documentation:
|
|
400
|
+
|
|
401
|
+
```bash
|
|
402
|
+
npm run docs
|
|
403
|
+
npm run docs:serve
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
## Integration Testing
|
|
407
|
+
|
|
408
|
+
The `cache/` directory contains real-world HTML and feed samples for integration testing. This enables testing against actual production data without network calls.
|
|
409
|
+
|
|
410
|
+
## Publishing
|
|
411
|
+
|
|
412
|
+
```bash
|
|
413
|
+
npm publish
|
|
414
|
+
```
|
|
415
|
+
|
|
416
|
+
The `prepublishOnly` script automatically builds the package before publishing.
|
|
417
|
+
|
|
418
|
+
## License
|
|
419
|
+
|
|
420
|
+
MIT
|
|
421
|
+
|
|
422
|
+
## Author
|
|
423
|
+
|
|
424
|
+
[Anonyfox](https://anonyfox.com)
|