magpie-html 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -303,51 +303,51 @@ const text = await response.textUtf8();
303
303
  - Compatible with standard `fetch()` API
304
304
  - Named `pluck()` to avoid confusion (magpies pluck things! 🦅)
305
305
 
306
- ## API Reference
306
+ ## Experimental: `swoop()` (client-side DOM rendering without a browser engine)
307
307
 
308
- ### High-Level Convenience
308
+ > **⚠️ SECURITY WARNING — Remote Code Execution (RCE)**
309
+ >
310
+ > `swoop()` **executes remote, third‑party JavaScript inside your current Node.js process** (via `node:vm` + browser shims).
311
+ > This is **fundamentally insecure**. Only use `swoop()` on **fully trusted targets** and treat inputs as **hostile by default**.
312
+ > For any professional/untrusted scraping, run this in a **real sandbox** (container/VM/locked-down OS user + seccomp/apparmor/firejail, etc.).
309
313
 
310
- - **`gatherWebsite(url)`** - Extract complete website metadata
311
- - **`gatherArticle(url)`** - Extract article content + metadata
312
- - **`gatherFeed(url)`** - Parse any feed format
314
+ > **Note:** `magpie-html` does **not** use `swoop()` internally. It’s provided as an **optional standalone utility** for the few cases where you really need DOM-only client-side rendering.
313
315
 
314
- ### Fetching
316
+ `swoop()` is an **explicitly experimental** helper that tries to execute client-side scripts against a **DOM-only** environment and then returns a **best-effort HTML snapshot**.
315
317
 
316
- - **`pluck(url, options?)`** - Enhanced fetch for web scraping
318
+ ### Why this exists
317
319
 
318
- ### Parsing
320
+ Sometimes `curl` / `fetch` / `pluck()` isn’t enough because the page is a SPA and only renders content after client-side JavaScript runs.
321
+ `swoop()` exists to **quickly turn “CSR-only” pages into HTML** so the rest of `magpie-html` can work with the result.
319
322
 
320
- - **`parseFeed(content, baseUrl?)`** - Parse RSS/Atom/JSON feeds
321
- - **`detectFormat(content)`** - Detect feed format
322
- - **`parseHTML(html)`** - Parse HTML to Document
323
+ If it works, it can be **comparably light and fast** because it avoids a full browser engine by using a custom `node:vm`-based execution environment with browser shims.
323
324
 
324
- ### Content Extraction
325
+ For very complicated targets (heavy JS, complex navigation, strong anti-bot, layout-dependent rendering), you should use a **real browser engine** instead.
325
326
 
326
- - **`extractContent(doc, options?)`** - Extract article with Readability
327
- - **`htmlToText(html, options?)`** - Convert HTML to plain text
328
- - **`isProbablyReaderable(doc)`** - Check if content is article-like
327
+ `swoop()` is best seen as a **building block**—you still need to provide the **real sandboxing** around it.
329
328
 
330
- ### Metadata Extraction
329
+ ### What it is
330
+
331
+ - A pragmatic “SPA snapshotter” for cases where a page renders content via client-side JavaScript.
332
+ - **No browser engine**: no layout/paint/CSS correctness.
333
+
334
+ ### What it is NOT
335
+
336
+ - Not a headless browser replacement (no navigation lifecycle, no reliable layout APIs).
337
+
338
+ ### Usage
331
339
 
332
- - **`extractSEO(doc)`** - SEO meta tags
333
- - **`extractOpenGraph(doc)`** - OpenGraph metadata
334
- - **`extractTwitterCard(doc)`** - Twitter Card metadata
335
- - **`extractSchemaOrg(doc)`** - Schema.org / JSON-LD
336
- - **`extractCanonical(doc)`** - Canonical URLs
337
- - **`extractLanguage(doc)`** - Language detection
338
- - **`extractIcons(doc)`** - Favicons and icons
339
- - **`extractAssets(doc, baseUrl)`** - Linked assets
340
- - **`extractLinks(doc, baseUrl, options?)`** - Navigation links
341
- - **`extractFeedDiscovery(doc, baseUrl)`** - Discover feeds
342
- - ...and 10+ more specialized extractors
343
-
344
- ### Utilities
345
-
346
- - **`normalizeUrl(baseUrl, url)`** - Convert relative to absolute URLs
347
- - **`countWords(text)`** - Count words in text
348
- - **`calculateReadingTime(wordCount)`** - Estimate reading time
349
-
350
- See [TypeDoc documentation](https://anonyfox.github.io/magpie-html) for complete API reference.
340
+ ```typescript
341
+ import { swoop } from "magpie-html";
342
+
343
+ const result = await swoop("https://example.com/spa", {
344
+ waitStrategy: "networkidle",
345
+ timeout: 3000,
346
+ });
347
+
348
+ console.log(result.html);
349
+ console.log(result.errors);
350
+ ```
351
351
 
352
352
  ## Performance Tips
353
353