magpie-html 0.1.2 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +35 -35
- package/dist/index.cjs +1787 -18
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +184 -1
- package/dist/index.d.ts +184 -1
- package/dist/index.js +1778 -19
- package/dist/index.js.map +1 -1
- package/package.json +8 -3
package/README.md
CHANGED
|
@@ -303,51 +303,51 @@ const text = await response.textUtf8();
|
|
|
303
303
|
- Compatible with standard `fetch()` API
|
|
304
304
|
- Named `pluck()` to avoid confusion (magpies pluck things! 🦅)
|
|
305
305
|
|
|
306
|
-
##
|
|
306
|
+
## Experimental: `swoop()` (client-side DOM rendering without a browser engine)
|
|
307
307
|
|
|
308
|
-
|
|
308
|
+
> **⚠️ SECURITY WARNING — Remote Code Execution (RCE)**
|
|
309
|
+
>
|
|
310
|
+
> `swoop()` **executes remote, third‑party JavaScript inside your current Node.js process** (via `node:vm` + browser shims).
|
|
311
|
+
> This is **fundamentally insecure**. Only use `swoop()` on **fully trusted targets** and treat inputs as **hostile by default**.
|
|
312
|
+
> For any professional/untrusted scraping, run this in a **real sandbox** (container/VM/locked-down OS user + seccomp/apparmor/firejail, etc.).
|
|
309
313
|
|
|
310
|
-
-
|
|
311
|
-
- **`gatherArticle(url)`** - Extract article content + metadata
|
|
312
|
-
- **`gatherFeed(url)`** - Parse any feed format
|
|
314
|
+
> **Note:** `magpie-html` does **not** use `swoop()` internally. It’s provided as an **optional standalone utility** for the few cases where you really need DOM-only client-side rendering.
|
|
313
315
|
|
|
314
|
-
|
|
316
|
+
`swoop()` is an **explicitly experimental** helper that tries to execute client-side scripts against a **DOM-only** environment and then returns a **best-effort HTML snapshot**.
|
|
315
317
|
|
|
316
|
-
|
|
318
|
+
### Why this exists
|
|
317
319
|
|
|
318
|
-
|
|
320
|
+
Sometimes `curl` / `fetch` / `pluck()` isn’t enough because the page is a SPA and only renders content after client-side JavaScript runs.
|
|
321
|
+
`swoop()` exists to **quickly turn “CSR-only” pages into HTML** so the rest of `magpie-html` can work with the result.
|
|
319
322
|
|
|
320
|
-
|
|
321
|
-
- **`detectFormat(content)`** - Detect feed format
|
|
322
|
-
- **`parseHTML(html)`** - Parse HTML to Document
|
|
323
|
+
If it works, it can be **comparably light and fast** because it avoids a full browser engine by using a custom `node:vm`-based execution environment with browser shims.
|
|
323
324
|
|
|
324
|
-
|
|
325
|
+
For very complicated targets (heavy JS, complex navigation, strong anti-bot, layout-dependent rendering), you should use a **real browser engine** instead.
|
|
325
326
|
|
|
326
|
-
|
|
327
|
-
- **`htmlToText(html, options?)`** - Convert HTML to plain text
|
|
328
|
-
- **`isProbablyReaderable(doc)`** - Check if content is article-like
|
|
327
|
+
`swoop()` is best seen as a **building block**—you still need to provide the **real sandboxing** around it.
|
|
329
328
|
|
|
330
|
-
###
|
|
329
|
+
### What it is
|
|
330
|
+
|
|
331
|
+
- A pragmatic “SPA snapshotter” for cases where a page renders content via client-side JavaScript.
|
|
332
|
+
- **No browser engine**: no layout/paint/CSS correctness.
|
|
333
|
+
|
|
334
|
+
### What it is NOT
|
|
335
|
+
|
|
336
|
+
- Not a headless browser replacement (no navigation lifecycle, no reliable layout APIs).
|
|
337
|
+
|
|
338
|
+
### Usage
|
|
331
339
|
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
335
|
-
|
|
336
|
-
|
|
337
|
-
|
|
338
|
-
|
|
339
|
-
|
|
340
|
-
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
### Utilities
|
|
345
|
-
|
|
346
|
-
- **`normalizeUrl(baseUrl, url)`** - Convert relative to absolute URLs
|
|
347
|
-
- **`countWords(text)`** - Count words in text
|
|
348
|
-
- **`calculateReadingTime(wordCount)`** - Estimate reading time
|
|
349
|
-
|
|
350
|
-
See [TypeDoc documentation](https://anonyfox.github.io/magpie-html) for complete API reference.
|
|
340
|
+
```typescript
|
|
341
|
+
import { swoop } from "magpie-html";
|
|
342
|
+
|
|
343
|
+
const result = await swoop("https://example.com/spa", {
|
|
344
|
+
waitStrategy: "networkidle",
|
|
345
|
+
timeout: 3000,
|
|
346
|
+
});
|
|
347
|
+
|
|
348
|
+
console.log(result.html);
|
|
349
|
+
console.log(result.errors);
|
|
350
|
+
```
|
|
351
351
|
|
|
352
352
|
## Performance Tips
|
|
353
353
|
|