npm - @purepageio/fetch-engines - Versions diffs - 0.6.0 → 0.6.2 - Mend

@purepageio/fetch-engines 0.6.0 → 0.6.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +81 -281
package/package.json +4 -3

package/README.md CHANGED Viewed

@@ -3,342 +3,142 @@
 [![npm version](https://img.shields.io/npm/v/@purepageio/fetch-engines.svg)](https://www.npmjs.com/package/@purepageio/fetch-engines)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-Web scraping requires handling static HTML, JavaScript-heavy sites, network errors, retries, caching, and bot detection. Managing browser automation tools like Playwright adds complexity with resource pooling and stealth configurations.
+Fetch websites with confidence. `@purepageio/fetch-engines` gives teams an HTTP-first workflow that automatically promotes tricky pages to a managed Playwright browser and can even hand structured results back through OpenAI.
-`@purepageio/fetch-engines` provides engines for retrieving web content through a unified API.
-**Key Benefits:**
-- **Unified API:** Use `fetchHTML(url, options?)` for processed content or `fetchContent(url, options?)` for raw content
-- **Smart Fallback Strategy:** Tries fast HTTP first, automatically falls back to full browser for complex sites
-- **AI-Powered Data Extraction:** Extract structured data from web pages using OpenAI and Zod schemas
-- **Raw Content Support:** Retrieve PDFs, images, APIs with the same fallback logic
-- **Built-in Resilience:** Caching, retries, and standardised error handling
-- **Browser Management:** Automatic browser pooling and stealth measures for complex sites
-- **Content Transformation:** Convert HTML to clean Markdown
-- **TypeScript Ready:** Fully typed codebase
-## Table of Contents
+## Table of contents
+- [Why fetch-engines?](#why-fetch-engines)
 - [Installation](#installation)
-- [Engines](#engines)
-- [Basic Usage](#basic-usage)
-- [fetchHTML vs fetchContent](#fetchhtml-vs-fetchcontent)
-- [Structured Content Extraction](#structured-content-extraction)
+- [Quick start](#quick-start)
+- [Usage patterns](#usage-patterns)
+  - [Pick an engine](#pick-an-engine)
+  - [Structured extraction](#structured-extraction)
 - [Configuration](#configuration)
-- [Return Value](#return-value)
-- [API Reference](#api-reference)
-- [Stealth Features](#stealth-features)
-- [Error Handling](#error-handling)
+  - [Essentials](#essentials)
+  - [Complete reference](#complete-reference)
+- [Error handling](#error-handling)
+- [Tooling and examples](#tooling-and-examples)
 - [Contributing](#contributing)
+- [License](#license)
-## Installation
+## Why fetch-engines?
-```bash
-pnpm add @purepageio/fetch-engines
-```
+- **One API for multiple strategies** – Call `fetchHTML` for rendered pages or `fetchContent` for raw responses. The library handles HTTP shortcuts and Playwright fallbacks automatically.
+- **Production-minded defaults** – Retries, caching, and consistent telemetry are ready out of the box.
+- **Drop-in AI enrichment** – Provide a Zod schema and let OpenAI convert full pages into structured data.
+- **Typed and tested** – Built in TypeScript with examples that mirror real-world scraping pipelines.
-For `HybridEngine` (uses Playwright), install browser binaries:
+## Installation
 ```bash
+pnpm add @purepageio/fetch-engines
+# install Playwright browsers once if you plan to use the Hybrid or Playwright engines
 pnpm exec playwright install
 ```
-## Engines
-**`HybridEngine`** (recommended): Attempts fast HTTP fetch first, falls back to Playwright browser on failure or when SPA shell detected. Handles both simple and complex sites automatically.
-**`FetchEngine`**: Lightweight HTTP-only engine for basic sites without browser fallback.
-**`StructuredContentEngine`**: AI-powered engine that combines HybridEngine with OpenAI for structured data extraction.
-## Basic Usage
-### Quick Start
+## Quick start
 ```typescript
 import { HybridEngine } from "@purepageio/fetch-engines";
 const engine = new HybridEngine();
-// Simple sites use fast HTTP
-const simple = await engine.fetchHTML("https://example.com");
-console.log(`Title: ${simple.title}`);
-// Complex sites automatically use browser
-const complex = await engine.fetchHTML("https://spa-site.com", {
-  markdown: true,
-  spaMode: true,
-});
-await engine.cleanup(); // Important: releases browser resources
-```
-### With Custom Headers
-```typescript
-const engine = new HybridEngine({
-  headers: { "X-Custom-Header": "value" },
-});
-const result = await engine.fetchHTML("https://example.com", {
-  headers: { "X-Request-Header": "value" },
-});
-```
-### Raw Content (PDFs, Images, APIs)
-```typescript
-const engine = new HybridEngine();
-// Fetch PDF
-const pdf = await engine.fetchContent("https://example.com/doc.pdf");
-console.log(`PDF size: ${pdf.content.length} bytes`);
-// Fetch JSON API with auth
-const api = await engine.fetchContent("https://api.example.com/data", {
-  headers: { Authorization: "Bearer token" },
-});
+const page = await engine.fetchHTML("https://example.com");
+console.log(page.title);
 await engine.cleanup();
 ```
-## fetchHTML vs fetchContent
-### `fetchHTML(url, options?)`
+## Usage patterns
-**Use for:** Web page content extraction
+### Pick an engine
-- Processes HTML and extracts metadata (title, etc.)
-- Supports HTML-to-Markdown conversion
-- Content-type restrictions (HTML/XML only)
-- Returns processed content as `string`
+| Engine                    | When to use it                                                                   |
+| ------------------------- | -------------------------------------------------------------------------------- |
+| `HybridEngine`            | Default option. Starts with HTTP, then retries via Playwright for tougher pages. |
+| `FetchEngine`             | Lightweight HTML/text fetching with zero browser overhead.                       |
+| `StructuredContentEngine` | Fetch a page and transform it into typed data with OpenAI.                       |
-### `fetchContent(url, options?)`
-**Use for:** Raw content retrieval (like standard `fetch`)
-- Retrieves any content type (PDFs, images, JSON, XML, etc.)
-- No content-type restrictions
-- Returns `Buffer` (binary) or `string` (text)
-- Preserves original MIME type
-### Example Comparison
-```typescript
-// fetchHTML - processes content
-const html = await engine.fetchHTML("https://example.com");
-console.log(html.title); // "Example Domain"
-console.log(html.contentType); // "html" or "markdown"
-// fetchContent - raw content
-const raw = await engine.fetchContent("https://example.com");
-console.log(raw.contentType); // "text/html"
-console.log(typeof raw.content); // "string" (raw HTML)
-// Binary content
-const pdf = await engine.fetchContent("https://example.com/doc.pdf");
-console.log(Buffer.isBuffer(pdf.content)); // true
-```
-## Structured Content Extraction
-Extract structured data from web pages using AI and Zod schemas.
-### Prerequisites
-Set environment variable:
-```bash
-export OPENAI_API_KEY="your-openai-api-key"
-```
-### Basic Usage
+### Structured extraction
 ```typescript
 import { fetchStructuredContent } from "@purepageio/fetch-engines";
 import { z } from "zod";
-const articleSchema = z.object({
+type Article = {
+  title: string;
+  summary: string;
+};
+const schema = z.object({
   title: z.string(),
-  author: z.string().optional(),
-  publishDate: z.string().optional(),
   summary: z.string(),
-  tags: z.array(z.string()),
-});
-const result = await fetchStructuredContent("https://example.com/article", articleSchema, {
-  model: "gpt-4.1-mini",
-  customPrompt: "Extract main article information",
-});
-console.log("Extracted:", result.data);
-console.log("Token usage:", result.usage);
-```
-### StructuredContentEngine Class
-```typescript
-import { StructuredContentEngine } from "@purepageio/fetch-engines";
-const productSchema = z.object({
-  name: z.string(),
-  price: z.number(),
-  inStock: z.boolean(),
-});
-const engine = new StructuredContentEngine({
-  spaMode: true,
-  spaRenderDelayMs: 2000,
 });
-const result = await engine.fetchStructuredContent("https://shop.com/product", productSchema);
-console.log(`${result.data.name} costs $${result.data.price}`);
+const result = await fetchStructuredContent("https://example.com/article", schema, { model: "gpt-4.1-mini" });
-await engine.cleanup();
+console.log(result.data.summary);
 ```
-### Supported Models
-- `'gpt-5-mini'` - Latest model, mini version **(default)**
-- `'gpt-5'` - Most capable model
-- `'gpt-4.1-mini'` - Fast and cost-effective
-- `'gpt-4.1'` - More capable GPT-4.1 version
+Set `OPENAI_API_KEY` before running structured helpers.
 ## Configuration
-### FetchEngine Options
-| Option     | Type                     | Default | Description              |
-| ---------- | ------------------------ | ------- | ------------------------ |
-| `markdown` | `boolean`                | `false` | Convert HTML to Markdown |
-| `headers`  | `Record<string, string>` | `{}`    | Custom HTTP headers      |
-### HybridEngine Configuration
-| Option             | Type                     | Default  | Description                                  |
-| ------------------ | ------------------------ | -------- | -------------------------------------------- |
-| `headers`          | `Record<string, string>` | `{}`     | Default headers for both engines             |
-| `markdown`         | `boolean`                | `false`  | Default Markdown conversion                  |
-| `useHttpFallback`  | `boolean`                | `true`   | Try HTTP before Playwright                   |
-| `spaMode`          | `boolean`                | `false`  | Enable SPA mode with patient load conditions |
-| `spaRenderDelayMs` | `number`                 | `0`      | Delay after page load in SPA mode            |
-| `maxRetries`       | `number`                 | `3`      | Max retry attempts                           |
-| `cacheTTL`         | `number`                 | `900000` | Cache TTL in ms (15 min default)             |
-| `concurrentPages`  | `number`                 | `3`      | Max concurrent pages                         |
-### Browser Pool Options
-| Option               | Type     | Default   | Description                        |
-| -------------------- | -------- | --------- | ---------------------------------- |
-| `maxBrowsers`        | `number` | `2`       | Max browser instances              |
-| `maxPagesPerContext` | `number` | `6`       | Pages per context before recycling |
-| `maxBrowserAge`      | `number` | `1200000` | Browser lifetime (20 min)          |
+### Essentials
-### Header Precedence
+All engines accept familiar `fetch` options such as custom headers. Additional Hybrid/Playwright options you are likely to tweak:
-Headers merge in this order (highest precedence first):
+- `markdown` – return Markdown instead of HTML.
+- `spaMode` & `spaRenderDelayMs` – allow single-page apps to render before extraction.
+- `cacheTTL`, `maxRetries`, and browser pool sizes – control resilience and throughput.
-1. Request-specific headers in `fetchHTML(url, { headers })`
-2. Engine constructor headers
-3. Engine default headers
+Check the inline TypeScript docs or the [`/examples`](./examples) directory for end-to-end flows.
-## Return Value
+### Complete reference
-### HTMLFetchResult (fetchHTML)
+Every option from `PlaywrightEngineConfig` (consumed by `HybridEngine`) with defaults:
-- `content` (`string`): HTML or Markdown content
-- `contentType` (`'html' | 'markdown'`): Content format
-- `title` (`string | null`): Extracted page title
-- `url` (`string`): Final URL after redirects
-- `isFromCache` (`boolean`): Cache hit indicator
-- `statusCode` (`number | undefined`): HTTP status code
+| Option                     | Default     | Purpose                                                                                       |
+| -------------------------- | ----------- | --------------------------------------------------------------------------------------------- |
+| `headers`                  | `{}`        | Extra headers merged into every request.                                                      |
+| `concurrentPages`          | `3`         | Maximum Playwright pages processed at once.                                                   |
+| `maxRetries`               | `3`         | Additional retry attempts after the first failure.                                            |
+| `retryDelay`               | `5000`      | Milliseconds to wait between retries.                                                         |
+| `cacheTTL`                 | `900000`    | Cache lifetime in ms (`0` disables caching).                                                  |
+| `useHttpFallback`          | `true`      | Try a fast HTTP GET before spinning up Playwright.                                            |
+| `useHeadedModeFallback`    | `false`     | Automatically retry a domain in headed mode after repeated failures.                          |
+| `defaultFastMode`          | `true`      | Block non-critical assets and skip human simulation unless overridden.                        |
+| `simulateHumanBehavior`    | `true`      | When not in fast mode, add delays and scrolling to avoid bot detection.                       |
+| `maxBrowsers`              | `2`         | Highest number of Playwright browser instances kept in the pool.                              |
+| `maxPagesPerContext`       | `6`         | Pages opened per browser context before recycling it.                                         |
+| `maxBrowserAge`            | `1200000`   | Milliseconds before a browser instance is torn down (20 minutes).                             |
+| `healthCheckInterval`      | `60000`     | Pool health check frequency in ms.                                                            |
+| `poolBlockedDomains`       | `[]`        | Domains blocked across every Playwright request (inherit pool defaults if empty).             |
+| `poolBlockedResourceTypes` | `[]`        | Resource types (e.g. `"image"`) blocked globally.                                             |
+| `proxy`                    | `undefined` | Per-browser proxy `{ server, username?, password? }`.                                         |
+| `useHeadedMode`            | `false`     | Force every browser to launch with a visible window.                                          |
+| `markdown`                 | `true`      | Return Markdown (instead of HTML) when possible. Override per request with `markdown: false`. |
+| `spaMode`                  | `false`     | Enable SPA heuristics and allow additional waits for client rendering.                        |
+| `spaRenderDelayMs`         | `0`         | Extra delay after load when `spaMode` is `true`.                                              |
+| `playwrightOnlyPatterns`   | `[]`        | URLs matching any string/regex go straight to Playwright, skipping HTTP fetches.              |
+| `playwrightLaunchOptions`  | `undefined` | Options passed to `browserType.launch` (see Playwright docs).                                 |
-### ContentFetchResult (fetchContent)
+Per-request overrides: `fetchHTML` accepts `fastMode`, `markdown`, `spaMode`, and `headers`, while `fetchContent` supports `fastMode` and `headers`.
-- `content` (`Buffer | string`): Raw content (binary as Buffer, text as string)
-- `contentType` (`string`): Original MIME type
-- `title` (`string | null`): Title if HTML content, otherwise null
-- `url` (`string`): Final URL after redirects
-- `isFromCache` (`boolean`): Cache hit indicator
-- `statusCode` (`number | undefined`): HTTP status code
+## Error handling
-## API Reference
+Failures raise a typed `FetchError` exposing `code`, `statusCode`, and the underlying error. Log these fields to diagnose issues quickly and tune your retry policy.
-### `engine.fetchHTML(url, options?)`
+## Tooling and examples
-- `url` (`string`): Target URL
-- `options?` (`FetchOptions`):
-  - `headers?: Record<string, string>`: Request headers
-  - `markdown?: boolean`: Request Markdown (HybridEngine only)
-  - `fastMode?: boolean`: Override fast mode (HybridEngine only)
-  - `spaMode?: boolean`: Override SPA mode (HybridEngine only)
-- **Returns:** `Promise<HTMLFetchResult>`
-### `engine.fetchContent(url, options?)`
-- `url` (`string`): Target URL
-- `options?` (`ContentFetchOptions`):
-  - `headers?: Record<string, string>`: Request headers
-- **Returns:** `Promise<ContentFetchResult>`
-### `fetchStructuredContent(url, schema, options?)`
-- `url` (`string`): Target URL
-- `schema` (`z.ZodSchema<T>`): Zod schema for extraction
-- `options?` (`StructuredContentOptions`):
-  - `model?: string`: OpenAI model (default: 'gpt-5-mini')
-  - `customPrompt?: string`: Additional AI context
-  - `engineConfig?: PlaywrightEngineConfig`: HybridEngine config
-- **Returns:** `Promise<StructuredContentResult<T>>`
-### `engine.cleanup()`
-Shuts down browser instances for `HybridEngine` and `StructuredContentEngine`. Call when finished to release resources. No-op for `FetchEngine`.
-## Stealth Features
-When `HybridEngine` uses Playwright, it automatically applies stealth measures via `playwright-extra` and stealth plugins to bypass common bot detection. No manual configuration required.
-Stealth techniques are not foolproof against sophisticated detection systems.
-## Error Handling
-Errors are thrown as `FetchError` instances with additional context:
-- `message` (`string`): Error description
-- `code` (`string | undefined`): Specific error code
-- `originalError` (`Error | undefined`): Underlying error
-- `statusCode` (`number | undefined`): HTTP status code
-Common error codes:
-- `ERR_HTTP_ERROR`: HTTP status >= 400
-- `ERR_NON_HTML_CONTENT`: Non-HTML content for HTML request
-- `ERR_FETCH_FAILED`: General fetch operation failure
-- `ERR_PLAYWRIGHT_OPERATION`: Playwright operation failure
-- `ERR_NAVIGATION`: Navigation timeout or failure
-- `ERR_BROWSER_POOL_EXHAUSTED`: No available browser resources
-- `ERR_MAX_RETRIES_REACHED`: All retry attempts exhausted
-- `ERR_MARKDOWN_CONVERSION_NON_HTML`: Markdown conversion on non-HTML content
-```typescript
-import { HybridEngine } from "@purepageio/fetch-engines";
-const engine = new HybridEngine();
-try {
-  const result = await engine.fetchHTML(url);
-} catch (error: any) {
-  console.error(`Error: ${error.code || "Unknown"} - ${error.message}`);
-  if (error.statusCode) console.error(`Status: ${error.statusCode}`);
-}
-```
+- Explore the [`examples`](./examples) directory for scripts you can run end-to-end.
+- Ready-to-use TypeScript types ship with the package.
+- `pnpm test` runs the automated suite when you are ready to contribute.
 ## Contributing
-Contributions welcome! Open an issue or submit a pull request on [GitHub](https://github.com/purepageio/fetch-engines).
+Issues and pull requests are welcome! Please follow the existing linting/test commands before sending a change.
 ## License
-MIT
+Distributed under the [MIT](./LICENSE) license.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@purepageio/fetch-engines",
-  "version": "0.6.0",
+  "version": "0.6.2",
   "type": "module",
   "description": "A collection of configurable engines for fetching HTML content using fetch or Playwright.",
   "main": "dist/index.js",
@@ -13,10 +13,10 @@
   "dependencies": {
     "@ai-sdk/openai": "^2.0.30",
     "ai": "^5.0.44",
-    "axios": "^1.6.8",
+    "axios": "^1.12.0",
     "node-html-parser": "^7.0.1",
     "p-queue": "^7.4.1",
-    "playwright": "^1.43.0",
+    "playwright": "^1.55.1",
     "playwright-extra": "^4.3.6",
     "puppeteer-extra-plugin": "^3.2.3",
     "puppeteer-extra-plugin-stealth": "^2.11.2",
@@ -27,6 +27,7 @@
     "zod": "^4.1.8"
   },
   "devDependencies": {
+    "@playwright/test": "^1.55.1",
     "@types/axios": "^0.14.0",
     "@types/jsdom": "^21.1.6",
     "@types/node": "^18.0.0",