@purepageio/fetch-engines 0.5.1-rc.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -3,546 +3,341 @@
3
3
  [![npm version](https://img.shields.io/npm/v/@purepageio/fetch-engines.svg)](https://www.npmjs.com/package/@purepageio/fetch-engines)
4
4
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
5
 
6
- Fetching web content can be complex. You need to handle static HTML, dynamic JavaScript-driven sites, network errors, retries, caching, and potential bot detection measures. Managing browser automation tools like Playwright adds another layer of complexity with resource pooling and stealth configurations.
6
+ Web scraping requires handling static HTML, JavaScript-heavy sites, network errors, retries, caching, and bot detection. Managing browser automation tools like Playwright adds complexity with resource pooling and stealth configurations.
7
7
 
8
- `@purepageio/fetch-engines` simplifies this entire process by providing a set of robust, configurable, and easy-to-use engines for retrieving web page content.
8
+ `@purepageio/fetch-engines` provides engines for retrieving web content through a unified API.
9
9
 
10
- **Why use `@purepageio/fetch-engines`?**
10
+ **Key Benefits:**
11
11
 
12
- - **Unified API:** Get content from simple or complex sites using the same `fetchHTML(url, options?)` method for processed content or `fetchContent(url, options?)` for raw content.
13
- - **Flexible Strategies:** Choose the right tool for the job:
14
- - `FetchEngine`: Lightweight and fast for static HTML, using the standard `fetch` API. Ideal for speed and efficiency with content that doesn't require JavaScript rendering. Supports custom headers.
15
- - `HybridEngine`: The best of both worlds – tries `FetchEngine` first for speed, automatically falls back to a powerful browser engine (internally, `PlaywrightEngine`) for reliability on complex, JavaScript-heavy pages. Supports custom headers.
16
- - **Raw Content Support:** Use `fetchContent()` to retrieve any type of content (PDFs, images, APIs, etc.) with the same smart fallback logic as `fetchHTML()`.
17
- - **Robust & Resilient:** Built-in caching, configurable retries, and standardized error handling make your fetching logic more dependable.
18
- - **Simplified Automation:** When `HybridEngine` uses its browser capabilities (via the internal `PlaywrightEngine`), it manages browser instances and contexts automatically through efficient pooling and includes integrated stealth measures to bypass common anti-bot systems.
19
- - **Content Transformation:** Optionally convert fetched HTML directly to clean Markdown content.
20
- - **TypeScript Ready:** Fully typed for a better development experience.
21
-
22
- This package provides a high-level abstraction, letting you focus on using the web content rather than the intricacies of fetching it.
12
+ - **Unified API:** Use `fetchHTML(url, options?)` for processed content or `fetchContent(url, options?)` for raw content
13
+ - **Smart Fallback Strategy:** Tries fast HTTP first, automatically falls back to full browser for complex sites
14
+ - **AI-Powered Data Extraction:** Extract structured data from web pages using OpenAI and Zod schemas
15
+ - **Raw Content Support:** Retrieve PDFs, images, APIs with the same fallback logic
16
+ - **Built-in Resilience:** Caching, retries, and standardised error handling
17
+ - **Browser Management:** Automatic browser pooling and stealth measures for complex sites
18
+ - **Content Transformation:** Convert HTML to clean Markdown
19
+ - **TypeScript Ready:** Fully typed codebase
23
20
 
24
21
  ## Table of Contents
25
22
 
26
- - [Features](#features)
27
23
  - [Installation](#installation)
28
24
  - [Engines](#engines)
29
25
  - [Basic Usage](#basic-usage)
30
26
  - [fetchHTML vs fetchContent](#fetchhtml-vs-fetchcontent)
27
+ - [Structured Content Extraction](#structured-content-extraction)
31
28
  - [Configuration](#configuration)
32
29
  - [Return Value](#return-value)
33
30
  - [API Reference](#api-reference)
34
- - [Stealth / Anti-Detection (`PlaywrightEngine`)](#stealth--anti-detection-playwrightengine)
31
+ - [Stealth Features](#stealth-features)
35
32
  - [Error Handling](#error-handling)
36
- - [Logging](#logging)
37
33
  - [Contributing](#contributing)
38
- - [License](#license)
39
-
40
- ## Features
41
-
42
- - **Multiple Fetching Strategies:** Choose between `FetchEngine` (lightweight `fetch`) or `HybridEngine` (smart fallback to a full browser engine).
43
- - **Unified API:** Simple `fetchHTML(url, options?)` interface for processed content and `fetchContent(url, options?)` for raw content across both primary engines.
44
- - **Raw Content Fetching:** Use `fetchContent()` to retrieve any type of content (PDFs, images, JSON, XML, etc.) without HTML processing or content-type restrictions.
45
- - **Custom Headers:** Easily provide custom HTTP headers for requests in both `FetchEngine` and `HybridEngine`.
46
- - **Configurable Retries:** Automatic retries on failure with customizable attempts and delays.
47
- - **Built-in Caching:** In-memory caching with configurable TTL to reduce redundant fetches.
48
- - **Playwright Stealth:** When `HybridEngine` utilizes its browser capabilities, it automatically integrates `playwright-extra` and stealth plugins to bypass common bot detection.
49
- - **Managed Browser Pooling:** Efficient resource management for `HybridEngine`'s browser mode with configurable browser/context limits and lifecycles.
50
- - **Smart Fallbacks:** `HybridEngine` uses `FetchEngine` first, falling back to its internal browser engine only when needed. The internal browser engine can also optionally use a fast HTTP fetch before launching a full browser.
51
- - **Content Conversion:** Optionally convert fetched HTML directly to Markdown, preserving `<table>` elements even without header rows.
52
- - **Standardized Errors:** Custom `FetchError` classes provide context on failures.
53
- - **TypeScript Ready:** Fully typed codebase for enhanced developer experience.
54
34
 
55
35
  ## Installation
56
36
 
57
37
  ```bash
58
38
  pnpm add @purepageio/fetch-engines
59
- # or with npm
60
- npm install @purepageio/fetch-engines
61
- # or with yarn
62
- yarn add @purepageio/fetch-engines
63
39
  ```
64
40
 
65
- If you plan to use the `HybridEngine` (which internally uses Playwright for advanced fetching), you also need to install Playwright's browser binaries:
41
+ For `HybridEngine` (uses Playwright), install browser binaries:
66
42
 
67
43
  ```bash
68
44
  pnpm exec playwright install
69
- # or
70
- npx playwright install
71
45
  ```
72
46
 
73
47
  ## Engines
74
48
 
75
- - **`FetchEngine`**: Uses the standard `fetch` API. Suitable for simple HTML pages or APIs returning HTML. Lightweight and fast. This is your go-to for speed and efficiency when JavaScript rendering is not required.
76
- - **`HybridEngine`**: A smart combination. It first attempts to fetch content using the lightweight `FetchEngine`. If that fails for _any_ reason (e.g., network error, non-HTML content, HTTP error like 403), or if `spaMode` is enabled and an SPA shell is detected, it automatically falls back to using an internal, powerful browser engine (based on Playwright). This provides the speed of `FetchEngine` for simple sites while retaining the power of a full browser for complex, dynamic websites. This is recommended for most general-purpose fetching tasks.
77
- - **`PlaywrightEngine` (Internal Component)**: While not recommended for direct use by most users, `PlaywrightEngine` is the component `HybridEngine` uses internally for its browser-based fetching. It manages Playwright browser instances, contexts, and stealth features. Users needing direct, low-level control over Playwright might consider it, but `HybridEngine` offers a more robust and flexible approach for most scenarios.
49
+ **`HybridEngine`** (recommended): Attempts fast HTTP fetch first, falls back to Playwright browser on failure or when SPA shell detected. Handles both simple and complex sites automatically.
50
+
51
+ **`FetchEngine`**: Lightweight HTTP-only engine for basic sites without browser fallback.
52
+
53
+ **`StructuredContentEngine`**: AI-powered engine that combines HybridEngine with OpenAI for structured data extraction.
78
54
 
79
55
  ## Basic Usage
80
56
 
81
- ### FetchEngine
57
+ ### Quick Start
82
58
 
83
59
  ```typescript
84
- import { FetchEngine } from "@purepageio/fetch-engines";
85
-
86
- const engine = new FetchEngine(); // Default: fetches HTML
87
-
88
- async function main() {
89
- try {
90
- const url = "https://example.com";
91
- const result = await engine.fetchHTML(url);
92
- console.log(`Fetched ${result.url} (ContentType: ${result.contentType})`);
93
- console.log(`Title: ${result.title}`);
94
- console.log(`Content (HTML): ${result.content.substring(0, 100)}...`);
95
-
96
- // Example fetching Markdown directly via constructor option
97
- const markdownEngine = new FetchEngine({ markdown: true });
98
- const mdResult = await markdownEngine.fetchHTML(url);
99
- console.log(`\nFetched ${mdResult.url} (ContentType: ${mdResult.contentType})`);
100
- console.log(`Content (Markdown):\n${mdResult.content.substring(0, 300)}...`);
101
- } catch (error) {
102
- console.error("Fetch failed:", error);
103
- }
104
- }
105
- main();
60
+ import { HybridEngine } from "@purepageio/fetch-engines";
61
+
62
+ const engine = new HybridEngine();
63
+
64
+ // Simple sites use fast HTTP
65
+ const simple = await engine.fetchHTML("https://example.com");
66
+ console.log(`Title: ${simple.title}`);
67
+
68
+ // Complex sites automatically use browser
69
+ const complex = await engine.fetchHTML("https://spa-site.com", {
70
+ markdown: true,
71
+ spaMode: true,
72
+ });
73
+
74
+ await engine.cleanup(); // Important: releases browser resources
106
75
  ```
107
76
 
108
- ### HybridEngine
77
+ ### With Custom Headers
109
78
 
110
79
  ```typescript
111
- import { HybridEngine } from "@purepageio/fetch-engines";
112
-
113
- // Engine configured to fetch HTML by default for its internal engines
114
- // and provide some custom headers for all requests made by HybridEngine.
115
80
  const engine = new HybridEngine({
116
- markdown: false,
117
- headers: { "X-Global-Custom-Header": "HybridGlobalValue" },
118
- // Other PlaywrightEngine specific configs can be set here for the fallback mechanism
119
- // e.g., playwrightLaunchOptions: { args: ["--disable-gpu"] }
81
+ headers: { "X-Custom-Header": "value" },
120
82
  });
121
83
 
122
- async function main() {
123
- try {
124
- const urlSimple = "https://example.com"; // Simple site, likely handled by FetchEngine
125
- const urlComplex = "https://quotes.toscrape.com/"; // JS-heavy site, likely requiring Playwright fallback
126
-
127
- // --- Scenario 1: FetchEngine part of HybridEngine handles it ---
128
- console.log(`\nFetching simple site (${urlSimple}) with per-request headers...`);
129
- const result1 = await engine.fetchHTML(urlSimple, {
130
- headers: { "X-Request-Specific": "SimpleRequestValue" },
131
- });
132
- // FetchEngine (via HybridEngine) will use:
133
- // 1. Its base default headers (User-Agent etc.)
134
- // 2. Overridden/augmented by HybridEngine's constructor headers ("X-Global-Custom-Header")
135
- // 3. Overridden/augmented by per-request headers ("X-Request-Specific")
136
- console.log(`Fetched ${result1.url} (ContentType: ${result1.contentType}) - Title: ${result1.title}`);
137
- console.log(`Content (HTML): ${result1.content.substring(0, 100)}...`);
138
-
139
- // --- Scenario 2: Playwright part of HybridEngine handles it ---
140
- console.log(`\nFetching complex site (${urlComplex}) requesting Markdown and with per-request headers...`);
141
- const result2 = await engine.fetchHTML(urlComplex, {
142
- markdown: true,
143
- headers: { "X-Request-Specific": "ComplexRequestValue", "X-Another": "ComplexAnother" },
144
- });
145
- // PlaywrightEngine (via HybridEngine) will use:
146
- // 1. Its base default headers (User-Agent etc. if doing HTTP fallback, or for page.setExtraHTTPHeaders)
147
- // 2. Overridden/augmented by HybridEngine's constructor headers ("X-Global-Custom-Header")
148
- // 3. Overridden/augmented by per-request headers ("X-Request-Specific", "X-Another")
149
- // The markdown: true option will be respected by the Playwright part.
150
- console.log(`Fetched ${result2.url} (ContentType: ${result2.contentType}) - Title: ${result2.title}`);
151
- console.log(`Content (Markdown):\n${result2.content.substring(0, 300)}...`);
152
- } catch (error) {
153
- console.error("Hybrid fetch failed:", error);
154
- } finally {
155
- await engine.cleanup(); // Important for HybridEngine
156
- }
157
- }
158
- main();
84
+ const result = await engine.fetchHTML("https://example.com", {
85
+ headers: { "X-Request-Header": "value" },
86
+ });
159
87
  ```
160
88
 
161
- ### Raw Content Fetching
89
+ ### Raw Content (PDFs, Images, APIs)
162
90
 
163
91
  ```typescript
164
- import { HybridEngine } from "@purepageio/fetch-engines";
165
-
166
92
  const engine = new HybridEngine();
167
93
 
168
- async function fetchRawContent() {
169
- try {
170
- // Fetch a PDF document
171
- const pdfResult = await engine.fetchContent("https://example.com/document.pdf");
172
- console.log(`PDF Content-Type: ${pdfResult.contentType}`);
173
- console.log(
174
- `PDF Size: ${Buffer.isBuffer(pdfResult.content) ? pdfResult.content.length : pdfResult.content.length} bytes`
175
- );
176
-
177
- // Fetch JSON API
178
- const jsonResult = await engine.fetchContent("https://api.example.com/data");
179
- console.log(`JSON Content-Type: ${jsonResult.contentType}`);
180
- console.log(`JSON Data: ${typeof jsonResult.content === "string" ? jsonResult.content : "Binary data"}`);
181
-
182
- // Fetch with custom headers
183
- const customResult = await engine.fetchContent("https://protected-api.example.com/data", {
184
- headers: {
185
- Authorization: "Bearer your-token",
186
- Accept: "application/json",
187
- },
188
- });
189
- console.log(`Custom fetch result: ${customResult.statusCode}`);
190
- } catch (error) {
191
- console.error("Raw content fetch failed:", error);
192
- } finally {
193
- await engine.cleanup();
194
- }
195
- }
196
- fetchRawContent();
94
+ // Fetch PDF
95
+ const pdf = await engine.fetchContent("https://example.com/doc.pdf");
96
+ console.log(`PDF size: ${pdf.content.length} bytes`);
97
+
98
+ // Fetch JSON API with auth
99
+ const api = await engine.fetchContent("https://api.example.com/data", {
100
+ headers: { Authorization: "Bearer token" },
101
+ });
102
+
103
+ await engine.cleanup();
197
104
  ```
198
105
 
199
106
  ## fetchHTML vs fetchContent
200
107
 
201
- Choose the right method for your use case:
202
-
203
108
  ### `fetchHTML(url, options?)`
204
109
 
205
- **Use when:** You want to extract and process web page content.
206
-
207
- **Features:**
110
+ **Use for:** Web page content extraction
208
111
 
209
- - Processes HTML content and extracts metadata (title, etc.)
112
+ - Processes HTML and extracts metadata (title, etc.)
210
113
  - Supports HTML-to-Markdown conversion
211
- - Optimized for web page content
212
114
  - Content-type restrictions (HTML/XML only)
213
115
  - Returns processed content as `string`
214
116
 
215
- **Best for:**
216
-
217
- - Web scraping
218
- - Content extraction
219
- - Blog/article processing
220
- - Any scenario where you need structured HTML or Markdown
221
-
222
117
  ### `fetchContent(url, options?)`
223
118
 
224
- **Use when:** You want raw content without processing, mimicking standard `fetch()` behavior.
225
-
226
- **Features:**
119
+ **Use for:** Raw content retrieval (like standard `fetch`)
227
120
 
228
121
  - Retrieves any content type (PDFs, images, JSON, XML, etc.)
229
122
  - No content-type restrictions
230
- - Returns raw content as `Buffer` (binary) or `string` (text)
231
- - Preserves original MIME type information
232
- - Minimal processing overhead
233
-
234
- **Best for:**
235
-
236
- - API consumption
237
- - File downloads (PDFs, images, etc.)
238
- - Binary content retrieval
239
- - Any scenario where you need the raw response
123
+ - Returns `Buffer` (binary) or `string` (text)
124
+ - Preserves original MIME type
240
125
 
241
126
  ### Example Comparison
242
127
 
243
128
  ```typescript
244
- import { HybridEngine } from "@purepageio/fetch-engines";
245
-
246
- const engine = new HybridEngine();
247
-
248
- // fetchHTML - for web page content
249
- const htmlResult = await engine.fetchHTML("https://example.com");
250
- console.log(htmlResult.title); // "Example Domain"
251
- console.log(htmlResult.contentType); // "html" or "markdown"
252
- console.log(typeof htmlResult.content); // "string" (processed HTML/Markdown)
253
-
254
- // fetchContent - for raw content
255
- const contentResult = await engine.fetchContent("https://example.com");
256
- console.log(contentResult.title); // "Example Domain" (extracted but not processed)
257
- console.log(contentResult.contentType); // "text/html" (original MIME type)
258
- console.log(typeof contentResult.content); // "string" (raw HTML)
259
-
260
- // fetchContent - for non-HTML content
261
- const pdfResult = await engine.fetchContent("https://example.com/doc.pdf");
262
- console.log(pdfResult.contentType); // "application/pdf"
263
- console.log(Buffer.isBuffer(pdfResult.content)); // true (binary content)
129
+ // fetchHTML - processes content
130
+ const html = await engine.fetchHTML("https://example.com");
131
+ console.log(html.title); // "Example Domain"
132
+ console.log(html.contentType); // "html" or "markdown"
133
+
134
+ // fetchContent - raw content
135
+ const raw = await engine.fetchContent("https://example.com");
136
+ console.log(raw.contentType); // "text/html"
137
+ console.log(typeof raw.content); // "string" (raw HTML)
138
+
139
+ // Binary content
140
+ const pdf = await engine.fetchContent("https://example.com/doc.pdf");
141
+ console.log(Buffer.isBuffer(pdf.content)); // true
264
142
  ```
265
143
 
266
- ## Configuration
144
+ ## Structured Content Extraction
267
145
 
268
- Engines accept an optional configuration object in their constructor to customise behavior.
146
+ Extract structured data from web pages using AI and Zod schemas.
269
147
 
270
- ### FetchEngine
148
+ ### Prerequisites
271
149
 
272
- The `FetchEngine` accepts a `FetchEngineOptions` object with the following properties:
150
+ Set environment variable:
273
151
 
274
- | Option | Type | Default | Description |
275
- | ---------- | ------------------------ | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
276
- | `markdown` | `boolean` | `false` | If `true`, converts fetched HTML to Markdown. `contentType` in the result will be set to `'markdown'`. |
277
- | `headers` | `Record<string, string>` | `{}` | Custom HTTP headers to be sent with the request. These are merged with and can override the engine's default headers. Headers from `fetchHTML` options take higher precedence. |
152
+ ```bash
153
+ export OPENAI_API_KEY="your-openai-api-key"
154
+ ```
155
+
156
+ ### Basic Usage
278
157
 
279
158
  ```typescript
280
- // Example: FetchEngine with custom headers and Markdown conversion
281
- const customFetchEngine = new FetchEngine({
282
- markdown: true,
283
- headers: {
284
- "User-Agent": "MyCustomFetchAgent/1.0",
285
- "X-Api-Key": "your-api-key",
286
- },
159
+ import { fetchStructuredContent } from "@purepageio/fetch-engines";
160
+ import { z } from "zod";
161
+
162
+ const articleSchema = z.object({
163
+ title: z.string(),
164
+ author: z.string().optional(),
165
+ publishDate: z.string().optional(),
166
+ summary: z.string(),
167
+ tags: z.array(z.string()),
287
168
  });
169
+
170
+ const result = await fetchStructuredContent("https://example.com/article", articleSchema, {
171
+ model: "gpt-4.1-mini",
172
+ customPrompt: "Extract main article information",
173
+ });
174
+
175
+ console.log("Extracted:", result.data);
176
+ console.log("Token usage:", result.usage);
288
177
  ```
289
178
 
290
- #### Header Precedence for `FetchEngine`:
291
-
292
- 1. Headers passed in `fetchHTML(url, { headers: { ... } })` (highest precedence).
293
- 2. Headers passed in the `FetchEngine` constructor `new FetchEngine({ headers: { ... } })`.
294
- 3. Default headers of the `FetchEngine` (e.g., its default `User-Agent`) (lowest precedence).
295
-
296
- ### `PlaywrightEngineConfig` (Used by `HybridEngine`)
297
-
298
- The `HybridEngine` constructor accepts a `PlaywrightEngineConfig` object. These settings configure the underlying `FetchEngine` and `PlaywrightEngine` (for fallback scenarios) and the hybrid strategy itself. When using `HybridEngine`, you are essentially configuring how it will behave and how its internal Playwright capabilities will operate if needed.
299
-
300
- **Key Options for `HybridEngine` (from `PlaywrightEngineConfig`):**
301
-
302
- | Option | Type | Default | Description |
303
- | ------------------------- | ------------------------ | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
304
- | `headers` | `Record<string, string>` | `{}` | Custom HTTP headers. For `HybridEngine`, these serve as default headers for both its internal `FetchEngine` (constructor) and `PlaywrightEngine` (constructor). They can be overridden by headers in `HybridEngine.fetchHTML()` options. |
305
- | `markdown` | `boolean` | `false` | Default Markdown conversion. For `HybridEngine`: sets default for internal `FetchEngine` (constructor) and internal `PlaywrightEngine`. Can be overridden per-request for the `PlaywrightEngine` part. |
306
- | `useHttpFallback` | `boolean` | `true` | (For Playwright part) If `true`, attempts a fast HTTP fetch before using Playwright. Ineffective if `spaMode` is `true`. |
307
- | `useHeadedModeFallback` | `boolean` | `false` | (For Playwright part) If `true`, automatically retries specific failed Playwright attempts in headed (visible) mode. |
308
- | `defaultFastMode` | `boolean` | `true` | If `true`, initially blocks non-essential resources and skips human simulation. Can be overridden per-request. Effectively `false` if `spaMode` is `true`. |
309
- | `simulateHumanBehavior` | `boolean` | `true` | If `true` (and not `fastMode` or `spaMode`), attempts basic human-like interactions. |
310
- | `concurrentPages` | `number` | `3` | Max number of pages to process concurrently within the engine queue. |
311
- | `maxRetries` | `number` | `3` | Max retry attempts for a failed fetch (excluding initial try). |
312
- | `retryDelay` | `number` | `5000` | Delay (ms) between retries. |
313
- | `cacheTTL` | `number` | `900000` | Cache Time-To-Live (ms). `0` disables caching. (15 mins default) |
314
- | `spaMode` | `boolean` | `false` | If `true`, enables Single Page Application mode. This typically bypasses `useHttpFallback`, effectively sets `fastMode` to `false`, uses more patient load conditions (e.g., network idle), and may apply `spaRenderDelayMs`. Recommended for JavaScript-heavy sites. |
315
- | `spaRenderDelayMs` | `number` | `0` | Explicit delay (ms) after page load events in `spaMode` to allow for client-side rendering. Only applies if `spaMode` is `true`. |
316
- | `playwrightLaunchOptions` | `LaunchOptions` | `undefined` | (For Playwright part) Optional Playwright launch options (from `playwright` package, e.g., `{ args: ['--some-flag'] }`) passed when a browser instance is created. Merged with internal defaults. |
317
-
318
- **Browser Pool Options (For `HybridEngine`'s internal `PlaywrightEngine`):**
319
-
320
- | Option | Type | Default | Description |
321
- | -------------------------- | -------------------------- | ----------- | ------------------------------------------------------------------------------------------- |
322
- | `maxBrowsers` | `number` | `2` | Max concurrent browser instances managed by the pool. |
323
- | `maxPagesPerContext` | `number` | `6` | Max pages per browser context before recycling. |
324
- | `maxBrowserAge` | `number` | `1200000` | Max age (ms) a browser instance lives before recycling. (20 mins default) |
325
- | `healthCheckInterval` | `number` | `60000` | How often (ms) the pool checks browser health. (1 min default) |
326
- | `useHeadedMode` | `boolean` | `false` | Forces the _entire pool_ (for Playwright part) to launch browsers in headed (visible) mode. |
327
- | `poolBlockedDomains` | `string[]` | `[]` | List of domain glob patterns to block requests to (for Playwright part). |
328
- | `poolBlockedResourceTypes` | `string[]` | `[]` | List of Playwright resource types (e.g., 'image', 'font') to block (for Playwright part). |
329
- | `proxy` | `{ server: string, ... }?` | `undefined` | Proxy configuration object (see `PlaywrightEngineConfig` type) (for Playwright part). |
330
-
331
- ### `HybridEngine` - Configuration Summary & Header Precedence
332
-
333
- When you configure `HybridEngine` using `PlaywrightEngineConfig`:
334
-
335
- - **`headers`**: Constructor headers are passed to the internal `FetchEngine`'s constructor and the internal `PlaywrightEngine`'s constructor.
336
- - **`markdown`**: Sets the default for both internal engines.
337
- - **`spaMode`**: Sets the default for `HybridEngine`'s SPA shell detection and for the internal `PlaywrightEngine`.
338
- - Other options primarily configure the internal `PlaywrightEngine` or general retry/caching logic.
339
-
340
- **Per-request `options` in `HybridEngine.fetchHTML(url, options)`:**
341
-
342
- - **`headers?: Record<string, string>`**:
343
- - These headers override any headers set in the `HybridEngine` constructor.
344
- - If `FetchEngine` is used: These headers are passed to `FetchEngine.fetchHTML(url, { headers: ... })`. `FetchEngine` then merges them with its constructor headers and base defaults.
345
- - If `PlaywrightEngine` (fallback) is used: These headers are merged with `HybridEngine` constructor headers (options take precedence) and the result is passed to `PlaywrightEngine`'s `fetchHTML()`. `PlaywrightEngine` then applies its own logic (e.g., for `page.setExtraHTTPHeaders` or its HTTP fallback).
346
- - **`markdown?: boolean`**:
347
- - If `FetchEngine` is used: This per-request option is **ignored**. `FetchEngine` uses its own constructor `markdown` setting.
348
- - If `PlaywrightEngine` (fallback) is used: This overrides `PlaywrightEngine`'s default and determines its output format.
349
- - **`spaMode?: boolean`**: Overrides `HybridEngine`'s default SPA mode and is passed to `PlaywrightEngine` if used.
350
- - **`fastMode?: boolean`**: Passed to `PlaywrightEngine` if used; no effect on `FetchEngine`.
179
+ ### StructuredContentEngine Class
351
180
 
352
181
  ```typescript
353
- // Example: HybridEngine with SPA mode enabled by default
354
- const spaHybridEngine = new HybridEngine({ spaMode: true, spaRenderDelayMs: 2000 });
355
-
356
- async function fetchSpaSite() {
357
- try {
358
- // This will use PlaywrightEngine directly if smallblackdots is an SPA shell
359
- const result = await spaHybridEngine.fetchHTML(
360
- "https://www.smallblackdots.net/release/16109/corrina-joseph-wish-tonite-lonely"
361
- );
362
- console.log(`Title: ${result.title}`);
363
- } catch (e) {
364
- console.error(e);
365
- }
366
- }
182
+ import { StructuredContentEngine } from "@purepageio/fetch-engines";
183
+
184
+ const productSchema = z.object({
185
+ name: z.string(),
186
+ price: z.number(),
187
+ inStock: z.boolean(),
188
+ });
189
+
190
+ const engine = new StructuredContentEngine({
191
+ spaMode: true,
192
+ spaRenderDelayMs: 2000,
193
+ });
194
+
195
+ const result = await engine.fetchStructuredContent("https://shop.com/product", productSchema);
196
+ console.log(`${result.data.name} costs $${result.data.price}`);
197
+
198
+ await engine.cleanup();
367
199
  ```
368
200
 
369
- ## Return Value
201
+ ### Supported Models
370
202
 
371
- ### `fetchHTML()` Result
203
+ - `'gpt-5-mini'` - Latest model, mini version **(default)**
204
+ - `'gpt-5'` - Most capable model
205
+ - `'gpt-4.1-mini'` - Fast and cost-effective
206
+ - `'gpt-4.1'` - More capable GPT-4.1 version
372
207
 
373
- All `fetchHTML()` methods return a Promise that resolves to an `HTMLFetchResult` object:
208
+ ## Configuration
374
209
 
375
- - `content` (`string`): The fetched content, either original HTML or converted Markdown.
376
- - `contentType` (`'html' | 'markdown'`): Indicates the format of the `content` string.
377
- - `title` (`string | null`): Extracted page title (from original HTML).
378
- - `url` (`string`): Final URL after redirects.
379
- - `isFromCache` (`boolean`): True if the result came from cache.
380
- - `statusCode` (`number | undefined`): HTTP status code.
381
- - `error` (`Error | undefined`): Error object if the fetch failed after all retries. It's generally recommended to rely on catching thrown errors for failure handling.
210
+ ### FetchEngine Options
382
211
 
383
- ### `fetchContent()` Result
212
+ | Option | Type | Default | Description |
213
+ | ---------- | ------------------------ | ------- | ------------------------ |
214
+ | `markdown` | `boolean` | `false` | Convert HTML to Markdown |
215
+ | `headers` | `Record<string, string>` | `{}` | Custom HTTP headers |
384
216
 
385
- All `fetchContent()` methods return a Promise that resolves to a `ContentFetchResult` object:
217
+ ### HybridEngine Configuration
386
218
 
387
- - `content` (`Buffer | string`): The raw fetched content. Binary content (PDFs, images, etc.) is returned as `Buffer`, text content as `string`.
388
- - `contentType` (`string`): The original MIME type from the server (e.g., `"application/pdf"`, `"text/html"`, `"application/json"`).
389
- - `title` (`string | null`): Extracted page title if the content is HTML, otherwise `null`.
390
- - `url` (`string`): Final URL after redirects.
391
- - `isFromCache` (`boolean`): True if the result came from cache.
392
- - `statusCode` (`number | undefined`): HTTP status code.
393
- - `error` (`Error | undefined`): Error object if the fetch failed after all retries. It's generally recommended to rely on catching thrown errors for failure handling.
219
+ | Option | Type | Default | Description |
220
+ | ------------------ | ------------------------ | -------- | -------------------------------------------- |
221
+ | `headers` | `Record<string, string>` | `{}` | Default headers for both engines |
222
+ | `markdown` | `boolean` | `false` | Default Markdown conversion |
223
+ | `useHttpFallback` | `boolean` | `true` | Try HTTP before Playwright |
224
+ | `spaMode` | `boolean` | `false` | Enable SPA mode with patient load conditions |
225
+ | `spaRenderDelayMs` | `number` | `0` | Delay after page load in SPA mode |
226
+ | `maxRetries` | `number` | `3` | Max retry attempts |
227
+ | `cacheTTL` | `number` | `900000` | Cache TTL in ms (15 min default) |
228
+ | `concurrentPages` | `number` | `3` | Max concurrent pages |
394
229
 
395
- ## API Reference
230
+ ### Browser Pool Options
396
231
 
397
- ### `engine.fetchHTML(url, options?)`
232
+ | Option | Type | Default | Description |
233
+ | -------------------- | -------- | --------- | ---------------------------------- |
234
+ | `maxBrowsers` | `number` | `2` | Max browser instances |
235
+ | `maxPagesPerContext` | `number` | `6` | Pages per context before recycling |
236
+ | `maxBrowserAge` | `number` | `1200000` | Browser lifetime (20 min) |
398
237
 
399
- - `url` (`string`): URL to fetch.
400
- - `options?` (`FetchOptions`): Optional per-request overrides.
401
- - `headers?: Record<string, string>`: Custom headers for this specific request.
402
- - `markdown?: boolean`: (For `HybridEngine`'s Playwright part) Request Markdown conversion.
403
- - `fastMode?: boolean`: (For `HybridEngine`'s Playwright part) Override fast mode.
404
- - `spaMode?: boolean`: (For `HybridEngine`) Override SPA mode behavior for this request.
405
- - **Returns:** `Promise<HTMLFetchResult>`
238
+ ### Header Precedence
406
239
 
407
- Fetches content, returning HTML or Markdown based on configuration/options in `result.content` with `result.contentType` indicating the format.
240
+ Headers merge in this order (highest precedence first):
408
241
 
409
- ### `engine.fetchContent(url, options?)`
242
+ 1. Request-specific headers in `fetchHTML(url, { headers })`
243
+ 2. Engine constructor headers
244
+ 3. Engine default headers
410
245
 
411
- - `url` (`string`): URL to fetch.
412
- - `options?` (`ContentFetchOptions`): Optional per-request overrides.
413
- - `headers?: Record<string, string>`: Custom headers for this specific request.
414
- - **Returns:** `Promise<ContentFetchResult>`
246
+ ## Return Value
415
247
 
416
- Fetches raw content without processing, mimicking standard `fetch()` behavior. Returns binary content as `Buffer` and text content as `string`. Supports any content type (PDFs, images, JSON, XML, etc.) and uses the same smart fallback logic as `fetchHTML()` but without HTML-specific processing or content-type restrictions.
248
+ ### HTMLFetchResult (fetchHTML)
417
249
 
418
- ### `engine.cleanup()` (`HybridEngine` and direct `FetchEngine` if no cleanup needed)
250
+ - `content` (`string`): HTML or Markdown content
251
+ - `contentType` (`'html' | 'markdown'`): Content format
252
+ - `title` (`string | null`): Extracted page title
253
+ - `url` (`string`): Final URL after redirects
254
+ - `isFromCache` (`boolean`): Cache hit indicator
255
+ - `statusCode` (`number | undefined`): HTTP status code
419
256
 
420
- - **Returns:** `Promise<void>`
257
+ ### ContentFetchResult (fetchContent)
421
258
 
422
- For `HybridEngine`, this gracefully shuts down all browser instances managed by its internal `PlaywrightEngine`. **It is crucial to call `await engine.cleanup()` when you are finished using `HybridEngine`** to release system resources.
423
- `FetchEngine` has a `cleanup` method for API consistency, but it's a no-op as `FetchEngine` doesn't manage persistent resources.
259
+ - `content` (`Buffer | string`): Raw content (binary as Buffer, text as string)
260
+ - `contentType` (`string`): Original MIME type
261
+ - `title` (`string | null`): Title if HTML content, otherwise null
262
+ - `url` (`string`): Final URL after redirects
263
+ - `isFromCache` (`boolean`): Cache hit indicator
264
+ - `statusCode` (`number | undefined`): HTTP status code
424
265
 
425
- ## Stealth / Anti-Detection (via `HybridEngine`)
266
+ ## API Reference
426
267
 
427
- When `HybridEngine` uses its internal browser capabilities (via `PlaywrightEngine`), it automatically integrates `playwright-extra` and its powerful stealth plugin ([`puppeteer-extra-plugin-stealth`](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)). This plugin applies various techniques to make the headless browser controlled by Playwright appear more like a regular human-operated browser, helping to bypass many common bot detection systems.
268
+ ### `engine.fetchHTML(url, options?)`
428
269
 
429
- There are **no manual configuration options** for stealth; it is enabled by default when `HybridEngine` uses its browser functionality.
270
+ - `url` (`string`): Target URL
271
+ - `options?` (`FetchOptions`):
272
+ - `headers?: Record<string, string>`: Request headers
273
+ - `markdown?: boolean`: Request Markdown (HybridEngine only)
274
+ - `fastMode?: boolean`: Override fast mode (HybridEngine only)
275
+ - `spaMode?: boolean`: Override SPA mode (HybridEngine only)
276
+ - **Returns:** `Promise<HTMLFetchResult>`
430
277
 
431
- While effective, be aware that no stealth technique is foolproof, and sophisticated websites may still detect automated browsing.
278
+ ### `engine.fetchContent(url, options?)`
432
279
 
433
- ## Error Handling
280
+ - `url` (`string`): Target URL
281
+ - `options?` (`ContentFetchOptions`):
282
+ - `headers?: Record<string, string>`: Request headers
283
+ - **Returns:** `Promise<ContentFetchResult>`
434
284
 
435
- Errors during fetching are typically thrown as instances of `FetchError` (or its subclasses like `FetchEngineHttpError`), providing more context than standard `Error` objects.
285
+ ### `fetchStructuredContent(url, schema, options?)`
436
286
 
437
- - `FetchError` properties:
438
- - `message` (`string`): Description of the error.
439
- - `code` (`string | undefined`): A specific error code (e.g., `ERR_NAVIGATION_TIMEOUT`, `ERR_HTTP_ERROR`, `ERR_NON_HTML_CONTENT`).
440
- - `originalError` (`Error | undefined`): The underlying error that caused this fetch error (e.g., a Playwright error object).
441
- - `statusCode` (`number | undefined`): The HTTP status code, if relevant (especially for `FetchEngineHttpError`).
287
+ - `url` (`string`): Target URL
288
+ - `schema` (`z.ZodSchema<T>`): Zod schema for extraction
289
+ - `options?` (`StructuredContentOptions`):
290
+ - `model?: string`: OpenAI model (default: 'gpt-5-mini')
291
+ - `customPrompt?: string`: Additional AI context
292
+ - `engineConfig?: PlaywrightEngineConfig`: HybridEngine config
293
+ - **Returns:** `Promise<StructuredContentResult<T>>`
442
294
 
443
- Common `FetchError` codes and scenarios:
295
+ ### `engine.cleanup()`
444
296
 
445
- - **`ERR_HTTP_ERROR`**: Thrown by `FetchEngine` for HTTP status codes >= 400. `error.statusCode` will be set.
446
- - **`ERR_NON_HTML_CONTENT`**: Thrown by `FetchEngine` if the content type is not HTML and `markdown` conversion is not requested. **Note:** `fetchContent()` does not throw this error as it supports all content types.
447
- - **`ERR_PLAYWRIGHT_OPERATION`**: A general error from `HybridEngine`'s browser mode indicating a failure during a Playwright operation (e.g., page acquisition, navigation, interaction). The `originalError` property will often contain the specific Playwright error.
448
- - **`ERR_NAVIGATION`**: Often seen as part of `ERR_PLAYWRIGHT_OPERATION`'s message or in `originalError` when a Playwright navigation (in `HybridEngine`'s browser mode) fails (e.g., timeout, SSL error).
449
- - **`ERR_MARKDOWN_CONVERSION_NON_HTML`**: Thrown by `HybridEngine` (when its Playwright part is active) if `markdown: true` is requested for a non-HTML content type (e.g., XML, JSON). **Note:** Only applies to `fetchHTML()` as `fetchContent()` doesn't perform markdown conversion.
450
- - **`ERR_CACHE_ERROR`**: Indicates an issue with cache read/write operations.
451
- - **`ERR_PROXY_CONFIG_ERROR`**: Problem with proxy configuration (for `HybridEngine`'s browser mode).
452
- - **`ERR_BROWSER_POOL_EXHAUSTED`**: If `HybridEngine`'s browser pool cannot provide a page.
453
- - **Other Scenarios (often wrapped by `ERR_PLAYWRIGHT_OPERATION` or a generic `FetchError` when `HybridEngine` uses its browser mode):**
454
- - Network issues (DNS resolution, connection refused).
455
- - Proxy connection failures.
456
- - Page crashes or context/browser disconnections within Playwright.
457
- - Failures during browser launch or management by the pool.
297
+ Shuts down browser instances for `HybridEngine` and `StructuredContentEngine`. Call when finished to release resources. No-op for `FetchEngine`.
458
298
 
459
- The `HTMLFetchResult` object may also contain an `error` property if the final fetch attempt failed after all retries but an earlier attempt (within retries) might have produced some intermediate (potentially unusable) result data. It's generally best to rely on the thrown error for failure handling.
299
+ ## Stealth Features
460
300
 
461
- **Example:**
301
+ When `HybridEngine` uses Playwright, it automatically applies stealth measures via `playwright-extra` and stealth plugins to bypass common bot detection. No manual configuration required.
462
302
 
463
- ```typescript
464
- import { HybridEngine, FetchError } from "@purepageio/fetch-engines";
465
-
466
- // Example using HybridEngine to illustrate error handling
467
- const engine = new HybridEngine({ useHttpFallback: false, maxRetries: 1 }); // useHttpFallback for Playwright part
468
-
469
- async function fetchWithHandling(url: string) {
470
- try {
471
- // Try fetchHTML first
472
- const htmlResult = await engine.fetchHTML(url, { headers: { "X-My-Header": "TestValue" } });
473
- if (htmlResult.error) {
474
- console.warn(`fetchHTML for ${url} included non-critical error after retries: ${htmlResult.error.message}`);
475
- }
476
- console.log(`fetchHTML Success for ${url}! Title: ${htmlResult.title}, Content type: ${htmlResult.contentType}`);
477
- // Use htmlResult.content
478
- } catch (error) {
479
- console.error(`fetchHTML failed for ${url}, trying fetchContent...`);
480
-
481
- try {
482
- // Fallback to fetchContent for raw content
483
- const contentResult = await engine.fetchContent(url, { headers: { "X-My-Header": "TestValue" } });
484
- if (contentResult.error) {
485
- console.warn(
486
- `fetchContent for ${url} included non-critical error after retries: ${contentResult.error.message}`
487
- );
488
- }
489
- console.log(`fetchContent Success for ${url}! Content type: ${contentResult.contentType}`);
490
- // Use contentResult.content (could be Buffer or string)
491
- } catch (contentError) {
492
- console.error(`Both fetchHTML and fetchContent failed for ${url}:`);
493
- if (contentError instanceof FetchError) {
494
- console.error(` Error Code: ${contentError.code || "N/A"}`);
495
- console.error(` Message: ${contentError.message}`);
496
- if (contentError.statusCode) {
497
- console.error(` Status Code: ${contentError.statusCode}`);
498
- }
499
- if (contentError.originalError) {
500
- console.error(` Original Error: ${contentError.originalError.name} - ${contentError.originalError.message}`);
501
- }
502
- // Example of specific handling:
503
- if (contentError.code === "ERR_PLAYWRIGHT_OPERATION") {
504
- console.error(
505
- " Hint: This was a Playwright operation failure (HybridEngine's browser mode). Check Playwright logs or originalError."
506
- );
507
- }
508
- } else if (contentError instanceof Error) {
509
- console.error(` Generic Error: ${contentError.message}`);
510
- } else {
511
- console.error(` Unknown error occurred: ${String(contentError)}`);
512
- }
513
- }
514
- }
515
- }
303
+ Stealth techniques are not foolproof against sophisticated detection systems.
516
304
 
517
- async function runExamples() {
518
- await fetchWithHandling("https://nonexistentdomain.example.com"); // Likely DNS or navigation error
519
- await fetchWithHandling("https://example.com/document.pdf"); // PDF content - fetchHTML will fail, fetchContent will succeed
520
- await fetchWithHandling("https://example.com/api/data.json"); // JSON content - fetchHTML will fail, fetchContent will succeed
521
- await engine.cleanup(); // Important for HybridEngine
522
- }
305
+ ## Error Handling
523
306
 
524
- runExamples();
525
- ```
307
+ Errors are thrown as `FetchError` instances with additional context:
526
308
 
527
- ## Logging
309
+ - `message` (`string`): Error description
310
+ - `code` (`string | undefined`): Specific error code
311
+ - `originalError` (`Error | undefined`): Underlying error
312
+ - `statusCode` (`number | undefined`): HTTP status code
528
313
 
529
- Currently, the library uses `console.warn` and `console.error` for internal warnings (like fallback events) and critical errors. More sophisticated logging options may be added in the future.
314
+ Common error codes:
530
315
 
531
- ## Testing
316
+ - `ERR_HTTP_ERROR`: HTTP status >= 400
317
+ - `ERR_NON_HTML_CONTENT`: Non-HTML content for HTML request
318
+ - `ERR_FETCH_FAILED`: General fetch operation failure
319
+ - `ERR_PLAYWRIGHT_OPERATION`: Playwright operation failure
320
+ - `ERR_NAVIGATION`: Navigation timeout or failure
321
+ - `ERR_BROWSER_POOL_EXHAUSTED`: No available browser resources
322
+ - `ERR_MAX_RETRIES_REACHED`: All retry attempts exhausted
323
+ - `ERR_MARKDOWN_CONVERSION_NON_HTML`: Markdown conversion on non-HTML content
532
324
 
533
- To run the test suite locally:
325
+ ```typescript
326
+ import { HybridEngine } from "@purepageio/fetch-engines";
534
327
 
535
- ```bash
536
- pnpm install
537
- pnpm exec playwright install
538
- pnpm test
539
- ```
328
+ const engine = new HybridEngine();
540
329
 
541
- The `pnpm exec playwright install` step downloads the required browser binaries for Playwright.
330
+ try {
331
+ const result = await engine.fetchHTML(url);
332
+ } catch (error: any) {
333
+ console.error(`Error: ${error.code || "Unknown"} - ${error.message}`);
334
+ if (error.statusCode) console.error(`Status: ${error.statusCode}`);
335
+ }
336
+ ```
542
337
 
543
338
  ## Contributing
544
339
 
545
- Contributions are welcome! Please open an issue or submit a pull request on the [GitHub repository](https://github.com/purepageio/fetch-engines).
340
+ Contributions welcome! Open an issue or submit a pull request on [GitHub](https://github.com/purepageio/fetch-engines).
546
341
 
547
342
  ## License
548
343
 
@@ -0,0 +1,67 @@
1
+ import type { z } from "zod";
2
+ import type { PlaywrightEngineConfig } from "./types.js";
3
+ /**
4
+ * Configuration options for structured content fetching
5
+ */
6
+ export interface StructuredContentOptions {
7
+ /** OpenAI model to use. Options: 'gpt-4.1-mini', 'gpt-4.1', 'gpt-5', 'gpt-5-mini' */
8
+ model?: "gpt-4.1-mini" | "gpt-4.1" | "gpt-5" | "gpt-5-mini";
9
+ /** Custom prompt to provide additional context to the LLM */
10
+ customPrompt?: string;
11
+ /** HybridEngine configuration for content fetching */
12
+ engineConfig?: PlaywrightEngineConfig;
13
+ }
14
+ /**
15
+ * Result of structured content extraction
16
+ */
17
+ export interface StructuredContentResult<T> {
18
+ /** The structured data extracted from the content */
19
+ data: T;
20
+ /** The original markdown content that was processed */
21
+ markdown: string;
22
+ /** The URL that was processed */
23
+ url: string;
24
+ /** The title of the page if available */
25
+ title: string | null;
26
+ /** Token usage information */
27
+ usage: {
28
+ promptTokens: number;
29
+ completionTokens: number;
30
+ totalTokens: number;
31
+ };
32
+ }
33
+ /**
34
+ * Engine for fetching web content and extracting structured data using AI
35
+ */
36
+ export declare class StructuredContentEngine {
37
+ private hybridEngine;
38
+ constructor(config?: PlaywrightEngineConfig);
39
+ /**
40
+ * Fetches content from a URL and extracts structured data using AI
41
+ *
42
+ * @param url The URL to fetch content from
43
+ * @param schema Zod schema defining the structure of data to extract
44
+ * @param options Additional options for the extraction process
45
+ * @returns Promise resolving to structured data and metadata
46
+ * @throws Error if OPENAI_API_KEY is not set or if extraction fails
47
+ */
48
+ fetchStructuredContent<T>(url: string, schema: z.ZodSchema<T>, options?: StructuredContentOptions): Promise<StructuredContentResult<T>>;
49
+ /**
50
+ * Get model-specific configuration options
51
+ */
52
+ private getModelConfig;
53
+ /**
54
+ * Clean up resources
55
+ */
56
+ cleanup(): Promise<void>;
57
+ }
58
+ /**
59
+ * Convenience function for one-off structured content extraction
60
+ *
61
+ * @param url The URL to fetch content from
62
+ * @param schema Zod schema defining the structure of data to extract
63
+ * @param options Additional options for the extraction process
64
+ * @returns Promise resolving to structured data and metadata
65
+ */
66
+ export declare function fetchStructuredContent<T>(url: string, schema: z.ZodSchema<T>, options?: StructuredContentOptions): Promise<StructuredContentResult<T>>;
67
+ //# sourceMappingURL=StructuredContentEngine.d.ts.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"StructuredContentEngine.d.ts","sourceRoot":"","sources":["../src/StructuredContentEngine.ts"],"names":[],"mappings":"AAEA,OAAO,KAAK,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAE7B,OAAO,KAAK,EAAE,sBAAsB,EAAE,MAAM,YAAY,CAAC;AAEzD;;GAEG;AACH,MAAM,WAAW,wBAAwB;IACvC,qFAAqF;IACrF,KAAK,CAAC,EAAE,cAAc,GAAG,SAAS,GAAG,OAAO,GAAG,YAAY,CAAC;IAC5D,6DAA6D;IAC7D,YAAY,CAAC,EAAE,MAAM,CAAC;IACtB,sDAAsD;IACtD,YAAY,CAAC,EAAE,sBAAsB,CAAC;CACvC;AAED;;GAEG;AACH,MAAM,WAAW,uBAAuB,CAAC,CAAC;IACxC,qDAAqD;IACrD,IAAI,EAAE,CAAC,CAAC;IACR,uDAAuD;IACvD,QAAQ,EAAE,MAAM,CAAC;IACjB,iCAAiC;IACjC,GAAG,EAAE,MAAM,CAAC;IACZ,yCAAyC;IACzC,KAAK,EAAE,MAAM,GAAG,IAAI,CAAC;IACrB,8BAA8B;IAC9B,KAAK,EAAE;QACL,YAAY,EAAE,MAAM,CAAC;QACrB,gBAAgB,EAAE,MAAM,CAAC;QACzB,WAAW,EAAE,MAAM,CAAC;KACrB,CAAC;CACH;AAED;;GAEG;AACH,qBAAa,uBAAuB;IAClC,OAAO,CAAC,YAAY,CAAe;gBAEvB,MAAM,GAAE,sBAA2B;IAQ/C;;;;;;;;OAQG;IACG,sBAAsB,CAAC,CAAC,EAC5B,GAAG,EAAE,MAAM,EACX,MAAM,EAAE,CAAC,CAAC,SAAS,CAAC,CAAC,CAAC,EACtB,OAAO,GAAE,wBAA6B,GACrC,OAAO,CAAC,uBAAuB,CAAC,CAAC,CAAC,CAAC;IAsDtC;;OAEG;IACH,OAAO,CAAC,cAAc;IAiBtB;;OAEG;IACG,OAAO,IAAI,OAAO,CAAC,IAAI,CAAC;CAG/B;AAED;;;;;;;GAOG;AACH,wBAAsB,sBAAsB,CAAC,CAAC,EAC5C,GAAG,EAAE,MAAM,EACX,MAAM,EAAE,CAAC,CAAC,SAAS,CAAC,CAAC,CAAC,EACtB,OAAO,GAAE,wBAA6B,GACrC,OAAO,CAAC,uBAAuB,CAAC,CAAC,CAAC,CAAC,CAOrC"}
@@ -0,0 +1,116 @@
1
+ import { generateObject } from "ai";
2
+ import { openai } from "@ai-sdk/openai";
3
+ import { HybridEngine } from "./HybridEngine.js";
4
+ /**
5
+ * Engine for fetching web content and extracting structured data using AI
6
+ */
7
+ export class StructuredContentEngine {
8
+ hybridEngine;
9
+ constructor(config = {}) {
10
+ // Always enable markdown conversion for structured content
11
+ this.hybridEngine = new HybridEngine({
12
+ ...config,
13
+ markdown: true,
14
+ });
15
+ }
16
+ /**
17
+ * Fetches content from a URL and extracts structured data using AI
18
+ *
19
+ * @param url The URL to fetch content from
20
+ * @param schema Zod schema defining the structure of data to extract
21
+ * @param options Additional options for the extraction process
22
+ * @returns Promise resolving to structured data and metadata
23
+ * @throws Error if OPENAI_API_KEY is not set or if extraction fails
24
+ */
25
+ async fetchStructuredContent(url, schema, options = {}) {
26
+ // Check for OpenAI API key
27
+ if (!process.env.OPENAI_API_KEY) {
28
+ throw new Error("OPENAI_API_KEY environment variable is required for structured content extraction");
29
+ }
30
+ const { model = "gpt-5-mini", customPrompt = "", engineConfig = {} } = options;
31
+ // Fetch content using HybridEngine with markdown enabled
32
+ const result = await this.hybridEngine.fetchHTML(url, {
33
+ markdown: true,
34
+ ...engineConfig,
35
+ });
36
+ if (result.contentType !== "markdown") {
37
+ throw new Error("Failed to convert content to markdown");
38
+ }
39
+ // Prepare the prompt for the LLM
40
+ const systemPrompt = `You are an expert at extracting structured data from web content.
41
+ Extract the requested information from the provided markdown content accurately and completely.
42
+ ${customPrompt ? `\nAdditional context: ${customPrompt}` : ""}
43
+
44
+ Content to analyze:
45
+ ${result.content}`;
46
+ // Configure model-specific options
47
+ const modelConfig = this.getModelConfig(model);
48
+ try {
49
+ // Generate structured object using AI SDK
50
+ const aiResult = await generateObject({
51
+ model: openai(model),
52
+ schema,
53
+ prompt: systemPrompt,
54
+ ...modelConfig,
55
+ });
56
+ return {
57
+ data: aiResult.object,
58
+ markdown: result.content,
59
+ url: result.url,
60
+ title: result.title,
61
+ usage: {
62
+ promptTokens: aiResult.usage?.promptTokens ?? 0,
63
+ completionTokens: aiResult.usage?.completionTokens ?? 0,
64
+ totalTokens: aiResult.usage?.totalTokens ?? 0,
65
+ },
66
+ };
67
+ }
68
+ catch (error) {
69
+ throw new Error(`Failed to extract structured data: ${error instanceof Error ? error.message : String(error)}`);
70
+ }
71
+ }
72
+ /**
73
+ * Get model-specific configuration options
74
+ */
75
+ getModelConfig(model) {
76
+ if (model.startsWith("gpt-5")) {
77
+ return {
78
+ providerOptions: {
79
+ openai: {
80
+ reasoning_effort: "low",
81
+ },
82
+ },
83
+ };
84
+ }
85
+ else if (model.startsWith("gpt-4.1")) {
86
+ return {
87
+ temperature: 0,
88
+ };
89
+ }
90
+ return {};
91
+ }
92
+ /**
93
+ * Clean up resources
94
+ */
95
+ async cleanup() {
96
+ await this.hybridEngine.cleanup();
97
+ }
98
+ }
99
+ /**
100
+ * Convenience function for one-off structured content extraction
101
+ *
102
+ * @param url The URL to fetch content from
103
+ * @param schema Zod schema defining the structure of data to extract
104
+ * @param options Additional options for the extraction process
105
+ * @returns Promise resolving to structured data and metadata
106
+ */
107
+ export async function fetchStructuredContent(url, schema, options = {}) {
108
+ const engine = new StructuredContentEngine(options.engineConfig);
109
+ try {
110
+ return await engine.fetchStructuredContent(url, schema, options);
111
+ }
112
+ finally {
113
+ await engine.cleanup();
114
+ }
115
+ }
116
+ //# sourceMappingURL=StructuredContentEngine.js.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"StructuredContentEngine.js","sourceRoot":"","sources":["../src/StructuredContentEngine.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,cAAc,EAAE,MAAM,IAAI,CAAC;AACpC,OAAO,EAAE,MAAM,EAAE,MAAM,gBAAgB,CAAC;AAExC,OAAO,EAAE,YAAY,EAAE,MAAM,mBAAmB,CAAC;AAmCjD;;GAEG;AACH,MAAM,OAAO,uBAAuB;IAC1B,YAAY,CAAe;IAEnC,YAAY,SAAiC,EAAE;QAC7C,2DAA2D;QAC3D,IAAI,CAAC,YAAY,GAAG,IAAI,YAAY,CAAC;YACnC,GAAG,MAAM;YACT,QAAQ,EAAE,IAAI;SACf,CAAC,CAAC;IACL,CAAC;IAED;;;;;;;;OAQG;IACH,KAAK,CAAC,sBAAsB,CAC1B,GAAW,EACX,MAAsB,EACtB,UAAoC,EAAE;QAEtC,2BAA2B;QAC3B,IAAI,CAAC,OAAO,CAAC,GAAG,CAAC,cAAc,EAAE,CAAC;YAChC,MAAM,IAAI,KAAK,CAAC,mFAAmF,CAAC,CAAC;QACvG,CAAC;QAED,MAAM,EAAE,KAAK,GAAG,YAAY,EAAE,YAAY,GAAG,EAAE,EAAE,YAAY,GAAG,EAAE,EAAE,GAAG,OAAO,CAAC;QAE/E,yDAAyD;QACzD,MAAM,MAAM,GAAG,MAAM,IAAI,CAAC,YAAY,CAAC,SAAS,CAAC,GAAG,EAAE;YACpD,QAAQ,EAAE,IAAI;YACd,GAAG,YAAY;SAChB,CAAC,CAAC;QAEH,IAAI,MAAM,CAAC,WAAW,KAAK,UAAU,EAAE,CAAC;YACtC,MAAM,IAAI,KAAK,CAAC,uCAAuC,CAAC,CAAC;QAC3D,CAAC;QAED,iCAAiC;QACjC,MAAM,YAAY,GAAG;;EAEvB,YAAY,CAAC,CAAC,CAAC,yBAAyB,YAAY,EAAE,CAAC,CAAC,CAAC,EAAE;;;EAG3D,MAAM,CAAC,OAAO,EAAE,CAAC;QAEf,mCAAmC;QACnC,MAAM,WAAW,GAAG,IAAI,CAAC,cAAc,CAAC,KAAK,CAAC,CAAC;QAE/C,IAAI,CAAC;YACH,0CAA0C;YAC1C,MAAM,QAAQ,GAAG,MAAM,cAAc,CAAC;gBACpC,KAAK,EAAE,MAAM,CAAC,KAAK,CAAC;gBACpB,MAAM;gBACN,MAAM,EAAE,YAAY;gBACpB,GAAG,WAAW;aACf,CAAC,CAAC;YAEH,OAAO;gBACL,IAAI,EAAE,QAAQ,CAAC,MAAM;gBACrB,QAAQ,EAAE,MAAM,CAAC,OAAO;gBACxB,GAAG,EAAE,MAAM,CAAC,GAAG;gBACf,KAAK,EAAE,MAAM,CAAC,KAAK;gBACnB,KAAK,EAAE;oBACL,YAAY,EAAG,QAAQ,CAAC,KAAa,EAAE,YAAY,IAAI,CAAC;oBACxD,gBAAgB,EAAG,QAAQ,CAAC,KAAa,EAAE,gBAAgB,IAAI,CAAC;oBAChE,WAAW,EAAG,QAAQ,CAAC,KAAa,EAAE,WAAW,IAAI,CAAC;iBACvD;aACF,CAAC;QACJ,CAAC;QAAC,OAAO,KAAK,EAAE,CAAC;YACf,MAAM,IAAI,KAAK,CAAC,sCAAsC,KAAK,YAAY,KAAK,CAAC,CAAC,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,KAAK,CAAC,EAAE,CAAC,CAAC;QAClH,CAAC;IACH,CAAC;IAED;;OAEG;IACK,cAAc,CAAC,KAAa;QAClC,IAAI,KAAK,CAAC,UAAU,CAAC,OAAO,CAAC,EAAE,CAAC;YAC9B,OAAO;gBACL,eAAe,EAAE;oBACf,MAAM,EAAE;wBACN,gBAAgB,EAAE,KAAK;qBACxB;iBACF;aACF,CAAC;QACJ,CAAC;aAAM,IAAI,KAAK,CAAC,UAAU,CAAC,SAAS,CAAC,EAAE,CAAC;YACvC,OAAO;gBACL,WAAW,EAAE,CAAC;aACf,CAAC;QACJ,CAAC;QACD,OAAO,EAAE,CAAC;IACZ,CAAC;IAED;;OAEG;IACH,KAAK,CAAC,OAAO;QACX,MAAM,IAAI,CAAC,YAAY,CAAC,OAAO,EAAE,CAAC;IACpC,CAAC;CACF;AAED;;;;;;;GAOG;AACH,MAAM,CAAC,KAAK,UAAU,sBAAsB,CAC1C,GAAW,EACX,MAAsB,EACtB,UAAoC,EAAE;IAEtC,MAAM,MAAM,GAAG,IAAI,uBAAuB,CAAC,OAAO,CAAC,YAAY,CAAC,CAAC;IACjE,IAAI,CAAC;QACH,OAAO,MAAM,MAAM,CAAC,sBAAsB,CAAC,GAAG,EAAE,MAAM,EAAE,OAAO,CAAC,CAAC;IACnE,CAAC;YAAS,CAAC;QACT,MAAM,MAAM,CAAC,OAAO,EAAE,CAAC;IACzB,CAAC;AACH,CAAC"}
package/dist/index.d.ts CHANGED
@@ -4,4 +4,5 @@ import type { HTMLFetchResult, ContentFetchResult, ContentFetchOptions, BrowserM
4
4
  export type { IEngine, HTMLFetchResult, ContentFetchResult, ContentFetchOptions, BrowserMetrics };
5
5
  export { FetchEngine };
6
6
  export * from "./HybridEngine.js";
7
+ export * from "./StructuredContentEngine.js";
7
8
  //# sourceMappingURL=index.d.ts.map
@@ -1 +1 @@
1
- {"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;AAC5C,OAAO,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC;AAE/C,OAAO,KAAK,EAAE,eAAe,EAAE,kBAAkB,EAAE,mBAAmB,EAAE,cAAc,EAAE,MAAM,YAAY,CAAC;AAE3G,YAAY,EAAE,OAAO,EAAE,eAAe,EAAE,kBAAkB,EAAE,mBAAmB,EAAE,cAAc,EAAE,CAAC;AAClG,OAAO,EAAE,WAAW,EAAE,CAAC;AACvB,cAAc,mBAAmB,CAAC"}
1
+ {"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;AAC5C,OAAO,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC;AAE/C,OAAO,KAAK,EAAE,eAAe,EAAE,kBAAkB,EAAE,mBAAmB,EAAE,cAAc,EAAE,MAAM,YAAY,CAAC;AAE3G,YAAY,EAAE,OAAO,EAAE,eAAe,EAAE,kBAAkB,EAAE,mBAAmB,EAAE,cAAc,EAAE,CAAC;AAClG,OAAO,EAAE,WAAW,EAAE,CAAC;AACvB,cAAc,mBAAmB,CAAC;AAClC,cAAc,8BAA8B,CAAC"}
package/dist/index.js CHANGED
@@ -1,4 +1,5 @@
1
1
  import { FetchEngine } from "./FetchEngine.js";
2
2
  export { FetchEngine };
3
3
  export * from "./HybridEngine.js"; // Export the new engine
4
+ export * from "./StructuredContentEngine.js"; // Export structured content functionality
4
5
  //# sourceMappingURL=index.js.map
package/dist/index.js.map CHANGED
@@ -1 +1 @@
1
- {"version":3,"file":"index.js","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AACA,OAAO,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC;AAK/C,OAAO,EAAE,WAAW,EAAE,CAAC;AACvB,cAAc,mBAAmB,CAAC,CAAC,wBAAwB"}
1
+ {"version":3,"file":"index.js","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AACA,OAAO,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC;AAK/C,OAAO,EAAE,WAAW,EAAE,CAAC;AACvB,cAAc,mBAAmB,CAAC,CAAC,wBAAwB;AAC3D,cAAc,8BAA8B,CAAC,CAAC,0CAA0C"}
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@purepageio/fetch-engines",
3
- "version": "0.5.1-rc.0",
3
+ "version": "0.6.0",
4
4
  "type": "module",
5
5
  "description": "A collection of configurable engines for fetching HTML content using fetch or Playwright.",
6
6
  "main": "dist/index.js",
@@ -11,6 +11,8 @@
11
11
  "LICENSE"
12
12
  ],
13
13
  "dependencies": {
14
+ "@ai-sdk/openai": "^2.0.30",
15
+ "ai": "^5.0.44",
14
16
  "axios": "^1.6.8",
15
17
  "node-html-parser": "^7.0.1",
16
18
  "p-queue": "^7.4.1",
@@ -21,7 +23,8 @@
21
23
  "turndown": "^7.2.0",
22
24
  "turndown-plugin-gfm": "^1.0.2",
23
25
  "user-agents": "^1.1.208",
24
- "uuid": "^11.1.0"
26
+ "uuid": "^11.1.0",
27
+ "zod": "^4.1.8"
25
28
  },
26
29
  "devDependencies": {
27
30
  "@types/axios": "^0.14.0",
@@ -76,6 +79,9 @@
76
79
  "test": "vitest run",
77
80
  "test:unit": "vitest run",
78
81
  "test:live": "LIVE_NETWORK=1 vitest run test/live/*.test.ts",
79
- "examples:hybrid-md": "node scripts/hybrid-md-dump.mjs"
82
+ "examples:hybrid-md": "node scripts/hybrid-md-dump.mjs",
83
+ "simple-scraping": "tsx examples/simple-scraping.ts",
84
+ "smart-scraping": "tsx examples/smart-scraping.ts",
85
+ "ai-extraction": "tsx examples/ai-extraction.ts"
80
86
  }
81
87
  }