@purepageio/fetch-engines 0.5.1-rc.0 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +214 -419
- package/dist/StructuredContentEngine.d.ts +67 -0
- package/dist/StructuredContentEngine.d.ts.map +1 -0
- package/dist/StructuredContentEngine.js +116 -0
- package/dist/StructuredContentEngine.js.map +1 -0
- package/dist/index.d.ts +1 -0
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +1 -0
- package/dist/index.js.map +1 -1
- package/package.json +9 -3
package/README.md
CHANGED
|
@@ -3,546 +3,341 @@
|
|
|
3
3
|
[](https://www.npmjs.com/package/@purepageio/fetch-engines)
|
|
4
4
|
[](https://opensource.org/licenses/MIT)
|
|
5
5
|
|
|
6
|
-
|
|
6
|
+
Web scraping requires handling static HTML, JavaScript-heavy sites, network errors, retries, caching, and bot detection. Managing browser automation tools like Playwright adds complexity with resource pooling and stealth configurations.
|
|
7
7
|
|
|
8
|
-
`@purepageio/fetch-engines`
|
|
8
|
+
`@purepageio/fetch-engines` provides engines for retrieving web content through a unified API.
|
|
9
9
|
|
|
10
|
-
**
|
|
10
|
+
**Key Benefits:**
|
|
11
11
|
|
|
12
|
-
- **Unified API:**
|
|
13
|
-
- **
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
- **
|
|
17
|
-
- **
|
|
18
|
-
- **
|
|
19
|
-
- **
|
|
20
|
-
- **TypeScript Ready:** Fully typed for a better development experience.
|
|
21
|
-
|
|
22
|
-
This package provides a high-level abstraction, letting you focus on using the web content rather than the intricacies of fetching it.
|
|
12
|
+
- **Unified API:** Use `fetchHTML(url, options?)` for processed content or `fetchContent(url, options?)` for raw content
|
|
13
|
+
- **Smart Fallback Strategy:** Tries fast HTTP first, automatically falls back to full browser for complex sites
|
|
14
|
+
- **AI-Powered Data Extraction:** Extract structured data from web pages using OpenAI and Zod schemas
|
|
15
|
+
- **Raw Content Support:** Retrieve PDFs, images, APIs with the same fallback logic
|
|
16
|
+
- **Built-in Resilience:** Caching, retries, and standardised error handling
|
|
17
|
+
- **Browser Management:** Automatic browser pooling and stealth measures for complex sites
|
|
18
|
+
- **Content Transformation:** Convert HTML to clean Markdown
|
|
19
|
+
- **TypeScript Ready:** Fully typed codebase
|
|
23
20
|
|
|
24
21
|
## Table of Contents
|
|
25
22
|
|
|
26
|
-
- [Features](#features)
|
|
27
23
|
- [Installation](#installation)
|
|
28
24
|
- [Engines](#engines)
|
|
29
25
|
- [Basic Usage](#basic-usage)
|
|
30
26
|
- [fetchHTML vs fetchContent](#fetchhtml-vs-fetchcontent)
|
|
27
|
+
- [Structured Content Extraction](#structured-content-extraction)
|
|
31
28
|
- [Configuration](#configuration)
|
|
32
29
|
- [Return Value](#return-value)
|
|
33
30
|
- [API Reference](#api-reference)
|
|
34
|
-
- [Stealth
|
|
31
|
+
- [Stealth Features](#stealth-features)
|
|
35
32
|
- [Error Handling](#error-handling)
|
|
36
|
-
- [Logging](#logging)
|
|
37
33
|
- [Contributing](#contributing)
|
|
38
|
-
- [License](#license)
|
|
39
|
-
|
|
40
|
-
## Features
|
|
41
|
-
|
|
42
|
-
- **Multiple Fetching Strategies:** Choose between `FetchEngine` (lightweight `fetch`) or `HybridEngine` (smart fallback to a full browser engine).
|
|
43
|
-
- **Unified API:** Simple `fetchHTML(url, options?)` interface for processed content and `fetchContent(url, options?)` for raw content across both primary engines.
|
|
44
|
-
- **Raw Content Fetching:** Use `fetchContent()` to retrieve any type of content (PDFs, images, JSON, XML, etc.) without HTML processing or content-type restrictions.
|
|
45
|
-
- **Custom Headers:** Easily provide custom HTTP headers for requests in both `FetchEngine` and `HybridEngine`.
|
|
46
|
-
- **Configurable Retries:** Automatic retries on failure with customizable attempts and delays.
|
|
47
|
-
- **Built-in Caching:** In-memory caching with configurable TTL to reduce redundant fetches.
|
|
48
|
-
- **Playwright Stealth:** When `HybridEngine` utilizes its browser capabilities, it automatically integrates `playwright-extra` and stealth plugins to bypass common bot detection.
|
|
49
|
-
- **Managed Browser Pooling:** Efficient resource management for `HybridEngine`'s browser mode with configurable browser/context limits and lifecycles.
|
|
50
|
-
- **Smart Fallbacks:** `HybridEngine` uses `FetchEngine` first, falling back to its internal browser engine only when needed. The internal browser engine can also optionally use a fast HTTP fetch before launching a full browser.
|
|
51
|
-
- **Content Conversion:** Optionally convert fetched HTML directly to Markdown, preserving `<table>` elements even without header rows.
|
|
52
|
-
- **Standardized Errors:** Custom `FetchError` classes provide context on failures.
|
|
53
|
-
- **TypeScript Ready:** Fully typed codebase for enhanced developer experience.
|
|
54
34
|
|
|
55
35
|
## Installation
|
|
56
36
|
|
|
57
37
|
```bash
|
|
58
38
|
pnpm add @purepageio/fetch-engines
|
|
59
|
-
# or with npm
|
|
60
|
-
npm install @purepageio/fetch-engines
|
|
61
|
-
# or with yarn
|
|
62
|
-
yarn add @purepageio/fetch-engines
|
|
63
39
|
```
|
|
64
40
|
|
|
65
|
-
|
|
41
|
+
For `HybridEngine` (uses Playwright), install browser binaries:
|
|
66
42
|
|
|
67
43
|
```bash
|
|
68
44
|
pnpm exec playwright install
|
|
69
|
-
# or
|
|
70
|
-
npx playwright install
|
|
71
45
|
```
|
|
72
46
|
|
|
73
47
|
## Engines
|
|
74
48
|
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
49
|
+
**`HybridEngine`** (recommended): Attempts fast HTTP fetch first, falls back to Playwright browser on failure or when SPA shell detected. Handles both simple and complex sites automatically.
|
|
50
|
+
|
|
51
|
+
**`FetchEngine`**: Lightweight HTTP-only engine for basic sites without browser fallback.
|
|
52
|
+
|
|
53
|
+
**`StructuredContentEngine`**: AI-powered engine that combines HybridEngine with OpenAI for structured data extraction.
|
|
78
54
|
|
|
79
55
|
## Basic Usage
|
|
80
56
|
|
|
81
|
-
###
|
|
57
|
+
### Quick Start
|
|
82
58
|
|
|
83
59
|
```typescript
|
|
84
|
-
import {
|
|
85
|
-
|
|
86
|
-
const engine = new
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
console.log(`\nFetched ${mdResult.url} (ContentType: ${mdResult.contentType})`);
|
|
100
|
-
console.log(`Content (Markdown):\n${mdResult.content.substring(0, 300)}...`);
|
|
101
|
-
} catch (error) {
|
|
102
|
-
console.error("Fetch failed:", error);
|
|
103
|
-
}
|
|
104
|
-
}
|
|
105
|
-
main();
|
|
60
|
+
import { HybridEngine } from "@purepageio/fetch-engines";
|
|
61
|
+
|
|
62
|
+
const engine = new HybridEngine();
|
|
63
|
+
|
|
64
|
+
// Simple sites use fast HTTP
|
|
65
|
+
const simple = await engine.fetchHTML("https://example.com");
|
|
66
|
+
console.log(`Title: ${simple.title}`);
|
|
67
|
+
|
|
68
|
+
// Complex sites automatically use browser
|
|
69
|
+
const complex = await engine.fetchHTML("https://spa-site.com", {
|
|
70
|
+
markdown: true,
|
|
71
|
+
spaMode: true,
|
|
72
|
+
});
|
|
73
|
+
|
|
74
|
+
await engine.cleanup(); // Important: releases browser resources
|
|
106
75
|
```
|
|
107
76
|
|
|
108
|
-
###
|
|
77
|
+
### With Custom Headers
|
|
109
78
|
|
|
110
79
|
```typescript
|
|
111
|
-
import { HybridEngine } from "@purepageio/fetch-engines";
|
|
112
|
-
|
|
113
|
-
// Engine configured to fetch HTML by default for its internal engines
|
|
114
|
-
// and provide some custom headers for all requests made by HybridEngine.
|
|
115
80
|
const engine = new HybridEngine({
|
|
116
|
-
|
|
117
|
-
headers: { "X-Global-Custom-Header": "HybridGlobalValue" },
|
|
118
|
-
// Other PlaywrightEngine specific configs can be set here for the fallback mechanism
|
|
119
|
-
// e.g., playwrightLaunchOptions: { args: ["--disable-gpu"] }
|
|
81
|
+
headers: { "X-Custom-Header": "value" },
|
|
120
82
|
});
|
|
121
83
|
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
const urlComplex = "https://quotes.toscrape.com/"; // JS-heavy site, likely requiring Playwright fallback
|
|
126
|
-
|
|
127
|
-
// --- Scenario 1: FetchEngine part of HybridEngine handles it ---
|
|
128
|
-
console.log(`\nFetching simple site (${urlSimple}) with per-request headers...`);
|
|
129
|
-
const result1 = await engine.fetchHTML(urlSimple, {
|
|
130
|
-
headers: { "X-Request-Specific": "SimpleRequestValue" },
|
|
131
|
-
});
|
|
132
|
-
// FetchEngine (via HybridEngine) will use:
|
|
133
|
-
// 1. Its base default headers (User-Agent etc.)
|
|
134
|
-
// 2. Overridden/augmented by HybridEngine's constructor headers ("X-Global-Custom-Header")
|
|
135
|
-
// 3. Overridden/augmented by per-request headers ("X-Request-Specific")
|
|
136
|
-
console.log(`Fetched ${result1.url} (ContentType: ${result1.contentType}) - Title: ${result1.title}`);
|
|
137
|
-
console.log(`Content (HTML): ${result1.content.substring(0, 100)}...`);
|
|
138
|
-
|
|
139
|
-
// --- Scenario 2: Playwright part of HybridEngine handles it ---
|
|
140
|
-
console.log(`\nFetching complex site (${urlComplex}) requesting Markdown and with per-request headers...`);
|
|
141
|
-
const result2 = await engine.fetchHTML(urlComplex, {
|
|
142
|
-
markdown: true,
|
|
143
|
-
headers: { "X-Request-Specific": "ComplexRequestValue", "X-Another": "ComplexAnother" },
|
|
144
|
-
});
|
|
145
|
-
// PlaywrightEngine (via HybridEngine) will use:
|
|
146
|
-
// 1. Its base default headers (User-Agent etc. if doing HTTP fallback, or for page.setExtraHTTPHeaders)
|
|
147
|
-
// 2. Overridden/augmented by HybridEngine's constructor headers ("X-Global-Custom-Header")
|
|
148
|
-
// 3. Overridden/augmented by per-request headers ("X-Request-Specific", "X-Another")
|
|
149
|
-
// The markdown: true option will be respected by the Playwright part.
|
|
150
|
-
console.log(`Fetched ${result2.url} (ContentType: ${result2.contentType}) - Title: ${result2.title}`);
|
|
151
|
-
console.log(`Content (Markdown):\n${result2.content.substring(0, 300)}...`);
|
|
152
|
-
} catch (error) {
|
|
153
|
-
console.error("Hybrid fetch failed:", error);
|
|
154
|
-
} finally {
|
|
155
|
-
await engine.cleanup(); // Important for HybridEngine
|
|
156
|
-
}
|
|
157
|
-
}
|
|
158
|
-
main();
|
|
84
|
+
const result = await engine.fetchHTML("https://example.com", {
|
|
85
|
+
headers: { "X-Request-Header": "value" },
|
|
86
|
+
});
|
|
159
87
|
```
|
|
160
88
|
|
|
161
|
-
### Raw Content
|
|
89
|
+
### Raw Content (PDFs, Images, APIs)
|
|
162
90
|
|
|
163
91
|
```typescript
|
|
164
|
-
import { HybridEngine } from "@purepageio/fetch-engines";
|
|
165
|
-
|
|
166
92
|
const engine = new HybridEngine();
|
|
167
93
|
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
const jsonResult = await engine.fetchContent("https://api.example.com/data");
|
|
179
|
-
console.log(`JSON Content-Type: ${jsonResult.contentType}`);
|
|
180
|
-
console.log(`JSON Data: ${typeof jsonResult.content === "string" ? jsonResult.content : "Binary data"}`);
|
|
181
|
-
|
|
182
|
-
// Fetch with custom headers
|
|
183
|
-
const customResult = await engine.fetchContent("https://protected-api.example.com/data", {
|
|
184
|
-
headers: {
|
|
185
|
-
Authorization: "Bearer your-token",
|
|
186
|
-
Accept: "application/json",
|
|
187
|
-
},
|
|
188
|
-
});
|
|
189
|
-
console.log(`Custom fetch result: ${customResult.statusCode}`);
|
|
190
|
-
} catch (error) {
|
|
191
|
-
console.error("Raw content fetch failed:", error);
|
|
192
|
-
} finally {
|
|
193
|
-
await engine.cleanup();
|
|
194
|
-
}
|
|
195
|
-
}
|
|
196
|
-
fetchRawContent();
|
|
94
|
+
// Fetch PDF
|
|
95
|
+
const pdf = await engine.fetchContent("https://example.com/doc.pdf");
|
|
96
|
+
console.log(`PDF size: ${pdf.content.length} bytes`);
|
|
97
|
+
|
|
98
|
+
// Fetch JSON API with auth
|
|
99
|
+
const api = await engine.fetchContent("https://api.example.com/data", {
|
|
100
|
+
headers: { Authorization: "Bearer token" },
|
|
101
|
+
});
|
|
102
|
+
|
|
103
|
+
await engine.cleanup();
|
|
197
104
|
```
|
|
198
105
|
|
|
199
106
|
## fetchHTML vs fetchContent
|
|
200
107
|
|
|
201
|
-
Choose the right method for your use case:
|
|
202
|
-
|
|
203
108
|
### `fetchHTML(url, options?)`
|
|
204
109
|
|
|
205
|
-
**Use
|
|
206
|
-
|
|
207
|
-
**Features:**
|
|
110
|
+
**Use for:** Web page content extraction
|
|
208
111
|
|
|
209
|
-
- Processes HTML
|
|
112
|
+
- Processes HTML and extracts metadata (title, etc.)
|
|
210
113
|
- Supports HTML-to-Markdown conversion
|
|
211
|
-
- Optimized for web page content
|
|
212
114
|
- Content-type restrictions (HTML/XML only)
|
|
213
115
|
- Returns processed content as `string`
|
|
214
116
|
|
|
215
|
-
**Best for:**
|
|
216
|
-
|
|
217
|
-
- Web scraping
|
|
218
|
-
- Content extraction
|
|
219
|
-
- Blog/article processing
|
|
220
|
-
- Any scenario where you need structured HTML or Markdown
|
|
221
|
-
|
|
222
117
|
### `fetchContent(url, options?)`
|
|
223
118
|
|
|
224
|
-
**Use
|
|
225
|
-
|
|
226
|
-
**Features:**
|
|
119
|
+
**Use for:** Raw content retrieval (like standard `fetch`)
|
|
227
120
|
|
|
228
121
|
- Retrieves any content type (PDFs, images, JSON, XML, etc.)
|
|
229
122
|
- No content-type restrictions
|
|
230
|
-
- Returns
|
|
231
|
-
- Preserves original MIME type
|
|
232
|
-
- Minimal processing overhead
|
|
233
|
-
|
|
234
|
-
**Best for:**
|
|
235
|
-
|
|
236
|
-
- API consumption
|
|
237
|
-
- File downloads (PDFs, images, etc.)
|
|
238
|
-
- Binary content retrieval
|
|
239
|
-
- Any scenario where you need the raw response
|
|
123
|
+
- Returns `Buffer` (binary) or `string` (text)
|
|
124
|
+
- Preserves original MIME type
|
|
240
125
|
|
|
241
126
|
### Example Comparison
|
|
242
127
|
|
|
243
128
|
```typescript
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
251
|
-
console.log(
|
|
252
|
-
console.log(typeof
|
|
253
|
-
|
|
254
|
-
//
|
|
255
|
-
const
|
|
256
|
-
console.log(
|
|
257
|
-
console.log(contentResult.contentType); // "text/html" (original MIME type)
|
|
258
|
-
console.log(typeof contentResult.content); // "string" (raw HTML)
|
|
259
|
-
|
|
260
|
-
// fetchContent - for non-HTML content
|
|
261
|
-
const pdfResult = await engine.fetchContent("https://example.com/doc.pdf");
|
|
262
|
-
console.log(pdfResult.contentType); // "application/pdf"
|
|
263
|
-
console.log(Buffer.isBuffer(pdfResult.content)); // true (binary content)
|
|
129
|
+
// fetchHTML - processes content
|
|
130
|
+
const html = await engine.fetchHTML("https://example.com");
|
|
131
|
+
console.log(html.title); // "Example Domain"
|
|
132
|
+
console.log(html.contentType); // "html" or "markdown"
|
|
133
|
+
|
|
134
|
+
// fetchContent - raw content
|
|
135
|
+
const raw = await engine.fetchContent("https://example.com");
|
|
136
|
+
console.log(raw.contentType); // "text/html"
|
|
137
|
+
console.log(typeof raw.content); // "string" (raw HTML)
|
|
138
|
+
|
|
139
|
+
// Binary content
|
|
140
|
+
const pdf = await engine.fetchContent("https://example.com/doc.pdf");
|
|
141
|
+
console.log(Buffer.isBuffer(pdf.content)); // true
|
|
264
142
|
```
|
|
265
143
|
|
|
266
|
-
##
|
|
144
|
+
## Structured Content Extraction
|
|
267
145
|
|
|
268
|
-
|
|
146
|
+
Extract structured data from web pages using AI and Zod schemas.
|
|
269
147
|
|
|
270
|
-
###
|
|
148
|
+
### Prerequisites
|
|
271
149
|
|
|
272
|
-
|
|
150
|
+
Set environment variable:
|
|
273
151
|
|
|
274
|
-
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
|
|
152
|
+
```bash
|
|
153
|
+
export OPENAI_API_KEY="your-openai-api-key"
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
### Basic Usage
|
|
278
157
|
|
|
279
158
|
```typescript
|
|
280
|
-
|
|
281
|
-
|
|
282
|
-
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
159
|
+
import { fetchStructuredContent } from "@purepageio/fetch-engines";
|
|
160
|
+
import { z } from "zod";
|
|
161
|
+
|
|
162
|
+
const articleSchema = z.object({
|
|
163
|
+
title: z.string(),
|
|
164
|
+
author: z.string().optional(),
|
|
165
|
+
publishDate: z.string().optional(),
|
|
166
|
+
summary: z.string(),
|
|
167
|
+
tags: z.array(z.string()),
|
|
287
168
|
});
|
|
169
|
+
|
|
170
|
+
const result = await fetchStructuredContent("https://example.com/article", articleSchema, {
|
|
171
|
+
model: "gpt-4.1-mini",
|
|
172
|
+
customPrompt: "Extract main article information",
|
|
173
|
+
});
|
|
174
|
+
|
|
175
|
+
console.log("Extracted:", result.data);
|
|
176
|
+
console.log("Token usage:", result.usage);
|
|
288
177
|
```
|
|
289
178
|
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
1. Headers passed in `fetchHTML(url, { headers: { ... } })` (highest precedence).
|
|
293
|
-
2. Headers passed in the `FetchEngine` constructor `new FetchEngine({ headers: { ... } })`.
|
|
294
|
-
3. Default headers of the `FetchEngine` (e.g., its default `User-Agent`) (lowest precedence).
|
|
295
|
-
|
|
296
|
-
### `PlaywrightEngineConfig` (Used by `HybridEngine`)
|
|
297
|
-
|
|
298
|
-
The `HybridEngine` constructor accepts a `PlaywrightEngineConfig` object. These settings configure the underlying `FetchEngine` and `PlaywrightEngine` (for fallback scenarios) and the hybrid strategy itself. When using `HybridEngine`, you are essentially configuring how it will behave and how its internal Playwright capabilities will operate if needed.
|
|
299
|
-
|
|
300
|
-
**Key Options for `HybridEngine` (from `PlaywrightEngineConfig`):**
|
|
301
|
-
|
|
302
|
-
| Option | Type | Default | Description |
|
|
303
|
-
| ------------------------- | ------------------------ | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
304
|
-
| `headers` | `Record<string, string>` | `{}` | Custom HTTP headers. For `HybridEngine`, these serve as default headers for both its internal `FetchEngine` (constructor) and `PlaywrightEngine` (constructor). They can be overridden by headers in `HybridEngine.fetchHTML()` options. |
|
|
305
|
-
| `markdown` | `boolean` | `false` | Default Markdown conversion. For `HybridEngine`: sets default for internal `FetchEngine` (constructor) and internal `PlaywrightEngine`. Can be overridden per-request for the `PlaywrightEngine` part. |
|
|
306
|
-
| `useHttpFallback` | `boolean` | `true` | (For Playwright part) If `true`, attempts a fast HTTP fetch before using Playwright. Ineffective if `spaMode` is `true`. |
|
|
307
|
-
| `useHeadedModeFallback` | `boolean` | `false` | (For Playwright part) If `true`, automatically retries specific failed Playwright attempts in headed (visible) mode. |
|
|
308
|
-
| `defaultFastMode` | `boolean` | `true` | If `true`, initially blocks non-essential resources and skips human simulation. Can be overridden per-request. Effectively `false` if `spaMode` is `true`. |
|
|
309
|
-
| `simulateHumanBehavior` | `boolean` | `true` | If `true` (and not `fastMode` or `spaMode`), attempts basic human-like interactions. |
|
|
310
|
-
| `concurrentPages` | `number` | `3` | Max number of pages to process concurrently within the engine queue. |
|
|
311
|
-
| `maxRetries` | `number` | `3` | Max retry attempts for a failed fetch (excluding initial try). |
|
|
312
|
-
| `retryDelay` | `number` | `5000` | Delay (ms) between retries. |
|
|
313
|
-
| `cacheTTL` | `number` | `900000` | Cache Time-To-Live (ms). `0` disables caching. (15 mins default) |
|
|
314
|
-
| `spaMode` | `boolean` | `false` | If `true`, enables Single Page Application mode. This typically bypasses `useHttpFallback`, effectively sets `fastMode` to `false`, uses more patient load conditions (e.g., network idle), and may apply `spaRenderDelayMs`. Recommended for JavaScript-heavy sites. |
|
|
315
|
-
| `spaRenderDelayMs` | `number` | `0` | Explicit delay (ms) after page load events in `spaMode` to allow for client-side rendering. Only applies if `spaMode` is `true`. |
|
|
316
|
-
| `playwrightLaunchOptions` | `LaunchOptions` | `undefined` | (For Playwright part) Optional Playwright launch options (from `playwright` package, e.g., `{ args: ['--some-flag'] }`) passed when a browser instance is created. Merged with internal defaults. |
|
|
317
|
-
|
|
318
|
-
**Browser Pool Options (For `HybridEngine`'s internal `PlaywrightEngine`):**
|
|
319
|
-
|
|
320
|
-
| Option | Type | Default | Description |
|
|
321
|
-
| -------------------------- | -------------------------- | ----------- | ------------------------------------------------------------------------------------------- |
|
|
322
|
-
| `maxBrowsers` | `number` | `2` | Max concurrent browser instances managed by the pool. |
|
|
323
|
-
| `maxPagesPerContext` | `number` | `6` | Max pages per browser context before recycling. |
|
|
324
|
-
| `maxBrowserAge` | `number` | `1200000` | Max age (ms) a browser instance lives before recycling. (20 mins default) |
|
|
325
|
-
| `healthCheckInterval` | `number` | `60000` | How often (ms) the pool checks browser health. (1 min default) |
|
|
326
|
-
| `useHeadedMode` | `boolean` | `false` | Forces the _entire pool_ (for Playwright part) to launch browsers in headed (visible) mode. |
|
|
327
|
-
| `poolBlockedDomains` | `string[]` | `[]` | List of domain glob patterns to block requests to (for Playwright part). |
|
|
328
|
-
| `poolBlockedResourceTypes` | `string[]` | `[]` | List of Playwright resource types (e.g., 'image', 'font') to block (for Playwright part). |
|
|
329
|
-
| `proxy` | `{ server: string, ... }?` | `undefined` | Proxy configuration object (see `PlaywrightEngineConfig` type) (for Playwright part). |
|
|
330
|
-
|
|
331
|
-
### `HybridEngine` - Configuration Summary & Header Precedence
|
|
332
|
-
|
|
333
|
-
When you configure `HybridEngine` using `PlaywrightEngineConfig`:
|
|
334
|
-
|
|
335
|
-
- **`headers`**: Constructor headers are passed to the internal `FetchEngine`'s constructor and the internal `PlaywrightEngine`'s constructor.
|
|
336
|
-
- **`markdown`**: Sets the default for both internal engines.
|
|
337
|
-
- **`spaMode`**: Sets the default for `HybridEngine`'s SPA shell detection and for the internal `PlaywrightEngine`.
|
|
338
|
-
- Other options primarily configure the internal `PlaywrightEngine` or general retry/caching logic.
|
|
339
|
-
|
|
340
|
-
**Per-request `options` in `HybridEngine.fetchHTML(url, options)`:**
|
|
341
|
-
|
|
342
|
-
- **`headers?: Record<string, string>`**:
|
|
343
|
-
- These headers override any headers set in the `HybridEngine` constructor.
|
|
344
|
-
- If `FetchEngine` is used: These headers are passed to `FetchEngine.fetchHTML(url, { headers: ... })`. `FetchEngine` then merges them with its constructor headers and base defaults.
|
|
345
|
-
- If `PlaywrightEngine` (fallback) is used: These headers are merged with `HybridEngine` constructor headers (options take precedence) and the result is passed to `PlaywrightEngine`'s `fetchHTML()`. `PlaywrightEngine` then applies its own logic (e.g., for `page.setExtraHTTPHeaders` or its HTTP fallback).
|
|
346
|
-
- **`markdown?: boolean`**:
|
|
347
|
-
- If `FetchEngine` is used: This per-request option is **ignored**. `FetchEngine` uses its own constructor `markdown` setting.
|
|
348
|
-
- If `PlaywrightEngine` (fallback) is used: This overrides `PlaywrightEngine`'s default and determines its output format.
|
|
349
|
-
- **`spaMode?: boolean`**: Overrides `HybridEngine`'s default SPA mode and is passed to `PlaywrightEngine` if used.
|
|
350
|
-
- **`fastMode?: boolean`**: Passed to `PlaywrightEngine` if used; no effect on `FetchEngine`.
|
|
179
|
+
### StructuredContentEngine Class
|
|
351
180
|
|
|
352
181
|
```typescript
|
|
353
|
-
|
|
354
|
-
|
|
355
|
-
|
|
356
|
-
|
|
357
|
-
|
|
358
|
-
|
|
359
|
-
|
|
360
|
-
|
|
361
|
-
|
|
362
|
-
|
|
363
|
-
|
|
364
|
-
|
|
365
|
-
|
|
366
|
-
|
|
182
|
+
import { StructuredContentEngine } from "@purepageio/fetch-engines";
|
|
183
|
+
|
|
184
|
+
const productSchema = z.object({
|
|
185
|
+
name: z.string(),
|
|
186
|
+
price: z.number(),
|
|
187
|
+
inStock: z.boolean(),
|
|
188
|
+
});
|
|
189
|
+
|
|
190
|
+
const engine = new StructuredContentEngine({
|
|
191
|
+
spaMode: true,
|
|
192
|
+
spaRenderDelayMs: 2000,
|
|
193
|
+
});
|
|
194
|
+
|
|
195
|
+
const result = await engine.fetchStructuredContent("https://shop.com/product", productSchema);
|
|
196
|
+
console.log(`${result.data.name} costs $${result.data.price}`);
|
|
197
|
+
|
|
198
|
+
await engine.cleanup();
|
|
367
199
|
```
|
|
368
200
|
|
|
369
|
-
|
|
201
|
+
### Supported Models
|
|
370
202
|
|
|
371
|
-
|
|
203
|
+
- `'gpt-5-mini'` - Latest model, mini version **(default)**
|
|
204
|
+
- `'gpt-5'` - Most capable model
|
|
205
|
+
- `'gpt-4.1-mini'` - Fast and cost-effective
|
|
206
|
+
- `'gpt-4.1'` - More capable GPT-4.1 version
|
|
372
207
|
|
|
373
|
-
|
|
208
|
+
## Configuration
|
|
374
209
|
|
|
375
|
-
|
|
376
|
-
- `contentType` (`'html' | 'markdown'`): Indicates the format of the `content` string.
|
|
377
|
-
- `title` (`string | null`): Extracted page title (from original HTML).
|
|
378
|
-
- `url` (`string`): Final URL after redirects.
|
|
379
|
-
- `isFromCache` (`boolean`): True if the result came from cache.
|
|
380
|
-
- `statusCode` (`number | undefined`): HTTP status code.
|
|
381
|
-
- `error` (`Error | undefined`): Error object if the fetch failed after all retries. It's generally recommended to rely on catching thrown errors for failure handling.
|
|
210
|
+
### FetchEngine Options
|
|
382
211
|
|
|
383
|
-
|
|
212
|
+
| Option | Type | Default | Description |
|
|
213
|
+
| ---------- | ------------------------ | ------- | ------------------------ |
|
|
214
|
+
| `markdown` | `boolean` | `false` | Convert HTML to Markdown |
|
|
215
|
+
| `headers` | `Record<string, string>` | `{}` | Custom HTTP headers |
|
|
384
216
|
|
|
385
|
-
|
|
217
|
+
### HybridEngine Configuration
|
|
386
218
|
|
|
387
|
-
|
|
388
|
-
|
|
389
|
-
|
|
390
|
-
|
|
391
|
-
|
|
392
|
-
|
|
393
|
-
|
|
219
|
+
| Option | Type | Default | Description |
|
|
220
|
+
| ------------------ | ------------------------ | -------- | -------------------------------------------- |
|
|
221
|
+
| `headers` | `Record<string, string>` | `{}` | Default headers for both engines |
|
|
222
|
+
| `markdown` | `boolean` | `false` | Default Markdown conversion |
|
|
223
|
+
| `useHttpFallback` | `boolean` | `true` | Try HTTP before Playwright |
|
|
224
|
+
| `spaMode` | `boolean` | `false` | Enable SPA mode with patient load conditions |
|
|
225
|
+
| `spaRenderDelayMs` | `number` | `0` | Delay after page load in SPA mode |
|
|
226
|
+
| `maxRetries` | `number` | `3` | Max retry attempts |
|
|
227
|
+
| `cacheTTL` | `number` | `900000` | Cache TTL in ms (15 min default) |
|
|
228
|
+
| `concurrentPages` | `number` | `3` | Max concurrent pages |
|
|
394
229
|
|
|
395
|
-
|
|
230
|
+
### Browser Pool Options
|
|
396
231
|
|
|
397
|
-
|
|
232
|
+
| Option | Type | Default | Description |
|
|
233
|
+
| -------------------- | -------- | --------- | ---------------------------------- |
|
|
234
|
+
| `maxBrowsers` | `number` | `2` | Max browser instances |
|
|
235
|
+
| `maxPagesPerContext` | `number` | `6` | Pages per context before recycling |
|
|
236
|
+
| `maxBrowserAge` | `number` | `1200000` | Browser lifetime (20 min) |
|
|
398
237
|
|
|
399
|
-
|
|
400
|
-
- `options?` (`FetchOptions`): Optional per-request overrides.
|
|
401
|
-
- `headers?: Record<string, string>`: Custom headers for this specific request.
|
|
402
|
-
- `markdown?: boolean`: (For `HybridEngine`'s Playwright part) Request Markdown conversion.
|
|
403
|
-
- `fastMode?: boolean`: (For `HybridEngine`'s Playwright part) Override fast mode.
|
|
404
|
-
- `spaMode?: boolean`: (For `HybridEngine`) Override SPA mode behavior for this request.
|
|
405
|
-
- **Returns:** `Promise<HTMLFetchResult>`
|
|
238
|
+
### Header Precedence
|
|
406
239
|
|
|
407
|
-
|
|
240
|
+
Headers merge in this order (highest precedence first):
|
|
408
241
|
|
|
409
|
-
|
|
242
|
+
1. Request-specific headers in `fetchHTML(url, { headers })`
|
|
243
|
+
2. Engine constructor headers
|
|
244
|
+
3. Engine default headers
|
|
410
245
|
|
|
411
|
-
|
|
412
|
-
- `options?` (`ContentFetchOptions`): Optional per-request overrides.
|
|
413
|
-
- `headers?: Record<string, string>`: Custom headers for this specific request.
|
|
414
|
-
- **Returns:** `Promise<ContentFetchResult>`
|
|
246
|
+
## Return Value
|
|
415
247
|
|
|
416
|
-
|
|
248
|
+
### HTMLFetchResult (fetchHTML)
|
|
417
249
|
|
|
418
|
-
|
|
250
|
+
- `content` (`string`): HTML or Markdown content
|
|
251
|
+
- `contentType` (`'html' | 'markdown'`): Content format
|
|
252
|
+
- `title` (`string | null`): Extracted page title
|
|
253
|
+
- `url` (`string`): Final URL after redirects
|
|
254
|
+
- `isFromCache` (`boolean`): Cache hit indicator
|
|
255
|
+
- `statusCode` (`number | undefined`): HTTP status code
|
|
419
256
|
|
|
420
|
-
|
|
257
|
+
### ContentFetchResult (fetchContent)
|
|
421
258
|
|
|
422
|
-
|
|
423
|
-
|
|
259
|
+
- `content` (`Buffer | string`): Raw content (binary as Buffer, text as string)
|
|
260
|
+
- `contentType` (`string`): Original MIME type
|
|
261
|
+
- `title` (`string | null`): Title if HTML content, otherwise null
|
|
262
|
+
- `url` (`string`): Final URL after redirects
|
|
263
|
+
- `isFromCache` (`boolean`): Cache hit indicator
|
|
264
|
+
- `statusCode` (`number | undefined`): HTTP status code
|
|
424
265
|
|
|
425
|
-
##
|
|
266
|
+
## API Reference
|
|
426
267
|
|
|
427
|
-
|
|
268
|
+
### `engine.fetchHTML(url, options?)`
|
|
428
269
|
|
|
429
|
-
|
|
270
|
+
- `url` (`string`): Target URL
|
|
271
|
+
- `options?` (`FetchOptions`):
|
|
272
|
+
- `headers?: Record<string, string>`: Request headers
|
|
273
|
+
- `markdown?: boolean`: Request Markdown (HybridEngine only)
|
|
274
|
+
- `fastMode?: boolean`: Override fast mode (HybridEngine only)
|
|
275
|
+
- `spaMode?: boolean`: Override SPA mode (HybridEngine only)
|
|
276
|
+
- **Returns:** `Promise<HTMLFetchResult>`
|
|
430
277
|
|
|
431
|
-
|
|
278
|
+
### `engine.fetchContent(url, options?)`
|
|
432
279
|
|
|
433
|
-
|
|
280
|
+
- `url` (`string`): Target URL
|
|
281
|
+
- `options?` (`ContentFetchOptions`):
|
|
282
|
+
- `headers?: Record<string, string>`: Request headers
|
|
283
|
+
- **Returns:** `Promise<ContentFetchResult>`
|
|
434
284
|
|
|
435
|
-
|
|
285
|
+
### `fetchStructuredContent(url, schema, options?)`
|
|
436
286
|
|
|
437
|
-
- `
|
|
438
|
-
|
|
439
|
-
|
|
440
|
-
- `
|
|
441
|
-
- `
|
|
287
|
+
- `url` (`string`): Target URL
|
|
288
|
+
- `schema` (`z.ZodSchema<T>`): Zod schema for extraction
|
|
289
|
+
- `options?` (`StructuredContentOptions`):
|
|
290
|
+
- `model?: string`: OpenAI model (default: 'gpt-5-mini')
|
|
291
|
+
- `customPrompt?: string`: Additional AI context
|
|
292
|
+
- `engineConfig?: PlaywrightEngineConfig`: HybridEngine config
|
|
293
|
+
- **Returns:** `Promise<StructuredContentResult<T>>`
|
|
442
294
|
|
|
443
|
-
|
|
295
|
+
### `engine.cleanup()`
|
|
444
296
|
|
|
445
|
-
|
|
446
|
-
- **`ERR_NON_HTML_CONTENT`**: Thrown by `FetchEngine` if the content type is not HTML and `markdown` conversion is not requested. **Note:** `fetchContent()` does not throw this error as it supports all content types.
|
|
447
|
-
- **`ERR_PLAYWRIGHT_OPERATION`**: A general error from `HybridEngine`'s browser mode indicating a failure during a Playwright operation (e.g., page acquisition, navigation, interaction). The `originalError` property will often contain the specific Playwright error.
|
|
448
|
-
- **`ERR_NAVIGATION`**: Often seen as part of `ERR_PLAYWRIGHT_OPERATION`'s message or in `originalError` when a Playwright navigation (in `HybridEngine`'s browser mode) fails (e.g., timeout, SSL error).
|
|
449
|
-
- **`ERR_MARKDOWN_CONVERSION_NON_HTML`**: Thrown by `HybridEngine` (when its Playwright part is active) if `markdown: true` is requested for a non-HTML content type (e.g., XML, JSON). **Note:** Only applies to `fetchHTML()` as `fetchContent()` doesn't perform markdown conversion.
|
|
450
|
-
- **`ERR_CACHE_ERROR`**: Indicates an issue with cache read/write operations.
|
|
451
|
-
- **`ERR_PROXY_CONFIG_ERROR`**: Problem with proxy configuration (for `HybridEngine`'s browser mode).
|
|
452
|
-
- **`ERR_BROWSER_POOL_EXHAUSTED`**: If `HybridEngine`'s browser pool cannot provide a page.
|
|
453
|
-
- **Other Scenarios (often wrapped by `ERR_PLAYWRIGHT_OPERATION` or a generic `FetchError` when `HybridEngine` uses its browser mode):**
|
|
454
|
-
- Network issues (DNS resolution, connection refused).
|
|
455
|
-
- Proxy connection failures.
|
|
456
|
-
- Page crashes or context/browser disconnections within Playwright.
|
|
457
|
-
- Failures during browser launch or management by the pool.
|
|
297
|
+
Shuts down browser instances for `HybridEngine` and `StructuredContentEngine`. Call when finished to release resources. No-op for `FetchEngine`.
|
|
458
298
|
|
|
459
|
-
|
|
299
|
+
## Stealth Features
|
|
460
300
|
|
|
461
|
-
|
|
301
|
+
When `HybridEngine` uses Playwright, it automatically applies stealth measures via `playwright-extra` and stealth plugins to bypass common bot detection. No manual configuration required.
|
|
462
302
|
|
|
463
|
-
|
|
464
|
-
import { HybridEngine, FetchError } from "@purepageio/fetch-engines";
|
|
465
|
-
|
|
466
|
-
// Example using HybridEngine to illustrate error handling
|
|
467
|
-
const engine = new HybridEngine({ useHttpFallback: false, maxRetries: 1 }); // useHttpFallback for Playwright part
|
|
468
|
-
|
|
469
|
-
async function fetchWithHandling(url: string) {
|
|
470
|
-
try {
|
|
471
|
-
// Try fetchHTML first
|
|
472
|
-
const htmlResult = await engine.fetchHTML(url, { headers: { "X-My-Header": "TestValue" } });
|
|
473
|
-
if (htmlResult.error) {
|
|
474
|
-
console.warn(`fetchHTML for ${url} included non-critical error after retries: ${htmlResult.error.message}`);
|
|
475
|
-
}
|
|
476
|
-
console.log(`fetchHTML Success for ${url}! Title: ${htmlResult.title}, Content type: ${htmlResult.contentType}`);
|
|
477
|
-
// Use htmlResult.content
|
|
478
|
-
} catch (error) {
|
|
479
|
-
console.error(`fetchHTML failed for ${url}, trying fetchContent...`);
|
|
480
|
-
|
|
481
|
-
try {
|
|
482
|
-
// Fallback to fetchContent for raw content
|
|
483
|
-
const contentResult = await engine.fetchContent(url, { headers: { "X-My-Header": "TestValue" } });
|
|
484
|
-
if (contentResult.error) {
|
|
485
|
-
console.warn(
|
|
486
|
-
`fetchContent for ${url} included non-critical error after retries: ${contentResult.error.message}`
|
|
487
|
-
);
|
|
488
|
-
}
|
|
489
|
-
console.log(`fetchContent Success for ${url}! Content type: ${contentResult.contentType}`);
|
|
490
|
-
// Use contentResult.content (could be Buffer or string)
|
|
491
|
-
} catch (contentError) {
|
|
492
|
-
console.error(`Both fetchHTML and fetchContent failed for ${url}:`);
|
|
493
|
-
if (contentError instanceof FetchError) {
|
|
494
|
-
console.error(` Error Code: ${contentError.code || "N/A"}`);
|
|
495
|
-
console.error(` Message: ${contentError.message}`);
|
|
496
|
-
if (contentError.statusCode) {
|
|
497
|
-
console.error(` Status Code: ${contentError.statusCode}`);
|
|
498
|
-
}
|
|
499
|
-
if (contentError.originalError) {
|
|
500
|
-
console.error(` Original Error: ${contentError.originalError.name} - ${contentError.originalError.message}`);
|
|
501
|
-
}
|
|
502
|
-
// Example of specific handling:
|
|
503
|
-
if (contentError.code === "ERR_PLAYWRIGHT_OPERATION") {
|
|
504
|
-
console.error(
|
|
505
|
-
" Hint: This was a Playwright operation failure (HybridEngine's browser mode). Check Playwright logs or originalError."
|
|
506
|
-
);
|
|
507
|
-
}
|
|
508
|
-
} else if (contentError instanceof Error) {
|
|
509
|
-
console.error(` Generic Error: ${contentError.message}`);
|
|
510
|
-
} else {
|
|
511
|
-
console.error(` Unknown error occurred: ${String(contentError)}`);
|
|
512
|
-
}
|
|
513
|
-
}
|
|
514
|
-
}
|
|
515
|
-
}
|
|
303
|
+
Stealth techniques are not foolproof against sophisticated detection systems.
|
|
516
304
|
|
|
517
|
-
|
|
518
|
-
await fetchWithHandling("https://nonexistentdomain.example.com"); // Likely DNS or navigation error
|
|
519
|
-
await fetchWithHandling("https://example.com/document.pdf"); // PDF content - fetchHTML will fail, fetchContent will succeed
|
|
520
|
-
await fetchWithHandling("https://example.com/api/data.json"); // JSON content - fetchHTML will fail, fetchContent will succeed
|
|
521
|
-
await engine.cleanup(); // Important for HybridEngine
|
|
522
|
-
}
|
|
305
|
+
## Error Handling
|
|
523
306
|
|
|
524
|
-
|
|
525
|
-
```
|
|
307
|
+
Errors are thrown as `FetchError` instances with additional context:
|
|
526
308
|
|
|
527
|
-
|
|
309
|
+
- `message` (`string`): Error description
|
|
310
|
+
- `code` (`string | undefined`): Specific error code
|
|
311
|
+
- `originalError` (`Error | undefined`): Underlying error
|
|
312
|
+
- `statusCode` (`number | undefined`): HTTP status code
|
|
528
313
|
|
|
529
|
-
|
|
314
|
+
Common error codes:
|
|
530
315
|
|
|
531
|
-
|
|
316
|
+
- `ERR_HTTP_ERROR`: HTTP status >= 400
|
|
317
|
+
- `ERR_NON_HTML_CONTENT`: Non-HTML content for HTML request
|
|
318
|
+
- `ERR_FETCH_FAILED`: General fetch operation failure
|
|
319
|
+
- `ERR_PLAYWRIGHT_OPERATION`: Playwright operation failure
|
|
320
|
+
- `ERR_NAVIGATION`: Navigation timeout or failure
|
|
321
|
+
- `ERR_BROWSER_POOL_EXHAUSTED`: No available browser resources
|
|
322
|
+
- `ERR_MAX_RETRIES_REACHED`: All retry attempts exhausted
|
|
323
|
+
- `ERR_MARKDOWN_CONVERSION_NON_HTML`: Markdown conversion on non-HTML content
|
|
532
324
|
|
|
533
|
-
|
|
325
|
+
```typescript
|
|
326
|
+
import { HybridEngine } from "@purepageio/fetch-engines";
|
|
534
327
|
|
|
535
|
-
|
|
536
|
-
pnpm install
|
|
537
|
-
pnpm exec playwright install
|
|
538
|
-
pnpm test
|
|
539
|
-
```
|
|
328
|
+
const engine = new HybridEngine();
|
|
540
329
|
|
|
541
|
-
|
|
330
|
+
try {
|
|
331
|
+
const result = await engine.fetchHTML(url);
|
|
332
|
+
} catch (error: any) {
|
|
333
|
+
console.error(`Error: ${error.code || "Unknown"} - ${error.message}`);
|
|
334
|
+
if (error.statusCode) console.error(`Status: ${error.statusCode}`);
|
|
335
|
+
}
|
|
336
|
+
```
|
|
542
337
|
|
|
543
338
|
## Contributing
|
|
544
339
|
|
|
545
|
-
Contributions
|
|
340
|
+
Contributions welcome! Open an issue or submit a pull request on [GitHub](https://github.com/purepageio/fetch-engines).
|
|
546
341
|
|
|
547
342
|
## License
|
|
548
343
|
|
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
import type { z } from "zod";
|
|
2
|
+
import type { PlaywrightEngineConfig } from "./types.js";
|
|
3
|
+
/**
|
|
4
|
+
* Configuration options for structured content fetching
|
|
5
|
+
*/
|
|
6
|
+
export interface StructuredContentOptions {
|
|
7
|
+
/** OpenAI model to use. Options: 'gpt-4.1-mini', 'gpt-4.1', 'gpt-5', 'gpt-5-mini' */
|
|
8
|
+
model?: "gpt-4.1-mini" | "gpt-4.1" | "gpt-5" | "gpt-5-mini";
|
|
9
|
+
/** Custom prompt to provide additional context to the LLM */
|
|
10
|
+
customPrompt?: string;
|
|
11
|
+
/** HybridEngine configuration for content fetching */
|
|
12
|
+
engineConfig?: PlaywrightEngineConfig;
|
|
13
|
+
}
|
|
14
|
+
/**
|
|
15
|
+
* Result of structured content extraction
|
|
16
|
+
*/
|
|
17
|
+
export interface StructuredContentResult<T> {
|
|
18
|
+
/** The structured data extracted from the content */
|
|
19
|
+
data: T;
|
|
20
|
+
/** The original markdown content that was processed */
|
|
21
|
+
markdown: string;
|
|
22
|
+
/** The URL that was processed */
|
|
23
|
+
url: string;
|
|
24
|
+
/** The title of the page if available */
|
|
25
|
+
title: string | null;
|
|
26
|
+
/** Token usage information */
|
|
27
|
+
usage: {
|
|
28
|
+
promptTokens: number;
|
|
29
|
+
completionTokens: number;
|
|
30
|
+
totalTokens: number;
|
|
31
|
+
};
|
|
32
|
+
}
|
|
33
|
+
/**
|
|
34
|
+
* Engine for fetching web content and extracting structured data using AI
|
|
35
|
+
*/
|
|
36
|
+
export declare class StructuredContentEngine {
|
|
37
|
+
private hybridEngine;
|
|
38
|
+
constructor(config?: PlaywrightEngineConfig);
|
|
39
|
+
/**
|
|
40
|
+
* Fetches content from a URL and extracts structured data using AI
|
|
41
|
+
*
|
|
42
|
+
* @param url The URL to fetch content from
|
|
43
|
+
* @param schema Zod schema defining the structure of data to extract
|
|
44
|
+
* @param options Additional options for the extraction process
|
|
45
|
+
* @returns Promise resolving to structured data and metadata
|
|
46
|
+
* @throws Error if OPENAI_API_KEY is not set or if extraction fails
|
|
47
|
+
*/
|
|
48
|
+
fetchStructuredContent<T>(url: string, schema: z.ZodSchema<T>, options?: StructuredContentOptions): Promise<StructuredContentResult<T>>;
|
|
49
|
+
/**
|
|
50
|
+
* Get model-specific configuration options
|
|
51
|
+
*/
|
|
52
|
+
private getModelConfig;
|
|
53
|
+
/**
|
|
54
|
+
* Clean up resources
|
|
55
|
+
*/
|
|
56
|
+
cleanup(): Promise<void>;
|
|
57
|
+
}
|
|
58
|
+
/**
|
|
59
|
+
* Convenience function for one-off structured content extraction
|
|
60
|
+
*
|
|
61
|
+
* @param url The URL to fetch content from
|
|
62
|
+
* @param schema Zod schema defining the structure of data to extract
|
|
63
|
+
* @param options Additional options for the extraction process
|
|
64
|
+
* @returns Promise resolving to structured data and metadata
|
|
65
|
+
*/
|
|
66
|
+
export declare function fetchStructuredContent<T>(url: string, schema: z.ZodSchema<T>, options?: StructuredContentOptions): Promise<StructuredContentResult<T>>;
|
|
67
|
+
//# sourceMappingURL=StructuredContentEngine.d.ts.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"StructuredContentEngine.d.ts","sourceRoot":"","sources":["../src/StructuredContentEngine.ts"],"names":[],"mappings":"AAEA,OAAO,KAAK,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAE7B,OAAO,KAAK,EAAE,sBAAsB,EAAE,MAAM,YAAY,CAAC;AAEzD;;GAEG;AACH,MAAM,WAAW,wBAAwB;IACvC,qFAAqF;IACrF,KAAK,CAAC,EAAE,cAAc,GAAG,SAAS,GAAG,OAAO,GAAG,YAAY,CAAC;IAC5D,6DAA6D;IAC7D,YAAY,CAAC,EAAE,MAAM,CAAC;IACtB,sDAAsD;IACtD,YAAY,CAAC,EAAE,sBAAsB,CAAC;CACvC;AAED;;GAEG;AACH,MAAM,WAAW,uBAAuB,CAAC,CAAC;IACxC,qDAAqD;IACrD,IAAI,EAAE,CAAC,CAAC;IACR,uDAAuD;IACvD,QAAQ,EAAE,MAAM,CAAC;IACjB,iCAAiC;IACjC,GAAG,EAAE,MAAM,CAAC;IACZ,yCAAyC;IACzC,KAAK,EAAE,MAAM,GAAG,IAAI,CAAC;IACrB,8BAA8B;IAC9B,KAAK,EAAE;QACL,YAAY,EAAE,MAAM,CAAC;QACrB,gBAAgB,EAAE,MAAM,CAAC;QACzB,WAAW,EAAE,MAAM,CAAC;KACrB,CAAC;CACH;AAED;;GAEG;AACH,qBAAa,uBAAuB;IAClC,OAAO,CAAC,YAAY,CAAe;gBAEvB,MAAM,GAAE,sBAA2B;IAQ/C;;;;;;;;OAQG;IACG,sBAAsB,CAAC,CAAC,EAC5B,GAAG,EAAE,MAAM,EACX,MAAM,EAAE,CAAC,CAAC,SAAS,CAAC,CAAC,CAAC,EACtB,OAAO,GAAE,wBAA6B,GACrC,OAAO,CAAC,uBAAuB,CAAC,CAAC,CAAC,CAAC;IAsDtC;;OAEG;IACH,OAAO,CAAC,cAAc;IAiBtB;;OAEG;IACG,OAAO,IAAI,OAAO,CAAC,IAAI,CAAC;CAG/B;AAED;;;;;;;GAOG;AACH,wBAAsB,sBAAsB,CAAC,CAAC,EAC5C,GAAG,EAAE,MAAM,EACX,MAAM,EAAE,CAAC,CAAC,SAAS,CAAC,CAAC,CAAC,EACtB,OAAO,GAAE,wBAA6B,GACrC,OAAO,CAAC,uBAAuB,CAAC,CAAC,CAAC,CAAC,CAOrC"}
|
|
@@ -0,0 +1,116 @@
|
|
|
1
|
+
import { generateObject } from "ai";
|
|
2
|
+
import { openai } from "@ai-sdk/openai";
|
|
3
|
+
import { HybridEngine } from "./HybridEngine.js";
|
|
4
|
+
/**
|
|
5
|
+
* Engine for fetching web content and extracting structured data using AI
|
|
6
|
+
*/
|
|
7
|
+
export class StructuredContentEngine {
|
|
8
|
+
hybridEngine;
|
|
9
|
+
constructor(config = {}) {
|
|
10
|
+
// Always enable markdown conversion for structured content
|
|
11
|
+
this.hybridEngine = new HybridEngine({
|
|
12
|
+
...config,
|
|
13
|
+
markdown: true,
|
|
14
|
+
});
|
|
15
|
+
}
|
|
16
|
+
/**
|
|
17
|
+
* Fetches content from a URL and extracts structured data using AI
|
|
18
|
+
*
|
|
19
|
+
* @param url The URL to fetch content from
|
|
20
|
+
* @param schema Zod schema defining the structure of data to extract
|
|
21
|
+
* @param options Additional options for the extraction process
|
|
22
|
+
* @returns Promise resolving to structured data and metadata
|
|
23
|
+
* @throws Error if OPENAI_API_KEY is not set or if extraction fails
|
|
24
|
+
*/
|
|
25
|
+
async fetchStructuredContent(url, schema, options = {}) {
|
|
26
|
+
// Check for OpenAI API key
|
|
27
|
+
if (!process.env.OPENAI_API_KEY) {
|
|
28
|
+
throw new Error("OPENAI_API_KEY environment variable is required for structured content extraction");
|
|
29
|
+
}
|
|
30
|
+
const { model = "gpt-5-mini", customPrompt = "", engineConfig = {} } = options;
|
|
31
|
+
// Fetch content using HybridEngine with markdown enabled
|
|
32
|
+
const result = await this.hybridEngine.fetchHTML(url, {
|
|
33
|
+
markdown: true,
|
|
34
|
+
...engineConfig,
|
|
35
|
+
});
|
|
36
|
+
if (result.contentType !== "markdown") {
|
|
37
|
+
throw new Error("Failed to convert content to markdown");
|
|
38
|
+
}
|
|
39
|
+
// Prepare the prompt for the LLM
|
|
40
|
+
const systemPrompt = `You are an expert at extracting structured data from web content.
|
|
41
|
+
Extract the requested information from the provided markdown content accurately and completely.
|
|
42
|
+
${customPrompt ? `\nAdditional context: ${customPrompt}` : ""}
|
|
43
|
+
|
|
44
|
+
Content to analyze:
|
|
45
|
+
${result.content}`;
|
|
46
|
+
// Configure model-specific options
|
|
47
|
+
const modelConfig = this.getModelConfig(model);
|
|
48
|
+
try {
|
|
49
|
+
// Generate structured object using AI SDK
|
|
50
|
+
const aiResult = await generateObject({
|
|
51
|
+
model: openai(model),
|
|
52
|
+
schema,
|
|
53
|
+
prompt: systemPrompt,
|
|
54
|
+
...modelConfig,
|
|
55
|
+
});
|
|
56
|
+
return {
|
|
57
|
+
data: aiResult.object,
|
|
58
|
+
markdown: result.content,
|
|
59
|
+
url: result.url,
|
|
60
|
+
title: result.title,
|
|
61
|
+
usage: {
|
|
62
|
+
promptTokens: aiResult.usage?.promptTokens ?? 0,
|
|
63
|
+
completionTokens: aiResult.usage?.completionTokens ?? 0,
|
|
64
|
+
totalTokens: aiResult.usage?.totalTokens ?? 0,
|
|
65
|
+
},
|
|
66
|
+
};
|
|
67
|
+
}
|
|
68
|
+
catch (error) {
|
|
69
|
+
throw new Error(`Failed to extract structured data: ${error instanceof Error ? error.message : String(error)}`);
|
|
70
|
+
}
|
|
71
|
+
}
|
|
72
|
+
/**
|
|
73
|
+
* Get model-specific configuration options
|
|
74
|
+
*/
|
|
75
|
+
getModelConfig(model) {
|
|
76
|
+
if (model.startsWith("gpt-5")) {
|
|
77
|
+
return {
|
|
78
|
+
providerOptions: {
|
|
79
|
+
openai: {
|
|
80
|
+
reasoning_effort: "low",
|
|
81
|
+
},
|
|
82
|
+
},
|
|
83
|
+
};
|
|
84
|
+
}
|
|
85
|
+
else if (model.startsWith("gpt-4.1")) {
|
|
86
|
+
return {
|
|
87
|
+
temperature: 0,
|
|
88
|
+
};
|
|
89
|
+
}
|
|
90
|
+
return {};
|
|
91
|
+
}
|
|
92
|
+
/**
|
|
93
|
+
* Clean up resources
|
|
94
|
+
*/
|
|
95
|
+
async cleanup() {
|
|
96
|
+
await this.hybridEngine.cleanup();
|
|
97
|
+
}
|
|
98
|
+
}
|
|
99
|
+
/**
|
|
100
|
+
* Convenience function for one-off structured content extraction
|
|
101
|
+
*
|
|
102
|
+
* @param url The URL to fetch content from
|
|
103
|
+
* @param schema Zod schema defining the structure of data to extract
|
|
104
|
+
* @param options Additional options for the extraction process
|
|
105
|
+
* @returns Promise resolving to structured data and metadata
|
|
106
|
+
*/
|
|
107
|
+
export async function fetchStructuredContent(url, schema, options = {}) {
|
|
108
|
+
const engine = new StructuredContentEngine(options.engineConfig);
|
|
109
|
+
try {
|
|
110
|
+
return await engine.fetchStructuredContent(url, schema, options);
|
|
111
|
+
}
|
|
112
|
+
finally {
|
|
113
|
+
await engine.cleanup();
|
|
114
|
+
}
|
|
115
|
+
}
|
|
116
|
+
//# sourceMappingURL=StructuredContentEngine.js.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"StructuredContentEngine.js","sourceRoot":"","sources":["../src/StructuredContentEngine.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,cAAc,EAAE,MAAM,IAAI,CAAC;AACpC,OAAO,EAAE,MAAM,EAAE,MAAM,gBAAgB,CAAC;AAExC,OAAO,EAAE,YAAY,EAAE,MAAM,mBAAmB,CAAC;AAmCjD;;GAEG;AACH,MAAM,OAAO,uBAAuB;IAC1B,YAAY,CAAe;IAEnC,YAAY,SAAiC,EAAE;QAC7C,2DAA2D;QAC3D,IAAI,CAAC,YAAY,GAAG,IAAI,YAAY,CAAC;YACnC,GAAG,MAAM;YACT,QAAQ,EAAE,IAAI;SACf,CAAC,CAAC;IACL,CAAC;IAED;;;;;;;;OAQG;IACH,KAAK,CAAC,sBAAsB,CAC1B,GAAW,EACX,MAAsB,EACtB,UAAoC,EAAE;QAEtC,2BAA2B;QAC3B,IAAI,CAAC,OAAO,CAAC,GAAG,CAAC,cAAc,EAAE,CAAC;YAChC,MAAM,IAAI,KAAK,CAAC,mFAAmF,CAAC,CAAC;QACvG,CAAC;QAED,MAAM,EAAE,KAAK,GAAG,YAAY,EAAE,YAAY,GAAG,EAAE,EAAE,YAAY,GAAG,EAAE,EAAE,GAAG,OAAO,CAAC;QAE/E,yDAAyD;QACzD,MAAM,MAAM,GAAG,MAAM,IAAI,CAAC,YAAY,CAAC,SAAS,CAAC,GAAG,EAAE;YACpD,QAAQ,EAAE,IAAI;YACd,GAAG,YAAY;SAChB,CAAC,CAAC;QAEH,IAAI,MAAM,CAAC,WAAW,KAAK,UAAU,EAAE,CAAC;YACtC,MAAM,IAAI,KAAK,CAAC,uCAAuC,CAAC,CAAC;QAC3D,CAAC;QAED,iCAAiC;QACjC,MAAM,YAAY,GAAG;;EAEvB,YAAY,CAAC,CAAC,CAAC,yBAAyB,YAAY,EAAE,CAAC,CAAC,CAAC,EAAE;;;EAG3D,MAAM,CAAC,OAAO,EAAE,CAAC;QAEf,mCAAmC;QACnC,MAAM,WAAW,GAAG,IAAI,CAAC,cAAc,CAAC,KAAK,CAAC,CAAC;QAE/C,IAAI,CAAC;YACH,0CAA0C;YAC1C,MAAM,QAAQ,GAAG,MAAM,cAAc,CAAC;gBACpC,KAAK,EAAE,MAAM,CAAC,KAAK,CAAC;gBACpB,MAAM;gBACN,MAAM,EAAE,YAAY;gBACpB,GAAG,WAAW;aACf,CAAC,CAAC;YAEH,OAAO;gBACL,IAAI,EAAE,QAAQ,CAAC,MAAM;gBACrB,QAAQ,EAAE,MAAM,CAAC,OAAO;gBACxB,GAAG,EAAE,MAAM,CAAC,GAAG;gBACf,KAAK,EAAE,MAAM,CAAC,KAAK;gBACnB,KAAK,EAAE;oBACL,YAAY,EAAG,QAAQ,CAAC,KAAa,EAAE,YAAY,IAAI,CAAC;oBACxD,gBAAgB,EAAG,QAAQ,CAAC,KAAa,EAAE,gBAAgB,IAAI,CAAC;oBAChE,WAAW,EAAG,QAAQ,CAAC,KAAa,EAAE,WAAW,IAAI,CAAC;iBACvD;aACF,CAAC;QACJ,CAAC;QAAC,OAAO,KAAK,EAAE,CAAC;YACf,MAAM,IAAI,KAAK,CAAC,sCAAsC,KAAK,YAAY,KAAK,CAAC,CAAC,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,KAAK,CAAC,EAAE,CAAC,CAAC;QAClH,CAAC;IACH,CAAC;IAED;;OAEG;IACK,cAAc,CAAC,KAAa;QAClC,IAAI,KAAK,CAAC,UAAU,CAAC,OAAO,CAAC,EAAE,CAAC;YAC9B,OAAO;gBACL,eAAe,EAAE;oBACf,MAAM,EAAE;wBACN,gBAAgB,EAAE,KAAK;qBACxB;iBACF;aACF,CAAC;QACJ,CAAC;aAAM,IAAI,KAAK,CAAC,UAAU,CAAC,SAAS,CAAC,EAAE,CAAC;YACvC,OAAO;gBACL,WAAW,EAAE,CAAC;aACf,CAAC;QACJ,CAAC;QACD,OAAO,EAAE,CAAC;IACZ,CAAC;IAED;;OAEG;IACH,KAAK,CAAC,OAAO;QACX,MAAM,IAAI,CAAC,YAAY,CAAC,OAAO,EAAE,CAAC;IACpC,CAAC;CACF;AAED;;;;;;;GAOG;AACH,MAAM,CAAC,KAAK,UAAU,sBAAsB,CAC1C,GAAW,EACX,MAAsB,EACtB,UAAoC,EAAE;IAEtC,MAAM,MAAM,GAAG,IAAI,uBAAuB,CAAC,OAAO,CAAC,YAAY,CAAC,CAAC;IACjE,IAAI,CAAC;QACH,OAAO,MAAM,MAAM,CAAC,sBAAsB,CAAC,GAAG,EAAE,MAAM,EAAE,OAAO,CAAC,CAAC;IACnE,CAAC;YAAS,CAAC;QACT,MAAM,MAAM,CAAC,OAAO,EAAE,CAAC;IACzB,CAAC;AACH,CAAC"}
|
package/dist/index.d.ts
CHANGED
|
@@ -4,4 +4,5 @@ import type { HTMLFetchResult, ContentFetchResult, ContentFetchOptions, BrowserM
|
|
|
4
4
|
export type { IEngine, HTMLFetchResult, ContentFetchResult, ContentFetchOptions, BrowserMetrics };
|
|
5
5
|
export { FetchEngine };
|
|
6
6
|
export * from "./HybridEngine.js";
|
|
7
|
+
export * from "./StructuredContentEngine.js";
|
|
7
8
|
//# sourceMappingURL=index.d.ts.map
|
package/dist/index.d.ts.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;AAC5C,OAAO,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC;AAE/C,OAAO,KAAK,EAAE,eAAe,EAAE,kBAAkB,EAAE,mBAAmB,EAAE,cAAc,EAAE,MAAM,YAAY,CAAC;AAE3G,YAAY,EAAE,OAAO,EAAE,eAAe,EAAE,kBAAkB,EAAE,mBAAmB,EAAE,cAAc,EAAE,CAAC;AAClG,OAAO,EAAE,WAAW,EAAE,CAAC;AACvB,cAAc,mBAAmB,CAAC"}
|
|
1
|
+
{"version":3,"file":"index.d.ts","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;AAC5C,OAAO,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC;AAE/C,OAAO,KAAK,EAAE,eAAe,EAAE,kBAAkB,EAAE,mBAAmB,EAAE,cAAc,EAAE,MAAM,YAAY,CAAC;AAE3G,YAAY,EAAE,OAAO,EAAE,eAAe,EAAE,kBAAkB,EAAE,mBAAmB,EAAE,cAAc,EAAE,CAAC;AAClG,OAAO,EAAE,WAAW,EAAE,CAAC;AACvB,cAAc,mBAAmB,CAAC;AAClC,cAAc,8BAA8B,CAAC"}
|
package/dist/index.js
CHANGED
package/dist/index.js.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"index.js","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AACA,OAAO,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC;AAK/C,OAAO,EAAE,WAAW,EAAE,CAAC;AACvB,cAAc,mBAAmB,CAAC,CAAC,wBAAwB"}
|
|
1
|
+
{"version":3,"file":"index.js","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AACA,OAAO,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC;AAK/C,OAAO,EAAE,WAAW,EAAE,CAAC;AACvB,cAAc,mBAAmB,CAAC,CAAC,wBAAwB;AAC3D,cAAc,8BAA8B,CAAC,CAAC,0CAA0C"}
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@purepageio/fetch-engines",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.6.0",
|
|
4
4
|
"type": "module",
|
|
5
5
|
"description": "A collection of configurable engines for fetching HTML content using fetch or Playwright.",
|
|
6
6
|
"main": "dist/index.js",
|
|
@@ -11,6 +11,8 @@
|
|
|
11
11
|
"LICENSE"
|
|
12
12
|
],
|
|
13
13
|
"dependencies": {
|
|
14
|
+
"@ai-sdk/openai": "^2.0.30",
|
|
15
|
+
"ai": "^5.0.44",
|
|
14
16
|
"axios": "^1.6.8",
|
|
15
17
|
"node-html-parser": "^7.0.1",
|
|
16
18
|
"p-queue": "^7.4.1",
|
|
@@ -21,7 +23,8 @@
|
|
|
21
23
|
"turndown": "^7.2.0",
|
|
22
24
|
"turndown-plugin-gfm": "^1.0.2",
|
|
23
25
|
"user-agents": "^1.1.208",
|
|
24
|
-
"uuid": "^11.1.0"
|
|
26
|
+
"uuid": "^11.1.0",
|
|
27
|
+
"zod": "^4.1.8"
|
|
25
28
|
},
|
|
26
29
|
"devDependencies": {
|
|
27
30
|
"@types/axios": "^0.14.0",
|
|
@@ -76,6 +79,9 @@
|
|
|
76
79
|
"test": "vitest run",
|
|
77
80
|
"test:unit": "vitest run",
|
|
78
81
|
"test:live": "LIVE_NETWORK=1 vitest run test/live/*.test.ts",
|
|
79
|
-
"examples:hybrid-md": "node scripts/hybrid-md-dump.mjs"
|
|
82
|
+
"examples:hybrid-md": "node scripts/hybrid-md-dump.mjs",
|
|
83
|
+
"simple-scraping": "tsx examples/simple-scraping.ts",
|
|
84
|
+
"smart-scraping": "tsx examples/smart-scraping.ts",
|
|
85
|
+
"ai-extraction": "tsx examples/ai-extraction.ts"
|
|
80
86
|
}
|
|
81
87
|
}
|