@purepageio/fetch-engines 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -9,13 +9,13 @@ Fetching web content can be complex. You need to handle static HTML, dynamic Jav
9
9
 
10
10
  **Why use `@purepageio/fetch-engines`?**
11
11
 
12
- - **Unified API:** Get content from simple or complex sites using the same `fetchHTML(url, options?)` method.
12
+ - **Unified API:** Get content from simple or complex sites using the same `fetchHTML(url, options?)` method for processed content or `fetchContent(url, options?)` for raw content.
13
13
  - **Flexible Strategies:** Choose the right tool for the job:
14
- - `FetchEngine`: Lightweight and fast for static HTML, using the standard `fetch` API.
15
- - `PlaywrightEngine`: Powerful browser automation for JavaScript-heavy sites, handling rendering and interactions.
16
- - `HybridEngine`: The best of both worlds tries `FetchEngine` first for speed, automatically falls back to `PlaywrightEngine` for reliability on complex pages.
14
+ - `FetchEngine`: Lightweight and fast for static HTML, using the standard `fetch` API. Ideal for speed and efficiency with content that doesn't require JavaScript rendering. Supports custom headers.
15
+ - `HybridEngine`: The best of both worlds – tries `FetchEngine` first for speed, automatically falls back to a powerful browser engine (internally, `PlaywrightEngine`) for reliability on complex, JavaScript-heavy pages. Supports custom headers.
16
+ - **Raw Content Support:** Use `fetchContent()` to retrieve any type of content (PDFs, images, APIs, etc.) with the same smart fallback logic as `fetchHTML()`.
17
17
  - **Robust & Resilient:** Built-in caching, configurable retries, and standardized error handling make your fetching logic more dependable.
18
- - **Simplified Automation:** `PlaywrightEngine` manages browser instances and contexts automatically through efficient pooling and includes integrated stealth measures to bypass common anti-bot systems.
18
+ - **Simplified Automation:** When `HybridEngine` uses its browser capabilities (via the internal `PlaywrightEngine`), it manages browser instances and contexts automatically through efficient pooling and includes integrated stealth measures to bypass common anti-bot systems.
19
19
  - **Content Transformation:** Optionally convert fetched HTML directly to clean Markdown content.
20
20
  - **TypeScript Ready:** Fully typed for a better development experience.
21
21
 
@@ -27,6 +27,7 @@ This package provides a high-level abstraction, letting you focus on using the w
27
27
  - [Installation](#installation)
28
28
  - [Engines](#engines)
29
29
  - [Basic Usage](#basic-usage)
30
+ - [fetchHTML vs fetchContent](#fetchhtml-vs-fetchcontent)
30
31
  - [Configuration](#configuration)
31
32
  - [Return Value](#return-value)
32
33
  - [API Reference](#api-reference)
@@ -38,13 +39,15 @@ This package provides a high-level abstraction, letting you focus on using the w
38
39
 
39
40
  ## Features
40
41
 
41
- - **Multiple Fetching Strategies:** Choose between `FetchEngine` (lightweight `fetch`), `PlaywrightEngine` (robust JS rendering via Playwright), or `HybridEngine` (smart fallback).
42
- - **Unified API:** Simple `fetchHTML(url, options?)` interface across all engines.
42
+ - **Multiple Fetching Strategies:** Choose between `FetchEngine` (lightweight `fetch`) or `HybridEngine` (smart fallback to a full browser engine).
43
+ - **Unified API:** Simple `fetchHTML(url, options?)` interface for processed content and `fetchContent(url, options?)` for raw content across both primary engines.
44
+ - **Raw Content Fetching:** Use `fetchContent()` to retrieve any type of content (PDFs, images, JSON, XML, etc.) without HTML processing or content-type restrictions.
45
+ - **Custom Headers:** Easily provide custom HTTP headers for requests in both `FetchEngine` and `HybridEngine`.
43
46
  - **Configurable Retries:** Automatic retries on failure with customizable attempts and delays.
44
47
  - **Built-in Caching:** In-memory caching with configurable TTL to reduce redundant fetches.
45
- - **Playwright Stealth:** Automatic integration of `playwright-extra` and stealth plugins to bypass common bot detection.
46
- - **Managed Browser Pooling:** Efficient resource management for `PlaywrightEngine` with configurable browser/context limits and lifecycles.
47
- - **Smart Fallbacks:** `HybridEngine` uses `FetchEngine` first, falling back to `PlaywrightEngine` only when needed. `PlaywrightEngine` can optionally use a fast HTTP fetch before launching a full browser.
48
+ - **Playwright Stealth:** When `HybridEngine` utilizes its browser capabilities, it automatically integrates `playwright-extra` and stealth plugins to bypass common bot detection.
49
+ - **Managed Browser Pooling:** Efficient resource management for `HybridEngine`'s browser mode with configurable browser/context limits and lifecycles.
50
+ - **Smart Fallbacks:** `HybridEngine` uses `FetchEngine` first, falling back to its internal browser engine only when needed. The internal browser engine can also optionally use a fast HTTP fetch before launching a full browser.
48
51
  - **Content Conversion:** Optionally convert fetched HTML directly to Markdown.
49
52
  - **Standardized Errors:** Custom `FetchError` classes provide context on failures.
50
53
  - **TypeScript Ready:** Fully typed codebase for enhanced developer experience.
@@ -59,7 +62,7 @@ npm install @purepageio/fetch-engines
59
62
  yarn add @purepageio/fetch-engines
60
63
  ```
61
64
 
62
- If you plan to use the `PlaywrightEngine` or `HybridEngine`, you also need to install Playwright's browser binaries:
65
+ If you plan to use the `HybridEngine` (which internally uses Playwright for advanced fetching), you also need to install Playwright's browser binaries:
63
66
 
64
67
  ```bash
65
68
  pnpm exec playwright install
@@ -69,9 +72,9 @@ npx playwright install
69
72
 
70
73
  ## Engines
71
74
 
72
- - **`FetchEngine`**: Uses the standard `fetch` API. Suitable for simple HTML pages or APIs returning HTML. Lightweight and fast.
73
- - **`PlaywrightEngine`**: Uses Playwright to control a managed pool of headless browsers (Chromium by default via `playwright-extra`). Handles JavaScript rendering, complex interactions, and provides automatic stealth/anti-bot detection measures. More resource-intensive but necessary for dynamic websites.
74
- - **`HybridEngine`**: A smart combination. It first attempts to fetch content using the lightweight `FetchEngine`. If that fails for _any_ reason (e.g., network error, non-HTML content, HTTP error like 403), it automatically falls back to using the `PlaywrightEngine`. This provides the speed of `FetchEngine` for simple sites while retaining the power of `PlaywrightEngine` for complex ones.
75
+ - **`FetchEngine`**: Uses the standard `fetch` API. Suitable for simple HTML pages or APIs returning HTML. Lightweight and fast. This is your go-to for speed and efficiency when JavaScript rendering is not required.
76
+ - **`HybridEngine`**: A smart combination. It first attempts to fetch content using the lightweight `FetchEngine`. If that fails for _any_ reason (e.g., network error, non-HTML content, HTTP error like 403), or if `spaMode` is enabled and an SPA shell is detected, it automatically falls back to using an internal, powerful browser engine (based on Playwright). This provides the speed of `FetchEngine` for simple sites while retaining the power of a full browser for complex, dynamic websites. This is recommended for most general-purpose fetching tasks.
77
+ - **`PlaywrightEngine` (Internal Component)**: While not recommended for direct use by most users, `PlaywrightEngine` is the component `HybridEngine` uses internally for its browser-based fetching. It manages Playwright browser instances, contexts, and stealth features. Users needing direct, low-level control over Playwright might consider it, but `HybridEngine` offers a more robust and flexible approach for most scenarios.
75
78
 
76
79
  ## Basic Usage
77
80
 
@@ -102,74 +105,162 @@ async function main() {
102
105
  main();
103
106
  ```
104
107
 
105
- ### PlaywrightEngine
108
+ ### HybridEngine
106
109
 
107
110
  ```typescript
108
- import { PlaywrightEngine } from "@purepageio/fetch-engines";
111
+ import { HybridEngine } from "@purepageio/fetch-engines";
109
112
 
110
- // Engine configured to fetch HTML by default and pass custom launch arguments
111
- const engine = new PlaywrightEngine({
113
+ // Engine configured to fetch HTML by default for its internal engines
114
+ // and provide some custom headers for all requests made by HybridEngine.
115
+ const engine = new HybridEngine({
112
116
  markdown: false,
113
- playwrightLaunchOptions: { args: ["--disable-gpu"] },
117
+ headers: { "X-Global-Custom-Header": "HybridGlobalValue" },
118
+ // Other PlaywrightEngine specific configs can be set here for the fallback mechanism
119
+ // e.g., playwrightLaunchOptions: { args: ["--disable-gpu"] }
114
120
  });
115
121
 
116
122
  async function main() {
117
123
  try {
118
- const url = "https://quotes.toscrape.com/";
119
-
120
- // Example: Fetching as Markdown using per-request override
121
- console.log(`Fetching ${url} as Markdown...`);
122
- const mdResult = await engine.fetchHTML(url, { markdown: true });
123
- console.log(`Fetched ${mdResult.url} (ContentType: ${mdResult.contentType}) - Title: ${mdResult.title}`);
124
- console.log(`Content (Markdown):\n${mdResult.content.substring(0, 300)}...`);
125
-
126
- // You could also fetch as HTML by default:
127
- // const htmlResult = await engine.fetchHTML(url);
128
- // console.log(`\nFetched ${htmlResult.url} (ContentType: ${htmlResult.contentType}) - Title: ${htmlResult.title}`);
124
+ const urlSimple = "https://example.com"; // Simple site, likely handled by FetchEngine
125
+ const urlComplex = "https://quotes.toscrape.com/"; // JS-heavy site, likely requiring Playwright fallback
126
+
127
+ // --- Scenario 1: FetchEngine part of HybridEngine handles it ---
128
+ console.log(`\nFetching simple site (${urlSimple}) with per-request headers...`);
129
+ const result1 = await engine.fetchHTML(urlSimple, {
130
+ headers: { "X-Request-Specific": "SimpleRequestValue" },
131
+ });
132
+ // FetchEngine (via HybridEngine) will use:
133
+ // 1. Its base default headers (User-Agent etc.)
134
+ // 2. Overridden/augmented by HybridEngine's constructor headers ("X-Global-Custom-Header")
135
+ // 3. Overridden/augmented by per-request headers ("X-Request-Specific")
136
+ console.log(`Fetched ${result1.url} (ContentType: ${result1.contentType}) - Title: ${result1.title}`);
137
+ console.log(`Content (HTML): ${result1.content.substring(0, 100)}...`);
138
+
139
+ // --- Scenario 2: Playwright part of HybridEngine handles it ---
140
+ console.log(`\nFetching complex site (${urlComplex}) requesting Markdown and with per-request headers...`);
141
+ const result2 = await engine.fetchHTML(urlComplex, {
142
+ markdown: true,
143
+ headers: { "X-Request-Specific": "ComplexRequestValue", "X-Another": "ComplexAnother" },
144
+ });
145
+ // PlaywrightEngine (via HybridEngine) will use:
146
+ // 1. Its base default headers (User-Agent etc. if doing HTTP fallback, or for page.setExtraHTTPHeaders)
147
+ // 2. Overridden/augmented by HybridEngine's constructor headers ("X-Global-Custom-Header")
148
+ // 3. Overridden/augmented by per-request headers ("X-Request-Specific", "X-Another")
149
+ // The markdown: true option will be respected by the Playwright part.
150
+ console.log(`Fetched ${result2.url} (ContentType: ${result2.contentType}) - Title: ${result2.title}`);
151
+ console.log(`Content (Markdown):\n${result2.content.substring(0, 300)}...`);
129
152
  } catch (error) {
130
- console.error("Playwright fetch failed:", error);
153
+ console.error("Hybrid fetch failed:", error);
131
154
  } finally {
132
- await engine.cleanup();
155
+ await engine.cleanup(); // Important for HybridEngine
133
156
  }
134
157
  }
135
158
  main();
136
159
  ```
137
160
 
138
- ### HybridEngine
161
+ ### Raw Content Fetching
139
162
 
140
163
  ```typescript
141
164
  import { HybridEngine } from "@purepageio/fetch-engines";
142
165
 
143
- // Engine configured to fetch HTML by default for both internal engines
144
- const engine = new HybridEngine({ markdown: false });
166
+ const engine = new HybridEngine();
145
167
 
146
- async function main() {
168
+ async function fetchRawContent() {
147
169
  try {
148
- const url1 = "https://example.com"; // Simple site
149
- const url2 = "https://quotes.toscrape.com/"; // Complex site
150
-
151
- // --- Scenario 1: FetchEngine Succeeds ---
152
- console.log(`\nFetching simple site (${url1}) requesting Markdown...`);
153
- // FetchEngine uses its constructor config (markdown: false), ignoring the per-request option.
154
- const result1 = await engine.fetchHTML(url1, { markdown: true });
155
- console.log(`Fetched ${result1.url} (ContentType: ${result1.contentType}) - Title: ${result1.title}`);
156
- console.log(`Content is ${result1.contentType} because FetchEngine succeeded and used its own config.`);
157
- console.log(`${result1.content.substring(0, 300)}...`);
170
+ // Fetch a PDF document
171
+ const pdfResult = await engine.fetchContent("https://example.com/document.pdf");
172
+ console.log(`PDF Content-Type: ${pdfResult.contentType}`);
173
+ console.log(
174
+ `PDF Size: ${Buffer.isBuffer(pdfResult.content) ? pdfResult.content.length : pdfResult.content.length} bytes`
175
+ );
158
176
 
159
- // --- Scenario 2: FetchEngine Fails, Playwright Fallback Occurs ---
160
- console.log(`\nFetching complex site (${url2}) requesting Markdown...`);
161
- // Assume FetchEngine fails for url2. PlaywrightEngine will be used and *will* receive the markdown: true override.
162
- const result2 = await engine.fetchHTML(url2, { markdown: true });
163
- console.log(`Fetched ${result2.url} (ContentType: ${result2.contentType}) - Title: ${result2.title}`);
164
- console.log(`Content is ${result2.contentType} because Playwright fallback used the per-request option.`);
165
- console.log(`${result2.content.substring(0, 300)}...`);
177
+ // Fetch JSON API
178
+ const jsonResult = await engine.fetchContent("https://api.example.com/data");
179
+ console.log(`JSON Content-Type: ${jsonResult.contentType}`);
180
+ console.log(`JSON Data: ${typeof jsonResult.content === "string" ? jsonResult.content : "Binary data"}`);
181
+
182
+ // Fetch with custom headers
183
+ const customResult = await engine.fetchContent("https://protected-api.example.com/data", {
184
+ headers: {
185
+ Authorization: "Bearer your-token",
186
+ Accept: "application/json",
187
+ },
188
+ });
189
+ console.log(`Custom fetch result: ${customResult.statusCode}`);
166
190
  } catch (error) {
167
- console.error("Hybrid fetch failed:", error);
191
+ console.error("Raw content fetch failed:", error);
168
192
  } finally {
169
193
  await engine.cleanup();
170
194
  }
171
195
  }
172
- main();
196
+ fetchRawContent();
197
+ ```
198
+
199
+ ## fetchHTML vs fetchContent
200
+
201
+ Choose the right method for your use case:
202
+
203
+ ### `fetchHTML(url, options?)`
204
+
205
+ **Use when:** You want to extract and process web page content.
206
+
207
+ **Features:**
208
+
209
+ - Processes HTML content and extracts metadata (title, etc.)
210
+ - Supports HTML-to-Markdown conversion
211
+ - Optimized for web page content
212
+ - Content-type restrictions (HTML/XML only)
213
+ - Returns processed content as `string`
214
+
215
+ **Best for:**
216
+
217
+ - Web scraping
218
+ - Content extraction
219
+ - Blog/article processing
220
+ - Any scenario where you need structured HTML or Markdown
221
+
222
+ ### `fetchContent(url, options?)`
223
+
224
+ **Use when:** You want raw content without processing, mimicking standard `fetch()` behavior.
225
+
226
+ **Features:**
227
+
228
+ - Retrieves any content type (PDFs, images, JSON, XML, etc.)
229
+ - No content-type restrictions
230
+ - Returns raw content as `Buffer` (binary) or `string` (text)
231
+ - Preserves original MIME type information
232
+ - Minimal processing overhead
233
+
234
+ **Best for:**
235
+
236
+ - API consumption
237
+ - File downloads (PDFs, images, etc.)
238
+ - Binary content retrieval
239
+ - Any scenario where you need the raw response
240
+
241
+ ### Example Comparison
242
+
243
+ ```typescript
244
+ import { HybridEngine } from "@purepageio/fetch-engines";
245
+
246
+ const engine = new HybridEngine();
247
+
248
+ // fetchHTML - for web page content
249
+ const htmlResult = await engine.fetchHTML("https://example.com");
250
+ console.log(htmlResult.title); // "Example Domain"
251
+ console.log(htmlResult.contentType); // "html" or "markdown"
252
+ console.log(typeof htmlResult.content); // "string" (processed HTML/Markdown)
253
+
254
+ // fetchContent - for raw content
255
+ const contentResult = await engine.fetchContent("https://example.com");
256
+ console.log(contentResult.title); // "Example Domain" (extracted but not processed)
257
+ console.log(contentResult.contentType); // "text/html" (original MIME type)
258
+ console.log(typeof contentResult.content); // "string" (raw HTML)
259
+
260
+ // fetchContent - for non-HTML content
261
+ const pdfResult = await engine.fetchContent("https://example.com/doc.pdf");
262
+ console.log(pdfResult.contentType); // "application/pdf"
263
+ console.log(Buffer.isBuffer(pdfResult.content)); // true (binary content)
173
264
  ```
174
265
 
175
266
  ## Configuration
@@ -180,71 +271,83 @@ Engines accept an optional configuration object in their constructor to customis
180
271
 
181
272
  The `FetchEngine` accepts a `FetchEngineOptions` object with the following properties:
182
273
 
183
- | Option | Type | Default | Description |
184
- | ---------- | --------- | ------- | ------------------------------------------------------------------------------------------------------ |
185
- | `markdown` | `boolean` | `false` | If `true`, converts fetched HTML to Markdown. `contentType` in the result will be set to `'markdown'`. |
274
+ | Option | Type | Default | Description |
275
+ | ---------- | ------------------------ | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
276
+ | `markdown` | `boolean` | `false` | If `true`, converts fetched HTML to Markdown. `contentType` in the result will be set to `'markdown'`. |
277
+ | `headers` | `Record<string, string>` | `{}` | Custom HTTP headers to be sent with the request. These are merged with and can override the engine's default headers. Headers from `fetchHTML` options take higher precedence. |
186
278
 
187
279
  ```typescript
188
- // Example: Always convert to Markdown
189
- const mdFetchEngine = new FetchEngine({ markdown: true });
280
+ // Example: FetchEngine with custom headers and Markdown conversion
281
+ const customFetchEngine = new FetchEngine({
282
+ markdown: true,
283
+ headers: {
284
+ "User-Agent": "MyCustomFetchAgent/1.0",
285
+ "X-Api-Key": "your-api-key",
286
+ },
287
+ });
190
288
  ```
191
289
 
192
- ### PlaywrightEngine
193
-
194
- The `PlaywrightEngine` accepts a `PlaywrightEngineConfig` object with the following properties:
195
-
196
- **General Options:**
197
-
198
- | Option | Type | Default | Description |
199
- | ------------------------- | --------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
200
- | `markdown` | `boolean` | `false` | If `true`, converts content (from Playwright or its internal HTTP fallback) to Markdown. `contentType` will be `'markdown'`. Can be overridden per-request. |
201
- | `useHttpFallback` | `boolean` | `true` | If `true`, attempts a fast HTTP fetch before using Playwright. Ineffective if `spaMode` is `true`. |
202
- | `useHeadedModeFallback` | `boolean` | `false` | If `true`, automatically retries specific failed Playwright attempts in headed (visible) mode. |
203
- | `defaultFastMode` | `boolean` | `true` | If `true`, initially blocks non-essential resources and skips human simulation. Can be overridden per-request. Effectively `false` if `spaMode` is `true`. |
204
- | `simulateHumanBehavior` | `boolean` | `true` | If `true` (and not `fastMode` or `spaMode`), attempts basic human-like interactions. |
205
- | `concurrentPages` | `number` | `3` | Max number of pages to process concurrently within the engine queue. |
206
- | `maxRetries` | `number` | `3` | Max retry attempts for a failed fetch (excluding initial try). |
207
- | `retryDelay` | `number` | `5000` | Delay (ms) between retries. |
208
- | `cacheTTL` | `number` | `900000` | Cache Time-To-Live (ms). `0` disables caching. (15 mins default) |
209
- | `spaMode` | `boolean` | `false` | If `true`, enables Single Page Application mode. This typically bypasses `useHttpFallback`, effectively sets `fastMode` to `false`, uses more patient load conditions (e.g., network idle), and may apply `spaRenderDelayMs`. Recommended for JavaScript-heavy sites. |
210
- | `spaRenderDelayMs` | `number` | `0` | Explicit delay (ms) after page load events in `spaMode` to allow for client-side rendering. Only applies if `spaMode` is `true`. |
211
- | `playwrightLaunchOptions` | `LaunchOptions` | `undefined` | Optional Playwright launch options (from `playwright` package, e.g., `{ args: ['--some-flag'] }`) passed when a browser instance is created. Merged with internal defaults. |
212
-
213
- **Browser Pool Options (Passed to internal `PlaywrightBrowserPool`):**
214
-
215
- | Option | Type | Default | Description |
216
- | -------------------------- | -------------------------- | ----------- | ------------------------------------------------------------------------- |
217
- | `maxBrowsers` | `number` | `2` | Max concurrent browser instances managed by the pool. |
218
- | `maxPagesPerContext` | `number` | `6` | Max pages per browser context before recycling. |
219
- | `maxBrowserAge` | `number` | `1200000` | Max age (ms) a browser instance lives before recycling. (20 mins default) |
220
- | `healthCheckInterval` | `number` | `60000` | How often (ms) the pool checks browser health. (1 min default) |
221
- | `useHeadedMode` | `boolean` | `false` | Forces the _entire pool_ to launch browsers in headed (visible) mode. |
222
- | `poolBlockedDomains` | `string[]` | `[]` | List of domain glob patterns to block requests to. |
223
- | `poolBlockedResourceTypes` | `string[]` | `[]` | List of Playwright resource types (e.g., 'image', 'font') to block. |
224
- | `proxy` | `{ server: string, ... }?` | `undefined` | Proxy configuration object (see `PlaywrightEngineConfig` type). |
290
+ #### Header Precedence for `FetchEngine`:
225
291
 
226
- ### HybridEngine
292
+ 1. Headers passed in `fetchHTML(url, { headers: { ... } })` (highest precedence).
293
+ 2. Headers passed in the `FetchEngine` constructor `new FetchEngine({ headers: { ... } })`.
294
+ 3. Default headers of the `FetchEngine` (e.g., its default `User-Agent`) (lowest precedence).
295
+
296
+ ### `PlaywrightEngineConfig` (Used by `HybridEngine`)
297
+
298
+ The `HybridEngine` constructor accepts a `PlaywrightEngineConfig` object. These settings configure the underlying `FetchEngine` and `PlaywrightEngine` (for fallback scenarios) and the hybrid strategy itself. When using `HybridEngine`, you are essentially configuring how it will behave and how its internal Playwright capabilities will operate if needed.
299
+
300
+ **Key Options for `HybridEngine` (from `PlaywrightEngineConfig`):**
301
+
302
+ | Option | Type | Default | Description |
303
+ | ------------------------- | ------------------------ | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
304
+ | `headers` | `Record<string, string>` | `{}` | Custom HTTP headers. For `HybridEngine`, these serve as default headers for both its internal `FetchEngine` (constructor) and `PlaywrightEngine` (constructor). They can be overridden by headers in `HybridEngine.fetchHTML()` options. |
305
+ | `markdown` | `boolean` | `false` | Default Markdown conversion. For `HybridEngine`: sets default for internal `FetchEngine` (constructor) and internal `PlaywrightEngine`. Can be overridden per-request for the `PlaywrightEngine` part. |
306
+ | `useHttpFallback` | `boolean` | `true` | (For Playwright part) If `true`, attempts a fast HTTP fetch before using Playwright. Ineffective if `spaMode` is `true`. |
307
+ | `useHeadedModeFallback` | `boolean` | `false` | (For Playwright part) If `true`, automatically retries specific failed Playwright attempts in headed (visible) mode. |
308
+ | `defaultFastMode` | `boolean` | `true` | If `true`, initially blocks non-essential resources and skips human simulation. Can be overridden per-request. Effectively `false` if `spaMode` is `true`. |
309
+ | `simulateHumanBehavior` | `boolean` | `true` | If `true` (and not `fastMode` or `spaMode`), attempts basic human-like interactions. |
310
+ | `concurrentPages` | `number` | `3` | Max number of pages to process concurrently within the engine queue. |
311
+ | `maxRetries` | `number` | `3` | Max retry attempts for a failed fetch (excluding initial try). |
312
+ | `retryDelay` | `number` | `5000` | Delay (ms) between retries. |
313
+ | `cacheTTL` | `number` | `900000` | Cache Time-To-Live (ms). `0` disables caching. (15 mins default) |
314
+ | `spaMode` | `boolean` | `false` | If `true`, enables Single Page Application mode. This typically bypasses `useHttpFallback`, effectively sets `fastMode` to `false`, uses more patient load conditions (e.g., network idle), and may apply `spaRenderDelayMs`. Recommended for JavaScript-heavy sites. |
315
+ | `spaRenderDelayMs` | `number` | `0` | Explicit delay (ms) after page load events in `spaMode` to allow for client-side rendering. Only applies if `spaMode` is `true`. |
316
+ | `playwrightLaunchOptions` | `LaunchOptions` | `undefined` | (For Playwright part) Optional Playwright launch options (from `playwright` package, e.g., `{ args: ['--some-flag'] }`) passed when a browser instance is created. Merged with internal defaults. |
317
+
318
+ **Browser Pool Options (For `HybridEngine`'s internal `PlaywrightEngine`):**
319
+
320
+ | Option | Type | Default | Description |
321
+ | -------------------------- | -------------------------- | ----------- | ------------------------------------------------------------------------------------------- |
322
+ | `maxBrowsers` | `number` | `2` | Max concurrent browser instances managed by the pool. |
323
+ | `maxPagesPerContext` | `number` | `6` | Max pages per browser context before recycling. |
324
+ | `maxBrowserAge` | `number` | `1200000` | Max age (ms) a browser instance lives before recycling. (20 mins default) |
325
+ | `healthCheckInterval` | `number` | `60000` | How often (ms) the pool checks browser health. (1 min default) |
326
+ | `useHeadedMode` | `boolean` | `false` | Forces the _entire pool_ (for Playwright part) to launch browsers in headed (visible) mode. |
327
+ | `poolBlockedDomains` | `string[]` | `[]` | List of domain glob patterns to block requests to (for Playwright part). |
328
+ | `poolBlockedResourceTypes` | `string[]` | `[]` | List of Playwright resource types (e.g., 'image', 'font') to block (for Playwright part). |
329
+ | `proxy` | `{ server: string, ... }?` | `undefined` | Proxy configuration object (see `PlaywrightEngineConfig` type) (for Playwright part). |
227
330
 
228
- The `HybridEngine` constructor accepts `PlaywrightEngineConfig` options. These settings configure the underlying engines and the hybrid strategy:
331
+ ### `HybridEngine` - Configuration Summary & Header Precedence
229
332
 
230
- - **Constructor `markdown` option:**
231
- - Sets the default Markdown conversion for the internal `FetchEngine`. This `FetchEngine` instance **does not** react to per-request `markdown` overrides.
232
- - Sets the default for the internal `PlaywrightEngine`.
233
- - **Constructor `spaMode` option:**
234
- - Sets the default SPA mode for `HybridEngine`. If `true`, `HybridEngine` checks `FetchEngine`'s output for SPA shell characteristics. If an SPA shell is detected, it forces a fallback to `PlaywrightEngine` (which will also run in SPA mode).
235
- - Sets the default for the internal `PlaywrightEngine`.
236
- - **Other `PlaywrightEngineConfig` options** (e.g., `maxRetries`, `cacheTTL`, `playwrightLaunchOptions`, pool settings) are primarily passed to and used by the internal `PlaywrightEngine`.
333
+ When you configure `HybridEngine` using `PlaywrightEngineConfig`:
334
+
335
+ - **`headers`**: Constructor headers are passed to the internal `FetchEngine`'s constructor and the internal `PlaywrightEngine`'s constructor.
336
+ - **`markdown`**: Sets the default for both internal engines.
337
+ - **`spaMode`**: Sets the default for `HybridEngine`'s SPA shell detection and for the internal `PlaywrightEngine`.
338
+ - Other options primarily configure the internal `PlaywrightEngine` or general retry/caching logic.
237
339
 
238
340
  **Per-request `options` in `HybridEngine.fetchHTML(url, options)`:**
239
341
 
240
- - **`options.markdown` (`boolean`):**
241
- - If `FetchEngine` succeeds and its content is used (i.e., not an SPA shell when `spaMode` is active), this per-request `markdown` option is **ignored**. The content's format is determined by the `FetchEngine`'s constructor `markdown` setting.
242
- - If `HybridEngine` falls back to `PlaywrightEngine` (due to `FetchEngine` failure or SPA shell detection), this per-request `markdown` option **overrides** the `PlaywrightEngine`'s default and determines if its output is Markdown.
243
- - **`options.spaMode` (`boolean`):**
244
- - Overrides the `HybridEngine`'s default SPA mode behavior for this specific request (affecting SPA shell detection and potential fallback to `PlaywrightEngine`).
245
- - If `PlaywrightEngine` is used, this option also overrides its default SPA mode.
246
- - **`options.fastMode` (`boolean`):**
247
- - If `PlaywrightEngine` is used, this option overrides its `defaultFastMode` setting. It has no effect on `FetchEngine`.
342
+ - **`headers?: Record<string, string>`**:
343
+ - These headers override any headers set in the `HybridEngine` constructor.
344
+ - If `FetchEngine` is used: These headers are passed to `FetchEngine.fetchHTML(url, { headers: ... })`. `FetchEngine` then merges them with its constructor headers and base defaults.
345
+ - If `PlaywrightEngine` (fallback) is used: These headers are merged with `HybridEngine` constructor headers (options take precedence) and the result is passed to `PlaywrightEngine`'s `fetchHTML()`. `PlaywrightEngine` then applies its own logic (e.g., for `page.setExtraHTTPHeaders` or its HTTP fallback).
346
+ - **`markdown?: boolean`**:
347
+ - If `FetchEngine` is used: This per-request option is **ignored**. `FetchEngine` uses its own constructor `markdown` setting.
348
+ - If `PlaywrightEngine` (fallback) is used: This overrides `PlaywrightEngine`'s default and determines its output format.
349
+ - **`spaMode?: boolean`**: Overrides `HybridEngine`'s default SPA mode and is passed to `PlaywrightEngine` if used.
350
+ - **`fastMode?: boolean`**: Passed to `PlaywrightEngine` if used; no effect on `FetchEngine`.
248
351
 
249
352
  ```typescript
250
353
  // Example: HybridEngine with SPA mode enabled by default
@@ -265,6 +368,8 @@ async function fetchSpaSite() {
265
368
 
266
369
  ## Return Value
267
370
 
371
+ ### `fetchHTML()` Result
372
+
268
373
  All `fetchHTML()` methods return a Promise that resolves to an `HTMLFetchResult` object:
269
374
 
270
375
  - `content` (`string`): The fetched content, either original HTML or converted Markdown.
@@ -275,30 +380,53 @@ All `fetchHTML()` methods return a Promise that resolves to an `HTMLFetchResult`
275
380
  - `statusCode` (`number | undefined`): HTTP status code.
276
381
  - `error` (`Error | undefined`): Error object if the fetch failed after all retries. It's generally recommended to rely on catching thrown errors for failure handling.
277
382
 
383
+ ### `fetchContent()` Result
384
+
385
+ All `fetchContent()` methods return a Promise that resolves to a `ContentFetchResult` object:
386
+
387
+ - `content` (`Buffer | string`): The raw fetched content. Binary content (PDFs, images, etc.) is returned as `Buffer`, text content as `string`.
388
+ - `contentType` (`string`): The original MIME type from the server (e.g., `"application/pdf"`, `"text/html"`, `"application/json"`).
389
+ - `title` (`string | null`): Extracted page title if the content is HTML, otherwise `null`.
390
+ - `url` (`string`): Final URL after redirects.
391
+ - `isFromCache` (`boolean`): True if the result came from cache.
392
+ - `statusCode` (`number | undefined`): HTTP status code.
393
+ - `error` (`Error | undefined`): Error object if the fetch failed after all retries. It's generally recommended to rely on catching thrown errors for failure handling.
394
+
278
395
  ## API Reference
279
396
 
280
397
  ### `engine.fetchHTML(url, options?)`
281
398
 
282
399
  - `url` (`string`): URL to fetch.
283
400
  - `options?` (`FetchOptions`): Optional per-request overrides.
284
- - `markdown?: boolean`: (Playwright/Hybrid only) Request Markdown conversion. For Hybrid, only applies on fallback to Playwright.
285
- - `fastMode?: boolean`: (Playwright/Hybrid only) Override fast mode.
286
- - `spaMode?: boolean`: (Playwright/Hybrid only) Override SPA mode behavior for this request.
401
+ - `headers?: Record<string, string>`: Custom headers for this specific request.
402
+ - `markdown?: boolean`: (For `HybridEngine`'s Playwright part) Request Markdown conversion.
403
+ - `fastMode?: boolean`: (For `HybridEngine`'s Playwright part) Override fast mode.
404
+ - `spaMode?: boolean`: (For `HybridEngine`) Override SPA mode behavior for this request.
287
405
  - **Returns:** `Promise<HTMLFetchResult>`
288
406
 
289
407
  Fetches content, returning HTML or Markdown based on configuration/options in `result.content` with `result.contentType` indicating the format.
290
408
 
291
- ### `engine.cleanup()` (PlaywrightEngine & HybridEngine)
409
+ ### `engine.fetchContent(url, options?)`
410
+
411
+ - `url` (`string`): URL to fetch.
412
+ - `options?` (`ContentFetchOptions`): Optional per-request overrides.
413
+ - `headers?: Record<string, string>`: Custom headers for this specific request.
414
+ - **Returns:** `Promise<ContentFetchResult>`
415
+
416
+ Fetches raw content without processing, mimicking standard `fetch()` behavior. Returns binary content as `Buffer` and text content as `string`. Supports any content type (PDFs, images, JSON, XML, etc.) and uses the same smart fallback logic as `fetchHTML()` but without HTML-specific processing or content-type restrictions.
417
+
418
+ ### `engine.cleanup()` (`HybridEngine` and direct `FetchEngine` if no cleanup needed)
292
419
 
293
420
  - **Returns:** `Promise<void>`
294
421
 
295
- Gracefully shuts down all browser instances managed by the `PlaywrightEngine`'s browser pool (used by both `PlaywrightEngine` and `HybridEngine`). **It is crucial to call `await engine.cleanup()` when you are finished using these engines** to release system resources.
422
+ For `HybridEngine`, this gracefully shuts down all browser instances managed by its internal `PlaywrightEngine`. **It is crucial to call `await engine.cleanup()` when you are finished using `HybridEngine`** to release system resources.
423
+ `FetchEngine` has a `cleanup` method for API consistency, but it's a no-op as `FetchEngine` doesn't manage persistent resources.
296
424
 
297
- ## Stealth / Anti-Detection (`PlaywrightEngine`)
425
+ ## Stealth / Anti-Detection (via `HybridEngine`)
298
426
 
299
- The `PlaywrightEngine` automatically integrates `playwright-extra` and its powerful stealth plugin ([`puppeteer-extra-plugin-stealth`](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)). This plugin applies various techniques to make the headless browser controlled by Playwright appear more like a regular human-operated browser, helping to bypass many common bot detection systems.
427
+ When `HybridEngine` uses its internal browser capabilities (via `PlaywrightEngine`), it automatically integrates `playwright-extra` and its powerful stealth plugin ([`puppeteer-extra-plugin-stealth`](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)). This plugin applies various techniques to make the headless browser controlled by Playwright appear more like a regular human-operated browser, helping to bypass many common bot detection systems.
300
428
 
301
- There are **no manual configuration options** for stealth; it is enabled by default when using `PlaywrightEngine`. The previous options (`useStealthMode`, `randomizeFingerprint`, `evasionLevel`) have been removed.
429
+ There are **no manual configuration options** for stealth; it is enabled by default when `HybridEngine` uses its browser functionality.
302
430
 
303
431
  While effective, be aware that no stealth technique is foolproof, and sophisticated websites may still detect automated browsing.
304
432
 
@@ -315,15 +443,14 @@ Errors during fetching are typically thrown as instances of `FetchError` (or its
315
443
  Common `FetchError` codes and scenarios:
316
444
 
317
445
  - **`ERR_HTTP_ERROR`**: Thrown by `FetchEngine` for HTTP status codes >= 400. `error.statusCode` will be set.
318
- - **`ERR_NON_HTML_CONTENT`**: Thrown by `FetchEngine` if the content type is not HTML and `markdown` conversion is not requested.
319
- - **`ERR_PLAYWRIGHT_OPERATION`**: A general error from `PlaywrightEngine` indicating a failure during a Playwright operation (e.g., page acquisition, navigation, interaction). The `originalError` property will often contain the specific Playwright error.
320
- - **`ERR_NAVIGATION`**: Often seen as part of `ERR_PLAYWRIGHT_OPERATION`'s message or in `originalError` when a Playwright navigation fails (e.g., timeout, SSL error).
321
- - **`ERR_MARKDOWN_CONVERSION_NON_HTML`**: Thrown by `PlaywrightEngine` (or `HybridEngine` if falling back to Playwright) if `markdown: true` is requested for a non-HTML content type (e.g., XML, JSON).
322
- - **`ERR_UNSUPPORTED_RAW_CONTENT_TYPE`**: Thrown by `PlaywrightEngine` if `markdown: false` is requested for a content type it doesn't support for direct fetching (e.g., images, applications). Currently, it primarily supports `text/*` and `application/json`, `application/xml` like types when `markdown: false`.
446
+ - **`ERR_NON_HTML_CONTENT`**: Thrown by `FetchEngine` if the content type is not HTML and `markdown` conversion is not requested. **Note:** `fetchContent()` does not throw this error as it supports all content types.
447
+ - **`ERR_PLAYWRIGHT_OPERATION`**: A general error from `HybridEngine`'s browser mode indicating a failure during a Playwright operation (e.g., page acquisition, navigation, interaction). The `originalError` property will often contain the specific Playwright error.
448
+ - **`ERR_NAVIGATION`**: Often seen as part of `ERR_PLAYWRIGHT_OPERATION`'s message or in `originalError` when a Playwright navigation (in `HybridEngine`'s browser mode) fails (e.g., timeout, SSL error).
449
+ - **`ERR_MARKDOWN_CONVERSION_NON_HTML`**: Thrown by `HybridEngine` (when its Playwright part is active) if `markdown: true` is requested for a non-HTML content type (e.g., XML, JSON). **Note:** Only applies to `fetchHTML()` as `fetchContent()` doesn't perform markdown conversion.
323
450
  - **`ERR_CACHE_ERROR`**: Indicates an issue with cache read/write operations.
324
- - **`ERR_PROXY_CONFIG_ERROR`**: Problem with proxy configuration.
325
- - **`ERR_BROWSER_POOL_EXHAUSTED`**: If the browser pool cannot provide a page (e.g. max browsers reached and all are busy beyond timeout).
326
- - **Other Scenarios (often wrapped by `ERR_PLAYWRIGHT_OPERATION` or a generic `FetchError`):**
451
+ - **`ERR_PROXY_CONFIG_ERROR`**: Problem with proxy configuration (for `HybridEngine`'s browser mode).
452
+ - **`ERR_BROWSER_POOL_EXHAUSTED`**: If `HybridEngine`'s browser pool cannot provide a page.
453
+ - **Other Scenarios (often wrapped by `ERR_PLAYWRIGHT_OPERATION` or a generic `FetchError` when `HybridEngine` uses its browser mode):**
327
454
  - Network issues (DNS resolution, connection refused).
328
455
  - Proxy connection failures.
329
456
  - Page crashes or context/browser disconnections within Playwright.
@@ -334,47 +461,64 @@ The `HTMLFetchResult` object may also contain an `error` property if the final f
334
461
  **Example:**
335
462
 
336
463
  ```typescript
337
- import { PlaywrightEngine, FetchError } from "@purepageio/fetch-engines";
464
+ import { HybridEngine, FetchError } from "@purepageio/fetch-engines";
338
465
 
339
- // Example using PlaywrightEngine to illustrate more complex error handling
340
- const engine = new PlaywrightEngine({ useHttpFallback: false, maxRetries: 1 });
466
+ // Example using HybridEngine to illustrate error handling
467
+ const engine = new HybridEngine({ useHttpFallback: false, maxRetries: 1 }); // useHttpFallback for Playwright part
341
468
 
342
469
  async function fetchWithHandling(url: string) {
343
470
  try {
344
- const result = await engine.fetchHTML(url);
345
- if (result.error) {
346
- console.warn(`Fetch for ${url} included non-critical error after retries: ${result.error.message}`);
471
+ // Try fetchHTML first
472
+ const htmlResult = await engine.fetchHTML(url, { headers: { "X-My-Header": "TestValue" } });
473
+ if (htmlResult.error) {
474
+ console.warn(`fetchHTML for ${url} included non-critical error after retries: ${htmlResult.error.message}`);
347
475
  }
348
- console.log(`Success for ${url}! Title: ${result.title}, Content type: ${result.contentType}`);
349
- // Use result.content
476
+ console.log(`fetchHTML Success for ${url}! Title: ${htmlResult.title}, Content type: ${htmlResult.contentType}`);
477
+ // Use htmlResult.content
350
478
  } catch (error) {
351
- console.error(`Fetch failed for ${url}:`);
352
- if (error instanceof FetchError) {
353
- console.error(` Error Code: ${error.code || "N/A"}`);
354
- console.error(` Message: ${error.message}`);
355
- if (error.statusCode) {
356
- console.error(` Status Code: ${error.statusCode}`);
357
- }
358
- if (error.originalError) {
359
- console.error(` Original Error: ${error.originalError.name} - ${error.originalError.message}`);
479
+ console.error(`fetchHTML failed for ${url}, trying fetchContent...`);
480
+
481
+ try {
482
+ // Fallback to fetchContent for raw content
483
+ const contentResult = await engine.fetchContent(url, { headers: { "X-My-Header": "TestValue" } });
484
+ if (contentResult.error) {
485
+ console.warn(
486
+ `fetchContent for ${url} included non-critical error after retries: ${contentResult.error.message}`
487
+ );
360
488
  }
361
- // Example of specific handling:
362
- if (error.code === "ERR_PLAYWRIGHT_OPERATION") {
363
- console.error(" Hint: This was a Playwright operation failure. Check Playwright logs or originalError.");
489
+ console.log(`fetchContent Success for ${url}! Content type: ${contentResult.contentType}`);
490
+ // Use contentResult.content (could be Buffer or string)
491
+ } catch (contentError) {
492
+ console.error(`Both fetchHTML and fetchContent failed for ${url}:`);
493
+ if (contentError instanceof FetchError) {
494
+ console.error(` Error Code: ${contentError.code || "N/A"}`);
495
+ console.error(` Message: ${contentError.message}`);
496
+ if (contentError.statusCode) {
497
+ console.error(` Status Code: ${contentError.statusCode}`);
498
+ }
499
+ if (contentError.originalError) {
500
+ console.error(` Original Error: ${contentError.originalError.name} - ${contentError.originalError.message}`);
501
+ }
502
+ // Example of specific handling:
503
+ if (contentError.code === "ERR_PLAYWRIGHT_OPERATION") {
504
+ console.error(
505
+ " Hint: This was a Playwright operation failure (HybridEngine's browser mode). Check Playwright logs or originalError."
506
+ );
507
+ }
508
+ } else if (contentError instanceof Error) {
509
+ console.error(` Generic Error: ${contentError.message}`);
510
+ } else {
511
+ console.error(` Unknown error occurred: ${String(contentError)}`);
364
512
  }
365
- } else if (error instanceof Error) {
366
- console.error(` Generic Error: ${error.message}`);
367
- } else {
368
- console.error(` Unknown error occurred: ${String(error)}`);
369
513
  }
370
514
  }
371
515
  }
372
516
 
373
517
  async function runExamples() {
374
518
  await fetchWithHandling("https://nonexistentdomain.example.com"); // Likely DNS or navigation error
375
- await fetchWithHandling("https://example.com/non_html_resource.json"); // Test with actual JSON URL if available
376
- // or a site known to cause Playwright issues for a demo.
377
- await engine.cleanup(); // Important for PlaywrightEngine
519
+ await fetchWithHandling("https://example.com/document.pdf"); // PDF content - fetchHTML will fail, fetchContent will succeed
520
+ await fetchWithHandling("https://example.com/api/data.json"); // JSON content - fetchHTML will fail, fetchContent will succeed
521
+ await engine.cleanup(); // Important for HybridEngine
378
522
  }
379
523
 
380
524
  runExamples();