npm - @purepageio/fetch-engines - Versions diffs - 0.1.3 → 0.2.0 - Mend

@purepageio/fetch-engines 0.1.3 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

package/README.md +206 -106
package/dist/FetchEngine.d.ts +9 -8
package/dist/FetchEngine.d.ts.map +1 -1
package/dist/FetchEngine.js +54 -77
package/dist/FetchEngine.js.map +1 -1
package/dist/HybridEngine.d.ts +13 -7
package/dist/HybridEngine.d.ts.map +1 -1
package/dist/HybridEngine.js +35 -18
package/dist/HybridEngine.js.map +1 -1
package/dist/PlaywrightEngine.d.ts +4 -2
package/dist/PlaywrightEngine.d.ts.map +1 -1
package/dist/PlaywrightEngine.js +117 -96
package/dist/PlaywrightEngine.js.map +1 -1
package/dist/PlaywrightEngine.test.js +62 -154
package/dist/PlaywrightEngine.test.js.map +1 -1
package/dist/browser/PlaywrightBrowserPool.d.ts.map +1 -1
package/dist/browser/PlaywrightBrowserPool.js +5 -10
package/dist/browser/PlaywrightBrowserPool.js.map +1 -1
package/dist/types.d.ts +27 -11
package/dist/types.d.ts.map +1 -1
package/dist/utils/markdown-converter.d.ts +31 -0
package/dist/utils/markdown-converter.d.ts.map +1 -0
package/dist/utils/markdown-converter.js +794 -0
package/dist/utils/markdown-converter.js.map +1 -0
package/package.json +7 -2

package/README.md CHANGED Viewed

@@ -1,8 +1,53 @@
 # @purepageio/fetch-engines
-A collection of configurable engines for fetching HTML content using plain `fetch` or Playwright.
-This package provides robust and customizable ways to retrieve web page content, handling retries, caching, user agents, and optional browser automation via Playwright for complex JavaScript-driven sites.
+[![npm version](https://img.shields.io/npm/v/@purepageio/fetch-engines.svg)](https://www.npmjs.com/package/@purepageio/fetch-engines)
+[![Build Status](https://github.com/purepageio/fetch-engines/actions/workflows/build.yml/badge.svg)](https://github.com/purepageio/fetch-engines/actions/workflows/build.yml) <!-- Assuming build.yml is the workflow filename -->
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+Fetching web content can be complex. You need to handle static HTML, dynamic JavaScript-driven sites, network errors, retries, caching, and potential bot detection measures. Managing browser automation tools like Playwright adds another layer of complexity with resource pooling and stealth configurations.
+`@purepageio/fetch-engines` simplifies this entire process by providing a set of robust, configurable, and easy-to-use engines for retrieving web page content.
+**Why use `@purepageio/fetch-engines`?**
+- **Unified API:** Get content from simple or complex sites using the same `fetchHTML(url, options?)` method.
+- **Flexible Strategies:** Choose the right tool for the job:
+  - `FetchEngine`: Lightweight and fast for static HTML, using the standard `fetch` API.
+  - `PlaywrightEngine`: Powerful browser automation for JavaScript-heavy sites, handling rendering and interactions.
+  - `HybridEngine`: The best of both worlds – tries `FetchEngine` first for speed, automatically falls back to `PlaywrightEngine` for reliability on complex pages.
+- **Robust & Resilient:** Built-in caching, configurable retries, and standardized error handling make your fetching logic more dependable.
+- **Simplified Automation:** `PlaywrightEngine` manages browser instances and contexts automatically through efficient pooling and includes integrated stealth measures to bypass common anti-bot systems.
+- **Content Transformation:** Optionally convert fetched HTML directly to clean Markdown content.
+- **TypeScript Ready:** Fully typed for a better development experience.
+This package provides a high-level abstraction, letting you focus on using the web content rather than the intricacies of fetching it.
+## Table of Contents
+- [Features](#features)
+- [Installation](#installation)
+- [Engines](#engines)
+- [Basic Usage](#basic-usage)
+- [Configuration](#configuration)
+- [Return Value](#return-value)
+- [API Reference](#api-reference)
+- [Stealth / Anti-Detection (`PlaywrightEngine`)](#stealth--anti-detection-playwrightengine)
+- [Error Handling](#error-handling)
+- [Contributing](#contributing)
+- [License](#license)
+## Features
+- **Multiple Fetching Strategies:** Choose between `FetchEngine` (lightweight `fetch`), `PlaywrightEngine` (robust JS rendering via Playwright), or `HybridEngine` (smart fallback).
+- **Unified API:** Simple `fetchHTML(url, options?)` interface across all engines.
+- **Configurable Retries:** Automatic retries on failure with customizable attempts and delays.
+- **Built-in Caching:** In-memory caching with configurable TTL to reduce redundant fetches.
+- **Playwright Stealth:** Automatic integration of `playwright-extra` and stealth plugins to bypass common bot detection.
+- **Managed Browser Pooling:** Efficient resource management for `PlaywrightEngine` with configurable browser/context limits and lifecycles.
+- **Smart Fallbacks:** `HybridEngine` uses `FetchEngine` first, falling back to `PlaywrightEngine` only when needed. `PlaywrightEngine` can optionally use a fast HTTP fetch before launching a full browser.
+- **Content Conversion:** Optionally convert fetched HTML directly to Markdown.
+- **Standardized Errors:** Custom `FetchError` classes provide context on failures.
+- **TypeScript Ready:** Fully typed codebase for enhanced developer experience.
 ## Installation
@@ -26,7 +71,7 @@ npx playwright install
 - **`FetchEngine`**: Uses the standard `fetch` API. Suitable for simple HTML pages or APIs returning HTML. Lightweight and fast.
 - **`PlaywrightEngine`**: Uses Playwright to control a managed pool of headless browsers (Chromium by default via `playwright-extra`). Handles JavaScript rendering, complex interactions, and provides automatic stealth/anti-bot detection measures. More resource-intensive but necessary for dynamic websites.
-- **`HybridEngine`**: A smart combination. It first attempts to fetch content using the lightweight `FetchEngine`. If that fails for *any* reason (e.g., network error, non-HTML content, HTTP error like 403), it automatically falls back to using the `PlaywrightEngine`. This provides the speed of `FetchEngine` for simple sites while retaining the power of `PlaywrightEngine` for complex ones.
+- **`HybridEngine`**: A smart combination. It first attempts to fetch content using the lightweight `FetchEngine`. If that fails for _any_ reason (e.g., network error, non-HTML content, HTTP error like 403), it automatically falls back to using the `PlaywrightEngine`. This provides the speed of `FetchEngine` for simple sites while retaining the power of `PlaywrightEngine` for complex ones.
 ## Basic Usage
@@ -35,20 +80,25 @@ npx playwright install
 ```typescript
 import { FetchEngine } from "@purepageio/fetch-engines";
-const engine = new FetchEngine();
+const engine = new FetchEngine(); // Default: fetches HTML
 async function main() {
   try {
     const url = "https://example.com";
     const result = await engine.fetchHTML(url);
-    console.log(`Fetched ${result.url} (Status: ${result.statusCode})`);
+    console.log(`Fetched ${result.url} (ContentType: ${result.contentType})`);
     console.log(`Title: ${result.title}`);
-    // console.log(`HTML: ${result.html.substring(0, 200)}...`);
+    console.log(`Content (HTML): ${result.content.substring(0, 100)}...`);
+    // Example fetching Markdown directly via constructor option
+    const markdownEngine = new FetchEngine({ markdown: true });
+    const mdResult = await markdownEngine.fetchHTML(url);
+    console.log(`\nFetched ${mdResult.url} (ContentType: ${mdResult.contentType})`);
+    console.log(`Content (Markdown):\n${mdResult.content.substring(0, 300)}...`);
   } catch (error) {
     console.error("Fetch failed:", error);
   }
 }
 main();
 ```
@@ -57,165 +107,161 @@ main();
 ```typescript
 import { PlaywrightEngine } from "@purepageio/fetch-engines";
-// Configure engine options (optional)
-const engine = new PlaywrightEngine({
-  maxRetries: 2, // Number of retry attempts
-  useHttpFallback: true, // Try simple HTTP fetch first
-  cacheTTL: 5 * 60 * 1000, // Cache results for 5 minutes (in milliseconds)
-});
+// Engine configured to fetch HTML by default
+const engine = new PlaywrightEngine({ markdown: false });
 async function main() {
   try {
-    const url = "https://quotes.toscrape.com/"; // A site that might benefit from JS rendering
-    const result = await engine.fetchHTML(url);
-    console.log(`Fetched ${result.url} (Status: ${result.statusCode})`);
-    console.log(`Title: ${result.title}`);
-    // console.log(`HTML: ${result.html.substring(0, 200)}...`);
+    const url = "https://quotes.toscrape.com/";
+    // Example: Fetching as Markdown using per-request override
+    console.log(`Fetching ${url} as Markdown...`);
+    const mdResult = await engine.fetchHTML(url, { markdown: true });
+    console.log(`Fetched ${mdResult.url} (ContentType: ${mdResult.contentType}) - Title: ${mdResult.title}`);
+    console.log(`Content (Markdown):\n${mdResult.content.substring(0, 300)}...`);
+    // You could also fetch as HTML by default:
+    // const htmlResult = await engine.fetchHTML(url);
+    // console.log(`\nFetched ${htmlResult.url} (ContentType: ${htmlResult.contentType}) - Title: ${htmlResult.title}`);
   } catch (error) {
     console.error("Playwright fetch failed:", error);
   } finally {
-    // Important: Clean up browser resources when done
     await engine.cleanup();
   }
 }
 main();
 ```
 ### HybridEngine
 ```typescript
-import { HybridEngine } from '@purepageio/fetch-engines';
+import { HybridEngine } from "@purepageio/fetch-engines";
-// Configure the underlying PlaywrightEngine (optional)
-const engine = new HybridEngine({
-  maxRetries: 2, // PlaywrightEngine retry config
-  maxBrowsers: 3, // PlaywrightEngine pool config
-  // FetchEngine part has no config
-});
+// Engine configured to fetch HTML by default for both internal engines
+const engine = new HybridEngine({ markdown: false });
 async function main() {
   try {
-    // Try a simple site (likely uses FetchEngine)
-    const url1 = 'https://example.com';
-    const result1 = await engine.fetchHTML(url1);
-    console.log(`Fetched ${result1.url} (Status: ${result1.statusCode}) - Title: ${result1.title}`);
-    // Try a complex site (likely falls back to PlaywrightEngine)
-    const url2 = 'https://quotes.toscrape.com/';
-    const result2 = await engine.fetchHTML(url2);
-    console.log(`Fetched ${result2.url} (Status: ${result2.statusCode}) - Title: ${result2.title}`);
+    const url1 = "https://example.com"; // Simple site
+    const url2 = "https://quotes.toscrape.com/"; // Complex site
+    // --- Scenario 1: FetchEngine Succeeds ---
+    console.log(`\nFetching simple site (${url1}) requesting Markdown...`);
+    // FetchEngine uses its constructor config (markdown: false), ignoring the per-request option.
+    const result1 = await engine.fetchHTML(url1, { markdown: true });
+    console.log(`Fetched ${result1.url} (ContentType: ${result1.contentType}) - Title: ${result1.title}`);
+    console.log(`Content is ${result1.contentType} because FetchEngine succeeded and used its own config.`);
+    console.log(`${result1.content.substring(0, 300)}...`);
+    // --- Scenario 2: FetchEngine Fails, Playwright Fallback Occurs ---
+    console.log(`\nFetching complex site (${url2}) requesting Markdown...`);
+    // Assume FetchEngine fails for url2. PlaywrightEngine will be used and *will* receive the markdown: true override.
+    const result2 = await engine.fetchHTML(url2, { markdown: true });
+    console.log(`Fetched ${result2.url} (ContentType: ${result2.contentType}) - Title: ${result2.title}`);
+    console.log(`Content is ${result2.contentType} because Playwright fallback used the per-request option.`);
+    console.log(`${result2.content.substring(0, 300)}...`);
   } catch (error) {
     console.error("Hybrid fetch failed:", error);
   } finally {
-    // Important: Clean up browser resources (for the Playwright part) when done
     await engine.cleanup();
   }
 }
 main();
 ```
 ## Configuration
-Engines accept an optional configuration object in their constructor to customize behavior.
+Engines accept an optional configuration object in their constructor to customise behavior.
 ### FetchEngine
-The `FetchEngine` currently has **no configurable options** via its constructor. It uses standard `fetch` with default browser/Node.js retry/timeout behavior and a fixed set of browser-like headers.
+The `FetchEngine` accepts a `FetchEngineOptions` object with the following properties:
+| Option     | Type      | Default | Description                                                                                            |
+| ---------- | --------- | ------- | ------------------------------------------------------------------------------------------------------ |
+| `markdown` | `boolean` | `false` | If `true`, converts fetched HTML to Markdown. `contentType` in the result will be set to `'markdown'`. |
+```typescript
+// Example: Always convert to Markdown
+const mdFetchEngine = new FetchEngine({ markdown: true });
+```
 ### PlaywrightEngine
-The `PlaywrightEngine` accepts a `PlaywrightEngineConfig` object. See the detailed options below:
+The `PlaywrightEngine` accepts a `PlaywrightEngineConfig` object with the following properties:
 **General Options:**
-- `concurrentPages` (`number`, default: `3`)
-  - Maximum number of Playwright pages to process concurrently across all browser instances.
-- `maxRetries` (`number`, default: `3`)
-  - Maximum number of retry attempts for a failed Playwright fetch operation (excluding initial attempt).
-- `retryDelay` (`number`, default: `5000`)
-  - Delay in milliseconds between Playwright retry attempts.
-- `cacheTTL` (`number`, default: `900000` (15 minutes))
-  - Time-to-live for cached results in milliseconds. Set to `0` to disable the in-memory cache. Affects both HTTP fallback and Playwright results.
-- `useHttpFallback` (`boolean`, default: `true`)
-  - If `true`, the engine first attempts a simple, fast HTTP GET request. If this fails or appears to receive a challenge/CAPTCHA page, it then proceeds with a full Playwright browser request.
-- `useHeadedModeFallback` (`boolean`, default: `false`)
-  - If `true` and a Playwright request fails (potentially due to bot detection), subsequent Playwright requests *to that specific domain* will automatically use a headed (visible) browser instance.
-- `defaultFastMode` (`boolean`, default: `true`)
-  - If `true`, Playwright requests initially run in "fast mode", blocking non-essential resources and skipping human behavior simulation. Can be overridden per-request via `fetchHTML` options.
-- `simulateHumanBehavior` (`boolean`, default: `true`)
-  - If `true` and the Playwright request is *not* in `fastMode`, the engine attempts basic human-like interactions. *Note: This simulation is currently basic.*
+| Option                  | Type      | Default  | Description                                                                                                                               |
+| ----------------------- | --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
+| `markdown`              | `boolean` | `false`  | If `true`, converts content (from Playwright or fallback) to Markdown. `contentType` will be `'markdown'`. Can be overridden per-request. |
+| `useHttpFallback`       | `boolean` | `true`   | If `true`, attempts a fast HTTP fetch before using Playwright.                                                                            |
+| `useHeadedModeFallback` | `boolean` | `false`  | If `true`, automatically retries specific failed domains in headed (visible) mode.                                                        |
+| `defaultFastMode`       | `boolean` | `true`   | If `true`, initially blocks non-essential resources and skips human simulation. Can be overridden per-request.                            |
+| `simulateHumanBehavior` | `boolean` | `true`   | If `true` (and not `fastMode`), attempts basic human-like interactions.                                                                   |
+| `concurrentPages`       | `number`  | `3`      | Max number of pages to process concurrently within the engine queue.                                                                      |
+| `maxRetries`            | `number`  | `3`      | Max retry attempts for a failed fetch (excluding initial try).                                                                            |
+| `retryDelay`            | `number`  | `5000`   | Delay (ms) between retries.                                                                                                               |
+| `cacheTTL`              | `number`  | `900000` | Cache Time-To-Live (ms). `0` disables caching. (15 mins default)                                                                          |
 **Browser Pool Options (Passed to internal `PlaywrightBrowserPool`):**
-- `maxBrowsers` (`number`, default: `2`)
-  - Maximum number of concurrent browser instances the pool will manage.
-- `maxPagesPerContext` (`number`, default: `6`)
-  - Maximum number of pages per browser context before recycling.
-- `maxBrowserAge` (`number`, default: `1200000` (20 minutes))
-  - Maximum age in milliseconds a browser instance lives before recycling.
-- `healthCheckInterval` (`number`, default: `60000` (1 minute))
-  - How often (in milliseconds) the pool checks browser health.
-- `useHeadedMode` (`boolean`, default: `false`)
-  - Forces the *entire* browser pool to launch browsers in headed (visible) mode.
-- `poolBlockedDomains` (`string[]`, default: `[]` - uses pool's internal defaults)
-  - List of domain *glob patterns* to block browser requests to.
-- `poolBlockedResourceTypes` (`string[]`, default: `[]` - uses pool's internal defaults)
-  - List of Playwright resource types (e.g., `image`, `font`) to block.
-- `proxy` (`object | undefined`, default: `undefined`)
-  - Proxy configuration for browser instances (`server`, `username?`, `password?`).
+| Option                     | Type                       | Default     | Description                                                               |
+| -------------------------- | -------------------------- | ----------- | ------------------------------------------------------------------------- |
+| `maxBrowsers`              | `number`                   | `2`         | Max concurrent browser instances managed by the pool.                     |
+| `maxPagesPerContext`       | `number`                   | `6`         | Max pages per browser context before recycling.                           |
+| `maxBrowserAge`            | `number`                   | `1200000`   | Max age (ms) a browser instance lives before recycling. (20 mins default) |
+| `healthCheckInterval`      | `number`                   | `60000`     | How often (ms) the pool checks browser health. (1 min default)            |
+| `useHeadedMode`            | `boolean`                  | `false`     | Forces the _entire pool_ to launch browsers in headed (visible) mode.     |
+| `poolBlockedDomains`       | `string[]`                 | `[]`        | List of domain glob patterns to block requests to.                        |
+| `poolBlockedResourceTypes` | `string[]`                 | `[]`        | List of Playwright resource types (e.g., 'image', 'font') to block.       |
+| `proxy`                    | `{ server: string, ... }?` | `undefined` | Proxy configuration object (see `PlaywrightEngineConfig` type).           |
 ### HybridEngine
-The `HybridEngine` constructor accepts a single optional argument: `playwrightConfig`. This object follows the **`PlaywrightEngineConfig`** structure described above.
+The `HybridEngine` constructor accepts a single optional argument which uses the **`PlaywrightEngineConfig`** structure (see the `PlaywrightEngine` tables above). These options configure the underlying engines where applicable:
+- Options like `maxRetries`, `cacheTTL`, `proxy`, `maxBrowsers`, etc., are primarily passed to the internal `PlaywrightEngine`.
+- The `markdown` setting in the constructor (`boolean`, default: `false`) applies to **both** internal engines by default.
+- If you provide `markdown: true` in the `options` object when calling `fetchHTML`, this override **only applies if a fallback to `PlaywrightEngine` is necessary**. The `FetchEngine` part will always use the `markdown` setting provided in the `HybridEngine` constructor.
 ```typescript
-import { HybridEngine } from '@purepageio/fetch-engines';
-const engine = new HybridEngine({
-  // These options configure the PlaywrightEngine used for fallbacks
-  maxRetries: 1,
-  maxBrowsers: 1,
-  cacheTTL: 0 // Disable caching in the Playwright part
-});
+// ... (HybridEngine examples remain the same) ...
 ```
-The internal `FetchEngine` used by `HybridEngine` is *not* configurable.
 ## Return Value
-Both `FetchEngine.fetchHTML()` and `PlaywrightEngine.fetchHTML()` return a Promise that resolves to a `FetchResult` object with the following properties:
+All `fetchHTML()` methods return a Promise that resolves to an `HTMLFetchResult` object:
-- `html` (`string`): The full HTML content of the fetched page.
-- `title` (`string | null`): The extracted `<title>` tag content, or `null` if no title is found.
-- `url` (`string`): The final URL after any redirects.
-- `isFromCache` (`boolean`): `true` if the result was served from the engine's cache, `false` otherwise.
-- `statusCode` (`number | undefined`): The HTTP status code of the final response. This is typically available for `FetchEngine` and the HTTP fallback in `PlaywrightEngine`, but might be `undefined` for some Playwright navigation scenarios if the primary response wasn't directly captured.
-- `error` (`FetchError | Error | undefined`): If an error occurred during the _final_ fetch attempt (after retries), this property will contain the error object. It might be a specific `FetchError` (see Error Handling) or a generic `Error`.
+- `content` (`string`): The fetched content, either original HTML or converted Markdown.
+- `contentType` (`'html' | 'markdown'`): Indicates the format of the `content` string.
+- `title` (`string | null`): Extracted page title (from original HTML).
+- `url` (`string`): Final URL after redirects.
+- `isFromCache` (`boolean`): True if the result came from cache.
+- `statusCode` (`number | undefined`): HTTP status code.
+- `error` (`Error | undefined`): Error object if the fetch failed after all retries. It's generally recommended to rely on catching thrown errors for failure handling.
 ## API Reference
 ### `engine.fetchHTML(url, options?)`
-- `url` (`string`): The URL of the page to fetch.
-- `options` (`object`, optional): Per-request options to override engine defaults.
-  - For `PlaywrightEngine`, you can override `fastMode` (`boolean`) to force or disable fast mode for this specific request.
-  - _(Other per-request options may be added in the future)._
-- **Returns:** `Promise<FetchResult>`
+- `url` (`string`): URL to fetch.
+- `options?` (`FetchOptions`): Optional per-request overrides.
+  - `markdown?: boolean`: (Playwright/Hybrid only) Request Markdown conversion. For Hybrid, only applies on fallback to Playwright.
+  - `fastMode?: boolean`: (Playwright/Hybrid only) Override fast mode.
+- **Returns:** `Promise<HTMLFetchResult>`
-Fetches the HTML content for the given URL using the engine's configured strategy (plain fetch or Playwright).
+Fetches content, returning HTML or Markdown based on configuration/options in `result.content` with `result.contentType` indicating the format.
-### `engine.cleanup()` (PlaywrightEngine only)
+### `engine.cleanup()` (PlaywrightEngine & HybridEngine)
 - **Returns:** `Promise<void>`
-Gracefully shuts down all browser instances managed by the `PlaywrightEngine`'s browser pool. **It is crucial to call `await engine.cleanup()` when you are finished using a `PlaywrightEngine` instance** to release system resources.
+Gracefully shuts down all browser instances managed by the `PlaywrightEngine`'s browser pool (used by both `PlaywrightEngine` and `HybridEngine`). **It is crucial to call `await engine.cleanup()` when you are finished using these engines** to release system resources.
 ## Stealth / Anti-Detection (`PlaywrightEngine`)
-The `PlaywrightEngine` automatically integrates `playwright-extra` and its powerful stealth plugin (`puppeteer-extra-plugin-stealth`). This plugin applies various techniques to make the headless browser controlled by Playwright appear more like a regular human-operated browser, helping to bypass many common bot detection systems.
+The `PlaywrightEngine` automatically integrates `playwright-extra` and its powerful stealth plugin ([`puppeteer-extra-plugin-stealth`](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)). This plugin applies various techniques to make the headless browser controlled by Playwright appear more like a regular human-operated browser, helping to bypass many common bot detection systems.
 There are **no manual configuration options** for stealth; it is enabled by default when using `PlaywrightEngine`. The previous options (`useStealthMode`, `randomizeFingerprint`, `evasionLevel`) have been removed.
@@ -229,21 +275,75 @@ Errors during fetching are typically thrown as instances of `FetchError` (or its
   - `message` (`string`): Description of the error.
   - `code` (`string | undefined`): A specific error code (e.g., `ERR_NAVIGATION_TIMEOUT`, `ERR_HTTP_ERROR`, `ERR_NON_HTML_CONTENT`).
   - `originalError` (`Error | undefined`): The underlying error that caused this fetch error (e.g., a Playwright error object).
+  - `statusCode` (`number | undefined`): The HTTP status code, if relevant (especially for `FetchEngineHttpError`).
 Common error scenarios include:
 - Network issues (DNS resolution failure, connection refused).
-- HTTP errors (4xx client errors, 5xx server errors).
-- Non-HTML content type received (for `FetchEngine`).
-- Playwright navigation timeouts.
+- HTTP errors (4xx client errors, 5xx server errors) -> `FetchEngineHttpError` from `FetchEngine` or potentially wrapped `FetchError` from `PlaywrightEngine`.
+- Non-HTML content type received -> `FetchError` with code `ERR_NON_HTML_CONTENT` from `FetchEngine`.
+- Playwright navigation timeouts -> `FetchError` wrapping Playwright error, often with code `ERR_NAVIGATION_TIMEOUT`.
 - Proxy connection errors.
 - Page crashes within Playwright.
 - Errors thrown by the browser pool (e.g., failure to launch browser).
-The `FetchResult` object may also contain an `error` property if the final fetch attempt failed after all retries.
+The `HTMLFetchResult` object may also contain an `error` property if the final fetch attempt failed after all retries but an earlier attempt (within retries) might have produced some intermediate (potentially unusable) result data. It's generally best to rely on the thrown error for failure handling.
+**Example:**
+```typescript
+import { FetchEngine, FetchError } from "@purepageio/fetch-engines";
+const engine = new FetchEngine();
+async function fetchWithHandling(url: string) {
+  try {
+    const result = await engine.fetchHTML(url);
+    // Note: result.error is less common, primary errors are thrown.
+    if (result.error) {
+      console.error(`Fetch for ${url} reported error after retries: ${result.error.message}`);
+    } else {
+      console.log(`Success for ${url}! Content type: ${result.contentType}`);
+      // Use result.content
+    }
+  } catch (error) {
+    console.error(`Fetch failed entirely for ${url}:`);
+    if (error instanceof FetchError) {
+      // Handle specific FetchError codes
+      switch (error.code) {
+        case "ERR_HTTP_ERROR":
+          console.error(`  HTTP Error: Status ${error.statusCode} - ${error.message}`);
+          break;
+        case "ERR_NON_HTML_CONTENT":
+          console.error(`  Wrong Content Type: ${error.message}`);
+          break;
+        // Add other specific codes as needed
+        default:
+          console.error(`  FetchError (${error.code || "UNKNOWN"}): ${error.message}`);
+          break;
+      }
+      if (error.originalError) {
+        console.error(`  Original Error: ${error.originalError.message}`);
+      }
+    } else if (error instanceof Error) {
+      // Handle generic JavaScript errors
+      console.error(`  Generic Error: ${error.message}`);
+    } else {
+      // Handle unexpected throw types
+      console.error(`  Unknown error occurred.`);
+    }
+  }
+}
+fetchWithHandling("https://example.com");
+fetchWithHandling("https://httpbin.org/status/404"); // Example causing HTTP error
+fetchWithHandling("https://httpbin.org/image/png"); // Example causing non-HTML error
+```
 ## Logging
+Currently, the library uses `console.warn` and `console.error` for internal warnings (like fallback events) and critical errors. More sophisticated logging options may be added in the future.
 ## Contributing
 Contributions are welcome! Please open an issue or submit a pull request on the [GitHub repository](https://github.com/purepageio/fetch-engines).

package/dist/FetchEngine.d.ts CHANGED Viewed

@@ -1,9 +1,10 @@
-import type { HTMLFetchResult, BrowserMetrics } from "./types.js";
+import type { HTMLFetchResult, BrowserMetrics, FetchEngineOptions } from "./types.js";
 import type { IEngine } from "./IEngine.js";
+import { FetchError } from "./errors.js";
 /**
  * Custom error class for HTTP errors from FetchEngine.
  */
-export declare class FetchEngineHttpError extends Error {
+export declare class FetchEngineHttpError extends FetchError {
     readonly statusCode: number;
     constructor(message: string, statusCode: number);
 }
@@ -14,22 +15,22 @@ export declare class FetchEngineHttpError extends Error {
  * It does not support advanced configurations like retries, caching, or proxies directly.
  */
 export declare class FetchEngine implements IEngine {
-    private readonly headers;
+    private readonly options;
+    private static readonly DEFAULT_OPTIONS;
     /**
      * Creates an instance of FetchEngine.
-     * Note: This engine currently does not accept configuration options.
+     * @param options Configuration options for the FetchEngine.
      */
-    constructor();
+    constructor(options?: FetchEngineOptions);
     /**
-     * Fetches HTML content from the specified URL using the `fetch` API.
+     * Fetches HTML or converts to Markdown from the specified URL.
      *
      * @param url The URL to fetch.
      * @returns A Promise resolving to an HTMLFetchResult object.
      * @throws {FetchEngineHttpError} If the HTTP response status is not ok (e.g., 404, 500).
      * @throws {Error} If the content type is not HTML or for other network errors.
      */
-    fetchHTML(url: string): Promise<HTMLFetchResult>;
-    private detectSPA;
+    fetchHTML(url: string, options?: FetchEngineOptions): Promise<HTMLFetchResult>;
     /**
      * Cleans up resources used by the engine.
      * For FetchEngine, this is a no-op as it doesn't manage persistent resources.

package/dist/FetchEngine.d.ts.map CHANGED Viewed

	@@ -1 +1 @@
1	- {"version":3,"file":"FetchEngine.d.ts","sourceRoot":"","sources":["../src/FetchEngine.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,eAAe,EAAE,cAAc,EAAE,MAAM,YAAY,CAAC;~~AAClE~~,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;AAG5C;;GAEG;AACH,qBAAa,oBAAqB,SAAQ,~~KAAK~~;~~IAC7C~~,~~SAAgB,~~UAAU,EAAE,MAAM~~,CAAC~~;~~gBAEvB~~,OAAO,EAAE,MAAM,~~EAAE~~,UAAU,EAAE,MAAM;~~CAShD~~;AAED;;;;;GAKG;AACH,qBAAa,WAAY,YAAW,OAAO;IACzC,OAAO,CAAC,QAAQ,CAAC,OAAO,~~CAAyB~~;~~IAEjD~~;;;OAGG~~;;IAeH~~;;;;;;;OAOG;IACG,SAAS,CAAC,GAAG,EAAE,MAAM,~~GAAG~~,OAAO,CAAC,~~eAAe~~,~~CAAC;IAgDtD~~,OAAO,CAAC,~~SAAS~~;~~IA+BjB~~;;;;OAIG;IACG,OAAO,IAAI,OAAO,CAAC,IAAI,CAAC;~~IAK9B~~;;;;OAIG;IACH,UAAU,IAAI,cAAc,EAAE;~~CAI~~/B"}
1	+ {"version":3,"file":"FetchEngine.d.ts","sourceRoot":"","sources":["../src/FetchEngine.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,eAAe,EAAE,cAAc,EAAE,kBAAkB,EAAE,MAAM,YAAY,CAAC;AACtF,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;AAG5C,OAAO,EAAE,UAAU,EAAE,MAAM,aAAa,CAAC;AAEzC;;GAEG;AACH,qBAAa,oBAAqB,SAAQ,UAAU;aAGhC,UAAU,EAAE,MAAM;gBADlC,OAAO,EAAE,MAAM,EACC,UAAU,EAAE,MAAM;CAKrC;AAED;;;;;GAKG;AACH,qBAAa,WAAY,YAAW,OAAO;IACzC,OAAO,CAAC,QAAQ,CAAC,OAAO,CAA+B;IAEvD,OAAO,CAAC,MAAM,CAAC,QAAQ,CAAC,eAAe,CAErC;IAEF;;;OAGG;gBACS,OAAO,GAAE,kBAAuB;IAI5C;;;;;;;OAOG;IACG,SAAS,CAAC,GAAG,EAAE,MAAM,EAAE,OAAO,CAAC,EAAE,kBAAkB,GAAG,OAAO,CAAC,eAAe,CAAC;IAiEpF;;;;OAIG;IACG,OAAO,IAAI,OAAO,CAAC,IAAI,CAAC;IAI9B;;;;OAIG;IACH,UAAU,IAAI,cAAc,EAAE;CAG/B"}