@purepageio/fetch-engines 0.9.1 → 0.10.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +62 -30
- package/dist/FetchEngine.d.ts +0 -1
- package/dist/FetchEngine.d.ts.map +1 -1
- package/dist/FetchEngine.js +3 -15
- package/dist/FetchEngine.js.map +1 -1
- package/dist/HybridEngine.d.ts +2 -1
- package/dist/HybridEngine.d.ts.map +1 -1
- package/dist/HybridEngine.js +51 -30
- package/dist/HybridEngine.js.map +1 -1
- package/dist/PlaywrightEngine.d.ts +6 -1
- package/dist/PlaywrightEngine.d.ts.map +1 -1
- package/dist/PlaywrightEngine.js +169 -32
- package/dist/PlaywrightEngine.js.map +1 -1
- package/dist/evals/auto-render-cases.d.ts +17 -0
- package/dist/evals/auto-render-cases.d.ts.map +1 -0
- package/dist/evals/auto-render-cases.js +164 -0
- package/dist/evals/auto-render-cases.js.map +1 -0
- package/dist/types.d.ts +7 -5
- package/dist/types.d.ts.map +1 -1
- package/dist/utils/markdown-converter.d.ts +12 -0
- package/dist/utils/markdown-converter.d.ts.map +1 -1
- package/dist/utils/markdown-converter.js +244 -11
- package/dist/utils/markdown-converter.js.map +1 -1
- package/dist/utils/render-detection.d.ts +29 -0
- package/dist/utils/render-detection.d.ts.map +1 -0
- package/dist/utils/render-detection.js +145 -0
- package/dist/utils/render-detection.js.map +1 -0
- package/package.json +5 -2
package/README.md
CHANGED
|
@@ -2,13 +2,18 @@
|
|
|
2
2
|
|
|
3
3
|
[](https://www.npmjs.com/package/@purepageio/fetch-engines)
|
|
4
4
|
[](https://github.com/purepage/fetch-engines/actions/workflows/publish.yml)
|
|
5
|
+
[](https://github.com/purepage/fetch-engines/actions/workflows/live-browser-evals.yml)
|
|
5
6
|
[](https://opensource.org/licenses/MIT)
|
|
6
7
|
|
|
7
|
-
|
|
8
|
+
Reliable web extraction.
|
|
9
|
+
|
|
10
|
+
HTTP-first for speed. Browser-backed when needed. Clean Markdown, soft-block handling, and structured extraction for RAG and AI pipelines.
|
|
8
11
|
|
|
9
12
|
## Table of contents
|
|
10
13
|
|
|
11
14
|
- [Why fetch-engines?](#why-fetch-engines)
|
|
15
|
+
- [Why trust fetch-engines](#why-trust-fetch-engines)
|
|
16
|
+
- [Library vs hosted crawler](#library-vs-hosted-crawler)
|
|
12
17
|
- [Installation](#installation)
|
|
13
18
|
- [Quick start](#quick-start)
|
|
14
19
|
- [Usage patterns](#usage-patterns)
|
|
@@ -26,11 +31,33 @@ Fetch web pages as clean Markdown or structured data. HTTP-first with automatic
|
|
|
26
31
|
## Why fetch-engines?
|
|
27
32
|
|
|
28
33
|
- **One API for multiple strategies** – Call `fetchHTML` for rendered pages or `fetchContent` for raw responses. The library handles HTTP shortcuts and Playwright fallbacks automatically.
|
|
29
|
-
- **
|
|
30
|
-
- **
|
|
31
|
-
- **
|
|
34
|
+
- **Automatic app-shell & soft-block detection** – Shell-like HTTP responses and bot-gate pages (Cloudflare challenges, CAPTCHAs, "verify you're human") are upgraded to Playwright rendering by default, so client-rendered pages and soft blocks work without per-domain rules.
|
|
35
|
+
- **RAG-ready Markdown** – Convert public content pages to clean Markdown with boilerplate, nav, and SVG noise stripped out. Powered by a Rust-native converter.
|
|
36
|
+
- **HTTP-first, browser-backed when needed** – Fast pages stay cheap via plain HTTP, while harder pages automatically benefit from Playwright fallback.
|
|
37
|
+
- **Structured extraction built in** – Define a Zod schema and go from URL to typed data via any OpenAI-compatible API. The page is fetched as Markdown first to minimise tokens.
|
|
32
38
|
- **Playwright is optional** – `FetchEngine` works without browser dependencies. Playwright is only loaded when you use `HybridEngine` or `PlaywrightEngine`.
|
|
33
39
|
|
|
40
|
+
## Why trust fetch-engines
|
|
41
|
+
|
|
42
|
+
- **19 live URLs across 7 archetypes** (docs, government, knowledge, marketing, commerce, static, access-guarded) validated on every release and nightly via browser-enabled CI
|
|
43
|
+
- **85 unit tests + dedicated live browser eval workflow** — not just "it compiles," but "it extracts real content from real pages"
|
|
44
|
+
- Handles app shells, Cloudflare challenges, CAPTCHAs, and utility-class-heavy doc sites (Tailwind, Vite) without per-domain rules
|
|
45
|
+
- Produces clean Markdown with absolute URLs — boilerplate removal typically reduces raw HTML to 10–30% of its original size before it reaches your LLM
|
|
46
|
+
- Structured extraction with Zod schemas and any OpenAI-compatible provider, in the same pipeline as page fetching
|
|
47
|
+
|
|
48
|
+
## Library vs hosted crawler
|
|
49
|
+
|
|
50
|
+
| | fetch-engines | Hosted crawlers |
|
|
51
|
+
| ------------------- | --------------------------------------- | ------------------- |
|
|
52
|
+
| **Runs where** | Your Node.js process | Third-party API |
|
|
53
|
+
| **Data stays** | In your infrastructure | Leaves your network |
|
|
54
|
+
| **Cost model** | Free + your compute | Per-page pricing |
|
|
55
|
+
| **Customisation** | Full source access, tune heuristics | Configuration flags |
|
|
56
|
+
| **Browser control** | Your Playwright instance, your proxy | Opaque |
|
|
57
|
+
| **Transparency** | Open tests, open evals, open heuristics | Black box |
|
|
58
|
+
|
|
59
|
+
Choose `fetch-engines` when you want full control over extraction, data residency, and cost. Choose a hosted crawler when you need managed infrastructure and don't want to run browsers yourself.
|
|
60
|
+
|
|
34
61
|
## Installation
|
|
35
62
|
|
|
36
63
|
```bash
|
|
@@ -84,7 +111,9 @@ console.log(page.contentType); // "markdown"
|
|
|
84
111
|
await engine.cleanup();
|
|
85
112
|
```
|
|
86
113
|
|
|
87
|
-
`FetchEngine` also supports `markdown: true` for static pages that don't need JavaScript rendering.
|
|
114
|
+
`FetchEngine` also supports `markdown: true` for static pages that don't need JavaScript rendering. `HybridEngine` now decides whether to render before converting to Markdown, so shell detection still works when callers request Markdown output.
|
|
115
|
+
Relative links and image URLs in Markdown output are normalized to absolute URLs using the final fetched page URL. The converter strips generic UI chrome (nav/footer/button controls and dense link clusters) using domain-agnostic heuristics, while preserving content on pages without semantic `<main>`/`<article>` containers (e.g., Tailwind CSS docs).
|
|
116
|
+
The extraction path is tuned for publicly accessible content. Paywalled or member-only pages may still return intentionally partial content unless you supply authenticated access yourself.
|
|
88
117
|
|
|
89
118
|
### Structured extraction
|
|
90
119
|
|
|
@@ -131,7 +160,7 @@ When you supply a custom `baseURL`, the engine automatically switches to the Ver
|
|
|
131
160
|
All engines accept familiar `fetch` options such as custom headers. Additional Hybrid/Playwright options you are likely to tweak:
|
|
132
161
|
|
|
133
162
|
- `markdown` – return Markdown instead of HTML.
|
|
134
|
-
- `spaMode` & `spaRenderDelayMs`
|
|
163
|
+
- Automatic shell detection is enabled by default. `spaMode` & `spaRenderDelayMs` still force a more patient render path when you know a page is highly dynamic.
|
|
135
164
|
- `cacheTTL`, `maxRetries`, and browser pool sizes – control resilience and throughput.
|
|
136
165
|
|
|
137
166
|
Check the inline TypeScript docs or the [`/examples`](./examples) directory for end-to-end flows.
|
|
@@ -140,30 +169,30 @@ Check the inline TypeScript docs or the [`/examples`](./examples) directory for
|
|
|
140
169
|
|
|
141
170
|
Every option from `PlaywrightEngineConfig` (consumed by `HybridEngine`) with defaults:
|
|
142
171
|
|
|
143
|
-
| Option | Default | Purpose
|
|
144
|
-
| -------------------------- | ----------- |
|
|
145
|
-
| `headers` | `{}` | Extra headers merged into every request.
|
|
146
|
-
| `concurrentPages` | `3` | Maximum Playwright pages processed at once.
|
|
147
|
-
| `maxRetries` | `3` | Additional retry attempts after the first failure.
|
|
148
|
-
| `retryDelay` | `5000` | Milliseconds to wait between retries.
|
|
149
|
-
| `cacheTTL` | `900000` | Cache lifetime in ms (`0` disables caching).
|
|
150
|
-
| `useHttpFallback` | `true` | Try a fast HTTP GET before spinning up Playwright.
|
|
151
|
-
| `useHeadedModeFallback` | `false` | Automatically retry a domain in headed mode after repeated failures.
|
|
152
|
-
| `defaultFastMode` | `true` | Block non-critical assets and skip human simulation unless overridden.
|
|
153
|
-
| `simulateHumanBehavior` | `true` | When not in fast mode, add delays and scrolling to avoid bot detection.
|
|
154
|
-
| `maxBrowsers` | `2` | Highest number of Playwright browser instances kept in the pool.
|
|
155
|
-
| `maxPagesPerContext` | `6` | Pages opened per browser context before recycling it.
|
|
156
|
-
| `maxBrowserAge` | `1200000` | Milliseconds before a browser instance is torn down (20 minutes).
|
|
157
|
-
| `healthCheckInterval` | `60000` | Pool health check frequency in ms.
|
|
158
|
-
| `poolBlockedDomains` | `[]` | Domains blocked across every Playwright request (inherit pool defaults if empty).
|
|
159
|
-
| `poolBlockedResourceTypes` | `[]` | Resource types (e.g. `"image"`) blocked globally.
|
|
160
|
-
| `proxy` | `undefined` | Per-browser proxy `{ server, username?, password? }`.
|
|
161
|
-
| `useHeadedMode` | `false` | Force every browser to launch with a visible window.
|
|
162
|
-
| `markdown` | `false` | Return Markdown instead of raw HTML. Converts via a Rust-native engine with boilerplate removal.
|
|
163
|
-
| `spaMode` | `false` |
|
|
164
|
-
| `spaRenderDelayMs` | `0` |
|
|
165
|
-
| `playwrightOnlyPatterns` | `[]` | URLs matching any string/regex go straight to Playwright, skipping HTTP
|
|
166
|
-
| `playwrightLaunchOptions` | `undefined` | Options passed to `browserType.launch` (see Playwright docs).
|
|
172
|
+
| Option | Default | Purpose |
|
|
173
|
+
| -------------------------- | ----------- | -------------------------------------------------------------------------------------------------- |
|
|
174
|
+
| `headers` | `{}` | Extra headers merged into every request. |
|
|
175
|
+
| `concurrentPages` | `3` | Maximum Playwright pages processed at once. |
|
|
176
|
+
| `maxRetries` | `3` | Additional retry attempts after the first failure. |
|
|
177
|
+
| `retryDelay` | `5000` | Milliseconds to wait between retries. |
|
|
178
|
+
| `cacheTTL` | `900000` | Cache lifetime in ms (`0` disables caching). |
|
|
179
|
+
| `useHttpFallback` | `true` | Try a fast HTTP GET before spinning up Playwright. |
|
|
180
|
+
| `useHeadedModeFallback` | `false` | Automatically retry a domain in headed mode after repeated failures. |
|
|
181
|
+
| `defaultFastMode` | `true` | Block non-critical assets and skip human simulation unless overridden. |
|
|
182
|
+
| `simulateHumanBehavior` | `true` | When not in fast mode, add delays and scrolling to avoid bot detection. |
|
|
183
|
+
| `maxBrowsers` | `2` | Highest number of Playwright browser instances kept in the pool. |
|
|
184
|
+
| `maxPagesPerContext` | `6` | Pages opened per browser context before recycling it. |
|
|
185
|
+
| `maxBrowserAge` | `1200000` | Milliseconds before a browser instance is torn down (20 minutes). |
|
|
186
|
+
| `healthCheckInterval` | `60000` | Pool health check frequency in ms. |
|
|
187
|
+
| `poolBlockedDomains` | `[]` | Domains blocked across every Playwright request (inherit pool defaults if empty). |
|
|
188
|
+
| `poolBlockedResourceTypes` | `[]` | Resource types (e.g. `"image"`) blocked globally. |
|
|
189
|
+
| `proxy` | `undefined` | Per-browser proxy `{ server, username?, password? }`. |
|
|
190
|
+
| `useHeadedMode` | `false` | Force every browser to launch with a visible window. |
|
|
191
|
+
| `markdown` | `false` | Return Markdown instead of raw HTML. Converts via a Rust-native engine with boilerplate removal. |
|
|
192
|
+
| `spaMode` | `false` | Force the more patient render path. Many shell-like pages are auto-detected even when this is off. |
|
|
193
|
+
| `spaRenderDelayMs` | `0` | Minimum extra wait budget when `spaMode` is `true`. |
|
|
194
|
+
| `playwrightOnlyPatterns` | `[]` | URLs matching any string/regex go straight to Playwright, skipping HTTP shell detection. |
|
|
195
|
+
| `playwrightLaunchOptions` | `undefined` | Options passed to `browserType.launch` (see Playwright docs). |
|
|
167
196
|
|
|
168
197
|
Per-request overrides: `fetchHTML` accepts `fastMode`, `markdown`, `spaMode`, and `headers`, while `fetchContent` supports `fastMode` and `headers`.
|
|
169
198
|
|
|
@@ -176,6 +205,9 @@ Failures raise a typed `FetchError` exposing `code`, `statusCode`, and the under
|
|
|
176
205
|
- Explore the [`examples`](./examples) directory for scripts you can run end-to-end.
|
|
177
206
|
- Ready-to-use TypeScript types ship with the package.
|
|
178
207
|
- `pnpm test` runs the automated suite when you are ready to contribute.
|
|
208
|
+
- `pnpm eval:auto-render` runs a live Hybrid-vs-HTTP quality matrix across docs, government, knowledge, marketing, commerce, and access-guarded pages, using a stable gated core plus observe-only sentinels for harder domains.
|
|
209
|
+
- `pnpm test:live:auto-render` runs the same hypothesis as a Vitest live test (`LIVE_NETWORK=1`).
|
|
210
|
+
- GitHub Actions includes a dedicated browser-enabled live eval workflow that runs on `main` changes, nightly on a schedule, and on manual dispatch. It uploads the JSON report as a build artifact.
|
|
179
211
|
|
|
180
212
|
## Contributing
|
|
181
213
|
|
package/dist/FetchEngine.d.ts
CHANGED
|
@@ -31,7 +31,6 @@ export declare class FetchEngine implements IEngine {
|
|
|
31
31
|
* @throws {Error} If the content type is not HTML or for other network errors.
|
|
32
32
|
*/
|
|
33
33
|
fetchHTML(url: string, options?: FetchEngineOptions): Promise<HTMLFetchResult>;
|
|
34
|
-
private _injectSourceUnderH1;
|
|
35
34
|
/**
|
|
36
35
|
* Fetches raw content from the specified URL (mimics standard fetch API).
|
|
37
36
|
*
|
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"FetchEngine.d.ts","sourceRoot":"","sources":["../src/FetchEngine.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EACV,eAAe,EACf,kBAAkB,EAClB,mBAAmB,EACnB,cAAc,EACd,kBAAkB,EACnB,MAAM,YAAY,CAAC;AACpB,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;AAG5C,OAAO,EAAE,UAAU,EAAE,MAAM,aAAa,CAAC;AAEzC;;GAEG;AACH,qBAAa,oBAAqB,SAAQ,UAAU;aAGhC,UAAU,EAAE,MAAM;gBADlC,OAAO,EAAE,MAAM,EACC,UAAU,EAAE,MAAM;CAKrC;AAED;;;;;GAKG;AACH,qBAAa,WAAY,YAAW,OAAO;IACzC,OAAO,CAAC,QAAQ,CAAC,OAAO,CAA+B;IAEvD,OAAO,CAAC,MAAM,CAAC,QAAQ,CAAC,eAAe,CAGrC;IAEF;;;OAGG;gBACS,OAAO,GAAE,kBAAuB;IAI5C;;;;;;;OAOG;IACG,SAAS,CAAC,GAAG,EAAE,MAAM,EAAE,OAAO,CAAC,EAAE,kBAAkB,GAAG,OAAO,CAAC,eAAe,CAAC;
|
|
1
|
+
{"version":3,"file":"FetchEngine.d.ts","sourceRoot":"","sources":["../src/FetchEngine.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EACV,eAAe,EACf,kBAAkB,EAClB,mBAAmB,EACnB,cAAc,EACd,kBAAkB,EACnB,MAAM,YAAY,CAAC;AACpB,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;AAG5C,OAAO,EAAE,UAAU,EAAE,MAAM,aAAa,CAAC;AAEzC;;GAEG;AACH,qBAAa,oBAAqB,SAAQ,UAAU;aAGhC,UAAU,EAAE,MAAM;gBADlC,OAAO,EAAE,MAAM,EACC,UAAU,EAAE,MAAM;CAKrC;AAED;;;;;GAKG;AACH,qBAAa,WAAY,YAAW,OAAO;IACzC,OAAO,CAAC,QAAQ,CAAC,OAAO,CAA+B;IAEvD,OAAO,CAAC,MAAM,CAAC,QAAQ,CAAC,eAAe,CAGrC;IAEF;;;OAGG;gBACS,OAAO,GAAE,kBAAuB;IAI5C;;;;;;;OAOG;IACG,SAAS,CAAC,GAAG,EAAE,MAAM,EAAE,OAAO,CAAC,EAAE,kBAAkB,GAAG,OAAO,CAAC,eAAe,CAAC;IAgFpF;;;;;;;;OAQG;IACG,YAAY,CAAC,GAAG,EAAE,MAAM,EAAE,OAAO,CAAC,EAAE,mBAAmB,GAAG,OAAO,CAAC,kBAAkB,CAAC;IA8E3F;;;;OAIG;IACG,OAAO,IAAI,OAAO,CAAC,IAAI,CAAC;IAI9B;;;;OAIG;IACH,UAAU,IAAI,cAAc,EAAE;CAG/B"}
|
package/dist/FetchEngine.js
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
import { MarkdownConverter } from "./utils/markdown-converter.js";
|
|
1
|
+
import { MarkdownConverter, injectSourceUrl } from "./utils/markdown-converter.js";
|
|
2
2
|
import { FetchError } from "./errors.js"; // Only import FetchError
|
|
3
3
|
/**
|
|
4
4
|
* Custom error class for HTTP errors from FetchEngine.
|
|
@@ -76,9 +76,8 @@ export class FetchEngine {
|
|
|
76
76
|
if (effectiveOptions.markdown) {
|
|
77
77
|
try {
|
|
78
78
|
const converter = new MarkdownConverter();
|
|
79
|
-
finalContent = converter.convert(html);
|
|
80
|
-
|
|
81
|
-
finalContent = this._injectSourceUnderH1(finalContent, response.url || url);
|
|
79
|
+
finalContent = converter.convert(html, { baseUrl: response.url || url });
|
|
80
|
+
finalContent = injectSourceUrl(finalContent, response.url || url);
|
|
82
81
|
finalContentType = "markdown";
|
|
83
82
|
}
|
|
84
83
|
catch (conversionError) {
|
|
@@ -107,17 +106,6 @@ export class FetchEngine {
|
|
|
107
106
|
throw new FetchError(`Fetch failed: ${message}`, "ERR_FETCH_FAILED", error instanceof Error ? error : undefined);
|
|
108
107
|
}
|
|
109
108
|
}
|
|
110
|
-
// Insert a "Source: <url>" line immediately below the first H1.
|
|
111
|
-
_injectSourceUnderH1(markdown, sourceUrl) {
|
|
112
|
-
if (!markdown || !sourceUrl)
|
|
113
|
-
return markdown;
|
|
114
|
-
// Avoid duplicate insertion if already present near the top
|
|
115
|
-
const head = markdown.split("\n").slice(0, 50).join("\n");
|
|
116
|
-
if (/^Source:\s+/m.test(head))
|
|
117
|
-
return markdown;
|
|
118
|
-
const safeUrl = sourceUrl.trim();
|
|
119
|
-
return markdown.replace(/^(\s*#\s.*)$/m, `$1\n\nSource: ${safeUrl}`);
|
|
120
|
-
}
|
|
121
109
|
/**
|
|
122
110
|
* Fetches raw content from the specified URL (mimics standard fetch API).
|
|
123
111
|
*
|
package/dist/FetchEngine.js.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"FetchEngine.js","sourceRoot":"","sources":["../src/FetchEngine.ts"],"names":[],"mappings":"AASA,OAAO,EAAE,iBAAiB,EAAE,MAAM,+BAA+B,CAAC
|
|
1
|
+
{"version":3,"file":"FetchEngine.js","sourceRoot":"","sources":["../src/FetchEngine.ts"],"names":[],"mappings":"AASA,OAAO,EAAE,iBAAiB,EAAE,eAAe,EAAE,MAAM,+BAA+B,CAAC;AACnF,OAAO,EAAE,UAAU,EAAE,MAAM,aAAa,CAAC,CAAC,yBAAyB;AAEnE;;GAEG;AACH,MAAM,OAAO,oBAAqB,SAAQ,UAAU;IAGhC;IAFlB,YACE,OAAe,EACC,UAAkB;QAElC,KAAK,CAAC,OAAO,EAAE,gBAAgB,EAAE,SAAS,EAAE,UAAU,CAAC,CAAC;QAFxC,eAAU,GAAV,UAAU,CAAQ;QAGlC,IAAI,CAAC,IAAI,GAAG,sBAAsB,CAAC;IACrC,CAAC;CACF;AAED;;;;;GAKG;AACH,MAAM,OAAO,WAAW;IACL,OAAO,CAA+B;IAE/C,MAAM,CAAU,eAAe,GAAiC;QACtE,QAAQ,EAAE,KAAK;QACf,OAAO,EAAE,EAAE;KACZ,CAAC;IAEF;;;OAGG;IACH,YAAY,UAA8B,EAAE;QAC1C,IAAI,CAAC,OAAO,GAAG,EAAE,GAAG,WAAW,CAAC,eAAe,EAAE,GAAG,OAAO,EAAE,CAAC;IAChE,CAAC;IAED;;;;;;;OAOG;IACH,KAAK,CAAC,SAAS,CAAC,GAAW,EAAE,OAA4B;QACvD,MAAM,gBAAgB,GAAG,EAAE,GAAG,IAAI,CAAC,OAAO,EAAE,GAAG,OAAO,EAAE,CAAC,CAAC,uCAAuC;QACjG,IAAI,QAAkB,CAAC;QACvB,IAAI,CAAC;YACH,MAAM,WAAW,GAAG;gBAClB,YAAY,EACV,iHAAiH;gBACnH,MAAM,EAAE,kGAAkG;gBAC1G,iBAAiB,EAAE,gBAAgB;aACpC,CAAC;YAEF,6DAA6D;YAC7D,MAAM,kBAAkB,GAAG,IAAI,CAAC,OAAO,CAAC,OAAO,IAAI,EAAE,CAAC;YAEtD,sEAAsE;YACtE,0GAA0G;YAC1G,MAAM,mBAAmB,GAAG,OAAO,EAAE,OAAO,IAAI,EAAE,CAAC;YAEnD,MAAM,YAAY,GAAG;gBACnB,GAAG,WAAW;gBACd,GAAG,kBAAkB;gBACrB,GAAG,mBAAmB,EAAE,sFAAsF;aAC/G,CAAC;YAEF,QAAQ,GAAG,MAAM,KAAK,CAAC,GAAG,EAAE;gBAC1B,QAAQ,EAAE,QAAQ;gBAClB,OAAO,EAAE,YAAY;aACtB,CAAC,CAAC;YAEH,IAAI,CAAC,QAAQ,CAAC,EAAE,EAAE,CAAC;gBACjB,MAAM,IAAI,oBAAoB,CAAC,uBAAuB,QAAQ,CAAC,MAAM,EAAE,EAAE,QAAQ,CAAC,MAAM,CAAC,CAAC;YAC5F,CAAC;YAED,MAAM,iBAAiB,GAAG,QAAQ,CAAC,OAAO,CAAC,GAAG,CAAC,cAAc,CAAC,CAAC;YAC/D,IAAI,CAAC,iBAAiB,IAAI,CAAC,iBAAiB,CAAC,QAAQ,CAAC,WAAW,CAAC,EAAE,CAAC;gBACnE,MAAM,IAAI,UAAU,CAAC,+BAA+B,EAAE,sBAAsB,CAAC,CAAC;YAChF,CAAC;YAED,MAAM,IAAI,GAAG,MAAM,QAAQ,CAAC,IAAI,EAAE,CAAC;YACnC,MAAM,UAAU,GAAG,IAAI,CAAC,KAAK,CAAC,+BAA+B,CAAC,CAAC;YAC/D,MAAM,KAAK,GAAG,UAAU,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,CAAC,CAAC,IAAI,EAAE,CAAC,CAAC,CAAC,IAAI,CAAC;YAEvD,IAAI,YAAY,GAAG,IAAI,CAAC;YACxB,IAAI,gBAAgB,GAAwB,MAAM,CAAC;YAEnD,IAAI,gBAAgB,CAAC,QAAQ,EAAE,CAAC;gBAC9B,IAAI,CAAC;oBACH,MAAM,SAAS,GAAG,IAAI,iBAAiB,EAAE,CAAC;oBAC1C,YAAY,GAAG,SAAS,CAAC,OAAO,CAAC,IAAI,EAAE,EAAE,OAAO,EAAE,QAAQ,CAAC,GAAG,IAAI,GAAG,EAAE,CAAC,CAAC;oBACzE,YAAY,GAAG,eAAe,CAAC,YAAY,EAAE,QAAQ,CAAC,GAAG,IAAI,GAAG,CAAC,CAAC;oBAClE,gBAAgB,GAAG,UAAU,CAAC;gBAChC,CAAC;gBAAC,OAAO,eAAwB,EAAE,CAAC;oBAClC,OAAO,CAAC,KAAK,CAAC,kCAAkC,GAAG,iBAAiB,EAAE,eAAe,CAAC,CAAC;oBACvF,gDAAgD;gBAClD,CAAC;YACH,CAAC;YAED,OAAO;gBACL,OAAO,EAAE,YAAY;gBACrB,WAAW,EAAE,gBAAgB;gBAC7B,KAAK,EAAE,KAAK;gBACZ,GAAG,EAAE,QAAQ,CAAC,GAAG,EAAE,oCAAoC;gBACvD,WAAW,EAAE,KAAK;gBAClB,UAAU,EAAE,QAAQ,CAAC,MAAM;gBAC3B,KAAK,EAAE,SAAS;aACjB,CAAC;QACJ,CAAC;QAAC,OAAO,KAAc,EAAE,CAAC;YACxB,0CAA0C;YAC1C,IACE,KAAK,YAAY,oBAAoB;gBACrC,CAAC,KAAK,YAAY,UAAU,IAAI,KAAK,CAAC,IAAI,KAAK,sBAAsB,CAAC,EACtE,CAAC;gBACD,MAAM,KAAK,CAAC;YACd,CAAC;YACD,+BAA+B;YAC/B,MAAM,OAAO,GAAG,KAAK,YAAY,KAAK,CAAC,CAAC,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,CAAC,qBAAqB,CAAC;YAC/E,MAAM,IAAI,UAAU,CAAC,iBAAiB,OAAO,EAAE,EAAE,kBAAkB,EAAE,KAAK,YAAY,KAAK,CAAC,CAAC,CAAC,KAAK,CAAC,CAAC,CAAC,SAAS,CAAC,CAAC;QACnH,CAAC;IACH,CAAC;IAED;;;;;;;;OAQG;IACH,KAAK,CAAC,YAAY,CAAC,GAAW,EAAE,OAA6B;QAC3D,IAAI,QAAkB,CAAC;QACvB,IAAI,CAAC;YACH,MAAM,WAAW,GAAG;gBAClB,YAAY,EACV,iHAAiH;gBACnH,MAAM,EAAE,KAAK,EAAE,mDAAmD;aACnE,CAAC;YAEF,sDAAsD;YACtD,MAAM,kBAAkB,GAAG,IAAI,CAAC,OAAO,CAAC,OAAO,IAAI,EAAE,CAAC;YACtD,MAAM,mBAAmB,GAAG,OAAO,EAAE,OAAO,IAAI,EAAE,CAAC;YAEnD,MAAM,YAAY,GAAG;gBACnB,GAAG,WAAW;gBACd,GAAG,kBAAkB;gBACrB,GAAG,mBAAmB;aACvB,CAAC;YAEF,QAAQ,GAAG,MAAM,KAAK,CAAC,GAAG,EAAE;gBAC1B,QAAQ,EAAE,QAAQ;gBAClB,OAAO,EAAE,YAAY;aACtB,CAAC,CAAC;YAEH,IAAI,CAAC,QAAQ,CAAC,EAAE,EAAE,CAAC;gBACjB,MAAM,IAAI,oBAAoB,CAAC,uBAAuB,QAAQ,CAAC,MAAM,EAAE,EAAE,QAAQ,CAAC,MAAM,CAAC,CAAC;YAC5F,CAAC;YAED,MAAM,iBAAiB,GAAG,QAAQ,CAAC,OAAO,CAAC,GAAG,CAAC,cAAc,CAAC,IAAI,0BAA0B,CAAC;YAE7F,+CAA+C;YAC/C,MAAM,WAAW,GACf,iBAAiB,CAAC,UAAU,CAAC,OAAO,CAAC;gBACrC,iBAAiB,CAAC,QAAQ,CAAC,MAAM,CAAC;gBAClC,iBAAiB,CAAC,QAAQ,CAAC,KAAK,CAAC;gBACjC,iBAAiB,CAAC,QAAQ,CAAC,YAAY,CAAC;gBACxC,iBAAiB,CAAC,QAAQ,CAAC,MAAM,CAAC;gBAClC,iBAAiB,CAAC,QAAQ,CAAC,KAAK,CAAC,CAAC;YAEpC,IAAI,OAAwB,CAAC;YAC7B,IAAI,WAAW,EAAE,CAAC;gBAChB,OAAO,GAAG,MAAM,QAAQ,CAAC,IAAI,EAAE,CAAC;YAClC,CAAC;iBAAM,CAAC;gBACN,MAAM,WAAW,GAAG,MAAM,QAAQ,CAAC,WAAW,EAAE,CAAC;gBACjD,OAAO,GAAG,MAAM,CAAC,IAAI,CAAC,WAAW,CAAC,CAAC;YACrC,CAAC;YAED,wCAAwC;YACxC,IAAI,KAAK,GAAkB,IAAI,CAAC;YAChC,IAAI,OAAO,OAAO,KAAK,QAAQ,IAAI,iBAAiB,CAAC,QAAQ,CAAC,MAAM,CAAC,EAAE,CAAC;gBACtE,MAAM,UAAU,GAAG,OAAO,CAAC,KAAK,CAAC,+BAA+B,CAAC,CAAC;gBAClE,KAAK,GAAG,UAAU,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,CAAC,CAAC,IAAI,EAAE,CAAC,CAAC,CAAC,IAAI,CAAC;YACnD,CAAC;YAED,OAAO;gBACL,OAAO;gBACP,WAAW,EAAE,iBAAiB;gBAC9B,KAAK;gBACL,GAAG,EAAE,QAAQ,CAAC,GAAG,EAAE,oCAAoC;gBACvD,WAAW,EAAE,KAAK;gBAClB,UAAU,EAAE,QAAQ,CAAC,MAAM;gBAC3B,KAAK,EAAE,SAAS;aACjB,CAAC;QACJ,CAAC;QAAC,OAAO,KAAc,EAAE,CAAC;YACxB,0CAA0C;YAC1C,IAAI,KAAK,YAAY,oBAAoB,EAAE,CAAC;gBAC1C,MAAM,KAAK,CAAC;YACd,CAAC;YACD,+BAA+B;YAC/B,MAAM,OAAO,GAAG,KAAK,YAAY,KAAK,CAAC,CAAC,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,CAAC,qBAAqB,CAAC;YAC/E,MAAM,IAAI,UAAU,CAClB,yBAAyB,OAAO,EAAE,EAClC,kBAAkB,EAClB,KAAK,YAAY,KAAK,CAAC,CAAC,CAAC,KAAK,CAAC,CAAC,CAAC,SAAS,CAC3C,CAAC;QACJ,CAAC;IACH,CAAC;IAED;;;;OAIG;IACH,KAAK,CAAC,OAAO;QACX,OAAO,OAAO,CAAC,OAAO,EAAE,CAAC;IAC3B,CAAC;IAED;;;;OAIG;IACH,UAAU;QACR,OAAO,EAAE,CAAC;IACZ,CAAC"}
|
package/dist/HybridEngine.d.ts
CHANGED
|
@@ -9,7 +9,8 @@ export declare class HybridEngine implements IEngine {
|
|
|
9
9
|
private readonly config;
|
|
10
10
|
private readonly playwrightOnlyPatterns;
|
|
11
11
|
constructor(config?: PlaywrightEngineConfig);
|
|
12
|
-
private
|
|
12
|
+
private _convertHtmlToMarkdown;
|
|
13
|
+
private _shouldAutoRender;
|
|
13
14
|
fetchHTML(url: string, options?: FetchOptions): Promise<HTMLFetchResult>;
|
|
14
15
|
/**
|
|
15
16
|
* Fetches raw content from the specified URL using the hybrid approach.
|
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"HybridEngine.d.ts","sourceRoot":"","sources":["../src/HybridEngine.ts"],"names":[],"mappings":"AAEA,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;
|
|
1
|
+
{"version":3,"file":"HybridEngine.d.ts","sourceRoot":"","sources":["../src/HybridEngine.ts"],"names":[],"mappings":"AAEA,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;AAQ5C,OAAO,KAAK,EACV,eAAe,EACf,kBAAkB,EAClB,mBAAmB,EACnB,sBAAsB,EACtB,YAAY,EACZ,cAAc,EACf,MAAM,YAAY,CAAC;AAEpB;;GAEG;AACH,qBAAa,YAAa,YAAW,OAAO;IAC1C,OAAO,CAAC,QAAQ,CAAC,WAAW,CAAc;IAC1C,OAAO,CAAC,QAAQ,CAAC,gBAAgB,CAAmB;IACpD,OAAO,CAAC,QAAQ,CAAC,MAAM,CAAyB;IAChD,OAAO,CAAC,QAAQ,CAAC,sBAAsB,CAAsB;gBAEjD,MAAM,GAAE,sBAA2B;IAS/C,OAAO,CAAC,sBAAsB;IAkB9B,OAAO,CAAC,iBAAiB;IAUnB,SAAS,CAAC,GAAG,EAAE,MAAM,EAAE,OAAO,GAAE,YAAiB,GAAG,OAAO,CAAC,eAAe,CAAC;IAsGlF;;;;;;;;;OASG;IACG,YAAY,CAAC,GAAG,EAAE,MAAM,EAAE,OAAO,GAAE,mBAAwB,GAAG,OAAO,CAAC,kBAAkB,CAAC;IA0C/F;;OAEG;IACH,UAAU,IAAI,cAAc,EAAE;IAI9B;;OAEG;IACG,OAAO,IAAI,OAAO,CAAC,IAAI,CAAC;CAM/B"}
|
package/dist/HybridEngine.js
CHANGED
|
@@ -1,5 +1,7 @@
|
|
|
1
1
|
import { FetchEngine, FetchEngineHttpError } from "./FetchEngine.js";
|
|
2
2
|
import { PlaywrightEngine } from "./PlaywrightEngine.js";
|
|
3
|
+
import { MarkdownConverter, injectSourceUrl } from "./utils/markdown-converter.js";
|
|
4
|
+
import { assessHtmlRenderNeed, assessSerializedContent, isRenderedContentMeaningfullyBetter, isSoftBlockPage, } from "./utils/render-detection.js";
|
|
3
5
|
/**
|
|
4
6
|
* HybridEngine - Tries FetchEngine first, falls back to PlaywrightEngine on failure.
|
|
5
7
|
*/
|
|
@@ -10,30 +12,35 @@ export class HybridEngine {
|
|
|
10
12
|
playwrightOnlyPatterns;
|
|
11
13
|
constructor(config = {}) {
|
|
12
14
|
// Pass relevant config parts to each engine
|
|
13
|
-
//
|
|
14
|
-
|
|
15
|
-
this.fetchEngine = new FetchEngine({ markdown: config.markdown, headers: config.headers });
|
|
15
|
+
// HybridEngine fetches raw HTML first so it can decide whether rendering is necessary.
|
|
16
|
+
this.fetchEngine = new FetchEngine({ markdown: false, headers: config.headers });
|
|
16
17
|
this.playwrightEngine = new PlaywrightEngine(config);
|
|
17
18
|
this.config = config; // Store for merging later
|
|
18
19
|
this.playwrightOnlyPatterns = config.playwrightOnlyPatterns || [];
|
|
19
20
|
}
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
21
|
+
_convertHtmlToMarkdown(htmlResult) {
|
|
22
|
+
try {
|
|
23
|
+
const converter = new MarkdownConverter();
|
|
24
|
+
const content = injectSourceUrl(converter.convert(htmlResult.content, { baseUrl: htmlResult.url }), htmlResult.url);
|
|
25
|
+
return {
|
|
26
|
+
...htmlResult,
|
|
27
|
+
content,
|
|
28
|
+
contentType: "markdown",
|
|
29
|
+
};
|
|
26
30
|
}
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
return
|
|
30
|
-
|
|
31
|
-
|
|
31
|
+
catch (conversionError) {
|
|
32
|
+
console.error(`HybridEngine: Markdown conversion failed for ${htmlResult.url}:`, conversionError);
|
|
33
|
+
return htmlResult;
|
|
34
|
+
}
|
|
35
|
+
}
|
|
36
|
+
_shouldAutoRender(fetchResult, forceSpaMode) {
|
|
37
|
+
if (forceSpaMode) {
|
|
32
38
|
return true;
|
|
33
|
-
|
|
34
|
-
if (
|
|
39
|
+
}
|
|
40
|
+
if (isSoftBlockPage(fetchResult.content)) {
|
|
35
41
|
return true;
|
|
36
|
-
|
|
42
|
+
}
|
|
43
|
+
return assessHtmlRenderNeed(fetchResult.content).renderLikelyNeeded;
|
|
37
44
|
}
|
|
38
45
|
async fetchHTML(url, options = {}) {
|
|
39
46
|
// Determine effective SPA mode and markdown options
|
|
@@ -70,22 +77,36 @@ export class HybridEngine {
|
|
|
70
77
|
}
|
|
71
78
|
}
|
|
72
79
|
try {
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
80
|
+
const fetchResult = await this.fetchEngine.fetchHTML(url, {
|
|
81
|
+
markdown: false,
|
|
82
|
+
headers: options.headers,
|
|
83
|
+
});
|
|
84
|
+
const httpPreferredResult = effectiveMarkdown ? this._convertHtmlToMarkdown(fetchResult) : fetchResult;
|
|
85
|
+
if (!this._shouldAutoRender(fetchResult, effectiveSpaMode)) {
|
|
86
|
+
return httpPreferredResult;
|
|
87
|
+
}
|
|
88
|
+
console.warn(`HybridEngine: HTTP fetch for ${url} looks incomplete. Attempting Playwright render.`);
|
|
89
|
+
// Skip HTTP fallback (we already know it's a shell) and use SPA rendering path for patient waits.
|
|
90
|
+
const autoRenderOptions = {
|
|
91
|
+
...playwrightOptions,
|
|
92
|
+
useHttpFallback: false,
|
|
93
|
+
spaMode: true,
|
|
77
94
|
};
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
return
|
|
95
|
+
try {
|
|
96
|
+
const playwrightResult = await this.playwrightEngine.fetchHTML(url, autoRenderOptions);
|
|
97
|
+
const staticAssessment = assessSerializedContent(httpPreferredResult.content, httpPreferredResult.contentType);
|
|
98
|
+
const renderedAssessment = assessSerializedContent(playwrightResult.content, playwrightResult.contentType);
|
|
99
|
+
if (!isRenderedContentMeaningfullyBetter(staticAssessment, renderedAssessment)) {
|
|
100
|
+
console.warn(`HybridEngine: Playwright render for ${url} was not meaningfully better. Keeping HTTP result.`);
|
|
101
|
+
return httpPreferredResult;
|
|
85
102
|
}
|
|
103
|
+
return playwrightResult;
|
|
104
|
+
}
|
|
105
|
+
catch (playwrightError) {
|
|
106
|
+
const pwMessage = playwrightError instanceof Error ? playwrightError.message : String(playwrightError);
|
|
107
|
+
console.warn(`HybridEngine: Playwright render failed for ${url}: ${pwMessage}. Returning HTTP result.`);
|
|
108
|
+
return httpPreferredResult;
|
|
86
109
|
}
|
|
87
|
-
// If not spaMode, or if spaMode but content is not a shell, return FetchEngine's result
|
|
88
|
-
return fetchResult;
|
|
89
110
|
}
|
|
90
111
|
catch (fetchError) {
|
|
91
112
|
// If FetchEngine returned a 404, do not attempt Playwright fallback
|
package/dist/HybridEngine.js.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"HybridEngine.js","sourceRoot":"","sources":["../src/HybridEngine.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,WAAW,EAAE,oBAAoB,EAAE,MAAM,kBAAkB,CAAC;AACrE,OAAO,EAAE,gBAAgB,EAAE,MAAM,uBAAuB,CAAC;
|
|
1
|
+
{"version":3,"file":"HybridEngine.js","sourceRoot":"","sources":["../src/HybridEngine.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,WAAW,EAAE,oBAAoB,EAAE,MAAM,kBAAkB,CAAC;AACrE,OAAO,EAAE,gBAAgB,EAAE,MAAM,uBAAuB,CAAC;AAEzD,OAAO,EAAE,iBAAiB,EAAE,eAAe,EAAE,MAAM,+BAA+B,CAAC;AACnF,OAAO,EACL,oBAAoB,EACpB,uBAAuB,EACvB,mCAAmC,EACnC,eAAe,GAChB,MAAM,6BAA6B,CAAC;AAUrC;;GAEG;AACH,MAAM,OAAO,YAAY;IACN,WAAW,CAAc;IACzB,gBAAgB,CAAmB;IACnC,MAAM,CAAyB,CAAC,sDAAsD;IACtF,sBAAsB,CAAsB;IAE7D,YAAY,SAAiC,EAAE;QAC7C,4CAA4C;QAC5C,uFAAuF;QACvF,IAAI,CAAC,WAAW,GAAG,IAAI,WAAW,CAAC,EAAE,QAAQ,EAAE,KAAK,EAAE,OAAO,EAAE,MAAM,CAAC,OAAO,EAAE,CAAC,CAAC;QACjF,IAAI,CAAC,gBAAgB,GAAG,IAAI,gBAAgB,CAAC,MAAM,CAAC,CAAC;QACrD,IAAI,CAAC,MAAM,GAAG,MAAM,CAAC,CAAC,0BAA0B;QAChD,IAAI,CAAC,sBAAsB,GAAG,MAAM,CAAC,sBAAsB,IAAI,EAAE,CAAC;IACpE,CAAC;IAEO,sBAAsB,CAAC,UAA2B;QACxD,IAAI,CAAC;YACH,MAAM,SAAS,GAAG,IAAI,iBAAiB,EAAE,CAAC;YAC1C,MAAM,OAAO,GAAG,eAAe,CAC7B,SAAS,CAAC,OAAO,CAAC,UAAU,CAAC,OAAO,EAAE,EAAE,OAAO,EAAE,UAAU,CAAC,GAAG,EAAE,CAAC,EAClE,UAAU,CAAC,GAAG,CACf,CAAC;YACF,OAAO;gBACL,GAAG,UAAU;gBACb,OAAO;gBACP,WAAW,EAAE,UAAU;aACxB,CAAC;QACJ,CAAC;QAAC,OAAO,eAAwB,EAAE,CAAC;YAClC,OAAO,CAAC,KAAK,CAAC,gDAAgD,UAAU,CAAC,GAAG,GAAG,EAAE,eAAe,CAAC,CAAC;YAClG,OAAO,UAAU,CAAC;QACpB,CAAC;IACH,CAAC;IAEO,iBAAiB,CAAC,WAA4B,EAAE,YAAqB;QAC3E,IAAI,YAAY,EAAE,CAAC;YACjB,OAAO,IAAI,CAAC;QACd,CAAC;QACD,IAAI,eAAe,CAAC,WAAW,CAAC,OAAO,CAAC,EAAE,CAAC;YACzC,OAAO,IAAI,CAAC;QACd,CAAC;QACD,OAAO,oBAAoB,CAAC,WAAW,CAAC,OAAO,CAAC,CAAC,kBAAkB,CAAC;IACtE,CAAC;IAED,KAAK,CAAC,SAAS,CAAC,GAAW,EAAE,UAAwB,EAAE;QACrD,oDAAoD;QACpD,gHAAgH;QAChH,MAAM,gBAAgB,GACpB,OAAO,CAAC,OAAO,KAAK,SAAS,CAAC,CAAC,CAAC,OAAO,CAAC,OAAO,CAAC,CAAC,CAAC,IAAI,CAAC,MAAM,CAAC,OAAO,KAAK,SAAS,CAAC,CAAC,CAAC,IAAI,CAAC,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,KAAK,CAAC;QACpH,MAAM,iBAAiB,GACrB,OAAO,CAAC,QAAQ,KAAK,SAAS;YAC5B,CAAC,CAAC,OAAO,CAAC,QAAQ;YAClB,CAAC,CAAC,IAAI,CAAC,MAAM,CAAC,QAAQ,KAAK,SAAS;gBAClC,CAAC,CAAC,IAAI,CAAC,MAAM,CAAC,QAAQ;gBACtB,CAAC,CAAC,KAAK,CAAC;QAEd,yFAAyF;QACzF,mEAAmE;QACnE,MAAM,kBAAkB,GAAG,IAAI,CAAC,MAAM,CAAC,OAAO,IAAI,EAAE,CAAC;QACrD,MAAM,sBAAsB,GAAG,OAAO,CAAC,OAAO,IAAI,EAAE,CAAC,CAAC,mEAAmE;QAEzH,8DAA8D;QAC9D,MAAM,0BAA0B,GAAG,EAAE,GAAG,kBAAkB,EAAE,GAAG,sBAAsB,EAAE,CAAC;QAExF,kEAAkE;QAClE,MAAM,iBAAiB,GAInB;YACF,GAAG,IAAI,CAAC,MAAM,EAAE,gEAAgE;YAChF,GAAG,OAAO,EAAE,yDAAyD;YACrE,OAAO,EAAE,0BAA0B,EAAE,sCAAsC;YAC3E,QAAQ,EAAE,iBAAiB;YAC3B,OAAO,EAAE,gBAAgB;SAC1B,CAAC;QAEF,qCAAqC;QACrC,KAAK,MAAM,OAAO,IAAI,IAAI,CAAC,sBAAsB,EAAE,CAAC;YAClD,IAAI,OAAO,OAAO,KAAK,QAAQ,IAAI,GAAG,CAAC,QAAQ,CAAC,OAAO,CAAC,EAAE,CAAC;gBACzD,OAAO,CAAC,IAAI,CAAC,qBAAqB,GAAG,4BAA4B,OAAO,qCAAqC,CAAC,CAAC;gBAC/G,OAAO,IAAI,CAAC,gBAAgB,CAAC,SAAS,CAAC,GAAG,EAAE,iBAAiB,CAAC,CAAC;YACjE,CAAC;iBAAM,IAAI,OAAO,YAAY,MAAM,IAAI,OAAO,CAAC,IAAI,CAAC,GAAG,CAAC,EAAE,CAAC;gBAC1D,OAAO,CAAC,IAAI,CACV,qBAAqB,GAAG,2BAA2B,OAAO,CAAC,QAAQ,EAAE,qCAAqC,CAC3G,CAAC;gBACF,OAAO,IAAI,CAAC,gBAAgB,CAAC,SAAS,CAAC,GAAG,EAAE,iBAAiB,CAAC,CAAC;YACjE,CAAC;QACH,CAAC;QAED,IAAI,CAAC;YACH,MAAM,WAAW,GAAG,MAAM,IAAI,CAAC,WAAW,CAAC,SAAS,CAAC,GAAG,EAAE;gBACxD,QAAQ,EAAE,KAAK;gBACf,OAAO,EAAE,OAAO,CAAC,OAAO;aACzB,CAAC,CAAC;YACH,MAAM,mBAAmB,GAAG,iBAAiB,CAAC,CAAC,CAAC,IAAI,CAAC,sBAAsB,CAAC,WAAW,CAAC,CAAC,CAAC,CAAC,WAAW,CAAC;YAEvG,IAAI,CAAC,IAAI,CAAC,iBAAiB,CAAC,WAAW,EAAE,gBAAgB,CAAC,EAAE,CAAC;gBAC3D,OAAO,mBAAmB,CAAC;YAC7B,CAAC;YAED,OAAO,CAAC,IAAI,CAAC,gCAAgC,GAAG,kDAAkD,CAAC,CAAC;YAEpG,kGAAkG;YAClG,MAAM,iBAAiB,GAAG;gBACxB,GAAG,iBAAiB;gBACpB,eAAe,EAAE,KAAK;gBACtB,OAAO,EAAE,IAAI;aACd,CAAC;YAEF,IAAI,CAAC;gBACH,MAAM,gBAAgB,GAAG,MAAM,IAAI,CAAC,gBAAgB,CAAC,SAAS,CAAC,GAAG,EAAE,iBAAiB,CAAC,CAAC;gBACvF,MAAM,gBAAgB,GAAG,uBAAuB,CAAC,mBAAmB,CAAC,OAAO,EAAE,mBAAmB,CAAC,WAAW,CAAC,CAAC;gBAC/G,MAAM,kBAAkB,GAAG,uBAAuB,CAAC,gBAAgB,CAAC,OAAO,EAAE,gBAAgB,CAAC,WAAW,CAAC,CAAC;gBAE3G,IAAI,CAAC,mCAAmC,CAAC,gBAAgB,EAAE,kBAAkB,CAAC,EAAE,CAAC;oBAC/E,OAAO,CAAC,IAAI,CAAC,uCAAuC,GAAG,oDAAoD,CAAC,CAAC;oBAC7G,OAAO,mBAAmB,CAAC;gBAC7B,CAAC;gBAED,OAAO,gBAAgB,CAAC;YAC1B,CAAC;YAAC,OAAO,eAAwB,EAAE,CAAC;gBAClC,MAAM,SAAS,GAAG,eAAe,YAAY,KAAK,CAAC,CAAC,CAAC,eAAe,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,eAAe,CAAC,CAAC;gBACvG,OAAO,CAAC,IAAI,CAAC,8CAA8C,GAAG,KAAK,SAAS,0BAA0B,CAAC,CAAC;gBACxG,OAAO,mBAAmB,CAAC;YAC7B,CAAC;QACH,CAAC;QAAC,OAAO,UAAmB,EAAE,CAAC;YAC7B,oEAAoE;YACpE,IAAI,UAAU,YAAY,oBAAoB,IAAI,UAAU,CAAC,UAAU,KAAK,GAAG,EAAE,CAAC;gBAChF,OAAO,CAAC,IAAI,CAAC,8CAA8C,GAAG,qBAAqB,CAAC,CAAC;gBACrF,MAAM,UAAU,CAAC;YACnB,CAAC;YACD,MAAM,OAAO,GAAG,UAAU,YAAY,KAAK,CAAC,CAAC,CAAC,UAAU,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,UAAU,CAAC,CAAC;YACtF,OAAO,CAAC,IAAI,CAAC,wCAAwC,GAAG,KAAK,OAAO,qCAAqC,CAAC,CAAC;YAC3G,IAAI,CAAC;gBACH,yEAAyE;gBACzE,MAAM,gBAAgB,GAAG,MAAM,IAAI,CAAC,gBAAgB,CAAC,SAAS,CAAC,GAAG,EAAE,iBAAiB,CAAC,CAAC;gBACvF,OAAO,gBAAgB,CAAC;YAC1B,CAAC;YAAC,OAAO,eAAwB,EAAE,CAAC;gBAClC,MAAM,SAAS,GAAG,eAAe,YAAY,KAAK,CAAC,CAAC,CAAC,eAAe,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,eAAe,CAAC,CAAC;gBACvG,OAAO,CAAC,KAAK,CAAC,2DAA2D,GAAG,KAAK,SAAS,EAAE,CAAC,CAAC;gBAC9F,MAAM,eAAe,CAAC,CAAC,8DAA8D;YACvF,CAAC;QACH,CAAC;IACH,CAAC;IAED;;;;;;;;;OASG;IACH,KAAK,CAAC,YAAY,CAAC,GAAW,EAAE,UAA+B,EAAE;QAC/D,qCAAqC;QACrC,KAAK,MAAM,OAAO,IAAI,IAAI,CAAC,sBAAsB,EAAE,CAAC;YAClD,IAAI,OAAO,OAAO,KAAK,QAAQ,IAAI,GAAG,CAAC,QAAQ,CAAC,OAAO,CAAC,EAAE,CAAC;gBACzD,OAAO,CAAC,IAAI,CACV,qBAAqB,GAAG,4BAA4B,OAAO,uDAAuD,CACnH,CAAC;gBACF,OAAO,IAAI,CAAC,gBAAgB,CAAC,YAAY,CAAC,GAAG,EAAE,OAAO,CAAC,CAAC;YAC1D,CAAC;iBAAM,IAAI,OAAO,YAAY,MAAM,IAAI,OAAO,CAAC,IAAI,CAAC,GAAG,CAAC,EAAE,CAAC;gBAC1D,OAAO,CAAC,IAAI,CACV,qBAAqB,GAAG,2BAA2B,OAAO,CAAC,QAAQ,EAAE,uDAAuD,CAC7H,CAAC;gBACF,OAAO,IAAI,CAAC,gBAAgB,CAAC,YAAY,CAAC,GAAG,EAAE,OAAO,CAAC,CAAC;YAC1D,CAAC;QACH,CAAC;QAED,IAAI,CAAC;YACH,wBAAwB;YACxB,MAAM,WAAW,GAAG,MAAM,IAAI,CAAC,WAAW,CAAC,YAAY,CAAC,GAAG,EAAE,OAAO,CAAC,CAAC;YACtE,OAAO,WAAW,CAAC;QACrB,CAAC;QAAC,OAAO,UAAmB,EAAE,CAAC;YAC7B,oEAAoE;YACpE,IAAI,UAAU,YAAY,oBAAoB,IAAI,UAAU,CAAC,UAAU,KAAK,GAAG,EAAE,CAAC;gBAChF,OAAO,CAAC,IAAI,CAAC,4DAA4D,GAAG,qBAAqB,CAAC,CAAC;gBACnG,MAAM,UAAU,CAAC;YACnB,CAAC;YACD,MAAM,OAAO,GAAG,UAAU,YAAY,KAAK,CAAC,CAAC,CAAC,UAAU,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,UAAU,CAAC,CAAC;YACtF,OAAO,CAAC,IAAI,CACV,sDAAsD,GAAG,KAAK,OAAO,qCAAqC,CAC3G,CAAC;YACF,IAAI,CAAC;gBACH,+BAA+B;gBAC/B,MAAM,gBAAgB,GAAG,MAAM,IAAI,CAAC,gBAAgB,CAAC,YAAY,CAAC,GAAG,EAAE,OAAO,CAAC,CAAC;gBAChF,OAAO,gBAAgB,CAAC;YAC1B,CAAC;YAAC,OAAO,eAAwB,EAAE,CAAC;gBAClC,MAAM,SAAS,GAAG,eAAe,YAAY,KAAK,CAAC,CAAC,CAAC,eAAe,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,eAAe,CAAC,CAAC;gBACvG,OAAO,CAAC,KAAK,CAAC,yEAAyE,GAAG,KAAK,SAAS,EAAE,CAAC,CAAC;gBAC5G,MAAM,eAAe,CAAC,CAAC,8DAA8D;YACvF,CAAC;QACH,CAAC;IACH,CAAC;IAED;;OAEG;IACH,UAAU;QACR,OAAO,IAAI,CAAC,gBAAgB,CAAC,UAAU,EAAE,CAAC;IAC5C,CAAC;IAED;;OAEG;IACH,KAAK,CAAC,OAAO;QACX,MAAM,OAAO,CAAC,UAAU,CAAC;YACvB,IAAI,CAAC,WAAW,CAAC,OAAO,EAAE,EAAE,yCAAyC;YACrE,IAAI,CAAC,gBAAgB,CAAC,OAAO,EAAE;SAChC,CAAC,CAAC;IACL,CAAC;CACF"}
|
|
@@ -8,6 +8,9 @@ import type { IEngine } from "./IEngine.js";
|
|
|
8
8
|
* Features include caching, retries, HTTP fallback, and configurable browser pooling.
|
|
9
9
|
*/
|
|
10
10
|
export declare class PlaywrightEngine implements IEngine {
|
|
11
|
+
private static readonly AUTO_RENDER_POLL_MS;
|
|
12
|
+
private static readonly AUTO_RENDER_QUIET_WINDOW_MS;
|
|
13
|
+
private static readonly AUTO_RENDER_MAX_WAIT_MS;
|
|
11
14
|
private browserPool;
|
|
12
15
|
private readonly queue;
|
|
13
16
|
private readonly cache;
|
|
@@ -41,6 +44,9 @@ export declare class PlaywrightEngine implements IEngine {
|
|
|
41
44
|
* Simulate human-like interactions on the page.
|
|
42
45
|
*/
|
|
43
46
|
private simulateHumanBehavior;
|
|
47
|
+
private captureRenderedDomSnapshot;
|
|
48
|
+
private shouldAutoWaitForRenderedDom;
|
|
49
|
+
private waitForRenderedDomIfNeeded;
|
|
44
50
|
/**
|
|
45
51
|
* Adds a result to the in-memory cache.
|
|
46
52
|
*/
|
|
@@ -102,7 +108,6 @@ export declare class PlaywrightEngine implements IEngine {
|
|
|
102
108
|
*/
|
|
103
109
|
private fetchWithPlaywright;
|
|
104
110
|
private applyBlockingRules;
|
|
105
|
-
private _injectSourceUnderH1;
|
|
106
111
|
/**
|
|
107
112
|
* Cleans up resources used by the engine, primarily closing browser instances in the pool.
|
|
108
113
|
*
|
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"PlaywrightEngine.d.ts","sourceRoot":"","sources":["../src/PlaywrightEngine.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EACV,eAAe,EACf,kBAAkB,EAClB,mBAAmB,EACnB,cAAc,EACd,sBAAsB,EACtB,YAAY,EACb,MAAM,YAAY,CAAC;AACpB,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;
|
|
1
|
+
{"version":3,"file":"PlaywrightEngine.d.ts","sourceRoot":"","sources":["../src/PlaywrightEngine.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EACV,eAAe,EACf,kBAAkB,EAClB,mBAAmB,EACnB,cAAc,EACd,sBAAsB,EACtB,YAAY,EACb,MAAM,YAAY,CAAC;AACpB,OAAO,KAAK,EAAE,OAAO,EAAE,MAAM,cAAc,CAAC;AA2D5C;;;;;;GAMG;AACH,qBAAa,gBAAiB,YAAW,OAAO;IAC9C,OAAO,CAAC,MAAM,CAAC,QAAQ,CAAC,mBAAmB,CAAO;IAClD,OAAO,CAAC,MAAM,CAAC,QAAQ,CAAC,2BAA2B,CAAO;IAC1D,OAAO,CAAC,MAAM,CAAC,QAAQ,CAAC,uBAAuB,CAAQ;IAEvD,OAAO,CAAC,WAAW,CAAsC;IACzD,OAAO,CAAC,QAAQ,CAAC,KAAK,CAAS;IAC/B,OAAO,CAAC,QAAQ,CAAC,KAAK,CAAsC;IAC5D,OAAO,CAAC,QAAQ,CAAC,MAAM,CAAiC;IAGxD,OAAO,CAAC,uBAAuB,CAAkB;IACjD,OAAO,CAAC,iBAAiB,CAAkB;IAC3C,OAAO,CAAC,mBAAmB,CAA0B;IAGrD,OAAO,CAAC,MAAM,CAAC,QAAQ,CAAC,cAAc,CAuBpC;IAEF;;;;;OAKG;gBACS,MAAM,GAAE,sBAA2B;IAM/C;;OAEG;YACW,qBAAqB;IAwCnC;;;OAGG;YACW,yBAAyB;IA0EvC,OAAO,CAAC,UAAU;IAalB;;OAEG;YACW,WAAW;IAazB;;OAEG;YACW,qBAAqB;YA2CrB,0BAA0B;IAqDxC,OAAO,CAAC,4BAA4B;YAUtB,0BAA0B;IA8FxC;;OAEG;IACH,OAAO,CAAC,UAAU;IAUlB;;;;;;;;;OASG;IACG,SAAS,CACb,GAAG,EAAE,MAAM,EACX,OAAO,GAAE,YAAY,GAAG;QAAE,QAAQ,CAAC,EAAE,OAAO,CAAC;QAAC,OAAO,CAAC,EAAE,OAAO,CAAA;KAAO,GACrE,OAAO,CAAC,eAAe,CAAC;IAoB3B;;;;;;;OAOG;IACH,OAAO,CAAC,iBAAiB;IAqDzB;;;;;;;OAOG;YACW,oBAAoB;IAmClC;;;;;;;;OAQG;YACW,6BAA6B;IAmC3C;;;;;;;OAOG;YACW,eAAe;IAkH7B;;;OAGG;YACW,mBAAmB;YA6KnB,kBAAkB;IAyChC;;;;;OAKG;IACG,OAAO,IAAI,OAAO,CAAC,IAAI,CAAC;IAoB9B;;;OAGG;IACH,UAAU,IAAI,cAAc,EAAE;IAQ9B,OAAO,CAAC,mBAAmB;IAU3B;;;;;;;;OAQG;IACG,YAAY,CAAC,GAAG,EAAE,MAAM,EAAE,OAAO,GAAE,mBAAwB,GAAG,OAAO,CAAC,kBAAkB,CAAC;IA0C/F;;OAEG;IACH,OAAO,CAAC,iBAAiB;IAkBzB;;OAEG;IACH,OAAO,CAAC,iBAAiB;IAoBzB;;OAEG;YACW,sBAAsB;IAyDpC;;OAEG;YACW,2BAA2B;IA4DzC;;OAEG;YACW,0BAA0B;CA2FzC"}
|