@isdk/web-fetcher 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (66) hide show
  1. package/README.action.cn.md +469 -0
  2. package/README.action.md +452 -0
  3. package/README.cn.md +147 -0
  4. package/README.engine.cn.md +262 -0
  5. package/README.engine.md +262 -0
  6. package/README.md +147 -0
  7. package/dist/index.d.mts +1603 -0
  8. package/dist/index.d.ts +1603 -0
  9. package/dist/index.js +1 -0
  10. package/dist/index.mjs +1 -0
  11. package/docs/README.md +151 -0
  12. package/docs/_media/LICENSE-MIT +22 -0
  13. package/docs/_media/README.action.md +452 -0
  14. package/docs/_media/README.cn.md +147 -0
  15. package/docs/_media/README.engine.md +262 -0
  16. package/docs/classes/CheerioFetchEngine.md +1447 -0
  17. package/docs/classes/ClickAction.md +533 -0
  18. package/docs/classes/ExtractAction.md +533 -0
  19. package/docs/classes/FetchAction.md +444 -0
  20. package/docs/classes/FetchEngine.md +1230 -0
  21. package/docs/classes/FetchSession.md +111 -0
  22. package/docs/classes/FillAction.md +533 -0
  23. package/docs/classes/GetContentAction.md +533 -0
  24. package/docs/classes/GotoAction.md +537 -0
  25. package/docs/classes/PauseAction.md +533 -0
  26. package/docs/classes/PlaywrightFetchEngine.md +1437 -0
  27. package/docs/classes/SubmitAction.md +533 -0
  28. package/docs/classes/WaitForAction.md +533 -0
  29. package/docs/classes/WebFetcher.md +85 -0
  30. package/docs/enumerations/FetchActionResultStatus.md +40 -0
  31. package/docs/functions/fetchWeb.md +43 -0
  32. package/docs/globals.md +72 -0
  33. package/docs/interfaces/BaseFetchActionProperties.md +83 -0
  34. package/docs/interfaces/BaseFetchCollectorActionProperties.md +145 -0
  35. package/docs/interfaces/BaseFetcherProperties.md +206 -0
  36. package/docs/interfaces/Cookie.md +142 -0
  37. package/docs/interfaces/DispatchedEngineAction.md +60 -0
  38. package/docs/interfaces/ExtractActionProperties.md +113 -0
  39. package/docs/interfaces/FetchActionInContext.md +149 -0
  40. package/docs/interfaces/FetchActionProperties.md +125 -0
  41. package/docs/interfaces/FetchActionResult.md +55 -0
  42. package/docs/interfaces/FetchContext.md +424 -0
  43. package/docs/interfaces/FetchEngineContext.md +328 -0
  44. package/docs/interfaces/FetchMetadata.md +73 -0
  45. package/docs/interfaces/FetchResponse.md +105 -0
  46. package/docs/interfaces/FetchReturnTypeRegistry.md +57 -0
  47. package/docs/interfaces/FetchSite.md +320 -0
  48. package/docs/interfaces/FetcherOptions.md +300 -0
  49. package/docs/interfaces/GotoActionOptions.md +66 -0
  50. package/docs/interfaces/PendingEngineRequest.md +51 -0
  51. package/docs/interfaces/SubmitActionOptions.md +23 -0
  52. package/docs/interfaces/WaitForActionOptions.md +39 -0
  53. package/docs/type-aliases/BaseFetchActionOptions.md +11 -0
  54. package/docs/type-aliases/BaseFetchCollectorOptions.md +11 -0
  55. package/docs/type-aliases/BrowserEngine.md +11 -0
  56. package/docs/type-aliases/FetchActionCapabilities.md +11 -0
  57. package/docs/type-aliases/FetchActionCapabilityMode.md +11 -0
  58. package/docs/type-aliases/FetchActionOptions.md +11 -0
  59. package/docs/type-aliases/FetchEngineAction.md +18 -0
  60. package/docs/type-aliases/FetchEngineType.md +11 -0
  61. package/docs/type-aliases/FetchReturnType.md +11 -0
  62. package/docs/type-aliases/FetchReturnTypeFor.md +17 -0
  63. package/docs/type-aliases/OnFetchPauseCallback.md +23 -0
  64. package/docs/type-aliases/ResourceType.md +11 -0
  65. package/docs/variables/DefaultFetcherProperties.md +11 -0
  66. package/package.json +90 -0
@@ -0,0 +1,262 @@
1
+ # ⚙️ Fetch Engine Architecture
2
+
3
+ English | [简体中文](./README.engine.cn.md)
4
+
5
+ > This document provides a comprehensive overview of the Fetch Engine architecture, designed to abstract web content fetching and interaction. It's intended for developers who need to understand, maintain, or extend the fetching capabilities.
6
+
7
+ ---
8
+
9
+ ## 🎯 1. Overview
10
+
11
+ The `engine` directory contains the core logic for the web fetcher. Its primary responsibility is to provide a unified interface for interacting with web pages, regardless of the underlying technology (e.g., simple HTTP requests or a full-fledged browser). This is achieved through an abstract `FetchEngine` class and concrete implementations that leverage different crawling technologies.
12
+
13
+ > ℹ️ The system is built on top of the [Crawlee](https://crawlee.dev/) library, using its powerful crawler abstractions.
14
+
15
+ ---
16
+
17
+ ## 🧩 2. Core Concepts
18
+
19
+ ### `FetchEngine` (base.ts)
20
+
21
+ This is the abstract base class that defines the contract for all fetch engines.
22
+
23
+ * **Role**: To provide a consistent, high-level API for actions like navigation, content retrieval, and user interaction.
24
+ * **Key Abstractions**:
25
+ * **Lifecycle**: `initialize()` and `cleanup()` methods.
26
+ * **Core Actions**: `goto()`, `getContent()`, `click()`, `fill()`, `submit()`, `waitFor()`, `extract()`.
27
+ * **Configuration**: `headers()`, `cookies()`, `blockResources()`.
28
+ * **Static Registry**: It maintains a static registry of all available engine implementations (`FetchEngine.register`), allowing for dynamic selection by `id` or `mode`.
29
+
30
+ ### `FetchEngine.create(context, options)`
31
+
32
+ This static factory method is the designated entry point for creating an engine instance. It automatically selects and initializes the appropriate engine.
33
+
34
+ ---
35
+
36
+ ## 🏗️ 3. Architecture and Workflow
37
+
38
+ The engine's architecture is designed to solve a key challenge: providing a simple, **stateful-like API** (`goto()`, then `click()`, then `fill()`) on top of Crawlee's fundamentally **stateless, asynchronous** request handling.
39
+
40
+ ### The Core Problem: Ephemeral Page Context
41
+
42
+ In Crawlee, the page context (the `page` object in Playwright or the `$` function in Cheerio) is **only available within the synchronous scope of the `requestHandler` function**. Once the handler finishes or `await`s, the context is lost. This makes it impossible to directly implement a sequence like `await engine.goto(); await engine.click();`.
43
+
44
+ ### The Solution: Bridging Scopes with an Action Loop
45
+
46
+ Our engine solves this by creating a bridge between the external API calls and the internal `requestHandler` scope. This is the most critical part of the design.
47
+
48
+ **The workflow is as follows:**
49
+
50
+ 1. **Initialization**: A consumer calls `FetchEngine.create()`, which initializes a Crawlee crawler (e.g., `PlaywrightCrawler`) that runs in the background.
51
+ 2. **Navigation (`goto`)**: The consumer calls `await engine.goto(url)`. This adds the URL to Crawlee's `RequestQueue` and returns a `Promise` that will resolve when the page is loaded.
52
+ 3. **Crawlee Processing**: The background crawler picks up the request and invokes the engine's `requestHandler`, passing it the crucial page context.
53
+ 4. **Page Activation & Action Loop**: Inside the `requestHandler`:
54
+ * The page context is used to resolve the `Promise` from the `goto()` call.
55
+ * The page is marked as "active" (`isPageActive = true`).
56
+ * Crucially, before the `requestHandler` returns, it starts an **action loop** (`_executePendingActions`). This loop effectively **pauses the `requestHandler`** by listening for events on an `EventEmitter`, keeping the page context alive.
57
+ 5. **Interactive Actions (`click`, `fill`, etc.)**: The consumer can now call `await engine.click(...)`. This dispatches an action to the `EventEmitter` and returns a new `Promise`.
58
+ 6. **Action Execution**: The action loop, still running within the original `requestHandler`'s scope, hears the event. Because it has access to the page context, it can perform the *actual* interaction (e.g., `page.click(...)`).
59
+ 7. **Cleanup**: The loop continues until a `dispose` action is dispatched (e.g., by a new `goto()` call), which terminates the loop and allows the `requestHandler` to finally complete.
60
+
61
+ ---
62
+
63
+ ## 🛠️ 4. Implementations
64
+
65
+ There are two primary engine implementations:
66
+
67
+ ### `CheerioFetchEngine` (http mode)
68
+
69
+ * **ID**: `cheerio`
70
+ * **Mechanism**: Uses `CheerioCrawler` to fetch pages via raw HTTP and parse static HTML.
71
+ * **Behavior**:
72
+ * ✅ **Fast and Lightweight**: Ideal for speed and low resource consumption.
73
+ * ❌ **No JavaScript Execution**: Cannot interact with client-side rendered content.
74
+ * ⚙️ **Simulated Interaction**: Actions like `click` and `submit` are simulated by making new HTTP requests.
75
+ * **Use Case**: Scraping static websites, server-rendered pages, or APIs.
76
+
77
+ ### `PlaywrightFetchEngine` (browser mode)
78
+
79
+ * **ID**: `playwright`
80
+ * **Mechanism**: Uses `PlaywrightCrawler` to control a real headless browser.
81
+ * **Behavior**:
82
+ * ✅ **Full Browser Environment**: Executes JavaScript, handles cookies, sessions, and complex user interactions natively.
83
+ * ✅ **Robust Interaction**: Actions accurately mimic real user behavior.
84
+ * ⚠️ **Resource Intensive**: Slower and requires more memory/CPU.
85
+ * **Use Case**: Interacting with modern, dynamic web applications (SPAs) or any site that relies heavily on JavaScript.
86
+
87
+ #### Anti-Bot Evasion (`antibot` option)
88
+
89
+ To combat sophisticated anti-bot measures, the `PlaywrightFetchEngine` offers an `antibot` mode. When enabled, it integrates [Camoufox](https://github.com/prescience-data/camoufox) to enhance its stealth capabilities.
90
+
91
+ * **Mechanism**:
92
+ * Uses a hardened Firefox browser via `camoufox-js`.
93
+ * Disables default fingerprint spoofing to let Camoufox manage the browser's fingerprint.
94
+ * Automatically attempts to solve Cloudflare challenges encountered during navigation.
95
+ * **How to enable**: Set the `antibot: true` option when creating the fetcher properties.
96
+ * **Use Case**: Scraping websites protected by services like Cloudflare or other advanced bot-detection systems.
97
+ * **Note**: This feature requires additional dependencies (`camoufox-js`, `firefox`) and may have a performance overhead.
98
+
99
+ ---
100
+
101
+ ## 📊 5. Data Extraction with `extract()`
102
+
103
+ The `extract()` method provides a powerful, declarative way to pull structured data from a web page. It uses a **Schema** to define the structure of your desired JSON output, and the engine automatically handles DOM traversal and data extraction.
104
+
105
+ ### Core Design: Schema Normalization
106
+
107
+ To enhance usability and flexibility, the `extract` method internally implements a **"Normalization"** layer. This means you can provide semantically clear shorthands, and the engine will automatically convert them into a standard, more verbose internal format before execution. This makes writing complex extraction rules simple and intuitive.
108
+
109
+ ### Schema Structure
110
+
111
+ A schema can be one of three types:
112
+
113
+ * **Value Extraction (`ExtractValueSchema`)**: Extracts a single value.
114
+ * `selector`: (Optional) A CSS selector to locate the element.
115
+ * `type`: (Optional) The type of the extracted value, such as `'string'` (default), `'number'`, `'boolean'`, or `'html'` (extracts inner HTML).
116
+ * `attribute`: (Optional) If provided, extracts the value of the specified attribute (e.g., `href`) instead of its text content.
117
+ * **Object Extraction (`ExtractObjectSchema`)**: Extracts a JSON object.
118
+ * `selector`: (Optional) The root element for the object's data.
119
+ * `properties`: Defines sub-extraction rules for each field of the object.
120
+ * **Array Extraction (`ExtractArraySchema`)**: Extracts an array.
121
+ * `selector`: A CSS selector to match each item element in the list.
122
+ * `items`: (Optional) The extraction rule to apply to each item element. **If omitted, it defaults to extracting the element's text content**.
123
+
124
+ ### Advanced Features (via Normalization)
125
+
126
+ #### 1. Shorthand Syntax
127
+
128
+ To simplify common scenarios, the following shorthands are supported:
129
+
130
+ * **Extracting an Array of Attributes**: You can use the `attribute` property directly on an `array` type schema as a shorthand for `items: { attribute: '...' }`.
131
+
132
+ **Shorthand:**
133
+
134
+ ```json
135
+ {
136
+ "type": "array",
137
+ "selector": ".post a",
138
+ "attribute": "href"
139
+ }
140
+ ```
141
+
142
+ **Equivalent to:**
143
+
144
+ ```json
145
+ {
146
+ "type": "array",
147
+ "selector": ".post a",
148
+ "items": { "attribute": "href" }
149
+ }
150
+ ```
151
+
152
+ * **Extracting an Array of Texts**: Simply provide the selector and omit `items`.
153
+
154
+ **Shorthand:**
155
+
156
+ ```json
157
+ {
158
+ "type": "array",
159
+ "selector": ".tags li"
160
+ }
161
+ ```
162
+
163
+ **Equivalent to:**
164
+
165
+ ```json
166
+ {
167
+ "type": "array",
168
+ "selector": ".tags li",
169
+ "items": { "type": "string" }
170
+ }
171
+ ```
172
+
173
+ #### 2. Precise Filtering: `has` and `exclude`
174
+
175
+ You can use the `has` and `exclude` fields in any schema that includes a `selector` to precisely control element selection.
176
+
177
+ * `has`: A CSS selector to ensure the selected element **must contain** a descendant matching this selector (similar to `:has()`).
178
+ * `exclude`: A CSS selector to **exclude** elements matching this selector from the results (similar to `:not()`).
179
+
180
+ ### Complete Example
181
+
182
+ Suppose we have the following HTML and we want to extract the titles and links of all "important" articles, while excluding "archived" ones.
183
+
184
+ ```html
185
+ <div id="articles">
186
+ <div class="post important">
187
+ <a href="/post/1"><h3>Post 1</h3></a>
188
+ </div>
189
+ <div class="post">
190
+ <!-- This one is NOT important, lacks h3 -->
191
+ <a href="/post/2">Post 2</a>
192
+ </div>
193
+ <div class="post important archived">
194
+ <!-- This one is important but archived -->
195
+ <a href="/post/3"><h3>Archived Post 3</h3></a>
196
+ </div>
197
+ </div>
198
+ ```
199
+
200
+ We can use the following schema:
201
+
202
+ ```typescript
203
+ const schema = {
204
+ type: 'array',
205
+ selector: '.post', // 1. Select all posts
206
+ has: 'h3', // 2. Only keep posts that contain an <h3> (important)
207
+ exclude: '.archived', // 3. Exclude posts with the .archived class
208
+ items: {
209
+ type: 'object',
210
+ properties: {
211
+ title: { selector: 'h3' },
212
+ link: { selector: 'a', attribute: 'href' },
213
+ }
214
+ }
215
+ };
216
+
217
+ const data = await engine.extract(schema);
218
+
219
+ /*
220
+ Expected output:
221
+ [
222
+ {
223
+ "title": "Post 1",
224
+ "link": "/post/1"
225
+ }
226
+ ]
227
+ */
228
+ ```
229
+
230
+ ---
231
+
232
+ ## 🧑‍💻 6. How to Extend the Engine
233
+
234
+ Adding a new fetch engine is straightforward:
235
+
236
+ 1. **Create the Class**: Define a new class that extends the generic `FetchEngine`, providing the specific `Context`, `Crawler`, and `Options` types from Crawlee.
237
+
238
+ ```typescript
239
+ import { PlaywrightCrawler, PlaywrightCrawlerOptions, PlaywrightCrawlingContext } from 'crawlee';
240
+
241
+ class MyPlaywrightEngine extends FetchEngine<
242
+ PlaywrightCrawlingContext,
243
+ PlaywrightCrawler,
244
+ PlaywrightCrawlerOptions
245
+ > {
246
+ // ...
247
+ }
248
+ ```
249
+
250
+ 2. **Define Static Properties**: Set the unique `id` and `mode`.
251
+ 3. **Implement Abstract Methods**: Provide concrete implementations for the abstract methods from the base class:
252
+ * `_getSpecificCrawlerOptions()`: Return an object with crawler-specific options (e.g., `headless` mode, `preNavigationHooks`).
253
+ * `_createCrawler()`: Return a new instance of your crawler (e.g., `new PlaywrightCrawler(options)`).
254
+ * `buildResponse()`: Convert the crawling context to a standard `FetchResponse`.
255
+ * `executeAction()`: Handle engine-specific implementations for actions like `click`, `fill`, etc.
256
+ 4. **Register the Engine**: Call `FetchEngine.register(MyNewEngine)`.
257
+
258
+ ---
259
+
260
+ ## ✅ 7. Testing
261
+
262
+ The file `engine.spec.ts` contains a comprehensive test suite. The `engineTestSuite` function defines a standard set of tests that is run against both `CheerioFetchEngine` and `PlaywrightFetchEngine` to ensure they are functionally equivalent and conform to the `FetchEngine` API contract.