@isdk/web-fetcher 0.2.12 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.action.cn.md +197 -155
- package/README.action.extract.cn.md +263 -0
- package/README.action.extract.md +263 -0
- package/README.action.md +202 -147
- package/README.cn.md +25 -15
- package/README.engine.cn.md +118 -14
- package/README.engine.md +115 -14
- package/README.md +19 -10
- package/dist/index.d.mts +667 -50
- package/dist/index.d.ts +667 -50
- package/dist/index.js +1 -1
- package/dist/index.mjs +1 -1
- package/docs/README.md +19 -10
- package/docs/_media/README.action.md +202 -147
- package/docs/_media/README.cn.md +25 -15
- package/docs/_media/README.engine.md +115 -14
- package/docs/classes/CheerioFetchEngine.md +805 -135
- package/docs/classes/ClickAction.md +33 -33
- package/docs/classes/EvaluateAction.md +559 -0
- package/docs/classes/ExtractAction.md +33 -33
- package/docs/classes/FetchAction.md +39 -33
- package/docs/classes/FetchEngine.md +660 -122
- package/docs/classes/FetchSession.md +38 -16
- package/docs/classes/FillAction.md +33 -33
- package/docs/classes/GetContentAction.md +33 -33
- package/docs/classes/GotoAction.md +33 -33
- package/docs/classes/KeyboardPressAction.md +533 -0
- package/docs/classes/KeyboardTypeAction.md +533 -0
- package/docs/classes/MouseClickAction.md +533 -0
- package/docs/classes/MouseMoveAction.md +533 -0
- package/docs/classes/PauseAction.md +33 -33
- package/docs/classes/PlaywrightFetchEngine.md +820 -122
- package/docs/classes/SubmitAction.md +33 -33
- package/docs/classes/TrimAction.md +533 -0
- package/docs/classes/WaitForAction.md +33 -33
- package/docs/classes/WebFetcher.md +9 -9
- package/docs/enumerations/FetchActionResultStatus.md +4 -4
- package/docs/functions/fetchWeb.md +6 -6
- package/docs/globals.md +14 -0
- package/docs/interfaces/BaseFetchActionProperties.md +12 -12
- package/docs/interfaces/BaseFetchCollectorActionProperties.md +16 -16
- package/docs/interfaces/BaseFetcherProperties.md +32 -28
- package/docs/interfaces/Cookie.md +14 -14
- package/docs/interfaces/DispatchedEngineAction.md +4 -4
- package/docs/interfaces/EvaluateActionOptions.md +81 -0
- package/docs/interfaces/ExtractActionProperties.md +12 -12
- package/docs/interfaces/FetchActionInContext.md +15 -15
- package/docs/interfaces/FetchActionProperties.md +13 -13
- package/docs/interfaces/FetchActionResult.md +6 -6
- package/docs/interfaces/FetchContext.md +42 -38
- package/docs/interfaces/FetchEngineContext.md +37 -33
- package/docs/interfaces/FetchMetadata.md +5 -5
- package/docs/interfaces/FetchResponse.md +14 -14
- package/docs/interfaces/FetchReturnTypeRegistry.md +8 -8
- package/docs/interfaces/FetchSite.md +35 -31
- package/docs/interfaces/FetcherOptions.md +34 -30
- package/docs/interfaces/GotoActionOptions.md +14 -6
- package/docs/interfaces/KeyboardPressParams.md +25 -0
- package/docs/interfaces/KeyboardTypeParams.md +25 -0
- package/docs/interfaces/MouseClickParams.md +49 -0
- package/docs/interfaces/MouseMoveParams.md +41 -0
- package/docs/interfaces/PendingEngineRequest.md +3 -3
- package/docs/interfaces/StorageOptions.md +5 -5
- package/docs/interfaces/SubmitActionOptions.md +2 -2
- package/docs/interfaces/TrimActionOptions.md +27 -0
- package/docs/interfaces/WaitForActionOptions.md +5 -5
- package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
- package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
- package/docs/type-aliases/BrowserEngine.md +1 -1
- package/docs/type-aliases/FetchActionCapabilities.md +1 -1
- package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
- package/docs/type-aliases/FetchActionOptions.md +1 -1
- package/docs/type-aliases/FetchEngineAction.md +2 -2
- package/docs/type-aliases/FetchEngineType.md +1 -1
- package/docs/type-aliases/FetchReturnType.md +1 -1
- package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
- package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
- package/docs/type-aliases/ResourceType.md +1 -1
- package/docs/type-aliases/TrimPreset.md +13 -0
- package/docs/variables/DefaultFetcherProperties.md +1 -1
- package/docs/variables/FetcherOptionKeys.md +1 -1
- package/docs/variables/TRIM_PRESETS.md +11 -0
- package/package.json +11 -11
|
@@ -12,6 +12,14 @@ The `engine` directory contains the core logic for the web fetcher. Its primary
|
|
|
12
12
|
|
|
13
13
|
> ℹ️ The system is built on top of the [Crawlee](https://crawlee.dev/) library, using its powerful crawler abstractions.
|
|
14
14
|
|
|
15
|
+
### Debug Mode & Tracing
|
|
16
|
+
|
|
17
|
+
When the `debug: true` option is enabled, the engine provides detailed tracing of its internal operations:
|
|
18
|
+
|
|
19
|
+
1. **Detailed Tracing**: Every major step (request processing, element selection, data extraction) is logged to the console with a `[FetchEngine:id]` prefix.
|
|
20
|
+
2. **Extraction Insights**: During `extract()`, the engine logs how many elements were matched for each selector and the specific values being extracted, making it easier to debug complex or misaligned schemas.
|
|
21
|
+
3. **Metadata**: The `FetchResponse` will include an enriched `metadata` object containing engine details, timing metrics (where available), and proxy information.
|
|
22
|
+
|
|
15
23
|
---
|
|
16
24
|
|
|
17
25
|
## 🧩 2. Core Concepts
|
|
@@ -20,13 +28,23 @@ The `engine` directory contains the core logic for the web fetcher. Its primary
|
|
|
20
28
|
|
|
21
29
|
This is the abstract base class that defines the contract for all fetch engines.
|
|
22
30
|
|
|
23
|
-
* **Role**: To provide a consistent, high-level API for actions like navigation, content retrieval, and user interaction.
|
|
31
|
+
* **Role**: To provide a consistent, high-level API for actions like navigation, content retrieval, and user interaction. It acts as the central **Action Dispatcher**, handling engine-agnostic logic (like `extract`, `pause`, `getContent`) and delegating others.
|
|
24
32
|
* **Key Abstractions**:
|
|
25
33
|
* **Lifecycle**: `initialize()` and `cleanup()` methods.
|
|
26
34
|
* **Core Actions**: `goto()`, `getContent()`, `click()`, `fill()`, `submit()`, `waitFor()`, `extract()`.
|
|
35
|
+
* **DOM Primitives**: `_querySelectorAll()`, `_extractValue()`, `_parentElement()`, `_nextSiblingsUntil()`.
|
|
27
36
|
* **Configuration & State**: `headers()`, `cookies()`, `blockResources()`, `getState()`, `sessionPoolOptions`.
|
|
28
37
|
* **Static Registry**: It maintains a static registry of all available engine implementations (`FetchEngine.register`), allowing for dynamic selection by `id` or `mode`.
|
|
29
38
|
|
|
39
|
+
### `FetchElementScope`
|
|
40
|
+
|
|
41
|
+
The **`FetchElementScope`** is the engine-specific "handle" or "context" for a DOM element.
|
|
42
|
+
|
|
43
|
+
- In **`CheerioFetchEngine`**, it is a combination of the Cheerio API (`$`) and the current selection (`el`).
|
|
44
|
+
- In **`PlaywrightFetchEngine`**, it is a `Locator`.
|
|
45
|
+
|
|
46
|
+
All extraction and interaction primitives operate on this scope, ensuring a unified way to reference elements across different underlying technologies.
|
|
47
|
+
|
|
30
48
|
### `FetchEngine.create(context, options)`
|
|
31
49
|
|
|
32
50
|
This static factory method is the designated entry point for creating an engine instance. It automatically selects and initializes the appropriate engine.
|
|
@@ -69,14 +87,14 @@ await session.executeAll([
|
|
|
69
87
|
The engine supports persisting and restoring session state (primarily cookies) between executions.
|
|
70
88
|
|
|
71
89
|
* **Flexible Session Isolation & Storage**: The library provides fine-grained control over how session data is stored and isolated via the `storage` configuration:
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
90
|
+
* **`id`**: A custom string to identify the storage.
|
|
91
|
+
* **Isolation (Default)**: If omitted, each session gets a unique ID, ensuring complete isolation of `RequestQueue`, `KeyValueStore`, and `SessionPool`.
|
|
92
|
+
* **Sharing**: Providing the same `id` across sessions allows them to share the same underlying storage, useful for persistent login sessions.
|
|
93
|
+
* **`persist`**: (boolean) Whether to enable disk persistence (Crawlee's `persistStorage`). Defaults to `false` (in-memory).
|
|
94
|
+
* **`purge`**: (boolean) Whether to delete the storage (drop `RequestQueue` and `KeyValueStore`) when the session is closed. Defaults to `true`.
|
|
95
|
+
* Set `purge: false` and provide a fixed `id` to create a truly persistent session that survives across application restarts.
|
|
96
|
+
* **`config`**: Allows passing raw configuration to the underlying Crawlee instance.
|
|
97
|
+
* **Note**: When `persist` is true, use `localDataDirectory` in the config to specify the storage path (e.g., `storage: { persist: true, config: { localDataDirectory: './my-data' } }`).
|
|
80
98
|
* **`sessionState`**: A comprehensive state object (derived from Crawlee's SessionPool) that can be used to fully restore a previous session. This state is **automatically included in every `FetchResponse`**, making it easy to persist and later provide back to the engine during initialization.
|
|
81
99
|
* **`sessionPoolOptions`**: Allows advanced configuration of the underlying Crawlee `SessionPool` (e.g., `maxUsageCount`, `maxPoolSize`).
|
|
82
100
|
* **`overrideSessionState`**: If set to `true`, it forces the engine to overwrite any existing persistent state in the storage with the provided `sessionState`. This is useful when you want to ensure the session starts with the exact state provided, ignoring any stale data in the persistence layer.
|
|
@@ -86,6 +104,7 @@ The engine supports persisting and restoring session state (primarily cookies) b
|
|
|
86
104
|
|
|
87
105
|
**Precedence Rule:**
|
|
88
106
|
If both `sessionState` and `cookies` are provided, the engine adopts a **"Merge and Override"** strategy:
|
|
107
|
+
|
|
89
108
|
1. The session is first restored from the `sessionState`.
|
|
90
109
|
2. The explicit `cookies` are then applied on top.
|
|
91
110
|
* **Result:** Any conflicting cookies in `sessionState` will be **overwritten** by the explicit `cookies`.
|
|
@@ -114,8 +133,11 @@ Our engine solves this by creating a bridge between the external API calls and t
|
|
|
114
133
|
* The page context is used to resolve the `Promise` from the `goto()` call.
|
|
115
134
|
* The page is marked as "active" (`isPageActive = true`).
|
|
116
135
|
* Crucially, before the `requestHandler` returns, it starts an **action loop** (`_executePendingActions`). This loop effectively **pauses the `requestHandler`** by listening for events on an `EventEmitter`, keeping the page context alive.
|
|
136
|
+
* **Strict Sequential Execution & Re-entrancy**: The loop uses an internal queue to ensure all actions are executed in the exact order they were dispatched. It also includes re-entrancy protection to allow composite actions to call atomic actions without deadlocking.
|
|
117
137
|
5. **Interactive Actions (`click`, `fill`, etc.)**: The consumer can now call `await engine.click(...)`. This dispatches an action to the `EventEmitter` and returns a new `Promise`.
|
|
118
|
-
6. **Action Execution**: The action loop, still running within the original `requestHandler`'s scope, hears the event.
|
|
138
|
+
6. **Action Execution**: The action loop, still running within the original `requestHandler`'s scope, hears the event.
|
|
139
|
+
* **Centralized Actions**: Actions like `extract`, `pause`, and `getContent` are processed immediately by the `FetchEngine` base class using the unified logic.
|
|
140
|
+
* **Delegated Actions**: Engine-specific interactions (e.g., `click`, `fill`) are delegated to the subclass's `executeAction` implementation.
|
|
119
141
|
7. **Robust Cleanup**: When `dispose()` or `cleanup()` is called:
|
|
120
142
|
* An `isEngineDisposed` flag is set to prevent new actions.
|
|
121
143
|
* A `dispose` signal is emitted to wake up and terminate the action loop.
|
|
@@ -134,8 +156,9 @@ There are two primary engine implementations:
|
|
|
134
156
|
* **Mechanism**: Uses `CheerioCrawler` to fetch pages via raw HTTP and parse static HTML.
|
|
135
157
|
* **Behavior**:
|
|
136
158
|
* ✅ **Fast and Lightweight**: Ideal for speed and low resource consumption.
|
|
159
|
+
* ✅ **HTTP-Compliant Redirects**: Correctly handles 301-303 and 307/308 redirects, preserving methods/bodies or converting to GET as per HTTP specifications.
|
|
137
160
|
* ❌ **No JavaScript Execution**: Cannot interact with client-side rendered content.
|
|
138
|
-
* ⚙️ **Simulated Interaction**: Actions like `click` and `submit` are simulated by making new HTTP requests.
|
|
161
|
+
* ⚙️ **Simulated Interaction**: Actions like `click` and `submit` are simulated by making new HTTP requests. **Browser-only actions** (e.g., `mouseMove`, `keyboardType`) will throw a `not_supported` error.
|
|
139
162
|
* **Use Case**: Scraping static websites, server-rendered pages, or APIs.
|
|
140
163
|
|
|
141
164
|
### `PlaywrightFetchEngine` (browser mode)
|
|
@@ -160,17 +183,95 @@ To combat sophisticated anti-bot measures, the `PlaywrightFetchEngine` offers an
|
|
|
160
183
|
* **Use Case**: Scraping websites protected by services like Cloudflare or other advanced bot-detection systems.
|
|
161
184
|
* **Note**: This feature requires additional dependencies (`camoufox-js`, `firefox`) and may have a performance overhead.
|
|
162
185
|
|
|
186
|
+
#### Configuration
|
|
187
|
+
|
|
188
|
+
You can configure the browser engine via the `browser` property in options:
|
|
189
|
+
|
|
190
|
+
* `headless` (boolean): Whether to run browser in headless mode (default: `true`).
|
|
191
|
+
* `launchOptions` (object): Native Playwright [LaunchOptions](https://playwright.dev/docs/api/class-browsertype#browser-type-launch) passed directly to the browser launcher (e.g., `slowMo`, `args`, `devtools`).
|
|
192
|
+
|
|
193
|
+
```typescript
|
|
194
|
+
const result = await fetchWeb({
|
|
195
|
+
url: 'https://example.com',
|
|
196
|
+
engine: 'browser',
|
|
197
|
+
browser: {
|
|
198
|
+
headless: false,
|
|
199
|
+
launchOptions: {
|
|
200
|
+
slowMo: 100, // Slow down operations by 100ms
|
|
201
|
+
args: ['--start-maximized'] // Pass custom arguments
|
|
202
|
+
}
|
|
203
|
+
}
|
|
204
|
+
});
|
|
205
|
+
```
|
|
206
|
+
|
|
163
207
|
---
|
|
164
208
|
|
|
165
209
|
## 📊 5. Data Extraction with `extract()`
|
|
166
210
|
|
|
167
211
|
The `extract()` method provides a powerful, declarative way to pull structured data from a web page. It uses a **Schema** to define the structure of your desired JSON output, and the engine automatically handles DOM traversal and data extraction.
|
|
168
212
|
|
|
169
|
-
### Core Design:
|
|
213
|
+
### Core Design: The Three-Layer Architecture
|
|
214
|
+
|
|
215
|
+
To ensure consistency across engines and maintain high quality, the extraction system is divided into three layers:
|
|
216
|
+
|
|
217
|
+
1. **Normalization Layer (`src/core/normalize-extract-schema.ts`)**: Pre-processes user-provided schemas into a canonical internal format, handling CSS filter merging and implicit object detection.
|
|
218
|
+
2. **Core Extraction Logic (`src/core/extract.ts`)**: An engine-agnostic layer responsible for the extraction workflow. It dispatches tasks to `_extractObject`, `_extractArray`, or `_extractValue` via a **Dispatcher (`_extract`)**. It manages recursion, strict mode, required field validation, anchor resolution, performance-optimized tree operations (LCA, bubbling), and sequential consumption cursors.
|
|
219
|
+
3. **Engine Interface (`IExtractEngine`)**: Defined at the core layer and implemented by engines to provide low-level DOM primitives.
|
|
220
|
+
|
|
221
|
+
#### Implementation Rules for `IExtractEngine`
|
|
222
|
+
|
|
223
|
+
To maintain cross-engine consistency, all implementations MUST follow these behavior contracts:
|
|
224
|
+
|
|
225
|
+
- **`_querySelectorAll`**:
|
|
226
|
+
- MUST return matching elements in **document order**.
|
|
227
|
+
- MUST check if the scope element(s) **themselves** match the selector and search their **descendants**.
|
|
228
|
+
- **`_nextSiblingsUntil`**:
|
|
229
|
+
- MUST return a flat list of siblings starting *after* the anchor and stopping *before* the first element matching the `untilSelector`.
|
|
230
|
+
- **`_isSameElement`**:
|
|
231
|
+
- MUST compare elements based on **identity**, not content.
|
|
232
|
+
- **`_findClosestAncestor`**:
|
|
233
|
+
- MUST efficiently find the closest ancestor of an element that exists in a given set of candidates.
|
|
234
|
+
- MUST be optimized to avoid multiple IPC calls in browser-based engines.
|
|
235
|
+
- **`_contains`**:
|
|
236
|
+
- MUST implement standard DOM `Node.contains()` behavior.
|
|
237
|
+
- MUST be optimized for high-frequency boundary checks.
|
|
238
|
+
- **`_findCommonAncestor`**:
|
|
239
|
+
- MUST find the Lowest Common Ancestor (LCA) of two elements.
|
|
240
|
+
- **Performance Critical**: In browser engines, this MUST be executed within a single `evaluate` call to minimize IPC (Inter-Process Communication) overhead.
|
|
241
|
+
- **`_findContainerChild`**:
|
|
242
|
+
- MUST find the direct child of a container that contains a specific descendant.
|
|
243
|
+
- **Performance Critical**: This replaces manual "bubble-up" loops in the Node.js context, significantly reducing overhead for deep DOM trees.
|
|
244
|
+
- **`_bubbleUpToScope` (Internal Helper)**:
|
|
245
|
+
- Implements the logic to bubble up from a deep element to its direct ancestor in the current scope.
|
|
246
|
+
- Supports an optional `depth` parameter to limit how many parent levels to traverse.
|
|
247
|
+
- MUST include a maximum depth limit (default 1000) to prevent infinite loops.
|
|
248
|
+
|
|
249
|
+
This architecture ensures that complex features like **Columnar Alignment**, **Segmented Scanning**, and **Anchor Jumping** behave identically across the fast Cheerio engine and the full Playwright browser.
|
|
250
|
+
|
|
251
|
+
### Schema Normalization
|
|
252
|
+
|
|
253
|
+
To enhance usability and flexibility, the `extract` method internally implements a **"Normalization"** layer. This allows you to provide semantically clear shorthands, which are automatically converted into a standardized internal format.
|
|
254
|
+
|
|
255
|
+
#### 1. Shorthand Rules
|
|
256
|
+
|
|
257
|
+
- **String Shorthand**: A simple string like `'h1'` is automatically expanded to `{ selector: 'h1', type: 'string', mode: 'text' }`.
|
|
258
|
+
- **Implicit Object Shorthand**: If you provide an object without an explicit `type: 'object'`, it is automatically treated as an `object` schema where the keys are the property names.
|
|
259
|
+
- *Example*: `{ "title": "h1" }` becomes `{ "type": "object", "properties": { "title": { "selector": "h1" } } }`.
|
|
260
|
+
- **Filter Shorthand**: If you provide `has` or `exclude` alongside a `selector`, they are automatically merged into the CSS selector using `:has()` and `:not()` pseudo-classes.
|
|
261
|
+
- **Array Shorthand**: Providing `attribute` directly on an `array` schema acts as a shorthand for its `items`.
|
|
262
|
+
- *Example*: `{ "type": "array", "attribute": "href" }` becomes `{ "type": "array", "items": { "type": "string", "attribute": "href" } }`.
|
|
263
|
+
|
|
264
|
+
#### 2. Context vs. Data (Keyword Separation)
|
|
265
|
+
|
|
266
|
+
In **Implicit Objects**, the engine must distinguish between *configuration* (where to look) and *data* (what to extract).
|
|
267
|
+
|
|
268
|
+
- **Context Keys**: The keys `selector`, `has`, `exclude`, `required`, `strict`, and `depth` are reserved for defining the extraction context and validation. They stay at the root of the schema.
|
|
269
|
+
- **Data Keys**: All other keys (including `items`, `attribute`, `mode`, or even a field named `type`) are moved into the `properties` object as data fields to be extracted.
|
|
270
|
+
- **Collision Handling**: You can safely extract a field named `type` as long as its value is not one of the reserved schema type keywords (`string`, `number`, `boolean`, `html`, `object`, `array`).
|
|
170
271
|
|
|
171
|
-
|
|
272
|
+
#### 3. Cross-Engine Consistency
|
|
172
273
|
|
|
173
|
-
|
|
274
|
+
This normalization layer ensures that regardless of whether you are using the `cheerio` (http) or `playwright` (browser) engine, the complex shorthand logic behaves identically, providing a consistent "AI-friendly" interface.
|
|
174
275
|
|
|
175
276
|
A schema can be one of three types:
|
|
176
277
|
|