@isdk/web-fetcher 0.3.2 → 0.3.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.cn.md +19 -0
- package/README.engine.cn.md +34 -6
- package/README.engine.md +29 -1
- package/README.md +21 -1
- package/dist/index.d.mts +1515 -1490
- package/dist/index.d.ts +1515 -1490
- package/dist/index.js +1 -1
- package/dist/index.mjs +1 -1
- package/docs/README.md +21 -1
- package/docs/_media/README.cn.md +19 -0
- package/docs/_media/README.engine.md +29 -1
- package/docs/classes/CheerioFetchEngine.md +95 -95
- package/docs/classes/ClickAction.md +29 -29
- package/docs/classes/EngineUpgradeError.md +335 -0
- package/docs/classes/EvaluateAction.md +29 -29
- package/docs/classes/ExtractAction.md +29 -29
- package/docs/classes/FetchAction.md +29 -29
- package/docs/classes/FetchEngine.md +93 -93
- package/docs/classes/FetchSession.md +14 -14
- package/docs/classes/FillAction.md +29 -29
- package/docs/classes/GetContentAction.md +29 -29
- package/docs/classes/GotoAction.md +29 -29
- package/docs/classes/KeyboardPressAction.md +29 -29
- package/docs/classes/KeyboardTypeAction.md +29 -29
- package/docs/classes/MouseClickAction.md +29 -29
- package/docs/classes/MouseMoveAction.md +29 -29
- package/docs/classes/MouseWheelAction.md +29 -29
- package/docs/classes/PauseAction.md +29 -29
- package/docs/classes/PlaywrightFetchEngine.md +101 -101
- package/docs/classes/ScrollIntoViewAction.md +29 -29
- package/docs/classes/SubmitAction.md +29 -29
- package/docs/classes/TrimAction.md +29 -29
- package/docs/classes/WaitForAction.md +29 -29
- package/docs/classes/WebFetcher.md +5 -5
- package/docs/enumerations/FetchActionResultStatus.md +4 -4
- package/docs/functions/fetchWeb.md +2 -2
- package/docs/functions/getRandomDelay.md +1 -1
- package/docs/globals.md +3 -1
- package/docs/interfaces/BaseFetchActionProperties.md +13 -13
- package/docs/interfaces/BaseFetchCollectorActionProperties.md +17 -17
- package/docs/interfaces/BaseFetcherProperties.md +44 -28
- package/docs/interfaces/DispatchedEngineAction.md +4 -4
- package/docs/interfaces/EvaluateActionOptions.md +3 -3
- package/docs/interfaces/ExtractActionProperties.md +13 -13
- package/docs/interfaces/FetchActionMeta.md +73 -0
- package/docs/interfaces/FetchActionProperties.md +15 -19
- package/docs/interfaces/FetchActionResult.md +7 -7
- package/docs/interfaces/FetchContext.md +65 -41
- package/docs/interfaces/FetchEngineContext.md +57 -33
- package/docs/interfaces/FetchMetadata.md +5 -5
- package/docs/interfaces/FetchResponse.md +14 -14
- package/docs/interfaces/FetchReturnTypeRegistry.md +7 -7
- package/docs/interfaces/FetchSite.md +55 -31
- package/docs/interfaces/FetcherOptions.md +55 -31
- package/docs/interfaces/GotoActionOptions.md +8 -8
- package/docs/interfaces/KeyboardPressParams.md +3 -3
- package/docs/interfaces/KeyboardTypeParams.md +3 -3
- package/docs/interfaces/MouseClickParams.md +6 -6
- package/docs/interfaces/MouseMoveParams.md +5 -5
- package/docs/interfaces/MouseWheelParams.md +7 -7
- package/docs/interfaces/PendingEngineRequest.md +3 -3
- package/docs/interfaces/ScrollIntoViewParams.md +2 -2
- package/docs/interfaces/StorageOptions.md +5 -5
- package/docs/interfaces/SubmitActionOptions.md +2 -2
- package/docs/interfaces/TrimActionOptions.md +3 -3
- package/docs/interfaces/WaitForActionOptions.md +5 -5
- package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
- package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
- package/docs/type-aliases/BrowserEngine.md +1 -1
- package/docs/type-aliases/FetchActionCapabilities.md +1 -1
- package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
- package/docs/type-aliases/FetchActionInContext.md +38 -0
- package/docs/type-aliases/FetchActionOptions.md +1 -1
- package/docs/type-aliases/FetchEngineAction.md +1 -1
- package/docs/type-aliases/FetchEngineType.md +1 -1
- package/docs/type-aliases/FetchReturnType.md +1 -1
- package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
- package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
- package/docs/type-aliases/ResourceType.md +1 -1
- package/docs/type-aliases/TrimPreset.md +1 -1
- package/docs/variables/DefaultFetcherProperties.md +1 -1
- package/docs/variables/FetcherOptionKeys.md +1 -1
- package/docs/variables/TRIM_PRESETS.md +1 -1
- package/package.json +1 -1
- package/docs/interfaces/FetchActionInContext.md +0 -190
package/README.cn.md
CHANGED
|
@@ -27,6 +27,19 @@
|
|
|
27
27
|
|
|
28
28
|
---
|
|
29
29
|
|
|
30
|
+
### 智能升级与重试策略
|
|
31
|
+
|
|
32
|
+
当 `enableSmart` 开启时,系统会根据响应特征自动判断是否需要升级引擎:
|
|
33
|
+
|
|
34
|
+
- 触发升级的条件包括:
|
|
35
|
+
- HTTP 状态码:`401 / 403 / 500 / 429`
|
|
36
|
+
- 页面疑似动态渲染(HTML 中检测到典型 JS 框架特征)
|
|
37
|
+
- `Retry-After` 超过 `upgradeThresholdMs`
|
|
38
|
+
- 升级过程中可选择是否同步 Cookies / Session 状态(`syncStateOnUpgrade`)
|
|
39
|
+
- 对于 `429` 响应,若 `Retry-After` 小于`upgradeThresholdMs`阈值,系统会优先重试而非升级
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
30
43
|
## 📦 安装
|
|
31
44
|
|
|
32
45
|
1. **安装依赖包:**
|
|
@@ -150,6 +163,12 @@ searchGoogle('gemini');
|
|
|
150
163
|
* `headless` (boolean): 是否以无头模式运行(默认:`true`)。
|
|
151
164
|
* `launchOptions` (object): Playwright 启动选项(例如 `{ slowMo: 50, args: [...] }`)。
|
|
152
165
|
* `sessionPoolOptions` (SessionPoolOptions): 底层 Crawlee SessionPool 的高级配置。
|
|
166
|
+
* `enableSmart` (boolean): 是否启用智能探测与自动引擎升级(默认:`true`)。
|
|
167
|
+
* `syncStateOnUpgrade` (boolean): 当从 http 升级到 browser 引擎时,是否同步 Cookies / Session 状态(默认:`false`)。
|
|
168
|
+
* `upgradeThresholdMs` (number): 触发引擎升级的等待时间阈值(毫秒),超过该时间或无明确重试信息则升级(默认:`5000`)。
|
|
169
|
+
* `maxRetries` (number): 单个 Action 的最大重试次数(默认:`0`)。
|
|
170
|
+
* `failOnError` (boolean): Action 失败时是否抛出异常(默认:主流程 `true`,collector `false`)。
|
|
171
|
+
* `failOnTimeout` (boolean): 超时是否视为失败(默认:`false`)。
|
|
153
172
|
* ...以及许多其他用于代理、重试等的选项。
|
|
154
173
|
|
|
155
174
|
### 内置动作 (Built-in Actions)
|
package/README.engine.cn.md
CHANGED
|
@@ -64,7 +64,13 @@ const session = new FetchSession({ engine: 'browser' });
|
|
|
64
64
|
1. **显式选项**: 如果在 `FetchSession` 的 `options.engine`(或 `executeAll` 的临时上下文覆盖)中提供了引擎且不为 `'auto'`。
|
|
65
65
|
* ⚠️ **快速失败 (Fail-Fast)**: 如果请求的引擎不可用(例如缺少依赖),将立即抛出错误。
|
|
66
66
|
2. **站点注册表 (Site Registry)**: 如果设置为 `'auto'`(默认),系统会尝试根据目标 URL 匹配 `sites` 注册表。
|
|
67
|
-
3. **智能升级 (Smart Upgrade)**:
|
|
67
|
+
3. **智能升级 (Smart Upgrade)**:
|
|
68
|
+
- 当 `enableSmart: true` 时,系统会在以下情况自动从 `http` 升级到 `browser`:
|
|
69
|
+
- 返回 `401 / 403 / 500 / 429`
|
|
70
|
+
- HTML 内容被识别为“高度动态”(大量 JS)
|
|
71
|
+
- `Retry-After` 超过 `upgradeThresholdMs`
|
|
72
|
+
- 升级时可选择是否同步 Cookies / Session(`syncStateOnUpgrade`)
|
|
73
|
+
- 升级失败或仍不满足需求时,可继续抛出原始错误
|
|
68
74
|
4. **默认**: 回退到 `'http'` (Cheerio)。
|
|
69
75
|
|
|
70
76
|
#### 带有覆盖的批量执行 (Batch Execution with Overrides)
|
|
@@ -82,6 +88,20 @@ await session.executeAll([
|
|
|
82
88
|
});
|
|
83
89
|
```
|
|
84
90
|
|
|
91
|
+
#### 智能升级与重试
|
|
92
|
+
|
|
93
|
+
在 `executeAll` 执行过程中,如果某个 Action 抛出 `ENGINE_UPGRADE_REQUIRED`:
|
|
94
|
+
|
|
95
|
+
- 系统会尝试释放当前引擎
|
|
96
|
+
- 创建新的 browser 引擎
|
|
97
|
+
- **自动从头重新执行动作列表**
|
|
98
|
+
- 已成功执行的副作用(如 Cookie 写入)可根据配置保留
|
|
99
|
+
|
|
100
|
+
此外,对于 `429` 响应:
|
|
101
|
+
|
|
102
|
+
- 若 `Retry-After` 存在且小于 `upgradeThresholdMs`
|
|
103
|
+
- 系统会在同一引擎内自动重试,而不会触发升级
|
|
104
|
+
|
|
85
105
|
### 会话管理与状态持久化
|
|
86
106
|
|
|
87
107
|
引擎支持在多次执行之间持久化和恢复会话状态(主要是 Cookie)。
|
|
@@ -112,6 +132,16 @@ await session.executeAll([
|
|
|
112
132
|
|
|
113
133
|
---
|
|
114
134
|
|
|
135
|
+
### 错误增强与 Retry-After 支持
|
|
136
|
+
|
|
137
|
+
- 所有 HTTP 错误现在都会将原始 `FetchResponse` 附加到错误对象(`error.response`)
|
|
138
|
+
- 支持解析 HTTP `Retry-After` 头:
|
|
139
|
+
- 支持整数(秒)
|
|
140
|
+
- 支持 HTTP 日期格式
|
|
141
|
+
- 错误信息中会包含重试等待时间提示
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
115
145
|
## 🏗️ 3. 架构和工作流程
|
|
116
146
|
|
|
117
147
|
引擎架构旨在解决一个关键挑战:在 Crawlee 根本上**无状态、异步**的请求处理之上,提供一个简单的、**类似有状态的 API**(先 `goto()`,再 `click()`,再 `fill()`)。
|
|
@@ -187,8 +217,8 @@ await session.executeAll([
|
|
|
187
217
|
|
|
188
218
|
您可以通过选项中的 `browser` 属性来配置浏览器引擎:
|
|
189
219
|
|
|
190
|
-
*
|
|
191
|
-
*
|
|
220
|
+
* `headless` (boolean): 是否以无头模式运行浏览器(默认:`true`)。
|
|
221
|
+
* `launchOptions` (object): 原生 Playwright [LaunchOptions](https://playwright.dev/docs/api/class-browsertype#browser-type-launch),直接传递给浏览器启动器(例如 `slowMo`, `args`, `devtools`)。
|
|
192
222
|
|
|
193
223
|
```typescript
|
|
194
224
|
const result = await fetchWeb({
|
|
@@ -246,9 +276,7 @@ await session.executeAll([
|
|
|
246
276
|
- 支持可选的 `depth` 参数来限制向上遍历的层级。
|
|
247
277
|
- 必须包含最大深度限制 (默认 1000) 以防止无限循环。
|
|
248
278
|
|
|
249
|
-
这种架构确保了诸如 **列对齐 (Columnar Alignment)**、**分段扫描 (Segmented Scanning)** 以及 **属性锚点跳转 (Anchor Jumping)**
|
|
250
|
-
|
|
251
|
-
这种解耦确保了诸如 **列对齐 (Columnar Alignment)**、**分段扫描 (Segmented Scanning)** 以及 **属性锚点跳转 (Anchor Jumping)** 等复杂功能,无论是在快速的 Cheerio 引擎还是完整的 Playwright 浏览器中,其行为都完全一致。
|
|
279
|
+
这种架构确保了诸如 **列对齐 (Columnar Alignment)**、**分段扫描 (Segmented Scanning)** 以及 **属性锚点跳转 (Anchor Jumping)** 等复杂功能,无论是在快速的 Cheerio 引擎还是完整的 Playwright 浏览器中,其行为都完全一致。
|
|
252
280
|
|
|
253
281
|
### Schema 规范化
|
|
254
282
|
|
package/README.engine.md
CHANGED
|
@@ -64,9 +64,27 @@ The engine is initialized lazily upon the first action execution and remains fix
|
|
|
64
64
|
1. **Explicit Option**: If `options.engine` (or temporary context override in `executeAll`) is provided and NOT set to `'auto'`.
|
|
65
65
|
* ⚠️ **Fail-Fast**: If the requested engine is unavailable (e.g., missing dependencies), an error is thrown immediately.
|
|
66
66
|
2. **Site Registry**: If set to `'auto'` (default), the system attempts to match the target URL against the `sites` registry.
|
|
67
|
-
3. **Smart Upgrade**: If
|
|
67
|
+
3. **Smart Upgrade**: If `enableSmart: true`, the system will automatically upgrade from `http` to `browser` under the following conditions:
|
|
68
|
+
- Returns `401 / 403 / 500 / 429`
|
|
69
|
+
- HTML content is identified as "highly dynamic" (heavy JS)
|
|
70
|
+
- `Retry-After` exceeds `upgradeThresholdMs`
|
|
71
|
+
- You can optionally sync Cookies/Session during upgrade (`syncStateOnUpgrade`)
|
|
72
|
+
- If upgrade fails or still doesn't meet requirements, the original error is thrown
|
|
68
73
|
4. **Default**: Falls back to `'http'` (Cheerio).
|
|
69
74
|
|
|
75
|
+
#### Smart Upgrade & Retry
|
|
76
|
+
|
|
77
|
+
During `executeAll`, if an Action throws `ENGINE_UPGRADE_REQUIRED`:
|
|
78
|
+
|
|
79
|
+
- The system attempts to release the current engine
|
|
80
|
+
- Creates a new browser engine
|
|
81
|
+
- **Automatically re-executes the action list from the beginning**
|
|
82
|
+
- Side effects of successfully executed actions (e.g., cookie writes) can be preserved based on configuration
|
|
83
|
+
|
|
84
|
+
For `429` responses:
|
|
85
|
+
- If `Retry-After` exists and is less than `upgradeThresholdMs`
|
|
86
|
+
- The system will automatically retry within the same engine without triggering an upgrade
|
|
87
|
+
|
|
70
88
|
#### Batch Execution with Overrides
|
|
71
89
|
|
|
72
90
|
You can execute a sequence of actions with temporary configuration overrides (e.g., headers, timeout) that apply only to that specific batch, without modifying the session's global state.
|
|
@@ -112,6 +130,16 @@ If both `sessionState` and `cookies` are provided, the engine adopts a **"Merge
|
|
|
112
130
|
|
|
113
131
|
---
|
|
114
132
|
|
|
133
|
+
### Error Enhancement & Retry-After Support
|
|
134
|
+
|
|
135
|
+
- All HTTP errors now attach the original `FetchResponse` to the error object (`error.response`)
|
|
136
|
+
- Support for parsing HTTP `Retry-After` header:
|
|
137
|
+
- Integer (seconds) format
|
|
138
|
+
- HTTP date format
|
|
139
|
+
- Retry wait time hints are included in error messages
|
|
140
|
+
|
|
141
|
+
---
|
|
142
|
+
|
|
115
143
|
## 🏗️ 3. Architecture and Workflow
|
|
116
144
|
|
|
117
145
|
The engine's architecture is designed to solve a key challenge: providing a simple, **stateful-like API** (`goto()`, then `click()`, then `fill()`) on top of Crawlee's fundamentally **stateless, asynchronous** request handling.
|
package/README.md
CHANGED
|
@@ -20,9 +20,23 @@ English | [简体中文](./README.cn.md)
|
|
|
20
20
|
* **📜 Declarative Action Scripts**: Define multi-step workflows (like logging in, filling forms, and clicking buttons) in a simple, readable JSON format.
|
|
21
21
|
* **📊 Powerful and Flexible Data Extraction**: Easily extract all kinds of structured data, from simple text to complex nested objects, through an intuitive and powerful declarative Schema.
|
|
22
22
|
* **🧠 Smart Engine Selection**: Automatically detects dynamic sites and can upgrade the engine from `http` to `browser` on the fly.
|
|
23
|
+
* **🛡️ Anti-Bot Evasion**: In `browser` mode, an optional `antibot` flag helps to bypass common anti-bot measures like Cloudflare challenges.
|
|
24
|
+
* **🕹️ High-Fidelity Interaction Simulation**: Supports Bézier curve-based mouse trajectory movement, realistic typing delay simulation, and complex keyboard interactions to significantly improve anti-bot evasion.
|
|
23
25
|
* **🧩 Extensible**: Easily create custom, high-level "composite" actions to encapsulate reusable business logic (e.g., a `login` action).
|
|
24
26
|
* **🧲 Advanced Collectors**: Asynchronously collect data in the background, triggered by events during the execution of a main action.
|
|
25
|
-
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
### Smart Upgrade and Retry Strategy
|
|
31
|
+
|
|
32
|
+
When `enableSmart` is enabled, the system automatically determines whether an engine upgrade is needed based on response characteristics:
|
|
33
|
+
|
|
34
|
+
- Triggers for upgrade include:
|
|
35
|
+
- HTTP status codes: `401 / 403 / 500 / 429`
|
|
36
|
+
- Page appears to be dynamically rendered (detected typical JS framework signatures in HTML)
|
|
37
|
+
- `Retry-After` exceeds `upgradeThresholdMs`
|
|
38
|
+
- During upgrade, you can choose whether to sync Cookies / Session state (`syncStateOnUpgrade`)
|
|
39
|
+
- For `429` responses, if `Retry-After` is less than the `upgradeThresholdMs` threshold, the system will prioritize retry over upgrade
|
|
26
40
|
|
|
27
41
|
---
|
|
28
42
|
|
|
@@ -149,6 +163,12 @@ This is the main entry point for the library.
|
|
|
149
163
|
* `headless` (boolean): Run in headless mode (default: `true`).
|
|
150
164
|
* `launchOptions` (object): Playwright launch options (e.g., `{ slowMo: 50, args: [...] }`).
|
|
151
165
|
* `sessionPoolOptions` (SessionPoolOptions): Advanced configuration for the underlying Crawlee SessionPool.
|
|
166
|
+
* `enableSmart` (boolean): Enable smart detection and automatic engine upgrade (default: `true`).
|
|
167
|
+
* `syncStateOnUpgrade` (boolean): Whether to sync Cookies / Session state when upgrading from http to browser engine (default: `false`).
|
|
168
|
+
* `upgradeThresholdMs` (number): Wait time threshold in milliseconds to trigger engine upgrade; upgrades if exceeded or no explicit retry info (default: `5000`).
|
|
169
|
+
* `maxRetries` (number): Maximum retry attempts for a single Action (default: `0`).
|
|
170
|
+
* `failOnError` (boolean): Whether to throw an exception when an Action fails (default: `true` for main flow, `false` for collector).
|
|
171
|
+
* `failOnTimeout` (boolean): Whether to treat timeout as failure (default: `false`).
|
|
152
172
|
* ...and many other options for proxy, retries, etc.
|
|
153
173
|
|
|
154
174
|
### Built-in Actions
|