@isdk/web-fetcher 0.3.1 → 0.3.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (88) hide show
  1. package/README.action.cn.md +28 -4
  2. package/README.action.md +27 -4
  3. package/README.cn.md +21 -0
  4. package/README.engine.cn.md +35 -7
  5. package/README.engine.md +30 -2
  6. package/README.md +23 -1
  7. package/dist/index.d.mts +1571 -1448
  8. package/dist/index.d.ts +1571 -1448
  9. package/dist/index.js +1 -1
  10. package/dist/index.mjs +1 -1
  11. package/docs/README.md +23 -1
  12. package/docs/_media/README.action.md +27 -4
  13. package/docs/_media/README.cn.md +21 -0
  14. package/docs/_media/README.engine.md +30 -2
  15. package/docs/classes/CheerioFetchEngine.md +169 -93
  16. package/docs/classes/ClickAction.md +29 -29
  17. package/docs/classes/EngineUpgradeError.md +335 -0
  18. package/docs/classes/EvaluateAction.md +29 -29
  19. package/docs/classes/ExtractAction.md +29 -29
  20. package/docs/classes/FetchAction.md +31 -29
  21. package/docs/classes/FetchEngine.md +159 -91
  22. package/docs/classes/FetchSession.md +14 -14
  23. package/docs/classes/FillAction.md +29 -29
  24. package/docs/classes/GetContentAction.md +29 -29
  25. package/docs/classes/GotoAction.md +29 -29
  26. package/docs/classes/KeyboardPressAction.md +29 -29
  27. package/docs/classes/KeyboardTypeAction.md +29 -29
  28. package/docs/classes/MouseClickAction.md +29 -29
  29. package/docs/classes/MouseMoveAction.md +29 -29
  30. package/docs/classes/MouseWheelAction.md +533 -0
  31. package/docs/classes/PauseAction.md +29 -29
  32. package/docs/classes/PlaywrightFetchEngine.md +252 -118
  33. package/docs/classes/ScrollIntoViewAction.md +533 -0
  34. package/docs/classes/SubmitAction.md +29 -29
  35. package/docs/classes/TrimAction.md +29 -29
  36. package/docs/classes/WaitForAction.md +29 -29
  37. package/docs/classes/WebFetcher.md +5 -5
  38. package/docs/enumerations/FetchActionResultStatus.md +4 -4
  39. package/docs/functions/fetchWeb.md +2 -2
  40. package/docs/functions/getRandomDelay.md +25 -0
  41. package/docs/globals.md +8 -1
  42. package/docs/interfaces/BaseFetchActionProperties.md +13 -13
  43. package/docs/interfaces/BaseFetchCollectorActionProperties.md +17 -17
  44. package/docs/interfaces/BaseFetcherProperties.md +44 -28
  45. package/docs/interfaces/DispatchedEngineAction.md +4 -4
  46. package/docs/interfaces/EvaluateActionOptions.md +3 -3
  47. package/docs/interfaces/ExtractActionProperties.md +13 -13
  48. package/docs/interfaces/FetchActionMeta.md +73 -0
  49. package/docs/interfaces/FetchActionProperties.md +15 -19
  50. package/docs/interfaces/FetchActionResult.md +7 -7
  51. package/docs/interfaces/FetchContext.md +65 -41
  52. package/docs/interfaces/FetchEngineContext.md +57 -33
  53. package/docs/interfaces/FetchMetadata.md +5 -5
  54. package/docs/interfaces/FetchResponse.md +14 -14
  55. package/docs/interfaces/FetchReturnTypeRegistry.md +7 -7
  56. package/docs/interfaces/FetchSite.md +55 -31
  57. package/docs/interfaces/FetcherOptions.md +55 -31
  58. package/docs/interfaces/GotoActionOptions.md +8 -8
  59. package/docs/interfaces/KeyboardPressParams.md +3 -3
  60. package/docs/interfaces/KeyboardTypeParams.md +3 -3
  61. package/docs/interfaces/MouseClickParams.md +6 -6
  62. package/docs/interfaces/MouseMoveParams.md +5 -5
  63. package/docs/interfaces/MouseWheelParams.md +69 -0
  64. package/docs/interfaces/PendingEngineRequest.md +3 -3
  65. package/docs/interfaces/ScrollIntoViewParams.md +17 -0
  66. package/docs/interfaces/StorageOptions.md +5 -5
  67. package/docs/interfaces/SubmitActionOptions.md +2 -2
  68. package/docs/interfaces/TrimActionOptions.md +3 -3
  69. package/docs/interfaces/WaitForActionOptions.md +5 -5
  70. package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
  71. package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
  72. package/docs/type-aliases/BrowserEngine.md +1 -1
  73. package/docs/type-aliases/FetchActionCapabilities.md +1 -1
  74. package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
  75. package/docs/type-aliases/FetchActionInContext.md +38 -0
  76. package/docs/type-aliases/FetchActionOptions.md +1 -1
  77. package/docs/type-aliases/FetchEngineAction.md +2 -2
  78. package/docs/type-aliases/FetchEngineType.md +1 -1
  79. package/docs/type-aliases/FetchReturnType.md +1 -1
  80. package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
  81. package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
  82. package/docs/type-aliases/ResourceType.md +1 -1
  83. package/docs/type-aliases/TrimPreset.md +1 -1
  84. package/docs/variables/DefaultFetcherProperties.md +1 -1
  85. package/docs/variables/FetcherOptionKeys.md +1 -1
  86. package/docs/variables/TRIM_PRESETS.md +1 -1
  87. package/package.json +7 -7
  88. package/docs/interfaces/FetchActionInContext.md +0 -190
@@ -251,8 +251,8 @@ await fetchWeb({
251
251
 
252
252
  * **`id`**: `mouseMove`
253
253
  * **`params`**:
254
- * `x` (number, 可选): 绝对 X 坐标。
255
- * `y` (number, 可选): 绝对 Y 坐标。
254
+ * `x` (number, 可选): 绝对 X 坐标。如果为负数,则被视为相对于当前位置的随机偏移。
255
+ * `y` (number, 可选): 绝对 Y 坐标。如果为负数,则被视为相对于当前位置的随机偏移。
256
256
  * `selector` (string, 可选): CSS 选择器。如果提供,鼠标将移动到该元素的中心。
257
257
  * `steps` (number, 可选): 轨迹的中间步数(默认:`-1`)。设置为 `-1` 可根据距离动态计算步数(模拟自然移动速度)。
258
258
  * **`returns`**: `none`
@@ -263,14 +263,38 @@ await fetchWeb({
263
263
 
264
264
  * **`id`**: `mouseClick`
265
265
  * **`params`**:
266
- * `x` (number, 可选): 点击的绝对 X 坐标。
267
- * `y` (number, 可选): 点击的绝对 Y 坐标。
266
+ * `x` (number, 可选): 要点击的绝对 X 坐标。如果为负数,则被视为相对于当前位置的随机偏移。
267
+ * `y` (number, 可选): 要点击的绝对 Y 坐标。如果为负数,则被视为相对于当前位置的随机偏移。
268
+
268
269
  * `selector` (string, 可选): CSS 选择器。如果提供,鼠标会先移动到该元素。
269
270
  * `button` (string, 可选): 使用的鼠标按键 (`left`, `right`, 或 `middle`)。默认为 `left`。
270
271
  * `clickCount` (number, 可选): 点击次数(例如:2 表示双击)。默认为 1。
271
272
  * `delay` (number, 可选): mousedown 和 mouseup 之间的延迟(毫秒)。
272
273
  * **`returns`**: `none`
273
274
 
275
+ #### `mouseWheel`
276
+
277
+ 模拟鼠标滚轮滚动事件。如果提供了 `selector`,元素会自动滚动到视口中,并且在滚动前将鼠标指针移动到其中心。如果提供了 `steps`,滚动增量将被拆分为多个步骤,以模拟真实的滚动行为。
278
+
279
+ * **`id`**: `mouseWheel`
280
+ * **`params`**:
281
+ * `x` (number, 可选): 滚动的绝对 X 坐标。如果为负数,则被视为相对于当前位置的随机偏移。
282
+ * `y` (number, 可选): 滚动的绝对 Y 坐标。如果为负数,则被视为相对于当前位置的随机偏移。
283
+ * `selector` (string, 可选): CSS 选择器。如果提供,则确保元素可见并先将鼠标移动到其中心。
284
+ * `deltaX` (number, 可选): 水平滚动量。默认值为 0。
285
+ * `deltaY` (number, 可选): 垂直滚动量。默认值为 0。
286
+ * `steps` (number, 可选): 将滚动拆分为多少步执行(默认值:`1`)。
287
+ * **`returns`**: `none`
288
+
289
+ #### `scrollIntoView`
290
+
291
+ 滚动页面或可滚动容器,使特定元素在视口中可见。
292
+
293
+ * **`id`**: `scrollIntoView`
294
+ * **`params`**:
295
+ * `selector` (string): 要滚动到视口可见的元素的 CSS 选择器。
296
+ * **`returns`**: `none`
297
+
274
298
  #### `keyboardType`
275
299
 
276
300
  模拟真人在当前获得焦点的元素中输入文本。
package/README.action.md CHANGED
@@ -253,8 +253,8 @@ Moves the mouse cursor to a specific coordinate or element. In `browser` mode, i
253
253
 
254
254
  * **`id`**: `mouseMove`
255
255
  * **`params`**:
256
- * `x` (number, optional): The absolute X coordinate.
257
- * `y` (number, optional): The absolute Y coordinate.
256
+ * `x` (number, optional): The absolute X coordinate. If negative, it's treated as a relative random offset from current position.
257
+ * `y` (number, optional): The absolute Y coordinate. If negative, it's treated as a relative random offset from current position.
258
258
  * `selector` (string, optional): A CSS selector. If provided, the mouse moves to the center of the element.
259
259
  * `steps` (number, optional): The number of intermediate steps for the trajectory (default: `-1`). Set to `-1` to calculate steps automatically based on distance (simulating natural speed).
260
260
  * **`returns`**: `none`
@@ -265,14 +265,37 @@ Triggers a mouse click at the current position or specified coordinates. If a `s
265
265
 
266
266
  * **`id`**: `mouseClick`
267
267
  * **`params`**:
268
- * `x` (number, optional): The absolute X coordinate to click.
269
- * `y` (number, optional): The absolute Y coordinate to click.
268
+ * `x` (number, optional): The absolute X coordinate to click. If negative, it's treated as a relative random offset from current position.
269
+ * `y` (number, optional): The absolute Y coordinate to click. If negative, it's treated as a relative random offset from current position.
270
270
  * `selector` (string, optional): A CSS selector. If provided, moves the mouse to the element first.
271
271
  * `button` (string, optional): The mouse button to use (`left`, `right`, or `middle`). Default is `left`.
272
272
  * `clickCount` (number, optional): The number of clicks (e.g., 2 for double-click). Default is 1.
273
273
  * `delay` (number, optional): Delay between mousedown and mouseup in milliseconds.
274
274
  * **`returns`**: `none`
275
275
 
276
+ #### `mouseWheel`
277
+
278
+ Simulates a mouse wheel scroll event. If a `selector` is provided, the element is automatically scrolled into view, and the cursor is moved to its center before scrolling. If `steps` is provided, the scroll delta is split into multiple steps for realistic simulation.
279
+
280
+ * **`id`**: `mouseWheel`
281
+ * **`params`**:
282
+ * `x` (number, optional): The absolute X coordinate to scroll at. If negative, it's treated as a relative random offset from current position.
283
+ * `y` (number, optional): The absolute Y coordinate to scroll at. If negative, it's treated as a relative random offset from current position.
284
+ * `selector` (string, optional): A CSS selector. If provided, ensures the element is visible and moves the mouse to its center first.
285
+ * `deltaX` (number, optional): The horizontal scroll amount. Default is 0.
286
+ * `deltaY` (number, optional): The vertical scroll amount. Default is 0.
287
+ * `steps` (number, optional): The number of steps to split the scroll into (default: `1`).
288
+ * **`returns`**: `none`
289
+
290
+ #### `scrollIntoView`
291
+
292
+ Scrolls the page or a scrollable container to make a specific element visible in the viewport.
293
+
294
+ * **`id`**: `scrollIntoView`
295
+ * **`params`**:
296
+ * `selector` (string): The CSS selector of the element to scroll into view.
297
+ * **`returns`**: `none`
298
+
276
299
  #### `keyboardType`
277
300
 
278
301
  Simulates a person typing text into the currently focused element.
package/README.cn.md CHANGED
@@ -27,6 +27,19 @@
27
27
 
28
28
  ---
29
29
 
30
+ ### 智能升级与重试策略
31
+
32
+ 当 `enableSmart` 开启时,系统会根据响应特征自动判断是否需要升级引擎:
33
+
34
+ - 触发升级的条件包括:
35
+ - HTTP 状态码:`401 / 403 / 500 / 429`
36
+ - 页面疑似动态渲染(HTML 中检测到典型 JS 框架特征)
37
+ - `Retry-After` 超过 `upgradeThresholdMs`
38
+ - 升级过程中可选择是否同步 Cookies / Session 状态(`syncStateOnUpgrade`)
39
+ - 对于 `429` 响应,若 `Retry-After` 小于`upgradeThresholdMs`阈值,系统会优先重试而非升级
40
+
41
+ ---
42
+
30
43
  ## 📦 安装
31
44
 
32
45
  1. **安装依赖包:**
@@ -150,6 +163,12 @@ searchGoogle('gemini');
150
163
  * `headless` (boolean): 是否以无头模式运行(默认:`true`)。
151
164
  * `launchOptions` (object): Playwright 启动选项(例如 `{ slowMo: 50, args: [...] }`)。
152
165
  * `sessionPoolOptions` (SessionPoolOptions): 底层 Crawlee SessionPool 的高级配置。
166
+ * `enableSmart` (boolean): 是否启用智能探测与自动引擎升级(默认:`true`)。
167
+ * `syncStateOnUpgrade` (boolean): 当从 http 升级到 browser 引擎时,是否同步 Cookies / Session 状态(默认:`false`)。
168
+ * `upgradeThresholdMs` (number): 触发引擎升级的等待时间阈值(毫秒),超过该时间或无明确重试信息则升级(默认:`5000`)。
169
+ * `maxRetries` (number): 单个 Action 的最大重试次数(默认:`0`)。
170
+ * `failOnError` (boolean): Action 失败时是否抛出异常(默认:主流程 `true`,collector `false`)。
171
+ * `failOnTimeout` (boolean): 超时是否视为失败(默认:`false`)。
153
172
  * ...以及许多其他用于代理、重试等的选项。
154
173
 
155
174
  ### 内置动作 (Built-in Actions)
@@ -162,6 +181,8 @@ searchGoogle('gemini');
162
181
  * `submit`: 提交表单(引擎相关)。
163
182
  * `mouseMove`: 将鼠标指针移动到指定的坐标或元素(支持贝塞尔曲线)。
164
183
  * `mouseClick`: 在当前位置或指定坐标触发鼠标点击。
184
+ * `mouseWheel`: 在目标位置模拟鼠标滚轮滚动,支持水平和垂直偏移、分步模拟以及自动将目标元素滚动到视口。
185
+ * `scrollIntoView`: 滚动页面或容器,使特定元素在视口中可见。
165
186
  * `keyboardType`: 模拟真人在当前获得焦点的元素中输入文本。
166
187
  * `keyboardPress`: 模拟按下单个按键或组合键。
167
188
  * `trim`: 从 DOM 中移除元素以清理页面(如脚本、广告、隐藏内容)。
@@ -64,7 +64,13 @@ const session = new FetchSession({ engine: 'browser' });
64
64
  1. **显式选项**: 如果在 `FetchSession` 的 `options.engine`(或 `executeAll` 的临时上下文覆盖)中提供了引擎且不为 `'auto'`。
65
65
  * ⚠️ **快速失败 (Fail-Fast)**: 如果请求的引擎不可用(例如缺少依赖),将立即抛出错误。
66
66
  2. **站点注册表 (Site Registry)**: 如果设置为 `'auto'`(默认),系统会尝试根据目标 URL 匹配 `sites` 注册表。
67
- 3. **智能升级 (Smart Upgrade)**: 如果启用,系统可能会根据响应特征(如机器人检测或大量 JS)动态从 `http` 升级到 `browser`。
67
+ 3. **智能升级 (Smart Upgrade)**:
68
+ - 当 `enableSmart: true` 时,系统会在以下情况自动从 `http` 升级到 `browser`:
69
+ - 返回 `401 / 403 / 500 / 429`
70
+ - HTML 内容被识别为“高度动态”(大量 JS)
71
+ - `Retry-After` 超过 `upgradeThresholdMs`
72
+ - 升级时可选择是否同步 Cookies / Session(`syncStateOnUpgrade`)
73
+ - 升级失败或仍不满足需求时,可继续抛出原始错误
68
74
  4. **默认**: 回退到 `'http'` (Cheerio)。
69
75
 
70
76
  #### 带有覆盖的批量执行 (Batch Execution with Overrides)
@@ -82,6 +88,20 @@ await session.executeAll([
82
88
  });
83
89
  ```
84
90
 
91
+ #### 智能升级与重试
92
+
93
+ 在 `executeAll` 执行过程中,如果某个 Action 抛出 `ENGINE_UPGRADE_REQUIRED`:
94
+
95
+ - 系统会尝试释放当前引擎
96
+ - 创建新的 browser 引擎
97
+ - **自动从头重新执行动作列表**
98
+ - 已成功执行的副作用(如 Cookie 写入)可根据配置保留
99
+
100
+ 此外,对于 `429` 响应:
101
+
102
+ - 若 `Retry-After` 存在且小于 `upgradeThresholdMs`
103
+ - 系统会在同一引擎内自动重试,而不会触发升级
104
+
85
105
  ### 会话管理与状态持久化
86
106
 
87
107
  引擎支持在多次执行之间持久化和恢复会话状态(主要是 Cookie)。
@@ -112,6 +132,16 @@ await session.executeAll([
112
132
 
113
133
  ---
114
134
 
135
+ ### 错误增强与 Retry-After 支持
136
+
137
+ - 所有 HTTP 错误现在都会将原始 `FetchResponse` 附加到错误对象(`error.response`)
138
+ - 支持解析 HTTP `Retry-After` 头:
139
+ - 支持整数(秒)
140
+ - 支持 HTTP 日期格式
141
+ - 错误信息中会包含重试等待时间提示
142
+
143
+ ---
144
+
115
145
  ## 🏗️ 3. 架构和工作流程
116
146
 
117
147
  引擎架构旨在解决一个关键挑战:在 Crawlee 根本上**无状态、异步**的请求处理之上,提供一个简单的、**类似有状态的 API**(先 `goto()`,再 `click()`,再 `fill()`)。
@@ -158,7 +188,7 @@ await session.executeAll([
158
188
  * ✅ **快速轻量**:非常适合追求速度和低资源消耗的场景。
159
189
  * ✅ **符合 HTTP 标准的重定向**:正确处理 301-303 和 307/308 重定向,按照 HTTP 规范保留方法/正文或转换为 GET。
160
190
  * ❌ **无 JavaScript 执行**:无法与客户端渲染的内容交互。
161
- * ⚙️ **模拟交互**:像 `click` 和 `submit` 这样的动作是通过发起新的 HTTP 请求来模拟的。**仅浏览器支持的动作**(如 `mouseMove`, `keyboardType`)将抛出 `not_supported` 错误。
191
+ * ⚙️ **模拟交互**:像 `click` 和 `submit` 这样的动作是通过发起新的 HTTP 请求来模拟的。**仅浏览器支持的动作**(如 `mouseMove`, `mouseWheel`, `keyboardType`)将抛出 `not_supported` 错误。
162
192
  * **用例**: 抓取静态网站、服务器渲染页面或 API。
163
193
 
164
194
  ### `PlaywrightFetchEngine` (browser 模式)
@@ -187,8 +217,8 @@ await session.executeAll([
187
217
 
188
218
  您可以通过选项中的 `browser` 属性来配置浏览器引擎:
189
219
 
190
- * `headless` (boolean): 是否以无头模式运行浏览器(默认:`true`)。
191
- * `launchOptions` (object): 原生 Playwright [LaunchOptions](https://playwright.dev/docs/api/class-browsertype#browser-type-launch),直接传递给浏览器启动器(例如 `slowMo`, `args`, `devtools`)。
220
+ * `headless` (boolean): 是否以无头模式运行浏览器(默认:`true`)。
221
+ * `launchOptions` (object): 原生 Playwright [LaunchOptions](https://playwright.dev/docs/api/class-browsertype#browser-type-launch),直接传递给浏览器启动器(例如 `slowMo`, `args`, `devtools`)。
192
222
 
193
223
  ```typescript
194
224
  const result = await fetchWeb({
@@ -246,9 +276,7 @@ await session.executeAll([
246
276
  - 支持可选的 `depth` 参数来限制向上遍历的层级。
247
277
  - 必须包含最大深度限制 (默认 1000) 以防止无限循环。
248
278
 
249
- 这种架构确保了诸如 **列对齐 (Columnar Alignment)**、**分段扫描 (Segmented Scanning)** 以及 **属性锚点跳转 (Anchor Jumping)** 等复杂功能在不同引擎下表现高度一致。
250
-
251
- 这种解耦确保了诸如 **列对齐 (Columnar Alignment)**、**分段扫描 (Segmented Scanning)** 以及 **属性锚点跳转 (Anchor Jumping)** 等复杂功能,无论是在快速的 Cheerio 引擎还是完整的 Playwright 浏览器中,其行为都完全一致。
279
+ 这种架构确保了诸如 **列对齐 (Columnar Alignment)**、**分段扫描 (Segmented Scanning)** 以及 **属性锚点跳转 (Anchor Jumping)** 等复杂功能,无论是在快速的 Cheerio 引擎还是完整的 Playwright 浏览器中,其行为都完全一致。
252
280
 
253
281
  ### Schema 规范化
254
282
 
package/README.engine.md CHANGED
@@ -64,9 +64,27 @@ The engine is initialized lazily upon the first action execution and remains fix
64
64
  1. **Explicit Option**: If `options.engine` (or temporary context override in `executeAll`) is provided and NOT set to `'auto'`.
65
65
  * ⚠️ **Fail-Fast**: If the requested engine is unavailable (e.g., missing dependencies), an error is thrown immediately.
66
66
  2. **Site Registry**: If set to `'auto'` (default), the system attempts to match the target URL against the `sites` registry.
67
- 3. **Smart Upgrade**: If enabled, the engine may be dynamically upgraded from `http` to `browser` based on response characteristics (e.g., bot detection or heavy JS).
67
+ 3. **Smart Upgrade**: If `enableSmart: true`, the system will automatically upgrade from `http` to `browser` under the following conditions:
68
+ - Returns `401 / 403 / 500 / 429`
69
+ - HTML content is identified as "highly dynamic" (heavy JS)
70
+ - `Retry-After` exceeds `upgradeThresholdMs`
71
+ - You can optionally sync Cookies/Session during upgrade (`syncStateOnUpgrade`)
72
+ - If upgrade fails or still doesn't meet requirements, the original error is thrown
68
73
  4. **Default**: Falls back to `'http'` (Cheerio).
69
74
 
75
+ #### Smart Upgrade & Retry
76
+
77
+ During `executeAll`, if an Action throws `ENGINE_UPGRADE_REQUIRED`:
78
+
79
+ - The system attempts to release the current engine
80
+ - Creates a new browser engine
81
+ - **Automatically re-executes the action list from the beginning**
82
+ - Side effects of successfully executed actions (e.g., cookie writes) can be preserved based on configuration
83
+
84
+ For `429` responses:
85
+ - If `Retry-After` exists and is less than `upgradeThresholdMs`
86
+ - The system will automatically retry within the same engine without triggering an upgrade
87
+
70
88
  #### Batch Execution with Overrides
71
89
 
72
90
  You can execute a sequence of actions with temporary configuration overrides (e.g., headers, timeout) that apply only to that specific batch, without modifying the session's global state.
@@ -112,6 +130,16 @@ If both `sessionState` and `cookies` are provided, the engine adopts a **"Merge
112
130
 
113
131
  ---
114
132
 
133
+ ### Error Enhancement & Retry-After Support
134
+
135
+ - All HTTP errors now attach the original `FetchResponse` to the error object (`error.response`)
136
+ - Support for parsing HTTP `Retry-After` header:
137
+ - Integer (seconds) format
138
+ - HTTP date format
139
+ - Retry wait time hints are included in error messages
140
+
141
+ ---
142
+
115
143
  ## 🏗️ 3. Architecture and Workflow
116
144
 
117
145
  The engine's architecture is designed to solve a key challenge: providing a simple, **stateful-like API** (`goto()`, then `click()`, then `fill()`) on top of Crawlee's fundamentally **stateless, asynchronous** request handling.
@@ -158,7 +186,7 @@ There are two primary engine implementations:
158
186
  * ✅ **Fast and Lightweight**: Ideal for speed and low resource consumption.
159
187
  * ✅ **HTTP-Compliant Redirects**: Correctly handles 301-303 and 307/308 redirects, preserving methods/bodies or converting to GET as per HTTP specifications.
160
188
  * ❌ **No JavaScript Execution**: Cannot interact with client-side rendered content.
161
- * ⚙️ **Simulated Interaction**: Actions like `click` and `submit` are simulated by making new HTTP requests. **Browser-only actions** (e.g., `mouseMove`, `keyboardType`) will throw a `not_supported` error.
189
+ * ⚙️ **Simulated Interaction**: Actions like `click` and `submit` are simulated by making new HTTP requests. **Browser-only actions** (e.g., `mouseMove`, `mouseWheel`, `keyboardType`) will throw a `not_supported` error.
162
190
  * **Use Case**: Scraping static websites, server-rendered pages, or APIs.
163
191
 
164
192
  ### `PlaywrightFetchEngine` (browser mode)
package/README.md CHANGED
@@ -20,9 +20,23 @@ English | [简体中文](./README.cn.md)
20
20
  * **📜 Declarative Action Scripts**: Define multi-step workflows (like logging in, filling forms, and clicking buttons) in a simple, readable JSON format.
21
21
  * **📊 Powerful and Flexible Data Extraction**: Easily extract all kinds of structured data, from simple text to complex nested objects, through an intuitive and powerful declarative Schema.
22
22
  * **🧠 Smart Engine Selection**: Automatically detects dynamic sites and can upgrade the engine from `http` to `browser` on the fly.
23
+ * **🛡️ Anti-Bot Evasion**: In `browser` mode, an optional `antibot` flag helps to bypass common anti-bot measures like Cloudflare challenges.
24
+ * **🕹️ High-Fidelity Interaction Simulation**: Supports Bézier curve-based mouse trajectory movement, realistic typing delay simulation, and complex keyboard interactions to significantly improve anti-bot evasion.
23
25
  * **🧩 Extensible**: Easily create custom, high-level "composite" actions to encapsulate reusable business logic (e.g., a `login` action).
24
26
  * **🧲 Advanced Collectors**: Asynchronously collect data in the background, triggered by events during the execution of a main action.
25
- * **🛡️ Anti-Bot Evasion**: In `browser` mode, an optional `antibot` flag helps to bypass common anti-bot measures like Cloudflare challenges.
27
+
28
+ ---
29
+
30
+ ### Smart Upgrade and Retry Strategy
31
+
32
+ When `enableSmart` is enabled, the system automatically determines whether an engine upgrade is needed based on response characteristics:
33
+
34
+ - Triggers for upgrade include:
35
+ - HTTP status codes: `401 / 403 / 500 / 429`
36
+ - Page appears to be dynamically rendered (detected typical JS framework signatures in HTML)
37
+ - `Retry-After` exceeds `upgradeThresholdMs`
38
+ - During upgrade, you can choose whether to sync Cookies / Session state (`syncStateOnUpgrade`)
39
+ - For `429` responses, if `Retry-After` is less than the `upgradeThresholdMs` threshold, the system will prioritize retry over upgrade
26
40
 
27
41
  ---
28
42
 
@@ -149,6 +163,12 @@ This is the main entry point for the library.
149
163
  * `headless` (boolean): Run in headless mode (default: `true`).
150
164
  * `launchOptions` (object): Playwright launch options (e.g., `{ slowMo: 50, args: [...] }`).
151
165
  * `sessionPoolOptions` (SessionPoolOptions): Advanced configuration for the underlying Crawlee SessionPool.
166
+ * `enableSmart` (boolean): Enable smart detection and automatic engine upgrade (default: `true`).
167
+ * `syncStateOnUpgrade` (boolean): Whether to sync Cookies / Session state when upgrading from http to browser engine (default: `false`).
168
+ * `upgradeThresholdMs` (number): Wait time threshold in milliseconds to trigger engine upgrade; upgrades if exceeded or no explicit retry info (default: `5000`).
169
+ * `maxRetries` (number): Maximum retry attempts for a single Action (default: `0`).
170
+ * `failOnError` (boolean): Whether to throw an exception when an Action fails (default: `true` for main flow, `false` for collector).
171
+ * `failOnTimeout` (boolean): Whether to treat timeout as failure (default: `false`).
152
172
  * ...and many other options for proxy, retries, etc.
153
173
 
154
174
  ### Built-in Actions
@@ -161,6 +181,8 @@ The library provides a set of powerful built-in actions, many of which are engin
161
181
  * `submit`: Submits a form (Engine-specific).
162
182
  * `mouseMove`: Moves the mouse cursor to a specific coordinate or element (Bézier curve supported).
163
183
  * `mouseClick`: Triggers a mouse click at the current position or specified coordinates.
184
+ * `mouseWheel`: Simulates a mouse wheel scroll event with horizontal and vertical deltas. Supports splitting into multiple steps and automatic scrolling to make the target element visible.
185
+ * `scrollIntoView`: Scrolls the page or a container to make a specific element visible in the viewport.
164
186
  * `keyboardType`: Simulates human-like typing into the currently focused element.
165
187
  * `keyboardPress`: Simulates pressing a single key or a key combination.
166
188
  * `trim`: Removes elements from the DOM to clean up the page.