@isdk/web-fetcher 0.2.12 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (83) hide show
  1. package/README.action.cn.md +197 -155
  2. package/README.action.extract.cn.md +263 -0
  3. package/README.action.extract.md +263 -0
  4. package/README.action.md +202 -147
  5. package/README.cn.md +25 -15
  6. package/README.engine.cn.md +118 -14
  7. package/README.engine.md +115 -14
  8. package/README.md +19 -10
  9. package/dist/index.d.mts +667 -50
  10. package/dist/index.d.ts +667 -50
  11. package/dist/index.js +1 -1
  12. package/dist/index.mjs +1 -1
  13. package/docs/README.md +19 -10
  14. package/docs/_media/README.action.md +202 -147
  15. package/docs/_media/README.cn.md +25 -15
  16. package/docs/_media/README.engine.md +115 -14
  17. package/docs/classes/CheerioFetchEngine.md +805 -135
  18. package/docs/classes/ClickAction.md +33 -33
  19. package/docs/classes/EvaluateAction.md +559 -0
  20. package/docs/classes/ExtractAction.md +33 -33
  21. package/docs/classes/FetchAction.md +39 -33
  22. package/docs/classes/FetchEngine.md +660 -122
  23. package/docs/classes/FetchSession.md +38 -16
  24. package/docs/classes/FillAction.md +33 -33
  25. package/docs/classes/GetContentAction.md +33 -33
  26. package/docs/classes/GotoAction.md +33 -33
  27. package/docs/classes/KeyboardPressAction.md +533 -0
  28. package/docs/classes/KeyboardTypeAction.md +533 -0
  29. package/docs/classes/MouseClickAction.md +533 -0
  30. package/docs/classes/MouseMoveAction.md +533 -0
  31. package/docs/classes/PauseAction.md +33 -33
  32. package/docs/classes/PlaywrightFetchEngine.md +820 -122
  33. package/docs/classes/SubmitAction.md +33 -33
  34. package/docs/classes/TrimAction.md +533 -0
  35. package/docs/classes/WaitForAction.md +33 -33
  36. package/docs/classes/WebFetcher.md +9 -9
  37. package/docs/enumerations/FetchActionResultStatus.md +4 -4
  38. package/docs/functions/fetchWeb.md +6 -6
  39. package/docs/globals.md +14 -0
  40. package/docs/interfaces/BaseFetchActionProperties.md +12 -12
  41. package/docs/interfaces/BaseFetchCollectorActionProperties.md +16 -16
  42. package/docs/interfaces/BaseFetcherProperties.md +32 -28
  43. package/docs/interfaces/Cookie.md +14 -14
  44. package/docs/interfaces/DispatchedEngineAction.md +4 -4
  45. package/docs/interfaces/EvaluateActionOptions.md +81 -0
  46. package/docs/interfaces/ExtractActionProperties.md +12 -12
  47. package/docs/interfaces/FetchActionInContext.md +15 -15
  48. package/docs/interfaces/FetchActionProperties.md +13 -13
  49. package/docs/interfaces/FetchActionResult.md +6 -6
  50. package/docs/interfaces/FetchContext.md +42 -38
  51. package/docs/interfaces/FetchEngineContext.md +37 -33
  52. package/docs/interfaces/FetchMetadata.md +5 -5
  53. package/docs/interfaces/FetchResponse.md +14 -14
  54. package/docs/interfaces/FetchReturnTypeRegistry.md +8 -8
  55. package/docs/interfaces/FetchSite.md +35 -31
  56. package/docs/interfaces/FetcherOptions.md +34 -30
  57. package/docs/interfaces/GotoActionOptions.md +14 -6
  58. package/docs/interfaces/KeyboardPressParams.md +25 -0
  59. package/docs/interfaces/KeyboardTypeParams.md +25 -0
  60. package/docs/interfaces/MouseClickParams.md +49 -0
  61. package/docs/interfaces/MouseMoveParams.md +41 -0
  62. package/docs/interfaces/PendingEngineRequest.md +3 -3
  63. package/docs/interfaces/StorageOptions.md +5 -5
  64. package/docs/interfaces/SubmitActionOptions.md +2 -2
  65. package/docs/interfaces/TrimActionOptions.md +27 -0
  66. package/docs/interfaces/WaitForActionOptions.md +5 -5
  67. package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
  68. package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
  69. package/docs/type-aliases/BrowserEngine.md +1 -1
  70. package/docs/type-aliases/FetchActionCapabilities.md +1 -1
  71. package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
  72. package/docs/type-aliases/FetchActionOptions.md +1 -1
  73. package/docs/type-aliases/FetchEngineAction.md +2 -2
  74. package/docs/type-aliases/FetchEngineType.md +1 -1
  75. package/docs/type-aliases/FetchReturnType.md +1 -1
  76. package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
  77. package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
  78. package/docs/type-aliases/ResourceType.md +1 -1
  79. package/docs/type-aliases/TrimPreset.md +13 -0
  80. package/docs/variables/DefaultFetcherProperties.md +1 -1
  81. package/docs/variables/FetcherOptionKeys.md +1 -1
  82. package/docs/variables/TRIM_PRESETS.md +11 -0
  83. package/package.json +11 -11
@@ -12,6 +12,14 @@
12
12
 
13
13
  > ℹ️ 该系统建立在 [Crawlee](https://crawlee.dev/) 库之上,并使用其强大的爬虫抽象。
14
14
 
15
+ ### Debug 模式与追踪 (Debug Mode & Tracing)
16
+
17
+ 当启用 `debug: true` 选项时,引擎会提供其内部操作的详细追踪信息:
18
+
19
+ 1. **详细日志追踪**:每个主要步骤(请求处理、元素选择、数据提取)都会记录到控制台,并带有 `[FetchEngine:id]` 前缀。
20
+ 2. **提取洞察**:在执行 `extract()` 期间,引擎会记录每个选择器匹配到的元素数量以及正在提取的具体值,这使得调试复杂或错位的 Schema 变得更加容易。
21
+ 3. **元数据 (Metadata)**:`FetchResponse` 将包含一个丰富的 `metadata` 对象,其中包含引擎详情、计时指标(如果可用)和代理信息。
22
+
15
23
  ---
16
24
 
17
25
  ## 🧩 2. 核心概念
@@ -20,13 +28,23 @@
20
28
 
21
29
  这是定义所有抓取引擎契约的抽象基类。
22
30
 
23
- * **职责**:为导航、内容检索和用户交互等操作提供一致的高级 API
31
+ * **职责**:为导航、内容检索和用户交互等操作提供一致的高级 API。它充当核心 **动作分发器 (Action Dispatcher)**,统一处理引擎无关的逻辑(如 `extract`、`pause`、`getContent`),并将其余动作委托给具体实现。
24
32
  * **关键抽象**:
25
33
  * **生命周期**:`initialize()` 和 `cleanup()` 方法。
26
34
  * **核心操作**:`goto()`、`getContent()`、`click()`、`fill()`、`submit()`、`waitFor()`、`extract()`。
35
+ * **DOM 原子操作 (Primitives)**: `_querySelectorAll()`、`_extractValue()`、`_parentElement()`、`_nextSiblingsUntil()`。
27
36
  * **配置与状态**:`headers()`、`cookies()`、`blockResources()`、`getState()`、`sessionPoolOptions`。
28
37
  * **静态注册表**:它维护所有可用引擎实现的静态注册表(`FetchEngine.register`),允许通过 `id` 或 `mode` 动态选择引擎。
29
38
 
39
+ ### `FetchElementScope`
40
+
41
+ **`FetchElementScope`** 是引擎相关的 DOM 元素“句柄”或“上下文”。
42
+
43
+ - 在 **`CheerioFetchEngine`** 中,它是 Cheerio API (`$`) 与当前选择 (`el`) 的组合。
44
+ - 在 **`PlaywrightFetchEngine`** 中,它是 `Locator`。
45
+
46
+ 所有提取和交互的原子操作都基于此 Scope 运行,确保了在不同底层技术之上引用元素的统一方式。
47
+
30
48
  ### `FetchEngine.create(context, options)`
31
49
 
32
50
  此静态工厂方法是创建引擎实例的指定入口点。它会自动选择并初始化合适的引擎。
@@ -69,14 +87,14 @@ await session.executeAll([
69
87
  引擎支持在多次执行之间持久化和恢复会话状态(主要是 Cookie)。
70
88
 
71
89
  * **灵活的会话隔离与存储控制**:库提供了对会话数据存储和隔离的精细控制,通过 `storage` 配置实现:
72
- * **`id`**:自定义存储标识符。
73
- * **隔离(默认)**:如果省略,每个会话将获得一个唯一 ID,确保 `RequestQueue`、`KeyValueStore` 和 `SessionPool` 完全隔离。
74
- * **共享**:在不同会话中提供相同的 `id` 可以让它们共享底层存储,适用于保持持久登录状态。
75
- * **`persist`**:(boolean) 是否启用磁盘持久化(对应 Crawlee 的 `persistStorage`)。默认为 `false`(仅在内存中)。
76
- * **`purge`**:(boolean) 会话关闭时是否删除存储(清理 `RequestQueue` 和 `KeyValueStore`)。默认为 `true`。
77
- * 设置 `purge: false` 并配合固定的 `id`,可以创建跨应用重启依然存在的持久会话。
78
- * **`config`**:允许向底层 Crawlee 实例传递原生配置。
79
- * **注意**:当 `persist` 为 true 时,在 config 中使用 `localDataDirectory` 指定存储路径(例如:`storage: { persist: true, config: { localDataDirectory: './my-data' } }`)。
90
+ * **`id`**:自定义存储标识符。
91
+ * **隔离(默认)**:如果省略,每个会话将获得一个唯一 ID,确保 `RequestQueue`、`KeyValueStore` 和 `SessionPool` 完全隔离。
92
+ * **共享**:在不同会话中提供相同的 `id` 可以让它们共享底层存储,适用于保持持久登录状态。
93
+ * **`persist`**:(boolean) 是否启用磁盘持久化(对应 Crawlee 的 `persistStorage`)。默认为 `false`(仅在内存中)。
94
+ * **`purge`**:(boolean) 会话关闭时是否删除存储(清理 `RequestQueue` 和 `KeyValueStore`)。默认为 `true`。
95
+ * 设置 `purge: false` 并配合固定的 `id`,可以创建跨应用重启依然存在的持久会话。
96
+ * **`config`**:允许向底层 Crawlee 实例传递原生配置。
97
+ * **注意**:当 `persist` 为 true 时,在 config 中使用 `localDataDirectory` 指定存储路径(例如:`storage: { persist: true, config: { localDataDirectory: './my-data' } }`)。
80
98
  * **`sessionState`**: 一个完整的状态对象(源自 Crawlee 的 SessionPool),可用于完全恢复之前的会话。该状态会**自动包含在每个 `FetchResponse` 中**,方便进行持久化,并在以后初始化引擎时通过选项传回。
81
99
  * **`sessionPoolOptions`**: 允许对底层的 Crawlee `SessionPool` 进行高级配置(例如 `maxUsageCount`, `maxPoolSize`)。
82
100
  * **`overrideSessionState`**: 如果设置为 `true`,则强制引擎使用提供的 `sessionState` 覆盖存储中的任何现有持久化状态。当你希望确保会话以确切提供的状态启动,忽略持久化层中的任何陈旧数据时,这非常有用。
@@ -86,6 +104,7 @@ await session.executeAll([
86
104
 
87
105
  **优先级规则:**
88
106
  如果同时提供了 `sessionState` 和 `cookies`,引擎将采用**“合并并覆盖”**策略:
107
+
89
108
  1. 首先从 `sessionState` 恢复会话。
90
109
  2. 然后在之上应用显式的 `cookies`。
91
110
  * **结果**:`sessionState` 中任何冲突的 Cookie 都会被显式的 `cookies` **覆盖**。
@@ -114,8 +133,11 @@ await session.executeAll([
114
133
  * 页面上下文用于解析 `goto()` 调用返回的 `Promise`。
115
134
  * 页面被标记为“活动”状态 (`isPageActive = true`)。
116
135
  * 至关重要的是,在 `requestHandler` 返回之前,它会启动一个**动作循环** (`_executePendingActions`)。此循环通过监听 `EventEmitter` 上的事件,有效地**暂停 `requestHandler`**,从而保持页面上下文的存活。
136
+ * **严格顺序执行与重入保护**:循环使用内部队列确保所有动作按分发顺序执行。同时包含重入保护,允许组合动作调用原子动作而不产生死锁。
117
137
  5. **交互式动作 (`click`, `fill` 等)**:消费者现在可以调用 `await engine.click(...)`。此方法将一个动作分派到 `EventEmitter` 并返回一个新的 `Promise`。
118
- 6. **动作执行**:仍在原始 `requestHandler` 作用域内运行的动作循环,会监听到该事件。因为它能访问页面上下文,所以可以执行*实际的*交互(例如 `page.click(...)`)。
138
+ 6. **动作执行**:仍在原始 `requestHandler` 作用域内运行的动作循环,会监听到该事件。
139
+ * **统一处理的动作**:像 `extract`、`pause` 和 `getContent` 这样的动作会直接由 `FetchEngine` 基类使用统一的逻辑进行处理。
140
+ * **委托执行的动作**:引擎相关的交互(如 `click`、`fill`)会委托给子类的 `executeAction` 实现。
119
141
  7. **健壮的清理**:当调用 `dispose()` 或 `cleanup()` 时:
120
142
  * 设置 `isEngineDisposed` 标志以阻止新动作。
121
143
  * 发出 `dispose` 信号以唤醒并终止动作循环。
@@ -134,8 +156,9 @@ await session.executeAll([
134
156
  * **机制**: 使用 `CheerioCrawler` 通过原始 HTTP 请求抓取页面并解析静态 HTML。
135
157
  * **行为**:
136
158
  * ✅ **快速轻量**:非常适合追求速度和低资源消耗的场景。
159
+ * ✅ **符合 HTTP 标准的重定向**:正确处理 301-303 和 307/308 重定向,按照 HTTP 规范保留方法/正文或转换为 GET。
137
160
  * ❌ **无 JavaScript 执行**:无法与客户端渲染的内容交互。
138
- * ⚙️ **模拟交互**:像 `click` 和 `submit` 这样的动作是通过发起新的 HTTP 请求来模拟的。
161
+ * ⚙️ **模拟交互**:像 `click` 和 `submit` 这样的动作是通过发起新的 HTTP 请求来模拟的。**仅浏览器支持的动作**(如 `mouseMove`, `keyboardType`)将抛出 `not_supported` 错误。
139
162
  * **用例**: 抓取静态网站、服务器渲染页面或 API。
140
163
 
141
164
  ### `PlaywrightFetchEngine` (browser 模式)
@@ -160,17 +183,98 @@ await session.executeAll([
160
183
  * **用例**: 抓取受 Cloudflare 或其他高级机器人检测系统保护的网站。
161
184
  * **注意**: 此功能需要额外的依赖项(`camoufox-js`, `firefox`),并可能带来性能开销。
162
185
 
186
+ #### 配置 (Configuration)
187
+
188
+ 您可以通过选项中的 `browser` 属性来配置浏览器引擎:
189
+
190
+ * `headless` (boolean): 是否以无头模式运行浏览器(默认:`true`)。
191
+ * `launchOptions` (object): 原生 Playwright [LaunchOptions](https://playwright.dev/docs/api/class-browsertype#browser-type-launch),直接传递给浏览器启动器(例如 `slowMo`, `args`, `devtools`)。
192
+
193
+ ```typescript
194
+ const result = await fetchWeb({
195
+ url: 'https://example.com',
196
+ engine: 'browser',
197
+ browser: {
198
+ headless: false,
199
+ launchOptions: {
200
+ slowMo: 100, // 将操作减慢 100ms
201
+ args: ['--start-maximized'] // 传递自定义参数
202
+ }
203
+ }
204
+ });
205
+ ```
206
+
163
207
  ---
164
208
 
165
209
  ## 📊 5. 使用 `extract()` 进行数据提取
166
210
 
167
211
  `extract()` 方法提供了一种强大的声明式方式来从网页中提取结构化数据。它通过一个 **Schema (模式)** 来定义您期望的 JSON 输出结构,引擎会自动处理 DOM 遍历和数据提取。
168
212
 
169
- ### 核心设计: Schema 规范化
213
+ ### 核心设计: 三层提取架构
214
+
215
+ 为了确保不同引擎之间的一致性并保持高质量,提取系统分为三个层次:
216
+
217
+ 1. **规范化层 (`src/core/normalize-extract-schema.ts`)**: 将用户提供的各种简写模式预处理为统一的内部格式,处理 CSS 筛选器的合并。
218
+ 2. **核心提取逻辑 (`src/core/extract.ts`)**: 引擎无关层,负责提取的工作流分发。它通过 **Dispatcher (`_extract`)** 将任务分发给 `_extractObject`、`_extractArray` 或 `_extractValue`。该层管理递归、严格模式/必填字段验证、锚点解析、性能优化的树操作(LCA、冒泡)以及顺序消费游标逻辑。
219
+ 3. **引擎接口层 (`IExtractEngine`)**: 在核心层定义,由各引擎实现,提供底层的 DOM 原子操作。
220
+
221
+ #### `IExtractEngine` 实现准则
222
+
223
+ 为了保持跨引擎的一致性,所有实现必须遵循以下行为契约:
224
+
225
+ - **`_querySelectorAll`**:
226
+ - 必须按 **DOM 文档顺序**返回匹配的元素。
227
+ - 必须检查 Scope 元素**自身**是否匹配选择器,并搜索其**后代**。
228
+ - **`_nextSiblingsUntil`**:
229
+ - 必须返回一个平铺的兄弟节点列表,从起始锚点之后开始,到第一个匹配 `untilSelector` 的元素之前停止。
230
+ - **`_isSameElement`**:
231
+ - 必须基于元素身份(Identity)进行比较,而不是内容。
232
+ - **`_findClosestAncestor`**:
233
+ - 必须高效地查找在给定候选集中的最近祖先元素。
234
+ - 必须针对浏览器引擎进行优化,避免多次 IPC 调用。
235
+ - **`_contains`**:
236
+ - 必须实现标准的 DOM `Node.contains()` 行为。
237
+ - 必须针对高频边界检查进行优化。
238
+ - **`_findCommonAncestor`**:
239
+ - 必须查找两个元素的最近公共祖先 (Lowest Common Ancestor, LCA)。
240
+ - **性能关键**: 在浏览器引擎中,这必须在单次 `evaluate` 调用中执行,以最小化 IPC (进程间通信) 开销。
241
+ - **`_findContainerChild`**:
242
+ - 必须在容器中查找包含特定后代元素的直系子元素。
243
+ - **性能关键**: 这取代了在 Node.js 环境中的手动“冒泡”循环,显著减少了深层 DOM 树的操作开销。
244
+ - **`_bubbleUpToScope` (内部辅助方法)**:
245
+ - 实现从深层元素向上冒泡到当前作用域内直接祖先的逻辑。
246
+ - 支持可选的 `depth` 参数来限制向上遍历的层级。
247
+ - 必须包含最大深度限制 (默认 1000) 以防止无限循环。
248
+
249
+ 这种架构确保了诸如 **列对齐 (Columnar Alignment)**、**分段扫描 (Segmented Scanning)** 以及 **属性锚点跳转 (Anchor Jumping)** 等复杂功能在不同引擎下表现高度一致。
250
+
251
+ 这种解耦确保了诸如 **列对齐 (Columnar Alignment)**、**分段扫描 (Segmented Scanning)** 以及 **属性锚点跳转 (Anchor Jumping)** 等复杂功能,无论是在快速的 Cheerio 引擎还是完整的 Playwright 浏览器中,其行为都完全一致。
252
+
253
+ ### Schema 规范化
254
+
255
+ 为了提升易用性和灵活性,`extract` 方法在内部实现了一个**“规范化 (Normalization)”**层。这意味着您可以提供语义清晰的简写形式,在执行前,引擎会自动将其转换为标准的内部格式。
256
+
257
+ #### 1. 简写规则
258
+
259
+ - **字符串简写**: 一个简单的字符串(如 `'h1'`)会自动扩展为 `{ selector: 'h1', type: 'string', mode: 'text' }`。
260
+ - **隐式对象简写**: 如果你提供一个没有显式 `type: 'object'` 的对象,它会被视为一个 `object` 模式,其中对象的键即为要提取的属性名。
261
+ - *示例*: `{ "title": "h1" }` 变为 `{ "type": "object", "properties": { "title": { "selector": "h1" } } }`。
262
+ - **筛选器简写**: 如果在提供 `selector` 的同时提供了 `has` 或 `exclude`,它们会自动使用 `:has()` 和 `:not()` 伪类合并到 CSS 选择器中。
263
+ - *示例*: `{ "selector": "div", "has": "p" }` 变为 `{ "selector": "div:has(p)" }`。
264
+ - **数组简写**: 在 `array` 模式下直接提供 `attribute` 会作为其 `items` 的简写。
265
+ - *示例*: `{ "type": "array", "attribute": "href" }` 变为 `{ "type": "array", "items": { "type": "string", "attribute": "href" } }`。
266
+
267
+ #### 2. 配置与数据 (关键字分离)
268
+
269
+ 在**隐式对象 (Implicit Objects)** 中,引擎必须区分*配置* (去哪里找) 和*数据* (提取什么)。
270
+
271
+ - **上下文配置关键字 (Context Keys)**:`selector`、`has`、`exclude`、`required`、`strict` 和 `depth` 是保留的,用于定义提取上下文和校验规则。它们保留在 Schema 的根部。
272
+ - **数据关键字 (Data Keys)**:所有其他键 (包括 `items`、`attribute`、`mode`,甚至名为 `type` 的字段) 都会被移动到 `properties` 对象中,作为要提取的数据字段。
273
+ - **冲突处理**:您可以安全地提取名为 `type` 的字段,只要其值不是保留的 Schema 类型关键字 (`string`、`number`、`boolean`、`html`、`object`、`array`)。
170
274
 
171
- 为了提升易用性和灵活性,`extract` 方法在内部实现了一个**“规范化 (Normalization)”**层。这意味着您可以提供语义清晰的简写形式,在执行前,引擎会自动将其转换为标准的、更详细的内部格式。这使得编写复杂的提取规则变得简单直观。
275
+ #### 3. 跨引擎一致性
172
276
 
173
- ### Schema 结构
277
+ 规范化层确保了无论你使用的是 `cheerio` (http) 还是 `playwright` (browser) 引擎,复杂的简写逻辑行为都完全一致,提供了一个统一的“AI 友好”接口。
174
278
 
175
279
  一个 Schema 可以是以下三种类型之一:
176
280
 
package/README.engine.md CHANGED
@@ -12,6 +12,14 @@ The `engine` directory contains the core logic for the web fetcher. Its primary
12
12
 
13
13
  > ℹ️ The system is built on top of the [Crawlee](https://crawlee.dev/) library, using its powerful crawler abstractions.
14
14
 
15
+ ### Debug Mode & Tracing
16
+
17
+ When the `debug: true` option is enabled, the engine provides detailed tracing of its internal operations:
18
+
19
+ 1. **Detailed Tracing**: Every major step (request processing, element selection, data extraction) is logged to the console with a `[FetchEngine:id]` prefix.
20
+ 2. **Extraction Insights**: During `extract()`, the engine logs how many elements were matched for each selector and the specific values being extracted, making it easier to debug complex or misaligned schemas.
21
+ 3. **Metadata**: The `FetchResponse` will include an enriched `metadata` object containing engine details, timing metrics (where available), and proxy information.
22
+
15
23
  ---
16
24
 
17
25
  ## 🧩 2. Core Concepts
@@ -20,13 +28,23 @@ The `engine` directory contains the core logic for the web fetcher. Its primary
20
28
 
21
29
  This is the abstract base class that defines the contract for all fetch engines.
22
30
 
23
- * **Role**: To provide a consistent, high-level API for actions like navigation, content retrieval, and user interaction.
31
+ * **Role**: To provide a consistent, high-level API for actions like navigation, content retrieval, and user interaction. It acts as the central **Action Dispatcher**, handling engine-agnostic logic (like `extract`, `pause`, `getContent`) and delegating others.
24
32
  * **Key Abstractions**:
25
33
  * **Lifecycle**: `initialize()` and `cleanup()` methods.
26
34
  * **Core Actions**: `goto()`, `getContent()`, `click()`, `fill()`, `submit()`, `waitFor()`, `extract()`.
35
+ * **DOM Primitives**: `_querySelectorAll()`, `_extractValue()`, `_parentElement()`, `_nextSiblingsUntil()`.
27
36
  * **Configuration & State**: `headers()`, `cookies()`, `blockResources()`, `getState()`, `sessionPoolOptions`.
28
37
  * **Static Registry**: It maintains a static registry of all available engine implementations (`FetchEngine.register`), allowing for dynamic selection by `id` or `mode`.
29
38
 
39
+ ### `FetchElementScope`
40
+
41
+ The **`FetchElementScope`** is the engine-specific "handle" or "context" for a DOM element.
42
+
43
+ - In **`CheerioFetchEngine`**, it is a combination of the Cheerio API (`$`) and the current selection (`el`).
44
+ - In **`PlaywrightFetchEngine`**, it is a `Locator`.
45
+
46
+ All extraction and interaction primitives operate on this scope, ensuring a unified way to reference elements across different underlying technologies.
47
+
30
48
  ### `FetchEngine.create(context, options)`
31
49
 
32
50
  This static factory method is the designated entry point for creating an engine instance. It automatically selects and initializes the appropriate engine.
@@ -69,14 +87,14 @@ await session.executeAll([
69
87
  The engine supports persisting and restoring session state (primarily cookies) between executions.
70
88
 
71
89
  * **Flexible Session Isolation & Storage**: The library provides fine-grained control over how session data is stored and isolated via the `storage` configuration:
72
- * **`id`**: A custom string to identify the storage.
73
- * **Isolation (Default)**: If omitted, each session gets a unique ID, ensuring complete isolation of `RequestQueue`, `KeyValueStore`, and `SessionPool`.
74
- * **Sharing**: Providing the same `id` across sessions allows them to share the same underlying storage, useful for persistent login sessions.
75
- * **`persist`**: (boolean) Whether to enable disk persistence (Crawlee's `persistStorage`). Defaults to `false` (in-memory).
76
- * **`purge`**: (boolean) Whether to delete the storage (drop `RequestQueue` and `KeyValueStore`) when the session is closed. Defaults to `true`.
77
- * Set `purge: false` and provide a fixed `id` to create a truly persistent session that survives across application restarts.
78
- * **`config`**: Allows passing raw configuration to the underlying Crawlee instance.
79
- * **Note**: When `persist` is true, use `localDataDirectory` in the config to specify the storage path (e.g., `storage: { persist: true, config: { localDataDirectory: './my-data' } }`).
90
+ * **`id`**: A custom string to identify the storage.
91
+ * **Isolation (Default)**: If omitted, each session gets a unique ID, ensuring complete isolation of `RequestQueue`, `KeyValueStore`, and `SessionPool`.
92
+ * **Sharing**: Providing the same `id` across sessions allows them to share the same underlying storage, useful for persistent login sessions.
93
+ * **`persist`**: (boolean) Whether to enable disk persistence (Crawlee's `persistStorage`). Defaults to `false` (in-memory).
94
+ * **`purge`**: (boolean) Whether to delete the storage (drop `RequestQueue` and `KeyValueStore`) when the session is closed. Defaults to `true`.
95
+ * Set `purge: false` and provide a fixed `id` to create a truly persistent session that survives across application restarts.
96
+ * **`config`**: Allows passing raw configuration to the underlying Crawlee instance.
97
+ * **Note**: When `persist` is true, use `localDataDirectory` in the config to specify the storage path (e.g., `storage: { persist: true, config: { localDataDirectory: './my-data' } }`).
80
98
  * **`sessionState`**: A comprehensive state object (derived from Crawlee's SessionPool) that can be used to fully restore a previous session. This state is **automatically included in every `FetchResponse`**, making it easy to persist and later provide back to the engine during initialization.
81
99
  * **`sessionPoolOptions`**: Allows advanced configuration of the underlying Crawlee `SessionPool` (e.g., `maxUsageCount`, `maxPoolSize`).
82
100
  * **`overrideSessionState`**: If set to `true`, it forces the engine to overwrite any existing persistent state in the storage with the provided `sessionState`. This is useful when you want to ensure the session starts with the exact state provided, ignoring any stale data in the persistence layer.
@@ -86,6 +104,7 @@ The engine supports persisting and restoring session state (primarily cookies) b
86
104
 
87
105
  **Precedence Rule:**
88
106
  If both `sessionState` and `cookies` are provided, the engine adopts a **"Merge and Override"** strategy:
107
+
89
108
  1. The session is first restored from the `sessionState`.
90
109
  2. The explicit `cookies` are then applied on top.
91
110
  * **Result:** Any conflicting cookies in `sessionState` will be **overwritten** by the explicit `cookies`.
@@ -114,8 +133,11 @@ Our engine solves this by creating a bridge between the external API calls and t
114
133
  * The page context is used to resolve the `Promise` from the `goto()` call.
115
134
  * The page is marked as "active" (`isPageActive = true`).
116
135
  * Crucially, before the `requestHandler` returns, it starts an **action loop** (`_executePendingActions`). This loop effectively **pauses the `requestHandler`** by listening for events on an `EventEmitter`, keeping the page context alive.
136
+ * **Strict Sequential Execution & Re-entrancy**: The loop uses an internal queue to ensure all actions are executed in the exact order they were dispatched. It also includes re-entrancy protection to allow composite actions to call atomic actions without deadlocking.
117
137
  5. **Interactive Actions (`click`, `fill`, etc.)**: The consumer can now call `await engine.click(...)`. This dispatches an action to the `EventEmitter` and returns a new `Promise`.
118
- 6. **Action Execution**: The action loop, still running within the original `requestHandler`'s scope, hears the event. Because it has access to the page context, it can perform the *actual* interaction (e.g., `page.click(...)`).
138
+ 6. **Action Execution**: The action loop, still running within the original `requestHandler`'s scope, hears the event.
139
+ * **Centralized Actions**: Actions like `extract`, `pause`, and `getContent` are processed immediately by the `FetchEngine` base class using the unified logic.
140
+ * **Delegated Actions**: Engine-specific interactions (e.g., `click`, `fill`) are delegated to the subclass's `executeAction` implementation.
119
141
  7. **Robust Cleanup**: When `dispose()` or `cleanup()` is called:
120
142
  * An `isEngineDisposed` flag is set to prevent new actions.
121
143
  * A `dispose` signal is emitted to wake up and terminate the action loop.
@@ -134,8 +156,9 @@ There are two primary engine implementations:
134
156
  * **Mechanism**: Uses `CheerioCrawler` to fetch pages via raw HTTP and parse static HTML.
135
157
  * **Behavior**:
136
158
  * ✅ **Fast and Lightweight**: Ideal for speed and low resource consumption.
159
+ * ✅ **HTTP-Compliant Redirects**: Correctly handles 301-303 and 307/308 redirects, preserving methods/bodies or converting to GET as per HTTP specifications.
137
160
  * ❌ **No JavaScript Execution**: Cannot interact with client-side rendered content.
138
- * ⚙️ **Simulated Interaction**: Actions like `click` and `submit` are simulated by making new HTTP requests.
161
+ * ⚙️ **Simulated Interaction**: Actions like `click` and `submit` are simulated by making new HTTP requests. **Browser-only actions** (e.g., `mouseMove`, `keyboardType`) will throw a `not_supported` error.
139
162
  * **Use Case**: Scraping static websites, server-rendered pages, or APIs.
140
163
 
141
164
  ### `PlaywrightFetchEngine` (browser mode)
@@ -160,17 +183,95 @@ To combat sophisticated anti-bot measures, the `PlaywrightFetchEngine` offers an
160
183
  * **Use Case**: Scraping websites protected by services like Cloudflare or other advanced bot-detection systems.
161
184
  * **Note**: This feature requires additional dependencies (`camoufox-js`, `firefox`) and may have a performance overhead.
162
185
 
186
+ #### Configuration
187
+
188
+ You can configure the browser engine via the `browser` property in options:
189
+
190
+ * `headless` (boolean): Whether to run browser in headless mode (default: `true`).
191
+ * `launchOptions` (object): Native Playwright [LaunchOptions](https://playwright.dev/docs/api/class-browsertype#browser-type-launch) passed directly to the browser launcher (e.g., `slowMo`, `args`, `devtools`).
192
+
193
+ ```typescript
194
+ const result = await fetchWeb({
195
+ url: 'https://example.com',
196
+ engine: 'browser',
197
+ browser: {
198
+ headless: false,
199
+ launchOptions: {
200
+ slowMo: 100, // Slow down operations by 100ms
201
+ args: ['--start-maximized'] // Pass custom arguments
202
+ }
203
+ }
204
+ });
205
+ ```
206
+
163
207
  ---
164
208
 
165
209
  ## 📊 5. Data Extraction with `extract()`
166
210
 
167
211
  The `extract()` method provides a powerful, declarative way to pull structured data from a web page. It uses a **Schema** to define the structure of your desired JSON output, and the engine automatically handles DOM traversal and data extraction.
168
212
 
169
- ### Core Design: Schema Normalization
213
+ ### Core Design: The Three-Layer Architecture
214
+
215
+ To ensure consistency across engines and maintain high quality, the extraction system is divided into three layers:
216
+
217
+ 1. **Normalization Layer (`src/core/normalize-extract-schema.ts`)**: Pre-processes user-provided schemas into a canonical internal format, handling CSS filter merging and implicit object detection.
218
+ 2. **Core Extraction Logic (`src/core/extract.ts`)**: An engine-agnostic layer responsible for the extraction workflow. It dispatches tasks to `_extractObject`, `_extractArray`, or `_extractValue` via a **Dispatcher (`_extract`)**. It manages recursion, strict mode, required field validation, anchor resolution, performance-optimized tree operations (LCA, bubbling), and sequential consumption cursors.
219
+ 3. **Engine Interface (`IExtractEngine`)**: Defined at the core layer and implemented by engines to provide low-level DOM primitives.
220
+
221
+ #### Implementation Rules for `IExtractEngine`
222
+
223
+ To maintain cross-engine consistency, all implementations MUST follow these behavior contracts:
224
+
225
+ - **`_querySelectorAll`**:
226
+ - MUST return matching elements in **document order**.
227
+ - MUST check if the scope element(s) **themselves** match the selector and search their **descendants**.
228
+ - **`_nextSiblingsUntil`**:
229
+ - MUST return a flat list of siblings starting *after* the anchor and stopping *before* the first element matching the `untilSelector`.
230
+ - **`_isSameElement`**:
231
+ - MUST compare elements based on **identity**, not content.
232
+ - **`_findClosestAncestor`**:
233
+ - MUST efficiently find the closest ancestor of an element that exists in a given set of candidates.
234
+ - MUST be optimized to avoid multiple IPC calls in browser-based engines.
235
+ - **`_contains`**:
236
+ - MUST implement standard DOM `Node.contains()` behavior.
237
+ - MUST be optimized for high-frequency boundary checks.
238
+ - **`_findCommonAncestor`**:
239
+ - MUST find the Lowest Common Ancestor (LCA) of two elements.
240
+ - **Performance Critical**: In browser engines, this MUST be executed within a single `evaluate` call to minimize IPC (Inter-Process Communication) overhead.
241
+ - **`_findContainerChild`**:
242
+ - MUST find the direct child of a container that contains a specific descendant.
243
+ - **Performance Critical**: This replaces manual "bubble-up" loops in the Node.js context, significantly reducing overhead for deep DOM trees.
244
+ - **`_bubbleUpToScope` (Internal Helper)**:
245
+ - Implements the logic to bubble up from a deep element to its direct ancestor in the current scope.
246
+ - Supports an optional `depth` parameter to limit how many parent levels to traverse.
247
+ - MUST include a maximum depth limit (default 1000) to prevent infinite loops.
248
+
249
+ This architecture ensures that complex features like **Columnar Alignment**, **Segmented Scanning**, and **Anchor Jumping** behave identically across the fast Cheerio engine and the full Playwright browser.
250
+
251
+ ### Schema Normalization
252
+
253
+ To enhance usability and flexibility, the `extract` method internally implements a **"Normalization"** layer. This allows you to provide semantically clear shorthands, which are automatically converted into a standardized internal format.
254
+
255
+ #### 1. Shorthand Rules
256
+
257
+ - **String Shorthand**: A simple string like `'h1'` is automatically expanded to `{ selector: 'h1', type: 'string', mode: 'text' }`.
258
+ - **Implicit Object Shorthand**: If you provide an object without an explicit `type: 'object'`, it is automatically treated as an `object` schema where the keys are the property names.
259
+ - *Example*: `{ "title": "h1" }` becomes `{ "type": "object", "properties": { "title": { "selector": "h1" } } }`.
260
+ - **Filter Shorthand**: If you provide `has` or `exclude` alongside a `selector`, they are automatically merged into the CSS selector using `:has()` and `:not()` pseudo-classes.
261
+ - **Array Shorthand**: Providing `attribute` directly on an `array` schema acts as a shorthand for its `items`.
262
+ - *Example*: `{ "type": "array", "attribute": "href" }` becomes `{ "type": "array", "items": { "type": "string", "attribute": "href" } }`.
263
+
264
+ #### 2. Context vs. Data (Keyword Separation)
265
+
266
+ In **Implicit Objects**, the engine must distinguish between *configuration* (where to look) and *data* (what to extract).
267
+
268
+ - **Context Keys**: The keys `selector`, `has`, `exclude`, `required`, `strict`, and `depth` are reserved for defining the extraction context and validation. They stay at the root of the schema.
269
+ - **Data Keys**: All other keys (including `items`, `attribute`, `mode`, or even a field named `type`) are moved into the `properties` object as data fields to be extracted.
270
+ - **Collision Handling**: You can safely extract a field named `type` as long as its value is not one of the reserved schema type keywords (`string`, `number`, `boolean`, `html`, `object`, `array`).
170
271
 
171
- To enhance usability and flexibility, the `extract` method internally implements a **"Normalization"** layer. This means you can provide semantically clear shorthands, and the engine will automatically convert them into a standard, more verbose internal format before execution. This makes writing complex extraction rules simple and intuitive.
272
+ #### 3. Cross-Engine Consistency
172
273
 
173
- ### Schema Structure
274
+ This normalization layer ensures that regardless of whether you are using the `cheerio` (http) or `playwright` (browser) engine, the complex shorthand logic behaves identically, providing a consistent "AI-friendly" interface.
174
275
 
175
276
  A schema can be one of three types:
176
277
 
package/README.md CHANGED
@@ -132,7 +132,7 @@ This is the main entry point for the library.
132
132
  * `url` (string): The initial URL to navigate to.
133
133
  * `engine` ('http' | 'browser' | 'auto'): The engine to use. Defaults to `auto`.
134
134
  * `proxy` (string | string[]): Proxy URL(s) to use for requests.
135
- * `debug` (boolean): Enable detailed execution metadata (timings, engine used, etc.) in response.
135
+ * `debug` (boolean | string | string[]): Enable detailed execution metadata (timings, engine used, etc.) in response, or enable debug logs for specific categories (e.g., 'extract', 'submit', 'request').
136
136
  * `actions` (FetchActionOptions[]): An array of action objects to execute. (Supports `action`/`name` as alias for `id`, and `args` as alias for `params`)
137
137
  * `headers` (Record<string, string>): Headers to use for all requests.
138
138
  * `cookies` (Cookie[]): Array of cookies to use.
@@ -145,21 +145,30 @@ This is the main entry point for the library.
145
145
  * `output` (object): Controls the output fields in `FetchResponse`.
146
146
  * `cookies` (boolean): Whether to include cookies in the response (default: `true`).
147
147
  * `sessionState` (boolean): Whether to include session state in the response (default: `true`).
148
+ * `browser` (object): Browser engine configuration.
149
+ * `headless` (boolean): Run in headless mode (default: `true`).
150
+ * `launchOptions` (object): Playwright launch options (e.g., `{ slowMo: 50, args: [...] }`).
148
151
  * `sessionPoolOptions` (SessionPoolOptions): Advanced configuration for the underlying Crawlee SessionPool.
149
152
  * ...and many other options for proxy, retries, etc.
150
153
 
151
154
  ### Built-in Actions
152
155
 
153
- Here are the essential built-in actions:
156
+ The library provides a set of powerful built-in actions, many of which are engine-agnostic and handled centrally for consistency:
154
157
 
155
158
  * `goto`: Navigates to a new URL.
156
- * `click`: Clicks on an element specified by a selector.
157
- * `fill`: Fills an input field with a specified value.
158
- * `submit`: Submits a form.
159
- * `waitFor`: Pauses execution to wait for a specific condition (e.g., a timeout, a selector to appear, or network to be idle).
160
- * `pause`: Pauses execution for manual intervention (e.g., solving a CAPTCHA).
161
- * `getContent`: Retrieves the full content (HTML, text, etc.) of the current page state.
162
- * `extract`: Extracts any structured data from the page with ease using an expressive, declarative schema.
159
+ * `click`: Clicks on an element (Engine-specific).
160
+ * `fill`: Fills an input field (Engine-specific).
161
+ * `submit`: Submits a form (Engine-specific).
162
+ * `mouseMove`: Moves the mouse cursor to a specific coordinate or element (Bézier curve supported).
163
+ * `mouseClick`: Triggers a mouse click at the current position or specified coordinates.
164
+ * `keyboardType`: Simulates human-like typing into the currently focused element.
165
+ * `keyboardPress`: Simulates pressing a single key or a key combination.
166
+ * `trim`: Removes elements from the DOM to clean up the page.
167
+ * `waitFor`: Pauses execution to wait for a specific condition (Supports fixed timeouts centrally).
168
+ * `pause`: Pauses execution for manual intervention (Handled centrally).
169
+ * `getContent`: Retrieves the full content of the current page (Handled centrally).
170
+ * `evaluate`: Executes custom JavaScript within the page context.
171
+ * `extract`: Extracts structured data using an engine-agnostic core logic and engine-specific DOM primitives. Supports `required` fields and `strict` validation.
163
172
 
164
173
  ### Response Structure
165
174
 
@@ -172,7 +181,7 @@ The `fetchWeb` function returns an object containing:
172
181
  * `cookies`: Array of cookies.
173
182
  * `sessionState`: Crawlee session state.
174
183
  * `text`, `html`: Page content.
175
- * `outputs` (Record<string, any>): Data extracted and stored via `storeAs`.
184
+ * `outputs` (Record<string, any>): Data extracted and stored via `storeAs`. Note: When multiple actions store objects into the same key, they are merged instead of overwritten.
176
185
 
177
186
  ---
178
187