@isdk/web-fetcher 0.2.11 → 0.2.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. package/README.action.cn.md +81 -20
  2. package/README.action.md +81 -4
  3. package/README.engine.cn.md +28 -8
  4. package/README.engine.md +28 -8
  5. package/dist/index.d.mts +405 -10
  6. package/dist/index.d.ts +405 -10
  7. package/dist/index.js +1 -1
  8. package/dist/index.mjs +1 -1
  9. package/docs/_media/README.action.md +81 -4
  10. package/docs/_media/README.engine.md +28 -8
  11. package/docs/classes/CheerioFetchEngine.md +354 -67
  12. package/docs/classes/ClickAction.md +23 -23
  13. package/docs/classes/ExtractAction.md +23 -23
  14. package/docs/classes/FetchAction.md +23 -23
  15. package/docs/classes/FetchEngine.md +314 -65
  16. package/docs/classes/FetchSession.md +118 -21
  17. package/docs/classes/FillAction.md +23 -23
  18. package/docs/classes/GetContentAction.md +23 -23
  19. package/docs/classes/GotoAction.md +23 -23
  20. package/docs/classes/PauseAction.md +23 -23
  21. package/docs/classes/PlaywrightFetchEngine.md +339 -66
  22. package/docs/classes/SubmitAction.md +23 -23
  23. package/docs/classes/WaitForAction.md +23 -23
  24. package/docs/classes/WebFetcher.md +53 -5
  25. package/docs/enumerations/FetchActionResultStatus.md +4 -4
  26. package/docs/functions/fetchWeb.md +2 -2
  27. package/docs/interfaces/BaseFetchActionProperties.md +19 -11
  28. package/docs/interfaces/BaseFetchCollectorActionProperties.md +27 -15
  29. package/docs/interfaces/BaseFetcherProperties.md +27 -27
  30. package/docs/interfaces/DispatchedEngineAction.md +4 -4
  31. package/docs/interfaces/ExtractActionProperties.md +23 -11
  32. package/docs/interfaces/FetchActionInContext.md +32 -15
  33. package/docs/interfaces/FetchActionProperties.md +24 -12
  34. package/docs/interfaces/FetchActionResult.md +6 -6
  35. package/docs/interfaces/FetchContext.md +80 -37
  36. package/docs/interfaces/FetchEngineContext.md +51 -32
  37. package/docs/interfaces/FetchMetadata.md +5 -5
  38. package/docs/interfaces/FetchResponse.md +14 -14
  39. package/docs/interfaces/FetchReturnTypeRegistry.md +7 -7
  40. package/docs/interfaces/FetchSite.md +30 -30
  41. package/docs/interfaces/FetcherOptions.md +29 -29
  42. package/docs/interfaces/GotoActionOptions.md +6 -6
  43. package/docs/interfaces/PendingEngineRequest.md +3 -3
  44. package/docs/interfaces/StorageOptions.md +5 -5
  45. package/docs/interfaces/SubmitActionOptions.md +2 -2
  46. package/docs/interfaces/WaitForActionOptions.md +5 -5
  47. package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
  48. package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
  49. package/docs/type-aliases/BrowserEngine.md +1 -1
  50. package/docs/type-aliases/FetchActionCapabilities.md +1 -1
  51. package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
  52. package/docs/type-aliases/FetchActionOptions.md +1 -1
  53. package/docs/type-aliases/FetchEngineAction.md +1 -1
  54. package/docs/type-aliases/FetchEngineType.md +1 -1
  55. package/docs/type-aliases/FetchReturnType.md +1 -1
  56. package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
  57. package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
  58. package/docs/type-aliases/ResourceType.md +1 -1
  59. package/docs/variables/DefaultFetcherProperties.md +1 -1
  60. package/docs/variables/FetcherOptionKeys.md +1 -1
  61. package/package.json +7 -4
@@ -222,19 +222,26 @@ await fetchWeb({
222
222
 
223
223
  ###### 1. 提取单个值
224
224
 
225
- 最基础的提取,可以指定 **`selector`** (CSS 选择器), **`attribute`** (要提取的属性名), 以及 **`type`** (string, number, boolean, html)。
225
+ 最基础的提取,可以指定 **`selector`** (CSS 选择器), **`attribute`** (要提取的属性名), **`type`** (string, number, boolean, html), 以及 **`mode`** (text, innerText)
226
226
 
227
227
  ```json
228
228
  {
229
229
  "id": "extract",
230
230
  "params": {
231
231
  "selector": "h1.main-title",
232
- "type": "string"
232
+ "type": "string",
233
+ "mode": "innerText"
233
234
  }
234
235
  }
235
236
  ```
236
237
 
237
- > 上例将提取 class 为 `main-title` 的 `<h1>` 标签的文本内容。
238
+ > **提取模式 (Extraction Modes):**
239
+ >
240
+ > * **`text`** (默认): 提取元素的 `textContent`。
241
+ > * **`innerText`**: 提取渲染后的文本,尊重 CSS 样式并处理换行。
242
+ > * **`html`**: 返回元素的 `innerHTML`。
243
+ > * **`outerHTML`**: 返回包含标签自身的完整 HTML。对于保留元素结构非常有用。
244
+ > 上例将使用 `innerText` 模式提取 class 为 `main-title` 的 `<h1>` 标签的文本内容。
238
245
 
239
246
  ###### 2. 提取对象
240
247
 
@@ -261,21 +268,13 @@ await fetchWeb({
261
268
  * **提取文本数组 (默认行为)**: 当您想提取一个文本列表时,只需提供选择器,省略 `items` 即可。这是最常见的用法。
262
269
 
263
270
  ```json
264
-
265
271
  {
266
-
267
272
  "id": "extract",
268
-
269
273
  "params": {
270
-
271
274
  "type": "array",
272
-
273
275
  "selector": ".tags li"
274
-
275
276
  }
276
-
277
277
  }
278
-
279
278
  ```
280
279
 
281
280
  > 上例将返回一个包含所有 `<li>` 标签文本的数组, 如 `["tech", "news"]`。
@@ -285,26 +284,88 @@ await fetchWeb({
285
284
  ```json
286
285
 
287
286
  {
288
-
289
287
  "id": "extract",
290
-
291
288
  "params": {
292
-
293
289
  "type": "array",
294
-
295
290
  "selector": ".gallery img",
296
-
297
291
  "attribute": "src"
298
-
299
292
  }
300
-
301
293
  }
302
-
303
294
  ```
304
295
 
305
296
  > 上例将返回一个包含所有 `<img>` 标签 `src` 属性的数组。
306
297
 
307
- ###### 4. 精确筛选: `has` 和 `exclude`
298
+ * **数组提取模式**: 在提取数组时,引擎支持不同的模式来处理各种 DOM 结构。
299
+
300
+ * **`nested`** (默认): `selector` 匹配每个项目的包裹元素。
301
+ * **`columnar`** (原 Zip 策略): `selector` 指向**容器**,`items` 中的字段是平行的列,按索引缝合在一起。
302
+ * **`segmented`**: `selector` 指向**容器**,项目通过“锚点”字段进行分段提取。
303
+
304
+ ###### 4. 列对齐模式 (Columnar Mode,原 Zip 策略)
305
+
306
+ 当 `selector` 指向一个**容器**(如结果列表)且项目数据以平行的独立列散落在其中时使用。
307
+
308
+ ```json
309
+ {
310
+ "id": "extract",
311
+ "params": {
312
+ "type": "array",
313
+ "selector": "#search-results",
314
+ "mode": "columnar",
315
+ "items": {
316
+ "title": { "selector": ".item-title" },
317
+ "link": { "selector": "a.item-link", "attribute": "href" }
318
+ }
319
+ }
320
+ }
321
+ ```
322
+
323
+ > **启发式自动检测:** 如果省略了 `mode`,且 `selector` 恰好只匹配到一个元素,同时 `items` 中包含选择器,引擎会自动使用 **columnar** 模式。
324
+
325
+ **Columnar 配置参数:**
326
+
327
+ * **`strict`** (boolean, 默认: `true`): 如果为 `true`,当不同字段匹配到的数量不一致时将抛出错误。
328
+ * **`inference`** (boolean, 默认: `false`): 如果为 `true`,当字段数量不匹配时,尝试通过 DOM 树自动寻找“包裹元素”来修复错位的列表。
329
+
330
+ ###### 5. 分段扫描模式 (Segmented Mode)
331
+
332
+ 适用于完全“平铺”且没有包裹元素的结构。它使用第一个字段(或指定的 `anchor`)作为锚点来对容器内容进行分段。
333
+
334
+ ```json
335
+ {
336
+ "id": "extract",
337
+ "params": {
338
+ "type": "array",
339
+ "selector": "#flat-container",
340
+ "mode": { "type": "segmented", "anchor": "title" },
341
+ "items": {
342
+ "title": { "selector": "h3" },
343
+ "desc": { "selector": "p" }
344
+ }
345
+ }
346
+ }
347
+ ```
348
+
349
+ > 在此模式下,每当引擎发现一个 `h3`,就会开启一个新项目。随后找到的 `p` 标签(直到下一个 `h3` 出现前)都归属于该项目。
350
+
351
+ ###### 6. 隐式对象提取 (最简语法)
352
+
353
+ 为了让对象提取更简单,你可以省略 `type: 'object'` 和 `properties`。如果 schema 对象包含非保留关键字(如 `selector`, `attribute`, `type` 等)的键,它将被视为对象 schema,其中的键作为属性名。
354
+
355
+ ```json
356
+ {
357
+ "id": "extract",
358
+ "params": {
359
+ "selector": ".author-bio",
360
+ "name": { "selector": ".author-name" },
361
+ "email": { "selector": "a.email", "attribute": "href" }
362
+ }
363
+ }
364
+ ```
365
+
366
+ > 这等同于示例 2,但更简洁。
367
+
368
+ ###### 6. 精确筛选: `has` 和 `exclude`
308
369
 
309
370
  您可以在任何包含 **`selector`** 的 Schema 中使用 **`has`** 和 **`exclude`** 字段来精确控制元素的选择。
310
371
 
package/README.action.md CHANGED
@@ -224,19 +224,26 @@ The `params` object itself is a Schema that describes the data structure you wan
224
224
 
225
225
  ###### 1. Extracting a Single Value
226
226
 
227
- The most basic extraction. You can specify a `selector` (CSS selector), an `attribute` (the name of the attribute to extract), and a `type` (string, number, boolean, html).
227
+ The most basic extraction. You can specify a `selector` (CSS selector), an `attribute` (the name of the attribute to extract), a `type` (string, number, boolean, html), and a `mode` (text, innerText).
228
228
 
229
229
  ```json
230
230
  {
231
231
  "id": "extract",
232
232
  "params": {
233
233
  "selector": "h1.main-title",
234
- "type": "string"
234
+ "type": "string",
235
+ "mode": "innerText"
235
236
  }
236
237
  }
237
238
  ```
238
239
 
239
- > The example above will extract the text content of the `<h1>` tag with the class `main-title`.
240
+ > **Extraction Modes:**
241
+ >
242
+ > * **`text`** (default): Extracts the `textContent` of the element.
243
+ > * **`innerText`**: Extracts the rendered text, respecting CSS styling and line breaks.
244
+ > * **`html`**: Returns the `innerHTML` of the element.
245
+ > * **`outerHTML`**: Returns the HTML including the tag itself. Useful for preserving the full element structure.
246
+ > The example above will extract the text content of the `<h1>` tag with the class `main-title` using the `innerText` mode.
240
247
 
241
248
  ###### 2. Extracting an Object
242
249
 
@@ -289,7 +296,77 @@ Extract a list using `type: 'array'`. To make the most common operations simpler
289
296
 
290
297
  > The example above will return an array of the `src` attributes from all `<img>` tags.
291
298
 
292
- ###### 4. Precise Filtering: `has` and `exclude`
299
+ * **Array Extraction Modes**: When extracting an array, the engine supports different modes to handle various DOM structures.
300
+
301
+ * **`nested`** (Default): The `selector` matches individual item wrappers.
302
+ * **`columnar`** (formerly Zip): The `selector` matches a **container**, and fields in `items` are parallel columns stitched together by index.
303
+ * **`segmented`**: The `selector` matches a **container**, and items are segmented by an "anchor" field.
304
+
305
+ ###### 4. Columnar Mode (formerly Zip Strategy)
306
+
307
+ This mode is used when the `selector` points to a **container** (like a results list) and item data is scattered as separate columns.
308
+
309
+ ```json
310
+ {
311
+ "id": "extract",
312
+ "params": {
313
+ "type": "array",
314
+ "selector": "#search-results",
315
+ "mode": "columnar",
316
+ "items": {
317
+ "title": { "selector": ".item-title" },
318
+ "link": { "selector": "a.item-link", "attribute": "href" }
319
+ }
320
+ }
321
+ }
322
+ ```
323
+
324
+ > **Heuristic Detection:** If `mode` is omitted and the `selector` matches exactly one element while `items` contains nested selectors, the engine automatically uses **columnar** mode.
325
+
326
+ **Columnar Configuration:**
327
+
328
+ * **`strict`** (boolean, default: `true`): If `true`, throws an error if fields have different match counts.
329
+ * **`inference`** (boolean, default: `false`): If `true`, tries to automatically find the "item wrapper" elements to fix misaligned lists.
330
+
331
+ ###### 5. Segmented Mode (Anchor-based Scanning)
332
+
333
+ This mode is ideal for "flat" structures where there are no item wrappers. It uses the first field (or a specified `anchor`) to segment the container's content.
334
+
335
+ ```json
336
+ {
337
+ "id": "extract",
338
+ "params": {
339
+ "type": "array",
340
+ "selector": "#flat-container",
341
+ "mode": { "type": "segmented", "anchor": "title" },
342
+ "items": {
343
+ "title": { "selector": "h3" },
344
+ "desc": { "selector": "p" }
345
+ }
346
+ }
347
+ }
348
+ ```
349
+
350
+ > In this mode, every time the engine finds an `h3`, it starts a new item. Any `<p>` found after that `h3` (and before the next one) is assigned to that item.
351
+
352
+ ###### 6. Implicit Object Extraction (Simplest Syntax)
353
+
354
+ For simpler object extraction, you can omit `type: 'object'` and `properties`. If the schema object contains keys that are not reserved keywords (like `selector`, `attribute`, `type`, etc.), it is treated as an object schema where keys are property names.
355
+
356
+ ```json
357
+ {
358
+ "id": "extract",
359
+ "params": {
360
+ "selector": ".author-bio",
361
+ "name": { "selector": ".author-name" },
362
+ "email": { "selector": "a.email", "attribute": "href" }
363
+ }
364
+ }
365
+ ```
366
+
367
+ > This is equivalent to Example 2 but more concise.
368
+
369
+ ###### 6. Precise Filtering: `has` and `exclude`
293
370
 
294
371
  You can use the `has` and `exclude` fields in any schema that includes a `selector` to precisely control element selection.
295
372
 
@@ -31,18 +31,38 @@
31
31
 
32
32
  此静态工厂方法是创建引擎实例的指定入口点。它会自动选择并初始化合适的引擎。
33
33
 
34
- ### `FetchSession(options, engine)`
34
+ ### `FetchSession(options)`
35
35
 
36
- `FetchSession` 类管理抓取操作的生命周期。它现在在构造函数中支持一个可选的 `engine` 参数,用于为该会话强制指定引擎实现,从而绕过任何自动检测或基于注册表的选择。
36
+ `FetchSession` 类管理抓取操作的生命周期。您可以在 `options` 中指定 `engine` 来为该会话强制指定特定的引擎实现。
37
37
 
38
- ### 引擎选择优先级
38
+ ```typescript
39
+ const session = new FetchSession({ engine: 'browser' });
40
+ ```
41
+
42
+ #### 引擎选择优先级 (Engine Selection Priority)
43
+
44
+ 引擎在执行第一个动作时延迟初始化,并在会话持续期间保持固定。选择遵循以下规则:
39
45
 
40
- 当库决定使用哪个引擎时(通过内部的 `maybeCreateEngine`),它遵循以下优先级:
46
+ 1. **显式选项**: 如果在 `FetchSession` 的 `options.engine`(或 `executeAll` 的临时上下文覆盖)中提供了引擎且不为 `'auto'`。
47
+ * ⚠️ **快速失败 (Fail-Fast)**: 如果请求的引擎不可用(例如缺少依赖),将立即抛出错误。
48
+ 2. **站点注册表 (Site Registry)**: 如果设置为 `'auto'`(默认),系统会尝试根据目标 URL 匹配 `sites` 注册表。
49
+ 3. **智能升级 (Smart Upgrade)**: 如果启用,系统可能会根据响应特征(如机器人检测或大量 JS)动态从 `http` 升级到 `browser`。
50
+ 4. **默认**: 回退到 `'http'` (Cheerio)。
41
51
 
42
- 1. **显式强制引擎**:如果在会话或引擎创建期间显式传递了引擎 ID(例如 `FetchSession` 中的新 `engine` 参数)。
43
- 2. **配置引擎**:在 `FetcherOptions` 中定义的 `engine` 属性。
44
- 3. **站点注册表**:如果目标 URL 匹配已配置 `sites` 注册表中的域名,则使用该站点首选的引擎。
45
- 4. **默认值**:默认为 `auto`。如果启用了 `enableSmart`,它会智能地在 `http` 和 `browser`之间切换,否则默认为 `http`。
52
+ #### 带有覆盖的批量执行 (Batch Execution with Overrides)
53
+
54
+ 您可以执行一系列动作,并指定临时的配置覆盖(例如 headers、timeout)。这些覆盖仅应用于当前批次,不会修改会话的全局状态。
55
+
56
+ ```typescript
57
+ // 使用临时的自定义 header 和超时执行动作
58
+ await session.executeAll([
59
+ { name: 'goto', params: { url: '...' } },
60
+ { name: 'extract', params: { ... } }
61
+ ], {
62
+ headers: { 'x-custom-priority': 'high' },
63
+ timeoutMs: 30000
64
+ });
65
+ ```
46
66
 
47
67
  ### 会话管理与状态持久化
48
68
 
package/README.engine.md CHANGED
@@ -31,18 +31,38 @@ This is the abstract base class that defines the contract for all fetch engines.
31
31
 
32
32
  This static factory method is the designated entry point for creating an engine instance. It automatically selects and initializes the appropriate engine.
33
33
 
34
- ### `FetchSession(options, engine)`
34
+ ### `FetchSession(options)`
35
35
 
36
- The `FetchSession` class manages the lifecycle of a fetch operation. It now supports an optional `engine` parameter in its constructor to force a specific engine implementation for that session, bypassing any auto-detection or registry-based selection.
36
+ The `FetchSession` class manages the lifecycle of a fetch operation. You can specify the `engine` in the `options` to force a specific engine implementation for that session.
37
37
 
38
- ### Engine Selection Priority
38
+ ```typescript
39
+ const session = new FetchSession({ engine: 'browser' });
40
+ ```
41
+
42
+ #### Engine Selection Priority
43
+
44
+ The engine is initialized lazily upon the first action execution and remains fixed for the duration of the session. The selection follows these rules:
39
45
 
40
- When the library determines which engine to use (via internal `maybeCreateEngine`), it follows this priority:
46
+ 1. **Explicit Option**: If `options.engine` (or temporary context override in `executeAll`) is provided and NOT set to `'auto'`.
47
+ * ⚠️ **Fail-Fast**: If the requested engine is unavailable (e.g., missing dependencies), an error is thrown immediately.
48
+ 2. **Site Registry**: If set to `'auto'` (default), the system attempts to match the target URL against the `sites` registry.
49
+ 3. **Smart Upgrade**: If enabled, the engine may be dynamically upgraded from `http` to `browser` based on response characteristics (e.g., bot detection or heavy JS).
50
+ 4. **Default**: Falls back to `'http'` (Cheerio).
41
51
 
42
- 1. **Explicit Forced Engine**: If an engine ID is explicitly passed during session or engine creation (e.g., the new `engine` parameter in `FetchSession`).
43
- 2. **Configuration Engine**: The `engine` property defined in `FetcherOptions`.
44
- 3. **Site Registry**: If the target URL matches a domain in the configured `sites` registry, it uses the engine preferred for that site.
45
- 4. **Default**: Defaults to `auto`, which intelligently switches between `http` and `browser` if `enableSmart` is enabled, or defaults to `http` otherwise.
52
+ #### Batch Execution with Overrides
53
+
54
+ You can execute a sequence of actions with temporary configuration overrides (e.g., headers, timeout) that apply only to that specific batch, without modifying the session's global state.
55
+
56
+ ```typescript
57
+ // Execute actions with a temporary custom header and timeout
58
+ await session.executeAll([
59
+ { name: 'goto', params: { url: '...' } },
60
+ { name: 'extract', params: { ... } }
61
+ ], {
62
+ headers: { 'x-custom-priority': 'high' },
63
+ timeoutMs: 30000
64
+ });
65
+ ```
46
66
 
47
67
  ### Session Management & State Persistence
48
68