@isdk/web-fetcher 0.2.11 → 0.2.12
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.action.cn.md +81 -20
- package/README.action.md +81 -4
- package/README.engine.cn.md +28 -8
- package/README.engine.md +28 -8
- package/dist/index.d.mts +405 -10
- package/dist/index.d.ts +405 -10
- package/dist/index.js +1 -1
- package/dist/index.mjs +1 -1
- package/docs/_media/README.action.md +81 -4
- package/docs/_media/README.engine.md +28 -8
- package/docs/classes/CheerioFetchEngine.md +354 -67
- package/docs/classes/ClickAction.md +23 -23
- package/docs/classes/ExtractAction.md +23 -23
- package/docs/classes/FetchAction.md +23 -23
- package/docs/classes/FetchEngine.md +314 -65
- package/docs/classes/FetchSession.md +118 -21
- package/docs/classes/FillAction.md +23 -23
- package/docs/classes/GetContentAction.md +23 -23
- package/docs/classes/GotoAction.md +23 -23
- package/docs/classes/PauseAction.md +23 -23
- package/docs/classes/PlaywrightFetchEngine.md +339 -66
- package/docs/classes/SubmitAction.md +23 -23
- package/docs/classes/WaitForAction.md +23 -23
- package/docs/classes/WebFetcher.md +53 -5
- package/docs/enumerations/FetchActionResultStatus.md +4 -4
- package/docs/functions/fetchWeb.md +2 -2
- package/docs/interfaces/BaseFetchActionProperties.md +19 -11
- package/docs/interfaces/BaseFetchCollectorActionProperties.md +27 -15
- package/docs/interfaces/BaseFetcherProperties.md +27 -27
- package/docs/interfaces/DispatchedEngineAction.md +4 -4
- package/docs/interfaces/ExtractActionProperties.md +23 -11
- package/docs/interfaces/FetchActionInContext.md +32 -15
- package/docs/interfaces/FetchActionProperties.md +24 -12
- package/docs/interfaces/FetchActionResult.md +6 -6
- package/docs/interfaces/FetchContext.md +80 -37
- package/docs/interfaces/FetchEngineContext.md +51 -32
- package/docs/interfaces/FetchMetadata.md +5 -5
- package/docs/interfaces/FetchResponse.md +14 -14
- package/docs/interfaces/FetchReturnTypeRegistry.md +7 -7
- package/docs/interfaces/FetchSite.md +30 -30
- package/docs/interfaces/FetcherOptions.md +29 -29
- package/docs/interfaces/GotoActionOptions.md +6 -6
- package/docs/interfaces/PendingEngineRequest.md +3 -3
- package/docs/interfaces/StorageOptions.md +5 -5
- package/docs/interfaces/SubmitActionOptions.md +2 -2
- package/docs/interfaces/WaitForActionOptions.md +5 -5
- package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
- package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
- package/docs/type-aliases/BrowserEngine.md +1 -1
- package/docs/type-aliases/FetchActionCapabilities.md +1 -1
- package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
- package/docs/type-aliases/FetchActionOptions.md +1 -1
- package/docs/type-aliases/FetchEngineAction.md +1 -1
- package/docs/type-aliases/FetchEngineType.md +1 -1
- package/docs/type-aliases/FetchReturnType.md +1 -1
- package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
- package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
- package/docs/type-aliases/ResourceType.md +1 -1
- package/docs/variables/DefaultFetcherProperties.md +1 -1
- package/docs/variables/FetcherOptionKeys.md +1 -1
- package/package.json +7 -4
package/README.action.cn.md
CHANGED
|
@@ -222,19 +222,26 @@ await fetchWeb({
|
|
|
222
222
|
|
|
223
223
|
###### 1. 提取单个值
|
|
224
224
|
|
|
225
|
-
最基础的提取,可以指定 **`selector`** (CSS 选择器), **`attribute`** (要提取的属性名),
|
|
225
|
+
最基础的提取,可以指定 **`selector`** (CSS 选择器), **`attribute`** (要提取的属性名), **`type`** (string, number, boolean, html), 以及 **`mode`** (text, innerText)。
|
|
226
226
|
|
|
227
227
|
```json
|
|
228
228
|
{
|
|
229
229
|
"id": "extract",
|
|
230
230
|
"params": {
|
|
231
231
|
"selector": "h1.main-title",
|
|
232
|
-
"type": "string"
|
|
232
|
+
"type": "string",
|
|
233
|
+
"mode": "innerText"
|
|
233
234
|
}
|
|
234
235
|
}
|
|
235
236
|
```
|
|
236
237
|
|
|
237
|
-
>
|
|
238
|
+
> **提取模式 (Extraction Modes):**
|
|
239
|
+
>
|
|
240
|
+
> * **`text`** (默认): 提取元素的 `textContent`。
|
|
241
|
+
> * **`innerText`**: 提取渲染后的文本,尊重 CSS 样式并处理换行。
|
|
242
|
+
> * **`html`**: 返回元素的 `innerHTML`。
|
|
243
|
+
> * **`outerHTML`**: 返回包含标签自身的完整 HTML。对于保留元素结构非常有用。
|
|
244
|
+
> 上例将使用 `innerText` 模式提取 class 为 `main-title` 的 `<h1>` 标签的文本内容。
|
|
238
245
|
|
|
239
246
|
###### 2. 提取对象
|
|
240
247
|
|
|
@@ -261,21 +268,13 @@ await fetchWeb({
|
|
|
261
268
|
* **提取文本数组 (默认行为)**: 当您想提取一个文本列表时,只需提供选择器,省略 `items` 即可。这是最常见的用法。
|
|
262
269
|
|
|
263
270
|
```json
|
|
264
|
-
|
|
265
271
|
{
|
|
266
|
-
|
|
267
272
|
"id": "extract",
|
|
268
|
-
|
|
269
273
|
"params": {
|
|
270
|
-
|
|
271
274
|
"type": "array",
|
|
272
|
-
|
|
273
275
|
"selector": ".tags li"
|
|
274
|
-
|
|
275
276
|
}
|
|
276
|
-
|
|
277
277
|
}
|
|
278
|
-
|
|
279
278
|
```
|
|
280
279
|
|
|
281
280
|
> 上例将返回一个包含所有 `<li>` 标签文本的数组, 如 `["tech", "news"]`。
|
|
@@ -285,26 +284,88 @@ await fetchWeb({
|
|
|
285
284
|
```json
|
|
286
285
|
|
|
287
286
|
{
|
|
288
|
-
|
|
289
287
|
"id": "extract",
|
|
290
|
-
|
|
291
288
|
"params": {
|
|
292
|
-
|
|
293
289
|
"type": "array",
|
|
294
|
-
|
|
295
290
|
"selector": ".gallery img",
|
|
296
|
-
|
|
297
291
|
"attribute": "src"
|
|
298
|
-
|
|
299
292
|
}
|
|
300
|
-
|
|
301
293
|
}
|
|
302
|
-
|
|
303
294
|
```
|
|
304
295
|
|
|
305
296
|
> 上例将返回一个包含所有 `<img>` 标签 `src` 属性的数组。
|
|
306
297
|
|
|
307
|
-
|
|
298
|
+
* **数组提取模式**: 在提取数组时,引擎支持不同的模式来处理各种 DOM 结构。
|
|
299
|
+
|
|
300
|
+
* **`nested`** (默认): `selector` 匹配每个项目的包裹元素。
|
|
301
|
+
* **`columnar`** (原 Zip 策略): `selector` 指向**容器**,`items` 中的字段是平行的列,按索引缝合在一起。
|
|
302
|
+
* **`segmented`**: `selector` 指向**容器**,项目通过“锚点”字段进行分段提取。
|
|
303
|
+
|
|
304
|
+
###### 4. 列对齐模式 (Columnar Mode,原 Zip 策略)
|
|
305
|
+
|
|
306
|
+
当 `selector` 指向一个**容器**(如结果列表)且项目数据以平行的独立列散落在其中时使用。
|
|
307
|
+
|
|
308
|
+
```json
|
|
309
|
+
{
|
|
310
|
+
"id": "extract",
|
|
311
|
+
"params": {
|
|
312
|
+
"type": "array",
|
|
313
|
+
"selector": "#search-results",
|
|
314
|
+
"mode": "columnar",
|
|
315
|
+
"items": {
|
|
316
|
+
"title": { "selector": ".item-title" },
|
|
317
|
+
"link": { "selector": "a.item-link", "attribute": "href" }
|
|
318
|
+
}
|
|
319
|
+
}
|
|
320
|
+
}
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
> **启发式自动检测:** 如果省略了 `mode`,且 `selector` 恰好只匹配到一个元素,同时 `items` 中包含选择器,引擎会自动使用 **columnar** 模式。
|
|
324
|
+
|
|
325
|
+
**Columnar 配置参数:**
|
|
326
|
+
|
|
327
|
+
* **`strict`** (boolean, 默认: `true`): 如果为 `true`,当不同字段匹配到的数量不一致时将抛出错误。
|
|
328
|
+
* **`inference`** (boolean, 默认: `false`): 如果为 `true`,当字段数量不匹配时,尝试通过 DOM 树自动寻找“包裹元素”来修复错位的列表。
|
|
329
|
+
|
|
330
|
+
###### 5. 分段扫描模式 (Segmented Mode)
|
|
331
|
+
|
|
332
|
+
适用于完全“平铺”且没有包裹元素的结构。它使用第一个字段(或指定的 `anchor`)作为锚点来对容器内容进行分段。
|
|
333
|
+
|
|
334
|
+
```json
|
|
335
|
+
{
|
|
336
|
+
"id": "extract",
|
|
337
|
+
"params": {
|
|
338
|
+
"type": "array",
|
|
339
|
+
"selector": "#flat-container",
|
|
340
|
+
"mode": { "type": "segmented", "anchor": "title" },
|
|
341
|
+
"items": {
|
|
342
|
+
"title": { "selector": "h3" },
|
|
343
|
+
"desc": { "selector": "p" }
|
|
344
|
+
}
|
|
345
|
+
}
|
|
346
|
+
}
|
|
347
|
+
```
|
|
348
|
+
|
|
349
|
+
> 在此模式下,每当引擎发现一个 `h3`,就会开启一个新项目。随后找到的 `p` 标签(直到下一个 `h3` 出现前)都归属于该项目。
|
|
350
|
+
|
|
351
|
+
###### 6. 隐式对象提取 (最简语法)
|
|
352
|
+
|
|
353
|
+
为了让对象提取更简单,你可以省略 `type: 'object'` 和 `properties`。如果 schema 对象包含非保留关键字(如 `selector`, `attribute`, `type` 等)的键,它将被视为对象 schema,其中的键作为属性名。
|
|
354
|
+
|
|
355
|
+
```json
|
|
356
|
+
{
|
|
357
|
+
"id": "extract",
|
|
358
|
+
"params": {
|
|
359
|
+
"selector": ".author-bio",
|
|
360
|
+
"name": { "selector": ".author-name" },
|
|
361
|
+
"email": { "selector": "a.email", "attribute": "href" }
|
|
362
|
+
}
|
|
363
|
+
}
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
> 这等同于示例 2,但更简洁。
|
|
367
|
+
|
|
368
|
+
###### 6. 精确筛选: `has` 和 `exclude`
|
|
308
369
|
|
|
309
370
|
您可以在任何包含 **`selector`** 的 Schema 中使用 **`has`** 和 **`exclude`** 字段来精确控制元素的选择。
|
|
310
371
|
|
package/README.action.md
CHANGED
|
@@ -224,19 +224,26 @@ The `params` object itself is a Schema that describes the data structure you wan
|
|
|
224
224
|
|
|
225
225
|
###### 1. Extracting a Single Value
|
|
226
226
|
|
|
227
|
-
The most basic extraction. You can specify a `selector` (CSS selector), an `attribute` (the name of the attribute to extract),
|
|
227
|
+
The most basic extraction. You can specify a `selector` (CSS selector), an `attribute` (the name of the attribute to extract), a `type` (string, number, boolean, html), and a `mode` (text, innerText).
|
|
228
228
|
|
|
229
229
|
```json
|
|
230
230
|
{
|
|
231
231
|
"id": "extract",
|
|
232
232
|
"params": {
|
|
233
233
|
"selector": "h1.main-title",
|
|
234
|
-
"type": "string"
|
|
234
|
+
"type": "string",
|
|
235
|
+
"mode": "innerText"
|
|
235
236
|
}
|
|
236
237
|
}
|
|
237
238
|
```
|
|
238
239
|
|
|
239
|
-
>
|
|
240
|
+
> **Extraction Modes:**
|
|
241
|
+
>
|
|
242
|
+
> * **`text`** (default): Extracts the `textContent` of the element.
|
|
243
|
+
> * **`innerText`**: Extracts the rendered text, respecting CSS styling and line breaks.
|
|
244
|
+
> * **`html`**: Returns the `innerHTML` of the element.
|
|
245
|
+
> * **`outerHTML`**: Returns the HTML including the tag itself. Useful for preserving the full element structure.
|
|
246
|
+
> The example above will extract the text content of the `<h1>` tag with the class `main-title` using the `innerText` mode.
|
|
240
247
|
|
|
241
248
|
###### 2. Extracting an Object
|
|
242
249
|
|
|
@@ -289,7 +296,77 @@ Extract a list using `type: 'array'`. To make the most common operations simpler
|
|
|
289
296
|
|
|
290
297
|
> The example above will return an array of the `src` attributes from all `<img>` tags.
|
|
291
298
|
|
|
292
|
-
|
|
299
|
+
* **Array Extraction Modes**: When extracting an array, the engine supports different modes to handle various DOM structures.
|
|
300
|
+
|
|
301
|
+
* **`nested`** (Default): The `selector` matches individual item wrappers.
|
|
302
|
+
* **`columnar`** (formerly Zip): The `selector` matches a **container**, and fields in `items` are parallel columns stitched together by index.
|
|
303
|
+
* **`segmented`**: The `selector` matches a **container**, and items are segmented by an "anchor" field.
|
|
304
|
+
|
|
305
|
+
###### 4. Columnar Mode (formerly Zip Strategy)
|
|
306
|
+
|
|
307
|
+
This mode is used when the `selector` points to a **container** (like a results list) and item data is scattered as separate columns.
|
|
308
|
+
|
|
309
|
+
```json
|
|
310
|
+
{
|
|
311
|
+
"id": "extract",
|
|
312
|
+
"params": {
|
|
313
|
+
"type": "array",
|
|
314
|
+
"selector": "#search-results",
|
|
315
|
+
"mode": "columnar",
|
|
316
|
+
"items": {
|
|
317
|
+
"title": { "selector": ".item-title" },
|
|
318
|
+
"link": { "selector": "a.item-link", "attribute": "href" }
|
|
319
|
+
}
|
|
320
|
+
}
|
|
321
|
+
}
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
> **Heuristic Detection:** If `mode` is omitted and the `selector` matches exactly one element while `items` contains nested selectors, the engine automatically uses **columnar** mode.
|
|
325
|
+
|
|
326
|
+
**Columnar Configuration:**
|
|
327
|
+
|
|
328
|
+
* **`strict`** (boolean, default: `true`): If `true`, throws an error if fields have different match counts.
|
|
329
|
+
* **`inference`** (boolean, default: `false`): If `true`, tries to automatically find the "item wrapper" elements to fix misaligned lists.
|
|
330
|
+
|
|
331
|
+
###### 5. Segmented Mode (Anchor-based Scanning)
|
|
332
|
+
|
|
333
|
+
This mode is ideal for "flat" structures where there are no item wrappers. It uses the first field (or a specified `anchor`) to segment the container's content.
|
|
334
|
+
|
|
335
|
+
```json
|
|
336
|
+
{
|
|
337
|
+
"id": "extract",
|
|
338
|
+
"params": {
|
|
339
|
+
"type": "array",
|
|
340
|
+
"selector": "#flat-container",
|
|
341
|
+
"mode": { "type": "segmented", "anchor": "title" },
|
|
342
|
+
"items": {
|
|
343
|
+
"title": { "selector": "h3" },
|
|
344
|
+
"desc": { "selector": "p" }
|
|
345
|
+
}
|
|
346
|
+
}
|
|
347
|
+
}
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
> In this mode, every time the engine finds an `h3`, it starts a new item. Any `<p>` found after that `h3` (and before the next one) is assigned to that item.
|
|
351
|
+
|
|
352
|
+
###### 6. Implicit Object Extraction (Simplest Syntax)
|
|
353
|
+
|
|
354
|
+
For simpler object extraction, you can omit `type: 'object'` and `properties`. If the schema object contains keys that are not reserved keywords (like `selector`, `attribute`, `type`, etc.), it is treated as an object schema where keys are property names.
|
|
355
|
+
|
|
356
|
+
```json
|
|
357
|
+
{
|
|
358
|
+
"id": "extract",
|
|
359
|
+
"params": {
|
|
360
|
+
"selector": ".author-bio",
|
|
361
|
+
"name": { "selector": ".author-name" },
|
|
362
|
+
"email": { "selector": "a.email", "attribute": "href" }
|
|
363
|
+
}
|
|
364
|
+
}
|
|
365
|
+
```
|
|
366
|
+
|
|
367
|
+
> This is equivalent to Example 2 but more concise.
|
|
368
|
+
|
|
369
|
+
###### 6. Precise Filtering: `has` and `exclude`
|
|
293
370
|
|
|
294
371
|
You can use the `has` and `exclude` fields in any schema that includes a `selector` to precisely control element selection.
|
|
295
372
|
|
package/README.engine.cn.md
CHANGED
|
@@ -31,18 +31,38 @@
|
|
|
31
31
|
|
|
32
32
|
此静态工厂方法是创建引擎实例的指定入口点。它会自动选择并初始化合适的引擎。
|
|
33
33
|
|
|
34
|
-
### `FetchSession(options
|
|
34
|
+
### `FetchSession(options)`
|
|
35
35
|
|
|
36
|
-
`FetchSession`
|
|
36
|
+
`FetchSession` 类管理抓取操作的生命周期。您可以在 `options` 中指定 `engine` 来为该会话强制指定特定的引擎实现。
|
|
37
37
|
|
|
38
|
-
|
|
38
|
+
```typescript
|
|
39
|
+
const session = new FetchSession({ engine: 'browser' });
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
#### 引擎选择优先级 (Engine Selection Priority)
|
|
43
|
+
|
|
44
|
+
引擎在执行第一个动作时延迟初始化,并在会话持续期间保持固定。选择遵循以下规则:
|
|
39
45
|
|
|
40
|
-
|
|
46
|
+
1. **显式选项**: 如果在 `FetchSession` 的 `options.engine`(或 `executeAll` 的临时上下文覆盖)中提供了引擎且不为 `'auto'`。
|
|
47
|
+
* ⚠️ **快速失败 (Fail-Fast)**: 如果请求的引擎不可用(例如缺少依赖),将立即抛出错误。
|
|
48
|
+
2. **站点注册表 (Site Registry)**: 如果设置为 `'auto'`(默认),系统会尝试根据目标 URL 匹配 `sites` 注册表。
|
|
49
|
+
3. **智能升级 (Smart Upgrade)**: 如果启用,系统可能会根据响应特征(如机器人检测或大量 JS)动态从 `http` 升级到 `browser`。
|
|
50
|
+
4. **默认**: 回退到 `'http'` (Cheerio)。
|
|
41
51
|
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
52
|
+
#### 带有覆盖的批量执行 (Batch Execution with Overrides)
|
|
53
|
+
|
|
54
|
+
您可以执行一系列动作,并指定临时的配置覆盖(例如 headers、timeout)。这些覆盖仅应用于当前批次,不会修改会话的全局状态。
|
|
55
|
+
|
|
56
|
+
```typescript
|
|
57
|
+
// 使用临时的自定义 header 和超时执行动作
|
|
58
|
+
await session.executeAll([
|
|
59
|
+
{ name: 'goto', params: { url: '...' } },
|
|
60
|
+
{ name: 'extract', params: { ... } }
|
|
61
|
+
], {
|
|
62
|
+
headers: { 'x-custom-priority': 'high' },
|
|
63
|
+
timeoutMs: 30000
|
|
64
|
+
});
|
|
65
|
+
```
|
|
46
66
|
|
|
47
67
|
### 会话管理与状态持久化
|
|
48
68
|
|
package/README.engine.md
CHANGED
|
@@ -31,18 +31,38 @@ This is the abstract base class that defines the contract for all fetch engines.
|
|
|
31
31
|
|
|
32
32
|
This static factory method is the designated entry point for creating an engine instance. It automatically selects and initializes the appropriate engine.
|
|
33
33
|
|
|
34
|
-
### `FetchSession(options
|
|
34
|
+
### `FetchSession(options)`
|
|
35
35
|
|
|
36
|
-
The `FetchSession` class manages the lifecycle of a fetch operation.
|
|
36
|
+
The `FetchSession` class manages the lifecycle of a fetch operation. You can specify the `engine` in the `options` to force a specific engine implementation for that session.
|
|
37
37
|
|
|
38
|
-
|
|
38
|
+
```typescript
|
|
39
|
+
const session = new FetchSession({ engine: 'browser' });
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
#### Engine Selection Priority
|
|
43
|
+
|
|
44
|
+
The engine is initialized lazily upon the first action execution and remains fixed for the duration of the session. The selection follows these rules:
|
|
39
45
|
|
|
40
|
-
|
|
46
|
+
1. **Explicit Option**: If `options.engine` (or temporary context override in `executeAll`) is provided and NOT set to `'auto'`.
|
|
47
|
+
* ⚠️ **Fail-Fast**: If the requested engine is unavailable (e.g., missing dependencies), an error is thrown immediately.
|
|
48
|
+
2. **Site Registry**: If set to `'auto'` (default), the system attempts to match the target URL against the `sites` registry.
|
|
49
|
+
3. **Smart Upgrade**: If enabled, the engine may be dynamically upgraded from `http` to `browser` based on response characteristics (e.g., bot detection or heavy JS).
|
|
50
|
+
4. **Default**: Falls back to `'http'` (Cheerio).
|
|
41
51
|
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
52
|
+
#### Batch Execution with Overrides
|
|
53
|
+
|
|
54
|
+
You can execute a sequence of actions with temporary configuration overrides (e.g., headers, timeout) that apply only to that specific batch, without modifying the session's global state.
|
|
55
|
+
|
|
56
|
+
```typescript
|
|
57
|
+
// Execute actions with a temporary custom header and timeout
|
|
58
|
+
await session.executeAll([
|
|
59
|
+
{ name: 'goto', params: { url: '...' } },
|
|
60
|
+
{ name: 'extract', params: { ... } }
|
|
61
|
+
], {
|
|
62
|
+
headers: { 'x-custom-priority': 'high' },
|
|
63
|
+
timeoutMs: 30000
|
|
64
|
+
});
|
|
65
|
+
```
|
|
46
66
|
|
|
47
67
|
### Session Management & State Persistence
|
|
48
68
|
|