@isdk/web-fetcher 0.2.11 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.action.cn.md +381 -19
- package/README.action.md +394 -4
- package/README.cn.md +13 -11
- package/README.engine.cn.md +124 -21
- package/README.engine.md +121 -21
- package/README.md +11 -9
- package/dist/index.d.mts +890 -24
- package/dist/index.d.ts +890 -24
- package/dist/index.js +1 -1
- package/dist/index.mjs +1 -1
- package/docs/README.md +11 -9
- package/docs/_media/README.action.md +394 -4
- package/docs/_media/README.cn.md +13 -11
- package/docs/_media/README.engine.md +121 -21
- package/docs/classes/CheerioFetchEngine.md +996 -187
- package/docs/classes/ClickAction.md +33 -33
- package/docs/classes/EvaluateAction.md +559 -0
- package/docs/classes/ExtractAction.md +33 -33
- package/docs/classes/FetchAction.md +35 -33
- package/docs/classes/FetchEngine.md +740 -85
- package/docs/classes/FetchSession.md +143 -24
- package/docs/classes/FillAction.md +33 -33
- package/docs/classes/GetContentAction.md +33 -33
- package/docs/classes/GotoAction.md +33 -33
- package/docs/classes/PauseAction.md +33 -33
- package/docs/classes/PlaywrightFetchEngine.md +901 -180
- package/docs/classes/SubmitAction.md +33 -33
- package/docs/classes/TrimAction.md +533 -0
- package/docs/classes/WaitForAction.md +33 -33
- package/docs/classes/WebFetcher.md +57 -9
- package/docs/enumerations/FetchActionResultStatus.md +4 -4
- package/docs/functions/fetchWeb.md +6 -6
- package/docs/globals.md +6 -0
- package/docs/interfaces/BaseFetchActionProperties.md +19 -11
- package/docs/interfaces/BaseFetchCollectorActionProperties.md +27 -15
- package/docs/interfaces/BaseFetcherProperties.md +28 -28
- package/docs/interfaces/DispatchedEngineAction.md +4 -4
- package/docs/interfaces/EvaluateActionOptions.md +81 -0
- package/docs/interfaces/ExtractActionProperties.md +23 -11
- package/docs/interfaces/FetchActionInContext.md +32 -15
- package/docs/interfaces/FetchActionProperties.md +24 -12
- package/docs/interfaces/FetchActionResult.md +6 -6
- package/docs/interfaces/FetchContext.md +81 -38
- package/docs/interfaces/FetchEngineContext.md +52 -33
- package/docs/interfaces/FetchMetadata.md +5 -5
- package/docs/interfaces/FetchResponse.md +14 -14
- package/docs/interfaces/FetchReturnTypeRegistry.md +8 -8
- package/docs/interfaces/FetchSite.md +31 -31
- package/docs/interfaces/FetcherOptions.md +30 -30
- package/docs/interfaces/GotoActionOptions.md +6 -6
- package/docs/interfaces/PendingEngineRequest.md +3 -3
- package/docs/interfaces/StorageOptions.md +5 -5
- package/docs/interfaces/SubmitActionOptions.md +2 -2
- package/docs/interfaces/TrimActionOptions.md +27 -0
- package/docs/interfaces/WaitForActionOptions.md +5 -5
- package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
- package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
- package/docs/type-aliases/BrowserEngine.md +1 -1
- package/docs/type-aliases/FetchActionCapabilities.md +1 -1
- package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
- package/docs/type-aliases/FetchActionOptions.md +1 -1
- package/docs/type-aliases/FetchEngineAction.md +2 -2
- package/docs/type-aliases/FetchEngineType.md +1 -1
- package/docs/type-aliases/FetchReturnType.md +1 -1
- package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
- package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
- package/docs/type-aliases/ResourceType.md +1 -1
- package/docs/type-aliases/TrimPreset.md +13 -0
- package/docs/variables/DefaultFetcherProperties.md +1 -1
- package/docs/variables/FetcherOptionKeys.md +1 -1
- package/docs/variables/TRIM_PRESETS.md +11 -0
- package/package.json +10 -7
|
@@ -90,6 +90,8 @@ Navigates the browser to a new URL.
|
|
|
90
90
|
* ...other navigation options like `waitUntil`, `timeout` which are passed to the engine.
|
|
91
91
|
* **`returns`**: `response`
|
|
92
92
|
|
|
93
|
+
> **Note**: This action can be called multiple times within a single session script. The engine ensures that each navigation is completed and its corresponding Action Loop is settled before processing the next action in the sequence.
|
|
94
|
+
|
|
93
95
|
#### `click`
|
|
94
96
|
|
|
95
97
|
Clicks on an element specified by a selector.
|
|
@@ -123,6 +125,41 @@ Submits a form.
|
|
|
123
125
|
* `selector` (string, optional): A selector for the form element.
|
|
124
126
|
* **`returns`**: `none`
|
|
125
127
|
|
|
128
|
+
#### `trim`
|
|
129
|
+
|
|
130
|
+
Removes specific elements from the DOM to clean up the page before extraction. This is a persistent modification to the current session's page state.
|
|
131
|
+
|
|
132
|
+
* **`id`**: `trim`
|
|
133
|
+
* **`params`**:
|
|
134
|
+
* **`selectors`** (string | string[], optional): One or more CSS selectors of elements to remove.
|
|
135
|
+
* **`presets`** (string | string[], optional): Predefined groups of elements to remove. Supported presets:
|
|
136
|
+
* `scripts`: Removes all `<script>` tags.
|
|
137
|
+
* `styles`: Removes all `<style>` and `<link rel="stylesheet">` tags.
|
|
138
|
+
* `svgs`: Removes all `<svg>` elements.
|
|
139
|
+
* `images`: Removes `<img>`, `<picture>`, and `<canvas>` elements.
|
|
140
|
+
* `comments`: Removes HTML comments.
|
|
141
|
+
* `hidden`: Removes elements with `hidden` attribute or inline `display:none`. In **browser** mode, it also detects and removes elements hidden via external CSS (e.g., `display: none` or `visibility: hidden` in stylesheets).
|
|
142
|
+
* `all`: Includes all of the above.
|
|
143
|
+
* **`returns`**: `none`
|
|
144
|
+
|
|
145
|
+
**Example: Cleaning up a page before extraction**
|
|
146
|
+
|
|
147
|
+
```json
|
|
148
|
+
{
|
|
149
|
+
"actions": [
|
|
150
|
+
{ "action": "goto", "params": { "url": "https://example.com" } },
|
|
151
|
+
{
|
|
152
|
+
"action": "trim",
|
|
153
|
+
"params": {
|
|
154
|
+
"selectors": ["#ad-banner", ".popup"],
|
|
155
|
+
"presets": ["scripts", "styles", "comments"]
|
|
156
|
+
}
|
|
157
|
+
},
|
|
158
|
+
{ "action": "extract", "params": { "schema": { "content": "#main-content" } } }
|
|
159
|
+
]
|
|
160
|
+
}
|
|
161
|
+
```
|
|
162
|
+
|
|
126
163
|
#### `waitFor`
|
|
127
164
|
|
|
128
165
|
Pauses execution to wait for one or more conditions to be met.
|
|
@@ -210,6 +247,63 @@ Retrieves the full content of the current page state.
|
|
|
210
247
|
* **`params`**: (none)
|
|
211
248
|
* **`returns`**: `response`
|
|
212
249
|
|
|
250
|
+
#### `evaluate`
|
|
251
|
+
|
|
252
|
+
Executes custom JavaScript code or an expression within the page context.
|
|
253
|
+
|
|
254
|
+
* **`id`**: `evaluate`
|
|
255
|
+
* **`params`**:
|
|
256
|
+
* **`fn`**: `string | Function` - The JavaScript function or expression to execute.
|
|
257
|
+
* In `browser` mode, this runs in the real browser.
|
|
258
|
+
* In `http` mode, this runs in a safe Node.js sandbox with a mocked browser environment (`window`, `document`, `console`, `).
|
|
259
|
+
* **`args`**: `any` - A single argument to pass to the function. Use an array or object to pass multiple values.
|
|
260
|
+
|
|
261
|
+
**Key Features:**
|
|
262
|
+
|
|
263
|
+
- **Automatic Navigation**: If the code modifies `window.location.href` or calls `assign()`/`replace()`, the engine automatically triggers and waits for the navigation to complete.
|
|
264
|
+
- **Enhanced Mock DOM (HTTP Mode)**: Supports common DOM methods like `querySelector`, `querySelectorAll`, `getElementById`, `getElementsByClassName`, and properties like `document.body` and `document.title`.
|
|
265
|
+
- **Sandbox Security**: Uses `util-ex`'s `newFunction` in HTTP mode to prevent global state pollution.
|
|
266
|
+
|
|
267
|
+
**Example (Array Args):**
|
|
268
|
+
|
|
269
|
+
```json
|
|
270
|
+
{
|
|
271
|
+
"action": "evaluate",
|
|
272
|
+
"params": {
|
|
273
|
+
"fn": "([a, b]) => a + b",
|
|
274
|
+
"args": [1, 2]
|
|
275
|
+
}
|
|
276
|
+
}
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
**Example (Navigation):**
|
|
280
|
+
|
|
281
|
+
```json
|
|
282
|
+
{
|
|
283
|
+
"action": "evaluate",
|
|
284
|
+
"params": {
|
|
285
|
+
"fn": "() => { window.location.href = '/new-page'; }"
|
|
286
|
+
}
|
|
287
|
+
}
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
$`).
|
|
291
|
+
>
|
|
292
|
+
> * **Navigation**: If the code modifies `window.location.href`, the engine will automatically trigger a `goto` to the new URL.
|
|
293
|
+
|
|
294
|
+
**Example: Extracting data via custom script**
|
|
295
|
+
|
|
296
|
+
```json
|
|
297
|
+
{
|
|
298
|
+
"action": "evaluate",
|
|
299
|
+
"params": {
|
|
300
|
+
"fn": "([selector]) => document.querySelector(selector).innerText",
|
|
301
|
+
"args": [".main-title"]
|
|
302
|
+
},
|
|
303
|
+
"storeAs": "title"
|
|
304
|
+
}
|
|
305
|
+
```
|
|
306
|
+
|
|
213
307
|
#### `extract`
|
|
214
308
|
|
|
215
309
|
Extracts structured data from the page using a powerful and declarative Schema. This is the core Action for data collection.
|
|
@@ -224,19 +318,28 @@ The `params` object itself is a Schema that describes the data structure you wan
|
|
|
224
318
|
|
|
225
319
|
###### 1. Extracting a Single Value
|
|
226
320
|
|
|
227
|
-
The most basic extraction. You can specify a `selector` (CSS selector), an `attribute` (the name of the attribute to extract),
|
|
321
|
+
The most basic extraction. You can specify a `selector` (CSS selector), an `attribute` (the name of the attribute to extract), a `type` (string, number, boolean, html), and a `mode` (text, innerText).
|
|
322
|
+
|
|
323
|
+
* **`depth`** (number, optional): After matching the element with `selector`, it bubbles up the DOM tree by the specified number of levels. The resulting ancestor element becomes the actual target for value extraction (e.g., to extract an attribute from a parent wrapper).
|
|
228
324
|
|
|
229
325
|
```json
|
|
230
326
|
{
|
|
231
327
|
"id": "extract",
|
|
232
328
|
"params": {
|
|
233
329
|
"selector": "h1.main-title",
|
|
234
|
-
"type": "string"
|
|
330
|
+
"type": "string",
|
|
331
|
+
"mode": "innerText"
|
|
235
332
|
}
|
|
236
333
|
}
|
|
237
334
|
```
|
|
238
335
|
|
|
239
|
-
>
|
|
336
|
+
> **Extraction Modes:**
|
|
337
|
+
>
|
|
338
|
+
> * **`text`** (default): Extracts the `textContent` of the element.
|
|
339
|
+
> * **`innerText`**: Extracts the rendered text, respecting CSS styling and line breaks.
|
|
340
|
+
> * **`html`**: Returns the `innerHTML` of the element.
|
|
341
|
+
> * **`outerHTML`**: Returns the HTML including the tag itself. Useful for preserving the full element structure.
|
|
342
|
+
> The example above will extract the text content of the `<h1>` tag with the class `main-title` using the `innerText` mode.
|
|
240
343
|
|
|
241
344
|
###### 2. Extracting an Object
|
|
242
345
|
|
|
@@ -256,6 +359,20 @@ Define a structured object using `type: 'object'` and the `properties` field.
|
|
|
256
359
|
}
|
|
257
360
|
```
|
|
258
361
|
|
|
362
|
+
* **`depth`** (number, optional): Enables "Try-And-Bubble" strategy. If a `required` field is missing in the matched element, the engine attempts to bubble up the DOM tree (up to `depth` levels) to find an ancestor where the required field exists. This is useful when the selector matches a descendant (e.g., an inner `span`) but the data resides on a parent container.
|
|
363
|
+
|
|
364
|
+
**Advanced Object Features:**
|
|
365
|
+
|
|
366
|
+
* **Anchor Jumping (`anchor`)**: Specifies a starting reference point for a field.
|
|
367
|
+
* **Field Reference**: Use the DOM element of a previously extracted field.
|
|
368
|
+
* **CSS Selector**: Query an anchor element on the fly within the object's scope.
|
|
369
|
+
* **`depth`** (number, optional): When using an anchor, defines how many parent levels to traverse upwards to collect following siblings.
|
|
370
|
+
* **Note**: If omitted, the engine defaults to maximum depth (up to the object's root) for backward compatibility. To strictly limit the search to the anchor's own siblings, set `depth: 0`.
|
|
371
|
+
* **Effect**: Once an anchor is set, the search scope for that field becomes the siblings **following** the anchor (and its ancestors, depending on `depth`). This allows for non-linear "jumping" extraction in flat structures.
|
|
372
|
+
* **Sequential Consumption (`relativeTo: "previous"`)**:
|
|
373
|
+
* Combined with the `order` property, this ensures each field's search scope starts *after* the previous field's match.
|
|
374
|
+
* Essential for extracting from lists composed of identical tags (e.g., consecutive `<p>` tags with different meanings).
|
|
375
|
+
|
|
259
376
|
###### 3. Extracting an Array (Convenient Usage)
|
|
260
377
|
|
|
261
378
|
Extract a list using `type: 'array'`. To make the most common operations simpler, we provide some convenient usages.
|
|
@@ -289,7 +406,186 @@ Extract a list using `type: 'array'`. To make the most common operations simpler
|
|
|
289
406
|
|
|
290
407
|
> The example above will return an array of the `src` attributes from all `<img>` tags.
|
|
291
408
|
|
|
292
|
-
|
|
409
|
+
* **Array Extraction Modes**: When extracting an array, the engine supports different modes to handle various DOM structures.
|
|
410
|
+
|
|
411
|
+
* **`nested`** (Default): The `selector` matches individual item wrappers.
|
|
412
|
+
* **`columnar`** (formerly Zip): The `selector` matches a **container**, and fields in `items` are parallel columns stitched together by index.
|
|
413
|
+
* **`segmented`**: The `selector` matches a **container**, and items are segmented by an "anchor" field.
|
|
414
|
+
|
|
415
|
+
###### 4. Columnar Mode (formerly Zip Strategy)
|
|
416
|
+
|
|
417
|
+
This mode is used when the `selector` points to a **container** (like a results list) and item data is scattered as separate columns. It is highly optimized for performance, especially in browser mode, by minimizing the number of DOM queries and RPC calls.
|
|
418
|
+
|
|
419
|
+
> **💡 Broadcasting & Performance**: If a property matches the container element itself (e.g. by omitting a selector or matching the container's own attributes), its value is **broadcasted** to every row. This is not only a feature but also a major performance optimization: the value is extracted **once** and reused across all rows, avoiding thousands of redundant engine calls.
|
|
420
|
+
|
|
421
|
+
```json
|
|
422
|
+
{
|
|
423
|
+
"id": "extract",
|
|
424
|
+
"params": {
|
|
425
|
+
"type": "array",
|
|
426
|
+
"selector": "#search-results",
|
|
427
|
+
"mode": "columnar",
|
|
428
|
+
"items": {
|
|
429
|
+
"title": { "selector": ".item-title" },
|
|
430
|
+
"link": { "selector": "a.item-link", "attribute": "href" }
|
|
431
|
+
}
|
|
432
|
+
}
|
|
433
|
+
}
|
|
434
|
+
```
|
|
435
|
+
|
|
436
|
+
> **Heuristic Detection:** If `mode` is omitted and the `selector` matches exactly one element while `items` contains nested selectors, the engine automatically uses **columnar** mode.
|
|
437
|
+
|
|
438
|
+
**Example: Columnar Broadcasting**
|
|
439
|
+
|
|
440
|
+
When you have a list where the category is on the container, but items are inside.
|
|
441
|
+
|
|
442
|
+
```json
|
|
443
|
+
{
|
|
444
|
+
"id": "extract",
|
|
445
|
+
"params": {
|
|
446
|
+
"type": "array",
|
|
447
|
+
"selector": "#book-category",
|
|
448
|
+
"mode": "columnar",
|
|
449
|
+
"items": {
|
|
450
|
+
"category": { "attribute": "data-category" },
|
|
451
|
+
"title": { "selector": ".book-title" }
|
|
452
|
+
}
|
|
453
|
+
}
|
|
454
|
+
}
|
|
455
|
+
```
|
|
456
|
+
|
|
457
|
+
> If `#book-category` has `data-category="Sci-Fi"` and contains 3 books, the result will be 3 items, each having `"category": "Sci-Fi"`.
|
|
458
|
+
|
|
459
|
+
**Columnar Configuration:**
|
|
460
|
+
|
|
461
|
+
* **`strict`** (boolean, default: `true`): If `true`, throws an error if fields have different match counts.
|
|
462
|
+
* **`inference`** (boolean, default: `false`): If `true`, tries to automatically find the "item wrapper" elements to fix misaligned lists. It uses an **optimized ancestor search** that is significantly faster in browser mode than manual traversal.
|
|
463
|
+
* **Performance Note**: The engine automatically detects shared structures and pre-calculates alignment to ensure O(N) performance even in complex DOM trees. In browser mode, it minimizes IPC round-trips by pre-calculating "broadcast" flags.
|
|
464
|
+
|
|
465
|
+
###### 5. Segmented Mode (Anchor-based Scanning)
|
|
466
|
+
|
|
467
|
+
Ideal for "flat" structures where there are no item wrappers. It uses a specified `anchor` to segment the container's content.
|
|
468
|
+
|
|
469
|
+
**Core Feature: Automatic Container Detection (Bubble Up)**
|
|
470
|
+
|
|
471
|
+
To handle structures that appear flat but have subtle containers, the engine uses a bubble-up strategy:
|
|
472
|
+
|
|
473
|
+
- **Smart Bubble Up**: When an anchor is nested deep (e.g., `div.card > h3.title`), the engine automatically crawls up the DOM to find the largest "safe container" (e.g., `div.card`) that doesn't overlap with neighbors.
|
|
474
|
+
- **Logical Isolation**: If a container is found, it becomes the scope for that segment. This allows you to extract any content within that container using simple relative selectors, even if it's "above" or deep alongside the anchor.
|
|
475
|
+
- **Flat Fallback**: If no container isolation is possible, it automatically falls back to classic sibling scanning.
|
|
476
|
+
|
|
477
|
+
```json
|
|
478
|
+
{
|
|
479
|
+
"id": "extract",
|
|
480
|
+
"params": {
|
|
481
|
+
"type": "array",
|
|
482
|
+
"selector": "#flat-container",
|
|
483
|
+
"mode": { "type": "segmented", "anchor": "h3.item-title" },
|
|
484
|
+
"items": {
|
|
485
|
+
"title": { "selector": "h3" },
|
|
486
|
+
"desc": { "selector": "p" }
|
|
487
|
+
}
|
|
488
|
+
}
|
|
489
|
+
}
|
|
490
|
+
```
|
|
491
|
+
|
|
492
|
+
**Segmented Configuration:**
|
|
493
|
+
|
|
494
|
+
* **`anchor`** (string):
|
|
495
|
+
* Can be a **field name** defined in `items` (e.g., `"title"`).
|
|
496
|
+
* Can be a **direct CSS selector** (e.g., `"h3.item-title"`).
|
|
497
|
+
* Defaults to the selector of the first field in `items`.
|
|
498
|
+
* **`depth`** (number, optional): The maximum number of levels to bubble up from the anchor to find a segment container. If omitted, it bubbles up as high as possible without conflicting with neighboring segments.
|
|
499
|
+
* **`strict`** (boolean, default: `false`): If `true`, throws an error if no anchor elements are found or if any item violates its own `required` constraints.
|
|
500
|
+
|
|
501
|
+
###### 5.1 Advanced: Handling Repeating Tags (`relativeTo`)
|
|
502
|
+
|
|
503
|
+
When a segment contains multiple identical tags (e.g., several `<p>` tags in a row) representing different fields, use `relativeTo: "previous"` to "consume" them one by one.
|
|
504
|
+
|
|
505
|
+
```json
|
|
506
|
+
{
|
|
507
|
+
"id": "extract",
|
|
508
|
+
"params": {
|
|
509
|
+
"type": "array",
|
|
510
|
+
"selector": "#container",
|
|
511
|
+
"mode": {
|
|
512
|
+
"type": "segmented",
|
|
513
|
+
"anchor": ".item-start",
|
|
514
|
+
"relativeTo": "previous"
|
|
515
|
+
},
|
|
516
|
+
"items": {
|
|
517
|
+
"type": "object",
|
|
518
|
+
"order": ["id", "desc", "extra"],
|
|
519
|
+
"properties": {
|
|
520
|
+
"id": "h1",
|
|
521
|
+
"desc": "p",
|
|
522
|
+
"extra": "p"
|
|
523
|
+
}
|
|
524
|
+
}
|
|
525
|
+
}
|
|
526
|
+
}
|
|
527
|
+
```
|
|
528
|
+
|
|
529
|
+
* **`relativeTo: "previous"`**: After finding `id` (h1), the search for `desc` starts *after* that h1. After finding `desc` (the first p), the search for `extra` starts *after* that p, successfully picking up the second `<p>`.
|
|
530
|
+
* **`order`**: Defines the sequence of consumption. Highly recommended with `relativeTo: "previous"`.
|
|
531
|
+
|
|
532
|
+
###### 6. Quality Control: `required` and `strict`
|
|
533
|
+
|
|
534
|
+
- **`required`**: Marks a field as mandatory.
|
|
535
|
+
- **In Objects**: If any required field is `null`, the entire object returns `null`.
|
|
536
|
+
- **In Arrays**: Items missing a required field are automatically skipped.
|
|
537
|
+
- **Null Propagation**: For implicit objects without a `selector`, if ALL sub-properties are `null`, the object itself becomes `null`, triggering parent-level required or skip logic.
|
|
538
|
+
- **`strict`**:
|
|
539
|
+
- `false` (Default): Silently skip or ignore incomplete data.
|
|
540
|
+
- `true`: Throw an error on any missing required field or alignment mismatch.
|
|
541
|
+
- **Inheritance**: Setting `strict: true` at the array level automatically propagates to all nested children.
|
|
542
|
+
|
|
543
|
+
**Example: Ignoring items with missing mandatory fields**
|
|
544
|
+
|
|
545
|
+
```json
|
|
546
|
+
{
|
|
547
|
+
"id": "extract",
|
|
548
|
+
"params": {
|
|
549
|
+
"type": "array",
|
|
550
|
+
"selector": ".product-list",
|
|
551
|
+
"mode": "columnar",
|
|
552
|
+
"items": {
|
|
553
|
+
"name": { "selector": ".title", "required": true },
|
|
554
|
+
"price": { "selector": ".price", "required": true },
|
|
555
|
+
"discount": ".promo"
|
|
556
|
+
}
|
|
557
|
+
}
|
|
558
|
+
}
|
|
559
|
+
```
|
|
560
|
+
|
|
561
|
+
> In this example, if a product lacks either a `name` or a `price`, it will be completely omitted from the result array. The optional `discount` field doesn't affect the item's inclusion.
|
|
562
|
+
|
|
563
|
+
###### 7. Implicit Object Extraction (Simplest Syntax)
|
|
564
|
+
|
|
565
|
+
For simpler object extraction, you can omit `type: 'object'` and `properties`. If the schema object contains keys that are not context-defining keywords (like `selector`, `has`, `exclude`, `required`, `strict`, `depth`), it is treated as an object schema where keys are property names.
|
|
566
|
+
|
|
567
|
+
> **Keyword Collision Handling:** You can safely extract a data field named `type` as long as its value is not a reserved schema type (like `"string"`, `"object"`, `"array"`, etc.).
|
|
568
|
+
|
|
569
|
+
```json
|
|
570
|
+
{
|
|
571
|
+
"id": "extract",
|
|
572
|
+
"params": {
|
|
573
|
+
"selector": ".author-bio",
|
|
574
|
+
"name": ".author-name",
|
|
575
|
+
"type": ".author-rank",
|
|
576
|
+
"items": { "type": "array", "selector": "li" }
|
|
577
|
+
}
|
|
578
|
+
}
|
|
579
|
+
```
|
|
580
|
+
|
|
581
|
+
> **Key features of implicit objects:**
|
|
582
|
+
>
|
|
583
|
+
> 1. **Keyword Handling**: Common configuration keywords like `items`, `attribute`, or `mode` **can be used as property names** within an implicit object. They are only treated as configuration when a `type` (like `array`) is explicitly present. Configuration keywords like `required`, `strict`, and `depth` are also handled as context defining keys.
|
|
584
|
+
> 2. **String Shorthand**: You can use a simple string as a property value (e.g., `"email": "a.email"`), which is automatically expanded to `{ "selector": "a.email" }`.
|
|
585
|
+
> 3. **Context Separation**: Only `selector`, `has`, `exclude`, `required`, `strict`, and `depth` are used to define the context and validation for the implicit object; all other keys are treated as data to be extracted.
|
|
586
|
+
> 4. **Null Propagation**: If an implicit object has no `selector` and ALL of its sub-properties extract to `null`, the object itself returns `null`. This is crucial for `required` validation on the parent object or for skipping items in an array.
|
|
587
|
+
|
|
588
|
+
###### 8. Advanced Filtering: `has` and `exclude`
|
|
293
589
|
|
|
294
590
|
You can use the `has` and `exclude` fields in any schema that includes a `selector` to precisely control element selection.
|
|
295
591
|
|
|
@@ -469,3 +765,97 @@ The `FetchAction` base class provides lifecycle hooks that allow injecting custo
|
|
|
469
765
|
* `protected onAfterExec?()`: Called after `onExecute`.
|
|
470
766
|
|
|
471
767
|
For Actions that need to manage complex state or resources, you can implement these hooks. Generally, for composite actions, writing the logic directly in `onExecute` is sufficient.
|
|
768
|
+
|
|
769
|
+
---
|
|
770
|
+
|
|
771
|
+
## 💎 7. Action Return Types & State Management (Advanced)
|
|
772
|
+
|
|
773
|
+
In `@isdk/web-fetcher`, an Action's `static returnType` is more than a type hint. It defines how the framework manages the **session state** and automates data synchronization after execution.
|
|
774
|
+
|
|
775
|
+
### 7.1 Detailed Type Breakdown
|
|
776
|
+
|
|
777
|
+
#### 🟢 `response` (Page Response)
|
|
778
|
+
|
|
779
|
+
* **Definition**: A `FetchResponse` object containing HTTP status, headers, body, and cookies.
|
|
780
|
+
* **Purpose**: To synchronize the latest page content and state with the session.
|
|
781
|
+
* **Usage**: Used by actions that navigate, refresh, or capture the current page state.
|
|
782
|
+
* **System Behavior**: The framework automatically updates `context.lastResponse` with this result in the `afterExec` phase. Subsequent actions can access this via the context.
|
|
783
|
+
* **Typical Actions**: `goto`, `getContent`, `fill` (in some engines).
|
|
784
|
+
* **Example**:
|
|
785
|
+
|
|
786
|
+
```typescript
|
|
787
|
+
export class MyNavigateAction extends FetchAction {
|
|
788
|
+
static override id = 'myGoto';
|
|
789
|
+
static override returnType = 'response' as const;
|
|
790
|
+
|
|
791
|
+
async onExecute(context, options) {
|
|
792
|
+
// Logic that returns a FetchResponse
|
|
793
|
+
return await this.delegateToEngine(context, 'goto', options.params.url);
|
|
794
|
+
}
|
|
795
|
+
}
|
|
796
|
+
```
|
|
797
|
+
|
|
798
|
+
#### 🟡 `any` (Generic Data - Default)
|
|
799
|
+
|
|
800
|
+
* **Definition**: Any serializable data structure (Object, Array, string, etc.).
|
|
801
|
+
* **Purpose**: Primary mechanism for business data extraction.
|
|
802
|
+
* **Usage**: Use this when your action produces processed data that doesn't represent the whole page or system state.
|
|
803
|
+
* **System Behavior**: If the action configuration includes `storeAs: "key"`, the framework automatically saves the `result` into `context.outputs["key"]`.
|
|
804
|
+
* **Typical Actions**: `extract`.
|
|
805
|
+
* **Example**:
|
|
806
|
+
|
|
807
|
+
```typescript
|
|
808
|
+
static override returnType = 'any' as const;
|
|
809
|
+
async onExecute(context, options) {
|
|
810
|
+
return { title: 'Hello', price: 99 }; // Saved to outputs if storeAs is set
|
|
811
|
+
}
|
|
812
|
+
```
|
|
813
|
+
|
|
814
|
+
#### ⚪ `none` (No Return)
|
|
815
|
+
|
|
816
|
+
* **Definition**: `void`.
|
|
817
|
+
* **Purpose**: Pure interaction/side-effects without data output.
|
|
818
|
+
* **Usage**: Actions that perform UI interactions or timing controls.
|
|
819
|
+
* **Typical Actions**: `click`, `submit`, `pause`, `trim`, `waitFor`.
|
|
820
|
+
* **Example**:
|
|
821
|
+
|
|
822
|
+
```typescript
|
|
823
|
+
static override returnType = 'none' as const;
|
|
824
|
+
async onExecute(context, options) {
|
|
825
|
+
await this.delegateToEngine(context, 'click', options.params.selector);
|
|
826
|
+
// No return value needed
|
|
827
|
+
}
|
|
828
|
+
```
|
|
829
|
+
|
|
830
|
+
#### 🔵 `outputs` (Accumulated Results)
|
|
831
|
+
|
|
832
|
+
* **Definition**: The entire `context.outputs` record (`Record<string, any>`).
|
|
833
|
+
* **Purpose**: To retrieve all data extracted and stored during the current session.
|
|
834
|
+
* **Usage**: Typically used as a "summary" action at the end of a chain or for debugging.
|
|
835
|
+
* **Typical Actions**: Custom data summary actions.
|
|
836
|
+
|
|
837
|
+
#### 🟣 `context` (Session Snapshot)
|
|
838
|
+
|
|
839
|
+
* **Definition**: The full `FetchContext` object.
|
|
840
|
+
* **Purpose**: Meta-programming and deep debugging.
|
|
841
|
+
* **Usage**: Allows the caller to inspect current session configurations (timeouts, proxies, headers) and internal engine metadata.
|
|
842
|
+
|
|
843
|
+
---
|
|
844
|
+
|
|
845
|
+
### 7.2 Result Wrapping (`FetchActionResult`)
|
|
846
|
+
|
|
847
|
+
Every value returned by `onExecute` is automatically wrapped by the `FetchAction.execute` method into a `FetchActionResult` object. This ensures consistent error handling and metadata tracking across all actions.
|
|
848
|
+
|
|
849
|
+
**Structure of `FetchActionResult`:**
|
|
850
|
+
|
|
851
|
+
* `status`: `Success`, `Failed`, or `Skipped`.
|
|
852
|
+
* `returnType`: Matches the action's `static returnType`.
|
|
853
|
+
* `result`: The raw data returned by `onExecute`.
|
|
854
|
+
* `error`: Captured error if the action failed.
|
|
855
|
+
* `meta`: Diagnostic information including execution time, engine type, and retry counts.
|
|
856
|
+
|
|
857
|
+
### 7.3 Developer Best Practices
|
|
858
|
+
|
|
859
|
+
1. **Choose `response` for Navigation**: Always use `response` for actions that land on a new URL to ensure the session's "current page" is kept in sync.
|
|
860
|
+
2. **Leverage `any` + `storeAs`**: For data extraction, return the data as `any` and let the user decide the storage key via `storeAs` in the JSON script.
|
|
861
|
+
3. **Be Explicit with `none`**: Using `none` clearly signals that the action is used for its side effects (like clicking or waiting), making the workflow easier to understand.
|
package/docs/_media/README.cn.md
CHANGED
|
@@ -132,7 +132,7 @@ searchGoogle('gemini');
|
|
|
132
132
|
* `url` (string): 要导航的初始 URL。
|
|
133
133
|
* `engine` ('http' | 'browser' | 'auto'): 要使用的引擎。默认为 `auto`。
|
|
134
134
|
* `proxy` (string | string[]): 用于请求的代理 URL。
|
|
135
|
-
* `debug` (boolean):
|
|
135
|
+
* `debug` (boolean | string | string[]): 在响应中启用详细的执行元数据(耗时、使用的引擎等),或启用特定类别(如 'extract', 'submit', 'request')的调试日志。
|
|
136
136
|
* `actions` (FetchActionOptions[]): 要执行的动作对象数组。(支持 `action`/`name` 作为 `id` 的别名,`args` 作为 `params` 的别名)
|
|
137
137
|
* `headers` (Record<string, string>): 用于所有请求的头信息。
|
|
138
138
|
* `cookies` (Cookie[]): 要使用的 Cookie 数组。
|
|
@@ -148,18 +148,20 @@ searchGoogle('gemini');
|
|
|
148
148
|
* `sessionPoolOptions` (SessionPoolOptions): 底层 Crawlee SessionPool 的高级配置。
|
|
149
149
|
* ...以及许多其他用于代理、重试等的选项。
|
|
150
150
|
|
|
151
|
-
### 内置动作
|
|
151
|
+
### 内置动作 (Built-in Actions)
|
|
152
152
|
|
|
153
|
-
|
|
153
|
+
该库提供了一系列强大的内置动作,其中许多动作是跨引擎通用的,并由核心层统一处理以确保一致性:
|
|
154
154
|
|
|
155
|
-
* `goto`:
|
|
156
|
-
* `click`:
|
|
157
|
-
* `fill`:
|
|
158
|
-
* `submit`:
|
|
159
|
-
* `
|
|
160
|
-
* `
|
|
161
|
-
* `
|
|
162
|
-
* `
|
|
155
|
+
* `goto`: 导航到新 URL。
|
|
156
|
+
* `click`: 点击选择器指定的元素(引擎相关)。
|
|
157
|
+
* `fill`: 用指定值填充输入框(引擎相关)。
|
|
158
|
+
* `submit`: 提交表单(引擎相关)。
|
|
159
|
+
* `trim`: 从 DOM 中移除元素以清理页面(如脚本、广告、隐藏内容)。
|
|
160
|
+
* `waitFor`: 暂停执行以等待特定条件(支持统一处理的固定超时)。
|
|
161
|
+
* `pause`: 暂停执行以进行人工干预(如处理验证码,由核心层统一处理)。
|
|
162
|
+
* `getContent`: 获取当前页面状态的完整内容(由核心层统一处理)。
|
|
163
|
+
* `evaluate`: 在页面上下文中执行自定义 JavaScript。
|
|
164
|
+
* `extract`: 使用引擎无关的核心逻辑和引擎相关的 DOM 原语提取结构化数据。支持 `required` 字段和 `strict` 验证。
|
|
163
165
|
|
|
164
166
|
### 响应结构
|
|
165
167
|
|