@isdk/web-fetcher 0.2.12 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.action.cn.md +197 -155
- package/README.action.extract.cn.md +263 -0
- package/README.action.extract.md +263 -0
- package/README.action.md +202 -147
- package/README.cn.md +25 -15
- package/README.engine.cn.md +118 -14
- package/README.engine.md +115 -14
- package/README.md +19 -10
- package/dist/index.d.mts +667 -50
- package/dist/index.d.ts +667 -50
- package/dist/index.js +1 -1
- package/dist/index.mjs +1 -1
- package/docs/README.md +19 -10
- package/docs/_media/README.action.md +202 -147
- package/docs/_media/README.cn.md +25 -15
- package/docs/_media/README.engine.md +115 -14
- package/docs/classes/CheerioFetchEngine.md +805 -135
- package/docs/classes/ClickAction.md +33 -33
- package/docs/classes/EvaluateAction.md +559 -0
- package/docs/classes/ExtractAction.md +33 -33
- package/docs/classes/FetchAction.md +39 -33
- package/docs/classes/FetchEngine.md +660 -122
- package/docs/classes/FetchSession.md +38 -16
- package/docs/classes/FillAction.md +33 -33
- package/docs/classes/GetContentAction.md +33 -33
- package/docs/classes/GotoAction.md +33 -33
- package/docs/classes/KeyboardPressAction.md +533 -0
- package/docs/classes/KeyboardTypeAction.md +533 -0
- package/docs/classes/MouseClickAction.md +533 -0
- package/docs/classes/MouseMoveAction.md +533 -0
- package/docs/classes/PauseAction.md +33 -33
- package/docs/classes/PlaywrightFetchEngine.md +820 -122
- package/docs/classes/SubmitAction.md +33 -33
- package/docs/classes/TrimAction.md +533 -0
- package/docs/classes/WaitForAction.md +33 -33
- package/docs/classes/WebFetcher.md +9 -9
- package/docs/enumerations/FetchActionResultStatus.md +4 -4
- package/docs/functions/fetchWeb.md +6 -6
- package/docs/globals.md +14 -0
- package/docs/interfaces/BaseFetchActionProperties.md +12 -12
- package/docs/interfaces/BaseFetchCollectorActionProperties.md +16 -16
- package/docs/interfaces/BaseFetcherProperties.md +32 -28
- package/docs/interfaces/Cookie.md +14 -14
- package/docs/interfaces/DispatchedEngineAction.md +4 -4
- package/docs/interfaces/EvaluateActionOptions.md +81 -0
- package/docs/interfaces/ExtractActionProperties.md +12 -12
- package/docs/interfaces/FetchActionInContext.md +15 -15
- package/docs/interfaces/FetchActionProperties.md +13 -13
- package/docs/interfaces/FetchActionResult.md +6 -6
- package/docs/interfaces/FetchContext.md +42 -38
- package/docs/interfaces/FetchEngineContext.md +37 -33
- package/docs/interfaces/FetchMetadata.md +5 -5
- package/docs/interfaces/FetchResponse.md +14 -14
- package/docs/interfaces/FetchReturnTypeRegistry.md +8 -8
- package/docs/interfaces/FetchSite.md +35 -31
- package/docs/interfaces/FetcherOptions.md +34 -30
- package/docs/interfaces/GotoActionOptions.md +14 -6
- package/docs/interfaces/KeyboardPressParams.md +25 -0
- package/docs/interfaces/KeyboardTypeParams.md +25 -0
- package/docs/interfaces/MouseClickParams.md +49 -0
- package/docs/interfaces/MouseMoveParams.md +41 -0
- package/docs/interfaces/PendingEngineRequest.md +3 -3
- package/docs/interfaces/StorageOptions.md +5 -5
- package/docs/interfaces/SubmitActionOptions.md +2 -2
- package/docs/interfaces/TrimActionOptions.md +27 -0
- package/docs/interfaces/WaitForActionOptions.md +5 -5
- package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
- package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
- package/docs/type-aliases/BrowserEngine.md +1 -1
- package/docs/type-aliases/FetchActionCapabilities.md +1 -1
- package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
- package/docs/type-aliases/FetchActionOptions.md +1 -1
- package/docs/type-aliases/FetchEngineAction.md +2 -2
- package/docs/type-aliases/FetchEngineType.md +1 -1
- package/docs/type-aliases/FetchReturnType.md +1 -1
- package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
- package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
- package/docs/type-aliases/ResourceType.md +1 -1
- package/docs/type-aliases/TrimPreset.md +13 -0
- package/docs/variables/DefaultFetcherProperties.md +1 -1
- package/docs/variables/FetcherOptionKeys.md +1 -1
- package/docs/variables/TRIM_PRESETS.md +11 -0
- package/package.json +11 -11
package/README.action.md
CHANGED
|
@@ -90,6 +90,8 @@ Navigates the browser to a new URL.
|
|
|
90
90
|
* ...other navigation options like `waitUntil`, `timeout` which are passed to the engine.
|
|
91
91
|
* **`returns`**: `response`
|
|
92
92
|
|
|
93
|
+
> **Note**: This action can be called multiple times within a single session script. The engine ensures that each navigation is completed and its corresponding Action Loop is settled before processing the next action in the sequence.
|
|
94
|
+
|
|
93
95
|
#### `click`
|
|
94
96
|
|
|
95
97
|
Clicks on an element specified by a selector.
|
|
@@ -123,6 +125,41 @@ Submits a form.
|
|
|
123
125
|
* `selector` (string, optional): A selector for the form element.
|
|
124
126
|
* **`returns`**: `none`
|
|
125
127
|
|
|
128
|
+
#### `trim`
|
|
129
|
+
|
|
130
|
+
Removes specific elements from the DOM to clean up the page before extraction. This is a persistent modification to the current session's page state.
|
|
131
|
+
|
|
132
|
+
* **`id`**: `trim`
|
|
133
|
+
* **`params`**:
|
|
134
|
+
* **`selectors`** (string | string[], optional): One or more CSS selectors of elements to remove.
|
|
135
|
+
* **`presets`** (string | string[], optional): Predefined groups of elements to remove. Supported presets:
|
|
136
|
+
* `scripts`: Removes all `<script>` tags.
|
|
137
|
+
* `styles`: Removes all `<style>` and `<link rel="stylesheet">` tags.
|
|
138
|
+
* `svgs`: Removes all `<svg>` elements.
|
|
139
|
+
* `images`: Removes `<img>`, `<picture>`, and `<canvas>` elements.
|
|
140
|
+
* `comments`: Removes HTML comments.
|
|
141
|
+
* `hidden`: Removes elements with `hidden` attribute or inline `display:none`. In **browser** mode, it also detects and removes elements hidden via external CSS (e.g., `display: none` or `visibility: hidden` in stylesheets).
|
|
142
|
+
* `all`: Includes all of the above.
|
|
143
|
+
* **`returns`**: `none`
|
|
144
|
+
|
|
145
|
+
**Example: Cleaning up a page before extraction**
|
|
146
|
+
|
|
147
|
+
```json
|
|
148
|
+
{
|
|
149
|
+
"actions": [
|
|
150
|
+
{ "action": "goto", "params": { "url": "https://example.com" } },
|
|
151
|
+
{
|
|
152
|
+
"action": "trim",
|
|
153
|
+
"params": {
|
|
154
|
+
"selectors": ["#ad-banner", ".popup"],
|
|
155
|
+
"presets": ["scripts", "styles", "comments"]
|
|
156
|
+
}
|
|
157
|
+
},
|
|
158
|
+
{ "action": "extract", "params": { "schema": { "content": "#main-content" } } }
|
|
159
|
+
]
|
|
160
|
+
}
|
|
161
|
+
```
|
|
162
|
+
|
|
126
163
|
#### `waitFor`
|
|
127
164
|
|
|
128
165
|
Pauses execution to wait for one or more conditions to be met.
|
|
@@ -210,198 +247,122 @@ Retrieves the full content of the current page state.
|
|
|
210
247
|
* **`params`**: (none)
|
|
211
248
|
* **`returns`**: `response`
|
|
212
249
|
|
|
213
|
-
#### `
|
|
214
|
-
|
|
215
|
-
Extracts structured data from the page using a powerful and declarative Schema. This is the core Action for data collection.
|
|
216
|
-
|
|
217
|
-
* **`id`**: `extract`
|
|
218
|
-
* **`params`**: An `ExtractSchema` object that defines the extraction rules.
|
|
219
|
-
* **`returns`**: `any` (the extracted data)
|
|
220
|
-
|
|
221
|
-
##### Detailed Explanation of Extraction Schema
|
|
250
|
+
#### `mouseMove`
|
|
222
251
|
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
###### 1. Extracting a Single Value
|
|
226
|
-
|
|
227
|
-
The most basic extraction. You can specify a `selector` (CSS selector), an `attribute` (the name of the attribute to extract), a `type` (string, number, boolean, html), and a `mode` (text, innerText).
|
|
228
|
-
|
|
229
|
-
```json
|
|
230
|
-
{
|
|
231
|
-
"id": "extract",
|
|
232
|
-
"params": {
|
|
233
|
-
"selector": "h1.main-title",
|
|
234
|
-
"type": "string",
|
|
235
|
-
"mode": "innerText"
|
|
236
|
-
}
|
|
237
|
-
}
|
|
238
|
-
```
|
|
252
|
+
Moves the mouse cursor to a specific coordinate or element. In `browser` mode, it uses a **Bézier curve** to simulate a human-like non-linear trajectory with slight jitter for realism.
|
|
239
253
|
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
254
|
+
* **`id`**: `mouseMove`
|
|
255
|
+
* **`params`**:
|
|
256
|
+
* `x` (number, optional): The absolute X coordinate.
|
|
257
|
+
* `y` (number, optional): The absolute Y coordinate.
|
|
258
|
+
* `selector` (string, optional): A CSS selector. If provided, the mouse moves to the center of the element.
|
|
259
|
+
* `steps` (number, optional): The number of intermediate steps for the trajectory (default: `-1`). Set to `-1` to calculate steps automatically based on distance (simulating natural speed).
|
|
260
|
+
* **`returns`**: `none`
|
|
247
261
|
|
|
248
|
-
|
|
262
|
+
#### `mouseClick`
|
|
249
263
|
|
|
250
|
-
|
|
264
|
+
Triggers a mouse click at the current position or specified coordinates. If a `selector` is provided, the cursor will first move smoothly to the target element (using dynamic steps) before clicking.
|
|
251
265
|
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
|
|
261
|
-
}
|
|
262
|
-
}
|
|
263
|
-
}
|
|
264
|
-
```
|
|
266
|
+
* **`id`**: `mouseClick`
|
|
267
|
+
* **`params`**:
|
|
268
|
+
* `x` (number, optional): The absolute X coordinate to click.
|
|
269
|
+
* `y` (number, optional): The absolute Y coordinate to click.
|
|
270
|
+
* `selector` (string, optional): A CSS selector. If provided, moves the mouse to the element first.
|
|
271
|
+
* `button` (string, optional): The mouse button to use (`left`, `right`, or `middle`). Default is `left`.
|
|
272
|
+
* `clickCount` (number, optional): The number of clicks (e.g., 2 for double-click). Default is 1.
|
|
273
|
+
* `delay` (number, optional): Delay between mousedown and mouseup in milliseconds.
|
|
274
|
+
* **`returns`**: `none`
|
|
265
275
|
|
|
266
|
-
|
|
276
|
+
#### `keyboardType`
|
|
267
277
|
|
|
268
|
-
|
|
278
|
+
Simulates a person typing text into the currently focused element.
|
|
269
279
|
|
|
270
|
-
*
|
|
280
|
+
* **`id`**: `keyboardType`
|
|
281
|
+
* **`params`**:
|
|
282
|
+
* `text` (string): The text to type.
|
|
283
|
+
* `delay` (number, optional): The delay between key presses in milliseconds (default: 100).
|
|
284
|
+
* **`returns`**: `none`
|
|
271
285
|
|
|
272
|
-
|
|
273
|
-
{
|
|
274
|
-
"id": "extract",
|
|
275
|
-
"params": {
|
|
276
|
-
"type": "array",
|
|
277
|
-
"selector": ".tags li"
|
|
278
|
-
}
|
|
279
|
-
}
|
|
280
|
-
```
|
|
286
|
+
#### `keyboardPress`
|
|
281
287
|
|
|
282
|
-
|
|
288
|
+
Simulates pressing a single key or a key combination (e.g., `Enter`, `Control+A`).
|
|
283
289
|
|
|
284
|
-
*
|
|
290
|
+
* **`id`**: `keyboardPress`
|
|
291
|
+
* **`params`**:
|
|
292
|
+
* `key` (string): The name of the key to press (e.g., `Enter`, `Tab`, `Backspace`, `ArrowUp`).
|
|
293
|
+
* `delay` (number, optional): The delay after the key press in milliseconds.
|
|
294
|
+
* **`returns`**: `none`
|
|
285
295
|
|
|
286
|
-
|
|
287
|
-
{
|
|
288
|
-
"id": "extract",
|
|
289
|
-
"params": {
|
|
290
|
-
"type": "array",
|
|
291
|
-
"selector": ".gallery img",
|
|
292
|
-
"attribute": "src"
|
|
293
|
-
}
|
|
294
|
-
}
|
|
295
|
-
```
|
|
296
|
+
#### `evaluate`
|
|
296
297
|
|
|
297
|
-
|
|
298
|
+
Executes custom JavaScript code or an expression within the page context.
|
|
298
299
|
|
|
299
|
-
*
|
|
300
|
+
* **`id`**: `evaluate`
|
|
301
|
+
* **`params`**:
|
|
302
|
+
* **`fn`**: `string | Function` - The JavaScript function or expression to execute.
|
|
303
|
+
* In `browser` mode, this runs in the real browser.
|
|
304
|
+
* In `http` mode, this runs in a safe Node.js sandbox with a mocked browser environment (`window`, `document`, `console`, `).
|
|
305
|
+
* **`args`**: `any` - A single argument to pass to the function. Use an array or object to pass multiple values.
|
|
300
306
|
|
|
301
|
-
|
|
302
|
-
* **`columnar`** (formerly Zip): The `selector` matches a **container**, and fields in `items` are parallel columns stitched together by index.
|
|
303
|
-
* **`segmented`**: The `selector` matches a **container**, and items are segmented by an "anchor" field.
|
|
307
|
+
**Key Features:**
|
|
304
308
|
|
|
305
|
-
|
|
309
|
+
- **Automatic Navigation**: If the code modifies `window.location.href` or calls `assign()`/`replace()`, the engine automatically triggers and waits for the navigation to complete.
|
|
310
|
+
- **Enhanced Mock DOM (HTTP Mode)**: Supports common DOM methods like `querySelector`, `querySelectorAll`, `getElementById`, `getElementsByClassName`, and properties like `document.body` and `document.title`.
|
|
311
|
+
- **Sandbox Security**: Uses `util-ex`'s `newFunction` in HTTP mode to prevent global state pollution.
|
|
306
312
|
|
|
307
|
-
|
|
313
|
+
**Example (Array Args):**
|
|
308
314
|
|
|
309
315
|
```json
|
|
310
316
|
{
|
|
311
|
-
"
|
|
317
|
+
"action": "evaluate",
|
|
312
318
|
"params": {
|
|
313
|
-
"
|
|
314
|
-
"
|
|
315
|
-
"mode": "columnar",
|
|
316
|
-
"items": {
|
|
317
|
-
"title": { "selector": ".item-title" },
|
|
318
|
-
"link": { "selector": "a.item-link", "attribute": "href" }
|
|
319
|
-
}
|
|
319
|
+
"fn": "([a, b]) => a + b",
|
|
320
|
+
"args": [1, 2]
|
|
320
321
|
}
|
|
321
322
|
}
|
|
322
323
|
```
|
|
323
324
|
|
|
324
|
-
|
|
325
|
-
|
|
326
|
-
**Columnar Configuration:**
|
|
327
|
-
|
|
328
|
-
* **`strict`** (boolean, default: `true`): If `true`, throws an error if fields have different match counts.
|
|
329
|
-
* **`inference`** (boolean, default: `false`): If `true`, tries to automatically find the "item wrapper" elements to fix misaligned lists.
|
|
330
|
-
|
|
331
|
-
###### 5. Segmented Mode (Anchor-based Scanning)
|
|
332
|
-
|
|
333
|
-
This mode is ideal for "flat" structures where there are no item wrappers. It uses the first field (or a specified `anchor`) to segment the container's content.
|
|
325
|
+
**Example (Navigation):**
|
|
334
326
|
|
|
335
327
|
```json
|
|
336
328
|
{
|
|
337
|
-
"
|
|
329
|
+
"action": "evaluate",
|
|
338
330
|
"params": {
|
|
339
|
-
"
|
|
340
|
-
"selector": "#flat-container",
|
|
341
|
-
"mode": { "type": "segmented", "anchor": "title" },
|
|
342
|
-
"items": {
|
|
343
|
-
"title": { "selector": "h3" },
|
|
344
|
-
"desc": { "selector": "p" }
|
|
345
|
-
}
|
|
331
|
+
"fn": "() => { window.location.href = '/new-page'; }"
|
|
346
332
|
}
|
|
347
333
|
}
|
|
348
334
|
```
|
|
349
335
|
|
|
350
|
-
|
|
351
|
-
|
|
352
|
-
|
|
336
|
+
$`).
|
|
337
|
+
>
|
|
338
|
+
> * **Navigation**: If the code modifies `window.location.href`, the engine will automatically trigger a `goto` to the new URL.
|
|
353
339
|
|
|
354
|
-
|
|
340
|
+
**Example: Extracting data via custom script**
|
|
355
341
|
|
|
356
342
|
```json
|
|
357
343
|
{
|
|
358
|
-
"
|
|
344
|
+
"action": "evaluate",
|
|
359
345
|
"params": {
|
|
360
|
-
"
|
|
361
|
-
"
|
|
362
|
-
|
|
363
|
-
|
|
346
|
+
"fn": "([selector]) => document.querySelector(selector).innerText",
|
|
347
|
+
"args": [".main-title"]
|
|
348
|
+
},
|
|
349
|
+
"storeAs": "title"
|
|
364
350
|
}
|
|
365
351
|
```
|
|
366
352
|
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
###### 6. Precise Filtering: `has` and `exclude`
|
|
370
|
-
|
|
371
|
-
You can use the `has` and `exclude` fields in any schema that includes a `selector` to precisely control element selection.
|
|
372
|
-
|
|
373
|
-
* `has`: A CSS selector to ensure the selected element **must contain** a descendant matching this selector.
|
|
374
|
-
* `exclude`: A CSS selector to **exclude** elements matching this selector from the results.
|
|
353
|
+
#### `extract`
|
|
375
354
|
|
|
376
|
-
|
|
355
|
+
Extracts structured data from the page using a powerful and declarative Schema.
|
|
377
356
|
|
|
378
|
-
|
|
379
|
-
|
|
380
|
-
|
|
381
|
-
{ "id": "goto", "params": { "url": "https://example.com/articles" } },
|
|
382
|
-
{
|
|
383
|
-
"id": "extract",
|
|
384
|
-
"params": {
|
|
385
|
-
"type": "array",
|
|
386
|
-
"selector": "div.article-card",
|
|
387
|
-
"has": "img.cover-image",
|
|
388
|
-
"exclude": ".draft",
|
|
389
|
-
"items": {
|
|
390
|
-
"selector": "a.title-link",
|
|
391
|
-
"attribute": "href"
|
|
392
|
-
}
|
|
393
|
-
}
|
|
394
|
-
}
|
|
395
|
-
]
|
|
396
|
-
}
|
|
397
|
-
```
|
|
357
|
+
* **`id`**: `extract`
|
|
358
|
+
* **`params`**: An `ExtractSchema` object.
|
|
359
|
+
* **`returns`**: The extracted structured data.
|
|
398
360
|
|
|
399
|
-
>
|
|
361
|
+
> **📚 Detailed Manual**: Because `extract` is very rich in features (including array modes, scope control, anchor jumping, etc.), we have prepared a dedicated detailed manual:
|
|
400
362
|
>
|
|
401
|
-
>
|
|
402
|
-
|
|
403
|
-
|
|
404
|
-
> 4. For each of the remaining `div.article-card` elements, find its descendant `a.title-link` and extract the `href` attribute.
|
|
363
|
+
> 👉 **[Click to View Extract Action Deep Dive](./README.action.extract.md)**
|
|
364
|
+
|
|
365
|
+
---
|
|
405
366
|
|
|
406
367
|
### Building High-Level Semantic Actions via "Composition"
|
|
407
368
|
|
|
@@ -546,3 +507,97 @@ The `FetchAction` base class provides lifecycle hooks that allow injecting custo
|
|
|
546
507
|
* `protected onAfterExec?()`: Called after `onExecute`.
|
|
547
508
|
|
|
548
509
|
For Actions that need to manage complex state or resources, you can implement these hooks. Generally, for composite actions, writing the logic directly in `onExecute` is sufficient.
|
|
510
|
+
|
|
511
|
+
---
|
|
512
|
+
|
|
513
|
+
## 💎 7. Action Return Types & State Management (Advanced)
|
|
514
|
+
|
|
515
|
+
In `@isdk/web-fetcher`, an Action's `static returnType` is more than a type hint. It defines how the framework manages the **session state** and automates data synchronization after execution.
|
|
516
|
+
|
|
517
|
+
### 7.1 Detailed Type Breakdown
|
|
518
|
+
|
|
519
|
+
#### 🟢 `response` (Page Response)
|
|
520
|
+
|
|
521
|
+
* **Definition**: A `FetchResponse` object containing HTTP status, headers, body, and cookies.
|
|
522
|
+
* **Purpose**: To synchronize the latest page content and state with the session.
|
|
523
|
+
* **Usage**: Used by actions that navigate, refresh, or capture the current page state.
|
|
524
|
+
* **System Behavior**: The framework automatically updates `context.lastResponse` with this result in the `afterExec` phase. Subsequent actions can access this via the context.
|
|
525
|
+
* **Typical Actions**: `goto`, `getContent`, `fill` (in some engines).
|
|
526
|
+
* **Example**:
|
|
527
|
+
|
|
528
|
+
```typescript
|
|
529
|
+
export class MyNavigateAction extends FetchAction {
|
|
530
|
+
static override id = 'myGoto';
|
|
531
|
+
static override returnType = 'response' as const;
|
|
532
|
+
|
|
533
|
+
async onExecute(context, options) {
|
|
534
|
+
// Logic that returns a FetchResponse
|
|
535
|
+
return await this.delegateToEngine(context, 'goto', options.params.url);
|
|
536
|
+
}
|
|
537
|
+
}
|
|
538
|
+
```
|
|
539
|
+
|
|
540
|
+
#### 🟡 `any` (Generic Data - Default)
|
|
541
|
+
|
|
542
|
+
* **Definition**: Any serializable data structure (Object, Array, string, etc.).
|
|
543
|
+
* **Purpose**: Primary mechanism for business data extraction.
|
|
544
|
+
* **Usage**: Use this when your action produces processed data that doesn't represent the whole page or system state.
|
|
545
|
+
* **System Behavior**: If the action configuration includes `storeAs: "key"`, the framework automatically saves the `result` into `context.outputs["key"]`. If the target key already contains an object and the new result is also an object, they will be merged (shallow merge) instead of overwritten. This allows multiple `extract` actions to accumulate data into the same output key.
|
|
546
|
+
* **Typical Actions**: `extract`.
|
|
547
|
+
* **Example**:
|
|
548
|
+
|
|
549
|
+
```typescript
|
|
550
|
+
static override returnType = 'any' as const;
|
|
551
|
+
async onExecute(context, options) {
|
|
552
|
+
return { title: 'Hello', price: 99 }; // Saved to outputs if storeAs is set
|
|
553
|
+
}
|
|
554
|
+
```
|
|
555
|
+
|
|
556
|
+
#### ⚪ `none` (No Return)
|
|
557
|
+
|
|
558
|
+
* **Definition**: `void`.
|
|
559
|
+
* **Purpose**: Pure interaction/side-effects without data output.
|
|
560
|
+
* **Usage**: Actions that perform UI interactions or timing controls.
|
|
561
|
+
* **Typical Actions**: `click`, `submit`, `pause`, `trim`, `waitFor`.
|
|
562
|
+
* **Example**:
|
|
563
|
+
|
|
564
|
+
```typescript
|
|
565
|
+
static override returnType = 'none' as const;
|
|
566
|
+
async onExecute(context, options) {
|
|
567
|
+
await this.delegateToEngine(context, 'click', options.params.selector);
|
|
568
|
+
// No return value needed
|
|
569
|
+
}
|
|
570
|
+
```
|
|
571
|
+
|
|
572
|
+
#### 🔵 `outputs` (Accumulated Results)
|
|
573
|
+
|
|
574
|
+
* **Definition**: The entire `context.outputs` record (`Record<string, any>`).
|
|
575
|
+
* **Purpose**: To retrieve all data extracted and stored during the current session.
|
|
576
|
+
* **Usage**: Typically used as a "summary" action at the end of a chain or for debugging.
|
|
577
|
+
* **Typical Actions**: Custom data summary actions.
|
|
578
|
+
|
|
579
|
+
#### 🟣 `context` (Session Snapshot)
|
|
580
|
+
|
|
581
|
+
* **Definition**: The full `FetchContext` object.
|
|
582
|
+
* **Purpose**: Meta-programming and deep debugging.
|
|
583
|
+
* **Usage**: Allows the caller to inspect current session configurations (timeouts, proxies, headers) and internal engine metadata.
|
|
584
|
+
|
|
585
|
+
---
|
|
586
|
+
|
|
587
|
+
### 7.2 Result Wrapping (`FetchActionResult`)
|
|
588
|
+
|
|
589
|
+
Every value returned by `onExecute` is automatically wrapped by the `FetchAction.execute` method into a `FetchActionResult` object. This ensures consistent error handling and metadata tracking across all actions.
|
|
590
|
+
|
|
591
|
+
**Structure of `FetchActionResult`:**
|
|
592
|
+
|
|
593
|
+
* `status`: `Success`, `Failed`, or `Skipped`.
|
|
594
|
+
* `returnType`: Matches the action's `static returnType`.
|
|
595
|
+
* `result`: The raw data returned by `onExecute`.
|
|
596
|
+
* `error`: Captured error if the action failed.
|
|
597
|
+
* `meta`: Diagnostic information including execution time, engine type, and retry counts.
|
|
598
|
+
|
|
599
|
+
### 7.3 Developer Best Practices
|
|
600
|
+
|
|
601
|
+
1. **Choose `response` for Navigation**: Always use `response` for actions that land on a new URL to ensure the session's "current page" is kept in sync.
|
|
602
|
+
2. **Leverage `any` + `storeAs`**: For data extraction, return the data as `any` and let the user decide the storage key via `storeAs` in the JSON script.
|
|
603
|
+
3. **Be Explicit with `none`**: Using `none` clearly signals that the action is used for its side effects (like clicking or waiting), making the workflow easier to understand.
|
package/README.cn.md
CHANGED
|
@@ -20,9 +20,10 @@
|
|
|
20
20
|
* **📜 声明式动作脚本**: 以简单、可读的 JSON 格式定义多步骤工作流(如登录、填写表单、点击按钮等)。
|
|
21
21
|
* **📊 强大而灵活的数据提取**: 通过直观、强大的声明式 Schema,轻松提取从简单文本到复杂嵌套的各类结构化数据。
|
|
22
22
|
* **🧠 智能引擎选择**: 可自动检测动态站点,并在需要时将引擎从 `http` 动态升级到 `browser`。
|
|
23
|
+
* **🛡️ 反爬虫/反屏蔽**: 在 `browser` 模式下,一个可选的 `antibot` 标志有助于绕过常见的反机器人措施,如 Cloudflare 挑战。
|
|
24
|
+
* **🕹️ 高仿真交互模拟**: 支持基于 **贝塞尔曲线** 的鼠标轨迹移动、真实的打字延迟模拟,以及复杂的键盘交互,大幅提升反爬避障能力。
|
|
23
25
|
* **🧩 可扩展性**: 轻松创建自定义的、高级别的“组合动作”,以封装可复用的业务逻辑(例如,一个 `login` 动作)。
|
|
24
26
|
* **🧲 高级收集器 (Collectors)**: 在主动作执行期间,由事件触发,在后台异步收集数据。
|
|
25
|
-
* **🛡️ 反爬虫/反屏蔽**: 在 `browser` 模式下,一个可选的 `antibot` 标志有助于绕过常见的反机器人措施,如 Cloudflare 挑战。
|
|
26
27
|
|
|
27
28
|
---
|
|
28
29
|
|
|
@@ -132,7 +133,7 @@ searchGoogle('gemini');
|
|
|
132
133
|
* `url` (string): 要导航的初始 URL。
|
|
133
134
|
* `engine` ('http' | 'browser' | 'auto'): 要使用的引擎。默认为 `auto`。
|
|
134
135
|
* `proxy` (string | string[]): 用于请求的代理 URL。
|
|
135
|
-
* `debug` (boolean):
|
|
136
|
+
* `debug` (boolean | string | string[]): 在响应中启用详细的执行元数据(耗时、使用的引擎等),或启用特定类别(如 'extract', 'submit', 'request')的调试日志。
|
|
136
137
|
* `actions` (FetchActionOptions[]): 要执行的动作对象数组。(支持 `action`/`name` 作为 `id` 的别名,`args` 作为 `params` 的别名)
|
|
137
138
|
* `headers` (Record<string, string>): 用于所有请求的头信息。
|
|
138
139
|
* `cookies` (Cookie[]): 要使用的 Cookie 数组。
|
|
@@ -145,21 +146,30 @@ searchGoogle('gemini');
|
|
|
145
146
|
* `output` (object): 控制 `FetchResponse` 中的输出字段。
|
|
146
147
|
* `cookies` (boolean): 是否在响应中包含 Cookie(默认:`true`)。
|
|
147
148
|
* `sessionState` (boolean): 是否在响应中包含会话状态(默认:`true`)。
|
|
149
|
+
* `browser` (object): 浏览器引擎配置。
|
|
150
|
+
* `headless` (boolean): 是否以无头模式运行(默认:`true`)。
|
|
151
|
+
* `launchOptions` (object): Playwright 启动选项(例如 `{ slowMo: 50, args: [...] }`)。
|
|
148
152
|
* `sessionPoolOptions` (SessionPoolOptions): 底层 Crawlee SessionPool 的高级配置。
|
|
149
153
|
* ...以及许多其他用于代理、重试等的选项。
|
|
150
154
|
|
|
151
|
-
### 内置动作
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
* `goto`:
|
|
156
|
-
* `click`:
|
|
157
|
-
* `fill`:
|
|
158
|
-
* `submit`:
|
|
159
|
-
* `
|
|
160
|
-
* `
|
|
161
|
-
* `
|
|
162
|
-
* `
|
|
155
|
+
### 内置动作 (Built-in Actions)
|
|
156
|
+
|
|
157
|
+
该库提供了一系列强大的内置动作,其中许多动作是跨引擎通用的,并由核心层统一处理以确保一致性:
|
|
158
|
+
|
|
159
|
+
* `goto`: 导航到新 URL。
|
|
160
|
+
* `click`: 点击选择器指定的元素(引擎相关)。
|
|
161
|
+
* `fill`: 用指定值填充输入框(引擎相关)。
|
|
162
|
+
* `submit`: 提交表单(引擎相关)。
|
|
163
|
+
* `mouseMove`: 将鼠标指针移动到指定的坐标或元素(支持贝塞尔曲线)。
|
|
164
|
+
* `mouseClick`: 在当前位置或指定坐标触发鼠标点击。
|
|
165
|
+
* `keyboardType`: 模拟真人在当前获得焦点的元素中输入文本。
|
|
166
|
+
* `keyboardPress`: 模拟按下单个按键或组合键。
|
|
167
|
+
* `trim`: 从 DOM 中移除元素以清理页面(如脚本、广告、隐藏内容)。
|
|
168
|
+
* `waitFor`: 暂停执行以等待特定条件(支持统一处理的固定超时)。
|
|
169
|
+
* `pause`: 暂停执行以进行人工干预(如处理验证码,由核心层统一处理)。
|
|
170
|
+
* `getContent`: 获取当前页面状态的完整内容(由核心层统一处理)。
|
|
171
|
+
* `evaluate`: 在页面上下文中执行自定义 JavaScript。
|
|
172
|
+
* `extract`: 使用引擎无关的核心逻辑和引擎相关的 DOM 原语提取结构化数据。支持 `required` 字段和 `strict` 验证。
|
|
163
173
|
|
|
164
174
|
### 响应结构
|
|
165
175
|
|
|
@@ -172,7 +182,7 @@ searchGoogle('gemini');
|
|
|
172
182
|
* `cookies`: Cookie 数组。
|
|
173
183
|
* `sessionState`: Crawlee 会话状态。
|
|
174
184
|
* `text`, `html`: 页面内容。
|
|
175
|
-
* `outputs` (Record<string, any>): 通过 `storeAs`
|
|
185
|
+
* `outputs` (Record<string, any>): 通过 `storeAs` 提取并存储的数据。注意:当多个动作将对象存储到同一个键时,它们将被合并而不再是覆盖。
|
|
176
186
|
|
|
177
187
|
---
|
|
178
188
|
|