@isdk/web-fetcher 0.3.0 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.action.cn.md +53 -312
- package/README.action.extract.cn.md +263 -0
- package/README.action.extract.md +263 -0
- package/README.action.md +53 -311
- package/README.cn.md +10 -2
- package/README.engine.cn.md +22 -1
- package/README.engine.md +22 -1
- package/README.md +8 -1
- package/dist/index.d.mts +147 -1
- package/dist/index.d.ts +147 -1
- package/dist/index.js +1 -1
- package/dist/index.mjs +1 -1
- package/docs/README.md +8 -1
- package/docs/_media/README.action.md +53 -311
- package/docs/_media/README.cn.md +10 -2
- package/docs/_media/README.engine.md +22 -1
- package/docs/classes/CheerioFetchEngine.md +236 -88
- package/docs/classes/ClickAction.md +23 -23
- package/docs/classes/EvaluateAction.md +23 -23
- package/docs/classes/ExtractAction.md +23 -23
- package/docs/classes/FetchAction.md +27 -23
- package/docs/classes/FetchEngine.md +218 -86
- package/docs/classes/FetchSession.md +13 -13
- package/docs/classes/FillAction.md +23 -23
- package/docs/classes/GetContentAction.md +23 -23
- package/docs/classes/GotoAction.md +23 -23
- package/docs/classes/KeyboardPressAction.md +533 -0
- package/docs/classes/KeyboardTypeAction.md +533 -0
- package/docs/classes/MouseClickAction.md +533 -0
- package/docs/classes/MouseMoveAction.md +533 -0
- package/docs/classes/PauseAction.md +23 -23
- package/docs/classes/PlaywrightFetchEngine.md +337 -87
- package/docs/classes/SubmitAction.md +23 -23
- package/docs/classes/TrimAction.md +23 -23
- package/docs/classes/WaitForAction.md +23 -23
- package/docs/classes/WebFetcher.md +5 -5
- package/docs/enumerations/FetchActionResultStatus.md +4 -4
- package/docs/functions/fetchWeb.md +2 -2
- package/docs/globals.md +8 -0
- package/docs/interfaces/BaseFetchActionProperties.md +12 -12
- package/docs/interfaces/BaseFetchCollectorActionProperties.md +16 -16
- package/docs/interfaces/BaseFetcherProperties.md +31 -27
- package/docs/interfaces/Cookie.md +14 -14
- package/docs/interfaces/DispatchedEngineAction.md +4 -4
- package/docs/interfaces/EvaluateActionOptions.md +3 -3
- package/docs/interfaces/ExtractActionProperties.md +12 -12
- package/docs/interfaces/FetchActionInContext.md +15 -15
- package/docs/interfaces/FetchActionProperties.md +13 -13
- package/docs/interfaces/FetchActionResult.md +6 -6
- package/docs/interfaces/FetchContext.md +41 -37
- package/docs/interfaces/FetchEngineContext.md +36 -32
- package/docs/interfaces/FetchMetadata.md +5 -5
- package/docs/interfaces/FetchResponse.md +14 -14
- package/docs/interfaces/FetchReturnTypeRegistry.md +7 -7
- package/docs/interfaces/FetchSite.md +34 -30
- package/docs/interfaces/FetcherOptions.md +33 -29
- package/docs/interfaces/GotoActionOptions.md +14 -6
- package/docs/interfaces/KeyboardPressParams.md +25 -0
- package/docs/interfaces/KeyboardTypeParams.md +25 -0
- package/docs/interfaces/MouseClickParams.md +49 -0
- package/docs/interfaces/MouseMoveParams.md +41 -0
- package/docs/interfaces/PendingEngineRequest.md +3 -3
- package/docs/interfaces/StorageOptions.md +5 -5
- package/docs/interfaces/SubmitActionOptions.md +2 -2
- package/docs/interfaces/TrimActionOptions.md +3 -3
- package/docs/interfaces/WaitForActionOptions.md +5 -5
- package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
- package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
- package/docs/type-aliases/BrowserEngine.md +1 -1
- package/docs/type-aliases/FetchActionCapabilities.md +1 -1
- package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
- package/docs/type-aliases/FetchActionOptions.md +1 -1
- package/docs/type-aliases/FetchEngineAction.md +2 -2
- package/docs/type-aliases/FetchEngineType.md +1 -1
- package/docs/type-aliases/FetchReturnType.md +1 -1
- package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
- package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
- package/docs/type-aliases/ResourceType.md +1 -1
- package/docs/type-aliases/TrimPreset.md +1 -1
- package/docs/variables/DefaultFetcherProperties.md +1 -1
- package/docs/variables/FetcherOptionKeys.md +1 -1
- package/docs/variables/TRIM_PRESETS.md +1 -1
- package/package.json +10 -10
|
@@ -0,0 +1,263 @@
|
|
|
1
|
+
# 🔍 Extract Action Deep Dive (The Data Surgeon's Manual)
|
|
2
|
+
|
|
3
|
+
[简体中文](./README.action.extract.cn.md) | English
|
|
4
|
+
|
|
5
|
+
`extract` is the heart of `@isdk/web-fetcher`. It's not just a "scraper"; it's an **intelligent converter** that transforms chaotic HTML into polished, ready-to-use JSON.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## ⚡ 1. Quick Start: Shorthand Magic
|
|
10
|
+
|
|
11
|
+
**Scenario**: Standard webpages with simple structures. Get results fast without long JSON configurations.
|
|
12
|
+
|
|
13
|
+
```json
|
|
14
|
+
{
|
|
15
|
+
"action": "extract",
|
|
16
|
+
"params": {
|
|
17
|
+
"title": "h1", // Grab h1 text into 'title'
|
|
18
|
+
"link": { "selector": "a.main", "attribute": "href" }, // Precision attribute extraction
|
|
19
|
+
"tags": { "type": "array", "selector": ".tag-item" } // Grab a group of tags at once
|
|
20
|
+
}
|
|
21
|
+
}
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## 📄 2. Basic Value Extraction: The Tweezers
|
|
27
|
+
|
|
28
|
+
**Scenario**: Processing single data points like prices, titles, or IDs.
|
|
29
|
+
|
|
30
|
+
### Value Properties Analysis
|
|
31
|
+
|
|
32
|
+
* **`selector`**: CSS selector—tells the tweezers where to grab.
|
|
33
|
+
* **`type`**: Automatic casting. Supports `string`, `number` (auto-removes currency/units), `boolean`, `html`.
|
|
34
|
+
* **`mode`**: Extraction mode.
|
|
35
|
+
* `innerText`: **(Highly Recommended)** Sees like a human, capturing only visible text and handling line breaks/noise.
|
|
36
|
+
* `text`: Raw `textContent` from the source, including hidden characters.
|
|
37
|
+
* `outerHTML`: Captures the full HTML, including the element's tags.
|
|
38
|
+
* **`attribute`**: If you need an attribute (e.g., `href`, `src`) instead of text, specify it here.
|
|
39
|
+
* **`depth`**: Bubble up. Once matched, move up N parent levels before extracting (useful for grabbing IDs from parent containers).
|
|
40
|
+
|
|
41
|
+
**Example**:
|
|
42
|
+
|
|
43
|
+
```json
|
|
44
|
+
{
|
|
45
|
+
"selector": ".price-tag",
|
|
46
|
+
"type": "number", // Transforms "¥99.00" to 99
|
|
47
|
+
"mode": "innerText", // Ensures clean text
|
|
48
|
+
"attribute": "data-v", // Grabs the data-v attribute
|
|
49
|
+
"depth": 1 // Grabs from the parent element
|
|
50
|
+
}
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## 📦 3. Object Extraction: The Bento Box
|
|
56
|
+
|
|
57
|
+
**Scenario**: Pack related fields (e.g., name, avatar, bio) into a structured JSON object.
|
|
58
|
+
|
|
59
|
+
### Object Properties Analysis
|
|
60
|
+
|
|
61
|
+
* **`type: "object"`**: Declares the object structure.
|
|
62
|
+
* **`selector`**: **The root container**. This is crucial! All internal property selectors are searched relative to this container.
|
|
63
|
+
* **`properties`**: **Core Config**. Define your JSON keys and their respective extraction rules.
|
|
64
|
+
* **`required`**: If a field is essential, set this to `true`. If extraction fails, the entire object returns `null`.
|
|
65
|
+
* **`depth`**: Depth used to trigger "bubble-up" logic (see advanced chapter).
|
|
66
|
+
|
|
67
|
+
**Example: Packing User Info**
|
|
68
|
+
|
|
69
|
+
```json
|
|
70
|
+
{
|
|
71
|
+
"type": "object",
|
|
72
|
+
"selector": ".user-card",
|
|
73
|
+
"properties": {
|
|
74
|
+
"name": { "selector": ".username", "required": true },
|
|
75
|
+
"bio": ".user-description",
|
|
76
|
+
"avatar": { "selector": "img.avatar", "attribute": "src" }
|
|
77
|
+
}
|
|
78
|
+
}
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
> **💡 Transparent Box (Implicit Object)**: If you omit `type: "object"` and don't provide a `selector`, the box becomes "transparent." If all internal fields fail to extract, the entire box disappears. This is magical for filtering ads or empty items in a list.
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## 📑 4. Array Extraction: The Sorting Machines
|
|
86
|
+
|
|
87
|
+
**Scenario**: Handle repeating data like search results, news feeds, or tag clouds.
|
|
88
|
+
|
|
89
|
+
### Array Properties Analysis
|
|
90
|
+
|
|
91
|
+
* **`type: "array"`**: Declares a list extraction.
|
|
92
|
+
* **`selector`**: **Scanning scope**. In default mode, it matches the "wrapper" of each item; in flat modes, it matches the entire list container.
|
|
93
|
+
* **`items`**: **The Blueprint**. Defines what each row looks like. (Note: arrays use `items`, not `properties`).
|
|
94
|
+
* **`mode`**: Choose the sorting strategy:
|
|
95
|
+
* `"nested"`: (Default) Every item has its own "slot" (container).
|
|
96
|
+
* `"columnar"`: Data is laid out in columns without row containers (like Excel).
|
|
97
|
+
* `"segmented"`: Data is completely flat, sliced into segments using "anchors."
|
|
98
|
+
* **`limit`**: Restricts the maximum number of items to prevent data bloat.
|
|
99
|
+
* **`inference`**: Heuristic inference. Enable this to let the engine automatically correct misalignments in messy lists.
|
|
100
|
+
|
|
101
|
+
### 4.1 Nested (The Egg Carton): The Stable Standard
|
|
102
|
+
|
|
103
|
+
**Scenario**: Items are wrapped in `<li>` or `.item` containers. This is the most stable and performant choice.
|
|
104
|
+
|
|
105
|
+
```json
|
|
106
|
+
{
|
|
107
|
+
"type": "array",
|
|
108
|
+
"selector": ".product-list .item", // Matches the shell of each product
|
|
109
|
+
"items": {
|
|
110
|
+
"name": "h3",
|
|
111
|
+
"price": ".price"
|
|
112
|
+
}
|
|
113
|
+
}
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
### 4.2 Columnar: Flat Alignment
|
|
117
|
+
|
|
118
|
+
**Scenario**: Data laid out like a spreadsheet. One column for titles, one for prices, but no shared parent for the row.
|
|
119
|
+
**Broadcasting**: If a field is constant across all items (e.g., category), extract it once from the main container, and it will "broadcast" to every row.
|
|
120
|
+
|
|
121
|
+
```json
|
|
122
|
+
{
|
|
123
|
+
"type": "array",
|
|
124
|
+
"selector": "#table-container",
|
|
125
|
+
"mode": "columnar",
|
|
126
|
+
"items": {
|
|
127
|
+
"name": ".title-cols",
|
|
128
|
+
"date": "" // Empty selector broadcasts date info from the main container
|
|
129
|
+
}
|
|
130
|
+
}
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
### 4.3 Segmented: Slicing the Unordered
|
|
134
|
+
|
|
135
|
+
**Scenario**: Terrible page structure where everything is siblings. You must rely on a marker (like an `h2` header) to define segments.
|
|
136
|
+
|
|
137
|
+
```json
|
|
138
|
+
{
|
|
139
|
+
"type": "array",
|
|
140
|
+
"selector": ".content-body",
|
|
141
|
+
"mode": { "type": "segmented", "anchor": "h2" }, // Slice whenever an h2 is found
|
|
142
|
+
"items": {
|
|
143
|
+
"section_title": "h2",
|
|
144
|
+
"text": "p"
|
|
145
|
+
}
|
|
146
|
+
}
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
---
|
|
150
|
+
|
|
151
|
+
## 🛡️ 5. Quality Control: Filters & Sieves
|
|
152
|
+
|
|
153
|
+
**Scenario**: Filtering out ads, out-of-stock notices, or unwanted junk items.
|
|
154
|
+
|
|
155
|
+
* **`has`**: **The Essence Filter**. Only extract if it contains specific sub-elements. Example: `"has": ".promo-tag"` only keeps items with discount tags.
|
|
156
|
+
* **`exclude`**: **The Junk Remover**. Exclude matching elements. Example: `"exclude": ".is-advertisement"`.
|
|
157
|
+
* **`strict`**: **OCD Mode**. If a required field is missing, throw an error immediately instead of silently returning `null`.
|
|
158
|
+
|
|
159
|
+
---
|
|
160
|
+
|
|
161
|
+
## 🧠 6. Advanced Mastery: Scope & Boundary Control
|
|
162
|
+
|
|
163
|
+
Surgical solutions for messy, class-less, and flat-structured HTML.
|
|
164
|
+
|
|
165
|
+
### 6.1 Sequential Slicing: Solving "The Twin Tag" Problem
|
|
166
|
+
|
|
167
|
+
* **Pain Point**: A row of identical `<span>` or `p` tags. A simple selector only grabs the first one repeatedly.
|
|
168
|
+
* **Solution**: `relativeTo: "previous"`.
|
|
169
|
+
* **Mechanism**: The engine remembers where it left off. After extracting the first field, it leaves a "Bookmark." The next search starts **after that Bookmark**.
|
|
170
|
+
* **Key Point**: Since this is sequential, we recommend explicitly setting `order: ["name", "job"]` to ensure the bookmarks don't get mixed up due to JSON key ordering.
|
|
171
|
+
* **Example Scenario**:
|
|
172
|
+
|
|
173
|
+
```html
|
|
174
|
+
<p>Name</p> <span>Alice</span>
|
|
175
|
+
<p>Job</p> <span>Manager</span>
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
* **Example Config**:
|
|
179
|
+
|
|
180
|
+
```json
|
|
181
|
+
{
|
|
182
|
+
"relativeTo": "previous",
|
|
183
|
+
"order": ["k1", "v1", "k2", "v2"], // Force order to prevent bookmark displacement
|
|
184
|
+
"properties": {
|
|
185
|
+
"k1": "p", // Grabs 1st p (Name)
|
|
186
|
+
"v1": "span", // Grabs Alice after k1
|
|
187
|
+
"k2": "p", // Grabs 2nd p (Job) after v1
|
|
188
|
+
"v2": "span" // Grabs Manager after k2
|
|
189
|
+
}
|
|
190
|
+
}
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
### 6.2 Segmented Isolation & LCA: Building "Fences" for Neighbors
|
|
194
|
+
|
|
195
|
+
* **Pain Point**: In flat structures, Item 1's search might bleed into Item 2's content without a boundary.
|
|
196
|
+
* **Solution**: Use `mode: "segmented"` with `anchor`.
|
|
197
|
+
* **Mechanism**: The engine calculates the "Lowest Common Ancestor (LCA)" between anchors and builds a logical fence, locking searches strictly within the item's territory.
|
|
198
|
+
* **Example Scenario**:
|
|
199
|
+
|
|
200
|
+
```html
|
|
201
|
+
<h2 class="title">Article 1</h2>
|
|
202
|
+
<p class="summary">Summary 1</p>
|
|
203
|
+
<h2 class="title">Article 2</h2>
|
|
204
|
+
<p class="summary">Summary 2</p>
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
* **Example Config**:
|
|
208
|
+
|
|
209
|
+
```json
|
|
210
|
+
{
|
|
211
|
+
"type": "array",
|
|
212
|
+
"mode": { "type": "segmented", "anchor": "h2.title" }, // h2 as the fence
|
|
213
|
+
"items": {
|
|
214
|
+
"title": "h2",
|
|
215
|
+
"desc": "p"
|
|
216
|
+
}
|
|
217
|
+
}
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
### 6.3 Bubble-up: The "Retry" Safety Net
|
|
221
|
+
|
|
222
|
+
* **Pain Point**: Data is just outside the locked scope (e.g., an ID on a parent container).
|
|
223
|
+
* **Solution**: Set `depth` and use `required`.
|
|
224
|
+
* **Mechanism**:
|
|
225
|
+
1. The engine looks for a mandatory field in the current scope—fails.
|
|
226
|
+
2. It automatically steps back (`depth: 1`) to the parent node.
|
|
227
|
+
3. It rescans in the wider view to find missing data that was "just out of sight."
|
|
228
|
+
* **Example Scenario**:
|
|
229
|
+
|
|
230
|
+
```html
|
|
231
|
+
<div class="wrapper" data-id="ID_001">
|
|
232
|
+
<div class="content-box">
|
|
233
|
+
<h3 class="title">Product Name</h3>
|
|
234
|
+
</div>
|
|
235
|
+
</div>
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
* **Example Config**:
|
|
239
|
+
|
|
240
|
+
```json
|
|
241
|
+
{
|
|
242
|
+
"selector": ".content-box",
|
|
243
|
+
"depth": 1,
|
|
244
|
+
"properties": {
|
|
245
|
+
"product_id": {
|
|
246
|
+
"selector": "..", // ".." means look at parent
|
|
247
|
+
"attribute": "data-id",
|
|
248
|
+
"required": true // Fails to trigger bubble-up
|
|
249
|
+
},
|
|
250
|
+
"name": "h3"
|
|
251
|
+
}
|
|
252
|
+
}
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
---
|
|
256
|
+
|
|
257
|
+
## 💡 7. Best Practice Tips
|
|
258
|
+
|
|
259
|
+
1. **Containers First**: If the HTML has `.item` wrappers, always prefer `Nested` mode—it's the stablest and fastest.
|
|
260
|
+
2. **innerText as Default**: In 90% of cases, it's better than `text` because it yields cleaner results and handles line breaks automatically.
|
|
261
|
+
3. **Flat Siblings? Use Anchors**: When you see siblings without a container shell, immediately think `segmented` + `anchor`.
|
|
262
|
+
4. **Leverage required for Filtering**: If some items in a list are empty (placeholders), mark core fields as `required: true`. They will automatically become `null`, making them easy to filter in your code.
|
|
263
|
+
5. **Debug Tip: Check the Scope**: If data is missing, check if your `selector` is too restrictive or try increasing the `depth`.
|
package/README.action.md
CHANGED
|
@@ -247,6 +247,52 @@ Retrieves the full content of the current page state.
|
|
|
247
247
|
* **`params`**: (none)
|
|
248
248
|
* **`returns`**: `response`
|
|
249
249
|
|
|
250
|
+
#### `mouseMove`
|
|
251
|
+
|
|
252
|
+
Moves the mouse cursor to a specific coordinate or element. In `browser` mode, it uses a **Bézier curve** to simulate a human-like non-linear trajectory with slight jitter for realism.
|
|
253
|
+
|
|
254
|
+
* **`id`**: `mouseMove`
|
|
255
|
+
* **`params`**:
|
|
256
|
+
* `x` (number, optional): The absolute X coordinate.
|
|
257
|
+
* `y` (number, optional): The absolute Y coordinate.
|
|
258
|
+
* `selector` (string, optional): A CSS selector. If provided, the mouse moves to the center of the element.
|
|
259
|
+
* `steps` (number, optional): The number of intermediate steps for the trajectory (default: `-1`). Set to `-1` to calculate steps automatically based on distance (simulating natural speed).
|
|
260
|
+
* **`returns`**: `none`
|
|
261
|
+
|
|
262
|
+
#### `mouseClick`
|
|
263
|
+
|
|
264
|
+
Triggers a mouse click at the current position or specified coordinates. If a `selector` is provided, the cursor will first move smoothly to the target element (using dynamic steps) before clicking.
|
|
265
|
+
|
|
266
|
+
* **`id`**: `mouseClick`
|
|
267
|
+
* **`params`**:
|
|
268
|
+
* `x` (number, optional): The absolute X coordinate to click.
|
|
269
|
+
* `y` (number, optional): The absolute Y coordinate to click.
|
|
270
|
+
* `selector` (string, optional): A CSS selector. If provided, moves the mouse to the element first.
|
|
271
|
+
* `button` (string, optional): The mouse button to use (`left`, `right`, or `middle`). Default is `left`.
|
|
272
|
+
* `clickCount` (number, optional): The number of clicks (e.g., 2 for double-click). Default is 1.
|
|
273
|
+
* `delay` (number, optional): Delay between mousedown and mouseup in milliseconds.
|
|
274
|
+
* **`returns`**: `none`
|
|
275
|
+
|
|
276
|
+
#### `keyboardType`
|
|
277
|
+
|
|
278
|
+
Simulates a person typing text into the currently focused element.
|
|
279
|
+
|
|
280
|
+
* **`id`**: `keyboardType`
|
|
281
|
+
* **`params`**:
|
|
282
|
+
* `text` (string): The text to type.
|
|
283
|
+
* `delay` (number, optional): The delay between key presses in milliseconds (default: 100).
|
|
284
|
+
* **`returns`**: `none`
|
|
285
|
+
|
|
286
|
+
#### `keyboardPress`
|
|
287
|
+
|
|
288
|
+
Simulates pressing a single key or a key combination (e.g., `Enter`, `Control+A`).
|
|
289
|
+
|
|
290
|
+
* **`id`**: `keyboardPress`
|
|
291
|
+
* **`params`**:
|
|
292
|
+
* `key` (string): The name of the key to press (e.g., `Enter`, `Tab`, `Backspace`, `ArrowUp`).
|
|
293
|
+
* `delay` (number, optional): The delay after the key press in milliseconds.
|
|
294
|
+
* **`returns`**: `none`
|
|
295
|
+
|
|
250
296
|
#### `evaluate`
|
|
251
297
|
|
|
252
298
|
Executes custom JavaScript code or an expression within the page context.
|
|
@@ -306,321 +352,17 @@ $`).
|
|
|
306
352
|
|
|
307
353
|
#### `extract`
|
|
308
354
|
|
|
309
|
-
Extracts structured data from the page using a powerful and declarative Schema.
|
|
355
|
+
Extracts structured data from the page using a powerful and declarative Schema.
|
|
310
356
|
|
|
311
357
|
* **`id`**: `extract`
|
|
312
|
-
* **`params`**: An `ExtractSchema` object
|
|
313
|
-
* **`returns`**:
|
|
314
|
-
|
|
315
|
-
##### Detailed Explanation of Extraction Schema
|
|
316
|
-
|
|
317
|
-
The `params` object itself is a Schema that describes the data structure you want to extract.
|
|
318
|
-
|
|
319
|
-
###### 1. Extracting a Single Value
|
|
320
|
-
|
|
321
|
-
The most basic extraction. You can specify a `selector` (CSS selector), an `attribute` (the name of the attribute to extract), a `type` (string, number, boolean, html), and a `mode` (text, innerText).
|
|
322
|
-
|
|
323
|
-
* **`depth`** (number, optional): After matching the element with `selector`, it bubbles up the DOM tree by the specified number of levels. The resulting ancestor element becomes the actual target for value extraction (e.g., to extract an attribute from a parent wrapper).
|
|
324
|
-
|
|
325
|
-
```json
|
|
326
|
-
{
|
|
327
|
-
"id": "extract",
|
|
328
|
-
"params": {
|
|
329
|
-
"selector": "h1.main-title",
|
|
330
|
-
"type": "string",
|
|
331
|
-
"mode": "innerText"
|
|
332
|
-
}
|
|
333
|
-
}
|
|
334
|
-
```
|
|
335
|
-
|
|
336
|
-
> **Extraction Modes:**
|
|
337
|
-
>
|
|
338
|
-
> * **`text`** (default): Extracts the `textContent` of the element.
|
|
339
|
-
> * **`innerText`**: Extracts the rendered text, respecting CSS styling and line breaks.
|
|
340
|
-
> * **`html`**: Returns the `innerHTML` of the element.
|
|
341
|
-
> * **`outerHTML`**: Returns the HTML including the tag itself. Useful for preserving the full element structure.
|
|
342
|
-
> The example above will extract the text content of the `<h1>` tag with the class `main-title` using the `innerText` mode.
|
|
343
|
-
|
|
344
|
-
###### 2. Extracting an Object
|
|
345
|
-
|
|
346
|
-
Define a structured object using `type: 'object'` and the `properties` field.
|
|
347
|
-
|
|
348
|
-
```json
|
|
349
|
-
{
|
|
350
|
-
"id": "extract",
|
|
351
|
-
"params": {
|
|
352
|
-
"type": "object",
|
|
353
|
-
"selector": ".author-bio",
|
|
354
|
-
"properties": {
|
|
355
|
-
"name": { "selector": ".author-name" },
|
|
356
|
-
"email": { "selector": "a.email", "attribute": "href" }
|
|
357
|
-
}
|
|
358
|
-
}
|
|
359
|
-
}
|
|
360
|
-
```
|
|
361
|
-
|
|
362
|
-
* **`depth`** (number, optional): Enables "Try-And-Bubble" strategy. If a `required` field is missing in the matched element, the engine attempts to bubble up the DOM tree (up to `depth` levels) to find an ancestor where the required field exists. This is useful when the selector matches a descendant (e.g., an inner `span`) but the data resides on a parent container.
|
|
363
|
-
|
|
364
|
-
**Advanced Object Features:**
|
|
365
|
-
|
|
366
|
-
* **Anchor Jumping (`anchor`)**: Specifies a starting reference point for a field.
|
|
367
|
-
* **Field Reference**: Use the DOM element of a previously extracted field.
|
|
368
|
-
* **CSS Selector**: Query an anchor element on the fly within the object's scope.
|
|
369
|
-
* **`depth`** (number, optional): When using an anchor, defines how many parent levels to traverse upwards to collect following siblings.
|
|
370
|
-
* **Note**: If omitted, the engine defaults to maximum depth (up to the object's root) for backward compatibility. To strictly limit the search to the anchor's own siblings, set `depth: 0`.
|
|
371
|
-
* **Effect**: Once an anchor is set, the search scope for that field becomes the siblings **following** the anchor (and its ancestors, depending on `depth`). This allows for non-linear "jumping" extraction in flat structures.
|
|
372
|
-
* **Sequential Consumption (`relativeTo: "previous"`)**:
|
|
373
|
-
* Combined with the `order` property, this ensures each field's search scope starts *after* the previous field's match.
|
|
374
|
-
* Essential for extracting from lists composed of identical tags (e.g., consecutive `<p>` tags with different meanings).
|
|
375
|
-
|
|
376
|
-
###### 3. Extracting an Array (Convenient Usage)
|
|
377
|
-
|
|
378
|
-
Extract a list using `type: 'array'`. To make the most common operations simpler, we provide some convenient usages.
|
|
379
|
-
|
|
380
|
-
* **Extracting an Array of Texts (Default Behavior)**: When you want to extract a list of text, just provide the selector and omit `items`. This is the most common usage.
|
|
381
|
-
|
|
382
|
-
```json
|
|
383
|
-
{
|
|
384
|
-
"id": "extract",
|
|
385
|
-
"params": {
|
|
386
|
-
"type": "array",
|
|
387
|
-
"selector": ".tags li"
|
|
388
|
-
}
|
|
389
|
-
}
|
|
390
|
-
```
|
|
391
|
-
|
|
392
|
-
> The example above will return an array of the text from all `<li>` tags, e.g., `["tech", "news"]`.
|
|
393
|
-
|
|
394
|
-
* **Extracting an Array of Attributes (Shortcut)**: When you only want to extract a list of attributes (e.g., all `href`s from links), there's no need to nest `items` either. Just declare `attribute` directly in the `array` definition.
|
|
395
|
-
|
|
396
|
-
```json
|
|
397
|
-
{
|
|
398
|
-
"id": "extract",
|
|
399
|
-
"params": {
|
|
400
|
-
"type": "array",
|
|
401
|
-
"selector": ".gallery img",
|
|
402
|
-
"attribute": "src"
|
|
403
|
-
}
|
|
404
|
-
}
|
|
405
|
-
```
|
|
406
|
-
|
|
407
|
-
> The example above will return an array of the `src` attributes from all `<img>` tags.
|
|
408
|
-
|
|
409
|
-
* **Array Extraction Modes**: When extracting an array, the engine supports different modes to handle various DOM structures.
|
|
410
|
-
|
|
411
|
-
* **`nested`** (Default): The `selector` matches individual item wrappers.
|
|
412
|
-
* **`columnar`** (formerly Zip): The `selector` matches a **container**, and fields in `items` are parallel columns stitched together by index.
|
|
413
|
-
* **`segmented`**: The `selector` matches a **container**, and items are segmented by an "anchor" field.
|
|
414
|
-
|
|
415
|
-
###### 4. Columnar Mode (formerly Zip Strategy)
|
|
416
|
-
|
|
417
|
-
This mode is used when the `selector` points to a **container** (like a results list) and item data is scattered as separate columns. It is highly optimized for performance, especially in browser mode, by minimizing the number of DOM queries and RPC calls.
|
|
418
|
-
|
|
419
|
-
> **💡 Broadcasting & Performance**: If a property matches the container element itself (e.g. by omitting a selector or matching the container's own attributes), its value is **broadcasted** to every row. This is not only a feature but also a major performance optimization: the value is extracted **once** and reused across all rows, avoiding thousands of redundant engine calls.
|
|
420
|
-
|
|
421
|
-
```json
|
|
422
|
-
{
|
|
423
|
-
"id": "extract",
|
|
424
|
-
"params": {
|
|
425
|
-
"type": "array",
|
|
426
|
-
"selector": "#search-results",
|
|
427
|
-
"mode": "columnar",
|
|
428
|
-
"items": {
|
|
429
|
-
"title": { "selector": ".item-title" },
|
|
430
|
-
"link": { "selector": "a.item-link", "attribute": "href" }
|
|
431
|
-
}
|
|
432
|
-
}
|
|
433
|
-
}
|
|
434
|
-
```
|
|
435
|
-
|
|
436
|
-
> **Heuristic Detection:** If `mode` is omitted and the `selector` matches exactly one element while `items` contains nested selectors, the engine automatically uses **columnar** mode.
|
|
358
|
+
* **`params`**: An `ExtractSchema` object.
|
|
359
|
+
* **`returns`**: The extracted structured data.
|
|
437
360
|
|
|
438
|
-
|
|
439
|
-
|
|
440
|
-
When you have a list where the category is on the container, but items are inside.
|
|
441
|
-
|
|
442
|
-
```json
|
|
443
|
-
{
|
|
444
|
-
"id": "extract",
|
|
445
|
-
"params": {
|
|
446
|
-
"type": "array",
|
|
447
|
-
"selector": "#book-category",
|
|
448
|
-
"mode": "columnar",
|
|
449
|
-
"items": {
|
|
450
|
-
"category": { "attribute": "data-category" },
|
|
451
|
-
"title": { "selector": ".book-title" }
|
|
452
|
-
}
|
|
453
|
-
}
|
|
454
|
-
}
|
|
455
|
-
```
|
|
456
|
-
|
|
457
|
-
> If `#book-category` has `data-category="Sci-Fi"` and contains 3 books, the result will be 3 items, each having `"category": "Sci-Fi"`.
|
|
458
|
-
|
|
459
|
-
**Columnar Configuration:**
|
|
460
|
-
|
|
461
|
-
* **`strict`** (boolean, default: `true`): If `true`, throws an error if fields have different match counts.
|
|
462
|
-
* **`inference`** (boolean, default: `false`): If `true`, tries to automatically find the "item wrapper" elements to fix misaligned lists. It uses an **optimized ancestor search** that is significantly faster in browser mode than manual traversal.
|
|
463
|
-
* **Performance Note**: The engine automatically detects shared structures and pre-calculates alignment to ensure O(N) performance even in complex DOM trees. In browser mode, it minimizes IPC round-trips by pre-calculating "broadcast" flags.
|
|
464
|
-
|
|
465
|
-
###### 5. Segmented Mode (Anchor-based Scanning)
|
|
466
|
-
|
|
467
|
-
Ideal for "flat" structures where there are no item wrappers. It uses a specified `anchor` to segment the container's content.
|
|
468
|
-
|
|
469
|
-
**Core Feature: Automatic Container Detection (Bubble Up)**
|
|
470
|
-
|
|
471
|
-
To handle structures that appear flat but have subtle containers, the engine uses a bubble-up strategy:
|
|
472
|
-
|
|
473
|
-
- **Smart Bubble Up**: When an anchor is nested deep (e.g., `div.card > h3.title`), the engine automatically crawls up the DOM to find the largest "safe container" (e.g., `div.card`) that doesn't overlap with neighbors.
|
|
474
|
-
- **Logical Isolation**: If a container is found, it becomes the scope for that segment. This allows you to extract any content within that container using simple relative selectors, even if it's "above" or deep alongside the anchor.
|
|
475
|
-
- **Flat Fallback**: If no container isolation is possible, it automatically falls back to classic sibling scanning.
|
|
476
|
-
|
|
477
|
-
```json
|
|
478
|
-
{
|
|
479
|
-
"id": "extract",
|
|
480
|
-
"params": {
|
|
481
|
-
"type": "array",
|
|
482
|
-
"selector": "#flat-container",
|
|
483
|
-
"mode": { "type": "segmented", "anchor": "h3.item-title" },
|
|
484
|
-
"items": {
|
|
485
|
-
"title": { "selector": "h3" },
|
|
486
|
-
"desc": { "selector": "p" }
|
|
487
|
-
}
|
|
488
|
-
}
|
|
489
|
-
}
|
|
490
|
-
```
|
|
491
|
-
|
|
492
|
-
**Segmented Configuration:**
|
|
493
|
-
|
|
494
|
-
* **`anchor`** (string):
|
|
495
|
-
* Can be a **field name** defined in `items` (e.g., `"title"`).
|
|
496
|
-
* Can be a **direct CSS selector** (e.g., `"h3.item-title"`).
|
|
497
|
-
* Defaults to the selector of the first field in `items`.
|
|
498
|
-
* **`depth`** (number, optional): The maximum number of levels to bubble up from the anchor to find a segment container. If omitted, it bubbles up as high as possible without conflicting with neighboring segments.
|
|
499
|
-
* **`strict`** (boolean, default: `false`): If `true`, throws an error if no anchor elements are found or if any item violates its own `required` constraints.
|
|
500
|
-
|
|
501
|
-
###### 5.1 Advanced: Handling Repeating Tags (`relativeTo`)
|
|
502
|
-
|
|
503
|
-
When a segment contains multiple identical tags (e.g., several `<p>` tags in a row) representing different fields, use `relativeTo: "previous"` to "consume" them one by one.
|
|
504
|
-
|
|
505
|
-
```json
|
|
506
|
-
{
|
|
507
|
-
"id": "extract",
|
|
508
|
-
"params": {
|
|
509
|
-
"type": "array",
|
|
510
|
-
"selector": "#container",
|
|
511
|
-
"mode": {
|
|
512
|
-
"type": "segmented",
|
|
513
|
-
"anchor": ".item-start",
|
|
514
|
-
"relativeTo": "previous"
|
|
515
|
-
},
|
|
516
|
-
"items": {
|
|
517
|
-
"type": "object",
|
|
518
|
-
"order": ["id", "desc", "extra"],
|
|
519
|
-
"properties": {
|
|
520
|
-
"id": "h1",
|
|
521
|
-
"desc": "p",
|
|
522
|
-
"extra": "p"
|
|
523
|
-
}
|
|
524
|
-
}
|
|
525
|
-
}
|
|
526
|
-
}
|
|
527
|
-
```
|
|
528
|
-
|
|
529
|
-
* **`relativeTo: "previous"`**: After finding `id` (h1), the search for `desc` starts *after* that h1. After finding `desc` (the first p), the search for `extra` starts *after* that p, successfully picking up the second `<p>`.
|
|
530
|
-
* **`order`**: Defines the sequence of consumption. Highly recommended with `relativeTo: "previous"`.
|
|
531
|
-
|
|
532
|
-
###### 6. Quality Control: `required` and `strict`
|
|
533
|
-
|
|
534
|
-
- **`required`**: Marks a field as mandatory.
|
|
535
|
-
- **In Objects**: If any required field is `null`, the entire object returns `null`.
|
|
536
|
-
- **In Arrays**: Items missing a required field are automatically skipped.
|
|
537
|
-
- **Null Propagation**: For implicit objects without a `selector`, if ALL sub-properties are `null`, the object itself becomes `null`, triggering parent-level required or skip logic.
|
|
538
|
-
- **`strict`**:
|
|
539
|
-
- `false` (Default): Silently skip or ignore incomplete data.
|
|
540
|
-
- `true`: Throw an error on any missing required field or alignment mismatch.
|
|
541
|
-
- **Inheritance**: Setting `strict: true` at the array level automatically propagates to all nested children.
|
|
542
|
-
|
|
543
|
-
**Example: Ignoring items with missing mandatory fields**
|
|
544
|
-
|
|
545
|
-
```json
|
|
546
|
-
{
|
|
547
|
-
"id": "extract",
|
|
548
|
-
"params": {
|
|
549
|
-
"type": "array",
|
|
550
|
-
"selector": ".product-list",
|
|
551
|
-
"mode": "columnar",
|
|
552
|
-
"items": {
|
|
553
|
-
"name": { "selector": ".title", "required": true },
|
|
554
|
-
"price": { "selector": ".price", "required": true },
|
|
555
|
-
"discount": ".promo"
|
|
556
|
-
}
|
|
557
|
-
}
|
|
558
|
-
}
|
|
559
|
-
```
|
|
560
|
-
|
|
561
|
-
> In this example, if a product lacks either a `name` or a `price`, it will be completely omitted from the result array. The optional `discount` field doesn't affect the item's inclusion.
|
|
562
|
-
|
|
563
|
-
###### 7. Implicit Object Extraction (Simplest Syntax)
|
|
564
|
-
|
|
565
|
-
For simpler object extraction, you can omit `type: 'object'` and `properties`. If the schema object contains keys that are not context-defining keywords (like `selector`, `has`, `exclude`, `required`, `strict`, `depth`), it is treated as an object schema where keys are property names.
|
|
566
|
-
|
|
567
|
-
> **Keyword Collision Handling:** You can safely extract a data field named `type` as long as its value is not a reserved schema type (like `"string"`, `"object"`, `"array"`, etc.).
|
|
568
|
-
|
|
569
|
-
```json
|
|
570
|
-
{
|
|
571
|
-
"id": "extract",
|
|
572
|
-
"params": {
|
|
573
|
-
"selector": ".author-bio",
|
|
574
|
-
"name": ".author-name",
|
|
575
|
-
"type": ".author-rank",
|
|
576
|
-
"items": { "type": "array", "selector": "li" }
|
|
577
|
-
}
|
|
578
|
-
}
|
|
579
|
-
```
|
|
580
|
-
|
|
581
|
-
> **Key features of implicit objects:**
|
|
361
|
+
> **📚 Detailed Manual**: Because `extract` is very rich in features (including array modes, scope control, anchor jumping, etc.), we have prepared a dedicated detailed manual:
|
|
582
362
|
>
|
|
583
|
-
>
|
|
584
|
-
> 2. **String Shorthand**: You can use a simple string as a property value (e.g., `"email": "a.email"`), which is automatically expanded to `{ "selector": "a.email" }`.
|
|
585
|
-
> 3. **Context Separation**: Only `selector`, `has`, `exclude`, `required`, `strict`, and `depth` are used to define the context and validation for the implicit object; all other keys are treated as data to be extracted.
|
|
586
|
-
> 4. **Null Propagation**: If an implicit object has no `selector` and ALL of its sub-properties extract to `null`, the object itself returns `null`. This is crucial for `required` validation on the parent object or for skipping items in an array.
|
|
587
|
-
|
|
588
|
-
###### 8. Advanced Filtering: `has` and `exclude`
|
|
589
|
-
|
|
590
|
-
You can use the `has` and `exclude` fields in any schema that includes a `selector` to precisely control element selection.
|
|
591
|
-
|
|
592
|
-
* `has`: A CSS selector to ensure the selected element **must contain** a descendant matching this selector.
|
|
593
|
-
* `exclude`: A CSS selector to **exclude** elements matching this selector from the results.
|
|
594
|
-
|
|
595
|
-
**Complete Example: Extracting links of articles that have an image and are not marked as "draft"**
|
|
596
|
-
|
|
597
|
-
```json
|
|
598
|
-
{
|
|
599
|
-
"actions": [
|
|
600
|
-
{ "id": "goto", "params": { "url": "https://example.com/articles" } },
|
|
601
|
-
{
|
|
602
|
-
"id": "extract",
|
|
603
|
-
"params": {
|
|
604
|
-
"type": "array",
|
|
605
|
-
"selector": "div.article-card",
|
|
606
|
-
"has": "img.cover-image",
|
|
607
|
-
"exclude": ".draft",
|
|
608
|
-
"items": {
|
|
609
|
-
"selector": "a.title-link",
|
|
610
|
-
"attribute": "href"
|
|
611
|
-
}
|
|
612
|
-
}
|
|
613
|
-
}
|
|
614
|
-
]
|
|
615
|
-
}
|
|
616
|
-
```
|
|
363
|
+
> 👉 **[Click to View Extract Action Deep Dive](./README.action.extract.md)**
|
|
617
364
|
|
|
618
|
-
|
|
619
|
-
>
|
|
620
|
-
> 1. Find all `div.article-card` elements.
|
|
621
|
-
> 2. Filter them to only include those that contain an `<img class="cover-image">`.
|
|
622
|
-
> 3. Further filter the results to exclude any that also have the `.draft` class.
|
|
623
|
-
> 4. For each of the remaining `div.article-card` elements, find its descendant `a.title-link` and extract the `href` attribute.
|
|
365
|
+
---
|
|
624
366
|
|
|
625
367
|
### Building High-Level Semantic Actions via "Composition"
|
|
626
368
|
|
|
@@ -800,7 +542,7 @@ In `@isdk/web-fetcher`, an Action's `static returnType` is more than a type hint
|
|
|
800
542
|
* **Definition**: Any serializable data structure (Object, Array, string, etc.).
|
|
801
543
|
* **Purpose**: Primary mechanism for business data extraction.
|
|
802
544
|
* **Usage**: Use this when your action produces processed data that doesn't represent the whole page or system state.
|
|
803
|
-
* **System Behavior**: If the action configuration includes `storeAs: "key"`, the framework automatically saves the `result` into `context.outputs["key"]`.
|
|
545
|
+
* **System Behavior**: If the action configuration includes `storeAs: "key"`, the framework automatically saves the `result` into `context.outputs["key"]`. If the target key already contains an object and the new result is also an object, they will be merged (shallow merge) instead of overwritten. This allows multiple `extract` actions to accumulate data into the same output key.
|
|
804
546
|
* **Typical Actions**: `extract`.
|
|
805
547
|
* **Example**:
|
|
806
548
|
|