@isdk/web-fetcher 0.2.12 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.action.cn.md +197 -155
- package/README.action.extract.cn.md +263 -0
- package/README.action.extract.md +263 -0
- package/README.action.md +202 -147
- package/README.cn.md +25 -15
- package/README.engine.cn.md +118 -14
- package/README.engine.md +115 -14
- package/README.md +19 -10
- package/dist/index.d.mts +667 -50
- package/dist/index.d.ts +667 -50
- package/dist/index.js +1 -1
- package/dist/index.mjs +1 -1
- package/docs/README.md +19 -10
- package/docs/_media/README.action.md +202 -147
- package/docs/_media/README.cn.md +25 -15
- package/docs/_media/README.engine.md +115 -14
- package/docs/classes/CheerioFetchEngine.md +805 -135
- package/docs/classes/ClickAction.md +33 -33
- package/docs/classes/EvaluateAction.md +559 -0
- package/docs/classes/ExtractAction.md +33 -33
- package/docs/classes/FetchAction.md +39 -33
- package/docs/classes/FetchEngine.md +660 -122
- package/docs/classes/FetchSession.md +38 -16
- package/docs/classes/FillAction.md +33 -33
- package/docs/classes/GetContentAction.md +33 -33
- package/docs/classes/GotoAction.md +33 -33
- package/docs/classes/KeyboardPressAction.md +533 -0
- package/docs/classes/KeyboardTypeAction.md +533 -0
- package/docs/classes/MouseClickAction.md +533 -0
- package/docs/classes/MouseMoveAction.md +533 -0
- package/docs/classes/PauseAction.md +33 -33
- package/docs/classes/PlaywrightFetchEngine.md +820 -122
- package/docs/classes/SubmitAction.md +33 -33
- package/docs/classes/TrimAction.md +533 -0
- package/docs/classes/WaitForAction.md +33 -33
- package/docs/classes/WebFetcher.md +9 -9
- package/docs/enumerations/FetchActionResultStatus.md +4 -4
- package/docs/functions/fetchWeb.md +6 -6
- package/docs/globals.md +14 -0
- package/docs/interfaces/BaseFetchActionProperties.md +12 -12
- package/docs/interfaces/BaseFetchCollectorActionProperties.md +16 -16
- package/docs/interfaces/BaseFetcherProperties.md +32 -28
- package/docs/interfaces/Cookie.md +14 -14
- package/docs/interfaces/DispatchedEngineAction.md +4 -4
- package/docs/interfaces/EvaluateActionOptions.md +81 -0
- package/docs/interfaces/ExtractActionProperties.md +12 -12
- package/docs/interfaces/FetchActionInContext.md +15 -15
- package/docs/interfaces/FetchActionProperties.md +13 -13
- package/docs/interfaces/FetchActionResult.md +6 -6
- package/docs/interfaces/FetchContext.md +42 -38
- package/docs/interfaces/FetchEngineContext.md +37 -33
- package/docs/interfaces/FetchMetadata.md +5 -5
- package/docs/interfaces/FetchResponse.md +14 -14
- package/docs/interfaces/FetchReturnTypeRegistry.md +8 -8
- package/docs/interfaces/FetchSite.md +35 -31
- package/docs/interfaces/FetcherOptions.md +34 -30
- package/docs/interfaces/GotoActionOptions.md +14 -6
- package/docs/interfaces/KeyboardPressParams.md +25 -0
- package/docs/interfaces/KeyboardTypeParams.md +25 -0
- package/docs/interfaces/MouseClickParams.md +49 -0
- package/docs/interfaces/MouseMoveParams.md +41 -0
- package/docs/interfaces/PendingEngineRequest.md +3 -3
- package/docs/interfaces/StorageOptions.md +5 -5
- package/docs/interfaces/SubmitActionOptions.md +2 -2
- package/docs/interfaces/TrimActionOptions.md +27 -0
- package/docs/interfaces/WaitForActionOptions.md +5 -5
- package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
- package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
- package/docs/type-aliases/BrowserEngine.md +1 -1
- package/docs/type-aliases/FetchActionCapabilities.md +1 -1
- package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
- package/docs/type-aliases/FetchActionOptions.md +1 -1
- package/docs/type-aliases/FetchEngineAction.md +2 -2
- package/docs/type-aliases/FetchEngineType.md +1 -1
- package/docs/type-aliases/FetchReturnType.md +1 -1
- package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
- package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
- package/docs/type-aliases/ResourceType.md +1 -1
- package/docs/type-aliases/TrimPreset.md +13 -0
- package/docs/variables/DefaultFetcherProperties.md +1 -1
- package/docs/variables/FetcherOptionKeys.md +1 -1
- package/docs/variables/TRIM_PRESETS.md +11 -0
- package/package.json +11 -11
|
@@ -0,0 +1,263 @@
|
|
|
1
|
+
# 🔍 Extract Action Deep Dive (The Data Surgeon's Manual)
|
|
2
|
+
|
|
3
|
+
[简体中文](./README.action.extract.cn.md) | English
|
|
4
|
+
|
|
5
|
+
`extract` is the heart of `@isdk/web-fetcher`. It's not just a "scraper"; it's an **intelligent converter** that transforms chaotic HTML into polished, ready-to-use JSON.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## ⚡ 1. Quick Start: Shorthand Magic
|
|
10
|
+
|
|
11
|
+
**Scenario**: Standard webpages with simple structures. Get results fast without long JSON configurations.
|
|
12
|
+
|
|
13
|
+
```json
|
|
14
|
+
{
|
|
15
|
+
"action": "extract",
|
|
16
|
+
"params": {
|
|
17
|
+
"title": "h1", // Grab h1 text into 'title'
|
|
18
|
+
"link": { "selector": "a.main", "attribute": "href" }, // Precision attribute extraction
|
|
19
|
+
"tags": { "type": "array", "selector": ".tag-item" } // Grab a group of tags at once
|
|
20
|
+
}
|
|
21
|
+
}
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## 📄 2. Basic Value Extraction: The Tweezers
|
|
27
|
+
|
|
28
|
+
**Scenario**: Processing single data points like prices, titles, or IDs.
|
|
29
|
+
|
|
30
|
+
### Value Properties Analysis
|
|
31
|
+
|
|
32
|
+
* **`selector`**: CSS selector—tells the tweezers where to grab.
|
|
33
|
+
* **`type`**: Automatic casting. Supports `string`, `number` (auto-removes currency/units), `boolean`, `html`.
|
|
34
|
+
* **`mode`**: Extraction mode.
|
|
35
|
+
* `innerText`: **(Highly Recommended)** Sees like a human, capturing only visible text and handling line breaks/noise.
|
|
36
|
+
* `text`: Raw `textContent` from the source, including hidden characters.
|
|
37
|
+
* `outerHTML`: Captures the full HTML, including the element's tags.
|
|
38
|
+
* **`attribute`**: If you need an attribute (e.g., `href`, `src`) instead of text, specify it here.
|
|
39
|
+
* **`depth`**: Bubble up. Once matched, move up N parent levels before extracting (useful for grabbing IDs from parent containers).
|
|
40
|
+
|
|
41
|
+
**Example**:
|
|
42
|
+
|
|
43
|
+
```json
|
|
44
|
+
{
|
|
45
|
+
"selector": ".price-tag",
|
|
46
|
+
"type": "number", // Transforms "¥99.00" to 99
|
|
47
|
+
"mode": "innerText", // Ensures clean text
|
|
48
|
+
"attribute": "data-v", // Grabs the data-v attribute
|
|
49
|
+
"depth": 1 // Grabs from the parent element
|
|
50
|
+
}
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## 📦 3. Object Extraction: The Bento Box
|
|
56
|
+
|
|
57
|
+
**Scenario**: Pack related fields (e.g., name, avatar, bio) into a structured JSON object.
|
|
58
|
+
|
|
59
|
+
### Object Properties Analysis
|
|
60
|
+
|
|
61
|
+
* **`type: "object"`**: Declares the object structure.
|
|
62
|
+
* **`selector`**: **The root container**. This is crucial! All internal property selectors are searched relative to this container.
|
|
63
|
+
* **`properties`**: **Core Config**. Define your JSON keys and their respective extraction rules.
|
|
64
|
+
* **`required`**: If a field is essential, set this to `true`. If extraction fails, the entire object returns `null`.
|
|
65
|
+
* **`depth`**: Depth used to trigger "bubble-up" logic (see advanced chapter).
|
|
66
|
+
|
|
67
|
+
**Example: Packing User Info**
|
|
68
|
+
|
|
69
|
+
```json
|
|
70
|
+
{
|
|
71
|
+
"type": "object",
|
|
72
|
+
"selector": ".user-card",
|
|
73
|
+
"properties": {
|
|
74
|
+
"name": { "selector": ".username", "required": true },
|
|
75
|
+
"bio": ".user-description",
|
|
76
|
+
"avatar": { "selector": "img.avatar", "attribute": "src" }
|
|
77
|
+
}
|
|
78
|
+
}
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
> **💡 Transparent Box (Implicit Object)**: If you omit `type: "object"` and don't provide a `selector`, the box becomes "transparent." If all internal fields fail to extract, the entire box disappears. This is magical for filtering ads or empty items in a list.
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## 📑 4. Array Extraction: The Sorting Machines
|
|
86
|
+
|
|
87
|
+
**Scenario**: Handle repeating data like search results, news feeds, or tag clouds.
|
|
88
|
+
|
|
89
|
+
### Array Properties Analysis
|
|
90
|
+
|
|
91
|
+
* **`type: "array"`**: Declares a list extraction.
|
|
92
|
+
* **`selector`**: **Scanning scope**. In default mode, it matches the "wrapper" of each item; in flat modes, it matches the entire list container.
|
|
93
|
+
* **`items`**: **The Blueprint**. Defines what each row looks like. (Note: arrays use `items`, not `properties`).
|
|
94
|
+
* **`mode`**: Choose the sorting strategy:
|
|
95
|
+
* `"nested"`: (Default) Every item has its own "slot" (container).
|
|
96
|
+
* `"columnar"`: Data is laid out in columns without row containers (like Excel).
|
|
97
|
+
* `"segmented"`: Data is completely flat, sliced into segments using "anchors."
|
|
98
|
+
* **`limit`**: Restricts the maximum number of items to prevent data bloat.
|
|
99
|
+
* **`inference`**: Heuristic inference. Enable this to let the engine automatically correct misalignments in messy lists.
|
|
100
|
+
|
|
101
|
+
### 4.1 Nested (The Egg Carton): The Stable Standard
|
|
102
|
+
|
|
103
|
+
**Scenario**: Items are wrapped in `<li>` or `.item` containers. This is the most stable and performant choice.
|
|
104
|
+
|
|
105
|
+
```json
|
|
106
|
+
{
|
|
107
|
+
"type": "array",
|
|
108
|
+
"selector": ".product-list .item", // Matches the shell of each product
|
|
109
|
+
"items": {
|
|
110
|
+
"name": "h3",
|
|
111
|
+
"price": ".price"
|
|
112
|
+
}
|
|
113
|
+
}
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
### 4.2 Columnar: Flat Alignment
|
|
117
|
+
|
|
118
|
+
**Scenario**: Data laid out like a spreadsheet. One column for titles, one for prices, but no shared parent for the row.
|
|
119
|
+
**Broadcasting**: If a field is constant across all items (e.g., category), extract it once from the main container, and it will "broadcast" to every row.
|
|
120
|
+
|
|
121
|
+
```json
|
|
122
|
+
{
|
|
123
|
+
"type": "array",
|
|
124
|
+
"selector": "#table-container",
|
|
125
|
+
"mode": "columnar",
|
|
126
|
+
"items": {
|
|
127
|
+
"name": ".title-cols",
|
|
128
|
+
"date": "" // Empty selector broadcasts date info from the main container
|
|
129
|
+
}
|
|
130
|
+
}
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
### 4.3 Segmented: Slicing the Unordered
|
|
134
|
+
|
|
135
|
+
**Scenario**: Terrible page structure where everything is siblings. You must rely on a marker (like an `h2` header) to define segments.
|
|
136
|
+
|
|
137
|
+
```json
|
|
138
|
+
{
|
|
139
|
+
"type": "array",
|
|
140
|
+
"selector": ".content-body",
|
|
141
|
+
"mode": { "type": "segmented", "anchor": "h2" }, // Slice whenever an h2 is found
|
|
142
|
+
"items": {
|
|
143
|
+
"section_title": "h2",
|
|
144
|
+
"text": "p"
|
|
145
|
+
}
|
|
146
|
+
}
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
---
|
|
150
|
+
|
|
151
|
+
## 🛡️ 5. Quality Control: Filters & Sieves
|
|
152
|
+
|
|
153
|
+
**Scenario**: Filtering out ads, out-of-stock notices, or unwanted junk items.
|
|
154
|
+
|
|
155
|
+
* **`has`**: **The Essence Filter**. Only extract if it contains specific sub-elements. Example: `"has": ".promo-tag"` only keeps items with discount tags.
|
|
156
|
+
* **`exclude`**: **The Junk Remover**. Exclude matching elements. Example: `"exclude": ".is-advertisement"`.
|
|
157
|
+
* **`strict`**: **OCD Mode**. If a required field is missing, throw an error immediately instead of silently returning `null`.
|
|
158
|
+
|
|
159
|
+
---
|
|
160
|
+
|
|
161
|
+
## 🧠 6. Advanced Mastery: Scope & Boundary Control
|
|
162
|
+
|
|
163
|
+
Surgical solutions for messy, class-less, and flat-structured HTML.
|
|
164
|
+
|
|
165
|
+
### 6.1 Sequential Slicing: Solving "The Twin Tag" Problem
|
|
166
|
+
|
|
167
|
+
* **Pain Point**: A row of identical `<span>` or `p` tags. A simple selector only grabs the first one repeatedly.
|
|
168
|
+
* **Solution**: `relativeTo: "previous"`.
|
|
169
|
+
* **Mechanism**: The engine remembers where it left off. After extracting the first field, it leaves a "Bookmark." The next search starts **after that Bookmark**.
|
|
170
|
+
* **Key Point**: Since this is sequential, we recommend explicitly setting `order: ["name", "job"]` to ensure the bookmarks don't get mixed up due to JSON key ordering.
|
|
171
|
+
* **Example Scenario**:
|
|
172
|
+
|
|
173
|
+
```html
|
|
174
|
+
<p>Name</p> <span>Alice</span>
|
|
175
|
+
<p>Job</p> <span>Manager</span>
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
* **Example Config**:
|
|
179
|
+
|
|
180
|
+
```json
|
|
181
|
+
{
|
|
182
|
+
"relativeTo": "previous",
|
|
183
|
+
"order": ["k1", "v1", "k2", "v2"], // Force order to prevent bookmark displacement
|
|
184
|
+
"properties": {
|
|
185
|
+
"k1": "p", // Grabs 1st p (Name)
|
|
186
|
+
"v1": "span", // Grabs Alice after k1
|
|
187
|
+
"k2": "p", // Grabs 2nd p (Job) after v1
|
|
188
|
+
"v2": "span" // Grabs Manager after k2
|
|
189
|
+
}
|
|
190
|
+
}
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
### 6.2 Segmented Isolation & LCA: Building "Fences" for Neighbors
|
|
194
|
+
|
|
195
|
+
* **Pain Point**: In flat structures, Item 1's search might bleed into Item 2's content without a boundary.
|
|
196
|
+
* **Solution**: Use `mode: "segmented"` with `anchor`.
|
|
197
|
+
* **Mechanism**: The engine calculates the "Lowest Common Ancestor (LCA)" between anchors and builds a logical fence, locking searches strictly within the item's territory.
|
|
198
|
+
* **Example Scenario**:
|
|
199
|
+
|
|
200
|
+
```html
|
|
201
|
+
<h2 class="title">Article 1</h2>
|
|
202
|
+
<p class="summary">Summary 1</p>
|
|
203
|
+
<h2 class="title">Article 2</h2>
|
|
204
|
+
<p class="summary">Summary 2</p>
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
* **Example Config**:
|
|
208
|
+
|
|
209
|
+
```json
|
|
210
|
+
{
|
|
211
|
+
"type": "array",
|
|
212
|
+
"mode": { "type": "segmented", "anchor": "h2.title" }, // h2 as the fence
|
|
213
|
+
"items": {
|
|
214
|
+
"title": "h2",
|
|
215
|
+
"desc": "p"
|
|
216
|
+
}
|
|
217
|
+
}
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
### 6.3 Bubble-up: The "Retry" Safety Net
|
|
221
|
+
|
|
222
|
+
* **Pain Point**: Data is just outside the locked scope (e.g., an ID on a parent container).
|
|
223
|
+
* **Solution**: Set `depth` and use `required`.
|
|
224
|
+
* **Mechanism**:
|
|
225
|
+
1. The engine looks for a mandatory field in the current scope—fails.
|
|
226
|
+
2. It automatically steps back (`depth: 1`) to the parent node.
|
|
227
|
+
3. It rescans in the wider view to find missing data that was "just out of sight."
|
|
228
|
+
* **Example Scenario**:
|
|
229
|
+
|
|
230
|
+
```html
|
|
231
|
+
<div class="wrapper" data-id="ID_001">
|
|
232
|
+
<div class="content-box">
|
|
233
|
+
<h3 class="title">Product Name</h3>
|
|
234
|
+
</div>
|
|
235
|
+
</div>
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
* **Example Config**:
|
|
239
|
+
|
|
240
|
+
```json
|
|
241
|
+
{
|
|
242
|
+
"selector": ".content-box",
|
|
243
|
+
"depth": 1,
|
|
244
|
+
"properties": {
|
|
245
|
+
"product_id": {
|
|
246
|
+
"selector": "..", // ".." means look at parent
|
|
247
|
+
"attribute": "data-id",
|
|
248
|
+
"required": true // Fails to trigger bubble-up
|
|
249
|
+
},
|
|
250
|
+
"name": "h3"
|
|
251
|
+
}
|
|
252
|
+
}
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
---
|
|
256
|
+
|
|
257
|
+
## 💡 7. Best Practice Tips
|
|
258
|
+
|
|
259
|
+
1. **Containers First**: If the HTML has `.item` wrappers, always prefer `Nested` mode—it's the stablest and fastest.
|
|
260
|
+
2. **innerText as Default**: In 90% of cases, it's better than `text` because it yields cleaner results and handles line breaks automatically.
|
|
261
|
+
3. **Flat Siblings? Use Anchors**: When you see siblings without a container shell, immediately think `segmented` + `anchor`.
|
|
262
|
+
4. **Leverage required for Filtering**: If some items in a list are empty (placeholders), mark core fields as `required: true`. They will automatically become `null`, making them easy to filter in your code.
|
|
263
|
+
5. **Debug Tip: Check the Scope**: If data is missing, check if your `selector` is too restrictive or try increasing the `depth`.
|