@isdk/web-fetcher 0.2.12 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (83) hide show
  1. package/README.action.cn.md +197 -155
  2. package/README.action.extract.cn.md +263 -0
  3. package/README.action.extract.md +263 -0
  4. package/README.action.md +202 -147
  5. package/README.cn.md +25 -15
  6. package/README.engine.cn.md +118 -14
  7. package/README.engine.md +115 -14
  8. package/README.md +19 -10
  9. package/dist/index.d.mts +667 -50
  10. package/dist/index.d.ts +667 -50
  11. package/dist/index.js +1 -1
  12. package/dist/index.mjs +1 -1
  13. package/docs/README.md +19 -10
  14. package/docs/_media/README.action.md +202 -147
  15. package/docs/_media/README.cn.md +25 -15
  16. package/docs/_media/README.engine.md +115 -14
  17. package/docs/classes/CheerioFetchEngine.md +805 -135
  18. package/docs/classes/ClickAction.md +33 -33
  19. package/docs/classes/EvaluateAction.md +559 -0
  20. package/docs/classes/ExtractAction.md +33 -33
  21. package/docs/classes/FetchAction.md +39 -33
  22. package/docs/classes/FetchEngine.md +660 -122
  23. package/docs/classes/FetchSession.md +38 -16
  24. package/docs/classes/FillAction.md +33 -33
  25. package/docs/classes/GetContentAction.md +33 -33
  26. package/docs/classes/GotoAction.md +33 -33
  27. package/docs/classes/KeyboardPressAction.md +533 -0
  28. package/docs/classes/KeyboardTypeAction.md +533 -0
  29. package/docs/classes/MouseClickAction.md +533 -0
  30. package/docs/classes/MouseMoveAction.md +533 -0
  31. package/docs/classes/PauseAction.md +33 -33
  32. package/docs/classes/PlaywrightFetchEngine.md +820 -122
  33. package/docs/classes/SubmitAction.md +33 -33
  34. package/docs/classes/TrimAction.md +533 -0
  35. package/docs/classes/WaitForAction.md +33 -33
  36. package/docs/classes/WebFetcher.md +9 -9
  37. package/docs/enumerations/FetchActionResultStatus.md +4 -4
  38. package/docs/functions/fetchWeb.md +6 -6
  39. package/docs/globals.md +14 -0
  40. package/docs/interfaces/BaseFetchActionProperties.md +12 -12
  41. package/docs/interfaces/BaseFetchCollectorActionProperties.md +16 -16
  42. package/docs/interfaces/BaseFetcherProperties.md +32 -28
  43. package/docs/interfaces/Cookie.md +14 -14
  44. package/docs/interfaces/DispatchedEngineAction.md +4 -4
  45. package/docs/interfaces/EvaluateActionOptions.md +81 -0
  46. package/docs/interfaces/ExtractActionProperties.md +12 -12
  47. package/docs/interfaces/FetchActionInContext.md +15 -15
  48. package/docs/interfaces/FetchActionProperties.md +13 -13
  49. package/docs/interfaces/FetchActionResult.md +6 -6
  50. package/docs/interfaces/FetchContext.md +42 -38
  51. package/docs/interfaces/FetchEngineContext.md +37 -33
  52. package/docs/interfaces/FetchMetadata.md +5 -5
  53. package/docs/interfaces/FetchResponse.md +14 -14
  54. package/docs/interfaces/FetchReturnTypeRegistry.md +8 -8
  55. package/docs/interfaces/FetchSite.md +35 -31
  56. package/docs/interfaces/FetcherOptions.md +34 -30
  57. package/docs/interfaces/GotoActionOptions.md +14 -6
  58. package/docs/interfaces/KeyboardPressParams.md +25 -0
  59. package/docs/interfaces/KeyboardTypeParams.md +25 -0
  60. package/docs/interfaces/MouseClickParams.md +49 -0
  61. package/docs/interfaces/MouseMoveParams.md +41 -0
  62. package/docs/interfaces/PendingEngineRequest.md +3 -3
  63. package/docs/interfaces/StorageOptions.md +5 -5
  64. package/docs/interfaces/SubmitActionOptions.md +2 -2
  65. package/docs/interfaces/TrimActionOptions.md +27 -0
  66. package/docs/interfaces/WaitForActionOptions.md +5 -5
  67. package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
  68. package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
  69. package/docs/type-aliases/BrowserEngine.md +1 -1
  70. package/docs/type-aliases/FetchActionCapabilities.md +1 -1
  71. package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
  72. package/docs/type-aliases/FetchActionOptions.md +1 -1
  73. package/docs/type-aliases/FetchEngineAction.md +2 -2
  74. package/docs/type-aliases/FetchEngineType.md +1 -1
  75. package/docs/type-aliases/FetchReturnType.md +1 -1
  76. package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
  77. package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
  78. package/docs/type-aliases/ResourceType.md +1 -1
  79. package/docs/type-aliases/TrimPreset.md +13 -0
  80. package/docs/variables/DefaultFetcherProperties.md +1 -1
  81. package/docs/variables/FetcherOptionKeys.md +1 -1
  82. package/docs/variables/TRIM_PRESETS.md +11 -0
  83. package/package.json +11 -11
@@ -0,0 +1,263 @@
1
+ # 🔍 Extract Action Deep Dive (The Data Surgeon's Manual)
2
+
3
+ [简体中文](./README.action.extract.cn.md) | English
4
+
5
+ `extract` is the heart of `@isdk/web-fetcher`. It's not just a "scraper"; it's an **intelligent converter** that transforms chaotic HTML into polished, ready-to-use JSON.
6
+
7
+ ---
8
+
9
+ ## ⚡ 1. Quick Start: Shorthand Magic
10
+
11
+ **Scenario**: Standard webpages with simple structures. Get results fast without long JSON configurations.
12
+
13
+ ```json
14
+ {
15
+ "action": "extract",
16
+ "params": {
17
+ "title": "h1", // Grab h1 text into 'title'
18
+ "link": { "selector": "a.main", "attribute": "href" }, // Precision attribute extraction
19
+ "tags": { "type": "array", "selector": ".tag-item" } // Grab a group of tags at once
20
+ }
21
+ }
22
+ ```
23
+
24
+ ---
25
+
26
+ ## 📄 2. Basic Value Extraction: The Tweezers
27
+
28
+ **Scenario**: Processing single data points like prices, titles, or IDs.
29
+
30
+ ### Value Properties Analysis
31
+
32
+ * **`selector`**: CSS selector—tells the tweezers where to grab.
33
+ * **`type`**: Automatic casting. Supports `string`, `number` (auto-removes currency/units), `boolean`, `html`.
34
+ * **`mode`**: Extraction mode.
35
+ * `innerText`: **(Highly Recommended)** Sees like a human, capturing only visible text and handling line breaks/noise.
36
+ * `text`: Raw `textContent` from the source, including hidden characters.
37
+ * `outerHTML`: Captures the full HTML, including the element's tags.
38
+ * **`attribute`**: If you need an attribute (e.g., `href`, `src`) instead of text, specify it here.
39
+ * **`depth`**: Bubble up. Once matched, move up N parent levels before extracting (useful for grabbing IDs from parent containers).
40
+
41
+ **Example**:
42
+
43
+ ```json
44
+ {
45
+ "selector": ".price-tag",
46
+ "type": "number", // Transforms "¥99.00" to 99
47
+ "mode": "innerText", // Ensures clean text
48
+ "attribute": "data-v", // Grabs the data-v attribute
49
+ "depth": 1 // Grabs from the parent element
50
+ }
51
+ ```
52
+
53
+ ---
54
+
55
+ ## 📦 3. Object Extraction: The Bento Box
56
+
57
+ **Scenario**: Pack related fields (e.g., name, avatar, bio) into a structured JSON object.
58
+
59
+ ### Object Properties Analysis
60
+
61
+ * **`type: "object"`**: Declares the object structure.
62
+ * **`selector`**: **The root container**. This is crucial! All internal property selectors are searched relative to this container.
63
+ * **`properties`**: **Core Config**. Define your JSON keys and their respective extraction rules.
64
+ * **`required`**: If a field is essential, set this to `true`. If extraction fails, the entire object returns `null`.
65
+ * **`depth`**: Depth used to trigger "bubble-up" logic (see advanced chapter).
66
+
67
+ **Example: Packing User Info**
68
+
69
+ ```json
70
+ {
71
+ "type": "object",
72
+ "selector": ".user-card",
73
+ "properties": {
74
+ "name": { "selector": ".username", "required": true },
75
+ "bio": ".user-description",
76
+ "avatar": { "selector": "img.avatar", "attribute": "src" }
77
+ }
78
+ }
79
+ ```
80
+
81
+ > **💡 Transparent Box (Implicit Object)**: If you omit `type: "object"` and don't provide a `selector`, the box becomes "transparent." If all internal fields fail to extract, the entire box disappears. This is magical for filtering ads or empty items in a list.
82
+
83
+ ---
84
+
85
+ ## 📑 4. Array Extraction: The Sorting Machines
86
+
87
+ **Scenario**: Handle repeating data like search results, news feeds, or tag clouds.
88
+
89
+ ### Array Properties Analysis
90
+
91
+ * **`type: "array"`**: Declares a list extraction.
92
+ * **`selector`**: **Scanning scope**. In default mode, it matches the "wrapper" of each item; in flat modes, it matches the entire list container.
93
+ * **`items`**: **The Blueprint**. Defines what each row looks like. (Note: arrays use `items`, not `properties`).
94
+ * **`mode`**: Choose the sorting strategy:
95
+ * `"nested"`: (Default) Every item has its own "slot" (container).
96
+ * `"columnar"`: Data is laid out in columns without row containers (like Excel).
97
+ * `"segmented"`: Data is completely flat, sliced into segments using "anchors."
98
+ * **`limit`**: Restricts the maximum number of items to prevent data bloat.
99
+ * **`inference`**: Heuristic inference. Enable this to let the engine automatically correct misalignments in messy lists.
100
+
101
+ ### 4.1 Nested (The Egg Carton): The Stable Standard
102
+
103
+ **Scenario**: Items are wrapped in `<li>` or `.item` containers. This is the most stable and performant choice.
104
+
105
+ ```json
106
+ {
107
+ "type": "array",
108
+ "selector": ".product-list .item", // Matches the shell of each product
109
+ "items": {
110
+ "name": "h3",
111
+ "price": ".price"
112
+ }
113
+ }
114
+ ```
115
+
116
+ ### 4.2 Columnar: Flat Alignment
117
+
118
+ **Scenario**: Data laid out like a spreadsheet. One column for titles, one for prices, but no shared parent for the row.
119
+ **Broadcasting**: If a field is constant across all items (e.g., category), extract it once from the main container, and it will "broadcast" to every row.
120
+
121
+ ```json
122
+ {
123
+ "type": "array",
124
+ "selector": "#table-container",
125
+ "mode": "columnar",
126
+ "items": {
127
+ "name": ".title-cols",
128
+ "date": "" // Empty selector broadcasts date info from the main container
129
+ }
130
+ }
131
+ ```
132
+
133
+ ### 4.3 Segmented: Slicing the Unordered
134
+
135
+ **Scenario**: Terrible page structure where everything is siblings. You must rely on a marker (like an `h2` header) to define segments.
136
+
137
+ ```json
138
+ {
139
+ "type": "array",
140
+ "selector": ".content-body",
141
+ "mode": { "type": "segmented", "anchor": "h2" }, // Slice whenever an h2 is found
142
+ "items": {
143
+ "section_title": "h2",
144
+ "text": "p"
145
+ }
146
+ }
147
+ ```
148
+
149
+ ---
150
+
151
+ ## 🛡️ 5. Quality Control: Filters & Sieves
152
+
153
+ **Scenario**: Filtering out ads, out-of-stock notices, or unwanted junk items.
154
+
155
+ * **`has`**: **The Essence Filter**. Only extract if it contains specific sub-elements. Example: `"has": ".promo-tag"` only keeps items with discount tags.
156
+ * **`exclude`**: **The Junk Remover**. Exclude matching elements. Example: `"exclude": ".is-advertisement"`.
157
+ * **`strict`**: **OCD Mode**. If a required field is missing, throw an error immediately instead of silently returning `null`.
158
+
159
+ ---
160
+
161
+ ## 🧠 6. Advanced Mastery: Scope & Boundary Control
162
+
163
+ Surgical solutions for messy, class-less, and flat-structured HTML.
164
+
165
+ ### 6.1 Sequential Slicing: Solving "The Twin Tag" Problem
166
+
167
+ * **Pain Point**: A row of identical `<span>` or `p` tags. A simple selector only grabs the first one repeatedly.
168
+ * **Solution**: `relativeTo: "previous"`.
169
+ * **Mechanism**: The engine remembers where it left off. After extracting the first field, it leaves a "Bookmark." The next search starts **after that Bookmark**.
170
+ * **Key Point**: Since this is sequential, we recommend explicitly setting `order: ["name", "job"]` to ensure the bookmarks don't get mixed up due to JSON key ordering.
171
+ * **Example Scenario**:
172
+
173
+ ```html
174
+ <p>Name</p> <span>Alice</span>
175
+ <p>Job</p> <span>Manager</span>
176
+ ```
177
+
178
+ * **Example Config**:
179
+
180
+ ```json
181
+ {
182
+ "relativeTo": "previous",
183
+ "order": ["k1", "v1", "k2", "v2"], // Force order to prevent bookmark displacement
184
+ "properties": {
185
+ "k1": "p", // Grabs 1st p (Name)
186
+ "v1": "span", // Grabs Alice after k1
187
+ "k2": "p", // Grabs 2nd p (Job) after v1
188
+ "v2": "span" // Grabs Manager after k2
189
+ }
190
+ }
191
+ ```
192
+
193
+ ### 6.2 Segmented Isolation & LCA: Building "Fences" for Neighbors
194
+
195
+ * **Pain Point**: In flat structures, Item 1's search might bleed into Item 2's content without a boundary.
196
+ * **Solution**: Use `mode: "segmented"` with `anchor`.
197
+ * **Mechanism**: The engine calculates the "Lowest Common Ancestor (LCA)" between anchors and builds a logical fence, locking searches strictly within the item's territory.
198
+ * **Example Scenario**:
199
+
200
+ ```html
201
+ <h2 class="title">Article 1</h2>
202
+ <p class="summary">Summary 1</p>
203
+ <h2 class="title">Article 2</h2>
204
+ <p class="summary">Summary 2</p>
205
+ ```
206
+
207
+ * **Example Config**:
208
+
209
+ ```json
210
+ {
211
+ "type": "array",
212
+ "mode": { "type": "segmented", "anchor": "h2.title" }, // h2 as the fence
213
+ "items": {
214
+ "title": "h2",
215
+ "desc": "p"
216
+ }
217
+ }
218
+ ```
219
+
220
+ ### 6.3 Bubble-up: The "Retry" Safety Net
221
+
222
+ * **Pain Point**: Data is just outside the locked scope (e.g., an ID on a parent container).
223
+ * **Solution**: Set `depth` and use `required`.
224
+ * **Mechanism**:
225
+ 1. The engine looks for a mandatory field in the current scope—fails.
226
+ 2. It automatically steps back (`depth: 1`) to the parent node.
227
+ 3. It rescans in the wider view to find missing data that was "just out of sight."
228
+ * **Example Scenario**:
229
+
230
+ ```html
231
+ <div class="wrapper" data-id="ID_001">
232
+ <div class="content-box">
233
+ <h3 class="title">Product Name</h3>
234
+ </div>
235
+ </div>
236
+ ```
237
+
238
+ * **Example Config**:
239
+
240
+ ```json
241
+ {
242
+ "selector": ".content-box",
243
+ "depth": 1,
244
+ "properties": {
245
+ "product_id": {
246
+ "selector": "..", // ".." means look at parent
247
+ "attribute": "data-id",
248
+ "required": true // Fails to trigger bubble-up
249
+ },
250
+ "name": "h3"
251
+ }
252
+ }
253
+ ```
254
+
255
+ ---
256
+
257
+ ## 💡 7. Best Practice Tips
258
+
259
+ 1. **Containers First**: If the HTML has `.item` wrappers, always prefer `Nested` mode—it's the stablest and fastest.
260
+ 2. **innerText as Default**: In 90% of cases, it's better than `text` because it yields cleaner results and handles line breaks automatically.
261
+ 3. **Flat Siblings? Use Anchors**: When you see siblings without a container shell, immediately think `segmented` + `anchor`.
262
+ 4. **Leverage required for Filtering**: If some items in a list are empty (placeholders), mark core fields as `required: true`. They will automatically become `null`, making them easy to filter in your code.
263
+ 5. **Debug Tip: Check the Scope**: If data is missing, check if your `selector` is too restrictive or try increasing the `depth`.