npm - @isdk/web-fetcher - Versions diffs - 0.2.12 → 0.3.0 - Mend

@isdk/web-fetcher 0.2.12 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (72) hide show

package/README.action.cn.md +312 -11
package/README.action.md +324 -11
package/README.cn.md +13 -11
package/README.engine.cn.md +96 -13
package/README.engine.md +93 -13
package/README.md +11 -9
package/dist/index.d.mts +521 -50
package/dist/index.d.ts +521 -50
package/dist/index.js +1 -1
package/dist/index.mjs +1 -1
package/docs/README.md +11 -9
package/docs/_media/README.action.md +324 -11
package/docs/_media/README.cn.md +13 -11
package/docs/_media/README.engine.md +93 -13
package/docs/classes/CheerioFetchEngine.md +647 -125
package/docs/classes/ClickAction.md +33 -33
package/docs/classes/EvaluateAction.md +559 -0
package/docs/classes/ExtractAction.md +33 -33
package/docs/classes/FetchAction.md +35 -33
package/docs/classes/FetchEngine.md +528 -122
package/docs/classes/FetchSession.md +38 -16
package/docs/classes/FillAction.md +33 -33
package/docs/classes/GetContentAction.md +33 -33
package/docs/classes/GotoAction.md +33 -33
package/docs/classes/PauseAction.md +33 -33
package/docs/classes/PlaywrightFetchEngine.md +570 -122
package/docs/classes/SubmitAction.md +33 -33
package/docs/classes/TrimAction.md +533 -0
package/docs/classes/WaitForAction.md +33 -33
package/docs/classes/WebFetcher.md +9 -9
package/docs/enumerations/FetchActionResultStatus.md +4 -4
package/docs/functions/fetchWeb.md +6 -6
package/docs/globals.md +6 -0
package/docs/interfaces/BaseFetchActionProperties.md +12 -12
package/docs/interfaces/BaseFetchCollectorActionProperties.md +16 -16
package/docs/interfaces/BaseFetcherProperties.md +28 -28
package/docs/interfaces/DispatchedEngineAction.md +4 -4
package/docs/interfaces/EvaluateActionOptions.md +81 -0
package/docs/interfaces/ExtractActionProperties.md +12 -12
package/docs/interfaces/FetchActionInContext.md +15 -15
package/docs/interfaces/FetchActionProperties.md +13 -13
package/docs/interfaces/FetchActionResult.md +6 -6
package/docs/interfaces/FetchContext.md +38 -38
package/docs/interfaces/FetchEngineContext.md +33 -33
package/docs/interfaces/FetchMetadata.md +5 -5
package/docs/interfaces/FetchResponse.md +14 -14
package/docs/interfaces/FetchReturnTypeRegistry.md +8 -8
package/docs/interfaces/FetchSite.md +31 -31
package/docs/interfaces/FetcherOptions.md +30 -30
package/docs/interfaces/GotoActionOptions.md +6 -6
package/docs/interfaces/PendingEngineRequest.md +3 -3
package/docs/interfaces/StorageOptions.md +5 -5
package/docs/interfaces/SubmitActionOptions.md +2 -2
package/docs/interfaces/TrimActionOptions.md +27 -0
package/docs/interfaces/WaitForActionOptions.md +5 -5
package/docs/type-aliases/BaseFetchActionOptions.md +1 -1
package/docs/type-aliases/BaseFetchCollectorOptions.md +1 -1
package/docs/type-aliases/BrowserEngine.md +1 -1
package/docs/type-aliases/FetchActionCapabilities.md +1 -1
package/docs/type-aliases/FetchActionCapabilityMode.md +1 -1
package/docs/type-aliases/FetchActionOptions.md +1 -1
package/docs/type-aliases/FetchEngineAction.md +2 -2
package/docs/type-aliases/FetchEngineType.md +1 -1
package/docs/type-aliases/FetchReturnType.md +1 -1
package/docs/type-aliases/FetchReturnTypeFor.md +1 -1
package/docs/type-aliases/OnFetchPauseCallback.md +1 -1
package/docs/type-aliases/ResourceType.md +1 -1
package/docs/type-aliases/TrimPreset.md +13 -0
package/docs/variables/DefaultFetcherProperties.md +1 -1
package/docs/variables/FetcherOptionKeys.md +1 -1
package/docs/variables/TRIM_PRESETS.md +11 -0
package/package.json +4 -4

package/docs/_media/README.action.md CHANGED Viewed

@@ -90,6 +90,8 @@ Navigates the browser to a new URL.
   * ...other navigation options like `waitUntil`, `timeout` which are passed to the engine.
 * **`returns`**: `response`
+> **Note**: This action can be called multiple times within a single session script. The engine ensures that each navigation is completed and its corresponding Action Loop is settled before processing the next action in the sequence.
 #### `click`
 Clicks on an element specified by a selector.
@@ -123,6 +125,41 @@ Submits a form.
   * `selector` (string, optional): A selector for the form element.
 * **`returns`**: `none`
+#### `trim`
+Removes specific elements from the DOM to clean up the page before extraction. This is a persistent modification to the current session's page state.
+* **`id`**: `trim`
+* **`params`**:
+  * **`selectors`** (string | string[], optional): One or more CSS selectors of elements to remove.
+  * **`presets`** (string | string[], optional): Predefined groups of elements to remove. Supported presets:
+    * `scripts`: Removes all `<script>` tags.
+    * `styles`: Removes all `<style>` and `<link rel="stylesheet">` tags.
+    * `svgs`: Removes all `<svg>` elements.
+    * `images`: Removes `<img>`, `<picture>`, and `<canvas>` elements.
+    * `comments`: Removes HTML comments.
+    * `hidden`: Removes elements with `hidden` attribute or inline `display:none`. In **browser** mode, it also detects and removes elements hidden via external CSS (e.g., `display: none` or `visibility: hidden` in stylesheets).
+    * `all`: Includes all of the above.
+* **`returns`**: `none`
+**Example: Cleaning up a page before extraction**
+```json
+{
+  "actions": [
+    { "action": "goto", "params": { "url": "https://example.com" } },
+    {
+      "action": "trim",
+      "params": {
+        "selectors": ["#ad-banner", ".popup"],
+        "presets": ["scripts", "styles", "comments"]
+      }
+    },
+    { "action": "extract", "params": { "schema": { "content": "#main-content" } } }
+  ]
+}
+```
 #### `waitFor`
 Pauses execution to wait for one or more conditions to be met.
@@ -210,6 +247,63 @@ Retrieves the full content of the current page state.
 * **`params`**: (none)
 * **`returns`**: `response`
+#### `evaluate`
+Executes custom JavaScript code or an expression within the page context.
+* **`id`**: `evaluate`
+* **`params`**:
+  * **`fn`**: `string | Function` - The JavaScript function or expression to execute.
+    * In `browser` mode, this runs in the real browser.
+    * In `http` mode, this runs in a safe Node.js sandbox with a mocked browser environment (`window`, `document`, `console`, `).
+  * **`args`**: `any` - A single argument to pass to the function. Use an array or object to pass multiple values.
+**Key Features:**
+- **Automatic Navigation**: If the code modifies `window.location.href` or calls `assign()`/`replace()`, the engine automatically triggers and waits for the navigation to complete.
+- **Enhanced Mock DOM (HTTP Mode)**: Supports common DOM methods like `querySelector`, `querySelectorAll`, `getElementById`, `getElementsByClassName`, and properties like `document.body` and `document.title`.
+- **Sandbox Security**: Uses `util-ex`'s `newFunction` in HTTP mode to prevent global state pollution.
+**Example (Array Args):**
+```json
+{
+  "action": "evaluate",
+  "params": {
+    "fn": "([a, b]) => a + b",
+    "args": [1, 2]
+  }
+}
+```
+**Example (Navigation):**
+```json
+{
+  "action": "evaluate",
+  "params": {
+    "fn": "() => { window.location.href = '/new-page'; }"
+  }
+}
+```
+$`).
+>
+> * **Navigation**: If the code modifies `window.location.href`, the engine will automatically trigger a `goto` to the new URL.
+**Example: Extracting data via custom script**
+```json
+{
+  "action": "evaluate",
+  "params": {
+    "fn": "([selector]) => document.querySelector(selector).innerText",
+    "args": [".main-title"]
+  },
+  "storeAs": "title"
+}
+```
 #### `extract`
 Extracts structured data from the page using a powerful and declarative Schema. This is the core Action for data collection.
@@ -226,6 +320,8 @@ The `params` object itself is a Schema that describes the data structure you wan
 The most basic extraction. You can specify a `selector` (CSS selector), an `attribute` (the name of the attribute to extract), a `type` (string, number, boolean, html), and a `mode` (text, innerText).
+* **`depth`** (number, optional): After matching the element with `selector`, it bubbles up the DOM tree by the specified number of levels. The resulting ancestor element becomes the actual target for value extraction (e.g., to extract an attribute from a parent wrapper).
 ```json
 {
   "id": "extract",
@@ -263,6 +359,20 @@ Define a structured object using `type: 'object'` and the `properties` field.
 }
 ```
+* **`depth`** (number, optional): Enables "Try-And-Bubble" strategy. If a `required` field is missing in the matched element, the engine attempts to bubble up the DOM tree (up to `depth` levels) to find an ancestor where the required field exists. This is useful when the selector matches a descendant (e.g., an inner `span`) but the data resides on a parent container.
+**Advanced Object Features:**
+* **Anchor Jumping (`anchor`)**: Specifies a starting reference point for a field.
+  * **Field Reference**: Use the DOM element of a previously extracted field.
+  * **CSS Selector**: Query an anchor element on the fly within the object's scope.
+  * **`depth`** (number, optional): When using an anchor, defines how many parent levels to traverse upwards to collect following siblings.
+    * **Note**: If omitted, the engine defaults to maximum depth (up to the object's root) for backward compatibility. To strictly limit the search to the anchor's own siblings, set `depth: 0`.
+  * **Effect**: Once an anchor is set, the search scope for that field becomes the siblings **following** the anchor (and its ancestors, depending on `depth`). This allows for non-linear "jumping" extraction in flat structures.
+* **Sequential Consumption (`relativeTo: "previous"`)**:
+  * Combined with the `order` property, this ensures each field's search scope starts *after* the previous field's match.
+  * Essential for extracting from lists composed of identical tags (e.g., consecutive `<p>` tags with different meanings).
 ###### 3. Extracting an Array (Convenient Usage)
 Extract a list using `type: 'array'`. To make the most common operations simpler, we provide some convenient usages.
@@ -304,7 +414,9 @@ Extract a list using `type: 'array'`. To make the most common operations simpler
 ###### 4. Columnar Mode (formerly Zip Strategy)
-This mode is used when the `selector` points to a **container** (like a results list) and item data is scattered as separate columns.
+This mode is used when the `selector` points to a **container** (like a results list) and item data is scattered as separate columns. It is highly optimized for performance, especially in browser mode, by minimizing the number of DOM queries and RPC calls.
+> **💡 Broadcasting & Performance**: If a property matches the container element itself (e.g. by omitting a selector or matching the container's own attributes), its value is **broadcasted** to every row. This is not only a feature but also a major performance optimization: the value is extracted **once** and reused across all rows, avoiding thousands of redundant engine calls.
 ```json
 {
@@ -323,14 +435,44 @@ This mode is used when the `selector` points to a **container** (like a results
 > **Heuristic Detection:** If `mode` is omitted and the `selector` matches exactly one element while `items` contains nested selectors, the engine automatically uses **columnar** mode.
+**Example: Columnar Broadcasting**
+When you have a list where the category is on the container, but items are inside.
+```json
+{
+  "id": "extract",
+  "params": {
+    "type": "array",
+    "selector": "#book-category",
+    "mode": "columnar",
+    "items": {
+      "category": { "attribute": "data-category" },
+      "title": { "selector": ".book-title" }
+    }
+  }
+}
+```
+> If `#book-category` has `data-category="Sci-Fi"` and contains 3 books, the result will be 3 items, each having `"category": "Sci-Fi"`.
 **Columnar Configuration:**
 * **`strict`** (boolean, default: `true`): If `true`, throws an error if fields have different match counts.
-* **`inference`** (boolean, default: `false`): If `true`, tries to automatically find the "item wrapper" elements to fix misaligned lists.
+* **`inference`** (boolean, default: `false`): If `true`, tries to automatically find the "item wrapper" elements to fix misaligned lists. It uses an **optimized ancestor search** that is significantly faster in browser mode than manual traversal.
+* **Performance Note**: The engine automatically detects shared structures and pre-calculates alignment to ensure O(N) performance even in complex DOM trees. In browser mode, it minimizes IPC round-trips by pre-calculating "broadcast" flags.
 ###### 5. Segmented Mode (Anchor-based Scanning)
-This mode is ideal for "flat" structures where there are no item wrappers. It uses the first field (or a specified `anchor`) to segment the container's content.
+Ideal for "flat" structures where there are no item wrappers. It uses a specified `anchor` to segment the container's content.
+**Core Feature: Automatic Container Detection (Bubble Up)**
+To handle structures that appear flat but have subtle containers, the engine uses a bubble-up strategy:
+- **Smart Bubble Up**: When an anchor is nested deep (e.g., `div.card > h3.title`), the engine automatically crawls up the DOM to find the largest "safe container" (e.g., `div.card`) that doesn't overlap with neighbors.
+- **Logical Isolation**: If a container is found, it becomes the scope for that segment. This allows you to extract any content within that container using simple relative selectors, even if it's "above" or deep alongside the anchor.
+- **Flat Fallback**: If no container isolation is possible, it automatically falls back to classic sibling scanning.
 ```json
 {
@@ -338,7 +480,7 @@ This mode is ideal for "flat" structures where there are no item wrappers. It us
   "params": {
     "type": "array",
     "selector": "#flat-container",
-    "mode": { "type": "segmented", "anchor": "title" },
+    "mode": { "type": "segmented", "anchor": "h3.item-title" },
     "items": {
       "title": { "selector": "h3" },
       "desc": { "selector": "p" }
@@ -347,26 +489,103 @@ This mode is ideal for "flat" structures where there are no item wrappers. It us
 }
 ```
-> In this mode, every time the engine finds an `h3`, it starts a new item. Any `<p>` found after that `h3` (and before the next one) is assigned to that item.
+**Segmented Configuration:**
-###### 6. Implicit Object Extraction (Simplest Syntax)
+* **`anchor`** (string):
+  * Can be a **field name** defined in `items` (e.g., `"title"`).
+  * Can be a **direct CSS selector** (e.g., `"h3.item-title"`).
+  * Defaults to the selector of the first field in `items`.
+* **`depth`** (number, optional): The maximum number of levels to bubble up from the anchor to find a segment container. If omitted, it bubbles up as high as possible without conflicting with neighboring segments.
+* **`strict`** (boolean, default: `false`): If `true`, throws an error if no anchor elements are found or if any item violates its own `required` constraints.
-For simpler object extraction, you can omit `type: 'object'` and `properties`. If the schema object contains keys that are not reserved keywords (like `selector`, `attribute`, `type`, etc.), it is treated as an object schema where keys are property names.
+###### 5.1 Advanced: Handling Repeating Tags (`relativeTo`)
+When a segment contains multiple identical tags (e.g., several `<p>` tags in a row) representing different fields, use `relativeTo: "previous"` to "consume" them one by one.
+```json
+{
+  "id": "extract",
+  "params": {
+    "type": "array",
+    "selector": "#container",
+    "mode": {
+      "type": "segmented",
+      "anchor": ".item-start",
+      "relativeTo": "previous"
+    },
+    "items": {
+      "type": "object",
+      "order": ["id", "desc", "extra"],
+      "properties": {
+        "id": "h1",
+        "desc": "p",
+        "extra": "p"
+      }
+    }
+  }
+}
+```
+* **`relativeTo: "previous"`**: After finding `id` (h1), the search for `desc` starts *after* that h1. After finding `desc` (the first p), the search for `extra` starts *after* that p, successfully picking up the second `<p>`.
+* **`order`**: Defines the sequence of consumption. Highly recommended with `relativeTo: "previous"`.
+###### 6. Quality Control: `required` and `strict`
+- **`required`**: Marks a field as mandatory.
+  - **In Objects**: If any required field is `null`, the entire object returns `null`.
+  - **In Arrays**: Items missing a required field are automatically skipped.
+  - **Null Propagation**: For implicit objects without a `selector`, if ALL sub-properties are `null`, the object itself becomes `null`, triggering parent-level required or skip logic.
+- **`strict`**:
+  - `false` (Default): Silently skip or ignore incomplete data.
+  - `true`: Throw an error on any missing required field or alignment mismatch.
+  - **Inheritance**: Setting `strict: true` at the array level automatically propagates to all nested children.
+**Example: Ignoring items with missing mandatory fields**
+```json
+{
+  "id": "extract",
+  "params": {
+    "type": "array",
+    "selector": ".product-list",
+    "mode": "columnar",
+    "items": {
+      "name": { "selector": ".title", "required": true },
+      "price": { "selector": ".price", "required": true },
+      "discount": ".promo"
+    }
+  }
+}
+```
+> In this example, if a product lacks either a `name` or a `price`, it will be completely omitted from the result array. The optional `discount` field doesn't affect the item's inclusion.
+###### 7. Implicit Object Extraction (Simplest Syntax)
+For simpler object extraction, you can omit `type: 'object'` and `properties`. If the schema object contains keys that are not context-defining keywords (like `selector`, `has`, `exclude`, `required`, `strict`, `depth`), it is treated as an object schema where keys are property names.
+> **Keyword Collision Handling:** You can safely extract a data field named `type` as long as its value is not a reserved schema type (like `"string"`, `"object"`, `"array"`, etc.).
 ```json
 {
   "id": "extract",
   "params": {
     "selector": ".author-bio",
-    "name": { "selector": ".author-name" },
-    "email": { "selector": "a.email", "attribute": "href" }
+    "name": ".author-name",
+    "type": ".author-rank",
+    "items": { "type": "array", "selector": "li" }
   }
 }
 ```
-> This is equivalent to Example 2 but more concise.
+> **Key features of implicit objects:**
+>
+> 1. **Keyword Handling**: Common configuration keywords like `items`, `attribute`, or `mode` **can be used as property names** within an implicit object. They are only treated as configuration when a `type` (like `array`) is explicitly present. Configuration keywords like `required`, `strict`, and `depth` are also handled as context defining keys.
+> 2. **String Shorthand**: You can use a simple string as a property value (e.g., `"email": "a.email"`), which is automatically expanded to `{ "selector": "a.email" }`.
+> 3. **Context Separation**: Only `selector`, `has`, `exclude`, `required`, `strict`, and `depth` are used to define the context and validation for the implicit object; all other keys are treated as data to be extracted.
+> 4. **Null Propagation**: If an implicit object has no `selector` and ALL of its sub-properties extract to `null`, the object itself returns `null`. This is crucial for `required` validation on the parent object or for skipping items in an array.
-###### 6. Precise Filtering: `has` and `exclude`
+###### 8. Advanced Filtering: `has` and `exclude`
 You can use the `has` and `exclude` fields in any schema that includes a `selector` to precisely control element selection.
@@ -546,3 +765,97 @@ The `FetchAction` base class provides lifecycle hooks that allow injecting custo
 * `protected onAfterExec?()`: Called after `onExecute`.
 For Actions that need to manage complex state or resources, you can implement these hooks. Generally, for composite actions, writing the logic directly in `onExecute` is sufficient.
+---
+## 💎 7. Action Return Types & State Management (Advanced)
+In `@isdk/web-fetcher`, an Action's `static returnType` is more than a type hint. It defines how the framework manages the **session state** and automates data synchronization after execution.
+### 7.1 Detailed Type Breakdown
+#### 🟢 `response` (Page Response)
+* **Definition**: A `FetchResponse` object containing HTTP status, headers, body, and cookies.
+* **Purpose**: To synchronize the latest page content and state with the session.
+* **Usage**: Used by actions that navigate, refresh, or capture the current page state.
+* **System Behavior**: The framework automatically updates `context.lastResponse` with this result in the `afterExec` phase. Subsequent actions can access this via the context.
+* **Typical Actions**: `goto`, `getContent`, `fill` (in some engines).
+* **Example**:
+  ```typescript
+  export class MyNavigateAction extends FetchAction {
+    static override id = 'myGoto';
+    static override returnType = 'response' as const;
+    async onExecute(context, options) {
+      // Logic that returns a FetchResponse
+      return await this.delegateToEngine(context, 'goto', options.params.url);
+    }
+  }
+  ```
+#### 🟡 `any` (Generic Data - Default)
+* **Definition**: Any serializable data structure (Object, Array, string, etc.).
+* **Purpose**: Primary mechanism for business data extraction.
+* **Usage**: Use this when your action produces processed data that doesn't represent the whole page or system state.
+* **System Behavior**: If the action configuration includes `storeAs: "key"`, the framework automatically saves the `result` into `context.outputs["key"]`.
+* **Typical Actions**: `extract`.
+* **Example**:
+  ```typescript
+  static override returnType = 'any' as const;
+  async onExecute(context, options) {
+    return { title: 'Hello', price: 99 }; // Saved to outputs if storeAs is set
+  }
+  ```
+#### ⚪ `none` (No Return)
+* **Definition**: `void`.
+* **Purpose**: Pure interaction/side-effects without data output.
+* **Usage**: Actions that perform UI interactions or timing controls.
+* **Typical Actions**: `click`, `submit`, `pause`, `trim`, `waitFor`.
+* **Example**:
+  ```typescript
+  static override returnType = 'none' as const;
+  async onExecute(context, options) {
+    await this.delegateToEngine(context, 'click', options.params.selector);
+    // No return value needed
+  }
+  ```
+#### 🔵 `outputs` (Accumulated Results)
+* **Definition**: The entire `context.outputs` record (`Record<string, any>`).
+* **Purpose**: To retrieve all data extracted and stored during the current session.
+* **Usage**: Typically used as a "summary" action at the end of a chain or for debugging.
+* **Typical Actions**: Custom data summary actions.
+#### 🟣 `context` (Session Snapshot)
+* **Definition**: The full `FetchContext` object.
+* **Purpose**: Meta-programming and deep debugging.
+* **Usage**: Allows the caller to inspect current session configurations (timeouts, proxies, headers) and internal engine metadata.
+---
+### 7.2 Result Wrapping (`FetchActionResult`)
+Every value returned by `onExecute` is automatically wrapped by the `FetchAction.execute` method into a `FetchActionResult` object. This ensures consistent error handling and metadata tracking across all actions.
+**Structure of `FetchActionResult`:**
+* `status`: `Success`, `Failed`, or `Skipped`.
+* `returnType`: Matches the action's `static returnType`.
+* `result`: The raw data returned by `onExecute`.
+* `error`: Captured error if the action failed.
+* `meta`: Diagnostic information including execution time, engine type, and retry counts.
+### 7.3 Developer Best Practices
+1. **Choose `response` for Navigation**: Always use `response` for actions that land on a new URL to ensure the session's "current page" is kept in sync.
+2. **Leverage `any` + `storeAs`**: For data extraction, return the data as `any` and let the user decide the storage key via `storeAs` in the JSON script.
+3. **Be Explicit with `none`**: Using `none` clearly signals that the action is used for its side effects (like clicking or waiting), making the workflow easier to understand.

package/docs/_media/README.cn.md CHANGED Viewed

@@ -132,7 +132,7 @@ searchGoogle('gemini');
 * `url` (string): 要导航的初始 URL。
 * `engine` ('http' | 'browser' | 'auto'): 要使用的引擎。默认为 `auto`。
 * `proxy` (string | string[]): 用于请求的代理 URL。
-* `debug` (boolean): 在响应中启用详细的执行元数据（耗时、使用的引擎等）。
+* `debug` (boolean | string | string[]): 在响应中启用详细的执行元数据（耗时、使用的引擎等），或启用特定类别（如 'extract', 'submit', 'request'）的调试日志。
 * `actions` (FetchActionOptions[]): 要执行的动作对象数组。（支持 `action`/`name` 作为 `id` 的别名，`args` 作为 `params` 的别名）
 * `headers` (Record<string, string>): 用于所有请求的头信息。
 * `cookies` (Cookie[]): 要使用的 Cookie 数组。
@@ -148,18 +148,20 @@ searchGoogle('gemini');
 * `sessionPoolOptions` (SessionPoolOptions): 底层 Crawlee SessionPool 的高级配置。
 * ...以及许多其他用于代理、重试等的选项。
-### 内置动作
+### 内置动作 (Built-in Actions)
-以下是核心的内置动作：
+该库提供了一系列强大的内置动作，其中许多动作是跨引擎通用的，并由核心层统一处理以确保一致性：
-* `goto`: 导航到一个新的 URL。
-* `click`: 点击一个由选择器指定的元素。
-* `fill`: 用指定的值填充一个输入字段。
-* `submit`: 提交一个表单。
-* `waitFor`: 暂停执行以等待特定条件（例如，超时、选择器出现或网络空闲）。
-* `pause`: 暂停执行以进行手动干预（例如，解决验证码）。
-* `getContent`: 获取当前页面状态的完整内容（HTML、文本等）。
-* `extract`: 使用富有表现力的声明式 Schema,可轻松提取页面中的任意结构化数据。
+* `goto`: 导航到新 URL。
+* `click`: 点击选择器指定的元素（引擎相关）。
+* `fill`: 用指定值填充输入框（引擎相关）。
+* `submit`: 提交表单（引擎相关）。
+* `trim`: 从 DOM 中移除元素以清理页面（如脚本、广告、隐藏内容）。
+* `waitFor`: 暂停执行以等待特定条件（支持统一处理的固定超时）。
+* `pause`: 暂停执行以进行人工干预（如处理验证码，由核心层统一处理）。
+* `getContent`: 获取当前页面状态的完整内容（由核心层统一处理）。
+* `evaluate`: 在页面上下文中执行自定义 JavaScript。
+* `extract`: 使用引擎无关的核心逻辑和引擎相关的 DOM 原语提取结构化数据。支持 `required` 字段和 `strict` 验证。
 ### 响应结构

package/docs/_media/README.engine.md CHANGED Viewed

@@ -12,6 +12,14 @@ The `engine` directory contains the core logic for the web fetcher. Its primary
 > ℹ️ The system is built on top of the [Crawlee](https://crawlee.dev/) library, using its powerful crawler abstractions.
+### Debug Mode & Tracing
+When the `debug: true` option is enabled, the engine provides detailed tracing of its internal operations:
+1. **Detailed Tracing**: Every major step (request processing, element selection, data extraction) is logged to the console with a `[FetchEngine:id]` prefix.
+2. **Extraction Insights**: During `extract()`, the engine logs how many elements were matched for each selector and the specific values being extracted, making it easier to debug complex or misaligned schemas.
+3. **Metadata**: The `FetchResponse` will include an enriched `metadata` object containing engine details, timing metrics (where available), and proxy information.
 ---
 ## 🧩 2. Core Concepts
@@ -20,13 +28,23 @@ The `engine` directory contains the core logic for the web fetcher. Its primary
 This is the abstract base class that defines the contract for all fetch engines.
-* **Role**: To provide a consistent, high-level API for actions like navigation, content retrieval, and user interaction.
+* **Role**: To provide a consistent, high-level API for actions like navigation, content retrieval, and user interaction. It acts as the central **Action Dispatcher**, handling engine-agnostic logic (like `extract`, `pause`, `getContent`) and delegating others.
 * **Key Abstractions**:
   * **Lifecycle**: `initialize()` and `cleanup()` methods.
   * **Core Actions**: `goto()`, `getContent()`, `click()`, `fill()`, `submit()`, `waitFor()`, `extract()`.
+  * **DOM Primitives**: `_querySelectorAll()`, `_extractValue()`, `_parentElement()`, `_nextSiblingsUntil()`.
   * **Configuration & State**: `headers()`, `cookies()`, `blockResources()`, `getState()`, `sessionPoolOptions`.
 * **Static Registry**: It maintains a static registry of all available engine implementations (`FetchEngine.register`), allowing for dynamic selection by `id` or `mode`.
+### `FetchElementScope`
+The **`FetchElementScope`** is the engine-specific "handle" or "context" for a DOM element.
+- In **`CheerioFetchEngine`**, it is a combination of the Cheerio API (`$`) and the current selection (`el`).
+- In **`PlaywrightFetchEngine`**, it is a `Locator`.
+All extraction and interaction primitives operate on this scope, ensuring a unified way to reference elements across different underlying technologies.
 ### `FetchEngine.create(context, options)`
 This static factory method is the designated entry point for creating an engine instance. It automatically selects and initializes the appropriate engine.
@@ -69,14 +87,14 @@ await session.executeAll([
 The engine supports persisting and restoring session state (primarily cookies) between executions.
 * **Flexible Session Isolation & Storage**: The library provides fine-grained control over how session data is stored and isolated via the `storage` configuration:
-    * **`id`**: A custom string to identify the storage.
-        * **Isolation (Default)**: If omitted, each session gets a unique ID, ensuring complete isolation of `RequestQueue`, `KeyValueStore`, and `SessionPool`.
-        * **Sharing**: Providing the same `id` across sessions allows them to share the same underlying storage, useful for persistent login sessions.
-    * **`persist`**: (boolean) Whether to enable disk persistence (Crawlee's `persistStorage`). Defaults to `false` (in-memory).
-    * **`purge`**: (boolean) Whether to delete the storage (drop `RequestQueue` and `KeyValueStore`) when the session is closed. Defaults to `true`.
-        * Set `purge: false` and provide a fixed `id` to create a truly persistent session that survives across application restarts.
-    * **`config`**: Allows passing raw configuration to the underlying Crawlee instance.
-        * **Note**: When `persist` is true, use `localDataDirectory` in the config to specify the storage path (e.g., `storage: { persist: true, config: { localDataDirectory: './my-data' } }`).
+  * **`id`**: A custom string to identify the storage.
+    * **Isolation (Default)**: If omitted, each session gets a unique ID, ensuring complete isolation of `RequestQueue`, `KeyValueStore`, and `SessionPool`.
+    * **Sharing**: Providing the same `id` across sessions allows them to share the same underlying storage, useful for persistent login sessions.
+  * **`persist`**: (boolean) Whether to enable disk persistence (Crawlee's `persistStorage`). Defaults to `false` (in-memory).
+  * **`purge`**: (boolean) Whether to delete the storage (drop `RequestQueue` and `KeyValueStore`) when the session is closed. Defaults to `true`.
+    * Set `purge: false` and provide a fixed `id` to create a truly persistent session that survives across application restarts.
+  * **`config`**: Allows passing raw configuration to the underlying Crawlee instance.
+    * **Note**: When `persist` is true, use `localDataDirectory` in the config to specify the storage path (e.g., `storage: { persist: true, config: { localDataDirectory: './my-data' } }`).
 * **`sessionState`**: A comprehensive state object (derived from Crawlee's SessionPool) that can be used to fully restore a previous session. This state is **automatically included in every `FetchResponse`**, making it easy to persist and later provide back to the engine during initialization.
 * **`sessionPoolOptions`**: Allows advanced configuration of the underlying Crawlee `SessionPool` (e.g., `maxUsageCount`, `maxPoolSize`).
 * **`overrideSessionState`**: If set to `true`, it forces the engine to overwrite any existing persistent state in the storage with the provided `sessionState`. This is useful when you want to ensure the session starts with the exact state provided, ignoring any stale data in the persistence layer.
@@ -86,6 +104,7 @@ The engine supports persisting and restoring session state (primarily cookies) b
 **Precedence Rule:**
 If both `sessionState` and `cookies` are provided, the engine adopts a **"Merge and Override"** strategy:
 1. The session is first restored from the `sessionState`.
 2. The explicit `cookies` are then applied on top.
    * **Result:** Any conflicting cookies in `sessionState` will be **overwritten** by the explicit `cookies`.
@@ -114,8 +133,11 @@ Our engine solves this by creating a bridge between the external API calls and t
     * The page context is used to resolve the `Promise` from the `goto()` call.
     * The page is marked as "active" (`isPageActive = true`).
     * Crucially, before the `requestHandler` returns, it starts an **action loop** (`_executePendingActions`). This loop effectively **pauses the `requestHandler`** by listening for events on an `EventEmitter`, keeping the page context alive.
+    * **Strict Sequential Execution & Re-entrancy**: The loop uses an internal queue to ensure all actions are executed in the exact order they were dispatched. It also includes re-entrancy protection to allow composite actions to call atomic actions without deadlocking.
 5. **Interactive Actions (`click`, `fill`, etc.)**: The consumer can now call `await engine.click(...)`. This dispatches an action to the `EventEmitter` and returns a new `Promise`.
-6. **Action Execution**: The action loop, still running within the original `requestHandler`'s scope, hears the event. Because it has access to the page context, it can perform the *actual* interaction (e.g., `page.click(...)`).
+6. **Action Execution**: The action loop, still running within the original `requestHandler`'s scope, hears the event.
+    * **Centralized Actions**: Actions like `extract`, `pause`, and `getContent` are processed immediately by the `FetchEngine` base class using the unified logic.
+    * **Delegated Actions**: Engine-specific interactions (e.g., `click`, `fill`) are delegated to the subclass's `executeAction` implementation.
 7. **Robust Cleanup**: When `dispose()` or `cleanup()` is called:
     * An `isEngineDisposed` flag is set to prevent new actions.
     * A `dispose` signal is emitted to wake up and terminate the action loop.
@@ -134,6 +156,7 @@ There are two primary engine implementations:
 * **Mechanism**: Uses `CheerioCrawler` to fetch pages via raw HTTP and parse static HTML.
 * **Behavior**:
   * ✅ **Fast and Lightweight**: Ideal for speed and low resource consumption.
+  * ✅ **HTTP-Compliant Redirects**: Correctly handles 301-303 and 307/308 redirects, preserving methods/bodies or converting to GET as per HTTP specifications.
   * ❌ **No JavaScript Execution**: Cannot interact with client-side rendered content.
   * ⚙️ **Simulated Interaction**: Actions like `click` and `submit` are simulated by making new HTTP requests.
 * **Use Case**: Scraping static websites, server-rendered pages, or APIs.
@@ -166,11 +189,68 @@ To combat sophisticated anti-bot measures, the `PlaywrightFetchEngine` offers an
 The `extract()` method provides a powerful, declarative way to pull structured data from a web page. It uses a **Schema** to define the structure of your desired JSON output, and the engine automatically handles DOM traversal and data extraction.
-### Core Design: Schema Normalization
+### Core Design: The Three-Layer Architecture
+To ensure consistency across engines and maintain high quality, the extraction system is divided into three layers:
+1. **Normalization Layer (`src/core/normalize-extract-schema.ts`)**: Pre-processes user-provided schemas into a canonical internal format, handling CSS filter merging and implicit object detection.
+2. **Core Extraction Logic (`src/core/extract.ts`)**: An engine-agnostic layer responsible for the extraction workflow. It dispatches tasks to `_extractObject`, `_extractArray`, or `_extractValue` via a **Dispatcher (`_extract`)**. It manages recursion, strict mode, required field validation, anchor resolution, performance-optimized tree operations (LCA, bubbling), and sequential consumption cursors.
+3. **Engine Interface (`IExtractEngine`)**: Defined at the core layer and implemented by engines to provide low-level DOM primitives.
+#### Implementation Rules for `IExtractEngine`
+To maintain cross-engine consistency, all implementations MUST follow these behavior contracts:
+- **`_querySelectorAll`**:
+  - MUST return matching elements in **document order**.
+  - MUST check if the scope element(s) **themselves** match the selector and search their **descendants**.
+- **`_nextSiblingsUntil`**:
+  - MUST return a flat list of siblings starting *after* the anchor and stopping *before* the first element matching the `untilSelector`.
+- **`_isSameElement`**:
+  - MUST compare elements based on **identity**, not content.
+- **`_findClosestAncestor`**:
+  - MUST efficiently find the closest ancestor of an element that exists in a given set of candidates.
+  - MUST be optimized to avoid multiple IPC calls in browser-based engines.
+- **`_contains`**:
+  - MUST implement standard DOM `Node.contains()` behavior.
+  - MUST be optimized for high-frequency boundary checks.
+- **`_findCommonAncestor`**:
+  - MUST find the Lowest Common Ancestor (LCA) of two elements.
+  - **Performance Critical**: In browser engines, this MUST be executed within a single `evaluate` call to minimize IPC (Inter-Process Communication) overhead.
+- **`_findContainerChild`**:
+  - MUST find the direct child of a container that contains a specific descendant.
+  - **Performance Critical**: This replaces manual "bubble-up" loops in the Node.js context, significantly reducing overhead for deep DOM trees.
+- **`_bubbleUpToScope` (Internal Helper)**:
+  - Implements the logic to bubble up from a deep element to its direct ancestor in the current scope.
+  - Supports an optional `depth` parameter to limit how many parent levels to traverse.
+  - MUST include a maximum depth limit (default 1000) to prevent infinite loops.
+This architecture ensures that complex features like **Columnar Alignment**, **Segmented Scanning**, and **Anchor Jumping** behave identically across the fast Cheerio engine and the full Playwright browser.
+### Schema Normalization
+To enhance usability and flexibility, the `extract` method internally implements a **"Normalization"** layer. This allows you to provide semantically clear shorthands, which are automatically converted into a standardized internal format.
+#### 1. Shorthand Rules
+- **String Shorthand**: A simple string like `'h1'` is automatically expanded to `{ selector: 'h1', type: 'string', mode: 'text' }`.
+- **Implicit Object Shorthand**: If you provide an object without an explicit `type: 'object'`, it is automatically treated as an `object` schema where the keys are the property names.
+  - *Example*: `{ "title": "h1" }` becomes `{ "type": "object", "properties": { "title": { "selector": "h1" } } }`.
+- **Filter Shorthand**: If you provide `has` or `exclude` alongside a `selector`, they are automatically merged into the CSS selector using `:has()` and `:not()` pseudo-classes.
+- **Array Shorthand**: Providing `attribute` directly on an `array` schema acts as a shorthand for its `items`.
+  - *Example*: `{ "type": "array", "attribute": "href" }` becomes `{ "type": "array", "items": { "type": "string", "attribute": "href" } }`.
+#### 2. Context vs. Data (Keyword Separation)
+In **Implicit Objects**, the engine must distinguish between *configuration* (where to look) and *data* (what to extract).
+- **Context Keys**: The keys `selector`, `has`, `exclude`, `required`, `strict`, and `depth` are reserved for defining the extraction context and validation. They stay at the root of the schema.
+- **Data Keys**: All other keys (including `items`, `attribute`, `mode`, or even a field named `type`) are moved into the `properties` object as data fields to be extracted.
+- **Collision Handling**: You can safely extract a field named `type` as long as its value is not one of the reserved schema type keywords (`string`, `number`, `boolean`, `html`, `object`, `array`).
-To enhance usability and flexibility, the `extract` method internally implements a **"Normalization"** layer. This means you can provide semantically clear shorthands, and the engine will automatically convert them into a standard, more verbose internal format before execution. This makes writing complex extraction rules simple and intuitive.
+#### 3. Cross-Engine Consistency
-### Schema Structure
+This normalization layer ensures that regardless of whether you are using the `cheerio` (http) or `playwright` (browser) engine, the complex shorthand logic behaves identically, providing a consistent "AI-friendly" interface.
 A schema can be one of three types: