npm - @isdk/web-fetcher - Versions diffs - 0.2.0 - Mend

@isdk/web-fetcher 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

package/README.action.cn.md +469 -0
package/README.action.md +452 -0
package/README.cn.md +147 -0
package/README.engine.cn.md +262 -0
package/README.engine.md +262 -0
package/README.md +147 -0
package/dist/index.d.mts +1603 -0
package/dist/index.d.ts +1603 -0
package/dist/index.js +1 -0
package/dist/index.mjs +1 -0
package/docs/README.md +151 -0
package/docs/_media/LICENSE-MIT +22 -0
package/docs/_media/README.action.md +452 -0
package/docs/_media/README.cn.md +147 -0
package/docs/_media/README.engine.md +262 -0
package/docs/classes/CheerioFetchEngine.md +1447 -0
package/docs/classes/ClickAction.md +533 -0
package/docs/classes/ExtractAction.md +533 -0
package/docs/classes/FetchAction.md +444 -0
package/docs/classes/FetchEngine.md +1230 -0
package/docs/classes/FetchSession.md +111 -0
package/docs/classes/FillAction.md +533 -0
package/docs/classes/GetContentAction.md +533 -0
package/docs/classes/GotoAction.md +537 -0
package/docs/classes/PauseAction.md +533 -0
package/docs/classes/PlaywrightFetchEngine.md +1437 -0
package/docs/classes/SubmitAction.md +533 -0
package/docs/classes/WaitForAction.md +533 -0
package/docs/classes/WebFetcher.md +85 -0
package/docs/enumerations/FetchActionResultStatus.md +40 -0
package/docs/functions/fetchWeb.md +43 -0
package/docs/globals.md +72 -0
package/docs/interfaces/BaseFetchActionProperties.md +83 -0
package/docs/interfaces/BaseFetchCollectorActionProperties.md +145 -0
package/docs/interfaces/BaseFetcherProperties.md +206 -0
package/docs/interfaces/Cookie.md +142 -0
package/docs/interfaces/DispatchedEngineAction.md +60 -0
package/docs/interfaces/ExtractActionProperties.md +113 -0
package/docs/interfaces/FetchActionInContext.md +149 -0
package/docs/interfaces/FetchActionProperties.md +125 -0
package/docs/interfaces/FetchActionResult.md +55 -0
package/docs/interfaces/FetchContext.md +424 -0
package/docs/interfaces/FetchEngineContext.md +328 -0
package/docs/interfaces/FetchMetadata.md +73 -0
package/docs/interfaces/FetchResponse.md +105 -0
package/docs/interfaces/FetchReturnTypeRegistry.md +57 -0
package/docs/interfaces/FetchSite.md +320 -0
package/docs/interfaces/FetcherOptions.md +300 -0
package/docs/interfaces/GotoActionOptions.md +66 -0
package/docs/interfaces/PendingEngineRequest.md +51 -0
package/docs/interfaces/SubmitActionOptions.md +23 -0
package/docs/interfaces/WaitForActionOptions.md +39 -0
package/docs/type-aliases/BaseFetchActionOptions.md +11 -0
package/docs/type-aliases/BaseFetchCollectorOptions.md +11 -0
package/docs/type-aliases/BrowserEngine.md +11 -0
package/docs/type-aliases/FetchActionCapabilities.md +11 -0
package/docs/type-aliases/FetchActionCapabilityMode.md +11 -0
package/docs/type-aliases/FetchActionOptions.md +11 -0
package/docs/type-aliases/FetchEngineAction.md +18 -0
package/docs/type-aliases/FetchEngineType.md +11 -0
package/docs/type-aliases/FetchReturnType.md +11 -0
package/docs/type-aliases/FetchReturnTypeFor.md +17 -0
package/docs/type-aliases/OnFetchPauseCallback.md +23 -0
package/docs/type-aliases/ResourceType.md +11 -0
package/docs/variables/DefaultFetcherProperties.md +11 -0
package/package.json +90 -0

package/docs/_media/README.action.md ADDED Viewed

@@ -0,0 +1,452 @@
+# 📜 Action Script Architecture
+English | [简体中文](./README.action.cn.md)
+> This document details the architecture, design philosophy, and usage of the Action Script system within `@isdk/web-fetcher`. It is intended to help developers maintain and extend the system, and to help users efficiently build automation tasks.
+## 🎯 1. Overview
+The core goal of the Action Script system is to provide a **declarative, engine-agnostic** way to define and execute a series of web interactions.
+The system is built on two fundamental concepts:
+* **⚛️ Atomic Actions:** Built into the library, these represent a single, indivisible operation and are the basic "atoms" that make up all complex processes. Examples: `goto`, `click`, `fill`.
+* **🧩 Composite Actions:** Created by the library user, these represent a complex operation with business semantics, composed of multiple atomic actions. This is the essence of the architecture, encouraging users to encapsulate low-level operations into higher-level "molecules" that are easier to understand and reuse. Examples: `login`, `search`, `addToCart`.
+This approach allows users to describe a complete business process with intuitive semantics, while hiding the specific, engine-related implementation details in the underlying layers.
+---
+## 🛠️ 2. Core Concepts
+### `FetchAction` (Base Class)
+`FetchAction` is the abstract base class for all Actions. It defines the core elements of an Action:
+* `static id`: The unique identifier for the Action, e.g., `'click'`.
+* `static returnType`: The type of the result returned after the Action executes, e.g., `'none'`, `'response'`.
+* `static capabilities`: Declares the capability level of this Action in different engines (`http`, `browser`), such as `native`, `simulate`, or `noop`.
+* `static register()`: A static method to register the Action class in a global registry, allowing it to be dynamically created by its `id`.
+### `onExecute` (Core Logic)
+Every `FetchAction` subclass must implement the `onExecute` method. This is where an Action defines its behavior.
+### `delegateToEngine` (Delegation Helper)
+To simplify the creation of **atomic actions**, the `FetchAction` base class provides a protected helper method, `delegateToEngine`. It forwards the call to the corresponding method on the active engine, passing along any arguments. This allows actions to be a thin wrapper around engine capabilities.
+**Example: The `fill` Action using `delegateToEngine`**
+```typescript
+// src/action/definitions/fill.ts
+export class FillAction extends FetchAction {
+  // ...
+  async onExecute(context: FetchContext, options?: BaseFetchActionProperties): Promise<void> {
+    const { selector, value, ...restOptions } = options?.params || {};
+    if (!selector) throw new Error('Selector is required for fill action');
+    if (value === undefined) throw new Error('Value is required for fill action');
+    // selector, value, and restOptions are passed as arguments to engine.fill()
+    await this.delegateToEngine(context, 'fill', selector, value, restOptions);
+  }
+}
+```
+---
+## 🚀 3. How to Use (For Users)
+Users define a complete automation workflow via a JSON-formatted `actions` array.
+### Using Atomic Actions
+For simple, linear workflows, you can use a list of the library's built-in atomic actions directly.
+**Example: Searching for "gemini" on Google**
+```json
+{
+  "actions": [
+    { "id": "goto", "params": { "url": "https://www.google.com" } },
+    { "id": "fill", "params": { "selector": "textarea[name=q]", "value": "gemini" } },
+    { "id": "submit", "params": { "selector": "form" } }
+  ]
+}
+```
+### Built-in Atomic Actions
+The library provides a set of essential atomic actions to perform common web interactions.
+#### `goto`
+Navigates the browser to a new URL.
+* **`id`**: `goto`
+* **`params`**:
+  * `url` (string): The URL to navigate to.
+  * ...other navigation options like `waitUntil`, `timeout` which are passed to the engine.
+* **`returns`**: `response`
+#### `click`
+Clicks on an element specified by a selector.
+* **`id`**: `click`
+* **`params`**:
+  * `selector` (string): A CSS selector or XPath to identify the element to click.
+* **`returns`**: `none`
+#### `fill`
+Fills an input field with a specified value.
+* **`id`**: `fill`
+* **`params`**:
+  * `selector` (string): A selector for the input element.
+  * `value` (string): The text to fill into the element.
+* **`returns`**: `none`
+#### `submit`
+Submits a form.
+* **`id`**: `submit`
+* **`params`**:
+  * `selector` (string, optional): A selector for the form element.
+* **`returns`**: `none`
+#### `waitFor`
+Pauses execution to wait for a specific condition to be met.
+* **`id`**: `waitFor`
+* **`params`**: An object specifying the wait condition (e.g., `ms`, `selector`, `networkIdle`).
+* **`returns`**: `none`
+#### `pause`
+Pauses the execution of the Action Script to allow for manual user intervention (e.g., solving a CAPTCHA).
+This action **requires** an `onPause` callback handler to be provided in the `fetchWeb` options. When triggered, this action calls the `onPause` handler and waits for it to complete.
+* **`id`**: `pause`
+* **`params`**:
+  * `selector` (string, optional): If provided, the action will only pause if an element matching this selector exists.
+  * `attribute` (string, optional): Used in conjunction with `selector`. If provided, the action will only pause if the element exists AND has the specified attribute.
+  * `message` (string, optional): A message that will be passed to the `onPause` handler, which can be used to display prompts to the user.
+* **`returns`**: `none`
+**Example: Handling a CAPTCHA in Google Search**
+```json
+{
+  "actions": [
+    { "id": "goto", "params": { "url": "https://www.google.com/search?q=gemini" } },
+    {
+      "id": "pause",
+      "params": {
+        "selector": "#recaptcha",
+        "message": "Google CAPTCHA detected. Please solve it in the browser and press Enter to continue."
+      }
+    },
+    { "id": "waitFor", "params": { "selector": "#search" } }
+  ]
+}
+```
+**`onPause` Handler Example:**
+```typescript
+// In your code that calls fetchWeb
+import { fetchWeb } from '@isdk/web-fetcher';
+import readline from 'readline';
+const handlePause = async ({ message }) => {
+  const rl = readline.createInterface({ input: process.stdin, output: process.stdout });
+  await new Promise(resolve => {
+    rl.question(message || 'Execution paused. Press Enter to continue...', () => {
+      rl.close();
+      resolve();
+    });
+  });
+};
+await fetchWeb({
+  // ...,
+  engine: 'browser',
+  engineOptions: { headless: false },
+  onPause: handlePause,
+  actions: [
+    // ... your actions
+  ]
+});
+```
+#### `getContent`
+Retrieves the full content of the current page state.
+* **`id`**: `getContent`
+* **`params`**: (none)
+* **`returns`**: `response`
+#### `extract`
+Extracts structured data from the page using a powerful and declarative Schema. This is the core Action for data collection.
+* **`id`**: `extract`
+* **`params`**: An `ExtractSchema` object that defines the extraction rules.
+* **`returns`**: `any` (the extracted data)
+##### Detailed Explanation of Extraction Schema
+The `params` object itself is a Schema that describes the data structure you want to extract.
+###### 1. Extracting a Single Value
+The most basic extraction. You can specify a `selector` (CSS selector), an `attribute` (the name of the attribute to extract), and a `type` (string, number, boolean, html).
+```json
+{
+  "id": "extract",
+  "params": {
+    "selector": "h1.main-title",
+    "type": "string"
+  }
+}
+```
+> The example above will extract the text content of the `<h1>` tag with the class `main-title`.
+###### 2. Extracting an Object
+Define a structured object using `type: 'object'` and the `properties` field.
+```json
+{
+  "id": "extract",
+  "params": {
+    "type": "object",
+    "selector": ".author-bio",
+    "properties": {
+      "name": { "selector": ".author-name" },
+      "email": { "selector": "a.email", "attribute": "href" }
+    }
+  }
+}
+```
+###### 3. Extracting an Array (Convenient Usage)
+Extract a list using `type: 'array'`. To make the most common operations simpler, we provide some convenient usages.
+* **Extracting an Array of Texts (Default Behavior)**: When you want to extract a list of text, just provide the selector and omit `items`. This is the most common usage.
+    ```json
+    {
+      "id": "extract",
+      "params": {
+        "type": "array",
+        "selector": ".tags li"
+      }
+    }
+    ```
+    > The example above will return an array of the text from all `<li>` tags, e.g., `["tech", "news"]`.
+* **Extracting an Array of Attributes (Shortcut)**: When you only want to extract a list of attributes (e.g., all `href`s from links), there's no need to nest `items` either. Just declare `attribute` directly in the `array` definition.
+    ```json
+    {
+      "id": "extract",
+      "params": {
+        "type": "array",
+        "selector": ".gallery img",
+        "attribute": "src"
+      }
+    }
+    ```
+    > The example above will return an array of the `src` attributes from all `<img>` tags.
+###### 4. Precise Filtering: `has` and `exclude`
+You can use the `has` and `exclude` fields in any schema that includes a `selector` to precisely control element selection.
+* `has`: A CSS selector to ensure the selected element **must contain** a descendant matching this selector.
+* `exclude`: A CSS selector to **exclude** elements matching this selector from the results.
+**Complete Example: Extracting links of articles that have an image and are not marked as "draft"**
+```json
+{
+  "actions": [
+    { "id": "goto", "params": { "url": "https://example.com/articles" } },
+    {
+      "id": "extract",
+      "params": {
+        "type": "array",
+        "selector": "div.article-card",
+        "has": "img.cover-image",
+        "exclude": ".draft",
+        "items": {
+          "selector": "a.title-link",
+          "attribute": "href"
+        }
+      }
+    }
+  ]
+}
+```
+> The `extract` action above will:
+>
+> 1. Find all `div.article-card` elements.
+> 2. Filter them to only include those that contain an `<img class="cover-image">`.
+> 3. Further filter the results to exclude any that also have the `.draft` class.
+> 4. For each of the remaining `div.article-card` elements, find its descendant `a.title-link` and extract the `href` attribute.
+### Building High-Level Semantic Actions via "Composition"
+This is the recommended best practice for **users** to encapsulate and reuse business logic.
+**Scenario: Creating a reusable `LoginAction`**
+1. **Define `LoginAction.ts` in your project:**
+    ```typescript
+    import { FetchContext, FetchAction, BaseFetchActionOptions } from '@isdk/web-fetcher';
+    export class LoginAction extends FetchAction {
+      static override id = 'login';
+      static override capabilities = { http: 'simulate' as const, browser: 'native' as const };
+      async onExecute(context: FetchContext, options?: BaseFetchActionOptions): Promise<void> {
+        const { username, password, userSelector, passSelector, submitSelector } = options?.params || {};
+        if (!username || !password || !userSelector || !passSelector || !submitSelector) {
+          throw new Error('Username, password, and all selectors are required for login action');
+        }
+        const engine = context.internal.engine;
+        if (!engine) throw new Error('No engine available');
+        // Orchestrate atomic capabilities to form a complete business process
+        await engine.fill({ selector: userSelector, value: username });
+        await engine.fill({ selector: passSelector, value: password });
+        await engine.click({ selector: submitSelector });
+        await engine.waitFor({ networkIdle: true });
+      }
+    }
+    ```
+2. **Register this custom Action when your application starts:**
+    ```typescript
+    import { FetchAction } from '@isdk/web-fetcher';
+    import { LoginAction } from './path/to/LoginAction';
+    FetchAction.register(LoginAction);
+    ```
+3. **Use your `LoginAction` in scripts:**
+    Now, your action script becomes much cleaner and more semantic:
+    ```json
+    {
+      "actions": [
+        {
+          "id": "login",
+          "params": {
+            "username": "testuser",
+            "password": "password123",
+            "userSelector": "#username",
+            "passSelector": "#password",
+            "submitSelector": "button[type=submit]"
+          }
+        }
+      ]
+    }
+    ```
+---
+## 🧲 4. Advanced Feature: Collectors
+A Collector is a powerful mechanism that allows a **main Action** to run one or more **child Actions** during its execution to collect data in a parallel, event-driven manner.
+### Core Concepts
+Collectors are defined in the `collectors` array of a main Action. Their execution is event-driven:
+* `activateOn`: Event(s) to activate the collector.
+* `collectOn`: Event(s) that trigger the collector's `onExecute` logic.
+* `deactivateOn`: Event(s) to deactivate the collector.
+* `storeAs`: A key to store the collected results in `context.outputs`.
+> **ℹ️ Special Rule**: If a collector has no `On` events configured, it will execute its `onExecute` logic once when the main Action's `end` event is triggered.
+### Applicable Scenarios for Collectors
+> **⚠️ Important**: While any Action can technically be used as a collector, it is only meaningful for **Actions whose purpose is to return data** (e.g., `getContent`, `extract`). Using an action like `click` or `fill` as a collector is pointless as it doesn't "collect" anything.
+### Usage Example
+**Scenario**: Visit a blog page and collect all hyperlinks (`href` from `<a>` tags) after the page has loaded.
+```json
+{
+  "actions": [
+    {
+      "id": "goto",
+      "params": { "url": "https://example.com/blog/my-post" },
+      "collectors": [
+        {
+          "id": "extract",
+          "name": "linkCollector",
+          "params": {
+            "type": "array",
+            "selector": "a",
+            "attribute": "href"
+          },
+          "storeAs": "allLinks"
+        }
+      ]
+    }
+  ]
+}
+```
+**Execution Flow**:
+1. The main `goto` Action begins.
+2. The `linkCollector` is initialized.
+3. Since it has no triggers, it waits for the `goto` action to complete.
+4. `goto` loads the page and fires its `action:goto.end` event.
+5. `linkCollector` hears this event and executes, extracting the `href` from all `<a>` tags.
+6. The results are pushed into the `context.outputs.allLinks` array.
+---
+## 🧑‍💻 5. How to Extend (For Developers)
+As a library developer, your primary responsibility is to enrich the **atomic Action** ecosystem.
+### Adding a New Atomic Action
+1. **Define the Capability in the Engine:** Add a new abstract method to `FetchEngine` in `src/engine/base.ts` and implement it in the concrete engines (`Cheerio`, `Playwright`).
+2. **Create the Action Class:** Create a new file like `src/action/definitions/MyNewAction.ts`.
+3. **Implement `onExecute`:** Use the `delegateToEngine` helper for simple cases.
+4. **Register the Action:** Call `FetchAction.register(MyNewAction)` in your new file.
+---
+## 🔄 6. Action Lifecycle
+The `FetchAction` base class provides lifecycle hooks that allow injecting custom behavior before and after the core logic of an Action executes.
+* `protected onBeforeExec?()`: Called before `onExecute`.
+* `protected onAfterExec?()`: Called after `onExecute`.
+For Actions that need to manage complex state or resources, you can implement these hooks. Generally, for composite actions, writing the logic directly in `onExecute` is sufficient.

package/docs/_media/README.cn.md ADDED Viewed

@@ -0,0 +1,147 @@
+# 🕸️ @isdk/web-fetcher
+[English](./README.md) | 简体中文
+> 一个功能强大且灵活的 Web 抓取与浏览器自动化库。
+> 它采用双引擎架构（HTTP 和浏览器）和声明式动作系统，是 AI 代理和复杂数据抓取任务的理想选择。
+---
+## ✨ 核心特性
+* **⚙️ 双引擎架构**: 可在 **`http`** 模式（由 Cheerio 驱动，适用于静态站点，速度快）和 **`browser`** 模式（由 Playwright 驱动，适用于动态站点，可执行完整的 JavaScript 交互）之间选择。
+* **📜 声明式动作脚本**: 以简单、可读的 JSON 格式定义多步骤工作流（如登录、填写表单、点击按钮等）。
+* **📊 强大而灵活的数据提取**: 通过直观、强大的声明式 Schema,轻松提取从简单文本到复杂嵌套的各类结构化数据。
+* **🧠 智能引擎选择**: 可自动检测动态站点，并在需要时将引擎从 `http` 动态升级到 `browser`。
+* **🧩 可扩展性**: 轻松创建自定义的、高级别的“组合动作”，以封装可复用的业务逻辑（例如，一个 `login` 动作）。
+* **🧲 高级收集器 (Collectors)**: 在主动作执行期间，由事件触发，在后台异步收集数据。
+* **🛡️ 反爬虫/反屏蔽**: 在 `browser` 模式下，一个可选的 `antibot` 标志有助于绕过常见的反机器人措施，如 Cloudflare 挑战。
+---
+## 📦 安装
+1. **安装依赖包:**
+    ```bash
+    npm install @isdk/web-fetcher
+    ```
+2. **安装浏览器 (用于 `browser` 模式):**
+    `browser` 引擎由 Playwright 驱动，它需要下载独立的浏览器二进制文件。如果您计划使用 `browser` 引擎与动态网站进行交互，请运行以下命令：
+    ```bash
+    npx playwright install
+    ```
+    > ℹ️ **提示:** 仅当您需要使用 `browser` 模式时，此步骤才是必需的。轻量级的 `http` 模式无需安装浏览器即可工作。
+---
+## 🚀 快速入门
+以下示例抓取一个网页并提取其标题。
+```typescript
+import { fetchWeb } from '@isdk/web-fetcher';
+async function getTitle(url: string) {
+  const { outputs } = await fetchWeb({
+    url,
+    actions: [
+      {
+        id: 'extract',
+        params: {
+          // 提取 <title> 标签的文本内容
+          selector: 'title',
+        },
+        // 将结果存储在 `outputs` 对象的 'pageTitle' 键下
+        storeAs: 'pageTitle',
+      },
+    ],
+  });
+  console.log('页面标题:', outputs.pageTitle);
+}
+getTitle('https://www.google.com');
+```
+---
+## 🤖 高级用法：多步表单提交
+此示例演示如何使用 `browser` 引擎在 Google 上执行搜索。
+```typescript
+import { fetchWeb } from '@isdk/web-fetcher';
+async function searchGoogle(query: string) {
+  // 在 Google 上搜索指定查询
+  const { result, outputs } = await fetchWeb({
+    url: 'https://www.google.com',
+    engine: 'browser', // 使用完整的浏览器引擎进行交互
+    actions: [
+      // 对 google.com 的初始导航由 `url` 选项处理
+      { id: 'fill', params: { selector: 'textarea[name=q]', value: query } },
+      { id: 'submit', params: { selector: 'form' } },
+      { id: 'waitFor', params: { selector: '#search' } }, // 等待搜索结果容器出现
+      { id: 'getContent', storeAs: 'searchResultsPage' },
+    ]
+  });
+  console.log('搜索结果 URL:', result?.finalUrl);
+  console.log('Outputs 中包含了完整的页面内容:', outputs.searchResultsPage.html.substring(0, 100));
+}
+searchGoogle('gemini');
+```
+---
+## 🏗️ 架构
+该库构建于两个核心概念之上：**引擎 (Engines)** 和 **动作 (Actions)**。
+* ### 引擎架构
+    该库的核心是其双引擎设计。它将 Web 交互的复杂性抽象在一个统一的 API 之后。有关 `http` (Cheerio) 和 `browser` (Playwright) 引擎的详细信息、它们如何管理状态以及如何扩展它们，请参阅 [**抓取引擎架构**](./README.engine.cn.md) 文档。
+* ### 动作架构
+    所有工作流都定义为一系列“动作”。该库提供了一套内置的原子动作和一个强大的组合模型，用于创建您自己的语义动作。有关创建和使用动作的深入探讨，请参阅 [**动作脚本架构**](./README.action.cn.md) 文档。
+---
+## 📚 API 参考
+### `fetchWeb(options)` 或 `fetchWeb(url, options)`
+这是库的主入口点。
+**关键 `FetcherOptions`**:
+* `url` (string): 要导航的初始 URL。
+* `engine` ('http' | 'browser' | 'auto'): 要使用的引擎。默认为 `auto`。
+* `actions` (FetchActionOptions[]): 要执行的动作对象数组。
+* `headers` (Record<string, string>): 用于所有请求的头信息。
+* ...以及许多其他用于代理、Cookie、重试等的选项。
+### 内置动作
+以下是核心的内置动作：
+* `goto`: 导航到一个新的 URL。
+* `click`: 点击一个由选择器指定的元素。
+* `fill`: 用指定的值填充一个输入字段。
+* `submit`: 提交一个表单。
+* `waitFor`: 暂停执行以等待特定条件（例如，超时、选择器出现或网络空闲）。
+* `getContent`: 获取当前页面状态的完整内容（HTML、文本等）。
+* `extract`: 使用富有表现力的声明式 Schema,可轻松提取页面中的任意结构化数据。
+---
+## 📜 许可证
+[MIT](./LICENSE-MIT)