@isdk/web-fetcher 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.action.cn.md +469 -0
- package/README.action.md +452 -0
- package/README.cn.md +147 -0
- package/README.engine.cn.md +262 -0
- package/README.engine.md +262 -0
- package/README.md +147 -0
- package/dist/index.d.mts +1603 -0
- package/dist/index.d.ts +1603 -0
- package/dist/index.js +1 -0
- package/dist/index.mjs +1 -0
- package/docs/README.md +151 -0
- package/docs/_media/LICENSE-MIT +22 -0
- package/docs/_media/README.action.md +452 -0
- package/docs/_media/README.cn.md +147 -0
- package/docs/_media/README.engine.md +262 -0
- package/docs/classes/CheerioFetchEngine.md +1447 -0
- package/docs/classes/ClickAction.md +533 -0
- package/docs/classes/ExtractAction.md +533 -0
- package/docs/classes/FetchAction.md +444 -0
- package/docs/classes/FetchEngine.md +1230 -0
- package/docs/classes/FetchSession.md +111 -0
- package/docs/classes/FillAction.md +533 -0
- package/docs/classes/GetContentAction.md +533 -0
- package/docs/classes/GotoAction.md +537 -0
- package/docs/classes/PauseAction.md +533 -0
- package/docs/classes/PlaywrightFetchEngine.md +1437 -0
- package/docs/classes/SubmitAction.md +533 -0
- package/docs/classes/WaitForAction.md +533 -0
- package/docs/classes/WebFetcher.md +85 -0
- package/docs/enumerations/FetchActionResultStatus.md +40 -0
- package/docs/functions/fetchWeb.md +43 -0
- package/docs/globals.md +72 -0
- package/docs/interfaces/BaseFetchActionProperties.md +83 -0
- package/docs/interfaces/BaseFetchCollectorActionProperties.md +145 -0
- package/docs/interfaces/BaseFetcherProperties.md +206 -0
- package/docs/interfaces/Cookie.md +142 -0
- package/docs/interfaces/DispatchedEngineAction.md +60 -0
- package/docs/interfaces/ExtractActionProperties.md +113 -0
- package/docs/interfaces/FetchActionInContext.md +149 -0
- package/docs/interfaces/FetchActionProperties.md +125 -0
- package/docs/interfaces/FetchActionResult.md +55 -0
- package/docs/interfaces/FetchContext.md +424 -0
- package/docs/interfaces/FetchEngineContext.md +328 -0
- package/docs/interfaces/FetchMetadata.md +73 -0
- package/docs/interfaces/FetchResponse.md +105 -0
- package/docs/interfaces/FetchReturnTypeRegistry.md +57 -0
- package/docs/interfaces/FetchSite.md +320 -0
- package/docs/interfaces/FetcherOptions.md +300 -0
- package/docs/interfaces/GotoActionOptions.md +66 -0
- package/docs/interfaces/PendingEngineRequest.md +51 -0
- package/docs/interfaces/SubmitActionOptions.md +23 -0
- package/docs/interfaces/WaitForActionOptions.md +39 -0
- package/docs/type-aliases/BaseFetchActionOptions.md +11 -0
- package/docs/type-aliases/BaseFetchCollectorOptions.md +11 -0
- package/docs/type-aliases/BrowserEngine.md +11 -0
- package/docs/type-aliases/FetchActionCapabilities.md +11 -0
- package/docs/type-aliases/FetchActionCapabilityMode.md +11 -0
- package/docs/type-aliases/FetchActionOptions.md +11 -0
- package/docs/type-aliases/FetchEngineAction.md +18 -0
- package/docs/type-aliases/FetchEngineType.md +11 -0
- package/docs/type-aliases/FetchReturnType.md +11 -0
- package/docs/type-aliases/FetchReturnTypeFor.md +17 -0
- package/docs/type-aliases/OnFetchPauseCallback.md +23 -0
- package/docs/type-aliases/ResourceType.md +11 -0
- package/docs/variables/DefaultFetcherProperties.md +11 -0
- package/package.json +90 -0
|
@@ -0,0 +1,262 @@
|
|
|
1
|
+
# ⚙️ Fetch Engine 架构
|
|
2
|
+
|
|
3
|
+
[English](./README.engine.md) | 简体中文
|
|
4
|
+
|
|
5
|
+
> 本文档全面概述了 Fetch Engine 的架构,旨在抽象化网页内容抓取和交互。它适用于需要理解、维护或扩展抓取引擎功能的开发人员。
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 🎯 1. 概述
|
|
10
|
+
|
|
11
|
+
`engine` 目录包含 Web Fetcher 的核心逻辑。其主要职责是为与网页交互提供统一的接口,无论底层技术如何(例如,简单的 HTTP 请求或功能齐全的浏览器)。这是通过抽象的 `FetchEngine` 类和利用不同抓取技术的具体实现来实现的。
|
|
12
|
+
|
|
13
|
+
> ℹ️ 该系统建立在 [Crawlee](https://crawlee.dev/) 库之上,并使用其强大的爬虫抽象。
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## 🧩 2. 核心概念
|
|
18
|
+
|
|
19
|
+
### `FetchEngine` (base.ts)
|
|
20
|
+
|
|
21
|
+
这是定义所有抓取引擎契约的抽象基类。
|
|
22
|
+
|
|
23
|
+
* **职责**:为导航、内容检索和用户交互等操作提供一致的高级 API。
|
|
24
|
+
* **关键抽象**:
|
|
25
|
+
* **生命周期**:`initialize()` 和 `cleanup()` 方法。
|
|
26
|
+
* **核心操作**:`goto()`、`getContent()`、`click()`、`fill()`、`submit()`、`waitFor()`、`extract()`。
|
|
27
|
+
* **配置**:`headers()`、`cookies()`、`blockResources()`。
|
|
28
|
+
* **静态注册表**:它维护所有可用引擎实现的静态注册表(`FetchEngine.register`),允许通过 `id` 或 `mode` 动态选择引擎。
|
|
29
|
+
|
|
30
|
+
### `FetchEngine.create(context, options)`
|
|
31
|
+
|
|
32
|
+
此静态工厂方法是创建引擎实例的指定入口点。它会自动选择并初始化合适的引擎。
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## 🏗️ 3. 架构和工作流程
|
|
37
|
+
|
|
38
|
+
引擎架构旨在解决一个关键挑战:在 Crawlee 根本上**无状态、异步**的请求处理之上,提供一个简单的、**类似有状态的 API**(先 `goto()`,再 `click()`,再 `fill()`)。
|
|
39
|
+
|
|
40
|
+
### 核心问题:短暂的页面上下文
|
|
41
|
+
|
|
42
|
+
在 Crawlee 中,页面上下文(Playwright 中的 `page` 对象或 Cheerio 中的 `$` 函数)**仅在 `requestHandler` 函数的同步作用域内可用**。一旦处理程序完成或 `await`,上下文就会丢失。这使得无法直接实现 `await engine.goto(); await engine.click();` 这样的序列。
|
|
43
|
+
|
|
44
|
+
### 解决方案:使用动作循环桥接作用域
|
|
45
|
+
|
|
46
|
+
我们的引擎通过在外部 API 调用和内部 `requestHandler` 作用域之间创建桥梁来解决此问题。这是设计中最关键的部分。
|
|
47
|
+
|
|
48
|
+
**工作流程如下:**
|
|
49
|
+
|
|
50
|
+
1. **初始化**:消费者调用 `FetchEngine.create()`,初始化一个在后台运行的 Crawlee 爬虫(例如 `PlaywrightCrawler`)。
|
|
51
|
+
2. **导航 (`goto`)**:消费者调用 `await engine.goto(url)`。此方法将 URL 添加到 Crawlee 的 `RequestQueue`,并返回一个 `Promise`,该 `Promise` 将在页面加载后被解析。
|
|
52
|
+
3. **Crawlee 处理**:后台爬虫接收请求并调用引擎的 `requestHandler`,向其传递关键的页面上下文。
|
|
53
|
+
4. **页面激活和动作循环**:在 `requestHandler` 内部:
|
|
54
|
+
* 页面上下文用于解析 `goto()` 调用返回的 `Promise`。
|
|
55
|
+
* 页面被标记为“活动”状态 (`isPageActive = true`)。
|
|
56
|
+
* 至关重要的是,在 `requestHandler` 返回之前,它会启动一个**动作循环** (`_executePendingActions`)。此循环通过监听 `EventEmitter` 上的事件,有效地**暂停 `requestHandler`**,从而保持页面上下文的存活。
|
|
57
|
+
5. **交互式动作 (`click`, `fill` 等)**:消费者现在可以调用 `await engine.click(...)`。此方法将一个动作分派到 `EventEmitter` 并返回一个新的 `Promise`。
|
|
58
|
+
6. **动作执行**:仍在原始 `requestHandler` 作用域内运行的动作循环,会监听到该事件。因为它能访问页面上下文,所以可以执行*实际的*交互(例如 `page.click(...)`)。
|
|
59
|
+
7. **清理**:循环一直持续到分派 `dispose` 动作(例如,由新的 `goto()` 调用触发),该动作会终止循环,并允许 `requestHandler` 最终完成。
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## 🛠️ 4. 实现
|
|
64
|
+
|
|
65
|
+
主要有两种引擎实现:
|
|
66
|
+
|
|
67
|
+
### `CheerioFetchEngine` (http 模式)
|
|
68
|
+
|
|
69
|
+
* **ID**: `cheerio`
|
|
70
|
+
* **机制**: 使用 `CheerioCrawler` 通过原始 HTTP 请求抓取页面并解析静态 HTML。
|
|
71
|
+
* **行为**:
|
|
72
|
+
* ✅ **快速轻量**:非常适合追求速度和低资源消耗的场景。
|
|
73
|
+
* ❌ **无 JavaScript 执行**:无法与客户端渲染的内容交互。
|
|
74
|
+
* ⚙️ **模拟交互**:像 `click` 和 `submit` 这样的动作是通过发起新的 HTTP 请求来模拟的。
|
|
75
|
+
* **用例**: 抓取静态网站、服务器渲染页面或 API。
|
|
76
|
+
|
|
77
|
+
### `PlaywrightFetchEngine` (browser 模式)
|
|
78
|
+
|
|
79
|
+
* **ID**: `playwright`
|
|
80
|
+
* **机制**: 使用 `PlaywrightCrawler` 控制一个真实的无头浏览器。
|
|
81
|
+
* **行为**:
|
|
82
|
+
* ✅ **完整的浏览器环境**:执行 JavaScript,原生处理 Cookie、会话和复杂的用户交互。
|
|
83
|
+
* ✅ **强大的交互**:动作能准确模拟真实用户行为。
|
|
84
|
+
* ⚠️ **资源密集型**:比 Cheerio 引擎慢,需要更多的内存/CPU。
|
|
85
|
+
* **用例**: 与现代动态 Web 应用程序(SPA)或任何严重依赖 JavaScript 的网站进行交互。
|
|
86
|
+
|
|
87
|
+
#### 反爬虫/反屏蔽 (`antibot` 选项)
|
|
88
|
+
|
|
89
|
+
为了对抗复杂的反机器人措施,`PlaywrightFetchEngine` 提供了一个 `antibot` 模式。启用后,它会集成 [Camoufox](https://github.com/prescience-data/camoufox) 来增强其规避能力。
|
|
90
|
+
|
|
91
|
+
* **机制**:
|
|
92
|
+
* 通过 `camoufox-js` 使用一个经过加固的 Firefox 浏览器。
|
|
93
|
+
* 禁用默认的指纹伪装,由 Camoufox 管理浏览器指纹。
|
|
94
|
+
* 自动尝试解决在导航过程中遇到的 Cloudflare 挑战。
|
|
95
|
+
* **如何启用**: 在创建 fetcher 属性时设置 `antibot: true` 选项。
|
|
96
|
+
* **用例**: 抓取受 Cloudflare 或其他高级机器人检测系统保护的网站。
|
|
97
|
+
* **注意**: 此功能需要额外的依赖项(`camoufox-js`, `firefox`),并可能带来性能开销。
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## 📊 5. 使用 `extract()` 进行数据提取
|
|
102
|
+
|
|
103
|
+
`extract()` 方法提供了一种强大的声明式方式来从网页中提取结构化数据。它通过一个 **Schema (模式)** 来定义您期望的 JSON 输出结构,引擎会自动处理 DOM 遍历和数据提取。
|
|
104
|
+
|
|
105
|
+
### 核心设计: Schema 规范化
|
|
106
|
+
|
|
107
|
+
为了提升易用性和灵活性,`extract` 方法在内部实现了一个**“规范化 (Normalization)”**层。这意味着您可以提供语义清晰的简写形式,在执行前,引擎会自动将其转换为标准的、更详细的内部格式。这使得编写复杂的提取规则变得简单直观。
|
|
108
|
+
|
|
109
|
+
### Schema 结构
|
|
110
|
+
|
|
111
|
+
一个 Schema 可以是以下三种类型之一:
|
|
112
|
+
|
|
113
|
+
* **值提取 (`ExtractValueSchema`)**: 提取单个值。
|
|
114
|
+
* `selector`: (可选) 用于定位元素的 CSS 选择器。
|
|
115
|
+
* `type`: (可选) 提取值的类型,如 `'string'` (默认), `'number'`, `'boolean'`, 或 `'html'` (提取内部 HTML)。
|
|
116
|
+
* `attribute`: (可选) 如果提供,则提取元素的指定属性值(例如 `href`),而不是其文本内容。
|
|
117
|
+
* **对象提取 (`ExtractObjectSchema`)**: 提取一个 JSON 对象。
|
|
118
|
+
* `selector`: (可选) 对象数据的根元素。
|
|
119
|
+
* `properties`: 定义对象每个字段的子提取规则。
|
|
120
|
+
* **数组提取 (`ExtractArraySchema`)**: 提取一个数组。
|
|
121
|
+
* `selector`: 用于匹配列表中每个项目元素的 CSS 选择器。
|
|
122
|
+
* `items`: (可选) 应用于每个项目元素的提取规则。**如果省略,默认为提取元素的文本内容**。
|
|
123
|
+
|
|
124
|
+
### 高级功能 (通过规范化实现)
|
|
125
|
+
|
|
126
|
+
#### 1. 简写语法
|
|
127
|
+
|
|
128
|
+
为了简化常见场景,我们支持以下简写:
|
|
129
|
+
|
|
130
|
+
* **提取属性数组**: 您可以在 `array` 类型的 Schema 上直接使用 `attribute` 属性,作为 `items: { attribute: '...' }` 的简写。
|
|
131
|
+
|
|
132
|
+
**简写:**
|
|
133
|
+
|
|
134
|
+
```json
|
|
135
|
+
{
|
|
136
|
+
"type": "array",
|
|
137
|
+
"selector": ".post a",
|
|
138
|
+
"attribute": "href"
|
|
139
|
+
}
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
**等效于:**
|
|
143
|
+
|
|
144
|
+
```json
|
|
145
|
+
{
|
|
146
|
+
"type": "array",
|
|
147
|
+
"selector": ".post a",
|
|
148
|
+
"items": { "attribute": "href" }
|
|
149
|
+
}
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
* **提取文本数组**: 只需提供选择器,省略 `items` 即可。
|
|
153
|
+
|
|
154
|
+
**简写:**
|
|
155
|
+
|
|
156
|
+
```json
|
|
157
|
+
{
|
|
158
|
+
"type": "array",
|
|
159
|
+
"selector": ".tags li"
|
|
160
|
+
}
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
**等效于:**
|
|
164
|
+
|
|
165
|
+
```json
|
|
166
|
+
{
|
|
167
|
+
"type": "array",
|
|
168
|
+
"selector": ".tags li",
|
|
169
|
+
"items": { "type": "string" }
|
|
170
|
+
}
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
#### 2. 精确筛选: `has` 和 `exclude`
|
|
174
|
+
|
|
175
|
+
您可以在任何包含 `selector` 的 Schema 中使用 `has` 和 `exclude` 字段来精确控制元素的选择。
|
|
176
|
+
|
|
177
|
+
* `has`: 一个 CSS 选择器,用于确保所选元素**必须包含**匹配此选择器的后代元素 (类似 `:has()`)。
|
|
178
|
+
* `exclude`: 一个 CSS 选择器,用于从结果中**排除**匹配此选择器的元素 (类似 `:not()`)。
|
|
179
|
+
|
|
180
|
+
### 完整示例
|
|
181
|
+
|
|
182
|
+
假设我们有以下 HTML,我们想提取所有"重要"文章的标题和链接,同时排除掉"存档"文章。
|
|
183
|
+
|
|
184
|
+
```html
|
|
185
|
+
<div id="articles">
|
|
186
|
+
<div class="post important">
|
|
187
|
+
<a href="/post/1"><h3>Post 1</h3></a>
|
|
188
|
+
</div>
|
|
189
|
+
<div class="post">
|
|
190
|
+
<!-- This one is NOT important, lacks h3 -->
|
|
191
|
+
<a href="/post/2">Post 2</a>
|
|
192
|
+
</div>
|
|
193
|
+
<div class="post important archived">
|
|
194
|
+
<!-- This one is important but archived -->
|
|
195
|
+
<a href="/post/3"><h3>Archived Post 3</h3></a>
|
|
196
|
+
</div>
|
|
197
|
+
</div>
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
我们可以使用以下 Schema:
|
|
201
|
+
|
|
202
|
+
```typescript
|
|
203
|
+
const schema = {
|
|
204
|
+
type: 'array',
|
|
205
|
+
selector: '.post', // 1. 选择所有 post
|
|
206
|
+
has: 'h3', // 2. 只保留包含 <h3> 的 post (重要文章)
|
|
207
|
+
exclude: '.archived', // 3. 排除包含 .archived 类名的 post
|
|
208
|
+
items: {
|
|
209
|
+
type: 'object',
|
|
210
|
+
properties: {
|
|
211
|
+
title: { selector: 'h3' },
|
|
212
|
+
link: { selector: 'a', attribute: 'href' },
|
|
213
|
+
}
|
|
214
|
+
}
|
|
215
|
+
};
|
|
216
|
+
|
|
217
|
+
const data = await engine.extract(schema);
|
|
218
|
+
|
|
219
|
+
/*
|
|
220
|
+
预计输出:
|
|
221
|
+
[
|
|
222
|
+
{
|
|
223
|
+
"title": "Post 1",
|
|
224
|
+
"link": "/post/1"
|
|
225
|
+
}
|
|
226
|
+
]
|
|
227
|
+
*/
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
---
|
|
231
|
+
|
|
232
|
+
## 🧑💻 6. 如何扩展引擎
|
|
233
|
+
|
|
234
|
+
添加新的抓取引擎非常简单:
|
|
235
|
+
|
|
236
|
+
1. **创建类**:定义一个扩展泛型 `FetchEngine` 的新类,并提供具体的 `Context`、`Crawler` 和 `Options` 类型。
|
|
237
|
+
|
|
238
|
+
```typescript
|
|
239
|
+
import { PlaywrightCrawler, PlaywrightCrawlerOptions, PlaywrightCrawlingContext } from 'crawlee';
|
|
240
|
+
|
|
241
|
+
class MyPlaywrightEngine extends FetchEngine<
|
|
242
|
+
PlaywrightCrawlingContext,
|
|
243
|
+
PlaywrightCrawler,
|
|
244
|
+
PlaywrightCrawlerOptions
|
|
245
|
+
> {
|
|
246
|
+
// ...
|
|
247
|
+
}
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
2. **定义静态属性**:设置唯一的 `id` 和 `mode`。
|
|
251
|
+
3. **实现抽象方法**:为基类的抽象方法提供具体实现:
|
|
252
|
+
* `_getSpecificCrawlerOptions()`: 返回一个包含爬虫特定选项的对象(例如,`headless` 模式、`preNavigationHooks`)。
|
|
253
|
+
* `_createCrawler()`: 返回一个新的爬虫实例(例如,`new PlaywrightCrawler(options)`)。
|
|
254
|
+
* `buildResponse()`: 将爬取上下文转换为标准的 `FetchResponse`。
|
|
255
|
+
* `executeAction()`: 处理特定引擎的动作实现,如 `click`、`fill` 等。
|
|
256
|
+
4. **注册引擎**:调用 `FetchEngine.register(MyNewEngine)`。
|
|
257
|
+
|
|
258
|
+
---
|
|
259
|
+
|
|
260
|
+
## ✅ 7. 测试
|
|
261
|
+
|
|
262
|
+
`engine.spec.ts` 文件包含一个全面的测试套件。`engineTestSuite` 函数定义了一组标准测试,该套件会针对 `CheerioFetchEngine` 和 `PlaywrightFetchEngine` 运行,以确保它们在功能上等效并符合 `FetchEngine` API 契约。
|
package/README.engine.md
ADDED
|
@@ -0,0 +1,262 @@
|
|
|
1
|
+
# ⚙️ Fetch Engine Architecture
|
|
2
|
+
|
|
3
|
+
English | [简体中文](./README.engine.cn.md)
|
|
4
|
+
|
|
5
|
+
> This document provides a comprehensive overview of the Fetch Engine architecture, designed to abstract web content fetching and interaction. It's intended for developers who need to understand, maintain, or extend the fetching capabilities.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 🎯 1. Overview
|
|
10
|
+
|
|
11
|
+
The `engine` directory contains the core logic for the web fetcher. Its primary responsibility is to provide a unified interface for interacting with web pages, regardless of the underlying technology (e.g., simple HTTP requests or a full-fledged browser). This is achieved through an abstract `FetchEngine` class and concrete implementations that leverage different crawling technologies.
|
|
12
|
+
|
|
13
|
+
> ℹ️ The system is built on top of the [Crawlee](https://crawlee.dev/) library, using its powerful crawler abstractions.
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## 🧩 2. Core Concepts
|
|
18
|
+
|
|
19
|
+
### `FetchEngine` (base.ts)
|
|
20
|
+
|
|
21
|
+
This is the abstract base class that defines the contract for all fetch engines.
|
|
22
|
+
|
|
23
|
+
* **Role**: To provide a consistent, high-level API for actions like navigation, content retrieval, and user interaction.
|
|
24
|
+
* **Key Abstractions**:
|
|
25
|
+
* **Lifecycle**: `initialize()` and `cleanup()` methods.
|
|
26
|
+
* **Core Actions**: `goto()`, `getContent()`, `click()`, `fill()`, `submit()`, `waitFor()`, `extract()`.
|
|
27
|
+
* **Configuration**: `headers()`, `cookies()`, `blockResources()`.
|
|
28
|
+
* **Static Registry**: It maintains a static registry of all available engine implementations (`FetchEngine.register`), allowing for dynamic selection by `id` or `mode`.
|
|
29
|
+
|
|
30
|
+
### `FetchEngine.create(context, options)`
|
|
31
|
+
|
|
32
|
+
This static factory method is the designated entry point for creating an engine instance. It automatically selects and initializes the appropriate engine.
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## 🏗️ 3. Architecture and Workflow
|
|
37
|
+
|
|
38
|
+
The engine's architecture is designed to solve a key challenge: providing a simple, **stateful-like API** (`goto()`, then `click()`, then `fill()`) on top of Crawlee's fundamentally **stateless, asynchronous** request handling.
|
|
39
|
+
|
|
40
|
+
### The Core Problem: Ephemeral Page Context
|
|
41
|
+
|
|
42
|
+
In Crawlee, the page context (the `page` object in Playwright or the `$` function in Cheerio) is **only available within the synchronous scope of the `requestHandler` function**. Once the handler finishes or `await`s, the context is lost. This makes it impossible to directly implement a sequence like `await engine.goto(); await engine.click();`.
|
|
43
|
+
|
|
44
|
+
### The Solution: Bridging Scopes with an Action Loop
|
|
45
|
+
|
|
46
|
+
Our engine solves this by creating a bridge between the external API calls and the internal `requestHandler` scope. This is the most critical part of the design.
|
|
47
|
+
|
|
48
|
+
**The workflow is as follows:**
|
|
49
|
+
|
|
50
|
+
1. **Initialization**: A consumer calls `FetchEngine.create()`, which initializes a Crawlee crawler (e.g., `PlaywrightCrawler`) that runs in the background.
|
|
51
|
+
2. **Navigation (`goto`)**: The consumer calls `await engine.goto(url)`. This adds the URL to Crawlee's `RequestQueue` and returns a `Promise` that will resolve when the page is loaded.
|
|
52
|
+
3. **Crawlee Processing**: The background crawler picks up the request and invokes the engine's `requestHandler`, passing it the crucial page context.
|
|
53
|
+
4. **Page Activation & Action Loop**: Inside the `requestHandler`:
|
|
54
|
+
* The page context is used to resolve the `Promise` from the `goto()` call.
|
|
55
|
+
* The page is marked as "active" (`isPageActive = true`).
|
|
56
|
+
* Crucially, before the `requestHandler` returns, it starts an **action loop** (`_executePendingActions`). This loop effectively **pauses the `requestHandler`** by listening for events on an `EventEmitter`, keeping the page context alive.
|
|
57
|
+
5. **Interactive Actions (`click`, `fill`, etc.)**: The consumer can now call `await engine.click(...)`. This dispatches an action to the `EventEmitter` and returns a new `Promise`.
|
|
58
|
+
6. **Action Execution**: The action loop, still running within the original `requestHandler`'s scope, hears the event. Because it has access to the page context, it can perform the *actual* interaction (e.g., `page.click(...)`).
|
|
59
|
+
7. **Cleanup**: The loop continues until a `dispose` action is dispatched (e.g., by a new `goto()` call), which terminates the loop and allows the `requestHandler` to finally complete.
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## 🛠️ 4. Implementations
|
|
64
|
+
|
|
65
|
+
There are two primary engine implementations:
|
|
66
|
+
|
|
67
|
+
### `CheerioFetchEngine` (http mode)
|
|
68
|
+
|
|
69
|
+
* **ID**: `cheerio`
|
|
70
|
+
* **Mechanism**: Uses `CheerioCrawler` to fetch pages via raw HTTP and parse static HTML.
|
|
71
|
+
* **Behavior**:
|
|
72
|
+
* ✅ **Fast and Lightweight**: Ideal for speed and low resource consumption.
|
|
73
|
+
* ❌ **No JavaScript Execution**: Cannot interact with client-side rendered content.
|
|
74
|
+
* ⚙️ **Simulated Interaction**: Actions like `click` and `submit` are simulated by making new HTTP requests.
|
|
75
|
+
* **Use Case**: Scraping static websites, server-rendered pages, or APIs.
|
|
76
|
+
|
|
77
|
+
### `PlaywrightFetchEngine` (browser mode)
|
|
78
|
+
|
|
79
|
+
* **ID**: `playwright`
|
|
80
|
+
* **Mechanism**: Uses `PlaywrightCrawler` to control a real headless browser.
|
|
81
|
+
* **Behavior**:
|
|
82
|
+
* ✅ **Full Browser Environment**: Executes JavaScript, handles cookies, sessions, and complex user interactions natively.
|
|
83
|
+
* ✅ **Robust Interaction**: Actions accurately mimic real user behavior.
|
|
84
|
+
* ⚠️ **Resource Intensive**: Slower and requires more memory/CPU.
|
|
85
|
+
* **Use Case**: Interacting with modern, dynamic web applications (SPAs) or any site that relies heavily on JavaScript.
|
|
86
|
+
|
|
87
|
+
#### Anti-Bot Evasion (`antibot` option)
|
|
88
|
+
|
|
89
|
+
To combat sophisticated anti-bot measures, the `PlaywrightFetchEngine` offers an `antibot` mode. When enabled, it integrates [Camoufox](https://github.com/prescience-data/camoufox) to enhance its stealth capabilities.
|
|
90
|
+
|
|
91
|
+
* **Mechanism**:
|
|
92
|
+
* Uses a hardened Firefox browser via `camoufox-js`.
|
|
93
|
+
* Disables default fingerprint spoofing to let Camoufox manage the browser's fingerprint.
|
|
94
|
+
* Automatically attempts to solve Cloudflare challenges encountered during navigation.
|
|
95
|
+
* **How to enable**: Set the `antibot: true` option when creating the fetcher properties.
|
|
96
|
+
* **Use Case**: Scraping websites protected by services like Cloudflare or other advanced bot-detection systems.
|
|
97
|
+
* **Note**: This feature requires additional dependencies (`camoufox-js`, `firefox`) and may have a performance overhead.
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## 📊 5. Data Extraction with `extract()`
|
|
102
|
+
|
|
103
|
+
The `extract()` method provides a powerful, declarative way to pull structured data from a web page. It uses a **Schema** to define the structure of your desired JSON output, and the engine automatically handles DOM traversal and data extraction.
|
|
104
|
+
|
|
105
|
+
### Core Design: Schema Normalization
|
|
106
|
+
|
|
107
|
+
To enhance usability and flexibility, the `extract` method internally implements a **"Normalization"** layer. This means you can provide semantically clear shorthands, and the engine will automatically convert them into a standard, more verbose internal format before execution. This makes writing complex extraction rules simple and intuitive.
|
|
108
|
+
|
|
109
|
+
### Schema Structure
|
|
110
|
+
|
|
111
|
+
A schema can be one of three types:
|
|
112
|
+
|
|
113
|
+
* **Value Extraction (`ExtractValueSchema`)**: Extracts a single value.
|
|
114
|
+
* `selector`: (Optional) A CSS selector to locate the element.
|
|
115
|
+
* `type`: (Optional) The type of the extracted value, such as `'string'` (default), `'number'`, `'boolean'`, or `'html'` (extracts inner HTML).
|
|
116
|
+
* `attribute`: (Optional) If provided, extracts the value of the specified attribute (e.g., `href`) instead of its text content.
|
|
117
|
+
* **Object Extraction (`ExtractObjectSchema`)**: Extracts a JSON object.
|
|
118
|
+
* `selector`: (Optional) The root element for the object's data.
|
|
119
|
+
* `properties`: Defines sub-extraction rules for each field of the object.
|
|
120
|
+
* **Array Extraction (`ExtractArraySchema`)**: Extracts an array.
|
|
121
|
+
* `selector`: A CSS selector to match each item element in the list.
|
|
122
|
+
* `items`: (Optional) The extraction rule to apply to each item element. **If omitted, it defaults to extracting the element's text content**.
|
|
123
|
+
|
|
124
|
+
### Advanced Features (via Normalization)
|
|
125
|
+
|
|
126
|
+
#### 1. Shorthand Syntax
|
|
127
|
+
|
|
128
|
+
To simplify common scenarios, the following shorthands are supported:
|
|
129
|
+
|
|
130
|
+
* **Extracting an Array of Attributes**: You can use the `attribute` property directly on an `array` type schema as a shorthand for `items: { attribute: '...' }`.
|
|
131
|
+
|
|
132
|
+
**Shorthand:**
|
|
133
|
+
|
|
134
|
+
```json
|
|
135
|
+
{
|
|
136
|
+
"type": "array",
|
|
137
|
+
"selector": ".post a",
|
|
138
|
+
"attribute": "href"
|
|
139
|
+
}
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
**Equivalent to:**
|
|
143
|
+
|
|
144
|
+
```json
|
|
145
|
+
{
|
|
146
|
+
"type": "array",
|
|
147
|
+
"selector": ".post a",
|
|
148
|
+
"items": { "attribute": "href" }
|
|
149
|
+
}
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
* **Extracting an Array of Texts**: Simply provide the selector and omit `items`.
|
|
153
|
+
|
|
154
|
+
**Shorthand:**
|
|
155
|
+
|
|
156
|
+
```json
|
|
157
|
+
{
|
|
158
|
+
"type": "array",
|
|
159
|
+
"selector": ".tags li"
|
|
160
|
+
}
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
**Equivalent to:**
|
|
164
|
+
|
|
165
|
+
```json
|
|
166
|
+
{
|
|
167
|
+
"type": "array",
|
|
168
|
+
"selector": ".tags li",
|
|
169
|
+
"items": { "type": "string" }
|
|
170
|
+
}
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
#### 2. Precise Filtering: `has` and `exclude`
|
|
174
|
+
|
|
175
|
+
You can use the `has` and `exclude` fields in any schema that includes a `selector` to precisely control element selection.
|
|
176
|
+
|
|
177
|
+
* `has`: A CSS selector to ensure the selected element **must contain** a descendant matching this selector (similar to `:has()`).
|
|
178
|
+
* `exclude`: A CSS selector to **exclude** elements matching this selector from the results (similar to `:not()`).
|
|
179
|
+
|
|
180
|
+
### Complete Example
|
|
181
|
+
|
|
182
|
+
Suppose we have the following HTML and we want to extract the titles and links of all "important" articles, while excluding "archived" ones.
|
|
183
|
+
|
|
184
|
+
```html
|
|
185
|
+
<div id="articles">
|
|
186
|
+
<div class="post important">
|
|
187
|
+
<a href="/post/1"><h3>Post 1</h3></a>
|
|
188
|
+
</div>
|
|
189
|
+
<div class="post">
|
|
190
|
+
<!-- This one is NOT important, lacks h3 -->
|
|
191
|
+
<a href="/post/2">Post 2</a>
|
|
192
|
+
</div>
|
|
193
|
+
<div class="post important archived">
|
|
194
|
+
<!-- This one is important but archived -->
|
|
195
|
+
<a href="/post/3"><h3>Archived Post 3</h3></a>
|
|
196
|
+
</div>
|
|
197
|
+
</div>
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
We can use the following schema:
|
|
201
|
+
|
|
202
|
+
```typescript
|
|
203
|
+
const schema = {
|
|
204
|
+
type: 'array',
|
|
205
|
+
selector: '.post', // 1. Select all posts
|
|
206
|
+
has: 'h3', // 2. Only keep posts that contain an <h3> (important)
|
|
207
|
+
exclude: '.archived', // 3. Exclude posts with the .archived class
|
|
208
|
+
items: {
|
|
209
|
+
type: 'object',
|
|
210
|
+
properties: {
|
|
211
|
+
title: { selector: 'h3' },
|
|
212
|
+
link: { selector: 'a', attribute: 'href' },
|
|
213
|
+
}
|
|
214
|
+
}
|
|
215
|
+
};
|
|
216
|
+
|
|
217
|
+
const data = await engine.extract(schema);
|
|
218
|
+
|
|
219
|
+
/*
|
|
220
|
+
Expected output:
|
|
221
|
+
[
|
|
222
|
+
{
|
|
223
|
+
"title": "Post 1",
|
|
224
|
+
"link": "/post/1"
|
|
225
|
+
}
|
|
226
|
+
]
|
|
227
|
+
*/
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
---
|
|
231
|
+
|
|
232
|
+
## 🧑💻 6. How to Extend the Engine
|
|
233
|
+
|
|
234
|
+
Adding a new fetch engine is straightforward:
|
|
235
|
+
|
|
236
|
+
1. **Create the Class**: Define a new class that extends the generic `FetchEngine`, providing the specific `Context`, `Crawler`, and `Options` types from Crawlee.
|
|
237
|
+
|
|
238
|
+
```typescript
|
|
239
|
+
import { PlaywrightCrawler, PlaywrightCrawlerOptions, PlaywrightCrawlingContext } from 'crawlee';
|
|
240
|
+
|
|
241
|
+
class MyPlaywrightEngine extends FetchEngine<
|
|
242
|
+
PlaywrightCrawlingContext,
|
|
243
|
+
PlaywrightCrawler,
|
|
244
|
+
PlaywrightCrawlerOptions
|
|
245
|
+
> {
|
|
246
|
+
// ...
|
|
247
|
+
}
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
2. **Define Static Properties**: Set the unique `id` and `mode`.
|
|
251
|
+
3. **Implement Abstract Methods**: Provide concrete implementations for the abstract methods from the base class:
|
|
252
|
+
* `_getSpecificCrawlerOptions()`: Return an object with crawler-specific options (e.g., `headless` mode, `preNavigationHooks`).
|
|
253
|
+
* `_createCrawler()`: Return a new instance of your crawler (e.g., `new PlaywrightCrawler(options)`).
|
|
254
|
+
* `buildResponse()`: Convert the crawling context to a standard `FetchResponse`.
|
|
255
|
+
* `executeAction()`: Handle engine-specific implementations for actions like `click`, `fill`, etc.
|
|
256
|
+
4. **Register the Engine**: Call `FetchEngine.register(MyNewEngine)`.
|
|
257
|
+
|
|
258
|
+
---
|
|
259
|
+
|
|
260
|
+
## ✅ 7. Testing
|
|
261
|
+
|
|
262
|
+
The file `engine.spec.ts` contains a comprehensive test suite. The `engineTestSuite` function defines a standard set of tests that is run against both `CheerioFetchEngine` and `PlaywrightFetchEngine` to ensure they are functionally equivalent and conform to the `FetchEngine` API contract.
|
package/README.md
ADDED
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
# 🕸️ @isdk/web-fetcher
|
|
2
|
+
|
|
3
|
+
English | [简体中文](./README.cn.md)
|
|
4
|
+
|
|
5
|
+
> A powerful and flexible web fetching and browser automation library.
|
|
6
|
+
> It features a dual-engine architecture (HTTP and Browser) and a declarative action system, making it perfect for AI agents and complex data scraping tasks.
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## ✨ Core Features
|
|
11
|
+
|
|
12
|
+
* **⚙️ Dual-Engine Architecture**: Choose between **`http`** mode (powered by Cheerio) for speed on static sites, or **`browser`** mode (powered by Playwright) for full JavaScript execution on dynamic sites.
|
|
13
|
+
* **📜 Declarative Action Scripts**: Define multi-step workflows (like logging in, filling forms, and clicking buttons) in a simple, readable JSON format.
|
|
14
|
+
* **📊 Powerful and Flexible Data Extraction**: Easily extract all kinds of structured data, from simple text to complex nested objects, through an intuitive and powerful declarative Schema.
|
|
15
|
+
* **🧠 Smart Engine Selection**: Automatically detects dynamic sites and can upgrade the engine from `http` to `browser` on the fly.
|
|
16
|
+
* **🧩 Extensible**: Easily create custom, high-level "composite" actions to encapsulate reusable business logic (e.g., a `login` action).
|
|
17
|
+
* **🧲 Advanced Collectors**: Asynchronously collect data in the background, triggered by events during the execution of a main action.
|
|
18
|
+
* **🛡️ Anti-Bot Evasion**: In `browser` mode, an optional `antibot` flag helps to bypass common anti-bot measures like Cloudflare challenges.
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## 📦 Installation
|
|
23
|
+
|
|
24
|
+
1. **Install the Package:**
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
npm install @isdk/web-fetcher
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
2. **Install Browsers (For `browser` mode):**
|
|
31
|
+
|
|
32
|
+
The `browser` engine is powered by Playwright, which requires separate browser binaries to be downloaded. If you plan to use the `browser` engine for interacting with dynamic websites, run the following command:
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
npx playwright install
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
> ℹ️ **Note:** This step is only required for `browser` mode. The lightweight `http` mode works out of the box without this installation.
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## 🚀 Quick Start
|
|
43
|
+
|
|
44
|
+
The following example fetches a web page and extracts its title.
|
|
45
|
+
|
|
46
|
+
```typescript
|
|
47
|
+
import { fetchWeb } from '@isdk/web-fetcher';
|
|
48
|
+
|
|
49
|
+
async function getTitle(url: string) {
|
|
50
|
+
const { outputs } = await fetchWeb({
|
|
51
|
+
url,
|
|
52
|
+
actions: [
|
|
53
|
+
{
|
|
54
|
+
id: 'extract',
|
|
55
|
+
params: {
|
|
56
|
+
// Extracts the text content of the <title> tag
|
|
57
|
+
selector: 'title',
|
|
58
|
+
},
|
|
59
|
+
// Stores the result in the `outputs` object under the key 'pageTitle'
|
|
60
|
+
storeAs: 'pageTitle',
|
|
61
|
+
},
|
|
62
|
+
],
|
|
63
|
+
});
|
|
64
|
+
|
|
65
|
+
console.log('Page Title:', outputs.pageTitle);
|
|
66
|
+
}
|
|
67
|
+
|
|
68
|
+
getTitle('https://www.google.com');
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## 🤖 Advanced Usage: Multi-Step Form Submission
|
|
74
|
+
|
|
75
|
+
This example demonstrates how to use the `browser` engine to perform a search on Google.
|
|
76
|
+
|
|
77
|
+
```typescript
|
|
78
|
+
import { fetchWeb } from '@isdk/web-fetcher';
|
|
79
|
+
|
|
80
|
+
async function searchGoogle(query: string) {
|
|
81
|
+
// Search for the query on Google
|
|
82
|
+
const { result, outputs } = await fetchWeb({
|
|
83
|
+
url: 'https://www.google.com',
|
|
84
|
+
engine: 'browser', // Use the full browser engine for interaction
|
|
85
|
+
actions: [
|
|
86
|
+
// The initial navigation to google.com is handled by the `url` option
|
|
87
|
+
{ id: 'fill', params: { selector: 'textarea[name=q]', value: query } },
|
|
88
|
+
{ id: 'submit', params: { selector: 'form' } },
|
|
89
|
+
{ id: 'waitFor', params: { selector: '#search' } }, // Wait for the search results container to appear
|
|
90
|
+
{ id: 'getContent', storeAs: 'searchResultsPage' },
|
|
91
|
+
]
|
|
92
|
+
});
|
|
93
|
+
|
|
94
|
+
console.log('Search Results URL:', result?.finalUrl);
|
|
95
|
+
console.log('Outputs contains the full page content:', outputs.searchResultsPage.html.substring(0, 100));
|
|
96
|
+
}
|
|
97
|
+
|
|
98
|
+
searchGoogle('gemini');
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
---
|
|
102
|
+
|
|
103
|
+
## 🏗️ Architecture
|
|
104
|
+
|
|
105
|
+
This library is built on two core concepts: **Engines** and **Actions**.
|
|
106
|
+
|
|
107
|
+
* ### Engine Architecture
|
|
108
|
+
|
|
109
|
+
The library's core is its dual-engine design. It abstracts away the complexities of web interaction behind a unified API. For detailed information on the `http` (Cheerio) and `browser` (Playwright) engines, how they manage state, and how to extend them, please see the [**Fetch Engine Architecture**](./README.engine.md) document.
|
|
110
|
+
|
|
111
|
+
* ### Action Architecture
|
|
112
|
+
|
|
113
|
+
All workflows are defined as a series of "Actions". The library provides a set of built-in atomic actions and a powerful composition model for creating your own semantic actions. For a deep dive into creating and using actions, see the [**Action Script Architecture**](./README.action.md) document.
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
## 📚 API Reference
|
|
118
|
+
|
|
119
|
+
### `fetchWeb(options)` or `fetchWeb(url, options)`
|
|
120
|
+
|
|
121
|
+
This is the main entry point for the library.
|
|
122
|
+
|
|
123
|
+
**Key `FetcherOptions`**:
|
|
124
|
+
|
|
125
|
+
* `url` (string): The initial URL to navigate to.
|
|
126
|
+
* `engine` ('http' | 'browser' | 'auto'): The engine to use. Defaults to `auto`.
|
|
127
|
+
* `actions` (FetchActionOptions[]): An array of action objects to execute.
|
|
128
|
+
* `headers` (Record<string, string>): Headers to use for all requests.
|
|
129
|
+
* ...and many other options for proxy, cookies, retries, etc.
|
|
130
|
+
|
|
131
|
+
### Built-in Actions
|
|
132
|
+
|
|
133
|
+
Here are the essential built-in actions:
|
|
134
|
+
|
|
135
|
+
* `goto`: Navigates to a new URL.
|
|
136
|
+
* `click`: Clicks on an element specified by a selector.
|
|
137
|
+
* `fill`: Fills an input field with a specified value.
|
|
138
|
+
* `submit`: Submits a form.
|
|
139
|
+
* `waitFor`: Pauses execution to wait for a specific condition (e.g., a timeout, a selector to appear, or network to be idle).
|
|
140
|
+
* `getContent`: Retrieves the full content (HTML, text, etc.) of the current page state.
|
|
141
|
+
* `extract`: Extracts any structured data from the page with ease using an expressive, declarative schema.
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
## 📜 License
|
|
146
|
+
|
|
147
|
+
[MIT](./LICENSE-MIT)
|