@isdk/web-searcher 0.1.3 → 0.1.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.cn.md +168 -8
- package/README.md +168 -8
- package/dist/index.d.mts +221 -12
- package/dist/index.d.ts +221 -12
- package/dist/index.js +1 -1
- package/dist/index.mjs +1 -1
- package/docs/README.md +168 -8
- package/docs/classes/GoogleSearcher.md +171 -44
- package/docs/classes/WebSearcher.md +158 -45
- package/docs/functions/extractDate.md +42 -0
- package/docs/functions/extractMetadataFrom.md +40 -0
- package/docs/functions/fetchHeaders.md +34 -0
- package/docs/functions/fetchPartial.md +41 -0
- package/docs/functions/normalizeDate.md +29 -0
- package/docs/functions/parseHeaders.md +28 -0
- package/docs/functions/parseHtml.md +31 -0
- package/docs/functions/testUrlsByLatency.md +38 -0
- package/docs/globals.md +18 -0
- package/docs/interfaces/CustomTimeRange.md +3 -3
- package/docs/interfaces/ExtractOptions.md +54 -0
- package/docs/interfaces/FetchExtractorOptions.md +35 -0
- package/docs/interfaces/FetcherOptions.md +424 -0
- package/docs/interfaces/HtmlData.md +53 -0
- package/docs/interfaces/MetadataResult.md +27 -0
- package/docs/interfaces/PaginationConfig.md +9 -9
- package/docs/interfaces/SearchContext.md +30 -4
- package/docs/interfaces/SearchOptions.md +77 -11
- package/docs/interfaces/StandardSearchResult.md +10 -10
- package/docs/interfaces/VerifiedUrl.md +25 -0
- package/docs/type-aliases/MetadataType.md +13 -0
- package/docs/type-aliases/SafeSearchLevel.md +1 -1
- package/docs/type-aliases/SearchCategory.md +2 -2
- package/docs/type-aliases/SearchTimeRange.md +1 -1
- package/docs/type-aliases/SearchTimeRangePreset.md +2 -2
- package/docs/type-aliases/SearcherConstructor.md +2 -2
- package/package.json +3 -2
package/README.cn.md
CHANGED
|
@@ -19,8 +19,9 @@ Search 模块提供了一个基于类的高级框架,用于构建搜索引擎
|
|
|
19
19
|
|
|
20
20
|
> **⚠️ 关于 `GoogleSearcher` 的说明**:这些示例中使用的 `GoogleSearcher` 类仅作为**演示实现**用于教学目的。它不适用于生产环境。
|
|
21
21
|
>
|
|
22
|
-
> *
|
|
23
|
-
> *
|
|
22
|
+
> * **严格的反爬虫检测**:目前发现,即使在 `browser` 模式下尝试模拟简单的“人类行为”(如等待几秒后自动填充搜索框并提交),仍然会被 Google 识别为自动化程序。这表明简单的操作模拟不足以通过检测。
|
|
23
|
+
> * **扩展性限制**:它缺乏大规模可靠抓取 Google 所需的高级反爬虫处理(如验证码破解、代理轮换)。
|
|
24
|
+
> * **脆弱性**:由于 Google 频繁的 DOM 变更和 A/B 测试,提取的数据可能会出现**不准确或信息错位**的情况。
|
|
24
25
|
|
|
25
26
|
使用静态方法 `WebSearcher.search` 处理快速、用完即弃的任务。它会自动创建会话、抓取结果并进行清理。
|
|
26
27
|
|
|
@@ -38,15 +39,65 @@ const results = await WebSearcher.search('Google', 'open source', { limit: 20 })
|
|
|
38
39
|
console.log(results);
|
|
39
40
|
```
|
|
40
41
|
|
|
41
|
-
###
|
|
42
|
+
### 3. 多引擎编排 (Multi-Engine Orchestration)
|
|
43
|
+
|
|
44
|
+
`WebSearcher.search` 具有内置的 **瀑布流(Waterfall)** 补偿机制。当您传入一个引擎数组时,它会按序执行并自动填补结果数量:
|
|
45
|
+
|
|
46
|
+
- **自动补全**:如果前面的引擎返回结果不足(少于 `limit`),它会自动请求后续引擎以补齐缺口。
|
|
47
|
+
- **容错降级**:如果某个引擎发生错误(如被封禁、超时),它会自动跳过并尝试下一个引擎,确保最终尽可能返回结果。
|
|
48
|
+
- **自动去重**:合并结果时会自动基于 `url` 进行去重。
|
|
49
|
+
|
|
50
|
+
```typescript
|
|
51
|
+
// 瀑布流搜索:优先用 Google,不够就用 Bing,还不够就用 SearXNG 补齐
|
|
52
|
+
const results = await WebSearcher.search(['Google', 'Bing', 'SearXNG'], 'open source', {
|
|
53
|
+
limit: 20,
|
|
54
|
+
fillLimit: true // 默认即为 true
|
|
55
|
+
});
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### 4. 有状态会话 (Stateful Session)
|
|
42
59
|
|
|
43
60
|
由于 `WebSearcher` 继承自 `FetchSession`,您可以实例化它以在多个请求之间保持 Cookie 和存储。这对于需要登录的搜索或通过模拟人类行为来避免反爬虫非常有用。
|
|
44
61
|
|
|
62
|
+
### 🧬 动态模板 (Dynamic Templates)
|
|
63
|
+
|
|
64
|
+
虽然静态 `template` 适用于简单的搜索引擎,但许多网站(如 Google)会根据搜索类别(如“网页” vs “图片” vs “新闻”)彻底改变其 HTML 结构。
|
|
65
|
+
|
|
66
|
+
为了处理这种情况,您可以重写 `getTemplate(variables, options)` 方法。
|
|
67
|
+
|
|
68
|
+
- **`variables`**: 计算后的变量(来自 `formatOptions`、分页等)。
|
|
69
|
+
- **`options`**: 用户提供的原始 `SearchOptions`。
|
|
70
|
+
|
|
71
|
+
```typescript
|
|
72
|
+
export class MyAdvancedSearcher extends WebSearcher {
|
|
73
|
+
get template(): FetcherOptions {
|
|
74
|
+
// 默认模板(通常用于网页搜索)
|
|
75
|
+
return {
|
|
76
|
+
url: '...',
|
|
77
|
+
actions: [ { id: 'extract', params: { selector: '.web-result' } } ]
|
|
78
|
+
};
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
protected override getTemplate(variables: Record<string, any>, options: SearchOptions): FetcherOptions {
|
|
82
|
+
if (options.category === 'images') {
|
|
83
|
+
return {
|
|
84
|
+
url: 'https://site.com/images?q=${query}',
|
|
85
|
+
actions: [ { id: 'extract', params: { selector: '.img-item' } } ]
|
|
86
|
+
};
|
|
87
|
+
}
|
|
88
|
+
// 回退到默认的 template 获取器
|
|
89
|
+
return super.getTemplate(variables, options);
|
|
90
|
+
}
|
|
91
|
+
}
|
|
92
|
+
```
|
|
93
|
+
|
|
45
94
|
### 🛡️ 核心准则:模板即法律 (Template is Law)
|
|
46
95
|
|
|
47
|
-
|
|
96
|
+
`template`(或由 `getTemplate` 返回的动态模板)是权威的“蓝图”。
|
|
48
97
|
|
|
49
98
|
- **模板优先级**:如果模板定义了某个属性(如 `engine: 'browser'`、特定的 `headers` 等),该值将被**锁定**,用户选项无法覆盖。这确保了抓取逻辑的稳定性。
|
|
99
|
+
- **Actions 不可变性**:模板中的 `actions` 数组受到严格保护。用户无法通过 `options` 追加、替换或修改执行步骤。这防止了外部逻辑破坏爬虫的执行流程。
|
|
100
|
+
- **会话上下文 (Session Context)**:为了保持干净的会话,**actions 会从会话的持久化上下文中过滤掉**。它们仅在 `search()` 调用执行期间使用。这确保了会话级设置(如 Cookie 或引擎类型)得以保留,而不会被特定于搜索的提取规则所污染。
|
|
50
101
|
- **用户灵活性**:对于模板中**未**显式锁定的属性(如 `proxy`、`timeoutMs` 或自定义变量),用户可以在构造函数或 `search()` 方法中自由设置。
|
|
51
102
|
|
|
52
103
|
```typescript
|
|
@@ -56,7 +107,17 @@ const google = new GoogleSearcher({
|
|
|
56
107
|
proxy: 'http://my-proxy:8080',
|
|
57
108
|
timeoutMs: 30000 // 有效(假设 GoogleSearcher 模板未显式设置 timeoutMs)
|
|
58
109
|
});
|
|
110
|
+
```
|
|
59
111
|
|
|
112
|
+
### 🧠 智能导航 (Goto)
|
|
113
|
+
|
|
114
|
+
`WebSearcher` 会自动管理前往搜索 URL 的导航。
|
|
115
|
+
|
|
116
|
+
1. **自动注入**:如果你的模板**不**包含 `goto` 动作,搜索器会自动在动作列表开头插入一个指向解析后的 `url`(已注入查询变量)的 `goto` 动作。
|
|
117
|
+
2. **手动控制**:如果你在模板中显式添加了一个匹配解析后 URL 的 `goto` 动作,搜索器会检测到重复并**跳过**自动注入。这让你能完全控制导航步骤(例如添加 headers、referrer 或其他特定参数)。
|
|
118
|
+
3. **多步流程**:你可以在模板中定义多个 `goto` 动作(例如先访问登录页)。搜索器仍然会预置主搜索 URL 的导航,除非你的 `goto` 动作之一与之精确匹配。
|
|
119
|
+
|
|
120
|
+
```typescript
|
|
60
121
|
try {
|
|
61
122
|
// 第一次查询
|
|
62
123
|
// 您还可以传递运行时选项来覆盖会话默认值或注入变量
|
|
@@ -156,12 +217,17 @@ protected override get pagination() {
|
|
|
156
217
|
|
|
157
218
|
### 步骤 3: 转换与清洗数据 (Transform)
|
|
158
219
|
|
|
159
|
-
重写 `transform`
|
|
220
|
+
重写 `transform` 以清洗数据。`context` 参数包含了当前的搜索状态以及您传递给 `search()` 的任何自定义参数。由于 `WebSearcher` 本身就是一个 `FetchSession`,您还可以使用 `this` 发起额外的请求(如解析重定向)。
|
|
160
221
|
|
|
161
222
|
```typescript
|
|
162
|
-
protected override async transform(outputs: Record<string, any
|
|
223
|
+
protected override async transform(outputs: Record<string, any>, context: SearchContext) {
|
|
163
224
|
const results = outputs['results'] || [];
|
|
164
225
|
|
|
226
|
+
// 您可以从 context 中访问自定义参数
|
|
227
|
+
if (context.myCustomFlag) {
|
|
228
|
+
// ... 逻辑
|
|
229
|
+
}
|
|
230
|
+
|
|
165
231
|
// 清洗数据或过滤
|
|
166
232
|
return results.map(item => ({
|
|
167
233
|
...item,
|
|
@@ -198,8 +264,10 @@ protected override async transform(outputs: Record<string, any>) {
|
|
|
198
264
|
```typescript
|
|
199
265
|
await google.search('test', {
|
|
200
266
|
limit: 20,
|
|
267
|
+
myCustomFlag: true,
|
|
201
268
|
// 示例:过滤掉赞助商结果(广告)并只保留 PDF
|
|
202
|
-
transform: (results) => {
|
|
269
|
+
transform: (results, context) => {
|
|
270
|
+
console.log('正在搜索:', context.query);
|
|
203
271
|
return results.filter(r => {
|
|
204
272
|
const isAd = r.isSponsored || r.url.includes('googleadservices.com');
|
|
205
273
|
return !isAd && r.url.endsWith('.pdf');
|
|
@@ -215,7 +283,7 @@ await google.search('test', {
|
|
|
215
283
|
```typescript
|
|
216
284
|
const results = await google.search('open source', {
|
|
217
285
|
limit: 20,
|
|
218
|
-
timeRange: 'month', // 'day', 'week', 'month', 'year'
|
|
286
|
+
timeRange: 'month', // 'hour', 'day', 'week', 'month', 'year'
|
|
219
287
|
// 或自定义范围:
|
|
220
288
|
// timeRange: { from: '2023-01-01', to: '2023-12-31' },
|
|
221
289
|
category: 'news', // 'all', 'images', 'videos', 'news'
|
|
@@ -225,6 +293,40 @@ const results = await google.search('open source', {
|
|
|
225
293
|
});
|
|
226
294
|
```
|
|
227
295
|
|
|
296
|
+
#### 搜索选项参考
|
|
297
|
+
|
|
298
|
+
| 选项 | 类型 | 说明 |
|
|
299
|
+
| :--- | :--- | :--- |
|
|
300
|
+
| `limit` | `number` | 期望获取的结果总数。搜索器会自动翻页以达到此数量。 |
|
|
301
|
+
| `maxPages` | `number` | 最大抓取页数(翻页循环次数)。用于防止无限循环的安全阈值。默认值:`10`。 |
|
|
302
|
+
| `timeRange` | `string` \| `object` | 按时间过滤。预设值:`'all'`, `'hour'`, `'day'`, `'week'`, `'month'`, `'year'`。<br/> 或自定义范围 `{ from: Date\|string, to?: Date\|string }` |
|
|
303
|
+
| `category` | `string` | 搜索分类:`'all'`, `'images'`, `'videos'`, `'news'`。 |
|
|
304
|
+
| `region` | `string` | ISO 3166-1 alpha-2 地区代码(如 `'US'`, `'CN'`)。 |
|
|
305
|
+
| `language` | `string` | ISO 639-1 语言代码(如 `'en'`, `'zh-CN'`)。 |
|
|
306
|
+
| `safeSearch` | `string` | 安全搜索级别:`'off'`, `'moderate'`, `'strict'`。 |
|
|
307
|
+
| `transform` | `function` | 运行时自定义转换函数。在引擎内置转换之后运行。 |
|
|
308
|
+
| `baseUrls` | `string[]` \| `Record<string, string[]>` | 覆盖引擎的基 URL。可以是单个引擎的 URL 数组,或引擎名称到 URL 数组的映射。 |
|
|
309
|
+
| `fillLimit` | `boolean` | 设为 `true`(默认)时,当前引擎返回结果不足 `limit` 时会自动尝试后续引擎。 |
|
|
310
|
+
| `startPage` | `number` | 分页起始页索引。适用于跨会话委托分页场景。默认值:`0`。 |
|
|
311
|
+
| `validator` | `function` | 自定义回调函数验证抓取结果。返回 `false` 时触发故障转移/重试。签名:`(results, context) => boolean \| Promise<boolean>` |
|
|
312
|
+
| `...custom` | `any` | 任何其他键都将作为自定义变量传递给模板(例如 `${myVar}`)。 |
|
|
313
|
+
|
|
314
|
+
#### 标准搜索结果 (Standard Search Result)
|
|
315
|
+
|
|
316
|
+
返回数组中的每个结果都遵循以下结构:
|
|
317
|
+
|
|
318
|
+
| 字段 | 类型 | 说明 |
|
|
319
|
+
| :--- | :--- | :--- |
|
|
320
|
+
| `title` | `string` | 搜索结果的标题。 |
|
|
321
|
+
| `url` | `string` | 结果的绝对 URL。 |
|
|
322
|
+
| `snippet` | `string` | 简短摘要或描述。 |
|
|
323
|
+
| `image` | `string` | (可选) 缩略图或相关图片的 URL。 |
|
|
324
|
+
| `date` | `string`\|`Date` | (可选) 发布日期。 |
|
|
325
|
+
| `author` | `string` | (可选) 作者或来源名称。 |
|
|
326
|
+
| `favicon` | `string` | (可选) 来源网站的 Favicon URL。 |
|
|
327
|
+
| `rank` | `number` | (可选) 在结果中的排名(从 1 开始)。 |
|
|
328
|
+
| `source` | `string` | (可选) 来源网站名称(如 'GitHub')。 |
|
|
329
|
+
|
|
228
330
|
要在您自己的引擎中支持这些选项,请重写 `formatOptions` 方法:
|
|
229
331
|
|
|
230
332
|
```typescript
|
|
@@ -239,6 +341,36 @@ protected override formatOptions(options: SearchOptions): Record<string, any> {
|
|
|
239
341
|
然后在您的 `template.url` 中使用这些变量:
|
|
240
342
|
`url: 'https://www.google.com/search?q=${query}&tbs=${tbs}'`
|
|
241
343
|
|
|
344
|
+
### 🚀 实现多实例支持 (Implementing Multi-instance Support)
|
|
345
|
+
|
|
346
|
+
如果某个搜索引擎支持多个镜像或分布式部署,您可以轻松地为其增加故障转移能力:
|
|
347
|
+
|
|
348
|
+
1. **配置 Base URLs**:在构造函数中支持传入地址列表。
|
|
349
|
+
2. **结果验证**:重写 `validateFetchResult(outputs, context)`。如果返回 `false`,搜索器会自动尝试列表中的下一个地址。
|
|
350
|
+
3. **模板变量**:在模板 URL 中使用 `${baseUrl}` 占位符。
|
|
351
|
+
|
|
352
|
+
```typescript
|
|
353
|
+
export class MyDistributedSearcher extends WebSearcher {
|
|
354
|
+
protected get template(): FetcherOptions {
|
|
355
|
+
return {
|
|
356
|
+
url: '${baseUrl}/search?q=${query}',
|
|
357
|
+
// ...
|
|
358
|
+
};
|
|
359
|
+
}
|
|
360
|
+
|
|
361
|
+
protected override validateFetchResult(outputs: Record<string, any>, context: SearchContext): boolean {
|
|
362
|
+
const results = outputs['results'] || [];
|
|
363
|
+
// 如果没有结果,触发故障转移切换到下一个节点
|
|
364
|
+
return results.length > 0;
|
|
365
|
+
}
|
|
366
|
+
}
|
|
367
|
+
|
|
368
|
+
// 使用方式
|
|
369
|
+
const searcher = new MyDistributedSearcher({
|
|
370
|
+
baseUrls: ['https://node1.com', 'https://node2.com']
|
|
371
|
+
});
|
|
372
|
+
```
|
|
373
|
+
|
|
242
374
|
### 自定义变量
|
|
243
375
|
|
|
244
376
|
您可以向 `search()` 传递自定义变量并在模板中使用它们。
|
|
@@ -251,6 +383,34 @@ await google.search('test', { category: 'news' });
|
|
|
251
383
|
url: 'https://site.com?q=${query}&cat=${category}'
|
|
252
384
|
```
|
|
253
385
|
|
|
386
|
+
## 🛡️ 弹性搜索与测速工具 (Resilient Search & Latency Tools)
|
|
387
|
+
|
|
388
|
+
本模块提供了一系列通用的工具函数,用于评估节点的健康状况并实现故障转移。
|
|
389
|
+
|
|
390
|
+
### 1. 通用延迟测试工具
|
|
391
|
+
|
|
392
|
+
我们提供了一个基于 `web-fetcher` 的通用测速函数 `testUrlsByLatency`,可用于对任何 URL 列表进行实时响应速度测试并按延迟排序。
|
|
393
|
+
|
|
394
|
+
```typescript
|
|
395
|
+
import { testUrlsByLatency } from '@isdk/web-searcher/utils';
|
|
396
|
+
|
|
397
|
+
const urls = ['https://google.com', 'https://bing.com', 'https://baidu.com'];
|
|
398
|
+
const sorted = await testUrlsByLatency(urls, { timeout: 5000 });
|
|
399
|
+
|
|
400
|
+
// 返回 [{ url: '...', latency: 123 }, ...],按 latency 升序排列
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
### 2. 特定引擎的弹性发现
|
|
404
|
+
|
|
405
|
+
对于像 **SearXNG** 这样支持多实例且不稳定的引擎,我们提供了专门的故障转移和自动发现机制。
|
|
406
|
+
|
|
407
|
+
- **自动故障转移**:支持配置多个 `baseUrls`,自动在连接失败时切换节点。
|
|
408
|
+
- **动态发现**:支持从 `searx.space` 或 GitHub 自动抓取并筛选高质量节点。
|
|
409
|
+
|
|
410
|
+
详细信息请参阅:[SearXNG 弹性搜索文档](./src/engines/searxng.cn.md)。
|
|
411
|
+
|
|
412
|
+
---
|
|
413
|
+
|
|
254
414
|
## 分页指南
|
|
255
415
|
|
|
256
416
|
### 1. 基于偏移量 (Offset-based) - 如 Google
|
package/README.md
CHANGED
|
@@ -19,8 +19,9 @@ This module encapsulates these patterns into a reusable `WebSearcher` class.
|
|
|
19
19
|
|
|
20
20
|
> **⚠️ Note on `GoogleSearcher`**: The `GoogleSearcher` class used in these examples is a **demo implementation** included for educational purposes. It is not intended for production use.
|
|
21
21
|
>
|
|
22
|
-
> *
|
|
23
|
-
> *
|
|
22
|
+
> * **Strict Anti-Bot Detection**: Currently, it has been found that even when attempting to simulate simple "human behavior" in `browser` mode (such as waiting for a few seconds before automatically filling in the search box and submitting), it is still detected as an automated program by Google. This indicates that simple operation simulation is not enough to pass the detection.
|
|
23
|
+
> * **Scalability Limitations**: It lacks advanced countermeasures like CAPTCHA solving, fingerprint spoofing, or high-quality proxy rotation required for reliable scraping.
|
|
24
|
+
> * **Fragility**: The extracted data may be **inaccurate or misaligned** due to Google's frequent DOM changes and A/B testing.
|
|
24
25
|
|
|
25
26
|
Use the static `WebSearcher.search` method for quick, disposable tasks. It automatically creates a session, fetches results, and cleans up.
|
|
26
27
|
|
|
@@ -38,15 +39,65 @@ const results = await WebSearcher.search('Google', 'open source', { limit: 20 })
|
|
|
38
39
|
console.log(results);
|
|
39
40
|
```
|
|
40
41
|
|
|
41
|
-
###
|
|
42
|
+
### 3. Multi-Engine Orchestration
|
|
43
|
+
|
|
44
|
+
`WebSearcher.search` features a built-in **Waterfall** compensation mechanism. When you provide an array of engine names, it executes them sequentially and automatically fills the result count:
|
|
45
|
+
|
|
46
|
+
- **Automatic Completion**: If the preceding engines return fewer results than the `limit`, it automatically requests subsequent engines to fill the gap.
|
|
47
|
+
- **Failover & Degradation**: If an engine fails (e.g., blocked, timeout), it automatically skips it and tries the next one, ensuring results are returned whenever possible.
|
|
48
|
+
- **Auto Deduplication**: It automatically de-duplicates results based on their `url` during the merging process.
|
|
49
|
+
|
|
50
|
+
```typescript
|
|
51
|
+
// Waterfall search: Google first, Bing as fallback, SearXNG as final backup
|
|
52
|
+
const results = await WebSearcher.search(['Google', 'Bing', 'SearXNG'], 'open source', {
|
|
53
|
+
limit: 20,
|
|
54
|
+
fillLimit: true // Enabled by default
|
|
55
|
+
});
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### 4. Stateful Session
|
|
42
59
|
|
|
43
60
|
Since `WebSearcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
|
|
44
61
|
|
|
62
|
+
### 🧬 Dynamic Templates
|
|
63
|
+
|
|
64
|
+
While a static `template` works for simple search engines, many sites (like Google) change their HTML structure drastically based on the search category (e.g., 'Web' vs 'Images' vs 'News').
|
|
65
|
+
|
|
66
|
+
To handle this, you can override the `getTemplate(variables, options)` method.
|
|
67
|
+
|
|
68
|
+
- **`variables`**: The calculated variables (from `formatOptions`, pagination, etc.).
|
|
69
|
+
- **`options`**: The original `SearchOptions` provided by the user.
|
|
70
|
+
|
|
71
|
+
```typescript
|
|
72
|
+
export class MyAdvancedSearcher extends WebSearcher {
|
|
73
|
+
get template(): FetcherOptions {
|
|
74
|
+
// Default template (usually for web search)
|
|
75
|
+
return {
|
|
76
|
+
url: '...',
|
|
77
|
+
actions: [ { id: 'extract', params: { selector: '.web-result' } } ]
|
|
78
|
+
};
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
protected override getTemplate(variables: Record<string, any>, options: SearchOptions): FetcherOptions {
|
|
82
|
+
if (options.category === 'images') {
|
|
83
|
+
return {
|
|
84
|
+
url: 'https://site.com/images?q=${query}',
|
|
85
|
+
actions: [ { id: 'extract', params: { selector: '.img-item' } } ]
|
|
86
|
+
};
|
|
87
|
+
}
|
|
88
|
+
// Fallback to the default template getter
|
|
89
|
+
return super.getTemplate(variables, options);
|
|
90
|
+
}
|
|
91
|
+
}
|
|
92
|
+
```
|
|
93
|
+
|
|
45
94
|
### 🛡️ Core Principle: Template is Law
|
|
46
95
|
|
|
47
|
-
The `template`
|
|
96
|
+
The `template` (or the dynamic template returned by `getTemplate`) acts as the authoritative "blueprint".
|
|
48
97
|
|
|
49
98
|
- **Template Priority**: If the template defines a property (e.g., `engine: 'browser'`, `headers`), that value is **locked** and cannot be overridden by user options. This ensures engine stability.
|
|
99
|
+
- **Immutable Actions**: The `actions` array in the template is strictly protected. Users cannot append, replace, or modify the execution steps via `options`. This prevents external logic from breaking the scraper's flow.
|
|
100
|
+
- **Session Context**: To maintain a clean session, **actions are filtered out** of the session's persistent context. They are only used during the execution of a `search()` call. This ensures that session-level settings (like cookies or engine type) are preserved without being cluttered by search-specific extraction rules.
|
|
50
101
|
- **User Flexibility**: Properties **not** explicitly defined in the template (such as `proxy`, `timeoutMs`, or custom variables) can be freely set by the user in the constructor or `search()` method.
|
|
51
102
|
|
|
52
103
|
```typescript
|
|
@@ -56,7 +107,17 @@ const google = new GoogleSearcher({
|
|
|
56
107
|
proxy: 'http://my-proxy:8080',
|
|
57
108
|
timeoutMs: 30000 // Set a global timeout (valid if template doesn't define it)
|
|
58
109
|
});
|
|
110
|
+
```
|
|
59
111
|
|
|
112
|
+
### 🧠 Intelligent Navigation (Goto)
|
|
113
|
+
|
|
114
|
+
The `WebSearcher` automatically manages navigation to the search URL.
|
|
115
|
+
|
|
116
|
+
1. **Auto-Injection**: If your template does **not** include a `goto` action, the searcher automatically inserts one at the beginning of the action list, pointing to the resolved `url` (with query variables injected).
|
|
117
|
+
2. **Manual Control**: If you explicitly add a `goto` action in your template that matches the resolved URL, the searcher detects this duplicate and **skips** the automatic injection. This gives you full control to add headers, referrer, or other specific parameters to the navigation step if needed.
|
|
118
|
+
3. **Multi-Step Flows**: You can define multiple `goto` actions in your template (e.g., visit a login page first). The searcher will still prepend the main search URL navigation unless one of your `goto` actions matches it exactly.
|
|
119
|
+
|
|
120
|
+
```typescript
|
|
60
121
|
try {
|
|
61
122
|
// First query
|
|
62
123
|
// You can also pass runtime options to override session defaults or inject variables
|
|
@@ -156,12 +217,17 @@ protected override get pagination() {
|
|
|
156
217
|
|
|
157
218
|
### Step 3: Transform & Clean Data
|
|
158
219
|
|
|
159
|
-
Override `transform` to clean data. Since `WebSearcher` is a `FetchSession`, you can also make extra requests (like resolving redirects) using `this`.
|
|
220
|
+
Override `transform` to clean data. The `context` parameter contains the current search state and any custom parameters you passed to `search()`. Since `WebSearcher` is a `FetchSession`, you can also make extra requests (like resolving redirects) using `this`.
|
|
160
221
|
|
|
161
222
|
```typescript
|
|
162
|
-
protected override async transform(outputs: Record<string, any
|
|
223
|
+
protected override async transform(outputs: Record<string, any>, context: SearchContext) {
|
|
163
224
|
const results = outputs['results'] || [];
|
|
164
225
|
|
|
226
|
+
// You can access custom parameters from context
|
|
227
|
+
if (context.myCustomFlag) {
|
|
228
|
+
// ... logic
|
|
229
|
+
}
|
|
230
|
+
|
|
165
231
|
// Clean data or filter
|
|
166
232
|
return results.map(item => ({
|
|
167
233
|
...item,
|
|
@@ -198,8 +264,10 @@ This is extremely powerful for **filtering out ads** or irrelevant content. If t
|
|
|
198
264
|
```typescript
|
|
199
265
|
await google.search('test', {
|
|
200
266
|
limit: 20,
|
|
267
|
+
myCustomFlag: true,
|
|
201
268
|
// Example: Filter out sponsored results and only keep PDFs
|
|
202
|
-
transform: (results) => {
|
|
269
|
+
transform: (results, context) => {
|
|
270
|
+
console.log('Searching for:', context.query);
|
|
203
271
|
return results.filter(r => {
|
|
204
272
|
const isAd = r.isSponsored || r.url.includes('googleadservices.com');
|
|
205
273
|
return !isAd && r.url.endsWith('.pdf');
|
|
@@ -215,7 +283,7 @@ When calling `search()`, you can provide standardized options that the search en
|
|
|
215
283
|
```typescript
|
|
216
284
|
const results = await google.search('open source', {
|
|
217
285
|
limit: 20,
|
|
218
|
-
timeRange: 'month', // 'day', 'week', 'month', 'year'
|
|
286
|
+
timeRange: 'month', // 'hour', 'day', 'week', 'month', 'year'
|
|
219
287
|
// Or custom range:
|
|
220
288
|
// timeRange: { from: '2023-01-01', to: '2023-12-31' },
|
|
221
289
|
category: 'news', // 'all', 'images', 'videos', 'news'
|
|
@@ -225,6 +293,40 @@ const results = await google.search('open source', {
|
|
|
225
293
|
});
|
|
226
294
|
```
|
|
227
295
|
|
|
296
|
+
#### Search Options Reference
|
|
297
|
+
|
|
298
|
+
| Option | Type | Description |
|
|
299
|
+
| :--- | :--- | :--- |
|
|
300
|
+
| `limit` | `number` | The target number of total results to retrieve. The searcher will automatically paginate to reach this number. |
|
|
301
|
+
| `maxPages` | `number` | The maximum number of pages (fetch cycles) to fetch. Safety threshold to prevent infinite loops. Default: `10`. |
|
|
302
|
+
| `timeRange` | `string` \| `object` | Filter by time. Presets: `'all'`, `'hour'`, `'day'`, `'week'`, `'month'`, `'year'`. <br/> Or `{ from: Date\|string, to?: Date\|string }` |
|
|
303
|
+
| `category` | `string` | Search category: `'all'`, `'images'`, `'videos'`, `'news'`. |
|
|
304
|
+
| `region` | `string` | ISO 3166-1 alpha-2 region code (e.g., `'US'`, `'CN'`). |
|
|
305
|
+
| `language` | `string` | ISO 639-1 language code (e.g., `'en'`, `'zh-CN'`). |
|
|
306
|
+
| `safeSearch` | `string` | Safe search level: `'off'`, `'moderate'`, `'strict'`. |
|
|
307
|
+
| `transform` | `function` | A custom function to filter or modify results at runtime. Runs after the engine's built-in transform. |
|
|
308
|
+
| `baseUrls` | `string[]` \| `Record<string, string[]>` | Override the base URLs for engines. Can be an array for a single engine, or a map of engine names to URL arrays. |
|
|
309
|
+
| `fillLimit` | `boolean` | If `true` (default), continues to subsequent engines in the chain when the current engine returns fewer results than `limit`. |
|
|
310
|
+
| `startPage` | `number` | The page index to start from. Useful when delegating pagination across different sessions. Default: `0`. |
|
|
311
|
+
| `validator` | `function` | Custom callback to validate fetched results. If it returns `false`, triggers failover/retry. Signature: `(results, context) => boolean \| Promise<boolean>`. |
|
|
312
|
+
| `...custom` | `any` | Any other keys are passed as custom variables to the template (e.g., `${myVar}`). |
|
|
313
|
+
|
|
314
|
+
#### Standard Search Result
|
|
315
|
+
|
|
316
|
+
Each result in the returned array follows this structure:
|
|
317
|
+
|
|
318
|
+
| Field | Type | Description |
|
|
319
|
+
| :--- | :--- | :--- |
|
|
320
|
+
| `title` | `string` | The title of the search result. |
|
|
321
|
+
| `url` | `string` | The absolute URL of the result. |
|
|
322
|
+
| `snippet` | `string` | A brief snippet or description. |
|
|
323
|
+
| `image` | `string` | (Optional) URL of a thumbnail or associated image. |
|
|
324
|
+
| `date` | `string`\|`Date` | (Optional) Publication date. |
|
|
325
|
+
| `author` | `string` | (Optional) Author or source name. |
|
|
326
|
+
| `favicon` | `string` | (Optional) Favicon URL of the source website. |
|
|
327
|
+
| `rank` | `number` | (Optional) 1-indexed position in the results. |
|
|
328
|
+
| `source` | `string` | (Optional) Source website name (e.g., 'GitHub'). |
|
|
329
|
+
|
|
228
330
|
To support these in your own engine, override the `formatOptions` method:
|
|
229
331
|
|
|
230
332
|
```typescript
|
|
@@ -239,6 +341,36 @@ protected override formatOptions(options: SearchOptions): Record<string, any> {
|
|
|
239
341
|
Then use these variables in your `template.url`:
|
|
240
342
|
`url: 'https://www.google.com/search?q=${query}&tbs=${tbs}'`
|
|
241
343
|
|
|
344
|
+
### 🚀 Implementing Multi-instance Support
|
|
345
|
+
|
|
346
|
+
If a search engine supports multiple mirrors or distributed deployment, you can easily add failover capabilities:
|
|
347
|
+
|
|
348
|
+
1. **Configure Base URLs**: Support a list of addresses in the constructor.
|
|
349
|
+
2. **Validate Results**: Override `validateFetchResult(outputs, context)`. If it returns `false`, the searcher automatically tries the next address in the list.
|
|
350
|
+
3. **Template Variables**: Use the `${baseUrl}` placeholder in your template URL.
|
|
351
|
+
|
|
352
|
+
```typescript
|
|
353
|
+
export class MyDistributedSearcher extends WebSearcher {
|
|
354
|
+
protected get template(): FetcherOptions {
|
|
355
|
+
return {
|
|
356
|
+
url: '${baseUrl}/search?q=${query}',
|
|
357
|
+
// ...
|
|
358
|
+
};
|
|
359
|
+
}
|
|
360
|
+
|
|
361
|
+
protected override validateFetchResult(outputs: Record<string, any>, context: SearchContext): boolean {
|
|
362
|
+
const results = outputs['results'] || [];
|
|
363
|
+
// If no results, trigger failover to the next node
|
|
364
|
+
return results.length > 0;
|
|
365
|
+
}
|
|
366
|
+
}
|
|
367
|
+
|
|
368
|
+
// Usage
|
|
369
|
+
const searcher = new MyDistributedSearcher({
|
|
370
|
+
baseUrls: ['https://node1.com', 'https://node2.com']
|
|
371
|
+
});
|
|
372
|
+
```
|
|
373
|
+
|
|
242
374
|
### Custom Variables
|
|
243
375
|
|
|
244
376
|
You can pass custom variables to `search()` and use them in your template.
|
|
@@ -251,6 +383,34 @@ await google.search('test', { category: 'news' });
|
|
|
251
383
|
url: 'https://site.com?q=${query}&cat=${category}'
|
|
252
384
|
```
|
|
253
385
|
|
|
386
|
+
## 🛡️ Resilient Search & Latency Tools
|
|
387
|
+
|
|
388
|
+
This module provides a set of general utility functions to evaluate node health and implement failover.
|
|
389
|
+
|
|
390
|
+
### 1. General Latency Testing Utility
|
|
391
|
+
|
|
392
|
+
We provide a general latency testing function `testUrlsByLatency` based on `web-fetcher` that can be used for real-time response testing and sorting of any URL list.
|
|
393
|
+
|
|
394
|
+
```typescript
|
|
395
|
+
import { testUrlsByLatency } from '@isdk/web-searcher/utils';
|
|
396
|
+
|
|
397
|
+
const urls = ['https://google.com', 'https://bing.com', 'https://baidu.com'];
|
|
398
|
+
const sorted = await testUrlsByLatency(urls, { timeout: 5000 });
|
|
399
|
+
|
|
400
|
+
// Returns [{ url: '...', latency: 123 }, ...], sorted by latency ascending.
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
### 2. Engine-Specific Resilient Discovery
|
|
404
|
+
|
|
405
|
+
For engines like **SearXNG** that support multiple instances and can be unstable, we provide specialized failover and discovery mechanisms.
|
|
406
|
+
|
|
407
|
+
- **Automatic Failover**: Configure multiple `baseUrls` to automatically switch nodes on connection failure.
|
|
408
|
+
- **Dynamic Discovery**: Automatically fetch and filter high-quality nodes from `searx.space` or GitHub.
|
|
409
|
+
|
|
410
|
+
For more details, see: [SearXNG Resilient Search Documentation](./src/engines/searxng.md).
|
|
411
|
+
|
|
412
|
+
---
|
|
413
|
+
|
|
254
414
|
## Pagination Guide
|
|
255
415
|
|
|
256
416
|
### 1. Offset-based (e.g., Google)
|