@isdk/web-searcher 0.1.4 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (36) hide show
  1. package/README.cn.md +196 -7
  2. package/README.md +196 -7
  3. package/dist/index.d.mts +234 -11
  4. package/dist/index.d.ts +234 -11
  5. package/dist/index.js +1 -1
  6. package/dist/index.mjs +1 -1
  7. package/docs/README.md +196 -7
  8. package/docs/classes/GoogleSearcher.md +289 -60
  9. package/docs/classes/WebSearcher.md +264 -61
  10. package/docs/functions/extractDate.md +42 -0
  11. package/docs/functions/extractMetadataFrom.md +40 -0
  12. package/docs/functions/fetchHeaders.md +34 -0
  13. package/docs/functions/fetchPartial.md +41 -0
  14. package/docs/functions/normalizeDate.md +29 -0
  15. package/docs/functions/parseHeaders.md +28 -0
  16. package/docs/functions/parseHtml.md +31 -0
  17. package/docs/functions/testUrlsByLatency.md +42 -0
  18. package/docs/globals.md +18 -0
  19. package/docs/interfaces/CustomTimeRange.md +3 -3
  20. package/docs/interfaces/ExtractOptions.md +54 -0
  21. package/docs/interfaces/FetchExtractorOptions.md +35 -0
  22. package/docs/interfaces/FetcherOptions.md +436 -0
  23. package/docs/interfaces/HtmlData.md +53 -0
  24. package/docs/interfaces/MetadataResult.md +27 -0
  25. package/docs/interfaces/PaginationConfig.md +9 -9
  26. package/docs/interfaces/SearchContext.md +30 -4
  27. package/docs/interfaces/SearchOptions.md +77 -11
  28. package/docs/interfaces/StandardSearchResult.md +10 -10
  29. package/docs/interfaces/VerifiedUrl.md +25 -0
  30. package/docs/type-aliases/MetadataType.md +13 -0
  31. package/docs/type-aliases/SafeSearchLevel.md +1 -1
  32. package/docs/type-aliases/SearchCategory.md +2 -2
  33. package/docs/type-aliases/SearchTimeRange.md +1 -1
  34. package/docs/type-aliases/SearchTimeRangePreset.md +1 -1
  35. package/docs/type-aliases/SearcherConstructor.md +2 -2
  36. package/package.json +3 -2
package/README.cn.md CHANGED
@@ -19,8 +19,9 @@ Search 模块提供了一个基于类的高级框架,用于构建搜索引擎
19
19
 
20
20
  > **⚠️ 关于 `GoogleSearcher` 的说明**:这些示例中使用的 `GoogleSearcher` 类仅作为**演示实现**用于教学目的。它不适用于生产环境。
21
21
  >
22
- > * 它缺乏大规模可靠抓取 Google 所需的高级反爬虫处理(如验证码破解、代理轮换)。
23
- > * 由于 Google 频繁的 DOM 变更和 A/B 测试,提取的数据可能会出现**不准确或信息错位**的情况。
22
+ > * **严格的反爬虫检测**:目前发现,即使在 `browser` 模式下尝试模拟简单的“人类行为”(如等待几秒后自动填充搜索框并提交),仍然会被 Google 识别为自动化程序。这表明简单的操作模拟不足以通过检测。
23
+ > * **扩展性限制**:它缺乏大规模可靠抓取 Google 所需的高级反爬虫处理(如验证码破解、代理轮换)。
24
+ > * **脆弱性**:由于 Google 频繁的 DOM 变更和 A/B 测试,提取的数据可能会出现**不准确或信息错位**的情况。
24
25
 
25
26
  使用静态方法 `WebSearcher.search` 处理快速、用完即弃的任务。它会自动创建会话、抓取结果并进行清理。
26
27
 
@@ -38,16 +39,105 @@ const results = await WebSearcher.search('Google', 'open source', { limit: 20 })
38
39
  console.log(results);
39
40
  ```
40
41
 
41
- ### 2. 有状态会话 (Stateful Session)
42
+ ### 3. 多引擎编排 (Multi-Engine Orchestration)
43
+
44
+ `WebSearcher.search` 具有内置的 **瀑布流(Waterfall)** 补偿机制。当您传入一个引擎数组时,它会按序执行并自动填补结果数量:
45
+
46
+ - **自动补全**:如果前面的引擎返回结果不足(少于 `limit`),它会自动请求后续引擎以补齐缺口。
47
+ - **容错降级**:如果某个引擎发生错误(如被封禁、超时),它会自动跳过并尝试下一个引擎,确保最终尽可能返回结果。
48
+ - **自动去重**:合并结果时会自动基于 `url` 进行去重。
49
+
50
+ ```typescript
51
+ // 瀑布流搜索:优先用 Google,不够就用 Bing,还不够就用 SearXNG 补齐
52
+ const results = await WebSearcher.search(['Google', 'Bing', 'SearXNG'], 'open source', {
53
+ limit: 20,
54
+ fillLimit: true // 默认即为 true
55
+ });
56
+ ```
57
+
58
+ ### 4. 有状态会话 (Stateful Session)
42
59
 
43
60
  由于 `WebSearcher` 继承自 `FetchSession`,您可以实例化它以在多个请求之间保持 Cookie 和存储。这对于需要登录的搜索或通过模拟人类行为来避免反爬虫非常有用。
44
61
 
62
+ ### 5. 默认搜索参数 (Default Search Parameters)
63
+
64
+ 您可以从三个层面设置默认搜索参数:**全局**、**引擎特定**和**实例级别**。这可以避免在每次调用 `search()` 时重复传递相同的选项。
65
+
66
+ 优先级顺序(从高到低)为:
67
+ `search(query, options)` (调用参数) > `this.options` (实例参数) > `Engine.defaultOptions` (引擎静态参数) > `WebSearcher.defaultOptions` (全局静态参数)
68
+
69
+ #### A. 全局静态默认值
70
+
71
+ 影响所有搜索引擎。
72
+
73
+ ```typescript
74
+ import { WebSearcher } from '@isdk/web-fetcher';
75
+
76
+ // 为所有搜索器设置全局限制
77
+ WebSearcher.defaultOptions = { limit: 20, safeSearch: 'strict' };
78
+ ```
79
+
80
+ #### B. 引擎特定静态默认值
81
+
82
+ 仅影响特定的引擎(及其子类)。
83
+
84
+ ```typescript
85
+ import { GoogleSearcher } from '@isdk/web-fetcher';
86
+
87
+ // 仅 Google 会使用这些默认值
88
+ GoogleSearcher.defaultOptions = { region: 'US', language: 'en' };
89
+ ```
90
+
91
+ #### C. 实例级别默认值
92
+
93
+ 在创建搜索器实例时设置。
94
+
95
+ ```typescript
96
+ const google = new GoogleSearcher({ limit: 5, category: 'news' });
97
+
98
+ // 此次搜索将自动使用 limit: 5 和 category: 'news'
99
+ const results = await google.search('open source');
100
+ ```
101
+
102
+ ### 🧬 动态模板 (Dynamic Templates)
103
+
104
+ 虽然静态 `template` 适用于简单的搜索引擎,但许多网站(如 Google)会根据搜索类别(如“网页” vs “图片” vs “新闻”)彻底改变其 HTML 结构。
105
+
106
+ 为了处理这种情况,您可以重写 `getTemplate(variables, options)` 方法。
107
+
108
+ - **`variables`**: 计算后的变量(来自 `formatOptions`、分页等)。
109
+ - **`options`**: 用户提供的原始 `SearchOptions`。
110
+
111
+ ```typescript
112
+ export class MyAdvancedSearcher extends WebSearcher {
113
+ get template(): FetcherOptions {
114
+ // 默认模板(通常用于网页搜索)
115
+ return {
116
+ url: '...',
117
+ actions: [ { id: 'extract', params: { selector: '.web-result' } } ]
118
+ };
119
+ }
120
+
121
+ protected override getTemplate(variables: Record<string, any>, options: SearchOptions): FetcherOptions {
122
+ if (options.category === 'images') {
123
+ return {
124
+ url: 'https://site.com/images?q=${query}',
125
+ actions: [ { id: 'extract', params: { selector: '.img-item' } } ]
126
+ };
127
+ }
128
+ // 回退到默认的 template 获取器
129
+ return super.getTemplate(variables, options);
130
+ }
131
+ }
132
+ ```
133
+
45
134
  ### 🛡️ 核心准则:模板即法律 (Template is Law)
46
135
 
47
- `WebSearcher` 子类中定义的 `template` 是权威的“蓝图”。
136
+ `template`(或由 `getTemplate` 返回的动态模板)是权威的“蓝图”。
48
137
 
49
138
  - **模板优先级**:如果模板定义了某个属性(如 `engine: 'browser'`、特定的 `headers` 等),该值将被**锁定**,用户选项无法覆盖。这确保了抓取逻辑的稳定性。
50
139
  - **Actions 不可变性**:模板中的 `actions` 数组受到严格保护。用户无法通过 `options` 追加、替换或修改执行步骤。这防止了外部逻辑破坏爬虫的执行流程。
140
+ - **会话上下文 (Session Context)**:为了保持干净的会话,**actions 会从会话的持久化上下文中过滤掉**。它们仅在 `search()` 调用执行期间使用。这确保了会话级设置(如 Cookie 或引擎类型)得以保留,而不会被特定于搜索的提取规则所污染。
51
141
  - **用户灵活性**:对于模板中**未**显式锁定的属性(如 `proxy`、`timeoutMs` 或自定义变量),用户可以在构造函数或 `search()` 方法中自由设置。
52
142
 
53
143
  ```typescript
@@ -167,12 +257,17 @@ protected override get pagination() {
167
257
 
168
258
  ### 步骤 3: 转换与清洗数据 (Transform)
169
259
 
170
- 重写 `transform` 以清洗数据。由于 `WebSearcher` 本身就是一个 `FetchSession`,您还可以使用 `this` 发起额外的请求(如解析重定向)。
260
+ 重写 `transform` 以清洗数据。`context` 参数包含了当前的搜索状态以及您传递给 `search()` 的任何自定义参数。由于 `WebSearcher` 本身就是一个 `FetchSession`,您还可以使用 `this` 发起额外的请求(如解析重定向)。
171
261
 
172
262
  ```typescript
173
- protected override async transform(outputs: Record<string, any>) {
263
+ protected override async transform(outputs: Record<string, any>, context: SearchContext) {
174
264
  const results = outputs['results'] || [];
175
265
 
266
+ // 您可以从 context 中访问自定义参数
267
+ if (context.myCustomFlag) {
268
+ // ... 逻辑
269
+ }
270
+
176
271
  // 清洗数据或过滤
177
272
  return results.map(item => ({
178
273
  ...item,
@@ -209,8 +304,10 @@ protected override async transform(outputs: Record<string, any>) {
209
304
  ```typescript
210
305
  await google.search('test', {
211
306
  limit: 20,
307
+ myCustomFlag: true,
212
308
  // 示例:过滤掉赞助商结果(广告)并只保留 PDF
213
- transform: (results) => {
309
+ transform: (results, context) => {
310
+ console.log('正在搜索:', context.query);
214
311
  return results.filter(r => {
215
312
  const isAd = r.isSponsored || r.url.includes('googleadservices.com');
216
313
  return !isAd && r.url.endsWith('.pdf');
@@ -236,6 +333,40 @@ const results = await google.search('open source', {
236
333
  });
237
334
  ```
238
335
 
336
+ #### 搜索选项参考
337
+
338
+ | 选项 | 类型 | 说明 |
339
+ | :--- | :--- | :--- |
340
+ | `limit` | `number` | 期望获取的结果总数。搜索器会自动翻页以达到此数量。 |
341
+ | `maxPages` | `number` | 最大抓取页数(翻页循环次数)。用于防止无限循环的安全阈值。默认值:`10`。 |
342
+ | `timeRange` | `string` \| `object` | 按时间过滤。预设值:`'all'`, `'hour'`, `'day'`, `'week'`, `'month'`, `'year'`。<br/> 或自定义范围 `{ from: Date\|string, to?: Date\|string }` |
343
+ | `category` | `string` | 搜索分类:`'all'`, `'images'`, `'videos'`, `'news'`。 |
344
+ | `region` | `string` | ISO 3166-1 alpha-2 地区代码(如 `'US'`, `'CN'`)。 |
345
+ | `language` | `string` | ISO 639-1 语言代码(如 `'en'`, `'zh-CN'`)。 |
346
+ | `safeSearch` | `string` | 安全搜索级别:`'off'`, `'moderate'`, `'strict'`。 |
347
+ | `transform` | `function` | 运行时自定义转换函数。在引擎内置转换之后运行。 |
348
+ | `baseUrls` | `string[]` \| `Record<string, string[]>` | 覆盖引擎的基 URL。可以是单个引擎的 URL 数组,或引擎名称到 URL 数组的映射。 |
349
+ | `fillLimit` | `boolean` | 设为 `true`(默认)时,当前引擎返回结果不足 `limit` 时会自动尝试后续引擎。 |
350
+ | `startPage` | `number` | 分页起始页索引。适用于跨会话委托分页场景。默认值:`0`。 |
351
+ | `validator` | `function` | 自定义回调函数验证抓取结果。返回 `false` 时触发故障转移/重试。签名:`(results, context) => boolean \| Promise<boolean>` |
352
+ | `...custom` | `any` | 任何其他键都将作为自定义变量传递给模板(例如 `${myVar}`)。 |
353
+
354
+ #### 标准搜索结果 (Standard Search Result)
355
+
356
+ 返回数组中的每个结果都遵循以下结构:
357
+
358
+ | 字段 | 类型 | 说明 |
359
+ | :--- | :--- | :--- |
360
+ | `title` | `string` | 搜索结果的标题。 |
361
+ | `url` | `string` | 结果的绝对 URL。 |
362
+ | `snippet` | `string` | 简短摘要或描述。 |
363
+ | `image` | `string` | (可选) 缩略图或相关图片的 URL。 |
364
+ | `date` | `string`\|`Date` | (可选) 发布日期。 |
365
+ | `author` | `string` | (可选) 作者或来源名称。 |
366
+ | `favicon` | `string` | (可选) 来源网站的 Favicon URL。 |
367
+ | `rank` | `number` | (可选) 在结果中的排名(从 1 开始)。 |
368
+ | `source` | `string` | (可选) 来源网站名称(如 'GitHub')。 |
369
+
239
370
  要在您自己的引擎中支持这些选项,请重写 `formatOptions` 方法:
240
371
 
241
372
  ```typescript
@@ -250,6 +381,36 @@ protected override formatOptions(options: SearchOptions): Record<string, any> {
250
381
  然后在您的 `template.url` 中使用这些变量:
251
382
  `url: 'https://www.google.com/search?q=${query}&tbs=${tbs}'`
252
383
 
384
+ ### 🚀 实现多实例支持 (Implementing Multi-instance Support)
385
+
386
+ 如果某个搜索引擎支持多个镜像或分布式部署,您可以轻松地为其增加故障转移能力:
387
+
388
+ 1. **配置 Base URLs**:在构造函数中支持传入地址列表。
389
+ 2. **结果验证**:重写 `validateFetchResult(outputs, context)`。如果返回 `false`,搜索器会自动尝试列表中的下一个地址。
390
+ 3. **模板变量**:在模板 URL 中使用 `${baseUrl}` 占位符。
391
+
392
+ ```typescript
393
+ export class MyDistributedSearcher extends WebSearcher {
394
+ protected get template(): FetcherOptions {
395
+ return {
396
+ url: '${baseUrl}/search?q=${query}',
397
+ // ...
398
+ };
399
+ }
400
+
401
+ protected override validateFetchResult(outputs: Record<string, any>, context: SearchContext): boolean {
402
+ const results = outputs['results'] || [];
403
+ // 如果没有结果,触发故障转移切换到下一个节点
404
+ return results.length > 0;
405
+ }
406
+ }
407
+
408
+ // 使用方式
409
+ const searcher = new MyDistributedSearcher({
410
+ baseUrls: ['https://node1.com', 'https://node2.com']
411
+ });
412
+ ```
413
+
253
414
  ### 自定义变量
254
415
 
255
416
  您可以向 `search()` 传递自定义变量并在模板中使用它们。
@@ -262,6 +423,34 @@ await google.search('test', { category: 'news' });
262
423
  url: 'https://site.com?q=${query}&cat=${category}'
263
424
  ```
264
425
 
426
+ ## 🛡️ 弹性搜索与测速工具 (Resilient Search & Latency Tools)
427
+
428
+ 本模块提供了一系列通用的工具函数,用于评估节点的健康状况并实现故障转移。
429
+
430
+ ### 1. 通用延迟测试工具
431
+
432
+ 我们提供了一个基于 `web-fetcher` 的通用测速函数 `testUrlsByLatency`,可用于对任何 URL 列表进行实时响应速度测试并按延迟排序。
433
+
434
+ ```typescript
435
+ import { testUrlsByLatency } from '@isdk/web-searcher/utils';
436
+
437
+ const urls = ['https://google.com', 'https://bing.com', 'https://baidu.com'];
438
+ const sorted = await testUrlsByLatency(urls, { timeout: 5000 });
439
+
440
+ // 返回 [{ url: '...', latency: 123 }, ...],按 latency 升序排列
441
+ ```
442
+
443
+ ### 2. 特定引擎的弹性发现
444
+
445
+ 对于像 **SearXNG** 这样支持多实例且不稳定的引擎,我们提供了专门的故障转移和自动发现机制。
446
+
447
+ - **自动故障转移**:支持配置多个 `baseUrls`,自动在连接失败时切换节点。
448
+ - **动态发现**:支持从 `searx.space` 或 GitHub 自动抓取并筛选高质量节点。
449
+
450
+ 详细信息请参阅:[SearXNG 弹性搜索文档](./src/engines/searxng.cn.md)。
451
+
452
+ ---
453
+
265
454
  ## 分页指南
266
455
 
267
456
  ### 1. 基于偏移量 (Offset-based) - 如 Google
package/README.md CHANGED
@@ -19,8 +19,9 @@ This module encapsulates these patterns into a reusable `WebSearcher` class.
19
19
 
20
20
  > **⚠️ Note on `GoogleSearcher`**: The `GoogleSearcher` class used in these examples is a **demo implementation** included for educational purposes. It is not intended for production use.
21
21
  >
22
- > * It lacks advanced anti-bot handling (CAPTCHA solving, proxy rotation) required for scraping Google reliably at scale.
23
- > * The extracted data may be **inaccurate or misaligned** due to Google's frequent DOM changes and A/B testing.
22
+ > * **Strict Anti-Bot Detection**: Currently, it has been found that even when attempting to simulate simple "human behavior" in `browser` mode (such as waiting for a few seconds before automatically filling in the search box and submitting), it is still detected as an automated program by Google. This indicates that simple operation simulation is not enough to pass the detection.
23
+ > * **Scalability Limitations**: It lacks advanced countermeasures like CAPTCHA solving, fingerprint spoofing, or high-quality proxy rotation required for reliable scraping.
24
+ > * **Fragility**: The extracted data may be **inaccurate or misaligned** due to Google's frequent DOM changes and A/B testing.
24
25
 
25
26
  Use the static `WebSearcher.search` method for quick, disposable tasks. It automatically creates a session, fetches results, and cleans up.
26
27
 
@@ -38,16 +39,105 @@ const results = await WebSearcher.search('Google', 'open source', { limit: 20 })
38
39
  console.log(results);
39
40
  ```
40
41
 
41
- ### 2. Stateful Session
42
+ ### 3. Multi-Engine Orchestration
43
+
44
+ `WebSearcher.search` features a built-in **Waterfall** compensation mechanism. When you provide an array of engine names, it executes them sequentially and automatically fills the result count:
45
+
46
+ - **Automatic Completion**: If the preceding engines return fewer results than the `limit`, it automatically requests subsequent engines to fill the gap.
47
+ - **Failover & Degradation**: If an engine fails (e.g., blocked, timeout), it automatically skips it and tries the next one, ensuring results are returned whenever possible.
48
+ - **Auto Deduplication**: It automatically de-duplicates results based on their `url` during the merging process.
49
+
50
+ ```typescript
51
+ // Waterfall search: Google first, Bing as fallback, SearXNG as final backup
52
+ const results = await WebSearcher.search(['Google', 'Bing', 'SearXNG'], 'open source', {
53
+ limit: 20,
54
+ fillLimit: true // Enabled by default
55
+ });
56
+ ```
57
+
58
+ ### 4. Stateful Session
42
59
 
43
60
  Since `WebSearcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
44
61
 
62
+ ### 5. Default Search Parameters
63
+
64
+ You can set default search parameters at three levels: **Global**, **Engine-specific**, and **Instance-level**. This avoids passing repetitive options to every `search()` call.
65
+
66
+ The priority order (from highest to lowest) is:
67
+ `search(query, options)` (Call) > `this.options` (Instance) > `Engine.defaultOptions` (Static Engine) > `WebSearcher.defaultOptions` (Static Global)
68
+
69
+ #### A. Global Static Defaults
70
+
71
+ Affects all search engines.
72
+
73
+ ```typescript
74
+ import { WebSearcher } from '@isdk/web-fetcher';
75
+
76
+ // Set global limit for all searchers
77
+ WebSearcher.defaultOptions = { limit: 20, safeSearch: 'strict' };
78
+ ```
79
+
80
+ #### B. Engine-Specific Static Defaults
81
+
82
+ Affects only a specific engine (and its subclasses).
83
+
84
+ ```typescript
85
+ import { GoogleSearcher } from '@isdk/web-fetcher';
86
+
87
+ // Only Google will use these defaults
88
+ GoogleSearcher.defaultOptions = { region: 'US', language: 'en' };
89
+ ```
90
+
91
+ #### C. Instance-Level Defaults
92
+
93
+ Set when creating a searcher instance.
94
+
95
+ ```typescript
96
+ const google = new GoogleSearcher({ limit: 5, category: 'news' });
97
+
98
+ // This search will use limit: 5 and category: 'news' automatically
99
+ const results = await google.search('open source');
100
+ ```
101
+
102
+ ### 🧬 Dynamic Templates
103
+
104
+ While a static `template` works for simple search engines, many sites (like Google) change their HTML structure drastically based on the search category (e.g., 'Web' vs 'Images' vs 'News').
105
+
106
+ To handle this, you can override the `getTemplate(variables, options)` method.
107
+
108
+ - **`variables`**: The calculated variables (from `formatOptions`, pagination, etc.).
109
+ - **`options`**: The original `SearchOptions` provided by the user.
110
+
111
+ ```typescript
112
+ export class MyAdvancedSearcher extends WebSearcher {
113
+ get template(): FetcherOptions {
114
+ // Default template (usually for web search)
115
+ return {
116
+ url: '...',
117
+ actions: [ { id: 'extract', params: { selector: '.web-result' } } ]
118
+ };
119
+ }
120
+
121
+ protected override getTemplate(variables: Record<string, any>, options: SearchOptions): FetcherOptions {
122
+ if (options.category === 'images') {
123
+ return {
124
+ url: 'https://site.com/images?q=${query}',
125
+ actions: [ { id: 'extract', params: { selector: '.img-item' } } ]
126
+ };
127
+ }
128
+ // Fallback to the default template getter
129
+ return super.getTemplate(variables, options);
130
+ }
131
+ }
132
+ ```
133
+
45
134
  ### 🛡️ Core Principle: Template is Law
46
135
 
47
- The `template` defined in the `WebSearcher` subclass acts as the authoritative "blueprint".
136
+ The `template` (or the dynamic template returned by `getTemplate`) acts as the authoritative "blueprint".
48
137
 
49
138
  - **Template Priority**: If the template defines a property (e.g., `engine: 'browser'`, `headers`), that value is **locked** and cannot be overridden by user options. This ensures engine stability.
50
139
  - **Immutable Actions**: The `actions` array in the template is strictly protected. Users cannot append, replace, or modify the execution steps via `options`. This prevents external logic from breaking the scraper's flow.
140
+ - **Session Context**: To maintain a clean session, **actions are filtered out** of the session's persistent context. They are only used during the execution of a `search()` call. This ensures that session-level settings (like cookies or engine type) are preserved without being cluttered by search-specific extraction rules.
51
141
  - **User Flexibility**: Properties **not** explicitly defined in the template (such as `proxy`, `timeoutMs`, or custom variables) can be freely set by the user in the constructor or `search()` method.
52
142
 
53
143
  ```typescript
@@ -167,12 +257,17 @@ protected override get pagination() {
167
257
 
168
258
  ### Step 3: Transform & Clean Data
169
259
 
170
- Override `transform` to clean data. Since `WebSearcher` is a `FetchSession`, you can also make extra requests (like resolving redirects) using `this`.
260
+ Override `transform` to clean data. The `context` parameter contains the current search state and any custom parameters you passed to `search()`. Since `WebSearcher` is a `FetchSession`, you can also make extra requests (like resolving redirects) using `this`.
171
261
 
172
262
  ```typescript
173
- protected override async transform(outputs: Record<string, any>) {
263
+ protected override async transform(outputs: Record<string, any>, context: SearchContext) {
174
264
  const results = outputs['results'] || [];
175
265
 
266
+ // You can access custom parameters from context
267
+ if (context.myCustomFlag) {
268
+ // ... logic
269
+ }
270
+
176
271
  // Clean data or filter
177
272
  return results.map(item => ({
178
273
  ...item,
@@ -209,8 +304,10 @@ This is extremely powerful for **filtering out ads** or irrelevant content. If t
209
304
  ```typescript
210
305
  await google.search('test', {
211
306
  limit: 20,
307
+ myCustomFlag: true,
212
308
  // Example: Filter out sponsored results and only keep PDFs
213
- transform: (results) => {
309
+ transform: (results, context) => {
310
+ console.log('Searching for:', context.query);
214
311
  return results.filter(r => {
215
312
  const isAd = r.isSponsored || r.url.includes('googleadservices.com');
216
313
  return !isAd && r.url.endsWith('.pdf');
@@ -236,6 +333,40 @@ const results = await google.search('open source', {
236
333
  });
237
334
  ```
238
335
 
336
+ #### Search Options Reference
337
+
338
+ | Option | Type | Description |
339
+ | :--- | :--- | :--- |
340
+ | `limit` | `number` | The target number of total results to retrieve. The searcher will automatically paginate to reach this number. |
341
+ | `maxPages` | `number` | The maximum number of pages (fetch cycles) to fetch. Safety threshold to prevent infinite loops. Default: `10`. |
342
+ | `timeRange` | `string` \| `object` | Filter by time. Presets: `'all'`, `'hour'`, `'day'`, `'week'`, `'month'`, `'year'`. <br/> Or `{ from: Date\|string, to?: Date\|string }` |
343
+ | `category` | `string` | Search category: `'all'`, `'images'`, `'videos'`, `'news'`. |
344
+ | `region` | `string` | ISO 3166-1 alpha-2 region code (e.g., `'US'`, `'CN'`). |
345
+ | `language` | `string` | ISO 639-1 language code (e.g., `'en'`, `'zh-CN'`). |
346
+ | `safeSearch` | `string` | Safe search level: `'off'`, `'moderate'`, `'strict'`. |
347
+ | `transform` | `function` | A custom function to filter or modify results at runtime. Runs after the engine's built-in transform. |
348
+ | `baseUrls` | `string[]` \| `Record<string, string[]>` | Override the base URLs for engines. Can be an array for a single engine, or a map of engine names to URL arrays. |
349
+ | `fillLimit` | `boolean` | If `true` (default), continues to subsequent engines in the chain when the current engine returns fewer results than `limit`. |
350
+ | `startPage` | `number` | The page index to start from. Useful when delegating pagination across different sessions. Default: `0`. |
351
+ | `validator` | `function` | Custom callback to validate fetched results. If it returns `false`, triggers failover/retry. Signature: `(results, context) => boolean \| Promise<boolean>`. |
352
+ | `...custom` | `any` | Any other keys are passed as custom variables to the template (e.g., `${myVar}`). |
353
+
354
+ #### Standard Search Result
355
+
356
+ Each result in the returned array follows this structure:
357
+
358
+ | Field | Type | Description |
359
+ | :--- | :--- | :--- |
360
+ | `title` | `string` | The title of the search result. |
361
+ | `url` | `string` | The absolute URL of the result. |
362
+ | `snippet` | `string` | A brief snippet or description. |
363
+ | `image` | `string` | (Optional) URL of a thumbnail or associated image. |
364
+ | `date` | `string`\|`Date` | (Optional) Publication date. |
365
+ | `author` | `string` | (Optional) Author or source name. |
366
+ | `favicon` | `string` | (Optional) Favicon URL of the source website. |
367
+ | `rank` | `number` | (Optional) 1-indexed position in the results. |
368
+ | `source` | `string` | (Optional) Source website name (e.g., 'GitHub'). |
369
+
239
370
  To support these in your own engine, override the `formatOptions` method:
240
371
 
241
372
  ```typescript
@@ -250,6 +381,36 @@ protected override formatOptions(options: SearchOptions): Record<string, any> {
250
381
  Then use these variables in your `template.url`:
251
382
  `url: 'https://www.google.com/search?q=${query}&tbs=${tbs}'`
252
383
 
384
+ ### 🚀 Implementing Multi-instance Support
385
+
386
+ If a search engine supports multiple mirrors or distributed deployment, you can easily add failover capabilities:
387
+
388
+ 1. **Configure Base URLs**: Support a list of addresses in the constructor.
389
+ 2. **Validate Results**: Override `validateFetchResult(outputs, context)`. If it returns `false`, the searcher automatically tries the next address in the list.
390
+ 3. **Template Variables**: Use the `${baseUrl}` placeholder in your template URL.
391
+
392
+ ```typescript
393
+ export class MyDistributedSearcher extends WebSearcher {
394
+ protected get template(): FetcherOptions {
395
+ return {
396
+ url: '${baseUrl}/search?q=${query}',
397
+ // ...
398
+ };
399
+ }
400
+
401
+ protected override validateFetchResult(outputs: Record<string, any>, context: SearchContext): boolean {
402
+ const results = outputs['results'] || [];
403
+ // If no results, trigger failover to the next node
404
+ return results.length > 0;
405
+ }
406
+ }
407
+
408
+ // Usage
409
+ const searcher = new MyDistributedSearcher({
410
+ baseUrls: ['https://node1.com', 'https://node2.com']
411
+ });
412
+ ```
413
+
253
414
  ### Custom Variables
254
415
 
255
416
  You can pass custom variables to `search()` and use them in your template.
@@ -262,6 +423,34 @@ await google.search('test', { category: 'news' });
262
423
  url: 'https://site.com?q=${query}&cat=${category}'
263
424
  ```
264
425
 
426
+ ## 🛡️ Resilient Search & Latency Tools
427
+
428
+ This module provides a set of general utility functions to evaluate node health and implement failover.
429
+
430
+ ### 1. General Latency Testing Utility
431
+
432
+ We provide a general latency testing function `testUrlsByLatency` based on `web-fetcher` that can be used for real-time response testing and sorting of any URL list.
433
+
434
+ ```typescript
435
+ import { testUrlsByLatency } from '@isdk/web-searcher/utils';
436
+
437
+ const urls = ['https://google.com', 'https://bing.com', 'https://baidu.com'];
438
+ const sorted = await testUrlsByLatency(urls, { timeout: 5000 });
439
+
440
+ // Returns [{ url: '...', latency: 123 }, ...], sorted by latency ascending.
441
+ ```
442
+
443
+ ### 2. Engine-Specific Resilient Discovery
444
+
445
+ For engines like **SearXNG** that support multiple instances and can be unstable, we provide specialized failover and discovery mechanisms.
446
+
447
+ - **Automatic Failover**: Configure multiple `baseUrls` to automatically switch nodes on connection failure.
448
+ - **Dynamic Discovery**: Automatically fetch and filter high-quality nodes from `searx.space` or GitHub.
449
+
450
+ For more details, see: [SearXNG Resilient Search Documentation](./src/engines/searxng.md).
451
+
452
+ ---
453
+
265
454
  ## Pagination Guide
266
455
 
267
456
  ### 1. Offset-based (e.g., Google)