@isdk/web-searcher 0.1.2 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.cn.md +43 -14
- package/README.md +44 -15
- package/dist/index.d.mts +31 -2
- package/dist/index.d.ts +31 -2
- package/dist/index.js +1 -1
- package/dist/index.mjs +1 -1
- package/docs/README.md +44 -15
- package/docs/classes/GoogleSearcher.md +27 -27
- package/docs/classes/WebSearcher.md +27 -27
- package/docs/interfaces/CustomTimeRange.md +3 -3
- package/docs/interfaces/PaginationConfig.md +26 -6
- package/docs/interfaces/SearchContext.md +4 -4
- package/docs/interfaces/SearchOptions.md +23 -8
- package/docs/interfaces/StandardSearchResult.md +56 -6
- package/docs/type-aliases/SafeSearchLevel.md +1 -1
- package/docs/type-aliases/SearchCategory.md +1 -1
- package/docs/type-aliases/SearchTimeRange.md +1 -1
- package/docs/type-aliases/SearchTimeRangePreset.md +2 -2
- package/docs/type-aliases/SearcherConstructor.md +1 -1
- package/package.json +1 -1
package/README.cn.md
CHANGED
|
@@ -42,27 +42,37 @@ console.log(results);
|
|
|
42
42
|
|
|
43
43
|
由于 `WebSearcher` 继承自 `FetchSession`,您可以实例化它以在多个请求之间保持 Cookie 和存储。这对于需要登录的搜索或通过模拟人类行为来避免反爬虫非常有用。
|
|
44
44
|
|
|
45
|
-
|
|
46
|
-
创建会话时,选项按以下顺序合并:
|
|
45
|
+
### 🛡️ 核心准则:模板即法律 (Template is Law)
|
|
47
46
|
|
|
48
|
-
|
|
49
|
-
2. **用户选项 (User Options)**:传递给构造函数的选项(可填充缺失的默认值,或在允许的情况下进行覆盖)。
|
|
47
|
+
在 `WebSearcher` 子类中定义的 `template` 是权威的“蓝图”。
|
|
50
48
|
|
|
51
|
-
|
|
49
|
+
- **模板优先级**:如果模板定义了某个属性(如 `engine: 'browser'`、特定的 `headers` 等),该值将被**锁定**,用户选项无法覆盖。这确保了抓取逻辑的稳定性。
|
|
50
|
+
- **Actions 不可变性**:模板中的 `actions` 数组受到严格保护。用户无法通过 `options` 追加、替换或修改执行步骤。这防止了外部逻辑破坏爬虫的执行流程。
|
|
51
|
+
- **用户灵活性**:对于模板中**未**显式锁定的属性(如 `proxy`、`timeoutMs` 或自定义变量),用户可以在构造函数或 `search()` 方法中自由设置。
|
|
52
52
|
|
|
53
53
|
```typescript
|
|
54
54
|
// 创建一个持久化会话
|
|
55
55
|
const google = new GoogleSearcher({
|
|
56
|
-
headless: false, //
|
|
56
|
+
headless: false, // 如果模板中未锁定,则可以覆盖
|
|
57
57
|
proxy: 'http://my-proxy:8080',
|
|
58
|
-
timeoutMs: 30000 //
|
|
58
|
+
timeoutMs: 30000 // 有效(假设 GoogleSearcher 模板未显式设置 timeoutMs)
|
|
59
59
|
});
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
### 🧠 智能导航 (Goto)
|
|
63
|
+
|
|
64
|
+
`WebSearcher` 会自动管理前往搜索 URL 的导航。
|
|
65
|
+
|
|
66
|
+
1. **自动注入**:如果你的模板**不**包含 `goto` 动作,搜索器会自动在动作列表开头插入一个指向解析后的 `url`(已注入查询变量)的 `goto` 动作。
|
|
67
|
+
2. **手动控制**:如果你在模板中显式添加了一个匹配解析后 URL 的 `goto` 动作,搜索器会检测到重复并**跳过**自动注入。这让你能完全控制导航步骤(例如添加 headers、referrer 或其他特定参数)。
|
|
68
|
+
3. **多步流程**:你可以在模板中定义多个 `goto` 动作(例如先访问登录页)。搜索器仍然会预置主搜索 URL 的导航,除非你的 `goto` 动作之一与之精确匹配。
|
|
60
69
|
|
|
70
|
+
```typescript
|
|
61
71
|
try {
|
|
62
72
|
// 第一次查询
|
|
63
73
|
// 您还可以传递运行时选项来覆盖会话默认值或注入变量
|
|
64
74
|
const results1 = await google.search('term A', {
|
|
65
|
-
timeoutMs: 60000, //
|
|
75
|
+
timeoutMs: 60000, // 针对此次搜索覆盖超时时间
|
|
66
76
|
extraParam: 'value' // 可以在模板中通过 ${extraParam} 使用
|
|
67
77
|
});
|
|
68
78
|
|
|
@@ -174,22 +184,41 @@ protected override async transform(outputs: Record<string, any>) {
|
|
|
174
184
|
|
|
175
185
|
## 🧠 高级概念
|
|
176
186
|
|
|
177
|
-
###
|
|
187
|
+
### 自动分页:`limit` 与 `maxPages` 的关系
|
|
188
|
+
|
|
189
|
+
`WebSearcher` 的设计是以结果为导向的。当您调用 `search()` 时,您只需要指定想要多少条结果,搜索器会自动处理翻页逻辑。
|
|
190
|
+
|
|
191
|
+
- **`limit`**: 您期望获取的结果总数。
|
|
192
|
+
- **`maxPages`**: 安全阈值。它限制了搜索器为了满足 `limit` 而允许抓取的最大页数(翻页循环次数)。
|
|
178
193
|
|
|
179
|
-
|
|
194
|
+
**协作逻辑示例:**
|
|
195
|
+
如果您请求 `{ limit: 50 }`,但每页只有 5 条结果:
|
|
196
|
+
|
|
197
|
+
1. 搜索器抓取第 1 页(得到 5 条)。
|
|
198
|
+
2. 发现 `5 < 50`,于是自动抓取第 2 页。
|
|
199
|
+
3. 循环持续,直到获取 50 条结果 **或者** 达到了 `maxPages` 的限制(默认为 10 页)。
|
|
200
|
+
|
|
201
|
+
这种机制可以防止因“下一页”选择器失效或引擎陷入死循环而导致的无限抓取,保护您的系统资源。
|
|
180
202
|
|
|
181
203
|
### 用户自定义转换 (User-defined Transforms)
|
|
182
204
|
|
|
183
205
|
用户可以在调用 `search` 时提供自己的 `transform`。它会在引擎内置的转换**之后**运行。
|
|
184
206
|
|
|
207
|
+
这在**过滤广告**或无关内容时非常强大。如果用户过滤掉了某些结果,自动分页逻辑会**自动启动**以抓取更多页面,确保最终返回给您的结果列表既满足 `limit` 数量要求,又只包含有效的条目。
|
|
208
|
+
|
|
185
209
|
```typescript
|
|
186
210
|
await google.search('test', {
|
|
187
|
-
|
|
211
|
+
limit: 20,
|
|
212
|
+
// 示例:过滤掉赞助商结果(广告)并只保留 PDF
|
|
213
|
+
transform: (results) => {
|
|
214
|
+
return results.filter(r => {
|
|
215
|
+
const isAd = r.isSponsored || r.url.includes('googleadservices.com');
|
|
216
|
+
return !isAd && r.url.endsWith('.pdf');
|
|
217
|
+
});
|
|
218
|
+
}
|
|
188
219
|
});
|
|
189
220
|
```
|
|
190
221
|
|
|
191
|
-
如果用户过滤掉了结果,自动分页逻辑会启动以抓取更多页面来满足请求的 limit。
|
|
192
|
-
|
|
193
222
|
### 标准化搜索选项
|
|
194
223
|
|
|
195
224
|
在调用 `search()` 时,您可以提供标准化的选项,搜索引擎会将其映射到特定的参数:
|
|
@@ -197,7 +226,7 @@ await google.search('test', {
|
|
|
197
226
|
```typescript
|
|
198
227
|
const results = await google.search('open source', {
|
|
199
228
|
limit: 20,
|
|
200
|
-
timeRange: 'month', // 'day', 'week', 'month', 'year'
|
|
229
|
+
timeRange: 'month', // 'hour', 'day', 'week', 'month', 'year'
|
|
201
230
|
// 或自定义范围:
|
|
202
231
|
// timeRange: { from: '2023-01-01', to: '2023-12-31' },
|
|
203
232
|
category: 'news', // 'all', 'images', 'videos', 'news'
|
package/README.md
CHANGED
|
@@ -42,27 +42,37 @@ console.log(results);
|
|
|
42
42
|
|
|
43
43
|
Since `WebSearcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
|
|
44
44
|
|
|
45
|
-
|
|
46
|
-
When creating a session, options are merged in the following order:
|
|
45
|
+
### 🛡️ Core Principle: Template is Law
|
|
47
46
|
|
|
48
|
-
|
|
49
|
-
2. **User Options**: Passed to the constructor (can fill missing defaults or override if allowed).
|
|
47
|
+
The `template` defined in the `WebSearcher` subclass acts as the authoritative "blueprint".
|
|
50
48
|
|
|
51
|
-
|
|
49
|
+
- **Template Priority**: If the template defines a property (e.g., `engine: 'browser'`, `headers`), that value is **locked** and cannot be overridden by user options. This ensures engine stability.
|
|
50
|
+
- **Immutable Actions**: The `actions` array in the template is strictly protected. Users cannot append, replace, or modify the execution steps via `options`. This prevents external logic from breaking the scraper's flow.
|
|
51
|
+
- **User Flexibility**: Properties **not** explicitly defined in the template (such as `proxy`, `timeoutMs`, or custom variables) can be freely set by the user in the constructor or `search()` method.
|
|
52
52
|
|
|
53
53
|
```typescript
|
|
54
54
|
// Create a persistent session
|
|
55
55
|
const google = new GoogleSearcher({
|
|
56
|
-
headless: false, // Override
|
|
56
|
+
headless: false, // Override if not locked in template
|
|
57
57
|
proxy: 'http://my-proxy:8080',
|
|
58
|
-
timeoutMs: 30000 // Set a global timeout
|
|
58
|
+
timeoutMs: 30000 // Set a global timeout (valid if template doesn't define it)
|
|
59
59
|
});
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
### 🧠 Intelligent Navigation (Goto)
|
|
63
|
+
|
|
64
|
+
The `WebSearcher` automatically manages navigation to the search URL.
|
|
65
|
+
|
|
66
|
+
1. **Auto-Injection**: If your template does **not** include a `goto` action, the searcher automatically inserts one at the beginning of the action list, pointing to the resolved `url` (with query variables injected).
|
|
67
|
+
2. **Manual Control**: If you explicitly add a `goto` action in your template that matches the resolved URL, the searcher detects this duplicate and **skips** the automatic injection. This gives you full control to add headers, referrer, or other specific parameters to the navigation step if needed.
|
|
68
|
+
3. **Multi-Step Flows**: You can define multiple `goto` actions in your template (e.g., visit a login page first). The searcher will still prepend the main search URL navigation unless one of your `goto` actions matches it exactly.
|
|
60
69
|
|
|
70
|
+
```typescript
|
|
61
71
|
try {
|
|
62
72
|
// First query
|
|
63
73
|
// You can also pass runtime options to override session defaults or inject variables
|
|
64
74
|
const results1 = await google.search('term A', {
|
|
65
|
-
timeoutMs: 60000, // Override timeout just for this search
|
|
75
|
+
timeoutMs: 60000, // Override session timeout just for this search
|
|
66
76
|
extraParam: 'value' // Can be used in template as ${extraParam}
|
|
67
77
|
});
|
|
68
78
|
|
|
@@ -172,24 +182,43 @@ protected override async transform(outputs: Record<string, any>) {
|
|
|
172
182
|
}
|
|
173
183
|
```
|
|
174
184
|
|
|
175
|
-
|
|
185
|
+
### 🧠 Advanced Concepts
|
|
186
|
+
|
|
187
|
+
### Auto-Pagination: `limit` vs `maxPages`
|
|
188
|
+
|
|
189
|
+
The `WebSearcher` is designed to be result-oriented. When you call `search()`, you specify how many results you want, and the searcher handles the pagination logic.
|
|
176
190
|
|
|
177
|
-
|
|
191
|
+
- **`limit`**: Your target number of total results.
|
|
192
|
+
- **`maxPages`**: The safety threshold. It limits how many pages (fetch cycles) the searcher is allowed to navigate to satisfy your `limit`.
|
|
178
193
|
|
|
179
|
-
|
|
194
|
+
**Example Logic:**
|
|
195
|
+
If you request `{ limit: 50 }` but each page only has 5 results:
|
|
196
|
+
|
|
197
|
+
1. The searcher fetches page 1 (5 results).
|
|
198
|
+
2. It sees `5 < 50`, so it fetches page 2.
|
|
199
|
+
3. It continues until it has 50 results **OR** it reaches `maxPages` (default 10).
|
|
200
|
+
|
|
201
|
+
This prevent infinite loops if the "Next" button selector is broken or if the search engine keeps returning the same results.
|
|
180
202
|
|
|
181
203
|
### User-defined Transforms
|
|
182
204
|
|
|
183
205
|
Users can provide their own `transform` when calling `search`. This runs **after** the engine's built-in transform.
|
|
184
206
|
|
|
207
|
+
This is extremely powerful for **filtering out ads** or irrelevant content. If the user filters out results, the auto-pagination logic will automatically kick in to fetch more pages to ensure the final result list meets your requested `limit` with only valid entries.
|
|
208
|
+
|
|
185
209
|
```typescript
|
|
186
210
|
await google.search('test', {
|
|
187
|
-
|
|
211
|
+
limit: 20,
|
|
212
|
+
// Example: Filter out sponsored results and only keep PDFs
|
|
213
|
+
transform: (results) => {
|
|
214
|
+
return results.filter(r => {
|
|
215
|
+
const isAd = r.isSponsored || r.url.includes('googleadservices.com');
|
|
216
|
+
return !isAd && r.url.endsWith('.pdf');
|
|
217
|
+
});
|
|
218
|
+
}
|
|
188
219
|
});
|
|
189
220
|
```
|
|
190
221
|
|
|
191
|
-
If the user filters out results, the auto-pagination logic will kick in to fetch more pages to meet the requested limit.
|
|
192
|
-
|
|
193
222
|
### Standardized Search Options
|
|
194
223
|
|
|
195
224
|
When calling `search()`, you can provide standardized options that the search engine will map to specific parameters:
|
|
@@ -197,7 +226,7 @@ When calling `search()`, you can provide standardized options that the search en
|
|
|
197
226
|
```typescript
|
|
198
227
|
const results = await google.search('open source', {
|
|
199
228
|
limit: 20,
|
|
200
|
-
timeRange: 'month', // 'day', 'week', 'month', 'year'
|
|
229
|
+
timeRange: 'month', // 'hour', 'day', 'week', 'month', 'year'
|
|
201
230
|
// Or custom range:
|
|
202
231
|
// timeRange: { from: '2023-01-01', to: '2023-12-31' },
|
|
203
232
|
category: 'news', // 'all', 'images', 'videos', 'news'
|
package/dist/index.d.mts
CHANGED
|
@@ -15,7 +15,17 @@ interface StandardSearchResult {
|
|
|
15
15
|
snippet?: string;
|
|
16
16
|
/** An optional image URL associated with the result. */
|
|
17
17
|
image?: string;
|
|
18
|
-
/**
|
|
18
|
+
/** The date the result was published or last updated. */
|
|
19
|
+
date?: string | Date;
|
|
20
|
+
/** The author or source name of the result. */
|
|
21
|
+
author?: string;
|
|
22
|
+
/** The favicon URL of the source website. */
|
|
23
|
+
favicon?: string;
|
|
24
|
+
/** The rank or position of the result (usually 1-indexed). */
|
|
25
|
+
rank?: number;
|
|
26
|
+
/** The source website name (e.g., 'GitHub', 'StackOverflow'). */
|
|
27
|
+
source?: string;
|
|
28
|
+
/** Allows for engine-specific extra fields (e.g., siteIcon, category). */
|
|
19
29
|
[key: string]: any;
|
|
20
30
|
}
|
|
21
31
|
/**
|
|
@@ -52,6 +62,16 @@ interface PaginationConfig {
|
|
|
52
62
|
* Required if type is 'click-next'.
|
|
53
63
|
*/
|
|
54
64
|
nextButtonSelector?: string;
|
|
65
|
+
/**
|
|
66
|
+
* The safety threshold for the maximum number of pages to fetch automatically
|
|
67
|
+
* in a single search call.
|
|
68
|
+
*
|
|
69
|
+
* Even if the requested `limit` of results hasn't been reached, the searcher
|
|
70
|
+
* will stop after this many pages to prevent infinite loops or excessive API usage.
|
|
71
|
+
*
|
|
72
|
+
* @default 10
|
|
73
|
+
*/
|
|
74
|
+
maxPages?: number;
|
|
55
75
|
}
|
|
56
76
|
/**
|
|
57
77
|
* Context object passed to the transform function.
|
|
@@ -64,7 +84,7 @@ interface SearchContext {
|
|
|
64
84
|
/** The requested limit of results. */
|
|
65
85
|
limit?: number;
|
|
66
86
|
}
|
|
67
|
-
type SearchTimeRangePreset = 'all' | 'day' | 'week' | 'month' | 'year';
|
|
87
|
+
type SearchTimeRangePreset = 'all' | 'hour' | 'day' | 'week' | 'month' | 'year';
|
|
68
88
|
interface CustomTimeRange {
|
|
69
89
|
/** Start date (Date object or string like 'YYYY-MM-DD'). */
|
|
70
90
|
from: Date | string;
|
|
@@ -80,6 +100,15 @@ type SafeSearchLevel = 'off' | 'moderate' | 'strict';
|
|
|
80
100
|
interface SearchOptions {
|
|
81
101
|
/** The maximum number of results to retrieve. */
|
|
82
102
|
limit?: number;
|
|
103
|
+
/**
|
|
104
|
+
* The maximum number of pages (fetch cycles) allowed to reach the requested `limit`.
|
|
105
|
+
*
|
|
106
|
+
* This is a safety guard. If the `limit` is high but each page has few results,
|
|
107
|
+
* the searcher will stop once this page count is reached.
|
|
108
|
+
*
|
|
109
|
+
* If not provided, it defaults to the value in `PaginationConfig` or 10.
|
|
110
|
+
*/
|
|
111
|
+
maxPages?: number;
|
|
83
112
|
/**
|
|
84
113
|
* Date range for the search results.
|
|
85
114
|
* Default: 'all'
|
package/dist/index.d.ts
CHANGED
|
@@ -15,7 +15,17 @@ interface StandardSearchResult {
|
|
|
15
15
|
snippet?: string;
|
|
16
16
|
/** An optional image URL associated with the result. */
|
|
17
17
|
image?: string;
|
|
18
|
-
/**
|
|
18
|
+
/** The date the result was published or last updated. */
|
|
19
|
+
date?: string | Date;
|
|
20
|
+
/** The author or source name of the result. */
|
|
21
|
+
author?: string;
|
|
22
|
+
/** The favicon URL of the source website. */
|
|
23
|
+
favicon?: string;
|
|
24
|
+
/** The rank or position of the result (usually 1-indexed). */
|
|
25
|
+
rank?: number;
|
|
26
|
+
/** The source website name (e.g., 'GitHub', 'StackOverflow'). */
|
|
27
|
+
source?: string;
|
|
28
|
+
/** Allows for engine-specific extra fields (e.g., siteIcon, category). */
|
|
19
29
|
[key: string]: any;
|
|
20
30
|
}
|
|
21
31
|
/**
|
|
@@ -52,6 +62,16 @@ interface PaginationConfig {
|
|
|
52
62
|
* Required if type is 'click-next'.
|
|
53
63
|
*/
|
|
54
64
|
nextButtonSelector?: string;
|
|
65
|
+
/**
|
|
66
|
+
* The safety threshold for the maximum number of pages to fetch automatically
|
|
67
|
+
* in a single search call.
|
|
68
|
+
*
|
|
69
|
+
* Even if the requested `limit` of results hasn't been reached, the searcher
|
|
70
|
+
* will stop after this many pages to prevent infinite loops or excessive API usage.
|
|
71
|
+
*
|
|
72
|
+
* @default 10
|
|
73
|
+
*/
|
|
74
|
+
maxPages?: number;
|
|
55
75
|
}
|
|
56
76
|
/**
|
|
57
77
|
* Context object passed to the transform function.
|
|
@@ -64,7 +84,7 @@ interface SearchContext {
|
|
|
64
84
|
/** The requested limit of results. */
|
|
65
85
|
limit?: number;
|
|
66
86
|
}
|
|
67
|
-
type SearchTimeRangePreset = 'all' | 'day' | 'week' | 'month' | 'year';
|
|
87
|
+
type SearchTimeRangePreset = 'all' | 'hour' | 'day' | 'week' | 'month' | 'year';
|
|
68
88
|
interface CustomTimeRange {
|
|
69
89
|
/** Start date (Date object or string like 'YYYY-MM-DD'). */
|
|
70
90
|
from: Date | string;
|
|
@@ -80,6 +100,15 @@ type SafeSearchLevel = 'off' | 'moderate' | 'strict';
|
|
|
80
100
|
interface SearchOptions {
|
|
81
101
|
/** The maximum number of results to retrieve. */
|
|
82
102
|
limit?: number;
|
|
103
|
+
/**
|
|
104
|
+
* The maximum number of pages (fetch cycles) allowed to reach the requested `limit`.
|
|
105
|
+
*
|
|
106
|
+
* This is a safety guard. If the `limit` is high but each page has few results,
|
|
107
|
+
* the searcher will stop once this page count is reached.
|
|
108
|
+
*
|
|
109
|
+
* If not provided, it defaults to the value in `PaginationConfig` or 10.
|
|
110
|
+
*/
|
|
111
|
+
maxPages?: number;
|
|
83
112
|
/**
|
|
84
113
|
* Date range for the search results.
|
|
85
114
|
* Default: 'all'
|
package/dist/index.js
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
"use strict";var t
|
|
1
|
+
"use strict";var e,t=Object.defineProperty,r=Object.getOwnPropertyDescriptor,s=Object.getOwnPropertyNames,a=Object.prototype.hasOwnProperty,i={};((e,r)=>{for(var s in r)t(e,s,{get:r[s],enumerable:!0})})(i,{GoogleSearcher:()=>f,WebSearcher:()=>h}),module.exports=(e=i,((e,i,n,o)=>{if(i&&"object"==typeof i||"function"==typeof i)for(let c of s(i))a.call(e,c)||c===n||t(e,c,{get:()=>i[c],enumerable:!(o=r(i,c))||o.enumerable});return e})(t({},"__esModule",{value:!0}),e));var n=require("@isdk/web-fetcher"),o=require("custom-factory"),c=require("lodash-es");function l(e,t){if("string"==typeof e)return e.replace(/\$\{(.*?)\}/g,(e,r)=>{const s=t[r.trim()];return void 0!==s?String(s):""});if(Array.isArray(e))return e.map(e=>l(e,t));if((0,c.isPlainObject)(e)){const r={};for(const s in e)Object.prototype.hasOwnProperty.call(e,s)&&(r[s]=l(e[s],t));return r}return e}var u=require("lodash-es"),h=class extends n.FetchSession{static async search(e,t,r={}){const s=this.createObject(e,r);if(!s)throw new Error(`Search engine not found: ${e}`);try{return await s.search(t,r)}finally{await s.dispose()}}get pagination(){}createContext(e=this.options){const t=this.template,r=(0,u.defaultsDeep)({},t,e);return t.engine&&"auto"!==t.engine||!e.engine||(r.engine=e.engine),super.createContext(r)}async search(e,t={}){const r=t.limit||10,s=[];let a=0;const i=this.pagination?.startValue??0,n=this.pagination?.increment??1,o=t.maxPages||this.pagination?.maxPages||10;for(;s.length<r;){const c=this.formatOptions(t),h=i+a*n,f={...t,...c,query:e,page:a+i,offset:h,limit:r},m=l(this.template,f),{actions:d,...g}=t,p=(0,u.defaultsDeep)({},m,g),w=[],y=p.actions||[];if(0===a||"url-param"===this.pagination?.type){if(p.url){y.some(e=>"goto"===(e.id??e.name??e.action)&&e.params?.url===p.url)||w.push({id:"goto",params:{url:p.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(w.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),w.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));w.push(...y),p.engine&&this.context.engine!==p.engine&&p.engine;const{outputs:b}=await this.executeAll(w),q={query:e,page:a,limit:t.limit};let $=[];if($=await this.transform(b,q),t.transform&&($=await t.transform($,q)),!$||0===$.length)break;if(s.push(...$),s.length>=r||!this.pagination)break;if(a++,a>=o)break}return s.slice(0,r)}async transform(e,t){return e.results||[]}formatOptions(e){return{...e}}};h._isFactory=!1,(0,o.addBaseFactoryAbility)(h),h.prototype.name="Searcher";var f=class extends h{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(e){const t={};if(e.timeRange)if("string"==typeof e.timeRange){const r={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[e.timeRange]&&(t.tbs=r[e.timeRange])}else{const r=new Date(e.timeRange.from),s=e.timeRange.to?new Date(e.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(s.getTime())){const e=e=>`${e.getMonth()+1}/${e.getDate()}/${e.getFullYear()}`;t.tbs=`cdr:1,cd_min:${e(r)},cd_max:${e(s)}`}}if(e.category){const r={images:"isch",videos:"vid",news:"nws"};r[e.category]&&(t.tbm=r[e.category])}return e.region&&(t.gl=e.region),e.language&&(t.hl=e.language),e.safeSearch&&("strict"===e.safeSearch?t.safe="active":"off"===e.safeSearch&&(t.safe="images")),t}async transform(e){const t=e.results||[];return Array.isArray(t)?t.map(e=>{if(e.url&&e.url.startsWith("/url?q="))try{const t=new URL(e.url,"https://www.google.com").searchParams.get("q");t&&(e.url=t)}catch(e){}return e}):[]}};f.alias=["google"];
|
package/dist/index.mjs
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
import{FetchSession as t}from"@isdk/web-fetcher";import{addBaseFactoryAbility as r}from"custom-factory";import{isPlainObject as e}from"lodash-es";function s(t,r){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,e)=>{const s=r[e.trim()];return void 0!==s?String(s):""});if(Array.isArray(t))return t.map(t=>s(t,r));if(e(t)){const e={};for(const a in t)Object.prototype.hasOwnProperty.call(t,a)&&(e[a]=s(t[a],r));return e}return t}import{defaultsDeep as a}from"lodash-es";var i=class extends t{static async search(t,r,e={}){const s=this.createObject(t,e);if(!s)throw new Error(`Search engine not found: ${t}`);try{return await s.search(r,e)}finally{await s.dispose()}}get pagination(){}createContext(t=this.options){const r=this.template,e=a({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(e.engine=t.engine),super.createContext(e)}async search(t,r={}){const e=r.limit||10,i=[];let o=0;const n=this.pagination?.startValue??0,c=this.pagination?.increment??1;for(;i.length<e;){const l=this.formatOptions(r),m=n+o*c,
|
|
1
|
+
import{FetchSession as t}from"@isdk/web-fetcher";import{addBaseFactoryAbility as r}from"custom-factory";import{isPlainObject as e}from"lodash-es";function s(t,r){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,e)=>{const s=r[e.trim()];return void 0!==s?String(s):""});if(Array.isArray(t))return t.map(t=>s(t,r));if(e(t)){const e={};for(const a in t)Object.prototype.hasOwnProperty.call(t,a)&&(e[a]=s(t[a],r));return e}return t}import{defaultsDeep as a}from"lodash-es";var i=class extends t{static async search(t,r,e={}){const s=this.createObject(t,e);if(!s)throw new Error(`Search engine not found: ${t}`);try{return await s.search(r,e)}finally{await s.dispose()}}get pagination(){}createContext(t=this.options){const r=this.template,e=a({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(e.engine=t.engine),super.createContext(e)}async search(t,r={}){const e=r.limit||10,i=[];let o=0;const n=this.pagination?.startValue??0,c=this.pagination?.increment??1,h=r.maxPages||this.pagination?.maxPages||10;for(;i.length<e;){const l=this.formatOptions(r),m=n+o*c,f={...r,...l,query:t,page:o+n,offset:m,limit:e},u=s(this.template,f),{actions:p,...d}=r,w=a({},u,d),g=[],y=w.actions||[];if(0===o||"url-param"===this.pagination?.type){if(w.url){y.some(t=>"goto"===(t.id??t.name??t.action)&&t.params?.url===w.url)||g.push({id:"goto",params:{url:w.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(g.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),g.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));g.push(...y),w.engine&&this.context.engine!==w.engine&&w.engine;const{outputs:$}=await this.executeAll(g),b={query:t,page:o,limit:r.limit};let q=[];if(q=await this.transform($,b),r.transform&&(q=await r.transform(q,b)),!q||0===q.length)break;if(i.push(...q),i.length>=e||!this.pagination)break;if(o++,o>=h)break}return i.slice(0,e)}async transform(t,r){return t.results||[]}formatOptions(t){return{...t}}};i._isFactory=!1,r(i),i.prototype.name="Searcher";var o=class extends i{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const r={};if(t.timeRange)if("string"==typeof t.timeRange){const e={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};e[t.timeRange]&&(r.tbs=e[t.timeRange])}else{const e=new Date(t.timeRange.from),s=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(e.getTime())&&!isNaN(s.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;r.tbs=`cdr:1,cd_min:${t(e)},cd_max:${t(s)}`}}if(t.category){const e={images:"isch",videos:"vid",news:"nws"};e[t.category]&&(r.tbm=e[t.category])}return t.region&&(r.gl=t.region),t.language&&(r.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?r.safe="active":"off"===t.safeSearch&&(r.safe="images")),r}async transform(t){const r=t.results||[];return Array.isArray(r)?r.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const r=new URL(t.url,"https://www.google.com").searchParams.get("q");r&&(t.url=r)}catch(t){}return t}):[]}};o.alias=["google"];export{o as GoogleSearcher,i as WebSearcher};
|
package/docs/README.md
CHANGED
|
@@ -46,27 +46,37 @@ console.log(results);
|
|
|
46
46
|
|
|
47
47
|
Since `WebSearcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
|
|
48
48
|
|
|
49
|
-
|
|
50
|
-
When creating a session, options are merged in the following order:
|
|
49
|
+
### 🛡️ Core Principle: Template is Law
|
|
51
50
|
|
|
52
|
-
|
|
53
|
-
2. **User Options**: Passed to the constructor (can fill missing defaults or override if allowed).
|
|
51
|
+
The `template` defined in the `WebSearcher` subclass acts as the authoritative "blueprint".
|
|
54
52
|
|
|
55
|
-
|
|
53
|
+
- **Template Priority**: If the template defines a property (e.g., `engine: 'browser'`, `headers`), that value is **locked** and cannot be overridden by user options. This ensures engine stability.
|
|
54
|
+
- **Immutable Actions**: The `actions` array in the template is strictly protected. Users cannot append, replace, or modify the execution steps via `options`. This prevents external logic from breaking the scraper's flow.
|
|
55
|
+
- **User Flexibility**: Properties **not** explicitly defined in the template (such as `proxy`, `timeoutMs`, or custom variables) can be freely set by the user in the constructor or `search()` method.
|
|
56
56
|
|
|
57
57
|
```typescript
|
|
58
58
|
// Create a persistent session
|
|
59
59
|
const google = new GoogleSearcher({
|
|
60
|
-
headless: false, // Override
|
|
60
|
+
headless: false, // Override if not locked in template
|
|
61
61
|
proxy: 'http://my-proxy:8080',
|
|
62
|
-
timeoutMs: 30000 // Set a global timeout
|
|
62
|
+
timeoutMs: 30000 // Set a global timeout (valid if template doesn't define it)
|
|
63
63
|
});
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
### 🧠 Intelligent Navigation (Goto)
|
|
67
|
+
|
|
68
|
+
The `WebSearcher` automatically manages navigation to the search URL.
|
|
69
|
+
|
|
70
|
+
1. **Auto-Injection**: If your template does **not** include a `goto` action, the searcher automatically inserts one at the beginning of the action list, pointing to the resolved `url` (with query variables injected).
|
|
71
|
+
2. **Manual Control**: If you explicitly add a `goto` action in your template that matches the resolved URL, the searcher detects this duplicate and **skips** the automatic injection. This gives you full control to add headers, referrer, or other specific parameters to the navigation step if needed.
|
|
72
|
+
3. **Multi-Step Flows**: You can define multiple `goto` actions in your template (e.g., visit a login page first). The searcher will still prepend the main search URL navigation unless one of your `goto` actions matches it exactly.
|
|
64
73
|
|
|
74
|
+
```typescript
|
|
65
75
|
try {
|
|
66
76
|
// First query
|
|
67
77
|
// You can also pass runtime options to override session defaults or inject variables
|
|
68
78
|
const results1 = await google.search('term A', {
|
|
69
|
-
timeoutMs: 60000, // Override timeout just for this search
|
|
79
|
+
timeoutMs: 60000, // Override session timeout just for this search
|
|
70
80
|
extraParam: 'value' // Can be used in template as ${extraParam}
|
|
71
81
|
});
|
|
72
82
|
|
|
@@ -176,24 +186,43 @@ protected override async transform(outputs: Record<string, any>) {
|
|
|
176
186
|
}
|
|
177
187
|
```
|
|
178
188
|
|
|
179
|
-
|
|
189
|
+
### 🧠 Advanced Concepts
|
|
190
|
+
|
|
191
|
+
### Auto-Pagination: `limit` vs `maxPages`
|
|
192
|
+
|
|
193
|
+
The `WebSearcher` is designed to be result-oriented. When you call `search()`, you specify how many results you want, and the searcher handles the pagination logic.
|
|
180
194
|
|
|
181
|
-
|
|
195
|
+
- **`limit`**: Your target number of total results.
|
|
196
|
+
- **`maxPages`**: The safety threshold. It limits how many pages (fetch cycles) the searcher is allowed to navigate to satisfy your `limit`.
|
|
182
197
|
|
|
183
|
-
|
|
198
|
+
**Example Logic:**
|
|
199
|
+
If you request `{ limit: 50 }` but each page only has 5 results:
|
|
200
|
+
|
|
201
|
+
1. The searcher fetches page 1 (5 results).
|
|
202
|
+
2. It sees `5 < 50`, so it fetches page 2.
|
|
203
|
+
3. It continues until it has 50 results **OR** it reaches `maxPages` (default 10).
|
|
204
|
+
|
|
205
|
+
This prevent infinite loops if the "Next" button selector is broken or if the search engine keeps returning the same results.
|
|
184
206
|
|
|
185
207
|
### User-defined Transforms
|
|
186
208
|
|
|
187
209
|
Users can provide their own `transform` when calling `search`. This runs **after** the engine's built-in transform.
|
|
188
210
|
|
|
211
|
+
This is extremely powerful for **filtering out ads** or irrelevant content. If the user filters out results, the auto-pagination logic will automatically kick in to fetch more pages to ensure the final result list meets your requested `limit` with only valid entries.
|
|
212
|
+
|
|
189
213
|
```typescript
|
|
190
214
|
await google.search('test', {
|
|
191
|
-
|
|
215
|
+
limit: 20,
|
|
216
|
+
// Example: Filter out sponsored results and only keep PDFs
|
|
217
|
+
transform: (results) => {
|
|
218
|
+
return results.filter(r => {
|
|
219
|
+
const isAd = r.isSponsored || r.url.includes('googleadservices.com');
|
|
220
|
+
return !isAd && r.url.endsWith('.pdf');
|
|
221
|
+
});
|
|
222
|
+
}
|
|
192
223
|
});
|
|
193
224
|
```
|
|
194
225
|
|
|
195
|
-
If the user filters out results, the auto-pagination logic will kick in to fetch more pages to meet the requested limit.
|
|
196
|
-
|
|
197
226
|
### Standardized Search Options
|
|
198
227
|
|
|
199
228
|
When calling `search()`, you can provide standardized options that the search engine will map to specific parameters:
|
|
@@ -201,7 +230,7 @@ When calling `search()`, you can provide standardized options that the search en
|
|
|
201
230
|
```typescript
|
|
202
231
|
const results = await google.search('open source', {
|
|
203
232
|
limit: 20,
|
|
204
|
-
timeRange: 'month', // 'day', 'week', 'month', 'year'
|
|
233
|
+
timeRange: 'month', // 'hour', 'day', 'week', 'month', 'year'
|
|
205
234
|
// Or custom range:
|
|
206
235
|
// timeRange: { from: '2023-01-01', to: '2023-12-31' },
|
|
207
236
|
category: 'news', // 'all', 'images', 'videos', 'news'
|