@isdk/web-searcher 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.cn.md +274 -0
- package/README.md +274 -0
- package/dist/index.d.mts +321 -0
- package/dist/index.d.ts +321 -0
- package/dist/index.js +1 -0
- package/dist/index.mjs +1 -0
- package/docs/README.md +278 -0
- package/docs/classes/GoogleSearcher.md +695 -0
- package/docs/classes/WebSearcher.md +661 -0
- package/docs/globals.md +26 -0
- package/docs/interfaces/CustomTimeRange.md +29 -0
- package/docs/interfaces/PaginationConfig.md +86 -0
- package/docs/interfaces/SearchContext.md +41 -0
- package/docs/interfaces/SearchOptions.md +105 -0
- package/docs/interfaces/StandardSearchResult.md +58 -0
- package/docs/type-aliases/SafeSearchLevel.md +11 -0
- package/docs/type-aliases/SearchCategory.md +11 -0
- package/docs/type-aliases/SearchTimeRange.md +11 -0
- package/docs/type-aliases/SearchTimeRangePreset.md +11 -0
- package/docs/type-aliases/SearcherConstructor.md +23 -0
- package/package.json +87 -0
package/docs/README.md
ADDED
|
@@ -0,0 +1,278 @@
|
|
|
1
|
+
**@isdk/web-searcher**
|
|
2
|
+
|
|
3
|
+
***
|
|
4
|
+
|
|
5
|
+
# Search Module
|
|
6
|
+
|
|
7
|
+
The Search module provides a high-level, class-based framework for building search engine scrapers. It is built on top of `@isdk/web-fetcher` and extends its capabilities to handle **multi-page navigation**, **session persistence**, and **result standardization**.
|
|
8
|
+
|
|
9
|
+
## 🌟 Why use the Search Module?
|
|
10
|
+
|
|
11
|
+
Building a robust search scraper involves more than just fetching a URL. You often need to:
|
|
12
|
+
|
|
13
|
+
- **Pagination**: Automatically click "Next" or modify URL parameters until you have enough results.
|
|
14
|
+
- **Session Management**: Maintain cookies and headers across multiple search queries.
|
|
15
|
+
- **Data Cleaning**: Parse raw HTML and resolve redirect links.
|
|
16
|
+
- **Flexibility**: Switch between HTTP (fast) and Browser (anti-bot) modes easily.
|
|
17
|
+
|
|
18
|
+
This module encapsulates these patterns into a reusable `Searcher` class.
|
|
19
|
+
|
|
20
|
+
## 🚀 Quick Start
|
|
21
|
+
|
|
22
|
+
### 1. One-off Search
|
|
23
|
+
|
|
24
|
+
Use the static `Searcher.search` method for quick, disposable tasks. It automatically creates a session, fetches results, and cleans up.
|
|
25
|
+
|
|
26
|
+
```typescript
|
|
27
|
+
import { Searcher } from '@isdk/web-fetcher/search';
|
|
28
|
+
import { GoogleSearcher } from '@isdk/web-fetcher/search/engines/google';
|
|
29
|
+
|
|
30
|
+
// Register the engine (only needs to be done once)
|
|
31
|
+
Searcher.register(GoogleSearcher);
|
|
32
|
+
|
|
33
|
+
// Search!
|
|
34
|
+
// The 'limit' parameter ensures we fetch enough pages to get 20 results.
|
|
35
|
+
// Note: The engine name is case-sensitive and derived from the class name (e.g., 'GoogleSearcher' -> 'Google')
|
|
36
|
+
const results = await Searcher.search('Google', 'open source', { limit: 20 });
|
|
37
|
+
|
|
38
|
+
console.log(results);
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### 2. Stateful Session
|
|
42
|
+
|
|
43
|
+
Since `Searcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
|
|
44
|
+
|
|
45
|
+
**Configuration Precedence:**
|
|
46
|
+
When creating a session, options are merged in the following order:
|
|
47
|
+
1. **Template Default**: Defined in the Searcher class (highest priority for structural options).
|
|
48
|
+
2. **User Options**: Passed to the constructor (can fill missing defaults or override if allowed).
|
|
49
|
+
|
|
50
|
+
*Note: If the template sets `engine: 'auto'` (default), user-provided `engine` option will be respected.*
|
|
51
|
+
|
|
52
|
+
```typescript
|
|
53
|
+
// Create a persistent session
|
|
54
|
+
const google = new GoogleSearcher({
|
|
55
|
+
headless: false, // Override default options (e.g., show browser)
|
|
56
|
+
proxy: 'http://my-proxy:8080',
|
|
57
|
+
timeoutMs: 30000 // Set a global timeout for requests
|
|
58
|
+
});
|
|
59
|
+
|
|
60
|
+
try {
|
|
61
|
+
// First query
|
|
62
|
+
// You can also pass runtime options to override session defaults or inject variables
|
|
63
|
+
const results1 = await google.search('term A', {
|
|
64
|
+
timeoutMs: 60000, // Override timeout just for this search
|
|
65
|
+
extraParam: 'value' // Can be used in template as ${extraParam}
|
|
66
|
+
});
|
|
67
|
+
|
|
68
|
+
// Second query (reuses the same browser window/cookies)
|
|
69
|
+
const results2 = await google.search('term B');
|
|
70
|
+
} finally {
|
|
71
|
+
// Always dispose to close the browser/release resources
|
|
72
|
+
await google.dispose();
|
|
73
|
+
}
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
## 🛠️ Implementing a New Search Engine
|
|
77
|
+
|
|
78
|
+
To support a new website, create a class that extends `Searcher`.
|
|
79
|
+
|
|
80
|
+
### Step 1: Define the Template
|
|
81
|
+
|
|
82
|
+
To support a new website, create a class that extends `Searcher`. The engine name is automatically derived from the class name (e.g., `MyBlogSearcher` -> `MyBlog`), but you can customize it and add aliases using static properties.
|
|
83
|
+
|
|
84
|
+
The `template` property defines the "Blueprint" for your search. It's a standard `FetcherOptions` object but supports **variable injection**.
|
|
85
|
+
|
|
86
|
+
Supported variables:
|
|
87
|
+
|
|
88
|
+
- `${query}`: The search string.
|
|
89
|
+
- `${page}`: Current page number (starts at 0 or 1 based on config).
|
|
90
|
+
- `${offset}`: Current item offset (e.g., 0, 10, 20).
|
|
91
|
+
- `${limit}`: The requested limit.
|
|
92
|
+
|
|
93
|
+
```typescript
|
|
94
|
+
import { Searcher } from '@isdk/web-fetcher/search';
|
|
95
|
+
import { FetcherOptions } from '@isdk/web-fetcher/types';
|
|
96
|
+
|
|
97
|
+
export class MyBlogSearcher extends Searcher {
|
|
98
|
+
static name = 'blog'; // Custom name (case-sensitive)
|
|
99
|
+
static alias = ['myblog', 'news'];
|
|
100
|
+
|
|
101
|
+
protected get template(): FetcherOptions {
|
|
102
|
+
return {
|
|
103
|
+
engine: 'http', // Use 'browser' if the site has anti-bot
|
|
104
|
+
// Dynamic URL with variables
|
|
105
|
+
url: 'https://blog.example.com/search?q=${query}&page=${page}',
|
|
106
|
+
actions: [
|
|
107
|
+
{
|
|
108
|
+
id: 'extract',
|
|
109
|
+
storeAs: 'results', // MUST store results here
|
|
110
|
+
params: {
|
|
111
|
+
type: 'array',
|
|
112
|
+
selector: 'article.post',
|
|
113
|
+
items: {
|
|
114
|
+
title: { selector: 'h2' },
|
|
115
|
+
url: { selector: 'a', attribute: 'href' }
|
|
116
|
+
}
|
|
117
|
+
}
|
|
118
|
+
}
|
|
119
|
+
]
|
|
120
|
+
};
|
|
121
|
+
}
|
|
122
|
+
}
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Step 2: Configure Pagination
|
|
126
|
+
|
|
127
|
+
Tell the `Searcher` how to navigate to the next page. Implement the `pagination` getter.
|
|
128
|
+
|
|
129
|
+
#### Option A: URL Parameters (Offset/Page)
|
|
130
|
+
|
|
131
|
+
Best for stateless HTTP scraping.
|
|
132
|
+
|
|
133
|
+
```typescript
|
|
134
|
+
protected override get pagination() {
|
|
135
|
+
return {
|
|
136
|
+
type: 'url-param',
|
|
137
|
+
paramName: 'page',
|
|
138
|
+
startValue: 1, // First page is 1
|
|
139
|
+
increment: 1 // Add 1 for next page
|
|
140
|
+
};
|
|
141
|
+
}
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
#### Option B: Click "Next" Button
|
|
145
|
+
|
|
146
|
+
Best for SPAs or complex session-based sites. Requires `engine: 'browser'`.
|
|
147
|
+
|
|
148
|
+
```typescript
|
|
149
|
+
protected override get pagination() {
|
|
150
|
+
return {
|
|
151
|
+
type: 'click-next',
|
|
152
|
+
nextButtonSelector: 'a.next-page-btn'
|
|
153
|
+
};
|
|
154
|
+
}
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
### Step 3: Transform & Clean Data
|
|
158
|
+
|
|
159
|
+
Override `transform` to clean data. Since `Searcher` is a `FetchSession`, you can also make extra requests (like resolving redirects) using `this`.
|
|
160
|
+
|
|
161
|
+
```typescript
|
|
162
|
+
protected override async transform(outputs: Record<string, any>) {
|
|
163
|
+
const results = outputs['results'] || [];
|
|
164
|
+
|
|
165
|
+
// Clean data or filter
|
|
166
|
+
return results.map(item => ({
|
|
167
|
+
...item,
|
|
168
|
+
title: item.title.trim(),
|
|
169
|
+
url: new URL(item.url, 'https://blog.example.com').href
|
|
170
|
+
}));
|
|
171
|
+
}
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
## 🧠 Advanced Concepts
|
|
175
|
+
|
|
176
|
+
### Auto-Pagination & Filtering
|
|
177
|
+
|
|
178
|
+
The `Searcher` is smart. If you request `limit: 10`, but the first page only returns 5 results (or if your `transform` filters out results), it will automatically fetch the next page until the limit is met.
|
|
179
|
+
|
|
180
|
+
### User-defined Transforms
|
|
181
|
+
|
|
182
|
+
Users can provide their own `transform` when calling `search`. This runs **after** the engine's built-in transform.
|
|
183
|
+
|
|
184
|
+
```typescript
|
|
185
|
+
await google.search('test', {
|
|
186
|
+
transform: (results) => results.filter(r => r.url.endsWith('.pdf'))
|
|
187
|
+
});
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
If the user filters out results, the auto-pagination logic will kick in to fetch more pages to meet the requested limit.
|
|
191
|
+
|
|
192
|
+
### Standardized Search Options
|
|
193
|
+
|
|
194
|
+
When calling `search()`, you can provide standardized options that the search engine will map to specific parameters:
|
|
195
|
+
|
|
196
|
+
```typescript
|
|
197
|
+
const results = await google.search('open source', {
|
|
198
|
+
limit: 20,
|
|
199
|
+
timeRange: 'month', // 'day', 'week', 'month', 'year'
|
|
200
|
+
// Or custom range:
|
|
201
|
+
// timeRange: { from: '2023-01-01', to: '2023-12-31' },
|
|
202
|
+
category: 'news', // 'all', 'images', 'videos', 'news'
|
|
203
|
+
region: 'US', // ISO 3166-1 alpha-2
|
|
204
|
+
language: 'en', // ISO 639-1
|
|
205
|
+
safeSearch: 'strict', // 'off', 'moderate', 'strict'
|
|
206
|
+
});
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
To support these in your own engine, override the `formatOptions` method:
|
|
210
|
+
|
|
211
|
+
```typescript
|
|
212
|
+
protected override formatOptions(options: SearchOptions): Record<string, any> {
|
|
213
|
+
const vars: Record<string, any> = {};
|
|
214
|
+
if (options.timeRange === 'day') vars.tbs = 'qdr:d';
|
|
215
|
+
// ... map other options to template variables
|
|
216
|
+
return vars;
|
|
217
|
+
}
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
Then use these variables in your `template.url`:
|
|
221
|
+
`url: 'https://www.google.com/search?q=${query}&tbs=${tbs}'`
|
|
222
|
+
|
|
223
|
+
### Custom Variables
|
|
224
|
+
|
|
225
|
+
You can pass custom variables to `search()` and use them in your template.
|
|
226
|
+
|
|
227
|
+
```typescript
|
|
228
|
+
// Call
|
|
229
|
+
await google.search('test', { category: 'news' });
|
|
230
|
+
|
|
231
|
+
// Template
|
|
232
|
+
url: 'https://site.com?q=${query}&cat=${category}'
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
## Pagination Guide
|
|
236
|
+
|
|
237
|
+
### 1. Offset-based (e.g., Google)
|
|
238
|
+
|
|
239
|
+
```typescript
|
|
240
|
+
protected override get pagination() {
|
|
241
|
+
return {
|
|
242
|
+
type: 'url-param',
|
|
243
|
+
paramName: 'start',
|
|
244
|
+
startValue: 0,
|
|
245
|
+
increment: 10 // Jump 10 items per page
|
|
246
|
+
};
|
|
247
|
+
}
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
URL: `search?q=...&start=${offset}`
|
|
251
|
+
|
|
252
|
+
### 2. Page-based (e.g., Bing)
|
|
253
|
+
|
|
254
|
+
```typescript
|
|
255
|
+
protected override get pagination() {
|
|
256
|
+
return {
|
|
257
|
+
type: 'url-param',
|
|
258
|
+
paramName: 'page',
|
|
259
|
+
startValue: 1,
|
|
260
|
+
increment: 1
|
|
261
|
+
};
|
|
262
|
+
}
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
URL: `search?q=...&page=${page}`
|
|
266
|
+
|
|
267
|
+
### 3. Click-based (SPA)
|
|
268
|
+
|
|
269
|
+
```typescript
|
|
270
|
+
protected override get pagination() {
|
|
271
|
+
return {
|
|
272
|
+
type: 'click-next',
|
|
273
|
+
nextButtonSelector: '.pagination .next'
|
|
274
|
+
};
|
|
275
|
+
}
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
The engine will click this selector and wait for network idle before scraping the next batch.
|