@isdk/web-searcher 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/README.md ADDED
@@ -0,0 +1,278 @@
1
+ **@isdk/web-searcher**
2
+
3
+ ***
4
+
5
+ # Search Module
6
+
7
+ The Search module provides a high-level, class-based framework for building search engine scrapers. It is built on top of `@isdk/web-fetcher` and extends its capabilities to handle **multi-page navigation**, **session persistence**, and **result standardization**.
8
+
9
+ ## 🌟 Why use the Search Module?
10
+
11
+ Building a robust search scraper involves more than just fetching a URL. You often need to:
12
+
13
+ - **Pagination**: Automatically click "Next" or modify URL parameters until you have enough results.
14
+ - **Session Management**: Maintain cookies and headers across multiple search queries.
15
+ - **Data Cleaning**: Parse raw HTML and resolve redirect links.
16
+ - **Flexibility**: Switch between HTTP (fast) and Browser (anti-bot) modes easily.
17
+
18
+ This module encapsulates these patterns into a reusable `Searcher` class.
19
+
20
+ ## 🚀 Quick Start
21
+
22
+ ### 1. One-off Search
23
+
24
+ Use the static `Searcher.search` method for quick, disposable tasks. It automatically creates a session, fetches results, and cleans up.
25
+
26
+ ```typescript
27
+ import { Searcher } from '@isdk/web-fetcher/search';
28
+ import { GoogleSearcher } from '@isdk/web-fetcher/search/engines/google';
29
+
30
+ // Register the engine (only needs to be done once)
31
+ Searcher.register(GoogleSearcher);
32
+
33
+ // Search!
34
+ // The 'limit' parameter ensures we fetch enough pages to get 20 results.
35
+ // Note: The engine name is case-sensitive and derived from the class name (e.g., 'GoogleSearcher' -> 'Google')
36
+ const results = await Searcher.search('Google', 'open source', { limit: 20 });
37
+
38
+ console.log(results);
39
+ ```
40
+
41
+ ### 2. Stateful Session
42
+
43
+ Since `Searcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
44
+
45
+ **Configuration Precedence:**
46
+ When creating a session, options are merged in the following order:
47
+ 1. **Template Default**: Defined in the Searcher class (highest priority for structural options).
48
+ 2. **User Options**: Passed to the constructor (can fill missing defaults or override if allowed).
49
+
50
+ *Note: If the template sets `engine: 'auto'` (default), user-provided `engine` option will be respected.*
51
+
52
+ ```typescript
53
+ // Create a persistent session
54
+ const google = new GoogleSearcher({
55
+ headless: false, // Override default options (e.g., show browser)
56
+ proxy: 'http://my-proxy:8080',
57
+ timeoutMs: 30000 // Set a global timeout for requests
58
+ });
59
+
60
+ try {
61
+ // First query
62
+ // You can also pass runtime options to override session defaults or inject variables
63
+ const results1 = await google.search('term A', {
64
+ timeoutMs: 60000, // Override timeout just for this search
65
+ extraParam: 'value' // Can be used in template as ${extraParam}
66
+ });
67
+
68
+ // Second query (reuses the same browser window/cookies)
69
+ const results2 = await google.search('term B');
70
+ } finally {
71
+ // Always dispose to close the browser/release resources
72
+ await google.dispose();
73
+ }
74
+ ```
75
+
76
+ ## 🛠️ Implementing a New Search Engine
77
+
78
+ To support a new website, create a class that extends `Searcher`.
79
+
80
+ ### Step 1: Define the Template
81
+
82
+ To support a new website, create a class that extends `Searcher`. The engine name is automatically derived from the class name (e.g., `MyBlogSearcher` -> `MyBlog`), but you can customize it and add aliases using static properties.
83
+
84
+ The `template` property defines the "Blueprint" for your search. It's a standard `FetcherOptions` object but supports **variable injection**.
85
+
86
+ Supported variables:
87
+
88
+ - `${query}`: The search string.
89
+ - `${page}`: Current page number (starts at 0 or 1 based on config).
90
+ - `${offset}`: Current item offset (e.g., 0, 10, 20).
91
+ - `${limit}`: The requested limit.
92
+
93
+ ```typescript
94
+ import { Searcher } from '@isdk/web-fetcher/search';
95
+ import { FetcherOptions } from '@isdk/web-fetcher/types';
96
+
97
+ export class MyBlogSearcher extends Searcher {
98
+ static name = 'blog'; // Custom name (case-sensitive)
99
+ static alias = ['myblog', 'news'];
100
+
101
+ protected get template(): FetcherOptions {
102
+ return {
103
+ engine: 'http', // Use 'browser' if the site has anti-bot
104
+ // Dynamic URL with variables
105
+ url: 'https://blog.example.com/search?q=${query}&page=${page}',
106
+ actions: [
107
+ {
108
+ id: 'extract',
109
+ storeAs: 'results', // MUST store results here
110
+ params: {
111
+ type: 'array',
112
+ selector: 'article.post',
113
+ items: {
114
+ title: { selector: 'h2' },
115
+ url: { selector: 'a', attribute: 'href' }
116
+ }
117
+ }
118
+ }
119
+ ]
120
+ };
121
+ }
122
+ }
123
+ ```
124
+
125
+ ### Step 2: Configure Pagination
126
+
127
+ Tell the `Searcher` how to navigate to the next page. Implement the `pagination` getter.
128
+
129
+ #### Option A: URL Parameters (Offset/Page)
130
+
131
+ Best for stateless HTTP scraping.
132
+
133
+ ```typescript
134
+ protected override get pagination() {
135
+ return {
136
+ type: 'url-param',
137
+ paramName: 'page',
138
+ startValue: 1, // First page is 1
139
+ increment: 1 // Add 1 for next page
140
+ };
141
+ }
142
+ ```
143
+
144
+ #### Option B: Click "Next" Button
145
+
146
+ Best for SPAs or complex session-based sites. Requires `engine: 'browser'`.
147
+
148
+ ```typescript
149
+ protected override get pagination() {
150
+ return {
151
+ type: 'click-next',
152
+ nextButtonSelector: 'a.next-page-btn'
153
+ };
154
+ }
155
+ ```
156
+
157
+ ### Step 3: Transform & Clean Data
158
+
159
+ Override `transform` to clean data. Since `Searcher` is a `FetchSession`, you can also make extra requests (like resolving redirects) using `this`.
160
+
161
+ ```typescript
162
+ protected override async transform(outputs: Record<string, any>) {
163
+ const results = outputs['results'] || [];
164
+
165
+ // Clean data or filter
166
+ return results.map(item => ({
167
+ ...item,
168
+ title: item.title.trim(),
169
+ url: new URL(item.url, 'https://blog.example.com').href
170
+ }));
171
+ }
172
+ ```
173
+
174
+ ## 🧠 Advanced Concepts
175
+
176
+ ### Auto-Pagination & Filtering
177
+
178
+ The `Searcher` is smart. If you request `limit: 10`, but the first page only returns 5 results (or if your `transform` filters out results), it will automatically fetch the next page until the limit is met.
179
+
180
+ ### User-defined Transforms
181
+
182
+ Users can provide their own `transform` when calling `search`. This runs **after** the engine's built-in transform.
183
+
184
+ ```typescript
185
+ await google.search('test', {
186
+ transform: (results) => results.filter(r => r.url.endsWith('.pdf'))
187
+ });
188
+ ```
189
+
190
+ If the user filters out results, the auto-pagination logic will kick in to fetch more pages to meet the requested limit.
191
+
192
+ ### Standardized Search Options
193
+
194
+ When calling `search()`, you can provide standardized options that the search engine will map to specific parameters:
195
+
196
+ ```typescript
197
+ const results = await google.search('open source', {
198
+ limit: 20,
199
+ timeRange: 'month', // 'day', 'week', 'month', 'year'
200
+ // Or custom range:
201
+ // timeRange: { from: '2023-01-01', to: '2023-12-31' },
202
+ category: 'news', // 'all', 'images', 'videos', 'news'
203
+ region: 'US', // ISO 3166-1 alpha-2
204
+ language: 'en', // ISO 639-1
205
+ safeSearch: 'strict', // 'off', 'moderate', 'strict'
206
+ });
207
+ ```
208
+
209
+ To support these in your own engine, override the `formatOptions` method:
210
+
211
+ ```typescript
212
+ protected override formatOptions(options: SearchOptions): Record<string, any> {
213
+ const vars: Record<string, any> = {};
214
+ if (options.timeRange === 'day') vars.tbs = 'qdr:d';
215
+ // ... map other options to template variables
216
+ return vars;
217
+ }
218
+ ```
219
+
220
+ Then use these variables in your `template.url`:
221
+ `url: 'https://www.google.com/search?q=${query}&tbs=${tbs}'`
222
+
223
+ ### Custom Variables
224
+
225
+ You can pass custom variables to `search()` and use them in your template.
226
+
227
+ ```typescript
228
+ // Call
229
+ await google.search('test', { category: 'news' });
230
+
231
+ // Template
232
+ url: 'https://site.com?q=${query}&cat=${category}'
233
+ ```
234
+
235
+ ## Pagination Guide
236
+
237
+ ### 1. Offset-based (e.g., Google)
238
+
239
+ ```typescript
240
+ protected override get pagination() {
241
+ return {
242
+ type: 'url-param',
243
+ paramName: 'start',
244
+ startValue: 0,
245
+ increment: 10 // Jump 10 items per page
246
+ };
247
+ }
248
+ ```
249
+
250
+ URL: `search?q=...&start=${offset}`
251
+
252
+ ### 2. Page-based (e.g., Bing)
253
+
254
+ ```typescript
255
+ protected override get pagination() {
256
+ return {
257
+ type: 'url-param',
258
+ paramName: 'page',
259
+ startValue: 1,
260
+ increment: 1
261
+ };
262
+ }
263
+ ```
264
+
265
+ URL: `search?q=...&page=${page}`
266
+
267
+ ### 3. Click-based (SPA)
268
+
269
+ ```typescript
270
+ protected override get pagination() {
271
+ return {
272
+ type: 'click-next',
273
+ nextButtonSelector: '.pagination .next'
274
+ };
275
+ }
276
+ ```
277
+
278
+ The engine will click this selector and wait for network idle before scraping the next batch.