@isdk/web-searcher 0.1.2 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.cn.md CHANGED
@@ -42,27 +42,37 @@ console.log(results);
42
42
 
43
43
  由于 `WebSearcher` 继承自 `FetchSession`,您可以实例化它以在多个请求之间保持 Cookie 和存储。这对于需要登录的搜索或通过模拟人类行为来避免反爬虫非常有用。
44
44
 
45
- **配置优先级:**
46
- 创建会话时,选项按以下顺序合并:
45
+ ### 🛡️ 核心准则:模板即法律 (Template is Law)
47
46
 
48
- 1. **模板默认 (Template Default)**:在 WebSearcher 类中定义(结构化选项的优先级最高)。
49
- 2. **用户选项 (User Options)**:传递给构造函数的选项(可填充缺失的默认值,或在允许的情况下进行覆盖)。
47
+ `WebSearcher` 子类中定义的 `template` 是权威的“蓝图”。
50
48
 
51
- *注:如果模板设置了 `engine: 'auto'`(默认值),则会尊重用户提供的 `engine` 选项。*
49
+ - **模板优先级**:如果模板定义了某个属性(如 `engine: 'browser'`、特定的 `headers` 等),该值将被**锁定**,用户选项无法覆盖。这确保了抓取逻辑的稳定性。
50
+ - **Actions 不可变性**:模板中的 `actions` 数组受到严格保护。用户无法通过 `options` 追加、替换或修改执行步骤。这防止了外部逻辑破坏爬虫的执行流程。
51
+ - **用户灵活性**:对于模板中**未**显式锁定的属性(如 `proxy`、`timeoutMs` 或自定义变量),用户可以在构造函数或 `search()` 方法中自由设置。
52
52
 
53
53
  ```typescript
54
54
  // 创建一个持久化会话
55
55
  const google = new GoogleSearcher({
56
- headless: false, // 覆盖默认选项 (例如显示浏览器)
56
+ headless: false, // 如果模板中未锁定,则可以覆盖
57
57
  proxy: 'http://my-proxy:8080',
58
- timeoutMs: 30000 // 为请求设置全局超时
58
+ timeoutMs: 30000 // 有效(假设 GoogleSearcher 模板未显式设置 timeoutMs)
59
59
  });
60
+ ```
61
+
62
+ ### 🧠 智能导航 (Goto)
63
+
64
+ `WebSearcher` 会自动管理前往搜索 URL 的导航。
65
+
66
+ 1. **自动注入**:如果你的模板**不**包含 `goto` 动作,搜索器会自动在动作列表开头插入一个指向解析后的 `url`(已注入查询变量)的 `goto` 动作。
67
+ 2. **手动控制**:如果你在模板中显式添加了一个匹配解析后 URL 的 `goto` 动作,搜索器会检测到重复并**跳过**自动注入。这让你能完全控制导航步骤(例如添加 headers、referrer 或其他特定参数)。
68
+ 3. **多步流程**:你可以在模板中定义多个 `goto` 动作(例如先访问登录页)。搜索器仍然会预置主搜索 URL 的导航,除非你的 `goto` 动作之一与之精确匹配。
60
69
 
70
+ ```typescript
61
71
  try {
62
72
  // 第一次查询
63
73
  // 您还可以传递运行时选项来覆盖会话默认值或注入变量
64
74
  const results1 = await google.search('term A', {
65
- timeoutMs: 60000, // 仅针对此搜索覆盖超时时间
75
+ timeoutMs: 60000, // 针对此次搜索覆盖超时时间
66
76
  extraParam: 'value' // 可以在模板中通过 ${extraParam} 使用
67
77
  });
68
78
 
@@ -174,22 +184,41 @@ protected override async transform(outputs: Record<string, any>) {
174
184
 
175
185
  ## 🧠 高级概念
176
186
 
177
- ### 自动分页与过滤
187
+ ### 自动分页:`limit` 与 `maxPages` 的关系
188
+
189
+ `WebSearcher` 的设计是以结果为导向的。当您调用 `search()` 时,您只需要指定想要多少条结果,搜索器会自动处理翻页逻辑。
190
+
191
+ - **`limit`**: 您期望获取的结果总数。
192
+ - **`maxPages`**: 安全阈值。它限制了搜索器为了满足 `limit` 而允许抓取的最大页数(翻页循环次数)。
178
193
 
179
- `WebSearcher` 是智能的。如果您请求 `limit: 10`,但第一页只返回了 5 条结果(或者如果您的 `transform` 过滤掉了一些结果),它会自动抓取下一页,直到满足限制。
194
+ **协作逻辑示例:**
195
+ 如果您请求 `{ limit: 50 }`,但每页只有 5 条结果:
196
+
197
+ 1. 搜索器抓取第 1 页(得到 5 条)。
198
+ 2. 发现 `5 < 50`,于是自动抓取第 2 页。
199
+ 3. 循环持续,直到获取 50 条结果 **或者** 达到了 `maxPages` 的限制(默认为 10 页)。
200
+
201
+ 这种机制可以防止因“下一页”选择器失效或引擎陷入死循环而导致的无限抓取,保护您的系统资源。
180
202
 
181
203
  ### 用户自定义转换 (User-defined Transforms)
182
204
 
183
205
  用户可以在调用 `search` 时提供自己的 `transform`。它会在引擎内置的转换**之后**运行。
184
206
 
207
+ 这在**过滤广告**或无关内容时非常强大。如果用户过滤掉了某些结果,自动分页逻辑会**自动启动**以抓取更多页面,确保最终返回给您的结果列表既满足 `limit` 数量要求,又只包含有效的条目。
208
+
185
209
  ```typescript
186
210
  await google.search('test', {
187
- transform: (results) => results.filter(r => r.url.endsWith('.pdf'))
211
+ limit: 20,
212
+ // 示例:过滤掉赞助商结果(广告)并只保留 PDF
213
+ transform: (results) => {
214
+ return results.filter(r => {
215
+ const isAd = r.isSponsored || r.url.includes('googleadservices.com');
216
+ return !isAd && r.url.endsWith('.pdf');
217
+ });
218
+ }
188
219
  });
189
220
  ```
190
221
 
191
- 如果用户过滤掉了结果,自动分页逻辑会启动以抓取更多页面来满足请求的 limit。
192
-
193
222
  ### 标准化搜索选项
194
223
 
195
224
  在调用 `search()` 时,您可以提供标准化的选项,搜索引擎会将其映射到特定的参数:
@@ -197,7 +226,7 @@ await google.search('test', {
197
226
  ```typescript
198
227
  const results = await google.search('open source', {
199
228
  limit: 20,
200
- timeRange: 'month', // 'day', 'week', 'month', 'year'
229
+ timeRange: 'month', // 'hour', 'day', 'week', 'month', 'year'
201
230
  // 或自定义范围:
202
231
  // timeRange: { from: '2023-01-01', to: '2023-12-31' },
203
232
  category: 'news', // 'all', 'images', 'videos', 'news'
package/README.md CHANGED
@@ -42,27 +42,37 @@ console.log(results);
42
42
 
43
43
  Since `WebSearcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
44
44
 
45
- **Configuration Precedence:**
46
- When creating a session, options are merged in the following order:
45
+ ### 🛡️ Core Principle: Template is Law
47
46
 
48
- 1. **Template Default**: Defined in the WebSearcher class (highest priority for structural options).
49
- 2. **User Options**: Passed to the constructor (can fill missing defaults or override if allowed).
47
+ The `template` defined in the `WebSearcher` subclass acts as the authoritative "blueprint".
50
48
 
51
- *Note: If the template sets `engine: 'auto'` (default), user-provided `engine` option will be respected.*
49
+ - **Template Priority**: If the template defines a property (e.g., `engine: 'browser'`, `headers`), that value is **locked** and cannot be overridden by user options. This ensures engine stability.
50
+ - **Immutable Actions**: The `actions` array in the template is strictly protected. Users cannot append, replace, or modify the execution steps via `options`. This prevents external logic from breaking the scraper's flow.
51
+ - **User Flexibility**: Properties **not** explicitly defined in the template (such as `proxy`, `timeoutMs`, or custom variables) can be freely set by the user in the constructor or `search()` method.
52
52
 
53
53
  ```typescript
54
54
  // Create a persistent session
55
55
  const google = new GoogleSearcher({
56
- headless: false, // Override default options (e.g., show browser)
56
+ headless: false, // Override if not locked in template
57
57
  proxy: 'http://my-proxy:8080',
58
- timeoutMs: 30000 // Set a global timeout for requests
58
+ timeoutMs: 30000 // Set a global timeout (valid if template doesn't define it)
59
59
  });
60
+ ```
61
+
62
+ ### 🧠 Intelligent Navigation (Goto)
63
+
64
+ The `WebSearcher` automatically manages navigation to the search URL.
65
+
66
+ 1. **Auto-Injection**: If your template does **not** include a `goto` action, the searcher automatically inserts one at the beginning of the action list, pointing to the resolved `url` (with query variables injected).
67
+ 2. **Manual Control**: If you explicitly add a `goto` action in your template that matches the resolved URL, the searcher detects this duplicate and **skips** the automatic injection. This gives you full control to add headers, referrer, or other specific parameters to the navigation step if needed.
68
+ 3. **Multi-Step Flows**: You can define multiple `goto` actions in your template (e.g., visit a login page first). The searcher will still prepend the main search URL navigation unless one of your `goto` actions matches it exactly.
60
69
 
70
+ ```typescript
61
71
  try {
62
72
  // First query
63
73
  // You can also pass runtime options to override session defaults or inject variables
64
74
  const results1 = await google.search('term A', {
65
- timeoutMs: 60000, // Override timeout just for this search
75
+ timeoutMs: 60000, // Override session timeout just for this search
66
76
  extraParam: 'value' // Can be used in template as ${extraParam}
67
77
  });
68
78
 
@@ -172,24 +182,43 @@ protected override async transform(outputs: Record<string, any>) {
172
182
  }
173
183
  ```
174
184
 
175
- ## 🧠 Advanced Concepts
185
+ ### 🧠 Advanced Concepts
186
+
187
+ ### Auto-Pagination: `limit` vs `maxPages`
188
+
189
+ The `WebSearcher` is designed to be result-oriented. When you call `search()`, you specify how many results you want, and the searcher handles the pagination logic.
176
190
 
177
- ### Auto-Pagination & Filtering
191
+ - **`limit`**: Your target number of total results.
192
+ - **`maxPages`**: The safety threshold. It limits how many pages (fetch cycles) the searcher is allowed to navigate to satisfy your `limit`.
178
193
 
179
- The `WebSearcher` is smart. If you request `limit: 10`, but the first page only returns 5 results (or if your `transform` filters out results), it will automatically fetch the next page until the limit is met.
194
+ **Example Logic:**
195
+ If you request `{ limit: 50 }` but each page only has 5 results:
196
+
197
+ 1. The searcher fetches page 1 (5 results).
198
+ 2. It sees `5 < 50`, so it fetches page 2.
199
+ 3. It continues until it has 50 results **OR** it reaches `maxPages` (default 10).
200
+
201
+ This prevent infinite loops if the "Next" button selector is broken or if the search engine keeps returning the same results.
180
202
 
181
203
  ### User-defined Transforms
182
204
 
183
205
  Users can provide their own `transform` when calling `search`. This runs **after** the engine's built-in transform.
184
206
 
207
+ This is extremely powerful for **filtering out ads** or irrelevant content. If the user filters out results, the auto-pagination logic will automatically kick in to fetch more pages to ensure the final result list meets your requested `limit` with only valid entries.
208
+
185
209
  ```typescript
186
210
  await google.search('test', {
187
- transform: (results) => results.filter(r => r.url.endsWith('.pdf'))
211
+ limit: 20,
212
+ // Example: Filter out sponsored results and only keep PDFs
213
+ transform: (results) => {
214
+ return results.filter(r => {
215
+ const isAd = r.isSponsored || r.url.includes('googleadservices.com');
216
+ return !isAd && r.url.endsWith('.pdf');
217
+ });
218
+ }
188
219
  });
189
220
  ```
190
221
 
191
- If the user filters out results, the auto-pagination logic will kick in to fetch more pages to meet the requested limit.
192
-
193
222
  ### Standardized Search Options
194
223
 
195
224
  When calling `search()`, you can provide standardized options that the search engine will map to specific parameters:
@@ -197,7 +226,7 @@ When calling `search()`, you can provide standardized options that the search en
197
226
  ```typescript
198
227
  const results = await google.search('open source', {
199
228
  limit: 20,
200
- timeRange: 'month', // 'day', 'week', 'month', 'year'
229
+ timeRange: 'month', // 'hour', 'day', 'week', 'month', 'year'
201
230
  // Or custom range:
202
231
  // timeRange: { from: '2023-01-01', to: '2023-12-31' },
203
232
  category: 'news', // 'all', 'images', 'videos', 'news'
package/dist/index.d.mts CHANGED
@@ -15,7 +15,17 @@ interface StandardSearchResult {
15
15
  snippet?: string;
16
16
  /** An optional image URL associated with the result. */
17
17
  image?: string;
18
- /** Allows for engine-specific extra fields (e.g., rank, author, date). */
18
+ /** The date the result was published or last updated. */
19
+ date?: string | Date;
20
+ /** The author or source name of the result. */
21
+ author?: string;
22
+ /** The favicon URL of the source website. */
23
+ favicon?: string;
24
+ /** The rank or position of the result (usually 1-indexed). */
25
+ rank?: number;
26
+ /** The source website name (e.g., 'GitHub', 'StackOverflow'). */
27
+ source?: string;
28
+ /** Allows for engine-specific extra fields (e.g., siteIcon, category). */
19
29
  [key: string]: any;
20
30
  }
21
31
  /**
@@ -52,6 +62,16 @@ interface PaginationConfig {
52
62
  * Required if type is 'click-next'.
53
63
  */
54
64
  nextButtonSelector?: string;
65
+ /**
66
+ * The safety threshold for the maximum number of pages to fetch automatically
67
+ * in a single search call.
68
+ *
69
+ * Even if the requested `limit` of results hasn't been reached, the searcher
70
+ * will stop after this many pages to prevent infinite loops or excessive API usage.
71
+ *
72
+ * @default 10
73
+ */
74
+ maxPages?: number;
55
75
  }
56
76
  /**
57
77
  * Context object passed to the transform function.
@@ -64,7 +84,7 @@ interface SearchContext {
64
84
  /** The requested limit of results. */
65
85
  limit?: number;
66
86
  }
67
- type SearchTimeRangePreset = 'all' | 'day' | 'week' | 'month' | 'year';
87
+ type SearchTimeRangePreset = 'all' | 'hour' | 'day' | 'week' | 'month' | 'year';
68
88
  interface CustomTimeRange {
69
89
  /** Start date (Date object or string like 'YYYY-MM-DD'). */
70
90
  from: Date | string;
@@ -80,6 +100,15 @@ type SafeSearchLevel = 'off' | 'moderate' | 'strict';
80
100
  interface SearchOptions {
81
101
  /** The maximum number of results to retrieve. */
82
102
  limit?: number;
103
+ /**
104
+ * The maximum number of pages (fetch cycles) allowed to reach the requested `limit`.
105
+ *
106
+ * This is a safety guard. If the `limit` is high but each page has few results,
107
+ * the searcher will stop once this page count is reached.
108
+ *
109
+ * If not provided, it defaults to the value in `PaginationConfig` or 10.
110
+ */
111
+ maxPages?: number;
83
112
  /**
84
113
  * Date range for the search results.
85
114
  * Default: 'all'
package/dist/index.d.ts CHANGED
@@ -15,7 +15,17 @@ interface StandardSearchResult {
15
15
  snippet?: string;
16
16
  /** An optional image URL associated with the result. */
17
17
  image?: string;
18
- /** Allows for engine-specific extra fields (e.g., rank, author, date). */
18
+ /** The date the result was published or last updated. */
19
+ date?: string | Date;
20
+ /** The author or source name of the result. */
21
+ author?: string;
22
+ /** The favicon URL of the source website. */
23
+ favicon?: string;
24
+ /** The rank or position of the result (usually 1-indexed). */
25
+ rank?: number;
26
+ /** The source website name (e.g., 'GitHub', 'StackOverflow'). */
27
+ source?: string;
28
+ /** Allows for engine-specific extra fields (e.g., siteIcon, category). */
19
29
  [key: string]: any;
20
30
  }
21
31
  /**
@@ -52,6 +62,16 @@ interface PaginationConfig {
52
62
  * Required if type is 'click-next'.
53
63
  */
54
64
  nextButtonSelector?: string;
65
+ /**
66
+ * The safety threshold for the maximum number of pages to fetch automatically
67
+ * in a single search call.
68
+ *
69
+ * Even if the requested `limit` of results hasn't been reached, the searcher
70
+ * will stop after this many pages to prevent infinite loops or excessive API usage.
71
+ *
72
+ * @default 10
73
+ */
74
+ maxPages?: number;
55
75
  }
56
76
  /**
57
77
  * Context object passed to the transform function.
@@ -64,7 +84,7 @@ interface SearchContext {
64
84
  /** The requested limit of results. */
65
85
  limit?: number;
66
86
  }
67
- type SearchTimeRangePreset = 'all' | 'day' | 'week' | 'month' | 'year';
87
+ type SearchTimeRangePreset = 'all' | 'hour' | 'day' | 'week' | 'month' | 'year';
68
88
  interface CustomTimeRange {
69
89
  /** Start date (Date object or string like 'YYYY-MM-DD'). */
70
90
  from: Date | string;
@@ -80,6 +100,15 @@ type SafeSearchLevel = 'off' | 'moderate' | 'strict';
80
100
  interface SearchOptions {
81
101
  /** The maximum number of results to retrieve. */
82
102
  limit?: number;
103
+ /**
104
+ * The maximum number of pages (fetch cycles) allowed to reach the requested `limit`.
105
+ *
106
+ * This is a safety guard. If the `limit` is high but each page has few results,
107
+ * the searcher will stop once this page count is reached.
108
+ *
109
+ * If not provided, it defaults to the value in `PaginationConfig` or 10.
110
+ */
111
+ maxPages?: number;
83
112
  /**
84
113
  * Date range for the search results.
85
114
  * Default: 'all'
package/dist/index.js CHANGED
@@ -1 +1 @@
1
- "use strict";var t,e=Object.defineProperty,r=Object.getOwnPropertyDescriptor,s=Object.getOwnPropertyNames,a=Object.prototype.hasOwnProperty,i={};((t,r)=>{for(var s in r)e(t,s,{get:r[s],enumerable:!0})})(i,{GoogleSearcher:()=>h,WebSearcher:()=>f}),module.exports=(t=i,((t,i,n,o)=>{if(i&&"object"==typeof i||"function"==typeof i)for(let c of s(i))a.call(t,c)||c===n||e(t,c,{get:()=>i[c],enumerable:!(o=r(i,c))||o.enumerable});return t})(e({},"__esModule",{value:!0}),t));var n=require("@isdk/web-fetcher"),o=require("custom-factory"),c=require("lodash-es");function l(t,e){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,r)=>{const s=e[r.trim()];return void 0!==s?String(s):""});if(Array.isArray(t))return t.map(t=>l(t,e));if((0,c.isPlainObject)(t)){const r={};for(const s in t)Object.prototype.hasOwnProperty.call(t,s)&&(r[s]=l(t[s],e));return r}return t}var u=require("lodash-es"),f=class extends n.FetchSession{static async search(t,e,r={}){const s=this.createObject(t,r);if(!s)throw new Error(`Search engine not found: ${t}`);try{return await s.search(e,r)}finally{await s.dispose()}}get pagination(){}createContext(t=this.options){const e=this.template,r=(0,u.defaultsDeep)({},e,t);return e.engine&&"auto"!==e.engine||!t.engine||(r.engine=t.engine),super.createContext(r)}async search(t,e={}){const r=e.limit||10,s=[];let a=0;const i=this.pagination?.startValue??0,n=this.pagination?.increment??1;for(;s.length<r;){const o=this.formatOptions(e),c=i+a*n,f={...e,...o,query:t,page:a+i,offset:c,limit:r},h=l(this.template,f),m=(0,u.defaultsDeep)({},h,e),d=[];if(0===a||"url-param"===this.pagination?.type?m.url&&d.push({id:"goto",params:{url:m.url}}):"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(d.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),d.push({id:"waitFor",params:{networkIdle:!0,ms:500}})),m.actions){const t=m.actions.filter(t=>!(d.length>0&&"goto"===d[0].id&&"goto"===t.id));d.push(...t)}m.engine&&this.context.engine!==m.engine&&m.engine;const{outputs:g}=await this.executeAll(d),p={query:t,page:a,limit:e.limit};let w=[];if(w=await this.transform(g,p),e.transform&&(w=await e.transform(w,p)),!w||0===w.length)break;if(s.push(...w),s.length>=r||!this.pagination)break;if(a++,a>10)break}return s.slice(0,r)}async transform(t,e){return t.results||[]}formatOptions(t){return{...t}}};f._isFactory=!1,(0,o.addBaseFactoryAbility)(f),f.prototype.name="Searcher";var h=class extends f{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const e={};if(t.timeRange)if("string"==typeof t.timeRange){const r={day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[t.timeRange]&&(e.tbs=r[t.timeRange])}else{const r=new Date(t.timeRange.from),s=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(s.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;e.tbs=`cdr:1,cd_min:${t(r)},cd_max:${t(s)}`}}if(t.category){const r={images:"isch",videos:"vid",news:"nws"};r[t.category]&&(e.tbm=r[t.category])}return t.region&&(e.gl=t.region),t.language&&(e.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?e.safe="active":"off"===t.safeSearch&&(e.safe="images")),e}async transform(t){const e=t.results||[];return Array.isArray(e)?e.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const e=new URL(t.url,"https://www.google.com").searchParams.get("q");e&&(t.url=e)}catch(t){}return t}):[]}};h.alias=["google"];
1
+ "use strict";var e,t=Object.defineProperty,r=Object.getOwnPropertyDescriptor,s=Object.getOwnPropertyNames,a=Object.prototype.hasOwnProperty,i={};((e,r)=>{for(var s in r)t(e,s,{get:r[s],enumerable:!0})})(i,{GoogleSearcher:()=>f,WebSearcher:()=>h}),module.exports=(e=i,((e,i,n,o)=>{if(i&&"object"==typeof i||"function"==typeof i)for(let c of s(i))a.call(e,c)||c===n||t(e,c,{get:()=>i[c],enumerable:!(o=r(i,c))||o.enumerable});return e})(t({},"__esModule",{value:!0}),e));var n=require("@isdk/web-fetcher"),o=require("custom-factory"),c=require("lodash-es");function l(e,t){if("string"==typeof e)return e.replace(/\$\{(.*?)\}/g,(e,r)=>{const s=t[r.trim()];return void 0!==s?String(s):""});if(Array.isArray(e))return e.map(e=>l(e,t));if((0,c.isPlainObject)(e)){const r={};for(const s in e)Object.prototype.hasOwnProperty.call(e,s)&&(r[s]=l(e[s],t));return r}return e}var u=require("lodash-es"),h=class extends n.FetchSession{static async search(e,t,r={}){const s=this.createObject(e,r);if(!s)throw new Error(`Search engine not found: ${e}`);try{return await s.search(t,r)}finally{await s.dispose()}}get pagination(){}createContext(e=this.options){const t=this.template,r=(0,u.defaultsDeep)({},t,e);return t.engine&&"auto"!==t.engine||!e.engine||(r.engine=e.engine),super.createContext(r)}async search(e,t={}){const r=t.limit||10,s=[];let a=0;const i=this.pagination?.startValue??0,n=this.pagination?.increment??1,o=t.maxPages||this.pagination?.maxPages||10;for(;s.length<r;){const c=this.formatOptions(t),h=i+a*n,f={...t,...c,query:e,page:a+i,offset:h,limit:r},m=l(this.template,f),{actions:d,...g}=t,p=(0,u.defaultsDeep)({},m,g),w=[],y=p.actions||[];if(0===a||"url-param"===this.pagination?.type){if(p.url){y.some(e=>"goto"===(e.id??e.name??e.action)&&e.params?.url===p.url)||w.push({id:"goto",params:{url:p.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(w.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),w.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));w.push(...y),p.engine&&this.context.engine!==p.engine&&p.engine;const{outputs:b}=await this.executeAll(w),q={query:e,page:a,limit:t.limit};let $=[];if($=await this.transform(b,q),t.transform&&($=await t.transform($,q)),!$||0===$.length)break;if(s.push(...$),s.length>=r||!this.pagination)break;if(a++,a>=o)break}return s.slice(0,r)}async transform(e,t){return e.results||[]}formatOptions(e){return{...e}}};h._isFactory=!1,(0,o.addBaseFactoryAbility)(h),h.prototype.name="Searcher";var f=class extends h{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(e){const t={};if(e.timeRange)if("string"==typeof e.timeRange){const r={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[e.timeRange]&&(t.tbs=r[e.timeRange])}else{const r=new Date(e.timeRange.from),s=e.timeRange.to?new Date(e.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(s.getTime())){const e=e=>`${e.getMonth()+1}/${e.getDate()}/${e.getFullYear()}`;t.tbs=`cdr:1,cd_min:${e(r)},cd_max:${e(s)}`}}if(e.category){const r={images:"isch",videos:"vid",news:"nws"};r[e.category]&&(t.tbm=r[e.category])}return e.region&&(t.gl=e.region),e.language&&(t.hl=e.language),e.safeSearch&&("strict"===e.safeSearch?t.safe="active":"off"===e.safeSearch&&(t.safe="images")),t}async transform(e){const t=e.results||[];return Array.isArray(t)?t.map(e=>{if(e.url&&e.url.startsWith("/url?q="))try{const t=new URL(e.url,"https://www.google.com").searchParams.get("q");t&&(e.url=t)}catch(e){}return e}):[]}};f.alias=["google"];
package/dist/index.mjs CHANGED
@@ -1 +1 @@
1
- import{FetchSession as t}from"@isdk/web-fetcher";import{addBaseFactoryAbility as r}from"custom-factory";import{isPlainObject as e}from"lodash-es";function s(t,r){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,e)=>{const s=r[e.trim()];return void 0!==s?String(s):""});if(Array.isArray(t))return t.map(t=>s(t,r));if(e(t)){const e={};for(const a in t)Object.prototype.hasOwnProperty.call(t,a)&&(e[a]=s(t[a],r));return e}return t}import{defaultsDeep as a}from"lodash-es";var i=class extends t{static async search(t,r,e={}){const s=this.createObject(t,e);if(!s)throw new Error(`Search engine not found: ${t}`);try{return await s.search(r,e)}finally{await s.dispose()}}get pagination(){}createContext(t=this.options){const r=this.template,e=a({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(e.engine=t.engine),super.createContext(e)}async search(t,r={}){const e=r.limit||10,i=[];let o=0;const n=this.pagination?.startValue??0,c=this.pagination?.increment??1;for(;i.length<e;){const l=this.formatOptions(r),m=n+o*c,h={...r,...l,query:t,page:o+n,offset:m,limit:e},f=s(this.template,h),u=a({},f,r),p=[];if(0===o||"url-param"===this.pagination?.type?u.url&&p.push({id:"goto",params:{url:u.url}}):"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(p.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),p.push({id:"waitFor",params:{networkIdle:!0,ms:500}})),u.actions){const t=u.actions.filter(t=>!(p.length>0&&"goto"===p[0].id&&"goto"===t.id));p.push(...t)}u.engine&&this.context.engine!==u.engine&&u.engine;const{outputs:d}=await this.executeAll(p),w={query:t,page:o,limit:r.limit};let g=[];if(g=await this.transform(d,w),r.transform&&(g=await r.transform(g,w)),!g||0===g.length)break;if(i.push(...g),i.length>=e||!this.pagination)break;if(o++,o>10)break}return i.slice(0,e)}async transform(t,r){return t.results||[]}formatOptions(t){return{...t}}};i._isFactory=!1,r(i),i.prototype.name="Searcher";var o=class extends i{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const r={};if(t.timeRange)if("string"==typeof t.timeRange){const e={day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};e[t.timeRange]&&(r.tbs=e[t.timeRange])}else{const e=new Date(t.timeRange.from),s=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(e.getTime())&&!isNaN(s.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;r.tbs=`cdr:1,cd_min:${t(e)},cd_max:${t(s)}`}}if(t.category){const e={images:"isch",videos:"vid",news:"nws"};e[t.category]&&(r.tbm=e[t.category])}return t.region&&(r.gl=t.region),t.language&&(r.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?r.safe="active":"off"===t.safeSearch&&(r.safe="images")),r}async transform(t){const r=t.results||[];return Array.isArray(r)?r.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const r=new URL(t.url,"https://www.google.com").searchParams.get("q");r&&(t.url=r)}catch(t){}return t}):[]}};o.alias=["google"];export{o as GoogleSearcher,i as WebSearcher};
1
+ import{FetchSession as t}from"@isdk/web-fetcher";import{addBaseFactoryAbility as r}from"custom-factory";import{isPlainObject as e}from"lodash-es";function s(t,r){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,e)=>{const s=r[e.trim()];return void 0!==s?String(s):""});if(Array.isArray(t))return t.map(t=>s(t,r));if(e(t)){const e={};for(const a in t)Object.prototype.hasOwnProperty.call(t,a)&&(e[a]=s(t[a],r));return e}return t}import{defaultsDeep as a}from"lodash-es";var i=class extends t{static async search(t,r,e={}){const s=this.createObject(t,e);if(!s)throw new Error(`Search engine not found: ${t}`);try{return await s.search(r,e)}finally{await s.dispose()}}get pagination(){}createContext(t=this.options){const r=this.template,e=a({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(e.engine=t.engine),super.createContext(e)}async search(t,r={}){const e=r.limit||10,i=[];let o=0;const n=this.pagination?.startValue??0,c=this.pagination?.increment??1,h=r.maxPages||this.pagination?.maxPages||10;for(;i.length<e;){const l=this.formatOptions(r),m=n+o*c,f={...r,...l,query:t,page:o+n,offset:m,limit:e},u=s(this.template,f),{actions:p,...d}=r,w=a({},u,d),g=[],y=w.actions||[];if(0===o||"url-param"===this.pagination?.type){if(w.url){y.some(t=>"goto"===(t.id??t.name??t.action)&&t.params?.url===w.url)||g.push({id:"goto",params:{url:w.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(g.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),g.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));g.push(...y),w.engine&&this.context.engine!==w.engine&&w.engine;const{outputs:$}=await this.executeAll(g),b={query:t,page:o,limit:r.limit};let q=[];if(q=await this.transform($,b),r.transform&&(q=await r.transform(q,b)),!q||0===q.length)break;if(i.push(...q),i.length>=e||!this.pagination)break;if(o++,o>=h)break}return i.slice(0,e)}async transform(t,r){return t.results||[]}formatOptions(t){return{...t}}};i._isFactory=!1,r(i),i.prototype.name="Searcher";var o=class extends i{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const r={};if(t.timeRange)if("string"==typeof t.timeRange){const e={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};e[t.timeRange]&&(r.tbs=e[t.timeRange])}else{const e=new Date(t.timeRange.from),s=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(e.getTime())&&!isNaN(s.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;r.tbs=`cdr:1,cd_min:${t(e)},cd_max:${t(s)}`}}if(t.category){const e={images:"isch",videos:"vid",news:"nws"};e[t.category]&&(r.tbm=e[t.category])}return t.region&&(r.gl=t.region),t.language&&(r.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?r.safe="active":"off"===t.safeSearch&&(r.safe="images")),r}async transform(t){const r=t.results||[];return Array.isArray(r)?r.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const r=new URL(t.url,"https://www.google.com").searchParams.get("q");r&&(t.url=r)}catch(t){}return t}):[]}};o.alias=["google"];export{o as GoogleSearcher,i as WebSearcher};
package/docs/README.md CHANGED
@@ -46,27 +46,37 @@ console.log(results);
46
46
 
47
47
  Since `WebSearcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
48
48
 
49
- **Configuration Precedence:**
50
- When creating a session, options are merged in the following order:
49
+ ### 🛡️ Core Principle: Template is Law
51
50
 
52
- 1. **Template Default**: Defined in the WebSearcher class (highest priority for structural options).
53
- 2. **User Options**: Passed to the constructor (can fill missing defaults or override if allowed).
51
+ The `template` defined in the `WebSearcher` subclass acts as the authoritative "blueprint".
54
52
 
55
- *Note: If the template sets `engine: 'auto'` (default), user-provided `engine` option will be respected.*
53
+ - **Template Priority**: If the template defines a property (e.g., `engine: 'browser'`, `headers`), that value is **locked** and cannot be overridden by user options. This ensures engine stability.
54
+ - **Immutable Actions**: The `actions` array in the template is strictly protected. Users cannot append, replace, or modify the execution steps via `options`. This prevents external logic from breaking the scraper's flow.
55
+ - **User Flexibility**: Properties **not** explicitly defined in the template (such as `proxy`, `timeoutMs`, or custom variables) can be freely set by the user in the constructor or `search()` method.
56
56
 
57
57
  ```typescript
58
58
  // Create a persistent session
59
59
  const google = new GoogleSearcher({
60
- headless: false, // Override default options (e.g., show browser)
60
+ headless: false, // Override if not locked in template
61
61
  proxy: 'http://my-proxy:8080',
62
- timeoutMs: 30000 // Set a global timeout for requests
62
+ timeoutMs: 30000 // Set a global timeout (valid if template doesn't define it)
63
63
  });
64
+ ```
65
+
66
+ ### 🧠 Intelligent Navigation (Goto)
67
+
68
+ The `WebSearcher` automatically manages navigation to the search URL.
69
+
70
+ 1. **Auto-Injection**: If your template does **not** include a `goto` action, the searcher automatically inserts one at the beginning of the action list, pointing to the resolved `url` (with query variables injected).
71
+ 2. **Manual Control**: If you explicitly add a `goto` action in your template that matches the resolved URL, the searcher detects this duplicate and **skips** the automatic injection. This gives you full control to add headers, referrer, or other specific parameters to the navigation step if needed.
72
+ 3. **Multi-Step Flows**: You can define multiple `goto` actions in your template (e.g., visit a login page first). The searcher will still prepend the main search URL navigation unless one of your `goto` actions matches it exactly.
64
73
 
74
+ ```typescript
65
75
  try {
66
76
  // First query
67
77
  // You can also pass runtime options to override session defaults or inject variables
68
78
  const results1 = await google.search('term A', {
69
- timeoutMs: 60000, // Override timeout just for this search
79
+ timeoutMs: 60000, // Override session timeout just for this search
70
80
  extraParam: 'value' // Can be used in template as ${extraParam}
71
81
  });
72
82
 
@@ -176,24 +186,43 @@ protected override async transform(outputs: Record<string, any>) {
176
186
  }
177
187
  ```
178
188
 
179
- ## 🧠 Advanced Concepts
189
+ ### 🧠 Advanced Concepts
190
+
191
+ ### Auto-Pagination: `limit` vs `maxPages`
192
+
193
+ The `WebSearcher` is designed to be result-oriented. When you call `search()`, you specify how many results you want, and the searcher handles the pagination logic.
180
194
 
181
- ### Auto-Pagination & Filtering
195
+ - **`limit`**: Your target number of total results.
196
+ - **`maxPages`**: The safety threshold. It limits how many pages (fetch cycles) the searcher is allowed to navigate to satisfy your `limit`.
182
197
 
183
- The `WebSearcher` is smart. If you request `limit: 10`, but the first page only returns 5 results (or if your `transform` filters out results), it will automatically fetch the next page until the limit is met.
198
+ **Example Logic:**
199
+ If you request `{ limit: 50 }` but each page only has 5 results:
200
+
201
+ 1. The searcher fetches page 1 (5 results).
202
+ 2. It sees `5 < 50`, so it fetches page 2.
203
+ 3. It continues until it has 50 results **OR** it reaches `maxPages` (default 10).
204
+
205
+ This prevent infinite loops if the "Next" button selector is broken or if the search engine keeps returning the same results.
184
206
 
185
207
  ### User-defined Transforms
186
208
 
187
209
  Users can provide their own `transform` when calling `search`. This runs **after** the engine's built-in transform.
188
210
 
211
+ This is extremely powerful for **filtering out ads** or irrelevant content. If the user filters out results, the auto-pagination logic will automatically kick in to fetch more pages to ensure the final result list meets your requested `limit` with only valid entries.
212
+
189
213
  ```typescript
190
214
  await google.search('test', {
191
- transform: (results) => results.filter(r => r.url.endsWith('.pdf'))
215
+ limit: 20,
216
+ // Example: Filter out sponsored results and only keep PDFs
217
+ transform: (results) => {
218
+ return results.filter(r => {
219
+ const isAd = r.isSponsored || r.url.includes('googleadservices.com');
220
+ return !isAd && r.url.endsWith('.pdf');
221
+ });
222
+ }
192
223
  });
193
224
  ```
194
225
 
195
- If the user filters out results, the auto-pagination logic will kick in to fetch more pages to meet the requested limit.
196
-
197
226
  ### Standardized Search Options
198
227
 
199
228
  When calling `search()`, you can provide standardized options that the search engine will map to specific parameters:
@@ -201,7 +230,7 @@ When calling `search()`, you can provide standardized options that the search en
201
230
  ```typescript
202
231
  const results = await google.search('open source', {
203
232
  limit: 20,
204
- timeRange: 'month', // 'day', 'week', 'month', 'year'
233
+ timeRange: 'month', // 'hour', 'day', 'week', 'month', 'year'
205
234
  // Or custom range:
206
235
  // timeRange: { from: '2023-01-01', to: '2023-12-31' },
207
236
  category: 'news', // 'all', 'images', 'videos', 'news'