@isdk/web-searcher 0.1.5 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (35) hide show
  1. package/README.cn.md +40 -0
  2. package/README.md +40 -0
  3. package/dist/index.d.mts +14 -0
  4. package/dist/index.d.ts +14 -0
  5. package/dist/index.js +1 -1
  6. package/dist/index.mjs +1 -1
  7. package/docs/README.md +40 -0
  8. package/docs/classes/GoogleSearcher.md +150 -48
  9. package/docs/classes/WebSearcher.md +138 -48
  10. package/docs/functions/extractDate.md +1 -1
  11. package/docs/functions/extractMetadataFrom.md +1 -1
  12. package/docs/functions/fetchHeaders.md +1 -1
  13. package/docs/functions/fetchPartial.md +1 -1
  14. package/docs/functions/normalizeDate.md +1 -1
  15. package/docs/functions/parseHeaders.md +1 -1
  16. package/docs/functions/parseHtml.md +1 -1
  17. package/docs/functions/testUrlsByLatency.md +5 -1
  18. package/docs/interfaces/CustomTimeRange.md +3 -3
  19. package/docs/interfaces/ExtractOptions.md +4 -4
  20. package/docs/interfaces/FetchExtractorOptions.md +3 -3
  21. package/docs/interfaces/FetcherOptions.md +41 -29
  22. package/docs/interfaces/HtmlData.md +4 -4
  23. package/docs/interfaces/MetadataResult.md +2 -2
  24. package/docs/interfaces/PaginationConfig.md +7 -7
  25. package/docs/interfaces/SearchContext.md +6 -6
  26. package/docs/interfaces/SearchOptions.md +13 -13
  27. package/docs/interfaces/StandardSearchResult.md +10 -10
  28. package/docs/interfaces/VerifiedUrl.md +3 -3
  29. package/docs/type-aliases/MetadataType.md +1 -1
  30. package/docs/type-aliases/SafeSearchLevel.md +1 -1
  31. package/docs/type-aliases/SearchCategory.md +1 -1
  32. package/docs/type-aliases/SearchTimeRange.md +1 -1
  33. package/docs/type-aliases/SearchTimeRangePreset.md +1 -1
  34. package/docs/type-aliases/SearcherConstructor.md +1 -1
  35. package/package.json +2 -2
package/README.cn.md CHANGED
@@ -59,6 +59,46 @@ const results = await WebSearcher.search(['Google', 'Bing', 'SearXNG'], 'open so
59
59
 
60
60
  由于 `WebSearcher` 继承自 `FetchSession`,您可以实例化它以在多个请求之间保持 Cookie 和存储。这对于需要登录的搜索或通过模拟人类行为来避免反爬虫非常有用。
61
61
 
62
+ ### 5. 默认搜索参数 (Default Search Parameters)
63
+
64
+ 您可以从三个层面设置默认搜索参数:**全局**、**引擎特定**和**实例级别**。这可以避免在每次调用 `search()` 时重复传递相同的选项。
65
+
66
+ 优先级顺序(从高到低)为:
67
+ `search(query, options)` (调用参数) > `this.options` (实例参数) > `Engine.defaultOptions` (引擎静态参数) > `WebSearcher.defaultOptions` (全局静态参数)
68
+
69
+ #### A. 全局静态默认值
70
+
71
+ 影响所有搜索引擎。
72
+
73
+ ```typescript
74
+ import { WebSearcher } from '@isdk/web-fetcher';
75
+
76
+ // 为所有搜索器设置全局限制
77
+ WebSearcher.defaultOptions = { limit: 20, safeSearch: 'strict' };
78
+ ```
79
+
80
+ #### B. 引擎特定静态默认值
81
+
82
+ 仅影响特定的引擎(及其子类)。
83
+
84
+ ```typescript
85
+ import { GoogleSearcher } from '@isdk/web-fetcher';
86
+
87
+ // 仅 Google 会使用这些默认值
88
+ GoogleSearcher.defaultOptions = { region: 'US', language: 'en' };
89
+ ```
90
+
91
+ #### C. 实例级别默认值
92
+
93
+ 在创建搜索器实例时设置。
94
+
95
+ ```typescript
96
+ const google = new GoogleSearcher({ limit: 5, category: 'news' });
97
+
98
+ // 此次搜索将自动使用 limit: 5 和 category: 'news'
99
+ const results = await google.search('open source');
100
+ ```
101
+
62
102
  ### 🧬 动态模板 (Dynamic Templates)
63
103
 
64
104
  虽然静态 `template` 适用于简单的搜索引擎,但许多网站(如 Google)会根据搜索类别(如“网页” vs “图片” vs “新闻”)彻底改变其 HTML 结构。
package/README.md CHANGED
@@ -59,6 +59,46 @@ const results = await WebSearcher.search(['Google', 'Bing', 'SearXNG'], 'open so
59
59
 
60
60
  Since `WebSearcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
61
61
 
62
+ ### 5. Default Search Parameters
63
+
64
+ You can set default search parameters at three levels: **Global**, **Engine-specific**, and **Instance-level**. This avoids passing repetitive options to every `search()` call.
65
+
66
+ The priority order (from highest to lowest) is:
67
+ `search(query, options)` (Call) > `this.options` (Instance) > `Engine.defaultOptions` (Static Engine) > `WebSearcher.defaultOptions` (Static Global)
68
+
69
+ #### A. Global Static Defaults
70
+
71
+ Affects all search engines.
72
+
73
+ ```typescript
74
+ import { WebSearcher } from '@isdk/web-fetcher';
75
+
76
+ // Set global limit for all searchers
77
+ WebSearcher.defaultOptions = { limit: 20, safeSearch: 'strict' };
78
+ ```
79
+
80
+ #### B. Engine-Specific Static Defaults
81
+
82
+ Affects only a specific engine (and its subclasses).
83
+
84
+ ```typescript
85
+ import { GoogleSearcher } from '@isdk/web-fetcher';
86
+
87
+ // Only Google will use these defaults
88
+ GoogleSearcher.defaultOptions = { region: 'US', language: 'en' };
89
+ ```
90
+
91
+ #### C. Instance-Level Defaults
92
+
93
+ Set when creating a searcher instance.
94
+
95
+ ```typescript
96
+ const google = new GoogleSearcher({ limit: 5, category: 'news' });
97
+
98
+ // This search will use limit: 5 and category: 'news' automatically
99
+ const results = await google.search('open source');
100
+ ```
101
+
62
102
  ### 🧬 Dynamic Templates
63
103
 
64
104
  While a static `template` works for simple search engines, many sites (like Google) change their HTML structure drastically based on the search category (e.g., 'Web' vs 'Images' vs 'News').
package/dist/index.d.mts CHANGED
@@ -320,6 +320,7 @@ declare function testUrlsByLatency(urls: string[], options?: {
320
320
  timeout?: number;
321
321
  limit?: number;
322
322
  testPath?: string;
323
+ proxy?: string;
323
324
  }): Promise<VerifiedUrl[]>;
324
325
 
325
326
  /**
@@ -362,6 +363,19 @@ declare abstract class WebSearcher extends FetchSession {
362
363
  static defaultBaseUrls?: string[];
363
364
  /** Globally shared index for tracking the currently active instance (node) across sessions. */
364
365
  static currentInstanceIndex?: number;
366
+ /** @internal */
367
+ static _defaultOptions?: SearchOptions;
368
+ /**
369
+ * Gets or sets the default search parameters for this specific engine class.
370
+ * This does not include settings from parent classes.
371
+ */
372
+ static get defaultOptions(): SearchOptions;
373
+ static set defaultOptions(options: SearchOptions);
374
+ /**
375
+ * Retrieves the combined default search options by traversing the prototype chain.
376
+ * Priority: Current class > Parent class > WebSearcher base class.
377
+ */
378
+ static getDefaultOptions(): SearchOptions;
365
379
  /**
366
380
  * Registers a search engine class.
367
381
  *
package/dist/index.d.ts CHANGED
@@ -320,6 +320,7 @@ declare function testUrlsByLatency(urls: string[], options?: {
320
320
  timeout?: number;
321
321
  limit?: number;
322
322
  testPath?: string;
323
+ proxy?: string;
323
324
  }): Promise<VerifiedUrl[]>;
324
325
 
325
326
  /**
@@ -362,6 +363,19 @@ declare abstract class WebSearcher extends FetchSession {
362
363
  static defaultBaseUrls?: string[];
363
364
  /** Globally shared index for tracking the currently active instance (node) across sessions. */
364
365
  static currentInstanceIndex?: number;
366
+ /** @internal */
367
+ static _defaultOptions?: SearchOptions;
368
+ /**
369
+ * Gets or sets the default search parameters for this specific engine class.
370
+ * This does not include settings from parent classes.
371
+ */
372
+ static get defaultOptions(): SearchOptions;
373
+ static set defaultOptions(options: SearchOptions);
374
+ /**
375
+ * Retrieves the combined default search options by traversing the prototype chain.
376
+ * Priority: Current class > Parent class > WebSearcher base class.
377
+ */
378
+ static getDefaultOptions(): SearchOptions;
365
379
  /**
366
380
  * Registers a search engine class.
367
381
  *
package/dist/index.js CHANGED
@@ -1 +1 @@
1
- "use strict";var t,e=Object.defineProperty,r=Object.getOwnPropertyDescriptor,n=Object.getOwnPropertyNames,s=Object.prototype.hasOwnProperty,i={};async function a(t,e={}){const{timeout:r=5e3,headers:n}=e,s=new AbortController,i=setTimeout(()=>s.abort(),r);try{return(await fetch(t,{method:"HEAD",signal:s.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...n}})).headers}catch(t){return null}finally{clearTimeout(i)}}async function o(t,e=32768,r={}){const{timeout:n=1e4,headers:s}=r,i=new AbortController,a=setTimeout(()=>i.abort(),n);let o="",c=new Headers;try{const r=await fetch(t,{signal:i.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...s}});if(c=r.headers,!r.ok||!r.body)return null;const n=r.headers.get("content-type"),a=n?.match(/charset=([\w-]+)/i),l=a?a[1]:"utf-8",u=r.body.getReader(),f=new TextDecoder(l);let d=0;for(;;)try{const{done:t,value:r}=await u.read();if(t)break;if(d+=r.length,o+=f.decode(r,{stream:!0}),d>=e){i.abort();break}}catch(t){if("AbortError"===t.name)break;throw t}return{content:o,headers:c}}catch(t){return o.length>0?{content:o,headers:c}:null}finally{clearTimeout(a)}}function c(t){const e={};return t.forEach((t,r)=>{e[r.toLowerCase()]=t}),e}function l(t){const e={meta:{},jsonLd:[],time:[]},r=/<meta\s+([^>]+?)>/gi;let n;for(;null!==(n=r.exec(t));){const t=f(n[1]),r=t.name||t.property||t.itemprop,s=t.content;r&&s&&(e.meta[r.toLowerCase()]=s)}const s=/<script\s+[^>]*?type\s*=\s*["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/gi;for(;null!==(n=s.exec(t));){const t=n[1];try{const r=JSON.parse(t);e.jsonLd.push(r)}catch(r){const n=u(t);n&&e.jsonLd.push(n)}}const i=/<time([^>]*?)>([\s\S]*?)<\/time>/gi;for(;null!==(n=i.exec(t));){const t=f(n[1]).datetime,r=n[2].replace(/<[^>]*>/g,"").trim();e.time.push({datetime:t,text:r})}return e}function u(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r={};let n=!1;for(const s of e){const e=new RegExp(`"${s}"\\s*:\\s*"([^"]+)"`,"i"),i=t.match(e);i&&(r[s]=i[1],n=!0)}return n?r:null}function f(t){const e={},r=/([a-z0-9:._-]+)(?:\s*=\s*(?:(?:"([^"]*)")|(?:'([^']*)')|([^>\s]+)))?/gi;let n;for(;null!==(n=r.exec(t));){const t=n[1].toLowerCase(),r=n[2]??n[3]??n[4]??"";e[t]=r}return e}function d(t){if(!t)return null;try{let e=t.trim();if(!e)return null;e=e.replace(/^(?:last|first|posted|originally)\s*(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.replace(/^(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.split(/[\(|\|]|by\s+|[-–—]\s*\d+\s*min/i)[0].trim();const r=new Date(e);if(!isNaN(r.getTime())){const t=r.getUTCFullYear(),e=(new Date).getUTCFullYear();if(t>=-1e4&&t<=e+20)return r.toISOString()}}catch(t){}return null}function h(t,e){const r=l(t.content);return"date"===e?function(t,e){const r=function(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r=t=>{if(!t||"object"!=typeof t)return null;for(const r of e)if("string"==typeof t[r])return t[r];if(Array.isArray(t))for(const e of t){const t=r(e);if(t)return t}else if(t["@graph"]&&Array.isArray(t["@graph"]))return r(t["@graph"]);return null};return r(t)}(t.jsonLd),n=d(r);if(n)return n;const s=function(t){const e=["article:published_time","og:published_time","datepublished","date","pubdate","publishdate","dc.date.issued","bt:pubdate","sailthru.date","article:modified_time","og:updated_time","modifieddate"];for(const r of e)if(t[r])return t[r];return null}(t.meta),i=d(s);if(i)return i;for(const e of t.time){const t=d(e.datetime||e.text);if(t)return t}const a=c(e);return d(a["last-modified"])}(r,t.headers):null}async function m(t,e={}){const r=await o(t,e.maxBytes,e);return r?h(r,"date"):null}((t,r)=>{for(var n in r)e(t,n,{get:r[n],enumerable:!0})})(i,{FetcherOptions:()=>y.FetcherOptions,GoogleSearcher:()=>A,WebSearcher:()=>q,extractDate:()=>m,extractMetadataFrom:()=>h,fetchHeaders:()=>a,fetchPartial:()=>o,normalizeDate:()=>d,parseHeaders:()=>c,parseHtml:()=>l,testUrlsByLatency:()=>b}),module.exports=(t=i,((t,i,a,o)=>{if(i&&"object"==typeof i||"function"==typeof i)for(let c of n(i))s.call(t,c)||c===a||e(t,c,{get:()=>i[c],enumerable:!(o=r(i,c))||o.enumerable});return t})(e({},"__esModule",{value:!0}),t));var p=require("@isdk/web-fetcher");async function b(t,e={}){const{timeout:r=5e3,limit:n=20,testPath:s=""}=e;return(await Promise.all(t.map(async t=>{const e=Date.now();try{const n=s?(t.endsWith("/")?t.slice(0,-1):t)+(s.startsWith("/")?s:"/"+s):t;return await(0,p.fetchWeb)(n,{timeoutMs:r}),{url:t,latency:Date.now()-e}}catch(t){return null}}))).filter(t=>null!==t).sort((t,e)=>t.latency-e.latency).slice(0,n)}var y=require("@isdk/web-fetcher"),w=require("custom-factory"),g=require("lodash-es");function k(t,e){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,r)=>{const n=e[r.trim()];return void 0!==n?String(n):""});if(Array.isArray(t))return t.map(t=>k(t,e));if((0,g.isPlainObject)(t)){const r={};for(const n in t)Object.prototype.hasOwnProperty.call(t,n)&&(r[n]=k(t[n],e));return r}return t}var $=require("lodash-es"),q=class extends y.FetchSession{static async search(t,e,r={}){const n=Array.isArray(t)?t:[t],s=r.limit||10,i=r.fillLimit??!0,a=[];for(let t=0;t<n.length;t++){const o=n[t];if(a.length>=s)break;const c=s-a.length,l={...r,limit:c},u=this.createObject(o,l);if(!u)throw new Error(`Search engine not found: ${o}`);try{const t=await u.search(e,l);for(const e of t)e.url&&!a.some(t=>t.url===e.url)&&a.push(e);if(a.length>=s)break;if(!1===i)break}catch(e){if(console.warn(`[WebSearcher] Engine '${o}' failed completely:`,e),t===n.length-1&&0===a.length)throw e}finally{await u.dispose()}}return a}get template(){return{}}get pagination(){}getTemplate(t,e){return(0,$.cloneDeep)(this.template)}createContext(t=this.options){const{actions:e,...r}=this.template,n=(0,$.defaultsDeep)({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(n.engine=t.engine),super.createContext(n)}async search(t,e={}){const r=e.limit||10,n=[],s=new Set;let i=e.startPage||0;const a=this.pagination?.startValue??0,o=this.pagination?.increment??1,c=e.maxPages||this.pagination?.maxPages||10,l=this.constructor.name;let u;e.baseUrls&&(Array.isArray(e.baseUrls)?u=e.baseUrls:"object"==typeof e.baseUrls&&(u=e.baseUrls[l]||e.baseUrls[this.constructor.alias?.[0]])),u&&0!==u.length||(u=this.constructor.defaultBaseUrls);const f=u&&u.length>0;let d=0;f&&"number"==typeof this.constructor.currentInstanceIndex&&(d=this.constructor.currentInstanceIndex);let h=!1;for(;n.length<r;){let m=!1,p=null;const b=f?u.length:1;let y=0;for(;y<b;){const c=f?u[d]:void 0,b=this.formatOptions(e),w=a+i*o,g={...e,...b,query:t,page:i+a,offset:w,limit:r,baseUrl:c?.endsWith("/")?c.slice(0,-1):c},q=k(this.getTemplate(g,e),g),{actions:A,...v}=e,x=(0,$.defaultsDeep)({},q,v),D=[],S=x.actions||[];if(i===(e.startPage||0)||"url-param"===this.pagination?.type){if(x.url){S.some(t=>"goto"===(t.id??t.name??t.action)&&t.params?.url===x.url)||D.push({id:"goto",params:{url:x.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(D.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),D.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));D.push(...S),x.engine&&this.context.engine!==x.engine&&x.engine;try{const{outputs:r}=await this.executeAll(D,e),a={...e,query:t,page:i,baseUrl:c,engine:l};let o=await this.transform(r,a);e.transform&&(o=await e.transform(o,a));let u=!0;if(this.validateFetchResult&&(u=await this.validateFetchResult(o,a)),u&&e.validator&&(u=await e.validator(o,a)),!u)throw new Error(`Results validation failed for engine: ${l}, url: ${c}`);if(o&&0!==o.length)for(const t of o)t.url&&!s.has(t.url)&&(s.add(t.url),n.push(t));else h=!0;m=!0;break}catch(t){p=t,f&&(d=(d+1)%u.length,this.constructor.currentInstanceIndex=d),y++}}if(!m)throw p||new Error(`All instances failed for engine: ${l}`);if(h)break;if(n.length>=r||!this.pagination)break;if(i++,i>=c)break}return n.slice(0,r)}async validateFetchResult(t,e){return!0}async transform(t,e){return t.results||[]}formatOptions(t){return{...t}}};q._isFactory=!1,(0,w.addBaseFactoryAbility)(q),q.prototype.name="Searcher";var A=class extends q{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const e={};if(t.timeRange)if("string"==typeof t.timeRange){const r={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[t.timeRange]&&(e.tbs=r[t.timeRange])}else{const r=new Date(t.timeRange.from),n=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(n.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;e.tbs=`cdr:1,cd_min:${t(r)},cd_max:${t(n)}`}}if(t.category){const r={images:"isch",videos:"vid",news:"nws"};r[t.category]&&(e.tbm=r[t.category])}return t.region&&(e.gl=t.region),t.language&&(e.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?e.safe="active":"off"===t.safeSearch&&(e.safe="images")),e}async transform(t){const e=t.results||[];return Array.isArray(e)?e.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const e=new URL(t.url,"https://www.google.com").searchParams.get("q");e&&(t.url=e)}catch(t){}return t}):[]}};A.alias=["google"];
1
+ "use strict";var t,e=Object.defineProperty,r=Object.getOwnPropertyDescriptor,n=Object.getOwnPropertyNames,s=Object.prototype.hasOwnProperty,i={};async function a(t,e={}){const{timeout:r=5e3,headers:n}=e,s=new AbortController,i=setTimeout(()=>s.abort(),r);try{return(await fetch(t,{method:"HEAD",signal:s.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...n}})).headers}catch(t){return null}finally{clearTimeout(i)}}async function o(t,e=32768,r={}){const{timeout:n=1e4,headers:s}=r,i=new AbortController,a=setTimeout(()=>i.abort(),n);let o="",c=new Headers;try{const r=await fetch(t,{signal:i.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...s}});if(c=r.headers,!r.ok||!r.body)return null;const n=r.headers.get("content-type"),a=n?.match(/charset=([\w-]+)/i),l=a?a[1]:"utf-8",u=r.body.getReader(),f=new TextDecoder(l);let d=0;for(;;)try{const{done:t,value:r}=await u.read();if(t)break;if(d+=r.length,o+=f.decode(r,{stream:!0}),d>=e){i.abort();break}}catch(t){if("AbortError"===t.name)break;throw t}return{content:o,headers:c}}catch(t){return o.length>0?{content:o,headers:c}:null}finally{clearTimeout(a)}}function c(t){const e={};return t.forEach((t,r)=>{e[r.toLowerCase()]=t}),e}function l(t){const e={meta:{},jsonLd:[],time:[]},r=/<meta\s+([^>]+?)>/gi;let n;for(;null!==(n=r.exec(t));){const t=f(n[1]),r=t.name||t.property||t.itemprop,s=t.content;r&&s&&(e.meta[r.toLowerCase()]=s)}const s=/<script\s+[^>]*?type\s*=\s*["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/gi;for(;null!==(n=s.exec(t));){const t=n[1];try{const r=JSON.parse(t);e.jsonLd.push(r)}catch(r){const n=u(t);n&&e.jsonLd.push(n)}}const i=/<time([^>]*?)>([\s\S]*?)<\/time>/gi;for(;null!==(n=i.exec(t));){const t=f(n[1]).datetime,r=n[2].replace(/<[^>]*>/g,"").trim();e.time.push({datetime:t,text:r})}return e}function u(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r={};let n=!1;for(const s of e){const e=new RegExp(`"${s}"\\s*:\\s*"([^"]+)"`,"i"),i=t.match(e);i&&(r[s]=i[1],n=!0)}return n?r:null}function f(t){const e={},r=/([a-z0-9:._-]+)(?:\s*=\s*(?:(?:"([^"]*)")|(?:'([^']*)')|([^>\s]+)))?/gi;let n;for(;null!==(n=r.exec(t));){const t=n[1].toLowerCase(),r=n[2]??n[3]??n[4]??"";e[t]=r}return e}function d(t){if(!t)return null;try{let e=t.trim();if(!e)return null;e=e.replace(/^(?:last|first|posted|originally)\s*(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.replace(/^(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.split(/[\(|\|]|by\s+|[-–—]\s*\d+\s*min/i)[0].trim();const r=new Date(e);if(!isNaN(r.getTime())){const t=r.getUTCFullYear(),e=(new Date).getUTCFullYear();if(t>=-1e4&&t<=e+20)return r.toISOString()}}catch(t){}return null}function h(t,e){const r=l(t.content);return"date"===e?function(t,e){const r=function(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r=t=>{if(!t||"object"!=typeof t)return null;for(const r of e)if("string"==typeof t[r])return t[r];if(Array.isArray(t))for(const e of t){const t=r(e);if(t)return t}else if(t["@graph"]&&Array.isArray(t["@graph"]))return r(t["@graph"]);return null};return r(t)}(t.jsonLd),n=d(r);if(n)return n;const s=function(t){const e=["article:published_time","og:published_time","datepublished","date","pubdate","publishdate","dc.date.issued","bt:pubdate","sailthru.date","article:modified_time","og:updated_time","modifieddate"];for(const r of e)if(t[r])return t[r];return null}(t.meta),i=d(s);if(i)return i;for(const e of t.time){const t=d(e.datetime||e.text);if(t)return t}const a=c(e);return d(a["last-modified"])}(r,t.headers):null}async function p(t,e={}){const r=await o(t,e.maxBytes,e);return r?h(r,"date"):null}((t,r)=>{for(var n in r)e(t,n,{get:r[n],enumerable:!0})})(i,{FetcherOptions:()=>y.FetcherOptions,GoogleSearcher:()=>O,WebSearcher:()=>A,extractDate:()=>p,extractMetadataFrom:()=>h,fetchHeaders:()=>a,fetchPartial:()=>o,normalizeDate:()=>d,parseHeaders:()=>c,parseHtml:()=>l,testUrlsByLatency:()=>b}),module.exports=(t=i,((t,i,a,o)=>{if(i&&"object"==typeof i||"function"==typeof i)for(let c of n(i))s.call(t,c)||c===a||e(t,c,{get:()=>i[c],enumerable:!(o=r(i,c))||o.enumerable});return t})(e({},"__esModule",{value:!0}),t));var m=require("@isdk/web-fetcher");async function b(t,e={}){const{timeout:r=5e3,limit:n,testPath:s="",proxy:i}=e;let a=await Promise.all(t.map(async t=>{const e=Date.now();try{const n=s?(t.endsWith("/")?t.slice(0,-1):t)+(s.startsWith("/")?s:"/"+s):t;return await(0,m.fetchWeb)(n,{timeoutMs:r,proxy:i,throwHttpErrors:!0,enableSmart:!1,engine:"http"}),{url:t,latency:Date.now()-e}}catch(t){return}}));return a=a.filter(t=>null!=t).sort((t,e)=>t.latency-e.latency),"number"==typeof n&&n&&(a=a.slice(0,n)),a}var y=require("@isdk/web-fetcher"),w=require("custom-factory"),g=require("lodash-es");function k(t,e){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,r)=>{const n=e[r.trim()];return void 0!==n?String(n):""});if(Array.isArray(t))return t.map(t=>k(t,e));if((0,g.isPlainObject)(t)){const r={};for(const n in t)Object.prototype.hasOwnProperty.call(t,n)&&(r[n]=k(t[n],e));return r}return t}var $=require("lodash-es"),q=class t extends y.FetchSession{static get defaultOptions(){return Object.prototype.hasOwnProperty.call(this,"_defaultOptions")||(this._defaultOptions={}),this._defaultOptions}static set defaultOptions(t){this._defaultOptions=t}static getDefaultOptions(){const e=[];let r=this;for(;r&&r!==Object.prototype&&(Object.prototype.hasOwnProperty.call(r,"_defaultOptions")&&r._defaultOptions&&e.push(r._defaultOptions),r!==t);)r=Object.getPrototypeOf(r);return e.length>0?(0,$.defaultsDeep)({},...e):{}}static async search(t,e,r={}){const n=Array.isArray(t)?t:[t],s=[];for(let t=0;t<n.length;t++){const i=n[t],a=this.get(i),o=a?a.getDefaultOptions():this.getDefaultOptions(),c=(0,$.defaultsDeep)({},r,o),l=c.limit||10;if(s.length>=l)break;const u=l-s.length,f={...r,limit:u},d=this.createObject(i,f);if(!d)throw new Error(`Search engine not found: ${i}`);try{const t=await d.search(e,f);for(const e of t)e.url&&!s.some(t=>t.url===e.url)&&s.push(e);if(s.length>=l)break;if(!1===c.fillLimit)break}catch(e){if(console.warn(`[WebSearcher] Engine '${i}' failed completely:`,e),t===n.length-1&&0===s.length)throw e}finally{await d.dispose()}}return s}get template(){return{}}get pagination(){}getTemplate(t,e){return(0,$.cloneDeep)(this.template)}createContext(t=this.options){const{actions:e,...r}=this.template,n=(0,$.defaultsDeep)({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(n.engine=t.engine),super.createContext(n)}async search(t,e={}){const r=this.constructor,n=(e=(0,$.defaultsDeep)({},e,this.options,r.getDefaultOptions())).limit||10,s=[],i=new Set;let a=e.startPage||0;const o=this.pagination?.startValue??0,c=this.pagination?.increment??1,l=e.maxPages||this.pagination?.maxPages||10,u=this.constructor.name;let f;e.baseUrls&&(Array.isArray(e.baseUrls)?f=e.baseUrls:"object"==typeof e.baseUrls&&(f=e.baseUrls[u]||e.baseUrls[this.constructor.alias?.[0]])),f&&0!==f.length||(f=this.constructor.defaultBaseUrls);const d=f&&f.length>0;let h=0;d&&"number"==typeof this.constructor.currentInstanceIndex&&(h=this.constructor.currentInstanceIndex);let p=!1;for(;s.length<n;){let r=!1,m=null;const b=d?f.length:1;let y=0;for(;y<b;){const l=d?f[h]:void 0,b=this.formatOptions(e),w=o+a*c,g={...e,...b,query:t,page:a+o,offset:w,limit:n,baseUrl:l?.endsWith("/")?l.slice(0,-1):l},q=k(this.getTemplate(g,e),g),{actions:A,...O}=e,v=(0,$.defaultsDeep)({},q,O),x=[],j=v.actions||[];if(a===(e.startPage||0)||"url-param"===this.pagination?.type){if(v.url){j.some(t=>"goto"===(t.id??t.name??t.action)&&t.params?.url===v.url)||x.push({id:"goto",params:{url:v.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(x.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),x.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));x.push(...j),v.engine&&this.context.engine!==v.engine&&v.engine;try{const{outputs:n}=await this.executeAll(x,e),o={...e,query:t,page:a,baseUrl:l,engine:u};let c=await this.transform(n,o);e.transform&&(c=await e.transform(c,o));let f=!0;if(this.validateFetchResult&&(f=await this.validateFetchResult(c,o)),f&&e.validator&&(f=await e.validator(c,o)),!f)throw new Error(`Results validation failed for engine: ${u}, url: ${l}`);if(c&&0!==c.length)for(const t of c)t.url&&!i.has(t.url)&&(i.add(t.url),s.push(t));else p=!0;r=!0;break}catch(t){m=t,d&&(h=(h+1)%f.length,this.constructor.currentInstanceIndex=h),y++}}if(!r)throw m||new Error(`All instances failed for engine: ${u}`);if(p)break;if(s.length>=n||!this.pagination)break;if(a++,a>=l)break}return s.slice(0,n)}async validateFetchResult(t,e){return!0}async transform(t,e){return t.results||[]}formatOptions(t){return{...t}}};q._isFactory=!1;var A=q;(0,w.addBaseFactoryAbility)(A),A.prototype.name="Searcher";var O=class extends A{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const e={};if(t.timeRange)if("string"==typeof t.timeRange){const r={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[t.timeRange]&&(e.tbs=r[t.timeRange])}else{const r=new Date(t.timeRange.from),n=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(n.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;e.tbs=`cdr:1,cd_min:${t(r)},cd_max:${t(n)}`}}if(t.category){const r={images:"isch",videos:"vid",news:"nws"};r[t.category]&&(e.tbm=r[t.category])}return t.region&&(e.gl=t.region),t.language&&(e.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?e.safe="active":"off"===t.safeSearch&&(e.safe="images")),e}async transform(t){const e=t.results||[];return Array.isArray(e)?e.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const e=new URL(t.url,"https://www.google.com").searchParams.get("q");e&&(t.url=e)}catch(t){}return t}):[]}};O.alias=["google"];
package/dist/index.mjs CHANGED
@@ -1 +1 @@
1
- async function t(t,e={}){const{timeout:r=5e3,headers:n}=e,s=new AbortController,i=setTimeout(()=>s.abort(),r);try{return(await fetch(t,{method:"HEAD",signal:s.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...n}})).headers}catch(t){return null}finally{clearTimeout(i)}}async function e(t,e=32768,r={}){const{timeout:n=1e4,headers:s}=r,i=new AbortController,o=setTimeout(()=>i.abort(),n);let a="",c=new Headers;try{const r=await fetch(t,{signal:i.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...s}});if(c=r.headers,!r.ok||!r.body)return null;const n=r.headers.get("content-type"),o=n?.match(/charset=([\w-]+)/i),l=o?o[1]:"utf-8",u=r.body.getReader(),f=new TextDecoder(l);let d=0;for(;;)try{const{done:t,value:r}=await u.read();if(t)break;if(d+=r.length,a+=f.decode(r,{stream:!0}),d>=e){i.abort();break}}catch(t){if("AbortError"===t.name)break;throw t}return{content:a,headers:c}}catch(t){return a.length>0?{content:a,headers:c}:null}finally{clearTimeout(o)}}function r(t){const e={};return t.forEach((t,r)=>{e[r.toLowerCase()]=t}),e}function n(t){const e={meta:{},jsonLd:[],time:[]},r=/<meta\s+([^>]+?)>/gi;let n;for(;null!==(n=r.exec(t));){const t=i(n[1]),r=t.name||t.property||t.itemprop,s=t.content;r&&s&&(e.meta[r.toLowerCase()]=s)}const o=/<script\s+[^>]*?type\s*=\s*["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/gi;for(;null!==(n=o.exec(t));){const t=n[1];try{const r=JSON.parse(t);e.jsonLd.push(r)}catch(r){const n=s(t);n&&e.jsonLd.push(n)}}const a=/<time([^>]*?)>([\s\S]*?)<\/time>/gi;for(;null!==(n=a.exec(t));){const t=i(n[1]).datetime,r=n[2].replace(/<[^>]*>/g,"").trim();e.time.push({datetime:t,text:r})}return e}function s(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r={};let n=!1;for(const s of e){const e=new RegExp(`"${s}"\\s*:\\s*"([^"]+)"`,"i"),i=t.match(e);i&&(r[s]=i[1],n=!0)}return n?r:null}function i(t){const e={},r=/([a-z0-9:._-]+)(?:\s*=\s*(?:(?:"([^"]*)")|(?:'([^']*)')|([^>\s]+)))?/gi;let n;for(;null!==(n=r.exec(t));){const t=n[1].toLowerCase(),r=n[2]??n[3]??n[4]??"";e[t]=r}return e}function o(t){if(!t)return null;try{let e=t.trim();if(!e)return null;e=e.replace(/^(?:last|first|posted|originally)\s*(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.replace(/^(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.split(/[\(|\|]|by\s+|[-–—]\s*\d+\s*min/i)[0].trim();const r=new Date(e);if(!isNaN(r.getTime())){const t=r.getUTCFullYear(),e=(new Date).getUTCFullYear();if(t>=-1e4&&t<=e+20)return r.toISOString()}}catch(t){}return null}function a(t,e){const s=n(t.content);return"date"===e?function(t,e){const n=function(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r=t=>{if(!t||"object"!=typeof t)return null;for(const r of e)if("string"==typeof t[r])return t[r];if(Array.isArray(t))for(const e of t){const t=r(e);if(t)return t}else if(t["@graph"]&&Array.isArray(t["@graph"]))return r(t["@graph"]);return null};return r(t)}(t.jsonLd),s=o(n);if(s)return s;const i=function(t){const e=["article:published_time","og:published_time","datepublished","date","pubdate","publishdate","dc.date.issued","bt:pubdate","sailthru.date","article:modified_time","og:updated_time","modifieddate"];for(const r of e)if(t[r])return t[r];return null}(t.meta),a=o(i);if(a)return a;for(const e of t.time){const t=o(e.datetime||e.text);if(t)return t}const c=r(e);return o(c["last-modified"])}(s,t.headers):null}async function c(t,r={}){const n=await e(t,r.maxBytes,r);return n?a(n,"date"):null}import{fetchWeb as l}from"@isdk/web-fetcher";async function u(t,e={}){const{timeout:r=5e3,limit:n=20,testPath:s=""}=e;return(await Promise.all(t.map(async t=>{const e=Date.now();try{const n=s?(t.endsWith("/")?t.slice(0,-1):t)+(s.startsWith("/")?s:"/"+s):t;return await l(n,{timeoutMs:r}),{url:t,latency:Date.now()-e}}catch(t){return null}}))).filter(t=>null!==t).sort((t,e)=>t.latency-e.latency).slice(0,n)}import{FetcherOptions as f,FetchSession as d}from"@isdk/web-fetcher";import{addBaseFactoryAbility as h}from"custom-factory";import{isPlainObject as m}from"lodash-es";function p(t,e){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,r)=>{const n=e[r.trim()];return void 0!==n?String(n):""});if(Array.isArray(t))return t.map(t=>p(t,e));if(m(t)){const r={};for(const n in t)Object.prototype.hasOwnProperty.call(t,n)&&(r[n]=p(t[n],e));return r}return t}import{cloneDeep as w,defaultsDeep as y}from"lodash-es";var b=class extends d{static async search(t,e,r={}){const n=Array.isArray(t)?t:[t],s=r.limit||10,i=r.fillLimit??!0,o=[];for(let t=0;t<n.length;t++){const a=n[t];if(o.length>=s)break;const c=s-o.length,l={...r,limit:c},u=this.createObject(a,l);if(!u)throw new Error(`Search engine not found: ${a}`);try{const t=await u.search(e,l);for(const e of t)e.url&&!o.some(t=>t.url===e.url)&&o.push(e);if(o.length>=s)break;if(!1===i)break}catch(e){if(console.warn(`[WebSearcher] Engine '${a}' failed completely:`,e),t===n.length-1&&0===o.length)throw e}finally{await u.dispose()}}return o}get template(){return{}}get pagination(){}getTemplate(t,e){return w(this.template)}createContext(t=this.options){const{actions:e,...r}=this.template,n=y({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(n.engine=t.engine),super.createContext(n)}async search(t,e={}){const r=e.limit||10,n=[],s=new Set;let i=e.startPage||0;const o=this.pagination?.startValue??0,a=this.pagination?.increment??1,c=e.maxPages||this.pagination?.maxPages||10,l=this.constructor.name;let u;e.baseUrls&&(Array.isArray(e.baseUrls)?u=e.baseUrls:"object"==typeof e.baseUrls&&(u=e.baseUrls[l]||e.baseUrls[this.constructor.alias?.[0]])),u&&0!==u.length||(u=this.constructor.defaultBaseUrls);const f=u&&u.length>0;let d=0;f&&"number"==typeof this.constructor.currentInstanceIndex&&(d=this.constructor.currentInstanceIndex);let h=!1;for(;n.length<r;){let m=!1,w=null;const b=f?u.length:1;let g=0;for(;g<b;){const c=f?u[d]:void 0,b=this.formatOptions(e),k=o+i*a,$={...e,...b,query:t,page:i+o,offset:k,limit:r,baseUrl:c?.endsWith("/")?c.slice(0,-1):c},A=p(this.getTemplate($,e),$),{actions:q,...x}=e,v=y({},A,x),D=[],T=v.actions||[];if(i===(e.startPage||0)||"url-param"===this.pagination?.type){if(v.url){T.some(t=>"goto"===(t.id??t.name??t.action)&&t.params?.url===v.url)||D.push({id:"goto",params:{url:v.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(D.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),D.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));D.push(...T),v.engine&&this.context.engine!==v.engine&&v.engine;try{const{outputs:r}=await this.executeAll(D,e),o={...e,query:t,page:i,baseUrl:c,engine:l};let a=await this.transform(r,o);e.transform&&(a=await e.transform(a,o));let u=!0;if(this.validateFetchResult&&(u=await this.validateFetchResult(a,o)),u&&e.validator&&(u=await e.validator(a,o)),!u)throw new Error(`Results validation failed for engine: ${l}, url: ${c}`);if(a&&0!==a.length)for(const t of a)t.url&&!s.has(t.url)&&(s.add(t.url),n.push(t));else h=!0;m=!0;break}catch(t){w=t,f&&(d=(d+1)%u.length,this.constructor.currentInstanceIndex=d),g++}}if(!m)throw w||new Error(`All instances failed for engine: ${l}`);if(h)break;if(n.length>=r||!this.pagination)break;if(i++,i>=c)break}return n.slice(0,r)}async validateFetchResult(t,e){return!0}async transform(t,e){return t.results||[]}formatOptions(t){return{...t}}};b._isFactory=!1,h(b),b.prototype.name="Searcher";var g=class extends b{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const e={};if(t.timeRange)if("string"==typeof t.timeRange){const r={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[t.timeRange]&&(e.tbs=r[t.timeRange])}else{const r=new Date(t.timeRange.from),n=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(n.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;e.tbs=`cdr:1,cd_min:${t(r)},cd_max:${t(n)}`}}if(t.category){const r={images:"isch",videos:"vid",news:"nws"};r[t.category]&&(e.tbm=r[t.category])}return t.region&&(e.gl=t.region),t.language&&(e.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?e.safe="active":"off"===t.safeSearch&&(e.safe="images")),e}async transform(t){const e=t.results||[];return Array.isArray(e)?e.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const e=new URL(t.url,"https://www.google.com").searchParams.get("q");e&&(t.url=e)}catch(t){}return t}):[]}};g.alias=["google"];export{f as FetcherOptions,g as GoogleSearcher,b as WebSearcher,c as extractDate,a as extractMetadataFrom,t as fetchHeaders,e as fetchPartial,o as normalizeDate,r as parseHeaders,n as parseHtml,u as testUrlsByLatency};
1
+ async function t(t,e={}){const{timeout:r=5e3,headers:n}=e,s=new AbortController,i=setTimeout(()=>s.abort(),r);try{return(await fetch(t,{method:"HEAD",signal:s.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...n}})).headers}catch(t){return null}finally{clearTimeout(i)}}async function e(t,e=32768,r={}){const{timeout:n=1e4,headers:s}=r,i=new AbortController,o=setTimeout(()=>i.abort(),n);let a="",c=new Headers;try{const r=await fetch(t,{signal:i.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...s}});if(c=r.headers,!r.ok||!r.body)return null;const n=r.headers.get("content-type"),o=n?.match(/charset=([\w-]+)/i),l=o?o[1]:"utf-8",u=r.body.getReader(),f=new TextDecoder(l);let d=0;for(;;)try{const{done:t,value:r}=await u.read();if(t)break;if(d+=r.length,a+=f.decode(r,{stream:!0}),d>=e){i.abort();break}}catch(t){if("AbortError"===t.name)break;throw t}return{content:a,headers:c}}catch(t){return a.length>0?{content:a,headers:c}:null}finally{clearTimeout(o)}}function r(t){const e={};return t.forEach((t,r)=>{e[r.toLowerCase()]=t}),e}function n(t){const e={meta:{},jsonLd:[],time:[]},r=/<meta\s+([^>]+?)>/gi;let n;for(;null!==(n=r.exec(t));){const t=i(n[1]),r=t.name||t.property||t.itemprop,s=t.content;r&&s&&(e.meta[r.toLowerCase()]=s)}const o=/<script\s+[^>]*?type\s*=\s*["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/gi;for(;null!==(n=o.exec(t));){const t=n[1];try{const r=JSON.parse(t);e.jsonLd.push(r)}catch(r){const n=s(t);n&&e.jsonLd.push(n)}}const a=/<time([^>]*?)>([\s\S]*?)<\/time>/gi;for(;null!==(n=a.exec(t));){const t=i(n[1]).datetime,r=n[2].replace(/<[^>]*>/g,"").trim();e.time.push({datetime:t,text:r})}return e}function s(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r={};let n=!1;for(const s of e){const e=new RegExp(`"${s}"\\s*:\\s*"([^"]+)"`,"i"),i=t.match(e);i&&(r[s]=i[1],n=!0)}return n?r:null}function i(t){const e={},r=/([a-z0-9:._-]+)(?:\s*=\s*(?:(?:"([^"]*)")|(?:'([^']*)')|([^>\s]+)))?/gi;let n;for(;null!==(n=r.exec(t));){const t=n[1].toLowerCase(),r=n[2]??n[3]??n[4]??"";e[t]=r}return e}function o(t){if(!t)return null;try{let e=t.trim();if(!e)return null;e=e.replace(/^(?:last|first|posted|originally)\s*(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.replace(/^(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.split(/[\(|\|]|by\s+|[-–—]\s*\d+\s*min/i)[0].trim();const r=new Date(e);if(!isNaN(r.getTime())){const t=r.getUTCFullYear(),e=(new Date).getUTCFullYear();if(t>=-1e4&&t<=e+20)return r.toISOString()}}catch(t){}return null}function a(t,e){const s=n(t.content);return"date"===e?function(t,e){const n=function(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r=t=>{if(!t||"object"!=typeof t)return null;for(const r of e)if("string"==typeof t[r])return t[r];if(Array.isArray(t))for(const e of t){const t=r(e);if(t)return t}else if(t["@graph"]&&Array.isArray(t["@graph"]))return r(t["@graph"]);return null};return r(t)}(t.jsonLd),s=o(n);if(s)return s;const i=function(t){const e=["article:published_time","og:published_time","datepublished","date","pubdate","publishdate","dc.date.issued","bt:pubdate","sailthru.date","article:modified_time","og:updated_time","modifieddate"];for(const r of e)if(t[r])return t[r];return null}(t.meta),a=o(i);if(a)return a;for(const e of t.time){const t=o(e.datetime||e.text);if(t)return t}const c=r(e);return o(c["last-modified"])}(s,t.headers):null}async function c(t,r={}){const n=await e(t,r.maxBytes,r);return n?a(n,"date"):null}import{fetchWeb as l}from"@isdk/web-fetcher";async function u(t,e={}){const{timeout:r=5e3,limit:n,testPath:s="",proxy:i}=e;let o=await Promise.all(t.map(async t=>{const e=Date.now();try{const n=s?(t.endsWith("/")?t.slice(0,-1):t)+(s.startsWith("/")?s:"/"+s):t;return await l(n,{timeoutMs:r,proxy:i,throwHttpErrors:!0,enableSmart:!1,engine:"http"}),{url:t,latency:Date.now()-e}}catch(t){return}}));return o=o.filter(t=>null!=t).sort((t,e)=>t.latency-e.latency),"number"==typeof n&&n&&(o=o.slice(0,n)),o}import{FetcherOptions as f,FetchSession as d}from"@isdk/web-fetcher";import{addBaseFactoryAbility as h}from"custom-factory";import{isPlainObject as m}from"lodash-es";function p(t,e){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,r)=>{const n=e[r.trim()];return void 0!==n?String(n):""});if(Array.isArray(t))return t.map(t=>p(t,e));if(m(t)){const r={};for(const n in t)Object.prototype.hasOwnProperty.call(t,n)&&(r[n]=p(t[n],e));return r}return t}import{cloneDeep as y,defaultsDeep as b}from"lodash-es";var w=class t extends d{static get defaultOptions(){return Object.prototype.hasOwnProperty.call(this,"_defaultOptions")||(this._defaultOptions={}),this._defaultOptions}static set defaultOptions(t){this._defaultOptions=t}static getDefaultOptions(){const e=[];let r=this;for(;r&&r!==Object.prototype&&(Object.prototype.hasOwnProperty.call(r,"_defaultOptions")&&r._defaultOptions&&e.push(r._defaultOptions),r!==t);)r=Object.getPrototypeOf(r);return e.length>0?b({},...e):{}}static async search(t,e,r={}){const n=Array.isArray(t)?t:[t],s=[];for(let t=0;t<n.length;t++){const i=n[t],o=this.get(i),a=o?o.getDefaultOptions():this.getDefaultOptions(),c=b({},r,a),l=c.limit||10;if(s.length>=l)break;const u=l-s.length,f={...r,limit:u},d=this.createObject(i,f);if(!d)throw new Error(`Search engine not found: ${i}`);try{const t=await d.search(e,f);for(const e of t)e.url&&!s.some(t=>t.url===e.url)&&s.push(e);if(s.length>=l)break;if(!1===c.fillLimit)break}catch(e){if(console.warn(`[WebSearcher] Engine '${i}' failed completely:`,e),t===n.length-1&&0===s.length)throw e}finally{await d.dispose()}}return s}get template(){return{}}get pagination(){}getTemplate(t,e){return y(this.template)}createContext(t=this.options){const{actions:e,...r}=this.template,n=b({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(n.engine=t.engine),super.createContext(n)}async search(t,e={}){const r=this.constructor,n=(e=b({},e,this.options,r.getDefaultOptions())).limit||10,s=[],i=new Set;let o=e.startPage||0;const a=this.pagination?.startValue??0,c=this.pagination?.increment??1,l=e.maxPages||this.pagination?.maxPages||10,u=this.constructor.name;let f;e.baseUrls&&(Array.isArray(e.baseUrls)?f=e.baseUrls:"object"==typeof e.baseUrls&&(f=e.baseUrls[u]||e.baseUrls[this.constructor.alias?.[0]])),f&&0!==f.length||(f=this.constructor.defaultBaseUrls);const d=f&&f.length>0;let h=0;d&&"number"==typeof this.constructor.currentInstanceIndex&&(h=this.constructor.currentInstanceIndex);let m=!1;for(;s.length<n;){let r=!1,y=null;const w=d?f.length:1;let g=0;for(;g<w;){const l=d?f[h]:void 0,w=this.formatOptions(e),k=a+o*c,$={...e,...w,query:t,page:o+a,offset:k,limit:n,baseUrl:l?.endsWith("/")?l.slice(0,-1):l},A=p(this.getTemplate($,e),$),{actions:x,...q}=e,O=b({},A,q),v=[],D=O.actions||[];if(o===(e.startPage||0)||"url-param"===this.pagination?.type){if(O.url){D.some(t=>"goto"===(t.id??t.name??t.action)&&t.params?.url===O.url)||v.push({id:"goto",params:{url:O.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(v.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),v.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));v.push(...D),O.engine&&this.context.engine!==O.engine&&O.engine;try{const{outputs:n}=await this.executeAll(v,e),a={...e,query:t,page:o,baseUrl:l,engine:u};let c=await this.transform(n,a);e.transform&&(c=await e.transform(c,a));let f=!0;if(this.validateFetchResult&&(f=await this.validateFetchResult(c,a)),f&&e.validator&&(f=await e.validator(c,a)),!f)throw new Error(`Results validation failed for engine: ${u}, url: ${l}`);if(c&&0!==c.length)for(const t of c)t.url&&!i.has(t.url)&&(i.add(t.url),s.push(t));else m=!0;r=!0;break}catch(t){y=t,d&&(h=(h+1)%f.length,this.constructor.currentInstanceIndex=h),g++}}if(!r)throw y||new Error(`All instances failed for engine: ${u}`);if(m)break;if(s.length>=n||!this.pagination)break;if(o++,o>=l)break}return s.slice(0,n)}async validateFetchResult(t,e){return!0}async transform(t,e){return t.results||[]}formatOptions(t){return{...t}}};w._isFactory=!1;var g=w;h(g),g.prototype.name="Searcher";var k=class extends g{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const e={};if(t.timeRange)if("string"==typeof t.timeRange){const r={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[t.timeRange]&&(e.tbs=r[t.timeRange])}else{const r=new Date(t.timeRange.from),n=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(n.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;e.tbs=`cdr:1,cd_min:${t(r)},cd_max:${t(n)}`}}if(t.category){const r={images:"isch",videos:"vid",news:"nws"};r[t.category]&&(e.tbm=r[t.category])}return t.region&&(e.gl=t.region),t.language&&(e.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?e.safe="active":"off"===t.safeSearch&&(e.safe="images")),e}async transform(t){const e=t.results||[];return Array.isArray(e)?e.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const e=new URL(t.url,"https://www.google.com").searchParams.get("q");e&&(t.url=e)}catch(t){}return t}):[]}};k.alias=["google"];export{f as FetcherOptions,k as GoogleSearcher,g as WebSearcher,c as extractDate,a as extractMetadataFrom,t as fetchHeaders,e as fetchPartial,o as normalizeDate,r as parseHeaders,n as parseHtml,u as testUrlsByLatency};
package/docs/README.md CHANGED
@@ -63,6 +63,46 @@ const results = await WebSearcher.search(['Google', 'Bing', 'SearXNG'], 'open so
63
63
 
64
64
  Since `WebSearcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
65
65
 
66
+ ### 5. Default Search Parameters
67
+
68
+ You can set default search parameters at three levels: **Global**, **Engine-specific**, and **Instance-level**. This avoids passing repetitive options to every `search()` call.
69
+
70
+ The priority order (from highest to lowest) is:
71
+ `search(query, options)` (Call) > `this.options` (Instance) > `Engine.defaultOptions` (Static Engine) > `WebSearcher.defaultOptions` (Static Global)
72
+
73
+ #### A. Global Static Defaults
74
+
75
+ Affects all search engines.
76
+
77
+ ```typescript
78
+ import { WebSearcher } from '@isdk/web-fetcher';
79
+
80
+ // Set global limit for all searchers
81
+ WebSearcher.defaultOptions = { limit: 20, safeSearch: 'strict' };
82
+ ```
83
+
84
+ #### B. Engine-Specific Static Defaults
85
+
86
+ Affects only a specific engine (and its subclasses).
87
+
88
+ ```typescript
89
+ import { GoogleSearcher } from '@isdk/web-fetcher';
90
+
91
+ // Only Google will use these defaults
92
+ GoogleSearcher.defaultOptions = { region: 'US', language: 'en' };
93
+ ```
94
+
95
+ #### C. Instance-Level Defaults
96
+
97
+ Set when creating a searcher instance.
98
+
99
+ ```typescript
100
+ const google = new GoogleSearcher({ limit: 5, category: 'news' });
101
+
102
+ // This search will use limit: 5 and category: 'news' automatically
103
+ const results = await google.search('open source');
104
+ ```
105
+
66
106
  ### 🧬 Dynamic Templates
67
107
 
68
108
  While a static `template` works for simple search engines, many sites (like Google) change their HTML structure drastically based on the search category (e.g., 'Web' vs 'Images' vs 'News').