@isdk/web-searcher 0.1.5 → 0.1.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.cn.md +40 -0
- package/README.md +40 -0
- package/dist/index.d.mts +14 -0
- package/dist/index.d.ts +14 -0
- package/dist/index.js +1 -1
- package/dist/index.mjs +1 -1
- package/docs/README.md +40 -0
- package/docs/classes/GoogleSearcher.md +150 -48
- package/docs/classes/WebSearcher.md +138 -48
- package/docs/functions/extractDate.md +1 -1
- package/docs/functions/extractMetadataFrom.md +1 -1
- package/docs/functions/fetchHeaders.md +1 -1
- package/docs/functions/fetchPartial.md +1 -1
- package/docs/functions/normalizeDate.md +1 -1
- package/docs/functions/parseHeaders.md +1 -1
- package/docs/functions/parseHtml.md +1 -1
- package/docs/functions/testUrlsByLatency.md +5 -1
- package/docs/interfaces/CustomTimeRange.md +3 -3
- package/docs/interfaces/ExtractOptions.md +4 -4
- package/docs/interfaces/FetchExtractorOptions.md +3 -3
- package/docs/interfaces/FetcherOptions.md +41 -29
- package/docs/interfaces/HtmlData.md +4 -4
- package/docs/interfaces/MetadataResult.md +2 -2
- package/docs/interfaces/PaginationConfig.md +7 -7
- package/docs/interfaces/SearchContext.md +6 -6
- package/docs/interfaces/SearchOptions.md +13 -13
- package/docs/interfaces/StandardSearchResult.md +10 -10
- package/docs/interfaces/VerifiedUrl.md +3 -3
- package/docs/type-aliases/MetadataType.md +1 -1
- package/docs/type-aliases/SafeSearchLevel.md +1 -1
- package/docs/type-aliases/SearchCategory.md +1 -1
- package/docs/type-aliases/SearchTimeRange.md +1 -1
- package/docs/type-aliases/SearchTimeRangePreset.md +1 -1
- package/docs/type-aliases/SearcherConstructor.md +1 -1
- package/package.json +2 -2
package/README.cn.md
CHANGED
|
@@ -59,6 +59,46 @@ const results = await WebSearcher.search(['Google', 'Bing', 'SearXNG'], 'open so
|
|
|
59
59
|
|
|
60
60
|
由于 `WebSearcher` 继承自 `FetchSession`,您可以实例化它以在多个请求之间保持 Cookie 和存储。这对于需要登录的搜索或通过模拟人类行为来避免反爬虫非常有用。
|
|
61
61
|
|
|
62
|
+
### 5. 默认搜索参数 (Default Search Parameters)
|
|
63
|
+
|
|
64
|
+
您可以从三个层面设置默认搜索参数:**全局**、**引擎特定**和**实例级别**。这可以避免在每次调用 `search()` 时重复传递相同的选项。
|
|
65
|
+
|
|
66
|
+
优先级顺序(从高到低)为:
|
|
67
|
+
`search(query, options)` (调用参数) > `this.options` (实例参数) > `Engine.defaultOptions` (引擎静态参数) > `WebSearcher.defaultOptions` (全局静态参数)
|
|
68
|
+
|
|
69
|
+
#### A. 全局静态默认值
|
|
70
|
+
|
|
71
|
+
影响所有搜索引擎。
|
|
72
|
+
|
|
73
|
+
```typescript
|
|
74
|
+
import { WebSearcher } from '@isdk/web-fetcher';
|
|
75
|
+
|
|
76
|
+
// 为所有搜索器设置全局限制
|
|
77
|
+
WebSearcher.defaultOptions = { limit: 20, safeSearch: 'strict' };
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
#### B. 引擎特定静态默认值
|
|
81
|
+
|
|
82
|
+
仅影响特定的引擎(及其子类)。
|
|
83
|
+
|
|
84
|
+
```typescript
|
|
85
|
+
import { GoogleSearcher } from '@isdk/web-fetcher';
|
|
86
|
+
|
|
87
|
+
// 仅 Google 会使用这些默认值
|
|
88
|
+
GoogleSearcher.defaultOptions = { region: 'US', language: 'en' };
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
#### C. 实例级别默认值
|
|
92
|
+
|
|
93
|
+
在创建搜索器实例时设置。
|
|
94
|
+
|
|
95
|
+
```typescript
|
|
96
|
+
const google = new GoogleSearcher({ limit: 5, category: 'news' });
|
|
97
|
+
|
|
98
|
+
// 此次搜索将自动使用 limit: 5 和 category: 'news'
|
|
99
|
+
const results = await google.search('open source');
|
|
100
|
+
```
|
|
101
|
+
|
|
62
102
|
### 🧬 动态模板 (Dynamic Templates)
|
|
63
103
|
|
|
64
104
|
虽然静态 `template` 适用于简单的搜索引擎,但许多网站(如 Google)会根据搜索类别(如“网页” vs “图片” vs “新闻”)彻底改变其 HTML 结构。
|
package/README.md
CHANGED
|
@@ -59,6 +59,46 @@ const results = await WebSearcher.search(['Google', 'Bing', 'SearXNG'], 'open so
|
|
|
59
59
|
|
|
60
60
|
Since `WebSearcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
|
|
61
61
|
|
|
62
|
+
### 5. Default Search Parameters
|
|
63
|
+
|
|
64
|
+
You can set default search parameters at three levels: **Global**, **Engine-specific**, and **Instance-level**. This avoids passing repetitive options to every `search()` call.
|
|
65
|
+
|
|
66
|
+
The priority order (from highest to lowest) is:
|
|
67
|
+
`search(query, options)` (Call) > `this.options` (Instance) > `Engine.defaultOptions` (Static Engine) > `WebSearcher.defaultOptions` (Static Global)
|
|
68
|
+
|
|
69
|
+
#### A. Global Static Defaults
|
|
70
|
+
|
|
71
|
+
Affects all search engines.
|
|
72
|
+
|
|
73
|
+
```typescript
|
|
74
|
+
import { WebSearcher } from '@isdk/web-fetcher';
|
|
75
|
+
|
|
76
|
+
// Set global limit for all searchers
|
|
77
|
+
WebSearcher.defaultOptions = { limit: 20, safeSearch: 'strict' };
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
#### B. Engine-Specific Static Defaults
|
|
81
|
+
|
|
82
|
+
Affects only a specific engine (and its subclasses).
|
|
83
|
+
|
|
84
|
+
```typescript
|
|
85
|
+
import { GoogleSearcher } from '@isdk/web-fetcher';
|
|
86
|
+
|
|
87
|
+
// Only Google will use these defaults
|
|
88
|
+
GoogleSearcher.defaultOptions = { region: 'US', language: 'en' };
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
#### C. Instance-Level Defaults
|
|
92
|
+
|
|
93
|
+
Set when creating a searcher instance.
|
|
94
|
+
|
|
95
|
+
```typescript
|
|
96
|
+
const google = new GoogleSearcher({ limit: 5, category: 'news' });
|
|
97
|
+
|
|
98
|
+
// This search will use limit: 5 and category: 'news' automatically
|
|
99
|
+
const results = await google.search('open source');
|
|
100
|
+
```
|
|
101
|
+
|
|
62
102
|
### 🧬 Dynamic Templates
|
|
63
103
|
|
|
64
104
|
While a static `template` works for simple search engines, many sites (like Google) change their HTML structure drastically based on the search category (e.g., 'Web' vs 'Images' vs 'News').
|
package/dist/index.d.mts
CHANGED
|
@@ -320,6 +320,7 @@ declare function testUrlsByLatency(urls: string[], options?: {
|
|
|
320
320
|
timeout?: number;
|
|
321
321
|
limit?: number;
|
|
322
322
|
testPath?: string;
|
|
323
|
+
proxy?: string;
|
|
323
324
|
}): Promise<VerifiedUrl[]>;
|
|
324
325
|
|
|
325
326
|
/**
|
|
@@ -362,6 +363,19 @@ declare abstract class WebSearcher extends FetchSession {
|
|
|
362
363
|
static defaultBaseUrls?: string[];
|
|
363
364
|
/** Globally shared index for tracking the currently active instance (node) across sessions. */
|
|
364
365
|
static currentInstanceIndex?: number;
|
|
366
|
+
/** @internal */
|
|
367
|
+
static _defaultOptions?: SearchOptions;
|
|
368
|
+
/**
|
|
369
|
+
* Gets or sets the default search parameters for this specific engine class.
|
|
370
|
+
* This does not include settings from parent classes.
|
|
371
|
+
*/
|
|
372
|
+
static get defaultOptions(): SearchOptions;
|
|
373
|
+
static set defaultOptions(options: SearchOptions);
|
|
374
|
+
/**
|
|
375
|
+
* Retrieves the combined default search options by traversing the prototype chain.
|
|
376
|
+
* Priority: Current class > Parent class > WebSearcher base class.
|
|
377
|
+
*/
|
|
378
|
+
static getDefaultOptions(): SearchOptions;
|
|
365
379
|
/**
|
|
366
380
|
* Registers a search engine class.
|
|
367
381
|
*
|
package/dist/index.d.ts
CHANGED
|
@@ -320,6 +320,7 @@ declare function testUrlsByLatency(urls: string[], options?: {
|
|
|
320
320
|
timeout?: number;
|
|
321
321
|
limit?: number;
|
|
322
322
|
testPath?: string;
|
|
323
|
+
proxy?: string;
|
|
323
324
|
}): Promise<VerifiedUrl[]>;
|
|
324
325
|
|
|
325
326
|
/**
|
|
@@ -362,6 +363,19 @@ declare abstract class WebSearcher extends FetchSession {
|
|
|
362
363
|
static defaultBaseUrls?: string[];
|
|
363
364
|
/** Globally shared index for tracking the currently active instance (node) across sessions. */
|
|
364
365
|
static currentInstanceIndex?: number;
|
|
366
|
+
/** @internal */
|
|
367
|
+
static _defaultOptions?: SearchOptions;
|
|
368
|
+
/**
|
|
369
|
+
* Gets or sets the default search parameters for this specific engine class.
|
|
370
|
+
* This does not include settings from parent classes.
|
|
371
|
+
*/
|
|
372
|
+
static get defaultOptions(): SearchOptions;
|
|
373
|
+
static set defaultOptions(options: SearchOptions);
|
|
374
|
+
/**
|
|
375
|
+
* Retrieves the combined default search options by traversing the prototype chain.
|
|
376
|
+
* Priority: Current class > Parent class > WebSearcher base class.
|
|
377
|
+
*/
|
|
378
|
+
static getDefaultOptions(): SearchOptions;
|
|
365
379
|
/**
|
|
366
380
|
* Registers a search engine class.
|
|
367
381
|
*
|
package/dist/index.js
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
"use strict";var t,e=Object.defineProperty,r=Object.getOwnPropertyDescriptor,n=Object.getOwnPropertyNames,s=Object.prototype.hasOwnProperty,i={};async function a(t,e={}){const{timeout:r=5e3,headers:n}=e,s=new AbortController,i=setTimeout(()=>s.abort(),r);try{return(await fetch(t,{method:"HEAD",signal:s.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...n}})).headers}catch(t){return null}finally{clearTimeout(i)}}async function o(t,e=32768,r={}){const{timeout:n=1e4,headers:s}=r,i=new AbortController,a=setTimeout(()=>i.abort(),n);let o="",c=new Headers;try{const r=await fetch(t,{signal:i.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...s}});if(c=r.headers,!r.ok||!r.body)return null;const n=r.headers.get("content-type"),a=n?.match(/charset=([\w-]+)/i),l=a?a[1]:"utf-8",u=r.body.getReader(),f=new TextDecoder(l);let d=0;for(;;)try{const{done:t,value:r}=await u.read();if(t)break;if(d+=r.length,o+=f.decode(r,{stream:!0}),d>=e){i.abort();break}}catch(t){if("AbortError"===t.name)break;throw t}return{content:o,headers:c}}catch(t){return o.length>0?{content:o,headers:c}:null}finally{clearTimeout(a)}}function c(t){const e={};return t.forEach((t,r)=>{e[r.toLowerCase()]=t}),e}function l(t){const e={meta:{},jsonLd:[],time:[]},r=/<meta\s+([^>]+?)>/gi;let n;for(;null!==(n=r.exec(t));){const t=f(n[1]),r=t.name||t.property||t.itemprop,s=t.content;r&&s&&(e.meta[r.toLowerCase()]=s)}const s=/<script\s+[^>]*?type\s*=\s*["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/gi;for(;null!==(n=s.exec(t));){const t=n[1];try{const r=JSON.parse(t);e.jsonLd.push(r)}catch(r){const n=u(t);n&&e.jsonLd.push(n)}}const i=/<time([^>]*?)>([\s\S]*?)<\/time>/gi;for(;null!==(n=i.exec(t));){const t=f(n[1]).datetime,r=n[2].replace(/<[^>]*>/g,"").trim();e.time.push({datetime:t,text:r})}return e}function u(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r={};let n=!1;for(const s of e){const e=new RegExp(`"${s}"\\s*:\\s*"([^"]+)"`,"i"),i=t.match(e);i&&(r[s]=i[1],n=!0)}return n?r:null}function f(t){const e={},r=/([a-z0-9:._-]+)(?:\s*=\s*(?:(?:"([^"]*)")|(?:'([^']*)')|([^>\s]+)))?/gi;let n;for(;null!==(n=r.exec(t));){const t=n[1].toLowerCase(),r=n[2]??n[3]??n[4]??"";e[t]=r}return e}function d(t){if(!t)return null;try{let e=t.trim();if(!e)return null;e=e.replace(/^(?:last|first|posted|originally)\s*(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.replace(/^(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.split(/[\(|\|]|by\s+|[-–—]\s*\d+\s*min/i)[0].trim();const r=new Date(e);if(!isNaN(r.getTime())){const t=r.getUTCFullYear(),e=(new Date).getUTCFullYear();if(t>=-1e4&&t<=e+20)return r.toISOString()}}catch(t){}return null}function h(t,e){const r=l(t.content);return"date"===e?function(t,e){const r=function(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r=t=>{if(!t||"object"!=typeof t)return null;for(const r of e)if("string"==typeof t[r])return t[r];if(Array.isArray(t))for(const e of t){const t=r(e);if(t)return t}else if(t["@graph"]&&Array.isArray(t["@graph"]))return r(t["@graph"]);return null};return r(t)}(t.jsonLd),n=d(r);if(n)return n;const s=function(t){const e=["article:published_time","og:published_time","datepublished","date","pubdate","publishdate","dc.date.issued","bt:pubdate","sailthru.date","article:modified_time","og:updated_time","modifieddate"];for(const r of e)if(t[r])return t[r];return null}(t.meta),i=d(s);if(i)return i;for(const e of t.time){const t=d(e.datetime||e.text);if(t)return t}const a=c(e);return d(a["last-modified"])}(r,t.headers):null}async function m(t,e={}){const r=await o(t,e.maxBytes,e);return r?h(r,"date"):null}((t,r)=>{for(var n in r)e(t,n,{get:r[n],enumerable:!0})})(i,{FetcherOptions:()=>y.FetcherOptions,GoogleSearcher:()=>A,WebSearcher:()=>q,extractDate:()=>m,extractMetadataFrom:()=>h,fetchHeaders:()=>a,fetchPartial:()=>o,normalizeDate:()=>d,parseHeaders:()=>c,parseHtml:()=>l,testUrlsByLatency:()=>b}),module.exports=(t=i,((t,i,a,o)=>{if(i&&"object"==typeof i||"function"==typeof i)for(let c of n(i))s.call(t,c)||c===a||e(t,c,{get:()=>i[c],enumerable:!(o=r(i,c))||o.enumerable});return t})(e({},"__esModule",{value:!0}),t));var p=require("@isdk/web-fetcher");async function b(t,e={}){const{timeout:r=5e3,limit:n=20,testPath:s=""}=e;return(await Promise.all(t.map(async t=>{const e=Date.now();try{const n=s?(t.endsWith("/")?t.slice(0,-1):t)+(s.startsWith("/")?s:"/"+s):t;return await(0,p.fetchWeb)(n,{timeoutMs:r}),{url:t,latency:Date.now()-e}}catch(t){return null}}))).filter(t=>null!==t).sort((t,e)=>t.latency-e.latency).slice(0,n)}var y=require("@isdk/web-fetcher"),w=require("custom-factory"),g=require("lodash-es");function k(t,e){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,r)=>{const n=e[r.trim()];return void 0!==n?String(n):""});if(Array.isArray(t))return t.map(t=>k(t,e));if((0,g.isPlainObject)(t)){const r={};for(const n in t)Object.prototype.hasOwnProperty.call(t,n)&&(r[n]=k(t[n],e));return r}return t}var $=require("lodash-es"),q=class extends y.FetchSession{static async search(t,e,r={}){const n=Array.isArray(t)?t:[t],s=r.limit||10,i=r.fillLimit??!0,a=[];for(let t=0;t<n.length;t++){const o=n[t];if(a.length>=s)break;const c=s-a.length,l={...r,limit:c},u=this.createObject(o,l);if(!u)throw new Error(`Search engine not found: ${o}`);try{const t=await u.search(e,l);for(const e of t)e.url&&!a.some(t=>t.url===e.url)&&a.push(e);if(a.length>=s)break;if(!1===i)break}catch(e){if(console.warn(`[WebSearcher] Engine '${o}' failed completely:`,e),t===n.length-1&&0===a.length)throw e}finally{await u.dispose()}}return a}get template(){return{}}get pagination(){}getTemplate(t,e){return(0,$.cloneDeep)(this.template)}createContext(t=this.options){const{actions:e,...r}=this.template,n=(0,$.defaultsDeep)({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(n.engine=t.engine),super.createContext(n)}async search(t,e={}){const r=e.limit||10,n=[],s=new Set;let i=e.startPage||0;const a=this.pagination?.startValue??0,o=this.pagination?.increment??1,c=e.maxPages||this.pagination?.maxPages||10,l=this.constructor.name;let u;e.baseUrls&&(Array.isArray(e.baseUrls)?u=e.baseUrls:"object"==typeof e.baseUrls&&(u=e.baseUrls[l]||e.baseUrls[this.constructor.alias?.[0]])),u&&0!==u.length||(u=this.constructor.defaultBaseUrls);const f=u&&u.length>0;let d=0;f&&"number"==typeof this.constructor.currentInstanceIndex&&(d=this.constructor.currentInstanceIndex);let h=!1;for(;n.length<r;){let m=!1,p=null;const b=f?u.length:1;let y=0;for(;y<b;){const c=f?u[d]:void 0,b=this.formatOptions(e),w=a+i*o,g={...e,...b,query:t,page:i+a,offset:w,limit:r,baseUrl:c?.endsWith("/")?c.slice(0,-1):c},q=k(this.getTemplate(g,e),g),{actions:A,...v}=e,x=(0,$.defaultsDeep)({},q,v),D=[],S=x.actions||[];if(i===(e.startPage||0)||"url-param"===this.pagination?.type){if(x.url){S.some(t=>"goto"===(t.id??t.name??t.action)&&t.params?.url===x.url)||D.push({id:"goto",params:{url:x.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(D.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),D.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));D.push(...S),x.engine&&this.context.engine!==x.engine&&x.engine;try{const{outputs:r}=await this.executeAll(D,e),a={...e,query:t,page:i,baseUrl:c,engine:l};let o=await this.transform(r,a);e.transform&&(o=await e.transform(o,a));let u=!0;if(this.validateFetchResult&&(u=await this.validateFetchResult(o,a)),u&&e.validator&&(u=await e.validator(o,a)),!u)throw new Error(`Results validation failed for engine: ${l}, url: ${c}`);if(o&&0!==o.length)for(const t of o)t.url&&!s.has(t.url)&&(s.add(t.url),n.push(t));else h=!0;m=!0;break}catch(t){p=t,f&&(d=(d+1)%u.length,this.constructor.currentInstanceIndex=d),y++}}if(!m)throw p||new Error(`All instances failed for engine: ${l}`);if(h)break;if(n.length>=r||!this.pagination)break;if(i++,i>=c)break}return n.slice(0,r)}async validateFetchResult(t,e){return!0}async transform(t,e){return t.results||[]}formatOptions(t){return{...t}}};q._isFactory=!1,(0,w.addBaseFactoryAbility)(q),q.prototype.name="Searcher";var A=class extends q{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const e={};if(t.timeRange)if("string"==typeof t.timeRange){const r={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[t.timeRange]&&(e.tbs=r[t.timeRange])}else{const r=new Date(t.timeRange.from),n=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(n.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;e.tbs=`cdr:1,cd_min:${t(r)},cd_max:${t(n)}`}}if(t.category){const r={images:"isch",videos:"vid",news:"nws"};r[t.category]&&(e.tbm=r[t.category])}return t.region&&(e.gl=t.region),t.language&&(e.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?e.safe="active":"off"===t.safeSearch&&(e.safe="images")),e}async transform(t){const e=t.results||[];return Array.isArray(e)?e.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const e=new URL(t.url,"https://www.google.com").searchParams.get("q");e&&(t.url=e)}catch(t){}return t}):[]}};A.alias=["google"];
|
|
1
|
+
"use strict";var t,e=Object.defineProperty,r=Object.getOwnPropertyDescriptor,n=Object.getOwnPropertyNames,s=Object.prototype.hasOwnProperty,i={};async function a(t,e={}){const{timeout:r=5e3,headers:n}=e,s=new AbortController,i=setTimeout(()=>s.abort(),r);try{return(await fetch(t,{method:"HEAD",signal:s.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...n}})).headers}catch(t){return null}finally{clearTimeout(i)}}async function o(t,e=32768,r={}){const{timeout:n=1e4,headers:s}=r,i=new AbortController,a=setTimeout(()=>i.abort(),n);let o="",c=new Headers;try{const r=await fetch(t,{signal:i.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...s}});if(c=r.headers,!r.ok||!r.body)return null;const n=r.headers.get("content-type"),a=n?.match(/charset=([\w-]+)/i),l=a?a[1]:"utf-8",u=r.body.getReader(),f=new TextDecoder(l);let d=0;for(;;)try{const{done:t,value:r}=await u.read();if(t)break;if(d+=r.length,o+=f.decode(r,{stream:!0}),d>=e){i.abort();break}}catch(t){if("AbortError"===t.name)break;throw t}return{content:o,headers:c}}catch(t){return o.length>0?{content:o,headers:c}:null}finally{clearTimeout(a)}}function c(t){const e={};return t.forEach((t,r)=>{e[r.toLowerCase()]=t}),e}function l(t){const e={meta:{},jsonLd:[],time:[]},r=/<meta\s+([^>]+?)>/gi;let n;for(;null!==(n=r.exec(t));){const t=f(n[1]),r=t.name||t.property||t.itemprop,s=t.content;r&&s&&(e.meta[r.toLowerCase()]=s)}const s=/<script\s+[^>]*?type\s*=\s*["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/gi;for(;null!==(n=s.exec(t));){const t=n[1];try{const r=JSON.parse(t);e.jsonLd.push(r)}catch(r){const n=u(t);n&&e.jsonLd.push(n)}}const i=/<time([^>]*?)>([\s\S]*?)<\/time>/gi;for(;null!==(n=i.exec(t));){const t=f(n[1]).datetime,r=n[2].replace(/<[^>]*>/g,"").trim();e.time.push({datetime:t,text:r})}return e}function u(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r={};let n=!1;for(const s of e){const e=new RegExp(`"${s}"\\s*:\\s*"([^"]+)"`,"i"),i=t.match(e);i&&(r[s]=i[1],n=!0)}return n?r:null}function f(t){const e={},r=/([a-z0-9:._-]+)(?:\s*=\s*(?:(?:"([^"]*)")|(?:'([^']*)')|([^>\s]+)))?/gi;let n;for(;null!==(n=r.exec(t));){const t=n[1].toLowerCase(),r=n[2]??n[3]??n[4]??"";e[t]=r}return e}function d(t){if(!t)return null;try{let e=t.trim();if(!e)return null;e=e.replace(/^(?:last|first|posted|originally)\s*(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.replace(/^(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.split(/[\(|\|]|by\s+|[-–—]\s*\d+\s*min/i)[0].trim();const r=new Date(e);if(!isNaN(r.getTime())){const t=r.getUTCFullYear(),e=(new Date).getUTCFullYear();if(t>=-1e4&&t<=e+20)return r.toISOString()}}catch(t){}return null}function h(t,e){const r=l(t.content);return"date"===e?function(t,e){const r=function(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r=t=>{if(!t||"object"!=typeof t)return null;for(const r of e)if("string"==typeof t[r])return t[r];if(Array.isArray(t))for(const e of t){const t=r(e);if(t)return t}else if(t["@graph"]&&Array.isArray(t["@graph"]))return r(t["@graph"]);return null};return r(t)}(t.jsonLd),n=d(r);if(n)return n;const s=function(t){const e=["article:published_time","og:published_time","datepublished","date","pubdate","publishdate","dc.date.issued","bt:pubdate","sailthru.date","article:modified_time","og:updated_time","modifieddate"];for(const r of e)if(t[r])return t[r];return null}(t.meta),i=d(s);if(i)return i;for(const e of t.time){const t=d(e.datetime||e.text);if(t)return t}const a=c(e);return d(a["last-modified"])}(r,t.headers):null}async function p(t,e={}){const r=await o(t,e.maxBytes,e);return r?h(r,"date"):null}((t,r)=>{for(var n in r)e(t,n,{get:r[n],enumerable:!0})})(i,{FetcherOptions:()=>y.FetcherOptions,GoogleSearcher:()=>O,WebSearcher:()=>A,extractDate:()=>p,extractMetadataFrom:()=>h,fetchHeaders:()=>a,fetchPartial:()=>o,normalizeDate:()=>d,parseHeaders:()=>c,parseHtml:()=>l,testUrlsByLatency:()=>b}),module.exports=(t=i,((t,i,a,o)=>{if(i&&"object"==typeof i||"function"==typeof i)for(let c of n(i))s.call(t,c)||c===a||e(t,c,{get:()=>i[c],enumerable:!(o=r(i,c))||o.enumerable});return t})(e({},"__esModule",{value:!0}),t));var m=require("@isdk/web-fetcher");async function b(t,e={}){const{timeout:r=5e3,limit:n,testPath:s="",proxy:i}=e;let a=await Promise.all(t.map(async t=>{const e=Date.now();try{const n=s?(t.endsWith("/")?t.slice(0,-1):t)+(s.startsWith("/")?s:"/"+s):t;return await(0,m.fetchWeb)(n,{timeoutMs:r,proxy:i,throwHttpErrors:!0,enableSmart:!1,engine:"http"}),{url:t,latency:Date.now()-e}}catch(t){return}}));return a=a.filter(t=>null!=t).sort((t,e)=>t.latency-e.latency),"number"==typeof n&&n&&(a=a.slice(0,n)),a}var y=require("@isdk/web-fetcher"),w=require("custom-factory"),g=require("lodash-es");function k(t,e){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,r)=>{const n=e[r.trim()];return void 0!==n?String(n):""});if(Array.isArray(t))return t.map(t=>k(t,e));if((0,g.isPlainObject)(t)){const r={};for(const n in t)Object.prototype.hasOwnProperty.call(t,n)&&(r[n]=k(t[n],e));return r}return t}var $=require("lodash-es"),q=class t extends y.FetchSession{static get defaultOptions(){return Object.prototype.hasOwnProperty.call(this,"_defaultOptions")||(this._defaultOptions={}),this._defaultOptions}static set defaultOptions(t){this._defaultOptions=t}static getDefaultOptions(){const e=[];let r=this;for(;r&&r!==Object.prototype&&(Object.prototype.hasOwnProperty.call(r,"_defaultOptions")&&r._defaultOptions&&e.push(r._defaultOptions),r!==t);)r=Object.getPrototypeOf(r);return e.length>0?(0,$.defaultsDeep)({},...e):{}}static async search(t,e,r={}){const n=Array.isArray(t)?t:[t],s=[];for(let t=0;t<n.length;t++){const i=n[t],a=this.get(i),o=a?a.getDefaultOptions():this.getDefaultOptions(),c=(0,$.defaultsDeep)({},r,o),l=c.limit||10;if(s.length>=l)break;const u=l-s.length,f={...r,limit:u},d=this.createObject(i,f);if(!d)throw new Error(`Search engine not found: ${i}`);try{const t=await d.search(e,f);for(const e of t)e.url&&!s.some(t=>t.url===e.url)&&s.push(e);if(s.length>=l)break;if(!1===c.fillLimit)break}catch(e){if(console.warn(`[WebSearcher] Engine '${i}' failed completely:`,e),t===n.length-1&&0===s.length)throw e}finally{await d.dispose()}}return s}get template(){return{}}get pagination(){}getTemplate(t,e){return(0,$.cloneDeep)(this.template)}createContext(t=this.options){const{actions:e,...r}=this.template,n=(0,$.defaultsDeep)({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(n.engine=t.engine),super.createContext(n)}async search(t,e={}){const r=this.constructor,n=(e=(0,$.defaultsDeep)({},e,this.options,r.getDefaultOptions())).limit||10,s=[],i=new Set;let a=e.startPage||0;const o=this.pagination?.startValue??0,c=this.pagination?.increment??1,l=e.maxPages||this.pagination?.maxPages||10,u=this.constructor.name;let f;e.baseUrls&&(Array.isArray(e.baseUrls)?f=e.baseUrls:"object"==typeof e.baseUrls&&(f=e.baseUrls[u]||e.baseUrls[this.constructor.alias?.[0]])),f&&0!==f.length||(f=this.constructor.defaultBaseUrls);const d=f&&f.length>0;let h=0;d&&"number"==typeof this.constructor.currentInstanceIndex&&(h=this.constructor.currentInstanceIndex);let p=!1;for(;s.length<n;){let r=!1,m=null;const b=d?f.length:1;let y=0;for(;y<b;){const l=d?f[h]:void 0,b=this.formatOptions(e),w=o+a*c,g={...e,...b,query:t,page:a+o,offset:w,limit:n,baseUrl:l?.endsWith("/")?l.slice(0,-1):l},q=k(this.getTemplate(g,e),g),{actions:A,...O}=e,v=(0,$.defaultsDeep)({},q,O),x=[],j=v.actions||[];if(a===(e.startPage||0)||"url-param"===this.pagination?.type){if(v.url){j.some(t=>"goto"===(t.id??t.name??t.action)&&t.params?.url===v.url)||x.push({id:"goto",params:{url:v.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(x.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),x.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));x.push(...j),v.engine&&this.context.engine!==v.engine&&v.engine;try{const{outputs:n}=await this.executeAll(x,e),o={...e,query:t,page:a,baseUrl:l,engine:u};let c=await this.transform(n,o);e.transform&&(c=await e.transform(c,o));let f=!0;if(this.validateFetchResult&&(f=await this.validateFetchResult(c,o)),f&&e.validator&&(f=await e.validator(c,o)),!f)throw new Error(`Results validation failed for engine: ${u}, url: ${l}`);if(c&&0!==c.length)for(const t of c)t.url&&!i.has(t.url)&&(i.add(t.url),s.push(t));else p=!0;r=!0;break}catch(t){m=t,d&&(h=(h+1)%f.length,this.constructor.currentInstanceIndex=h),y++}}if(!r)throw m||new Error(`All instances failed for engine: ${u}`);if(p)break;if(s.length>=n||!this.pagination)break;if(a++,a>=l)break}return s.slice(0,n)}async validateFetchResult(t,e){return!0}async transform(t,e){return t.results||[]}formatOptions(t){return{...t}}};q._isFactory=!1;var A=q;(0,w.addBaseFactoryAbility)(A),A.prototype.name="Searcher";var O=class extends A{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const e={};if(t.timeRange)if("string"==typeof t.timeRange){const r={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[t.timeRange]&&(e.tbs=r[t.timeRange])}else{const r=new Date(t.timeRange.from),n=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(n.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;e.tbs=`cdr:1,cd_min:${t(r)},cd_max:${t(n)}`}}if(t.category){const r={images:"isch",videos:"vid",news:"nws"};r[t.category]&&(e.tbm=r[t.category])}return t.region&&(e.gl=t.region),t.language&&(e.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?e.safe="active":"off"===t.safeSearch&&(e.safe="images")),e}async transform(t){const e=t.results||[];return Array.isArray(e)?e.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const e=new URL(t.url,"https://www.google.com").searchParams.get("q");e&&(t.url=e)}catch(t){}return t}):[]}};O.alias=["google"];
|
package/dist/index.mjs
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
async function t(t,e={}){const{timeout:r=5e3,headers:n}=e,s=new AbortController,i=setTimeout(()=>s.abort(),r);try{return(await fetch(t,{method:"HEAD",signal:s.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...n}})).headers}catch(t){return null}finally{clearTimeout(i)}}async function e(t,e=32768,r={}){const{timeout:n=1e4,headers:s}=r,i=new AbortController,o=setTimeout(()=>i.abort(),n);let a="",c=new Headers;try{const r=await fetch(t,{signal:i.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...s}});if(c=r.headers,!r.ok||!r.body)return null;const n=r.headers.get("content-type"),o=n?.match(/charset=([\w-]+)/i),l=o?o[1]:"utf-8",u=r.body.getReader(),f=new TextDecoder(l);let d=0;for(;;)try{const{done:t,value:r}=await u.read();if(t)break;if(d+=r.length,a+=f.decode(r,{stream:!0}),d>=e){i.abort();break}}catch(t){if("AbortError"===t.name)break;throw t}return{content:a,headers:c}}catch(t){return a.length>0?{content:a,headers:c}:null}finally{clearTimeout(o)}}function r(t){const e={};return t.forEach((t,r)=>{e[r.toLowerCase()]=t}),e}function n(t){const e={meta:{},jsonLd:[],time:[]},r=/<meta\s+([^>]+?)>/gi;let n;for(;null!==(n=r.exec(t));){const t=i(n[1]),r=t.name||t.property||t.itemprop,s=t.content;r&&s&&(e.meta[r.toLowerCase()]=s)}const o=/<script\s+[^>]*?type\s*=\s*["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/gi;for(;null!==(n=o.exec(t));){const t=n[1];try{const r=JSON.parse(t);e.jsonLd.push(r)}catch(r){const n=s(t);n&&e.jsonLd.push(n)}}const a=/<time([^>]*?)>([\s\S]*?)<\/time>/gi;for(;null!==(n=a.exec(t));){const t=i(n[1]).datetime,r=n[2].replace(/<[^>]*>/g,"").trim();e.time.push({datetime:t,text:r})}return e}function s(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r={};let n=!1;for(const s of e){const e=new RegExp(`"${s}"\\s*:\\s*"([^"]+)"`,"i"),i=t.match(e);i&&(r[s]=i[1],n=!0)}return n?r:null}function i(t){const e={},r=/([a-z0-9:._-]+)(?:\s*=\s*(?:(?:"([^"]*)")|(?:'([^']*)')|([^>\s]+)))?/gi;let n;for(;null!==(n=r.exec(t));){const t=n[1].toLowerCase(),r=n[2]??n[3]??n[4]??"";e[t]=r}return e}function o(t){if(!t)return null;try{let e=t.trim();if(!e)return null;e=e.replace(/^(?:last|first|posted|originally)\s*(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.replace(/^(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.split(/[\(|\|]|by\s+|[-–—]\s*\d+\s*min/i)[0].trim();const r=new Date(e);if(!isNaN(r.getTime())){const t=r.getUTCFullYear(),e=(new Date).getUTCFullYear();if(t>=-1e4&&t<=e+20)return r.toISOString()}}catch(t){}return null}function a(t,e){const s=n(t.content);return"date"===e?function(t,e){const n=function(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r=t=>{if(!t||"object"!=typeof t)return null;for(const r of e)if("string"==typeof t[r])return t[r];if(Array.isArray(t))for(const e of t){const t=r(e);if(t)return t}else if(t["@graph"]&&Array.isArray(t["@graph"]))return r(t["@graph"]);return null};return r(t)}(t.jsonLd),s=o(n);if(s)return s;const i=function(t){const e=["article:published_time","og:published_time","datepublished","date","pubdate","publishdate","dc.date.issued","bt:pubdate","sailthru.date","article:modified_time","og:updated_time","modifieddate"];for(const r of e)if(t[r])return t[r];return null}(t.meta),a=o(i);if(a)return a;for(const e of t.time){const t=o(e.datetime||e.text);if(t)return t}const c=r(e);return o(c["last-modified"])}(s,t.headers):null}async function c(t,r={}){const n=await e(t,r.maxBytes,r);return n?a(n,"date"):null}import{fetchWeb as l}from"@isdk/web-fetcher";async function u(t,e={}){const{timeout:r=5e3,limit:n=20,testPath:s=""}=e;return(await Promise.all(t.map(async t=>{const e=Date.now();try{const n=s?(t.endsWith("/")?t.slice(0,-1):t)+(s.startsWith("/")?s:"/"+s):t;return await l(n,{timeoutMs:r}),{url:t,latency:Date.now()-e}}catch(t){return null}}))).filter(t=>null!==t).sort((t,e)=>t.latency-e.latency).slice(0,n)}import{FetcherOptions as f,FetchSession as d}from"@isdk/web-fetcher";import{addBaseFactoryAbility as h}from"custom-factory";import{isPlainObject as m}from"lodash-es";function p(t,e){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,r)=>{const n=e[r.trim()];return void 0!==n?String(n):""});if(Array.isArray(t))return t.map(t=>p(t,e));if(m(t)){const r={};for(const n in t)Object.prototype.hasOwnProperty.call(t,n)&&(r[n]=p(t[n],e));return r}return t}import{cloneDeep as w,defaultsDeep as y}from"lodash-es";var b=class extends d{static async search(t,e,r={}){const n=Array.isArray(t)?t:[t],s=r.limit||10,i=r.fillLimit??!0,o=[];for(let t=0;t<n.length;t++){const a=n[t];if(o.length>=s)break;const c=s-o.length,l={...r,limit:c},u=this.createObject(a,l);if(!u)throw new Error(`Search engine not found: ${a}`);try{const t=await u.search(e,l);for(const e of t)e.url&&!o.some(t=>t.url===e.url)&&o.push(e);if(o.length>=s)break;if(!1===i)break}catch(e){if(console.warn(`[WebSearcher] Engine '${a}' failed completely:`,e),t===n.length-1&&0===o.length)throw e}finally{await u.dispose()}}return o}get template(){return{}}get pagination(){}getTemplate(t,e){return w(this.template)}createContext(t=this.options){const{actions:e,...r}=this.template,n=y({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(n.engine=t.engine),super.createContext(n)}async search(t,e={}){const r=e.limit||10,n=[],s=new Set;let i=e.startPage||0;const o=this.pagination?.startValue??0,a=this.pagination?.increment??1,c=e.maxPages||this.pagination?.maxPages||10,l=this.constructor.name;let u;e.baseUrls&&(Array.isArray(e.baseUrls)?u=e.baseUrls:"object"==typeof e.baseUrls&&(u=e.baseUrls[l]||e.baseUrls[this.constructor.alias?.[0]])),u&&0!==u.length||(u=this.constructor.defaultBaseUrls);const f=u&&u.length>0;let d=0;f&&"number"==typeof this.constructor.currentInstanceIndex&&(d=this.constructor.currentInstanceIndex);let h=!1;for(;n.length<r;){let m=!1,w=null;const b=f?u.length:1;let g=0;for(;g<b;){const c=f?u[d]:void 0,b=this.formatOptions(e),k=o+i*a,$={...e,...b,query:t,page:i+o,offset:k,limit:r,baseUrl:c?.endsWith("/")?c.slice(0,-1):c},A=p(this.getTemplate($,e),$),{actions:q,...x}=e,v=y({},A,x),D=[],T=v.actions||[];if(i===(e.startPage||0)||"url-param"===this.pagination?.type){if(v.url){T.some(t=>"goto"===(t.id??t.name??t.action)&&t.params?.url===v.url)||D.push({id:"goto",params:{url:v.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(D.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),D.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));D.push(...T),v.engine&&this.context.engine!==v.engine&&v.engine;try{const{outputs:r}=await this.executeAll(D,e),o={...e,query:t,page:i,baseUrl:c,engine:l};let a=await this.transform(r,o);e.transform&&(a=await e.transform(a,o));let u=!0;if(this.validateFetchResult&&(u=await this.validateFetchResult(a,o)),u&&e.validator&&(u=await e.validator(a,o)),!u)throw new Error(`Results validation failed for engine: ${l}, url: ${c}`);if(a&&0!==a.length)for(const t of a)t.url&&!s.has(t.url)&&(s.add(t.url),n.push(t));else h=!0;m=!0;break}catch(t){w=t,f&&(d=(d+1)%u.length,this.constructor.currentInstanceIndex=d),g++}}if(!m)throw w||new Error(`All instances failed for engine: ${l}`);if(h)break;if(n.length>=r||!this.pagination)break;if(i++,i>=c)break}return n.slice(0,r)}async validateFetchResult(t,e){return!0}async transform(t,e){return t.results||[]}formatOptions(t){return{...t}}};b._isFactory=!1,h(b),b.prototype.name="Searcher";var g=class extends b{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const e={};if(t.timeRange)if("string"==typeof t.timeRange){const r={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[t.timeRange]&&(e.tbs=r[t.timeRange])}else{const r=new Date(t.timeRange.from),n=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(n.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;e.tbs=`cdr:1,cd_min:${t(r)},cd_max:${t(n)}`}}if(t.category){const r={images:"isch",videos:"vid",news:"nws"};r[t.category]&&(e.tbm=r[t.category])}return t.region&&(e.gl=t.region),t.language&&(e.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?e.safe="active":"off"===t.safeSearch&&(e.safe="images")),e}async transform(t){const e=t.results||[];return Array.isArray(e)?e.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const e=new URL(t.url,"https://www.google.com").searchParams.get("q");e&&(t.url=e)}catch(t){}return t}):[]}};g.alias=["google"];export{f as FetcherOptions,g as GoogleSearcher,b as WebSearcher,c as extractDate,a as extractMetadataFrom,t as fetchHeaders,e as fetchPartial,o as normalizeDate,r as parseHeaders,n as parseHtml,u as testUrlsByLatency};
|
|
1
|
+
async function t(t,e={}){const{timeout:r=5e3,headers:n}=e,s=new AbortController,i=setTimeout(()=>s.abort(),r);try{return(await fetch(t,{method:"HEAD",signal:s.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...n}})).headers}catch(t){return null}finally{clearTimeout(i)}}async function e(t,e=32768,r={}){const{timeout:n=1e4,headers:s}=r,i=new AbortController,o=setTimeout(()=>i.abort(),n);let a="",c=new Headers;try{const r=await fetch(t,{signal:i.signal,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",...s}});if(c=r.headers,!r.ok||!r.body)return null;const n=r.headers.get("content-type"),o=n?.match(/charset=([\w-]+)/i),l=o?o[1]:"utf-8",u=r.body.getReader(),f=new TextDecoder(l);let d=0;for(;;)try{const{done:t,value:r}=await u.read();if(t)break;if(d+=r.length,a+=f.decode(r,{stream:!0}),d>=e){i.abort();break}}catch(t){if("AbortError"===t.name)break;throw t}return{content:a,headers:c}}catch(t){return a.length>0?{content:a,headers:c}:null}finally{clearTimeout(o)}}function r(t){const e={};return t.forEach((t,r)=>{e[r.toLowerCase()]=t}),e}function n(t){const e={meta:{},jsonLd:[],time:[]},r=/<meta\s+([^>]+?)>/gi;let n;for(;null!==(n=r.exec(t));){const t=i(n[1]),r=t.name||t.property||t.itemprop,s=t.content;r&&s&&(e.meta[r.toLowerCase()]=s)}const o=/<script\s+[^>]*?type\s*=\s*["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/gi;for(;null!==(n=o.exec(t));){const t=n[1];try{const r=JSON.parse(t);e.jsonLd.push(r)}catch(r){const n=s(t);n&&e.jsonLd.push(n)}}const a=/<time([^>]*?)>([\s\S]*?)<\/time>/gi;for(;null!==(n=a.exec(t));){const t=i(n[1]).datetime,r=n[2].replace(/<[^>]*>/g,"").trim();e.time.push({datetime:t,text:r})}return e}function s(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r={};let n=!1;for(const s of e){const e=new RegExp(`"${s}"\\s*:\\s*"([^"]+)"`,"i"),i=t.match(e);i&&(r[s]=i[1],n=!0)}return n?r:null}function i(t){const e={},r=/([a-z0-9:._-]+)(?:\s*=\s*(?:(?:"([^"]*)")|(?:'([^']*)')|([^>\s]+)))?/gi;let n;for(;null!==(n=r.exec(t));){const t=n[1].toLowerCase(),r=n[2]??n[3]??n[4]??"";e[t]=r}return e}function o(t){if(!t)return null;try{let e=t.trim();if(!e)return null;e=e.replace(/^(?:last|first|posted|originally)\s*(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.replace(/^(?:published|updated|date|posted|modified)\s*(?:on|at)?[:\s]*/i,""),e=e.split(/[\(|\|]|by\s+|[-–—]\s*\d+\s*min/i)[0].trim();const r=new Date(e);if(!isNaN(r.getTime())){const t=r.getUTCFullYear(),e=(new Date).getUTCFullYear();if(t>=-1e4&&t<=e+20)return r.toISOString()}}catch(t){}return null}function a(t,e){const s=n(t.content);return"date"===e?function(t,e){const n=function(t){const e=["datePublished","dateModified","pubDate","publishedAt"],r=t=>{if(!t||"object"!=typeof t)return null;for(const r of e)if("string"==typeof t[r])return t[r];if(Array.isArray(t))for(const e of t){const t=r(e);if(t)return t}else if(t["@graph"]&&Array.isArray(t["@graph"]))return r(t["@graph"]);return null};return r(t)}(t.jsonLd),s=o(n);if(s)return s;const i=function(t){const e=["article:published_time","og:published_time","datepublished","date","pubdate","publishdate","dc.date.issued","bt:pubdate","sailthru.date","article:modified_time","og:updated_time","modifieddate"];for(const r of e)if(t[r])return t[r];return null}(t.meta),a=o(i);if(a)return a;for(const e of t.time){const t=o(e.datetime||e.text);if(t)return t}const c=r(e);return o(c["last-modified"])}(s,t.headers):null}async function c(t,r={}){const n=await e(t,r.maxBytes,r);return n?a(n,"date"):null}import{fetchWeb as l}from"@isdk/web-fetcher";async function u(t,e={}){const{timeout:r=5e3,limit:n,testPath:s="",proxy:i}=e;let o=await Promise.all(t.map(async t=>{const e=Date.now();try{const n=s?(t.endsWith("/")?t.slice(0,-1):t)+(s.startsWith("/")?s:"/"+s):t;return await l(n,{timeoutMs:r,proxy:i,throwHttpErrors:!0,enableSmart:!1,engine:"http"}),{url:t,latency:Date.now()-e}}catch(t){return}}));return o=o.filter(t=>null!=t).sort((t,e)=>t.latency-e.latency),"number"==typeof n&&n&&(o=o.slice(0,n)),o}import{FetcherOptions as f,FetchSession as d}from"@isdk/web-fetcher";import{addBaseFactoryAbility as h}from"custom-factory";import{isPlainObject as m}from"lodash-es";function p(t,e){if("string"==typeof t)return t.replace(/\$\{(.*?)\}/g,(t,r)=>{const n=e[r.trim()];return void 0!==n?String(n):""});if(Array.isArray(t))return t.map(t=>p(t,e));if(m(t)){const r={};for(const n in t)Object.prototype.hasOwnProperty.call(t,n)&&(r[n]=p(t[n],e));return r}return t}import{cloneDeep as y,defaultsDeep as b}from"lodash-es";var w=class t extends d{static get defaultOptions(){return Object.prototype.hasOwnProperty.call(this,"_defaultOptions")||(this._defaultOptions={}),this._defaultOptions}static set defaultOptions(t){this._defaultOptions=t}static getDefaultOptions(){const e=[];let r=this;for(;r&&r!==Object.prototype&&(Object.prototype.hasOwnProperty.call(r,"_defaultOptions")&&r._defaultOptions&&e.push(r._defaultOptions),r!==t);)r=Object.getPrototypeOf(r);return e.length>0?b({},...e):{}}static async search(t,e,r={}){const n=Array.isArray(t)?t:[t],s=[];for(let t=0;t<n.length;t++){const i=n[t],o=this.get(i),a=o?o.getDefaultOptions():this.getDefaultOptions(),c=b({},r,a),l=c.limit||10;if(s.length>=l)break;const u=l-s.length,f={...r,limit:u},d=this.createObject(i,f);if(!d)throw new Error(`Search engine not found: ${i}`);try{const t=await d.search(e,f);for(const e of t)e.url&&!s.some(t=>t.url===e.url)&&s.push(e);if(s.length>=l)break;if(!1===c.fillLimit)break}catch(e){if(console.warn(`[WebSearcher] Engine '${i}' failed completely:`,e),t===n.length-1&&0===s.length)throw e}finally{await d.dispose()}}return s}get template(){return{}}get pagination(){}getTemplate(t,e){return y(this.template)}createContext(t=this.options){const{actions:e,...r}=this.template,n=b({},r,t);return r.engine&&"auto"!==r.engine||!t.engine||(n.engine=t.engine),super.createContext(n)}async search(t,e={}){const r=this.constructor,n=(e=b({},e,this.options,r.getDefaultOptions())).limit||10,s=[],i=new Set;let o=e.startPage||0;const a=this.pagination?.startValue??0,c=this.pagination?.increment??1,l=e.maxPages||this.pagination?.maxPages||10,u=this.constructor.name;let f;e.baseUrls&&(Array.isArray(e.baseUrls)?f=e.baseUrls:"object"==typeof e.baseUrls&&(f=e.baseUrls[u]||e.baseUrls[this.constructor.alias?.[0]])),f&&0!==f.length||(f=this.constructor.defaultBaseUrls);const d=f&&f.length>0;let h=0;d&&"number"==typeof this.constructor.currentInstanceIndex&&(h=this.constructor.currentInstanceIndex);let m=!1;for(;s.length<n;){let r=!1,y=null;const w=d?f.length:1;let g=0;for(;g<w;){const l=d?f[h]:void 0,w=this.formatOptions(e),k=a+o*c,$={...e,...w,query:t,page:o+a,offset:k,limit:n,baseUrl:l?.endsWith("/")?l.slice(0,-1):l},A=p(this.getTemplate($,e),$),{actions:x,...q}=e,O=b({},A,q),v=[],D=O.actions||[];if(o===(e.startPage||0)||"url-param"===this.pagination?.type){if(O.url){D.some(t=>"goto"===(t.id??t.name??t.action)&&t.params?.url===O.url)||v.push({id:"goto",params:{url:O.url}})}}else"click-next"===this.pagination?.type&&this.pagination.nextButtonSelector&&(v.push({id:"click",params:{selector:this.pagination.nextButtonSelector}}),v.push({id:"waitFor",params:{networkIdle:!0,ms:500}}));v.push(...D),O.engine&&this.context.engine!==O.engine&&O.engine;try{const{outputs:n}=await this.executeAll(v,e),a={...e,query:t,page:o,baseUrl:l,engine:u};let c=await this.transform(n,a);e.transform&&(c=await e.transform(c,a));let f=!0;if(this.validateFetchResult&&(f=await this.validateFetchResult(c,a)),f&&e.validator&&(f=await e.validator(c,a)),!f)throw new Error(`Results validation failed for engine: ${u}, url: ${l}`);if(c&&0!==c.length)for(const t of c)t.url&&!i.has(t.url)&&(i.add(t.url),s.push(t));else m=!0;r=!0;break}catch(t){y=t,d&&(h=(h+1)%f.length,this.constructor.currentInstanceIndex=h),g++}}if(!r)throw y||new Error(`All instances failed for engine: ${u}`);if(m)break;if(s.length>=n||!this.pagination)break;if(o++,o>=l)break}return s.slice(0,n)}async validateFetchResult(t,e){return!0}async transform(t,e){return t.results||[]}formatOptions(t){return{...t}}};w._isFactory=!1;var g=w;h(g),g.prototype.name="Searcher";var k=class extends g{get template(){return{engine:"browser",browser:{headless:!1},url:"https://www.google.com/search?q=${query}&start=${offset}&tbs=${tbs}&tbm=${tbm}&gl=${gl}&hl=${hl}&safe=${safe}",actions:[{id:"extract",storeAs:"results",params:{type:"array",selector:"#main #search",items:{url:{selector:"a:has(h3)",attribute:"href",required:!0},title:{selector:"a:has(h3) h3",required:!0,mode:"innerText"},snippet:{selector:"div[style*='-webkit-line-clamp']",type:"html"}}}}]}}get pagination(){return{type:"url-param",paramName:"start",startValue:0,increment:10}}formatOptions(t){const e={};if(t.timeRange)if("string"==typeof t.timeRange){const r={hour:"qdr:h",day:"qdr:d",week:"qdr:w",month:"qdr:m",year:"qdr:y"};r[t.timeRange]&&(e.tbs=r[t.timeRange])}else{const r=new Date(t.timeRange.from),n=t.timeRange.to?new Date(t.timeRange.to):new Date;if(!isNaN(r.getTime())&&!isNaN(n.getTime())){const t=t=>`${t.getMonth()+1}/${t.getDate()}/${t.getFullYear()}`;e.tbs=`cdr:1,cd_min:${t(r)},cd_max:${t(n)}`}}if(t.category){const r={images:"isch",videos:"vid",news:"nws"};r[t.category]&&(e.tbm=r[t.category])}return t.region&&(e.gl=t.region),t.language&&(e.hl=t.language),t.safeSearch&&("strict"===t.safeSearch?e.safe="active":"off"===t.safeSearch&&(e.safe="images")),e}async transform(t){const e=t.results||[];return Array.isArray(e)?e.map(t=>{if(t.url&&t.url.startsWith("/url?q="))try{const e=new URL(t.url,"https://www.google.com").searchParams.get("q");e&&(t.url=e)}catch(t){}return t}):[]}};k.alias=["google"];export{f as FetcherOptions,k as GoogleSearcher,g as WebSearcher,c as extractDate,a as extractMetadataFrom,t as fetchHeaders,e as fetchPartial,o as normalizeDate,r as parseHeaders,n as parseHtml,u as testUrlsByLatency};
|
package/docs/README.md
CHANGED
|
@@ -63,6 +63,46 @@ const results = await WebSearcher.search(['Google', 'Bing', 'SearXNG'], 'open so
|
|
|
63
63
|
|
|
64
64
|
Since `WebSearcher` extends `FetchSession`, you can instantiate it to keep cookies and storage alive across multiple requests. This is useful for authenticated searches or avoiding bot detection by behaving like a human.
|
|
65
65
|
|
|
66
|
+
### 5. Default Search Parameters
|
|
67
|
+
|
|
68
|
+
You can set default search parameters at three levels: **Global**, **Engine-specific**, and **Instance-level**. This avoids passing repetitive options to every `search()` call.
|
|
69
|
+
|
|
70
|
+
The priority order (from highest to lowest) is:
|
|
71
|
+
`search(query, options)` (Call) > `this.options` (Instance) > `Engine.defaultOptions` (Static Engine) > `WebSearcher.defaultOptions` (Static Global)
|
|
72
|
+
|
|
73
|
+
#### A. Global Static Defaults
|
|
74
|
+
|
|
75
|
+
Affects all search engines.
|
|
76
|
+
|
|
77
|
+
```typescript
|
|
78
|
+
import { WebSearcher } from '@isdk/web-fetcher';
|
|
79
|
+
|
|
80
|
+
// Set global limit for all searchers
|
|
81
|
+
WebSearcher.defaultOptions = { limit: 20, safeSearch: 'strict' };
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
#### B. Engine-Specific Static Defaults
|
|
85
|
+
|
|
86
|
+
Affects only a specific engine (and its subclasses).
|
|
87
|
+
|
|
88
|
+
```typescript
|
|
89
|
+
import { GoogleSearcher } from '@isdk/web-fetcher';
|
|
90
|
+
|
|
91
|
+
// Only Google will use these defaults
|
|
92
|
+
GoogleSearcher.defaultOptions = { region: 'US', language: 'en' };
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
#### C. Instance-Level Defaults
|
|
96
|
+
|
|
97
|
+
Set when creating a searcher instance.
|
|
98
|
+
|
|
99
|
+
```typescript
|
|
100
|
+
const google = new GoogleSearcher({ limit: 5, category: 'news' });
|
|
101
|
+
|
|
102
|
+
// This search will use limit: 5 and category: 'news' automatically
|
|
103
|
+
const results = await google.search('open source');
|
|
104
|
+
```
|
|
105
|
+
|
|
66
106
|
### 🧬 Dynamic Templates
|
|
67
107
|
|
|
68
108
|
While a static `template` works for simple search engines, many sites (like Google) change their HTML structure drastically based on the search category (e.g., 'Web' vs 'Images' vs 'News').
|