@d-zero/beholder 2.1.5 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +44 -0
- package/README.md +9 -276
- package/dist/dom-evaluation.d.ts +100 -62
- package/dist/dom-evaluation.js +498 -195
- package/dist/index.d.ts +1 -1
- package/dist/meta/classify.d.ts +52 -0
- package/dist/meta/classify.js +731 -0
- package/dist/meta/id-extractors.d.ts +40 -0
- package/dist/meta/id-extractors.js +196 -0
- package/dist/meta/keys.d.ts +41 -0
- package/dist/meta/keys.js +507 -0
- package/dist/meta/parsers.d.ts +74 -0
- package/dist/meta/parsers.js +293 -0
- package/dist/meta/tag-detection.d.ts +59 -0
- package/dist/meta/tag-detection.js +120 -0
- package/dist/meta/types.d.ts +874 -0
- package/dist/meta/types.js +12 -0
- package/dist/scraper.js +22 -18
- package/dist/types.d.ts +8 -37
- package/package.json +5 -4
- package/src/dom-evaluation.spec.ts +521 -0
- package/src/dom-evaluation.ts +655 -227
- package/src/index.ts +43 -0
- package/src/meta/classify.spec.ts +281 -0
- package/src/meta/classify.ts +810 -0
- package/src/meta/id-extractors.spec.ts +69 -0
- package/src/meta/id-extractors.ts +206 -0
- package/src/meta/keys.ts +568 -0
- package/src/meta/parsers.spec.ts +178 -0
- package/src/meta/parsers.ts +304 -0
- package/src/meta/simple-wappalyzer.d.ts +37 -0
- package/src/meta/tag-detection.spec.ts +134 -0
- package/src/meta/tag-detection.ts +161 -0
- package/src/meta/types.ts +949 -0
- package/src/scraper.ts +32 -16
- package/src/types.ts +54 -54
- package/tsconfig.tsbuildinfo +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -3,6 +3,50 @@
|
|
|
3
3
|
All notable changes to this project will be documented in this file.
|
|
4
4
|
See [Conventional Commits](https://conventionalcommits.org) for commit guidelines.
|
|
5
5
|
|
|
6
|
+
# [3.0.0](https://github.com/d-zero-dev/tools/compare/@d-zero/beholder@2.1.6...@d-zero/beholder@3.0.0) (2026-06-16)
|
|
7
|
+
|
|
8
|
+
### Bug Fixes
|
|
9
|
+
|
|
10
|
+
- **beholder:** warn loudly and tripwire-test puppeteer Page.\_client() coverage ([97a07ea](https://github.com/d-zero-dev/tools/commit/97a07ea273e90d50bfede1d68f594ddee9c33268))
|
|
11
|
+
|
|
12
|
+
- feat(beholder)!: expand meta extraction with frontmatter-keys schema and Wappalyzer tag detection ([6ee7861](https://github.com/d-zero-dev/tools/commit/6ee78617aac3fe3d5c022ccfd0df265de0c5310b))
|
|
13
|
+
|
|
14
|
+
### Features
|
|
15
|
+
|
|
16
|
+
- **beholder:** rewrite getAnchorList with single AX tree + parallel describeNode ([#876](https://github.com/d-zero-dev/tools/issues/876)) ([7e5b089](https://github.com/d-zero-dev/tools/commit/7e5b089695bd1e605d63c6faef2e8bf927bd861f))
|
|
17
|
+
|
|
18
|
+
### BREAKING CHANGES
|
|
19
|
+
|
|
20
|
+
- `Meta` is restructured from flat keys (`noindex`, `canonical`,
|
|
21
|
+
`'og:type'`, `'twitter:card'`, ...) into a nested shape backed by
|
|
22
|
+
`frontmatter-keys.md`. New required fields: `title`, `jsonLd`,
|
|
23
|
+
`speculationRules`, `originTrial`, `tags`, `others`. `getMeta(page)` now takes
|
|
24
|
+
a context object `getMeta(page, { url, html?, statusCode?, headers? }, timeout?)`.
|
|
25
|
+
Old top-level shortcuts (`canonical`, `alternate`, `noindex`, `nofollow`,
|
|
26
|
+
`noarchive`, `'og:*'`, `'twitter:card'`) are removed; values move to
|
|
27
|
+
`meta.link.canonical`, `meta.robots.*`, `meta.og.*`, `meta.twitter.*` etc.
|
|
28
|
+
|
|
29
|
+
Changes:
|
|
30
|
+
|
|
31
|
+
- New `src/meta/` module: `types.ts`, `keys.ts`, `parsers.ts`, `classify.ts`,
|
|
32
|
+
`id-extractors.ts`, `tag-detection.ts`, plus ambient `simple-wappalyzer.d.ts`
|
|
33
|
+
- Browser-side `collectHead()` serializes every `<meta>`, `<link>`, structured-data
|
|
34
|
+
`<script>`, `<base>`, `<iframe>` plus a curated set of `window` globals into
|
|
35
|
+
`RawHeadEntry[]`; Node-side `classify()` maps these to typed Meta fields
|
|
36
|
+
- `simple-wappalyzer` (MIT) added as a dependency for technology detection;
|
|
37
|
+
detected providers run through `id-extractors.ts` for real ID extraction
|
|
38
|
+
(GA4, GTM, UA, FB Pixel, Hotjar, Clarity, ...)
|
|
39
|
+
- Unknown markup is preserved under `Meta.others` (meta/property/httpEquiv/
|
|
40
|
+
itemprop/link/script/iframe buckets) so nothing is silently dropped
|
|
41
|
+
- Tests: parsers/classify/id-extractors/tag-detection units + getMeta
|
|
42
|
+
error/timeout fallback
|
|
43
|
+
|
|
44
|
+
## [2.1.6](https://github.com/d-zero-dev/tools/compare/@d-zero/beholder@2.1.5...@d-zero/beholder@2.1.6) (2026-06-15)
|
|
45
|
+
|
|
46
|
+
### Bug Fixes
|
|
47
|
+
|
|
48
|
+
- **beholder:** prevent getMeta from hanging on unresponsive pages ([f55bb26](https://github.com/d-zero-dev/tools/commit/f55bb261c1868b40709cbae6aa17d273c5516e74)), closes [#874](https://github.com/d-zero-dev/tools/issues/874)
|
|
49
|
+
|
|
6
50
|
## [2.1.5](https://github.com/d-zero-dev/tools/compare/@d-zero/beholder@2.1.4...@d-zero/beholder@2.1.5) (2026-05-27)
|
|
7
51
|
|
|
8
52
|
**Note:** Version bump only for package @d-zero/beholder
|
package/README.md
CHANGED
|
@@ -1,58 +1,16 @@
|
|
|
1
1
|
# `@d-zero/beholder`
|
|
2
2
|
|
|
3
|
-
Puppeteer
|
|
3
|
+
Puppeteer の `Page` を受け取り、単一ページのメタデータ・リンク・画像・ネットワークリソースを収集するインプロセス型スクレイパー。結果は `ScrapeResult` として戻り値で返却される(イベント経由ではない)。ブラウザ管理は呼び出し側の責任。
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
## Installation
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
```bash
|
|
7
|
+
```sh
|
|
10
8
|
yarn add @d-zero/beholder
|
|
11
9
|
```
|
|
12
10
|
|
|
13
|
-
##
|
|
14
|
-
|
|
15
|
-
`@d-zero/beholder` は、Puppeteer の `Page` オブジェクトを受け取り、単一ページのスクレイピングを行うインプロセス型のスクレイパーです。
|
|
16
|
-
|
|
17
|
-
**主な特徴:**
|
|
18
|
-
|
|
19
|
-
- 結果は `ScrapeResult` として戻り値で返却(イベント経由ではない)
|
|
20
|
-
- ストリーミングイベント(`changePhase`, `resourceResponse`)で進捗を監視可能
|
|
21
|
-
- キーワード・パス除外によるページスキップ
|
|
22
|
-
- 複数デバイスプリセット対応のレスポンシブ画像キャプチャ
|
|
23
|
-
- ブラウザ管理は呼び出し側の責任(Scraperはページ操作のみ)
|
|
24
|
-
|
|
25
|
-
## エクスポートされるAPI
|
|
11
|
+
## Usage
|
|
26
12
|
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
ページレベルのスクレイパークラスです。`TypedAwaitEventEmitter` を継承しています。
|
|
30
|
-
|
|
31
|
-
```typescript
|
|
32
|
-
import Scraper from '@d-zero/beholder';
|
|
33
|
-
```
|
|
34
|
-
|
|
35
|
-
#### `scrapeStart(page, url, options?, isSkip?)`
|
|
36
|
-
|
|
37
|
-
Puppeteer ページ上でスクレイピングを実行します。
|
|
38
|
-
|
|
39
|
-
**パラメータ:**
|
|
40
|
-
|
|
41
|
-
- `page` (`Page`) — Puppeteer ページインスタンス
|
|
42
|
-
- `url` (`ExURL`) — スクレイピング対象のURL
|
|
43
|
-
- `options` (`Partial<ScraperOptions>`, 省略可) — スクレイピングオプション
|
|
44
|
-
- `isSkip` (`boolean`, 省略可) — `true` でネットワークリクエストなしにスキップ
|
|
45
|
-
|
|
46
|
-
**戻り値:** `Promise<ScrapeResult>`
|
|
47
|
-
|
|
48
|
-
- `type: "success"` — `pageData` にスクレイピング結果を格納
|
|
49
|
-
- `type: "skipped"` — `ignored` にスキップ理由を格納
|
|
50
|
-
- `type: "error"` — `error` にエラー詳細を格納
|
|
51
|
-
- `failedRequests` — ネットワーク切断等で失敗したサブリソースリクエストの一覧(存在する場合のみ)
|
|
52
|
-
|
|
53
|
-
**使用例:**
|
|
54
|
-
|
|
55
|
-
```typescript
|
|
13
|
+
```ts
|
|
56
14
|
import Scraper from '@d-zero/beholder';
|
|
57
15
|
import { parseUrl } from '@d-zero/shared/parse-url';
|
|
58
16
|
import { launch } from 'puppeteer';
|
|
@@ -61,241 +19,16 @@ const browser = await launch();
|
|
|
61
19
|
const page = await browser.newPage();
|
|
62
20
|
|
|
63
21
|
const scraper = new Scraper();
|
|
22
|
+
scraper.on('changePhase', (event) => console.log(event.message));
|
|
64
23
|
|
|
65
|
-
|
|
66
|
-
scraper.on('changePhase', async (event) => {
|
|
67
|
-
console.log(`[${event.pid}] ${event.name}: ${event.message}`);
|
|
68
|
-
});
|
|
69
|
-
|
|
70
|
-
// スクレイピングを実行
|
|
71
|
-
const url = parseUrl('https://example.com');
|
|
72
|
-
const result = await scraper.scrapeStart(page, url, {
|
|
24
|
+
const result = await scraper.scrapeStart(page, parseUrl('https://example.com'), {
|
|
73
25
|
captureImages: true,
|
|
74
|
-
excludeKeywords: ['広告'],
|
|
75
26
|
isExternal: false,
|
|
76
27
|
});
|
|
77
28
|
|
|
78
29
|
if (result.type === 'success') {
|
|
79
|
-
console.log(
|
|
80
|
-
console.log('リンク数:', result.pageData?.anchorList.length);
|
|
81
|
-
console.log('画像数:', result.pageData?.imageList.length);
|
|
82
|
-
console.log('サブリソース数:', result.resources.length);
|
|
30
|
+
console.log(result.pageData?.meta.title);
|
|
83
31
|
}
|
|
84
|
-
|
|
85
|
-
// クリーンアップはブラウザレベルで行う
|
|
86
|
-
await page.close();
|
|
87
|
-
await browser.close();
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
#### イベント
|
|
91
|
-
|
|
92
|
-
| イベント名 | 説明 |
|
|
93
|
-
| ------------------ | ---------------------------------------------- |
|
|
94
|
-
| `changePhase` | スクレイピングフェーズが遷移した場合 |
|
|
95
|
-
| `resourceResponse` | サブリソースのレスポンスがキャプチャされた場合 |
|
|
96
|
-
|
|
97
|
-
### `ScraperOptions`
|
|
98
|
-
|
|
99
|
-
| プロパティ | 型 | デフォルト | 説明 |
|
|
100
|
-
| ------------------- | ---------- | ---------- | ---------------------------------------------------------------- |
|
|
101
|
-
| `isExternal` | `boolean` | `false` | 外部URLかどうか |
|
|
102
|
-
| `captureImages` | `boolean` | `true` | 画像データを取得するかどうか |
|
|
103
|
-
| `excludeKeywords` | `string[]` | `[]` | HTML内にマッチしたらスキップするキーワード |
|
|
104
|
-
| `metadataOnly` | `boolean` | `false` | メタデータのみ取得(ブラウザスクレイピングなし) |
|
|
105
|
-
| `imageLoadTimeout` | `number` | `5000` | 画像読み込み待機のタイムアウト(ms) |
|
|
106
|
-
| `disableQueries` | `boolean` | - | URLパース時にクエリパラメータを除去するかどうか |
|
|
107
|
-
| `retries` | `number` | - | ネットワーク操作のリトライ回数 |
|
|
108
|
-
| `headCheckResult` | `PageData` | - | 事前取得したHEADチェック結果(省略時はHEADリクエストをスキップ) |
|
|
109
|
-
| `navigationTimeout` | `number` | `60000` | `page.goto()` のタイムアウト(ms) |
|
|
110
|
-
|
|
111
|
-
### ユーティリティ関数
|
|
112
|
-
|
|
113
|
-
#### `isError(status)`
|
|
114
|
-
|
|
115
|
-
HTTPステータスコードがエラーかどうかを判定します。200-399 は成功、それ以外はエラーです。
|
|
116
|
-
|
|
117
|
-
```typescript
|
|
118
|
-
import { isError } from '@d-zero/beholder';
|
|
119
|
-
|
|
120
|
-
isError(200); // false
|
|
121
|
-
isError(404); // true
|
|
122
|
-
```
|
|
123
|
-
|
|
124
|
-
#### `detectCompress(headers)` / `detectCDN(headers)`
|
|
125
|
-
|
|
126
|
-
レスポンスヘッダーから圧縮方式・CDNプロバイダを検出します(`@d-zero/shared` からの再エクスポート)。
|
|
127
|
-
|
|
128
|
-
### 型定義
|
|
129
|
-
|
|
130
|
-
#### `ScrapeResult`
|
|
131
|
-
|
|
132
|
-
スクレイピング操作の結果を表します。
|
|
133
|
-
|
|
134
|
-
```typescript
|
|
135
|
-
type ScrapeResult = {
|
|
136
|
-
type: 'success' | 'skipped' | 'error';
|
|
137
|
-
pageData?: PageData;
|
|
138
|
-
resources: ResourceEntry[];
|
|
139
|
-
ignored?: { url: ExURL; matchedText: string; excludeKeywords: string[] };
|
|
140
|
-
error?: { name: string; message: string; stack?: string; shutdown: boolean };
|
|
141
|
-
failedRequests?: ReadonlyArray<{ url: string; errorText: string }>;
|
|
142
|
-
};
|
|
143
32
|
```
|
|
144
33
|
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
スクレイピング成功時のページデータです。
|
|
148
|
-
|
|
149
|
-
```typescript
|
|
150
|
-
type PageData = {
|
|
151
|
-
url: ExURL;
|
|
152
|
-
redirectPaths: string[];
|
|
153
|
-
isTarget: boolean;
|
|
154
|
-
isExternal: boolean;
|
|
155
|
-
status: number;
|
|
156
|
-
statusText: string;
|
|
157
|
-
contentType: string | null;
|
|
158
|
-
contentLength: number | null;
|
|
159
|
-
responseHeaders: Record<string, string | string[] | undefined> | null;
|
|
160
|
-
meta: Meta;
|
|
161
|
-
anchorList: AnchorData[];
|
|
162
|
-
imageList: ImageElement[];
|
|
163
|
-
html: string;
|
|
164
|
-
isSkipped: false;
|
|
165
|
-
};
|
|
166
|
-
```
|
|
167
|
-
|
|
168
|
-
#### `Meta`
|
|
169
|
-
|
|
170
|
-
ページの `<head>` から抽出されたメタデータです。
|
|
171
|
-
|
|
172
|
-
```typescript
|
|
173
|
-
type Meta = {
|
|
174
|
-
lang?: string;
|
|
175
|
-
title: string;
|
|
176
|
-
description?: string;
|
|
177
|
-
keywords?: string;
|
|
178
|
-
noindex?: boolean;
|
|
179
|
-
nofollow?: boolean;
|
|
180
|
-
noarchive?: boolean;
|
|
181
|
-
canonical?: string;
|
|
182
|
-
alternate?: string;
|
|
183
|
-
'og:type'?: string;
|
|
184
|
-
'og:title'?: string;
|
|
185
|
-
'og:site_name'?: string;
|
|
186
|
-
'og:description'?: string;
|
|
187
|
-
'og:url'?: string;
|
|
188
|
-
'og:image'?: string;
|
|
189
|
-
'twitter:card'?: string;
|
|
190
|
-
};
|
|
191
|
-
```
|
|
192
|
-
|
|
193
|
-
#### `AnchorData`
|
|
194
|
-
|
|
195
|
-
アンカー要素(`<a>` / `<area>`)のデータです。
|
|
196
|
-
|
|
197
|
-
```typescript
|
|
198
|
-
type AnchorData = {
|
|
199
|
-
href: ExURL;
|
|
200
|
-
textContent: string;
|
|
201
|
-
isExternal?: boolean;
|
|
202
|
-
};
|
|
203
|
-
```
|
|
204
|
-
|
|
205
|
-
#### `ImageElement`
|
|
206
|
-
|
|
207
|
-
画像要素のデータです。デバイスプリセットごとにキャプチャされます。
|
|
208
|
-
|
|
209
|
-
```typescript
|
|
210
|
-
type ImageElement = {
|
|
211
|
-
src: string;
|
|
212
|
-
currentSrc: string;
|
|
213
|
-
alt: string;
|
|
214
|
-
width: number;
|
|
215
|
-
height: number;
|
|
216
|
-
naturalWidth: number;
|
|
217
|
-
naturalHeight: number;
|
|
218
|
-
isLazy: boolean;
|
|
219
|
-
viewportWidth: number;
|
|
220
|
-
sourceCode: string;
|
|
221
|
-
};
|
|
222
|
-
```
|
|
223
|
-
|
|
224
|
-
#### `ResourceEntry`
|
|
225
|
-
|
|
226
|
-
ページ読み込み中にキャプチャされたサブリソースです。
|
|
227
|
-
|
|
228
|
-
```typescript
|
|
229
|
-
type ResourceEntry = {
|
|
230
|
-
log: NetworkLog;
|
|
231
|
-
resource: Omit<Resource, 'uid'>;
|
|
232
|
-
pageUrl: string;
|
|
233
|
-
};
|
|
234
|
-
```
|
|
235
|
-
|
|
236
|
-
#### `NetworkLog`
|
|
237
|
-
|
|
238
|
-
ネットワークリクエスト/レスポンスのログエントリです。
|
|
239
|
-
|
|
240
|
-
```typescript
|
|
241
|
-
type NetworkLog = {
|
|
242
|
-
url: ExURL;
|
|
243
|
-
status: number | null;
|
|
244
|
-
contentLength: number;
|
|
245
|
-
contentType: string;
|
|
246
|
-
isError: boolean;
|
|
247
|
-
request: { ts: number; headers: Record<string, string>; method: string };
|
|
248
|
-
response?: {
|
|
249
|
-
ts: number;
|
|
250
|
-
status: number;
|
|
251
|
-
statusText: string;
|
|
252
|
-
fromCache: boolean;
|
|
253
|
-
headers: Record<string, string>;
|
|
254
|
-
};
|
|
255
|
-
};
|
|
256
|
-
```
|
|
257
|
-
|
|
258
|
-
#### `Resource`
|
|
259
|
-
|
|
260
|
-
ネットワークリソースのメタデータです。
|
|
261
|
-
|
|
262
|
-
```typescript
|
|
263
|
-
type Resource = {
|
|
264
|
-
url: ExURL;
|
|
265
|
-
isExternal: boolean;
|
|
266
|
-
isError: boolean;
|
|
267
|
-
status: number | null;
|
|
268
|
-
statusText: string | null;
|
|
269
|
-
contentType: string | null;
|
|
270
|
-
contentLength: number | null;
|
|
271
|
-
compress: false | CompressType;
|
|
272
|
-
cdn: false | CDNType;
|
|
273
|
-
headers: Record<string, string | string[] | undefined> | null;
|
|
274
|
-
};
|
|
275
|
-
```
|
|
276
|
-
|
|
277
|
-
#### `ChangePhaseEvent`
|
|
278
|
-
|
|
279
|
-
スクレイピングライフサイクルのフェーズ遷移イベントです。
|
|
280
|
-
|
|
281
|
-
主なフェーズ: `scrapeStart` → `openPage` → `loadDOMContent` → `waitNetworkIdle` → `getHTML` → `getAnchors` → `getMeta` → `extractImages` → `getImages` → `scrapeEnd`
|
|
282
|
-
|
|
283
|
-
その他のフェーズ: `launchBrowser`, `headRequest`, `headRequestTimeout`, `newPage`, `setViewport`, `scrollToBottom`, `waitImageLoad`, `pageSkipped`, `retryWait`, `retryExhausted`, `beforeCleanup`, `cleanedUp`
|
|
284
|
-
|
|
285
|
-
#### `SkippedPageData`
|
|
286
|
-
|
|
287
|
-
キーワードまたはパス除外によりスキップされたページのデータです。
|
|
288
|
-
|
|
289
|
-
```typescript
|
|
290
|
-
type SkippedPageData = {
|
|
291
|
-
isSkipped: true;
|
|
292
|
-
url: ExURL;
|
|
293
|
-
matched:
|
|
294
|
-
| { type: 'keyword'; text: string; excludeKeywords: string[] }
|
|
295
|
-
| { type: 'path'; excludes: string[] };
|
|
296
|
-
};
|
|
297
|
-
```
|
|
298
|
-
|
|
299
|
-
## ライセンス
|
|
300
|
-
|
|
301
|
-
MIT
|
|
34
|
+
設計判断(イベントではなく戻り値で返す理由、`page` のライフサイクル責務、リトライ機構など)は `src/scraper.ts` の JSDoc を参照。
|
package/dist/dom-evaluation.d.ts
CHANGED
|
@@ -3,10 +3,28 @@
|
|
|
3
3
|
*
|
|
4
4
|
* These functions are called by {@link ./scraper.ts | Scraper.#fetchData} to extract
|
|
5
5
|
* anchors, images, and meta information after page navigation completes.
|
|
6
|
+
*
|
|
7
|
+
* WHY timeouts everywhere: A page whose main thread is blocked (heavy JS, autoplay
|
|
8
|
+
* video players, infinite loops) makes every CDP round-trip hang. `getMeta` and
|
|
9
|
+
* `getImageList` therefore collect all data in a single `page.evaluate` and wrap it
|
|
10
|
+
* in {@link raceWithTimeout} so a blocked thread is abandoned after a bounded budget
|
|
11
|
+
* instead of accumulating per-property timeouts up to the caller's global timeout.
|
|
12
|
+
* Note that `page.evaluate` itself runs on the page's main thread and has no built-in
|
|
13
|
+
* timeout, so the surrounding race is what actually bounds the hang.
|
|
6
14
|
* @see {@link ./types.ts} for the data types returned by these functions
|
|
7
15
|
*/
|
|
8
|
-
import type { AnchorData, ImageElement, ParseURLOptions } from './types.js';
|
|
16
|
+
import type { AnchorData, ImageElement, Meta, ParseURLOptions } from './types.js';
|
|
9
17
|
import type { ElementHandle, Page } from 'puppeteer';
|
|
18
|
+
/**
|
|
19
|
+
* Default timeout (ms) applied to DOM evaluation operations when the caller does not
|
|
20
|
+
* specify one. Bounds how long a single `page.evaluate` / property read may hang on a
|
|
21
|
+
* page whose main thread is unresponsive.
|
|
22
|
+
*
|
|
23
|
+
* WHY 180s: Aligned with the upstream `Scraper#fetchData` retryable timeout (3 min) so
|
|
24
|
+
* a single phase does not exceed the retry budget while still tolerating large pages
|
|
25
|
+
* (e.g., 1000+ anchors) and slow main threads.
|
|
26
|
+
*/
|
|
27
|
+
export declare const DEFAULT_DOM_EVALUATION_TIMEOUT = 180000;
|
|
10
28
|
/**
|
|
11
29
|
* Parameters for {@link getProp}.
|
|
12
30
|
* @template T - The expected type of the property value.
|
|
@@ -22,88 +40,108 @@ export interface GetPropParams<T> {
|
|
|
22
40
|
/**
|
|
23
41
|
* Retrieves a DOM property value from a Puppeteer element handle with a timeout.
|
|
24
42
|
*
|
|
25
|
-
* Races the actual property retrieval against a
|
|
43
|
+
* Races the actual property retrieval against a timeout via {@link raceWithTimeout},
|
|
44
|
+
* which clears the loser-side timer so it cannot keep the event loop alive.
|
|
26
45
|
* If the property cannot be read or the timeout expires, the fallback value is returned.
|
|
27
46
|
* @template T - The expected type of the property value.
|
|
28
47
|
* @param params - Parameters containing the element, property name, and fallback.
|
|
29
|
-
* @
|
|
30
|
-
|
|
31
|
-
export declare function getProp<T>(params: GetPropParams<T>): Promise<T>;
|
|
32
|
-
/**
|
|
33
|
-
* Parameters for {@link getPropBySelector}.
|
|
34
|
-
* @template T - The expected type of the property value.
|
|
35
|
-
*/
|
|
36
|
-
export interface GetPropBySelectorParams<T> {
|
|
37
|
-
/** The Puppeteer page to query. */
|
|
38
|
-
readonly page: Page;
|
|
39
|
-
/** A CSS selector to find the target element. */
|
|
40
|
-
readonly selector: string;
|
|
41
|
-
/** The DOM property name to read from the matched element. */
|
|
42
|
-
readonly propName: string;
|
|
43
|
-
/** The default value if no element matches or the property cannot be read. */
|
|
44
|
-
readonly fallback: T;
|
|
45
|
-
}
|
|
46
|
-
/**
|
|
47
|
-
* Retrieves a DOM property value from the first element matching a CSS selector.
|
|
48
|
-
*
|
|
49
|
-
* Combines `page.$()` with {@link getProp} for convenient single-element lookups.
|
|
50
|
-
* @template T - The expected type of the property value.
|
|
51
|
-
* @param params - Parameters containing the page, selector, property name, and fallback.
|
|
52
|
-
* @returns The property value, or the fallback if the element is not found or retrieval fails.
|
|
48
|
+
* @param timeout - Timeout in ms before falling back. Defaults to {@link DEFAULT_DOM_EVALUATION_TIMEOUT}.
|
|
49
|
+
* @returns The property value, or the fallback if retrieval fails or times out.
|
|
53
50
|
*/
|
|
54
|
-
export declare function
|
|
51
|
+
export declare function getProp<T>(params: GetPropParams<T>, timeout?: number): Promise<T>;
|
|
55
52
|
/**
|
|
56
53
|
* Extracts all `<img>` elements from the page and returns their properties.
|
|
57
54
|
*
|
|
58
|
-
*
|
|
59
|
-
* natural dimensions, lazy-loading status, and
|
|
55
|
+
* Collects every image's `src`, `currentSrc`, `alt`, layout dimensions,
|
|
56
|
+
* natural dimensions, lazy-loading status, and outer HTML in a single
|
|
57
|
+
* `page.evaluate` call, wrapped in {@link raceWithTimeout}. On timeout (an
|
|
58
|
+
* unresponsive page) an empty array is returned rather than hanging.
|
|
60
59
|
* @param page - The Puppeteer page to extract images from.
|
|
61
60
|
* @param viewportWidth - The current viewport width in pixels, recorded alongside each image entry.
|
|
61
|
+
* @param timeout - Timeout in ms for the evaluation. Defaults to {@link DEFAULT_DOM_EVALUATION_TIMEOUT}.
|
|
62
62
|
* @returns An array of {@link ImageElement} objects describing each image on the page.
|
|
63
63
|
*/
|
|
64
|
-
export declare function getImageList(page: Page, viewportWidth: number): Promise<ImageElement[]>;
|
|
64
|
+
export declare function getImageList(page: Page, viewportWidth: number, timeout?: number): Promise<ImageElement[]>;
|
|
65
65
|
/**
|
|
66
66
|
* Extracts all anchor (`<a>` and `<area>`) elements with `href` attributes from the page.
|
|
67
67
|
*
|
|
68
68
|
* For each anchor, resolves the `href` to an `ExURL` via `parseUrl`, retrieves
|
|
69
69
|
* the accessible name (from the accessibility tree, falling back to `textContent`),
|
|
70
70
|
* and filters out non-HTTP links.
|
|
71
|
+
*
|
|
72
|
+
* WHY Strategy F (single AX-tree fetch + parallel `DOM.describeNode`): the old
|
|
73
|
+
* implementation called `page.accessibility.snapshot({ root })` per anchor, which
|
|
74
|
+
* triggers a CDP round-trip *and* a Chrome-side AX subtree computation (~42ms
|
|
75
|
+
* each). On a page with 1181 anchors that compounded to ~53s. By fetching the
|
|
76
|
+
* full AX tree once and using `DOM.describeNode` in parallel to map element
|
|
77
|
+
* handles back to AX nodes by `backendDOMNodeId`, the same data is collected in
|
|
78
|
+
* ~150ms on the same page — a ~350× speed-up while preserving the original
|
|
79
|
+
* accessible-name semantics. See issue #876 for measurements.
|
|
80
|
+
*
|
|
81
|
+
* WHY the whole operation is wrapped in `raceWithTimeout`: even with bounded
|
|
82
|
+
* per-CDP-call timeouts, a degenerate page (blocked main thread, thousands of
|
|
83
|
+
* anchors, runaway describeNode latency) could chain enough sub-timeouts to
|
|
84
|
+
* exceed the caller's `timeout` budget. The outer race guarantees the function
|
|
85
|
+
* returns within `timeout`, surfacing whatever anchors were collected so far so
|
|
86
|
+
* the upstream scrape phase can continue rather than tripping a retryable retry.
|
|
71
87
|
* @param page - The Puppeteer page to extract anchors from.
|
|
72
88
|
* @param options - Optional URL parsing options (e.g., `disableQueries`).
|
|
89
|
+
* @param timeout - Total time budget in ms for the whole extraction. Defaults to {@link DEFAULT_DOM_EVALUATION_TIMEOUT}.
|
|
73
90
|
* @returns An array of {@link AnchorData} objects for all HTTP(S) links found on the page.
|
|
74
91
|
*/
|
|
75
|
-
export declare function getAnchorList(page: Page, options?: ParseURLOptions): Promise<AnchorData[]>;
|
|
92
|
+
export declare function getAnchorList(page: Page, options?: ParseURLOptions, timeout?: number): Promise<AnchorData[]>;
|
|
76
93
|
/**
|
|
77
|
-
*
|
|
94
|
+
* Required context for {@link getMeta}. Provided by the scraper from data it
|
|
95
|
+
* already has on hand (URL it navigated to, response status/headers it received).
|
|
96
|
+
*
|
|
97
|
+
* `html` is optional: when omitted, `getMeta` falls back to `page.content()`
|
|
98
|
+
* to obtain the rendered HTML for the third-party tag detection pass.
|
|
99
|
+
*/
|
|
100
|
+
export type GetMetaContext = {
|
|
101
|
+
/** The fully resolved URL of the page (after redirects). */
|
|
102
|
+
readonly url: string;
|
|
103
|
+
/** Rendered HTML. Falls back to `page.content()` when omitted. */
|
|
104
|
+
readonly html?: string;
|
|
105
|
+
/** Response status code, surfaced to the Wappalyzer driver. */
|
|
106
|
+
readonly statusCode?: number;
|
|
107
|
+
/** Response headers; case is preserved by the caller, lowercased internally. */
|
|
108
|
+
readonly headers?: Record<string, string | string[] | undefined>;
|
|
109
|
+
/**
|
|
110
|
+
* When `true`, the returned `Meta` includes `_raw: RawHeadEntry[]` for
|
|
111
|
+
* debugging. Default `false` to keep the serialized payload small.
|
|
112
|
+
*/
|
|
113
|
+
readonly includeRaw?: boolean;
|
|
114
|
+
};
|
|
115
|
+
/**
|
|
116
|
+
* Extracts comprehensive metadata from the page.
|
|
117
|
+
*
|
|
118
|
+
* Two passes happen in parallel:
|
|
119
|
+
* 1. Browser-side `collectHead()` serializes every `<meta>`, `<link>`,
|
|
120
|
+
* relevant `<script>`, `<base>`, `<noscript>`/`<iframe>` and a curated
|
|
121
|
+
* set of `window` globals into a `RawHeadEntry[]`. Node-side `classify()`
|
|
122
|
+
* then maps those entries to typed `Meta` fields using the lookup tables
|
|
123
|
+
* in `./meta/keys.ts`, with unknown entries preserved in `Meta.others`.
|
|
124
|
+
* 2. `detectTags()` runs `simple-wappalyzer` over the page HTML to produce
|
|
125
|
+
* `Meta.tags` (technology detection + real-ID extraction).
|
|
78
126
|
*
|
|
79
|
-
*
|
|
80
|
-
*
|
|
81
|
-
*
|
|
82
|
-
*
|
|
83
|
-
*
|
|
84
|
-
*
|
|
85
|
-
*
|
|
86
|
-
*
|
|
87
|
-
*
|
|
88
|
-
*
|
|
89
|
-
*
|
|
90
|
-
*
|
|
127
|
+
* The whole call is wrapped in `raceWithTimeout`. On timeout an empty `Meta`
|
|
128
|
+
* (with `title: ''` and empty required arrays/objects) is returned.
|
|
129
|
+
* @param page
|
|
130
|
+
* @param context
|
|
131
|
+
* @param timeout
|
|
132
|
+
* @example
|
|
133
|
+
* ```ts
|
|
134
|
+
* const meta = await getMeta(page, {
|
|
135
|
+
* url: 'https://example.com/',
|
|
136
|
+
* html: await page.content(),
|
|
137
|
+
* statusCode: response.status,
|
|
138
|
+
* headers: response.headers,
|
|
139
|
+
* });
|
|
140
|
+
* console.log(meta.title); // <title> text
|
|
141
|
+
* console.log(meta.og?.image); // og:image[] array
|
|
142
|
+
* console.log(meta.robots?.noindex); // parsed robots
|
|
143
|
+
* console.log(meta.tags.detected.Analytics); // Wappalyzer hits
|
|
144
|
+
* console.log(meta.tags.entries.find(e => e.provider === 'Google Analytics')?.id);
|
|
145
|
+
* ```
|
|
91
146
|
*/
|
|
92
|
-
export declare function getMeta(page: Page): Promise<
|
|
93
|
-
title: string;
|
|
94
|
-
lang: string;
|
|
95
|
-
description: string;
|
|
96
|
-
keywords: string;
|
|
97
|
-
noindex: boolean;
|
|
98
|
-
nofollow: boolean;
|
|
99
|
-
noarchive: boolean;
|
|
100
|
-
canonical: string;
|
|
101
|
-
alternate: string;
|
|
102
|
-
'og:type': string;
|
|
103
|
-
'og:title': string;
|
|
104
|
-
'og:site_name': string;
|
|
105
|
-
'og:description': string;
|
|
106
|
-
'og:url': string;
|
|
107
|
-
'og:image': string;
|
|
108
|
-
'twitter:card': string;
|
|
109
|
-
}>;
|
|
147
|
+
export declare function getMeta(page: Page, context: GetMetaContext, timeout?: number): Promise<Meta>;
|