npm - @d-zero/beholder - Versions diffs - 2.1.6 → 3.0.0 - Mend

@d-zero/beholder 2.1.6 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

package/CHANGELOG.md +38 -0
package/dist/dom-evaluation.d.ts +72 -24
package/dist/dom-evaluation.js +442 -84
package/dist/index.d.ts +1 -1
package/dist/meta/classify.d.ts +52 -0
package/dist/meta/classify.js +731 -0
package/dist/meta/id-extractors.d.ts +40 -0
package/dist/meta/id-extractors.js +196 -0
package/dist/meta/keys.d.ts +41 -0
package/dist/meta/keys.js +507 -0
package/dist/meta/parsers.d.ts +74 -0
package/dist/meta/parsers.js +293 -0
package/dist/meta/tag-detection.d.ts +59 -0
package/dist/meta/tag-detection.js +120 -0
package/dist/meta/types.d.ts +874 -0
package/dist/meta/types.js +12 -0
package/dist/scraper.js +15 -13
package/dist/types.d.ts +3 -38
package/package.json +5 -4
package/src/dom-evaluation.spec.ts +301 -73
package/src/dom-evaluation.ts +558 -88
package/src/index.ts +43 -0
package/src/meta/classify.spec.ts +281 -0
package/src/meta/classify.ts +810 -0
package/src/meta/id-extractors.spec.ts +69 -0
package/src/meta/id-extractors.ts +206 -0
package/src/meta/keys.ts +568 -0
package/src/meta/parsers.spec.ts +178 -0
package/src/meta/parsers.ts +304 -0
package/src/meta/simple-wappalyzer.d.ts +37 -0
package/src/meta/tag-detection.spec.ts +134 -0
package/src/meta/tag-detection.ts +161 -0
package/src/meta/types.ts +949 -0
package/src/scraper.ts +19 -13
package/src/types.ts +49 -55
package/tsconfig.tsbuildinfo +1 -1

package/CHANGELOG.md CHANGED Viewed

@@ -3,6 +3,44 @@
 All notable changes to this project will be documented in this file.
 See [Conventional Commits](https://conventionalcommits.org) for commit guidelines.
+# [3.0.0](https://github.com/d-zero-dev/tools/compare/@d-zero/beholder@2.1.6...@d-zero/beholder@3.0.0) (2026-06-16)
+### Bug Fixes
+- **beholder:** warn loudly and tripwire-test puppeteer Page.\_client() coverage ([97a07ea](https://github.com/d-zero-dev/tools/commit/97a07ea273e90d50bfede1d68f594ddee9c33268))
+- feat(beholder)!: expand meta extraction with frontmatter-keys schema and Wappalyzer tag detection ([6ee7861](https://github.com/d-zero-dev/tools/commit/6ee78617aac3fe3d5c022ccfd0df265de0c5310b))
+### Features
+- **beholder:** rewrite getAnchorList with single AX tree + parallel describeNode ([#876](https://github.com/d-zero-dev/tools/issues/876)) ([7e5b089](https://github.com/d-zero-dev/tools/commit/7e5b089695bd1e605d63c6faef2e8bf927bd861f))
+### BREAKING CHANGES
+- `Meta` is restructured from flat keys (`noindex`, `canonical`,
+  `'og:type'`, `'twitter:card'`, ...) into a nested shape backed by
+  `frontmatter-keys.md`. New required fields: `title`, `jsonLd`,
+  `speculationRules`, `originTrial`, `tags`, `others`. `getMeta(page)` now takes
+  a context object `getMeta(page, { url, html?, statusCode?, headers? }, timeout?)`.
+  Old top-level shortcuts (`canonical`, `alternate`, `noindex`, `nofollow`,
+  `noarchive`, `'og:*'`, `'twitter:card'`) are removed; values move to
+  `meta.link.canonical`, `meta.robots.*`, `meta.og.*`, `meta.twitter.*` etc.
+Changes:
+- New `src/meta/` module: `types.ts`, `keys.ts`, `parsers.ts`, `classify.ts`,
+  `id-extractors.ts`, `tag-detection.ts`, plus ambient `simple-wappalyzer.d.ts`
+- Browser-side `collectHead()` serializes every `<meta>`, `<link>`, structured-data
+  `<script>`, `<base>`, `<iframe>` plus a curated set of `window` globals into
+  `RawHeadEntry[]`; Node-side `classify()` maps these to typed Meta fields
+- `simple-wappalyzer` (MIT) added as a dependency for technology detection;
+  detected providers run through `id-extractors.ts` for real ID extraction
+  (GA4, GTM, UA, FB Pixel, Hotjar, Clarity, ...)
+- Unknown markup is preserved under `Meta.others` (meta/property/httpEquiv/
+  itemprop/link/script/iframe buckets) so nothing is silently dropped
+- Tests: parsers/classify/id-extractors/tag-detection units + getMeta
+  error/timeout fallback
 ## [2.1.6](https://github.com/d-zero-dev/tools/compare/@d-zero/beholder@2.1.5...@d-zero/beholder@2.1.6) (2026-06-15)
 ### Bug Fixes

package/dist/dom-evaluation.d.ts CHANGED Viewed

@@ -19,8 +19,12 @@ import type { ElementHandle, Page } from 'puppeteer';
  * Default timeout (ms) applied to DOM evaluation operations when the caller does not
  * specify one. Bounds how long a single `page.evaluate` / property read may hang on a
  * page whose main thread is unresponsive.
+ *
+ * WHY 180s: Aligned with the upstream `Scraper#fetchData` retryable timeout (3 min) so
+ * a single phase does not exceed the retry budget while still tolerating large pages
+ * (e.g., 1000+ anchors) and slow main threads.
  */
-export declare const DEFAULT_DOM_EVALUATION_TIMEOUT = 30000;
+export declare const DEFAULT_DOM_EVALUATION_TIMEOUT = 180000;
 /**
  * Parameters for {@link getProp}.
  * @template T - The expected type of the property value.
@@ -65,35 +69,79 @@ export declare function getImageList(page: Page, viewportWidth: number, timeout?
  * the accessible name (from the accessibility tree, falling back to `textContent`),
  * and filters out non-HTTP links.
  *
- * WHY this keeps per-element CDP calls (unlike {@link getMeta} / {@link getImageList}):
- * the accessible name comes from Chrome's computed accessibility tree
- * (`page.accessibility.snapshot`), which is a CDP-only feature unavailable to in-page
- * DOM APIs. Each {@link getProp} read is still bounded by `timeout`.
+ * WHY Strategy F (single AX-tree fetch + parallel `DOM.describeNode`): the old
+ * implementation called `page.accessibility.snapshot({ root })` per anchor, which
+ * triggers a CDP round-trip *and* a Chrome-side AX subtree computation (~42ms
+ * each). On a page with 1181 anchors that compounded to ~53s. By fetching the
+ * full AX tree once and using `DOM.describeNode` in parallel to map element
+ * handles back to AX nodes by `backendDOMNodeId`, the same data is collected in
+ * ~150ms on the same page — a ~350× speed-up while preserving the original
+ * accessible-name semantics. See issue #876 for measurements.
+ *
+ * WHY the whole operation is wrapped in `raceWithTimeout`: even with bounded
+ * per-CDP-call timeouts, a degenerate page (blocked main thread, thousands of
+ * anchors, runaway describeNode latency) could chain enough sub-timeouts to
+ * exceed the caller's `timeout` budget. The outer race guarantees the function
+ * returns within `timeout`, surfacing whatever anchors were collected so far so
+ * the upstream scrape phase can continue rather than tripping a retryable retry.
  * @param page - The Puppeteer page to extract anchors from.
  * @param options - Optional URL parsing options (e.g., `disableQueries`).
- * @param timeout - Timeout in ms per property read. Defaults to {@link DEFAULT_DOM_EVALUATION_TIMEOUT}.
+ * @param timeout - Total time budget in ms for the whole extraction. Defaults to {@link DEFAULT_DOM_EVALUATION_TIMEOUT}.
  * @returns An array of {@link AnchorData} objects for all HTTP(S) links found on the page.
  */
 export declare function getAnchorList(page: Page, options?: ParseURLOptions, timeout?: number): Promise<AnchorData[]>;
 /**
- * Extracts comprehensive meta information from the page's `<head>`.
+ * Required context for {@link getMeta}. Provided by the scraper from data it
+ * already has on hand (URL it navigated to, response status/headers it received).
+ *
+ * `html` is optional: when omitted, `getMeta` falls back to `page.content()`
+ * to obtain the rendered HTML for the third-party tag detection pass.
+ */
+export type GetMetaContext = {
+    /** The fully resolved URL of the page (after redirects). */
+    readonly url: string;
+    /** Rendered HTML. Falls back to `page.content()` when omitted. */
+    readonly html?: string;
+    /** Response status code, surfaced to the Wappalyzer driver. */
+    readonly statusCode?: number;
+    /** Response headers; case is preserved by the caller, lowercased internally. */
+    readonly headers?: Record<string, string | string[] | undefined>;
+    /**
+     * When `true`, the returned `Meta` includes `_raw: RawHeadEntry[]` for
+     * debugging. Default `false` to keep the serialized payload small.
+     */
+    readonly includeRaw?: boolean;
+};
+/**
+ * Extracts comprehensive metadata from the page.
  *
- * Collects all metadata in a single `page.evaluate` call (14 CDP round-trips
- * collapsed into 1) wrapped in {@link raceWithTimeout}. On timeout (an unresponsive
- * page) a minimal `{ title: '' }` is returned rather than hanging.
+ * Two passes happen in parallel:
+ * 1. Browser-side `collectHead()` serializes every `<meta>`, `<link>`,
+ *    relevant `<script>`, `<base>`, `<noscript>`/`<iframe>` and a curated
+ *    set of `window` globals into a `RawHeadEntry[]`. Node-side `classify()`
+ *    then maps those entries to typed `Meta` fields using the lookup tables
+ *    in `./meta/keys.ts`, with unknown entries preserved in `Meta.others`.
+ * 2. `detectTags()` runs `simple-wappalyzer` over the page HTML to produce
+ *    `Meta.tags` (technology detection + real-ID extraction).
  *
- * Collected metadata:
- * - `title` - The document title.
- * - `lang` - The `lang` attribute of the `<html>` element.
- * - `description` - The `<meta name="description">` content.
- * - `keywords` - The `<meta name="keywords">` content.
- * - `noindex` / `nofollow` / `noarchive` - Parsed from the `<meta name="robots">` directives.
- * - `canonical` - The `<link rel="canonical">` content.
- * - `alternate` - The `<link rel="alternate">` content.
- * - Open Graph tags: `og:type`, `og:title`, `og:site_name`, `og:description`, `og:url`, `og:image`.
- * - `twitter:card` - The Twitter Card type.
- * @param page - The Puppeteer page to extract meta information from.
- * @param timeout - Timeout in ms for the evaluation. Defaults to {@link DEFAULT_DOM_EVALUATION_TIMEOUT}.
- * @returns An object containing all extracted meta properties.
+ * The whole call is wrapped in `raceWithTimeout`. On timeout an empty `Meta`
+ * (with `title: ''` and empty required arrays/objects) is returned.
+ * @param page
+ * @param context
+ * @param timeout
+ * @example
+ * ```ts
+ * const meta = await getMeta(page, {
+ *   url: 'https://example.com/',
+ *   html: await page.content(),
+ *   statusCode: response.status,
+ *   headers: response.headers,
+ * });
+ * console.log(meta.title);                         // <title> text
+ * console.log(meta.og?.image);                     // og:image[] array
+ * console.log(meta.robots?.noindex);               // parsed robots
+ * console.log(meta.tags.detected.Analytics);       // Wappalyzer hits
+ * console.log(meta.tags.entries.find(e => e.provider === 'Google Analytics')?.id);
+ * ```
  */
-export declare function getMeta(page: Page, timeout?: number): Promise<Meta>;
+export declare function getMeta(page: Page, context: GetMetaContext, timeout?: number): Promise<Meta>;