npm - page-analyzer - Versions diffs - 1.0.1 → 1.2.0 - Mend

page-analyzer 1.0.1 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/LICENSE +21 -0
package/README.md +72 -9
package/index.js +206 -22
package/llm/analyzers/event-analyzer/event-analyzer-blocks.js +23 -2
package/llm/analyzers/event-analyzer/event-analyzer-constants.js +1 -1
package/llm/analyzers/event-analyzer/event-analyzer.js +1 -1
package/package.json +6 -3
package/page-extractor.js +562 -36
package/result-viewer.html +1064 -0
package/scripts/analyze.js +51 -0
package/scripts/build-result-viewer.js +1076 -0
package/scripts/serve-result-viewer.js +68 -0
package/test/smoke.test.js +454 -0

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 page-analyzer contributors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md CHANGED Viewed

@@ -71,21 +71,21 @@ LLM_API_ENDPOINT=https://api.openai.com/v1/chat/completions
 LLM_MODEL=gpt-4o-mini
 ```
-## 运行示例脚本
+## 运行测试和示例
-仓库内提供了 `test.js` 作为本地调试入口。它会读取项目根目录下的 `.env`，分析指定 URL，并把结果写入 `result.json`。
+本地测试不会调用真实网页或 LLM 接口：
 ```bash
 npm test
 ```
-也可以指定 URL：
+如需手动分析真实页面，可以运行示例脚本。它会读取项目根目录下的 `.env`，分析指定 URL，并把结果写入 `result.json`。
 ```bash
-node test.js https://example.com
+npm run analyze -- https://example.com
 ```
-注意：`test.js` 依赖以下环境变量：
+注意：`npm run analyze` 依赖以下环境变量：
 - `LLM_API_KEY`
 - `LLM_API_ENDPOINT`
@@ -114,6 +114,9 @@ const result = await analyzeUrl('https://example.com', {
   },
   showEvents: true,
   showBlockIdx: true,
+  fullPageScreenshot: true,
+  blockScreenshots: true,
+  waitForImagesLoaded: true,
   knownEventTypes: ['click_link', 'submit_form'],
   extractorConfig: {
     viewportWidth: 1440,
@@ -145,6 +148,10 @@ const result = await analyzeUrl('https://example.com', {
 | `options.extractorConfig` | `object` | 否 | Playwright 页面抓取配置 |
 | `options.showEvents` | `boolean` | 否 | 是否返回完整事件数组和元素明细 |
 | `options.showBlockIdx` | `boolean` | 否 | 是否返回 CSV 与区块索引相关字段 |
+| `options.fullPageScreenshot` | `boolean` | 否 | 是否保存整页截图到当前运行目录的 `snapshots/` 并返回文件路径 |
+| `options.blockScreenshots` | `boolean` | 否 | 是否在 LLM 合并区块后，保存每个逻辑区块截图到当前运行目录的 `snapshots/` 并返回文件路径 |
+| `options.waitForImagesLoaded` | `boolean` | 否 | 是否在提取区块、分析和截图前等待页面图片加载完成，默认 `false` |
+| `options.extractorConfig.s3` | `object` | 否 | 截图 S3 上传配置。配置后截图上传到 S3，返回 HTTPS URL；未配置时仍保存到本地 `snapshots/` |
 ### analyzePageEvents(input)
@@ -239,6 +246,30 @@ const result = await analyzePageEvents({
 启用 `showBlockIdx: true` 后，区块结果中会额外包含 `blockIdxs`、`blockSemanticGroups`、`rowCount` 等字段，并返回 `csvContent`。
+启用 `fullPageScreenshot: true` 后，返回结果会包含 `screenshots.fullPage`，值为整页截图文件路径。
+启用 `blockScreenshots: true` 后，模块会在 LLM 合并区块后再截图。返回结果会包含 `screenshots.blocks`，每项包含逻辑区块序号 `blockIdx` 和对应截图 `path`；区块分析结果中的每个 block 也会额外带上 `blockScreenshotPaths`，每个逻辑区块最多对应一张截图。无法通过 `blockCssPath` 截图的隐藏或空区块会被跳过。
+如果配置 `extractorConfig.s3`，截图不会写入本地 `snapshots/`，而是直接上传到 S3；`screenshots.fullPage`、`screenshots.blocks[].path` 和 `blockScreenshotPaths` 会返回 HTTPS URL。上传不会设置 ACL，访问权限沿用 bucket 策略。单张截图上传失败会重试 3 次，仍失败则跳过该截图。
+启用 `waitForImagesLoaded: true` 后，模块会先滚动页面触发懒加载，再等待当前 DOM 中的 `<img>` 完成加载或失败，之后再提取区块、分析和截图；等待时间受 `extractorConfig.timeoutMs` 控制。
+截图参数启用后的新增输出示例：
+```js
+{
+  screenshots: {
+    fullPage: '/path/to/page-analyzer/snapshots/example-com-20260507-095500-full-page.png',
+    blocks: [
+      {
+        blockIdx: 0,
+        path: '/path/to/page-analyzer/snapshots/example-com-20260507-095500-block-000.png'
+      }
+    ]
+  }
+}
+```
 ## 配置项
 ### extractorConfig
@@ -255,6 +286,37 @@ const result = await analyzePageEvents({
 | `blockMaxHeightRatio` | `1.5` | 最大区块高度占视口高度比例 |
 | `blockMaxDepth` | `15` | 区块提取最大 DOM 深度 |
 | `textPreviewMaxChars` | `1200` | 区块文本预览最大长度 |
+| `waitForImagesLoaded` | `false` | 是否在提取区块、分析和截图前等待页面图片加载完成 |
+| `s3` | 无 | 截图 S3 上传配置。配置后截图直接上传到 S3，未配置时保存到本地 |
+S3 截图上传示例：
+```js
+const result = await analyzeUrl('https://example.com', {
+  fullPageScreenshot: true,
+  blockScreenshots: true,
+  llm: {
+    apiKey: process.env.LLM_API_KEY,
+    apiEndpoint: process.env.LLM_API_ENDPOINT,
+    model: process.env.LLM_MODEL
+  },
+  extractorConfig: {
+    s3: {
+      bucket: 'my-bucket',
+      region: 'ap-northeast-1',
+      prefix: 'page-analyzer/snapshots',
+      publicBaseUrl: 'https://cdn.example.com/page-analyzer/snapshots',
+      credentials: {
+        accessKeyId: process.env.AWS_ACCESS_KEY_ID,
+        secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
+        sessionToken: process.env.AWS_SESSION_TOKEN
+      }
+    }
+  }
+});
+```
+`extractorConfig.s3.bucket` 和 `extractorConfig.s3.region` 必填。`credentials` 可省略，省略时使用 AWS SDK 默认凭证链。`publicBaseUrl` 可省略，省略时返回 `https://<bucket>.s3.<region>.amazonaws.com/<key>`。
 ### parserConfig
@@ -320,7 +382,7 @@ data.choices[0].message.content
 npm install
 ```
-运行测试脚本：
+运行本地测试：
 ```bash
 npm test
@@ -355,12 +417,13 @@ page-analyzer/
   models/                          # 上下文数据模型
   utils/                           # 文本、URL、选择器工具
   vendor/                          # 浏览器内区块提取脚本
-  test.js                          # 本地调试脚本
+  scripts/analyze.js               # 手动真实页面分析脚本
+  test/smoke.test.js               # 本地 smoke test
 ```
 ## 常见问题
-### npm test 报 LLM 配置缺失
+### npm run analyze 报 LLM 配置缺失
 确认项目根目录存在 `.env`，并且包含：
@@ -397,4 +460,4 @@ extractorConfig: {
 ## License
-当前仓库没有声明 License。发布到 npm 前建议补充明确的开源许可证，例如 MIT、Apache-2.0 或私有许可证说明。
+MIT License. See [LICENSE](./LICENSE).

package/index.js CHANGED Viewed

@@ -38,6 +38,131 @@ function normalizeDisplayOptions(options = {}) {
   };
 }
+function parseBlockIdxs(value) {
+  if (Array.isArray(value)) {
+    return value
+      .map((item) => Number.parseInt(String(item), 10))
+      .filter(Number.isInteger);
+  }
+  if (Number.isInteger(value)) {
+    return [value];
+  }
+  return String(value || '')
+    .split(/[.,\s]+/)
+    .map((item) => Number.parseInt(item, 10))
+    .filter(Number.isInteger);
+}
+function buildBlockScreenshotMap(screenshots) {
+  const map = new Map();
+  for (const item of Array.isArray(screenshots?.blocks) ? screenshots.blocks : []) {
+    const blockIdx = Number.isInteger(item?.blockIdx)
+      ? item.blockIdx
+      : Number.parseInt(String(item?.blockIdx), 10);
+    const screenshotPath = typeof item?.path === 'string' ? item.path : '';
+    if (Number.isInteger(blockIdx) && screenshotPath) {
+      map.set(blockIdx, screenshotPath);
+    }
+  }
+  return map;
+}
+function attachBlockScreenshotPaths(analysis, screenshots) {
+  const screenshotByBlockIdx = buildBlockScreenshotMap(screenshots);
+  if (screenshotByBlockIdx.size === 0 || !isObject(analysis?.block_analysis)) {
+    return analysis;
+  }
+  const sourceBlocks = analysis.block_analysis.blocks;
+  if (!Array.isArray(sourceBlocks)) {
+    return analysis;
+  }
+  const blocks = sourceBlocks.map((block) => {
+    const blockIdxs = parseBlockIdxs(block?.blockIdxs ?? block?.blockIdx);
+    const blockScreenshotPaths = blockIdxs
+      .map((blockIdx) => screenshotByBlockIdx.get(blockIdx))
+      .filter(Boolean);
+    if (blockScreenshotPaths.length === 0) {
+      return block;
+    }
+    return {
+      ...block,
+      blockScreenshotPaths
+    };
+  });
+  return {
+    ...analysis,
+    block_analysis: {
+      ...analysis.block_analysis,
+      blocks
+    }
+  };
+}
+function hasScreenshots(screenshots) {
+  return Boolean(
+    screenshots?.fullPage ||
+    (Array.isArray(screenshots?.blocks) && screenshots.blocks.length > 0)
+  );
+}
+function mergeScreenshots(primary, secondary) {
+  const merged = {};
+  if (primary?.fullPage) {
+    merged.fullPage = primary.fullPage;
+  }
+  if (secondary?.fullPage) {
+    merged.fullPage = secondary.fullPage;
+  }
+  const primaryBlocks = Array.isArray(primary?.blocks) ? primary.blocks : [];
+  const secondaryBlocks = Array.isArray(secondary?.blocks) ? secondary.blocks : [];
+  const blocks = secondaryBlocks.length > 0 ? secondaryBlocks : primaryBlocks;
+  if (blocks.length > 0) {
+    merged.blocks = blocks;
+  }
+  return hasScreenshots(merged) ? merged : null;
+}
+function attachLogicalBlockScreenshotPaths(result, screenshots) {
+  const blocks = result?.analysis?.block_analysis?.blocks;
+  if (!Array.isArray(blocks) || blocks.length === 0) {
+    return result;
+  }
+  const screenshotByLogicalIndex = buildBlockScreenshotMap(screenshots);
+  if (screenshotByLogicalIndex.size === 0) {
+    return result;
+  }
+  return {
+    ...result,
+    analysis: {
+      ...result.analysis,
+      block_analysis: {
+        ...result.analysis.block_analysis,
+        blocks: blocks.map((block, index) => {
+          const screenshotPath = screenshotByLogicalIndex.get(index);
+          if (!screenshotPath) {
+            return block;
+          }
+          return {
+            ...block,
+            blockScreenshotPaths: [screenshotPath]
+          };
+        })
+      }
+    }
+  };
+}
 function compactBlockAnalysisBlock(block, displayOptions) {
   const source = isObject(block) ? block : {};
   const out = {
@@ -63,6 +188,10 @@ function compactBlockAnalysisBlock(block, displayOptions) {
     out.mode = source.mode;
   }
+  if (Array.isArray(source.blockScreenshotPaths) && source.blockScreenshotPaths.length > 0) {
+    out.blockScreenshotPaths = source.blockScreenshotPaths;
+  }
   return out;
 }
@@ -121,14 +250,20 @@ function buildPageAnalysisResult({
   csvContent,
   pageData,
   analysis,
-  displayOptions
+  displayOptions,
+  screenshots
 }) {
+  const analysisWithScreenshots = attachBlockScreenshotPaths(analysis, screenshots);
   const result = {
     title: pageData.title,
     parseMetrics: pageData.metrics,
-    analysis: buildAnalysisResult(analysis, displayOptions)
+    analysis: buildAnalysisResult(analysisWithScreenshots, displayOptions)
   };
+  if (hasScreenshots(screenshots)) {
+    result.screenshots = screenshots;
+  }
   if (displayOptions.showEvents) {
     result.elements = elements;
     result.csvContent = csvContent;
@@ -154,38 +289,84 @@ function buildPageAnalysisResult({
  * @param {boolean} [options.showEvents=false] - Include event arrays and full event-related metadata.
  *   Also enables node-level event classification.
  * @param {boolean} [options.showBlockIdx=false] - Include CSV/block index alignment fields.
+ * @param {boolean} [options.fullPageScreenshot=false] - Save a full-page screenshot to snapshots/ and return its path.
+ * @param {boolean} [options.blockScreenshots=false] - Save one screenshot per merged logical block to snapshots/ and return their paths.
+ * @param {boolean} [options.waitForImagesLoaded=false] - Wait for page images before extracting and screenshotting.
+ * @param {Object} [options.extractorConfig.s3] - Optional S3 config for uploading screenshots instead of saving locally.
  * @returns {Promise<Object>} Analysis result. Event and idx fields are omitted unless requested.
  */
 export async function analyzeUrl(url, options = {}) {
-  const { llm: llmConfig, knownEventTypes, parserConfig, extractorConfig, showEvents, showBlockIdx } = options;
+  const {
+    llm: llmConfig,
+    knownEventTypes,
+    parserConfig,
+    extractorConfig,
+    showEvents,
+    showBlockIdx,
+    fullPageScreenshot,
+    blockScreenshots,
+    waitForImagesLoaded
+  } = options;
   if (!url) throw new Error('url is required');
   if (!llmConfig?.apiKey || !llmConfig?.apiEndpoint || !llmConfig?.model) {
     throw new Error('options.llm.apiKey, apiEndpoint, and model are required');
   }
+  const shouldCaptureFullPage = fullPageScreenshot ?? extractorConfig?.fullPageScreenshot;
+  const shouldCaptureBlocks = blockScreenshots ?? extractorConfig?.blockScreenshots;
   // Step 0: Playwright extraction
   console.log(`[page-analyzer] Extracting ${url} ...`);
-  const extractor = new PageExtractor(extractorConfig);
-  const bundle = await extractor.extract(url);
-  console.log(`[page-analyzer] Extracted: ${bundle.blocks.length} blocks, ${bundle.elementGeometries.length} geometries`);
+  const extractor = new PageExtractor({
+    ...extractorConfig,
+    fullPageScreenshot: shouldCaptureFullPage,
+    blockScreenshots: false,
+    waitForImagesLoaded: waitForImagesLoaded ?? extractorConfig?.waitForImagesLoaded
+  });
-  // Derive domain from URL
-  let domain = '';
-  try { domain = new URL(url).hostname.replace(/^www\./, ''); } catch { /* ignore */ }
+  return await extractor.withPreparedPage(url, async (page, targetUrl) => {
+    const bundle = await extractor.extractPreparedPage(page, targetUrl);
+    console.log(`[page-analyzer] Extracted: ${bundle.blocks.length} blocks, ${bundle.elementGeometries.length} geometries`);
+    // Derive domain from URL
+    let domain = '';
+    try { domain = new URL(targetUrl).hostname.replace(/^www\./, ''); } catch { /* ignore */ }
+    let result = await analyzePageEvents({
+      html: bundle.html,
+      url: targetUrl,
+      blocks: bundle.blocks,
+      elementGeometries: bundle.elementGeometries,
+      llm: llmConfig,
+      knownEventTypes,
+      parserConfig,
+      showEvents,
+      showBlockIdx,
+      screenshots: bundle.screenshots,
+      domain,
+      nodeId: `${domain}-root`
+    });
+    if (shouldCaptureBlocks) {
+      const logicalBlocks = Array.isArray(result?.analysis?.block_analysis?.blocks)
+        ? result.analysis.block_analysis.blocks
+        : [];
+      const blockScreenshotsBundle = await extractor.captureScreenshots(page, targetUrl, logicalBlocks, {
+        fullPageScreenshot: false,
+        blockScreenshots: true
+      });
+      const screenshots = mergeScreenshots(result.screenshots, blockScreenshotsBundle);
+      result = attachLogicalBlockScreenshotPaths(
+        {
+          ...result,
+          ...(screenshots ? { screenshots } : {})
+        },
+        screenshots
+      );
+    }
-  return analyzePageEvents({
-    html: bundle.html,
-    url,
-    blocks: bundle.blocks,
-    elementGeometries: bundle.elementGeometries,
-    llm: llmConfig,
-    knownEventTypes,
-    parserConfig,
-    showEvents,
-    showBlockIdx,
-    domain,
-    nodeId: `${domain}-root`
+    return result;
   });
 }
@@ -213,6 +394,7 @@ export async function analyzeUrl(url, options = {}) {
  * @param {boolean} [input.showEvents=false] - Include event arrays and full event-related metadata.
  *   Also enables node-level event classification.
  * @param {boolean} [input.showBlockIdx=false] - Include CSV/block index alignment fields.
+ * @param {Object} [input.screenshots] - Screenshot paths captured during extraction.
  * @param {string} [input.nodeId] - Node ID for logging context
  * @param {string} [input.domain] - Domain for logging context
  * @returns {Promise<Object>} Analysis result. Event and idx fields are omitted unless requested.
@@ -229,6 +411,7 @@ export async function analyzePageEvents(input) {
     parserConfig = {},
     showEvents = false,
     showBlockIdx = false,
+    screenshots = null,
     nodeId = '',
     domain = ''
   } = input;
@@ -289,7 +472,8 @@ export async function analyzePageEvents(input) {
     csvContent,
     pageData,
     analysis,
-    displayOptions
+    displayOptions,
+    screenshots
   });
 }

package/llm/analyzers/event-analyzer/event-analyzer-blocks.js CHANGED Viewed

@@ -115,13 +115,34 @@ function buildLogicalBlockPosition(sourceBlocks = []) {
 }
 function resolveLogicalBlockCssPath(sourceBlocks = []) {
+  const paths = [];
   for (const block of Array.isArray(sourceBlocks) ? sourceBlocks : []) {
     const path = cleanText(block?.blockCssPath || block?.cssPath || '', 500);
     if (path) {
-      return path;
+      paths.push(path);
     }
   }
-  return '';
+  if (paths.length === 0) {
+    return '';
+  }
+  if (paths.length === 1) {
+    return paths[0];
+  }
+  const partsList = paths.map((path) => path.split('>').map((part) => part.trim()).filter(Boolean));
+  const commonParts = [];
+  const firstParts = partsList[0];
+  for (let index = 0; index < firstParts.length; index += 1) {
+    const part = firstParts[index];
+    if (partsList.every((parts) => parts[index] === part)) {
+      commonParts.push(part);
+      continue;
+    }
+    break;
+  }
+  return commonParts.length > 1 ? commonParts.join(' > ') : paths[0];
 }
 function normalizePossibleEvents(responseHelper, value) {

package/llm/analyzers/event-analyzer/event-analyzer-constants.js CHANGED Viewed

@@ -1,7 +1,7 @@
 const DEFAULT_ATTRIBUTE_KEYS = [
   'text',
   'page_area',
-  'content_category(producdt/support/company/legal)',
+  'content_category(product/support/company/legal)',
   'is_external'
 ];

package/llm/analyzers/event-analyzer/event-analyzer.js CHANGED Viewed

@@ -307,7 +307,7 @@ class EventAnalyzer {
   }
   async analyzeEvents(csvData, _mdData, knownEventTypes = [], options = {}) {
-    const analyzeNodeEvents = !options?.analyzeNodeEvents;
+    const analyzeNodeEvents = options?.analyzeNodeEvents === true;
     const configuredKnownEventTypes = this.response.normalizeStringList(
       this.config?.knownEventTypes,
       { eventType: true }

package/package.json CHANGED Viewed

@@ -1,14 +1,17 @@
 {
   "name": "page-analyzer",
-  "version": "1.0.1",
+  "version": "1.2.0",
   "type": "module",
   "description": "Standalone page analysis module.",
+  "license": "MIT",
   "main": "index.js",
   "scripts": {
-    "test": "node test.js",
-    "analyze": "node test.js"
+    "test": "node test/smoke.test.js",
+    "analyze": "node scripts/analyze.js",
+    "viewer": "node scripts/serve-result-viewer.js"
   },
   "dependencies": {
+    "@aws-sdk/client-s3": "^3.1045.0",
     "cheerio": "^1.2.0",
     "csv-parse": "^5.6.0",
     "playwright": "^1.58.2"