npm - @adobe/spacecat-shared-scrape-client - Versions diffs - 2.0.0 → 2.1.1 - Mend

@adobe/spacecat-shared-scrape-client 2.0.0 → 2.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/CHANGELOG.md +14 -0
package/README.md +69 -1
package/package.json +1 -1
package/src/clients/scrape-client.js +82 -3
package/src/clients/scrape-job-supervisor.js +9 -4

package/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,17 @@
+# [@adobe/spacecat-shared-scrape-client-v2.1.1](https://github.com/adobe/spacecat-shared/compare/@adobe/spacecat-shared-scrape-client-v2.1.0...@adobe/spacecat-shared-scrape-client-v2.1.1) (2025-08-28)
+### Bug Fixes
+* enhance validation for scrape job configuration ([#940](https://github.com/adobe/spacecat-shared/issues/940)) ([54d0a6a](https://github.com/adobe/spacecat-shared/commit/54d0a6aa322547e13da25f2f97e1542fd5688849))
+# [@adobe/spacecat-shared-scrape-client-v2.1.0](https://github.com/adobe/spacecat-shared/compare/@adobe/spacecat-shared-scrape-client-v2.0.0...@adobe/spacecat-shared-scrape-client-v2.1.0) (2025-08-20)
+### Features
+* add scrape-client destination  ([#913](https://github.com/adobe/spacecat-shared/issues/913)) ([e208a87](https://github.com/adobe/spacecat-shared/commit/e208a87214874a2708ac2d7614fcfd4c0770fe17))
 # [@adobe/spacecat-shared-scrape-client-v2.0.0](https://github.com/adobe/spacecat-shared/compare/@adobe/spacecat-shared-scrape-client-v1.0.7...@adobe/spacecat-shared-scrape-client-v2.0.0) (2025-08-13)

package/README.md CHANGED Viewed

@@ -67,7 +67,9 @@ const jobData = {
     'Authorization': 'Bearer token',
     'X-Custom-Header': 'value'
   },
-  processingType: 'default' // Optional, defaults to 'DEFAULT'
+  processingType: 'default', // Optional, defaults to 'DEFAULT'
+  maxScrapeAge: 6, // Optional, used to avoid re-scraping recently scraped URLs (hours) 0 means always scrape
+  auditData: {} // Optional, this is used for step audits
 };
 try {
@@ -122,6 +124,27 @@ try {
 }
 ```
+### Getting Successful Scrape Paths
+```js
+const jobId = 'your-job-id';
+try {
+  const paths = await client.getScrapeResultPaths(jobId);
+  if (paths === null) {
+    console.log('Job not found');
+  } else if (paths.size === 0) {
+    console.log('No successful paths found for this job');
+  } else {
+    console.log(`Found ${paths.size} successful paths for job ${jobId}`);
+    for (const [url, path] of paths) {
+      console.log(`URL: ${url} -> Path: ${path}`);
+    }
+  }
+} catch (error) {
+  console.error('Failed to get successful paths:', error.message);
+}
+```
 ### Finding Jobs by Date Range
 ```js
@@ -192,6 +215,17 @@ When you retrieve job results, each URL result has this structure:
 }
 ```
+## Path Results Format
+When you retrieve successful scrape paths using `getScrapeResultPaths()`, the response is a JavaScript Map object that maps URLs to their corresponding result file paths. Only URLs with `COMPLETE` status are included:
+```js
+Map(2) {
+  'https://example.com/page1' => 'path/to/result1',
+  'https://example.com/page2' => 'path/to/result2'
+}
+```
 ## Configuration
 The client uses the `SCRAPE_JOB_CONFIGURATION` environment variable for default settings:
@@ -248,3 +282,37 @@ npm run clean
 - **Repository**: [GitHub](https://github.com/adobe/spacecat-shared.git)
 - **Issue Tracking**: [GitHub Issues](https://github.com/adobe/spacecat-shared/issues)
 - **License**: Apache-2.0
+### ScrapeClient Workflow Overview
+<img width="889" height="508" alt="Screenshot 2025-08-27 at 08 56 16" src="https://github.com/user-attachments/assets/9ccc1388-ed6b-4bf0-a059-d40e6e90aff8" />
+When a new scrape job is created, the client performs the following steps:
+1. Creates a new job entry in the database with status `PENDING`.
+2. Splits the provided URLs into batches based on the `maxUrlsPerMessage` configuration (this is limited due to SQS message size constraints).
+3. For each batch, it creates a message in the SQS queue to the scrape-job-manager.
+In the scrape-job-manager the following steps are performed:
+1. All existing ScrapeURLs are fetched for the base URL to avoid re-scraping recently scraped URLs (based on the `maxScrapeAge` parameter).
+2. For all URLs a new ScrapeURL entry is created with status `PENDING`.
+3. Each URL in the batch is checked against existing ScrapeURLs.
+   - Already scraped URLs (with status 'COMPLETE' or 'PENDING') are marked to be skipped with the ID of the existing ScrapeURL and the isOriginal flag set to false.
+   - URLs that need to be scraped are marked with the isOriginal flag set to true. (The isOriginal flag is used to avoid the sliding window problem when re-scraping URLs.)
+   - All URLs are numbered with based on their position in the original list to be able to track the job progress.
+4. For each URL, a message is created in the SQS queue to the content-scraper.
+In the content-scraper the following steps are performed:
+1. The content-scraper checks if an incoming URL message is marked to be skipped. If so, it just sends a message to the content-processor.
+2. If the URL is not marked to be skipped, the content-scraper scrapes the URL.
+3. The content-scraper creates a message in the SQS queue to the content-processor with the result of the scraping operation.
+in the content-processor the following steps are performed:
+1. The content-processor processes the incoming message from the content-scraper.
+2. If the URL was skipped, it fetches the existing ScrapeURL entry and updates the new ScrapeURL entry with the same path and status.
+3. If the URL was scraped, it updates the ScrapeURL entry with the result of the scraping operation (status, path, reason).
+4. The content-processor updates the ScrapeJob entry with the new counts (success, failed, redirect).
+5. If all URLs of a job are processed (based on their number and the totalUrlCount of the job), it:
+    - performs a cleanup step to set all PENDING URLs to FAILED that were not processed (e.g. due to timeouts).
+    - updates the counts of the job again.
+    - sets the job status to COMPLETE and sets the endedAt timestamp.
+    - Optionally, it can send a SQS message (e.g. to trigger the next audit step).

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@adobe/spacecat-shared-scrape-client",
-  "version": "2.0.0",
+  "version": "2.1.1",
   "description": "Shared modules of the Spacecat Services - Scrape Client",
   "type": "module",
   "engines": {

package/src/clients/scrape-client.js CHANGED Viewed

@@ -11,8 +11,7 @@
  */
 import {
-  isIsoDate, isObject, isValidUrl, isNonEmptyArray, hasText,
-  isValidUUID,
+  hasText, isIsoDate, isNonEmptyArray, isObject, isValidUrl, isValidUUID,
 } from '@adobe/spacecat-shared-utils';
 import { ScrapeJob as ScrapeJobModel } from '@adobe/spacecat-shared-data-access';
 import { ScrapeJobDto } from './scrapeJobDto.js';
@@ -35,6 +34,59 @@ export default class ScrapeClient {
     }
   }
+  static validateScrapeConfiguration(scrapeJobConfiguration) {
+    if (!isObject(scrapeJobConfiguration)) {
+      throw new Error('Invalid scrape configuration: configuration must be an object');
+    }
+    // Validate scrapeWorkerQueue
+    if (!hasText(scrapeJobConfiguration.scrapeWorkerQueue)) {
+      throw new Error('Invalid scrape configuration: scrapeWorkerQueue must be a non-empty string');
+    }
+    if (!isValidUrl(scrapeJobConfiguration.scrapeWorkerQueue)) {
+      throw new Error('Invalid scrape configuration: scrapeWorkerQueue must be a valid URL');
+    }
+    // Validate s3Bucket
+    if (!hasText(scrapeJobConfiguration.s3Bucket)) {
+      throw new Error('Invalid scrape configuration: s3Bucket must be a non-empty string');
+    }
+    // Validate options
+    if (scrapeJobConfiguration.options !== undefined) {
+      if (!isObject(scrapeJobConfiguration.options)) {
+        throw new Error('Invalid scrape configuration: options must be an object');
+      }
+      const { options } = scrapeJobConfiguration;
+      if (options.enableJavascript !== undefined && typeof options.enableJavascript !== 'boolean') {
+        throw new Error('Invalid scrape configuration: options.enableJavascript must be a boolean');
+      }
+      if (options.hideConsentBanners !== undefined && typeof options.hideConsentBanners !== 'boolean') {
+        throw new Error('Invalid scrape configuration: options.hideConsentBanners must be a boolean');
+      }
+    }
+    // Validate maxUrlsPerJob
+    if (scrapeJobConfiguration.maxUrlsPerJob !== undefined) {
+      if (!Number.isInteger(scrapeJobConfiguration.maxUrlsPerJob)
+          || scrapeJobConfiguration.maxUrlsPerJob <= 0) {
+        throw new Error('Invalid scrape configuration: maxUrlsPerJob must be a positive integer');
+      }
+    }
+    // Validate maxUrlsPerMessage
+    if (scrapeJobConfiguration.maxUrlsPerMessage !== undefined) {
+      if (!Number.isInteger(scrapeJobConfiguration.maxUrlsPerMessage)
+          || scrapeJobConfiguration.maxUrlsPerMessage <= 0) {
+        throw new Error('Invalid scrape configuration: maxUrlsPerMessage must be a positive integer');
+      }
+    }
+  }
   validateRequestData(data) {
     if (!isObject(data)) {
       throw new Error('Invalid request: missing application/json request data');
@@ -104,8 +156,10 @@ export default class ScrapeClient {
     let scrapeConfiguration = {};
     try {
       scrapeConfiguration = JSON.parse(this.config.env.SCRAPE_JOB_CONFIGURATION);
+      ScrapeClient.validateScrapeConfiguration(scrapeConfiguration);
     } catch (error) {
-      this.config.log.error(`Failed to parse scrape job configuration: ${error.message}`);
+      this.config.log.error(`Failed to parse or validate scrape job configuration: ${error.message}`);
+      throw new Error(`Invalid scrape job configuration: ${error.message}`);
     }
     this.scrapeConfiguration = scrapeConfiguration;
@@ -132,6 +186,7 @@ export default class ScrapeClient {
         customHeaders,
         processingType = ScrapeJobModel.ScrapeProcessingType.DEFAULT,
         maxScrapeAge = 24,
+        auditData = {},
       } = data;
       this.config.log.info(`Creating a new scrape job with ${urls.length} URLs.`);
@@ -149,6 +204,7 @@ export default class ScrapeClient {
         mergedOptions,
         customHeaders,
         maxScrapeAge,
+        auditData,
       );
       return ScrapeJobDto.toJSON(job);
     } catch (error) {
@@ -228,6 +284,29 @@ export default class ScrapeClient {
     }
   }
+  /**
+    * Get the result paths of a scrape job
+    * @param {string} jobId - The ID of the job to fetch.
+    * @return {Promise<Map<string, string>>} A map of URLs to their corresponding result paths.
+   */
+  async getScrapeResultPaths(jobId) {
+    try {
+      const job = await this.scrapeSupervisor.getScrapeJob(jobId);
+      if (!job) {
+        return null;
+      }
+      const { ScrapeUrl } = this.config.dataAccess;
+      const scrapeUrls = await ScrapeUrl.allByScrapeJobId(job.getId());
+      return scrapeUrls
+        .filter((url) => url.getStatus() === ScrapeJobModel.ScrapeUrlStatus.COMPLETE)
+        .reduce((map, url) => map.set(url.getUrl(), url.getPath()), new Map());
+    } catch (error) {
+      const msgError = `Failed to fetch the scrape job result: ${error.message}`;
+      this.config.log.error(msgError);
+      throw new Error(msgError);
+    }
+  }
   /**
    * Get all scrape jobs by baseURL and processing type
    * @param {string} baseURL - The baseURL of the jobs to fetch.

package/src/clients/scrape-job-supervisor.js CHANGED Viewed

@@ -122,10 +122,12 @@ function ScrapeJobSupervisor(services, config) {
    * @param {object} scrapeJob - The scrape job record.
    * @param {object} customHeaders - Optional custom headers to be sent with each request.
    * @param {string} maxScrapeAge - The maximum age of the scrape job
+   * @param {object} auditData - Step-Audit specific data
    */
-  async function queueUrlsForScrapeWorker(urls, scrapeJob, customHeaders, maxScrapeAge) {
+  // eslint-disable-next-line max-len
+  async function queueUrlsForScrapeWorker(urls, scrapeJob, customHeaders, maxScrapeAge, auditData) {
     log.info(`Starting a new scrape job of baseUrl: ${scrapeJob.getBaseURL()} with ${urls.length}`
-      + ` URLs. This new job has claimed: ${scrapeJob.getScrapeQueueId()} `
+      + ' URLs.'
       + `(jobId: ${scrapeJob.getId()})`);
     const options = scrapeJob.getOptions();
@@ -155,6 +157,7 @@ function ScrapeJobSupervisor(services, config) {
         customHeaders,
         options,
         maxScrapeAge,
+        auditData,
       };
       // eslint-disable-next-line no-await-in-loop
@@ -168,7 +171,8 @@ function ScrapeJobSupervisor(services, config) {
    * @param {string} processingType - The type of processing to perform.
    * @param {object} options - Optional configuration params for the scrape job.
    * @param {object} customHeaders - Optional custom headers to be sent with each request.
-   * @param {string} maxScrapeAge - The maximum age of the scrape job
+   * @param {number} maxScrapeAge - The maximum age of the scrape job
+   * @param auditContext
    * @returns {Promise<ScrapeJob>} newly created job object
    */
   async function startNewJob(
@@ -177,6 +181,7 @@ function ScrapeJobSupervisor(services, config) {
     options,
     customHeaders,
     maxScrapeAge,
+    auditContext,
   ) {
     const newScrapeJob = await createNewScrapeJob(
       urls,
@@ -196,7 +201,7 @@ function ScrapeJobSupervisor(services, config) {
     // Queue all URLs for scrape as a single message. This enables the controller to respond with
     // a job ID ASAP, while the individual URLs are queued up asynchronously by another function.
-    await queueUrlsForScrapeWorker(urls, newScrapeJob, customHeaders, maxScrapeAge);
+    await queueUrlsForScrapeWorker(urls, newScrapeJob, customHeaders, maxScrapeAge, auditContext);
     return newScrapeJob;
   }