@adobe/spacecat-shared-scrape-client 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md ADDED
@@ -0,0 +1,6 @@
1
+ # @adobe/spacecat-shared-scrape-client-v1.0.0 (2025-06-19)
2
+
3
+
4
+ ### Features
5
+
6
+ * added scrape client ([#814](https://github.com/adobe/spacecat-shared/issues/814)) ([fad6614](https://github.com/adobe/spacecat-shared/commit/fad6614672a046da5319e493cc7c26bfdc3993d2))
@@ -0,0 +1,74 @@
1
+ # Adobe Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at Grp-opensourceoffice@adobe.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [http://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: http://contributor-covenant.org
74
+ [version]: http://contributor-covenant.org/version/1/4/
@@ -0,0 +1,74 @@
1
+ # Contributing to Project Franklin
2
+
3
+ This project (like almost all of Project Franklin) is an Open Development project and welcomes contributions from everyone who finds it useful or lacking.
4
+
5
+ ## Code Of Conduct
6
+
7
+ This project adheres to the Adobe [code of conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to cstaub at adobe dot com.
8
+
9
+ ## Contributor License Agreement
10
+
11
+ All third-party contributions to this project must be accompanied by a signed contributor license. This gives Adobe permission to redistribute your contributions as part of the project. [Sign our CLA](http://opensource.adobe.com/cla.html)! You only need to submit an Adobe CLA one time, so if you have submitted one previously, you are good to go!
12
+
13
+ ## Things to Keep in Mind
14
+
15
+ This project uses a **commit then review** process, which means that for approved maintainers, changes can be merged immediately, but will be reviewed by others.
16
+
17
+ For other contributors, a maintainer of the project has to approve the pull request.
18
+
19
+ # Before You Contribute
20
+
21
+ * Check that there is an existing issue in GitHub issues
22
+ * Check if there are other pull requests that might overlap or conflict with your intended contribution
23
+
24
+ # How to Contribute
25
+
26
+ 1. Fork the repository
27
+ 2. Make some changes on a branch on your fork
28
+ 3. Create a pull request from your branch
29
+
30
+ In your pull request, outline:
31
+
32
+ * What the changes intend
33
+ * How they change the existing code
34
+ * If (and what) they breaks
35
+ * Start the pull request with the GitHub issue ID, e.g. #123
36
+
37
+ Lastly, please follow the [pull request template](.github/pull_request_template.md) when submitting a pull request!
38
+
39
+ Each commit message that is not part of a pull request:
40
+
41
+ * Should contain the issue ID like `#123`
42
+ * Can contain the tag `[trivial]` for trivial changes that don't relate to an issue
43
+
44
+
45
+
46
+ ## Coding Styleguides
47
+
48
+ We enforce a coding styleguide using `eslint`. As part of your build, run `npm run lint` to check if your code is conforming to the style guide. We do the same for every PR in our CI, so PRs will get rejected if they don't follow the style guide.
49
+
50
+ You can fix some of the issues automatically by running `npx eslint . --fix`.
51
+
52
+ ## Commit Message Format
53
+
54
+ This project uses a structured commit changelog format that should be used for every commit. Use `npm run commit` instead of your usual `git commit` to generate commit messages using a wizard.
55
+
56
+ ```bash
57
+ # either add all changed files
58
+ $ git add -A
59
+ # or selectively add files
60
+ $ git add package.json
61
+ # then commit using the wizard
62
+ $ npm run commit
63
+ ```
64
+
65
+ # How Contributions get Reviewed
66
+
67
+ One of the maintainers will look at the pull request within one week. Feedback on the pull request will be given in writing, in GitHub.
68
+
69
+ # Release Management
70
+
71
+ The project's committers will release to the [Adobe organization on npmjs.org](https://www.npmjs.com/org/adobe).
72
+ Please contact the [Adobe Open Source Advisory Board](https://git.corp.adobe.com/OpenSourceAdvisoryBoard/discuss/issues) to get access to the npmjs organization.
73
+
74
+ The release process is fully automated using `semantic-release`, increasing the version numbers, etc. based on the contents of the commit messages found.
package/LICENSE.txt ADDED
@@ -0,0 +1,264 @@
1
+
2
+ Apache License
3
+ Version 2.0, January 2004
4
+ http://www.apache.org/licenses/
5
+
6
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7
+
8
+ 1. Definitions.
9
+
10
+ "License" shall mean the terms and conditions for use, reproduction,
11
+ and distribution as defined by Sections 1 through 9 of this document.
12
+
13
+ "Licensor" shall mean the copyright owner or entity authorized by
14
+ the copyright owner that is granting the License.
15
+
16
+ "Legal Entity" shall mean the union of the acting entity and all
17
+ other entities that control, are controlled by, or are under common
18
+ control with that entity. For the purposes of this definition,
19
+ "control" means (i) the power, direct or indirect, to cause the
20
+ direction or management of such entity, whether by contract or
21
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
22
+ outstanding shares, or (iii) beneficial ownership of such entity.
23
+
24
+ "You" (or "Your") shall mean an individual or Legal Entity
25
+ exercising permissions granted by this License.
26
+
27
+ "Source" form shall mean the preferred form for making modifications,
28
+ including but not limited to software source code, documentation
29
+ source, and configuration files.
30
+
31
+ "Object" form shall mean any form resulting from mechanical
32
+ transformation or translation of a Source form, including but
33
+ not limited to compiled object code, generated documentation,
34
+ and conversions to other media types.
35
+
36
+ "Work" shall mean the work of authorship, whether in Source or
37
+ Object form, made available under the License, as indicated by a
38
+ copyright notice that is included in or attached to the work
39
+ (an example is provided in the Appendix below).
40
+
41
+ "Derivative Works" shall mean any work, whether in Source or Object
42
+ form, that is based on (or derived from) the Work and for which the
43
+ editorial revisions, annotations, elaborations, or other modifications
44
+ represent, as a whole, an original work of authorship. For the purposes
45
+ of this License, Derivative Works shall not include works that remain
46
+ separable from, or merely link (or bind by name) to the interfaces of,
47
+ the Work and Derivative Works thereof.
48
+
49
+ "Contribution" shall mean any work of authorship, including
50
+ the original version of the Work and any modifications or additions
51
+ to that Work or Derivative Works thereof, that is intentionally
52
+ submitted to Licensor for inclusion in the Work by the copyright owner
53
+ or by an individual or Legal Entity authorized to submit on behalf of
54
+ the copyright owner. For the purposes of this definition, "submitted"
55
+ means any form of electronic, verbal, or written communication sent
56
+ to the Licensor or its representatives, including but not limited to
57
+ communication on electronic mailing lists, source code control systems,
58
+ and issue tracking systems that are managed by, or on behalf of, the
59
+ Licensor for the purpose of discussing and improving the Work, but
60
+ excluding communication that is conspicuously marked or otherwise
61
+ designated in writing by the copyright owner as "Not a Contribution."
62
+
63
+ "Contributor" shall mean Licensor and any individual or Legal Entity
64
+ on behalf of whom a Contribution has been received by Licensor and
65
+ subsequently incorporated within the Work.
66
+
67
+ 2. Grant of Copyright License. Subject to the terms and conditions of
68
+ this License, each Contributor hereby grants to You a perpetual,
69
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70
+ copyright license to reproduce, prepare Derivative Works of,
71
+ publicly display, publicly perform, sublicense, and distribute the
72
+ Work and such Derivative Works in Source or Object form.
73
+
74
+ 3. Grant of Patent License. Subject to the terms and conditions of
75
+ this License, each Contributor hereby grants to You a perpetual,
76
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
+ (except as stated in this section) patent license to make, have made,
78
+ use, offer to sell, sell, import, and otherwise transfer the Work,
79
+ where such license applies only to those patent claims licensable
80
+ by such Contributor that are necessarily infringed by their
81
+ Contribution(s) alone or by combination of their Contribution(s)
82
+ with the Work to which such Contribution(s) was submitted. If You
83
+ institute patent litigation against any entity (including a
84
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
85
+ or a Contribution incorporated within the Work constitutes direct
86
+ or contributory patent infringement, then any patent licenses
87
+ granted to You under this License for that Work shall terminate
88
+ as of the date such litigation is filed.
89
+
90
+ 4. Redistribution. You may reproduce and distribute copies of the
91
+ Work or Derivative Works thereof in any medium, with or without
92
+ modifications, and in Source or Object form, provided that You
93
+ meet the following conditions:
94
+
95
+ (a) You must give any other recipients of the Work or
96
+ Derivative Works a copy of this License; and
97
+
98
+ (b) You must cause any modified files to carry prominent notices
99
+ stating that You changed the files; and
100
+
101
+ (c) You must retain, in the Source form of any Derivative Works
102
+ that You distribute, all copyright, patent, trademark, and
103
+ attribution notices from the Source form of the Work,
104
+ excluding those notices that do not pertain to any part of
105
+ the Derivative Works; and
106
+
107
+ (d) If the Work includes a "NOTICE" text file as part of its
108
+ distribution, then any Derivative Works that You distribute must
109
+ include a readable copy of the attribution notices contained
110
+ within such NOTICE file, excluding those notices that do not
111
+ pertain to any part of the Derivative Works, in at least one
112
+ of the following places: within a NOTICE text file distributed
113
+ as part of the Derivative Works; within the Source form or
114
+ documentation, if provided along with the Derivative Works; or,
115
+ within a display generated by the Derivative Works, if and
116
+ wherever such third-party notices normally appear. The contents
117
+ of the NOTICE file are for informational purposes only and
118
+ do not modify the License. You may add Your own attribution
119
+ notices within Derivative Works that You distribute, alongside
120
+ or as an addendum to the NOTICE text from the Work, provided
121
+ that such additional attribution notices cannot be construed
122
+ as modifying the License.
123
+
124
+ You may add Your own copyright statement to Your modifications and
125
+ may provide additional or different license terms and conditions
126
+ for use, reproduction, or distribution of Your modifications, or
127
+ for any such Derivative Works as a whole, provided Your use,
128
+ reproduction, and distribution of the Work otherwise complies with
129
+ the conditions stated in this License.
130
+
131
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
132
+ any Contribution intentionally submitted for inclusion in the Work
133
+ by You to the Licensor shall be under the terms and conditions of
134
+ this License, without any additional terms or conditions.
135
+ Notwithstanding the above, nothing herein shall supersede or modify
136
+ the terms of any separate license agreement you may have executed
137
+ with Licensor regarding such Contributions.
138
+
139
+ 6. Trademarks. This License does not grant permission to use the trade
140
+ names, trademarks, service marks, or product names of the Licensor,
141
+ except as required for reasonable and customary use in describing the
142
+ origin of the Work and reproducing the content of the NOTICE file.
143
+
144
+ 7. Disclaimer of Warranty. Unless required by applicable law or
145
+ agreed to in writing, Licensor provides the Work (and each
146
+ Contributor provides its Contributions) on an "AS IS" BASIS,
147
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148
+ implied, including, without limitation, any warranties or conditions
149
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150
+ PARTICULAR PURPOSE. You are solely responsible for determining the
151
+ appropriateness of using or redistributing the Work and assume any
152
+ risks associated with Your exercise of permissions under this License.
153
+
154
+ 8. Limitation of Liability. In no event and under no legal theory,
155
+ whether in tort (including negligence), contract, or otherwise,
156
+ unless required by applicable law (such as deliberate and grossly
157
+ negligent acts) or agreed to in writing, shall any Contributor be
158
+ liable to You for damages, including any direct, indirect, special,
159
+ incidental, or consequential damages of any character arising as a
160
+ result of this License or out of the use or inability to use the
161
+ Work (including but not limited to damages for loss of goodwill,
162
+ work stoppage, computer failure or malfunction, or any and all
163
+ other commercial damages or losses), even if such Contributor
164
+ has been advised of the possibility of such damages.
165
+
166
+ 9. Accepting Warranty or Additional Liability. While redistributing
167
+ the Work or Derivative Works thereof, You may choose to offer,
168
+ and charge a fee for, acceptance of support, warranty, indemnity,
169
+ or other liability obligations and/or rights consistent with this
170
+ License. However, in accepting such obligations, You may act only
171
+ on Your own behalf and on Your sole responsibility, not on behalf
172
+ of any other Contributor, and only if You agree to indemnify,
173
+ defend, and hold each Contributor harmless for any liability
174
+ incurred by, or claims asserted against, such Contributor by reason
175
+ of your accepting any such warranty or additional liability.
176
+
177
+ END OF TERMS AND CONDITIONS
178
+
179
+ APPENDIX: How to apply the Apache License to your work.
180
+
181
+ To apply the Apache License to your work, attach the following
182
+ boilerplate notice, with the fields enclosed by brackets "[]"
183
+ replaced with your own identifying information. (Don't include
184
+ the brackets!) The text should be enclosed in the appropriate
185
+ comment syntax for the file format. We also recommend that a
186
+ file or class name and description of purpose be included on the
187
+ same "printed page" as the copyright notice for easier
188
+ identification within third-party archives.
189
+
190
+ Copyright [yyyy] [name of copyright owner]
191
+
192
+ Licensed under the Apache License, Version 2.0 (the "License");
193
+ you may not use this file except in compliance with the License.
194
+ You may obtain a copy of the License at
195
+
196
+ http://www.apache.org/licenses/LICENSE-2.0
197
+
198
+ Unless required by applicable law or agreed to in writing, software
199
+ distributed under the License is distributed on an "AS IS" BASIS,
200
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201
+ See the License for the specific language governing permissions and
202
+ limitations under the License.
203
+
204
+
205
+ APACHE JACKRABBIT SUBCOMPONENTS
206
+
207
+ Apache Jackrabbit includes parts with separate copyright notices and license
208
+ terms. Your use of these subcomponents is subject to the terms and conditions
209
+ of the following licenses:
210
+
211
+ XPath 2.0/XQuery 1.0 Parser:
212
+ http://www.w3.org/2002/11/xquery-xpath-applets/xgrammar.zip
213
+
214
+ Copyright (C) 2002 World Wide Web Consortium, (Massachusetts Institute of
215
+ Technology, European Research Consortium for Informatics and Mathematics,
216
+ Keio University). All Rights Reserved.
217
+
218
+ This work is distributed under the W3C(R) Software License in the hope
219
+ that it will be useful, but WITHOUT ANY WARRANTY; without even the
220
+ implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
221
+
222
+ W3C(R) SOFTWARE NOTICE AND LICENSE
223
+ http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231
224
+
225
+ This work (and included software, documentation such as READMEs, or
226
+ other related items) is being provided by the copyright holders under
227
+ the following license. By obtaining, using and/or copying this work,
228
+ you (the licensee) agree that you have read, understood, and will comply
229
+ with the following terms and conditions.
230
+
231
+ Permission to copy, modify, and distribute this software and its
232
+ documentation, with or without modification, for any purpose and
233
+ without fee or royalty is hereby granted, provided that you include
234
+ the following on ALL copies of the software and documentation or
235
+ portions thereof, including modifications:
236
+
237
+ 1. The full text of this NOTICE in a location viewable to users
238
+ of the redistributed or derivative work.
239
+
240
+ 2. Any pre-existing intellectual property disclaimers, notices,
241
+ or terms and conditions. If none exist, the W3C Software Short
242
+ Notice should be included (hypertext is preferred, text is
243
+ permitted) within the body of any redistributed or derivative code.
244
+
245
+ 3. Notice of any changes or modifications to the files, including
246
+ the date changes were made. (We recommend you provide URIs to the
247
+ location from which the code is derived.)
248
+
249
+ THIS SOFTWARE AND DOCUMENTATION IS PROVIDED "AS IS," AND COPYRIGHT
250
+ HOLDERS MAKE NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED,
251
+ INCLUDING BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY OR FITNESS
252
+ FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE OR
253
+ DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS,
254
+ TRADEMARKS OR OTHER RIGHTS.
255
+
256
+ COPYRIGHT HOLDERS WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL
257
+ OR CONSEQUENTIAL DAMAGES ARISING OUT OF ANY USE OF THE SOFTWARE OR
258
+ DOCUMENTATION.
259
+
260
+ The name and trademarks of copyright holders may NOT be used in
261
+ advertising or publicity pertaining to the software without specific,
262
+ written prior permission. Title to copyright in this software and
263
+ any associated documentation will at all times remain with
264
+ copyright holders.
package/README.md ADDED
@@ -0,0 +1,249 @@
1
+ # Spacecat Shared - Scrape Client
2
+
3
+ A JavaScript client for managing web scraping jobs, part of the SpaceCat Shared library. The ScrapeClient provides a comprehensive interface for creating, monitoring, and retrieving results from web scraping operations without needing to access the SpaceCat API service directly.
4
+
5
+ ## Installation
6
+
7
+ Install the package using npm:
8
+
9
+ ```bash
10
+ npm install @adobe/spacecat-shared-scrape-client
11
+ ```
12
+
13
+ ## Features
14
+
15
+ - **Create Scrape Jobs**: Submit URLs for web scraping with customizable options
16
+ - **Job Monitoring**: Track job status and progress
17
+ - **Result Retrieval**: Get detailed results for completed scraping jobs
18
+ - **Date Range Queries**: Find jobs within specific time periods
19
+ - **Base URL Filtering**: Search jobs by domain or base URL
20
+ - **Processing Type Support**: Different scraping strategies and configurations
21
+ - **Custom Headers**: Add custom HTTP headers for scraping requests
22
+ - **Error Handling**: Comprehensive validation and error reporting
23
+
24
+ ## Usage
25
+
26
+ ### Creating an Instance
27
+
28
+ #### Method 1: Direct Constructor
29
+
30
+ ```js
31
+ import { ScrapeClient } from '@adobe/spacecat-shared-scrape-client';
32
+
33
+ const config = {
34
+ dataAccess: dataAccessClient, // Data access layer
35
+ sqs: sqsClient, // SQS client for job queuing
36
+ env: environmentVariables, // Environment configuration
37
+ log: logger // Logging interface
38
+ };
39
+
40
+ const client = new ScrapeClient(config);
41
+ ```
42
+
43
+ #### Method 2: From Helix Universal Context
44
+
45
+ ```js
46
+ import { ScrapeClient } from '@adobe/spacecat-shared-scrape-client';
47
+
48
+ const context = {
49
+ dataAccess: context.dataAccess,
50
+ sqs: context.sqs,
51
+ env: context.env,
52
+ log: context.log
53
+ };
54
+
55
+ const client = ScrapeClient.createFrom(context);
56
+ ```
57
+
58
+ ### Creating a Scrape Job
59
+
60
+ ```js
61
+ const jobData = {
62
+ urls: ['https://example.com/page1', 'https://example.com/page2'],
63
+ options: {},
64
+ customHeaders: {
65
+ // Custom HTTP headers (optional)
66
+ 'Authorization': 'Bearer token',
67
+ 'X-Custom-Header': 'value'
68
+ },
69
+ processingType: 'default' // Optional, defaults to 'DEFAULT'
70
+ };
71
+
72
+ try {
73
+ const job = await client.createScrapeJob(jobData);
74
+ console.log('Job created:', job.id);
75
+ console.log('Job status:', job.status);
76
+ } catch (error) {
77
+ console.error('Failed to create job:', error.message);
78
+ }
79
+ ```
80
+
81
+ ### Checking Job Status
82
+
83
+ ```js
84
+ const jobId = 'your-job-id';
85
+
86
+ try {
87
+ const jobStatus = await client.getScrapeJobStatus(jobId);
88
+ if (jobStatus) {
89
+ console.log('Job Status:', jobStatus.status);
90
+ console.log('URL Count:', jobStatus.urlCount);
91
+ console.log('Success Count:', jobStatus.successCount);
92
+ console.log('Failed Count:', jobStatus.failedCount);
93
+ console.log('Duration:', jobStatus.duration);
94
+ } else {
95
+ console.log('Job not found');
96
+ }
97
+ } catch (error) {
98
+ console.error('Failed to get job status:', error.message);
99
+ }
100
+ ```
101
+
102
+ ### Getting Job Results
103
+
104
+ ```js
105
+ const jobId = 'your-job-id';
106
+
107
+ try {
108
+ const results = await client.getScrapeJobUrlResults(jobId);
109
+ if (results) {
110
+ results.forEach(result => {
111
+ console.log(`URL: ${result.url}`);
112
+ console.log(`Status: ${result.status}`);
113
+ console.log(`Reason: ${result.reason}`);
114
+ console.log(`Path: ${result.path}`);
115
+ });
116
+ } else {
117
+ console.log('Job not found');
118
+ }
119
+ } catch (error) {
120
+ console.error('Failed to get job results:', error.message);
121
+ }
122
+ ```
123
+
124
+ ### Finding Jobs by Date Range
125
+
126
+ ```js
127
+ const startDate = '2024-01-01T00:00:00Z';
128
+ const endDate = '2024-01-31T23:59:59Z';
129
+
130
+ try {
131
+ const jobs = await client.getScrapeJobsByDateRange(startDate, endDate);
132
+ console.log(`Found ${jobs.length} jobs in date range`);
133
+ jobs.forEach(job => {
134
+ console.log(`Job ${job.id}: ${job.status} - ${job.baseURL}`);
135
+ });
136
+ } catch (error) {
137
+ console.error('Failed to get jobs by date range:', error.message);
138
+ }
139
+ ```
140
+
141
+ ### Finding Jobs by Base URL
142
+
143
+ ```js
144
+ const baseURL = 'https://example.com';
145
+
146
+ try {
147
+ // Get all jobs for a base URL
148
+ const allJobs = await client.getScrapeJobsByBaseURL(baseURL);
149
+ console.log(`Found ${allJobs.length} jobs for ${baseURL}`);
150
+
151
+ // Get jobs for a specific processing type
152
+ const specificJobs = await client.getScrapeJobsByBaseURL(baseURL, 'form');
153
+ console.log(`Found ${specificJobs.length} jobs with custom processing`);
154
+ } catch (error) {
155
+ console.error('Failed to get jobs by base URL:', error.message);
156
+ }
157
+ ```
158
+
159
+ ## Job Response Format
160
+
161
+ When you retrieve a scrape job, it returns an object with the following structure:
162
+
163
+ ```js
164
+ {
165
+ id: "job-id",
166
+ baseURL: "https://example.com",
167
+ processingType: "default",
168
+ options: { /* scraping options */ },
169
+ startedAt: "2024-01-01T10:00:00Z",
170
+ endedAt: "2024-01-01T10:05:00Z",
171
+ duration: 300000, // milliseconds
172
+ status: "COMPLETE",
173
+ urlCount: 10,
174
+ successCount: 8,
175
+ failedCount: 2,
176
+ redirectCount: 0,
177
+ customHeaders: { /* custom headers used */ }
178
+ }
179
+ ```
180
+
181
+ ## URL Results Format
182
+
183
+ When you retrieve job results, each URL result has this structure:
184
+
185
+ ```js
186
+ {
187
+ url: "https://example.com/page",
188
+ status: "SUCCESS",
189
+ reason: "in case there was an error, you will this this here",
190
+ path: "/s3/path/to/scraped/content"
191
+ }
192
+ ```
193
+
194
+ ## Configuration
195
+
196
+ The client uses the `SCRAPE_JOB_CONFIGURATION` environment variable for default settings:
197
+
198
+ ```js
199
+ // Example configuration
200
+ {
201
+ "maxUrlsPerJob": 5,
202
+ "options": {
203
+ "enableJavascript": true,
204
+ "hideConsentBanner": true,
205
+ }
206
+ }
207
+ ```
208
+
209
+ ## Testing
210
+
211
+ To run tests:
212
+
213
+ ```bash
214
+ npm run test
215
+ ```
216
+
217
+ ## Linting
218
+
219
+ Lint your code:
220
+
221
+ ```bash
222
+ npm run lint
223
+ ```
224
+
225
+ Fix linting issues:
226
+
227
+ ```bash
228
+ npm run lint:fix
229
+ ```
230
+
231
+ ## Cleaning
232
+
233
+ To remove `node_modules` and `package-lock.json`:
234
+
235
+ ```bash
236
+ npm run clean
237
+ ```
238
+
239
+ ## Dependencies
240
+
241
+ - `@adobe/helix-universal`: Universal context support
242
+ - `@adobe/spacecat-shared-data-access`: Data access layer
243
+ - `@adobe/spacecat-shared-utils`: Utility functions
244
+
245
+ ## Additional Information
246
+
247
+ - **Repository**: [GitHub](https://github.com/adobe/spacecat-shared.git)
248
+ - **Issue Tracking**: [GitHub Issues](https://github.com/adobe/spacecat-shared/issues)
249
+ - **License**: Apache-2.0
package/package.json ADDED
@@ -0,0 +1,50 @@
1
+ {
2
+ "name": "@adobe/spacecat-shared-scrape-client",
3
+ "version": "1.0.0",
4
+ "description": "Shared modules of the Spacecat Services - Scrape Client",
5
+ "type": "module",
6
+ "engines": {
7
+ "node": ">=22.0.0 <23.0.0",
8
+ "npm": ">=10.9.0 <12.0.0"
9
+ },
10
+ "main": "src/index.js",
11
+ "types": "src/index.d.ts",
12
+ "scripts": {
13
+ "test": "c8 mocha --spec=test/**/*.test.js",
14
+ "lint": "eslint .",
15
+ "lint:fix": "eslint --fix .",
16
+ "clean": "rm -rf package-lock.json node_modules"
17
+ },
18
+ "mocha": {
19
+ "require": "test/setup-env.js",
20
+ "reporter": "mocha-multi-reporters",
21
+ "reporter-options": "configFile=.mocha-multi.json",
22
+ "spec": "test/**/*.test.js"
23
+ },
24
+ "repository": {
25
+ "type": "git",
26
+ "url": "https://github.com/adobe/spacecat-shared.git"
27
+ },
28
+ "author": "",
29
+ "license": "Apache-2.0",
30
+ "bugs": {
31
+ "url": "https://github.com/adobe/spacecat-shared/issues"
32
+ },
33
+ "homepage": "https://github.com/adobe/spacecat-shared#readme",
34
+ "publishConfig": {
35
+ "access": "public"
36
+ },
37
+ "dependencies": {
38
+ "@adobe/helix-universal": "5.2.2",
39
+ "@adobe/spacecat-shared-data-access": "2.25.0",
40
+ "@adobe/spacecat-shared-utils": "1.31.0"
41
+ },
42
+ "devDependencies": {
43
+ "chai": "5.2.0",
44
+ "chai-as-promised": "8.0.1",
45
+ "nock": "14.0.5",
46
+ "sinon": "20.0.0",
47
+ "sinon-chai": "4.0.0",
48
+ "typescript": "5.8.3"
49
+ }
50
+ }
@@ -0,0 +1,66 @@
1
+ /*
2
+ * Copyright 2025 Adobe. All rights reserved.
3
+ * This file is licensed to you under the Apache License, Version 2.0 (the "License");
4
+ * you may not use this file except in compliance with the License. You may obtain a copy
5
+ * of the License at http://www.apache.org/licenses/LICENSE-2.0
6
+ *
7
+ * Unless required by applicable law or agreed to in writing, software distributed under
8
+ * the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS
9
+ * OF ANY KIND, either express or implied. See the License for the specific language
10
+ * governing permissions and limitations under the License.
11
+ */
12
+
13
+ import { UniversalContext } from '@adobe/helix-universal';
14
+
15
+ export default class ScrapeClient {
16
+ /**
17
+ * Static factory method to create an instance of ScrapeClient.
18
+ * @param {UniversalContext} context - An object containing the AWS Lambda context information
19
+ * @returns An instance of ScrapeClient.
20
+ */
21
+ static createFrom(context: UniversalContext): ScrapeClient;
22
+
23
+ /**
24
+ * Constructor for creating an instance of ScrapeClient.
25
+ * @param config - Configuration object for the ScrapeClient.
26
+ * @param log - Optional logger instance for logging messages.
27
+ */
28
+ constructor(config: object, log?: Console);
29
+
30
+ /**
31
+ * Create and start a new scrape job.
32
+ * @param {object} data - json data for scrape job
33
+ * @returns {Promise<Response>} new job object
34
+ */
35
+ async createScrapeJob(data: object): Promise<object>;
36
+
37
+ /**
38
+ * Get all scrape jobs between startDate and endDate
39
+ * @param {string} startDate - The start date of the range.
40
+ * @param {string} endDate - The end date of the range.
41
+ * @returns {Promise<Response>} JSON representation of the scrape jobs.
42
+ */
43
+ async getScrapeJobsByDateRange(startDate: string, endDate: string): Promise<object[]>;
44
+
45
+ /**
46
+ * Get the status of an scrape job.
47
+ * @param {string} jobId - The ID of the job to fetch.
48
+ * @returns {Promise<Response>} JSON representation of the scrape job.
49
+ */
50
+ async getScrapeJobStatus(jobId: string): Promise<object>;
51
+
52
+ /**
53
+ * Get the result of a scrape job
54
+ * @param {string} jobId - The ID of the job to fetch.
55
+ * @returns {Promise<Response>} all results for all urls scrape jobs.
56
+ */
57
+ async getScrapeJobUrlResults(jobId: string): Promise<object[]>;
58
+
59
+ /**
60
+ * Get all scrape jobs by baseURL and processing type
61
+ * @param {string} baseURL - The baseURL of the jobs to fetch.
62
+ * @param {string} processingType - (optional) The processing type of the jobs to fetch.
63
+ * @returns {Promise<Response>} JSON representation of the scrape jobs
64
+ */
65
+ async getScrapeJobsByBaseURL(baseURL: string, processingType?: string): Promise<object[]>;
66
+ }
@@ -0,0 +1,15 @@
1
+ /*
2
+ * Copyright 2024 Adobe. All rights reserved.
3
+ * This file is licensed to you under the Apache License, Version 2.0 (the "License");
4
+ * you may not use this file except in compliance with the License. You may obtain a copy
5
+ * of the License at http://www.apache.org/licenses/LICENSE-2.0
6
+ *
7
+ * Unless required by applicable law or agreed to in writing, software distributed under
8
+ * the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS
9
+ * OF ANY KIND, either express or implied. See the License for the specific language
10
+ * governing permissions and limitations under the License.
11
+ */
12
+
13
+ import type { ScrapeClient } from './ScrapeClient.d.ts';
14
+
15
+ export { ScrapeClient };
@@ -0,0 +1,260 @@
1
+ /*
2
+ * Copyright 2025 Adobe. All rights reserved.
3
+ * This file is licensed to you under the Apache License, Version 2.0 (the "License");
4
+ * you may not use this file except in compliance with the License. You may obtain a copy
5
+ * of the License at http://www.apache.org/licenses/LICENSE-2.0
6
+ *
7
+ * Unless required by applicable law or agreed to in writing, software distributed under
8
+ * the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS
9
+ * OF ANY KIND, either express or implied. See the License for the specific language
10
+ * governing permissions and limitations under the License.
11
+ */
12
+
13
+ import {
14
+ isIsoDate, isObject, isValidUrl, isNonEmptyArray, hasText,
15
+ isValidUUID,
16
+ } from '@adobe/spacecat-shared-utils';
17
+ import { ScrapeJob as ScrapeJobModel } from '@adobe/spacecat-shared-data-access';
18
+ import { ScrapeJobDto } from './scrapeJobDto.js';
19
+ import ScrapeJobSupervisor from './scrape-job-supervisor.js';
20
+
21
+ export default class ScrapeClient {
22
+ config = null;
23
+
24
+ services = null;
25
+
26
+ scrapeConfiguration = null;
27
+
28
+ scrapeSupervisor = null;
29
+
30
+ maxUrlsPerJob = 1;
31
+
32
+ static validateIsoDates(startDate, endDate) {
33
+ if (!isIsoDate(startDate) || !isIsoDate(endDate)) {
34
+ throw new Error('Invalid request: startDate and endDate must be in ISO 8601 format');
35
+ }
36
+ }
37
+
38
+ validateRequestData(data) {
39
+ if (!isObject(data)) {
40
+ throw new Error('Invalid request: missing application/json request data');
41
+ }
42
+
43
+ if (!isNonEmptyArray(data.urls)) {
44
+ throw new Error('Invalid request: urls must be provided as a non-empty array');
45
+ }
46
+
47
+ if (data.urls.length > this.maxUrlsPerJob) {
48
+ throw new Error(`Invalid request: number of URLs provided (${data.urls.length}) exceeds the maximum allowed (${this.maxUrlsPerJob})`);
49
+ }
50
+
51
+ data.urls.forEach((url) => {
52
+ if (!isValidUrl(url)) {
53
+ throw new Error(`Invalid request: ${url} is not a valid URL`);
54
+ }
55
+ });
56
+
57
+ if (data.options && !isObject(data.options)) {
58
+ throw new Error('Invalid request: options must be an object');
59
+ }
60
+
61
+ if (data.customHeaders && !isObject(data.customHeaders)) {
62
+ throw new Error('Invalid request: customHeaders must be an object');
63
+ }
64
+ }
65
+
66
+ /**
67
+ * Creates a new ScrapeClient from the context.
68
+ * @param {object} context - The context object
69
+ * @param {object} context.dataAccess - The data access client
70
+ * @param {object} context.sqs - The SQS client
71
+ * @param {object} context.log - The logger
72
+ * @param {object} context.env - The environment variables
73
+ * @returns {ScrapeClient} - The ScrapeClient instance
74
+ */
75
+ static createFrom(context) {
76
+ function validateServices() {
77
+ const requiredServices = ['dataAccess', 'sqs', 'log', 'env'];
78
+ requiredServices.forEach((service) => {
79
+ if (!context[service]) {
80
+ throw new Error(`Invalid services: ${service} is required`);
81
+ }
82
+ });
83
+ }
84
+ validateServices();
85
+ const {
86
+ log,
87
+ dataAccess,
88
+ sqs,
89
+ env,
90
+ } = context;
91
+
92
+ const config = {
93
+ dataAccess,
94
+ sqs,
95
+ env,
96
+ log,
97
+ };
98
+ return new ScrapeClient(config);
99
+ }
100
+
101
+ constructor(config) {
102
+ this.config = config;
103
+
104
+ let scrapeConfiguration = {};
105
+ try {
106
+ scrapeConfiguration = JSON.parse(this.config.env.SCRAPE_JOB_CONFIGURATION);
107
+ } catch (error) {
108
+ this.config.log.error(`Failed to parse scrape job configuration: ${error.message}`);
109
+ }
110
+ this.scrapeConfiguration = scrapeConfiguration;
111
+
112
+ // default to 1 url per job
113
+ this.maxUrlsPerJob = scrapeConfiguration.maxUrlsPerJob || 1;
114
+
115
+ this.scrapeSupervisor = new ScrapeJobSupervisor(this.config, scrapeConfiguration);
116
+ }
117
+
118
+ /**
119
+ * Create and start a new scrape job.
120
+ * @param {object} data - json data for scrape job
121
+ * @returns {Promise<Response>} newly created job object
122
+ */
123
+ async createScrapeJob(data) {
124
+ try {
125
+ this.validateRequestData(data);
126
+
127
+ const {
128
+ urls, options, customHeaders, processingType = ScrapeJobModel.ScrapeProcessingType.DEFAULT,
129
+ } = data;
130
+
131
+ this.config.log.info(`Creating a new scrape job with ${urls.length} URLs.`);
132
+
133
+ // Merge the scrape configuration options with the request options allowing the user options
134
+ // to override the defaults
135
+ const mergedOptions = {
136
+ ...this.scrapeConfiguration.options,
137
+ ...options,
138
+ };
139
+
140
+ const job = await this.scrapeSupervisor.startNewJob(
141
+ urls,
142
+ processingType,
143
+ mergedOptions,
144
+ customHeaders,
145
+ );
146
+ return ScrapeJobDto.toJSON(job);
147
+ } catch (error) {
148
+ const msgError = `Failed to create a new scrape job: ${error.message}`;
149
+ this.config.log.error(msgError);
150
+ throw new Error(msgError);
151
+ }
152
+ }
153
+
154
+ /**
155
+ * Get all scrape jobs between startDate and endDate
156
+ * @param {string} startDate - The start date of the range.
157
+ * @param {string} endDate - The end date of the range.
158
+ * @returns {Promise<Response>} JSON representation of the scrape jobs.
159
+ */
160
+ async getScrapeJobsByDateRange(startDate, endDate) {
161
+ this.config.log.debug(`Fetching scrape jobs between startDate: ${startDate} and endDate: ${endDate}.`);
162
+
163
+ ScrapeClient.validateIsoDates(startDate, endDate);
164
+ try {
165
+ const jobs = await this.scrapeSupervisor.getScrapeJobsByDateRange(startDate, endDate);
166
+ return jobs.map((job) => ScrapeJobDto.toJSON(job));
167
+ } catch (error) {
168
+ const msgError = `Failed to fetch scrape jobs between startDate: ${startDate} and endDate: ${endDate}, ${error.message}`;
169
+ this.config.log.error(msgError);
170
+ throw new Error(msgError);
171
+ }
172
+ }
173
+
174
+ /**
175
+ * Get the status of an scrape job.
176
+ * @param {string} jobId - The ID of the job to fetch.
177
+ * @returns {Promise<Response>} JSON representation of the scrape job.
178
+ */
179
+ async getScrapeJobStatus(jobId) {
180
+ if (!isValidUUID(jobId)) {
181
+ throw new Error('Job ID is required');
182
+ }
183
+ try {
184
+ const job = await this.scrapeSupervisor.getScrapeJob(jobId);
185
+ if (!job) {
186
+ return null;
187
+ }
188
+ return ScrapeJobDto.toJSON(job);
189
+ } catch (error) {
190
+ const msgError = `Failed to fetch scrape job status for jobId: ${jobId}, message: ${error.message}`;
191
+ this.config.log.error(msgError);
192
+ throw new Error(msgError);
193
+ }
194
+ }
195
+
196
+ /**
197
+ * Get the result of a scrape job
198
+ * @param {string} jobId - The ID of the job to fetch.
199
+ * @returns {Promise<Response>} all results for all urls scrape jobs.
200
+ */
201
+ async getScrapeJobUrlResults(jobId) {
202
+ try {
203
+ const job = await this.scrapeSupervisor.getScrapeJob(jobId);
204
+ if (!job) {
205
+ return null;
206
+ }
207
+ const { ScrapeUrl } = this.config.dataAccess;
208
+ const scrapeUrls = await ScrapeUrl.allByScrapeJobId(job.getId());
209
+ const results = scrapeUrls.map((url) => ({
210
+ url: url.getUrl(),
211
+ status: url.getStatus(),
212
+ reason: url.getReason(),
213
+ path: url.getPath(),
214
+ }));
215
+
216
+ return results;
217
+ } catch (error) {
218
+ const msgError = `Failed to fetch the scrape job result: ${error.message}`;
219
+ this.config.log.error(msgError);
220
+ throw new Error(msgError);
221
+ }
222
+ }
223
+
224
+ /**
225
+ * Get all scrape jobs by baseURL and processing type
226
+ * @param {string} baseURL - The baseURL of the jobs to fetch.
227
+ * @param {string} processingType - (optional) The processing type of the jobs to fetch.
228
+ * @returns {Promise<Response>} JSON representation of the scrape jobs
229
+ */
230
+ async getScrapeJobsByBaseURL(baseURL, processingType = undefined) {
231
+ let decodedBaseURL = baseURL;
232
+ try {
233
+ decodedBaseURL = decodeURIComponent(baseURL);
234
+
235
+ if (!isValidUrl(decodedBaseURL)) {
236
+ throw new Error('Invalid request: baseURL must be a valid URL');
237
+ }
238
+
239
+ let jobs = [];
240
+ if (hasText(processingType)) {
241
+ jobs = await this.scrapeSupervisor.getScrapeJobsByBaseURLAndProcessingType(
242
+ decodedBaseURL,
243
+ processingType,
244
+ );
245
+ } else {
246
+ jobs = await this.scrapeSupervisor.getScrapeJobsByBaseURL(decodedBaseURL);
247
+ }
248
+
249
+ if (!isNonEmptyArray(jobs)) {
250
+ return [];
251
+ }
252
+ return jobs.map((job) => ScrapeJobDto.toJSON(job));
253
+ } catch (error) {
254
+ const procType = processingType ? ` and processing type: ${processingType}` : '';
255
+ const msgError = `Failed to fetch scrape jobs by baseURL: ${decodedBaseURL}${procType}, ${error.message}`;
256
+ this.config.log.error(msgError);
257
+ throw new Error(msgError);
258
+ }
259
+ }
260
+ }
@@ -0,0 +1,235 @@
1
+ /*
2
+ * Copyright 2024 Adobe. All rights reserved.
3
+ * This file is licensed to you under the Apache License, Version 2.0 (the "License");
4
+ * you may not use this file except in compliance with the License. You may obtain a copy
5
+ * of the License at http://www.apache.org/licenses/LICENSE-2.0
6
+ *
7
+ * Unless required by applicable law or agreed to in writing, software distributed under
8
+ * the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS
9
+ * OF ANY KIND, either express or implied. See the License for the specific language
10
+ * governing permissions and limitations under the License.
11
+ */
12
+
13
+ import { ScrapeJob as ScrapeJobModel } from '@adobe/spacecat-shared-data-access';
14
+ import { isValidUUID } from '@adobe/spacecat-shared-utils';
15
+
16
+ /**
17
+ * Scrape Supervisor provides functionality to start and manage scrape jobs.
18
+ * @param {object} services - The services required by the handler.
19
+ * @param {DataAccess} services.dataAccess - Data access.
20
+ * @param {object} services.sqs - AWS Simple Queue Service client.
21
+ * @param {object} services.s3 - AWS S3 client and related helpers.
22
+ * @param {object} services.log - Logger.
23
+ * @param {object} config - Scrape configuration details.
24
+ * @param {Array<string>} config.queues - Array of available scrape queues.
25
+ * @param {string} config.scrapeWorkerQueue - URL of the scrape worker queue.
26
+ * @returns {object} Scrape Supervisor.
27
+ */
28
+ function ScrapeJobSupervisor(services, config) {
29
+ const {
30
+ dataAccess, sqs, log,
31
+ } = services;
32
+
33
+ const { ScrapeJob } = dataAccess;
34
+
35
+ const {
36
+ queues = [], // Array of scrape queues
37
+ scrapeWorkerQueue, // URL of the scrape worker queue
38
+ } = config;
39
+
40
+ /**
41
+ * Get the queue with the least number of messages.
42
+ */
43
+ async function getAvailableScrapeQueue() {
44
+ const countMessages = async (queue) => {
45
+ const count = await sqs.getQueueMessageCount(queue);
46
+ return { queue, count };
47
+ };
48
+
49
+ const arrProm = queues.map(
50
+ (queue) => countMessages(queue),
51
+ );
52
+ const queueMessageCounts = await Promise.all(arrProm);
53
+
54
+ if (queueMessageCounts.length === 0) {
55
+ return null;
56
+ }
57
+
58
+ // get the queue with the lowest number of messages
59
+ const queueWithLeastMessages = queueMessageCounts.reduce(
60
+ (min, current) => (min.count < current.count ? min : current),
61
+ );
62
+ log.info(`Queue with least messages: ${queueWithLeastMessages.queue}`);
63
+ return queueWithLeastMessages.queue;
64
+ }
65
+
66
+ function determineBaseURL(urls) {
67
+ // Initially, we will just use the domain of the first URL
68
+ const url = new URL(urls[0]);
69
+ return `${url.protocol}//${url.hostname}`;
70
+ }
71
+
72
+ /**
73
+ * Create a new scrape job by claiming one of the free scrape queues, persisting the scrape job
74
+ * metadata, and setting the job status to 'RUNNING'.
75
+ * @param {Array<string>} urls - The list of URLs to scrape.
76
+ * @param {string} scrapeQueueId - Name of the queue to use for this scrape job.
77
+ * @param {string} processingType - The scrape handler to be used for the scrape job.
78
+ * @param {object} options - Client provided options for the scrape job.
79
+ * @param {object} customHeaders - Custom headers to be sent with each request.
80
+ * @returns {Promise<ScrapeJob>}
81
+ */
82
+ async function createNewScrapeJob(
83
+ urls,
84
+ scrapeQueueId,
85
+ processingType,
86
+ options,
87
+ customHeaders = null,
88
+ ) {
89
+ const jobData = {
90
+ baseURL: determineBaseURL(urls),
91
+ scrapeQueueId,
92
+ processingType,
93
+ options,
94
+ urlCount: urls.length,
95
+ status: ScrapeJobModel.ScrapeJobStatus.RUNNING,
96
+ customHeaders,
97
+ };
98
+ log.info(`Creating a new scrape job. Job data: ${JSON.stringify(jobData)}`);
99
+ return ScrapeJob.create(jobData);
100
+ }
101
+
102
+ /**
103
+ * Get all scrape jobs between the specified start and end dates.
104
+ * @param {string} startDate - The start date of the range.
105
+ * @param {string} endDate - The end date of the range.
106
+ * @returns {Promise<ScrapeJob[]>}
107
+ */
108
+ async function getScrapeJobsByDateRange(startDate, endDate) {
109
+ return ScrapeJob.allByDateRange(startDate, endDate);
110
+ }
111
+
112
+ /**
113
+ * Get all scrape jobs by baseURL
114
+ * @param {string} baseURL - The baseURL of the jobs to fetch.
115
+ * @returns {Promise<ScrapeJob[]>}
116
+ */
117
+ async function getScrapeJobsByBaseURL(baseURL) {
118
+ return ScrapeJob.allByBaseURL(baseURL);
119
+ }
120
+
121
+ /**
122
+ * Get all scrape jobs by baseURL and processing type
123
+ * @param {string} baseURL - The baseURL of the jobs to fetch.
124
+ * @param {string} processingType - The processing type of the jobs to fetch.
125
+ * @returns {Promise<ScrapeJob[]>}
126
+ */
127
+ async function getScrapeJobsByBaseURLAndProcessingType(baseURL, processingType) {
128
+ return ScrapeJob.allByBaseURLAndProcessingType(baseURL, processingType);
129
+ }
130
+
131
+ /**
132
+ * Queue all URLs as a single message for processing by another function. This will enable
133
+ * the controller to respond with a new job ID ASAP, while the individual URLs are queued up
134
+ * asynchronously.
135
+ * @param {Array<string>} urls - Array of URL records to queue.
136
+ * @param {object} scrapeJob - The scrape job record.
137
+ * @param {object} customHeaders - Optional custom headers to be sent with each request.
138
+ */
139
+ async function queueUrlsForScrapeWorker(urls, scrapeJob, customHeaders) {
140
+ log.info(`Starting a new scrape job of baseUrl: ${scrapeJob.getBaseURL()} with ${urls.length}`
141
+ + ` URLs. This new job has claimed: ${scrapeJob.getScrapeQueueId()} `
142
+ + `(jobId: ${scrapeJob.getId()})`);
143
+
144
+ const options = scrapeJob.getOptions();
145
+ const processingType = scrapeJob.getProcessingType();
146
+
147
+ // Send a single message containing all URLs and the new job ID
148
+ const message = {
149
+ processingType,
150
+ jobId: scrapeJob.getId(),
151
+ urls,
152
+ customHeaders,
153
+ options,
154
+ };
155
+
156
+ await sqs.sendMessage(scrapeWorkerQueue, message);
157
+ }
158
+
159
+ /**
160
+ * Starts a new scrape job.
161
+ * @param {Array<string>} urls - The URLs to scrape.
162
+ * @param {object} options - Optional configuration params for the scrape job.
163
+ * @param {object} customHeaders - Optional custom headers to be sent with each request.
164
+ * @returns {Promise<ScrapeJob>} newly created job object
165
+ */
166
+ async function startNewJob(
167
+ urls,
168
+ processingType,
169
+ options,
170
+ customHeaders,
171
+ ) {
172
+ // Determine if there is a free scrape queue
173
+ const scrapeQueueId = await getAvailableScrapeQueue();
174
+
175
+ if (scrapeQueueId === null) {
176
+ throw new Error('Service Unavailable: No scrape queue available');
177
+ }
178
+
179
+ // If a queue is available, create the scrape-job record in dataAccess:
180
+ const newScrapeJob = await createNewScrapeJob(
181
+ urls,
182
+ scrapeQueueId,
183
+ processingType,
184
+ options,
185
+ customHeaders,
186
+ );
187
+
188
+ log.info(
189
+ 'New scrape job created:\n'
190
+ + `- baseUrl: ${newScrapeJob.getBaseURL()}\n`
191
+ + `- urlCount: ${urls.length}\n`
192
+ + `- jobId: ${newScrapeJob.getId()}\n`
193
+ + `- scrapeQueueId: ${scrapeQueueId}\n`
194
+ + `- customHeaders: ${JSON.stringify(customHeaders)}\n`
195
+ + `- options: ${JSON.stringify(options)}`,
196
+ );
197
+
198
+ // Queue all URLs for scrape as a single message. This enables the controller to respond with
199
+ // a job ID ASAP, while the individual URLs are queued up asynchronously by another function.
200
+ await queueUrlsForScrapeWorker(urls, newScrapeJob, customHeaders);
201
+
202
+ return newScrapeJob;
203
+ }
204
+
205
+ /**
206
+ * Get an scrape job from the data layer.
207
+ * used to start the job.
208
+ * @param {string} jobId - The ID of the job.
209
+ * @returns {Promise<ScrapeJob>} requested scrape job object
210
+ */
211
+ async function getScrapeJob(jobId) {
212
+ if (!isValidUUID(jobId)) {
213
+ throw new Error('jobId must be a valid UUID');
214
+ }
215
+
216
+ try {
217
+ return ScrapeJob.findById(jobId);
218
+ } catch (error) {
219
+ if (error.message.includes('Not found')) {
220
+ return null;
221
+ }
222
+ throw error;
223
+ }
224
+ }
225
+
226
+ return {
227
+ startNewJob,
228
+ getScrapeJob,
229
+ getScrapeJobsByDateRange,
230
+ getScrapeJobsByBaseURL,
231
+ getScrapeJobsByBaseURLAndProcessingType,
232
+ };
233
+ }
234
+
235
+ export default ScrapeJobSupervisor;
@@ -0,0 +1,35 @@
1
+ /*
2
+ * Copyright 2024 Adobe. All rights reserved.
3
+ * This file is licensed to you under the Apache License, Version 2.0 (the "License");
4
+ * you may not use this file except in compliance with the License. You may obtain a copy
5
+ * of the License at http://www.apache.org/licenses/LICENSE-2.0
6
+ *
7
+ * Unless required by applicable law or agreed to in writing, software distributed under
8
+ * the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS
9
+ * OF ANY KIND, either express or implied. See the License for the specific language
10
+ * governing permissions and limitations under the License.
11
+ */
12
+
13
+ /**
14
+ * Data transfer object for Import Job.
15
+ */
16
+ export const ScrapeJobDto = {
17
+ /**
18
+ * Converts an Import Job object into a JSON object.
19
+ */
20
+ toJSON: (scrapeJob) => ({
21
+ id: scrapeJob.getId(),
22
+ baseURL: scrapeJob.getBaseURL(),
23
+ processingType: scrapeJob.getProcessingType(),
24
+ options: scrapeJob.getOptions(),
25
+ startedAt: scrapeJob.getStartedAt(),
26
+ endedAt: scrapeJob.getEndedAt(),
27
+ duration: scrapeJob.getDuration(),
28
+ status: scrapeJob.getStatus(),
29
+ urlCount: scrapeJob.getUrlCount(),
30
+ successCount: scrapeJob.getSuccessCount(),
31
+ failedCount: scrapeJob.getFailedCount(),
32
+ redirectCount: scrapeJob.getRedirectCount(),
33
+ customHeaders: scrapeJob.getCustomHeaders(),
34
+ }),
35
+ };
package/src/index.d.ts ADDED
@@ -0,0 +1,17 @@
1
+ /*
2
+ * Copyright 2024 Adobe. All rights reserved.
3
+ * This file is licensed to you under the Apache License, Version 2.0 (the "License");
4
+ * you may not use this file except in compliance with the License. You may obtain a copy
5
+ * of the License at http://www.apache.org/licenses/LICENSE-2.0
6
+ *
7
+ * Unless required by applicable law or agreed to in writing, software distributed under
8
+ * the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS
9
+ * OF ANY KIND, either express or implied. See the License for the specific language
10
+ * governing permissions and limitations under the License.
11
+ */
12
+
13
+ import type { ScrapeClient } from './clients';
14
+
15
+ export {
16
+ ScrapeClient,
17
+ };
package/src/index.js ADDED
@@ -0,0 +1,17 @@
1
+ /*
2
+ * Copyright 2025 Adobe. All rights reserved.
3
+ * This file is licensed to you under the Apache License, Version 2.0 (the "License");
4
+ * you may not use this file except in compliance with the License. You may obtain a copy
5
+ * of the License at http://www.apache.org/licenses/LICENSE-2.0
6
+ *
7
+ * Unless required by applicable law or agreed to in writing, software distributed under
8
+ * the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS
9
+ * OF ANY KIND, either express or implied. See the License for the specific language
10
+ * governing permissions and limitations under the License.
11
+ */
12
+
13
+ import ScrapeClient from './clients/scrape-client.js';
14
+
15
+ export {
16
+ ScrapeClient,
17
+ };