@govtechsg/oobee 0.10.93 → 0.10.94

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. package/AGENTS.md +20 -0
  2. package/dist/cli.js +3 -2
  3. package/dist/combine.js +3 -3
  4. package/dist/constants/common.js +119 -52
  5. package/dist/crawlers/commonCrawlerFunc.js +11 -2
  6. package/dist/crawlers/crawlDomain.js +4 -6
  7. package/dist/crawlers/crawlSitemap.js +14 -2
  8. package/dist/crawlers/custom/utils.js +22 -9
  9. package/dist/crawlers/guards/urlGuard.js +19 -1
  10. package/dist/static/ejs/partials/components/allIssues/CategoryBadges.ejs +3 -0
  11. package/dist/static/ejs/partials/components/allIssues/IssuesTable.ejs +3 -3
  12. package/dist/static/ejs/partials/components/header/aboutScanModal/AboutScanModal.ejs +1 -1
  13. package/dist/static/ejs/partials/components/header/aboutScanModal/ScanConfiguration.ejs +3 -3
  14. package/dist/static/ejs/partials/components/header/aboutScanModal/ScanDetails.ejs +34 -27
  15. package/dist/static/ejs/partials/components/ruleModal/ruleOffcanvas.ejs +1 -0
  16. package/dist/static/ejs/partials/components/scannedPagesSegmentedTabs.ejs +7 -0
  17. package/dist/static/ejs/partials/components/wcagCoverageDetails.ejs +5 -5
  18. package/dist/static/ejs/partials/scripts/header/aboutScanModal/AboutScanModal.ejs +3 -3
  19. package/dist/static/ejs/partials/scripts/prioritiseIssues/PrioritiseIssues.ejs +21 -19
  20. package/dist/static/ejs/partials/scripts/ruleModal/pageAccordionBuilder.ejs +39 -8
  21. package/dist/static/ejs/partials/scripts/scannedPagesSegmentedTabs.ejs +11 -5
  22. package/dist/static/ejs/partials/scripts/screenshotLightbox.ejs +49 -31
  23. package/dist/static/ejs/partials/styles/header/SiteInfo.ejs +1 -1
  24. package/dist/static/ejs/partials/styles/header/aboutScanModal/ScanDetails.ejs +36 -16
  25. package/dist/static/ejs/partials/styles/prioritiseIssues/PrioritiseIssues.ejs +22 -1
  26. package/dist/static/ejs/partials/styles/styles.ejs +1 -1
  27. package/dist/static/ejs/partials/styles/wcagCompliance/WcagGaugeBar.ejs +6 -0
  28. package/dist/static/ejs/partials/styles/wcagCompliance.ejs +5 -4
  29. package/dist/static/ejs/partials/styles/wcagCoverageDetails.ejs +6 -1
  30. package/oobee-client-scanner.js +2 -2
  31. package/package.json +1 -1
  32. package/src/cli.ts +3 -2
  33. package/src/combine.ts +3 -2
  34. package/src/constants/common.ts +112 -36
  35. package/src/crawlers/commonCrawlerFunc.ts +11 -2
  36. package/src/crawlers/crawlDomain.ts +4 -5
  37. package/src/crawlers/crawlSitemap.ts +19 -2
  38. package/src/crawlers/custom/utils.ts +26 -13
  39. package/src/crawlers/guards/urlGuard.ts +18 -1
  40. package/src/static/ejs/partials/components/allIssues/CategoryBadges.ejs +3 -0
  41. package/src/static/ejs/partials/components/allIssues/IssuesTable.ejs +3 -3
  42. package/src/static/ejs/partials/components/header/aboutScanModal/AboutScanModal.ejs +1 -1
  43. package/src/static/ejs/partials/components/header/aboutScanModal/ScanConfiguration.ejs +3 -3
  44. package/src/static/ejs/partials/components/header/aboutScanModal/ScanDetails.ejs +34 -27
  45. package/src/static/ejs/partials/components/ruleModal/ruleOffcanvas.ejs +1 -0
  46. package/src/static/ejs/partials/components/scannedPagesSegmentedTabs.ejs +7 -0
  47. package/src/static/ejs/partials/components/wcagCoverageDetails.ejs +5 -5
  48. package/src/static/ejs/partials/scripts/header/aboutScanModal/AboutScanModal.ejs +3 -3
  49. package/src/static/ejs/partials/scripts/prioritiseIssues/PrioritiseIssues.ejs +21 -19
  50. package/src/static/ejs/partials/scripts/ruleModal/pageAccordionBuilder.ejs +39 -8
  51. package/src/static/ejs/partials/scripts/scannedPagesSegmentedTabs.ejs +11 -5
  52. package/src/static/ejs/partials/scripts/screenshotLightbox.ejs +49 -31
  53. package/src/static/ejs/partials/styles/header/SiteInfo.ejs +1 -1
  54. package/src/static/ejs/partials/styles/header/aboutScanModal/ScanDetails.ejs +36 -16
  55. package/src/static/ejs/partials/styles/prioritiseIssues/PrioritiseIssues.ejs +22 -1
  56. package/src/static/ejs/partials/styles/styles.ejs +1 -1
  57. package/src/static/ejs/partials/styles/wcagCompliance/WcagGaugeBar.ejs +6 -0
  58. package/src/static/ejs/partials/styles/wcagCompliance.ejs +5 -4
  59. package/src/static/ejs/partials/styles/wcagCoverageDetails.ejs +6 -1
  60. package/testStaticJSScanner.html +1 -1
  61. /package/{7339fae5-e8ed-4b50-af13-317847620dbf.txt → 67e8137b-1939-4253-8f11-a82bc833cfcb.txt} +0 -0
package/AGENTS.md CHANGED
@@ -112,6 +112,10 @@ Important behaviors:
112
112
  - The crawler itself enforces `maxRequestsPerCrawl` by counting only successfully scanned pages
113
113
  - `constants.sitemapFetchedLinks` stores the total discovered count for `scanData.json` reporting
114
114
  - For sitemap indexes, child sitemaps are processed recursively
115
+ - Some sitemap XMLs include `<?xml-stylesheet ...?>` (XSL). In `getDataUsingPlaywright()`:
116
+ - Use `waitUntil: 'domcontentloaded'` (not `networkidle`) to avoid 60s timeouts caused by stylesheet/resource loading
117
+ - Prefer `response.text()` to capture raw XML before browser XSL transformation (preserves `<sitemapindex>` / `<urlset>` structure)
118
+ - Only fall back to DOM extraction when raw response text is unavailable
115
119
 
116
120
  ## Shared Mutable State
117
121
 
@@ -229,6 +233,12 @@ docker run oobee node dist/cli.js ...
229
233
 
230
234
  10. **Intermediate JSONL write safety + corruption tolerance** — `ItemsStore.appendPageItems()` requires strict serialization of writes per rule file to prevent interleaved corruption. It also enforces a strict text sanitization regex to filter out literal `\n` and `\r` control characters from website HTML inputs immediately after `JSON.stringify()`. This ensures no single JSON issue accidentally injects illegal implicit newline boundaries when writing to JSONL format. Maintain backward-compatible `fs.appendFile` queues over buffered WriteStreams to guarantee pipeline sync visibility. `ItemsStore.readRuleItems()` tolerates historical malformed lines via fallback skip logic.
231
235
 
236
+ 11. **`preNavigationHooks` and the Playwright header-rewrite warning** — `preNavigationHooks()` in `commonCrawlerFunc.ts` is always included in the crawler `preNavigationHooks` array (for both `crawlDomain` and `crawlSitemap`). The hook does two things:
237
+ - **Header rewriting**: only sets `crawlingContext.request.headers = extraHTTPHeaders` when `extraHTTPHeaders` is non-empty. Setting request headers causes Crawlee/Playwright to intercept every network request to rewrite them, which triggers `WARN Playwright Utils: Using other request methods than GET, rewriting headers and adding payloads has a high impact on performance`. This warning is expected for authenticated scans; it is suppressed for unauthenticated scans because `extraHTTPHeaders` stays empty (see pitfall 12 below).
238
+ - **Navigation wait**: always sets `gotoOptions.waitUntil = 'domcontentloaded'` and `gotoOptions.timeout = 30000` via **in-place object mutation**. Do NOT reassign the `gotoOptions` parameter (`gotoOptions = {...}`) — that only rebinds the local variable and does not propagate to Crawlee. `domcontentloaded` is used (not `networkidle`) to avoid indefinite hangs on sites with WebSockets, analytics polling, lazy-load beacons, or health-check pings that never quiet their network activity. Further page stability is handled by `waitForPageLoaded()` in each requestHandler and the DOM mutation observer in `postNavigationHooks`.
239
+
240
+ 12. **`extraHTTPHeaders` must not be mutated before being passed to crawlers** — `checkUrlConnectivityWithBrowser()` in `common.ts` needs an `Accept` header for its own connectivity check but must NOT add it to the shared `extraHTTPHeaders` object. Mutating the shared object causes crawlers to see a non-empty `extraHTTPHeaders` (at minimum `{ Accept: '...' }`), which silently triggers header rewriting and the Playwright performance warning for every unauthenticated scan. Always use a local copy: `const localHeaders = { ...extraHTTPHeaders }; localHeaders.Accept ||= '...';`.
241
+
232
242
  ## Testing Considerations
233
243
 
234
244
  When making changes, validate these areas which have well-established edge cases:
@@ -260,6 +270,16 @@ When making changes, validate these areas which have well-established edge cases
260
270
  - `document.title` must be captured at the START of `runAxeScript()`, before axe scanning or screenshot capture. Pages can close during these operations (timeout, navigation, crash). Never create a new page just to re-navigate for the title — this leaks pages.
261
271
  - The URL guard script in custom flow must be defensive against pages that close unexpectedly. All page event handlers should handle closed contexts gracefully.
262
272
 
273
+ ### URL Guard & Overlay Management in Custom Flow
274
+
275
+ `src/crawlers/guards/urlGuard.ts` — attached via `addUrlGuardScript()` in `runCustom.ts`:
276
+
277
+ - **`restoreToSafeUrl` must validate the safe URL before calling `page.goto()`**. If the entry URL is `file://` (e.g. `-u '/path/to/report.html'`), `fallbackUrl` is also `file://`. Redirecting to it fires another `framenavigated` for `file://`, which re-triggers `restoreToSafeUrl` → infinite reload loop. Always check `ALLOWED_PROTOCOLS.has(safeObj.protocol)` before navigating; if the fallback is not http/https, return without redirecting.
278
+
279
+ - **`about:` protocol must be skipped in `framenavigated`**. Chromium fires `framenavigated` for `about:blank` as a transient intermediate state during every `page.goto()` call. Intercepting it and calling `restoreToSafeUrl` → `page.goto(safeUrl)` → `about:blank` → `restoreToSafeUrl` → … creates a second infinite loop. Always `return` early when `urlObj.protocol === 'about:'`.
280
+
281
+ - **`reconcileOverlayMenu` must not remove the overlay on macOS/Windows**. On `darwin`/`win32` the custom flow runs headful. When `isOverlayAllowed` returns `false` (e.g. transient `file://` or `about:blank` URL), do **not** call `removeOverlayMenu` — the URL guard will redirect back to the safe URL momentarily. Instead, fall through to the `hasOverlay` / `addOverlayMenu` block so the overlay is (re-)injected regardless of the current URL protocol. On Linux/Docker (headless) the removal behaviour is unchanged.
282
+
263
283
  ### Proxy & Network
264
284
  - Proxy detection must handle `ALL_PROXY` on Windows. The proxy resolution logic should be tested on all platforms.
265
285
 
package/dist/cli.js CHANGED
@@ -199,9 +199,10 @@ const scanInit = async (argvs) => {
199
199
  if (res.httpStatus)
200
200
  consoleLogger.info(`Connectivity Check HTTP Response Code: ${res.httpStatus}`);
201
201
  if (res.status === statuses.success.code) {
202
- // Custom flow should continue from the user-provided entry URL so auth redirects
203
- // do not replace the original domain used for overlay gating and navigation.
202
+ // Keep browser-resolved URL as entryUrl for downstream scan metadata/events
203
+ // on non-custom scans.
204
204
  if (data.type !== ScannerTypes.CUSTOM) {
205
+ data.entryUrl = res.url;
205
206
  data.url = res.url;
206
207
  }
207
208
  if (process.env.OOBEE_VALIDATE_URL) {
package/dist/combine.js CHANGED
@@ -23,7 +23,7 @@ export class ViewportSettingsClass {
23
23
  }
24
24
  const combineRun = async (details, deviceToScan) => {
25
25
  const envDetails = { ...details };
26
- const { type, url, nameEmail, randomToken, deviceChosen, customDevice, viewportWidth, playwrightDeviceDetailsObject, maxRequestsPerCrawl, browser, userDataDirectory, strategy, // Allow subdomains: if checked, = 'same-domain'
26
+ const { type, url, entryUrl, nameEmail, randomToken, deviceChosen, customDevice, viewportWidth, playwrightDeviceDetailsObject, maxRequestsPerCrawl, browser, userDataDirectory, strategy, // Allow subdomains: if checked, = 'same-domain'
27
27
  specifiedMaxConcurrency, // Slow scan mode: if checked, = '1'
28
28
  fileTypes, blacklistedPatternsFilename, includeScreenshots, // Include screenshots: if checked, = 'true'
29
29
  followRobots, // Adhere to robots.txt: if checked, = 'true'
@@ -59,8 +59,8 @@ const combineRun = async (details, deviceToScan) => {
59
59
  }
60
60
  // remove basic-auth credentials from URL
61
61
  const finalUrl = !(type === ScannerTypes.SITEMAP || type === ScannerTypes.LOCALFILE)
62
- ? new URL(url)
63
- : new URL(pathToFileURL(url));
62
+ ? new URL(entryUrl)
63
+ : new URL(pathToFileURL(entryUrl));
64
64
  // Use the string version of finalUrl to reduce logic at submitForm
65
65
  const finalUrlString = finalUrl.toString();
66
66
  const scanDetails = {
@@ -292,15 +292,18 @@ const checkUrlConnectivityWithBrowser = async (url, browserToRun, clonedDataDir,
292
292
  return res;
293
293
  }
294
294
  }
295
- // Ensure Accept header for non-html content fallback
296
- extraHTTPHeaders.Accept ||= 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
295
+ // Ensure Accept header for non-html content fallback — use a local copy to avoid
296
+ // mutating the caller's extraHTTPHeaders object (which is later checked by crawlers
297
+ // to decide whether to enable preNavigationHooks header rewriting).
298
+ const localHeaders = { ...extraHTTPHeaders };
299
+ localHeaders.Accept ||= 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
297
300
  await initModifiedUserAgent(browserToRun, playwrightDeviceDetailsObject, clonedDataDir);
298
301
  let browserContext;
299
302
  let browserInstance;
300
303
  const rawDevice = (playwrightDeviceDetailsObject || {});
301
304
  const { viewport, isMobile, hasTouch, userAgent: deviceUserAgent, ...restDevice } = rawDevice;
302
305
  const launchOptions = getPlaywrightLaunchOptions(browserToRun);
303
- const { Authorization, ...nonAuthHeaders } = extraHTTPHeaders || {};
306
+ const { Authorization, ...nonAuthHeaders } = localHeaders || {};
304
307
  let httpCredentials = undefined;
305
308
  if (Authorization?.startsWith('Basic ')) {
306
309
  const decoded = Buffer.from(Authorization.slice(6), 'base64').toString();
@@ -355,21 +358,23 @@ const checkUrlConnectivityWithBrowser = async (url, browserToRun, clonedDataDir,
355
358
  // Only enable generic Authorization header routing interception broadly if
356
359
  // a non-Basic Bearer auth string is heavily relied upon, thereby bypassing
357
360
  // performance warnings inside the check checkUrl phase for typical public scans
358
- if (Authorization && !httpCredentials) {
359
- const entryOrigin = new URL(url).origin;
360
- await browserContext.route('**/*', async (route, request) => {
361
- try {
362
- if (new URL(request.url()).origin === entryOrigin) {
363
- await route.continue({ headers: { ...request.headers(), Authorization } });
361
+ if (Object.keys(localHeaders).length > 0) {
362
+ if (Authorization && !httpCredentials) {
363
+ const entryOrigin = new URL(url).origin;
364
+ await browserContext.route('**/*', async (route, request) => {
365
+ try {
366
+ if (new URL(request.url()).origin === entryOrigin) {
367
+ await route.continue({ headers: { ...request.headers(), Authorization } });
368
+ }
369
+ else {
370
+ await route.continue();
371
+ }
364
372
  }
365
- else {
373
+ catch {
366
374
  await route.continue();
367
375
  }
368
- }
369
- catch {
370
- await route.continue();
371
- }
372
- });
376
+ });
377
+ }
373
378
  }
374
379
  const page = await browserContext.newPage();
375
380
  // Block native Chrome download UI
@@ -491,7 +496,7 @@ export const isSitemapContent = (content) => {
491
496
  return true;
492
497
  }
493
498
  const regexForHtml = new RegExp('<(?:!doctype html|html|head|body)+?>', 'gmi');
494
- const regexForXmlSitemap = new RegExp('<(?:urlset|feed|rss)+?.*>', 'gmi');
499
+ const regexForXmlSitemap = new RegExp('<(?:urlset|sitemapindex|feed|rss)+?.*>', 'gmi');
495
500
  if (content.match(regexForHtml) && content.match(regexForXmlSitemap)) {
496
501
  // is an XML sitemap wrapped in a HTML document
497
502
  return true;
@@ -505,7 +510,18 @@ export const isSitemapContent = (content) => {
505
510
  return false;
506
511
  };
507
512
  export const checkUrl = async (scanner, url, browser, clonedDataDir, playwrightDeviceDetailsObject, extraHTTPHeaders, fileTypes) => {
508
- const res = await checkUrlConnectivityWithBrowser(url, browser, clonedDataDir, playwrightDeviceDetailsObject, extraHTTPHeaders);
513
+ let urlToCheck = url;
514
+ if (scanner === ScannerTypes.LOCALFILE) {
515
+ if (!isFilePath(url)) {
516
+ const res = new RES();
517
+ res.status = constants.urlCheckStatuses.notALocalFile.code;
518
+ return res;
519
+ }
520
+ if (!url.toLowerCase().startsWith('file://')) {
521
+ urlToCheck = pathToFileURL(path.resolve(url)).toString();
522
+ }
523
+ }
524
+ const res = await checkUrlConnectivityWithBrowser(urlToCheck, browser, clonedDataDir, playwrightDeviceDetailsObject, extraHTTPHeaders);
509
525
  // If response is 200 (meaning no other code was set earlier)
510
526
  if (res.status === constants.urlCheckStatuses.success.code) {
511
527
  // Check if document is pdf type
@@ -552,7 +568,7 @@ export const prepareData = async (argv) => {
552
568
  if (isEmptyObject(argv)) {
553
569
  throw Error('No inputs should be provided');
554
570
  }
555
- let { scanner, headless, url, deviceChosen, customDevice, viewportWidth, maxpages, strategy, isLocalFileScan = argv.scanner === ScannerTypes.LOCALFILE, browserToRun, nameEmail, customFlowLabel, specifiedMaxConcurrency, fileTypes, blacklistedPatternsFilename, additional, metadata, followRobots, header, safeMode, exportDirectory, zip, ruleset, generateJsonFiles, scanDuration, } = argv;
571
+ let { scanner, headless, url, deviceChosen, customDevice, viewportWidth, maxpages, strategy, isLocalFileScan = argv.scanner === ScannerTypes.LOCALFILE, browserToRun, nameEmail, customFlowLabel, specifiedMaxConcurrency, fileTypes, blacklistedPatternsFilename, additional, metadata, followRobots, header, safeMode, exportDirectory, zip, ruleset, generateJsonFiles, scanDuration, finalUrl, } = argv;
556
572
  const extraHTTPHeaders = parseHeaders(header);
557
573
  // Set default username and password for basic auth
558
574
  let username = '';
@@ -578,6 +594,9 @@ export const prepareData = async (argv) => {
578
594
  temp.password = '';
579
595
  url = temp.toString();
580
596
  }
597
+ // Keep browser-resolved URL (if provided by pre-check flow) as canonical entry URL.
598
+ // For local file paths, keep using the normalized `url` value below.
599
+ const resolvedEntryUrl = finalUrl && !isFilePath(finalUrl) ? finalUrl : url;
581
600
  // construct filename for scan results
582
601
  const [date, time] = new Date().toLocaleString('sv').replaceAll(/-|:/g, '').split(' ');
583
602
  const domain = isLocalFileScan ? path.basename(url) : new URL(url).hostname;
@@ -605,7 +624,7 @@ export const prepareData = async (argv) => {
605
624
  return {
606
625
  type: scanner,
607
626
  url,
608
- entryUrl: url,
627
+ entryUrl: resolvedEntryUrl,
609
628
  isHeadless: headless,
610
629
  deviceChosen,
611
630
  customDevice,
@@ -810,6 +829,7 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
810
829
  const scannedSitemaps = new Set();
811
830
  const sitemapLinkCounts = {};
812
831
  const allUrls = new Set(); // all discovered URLs (lightweight strings)
832
+ const isImageSitemapUrl = (candidateUrl) => /(^|\/)image-sitemap(?:-index)?(?:-\d+)?\.xml(?:$|[?#])/i.test(candidateUrl);
813
833
  const addToUrlList = (url) => {
814
834
  if (!url)
815
835
  return;
@@ -880,6 +900,10 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
880
900
  const fetchUrls = async (url, extraHTTPHeaders) => {
881
901
  let data;
882
902
  let sitemapType;
903
+ if (isImageSitemapUrl(url)) {
904
+ consoleLogger.info(`Skipping image sitemap: ${url}`);
905
+ return;
906
+ }
883
907
  if (scannedSitemaps.has(url)) {
884
908
  // Skip processing if the sitemap has already been scanned
885
909
  return;
@@ -926,27 +950,45 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
926
950
  });
927
951
  }
928
952
  const page = await browserContext.newPage();
929
- await page.goto(url, { waitUntil: 'networkidle', timeout: 60000 });
930
- if ((await page.locator('body').count()) > 0) {
931
- data = await page.locator('body').innerText();
932
- }
933
- else {
934
- const urlSet = page.locator('urlset');
935
- const sitemapIndex = page.locator('sitemapindex');
936
- const rss = page.locator('rss');
937
- const feed = page.locator('feed');
938
- const isRoot = async (locator) => (await locator.count()) > 0;
939
- if (await isRoot(urlSet)) {
940
- data = await urlSet.evaluate(elem => elem.outerHTML);
953
+ // Use 'domcontentloaded' instead of 'networkidle' sitemap XMLs with
954
+ // XSL stylesheet references (e.g. <?xml-stylesheet ...?>) cause the browser
955
+ // to fetch and apply the stylesheet, which may load additional resources
956
+ // (fonts, CSS, images) that prevent 'networkidle' from ever being reached.
957
+ const response = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 });
958
+ // Prefer the raw response body — this gives us the original XML before
959
+ // the browser applies any XSL transformation (which would turn the XML
960
+ // into rendered HTML, losing the sitemap structure).
961
+ if (response) {
962
+ try {
963
+ data = await response.text();
941
964
  }
942
- else if (await isRoot(sitemapIndex)) {
943
- data = await sitemapIndex.evaluate(elem => elem.outerHTML);
965
+ catch {
966
+ // response.text() can fail if the body was already consumed or
967
+ // if a redirect occurred; fall through to DOM extraction below.
944
968
  }
945
- else if (await isRoot(rss)) {
946
- data = await rss.evaluate(elem => elem.outerHTML);
969
+ }
970
+ if (!data) {
971
+ if ((await page.locator('body').count()) > 0) {
972
+ data = await page.locator('body').innerText();
947
973
  }
948
- else if (await isRoot(feed)) {
949
- data = await feed.evaluate(elem => elem.outerHTML);
974
+ else {
975
+ const urlSet = page.locator('urlset');
976
+ const sitemapIndex = page.locator('sitemapindex');
977
+ const rss = page.locator('rss');
978
+ const feed = page.locator('feed');
979
+ const isRoot = async (locator) => (await locator.count()) > 0;
980
+ if (await isRoot(urlSet)) {
981
+ data = await urlSet.evaluate(elem => elem.outerHTML);
982
+ }
983
+ else if (await isRoot(sitemapIndex)) {
984
+ data = await sitemapIndex.evaluate(elem => elem.outerHTML);
985
+ }
986
+ else if (await isRoot(rss)) {
987
+ data = await rss.evaluate(elem => elem.outerHTML);
988
+ }
989
+ else if (await isRoot(feed)) {
990
+ data = await feed.evaluate(elem => elem.outerHTML);
991
+ }
950
992
  }
951
993
  }
952
994
  }
@@ -969,37 +1011,61 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
969
1011
  data = fs.readFileSync(url, 'utf8');
970
1012
  }
971
1013
  const $ = cheerio.load(data, { xml: true });
1014
+ const countBefore = allUrls.size;
972
1015
  // This case is when the document is not an XML format document
973
1016
  if ($(':root').length === 0) {
974
1017
  processNonStandardSitemap(data);
1018
+ const linksFromThisSitemap = allUrls.size - countBefore;
1019
+ if (linksFromThisSitemap > 0) {
1020
+ sitemapLinkCounts[url] = (sitemapLinkCounts[url] || 0) + linksFromThisSitemap;
1021
+ }
975
1022
  return;
976
1023
  }
977
1024
  // Root element
978
1025
  const root = $(':root')[0];
979
- const { xmlns } = root.attribs;
980
- const xmlFormatNamespace = '/schemas/sitemap';
981
- if (root.name === 'urlset' && xmlns.includes(xmlFormatNamespace)) {
1026
+ const hasImageNamespace = Object.values(root?.attribs ?? {}).some(attribVal => typeof attribVal === 'string' && attribVal.toLowerCase().includes('sitemap-image'));
1027
+ if (hasImageNamespace) {
1028
+ consoleLogger.info(`Skipping image sitemap: ${url}`);
1029
+ return;
1030
+ }
1031
+ const rootName = root?.name?.toLowerCase().split(':').pop() ?? '';
1032
+ const hasXmlSitemapIndexTag = /<\s*(?:[a-z0-9_-]+:)?sitemapindex\b/i.test(data);
1033
+ const hasXmlUrlsetTag = /<\s*(?:[a-z0-9_-]+:)?urlset\b/i.test(data);
1034
+ if (rootName === 'urlset') {
982
1035
  sitemapType = constants.xmlSitemapTypes.xml;
983
1036
  }
984
- else if (root.name === 'sitemapindex' && xmlns.includes(xmlFormatNamespace)) {
1037
+ else if (rootName === 'sitemapindex') {
985
1038
  sitemapType = constants.xmlSitemapTypes.xmlIndex;
986
1039
  }
987
- else if (root.name === 'rss') {
1040
+ else if (rootName === 'rss') {
988
1041
  sitemapType = constants.xmlSitemapTypes.rss;
989
1042
  }
990
- else if (root.name === 'feed') {
1043
+ else if (rootName === 'feed') {
991
1044
  sitemapType = constants.xmlSitemapTypes.atom;
992
1045
  }
1046
+ else if (hasXmlSitemapIndexTag) {
1047
+ sitemapType = constants.xmlSitemapTypes.xmlIndex;
1048
+ }
1049
+ else if (hasXmlUrlsetTag) {
1050
+ sitemapType = constants.xmlSitemapTypes.xml;
1051
+ }
993
1052
  else {
994
1053
  sitemapType = constants.xmlSitemapTypes.unknown;
995
1054
  }
996
- const countBefore = allUrls.size;
997
1055
  switch (sitemapType) {
998
1056
  case constants.xmlSitemapTypes.xmlIndex:
999
- consoleLogger.info(`This is a XML format sitemap index.`);
1057
+ consoleLogger.info(`This is a XML format sitemap index: ${url}`);
1000
1058
  for (const childSitemapUrl of $('loc')) {
1001
- const childSitemapUrlText = $(childSitemapUrl).text();
1002
- if (childSitemapUrlText.endsWith('.xml') || childSitemapUrlText.endsWith('.txt')) {
1059
+ const childSitemapUrlText = $(childSitemapUrl).text().trim();
1060
+ if (!childSitemapUrlText) {
1061
+ continue;
1062
+ }
1063
+ const childSitemapPath = childSitemapUrlText.split(/[?#]/)[0].toLowerCase();
1064
+ if (childSitemapPath.endsWith('.xml') || childSitemapPath.endsWith('.txt')) {
1065
+ if (isImageSitemapUrl(childSitemapUrlText)) {
1066
+ consoleLogger.info(`Skipping image sitemap: ${childSitemapUrlText}`);
1067
+ continue;
1068
+ }
1003
1069
  await fetchUrls(childSitemapUrlText, extraHTTPHeaders); // Recursive call for nested sitemaps
1004
1070
  }
1005
1071
  else {
@@ -1008,19 +1074,19 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
1008
1074
  }
1009
1075
  break;
1010
1076
  case constants.xmlSitemapTypes.xml:
1011
- consoleLogger.info(`This is a XML format sitemap.`);
1077
+ consoleLogger.info(`This is a XML format sitemap: ${url}`);
1012
1078
  await processXmlSitemap($, sitemapType, 'loc', 'lastmod', 'url');
1013
1079
  break;
1014
1080
  case constants.xmlSitemapTypes.rss:
1015
- consoleLogger.info(`This is a RSS format sitemap.`);
1081
+ consoleLogger.info(`This is a RSS format sitemap: ${url}`);
1016
1082
  await processXmlSitemap($, sitemapType, 'link', 'pubDate', 'item');
1017
1083
  break;
1018
1084
  case constants.xmlSitemapTypes.atom:
1019
- consoleLogger.info(`This is a Atom format sitemap.`);
1085
+ consoleLogger.info(`This is a Atom format sitemap: ${url}`);
1020
1086
  await processXmlSitemap($, sitemapType, 'link', 'published', 'entry');
1021
1087
  break;
1022
1088
  default:
1023
- consoleLogger.info(`This is an unrecognised XML sitemap format.`);
1089
+ consoleLogger.info(`This is an unrecognised XML sitemap format: ${url}`);
1024
1090
  processNonStandardSitemap(data);
1025
1091
  }
1026
1092
  const linksFromThisSitemap = allUrls.size - countBefore;
@@ -1836,7 +1902,8 @@ function isValidHttpUrl(urlString) {
1836
1902
  export const isFilePath = (url) => {
1837
1903
  const driveLetterPattern = /^[A-Z]:/i;
1838
1904
  const backslashPattern = /\\/;
1839
- return (url.startsWith('/') ||
1905
+ return (url.toLowerCase().startsWith('file://') ||
1906
+ url.startsWith('/') ||
1840
1907
  driveLetterPattern.test(url) ||
1841
1908
  backslashPattern.test(url) ||
1842
1909
  url.startsWith('./') ||
@@ -898,10 +898,19 @@ export const createCrawleeSubFolders = async (randomToken) => {
898
898
  export const preNavigationHooks = (extraHTTPHeaders) => {
899
899
  return [
900
900
  async (crawlingContext, gotoOptions) => {
901
- if (extraHTTPHeaders) {
901
+ if (extraHTTPHeaders && Object.keys(extraHTTPHeaders).length > 0) {
902
902
  crawlingContext.request.headers = extraHTTPHeaders;
903
903
  }
904
- gotoOptions = { waitUntil: 'networkidle', timeout: 30000 };
904
+ // Use domcontentloaded fires as soon as the DOM is parsed, before
905
+ // images/stylesheets/network requests settle. This avoids indefinite
906
+ // hangs on sites with WebSockets, analytics polling, or infinite-scroll
907
+ // beacons that never reach networkidle. Further page stability is
908
+ // handled by waitForPageLoaded() in each crawler's requestHandler and
909
+ // by the DOM mutation observer in postNavigationHooks.
910
+ if (gotoOptions) {
911
+ gotoOptions.waitUntil = 'domcontentloaded';
912
+ gotoOptions.timeout = 30000;
913
+ }
905
914
  },
906
915
  ];
907
916
  };
@@ -1,6 +1,6 @@
1
1
  import crawlee from 'crawlee';
2
2
  import { CrawlRateController } from './crawlRateController.js';
3
- import { createCrawleeSubFolders, getPreLaunchHook, runAxeScript, isUrlPdf, shouldSkipClickDueToDisallowedHref, shouldSkipDueToUnsupportedContent, splitAuthHeaders, } from './commonCrawlerFunc.js';
3
+ import { createCrawleeSubFolders, getPreLaunchHook, preNavigationHooks, runAxeScript, isUrlPdf, shouldSkipClickDueToDisallowedHref, shouldSkipDueToUnsupportedContent, splitAuthHeaders, } from './commonCrawlerFunc.js';
4
4
  import constants, { blackListedFileExtensions, guiInfoStatusTypes, cssQuerySelectors, STATUS_CODE_METADATA, disallowedListOfPatterns, disallowedSelectorPatterns, FileTypes, } from '../constants/constants.js';
5
5
  import { getPlaywrightLaunchOptions, isBlacklistedFileExtensions, isSkippedUrl, isDisallowedInRobotsTxt, getUrlsFromRobotsTxt, waitForPageLoaded, } from '../constants/common.js';
6
6
  import { areLinksEqual, isFollowStrategy, isSameHostname, normUrl, register } from '../utils.js';
@@ -301,12 +301,10 @@ const crawlDomain = async ({ url, randomToken, host: _host, viewportSettings, ma
301
301
  ],
302
302
  },
303
303
  requestQueue,
304
+ maxRequestRetries: 3,
305
+ maxSessionRotations: 1,
304
306
  preNavigationHooks: [
305
- async (crawlingContext) => {
306
- if (extraHTTPHeaders) {
307
- crawlingContext.request.headers = extraHTTPHeaders;
308
- }
309
- },
307
+ ...preNavigationHooks(extraHTTPHeaders),
310
308
  ],
311
309
  postNavigationHooks: [
312
310
  async (crawlingContext) => {
@@ -1,6 +1,6 @@
1
1
  import crawlee, { EnqueueStrategy, RequestList } from 'crawlee';
2
2
  import { CrawlRateController } from './crawlRateController.js';
3
- import { createCrawleeSubFolders, getPreLaunchHook, preNavigationHooks, runAxeScript, } from './commonCrawlerFunc.js';
3
+ import { createCrawleeSubFolders, getPreLaunchHook, preNavigationHooks, runAxeScript, splitAuthHeaders, } from './commonCrawlerFunc.js';
4
4
  import constants, { STATUS_CODE_METADATA, guiInfoStatusTypes, disallowedListOfPatterns, FileTypes, } from '../constants/constants.js';
5
5
  import { getLinksFromSitemap, getPlaywrightLaunchOptions, isSkippedUrl, waitForPageLoaded, isFilePath, } from '../constants/common.js';
6
6
  import { areLinksEqual, isFollowStrategy, isWhitelistedContentType, normUrl, register } from '../utils.js';
@@ -13,6 +13,7 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
13
13
  let durationExceeded = false;
14
14
  let isAbortingScan = false;
15
15
  const rateController = new CrawlRateController(maxRequestsPerCrawl, specifiedMaxConcurrency || constants.maxConcurrency);
16
+ const initialNoSuccessFailureAbortThreshold = Math.max(5, Math.min(maxRequestsPerCrawl, 25));
16
17
  if (fromCrawlIntelligentSitemap) {
17
18
  dataset = datasetFromIntelligent;
18
19
  urlsCrawled = urlsCrawledFromIntelligent;
@@ -33,6 +34,7 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
33
34
  const isScanPdfs = [FileTypes.All, FileTypes.PdfOnly].includes(fileTypes);
34
35
  const { playwrightDeviceDetailsObject } = viewportSettings;
35
36
  const { maxConcurrency } = constants;
37
+ const { nonAuthHeaders, httpCredentials } = splitAuthHeaders(extraHTTPHeaders);
36
38
  const requestList = await RequestList.open({
37
39
  sources: linksFromSitemap,
38
40
  });
@@ -53,11 +55,15 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
53
55
  ...playwrightDeviceDetailsObject,
54
56
  ...(process.env.OOBEE_USER_AGENT && { userAgent: process.env.OOBEE_USER_AGENT }),
55
57
  ...(process.env.OOBEE_DISABLE_BROWSER_DOWNLOAD && { acceptDownloads: false }),
58
+ ...(nonAuthHeaders && { extraHTTPHeaders: nonAuthHeaders }),
59
+ ...(httpCredentials && { httpCredentials }),
56
60
  };
57
61
  },
58
62
  ],
59
63
  },
60
64
  requestList,
65
+ maxRequestRetries: 3,
66
+ maxSessionRotations: 1,
61
67
  postNavigationHooks: [
62
68
  async ({ page }) => {
63
69
  try {
@@ -104,6 +110,7 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
104
110
  },
105
111
  ],
106
112
  preNavigationHooks: [
113
+ ...preNavigationHooks(extraHTTPHeaders),
107
114
  async ({ request, page }, gotoOptions) => {
108
115
  const url = request.url.toLowerCase();
109
116
  const isNotSupportedDocument = disallowedListOfPatterns.some(pattern => url.startsWith(pattern));
@@ -114,7 +121,6 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
114
121
  // console.log(`[SKIP] Not supported: ${request.url}`);
115
122
  return;
116
123
  }
117
- preNavigationHooks(extraHTTPHeaders);
118
124
  },
119
125
  ],
120
126
  requestHandlerTimeoutSecs: 90,
@@ -310,6 +316,12 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
310
316
  httpStatusCode: typeof status === 'number' ? status : 0,
311
317
  });
312
318
  crawlee.log.error(`Failed Request - ${request.url}: ${request.errorMessages}`);
319
+ if (urlsCrawled.scanned.length === 0 &&
320
+ urlsCrawled.error.length >= initialNoSuccessFailureAbortThreshold) {
321
+ consoleLogger.info(`Aborting sitemap crawl: ${urlsCrawled.error.length} failed pages with 0 successful scans.`);
322
+ isAbortingScan = true;
323
+ crawler.autoscaledPool?.abort();
324
+ }
313
325
  },
314
326
  maxRequestsPerCrawl: Infinity,
315
327
  maxConcurrency: specifiedMaxConcurrency || maxConcurrency,
@@ -1064,15 +1064,28 @@ export const initNewPage = async (page, pageClosePromises, processPageParams, pa
1064
1064
  return;
1065
1065
  const allowed = isOverlayAllowed(page.url(), processPageParams.entryUrl);
1066
1066
  if (!allowed) {
1067
- await Promise.race([
1068
- removeOverlayMenu(page),
1069
- new Promise((_, reject) => {
1070
- setTimeout(() => {
1071
- reject(new Error(`removeOverlayMenu timed out after ${OVERLAY_OPERATION_TIMEOUT_MS}ms`));
1072
- }, OVERLAY_OPERATION_TIMEOUT_MS);
1073
- }),
1074
- ]);
1075
- return;
1067
+ // On macOS and Windows the custom flow always runs headful.
1068
+ // The URL guard (urlGuard.ts) intercepts non-http/https navigations
1069
+ // and calls page.goto(safeUrl). Do NOT remove the overlay here —
1070
+ // removing it causes it to stay permanently disabled if the redirect
1071
+ // races ahead of the next reconcile cycle.
1072
+ // Instead, fall through to the hasOverlay / addOverlayMenu block so
1073
+ // the overlay is (re-)injected even on transient non-http/https URLs
1074
+ // (e.g. file://, about:blank) and again after the guard's redirect.
1075
+ const isDesktopHost = process.platform === 'darwin' || process.platform === 'win32';
1076
+ if (!isDesktopHost) {
1077
+ // On Linux / Docker: remove overlay for non-http/https URLs and stop.
1078
+ await Promise.race([
1079
+ removeOverlayMenu(page),
1080
+ new Promise((_, reject) => {
1081
+ setTimeout(() => {
1082
+ reject(new Error(`removeOverlayMenu timed out after ${OVERLAY_OPERATION_TIMEOUT_MS}ms`));
1083
+ }, OVERLAY_OPERATION_TIMEOUT_MS);
1084
+ }),
1085
+ ]);
1086
+ return;
1087
+ }
1088
+ // Desktop hosts: skip removal and fall through to re-add overlay.
1076
1089
  }
1077
1090
  const hasOverlay = await page.evaluate(() => Boolean(document.querySelector('#oobeeShadowHost')));
1078
1091
  consoleLogger.info(`Overlay state (${trigger}): ${hasOverlay}`);
@@ -30,8 +30,20 @@ export function addUrlGuardScript(context, opts = {}) {
30
30
  // page may have closed before addInitScript completed; safe to ignore
31
31
  });
32
32
  const restoreToSafeUrl = async (page, attemptedUrl) => {
33
+ const safeUrl = lastAllowedUrlByPage.get(page) || fallbackUrl || 'about:blank';
34
+ // Only redirect if the safe URL is itself an allowed (http/https) URL.
35
+ // If the entry URL is file:// (e.g. scanning a local HTML file), the
36
+ // fallback is also file://, and redirecting would create an infinite loop:
37
+ // file:// → restoreToSafeUrl → file:// → framenavigated → restoreToSafeUrl → …
38
+ try {
39
+ const safeObj = new URL(safeUrl);
40
+ if (!ALLOWED_PROTOCOLS.has(safeObj.protocol))
41
+ return;
42
+ }
43
+ catch {
44
+ return;
45
+ }
33
46
  try {
34
- const safeUrl = lastAllowedUrlByPage.get(page) || fallbackUrl || 'about:blank';
35
47
  await page.goto(safeUrl, { waitUntil: 'domcontentloaded' });
36
48
  }
37
49
  catch {
@@ -53,6 +65,12 @@ export function addUrlGuardScript(context, opts = {}) {
53
65
  lastAllowedUrlByPage.set(page, urlObj.toString());
54
66
  return;
55
67
  }
68
+ // Skip browser-internal transitional states (about:blank, about:srcdoc, etc.).
69
+ // page.goto() navigates through about:blank before loading the target URL.
70
+ // Redirecting from about: creates an infinite loop:
71
+ // restoreToSafeUrl → page.goto(safeUrl) → about:blank → restoreToSafeUrl → …
72
+ if (urlObj.protocol === 'about:')
73
+ return;
56
74
  await restoreToSafeUrl(page, urlStr);
57
75
  });
58
76
  };
@@ -7,6 +7,7 @@
7
7
  <button
8
8
  type="button"
9
9
  class="category-tooltip-icon"
10
+ aria-label="About Must Fix category"
10
11
  aria-describedby="mustFixTooltip"
11
12
  >
12
13
  <svg xmlns="http://www.w3.org/2000/svg" width="14" height="14"
@@ -34,6 +35,7 @@
34
35
  <button
35
36
  type="button"
36
37
  class="category-tooltip-icon"
38
+ aria-label="About Good to Fix category"
37
39
  aria-describedby="goodToFixTooltip"
38
40
  >
39
41
  <svg xmlns="http://www.w3.org/2000/svg" width="14" height="14"
@@ -61,6 +63,7 @@
61
63
  <button
62
64
  type="button"
63
65
  class="category-tooltip-icon"
66
+ aria-label="About Manual Test category"
64
67
  aria-describedby="manualTestTooltip"
65
68
  >
66
69
  <svg xmlns="http://www.w3.org/2000/svg" width="14" height="14"
@@ -2,21 +2,21 @@
2
2
  <table class="issues-table" id="issuesTable">
3
3
  <thead>
4
4
  <tr>
5
- <th class="sortable" role="button" tabindex="0" aria-sort="none" style="width: 15%;">
5
+ <th class="sortable" tabindex="0" aria-sort="none" style="width: 15%;">
6
6
  <span>Severity</span>
7
7
  <svg class="sort-icon" width="24" height="24" viewBox="0 0 24 24" fill="none" aria-hidden="true">
8
8
  <path d="M7 9L12 4L17 9H7Z" fill="currentColor" opacity="1" />
9
9
  <path d="M7 15L12 20L17 15H7Z" fill="currentColor" opacity="0.3" />
10
10
  </svg>
11
11
  </th>
12
- <th class="sortable" role="button" tabindex="0" aria-sort="none">
12
+ <th class="sortable" tabindex="0" aria-sort="none">
13
13
  <span>Issue Name</span>
14
14
  <svg class="sort-icon" width="24" height="24" viewBox="0 0 24 24" fill="none" aria-hidden="true">
15
15
  <path d="M7 9L12 4L17 9H7Z" fill="currentColor" opacity="0.3" />
16
16
  <path d="M7 15L12 20L17 15H7Z" fill="currentColor" opacity="1" />
17
17
  </svg>
18
18
  </th>
19
- <th class="sortable" role="button" tabindex="0" aria-sort="descending" style="width: 15%;">
19
+ <th class="sortable" tabindex="0" aria-sort="descending" style="width: 15%;">
20
20
  <span>Occurrence</span>
21
21
  <svg class="sort-icon" width="24" height="24" viewBox="0 0 24 24" fill="none" aria-hidden="true">
22
22
  <path d="M7 9L12 4L17 9H7Z" fill="currentColor" opacity="0.3" />
@@ -1,4 +1,4 @@
1
- <div id="aboutScanModal" class="modal fade" tabindex="-1" aria-labelledby="aboutScanModalLabel" aria-hidden="true">
1
+ <div id="aboutScanModal" class="modal fade" tabindex="-1" aria-label="About this scan" aria-hidden="true">
2
2
  <div class="modal-dialog modal-dialog-centered">
3
3
  <div class="modal-content">
4
4
  <div class="modal-header">