npm - @govtechsg/oobee - Versions diffs - 0.10.93 → 0.10.94 - Mend

@govtechsg/oobee 0.10.93 → 0.10.94

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (61) hide show

package/AGENTS.md CHANGED Viewed

@@ -112,6 +112,10 @@ Important behaviors:
 - The crawler itself enforces `maxRequestsPerCrawl` by counting only successfully scanned pages
 - `constants.sitemapFetchedLinks` stores the total discovered count for `scanData.json` reporting
 - For sitemap indexes, child sitemaps are processed recursively
+- Some sitemap XMLs include `<?xml-stylesheet ...?>` (XSL). In `getDataUsingPlaywright()`:
+  - Use `waitUntil: 'domcontentloaded'` (not `networkidle`) to avoid 60s timeouts caused by stylesheet/resource loading
+  - Prefer `response.text()` to capture raw XML before browser XSL transformation (preserves `<sitemapindex>` / `<urlset>` structure)
+  - Only fall back to DOM extraction when raw response text is unavailable
 ## Shared Mutable State
@@ -229,6 +233,12 @@ docker run oobee node dist/cli.js ...
 10. **Intermediate JSONL write safety + corruption tolerance** — `ItemsStore.appendPageItems()` requires strict serialization of writes per rule file to prevent interleaved corruption. It also enforces a strict text sanitization regex to filter out literal `\n` and `\r` control characters from website HTML inputs immediately after `JSON.stringify()`. This ensures no single JSON issue accidentally injects illegal implicit newline boundaries when writing to JSONL format. Maintain backward-compatible `fs.appendFile` queues over buffered WriteStreams to guarantee pipeline sync visibility. `ItemsStore.readRuleItems()` tolerates historical malformed lines via fallback skip logic.
+11. **`preNavigationHooks` and the Playwright header-rewrite warning** — `preNavigationHooks()` in `commonCrawlerFunc.ts` is always included in the crawler `preNavigationHooks` array (for both `crawlDomain` and `crawlSitemap`). The hook does two things:
+    - **Header rewriting**: only sets `crawlingContext.request.headers = extraHTTPHeaders` when `extraHTTPHeaders` is non-empty. Setting request headers causes Crawlee/Playwright to intercept every network request to rewrite them, which triggers `WARN Playwright Utils: Using other request methods than GET, rewriting headers and adding payloads has a high impact on performance`. This warning is expected for authenticated scans; it is suppressed for unauthenticated scans because `extraHTTPHeaders` stays empty (see pitfall 12 below).
+    - **Navigation wait**: always sets `gotoOptions.waitUntil = 'domcontentloaded'` and `gotoOptions.timeout = 30000` via **in-place object mutation**. Do NOT reassign the `gotoOptions` parameter (`gotoOptions = {...}`) — that only rebinds the local variable and does not propagate to Crawlee. `domcontentloaded` is used (not `networkidle`) to avoid indefinite hangs on sites with WebSockets, analytics polling, lazy-load beacons, or health-check pings that never quiet their network activity. Further page stability is handled by `waitForPageLoaded()` in each requestHandler and the DOM mutation observer in `postNavigationHooks`.
+12. **`extraHTTPHeaders` must not be mutated before being passed to crawlers** — `checkUrlConnectivityWithBrowser()` in `common.ts` needs an `Accept` header for its own connectivity check but must NOT add it to the shared `extraHTTPHeaders` object. Mutating the shared object causes crawlers to see a non-empty `extraHTTPHeaders` (at minimum `{ Accept: '...' }`), which silently triggers header rewriting and the Playwright performance warning for every unauthenticated scan. Always use a local copy: `const localHeaders = { ...extraHTTPHeaders }; localHeaders.Accept ||= '...';`.
 ## Testing Considerations
 When making changes, validate these areas which have well-established edge cases:
@@ -260,6 +270,16 @@ When making changes, validate these areas which have well-established edge cases
 - `document.title` must be captured at the START of `runAxeScript()`, before axe scanning or screenshot capture. Pages can close during these operations (timeout, navigation, crash). Never create a new page just to re-navigate for the title — this leaks pages.
 - The URL guard script in custom flow must be defensive against pages that close unexpectedly. All page event handlers should handle closed contexts gracefully.
+### URL Guard & Overlay Management in Custom Flow
+`src/crawlers/guards/urlGuard.ts` — attached via `addUrlGuardScript()` in `runCustom.ts`:
+- **`restoreToSafeUrl` must validate the safe URL before calling `page.goto()`**. If the entry URL is `file://` (e.g. `-u '/path/to/report.html'`), `fallbackUrl` is also `file://`. Redirecting to it fires another `framenavigated` for `file://`, which re-triggers `restoreToSafeUrl` → infinite reload loop. Always check `ALLOWED_PROTOCOLS.has(safeObj.protocol)` before navigating; if the fallback is not http/https, return without redirecting.
+- **`about:` protocol must be skipped in `framenavigated`**. Chromium fires `framenavigated` for `about:blank` as a transient intermediate state during every `page.goto()` call. Intercepting it and calling `restoreToSafeUrl` → `page.goto(safeUrl)` → `about:blank` → `restoreToSafeUrl` → … creates a second infinite loop. Always `return` early when `urlObj.protocol === 'about:'`.
+- **`reconcileOverlayMenu` must not remove the overlay on macOS/Windows**. On `darwin`/`win32` the custom flow runs headful. When `isOverlayAllowed` returns `false` (e.g. transient `file://` or `about:blank` URL), do **not** call `removeOverlayMenu` — the URL guard will redirect back to the safe URL momentarily. Instead, fall through to the `hasOverlay` / `addOverlayMenu` block so the overlay is (re-)injected regardless of the current URL protocol. On Linux/Docker (headless) the removal behaviour is unchanged.
 ### Proxy & Network
 - Proxy detection must handle `ALL_PROXY` on Windows. The proxy resolution logic should be tested on all platforms.

package/dist/cli.js CHANGED Viewed

@@ -199,9 +199,10 @@ const scanInit = async (argvs) => {
     if (res.httpStatus)
         consoleLogger.info(`Connectivity Check HTTP Response Code: ${res.httpStatus}`);
     if (res.status === statuses.success.code) {
-        // Custom flow should continue from the user-provided entry URL so auth redirects
-        // do not replace the original domain used for overlay gating and navigation.
+        // Keep browser-resolved URL as entryUrl for downstream scan metadata/events
+        // on non-custom scans.
         if (data.type !== ScannerTypes.CUSTOM) {
+            data.entryUrl = res.url;
             data.url = res.url;
         }
         if (process.env.OOBEE_VALIDATE_URL) {

package/dist/combine.js CHANGED Viewed

@@ -23,7 +23,7 @@ export class ViewportSettingsClass {
 }
 const combineRun = async (details, deviceToScan) => {
     const envDetails = { ...details };
-    const { type, url, nameEmail, randomToken, deviceChosen, customDevice, viewportWidth, playwrightDeviceDetailsObject, maxRequestsPerCrawl, browser, userDataDirectory, strategy, // Allow subdomains: if checked, = 'same-domain'
+    const { type, url, entryUrl, nameEmail, randomToken, deviceChosen, customDevice, viewportWidth, playwrightDeviceDetailsObject, maxRequestsPerCrawl, browser, userDataDirectory, strategy, // Allow subdomains: if checked, = 'same-domain'
     specifiedMaxConcurrency, // Slow scan mode: if checked, = '1'
     fileTypes, blacklistedPatternsFilename, includeScreenshots, // Include screenshots: if checked, = 'true'
     followRobots, // Adhere to robots.txt: if checked, = 'true'
@@ -59,8 +59,8 @@ const combineRun = async (details, deviceToScan) => {
     }
     // remove basic-auth credentials from URL
     const finalUrl = !(type === ScannerTypes.SITEMAP || type === ScannerTypes.LOCALFILE)
-        ? new URL(url)
-        : new URL(pathToFileURL(url));
+        ? new URL(entryUrl)
+        : new URL(pathToFileURL(entryUrl));
     // Use the string version of finalUrl to reduce logic at submitForm
     const finalUrlString = finalUrl.toString();
     const scanDetails = {

package/dist/constants/common.js CHANGED Viewed

@@ -292,15 +292,18 @@ const checkUrlConnectivityWithBrowser = async (url, browserToRun, clonedDataDir,
             return res;
         }
     }
-    // Ensure Accept header for non-html content fallback
-    extraHTTPHeaders.Accept ||= 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
+    // Ensure Accept header for non-html content fallback — use a local copy to avoid
+    // mutating the caller's extraHTTPHeaders object (which is later checked by crawlers
+    // to decide whether to enable preNavigationHooks header rewriting).
+    const localHeaders = { ...extraHTTPHeaders };
+    localHeaders.Accept ||= 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
     await initModifiedUserAgent(browserToRun, playwrightDeviceDetailsObject, clonedDataDir);
     let browserContext;
     let browserInstance;
     const rawDevice = (playwrightDeviceDetailsObject || {});
     const { viewport, isMobile, hasTouch, userAgent: deviceUserAgent, ...restDevice } = rawDevice;
     const launchOptions = getPlaywrightLaunchOptions(browserToRun);
-    const { Authorization, ...nonAuthHeaders } = extraHTTPHeaders || {};
+    const { Authorization, ...nonAuthHeaders } = localHeaders || {};
     let httpCredentials = undefined;
     if (Authorization?.startsWith('Basic ')) {
         const decoded = Buffer.from(Authorization.slice(6), 'base64').toString();
@@ -355,21 +358,23 @@ const checkUrlConnectivityWithBrowser = async (url, browserToRun, clonedDataDir,
         // Only enable generic Authorization header routing interception broadly if
         // a non-Basic Bearer auth string is heavily relied upon, thereby bypassing
         // performance warnings inside the check checkUrl phase for typical public scans
-        if (Authorization && !httpCredentials) {
-            const entryOrigin = new URL(url).origin;
-            await browserContext.route('**/*', async (route, request) => {
-                try {
-                    if (new URL(request.url()).origin === entryOrigin) {
-                        await route.continue({ headers: { ...request.headers(), Authorization } });
+        if (Object.keys(localHeaders).length > 0) {
+            if (Authorization && !httpCredentials) {
+                const entryOrigin = new URL(url).origin;
+                await browserContext.route('**/*', async (route, request) => {
+                    try {
+                        if (new URL(request.url()).origin === entryOrigin) {
+                            await route.continue({ headers: { ...request.headers(), Authorization } });
+                        }
+                        else {
+                            await route.continue();
+                        }
                     }
-                    else {
+                    catch {
                         await route.continue();
                     }
-                }
-                catch {
-                    await route.continue();
-                }
-            });
+                });
+            }
         }
         const page = await browserContext.newPage();
         // Block native Chrome download UI
@@ -491,7 +496,7 @@ export const isSitemapContent = (content) => {
         return true;
     }
     const regexForHtml = new RegExp('<(?:!doctype html|html|head|body)+?>', 'gmi');
-    const regexForXmlSitemap = new RegExp('<(?:urlset|feed|rss)+?.*>', 'gmi');
+    const regexForXmlSitemap = new RegExp('<(?:urlset|sitemapindex|feed|rss)+?.*>', 'gmi');
     if (content.match(regexForHtml) && content.match(regexForXmlSitemap)) {
         // is an XML sitemap wrapped in a HTML document
         return true;
@@ -505,7 +510,18 @@ export const isSitemapContent = (content) => {
     return false;
 };
 export const checkUrl = async (scanner, url, browser, clonedDataDir, playwrightDeviceDetailsObject, extraHTTPHeaders, fileTypes) => {
-    const res = await checkUrlConnectivityWithBrowser(url, browser, clonedDataDir, playwrightDeviceDetailsObject, extraHTTPHeaders);
+    let urlToCheck = url;
+    if (scanner === ScannerTypes.LOCALFILE) {
+        if (!isFilePath(url)) {
+            const res = new RES();
+            res.status = constants.urlCheckStatuses.notALocalFile.code;
+            return res;
+        }
+        if (!url.toLowerCase().startsWith('file://')) {
+            urlToCheck = pathToFileURL(path.resolve(url)).toString();
+        }
+    }
+    const res = await checkUrlConnectivityWithBrowser(urlToCheck, browser, clonedDataDir, playwrightDeviceDetailsObject, extraHTTPHeaders);
     // If response is 200 (meaning no other code was set earlier)
     if (res.status === constants.urlCheckStatuses.success.code) {
         // Check if document is pdf type
@@ -552,7 +568,7 @@ export const prepareData = async (argv) => {
     if (isEmptyObject(argv)) {
         throw Error('No inputs should be provided');
     }
-    let { scanner, headless, url, deviceChosen, customDevice, viewportWidth, maxpages, strategy, isLocalFileScan = argv.scanner === ScannerTypes.LOCALFILE, browserToRun, nameEmail, customFlowLabel, specifiedMaxConcurrency, fileTypes, blacklistedPatternsFilename, additional, metadata, followRobots, header, safeMode, exportDirectory, zip, ruleset, generateJsonFiles, scanDuration, } = argv;
+    let { scanner, headless, url, deviceChosen, customDevice, viewportWidth, maxpages, strategy, isLocalFileScan = argv.scanner === ScannerTypes.LOCALFILE, browserToRun, nameEmail, customFlowLabel, specifiedMaxConcurrency, fileTypes, blacklistedPatternsFilename, additional, metadata, followRobots, header, safeMode, exportDirectory, zip, ruleset, generateJsonFiles, scanDuration, finalUrl, } = argv;
     const extraHTTPHeaders = parseHeaders(header);
     // Set default username and password for basic auth
     let username = '';
@@ -578,6 +594,9 @@ export const prepareData = async (argv) => {
         temp.password = '';
         url = temp.toString();
     }
+    // Keep browser-resolved URL (if provided by pre-check flow) as canonical entry URL.
+    // For local file paths, keep using the normalized `url` value below.
+    const resolvedEntryUrl = finalUrl && !isFilePath(finalUrl) ? finalUrl : url;
     // construct filename for scan results
     const [date, time] = new Date().toLocaleString('sv').replaceAll(/-|:/g, '').split(' ');
     const domain = isLocalFileScan ? path.basename(url) : new URL(url).hostname;
@@ -605,7 +624,7 @@ export const prepareData = async (argv) => {
     return {
         type: scanner,
         url,
-        entryUrl: url,
+        entryUrl: resolvedEntryUrl,
         isHeadless: headless,
         deviceChosen,
         customDevice,
@@ -810,6 +829,7 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
     const scannedSitemaps = new Set();
     const sitemapLinkCounts = {};
     const allUrls = new Set(); // all discovered URLs (lightweight strings)
+    const isImageSitemapUrl = (candidateUrl) => /(^|\/)image-sitemap(?:-index)?(?:-\d+)?\.xml(?:$|[?#])/i.test(candidateUrl);
     const addToUrlList = (url) => {
         if (!url)
             return;
@@ -880,6 +900,10 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
     const fetchUrls = async (url, extraHTTPHeaders) => {
         let data;
         let sitemapType;
+        if (isImageSitemapUrl(url)) {
+            consoleLogger.info(`Skipping image sitemap: ${url}`);
+            return;
+        }
         if (scannedSitemaps.has(url)) {
             // Skip processing if the sitemap has already been scanned
             return;
@@ -926,27 +950,45 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
                     });
                 }
                 const page = await browserContext.newPage();
-                await page.goto(url, { waitUntil: 'networkidle', timeout: 60000 });
-                if ((await page.locator('body').count()) > 0) {
-                    data = await page.locator('body').innerText();
-                }
-                else {
-                    const urlSet = page.locator('urlset');
-                    const sitemapIndex = page.locator('sitemapindex');
-                    const rss = page.locator('rss');
-                    const feed = page.locator('feed');
-                    const isRoot = async (locator) => (await locator.count()) > 0;
-                    if (await isRoot(urlSet)) {
-                        data = await urlSet.evaluate(elem => elem.outerHTML);
+                // Use 'domcontentloaded' instead of 'networkidle' — sitemap XMLs with
+                // XSL stylesheet references (e.g. <?xml-stylesheet ...?>) cause the browser
+                // to fetch and apply the stylesheet, which may load additional resources
+                // (fonts, CSS, images) that prevent 'networkidle' from ever being reached.
+                const response = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 });
+                // Prefer the raw response body — this gives us the original XML before
+                // the browser applies any XSL transformation (which would turn the XML
+                // into rendered HTML, losing the sitemap structure).
+                if (response) {
+                    try {
+                        data = await response.text();
                     }
-                    else if (await isRoot(sitemapIndex)) {
-                        data = await sitemapIndex.evaluate(elem => elem.outerHTML);
+                    catch {
+                        // response.text() can fail if the body was already consumed or
+                        // if a redirect occurred; fall through to DOM extraction below.
                     }
-                    else if (await isRoot(rss)) {
-                        data = await rss.evaluate(elem => elem.outerHTML);
+                }
+                if (!data) {
+                    if ((await page.locator('body').count()) > 0) {
+                        data = await page.locator('body').innerText();
                     }
-                    else if (await isRoot(feed)) {
-                        data = await feed.evaluate(elem => elem.outerHTML);
+                    else {
+                        const urlSet = page.locator('urlset');
+                        const sitemapIndex = page.locator('sitemapindex');
+                        const rss = page.locator('rss');
+                        const feed = page.locator('feed');
+                        const isRoot = async (locator) => (await locator.count()) > 0;
+                        if (await isRoot(urlSet)) {
+                            data = await urlSet.evaluate(elem => elem.outerHTML);
+                        }
+                        else if (await isRoot(sitemapIndex)) {
+                            data = await sitemapIndex.evaluate(elem => elem.outerHTML);
+                        }
+                        else if (await isRoot(rss)) {
+                            data = await rss.evaluate(elem => elem.outerHTML);
+                        }
+                        else if (await isRoot(feed)) {
+                            data = await feed.evaluate(elem => elem.outerHTML);
+                        }
                     }
                 }
             }
@@ -969,37 +1011,61 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
             data = fs.readFileSync(url, 'utf8');
         }
         const $ = cheerio.load(data, { xml: true });
+        const countBefore = allUrls.size;
         // This case is when the document is not an XML format document
         if ($(':root').length === 0) {
             processNonStandardSitemap(data);
+            const linksFromThisSitemap = allUrls.size - countBefore;
+            if (linksFromThisSitemap > 0) {
+                sitemapLinkCounts[url] = (sitemapLinkCounts[url] || 0) + linksFromThisSitemap;
+            }
             return;
         }
         // Root element
         const root = $(':root')[0];
-        const { xmlns } = root.attribs;
-        const xmlFormatNamespace = '/schemas/sitemap';
-        if (root.name === 'urlset' && xmlns.includes(xmlFormatNamespace)) {
+        const hasImageNamespace = Object.values(root?.attribs ?? {}).some(attribVal => typeof attribVal === 'string' && attribVal.toLowerCase().includes('sitemap-image'));
+        if (hasImageNamespace) {
+            consoleLogger.info(`Skipping image sitemap: ${url}`);
+            return;
+        }
+        const rootName = root?.name?.toLowerCase().split(':').pop() ?? '';
+        const hasXmlSitemapIndexTag = /<\s*(?:[a-z0-9_-]+:)?sitemapindex\b/i.test(data);
+        const hasXmlUrlsetTag = /<\s*(?:[a-z0-9_-]+:)?urlset\b/i.test(data);
+        if (rootName === 'urlset') {
             sitemapType = constants.xmlSitemapTypes.xml;
         }
-        else if (root.name === 'sitemapindex' && xmlns.includes(xmlFormatNamespace)) {
+        else if (rootName === 'sitemapindex') {
             sitemapType = constants.xmlSitemapTypes.xmlIndex;
         }
-        else if (root.name === 'rss') {
+        else if (rootName === 'rss') {
             sitemapType = constants.xmlSitemapTypes.rss;
         }
-        else if (root.name === 'feed') {
+        else if (rootName === 'feed') {
             sitemapType = constants.xmlSitemapTypes.atom;
         }
+        else if (hasXmlSitemapIndexTag) {
+            sitemapType = constants.xmlSitemapTypes.xmlIndex;
+        }
+        else if (hasXmlUrlsetTag) {
+            sitemapType = constants.xmlSitemapTypes.xml;
+        }
         else {
             sitemapType = constants.xmlSitemapTypes.unknown;
         }
-        const countBefore = allUrls.size;
         switch (sitemapType) {
             case constants.xmlSitemapTypes.xmlIndex:
-                consoleLogger.info(`This is a XML format sitemap index.`);
+                consoleLogger.info(`This is a XML format sitemap index: ${url}`);
                 for (const childSitemapUrl of $('loc')) {
-                    const childSitemapUrlText = $(childSitemapUrl).text();
-                    if (childSitemapUrlText.endsWith('.xml') || childSitemapUrlText.endsWith('.txt')) {
+                    const childSitemapUrlText = $(childSitemapUrl).text().trim();
+                    if (!childSitemapUrlText) {
+                        continue;
+                    }
+                    const childSitemapPath = childSitemapUrlText.split(/[?#]/)[0].toLowerCase();
+                    if (childSitemapPath.endsWith('.xml') || childSitemapPath.endsWith('.txt')) {
+                        if (isImageSitemapUrl(childSitemapUrlText)) {
+                            consoleLogger.info(`Skipping image sitemap: ${childSitemapUrlText}`);
+                            continue;
+                        }
                         await fetchUrls(childSitemapUrlText, extraHTTPHeaders); // Recursive call for nested sitemaps
                     }
                     else {
@@ -1008,19 +1074,19 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
                 }
                 break;
             case constants.xmlSitemapTypes.xml:
-                consoleLogger.info(`This is a XML format sitemap.`);
+                consoleLogger.info(`This is a XML format sitemap: ${url}`);
                 await processXmlSitemap($, sitemapType, 'loc', 'lastmod', 'url');
                 break;
             case constants.xmlSitemapTypes.rss:
-                consoleLogger.info(`This is a RSS format sitemap.`);
+                consoleLogger.info(`This is a RSS format sitemap: ${url}`);
                 await processXmlSitemap($, sitemapType, 'link', 'pubDate', 'item');
                 break;
             case constants.xmlSitemapTypes.atom:
-                consoleLogger.info(`This is a Atom format sitemap.`);
+                consoleLogger.info(`This is a Atom format sitemap: ${url}`);
                 await processXmlSitemap($, sitemapType, 'link', 'published', 'entry');
                 break;
             default:
-                consoleLogger.info(`This is an unrecognised XML sitemap format.`);
+                consoleLogger.info(`This is an unrecognised XML sitemap format: ${url}`);
                 processNonStandardSitemap(data);
         }
         const linksFromThisSitemap = allUrls.size - countBefore;
@@ -1836,7 +1902,8 @@ function isValidHttpUrl(urlString) {
 export const isFilePath = (url) => {
     const driveLetterPattern = /^[A-Z]:/i;
     const backslashPattern = /\\/;
-    return (url.startsWith('/') ||
+    return (url.toLowerCase().startsWith('file://') ||
+        url.startsWith('/') ||
         driveLetterPattern.test(url) ||
         backslashPattern.test(url) ||
         url.startsWith('./') ||

package/dist/crawlers/commonCrawlerFunc.js CHANGED Viewed

@@ -898,10 +898,19 @@ export const createCrawleeSubFolders = async (randomToken) => {
 export const preNavigationHooks = (extraHTTPHeaders) => {
     return [
         async (crawlingContext, gotoOptions) => {
-            if (extraHTTPHeaders) {
+            if (extraHTTPHeaders && Object.keys(extraHTTPHeaders).length > 0) {
                 crawlingContext.request.headers = extraHTTPHeaders;
             }
-            gotoOptions = { waitUntil: 'networkidle', timeout: 30000 };
+            // Use domcontentloaded — fires as soon as the DOM is parsed, before
+            // images/stylesheets/network requests settle. This avoids indefinite
+            // hangs on sites with WebSockets, analytics polling, or infinite-scroll
+            // beacons that never reach networkidle. Further page stability is
+            // handled by waitForPageLoaded() in each crawler's requestHandler and
+            // by the DOM mutation observer in postNavigationHooks.
+            if (gotoOptions) {
+                gotoOptions.waitUntil = 'domcontentloaded';
+                gotoOptions.timeout = 30000;
+            }
         },
     ];
 };

package/dist/crawlers/crawlDomain.js CHANGED Viewed

@@ -1,6 +1,6 @@
 import crawlee from 'crawlee';
 import { CrawlRateController } from './crawlRateController.js';
-import { createCrawleeSubFolders, getPreLaunchHook, runAxeScript, isUrlPdf, shouldSkipClickDueToDisallowedHref, shouldSkipDueToUnsupportedContent, splitAuthHeaders, } from './commonCrawlerFunc.js';
+import { createCrawleeSubFolders, getPreLaunchHook, preNavigationHooks, runAxeScript, isUrlPdf, shouldSkipClickDueToDisallowedHref, shouldSkipDueToUnsupportedContent, splitAuthHeaders, } from './commonCrawlerFunc.js';
 import constants, { blackListedFileExtensions, guiInfoStatusTypes, cssQuerySelectors, STATUS_CODE_METADATA, disallowedListOfPatterns, disallowedSelectorPatterns, FileTypes, } from '../constants/constants.js';
 import { getPlaywrightLaunchOptions, isBlacklistedFileExtensions, isSkippedUrl, isDisallowedInRobotsTxt, getUrlsFromRobotsTxt, waitForPageLoaded, } from '../constants/common.js';
 import { areLinksEqual, isFollowStrategy, isSameHostname, normUrl, register } from '../utils.js';
@@ -301,12 +301,10 @@ const crawlDomain = async ({ url, randomToken, host: _host, viewportSettings, ma
             ],
         },
         requestQueue,
+        maxRequestRetries: 3,
+        maxSessionRotations: 1,
         preNavigationHooks: [
-            async (crawlingContext) => {
-                if (extraHTTPHeaders) {
-                    crawlingContext.request.headers = extraHTTPHeaders;
-                }
-            },
+            ...preNavigationHooks(extraHTTPHeaders),
         ],
         postNavigationHooks: [
             async (crawlingContext) => {

package/dist/crawlers/crawlSitemap.js CHANGED Viewed

@@ -1,6 +1,6 @@
 import crawlee, { EnqueueStrategy, RequestList } from 'crawlee';
 import { CrawlRateController } from './crawlRateController.js';
-import { createCrawleeSubFolders, getPreLaunchHook, preNavigationHooks, runAxeScript, } from './commonCrawlerFunc.js';
+import { createCrawleeSubFolders, getPreLaunchHook, preNavigationHooks, runAxeScript, splitAuthHeaders, } from './commonCrawlerFunc.js';
 import constants, { STATUS_CODE_METADATA, guiInfoStatusTypes, disallowedListOfPatterns, FileTypes, } from '../constants/constants.js';
 import { getLinksFromSitemap, getPlaywrightLaunchOptions, isSkippedUrl, waitForPageLoaded, isFilePath, } from '../constants/common.js';
 import { areLinksEqual, isFollowStrategy, isWhitelistedContentType, normUrl, register } from '../utils.js';
@@ -13,6 +13,7 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
     let durationExceeded = false;
     let isAbortingScan = false;
     const rateController = new CrawlRateController(maxRequestsPerCrawl, specifiedMaxConcurrency || constants.maxConcurrency);
+    const initialNoSuccessFailureAbortThreshold = Math.max(5, Math.min(maxRequestsPerCrawl, 25));
     if (fromCrawlIntelligentSitemap) {
         dataset = datasetFromIntelligent;
         urlsCrawled = urlsCrawledFromIntelligent;
@@ -33,6 +34,7 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
     const isScanPdfs = [FileTypes.All, FileTypes.PdfOnly].includes(fileTypes);
     const { playwrightDeviceDetailsObject } = viewportSettings;
     const { maxConcurrency } = constants;
+    const { nonAuthHeaders, httpCredentials } = splitAuthHeaders(extraHTTPHeaders);
     const requestList = await RequestList.open({
         sources: linksFromSitemap,
     });
@@ -53,11 +55,15 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
                         ...playwrightDeviceDetailsObject,
                         ...(process.env.OOBEE_USER_AGENT && { userAgent: process.env.OOBEE_USER_AGENT }),
                         ...(process.env.OOBEE_DISABLE_BROWSER_DOWNLOAD && { acceptDownloads: false }),
+                        ...(nonAuthHeaders && { extraHTTPHeaders: nonAuthHeaders }),
+                        ...(httpCredentials && { httpCredentials }),
                     };
                 },
             ],
         },
         requestList,
+        maxRequestRetries: 3,
+        maxSessionRotations: 1,
         postNavigationHooks: [
             async ({ page }) => {
                 try {
@@ -104,6 +110,7 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
             },
         ],
         preNavigationHooks: [
+            ...preNavigationHooks(extraHTTPHeaders),
             async ({ request, page }, gotoOptions) => {
                 const url = request.url.toLowerCase();
                 const isNotSupportedDocument = disallowedListOfPatterns.some(pattern => url.startsWith(pattern));
@@ -114,7 +121,6 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
                     // console.log(`[SKIP] Not supported: ${request.url}`);
                     return;
                 }
-                preNavigationHooks(extraHTTPHeaders);
             },
         ],
         requestHandlerTimeoutSecs: 90,
@@ -310,6 +316,12 @@ const crawlSitemap = async ({ sitemapUrl, randomToken, host, viewportSettings, m
                 httpStatusCode: typeof status === 'number' ? status : 0,
             });
             crawlee.log.error(`Failed Request - ${request.url}: ${request.errorMessages}`);
+            if (urlsCrawled.scanned.length === 0 &&
+                urlsCrawled.error.length >= initialNoSuccessFailureAbortThreshold) {
+                consoleLogger.info(`Aborting sitemap crawl: ${urlsCrawled.error.length} failed pages with 0 successful scans.`);
+                isAbortingScan = true;
+                crawler.autoscaledPool?.abort();
+            }
         },
         maxRequestsPerCrawl: Infinity,
         maxConcurrency: specifiedMaxConcurrency || maxConcurrency,

package/dist/crawlers/custom/utils.js CHANGED Viewed

@@ -1064,15 +1064,28 @@ export const initNewPage = async (page, pageClosePromises, processPageParams, pa
                 return;
             const allowed = isOverlayAllowed(page.url(), processPageParams.entryUrl);
             if (!allowed) {
-                await Promise.race([
-                    removeOverlayMenu(page),
-                    new Promise((_, reject) => {
-                        setTimeout(() => {
-                            reject(new Error(`removeOverlayMenu timed out after ${OVERLAY_OPERATION_TIMEOUT_MS}ms`));
-                        }, OVERLAY_OPERATION_TIMEOUT_MS);
-                    }),
-                ]);
-                return;
+                // On macOS and Windows the custom flow always runs headful.
+                // The URL guard (urlGuard.ts) intercepts non-http/https navigations
+                // and calls page.goto(safeUrl). Do NOT remove the overlay here —
+                // removing it causes it to stay permanently disabled if the redirect
+                // races ahead of the next reconcile cycle.
+                // Instead, fall through to the hasOverlay / addOverlayMenu block so
+                // the overlay is (re-)injected even on transient non-http/https URLs
+                // (e.g. file://, about:blank) and again after the guard's redirect.
+                const isDesktopHost = process.platform === 'darwin' || process.platform === 'win32';
+                if (!isDesktopHost) {
+                    // On Linux / Docker: remove overlay for non-http/https URLs and stop.
+                    await Promise.race([
+                        removeOverlayMenu(page),
+                        new Promise((_, reject) => {
+                            setTimeout(() => {
+                                reject(new Error(`removeOverlayMenu timed out after ${OVERLAY_OPERATION_TIMEOUT_MS}ms`));
+                            }, OVERLAY_OPERATION_TIMEOUT_MS);
+                        }),
+                    ]);
+                    return;
+                }
+                // Desktop hosts: skip removal and fall through to re-add overlay.
             }
             const hasOverlay = await page.evaluate(() => Boolean(document.querySelector('#oobeeShadowHost')));
             consoleLogger.info(`Overlay state (${trigger}): ${hasOverlay}`);

package/dist/crawlers/guards/urlGuard.js CHANGED Viewed

@@ -30,8 +30,20 @@ export function addUrlGuardScript(context, opts = {}) {
             // page may have closed before addInitScript completed; safe to ignore
         });
         const restoreToSafeUrl = async (page, attemptedUrl) => {
+            const safeUrl = lastAllowedUrlByPage.get(page) || fallbackUrl || 'about:blank';
+            // Only redirect if the safe URL is itself an allowed (http/https) URL.
+            // If the entry URL is file:// (e.g. scanning a local HTML file), the
+            // fallback is also file://, and redirecting would create an infinite loop:
+            //   file:// → restoreToSafeUrl → file:// → framenavigated → restoreToSafeUrl → …
+            try {
+                const safeObj = new URL(safeUrl);
+                if (!ALLOWED_PROTOCOLS.has(safeObj.protocol))
+                    return;
+            }
+            catch {
+                return;
+            }
             try {
-                const safeUrl = lastAllowedUrlByPage.get(page) || fallbackUrl || 'about:blank';
                 await page.goto(safeUrl, { waitUntil: 'domcontentloaded' });
             }
             catch {
@@ -53,6 +65,12 @@ export function addUrlGuardScript(context, opts = {}) {
                 lastAllowedUrlByPage.set(page, urlObj.toString());
                 return;
             }
+            // Skip browser-internal transitional states (about:blank, about:srcdoc, etc.).
+            // page.goto() navigates through about:blank before loading the target URL.
+            // Redirecting from about: creates an infinite loop:
+            //   restoreToSafeUrl → page.goto(safeUrl) → about:blank → restoreToSafeUrl → …
+            if (urlObj.protocol === 'about:')
+                return;
             await restoreToSafeUrl(page, urlStr);
         });
     };

package/dist/static/ejs/partials/components/allIssues/CategoryBadges.ejs CHANGED Viewed

@@ -7,6 +7,7 @@
         <button
           type="button"
           class="category-tooltip-icon"
+          aria-label="About Must Fix category"
           aria-describedby="mustFixTooltip"
         >
           <svg xmlns="http://www.w3.org/2000/svg" width="14" height="14"
@@ -34,6 +35,7 @@
         <button
           type="button"
           class="category-tooltip-icon"
+          aria-label="About Good to Fix category"
           aria-describedby="goodToFixTooltip"
         >
           <svg xmlns="http://www.w3.org/2000/svg" width="14" height="14"
@@ -61,6 +63,7 @@
         <button
           type="button"
           class="category-tooltip-icon"
+          aria-label="About Manual Test category"
           aria-describedby="manualTestTooltip"
         >
           <svg xmlns="http://www.w3.org/2000/svg" width="14" height="14"

package/dist/static/ejs/partials/components/allIssues/IssuesTable.ejs CHANGED Viewed

@@ -2,21 +2,21 @@
   <table class="issues-table" id="issuesTable">
     <thead>
       <tr>
-        <th class="sortable" role="button" tabindex="0" aria-sort="none" style="width: 15%;">
+        <th class="sortable" tabindex="0" aria-sort="none" style="width: 15%;">
           <span>Severity</span>
           <svg class="sort-icon" width="24" height="24" viewBox="0 0 24 24" fill="none" aria-hidden="true">
             <path d="M7 9L12 4L17 9H7Z" fill="currentColor" opacity="1" />
             <path d="M7 15L12 20L17 15H7Z" fill="currentColor" opacity="0.3" />
           </svg>
         </th>
-        <th class="sortable" role="button" tabindex="0" aria-sort="none">
+        <th class="sortable" tabindex="0" aria-sort="none">
           <span>Issue Name</span>
           <svg class="sort-icon" width="24" height="24" viewBox="0 0 24 24" fill="none" aria-hidden="true">
             <path d="M7 9L12 4L17 9H7Z" fill="currentColor" opacity="0.3" />
             <path d="M7 15L12 20L17 15H7Z" fill="currentColor" opacity="1" />
           </svg>
         </th>
-        <th class="sortable" role="button" tabindex="0" aria-sort="descending" style="width: 15%;">
+        <th class="sortable" tabindex="0" aria-sort="descending" style="width: 15%;">
           <span>Occurrence</span>
           <svg class="sort-icon" width="24" height="24" viewBox="0 0 24 24" fill="none" aria-hidden="true">
             <path d="M7 9L12 4L17 9H7Z" fill="currentColor" opacity="0.3" />

package/dist/static/ejs/partials/components/header/aboutScanModal/AboutScanModal.ejs CHANGED Viewed

@@ -1,4 +1,4 @@
-<div id="aboutScanModal" class="modal fade" tabindex="-1" aria-labelledby="aboutScanModalLabel" aria-hidden="true">
+<div id="aboutScanModal" class="modal fade" tabindex="-1" aria-label="About this scan" aria-hidden="true">
   <div class="modal-dialog modal-dialog-centered">
     <div class="modal-content">
       <div class="modal-header">