npm - @govtechsg/oobee - Versions diffs - 0.10.92 → 0.10.94 - Mend

@govtechsg/oobee 0.10.92 → 0.10.94

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (70) hide show

package/AGENTS.md CHANGED Viewed

@@ -79,6 +79,7 @@ All crawlers use Crawlee's `PlaywrightCrawler` with:
 - Docker detection (`/.dockerenv`): adds `--disable-gpu`, `--no-sandbox`, `--disable-dev-shm-usage`
 - Proxy support (manual, PAC, or none) via `getProxyInfo()`
 - Channel set from browser name (undefined for chromium = bundled)
+- `--mute-audio` is added by default in both headless and headful modes, but must be disabled for `customFlow` by calling `getPlaywrightLaunchOptions(browser, { includeMuteAudio: false })`
 ### User-Agent
@@ -111,6 +112,10 @@ Important behaviors:
 - The crawler itself enforces `maxRequestsPerCrawl` by counting only successfully scanned pages
 - `constants.sitemapFetchedLinks` stores the total discovered count for `scanData.json` reporting
 - For sitemap indexes, child sitemaps are processed recursively
+- Some sitemap XMLs include `<?xml-stylesheet ...?>` (XSL). In `getDataUsingPlaywright()`:
+  - Use `waitUntil: 'domcontentloaded'` (not `networkidle`) to avoid 60s timeouts caused by stylesheet/resource loading
+  - Prefer `response.text()` to capture raw XML before browser XSL transformation (preserves `<sitemapindex>` / `<urlset>` structure)
+  - Only fall back to DOM extraction when raw response text is unavailable
 ## Shared Mutable State
@@ -134,6 +139,11 @@ The `constants` default export object holds runtime state:
 | `OOBEE_SLOWMO` | Browser slowmo in ms |
 | `OOBEE_FAST_CRAWLER` | Experimental high-concurrency mode |
 | `OOBEE_DISABLE_BROWSER_DOWNLOAD` | Block browser file downloads |
+| `OOBEE_TAGGED_WEBSITE` | Tag to identify the website in Sentry telemetry (overridden by `--websiteTag` CLI flag) |
+| `OOBEE_SCAN_METADATA` | Overrides `entryUrl` tag in Sentry events |
+| `OOBEE_SCAN_PRODUCT` | Adds `scanProduct` tag to Sentry events |
+| `OOBEE_CONSECUTIVE_MAX_RETRIES` | Max consecutive HTTP failures before circuit breaker aborts crawl (default 100) |
+| `OOBEE_VALIDATE_URL` | If set, exit after URL validation without scanning |
 | `HTTP_PROXY` / `HTTPS_PROXY` / `ALL_PROXY` | Proxy configuration |
 | `NO_PROXY` / `INCLUDE_PROXY` | Proxy bypass/include lists |
@@ -215,6 +225,20 @@ docker run oobee node dist/cli.js ...
 8. **Crawlee dataset** — Results are stored as numbered JSON files in `{randomToken}/datasets/default/`. Each file is one page's axe results. `generateArtifacts()` reads all of them.
+9. **Auth headers and CORS** — Never set `Authorization` in `extraHTTPHeaders` globally on a browser context. Playwright sends `extraHTTPHeaders` to ALL requests (including cross-origin CDNs), which triggers CORS preflight failures. Instead use `splitAuthHeaders()` from `commonCrawlerFunc.ts` to separate auth from non-auth headers:
+    - Non-auth headers → safe to set globally via `extraHTTPHeaders` on context/launch options
+    - Basic auth → set `httpCredentials` on context (Playwright auto-responds to 401 challenges, origin-aware)
+    - Any Authorization header → send only to same-origin requests via `addAuthRouteHandler()` (route interception) or Crawlee's `preNavigationHooks` (navigation-only)
+    - Credentials come from URL-embedded `user:pass@host` or `-m "Authorization Basic ..."` — both produce the same `extraHTTPHeaders.Authorization` value in `prepareData()`
+10. **Intermediate JSONL write safety + corruption tolerance** — `ItemsStore.appendPageItems()` requires strict serialization of writes per rule file to prevent interleaved corruption. It also enforces a strict text sanitization regex to filter out literal `\n` and `\r` control characters from website HTML inputs immediately after `JSON.stringify()`. This ensures no single JSON issue accidentally injects illegal implicit newline boundaries when writing to JSONL format. Maintain backward-compatible `fs.appendFile` queues over buffered WriteStreams to guarantee pipeline sync visibility. `ItemsStore.readRuleItems()` tolerates historical malformed lines via fallback skip logic.
+11. **`preNavigationHooks` and the Playwright header-rewrite warning** — `preNavigationHooks()` in `commonCrawlerFunc.ts` is always included in the crawler `preNavigationHooks` array (for both `crawlDomain` and `crawlSitemap`). The hook does two things:
+    - **Header rewriting**: only sets `crawlingContext.request.headers = extraHTTPHeaders` when `extraHTTPHeaders` is non-empty. Setting request headers causes Crawlee/Playwright to intercept every network request to rewrite them, which triggers `WARN Playwright Utils: Using other request methods than GET, rewriting headers and adding payloads has a high impact on performance`. This warning is expected for authenticated scans; it is suppressed for unauthenticated scans because `extraHTTPHeaders` stays empty (see pitfall 12 below).
+    - **Navigation wait**: always sets `gotoOptions.waitUntil = 'domcontentloaded'` and `gotoOptions.timeout = 30000` via **in-place object mutation**. Do NOT reassign the `gotoOptions` parameter (`gotoOptions = {...}`) — that only rebinds the local variable and does not propagate to Crawlee. `domcontentloaded` is used (not `networkidle`) to avoid indefinite hangs on sites with WebSockets, analytics polling, lazy-load beacons, or health-check pings that never quiet their network activity. Further page stability is handled by `waitForPageLoaded()` in each requestHandler and the DOM mutation observer in `postNavigationHooks`.
+12. **`extraHTTPHeaders` must not be mutated before being passed to crawlers** — `checkUrlConnectivityWithBrowser()` in `common.ts` needs an `Accept` header for its own connectivity check but must NOT add it to the shared `extraHTTPHeaders` object. Mutating the shared object causes crawlers to see a non-empty `extraHTTPHeaders` (at minimum `{ Accept: '...' }`), which silently triggers header rewriting and the Playwright performance warning for every unauthenticated scan. Always use a local copy: `const localHeaders = { ...extraHTTPHeaders }; localHeaders.Accept ||= '...';`.
 ## Testing Considerations
 When making changes, validate these areas which have well-established edge cases:
@@ -246,6 +270,16 @@ When making changes, validate these areas which have well-established edge cases
 - `document.title` must be captured at the START of `runAxeScript()`, before axe scanning or screenshot capture. Pages can close during these operations (timeout, navigation, crash). Never create a new page just to re-navigate for the title — this leaks pages.
 - The URL guard script in custom flow must be defensive against pages that close unexpectedly. All page event handlers should handle closed contexts gracefully.
+### URL Guard & Overlay Management in Custom Flow
+`src/crawlers/guards/urlGuard.ts` — attached via `addUrlGuardScript()` in `runCustom.ts`:
+- **`restoreToSafeUrl` must validate the safe URL before calling `page.goto()`**. If the entry URL is `file://` (e.g. `-u '/path/to/report.html'`), `fallbackUrl` is also `file://`. Redirecting to it fires another `framenavigated` for `file://`, which re-triggers `restoreToSafeUrl` → infinite reload loop. Always check `ALLOWED_PROTOCOLS.has(safeObj.protocol)` before navigating; if the fallback is not http/https, return without redirecting.
+- **`about:` protocol must be skipped in `framenavigated`**. Chromium fires `framenavigated` for `about:blank` as a transient intermediate state during every `page.goto()` call. Intercepting it and calling `restoreToSafeUrl` → `page.goto(safeUrl)` → `about:blank` → `restoreToSafeUrl` → … creates a second infinite loop. Always `return` early when `urlObj.protocol === 'about:'`.
+- **`reconcileOverlayMenu` must not remove the overlay on macOS/Windows**. On `darwin`/`win32` the custom flow runs headful. When `isOverlayAllowed` returns `false` (e.g. transient `file://` or `about:blank` URL), do **not** call `removeOverlayMenu` — the URL guard will redirect back to the safe URL momentarily. Instead, fall through to the `hasOverlay` / `addOverlayMenu` block so the overlay is (re-)injected regardless of the current URL protocol. On Linux/Docker (headless) the removal behaviour is unchanged.
 ### Proxy & Network
 - Proxy detection must handle `ALL_PROXY` on Windows. The proxy resolution logic should be tested on all platforms.

package/README.md CHANGED Viewed

@@ -92,6 +92,10 @@ verapdf --version
 | WARN_LEVEL | Only used in tests. |  |
 | OOBEE_DISABLE_BROWSER_DOWNLOAD | Experimental flag to disable file downloads on Chrome/Chromium/Edge.  Does not affect Local File scan | |
 | OOBEE_SLOWMO | Experimental flag to slow down web browser behaviour by specified duration (in miliseconds) | |
+| OOBEE_TAGGED_WEBSITE | Tag to identify the website in telemetry. Can also be set via `-z, --websiteTag` CLI flag (CLI flag takes precedence). | |
+| OOBEE_SCAN_METADATA | Overrides the `entryUrl` tag sent to telemetry. | |
+| OOBEE_SCAN_PRODUCT | Adds a `scanProduct` tag to telemetry events. | |
+| OOBEE_CONSECUTIVE_MAX_RETRIES | Max consecutive HTTP failures before the circuit breaker aborts the crawl. | `100` |
 | HTTP_PROXY | URL of the proxy server to be used for HTTP requests (e.g. `http://proxy.example.com:8080`). | |
 | HTTPS_PROXY | URL of the proxy server to be used for HTTPS requests (e.g. `https://proxy.example.com:8080`). | |
 | ALL_PROXY | URL of the proxy server to be used for all requests, typically used for SOCKS5 proxies (e.g. `socks5://proxy.example.com:1080`. Note: IPv6 direct connections may still continue even though socks5 proxy is specified due to a known issue with Chrome/Chromium. (Recommended workaround is to turn off IPv6 at host-level). | |
@@ -413,6 +417,21 @@ Examples:
   > [ -d <device> | -w <viewport_width> ]
 ```
+### Basic Auth
+For sites behind HTTP Basic Authentication, you can provide credentials in two ways:
+1. **Embed in URL**: `npm run cli -- -u 'https://user:password@example.com' -c 3`
+2. **Use `-m` flag**: `npm run cli -- -u 'https://example.com' -c 3 -m "Authorization Basic dXNlcjpwYXNzd29yZA=="`
+Both methods work across all scan types (sitemap, website, custom flow). For multiple headers, separate with `, `:
+```
+-m "Authorization Basic dXNlcjpwYXNz, X-Custom-Header myvalue"
+```
+> **Note:** Authorization headers are only sent to same-origin requests to avoid CORS preflight failures on cross-origin resources (e.g., CDN fonts, analytics scripts).
 ### Note on Windows PowerShell:
 You need to run the command as `npm run cli -- --` (with the extra set of `--`) as PowerShell interprets arguments differently.

package/dist/cli.js CHANGED Viewed

@@ -199,9 +199,10 @@ const scanInit = async (argvs) => {
     if (res.httpStatus)
         consoleLogger.info(`Connectivity Check HTTP Response Code: ${res.httpStatus}`);
     if (res.status === statuses.success.code) {
-        // Custom flow should continue from the user-provided entry URL so auth redirects
-        // do not replace the original domain used for overlay gating and navigation.
+        // Keep browser-resolved URL as entryUrl for downstream scan metadata/events
+        // on non-custom scans.
         if (data.type !== ScannerTypes.CUSTOM) {
+            data.entryUrl = res.url;
             data.url = res.url;
         }
         if (process.env.OOBEE_VALIDATE_URL) {

package/dist/combine.js CHANGED Viewed

@@ -23,7 +23,7 @@ export class ViewportSettingsClass {
 }
 const combineRun = async (details, deviceToScan) => {
     const envDetails = { ...details };
-    const { type, url, nameEmail, randomToken, deviceChosen, customDevice, viewportWidth, playwrightDeviceDetailsObject, maxRequestsPerCrawl, browser, userDataDirectory, strategy, // Allow subdomains: if checked, = 'same-domain'
+    const { type, url, entryUrl, nameEmail, randomToken, deviceChosen, customDevice, viewportWidth, playwrightDeviceDetailsObject, maxRequestsPerCrawl, browser, userDataDirectory, strategy, // Allow subdomains: if checked, = 'same-domain'
     specifiedMaxConcurrency, // Slow scan mode: if checked, = '1'
     fileTypes, blacklistedPatternsFilename, includeScreenshots, // Include screenshots: if checked, = 'true'
     followRobots, // Adhere to robots.txt: if checked, = 'true'
@@ -59,8 +59,8 @@ const combineRun = async (details, deviceToScan) => {
     }
     // remove basic-auth credentials from URL
     const finalUrl = !(type === ScannerTypes.SITEMAP || type === ScannerTypes.LOCALFILE)
-        ? new URL(url)
-        : new URL(pathToFileURL(url));
+        ? new URL(entryUrl)
+        : new URL(pathToFileURL(entryUrl));
     // Use the string version of finalUrl to reduce logic at submitForm
     const finalUrlString = finalUrl.toString();
     const scanDetails = {
@@ -89,7 +89,7 @@ const combineRun = async (details, deviceToScan) => {
     let durationExceeded = false;
     switch (type) {
         case ScannerTypes.CUSTOM:
-            const res = await runCustom(url, randomToken, browser, userDataDirectory, viewportSettings, blacklistedPatterns, includeScreenshots, customFlowLabel && customFlowLabel !== 'None' ? customFlowLabel : '');
+            const res = await runCustom(url, randomToken, browser, userDataDirectory, viewportSettings, blacklistedPatterns, includeScreenshots, customFlowLabel && customFlowLabel !== 'None' ? customFlowLabel : '', extraHTTPHeaders);
             urlsCrawledObj = res.urlsCrawled;
             uiCustomFlowLabel = res.customFlowLabel;
             break;

package/dist/constants/common.js CHANGED Viewed

@@ -292,17 +292,30 @@ const checkUrlConnectivityWithBrowser = async (url, browserToRun, clonedDataDir,
             return res;
         }
     }
-    // Ensure Accept header for non-html content fallback
-    extraHTTPHeaders.Accept ||= 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
+    // Ensure Accept header for non-html content fallback — use a local copy to avoid
+    // mutating the caller's extraHTTPHeaders object (which is later checked by crawlers
+    // to decide whether to enable preNavigationHooks header rewriting).
+    const localHeaders = { ...extraHTTPHeaders };
+    localHeaders.Accept ||= 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
     await initModifiedUserAgent(browserToRun, playwrightDeviceDetailsObject, clonedDataDir);
     let browserContext;
     let browserInstance;
     const rawDevice = (playwrightDeviceDetailsObject || {});
     const { viewport, isMobile, hasTouch, userAgent: deviceUserAgent, ...restDevice } = rawDevice;
     const launchOptions = getPlaywrightLaunchOptions(browserToRun);
+    const { Authorization, ...nonAuthHeaders } = localHeaders || {};
+    let httpCredentials = undefined;
+    if (Authorization?.startsWith('Basic ')) {
+        const decoded = Buffer.from(Authorization.slice(6), 'base64').toString();
+        const colonIdx = decoded.indexOf(':');
+        if (colonIdx > 0) {
+            httpCredentials = { username: decoded.slice(0, colonIdx), password: decoded.slice(colonIdx + 1) };
+        }
+    }
     const contextOptions = {
         ...restDevice,
-        ...(extraHTTPHeaders && { extraHTTPHeaders }),
+        ...(Object.keys(nonAuthHeaders).length > 0 && { extraHTTPHeaders: nonAuthHeaders }),
+        ...(httpCredentials && { httpCredentials }),
         ignoreHTTPSErrors: true,
         ...(process.env.OOBEE_DISABLE_BROWSER_DOWNLOAD && { acceptDownloads: false }),
     };
@@ -342,6 +355,27 @@ const checkUrlConnectivityWithBrowser = async (url, browserToRun, clonedDataDir,
         return res;
     }
     try {
+        // Only enable generic Authorization header routing interception broadly if
+        // a non-Basic Bearer auth string is heavily relied upon, thereby bypassing
+        // performance warnings inside the check checkUrl phase for typical public scans
+        if (Object.keys(localHeaders).length > 0) {
+            if (Authorization && !httpCredentials) {
+                const entryOrigin = new URL(url).origin;
+                await browserContext.route('**/*', async (route, request) => {
+                    try {
+                        if (new URL(request.url()).origin === entryOrigin) {
+                            await route.continue({ headers: { ...request.headers(), Authorization } });
+                        }
+                        else {
+                            await route.continue();
+                        }
+                    }
+                    catch {
+                        await route.continue();
+                    }
+                });
+            }
+        }
         const page = await browserContext.newPage();
         // Block native Chrome download UI
         try {
@@ -351,15 +385,6 @@ const checkUrlConnectivityWithBrowser = async (url, browserToRun, clonedDataDir,
         catch (e) {
             consoleLogger.info(`Unable to set download deny: ${e.message}`);
         }
-        // OPTIMIZATION: Block heavy visual resources (Images/Fonts/CSS)
-        // This allows the "Connectivity Check" to pass as soon as HTML is ready
-        await page.route('**/*', (route) => {
-            const type = route.request().resourceType();
-            if (['image', 'media', 'font', 'stylesheet'].includes(type)) {
-                return route.abort();
-            }
-            return route.continue();
-        });
         // STEP 2: Navigate (follows server-side redirects)
         page.once('download', () => {
             res.status = constants.urlCheckStatuses.notASupportedDocument.code;
@@ -471,7 +496,7 @@ export const isSitemapContent = (content) => {
         return true;
     }
     const regexForHtml = new RegExp('<(?:!doctype html|html|head|body)+?>', 'gmi');
-    const regexForXmlSitemap = new RegExp('<(?:urlset|feed|rss)+?.*>', 'gmi');
+    const regexForXmlSitemap = new RegExp('<(?:urlset|sitemapindex|feed|rss)+?.*>', 'gmi');
     if (content.match(regexForHtml) && content.match(regexForXmlSitemap)) {
         // is an XML sitemap wrapped in a HTML document
         return true;
@@ -485,7 +510,18 @@ export const isSitemapContent = (content) => {
     return false;
 };
 export const checkUrl = async (scanner, url, browser, clonedDataDir, playwrightDeviceDetailsObject, extraHTTPHeaders, fileTypes) => {
-    const res = await checkUrlConnectivityWithBrowser(url, browser, clonedDataDir, playwrightDeviceDetailsObject, extraHTTPHeaders);
+    let urlToCheck = url;
+    if (scanner === ScannerTypes.LOCALFILE) {
+        if (!isFilePath(url)) {
+            const res = new RES();
+            res.status = constants.urlCheckStatuses.notALocalFile.code;
+            return res;
+        }
+        if (!url.toLowerCase().startsWith('file://')) {
+            urlToCheck = pathToFileURL(path.resolve(url)).toString();
+        }
+    }
+    const res = await checkUrlConnectivityWithBrowser(urlToCheck, browser, clonedDataDir, playwrightDeviceDetailsObject, extraHTTPHeaders);
     // If response is 200 (meaning no other code was set earlier)
     if (res.status === constants.urlCheckStatuses.success.code) {
         // Check if document is pdf type
@@ -532,7 +568,7 @@ export const prepareData = async (argv) => {
     if (isEmptyObject(argv)) {
         throw Error('No inputs should be provided');
     }
-    let { scanner, headless, url, deviceChosen, customDevice, viewportWidth, maxpages, strategy, isLocalFileScan = argv.scanner === ScannerTypes.LOCALFILE, browserToRun, nameEmail, customFlowLabel, specifiedMaxConcurrency, fileTypes, blacklistedPatternsFilename, additional, metadata, followRobots, header, safeMode, exportDirectory, zip, ruleset, generateJsonFiles, scanDuration, } = argv;
+    let { scanner, headless, url, deviceChosen, customDevice, viewportWidth, maxpages, strategy, isLocalFileScan = argv.scanner === ScannerTypes.LOCALFILE, browserToRun, nameEmail, customFlowLabel, specifiedMaxConcurrency, fileTypes, blacklistedPatternsFilename, additional, metadata, followRobots, header, safeMode, exportDirectory, zip, ruleset, generateJsonFiles, scanDuration, finalUrl, } = argv;
     const extraHTTPHeaders = parseHeaders(header);
     // Set default username and password for basic auth
     let username = '';
@@ -558,6 +594,9 @@ export const prepareData = async (argv) => {
         temp.password = '';
         url = temp.toString();
     }
+    // Keep browser-resolved URL (if provided by pre-check flow) as canonical entry URL.
+    // For local file paths, keep using the normalized `url` value below.
+    const resolvedEntryUrl = finalUrl && !isFilePath(finalUrl) ? finalUrl : url;
     // construct filename for scan results
     const [date, time] = new Date().toLocaleString('sv').replaceAll(/-|:/g, '').split(' ');
     const domain = isLocalFileScan ? path.basename(url) : new URL(url).hostname;
@@ -585,7 +624,7 @@ export const prepareData = async (argv) => {
     return {
         type: scanner,
         url,
-        entryUrl: url,
+        entryUrl: resolvedEntryUrl,
         isHeadless: headless,
         deviceChosen,
         customDevice,
@@ -790,6 +829,7 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
     const scannedSitemaps = new Set();
     const sitemapLinkCounts = {};
     const allUrls = new Set(); // all discovered URLs (lightweight strings)
+    const isImageSitemapUrl = (candidateUrl) => /(^|\/)image-sitemap(?:-index)?(?:-\d+)?\.xml(?:$|[?#])/i.test(candidateUrl);
     const addToUrlList = (url) => {
         if (!url)
             return;
@@ -860,6 +900,10 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
     const fetchUrls = async (url, extraHTTPHeaders) => {
         let data;
         let sitemapType;
+        if (isImageSitemapUrl(url)) {
+            consoleLogger.info(`Skipping image sitemap: ${url}`);
+            return;
+        }
         if (scannedSitemaps.has(url)) {
             // Skip processing if the sitemap has already been scanned
             return;
@@ -906,27 +950,45 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
                     });
                 }
                 const page = await browserContext.newPage();
-                await page.goto(url, { waitUntil: 'networkidle', timeout: 60000 });
-                if ((await page.locator('body').count()) > 0) {
-                    data = await page.locator('body').innerText();
-                }
-                else {
-                    const urlSet = page.locator('urlset');
-                    const sitemapIndex = page.locator('sitemapindex');
-                    const rss = page.locator('rss');
-                    const feed = page.locator('feed');
-                    const isRoot = async (locator) => (await locator.count()) > 0;
-                    if (await isRoot(urlSet)) {
-                        data = await urlSet.evaluate(elem => elem.outerHTML);
+                // Use 'domcontentloaded' instead of 'networkidle' — sitemap XMLs with
+                // XSL stylesheet references (e.g. <?xml-stylesheet ...?>) cause the browser
+                // to fetch and apply the stylesheet, which may load additional resources
+                // (fonts, CSS, images) that prevent 'networkidle' from ever being reached.
+                const response = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 });
+                // Prefer the raw response body — this gives us the original XML before
+                // the browser applies any XSL transformation (which would turn the XML
+                // into rendered HTML, losing the sitemap structure).
+                if (response) {
+                    try {
+                        data = await response.text();
                     }
-                    else if (await isRoot(sitemapIndex)) {
-                        data = await sitemapIndex.evaluate(elem => elem.outerHTML);
+                    catch {
+                        // response.text() can fail if the body was already consumed or
+                        // if a redirect occurred; fall through to DOM extraction below.
                     }
-                    else if (await isRoot(rss)) {
-                        data = await rss.evaluate(elem => elem.outerHTML);
+                }
+                if (!data) {
+                    if ((await page.locator('body').count()) > 0) {
+                        data = await page.locator('body').innerText();
                     }
-                    else if (await isRoot(feed)) {
-                        data = await feed.evaluate(elem => elem.outerHTML);
+                    else {
+                        const urlSet = page.locator('urlset');
+                        const sitemapIndex = page.locator('sitemapindex');
+                        const rss = page.locator('rss');
+                        const feed = page.locator('feed');
+                        const isRoot = async (locator) => (await locator.count()) > 0;
+                        if (await isRoot(urlSet)) {
+                            data = await urlSet.evaluate(elem => elem.outerHTML);
+                        }
+                        else if (await isRoot(sitemapIndex)) {
+                            data = await sitemapIndex.evaluate(elem => elem.outerHTML);
+                        }
+                        else if (await isRoot(rss)) {
+                            data = await rss.evaluate(elem => elem.outerHTML);
+                        }
+                        else if (await isRoot(feed)) {
+                            data = await feed.evaluate(elem => elem.outerHTML);
+                        }
                     }
                 }
             }
@@ -949,37 +1011,61 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
             data = fs.readFileSync(url, 'utf8');
         }
         const $ = cheerio.load(data, { xml: true });
+        const countBefore = allUrls.size;
         // This case is when the document is not an XML format document
         if ($(':root').length === 0) {
             processNonStandardSitemap(data);
+            const linksFromThisSitemap = allUrls.size - countBefore;
+            if (linksFromThisSitemap > 0) {
+                sitemapLinkCounts[url] = (sitemapLinkCounts[url] || 0) + linksFromThisSitemap;
+            }
             return;
         }
         // Root element
         const root = $(':root')[0];
-        const { xmlns } = root.attribs;
-        const xmlFormatNamespace = '/schemas/sitemap';
-        if (root.name === 'urlset' && xmlns.includes(xmlFormatNamespace)) {
+        const hasImageNamespace = Object.values(root?.attribs ?? {}).some(attribVal => typeof attribVal === 'string' && attribVal.toLowerCase().includes('sitemap-image'));
+        if (hasImageNamespace) {
+            consoleLogger.info(`Skipping image sitemap: ${url}`);
+            return;
+        }
+        const rootName = root?.name?.toLowerCase().split(':').pop() ?? '';
+        const hasXmlSitemapIndexTag = /<\s*(?:[a-z0-9_-]+:)?sitemapindex\b/i.test(data);
+        const hasXmlUrlsetTag = /<\s*(?:[a-z0-9_-]+:)?urlset\b/i.test(data);
+        if (rootName === 'urlset') {
             sitemapType = constants.xmlSitemapTypes.xml;
         }
-        else if (root.name === 'sitemapindex' && xmlns.includes(xmlFormatNamespace)) {
+        else if (rootName === 'sitemapindex') {
             sitemapType = constants.xmlSitemapTypes.xmlIndex;
         }
-        else if (root.name === 'rss') {
+        else if (rootName === 'rss') {
             sitemapType = constants.xmlSitemapTypes.rss;
         }
-        else if (root.name === 'feed') {
+        else if (rootName === 'feed') {
             sitemapType = constants.xmlSitemapTypes.atom;
         }
+        else if (hasXmlSitemapIndexTag) {
+            sitemapType = constants.xmlSitemapTypes.xmlIndex;
+        }
+        else if (hasXmlUrlsetTag) {
+            sitemapType = constants.xmlSitemapTypes.xml;
+        }
         else {
             sitemapType = constants.xmlSitemapTypes.unknown;
         }
-        const countBefore = allUrls.size;
         switch (sitemapType) {
             case constants.xmlSitemapTypes.xmlIndex:
-                consoleLogger.info(`This is a XML format sitemap index.`);
+                consoleLogger.info(`This is a XML format sitemap index: ${url}`);
                 for (const childSitemapUrl of $('loc')) {
-                    const childSitemapUrlText = $(childSitemapUrl).text();
-                    if (childSitemapUrlText.endsWith('.xml') || childSitemapUrlText.endsWith('.txt')) {
+                    const childSitemapUrlText = $(childSitemapUrl).text().trim();
+                    if (!childSitemapUrlText) {
+                        continue;
+                    }
+                    const childSitemapPath = childSitemapUrlText.split(/[?#]/)[0].toLowerCase();
+                    if (childSitemapPath.endsWith('.xml') || childSitemapPath.endsWith('.txt')) {
+                        if (isImageSitemapUrl(childSitemapUrlText)) {
+                            consoleLogger.info(`Skipping image sitemap: ${childSitemapUrlText}`);
+                            continue;
+                        }
                         await fetchUrls(childSitemapUrlText, extraHTTPHeaders); // Recursive call for nested sitemaps
                     }
                     else {
@@ -988,19 +1074,19 @@ export const getLinksFromSitemap = async (sitemapUrl, _maxLinksCount, browser, u
                 }
                 break;
             case constants.xmlSitemapTypes.xml:
-                consoleLogger.info(`This is a XML format sitemap.`);
+                consoleLogger.info(`This is a XML format sitemap: ${url}`);
                 await processXmlSitemap($, sitemapType, 'loc', 'lastmod', 'url');
                 break;
             case constants.xmlSitemapTypes.rss:
-                consoleLogger.info(`This is a RSS format sitemap.`);
+                consoleLogger.info(`This is a RSS format sitemap: ${url}`);
                 await processXmlSitemap($, sitemapType, 'link', 'pubDate', 'item');
                 break;
             case constants.xmlSitemapTypes.atom:
-                consoleLogger.info(`This is a Atom format sitemap.`);
+                consoleLogger.info(`This is a Atom format sitemap: ${url}`);
                 await processXmlSitemap($, sitemapType, 'link', 'published', 'entry');
                 break;
             default:
-                consoleLogger.info(`This is an unrecognised XML sitemap format.`);
+                consoleLogger.info(`This is an unrecognised XML sitemap format: ${url}`);
                 processNonStandardSitemap(data);
         }
         const linksFromThisSitemap = allUrls.size - countBefore;
@@ -1816,7 +1902,8 @@ function isValidHttpUrl(urlString) {
 export const isFilePath = (url) => {
     const driveLetterPattern = /^[A-Z]:/i;
     const backslashPattern = /\\/;
-    return (url.startsWith('/') ||
+    return (url.toLowerCase().startsWith('file://') ||
+        url.startsWith('/') ||
         driveLetterPattern.test(url) ||
         backslashPattern.test(url) ||
         url.startsWith('./') ||

package/dist/crawlers/commonCrawlerFunc.js CHANGED Viewed

@@ -898,13 +898,65 @@ export const createCrawleeSubFolders = async (randomToken) => {
 export const preNavigationHooks = (extraHTTPHeaders) => {
     return [
         async (crawlingContext, gotoOptions) => {
-            if (extraHTTPHeaders) {
+            if (extraHTTPHeaders && Object.keys(extraHTTPHeaders).length > 0) {
                 crawlingContext.request.headers = extraHTTPHeaders;
             }
-            gotoOptions = { waitUntil: 'networkidle', timeout: 30000 };
+            // Use domcontentloaded — fires as soon as the DOM is parsed, before
+            // images/stylesheets/network requests settle. This avoids indefinite
+            // hangs on sites with WebSockets, analytics polling, or infinite-scroll
+            // beacons that never reach networkidle. Further page stability is
+            // handled by waitForPageLoaded() in each crawler's requestHandler and
+            // by the DOM mutation observer in postNavigationHooks.
+            if (gotoOptions) {
+                gotoOptions.waitUntil = 'domcontentloaded';
+                gotoOptions.timeout = 30000;
+            }
         },
     ];
 };
+/**
+ * Splits extraHTTPHeaders into auth and non-auth parts.
+ * Auth headers (Authorization) must only be sent to same-origin requests to avoid CORS preflight failures.
+ * Non-auth headers are safe to set globally on the browser context.
+ */
+export const splitAuthHeaders = (extraHTTPHeaders) => {
+    const { Authorization, ...nonAuthHeaders } = extraHTTPHeaders || {};
+    return {
+        authHeader: Authorization || null,
+        nonAuthHeaders: Object.keys(nonAuthHeaders).length > 0 ? nonAuthHeaders : null,
+        httpCredentials: (() => {
+            if (!Authorization?.startsWith('Basic '))
+                return null;
+            const decoded = Buffer.from(Authorization.slice(6), 'base64').toString();
+            const colonIdx = decoded.indexOf(':');
+            if (colonIdx <= 0)
+                return null;
+            return { username: decoded.slice(0, colonIdx), password: decoded.slice(colonIdx + 1) };
+        })(),
+    };
+};
+/**
+ * Adds a route handler to a BrowserContext that sends the Authorization header
+ * only to same-origin requests, preventing CORS preflight failures on cross-origin CDN resources.
+ */
+export const addAuthRouteHandler = async (context, entryUrl, authHeader) => {
+    if (!authHeader)
+        return;
+    const entryOrigin = new URL(entryUrl).origin;
+    await context.route('**/*', async (route, request) => {
+        try {
+            if (new URL(request.url()).origin === entryOrigin) {
+                await route.continue({ headers: { ...request.headers(), Authorization: authHeader } });
+            }
+            else {
+                await route.continue();
+            }
+        }
+        catch {
+            await route.continue();
+        }
+    });
+};
 export const postNavigationHooks = [
     async (_crawlingContext) => {
         guiInfoLog(guiInfoStatusTypes.COMPLETED, {});

package/dist/crawlers/crawlDomain.js CHANGED Viewed

@@ -1,6 +1,6 @@
 import crawlee from 'crawlee';
 import { CrawlRateController } from './crawlRateController.js';
-import { createCrawleeSubFolders, getPreLaunchHook, runAxeScript, isUrlPdf, shouldSkipClickDueToDisallowedHref, shouldSkipDueToUnsupportedContent, } from './commonCrawlerFunc.js';
+import { createCrawleeSubFolders, getPreLaunchHook, preNavigationHooks, runAxeScript, isUrlPdf, shouldSkipClickDueToDisallowedHref, shouldSkipDueToUnsupportedContent, splitAuthHeaders, } from './commonCrawlerFunc.js';
 import constants, { blackListedFileExtensions, guiInfoStatusTypes, cssQuerySelectors, STATUS_CODE_METADATA, disallowedListOfPatterns, disallowedSelectorPatterns, FileTypes, } from '../constants/constants.js';
 import { getPlaywrightLaunchOptions, isBlacklistedFileExtensions, isSkippedUrl, isDisallowedInRobotsTxt, getUrlsFromRobotsTxt, waitForPageLoaded, } from '../constants/common.js';
 import { areLinksEqual, isFollowStrategy, isSameHostname, normUrl, register } from '../utils.js';
@@ -275,6 +275,7 @@ const crawlDomain = async ({ url, randomToken, host: _host, viewportSettings, ma
     };
     let isAbortingScanNow = false;
     const rateController = new CrawlRateController(maxRequestsPerCrawl, specifiedMaxConcurrency || constants.maxConcurrency);
+    const { nonAuthHeaders, httpCredentials } = splitAuthHeaders(extraHTTPHeaders);
     const crawler = register(new crawlee.PlaywrightCrawler({
         launchContext: {
             launcher: constants.launcher,
@@ -293,12 +294,18 @@ const crawlDomain = async ({ url, randomToken, host: _host, viewportSettings, ma
                         ...playwrightDeviceDetailsObject,
                         ...(process.env.OOBEE_USER_AGENT && { userAgent: process.env.OOBEE_USER_AGENT }),
                         ...(process.env.OOBEE_DISABLE_BROWSER_DOWNLOAD && { acceptDownloads: false }),
-                        ...(extraHTTPHeaders && { extraHTTPHeaders }),
+                        ...(nonAuthHeaders && { extraHTTPHeaders: nonAuthHeaders }),
+                        ...(httpCredentials && { httpCredentials }),
                     };
                 },
             ],
         },
         requestQueue,
+        maxRequestRetries: 3,
+        maxSessionRotations: 1,
+        preNavigationHooks: [
+            ...preNavigationHooks(extraHTTPHeaders),
+        ],
         postNavigationHooks: [
             async (crawlingContext) => {
                 const { page, request } = crawlingContext;

package/dist/crawlers/crawlIntelligentSitemap.js CHANGED Viewed

@@ -1,4 +1,4 @@
-import { createCrawleeSubFolders } from './commonCrawlerFunc.js';
+import { createCrawleeSubFolders, splitAuthHeaders, addAuthRouteHandler } from './commonCrawlerFunc.js';
 import constants, { guiInfoStatusTypes, sitemapPaths } from '../constants/constants.js';
 import { consoleLogger, guiInfoLog } from '../logs.js';
 import crawlDomain from './crawlDomain.js';
@@ -26,26 +26,31 @@ const crawlIntelligentSitemap = async (url, randomToken, host, viewportSettings,
         const homeUrl = getHomeUrl(link);
         let sitemapLink = '';
         const launchOptions = getPlaywrightLaunchOptions(browser);
+        const { authHeader, nonAuthHeaders, httpCredentials } = splitAuthHeaders(extraHTTPHeaders);
         let context;
         let browserInstance;
         if (process.env.CRAWLEE_HEADLESS === '1') {
             const effectiveUserDataDirectory = userDataDirectory || '';
             context = await constants.launcher.launchPersistentContext(effectiveUserDataDirectory, {
                 ...launchOptions,
-                ...(extraHTTPHeaders && { extraHTTPHeaders }),
+                ...(nonAuthHeaders && { extraHTTPHeaders: nonAuthHeaders }),
+                ...(httpCredentials && { httpCredentials }),
                 ...(process.env.OOBEE_USER_AGENT && { userAgent: process.env.OOBEE_USER_AGENT }),
             });
             register(context);
         }
         else {
-            // In headful mode, avoid launchPersistentContext to prevent "Browser window not found"
             browserInstance = await constants.launcher.launch(launchOptions);
             register(browserInstance);
             context = await browserInstance.newContext({
-                ...(extraHTTPHeaders && { extraHTTPHeaders }),
+                ...(nonAuthHeaders && { extraHTTPHeaders: nonAuthHeaders }),
+                ...(httpCredentials && { httpCredentials }),
                 ...(process.env.OOBEE_USER_AGENT && { userAgent: process.env.OOBEE_USER_AGENT }),
             });
         }
+        if (authHeader) {
+            await addAuthRouteHandler(context, link, authHeader);
+        }
         const page = await context.newPage();
         for (const path of sitemapPaths) {
             sitemapLink = homeUrl + path;