npm - @fanboynz/network-scanner - Versions diffs - 3.1.2 → 3.3.0 - Mend

@fanboynz/network-scanner 3.1.2 → 3.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -2,7 +2,58 @@
 All notable changes to the Network Scanner (nwss.js) project.
-## [3.1.1] - 2026-05-30
+## [3.3.0] - 2026-06-06
+### Added
+- **DNS dead-domain skip + corroborated persistence** — within a scan, once a host resolves NXDOMAIN/ENODATA it is remembered and repeat URLs on that host are skipped without re-resolving. With `--dns-cache`, a host that *also* fails navigation (`ERR_NAME_NOT_RESOLVED` / `ERR_ADDRESS_UNREACHABLE`) is corroborated and persisted to the negative cache (`.dnsnegcache`, 12h TTL) so it is skipped on the next run too. Only definitive non-existence is cached — resolver errors fail open and never poison a live host.
+- **`acceptInsecureCerts` on browser launch** — TLS/cert errors (expired, self-signed, name-mismatch) no longer abort navigation, so streaming/pirate domains with broken certs are still scanned.
+- **`--disable-popup-blocking` when a site uses `capture_popups`** — Chrome's pop-up blocker (`chrome://settings/content/popups`) is turned off only for popup-capture scans, so non-gesture popunders (document-level `onclick` / timer SDKs) fire and get captured too. Non-popup scans keep the blocker on (stealthier — a real browser blocks non-gesture `window.open()`); gesture-triggered popups already worked via the synthetic-click path.
+### Changed
+- **The main-frame document is never blocked** — the scanned page (and any main-frame redirect target) is exempt from adblock / `blocked` / `blockDomainsByUrl` aborts. Aborting it made the navigation never commit (`about:blank` → timeout), silently breaking scanned URLs that matched our own filter lists (common on adult/pirate/stream domains). The request still flows through the matcher, so a main-frame redirect destination (e.g. a filecrypt → ad-domain hop) is still captured; sub-frame / ad iframes stay blockable.
+- **Navigation timeouts are recovered, not discarded** — on a nav timeout the scanner retries leniently and proceeds with the partially-loaded page instead of dropping the URL (a page still at `about:blank` is still treated as a failure).
+- **whois disk-cache TTL raised to 36h** (dig stays 20h) — registrar data is stable and whois servers rate-limit aggressively, so a longer TTL cuts repeat queries; dig keeps its 20h TTL.
+- **VPN is Linux-only with a clear guard** — `vpn` / `openvpn` on macOS/Windows now returns an explicit "Linux-only" error instead of cryptic `ip` / `/proc` failures.
+### Performance
+- **`psl.parse` memoized by hostname** in the request hot path — both per-request handlers (main page + popup capture) parsed the root domain of *every* request, while a page hammers the same handful of hosts (CDN, analytics, ad domains). A hostname-keyed memo turns almost all of those into `Map` hits, replacing the URL-keyed cache (fewer + shorter keys, far higher hit rate).
+- **Lower per-request overhead** — the iframe-loop guard's `frame().url()` lookup is now gated behind a cheap URL string test instead of running on every request.
+- **Removed redundant disk I/O** — a leaked adblock combined-list temp file in `tmpdir` is now cleaned up, and a redundant `existsSync` before each forced screenshot's recursive `mkdir` was dropped.
+### Fixed
+- **Periodic debug/`--dumpurls` log flush is now synchronous** — the 2s timer used async `fs.writeFile({flag:'a'})` with no in-flight guard, so two ticks could append to the same file concurrently and interleave lines, and it cleared the buffer *before* the write confirmed (silently dropping entries on a failed write). It now uses `appendFileSync`, clears only after a successful write (transient failures retry next tick), and is bounded so a permanently-unwritable path can't grow memory.
+- **Dead-domain skip works without `--show-dead-domains`** — the in-scan skip recorded into the dead set only when the report flag was on, which made the skip dead code; recording is now unconditional and the flag gates only the end-of-scan report. Transient DNS errors were also dropped from the dead-domain match so only `ERR_NAME_NOT_RESOLVED` / `ERR_ADDRESS_UNREACHABLE` mark a host dead.
+### Removed
+- **Hardcoded `dmzjmp` iframe-loop guard** — the domain-specific abort for a `creative.dmzjmp.com` frame requesting `go.dmzjmp.com/api/models` (added mid-2025 to stop a runaway request loop) has not recurred and was removed from the request hot path; the per-URL timeout remains the backstop. Recoverable from git history — prefer a config-driven `iframe_loop_guards` entry if it ever returns.
+### Documentation
+- **README + man page now document `--block-ads` and `--adblock-engine`** — blocking ads/trackers *during* the scan with EasyList-format list(s) (comma-separated), and the `js` (default, native parser) vs `rust` (Brave `adblock-rs`) matcher backends.
+## [3.2.0] - 2026-06-04
+### Added
+- **`output_regex`** site option — a per-site regex whose capture group 1 (or whole match) becomes the rule body, so output can be a path-prefix rule like `||host/script/` instead of `||host^`. Collapses randomized filenames under a stable path into one rule and lets you block a folder on a host that also serves legit content; falls back to `||host^` when the regex doesn't match. Adblock-only — domain-based formats (dnsmasq/unbound/pi-hole/hosts/plain) emit the bare host. Compiled once per pattern (memoized) and validated at config load.
+- **dig resolver failover** — `digLookup` now fails over through the `--dns` resolvers on timeout / no-reply / `REFUSED` / `SERVFAIL` (up to 3 attempts, `+time=2 +tries=1` each), matching the resilience the whois retry and DNS pre-check rotation already had. With no `--dns`, the system-resolver path keeps dig's native `resolv.conf` rotation unchanged.
+### Changed
+- **Ghost-cursor coordinate clicks now use the same realistic press as the built-in content clicks** (`humanClick`): hover dwell + mousedown/hold/mouseup, plus hand-tremor during the hold and a mouseup drift (so mousedown ≠ mouseup coordinates) when `realistic_click` is set — replacing a 0ms `page.mouse.click`.
+- **Ghost-cursor clicks honor `interact_click_count`** (default 3, cap 20) instead of firing a single click — ad SDKs often swallow the 1st/2nd click as warmup. The bezier movement loop reserves part of `ghost_cursor_duration` for the clicks (raise the duration to fit more; the default 2000ms fits ~1 realistic click).
+- **`dig` success is judged by RCODE, not stderr** — a dig that prints a transient `communications error` warning but still returns a valid `ANSWER SECTION` is no longer discarded.
+- **dig-only configs skip the whois root-domain parse** per request (small per-request saving when no `whois`/`whois-or` is configured).
+### Fixed
+- **`max_redirects: 0`** now means "follow none" instead of silently becoming 10 (the `|| 10` falsy-zero bug in `nwss.js` and `lib/redirect.js`).
+- **A `REFUSED`/`SERVFAIL` dig that exhausts all resolvers returns failure** so it isn't cached — a transient resolver-side error no longer poisons a domain for the cache TTL.
+- **Ghost-cursor coordinate click no longer reports false success** — it returned `true` (and logged "Clicked") even when the click was silently skipped for lack of a page; it now returns `false` and logs the skip.
+### Removed
+- **`follow_redirects`** site option — documented in `--help`, the man page, the README, and example configs but never wired to any runtime behavior; removed from the docs. Use `max_redirects` instead (`0` = follow none).
+### Security
+- **dig argv-injection guard** — `digLookup` rejects non-hostname-shaped input before shelling out. `dig` has no `--` end-of-options marker (unlike whois) and parses `@`/`-`/`+`-leading argv tokens as options, so a crafted "domain" like `@evil-resolver` (redirects the query to an arbitrary server) or `-f /path` (reads a file as a query batch) is now rejected — out-of-charset or dash-leading values fall back to no-match.
+## [3.1.2] - 2026-05-30
 ### Changed
 - **Fingerprint identity pinned to Stable Chrome 148**, not whatever Chrome-for-Testing puppeteer bundles (currently 149, ahead of Stable). The spoof must blend with the real-world population; claiming an unreleased build is itself a tell. The Chrome major + build (`CHROME_BUILD`) + GREASE brand (`CHROME_GREASE_BRAND`) are now single constants — see `lib/fingerprint.md`.

package/CLAUDE.md CHANGED Viewed

@@ -6,7 +6,7 @@ Puppeteer-based network scanner for analyzing web traffic, generating adblock fi
 - `nwss.js` — Main entry point (~5,800 lines). CLI args, URL processing, orchestration.
 - `config.json` — Default scan configuration (sites, filters, options).
-- `lib/` — 32 focused, single-purpose modules:
+- `lib/` — 33 focused, single-purpose modules:
   - `fingerprint.js` — Bot detection evasion (device/GPU/timezone spoofing)
   - `cloudflare.js` — Cloudflare challenge detection and solving
   - `browserhealth.js` — Memory management and browser lifecycle
@@ -14,6 +14,7 @@ Puppeteer-based network scanner for analyzing web traffic, generating adblock fi
   - `ghost-cursor.js` — Bezier-curve cursor pathing for human-like mouse movement
   - `smart-cache.js` — Multi-layer caching with persistence
   - `nettools.js` — WHOIS/dig integration
+  - `dns.js` — DNS pre-check resolver: multi-nameserver rotation + `--dns` override (pre-check only; not Chrome/dig)
   - `output.js` — Multi-format rule output (adblock, dnsmasq, unbound, pihole, etc.)
   - `proxy.js` — SOCKS5/HTTP proxy support
   - `socks-relay.js` — Local SOCKS proxy relay/chain helper

package/README.md CHANGED Viewed

@@ -17,6 +17,7 @@ A Puppeteer-based tool for scanning websites to find third-party (or optionally
 - Subdomain handling (collapse to root or full subdomain)
 - Optionally match only first-party, third-party, or both
 - Enhanced redirect handling with JavaScript and meta refresh detection
+- Capture and drive popup/popunder chains (`capture_popups` + `interact_popups`) so domains reachable only via a clicked popup still match
 - Per-site proxy routing (SOCKS5, SOCKS4, HTTP, HTTPS) with pre-flight health checks
 ---
@@ -50,7 +51,6 @@ A Puppeteer-based tool for scanning websites to find third-party (or optionally
 | Argument                  | Description |
 |:---------------------------|:------------|
-| `--verbose`                 | Force verbose mode globally |
 | `--debug`                   | Force debug mode globally |
 | `--silent`                  | Suppress normal console logs |
 | `--titles`                  | Add `! <url>` title before each site's group |
@@ -66,9 +66,10 @@ A Puppeteer-based tool for scanning websites to find third-party (or optionally
 | `--use-puppeteer-core`      | Use `puppeteer-core` with system Chrome instead of bundled Chromium |
 | `--use-obscura`             | Connect to running Obscura CDP server (`ws://127.0.0.1:9222` or `OBSCURA_WS` env). Skips fingerprint injection — Obscura provides built-in stealth |
 | `--load-extension <path>`   | Load unpacked Chrome extension from directory (can be used multiple times) |
-| `--dns-cache`               | Persist dig/whois results to disk between runs (20hr TTL, 2000-entry cap each, `.digcache`/`.whoiscache`). Disk writes are atomic (tmp + rename); corrupt cache files are detected on load with a `[dns-cache]` warn line and reset cleanly. |
-| `--no-dns-precheck`         | Disable per-URL DNS resolution check before page navigation. By default, hosts that dig/whois have already proven live (within the 20hr cache TTL) skip their c-ares pre-check via a positive-resolution index. |
-| `--block-ads=<files>`       | Block ads using EasyList format rules (comma-separated: `easylist.txt,easyprivacy.txt`) |
+| `--dns-cache`               | Persist dig/whois results to disk between runs (dig 20hr / whois 36hr TTL, 2000-entry cap each, `.digcache`/`.whoiscache`), **plus** the DNS pre-check negative cache (NXDOMAIN/ENODATA only — never resolver errors — 12h TTL, `.dnsnegcache`) so known-dead hosts aren't re-resolved next run. Disk writes are atomic (tmp + rename); corrupt cache files are detected on load with a `[dns-cache]` warn line and reset cleanly. |
+| `--no-dns-precheck`         | Disable per-URL DNS resolution check before page navigation. By default, hosts that dig/whois have already proven live (within the dig/whois cache TTL) skip their c-ares pre-check via a positive-resolution index. |
+| `--block-ads=<files>`       | Block ads/trackers **during the scan** using EasyList-format filter list(s) (`\|\|domain^`, `/ads/*`, etc.). Comma-separated for multiple: `--block-ads=easylist.txt,easyprivacy.txt`. See [Blocking ads during the scan](#blocking-ads-during-the-scan). |
+| `--adblock-engine=<js\|rust>` | Matcher backend for `--block-ads` (default: `js`). `rust` uses Brave's `adblock-rs` (much faster on large lists) and requires `npm i adblock-rs`. |
 | `--cdp`                     | Enable Chrome DevTools Protocol logging (now per-page if enabled) |
 | `--remove-dupes`            | Remove duplicate domains from output (only with `-o`) |
 | `--dry-run`                 | Console output only: show matching regex, titles, whois/dig/searchstring results, and adblock rules |
@@ -76,6 +77,8 @@ A Puppeteer-based tool for scanning websites to find third-party (or optionally
 | `--help`, `-h`              | Show this help menu |
 | `--version`                 | Show script version |
 | `--max-concurrent <number>` | Maximum concurrent site processing (1-50, overrides config/default) |
+| `--dns <ip[,ip,...]>` | Resolver(s) for the DNS pre-check **and** nettools' `dig` (one pins, several rotate per query; overrides `/etc/resolv.conf`). Does not affect Chrome navigation or `whois`. Useful when the system resolver is flaky and `dig`-gated domains time out |
+| `--show-dead-domains` | At end of scan, list hostnames that did not resolve / were unreachable (`NXDOMAIN`/`ENODATA` + `ERR_NAME_NOT_RESOLVED`/`ERR_ADDRESS_UNREACHABLE`). Excludes blocks/timeouts (those mean the domain is alive). For pruning dead URLs. |
 | `--cleanup-interval <number>` | Browser restart interval in URLs processed (1-1000, overrides config/default) |
 ### Validation Options
@@ -90,6 +93,37 @@ A Puppeteer-based tool for scanning websites to find third-party (or optionally
 | `--clear-cache`             | Clear persistent cache before scanning (improves fresh start performance) |
 | `--ignore-cache`            | Bypass all smart caching functionality during scanning |
+### Blocking ads during the scan
+`--block-ads` makes the scanner **block** matching requests *during* the scan (separate from capturing rules) — to keep ad/tracker noise out of the page, speed up loads, or test that a list catches what it should.
+**Adding lists.** Pass one or more EasyList-format filter lists (same syntax as uBlock Origin / EasyList):
+```bash
+# Single list
+node nwss.js --block-ads=easylist.txt
+# Multiple lists — comma-separated, no spaces
+node nwss.js --block-ads=easylist.txt,easyprivacy.txt,mylist.txt
+```
+Lists are plain-text **network** rules — `||doubleclick.net^`, `/ads/*`, `||example.com^$script`, etc. Element-hiding/cosmetic rules (`##…`) don't apply to request blocking and are ignored. The scanned page's own top-level document is never blocked (only sub-resources), so a site whose own domain is in a list still loads.
+**Engine — `js` vs `rust`** (`--adblock-engine`, default `js`):
+| Engine | Flag | Backend | When |
+|---|---|---|---|
+| **js** (default) | `--adblock-engine=js` | `lib/adblock.js` — pure-JS, no extra deps | Default; fine for small/medium lists, works everywhere |
+| **rust** | `--adblock-engine=rust` | `lib/adblock-rust.js` — Brave's [`adblock-rs`](https://github.com/brave/adblock-rust) | Large lists (full EasyList + EasyPrivacy + …); much faster matching. Drop-in (same rules, same results). Requires `npm install adblock-rs` (needs a Rust toolchain) |
+The two engines are interchangeable — same rule format, same blocking result; `rust` is purely a speed option for big lists. If you pass `--adblock-engine=rust` without `adblock-rs` installed, install it (`npm i adblock-rs`) or drop the flag to use `js`.
+```bash
+# Fast matching over big lists with the Rust engine
+npm install adblock-rs
+node nwss.js --block-ads=easylist.txt,easyprivacy.txt --adblock-engine=rust
+```
 ---
 ## config.json Format
@@ -152,6 +186,7 @@ Example:
 | `userAgent`          | `chrome`, `chrome_mac`, `chrome_linux`, `firefox`, `firefox_mac`, `firefox_linux`, `safari` | - | User agent for page |
 | `filterRegex`        | String or Array | `.*` | Regex or list of regexes to match requests |
 | `regex_and`          | Boolean | `false` | Use AND logic for multiple filterRegex patterns - ALL patterns must match the same URL |
+| `output_regex`       | String | — | Regex applied to each matched URL to build the rule body: capture group 1 (or whole match) becomes `\|\|<capture>` instead of `\|\|host^`. E.g. `^https?:\/\/([^\/]+\/[^\/]+\/)` turns `https://host.com/script/abc.js` into `\|\|host.com/script/`. The capture must include the host. No match → falls back to `\|\|host^`. Adblock-only; domain formats (dnsmasq/pihole/hosts/plain) emit the bare host |
 | `comments`           | String or Array | - | String of comments or references |
 | `resourceTypes`      | Array | `["script", "xhr", "image", "stylesheet"]` | What resource types to monitor |
 | `reload`             | Integer | `1` | Number of times to reload page |
@@ -176,8 +211,7 @@ Example:
 | Field                | Values | Default | Description |
 |:---------------------|:-------|:-------:|:------------|
-| `follow_redirects`   | Boolean | `true` | Follow redirects to new domains |
-| `max_redirects`      | Integer | `10` | Maximum number of redirects to follow |
+| `max_redirects`      | Integer | `10` | Maximum number of redirects to follow (`0` = follow none) |
 | `js_redirect_timeout` | Milliseconds | `5000` | Time to wait for JavaScript redirects |
 | `detect_js_patterns` | Boolean | `true` | Analyze page source for redirect patterns |
 | `redirect_timeout_multiplier` | Number | `1.5` | Increase timeout for redirected URLs |
@@ -279,6 +313,8 @@ When a page redirects to a new domain, first-party/third-party detection is base
 | `interact_duration`  | Milliseconds | `2000` | Duration of interaction simulation |
 | `interact_scrolling` | Boolean | `true` | Enable scrolling simulation |
 | `interact_clicks`    | Boolean | `false` | Enable element clicking simulation |
+| `interact_click_count` | Integer | `3` | Number of random content-zone clicks per load (capped at 20). Default 3 = primary + 2 backups, since ad SDKs sometimes suppress the 1st/2nd click as warmup |
+| `realistic_click`    | Boolean | `false` | Higher click fidelity: denser mouse approach (15 steps), ±1px hand-tremor micro-moves during the press, and ±1.5px mouseup drift (so mousedown≠mouseup coords) — for sites that score click realism. Costs ~80–120ms/click |
 | `interact_typing`    | Boolean | `false` | Enable typing simulation |
 | `interact_intensity` | String | `"medium"` | Interaction simulation intensity: "low", "medium", "high" |
 | `cursor_mode`        | `"ghost"` | - | Use ghost-cursor Bezier mouse movements (requires `npm i ghost-cursor`) |
@@ -295,6 +331,21 @@ When a page redirects to a new domain, first-party/third-party detection is base
 | `ignore_similar_threshold` | Integer | - | Override global similarity threshold for this site |
 | `ignore_similar_ignored_domains` | Boolean | - | Override global `ignore_similar_ignored_domains` for this site |
+### Popup Capture Options
+Capture (and optionally drive) the popup/popunder windows that ad and redirect
+scripts open, so domains reachable only via a popup chain still match `filterRegex`.
+The same `filterRegex` applies to the whole chain — it must contain every pattern
+you expect along it. Popup capture only fires when the main page is actually
+clicking, so set `interact: true` **and** `interact_clicks: true` as well.
+| Field                | Values | Default | Description |
+|:---------------------|:-------|:-------:|:------------|
+| `capture_popups`     | Boolean | `false` | Capture popup windows opened during the scan and evaluate their landing URL + in-popup requests against `filterRegex`/`dig`/`whois` (requires `interact` + `interact_clicks` to fire user-gesture clicks) |
+| `interact_popups`    | Boolean | `false` | Mouse-click inside captured popups (3 content-zone clicks) so the chain cascades to its next redirect/ad. Requires `capture_popups`. Clicks popups up to `capture_popups_max_depth − 1` (the deepest captured popup is observed, not clicked) |
+| `capture_popups_max_depth` | Integer | `4` | Max popup-chain depth to capture (`site → p1 → p2 → p3 → destination`). Each extra level multiplies popups + time |
+| `capture_popups_window_ms` | Integer | `5000` | Per-popup capture window (ms) before the popup is auto-closed |
 ### VPN Options
 Route traffic through a VPN for specific sites. Requires `sudo` privileges. The VPN connection is established before scanning and torn down after the site completes.
@@ -596,8 +647,11 @@ node nwss.js --max-concurrent 12 --cleanup-interval 300 -o rules.txt
 {
   "url": "https://anti-bot-site.com",
   "interact": true,
+  "interact_clicks": true,
   "cursor_mode": "ghost",
-  "ghost_cursor_duration": 3000,
+  "realistic_click": true,
+  "interact_click_count": 3,
+  "ghost_cursor_duration": 5000,
   "ghost_cursor_speed": 1.2,
   "fingerprint_protection": "random",
   "filterRegex": "tracking|analytics",
@@ -610,6 +664,12 @@ Or enable globally via CLI:
 node nwss.js --ghost-cursor --debug -o rules.txt
 ```
+**Ghost-cursor clicks.** The cursor moves with `cursor_mode: "ghost"`, but it only *clicks* when both `interact: true` **and** `interact_clicks: true` are set (same rule as the built-in path). Click behavior:
+- `realistic_click: true` — each press adds hand-tremor during the hold and a mouseup drift, so `mousedown` ≠ `mouseup` coordinates (the press is routed through the same `humanClick` the built-in content clicks use).
+- `interact_click_count` — number of clicks per load (default `3`, capped at `20`). The default of 3 matters because some ad SDKs swallow the 1st/2nd click as warmup.
+- **Duration vs. clicks:** realistic clicks take ~600–700ms each, and the bezier movement loop reserves up to **half** of `ghost_cursor_duration` for them. So the default `ghost_cursor_duration: 2000` only fits **~1 click** — raise it to roughly `interact_click_count × 700 + movement` (e.g. `5000`–`8000`) to fit all of them.
 > **Note:** ghost-cursor is an optional dependency. Install with `npm install ghost-cursor`. If not installed, the scanner falls back to the built-in mouse simulation automatically.
 #### E-commerce Site Scanning

package/eslint.config.mjs CHANGED Viewed

@@ -2,5 +2,17 @@ import globals from "globals";
 import { defineConfig } from "eslint/config";
 export default defineConfig([
-  { files: ["**/*.{js,mjs,cjs}"], languageOptions: { globals: globals.browser } },
+  {
+    files: ["**/*.{js,mjs,cjs}"],
+    // Node globals (require/module/process/Buffer/...) plus browser globals
+    // (document/window/navigator) — the latter are referenced inside
+    // page.evaluate() callbacks that eslint parses as part of the file.
+    languageOptions: { globals: { ...globals.node, ...globals.browser } },
+    // Catch undefined-variable references statically. node --check only
+    // validates syntax, so an orphaned identifier (e.g. a const that was
+    // removed while a usage remained) passes parsing but throws
+    // ReferenceError at runtime only when that branch executes. no-undef
+    // turns that whole class into a build-time failure.
+    rules: { "no-undef": "error" },
+  },
 ]);

package/lib/browserhealth.js CHANGED Viewed

@@ -7,12 +7,11 @@ const { formatLogMessage, messageColors } = require('./colorize');
 const IS_PAGE_FROM_PREVIOUS_SCAN_TAG = messageColors.processing('[isPageFromPreviousScan]');
 const REALTIME_CLEANUP_TAG = messageColors.processing('[realtime_cleanup]');
 const GROUP_WINDOW_CLEANUP_TAG = messageColors.processing('[group_window_cleanup]');
-const { execSync, execFile } = require('child_process');
+const { execFile } = require('child_process');
 // Window cleanup delay constant
 const WINDOW_CLEANUP_DELAY_MS = 15000;
 // window_clean REALTIME
-const REALTIME_CLEANUP_BUFFER_MS = 25000; // Additional buffer time after site delay (increased for Cloudflare)
 const REALTIME_CLEANUP_THRESHOLD = 12; // Default number of pages to keep
 const REALTIME_CLEANUP_MIN_PAGES = 6; // Minimum pages before cleanup kicks in
@@ -380,7 +379,30 @@ async function performRealtimeWindowCleanup(browserInstance, threshold = REALTIM
     // Use the provided total delay (already includes appropriate buffer)
     const cleanupDelay = totalDelay;
+    // Pre-wait short-circuit. The only pages this pass can ever close are popups
+    // (untracked) and idle pages — active main pages are protected by
+    // isPageSafeToClose. When concurrency exceeds the threshold the page count is
+    // dominated by active main pages, so without this we'd wait the full
+    // cleanupDelay and then close nothing (e.g. max_concurrent 30 vs threshold 8
+    // = a ~36s no-op on every task). If nothing is even a candidate, skip the
+    // wait. A main task that finishes during the skipped wait closes its OWN page,
+    // so realtime cleanup never needed to wait for it.
+    const hasCloseCandidate = quickPages.some(p => {
+      if (p.isClosed()) return false;
+      const usage = pageUsageTracker.get(p);
+      return !usage || !usage.isProcessing; // untracked popup, or a tracked-idle page
+    });
+    if (!hasCloseCandidate) {
+      if (forceDebug) {
+        console.log(formatLogMessage('debug', `${REALTIME_CLEANUP_TAG} ${quickPages.length} pages but all actively processing — skipping ${cleanupDelay}ms wait (nothing closeable)`));
+      }
+      result.success = true;
+      result.totalPages = quickPages.length;
+      result.reason = 'all_active';
+      return result;
+    }
     if (forceDebug) {
       console.log(formatLogMessage('debug', `${REALTIME_CLEANUP_TAG} Waiting ${cleanupDelay}ms before cleanup (threshold: ${threshold})`));
     }

package/lib/dns.js ADDED Viewed

@@ -0,0 +1,238 @@
+/**
+ * DNS pre-check resolver with multi-nameserver rotation.
+ *
+ * Owns nameserver selection and robust resolution for the scan's DNS
+ * pre-check. The default global resolver leads EVERY query with the FIRST
+ * nameserver in /etc/resolv.conf, so under scan concurrency one server
+ * (typically the ISP resolver) takes the whole c-ares burst and starts
+ * answering REFUSED while the other configured servers (e.g. 8.8.8.8/8.8.4.4)
+ * sit idle. This module builds one Resolver per nameserver — each leading with
+ * a different server, the rest kept as failover order — and round-robins them
+ * per resolve attempt so the lead spreads across all servers (and across the
+ * retry). A `--dns` override pins/rotates an explicit list instead of
+ * resolv.conf.
+ *
+ * Scope: this affects the pre-check resolver only. Chrome's navigation DNS
+ * (OS resolver) and nettools' dig/whois are separate paths and unaffected.
+ */
+const net = require('node:net');
+const dnsPromises = require('node:dns/promises');
+const { getServers: getSystemDnsServers } = require('node:dns');
+const { Resolver: DnsPromiseResolver } = require('node:dns/promises');
+const { formatLogMessage } = require('./colorize');
+// c-ares codes that mean "resolver problem" (retry-worthy / fail-open), not
+// "the host does not exist".
+const DNS_TRANSIENT_ERRORS = new Set(['ETIMEOUT', 'ESERVFAIL', 'EREFUSED', 'ECONNREFUSED']);
+/**
+ * True only for a definitive "host does not exist / has no address" answer —
+ * the only case that justifies skipping a URL in the pre-check. Everything
+ * else (EREFUSED, ESERVFAIL, ETIMEOUT, ECONNREFUSED, timeout) is a resolver
+ * problem the caller should fail open on.
+ * @param {string} code
+ * @returns {boolean}
+ */
+function isNonExistenceError(code) {
+  return code === 'ENOTFOUND' || code === 'ENODATA';
+}
+// Accept a bare IPv4/IPv6 address, or an address with a port in the exact form
+// Resolver.setServers() understands: `ipv4:port` or `[ipv6]:port`.
+function isResolverSpec(s) {
+  if (net.isIP(s)) return true;
+  const bracketed = s.match(/^\[([0-9a-fA-F:]+)\](?::\d{1,5})?$/);
+  if (bracketed) return net.isIP(bracketed[1]) === 6;
+  const v4port = s.match(/^(\d{1,3}(?:\.\d{1,3}){3}):\d{1,5}$/);
+  if (v4port) return net.isIP(v4port[1]) === 4;
+  return false;
+}
+/**
+ * Parse + validate a `--dns` / config value into a clean, de-duplicated server
+ * list. Accepts a comma-separated string or an array. Each entry may be a bare
+ * IPv4/IPv6 address or an address with a port (`8.8.8.8:5353`,
+ * `[2001:db8::1]:5353`) — the form setServers() accepts. Invalid entries are
+ * warned and dropped; duplicates are collapsed so the rotation stays even.
+ * @param {string|string[]|undefined} raw
+ * @returns {string[]} validated server specs (possibly empty)
+ */
+function parseDnsServers(raw) {
+  if (!raw) return [];
+  const parts = (Array.isArray(raw) ? raw : String(raw).split(','))
+    .map(s => String(s).trim())
+    .filter(Boolean);
+  const valid = [];
+  const seen = new Set();
+  for (const p of parts) {
+    if (!isResolverSpec(p)) {
+      console.warn(`⚠ --dns: ignoring invalid server "${p}" (expected IPv4/IPv6, optionally with :port)`);
+      continue;
+    }
+    if (!seen.has(p)) { seen.add(p); valid.push(p); }
+  }
+  return valid;
+}
+/**
+ * Build a rotating pre-check resolver.
+ * @param {object} [opts]
+ * @param {string[]} [opts.servers] - explicit servers (from --dns). When empty,
+ *   the system resolv.conf servers are used.
+ * @param {boolean} [opts.forceDebug] - emit a debug line on the retry path.
+ * @returns {{ resolveHost: (hostname:string, timeoutMs:number)=>Promise<void>,
+ *   servers: string[], rotates: boolean, pinned: boolean }}
+ *   resolveHost resolves on success and rejects with the final error
+ *   (err.code intact) on failure.
+ */
+function createRotatingResolver(opts = {}) {
+  const forceDebug = !!opts.forceDebug;
+  const override = Array.isArray(opts.servers) && opts.servers.length > 0 ? opts.servers : null;
+  let systemServers = [];
+  try { systemServers = getSystemDnsServers(); } catch { systemServers = []; }
+  const servers = override || systemServers;
+  // Pin/rotate an explicit --dns list (even a single server — never fall back
+  // to the OS resolver in that case). For resolv.conf, only build a pool when
+  // there is more than one server to rotate; otherwise use the global API
+  // (which already reads resolv.conf).
+  const shouldPool = override ? servers.length >= 1 : servers.length > 1;
+  let pool = null;
+  if (shouldPool) {
+    pool = servers.map((_, i) => {
+      const r = new DnsPromiseResolver();
+      // setServers accepts exactly what we hold here: getServers()'s own output
+      // (system path) or net-validated specs incl. ip:port (override path).
+      // Keep the resolver's default servers if an entry is somehow rejected.
+      try { r.setServers([...servers.slice(i), ...servers.slice(0, i)]); } catch { /* keep default */ }
+      return r;
+    });
+  }
+  let cursor = 0;
+  // Resolver for the next attempt: rotated when a pool exists, else the global
+  // promises API. `cursor++` is a synchronous single-threaded increment, so even
+  // under heavy concurrency every caller gets a distinct slot and the lead
+  // distribution stays exactly even (no locking needed).
+  const nextResolver = () => (pool ? pool[cursor++ % pool.length] : dnsPromises);
+  // One resolution attempt: rotate the lead server, resolve4 first, and on
+  // no-IPv4 (ENODATA/ENOTFOUND) fall back to resolve6 so IPv6-only hosts aren't
+  // wrongly skipped. Any OTHER code propagates unchanged so the caller sees the
+  // real resolver error. A timeout is kept as a safety net — with c-ares off
+  // the libuv threadpool it should rarely fire.
+  async function attempt(hostname, timeoutMs) {
+    const resolver = nextResolver();
+    let timer;
+    try {
+      const timeoutP = new Promise((_, reject) => {
+        timer = setTimeout(() => reject(new Error('DNS timeout')), timeoutMs);
+      });
+      const chain = resolver.resolve4(hostname).catch(err => {
+        if (err && (err.code === 'ENODATA' || err.code === 'ENOTFOUND')) {
+          return resolver.resolve6(hostname);
+        }
+        throw err;
+      });
+      await Promise.race([chain, timeoutP]);
+    } finally {
+      if (timer) clearTimeout(timer);
+    }
+  }
+  /**
+   * Resolve a hostname, rotating the lead server per attempt and retrying once
+   * on a transient/resolver error (so the retry leads with the next server —
+   * if one REFUSES, the retry hits another).
+   */
+  async function resolveHost(hostname, timeoutMs) {
+    try {
+      await attempt(hostname, timeoutMs);
+    } catch (firstErr) {
+      const code = firstErr && firstErr.code;
+      if (DNS_TRANSIENT_ERRORS.has(code) || (firstErr && firstErr.message === 'DNS timeout')) {
+        if (forceDebug) console.log(formatLogMessage('debug', `DNS pre-check transient (${code || 'timeout'}) for ${hostname}, retrying once`));
+        await attempt(hostname, timeoutMs);
+      } else {
+        throw firstErr;
+      }
+    }
+  }
+  return { resolveHost, servers, rotates: !!pool, pinned: !!override };
+}
+/**
+ * Circuit breaker for the DNS pre-check. During a resolver-refusal storm the
+ * pre-check is worthless (every host fails open and proceeds anyway) and
+ * actively harmful (it piles ~2× the queries — with the retry — onto an
+ * already-refusing resolver). This trips when resolver errors dominate a recent
+ * window of attempts and suspends pre-checking for a cooldown so the resolver
+ * gets breathing room; sites still load (a suspended pre-check just proceeds to
+ * navigation, exactly like a single fail-open). NXDOMAIN and success count as
+ * HEALTHY (the resolver answered) — only resolver errors (EREFUSED / ESERVFAIL
+ * / ETIMEOUT / ECONNREFUSED / timeout) count against it.
+ *
+ * @param {object} [opts]
+ * @param {number} [opts.window=20]        attempts kept in the rolling window
+ * @param {number} [opts.threshold=10]     resolver-errors in the window to trip
+ * @param {number} [opts.cooldownMs=30000] how long to stay suspended once tripped
+ * @param {boolean} [opts.forceDebug]
+ * @param {function} [opts.now]            clock injection (tests); defaults to Date.now
+ * @returns {{ record:(isResolverError:boolean)=>void, isTripped:()=>boolean,
+ *   stats:()=>{tripped:boolean,errorCount:number,windowFill:number,trips:number} }}
+ */
+function createDnsCircuitBreaker(opts = {}) {
+  const windowSize = opts.window || 20;
+  const threshold = opts.threshold || 10;
+  const cooldownMs = opts.cooldownMs != null ? opts.cooldownMs : 30000;
+  const forceDebug = !!opts.forceDebug;
+  const now = opts.now || Date.now;
+  const recent = [];   // booleans, true = resolver error
+  let errorCount = 0;
+  let openUntil = 0;   // suspended while now() < openUntil
+  let trips = 0;
+  // Feed one resolve outcome. Only ever called while closed (a suspended
+  // pre-check skips the resolve, so no outcome is produced).
+  function record(isResolverError) {
+    recent.push(!!isResolverError);
+    if (isResolverError) errorCount++;
+    if (recent.length > windowSize && recent.shift()) errorCount--;
+    if (now() >= openUntil && errorCount >= threshold) {
+      openUntil = now() + cooldownMs;
+      trips++;
+      console.log(formatLogMessage('warn', `[dns-precheck] resolver errors ${errorCount}/${recent.length} — suspending DNS pre-check ${Math.round(cooldownMs / 1000)}s (sites still load; backing off the resolver)`));
+    }
+  }
+  // True while suspended. On the first call after the cooldown elapses, resume
+  // with a clean window so the storm is re-measured fresh rather than re-tripping
+  // on stale errors.
+  function isTripped() {
+    if (now() < openUntil) return true;
+    if (openUntil !== 0) {
+      openUntil = 0;
+      recent.length = 0;
+      errorCount = 0;
+      if (forceDebug) console.log(formatLogMessage('debug', '[dns-precheck] cooldown elapsed — resuming DNS pre-check'));
+    }
+    return false;
+  }
+  return {
+    record,
+    isTripped,
+    stats: () => ({ tripped: now() < openUntil, errorCount, windowFill: recent.length, trips }),
+  };
+}
+module.exports = {
+  createRotatingResolver,
+  createDnsCircuitBreaker,
+  parseDnsServers,
+  isNonExistenceError,
+};