@fanboynz/network-scanner 3.2.0 → 3.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,34 @@
2
2
 
3
3
  All notable changes to the Network Scanner (nwss.js) project.
4
4
 
5
+ ## [3.3.0] - 2026-06-06
6
+
7
+ ### Added
8
+ - **DNS dead-domain skip + corroborated persistence** — within a scan, once a host resolves NXDOMAIN/ENODATA it is remembered and repeat URLs on that host are skipped without re-resolving. With `--dns-cache`, a host that *also* fails navigation (`ERR_NAME_NOT_RESOLVED` / `ERR_ADDRESS_UNREACHABLE`) is corroborated and persisted to the negative cache (`.dnsnegcache`, 12h TTL) so it is skipped on the next run too. Only definitive non-existence is cached — resolver errors fail open and never poison a live host.
9
+ - **`acceptInsecureCerts` on browser launch** — TLS/cert errors (expired, self-signed, name-mismatch) no longer abort navigation, so streaming/pirate domains with broken certs are still scanned.
10
+ - **`--disable-popup-blocking` when a site uses `capture_popups`** — Chrome's pop-up blocker (`chrome://settings/content/popups`) is turned off only for popup-capture scans, so non-gesture popunders (document-level `onclick` / timer SDKs) fire and get captured too. Non-popup scans keep the blocker on (stealthier — a real browser blocks non-gesture `window.open()`); gesture-triggered popups already worked via the synthetic-click path.
11
+
12
+ ### Changed
13
+ - **The main-frame document is never blocked** — the scanned page (and any main-frame redirect target) is exempt from adblock / `blocked` / `blockDomainsByUrl` aborts. Aborting it made the navigation never commit (`about:blank` → timeout), silently breaking scanned URLs that matched our own filter lists (common on adult/pirate/stream domains). The request still flows through the matcher, so a main-frame redirect destination (e.g. a filecrypt → ad-domain hop) is still captured; sub-frame / ad iframes stay blockable.
14
+ - **Navigation timeouts are recovered, not discarded** — on a nav timeout the scanner retries leniently and proceeds with the partially-loaded page instead of dropping the URL (a page still at `about:blank` is still treated as a failure).
15
+ - **whois disk-cache TTL raised to 36h** (dig stays 20h) — registrar data is stable and whois servers rate-limit aggressively, so a longer TTL cuts repeat queries; dig keeps its 20h TTL.
16
+ - **VPN is Linux-only with a clear guard** — `vpn` / `openvpn` on macOS/Windows now returns an explicit "Linux-only" error instead of cryptic `ip` / `/proc` failures.
17
+
18
+ ### Performance
19
+ - **`psl.parse` memoized by hostname** in the request hot path — both per-request handlers (main page + popup capture) parsed the root domain of *every* request, while a page hammers the same handful of hosts (CDN, analytics, ad domains). A hostname-keyed memo turns almost all of those into `Map` hits, replacing the URL-keyed cache (fewer + shorter keys, far higher hit rate).
20
+ - **Lower per-request overhead** — the iframe-loop guard's `frame().url()` lookup is now gated behind a cheap URL string test instead of running on every request.
21
+ - **Removed redundant disk I/O** — a leaked adblock combined-list temp file in `tmpdir` is now cleaned up, and a redundant `existsSync` before each forced screenshot's recursive `mkdir` was dropped.
22
+
23
+ ### Fixed
24
+ - **Periodic debug/`--dumpurls` log flush is now synchronous** — the 2s timer used async `fs.writeFile({flag:'a'})` with no in-flight guard, so two ticks could append to the same file concurrently and interleave lines, and it cleared the buffer *before* the write confirmed (silently dropping entries on a failed write). It now uses `appendFileSync`, clears only after a successful write (transient failures retry next tick), and is bounded so a permanently-unwritable path can't grow memory.
25
+ - **Dead-domain skip works without `--show-dead-domains`** — the in-scan skip recorded into the dead set only when the report flag was on, which made the skip dead code; recording is now unconditional and the flag gates only the end-of-scan report. Transient DNS errors were also dropped from the dead-domain match so only `ERR_NAME_NOT_RESOLVED` / `ERR_ADDRESS_UNREACHABLE` mark a host dead.
26
+
27
+ ### Removed
28
+ - **Hardcoded `dmzjmp` iframe-loop guard** — the domain-specific abort for a `creative.dmzjmp.com` frame requesting `go.dmzjmp.com/api/models` (added mid-2025 to stop a runaway request loop) has not recurred and was removed from the request hot path; the per-URL timeout remains the backstop. Recoverable from git history — prefer a config-driven `iframe_loop_guards` entry if it ever returns.
29
+
30
+ ### Documentation
31
+ - **README + man page now document `--block-ads` and `--adblock-engine`** — blocking ads/trackers *during* the scan with EasyList-format list(s) (comma-separated), and the `js` (default, native parser) vs `rust` (Brave `adblock-rs`) matcher backends.
32
+
5
33
  ## [3.2.0] - 2026-06-04
6
34
 
7
35
  ### Added
package/README.md CHANGED
@@ -66,9 +66,10 @@ A Puppeteer-based tool for scanning websites to find third-party (or optionally
66
66
  | `--use-puppeteer-core` | Use `puppeteer-core` with system Chrome instead of bundled Chromium |
67
67
  | `--use-obscura` | Connect to running Obscura CDP server (`ws://127.0.0.1:9222` or `OBSCURA_WS` env). Skips fingerprint injection — Obscura provides built-in stealth |
68
68
  | `--load-extension <path>` | Load unpacked Chrome extension from directory (can be used multiple times) |
69
- | `--dns-cache` | Persist dig/whois results to disk between runs (20hr TTL, 2000-entry cap each, `.digcache`/`.whoiscache`), **plus** the DNS pre-check negative cache (NXDOMAIN/ENODATA only — never resolver errors — 12h TTL, `.dnsnegcache`) so known-dead hosts aren't re-resolved next run. Disk writes are atomic (tmp + rename); corrupt cache files are detected on load with a `[dns-cache]` warn line and reset cleanly. |
70
- | `--no-dns-precheck` | Disable per-URL DNS resolution check before page navigation. By default, hosts that dig/whois have already proven live (within the 20hr cache TTL) skip their c-ares pre-check via a positive-resolution index. |
71
- | `--block-ads=<files>` | Block ads using EasyList format rules (comma-separated: `easylist.txt,easyprivacy.txt`) |
69
+ | `--dns-cache` | Persist dig/whois results to disk between runs (dig 20hr / whois 36hr TTL, 2000-entry cap each, `.digcache`/`.whoiscache`), **plus** the DNS pre-check negative cache (NXDOMAIN/ENODATA only — never resolver errors — 12h TTL, `.dnsnegcache`) so known-dead hosts aren't re-resolved next run. Disk writes are atomic (tmp + rename); corrupt cache files are detected on load with a `[dns-cache]` warn line and reset cleanly. |
70
+ | `--no-dns-precheck` | Disable per-URL DNS resolution check before page navigation. By default, hosts that dig/whois have already proven live (within the dig/whois cache TTL) skip their c-ares pre-check via a positive-resolution index. |
71
+ | `--block-ads=<files>` | Block ads/trackers **during the scan** using EasyList-format filter list(s) (`\|\|domain^`, `/ads/*`, etc.). Comma-separated for multiple: `--block-ads=easylist.txt,easyprivacy.txt`. See [Blocking ads during the scan](#blocking-ads-during-the-scan). |
72
+ | `--adblock-engine=<js\|rust>` | Matcher backend for `--block-ads` (default: `js`). `rust` uses Brave's `adblock-rs` (much faster on large lists) and requires `npm i adblock-rs`. |
72
73
  | `--cdp` | Enable Chrome DevTools Protocol logging (now per-page if enabled) |
73
74
  | `--remove-dupes` | Remove duplicate domains from output (only with `-o`) |
74
75
  | `--dry-run` | Console output only: show matching regex, titles, whois/dig/searchstring results, and adblock rules |
@@ -92,6 +93,37 @@ A Puppeteer-based tool for scanning websites to find third-party (or optionally
92
93
  | `--clear-cache` | Clear persistent cache before scanning (improves fresh start performance) |
93
94
  | `--ignore-cache` | Bypass all smart caching functionality during scanning |
94
95
 
96
+ ### Blocking ads during the scan
97
+
98
+ `--block-ads` makes the scanner **block** matching requests *during* the scan (separate from capturing rules) — to keep ad/tracker noise out of the page, speed up loads, or test that a list catches what it should.
99
+
100
+ **Adding lists.** Pass one or more EasyList-format filter lists (same syntax as uBlock Origin / EasyList):
101
+
102
+ ```bash
103
+ # Single list
104
+ node nwss.js --block-ads=easylist.txt
105
+
106
+ # Multiple lists — comma-separated, no spaces
107
+ node nwss.js --block-ads=easylist.txt,easyprivacy.txt,mylist.txt
108
+ ```
109
+
110
+ Lists are plain-text **network** rules — `||doubleclick.net^`, `/ads/*`, `||example.com^$script`, etc. Element-hiding/cosmetic rules (`##…`) don't apply to request blocking and are ignored. The scanned page's own top-level document is never blocked (only sub-resources), so a site whose own domain is in a list still loads.
111
+
112
+ **Engine — `js` vs `rust`** (`--adblock-engine`, default `js`):
113
+
114
+ | Engine | Flag | Backend | When |
115
+ |---|---|---|---|
116
+ | **js** (default) | `--adblock-engine=js` | `lib/adblock.js` — pure-JS, no extra deps | Default; fine for small/medium lists, works everywhere |
117
+ | **rust** | `--adblock-engine=rust` | `lib/adblock-rust.js` — Brave's [`adblock-rs`](https://github.com/brave/adblock-rust) | Large lists (full EasyList + EasyPrivacy + …); much faster matching. Drop-in (same rules, same results). Requires `npm install adblock-rs` (needs a Rust toolchain) |
118
+
119
+ The two engines are interchangeable — same rule format, same blocking result; `rust` is purely a speed option for big lists. If you pass `--adblock-engine=rust` without `adblock-rs` installed, install it (`npm i adblock-rs`) or drop the flag to use `js`.
120
+
121
+ ```bash
122
+ # Fast matching over big lists with the Rust engine
123
+ npm install adblock-rs
124
+ node nwss.js --block-ads=easylist.txt,easyprivacy.txt --adblock-engine=rust
125
+ ```
126
+
95
127
  ---
96
128
 
97
129
  ## config.json Format
package/lib/nettools.js CHANGED
@@ -30,7 +30,7 @@ const GLOBAL_DIG_CACHE_MAX = 2000;
30
30
  // Global whois result cache — shared across ALL handler instances and processUrl calls
31
31
  // Whois data is per root domain and doesn't change based on search terms
32
32
  const globalWhoisResultCache = new Map();
33
- const GLOBAL_WHOIS_CACHE_TTL = 72000000; // 20 hours (persisted to disk between runs)
33
+ const GLOBAL_WHOIS_CACHE_TTL = 129600000; // 36 hours (persisted to disk between runs). Longer than dig's 20h: registrar data is very stable and whois servers rate-limit aggressively, so caching longer cuts repeat queries.
34
34
  const GLOBAL_WHOIS_CACHE_MAX = 2000;
35
35
 
36
36
  // Persistent disk cache file paths
@@ -40,8 +40,8 @@ const WHOIS_CACHE_FILE = path.join(__dirname, '..', '.whoiscache');
40
40
  // Index of hostnames known to resolve, populated as a side effect of
41
41
  // positive dig/whois cache writes AND cache hits. nwss.js's DNS pre-check
42
42
  // reads this via domainKnownToResolve() so it can skip its own resolve4
43
- // call on hosts that dig or whois have already proven live within the
44
- // 20-hour TTL window. Populating on cache HITS (not just writes) handles
43
+ // call on hosts that dig or whois have already proven live within their
44
+ // cache TTL window (dig 20h / whois 36h). Populating on cache HITS (not just writes) handles
45
45
  // the --dns-cache disk-load case where entries arrive without going
46
46
  // through the in-process write path. Stale entries -- hostname in Set but
47
47
  // the dig/whois entry has since been evicted -- are harmless: worst case
@@ -778,6 +778,14 @@ function validateOvpnConfig(ovpnConfig) {
778
778
  * @returns {Promise<Object>} { success, connection, tunDevice, error }
779
779
  */
780
780
  async function connectForSite(siteConfig, forceDebug = false) {
781
+ // Platform guard: OpenVPN routing here reads /proc and uses the iproute2 `ip`
782
+ // command, both Linux-only. Fail clearly instead of a cryptic /proc or `ip`
783
+ // error on macOS/Windows. WSL2 reports 'linux' and passes (TUN is checked
784
+ // separately below via isWSL/checkTunDevice).
785
+ if (process.platform !== 'linux') {
786
+ return { success: false, error: `OpenVPN routing is currently Linux-only (needs /proc + the iproute2 'ip' command; not available on ${process.platform}). Run on Linux/WSL2, or remove the 'openvpn' option from the site config.` };
787
+ }
788
+
781
789
  const ovpnConfig = normalizeOvpnConfig(siteConfig.openvpn);
782
790
  if (!ovpnConfig) {
783
791
  return { success: false, error: 'Invalid OpenVPN configuration' };
@@ -388,6 +388,14 @@ function validateVpnConfig(vpnConfig) {
388
388
  * @returns {Promise<Object>} { success, interface, error }
389
389
  */
390
390
  async function connectForSite(siteConfig, forceDebug = false) {
391
+ // Platform guard: WireGuard routing here relies on the iproute2 `ip` command
392
+ // and wg-quick conventions, which are Linux-only. Fail with a clear message
393
+ // instead of a cryptic `ip: command not found` on macOS/Windows. WSL2 reports
394
+ // 'linux' and passes.
395
+ if (process.platform !== 'linux') {
396
+ return { success: false, error: `WireGuard routing is currently Linux-only (needs the iproute2 'ip' command + wg-quick; not available on ${process.platform}). Run on Linux/WSL2, or remove the 'vpn' option from the site config.` };
397
+ }
398
+
391
399
  const vpnConfig = normalizeVpnConfig(siteConfig.vpn);
392
400
  if (!vpnConfig) {
393
401
  return { success: false, error: 'Invalid VPN configuration' };
package/nwss.1 CHANGED
@@ -153,6 +153,14 @@ Browser restart interval in URLs processed (1-1000, overrides config/default).
153
153
  .B \--show-dead-domains
154
154
  At end of scan, list hostnames that did not resolve or were unreachable (\fBNXDOMAIN\fR/\fBENODATA\fR plus \fBERR_NAME_NOT_RESOLVED\fR/\fBERR_ADDRESS_UNREACHABLE\fR). Excludes blocks and timeouts, since those mean the domain is alive. Useful for pruning dead URLs.
155
155
 
156
+ .TP
157
+ .BI \--block-ads= FILE\fR[,\fIFILE\fR...]
158
+ Block ads/trackers during the scan using EasyList-format filter list(s) \(em network rules like \fB||domain^\fR, \fB/ads/*\fR, \fB||domain^$script\fR. Comma-separated for multiple lists. Cosmetic (\fB##\fR) rules are ignored; the scanned page's own top-level document is never blocked (only sub-resources).
159
+
160
+ .TP
161
+ .BI \--adblock-engine= js|rust
162
+ Matcher backend for \fB\-\-block-ads\fR (default: \fBjs\fR). \fBjs\fR is the built-in pure-JS matcher (no extra dependencies). \fBrust\fR uses Brave's \fBadblock-rs\fR \(em much faster on large lists, same rules and results, but requires \fBnpm install adblock-rs\fR (needs a Rust toolchain).
163
+
156
164
  .TP
157
165
  .BR \-h ", " \--help
158
166
  Show help message and exit.
package/nwss.js CHANGED
@@ -55,6 +55,7 @@ const CSS_BLOCKED_TAG = messageColors.processing('[css_blocked]');
55
55
  const EVAL_ON_DOC_TAG = messageColors.processing('[evalOnDoc]');
56
56
  const REALTIME_CLEANUP_TAG = messageColors.processing('[realtime_cleanup]');
57
57
  const VPN_TAG = messageColors.processing('[vpn]');
58
+ const POPUP_TAG = messageColors.processing('[popup]');
58
59
  // Precomputed colored '[SmartCache]' subsystem prefix — paired with the
59
60
  // same constant in lib/smart-cache.js so debug lines from both files
60
61
  // produce consistently colored output. formatLogMessage only colors the
@@ -387,7 +388,11 @@ const dnsPrecheckTimeoutMs = 2000;
387
388
  const showDeadDomains = args.includes('--show-dead-domains');
388
389
  const _deadDomains = new Map();
389
390
  function recordDeadDomain(urlOrHost, reason) {
390
- if (!showDeadDomains || !urlOrHost) return;
391
+ // Populate unconditionally the pre-check skip reads _deadDomains to drop
392
+ // repeat URLs on a host already proven dead this run, which must work whether
393
+ // or not --show-dead-domains is set. The end-of-scan REPORT is separately
394
+ // gated on showDeadDomains, so the flag still controls output, not recording.
395
+ if (!urlOrHost) return;
391
396
  let host = urlOrHost;
392
397
  try { host = new URL(urlOrHost).hostname; } catch { /* already a bare host */ }
393
398
  if (host && !_deadDomains.has(host)) _deadDomains.set(host, reason);
@@ -407,7 +412,7 @@ const DNS_NEGATIVE_CACHE_MAX = 1000;
407
412
  // persisting it can't silently drop a live host. Opt-in via --dns-cache: dead
408
413
  // hosts are remembered for DNS_NEGATIVE_PERSIST_TTL_MS and reloaded next run;
409
414
  // otherwise it's a 5-min in-memory-only cache. The persist TTL is deliberately
410
- // shorter than the dig/whois positive cache (20h): a domain that doesn't exist
415
+ // shorter than the dig/whois positive cache (dig 20h / whois 36h): a domain that doesn't exist
411
416
  // now MAY get registered, and this is a domain-hunting scanner, so the dead
412
417
  // ones are re-checked twice a day rather than trusted for ~a day.
413
418
  const DNS_NEGATIVE_PERSIST_TTL_MS = 12 * 60 * 60 * 1000; // 12 hours
@@ -715,6 +720,9 @@ if (blockAdsIndex !== -1) {
715
720
 
716
721
  adblockEnabled = true;
717
722
  const engine = adblockEngineName === 'rust' ? adblockRust : adblockJs;
723
+ // Only ever assigned the os.tmpdir() path below — never a user file — so the
724
+ // unlink in finally can never touch the caller's own lists.
725
+ let combinedTmpFile = null;
718
726
  try {
719
727
  if (engine === adblockRust) {
720
728
  // Rust wrapper accepts an array directly — no temp file needed.
@@ -723,15 +731,22 @@ if (blockAdsIndex !== -1) {
723
731
  // JS engine takes a single path; concat to a temp file when multiple lists.
724
732
  let rulesFile = rulesFiles[0];
725
733
  if (rulesFiles.length > 1) {
726
- rulesFile = path.join(os.tmpdir(), `nwss-adblock-combined-${Date.now()}.txt`);
734
+ combinedTmpFile = path.join(os.tmpdir(), `nwss-adblock-combined-${Date.now()}.txt`);
735
+ rulesFile = combinedTmpFile;
727
736
  const combined = rulesFiles.map(f => fs.readFileSync(f, 'utf-8')).join('\n');
728
737
  fs.writeFileSync(rulesFile, combined);
729
738
  }
739
+ // parseAdblockRules reads the file synchronously and in full before
740
+ // returning, so the temp copy is safe to remove immediately afterwards.
730
741
  adblockMatcher = engine.parseAdblockRules(rulesFile, { enableLogging: forceDebug });
731
742
  }
732
743
  } catch (err) {
733
744
  console.log(`Error: Failed to load adblock engine '${adblockEngineName}': ${err.message}`);
734
745
  process.exit(1);
746
+ } finally {
747
+ if (combinedTmpFile) {
748
+ try { fs.unlinkSync(combinedTmpFile); } catch { /* best effort — OS reaps tmpdir */ }
749
+ }
735
750
  }
736
751
  const stats = adblockMatcher.getStats();
737
752
  const ruleDesc = stats.total != null
@@ -805,7 +820,7 @@ Validation Options:
805
820
  --cache-requests Cache HTTP requests to avoid re-requesting same URLs within scan
806
821
  --dns <ip[,ip,...]> Resolver(s) for the DNS pre-check AND nettools' dig (not Chrome nav / whois).
807
822
  One pins all queries to it; several rotate per query. Overrides /etc/resolv.conf.
808
- --dns-cache Persist dig/whois results to disk between runs (20h TTL, 2000-entry cap each),
823
+ --dns-cache Persist dig/whois results to disk between runs (dig 20h / whois 36h TTL, 2000-entry cap each),
809
824
  plus the DNS pre-check negative cache (NXDOMAIN only, 12h TTL, .dnsnegcache)
810
825
  --no-dns-precheck Disable per-URL DNS resolution check before page navigation.
811
826
  By default, URLs whose hostname doesn't resolve are skipped
@@ -933,7 +948,7 @@ Advanced Options:
933
948
  whois_delay: <milliseconds> Delay between whois requests for this site (default: global whois_delay)
934
949
  dig: ["term1", "term2"] Check dig output for ALL specified terms (AND logic)
935
950
  dig-or: ["term1", "term2"] Check dig output for ANY specified term (OR logic)
936
- goto_options: {"waitUntil": "domcontentloaded"} Custom page.goto() options (default: {"waitUntil": "load"})
951
+ goto_options: {"waitUntil": "domcontentloaded"} Custom page.goto() options (default: {"waitUntil": "domcontentloaded"})
937
952
  dig_subdomain: true/false Use subdomain for dig lookup instead of root domain (default: false)
938
953
  digRecordType: "A" DNS record type for dig (default: A)
939
954
 
@@ -1423,6 +1438,7 @@ if (dumpUrls) {
1423
1438
  // Avoids blocking I/O on every intercepted request in debug/dumpurls mode
1424
1439
  const _logBuffers = new Map(); // filePath -> string[]
1425
1440
  const LOG_FLUSH_INTERVAL = 2000; // Flush every 2 seconds
1441
+ const LOG_BUFFER_MAX_RETAINED = 10000; // Cap a file's retry backlog (lines) so a permanently unwritable path can't grow memory unboundedly
1426
1442
  let _logFlushTimer = null;
1427
1443
 
1428
1444
  function bufferedLogWrite(filePath, entry) {
@@ -1435,18 +1451,20 @@ function bufferedLogWrite(filePath, entry) {
1435
1451
 
1436
1452
  function flushLogBuffers() {
1437
1453
  for (const [filePath, entries] of _logBuffers) {
1438
- if (entries.length > 0) {
1439
- try {
1440
- const data = entries.join('');
1441
- entries.length = 0; // Clear buffer immediately
1442
- fs.writeFile(filePath, data, { flag: 'a' }, (err) => {
1443
- if (err) {
1444
- console.warn(formatLogMessage('warn', `Failed to flush log buffer to ${filePath}: ${err.message}`));
1445
- }
1446
- });
1447
- } catch (err) {
1448
- console.warn(formatLogMessage('warn', `Failed to flush log buffer to ${filePath}: ${err.message}`));
1449
- }
1454
+ if (entries.length === 0) continue;
1455
+ try {
1456
+ // Synchronous append on purpose: the batched 2s flush is small, and a
1457
+ // blocking append cannot overlap the next timer tick (it holds the event
1458
+ // loop for its duration) eliminating the interleaved concurrent-append
1459
+ // hazard of the old async fs.writeFile({flag:'a'}). Clear ONLY after the
1460
+ // write succeeds, so a transient failure retries next tick instead of
1461
+ // being silently dropped (the old code cleared before the async write
1462
+ // confirmed). Bounded so a permanently unwritable path can't grow memory.
1463
+ fs.appendFileSync(filePath, entries.join(''));
1464
+ entries.length = 0;
1465
+ } catch (err) {
1466
+ console.warn(formatLogMessage('warn', `Failed to flush log buffer to ${filePath}: ${err.message}`));
1467
+ if (entries.length > LOG_BUFFER_MAX_RETAINED) entries.length = 0;
1450
1468
  }
1451
1469
  }
1452
1470
  }
@@ -1490,21 +1508,29 @@ if (forceDebug && globalComments) {
1490
1508
  * @param {string} url - The URL string to parse.
1491
1509
  * @returns {string} The root domain, or the original hostname if parsing fails (e.g., for IP addresses or invalid URLs), or an empty string on error.
1492
1510
  */
1493
- const _rootDomainCache = new Map();
1494
- function getRootDomain(url) {
1495
- const cached = _rootDomainCache.get(url);
1511
+ // psl.parse memoized by hostname. The request handlers parse the root domain
1512
+ // of EVERY request, and a page hits the same few hosts repeatedly (CDN,
1513
+ // analytics, ad domains) — so a hostname-keyed memo turns almost all of those
1514
+ // into Map hits instead of repeated public-suffix-list lookups. Keyed by
1515
+ // hostname (not full URL) so distinct paths/queries on one host share one
1516
+ // entry: higher hit rate, fewer + shorter keys than a URL-keyed cache.
1517
+ // psl.parse is pure and never throws (malformed input → {domain: null}), so
1518
+ // the catch is defensive only.
1519
+ const _hostRootCache = new Map();
1520
+ function rootDomainForHost(hostname) {
1521
+ if (!hostname) return '';
1522
+ const cached = _hostRootCache.get(hostname);
1496
1523
  if (cached !== undefined) return cached;
1497
- try {
1498
- const { hostname } = new URL(url);
1499
- const parsed = psl.parse(hostname);
1500
- const result = parsed.domain || hostname;
1501
- if (_rootDomainCache.size > 5000) _rootDomainCache.clear();
1502
- _rootDomainCache.set(url, result);
1503
- return result;
1504
- } catch {
1505
- _rootDomainCache.set(url, '');
1506
- return '';
1507
- }
1524
+ let result;
1525
+ try { const parsed = psl.parse(hostname); result = parsed.domain || hostname; }
1526
+ catch { result = hostname; }
1527
+ if (_hostRootCache.size > 5000) _hostRootCache.clear();
1528
+ _hostRootCache.set(hostname, result);
1529
+ return result;
1530
+ }
1531
+ function getRootDomain(url) {
1532
+ try { return rootDomainForHost(new URL(url).hostname); }
1533
+ catch { return ''; }
1508
1534
  }
1509
1535
 
1510
1536
  /**
@@ -1839,7 +1865,19 @@ function setupFrameHandling(page, forceDebug) {
1839
1865
 
1840
1866
  // Declare userDataDir in outer scope for cleanup access
1841
1867
  let userDataDir = null;
1842
-
1868
+
1869
+ // Browser-level decision (the browser launches once per batch, so this can't
1870
+ // be per-site): only disable Chrome's pop-up blocker when at least one site
1871
+ // actually wants popups captured. A real browser blocks non-gesture
1872
+ // window.open(), so non-popup scans keep the blocker on for stealth.
1873
+ // capture_popups scans turn it off so non-gesture popunders (document-level
1874
+ // onclick / timer SDKs) fire and get captured too — gesture-triggered
1875
+ // popups already work via the synthetic-click path regardless of this flag.
1876
+ const wantPopups = Array.isArray(sites) && sites.some(s => s && s.capture_popups === true);
1877
+ if (wantPopups && forceDebug) {
1878
+ console.log(formatLogMessage('debug', `${POPUP_TAG} capture_popups set — launching with --disable-popup-blocking (non-gesture popunders allowed)`));
1879
+ }
1880
+
1843
1881
  /**
1844
1882
  * Creates a new browser instance with consistent configuration
1845
1883
  * Uses system Chrome and temporary directories to minimize disk usage
@@ -1930,6 +1968,12 @@ function setupFrameHandling(page, forceDebug) {
1930
1968
  // Puppeteer 22.x headless mode optimization
1931
1969
  // Auto-detect best headless mode based on Puppeteer version
1932
1970
  headless: headlessMode,
1971
+ // Bypass TLS cert errors at the browser level (drives CDP
1972
+ // Security.setIgnoreCertificateErrors). Robust on new-headless Chrome,
1973
+ // where the --ignore-certificate-errors *flag* is increasingly ignored.
1974
+ // An ad/tracker scanner must reach self-signed / mismatched-cert ad and
1975
+ // embed domains; we observe traffic, we don't transmit secrets.
1976
+ acceptInsecureCerts: true,
1933
1977
  args: [
1934
1978
  // CRITICAL: Remove automation detection markers
1935
1979
  '--disable-blink-features=AutomationControlled',
@@ -2018,6 +2062,10 @@ function setupFrameHandling(page, forceDebug) {
2018
2062
  '--memory-pressure-off',
2019
2063
  '--max_old_space_size=2048', // V8 heap limit
2020
2064
  '--disable-prompt-on-repost', // Fixes form popup on page reload
2065
+ // Disable Chrome's pop-up blocker (chrome://settings/content/popups)
2066
+ // ONLY when a site wants popups captured — lets non-gesture popunders
2067
+ // fire. Gated so non-popup scans keep the blocker on for stealth.
2068
+ ...(wantPopups ? ['--disable-popup-blocking'] : []),
2021
2069
  ...(keepBrowserOpen ? [] : ['--disable-background-networking']),
2022
2070
  '--no-sandbox',
2023
2071
  '--disable-setuid-sandbox',
@@ -3362,8 +3410,7 @@ function setupFrameHandling(page, forceDebug) {
3362
3410
  try {
3363
3411
  const parsedUrl = new URL(checkedUrl);
3364
3412
  fullSubdomain = parsedUrl.hostname;
3365
- const pslResult = psl.parse(fullSubdomain);
3366
- checkedRootDomain = pslResult.domain || fullSubdomain;
3413
+ checkedRootDomain = rootDomainForHost(fullSubdomain);
3367
3414
  } catch (_) { return; }
3368
3415
  if (!checkedRootDomain) return;
3369
3416
 
@@ -3638,30 +3685,24 @@ function setupFrameHandling(page, forceDebug) {
3638
3685
  try {
3639
3686
  const parsedUrl = new URL(checkedUrl);
3640
3687
  fullSubdomain = parsedUrl.hostname;
3641
- const pslResult = psl.parse(fullSubdomain);
3642
- checkedRootDomain = pslResult.domain || fullSubdomain;
3688
+ checkedRootDomain = rootDomainForHost(fullSubdomain);
3643
3689
  } catch (e) {}
3644
3690
 
3691
+ // Never BLOCK the top-level document (the scanned page OR a main-frame
3692
+ // redirect target). Aborting it makes the navigation never commit (page
3693
+ // stays at about:blank → navigation timeout), silently breaking any
3694
+ // scanned URL that matches our own filter lists (adblock / blocked /
3695
+ // blockDomainsByUrl) — common on adult/pirate/stream domains. This flag
3696
+ // ONLY guards the abort paths below; the request still flows through the
3697
+ // match logic, so a main-frame redirect destination (e.g. a
3698
+ // filecrypt → ad-domain hop) is still captured via filterRegex/dig/whois.
3699
+ // isNavigationRequest is true for sub-frame docs too, so the mainFrame()
3700
+ // check keeps ad iframes blockable.
3701
+ let isMainFrameDoc = false;
3702
+ try { isMainFrameDoc = request.isNavigationRequest() && request.frame() === page.mainFrame(); } catch (_) {}
3703
+
3645
3704
  // Check against ALL first-party domains (original + all redirects)
3646
3705
  const isFirstParty = checkedRootDomain && firstPartyDomains.has(checkedRootDomain);
3647
-
3648
- // Block infinite iframe loops - safely access frame URL
3649
- const frameUrl = (() => {
3650
- try {
3651
- const frame = request.frame();
3652
- return frame ? frame.url() : '';
3653
- } catch (err) {
3654
- return '';
3655
- }
3656
- })();
3657
- if (frameUrl && frameUrl.includes('creative.dmzjmp.com') &&
3658
- checkedUrl.includes('go.dmzjmp.com/api/models')) {
3659
- if (forceDebug) {
3660
- console.log(formatLogMessage('debug', `Blocking potential infinite iframe loop: ${checkedUrl}`));
3661
- }
3662
- request.abort();
3663
- return;
3664
- }
3665
3706
 
3666
3707
  // Enhanced debug logging to show which frame the request came from
3667
3708
  if (forceDebug) {
@@ -3691,7 +3732,7 @@ function setupFrameHandling(page, forceDebug) {
3691
3732
  request.resourceType()
3692
3733
  );
3693
3734
 
3694
- if (result.blocked) {
3735
+ if (result.blocked && !isMainFrameDoc) {
3695
3736
  adblockStats.blocked++;
3696
3737
  if (forceDebug) {
3697
3738
  console.log(formatLogMessage('debug', `${messageColors.blocked('[adblock]')} ${checkedUrl} (${result.reason})`));
@@ -3699,6 +3740,12 @@ function setupFrameHandling(page, forceDebug) {
3699
3740
  request.abort('blockedbyclient');
3700
3741
  return;
3701
3742
  }
3743
+ if (result.blocked && isMainFrameDoc && forceDebug) {
3744
+ // Matched a filter rule but it's the page we're scanning (or a
3745
+ // main-frame redirect target) — allow it (blocking the top-level
3746
+ // document aborts navigation). It still flows through the matcher.
3747
+ console.log(formatLogMessage('debug', `${messageColors.highlight('[adblock]')} top-level document ${checkedUrl} matched (${result.reason}) — allowed (never block the scanned page)`));
3748
+ }
3702
3749
  adblockStats.allowed++;
3703
3750
  } catch (err) { /* Silently continue on adblock errors */ }
3704
3751
  }
@@ -3752,7 +3799,7 @@ function setupFrameHandling(page, forceDebug) {
3752
3799
  // check so domain-based blocks short-circuit without paying the
3753
3800
  // per-URL regex scan. Same abort reason as the static path so
3754
3801
  // request.failure() observers see consistent metadata.
3755
- if (reqDomain && _dynamicallyBlockedDomains.size > 0 && matchesDynamicBlock(reqDomain)) {
3802
+ if (reqDomain && _dynamicallyBlockedDomains.size > 0 && matchesDynamicBlock(reqDomain) && !isMainFrameDoc) {
3756
3803
  if (forceDebug) {
3757
3804
  console.log(formatLogMessage('debug', `${BLOCK_DOMAINS_BY_URL_TAG} aborting ${reqUrl} (domain ${reqDomain} dynamically blocked)`));
3758
3805
  }
@@ -3767,7 +3814,7 @@ function setupFrameHandling(page, forceDebug) {
3767
3814
  break;
3768
3815
  }
3769
3816
  }
3770
- if (blockedMatchIndex !== -1) {
3817
+ if (blockedMatchIndex !== -1 && !isMainFrameDoc) {
3771
3818
  // Always track the hit (zero-cost on the un-debug path) so the
3772
3819
  // scan-end summary can show which patterns are doing work vs.
3773
3820
  // which are stale and ready to prune. Keyed by pattern.source --
@@ -4349,15 +4396,43 @@ function setupFrameHandling(page, forceDebug) {
4349
4396
  try {
4350
4397
  navigationResult = await navigateWithRedirectHandling(page, currentUrl, siteConfig, gotoOptions, forceDebug, formatLogMessage);
4351
4398
  } catch (navErr) {
4352
- // Only retry on genuine timeouts, not chrome-error:// redirects
4399
+ // Only handle genuine timeouts here, not chrome-error:// redirects.
4400
+ // pageUrl === 'about:blank' means the navigation never committed
4401
+ // (server never responded) — treat as a real failure, not a partial
4402
+ // page; only a page that actually reached a URL is worth observing.
4353
4403
  let pageUrl = '';
4354
4404
  try { if (!page.isClosed()) pageUrl = page.url(); } catch {}
4355
4405
  const isPopupFailure = navErr.message.includes('chrome-error://') || navErr.message.includes('invalid URL') ||
4356
4406
  pageUrl.startsWith('chrome-error://') || pageUrl === 'about:blank';
4357
4407
  if ((navErr.message.includes('timeout') || navErr.message.includes('Timeout')) && !isPopupFailure) {
4358
- if (forceDebug) console.log(formatLogMessage('debug', `Navigation timeout, retrying with waitUntil:networkidle2 for ${currentUrl}`));
4359
- const fallbackOptions = { ...gotoOptions, waitUntil: 'networkidle2', timeout: Math.min(timeout, 10000) };
4360
- navigationResult = await navigateWithRedirectHandling(page, currentUrl, siteConfig, fallbackOptions, forceDebug, formatLogMessage);
4408
+ // The OLD fallback retried with networkidle2 STRICTER than the
4409
+ // domcontentloaded default, so it could never rescue a
4410
+ // domcontentloaded timeout (and Puppeteer 25 has no 'commit', i.e.
4411
+ // nothing more lenient). Two-tier recovery instead:
4412
+ // 1. If the site used a wait STRICTER than domcontentloaded, do one
4413
+ // lenient retry with domcontentloaded (it fires earlier).
4414
+ // 2. Otherwise proceed with the partially-loaded page rather than
4415
+ // discarding the URL — it exists and requests already fired
4416
+ // (captured by page.on('request')); the delay/interact phase
4417
+ // below keeps capturing. Streaming/embed/media pages routinely
4418
+ // never reach DOM-ready (a connection stays open) but their
4419
+ // ad/tracker calls fired early.
4420
+ const primaryWait = gotoOptions.waitUntil || defaultWaitUntil;
4421
+ let recovered = false;
4422
+ if (primaryWait !== 'domcontentloaded') {
4423
+ try {
4424
+ if (forceDebug) console.log(formatLogMessage('debug', `Navigation timeout (${primaryWait}), retrying with waitUntil:domcontentloaded for ${currentUrl}`));
4425
+ const fallbackOptions = { ...gotoOptions, waitUntil: 'domcontentloaded', timeout: Math.min(timeout, 15000) };
4426
+ navigationResult = await navigateWithRedirectHandling(page, currentUrl, siteConfig, fallbackOptions, forceDebug, formatLogMessage);
4427
+ recovered = true;
4428
+ } catch (_) { /* fall through to proceed-with-partial */ }
4429
+ }
4430
+ if (!recovered) {
4431
+ let partialUrl = currentUrl;
4432
+ try { if (!page.isClosed()) partialUrl = page.url() || currentUrl; } catch {}
4433
+ if (forceDebug) console.log(formatLogMessage('debug', `Navigation timeout — proceeding with partially-loaded page for ${currentUrl}`));
4434
+ navigationResult = { finalUrl: partialUrl, redirected: false, redirectChain: [currentUrl], originalUrl: currentUrl, redirectDomains: [], httpStatus: null, cfRay: null };
4435
+ }
4361
4436
  } else {
4362
4437
  throw navErr;
4363
4438
  }
@@ -4630,8 +4705,41 @@ function setupFrameHandling(page, forceDebug) {
4630
4705
  // Capture hard "dead domain" navigation errors for --show-dead-domains
4631
4706
  // (DNS doesn't resolve / host unreachable). Blocks, timeouts and CF
4632
4707
  // challenges are NOT dead — they're excluded by this match.
4633
- const deadNav = /ERR_NAME_NOT_RESOLVED|ERR_ADDRESS_UNREACHABLE|ERR_DNS/.exec(err.message || '');
4634
- if (deadNav) recordDeadDomain(currentUrl, deadNav[0]);
4708
+ // Only DEFINITIVE non-existence / unreachable signals — these now drive
4709
+ // the in-scan dead-domain SKIP (not just --show-dead-domains reporting),
4710
+ // so transient DNS errors must NOT match. The bare `ERR_DNS` used to
4711
+ // catch ERR_DNS_TIMED_OUT / ERR_DNS_MALFORMED_RESPONSE / ERR_DNS_SERVER_FAILED
4712
+ // (all transient) — dropped so a slow-DNS blip can't false-skip the
4713
+ // rest of a live host's URLs.
4714
+ const deadNav = /ERR_NAME_NOT_RESOLVED|ERR_ADDRESS_UNREACHABLE/.exec(err.message || '');
4715
+ if (deadNav) {
4716
+ recordDeadDomain(currentUrl, deadNav[0]);
4717
+ // Corroborate-then-persist to the negative cache (.dnsnegcache with
4718
+ // --dns-cache → cross-scan skip; else in-memory). Chrome resolves via
4719
+ // the possibly-flaky SYSTEM resolver, so its ERR_NAME_NOT_RESOLVED may
4720
+ // be a glitch on a LIVE host. Re-confirm via the reliable --dns
4721
+ // resolver and cache ONLY if it ALSO returns a definitive NXDOMAIN.
4722
+ // ERR_ADDRESS_UNREACHABLE is routing (the host resolves), so the
4723
+ // resolve succeeds and it's correctly not cached. Fire-and-forget:
4724
+ // off the critical path; saveDiskCache flushes on exit.
4725
+ if (dnsPrecheckEnabled && deadNav[0] === 'ERR_NAME_NOT_RESOLVED') {
4726
+ let navHost = '';
4727
+ try { navHost = new URL(currentUrl).hostname; } catch {}
4728
+ if (navHost && !/^[\d.:]+$|^\[/.test(navHost) && !dnsNegativeCache.has(navHost)) {
4729
+ dnsResolver.resolveHost(navHost, dnsPrecheckTimeoutMs).then(
4730
+ () => { /* reliable resolver resolves it — system-resolver glitch, do NOT cache */ },
4731
+ (e) => {
4732
+ const code = (e && (e.code || e.message)) || '';
4733
+ if (isNonExistenceError(code)) {
4734
+ dnsNegativeCacheSet(navHost, code);
4735
+ recordDeadDomain(navHost, code);
4736
+ if (forceDebug) console.log(formatLogMessage('debug', `Dead domain confirmed by --dns resolver (${code}) — caching ${navHost} (skips next run with --dns-cache)`));
4737
+ }
4738
+ }
4739
+ ).catch(() => {});
4740
+ }
4741
+ }
4742
+ }
4635
4743
  throw err;
4636
4744
  }
4637
4745
  }
@@ -5263,7 +5371,7 @@ function setupFrameHandling(page, forceDebug) {
5263
5371
  const safeUrl = currentUrl.replace(/https?:\/\//, '').replace(/[^a-zA-Z0-9]/g, '_').substring(0, 80);
5264
5372
  const filename = `screenshots/${safeUrl}-${timestamp}.png`;
5265
5373
  try {
5266
- if (!fs.existsSync('screenshots')) fs.mkdirSync('screenshots', { recursive: true });
5374
+ fs.mkdirSync('screenshots', { recursive: true }); // recursive:true is a no-op if it already exists
5267
5375
  await page.screenshot({ path: filename, type: 'png', fullPage: true });
5268
5376
  console.log(formatLogMessage('info', `Screenshot saved: ${filename}`));
5269
5377
  } catch (screenshotErr) {
@@ -5759,6 +5867,19 @@ function setupFrameHandling(page, forceDebug) {
5759
5867
  // actually starting — wrongly skipping live domains. c-ares isn't
5760
5868
  // threadpool-bound so it's immune to that contention.
5761
5869
  if (dnsPrecheckEnabled && taskDomain && !/^[\d.:]+$|^\[/.test(taskDomain)) {
5870
+ // Already proven dead earlier THIS run — either a pre-check NXDOMAIN or
5871
+ // a prior URL's navigation hit ERR_NAME_NOT_RESOLVED / ERR_ADDRESS_UNREACHABLE
5872
+ // (recordDeadDomain populates _deadDomains for both). Skip the repeat
5873
+ // instead of paying another fail-open navigation on a multi-URL dead
5874
+ // host (e.g. dlstreams.top?id=39/54/347). In-scan only (NOT persisted):
5875
+ // Chrome resolves via the system resolver, so a nav-level failure could
5876
+ // be a system-resolver glitch on a live host — a false "dead" must not
5877
+ // carry across runs. Cheap: a Map lookup, no DNS resolve.
5878
+ if (_deadDomains.has(taskDomain)) {
5879
+ dnsPrecheckSkips++;
5880
+ if (forceDebug) console.log(formatLogMessage('debug', `DNS pre-check: ${taskDomain} already dead this run (${_deadDomains.get(taskDomain)}) — skipping`));
5881
+ return { url: task.url, rules: [], success: false, error: `DNS: ${_deadDomains.get(taskDomain)}`, skipped: true };
5882
+ }
5762
5883
  const cached = dnsNegativeCache.get(taskDomain);
5763
5884
  if (cached && Date.now() - cached.timestamp < DNS_NEGATIVE_CACHE_TTL_MS) {
5764
5885
  dnsPrecheckSkips++;
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@fanboynz/network-scanner",
3
- "version": "3.2.0",
3
+ "version": "3.3.0",
4
4
  "description": "A Puppeteer-based network scanner for analyzing web traffic, generating adblock filter rules, and identifying third-party requests. Features include fingerprint spoofing, Cloudflare bypass, content analysis with curl/grep, and multiple output formats.",
5
5
  "main": "nwss.js",
6
6
  "scripts": {