mailpop 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,195 @@
1
+ # mailpop 🎯
2
+
3
+ mailpop is a production-ready, highly optimized TypeScript application built on top of **Node.js 22+** and **Playwright Chromium** to discover reliable, public contact emails directly from company websites.
4
+
5
+ Designed specifically for cold outreach lead enrichment where sender reputation is critical, mailpop implements robust verification heuristics, email score priorities, and strict validation checks. It **never invents or guesses emails**; it only returns emails found explicitly on the public web.
6
+
7
+ ---
8
+
9
+ ## Key Features
10
+
11
+ - 🌐 **SPA & Javascript Execution**: Uses Playwright Chromium (headless by default) to handle modern frameworks (React, Next.js, Vue, Angular, Astro, etc.) and redirects.
12
+ - ⚡ **BFS Crawler with Priority Routing**: Custom Breadth-First-Search (BFS) traversal prioritizing contact-relevant paths (`contact`, `about`, `support`, etc.) up to depth 2.
13
+ - 📂 **Sitemap & Robots.txt Parser**: Scans `robots.txt` and recursively crawls sitemaps (and sitemap indices) using Cheerio in `xmlMode` to extract links before crawling.
14
+ - 🔐 **Obfuscation Decoders**: Automatically bypasses and decodes Cloudflare email protection, HTML entities, Unicode sequences, Base64 strings, and common textual patterns (e.g., `name [at] company [dot] com`).
15
+ - 📊 **Scoring & Confidence Engines**: Computes a confidence score based on the location of the email (footer, header, body, mailto, script), page relevance, domain-match alignment, and frequency.
16
+ - 💾 **Memory Efficient Streaming**: Reads input CSVs and appends output CSV records incrementally using Node.js async generators and `fast-csv`. Handles datasets of 100 to 50,000+ entries without memory leaks.
17
+ - 🔄 **Contiguous Checkpoints & Resuming**: Automatically saves crawler checkpoints every 10 rows. If the process is killed (SIGINT/SIGTERM or crash), it resumes exactly from the last processed contiguous row.
18
+ - 🚦 **Adaptive Throttling**: Adds randomized throttling delays between page requests of the same website to respect rate limits and prevent server hammer.
19
+ - 📝 **Structured Logging**: Generates JSON-formatted app logs, error logs, and dedicated email discovery logs in the `logs/` directory.
20
+
21
+ ---
22
+
23
+ ## Architecture Overview
24
+
25
+ ```mermaid
26
+ graph TD
27
+ A[input.csv] -->|Read Stream / Generator| B[CSV Orchestrator]
28
+ B -->|Check Checkpoint| C{Checkpoint Exists?}
29
+ C -->|Yes| D[Skip to Next Row]
30
+ C -->|No| E[Fresh Run & Write Headers]
31
+ D --> F[BFS Crawl Target]
32
+ E --> F
33
+ F -->|Fetch Sitemaps & robots.txt| G[Cheerio Link Collector]
34
+ F -->|Spawn Browser Context| H[Playwright Page Loader]
35
+ G -->|Seed Queue| H
36
+ H -->|Block Media/CSS| I[Render SPA & Execute JS]
37
+ I -->|Check Early Exit| J{High Confidence Email Found?}
38
+ J -->|Yes| K[Terminate Site Crawl]
39
+ J -->|No| L[Queue Next Traversal Page]
40
+ L --> I
41
+ I -->|Extract & Decode| M[Extractor]
42
+ M -->|Filter & Score| N[Scorer]
43
+ N -->|Select Best Email| O[CSV Appender]
44
+ O -->|Every 10 rows| P[Save Checkpoint]
45
+ ```
46
+
47
+ ---
48
+
49
+ ## Tech Stack
50
+
51
+ - **Runtime**: Node.js 22+ (ES Modules)
52
+ - **Language**: TypeScript 5+ (Strict mode compiler settings)
53
+ - **Scraping**: Playwright Chromium, Cheerio
54
+ - **Data & Streams**: Fast CSV, p-limit
55
+ - **Tooling**: tsx (Development runner), ESLint 9+ (Flat config), Prettier
56
+
57
+ ---
58
+
59
+ ## Configuration (`.env`)
60
+
61
+ Create a `.env` file in the root directory (based on `.env.example`):
62
+
63
+ ```env
64
+ # Input & Output Files
65
+ INPUT_CSV=input.csv
66
+ OUTPUT_CSV=output/output.csv
67
+ CHECKPOINT_FILE=output/checkpoint.json
68
+ CACHE_DIR=output/cache
69
+
70
+ # Crawling Limits
71
+ CONCURRENCY=5
72
+ MAX_DEPTH=2
73
+ MAX_PAGES_PER_SITE=25
74
+ MAX_CRAWL_TIME_PER_SITE_MS=60000
75
+ PAGE_TIMEOUT_MS=15000
76
+
77
+ # Browser Settings
78
+ HEADLESS=true
79
+
80
+ # Throttling & Delay
81
+ MIN_DELAY_MS=500
82
+ MAX_DELAY_MS=2000
83
+
84
+ # Retry Configuration
85
+ MAX_RETRIES=3
86
+ RETRY_INITIAL_DELAY_MS=1000
87
+ RETRY_MAX_DELAY_MS=10000
88
+ ```
89
+
90
+ ---
91
+
92
+ ## Installation & Setup
93
+
94
+ 1. **Clone or Navigate to the project directory**:
95
+ ```bash
96
+ cd email-hunter-pro # (Or the directory name mailpop was installed to)
97
+ ```
98
+
99
+ 2. **Install Node dependencies**:
100
+ ```bash
101
+ npm install
102
+ ```
103
+
104
+ 3. **Install Playwright Browsers**:
105
+ ```bash
106
+ npx playwright install chromium
107
+ ```
108
+
109
+ 4. **Prepare the configuration**:
110
+ ```bash
111
+ cp .env.example .env
112
+ ```
113
+
114
+ 5. **Prepare your input data**:
115
+ Put your company websites list in `input.csv` (see `input.csv` for the template).
116
+
117
+ ---
118
+
119
+ ## CLI & npx Execution
120
+
121
+ After compiling the codebase with `npm run build`, you can invoke the program directly as a CLI tool using `npx`.
122
+
123
+ ### Command Syntax
124
+
125
+ ```bash
126
+ # General invocation (defaults to config settings for input and output)
127
+ npx .
128
+
129
+ # Custom input and output paths (positional arguments)
130
+ npx . my_leads.csv enriched_output.csv
131
+
132
+ # Using explicit flags
133
+ npx . -i my_leads.csv -o enriched_output.csv
134
+
135
+ # Viewing CLI help options
136
+ npx . -h
137
+ ```
138
+
139
+ ### Available Package Scripts
140
+
141
+ | Script | Command | Description |
142
+ | :--- | :--- | :--- |
143
+ | `npm run dev` | `tsx src/index.ts` | Runs the development code directly using `tsx` |
144
+ | `npm run build` | `tsc` | Compiles the TypeScript code to standard ES Modules in `dist/` |
145
+ | `npm run start` | `node dist/index.js` | Runs the compiled output binary |
146
+ | `npm run lint` | `eslint src` | Checks code against linting rules |
147
+ | `npm run format` | `prettier --write "src/**/*.ts"` | Formats code using Prettier |
148
+ | `npm run typecheck` | `tsc --noEmit` | Performs static type checks |
149
+
150
+ ---
151
+
152
+ ## CSV Data Processing Heuristics
153
+
154
+ ### Dynamic Header Matching
155
+
156
+ mailpop does not mandate a fixed schema. It accepts CSVs with **any custom columns** (e.g. CRM IDs, Industry, Contact Names).
157
+
158
+ The crawler dynamically detects the target domain or website by scanning row headers case-insensitively for keywords: `website`, `url`, `domain`, `site`, or `web` (supporting partial matches like `CompanyDomain` or `target_url`).
159
+
160
+ ### Enriched Output Generation
161
+
162
+ When writing the output, mailpop **preserves 100% of the original columns** and appends the enrichment results as new columns:
163
+
164
+ - `email`: The discovered, verified contact email (leaves empty if none is confidently found).
165
+ - `email_source`: Specific URL/page where the email was located.
166
+ - `email_type`: Classifies the email category (`role`, `personal`, `automated`).
167
+ - `confidence_score`: Confidence score rating from 10 to 100.
168
+ - `discovery_method`: Scraping discovery origin (`contact-page`, `about-page`, `footer`, `header`, `sitemap`, `general-page`, `obscure-js`, `mailto-link`).
169
+
170
+ #### Custom Column Matching Example
171
+
172
+ **Input CSV (`custom_leads.csv`)**:
173
+ ```csv
174
+ CompanyID,CompanyDomain,Industry,ContactPerson
175
+ 1001,https://github.com,Technology,John Doe
176
+ ```
177
+
178
+ **Output CSV (`enriched.csv`)**:
179
+ ```csv
180
+ CompanyID,CompanyDomain,Industry,ContactPerson,email,email_source,email_type,confidence_score,discovery_method
181
+ 1001,https://github.com,Technology,John Doe,copyright@github.com,https://docs.github.com/site-policy/github-terms/github-terms-of-service,personal,100,mailto-link
182
+ ```
183
+
184
+ ---
185
+
186
+ ## Structured Log Output (`logs/`)
187
+
188
+ - `logs/app.log`: General JSON logs showing crawl transitions and app events.
189
+ - `logs/errors.log`: Failures, timeouts, or exceptions raised by page crawls.
190
+ - `logs/discovered-emails.log`: Every email address found during the traversal, containing its source and confidence metrics.
191
+
192
+ Example log entry in `discovered-emails.log`:
193
+ ```json
194
+ {"timestamp":"2026-06-03T09:12:00.123Z","domain":"acme.com","email":"contact@acme.com","emailSource":"https://acme.com/contact","confidenceScore":98,"discoveryMethod":"contact-page"}
195
+ ```
package/dist/cache.js ADDED
@@ -0,0 +1,81 @@
1
+ import fs from 'fs/promises';
2
+ import path from 'path';
3
+ import crypto from 'crypto';
4
+ import { config } from './config.js';
5
+ import { Logger } from './logger.js';
6
+ export class Cache {
7
+ cacheDir;
8
+ constructor() {
9
+ this.cacheDir = config.cacheDir;
10
+ }
11
+ /**
12
+ * Generates a unique, file-safe cache path for a given key.
13
+ */
14
+ getCachePath(key) {
15
+ const hash = crypto.createHash('md5').update(key).digest('hex');
16
+ return path.join(this.cacheDir, `${hash}.json`);
17
+ }
18
+ /**
19
+ * Retrieves an item from the cache. Returns null if missing or expired.
20
+ */
21
+ async get(key) {
22
+ try {
23
+ const cachePath = this.getCachePath(key);
24
+ const content = await fs.readFile(cachePath, 'utf-8');
25
+ const entry = JSON.parse(content);
26
+ if (entry.expiresAt !== null && Date.now() > entry.expiresAt) {
27
+ // Expired cache entry, clean it up
28
+ await fs.unlink(cachePath);
29
+ return null;
30
+ }
31
+ return entry.value;
32
+ }
33
+ catch (_e) {
34
+ // Cache miss or error reading
35
+ return null;
36
+ }
37
+ }
38
+ /**
39
+ * Sets an item in the cache with an optional TTL (Time To Live).
40
+ */
41
+ async set(key, value, ttlMs) {
42
+ try {
43
+ await fs.mkdir(this.cacheDir, { recursive: true });
44
+ const cachePath = this.getCachePath(key);
45
+ const entry = {
46
+ value,
47
+ expiresAt: ttlMs ? Date.now() + ttlMs : null,
48
+ };
49
+ await fs.writeFile(cachePath, JSON.stringify(entry), 'utf-8');
50
+ }
51
+ catch (err) {
52
+ const errorMsg = err instanceof Error ? err.message : String(err);
53
+ await Logger.error('cache-write', undefined, undefined, `Failed to write cache key '${key}': ${errorMsg}`);
54
+ }
55
+ }
56
+ /**
57
+ * Deletes a cache entry.
58
+ */
59
+ async delete(key) {
60
+ try {
61
+ const cachePath = this.getCachePath(key);
62
+ await fs.unlink(cachePath);
63
+ }
64
+ catch (_e) {
65
+ // Ignore errors (e.g. key didn't exist)
66
+ }
67
+ }
68
+ /**
69
+ * Clears the entire cache directory.
70
+ */
71
+ async clear() {
72
+ try {
73
+ await fs.rm(this.cacheDir, { recursive: true, force: true });
74
+ await Logger.info('cache-clear', undefined, undefined, 'Cache directory deleted successfully.');
75
+ }
76
+ catch (err) {
77
+ const errorMsg = err instanceof Error ? err.message : String(err);
78
+ await Logger.error('cache-clear', undefined, undefined, `Failed to clear cache: ${errorMsg}`);
79
+ }
80
+ }
81
+ }
package/dist/config.js ADDED
@@ -0,0 +1,34 @@
1
+ import dotenv from 'dotenv';
2
+ import path from 'path';
3
+ // Load environment variables from .env
4
+ dotenv.config();
5
+ const getEnvNumber = (key, defaultValue) => {
6
+ const val = process.env[key];
7
+ if (val === undefined)
8
+ return defaultValue;
9
+ const num = parseInt(val, 10);
10
+ return isNaN(num) ? defaultValue : num;
11
+ };
12
+ const getEnvBoolean = (key, defaultValue) => {
13
+ const val = process.env[key];
14
+ if (val === undefined)
15
+ return defaultValue;
16
+ return val.toLowerCase() === 'true';
17
+ };
18
+ export const config = {
19
+ inputCsv: path.resolve(process.env.INPUT_CSV || 'input.csv'),
20
+ outputCsv: path.resolve(process.env.OUTPUT_CSV || 'output/output.csv'),
21
+ checkpointFile: path.resolve(process.env.CHECKPOINT_FILE || 'output/checkpoint.json'),
22
+ cacheDir: path.resolve(process.env.CACHE_DIR || 'output/cache'),
23
+ concurrency: getEnvNumber('CONCURRENCY', 5),
24
+ maxDepth: getEnvNumber('MAX_DEPTH', 2),
25
+ maxPagesPerSite: getEnvNumber('MAX_PAGES_PER_SITE', 25),
26
+ maxCrawlTimePerSiteMs: getEnvNumber('MAX_CRAWL_TIME_PER_SITE_MS', 60000),
27
+ pageTimeoutMs: getEnvNumber('PAGE_TIMEOUT_MS', 15000),
28
+ headless: getEnvBoolean('HEADLESS', true),
29
+ minDelayMs: getEnvNumber('MIN_DELAY_MS', 500),
30
+ maxDelayMs: getEnvNumber('MAX_DELAY_MS', 2000),
31
+ maxRetries: getEnvNumber('MAX_RETRIES', 3),
32
+ retryInitialDelayMs: getEnvNumber('RETRY_INITIAL_DELAY_MS', 1000),
33
+ retryMaxDelayMs: getEnvNumber('RETRY_MAX_DELAY_MS', 10000),
34
+ };
@@ -0,0 +1,280 @@
1
+ import { chromium } from 'playwright';
2
+ import { parseRobotsTxt } from './robots.js';
3
+ import { parseSitemap } from './sitemap.js';
4
+ import { extractAndFilterLinks, getLinkPriority } from './link-discovery.js';
5
+ import { extractEmails } from './extractor.js';
6
+ import { selectBestEmail } from './scorer.js';
7
+ import { Cache } from './cache.js';
8
+ import { Logger } from './logger.js';
9
+ import { getRandomDelay } from './utils/delay.js';
10
+ import { retryWithBackoff } from './utils/retry.js';
11
+ import { normalizeUrl } from './utils/normalize.js';
12
+ import { isDomainMatch } from './utils/validators.js';
13
+ import { PageLoadError, RateLimitError } from './utils/errors.js';
14
+ export class Crawler {
15
+ browser = null;
16
+ cache;
17
+ constructor() {
18
+ this.cache = new Cache();
19
+ }
20
+ /**
21
+ * Launches the headless/headful Playwright Chromium browser.
22
+ */
23
+ async initialize(headless = true) {
24
+ if (this.browser) {
25
+ return;
26
+ }
27
+ this.browser = await chromium.launch({
28
+ headless,
29
+ args: [
30
+ '--no-sandbox',
31
+ '--disable-setuid-sandbox',
32
+ '--disable-dev-shm-usage',
33
+ '--disable-accelerated-2d-canvas',
34
+ '--disable-gpu',
35
+ ],
36
+ });
37
+ }
38
+ /**
39
+ * Closes the active browser instance.
40
+ */
41
+ async close() {
42
+ if (this.browser) {
43
+ await this.browser.close();
44
+ this.browser = null;
45
+ }
46
+ }
47
+ /**
48
+ * Loads a single page content using Playwright inside the context.
49
+ * Leverages exponential backoff retries.
50
+ */
51
+ async loadPage(pageUrl, context, config) {
52
+ return await retryWithBackoff(async () => {
53
+ const page = await context.newPage();
54
+ try {
55
+ // Navigate with DOMContentLoaded as the baseline
56
+ const response = await page.goto(pageUrl, {
57
+ waitUntil: 'domcontentloaded',
58
+ timeout: config.pageTimeoutMs,
59
+ });
60
+ if (!response) {
61
+ throw new PageLoadError('No response received from page', pageUrl);
62
+ }
63
+ const status = response.status();
64
+ if (status === 429) {
65
+ throw new RateLimitError(`Rate limited (429)`, pageUrl);
66
+ }
67
+ if (status >= 500) {
68
+ throw new PageLoadError(`Server error (${status})`, pageUrl, status);
69
+ }
70
+ // Wait for load states to let JavaScript load components (React, Angular, Vue, Next.js, etc.)
71
+ await page.waitForLoadState('load', { timeout: 3000 }).catch(() => { });
72
+ await page.waitForLoadState('networkidle', { timeout: 3000 }).catch(() => { });
73
+ const html = await page.content();
74
+ const title = await page.title().catch(() => '');
75
+ return { html, title };
76
+ }
77
+ finally {
78
+ await page.close();
79
+ }
80
+ }, {
81
+ maxRetries: config.maxRetries,
82
+ initialDelayMs: config.retryInitialDelayMs,
83
+ maxDelayMs: config.retryMaxDelayMs,
84
+ onRetry: (err, attempt) => {
85
+ let host = pageUrl;
86
+ try {
87
+ host = new URL(pageUrl).hostname;
88
+ }
89
+ catch (_e) {
90
+ /* ignore */
91
+ }
92
+ Logger.info('page-load-retry', host, undefined, `Attempt ${attempt}`, `Retrying navigation to ${pageUrl}: ${err.message}`).catch(() => { });
93
+ },
94
+ });
95
+ }
96
+ /**
97
+ * Crawls a single company website, performing discovery, BFS traversal, and email extraction.
98
+ * Respects depth limit, crawl page budget, and website crawl duration timeout.
99
+ */
100
+ async crawlWebsite(target, config) {
101
+ const startTime = Date.now();
102
+ const domain = target.domain;
103
+ if (!this.browser) {
104
+ throw new Error('Browser is not initialized. Run initialize() first.');
105
+ }
106
+ const startUrl = normalizeUrl(target.website);
107
+ if (!startUrl) {
108
+ return {
109
+ target,
110
+ success: false,
111
+ discoveredEmails: [],
112
+ selectedEmail: null,
113
+ error: 'Invalid start URL',
114
+ pagesCrawledCount: 0,
115
+ durationMs: Date.now() - startTime,
116
+ };
117
+ }
118
+ let context = null;
119
+ const discoveredEmails = [];
120
+ const occurrenceCounts = {};
121
+ let pagesCrawledCount = 0;
122
+ try {
123
+ // 1. Initial configuration for browser context
124
+ context = await this.browser.newContext({
125
+ userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
126
+ viewport: { width: 1280, height: 800 },
127
+ bypassCSP: true,
128
+ ignoreHTTPSErrors: true,
129
+ });
130
+ // Bandwidth optimization: block assets like images, videos, fonts, and CSS.
131
+ // This is crucial for performance and avoids loading unnecessary assets.
132
+ await context.route('**/*', (route) => {
133
+ const type = route.request().resourceType();
134
+ if (['image', 'media', 'font', 'stylesheet'].includes(type)) {
135
+ route.abort().catch(() => { });
136
+ }
137
+ else {
138
+ route.continue().catch(() => { });
139
+ }
140
+ });
141
+ // 2. Discover Sitemaps & robots.txt
142
+ const robotsInfo = await parseRobotsTxt(startUrl, this.cache);
143
+ const sitemapLinks = [...robotsInfo.sitemaps];
144
+ // If robots.txt doesn't mention sitemaps, guess standard names
145
+ if (sitemapLinks.length === 0) {
146
+ try {
147
+ const parsedBase = new URL(startUrl);
148
+ sitemapLinks.push(`${parsedBase.protocol}//${parsedBase.host}/sitemap.xml`);
149
+ sitemapLinks.push(`${parsedBase.protocol}//${parsedBase.host}/sitemap_index.xml`);
150
+ }
151
+ catch (_e) {
152
+ sitemapLinks.push(`${startUrl}/sitemap.xml`);
153
+ }
154
+ }
155
+ // Collect URLs from sitemaps (limit total processed URLs to avoid high memory usage)
156
+ const sitemapUrls = [];
157
+ for (const sitemapUrl of sitemapLinks) {
158
+ const urls = await parseSitemap(sitemapUrl, this.cache, 250);
159
+ sitemapUrls.push(...urls);
160
+ if (sitemapUrls.length >= 500) {
161
+ break;
162
+ }
163
+ }
164
+ // Initialize BFS queue and visited tracker
165
+ const queue = [];
166
+ const visited = new Set();
167
+ // Push homepage as the first item
168
+ queue.push({ url: startUrl, depth: 0, referrer: '' });
169
+ visited.add(startUrl);
170
+ // Prioritize sitemap URLs (keep ones containing contact keywords, max 10 to start)
171
+ const filteredSitemapUrls = sitemapUrls
172
+ .filter((url) => getLinkPriority(url) > 0)
173
+ .slice(0, 10);
174
+ for (const sUrl of filteredSitemapUrls) {
175
+ const normalized = normalizeUrl(sUrl);
176
+ if (normalized && !visited.has(normalized)) {
177
+ queue.push({ url: normalized, depth: 1, referrer: 'sitemap' });
178
+ visited.add(normalized);
179
+ }
180
+ }
181
+ await Logger.info('crawl-start', domain, undefined, 'Active', `Queue size: ${queue.length} pages, sitemaps parsed: ${sitemapLinks.length}`);
182
+ // 3. Traversal (BFS) Loop
183
+ while (queue.length > 0 && pagesCrawledCount < config.maxPagesPerSite) {
184
+ const elapsed = Date.now() - startTime;
185
+ if (elapsed > config.maxCrawlTimePerSiteMs) {
186
+ await Logger.info('crawl-time-limit', domain, elapsed, 'Timeout', `Reached budget limit of ${config.maxCrawlTimePerSiteMs}ms`);
187
+ break;
188
+ }
189
+ const current = queue.shift();
190
+ if (!current) {
191
+ continue;
192
+ }
193
+ // Apply throttling delay between pages of the SAME website to avoid rate limits
194
+ if (pagesCrawledCount > 0) {
195
+ await getRandomDelay(config.minDelayMs, config.maxDelayMs);
196
+ }
197
+ pagesCrawledCount++;
198
+ const pageStart = Date.now();
199
+ try {
200
+ const { html, title } = await this.loadPage(current.url, context, config);
201
+ const pageDuration = Date.now() - pageStart;
202
+ // Extract emails
203
+ const extracted = extractEmails(html, current.url, title, pageDuration);
204
+ for (const item of extracted) {
205
+ // Keep occurrences tracker updated
206
+ occurrenceCounts[item.email] = (occurrenceCounts[item.email] || 0) + 1;
207
+ // Deduplicate: If already found, update with higher confidence if applicable
208
+ const existingIdx = discoveredEmails.findIndex((e) => e.email === item.email);
209
+ if (existingIdx === -1) {
210
+ discoveredEmails.push(item);
211
+ await Logger.email(domain, item.email, item.emailSource, item.confidenceScore, item.discoveryMethod);
212
+ }
213
+ else {
214
+ if (item.confidenceScore > discoveredEmails[existingIdx].confidenceScore) {
215
+ discoveredEmails[existingIdx] = item;
216
+ }
217
+ }
218
+ }
219
+ // EARLY STOP OPTIMIZATION: If we found a high confidence, domain-matching contact email, stop crawling
220
+ const currentBest = selectBestEmail(discoveredEmails, domain, occurrenceCounts);
221
+ if (currentBest &&
222
+ currentBest.confidenceScore >= 95 &&
223
+ isDomainMatch(currentBest.email, domain)) {
224
+ await Logger.info('crawl-early-stop', domain, Date.now() - startTime, 'Success', `Early exit triggered by: ${currentBest.email} (${currentBest.confidenceScore} score)`);
225
+ break;
226
+ }
227
+ // Discover and enqueue internal links if depth is within bounds
228
+ if (current.depth < config.maxDepth) {
229
+ const childLinks = extractAndFilterLinks(html, current.url, domain);
230
+ for (const link of childLinks) {
231
+ if (!visited.has(link) && visited.size < 100) {
232
+ // Safety ceiling to prevent massive Set sizes
233
+ visited.add(link);
234
+ queue.push({
235
+ url: link,
236
+ depth: current.depth + 1,
237
+ referrer: current.url,
238
+ });
239
+ }
240
+ }
241
+ }
242
+ }
243
+ catch (err) {
244
+ const errorMsg = err instanceof Error ? err.message : String(err);
245
+ await Logger.error('page-crawl-error', domain, Date.now() - pageStart, `Failed ${current.url}: ${errorMsg}`);
246
+ }
247
+ }
248
+ // 4. Select Final Email
249
+ const selectedEmail = selectBestEmail(discoveredEmails, domain, occurrenceCounts);
250
+ const totalDuration = Date.now() - startTime;
251
+ await Logger.info('crawl-end', domain, totalDuration, selectedEmail ? 'Found' : 'Empty', `Crawled ${pagesCrawledCount} pages. Final email: ${selectedEmail ? selectedEmail.email : 'None'}`);
252
+ return {
253
+ target,
254
+ success: true,
255
+ discoveredEmails,
256
+ selectedEmail,
257
+ pagesCrawledCount,
258
+ durationMs: totalDuration,
259
+ };
260
+ }
261
+ catch (err) {
262
+ const errorMsg = err instanceof Error ? err.message : String(err);
263
+ await Logger.error('crawl-fatal-error', domain, Date.now() - startTime, errorMsg);
264
+ return {
265
+ target,
266
+ success: false,
267
+ discoveredEmails: [],
268
+ selectedEmail: null,
269
+ error: errorMsg,
270
+ pagesCrawledCount,
271
+ durationMs: Date.now() - startTime,
272
+ };
273
+ }
274
+ finally {
275
+ if (context) {
276
+ await context.close().catch(() => { });
277
+ }
278
+ }
279
+ }
280
+ }