npm - mailpop - Versions diffs - 1.0.0 - Mend

mailpop 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/README.md ADDED Viewed

@@ -0,0 +1,195 @@
+# mailpop 🎯
+mailpop is a production-ready, highly optimized TypeScript application built on top of **Node.js 22+** and **Playwright Chromium** to discover reliable, public contact emails directly from company websites.
+Designed specifically for cold outreach lead enrichment where sender reputation is critical, mailpop implements robust verification heuristics, email score priorities, and strict validation checks. It **never invents or guesses emails**; it only returns emails found explicitly on the public web.
+---
+## Key Features
+- 🌐 **SPA & Javascript Execution**: Uses Playwright Chromium (headless by default) to handle modern frameworks (React, Next.js, Vue, Angular, Astro, etc.) and redirects.
+- ⚡ **BFS Crawler with Priority Routing**: Custom Breadth-First-Search (BFS) traversal prioritizing contact-relevant paths (`contact`, `about`, `support`, etc.) up to depth 2.
+- 📂 **Sitemap & Robots.txt Parser**: Scans `robots.txt` and recursively crawls sitemaps (and sitemap indices) using Cheerio in `xmlMode` to extract links before crawling.
+- 🔐 **Obfuscation Decoders**: Automatically bypasses and decodes Cloudflare email protection, HTML entities, Unicode sequences, Base64 strings, and common textual patterns (e.g., `name [at] company [dot] com`).
+- 📊 **Scoring & Confidence Engines**: Computes a confidence score based on the location of the email (footer, header, body, mailto, script), page relevance, domain-match alignment, and frequency.
+- 💾 **Memory Efficient Streaming**: Reads input CSVs and appends output CSV records incrementally using Node.js async generators and `fast-csv`. Handles datasets of 100 to 50,000+ entries without memory leaks.
+- 🔄 **Contiguous Checkpoints & Resuming**: Automatically saves crawler checkpoints every 10 rows. If the process is killed (SIGINT/SIGTERM or crash), it resumes exactly from the last processed contiguous row.
+- 🚦 **Adaptive Throttling**: Adds randomized throttling delays between page requests of the same website to respect rate limits and prevent server hammer.
+- 📝 **Structured Logging**: Generates JSON-formatted app logs, error logs, and dedicated email discovery logs in the `logs/` directory.
+---
+## Architecture Overview
+```mermaid
+graph TD
+    A[input.csv] -->|Read Stream / Generator| B[CSV Orchestrator]
+    B -->|Check Checkpoint| C{Checkpoint Exists?}
+    C -->|Yes| D[Skip to Next Row]
+    C -->|No| E[Fresh Run & Write Headers]
+    D --> F[BFS Crawl Target]
+    E --> F
+    F -->|Fetch Sitemaps & robots.txt| G[Cheerio Link Collector]
+    F -->|Spawn Browser Context| H[Playwright Page Loader]
+    G -->|Seed Queue| H
+    H -->|Block Media/CSS| I[Render SPA & Execute JS]
+    I -->|Check Early Exit| J{High Confidence Email Found?}
+    J -->|Yes| K[Terminate Site Crawl]
+    J -->|No| L[Queue Next Traversal Page]
+    L --> I
+    I -->|Extract & Decode| M[Extractor]
+    M -->|Filter & Score| N[Scorer]
+    N -->|Select Best Email| O[CSV Appender]
+    O -->|Every 10 rows| P[Save Checkpoint]
+```
+---
+## Tech Stack
+- **Runtime**: Node.js 22+ (ES Modules)
+- **Language**: TypeScript 5+ (Strict mode compiler settings)
+- **Scraping**: Playwright Chromium, Cheerio
+- **Data & Streams**: Fast CSV, p-limit
+- **Tooling**: tsx (Development runner), ESLint 9+ (Flat config), Prettier
+---
+## Configuration (`.env`)
+Create a `.env` file in the root directory (based on `.env.example`):
+```env
+# Input & Output Files
+INPUT_CSV=input.csv
+OUTPUT_CSV=output/output.csv
+CHECKPOINT_FILE=output/checkpoint.json
+CACHE_DIR=output/cache
+# Crawling Limits
+CONCURRENCY=5
+MAX_DEPTH=2
+MAX_PAGES_PER_SITE=25
+MAX_CRAWL_TIME_PER_SITE_MS=60000
+PAGE_TIMEOUT_MS=15000
+# Browser Settings
+HEADLESS=true
+# Throttling & Delay
+MIN_DELAY_MS=500
+MAX_DELAY_MS=2000
+# Retry Configuration
+MAX_RETRIES=3
+RETRY_INITIAL_DELAY_MS=1000
+RETRY_MAX_DELAY_MS=10000
+```
+---
+## Installation & Setup
+1. **Clone or Navigate to the project directory**:
+   ```bash
+   cd email-hunter-pro # (Or the directory name mailpop was installed to)
+   ```
+2. **Install Node dependencies**:
+   ```bash
+   npm install
+   ```
+3. **Install Playwright Browsers**:
+   ```bash
+   npx playwright install chromium
+   ```
+4. **Prepare the configuration**:
+   ```bash
+   cp .env.example .env
+   ```
+5. **Prepare your input data**:
+   Put your company websites list in `input.csv` (see `input.csv` for the template).
+---
+## CLI & npx Execution
+After compiling the codebase with `npm run build`, you can invoke the program directly as a CLI tool using `npx`.
+### Command Syntax
+```bash
+# General invocation (defaults to config settings for input and output)
+npx .
+# Custom input and output paths (positional arguments)
+npx . my_leads.csv enriched_output.csv
+# Using explicit flags
+npx . -i my_leads.csv -o enriched_output.csv
+# Viewing CLI help options
+npx . -h
+```
+### Available Package Scripts
+| Script | Command | Description |
+| :--- | :--- | :--- |
+| `npm run dev` | `tsx src/index.ts` | Runs the development code directly using `tsx` |
+| `npm run build` | `tsc` | Compiles the TypeScript code to standard ES Modules in `dist/` |
+| `npm run start` | `node dist/index.js` | Runs the compiled output binary |
+| `npm run lint` | `eslint src` | Checks code against linting rules |
+| `npm run format` | `prettier --write "src/**/*.ts"` | Formats code using Prettier |
+| `npm run typecheck` | `tsc --noEmit` | Performs static type checks |
+---
+## CSV Data Processing Heuristics
+### Dynamic Header Matching
+mailpop does not mandate a fixed schema. It accepts CSVs with **any custom columns** (e.g. CRM IDs, Industry, Contact Names).
+The crawler dynamically detects the target domain or website by scanning row headers case-insensitively for keywords: `website`, `url`, `domain`, `site`, or `web` (supporting partial matches like `CompanyDomain` or `target_url`).
+### Enriched Output Generation
+When writing the output, mailpop **preserves 100% of the original columns** and appends the enrichment results as new columns:
+- `email`: The discovered, verified contact email (leaves empty if none is confidently found).
+- `email_source`: Specific URL/page where the email was located.
+- `email_type`: Classifies the email category (`role`, `personal`, `automated`).
+- `confidence_score`: Confidence score rating from 10 to 100.
+- `discovery_method`: Scraping discovery origin (`contact-page`, `about-page`, `footer`, `header`, `sitemap`, `general-page`, `obscure-js`, `mailto-link`).
+#### Custom Column Matching Example
+**Input CSV (`custom_leads.csv`)**:
+```csv
+CompanyID,CompanyDomain,Industry,ContactPerson
+1001,https://github.com,Technology,John Doe
+```
+**Output CSV (`enriched.csv`)**:
+```csv
+CompanyID,CompanyDomain,Industry,ContactPerson,email,email_source,email_type,confidence_score,discovery_method
+1001,https://github.com,Technology,John Doe,copyright@github.com,https://docs.github.com/site-policy/github-terms/github-terms-of-service,personal,100,mailto-link
+```
+---
+## Structured Log Output (`logs/`)
+- `logs/app.log`: General JSON logs showing crawl transitions and app events.
+- `logs/errors.log`: Failures, timeouts, or exceptions raised by page crawls.
+- `logs/discovered-emails.log`: Every email address found during the traversal, containing its source and confidence metrics.
+Example log entry in `discovered-emails.log`:
+```json
+{"timestamp":"2026-06-03T09:12:00.123Z","domain":"acme.com","email":"contact@acme.com","emailSource":"https://acme.com/contact","confidenceScore":98,"discoveryMethod":"contact-page"}
+```

package/dist/cache.js ADDED Viewed

@@ -0,0 +1,81 @@
+import fs from 'fs/promises';
+import path from 'path';
+import crypto from 'crypto';
+import { config } from './config.js';
+import { Logger } from './logger.js';
+export class Cache {
+    cacheDir;
+    constructor() {
+        this.cacheDir = config.cacheDir;
+    }
+    /**
+     * Generates a unique, file-safe cache path for a given key.
+     */
+    getCachePath(key) {
+        const hash = crypto.createHash('md5').update(key).digest('hex');
+        return path.join(this.cacheDir, `${hash}.json`);
+    }
+    /**
+     * Retrieves an item from the cache. Returns null if missing or expired.
+     */
+    async get(key) {
+        try {
+            const cachePath = this.getCachePath(key);
+            const content = await fs.readFile(cachePath, 'utf-8');
+            const entry = JSON.parse(content);
+            if (entry.expiresAt !== null && Date.now() > entry.expiresAt) {
+                // Expired cache entry, clean it up
+                await fs.unlink(cachePath);
+                return null;
+            }
+            return entry.value;
+        }
+        catch (_e) {
+            // Cache miss or error reading
+            return null;
+        }
+    }
+    /**
+     * Sets an item in the cache with an optional TTL (Time To Live).
+     */
+    async set(key, value, ttlMs) {
+        try {
+            await fs.mkdir(this.cacheDir, { recursive: true });
+            const cachePath = this.getCachePath(key);
+            const entry = {
+                value,
+                expiresAt: ttlMs ? Date.now() + ttlMs : null,
+            };
+            await fs.writeFile(cachePath, JSON.stringify(entry), 'utf-8');
+        }
+        catch (err) {
+            const errorMsg = err instanceof Error ? err.message : String(err);
+            await Logger.error('cache-write', undefined, undefined, `Failed to write cache key '${key}': ${errorMsg}`);
+        }
+    }
+    /**
+     * Deletes a cache entry.
+     */
+    async delete(key) {
+        try {
+            const cachePath = this.getCachePath(key);
+            await fs.unlink(cachePath);
+        }
+        catch (_e) {
+            // Ignore errors (e.g. key didn't exist)
+        }
+    }
+    /**
+     * Clears the entire cache directory.
+     */
+    async clear() {
+        try {
+            await fs.rm(this.cacheDir, { recursive: true, force: true });
+            await Logger.info('cache-clear', undefined, undefined, 'Cache directory deleted successfully.');
+        }
+        catch (err) {
+            const errorMsg = err instanceof Error ? err.message : String(err);
+            await Logger.error('cache-clear', undefined, undefined, `Failed to clear cache: ${errorMsg}`);
+        }
+    }
+}

package/dist/config.js ADDED Viewed

@@ -0,0 +1,34 @@
+import dotenv from 'dotenv';
+import path from 'path';
+// Load environment variables from .env
+dotenv.config();
+const getEnvNumber = (key, defaultValue) => {
+    const val = process.env[key];
+    if (val === undefined)
+        return defaultValue;
+    const num = parseInt(val, 10);
+    return isNaN(num) ? defaultValue : num;
+};
+const getEnvBoolean = (key, defaultValue) => {
+    const val = process.env[key];
+    if (val === undefined)
+        return defaultValue;
+    return val.toLowerCase() === 'true';
+};
+export const config = {
+    inputCsv: path.resolve(process.env.INPUT_CSV || 'input.csv'),
+    outputCsv: path.resolve(process.env.OUTPUT_CSV || 'output/output.csv'),
+    checkpointFile: path.resolve(process.env.CHECKPOINT_FILE || 'output/checkpoint.json'),
+    cacheDir: path.resolve(process.env.CACHE_DIR || 'output/cache'),
+    concurrency: getEnvNumber('CONCURRENCY', 5),
+    maxDepth: getEnvNumber('MAX_DEPTH', 2),
+    maxPagesPerSite: getEnvNumber('MAX_PAGES_PER_SITE', 25),
+    maxCrawlTimePerSiteMs: getEnvNumber('MAX_CRAWL_TIME_PER_SITE_MS', 60000),
+    pageTimeoutMs: getEnvNumber('PAGE_TIMEOUT_MS', 15000),
+    headless: getEnvBoolean('HEADLESS', true),
+    minDelayMs: getEnvNumber('MIN_DELAY_MS', 500),
+    maxDelayMs: getEnvNumber('MAX_DELAY_MS', 2000),
+    maxRetries: getEnvNumber('MAX_RETRIES', 3),
+    retryInitialDelayMs: getEnvNumber('RETRY_INITIAL_DELAY_MS', 1000),
+    retryMaxDelayMs: getEnvNumber('RETRY_MAX_DELAY_MS', 10000),
+};

package/dist/crawler.js ADDED Viewed

@@ -0,0 +1,280 @@
+import { chromium } from 'playwright';
+import { parseRobotsTxt } from './robots.js';
+import { parseSitemap } from './sitemap.js';
+import { extractAndFilterLinks, getLinkPriority } from './link-discovery.js';
+import { extractEmails } from './extractor.js';
+import { selectBestEmail } from './scorer.js';
+import { Cache } from './cache.js';
+import { Logger } from './logger.js';
+import { getRandomDelay } from './utils/delay.js';
+import { retryWithBackoff } from './utils/retry.js';
+import { normalizeUrl } from './utils/normalize.js';
+import { isDomainMatch } from './utils/validators.js';
+import { PageLoadError, RateLimitError } from './utils/errors.js';
+export class Crawler {
+    browser = null;
+    cache;
+    constructor() {
+        this.cache = new Cache();
+    }
+    /**
+     * Launches the headless/headful Playwright Chromium browser.
+     */
+    async initialize(headless = true) {
+        if (this.browser) {
+            return;
+        }
+        this.browser = await chromium.launch({
+            headless,
+            args: [
+                '--no-sandbox',
+                '--disable-setuid-sandbox',
+                '--disable-dev-shm-usage',
+                '--disable-accelerated-2d-canvas',
+                '--disable-gpu',
+            ],
+        });
+    }
+    /**
+     * Closes the active browser instance.
+     */
+    async close() {
+        if (this.browser) {
+            await this.browser.close();
+            this.browser = null;
+        }
+    }
+    /**
+     * Loads a single page content using Playwright inside the context.
+     * Leverages exponential backoff retries.
+     */
+    async loadPage(pageUrl, context, config) {
+        return await retryWithBackoff(async () => {
+            const page = await context.newPage();
+            try {
+                // Navigate with DOMContentLoaded as the baseline
+                const response = await page.goto(pageUrl, {
+                    waitUntil: 'domcontentloaded',
+                    timeout: config.pageTimeoutMs,
+                });
+                if (!response) {
+                    throw new PageLoadError('No response received from page', pageUrl);
+                }
+                const status = response.status();
+                if (status === 429) {
+                    throw new RateLimitError(`Rate limited (429)`, pageUrl);
+                }
+                if (status >= 500) {
+                    throw new PageLoadError(`Server error (${status})`, pageUrl, status);
+                }
+                // Wait for load states to let JavaScript load components (React, Angular, Vue, Next.js, etc.)
+                await page.waitForLoadState('load', { timeout: 3000 }).catch(() => { });
+                await page.waitForLoadState('networkidle', { timeout: 3000 }).catch(() => { });
+                const html = await page.content();
+                const title = await page.title().catch(() => '');
+                return { html, title };
+            }
+            finally {
+                await page.close();
+            }
+        }, {
+            maxRetries: config.maxRetries,
+            initialDelayMs: config.retryInitialDelayMs,
+            maxDelayMs: config.retryMaxDelayMs,
+            onRetry: (err, attempt) => {
+                let host = pageUrl;
+                try {
+                    host = new URL(pageUrl).hostname;
+                }
+                catch (_e) {
+                    /* ignore */
+                }
+                Logger.info('page-load-retry', host, undefined, `Attempt ${attempt}`, `Retrying navigation to ${pageUrl}: ${err.message}`).catch(() => { });
+            },
+        });
+    }
+    /**
+     * Crawls a single company website, performing discovery, BFS traversal, and email extraction.
+     * Respects depth limit, crawl page budget, and website crawl duration timeout.
+     */
+    async crawlWebsite(target, config) {
+        const startTime = Date.now();
+        const domain = target.domain;
+        if (!this.browser) {
+            throw new Error('Browser is not initialized. Run initialize() first.');
+        }
+        const startUrl = normalizeUrl(target.website);
+        if (!startUrl) {
+            return {
+                target,
+                success: false,
+                discoveredEmails: [],
+                selectedEmail: null,
+                error: 'Invalid start URL',
+                pagesCrawledCount: 0,
+                durationMs: Date.now() - startTime,
+            };
+        }
+        let context = null;
+        const discoveredEmails = [];
+        const occurrenceCounts = {};
+        let pagesCrawledCount = 0;
+        try {
+            // 1. Initial configuration for browser context
+            context = await this.browser.newContext({
+                userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
+                viewport: { width: 1280, height: 800 },
+                bypassCSP: true,
+                ignoreHTTPSErrors: true,
+            });
+            // Bandwidth optimization: block assets like images, videos, fonts, and CSS.
+            // This is crucial for performance and avoids loading unnecessary assets.
+            await context.route('**/*', (route) => {
+                const type = route.request().resourceType();
+                if (['image', 'media', 'font', 'stylesheet'].includes(type)) {
+                    route.abort().catch(() => { });
+                }
+                else {
+                    route.continue().catch(() => { });
+                }
+            });
+            // 2. Discover Sitemaps & robots.txt
+            const robotsInfo = await parseRobotsTxt(startUrl, this.cache);
+            const sitemapLinks = [...robotsInfo.sitemaps];
+            // If robots.txt doesn't mention sitemaps, guess standard names
+            if (sitemapLinks.length === 0) {
+                try {
+                    const parsedBase = new URL(startUrl);
+                    sitemapLinks.push(`${parsedBase.protocol}//${parsedBase.host}/sitemap.xml`);
+                    sitemapLinks.push(`${parsedBase.protocol}//${parsedBase.host}/sitemap_index.xml`);
+                }
+                catch (_e) {
+                    sitemapLinks.push(`${startUrl}/sitemap.xml`);
+                }
+            }
+            // Collect URLs from sitemaps (limit total processed URLs to avoid high memory usage)
+            const sitemapUrls = [];
+            for (const sitemapUrl of sitemapLinks) {
+                const urls = await parseSitemap(sitemapUrl, this.cache, 250);
+                sitemapUrls.push(...urls);
+                if (sitemapUrls.length >= 500) {
+                    break;
+                }
+            }
+            // Initialize BFS queue and visited tracker
+            const queue = [];
+            const visited = new Set();
+            // Push homepage as the first item
+            queue.push({ url: startUrl, depth: 0, referrer: '' });
+            visited.add(startUrl);
+            // Prioritize sitemap URLs (keep ones containing contact keywords, max 10 to start)
+            const filteredSitemapUrls = sitemapUrls
+                .filter((url) => getLinkPriority(url) > 0)
+                .slice(0, 10);
+            for (const sUrl of filteredSitemapUrls) {
+                const normalized = normalizeUrl(sUrl);
+                if (normalized && !visited.has(normalized)) {
+                    queue.push({ url: normalized, depth: 1, referrer: 'sitemap' });
+                    visited.add(normalized);
+                }
+            }
+            await Logger.info('crawl-start', domain, undefined, 'Active', `Queue size: ${queue.length} pages, sitemaps parsed: ${sitemapLinks.length}`);
+            // 3. Traversal (BFS) Loop
+            while (queue.length > 0 && pagesCrawledCount < config.maxPagesPerSite) {
+                const elapsed = Date.now() - startTime;
+                if (elapsed > config.maxCrawlTimePerSiteMs) {
+                    await Logger.info('crawl-time-limit', domain, elapsed, 'Timeout', `Reached budget limit of ${config.maxCrawlTimePerSiteMs}ms`);
+                    break;
+                }
+                const current = queue.shift();
+                if (!current) {
+                    continue;
+                }
+                // Apply throttling delay between pages of the SAME website to avoid rate limits
+                if (pagesCrawledCount > 0) {
+                    await getRandomDelay(config.minDelayMs, config.maxDelayMs);
+                }
+                pagesCrawledCount++;
+                const pageStart = Date.now();
+                try {
+                    const { html, title } = await this.loadPage(current.url, context, config);
+                    const pageDuration = Date.now() - pageStart;
+                    // Extract emails
+                    const extracted = extractEmails(html, current.url, title, pageDuration);
+                    for (const item of extracted) {
+                        // Keep occurrences tracker updated
+                        occurrenceCounts[item.email] = (occurrenceCounts[item.email] || 0) + 1;
+                        // Deduplicate: If already found, update with higher confidence if applicable
+                        const existingIdx = discoveredEmails.findIndex((e) => e.email === item.email);
+                        if (existingIdx === -1) {
+                            discoveredEmails.push(item);
+                            await Logger.email(domain, item.email, item.emailSource, item.confidenceScore, item.discoveryMethod);
+                        }
+                        else {
+                            if (item.confidenceScore > discoveredEmails[existingIdx].confidenceScore) {
+                                discoveredEmails[existingIdx] = item;
+                            }
+                        }
+                    }
+                    // EARLY STOP OPTIMIZATION: If we found a high confidence, domain-matching contact email, stop crawling
+                    const currentBest = selectBestEmail(discoveredEmails, domain, occurrenceCounts);
+                    if (currentBest &&
+                        currentBest.confidenceScore >= 95 &&
+                        isDomainMatch(currentBest.email, domain)) {
+                        await Logger.info('crawl-early-stop', domain, Date.now() - startTime, 'Success', `Early exit triggered by: ${currentBest.email} (${currentBest.confidenceScore} score)`);
+                        break;
+                    }
+                    // Discover and enqueue internal links if depth is within bounds
+                    if (current.depth < config.maxDepth) {
+                        const childLinks = extractAndFilterLinks(html, current.url, domain);
+                        for (const link of childLinks) {
+                            if (!visited.has(link) && visited.size < 100) {
+                                // Safety ceiling to prevent massive Set sizes
+                                visited.add(link);
+                                queue.push({
+                                    url: link,
+                                    depth: current.depth + 1,
+                                    referrer: current.url,
+                                });
+                            }
+                        }
+                    }
+                }
+                catch (err) {
+                    const errorMsg = err instanceof Error ? err.message : String(err);
+                    await Logger.error('page-crawl-error', domain, Date.now() - pageStart, `Failed ${current.url}: ${errorMsg}`);
+                }
+            }
+            // 4. Select Final Email
+            const selectedEmail = selectBestEmail(discoveredEmails, domain, occurrenceCounts);
+            const totalDuration = Date.now() - startTime;
+            await Logger.info('crawl-end', domain, totalDuration, selectedEmail ? 'Found' : 'Empty', `Crawled ${pagesCrawledCount} pages. Final email: ${selectedEmail ? selectedEmail.email : 'None'}`);
+            return {
+                target,
+                success: true,
+                discoveredEmails,
+                selectedEmail,
+                pagesCrawledCount,
+                durationMs: totalDuration,
+            };
+        }
+        catch (err) {
+            const errorMsg = err instanceof Error ? err.message : String(err);
+            await Logger.error('crawl-fatal-error', domain, Date.now() - startTime, errorMsg);
+            return {
+                target,
+                success: false,
+                discoveredEmails: [],
+                selectedEmail: null,
+                error: errorMsg,
+                pagesCrawledCount,
+                durationMs: Date.now() - startTime,
+            };
+        }
+        finally {
+            if (context) {
+                await context.close().catch(() => { });
+            }
+        }
+    }
+}