npm - scraply - Versions diffs - 2.0.0 → 2.0.1 - Mend

scraply 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/package.json +1 -2
package/readme.md +50 -58
package/src/config/browser.js +37 -0
package/src/config/defaults.js +21 -10
package/src/config/load.js +8 -1
package/src/core/queue.js +65 -11
package/src/core/retry.js +28 -24
package/src/crawler.js +75 -45
package/src/extract/links.js +4 -4
package/src/fetchers/browserFetcher.js +18 -12
package/src/fetchers/httpFetcher.js +40 -3
package/src/index.js +11 -1

package/package.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "scraply",
   "description": "A simple, configurable and functional content scraper",
-  "version": "2.0.0",
+  "version": "2.0.1",
   "main": "src/index.js",
   "type": "module",
   "exports": {
@@ -14,7 +14,6 @@
     "node": ">=18"
   },
   "scripts": {
-    "start": "node .",
     "dev": "node src/dev.js"
   },
   "keywords": [

package/readme.md CHANGED Viewed

@@ -34,24 +34,46 @@ This crawls `example.com`, extracts the readable text of every allowed page, and
 ## How Scraply works
 1. The crawl is seeded from `startUrls`.
 2. Each page is fetched, its links are discovered and filtered (`include` / `exclude`), and new links are queued.
-3. The page text is extracted (configurable element removal) and saved under `dataset/crawled/`.
+3. The page text is extracted (configurable element removal) and saved under `dataset/crawled/` as `{ url, content, crawledAt, hash }` (`crawledAt` is an ISO timestamp; `hash` is the SHA-256 of `content`, handy for change detection).
 4. When the queue drains, all crawled pages are routed by URL into the files defined in `output.routes` and written to `dataset/formatted/`.
+Each queue entry ends in one of three terminal states: **crawled** (saved), **skipped** (disallowed `Content-Type`), or **error** (fetch failed). The three are tracked separately so stats stay meaningful.
 ### Persistence and resuming
-The queue and crawled pages are checkpointed to disk in `dataset/`. If a run is interrupted (or rate-limited), progress is saved and the next run resumes exactly where it left off without re-crawling finished URLs. When every URL has been processed, Scraply starts a fresh crawl (set `crawl.resetOnComplete: false` to keep the finished queue instead).
+The queue and crawled pages are checkpointed to disk in `dataset/`. If a run is interrupted (or rate-limited), progress is saved and the next run resumes exactly where it left off without re-crawling finished URLs. When every URL has been processed, Scraply starts a fresh crawl (set `crawl.resetOnComplete: false` to keep the finished queue instead). To re-attempt failed URLs on the next run, set `crawl.retryErrors: true` (or call `requeueErrors()` and crawl again).
-### Concurrency and politeness
-Pages are crawled with a worker pool (`crawl.concurrency`). Requests to the same host are spaced by `crawl.delay` for politeness, while different hosts run in parallel.
+### Concurrency and limits
+Pages are crawled with a worker pool (`crawl.concurrency`). Requests to the same host are spaced by `crawl.delay` for politeness, while different hosts run in parallel. `crawl.maxDepth` bounds link depth and `crawl.maxPages` caps the total number of successfully crawled pages (counted across resumes).
 ### Rate limiting
-On HTTP `429`, Scraply either exits immediately with `rateLimit.exitCode` (default) so a scheduler can retry later, or waits (honoring `retry-after` / `x-ratelimit-reset`) and continues when `rateLimit.exitOnLimit` is `false`.
+On HTTP `429`, Scraply either exits immediately with `rateLimit.exitCode` (default) so a scheduler can retry later, or waits (honoring `retry-after` / `x-ratelimit-reset`) and retries — independently of the normal `retry` budget — when `rateLimit.exitOnLimit` is `false`.
 ## Fetchers
 `fetcher` selects the backend:
-- `'http'` (default): fast static fetching with the native `fetch`.
+- `'http'` (default): fast static fetching with the native `fetch`. Redirects are followed up to `request.maxRedirects`, and response bodies larger than `request.maxContentLength` (default 20 MB, `0` disables) are rejected before they are buffered.
 - `'browser'`: full JavaScript rendering via Puppeteer (`puppeteer-cluster`).
 - a custom object implementing the `Fetcher` interface (`{ name, fetch, init?, close? }`), so backends like Playwright or a remote CDP browser can be plugged in without changing the crawler.
+Both built-in fetchers send `request.userAgent` and any extra `request.headers` (e.g. `Authorization`, `Accept-Language`, `Cookie`) with every request.
+### Browser fetcher options
+The `browser` block applies only when `fetcher: 'browser'`. Both options are validated at config load time. See [`src/config/defaults.js`](src/config/defaults.js) for defaults.
+- **`browser.waitUntil`** — passed to Puppeteer `page.goto`. Default `'load'`. Use `'networkidle2'` for SPAs that inject links or content after the initial load (Vue/React sites). Increase `request.timeout` when using slower modes.
+- **`browser.blockResources`** — Puppeteer resource types to abort during fetch (`'image'`, `'stylesheet'`, `'font'`, `'media'`). Default `['image', 'font', 'media']`. Stylesheets are excluded by default because many SPAs need CSS before content renders. Pass `[]` to disable resource blocking entirely.
+```js
+await scraply({
+  startUrls: ['https://spa.example.com/products'],
+  fetcher: 'browser',
+  browser: {
+    waitUntil: 'networkidle2',
+    blockResources: ['image', 'font', 'media']
+  },
+  request: { timeout: 60000 }
+});
+```
 ## Programmatic API
 `createCrawler(config)` returns an instance exposing each stage, plus lifecycle hooks:
@@ -72,68 +94,38 @@ crawler.on('transform', (record) => ({ ...record, length: record.content.length
 await crawler.run();
 ```
-Instance methods: `run()`, `crawl()`, `fetch(url)`, `extract(html, url)`, `enqueue(urls, opts)`, `format(records?)`, `stop()`, `on(event, fn)`.
+Instance methods: `run()`, `crawl()`, `fetch(url)`, `extract(html, url)`, `enqueue(urls, opts)`, `format(records?)`, `requeueErrors()`, `stop()`, `on(event, fn)`.
+`format()` reads crawled pages from `dataset/crawled/` via the persisted queue. You can call it alone to re-route output after a crawl — no need to fetch pages again.
 Hooks: `response`, `extract`, `shouldEnqueue`, `transform`, `page`, `error`.
-Standalone exports for advanced use: `normalizeUrl`, `matchesPattern`, `matchesAnyPattern`, `extractText`, `discoverLinks`, `routeRecord`, `writeRecords`, `formatRecords`, `loadConfig`, `DEFAULT_CONFIG`, `resolveFetcher`, `createHttpFetcher`, `createBrowserFetcher`.
+Standalone exports for advanced use: `normalizeUrl`, `matchesPattern`, `matchesAnyPattern`, `extractText`, `discoverLinks`, `routeRecord`, `writeRecords`, `formatRecords`, `loadConfig`, `DEFAULT_CONFIG`, `resolveFetcher`, `createHttpFetcher`, `createBrowserFetcher`, `assertBrowserConfig`, `BROWSER_WAIT_UNTIL`, `BROWSER_BLOCKABLE_RESOURCES`.
 ## Configuration
-All options are optional except `startUrls`. Durations are milliseconds.
+All options are optional except `startUrls`. Pass a partial object to `scraply()` or `createCrawler()` — it is [deep-merged](src/config/load.js) over the defaults. Durations are in milliseconds.
-```js
-{
-  startUrls: ['https://crawler-test.com/'],
-  include: [],                 // URL prefixes or RegExp; defaults to startUrls
-  exclude: [/\.(zip|png|js|css|...)$/i],
-  allowedContentTypes: ['text/html'],
-  fetcher: 'http',             // 'http' | 'browser' | Fetcher instance
-  logLevel: 'info',            // 'silent' | 'error' | 'warn' | 'info' | 'debug'
-  storage: { dir: 'dataset' },
-  request: {
-    timeout: 10000,
-    maxRedirects: 5,
-    maxContentLength: 20 * 1024 * 1024,
-    userAgent: 'Mozilla/5.0 (compatible; Scraply/2.0; +https://www.npmjs.com/package/scraply)'
-  },
+**Full default values and inline comments:** [`src/config/defaults.js`](src/config/defaults.js)
-  retry: {
-    max: 1,
-    statusCodes: [408, 500, 502, 503, 504],
-    delay: 1000
-  },
+```js
+import { DEFAULT_CONFIG, loadConfig } from 'scraply';
-  rateLimit: {
-    fallbackDelay: 60000,
-    exitOnLimit: true,
-    exitCode: 10
-  },
+// Inspect or extend the defaults programmatically.
+const config = loadConfig({
+  ...DEFAULT_CONFIG,
+  startUrls: ['https://example.com']
+});
+```
-  crawl: {
-    concurrency: 5,
-    delay: 200,                // per-host spacing
-    maxDepth: Infinity,
-    resetOnComplete: true
-  },
+Top-level keys: `startUrls`, `include`, `exclude`, `allowedContentTypes`, `fetcher`, `browser`, `logLevel`, `storage`, `request`, `retry`, `rateLimit`, `crawl`, `extract`, `output`.
-  extract: {
-    removeSelectors: ['script', 'style', 'nav', 'header', 'footer', '...']
-  },
+### Output routing
+`output.routes` is a two-level map:
-  output: {
-    format: 'json',            // 'json' | 'jsonl' | 'lines'
-    exclude: [],
-    routes: {
-      'https://crawler-test.com': { '*': 'general.json' }
-    }
-  }
-}
-```
+1. **Outer keys** — URL prefix (usually `https://origin`, or `https://origin/path`) matched against the full crawled URL.
+2. **Inner keys** — pathname segments joined with `/`, **without a leading slash**, matched from the longest suffix upward. Use `'*'` as fallback within that prefix.
-### Output routing
-`output.routes` maps a URL prefix to `{ pathKey: filename, '*': fallback }`. The most specific matching prefix wins, then the most specific path key, then `'*'`. For example:
+Inner keys are case-sensitive and must match the URL pathname exactly (e.g. `Products/sports-watches`, not `/products/sports-watches`).
 ```js
 output: {
@@ -142,7 +134,7 @@ output: {
       'guide': 'guides.json',
       '*': 'docs.json'
     },
-    'https://example.com': { '*': 'main.json' }
+    'https://example.com/products/sports-watches': { '*': 'watches.json' }
   }
 }
 ```
@@ -167,4 +159,4 @@ The configuration is now camelCase and grouped, and the entry point is `src/inde
 - `DATA_FORMATTER.CATEGORISED_PATHS` -> `output.routes`
 - `DATA_FORMATTER.EXCLUDED_PATTERNS` -> `output.exclude`
-New in 2.0: `crawl.concurrency`, `crawl.maxDepth`, `crawl.resetOnComplete`, `output.format`, pluggable `fetcher`, and lifecycle hooks. Formatted output is now real JSON by default (1.x wrote `url content` text lines).
+New in 2.0: `crawl.concurrency`, `crawl.maxDepth`, `crawl.resetOnComplete`, `output.format`, `browser.waitUntil`, `browser.blockResources`, pluggable `fetcher`, and lifecycle hooks. Formatted output is now real JSON by default (1.x wrote `url content` text lines).

package/src/config/browser.js ADDED Viewed

@@ -0,0 +1,37 @@
+/** @type {readonly ['load', 'domcontentloaded', 'networkidle0', 'networkidle2']} */
+export const BROWSER_WAIT_UNTIL = Object.freeze([
+  'load',
+  'domcontentloaded',
+  'networkidle0',
+  'networkidle2'
+]);
+/** Puppeteer resource types Scraply may block to speed up browser fetches. */
+export const BROWSER_BLOCKABLE_RESOURCES = Object.freeze(['image', 'stylesheet', 'font', 'media']);
+/** Default blocked types. Stylesheets are excluded — many SPAs need CSS before content renders. */
+export const DEFAULT_BROWSER_BLOCK_RESOURCES = Object.freeze(['image', 'font', 'media']);
+/**
+ * @param {import('../index.js').BrowserConfig} browser
+ */
+export const assertBrowserConfig = (browser) => {
+  if (!BROWSER_WAIT_UNTIL.includes(browser?.waitUntil)) {
+    throw new Error(
+      `Invalid browser.waitUntil: ${String(browser?.waitUntil)}. Expected one of: ${BROWSER_WAIT_UNTIL.join(', ')}`
+    );
+  }
+  const blockResources = browser?.blockResources;
+  if (!Array.isArray(blockResources)) {
+    throw new Error('Invalid browser.blockResources: expected an array.');
+  }
+  for (const type of blockResources) {
+    if (!BROWSER_BLOCKABLE_RESOURCES.includes(type)) {
+      throw new Error(
+        `Invalid browser.blockResources entry: ${String(type)}. Expected one of: ${BROWSER_BLOCKABLE_RESOURCES.join(', ')}`
+      );
+    }
+  }
+};

package/src/config/defaults.js CHANGED Viewed

@@ -1,6 +1,7 @@
+import { DEFAULT_BROWSER_BLOCK_RESOURCES } from './browser.js';
 /**
- * Default Scraply configuration. Every value here can be overridden by the
- * object passed to `createCrawler()` / `scraply()`. Durations are in milliseconds.
+ * Default Scraply configuration. Every value here can be overridden by the object passed to `createCrawler()` / `scraply()`. Durations are in milliseconds.
  *
  * @type {import('../index.js').ScraplyConfig}
  */
@@ -8,9 +9,7 @@ export const DEFAULT_CONFIG = {
   // URLs the crawl is seeded with.
   startUrls: ['https://crawler-test.com/'],
-  // Which discovered links are allowed into the queue. Each entry is either an
-  // absolute URL prefix (e.g. 'https://site.com/blog') or a RegExp. Empty means
-  // "default to startUrls".
+  // Which discovered links are allowed into the queue. Each entry is either an absolute URL prefix (e.g. 'https://site.com/blog') or a RegExp. Empty means "default to startUrls".
   include: [],
   // Links matching any of these (string prefix or RegExp) are never queued.
@@ -24,6 +23,15 @@ export const DEFAULT_CONFIG = {
   // 'http' (native fetch), 'browser' (Puppeteer) or a custom Fetcher instance.
   fetcher: 'http',
+  // Options for the built-in Puppeteer fetcher (`fetcher: 'browser'`).
+  browser: {
+    // When page.goto considers navigation finished. Use 'networkidle2' for SPAs that inject links/content after load (e.g. Vue/React apps).
+    waitUntil: 'load',
+    // Resource types to abort during fetch (speeds up crawls). Stylesheets are omitted by default because many SPAs need CSS before content renders.
+    blockResources: [...DEFAULT_BROWSER_BLOCK_RESOURCES]
+  },
   // 'silent' | 'error' | 'warn' | 'info' | 'debug'
   logLevel: 'info',
@@ -32,10 +40,11 @@ export const DEFAULT_CONFIG = {
   },
   request: {
-    timeout: 10000,
-    maxRedirects: 5,
-    maxContentLength: 20 * 1024 * 1024,
-    userAgent: 'Mozilla/5.0 (compatible; Scraply/2.0; +https://www.npmjs.com/package/scraply)'
+    timeout: 10000, // per-request budget (aborts the fetch, including body read)
+    maxRedirects: 5, // redirect hops the HTTP fetcher follows before giving up
+    maxContentLength: 20 * 1024 * 1024, // hard cap on the response body (bytes); 0 disables it
+    userAgent: 'Mozilla/5.0 (compatible; Scraply/2.0; +https://www.npmjs.com/package/scraply)',
+    headers: {} // extra request headers (auth, Accept-Language, cookies, ...) sent by every fetcher
   },
   retry: {
@@ -54,7 +63,9 @@ export const DEFAULT_CONFIG = {
     concurrency: 5,
     delay: 200, // minimum spacing between requests to the same host
     maxDepth: Infinity,
-    resetOnComplete: true
+    maxPages: Infinity, // hard cap on successfully crawled pages (counts across resumes)
+    resetOnComplete: true,
+    retryErrors: false // re-queue previously errored URLs on resume so they are retried
   },
   extract: {

package/src/config/load.js CHANGED Viewed

@@ -1,5 +1,7 @@
 import path from 'node:path';
 import { DEFAULT_CONFIG } from './defaults.js';
+import { assertBrowserConfig } from './browser.js';
+import { normalizeUrl } from '../url/normalize.js';
 const isPlainObject = (value) =>
   value !== null && typeof value === 'object' && !Array.isArray(value) && !(value instanceof RegExp);
@@ -31,9 +33,14 @@ export const loadConfig = (userConfig = {}) => {
   config.storage.crawledDir = path.posix.join(dir, 'crawled');
   config.storage.formattedDir = path.posix.join(dir, 'formatted');
+  // When no include rules are given, fall back to the start URLs — normalized so
+  // they match the normalized links the crawler actually discovers (forced
+  // HTTPS, no "www.", no trailing slash).
   if (!config.include?.length) {
-    config.include = [...config.startUrls];
+    config.include = config.startUrls.map(normalizeUrl);
   }
+  assertBrowserConfig(config.browser);
   return config;
 };

package/src/core/queue.js CHANGED Viewed

@@ -3,19 +3,21 @@ import { loadJSON, saveJSON, deletePath } from '../storage/files.js';
 /**
  * @typedef {Object} QueueEntry
  * @property {string} url
- * @property {string|null} file     - path to the saved crawled file, or null
+ * @property {string|null} file     - filename of the saved crawled record (relative to crawledDir), or null
  * @property {number|null} status   - last HTTP status
  * @property {string|null} error    - error message, or null
+ * @property {string|null} skipped  - reason the page was skipped (e.g. content-type), or null
  * @property {string|null} referrer - URL this entry was discovered on
  * @property {number} depth
  */
-const isProcessed = (entry) => entry.file !== null || entry.error !== null;
+const isProcessed = (entry) => entry.file !== null || entry.error !== null || entry.skipped !== null;
 /**
  * Owns the crawl queue: dedup, depth limiting, status tracking and durable
- * checkpointing. Persistence is debounced so a high-concurrency crawl does not
- * rewrite the queue file on every single URL.
+ * checkpointing. Status totals are tracked incrementally (O(1) reads) and
+ * persistence is debounced so a high-concurrency crawl does not rewrite the
+ * queue file on every single URL.
  */
 export class QueueManager {
   /** @param {{ config: import('../index.js').ResolvedConfig, logger: any }} deps */
@@ -32,16 +34,30 @@ export class QueueManager {
     /** @type {QueueEntry[]} */
     this._pending = [];
     this._cursor = 0;
+    this._crawled = 0;
+    this._errors = 0;
+    this._skipped = 0;
     this._dirty = false;
     this._timer = null;
     this._persistInterval = 1000;
   }
-  /** Loads any previously persisted queue and rebuilds the in-memory indexes. */
+  /** Loads any previously persisted queue and rebuilds the in-memory indexes and totals. */
   load() {
     this.entries = loadJSON(this.path, []) ?? [];
     this.index = new Set(this.entries.map((entry) => entry.url));
-    this._pending = this.entries.filter((entry) => !isProcessed(entry));
+    this._pending = [];
+    this._crawled = 0;
+    this._errors = 0;
+    this._skipped = 0;
+    for (const entry of this.entries) {
+      if (entry.file !== null) this._crawled += 1;
+      else if (entry.error !== null) this._errors += 1;
+      else if (entry.skipped !== null) this._skipped += 1;
+      else this._pending.push(entry);
+    }
     this._cursor = 0;
     return this.entries;
   }
@@ -59,7 +75,7 @@ export class QueueManager {
   add(url, { depth = 0, referrer = null } = {}) {
     if (this.index.has(url) || depth > this.maxDepth) return false;
-    const entry = { url, file: null, status: null, error: null, referrer, depth };
+    const entry = { url, file: null, status: null, error: null, skipped: null, referrer, depth };
     this.index.add(url);
     this.entries.push(entry);
     this._pending.push(entry);
@@ -76,29 +92,64 @@ export class QueueManager {
     entry.file = file;
     entry.status = status;
     entry.error = null;
+    entry.skipped = null;
+    this._crawled += 1;
     this._markDirty();
   }
   markError(entry, { error, status }) {
     entry.error = error;
     entry.status = status ?? null;
+    this._errors += 1;
+    this._markDirty();
+  }
+  markSkipped(entry, { reason, status }) {
+    entry.skipped = reason;
+    entry.status = status ?? null;
+    this._skipped += 1;
     this._markDirty();
   }
+  /**
+   * Clears the error on every failed entry and returns it to the pending set so
+   * the next crawl retries it. Persists immediately so a fresh `load()` (e.g. at
+   * the start of `crawl()`) sees the requeued entries.
+   * @returns {number} how many entries were requeued
+   */
+  requeueErrors() {
+    let count = 0;
+    for (const entry of this.entries) {
+      if (entry.error !== null) {
+        entry.error = null;
+        entry.status = null;
+        this._pending.push(entry);
+        this._errors -= 1;
+        count += 1;
+      }
+    }
+    if (count > 0) this.flush();
+    return count;
+  }
   isAllProcessed() {
-    return this.entries.length > 0 && this.entries.every(isProcessed);
+    return this.entries.length > 0 && this.pendingCount() === 0;
   }
   pendingCount() {
-    return this.entries.filter((entry) => !isProcessed(entry)).length;
+    return this.entries.length - this._crawled - this._errors - this._skipped;
   }
   crawledCount() {
-    return this.entries.filter((entry) => entry.file !== null).length;
+    return this._crawled;
   }
   errorCount() {
-    return this.entries.filter((entry) => entry.error !== null).length;
+    return this._errors;
+  }
+  skippedCount() {
+    return this._skipped;
   }
   /** Clears in-memory state and removes the persisted queue file. */
@@ -107,6 +158,9 @@ export class QueueManager {
     this.index = new Set();
     this._pending = [];
     this._cursor = 0;
+    this._crawled = 0;
+    this._errors = 0;
+    this._skipped = 0;
     this._dirty = false;
     deletePath(this.path);
   }

package/src/core/retry.js CHANGED Viewed

@@ -22,46 +22,50 @@ const computeWait = (headers = {}, fallback) => {
  * Wraps a fetch operation with retry and rate-limit handling shared by every
  * fetcher backend.
  *
+ * Rate limiting (HTTP 429) is handled independently of the normal retry budget:
+ * when `rateLimit.exitOnLimit` is false the runner waits (honoring `retry-after`
+ * / `x-ratelimit-reset`) and retries until the host relents; otherwise it
+ * triggers a clean exit so a scheduler can resume the crawl later.
+ *
  * @param {{ config: import('../index.js').ResolvedConfig, logger: any, onRateLimitExit: (code: number) => void }} deps
  */
 export const createRetryRunner = ({ config, logger, onRateLimitExit }) => {
   const { retry, rateLimit } = config;
-  const shouldRetry = async (error) => {
-    const status = error?.response?.status;
-    if (status === undefined) return true; // network/transport error
-    if (status === 429) {
-      if (rateLimit.exitOnLimit) return false; // run() handles the exit
-      const wait = computeWait(error.response.headers, rateLimit.fallbackDelay);
-      logger.warn(`Rate limited. Waiting ${Math.round(wait / 1000)}s before retrying...`);
-      await delay(wait);
-      return true;
-    }
-    return retry.statusCodes.includes(status);
-  };
   const run = async (fn) => {
-    for (let attempt = 0; ; attempt++) {
+    let attempt = 0;
+    for (;;) {
       try {
         return await fn();
       } catch (error) {
-        const canRetry = attempt < retry.max && (await shouldRetry(error));
-        if (canRetry) {
-          logger.info(`Retry ${attempt + 1}/${retry.max} -> ${error.message}`);
+        const status = error?.response?.status;
+        if (status === 429) {
+          if (rateLimit.exitOnLimit) {
+            logger.warn(`Force exiting with code ${rateLimit.exitCode} (rate limited).`);
+            onRateLimitExit(rateLimit.exitCode);
+            throw error;
+          }
+          const wait = computeWait(error.response.headers, rateLimit.fallbackDelay);
+          logger.warn(`Rate limited. Waiting ${Math.round(wait / 1000)}s before retrying...`);
+          await delay(wait);
+          continue; // rate-limit waits never consume the retry budget
+        }
+        const retriable = status === undefined || retry.statusCodes.includes(status);
+        if (retriable && attempt < retry.max) {
+          attempt += 1;
+          logger.info(`Retry ${attempt}/${retry.max} -> ${error.message}`);
           if (retry.delay > 0) await delay(retry.delay);
           continue;
         }
-        if (error?.response?.status === 429) {
-          logger.warn(`Force exiting with code ${rateLimit.exitCode} (rate limited).`);
-          onRateLimitExit(rateLimit.exitCode);
-        }
         throw error;
       }
     }
   };
-  return { run, shouldRetry };
+  return { run };
 };

package/src/crawler.js CHANGED Viewed

@@ -1,4 +1,5 @@
 import path from 'node:path';
+import { createHash } from 'node:crypto';
 import * as cheerio from 'cheerio';
 import { loadConfig } from './config/load.js';
@@ -15,22 +16,12 @@ import { resolveFetcher } from './fetchers/index.js';
 import { formatRecords } from './output/writers.js';
 import { loadJSON, saveJSON, deletePath, deleteUntracked } from './storage/files.js';
-const getHeader = (headers, name) => {
-  if (!headers) return undefined;
-  if (headers[name] !== undefined) return headers[name];
-  const lower = name.toLowerCase();
-  for (const key of Object.keys(headers)) {
-    if (key.toLowerCase() === lower) return headers[key];
-  }
-  return undefined;
-};
 const toHtml = (data) => (typeof data === 'string' ? data : Buffer.from(data).toString('utf8'));
+const sha256 = (text) => createHash('sha256').update(text).digest('hex');
 /**
- * Creates a crawler instance. Every stage is exposed as a method so callers can
- * run the whole pipeline (`run`) or drive individual stages and add their own
- * logic via hooks.
+ * Creates a crawler instance. Every stage is exposed as a method so callers can run the whole pipeline (`run`) or drive individual stages and add their own logic via hooks.
  *
  * @param {import('./index.js').ScraplyConfig} [userConfig]
  */
@@ -41,6 +32,11 @@ export const createCrawler = (userConfig = {}) => {
   const queue = new QueueManager({ config, logger });
   const fetcher = resolveFetcher({ config, logger });
+  // Normalized once so the start URLs match discovered (normalized) links and
+  // can be looked up in O(1) during filtering.
+  const startUrls = config.startUrls.map(normalizeUrl);
+  const startUrlSet = new Set(startUrls);
   let stopped = false;
   let initialized = false;
   let datasetCounter = 0;
@@ -76,9 +72,14 @@ export const createCrawler = (userConfig = {}) => {
     queue.load();
     datasetCounter = computeDatasetCounter();
+    if (config.crawl.retryErrors) {
+      const requeued = queue.requeueErrors();
+      if (requeued > 0) logger.info(`Re-queued ${requeued} previously errored URL(s) for retry.`);
+    }
     if (queue.entries.length === 0) {
-      logger.info(`Starting fresh with ${config.startUrls.length} start URL(s).`);
-      queue.seed(config.startUrls.map(normalizeUrl));
+      logger.info(`Starting fresh with ${startUrls.length} start URL(s).`);
+      queue.seed(startUrls);
       return;
     }
@@ -88,7 +89,7 @@ export const createCrawler = (userConfig = {}) => {
         queue.reset();
         deletePath(config.storage.crawledDir);
         datasetCounter = 0;
-        queue.seed(config.startUrls.map(normalizeUrl));
+        queue.seed(startUrls);
       } else {
         logger.info('All URLs already processed (resetOnComplete is false). Nothing to do.');
       }
@@ -100,22 +101,22 @@ export const createCrawler = (userConfig = {}) => {
   // --- stage methods ---
-  /** Fetches a single URL (with retry/rate-limit policy) and returns the raw result. */
+  // Fetches a single URL (with retry/rate-limit policy) and returns the raw result.
   const fetchUrl = (url) => retryRunner.run(() => fetcher.fetch(normalizeUrl(url)));
-  /** Extracts readable text from HTML. */
+  // Extracts readable text from HTML.
   const extract = (html, url = null) => ({
     url,
     content: extractText(html, { removeSelectors: config.extract.removeSelectors })
   });
   const shouldCrawl = (url) => {
-    if (config.startUrls.some((start) => normalizeUrl(start) === url)) return true;
+    if (startUrlSet.has(url)) return true;
     if (matchesAnyPattern(url, config.exclude)) return false;
     return matchesAnyPattern(url, config.include);
   };
-  /** Filters + normalizes URLs and adds the survivors to the queue. */
+  // Filters + normalizes URLs and adds the survivors to the queue.
   const enqueue = async (urls, { depth = 0, referrer = null } = {}) => {
     const list = Array.isArray(urls) ? urls : [urls];
     let added = 0;
@@ -137,15 +138,17 @@ export const createCrawler = (userConfig = {}) => {
     return added;
   };
+  // Persists a crawled record and returns its filename (relative to crawledDir).
+  // Only the bare name is stored in the queue so datasets stay portable.
   const saveDataset = (record) => {
     datasetCounter += 1;
-    const filePath = path.posix.join(config.storage.crawledDir, `${datasetCounter}.json`);
-    saveJSON(filePath, record);
-    return filePath;
+    const file = `${datasetCounter}.json`;
+    saveJSON(path.posix.join(config.storage.crawledDir, file), record);
+    return file;
   };
   const processOne = async (entry) => {
-    if (entry.file || entry.error) return;
+    if (entry.file || entry.error || entry.skipped) return;
     processedCount += 1;
     logger.info(`- ${processedCount}/${queue.entries.length} -> ${entry.url}`);
@@ -154,9 +157,10 @@ export const createCrawler = (userConfig = {}) => {
       const result = await retryRunner.run(() => fetcher.fetch(entry.url));
       await hooks.emit('response', result, entry);
-      const contentType = getHeader(result.headers, 'content-type');
+      // Fetchers return lowercased header keys (see Fetcher interface).
+      const contentType = result.headers?.['content-type'];
       if (!contentType || !config.allowedContentTypes.some((type) => contentType.includes(type))) {
-        queue.markError(entry, { error: `Skipped content-type: ${contentType ?? 'none'}`, status: result.status });
+        queue.markSkipped(entry, { reason: `content-type: ${contentType ?? 'none'}`, status: result.status });
         return;
       }
@@ -168,25 +172,41 @@ export const createCrawler = (userConfig = {}) => {
       let content = extractText($, { removeSelectors: config.extract.removeSelectors });
       content = await hooks.reduce('extract', content, $, entry);
-      const file = saveDataset({ url: entry.url, content });
+      const record = {
+        url: entry.url,
+        content,
+        crawledAt: new Date().toISOString(),
+        hash: sha256(content)
+      };
+      const file = saveDataset(record);
       queue.markDone(entry, { file, status: result.status });
-      const record = await hooks.reduce('transform', { url: entry.url, content }, entry);
-      await hooks.emit('page', record, entry);
+      const transformed = await hooks.reduce('transform', record, entry);
+      await hooks.emit('page', transformed, entry);
     } catch (error) {
-      queue.markError(entry, { error: error.message, status: error.response?.status });
+      // A 429 only reaches here when rateLimit.exitOnLimit is true and the
+      // process is already exiting; leave the entry pending so the next run
+      // retries it instead of recording a permanent error.
+      if (error.response?.status !== 429) {
+        queue.markError(entry, { error: error.message, status: error.response?.status });
+      }
       await hooks.emit('error', error, entry);
       logger.error(`Failed to fetch ${entry.url} -> ${error.message}`);
     }
   };
   const logBanner = () => {
+    const browserLine =
+      fetcher.name === 'browser' ? `\n  - Browser waitUntil: ${config.browser.waitUntil}` : '';
     logger.info(`STARTING SCRAPLY CRAWLER...
   - Start URLs: ${config.startUrls.join(', ')}
-  - Fetcher: ${fetcher.name}
+  - Fetcher: ${fetcher.name}${browserLine}
   - Concurrency: ${config.crawl.concurrency}
   - Per-host delay: ${config.crawl.delay}ms
   - Max depth: ${config.crawl.maxDepth}
+  - Max pages: ${config.crawl.maxPages}
   - Allowed content types: ${config.allowedContentTypes.join(', ')}
   - Output format: ${config.output.format}
 `);
@@ -208,48 +228,52 @@ export const createCrawler = (userConfig = {}) => {
     process.once('SIGTERM', handler);
   };
-  /** Crawls until the queue is drained (or `stop()` is called). */
+  // Crawls until the queue is drained (or `stop()` is called).
   const crawl = async () => {
     init();
     logBanner();
     registerSignals();
     if (fetcher.init) await fetcher.init();
-    processedCount = queue.crawledCount() + queue.errorCount();
+    processedCount = queue.crawledCount() + queue.errorCount() + queue.skippedCount();
     await runPipeline({
       queue,
       concurrency: config.crawl.concurrency,
       perHostDelay: config.crawl.delay,
       processOne,
-      isStopped: () => stopped
+      isStopped: () => stopped || queue.crawledCount() >= config.crawl.maxPages
     });
     queue.flush();
+    if (config.crawl.maxPages !== Infinity && queue.crawledCount() >= config.crawl.maxPages) {
+      logger.info(`Reached maxPages limit (${config.crawl.maxPages}).`);
+    }
     logger.info(
-      `Crawling completed! ${queue.crawledCount()} of ${queue.entries.length} ` +
-        `(${queue.entries.length - queue.crawledCount()} not crawled, ${queue.errorCount()} errors)`
+      `Crawling completed! ${queue.crawledCount()} crawled, ${queue.skippedCount()} skipped, ` +
+        `${queue.errorCount()} errors, ${queue.pendingCount()} pending (of ${queue.entries.length} total).`
     );
   };
-  /** Re-reads crawled pages from disk so resumed runs include earlier sessions. */
+  // Re-reads crawled pages from disk so resumed runs include earlier sessions.
   const collectRecords = () => {
     const records = [];
     for (const entry of queue.entries) {
-      if (!entry.file || entry.error) continue;
-      const data = loadJSON(entry.file, null);
+      if (!entry.file) continue;
+      const data = loadJSON(path.posix.join(config.storage.crawledDir, entry.file), null);
       if (data) records.push({ url: entry.url, content: data.content });
     }
     return records;
   };
-  /**
-   * Routes records to their output files and writes them. Defaults to every
-   * successfully crawled page; pass an explicit array to format custom records.
-   */
+  // Routes records to their output files and writes them. Defaults to every successfully crawled page; pass an explicit array to format custom records. When reading from disk, reloads `dataset/queue.json` first so this can run without calling `crawl()` (e.g. after changing `output.routes`).
   const format = async (records = null) => {
     logger.info('Formatting data...');
+    if (records === null) queue.load();
     const collected = records ?? collectRecords();
     const groups = formatRecords(collected, {
       output: config.output,
@@ -269,7 +293,7 @@ export const createCrawler = (userConfig = {}) => {
     return groups;
   };
-  /** Full pipeline: init -> crawl -> format, with guaranteed cleanup. */
+  // Full pipeline: init -> crawl -> format, with guaranteed cleanup.
   const run = async () => {
     try {
       await crawl();
@@ -292,11 +316,17 @@ export const createCrawler = (userConfig = {}) => {
     crawl,
     format,
     run,
+    // Clears errored entries and returns them to the queue so a later crawl()
+    // retries them. Persists immediately; returns how many were requeued.
+    requeueErrors: () => {
+      if (queue.entries.length === 0) queue.load();
+      return queue.requeueErrors();
+    },
     stop: () => {
       stopped = true;
     }
   };
 };
-/** One-call convenience wrapper: create a crawler and run the full pipeline. */
+// One-call convenience wrapper: create a crawler and run the full pipeline.
 export const scraply = (userConfig = {}) => createCrawler(userConfig).run();

package/src/extract/links.js CHANGED Viewed

@@ -1,11 +1,11 @@
 import { URL } from 'node:url';
-import { normalizeUrl } from '../url/normalize.js';
 const NON_NAVIGATIONAL = /^(mailto:|tel:|javascript:|data:)/i;
 /**
- * Collects unique, normalized links from anchor tags in a document. No
- * include/exclude filtering happens here; that is the crawler's job.
+ * Collects unique, absolute links from anchor tags in a document, resolving
+ * relative hrefs against `baseUrl`. Normalization and include/exclude filtering
+ * are the crawler's job (`enqueue`), so links are only resolved here.
  *
  * @param {import('cheerio').CheerioAPI} $
  * @param {string} baseUrl - used to resolve relative hrefs
@@ -19,7 +19,7 @@ export const discoverLinks = ($, baseUrl) => {
     if (!href || href.startsWith('#') || NON_NAVIGATIONAL.test(href)) return;
     try {
-      links.add(normalizeUrl(new URL(href, baseUrl).toString()));
+      links.add(new URL(href, baseUrl).href);
     } catch {
       // Ignore malformed hrefs.
     }

package/src/fetchers/browserFetcher.js CHANGED Viewed

@@ -1,18 +1,18 @@
 import { Cluster } from 'puppeteer-cluster';
-const BLOCKED_RESOURCES = new Set(['image', 'stylesheet', 'font', 'media']);
 /**
- * Puppeteer-cluster backend for JavaScript-rendered pages. `page.goto` already
- * follows redirects and returns the final response, so no manual redirect
- * handling is needed.
+ * Puppeteer-cluster backend for JavaScript-rendered pages. `page.goto` already follows redirects and returns the final response, so no manual redirect handling is needed. The `browser` config is validated once in `loadConfig`, so no re-validation is needed here.
  *
  * @param {import('./types.js').FetcherDeps} deps
  * @returns {import('./types.js').Fetcher}
  */
 export const createBrowserFetcher = ({ config, logger }) => {
-  const { request, crawl } = config;
+  const { request, crawl, browser } = config;
   const timeout = Math.max(request.timeout, 5000);
+  const { waitUntil, blockResources } = browser;
+  const blockedResources = new Set(blockResources);
   let cluster = null;
   const init = async () => {
@@ -31,13 +31,19 @@ export const createBrowserFetcher = ({ config, logger }) => {
     await cluster.task(async ({ page, data: url }) => {
       await page.setUserAgent(request.userAgent);
-      await page.setRequestInterception(true);
-      page.on('request', (req) => {
-        if (BLOCKED_RESOURCES.has(req.resourceType())) req.abort();
-        else req.continue();
-      });
+      if (Object.keys(request.headers).length > 0) {
+        await page.setExtraHTTPHeaders(request.headers);
+      }
+      if (blockedResources.size > 0) {
+        await page.setRequestInterception(true);
+        page.on('request', (req) => {
+          if (blockedResources.has(req.resourceType())) req.abort();
+          else req.continue();
+        });
+      }
-      const response = await page.goto(url, { timeout, waitUntil: 'domcontentloaded' });
+      const response = await page.goto(url, { timeout, waitUntil });
       const data = await page.content();
       return {

package/src/fetchers/httpFetcher.js CHANGED Viewed

@@ -3,9 +3,45 @@ const lowercaseHeaders = (headers) => Object.fromEntries(headers.entries());
 const httpError = (message, status, headers = {}) =>
   Object.assign(new Error(message), { response: { status, headers } });
+/**
+ * Reads a response body as text while enforcing a byte cap (`maxBytes <= 0`
+ * disables it). Rejects early on a declared `Content-Length`, and otherwise
+ * streams the body so an oversized chunked response is aborted instead of being
+ * buffered whole.
+ */
+const readBodyWithLimit = async (response, maxBytes, headers) => {
+  if (maxBytes > 0) {
+    const declared = Number(response.headers.get('content-length'));
+    if (Number.isFinite(declared) && declared > maxBytes) {
+      throw httpError(`Response too large: ${declared} bytes (max ${maxBytes})`, 413, headers);
+    }
+  }
+  if (maxBytes <= 0 || !response.body) return response.text();
+  const reader = response.body.getReader();
+  const chunks = [];
+  let total = 0;
+  for (;;) {
+    const { done, value } = await reader.read();
+    if (done) break;
+    total += value.byteLength;
+    if (total > maxBytes) {
+      await reader.cancel();
+      throw httpError(`Response exceeded max size of ${maxBytes} bytes`, 413, headers);
+    }
+    chunks.push(Buffer.from(value));
+  }
+  return Buffer.concat(chunks).toString('utf8');
+};
 /**
  * Native-fetch based backend. Follows redirects manually so the redirect budget
- * is enforced, and times out via AbortController.
+ * is enforced, times out via AbortController, and caps the body at
+ * `request.maxContentLength`.
  *
  * @param {import('./types.js').FetcherDeps} deps
  * @returns {import('./types.js').Fetcher}
@@ -21,12 +57,13 @@ export const createHttpFetcher = ({ config }) => {
       const response = await fetch(url, {
         signal: controller.signal,
         redirect: 'manual',
-        headers: { 'User-Agent': request.userAgent }
+        headers: { 'User-Agent': request.userAgent, ...request.headers }
       });
       const headers = lowercaseHeaders(response.headers);
       if (response.status >= 300 && response.status < 400) {
+        await response.body?.cancel();
         const location = response.headers.get('location');
         if (!location) throw httpError('Redirect without location header', response.status, headers);
         if (redirectsLeft <= 0) throw httpError('Max redirects reached', response.status, headers);
@@ -35,7 +72,7 @@ export const createHttpFetcher = ({ config }) => {
       if (!response.ok) throw httpError(`Invalid status code: ${response.status}`, response.status, headers);
-      const data = await response.text();
+      const data = await readBodyWithLimit(response, request.maxContentLength, headers);
       return { data, status: response.status, headers };
     } catch (error) {
       if (error.name === 'AbortError') {

package/src/index.js CHANGED Viewed

@@ -4,8 +4,9 @@
  * @typedef {Object} RequestConfig
  * @property {number} timeout
  * @property {number} maxRedirects
- * @property {number} maxContentLength
+ * @property {number} maxContentLength - hard cap on the response body in bytes; 0 disables it
  * @property {string} userAgent
+ * @property {Record<string, string>} headers - extra request headers sent by every fetcher
  *
  * @typedef {Object} RetryConfig
  * @property {number} max
@@ -21,7 +22,13 @@
  * @property {number} concurrency
  * @property {number} delay - minimum spacing (ms) between requests to the same host
  * @property {number} maxDepth
+ * @property {number} maxPages - hard cap on successfully crawled pages (counts across resumes)
  * @property {boolean} resetOnComplete
+ * @property {boolean} retryErrors - re-queue previously errored URLs on resume
+ *
+ * @typedef {Object} BrowserConfig
+ * @property {'load'|'domcontentloaded'|'networkidle0'|'networkidle2'} waitUntil
+ * @property {Array<'image'|'stylesheet'|'font'|'media'>} blockResources
  *
  * @typedef {Object} OutputConfig
  * @property {'json'|'jsonl'|'lines'} format
@@ -34,6 +41,7 @@
  * @property {Array<string|RegExp>} [exclude]
  * @property {string[]} [allowedContentTypes]
  * @property {'http'|'browser'|import('./fetchers/types.js').Fetcher} [fetcher]
+ * @property {Partial<BrowserConfig>} [browser]
  * @property {'silent'|'error'|'warn'|'info'|'debug'} [logLevel]
  * @property {{ dir?: string }} [storage]
  * @property {Partial<RequestConfig>} [request]
@@ -44,6 +52,7 @@
  * @property {Partial<OutputConfig>} [output]
  *
  * @typedef {Required<ScraplyConfig> & {
+ *   browser: BrowserConfig,
  *   storage: { dir: string, queuePath: string, crawledDir: string, formattedDir: string }
  * }} ResolvedConfig
  */
@@ -54,6 +63,7 @@ export { createCrawler, scraply } from './crawler.js';
 // Config
 export { loadConfig } from './config/load.js';
 export { DEFAULT_CONFIG } from './config/defaults.js';
+export { assertBrowserConfig, BROWSER_WAIT_UNTIL, BROWSER_BLOCKABLE_RESOURCES } from './config/browser.js';
 // Standalone building blocks (usable without a crawler instance)
 export { normalizeUrl } from './url/normalize.js';