npm - scraply - Versions diffs - 2.0.1 → 2.0.2 - Mend

scraply 2.0.1 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/package.json CHANGED Viewed

@@ -1,11 +1,16 @@
 {
   "name": "scraply",
   "description": "A simple, configurable and functional content scraper",
-  "version": "2.0.1",
+  "version": "2.0.2",
   "main": "src/index.js",
+  "types": "./src/index.d.ts",
   "type": "module",
   "exports": {
-    ".": "./src/index.js"
+    ".": {
+      "types": "./src/index.d.ts",
+      "import": "./src/index.js",
+      "default": "./src/index.js"
+    }
   },
   "files": [
     "src"

package/readme.md CHANGED Viewed

@@ -32,21 +32,52 @@ await scraply({
 This crawls `example.com`, extracts the readable text of every allowed page, and writes the results to `dataset/formatted/example.json`.
 ## How Scraply works
-1. The crawl is seeded from `startUrls`.
+1. The crawl is seeded from `startUrls` (and optionally a [sitemap](#sitemap-seeding)).
 2. Each page is fetched, its links are discovered and filtered (`include` / `exclude`), and new links are queued.
-3. The page text is extracted (configurable element removal) and saved under `dataset/crawled/` as `{ url, content, crawledAt, hash }` (`crawledAt` is an ISO timestamp; `hash` is the SHA-256 of `content`, handy for change detection).
+3. The body is processed [based on its `Content-Type`](#content-type-aware-extraction) and saved under `dataset/crawled/` as `{ url, content, crawledAt, hash }` (`crawledAt` is an ISO timestamp; `hash` is the SHA-256 of `content`, handy for change detection). JSON responses additionally carry the parsed value on `data`.
 4. When the queue drains, all crawled pages are routed by URL into the files defined in `output.routes` and written to `dataset/formatted/`.
 Each queue entry ends in one of three terminal states: **crawled** (saved), **skipped** (disallowed `Content-Type`), or **error** (fetch failed). The three are tracked separately so stats stay meaningful.
+### Content-type-aware extraction
+The body is handled according to its `Content-Type`:
+- **HTML** — links are discovered, then readable text is extracted. Use `extract.root` to allow-list the container(s) to read from (e.g. `'main'`), and `extract.removeSelectors` to strip noise. When `root` matches nothing it falls back to `extract.rootFallback` (default `<body>`).
+- **JSON** — parsed and stored as pretty-printed `content`, with the parsed value also exposed on `record.data` (set `extract.json: false` to keep the raw text instead). The `extract` hook still runs (with `$` as `null`).
+- **Other text** — stored as-is.
+Only responses whose `Content-Type` matches `allowedContentTypes` are processed; everything else is skipped (and fires the `skip` hook).
+```js
+await scraply({
+  startUrls: ['https://docs.example.com'],
+  allowedContentTypes: ['text/html', 'application/json'],
+  extract: { root: 'main' }
+});
+```
 ### Persistence and resuming
-The queue and crawled pages are checkpointed to disk in `dataset/`. If a run is interrupted (or rate-limited), progress is saved and the next run resumes exactly where it left off without re-crawling finished URLs. When every URL has been processed, Scraply starts a fresh crawl (set `crawl.resetOnComplete: false` to keep the finished queue instead). To re-attempt failed URLs on the next run, set `crawl.retryErrors: true` (or call `requeueErrors()` and crawl again).
+The queue and crawled pages are checkpointed to disk in `dataset/`. If a run is interrupted (or rate-limited), progress is saved and the next run resumes exactly where it left off without re-crawling finished URLs. When every URL has been processed, Scraply starts a fresh crawl (set `crawl.resetOnComplete: false` to keep the finished queue instead).
+- Re-attempt **failed** URLs on the next run with `crawl.retryErrors: true` (or call `requeueErrors()` and crawl again).
+- Re-attempt **skipped** URLs with `crawl.retrySkipped: true` (or `requeueSkipped()`) — handy after widening `allowedContentTypes` or changing `sites`.
 ### Concurrency and limits
 Pages are crawled with a worker pool (`crawl.concurrency`). Requests to the same host are spaced by `crawl.delay` for politeness, while different hosts run in parallel. `crawl.maxDepth` bounds link depth and `crawl.maxPages` caps the total number of successfully crawled pages (counted across resumes).
 ### Rate limiting
-On HTTP `429`, Scraply either exits immediately with `rateLimit.exitCode` (default) so a scheduler can retry later, or waits (honoring `retry-after` / `x-ratelimit-reset`) and retries — independently of the normal `retry` budget — when `rateLimit.exitOnLimit` is `false`.
+On HTTP `429`, Scraply waits (honoring `retry-after` / `x-ratelimit-reset`) and retries — independently of the normal `retry` budget. This is the default (`rateLimit.exitOnLimit: false`). Set `rateLimit.exitOnLimit: true` to instead abort the crawl by throwing a `RateLimitError` (carrying `error.code = rateLimit.exitCode`); because the queue is persistent, a later run resumes where it stopped. Scraply never calls `process.exit` on your behalf — catch the error and decide what to do (e.g. `process.exit(error.code)` from a CLI).
+```js
+import { scraply, RateLimitError } from 'scraply';
+try {
+  await scraply({ startUrls: ['https://example.com'], rateLimit: { exitOnLimit: true } });
+} catch (error) {
+  if (error instanceof RateLimitError) process.exit(error.code);
+  throw error;
+}
+```
 ## Fetchers
 `fetcher` selects the backend:
@@ -88,19 +119,43 @@ crawler.on('page', (record) => console.log('crawled', record.url));
 // Veto links before they are queued.
 crawler.on('shouldEnqueue', (url) => !url.includes('/admin'));
-// Transform the stored record.
+// Transform the stored record (runs before it is persisted, so changes are saved).
 crawler.on('transform', (record) => ({ ...record, length: record.content.length }));
 await crawler.run();
 ```
-Instance methods: `run()`, `crawl()`, `fetch(url)`, `extract(html, url)`, `enqueue(urls, opts)`, `format(records?)`, `requeueErrors()`, `stop()`, `on(event, fn)`.
+Instance methods: `run()`, `crawl()`, `fetch(url)`, `extract(html, url)`, `enqueue(urls, opts)`, `format(records?)`, `requeueErrors()`, `requeueSkipped()`, `stop()`, `on(event, fn)`.
 `format()` reads crawled pages from `dataset/crawled/` via the persisted queue. You can call it alone to re-route output after a crawl — no need to fetch pages again.
-Hooks: `response`, `extract`, `shouldEnqueue`, `transform`, `page`, `error`.
+### Hooks
+Register with `crawler.on(name, fn)` (returns an unsubscribe function). Async handlers are awaited. **Reduce** hooks may return a replacement value; **emit** hooks are side-effect only.
-Standalone exports for advanced use: `normalizeUrl`, `matchesPattern`, `matchesAnyPattern`, `extractText`, `discoverLinks`, `routeRecord`, `writeRecords`, `formatRecords`, `loadConfig`, `DEFAULT_CONFIG`, `resolveFetcher`, `createHttpFetcher`, `createBrowserFetcher`, `assertBrowserConfig`, `BROWSER_WAIT_UNTIL`, `BROWSER_BLOCKABLE_RESOURCES`.
+| Hook | Type | Arguments | Notes |
+|---|---|---|---|
+| `response` | emit | `(result, entry)` | Raw `{ data, status, headers }`, before the content-type gate. |
+| `skip` | emit | `(entry, { reason, status, result })` | A response was skipped (e.g. disallowed `Content-Type`). |
+| `shouldEnqueue` | reduce | `(allow, url, referrer)` | Return `false` to veto a URL. |
+| `links` | reduce | `(links, $, entry, result)` | Add/replace discovered links before they are enqueued. `$` is `null` for non-HTML — useful to pull URLs out of a JSON API response. |
+| `extract` | reduce | `(content, $, entry, result)` | Replace the extracted content. `$` is `null` for non-HTML. `result` gives raw access to the body/headers. |
+| `transform` | reduce | `(record, entry, result)` | Replace the record **before** it is saved and formatted. |
+| `page` | emit | `(record, entry, result)` | Fires after the record is persisted. |
+| `error` | emit | `(error, entry)` | A fetch/process failed. |
+Standalone exports for advanced use: `runCrawlers`, `RateLimitError`, `normalizeUrl`, `matchesPattern`, `matchesAnyPattern`, `extractText`, `discoverLinks`, `classifyContentType`, `parseJson`, `parseSitemap`, `routeRecord`, `writeRecords`, `formatRecords`, `loadConfig`, `DEFAULT_CONFIG`, `resolveFetcher`, `createHttpFetcher`, `createBrowserFetcher`, `assertBrowserConfig`, `BROWSER_WAIT_UNTIL`, `BROWSER_BLOCKABLE_RESOURCES`.
+### Running multiple crawlers
+Scraply never calls `process.exit`, so several crawlers can share one process. `runCrawlers` accepts config objects or crawler instances and runs them sequentially (or `concurrency` at a time):
+```js
+import { runCrawlers } from 'scraply';
+await runCrawlers([mainConfig, academyConfig]);            // sequential
+await runCrawlers([a, b, c], { concurrency: 2 });          // two at a time
+```
+By default each crawler installs a SIGINT/SIGTERM handler for a graceful stop (a second signal forces quit). Set `signals: false` in a config when embedding Scraply so it never touches process signals.
 ## Configuration
 All options are optional except `startUrls`. Pass a partial object to `scraply()` or `createCrawler()` — it is [deep-merged](src/config/load.js) over the defaults. Durations are in milliseconds.
@@ -117,7 +172,51 @@ const config = loadConfig({
 });
 ```
-Top-level keys: `startUrls`, `include`, `exclude`, `allowedContentTypes`, `fetcher`, `browser`, `logLevel`, `storage`, `request`, `retry`, `rateLimit`, `crawl`, `extract`, `output`.
+Top-level keys: `startUrls`, `include`, `exclude`, `allowedContentTypes`, `sites`, `fetcher`, `browser`, `logLevel`, `signals`, `storage`, `request`, `retry`, `rateLimit`, `crawl`, `extract`, `output`.
+### Extending list options
+List fields — `include`, `exclude`, `allowedContentTypes`, `extract.removeSelectors`, `output.exclude` — accept either an array (which **replaces** the default) or a directive object that **combines** with Scraply's defaults:
+```js
+extract: {
+  // Keep all of Scraply's default removeSelectors AND add your own.
+  removeSelectors: { extend: ['.cookie-banner', '#promo'] }
+}
+// Also supported: { prepend: [...] } and { replace: [...] }.
+```
+### Per-site overrides
+`sites` lets one crawl apply different rules per origin or path. Each entry has a `match` (URL prefix / RegExp, or an array of them) plus the fields to override: **`allowedContentTypes`** and **`extract`** (`root`, `rootFallback`, `json`, `removeSelectors`). The most specific match wins and is merged over the top-level config — so a single crawler can handle several origins with one queue. A site's `extract.removeSelectors` accepts the same `{ extend }` / `{ replace }` directives, resolved against the top-level list.
+```js
+await scraply({
+  startUrls: ['https://example.com', 'https://docs.example.com'],
+  include: ['https://example.com', 'https://docs.example.com'],
+  output: {
+    routes: {
+      'https://example.com': { '*': 'site.json' },
+      'https://docs.example.com': { '*': 'docs.json' }   // routes are origin-aware in one config
+    }
+  },
+  sites: [
+    {
+      match: 'https://docs.example.com',
+      allowedContentTypes: ['text/html', 'application/json'],
+      extract: { root: 'main', removeSelectors: { extend: ['.sidebar'] } }
+    }
+  ]
+});
+```
+**Scope:** `sites` overrides only `allowedContentTypes` and `extract`. `request`, `retry`, `crawl`, and `fetcher` are per-instance (one fetcher/pool per crawler), and `storage.dir` is shared (one queue + one `dataset/`). To keep **separate datasets** per origin, run one crawler each via [`runCrawlers`](#running-multiple-crawlers) with different `storage.dir`s.
+### Sitemap seeding
+Set `crawl.sitemap: true` to seed the crawl from `<origin>/sitemap.xml` for each start URL, or pass an explicit array of sitemap URLs. Sitemap indexes are followed automatically, and discovered URLs still pass through `include` / `exclude`.
+```js
+await scraply({ startUrls: ['https://example.com'], crawl: { sitemap: true } });
+await scraply({ startUrls: ['https://example.com'], crawl: { sitemap: ['https://example.com/sitemap_index.xml'] } });
+```
 ### Output routing
 `output.routes` is a two-level map:
@@ -139,8 +238,11 @@ output: {
 }
 ```
+## TypeScript
+Scraply ships type declarations (`src/index.d.ts`), so configuration, hooks, and the crawler instance are fully typed in both TypeScript and JS (via editor IntelliSense) — no `@types` package needed.
 ## GitHub Actions
-Because crawls are persistent and exit cleanly on rate limits, Scraply works well on a schedule. Commit the `dataset/` directory between runs, and each scheduled run continues the crawl.
+Because crawls are persistent, Scraply works well on a schedule. Commit the `dataset/` directory between runs, and each scheduled run continues the crawl. To stop a run early on rate limits and resume later, set `rateLimit.exitOnLimit: true` and exit with the thrown `RateLimitError.code`.
 ## Migrating from 1.x
 The configuration is now camelCase and grouped, and the entry point is `src/index.js`.

package/src/config/defaults.js CHANGED Viewed

@@ -20,6 +20,11 @@ export const DEFAULT_CONFIG = {
   // Only responses whose Content-Type includes one of these are parsed.
   allowedContentTypes: ['text/html'],
+  // Per-origin/route overrides. Each entry: { match, allowedContentTypes?, extract? }.
+  // `match` is a URL prefix / RegExp (or an array of them); the most specific
+  // match wins and its fields override the top-level config for matching URLs.
+  sites: [],
   // 'http' (native fetch), 'browser' (Puppeteer) or a custom Fetcher instance.
   fetcher: 'http',
@@ -35,6 +40,11 @@ export const DEFAULT_CONFIG = {
   // 'silent' | 'error' | 'warn' | 'info' | 'debug'
   logLevel: 'info',
+  // Install SIGINT/SIGTERM handlers for a graceful stop (first signal finishes
+  // in-flight work and flushes; a second forces quit). Set false when embedding
+  // Scraply so it never touches process signals.
+  signals: true,
   storage: {
     dir: 'dataset'
   },
@@ -55,7 +65,10 @@ export const DEFAULT_CONFIG = {
   rateLimit: {
     fallbackDelay: 60000,
-    exitOnLimit: true,
+    // false: wait (honoring retry-after / x-ratelimit-reset) and retry until the
+    // host relents. true: abort the crawl with a RateLimitError carrying
+    // `exitCode` so a scheduler can resume it later (the queue is persistent).
+    exitOnLimit: false,
     exitCode: 10
   },
@@ -65,10 +78,22 @@ export const DEFAULT_CONFIG = {
     maxDepth: Infinity,
     maxPages: Infinity, // hard cap on successfully crawled pages (counts across resumes)
     resetOnComplete: true,
-    retryErrors: false // re-queue previously errored URLs on resume so they are retried
+    retryErrors: false, // re-queue previously errored URLs on resume so they are retried
+    retrySkipped: false, // re-queue previously skipped URLs on resume (e.g. after widening allowedContentTypes)
+    sitemap: false // true -> seed <origin>/sitemap.xml per start URL; or pass an array of sitemap URLs
   },
   extract: {
+    // Allow-list the container(s) to extract text from: a selector, an array of
+    // selectors, or null for the whole <body>. Falls back to `rootFallback`
+    // when the selector matches nothing.
+    root: null,
+    rootFallback: 'body',
+    // true -> JSON responses are parsed and stored as pretty-printed `content`
+    // (with the parsed value on `record.data`). false -> store the raw body text.
+    json: true,
     removeSelectors: [
       'script',
       'noscript',

package/src/config/load.js CHANGED Viewed

@@ -20,6 +20,27 @@ const deepMerge = (target, source) => {
   return merged;
 };
+/**
+ * Resolves a list field that may be a plain array (replaces the default) or a
+ * directive object: `{ replace }`, `{ extend }`/`{ append }`, `{ prepend }`.
+ * Directives are combined with the package defaults so users can add to a list
+ * (e.g. `removeSelectors`) without losing Scraply's built-ins.
+ */
+const resolveList = (value, defaults) => {
+  if (Array.isArray(value)) return value;
+  if (value && typeof value === 'object') {
+    if (Array.isArray(value.replace)) return value.replace;
+    const prepend = Array.isArray(value.prepend) ? value.prepend : [];
+    const append = Array.isArray(value.extend)
+      ? value.extend
+      : Array.isArray(value.append)
+        ? value.append
+        : [];
+    return [...prepend, ...defaults, ...append];
+  }
+  return defaults;
+};
 /**
  * Merges a user config over the defaults and derives the storage paths.
  * @param {import('../index.js').ScraplyConfig} [userConfig]
@@ -28,6 +49,34 @@ const deepMerge = (target, source) => {
 export const loadConfig = (userConfig = {}) => {
   const config = deepMerge(DEFAULT_CONFIG, userConfig);
+  // List fields accept { extend } / { prepend } / { replace } directives so a
+  // user can add to Scraply's defaults instead of replacing them wholesale.
+  config.exclude = resolveList(config.exclude, DEFAULT_CONFIG.exclude);
+  config.include = resolveList(config.include, []);
+  config.allowedContentTypes = resolveList(config.allowedContentTypes, DEFAULT_CONFIG.allowedContentTypes);
+  config.extract.removeSelectors = resolveList(config.extract.removeSelectors, DEFAULT_CONFIG.extract.removeSelectors);
+  config.output.exclude = resolveList(config.output.exclude, DEFAULT_CONFIG.output.exclude);
+  // Normalize per-site overrides: `match` becomes an array of patterns, and a
+  // site's `extract.removeSelectors` honors the same { extend } / { replace }
+  // directives — resolved against the (already-resolved) top-level list, so a
+  // site can add to the base instead of silently passing an object downstream.
+  config.sites = (config.sites ?? []).map((site) => {
+    const normalized = {
+      ...site,
+      match: Array.isArray(site.match) ? site.match : [site.match]
+    };
+    if (normalized.extract?.removeSelectors !== undefined) {
+      normalized.extract = {
+        ...normalized.extract,
+        removeSelectors: resolveList(normalized.extract.removeSelectors, config.extract.removeSelectors)
+      };
+    }
+    return normalized;
+  });
   const { dir } = config.storage;
   config.storage.queuePath = path.posix.join(dir, 'queue.json');
   config.storage.crawledDir = path.posix.join(dir, 'crawled');

package/src/core/errors.js ADDED Viewed

@@ -0,0 +1,23 @@
+/**
+ * Thrown when a host rate-limits the crawl (HTTP 429) and
+ * `rateLimit.exitOnLimit` is true. Instead of killing the host process, Scraply
+ * aborts the current crawl with this error so the caller can decide what to do
+ * (e.g. exit with `error.code` from a CLI, or schedule a later resume — the
+ * persistent queue means crawling continues where it stopped).
+ */
+export class RateLimitError extends Error {
+  /**
+   * @param {string} [message]
+   * @param {{ code?: number, headers?: Record<string, string>, cause?: unknown }} [options]
+   */
+  constructor(message = 'Rate limited', { code = 10, headers = {}, cause } = {}) {
+    super(message);
+    this.name = 'RateLimitError';
+    this.code = code;
+    this.headers = headers;
+    // Mirror the shape fetchers attach so existing `error.response.status`
+    // checks keep working.
+    this.response = { status: 429, headers };
+    if (cause !== undefined) this.cause = cause;
+  }
+}

package/src/core/queue.js CHANGED Viewed

@@ -112,26 +112,44 @@ export class QueueManager {
   }
   /**
-   * Clears the error on every failed entry and returns it to the pending set so
-   * the next crawl retries it. Persists immediately so a fresh `load()` (e.g. at
-   * the start of `crawl()`) sees the requeued entries.
+   * Returns matching terminal entries to the pending set so the next crawl
+   * retries them. Persists immediately so a fresh `load()` (e.g. at the start of
+   * `crawl()`) sees the requeued entries.
+   * @param {(entry: QueueEntry) => boolean} match
    * @returns {number} how many entries were requeued
    */
-  requeueErrors() {
+  _requeue(match) {
     let count = 0;
     for (const entry of this.entries) {
-      if (entry.error !== null) {
-        entry.error = null;
-        entry.status = null;
-        this._pending.push(entry);
-        this._errors -= 1;
-        count += 1;
-      }
+      if (!match(entry)) continue;
+      if (entry.error !== null) this._errors -= 1;
+      if (entry.skipped !== null) this._skipped -= 1;
+      entry.error = null;
+      entry.skipped = null;
+      entry.status = null;
+      this._pending.push(entry);
+      count += 1;
     }
     if (count > 0) this.flush();
     return count;
   }
+  /** Re-queues every errored entry for retry. @returns {number} */
+  requeueErrors() {
+    return this._requeue((entry) => entry.error !== null);
+  }
+  /**
+   * Re-queues every skipped entry for another attempt. Useful after widening
+   * `allowedContentTypes` (or changing `sites`) so previously skipped URLs are
+   * reconsidered. @returns {number}
+   */
+  requeueSkipped() {
+    return this._requeue((entry) => entry.skipped !== null);
+  }
   isAllProcessed() {
     return this.entries.length > 0 && this.pendingCount() === 0;
   }

package/src/core/retry.js CHANGED Viewed

@@ -1,4 +1,5 @@
 import { delay } from '../util/delay.js';
+import { RateLimitError } from './errors.js';
 /** Derives how long to wait (ms) from rate-limit headers, falling back to a default. */
 const computeWait = (headers = {}, fallback) => {
@@ -24,12 +25,12 @@ const computeWait = (headers = {}, fallback) => {
  *
  * Rate limiting (HTTP 429) is handled independently of the normal retry budget:
  * when `rateLimit.exitOnLimit` is false the runner waits (honoring `retry-after`
- * / `x-ratelimit-reset`) and retries until the host relents; otherwise it
- * triggers a clean exit so a scheduler can resume the crawl later.
+ * / `x-ratelimit-reset`) and retries until the host relents; otherwise it throws
+ * a `RateLimitError` so the crawl aborts cleanly and can be resumed later.
  *
- * @param {{ config: import('../index.js').ResolvedConfig, logger: any, onRateLimitExit: (code: number) => void }} deps
+ * @param {{ config: import('../index.js').ResolvedConfig, logger: any }} deps
  */
-export const createRetryRunner = ({ config, logger, onRateLimitExit }) => {
+export const createRetryRunner = ({ config, logger }) => {
   const { retry, rateLimit } = config;
   const run = async (fn) => {
@@ -43,9 +44,12 @@ export const createRetryRunner = ({ config, logger, onRateLimitExit }) => {
         if (status === 429) {
           if (rateLimit.exitOnLimit) {
-            logger.warn(`Force exiting with code ${rateLimit.exitCode} (rate limited).`);
-            onRateLimitExit(rateLimit.exitCode);
-            throw error;
+            logger.warn(`Rate limited. Aborting crawl (exitOnLimit) with code ${rateLimit.exitCode}.`);
+            throw new RateLimitError('Rate limited', {
+              code: rateLimit.exitCode,
+              headers: error.response.headers,
+              cause: error
+            });
           }
           const wait = computeWait(error.response.headers, rateLimit.fallbackDelay);