npm - @mdream/crawl - Versions diffs - 1.0.0-beta.9 → 1.0.0 - Mend

@mdream/crawl 1.0.0-beta.9 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/README.md +416 -56
package/dist/_chunks/crawl.mjs +366 -277
package/dist/_chunks/playwright-utils.mjs +59 -0
package/dist/cli.mjs +79 -89
package/dist/index.d.mts +40 -2
package/dist/index.mjs +6 -1
package/package.json +11 -4

package/README.md CHANGED Viewed

@@ -1,96 +1,456 @@
 # @mdream/crawl
-Multi-page website crawler that generates comprehensive llms.txt files by following internal links and processing entire websites using mdream HTML-to-Markdown conversion.
+Multi-page website crawler that generates [llms.txt](https://llmstxt.org/) files. Follows internal links and converts HTML to Markdown using [mdream](../mdream).
-> **Note**: For single-page HTML-to-Markdown conversion, use the [`mdream`](../mdream) binary instead. `@mdream/crawl` is specifically designed for crawling entire websites with multiple pages.
-## Installation
+## Setup
 ```bash
 npm install @mdream/crawl
 ```
-## Usage
+For JavaScript-heavy sites that require browser rendering, install the optional Playwright dependencies:
+```bash
+npm install crawlee playwright
+```
+## CLI Usage
-Simply run the command to start the interactive multi-page website crawler:
+### Interactive Mode
+Run without arguments to start the interactive prompt-based interface:
 ```bash
 npx @mdream/crawl
 ```
-The crawler will automatically discover and follow internal links to crawl entire websites. The interactive interface provides:
-- ✨ Beautiful prompts powered by Clack
-- 🎯 Step-by-step configuration guidance
-- ✅ Input validation and helpful hints
-- 📋 Configuration summary before crawling
-- 🎉 Clean result display with progress indicators
-- 🧹 Automatic cleanup of crawler storage
+### Direct Mode
+Pass arguments directly to skip interactive prompts:
+```bash
+npx @mdream/crawl -u https://docs.example.com
+```
+### CLI Options
+| Flag | Alias | Description | Default |
+|------|-------|-------------|---------|
+| `--url <url>` | `-u` | Website URL to crawl (supports glob patterns) | Required |
+| `--output <dir>` | `-o` | Output directory | `output` |
+| `--depth <number>` | `-d` | Crawl depth (0 for single page, max 10) | `3` |
+| `--single-page` | | Only process the given URL(s), no crawling. Alias for `--depth 0` | |
+| `--driver <type>` | | Crawler driver: `http` or `playwright` | `http` |
+| `--artifacts <list>` | | Comma-separated output formats: `llms.txt`, `llms-full.txt`, `markdown` | all three |
+| `--origin <url>` | | Origin URL for resolving relative paths (overrides auto-detection) | auto-detected |
+| `--site-name <name>` | | Override the auto-extracted site name used in llms.txt | auto-extracted |
+| `--description <desc>` | | Override the auto-extracted site description used in llms.txt | auto-extracted |
+| `--max-pages <number>` | | Maximum pages to crawl | unlimited |
+| `--crawl-delay <seconds>` | | Delay between requests in seconds | from `robots.txt` or none |
+| `--exclude <pattern>` | | Exclude URLs matching glob patterns (repeatable) | none |
+| `--skip-sitemap` | | Skip `sitemap.xml` and `robots.txt` discovery | `false` |
+| `--allow-subdomains` | | Crawl across subdomains of the same root domain | `false` |
+| `--verbose` | `-v` | Enable verbose logging | `false` |
+| `--help` | `-h` | Show help message | |
+| `--version` | | Show version number | |
+### CLI Examples
+```bash
+# Basic crawl with specific artifacts
+npx @mdream/crawl -u harlanzw.com --artifacts "llms.txt,markdown"
+# Shallow crawl (depth 2) with only llms-full.txt output
+npx @mdream/crawl --url https://docs.example.com --depth 2 --artifacts "llms-full.txt"
+# Exclude admin and API routes
+npx @mdream/crawl -u example.com --exclude "*/admin/*" --exclude "*/api/*"
+# Single page mode (no link following)
+npx @mdream/crawl -u example.com/pricing --single-page
+# Use Playwright for JavaScript-heavy sites
+npx @mdream/crawl -u example.com --driver playwright
-## Programmatic Usage
+# Skip sitemap discovery with verbose output
+npx @mdream/crawl -u example.com --skip-sitemap --verbose
-You can also use @mdream/crawl programmatically in your Node.js applications:
+# Crawl across subdomains (docs.example.com, blog.example.com, etc.)
+npx @mdream/crawl -u example.com --allow-subdomains
+# Override site metadata
+npx @mdream/crawl -u example.com --site-name "My Company" --description "Company documentation"
+```
+## Glob Patterns
+URLs support glob patterns for targeted crawling. When a glob pattern is provided, the crawler uses sitemap discovery to find all matching URLs.
+```bash
+# Crawl only the /docs/ section
+npx @mdream/crawl -u "docs.example.com/docs/**"
+# Crawl pages matching a prefix
+npx @mdream/crawl -u "example.com/blog/2024*"
+```
+Patterns are matched against the URL pathname using [picomatch](https://github.com/micromatch/picomatch) syntax. A trailing single `*` (e.g. `/fieldtypes*`) automatically expands to match both the path itself and all subdirectories.
+## Programmatic API
+### `crawlAndGenerate(options, onProgress?)`
+The main entry point for programmatic use. Returns a `Promise<CrawlResult[]>`.
 ```typescript
 import { crawlAndGenerate } from '@mdream/crawl'
-// Crawl entire websites programmatically
 const results = await crawlAndGenerate({
-  urls: ['https://docs.example.com'], // Starting URLs for website crawling
+  urls: ['https://docs.example.com'],
   outputDir: './output',
-  maxRequestsPerCrawl: 100, // Maximum pages per website
-  generateLlmsTxt: true,
-  followLinks: true, // Follow internal links to crawl entire site
-  maxDepth: 3, // How deep to follow links
-  driver: 'http', // or 'playwright' for JS-heavy sites
-  verbose: true
 })
 ```
-> **Note**: llms.txt artifact generation is handled by [`@mdream/js/llms-txt`](../js). The crawl package uses it internally when `generateLlmsTxt: true`.
+### `CrawlOptions`
-## Output
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `urls` | `string[]` | Required | Starting URLs for crawling |
+| `outputDir` | `string` | Required | Directory to write output files |
+| `driver` | `'http' \| 'playwright'` | `'http'` | Crawler driver to use |
+| `maxRequestsPerCrawl` | `number` | `Number.MAX_SAFE_INTEGER` | Maximum total pages to crawl |
+| `followLinks` | `boolean` | `false` | Whether to follow internal links discovered on pages |
+| `maxDepth` | `number` | `1` | Maximum link-following depth. `0` enables single-page mode |
+| `generateLlmsTxt` | `boolean` | `true` | Generate an `llms.txt` file |
+| `generateLlmsFullTxt` | `boolean` | `false` | Generate an `llms-full.txt` file with full page content |
+| `generateIndividualMd` | `boolean` | `true` | Write individual `.md` files for each page |
+| `origin` | `string` | auto-detected | Origin URL for resolving relative paths in HTML |
+| `siteNameOverride` | `string` | auto-extracted | Override the site name in the generated `llms.txt` |
+| `descriptionOverride` | `string` | auto-extracted | Override the site description in the generated `llms.txt` |
+| `globPatterns` | `ParsedUrlPattern[]` | `[]` | Pre-parsed URL glob patterns (advanced usage) |
+| `exclude` | `string[]` | `[]` | Glob patterns for URLs to exclude |
+| `crawlDelay` | `number` | from `robots.txt` | Delay between requests in seconds |
+| `skipSitemap` | `boolean` | `false` | Skip `sitemap.xml` and `robots.txt` discovery |
+| `allowSubdomains` | `boolean` | `false` | Crawl across subdomains of the same root domain (e.g. `docs.example.com` + `blog.example.com`). Output files are namespaced by hostname to avoid collisions |
+| `useChrome` | `boolean` | `false` | Use system Chrome instead of Playwright's bundled browser (Playwright driver only) |
+| `chunkSize` | `number` | | Chunk size passed to mdream for markdown conversion |
+| `verbose` | `boolean` | `false` | Enable verbose error logging |
+| `hooks` | `Partial<CrawlHooks>` | | Hook functions for the crawl pipeline (see [Hooks](#hooks)) |
+| `onPage` | `(page: PageData) => Promise<void> \| void` | | **Deprecated.** Use `hooks['crawl:page']` instead. Still works for backwards compatibility |
-The crawler generates comprehensive output from entire websites:
+### `CrawlResult`
-1. **Markdown files** - One `.md` file per crawled page with clean markdown content
-2. **llms.txt** - Comprehensive site overview file following the [llms.txt specification](https://llmstxt.org/)
+```typescript
+interface CrawlResult {
+  url: string
+  title: string
+  content: string
+  filePath?: string // Set when generateIndividualMd is true
+  timestamp: number // Unix timestamp of processing time
+  success: boolean
+  error?: string // Set when success is false
+  metadata?: PageMetadata
+  depth?: number // Link-following depth at which this page was found
+}
-### Example llms.txt output
+interface PageMetadata {
+  title: string
+  description?: string
+  keywords?: string
+  author?: string
+  links: string[] // Internal links discovered on the page
+}
+```
-```markdown
-# example.com
+### `PageData`
-## Pages
+The shape passed to the `onPage` callback:
+```typescript
+interface PageData {
+  url: string
+  html: string // Raw HTML (empty string if content was already markdown)
+  title: string
+  metadata: PageMetadata
+  origin: string
+}
+```
+### Progress Callback
+The optional second argument to `crawlAndGenerate` receives progress updates:
+```typescript
+await crawlAndGenerate(options, (progress) => {
+  // progress.sitemap.status: 'discovering' | 'processing' | 'completed'
+  // progress.sitemap.found: number of sitemap URLs found
+  // progress.sitemap.processed: number of URLs after filtering
+  // progress.crawling.status: 'starting' | 'processing' | 'completed'
+  // progress.crawling.total: total URLs to process
+  // progress.crawling.processed: pages completed so far
+  // progress.crawling.failed: pages that errored
+  // progress.crawling.currentUrl: URL currently being fetched
+  // progress.crawling.latency: { total, min, max, count } in ms
+  // progress.generation.status: 'idle' | 'generating' | 'completed'
+  // progress.generation.current: description of current generation step
+})
+```
+### Examples
+#### Custom page processing with `onPage`
+```typescript
+import { crawlAndGenerate } from '@mdream/crawl'
+const pages = []
+await crawlAndGenerate({
+  urls: ['https://docs.example.com'],
+  outputDir: './output',
+  generateIndividualMd: false,
+  generateLlmsTxt: false,
+  onPage: (page) => {
+    pages.push({
+      url: page.url,
+      title: page.title,
+      description: page.metadata.description,
+    })
+  },
+})
+console.log(`Discovered ${pages.length} pages`)
+```
+#### Glob filtering with exclusions
+```typescript
+import { crawlAndGenerate } from '@mdream/crawl'
+await crawlAndGenerate({
+  urls: ['https://example.com/docs/**'],
+  outputDir: './docs-output',
+  exclude: ['/docs/deprecated/*', '/docs/internal/*'],
+  followLinks: true,
+  maxDepth: 2,
+})
+```
+#### Crawling across subdomains
+```typescript
+await crawlAndGenerate({
+  urls: ['https://example.com'],
+  outputDir: './output',
+  allowSubdomains: true, // Will also crawl docs.example.com, blog.example.com, etc.
+  followLinks: true,
+  maxDepth: 2,
+})
+```
+#### Single-page mode
+Set `maxDepth: 0` to process only the provided URLs without crawling or link following:
+```typescript
+await crawlAndGenerate({
+  urls: ['https://example.com/pricing', 'https://example.com/about'],
+  outputDir: './output',
+  maxDepth: 0,
+})
+```
+## Config File
+Create a `mdream.config.ts` (or `.js`, `.mjs`) in your project root to set defaults and register hooks. Loaded via [c12](https://github.com/unjs/c12).
+```typescript
+import { defineConfig } from '@mdream/crawl'
+export default defineConfig({
+  exclude: ['*/admin/*', '*/internal/*'],
+  driver: 'http',
+  maxDepth: 3,
+  hooks: {
+    'crawl:page': (page) => {
+      // Strip branding from all page titles
+      page.title = page.title.replace(/ \| My Brand$/, '')
+    },
+  },
+})
+```
+CLI arguments override config file values. Array options like `exclude` are concatenated (config + CLI).
+### Config Options
+| Option | Type | Description |
+|--------|------|-------------|
+| `exclude` | `string[]` | Glob patterns for URLs to exclude |
+| `driver` | `'http' \| 'playwright'` | Crawler driver |
+| `maxDepth` | `number` | Maximum crawl depth |
+| `maxPages` | `number` | Maximum pages to crawl |
+| `crawlDelay` | `number` | Delay between requests (seconds) |
+| `skipSitemap` | `boolean` | Skip sitemap discovery |
+| `allowSubdomains` | `boolean` | Crawl across subdomains |
+| `verbose` | `boolean` | Enable verbose logging |
+| `artifacts` | `string[]` | Output formats: `llms.txt`, `llms-full.txt`, `markdown` |
+| `hooks` | `object` | Hook functions (see below) |
+## Hooks
+Four hooks let you intercept and transform data at each stage of the crawl pipeline. Hooks receive mutable objects. Mutate in-place to transform output.
+### `crawl:url`
+Called before fetching a URL. Set `ctx.skip = true` to skip it entirely (saves the network request).
+```typescript
+defineConfig({
+  hooks: {
+    'crawl:url': (ctx) => {
+      // Skip large asset pages
+      if (ctx.url.includes('/assets/') || ctx.url.includes('/downloads/'))
+        ctx.skip = true
+    },
+  },
+})
+```
+### `crawl:page`
+Called after HTML-to-Markdown conversion, before storage. Mutate `page.title` or other fields. This replaces the `onPage` callback (which still works for backwards compatibility).
+```typescript
+defineConfig({
+  hooks: {
+    'crawl:page': (page) => {
+      // page.url, page.html, page.title, page.metadata, page.origin
+      page.title = page.title.replace(/ - Docs$/, '')
+    },
+  },
+})
+```
+### `crawl:content`
+Called before markdown is written to disk. Transform the final output content or change the file path.
+```typescript
+defineConfig({
+  hooks: {
+    'crawl:content': (ctx) => {
+      // ctx.url, ctx.title, ctx.content, ctx.filePath
+      ctx.content = ctx.content.replace(/CONFIDENTIAL/g, '[REDACTED]')
+      ctx.filePath = ctx.filePath.replace('.md', '.mdx')
+    },
+  },
+})
+```
+### `crawl:done`
+Called after all pages are crawled, before `llms.txt` generation. Filter or reorder results.
+```typescript
+defineConfig({
+  hooks: {
+    'crawl:done': (ctx) => {
+      // Remove short pages from the final output
+      const filtered = ctx.results.filter(r => r.content.length > 100)
+      ctx.results.length = 0
+      ctx.results.push(...filtered)
+    },
+  },
+})
+```
+### Programmatic Hooks
+Hooks can also be passed directly to `crawlAndGenerate`:
-- [Example Domain](https---example-com-.md): https://example.com/
-- [About Us](https---example-com-about.md): https://example.com/about
+```typescript
+import { crawlAndGenerate } from '@mdream/crawl'
+await crawlAndGenerate({
+  urls: ['https://example.com'],
+  outputDir: './output',
+  hooks: {
+    'crawl:page': (page) => {
+      page.title = page.title.replace(/ \| Brand$/, '')
+    },
+    'crawl:done': (ctx) => {
+      ctx.results.sort((a, b) => a.url.localeCompare(b.url))
+    },
+  },
+})
 ```
-## Features
+## Crawl Drivers
-- ✅ **Multi-Page Website Crawling**: Designed specifically for crawling entire websites by following internal links
-- ✅ **Purely Interactive**: No complex command-line options to remember
-- ✅ **Dual Crawler Support**: Fast HTTP crawler (default) + Playwright for JavaScript-heavy sites
-- ✅ **Smart Link Discovery**: Uses mdream's extraction plugin to find and follow internal links
-- ✅ **Rich Metadata Extraction**: Extracts titles, descriptions, keywords, and author info from all pages
-- ✅ **Comprehensive llms.txt Generation**: Creates complete site documentation files
-- ✅ **Configurable Depth Crawling**: Follow links with customizable depth limits (1-10 levels)
-- ✅ **Clean Markdown Conversion**: Powered by mdream's HTML-to-Markdown engine
-- ✅ **Performance Optimized**: HTTP crawler is 5-10x faster than browser-based crawling
-- ✅ **Beautiful Output**: Clean result display with progress indicators
-- ✅ **Automatic Cleanup**: Purges crawler storage after completion
-- ✅ **TypeScript Support**: Full type definitions with excellent IDE support
+### HTTP Driver (default)
-## Use Cases
+Uses [`ofetch`](https://github.com/unjs/ofetch) for page fetching with up to 20 concurrent requests.
-Perfect for:
-- 📚 **Documentation Sites**: Crawl entire documentation websites (GitBook, Docusaurus, etc.)
-- 🏢 **Company Websites**: Generate comprehensive site overviews for LLM context
-- 📝 **Blogs**: Process entire blog archives with proper categorization
-- 🔗 **Multi-Page Resources**: Any website where you need all pages, not just one
+- Automatic retry (2 retries with 500ms delay)
+- 10 second request timeout
+- Respects `Retry-After` headers on 429 responses (automatically adjusts crawl delay)
+- Detects `text/markdown` content types and skips HTML-to-Markdown conversion
-**Not suitable for**: Single-page conversions (use `mdream` binary instead)
+### Playwright Driver
+For sites that require a browser to render content. Requires `crawlee` and `playwright` as peer dependencies (see [Setup](#setup)).
+```bash
+npx @mdream/crawl -u example.com --driver playwright
+```
+```typescript
+await crawlAndGenerate({
+  urls: ['https://spa-app.example.com'],
+  outputDir: './output',
+  driver: 'playwright',
+})
+```
+Waits for `networkidle` before extracting content. Automatically detects and uses system Chrome when available, falling back to Playwright's bundled browser.
+## Sitemap and Robots.txt Discovery
+By default, the crawler performs sitemap discovery before crawling:
+1. Fetches `robots.txt` to find `Sitemap:` directives and `Crawl-delay` values
+2. Loads sitemaps referenced in `robots.txt`
+3. Falls back to `/sitemap.xml`
+4. Tries common alternatives: `/sitemap_index.xml`, `/sitemaps.xml`, `/sitemap-index.xml`
+5. Supports sitemap index files (recursively loads child sitemaps)
+6. Filters discovered URLs against glob patterns and exclusion rules
+The home page is always included for metadata extraction (site name, description).
+Disable with `--skip-sitemap` or `skipSitemap: true`.
+## Output Formats
+### Individual Markdown Files
+One `.md` file per crawled page, written to the output directory preserving the URL path structure. For example, `https://example.com/docs/getting-started` becomes `output/docs/getting-started.md`.
+### llms.txt
+A site overview file following the [llms.txt specification](https://llmstxt.org/), listing all crawled pages with titles and links to their markdown files.
+```markdown
+# example.com
+## Pages
+- [Example Domain](index.md): https://example.com/
+- [About Us](about.md): https://example.com/about
+```
-## License
+### llms-full.txt
-MIT
+Same structure as `llms.txt` but includes the full markdown content of every page inline.