@mdream/crawl 1.0.0-beta.9 โ†’ 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,96 +1,456 @@
1
1
  # @mdream/crawl
2
2
 
3
- Multi-page website crawler that generates comprehensive llms.txt files by following internal links and processing entire websites using mdream HTML-to-Markdown conversion.
3
+ Multi-page website crawler that generates [llms.txt](https://llmstxt.org/) files. Follows internal links and converts HTML to Markdown using [mdream](../mdream).
4
4
 
5
- > **Note**: For single-page HTML-to-Markdown conversion, use the [`mdream`](../mdream) binary instead. `@mdream/crawl` is specifically designed for crawling entire websites with multiple pages.
6
-
7
- ## Installation
5
+ ## Setup
8
6
 
9
7
  ```bash
10
8
  npm install @mdream/crawl
11
9
  ```
12
10
 
13
- ## Usage
11
+ For JavaScript-heavy sites that require browser rendering, install the optional Playwright dependencies:
12
+
13
+ ```bash
14
+ npm install crawlee playwright
15
+ ```
16
+
17
+ ## CLI Usage
14
18
 
15
- Simply run the command to start the interactive multi-page website crawler:
19
+ ### Interactive Mode
20
+
21
+ Run without arguments to start the interactive prompt-based interface:
16
22
 
17
23
  ```bash
18
24
  npx @mdream/crawl
19
25
  ```
20
26
 
21
- The crawler will automatically discover and follow internal links to crawl entire websites. The interactive interface provides:
22
- - โœจ Beautiful prompts powered by Clack
23
- - ๐ŸŽฏ Step-by-step configuration guidance
24
- - โœ… Input validation and helpful hints
25
- - ๐Ÿ“‹ Configuration summary before crawling
26
- - ๐ŸŽ‰ Clean result display with progress indicators
27
- - ๐Ÿงน Automatic cleanup of crawler storage
27
+ ### Direct Mode
28
+
29
+ Pass arguments directly to skip interactive prompts:
30
+
31
+ ```bash
32
+ npx @mdream/crawl -u https://docs.example.com
33
+ ```
34
+
35
+ ### CLI Options
36
+
37
+ | Flag | Alias | Description | Default |
38
+ |------|-------|-------------|---------|
39
+ | `--url <url>` | `-u` | Website URL to crawl (supports glob patterns) | Required |
40
+ | `--output <dir>` | `-o` | Output directory | `output` |
41
+ | `--depth <number>` | `-d` | Crawl depth (0 for single page, max 10) | `3` |
42
+ | `--single-page` | | Only process the given URL(s), no crawling. Alias for `--depth 0` | |
43
+ | `--driver <type>` | | Crawler driver: `http` or `playwright` | `http` |
44
+ | `--artifacts <list>` | | Comma-separated output formats: `llms.txt`, `llms-full.txt`, `markdown` | all three |
45
+ | `--origin <url>` | | Origin URL for resolving relative paths (overrides auto-detection) | auto-detected |
46
+ | `--site-name <name>` | | Override the auto-extracted site name used in llms.txt | auto-extracted |
47
+ | `--description <desc>` | | Override the auto-extracted site description used in llms.txt | auto-extracted |
48
+ | `--max-pages <number>` | | Maximum pages to crawl | unlimited |
49
+ | `--crawl-delay <seconds>` | | Delay between requests in seconds | from `robots.txt` or none |
50
+ | `--exclude <pattern>` | | Exclude URLs matching glob patterns (repeatable) | none |
51
+ | `--skip-sitemap` | | Skip `sitemap.xml` and `robots.txt` discovery | `false` |
52
+ | `--allow-subdomains` | | Crawl across subdomains of the same root domain | `false` |
53
+ | `--verbose` | `-v` | Enable verbose logging | `false` |
54
+ | `--help` | `-h` | Show help message | |
55
+ | `--version` | | Show version number | |
56
+
57
+ ### CLI Examples
58
+
59
+ ```bash
60
+ # Basic crawl with specific artifacts
61
+ npx @mdream/crawl -u harlanzw.com --artifacts "llms.txt,markdown"
62
+
63
+ # Shallow crawl (depth 2) with only llms-full.txt output
64
+ npx @mdream/crawl --url https://docs.example.com --depth 2 --artifacts "llms-full.txt"
65
+
66
+ # Exclude admin and API routes
67
+ npx @mdream/crawl -u example.com --exclude "*/admin/*" --exclude "*/api/*"
68
+
69
+ # Single page mode (no link following)
70
+ npx @mdream/crawl -u example.com/pricing --single-page
71
+
72
+ # Use Playwright for JavaScript-heavy sites
73
+ npx @mdream/crawl -u example.com --driver playwright
28
74
 
29
- ## Programmatic Usage
75
+ # Skip sitemap discovery with verbose output
76
+ npx @mdream/crawl -u example.com --skip-sitemap --verbose
30
77
 
31
- You can also use @mdream/crawl programmatically in your Node.js applications:
78
+ # Crawl across subdomains (docs.example.com, blog.example.com, etc.)
79
+ npx @mdream/crawl -u example.com --allow-subdomains
80
+
81
+ # Override site metadata
82
+ npx @mdream/crawl -u example.com --site-name "My Company" --description "Company documentation"
83
+ ```
84
+
85
+ ## Glob Patterns
86
+
87
+ URLs support glob patterns for targeted crawling. When a glob pattern is provided, the crawler uses sitemap discovery to find all matching URLs.
88
+
89
+ ```bash
90
+ # Crawl only the /docs/ section
91
+ npx @mdream/crawl -u "docs.example.com/docs/**"
92
+
93
+ # Crawl pages matching a prefix
94
+ npx @mdream/crawl -u "example.com/blog/2024*"
95
+ ```
96
+
97
+ Patterns are matched against the URL pathname using [picomatch](https://github.com/micromatch/picomatch) syntax. A trailing single `*` (e.g. `/fieldtypes*`) automatically expands to match both the path itself and all subdirectories.
98
+
99
+ ## Programmatic API
100
+
101
+ ### `crawlAndGenerate(options, onProgress?)`
102
+
103
+ The main entry point for programmatic use. Returns a `Promise<CrawlResult[]>`.
32
104
 
33
105
  ```typescript
34
106
  import { crawlAndGenerate } from '@mdream/crawl'
35
107
 
36
- // Crawl entire websites programmatically
37
108
  const results = await crawlAndGenerate({
38
- urls: ['https://docs.example.com'], // Starting URLs for website crawling
109
+ urls: ['https://docs.example.com'],
39
110
  outputDir: './output',
40
- maxRequestsPerCrawl: 100, // Maximum pages per website
41
- generateLlmsTxt: true,
42
- followLinks: true, // Follow internal links to crawl entire site
43
- maxDepth: 3, // How deep to follow links
44
- driver: 'http', // or 'playwright' for JS-heavy sites
45
- verbose: true
46
111
  })
47
112
  ```
48
113
 
49
- > **Note**: llms.txt artifact generation is handled by [`@mdream/js/llms-txt`](../js). The crawl package uses it internally when `generateLlmsTxt: true`.
114
+ ### `CrawlOptions`
50
115
 
51
- ## Output
116
+ | Option | Type | Default | Description |
117
+ |--------|------|---------|-------------|
118
+ | `urls` | `string[]` | Required | Starting URLs for crawling |
119
+ | `outputDir` | `string` | Required | Directory to write output files |
120
+ | `driver` | `'http' \| 'playwright'` | `'http'` | Crawler driver to use |
121
+ | `maxRequestsPerCrawl` | `number` | `Number.MAX_SAFE_INTEGER` | Maximum total pages to crawl |
122
+ | `followLinks` | `boolean` | `false` | Whether to follow internal links discovered on pages |
123
+ | `maxDepth` | `number` | `1` | Maximum link-following depth. `0` enables single-page mode |
124
+ | `generateLlmsTxt` | `boolean` | `true` | Generate an `llms.txt` file |
125
+ | `generateLlmsFullTxt` | `boolean` | `false` | Generate an `llms-full.txt` file with full page content |
126
+ | `generateIndividualMd` | `boolean` | `true` | Write individual `.md` files for each page |
127
+ | `origin` | `string` | auto-detected | Origin URL for resolving relative paths in HTML |
128
+ | `siteNameOverride` | `string` | auto-extracted | Override the site name in the generated `llms.txt` |
129
+ | `descriptionOverride` | `string` | auto-extracted | Override the site description in the generated `llms.txt` |
130
+ | `globPatterns` | `ParsedUrlPattern[]` | `[]` | Pre-parsed URL glob patterns (advanced usage) |
131
+ | `exclude` | `string[]` | `[]` | Glob patterns for URLs to exclude |
132
+ | `crawlDelay` | `number` | from `robots.txt` | Delay between requests in seconds |
133
+ | `skipSitemap` | `boolean` | `false` | Skip `sitemap.xml` and `robots.txt` discovery |
134
+ | `allowSubdomains` | `boolean` | `false` | Crawl across subdomains of the same root domain (e.g. `docs.example.com` + `blog.example.com`). Output files are namespaced by hostname to avoid collisions |
135
+ | `useChrome` | `boolean` | `false` | Use system Chrome instead of Playwright's bundled browser (Playwright driver only) |
136
+ | `chunkSize` | `number` | | Chunk size passed to mdream for markdown conversion |
137
+ | `verbose` | `boolean` | `false` | Enable verbose error logging |
138
+ | `hooks` | `Partial<CrawlHooks>` | | Hook functions for the crawl pipeline (see [Hooks](#hooks)) |
139
+ | `onPage` | `(page: PageData) => Promise<void> \| void` | | **Deprecated.** Use `hooks['crawl:page']` instead. Still works for backwards compatibility |
52
140
 
53
- The crawler generates comprehensive output from entire websites:
141
+ ### `CrawlResult`
54
142
 
55
- 1. **Markdown files** - One `.md` file per crawled page with clean markdown content
56
- 2. **llms.txt** - Comprehensive site overview file following the [llms.txt specification](https://llmstxt.org/)
143
+ ```typescript
144
+ interface CrawlResult {
145
+ url: string
146
+ title: string
147
+ content: string
148
+ filePath?: string // Set when generateIndividualMd is true
149
+ timestamp: number // Unix timestamp of processing time
150
+ success: boolean
151
+ error?: string // Set when success is false
152
+ metadata?: PageMetadata
153
+ depth?: number // Link-following depth at which this page was found
154
+ }
57
155
 
58
- ### Example llms.txt output
156
+ interface PageMetadata {
157
+ title: string
158
+ description?: string
159
+ keywords?: string
160
+ author?: string
161
+ links: string[] // Internal links discovered on the page
162
+ }
163
+ ```
59
164
 
60
- ```markdown
61
- # example.com
165
+ ### `PageData`
62
166
 
63
- ## Pages
167
+ The shape passed to the `onPage` callback:
168
+
169
+ ```typescript
170
+ interface PageData {
171
+ url: string
172
+ html: string // Raw HTML (empty string if content was already markdown)
173
+ title: string
174
+ metadata: PageMetadata
175
+ origin: string
176
+ }
177
+ ```
178
+
179
+ ### Progress Callback
180
+
181
+ The optional second argument to `crawlAndGenerate` receives progress updates:
182
+
183
+ ```typescript
184
+ await crawlAndGenerate(options, (progress) => {
185
+ // progress.sitemap.status: 'discovering' | 'processing' | 'completed'
186
+ // progress.sitemap.found: number of sitemap URLs found
187
+ // progress.sitemap.processed: number of URLs after filtering
188
+
189
+ // progress.crawling.status: 'starting' | 'processing' | 'completed'
190
+ // progress.crawling.total: total URLs to process
191
+ // progress.crawling.processed: pages completed so far
192
+ // progress.crawling.failed: pages that errored
193
+ // progress.crawling.currentUrl: URL currently being fetched
194
+ // progress.crawling.latency: { total, min, max, count } in ms
195
+
196
+ // progress.generation.status: 'idle' | 'generating' | 'completed'
197
+ // progress.generation.current: description of current generation step
198
+ })
199
+ ```
200
+
201
+ ### Examples
202
+
203
+ #### Custom page processing with `onPage`
204
+
205
+ ```typescript
206
+ import { crawlAndGenerate } from '@mdream/crawl'
207
+
208
+ const pages = []
209
+
210
+ await crawlAndGenerate({
211
+ urls: ['https://docs.example.com'],
212
+ outputDir: './output',
213
+ generateIndividualMd: false,
214
+ generateLlmsTxt: false,
215
+ onPage: (page) => {
216
+ pages.push({
217
+ url: page.url,
218
+ title: page.title,
219
+ description: page.metadata.description,
220
+ })
221
+ },
222
+ })
223
+
224
+ console.log(`Discovered ${pages.length} pages`)
225
+ ```
226
+
227
+ #### Glob filtering with exclusions
228
+
229
+ ```typescript
230
+ import { crawlAndGenerate } from '@mdream/crawl'
231
+
232
+ await crawlAndGenerate({
233
+ urls: ['https://example.com/docs/**'],
234
+ outputDir: './docs-output',
235
+ exclude: ['/docs/deprecated/*', '/docs/internal/*'],
236
+ followLinks: true,
237
+ maxDepth: 2,
238
+ })
239
+ ```
240
+
241
+ #### Crawling across subdomains
242
+
243
+ ```typescript
244
+ await crawlAndGenerate({
245
+ urls: ['https://example.com'],
246
+ outputDir: './output',
247
+ allowSubdomains: true, // Will also crawl docs.example.com, blog.example.com, etc.
248
+ followLinks: true,
249
+ maxDepth: 2,
250
+ })
251
+ ```
252
+
253
+ #### Single-page mode
254
+
255
+ Set `maxDepth: 0` to process only the provided URLs without crawling or link following:
256
+
257
+ ```typescript
258
+ await crawlAndGenerate({
259
+ urls: ['https://example.com/pricing', 'https://example.com/about'],
260
+ outputDir: './output',
261
+ maxDepth: 0,
262
+ })
263
+ ```
264
+
265
+ ## Config File
266
+
267
+ Create a `mdream.config.ts` (or `.js`, `.mjs`) in your project root to set defaults and register hooks. Loaded via [c12](https://github.com/unjs/c12).
268
+
269
+ ```typescript
270
+ import { defineConfig } from '@mdream/crawl'
271
+
272
+ export default defineConfig({
273
+ exclude: ['*/admin/*', '*/internal/*'],
274
+ driver: 'http',
275
+ maxDepth: 3,
276
+ hooks: {
277
+ 'crawl:page': (page) => {
278
+ // Strip branding from all page titles
279
+ page.title = page.title.replace(/ \| My Brand$/, '')
280
+ },
281
+ },
282
+ })
283
+ ```
284
+
285
+ CLI arguments override config file values. Array options like `exclude` are concatenated (config + CLI).
286
+
287
+ ### Config Options
288
+
289
+ | Option | Type | Description |
290
+ |--------|------|-------------|
291
+ | `exclude` | `string[]` | Glob patterns for URLs to exclude |
292
+ | `driver` | `'http' \| 'playwright'` | Crawler driver |
293
+ | `maxDepth` | `number` | Maximum crawl depth |
294
+ | `maxPages` | `number` | Maximum pages to crawl |
295
+ | `crawlDelay` | `number` | Delay between requests (seconds) |
296
+ | `skipSitemap` | `boolean` | Skip sitemap discovery |
297
+ | `allowSubdomains` | `boolean` | Crawl across subdomains |
298
+ | `verbose` | `boolean` | Enable verbose logging |
299
+ | `artifacts` | `string[]` | Output formats: `llms.txt`, `llms-full.txt`, `markdown` |
300
+ | `hooks` | `object` | Hook functions (see below) |
301
+
302
+ ## Hooks
303
+
304
+ Four hooks let you intercept and transform data at each stage of the crawl pipeline. Hooks receive mutable objects. Mutate in-place to transform output.
305
+
306
+ ### `crawl:url`
307
+
308
+ Called before fetching a URL. Set `ctx.skip = true` to skip it entirely (saves the network request).
309
+
310
+ ```typescript
311
+ defineConfig({
312
+ hooks: {
313
+ 'crawl:url': (ctx) => {
314
+ // Skip large asset pages
315
+ if (ctx.url.includes('/assets/') || ctx.url.includes('/downloads/'))
316
+ ctx.skip = true
317
+ },
318
+ },
319
+ })
320
+ ```
321
+
322
+ ### `crawl:page`
323
+
324
+ Called after HTML-to-Markdown conversion, before storage. Mutate `page.title` or other fields. This replaces the `onPage` callback (which still works for backwards compatibility).
325
+
326
+ ```typescript
327
+ defineConfig({
328
+ hooks: {
329
+ 'crawl:page': (page) => {
330
+ // page.url, page.html, page.title, page.metadata, page.origin
331
+ page.title = page.title.replace(/ - Docs$/, '')
332
+ },
333
+ },
334
+ })
335
+ ```
336
+
337
+ ### `crawl:content`
338
+
339
+ Called before markdown is written to disk. Transform the final output content or change the file path.
340
+
341
+ ```typescript
342
+ defineConfig({
343
+ hooks: {
344
+ 'crawl:content': (ctx) => {
345
+ // ctx.url, ctx.title, ctx.content, ctx.filePath
346
+ ctx.content = ctx.content.replace(/CONFIDENTIAL/g, '[REDACTED]')
347
+ ctx.filePath = ctx.filePath.replace('.md', '.mdx')
348
+ },
349
+ },
350
+ })
351
+ ```
352
+
353
+ ### `crawl:done`
354
+
355
+ Called after all pages are crawled, before `llms.txt` generation. Filter or reorder results.
356
+
357
+ ```typescript
358
+ defineConfig({
359
+ hooks: {
360
+ 'crawl:done': (ctx) => {
361
+ // Remove short pages from the final output
362
+ const filtered = ctx.results.filter(r => r.content.length > 100)
363
+ ctx.results.length = 0
364
+ ctx.results.push(...filtered)
365
+ },
366
+ },
367
+ })
368
+ ```
369
+
370
+ ### Programmatic Hooks
371
+
372
+ Hooks can also be passed directly to `crawlAndGenerate`:
64
373
 
65
- - [Example Domain](https---example-com-.md): https://example.com/
66
- - [About Us](https---example-com-about.md): https://example.com/about
374
+ ```typescript
375
+ import { crawlAndGenerate } from '@mdream/crawl'
376
+
377
+ await crawlAndGenerate({
378
+ urls: ['https://example.com'],
379
+ outputDir: './output',
380
+ hooks: {
381
+ 'crawl:page': (page) => {
382
+ page.title = page.title.replace(/ \| Brand$/, '')
383
+ },
384
+ 'crawl:done': (ctx) => {
385
+ ctx.results.sort((a, b) => a.url.localeCompare(b.url))
386
+ },
387
+ },
388
+ })
67
389
  ```
68
390
 
69
- ## Features
391
+ ## Crawl Drivers
70
392
 
71
- - โœ… **Multi-Page Website Crawling**: Designed specifically for crawling entire websites by following internal links
72
- - โœ… **Purely Interactive**: No complex command-line options to remember
73
- - โœ… **Dual Crawler Support**: Fast HTTP crawler (default) + Playwright for JavaScript-heavy sites
74
- - โœ… **Smart Link Discovery**: Uses mdream's extraction plugin to find and follow internal links
75
- - โœ… **Rich Metadata Extraction**: Extracts titles, descriptions, keywords, and author info from all pages
76
- - โœ… **Comprehensive llms.txt Generation**: Creates complete site documentation files
77
- - โœ… **Configurable Depth Crawling**: Follow links with customizable depth limits (1-10 levels)
78
- - โœ… **Clean Markdown Conversion**: Powered by mdream's HTML-to-Markdown engine
79
- - โœ… **Performance Optimized**: HTTP crawler is 5-10x faster than browser-based crawling
80
- - โœ… **Beautiful Output**: Clean result display with progress indicators
81
- - โœ… **Automatic Cleanup**: Purges crawler storage after completion
82
- - โœ… **TypeScript Support**: Full type definitions with excellent IDE support
393
+ ### HTTP Driver (default)
83
394
 
84
- ## Use Cases
395
+ Uses [`ofetch`](https://github.com/unjs/ofetch) for page fetching with up to 20 concurrent requests.
85
396
 
86
- Perfect for:
87
- - ๐Ÿ“š **Documentation Sites**: Crawl entire documentation websites (GitBook, Docusaurus, etc.)
88
- - ๐Ÿข **Company Websites**: Generate comprehensive site overviews for LLM context
89
- - ๐Ÿ“ **Blogs**: Process entire blog archives with proper categorization
90
- - ๐Ÿ”— **Multi-Page Resources**: Any website where you need all pages, not just one
397
+ - Automatic retry (2 retries with 500ms delay)
398
+ - 10 second request timeout
399
+ - Respects `Retry-After` headers on 429 responses (automatically adjusts crawl delay)
400
+ - Detects `text/markdown` content types and skips HTML-to-Markdown conversion
91
401
 
92
- **Not suitable for**: Single-page conversions (use `mdream` binary instead)
402
+ ### Playwright Driver
403
+
404
+ For sites that require a browser to render content. Requires `crawlee` and `playwright` as peer dependencies (see [Setup](#setup)).
405
+
406
+ ```bash
407
+ npx @mdream/crawl -u example.com --driver playwright
408
+ ```
409
+
410
+ ```typescript
411
+ await crawlAndGenerate({
412
+ urls: ['https://spa-app.example.com'],
413
+ outputDir: './output',
414
+ driver: 'playwright',
415
+ })
416
+ ```
417
+
418
+ Waits for `networkidle` before extracting content. Automatically detects and uses system Chrome when available, falling back to Playwright's bundled browser.
419
+
420
+ ## Sitemap and Robots.txt Discovery
421
+
422
+ By default, the crawler performs sitemap discovery before crawling:
423
+
424
+ 1. Fetches `robots.txt` to find `Sitemap:` directives and `Crawl-delay` values
425
+ 2. Loads sitemaps referenced in `robots.txt`
426
+ 3. Falls back to `/sitemap.xml`
427
+ 4. Tries common alternatives: `/sitemap_index.xml`, `/sitemaps.xml`, `/sitemap-index.xml`
428
+ 5. Supports sitemap index files (recursively loads child sitemaps)
429
+ 6. Filters discovered URLs against glob patterns and exclusion rules
430
+
431
+ The home page is always included for metadata extraction (site name, description).
432
+
433
+ Disable with `--skip-sitemap` or `skipSitemap: true`.
434
+
435
+ ## Output Formats
436
+
437
+ ### Individual Markdown Files
438
+
439
+ One `.md` file per crawled page, written to the output directory preserving the URL path structure. For example, `https://example.com/docs/getting-started` becomes `output/docs/getting-started.md`.
440
+
441
+ ### llms.txt
442
+
443
+ A site overview file following the [llms.txt specification](https://llmstxt.org/), listing all crawled pages with titles and links to their markdown files.
444
+
445
+ ```markdown
446
+ # example.com
447
+
448
+ ## Pages
449
+
450
+ - [Example Domain](index.md): https://example.com/
451
+ - [About Us](about.md): https://example.com/about
452
+ ```
93
453
 
94
- ## License
454
+ ### llms-full.txt
95
455
 
96
- MIT
456
+ Same structure as `llms.txt` but includes the full markdown content of every page inline.