@mdream/crawl 1.0.0-beta.9 โ 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +416 -56
- package/dist/_chunks/crawl.mjs +366 -277
- package/dist/_chunks/playwright-utils.mjs +59 -0
- package/dist/cli.mjs +79 -89
- package/dist/index.d.mts +40 -2
- package/dist/index.mjs +6 -1
- package/package.json +11 -4
package/README.md
CHANGED
|
@@ -1,96 +1,456 @@
|
|
|
1
1
|
# @mdream/crawl
|
|
2
2
|
|
|
3
|
-
Multi-page website crawler that generates
|
|
3
|
+
Multi-page website crawler that generates [llms.txt](https://llmstxt.org/) files. Follows internal links and converts HTML to Markdown using [mdream](../mdream).
|
|
4
4
|
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
## Installation
|
|
5
|
+
## Setup
|
|
8
6
|
|
|
9
7
|
```bash
|
|
10
8
|
npm install @mdream/crawl
|
|
11
9
|
```
|
|
12
10
|
|
|
13
|
-
|
|
11
|
+
For JavaScript-heavy sites that require browser rendering, install the optional Playwright dependencies:
|
|
12
|
+
|
|
13
|
+
```bash
|
|
14
|
+
npm install crawlee playwright
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
## CLI Usage
|
|
14
18
|
|
|
15
|
-
|
|
19
|
+
### Interactive Mode
|
|
20
|
+
|
|
21
|
+
Run without arguments to start the interactive prompt-based interface:
|
|
16
22
|
|
|
17
23
|
```bash
|
|
18
24
|
npx @mdream/crawl
|
|
19
25
|
```
|
|
20
26
|
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
27
|
+
### Direct Mode
|
|
28
|
+
|
|
29
|
+
Pass arguments directly to skip interactive prompts:
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
npx @mdream/crawl -u https://docs.example.com
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
### CLI Options
|
|
36
|
+
|
|
37
|
+
| Flag | Alias | Description | Default |
|
|
38
|
+
|------|-------|-------------|---------|
|
|
39
|
+
| `--url <url>` | `-u` | Website URL to crawl (supports glob patterns) | Required |
|
|
40
|
+
| `--output <dir>` | `-o` | Output directory | `output` |
|
|
41
|
+
| `--depth <number>` | `-d` | Crawl depth (0 for single page, max 10) | `3` |
|
|
42
|
+
| `--single-page` | | Only process the given URL(s), no crawling. Alias for `--depth 0` | |
|
|
43
|
+
| `--driver <type>` | | Crawler driver: `http` or `playwright` | `http` |
|
|
44
|
+
| `--artifacts <list>` | | Comma-separated output formats: `llms.txt`, `llms-full.txt`, `markdown` | all three |
|
|
45
|
+
| `--origin <url>` | | Origin URL for resolving relative paths (overrides auto-detection) | auto-detected |
|
|
46
|
+
| `--site-name <name>` | | Override the auto-extracted site name used in llms.txt | auto-extracted |
|
|
47
|
+
| `--description <desc>` | | Override the auto-extracted site description used in llms.txt | auto-extracted |
|
|
48
|
+
| `--max-pages <number>` | | Maximum pages to crawl | unlimited |
|
|
49
|
+
| `--crawl-delay <seconds>` | | Delay between requests in seconds | from `robots.txt` or none |
|
|
50
|
+
| `--exclude <pattern>` | | Exclude URLs matching glob patterns (repeatable) | none |
|
|
51
|
+
| `--skip-sitemap` | | Skip `sitemap.xml` and `robots.txt` discovery | `false` |
|
|
52
|
+
| `--allow-subdomains` | | Crawl across subdomains of the same root domain | `false` |
|
|
53
|
+
| `--verbose` | `-v` | Enable verbose logging | `false` |
|
|
54
|
+
| `--help` | `-h` | Show help message | |
|
|
55
|
+
| `--version` | | Show version number | |
|
|
56
|
+
|
|
57
|
+
### CLI Examples
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
# Basic crawl with specific artifacts
|
|
61
|
+
npx @mdream/crawl -u harlanzw.com --artifacts "llms.txt,markdown"
|
|
62
|
+
|
|
63
|
+
# Shallow crawl (depth 2) with only llms-full.txt output
|
|
64
|
+
npx @mdream/crawl --url https://docs.example.com --depth 2 --artifacts "llms-full.txt"
|
|
65
|
+
|
|
66
|
+
# Exclude admin and API routes
|
|
67
|
+
npx @mdream/crawl -u example.com --exclude "*/admin/*" --exclude "*/api/*"
|
|
68
|
+
|
|
69
|
+
# Single page mode (no link following)
|
|
70
|
+
npx @mdream/crawl -u example.com/pricing --single-page
|
|
71
|
+
|
|
72
|
+
# Use Playwright for JavaScript-heavy sites
|
|
73
|
+
npx @mdream/crawl -u example.com --driver playwright
|
|
28
74
|
|
|
29
|
-
|
|
75
|
+
# Skip sitemap discovery with verbose output
|
|
76
|
+
npx @mdream/crawl -u example.com --skip-sitemap --verbose
|
|
30
77
|
|
|
31
|
-
|
|
78
|
+
# Crawl across subdomains (docs.example.com, blog.example.com, etc.)
|
|
79
|
+
npx @mdream/crawl -u example.com --allow-subdomains
|
|
80
|
+
|
|
81
|
+
# Override site metadata
|
|
82
|
+
npx @mdream/crawl -u example.com --site-name "My Company" --description "Company documentation"
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
## Glob Patterns
|
|
86
|
+
|
|
87
|
+
URLs support glob patterns for targeted crawling. When a glob pattern is provided, the crawler uses sitemap discovery to find all matching URLs.
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
# Crawl only the /docs/ section
|
|
91
|
+
npx @mdream/crawl -u "docs.example.com/docs/**"
|
|
92
|
+
|
|
93
|
+
# Crawl pages matching a prefix
|
|
94
|
+
npx @mdream/crawl -u "example.com/blog/2024*"
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
Patterns are matched against the URL pathname using [picomatch](https://github.com/micromatch/picomatch) syntax. A trailing single `*` (e.g. `/fieldtypes*`) automatically expands to match both the path itself and all subdirectories.
|
|
98
|
+
|
|
99
|
+
## Programmatic API
|
|
100
|
+
|
|
101
|
+
### `crawlAndGenerate(options, onProgress?)`
|
|
102
|
+
|
|
103
|
+
The main entry point for programmatic use. Returns a `Promise<CrawlResult[]>`.
|
|
32
104
|
|
|
33
105
|
```typescript
|
|
34
106
|
import { crawlAndGenerate } from '@mdream/crawl'
|
|
35
107
|
|
|
36
|
-
// Crawl entire websites programmatically
|
|
37
108
|
const results = await crawlAndGenerate({
|
|
38
|
-
urls: ['https://docs.example.com'],
|
|
109
|
+
urls: ['https://docs.example.com'],
|
|
39
110
|
outputDir: './output',
|
|
40
|
-
maxRequestsPerCrawl: 100, // Maximum pages per website
|
|
41
|
-
generateLlmsTxt: true,
|
|
42
|
-
followLinks: true, // Follow internal links to crawl entire site
|
|
43
|
-
maxDepth: 3, // How deep to follow links
|
|
44
|
-
driver: 'http', // or 'playwright' for JS-heavy sites
|
|
45
|
-
verbose: true
|
|
46
111
|
})
|
|
47
112
|
```
|
|
48
113
|
|
|
49
|
-
|
|
114
|
+
### `CrawlOptions`
|
|
50
115
|
|
|
51
|
-
|
|
116
|
+
| Option | Type | Default | Description |
|
|
117
|
+
|--------|------|---------|-------------|
|
|
118
|
+
| `urls` | `string[]` | Required | Starting URLs for crawling |
|
|
119
|
+
| `outputDir` | `string` | Required | Directory to write output files |
|
|
120
|
+
| `driver` | `'http' \| 'playwright'` | `'http'` | Crawler driver to use |
|
|
121
|
+
| `maxRequestsPerCrawl` | `number` | `Number.MAX_SAFE_INTEGER` | Maximum total pages to crawl |
|
|
122
|
+
| `followLinks` | `boolean` | `false` | Whether to follow internal links discovered on pages |
|
|
123
|
+
| `maxDepth` | `number` | `1` | Maximum link-following depth. `0` enables single-page mode |
|
|
124
|
+
| `generateLlmsTxt` | `boolean` | `true` | Generate an `llms.txt` file |
|
|
125
|
+
| `generateLlmsFullTxt` | `boolean` | `false` | Generate an `llms-full.txt` file with full page content |
|
|
126
|
+
| `generateIndividualMd` | `boolean` | `true` | Write individual `.md` files for each page |
|
|
127
|
+
| `origin` | `string` | auto-detected | Origin URL for resolving relative paths in HTML |
|
|
128
|
+
| `siteNameOverride` | `string` | auto-extracted | Override the site name in the generated `llms.txt` |
|
|
129
|
+
| `descriptionOverride` | `string` | auto-extracted | Override the site description in the generated `llms.txt` |
|
|
130
|
+
| `globPatterns` | `ParsedUrlPattern[]` | `[]` | Pre-parsed URL glob patterns (advanced usage) |
|
|
131
|
+
| `exclude` | `string[]` | `[]` | Glob patterns for URLs to exclude |
|
|
132
|
+
| `crawlDelay` | `number` | from `robots.txt` | Delay between requests in seconds |
|
|
133
|
+
| `skipSitemap` | `boolean` | `false` | Skip `sitemap.xml` and `robots.txt` discovery |
|
|
134
|
+
| `allowSubdomains` | `boolean` | `false` | Crawl across subdomains of the same root domain (e.g. `docs.example.com` + `blog.example.com`). Output files are namespaced by hostname to avoid collisions |
|
|
135
|
+
| `useChrome` | `boolean` | `false` | Use system Chrome instead of Playwright's bundled browser (Playwright driver only) |
|
|
136
|
+
| `chunkSize` | `number` | | Chunk size passed to mdream for markdown conversion |
|
|
137
|
+
| `verbose` | `boolean` | `false` | Enable verbose error logging |
|
|
138
|
+
| `hooks` | `Partial<CrawlHooks>` | | Hook functions for the crawl pipeline (see [Hooks](#hooks)) |
|
|
139
|
+
| `onPage` | `(page: PageData) => Promise<void> \| void` | | **Deprecated.** Use `hooks['crawl:page']` instead. Still works for backwards compatibility |
|
|
52
140
|
|
|
53
|
-
|
|
141
|
+
### `CrawlResult`
|
|
54
142
|
|
|
55
|
-
|
|
56
|
-
|
|
143
|
+
```typescript
|
|
144
|
+
interface CrawlResult {
|
|
145
|
+
url: string
|
|
146
|
+
title: string
|
|
147
|
+
content: string
|
|
148
|
+
filePath?: string // Set when generateIndividualMd is true
|
|
149
|
+
timestamp: number // Unix timestamp of processing time
|
|
150
|
+
success: boolean
|
|
151
|
+
error?: string // Set when success is false
|
|
152
|
+
metadata?: PageMetadata
|
|
153
|
+
depth?: number // Link-following depth at which this page was found
|
|
154
|
+
}
|
|
57
155
|
|
|
58
|
-
|
|
156
|
+
interface PageMetadata {
|
|
157
|
+
title: string
|
|
158
|
+
description?: string
|
|
159
|
+
keywords?: string
|
|
160
|
+
author?: string
|
|
161
|
+
links: string[] // Internal links discovered on the page
|
|
162
|
+
}
|
|
163
|
+
```
|
|
59
164
|
|
|
60
|
-
|
|
61
|
-
# example.com
|
|
165
|
+
### `PageData`
|
|
62
166
|
|
|
63
|
-
|
|
167
|
+
The shape passed to the `onPage` callback:
|
|
168
|
+
|
|
169
|
+
```typescript
|
|
170
|
+
interface PageData {
|
|
171
|
+
url: string
|
|
172
|
+
html: string // Raw HTML (empty string if content was already markdown)
|
|
173
|
+
title: string
|
|
174
|
+
metadata: PageMetadata
|
|
175
|
+
origin: string
|
|
176
|
+
}
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
### Progress Callback
|
|
180
|
+
|
|
181
|
+
The optional second argument to `crawlAndGenerate` receives progress updates:
|
|
182
|
+
|
|
183
|
+
```typescript
|
|
184
|
+
await crawlAndGenerate(options, (progress) => {
|
|
185
|
+
// progress.sitemap.status: 'discovering' | 'processing' | 'completed'
|
|
186
|
+
// progress.sitemap.found: number of sitemap URLs found
|
|
187
|
+
// progress.sitemap.processed: number of URLs after filtering
|
|
188
|
+
|
|
189
|
+
// progress.crawling.status: 'starting' | 'processing' | 'completed'
|
|
190
|
+
// progress.crawling.total: total URLs to process
|
|
191
|
+
// progress.crawling.processed: pages completed so far
|
|
192
|
+
// progress.crawling.failed: pages that errored
|
|
193
|
+
// progress.crawling.currentUrl: URL currently being fetched
|
|
194
|
+
// progress.crawling.latency: { total, min, max, count } in ms
|
|
195
|
+
|
|
196
|
+
// progress.generation.status: 'idle' | 'generating' | 'completed'
|
|
197
|
+
// progress.generation.current: description of current generation step
|
|
198
|
+
})
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
### Examples
|
|
202
|
+
|
|
203
|
+
#### Custom page processing with `onPage`
|
|
204
|
+
|
|
205
|
+
```typescript
|
|
206
|
+
import { crawlAndGenerate } from '@mdream/crawl'
|
|
207
|
+
|
|
208
|
+
const pages = []
|
|
209
|
+
|
|
210
|
+
await crawlAndGenerate({
|
|
211
|
+
urls: ['https://docs.example.com'],
|
|
212
|
+
outputDir: './output',
|
|
213
|
+
generateIndividualMd: false,
|
|
214
|
+
generateLlmsTxt: false,
|
|
215
|
+
onPage: (page) => {
|
|
216
|
+
pages.push({
|
|
217
|
+
url: page.url,
|
|
218
|
+
title: page.title,
|
|
219
|
+
description: page.metadata.description,
|
|
220
|
+
})
|
|
221
|
+
},
|
|
222
|
+
})
|
|
223
|
+
|
|
224
|
+
console.log(`Discovered ${pages.length} pages`)
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
#### Glob filtering with exclusions
|
|
228
|
+
|
|
229
|
+
```typescript
|
|
230
|
+
import { crawlAndGenerate } from '@mdream/crawl'
|
|
231
|
+
|
|
232
|
+
await crawlAndGenerate({
|
|
233
|
+
urls: ['https://example.com/docs/**'],
|
|
234
|
+
outputDir: './docs-output',
|
|
235
|
+
exclude: ['/docs/deprecated/*', '/docs/internal/*'],
|
|
236
|
+
followLinks: true,
|
|
237
|
+
maxDepth: 2,
|
|
238
|
+
})
|
|
239
|
+
```
|
|
240
|
+
|
|
241
|
+
#### Crawling across subdomains
|
|
242
|
+
|
|
243
|
+
```typescript
|
|
244
|
+
await crawlAndGenerate({
|
|
245
|
+
urls: ['https://example.com'],
|
|
246
|
+
outputDir: './output',
|
|
247
|
+
allowSubdomains: true, // Will also crawl docs.example.com, blog.example.com, etc.
|
|
248
|
+
followLinks: true,
|
|
249
|
+
maxDepth: 2,
|
|
250
|
+
})
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
#### Single-page mode
|
|
254
|
+
|
|
255
|
+
Set `maxDepth: 0` to process only the provided URLs without crawling or link following:
|
|
256
|
+
|
|
257
|
+
```typescript
|
|
258
|
+
await crawlAndGenerate({
|
|
259
|
+
urls: ['https://example.com/pricing', 'https://example.com/about'],
|
|
260
|
+
outputDir: './output',
|
|
261
|
+
maxDepth: 0,
|
|
262
|
+
})
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
## Config File
|
|
266
|
+
|
|
267
|
+
Create a `mdream.config.ts` (or `.js`, `.mjs`) in your project root to set defaults and register hooks. Loaded via [c12](https://github.com/unjs/c12).
|
|
268
|
+
|
|
269
|
+
```typescript
|
|
270
|
+
import { defineConfig } from '@mdream/crawl'
|
|
271
|
+
|
|
272
|
+
export default defineConfig({
|
|
273
|
+
exclude: ['*/admin/*', '*/internal/*'],
|
|
274
|
+
driver: 'http',
|
|
275
|
+
maxDepth: 3,
|
|
276
|
+
hooks: {
|
|
277
|
+
'crawl:page': (page) => {
|
|
278
|
+
// Strip branding from all page titles
|
|
279
|
+
page.title = page.title.replace(/ \| My Brand$/, '')
|
|
280
|
+
},
|
|
281
|
+
},
|
|
282
|
+
})
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
CLI arguments override config file values. Array options like `exclude` are concatenated (config + CLI).
|
|
286
|
+
|
|
287
|
+
### Config Options
|
|
288
|
+
|
|
289
|
+
| Option | Type | Description |
|
|
290
|
+
|--------|------|-------------|
|
|
291
|
+
| `exclude` | `string[]` | Glob patterns for URLs to exclude |
|
|
292
|
+
| `driver` | `'http' \| 'playwright'` | Crawler driver |
|
|
293
|
+
| `maxDepth` | `number` | Maximum crawl depth |
|
|
294
|
+
| `maxPages` | `number` | Maximum pages to crawl |
|
|
295
|
+
| `crawlDelay` | `number` | Delay between requests (seconds) |
|
|
296
|
+
| `skipSitemap` | `boolean` | Skip sitemap discovery |
|
|
297
|
+
| `allowSubdomains` | `boolean` | Crawl across subdomains |
|
|
298
|
+
| `verbose` | `boolean` | Enable verbose logging |
|
|
299
|
+
| `artifacts` | `string[]` | Output formats: `llms.txt`, `llms-full.txt`, `markdown` |
|
|
300
|
+
| `hooks` | `object` | Hook functions (see below) |
|
|
301
|
+
|
|
302
|
+
## Hooks
|
|
303
|
+
|
|
304
|
+
Four hooks let you intercept and transform data at each stage of the crawl pipeline. Hooks receive mutable objects. Mutate in-place to transform output.
|
|
305
|
+
|
|
306
|
+
### `crawl:url`
|
|
307
|
+
|
|
308
|
+
Called before fetching a URL. Set `ctx.skip = true` to skip it entirely (saves the network request).
|
|
309
|
+
|
|
310
|
+
```typescript
|
|
311
|
+
defineConfig({
|
|
312
|
+
hooks: {
|
|
313
|
+
'crawl:url': (ctx) => {
|
|
314
|
+
// Skip large asset pages
|
|
315
|
+
if (ctx.url.includes('/assets/') || ctx.url.includes('/downloads/'))
|
|
316
|
+
ctx.skip = true
|
|
317
|
+
},
|
|
318
|
+
},
|
|
319
|
+
})
|
|
320
|
+
```
|
|
321
|
+
|
|
322
|
+
### `crawl:page`
|
|
323
|
+
|
|
324
|
+
Called after HTML-to-Markdown conversion, before storage. Mutate `page.title` or other fields. This replaces the `onPage` callback (which still works for backwards compatibility).
|
|
325
|
+
|
|
326
|
+
```typescript
|
|
327
|
+
defineConfig({
|
|
328
|
+
hooks: {
|
|
329
|
+
'crawl:page': (page) => {
|
|
330
|
+
// page.url, page.html, page.title, page.metadata, page.origin
|
|
331
|
+
page.title = page.title.replace(/ - Docs$/, '')
|
|
332
|
+
},
|
|
333
|
+
},
|
|
334
|
+
})
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
### `crawl:content`
|
|
338
|
+
|
|
339
|
+
Called before markdown is written to disk. Transform the final output content or change the file path.
|
|
340
|
+
|
|
341
|
+
```typescript
|
|
342
|
+
defineConfig({
|
|
343
|
+
hooks: {
|
|
344
|
+
'crawl:content': (ctx) => {
|
|
345
|
+
// ctx.url, ctx.title, ctx.content, ctx.filePath
|
|
346
|
+
ctx.content = ctx.content.replace(/CONFIDENTIAL/g, '[REDACTED]')
|
|
347
|
+
ctx.filePath = ctx.filePath.replace('.md', '.mdx')
|
|
348
|
+
},
|
|
349
|
+
},
|
|
350
|
+
})
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
### `crawl:done`
|
|
354
|
+
|
|
355
|
+
Called after all pages are crawled, before `llms.txt` generation. Filter or reorder results.
|
|
356
|
+
|
|
357
|
+
```typescript
|
|
358
|
+
defineConfig({
|
|
359
|
+
hooks: {
|
|
360
|
+
'crawl:done': (ctx) => {
|
|
361
|
+
// Remove short pages from the final output
|
|
362
|
+
const filtered = ctx.results.filter(r => r.content.length > 100)
|
|
363
|
+
ctx.results.length = 0
|
|
364
|
+
ctx.results.push(...filtered)
|
|
365
|
+
},
|
|
366
|
+
},
|
|
367
|
+
})
|
|
368
|
+
```
|
|
369
|
+
|
|
370
|
+
### Programmatic Hooks
|
|
371
|
+
|
|
372
|
+
Hooks can also be passed directly to `crawlAndGenerate`:
|
|
64
373
|
|
|
65
|
-
|
|
66
|
-
|
|
374
|
+
```typescript
|
|
375
|
+
import { crawlAndGenerate } from '@mdream/crawl'
|
|
376
|
+
|
|
377
|
+
await crawlAndGenerate({
|
|
378
|
+
urls: ['https://example.com'],
|
|
379
|
+
outputDir: './output',
|
|
380
|
+
hooks: {
|
|
381
|
+
'crawl:page': (page) => {
|
|
382
|
+
page.title = page.title.replace(/ \| Brand$/, '')
|
|
383
|
+
},
|
|
384
|
+
'crawl:done': (ctx) => {
|
|
385
|
+
ctx.results.sort((a, b) => a.url.localeCompare(b.url))
|
|
386
|
+
},
|
|
387
|
+
},
|
|
388
|
+
})
|
|
67
389
|
```
|
|
68
390
|
|
|
69
|
-
##
|
|
391
|
+
## Crawl Drivers
|
|
70
392
|
|
|
71
|
-
|
|
72
|
-
- โ
**Purely Interactive**: No complex command-line options to remember
|
|
73
|
-
- โ
**Dual Crawler Support**: Fast HTTP crawler (default) + Playwright for JavaScript-heavy sites
|
|
74
|
-
- โ
**Smart Link Discovery**: Uses mdream's extraction plugin to find and follow internal links
|
|
75
|
-
- โ
**Rich Metadata Extraction**: Extracts titles, descriptions, keywords, and author info from all pages
|
|
76
|
-
- โ
**Comprehensive llms.txt Generation**: Creates complete site documentation files
|
|
77
|
-
- โ
**Configurable Depth Crawling**: Follow links with customizable depth limits (1-10 levels)
|
|
78
|
-
- โ
**Clean Markdown Conversion**: Powered by mdream's HTML-to-Markdown engine
|
|
79
|
-
- โ
**Performance Optimized**: HTTP crawler is 5-10x faster than browser-based crawling
|
|
80
|
-
- โ
**Beautiful Output**: Clean result display with progress indicators
|
|
81
|
-
- โ
**Automatic Cleanup**: Purges crawler storage after completion
|
|
82
|
-
- โ
**TypeScript Support**: Full type definitions with excellent IDE support
|
|
393
|
+
### HTTP Driver (default)
|
|
83
394
|
|
|
84
|
-
|
|
395
|
+
Uses [`ofetch`](https://github.com/unjs/ofetch) for page fetching with up to 20 concurrent requests.
|
|
85
396
|
|
|
86
|
-
|
|
87
|
-
-
|
|
88
|
-
-
|
|
89
|
-
-
|
|
90
|
-
- ๐ **Multi-Page Resources**: Any website where you need all pages, not just one
|
|
397
|
+
- Automatic retry (2 retries with 500ms delay)
|
|
398
|
+
- 10 second request timeout
|
|
399
|
+
- Respects `Retry-After` headers on 429 responses (automatically adjusts crawl delay)
|
|
400
|
+
- Detects `text/markdown` content types and skips HTML-to-Markdown conversion
|
|
91
401
|
|
|
92
|
-
|
|
402
|
+
### Playwright Driver
|
|
403
|
+
|
|
404
|
+
For sites that require a browser to render content. Requires `crawlee` and `playwright` as peer dependencies (see [Setup](#setup)).
|
|
405
|
+
|
|
406
|
+
```bash
|
|
407
|
+
npx @mdream/crawl -u example.com --driver playwright
|
|
408
|
+
```
|
|
409
|
+
|
|
410
|
+
```typescript
|
|
411
|
+
await crawlAndGenerate({
|
|
412
|
+
urls: ['https://spa-app.example.com'],
|
|
413
|
+
outputDir: './output',
|
|
414
|
+
driver: 'playwright',
|
|
415
|
+
})
|
|
416
|
+
```
|
|
417
|
+
|
|
418
|
+
Waits for `networkidle` before extracting content. Automatically detects and uses system Chrome when available, falling back to Playwright's bundled browser.
|
|
419
|
+
|
|
420
|
+
## Sitemap and Robots.txt Discovery
|
|
421
|
+
|
|
422
|
+
By default, the crawler performs sitemap discovery before crawling:
|
|
423
|
+
|
|
424
|
+
1. Fetches `robots.txt` to find `Sitemap:` directives and `Crawl-delay` values
|
|
425
|
+
2. Loads sitemaps referenced in `robots.txt`
|
|
426
|
+
3. Falls back to `/sitemap.xml`
|
|
427
|
+
4. Tries common alternatives: `/sitemap_index.xml`, `/sitemaps.xml`, `/sitemap-index.xml`
|
|
428
|
+
5. Supports sitemap index files (recursively loads child sitemaps)
|
|
429
|
+
6. Filters discovered URLs against glob patterns and exclusion rules
|
|
430
|
+
|
|
431
|
+
The home page is always included for metadata extraction (site name, description).
|
|
432
|
+
|
|
433
|
+
Disable with `--skip-sitemap` or `skipSitemap: true`.
|
|
434
|
+
|
|
435
|
+
## Output Formats
|
|
436
|
+
|
|
437
|
+
### Individual Markdown Files
|
|
438
|
+
|
|
439
|
+
One `.md` file per crawled page, written to the output directory preserving the URL path structure. For example, `https://example.com/docs/getting-started` becomes `output/docs/getting-started.md`.
|
|
440
|
+
|
|
441
|
+
### llms.txt
|
|
442
|
+
|
|
443
|
+
A site overview file following the [llms.txt specification](https://llmstxt.org/), listing all crawled pages with titles and links to their markdown files.
|
|
444
|
+
|
|
445
|
+
```markdown
|
|
446
|
+
# example.com
|
|
447
|
+
|
|
448
|
+
## Pages
|
|
449
|
+
|
|
450
|
+
- [Example Domain](index.md): https://example.com/
|
|
451
|
+
- [About Us](about.md): https://example.com/about
|
|
452
|
+
```
|
|
93
453
|
|
|
94
|
-
|
|
454
|
+
### llms-full.txt
|
|
95
455
|
|
|
96
|
-
|
|
456
|
+
Same structure as `llms.txt` but includes the full markdown content of every page inline.
|