mdream 1.0.0-beta.9 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -4,189 +4,343 @@
4
4
  [![npm downloads](https://img.shields.io/npm/dm/mdream?color=yellow)](https://npm.chart.dev/mdream)
5
5
  [![license](https://img.shields.io/github/license/harlan-zw/mdream?color=yellow)](https://github.com/harlan-zw/mdream/blob/main/LICENSE.md)
6
6
 
7
- > Ultra-performant HTML to Markdown Convertor Optimized for LLMs. Generate llms.txt artifacts using CLI, GitHub Actions, Vite Plugin and more.
8
-
9
- <img src="../../.github/logo.png" alt="mdream logo" width="200">
10
-
11
- <p align="center">
12
- <table>
13
- <tbody>
14
- <td align="center">
15
- <sub>Made possible by my <a href="https://github.com/sponsors/harlan-zw">Sponsor Program 💖</a><br> Follow me <a href="https://twitter.com/harlan_zw">@harlan_zw</a> 🐦 • Join <a href="https://discord.gg/275MBUBvgP">Discord</a> for help</sub><br>
16
- </td>
17
- </tbody>
18
- </table>
19
- </p>
7
+ ## Installation
20
8
 
21
- ## Features
9
+ ```bash
10
+ # npm
11
+ npm install mdream
22
12
 
23
- - 🧠 Optimized HTML To Markdown Conversion (~50% fewer tokens with Minimal preset)
24
- - 🔍 Generates GitHub Flavored Markdown: Frontmatter, Nested & HTML markup support.
25
- - 🚀 Fast: Convert 1.8MB of HTML to markdown in ~8ms (Rust), ~62ms (JS). Up to 7.9x speedup.
26
- - ⚡ Tiny: 10kB gzip JS core, 45kB gzip with Rust WASM engine. Zero dependencies.
27
- - ⚙️ Run anywhere: CLI, edge workers, browsers, Node, etc.
28
- - 🔌 Extensible: Declarative plugin config for both engines, hook-based plugins via `@mdream/js`.
13
+ # pnpm
14
+ pnpm add mdream
29
15
 
30
- ## What is Mdream?
16
+ # yarn
17
+ yarn add mdream
18
+ ```
31
19
 
32
- Traditional HTML to Markdown converters were not built for LLMs or humans. They tend to be slow and bloated and produce output that's poorly suited for LLMs token usage or for
33
- human readability.
20
+ For the JavaScript-only engine (hook-based plugins, splitter, pure HTML parser):
34
21
 
35
- Other LLM specific convertors focus on supporting _all_ document formats, resulting in larger bundles and lower quality Markdown output.
22
+ ```bash
23
+ pnpm add @mdream/js
24
+ ```
36
25
 
37
- Mdream produces high-quality Markdown for LLMs efficiently with no core dependencies. It includes a plugin system to customize the conversion process, allowing you to parse, extract, transform, and filter as needed.
26
+ ## Table of Contents
27
+
28
+ - [API Reference](#api-reference)
29
+ - [htmlToMarkdown()](#htmltomarkdown)
30
+ - [streamHtmlToMarkdown()](#streamhtmltomarkdown)
31
+ - [Engines](#engines)
32
+ - [Options](#options)
33
+ - [MdreamOptions (Rust engine)](#mdreamoptions-rust-engine)
34
+ - [MdreamOptions (JS engine)](#mdreamoptions-js-engine)
35
+ - [CleanOptions](#cleanoptions)
36
+ - [FrontmatterConfig](#frontmatterconfig)
37
+ - [TagOverride](#tagoverride)
38
+ - [FilterOptions](#filteroptions)
39
+ - [Presets](#presets)
40
+ - [Minimal Preset](#minimal-preset)
41
+ - [Built-in Plugins](#built-in-plugins)
42
+ - [Frontmatter](#frontmatter-plugin)
43
+ - [Isolate Main](#isolate-main-plugin)
44
+ - [Tailwind](#tailwind-plugin)
45
+ - [Filter](#filter-plugin)
46
+ - [Extraction](#extraction-plugin)
47
+ - [Hook-Based Plugins (JS Engine)](#hook-based-plugins-js-engine)
48
+ - [Plugin Hooks](#plugin-hooks)
49
+ - [createPlugin()](#createplugin)
50
+ - [Markdown Splitting (JS Engine)](#markdown-splitting-js-engine)
51
+ - [Basic Chunking](#basic-chunking)
52
+ - [Streaming Chunks](#streaming-chunks-memory-efficient)
53
+ - [Splitter Options](#splitter-options)
54
+ - [Chunk Metadata](#chunk-metadata)
55
+ - [Content Negotiation](#content-negotiation)
56
+ - [Pure HTML Parser (JS Engine)](#pure-html-parser-js-engine)
57
+ - [CLI Usage](#cli-usage)
58
+ - [Browser and Edge Usage](#browser-and-edge-usage)
59
+ - [Edge / Cloudflare Workers](#edge--cloudflare-workers)
60
+ - [Browser CDN (IIFE)](#browser-cdn-iife)
61
+ - [Web Worker](#web-worker)
62
+ - [llms.txt Generation](#llmstxt-generation)
63
+ - [Related Packages](#related-packages)
64
+
65
+ ## API Reference
66
+
67
+ ### `htmlToMarkdown()`
68
+
69
+ Converts a complete HTML string to Markdown synchronously.
70
+
71
+ **Rust engine** (`mdream`):
38
72
 
39
- ## Installation
73
+ ```ts
74
+ import { htmlToMarkdown } from 'mdream'
40
75
 
41
- ```bash
42
- pnpm add mdream
76
+ function htmlToMarkdown(html: string, options?: Partial<MdreamOptions>): string
43
77
  ```
44
78
 
45
- ## CLI Usage
79
+ **JS engine** (`@mdream/js`):
46
80
 
47
- Mdream provides a CLI designed to work exclusively with Unix pipes,
48
- providing flexibility and freedom to integrate with other tools.
81
+ ```ts
82
+ import { htmlToMarkdown } from '@mdream/js'
49
83
 
50
- **Pipe Site to Markdown**
84
+ function htmlToMarkdown(html: string, options?: Partial<MdreamOptions>): string
85
+ ```
51
86
 
52
- Fetches the [Markdown Wikipedia page](https://en.wikipedia.org/wiki/Markdown) and converts it to Markdown preserving the original links and images.
87
+ **Example:**
53
88
 
54
- ```bash
55
- curl -s https://en.wikipedia.org/wiki/Markdown \
56
- | npx mdream --origin https://en.wikipedia.org --preset minimal \
57
- | tee streaming.md
89
+ ```ts
90
+ import { htmlToMarkdown } from 'mdream'
91
+
92
+ const markdown = htmlToMarkdown('<h1>Hello World</h1><p>Some content.</p>')
93
+ // # Hello World
94
+ //
95
+ // Some content.
58
96
  ```
59
97
 
60
- _Tip: The `--origin` flag will fix relative image and link paths_
98
+ ### `streamHtmlToMarkdown()`
61
99
 
62
- **Local File to Markdown**
100
+ Converts an HTML `ReadableStream` to Markdown incrementally. Returns an `AsyncIterable<string>` that yields Markdown chunks as they are processed.
63
101
 
64
- Converts a local HTML file to a Markdown file, using `tee` to write the output to a file and display it in the terminal.
65
102
 
66
- ```bash
67
- cat index.html \
68
- | npx mdream --preset minimal \
69
- | tee streaming.md
103
+ ```ts
104
+ import { streamHtmlToMarkdown } from 'mdream'
105
+
106
+ function streamHtmlToMarkdown(
107
+ htmlStream: ReadableStream<Uint8Array | string> | null,
108
+ options?: Partial<MdreamOptions>,
109
+ ): AsyncIterable<string>
70
110
  ```
71
111
 
72
- ### CLI Options
73
112
 
74
- - `--origin <url>`: Base URL for resolving relative links and images
75
- - `--preset <preset>`: Conversion presets: minimal
76
- - `--help`: Display help information
77
- - `--version`: Display version information
113
+ **Example:**
78
114
 
79
- ## API Usage
115
+ ```ts
116
+ import { streamHtmlToMarkdown } from 'mdream'
80
117
 
81
- Mdream provides two main functions for working with HTML:
82
- - `htmlToMarkdown`: Useful if you already have the entire HTML payload you want to convert.
83
- - `streamHtmlToMarkdown`: Best practice if you are fetching or reading from a local file.
118
+ const response = await fetch('https://example.com')
119
+ const stream = response.body
84
120
 
85
- ### Engines
121
+ for await (const chunk of streamHtmlToMarkdown(stream, {
122
+ origin: 'https://example.com',
123
+ })) {
124
+ process.stdout.write(chunk)
125
+ }
126
+ ```
127
+
128
+ ## Engines
86
129
 
87
130
  Mdream includes two rendering engines, automatically selecting the best one for your environment:
88
- - **Rust Engine** (default in Node.js): Native NAPI performance, 5.6-7.9x faster. WASM build for edge/browser runtimes.
89
- - **JavaScript Engine** (`@mdream/js`): Zero-dependencies, supports custom hook-based plugins.
131
+
132
+ | Engine | Package | Plugins | Use case |
133
+ |--------|---------|---------|----------|
134
+ | **Rust** (NAPI) | `mdream` | Declarative config only | Node.js (default) |
135
+ | **Rust** (WASM) | `mdream` | Declarative config only | Edge, browser |
136
+ | **JavaScript** | `@mdream/js` | Hook-based + declarative | Custom plugins, splitter |
90
137
 
91
138
  ```ts
92
- import { htmlToMarkdown } from 'mdream'
139
+ // JavaScript engine (required for hook-based plugins)
140
+ import { htmlToMarkdown } from '@mdream/js'
93
141
 
94
- // Rust NAPI engine used automatically in Node.js
95
- // JS engine used in browser/edge runtimes
96
- const markdown = htmlToMarkdown('<h1>Hello World</h1>')
142
+ // Rust NAPI engine (auto-selected in Node.js)
143
+ import { htmlToMarkdown } from 'mdream'
97
144
  ```
98
145
 
99
- ## Browser & Edge Usage
146
+ Both engines accept the same declarative plugin configuration (`origin`, `minimal`, `frontmatter`, `isolateMain`, `tailwind`, `filter`, `extraction`, `tagOverrides`, `clean`). The JS engine additionally supports `hooks` for imperative plugin transforms.
100
147
 
101
- For browser environments and edge runtimes (Cloudflare Workers, Vercel Edge), mdream compiles to WebAssembly. Export conditions (`workerd`, `edge-light`, `browser`) select the correct build automatically, or use `mdream/worker` directly:
148
+ ## Options
102
149
 
103
- ```ts
104
- import { htmlToMarkdown } from 'mdream/worker'
150
+ ### MdreamOptions (Rust engine)
151
+
152
+ Defined in `mdream`:
105
153
 
106
- const markdown = await htmlToMarkdown('<h1>Hello World</h1>')
154
+ ```ts
155
+ interface MdreamOptions {
156
+ /** Base URL for resolving relative links and images. */
157
+ origin?: string
158
+
159
+ /**
160
+ * Enable minimal preset (frontmatter, isolateMain, tailwind, filter).
161
+ * Default: false
162
+ */
163
+ minimal?: boolean
164
+
165
+ /**
166
+ * Post-processing cleanup. Pass `true` for all cleanup, or an object for specific features.
167
+ * Enabled by default when `minimal` is true.
168
+ */
169
+ clean?: boolean | CleanOptions
170
+
171
+ /**
172
+ * Extract frontmatter from HTML <head>.
173
+ * - `true`: enable with defaults
174
+ * - `(fm) => void`: enable and receive structured data via callback
175
+ * - `FrontmatterConfig`: enable with config and optional callback
176
+ */
177
+ frontmatter?: boolean | ((frontmatter: Record<string, string>) => void) | FrontmatterConfig
178
+
179
+ /** Isolate main content area. Default when minimal: true */
180
+ isolateMain?: boolean
181
+
182
+ /** Convert Tailwind utility classes to Markdown. Default when minimal: true */
183
+ tailwind?: boolean
184
+
185
+ /** Filter elements by CSS selectors. Default when minimal: excludes form, nav, footer, etc. */
186
+ filter?: { include?: string[], exclude?: string[], processChildren?: boolean }
187
+
188
+ /** Extract elements matching CSS selectors during conversion. */
189
+ extraction?: Record<string, (element: ExtractedElement) => void>
190
+
191
+ /** Override tag rendering behavior. String values act as aliases. */
192
+ tagOverrides?: Record<string, TagOverride | string>
193
+ }
107
194
  ```
108
195
 
109
- ### Browser CDN Usage
196
+ ### MdreamOptions (JS engine)
110
197
 
111
- Use mdream directly via CDN with no build step. The IIFE bundle uses the Rust WASM engine. Call `init()` once to load the WASM binary, then use `htmlToMarkdown()` synchronously.
198
+ The JS engine extends the shared `EngineOptions` with hook-based plugin support:
112
199
 
113
- ```html
114
- <script src="https://unpkg.com/mdream/dist/iife.js"></script>
115
- <script>
116
- // init() fetches the .wasm file from the same CDN path automatically
117
- await window.mdream.init()
118
- const markdown = window.mdream.htmlToMarkdown('<h1>Hello</h1><p>World</p>')
119
- console.log(markdown) // # Hello\n\nWorld
120
- </script>
121
- ```
200
+ ```ts
201
+ interface MdreamOptions extends EngineOptions {
202
+ /** Imperative hook-based transform plugins. JS engine only. */
203
+ hooks?: TransformPlugin[]
204
+ }
122
205
 
123
- You can also pass a custom WASM URL or `ArrayBuffer` to `init()`:
206
+ interface EngineOptions {
207
+ origin?: string
208
+ clean?: boolean | CleanOptions
209
+ plugins?: BuiltinPlugins
210
+ }
124
211
 
125
- ```js
126
- await window.mdream.init('https://cdn.example.com/mdream_edge_bg.wasm')
212
+ interface BuiltinPlugins {
213
+ filter?: { include?: (string | number)[], exclude?: (string | number)[], processChildren?: boolean }
214
+ frontmatter?: boolean | ((fm: Record<string, string>) => void) | FrontmatterConfig
215
+ isolateMain?: boolean
216
+ tailwind?: boolean
217
+ extraction?: Record<string, (element: ExtractedElement) => void>
218
+ tagOverrides?: Record<string, TagOverride | string>
219
+ }
127
220
  ```
128
221
 
129
- **CDN Options:**
130
- - **unpkg**: `https://unpkg.com/mdream/dist/iife.js`
131
- - **jsDelivr**: `https://cdn.jsdelivr.net/npm/mdream/dist/iife.js`
222
+ Note: The JS engine uses `options.plugins.filter` while the Rust engine uses `options.filter` directly.
132
223
 
133
- **Convert existing HTML**
224
+ ### CleanOptions
134
225
 
135
- ```ts
136
- import { htmlToMarkdown } from 'mdream'
226
+ Post-processing cleanup applied to the final Markdown output. All options default to `false` unless `clean: true` is set.
137
227
 
138
- // Simple conversion
139
- const markdown = htmlToMarkdown('<h1>Hello World</h1>')
140
- console.log(markdown) // # Hello World
228
+ ```ts
229
+ interface CleanOptions {
230
+ /** Strip tracking query parameters (utm_*, fbclid, gclid, etc.) from URLs */
231
+ urls?: boolean
232
+ /** Strip fragment-only links that don't match any heading in the output */
233
+ fragments?: boolean
234
+ /** Strip links with meaningless hrefs (#, javascript:void(0)) to plain text */
235
+ emptyLinks?: boolean
236
+ /** Collapse 3+ consecutive blank lines to 2 */
237
+ blankLines?: boolean
238
+ /** Strip links where text equals URL: [https://x.com](https://x.com) becomes https://x.com */
239
+ redundantLinks?: boolean
240
+ /** Strip self-referencing heading anchors: ## [Title](#title) becomes ## Title */
241
+ selfLinkHeadings?: boolean
242
+ /** Strip images with no alt text (decorative/tracking pixels) */
243
+ emptyImages?: boolean
244
+ /** Drop links that produce no visible text: [](url) is removed entirely */
245
+ emptyLinkText?: boolean
246
+ }
141
247
  ```
142
248
 
143
- **Convert from Fetch**
249
+ **Example:**
144
250
 
145
251
  ```ts
146
- import { streamHtmlToMarkdown } from 'mdream'
147
-
148
- // Using fetch with streaming
149
- const response = await fetch('https://example.com')
150
- const htmlStream = response.body
151
- const markdownGenerator = streamHtmlToMarkdown(htmlStream, {
152
- origin: 'https://example.com'
252
+ const markdown = htmlToMarkdown(html, {
253
+ clean: {
254
+ urls: true,
255
+ emptyLinks: true,
256
+ emptyImages: true,
257
+ },
153
258
  })
259
+ ```
260
+
261
+ ### FrontmatterConfig
154
262
 
155
- // Process chunks as they arrive
156
- for await (const chunk of markdownGenerator) {
157
- console.log(chunk)
263
+ ```ts
264
+ interface FrontmatterConfig {
265
+ /** Additional static fields to include in frontmatter */
266
+ additionalFields?: Record<string, string>
267
+ /**
268
+ * Meta tag names to extract beyond the defaults.
269
+ * Defaults: description, keywords, author, date,
270
+ * og:title, og:description, twitter:title, twitter:description
271
+ */
272
+ metaFields?: string[]
273
+ /** Callback to receive structured frontmatter data after conversion */
274
+ onExtract?: (frontmatter: Record<string, string>) => void
158
275
  }
159
276
  ```
160
277
 
161
- **Pure HTML Parser (JS Engine)**
278
+ ### TagOverride
162
279
 
163
- If you only need to parse HTML into a DOM-like AST without converting to Markdown, use `parseHtml` from the JS engine:
280
+ Override how specific HTML tags are rendered in Markdown. String values act as aliases.
164
281
 
165
282
  ```ts
166
- import { parseHtml } from '@mdream/js'
283
+ interface TagOverride {
284
+ /** Markdown string to insert when entering this tag */
285
+ enter?: string
286
+ /** Markdown string to insert when exiting this tag */
287
+ exit?: string
288
+ /** Spacing: [newlines before, newlines after] */
289
+ spacing?: number[]
290
+ /** Whether this tag should be treated as inline */
291
+ isInline?: boolean
292
+ /** Whether this tag is self-closing */
293
+ isSelfClosing?: boolean
294
+ /** Whether whitespace inside this tag should be collapsed */
295
+ collapsesInnerWhiteSpace?: boolean
296
+ /** Alias this tag to another tag's handler */
297
+ alias?: string
298
+ }
299
+ ```
167
300
 
168
- const html = '<div><h1>Title</h1><p>Content</p></div>'
169
- const { events, remainingHtml } = parseHtml(html)
301
+ **Example:**
170
302
 
171
- // Process the parsed events
172
- events.forEach((event) => {
173
- if (event.type === 'enter' && event.node.type === 'element') {
174
- console.log('Entering element:', event.node.tagName)
175
- }
303
+ ```ts
304
+ const markdown = htmlToMarkdown(html, {
305
+ tagOverrides: {
306
+ // Treat <x-heading> like <h2>
307
+ 'x-heading': 'h2',
308
+ // Custom rendering for <callout>
309
+ 'callout': {
310
+ enter: '> **Note:** ',
311
+ exit: '',
312
+ spacing: [2, 2],
313
+ },
314
+ },
176
315
  })
177
316
  ```
178
317
 
179
- The `parseHtml` function provides:
180
- - **Pure AST parsing** - No markdown generation overhead
181
- - **DOM events** - Enter/exit events for each element and text node
182
- - **Plugin support** - Can apply plugins during parsing
183
- - **Streaming compatible** - Works with the same plugin system
318
+ ### FilterOptions
319
+
320
+ ```ts
321
+ interface FilterOptions {
322
+ /** CSS selectors, tag names, or TAG_* constants for elements to include (all others excluded) */
323
+ include?: (string | number)[]
324
+ /** CSS selectors, tag names, or TAG_* constants for elements to exclude */
325
+ exclude?: (string | number)[]
326
+ /** Whether to also process children of matched elements. Default: true */
327
+ processChildren?: boolean
328
+ }
329
+ ```
184
330
 
185
331
  ## Presets
186
332
 
187
333
  ### Minimal Preset
188
334
 
189
- The `minimal` preset optimizes for token reduction and cleaner output by removing non-essential content:
335
+ The `minimal` preset enables the following plugins together:
336
+
337
+ - **frontmatter**: Extracts metadata from HTML `<head>` into YAML frontmatter
338
+ - **isolateMain**: Extracts the main content area, skipping navigation, headers, and footers
339
+ - **tailwind**: Converts Tailwind utility classes to Markdown formatting
340
+ - **filter**: Excludes `form`, `fieldset`, `object`, `embed`, `footer`, `aside`, `iframe`, `input`, `textarea`, `select`, `button`, `nav`
341
+ - **clean**: All post-processing cleanup enabled
342
+
343
+ **Rust engine:**
190
344
 
191
345
  ```ts
192
346
  import { htmlToMarkdown } from 'mdream'
@@ -197,75 +351,170 @@ const markdown = htmlToMarkdown(html, {
197
351
  })
198
352
  ```
199
353
 
200
- **Enables:**
201
- - `isolateMain` - Extracts main content area
202
- - `frontmatter` - Generates YAML frontmatter from meta tags
203
- - `tailwind` - Converts Tailwind classes to Markdown
204
- - `filter` - Excludes forms, navigation, buttons, footers, and other non-content elements
354
+ **JS engine (using `withMinimalPreset`):**
205
355
 
206
- **CLI Usage:**
207
- ```bash
208
- curl -s https://example.com | npx mdream --preset minimal --origin https://example.com
356
+ ```ts
357
+ import { htmlToMarkdown } from '@mdream/js'
358
+ import { withMinimalPreset } from '@mdream/js/preset/minimal'
359
+
360
+ const markdown = htmlToMarkdown(html, withMinimalPreset({
361
+ origin: 'https://example.com',
362
+ }))
363
+ ```
364
+
365
+ `withMinimalPreset()` returns an `EngineOptions` object with all plugin defaults applied. You can override individual plugins:
366
+
367
+ ```ts
368
+ const markdown = htmlToMarkdown(html, withMinimalPreset({
369
+ plugins: {
370
+ frontmatter: false,
371
+ filter: { exclude: ['nav'] },
372
+ },
373
+ }))
209
374
  ```
210
375
 
211
- ## Declarative Options
376
+ ## Built-in Plugins
377
+
378
+ All built-in plugins work with both the Rust and JS engines through declarative configuration.
379
+
380
+ ### Frontmatter Plugin
212
381
 
213
- Both engines accept the same declarative configuration:
382
+ Extracts metadata from the HTML `<head>` element and generates YAML frontmatter.
383
+
384
+ **Extracted fields by default:** `title`, `description`, `keywords`, `author`, `date`, `og:title`, `og:description`, `twitter:title`, `twitter:description`.
214
385
 
215
386
  ```ts
216
- import { htmlToMarkdown } from 'mdream'
387
+ // Enable with defaults
388
+ htmlToMarkdown(html, { frontmatter: true })
389
+
390
+ // With callback to receive structured data
391
+ htmlToMarkdown(html, {
392
+ frontmatter: (fm) => {
393
+ console.log(fm.title)
394
+ console.log(fm.description)
395
+ },
396
+ })
217
397
 
218
- const markdown = htmlToMarkdown(html, {
219
- origin: 'https://example.com',
220
- minimal: true, // enables frontmatter, isolateMain, tailwind, filter
221
- clean: true, // enable all post-processing cleanup
222
- frontmatter: fm => console.log(fm), // callback for extracted frontmatter
223
- filter: { exclude: ['nav', '.sidebar'] },
224
- extraction: {
225
- 'h2': el => console.log('Heading:', el.textContent),
226
- 'img[alt]': el => console.log('Image:', el.attributes.src),
398
+ // With full config
399
+ htmlToMarkdown(html, {
400
+ frontmatter: {
401
+ additionalFields: { source: 'https://example.com' },
402
+ metaFields: ['robots', 'viewport'],
403
+ onExtract: fm => console.log(fm),
227
404
  },
228
- tagOverrides: { 'custom-tag': { alias: 'div' } },
229
405
  })
230
406
  ```
231
407
 
232
- ### Available Options
408
+ **Output example:**
233
409
 
234
- | Option | Type | Description |
235
- |--------|------|-------------|
236
- | `origin` | `string` | Base URL for resolving relative links/images |
237
- | `minimal` | `boolean` | Enable minimal preset (frontmatter, isolateMain, tailwind, filter) |
238
- | `clean` | `boolean \| CleanOptions` | Post-processing cleanup (`true` for all, or pick specific) |
239
- | `frontmatter` | `boolean \| (fm) => void \| FrontmatterConfig` | Extract frontmatter from HTML head |
240
- | `isolateMain` | `boolean` | Isolate main content area |
241
- | `tailwind` | `boolean` | Convert Tailwind classes to Markdown |
242
- | `filter` | `{ include?, exclude?, processChildren? }` | Filter elements by CSS selectors |
243
- | `extraction` | `Record<string, (el) => void>` | Extract elements matching CSS selectors |
244
- | `tagOverrides` | `Record<string, TagOverride \| string>` | Override tag rendering behavior |
410
+ ```yaml
411
+ ---
412
+ title: My Page Title
413
+ meta:
414
+ description: A page description
415
+ 'og:title': My Page Title
416
+ ---
417
+ ```
245
418
 
246
- ### Content Extraction with Readability
419
+ ### Isolate Main Plugin
247
420
 
248
- For advanced content extraction (article detection, boilerplate removal), use [@mozilla/readability](https://github.com/mozilla/readability) before mdream:
421
+ Isolates the main content area using the following priority:
422
+
423
+ 1. If an explicit `<main>` element exists (within 5 depth levels), use its content exclusively
424
+ 2. Otherwise, find content between the first header tag (`h1`-`h6`) and the first `<footer>`
425
+ 3. Headings inside `<header>` tags are skipped during fallback detection
426
+ 4. The `<head>` section is always passed through for other plugins (e.g., frontmatter)
249
427
 
250
428
  ```ts
251
- import { Readability } from '@mozilla/readability'
252
- import { JSDOM } from 'jsdom'
253
- import { htmlToMarkdown } from 'mdream'
429
+ htmlToMarkdown(html, { isolateMain: true })
430
+ ```
254
431
 
255
- const dom = new JSDOM(html, { url: 'https://example.com' })
256
- const article = new Readability(dom.window.document).parse()
432
+ ### Tailwind Plugin
257
433
 
258
- if (article) {
259
- const markdown = htmlToMarkdown(article.content)
260
- // article.title, article.excerpt, article.byline also available
261
- }
434
+ Converts Tailwind CSS utility classes to semantic Markdown formatting:
435
+
436
+ | Tailwind Class | Markdown Output |
437
+ |---|---|
438
+ | `font-bold`, `font-semibold`, `font-medium`, `font-extrabold`, `font-black` | `**bold**` |
439
+ | `italic`, `font-italic` | `*italic*` |
440
+ | `line-through` | `~~strikethrough~~` |
441
+ | `hidden`, `invisible` | Content removed |
442
+ | `absolute`, `fixed`, `sticky` | Content removed |
443
+
444
+ Supports responsive breakpoint prefixes (`sm:`, `md:`, `lg:`, `xl:`, `2xl:`) with mobile-first resolution.
445
+
446
+ ```ts
447
+ htmlToMarkdown(html, { tailwind: true })
262
448
  ```
263
449
 
264
- This pipeline gives you battle-tested content extraction + fast markdown conversion.
450
+ ### Filter Plugin
451
+
452
+ Filters HTML elements by CSS selectors, tag names, or `TAG_*` constants.
453
+
454
+ ```ts
455
+ // Exclude navigation, sidebar, footer
456
+ htmlToMarkdown(html, {
457
+ filter: {
458
+ exclude: ['nav', '#sidebar', '.footer', 'aside'],
459
+ },
460
+ })
461
+
462
+ // Include only specific elements
463
+ htmlToMarkdown(html, {
464
+ filter: {
465
+ include: ['article', 'main'],
466
+ },
467
+ })
468
+ ```
469
+
470
+ The JS engine also supports `TAG_*` integer constants for filtering:
471
+
472
+ ```ts
473
+ import { TAG_FOOTER, TAG_NAV } from '@mdream/js'
474
+
475
+ htmlToMarkdown(html, {
476
+ plugins: {
477
+ filter: { exclude: [TAG_NAV, TAG_FOOTER] },
478
+ },
479
+ })
480
+ ```
481
+
482
+ Elements with `style="position: absolute"` or `style="position: fixed"` are also automatically excluded when the filter plugin is active.
483
+
484
+ ### Extraction Plugin
485
+
486
+ Extracts elements matching CSS selectors during conversion. Callbacks receive the matched element with its accumulated text content and attributes.
487
+
488
+ ```ts
489
+ htmlToMarkdown(html, {
490
+ extraction: {
491
+ 'h2': (el) => {
492
+ console.log('Heading:', el.textContent)
493
+ },
494
+ 'img[alt]': (el) => {
495
+ console.log('Image:', el.attributes.src, el.attributes.alt)
496
+ },
497
+ 'a[href]': (el) => {
498
+ console.log('Link:', el.textContent, el.attributes.href)
499
+ },
500
+ },
501
+ })
502
+ ```
503
+
504
+ The `ExtractedElement` interface:
505
+
506
+ ```ts
507
+ interface ExtractedElement {
508
+ selector: string
509
+ tagName: string
510
+ textContent: string
511
+ attributes: Record<string, string>
512
+ }
513
+ ```
265
514
 
266
515
  ## Hook-Based Plugins (JS Engine)
267
516
 
268
- For custom hook-based plugins, use `@mdream/js`:
517
+ The JS engine (`@mdream/js`) supports imperative hook-based plugins for custom transform logic. These allow you to intercept and modify the conversion pipeline at multiple stages.
269
518
 
270
519
  ```ts
271
520
  import { htmlToMarkdown } from '@mdream/js'
@@ -280,7 +529,7 @@ const myPlugin = createPlugin({
280
529
  if (textNode.parent?.attributes?.id === 'highlight') {
281
530
  return { content: `**${textNode.value}**`, skip: false }
282
531
  }
283
- }
532
+ },
284
533
  })
285
534
 
286
535
  const markdown = htmlToMarkdown(html, { hooks: [myPlugin] })
@@ -288,15 +537,91 @@ const markdown = htmlToMarkdown(html, { hooks: [myPlugin] })
288
537
 
289
538
  ### Plugin Hooks
290
539
 
291
- - `beforeNodeProcess`: Called before any node processing, can skip nodes
292
- - `onNodeEnter`: Called when entering an element node
293
- - `onNodeExit`: Called when exiting an element node
294
- - `processTextNode`: Called for each text node
295
- - `processAttributes`: Called to process element attributes
540
+ ```ts
541
+ interface TransformPlugin {
542
+ /**
543
+ * Called before any node processing. Return { skip: true } to skip the node.
544
+ */
545
+ beforeNodeProcess?: (
546
+ event: NodeEvent,
547
+ state: MdreamRuntimeState,
548
+ ) => undefined | void | { skip: boolean }
549
+
550
+ /**
551
+ * Called when entering an element node.
552
+ * Return a string to prepend to the output.
553
+ */
554
+ onNodeEnter?: (
555
+ node: ElementNode,
556
+ state: MdreamRuntimeState,
557
+ ) => string | undefined | void
558
+
559
+ /**
560
+ * Called when exiting an element node.
561
+ * Return a string to append to the output.
562
+ */
563
+ onNodeExit?: (
564
+ node: ElementNode,
565
+ state: MdreamRuntimeState,
566
+ ) => string | undefined | void
567
+
568
+ /**
569
+ * Called to process element attributes (e.g., extracting Tailwind classes).
570
+ */
571
+ processAttributes?: (
572
+ node: ElementNode,
573
+ state: MdreamRuntimeState,
574
+ ) => void
575
+
576
+ /**
577
+ * Called for each text node. Return { content, skip } to transform text.
578
+ * Return undefined for no transformation.
579
+ */
580
+ processTextNode?: (
581
+ node: TextNode,
582
+ state: MdreamRuntimeState,
583
+ ) => { content: string, skip: boolean } | undefined
584
+ }
585
+ ```
586
+
587
+ ### `createPlugin()`
588
+
589
+ A typed identity function for creating plugins with full TypeScript inference:
590
+
591
+ ```ts
592
+ import { createPlugin } from '@mdream/js/plugins'
593
+
594
+ const plugin = createPlugin({
595
+ beforeNodeProcess({ node }) {
596
+ // Skip all div elements with class "ad"
597
+ if (node.type === 1 && node.attributes?.class?.includes('ad')) {
598
+ return { skip: true }
599
+ }
600
+ },
601
+ })
602
+ ```
603
+
604
+ ### Built-in Plugin Functions (JS Engine)
605
+
606
+ The following plugin factory functions are available from `@mdream/js/plugins`:
607
+
608
+ ```ts
609
+ import {
610
+ createPlugin,
611
+ extractionCollectorPlugin,
612
+ extractionPlugin,
613
+ filterPlugin,
614
+ frontmatterPlugin,
615
+ isolateMainPlugin,
616
+ tailwindPlugin,
617
+ } from '@mdream/js/plugins'
618
+ ```
619
+
620
+ ## Markdown Splitting (JS Engine)
296
621
 
297
- ## Markdown Splitting
622
+ Split HTML into Markdown chunks during conversion. Compatible with the LangChain `Document` structure.
298
623
 
299
- Split HTML into chunks during conversion for LLM context windows, vector databases, or document processing.
624
+ Available from `@mdream/js/splitter`.
300
625
 
301
626
  ### Basic Chunking
302
627
 
@@ -313,18 +638,17 @@ const html = `
313
638
  `
314
639
 
315
640
  const chunks = htmlToMarkdownSplitChunks(html, {
316
- headersToSplitOn: [TAG_H2], // Split on h2 headers
317
- chunkSize: 1000, // Max chars per chunk
318
- chunkOverlap: 200, // Overlap for context
319
- stripHeaders: true // Remove headers from content
641
+ headersToSplitOn: [TAG_H2],
642
+ chunkSize: 1000,
643
+ chunkOverlap: 200,
644
+ stripHeaders: true,
320
645
  })
321
646
 
322
- // Each chunk includes content and metadata
323
647
  chunks.forEach((chunk) => {
324
648
  console.log(chunk.content)
325
649
  console.log(chunk.metadata.headers) // { h1: "Documentation", h2: "Installation" }
326
650
  console.log(chunk.metadata.code) // Language if chunk contains code
327
- console.log(chunk.metadata.loc) // Line numbers
651
+ console.log(chunk.metadata.loc) // { lines: { from: 1, to: 5 } }
328
652
  })
329
653
  ```
330
654
 
@@ -335,54 +659,82 @@ For large documents, use the generator version to process chunks one at a time:
335
659
  ```ts
336
660
  import { htmlToMarkdownSplitChunksStream } from '@mdream/js/splitter'
337
661
 
338
- // Process chunks incrementally - lower memory usage
339
662
  for (const chunk of htmlToMarkdownSplitChunksStream(html, options)) {
340
- await processChunk(chunk) // Handle each chunk as it's generated
663
+ await processChunk(chunk)
341
664
 
342
- // Can break early if you found what you need
665
+ // Early termination supported
343
666
  if (foundTarget)
344
667
  break
345
668
  }
346
669
  ```
347
670
 
348
- **Benefits of streaming:**
349
- - Lower memory usage - chunks aren't stored in an array
350
- - Early termination - stop processing when you find what you need
351
- - Better for large documents
352
-
353
- ### Splitting Options
671
+ ### Splitter Options
354
672
 
355
673
  ```ts
356
674
  interface SplitterOptions {
357
- // Structural splitting
358
- headersToSplitOn?: number[] // TAG_H1, TAG_H2, etc. Default: [TAG_H2-TAG_H6]
675
+ // --- Structural splitting ---
676
+
677
+ /**
678
+ * Header tag IDs to split on (TAG_H1 through TAG_H6).
679
+ * Default: [TAG_H2, TAG_H3, TAG_H4, TAG_H5, TAG_H6]
680
+ */
681
+ headersToSplitOn?: number[]
682
+
683
+ // --- Size-based splitting ---
684
+
685
+ /** Maximum chunk size in characters. Default: 1000 */
686
+ chunkSize?: number
687
+
688
+ /** Overlap between chunks for context preservation. Default: 200 */
689
+ chunkOverlap?: number
690
+
691
+ /**
692
+ * Custom length function (e.g., a token counter for LLM applications).
693
+ * Default: (text) => text.length
694
+ */
695
+ lengthFunction?: (text: string) => number
696
+
697
+ // --- Output formatting ---
698
+
699
+ /** Remove headers from chunk content. Default: true */
700
+ stripHeaders?: boolean
359
701
 
360
- // Size-based splitting
361
- chunkSize?: number // Max chunk size. Default: 1000
362
- chunkOverlap?: number // Overlap between chunks. Default: 200
363
- lengthFunction?: (text: string) => number // Custom length (e.g., token count)
702
+ /** Split into individual lines. Default: false */
703
+ returnEachLine?: boolean
364
704
 
365
- // Output formatting
366
- stripHeaders?: boolean // Remove headers from content. Default: true
367
- returnEachLine?: boolean // Split into individual lines. Default: false
705
+ /** Keep separators in the split chunks. Default: false */
706
+ keepSeparator?: boolean
368
707
 
369
- // Standard options
370
- origin?: string // Base URL for links/images
371
- hooks?: TransformPlugin[] // Apply hook-based plugins during conversion (@mdream/js only)
708
+ // --- Standard options ---
709
+
710
+ /** Base URL for resolving relative links/images */
711
+ origin?: string
712
+
713
+ /** Declarative built-in plugin config */
714
+ plugins?: BuiltinPlugins
715
+
716
+ /** Hook-based plugins (JS engine only) */
717
+ hooks?: TransformPlugin[]
718
+
719
+ /** Post-processing cleanup */
720
+ clean?: boolean | CleanOptions
372
721
  }
373
722
  ```
374
723
 
375
724
  ### Chunk Metadata
376
725
 
377
- Each chunk includes rich metadata for context:
726
+ Each chunk includes metadata for context:
378
727
 
379
728
  ```ts
380
729
  interface MarkdownChunk {
381
730
  content: string
382
731
  metadata: {
383
- headers?: Record<string, string> // Header hierarchy: { h1: "Title", h2: "Section" }
384
- code?: string // Code block language if present
385
- loc?: { // Line number range
732
+ /** Header hierarchy at this chunk position */
733
+ headers?: Record<string, string> // { h1: "Title", h2: "Section" }
734
+ /** Code block language if chunk contains code */
735
+ code?: string
736
+ /** Line number range in the original document */
737
+ loc?: {
386
738
  lines: { from: number, to: number }
387
739
  }
388
740
  }
@@ -391,7 +743,7 @@ interface MarkdownChunk {
391
743
 
392
744
  ### Use with Presets
393
745
 
394
- Combine splitting with presets for optimized output:
746
+ Combine splitting with presets:
395
747
 
396
748
  ```ts
397
749
  import { TAG_H2 } from '@mdream/js'
@@ -401,13 +753,189 @@ import { htmlToMarkdownSplitChunks } from '@mdream/js/splitter'
401
753
  const chunks = htmlToMarkdownSplitChunks(html, withMinimalPreset({
402
754
  headersToSplitOn: [TAG_H2],
403
755
  chunkSize: 500,
404
- origin: 'https://example.com'
756
+ origin: 'https://example.com',
405
757
  }))
406
758
  ```
407
759
 
760
+ ## Content Negotiation
761
+
762
+ The `@mdream/js/negotiate` module provides HTTP content negotiation utilities for serving Markdown to LLM clients:
763
+
764
+ ```ts
765
+ import { parseAcceptHeader, shouldServeMarkdown } from '@mdream/js/negotiate'
766
+
767
+ // Check if client prefers markdown
768
+ const serveMarkdown = shouldServeMarkdown(
769
+ request.headers.get('accept'),
770
+ request.headers.get('sec-fetch-dest'),
771
+ )
772
+
773
+ if (serveMarkdown) {
774
+ return new Response(markdown, {
775
+ headers: { 'Content-Type': 'text/markdown' },
776
+ })
777
+ }
778
+ ```
779
+
780
+ `shouldServeMarkdown()` uses Accept header quality weights and position ordering. It returns `true` when `text/markdown` or `text/plain` has higher priority than `text/html`. Browser navigation requests (`sec-fetch-dest: document`) always return `false`.
781
+
782
+ ## Pure HTML Parser (JS Engine)
783
+
784
+ If you only need to parse HTML into a DOM-like event stream without converting to Markdown, use `parseHtml` from the JS engine:
785
+
786
+ ```ts
787
+ import { parseHtml } from '@mdream/js/parse'
788
+
789
+ const html = '<div><h1>Title</h1><p>Content</p></div>'
790
+ const { events, remainingHtml } = parseHtml(html)
791
+
792
+ events.forEach((event) => {
793
+ if (event.type === 0 && event.node.type === 1) { // Enter + Element
794
+ console.log('Entering element:', event.node.name)
795
+ }
796
+ })
797
+ ```
798
+
799
+ The parser provides:
800
+ - Pure AST event stream with no markdown generation overhead
801
+ - Enter/exit events for each element and text node
802
+ - Plugin support during parsing
803
+ - Streaming compatible via `parseHtmlStream()`
804
+
805
+ ## CLI Usage
806
+
807
+ Mdream provides a CLI that works with Unix pipes.
808
+
809
+ **Pipe site to Markdown:**
810
+
811
+ ```bash
812
+ curl -s https://en.wikipedia.org/wiki/Markdown \
813
+ | npx mdream --origin https://en.wikipedia.org --preset minimal \
814
+ | tee output.md
815
+ ```
816
+
817
+ **Local file to Markdown:**
818
+
819
+ ```bash
820
+ cat index.html \
821
+ | npx mdream --preset minimal \
822
+ | tee output.md
823
+ ```
824
+
825
+ ### CLI Options
826
+
827
+ | Option | Description |
828
+ |--------|-------------|
829
+ | `--origin <url>` | Base URL for resolving relative links and images |
830
+ | `--preset minimal` | Enable the minimal preset |
831
+ | `-h`, `--help` | Display help information |
832
+
833
+ The CLI reads HTML from stdin and writes Markdown to stdout. It uses the streaming API internally.
834
+
835
+ ## Browser and Edge Usage
836
+
837
+ ### Edge / Cloudflare Workers
838
+
839
+ For edge runtimes (Cloudflare Workers, Vercel Edge), `mdream` automatically selects the WASM build via export conditions (`workerd`, `edge-light`). Both `htmlToMarkdown` and `streamHtmlToMarkdown` are available:
840
+
841
+ ```ts
842
+ import { htmlToMarkdown, streamHtmlToMarkdown } from 'mdream'
843
+
844
+ // WASM engine auto-selected via export conditions
845
+ const markdown = htmlToMarkdown('<h1>Hello World</h1>')
846
+
847
+ // Streaming works the same as Node.js
848
+ const response = await fetch('https://example.com')
849
+ for await (const chunk of streamHtmlToMarkdown(response.body)) {
850
+ // process chunk
851
+ }
852
+ ```
853
+
854
+ You can also import the edge entry point directly:
855
+
856
+ ```ts
857
+ import { htmlToMarkdown } from 'mdream/worker'
858
+ ```
859
+
860
+ The `mdream/worker` entry provides an async API since WASM must be initialized first:
861
+
862
+ ```ts
863
+ import { htmlToMarkdown, initWorker, terminateWorker } from 'mdream/worker'
864
+
865
+ // Initialize once with the WASM URL
866
+ await initWorker('https://cdn.example.com/mdream_edge_bg.wasm')
867
+
868
+ // Convert (returns Promise<string>)
869
+ const markdown = await htmlToMarkdown('<h1>Hello</h1>')
870
+
871
+ // Clean up when done
872
+ terminateWorker()
873
+ ```
874
+
875
+ ### Browser CDN (IIFE)
876
+
877
+ Use mdream directly via CDN with no build step. Call `init()` once to load the WASM binary, then use `htmlToMarkdown()` synchronously.
878
+
879
+ ```html
880
+ <script src="https://unpkg.com/mdream/dist/iife.js"></script>
881
+ <script>
882
+ await window.mdream.init()
883
+ const markdown = window.mdream.htmlToMarkdown('<h1>Hello</h1><p>World</p>')
884
+ console.log(markdown) // # Hello\n\nWorld
885
+ </script>
886
+ ```
887
+
888
+ You can pass a custom WASM URL or `ArrayBuffer` to `init()`:
889
+
890
+ ```js
891
+ // Custom URL
892
+ await window.mdream.init('https://cdn.example.com/mdream_edge_bg.wasm')
893
+
894
+ // Pre-loaded ArrayBuffer
895
+ const wasmBytes = await fetch('/wasm/mdream_edge_bg.wasm').then(r => r.arrayBuffer())
896
+ await window.mdream.init(wasmBytes)
897
+ ```
898
+
899
+ **CDN Options:**
900
+ - **unpkg**: `https://unpkg.com/mdream/dist/iife.js`
901
+ - **jsDelivr**: `https://cdn.jsdelivr.net/npm/mdream/dist/iife.js`
902
+
903
+ ### Web Worker
904
+
905
+ For browser environments, `mdream/worker` runs conversions off the main thread using a Web Worker:
906
+
907
+ ```ts
908
+ import { htmlToMarkdown, initWorker, terminateWorker } from 'mdream/worker'
909
+
910
+ await initWorker('/path/to/mdream_edge_bg.wasm')
911
+
912
+ const markdown = await htmlToMarkdown('<h1>Hello</h1>')
913
+
914
+ // Clean up
915
+ terminateWorker()
916
+ ```
917
+
918
+ ## Content Extraction with Readability
919
+
920
+ For advanced content extraction (article detection, boilerplate removal), use [@mozilla/readability](https://github.com/mozilla/readability) before mdream:
921
+
922
+ ```ts
923
+ import { Readability } from '@mozilla/readability'
924
+ import { JSDOM } from 'jsdom'
925
+ import { htmlToMarkdown } from 'mdream'
926
+
927
+ const dom = new JSDOM(html, { url: 'https://example.com' })
928
+ const article = new Readability(dom.window.document).parse()
929
+
930
+ if (article) {
931
+ const markdown = htmlToMarkdown(article.content)
932
+ // article.title, article.excerpt, article.byline also available
933
+ }
934
+ ```
935
+
408
936
  ## llms.txt Generation
409
937
 
410
- For llms.txt artifact generation, use `@mdream/llms-txt`. It accepts pre-converted markdown and generates `llms.txt` and `llms-full.txt` artifacts.
938
+ For llms.txt artifact generation, use the separate `@mdream/llms-txt` package. It accepts pre-converted Markdown and generates `llms.txt` and `llms-full.txt` artifacts.
411
939
 
412
940
  ```ts
413
941
  import { generateLlmsTxtArtifacts } from '@mdream/llms-txt'
@@ -427,11 +955,18 @@ console.log(result.llmsTxt) // llms.txt content
427
955
  console.log(result.llmsFullTxt) // llms-full.txt content
428
956
  ```
429
957
 
430
- ## Credits
958
+ ## Related Packages
431
959
 
432
- - [ultrahtml](https://github.com/natemoo-re/ultrahtml): HTML parsing inspiration
960
+ | Package | Description |
961
+ |---------|-------------|
962
+ | [`mdream`](https://npmjs.com/package/mdream) | Core HTML to Markdown converter (Rust + WASM engine) |
963
+ | [`@mdream/js`](https://npmjs.com/package/@mdream/js) | JavaScript engine with hook-based plugins and splitter |
964
+ | [`@mdream/llms-txt`](https://github.com/harlan-zw/mdream/tree/main/packages/llms-txt) | Engine-agnostic llms.txt artifact generation |
965
+ | [`@mdream/crawl`](https://github.com/harlan-zw/mdream/tree/main/packages/crawl) | Site-wide crawler for llms.txt generation |
966
+ | [`@mdream/vite`](https://github.com/harlan-zw/mdream/tree/main/packages/vite) | Vite plugin integration |
967
+ | [`@mdream/nuxt`](https://github.com/harlan-zw/mdream/tree/main/packages/nuxt) | Nuxt module integration |
968
+ | [`@mdream/action`](https://github.com/harlan-zw/mdream/tree/main/packages/action) | GitHub Actions integration |
433
969
 
434
970
  ## License
435
971
 
436
972
  Licensed under the [MIT license](https://github.com/harlan-zw/mdream/blob/main/LICENSE.md).
437
-