mdream 1.0.0-beta.9 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE.md +9 -0
- package/README.md +741 -206
- package/dist/browser.d.mts +2 -1
- package/dist/browser.mjs +22 -2
- package/dist/edge.d.mts +2 -1
- package/dist/edge.mjs +20 -1
- package/dist/iife.js +368 -1
- package/package.json +15 -15
- package/wasm/mdream_edge_bg.wasm +0 -0
- package/wasm/package.json +1 -1
- package/dist/THIRD-PARTY-LICENSES.md +0 -10
package/README.md
CHANGED
|
@@ -4,189 +4,343 @@
|
|
|
4
4
|
[](https://npm.chart.dev/mdream)
|
|
5
5
|
[](https://github.com/harlan-zw/mdream/blob/main/LICENSE.md)
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
<img src="../../.github/logo.png" alt="mdream logo" width="200">
|
|
10
|
-
|
|
11
|
-
<p align="center">
|
|
12
|
-
<table>
|
|
13
|
-
<tbody>
|
|
14
|
-
<td align="center">
|
|
15
|
-
<sub>Made possible by my <a href="https://github.com/sponsors/harlan-zw">Sponsor Program 💖</a><br> Follow me <a href="https://twitter.com/harlan_zw">@harlan_zw</a> 🐦 • Join <a href="https://discord.gg/275MBUBvgP">Discord</a> for help</sub><br>
|
|
16
|
-
</td>
|
|
17
|
-
</tbody>
|
|
18
|
-
</table>
|
|
19
|
-
</p>
|
|
7
|
+
## Installation
|
|
20
8
|
|
|
21
|
-
|
|
9
|
+
```bash
|
|
10
|
+
# npm
|
|
11
|
+
npm install mdream
|
|
22
12
|
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
- 🚀 Fast: Convert 1.8MB of HTML to markdown in ~8ms (Rust), ~62ms (JS). Up to 7.9x speedup.
|
|
26
|
-
- ⚡ Tiny: 10kB gzip JS core, 45kB gzip with Rust WASM engine. Zero dependencies.
|
|
27
|
-
- ⚙️ Run anywhere: CLI, edge workers, browsers, Node, etc.
|
|
28
|
-
- 🔌 Extensible: Declarative plugin config for both engines, hook-based plugins via `@mdream/js`.
|
|
13
|
+
# pnpm
|
|
14
|
+
pnpm add mdream
|
|
29
15
|
|
|
30
|
-
|
|
16
|
+
# yarn
|
|
17
|
+
yarn add mdream
|
|
18
|
+
```
|
|
31
19
|
|
|
32
|
-
|
|
33
|
-
human readability.
|
|
20
|
+
For the JavaScript-only engine (hook-based plugins, splitter, pure HTML parser):
|
|
34
21
|
|
|
35
|
-
|
|
22
|
+
```bash
|
|
23
|
+
pnpm add @mdream/js
|
|
24
|
+
```
|
|
36
25
|
|
|
37
|
-
|
|
26
|
+
## Table of Contents
|
|
27
|
+
|
|
28
|
+
- [API Reference](#api-reference)
|
|
29
|
+
- [htmlToMarkdown()](#htmltomarkdown)
|
|
30
|
+
- [streamHtmlToMarkdown()](#streamhtmltomarkdown)
|
|
31
|
+
- [Engines](#engines)
|
|
32
|
+
- [Options](#options)
|
|
33
|
+
- [MdreamOptions (Rust engine)](#mdreamoptions-rust-engine)
|
|
34
|
+
- [MdreamOptions (JS engine)](#mdreamoptions-js-engine)
|
|
35
|
+
- [CleanOptions](#cleanoptions)
|
|
36
|
+
- [FrontmatterConfig](#frontmatterconfig)
|
|
37
|
+
- [TagOverride](#tagoverride)
|
|
38
|
+
- [FilterOptions](#filteroptions)
|
|
39
|
+
- [Presets](#presets)
|
|
40
|
+
- [Minimal Preset](#minimal-preset)
|
|
41
|
+
- [Built-in Plugins](#built-in-plugins)
|
|
42
|
+
- [Frontmatter](#frontmatter-plugin)
|
|
43
|
+
- [Isolate Main](#isolate-main-plugin)
|
|
44
|
+
- [Tailwind](#tailwind-plugin)
|
|
45
|
+
- [Filter](#filter-plugin)
|
|
46
|
+
- [Extraction](#extraction-plugin)
|
|
47
|
+
- [Hook-Based Plugins (JS Engine)](#hook-based-plugins-js-engine)
|
|
48
|
+
- [Plugin Hooks](#plugin-hooks)
|
|
49
|
+
- [createPlugin()](#createplugin)
|
|
50
|
+
- [Markdown Splitting (JS Engine)](#markdown-splitting-js-engine)
|
|
51
|
+
- [Basic Chunking](#basic-chunking)
|
|
52
|
+
- [Streaming Chunks](#streaming-chunks-memory-efficient)
|
|
53
|
+
- [Splitter Options](#splitter-options)
|
|
54
|
+
- [Chunk Metadata](#chunk-metadata)
|
|
55
|
+
- [Content Negotiation](#content-negotiation)
|
|
56
|
+
- [Pure HTML Parser (JS Engine)](#pure-html-parser-js-engine)
|
|
57
|
+
- [CLI Usage](#cli-usage)
|
|
58
|
+
- [Browser and Edge Usage](#browser-and-edge-usage)
|
|
59
|
+
- [Edge / Cloudflare Workers](#edge--cloudflare-workers)
|
|
60
|
+
- [Browser CDN (IIFE)](#browser-cdn-iife)
|
|
61
|
+
- [Web Worker](#web-worker)
|
|
62
|
+
- [llms.txt Generation](#llmstxt-generation)
|
|
63
|
+
- [Related Packages](#related-packages)
|
|
64
|
+
|
|
65
|
+
## API Reference
|
|
66
|
+
|
|
67
|
+
### `htmlToMarkdown()`
|
|
68
|
+
|
|
69
|
+
Converts a complete HTML string to Markdown synchronously.
|
|
70
|
+
|
|
71
|
+
**Rust engine** (`mdream`):
|
|
38
72
|
|
|
39
|
-
|
|
73
|
+
```ts
|
|
74
|
+
import { htmlToMarkdown } from 'mdream'
|
|
40
75
|
|
|
41
|
-
|
|
42
|
-
pnpm add mdream
|
|
76
|
+
function htmlToMarkdown(html: string, options?: Partial<MdreamOptions>): string
|
|
43
77
|
```
|
|
44
78
|
|
|
45
|
-
|
|
79
|
+
**JS engine** (`@mdream/js`):
|
|
46
80
|
|
|
47
|
-
|
|
48
|
-
|
|
81
|
+
```ts
|
|
82
|
+
import { htmlToMarkdown } from '@mdream/js'
|
|
49
83
|
|
|
50
|
-
|
|
84
|
+
function htmlToMarkdown(html: string, options?: Partial<MdreamOptions>): string
|
|
85
|
+
```
|
|
51
86
|
|
|
52
|
-
|
|
87
|
+
**Example:**
|
|
53
88
|
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
89
|
+
```ts
|
|
90
|
+
import { htmlToMarkdown } from 'mdream'
|
|
91
|
+
|
|
92
|
+
const markdown = htmlToMarkdown('<h1>Hello World</h1><p>Some content.</p>')
|
|
93
|
+
// # Hello World
|
|
94
|
+
//
|
|
95
|
+
// Some content.
|
|
58
96
|
```
|
|
59
97
|
|
|
60
|
-
|
|
98
|
+
### `streamHtmlToMarkdown()`
|
|
61
99
|
|
|
62
|
-
|
|
100
|
+
Converts an HTML `ReadableStream` to Markdown incrementally. Returns an `AsyncIterable<string>` that yields Markdown chunks as they are processed.
|
|
63
101
|
|
|
64
|
-
Converts a local HTML file to a Markdown file, using `tee` to write the output to a file and display it in the terminal.
|
|
65
102
|
|
|
66
|
-
```
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
103
|
+
```ts
|
|
104
|
+
import { streamHtmlToMarkdown } from 'mdream'
|
|
105
|
+
|
|
106
|
+
function streamHtmlToMarkdown(
|
|
107
|
+
htmlStream: ReadableStream<Uint8Array | string> | null,
|
|
108
|
+
options?: Partial<MdreamOptions>,
|
|
109
|
+
): AsyncIterable<string>
|
|
70
110
|
```
|
|
71
111
|
|
|
72
|
-
### CLI Options
|
|
73
112
|
|
|
74
|
-
|
|
75
|
-
- `--preset <preset>`: Conversion presets: minimal
|
|
76
|
-
- `--help`: Display help information
|
|
77
|
-
- `--version`: Display version information
|
|
113
|
+
**Example:**
|
|
78
114
|
|
|
79
|
-
|
|
115
|
+
```ts
|
|
116
|
+
import { streamHtmlToMarkdown } from 'mdream'
|
|
80
117
|
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
- `streamHtmlToMarkdown`: Best practice if you are fetching or reading from a local file.
|
|
118
|
+
const response = await fetch('https://example.com')
|
|
119
|
+
const stream = response.body
|
|
84
120
|
|
|
85
|
-
|
|
121
|
+
for await (const chunk of streamHtmlToMarkdown(stream, {
|
|
122
|
+
origin: 'https://example.com',
|
|
123
|
+
})) {
|
|
124
|
+
process.stdout.write(chunk)
|
|
125
|
+
}
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## Engines
|
|
86
129
|
|
|
87
130
|
Mdream includes two rendering engines, automatically selecting the best one for your environment:
|
|
88
|
-
|
|
89
|
-
|
|
131
|
+
|
|
132
|
+
| Engine | Package | Plugins | Use case |
|
|
133
|
+
|--------|---------|---------|----------|
|
|
134
|
+
| **Rust** (NAPI) | `mdream` | Declarative config only | Node.js (default) |
|
|
135
|
+
| **Rust** (WASM) | `mdream` | Declarative config only | Edge, browser |
|
|
136
|
+
| **JavaScript** | `@mdream/js` | Hook-based + declarative | Custom plugins, splitter |
|
|
90
137
|
|
|
91
138
|
```ts
|
|
92
|
-
|
|
139
|
+
// JavaScript engine (required for hook-based plugins)
|
|
140
|
+
import { htmlToMarkdown } from '@mdream/js'
|
|
93
141
|
|
|
94
|
-
// Rust NAPI engine
|
|
95
|
-
|
|
96
|
-
const markdown = htmlToMarkdown('<h1>Hello World</h1>')
|
|
142
|
+
// Rust NAPI engine (auto-selected in Node.js)
|
|
143
|
+
import { htmlToMarkdown } from 'mdream'
|
|
97
144
|
```
|
|
98
145
|
|
|
99
|
-
|
|
146
|
+
Both engines accept the same declarative plugin configuration (`origin`, `minimal`, `frontmatter`, `isolateMain`, `tailwind`, `filter`, `extraction`, `tagOverrides`, `clean`). The JS engine additionally supports `hooks` for imperative plugin transforms.
|
|
100
147
|
|
|
101
|
-
|
|
148
|
+
## Options
|
|
102
149
|
|
|
103
|
-
|
|
104
|
-
|
|
150
|
+
### MdreamOptions (Rust engine)
|
|
151
|
+
|
|
152
|
+
Defined in `mdream`:
|
|
105
153
|
|
|
106
|
-
|
|
154
|
+
```ts
|
|
155
|
+
interface MdreamOptions {
|
|
156
|
+
/** Base URL for resolving relative links and images. */
|
|
157
|
+
origin?: string
|
|
158
|
+
|
|
159
|
+
/**
|
|
160
|
+
* Enable minimal preset (frontmatter, isolateMain, tailwind, filter).
|
|
161
|
+
* Default: false
|
|
162
|
+
*/
|
|
163
|
+
minimal?: boolean
|
|
164
|
+
|
|
165
|
+
/**
|
|
166
|
+
* Post-processing cleanup. Pass `true` for all cleanup, or an object for specific features.
|
|
167
|
+
* Enabled by default when `minimal` is true.
|
|
168
|
+
*/
|
|
169
|
+
clean?: boolean | CleanOptions
|
|
170
|
+
|
|
171
|
+
/**
|
|
172
|
+
* Extract frontmatter from HTML <head>.
|
|
173
|
+
* - `true`: enable with defaults
|
|
174
|
+
* - `(fm) => void`: enable and receive structured data via callback
|
|
175
|
+
* - `FrontmatterConfig`: enable with config and optional callback
|
|
176
|
+
*/
|
|
177
|
+
frontmatter?: boolean | ((frontmatter: Record<string, string>) => void) | FrontmatterConfig
|
|
178
|
+
|
|
179
|
+
/** Isolate main content area. Default when minimal: true */
|
|
180
|
+
isolateMain?: boolean
|
|
181
|
+
|
|
182
|
+
/** Convert Tailwind utility classes to Markdown. Default when minimal: true */
|
|
183
|
+
tailwind?: boolean
|
|
184
|
+
|
|
185
|
+
/** Filter elements by CSS selectors. Default when minimal: excludes form, nav, footer, etc. */
|
|
186
|
+
filter?: { include?: string[], exclude?: string[], processChildren?: boolean }
|
|
187
|
+
|
|
188
|
+
/** Extract elements matching CSS selectors during conversion. */
|
|
189
|
+
extraction?: Record<string, (element: ExtractedElement) => void>
|
|
190
|
+
|
|
191
|
+
/** Override tag rendering behavior. String values act as aliases. */
|
|
192
|
+
tagOverrides?: Record<string, TagOverride | string>
|
|
193
|
+
}
|
|
107
194
|
```
|
|
108
195
|
|
|
109
|
-
###
|
|
196
|
+
### MdreamOptions (JS engine)
|
|
110
197
|
|
|
111
|
-
|
|
198
|
+
The JS engine extends the shared `EngineOptions` with hook-based plugin support:
|
|
112
199
|
|
|
113
|
-
```
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
const markdown = window.mdream.htmlToMarkdown('<h1>Hello</h1><p>World</p>')
|
|
119
|
-
console.log(markdown) // # Hello\n\nWorld
|
|
120
|
-
</script>
|
|
121
|
-
```
|
|
200
|
+
```ts
|
|
201
|
+
interface MdreamOptions extends EngineOptions {
|
|
202
|
+
/** Imperative hook-based transform plugins. JS engine only. */
|
|
203
|
+
hooks?: TransformPlugin[]
|
|
204
|
+
}
|
|
122
205
|
|
|
123
|
-
|
|
206
|
+
interface EngineOptions {
|
|
207
|
+
origin?: string
|
|
208
|
+
clean?: boolean | CleanOptions
|
|
209
|
+
plugins?: BuiltinPlugins
|
|
210
|
+
}
|
|
124
211
|
|
|
125
|
-
|
|
126
|
-
|
|
212
|
+
interface BuiltinPlugins {
|
|
213
|
+
filter?: { include?: (string | number)[], exclude?: (string | number)[], processChildren?: boolean }
|
|
214
|
+
frontmatter?: boolean | ((fm: Record<string, string>) => void) | FrontmatterConfig
|
|
215
|
+
isolateMain?: boolean
|
|
216
|
+
tailwind?: boolean
|
|
217
|
+
extraction?: Record<string, (element: ExtractedElement) => void>
|
|
218
|
+
tagOverrides?: Record<string, TagOverride | string>
|
|
219
|
+
}
|
|
127
220
|
```
|
|
128
221
|
|
|
129
|
-
|
|
130
|
-
- **unpkg**: `https://unpkg.com/mdream/dist/iife.js`
|
|
131
|
-
- **jsDelivr**: `https://cdn.jsdelivr.net/npm/mdream/dist/iife.js`
|
|
222
|
+
Note: The JS engine uses `options.plugins.filter` while the Rust engine uses `options.filter` directly.
|
|
132
223
|
|
|
133
|
-
|
|
224
|
+
### CleanOptions
|
|
134
225
|
|
|
135
|
-
|
|
136
|
-
import { htmlToMarkdown } from 'mdream'
|
|
226
|
+
Post-processing cleanup applied to the final Markdown output. All options default to `false` unless `clean: true` is set.
|
|
137
227
|
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
228
|
+
```ts
|
|
229
|
+
interface CleanOptions {
|
|
230
|
+
/** Strip tracking query parameters (utm_*, fbclid, gclid, etc.) from URLs */
|
|
231
|
+
urls?: boolean
|
|
232
|
+
/** Strip fragment-only links that don't match any heading in the output */
|
|
233
|
+
fragments?: boolean
|
|
234
|
+
/** Strip links with meaningless hrefs (#, javascript:void(0)) to plain text */
|
|
235
|
+
emptyLinks?: boolean
|
|
236
|
+
/** Collapse 3+ consecutive blank lines to 2 */
|
|
237
|
+
blankLines?: boolean
|
|
238
|
+
/** Strip links where text equals URL: [https://x.com](https://x.com) becomes https://x.com */
|
|
239
|
+
redundantLinks?: boolean
|
|
240
|
+
/** Strip self-referencing heading anchors: ## [Title](#title) becomes ## Title */
|
|
241
|
+
selfLinkHeadings?: boolean
|
|
242
|
+
/** Strip images with no alt text (decorative/tracking pixels) */
|
|
243
|
+
emptyImages?: boolean
|
|
244
|
+
/** Drop links that produce no visible text: [](url) is removed entirely */
|
|
245
|
+
emptyLinkText?: boolean
|
|
246
|
+
}
|
|
141
247
|
```
|
|
142
248
|
|
|
143
|
-
**
|
|
249
|
+
**Example:**
|
|
144
250
|
|
|
145
251
|
```ts
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
origin: 'https://example.com'
|
|
252
|
+
const markdown = htmlToMarkdown(html, {
|
|
253
|
+
clean: {
|
|
254
|
+
urls: true,
|
|
255
|
+
emptyLinks: true,
|
|
256
|
+
emptyImages: true,
|
|
257
|
+
},
|
|
153
258
|
})
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
### FrontmatterConfig
|
|
154
262
|
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
263
|
+
```ts
|
|
264
|
+
interface FrontmatterConfig {
|
|
265
|
+
/** Additional static fields to include in frontmatter */
|
|
266
|
+
additionalFields?: Record<string, string>
|
|
267
|
+
/**
|
|
268
|
+
* Meta tag names to extract beyond the defaults.
|
|
269
|
+
* Defaults: description, keywords, author, date,
|
|
270
|
+
* og:title, og:description, twitter:title, twitter:description
|
|
271
|
+
*/
|
|
272
|
+
metaFields?: string[]
|
|
273
|
+
/** Callback to receive structured frontmatter data after conversion */
|
|
274
|
+
onExtract?: (frontmatter: Record<string, string>) => void
|
|
158
275
|
}
|
|
159
276
|
```
|
|
160
277
|
|
|
161
|
-
|
|
278
|
+
### TagOverride
|
|
162
279
|
|
|
163
|
-
|
|
280
|
+
Override how specific HTML tags are rendered in Markdown. String values act as aliases.
|
|
164
281
|
|
|
165
282
|
```ts
|
|
166
|
-
|
|
283
|
+
interface TagOverride {
|
|
284
|
+
/** Markdown string to insert when entering this tag */
|
|
285
|
+
enter?: string
|
|
286
|
+
/** Markdown string to insert when exiting this tag */
|
|
287
|
+
exit?: string
|
|
288
|
+
/** Spacing: [newlines before, newlines after] */
|
|
289
|
+
spacing?: number[]
|
|
290
|
+
/** Whether this tag should be treated as inline */
|
|
291
|
+
isInline?: boolean
|
|
292
|
+
/** Whether this tag is self-closing */
|
|
293
|
+
isSelfClosing?: boolean
|
|
294
|
+
/** Whether whitespace inside this tag should be collapsed */
|
|
295
|
+
collapsesInnerWhiteSpace?: boolean
|
|
296
|
+
/** Alias this tag to another tag's handler */
|
|
297
|
+
alias?: string
|
|
298
|
+
}
|
|
299
|
+
```
|
|
167
300
|
|
|
168
|
-
|
|
169
|
-
const { events, remainingHtml } = parseHtml(html)
|
|
301
|
+
**Example:**
|
|
170
302
|
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
303
|
+
```ts
|
|
304
|
+
const markdown = htmlToMarkdown(html, {
|
|
305
|
+
tagOverrides: {
|
|
306
|
+
// Treat <x-heading> like <h2>
|
|
307
|
+
'x-heading': 'h2',
|
|
308
|
+
// Custom rendering for <callout>
|
|
309
|
+
'callout': {
|
|
310
|
+
enter: '> **Note:** ',
|
|
311
|
+
exit: '',
|
|
312
|
+
spacing: [2, 2],
|
|
313
|
+
},
|
|
314
|
+
},
|
|
176
315
|
})
|
|
177
316
|
```
|
|
178
317
|
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
|
|
318
|
+
### FilterOptions
|
|
319
|
+
|
|
320
|
+
```ts
|
|
321
|
+
interface FilterOptions {
|
|
322
|
+
/** CSS selectors, tag names, or TAG_* constants for elements to include (all others excluded) */
|
|
323
|
+
include?: (string | number)[]
|
|
324
|
+
/** CSS selectors, tag names, or TAG_* constants for elements to exclude */
|
|
325
|
+
exclude?: (string | number)[]
|
|
326
|
+
/** Whether to also process children of matched elements. Default: true */
|
|
327
|
+
processChildren?: boolean
|
|
328
|
+
}
|
|
329
|
+
```
|
|
184
330
|
|
|
185
331
|
## Presets
|
|
186
332
|
|
|
187
333
|
### Minimal Preset
|
|
188
334
|
|
|
189
|
-
The `minimal` preset
|
|
335
|
+
The `minimal` preset enables the following plugins together:
|
|
336
|
+
|
|
337
|
+
- **frontmatter**: Extracts metadata from HTML `<head>` into YAML frontmatter
|
|
338
|
+
- **isolateMain**: Extracts the main content area, skipping navigation, headers, and footers
|
|
339
|
+
- **tailwind**: Converts Tailwind utility classes to Markdown formatting
|
|
340
|
+
- **filter**: Excludes `form`, `fieldset`, `object`, `embed`, `footer`, `aside`, `iframe`, `input`, `textarea`, `select`, `button`, `nav`
|
|
341
|
+
- **clean**: All post-processing cleanup enabled
|
|
342
|
+
|
|
343
|
+
**Rust engine:**
|
|
190
344
|
|
|
191
345
|
```ts
|
|
192
346
|
import { htmlToMarkdown } from 'mdream'
|
|
@@ -197,75 +351,170 @@ const markdown = htmlToMarkdown(html, {
|
|
|
197
351
|
})
|
|
198
352
|
```
|
|
199
353
|
|
|
200
|
-
**
|
|
201
|
-
- `isolateMain` - Extracts main content area
|
|
202
|
-
- `frontmatter` - Generates YAML frontmatter from meta tags
|
|
203
|
-
- `tailwind` - Converts Tailwind classes to Markdown
|
|
204
|
-
- `filter` - Excludes forms, navigation, buttons, footers, and other non-content elements
|
|
354
|
+
**JS engine (using `withMinimalPreset`):**
|
|
205
355
|
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
356
|
+
```ts
|
|
357
|
+
import { htmlToMarkdown } from '@mdream/js'
|
|
358
|
+
import { withMinimalPreset } from '@mdream/js/preset/minimal'
|
|
359
|
+
|
|
360
|
+
const markdown = htmlToMarkdown(html, withMinimalPreset({
|
|
361
|
+
origin: 'https://example.com',
|
|
362
|
+
}))
|
|
363
|
+
```
|
|
364
|
+
|
|
365
|
+
`withMinimalPreset()` returns an `EngineOptions` object with all plugin defaults applied. You can override individual plugins:
|
|
366
|
+
|
|
367
|
+
```ts
|
|
368
|
+
const markdown = htmlToMarkdown(html, withMinimalPreset({
|
|
369
|
+
plugins: {
|
|
370
|
+
frontmatter: false,
|
|
371
|
+
filter: { exclude: ['nav'] },
|
|
372
|
+
},
|
|
373
|
+
}))
|
|
209
374
|
```
|
|
210
375
|
|
|
211
|
-
##
|
|
376
|
+
## Built-in Plugins
|
|
377
|
+
|
|
378
|
+
All built-in plugins work with both the Rust and JS engines through declarative configuration.
|
|
379
|
+
|
|
380
|
+
### Frontmatter Plugin
|
|
212
381
|
|
|
213
|
-
|
|
382
|
+
Extracts metadata from the HTML `<head>` element and generates YAML frontmatter.
|
|
383
|
+
|
|
384
|
+
**Extracted fields by default:** `title`, `description`, `keywords`, `author`, `date`, `og:title`, `og:description`, `twitter:title`, `twitter:description`.
|
|
214
385
|
|
|
215
386
|
```ts
|
|
216
|
-
|
|
387
|
+
// Enable with defaults
|
|
388
|
+
htmlToMarkdown(html, { frontmatter: true })
|
|
389
|
+
|
|
390
|
+
// With callback to receive structured data
|
|
391
|
+
htmlToMarkdown(html, {
|
|
392
|
+
frontmatter: (fm) => {
|
|
393
|
+
console.log(fm.title)
|
|
394
|
+
console.log(fm.description)
|
|
395
|
+
},
|
|
396
|
+
})
|
|
217
397
|
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
extraction: {
|
|
225
|
-
'h2': el => console.log('Heading:', el.textContent),
|
|
226
|
-
'img[alt]': el => console.log('Image:', el.attributes.src),
|
|
398
|
+
// With full config
|
|
399
|
+
htmlToMarkdown(html, {
|
|
400
|
+
frontmatter: {
|
|
401
|
+
additionalFields: { source: 'https://example.com' },
|
|
402
|
+
metaFields: ['robots', 'viewport'],
|
|
403
|
+
onExtract: fm => console.log(fm),
|
|
227
404
|
},
|
|
228
|
-
tagOverrides: { 'custom-tag': { alias: 'div' } },
|
|
229
405
|
})
|
|
230
406
|
```
|
|
231
407
|
|
|
232
|
-
|
|
408
|
+
**Output example:**
|
|
233
409
|
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
| `filter` | `{ include?, exclude?, processChildren? }` | Filter elements by CSS selectors |
|
|
243
|
-
| `extraction` | `Record<string, (el) => void>` | Extract elements matching CSS selectors |
|
|
244
|
-
| `tagOverrides` | `Record<string, TagOverride \| string>` | Override tag rendering behavior |
|
|
410
|
+
```yaml
|
|
411
|
+
---
|
|
412
|
+
title: My Page Title
|
|
413
|
+
meta:
|
|
414
|
+
description: A page description
|
|
415
|
+
'og:title': My Page Title
|
|
416
|
+
---
|
|
417
|
+
```
|
|
245
418
|
|
|
246
|
-
###
|
|
419
|
+
### Isolate Main Plugin
|
|
247
420
|
|
|
248
|
-
|
|
421
|
+
Isolates the main content area using the following priority:
|
|
422
|
+
|
|
423
|
+
1. If an explicit `<main>` element exists (within 5 depth levels), use its content exclusively
|
|
424
|
+
2. Otherwise, find content between the first header tag (`h1`-`h6`) and the first `<footer>`
|
|
425
|
+
3. Headings inside `<header>` tags are skipped during fallback detection
|
|
426
|
+
4. The `<head>` section is always passed through for other plugins (e.g., frontmatter)
|
|
249
427
|
|
|
250
428
|
```ts
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
import { htmlToMarkdown } from 'mdream'
|
|
429
|
+
htmlToMarkdown(html, { isolateMain: true })
|
|
430
|
+
```
|
|
254
431
|
|
|
255
|
-
|
|
256
|
-
const article = new Readability(dom.window.document).parse()
|
|
432
|
+
### Tailwind Plugin
|
|
257
433
|
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
|
|
261
|
-
|
|
434
|
+
Converts Tailwind CSS utility classes to semantic Markdown formatting:
|
|
435
|
+
|
|
436
|
+
| Tailwind Class | Markdown Output |
|
|
437
|
+
|---|---|
|
|
438
|
+
| `font-bold`, `font-semibold`, `font-medium`, `font-extrabold`, `font-black` | `**bold**` |
|
|
439
|
+
| `italic`, `font-italic` | `*italic*` |
|
|
440
|
+
| `line-through` | `~~strikethrough~~` |
|
|
441
|
+
| `hidden`, `invisible` | Content removed |
|
|
442
|
+
| `absolute`, `fixed`, `sticky` | Content removed |
|
|
443
|
+
|
|
444
|
+
Supports responsive breakpoint prefixes (`sm:`, `md:`, `lg:`, `xl:`, `2xl:`) with mobile-first resolution.
|
|
445
|
+
|
|
446
|
+
```ts
|
|
447
|
+
htmlToMarkdown(html, { tailwind: true })
|
|
262
448
|
```
|
|
263
449
|
|
|
264
|
-
|
|
450
|
+
### Filter Plugin
|
|
451
|
+
|
|
452
|
+
Filters HTML elements by CSS selectors, tag names, or `TAG_*` constants.
|
|
453
|
+
|
|
454
|
+
```ts
|
|
455
|
+
// Exclude navigation, sidebar, footer
|
|
456
|
+
htmlToMarkdown(html, {
|
|
457
|
+
filter: {
|
|
458
|
+
exclude: ['nav', '#sidebar', '.footer', 'aside'],
|
|
459
|
+
},
|
|
460
|
+
})
|
|
461
|
+
|
|
462
|
+
// Include only specific elements
|
|
463
|
+
htmlToMarkdown(html, {
|
|
464
|
+
filter: {
|
|
465
|
+
include: ['article', 'main'],
|
|
466
|
+
},
|
|
467
|
+
})
|
|
468
|
+
```
|
|
469
|
+
|
|
470
|
+
The JS engine also supports `TAG_*` integer constants for filtering:
|
|
471
|
+
|
|
472
|
+
```ts
|
|
473
|
+
import { TAG_FOOTER, TAG_NAV } from '@mdream/js'
|
|
474
|
+
|
|
475
|
+
htmlToMarkdown(html, {
|
|
476
|
+
plugins: {
|
|
477
|
+
filter: { exclude: [TAG_NAV, TAG_FOOTER] },
|
|
478
|
+
},
|
|
479
|
+
})
|
|
480
|
+
```
|
|
481
|
+
|
|
482
|
+
Elements with `style="position: absolute"` or `style="position: fixed"` are also automatically excluded when the filter plugin is active.
|
|
483
|
+
|
|
484
|
+
### Extraction Plugin
|
|
485
|
+
|
|
486
|
+
Extracts elements matching CSS selectors during conversion. Callbacks receive the matched element with its accumulated text content and attributes.
|
|
487
|
+
|
|
488
|
+
```ts
|
|
489
|
+
htmlToMarkdown(html, {
|
|
490
|
+
extraction: {
|
|
491
|
+
'h2': (el) => {
|
|
492
|
+
console.log('Heading:', el.textContent)
|
|
493
|
+
},
|
|
494
|
+
'img[alt]': (el) => {
|
|
495
|
+
console.log('Image:', el.attributes.src, el.attributes.alt)
|
|
496
|
+
},
|
|
497
|
+
'a[href]': (el) => {
|
|
498
|
+
console.log('Link:', el.textContent, el.attributes.href)
|
|
499
|
+
},
|
|
500
|
+
},
|
|
501
|
+
})
|
|
502
|
+
```
|
|
503
|
+
|
|
504
|
+
The `ExtractedElement` interface:
|
|
505
|
+
|
|
506
|
+
```ts
|
|
507
|
+
interface ExtractedElement {
|
|
508
|
+
selector: string
|
|
509
|
+
tagName: string
|
|
510
|
+
textContent: string
|
|
511
|
+
attributes: Record<string, string>
|
|
512
|
+
}
|
|
513
|
+
```
|
|
265
514
|
|
|
266
515
|
## Hook-Based Plugins (JS Engine)
|
|
267
516
|
|
|
268
|
-
|
|
517
|
+
The JS engine (`@mdream/js`) supports imperative hook-based plugins for custom transform logic. These allow you to intercept and modify the conversion pipeline at multiple stages.
|
|
269
518
|
|
|
270
519
|
```ts
|
|
271
520
|
import { htmlToMarkdown } from '@mdream/js'
|
|
@@ -280,7 +529,7 @@ const myPlugin = createPlugin({
|
|
|
280
529
|
if (textNode.parent?.attributes?.id === 'highlight') {
|
|
281
530
|
return { content: `**${textNode.value}**`, skip: false }
|
|
282
531
|
}
|
|
283
|
-
}
|
|
532
|
+
},
|
|
284
533
|
})
|
|
285
534
|
|
|
286
535
|
const markdown = htmlToMarkdown(html, { hooks: [myPlugin] })
|
|
@@ -288,15 +537,91 @@ const markdown = htmlToMarkdown(html, { hooks: [myPlugin] })
|
|
|
288
537
|
|
|
289
538
|
### Plugin Hooks
|
|
290
539
|
|
|
291
|
-
|
|
292
|
-
|
|
293
|
-
|
|
294
|
-
|
|
295
|
-
|
|
540
|
+
```ts
|
|
541
|
+
interface TransformPlugin {
|
|
542
|
+
/**
|
|
543
|
+
* Called before any node processing. Return { skip: true } to skip the node.
|
|
544
|
+
*/
|
|
545
|
+
beforeNodeProcess?: (
|
|
546
|
+
event: NodeEvent,
|
|
547
|
+
state: MdreamRuntimeState,
|
|
548
|
+
) => undefined | void | { skip: boolean }
|
|
549
|
+
|
|
550
|
+
/**
|
|
551
|
+
* Called when entering an element node.
|
|
552
|
+
* Return a string to prepend to the output.
|
|
553
|
+
*/
|
|
554
|
+
onNodeEnter?: (
|
|
555
|
+
node: ElementNode,
|
|
556
|
+
state: MdreamRuntimeState,
|
|
557
|
+
) => string | undefined | void
|
|
558
|
+
|
|
559
|
+
/**
|
|
560
|
+
* Called when exiting an element node.
|
|
561
|
+
* Return a string to append to the output.
|
|
562
|
+
*/
|
|
563
|
+
onNodeExit?: (
|
|
564
|
+
node: ElementNode,
|
|
565
|
+
state: MdreamRuntimeState,
|
|
566
|
+
) => string | undefined | void
|
|
567
|
+
|
|
568
|
+
/**
|
|
569
|
+
* Called to process element attributes (e.g., extracting Tailwind classes).
|
|
570
|
+
*/
|
|
571
|
+
processAttributes?: (
|
|
572
|
+
node: ElementNode,
|
|
573
|
+
state: MdreamRuntimeState,
|
|
574
|
+
) => void
|
|
575
|
+
|
|
576
|
+
/**
|
|
577
|
+
* Called for each text node. Return { content, skip } to transform text.
|
|
578
|
+
* Return undefined for no transformation.
|
|
579
|
+
*/
|
|
580
|
+
processTextNode?: (
|
|
581
|
+
node: TextNode,
|
|
582
|
+
state: MdreamRuntimeState,
|
|
583
|
+
) => { content: string, skip: boolean } | undefined
|
|
584
|
+
}
|
|
585
|
+
```
|
|
586
|
+
|
|
587
|
+
### `createPlugin()`
|
|
588
|
+
|
|
589
|
+
A typed identity function for creating plugins with full TypeScript inference:
|
|
590
|
+
|
|
591
|
+
```ts
|
|
592
|
+
import { createPlugin } from '@mdream/js/plugins'
|
|
593
|
+
|
|
594
|
+
const plugin = createPlugin({
|
|
595
|
+
beforeNodeProcess({ node }) {
|
|
596
|
+
// Skip all div elements with class "ad"
|
|
597
|
+
if (node.type === 1 && node.attributes?.class?.includes('ad')) {
|
|
598
|
+
return { skip: true }
|
|
599
|
+
}
|
|
600
|
+
},
|
|
601
|
+
})
|
|
602
|
+
```
|
|
603
|
+
|
|
604
|
+
### Built-in Plugin Functions (JS Engine)
|
|
605
|
+
|
|
606
|
+
The following plugin factory functions are available from `@mdream/js/plugins`:
|
|
607
|
+
|
|
608
|
+
```ts
|
|
609
|
+
import {
|
|
610
|
+
createPlugin,
|
|
611
|
+
extractionCollectorPlugin,
|
|
612
|
+
extractionPlugin,
|
|
613
|
+
filterPlugin,
|
|
614
|
+
frontmatterPlugin,
|
|
615
|
+
isolateMainPlugin,
|
|
616
|
+
tailwindPlugin,
|
|
617
|
+
} from '@mdream/js/plugins'
|
|
618
|
+
```
|
|
619
|
+
|
|
620
|
+
## Markdown Splitting (JS Engine)
|
|
296
621
|
|
|
297
|
-
|
|
622
|
+
Split HTML into Markdown chunks during conversion. Compatible with the LangChain `Document` structure.
|
|
298
623
|
|
|
299
|
-
|
|
624
|
+
Available from `@mdream/js/splitter`.
|
|
300
625
|
|
|
301
626
|
### Basic Chunking
|
|
302
627
|
|
|
@@ -313,18 +638,17 @@ const html = `
|
|
|
313
638
|
`
|
|
314
639
|
|
|
315
640
|
const chunks = htmlToMarkdownSplitChunks(html, {
|
|
316
|
-
headersToSplitOn: [TAG_H2],
|
|
317
|
-
chunkSize: 1000,
|
|
318
|
-
chunkOverlap: 200,
|
|
319
|
-
stripHeaders: true
|
|
641
|
+
headersToSplitOn: [TAG_H2],
|
|
642
|
+
chunkSize: 1000,
|
|
643
|
+
chunkOverlap: 200,
|
|
644
|
+
stripHeaders: true,
|
|
320
645
|
})
|
|
321
646
|
|
|
322
|
-
// Each chunk includes content and metadata
|
|
323
647
|
chunks.forEach((chunk) => {
|
|
324
648
|
console.log(chunk.content)
|
|
325
649
|
console.log(chunk.metadata.headers) // { h1: "Documentation", h2: "Installation" }
|
|
326
650
|
console.log(chunk.metadata.code) // Language if chunk contains code
|
|
327
|
-
console.log(chunk.metadata.loc) //
|
|
651
|
+
console.log(chunk.metadata.loc) // { lines: { from: 1, to: 5 } }
|
|
328
652
|
})
|
|
329
653
|
```
|
|
330
654
|
|
|
@@ -335,54 +659,82 @@ For large documents, use the generator version to process chunks one at a time:
|
|
|
335
659
|
```ts
|
|
336
660
|
import { htmlToMarkdownSplitChunksStream } from '@mdream/js/splitter'
|
|
337
661
|
|
|
338
|
-
// Process chunks incrementally - lower memory usage
|
|
339
662
|
for (const chunk of htmlToMarkdownSplitChunksStream(html, options)) {
|
|
340
|
-
await processChunk(chunk)
|
|
663
|
+
await processChunk(chunk)
|
|
341
664
|
|
|
342
|
-
//
|
|
665
|
+
// Early termination supported
|
|
343
666
|
if (foundTarget)
|
|
344
667
|
break
|
|
345
668
|
}
|
|
346
669
|
```
|
|
347
670
|
|
|
348
|
-
|
|
349
|
-
- Lower memory usage - chunks aren't stored in an array
|
|
350
|
-
- Early termination - stop processing when you find what you need
|
|
351
|
-
- Better for large documents
|
|
352
|
-
|
|
353
|
-
### Splitting Options
|
|
671
|
+
### Splitter Options
|
|
354
672
|
|
|
355
673
|
```ts
|
|
356
674
|
interface SplitterOptions {
|
|
357
|
-
// Structural splitting
|
|
358
|
-
|
|
675
|
+
// --- Structural splitting ---
|
|
676
|
+
|
|
677
|
+
/**
|
|
678
|
+
* Header tag IDs to split on (TAG_H1 through TAG_H6).
|
|
679
|
+
* Default: [TAG_H2, TAG_H3, TAG_H4, TAG_H5, TAG_H6]
|
|
680
|
+
*/
|
|
681
|
+
headersToSplitOn?: number[]
|
|
682
|
+
|
|
683
|
+
// --- Size-based splitting ---
|
|
684
|
+
|
|
685
|
+
/** Maximum chunk size in characters. Default: 1000 */
|
|
686
|
+
chunkSize?: number
|
|
687
|
+
|
|
688
|
+
/** Overlap between chunks for context preservation. Default: 200 */
|
|
689
|
+
chunkOverlap?: number
|
|
690
|
+
|
|
691
|
+
/**
|
|
692
|
+
* Custom length function (e.g., a token counter for LLM applications).
|
|
693
|
+
* Default: (text) => text.length
|
|
694
|
+
*/
|
|
695
|
+
lengthFunction?: (text: string) => number
|
|
696
|
+
|
|
697
|
+
// --- Output formatting ---
|
|
698
|
+
|
|
699
|
+
/** Remove headers from chunk content. Default: true */
|
|
700
|
+
stripHeaders?: boolean
|
|
359
701
|
|
|
360
|
-
|
|
361
|
-
|
|
362
|
-
chunkOverlap?: number // Overlap between chunks. Default: 200
|
|
363
|
-
lengthFunction?: (text: string) => number // Custom length (e.g., token count)
|
|
702
|
+
/** Split into individual lines. Default: false */
|
|
703
|
+
returnEachLine?: boolean
|
|
364
704
|
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
returnEachLine?: boolean // Split into individual lines. Default: false
|
|
705
|
+
/** Keep separators in the split chunks. Default: false */
|
|
706
|
+
keepSeparator?: boolean
|
|
368
707
|
|
|
369
|
-
// Standard options
|
|
370
|
-
|
|
371
|
-
|
|
708
|
+
// --- Standard options ---
|
|
709
|
+
|
|
710
|
+
/** Base URL for resolving relative links/images */
|
|
711
|
+
origin?: string
|
|
712
|
+
|
|
713
|
+
/** Declarative built-in plugin config */
|
|
714
|
+
plugins?: BuiltinPlugins
|
|
715
|
+
|
|
716
|
+
/** Hook-based plugins (JS engine only) */
|
|
717
|
+
hooks?: TransformPlugin[]
|
|
718
|
+
|
|
719
|
+
/** Post-processing cleanup */
|
|
720
|
+
clean?: boolean | CleanOptions
|
|
372
721
|
}
|
|
373
722
|
```
|
|
374
723
|
|
|
375
724
|
### Chunk Metadata
|
|
376
725
|
|
|
377
|
-
Each chunk includes
|
|
726
|
+
Each chunk includes metadata for context:
|
|
378
727
|
|
|
379
728
|
```ts
|
|
380
729
|
interface MarkdownChunk {
|
|
381
730
|
content: string
|
|
382
731
|
metadata: {
|
|
383
|
-
|
|
384
|
-
|
|
385
|
-
|
|
732
|
+
/** Header hierarchy at this chunk position */
|
|
733
|
+
headers?: Record<string, string> // { h1: "Title", h2: "Section" }
|
|
734
|
+
/** Code block language if chunk contains code */
|
|
735
|
+
code?: string
|
|
736
|
+
/** Line number range in the original document */
|
|
737
|
+
loc?: {
|
|
386
738
|
lines: { from: number, to: number }
|
|
387
739
|
}
|
|
388
740
|
}
|
|
@@ -391,7 +743,7 @@ interface MarkdownChunk {
|
|
|
391
743
|
|
|
392
744
|
### Use with Presets
|
|
393
745
|
|
|
394
|
-
Combine splitting with presets
|
|
746
|
+
Combine splitting with presets:
|
|
395
747
|
|
|
396
748
|
```ts
|
|
397
749
|
import { TAG_H2 } from '@mdream/js'
|
|
@@ -401,13 +753,189 @@ import { htmlToMarkdownSplitChunks } from '@mdream/js/splitter'
|
|
|
401
753
|
const chunks = htmlToMarkdownSplitChunks(html, withMinimalPreset({
|
|
402
754
|
headersToSplitOn: [TAG_H2],
|
|
403
755
|
chunkSize: 500,
|
|
404
|
-
origin: 'https://example.com'
|
|
756
|
+
origin: 'https://example.com',
|
|
405
757
|
}))
|
|
406
758
|
```
|
|
407
759
|
|
|
760
|
+
## Content Negotiation
|
|
761
|
+
|
|
762
|
+
The `@mdream/js/negotiate` module provides HTTP content negotiation utilities for serving Markdown to LLM clients:
|
|
763
|
+
|
|
764
|
+
```ts
|
|
765
|
+
import { parseAcceptHeader, shouldServeMarkdown } from '@mdream/js/negotiate'
|
|
766
|
+
|
|
767
|
+
// Check if client prefers markdown
|
|
768
|
+
const serveMarkdown = shouldServeMarkdown(
|
|
769
|
+
request.headers.get('accept'),
|
|
770
|
+
request.headers.get('sec-fetch-dest'),
|
|
771
|
+
)
|
|
772
|
+
|
|
773
|
+
if (serveMarkdown) {
|
|
774
|
+
return new Response(markdown, {
|
|
775
|
+
headers: { 'Content-Type': 'text/markdown' },
|
|
776
|
+
})
|
|
777
|
+
}
|
|
778
|
+
```
|
|
779
|
+
|
|
780
|
+
`shouldServeMarkdown()` uses Accept header quality weights and position ordering. It returns `true` when `text/markdown` or `text/plain` has higher priority than `text/html`. Browser navigation requests (`sec-fetch-dest: document`) always return `false`.
|
|
781
|
+
|
|
782
|
+
## Pure HTML Parser (JS Engine)
|
|
783
|
+
|
|
784
|
+
If you only need to parse HTML into a DOM-like event stream without converting to Markdown, use `parseHtml` from the JS engine:
|
|
785
|
+
|
|
786
|
+
```ts
|
|
787
|
+
import { parseHtml } from '@mdream/js/parse'
|
|
788
|
+
|
|
789
|
+
const html = '<div><h1>Title</h1><p>Content</p></div>'
|
|
790
|
+
const { events, remainingHtml } = parseHtml(html)
|
|
791
|
+
|
|
792
|
+
events.forEach((event) => {
|
|
793
|
+
if (event.type === 0 && event.node.type === 1) { // Enter + Element
|
|
794
|
+
console.log('Entering element:', event.node.name)
|
|
795
|
+
}
|
|
796
|
+
})
|
|
797
|
+
```
|
|
798
|
+
|
|
799
|
+
The parser provides:
|
|
800
|
+
- Pure AST event stream with no markdown generation overhead
|
|
801
|
+
- Enter/exit events for each element and text node
|
|
802
|
+
- Plugin support during parsing
|
|
803
|
+
- Streaming compatible via `parseHtmlStream()`
|
|
804
|
+
|
|
805
|
+
## CLI Usage
|
|
806
|
+
|
|
807
|
+
Mdream provides a CLI that works with Unix pipes.
|
|
808
|
+
|
|
809
|
+
**Pipe site to Markdown:**
|
|
810
|
+
|
|
811
|
+
```bash
|
|
812
|
+
curl -s https://en.wikipedia.org/wiki/Markdown \
|
|
813
|
+
| npx mdream --origin https://en.wikipedia.org --preset minimal \
|
|
814
|
+
| tee output.md
|
|
815
|
+
```
|
|
816
|
+
|
|
817
|
+
**Local file to Markdown:**
|
|
818
|
+
|
|
819
|
+
```bash
|
|
820
|
+
cat index.html \
|
|
821
|
+
| npx mdream --preset minimal \
|
|
822
|
+
| tee output.md
|
|
823
|
+
```
|
|
824
|
+
|
|
825
|
+
### CLI Options
|
|
826
|
+
|
|
827
|
+
| Option | Description |
|
|
828
|
+
|--------|-------------|
|
|
829
|
+
| `--origin <url>` | Base URL for resolving relative links and images |
|
|
830
|
+
| `--preset minimal` | Enable the minimal preset |
|
|
831
|
+
| `-h`, `--help` | Display help information |
|
|
832
|
+
|
|
833
|
+
The CLI reads HTML from stdin and writes Markdown to stdout. It uses the streaming API internally.
|
|
834
|
+
|
|
835
|
+
## Browser and Edge Usage
|
|
836
|
+
|
|
837
|
+
### Edge / Cloudflare Workers
|
|
838
|
+
|
|
839
|
+
For edge runtimes (Cloudflare Workers, Vercel Edge), `mdream` automatically selects the WASM build via export conditions (`workerd`, `edge-light`). Both `htmlToMarkdown` and `streamHtmlToMarkdown` are available:
|
|
840
|
+
|
|
841
|
+
```ts
|
|
842
|
+
import { htmlToMarkdown, streamHtmlToMarkdown } from 'mdream'
|
|
843
|
+
|
|
844
|
+
// WASM engine auto-selected via export conditions
|
|
845
|
+
const markdown = htmlToMarkdown('<h1>Hello World</h1>')
|
|
846
|
+
|
|
847
|
+
// Streaming works the same as Node.js
|
|
848
|
+
const response = await fetch('https://example.com')
|
|
849
|
+
for await (const chunk of streamHtmlToMarkdown(response.body)) {
|
|
850
|
+
// process chunk
|
|
851
|
+
}
|
|
852
|
+
```
|
|
853
|
+
|
|
854
|
+
You can also import the edge entry point directly:
|
|
855
|
+
|
|
856
|
+
```ts
|
|
857
|
+
import { htmlToMarkdown } from 'mdream/worker'
|
|
858
|
+
```
|
|
859
|
+
|
|
860
|
+
The `mdream/worker` entry provides an async API since WASM must be initialized first:
|
|
861
|
+
|
|
862
|
+
```ts
|
|
863
|
+
import { htmlToMarkdown, initWorker, terminateWorker } from 'mdream/worker'
|
|
864
|
+
|
|
865
|
+
// Initialize once with the WASM URL
|
|
866
|
+
await initWorker('https://cdn.example.com/mdream_edge_bg.wasm')
|
|
867
|
+
|
|
868
|
+
// Convert (returns Promise<string>)
|
|
869
|
+
const markdown = await htmlToMarkdown('<h1>Hello</h1>')
|
|
870
|
+
|
|
871
|
+
// Clean up when done
|
|
872
|
+
terminateWorker()
|
|
873
|
+
```
|
|
874
|
+
|
|
875
|
+
### Browser CDN (IIFE)
|
|
876
|
+
|
|
877
|
+
Use mdream directly via CDN with no build step. Call `init()` once to load the WASM binary, then use `htmlToMarkdown()` synchronously.
|
|
878
|
+
|
|
879
|
+
```html
|
|
880
|
+
<script src="https://unpkg.com/mdream/dist/iife.js"></script>
|
|
881
|
+
<script>
|
|
882
|
+
await window.mdream.init()
|
|
883
|
+
const markdown = window.mdream.htmlToMarkdown('<h1>Hello</h1><p>World</p>')
|
|
884
|
+
console.log(markdown) // # Hello\n\nWorld
|
|
885
|
+
</script>
|
|
886
|
+
```
|
|
887
|
+
|
|
888
|
+
You can pass a custom WASM URL or `ArrayBuffer` to `init()`:
|
|
889
|
+
|
|
890
|
+
```js
|
|
891
|
+
// Custom URL
|
|
892
|
+
await window.mdream.init('https://cdn.example.com/mdream_edge_bg.wasm')
|
|
893
|
+
|
|
894
|
+
// Pre-loaded ArrayBuffer
|
|
895
|
+
const wasmBytes = await fetch('/wasm/mdream_edge_bg.wasm').then(r => r.arrayBuffer())
|
|
896
|
+
await window.mdream.init(wasmBytes)
|
|
897
|
+
```
|
|
898
|
+
|
|
899
|
+
**CDN Options:**
|
|
900
|
+
- **unpkg**: `https://unpkg.com/mdream/dist/iife.js`
|
|
901
|
+
- **jsDelivr**: `https://cdn.jsdelivr.net/npm/mdream/dist/iife.js`
|
|
902
|
+
|
|
903
|
+
### Web Worker
|
|
904
|
+
|
|
905
|
+
For browser environments, `mdream/worker` runs conversions off the main thread using a Web Worker:
|
|
906
|
+
|
|
907
|
+
```ts
|
|
908
|
+
import { htmlToMarkdown, initWorker, terminateWorker } from 'mdream/worker'
|
|
909
|
+
|
|
910
|
+
await initWorker('/path/to/mdream_edge_bg.wasm')
|
|
911
|
+
|
|
912
|
+
const markdown = await htmlToMarkdown('<h1>Hello</h1>')
|
|
913
|
+
|
|
914
|
+
// Clean up
|
|
915
|
+
terminateWorker()
|
|
916
|
+
```
|
|
917
|
+
|
|
918
|
+
## Content Extraction with Readability
|
|
919
|
+
|
|
920
|
+
For advanced content extraction (article detection, boilerplate removal), use [@mozilla/readability](https://github.com/mozilla/readability) before mdream:
|
|
921
|
+
|
|
922
|
+
```ts
|
|
923
|
+
import { Readability } from '@mozilla/readability'
|
|
924
|
+
import { JSDOM } from 'jsdom'
|
|
925
|
+
import { htmlToMarkdown } from 'mdream'
|
|
926
|
+
|
|
927
|
+
const dom = new JSDOM(html, { url: 'https://example.com' })
|
|
928
|
+
const article = new Readability(dom.window.document).parse()
|
|
929
|
+
|
|
930
|
+
if (article) {
|
|
931
|
+
const markdown = htmlToMarkdown(article.content)
|
|
932
|
+
// article.title, article.excerpt, article.byline also available
|
|
933
|
+
}
|
|
934
|
+
```
|
|
935
|
+
|
|
408
936
|
## llms.txt Generation
|
|
409
937
|
|
|
410
|
-
For llms.txt artifact generation, use `@mdream/llms-txt
|
|
938
|
+
For llms.txt artifact generation, use the separate `@mdream/llms-txt` package. It accepts pre-converted Markdown and generates `llms.txt` and `llms-full.txt` artifacts.
|
|
411
939
|
|
|
412
940
|
```ts
|
|
413
941
|
import { generateLlmsTxtArtifacts } from '@mdream/llms-txt'
|
|
@@ -427,11 +955,18 @@ console.log(result.llmsTxt) // llms.txt content
|
|
|
427
955
|
console.log(result.llmsFullTxt) // llms-full.txt content
|
|
428
956
|
```
|
|
429
957
|
|
|
430
|
-
##
|
|
958
|
+
## Related Packages
|
|
431
959
|
|
|
432
|
-
|
|
960
|
+
| Package | Description |
|
|
961
|
+
|---------|-------------|
|
|
962
|
+
| [`mdream`](https://npmjs.com/package/mdream) | Core HTML to Markdown converter (Rust + WASM engine) |
|
|
963
|
+
| [`@mdream/js`](https://npmjs.com/package/@mdream/js) | JavaScript engine with hook-based plugins and splitter |
|
|
964
|
+
| [`@mdream/llms-txt`](https://github.com/harlan-zw/mdream/tree/main/packages/llms-txt) | Engine-agnostic llms.txt artifact generation |
|
|
965
|
+
| [`@mdream/crawl`](https://github.com/harlan-zw/mdream/tree/main/packages/crawl) | Site-wide crawler for llms.txt generation |
|
|
966
|
+
| [`@mdream/vite`](https://github.com/harlan-zw/mdream/tree/main/packages/vite) | Vite plugin integration |
|
|
967
|
+
| [`@mdream/nuxt`](https://github.com/harlan-zw/mdream/tree/main/packages/nuxt) | Nuxt module integration |
|
|
968
|
+
| [`@mdream/action`](https://github.com/harlan-zw/mdream/tree/main/packages/action) | GitHub Actions integration |
|
|
433
969
|
|
|
434
970
|
## License
|
|
435
971
|
|
|
436
972
|
Licensed under the [MIT license](https://github.com/harlan-zw/mdream/blob/main/LICENSE.md).
|
|
437
|
-
|