@kreuzberg/html-to-markdown-wasm 3.1.0 → 3.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/html_to_markdown_wasm.d.ts +482 -34
- package/dist/html_to_markdown_wasm_bg.js +4206 -307
- package/dist/html_to_markdown_wasm_bg.wasm +0 -0
- package/dist/html_to_markdown_wasm_bg.wasm.d.ts +378 -4
- package/dist/package.json +2 -12
- package/dist-node/html_to_markdown_wasm.d.ts +482 -34
- package/dist-node/html_to_markdown_wasm.js +4243 -310
- package/dist-node/html_to_markdown_wasm_bg.wasm +0 -0
- package/dist-node/html_to_markdown_wasm_bg.wasm.d.ts +378 -4
- package/dist-node/package.json +3 -13
- package/dist-web/html_to_markdown_wasm.d.ts +859 -37
- package/dist-web/html_to_markdown_wasm.js +4207 -309
- package/dist-web/html_to_markdown_wasm_bg.wasm +0 -0
- package/dist-web/html_to_markdown_wasm_bg.wasm.d.ts +378 -4
- package/dist-web/package.json +2 -12
- package/package.json +3 -3
- package/LICENSE +0 -21
- package/README.md +0 -586
- package/dist/LICENSE +0 -21
- package/dist/README.md +0 -154
- package/dist-node/LICENSE +0 -21
- package/dist-node/README.md +0 -154
- package/dist-web/LICENSE +0 -21
- package/dist-web/README.md +0 -154
package/README.md
DELETED
|
@@ -1,586 +0,0 @@
|
|
|
1
|
-
# @kreuzberg/html-to-markdown-wasm
|
|
2
|
-
|
|
3
|
-
> **npm package:** `@kreuzberg/html-to-markdown-wasm` (this README).
|
|
4
|
-
> Use [`@kreuzberg/html-to-markdown-node`](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node) when you only target Node.js or Bun and want native performance.
|
|
5
|
-
|
|
6
|
-
Universal HTML to Markdown converter using WebAssembly.
|
|
7
|
-
|
|
8
|
-
Powered by the same Rust engine as the Node.js, Python, Ruby, and PHP bindings, so Markdown output stays identical regardless of runtime.
|
|
9
|
-
|
|
10
|
-
Runs anywhere: Node.js, Deno, Bun, browsers, and edge runtimes.
|
|
11
|
-
|
|
12
|
-
[](https://crates.io/crates/html-to-markdown-rs)
|
|
13
|
-
[](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node)
|
|
14
|
-
[](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-wasm)
|
|
15
|
-
[](https://pypi.org/project/html-to-markdown/)
|
|
16
|
-
[](https://packagist.org/packages/kreuzberg-dev/html-to-markdown)
|
|
17
|
-
[](https://rubygems.org/gems/html-to-markdown)
|
|
18
|
-
[](https://www.nuget.org/packages/KreuzbergDev.HtmlToMarkdown/)
|
|
19
|
-
[](https://central.sonatype.com/artifact/dev.kreuzberg/html-to-markdown)
|
|
20
|
-
[](https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v3/htmltomarkdown)
|
|
21
|
-
[](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE)
|
|
22
|
-
|
|
23
|
-
## Performance
|
|
24
|
-
|
|
25
|
-
Universal WebAssembly bindings with **excellent performance** across all JavaScript runtimes.
|
|
26
|
-
|
|
27
|
-
### Benchmark Results (Apple M4)
|
|
28
|
-
|
|
29
|
-
| Document Type | ops/sec | Notes |
|
|
30
|
-
| -------------------------- | ---------- | ------------------ |
|
|
31
|
-
| **Small (5 paragraphs)** | **70,300** | Simple documents |
|
|
32
|
-
| **Medium (25 paragraphs)** | **15,282** | Nested formatting |
|
|
33
|
-
| **Large (100 paragraphs)** | **3,836** | Complex structures |
|
|
34
|
-
| **Tables (20 tables)** | **3,748** | Table processing |
|
|
35
|
-
| **Lists (500 items)** | **1,391** | Nested lists |
|
|
36
|
-
| **Wikipedia (129KB)** | **1,022** | Real-world content |
|
|
37
|
-
| **Wikipedia (653KB)** | **147** | Large documents |
|
|
38
|
-
|
|
39
|
-
**Average: ~15,536 ops/sec** across varied workloads.
|
|
40
|
-
|
|
41
|
-
### Comparison
|
|
42
|
-
|
|
43
|
-
- **vs Native NAPI**: ~1.17x slower (WASM has minimal overhead)
|
|
44
|
-
- **vs Python**: ~6.3x faster (no FFI overhead)
|
|
45
|
-
- **Best for**: Universal deployment (browsers, Deno, edge runtimes, cross-platform apps)
|
|
46
|
-
|
|
47
|
-
### Benchmark Fixtures (Apple M4)
|
|
48
|
-
|
|
49
|
-
Numbers captured via the shared fixture harness in `tools/benchmark-harness`:
|
|
50
|
-
|
|
51
|
-
| Document | Size | ops/sec (WASM) |
|
|
52
|
-
| ---------------------- | ------ | -------------- |
|
|
53
|
-
| Lists (Timeline) | 129 KB | 882 |
|
|
54
|
-
| Tables (Countries) | 360 KB | 242 |
|
|
55
|
-
| Medium (Python) | 657 KB | 121 |
|
|
56
|
-
| Large (Rust) | 567 KB | 124 |
|
|
57
|
-
| Small (Intro) | 463 KB | 163 |
|
|
58
|
-
|
|
59
|
-
> Expect slightly higher numbers in long-lived browser/Deno workers once the WASM module is warm.
|
|
60
|
-
|
|
61
|
-
## Installation
|
|
62
|
-
|
|
63
|
-
### npm / Yarn / pnpm
|
|
64
|
-
|
|
65
|
-
```bash
|
|
66
|
-
npm install @kreuzberg/html-to-markdown-wasm
|
|
67
|
-
# or
|
|
68
|
-
yarn add @kreuzberg/html-to-markdown-wasm
|
|
69
|
-
# or
|
|
70
|
-
pnpm add @kreuzberg/html-to-markdown-wasm
|
|
71
|
-
```
|
|
72
|
-
|
|
73
|
-
### Deno
|
|
74
|
-
|
|
75
|
-
```typescript
|
|
76
|
-
// Via npm specifier
|
|
77
|
-
import { convert } from "npm:@kreuzberg/html-to-markdown-wasm";
|
|
78
|
-
```
|
|
79
|
-
|
|
80
|
-
## Usage
|
|
81
|
-
|
|
82
|
-
### Basic Conversion
|
|
83
|
-
|
|
84
|
-
```javascript
|
|
85
|
-
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
86
|
-
|
|
87
|
-
const html = '<h1>Hello World</h1><p>This is <strong>fast</strong>!</p>';
|
|
88
|
-
const result = convert(html);
|
|
89
|
-
console.log(result.content);
|
|
90
|
-
// # Hello World
|
|
91
|
-
//
|
|
92
|
-
// This is **fast**!
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
> **Heads up for edge runtimes:** Cloudflare Workers, Vite dev servers, and other environments that instantiate `.wasm` files asynchronously must call `await initWasm()` (or `await wasmReady`) once during startup before invoking `convert`. Traditional bundlers (Webpack, Rollup) and Deno/Node imports continue to work without manual initialization.
|
|
96
|
-
|
|
97
|
-
### WasmConversionResult Fields
|
|
98
|
-
|
|
99
|
-
Every call to `convert()` returns a `WasmConversionResult` object with six fields:
|
|
100
|
-
|
|
101
|
-
```typescript
|
|
102
|
-
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
103
|
-
|
|
104
|
-
const result = convert(html);
|
|
105
|
-
|
|
106
|
-
result.content; // string | null -- converted Markdown (or djot/plain text)
|
|
107
|
-
result.document; // string | null -- structured document tree as JSON
|
|
108
|
-
result.metadata; // string | null -- extracted HTML metadata as JSON
|
|
109
|
-
result.tables; // Array -- all tables found in document order
|
|
110
|
-
result.images; // Array -- extracted inline images (data URIs, SVGs)
|
|
111
|
-
result.warnings; // Array -- non-fatal processing warnings
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
### Byte-Based Input (Buffers / Uint8Array)
|
|
115
|
-
|
|
116
|
-
When you already have raw bytes (e.g., `fs.readFileSync`, Fetch API responses), skip re-encoding with `TextDecoder` by calling the byte-friendly helper:
|
|
117
|
-
|
|
118
|
-
```ts
|
|
119
|
-
import { convertBytes } from '@kreuzberg/html-to-markdown-wasm';
|
|
120
|
-
import { readFileSync } from 'node:fs';
|
|
121
|
-
|
|
122
|
-
const htmlBytes = readFileSync('input.html'); // Buffer -> Uint8Array
|
|
123
|
-
const result = convertBytes(htmlBytes);
|
|
124
|
-
console.log(result.content);
|
|
125
|
-
```
|
|
126
|
-
|
|
127
|
-
### With Options
|
|
128
|
-
|
|
129
|
-
```typescript
|
|
130
|
-
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
131
|
-
|
|
132
|
-
const result = convert(html, {
|
|
133
|
-
headingStyle: 'atx',
|
|
134
|
-
codeBlockStyle: 'backticks',
|
|
135
|
-
listIndentWidth: 2,
|
|
136
|
-
bullets: '-',
|
|
137
|
-
wrap: true,
|
|
138
|
-
wrapWidth: 80
|
|
139
|
-
});
|
|
140
|
-
console.log(result.content);
|
|
141
|
-
```
|
|
142
|
-
|
|
143
|
-
### Preserve Complex HTML
|
|
144
|
-
|
|
145
|
-
```typescript
|
|
146
|
-
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
147
|
-
|
|
148
|
-
const html = `
|
|
149
|
-
<h1>Report</h1>
|
|
150
|
-
<table>
|
|
151
|
-
<tr><th>Name</th><th>Value</th></tr>
|
|
152
|
-
<tr><td>Foo</td><td>Bar</td></tr>
|
|
153
|
-
</table>
|
|
154
|
-
`;
|
|
155
|
-
|
|
156
|
-
const result = convert(html, {
|
|
157
|
-
preserveTags: ['table'] // Keep tables as HTML
|
|
158
|
-
});
|
|
159
|
-
console.log(result.content);
|
|
160
|
-
```
|
|
161
|
-
|
|
162
|
-
### Deno
|
|
163
|
-
|
|
164
|
-
```typescript
|
|
165
|
-
import { convert } from "npm:@kreuzberg/html-to-markdown-wasm";
|
|
166
|
-
|
|
167
|
-
const html = await Deno.readTextFile("input.html");
|
|
168
|
-
const result = convert(html, { headingStyle: "atx" });
|
|
169
|
-
await Deno.writeTextFile("output.md", result.content ?? "");
|
|
170
|
-
```
|
|
171
|
-
|
|
172
|
-
> **Performance Tip:** For Node.js/Bun, use [@kreuzberg/html-to-markdown-node](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node) for 1.17x better performance with native bindings.
|
|
173
|
-
|
|
174
|
-
### Browser (ESM)
|
|
175
|
-
|
|
176
|
-
```html
|
|
177
|
-
<!DOCTYPE html>
|
|
178
|
-
<html>
|
|
179
|
-
<head>
|
|
180
|
-
<title>HTML to Markdown</title>
|
|
181
|
-
</head>
|
|
182
|
-
<body>
|
|
183
|
-
<script type="module">
|
|
184
|
-
import init, { convert } from 'https://unpkg.com/@kreuzberg/html-to-markdown-wasm/dist-web/html_to_markdown_wasm.js';
|
|
185
|
-
|
|
186
|
-
// Initialize WASM module
|
|
187
|
-
await init();
|
|
188
|
-
|
|
189
|
-
const html = '<h1>Hello World</h1><p>This runs in the <strong>browser</strong>!</p>';
|
|
190
|
-
const result = convert(html, { headingStyle: 'atx' });
|
|
191
|
-
|
|
192
|
-
console.log(result.content);
|
|
193
|
-
document.body.innerHTML = `<pre>${result.content}</pre>`;
|
|
194
|
-
</script>
|
|
195
|
-
</body>
|
|
196
|
-
</html>
|
|
197
|
-
```
|
|
198
|
-
|
|
199
|
-
### Vite / Webpack / Bundlers
|
|
200
|
-
|
|
201
|
-
```typescript
|
|
202
|
-
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
203
|
-
|
|
204
|
-
const result = convert('<h1>Hello</h1>', {
|
|
205
|
-
headingStyle: 'atx',
|
|
206
|
-
codeBlockStyle: 'backticks'
|
|
207
|
-
});
|
|
208
|
-
console.log(result.content);
|
|
209
|
-
```
|
|
210
|
-
|
|
211
|
-
### Cloudflare Workers
|
|
212
|
-
|
|
213
|
-
```typescript
|
|
214
|
-
import { convert, initWasm, wasmReady } from '@kreuzberg/html-to-markdown-wasm';
|
|
215
|
-
|
|
216
|
-
// Cloudflare Workers / other edge runtimes instantiate WASM asynchronously.
|
|
217
|
-
// Kick off initialization once at module scope.
|
|
218
|
-
const ready = wasmReady ?? initWasm();
|
|
219
|
-
|
|
220
|
-
export default {
|
|
221
|
-
async fetch(request: Request): Promise<Response> {
|
|
222
|
-
await ready;
|
|
223
|
-
const html = await request.text();
|
|
224
|
-
const result = convert(html, { headingStyle: 'atx' });
|
|
225
|
-
|
|
226
|
-
return new Response(result.content ?? "", {
|
|
227
|
-
headers: { 'Content-Type': 'text/markdown' }
|
|
228
|
-
});
|
|
229
|
-
}
|
|
230
|
-
};
|
|
231
|
-
```
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
## TypeScript
|
|
235
|
-
|
|
236
|
-
Full TypeScript support with type definitions:
|
|
237
|
-
|
|
238
|
-
```typescript
|
|
239
|
-
import { convert, type WasmConversionOptions } from '@kreuzberg/html-to-markdown-wasm';
|
|
240
|
-
|
|
241
|
-
const options: WasmConversionOptions = {
|
|
242
|
-
headingStyle: 'atx',
|
|
243
|
-
codeBlockStyle: 'backticks',
|
|
244
|
-
listIndentWidth: 2,
|
|
245
|
-
wrap: true,
|
|
246
|
-
wrapWidth: 80
|
|
247
|
-
};
|
|
248
|
-
|
|
249
|
-
const result = convert('<h1>Hello</h1>', options);
|
|
250
|
-
console.log(result.content);
|
|
251
|
-
```
|
|
252
|
-
|
|
253
|
-
## Metadata and Tables
|
|
254
|
-
|
|
255
|
-
Extract document metadata and structured tables from the conversion result:
|
|
256
|
-
|
|
257
|
-
```typescript
|
|
258
|
-
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
259
|
-
|
|
260
|
-
const html = `
|
|
261
|
-
<html lang="en">
|
|
262
|
-
<head><title>My Article</title></head>
|
|
263
|
-
<body>
|
|
264
|
-
<h1>Main Title</h1>
|
|
265
|
-
<p>Content with <a href="https://example.com">a link</a></p>
|
|
266
|
-
<img src="https://example.com/image.jpg" alt="Example image">
|
|
267
|
-
<table>
|
|
268
|
-
<tr><th>Name</th><th>Value</th></tr>
|
|
269
|
-
<tr><td>Foo</td><td>42</td></tr>
|
|
270
|
-
</table>
|
|
271
|
-
</body>
|
|
272
|
-
</html>
|
|
273
|
-
`;
|
|
274
|
-
|
|
275
|
-
const result = convert(html, {
|
|
276
|
-
extractMetadata: true,
|
|
277
|
-
});
|
|
278
|
-
|
|
279
|
-
console.log(result.content); // Markdown output
|
|
280
|
-
console.log(result.metadata); // JSON string with title, links, headers, etc.
|
|
281
|
-
console.log(result.tables.length); // Number of tables found
|
|
282
|
-
console.log(result.images.length); // Number of inline images extracted
|
|
283
|
-
console.log(result.warnings); // Any processing warnings
|
|
284
|
-
```
|
|
285
|
-
|
|
286
|
-
## Build Targets
|
|
287
|
-
|
|
288
|
-
Three build targets are provided for different environments:
|
|
289
|
-
|
|
290
|
-
| Target | Path | Use Case |
|
|
291
|
-
| ----------- | --------------------------------- | ------------------------------ |
|
|
292
|
-
| **Bundler** | `@kreuzberg/html-to-markdown-wasm` | Webpack, Vite, Rollup, esbuild |
|
|
293
|
-
| **Node.js** | `@kreuzberg/html-to-markdown-wasm/dist-node` | Node.js, Bun (CommonJS/ESM) |
|
|
294
|
-
| **Web** | `@kreuzberg/html-to-markdown-wasm/dist-web` | Direct browser ESM imports |
|
|
295
|
-
|
|
296
|
-
## Runtime Compatibility
|
|
297
|
-
|
|
298
|
-
| Runtime | Support | Package |
|
|
299
|
-
| ------------------------- | ---------------------------- | -------------- |
|
|
300
|
-
| **Node.js** 18+ | Full support | `dist-node` |
|
|
301
|
-
| **Deno** | Full support | npm: specifier |
|
|
302
|
-
| **Bun** | Full support (prefer native) | Default export |
|
|
303
|
-
| **Browsers** | Full support | `dist-web` |
|
|
304
|
-
| **Cloudflare Workers** | Full support | Default export |
|
|
305
|
-
| **Deno Deploy** | Full support | npm: specifier |
|
|
306
|
-
|
|
307
|
-
## When to Use
|
|
308
|
-
|
|
309
|
-
Choose `@kreuzberg/html-to-markdown-wasm` when:
|
|
310
|
-
|
|
311
|
-
- Running in browsers or edge runtimes
|
|
312
|
-
- Using Deno
|
|
313
|
-
- Deploying to Cloudflare Workers, Deno Deploy
|
|
314
|
-
- Building universal libraries
|
|
315
|
-
- Need consistent behavior across all platforms
|
|
316
|
-
|
|
317
|
-
Use [@kreuzberg/html-to-markdown-node](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node) for:
|
|
318
|
-
|
|
319
|
-
- Maximum performance in Node.js/Bun (~3x faster)
|
|
320
|
-
- Server-side only applications
|
|
321
|
-
|
|
322
|
-
## Visitor Pattern Support
|
|
323
|
-
|
|
324
|
-
**The WebAssembly binding does not support the visitor pattern.** The visitor pattern requires callbacks and stateful execution across the WebAssembly/JavaScript boundary, which has fundamental limitations:
|
|
325
|
-
|
|
326
|
-
### Why WASM Does Not Support Visitors
|
|
327
|
-
|
|
328
|
-
1. **Memory safety across FFI boundary**: The WASM/JS boundary cannot safely pass mutable function callbacks that maintain state across multiple invocations
|
|
329
|
-
2. **Single-threaded execution model**: WASM runs on a single thread with no equivalent to Node.js's `ThreadsafeFunction` FFI primitive
|
|
330
|
-
3. **No callback marshaling**: JavaScript callbacks cannot be directly invoked from within WASM without significant overhead and memory leaks
|
|
331
|
-
4. **Serialization overhead**: Converting context objects between WASM and JS for each visitor callback would eliminate performance benefits
|
|
332
|
-
|
|
333
|
-
### Alternatives for WASM Users
|
|
334
|
-
|
|
335
|
-
Choose one of these approaches:
|
|
336
|
-
|
|
337
|
-
#### 1. Use Node.js Binding (Recommended)
|
|
338
|
-
|
|
339
|
-
For best performance with visitor support, use the native Node.js binding:
|
|
340
|
-
|
|
341
|
-
```typescript
|
|
342
|
-
import { convert, type Visitor } from '@kreuzberg/html-to-markdown-node';
|
|
343
|
-
|
|
344
|
-
const visitor: Visitor = {
|
|
345
|
-
visitLink(ctx, href, text, title) {
|
|
346
|
-
// Your visitor logic here
|
|
347
|
-
return { type: 'continue' };
|
|
348
|
-
},
|
|
349
|
-
};
|
|
350
|
-
|
|
351
|
-
const result = convert(html, undefined, visitor);
|
|
352
|
-
console.log(result.content);
|
|
353
|
-
```
|
|
354
|
-
|
|
355
|
-
**Performance:** ~3x faster than WASM, full visitor pattern support.
|
|
356
|
-
**Use when:** Running on Node.js or Bun server-side.
|
|
357
|
-
|
|
358
|
-
#### 2. Use Server-Side Bindings
|
|
359
|
-
|
|
360
|
-
For other platforms, use Python, Ruby, or PHP bindings with visitor support:
|
|
361
|
-
|
|
362
|
-
**Python:**
|
|
363
|
-
|
|
364
|
-
```python
|
|
365
|
-
from html_to_markdown import convert
|
|
366
|
-
|
|
367
|
-
class MyVisitor:
|
|
368
|
-
def visit_link(self, ctx, href, text, title):
|
|
369
|
-
# Your visitor logic here
|
|
370
|
-
return {"type": "continue"}
|
|
371
|
-
|
|
372
|
-
result = convert(html, None, MyVisitor())
|
|
373
|
-
```
|
|
374
|
-
|
|
375
|
-
**Ruby:**
|
|
376
|
-
|
|
377
|
-
```ruby
|
|
378
|
-
require 'html_to_markdown'
|
|
379
|
-
|
|
380
|
-
class MyVisitor
|
|
381
|
-
def visit_link(ctx, href, text, title)
|
|
382
|
-
{ type: :continue }
|
|
383
|
-
end
|
|
384
|
-
end
|
|
385
|
-
|
|
386
|
-
result = HtmlToMarkdown.convert(html, nil, MyVisitor.new)
|
|
387
|
-
```
|
|
388
|
-
|
|
389
|
-
**PHP:**
|
|
390
|
-
|
|
391
|
-
```php
|
|
392
|
-
use HtmlToMarkdown\Converter;
|
|
393
|
-
|
|
394
|
-
class MyVisitor {
|
|
395
|
-
public function visitLink(array $ctx, string $href, string $text, ?string $title): array {
|
|
396
|
-
return ['type' => 'continue'];
|
|
397
|
-
}
|
|
398
|
-
}
|
|
399
|
-
|
|
400
|
-
$result = Converter::convert($html, null, new MyVisitor());
|
|
401
|
-
```
|
|
402
|
-
|
|
403
|
-
#### 3. Preprocess HTML Before Conversion
|
|
404
|
-
|
|
405
|
-
For simple transformations, manipulate the HTML before passing to WASM:
|
|
406
|
-
|
|
407
|
-
```typescript
|
|
408
|
-
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
409
|
-
|
|
410
|
-
// Rewrite URLs before conversion
|
|
411
|
-
const processedHtml = html.replace(
|
|
412
|
-
/https:\/\/old-cdn\.com/g,
|
|
413
|
-
'https://new-cdn.com'
|
|
414
|
-
);
|
|
415
|
-
|
|
416
|
-
const result = convert(processedHtml);
|
|
417
|
-
console.log(result.content);
|
|
418
|
-
```
|
|
419
|
-
|
|
420
|
-
**Use when:** Only simple text replacements are needed.
|
|
421
|
-
|
|
422
|
-
#### 4. Post-Process Markdown
|
|
423
|
-
|
|
424
|
-
Transform the output Markdown after conversion:
|
|
425
|
-
|
|
426
|
-
```typescript
|
|
427
|
-
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
428
|
-
|
|
429
|
-
const markdown = convert(html).content ?? "";
|
|
430
|
-
|
|
431
|
-
// Post-process the markdown
|
|
432
|
-
const transformed = markdown
|
|
433
|
-
.replace(/\[(.+?)\]\(https:\/\/old-cdn\.com/g, '[$1](https://new-cdn.com')
|
|
434
|
-
.replace(/!\[(.+?)\]\(https:\/\/old-cdn\.com/g, ';
|
|
435
|
-
```
|
|
436
|
-
|
|
437
|
-
**Use when:** Transformations can be applied to final Markdown output.
|
|
438
|
-
|
|
439
|
-
### Visitor Pattern Support Matrix
|
|
440
|
-
|
|
441
|
-
| Binding | Visitor Support | Best For |
|
|
442
|
-
|---------|-----------------|----------|
|
|
443
|
-
| **Rust** | Yes | Core library, performance-critical code |
|
|
444
|
-
| **Python** | Yes (sync and async) | Server-side, bulk processing |
|
|
445
|
-
| **TypeScript/Node.js** | Yes (sync and async) | Server-side Node.js/Bun, best performance |
|
|
446
|
-
| **Ruby** | Yes | Server-side Ruby on Rails, Sinatra |
|
|
447
|
-
| **PHP** | Yes | Server-side PHP, content management |
|
|
448
|
-
| **Go** | No | Basic conversion only |
|
|
449
|
-
| **Java** | No | Basic conversion only |
|
|
450
|
-
| **C#** | No | Basic conversion only |
|
|
451
|
-
| **Elixir** | No | Basic conversion only |
|
|
452
|
-
| **WebAssembly** | No | Browser, Edge, Deno (see alternatives above) |
|
|
453
|
-
|
|
454
|
-
For comprehensive visitor pattern documentation with examples, see the [full documentation](https://docs.html-to-markdown.kreuzberg.dev).
|
|
455
|
-
|
|
456
|
-
## Configuration Options
|
|
457
|
-
|
|
458
|
-
Available options:
|
|
459
|
-
|
|
460
|
-
- Heading styles (atx, underlined, atxClosed)
|
|
461
|
-
- Code block styles (indented, backticks, tildes)
|
|
462
|
-
- List formatting (indent width, bullet characters)
|
|
463
|
-
- Text escaping and formatting
|
|
464
|
-
- Tag preservation (`preserveTags`) and stripping (`stripTags`)
|
|
465
|
-
- Metadata extraction (`extractMetadata`)
|
|
466
|
-
- Preprocessing for web scraping
|
|
467
|
-
- And more...
|
|
468
|
-
|
|
469
|
-
## Examples
|
|
470
|
-
|
|
471
|
-
### Preserving HTML Tags
|
|
472
|
-
|
|
473
|
-
Keep specific HTML tags in their original form:
|
|
474
|
-
|
|
475
|
-
```typescript
|
|
476
|
-
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
477
|
-
|
|
478
|
-
const html = `
|
|
479
|
-
<p>Before table</p>
|
|
480
|
-
<table class="data">
|
|
481
|
-
<tr><th>Name</th><th>Value</th></tr>
|
|
482
|
-
<tr><td>Item 1</td><td>100</td></tr>
|
|
483
|
-
</table>
|
|
484
|
-
<p>After table</p>
|
|
485
|
-
`;
|
|
486
|
-
|
|
487
|
-
const result = convert(html, {
|
|
488
|
-
preserveTags: ['table']
|
|
489
|
-
});
|
|
490
|
-
|
|
491
|
-
// result.content includes the table as HTML
|
|
492
|
-
```
|
|
493
|
-
|
|
494
|
-
Combine with `stripTags`:
|
|
495
|
-
|
|
496
|
-
```typescript
|
|
497
|
-
const result = convert(html, {
|
|
498
|
-
preserveTags: ['table', 'form'], // Keep as HTML
|
|
499
|
-
stripTags: ['script', 'style'] // Remove entirely
|
|
500
|
-
});
|
|
501
|
-
console.log(result.content);
|
|
502
|
-
```
|
|
503
|
-
|
|
504
|
-
### Deno Web Server
|
|
505
|
-
|
|
506
|
-
```typescript
|
|
507
|
-
import { convert } from "npm:@kreuzberg/html-to-markdown-wasm";
|
|
508
|
-
|
|
509
|
-
Deno.serve(async (req) => {
|
|
510
|
-
const url = new URL(req.url);
|
|
511
|
-
|
|
512
|
-
if (url.pathname === "/convert" && req.method === "POST") {
|
|
513
|
-
const html = await req.text();
|
|
514
|
-
const result = convert(html, { headingStyle: "atx" });
|
|
515
|
-
|
|
516
|
-
return new Response(result.content ?? "", {
|
|
517
|
-
headers: { "Content-Type": "text/markdown" }
|
|
518
|
-
});
|
|
519
|
-
}
|
|
520
|
-
|
|
521
|
-
return new Response("Not found", { status: 404 });
|
|
522
|
-
});
|
|
523
|
-
```
|
|
524
|
-
|
|
525
|
-
### Browser File Conversion
|
|
526
|
-
|
|
527
|
-
```html
|
|
528
|
-
<input type="file" id="htmlFile" accept=".html">
|
|
529
|
-
<button onclick="convertFile()">Convert to Markdown</button>
|
|
530
|
-
<pre id="output"></pre>
|
|
531
|
-
|
|
532
|
-
<script type="module">
|
|
533
|
-
import init, { convert } from 'https://unpkg.com/@kreuzberg/html-to-markdown-wasm/dist-web/html_to_markdown_wasm.js';
|
|
534
|
-
|
|
535
|
-
await init();
|
|
536
|
-
|
|
537
|
-
window.convertFile = async () => {
|
|
538
|
-
const file = document.getElementById('htmlFile').files[0];
|
|
539
|
-
const html = await file.text();
|
|
540
|
-
const result = convert(html, { headingStyle: 'atx' });
|
|
541
|
-
document.getElementById('output').textContent = result.content ?? "";
|
|
542
|
-
};
|
|
543
|
-
</script>
|
|
544
|
-
```
|
|
545
|
-
|
|
546
|
-
### Web Scraping (Deno)
|
|
547
|
-
|
|
548
|
-
```typescript
|
|
549
|
-
import { convert } from "npm:@kreuzberg/html-to-markdown-wasm";
|
|
550
|
-
|
|
551
|
-
const response = await fetch("https://example.com");
|
|
552
|
-
const html = await response.text();
|
|
553
|
-
|
|
554
|
-
const result = convert(html, {
|
|
555
|
-
preprocess: true,
|
|
556
|
-
preset: "aggressive",
|
|
557
|
-
keepNavigation: false,
|
|
558
|
-
headingStyle: "atx",
|
|
559
|
-
codeBlockStyle: "backticks"
|
|
560
|
-
});
|
|
561
|
-
|
|
562
|
-
console.log(result.content);
|
|
563
|
-
```
|
|
564
|
-
|
|
565
|
-
## Other Runtimes
|
|
566
|
-
|
|
567
|
-
The same Rust engine ships as native bindings for other ecosystems:
|
|
568
|
-
|
|
569
|
-
- Node.js / Bun: [`@kreuzberg/html-to-markdown-node`](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node)
|
|
570
|
-
- Python: [`html-to-markdown`](https://pypi.org/project/html-to-markdown/)
|
|
571
|
-
- Ruby: [`html-to-markdown`](https://rubygems.org/gems/html-to-markdown)
|
|
572
|
-
- PHP: [`kreuzberg-dev/html-to-markdown`](https://packagist.org/packages/kreuzberg-dev/html-to-markdown)
|
|
573
|
-
- Rust crate and CLI: [`html-to-markdown-rs`](https://crates.io/crates/html-to-markdown-rs)
|
|
574
|
-
|
|
575
|
-
## Links
|
|
576
|
-
|
|
577
|
-
- [GitHub Repository](https://github.com/kreuzberg-dev/html-to-markdown)
|
|
578
|
-
- [Full Documentation](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/README.md)
|
|
579
|
-
- [Native Node Package](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node)
|
|
580
|
-
- [Python Package](https://pypi.org/project/html-to-markdown/)
|
|
581
|
-
- [PHP Extension and Helpers](https://packagist.org/packages/kreuzberg-dev/html-to-markdown)
|
|
582
|
-
- [Rust Crate](https://crates.io/crates/html-to-markdown-rs)
|
|
583
|
-
|
|
584
|
-
## License
|
|
585
|
-
|
|
586
|
-
MIT
|
package/dist/LICENSE
DELETED
|
@@ -1,21 +0,0 @@
|
|
|
1
|
-
The MIT License (MIT)
|
|
2
|
-
|
|
3
|
-
Copyright 2024-2025 Na'aman Hirschfeld
|
|
4
|
-
|
|
5
|
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
-
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
-
in the Software without restriction, including without limitation the rights
|
|
8
|
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
-
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
-
furnished to do so, subject to the following conditions:
|
|
11
|
-
|
|
12
|
-
The above copyright notice and this permission notice shall be included in all
|
|
13
|
-
copies or substantial portions of the Software.
|
|
14
|
-
|
|
15
|
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
-
SOFTWARE.
|