@kreuzberg/html-to-markdown-wasm 2.30.0 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +124 -286
- package/dist/README.md +50 -2
- package/dist/html_to_markdown_wasm.d.ts +15 -163
- package/dist/html_to_markdown_wasm_bg.js +73 -887
- package/dist/html_to_markdown_wasm_bg.wasm +0 -0
- package/dist/html_to_markdown_wasm_bg.wasm.d.ts +0 -45
- package/dist/package.json +1 -1
- package/dist-node/README.md +50 -2
- package/dist-node/html_to_markdown_wasm.d.ts +15 -163
- package/dist-node/html_to_markdown_wasm.js +74 -903
- package/dist-node/html_to_markdown_wasm_bg.wasm +0 -0
- package/dist-node/html_to_markdown_wasm_bg.wasm.d.ts +0 -45
- package/dist-node/package.json +1 -1
- package/dist-web/README.md +50 -2
- package/dist-web/html_to_markdown_wasm.d.ts +15 -208
- package/dist-web/html_to_markdown_wasm.js +73 -888
- package/dist-web/html_to_markdown_wasm_bg.wasm +0 -0
- package/dist-web/html_to_markdown_wasm_bg.wasm.d.ts +0 -45
- package/dist-web/package.json +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -17,63 +17,9 @@ Runs anywhere: Node.js, Deno, Bun, browsers, and edge runtimes.
|
|
|
17
17
|
[](https://rubygems.org/gems/html-to-markdown)
|
|
18
18
|
[](https://www.nuget.org/packages/KreuzbergDev.HtmlToMarkdown/)
|
|
19
19
|
[](https://central.sonatype.com/artifact/dev.kreuzberg/html-to-markdown)
|
|
20
|
-
[](https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v3/htmltomarkdown)
|
|
21
21
|
[](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE)
|
|
22
22
|
|
|
23
|
-
## Migration Guide (v2.18.x → v2.19.0)
|
|
24
|
-
|
|
25
|
-
> **⚠️ BREAKING CHANGE: Package Namespace Update**
|
|
26
|
-
>
|
|
27
|
-
> In v2.19.0, the npm package namespace changed from `html-to-markdown-wasm` to `@kreuzberg/html-to-markdown-wasm` to reflect the new Kreuzberg.dev organization.
|
|
28
|
-
|
|
29
|
-
### Install Updated Package
|
|
30
|
-
|
|
31
|
-
**Before (v2.18.x):**
|
|
32
|
-
```bash
|
|
33
|
-
npm install html-to-markdown-wasm
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
**After (v2.19.0+):**
|
|
37
|
-
```bash
|
|
38
|
-
npm install @kreuzberg/html-to-markdown-wasm
|
|
39
|
-
```
|
|
40
|
-
|
|
41
|
-
### Update Import Statements
|
|
42
|
-
|
|
43
|
-
**Before:**
|
|
44
|
-
```typescript
|
|
45
|
-
import { convert } from 'html-to-markdown-wasm';
|
|
46
|
-
// or
|
|
47
|
-
import { convert } from "npm:html-to-markdown-wasm"; // Deno
|
|
48
|
-
```
|
|
49
|
-
|
|
50
|
-
**After:**
|
|
51
|
-
```typescript
|
|
52
|
-
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
53
|
-
// or
|
|
54
|
-
import { convert } from "npm:@kreuzberg/html-to-markdown-wasm"; // Deno
|
|
55
|
-
```
|
|
56
|
-
|
|
57
|
-
### Update Browser ESM Imports
|
|
58
|
-
|
|
59
|
-
**Before:**
|
|
60
|
-
```javascript
|
|
61
|
-
import init, { convert } from 'https://unpkg.com/html-to-markdown-wasm/dist-web/html_to_markdown_wasm.js';
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
**After:**
|
|
65
|
-
```javascript
|
|
66
|
-
import init, { convert } from 'https://unpkg.com/@kreuzberg/html-to-markdown-wasm/dist-web/html_to_markdown_wasm.js';
|
|
67
|
-
```
|
|
68
|
-
|
|
69
|
-
### Summary of Changes
|
|
70
|
-
|
|
71
|
-
- Package renamed from `html-to-markdown-wasm` to `@kreuzberg/html-to-markdown-wasm`
|
|
72
|
-
- All APIs remain identical
|
|
73
|
-
- Full backward compatibility after updating package name and imports
|
|
74
|
-
|
|
75
|
-
---
|
|
76
|
-
|
|
77
23
|
## Performance
|
|
78
24
|
|
|
79
25
|
Universal WebAssembly bindings with **excellent performance** across all JavaScript runtimes.
|
|
@@ -94,8 +40,8 @@ Universal WebAssembly bindings with **excellent performance** across all JavaScr
|
|
|
94
40
|
|
|
95
41
|
### Comparison
|
|
96
42
|
|
|
97
|
-
- **vs Native NAPI**: ~1.
|
|
98
|
-
- **vs Python**: ~6.
|
|
43
|
+
- **vs Native NAPI**: ~1.17x slower (WASM has minimal overhead)
|
|
44
|
+
- **vs Python**: ~6.3x faster (no FFI overhead)
|
|
99
45
|
- **Best for**: Universal deployment (browsers, Deno, edge runtimes, cross-platform apps)
|
|
100
46
|
|
|
101
47
|
### Benchmark Fixtures (Apple M4)
|
|
@@ -109,9 +55,6 @@ Numbers captured via the shared fixture harness in `tools/benchmark-harness`:
|
|
|
109
55
|
| Medium (Python) | 657 KB | 121 |
|
|
110
56
|
| Large (Rust) | 567 KB | 124 |
|
|
111
57
|
| Small (Intro) | 463 KB | 163 |
|
|
112
|
-
| hOCR German PDF | 44 KB | 1,637 |
|
|
113
|
-
| hOCR Invoice | 4 KB | 7,775 |
|
|
114
|
-
| hOCR Embedded Tables | 37 KB | 1,667 |
|
|
115
58
|
|
|
116
59
|
> Expect slightly higher numbers in long-lived browser/Deno workers once the WASM module is warm.
|
|
117
60
|
|
|
@@ -142,8 +85,8 @@ import { convert } from "npm:@kreuzberg/html-to-markdown-wasm";
|
|
|
142
85
|
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
143
86
|
|
|
144
87
|
const html = '<h1>Hello World</h1><p>This is <strong>fast</strong>!</p>';
|
|
145
|
-
const
|
|
146
|
-
console.log(
|
|
88
|
+
const result = convert(html);
|
|
89
|
+
console.log(result.content);
|
|
147
90
|
// # Hello World
|
|
148
91
|
//
|
|
149
92
|
// This is **fast**!
|
|
@@ -151,45 +94,34 @@ console.log(markdown);
|
|
|
151
94
|
|
|
152
95
|
> **Heads up for edge runtimes:** Cloudflare Workers, Vite dev servers, and other environments that instantiate `.wasm` files asynchronously must call `await initWasm()` (or `await wasmReady`) once during startup before invoking `convert`. Traditional bundlers (Webpack, Rollup) and Deno/Node imports continue to work without manual initialization.
|
|
153
96
|
|
|
154
|
-
|
|
155
|
-
- [**Browser with Rollup**](https://github.com/kreuzberg-dev/html-to-markdown/tree/main/examples/wasm-rollup) - Using dist-web target in browser
|
|
156
|
-
- [**Node.js**](https://github.com/kreuzberg-dev/html-to-markdown/tree/main/examples/wasm-node) - Using dist-node target
|
|
157
|
-
- [**Cloudflare Workers**](https://github.com/kreuzberg-dev/html-to-markdown/tree/main/examples/wasm-cloudflare) - Using bundler target with Wrangler
|
|
97
|
+
### WasmConversionResult Fields
|
|
158
98
|
|
|
159
|
-
|
|
99
|
+
Every call to `convert()` returns a `WasmConversionResult` object with six fields:
|
|
160
100
|
|
|
161
|
-
```
|
|
162
|
-
import {
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
} from '@kreuzberg/html-to-markdown-wasm';
|
|
101
|
+
```typescript
|
|
102
|
+
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
103
|
+
|
|
104
|
+
const result = convert(html);
|
|
166
105
|
|
|
167
|
-
|
|
168
|
-
|
|
106
|
+
result.content; // string | null -- converted Markdown (or djot/plain text)
|
|
107
|
+
result.document; // string | null -- structured document tree as JSON
|
|
108
|
+
result.metadata; // string | null -- extracted HTML metadata as JSON
|
|
109
|
+
result.tables; // Array -- all tables found in document order
|
|
110
|
+
result.images; // Array -- extracted inline images (data URIs, SVGs)
|
|
111
|
+
result.warnings; // Array -- non-fatal processing warnings
|
|
169
112
|
```
|
|
170
113
|
|
|
171
114
|
### Byte-Based Input (Buffers / Uint8Array)
|
|
172
115
|
|
|
173
|
-
When you already have raw bytes (e.g., `fs.readFileSync`, Fetch API responses), skip re-encoding with `TextDecoder` by calling the byte-friendly
|
|
116
|
+
When you already have raw bytes (e.g., `fs.readFileSync`, Fetch API responses), skip re-encoding with `TextDecoder` by calling the byte-friendly helper:
|
|
174
117
|
|
|
175
118
|
```ts
|
|
176
|
-
import {
|
|
177
|
-
convertBytes,
|
|
178
|
-
convertBytesWithOptionsHandle,
|
|
179
|
-
createConversionOptionsHandle,
|
|
180
|
-
convertBytesWithInlineImages,
|
|
181
|
-
} from '@kreuzberg/html-to-markdown-wasm';
|
|
119
|
+
import { convertBytes } from '@kreuzberg/html-to-markdown-wasm';
|
|
182
120
|
import { readFileSync } from 'node:fs';
|
|
183
121
|
|
|
184
122
|
const htmlBytes = readFileSync('input.html'); // Buffer -> Uint8Array
|
|
185
|
-
const
|
|
186
|
-
|
|
187
|
-
const handle = createConversionOptionsHandle({ headingStyle: 'atx' });
|
|
188
|
-
const markdownFromHandle = convertBytesWithOptionsHandle(htmlBytes, handle);
|
|
189
|
-
|
|
190
|
-
const inlineExtraction = convertBytesWithInlineImages(htmlBytes, null, {
|
|
191
|
-
maxDecodedSizeBytes: 5 * 1024 * 1024,
|
|
192
|
-
});
|
|
123
|
+
const result = convertBytes(htmlBytes);
|
|
124
|
+
console.log(result.content);
|
|
193
125
|
```
|
|
194
126
|
|
|
195
127
|
### With Options
|
|
@@ -197,7 +129,7 @@ const inlineExtraction = convertBytesWithInlineImages(htmlBytes, null, {
|
|
|
197
129
|
```typescript
|
|
198
130
|
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
199
131
|
|
|
200
|
-
const
|
|
132
|
+
const result = convert(html, {
|
|
201
133
|
headingStyle: 'atx',
|
|
202
134
|
codeBlockStyle: 'backticks',
|
|
203
135
|
listIndentWidth: 2,
|
|
@@ -205,9 +137,10 @@ const markdown = convert(html, {
|
|
|
205
137
|
wrap: true,
|
|
206
138
|
wrapWidth: 80
|
|
207
139
|
});
|
|
140
|
+
console.log(result.content);
|
|
208
141
|
```
|
|
209
142
|
|
|
210
|
-
### Preserve Complex HTML
|
|
143
|
+
### Preserve Complex HTML
|
|
211
144
|
|
|
212
145
|
```typescript
|
|
213
146
|
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
@@ -220,22 +153,23 @@ const html = `
|
|
|
220
153
|
</table>
|
|
221
154
|
`;
|
|
222
155
|
|
|
223
|
-
const
|
|
156
|
+
const result = convert(html, {
|
|
224
157
|
preserveTags: ['table'] // Keep tables as HTML
|
|
225
158
|
});
|
|
159
|
+
console.log(result.content);
|
|
226
160
|
```
|
|
227
161
|
|
|
228
162
|
### Deno
|
|
229
163
|
|
|
230
164
|
```typescript
|
|
231
|
-
import { convert } from "npm
|
|
165
|
+
import { convert } from "npm:@kreuzberg/html-to-markdown-wasm";
|
|
232
166
|
|
|
233
167
|
const html = await Deno.readTextFile("input.html");
|
|
234
|
-
const
|
|
235
|
-
await Deno.writeTextFile("output.md",
|
|
168
|
+
const result = convert(html, { headingStyle: "atx" });
|
|
169
|
+
await Deno.writeTextFile("output.md", result.content ?? "");
|
|
236
170
|
```
|
|
237
171
|
|
|
238
|
-
> **Performance Tip:** For Node.js/Bun, use [@kreuzberg/html-to-markdown-node](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node) for 1.
|
|
172
|
+
> **Performance Tip:** For Node.js/Bun, use [@kreuzberg/html-to-markdown-node](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node) for 1.17x better performance with native bindings.
|
|
239
173
|
|
|
240
174
|
### Browser (ESM)
|
|
241
175
|
|
|
@@ -253,10 +187,10 @@ await Deno.writeTextFile("output.md", markdown);
|
|
|
253
187
|
await init();
|
|
254
188
|
|
|
255
189
|
const html = '<h1>Hello World</h1><p>This runs in the <strong>browser</strong>!</p>';
|
|
256
|
-
const
|
|
190
|
+
const result = convert(html, { headingStyle: 'atx' });
|
|
257
191
|
|
|
258
|
-
console.log(
|
|
259
|
-
document.body.innerHTML = `<pre>${
|
|
192
|
+
console.log(result.content);
|
|
193
|
+
document.body.innerHTML = `<pre>${result.content}</pre>`;
|
|
260
194
|
</script>
|
|
261
195
|
</body>
|
|
262
196
|
</html>
|
|
@@ -267,10 +201,11 @@ await Deno.writeTextFile("output.md", markdown);
|
|
|
267
201
|
```typescript
|
|
268
202
|
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
269
203
|
|
|
270
|
-
const
|
|
204
|
+
const result = convert('<h1>Hello</h1>', {
|
|
271
205
|
headingStyle: 'atx',
|
|
272
206
|
codeBlockStyle: 'backticks'
|
|
273
207
|
});
|
|
208
|
+
console.log(result.content);
|
|
274
209
|
```
|
|
275
210
|
|
|
276
211
|
### Cloudflare Workers
|
|
@@ -286,28 +221,22 @@ export default {
|
|
|
286
221
|
async fetch(request: Request): Promise<Response> {
|
|
287
222
|
await ready;
|
|
288
223
|
const html = await request.text();
|
|
289
|
-
const
|
|
224
|
+
const result = convert(html, { headingStyle: 'atx' });
|
|
290
225
|
|
|
291
|
-
return new Response(
|
|
226
|
+
return new Response(result.content ?? "", {
|
|
292
227
|
headers: { 'Content-Type': 'text/markdown' }
|
|
293
228
|
});
|
|
294
229
|
}
|
|
295
230
|
};
|
|
296
231
|
```
|
|
297
232
|
|
|
298
|
-
> See the full [Cloudflare Workers example](https://github.com/kreuzberg-dev/html-to-markdown/tree/main/examples/wasm-cloudflare) with Wrangler configuration.
|
|
299
233
|
|
|
300
234
|
## TypeScript
|
|
301
235
|
|
|
302
236
|
Full TypeScript support with type definitions:
|
|
303
237
|
|
|
304
238
|
```typescript
|
|
305
|
-
import {
|
|
306
|
-
convert,
|
|
307
|
-
convertWithInlineImages,
|
|
308
|
-
WasmInlineImageConfig,
|
|
309
|
-
type WasmConversionOptions
|
|
310
|
-
} from '@kreuzberg/html-to-markdown-wasm';
|
|
239
|
+
import { convert, type WasmConversionOptions } from '@kreuzberg/html-to-markdown-wasm';
|
|
311
240
|
|
|
312
241
|
const options: WasmConversionOptions = {
|
|
313
242
|
headingStyle: 'atx',
|
|
@@ -317,40 +246,16 @@ const options: WasmConversionOptions = {
|
|
|
317
246
|
wrapWidth: 80
|
|
318
247
|
};
|
|
319
248
|
|
|
320
|
-
const
|
|
249
|
+
const result = convert('<h1>Hello</h1>', options);
|
|
250
|
+
console.log(result.content);
|
|
321
251
|
```
|
|
322
252
|
|
|
323
|
-
##
|
|
253
|
+
## Metadata and Tables
|
|
324
254
|
|
|
325
|
-
Extract and
|
|
255
|
+
Extract document metadata and structured tables from the conversion result:
|
|
326
256
|
|
|
327
257
|
```typescript
|
|
328
|
-
import {
|
|
329
|
-
|
|
330
|
-
const html = '<img src="data:image/png;base64,iVBORw0..." alt="Logo">';
|
|
331
|
-
|
|
332
|
-
const config = new WasmInlineImageConfig(5 * 1024 * 1024); // 5MB max
|
|
333
|
-
config.inferDimensions = true;
|
|
334
|
-
config.filenamePrefix = 'img_';
|
|
335
|
-
config.captureSvg = true;
|
|
336
|
-
|
|
337
|
-
const result = convertWithInlineImages(html, null, config);
|
|
338
|
-
|
|
339
|
-
console.log(result.markdown);
|
|
340
|
-
console.log(`Extracted ${result.inlineImages.length} images`);
|
|
341
|
-
|
|
342
|
-
for (const img of result.inlineImages) {
|
|
343
|
-
console.log(`${img.filename}: ${img.format}, ${img.data.length} bytes`);
|
|
344
|
-
// img.data is a Uint8Array - save to file or upload
|
|
345
|
-
}
|
|
346
|
-
```
|
|
347
|
-
|
|
348
|
-
## Metadata Extraction
|
|
349
|
-
|
|
350
|
-
Extract document metadata (headers, links, images, structured data) alongside Markdown conversion:
|
|
351
|
-
|
|
352
|
-
```typescript
|
|
353
|
-
import { convertWithMetadata, WasmMetadataConfig } from '@kreuzberg/html-to-markdown-wasm';
|
|
258
|
+
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
354
259
|
|
|
355
260
|
const html = `
|
|
356
261
|
<html lang="en">
|
|
@@ -359,97 +264,23 @@ const html = `
|
|
|
359
264
|
<h1>Main Title</h1>
|
|
360
265
|
<p>Content with <a href="https://example.com">a link</a></p>
|
|
361
266
|
<img src="https://example.com/image.jpg" alt="Example image">
|
|
267
|
+
<table>
|
|
268
|
+
<tr><th>Name</th><th>Value</th></tr>
|
|
269
|
+
<tr><td>Foo</td><td>42</td></tr>
|
|
270
|
+
</table>
|
|
362
271
|
</body>
|
|
363
272
|
</html>
|
|
364
273
|
`;
|
|
365
274
|
|
|
366
|
-
const
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
config.extractImages = true;
|
|
370
|
-
config.extractStructuredData = true;
|
|
371
|
-
config.maxStructuredDataSize = 1_000_000; // 1MB limit
|
|
372
|
-
|
|
373
|
-
const result = convertWithMetadata(html, null, config);
|
|
374
|
-
|
|
375
|
-
console.log(result.markdown);
|
|
376
|
-
console.log('Document metadata:', result.metadata.document);
|
|
377
|
-
// {
|
|
378
|
-
// title: 'My Article',
|
|
379
|
-
// language: 'en',
|
|
380
|
-
// ...
|
|
381
|
-
// }
|
|
382
|
-
|
|
383
|
-
console.log('Headers:', result.metadata.headers);
|
|
384
|
-
// [
|
|
385
|
-
// { level: 1, text: 'Main Title', id: undefined, depth: 0, htmlOffset: ... }
|
|
386
|
-
// ]
|
|
387
|
-
|
|
388
|
-
console.log('Links:', result.metadata.links);
|
|
389
|
-
// [
|
|
390
|
-
// {
|
|
391
|
-
// href: 'https://example.com',
|
|
392
|
-
// text: 'a link',
|
|
393
|
-
// linkType: 'external',
|
|
394
|
-
// rel: [],
|
|
395
|
-
// ...
|
|
396
|
-
// }
|
|
397
|
-
// ]
|
|
398
|
-
|
|
399
|
-
console.log('Images:', result.metadata.images);
|
|
400
|
-
// [
|
|
401
|
-
// {
|
|
402
|
-
// src: 'https://example.com/image.jpg',
|
|
403
|
-
// alt: 'Example image',
|
|
404
|
-
// imageType: 'external',
|
|
405
|
-
// ...
|
|
406
|
-
// }
|
|
407
|
-
// ]
|
|
408
|
-
```
|
|
409
|
-
|
|
410
|
-
### Metadata Configuration
|
|
411
|
-
|
|
412
|
-
The `WasmMetadataConfig` class controls what metadata is extracted:
|
|
413
|
-
|
|
414
|
-
```typescript
|
|
415
|
-
import { WasmMetadataConfig } from '@kreuzberg/html-to-markdown-wasm';
|
|
416
|
-
|
|
417
|
-
const config = new WasmMetadataConfig();
|
|
418
|
-
|
|
419
|
-
// Enable/disable extraction types
|
|
420
|
-
config.extractHeaders = true; // h1-h6 elements
|
|
421
|
-
config.extractLinks = true; // <a> elements with link type classification
|
|
422
|
-
config.extractImages = true; // <img> and <svg> elements
|
|
423
|
-
config.extractStructuredData = true; // JSON-LD, Microdata, RDFa
|
|
424
|
-
|
|
425
|
-
// Limit structured data size to prevent memory exhaustion
|
|
426
|
-
config.maxStructuredDataSize = 1_000_000; // 1MB default
|
|
427
|
-
```
|
|
428
|
-
|
|
429
|
-
### Metadata Structure
|
|
430
|
-
|
|
431
|
-
The returned metadata object includes:
|
|
432
|
-
|
|
433
|
-
- **document**: Document-level metadata (title, description, keywords, language, OG tags, Twitter cards, etc.)
|
|
434
|
-
- **headers**: Array of header elements with level, text, id, and document position
|
|
435
|
-
- **links**: Array of links with href, text, type (anchor/internal/external/email/phone), and rel attributes
|
|
436
|
-
- **images**: Array of images with src, alt text, dimensions, and type classification (dataUri/external/relative/svg)
|
|
437
|
-
- **structuredData**: Array of JSON-LD, Microdata, and RDFa blocks
|
|
438
|
-
|
|
439
|
-
### Byte-Based Input
|
|
440
|
-
|
|
441
|
-
Convert bytes directly with metadata extraction:
|
|
442
|
-
|
|
443
|
-
```typescript
|
|
444
|
-
import { convertBytesWithMetadata, WasmMetadataConfig } from '@kreuzberg/html-to-markdown-wasm';
|
|
445
|
-
import { readFileSync } from 'node:fs';
|
|
446
|
-
|
|
447
|
-
const htmlBytes = readFileSync('article.html');
|
|
448
|
-
const config = new WasmMetadataConfig();
|
|
275
|
+
const result = convert(html, {
|
|
276
|
+
extractMetadata: true,
|
|
277
|
+
});
|
|
449
278
|
|
|
450
|
-
|
|
451
|
-
console.log(result.
|
|
452
|
-
console.log(result.
|
|
279
|
+
console.log(result.content); // Markdown output
|
|
280
|
+
console.log(result.metadata); // JSON string with title, links, headers, etc.
|
|
281
|
+
console.log(result.tables.length); // Number of tables found
|
|
282
|
+
console.log(result.images.length); // Number of inline images extracted
|
|
283
|
+
console.log(result.warnings); // Any processing warnings
|
|
453
284
|
```
|
|
454
285
|
|
|
455
286
|
## Build Targets
|
|
@@ -466,33 +297,33 @@ Three build targets are provided for different environments:
|
|
|
466
297
|
|
|
467
298
|
| Runtime | Support | Package |
|
|
468
299
|
| ------------------------- | ---------------------------- | -------------- |
|
|
469
|
-
|
|
|
470
|
-
|
|
|
471
|
-
|
|
|
472
|
-
|
|
|
473
|
-
|
|
|
474
|
-
|
|
|
300
|
+
| **Node.js** 18+ | Full support | `dist-node` |
|
|
301
|
+
| **Deno** | Full support | npm: specifier |
|
|
302
|
+
| **Bun** | Full support (prefer native) | Default export |
|
|
303
|
+
| **Browsers** | Full support | `dist-web` |
|
|
304
|
+
| **Cloudflare Workers** | Full support | Default export |
|
|
305
|
+
| **Deno Deploy** | Full support | npm: specifier |
|
|
475
306
|
|
|
476
307
|
## When to Use
|
|
477
308
|
|
|
478
309
|
Choose `@kreuzberg/html-to-markdown-wasm` when:
|
|
479
310
|
|
|
480
|
-
-
|
|
481
|
-
-
|
|
482
|
-
-
|
|
483
|
-
-
|
|
484
|
-
-
|
|
311
|
+
- Running in browsers or edge runtimes
|
|
312
|
+
- Using Deno
|
|
313
|
+
- Deploying to Cloudflare Workers, Deno Deploy
|
|
314
|
+
- Building universal libraries
|
|
315
|
+
- Need consistent behavior across all platforms
|
|
485
316
|
|
|
486
317
|
Use [@kreuzberg/html-to-markdown-node](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node) for:
|
|
487
318
|
|
|
488
|
-
-
|
|
489
|
-
-
|
|
319
|
+
- Maximum performance in Node.js/Bun (~3x faster)
|
|
320
|
+
- Server-side only applications
|
|
490
321
|
|
|
491
322
|
## Visitor Pattern Support
|
|
492
323
|
|
|
493
324
|
**The WebAssembly binding does not support the visitor pattern.** The visitor pattern requires callbacks and stateful execution across the WebAssembly/JavaScript boundary, which has fundamental limitations:
|
|
494
325
|
|
|
495
|
-
### Why WASM
|
|
326
|
+
### Why WASM Does Not Support Visitors
|
|
496
327
|
|
|
497
328
|
1. **Memory safety across FFI boundary**: The WASM/JS boundary cannot safely pass mutable function callbacks that maintain state across multiple invocations
|
|
498
329
|
2. **Single-threaded execution model**: WASM runs on a single thread with no equivalent to Node.js's `ThreadsafeFunction` FFI primitive
|
|
@@ -504,10 +335,11 @@ Use [@kreuzberg/html-to-markdown-node](https://www.npmjs.com/package/@kreuzberg/
|
|
|
504
335
|
Choose one of these approaches:
|
|
505
336
|
|
|
506
337
|
#### 1. Use Node.js Binding (Recommended)
|
|
338
|
+
|
|
507
339
|
For best performance with visitor support, use the native Node.js binding:
|
|
508
340
|
|
|
509
341
|
```typescript
|
|
510
|
-
import {
|
|
342
|
+
import { convert, type Visitor } from '@kreuzberg/html-to-markdown-node';
|
|
511
343
|
|
|
512
344
|
const visitor: Visitor = {
|
|
513
345
|
visitLink(ctx, href, text, title) {
|
|
@@ -516,28 +348,32 @@ const visitor: Visitor = {
|
|
|
516
348
|
},
|
|
517
349
|
};
|
|
518
350
|
|
|
519
|
-
const
|
|
351
|
+
const result = convert(html, undefined, visitor);
|
|
352
|
+
console.log(result.content);
|
|
520
353
|
```
|
|
521
354
|
|
|
522
|
-
**Performance:** ~
|
|
355
|
+
**Performance:** ~3x faster than WASM, full visitor pattern support.
|
|
523
356
|
**Use when:** Running on Node.js or Bun server-side.
|
|
524
357
|
|
|
525
358
|
#### 2. Use Server-Side Bindings
|
|
359
|
+
|
|
526
360
|
For other platforms, use Python, Ruby, or PHP bindings with visitor support:
|
|
527
361
|
|
|
528
362
|
**Python:**
|
|
363
|
+
|
|
529
364
|
```python
|
|
530
|
-
from html_to_markdown import
|
|
365
|
+
from html_to_markdown import convert
|
|
531
366
|
|
|
532
367
|
class MyVisitor:
|
|
533
368
|
def visit_link(self, ctx, href, text, title):
|
|
534
369
|
# Your visitor logic here
|
|
535
370
|
return {"type": "continue"}
|
|
536
371
|
|
|
537
|
-
|
|
372
|
+
result = convert(html, None, MyVisitor())
|
|
538
373
|
```
|
|
539
374
|
|
|
540
375
|
**Ruby:**
|
|
376
|
+
|
|
541
377
|
```ruby
|
|
542
378
|
require 'html_to_markdown'
|
|
543
379
|
|
|
@@ -547,10 +383,11 @@ class MyVisitor
|
|
|
547
383
|
end
|
|
548
384
|
end
|
|
549
385
|
|
|
550
|
-
|
|
386
|
+
result = HtmlToMarkdown.convert(html, nil, MyVisitor.new)
|
|
551
387
|
```
|
|
552
388
|
|
|
553
389
|
**PHP:**
|
|
390
|
+
|
|
554
391
|
```php
|
|
555
392
|
use HtmlToMarkdown\Converter;
|
|
556
393
|
|
|
@@ -560,10 +397,11 @@ class MyVisitor {
|
|
|
560
397
|
}
|
|
561
398
|
}
|
|
562
399
|
|
|
563
|
-
$
|
|
400
|
+
$result = Converter::convert($html, null, new MyVisitor());
|
|
564
401
|
```
|
|
565
402
|
|
|
566
403
|
#### 3. Preprocess HTML Before Conversion
|
|
404
|
+
|
|
567
405
|
For simple transformations, manipulate the HTML before passing to WASM:
|
|
568
406
|
|
|
569
407
|
```typescript
|
|
@@ -575,18 +413,20 @@ const processedHtml = html.replace(
|
|
|
575
413
|
'https://new-cdn.com'
|
|
576
414
|
);
|
|
577
415
|
|
|
578
|
-
const
|
|
416
|
+
const result = convert(processedHtml);
|
|
417
|
+
console.log(result.content);
|
|
579
418
|
```
|
|
580
419
|
|
|
581
420
|
**Use when:** Only simple text replacements are needed.
|
|
582
421
|
|
|
583
422
|
#### 4. Post-Process Markdown
|
|
423
|
+
|
|
584
424
|
Transform the output Markdown after conversion:
|
|
585
425
|
|
|
586
426
|
```typescript
|
|
587
427
|
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
588
428
|
|
|
589
|
-
const markdown = convert(html);
|
|
429
|
+
const markdown = convert(html).content ?? "";
|
|
590
430
|
|
|
591
431
|
// Post-process the markdown
|
|
592
432
|
const transformed = markdown
|
|
@@ -600,30 +440,30 @@ const transformed = markdown
|
|
|
600
440
|
|
|
601
441
|
| Binding | Visitor Support | Best For |
|
|
602
442
|
|---------|-----------------|----------|
|
|
603
|
-
| **Rust** |
|
|
604
|
-
| **Python** |
|
|
605
|
-
| **TypeScript/Node.js** |
|
|
606
|
-
| **Ruby** |
|
|
607
|
-
| **PHP** |
|
|
608
|
-
| **Go** |
|
|
609
|
-
| **Java** |
|
|
610
|
-
| **C#** |
|
|
611
|
-
| **Elixir** |
|
|
612
|
-
| **WebAssembly** |
|
|
613
|
-
|
|
614
|
-
For comprehensive visitor pattern documentation with examples, see [
|
|
443
|
+
| **Rust** | Yes | Core library, performance-critical code |
|
|
444
|
+
| **Python** | Yes (sync and async) | Server-side, bulk processing |
|
|
445
|
+
| **TypeScript/Node.js** | Yes (sync and async) | Server-side Node.js/Bun, best performance |
|
|
446
|
+
| **Ruby** | Yes | Server-side Ruby on Rails, Sinatra |
|
|
447
|
+
| **PHP** | Yes | Server-side PHP, content management |
|
|
448
|
+
| **Go** | No | Basic conversion only |
|
|
449
|
+
| **Java** | No | Basic conversion only |
|
|
450
|
+
| **C#** | No | Basic conversion only |
|
|
451
|
+
| **Elixir** | No | Basic conversion only |
|
|
452
|
+
| **WebAssembly** | No | Browser, Edge, Deno (see alternatives above) |
|
|
453
|
+
|
|
454
|
+
For comprehensive visitor pattern documentation with examples, see the [full documentation](https://docs.html-to-markdown.kreuzberg.dev).
|
|
615
455
|
|
|
616
456
|
## Configuration Options
|
|
617
457
|
|
|
618
|
-
|
|
458
|
+
Available options:
|
|
619
459
|
|
|
620
460
|
- Heading styles (atx, underlined, atxClosed)
|
|
621
461
|
- Code block styles (indented, backticks, tildes)
|
|
622
462
|
- List formatting (indent width, bullet characters)
|
|
623
463
|
- Text escaping and formatting
|
|
624
464
|
- Tag preservation (`preserveTags`) and stripping (`stripTags`)
|
|
465
|
+
- Metadata extraction (`extractMetadata`)
|
|
625
466
|
- Preprocessing for web scraping
|
|
626
|
-
- hOCR table extraction
|
|
627
467
|
- And more...
|
|
628
468
|
|
|
629
469
|
## Examples
|
|
@@ -644,35 +484,36 @@ const html = `
|
|
|
644
484
|
<p>After table</p>
|
|
645
485
|
`;
|
|
646
486
|
|
|
647
|
-
const
|
|
487
|
+
const result = convert(html, {
|
|
648
488
|
preserveTags: ['table']
|
|
649
489
|
});
|
|
650
490
|
|
|
651
|
-
//
|
|
491
|
+
// result.content includes the table as HTML
|
|
652
492
|
```
|
|
653
493
|
|
|
654
494
|
Combine with `stripTags`:
|
|
655
495
|
|
|
656
496
|
```typescript
|
|
657
|
-
const
|
|
497
|
+
const result = convert(html, {
|
|
658
498
|
preserveTags: ['table', 'form'], // Keep as HTML
|
|
659
499
|
stripTags: ['script', 'style'] // Remove entirely
|
|
660
500
|
});
|
|
501
|
+
console.log(result.content);
|
|
661
502
|
```
|
|
662
503
|
|
|
663
504
|
### Deno Web Server
|
|
664
505
|
|
|
665
506
|
```typescript
|
|
666
|
-
import { convert } from "npm
|
|
507
|
+
import { convert } from "npm:@kreuzberg/html-to-markdown-wasm";
|
|
667
508
|
|
|
668
|
-
Deno.serve((req) => {
|
|
509
|
+
Deno.serve(async (req) => {
|
|
669
510
|
const url = new URL(req.url);
|
|
670
511
|
|
|
671
512
|
if (url.pathname === "/convert" && req.method === "POST") {
|
|
672
513
|
const html = await req.text();
|
|
673
|
-
const
|
|
514
|
+
const result = convert(html, { headingStyle: "atx" });
|
|
674
515
|
|
|
675
|
-
return new Response(
|
|
516
|
+
return new Response(result.content ?? "", {
|
|
676
517
|
headers: { "Content-Type": "text/markdown" }
|
|
677
518
|
});
|
|
678
519
|
}
|
|
@@ -696,8 +537,8 @@ Deno.serve((req) => {
|
|
|
696
537
|
window.convertFile = async () => {
|
|
697
538
|
const file = document.getElementById('htmlFile').files[0];
|
|
698
539
|
const html = await file.text();
|
|
699
|
-
const
|
|
700
|
-
document.getElementById('output').textContent =
|
|
540
|
+
const result = convert(html, { headingStyle: 'atx' });
|
|
541
|
+
document.getElementById('output').textContent = result.content ?? "";
|
|
701
542
|
};
|
|
702
543
|
</script>
|
|
703
544
|
```
|
|
@@ -705,42 +546,39 @@ Deno.serve((req) => {
|
|
|
705
546
|
### Web Scraping (Deno)
|
|
706
547
|
|
|
707
548
|
```typescript
|
|
708
|
-
import { convert } from "npm
|
|
549
|
+
import { convert } from "npm:@kreuzberg/html-to-markdown-wasm";
|
|
709
550
|
|
|
710
551
|
const response = await fetch("https://example.com");
|
|
711
552
|
const html = await response.text();
|
|
712
553
|
|
|
713
|
-
const
|
|
714
|
-
|
|
715
|
-
|
|
716
|
-
|
|
717
|
-
removeNavigation: true,
|
|
718
|
-
removeForms: true
|
|
719
|
-
},
|
|
554
|
+
const result = convert(html, {
|
|
555
|
+
preprocess: true,
|
|
556
|
+
preset: "aggressive",
|
|
557
|
+
keepNavigation: false,
|
|
720
558
|
headingStyle: "atx",
|
|
721
559
|
codeBlockStyle: "backticks"
|
|
722
560
|
});
|
|
723
561
|
|
|
724
|
-
console.log(
|
|
562
|
+
console.log(result.content);
|
|
725
563
|
```
|
|
726
564
|
|
|
727
565
|
## Other Runtimes
|
|
728
566
|
|
|
729
567
|
The same Rust engine ships as native bindings for other ecosystems:
|
|
730
568
|
|
|
731
|
-
-
|
|
732
|
-
-
|
|
733
|
-
-
|
|
734
|
-
-
|
|
735
|
-
-
|
|
569
|
+
- Node.js / Bun: [`@kreuzberg/html-to-markdown-node`](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node)
|
|
570
|
+
- Python: [`html-to-markdown`](https://pypi.org/project/html-to-markdown/)
|
|
571
|
+
- Ruby: [`html-to-markdown`](https://rubygems.org/gems/html-to-markdown)
|
|
572
|
+
- PHP: [`kreuzberg-dev/html-to-markdown`](https://packagist.org/packages/kreuzberg-dev/html-to-markdown)
|
|
573
|
+
- Rust crate and CLI: [`html-to-markdown-rs`](https://crates.io/crates/html-to-markdown-rs)
|
|
736
574
|
|
|
737
575
|
## Links
|
|
738
576
|
|
|
739
577
|
- [GitHub Repository](https://github.com/kreuzberg-dev/html-to-markdown)
|
|
740
578
|
- [Full Documentation](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/README.md)
|
|
741
|
-
- [Native Node Package](https://www.npmjs.com/package/html-to-markdown-node)
|
|
579
|
+
- [Native Node Package](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node)
|
|
742
580
|
- [Python Package](https://pypi.org/project/html-to-markdown/)
|
|
743
|
-
- [PHP Extension
|
|
581
|
+
- [PHP Extension and Helpers](https://packagist.org/packages/kreuzberg-dev/html-to-markdown)
|
|
744
582
|
- [Rust Crate](https://crates.io/crates/html-to-markdown-rs)
|
|
745
583
|
|
|
746
584
|
## License
|