@kreuzberg/html-to-markdown-wasm 2.19.0-rc.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +569 -0
- package/dist/LICENSE +21 -0
- package/dist/README.md +202 -0
- package/dist/html_to_markdown_wasm.d.ts +200 -0
- package/dist/html_to_markdown_wasm.js +116 -0
- package/dist/html_to_markdown_wasm_bg.js +1355 -0
- package/dist/html_to_markdown_wasm_bg.wasm +0 -0
- package/dist/html_to_markdown_wasm_bg.wasm.d.ts +55 -0
- package/dist/package.json +27 -0
- package/dist-node/LICENSE +21 -0
- package/dist-node/README.md +202 -0
- package/dist-node/html_to_markdown_wasm.d.ts +197 -0
- package/dist-node/html_to_markdown_wasm.js +1369 -0
- package/dist-node/html_to_markdown_wasm_bg.wasm +0 -0
- package/dist-node/html_to_markdown_wasm_bg.wasm.d.ts +55 -0
- package/dist-node/package.json +21 -0
- package/dist-web/LICENSE +21 -0
- package/dist-web/README.md +202 -0
- package/dist-web/html_to_markdown_wasm.d.ts +277 -0
- package/dist-web/html_to_markdown_wasm.js +1395 -0
- package/dist-web/html_to_markdown_wasm_bg.wasm +0 -0
- package/dist-web/html_to_markdown_wasm_bg.wasm.d.ts +55 -0
- package/dist-web/package.json +25 -0
- package/package.json +68 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
The MIT License (MIT)
|
|
2
|
+
|
|
3
|
+
Copyright 2024-2025 Na'aman Hirschfeld
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,569 @@
|
|
|
1
|
+
# @kreuzberg/html-to-markdown-wasm
|
|
2
|
+
|
|
3
|
+
> **npm package:** `@kreuzberg/html-to-markdown-wasm` (this README).
|
|
4
|
+
> Use [`@kreuzberg/html-to-markdown-node`](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node) when you only target Node.js or Bun and want native performance.
|
|
5
|
+
|
|
6
|
+
Universal HTML to Markdown converter using WebAssembly.
|
|
7
|
+
|
|
8
|
+
Powered by the same Rust engine as the Node.js, Python, Ruby, and PHP bindings, so Markdown output stays identical regardless of runtime.
|
|
9
|
+
|
|
10
|
+
Runs anywhere: Node.js, Deno, Bun, browsers, and edge runtimes.
|
|
11
|
+
|
|
12
|
+
[](https://crates.io/crates/html-to-markdown-rs)
|
|
13
|
+
[](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node)
|
|
14
|
+
[](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-wasm)
|
|
15
|
+
[](https://pypi.org/project/html-to-markdown/)
|
|
16
|
+
[](https://packagist.org/packages/goldziher/html-to-markdown)
|
|
17
|
+
[](https://rubygems.org/gems/html-to-markdown)
|
|
18
|
+
[](https://www.nuget.org/packages/Goldziher.HtmlToMarkdown/)
|
|
19
|
+
[](https://central.sonatype.com/artifact/io.github.goldziher/html-to-markdown)
|
|
20
|
+
[](https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v2/htmltomarkdown)
|
|
21
|
+
[](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE)
|
|
22
|
+
|
|
23
|
+
## Performance
|
|
24
|
+
|
|
25
|
+
Universal WebAssembly bindings with **excellent performance** across all JavaScript runtimes.
|
|
26
|
+
|
|
27
|
+
### Benchmark Results (Apple M4)
|
|
28
|
+
|
|
29
|
+
| Document Type | ops/sec | Notes |
|
|
30
|
+
| -------------------------- | ---------- | ------------------ |
|
|
31
|
+
| **Small (5 paragraphs)** | **70,300** | Simple documents |
|
|
32
|
+
| **Medium (25 paragraphs)** | **15,282** | Nested formatting |
|
|
33
|
+
| **Large (100 paragraphs)** | **3,836** | Complex structures |
|
|
34
|
+
| **Tables (20 tables)** | **3,748** | Table processing |
|
|
35
|
+
| **Lists (500 items)** | **1,391** | Nested lists |
|
|
36
|
+
| **Wikipedia (129KB)** | **1,022** | Real-world content |
|
|
37
|
+
| **Wikipedia (653KB)** | **147** | Large documents |
|
|
38
|
+
|
|
39
|
+
**Average: ~15,536 ops/sec** across varied workloads.
|
|
40
|
+
|
|
41
|
+
### Comparison
|
|
42
|
+
|
|
43
|
+
- **vs Native NAPI**: ~1.17× slower (WASM has minimal overhead)
|
|
44
|
+
- **vs Python**: ~6.3× faster (no FFI overhead)
|
|
45
|
+
- **Best for**: Universal deployment (browsers, Deno, edge runtimes, cross-platform apps)
|
|
46
|
+
|
|
47
|
+
### Benchmark Fixtures (Apple M4)
|
|
48
|
+
|
|
49
|
+
Numbers captured via the shared fixture harness in `tools/benchmark-harness`:
|
|
50
|
+
|
|
51
|
+
| Document | Size | ops/sec (WASM) |
|
|
52
|
+
| ---------------------- | ------ | -------------- |
|
|
53
|
+
| Lists (Timeline) | 129 KB | 882 |
|
|
54
|
+
| Tables (Countries) | 360 KB | 242 |
|
|
55
|
+
| Medium (Python) | 657 KB | 121 |
|
|
56
|
+
| Large (Rust) | 567 KB | 124 |
|
|
57
|
+
| Small (Intro) | 463 KB | 163 |
|
|
58
|
+
| hOCR German PDF | 44 KB | 1,637 |
|
|
59
|
+
| hOCR Invoice | 4 KB | 7,775 |
|
|
60
|
+
| hOCR Embedded Tables | 37 KB | 1,667 |
|
|
61
|
+
|
|
62
|
+
> Expect slightly higher numbers in long-lived browser/Deno workers once the WASM module is warm.
|
|
63
|
+
|
|
64
|
+
## Installation
|
|
65
|
+
|
|
66
|
+
### npm / Yarn / pnpm
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
npm install @kreuzberg/html-to-markdown-wasm
|
|
70
|
+
# or
|
|
71
|
+
yarn add @kreuzberg/html-to-markdown-wasm
|
|
72
|
+
# or
|
|
73
|
+
pnpm add @kreuzberg/html-to-markdown-wasm
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
### Deno
|
|
77
|
+
|
|
78
|
+
```typescript
|
|
79
|
+
// Via npm specifier
|
|
80
|
+
import { convert } from "npm:@kreuzberg/html-to-markdown-wasm";
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Usage
|
|
84
|
+
|
|
85
|
+
### Basic Conversion
|
|
86
|
+
|
|
87
|
+
```javascript
|
|
88
|
+
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
89
|
+
|
|
90
|
+
const html = '<h1>Hello World</h1><p>This is <strong>fast</strong>!</p>';
|
|
91
|
+
const markdown = convert(html);
|
|
92
|
+
console.log(markdown);
|
|
93
|
+
// # Hello World
|
|
94
|
+
//
|
|
95
|
+
// This is **fast**!
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
> **Heads up for edge runtimes:** Cloudflare Workers, Vite dev servers, and other environments that instantiate `.wasm` files asynchronously must call `await initWasm()` (or `await wasmReady`) once during startup before invoking `convert`. Traditional bundlers (Webpack, Rollup) and Deno/Node imports continue to work without manual initialization.
|
|
99
|
+
|
|
100
|
+
**Working Examples:**
|
|
101
|
+
- [**Browser with Rollup**](https://github.com/kreuzberg-dev/html-to-markdown/tree/main/examples/wasm-rollup) - Using dist-web target in browser
|
|
102
|
+
- [**Node.js**](https://github.com/kreuzberg-dev/html-to-markdown/tree/main/examples/wasm-node) - Using dist-node target
|
|
103
|
+
- [**Cloudflare Workers**](https://github.com/kreuzberg-dev/html-to-markdown/tree/main/examples/wasm-cloudflare) - Using bundler target with Wrangler
|
|
104
|
+
|
|
105
|
+
### Reusing Options Handles
|
|
106
|
+
|
|
107
|
+
```ts
|
|
108
|
+
import {
|
|
109
|
+
convertWithOptionsHandle,
|
|
110
|
+
createConversionOptionsHandle,
|
|
111
|
+
} from '@kreuzberg/html-to-markdown-wasm';
|
|
112
|
+
|
|
113
|
+
const handle = createConversionOptionsHandle({ hocrSpatialTables: false });
|
|
114
|
+
const markdown = convertWithOptionsHandle('<h1>Reusable</h1>', handle);
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### Byte-Based Input (Buffers / Uint8Array)
|
|
118
|
+
|
|
119
|
+
When you already have raw bytes (e.g., `fs.readFileSync`, Fetch API responses), skip re-encoding with `TextDecoder` by calling the byte-friendly helpers:
|
|
120
|
+
|
|
121
|
+
```ts
|
|
122
|
+
import {
|
|
123
|
+
convertBytes,
|
|
124
|
+
convertBytesWithOptionsHandle,
|
|
125
|
+
createConversionOptionsHandle,
|
|
126
|
+
convertBytesWithInlineImages,
|
|
127
|
+
} from '@kreuzberg/html-to-markdown-wasm';
|
|
128
|
+
import { readFileSync } from 'node:fs';
|
|
129
|
+
|
|
130
|
+
const htmlBytes = readFileSync('input.html'); // Buffer -> Uint8Array
|
|
131
|
+
const markdown = convertBytes(htmlBytes);
|
|
132
|
+
|
|
133
|
+
const handle = createConversionOptionsHandle({ headingStyle: 'atx' });
|
|
134
|
+
const markdownFromHandle = convertBytesWithOptionsHandle(htmlBytes, handle);
|
|
135
|
+
|
|
136
|
+
const inlineExtraction = convertBytesWithInlineImages(htmlBytes, null, {
|
|
137
|
+
maxDecodedSizeBytes: 5 * 1024 * 1024,
|
|
138
|
+
});
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### With Options
|
|
142
|
+
|
|
143
|
+
```typescript
|
|
144
|
+
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
145
|
+
|
|
146
|
+
const markdown = convert(html, {
|
|
147
|
+
headingStyle: 'atx',
|
|
148
|
+
codeBlockStyle: 'backticks',
|
|
149
|
+
listIndentWidth: 2,
|
|
150
|
+
bullets: '-',
|
|
151
|
+
wrap: true,
|
|
152
|
+
wrapWidth: 80
|
|
153
|
+
});
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
### Preserve Complex HTML (NEW in v2.5)
|
|
157
|
+
|
|
158
|
+
```typescript
|
|
159
|
+
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
160
|
+
|
|
161
|
+
const html = `
|
|
162
|
+
<h1>Report</h1>
|
|
163
|
+
<table>
|
|
164
|
+
<tr><th>Name</th><th>Value</th></tr>
|
|
165
|
+
<tr><td>Foo</td><td>Bar</td></tr>
|
|
166
|
+
</table>
|
|
167
|
+
`;
|
|
168
|
+
|
|
169
|
+
const markdown = convert(html, {
|
|
170
|
+
preserveTags: ['table'] // Keep tables as HTML
|
|
171
|
+
});
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
### Deno
|
|
175
|
+
|
|
176
|
+
```typescript
|
|
177
|
+
import { convert } from "npm:html-to-markdown-wasm";
|
|
178
|
+
|
|
179
|
+
const html = await Deno.readTextFile("input.html");
|
|
180
|
+
const markdown = convert(html, { headingStyle: "atx" });
|
|
181
|
+
await Deno.writeTextFile("output.md", markdown);
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
> **Performance Tip:** For Node.js/Bun, use [@kreuzberg/html-to-markdown-node](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node) for 1.17× better performance with native bindings.
|
|
185
|
+
|
|
186
|
+
### Browser (ESM)
|
|
187
|
+
|
|
188
|
+
```html
|
|
189
|
+
<!DOCTYPE html>
|
|
190
|
+
<html>
|
|
191
|
+
<head>
|
|
192
|
+
<title>HTML to Markdown</title>
|
|
193
|
+
</head>
|
|
194
|
+
<body>
|
|
195
|
+
<script type="module">
|
|
196
|
+
import init, { convert } from 'https://unpkg.com/@kreuzberg/html-to-markdown-wasm/dist-web/html_to_markdown_wasm.js';
|
|
197
|
+
|
|
198
|
+
// Initialize WASM module
|
|
199
|
+
await init();
|
|
200
|
+
|
|
201
|
+
const html = '<h1>Hello World</h1><p>This runs in the <strong>browser</strong>!</p>';
|
|
202
|
+
const markdown = convert(html, { headingStyle: 'atx' });
|
|
203
|
+
|
|
204
|
+
console.log(markdown);
|
|
205
|
+
document.body.innerHTML = `<pre>${markdown}</pre>`;
|
|
206
|
+
</script>
|
|
207
|
+
</body>
|
|
208
|
+
</html>
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
### Vite / Webpack / Bundlers
|
|
212
|
+
|
|
213
|
+
```typescript
|
|
214
|
+
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
215
|
+
|
|
216
|
+
const markdown = convert('<h1>Hello</h1>', {
|
|
217
|
+
headingStyle: 'atx',
|
|
218
|
+
codeBlockStyle: 'backticks'
|
|
219
|
+
});
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
### Cloudflare Workers
|
|
223
|
+
|
|
224
|
+
```typescript
|
|
225
|
+
import { convert, initWasm, wasmReady } from '@kreuzberg/html-to-markdown-wasm';
|
|
226
|
+
|
|
227
|
+
// Cloudflare Workers / other edge runtimes instantiate WASM asynchronously.
|
|
228
|
+
// Kick off initialization once at module scope.
|
|
229
|
+
const ready = wasmReady ?? initWasm();
|
|
230
|
+
|
|
231
|
+
export default {
|
|
232
|
+
async fetch(request: Request): Promise<Response> {
|
|
233
|
+
await ready;
|
|
234
|
+
const html = await request.text();
|
|
235
|
+
const markdown = convert(html, { headingStyle: 'atx' });
|
|
236
|
+
|
|
237
|
+
return new Response(markdown, {
|
|
238
|
+
headers: { 'Content-Type': 'text/markdown' }
|
|
239
|
+
});
|
|
240
|
+
}
|
|
241
|
+
};
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
> See the full [Cloudflare Workers example](https://github.com/kreuzberg-dev/html-to-markdown/tree/main/examples/wasm-cloudflare) with Wrangler configuration.
|
|
245
|
+
|
|
246
|
+
## TypeScript
|
|
247
|
+
|
|
248
|
+
Full TypeScript support with type definitions:
|
|
249
|
+
|
|
250
|
+
```typescript
|
|
251
|
+
import {
|
|
252
|
+
convert,
|
|
253
|
+
convertWithInlineImages,
|
|
254
|
+
WasmInlineImageConfig,
|
|
255
|
+
type WasmConversionOptions
|
|
256
|
+
} from '@kreuzberg/html-to-markdown-wasm';
|
|
257
|
+
|
|
258
|
+
const options: WasmConversionOptions = {
|
|
259
|
+
headingStyle: 'atx',
|
|
260
|
+
codeBlockStyle: 'backticks',
|
|
261
|
+
listIndentWidth: 2,
|
|
262
|
+
wrap: true,
|
|
263
|
+
wrapWidth: 80
|
|
264
|
+
};
|
|
265
|
+
|
|
266
|
+
const markdown = convert('<h1>Hello</h1>', options);
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
## Inline Images
|
|
270
|
+
|
|
271
|
+
Extract and decode inline images (data URIs, SVG):
|
|
272
|
+
|
|
273
|
+
```typescript
|
|
274
|
+
import { convertWithInlineImages, WasmInlineImageConfig } from '@kreuzberg/html-to-markdown-wasm';
|
|
275
|
+
|
|
276
|
+
const html = '<img src="..." alt="Logo">';
|
|
277
|
+
|
|
278
|
+
const config = new WasmInlineImageConfig(5 * 1024 * 1024); // 5MB max
|
|
279
|
+
config.inferDimensions = true;
|
|
280
|
+
config.filenamePrefix = 'img_';
|
|
281
|
+
config.captureSvg = true;
|
|
282
|
+
|
|
283
|
+
const result = convertWithInlineImages(html, null, config);
|
|
284
|
+
|
|
285
|
+
console.log(result.markdown);
|
|
286
|
+
console.log(`Extracted ${result.inlineImages.length} images`);
|
|
287
|
+
|
|
288
|
+
for (const img of result.inlineImages) {
|
|
289
|
+
console.log(`${img.filename}: ${img.format}, ${img.data.length} bytes`);
|
|
290
|
+
// img.data is a Uint8Array - save to file or upload
|
|
291
|
+
}
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
## Metadata Extraction
|
|
295
|
+
|
|
296
|
+
Extract document metadata (headers, links, images, structured data) alongside Markdown conversion:
|
|
297
|
+
|
|
298
|
+
```typescript
|
|
299
|
+
import { convertWithMetadata, WasmMetadataConfig } from '@kreuzberg/html-to-markdown-wasm';
|
|
300
|
+
|
|
301
|
+
const html = `
|
|
302
|
+
<html lang="en">
|
|
303
|
+
<head><title>My Article</title></head>
|
|
304
|
+
<body>
|
|
305
|
+
<h1>Main Title</h1>
|
|
306
|
+
<p>Content with <a href="https://example.com">a link</a></p>
|
|
307
|
+
<img src="https://example.com/image.jpg" alt="Example image">
|
|
308
|
+
</body>
|
|
309
|
+
</html>
|
|
310
|
+
`;
|
|
311
|
+
|
|
312
|
+
const config = new WasmMetadataConfig();
|
|
313
|
+
config.extractHeaders = true;
|
|
314
|
+
config.extractLinks = true;
|
|
315
|
+
config.extractImages = true;
|
|
316
|
+
config.extractStructuredData = true;
|
|
317
|
+
config.maxStructuredDataSize = 1_000_000; // 1MB limit
|
|
318
|
+
|
|
319
|
+
const result = convertWithMetadata(html, null, config);
|
|
320
|
+
|
|
321
|
+
console.log(result.markdown);
|
|
322
|
+
console.log('Document metadata:', result.metadata.document);
|
|
323
|
+
// {
|
|
324
|
+
// title: 'My Article',
|
|
325
|
+
// language: 'en',
|
|
326
|
+
// ...
|
|
327
|
+
// }
|
|
328
|
+
|
|
329
|
+
console.log('Headers:', result.metadata.headers);
|
|
330
|
+
// [
|
|
331
|
+
// { level: 1, text: 'Main Title', id: undefined, depth: 0, htmlOffset: ... }
|
|
332
|
+
// ]
|
|
333
|
+
|
|
334
|
+
console.log('Links:', result.metadata.links);
|
|
335
|
+
// [
|
|
336
|
+
// {
|
|
337
|
+
// href: 'https://example.com',
|
|
338
|
+
// text: 'a link',
|
|
339
|
+
// linkType: 'external',
|
|
340
|
+
// rel: [],
|
|
341
|
+
// ...
|
|
342
|
+
// }
|
|
343
|
+
// ]
|
|
344
|
+
|
|
345
|
+
console.log('Images:', result.metadata.images);
|
|
346
|
+
// [
|
|
347
|
+
// {
|
|
348
|
+
// src: 'https://example.com/image.jpg',
|
|
349
|
+
// alt: 'Example image',
|
|
350
|
+
// imageType: 'external',
|
|
351
|
+
// ...
|
|
352
|
+
// }
|
|
353
|
+
// ]
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
### Metadata Configuration
|
|
357
|
+
|
|
358
|
+
The `WasmMetadataConfig` class controls what metadata is extracted:
|
|
359
|
+
|
|
360
|
+
```typescript
|
|
361
|
+
import { WasmMetadataConfig } from '@kreuzberg/html-to-markdown-wasm';
|
|
362
|
+
|
|
363
|
+
const config = new WasmMetadataConfig();
|
|
364
|
+
|
|
365
|
+
// Enable/disable extraction types
|
|
366
|
+
config.extractHeaders = true; // h1-h6 elements
|
|
367
|
+
config.extractLinks = true; // <a> elements with link type classification
|
|
368
|
+
config.extractImages = true; // <img> and <svg> elements
|
|
369
|
+
config.extractStructuredData = true; // JSON-LD, Microdata, RDFa
|
|
370
|
+
|
|
371
|
+
// Limit structured data size to prevent memory exhaustion
|
|
372
|
+
config.maxStructuredDataSize = 1_000_000; // 1MB default
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
### Metadata Structure
|
|
376
|
+
|
|
377
|
+
The returned metadata object includes:
|
|
378
|
+
|
|
379
|
+
- **document**: Document-level metadata (title, description, keywords, language, OG tags, Twitter cards, etc.)
|
|
380
|
+
- **headers**: Array of header elements with level, text, id, and document position
|
|
381
|
+
- **links**: Array of links with href, text, type (anchor/internal/external/email/phone), and rel attributes
|
|
382
|
+
- **images**: Array of images with src, alt text, dimensions, and type classification (dataUri/external/relative/svg)
|
|
383
|
+
- **structuredData**: Array of JSON-LD, Microdata, and RDFa blocks
|
|
384
|
+
|
|
385
|
+
### Byte-Based Input
|
|
386
|
+
|
|
387
|
+
Convert bytes directly with metadata extraction:
|
|
388
|
+
|
|
389
|
+
```typescript
|
|
390
|
+
import { convertBytesWithMetadata, WasmMetadataConfig } from '@kreuzberg/html-to-markdown-wasm';
|
|
391
|
+
import { readFileSync } from 'node:fs';
|
|
392
|
+
|
|
393
|
+
const htmlBytes = readFileSync('article.html');
|
|
394
|
+
const config = new WasmMetadataConfig();
|
|
395
|
+
|
|
396
|
+
const result = convertBytesWithMetadata(htmlBytes, null, config);
|
|
397
|
+
console.log(result.markdown);
|
|
398
|
+
console.log(result.metadata);
|
|
399
|
+
```
|
|
400
|
+
|
|
401
|
+
## Build Targets
|
|
402
|
+
|
|
403
|
+
Three build targets are provided for different environments:
|
|
404
|
+
|
|
405
|
+
| Target | Path | Use Case |
|
|
406
|
+
| ----------- | --------------------------------- | ------------------------------ |
|
|
407
|
+
| **Bundler** | `@kreuzberg/html-to-markdown-wasm` | Webpack, Vite, Rollup, esbuild |
|
|
408
|
+
| **Node.js** | `@kreuzberg/html-to-markdown-wasm/dist-node` | Node.js, Bun (CommonJS/ESM) |
|
|
409
|
+
| **Web** | `@kreuzberg/html-to-markdown-wasm/dist-web` | Direct browser ESM imports |
|
|
410
|
+
|
|
411
|
+
## Runtime Compatibility
|
|
412
|
+
|
|
413
|
+
| Runtime | Support | Package |
|
|
414
|
+
| ------------------------- | ---------------------------- | -------------- |
|
|
415
|
+
| ✅ **Node.js** 18+ | Full support | `dist-node` |
|
|
416
|
+
| ✅ **Deno** | Full support | npm: specifier |
|
|
417
|
+
| ✅ **Bun** | Full support (prefer native) | Default export |
|
|
418
|
+
| ✅ **Browsers** | Full support | `dist-web` |
|
|
419
|
+
| ✅ **Cloudflare Workers** | Full support | Default export |
|
|
420
|
+
| ✅ **Deno Deploy** | Full support | npm: specifier |
|
|
421
|
+
|
|
422
|
+
## When to Use
|
|
423
|
+
|
|
424
|
+
Choose `@kreuzberg/html-to-markdown-wasm` when:
|
|
425
|
+
|
|
426
|
+
- 🌐 Running in browsers or edge runtimes
|
|
427
|
+
- 🦕 Using Deno
|
|
428
|
+
- ☁️ Deploying to Cloudflare Workers, Deno Deploy
|
|
429
|
+
- 📦 Building universal libraries
|
|
430
|
+
- 🔄 Need consistent behavior across all platforms
|
|
431
|
+
|
|
432
|
+
Use [@kreuzberg/html-to-markdown-node](https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node) for:
|
|
433
|
+
|
|
434
|
+
- ⚡ Maximum performance in Node.js/Bun (~3× faster)
|
|
435
|
+
- 🖥️ Server-side only applications
|
|
436
|
+
|
|
437
|
+
## Configuration Options
|
|
438
|
+
|
|
439
|
+
See the [TypeScript definitions](./dist-node/html_to_markdown_wasm.d.ts) for all available options:
|
|
440
|
+
|
|
441
|
+
- Heading styles (atx, underlined, atxClosed)
|
|
442
|
+
- Code block styles (indented, backticks, tildes)
|
|
443
|
+
- List formatting (indent width, bullet characters)
|
|
444
|
+
- Text escaping and formatting
|
|
445
|
+
- Tag preservation (`preserveTags`) and stripping (`stripTags`)
|
|
446
|
+
- Preprocessing for web scraping
|
|
447
|
+
- hOCR table extraction
|
|
448
|
+
- And more...
|
|
449
|
+
|
|
450
|
+
## Examples
|
|
451
|
+
|
|
452
|
+
### Preserving HTML Tags
|
|
453
|
+
|
|
454
|
+
Keep specific HTML tags in their original form:
|
|
455
|
+
|
|
456
|
+
```typescript
|
|
457
|
+
import { convert } from '@kreuzberg/html-to-markdown-wasm';
|
|
458
|
+
|
|
459
|
+
const html = `
|
|
460
|
+
<p>Before table</p>
|
|
461
|
+
<table class="data">
|
|
462
|
+
<tr><th>Name</th><th>Value</th></tr>
|
|
463
|
+
<tr><td>Item 1</td><td>100</td></tr>
|
|
464
|
+
</table>
|
|
465
|
+
<p>After table</p>
|
|
466
|
+
`;
|
|
467
|
+
|
|
468
|
+
const markdown = convert(html, {
|
|
469
|
+
preserveTags: ['table']
|
|
470
|
+
});
|
|
471
|
+
|
|
472
|
+
// Result includes the table as HTML
|
|
473
|
+
```
|
|
474
|
+
|
|
475
|
+
Combine with `stripTags`:
|
|
476
|
+
|
|
477
|
+
```typescript
|
|
478
|
+
const markdown = convert(html, {
|
|
479
|
+
preserveTags: ['table', 'form'], // Keep as HTML
|
|
480
|
+
stripTags: ['script', 'style'] // Remove entirely
|
|
481
|
+
});
|
|
482
|
+
```
|
|
483
|
+
|
|
484
|
+
### Deno Web Server
|
|
485
|
+
|
|
486
|
+
```typescript
|
|
487
|
+
import { convert } from "npm:html-to-markdown-wasm";
|
|
488
|
+
|
|
489
|
+
Deno.serve((req) => {
|
|
490
|
+
const url = new URL(req.url);
|
|
491
|
+
|
|
492
|
+
if (url.pathname === "/convert" && req.method === "POST") {
|
|
493
|
+
const html = await req.text();
|
|
494
|
+
const markdown = convert(html, { headingStyle: "atx" });
|
|
495
|
+
|
|
496
|
+
return new Response(markdown, {
|
|
497
|
+
headers: { "Content-Type": "text/markdown" }
|
|
498
|
+
});
|
|
499
|
+
}
|
|
500
|
+
|
|
501
|
+
return new Response("Not found", { status: 404 });
|
|
502
|
+
});
|
|
503
|
+
```
|
|
504
|
+
|
|
505
|
+
### Browser File Conversion
|
|
506
|
+
|
|
507
|
+
```html
|
|
508
|
+
<input type="file" id="htmlFile" accept=".html">
|
|
509
|
+
<button onclick="convertFile()">Convert to Markdown</button>
|
|
510
|
+
<pre id="output"></pre>
|
|
511
|
+
|
|
512
|
+
<script type="module">
|
|
513
|
+
import init, { convert } from 'https://unpkg.com/@kreuzberg/html-to-markdown-wasm/dist-web/html_to_markdown_wasm.js';
|
|
514
|
+
|
|
515
|
+
await init();
|
|
516
|
+
|
|
517
|
+
window.convertFile = async () => {
|
|
518
|
+
const file = document.getElementById('htmlFile').files[0];
|
|
519
|
+
const html = await file.text();
|
|
520
|
+
const markdown = convert(html, { headingStyle: 'atx' });
|
|
521
|
+
document.getElementById('output').textContent = markdown;
|
|
522
|
+
};
|
|
523
|
+
</script>
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
### Web Scraping (Deno)
|
|
527
|
+
|
|
528
|
+
```typescript
|
|
529
|
+
import { convert } from "npm:html-to-markdown-wasm";
|
|
530
|
+
|
|
531
|
+
const response = await fetch("https://example.com");
|
|
532
|
+
const html = await response.text();
|
|
533
|
+
|
|
534
|
+
const markdown = convert(html, {
|
|
535
|
+
preprocessing: {
|
|
536
|
+
enabled: true,
|
|
537
|
+
preset: "aggressive",
|
|
538
|
+
removeNavigation: true,
|
|
539
|
+
removeForms: true
|
|
540
|
+
},
|
|
541
|
+
headingStyle: "atx",
|
|
542
|
+
codeBlockStyle: "backticks"
|
|
543
|
+
});
|
|
544
|
+
|
|
545
|
+
console.log(markdown);
|
|
546
|
+
```
|
|
547
|
+
|
|
548
|
+
## Other Runtimes
|
|
549
|
+
|
|
550
|
+
The same Rust engine ships as native bindings for other ecosystems:
|
|
551
|
+
|
|
552
|
+
- 🖥️ Node.js / Bun: [`html-to-markdown-node`](https://www.npmjs.com/package/html-to-markdown-node)
|
|
553
|
+
- 🐍 Python: [`html-to-markdown`](https://pypi.org/project/html-to-markdown/)
|
|
554
|
+
- 💎 Ruby: [`html-to-markdown`](https://rubygems.org/gems/html-to-markdown)
|
|
555
|
+
- 🐘 PHP: [`goldziher/html-to-markdown`](https://packagist.org/packages/goldziher/html-to-markdown)
|
|
556
|
+
- 🦀 Rust crate & CLI: [`html-to-markdown-rs`](https://crates.io/crates/html-to-markdown-rs)
|
|
557
|
+
|
|
558
|
+
## Links
|
|
559
|
+
|
|
560
|
+
- [GitHub Repository](https://github.com/kreuzberg-dev/html-to-markdown)
|
|
561
|
+
- [Full Documentation](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/README.md)
|
|
562
|
+
- [Native Node Package](https://www.npmjs.com/package/html-to-markdown-node)
|
|
563
|
+
- [Python Package](https://pypi.org/project/html-to-markdown/)
|
|
564
|
+
- [PHP Extension & Helpers](https://packagist.org/packages/goldziher/html-to-markdown)
|
|
565
|
+
- [Rust Crate](https://crates.io/crates/html-to-markdown-rs)
|
|
566
|
+
|
|
567
|
+
## License
|
|
568
|
+
|
|
569
|
+
MIT
|
package/dist/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
The MIT License (MIT)
|
|
2
|
+
|
|
3
|
+
Copyright 2024-2025 Na'aman Hirschfeld
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|