@xberg-io/html-to-markdown-wasm 0.0.1 → 3.8.0-rc.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +250 -2
- package/package.json +34 -4
- package/pkg/bundler/LICENSE +21 -0
- package/pkg/bundler/README.md +251 -0
- package/pkg/bundler/html_to_markdown_wasm.d.ts +999 -0
- package/pkg/bundler/html_to_markdown_wasm.js +9 -0
- package/pkg/bundler/html_to_markdown_wasm_bg.js +6227 -0
- package/pkg/bundler/html_to_markdown_wasm_bg.wasm +0 -0
- package/pkg/bundler/html_to_markdown_wasm_bg.wasm.d.ts +466 -0
- package/pkg/bundler/package.json +28 -0
- package/pkg/deno/LICENSE +21 -0
- package/pkg/deno/README.md +251 -0
- package/pkg/deno/html_to_markdown_wasm.d.ts +999 -0
- package/pkg/deno/html_to_markdown_wasm.js +6219 -0
- package/pkg/deno/html_to_markdown_wasm_bg.wasm +0 -0
- package/pkg/deno/html_to_markdown_wasm_bg.wasm.d.ts +466 -0
- package/pkg/nodejs/LICENSE +21 -0
- package/pkg/nodejs/README.md +251 -0
- package/pkg/nodejs/html_to_markdown_wasm.d.ts +999 -0
- package/pkg/nodejs/html_to_markdown_wasm.js +6274 -0
- package/pkg/nodejs/html_to_markdown_wasm_bg.wasm +0 -0
- package/pkg/nodejs/html_to_markdown_wasm_bg.wasm.d.ts +466 -0
- package/pkg/nodejs/package.json +22 -0
- package/pkg/web/LICENSE +21 -0
- package/pkg/web/README.md +251 -0
- package/pkg/web/html_to_markdown_wasm.d.ts +1490 -0
- package/pkg/web/html_to_markdown_wasm.js +6327 -0
- package/pkg/web/html_to_markdown_wasm_bg.wasm +0 -0
- package/pkg/web/html_to_markdown_wasm_bg.wasm.d.ts +466 -0
- package/pkg/web/package.json +26 -0
|
@@ -0,0 +1,251 @@
|
|
|
1
|
+
<p align="center">
|
|
2
|
+
<picture>
|
|
3
|
+
<source media="(prefers-color-scheme: dark)" srcset="https://cdn.jsdelivr.net/gh/xberg-io/assets@v1/banner/readme-banner-dark.svg">
|
|
4
|
+
<img alt="Xberg" width="420" src="https://cdn.jsdelivr.net/gh/xberg-io/assets@v1/banner/readme-banner-light.svg">
|
|
5
|
+
</picture>
|
|
6
|
+
</p>
|
|
7
|
+
|
|
8
|
+
# html-to-markdown
|
|
9
|
+
|
|
10
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
|
11
|
+
<a href="https://github.com/xberg-io/alef">
|
|
12
|
+
<img src="https://img.shields.io/badge/built%20with-alef%20%D7%90-007ec6" alt="Built with alef">
|
|
13
|
+
</a>
|
|
14
|
+
<!-- Language Bindings -->
|
|
15
|
+
<a href="https://crates.io/crates/html-to-markdown-rs">
|
|
16
|
+
<img src="https://img.shields.io/crates/v/html-to-markdown-rs?label=Rust&color=007ec6" alt="Rust">
|
|
17
|
+
</a>
|
|
18
|
+
<a href="https://pypi.org/project/html-to-markdown/">
|
|
19
|
+
<img src="https://img.shields.io/pypi/v/html-to-markdown?label=Python&color=007ec6" alt="Python">
|
|
20
|
+
</a>
|
|
21
|
+
<a href="https://www.npmjs.com/package/@xberg-io/html-to-markdown">
|
|
22
|
+
<img src="https://img.shields.io/npm/v/@xberg-io/html-to-markdown?label=Node.js&color=007ec6" alt="Node.js">
|
|
23
|
+
</a>
|
|
24
|
+
<a href="https://www.npmjs.com/package/@xberg-io/html-to-markdown-wasm">
|
|
25
|
+
<img src="https://img.shields.io/npm/v/@xberg-io/html-to-markdown-wasm?label=WASM&color=007ec6" alt="WASM">
|
|
26
|
+
</a>
|
|
27
|
+
<a href="https://central.sonatype.com/artifact/io.xberg/html-to-markdown">
|
|
28
|
+
<img src="https://img.shields.io/maven-central/v/io.xberg/html-to-markdown?label=Java&color=007ec6" alt="Java">
|
|
29
|
+
</a>
|
|
30
|
+
<a href="https://pkg.go.dev/github.com/xberg-io/html-to-markdown/packages/go/v3">
|
|
31
|
+
<img src="https://img.shields.io/github/v/tag/xberg-io/html-to-markdown?label=Go&color=007ec6&filter=v3*" alt="Go">
|
|
32
|
+
</a>
|
|
33
|
+
<a href="https://www.nuget.org/packages/XbergIo.HtmlToMarkdown/">
|
|
34
|
+
<img src="https://img.shields.io/nuget/v/XbergIo.HtmlToMarkdown?label=C%23&color=007ec6" alt="C#">
|
|
35
|
+
</a>
|
|
36
|
+
<a href="https://packagist.org/packages/xberg-io/html-to-markdown">
|
|
37
|
+
<img src="https://img.shields.io/packagist/v/xberg-io/html-to-markdown?label=PHP&color=007ec6" alt="PHP">
|
|
38
|
+
</a>
|
|
39
|
+
<a href="https://rubygems.org/gems/html-to-markdown">
|
|
40
|
+
<img src="https://img.shields.io/gem/v/html-to-markdown?label=Ruby&color=007ec6" alt="Ruby">
|
|
41
|
+
</a>
|
|
42
|
+
<a href="https://hex.pm/packages/html_to_markdown">
|
|
43
|
+
<img src="https://img.shields.io/hexpm/v/html_to_markdown?label=Elixir&color=007ec6" alt="Elixir">
|
|
44
|
+
</a>
|
|
45
|
+
<a href="https://xberg-io.r-universe.dev/htmltomarkdown">
|
|
46
|
+
<img src="https://img.shields.io/badge/R-htmltomarkdown-007ec6" alt="R">
|
|
47
|
+
</a>
|
|
48
|
+
<a href="https://pub.dev/packages/h2m">
|
|
49
|
+
<img src="https://img.shields.io/pub/v/h2m?label=Dart&color=007ec6" alt="Dart">
|
|
50
|
+
</a>
|
|
51
|
+
<a href="https://central.sonatype.com/artifact/io.xberg/html-to-markdown-android">
|
|
52
|
+
<img src="https://img.shields.io/maven-central/v/io.xberg/html-to-markdown-android?label=Kotlin&color=007ec6" alt="Kotlin">
|
|
53
|
+
</a>
|
|
54
|
+
<a href="https://github.com/xberg-io/html-to-markdown/tree/main/packages/swift">
|
|
55
|
+
<img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
|
|
56
|
+
</a>
|
|
57
|
+
<a href="https://github.com/xberg-io/html-to-markdown/tree/main/packages/zig">
|
|
58
|
+
<img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
|
|
59
|
+
</a>
|
|
60
|
+
<a href="https://github.com/xberg-io/html-to-markdown/releases">
|
|
61
|
+
<img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
|
|
62
|
+
</a>
|
|
63
|
+
|
|
64
|
+
<!-- Project Info -->
|
|
65
|
+
<a href="https://github.com/xberg-io/html-to-markdown/blob/main/LICENSE">
|
|
66
|
+
<img src="https://img.shields.io/badge/License-MIT-007ec6" alt="License">
|
|
67
|
+
</a>
|
|
68
|
+
<a href="https://docs.html-to-markdown.xberg.io">
|
|
69
|
+
<img src="https://img.shields.io/badge/Docs-html--to--markdown-007ec6" alt="Documentation">
|
|
70
|
+
</a>
|
|
71
|
+
</div>
|
|
72
|
+
|
|
73
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
|
|
74
|
+
<a href="https://discord.gg/xt9WY3GnKR">
|
|
75
|
+
<img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
|
|
76
|
+
</a>
|
|
77
|
+
<a href="https://docs.html-to-markdown.xberg.io/demo/">
|
|
78
|
+
<img height="22" src="https://img.shields.io/badge/Live%20Demo-Open-007ec6?logo=webassembly&logoColor=white" alt="Live Demo">
|
|
79
|
+
</a>
|
|
80
|
+
</div>
|
|
81
|
+
|
|
82
|
+
Pure WebAssembly build of html-to-markdown for browsers, Deno, Cloudflare Workers, and other JS
|
|
83
|
+
runtimes without a Node.js native-addon ABI. Single `.wasm` artifact loaded via `wasm-bindgen`,
|
|
84
|
+
shipped with TypeScript types and dist targets for nodejs, web, bundler, and deno.
|
|
85
|
+
|
|
86
|
+
## What This Package Provides
|
|
87
|
+
|
|
88
|
+
- **Same renderer as every binding** — output matches Rust, Python, Node.js, Ruby, PHP, Go, Java, .NET, Elixir, R, Dart, Swift, Zig, C FFI, and WASM.
|
|
89
|
+
- **Structured conversion result** — Markdown plus metadata, links, headings, images, tables, and warnings where the binding exposes them.
|
|
90
|
+
- **Production defaults** — HTML is parsed with the Rust core, sanitized by default, and rendered without runtime-specific Markdown drift.
|
|
91
|
+
|
|
92
|
+
## Installation
|
|
93
|
+
|
|
94
|
+
```bash
|
|
95
|
+
pnpm add @xberg-io/html-to-markdown-wasm
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## Performance Snapshot
|
|
99
|
+
|
|
100
|
+
## Quick Start
|
|
101
|
+
|
|
102
|
+
Basic conversion:
|
|
103
|
+
|
|
104
|
+
```javascript
|
|
105
|
+
import init, { convert } from "@xberg-io/html-to-markdown-wasm";
|
|
106
|
+
|
|
107
|
+
await init();
|
|
108
|
+
|
|
109
|
+
const html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>";
|
|
110
|
+
const result = convert(html);
|
|
111
|
+
const markdown = result.content;
|
|
112
|
+
console.log(markdown);
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
With conversion options:
|
|
116
|
+
|
|
117
|
+
```javascript
|
|
118
|
+
import init, { convert } from "@xberg-io/html-to-markdown-wasm";
|
|
119
|
+
|
|
120
|
+
await init();
|
|
121
|
+
|
|
122
|
+
const result = convert('<h1>Hello</h1><img src="pic.jpg">', {
|
|
123
|
+
headingStyle: "atx",
|
|
124
|
+
skipImages: true,
|
|
125
|
+
});
|
|
126
|
+
const markdown = result.content;
|
|
127
|
+
console.log(markdown);
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
## Architecture
|
|
131
|
+
|
|
132
|
+
The converter routes each input through one of three tiers based on a fast prescan of the byte stream:
|
|
133
|
+
|
|
134
|
+
1. **Tier-1 — single-pass byte scanner.** Handles 110+ HTML tags directly. Bails on any construct it cannot prove byte-equivalent to Tier-2.
|
|
135
|
+
2. **Tier-2 — DOM walker.** Picks up Tier-1 bails and inputs the classifier rejected up front.
|
|
136
|
+
3. **Tier-3 — standards-conformant parser.** Engaged for malformed HTML requiring full HTML5 repair.
|
|
137
|
+
|
|
138
|
+
The dispatcher is invisible to the caller. Output is byte-identical across tiers — enforced by a 116-snapshot oracle.
|
|
139
|
+
|
|
140
|
+
## Capabilities
|
|
141
|
+
|
|
142
|
+
- **16 languages, one Rust core.** Rust, Python, Node.js, WASM, Java, Go, C#, PHP, Ruby, Elixir, R, Dart, Kotlin (Android), Swift, Zig, C ABI.
|
|
143
|
+
- **CommonMark-compatible Markdown** with GFM-style tables.
|
|
144
|
+
- **Djot output**: set `output_format = "djot"` (see Djot Output Format section below).
|
|
145
|
+
- **Real-HTML robust**: unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings handled without losing content.
|
|
146
|
+
- **Metadata extraction**, **visitor API**, **inline images**, **configurable preprocessing presets**.
|
|
147
|
+
- **Per-group regression gates in CI**: every PR runs the bench harness against per-group thresholds.
|
|
148
|
+
|
|
149
|
+
## API Reference
|
|
150
|
+
|
|
151
|
+
### Core Function
|
|
152
|
+
|
|
153
|
+
### Options
|
|
154
|
+
|
|
155
|
+
**`ConversionOptions`** – Key configuration fields:
|
|
156
|
+
|
|
157
|
+
- `heading_style`: Heading format (`"underlined"` | `"atx"` | `"atx_closed"`) — default: `"atx"`
|
|
158
|
+
- `list_indent_width`: Spaces per indent level — default: `2`
|
|
159
|
+
- `bullets`: Bullet characters cycle — default: `"-*+"`
|
|
160
|
+
- `wrap`: Enable text wrapping — default: `false`
|
|
161
|
+
- `wrap_width`: Wrap at column — default: `80`
|
|
162
|
+
- `code_language`: Default fenced code block language — default: none
|
|
163
|
+
- `extract_metadata`: Enable metadata extraction into `result.metadata` — default: `true`
|
|
164
|
+
- `output_format`: Output markup format (`"markdown"` | `"djot"` | `"plain"`) — default: `"markdown"`
|
|
165
|
+
|
|
166
|
+
## Djot Output Format
|
|
167
|
+
|
|
168
|
+
The library supports converting HTML to [Djot](https://djot.net/), a lightweight markup language similar to Markdown but with a different syntax for some elements. Set `output_format` to `"djot"` to use this format.
|
|
169
|
+
|
|
170
|
+
### Syntax Differences
|
|
171
|
+
|
|
172
|
+
| Element | Markdown | Djot |
|
|
173
|
+
| -------------- | ---------- | ---------- |
|
|
174
|
+
| Strong | `**text**` | `*text*` |
|
|
175
|
+
| Emphasis | `*text*` | `_text_` |
|
|
176
|
+
| Strikethrough | `~~text~~` | `{-text-}` |
|
|
177
|
+
| Inserted/Added | N/A | `{+text+}` |
|
|
178
|
+
| Highlighted | N/A | `{=text=}` |
|
|
179
|
+
| Subscript | N/A | `~text~` |
|
|
180
|
+
| Superscript | N/A | `^text^` |
|
|
181
|
+
|
|
182
|
+
### Example Usage
|
|
183
|
+
|
|
184
|
+
Djot's extended syntax allows you to express more semantic meaning in lightweight text, making it useful for documents that require strikethrough, insertion tracking, or mathematical notation.
|
|
185
|
+
|
|
186
|
+
## Plain Text Output
|
|
187
|
+
|
|
188
|
+
Set `output_format` to `"plain"` to strip all markup and return only visible text. This bypasses the Markdown conversion pipeline entirely for maximum speed.
|
|
189
|
+
|
|
190
|
+
Plain text mode is useful for search indexing, text extraction, and feeding content to LLMs.
|
|
191
|
+
|
|
192
|
+
## Metadata Extraction
|
|
193
|
+
|
|
194
|
+
The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass — all via the standard `convert()` function.
|
|
195
|
+
|
|
196
|
+
**Use Cases:**
|
|
197
|
+
|
|
198
|
+
- **SEO analysis** – Extract title, description, Open Graph tags, Twitter cards
|
|
199
|
+
- **Table of contents generation** – Build structured outlines from heading hierarchy
|
|
200
|
+
- **Content migration** – Document all external links and resources
|
|
201
|
+
- **Accessibility audits** – Check for images without alt text, empty links, invalid heading hierarchy
|
|
202
|
+
- **Link validation** – Classify and validate anchor, internal, external, email, and phone links
|
|
203
|
+
|
|
204
|
+
**Zero Overhead When Disabled:** Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Pass `extract_metadata: true` in `ConversionOptions` to enable it; the result is available at `result.metadata`.
|
|
205
|
+
|
|
206
|
+
### Example: Quick Start
|
|
207
|
+
|
|
208
|
+
## Examples
|
|
209
|
+
|
|
210
|
+
## Links
|
|
211
|
+
|
|
212
|
+
- **GitHub:** [github.com/xberg-io/html-to-markdown](https://github.com/xberg-io/html-to-markdown)
|
|
213
|
+
- **Discord:** [discord.gg/xt9WY3GnKR](https://discord.gg/xt9WY3GnKR)
|
|
214
|
+
|
|
215
|
+
## Part of Xberg
|
|
216
|
+
|
|
217
|
+
- [Xberg](https://github.com/xberg-io/xberg) — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
|
|
218
|
+
- [Xberg Enterprise](https://github.com/xberg-io/xberg-enterprise) — managed extraction API with SDKs, dashboards, and observability.
|
|
219
|
+
- [crawlberg](https://github.com/xberg-io/crawlberg) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
|
|
220
|
+
- [html-to-markdown](https://github.com/xberg-io/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
|
221
|
+
- [liter-llm](https://github.com/xberg-io/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
|
222
|
+
- [tree-sitter-language-pack](https://github.com/xberg-io/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
|
223
|
+
- [alef](https://github.com/xberg-io/alef) — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
|
|
224
|
+
|
|
225
|
+
## Contributing
|
|
226
|
+
|
|
227
|
+
We welcome contributions! Please see our [Contributing Guide](https://github.com/xberg-io/html-to-markdown/blob/main/CONTRIBUTING.md) for details on:
|
|
228
|
+
|
|
229
|
+
- Setting up the development environment
|
|
230
|
+
- Running tests locally
|
|
231
|
+
- Submitting pull requests
|
|
232
|
+
- Reporting issues
|
|
233
|
+
|
|
234
|
+
All contributions must follow our code quality standards (enforced via pre-commit hooks):
|
|
235
|
+
|
|
236
|
+
- Proper test coverage (Rust 95%+, language bindings 80%+)
|
|
237
|
+
- Formatting and linting checks
|
|
238
|
+
- Documentation for public APIs
|
|
239
|
+
|
|
240
|
+
## License
|
|
241
|
+
|
|
242
|
+
MIT License – see [LICENSE](https://github.com/xberg-io/html-to-markdown/blob/main/LICENSE). Copyright © Kreuzberg, Inc.
|
|
243
|
+
|
|
244
|
+
## Support
|
|
245
|
+
|
|
246
|
+
If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/xberg-io).
|
|
247
|
+
|
|
248
|
+
Have questions or run into issues? We're here to help:
|
|
249
|
+
|
|
250
|
+
- **GitHub Issues:** [github.com/xberg-io/html-to-markdown/issues](https://github.com/xberg-io/html-to-markdown/issues)
|
|
251
|
+
- **Discord Community:** [discord.gg/xt9WY3GnKR](https://discord.gg/xt9WY3GnKR)
|