html-to-markdown 2.16.1 → 2.19.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile.lock +66 -11
- data/README.md +168 -369
- data/bin/benchmark.rb +100 -12
- data/ext/html-to-markdown-rb/native/Cargo.toml +7 -5
- data/ext/html-to-markdown-rb/native/README.md +5 -5
- data/ext/html-to-markdown-rb/native/src/lib.rs +951 -0
- data/ext/html-to-markdown-rb/native/src/profiling.rs +4 -0
- data/html-to-markdown-rb.gemspec +6 -6
- data/lib/html_to_markdown/version.rb +1 -1
- data/sig/html_to_markdown.rbs +110 -0
- data/spec/visitor_spec.rb +1149 -0
- metadata +9 -8
data/README.md
CHANGED
|
@@ -1,477 +1,276 @@
|
|
|
1
|
-
# html-to-markdown
|
|
2
|
-
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
1
|
+
# html-to-markdown
|
|
2
|
+
|
|
3
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
|
4
|
+
<!-- Language Bindings -->
|
|
5
|
+
<a href="https://crates.io/crates/html-to-markdown-rs">
|
|
6
|
+
<img src="https://img.shields.io/crates/v/html-to-markdown-rs?label=Rust&color=007ec6" alt="Rust">
|
|
7
|
+
</a>
|
|
8
|
+
<a href="https://pypi.org/project/html-to-markdown/">
|
|
9
|
+
<img src="https://img.shields.io/pypi/v/html-to-markdown?label=Python&color=007ec6" alt="Python">
|
|
10
|
+
</a>
|
|
11
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node">
|
|
12
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/html-to-markdown-node?label=Node.js&color=007ec6" alt="Node.js">
|
|
13
|
+
</a>
|
|
14
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/html-to-markdown-wasm">
|
|
15
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/html-to-markdown-wasm?label=WASM&color=007ec6" alt="WASM">
|
|
16
|
+
</a>
|
|
17
|
+
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/html-to-markdown">
|
|
18
|
+
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/html-to-markdown?label=Java&color=007ec6" alt="Java">
|
|
19
|
+
</a>
|
|
20
|
+
<a href="https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v2/htmltomarkdown">
|
|
21
|
+
<img src="https://img.shields.io/badge/Go-v2.19.0-007ec6" alt="Go">
|
|
22
|
+
</a>
|
|
23
|
+
<a href="https://www.nuget.org/packages/KreuzbergDev.HtmlToMarkdown/">
|
|
24
|
+
<img src="https://img.shields.io/nuget/v/KreuzbergDev.HtmlToMarkdown?label=C%23&color=007ec6" alt="C#">
|
|
25
|
+
</a>
|
|
26
|
+
<a href="https://packagist.org/packages/goldziher/html-to-markdown">
|
|
27
|
+
<img src="https://img.shields.io/packagist/v/goldziher/html-to-markdown?label=PHP&color=007ec6" alt="PHP">
|
|
28
|
+
</a>
|
|
29
|
+
<a href="https://rubygems.org/gems/html-to-markdown">
|
|
30
|
+
<img src="https://img.shields.io/gem/v/html-to-markdown?label=Ruby&color=007ec6" alt="Ruby">
|
|
31
|
+
</a>
|
|
32
|
+
<a href="https://hex.pm/packages/html_to_markdown">
|
|
33
|
+
<img src="https://img.shields.io/hexpm/v/html_to_markdown?label=Elixir&color=007ec6" alt="Elixir">
|
|
34
|
+
</a>
|
|
35
|
+
|
|
36
|
+
<!-- Project Info -->
|
|
37
|
+
<a href="https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE">
|
|
38
|
+
<img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
|
|
39
|
+
</a>
|
|
40
|
+
</div>
|
|
41
|
+
|
|
42
|
+
<img width="1128" height="191" alt="html-to-markdown" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
|
|
43
|
+
|
|
44
|
+
<div align="center" style="margin-top: 20px;">
|
|
45
|
+
<a href="https://discord.gg/pXxagNK2zN">
|
|
46
|
+
<img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
|
|
47
|
+
</a>
|
|
48
|
+
</div>
|
|
49
|
+
|
|
50
|
+
|
|
51
|
+
Blazing-fast HTML to Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, WebAssembly, and PHP packages.
|
|
52
|
+
Ship identical Markdown across every runtime while enjoying native extension performance with Magnus bindings.
|
|
53
|
+
|
|
33
54
|
|
|
34
55
|
## Installation
|
|
35
56
|
|
|
36
57
|
```bash
|
|
37
|
-
bundle add html-to-markdown
|
|
38
|
-
# or
|
|
39
58
|
gem install html-to-markdown
|
|
40
59
|
```
|
|
41
60
|
|
|
42
|
-
Add the gem to your project and Bundler will compile the native Rust extension on first install.
|
|
43
61
|
|
|
44
|
-
### Requirements
|
|
45
62
|
|
|
46
|
-
|
|
47
|
-
- Rust toolchain **1.85+** with Cargo available on your `$PATH`
|
|
48
|
-
- Ruby development headers (`ruby-dev`, `ruby-devel`, or the platform equivalent)
|
|
63
|
+
Requires Ruby 3.2+ with Magnus native extension bindings. Published for Linux, macOS.
|
|
49
64
|
|
|
50
|
-
**Windows**: install [RubyInstaller with MSYS2](https://rubyinstaller.org/) (UCRT64). Run once:
|
|
51
65
|
|
|
52
|
-
```powershell
|
|
53
|
-
ridk exec pacman -S --needed --noconfirm base-devel mingw-w64-ucrt-x86_64-toolchain
|
|
54
|
-
```
|
|
55
66
|
|
|
56
|
-
This provides the standard headers (including `strings.h`) required for the bindgen step.
|
|
57
67
|
|
|
58
|
-
## Performance Snapshot
|
|
59
68
|
|
|
60
|
-
Apple M4 • Real Wikipedia documents • `HtmlToMarkdown.convert` (Ruby)
|
|
61
69
|
|
|
62
|
-
|
|
63
|
-
| ------------------- | ----- | ------- | ---------- | -------- |
|
|
64
|
-
| Lists (Timeline) | 129KB | 0.69ms | 187 MB/s | 1,450 |
|
|
65
|
-
| Tables (Countries) | 360KB | 2.19ms | 164 MB/s | 456 |
|
|
66
|
-
| Mixed (Python wiki) | 656KB | 4.88ms | 134 MB/s | 205 |
|
|
70
|
+
## Performance Snapshot
|
|
67
71
|
|
|
68
|
-
|
|
72
|
+
Apple M4 • Real Wikipedia documents • `convert()` (Ruby)
|
|
69
73
|
|
|
70
|
-
|
|
74
|
+
| Document | Size | Latency | Throughput |
|
|
75
|
+
| -------- | ---- | ------- | ---------- |
|
|
76
|
+
| Lists (Timeline) | 129KB | 0.71ms | 182 MB/s |
|
|
77
|
+
| Tables (Countries) | 360KB | 2.15ms | 167 MB/s |
|
|
78
|
+
| Mixed (Python wiki) | 656KB | 4.89ms | 134 MB/s |
|
|
71
79
|
|
|
72
|
-
Measured via `task bench:harness` with the shared Wikipedia + hOCR suite:
|
|
73
80
|
|
|
74
|
-
|
|
75
|
-
| ---------------------- | ------ | -------------- |
|
|
76
|
-
| Lists (Timeline) | 129 KB | 3,156 |
|
|
77
|
-
| Tables (Countries) | 360 KB | 921 |
|
|
78
|
-
| Medium (Python) | 657 KB | 469 |
|
|
79
|
-
| Large (Rust) | 567 KB | 534 |
|
|
80
|
-
| Small (Intro) | 463 KB | 629 |
|
|
81
|
-
| hOCR German PDF | 44 KB | 7,250 |
|
|
82
|
-
| hOCR Invoice | 4 KB | 83,883 |
|
|
83
|
-
| hOCR Embedded Tables | 37 KB | 7,890 |
|
|
81
|
+
See [Performance Guide](../../examples/performance/) for detailed benchmarks.
|
|
84
82
|
|
|
85
|
-
> These numbers line up with the Python/Node bindings because everything flows through the same Rust engine.
|
|
86
83
|
|
|
87
84
|
## Quick Start
|
|
88
85
|
|
|
86
|
+
Basic conversion:
|
|
87
|
+
|
|
89
88
|
```ruby
|
|
90
89
|
require 'html_to_markdown'
|
|
91
90
|
|
|
92
|
-
html =
|
|
93
|
-
<h1>Welcome</h1>
|
|
94
|
-
<p>This is <strong>Rust-fast</strong> conversion!</p>
|
|
95
|
-
<ul>
|
|
96
|
-
<li>Native extension</li>
|
|
97
|
-
<li>Identical output across languages</li>
|
|
98
|
-
</ul>
|
|
99
|
-
HTML
|
|
100
|
-
|
|
91
|
+
html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
|
|
101
92
|
markdown = HtmlToMarkdown.convert(html)
|
|
102
|
-
puts markdown
|
|
103
|
-
# # Welcome
|
|
104
|
-
#
|
|
105
|
-
# This is **Rust-fast** conversion!
|
|
106
|
-
#
|
|
107
|
-
# - Native extension
|
|
108
|
-
# - Identical output across languages
|
|
109
93
|
```
|
|
110
94
|
|
|
111
|
-
## API
|
|
112
95
|
|
|
113
|
-
### Conversion Options
|
|
114
96
|
|
|
115
|
-
|
|
97
|
+
With conversion options:
|
|
116
98
|
|
|
117
99
|
```ruby
|
|
118
100
|
require 'html_to_markdown'
|
|
119
101
|
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
heading_style: :atx,
|
|
123
|
-
code_block_style: :fenced,
|
|
124
|
-
bullets: '*+-',
|
|
125
|
-
list_indent_type: :spaces,
|
|
126
|
-
list_indent_width: 2,
|
|
127
|
-
whitespace_mode: :normalized,
|
|
128
|
-
highlight_style: :double_equal
|
|
129
|
-
)
|
|
130
|
-
|
|
131
|
-
puts markdown
|
|
102
|
+
html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
|
|
103
|
+
markdown = HtmlToMarkdown.convert(html, heading_style: :atx, code_block_style: :fenced)
|
|
132
104
|
```
|
|
133
105
|
|
|
134
|
-
### Reusing Options
|
|
135
106
|
|
|
136
|
-
If you’re running tight loops or benchmarks, build the options once and pass the handle back into `convert_with_options`:
|
|
137
107
|
|
|
138
|
-
```ruby
|
|
139
|
-
handle = HtmlToMarkdown.options(hocr_spatial_tables: false)
|
|
140
108
|
|
|
141
|
-
100.times do
|
|
142
|
-
HtmlToMarkdown.convert_with_options('<h1>Handles</h1>', handle)
|
|
143
|
-
end
|
|
144
|
-
```
|
|
145
109
|
|
|
146
|
-
### HTML Preprocessing
|
|
147
110
|
|
|
148
|
-
|
|
111
|
+
## API Reference
|
|
149
112
|
|
|
150
|
-
|
|
151
|
-
require 'html_to_markdown'
|
|
113
|
+
### Core Functions
|
|
152
114
|
|
|
153
|
-
markdown = HtmlToMarkdown.convert(
|
|
154
|
-
html,
|
|
155
|
-
preprocessing: {
|
|
156
|
-
enabled: true,
|
|
157
|
-
preset: :aggressive, # :minimal, :standard, :aggressive
|
|
158
|
-
remove_navigation: true,
|
|
159
|
-
remove_forms: true
|
|
160
|
-
}
|
|
161
|
-
)
|
|
162
|
-
```
|
|
163
115
|
|
|
164
|
-
|
|
116
|
+
**`convert(html, options: nil) -> String`**
|
|
165
117
|
|
|
166
|
-
|
|
118
|
+
Basic HTML-to-Markdown conversion. Fast and simple.
|
|
167
119
|
|
|
168
|
-
|
|
169
|
-
require 'html_to_markdown'
|
|
120
|
+
**`convert_with_metadata(html, options: nil, config: nil) -> [String, Hash]`**
|
|
170
121
|
|
|
171
|
-
|
|
172
|
-
'<img src="data:image/png;base64,iVBORw0..." alt="Pixel">',
|
|
173
|
-
image_config: {
|
|
174
|
-
max_decoded_size_bytes: 1 * 1024 * 1024,
|
|
175
|
-
infer_dimensions: true,
|
|
176
|
-
filename_prefix: 'img_',
|
|
177
|
-
capture_svg: true
|
|
178
|
-
}
|
|
179
|
-
)
|
|
180
|
-
|
|
181
|
-
puts result.markdown
|
|
182
|
-
result.inline_images.each do |img|
|
|
183
|
-
puts "#{img.filename} -> #{img.format} (#{img.data.bytesize} bytes)"
|
|
184
|
-
end
|
|
185
|
-
```
|
|
122
|
+
Extract Markdown plus metadata (headers, links, images, structured data) in a single pass. See [Metadata Extraction Guide](../../examples/metadata-extraction/).
|
|
186
123
|
|
|
187
|
-
|
|
124
|
+
**`convert_with_visitor(html, visitor:, options: nil) -> String`**
|
|
188
125
|
|
|
189
|
-
|
|
126
|
+
Customize conversion with visitor callbacks for element interception. See [Visitor Pattern Guide](../../examples/visitor-pattern/).
|
|
190
127
|
|
|
191
|
-
|
|
128
|
+
**`convert_with_inline_images(html, config: nil) -> [String, Array, Array]`**
|
|
192
129
|
|
|
193
|
-
|
|
194
|
-
require 'html_to_markdown'
|
|
130
|
+
Extract base64-encoded inline images with metadata.
|
|
195
131
|
|
|
196
|
-
html = '<html lang="en"><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
|
|
197
|
-
markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
|
|
198
132
|
|
|
199
|
-
puts markdown
|
|
200
|
-
puts metadata[:document][:title] # "Test"
|
|
201
|
-
puts metadata[:headers].length # 1
|
|
202
|
-
```
|
|
203
133
|
|
|
204
|
-
|
|
134
|
+
### Options
|
|
205
135
|
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
)
|
|
215
|
-
```
|
|
136
|
+
**`ConversionOptions`** – Key configuration fields:
|
|
137
|
+
- `heading_style`: Heading format (`"underlined"` | `"atx"` | `"atx_closed"`) — default: `"underlined"`
|
|
138
|
+
- `list_indent_width`: Spaces per indent level — default: `2`
|
|
139
|
+
- `bullets`: Bullet characters cycle — default: `"*+-"`
|
|
140
|
+
- `wrap`: Enable text wrapping — default: `false`
|
|
141
|
+
- `wrap_width`: Wrap at column — default: `80`
|
|
142
|
+
- `code_language`: Default fenced code block language — default: none
|
|
143
|
+
- `extract_metadata`: Embed metadata as YAML frontmatter — default: `false`
|
|
216
144
|
|
|
217
|
-
|
|
145
|
+
**`MetadataConfig`** – Selective metadata extraction:
|
|
146
|
+
- `extract_headers`: h1-h6 elements — default: `true`
|
|
147
|
+
- `extract_links`: Hyperlinks — default: `true`
|
|
148
|
+
- `extract_images`: Image elements — default: `true`
|
|
149
|
+
- `extract_structured_data`: JSON-LD, Microdata, RDFa — default: `true`
|
|
150
|
+
- `max_structured_data_size`: Size limit in bytes — default: `100KB`
|
|
218
151
|
|
|
219
|
-
```ruby
|
|
220
|
-
require 'html_to_markdown'
|
|
221
152
|
|
|
222
|
-
html = <<~HTML
|
|
223
|
-
<html>
|
|
224
|
-
<head>
|
|
225
|
-
<title>Example</title>
|
|
226
|
-
<meta name="description" content="Demo page">
|
|
227
|
-
<link rel="canonical" href="https://example.com/page">
|
|
228
|
-
<meta property="og:image" content="https://example.com/og.jpg">
|
|
229
|
-
<meta name="twitter:card" content="summary_large_image">
|
|
230
|
-
</head>
|
|
231
|
-
<body>
|
|
232
|
-
<h1 id="welcome">Welcome</h1>
|
|
233
|
-
<a href="https://example.com" rel="nofollow external">Example link</a>
|
|
234
|
-
<img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
|
|
235
|
-
<script type="application/ld+json">
|
|
236
|
-
{"@context": "https://schema.org", "@type": "Article"}
|
|
237
|
-
</script>
|
|
238
|
-
</body>
|
|
239
|
-
</html>
|
|
240
|
-
HTML
|
|
241
|
-
|
|
242
|
-
markdown, metadata = HtmlToMarkdown.convert_with_metadata(
|
|
243
|
-
html,
|
|
244
|
-
{ heading_style: :atx },
|
|
245
|
-
{ extract_links: true, extract_images: true, extract_headers: true, extract_structured_data: true }
|
|
246
|
-
)
|
|
247
|
-
|
|
248
|
-
puts markdown
|
|
249
|
-
puts metadata[:document][:title] # "Example"
|
|
250
|
-
puts metadata[:document][:description] # "Demo page"
|
|
251
|
-
puts metadata[:document][:open_graph] # {"og:image" => "https://example.com/og.jpg"}
|
|
252
|
-
puts metadata[:links].first[:rel] # ["nofollow", "external"]
|
|
253
|
-
puts metadata[:images].first[:dimensions] # [640, 480]
|
|
254
|
-
puts metadata[:headers].first[:id] # "welcome"
|
|
255
|
-
```
|
|
256
153
|
|
|
257
|
-
|
|
154
|
+
## Metadata Extraction
|
|
258
155
|
|
|
259
|
-
|
|
156
|
+
The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass.
|
|
260
157
|
|
|
261
|
-
|
|
158
|
+
**Use Cases:**
|
|
159
|
+
- **SEO analysis** – Extract title, description, Open Graph tags, Twitter cards
|
|
160
|
+
- **Table of contents generation** – Build structured outlines from heading hierarchy
|
|
161
|
+
- **Content migration** – Document all external links and resources
|
|
162
|
+
- **Accessibility audits** – Check for images without alt text, empty links, invalid heading hierarchy
|
|
163
|
+
- **Link validation** – Classify and validate anchor, internal, external, email, and phone links
|
|
262
164
|
|
|
263
|
-
|
|
264
|
-
{
|
|
265
|
-
document: {
|
|
266
|
-
title: String?,
|
|
267
|
-
description: String?,
|
|
268
|
-
keywords: Array[String],
|
|
269
|
-
author: String?,
|
|
270
|
-
canonical_url: String?,
|
|
271
|
-
base_href: String?,
|
|
272
|
-
language: String?,
|
|
273
|
-
text_direction: "ltr" | "rtl" | "auto" | nil,
|
|
274
|
-
open_graph: Hash[String, String],
|
|
275
|
-
twitter_card: Hash[String, String],
|
|
276
|
-
meta_tags: Hash[String, String]
|
|
277
|
-
},
|
|
278
|
-
headers: [
|
|
279
|
-
{
|
|
280
|
-
level: Integer, # 1-6
|
|
281
|
-
text: String,
|
|
282
|
-
id: String?,
|
|
283
|
-
depth: Integer,
|
|
284
|
-
html_offset: Integer
|
|
285
|
-
}
|
|
286
|
-
],
|
|
287
|
-
links: [
|
|
288
|
-
{
|
|
289
|
-
href: String,
|
|
290
|
-
text: String,
|
|
291
|
-
title: String?,
|
|
292
|
-
link_type: "anchor" | "internal" | "external" | "email" | "phone" | "other",
|
|
293
|
-
rel: Array[String],
|
|
294
|
-
attributes: Hash[String, String]
|
|
295
|
-
}
|
|
296
|
-
],
|
|
297
|
-
images: [
|
|
298
|
-
{
|
|
299
|
-
src: String,
|
|
300
|
-
alt: String?,
|
|
301
|
-
title: String?,
|
|
302
|
-
dimensions: [Integer, Integer]?,
|
|
303
|
-
image_type: "data_uri" | "inline_svg" | "external" | "relative",
|
|
304
|
-
attributes: Hash[String, String]
|
|
305
|
-
}
|
|
306
|
-
],
|
|
307
|
-
structured_data: [
|
|
308
|
-
{
|
|
309
|
-
data_type: "json_ld" | "microdata" | "rdfa",
|
|
310
|
-
raw_json: String,
|
|
311
|
-
schema_type: String?
|
|
312
|
-
}
|
|
313
|
-
]
|
|
314
|
-
}
|
|
315
|
-
```
|
|
165
|
+
**Zero Overhead When Disabled:** Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Disable unused metadata types in `MetadataConfig` to optimize further.
|
|
316
166
|
|
|
317
|
-
|
|
167
|
+
### Example: Quick Start
|
|
318
168
|
|
|
319
|
-
Pass a hash with the following options to control which metadata types are extracted:
|
|
320
169
|
|
|
321
170
|
```ruby
|
|
322
|
-
|
|
323
|
-
extract_headers: true, # Extract h1-h6 elements (default: true)
|
|
324
|
-
extract_links: true, # Extract <a> elements (default: true)
|
|
325
|
-
extract_images: true, # Extract <img> elements (default: true)
|
|
326
|
-
extract_structured_data: true, # Extract JSON-LD/Microdata/RDFa (default: true)
|
|
327
|
-
max_structured_data_size: 1_000_000 # Max bytes for structured data (default: 1MB)
|
|
328
|
-
}
|
|
329
|
-
|
|
330
|
-
markdown, metadata = HtmlToMarkdown.convert_with_metadata(html, nil, config)
|
|
331
|
-
```
|
|
332
|
-
|
|
333
|
-
#### Features
|
|
334
|
-
|
|
335
|
-
The Ruby binding provides comprehensive metadata extraction during HTML-to-Markdown conversion:
|
|
336
|
-
|
|
337
|
-
- **Document Metadata**: title, description, keywords, author, canonical URL, language, text direction
|
|
338
|
-
- **Open Graph & Twitter Card**: social media metadata extraction
|
|
339
|
-
- **Headers**: h1-h6 extraction with hierarchy, ids, and depth tracking
|
|
340
|
-
- **Links**: hyperlink extraction with type classification (anchor, internal, external, email, phone)
|
|
341
|
-
- **Images**: image extraction with source type (data_uri, inline_svg, external, relative) and dimensions
|
|
342
|
-
- **Structured Data**: JSON-LD, Microdata, and RDFa extraction
|
|
343
|
-
|
|
344
|
-
#### Type Safety with RBS
|
|
345
|
-
|
|
346
|
-
All types are defined in RBS format in `sig/html_to_markdown.rbs`:
|
|
347
|
-
|
|
348
|
-
- `document_metadata` - Document-level metadata structure
|
|
349
|
-
- `header_metadata` - Individual header element
|
|
350
|
-
- `link_metadata` - Individual link element
|
|
351
|
-
- `image_metadata` - Individual image element
|
|
352
|
-
- `structured_data` - Structured data block
|
|
353
|
-
- `extended_metadata` - Complete metadata extraction result
|
|
171
|
+
require 'html_to_markdown'
|
|
354
172
|
|
|
355
|
-
|
|
173
|
+
html = '<h1>Article</h1><img src="test.jpg" alt="test">'
|
|
174
|
+
markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
|
|
356
175
|
|
|
357
|
-
|
|
358
|
-
|
|
176
|
+
puts metadata[:document][:title] # Document title
|
|
177
|
+
puts metadata[:headers] # All h1-h6 elements
|
|
178
|
+
puts metadata[:links] # All hyperlinks
|
|
179
|
+
puts metadata[:images] # All images with alt text
|
|
180
|
+
puts metadata[:structured_data] # JSON-LD, Microdata, RDFa
|
|
359
181
|
```
|
|
360
182
|
|
|
361
|
-
#### Implementation Architecture
|
|
362
183
|
|
|
363
|
-
The Rust implementation uses a single-pass collector pattern for efficient metadata extraction:
|
|
364
184
|
|
|
365
|
-
|
|
366
|
-
2. **Minimal wrapper layer**: Ruby binding in `crates/html-to-markdown-rb/src/lib.rs`
|
|
367
|
-
3. **Type translation**: Rust types → Ruby hashes with proper Magnus bindings
|
|
368
|
-
4. **Hash conversion**: Uses Magnus `RHash` API for efficient Ruby hash construction
|
|
185
|
+
For detailed examples including SEO extraction, table-of-contents generation, link validation, and accessibility audits, see the [Metadata Extraction Guide](../../examples/metadata-extraction/).
|
|
369
186
|
|
|
370
|
-
The metadata feature is gated by a Cargo feature in `Cargo.toml`:
|
|
371
187
|
|
|
372
|
-
```toml
|
|
373
|
-
[features]
|
|
374
|
-
metadata = ["html-to-markdown-rs/metadata"]
|
|
375
|
-
```
|
|
376
188
|
|
|
377
|
-
This ensures:
|
|
378
|
-
- Zero overhead when metadata is not needed
|
|
379
|
-
- Clean integration with feature flag detection
|
|
380
|
-
- Consistent with Python binding implementation
|
|
381
189
|
|
|
382
|
-
|
|
190
|
+
## Visitor Pattern
|
|
383
191
|
|
|
384
|
-
|
|
192
|
+
The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Use visitors to transform content, filter elements, validate structure, or collect analytics.
|
|
385
193
|
|
|
386
|
-
|
|
387
|
-
-
|
|
388
|
-
-
|
|
389
|
-
-
|
|
194
|
+
**Use Cases:**
|
|
195
|
+
- **Custom Markdown dialects** – Convert to Obsidian, Notion, or other flavors
|
|
196
|
+
- **Content filtering** – Remove tracking pixels, ads, or unwanted elements
|
|
197
|
+
- **URL rewriting** – Rewrite CDN URLs, add query parameters, validate links
|
|
198
|
+
- **Accessibility validation** – Check alt text, heading hierarchy, link text
|
|
199
|
+
- **Analytics** – Track element usage, link destinations, image sources
|
|
390
200
|
|
|
391
|
-
|
|
201
|
+
**Supported Visitor Methods:** 40+ callbacks for text, inline elements, links, images, headings, lists, blocks, and tables.
|
|
392
202
|
|
|
393
|
-
|
|
203
|
+
### Example: Quick Start
|
|
394
204
|
|
|
395
|
-
Single-pass collection during tree traversal:
|
|
396
|
-
- No additional parsing passes
|
|
397
|
-
- Minimal memory overhead
|
|
398
|
-
- Configurable extraction granularity
|
|
399
|
-
- Built-in size limits for safety
|
|
400
205
|
|
|
401
|
-
|
|
206
|
+
```ruby
|
|
207
|
+
require 'html_to_markdown'
|
|
402
208
|
|
|
403
|
-
|
|
209
|
+
class MyVisitor
|
|
210
|
+
def visit_link(ctx, href, text, title = nil)
|
|
211
|
+
# Rewrite CDN URLs
|
|
212
|
+
if href.start_with?('https://old-cdn.com')
|
|
213
|
+
href = href.sub('https://old-cdn.com', 'https://new-cdn.com')
|
|
214
|
+
end
|
|
215
|
+
{ type: :custom, output: "[#{text}](#{href})" }
|
|
216
|
+
end
|
|
217
|
+
|
|
218
|
+
def visit_image(ctx, src, alt = nil, title = nil)
|
|
219
|
+
# Skip tracking pixels
|
|
220
|
+
src.include?('tracking') ? { type: :skip } : { type: :continue }
|
|
221
|
+
end
|
|
222
|
+
end
|
|
404
223
|
|
|
405
|
-
|
|
406
|
-
|
|
407
|
-
bundle exec rake compile -- --release --features metadata
|
|
408
|
-
bundle exec rspec spec/metadata_extraction_spec.rb
|
|
224
|
+
html = '<a href="https://old-cdn.com/file.pdf">Download</a>'
|
|
225
|
+
markdown = HtmlToMarkdown.convert_with_visitor(html, visitor: MyVisitor.new)
|
|
409
226
|
```
|
|
410
227
|
|
|
411
|
-
Tests cover:
|
|
412
|
-
- All metadata types extraction
|
|
413
|
-
- Configuration flags
|
|
414
|
-
- Edge cases (empty HTML, malformed input, special characters)
|
|
415
|
-
- Return value structure validation
|
|
416
|
-
- Integration with conversion options
|
|
417
228
|
|
|
418
|
-
## CLI
|
|
419
229
|
|
|
420
|
-
|
|
230
|
+
For comprehensive examples including content filtering, link footnotes, accessibility validation, and asynchronous URL validation, see the [Visitor Pattern Guide](../../examples/visitor-pattern/).
|
|
421
231
|
|
|
422
|
-
```ruby
|
|
423
|
-
require 'html_to_markdown/cli'
|
|
424
232
|
|
|
425
|
-
HtmlToMarkdown::CLI.run(%w[--heading-style atx input.html], stdout: $stdout)
|
|
426
|
-
# => writes converted Markdown to STDOUT
|
|
427
|
-
```
|
|
428
233
|
|
|
429
|
-
|
|
234
|
+
## Examples
|
|
430
235
|
|
|
431
|
-
|
|
432
|
-
|
|
433
|
-
|
|
434
|
-
```
|
|
236
|
+
- [Visitor Pattern Guide](../../examples/visitor-pattern/)
|
|
237
|
+
- [Metadata Extraction Guide](../../examples/metadata-extraction/)
|
|
238
|
+
- [Performance Guide](../../examples/performance/)
|
|
435
239
|
|
|
436
|
-
|
|
240
|
+
## Links
|
|
437
241
|
|
|
438
|
-
|
|
439
|
-
bundle exec rake compile # builds the extension
|
|
440
|
-
bundle exec ruby scripts/prepare_ruby_gem.rb # copies the CLI into lib/bin/
|
|
441
|
-
```
|
|
442
|
-
|
|
443
|
-
## Error Handling
|
|
242
|
+
- **GitHub:** [github.com/kreuzberg-dev/html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown)
|
|
444
243
|
|
|
445
|
-
|
|
244
|
+
- **RubyGems:** [rubygems.org/gems/html-to-markdown](https://rubygems.org/gems/html-to-markdown)
|
|
446
245
|
|
|
447
|
-
-
|
|
448
|
-
-
|
|
246
|
+
- **Kreuzberg Ecosystem:** [kreuzberg.dev](https://kreuzberg.dev)
|
|
247
|
+
- **Discord:** [discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
|
|
449
248
|
|
|
450
|
-
|
|
249
|
+
## Contributing
|
|
451
250
|
|
|
452
|
-
|
|
453
|
-
`Invalid input` message.
|
|
251
|
+
We welcome contributions! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/CONTRIBUTING.md) for details on:
|
|
454
252
|
|
|
455
|
-
|
|
253
|
+
- Setting up the development environment
|
|
254
|
+
- Running tests locally
|
|
255
|
+
- Submitting pull requests
|
|
256
|
+
- Reporting issues
|
|
456
257
|
|
|
457
|
-
|
|
258
|
+
All contributions must follow our code quality standards (enforced via pre-commit hooks):
|
|
458
259
|
|
|
459
|
-
-
|
|
460
|
-
-
|
|
461
|
-
-
|
|
462
|
-
- The Rust crate and CLI
|
|
260
|
+
- Proper test coverage (Rust 95%+, language bindings 80%+)
|
|
261
|
+
- Formatting and linting checks
|
|
262
|
+
- Documentation for public APIs
|
|
463
263
|
|
|
464
|
-
|
|
264
|
+
## License
|
|
465
265
|
|
|
466
|
-
|
|
266
|
+
MIT License – see [LICENSE](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE).
|
|
467
267
|
|
|
468
|
-
|
|
469
|
-
bundle exec rake compile # build the native extension
|
|
470
|
-
bundle exec rspec # run test suite
|
|
471
|
-
```
|
|
268
|
+
## Support
|
|
472
269
|
|
|
473
|
-
|
|
270
|
+
If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/kreuzberg-dev).
|
|
474
271
|
|
|
475
|
-
|
|
272
|
+
Have questions or run into issues? We're here to help:
|
|
476
273
|
|
|
477
|
-
|
|
274
|
+
- **GitHub Issues:** [github.com/kreuzberg-dev/html-to-markdown/issues](https://github.com/kreuzberg-dev/html-to-markdown/issues)
|
|
275
|
+
- **Discussions:** [github.com/kreuzberg-dev/html-to-markdown/discussions](https://github.com/kreuzberg-dev/html-to-markdown/discussions)
|
|
276
|
+
- **Discord Community:** [discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
|