html-to-markdown 2.16.0 → 2.18.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile.lock +90 -10
- data/README.md +99 -302
- data/bin/benchmark.rb +100 -12
- data/ext/html-to-markdown-rb/native/Cargo.toml +7 -5
- data/ext/html-to-markdown-rb/native/README.md +5 -5
- data/ext/html-to-markdown-rb/native/src/lib.rs +951 -0
- data/ext/html-to-markdown-rb/native/src/profiling.rs +4 -0
- data/html-to-markdown-rb.gemspec +6 -6
- data/lib/html_to_markdown/version.rb +1 -1
- data/sig/html_to_markdown.rbs +110 -0
- data/spec/visitor_spec.rb +1149 -0
- metadata +9 -8
data/README.md
CHANGED
|
@@ -11,8 +11,8 @@ Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust eng
|
|
|
11
11
|
[](https://hex.pm/packages/html_to_markdown)
|
|
12
12
|
[](https://www.nuget.org/packages/Goldziher.HtmlToMarkdown/)
|
|
13
13
|
[](https://central.sonatype.com/artifact/io.github.goldziher/html-to-markdown)
|
|
14
|
-
[](https://github.com/
|
|
14
|
+
[](https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v2/htmltomarkdown)
|
|
15
|
+
[](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE)
|
|
16
16
|
[](https://discord.gg/pXxagNK2zN)
|
|
17
17
|
|
|
18
18
|
## Features
|
|
@@ -26,10 +26,10 @@ Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust eng
|
|
|
26
26
|
|
|
27
27
|
## Documentation & Support
|
|
28
28
|
|
|
29
|
-
- [GitHub repository](https://github.com/
|
|
30
|
-
- [Issue tracker](https://github.com/
|
|
31
|
-
- [Changelog](https://github.com/
|
|
32
|
-
- [Live demo (WASM)](https://
|
|
29
|
+
- [GitHub repository](https://github.com/kreuzberg-dev/html-to-markdown)
|
|
30
|
+
- [Issue tracker](https://github.com/kreuzberg-dev/html-to-markdown/issues)
|
|
31
|
+
- [Changelog](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/CHANGELOG.md)
|
|
32
|
+
- [Live demo (WASM)](https://kreuzberg-dev.github.io/html-to-markdown/)
|
|
33
33
|
|
|
34
34
|
## Installation
|
|
35
35
|
|
|
@@ -55,35 +55,6 @@ ridk exec pacman -S --needed --noconfirm base-devel mingw-w64-ucrt-x86_64-toolch
|
|
|
55
55
|
|
|
56
56
|
This provides the standard headers (including `strings.h`) required for the bindgen step.
|
|
57
57
|
|
|
58
|
-
## Performance Snapshot
|
|
59
|
-
|
|
60
|
-
Apple M4 • Real Wikipedia documents • `HtmlToMarkdown.convert` (Ruby)
|
|
61
|
-
|
|
62
|
-
| Document | Size | Latency | Throughput | Docs/sec |
|
|
63
|
-
| ------------------- | ----- | ------- | ---------- | -------- |
|
|
64
|
-
| Lists (Timeline) | 129KB | 0.69ms | 187 MB/s | 1,450 |
|
|
65
|
-
| Tables (Countries) | 360KB | 2.19ms | 164 MB/s | 456 |
|
|
66
|
-
| Mixed (Python wiki) | 656KB | 4.88ms | 134 MB/s | 205 |
|
|
67
|
-
|
|
68
|
-
> Same core, same benchmarks: the Ruby extension stays within single-digit % of the Rust CLI and mirrors the Python/Node numbers.
|
|
69
|
-
|
|
70
|
-
### Benchmark Fixtures (Apple M4)
|
|
71
|
-
|
|
72
|
-
Measured via `task bench:harness` with the shared Wikipedia + hOCR suite:
|
|
73
|
-
|
|
74
|
-
| Document | Size | ops/sec (Ruby) |
|
|
75
|
-
| ---------------------- | ------ | -------------- |
|
|
76
|
-
| Lists (Timeline) | 129 KB | 3,156 |
|
|
77
|
-
| Tables (Countries) | 360 KB | 921 |
|
|
78
|
-
| Medium (Python) | 657 KB | 469 |
|
|
79
|
-
| Large (Rust) | 567 KB | 534 |
|
|
80
|
-
| Small (Intro) | 463 KB | 629 |
|
|
81
|
-
| hOCR German PDF | 44 KB | 7,250 |
|
|
82
|
-
| hOCR Invoice | 4 KB | 83,883 |
|
|
83
|
-
| hOCR Embedded Tables | 37 KB | 7,890 |
|
|
84
|
-
|
|
85
|
-
> These numbers line up with the Python/Node bindings because everything flows through the same Rust engine.
|
|
86
|
-
|
|
87
58
|
## Quick Start
|
|
88
59
|
|
|
89
60
|
```ruby
|
|
@@ -108,349 +79,174 @@ puts markdown
|
|
|
108
79
|
# - Identical output across languages
|
|
109
80
|
```
|
|
110
81
|
|
|
111
|
-
## API
|
|
82
|
+
## API Reference
|
|
112
83
|
|
|
113
|
-
### Conversion
|
|
114
|
-
|
|
115
|
-
Pass a Ruby hash (string or symbol keys) to tweak rendering. Every option maps one-for-one with the Rust/Python/Node APIs.
|
|
84
|
+
### Basic Conversion
|
|
116
85
|
|
|
117
86
|
```ruby
|
|
118
|
-
|
|
87
|
+
# Simple conversion
|
|
88
|
+
markdown = HtmlToMarkdown.convert(html)
|
|
119
89
|
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
heading_style: :atx,
|
|
123
|
-
code_block_style: :fenced,
|
|
124
|
-
bullets: '*+-',
|
|
125
|
-
list_indent_type: :spaces,
|
|
126
|
-
list_indent_width: 2,
|
|
127
|
-
whitespace_mode: :normalized,
|
|
128
|
-
highlight_style: :double_equal
|
|
129
|
-
)
|
|
90
|
+
# With options (pass a Ruby hash with symbol keys)
|
|
91
|
+
markdown = HtmlToMarkdown.convert(html, heading_style: :atx, code_block_style: :fenced)
|
|
130
92
|
|
|
131
|
-
|
|
132
|
-
|
|
93
|
+
# With inline images
|
|
94
|
+
result = HtmlToMarkdown.convert_with_inline_images(html, image_config: {...})
|
|
95
|
+
markdown = result.markdown
|
|
96
|
+
images = result.inline_images
|
|
133
97
|
|
|
134
|
-
|
|
98
|
+
# With metadata extraction
|
|
99
|
+
markdown, metadata = HtmlToMarkdown.convert_with_metadata(html, options, metadata_config)
|
|
135
100
|
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
```ruby
|
|
139
|
-
handle = HtmlToMarkdown.options(hocr_spatial_tables: false)
|
|
140
|
-
|
|
141
|
-
100.times do
|
|
142
|
-
HtmlToMarkdown.convert_with_options('<h1>Handles</h1>', handle)
|
|
143
|
-
end
|
|
101
|
+
# With visitor pattern (custom callbacks)
|
|
102
|
+
result = HtmlToMarkdown.convert_with_visitor(html, visitor: MyVisitor.new, options: {...})
|
|
144
103
|
```
|
|
145
104
|
|
|
146
|
-
###
|
|
147
|
-
|
|
148
|
-
Clean up scraped HTML (navigation, forms, malformed markup) before conversion:
|
|
105
|
+
### Conversion Options Hash
|
|
149
106
|
|
|
150
107
|
```ruby
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
108
|
+
{
|
|
109
|
+
heading_style: :atx, # :atx or :setext
|
|
110
|
+
code_block_style: :fenced, # :fenced or :indented
|
|
111
|
+
bullets: '*+-', # List bullet chars
|
|
112
|
+
list_indent_type: :spaces, # :spaces or :tabs
|
|
113
|
+
list_indent_width: 2, # Number of indent spaces
|
|
114
|
+
whitespace_mode: :normalized, # :normalized, :preserve, or :collapse
|
|
115
|
+
highlight_style: :double_equal, # Code highlighting style
|
|
116
|
+
hocr_spatial_tables: false, # Special hOCR table handling
|
|
155
117
|
preprocessing: {
|
|
156
118
|
enabled: true,
|
|
157
|
-
preset: :aggressive,
|
|
119
|
+
preset: :aggressive, # :minimal, :standard, :aggressive
|
|
158
120
|
remove_navigation: true,
|
|
159
121
|
remove_forms: true
|
|
160
122
|
}
|
|
161
|
-
|
|
123
|
+
}
|
|
162
124
|
```
|
|
163
125
|
|
|
164
|
-
###
|
|
126
|
+
### Performance: Reusing Options
|
|
165
127
|
|
|
166
|
-
|
|
128
|
+
For tight loops, build an options handle once:
|
|
167
129
|
|
|
168
130
|
```ruby
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
result = HtmlToMarkdown.convert_with_inline_images(
|
|
172
|
-
'<img src="..." alt="Pixel">',
|
|
173
|
-
image_config: {
|
|
174
|
-
max_decoded_size_bytes: 1 * 1024 * 1024,
|
|
175
|
-
infer_dimensions: true,
|
|
176
|
-
filename_prefix: 'img_',
|
|
177
|
-
capture_svg: true
|
|
178
|
-
}
|
|
179
|
-
)
|
|
131
|
+
handle = HtmlToMarkdown.options(hocr_spatial_tables: false)
|
|
180
132
|
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
puts "#{img.filename} -> #{img.format} (#{img.data.bytesize} bytes)"
|
|
133
|
+
100.times do
|
|
134
|
+
HtmlToMarkdown.convert_with_options(html, handle)
|
|
184
135
|
end
|
|
185
136
|
```
|
|
186
137
|
|
|
187
138
|
### Metadata Extraction
|
|
188
139
|
|
|
189
|
-
Extract
|
|
190
|
-
|
|
191
|
-
#### Basic Usage
|
|
140
|
+
Extract document properties (title, description, author, language), social metadata (Open Graph, Twitter cards), heading hierarchy, link analysis, image metadata, and structured data (JSON-LD, Microdata, RDFa):
|
|
192
141
|
|
|
193
142
|
```ruby
|
|
194
|
-
require 'html_to_markdown'
|
|
195
|
-
|
|
196
143
|
html = '<html lang="en"><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
|
|
197
144
|
markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
|
|
198
145
|
|
|
199
|
-
puts
|
|
200
|
-
puts metadata[:
|
|
201
|
-
puts metadata[:headers].length # 1
|
|
202
|
-
```
|
|
203
|
-
|
|
204
|
-
#### With Conversion Options
|
|
205
|
-
|
|
206
|
-
```ruby
|
|
207
|
-
conv_opts = { heading_style: :atx_closed }
|
|
208
|
-
metadata_opts = { extract_headers: true, extract_links: false }
|
|
209
|
-
|
|
210
|
-
markdown, metadata = HtmlToMarkdown.convert_with_metadata(
|
|
211
|
-
html,
|
|
212
|
-
conv_opts,
|
|
213
|
-
metadata_opts
|
|
214
|
-
)
|
|
215
|
-
```
|
|
216
|
-
|
|
217
|
-
#### Full Example
|
|
218
|
-
|
|
219
|
-
```ruby
|
|
220
|
-
require 'html_to_markdown'
|
|
221
|
-
|
|
222
|
-
html = <<~HTML
|
|
223
|
-
<html>
|
|
224
|
-
<head>
|
|
225
|
-
<title>Example</title>
|
|
226
|
-
<meta name="description" content="Demo page">
|
|
227
|
-
<link rel="canonical" href="https://example.com/page">
|
|
228
|
-
<meta property="og:image" content="https://example.com/og.jpg">
|
|
229
|
-
<meta name="twitter:card" content="summary_large_image">
|
|
230
|
-
</head>
|
|
231
|
-
<body>
|
|
232
|
-
<h1 id="welcome">Welcome</h1>
|
|
233
|
-
<a href="https://example.com" rel="nofollow external">Example link</a>
|
|
234
|
-
<img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
|
|
235
|
-
<script type="application/ld+json">
|
|
236
|
-
{"@context": "https://schema.org", "@type": "Article"}
|
|
237
|
-
</script>
|
|
238
|
-
</body>
|
|
239
|
-
</html>
|
|
240
|
-
HTML
|
|
241
|
-
|
|
242
|
-
markdown, metadata = HtmlToMarkdown.convert_with_metadata(
|
|
243
|
-
html,
|
|
244
|
-
{ heading_style: :atx },
|
|
245
|
-
{ extract_links: true, extract_images: true, extract_headers: true, extract_structured_data: true }
|
|
246
|
-
)
|
|
247
|
-
|
|
248
|
-
puts markdown
|
|
249
|
-
puts metadata[:document][:title] # "Example"
|
|
250
|
-
puts metadata[:document][:description] # "Demo page"
|
|
251
|
-
puts metadata[:document][:open_graph] # {"og:image" => "https://example.com/og.jpg"}
|
|
252
|
-
puts metadata[:links].first[:rel] # ["nofollow", "external"]
|
|
253
|
-
puts metadata[:images].first[:dimensions] # [640, 480]
|
|
254
|
-
puts metadata[:headers].first[:id] # "welcome"
|
|
146
|
+
puts metadata[:document][:title] # "Test"
|
|
147
|
+
puts metadata[:headers].first[:text] # "Hello"
|
|
255
148
|
```
|
|
256
149
|
|
|
257
|
-
|
|
150
|
+
For detailed examples (SEO extraction, heading hierarchy analysis, structured data) and full metadata structure reference, see [Metadata Extraction Guide](../../examples/metadata-extraction/).
|
|
258
151
|
|
|
259
|
-
|
|
152
|
+
### Visitor Pattern
|
|
260
153
|
|
|
261
|
-
|
|
154
|
+
Customize conversion with fine-grained element callbacks. Perfect for custom element handling, analytics during conversion, domain-specific markdown dialects, and conditional rendering:
|
|
262
155
|
|
|
263
156
|
```ruby
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
text_direction: "ltr" | "rtl" | "auto" | nil,
|
|
274
|
-
open_graph: Hash[String, String],
|
|
275
|
-
twitter_card: Hash[String, String],
|
|
276
|
-
meta_tags: Hash[String, String]
|
|
277
|
-
},
|
|
278
|
-
headers: [
|
|
279
|
-
{
|
|
280
|
-
level: Integer, # 1-6
|
|
281
|
-
text: String,
|
|
282
|
-
id: String?,
|
|
283
|
-
depth: Integer,
|
|
284
|
-
html_offset: Integer
|
|
285
|
-
}
|
|
286
|
-
],
|
|
287
|
-
links: [
|
|
288
|
-
{
|
|
289
|
-
href: String,
|
|
290
|
-
text: String,
|
|
291
|
-
title: String?,
|
|
292
|
-
link_type: "anchor" | "internal" | "external" | "email" | "phone" | "other",
|
|
293
|
-
rel: Array[String],
|
|
294
|
-
attributes: Hash[String, String]
|
|
295
|
-
}
|
|
296
|
-
],
|
|
297
|
-
images: [
|
|
298
|
-
{
|
|
299
|
-
src: String,
|
|
300
|
-
alt: String?,
|
|
301
|
-
title: String?,
|
|
302
|
-
dimensions: [Integer, Integer]?,
|
|
303
|
-
image_type: "data_uri" | "inline_svg" | "external" | "relative",
|
|
304
|
-
attributes: Hash[String, String]
|
|
305
|
-
}
|
|
306
|
-
],
|
|
307
|
-
structured_data: [
|
|
308
|
-
{
|
|
309
|
-
data_type: "json_ld" | "microdata" | "rdfa",
|
|
310
|
-
raw_json: String,
|
|
311
|
-
schema_type: String?
|
|
312
|
-
}
|
|
313
|
-
]
|
|
314
|
-
}
|
|
315
|
-
```
|
|
316
|
-
|
|
317
|
-
#### Metadata Configuration
|
|
318
|
-
|
|
319
|
-
Pass a hash with the following options to control which metadata types are extracted:
|
|
320
|
-
|
|
321
|
-
```ruby
|
|
322
|
-
config = {
|
|
323
|
-
extract_headers: true, # Extract h1-h6 elements (default: true)
|
|
324
|
-
extract_links: true, # Extract <a> elements (default: true)
|
|
325
|
-
extract_images: true, # Extract <img> elements (default: true)
|
|
326
|
-
extract_structured_data: true, # Extract JSON-LD/Microdata/RDFa (default: true)
|
|
327
|
-
max_structured_data_size: 1_000_000 # Max bytes for structured data (default: 1MB)
|
|
328
|
-
}
|
|
157
|
+
class MyVisitor
|
|
158
|
+
def visit_link(ctx, href, text, title = nil)
|
|
159
|
+
{ type: :custom, output: "[#{text}](#{href})" }
|
|
160
|
+
end
|
|
161
|
+
|
|
162
|
+
def visit_image(ctx, src, alt, title = nil)
|
|
163
|
+
{ type: :skip } # Remove images
|
|
164
|
+
end
|
|
165
|
+
end
|
|
329
166
|
|
|
330
|
-
|
|
167
|
+
result = HtmlToMarkdown.convert_with_visitor(html, visitor: MyVisitor.new)
|
|
331
168
|
```
|
|
332
169
|
|
|
333
|
-
|
|
170
|
+
**Return types**: `{ type: :continue }` (default), `{ type: :custom, output: "..." }` (replace), `{ type: :skip }` (omit), `{ type: :preserve_html }` (keep HTML), `{ type: :error, message: "..." }` (halt).
|
|
334
171
|
|
|
335
|
-
|
|
172
|
+
**40+ visitor methods** for text, inline formatting, blocks, lists, tables, advanced elements, and lifecycle hooks. Callback parameters include `NodeContext` with element metadata (tag_name, attributes, depth, parent_tag, is_inline).
|
|
336
173
|
|
|
337
|
-
|
|
338
|
-
- **Open Graph & Twitter Card**: social media metadata extraction
|
|
339
|
-
- **Headers**: h1-h6 extraction with hierarchy, ids, and depth tracking
|
|
340
|
-
- **Links**: hyperlink extraction with type classification (anchor, internal, external, email, phone)
|
|
341
|
-
- **Images**: image extraction with source type (data_uri, inline_svg, external, relative) and dimensions
|
|
342
|
-
- **Structured Data**: JSON-LD, Microdata, and RDFa extraction
|
|
174
|
+
For advanced examples (image filtering, link analytics, footnote dialects), RBS type-safety patterns, and full method reference, see [Visitor Pattern Guide](../../examples/visitor-pattern/).
|
|
343
175
|
|
|
344
|
-
|
|
176
|
+
## RBS Types & Strict Type Checking
|
|
345
177
|
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
- `document_metadata` - Document-level metadata structure
|
|
349
|
-
- `header_metadata` - Individual header element
|
|
350
|
-
- `link_metadata` - Individual link element
|
|
351
|
-
- `image_metadata` - Individual image element
|
|
352
|
-
- `structured_data` - Structured data block
|
|
353
|
-
- `extended_metadata` - Complete metadata extraction result
|
|
354
|
-
|
|
355
|
-
Uses strict RBS type checking with Steep for full type safety:
|
|
178
|
+
Full RBS type definitions in `sig/html_to_markdown.rbs` enable strict type checking with [Steep](https://github.com/soutaro/steep):
|
|
356
179
|
|
|
357
180
|
```bash
|
|
358
181
|
steep check
|
|
359
182
|
```
|
|
360
183
|
|
|
361
|
-
|
|
362
|
-
|
|
363
|
-
|
|
184
|
+
Key types:
|
|
185
|
+
- `HtmlToMarkdown::NodeContext` - Element metadata in visitor callbacks (tag_name, attributes, depth, etc.)
|
|
186
|
+
- `HtmlToMarkdown::visitor_result` - Return type union for visitor methods
|
|
187
|
+
- `HtmlToMarkdown::extended_metadata` - Metadata extraction result
|
|
364
188
|
|
|
365
|
-
|
|
366
|
-
2. **Minimal wrapper layer**: Ruby binding in `crates/html-to-markdown-rb/src/lib.rs`
|
|
367
|
-
3. **Type translation**: Rust types → Ruby hashes with proper Magnus bindings
|
|
368
|
-
4. **Hash conversion**: Uses Magnus `RHash` API for efficient Ruby hash construction
|
|
189
|
+
Type-safe visitor implementation:
|
|
369
190
|
|
|
370
|
-
|
|
371
|
-
|
|
372
|
-
|
|
373
|
-
|
|
374
|
-
|
|
191
|
+
```ruby
|
|
192
|
+
class TypedVisitor
|
|
193
|
+
def visit_link(
|
|
194
|
+
ctx : HtmlToMarkdown::NodeContext,
|
|
195
|
+
href : String,
|
|
196
|
+
text : String,
|
|
197
|
+
title : String | nil = nil
|
|
198
|
+
) : HtmlToMarkdown::visitor_result
|
|
199
|
+
{ type: :custom, output: "[#{text}](#{href})" }
|
|
200
|
+
end
|
|
201
|
+
end
|
|
375
202
|
```
|
|
376
203
|
|
|
377
|
-
|
|
378
|
-
- Zero overhead when metadata is not needed
|
|
379
|
-
- Clean integration with feature flag detection
|
|
380
|
-
- Consistent with Python binding implementation
|
|
381
|
-
|
|
382
|
-
#### Language Parity
|
|
383
|
-
|
|
384
|
-
Implements the same API as the Python binding:
|
|
385
|
-
|
|
386
|
-
- Same method signature: `convert_with_metadata(html, options, metadata_config)`
|
|
387
|
-
- Same return type: `[markdown, metadata_dict]`
|
|
388
|
-
- Same metadata structures and field names
|
|
389
|
-
- Same enum values (link_type, image_type, data_type, text_direction)
|
|
390
|
-
|
|
391
|
-
Enables seamless migration and multi-language development.
|
|
204
|
+
All public methods are typed for early error detection and LSP editor support (Ruby 3+).
|
|
392
205
|
|
|
393
|
-
|
|
206
|
+
## Magnus Native Extension
|
|
394
207
|
|
|
395
|
-
|
|
396
|
-
- No additional parsing passes
|
|
397
|
-
- Minimal memory overhead
|
|
398
|
-
- Configurable extraction granularity
|
|
399
|
-
- Built-in size limits for safety
|
|
208
|
+
The gem compiles a native Rust extension via [Magnus](https://github.com/matsadler/magnus) FFI bindings:
|
|
400
209
|
|
|
401
|
-
|
|
210
|
+
- **Zero-copy interop**: String and hash data flows directly between Ruby and Rust
|
|
211
|
+
- **Safe bindings**: No segfaults; Rust's type system ensures memory safety
|
|
212
|
+
- **Automatic error mapping**: Rust errors convert to Ruby exceptions with full context
|
|
213
|
+
- **Native performance**: Compiled to `.so` (Linux/macOS) or `.dll` (Windows)
|
|
214
|
+
- **Smart compilation**: Prebuilt binaries for common platforms; falls back to on-install compilation
|
|
402
215
|
|
|
403
|
-
|
|
216
|
+
Build manually:
|
|
404
217
|
|
|
405
218
|
```bash
|
|
406
|
-
|
|
407
|
-
bundle exec rake compile -- --release --features metadata
|
|
408
|
-
bundle exec rspec spec/metadata_extraction_spec.rb
|
|
219
|
+
bundle exec rake compile
|
|
409
220
|
```
|
|
410
221
|
|
|
411
|
-
|
|
412
|
-
- All metadata types extraction
|
|
413
|
-
- Configuration flags
|
|
414
|
-
- Edge cases (empty HTML, malformed input, special characters)
|
|
415
|
-
- Return value structure validation
|
|
416
|
-
- Integration with conversion options
|
|
222
|
+
## CLI Proxy
|
|
417
223
|
|
|
418
|
-
|
|
419
|
-
|
|
420
|
-
The gem bundles a small proxy for the Rust CLI binary. Use it when you need parity with the standalone `html-to-markdown` executable.
|
|
224
|
+
Call the Rust CLI directly from Ruby or shell:
|
|
421
225
|
|
|
422
226
|
```ruby
|
|
423
227
|
require 'html_to_markdown/cli'
|
|
424
228
|
|
|
425
229
|
HtmlToMarkdown::CLI.run(%w[--heading-style atx input.html], stdout: $stdout)
|
|
426
|
-
# => writes converted Markdown to STDOUT
|
|
427
|
-
```
|
|
428
|
-
|
|
429
|
-
You can also call the CLI binary directly for scripting:
|
|
430
230
|
|
|
431
|
-
|
|
231
|
+
# Or call the binary directly
|
|
432
232
|
HtmlToMarkdown::CLIProxy.call(['--version'])
|
|
433
|
-
# => "html-to-markdown 2.5.7"
|
|
434
|
-
```
|
|
435
|
-
|
|
436
|
-
Rebuild the CLI locally if you see `CLI binary not built` during tests:
|
|
437
|
-
|
|
438
|
-
```bash
|
|
439
|
-
bundle exec rake compile # builds the extension
|
|
440
|
-
bundle exec ruby scripts/prepare_ruby_gem.rb # copies the CLI into lib/bin/
|
|
441
233
|
```
|
|
442
234
|
|
|
443
235
|
## Error Handling
|
|
444
236
|
|
|
445
|
-
|
|
237
|
+
- `HtmlToMarkdown::Error` - Conversion errors with Rust error context
|
|
238
|
+
- `HtmlToMarkdown::CLIProxy::MissingBinaryError` - CLI binary not found
|
|
239
|
+
- `HtmlToMarkdown::CLIProxy::CLIExecutionError` - Command execution failed
|
|
240
|
+
|
|
241
|
+
Binary data inputs (e.g., PDF bytes as string) raise `HtmlToMarkdown::Error` with "Invalid input" message.
|
|
446
242
|
|
|
447
|
-
|
|
448
|
-
- `HtmlToMarkdown::CLIProxy::CLIExecutionError`
|
|
243
|
+
## Examples
|
|
449
244
|
|
|
450
|
-
|
|
245
|
+
Comprehensive guides with real-world patterns (Ruby examples included):
|
|
451
246
|
|
|
452
|
-
|
|
453
|
-
|
|
247
|
+
- **[Visitor Pattern](../../examples/visitor-pattern/)** - Custom callbacks, element-by-element control, analytics, domain-specific markdown dialects
|
|
248
|
+
- **[Metadata Extraction](../../examples/metadata-extraction/)** - SEO data, heading hierarchy, link classification, structured data parsing
|
|
249
|
+
- **[Performance Guide](../../examples/performance/)** - Benchmarking, profiling, throughput optimization
|
|
454
250
|
|
|
455
251
|
## Consistent Across Languages
|
|
456
252
|
|
|
@@ -459,6 +255,7 @@ The Ruby gem shares the exact Rust core with:
|
|
|
459
255
|
- [Python wheels](https://pypi.org/project/html-to-markdown/)
|
|
460
256
|
- [Node.js / Bun bindings](https://www.npmjs.com/package/html-to-markdown-node)
|
|
461
257
|
- [WebAssembly package](https://www.npmjs.com/package/html-to-markdown-wasm)
|
|
258
|
+
- [PHP extension](https://packagist.org/packages/goldziher/html-to-markdown)
|
|
462
259
|
- The Rust crate and CLI
|
|
463
260
|
|
|
464
261
|
Use whichever runtime fits your stack while keeping formatting behaviour identical.
|
|
@@ -470,7 +267,7 @@ bundle exec rake compile # build the native extension
|
|
|
470
267
|
bundle exec rspec # run test suite
|
|
471
268
|
```
|
|
472
269
|
|
|
473
|
-
|
|
270
|
+
When editing Rust code under `src/`, rerun `rake compile`.
|
|
474
271
|
|
|
475
272
|
## License
|
|
476
273
|
|