html-to-markdown 2.16.1 → 2.19.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,477 +1,276 @@
1
- # html-to-markdown-rb
2
-
3
- Blazing-fast HTML Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, WebAssembly, and PHP packages. Ship identical Markdown across every runtime while enjoying native extension performance.
4
-
5
- [![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg?logo=rust&label=crates.io)](https://crates.io/crates/html-to-markdown-rs)
6
- [![npm (node)](https://img.shields.io/npm/v/html-to-markdown-node.svg?logo=npm)](https://www.npmjs.com/package/html-to-markdown-node)
7
- [![npm (wasm)](https://img.shields.io/npm/v/html-to-markdown-wasm.svg?logo=npm)](https://www.npmjs.com/package/html-to-markdown-wasm)
8
- [![PyPI](https://img.shields.io/pypi/v/html-to-markdown.svg?logo=pypi)](https://pypi.org/project/html-to-markdown/)
9
- [![Packagist](https://img.shields.io/packagist/v/goldziher/html-to-markdown.svg)](https://packagist.org/packages/goldziher/html-to-markdown)
10
- [![RubyGems](https://badge.fury.io/rb/html-to-markdown.svg)](https://rubygems.org/gems/html-to-markdown)
11
- [![Hex.pm](https://img.shields.io/hexpm/v/html_to_markdown.svg)](https://hex.pm/packages/html_to_markdown)
12
- [![NuGet](https://img.shields.io/nuget/v/Goldziher.HtmlToMarkdown.svg)](https://www.nuget.org/packages/Goldziher.HtmlToMarkdown/)
13
- [![Maven Central](https://img.shields.io/maven-central/v/io.github.goldziher/html-to-markdown.svg)](https://central.sonatype.com/artifact/io.github.goldziher/html-to-markdown)
14
- [![Go Reference](https://pkg.go.dev/badge/github.com/Goldziher/html-to-markdown/packages/go/v2/htmltomarkdown.svg)](https://pkg.go.dev/github.com/Goldziher/html-to-markdown/packages/go/v2/htmltomarkdown)
15
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
16
- [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
17
-
18
- ## Features
19
-
20
- - ⚡ **Rust-fast**: Ruby bindings around a highly optimised Rust core (60‑80× faster than BeautifulSoup-based converters).
21
- - 🔁 **Identical output**: Shares logic with the Python wheels, npm bindings, PHP extension, WASM package, and CLI — consistent Markdown everywhere.
22
- - ⚙️ **Rich configuration**: Control heading styles, list indentation, whitespace handling, HTML preprocessing, and more.
23
- - 🖼️ **Inline image extraction**: Pull out embedded images (PNG/JPEG/SVG/data URIs) alongside Markdown.
24
- - 🧰 **Bundled CLI proxy**: Call the Rust CLI straight from Ruby or shell scripts.
25
- - 🛠️ **First-class Rails support**: Works with `Gem.win_platform?` builds, supports Trusted Publishing, and compiles on install if no native gem matches.
26
-
27
- ## Documentation & Support
28
-
29
- - [GitHub repository](https://github.com/Goldziher/html-to-markdown)
30
- - [Issue tracker](https://github.com/Goldziher/html-to-markdown/issues)
31
- - [Changelog](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md)
32
- - [Live demo (WASM)](https://goldziher.github.io/html-to-markdown/)
1
+ # html-to-markdown
2
+
3
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
4
+ <!-- Language Bindings -->
5
+ <a href="https://crates.io/crates/html-to-markdown-rs">
6
+ <img src="https://img.shields.io/crates/v/html-to-markdown-rs?label=Rust&color=007ec6" alt="Rust">
7
+ </a>
8
+ <a href="https://pypi.org/project/html-to-markdown/">
9
+ <img src="https://img.shields.io/pypi/v/html-to-markdown?label=Python&color=007ec6" alt="Python">
10
+ </a>
11
+ <a href="https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node">
12
+ <img src="https://img.shields.io/npm/v/@kreuzberg/html-to-markdown-node?label=Node.js&color=007ec6" alt="Node.js">
13
+ </a>
14
+ <a href="https://www.npmjs.com/package/@kreuzberg/html-to-markdown-wasm">
15
+ <img src="https://img.shields.io/npm/v/@kreuzberg/html-to-markdown-wasm?label=WASM&color=007ec6" alt="WASM">
16
+ </a>
17
+ <a href="https://central.sonatype.com/artifact/dev.kreuzberg/html-to-markdown">
18
+ <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/html-to-markdown?label=Java&color=007ec6" alt="Java">
19
+ </a>
20
+ <a href="https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v2/htmltomarkdown">
21
+ <img src="https://img.shields.io/badge/Go-v2.19.0-007ec6" alt="Go">
22
+ </a>
23
+ <a href="https://www.nuget.org/packages/KreuzbergDev.HtmlToMarkdown/">
24
+ <img src="https://img.shields.io/nuget/v/KreuzbergDev.HtmlToMarkdown?label=C%23&color=007ec6" alt="C#">
25
+ </a>
26
+ <a href="https://packagist.org/packages/goldziher/html-to-markdown">
27
+ <img src="https://img.shields.io/packagist/v/goldziher/html-to-markdown?label=PHP&color=007ec6" alt="PHP">
28
+ </a>
29
+ <a href="https://rubygems.org/gems/html-to-markdown">
30
+ <img src="https://img.shields.io/gem/v/html-to-markdown?label=Ruby&color=007ec6" alt="Ruby">
31
+ </a>
32
+ <a href="https://hex.pm/packages/html_to_markdown">
33
+ <img src="https://img.shields.io/hexpm/v/html_to_markdown?label=Elixir&color=007ec6" alt="Elixir">
34
+ </a>
35
+
36
+ <!-- Project Info -->
37
+ <a href="https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE">
38
+ <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
39
+ </a>
40
+ </div>
41
+
42
+ <img width="1128" height="191" alt="html-to-markdown" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
43
+
44
+ <div align="center" style="margin-top: 20px;">
45
+ <a href="https://discord.gg/pXxagNK2zN">
46
+ <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
47
+ </a>
48
+ </div>
49
+
50
+
51
+ Blazing-fast HTML to Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, WebAssembly, and PHP packages.
52
+ Ship identical Markdown across every runtime while enjoying native extension performance with Magnus bindings.
53
+
33
54
 
34
55
  ## Installation
35
56
 
36
57
  ```bash
37
- bundle add html-to-markdown
38
- # or
39
58
  gem install html-to-markdown
40
59
  ```
41
60
 
42
- Add the gem to your project and Bundler will compile the native Rust extension on first install.
43
61
 
44
- ### Requirements
45
62
 
46
- - Ruby **3.2+** (Magnus relies on the fiber scheduler APIs added in 3.2)
47
- - Rust toolchain **1.85+** with Cargo available on your `$PATH`
48
- - Ruby development headers (`ruby-dev`, `ruby-devel`, or the platform equivalent)
63
+ Requires Ruby 3.2+ with Magnus native extension bindings. Published for Linux, macOS.
49
64
 
50
- **Windows**: install [RubyInstaller with MSYS2](https://rubyinstaller.org/) (UCRT64). Run once:
51
65
 
52
- ```powershell
53
- ridk exec pacman -S --needed --noconfirm base-devel mingw-w64-ucrt-x86_64-toolchain
54
- ```
55
66
 
56
- This provides the standard headers (including `strings.h`) required for the bindgen step.
57
67
 
58
- ## Performance Snapshot
59
68
 
60
- Apple M4 • Real Wikipedia documents • `HtmlToMarkdown.convert` (Ruby)
61
69
 
62
- | Document | Size | Latency | Throughput | Docs/sec |
63
- | ------------------- | ----- | ------- | ---------- | -------- |
64
- | Lists (Timeline) | 129KB | 0.69ms | 187 MB/s | 1,450 |
65
- | Tables (Countries) | 360KB | 2.19ms | 164 MB/s | 456 |
66
- | Mixed (Python wiki) | 656KB | 4.88ms | 134 MB/s | 205 |
70
+ ## Performance Snapshot
67
71
 
68
- > Same core, same benchmarks: the Ruby extension stays within single-digit % of the Rust CLI and mirrors the Python/Node numbers.
72
+ Apple M4 Real Wikipedia documents `convert()` (Ruby)
69
73
 
70
- ### Benchmark Fixtures (Apple M4)
74
+ | Document | Size | Latency | Throughput |
75
+ | -------- | ---- | ------- | ---------- |
76
+ | Lists (Timeline) | 129KB | 0.71ms | 182 MB/s |
77
+ | Tables (Countries) | 360KB | 2.15ms | 167 MB/s |
78
+ | Mixed (Python wiki) | 656KB | 4.89ms | 134 MB/s |
71
79
 
72
- Measured via `task bench:harness` with the shared Wikipedia + hOCR suite:
73
80
 
74
- | Document | Size | ops/sec (Ruby) |
75
- | ---------------------- | ------ | -------------- |
76
- | Lists (Timeline) | 129 KB | 3,156 |
77
- | Tables (Countries) | 360 KB | 921 |
78
- | Medium (Python) | 657 KB | 469 |
79
- | Large (Rust) | 567 KB | 534 |
80
- | Small (Intro) | 463 KB | 629 |
81
- | hOCR German PDF | 44 KB | 7,250 |
82
- | hOCR Invoice | 4 KB | 83,883 |
83
- | hOCR Embedded Tables | 37 KB | 7,890 |
81
+ See [Performance Guide](../../examples/performance/) for detailed benchmarks.
84
82
 
85
- > These numbers line up with the Python/Node bindings because everything flows through the same Rust engine.
86
83
 
87
84
  ## Quick Start
88
85
 
86
+ Basic conversion:
87
+
89
88
  ```ruby
90
89
  require 'html_to_markdown'
91
90
 
92
- html = <<~HTML
93
- <h1>Welcome</h1>
94
- <p>This is <strong>Rust-fast</strong> conversion!</p>
95
- <ul>
96
- <li>Native extension</li>
97
- <li>Identical output across languages</li>
98
- </ul>
99
- HTML
100
-
91
+ html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
101
92
  markdown = HtmlToMarkdown.convert(html)
102
- puts markdown
103
- # # Welcome
104
- #
105
- # This is **Rust-fast** conversion!
106
- #
107
- # - Native extension
108
- # - Identical output across languages
109
93
  ```
110
94
 
111
- ## API
112
95
 
113
- ### Conversion Options
114
96
 
115
- Pass a Ruby hash (string or symbol keys) to tweak rendering. Every option maps one-for-one with the Rust/Python/Node APIs.
97
+ With conversion options:
116
98
 
117
99
  ```ruby
118
100
  require 'html_to_markdown'
119
101
 
120
- markdown = HtmlToMarkdown.convert(
121
- '<pre><code class="language-ruby">puts "hi"</code></pre>',
122
- heading_style: :atx,
123
- code_block_style: :fenced,
124
- bullets: '*+-',
125
- list_indent_type: :spaces,
126
- list_indent_width: 2,
127
- whitespace_mode: :normalized,
128
- highlight_style: :double_equal
129
- )
130
-
131
- puts markdown
102
+ html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
103
+ markdown = HtmlToMarkdown.convert(html, heading_style: :atx, code_block_style: :fenced)
132
104
  ```
133
105
 
134
- ### Reusing Options
135
106
 
136
- If you’re running tight loops or benchmarks, build the options once and pass the handle back into `convert_with_options`:
137
107
 
138
- ```ruby
139
- handle = HtmlToMarkdown.options(hocr_spatial_tables: false)
140
108
 
141
- 100.times do
142
- HtmlToMarkdown.convert_with_options('<h1>Handles</h1>', handle)
143
- end
144
- ```
145
109
 
146
- ### HTML Preprocessing
147
110
 
148
- Clean up scraped HTML (navigation, forms, malformed markup) before conversion:
111
+ ## API Reference
149
112
 
150
- ```ruby
151
- require 'html_to_markdown'
113
+ ### Core Functions
152
114
 
153
- markdown = HtmlToMarkdown.convert(
154
- html,
155
- preprocessing: {
156
- enabled: true,
157
- preset: :aggressive, # :minimal, :standard, :aggressive
158
- remove_navigation: true,
159
- remove_forms: true
160
- }
161
- )
162
- ```
163
115
 
164
- ### Inline Images
116
+ **`convert(html, options: nil) -> String`**
165
117
 
166
- Extract inline binary data (data URIs, SVG) together with the converted Markdown.
118
+ Basic HTML-to-Markdown conversion. Fast and simple.
167
119
 
168
- ```ruby
169
- require 'html_to_markdown'
120
+ **`convert_with_metadata(html, options: nil, config: nil) -> [String, Hash]`**
170
121
 
171
- result = HtmlToMarkdown.convert_with_inline_images(
172
- '<img src="data:image/png;base64,iVBORw0..." alt="Pixel">',
173
- image_config: {
174
- max_decoded_size_bytes: 1 * 1024 * 1024,
175
- infer_dimensions: true,
176
- filename_prefix: 'img_',
177
- capture_svg: true
178
- }
179
- )
180
-
181
- puts result.markdown
182
- result.inline_images.each do |img|
183
- puts "#{img.filename} -> #{img.format} (#{img.data.bytesize} bytes)"
184
- end
185
- ```
122
+ Extract Markdown plus metadata (headers, links, images, structured data) in a single pass. See [Metadata Extraction Guide](../../examples/metadata-extraction/).
186
123
 
187
- ### Metadata Extraction
124
+ **`convert_with_visitor(html, visitor:, options: nil) -> String`**
188
125
 
189
- Extract comprehensive metadata alongside Markdown conversion: document properties (title, description, author, language), social metadata (Open Graph, Twitter cards), heading hierarchy, link analysis (type classification, rel attributes), image metadata (dimensions, type detection), and structured data (JSON-LD, Microdata, RDFa).
126
+ Customize conversion with visitor callbacks for element interception. See [Visitor Pattern Guide](../../examples/visitor-pattern/).
190
127
 
191
- #### Basic Usage
128
+ **`convert_with_inline_images(html, config: nil) -> [String, Array, Array]`**
192
129
 
193
- ```ruby
194
- require 'html_to_markdown'
130
+ Extract base64-encoded inline images with metadata.
195
131
 
196
- html = '<html lang="en"><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
197
- markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
198
132
 
199
- puts markdown
200
- puts metadata[:document][:title] # "Test"
201
- puts metadata[:headers].length # 1
202
- ```
203
133
 
204
- #### With Conversion Options
134
+ ### Options
205
135
 
206
- ```ruby
207
- conv_opts = { heading_style: :atx_closed }
208
- metadata_opts = { extract_headers: true, extract_links: false }
209
-
210
- markdown, metadata = HtmlToMarkdown.convert_with_metadata(
211
- html,
212
- conv_opts,
213
- metadata_opts
214
- )
215
- ```
136
+ **`ConversionOptions`** – Key configuration fields:
137
+ - `heading_style`: Heading format (`"underlined"` | `"atx"` | `"atx_closed"`) — default: `"underlined"`
138
+ - `list_indent_width`: Spaces per indent level — default: `2`
139
+ - `bullets`: Bullet characters cycle — default: `"*+-"`
140
+ - `wrap`: Enable text wrapping — default: `false`
141
+ - `wrap_width`: Wrap at column — default: `80`
142
+ - `code_language`: Default fenced code block language — default: none
143
+ - `extract_metadata`: Embed metadata as YAML frontmatter — default: `false`
216
144
 
217
- #### Full Example
145
+ **`MetadataConfig`** Selective metadata extraction:
146
+ - `extract_headers`: h1-h6 elements — default: `true`
147
+ - `extract_links`: Hyperlinks — default: `true`
148
+ - `extract_images`: Image elements — default: `true`
149
+ - `extract_structured_data`: JSON-LD, Microdata, RDFa — default: `true`
150
+ - `max_structured_data_size`: Size limit in bytes — default: `100KB`
218
151
 
219
- ```ruby
220
- require 'html_to_markdown'
221
152
 
222
- html = <<~HTML
223
- <html>
224
- <head>
225
- <title>Example</title>
226
- <meta name="description" content="Demo page">
227
- <link rel="canonical" href="https://example.com/page">
228
- <meta property="og:image" content="https://example.com/og.jpg">
229
- <meta name="twitter:card" content="summary_large_image">
230
- </head>
231
- <body>
232
- <h1 id="welcome">Welcome</h1>
233
- <a href="https://example.com" rel="nofollow external">Example link</a>
234
- <img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
235
- <script type="application/ld+json">
236
- {"@context": "https://schema.org", "@type": "Article"}
237
- </script>
238
- </body>
239
- </html>
240
- HTML
241
-
242
- markdown, metadata = HtmlToMarkdown.convert_with_metadata(
243
- html,
244
- { heading_style: :atx },
245
- { extract_links: true, extract_images: true, extract_headers: true, extract_structured_data: true }
246
- )
247
-
248
- puts markdown
249
- puts metadata[:document][:title] # "Example"
250
- puts metadata[:document][:description] # "Demo page"
251
- puts metadata[:document][:open_graph] # {"og:image" => "https://example.com/og.jpg"}
252
- puts metadata[:links].first[:rel] # ["nofollow", "external"]
253
- puts metadata[:images].first[:dimensions] # [640, 480]
254
- puts metadata[:headers].first[:id] # "welcome"
255
- ```
256
153
 
257
- #### Return Value Structure
154
+ ## Metadata Extraction
258
155
 
259
- Returns a 2-element array: `[markdown_string, metadata_hash]`
156
+ The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass.
260
157
 
261
- The metadata hash contains:
158
+ **Use Cases:**
159
+ - **SEO analysis** – Extract title, description, Open Graph tags, Twitter cards
160
+ - **Table of contents generation** – Build structured outlines from heading hierarchy
161
+ - **Content migration** – Document all external links and resources
162
+ - **Accessibility audits** – Check for images without alt text, empty links, invalid heading hierarchy
163
+ - **Link validation** – Classify and validate anchor, internal, external, email, and phone links
262
164
 
263
- ```ruby
264
- {
265
- document: {
266
- title: String?,
267
- description: String?,
268
- keywords: Array[String],
269
- author: String?,
270
- canonical_url: String?,
271
- base_href: String?,
272
- language: String?,
273
- text_direction: "ltr" | "rtl" | "auto" | nil,
274
- open_graph: Hash[String, String],
275
- twitter_card: Hash[String, String],
276
- meta_tags: Hash[String, String]
277
- },
278
- headers: [
279
- {
280
- level: Integer, # 1-6
281
- text: String,
282
- id: String?,
283
- depth: Integer,
284
- html_offset: Integer
285
- }
286
- ],
287
- links: [
288
- {
289
- href: String,
290
- text: String,
291
- title: String?,
292
- link_type: "anchor" | "internal" | "external" | "email" | "phone" | "other",
293
- rel: Array[String],
294
- attributes: Hash[String, String]
295
- }
296
- ],
297
- images: [
298
- {
299
- src: String,
300
- alt: String?,
301
- title: String?,
302
- dimensions: [Integer, Integer]?,
303
- image_type: "data_uri" | "inline_svg" | "external" | "relative",
304
- attributes: Hash[String, String]
305
- }
306
- ],
307
- structured_data: [
308
- {
309
- data_type: "json_ld" | "microdata" | "rdfa",
310
- raw_json: String,
311
- schema_type: String?
312
- }
313
- ]
314
- }
315
- ```
165
+ **Zero Overhead When Disabled:** Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Disable unused metadata types in `MetadataConfig` to optimize further.
316
166
 
317
- #### Metadata Configuration
167
+ ### Example: Quick Start
318
168
 
319
- Pass a hash with the following options to control which metadata types are extracted:
320
169
 
321
170
  ```ruby
322
- config = {
323
- extract_headers: true, # Extract h1-h6 elements (default: true)
324
- extract_links: true, # Extract <a> elements (default: true)
325
- extract_images: true, # Extract <img> elements (default: true)
326
- extract_structured_data: true, # Extract JSON-LD/Microdata/RDFa (default: true)
327
- max_structured_data_size: 1_000_000 # Max bytes for structured data (default: 1MB)
328
- }
329
-
330
- markdown, metadata = HtmlToMarkdown.convert_with_metadata(html, nil, config)
331
- ```
332
-
333
- #### Features
334
-
335
- The Ruby binding provides comprehensive metadata extraction during HTML-to-Markdown conversion:
336
-
337
- - **Document Metadata**: title, description, keywords, author, canonical URL, language, text direction
338
- - **Open Graph & Twitter Card**: social media metadata extraction
339
- - **Headers**: h1-h6 extraction with hierarchy, ids, and depth tracking
340
- - **Links**: hyperlink extraction with type classification (anchor, internal, external, email, phone)
341
- - **Images**: image extraction with source type (data_uri, inline_svg, external, relative) and dimensions
342
- - **Structured Data**: JSON-LD, Microdata, and RDFa extraction
343
-
344
- #### Type Safety with RBS
345
-
346
- All types are defined in RBS format in `sig/html_to_markdown.rbs`:
347
-
348
- - `document_metadata` - Document-level metadata structure
349
- - `header_metadata` - Individual header element
350
- - `link_metadata` - Individual link element
351
- - `image_metadata` - Individual image element
352
- - `structured_data` - Structured data block
353
- - `extended_metadata` - Complete metadata extraction result
171
+ require 'html_to_markdown'
354
172
 
355
- Uses strict RBS type checking with Steep for full type safety:
173
+ html = '<h1>Article</h1><img src="test.jpg" alt="test">'
174
+ markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
356
175
 
357
- ```bash
358
- steep check
176
+ puts metadata[:document][:title] # Document title
177
+ puts metadata[:headers] # All h1-h6 elements
178
+ puts metadata[:links] # All hyperlinks
179
+ puts metadata[:images] # All images with alt text
180
+ puts metadata[:structured_data] # JSON-LD, Microdata, RDFa
359
181
  ```
360
182
 
361
- #### Implementation Architecture
362
183
 
363
- The Rust implementation uses a single-pass collector pattern for efficient metadata extraction:
364
184
 
365
- 1. **No duplication**: Core logic lives in Rust (`crates/html-to-markdown/src/metadata.rs`)
366
- 2. **Minimal wrapper layer**: Ruby binding in `crates/html-to-markdown-rb/src/lib.rs`
367
- 3. **Type translation**: Rust types → Ruby hashes with proper Magnus bindings
368
- 4. **Hash conversion**: Uses Magnus `RHash` API for efficient Ruby hash construction
185
+ For detailed examples including SEO extraction, table-of-contents generation, link validation, and accessibility audits, see the [Metadata Extraction Guide](../../examples/metadata-extraction/).
369
186
 
370
- The metadata feature is gated by a Cargo feature in `Cargo.toml`:
371
187
 
372
- ```toml
373
- [features]
374
- metadata = ["html-to-markdown-rs/metadata"]
375
- ```
376
188
 
377
- This ensures:
378
- - Zero overhead when metadata is not needed
379
- - Clean integration with feature flag detection
380
- - Consistent with Python binding implementation
381
189
 
382
- #### Language Parity
190
+ ## Visitor Pattern
383
191
 
384
- Implements the same API as the Python binding:
192
+ The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Use visitors to transform content, filter elements, validate structure, or collect analytics.
385
193
 
386
- - Same method signature: `convert_with_metadata(html, options, metadata_config)`
387
- - Same return type: `[markdown, metadata_dict]`
388
- - Same metadata structures and field names
389
- - Same enum values (link_type, image_type, data_type, text_direction)
194
+ **Use Cases:**
195
+ - **Custom Markdown dialects** – Convert to Obsidian, Notion, or other flavors
196
+ - **Content filtering** Remove tracking pixels, ads, or unwanted elements
197
+ - **URL rewriting** Rewrite CDN URLs, add query parameters, validate links
198
+ - **Accessibility validation** – Check alt text, heading hierarchy, link text
199
+ - **Analytics** – Track element usage, link destinations, image sources
390
200
 
391
- Enables seamless migration and multi-language development.
201
+ **Supported Visitor Methods:** 40+ callbacks for text, inline elements, links, images, headings, lists, blocks, and tables.
392
202
 
393
- #### Performance
203
+ ### Example: Quick Start
394
204
 
395
- Single-pass collection during tree traversal:
396
- - No additional parsing passes
397
- - Minimal memory overhead
398
- - Configurable extraction granularity
399
- - Built-in size limits for safety
400
205
 
401
- #### Testing
206
+ ```ruby
207
+ require 'html_to_markdown'
402
208
 
403
- Comprehensive RSpec test suite in `spec/metadata_extraction_spec.rb`:
209
+ class MyVisitor
210
+ def visit_link(ctx, href, text, title = nil)
211
+ # Rewrite CDN URLs
212
+ if href.start_with?('https://old-cdn.com')
213
+ href = href.sub('https://old-cdn.com', 'https://new-cdn.com')
214
+ end
215
+ { type: :custom, output: "[#{text}](#{href})" }
216
+ end
217
+
218
+ def visit_image(ctx, src, alt = nil, title = nil)
219
+ # Skip tracking pixels
220
+ src.include?('tracking') ? { type: :skip } : { type: :continue }
221
+ end
222
+ end
404
223
 
405
- ```bash
406
- cd packages/ruby
407
- bundle exec rake compile -- --release --features metadata
408
- bundle exec rspec spec/metadata_extraction_spec.rb
224
+ html = '<a href="https://old-cdn.com/file.pdf">Download</a>'
225
+ markdown = HtmlToMarkdown.convert_with_visitor(html, visitor: MyVisitor.new)
409
226
  ```
410
227
 
411
- Tests cover:
412
- - All metadata types extraction
413
- - Configuration flags
414
- - Edge cases (empty HTML, malformed input, special characters)
415
- - Return value structure validation
416
- - Integration with conversion options
417
228
 
418
- ## CLI
419
229
 
420
- The gem bundles a small proxy for the Rust CLI binary. Use it when you need parity with the standalone `html-to-markdown` executable.
230
+ For comprehensive examples including content filtering, link footnotes, accessibility validation, and asynchronous URL validation, see the [Visitor Pattern Guide](../../examples/visitor-pattern/).
421
231
 
422
- ```ruby
423
- require 'html_to_markdown/cli'
424
232
 
425
- HtmlToMarkdown::CLI.run(%w[--heading-style atx input.html], stdout: $stdout)
426
- # => writes converted Markdown to STDOUT
427
- ```
428
233
 
429
- You can also call the CLI binary directly for scripting:
234
+ ## Examples
430
235
 
431
- ```ruby
432
- HtmlToMarkdown::CLIProxy.call(['--version'])
433
- # => "html-to-markdown 2.5.7"
434
- ```
236
+ - [Visitor Pattern Guide](../../examples/visitor-pattern/)
237
+ - [Metadata Extraction Guide](../../examples/metadata-extraction/)
238
+ - [Performance Guide](../../examples/performance/)
435
239
 
436
- Rebuild the CLI locally if you see `CLI binary not built` during tests:
240
+ ## Links
437
241
 
438
- ```bash
439
- bundle exec rake compile # builds the extension
440
- bundle exec ruby scripts/prepare_ruby_gem.rb # copies the CLI into lib/bin/
441
- ```
442
-
443
- ## Error Handling
242
+ - **GitHub:** [github.com/kreuzberg-dev/html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown)
444
243
 
445
- Conversion errors raise `HtmlToMarkdown::Error` (wrapping the Rust error context). CLI invocations use specialised subclasses:
244
+ - **RubyGems:** [rubygems.org/gems/html-to-markdown](https://rubygems.org/gems/html-to-markdown)
446
245
 
447
- - `HtmlToMarkdown::CLIProxy::MissingBinaryError`
448
- - `HtmlToMarkdown::CLIProxy::CLIExecutionError`
246
+ - **Kreuzberg Ecosystem:** [kreuzberg.dev](https://kreuzberg.dev)
247
+ - **Discord:** [discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
449
248
 
450
- Rescue them to provide clearer feedback in your application.
249
+ ## Contributing
451
250
 
452
- Inputs that look like binary data (e.g., PDF bytes coerced to a string) raise `HtmlToMarkdown::Error` with an
453
- `Invalid input` message.
251
+ We welcome contributions! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/CONTRIBUTING.md) for details on:
454
252
 
455
- ## Consistent Across Languages
253
+ - Setting up the development environment
254
+ - Running tests locally
255
+ - Submitting pull requests
256
+ - Reporting issues
456
257
 
457
- The Ruby gem shares the exact Rust core with:
258
+ All contributions must follow our code quality standards (enforced via pre-commit hooks):
458
259
 
459
- - [Python wheels](https://pypi.org/project/html-to-markdown/)
460
- - [Node.js / Bun bindings](https://www.npmjs.com/package/html-to-markdown-node)
461
- - [WebAssembly package](https://www.npmjs.com/package/html-to-markdown-wasm)
462
- - The Rust crate and CLI
260
+ - Proper test coverage (Rust 95%+, language bindings 80%+)
261
+ - Formatting and linting checks
262
+ - Documentation for public APIs
463
263
 
464
- Use whichever runtime fits your stack while keeping formatting behaviour identical.
264
+ ## License
465
265
 
466
- ## Development
266
+ MIT License – see [LICENSE](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE).
467
267
 
468
- ```bash
469
- bundle exec rake compile # build the native extension
470
- bundle exec rspec # run test suite
471
- ```
268
+ ## Support
472
269
 
473
- The extension uses [Magnus](https://github.com/matsadler/magnus) plus `rb-sys` for bindgen. When editing the Rust code under `src/`, rerun `rake compile`.
270
+ If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/kreuzberg-dev).
474
271
 
475
- ## License
272
+ Have questions or run into issues? We're here to help:
476
273
 
477
- MIT © Na'aman Hirschfeld
274
+ - **GitHub Issues:** [github.com/kreuzberg-dev/html-to-markdown/issues](https://github.com/kreuzberg-dev/html-to-markdown/issues)
275
+ - **Discussions:** [github.com/kreuzberg-dev/html-to-markdown/discussions](https://github.com/kreuzberg-dev/html-to-markdown/discussions)
276
+ - **Discord Community:** [discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)