RubyGems - html-to-markdown - Versions diffs - 2.16.1 → 2.19.0 - Mend

html-to-markdown 2.16.1 → 2.19.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml +4 -4
data/Gemfile.lock +66 -11
data/README.md +168 -369
data/bin/benchmark.rb +100 -12
data/ext/html-to-markdown-rb/native/Cargo.toml +7 -5
data/ext/html-to-markdown-rb/native/README.md +5 -5
data/ext/html-to-markdown-rb/native/src/lib.rs +951 -0
data/ext/html-to-markdown-rb/native/src/profiling.rs +4 -0
data/html-to-markdown-rb.gemspec +6 -6
data/lib/html_to_markdown/version.rb +1 -1
data/sig/html_to_markdown.rbs +110 -0
data/spec/visitor_spec.rb +1149 -0
metadata +9 -8

data/README.md CHANGED Viewed

@@ -1,477 +1,276 @@
-# html-to-markdown-rb
-Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, WebAssembly, and PHP packages. Ship identical Markdown across every runtime while enjoying native extension performance.
-[![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg?logo=rust&label=crates.io)](https://crates.io/crates/html-to-markdown-rs)
-[![npm (node)](https://img.shields.io/npm/v/html-to-markdown-node.svg?logo=npm)](https://www.npmjs.com/package/html-to-markdown-node)
-[![npm (wasm)](https://img.shields.io/npm/v/html-to-markdown-wasm.svg?logo=npm)](https://www.npmjs.com/package/html-to-markdown-wasm)
-[![PyPI](https://img.shields.io/pypi/v/html-to-markdown.svg?logo=pypi)](https://pypi.org/project/html-to-markdown/)
-[![Packagist](https://img.shields.io/packagist/v/goldziher/html-to-markdown.svg)](https://packagist.org/packages/goldziher/html-to-markdown)
-[![RubyGems](https://badge.fury.io/rb/html-to-markdown.svg)](https://rubygems.org/gems/html-to-markdown)
-[![Hex.pm](https://img.shields.io/hexpm/v/html_to_markdown.svg)](https://hex.pm/packages/html_to_markdown)
-[![NuGet](https://img.shields.io/nuget/v/Goldziher.HtmlToMarkdown.svg)](https://www.nuget.org/packages/Goldziher.HtmlToMarkdown/)
-[![Maven Central](https://img.shields.io/maven-central/v/io.github.goldziher/html-to-markdown.svg)](https://central.sonatype.com/artifact/io.github.goldziher/html-to-markdown)
-[![Go Reference](https://pkg.go.dev/badge/github.com/Goldziher/html-to-markdown/packages/go/v2/htmltomarkdown.svg)](https://pkg.go.dev/github.com/Goldziher/html-to-markdown/packages/go/v2/htmltomarkdown)
-[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
-[![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
-## Features
-- ⚡ **Rust-fast**: Ruby bindings around a highly optimised Rust core (60‑80× faster than BeautifulSoup-based converters).
-- 🔁 **Identical output**: Shares logic with the Python wheels, npm bindings, PHP extension, WASM package, and CLI — consistent Markdown everywhere.
-- ⚙️ **Rich configuration**: Control heading styles, list indentation, whitespace handling, HTML preprocessing, and more.
-- 🖼️ **Inline image extraction**: Pull out embedded images (PNG/JPEG/SVG/data URIs) alongside Markdown.
-- 🧰 **Bundled CLI proxy**: Call the Rust CLI straight from Ruby or shell scripts.
-- 🛠️ **First-class Rails support**: Works with `Gem.win_platform?` builds, supports Trusted Publishing, and compiles on install if no native gem matches.
-## Documentation & Support
-- [GitHub repository](https://github.com/Goldziher/html-to-markdown)
-- [Issue tracker](https://github.com/Goldziher/html-to-markdown/issues)
-- [Changelog](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md)
-- [Live demo (WASM)](https://goldziher.github.io/html-to-markdown/)
+# html-to-markdown
+<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
+  <!-- Language Bindings -->
+  <a href="https://crates.io/crates/html-to-markdown-rs">
+    <img src="https://img.shields.io/crates/v/html-to-markdown-rs?label=Rust&color=007ec6" alt="Rust">
+  </a>
+  <a href="https://pypi.org/project/html-to-markdown/">
+    <img src="https://img.shields.io/pypi/v/html-to-markdown?label=Python&color=007ec6" alt="Python">
+  </a>
+  <a href="https://www.npmjs.com/package/@kreuzberg/html-to-markdown-node">
+    <img src="https://img.shields.io/npm/v/@kreuzberg/html-to-markdown-node?label=Node.js&color=007ec6" alt="Node.js">
+  </a>
+  <a href="https://www.npmjs.com/package/@kreuzberg/html-to-markdown-wasm">
+    <img src="https://img.shields.io/npm/v/@kreuzberg/html-to-markdown-wasm?label=WASM&color=007ec6" alt="WASM">
+  </a>
+  <a href="https://central.sonatype.com/artifact/dev.kreuzberg/html-to-markdown">
+    <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/html-to-markdown?label=Java&color=007ec6" alt="Java">
+  </a>
+  <a href="https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v2/htmltomarkdown">
+    <img src="https://img.shields.io/badge/Go-v2.19.0-007ec6" alt="Go">
+  </a>
+  <a href="https://www.nuget.org/packages/KreuzbergDev.HtmlToMarkdown/">
+    <img src="https://img.shields.io/nuget/v/KreuzbergDev.HtmlToMarkdown?label=C%23&color=007ec6" alt="C#">
+  </a>
+  <a href="https://packagist.org/packages/goldziher/html-to-markdown">
+    <img src="https://img.shields.io/packagist/v/goldziher/html-to-markdown?label=PHP&color=007ec6" alt="PHP">
+  </a>
+  <a href="https://rubygems.org/gems/html-to-markdown">
+    <img src="https://img.shields.io/gem/v/html-to-markdown?label=Ruby&color=007ec6" alt="Ruby">
+  </a>
+  <a href="https://hex.pm/packages/html_to_markdown">
+    <img src="https://img.shields.io/hexpm/v/html_to_markdown?label=Elixir&color=007ec6" alt="Elixir">
+  </a>
+  <!-- Project Info -->
+  <a href="https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE">
+    <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
+  </a>
+</div>
+<img width="1128" height="191" alt="html-to-markdown" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
+<div align="center" style="margin-top: 20px;">
+  <a href="https://discord.gg/pXxagNK2zN">
+      <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
+  </a>
+</div>
+Blazing-fast HTML to Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, WebAssembly, and PHP packages.
+Ship identical Markdown across every runtime while enjoying native extension performance with Magnus bindings.
 ## Installation
 ```bash
-bundle add html-to-markdown
-# or
 gem install html-to-markdown
 ```
-Add the gem to your project and Bundler will compile the native Rust extension on first install.
-### Requirements
-- Ruby **3.2+** (Magnus relies on the fiber scheduler APIs added in 3.2)
-- Rust toolchain **1.85+** with Cargo available on your `$PATH`
-- Ruby development headers (`ruby-dev`, `ruby-devel`, or the platform equivalent)
+Requires Ruby 3.2+ with Magnus native extension bindings. Published for Linux, macOS.
-**Windows**: install [RubyInstaller with MSYS2](https://rubyinstaller.org/) (UCRT64). Run once:
-```powershell
-ridk exec pacman -S --needed --noconfirm base-devel mingw-w64-ucrt-x86_64-toolchain
-```
-This provides the standard headers (including `strings.h`) required for the bindgen step.
-## Performance Snapshot
-Apple M4 • Real Wikipedia documents • `HtmlToMarkdown.convert` (Ruby)
-| Document            | Size  | Latency | Throughput | Docs/sec |
-| ------------------- | ----- | ------- | ---------- | -------- |
-| Lists (Timeline)    | 129KB | 0.69ms  | 187 MB/s   | 1,450    |
-| Tables (Countries)  | 360KB | 2.19ms  | 164 MB/s   | 456      |
-| Mixed (Python wiki) | 656KB | 4.88ms  | 134 MB/s   | 205      |
+## Performance Snapshot
-> Same core, same benchmarks: the Ruby extension stays within single-digit % of the Rust CLI and mirrors the Python/Node numbers.
+Apple M4 • Real Wikipedia documents • `convert()` (Ruby)
-### Benchmark Fixtures (Apple M4)
+| Document | Size | Latency | Throughput |
+| -------- | ---- | ------- | ---------- |
+| Lists (Timeline) | 129KB | 0.71ms | 182 MB/s |
+| Tables (Countries) | 360KB | 2.15ms | 167 MB/s |
+| Mixed (Python wiki) | 656KB | 4.89ms | 134 MB/s |
-Measured via `task bench:harness` with the shared Wikipedia + hOCR suite:
-| Document               | Size   | ops/sec (Ruby) |
-| ---------------------- | ------ | -------------- |
-| Lists (Timeline)       | 129 KB | 3,156          |
-| Tables (Countries)     | 360 KB | 921            |
-| Medium (Python)        | 657 KB | 469            |
-| Large (Rust)           | 567 KB | 534            |
-| Small (Intro)          | 463 KB | 629            |
-| hOCR German PDF        | 44 KB  | 7,250          |
-| hOCR Invoice           | 4 KB   | 83,883         |
-| hOCR Embedded Tables   | 37 KB  | 7,890          |
+See [Performance Guide](../../examples/performance/) for detailed benchmarks.
-> These numbers line up with the Python/Node bindings because everything flows through the same Rust engine.
 ## Quick Start
+Basic conversion:
 ```ruby
 require 'html_to_markdown'
-html = <<~HTML
-  <h1>Welcome</h1>
-  <p>This is <strong>Rust-fast</strong> conversion!</p>
-  <ul>
-    <li>Native extension</li>
-    <li>Identical output across languages</li>
-  </ul>
-HTML
+html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
 markdown = HtmlToMarkdown.convert(html)
-puts markdown
-# # Welcome
-#
-# This is **Rust-fast** conversion!
-#
-# - Native extension
-# - Identical output across languages
 ```
-## API
-### Conversion Options
-Pass a Ruby hash (string or symbol keys) to tweak rendering. Every option maps one-for-one with the Rust/Python/Node APIs.
+With conversion options:
 ```ruby
 require 'html_to_markdown'
-markdown = HtmlToMarkdown.convert(
-  '<pre><code class="language-ruby">puts "hi"</code></pre>',
-  heading_style: :atx,
-  code_block_style: :fenced,
-  bullets: '*+-',
-  list_indent_type: :spaces,
-  list_indent_width: 2,
-  whitespace_mode: :normalized,
-  highlight_style: :double_equal
-)
-puts markdown
+html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
+markdown = HtmlToMarkdown.convert(html, heading_style: :atx, code_block_style: :fenced)
 ```
-### Reusing Options
-If you’re running tight loops or benchmarks, build the options once and pass the handle back into `convert_with_options`:
-```ruby
-handle = HtmlToMarkdown.options(hocr_spatial_tables: false)
-100.times do
-  HtmlToMarkdown.convert_with_options('<h1>Handles</h1>', handle)
-end
-```
-### HTML Preprocessing
-Clean up scraped HTML (navigation, forms, malformed markup) before conversion:
+## API Reference
-```ruby
-require 'html_to_markdown'
+### Core Functions
-markdown = HtmlToMarkdown.convert(
-  html,
-  preprocessing: {
-    enabled: true,
-    preset: :aggressive, # :minimal, :standard, :aggressive
-    remove_navigation: true,
-    remove_forms: true
-  }
-)
-```
-### Inline Images
+**`convert(html, options: nil) -> String`**
-Extract inline binary data (data URIs, SVG) together with the converted Markdown.
+Basic HTML-to-Markdown conversion. Fast and simple.
-```ruby
-require 'html_to_markdown'
+**`convert_with_metadata(html, options: nil, config: nil) -> [String, Hash]`**
-result = HtmlToMarkdown.convert_with_inline_images(
-  '<img src="data:image/png;base64,iVBORw0..." alt="Pixel">',
-  image_config: {
-    max_decoded_size_bytes: 1 * 1024 * 1024,
-    infer_dimensions: true,
-    filename_prefix: 'img_',
-    capture_svg: true
-  }
-)
-puts result.markdown
-result.inline_images.each do |img|
-  puts "#{img.filename} -> #{img.format} (#{img.data.bytesize} bytes)"
-end
-```
+Extract Markdown plus metadata (headers, links, images, structured data) in a single pass. See [Metadata Extraction Guide](../../examples/metadata-extraction/).
-### Metadata Extraction
+**`convert_with_visitor(html, visitor:, options: nil) -> String`**
-Extract comprehensive metadata alongside Markdown conversion: document properties (title, description, author, language), social metadata (Open Graph, Twitter cards), heading hierarchy, link analysis (type classification, rel attributes), image metadata (dimensions, type detection), and structured data (JSON-LD, Microdata, RDFa).
+Customize conversion with visitor callbacks for element interception. See [Visitor Pattern Guide](../../examples/visitor-pattern/).
-#### Basic Usage
+**`convert_with_inline_images(html, config: nil) -> [String, Array, Array]`**
-```ruby
-require 'html_to_markdown'
+Extract base64-encoded inline images with metadata.
-html = '<html lang="en"><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
-markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
-puts markdown
-puts metadata[:document][:title]  # "Test"
-puts metadata[:headers].length     # 1
-```
-#### With Conversion Options
+### Options
-```ruby
-conv_opts = { heading_style: :atx_closed }
-metadata_opts = { extract_headers: true, extract_links: false }
-markdown, metadata = HtmlToMarkdown.convert_with_metadata(
-  html,
-  conv_opts,
-  metadata_opts
-)
-```
+**`ConversionOptions`** – Key configuration fields:
+- `heading_style`: Heading format (`"underlined"` | `"atx"` | `"atx_closed"`) — default: `"underlined"`
+- `list_indent_width`: Spaces per indent level — default: `2`
+- `bullets`: Bullet characters cycle — default: `"*+-"`
+- `wrap`: Enable text wrapping — default: `false`
+- `wrap_width`: Wrap at column — default: `80`
+- `code_language`: Default fenced code block language — default: none
+- `extract_metadata`: Embed metadata as YAML frontmatter — default: `false`
-#### Full Example
+**`MetadataConfig`** – Selective metadata extraction:
+- `extract_headers`: h1-h6 elements — default: `true`
+- `extract_links`: Hyperlinks — default: `true`
+- `extract_images`: Image elements — default: `true`
+- `extract_structured_data`: JSON-LD, Microdata, RDFa — default: `true`
+- `max_structured_data_size`: Size limit in bytes — default: `100KB`
-```ruby
-require 'html_to_markdown'
-html = <<~HTML
-  <html>
-    <head>
-      <title>Example</title>
-      <meta name="description" content="Demo page">
-      <link rel="canonical" href="https://example.com/page">
-      <meta property="og:image" content="https://example.com/og.jpg">
-      <meta name="twitter:card" content="summary_large_image">
-    </head>
-    <body>
-      <h1 id="welcome">Welcome</h1>
-      <a href="https://example.com" rel="nofollow external">Example link</a>
-      <img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
-      <script type="application/ld+json">
-        {"@context": "https://schema.org", "@type": "Article"}
-      </script>
-    </body>
-  </html>
-HTML
-markdown, metadata = HtmlToMarkdown.convert_with_metadata(
-  html,
-  { heading_style: :atx },
-  { extract_links: true, extract_images: true, extract_headers: true, extract_structured_data: true }
-)
-puts markdown
-puts metadata[:document][:title]         # "Example"
-puts metadata[:document][:description]   # "Demo page"
-puts metadata[:document][:open_graph]    # {"og:image" => "https://example.com/og.jpg"}
-puts metadata[:links].first[:rel]        # ["nofollow", "external"]
-puts metadata[:images].first[:dimensions] # [640, 480]
-puts metadata[:headers].first[:id]       # "welcome"
-```
-#### Return Value Structure
+## Metadata Extraction
-Returns a 2-element array: `[markdown_string, metadata_hash]`
+The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass.
-The metadata hash contains:
+**Use Cases:**
+- **SEO analysis** – Extract title, description, Open Graph tags, Twitter cards
+- **Table of contents generation** – Build structured outlines from heading hierarchy
+- **Content migration** – Document all external links and resources
+- **Accessibility audits** – Check for images without alt text, empty links, invalid heading hierarchy
+- **Link validation** – Classify and validate anchor, internal, external, email, and phone links
-```ruby
-{
-  document: {
-    title: String?,
-    description: String?,
-    keywords: Array[String],
-    author: String?,
-    canonical_url: String?,
-    base_href: String?,
-    language: String?,
-    text_direction: "ltr" | "rtl" | "auto" | nil,
-    open_graph: Hash[String, String],
-    twitter_card: Hash[String, String],
-    meta_tags: Hash[String, String]
-  },
-  headers: [
-    {
-      level: Integer,          # 1-6
-      text: String,
-      id: String?,
-      depth: Integer,
-      html_offset: Integer
-    }
-  ],
-  links: [
-    {
-      href: String,
-      text: String,
-      title: String?,
-      link_type: "anchor" | "internal" | "external" | "email" | "phone" | "other",
-      rel: Array[String],
-      attributes: Hash[String, String]
-    }
-  ],
-  images: [
-    {
-      src: String,
-      alt: String?,
-      title: String?,
-      dimensions: [Integer, Integer]?,
-      image_type: "data_uri" | "inline_svg" | "external" | "relative",
-      attributes: Hash[String, String]
-    }
-  ],
-  structured_data: [
-    {
-      data_type: "json_ld" | "microdata" | "rdfa",
-      raw_json: String,
-      schema_type: String?
-    }
-  ]
-}
-```
+**Zero Overhead When Disabled:** Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Disable unused metadata types in `MetadataConfig` to optimize further.
-#### Metadata Configuration
+### Example: Quick Start
-Pass a hash with the following options to control which metadata types are extracted:
 ```ruby
-config = {
-  extract_headers: true,           # Extract h1-h6 elements (default: true)
-  extract_links: true,             # Extract <a> elements (default: true)
-  extract_images: true,            # Extract <img> elements (default: true)
-  extract_structured_data: true,   # Extract JSON-LD/Microdata/RDFa (default: true)
-  max_structured_data_size: 1_000_000  # Max bytes for structured data (default: 1MB)
-}
-markdown, metadata = HtmlToMarkdown.convert_with_metadata(html, nil, config)
-```
-#### Features
-The Ruby binding provides comprehensive metadata extraction during HTML-to-Markdown conversion:
-- **Document Metadata**: title, description, keywords, author, canonical URL, language, text direction
-- **Open Graph & Twitter Card**: social media metadata extraction
-- **Headers**: h1-h6 extraction with hierarchy, ids, and depth tracking
-- **Links**: hyperlink extraction with type classification (anchor, internal, external, email, phone)
-- **Images**: image extraction with source type (data_uri, inline_svg, external, relative) and dimensions
-- **Structured Data**: JSON-LD, Microdata, and RDFa extraction
-#### Type Safety with RBS
-All types are defined in RBS format in `sig/html_to_markdown.rbs`:
-- `document_metadata` - Document-level metadata structure
-- `header_metadata` - Individual header element
-- `link_metadata` - Individual link element
-- `image_metadata` - Individual image element
-- `structured_data` - Structured data block
-- `extended_metadata` - Complete metadata extraction result
+require 'html_to_markdown'
-Uses strict RBS type checking with Steep for full type safety:
+html = '<h1>Article</h1><img src="test.jpg" alt="test">'
+markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
-```bash
-steep check
+puts metadata[:document][:title]           # Document title
+puts metadata[:headers]                    # All h1-h6 elements
+puts metadata[:links]                      # All hyperlinks
+puts metadata[:images]                     # All images with alt text
+puts metadata[:structured_data]            # JSON-LD, Microdata, RDFa
 ```
-#### Implementation Architecture
-The Rust implementation uses a single-pass collector pattern for efficient metadata extraction:
-1. **No duplication**: Core logic lives in Rust (`crates/html-to-markdown/src/metadata.rs`)
-2. **Minimal wrapper layer**: Ruby binding in `crates/html-to-markdown-rb/src/lib.rs`
-3. **Type translation**: Rust types → Ruby hashes with proper Magnus bindings
-4. **Hash conversion**: Uses Magnus `RHash` API for efficient Ruby hash construction
+For detailed examples including SEO extraction, table-of-contents generation, link validation, and accessibility audits, see the [Metadata Extraction Guide](../../examples/metadata-extraction/).
-The metadata feature is gated by a Cargo feature in `Cargo.toml`:
-```toml
-[features]
-metadata = ["html-to-markdown-rs/metadata"]
-```
-This ensures:
-- Zero overhead when metadata is not needed
-- Clean integration with feature flag detection
-- Consistent with Python binding implementation
-#### Language Parity
+## Visitor Pattern
-Implements the same API as the Python binding:
+The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Use visitors to transform content, filter elements, validate structure, or collect analytics.
-- Same method signature: `convert_with_metadata(html, options, metadata_config)`
-- Same return type: `[markdown, metadata_dict]`
-- Same metadata structures and field names
-- Same enum values (link_type, image_type, data_type, text_direction)
+**Use Cases:**
+- **Custom Markdown dialects** – Convert to Obsidian, Notion, or other flavors
+- **Content filtering** – Remove tracking pixels, ads, or unwanted elements
+- **URL rewriting** – Rewrite CDN URLs, add query parameters, validate links
+- **Accessibility validation** – Check alt text, heading hierarchy, link text
+- **Analytics** – Track element usage, link destinations, image sources
-Enables seamless migration and multi-language development.
+**Supported Visitor Methods:** 40+ callbacks for text, inline elements, links, images, headings, lists, blocks, and tables.
-#### Performance
+### Example: Quick Start
-Single-pass collection during tree traversal:
-- No additional parsing passes
-- Minimal memory overhead
-- Configurable extraction granularity
-- Built-in size limits for safety
-#### Testing
+```ruby
+require 'html_to_markdown'
-Comprehensive RSpec test suite in `spec/metadata_extraction_spec.rb`:
+class MyVisitor
+  def visit_link(ctx, href, text, title = nil)
+    # Rewrite CDN URLs
+    if href.start_with?('https://old-cdn.com')
+      href = href.sub('https://old-cdn.com', 'https://new-cdn.com')
+    end
+    { type: :custom, output: "[#{text}](#{href})" }
+  end
+  def visit_image(ctx, src, alt = nil, title = nil)
+    # Skip tracking pixels
+    src.include?('tracking') ? { type: :skip } : { type: :continue }
+  end
+end
-```bash
-cd packages/ruby
-bundle exec rake compile -- --release --features metadata
-bundle exec rspec spec/metadata_extraction_spec.rb
+html = '<a href="https://old-cdn.com/file.pdf">Download</a>'
+markdown = HtmlToMarkdown.convert_with_visitor(html, visitor: MyVisitor.new)
 ```
-Tests cover:
-- All metadata types extraction
-- Configuration flags
-- Edge cases (empty HTML, malformed input, special characters)
-- Return value structure validation
-- Integration with conversion options
-## CLI
-The gem bundles a small proxy for the Rust CLI binary. Use it when you need parity with the standalone `html-to-markdown` executable.
+For comprehensive examples including content filtering, link footnotes, accessibility validation, and asynchronous URL validation, see the [Visitor Pattern Guide](../../examples/visitor-pattern/).
-```ruby
-require 'html_to_markdown/cli'
-HtmlToMarkdown::CLI.run(%w[--heading-style atx input.html], stdout: $stdout)
-# => writes converted Markdown to STDOUT
-```
-You can also call the CLI binary directly for scripting:
+## Examples
-```ruby
-HtmlToMarkdown::CLIProxy.call(['--version'])
-# => "html-to-markdown 2.5.7"
-```
+- [Visitor Pattern Guide](../../examples/visitor-pattern/)
+- [Metadata Extraction Guide](../../examples/metadata-extraction/)
+- [Performance Guide](../../examples/performance/)
-Rebuild the CLI locally if you see `CLI binary not built` during tests:
+## Links
-```bash
-bundle exec rake compile          # builds the extension
-bundle exec ruby scripts/prepare_ruby_gem.rb  # copies the CLI into lib/bin/
-```
-## Error Handling
+- **GitHub:** [github.com/kreuzberg-dev/html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown)
-Conversion errors raise `HtmlToMarkdown::Error` (wrapping the Rust error context). CLI invocations use specialised subclasses:
+- **RubyGems:** [rubygems.org/gems/html-to-markdown](https://rubygems.org/gems/html-to-markdown)
-- `HtmlToMarkdown::CLIProxy::MissingBinaryError`
-- `HtmlToMarkdown::CLIProxy::CLIExecutionError`
+- **Kreuzberg Ecosystem:** [kreuzberg.dev](https://kreuzberg.dev)
+- **Discord:** [discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
-Rescue them to provide clearer feedback in your application.
+## Contributing
-Inputs that look like binary data (e.g., PDF bytes coerced to a string) raise `HtmlToMarkdown::Error` with an
-`Invalid input` message.
+We welcome contributions! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/CONTRIBUTING.md) for details on:
-## Consistent Across Languages
+- Setting up the development environment
+- Running tests locally
+- Submitting pull requests
+- Reporting issues
-The Ruby gem shares the exact Rust core with:
+All contributions must follow our code quality standards (enforced via pre-commit hooks):
-- [Python wheels](https://pypi.org/project/html-to-markdown/)
-- [Node.js / Bun bindings](https://www.npmjs.com/package/html-to-markdown-node)
-- [WebAssembly package](https://www.npmjs.com/package/html-to-markdown-wasm)
-- The Rust crate and CLI
+- Proper test coverage (Rust 95%+, language bindings 80%+)
+- Formatting and linting checks
+- Documentation for public APIs
-Use whichever runtime fits your stack while keeping formatting behaviour identical.
+## License
-## Development
+MIT License – see [LICENSE](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE).
-```bash
-bundle exec rake compile   # build the native extension
-bundle exec rspec          # run test suite
-```
+## Support
-The extension uses [Magnus](https://github.com/matsadler/magnus) plus `rb-sys` for bindgen. When editing the Rust code under `src/`, rerun `rake compile`.
+If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/kreuzberg-dev).
-## License
+Have questions or run into issues? We're here to help:
-MIT © Na'aman Hirschfeld
+- **GitHub Issues:** [github.com/kreuzberg-dev/html-to-markdown/issues](https://github.com/kreuzberg-dev/html-to-markdown/issues)
+- **Discussions:** [github.com/kreuzberg-dev/html-to-markdown/discussions](https://github.com/kreuzberg-dev/html-to-markdown/discussions)
+- **Discord Community:** [discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)