html-to-markdown 2.6.5 → 2.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 32f9140116181176bc78edbbe936ecbb679656734cc205e46a2f1e0717068f72
4
- data.tar.gz: 3eaaa3b6bc8332df402a2f5405460d28d5176edb18216b7afaa7cae634ce64d7
3
+ metadata.gz: 56ca0b6b6d1c9e67dadddfa1865e693bcf986859cc203f7987a5f787203ff40f
4
+ data.tar.gz: 932a473d64548a6d976c452b4d82a357e10709f7975a43e5d699e543a9d3a372
5
5
  SHA512:
6
- metadata.gz: c9fca57c280d7c413b9789a550e598741bda9151bf02e5991b1ffa8b42a1f75940621d62e75b0eb13c24646c22e71fe316cbd9ebc77324f7486d6a8773fcc456
7
- data.tar.gz: 162a253371ec62b6854173eb1559986b127a16fb5655cf2c26e2ef6ae5f4e80bc1f222bdfe6d8ada87c53871f24982c35f24606bc1fcf0a335d4dcd2e9566fba
6
+ metadata.gz: 0e36518ae77b2c25f0ac0c26686524d152c687a026fc0f2f4c7b3ed03bd219cc8a5a799ddb8f9b698a5d3e64db7d05ac7d936b535b797aafd806a276af97e993
7
+ data.tar.gz: 31ab40f1ff1daae0e2b06e485e3e85d15f59fca17c07a8b6b64785b2326e9a6d208381183b043dcac92e01c83fd98b6a8607438faffb2b1e01d4953cb96e19f0
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- html-to-markdown (2.6.5)
4
+ html-to-markdown (2.7.0)
5
5
  rb_sys (>= 0.9, < 1.0)
6
6
 
7
7
  GEM
@@ -50,7 +50,7 @@ GEM
50
50
  rubocop-ast (>= 1.47.1, < 2.0)
51
51
  ruby-progressbar (~> 1.7)
52
52
  unicode-display_width (>= 2.4.0, < 4.0)
53
- rubocop-ast (1.47.1)
53
+ rubocop-ast (1.48.0)
54
54
  parser (>= 3.3.7.2)
55
55
  prism (~> 1.4)
56
56
  rubocop-rspec (3.7.0)
data/README.md CHANGED
@@ -1,11 +1,10 @@
1
1
  # html-to-markdown-rb
2
2
 
3
- Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, and WebAssembly packages. Ship identical Markdown across every runtime while enjoying native extension performance.
3
+ Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, WebAssembly, and PHP packages. Ship identical Markdown across every runtime while enjoying native extension performance.
4
4
 
5
5
  [![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg)](https://crates.io/crates/html-to-markdown-rs)
6
6
  [![npm (node)](https://badge.fury.io/js/html-to-markdown-node.svg)](https://www.npmjs.com/package/html-to-markdown-node)
7
7
  [![npm (wasm)](https://badge.fury.io/js/html-to-markdown-wasm.svg)](https://www.npmjs.com/package/html-to-markdown-wasm)
8
- [![npm (typescript)](https://badge.fury.io/js/html-to-markdown.svg)](https://www.npmjs.com/package/html-to-markdown)
9
8
  [![PyPI](https://badge.fury.io/py/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/)
10
9
  [![Packagist](https://img.shields.io/packagist/v/goldziher/html-to-markdown.svg)](https://packagist.org/packages/goldziher/html-to-markdown)
11
10
  [![RubyGems](https://badge.fury.io/rb/html-to-markdown.svg)](https://rubygems.org/gems/html-to-markdown)
@@ -14,7 +13,7 @@ Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust eng
14
13
  ## Features
15
14
 
16
15
  - ⚡ **Rust-fast**: Ruby bindings around a highly optimised Rust core (60‑80× faster than BeautifulSoup-based converters).
17
- - 🔁 **Identical output**: Shares logic with the Python wheels, npm bindings, WASM package, and CLI — consistent Markdown everywhere.
16
+ - 🔁 **Identical output**: Shares logic with the Python wheels, npm bindings, PHP extension, WASM package, and CLI — consistent Markdown everywhere.
18
17
  - ⚙️ **Rich configuration**: Control heading styles, list indentation, whitespace handling, HTML preprocessing, and more.
19
18
  - 🖼️ **Inline image extraction**: Pull out embedded images (PNG/JPEG/SVG/data URIs) alongside Markdown.
20
19
  - 🧰 **Bundled CLI proxy**: Call the Rust CLI straight from Ruby or shell scripts.
@@ -110,6 +109,18 @@ markdown = HtmlToMarkdown.convert(
110
109
  puts markdown
111
110
  ```
112
111
 
112
+ ### Reusing Options
113
+
114
+ If you’re running tight loops or benchmarks, build the options once and pass the handle back into `convert_with_options`:
115
+
116
+ ```ruby
117
+ handle = HtmlToMarkdown.options(hocr_spatial_tables: false)
118
+
119
+ 100.times do
120
+ HtmlToMarkdown.convert_with_options('<h1>Handles</h1>', handle)
121
+ end
122
+ ```
123
+
113
124
  ### HTML Preprocessing
114
125
 
115
126
  Clean up scraped HTML (navigation, forms, malformed markup) before conversion:
data/bin/benchmark.rb ADDED
@@ -0,0 +1,94 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require 'optparse'
5
+ require 'time'
6
+
7
+ $LOAD_PATH.unshift(File.expand_path('../lib', __dir__))
8
+ require 'html_to_markdown'
9
+
10
+ def json_escape(value)
11
+ value.to_s.gsub(/["\\\n\r]/) do |char|
12
+ case char
13
+ when '"', '\\'
14
+ "\\#{char}"
15
+ when "\n"
16
+ '\\n'
17
+ when "\r"
18
+ '\\r'
19
+ end
20
+ end
21
+ end
22
+
23
+ options = {
24
+ iterations: 50,
25
+ format: 'html'
26
+ }
27
+
28
+ OptionParser.new do |parser|
29
+ parser.banner = 'ruby benchmark.rb --file path/to/fixture.html [--iterations 200]'
30
+
31
+ parser.on('--file FILE', 'HTML fixture to convert repeatedly') do |file|
32
+ options[:file] = file
33
+ end
34
+
35
+ parser.on('--iterations N', Integer, 'Number of conversion iterations (default: 50)') do |n|
36
+ options[:iterations] = n.positive? ? n : 1
37
+ end
38
+
39
+ parser.on('--format FORMAT', 'Fixture format (html or hocr)') do |format|
40
+ options[:format] = format.downcase
41
+ end
42
+ end.parse!
43
+
44
+ fixture = options.fetch(:file) do
45
+ warn 'Missing --file parameter'
46
+ exit 1
47
+ end
48
+
49
+ unless File.exist?(fixture)
50
+ warn "Fixture not found: #{fixture}"
51
+ exit 1
52
+ end
53
+
54
+ unless %w[html hocr].include?(options[:format])
55
+ warn "Unsupported format: #{options[:format]}"
56
+ exit 1
57
+ end
58
+
59
+ html = File.binread(fixture)
60
+ html.force_encoding(Encoding::UTF_8)
61
+ html.freeze
62
+ iterations = options[:iterations]
63
+ options_handle = HtmlToMarkdown.options(
64
+ options[:format] == 'hocr' ? { hocr_spatial_tables: false } : nil
65
+ )
66
+
67
+ def convert_document(html, options_handle)
68
+ HtmlToMarkdown.convert_with_options(html, options_handle)
69
+ end
70
+
71
+ convert_document(html, options_handle)
72
+
73
+ start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
74
+ iterations.times { convert_document(html, options_handle) }
75
+ elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
76
+
77
+ payload_size_bytes = html.bytesize
78
+ bytes_processed = payload_size_bytes * iterations
79
+ ops_per_sec = iterations / elapsed
80
+ mb_per_sec = (bytes_processed.to_f / (1024 * 1024)) / elapsed
81
+
82
+ payload = %({
83
+ "language":"ruby",
84
+ "fixture":"#{json_escape(File.basename(fixture))}",
85
+ "fixture_path":"#{json_escape(fixture)}",
86
+ "iterations":#{iterations},
87
+ "elapsed_seconds":#{format('%.8f', elapsed)},
88
+ "ops_per_sec":#{format('%.4f', ops_per_sec)},
89
+ "mb_per_sec":#{format('%.4f', mb_per_sec)},
90
+ "bytes_processed":#{bytes_processed},
91
+ "payload_size_bytes":#{payload_size_bytes}
92
+ })
93
+
94
+ puts payload.strip
@@ -24,5 +24,5 @@ default_profile = ENV.fetch('CARGO_PROFILE', 'release')
24
24
 
25
25
  create_rust_makefile('html_to_markdown_rb') do |config|
26
26
  config.profile = default_profile.to_sym
27
- config.ext_dir = '../../../../crates/html-to-markdown-rb'
27
+ config.ext_dir = File.expand_path('native', __dir__)
28
28
  end
@@ -0,0 +1,28 @@
1
+ [package]
2
+ name = "html-to-markdown-rb"
3
+ version = "2.7.0"
4
+ edition.workspace = true
5
+ authors = ["Na'aman Hirschfeld <nhirschfeld@gmail.com>"]
6
+ license = "MIT"
7
+ repository = "https://github.com/Goldziher/html-to-markdown"
8
+ homepage = "https://github.com/Goldziher/html-to-markdown"
9
+ documentation = "https://docs.rs/html-to-markdown-rs"
10
+ readme = "README.md"
11
+ rust-version.workspace = true
12
+ description = "Ruby bindings (Magnus) for html-to-markdown - high-performance HTML to Markdown converter"
13
+ keywords = ["html", "markdown", "ruby", "magnus", "bindings"]
14
+ categories = ["api-bindings"]
15
+
16
+ [lib]
17
+ name = "html_to_markdown_rb"
18
+ crate-type = ["cdylib", "rlib"]
19
+
20
+ [features]
21
+ default = []
22
+
23
+ [dependencies]
24
+ html-to-markdown-rs = { version = "2.7.0", features = ["inline-images"] }
25
+ magnus = { git = "https://github.com/matsadler/magnus", rev = "f6db11769efb517427bf7f121f9c32e18b059b38", features = ["rb-sys"] }
26
+
27
+ [dev-dependencies]
28
+ pretty_assertions = "1.4"
@@ -0,0 +1,209 @@
1
+ # html-to-markdown-rb
2
+
3
+ Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, WebAssembly, and PHP packages. Ship identical Markdown across every runtime while enjoying native extension performance.
4
+
5
+ [![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg)](https://crates.io/crates/html-to-markdown-rs)
6
+ [![npm (node)](https://badge.fury.io/js/html-to-markdown-node.svg)](https://www.npmjs.com/package/html-to-markdown-node)
7
+ [![npm (wasm)](https://badge.fury.io/js/html-to-markdown-wasm.svg)](https://www.npmjs.com/package/html-to-markdown-wasm)
8
+ [![PyPI](https://badge.fury.io/py/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/)
9
+ [![Packagist](https://img.shields.io/packagist/v/goldziher/html-to-markdown.svg)](https://packagist.org/packages/goldziher/html-to-markdown)
10
+ [![RubyGems](https://badge.fury.io/rb/html-to-markdown.svg)](https://rubygems.org/gems/html-to-markdown)
11
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
12
+
13
+ ## Features
14
+
15
+ - ⚡ **Rust-fast**: Ruby bindings around a highly optimised Rust core (60‑80× faster than BeautifulSoup-based converters).
16
+ - 🔁 **Identical output**: Shares logic with the Python wheels, npm bindings, PHP extension, WASM package, and CLI — consistent Markdown everywhere.
17
+ - ⚙️ **Rich configuration**: Control heading styles, list indentation, whitespace handling, HTML preprocessing, and more.
18
+ - 🖼️ **Inline image extraction**: Pull out embedded images (PNG/JPEG/SVG/data URIs) alongside Markdown.
19
+ - 🧰 **Bundled CLI proxy**: Call the Rust CLI straight from Ruby or shell scripts.
20
+ - 🛠️ **First-class Rails support**: Works with `Gem.win_platform?` builds, supports Trusted Publishing, and compiles on install if no native gem matches.
21
+
22
+ ## Documentation & Support
23
+
24
+ - [GitHub repository](https://github.com/Goldziher/html-to-markdown)
25
+ - [Issue tracker](https://github.com/Goldziher/html-to-markdown/issues)
26
+ - [Changelog](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md)
27
+ - [Live demo (WASM)](https://goldziher.github.io/html-to-markdown/)
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ bundle add html-to-markdown
33
+ # or
34
+ gem install html-to-markdown
35
+ ```
36
+
37
+ Add the gem to your project and Bundler will compile the native Rust extension on first install.
38
+
39
+ ### Requirements
40
+
41
+ - Ruby **3.2+** (Magnus relies on the fiber scheduler APIs added in 3.2)
42
+ - Rust toolchain **1.85+** with Cargo available on your `$PATH`
43
+ - Ruby development headers (`ruby-dev`, `ruby-devel`, or the platform equivalent)
44
+
45
+ **Windows**: install [RubyInstaller with MSYS2](https://rubyinstaller.org/) (UCRT64). Run once:
46
+
47
+ ```powershell
48
+ ridk exec pacman -S --needed --noconfirm base-devel mingw-w64-ucrt-x86_64-toolchain
49
+ ```
50
+
51
+ This provides the standard headers (including `strings.h`) required for the bindgen step.
52
+
53
+ ## Performance Snapshot
54
+
55
+ Apple M4 • Real Wikipedia documents • `HtmlToMarkdown.convert` (Ruby)
56
+
57
+ | Document | Size | Latency | Throughput | Docs/sec |
58
+ | ------------------- | ----- | ------- | ---------- | -------- |
59
+ | Lists (Timeline) | 129KB | 0.69ms | 187 MB/s | 1,450 |
60
+ | Tables (Countries) | 360KB | 2.19ms | 164 MB/s | 456 |
61
+ | Mixed (Python wiki) | 656KB | 4.88ms | 134 MB/s | 205 |
62
+
63
+ > Same core, same benchmarks: the Ruby extension stays within single-digit % of the Rust CLI and mirrors the Python/Node numbers.
64
+
65
+ ## Quick Start
66
+
67
+ ```ruby
68
+ require 'html_to_markdown'
69
+
70
+ html = <<~HTML
71
+ <h1>Welcome</h1>
72
+ <p>This is <strong>Rust-fast</strong> conversion!</p>
73
+ <ul>
74
+ <li>Native extension</li>
75
+ <li>Identical output across languages</li>
76
+ </ul>
77
+ HTML
78
+
79
+ markdown = HtmlToMarkdown.convert(html)
80
+ puts markdown
81
+ # # Welcome
82
+ #
83
+ # This is **Rust-fast** conversion!
84
+ #
85
+ # - Native extension
86
+ # - Identical output across languages
87
+ ```
88
+
89
+ ## API
90
+
91
+ ### Conversion Options
92
+
93
+ Pass a Ruby hash (string or symbol keys) to tweak rendering. Every option maps one-for-one with the Rust/Python/Node APIs.
94
+
95
+ ```ruby
96
+ require 'html_to_markdown'
97
+
98
+ markdown = HtmlToMarkdown.convert(
99
+ '<pre><code class="language-ruby">puts "hi"</code></pre>',
100
+ heading_style: :atx,
101
+ code_block_style: :fenced,
102
+ bullets: '*+-',
103
+ list_indent_type: :spaces,
104
+ list_indent_width: 2,
105
+ whitespace_mode: :normalized,
106
+ highlight_style: :double_equal
107
+ )
108
+
109
+ puts markdown
110
+ ```
111
+
112
+ ### HTML Preprocessing
113
+
114
+ Clean up scraped HTML (navigation, forms, malformed markup) before conversion:
115
+
116
+ ```ruby
117
+ require 'html_to_markdown'
118
+
119
+ markdown = HtmlToMarkdown.convert(
120
+ html,
121
+ preprocessing: {
122
+ enabled: true,
123
+ preset: :aggressive, # :minimal, :standard, :aggressive
124
+ remove_navigation: true,
125
+ remove_forms: true
126
+ }
127
+ )
128
+ ```
129
+
130
+ ### Inline Images
131
+
132
+ Extract inline binary data (data URIs, SVG) together with the converted Markdown.
133
+
134
+ ```ruby
135
+ require 'html_to_markdown'
136
+
137
+ result = HtmlToMarkdown.convert_with_inline_images(
138
+ '<img src="..." alt="Pixel">',
139
+ image_config: {
140
+ max_decoded_size_bytes: 1 * 1024 * 1024,
141
+ infer_dimensions: true,
142
+ filename_prefix: 'img_',
143
+ capture_svg: true
144
+ }
145
+ )
146
+
147
+ puts result.markdown
148
+ result.inline_images.each do |img|
149
+ puts "#{img.filename} -> #{img.format} (#{img.data.bytesize} bytes)"
150
+ end
151
+ ```
152
+
153
+ ## CLI
154
+
155
+ The gem bundles a small proxy for the Rust CLI binary. Use it when you need parity with the standalone `html-to-markdown` executable.
156
+
157
+ ```ruby
158
+ require 'html_to_markdown/cli'
159
+
160
+ HtmlToMarkdown::CLI.run(%w[--heading-style atx input.html], stdout: $stdout)
161
+ # => writes converted Markdown to STDOUT
162
+ ```
163
+
164
+ You can also call the CLI binary directly for scripting:
165
+
166
+ ```ruby
167
+ HtmlToMarkdown::CLIProxy.call(['--version'])
168
+ # => "html-to-markdown 2.5.7"
169
+ ```
170
+
171
+ Rebuild the CLI locally if you see `CLI binary not built` during tests:
172
+
173
+ ```bash
174
+ bundle exec rake compile # builds the extension
175
+ bundle exec ruby scripts/prepare_ruby_gem.rb # copies the CLI into lib/bin/
176
+ ```
177
+
178
+ ## Error Handling
179
+
180
+ Conversion errors raise `HtmlToMarkdown::Error` (wrapping the Rust error context). CLI invocations use specialised subclasses:
181
+
182
+ - `HtmlToMarkdown::CLIProxy::MissingBinaryError`
183
+ - `HtmlToMarkdown::CLIProxy::CLIExecutionError`
184
+
185
+ Rescue them to provide clearer feedback in your application.
186
+
187
+ ## Consistent Across Languages
188
+
189
+ The Ruby gem shares the exact Rust core with:
190
+
191
+ - [Python wheels](https://pypi.org/project/html-to-markdown/)
192
+ - [Node.js / Bun bindings](https://www.npmjs.com/package/html-to-markdown-node)
193
+ - [WebAssembly package](https://www.npmjs.com/package/html-to-markdown-wasm)
194
+ - The Rust crate and CLI
195
+
196
+ Use whichever runtime fits your stack while keeping formatting behaviour identical.
197
+
198
+ ## Development
199
+
200
+ ```bash
201
+ bundle exec rake compile # build the native extension
202
+ bundle exec rspec # run test suite
203
+ ```
204
+
205
+ The extension uses [Magnus](https://github.com/matsadler/magnus) plus `rb-sys` for bindgen. When editing the Rust code under `src/`, rerun `rake compile`.
206
+
207
+ ## License
208
+
209
+ MIT © Na'aman Hirschfeld
@@ -0,0 +1,3 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative '../extconf'
@@ -0,0 +1,432 @@
1
+ use html_to_markdown_rs::{
2
+ CodeBlockStyle, ConversionOptions, HeadingStyle, HighlightStyle, HtmlExtraction, InlineImage, InlineImageConfig,
3
+ InlineImageFormat, InlineImageSource, InlineImageWarning, ListIndentType, NewlineStyle, PreprocessingOptions,
4
+ PreprocessingPreset, WhitespaceMode, convert as convert_inner,
5
+ convert_with_inline_images as convert_with_inline_images_inner, error::ConversionError,
6
+ };
7
+ use magnus::prelude::*;
8
+ use magnus::r_hash::ForEach;
9
+ use magnus::{Error, RArray, RHash, Ruby, Symbol, TryConvert, Value, function, scan_args::scan_args};
10
+
11
+ #[derive(Clone)]
12
+ #[magnus::wrap(class = "HtmlToMarkdown::Options", free_immediately)]
13
+ struct OptionsHandle(ConversionOptions);
14
+
15
+ const DEFAULT_INLINE_IMAGE_LIMIT: u64 = 5 * 1024 * 1024;
16
+
17
+ fn conversion_error(err: ConversionError) -> Error {
18
+ match err {
19
+ ConversionError::ConfigError(msg) => arg_error(msg),
20
+ other => runtime_error(other.to_string()),
21
+ }
22
+ }
23
+
24
+ fn arg_error(message: impl Into<String>) -> Error {
25
+ let ruby = Ruby::get().expect("Ruby not initialised");
26
+ Error::new(ruby.exception_arg_error(), message.into())
27
+ }
28
+
29
+ fn runtime_error(message: impl Into<String>) -> Error {
30
+ let ruby = Ruby::get().expect("Ruby not initialised");
31
+ Error::new(ruby.exception_runtime_error(), message.into())
32
+ }
33
+
34
+ fn symbol_to_string(value: Value) -> Result<String, Error> {
35
+ if let Some(symbol) = Symbol::from_value(value) {
36
+ Ok(symbol.name()?.to_string())
37
+ } else {
38
+ String::try_convert(value)
39
+ }
40
+ }
41
+
42
+ fn parse_heading_style(value: Value) -> Result<HeadingStyle, Error> {
43
+ match symbol_to_string(value)?.as_str() {
44
+ "underlined" => Ok(HeadingStyle::Underlined),
45
+ "atx" => Ok(HeadingStyle::Atx),
46
+ "atx_closed" => Ok(HeadingStyle::AtxClosed),
47
+ other => Err(arg_error(format!("invalid heading_style: {other}"))),
48
+ }
49
+ }
50
+
51
+ fn parse_list_indent_type(value: Value) -> Result<ListIndentType, Error> {
52
+ match symbol_to_string(value)?.as_str() {
53
+ "spaces" => Ok(ListIndentType::Spaces),
54
+ "tabs" => Ok(ListIndentType::Tabs),
55
+ other => Err(arg_error(format!("invalid list_indent_type: {other}"))),
56
+ }
57
+ }
58
+
59
+ fn parse_highlight_style(value: Value) -> Result<HighlightStyle, Error> {
60
+ match symbol_to_string(value)?.as_str() {
61
+ "double_equal" => Ok(HighlightStyle::DoubleEqual),
62
+ "html" => Ok(HighlightStyle::Html),
63
+ "bold" => Ok(HighlightStyle::Bold),
64
+ "none" => Ok(HighlightStyle::None),
65
+ other => Err(arg_error(format!("invalid highlight_style: {other}"))),
66
+ }
67
+ }
68
+
69
+ fn parse_whitespace_mode(value: Value) -> Result<WhitespaceMode, Error> {
70
+ match symbol_to_string(value)?.as_str() {
71
+ "normalized" => Ok(WhitespaceMode::Normalized),
72
+ "strict" => Ok(WhitespaceMode::Strict),
73
+ other => Err(arg_error(format!("invalid whitespace_mode: {other}"))),
74
+ }
75
+ }
76
+
77
+ fn parse_newline_style(value: Value) -> Result<NewlineStyle, Error> {
78
+ match symbol_to_string(value)?.as_str() {
79
+ "spaces" => Ok(NewlineStyle::Spaces),
80
+ "backslash" => Ok(NewlineStyle::Backslash),
81
+ other => Err(arg_error(format!("invalid newline_style: {other}"))),
82
+ }
83
+ }
84
+
85
+ fn parse_code_block_style(value: Value) -> Result<CodeBlockStyle, Error> {
86
+ match symbol_to_string(value)?.as_str() {
87
+ "indented" => Ok(CodeBlockStyle::Indented),
88
+ "backticks" => Ok(CodeBlockStyle::Backticks),
89
+ "tildes" => Ok(CodeBlockStyle::Tildes),
90
+ other => Err(arg_error(format!("invalid code_block_style: {other}"))),
91
+ }
92
+ }
93
+
94
+ fn parse_preset(value: Value) -> Result<PreprocessingPreset, Error> {
95
+ match symbol_to_string(value)?.as_str() {
96
+ "minimal" => Ok(PreprocessingPreset::Minimal),
97
+ "standard" => Ok(PreprocessingPreset::Standard),
98
+ "aggressive" => Ok(PreprocessingPreset::Aggressive),
99
+ other => Err(arg_error(format!("invalid preprocessing preset: {other}"))),
100
+ }
101
+ }
102
+
103
+ fn parse_vec_of_strings(value: Value) -> Result<Vec<String>, Error> {
104
+ let array = RArray::from_value(value).ok_or_else(|| arg_error("expected an Array of strings"))?;
105
+
106
+ array.to_vec::<String>()
107
+ }
108
+
109
+ fn parse_preprocessing_options(_ruby: &Ruby, value: Value) -> Result<PreprocessingOptions, Error> {
110
+ let hash = RHash::from_value(value).ok_or_else(|| arg_error("expected preprocessing to be a Hash"))?;
111
+
112
+ let mut opts = PreprocessingOptions::default();
113
+
114
+ hash.foreach(|key: Value, val: Value| {
115
+ let key_name = symbol_to_string(key)?;
116
+ match key_name.as_str() {
117
+ "enabled" => {
118
+ opts.enabled = bool::try_convert(val)?;
119
+ }
120
+ "preset" => {
121
+ opts.preset = parse_preset(val)?;
122
+ }
123
+ "remove_navigation" => {
124
+ opts.remove_navigation = bool::try_convert(val)?;
125
+ }
126
+ "remove_forms" => {
127
+ opts.remove_forms = bool::try_convert(val)?;
128
+ }
129
+ _ => {}
130
+ }
131
+ Ok(ForEach::Continue)
132
+ })?;
133
+
134
+ Ok(opts)
135
+ }
136
+
137
+ fn build_conversion_options(ruby: &Ruby, options: Option<Value>) -> Result<ConversionOptions, Error> {
138
+ let mut opts = ConversionOptions::default();
139
+
140
+ let Some(options) = options else {
141
+ return Ok(opts);
142
+ };
143
+
144
+ if options.is_nil() {
145
+ return Ok(opts);
146
+ }
147
+
148
+ let hash = RHash::from_value(options).ok_or_else(|| arg_error("options must be provided as a Hash"))?;
149
+
150
+ hash.foreach(|key: Value, val: Value| {
151
+ let key_name = symbol_to_string(key)?;
152
+ match key_name.as_str() {
153
+ "heading_style" => {
154
+ opts.heading_style = parse_heading_style(val)?;
155
+ }
156
+ "list_indent_type" => {
157
+ opts.list_indent_type = parse_list_indent_type(val)?;
158
+ }
159
+ "list_indent_width" => {
160
+ opts.list_indent_width = usize::try_convert(val)?;
161
+ }
162
+ "bullets" => {
163
+ opts.bullets = String::try_convert(val)?;
164
+ }
165
+ "strong_em_symbol" => {
166
+ let value = String::try_convert(val)?;
167
+ let mut chars = value.chars();
168
+ let ch = chars
169
+ .next()
170
+ .ok_or_else(|| arg_error("strong_em_symbol must not be empty"))?;
171
+ if chars.next().is_some() {
172
+ return Err(arg_error("strong_em_symbol must be a single character"));
173
+ }
174
+ opts.strong_em_symbol = ch;
175
+ }
176
+ "escape_asterisks" => {
177
+ opts.escape_asterisks = bool::try_convert(val)?;
178
+ }
179
+ "escape_underscores" => {
180
+ opts.escape_underscores = bool::try_convert(val)?;
181
+ }
182
+ "escape_misc" => {
183
+ opts.escape_misc = bool::try_convert(val)?;
184
+ }
185
+ "escape_ascii" => {
186
+ opts.escape_ascii = bool::try_convert(val)?;
187
+ }
188
+ "code_language" => {
189
+ opts.code_language = String::try_convert(val)?;
190
+ }
191
+ "autolinks" => {
192
+ opts.autolinks = bool::try_convert(val)?;
193
+ }
194
+ "default_title" => {
195
+ opts.default_title = bool::try_convert(val)?;
196
+ }
197
+ "br_in_tables" => {
198
+ opts.br_in_tables = bool::try_convert(val)?;
199
+ }
200
+ "hocr_spatial_tables" => {
201
+ opts.hocr_spatial_tables = bool::try_convert(val)?;
202
+ }
203
+ "highlight_style" => {
204
+ opts.highlight_style = parse_highlight_style(val)?;
205
+ }
206
+ "extract_metadata" => {
207
+ opts.extract_metadata = bool::try_convert(val)?;
208
+ }
209
+ "whitespace_mode" => {
210
+ opts.whitespace_mode = parse_whitespace_mode(val)?;
211
+ }
212
+ "strip_newlines" => {
213
+ opts.strip_newlines = bool::try_convert(val)?;
214
+ }
215
+ "wrap" => {
216
+ opts.wrap = bool::try_convert(val)?;
217
+ }
218
+ "wrap_width" => {
219
+ opts.wrap_width = usize::try_convert(val)?;
220
+ }
221
+ "convert_as_inline" => {
222
+ opts.convert_as_inline = bool::try_convert(val)?;
223
+ }
224
+ "sub_symbol" => {
225
+ opts.sub_symbol = String::try_convert(val)?;
226
+ }
227
+ "sup_symbol" => {
228
+ opts.sup_symbol = String::try_convert(val)?;
229
+ }
230
+ "newline_style" => {
231
+ opts.newline_style = parse_newline_style(val)?;
232
+ }
233
+ "code_block_style" => {
234
+ opts.code_block_style = parse_code_block_style(val)?;
235
+ }
236
+ "keep_inline_images_in" => {
237
+ opts.keep_inline_images_in = parse_vec_of_strings(val)?;
238
+ }
239
+ "preprocessing" => {
240
+ opts.preprocessing = parse_preprocessing_options(ruby, val)?;
241
+ }
242
+ "encoding" => {
243
+ opts.encoding = String::try_convert(val)?;
244
+ }
245
+ "debug" => {
246
+ opts.debug = bool::try_convert(val)?;
247
+ }
248
+ "strip_tags" => {
249
+ opts.strip_tags = parse_vec_of_strings(val)?;
250
+ }
251
+ "preserve_tags" => {
252
+ opts.preserve_tags = parse_vec_of_strings(val)?;
253
+ }
254
+ _ => {}
255
+ }
256
+ Ok(ForEach::Continue)
257
+ })?;
258
+
259
+ Ok(opts)
260
+ }
261
+
262
+ fn build_inline_image_config(_ruby: &Ruby, config: Option<Value>) -> Result<InlineImageConfig, Error> {
263
+ let mut cfg = InlineImageConfig::new(DEFAULT_INLINE_IMAGE_LIMIT);
264
+
265
+ let Some(config) = config else {
266
+ return Ok(cfg);
267
+ };
268
+
269
+ if config.is_nil() {
270
+ return Ok(cfg);
271
+ }
272
+
273
+ let hash = RHash::from_value(config).ok_or_else(|| arg_error("inline image config must be provided as a Hash"))?;
274
+
275
+ hash.foreach(|key: Value, val: Value| {
276
+ let key_name = symbol_to_string(key)?;
277
+ match key_name.as_str() {
278
+ "max_decoded_size_bytes" => {
279
+ cfg.max_decoded_size_bytes = u64::try_convert(val)?;
280
+ }
281
+ "filename_prefix" => {
282
+ cfg.filename_prefix = if val.is_nil() {
283
+ None
284
+ } else {
285
+ Some(String::try_convert(val)?)
286
+ };
287
+ }
288
+ "capture_svg" => {
289
+ cfg.capture_svg = bool::try_convert(val)?;
290
+ }
291
+ "infer_dimensions" => {
292
+ cfg.infer_dimensions = bool::try_convert(val)?;
293
+ }
294
+ _ => {}
295
+ }
296
+ Ok(ForEach::Continue)
297
+ })?;
298
+
299
+ Ok(cfg)
300
+ }
301
+
302
+ fn inline_image_to_value(ruby: &Ruby, image: InlineImage) -> Result<Value, Error> {
303
+ let InlineImage {
304
+ data,
305
+ format,
306
+ filename,
307
+ description,
308
+ dimensions,
309
+ source,
310
+ attributes,
311
+ } = image;
312
+
313
+ let hash = ruby.hash_new();
314
+ let data_value = ruby.str_from_slice(&data);
315
+ hash.aset(ruby.intern("data"), data_value)?;
316
+
317
+ let format_value = match format {
318
+ InlineImageFormat::Png => "png".to_string(),
319
+ InlineImageFormat::Jpeg => "jpeg".to_string(),
320
+ InlineImageFormat::Gif => "gif".to_string(),
321
+ InlineImageFormat::Bmp => "bmp".to_string(),
322
+ InlineImageFormat::Webp => "webp".to_string(),
323
+ InlineImageFormat::Svg => "svg".to_string(),
324
+ InlineImageFormat::Other(other) => other,
325
+ };
326
+ hash.aset(ruby.intern("format"), format_value)?;
327
+
328
+ match filename {
329
+ Some(name) => hash.aset(ruby.intern("filename"), name)?,
330
+ None => hash.aset(ruby.intern("filename"), ruby.qnil())?,
331
+ }
332
+
333
+ match description {
334
+ Some(desc) => hash.aset(ruby.intern("description"), desc)?,
335
+ None => hash.aset(ruby.intern("description"), ruby.qnil())?,
336
+ }
337
+
338
+ if let Some((width, height)) = dimensions {
339
+ let dims = ruby.ary_new();
340
+ dims.push(width as i64)?;
341
+ dims.push(height as i64)?;
342
+ hash.aset(ruby.intern("dimensions"), dims)?;
343
+ } else {
344
+ hash.aset(ruby.intern("dimensions"), ruby.qnil())?;
345
+ }
346
+
347
+ let source_value = match source {
348
+ InlineImageSource::ImgDataUri => "img_data_uri",
349
+ InlineImageSource::SvgElement => "svg_element",
350
+ };
351
+ hash.aset(ruby.intern("source"), source_value)?;
352
+
353
+ let attrs = ruby.hash_new();
354
+ for (key, value) in attributes {
355
+ attrs.aset(key, value)?;
356
+ }
357
+ hash.aset(ruby.intern("attributes"), attrs)?;
358
+
359
+ Ok(hash.as_value())
360
+ }
361
+
362
+ fn warning_to_value(ruby: &Ruby, warning: InlineImageWarning) -> Result<Value, Error> {
363
+ let hash = ruby.hash_new();
364
+ hash.aset(ruby.intern("index"), warning.index as i64)?;
365
+ hash.aset(ruby.intern("message"), warning.message)?;
366
+ Ok(hash.as_value())
367
+ }
368
+
369
+ fn extraction_to_value(ruby: &Ruby, extraction: HtmlExtraction) -> Result<Value, Error> {
370
+ let hash = ruby.hash_new();
371
+ hash.aset(ruby.intern("markdown"), extraction.markdown)?;
372
+
373
+ let inline_images = ruby.ary_new();
374
+ for image in extraction.inline_images {
375
+ inline_images.push(inline_image_to_value(ruby, image)?)?;
376
+ }
377
+ hash.aset(ruby.intern("inline_images"), inline_images)?;
378
+
379
+ let warnings = ruby.ary_new();
380
+ for warning in extraction.warnings {
381
+ warnings.push(warning_to_value(ruby, warning)?)?;
382
+ }
383
+ hash.aset(ruby.intern("warnings"), warnings)?;
384
+
385
+ Ok(hash.as_value())
386
+ }
387
+
388
+ fn convert_fn(ruby: &Ruby, args: &[Value]) -> Result<String, Error> {
389
+ let parsed = scan_args::<(String,), (Option<Value>,), (), (), (), ()>(args)?;
390
+ let html = parsed.required.0;
391
+ let options = build_conversion_options(ruby, parsed.optional.0)?;
392
+
393
+ convert_inner(&html, Some(options)).map_err(conversion_error)
394
+ }
395
+
396
+ fn options_handle_fn(ruby: &Ruby, args: &[Value]) -> Result<OptionsHandle, Error> {
397
+ let parsed = scan_args::<(), (Option<Value>,), (), (), (), ()>(args)?;
398
+ let options = build_conversion_options(ruby, parsed.optional.0)?;
399
+ Ok(OptionsHandle(options))
400
+ }
401
+
402
+ fn convert_with_options_handle_fn(_ruby: &Ruby, args: &[Value]) -> Result<String, Error> {
403
+ let parsed = scan_args::<(String, &OptionsHandle), (), (), (), (), ()>(args)?;
404
+ let html = parsed.required.0;
405
+ let handle = parsed.required.1;
406
+ convert_inner(&html, Some(handle.0.clone())).map_err(conversion_error)
407
+ }
408
+
409
+ fn convert_with_inline_images_fn(ruby: &Ruby, args: &[Value]) -> Result<Value, Error> {
410
+ let parsed = scan_args::<(String,), (Option<Value>, Option<Value>), (), (), (), ()>(args)?;
411
+ let html = parsed.required.0;
412
+ let options = build_conversion_options(ruby, parsed.optional.0)?;
413
+ let config = build_inline_image_config(ruby, parsed.optional.1)?;
414
+
415
+ let extraction = convert_with_inline_images_inner(&html, Some(options), config).map_err(conversion_error)?;
416
+
417
+ extraction_to_value(ruby, extraction)
418
+ }
419
+
420
+ #[magnus::init]
421
+ fn init(ruby: &Ruby) -> Result<(), Error> {
422
+ let module = ruby.define_module("HtmlToMarkdown")?;
423
+ module.define_singleton_method("convert", function!(convert_fn, -1))?;
424
+ module.define_singleton_method("options", function!(options_handle_fn, -1))?;
425
+ module.define_singleton_method("convert_with_options", function!(convert_with_options_handle_fn, -1))?;
426
+ module.define_singleton_method(
427
+ "convert_with_inline_images",
428
+ function!(convert_with_inline_images_fn, -1),
429
+ )?;
430
+
431
+ Ok(())
432
+ }
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module HtmlToMarkdown
4
- VERSION = '2.6.5'
4
+ VERSION = '2.7.0'
5
5
  end
@@ -7,9 +7,13 @@ module HtmlToMarkdown
7
7
  autoload :CLI, 'html_to_markdown/cli'
8
8
  autoload :CLIProxy, 'html_to_markdown/cli_proxy'
9
9
 
10
+ class Options; end # rubocop:disable Lint/EmptyClass
11
+
10
12
  class << self
11
13
  alias native_convert convert
12
14
  alias native_convert_with_inline_images convert_with_inline_images
15
+ alias native_options options
16
+ alias native_convert_with_options convert_with_options
13
17
  end
14
18
 
15
19
  module_function
@@ -18,7 +22,15 @@ module HtmlToMarkdown
18
22
  native_convert(html.to_s, options)
19
23
  end
20
24
 
25
+ def convert_with_options(html, options_handle)
26
+ native_convert_with_options(html.to_s, options_handle)
27
+ end
28
+
21
29
  def convert_with_inline_images(html, options = nil, image_config = nil)
22
30
  native_convert_with_inline_images(html.to_s, options, image_config)
23
31
  end
32
+
33
+ def options(options_hash = nil)
34
+ native_options(options_hash)
35
+ end
24
36
  end
data/spec/convert_spec.rb CHANGED
@@ -26,4 +26,13 @@ RSpec.describe HtmlToMarkdown do
26
26
  expect(extraction[:inline_images].first[:description]).to eq('fake')
27
27
  end
28
28
  end
29
+
30
+ describe '.options' do
31
+ it 'returns a reusable options handle' do
32
+ handle = described_class.options(heading_style: :atx_closed)
33
+ expect(handle).to be_a(HtmlToMarkdown::Options)
34
+ result = described_class.convert_with_options('<h1>Hello</h1>', handle)
35
+ expect(result).to include('# Hello #')
36
+ end
37
+ end
29
38
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: html-to-markdown
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.6.5
4
+ version: 2.7.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Na'aman Hirschfeld
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2025-11-08 00:00:00.000000000 Z
11
+ date: 2025-11-11 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rb_sys
@@ -47,8 +47,13 @@ files:
47
47
  - Gemfile.lock
48
48
  - README.md
49
49
  - Rakefile
50
+ - bin/benchmark.rb
50
51
  - exe/html-to-markdown
51
52
  - ext/html-to-markdown-rb/extconf.rb
53
+ - ext/html-to-markdown-rb/native/Cargo.toml
54
+ - ext/html-to-markdown-rb/native/README.md
55
+ - ext/html-to-markdown-rb/native/extconf.rb
56
+ - ext/html-to-markdown-rb/native/src/lib.rs
52
57
  - html-to-markdown-rb.gemspec
53
58
  - lib/html_to_markdown.rb
54
59
  - lib/html_to_markdown/cli.rb