html-to-markdown 2.11.1 → 2.11.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 835588a279a73ee3cc97d2010b023d96e29454a7a6aaaa81dfb849d1d0a6acde
4
- data.tar.gz: c1b533d208f4c061f1724cdb87c22f0aca309ca6b02767252812a839c86cc29e
3
+ metadata.gz: b469768fb1f1e42d77a4acaa11bb173c9f97aa584dd9a3315f5593ec459c7125
4
+ data.tar.gz: cd82653deecd6b0a5a2c2ba31ce0a5ebb16b87292b5ef8c6e9e2a402715c229c
5
5
  SHA512:
6
- metadata.gz: 2d7381ee6e1e90726ff4a502aaf7768f7128b541ede801d0fc1c9ee717de7f24518c80b57a9a2ccca0285e37f6a749f51dbb94f3d7254e570d220e772afe534f
7
- data.tar.gz: aca530dadbc1c0e1a927da375071eaf17edc90f04ef2a4102848cf0ab4432bcf051a48c37cc2372e8497c17d157dce5fcb747b567525c51dad4d834ee916fdfc
6
+ metadata.gz: 38240a994c44c0497c5b3faf62b3d74dcecdf5b4d960d8754179ebc7d8c0344be050b0b099258d20fd7eef4e533a681d94d5238d965544841547145d346603ad
7
+ data.tar.gz: 4094481d2dda7846b4e89d346982039952ec05d574f6e0c19430162be3a7f0f0c4a15bdb44075dad43259e9f2543b965fa44c05f517550093a327245a03b0aa8
data/.bundle/config CHANGED
@@ -1,2 +1,2 @@
1
- ---
2
- BUNDLE_PATH: "vendor/bundle"
1
+ ---
2
+ BUNDLE_PATH: "vendor/bundle"
data/.rubocop.yml CHANGED
@@ -1,29 +1,29 @@
1
- plugins:
2
- - rubocop-rspec
3
-
4
- AllCops:
5
- NewCops: enable
6
- TargetRubyVersion: 3.2
7
- Exclude:
8
- - "tmp/**/*"
9
- - "vendor/**/*"
10
-
11
- Style/Documentation:
12
- Enabled: false
13
-
14
- Metrics/BlockLength:
15
- Exclude:
16
- - "spec/**/*"
17
- - "*.gemspec"
18
-
19
- Metrics/MethodLength:
20
- Max: 15
21
-
22
- RSpec/MultipleExpectations:
23
- Enabled: false
24
-
25
- RSpec/ExampleLength:
26
- Enabled: false
27
-
28
- RSpec/SpecFilePathFormat:
29
- Enabled: false
1
+ plugins:
2
+ - rubocop-rspec
3
+
4
+ AllCops:
5
+ NewCops: enable
6
+ TargetRubyVersion: 3.2
7
+ Exclude:
8
+ - "tmp/**/*"
9
+ - "vendor/**/*"
10
+
11
+ Style/Documentation:
12
+ Enabled: false
13
+
14
+ Metrics/BlockLength:
15
+ Exclude:
16
+ - "spec/**/*"
17
+ - "*.gemspec"
18
+
19
+ Metrics/MethodLength:
20
+ Max: 15
21
+
22
+ RSpec/MultipleExpectations:
23
+ Enabled: false
24
+
25
+ RSpec/ExampleLength:
26
+ Enabled: false
27
+
28
+ RSpec/SpecFilePathFormat:
29
+ Enabled: false
data/Gemfile CHANGED
@@ -1,17 +1,17 @@
1
- # frozen_string_literal: true
2
-
3
- source 'https://rubygems.org'
4
-
5
- ruby '>= 3.2'
6
-
7
- gemspec
8
-
9
- group :development, :test do
10
- gem 'rake-compiler'
11
- gem 'rbs', require: false
12
- gem 'rb_sys' # provides build tooling when developing locally
13
- gem 'rspec'
14
- gem 'rubocop', require: false
15
- gem 'rubocop-rspec', require: false
16
- gem 'steep', require: false
17
- end
1
+ # frozen_string_literal: true
2
+
3
+ source 'https://rubygems.org'
4
+
5
+ ruby '>= 3.2'
6
+
7
+ gemspec
8
+
9
+ group :development, :test do
10
+ gem 'rake-compiler'
11
+ gem 'rbs', require: false
12
+ gem 'rb_sys' # provides build tooling when developing locally
13
+ gem 'rspec'
14
+ gem 'rubocop', require: false
15
+ gem 'rubocop-rspec', require: false
16
+ gem 'steep', require: false
17
+ end
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- html-to-markdown (2.11.1)
4
+ html-to-markdown (2.11.3)
5
5
  rb_sys (>= 0.9, < 1.0)
6
6
 
7
7
  GEM
@@ -24,17 +24,16 @@ GEM
24
24
  base64 (0.3.0)
25
25
  bigdecimal (3.3.1)
26
26
  concurrent-ruby (1.3.5)
27
- connection_pool (2.5.5)
27
+ connection_pool (3.0.2)
28
28
  csv (3.3.5)
29
29
  diff-lcs (1.6.2)
30
30
  drb (2.2.3)
31
31
  ffi (1.17.2)
32
32
  ffi (1.17.2-arm64-darwin)
33
- ffi (1.17.2-x64-mingw-ucrt)
34
33
  fileutils (1.8.0)
35
34
  i18n (1.14.7)
36
35
  concurrent-ruby (~> 1.0)
37
- json (2.16.0)
36
+ json (2.17.1)
38
37
  language_server-protocol (3.17.0.5)
39
38
  lint_roller (1.1.0)
40
39
  listen (3.9.0)
@@ -53,12 +52,12 @@ GEM
53
52
  rake (13.3.1)
54
53
  rake-compiler (1.3.0)
55
54
  rake
56
- rake-compiler-dock (1.9.1)
55
+ rake-compiler-dock (1.10.0)
57
56
  rb-fsevent (0.11.2)
58
57
  rb-inotify (0.11.1)
59
58
  ffi (~> 1.0)
60
- rb_sys (0.9.117)
61
- rake-compiler-dock (= 1.9.1)
59
+ rb_sys (0.9.119)
60
+ rake-compiler-dock (= 1.10.0)
62
61
  rbs (3.9.5)
63
62
  logger
64
63
  regexp_parser (2.11.3)
@@ -124,7 +123,6 @@ GEM
124
123
  PLATFORMS
125
124
  arm64-darwin-24
126
125
  ruby
127
- x64-mingw-ucrt
128
126
 
129
127
  DEPENDENCIES
130
128
  html-to-markdown!
@@ -137,7 +135,7 @@ DEPENDENCIES
137
135
  steep
138
136
 
139
137
  RUBY VERSION
140
- ruby 3.2.9p248
138
+ ruby 3.2.9p248
141
139
 
142
140
  BUNDLED WITH
143
- 2.7.2
141
+ 4.0.0
data/README.md CHANGED
@@ -1,243 +1,243 @@
1
- # html-to-markdown-rb
2
-
3
- Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, WebAssembly, and PHP packages. Ship identical Markdown across every runtime while enjoying native extension performance.
4
-
5
- [![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg?logo=rust&label=crates.io)](https://crates.io/crates/html-to-markdown-rs)
6
- [![npm (node)](https://img.shields.io/npm/v/html-to-markdown-node.svg?logo=npm)](https://www.npmjs.com/package/html-to-markdown-node)
7
- [![npm (wasm)](https://img.shields.io/npm/v/html-to-markdown-wasm.svg?logo=npm)](https://www.npmjs.com/package/html-to-markdown-wasm)
8
- [![PyPI](https://img.shields.io/pypi/v/html-to-markdown.svg?logo=pypi)](https://pypi.org/project/html-to-markdown/)
9
- [![Packagist](https://img.shields.io/packagist/v/goldziher/html-to-markdown.svg)](https://packagist.org/packages/goldziher/html-to-markdown)
10
- [![RubyGems](https://badge.fury.io/rb/html-to-markdown.svg)](https://rubygems.org/gems/html-to-markdown)
11
- [![Hex.pm](https://img.shields.io/hexpm/v/html_to_markdown.svg)](https://hex.pm/packages/html_to_markdown)
12
- [![NuGet](https://img.shields.io/nuget/v/Goldziher.HtmlToMarkdown.svg)](https://www.nuget.org/packages/Goldziher.HtmlToMarkdown/)
13
- [![Maven Central](https://img.shields.io/maven-central/v/io.github.goldziher/html-to-markdown.svg)](https://central.sonatype.com/artifact/io.github.goldziher/html-to-markdown)
14
- [![Go Reference](https://pkg.go.dev/badge/github.com/Goldziher/html-to-markdown/packages/go/htmltomarkdown.svg)](https://pkg.go.dev/github.com/Goldziher/html-to-markdown/packages/go/htmltomarkdown)
15
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
16
- [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
17
-
18
- ## Features
19
-
20
- - ⚡ **Rust-fast**: Ruby bindings around a highly optimised Rust core (60‑80× faster than BeautifulSoup-based converters).
21
- - 🔁 **Identical output**: Shares logic with the Python wheels, npm bindings, PHP extension, WASM package, and CLI — consistent Markdown everywhere.
22
- - ⚙️ **Rich configuration**: Control heading styles, list indentation, whitespace handling, HTML preprocessing, and more.
23
- - 🖼️ **Inline image extraction**: Pull out embedded images (PNG/JPEG/SVG/data URIs) alongside Markdown.
24
- - 🧰 **Bundled CLI proxy**: Call the Rust CLI straight from Ruby or shell scripts.
25
- - 🛠️ **First-class Rails support**: Works with `Gem.win_platform?` builds, supports Trusted Publishing, and compiles on install if no native gem matches.
26
-
27
- ## Documentation & Support
28
-
29
- - [GitHub repository](https://github.com/Goldziher/html-to-markdown)
30
- - [Issue tracker](https://github.com/Goldziher/html-to-markdown/issues)
31
- - [Changelog](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md)
32
- - [Live demo (WASM)](https://goldziher.github.io/html-to-markdown/)
33
-
34
- ## Installation
35
-
36
- ```bash
37
- bundle add html-to-markdown
38
- # or
39
- gem install html-to-markdown
40
- ```
41
-
42
- Add the gem to your project and Bundler will compile the native Rust extension on first install.
43
-
44
- ### Requirements
45
-
46
- - Ruby **3.2+** (Magnus relies on the fiber scheduler APIs added in 3.2)
47
- - Rust toolchain **1.85+** with Cargo available on your `$PATH`
48
- - Ruby development headers (`ruby-dev`, `ruby-devel`, or the platform equivalent)
49
-
50
- **Windows**: install [RubyInstaller with MSYS2](https://rubyinstaller.org/) (UCRT64). Run once:
51
-
52
- ```powershell
53
- ridk exec pacman -S --needed --noconfirm base-devel mingw-w64-ucrt-x86_64-toolchain
54
- ```
55
-
56
- This provides the standard headers (including `strings.h`) required for the bindgen step.
57
-
58
- ## Performance Snapshot
59
-
60
- Apple M4 • Real Wikipedia documents • `HtmlToMarkdown.convert` (Ruby)
61
-
62
- | Document | Size | Latency | Throughput | Docs/sec |
63
- | ------------------- | ----- | ------- | ---------- | -------- |
64
- | Lists (Timeline) | 129KB | 0.69ms | 187 MB/s | 1,450 |
65
- | Tables (Countries) | 360KB | 2.19ms | 164 MB/s | 456 |
66
- | Mixed (Python wiki) | 656KB | 4.88ms | 134 MB/s | 205 |
67
-
68
- > Same core, same benchmarks: the Ruby extension stays within single-digit % of the Rust CLI and mirrors the Python/Node numbers.
69
-
70
- ### Benchmark Fixtures (Apple M4)
71
-
72
- Measured via `task bench:bindings -- --language ruby` with the shared Wikipedia + hOCR suite:
73
-
74
- | Document | Size | ops/sec (Ruby) |
75
- | ---------------------- | ------ | -------------- |
76
- | Lists (Timeline) | 129 KB | 1,349 |
77
- | Tables (Countries) | 360 KB | 326 |
78
- | Medium (Python) | 657 KB | 157 |
79
- | Large (Rust) | 567 KB | 174 |
80
- | Small (Intro) | 463 KB | 214 |
81
- | hOCR German PDF | 44 KB | 2,936 |
82
- | hOCR Invoice | 4 KB | 25,740 |
83
- | hOCR Embedded Tables | 37 KB | 3,328 |
84
-
85
- > These numbers line up with the Python/Node bindings because everything flows through the same Rust engine.
86
-
87
- ## Quick Start
88
-
89
- ```ruby
90
- require 'html_to_markdown'
91
-
92
- html = <<~HTML
93
- <h1>Welcome</h1>
94
- <p>This is <strong>Rust-fast</strong> conversion!</p>
95
- <ul>
96
- <li>Native extension</li>
97
- <li>Identical output across languages</li>
98
- </ul>
99
- HTML
100
-
101
- markdown = HtmlToMarkdown.convert(html)
102
- puts markdown
103
- # # Welcome
104
- #
105
- # This is **Rust-fast** conversion!
106
- #
107
- # - Native extension
108
- # - Identical output across languages
109
- ```
110
-
111
- ## API
112
-
113
- ### Conversion Options
114
-
115
- Pass a Ruby hash (string or symbol keys) to tweak rendering. Every option maps one-for-one with the Rust/Python/Node APIs.
116
-
117
- ```ruby
118
- require 'html_to_markdown'
119
-
120
- markdown = HtmlToMarkdown.convert(
121
- '<pre><code class="language-ruby">puts "hi"</code></pre>',
122
- heading_style: :atx,
123
- code_block_style: :fenced,
124
- bullets: '*+-',
125
- list_indent_type: :spaces,
126
- list_indent_width: 2,
127
- whitespace_mode: :normalized,
128
- highlight_style: :double_equal
129
- )
130
-
131
- puts markdown
132
- ```
133
-
134
- ### Reusing Options
135
-
136
- If you’re running tight loops or benchmarks, build the options once and pass the handle back into `convert_with_options`:
137
-
138
- ```ruby
139
- handle = HtmlToMarkdown.options(hocr_spatial_tables: false)
140
-
141
- 100.times do
142
- HtmlToMarkdown.convert_with_options('<h1>Handles</h1>', handle)
143
- end
144
- ```
145
-
146
- ### HTML Preprocessing
147
-
148
- Clean up scraped HTML (navigation, forms, malformed markup) before conversion:
149
-
150
- ```ruby
151
- require 'html_to_markdown'
152
-
153
- markdown = HtmlToMarkdown.convert(
154
- html,
155
- preprocessing: {
156
- enabled: true,
157
- preset: :aggressive, # :minimal, :standard, :aggressive
158
- remove_navigation: true,
159
- remove_forms: true
160
- }
161
- )
162
- ```
163
-
164
- ### Inline Images
165
-
166
- Extract inline binary data (data URIs, SVG) together with the converted Markdown.
167
-
168
- ```ruby
169
- require 'html_to_markdown'
170
-
171
- result = HtmlToMarkdown.convert_with_inline_images(
172
- '<img src="..." alt="Pixel">',
173
- image_config: {
174
- max_decoded_size_bytes: 1 * 1024 * 1024,
175
- infer_dimensions: true,
176
- filename_prefix: 'img_',
177
- capture_svg: true
178
- }
179
- )
180
-
181
- puts result.markdown
182
- result.inline_images.each do |img|
183
- puts "#{img.filename} -> #{img.format} (#{img.data.bytesize} bytes)"
184
- end
185
- ```
186
-
187
- ## CLI
188
-
189
- The gem bundles a small proxy for the Rust CLI binary. Use it when you need parity with the standalone `html-to-markdown` executable.
190
-
191
- ```ruby
192
- require 'html_to_markdown/cli'
193
-
194
- HtmlToMarkdown::CLI.run(%w[--heading-style atx input.html], stdout: $stdout)
195
- # => writes converted Markdown to STDOUT
196
- ```
197
-
198
- You can also call the CLI binary directly for scripting:
199
-
200
- ```ruby
201
- HtmlToMarkdown::CLIProxy.call(['--version'])
202
- # => "html-to-markdown 2.5.7"
203
- ```
204
-
205
- Rebuild the CLI locally if you see `CLI binary not built` during tests:
206
-
207
- ```bash
208
- bundle exec rake compile # builds the extension
209
- bundle exec ruby scripts/prepare_ruby_gem.rb # copies the CLI into lib/bin/
210
- ```
211
-
212
- ## Error Handling
213
-
214
- Conversion errors raise `HtmlToMarkdown::Error` (wrapping the Rust error context). CLI invocations use specialised subclasses:
215
-
216
- - `HtmlToMarkdown::CLIProxy::MissingBinaryError`
217
- - `HtmlToMarkdown::CLIProxy::CLIExecutionError`
218
-
219
- Rescue them to provide clearer feedback in your application.
220
-
221
- ## Consistent Across Languages
222
-
223
- The Ruby gem shares the exact Rust core with:
224
-
225
- - [Python wheels](https://pypi.org/project/html-to-markdown/)
226
- - [Node.js / Bun bindings](https://www.npmjs.com/package/html-to-markdown-node)
227
- - [WebAssembly package](https://www.npmjs.com/package/html-to-markdown-wasm)
228
- - The Rust crate and CLI
229
-
230
- Use whichever runtime fits your stack while keeping formatting behaviour identical.
231
-
232
- ## Development
233
-
234
- ```bash
235
- bundle exec rake compile # build the native extension
236
- bundle exec rspec # run test suite
237
- ```
238
-
239
- The extension uses [Magnus](https://github.com/matsadler/magnus) plus `rb-sys` for bindgen. When editing the Rust code under `src/`, rerun `rake compile`.
240
-
241
- ## License
242
-
243
- MIT © Na'aman Hirschfeld
1
+ # html-to-markdown-rb
2
+
3
+ Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, WebAssembly, and PHP packages. Ship identical Markdown across every runtime while enjoying native extension performance.
4
+
5
+ [![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg?logo=rust&label=crates.io)](https://crates.io/crates/html-to-markdown-rs)
6
+ [![npm (node)](https://img.shields.io/npm/v/html-to-markdown-node.svg?logo=npm)](https://www.npmjs.com/package/html-to-markdown-node)
7
+ [![npm (wasm)](https://img.shields.io/npm/v/html-to-markdown-wasm.svg?logo=npm)](https://www.npmjs.com/package/html-to-markdown-wasm)
8
+ [![PyPI](https://img.shields.io/pypi/v/html-to-markdown.svg?logo=pypi)](https://pypi.org/project/html-to-markdown/)
9
+ [![Packagist](https://img.shields.io/packagist/v/goldziher/html-to-markdown.svg)](https://packagist.org/packages/goldziher/html-to-markdown)
10
+ [![RubyGems](https://badge.fury.io/rb/html-to-markdown.svg)](https://rubygems.org/gems/html-to-markdown)
11
+ [![Hex.pm](https://img.shields.io/hexpm/v/html_to_markdown.svg)](https://hex.pm/packages/html_to_markdown)
12
+ [![NuGet](https://img.shields.io/nuget/v/Goldziher.HtmlToMarkdown.svg)](https://www.nuget.org/packages/Goldziher.HtmlToMarkdown/)
13
+ [![Maven Central](https://img.shields.io/maven-central/v/io.github.goldziher/html-to-markdown.svg)](https://central.sonatype.com/artifact/io.github.goldziher/html-to-markdown)
14
+ [![Go Reference](https://pkg.go.dev/badge/github.com/Goldziher/html-to-markdown/packages/go/htmltomarkdown.svg)](https://pkg.go.dev/github.com/Goldziher/html-to-markdown/packages/go/htmltomarkdown)
15
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
16
+ [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
17
+
18
+ ## Features
19
+
20
+ - ⚡ **Rust-fast**: Ruby bindings around a highly optimised Rust core (60‑80× faster than BeautifulSoup-based converters).
21
+ - 🔁 **Identical output**: Shares logic with the Python wheels, npm bindings, PHP extension, WASM package, and CLI — consistent Markdown everywhere.
22
+ - ⚙️ **Rich configuration**: Control heading styles, list indentation, whitespace handling, HTML preprocessing, and more.
23
+ - 🖼️ **Inline image extraction**: Pull out embedded images (PNG/JPEG/SVG/data URIs) alongside Markdown.
24
+ - 🧰 **Bundled CLI proxy**: Call the Rust CLI straight from Ruby or shell scripts.
25
+ - 🛠️ **First-class Rails support**: Works with `Gem.win_platform?` builds, supports Trusted Publishing, and compiles on install if no native gem matches.
26
+
27
+ ## Documentation & Support
28
+
29
+ - [GitHub repository](https://github.com/Goldziher/html-to-markdown)
30
+ - [Issue tracker](https://github.com/Goldziher/html-to-markdown/issues)
31
+ - [Changelog](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md)
32
+ - [Live demo (WASM)](https://goldziher.github.io/html-to-markdown/)
33
+
34
+ ## Installation
35
+
36
+ ```bash
37
+ bundle add html-to-markdown
38
+ # or
39
+ gem install html-to-markdown
40
+ ```
41
+
42
+ Add the gem to your project and Bundler will compile the native Rust extension on first install.
43
+
44
+ ### Requirements
45
+
46
+ - Ruby **3.2+** (Magnus relies on the fiber scheduler APIs added in 3.2)
47
+ - Rust toolchain **1.85+** with Cargo available on your `$PATH`
48
+ - Ruby development headers (`ruby-dev`, `ruby-devel`, or the platform equivalent)
49
+
50
+ **Windows**: install [RubyInstaller with MSYS2](https://rubyinstaller.org/) (UCRT64). Run once:
51
+
52
+ ```powershell
53
+ ridk exec pacman -S --needed --noconfirm base-devel mingw-w64-ucrt-x86_64-toolchain
54
+ ```
55
+
56
+ This provides the standard headers (including `strings.h`) required for the bindgen step.
57
+
58
+ ## Performance Snapshot
59
+
60
+ Apple M4 • Real Wikipedia documents • `HtmlToMarkdown.convert` (Ruby)
61
+
62
+ | Document | Size | Latency | Throughput | Docs/sec |
63
+ | ------------------- | ----- | ------- | ---------- | -------- |
64
+ | Lists (Timeline) | 129KB | 0.69ms | 187 MB/s | 1,450 |
65
+ | Tables (Countries) | 360KB | 2.19ms | 164 MB/s | 456 |
66
+ | Mixed (Python wiki) | 656KB | 4.88ms | 134 MB/s | 205 |
67
+
68
+ > Same core, same benchmarks: the Ruby extension stays within single-digit % of the Rust CLI and mirrors the Python/Node numbers.
69
+
70
+ ### Benchmark Fixtures (Apple M4)
71
+
72
+ Measured via `task bench:bindings -- --language ruby` with the shared Wikipedia + hOCR suite:
73
+
74
+ | Document | Size | ops/sec (Ruby) |
75
+ | ---------------------- | ------ | -------------- |
76
+ | Lists (Timeline) | 129 KB | 1,349 |
77
+ | Tables (Countries) | 360 KB | 326 |
78
+ | Medium (Python) | 657 KB | 157 |
79
+ | Large (Rust) | 567 KB | 174 |
80
+ | Small (Intro) | 463 KB | 214 |
81
+ | hOCR German PDF | 44 KB | 2,936 |
82
+ | hOCR Invoice | 4 KB | 25,740 |
83
+ | hOCR Embedded Tables | 37 KB | 3,328 |
84
+
85
+ > These numbers line up with the Python/Node bindings because everything flows through the same Rust engine.
86
+
87
+ ## Quick Start
88
+
89
+ ```ruby
90
+ require 'html_to_markdown'
91
+
92
+ html = <<~HTML
93
+ <h1>Welcome</h1>
94
+ <p>This is <strong>Rust-fast</strong> conversion!</p>
95
+ <ul>
96
+ <li>Native extension</li>
97
+ <li>Identical output across languages</li>
98
+ </ul>
99
+ HTML
100
+
101
+ markdown = HtmlToMarkdown.convert(html)
102
+ puts markdown
103
+ # # Welcome
104
+ #
105
+ # This is **Rust-fast** conversion!
106
+ #
107
+ # - Native extension
108
+ # - Identical output across languages
109
+ ```
110
+
111
+ ## API
112
+
113
+ ### Conversion Options
114
+
115
+ Pass a Ruby hash (string or symbol keys) to tweak rendering. Every option maps one-for-one with the Rust/Python/Node APIs.
116
+
117
+ ```ruby
118
+ require 'html_to_markdown'
119
+
120
+ markdown = HtmlToMarkdown.convert(
121
+ '<pre><code class="language-ruby">puts "hi"</code></pre>',
122
+ heading_style: :atx,
123
+ code_block_style: :fenced,
124
+ bullets: '*+-',
125
+ list_indent_type: :spaces,
126
+ list_indent_width: 2,
127
+ whitespace_mode: :normalized,
128
+ highlight_style: :double_equal
129
+ )
130
+
131
+ puts markdown
132
+ ```
133
+
134
+ ### Reusing Options
135
+
136
+ If you’re running tight loops or benchmarks, build the options once and pass the handle back into `convert_with_options`:
137
+
138
+ ```ruby
139
+ handle = HtmlToMarkdown.options(hocr_spatial_tables: false)
140
+
141
+ 100.times do
142
+ HtmlToMarkdown.convert_with_options('<h1>Handles</h1>', handle)
143
+ end
144
+ ```
145
+
146
+ ### HTML Preprocessing
147
+
148
+ Clean up scraped HTML (navigation, forms, malformed markup) before conversion:
149
+
150
+ ```ruby
151
+ require 'html_to_markdown'
152
+
153
+ markdown = HtmlToMarkdown.convert(
154
+ html,
155
+ preprocessing: {
156
+ enabled: true,
157
+ preset: :aggressive, # :minimal, :standard, :aggressive
158
+ remove_navigation: true,
159
+ remove_forms: true
160
+ }
161
+ )
162
+ ```
163
+
164
+ ### Inline Images
165
+
166
+ Extract inline binary data (data URIs, SVG) together with the converted Markdown.
167
+
168
+ ```ruby
169
+ require 'html_to_markdown'
170
+
171
+ result = HtmlToMarkdown.convert_with_inline_images(
172
+ '<img src="..." alt="Pixel">',
173
+ image_config: {
174
+ max_decoded_size_bytes: 1 * 1024 * 1024,
175
+ infer_dimensions: true,
176
+ filename_prefix: 'img_',
177
+ capture_svg: true
178
+ }
179
+ )
180
+
181
+ puts result.markdown
182
+ result.inline_images.each do |img|
183
+ puts "#{img.filename} -> #{img.format} (#{img.data.bytesize} bytes)"
184
+ end
185
+ ```
186
+
187
+ ## CLI
188
+
189
+ The gem bundles a small proxy for the Rust CLI binary. Use it when you need parity with the standalone `html-to-markdown` executable.
190
+
191
+ ```ruby
192
+ require 'html_to_markdown/cli'
193
+
194
+ HtmlToMarkdown::CLI.run(%w[--heading-style atx input.html], stdout: $stdout)
195
+ # => writes converted Markdown to STDOUT
196
+ ```
197
+
198
+ You can also call the CLI binary directly for scripting:
199
+
200
+ ```ruby
201
+ HtmlToMarkdown::CLIProxy.call(['--version'])
202
+ # => "html-to-markdown 2.5.7"
203
+ ```
204
+
205
+ Rebuild the CLI locally if you see `CLI binary not built` during tests:
206
+
207
+ ```bash
208
+ bundle exec rake compile # builds the extension
209
+ bundle exec ruby scripts/prepare_ruby_gem.rb # copies the CLI into lib/bin/
210
+ ```
211
+
212
+ ## Error Handling
213
+
214
+ Conversion errors raise `HtmlToMarkdown::Error` (wrapping the Rust error context). CLI invocations use specialised subclasses:
215
+
216
+ - `HtmlToMarkdown::CLIProxy::MissingBinaryError`
217
+ - `HtmlToMarkdown::CLIProxy::CLIExecutionError`
218
+
219
+ Rescue them to provide clearer feedback in your application.
220
+
221
+ ## Consistent Across Languages
222
+
223
+ The Ruby gem shares the exact Rust core with:
224
+
225
+ - [Python wheels](https://pypi.org/project/html-to-markdown/)
226
+ - [Node.js / Bun bindings](https://www.npmjs.com/package/html-to-markdown-node)
227
+ - [WebAssembly package](https://www.npmjs.com/package/html-to-markdown-wasm)
228
+ - The Rust crate and CLI
229
+
230
+ Use whichever runtime fits your stack while keeping formatting behaviour identical.
231
+
232
+ ## Development
233
+
234
+ ```bash
235
+ bundle exec rake compile # build the native extension
236
+ bundle exec rspec # run test suite
237
+ ```
238
+
239
+ The extension uses [Magnus](https://github.com/matsadler/magnus) plus `rb-sys` for bindgen. When editing the Rust code under `src/`, rerun `rake compile`.
240
+
241
+ ## License
242
+
243
+ MIT © Na'aman Hirschfeld