html-to-markdown 2.16.0 → 2.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -11,8 +11,8 @@ Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust eng
11
11
  [![Hex.pm](https://img.shields.io/hexpm/v/html_to_markdown.svg)](https://hex.pm/packages/html_to_markdown)
12
12
  [![NuGet](https://img.shields.io/nuget/v/Goldziher.HtmlToMarkdown.svg)](https://www.nuget.org/packages/Goldziher.HtmlToMarkdown/)
13
13
  [![Maven Central](https://img.shields.io/maven-central/v/io.github.goldziher/html-to-markdown.svg)](https://central.sonatype.com/artifact/io.github.goldziher/html-to-markdown)
14
- [![Go Reference](https://pkg.go.dev/badge/github.com/Goldziher/html-to-markdown/packages/go/v2/htmltomarkdown.svg)](https://pkg.go.dev/github.com/Goldziher/html-to-markdown/packages/go/v2/htmltomarkdown)
15
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
14
+ [![Go Reference](https://pkg.go.dev/badge/github.com/kreuzberg-dev/html-to-markdown/packages/go/v2/htmltomarkdown.svg)](https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v2/htmltomarkdown)
15
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/LICENSE)
16
16
  [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
17
17
 
18
18
  ## Features
@@ -26,10 +26,10 @@ Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust eng
26
26
 
27
27
  ## Documentation & Support
28
28
 
29
- - [GitHub repository](https://github.com/Goldziher/html-to-markdown)
30
- - [Issue tracker](https://github.com/Goldziher/html-to-markdown/issues)
31
- - [Changelog](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md)
32
- - [Live demo (WASM)](https://goldziher.github.io/html-to-markdown/)
29
+ - [GitHub repository](https://github.com/kreuzberg-dev/html-to-markdown)
30
+ - [Issue tracker](https://github.com/kreuzberg-dev/html-to-markdown/issues)
31
+ - [Changelog](https://github.com/kreuzberg-dev/html-to-markdown/blob/main/CHANGELOG.md)
32
+ - [Live demo (WASM)](https://kreuzberg-dev.github.io/html-to-markdown/)
33
33
 
34
34
  ## Installation
35
35
 
@@ -55,35 +55,6 @@ ridk exec pacman -S --needed --noconfirm base-devel mingw-w64-ucrt-x86_64-toolch
55
55
 
56
56
  This provides the standard headers (including `strings.h`) required for the bindgen step.
57
57
 
58
- ## Performance Snapshot
59
-
60
- Apple M4 • Real Wikipedia documents • `HtmlToMarkdown.convert` (Ruby)
61
-
62
- | Document | Size | Latency | Throughput | Docs/sec |
63
- | ------------------- | ----- | ------- | ---------- | -------- |
64
- | Lists (Timeline) | 129KB | 0.69ms | 187 MB/s | 1,450 |
65
- | Tables (Countries) | 360KB | 2.19ms | 164 MB/s | 456 |
66
- | Mixed (Python wiki) | 656KB | 4.88ms | 134 MB/s | 205 |
67
-
68
- > Same core, same benchmarks: the Ruby extension stays within single-digit % of the Rust CLI and mirrors the Python/Node numbers.
69
-
70
- ### Benchmark Fixtures (Apple M4)
71
-
72
- Measured via `task bench:harness` with the shared Wikipedia + hOCR suite:
73
-
74
- | Document | Size | ops/sec (Ruby) |
75
- | ---------------------- | ------ | -------------- |
76
- | Lists (Timeline) | 129 KB | 3,156 |
77
- | Tables (Countries) | 360 KB | 921 |
78
- | Medium (Python) | 657 KB | 469 |
79
- | Large (Rust) | 567 KB | 534 |
80
- | Small (Intro) | 463 KB | 629 |
81
- | hOCR German PDF | 44 KB | 7,250 |
82
- | hOCR Invoice | 4 KB | 83,883 |
83
- | hOCR Embedded Tables | 37 KB | 7,890 |
84
-
85
- > These numbers line up with the Python/Node bindings because everything flows through the same Rust engine.
86
-
87
58
  ## Quick Start
88
59
 
89
60
  ```ruby
@@ -108,349 +79,174 @@ puts markdown
108
79
  # - Identical output across languages
109
80
  ```
110
81
 
111
- ## API
82
+ ## API Reference
112
83
 
113
- ### Conversion Options
114
-
115
- Pass a Ruby hash (string or symbol keys) to tweak rendering. Every option maps one-for-one with the Rust/Python/Node APIs.
84
+ ### Basic Conversion
116
85
 
117
86
  ```ruby
118
- require 'html_to_markdown'
87
+ # Simple conversion
88
+ markdown = HtmlToMarkdown.convert(html)
119
89
 
120
- markdown = HtmlToMarkdown.convert(
121
- '<pre><code class="language-ruby">puts "hi"</code></pre>',
122
- heading_style: :atx,
123
- code_block_style: :fenced,
124
- bullets: '*+-',
125
- list_indent_type: :spaces,
126
- list_indent_width: 2,
127
- whitespace_mode: :normalized,
128
- highlight_style: :double_equal
129
- )
90
+ # With options (pass a Ruby hash with symbol keys)
91
+ markdown = HtmlToMarkdown.convert(html, heading_style: :atx, code_block_style: :fenced)
130
92
 
131
- puts markdown
132
- ```
93
+ # With inline images
94
+ result = HtmlToMarkdown.convert_with_inline_images(html, image_config: {...})
95
+ markdown = result.markdown
96
+ images = result.inline_images
133
97
 
134
- ### Reusing Options
98
+ # With metadata extraction
99
+ markdown, metadata = HtmlToMarkdown.convert_with_metadata(html, options, metadata_config)
135
100
 
136
- If you’re running tight loops or benchmarks, build the options once and pass the handle back into `convert_with_options`:
137
-
138
- ```ruby
139
- handle = HtmlToMarkdown.options(hocr_spatial_tables: false)
140
-
141
- 100.times do
142
- HtmlToMarkdown.convert_with_options('<h1>Handles</h1>', handle)
143
- end
101
+ # With visitor pattern (custom callbacks)
102
+ result = HtmlToMarkdown.convert_with_visitor(html, visitor: MyVisitor.new, options: {...})
144
103
  ```
145
104
 
146
- ### HTML Preprocessing
147
-
148
- Clean up scraped HTML (navigation, forms, malformed markup) before conversion:
105
+ ### Conversion Options Hash
149
106
 
150
107
  ```ruby
151
- require 'html_to_markdown'
152
-
153
- markdown = HtmlToMarkdown.convert(
154
- html,
108
+ {
109
+ heading_style: :atx, # :atx or :setext
110
+ code_block_style: :fenced, # :fenced or :indented
111
+ bullets: '*+-', # List bullet chars
112
+ list_indent_type: :spaces, # :spaces or :tabs
113
+ list_indent_width: 2, # Number of indent spaces
114
+ whitespace_mode: :normalized, # :normalized, :preserve, or :collapse
115
+ highlight_style: :double_equal, # Code highlighting style
116
+ hocr_spatial_tables: false, # Special hOCR table handling
155
117
  preprocessing: {
156
118
  enabled: true,
157
- preset: :aggressive, # :minimal, :standard, :aggressive
119
+ preset: :aggressive, # :minimal, :standard, :aggressive
158
120
  remove_navigation: true,
159
121
  remove_forms: true
160
122
  }
161
- )
123
+ }
162
124
  ```
163
125
 
164
- ### Inline Images
126
+ ### Performance: Reusing Options
165
127
 
166
- Extract inline binary data (data URIs, SVG) together with the converted Markdown.
128
+ For tight loops, build an options handle once:
167
129
 
168
130
  ```ruby
169
- require 'html_to_markdown'
170
-
171
- result = HtmlToMarkdown.convert_with_inline_images(
172
- '<img src="data:image/png;base64,iVBORw0..." alt="Pixel">',
173
- image_config: {
174
- max_decoded_size_bytes: 1 * 1024 * 1024,
175
- infer_dimensions: true,
176
- filename_prefix: 'img_',
177
- capture_svg: true
178
- }
179
- )
131
+ handle = HtmlToMarkdown.options(hocr_spatial_tables: false)
180
132
 
181
- puts result.markdown
182
- result.inline_images.each do |img|
183
- puts "#{img.filename} -> #{img.format} (#{img.data.bytesize} bytes)"
133
+ 100.times do
134
+ HtmlToMarkdown.convert_with_options(html, handle)
184
135
  end
185
136
  ```
186
137
 
187
138
  ### Metadata Extraction
188
139
 
189
- Extract comprehensive metadata alongside Markdown conversion: document properties (title, description, author, language), social metadata (Open Graph, Twitter cards), heading hierarchy, link analysis (type classification, rel attributes), image metadata (dimensions, type detection), and structured data (JSON-LD, Microdata, RDFa).
190
-
191
- #### Basic Usage
140
+ Extract document properties (title, description, author, language), social metadata (Open Graph, Twitter cards), heading hierarchy, link analysis, image metadata, and structured data (JSON-LD, Microdata, RDFa):
192
141
 
193
142
  ```ruby
194
- require 'html_to_markdown'
195
-
196
143
  html = '<html lang="en"><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
197
144
  markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
198
145
 
199
- puts markdown
200
- puts metadata[:document][:title] # "Test"
201
- puts metadata[:headers].length # 1
202
- ```
203
-
204
- #### With Conversion Options
205
-
206
- ```ruby
207
- conv_opts = { heading_style: :atx_closed }
208
- metadata_opts = { extract_headers: true, extract_links: false }
209
-
210
- markdown, metadata = HtmlToMarkdown.convert_with_metadata(
211
- html,
212
- conv_opts,
213
- metadata_opts
214
- )
215
- ```
216
-
217
- #### Full Example
218
-
219
- ```ruby
220
- require 'html_to_markdown'
221
-
222
- html = <<~HTML
223
- <html>
224
- <head>
225
- <title>Example</title>
226
- <meta name="description" content="Demo page">
227
- <link rel="canonical" href="https://example.com/page">
228
- <meta property="og:image" content="https://example.com/og.jpg">
229
- <meta name="twitter:card" content="summary_large_image">
230
- </head>
231
- <body>
232
- <h1 id="welcome">Welcome</h1>
233
- <a href="https://example.com" rel="nofollow external">Example link</a>
234
- <img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
235
- <script type="application/ld+json">
236
- {"@context": "https://schema.org", "@type": "Article"}
237
- </script>
238
- </body>
239
- </html>
240
- HTML
241
-
242
- markdown, metadata = HtmlToMarkdown.convert_with_metadata(
243
- html,
244
- { heading_style: :atx },
245
- { extract_links: true, extract_images: true, extract_headers: true, extract_structured_data: true }
246
- )
247
-
248
- puts markdown
249
- puts metadata[:document][:title] # "Example"
250
- puts metadata[:document][:description] # "Demo page"
251
- puts metadata[:document][:open_graph] # {"og:image" => "https://example.com/og.jpg"}
252
- puts metadata[:links].first[:rel] # ["nofollow", "external"]
253
- puts metadata[:images].first[:dimensions] # [640, 480]
254
- puts metadata[:headers].first[:id] # "welcome"
146
+ puts metadata[:document][:title] # "Test"
147
+ puts metadata[:headers].first[:text] # "Hello"
255
148
  ```
256
149
 
257
- #### Return Value Structure
150
+ For detailed examples (SEO extraction, heading hierarchy analysis, structured data) and full metadata structure reference, see [Metadata Extraction Guide](../../examples/metadata-extraction/).
258
151
 
259
- Returns a 2-element array: `[markdown_string, metadata_hash]`
152
+ ### Visitor Pattern
260
153
 
261
- The metadata hash contains:
154
+ Customize conversion with fine-grained element callbacks. Perfect for custom element handling, analytics during conversion, domain-specific markdown dialects, and conditional rendering:
262
155
 
263
156
  ```ruby
264
- {
265
- document: {
266
- title: String?,
267
- description: String?,
268
- keywords: Array[String],
269
- author: String?,
270
- canonical_url: String?,
271
- base_href: String?,
272
- language: String?,
273
- text_direction: "ltr" | "rtl" | "auto" | nil,
274
- open_graph: Hash[String, String],
275
- twitter_card: Hash[String, String],
276
- meta_tags: Hash[String, String]
277
- },
278
- headers: [
279
- {
280
- level: Integer, # 1-6
281
- text: String,
282
- id: String?,
283
- depth: Integer,
284
- html_offset: Integer
285
- }
286
- ],
287
- links: [
288
- {
289
- href: String,
290
- text: String,
291
- title: String?,
292
- link_type: "anchor" | "internal" | "external" | "email" | "phone" | "other",
293
- rel: Array[String],
294
- attributes: Hash[String, String]
295
- }
296
- ],
297
- images: [
298
- {
299
- src: String,
300
- alt: String?,
301
- title: String?,
302
- dimensions: [Integer, Integer]?,
303
- image_type: "data_uri" | "inline_svg" | "external" | "relative",
304
- attributes: Hash[String, String]
305
- }
306
- ],
307
- structured_data: [
308
- {
309
- data_type: "json_ld" | "microdata" | "rdfa",
310
- raw_json: String,
311
- schema_type: String?
312
- }
313
- ]
314
- }
315
- ```
316
-
317
- #### Metadata Configuration
318
-
319
- Pass a hash with the following options to control which metadata types are extracted:
320
-
321
- ```ruby
322
- config = {
323
- extract_headers: true, # Extract h1-h6 elements (default: true)
324
- extract_links: true, # Extract <a> elements (default: true)
325
- extract_images: true, # Extract <img> elements (default: true)
326
- extract_structured_data: true, # Extract JSON-LD/Microdata/RDFa (default: true)
327
- max_structured_data_size: 1_000_000 # Max bytes for structured data (default: 1MB)
328
- }
157
+ class MyVisitor
158
+ def visit_link(ctx, href, text, title = nil)
159
+ { type: :custom, output: "[#{text}](#{href})" }
160
+ end
161
+
162
+ def visit_image(ctx, src, alt, title = nil)
163
+ { type: :skip } # Remove images
164
+ end
165
+ end
329
166
 
330
- markdown, metadata = HtmlToMarkdown.convert_with_metadata(html, nil, config)
167
+ result = HtmlToMarkdown.convert_with_visitor(html, visitor: MyVisitor.new)
331
168
  ```
332
169
 
333
- #### Features
170
+ **Return types**: `{ type: :continue }` (default), `{ type: :custom, output: "..." }` (replace), `{ type: :skip }` (omit), `{ type: :preserve_html }` (keep HTML), `{ type: :error, message: "..." }` (halt).
334
171
 
335
- The Ruby binding provides comprehensive metadata extraction during HTML-to-Markdown conversion:
172
+ **40+ visitor methods** for text, inline formatting, blocks, lists, tables, advanced elements, and lifecycle hooks. Callback parameters include `NodeContext` with element metadata (tag_name, attributes, depth, parent_tag, is_inline).
336
173
 
337
- - **Document Metadata**: title, description, keywords, author, canonical URL, language, text direction
338
- - **Open Graph & Twitter Card**: social media metadata extraction
339
- - **Headers**: h1-h6 extraction with hierarchy, ids, and depth tracking
340
- - **Links**: hyperlink extraction with type classification (anchor, internal, external, email, phone)
341
- - **Images**: image extraction with source type (data_uri, inline_svg, external, relative) and dimensions
342
- - **Structured Data**: JSON-LD, Microdata, and RDFa extraction
174
+ For advanced examples (image filtering, link analytics, footnote dialects), RBS type-safety patterns, and full method reference, see [Visitor Pattern Guide](../../examples/visitor-pattern/).
343
175
 
344
- #### Type Safety with RBS
176
+ ## RBS Types & Strict Type Checking
345
177
 
346
- All types are defined in RBS format in `sig/html_to_markdown.rbs`:
347
-
348
- - `document_metadata` - Document-level metadata structure
349
- - `header_metadata` - Individual header element
350
- - `link_metadata` - Individual link element
351
- - `image_metadata` - Individual image element
352
- - `structured_data` - Structured data block
353
- - `extended_metadata` - Complete metadata extraction result
354
-
355
- Uses strict RBS type checking with Steep for full type safety:
178
+ Full RBS type definitions in `sig/html_to_markdown.rbs` enable strict type checking with [Steep](https://github.com/soutaro/steep):
356
179
 
357
180
  ```bash
358
181
  steep check
359
182
  ```
360
183
 
361
- #### Implementation Architecture
362
-
363
- The Rust implementation uses a single-pass collector pattern for efficient metadata extraction:
184
+ Key types:
185
+ - `HtmlToMarkdown::NodeContext` - Element metadata in visitor callbacks (tag_name, attributes, depth, etc.)
186
+ - `HtmlToMarkdown::visitor_result` - Return type union for visitor methods
187
+ - `HtmlToMarkdown::extended_metadata` - Metadata extraction result
364
188
 
365
- 1. **No duplication**: Core logic lives in Rust (`crates/html-to-markdown/src/metadata.rs`)
366
- 2. **Minimal wrapper layer**: Ruby binding in `crates/html-to-markdown-rb/src/lib.rs`
367
- 3. **Type translation**: Rust types → Ruby hashes with proper Magnus bindings
368
- 4. **Hash conversion**: Uses Magnus `RHash` API for efficient Ruby hash construction
189
+ Type-safe visitor implementation:
369
190
 
370
- The metadata feature is gated by a Cargo feature in `Cargo.toml`:
371
-
372
- ```toml
373
- [features]
374
- metadata = ["html-to-markdown-rs/metadata"]
191
+ ```ruby
192
+ class TypedVisitor
193
+ def visit_link(
194
+ ctx : HtmlToMarkdown::NodeContext,
195
+ href : String,
196
+ text : String,
197
+ title : String | nil = nil
198
+ ) : HtmlToMarkdown::visitor_result
199
+ { type: :custom, output: "[#{text}](#{href})" }
200
+ end
201
+ end
375
202
  ```
376
203
 
377
- This ensures:
378
- - Zero overhead when metadata is not needed
379
- - Clean integration with feature flag detection
380
- - Consistent with Python binding implementation
381
-
382
- #### Language Parity
383
-
384
- Implements the same API as the Python binding:
385
-
386
- - Same method signature: `convert_with_metadata(html, options, metadata_config)`
387
- - Same return type: `[markdown, metadata_dict]`
388
- - Same metadata structures and field names
389
- - Same enum values (link_type, image_type, data_type, text_direction)
390
-
391
- Enables seamless migration and multi-language development.
204
+ All public methods are typed for early error detection and LSP editor support (Ruby 3+).
392
205
 
393
- #### Performance
206
+ ## Magnus Native Extension
394
207
 
395
- Single-pass collection during tree traversal:
396
- - No additional parsing passes
397
- - Minimal memory overhead
398
- - Configurable extraction granularity
399
- - Built-in size limits for safety
208
+ The gem compiles a native Rust extension via [Magnus](https://github.com/matsadler/magnus) FFI bindings:
400
209
 
401
- #### Testing
210
+ - **Zero-copy interop**: String and hash data flows directly between Ruby and Rust
211
+ - **Safe bindings**: No segfaults; Rust's type system ensures memory safety
212
+ - **Automatic error mapping**: Rust errors convert to Ruby exceptions with full context
213
+ - **Native performance**: Compiled to `.so` (Linux/macOS) or `.dll` (Windows)
214
+ - **Smart compilation**: Prebuilt binaries for common platforms; falls back to on-install compilation
402
215
 
403
- Comprehensive RSpec test suite in `spec/metadata_extraction_spec.rb`:
216
+ Build manually:
404
217
 
405
218
  ```bash
406
- cd packages/ruby
407
- bundle exec rake compile -- --release --features metadata
408
- bundle exec rspec spec/metadata_extraction_spec.rb
219
+ bundle exec rake compile
409
220
  ```
410
221
 
411
- Tests cover:
412
- - All metadata types extraction
413
- - Configuration flags
414
- - Edge cases (empty HTML, malformed input, special characters)
415
- - Return value structure validation
416
- - Integration with conversion options
222
+ ## CLI Proxy
417
223
 
418
- ## CLI
419
-
420
- The gem bundles a small proxy for the Rust CLI binary. Use it when you need parity with the standalone `html-to-markdown` executable.
224
+ Call the Rust CLI directly from Ruby or shell:
421
225
 
422
226
  ```ruby
423
227
  require 'html_to_markdown/cli'
424
228
 
425
229
  HtmlToMarkdown::CLI.run(%w[--heading-style atx input.html], stdout: $stdout)
426
- # => writes converted Markdown to STDOUT
427
- ```
428
-
429
- You can also call the CLI binary directly for scripting:
430
230
 
431
- ```ruby
231
+ # Or call the binary directly
432
232
  HtmlToMarkdown::CLIProxy.call(['--version'])
433
- # => "html-to-markdown 2.5.7"
434
- ```
435
-
436
- Rebuild the CLI locally if you see `CLI binary not built` during tests:
437
-
438
- ```bash
439
- bundle exec rake compile # builds the extension
440
- bundle exec ruby scripts/prepare_ruby_gem.rb # copies the CLI into lib/bin/
441
233
  ```
442
234
 
443
235
  ## Error Handling
444
236
 
445
- Conversion errors raise `HtmlToMarkdown::Error` (wrapping the Rust error context). CLI invocations use specialised subclasses:
237
+ - `HtmlToMarkdown::Error` - Conversion errors with Rust error context
238
+ - `HtmlToMarkdown::CLIProxy::MissingBinaryError` - CLI binary not found
239
+ - `HtmlToMarkdown::CLIProxy::CLIExecutionError` - Command execution failed
240
+
241
+ Binary data inputs (e.g., PDF bytes as string) raise `HtmlToMarkdown::Error` with "Invalid input" message.
446
242
 
447
- - `HtmlToMarkdown::CLIProxy::MissingBinaryError`
448
- - `HtmlToMarkdown::CLIProxy::CLIExecutionError`
243
+ ## Examples
449
244
 
450
- Rescue them to provide clearer feedback in your application.
245
+ Comprehensive guides with real-world patterns (Ruby examples included):
451
246
 
452
- Inputs that look like binary data (e.g., PDF bytes coerced to a string) raise `HtmlToMarkdown::Error` with an
453
- `Invalid input` message.
247
+ - **[Visitor Pattern](../../examples/visitor-pattern/)** - Custom callbacks, element-by-element control, analytics, domain-specific markdown dialects
248
+ - **[Metadata Extraction](../../examples/metadata-extraction/)** - SEO data, heading hierarchy, link classification, structured data parsing
249
+ - **[Performance Guide](../../examples/performance/)** - Benchmarking, profiling, throughput optimization
454
250
 
455
251
  ## Consistent Across Languages
456
252
 
@@ -459,6 +255,7 @@ The Ruby gem shares the exact Rust core with:
459
255
  - [Python wheels](https://pypi.org/project/html-to-markdown/)
460
256
  - [Node.js / Bun bindings](https://www.npmjs.com/package/html-to-markdown-node)
461
257
  - [WebAssembly package](https://www.npmjs.com/package/html-to-markdown-wasm)
258
+ - [PHP extension](https://packagist.org/packages/goldziher/html-to-markdown)
462
259
  - The Rust crate and CLI
463
260
 
464
261
  Use whichever runtime fits your stack while keeping formatting behaviour identical.
@@ -470,7 +267,7 @@ bundle exec rake compile # build the native extension
470
267
  bundle exec rspec # run test suite
471
268
  ```
472
269
 
473
- The extension uses [Magnus](https://github.com/matsadler/magnus) plus `rb-sys` for bindgen. When editing the Rust code under `src/`, rerun `rake compile`.
270
+ When editing Rust code under `src/`, rerun `rake compile`.
474
271
 
475
272
  ## License
476
273