html-to-markdown 2.12.0 → 2.13.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 64b0bc7bc7ba063d5e3bdcdf4ac3edde703e9358a208ea73f54cfa0d48de53e0
4
- data.tar.gz: 3bb71b6453bd3d8fc0095d22e9da317ba2a20aef0ca29bfd2c53b18ec0d212a2
3
+ metadata.gz: 5643900ad2e99a827fb3814cba27fae1ca60f33dda4d0ff82af524b08afd9ad8
4
+ data.tar.gz: '029a84a8b9f7e2f08453651807313f0b4d40c83cf9af1ebb1399a8bcfbf1ff00'
5
5
  SHA512:
6
- metadata.gz: cd5e3ca8fd4c06817a129587b30df66b2a5c114c329047ca4aa0f691627b0a057bcaa8ed930460dc7399808c32779bc7a8f576e6eee8739c482d0fe3bbca4783
7
- data.tar.gz: 5f049999f764f98f9980a5aa3bdb057c18f0a7914ed844a80613d88a7ae84b83a3e9824752daa60f82b25458ee871b028bd87c02713e0422e1abbaf0b2d6c709
6
+ metadata.gz: 8ee229c8b956f68d316ee17a08163d88ed89beb36a912e2ab03a8b52a625d217eb97c6428c12610014f89cc583515dc56531f7622c17ccdf9c66d8f10d3467db
7
+ data.tar.gz: e4ceff847ce455aa4c925cfd3c7084540649a45e85665948f3faaa1d8f0c4d31e69cb1efdb4eee2b185d6553db1c6eb5bbf376f96e8e5aa394d630c4f46bf93c
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- html-to-markdown (2.12.0)
4
+ html-to-markdown (2.13.0)
5
5
  rb_sys (>= 0.9, < 1.0)
6
6
 
7
7
  GEM
data/METADATA.md ADDED
@@ -0,0 +1,227 @@
1
+ # Metadata Extraction for Ruby Bindings
2
+
3
+ Complete Ruby Magnus binding implementation for HTML-to-Markdown metadata extraction with full RBS type signatures.
4
+
5
+ ## Features
6
+
7
+ The Ruby binding provides comprehensive metadata extraction during HTML-to-Markdown conversion:
8
+
9
+ - **Document Metadata**: title, description, keywords, author, canonical URL, language, text direction
10
+ - **Open Graph & Twitter Card**: social media metadata extraction
11
+ - **Headers**: h1-h6 extraction with hierarchy, ids, and depth tracking
12
+ - **Links**: hyperlink extraction with type classification (anchor, internal, external, email, phone)
13
+ - **Images**: image extraction with source type (data_uri, inline_svg, external, relative) and dimensions
14
+ - **Structured Data**: JSON-LD, Microdata, and RDFa extraction
15
+
16
+ ## API
17
+
18
+ ### Basic Usage
19
+
20
+ ```ruby
21
+ require 'html_to_markdown'
22
+
23
+ html = '<html lang="en"><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
24
+ markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
25
+
26
+ puts markdown
27
+ puts metadata[:document][:title] # "Test"
28
+ puts metadata[:headers].length # 1
29
+ ```
30
+
31
+ ### With Conversion Options
32
+
33
+ ```ruby
34
+ conv_opts = { heading_style: :atx_closed }
35
+ metadata_opts = { extract_headers: true, extract_links: false }
36
+
37
+ markdown, metadata = HtmlToMarkdown.convert_with_metadata(
38
+ html,
39
+ conv_opts,
40
+ metadata_opts
41
+ )
42
+ ```
43
+
44
+ ### Return Value
45
+
46
+ Returns a 2-element array: `[markdown_string, metadata_hash]`
47
+
48
+ The metadata hash contains:
49
+
50
+ ```ruby
51
+ {
52
+ document: {
53
+ title: String?,
54
+ description: String?,
55
+ keywords: Array[String],
56
+ author: String?,
57
+ canonical_url: String?,
58
+ base_href: String?,
59
+ language: String?,
60
+ text_direction: "ltr" | "rtl" | "auto" | nil,
61
+ open_graph: Hash[String, String],
62
+ twitter_card: Hash[String, String],
63
+ meta_tags: Hash[String, String]
64
+ },
65
+ headers: [
66
+ {
67
+ level: Integer, # 1-6
68
+ text: String,
69
+ id: String?,
70
+ depth: Integer,
71
+ html_offset: Integer
72
+ }
73
+ ],
74
+ links: [
75
+ {
76
+ href: String,
77
+ text: String,
78
+ title: String?,
79
+ link_type: "anchor" | "internal" | "external" | "email" | "phone" | "other",
80
+ rel: Array[String],
81
+ attributes: Hash[String, String]
82
+ }
83
+ ],
84
+ images: [
85
+ {
86
+ src: String,
87
+ alt: String?,
88
+ title: String?,
89
+ dimensions: [Integer, Integer]?,
90
+ image_type: "data_uri" | "inline_svg" | "external" | "relative",
91
+ attributes: Hash[String, String]
92
+ }
93
+ ],
94
+ structured_data: [
95
+ {
96
+ data_type: "json_ld" | "microdata" | "rdfa",
97
+ raw_json: String,
98
+ schema_type: String?
99
+ }
100
+ ]
101
+ }
102
+ ```
103
+
104
+ ## Metadata Configuration
105
+
106
+ Pass a hash with the following options to control which metadata types are extracted:
107
+
108
+ ```ruby
109
+ config = {
110
+ extract_headers: true, # Extract h1-h6 elements (default: true)
111
+ extract_links: true, # Extract <a> elements (default: true)
112
+ extract_images: true, # Extract <img> elements (default: true)
113
+ extract_structured_data: true, # Extract JSON-LD/Microdata/RDFa (default: true)
114
+ max_structured_data_size: 1_000_000 # Max bytes for structured data (default: 1MB)
115
+ }
116
+
117
+ markdown, metadata = HtmlToMarkdown.convert_with_metadata(html, nil, config)
118
+ ```
119
+
120
+ ## Type Signatures
121
+
122
+ All types are defined in RBS format in `sig/html_to_markdown.rbs`:
123
+
124
+ - `document_metadata` - Document-level metadata structure
125
+ - `header_metadata` - Individual header element
126
+ - `link_metadata` - Individual link element
127
+ - `image_metadata` - Individual image element
128
+ - `structured_data` - Structured data block
129
+ - `extended_metadata` - Complete metadata extraction result
130
+
131
+ Uses strict RBS type checking with Steep for full type safety.
132
+
133
+ ## Implementation Details
134
+
135
+ ### Architecture
136
+
137
+ The Rust implementation uses a single-pass collector pattern for efficient metadata extraction:
138
+
139
+ 1. **No duplication**: Core logic lives in Rust (`crates/html-to-markdown/src/metadata.rs`)
140
+ 2. **Minimal wrapper layer**: Ruby binding in `crates/html-to-markdown-rb/src/lib.rs`
141
+ 3. **Type translation**: Rust types → Ruby hashes with proper Magnus bindings
142
+ 4. **Hash conversion**: Uses Magnus `RHash` API for efficient Ruby hash construction
143
+
144
+ ### Hash Conversion Pattern
145
+
146
+ Following the inline_images pattern:
147
+
148
+ ```rust
149
+ fn document_metadata_to_ruby(ruby: &Ruby, doc: RustDocumentMetadata) -> Result<Value, Error> {
150
+ let hash = ruby.hash_new();
151
+ hash.aset(ruby.intern("title"), opt_string_to_ruby(ruby, doc.title)?)?;
152
+ hash.aset(ruby.intern("keywords"), keywords_array)?;
153
+ // ... more fields
154
+ Ok(hash.as_value())
155
+ }
156
+ ```
157
+
158
+ ### Feature Flag
159
+
160
+ The metadata feature is gated by a Cargo feature in `Cargo.toml`:
161
+
162
+ ```toml
163
+ [features]
164
+ metadata = ["html-to-markdown-rs/metadata"]
165
+ ```
166
+
167
+ This ensures:
168
+ - Zero overhead when metadata is not needed
169
+ - Clean integration with feature flag detection
170
+ - Consistent with Python binding implementation
171
+
172
+ ## Tests
173
+
174
+ Comprehensive RSpec test suite in `spec/metadata_extraction_spec.rb`:
175
+
176
+ ```bash
177
+ cd packages/ruby
178
+ bundle exec rake compile
179
+ bundle exec rspec spec/metadata_extraction_spec.rb
180
+ ```
181
+
182
+ Tests cover:
183
+ - All metadata types extraction
184
+ - Configuration flags
185
+ - Edge cases (empty HTML, malformed input, special characters)
186
+ - Return value structure validation
187
+ - Integration with conversion options
188
+
189
+ ## Language Parity
190
+
191
+ Implements the same API as the Python binding:
192
+
193
+ - Same method signature: `convert_with_metadata(html, options, metadata_config)`
194
+ - Same return type: `[markdown, metadata_dict]`
195
+ - Same metadata structures and field names
196
+ - Same enum values (link_type, image_type, data_type, text_direction)
197
+
198
+ Enables seamless migration and multi-language development.
199
+
200
+ ## Performance
201
+
202
+ Single-pass collection during tree traversal:
203
+ - No additional parsing passes
204
+ - Minimal memory overhead
205
+ - Configurable extraction granularity
206
+ - Built-in size limits for safety
207
+
208
+ ## Building and Testing
209
+
210
+ Build the extension with metadata support:
211
+
212
+ ```bash
213
+ cd packages/ruby
214
+ bundle exec rake compile -- --release --features metadata
215
+ ```
216
+
217
+ Run type checking:
218
+
219
+ ```bash
220
+ steep check
221
+ ```
222
+
223
+ Run tests:
224
+
225
+ ```bash
226
+ bundle exec rspec spec/metadata_extraction_spec.rb
227
+ ```
data/README.md CHANGED
@@ -184,6 +184,40 @@ result.inline_images.each do |img|
184
184
  end
185
185
  ```
186
186
 
187
+ ### Metadata extraction
188
+
189
+ ```ruby
190
+ require 'html_to_markdown'
191
+
192
+ html = <<~HTML
193
+ <html>
194
+ <head>
195
+ <title>Example</title>
196
+ <meta name="description" content="Demo page">
197
+ <link rel="canonical" href="https://example.com/page">
198
+ </head>
199
+ <body>
200
+ <h1 id="welcome">Welcome</h1>
201
+ <a href="https://example.com" rel="nofollow external">Example link</a>
202
+ <img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
203
+ </body>
204
+ </html>
205
+ HTML
206
+
207
+ markdown, metadata = HtmlToMarkdown.convert_with_metadata(
208
+ html,
209
+ { heading_style: :atx },
210
+ { extract_links: true, extract_images: true, extract_headers: true }
211
+ )
212
+
213
+ puts markdown
214
+ puts metadata[:document][:title] # "Example"
215
+ puts metadata[:links].first[:rel] # ["nofollow", "external"]
216
+ puts metadata[:images].first[:dimensions] # [640, 480]
217
+ ```
218
+
219
+ `metadata` contains document tags (title, description, canonical URL, Open Graph/Twitter cards), enriched links (type, text, `rel`, raw attributes), images (alt, title, inferred dimensions, attributes), headers with depth/offsets, and optional structured data when enabled.
220
+
187
221
  ## CLI
188
222
 
189
223
  The gem bundles a small proxy for the Rust CLI binary. Use it when you need parity with the standalone `html-to-markdown` executable.
@@ -1,6 +1,6 @@
1
1
  [package]
2
2
  name = "html-to-markdown-rb"
3
- version = "2.12.0"
3
+ version = "2.13.0"
4
4
  edition = "2024"
5
5
  authors = ["Na'aman Hirschfeld <nhirschfeld@gmail.com>"]
6
6
  license = "MIT"
@@ -18,10 +18,11 @@ name = "html_to_markdown_rb"
18
18
  crate-type = ["cdylib", "rlib"]
19
19
 
20
20
  [features]
21
- default = []
21
+ default = ["metadata"]
22
+ metadata = ["html-to-markdown-rs/metadata"]
22
23
 
23
24
  [dependencies]
24
- html-to-markdown-rs = { version = "2.12.0", features = ["inline-images"] }
25
+ html-to-markdown-rs = { version = "2.13.0", features = ["inline-images"] }
25
26
  magnus = { git = "https://github.com/matsadler/magnus", rev = "f6db11769efb517427bf7f121f9c32e18b059b38", features = ["rb-sys"] }
26
27
 
27
28
  [dev-dependencies]
@@ -4,6 +4,17 @@ use html_to_markdown_rs::{
4
4
  PreprocessingPreset, WhitespaceMode, convert as convert_inner,
5
5
  convert_with_inline_images as convert_with_inline_images_inner, error::ConversionError, safety::guard_panic,
6
6
  };
7
+
8
+ #[cfg(feature = "metadata")]
9
+ use html_to_markdown_rs::convert_with_metadata as convert_with_metadata_inner;
10
+ #[cfg(feature = "metadata")]
11
+ use html_to_markdown_rs::metadata::{
12
+ DocumentMetadata as RustDocumentMetadata, ExtendedMetadata as RustExtendedMetadata,
13
+ HeaderMetadata as RustHeaderMetadata, ImageMetadata as RustImageMetadata, ImageType as RustImageType,
14
+ LinkMetadata as RustLinkMetadata, LinkType as RustLinkType, MetadataConfig as RustMetadataConfig,
15
+ StructuredData as RustStructuredData, StructuredDataType as RustStructuredDataType,
16
+ TextDirection as RustTextDirection,
17
+ };
7
18
  use magnus::prelude::*;
8
19
  use magnus::r_hash::ForEach;
9
20
  use magnus::{Error, RArray, RHash, Ruby, Symbol, TryConvert, Value, function, scan_args::scan_args};
@@ -423,6 +434,261 @@ fn convert_with_inline_images_fn(ruby: &Ruby, args: &[Value]) -> Result<Value, E
423
434
  extraction_to_value(ruby, extraction)
424
435
  }
425
436
 
437
+ #[cfg(feature = "metadata")]
438
+ fn build_metadata_config(_ruby: &Ruby, config: Option<Value>) -> Result<RustMetadataConfig, Error> {
439
+ let mut cfg = RustMetadataConfig::default();
440
+
441
+ let Some(config) = config else {
442
+ return Ok(cfg);
443
+ };
444
+
445
+ if config.is_nil() {
446
+ return Ok(cfg);
447
+ }
448
+
449
+ let hash = RHash::from_value(config).ok_or_else(|| arg_error("metadata_config must be provided as a Hash"))?;
450
+
451
+ hash.foreach(|key: Value, val: Value| {
452
+ let key_name = symbol_to_string(key)?;
453
+ match key_name.as_str() {
454
+ "extract_headers" => {
455
+ cfg.extract_headers = bool::try_convert(val)?;
456
+ }
457
+ "extract_links" => {
458
+ cfg.extract_links = bool::try_convert(val)?;
459
+ }
460
+ "extract_images" => {
461
+ cfg.extract_images = bool::try_convert(val)?;
462
+ }
463
+ "extract_structured_data" => {
464
+ cfg.extract_structured_data = bool::try_convert(val)?;
465
+ }
466
+ "max_structured_data_size" => {
467
+ cfg.max_structured_data_size = usize::try_convert(val)?;
468
+ }
469
+ _ => {}
470
+ }
471
+ Ok(ForEach::Continue)
472
+ })?;
473
+
474
+ Ok(cfg)
475
+ }
476
+
477
+ #[cfg(feature = "metadata")]
478
+ fn opt_string_to_ruby(ruby: &Ruby, opt: Option<String>) -> Result<Value, Error> {
479
+ match opt {
480
+ Some(val) => Ok(ruby.str_from_slice(val.as_bytes()).as_value()),
481
+ None => Ok(ruby.qnil().as_value()),
482
+ }
483
+ }
484
+
485
+ #[cfg(feature = "metadata")]
486
+ fn btreemap_to_ruby_hash(ruby: &Ruby, map: std::collections::BTreeMap<String, String>) -> Result<Value, Error> {
487
+ let hash = ruby.hash_new();
488
+ for (k, v) in map {
489
+ hash.aset(k, v)?;
490
+ }
491
+ Ok(hash.as_value())
492
+ }
493
+
494
+ #[cfg(feature = "metadata")]
495
+ fn text_direction_to_string(text_direction: Option<RustTextDirection>) -> Option<&'static str> {
496
+ match text_direction {
497
+ Some(RustTextDirection::LeftToRight) => Some("ltr"),
498
+ Some(RustTextDirection::RightToLeft) => Some("rtl"),
499
+ Some(RustTextDirection::Auto) => Some("auto"),
500
+ None => None,
501
+ }
502
+ }
503
+
504
+ #[cfg(feature = "metadata")]
505
+ fn link_type_to_string(link_type: &RustLinkType) -> &'static str {
506
+ match link_type {
507
+ RustLinkType::Anchor => "anchor",
508
+ RustLinkType::Internal => "internal",
509
+ RustLinkType::External => "external",
510
+ RustLinkType::Email => "email",
511
+ RustLinkType::Phone => "phone",
512
+ RustLinkType::Other => "other",
513
+ }
514
+ }
515
+
516
+ #[cfg(feature = "metadata")]
517
+ fn image_type_to_string(image_type: &RustImageType) -> &'static str {
518
+ match image_type {
519
+ RustImageType::DataUri => "data_uri",
520
+ RustImageType::InlineSvg => "inline_svg",
521
+ RustImageType::External => "external",
522
+ RustImageType::Relative => "relative",
523
+ }
524
+ }
525
+
526
+ #[cfg(feature = "metadata")]
527
+ fn structured_data_type_to_string(data_type: &RustStructuredDataType) -> &'static str {
528
+ match data_type {
529
+ RustStructuredDataType::JsonLd => "json_ld",
530
+ RustStructuredDataType::Microdata => "microdata",
531
+ RustStructuredDataType::RDFa => "rdfa",
532
+ }
533
+ }
534
+
535
+ #[cfg(feature = "metadata")]
536
+ fn document_metadata_to_ruby(ruby: &Ruby, doc: RustDocumentMetadata) -> Result<Value, Error> {
537
+ let hash = ruby.hash_new();
538
+
539
+ hash.aset(ruby.intern("title"), opt_string_to_ruby(ruby, doc.title)?)?;
540
+ hash.aset(ruby.intern("description"), opt_string_to_ruby(ruby, doc.description)?)?;
541
+
542
+ let keywords = ruby.ary_new();
543
+ for keyword in doc.keywords {
544
+ keywords.push(keyword)?;
545
+ }
546
+ hash.aset(ruby.intern("keywords"), keywords)?;
547
+
548
+ hash.aset(ruby.intern("author"), opt_string_to_ruby(ruby, doc.author)?)?;
549
+ hash.aset(
550
+ ruby.intern("canonical_url"),
551
+ opt_string_to_ruby(ruby, doc.canonical_url)?,
552
+ )?;
553
+ hash.aset(ruby.intern("base_href"), opt_string_to_ruby(ruby, doc.base_href)?)?;
554
+ hash.aset(ruby.intern("language"), opt_string_to_ruby(ruby, doc.language)?)?;
555
+
556
+ match text_direction_to_string(doc.text_direction) {
557
+ Some(dir) => hash.aset(ruby.intern("text_direction"), dir)?,
558
+ None => hash.aset(ruby.intern("text_direction"), ruby.qnil())?,
559
+ }
560
+
561
+ hash.aset(ruby.intern("open_graph"), btreemap_to_ruby_hash(ruby, doc.open_graph)?)?;
562
+ hash.aset(
563
+ ruby.intern("twitter_card"),
564
+ btreemap_to_ruby_hash(ruby, doc.twitter_card)?,
565
+ )?;
566
+ hash.aset(ruby.intern("meta_tags"), btreemap_to_ruby_hash(ruby, doc.meta_tags)?)?;
567
+
568
+ Ok(hash.as_value())
569
+ }
570
+
571
+ #[cfg(feature = "metadata")]
572
+ fn headers_to_ruby(ruby: &Ruby, headers: Vec<RustHeaderMetadata>) -> Result<Value, Error> {
573
+ let array = ruby.ary_new();
574
+ for header in headers {
575
+ let hash = ruby.hash_new();
576
+ hash.aset(ruby.intern("level"), header.level)?;
577
+ hash.aset(ruby.intern("text"), header.text)?;
578
+ hash.aset(ruby.intern("id"), opt_string_to_ruby(ruby, header.id)?)?;
579
+ hash.aset(ruby.intern("depth"), header.depth as i64)?;
580
+ hash.aset(ruby.intern("html_offset"), header.html_offset as i64)?;
581
+ array.push(hash)?;
582
+ }
583
+ Ok(array.as_value())
584
+ }
585
+
586
+ #[cfg(feature = "metadata")]
587
+ fn links_to_ruby(ruby: &Ruby, links: Vec<RustLinkMetadata>) -> Result<Value, Error> {
588
+ let array = ruby.ary_new();
589
+ for link in links {
590
+ let hash = ruby.hash_new();
591
+ hash.aset(ruby.intern("href"), link.href)?;
592
+ hash.aset(ruby.intern("text"), link.text)?;
593
+ hash.aset(ruby.intern("title"), opt_string_to_ruby(ruby, link.title)?)?;
594
+ hash.aset(ruby.intern("link_type"), link_type_to_string(&link.link_type))?;
595
+
596
+ let rel_array = ruby.ary_new();
597
+ for r in link.rel {
598
+ rel_array.push(r)?;
599
+ }
600
+ hash.aset(ruby.intern("rel"), rel_array)?;
601
+
602
+ hash.aset(ruby.intern("attributes"), btreemap_to_ruby_hash(ruby, link.attributes)?)?;
603
+ array.push(hash)?;
604
+ }
605
+ Ok(array.as_value())
606
+ }
607
+
608
+ #[cfg(feature = "metadata")]
609
+ fn images_to_ruby(ruby: &Ruby, images: Vec<RustImageMetadata>) -> Result<Value, Error> {
610
+ let array = ruby.ary_new();
611
+ for image in images {
612
+ let hash = ruby.hash_new();
613
+ hash.aset(ruby.intern("src"), image.src)?;
614
+ hash.aset(ruby.intern("alt"), opt_string_to_ruby(ruby, image.alt)?)?;
615
+ hash.aset(ruby.intern("title"), opt_string_to_ruby(ruby, image.title)?)?;
616
+
617
+ match image.dimensions {
618
+ Some((width, height)) => {
619
+ let dims = ruby.ary_new();
620
+ dims.push(width as i64)?;
621
+ dims.push(height as i64)?;
622
+ hash.aset(ruby.intern("dimensions"), dims)?;
623
+ }
624
+ None => {
625
+ hash.aset(ruby.intern("dimensions"), ruby.qnil())?;
626
+ }
627
+ }
628
+
629
+ hash.aset(ruby.intern("image_type"), image_type_to_string(&image.image_type))?;
630
+ hash.aset(
631
+ ruby.intern("attributes"),
632
+ btreemap_to_ruby_hash(ruby, image.attributes)?,
633
+ )?;
634
+ array.push(hash)?;
635
+ }
636
+ Ok(array.as_value())
637
+ }
638
+
639
+ #[cfg(feature = "metadata")]
640
+ fn structured_data_to_ruby(ruby: &Ruby, data: Vec<RustStructuredData>) -> Result<Value, Error> {
641
+ let array = ruby.ary_new();
642
+ for item in data {
643
+ let hash = ruby.hash_new();
644
+ hash.aset(
645
+ ruby.intern("data_type"),
646
+ structured_data_type_to_string(&item.data_type),
647
+ )?;
648
+ hash.aset(ruby.intern("raw_json"), item.raw_json)?;
649
+ hash.aset(ruby.intern("schema_type"), opt_string_to_ruby(ruby, item.schema_type)?)?;
650
+ array.push(hash)?;
651
+ }
652
+ Ok(array.as_value())
653
+ }
654
+
655
+ #[cfg(feature = "metadata")]
656
+ fn extended_metadata_to_ruby(ruby: &Ruby, metadata: RustExtendedMetadata) -> Result<Value, Error> {
657
+ let hash = ruby.hash_new();
658
+
659
+ hash.aset(
660
+ ruby.intern("document"),
661
+ document_metadata_to_ruby(ruby, metadata.document)?,
662
+ )?;
663
+ hash.aset(ruby.intern("headers"), headers_to_ruby(ruby, metadata.headers)?)?;
664
+ hash.aset(ruby.intern("links"), links_to_ruby(ruby, metadata.links)?)?;
665
+ hash.aset(ruby.intern("images"), images_to_ruby(ruby, metadata.images)?)?;
666
+ hash.aset(
667
+ ruby.intern("structured_data"),
668
+ structured_data_to_ruby(ruby, metadata.structured_data)?,
669
+ )?;
670
+
671
+ Ok(hash.as_value())
672
+ }
673
+
674
+ #[cfg(feature = "metadata")]
675
+ fn convert_with_metadata_fn(ruby: &Ruby, args: &[Value]) -> Result<Value, Error> {
676
+ let parsed = scan_args::<(String,), (Option<Value>, Option<Value>), (), (), (), ()>(args)?;
677
+ let html = parsed.required.0;
678
+ let options = build_conversion_options(ruby, parsed.optional.0)?;
679
+ let metadata_config = build_metadata_config(ruby, parsed.optional.1)?;
680
+
681
+ let (markdown, metadata) =
682
+ guard_panic(|| convert_with_metadata_inner(&html, Some(options), metadata_config)).map_err(conversion_error)?;
683
+
684
+ // Convert to Ruby array [markdown, metadata_hash]
685
+ let array = ruby.ary_new();
686
+ array.push(markdown)?;
687
+ array.push(extended_metadata_to_ruby(ruby, metadata)?)?;
688
+
689
+ Ok(array.as_value())
690
+ }
691
+
426
692
  #[magnus::init]
427
693
  fn init(ruby: &Ruby) -> Result<(), Error> {
428
694
  let module = ruby.define_module("HtmlToMarkdown")?;
@@ -434,5 +700,8 @@ fn init(ruby: &Ruby) -> Result<(), Error> {
434
700
  function!(convert_with_inline_images_fn, -1),
435
701
  )?;
436
702
 
703
+ #[cfg(feature = "metadata")]
704
+ module.define_singleton_method("convert_with_metadata", function!(convert_with_metadata_fn, -1))?;
705
+
437
706
  Ok(())
438
707
  }
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module HtmlToMarkdown
4
- VERSION = '2.12.0'
4
+ VERSION = '2.13.0'
5
5
  end
@@ -14,6 +14,7 @@ module HtmlToMarkdown
14
14
  alias native_convert_with_inline_images convert_with_inline_images
15
15
  alias native_options options
16
16
  alias native_convert_with_options convert_with_options
17
+ alias native_convert_with_metadata convert_with_metadata
17
18
  end
18
19
 
19
20
  module_function
@@ -33,4 +34,14 @@ module HtmlToMarkdown
33
34
  def options(options_hash = nil)
34
35
  native_options(options_hash)
35
36
  end
37
+
38
+ # Convert HTML to Markdown with comprehensive metadata extraction
39
+ #
40
+ # @param html [String] HTML string to convert
41
+ # @param options [Hash, nil] Optional conversion configuration
42
+ # @param metadata_config [Hash, nil] Optional metadata extraction configuration
43
+ # @return [Array<String, Hash>] Array containing [markdown_string, metadata_hash]
44
+ def convert_with_metadata(html, options = nil, metadata_config = nil)
45
+ native_convert_with_metadata(html.to_s, options, metadata_config)
46
+ end
36
47
  end
@@ -87,6 +87,74 @@ module HtmlToMarkdown
87
87
  warnings: Array[inline_image_warning]
88
88
  }
89
89
 
90
+ type metadata_config = {
91
+ extract_headers?: bool,
92
+ extract_links?: bool,
93
+ extract_images?: bool,
94
+ extract_structured_data?: bool,
95
+ max_structured_data_size?: Integer
96
+ }
97
+
98
+ type text_direction = "ltr" | "rtl" | "auto" | nil
99
+
100
+ type document_metadata = {
101
+ title: String?,
102
+ description: String?,
103
+ keywords: Array[String],
104
+ author: String?,
105
+ canonical_url: String?,
106
+ base_href: String?,
107
+ language: String?,
108
+ text_direction: text_direction,
109
+ open_graph: Hash[String, String],
110
+ twitter_card: Hash[String, String],
111
+ meta_tags: Hash[String, String]
112
+ }
113
+
114
+ type header_metadata = {
115
+ level: Integer,
116
+ text: String,
117
+ id: String?,
118
+ depth: Integer,
119
+ html_offset: Integer
120
+ }
121
+
122
+ type link_type = "anchor" | "internal" | "external" | "email" | "phone" | "other"
123
+
124
+ type link_metadata = {
125
+ href: String,
126
+ text: String,
127
+ title: String?,
128
+ link_type: link_type,
129
+ rel: Array[String],
130
+ attributes: Hash[String, String]
131
+ }
132
+
133
+ type image_type = "data_uri" | "inline_svg" | "external" | "relative"
134
+
135
+ type image_metadata = {
136
+ src: String,
137
+ alt: String?,
138
+ title: String?,
139
+ dimensions: [Integer, Integer]?,
140
+ image_type: image_type,
141
+ attributes: Hash[String, String]
142
+ }
143
+
144
+ type structured_data = {
145
+ data_type: "json_ld" | "microdata" | "rdfa",
146
+ raw_json: String,
147
+ schema_type: String?
148
+ }
149
+
150
+ type extended_metadata = {
151
+ document: document_metadata,
152
+ headers: Array[header_metadata],
153
+ links: Array[link_metadata],
154
+ images: Array[image_metadata],
155
+ structured_data: Array[structured_data]
156
+ }
157
+
90
158
  # Native methods (implemented in Rust via Magnus/rb-sys)
91
159
  # These are aliased from the Rust extension and available as both module and instance methods
92
160
  private
@@ -99,6 +167,11 @@ module HtmlToMarkdown
99
167
  conversion_options? options,
100
168
  inline_image_config? image_config
101
169
  ) -> html_extraction
170
+ def self.native_convert_with_metadata: (
171
+ String html,
172
+ conversion_options? options,
173
+ metadata_config? metadata_config
174
+ ) -> [String, extended_metadata]
102
175
 
103
176
  def native_convert: (String html, conversion_options? options) -> String
104
177
  def native_options: (conversion_options? options_hash) -> Options
@@ -108,14 +181,19 @@ module HtmlToMarkdown
108
181
  conversion_options? options,
109
182
  inline_image_config? image_config
110
183
  ) -> html_extraction
184
+ def native_convert_with_metadata: (
185
+ String html,
186
+ conversion_options? options,
187
+ metadata_config? metadata_config
188
+ ) -> [String, extended_metadata]
111
189
 
112
190
  public
113
191
 
114
192
  # Convert HTML to Markdown with optional configuration
115
- def self.convert: (String html, ?conversion_options? options) -> String
193
+ def self.convert: (String html, ?conversion_options options) -> String
116
194
 
117
195
  # Create a reusable options handle for performance
118
- def self.options: (?conversion_options? options_hash) -> Options
196
+ def self.options: (?conversion_options options_hash) -> Options
119
197
 
120
198
  # Convert HTML using a pre-built options handle
121
199
  def self.convert_with_options: (String html, Options options_handle) -> String
@@ -123,17 +201,54 @@ module HtmlToMarkdown
123
201
  # Convert HTML with inline image extraction
124
202
  def self.convert_with_inline_images: (
125
203
  String html,
126
- ?conversion_options? options,
127
- ?inline_image_config? image_config
204
+ ?conversion_options options,
205
+ ?inline_image_config image_config
128
206
  ) -> html_extraction
129
207
 
208
+ # Convert HTML to Markdown with metadata extraction
209
+ #
210
+ # Extracts comprehensive metadata (headers, links, images, structured data) during conversion.
211
+ #
212
+ # Args:
213
+ # html: HTML string to convert
214
+ # options: Optional conversion configuration
215
+ # metadata_config: Optional metadata extraction configuration
216
+ #
217
+ # Returns:
218
+ # Array containing:
219
+ # - [0] markdown: String - Converted markdown output
220
+ # - [1] metadata: Hash - Extracted metadata with document, headers, links, images, structured_data
221
+ #
222
+ # The metadata hash contains:
223
+ # - document: Document-level metadata (title, description, lang, etc.)
224
+ # - headers: List of header elements with hierarchy
225
+ # - links: List of extracted hyperlinks with classification
226
+ # - images: List of extracted images with metadata
227
+ # - structured_data: List of JSON-LD, Microdata, or RDFa blocks
228
+ #
229
+ # Example:
230
+ # html = '<html lang="en"><head><title>Test</title></head><body><h1>Hello</h1></body></html>'
231
+ # markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
232
+ # puts "Title: #{metadata['document']['title']}"
233
+ # puts "Headers: #{metadata['headers'].length}"
234
+ def self.convert_with_metadata: (
235
+ String html,
236
+ ?conversion_options options,
237
+ ?metadata_config metadata_config
238
+ ) -> [String, extended_metadata]
239
+
130
240
  # Instance method versions (created by module_function)
131
- def convert: (String html, ?conversion_options? options) -> String
132
- def options: (?conversion_options? options_hash) -> Options
241
+ def convert: (String html, ?conversion_options options) -> String
242
+ def options: (?conversion_options options_hash) -> Options
133
243
  def convert_with_options: (String html, Options options_handle) -> String
134
244
  def convert_with_inline_images: (
135
245
  String html,
136
- ?conversion_options? options,
137
- ?inline_image_config? image_config
246
+ ?conversion_options options,
247
+ ?inline_image_config image_config
138
248
  ) -> html_extraction
249
+ def convert_with_metadata: (
250
+ String html,
251
+ ?conversion_options options,
252
+ ?metadata_config metadata_config
253
+ ) -> [String, extended_metadata]
139
254
  end
@@ -0,0 +1,440 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'spec_helper'
4
+
5
+ RSpec.describe HtmlToMarkdown do
6
+ describe '.convert_with_metadata' do
7
+ it 'returns array with markdown and metadata' do
8
+ html = '<html><head><title>Test</title></head><body><p>Content</p></body></html>'
9
+ result = described_class.convert_with_metadata(html)
10
+
11
+ expect(result).to be_an(Array)
12
+ expect(result.length).to eq(2)
13
+ expect(result[0]).to be_a(String)
14
+ expect(result[1]).to be_a(Hash)
15
+ end
16
+
17
+ context 'when extracting document metadata' do
18
+ it 'extracts title' do
19
+ html = '<html><head><title>My Page Title</title></head><body><p>Content</p></body></html>'
20
+ _, metadata = described_class.convert_with_metadata(html)
21
+
22
+ expect(metadata[:document][:title]).to eq('My Page Title')
23
+ end
24
+
25
+ it 'extracts description' do
26
+ html = <<~HTML
27
+ <html>
28
+ <head><meta name="description" content="Page description"></head>
29
+ <body><p>Content</p></body>
30
+ </html>
31
+ HTML
32
+ _, metadata = described_class.convert_with_metadata(html)
33
+
34
+ expect(metadata[:document][:description]).to eq('Page description')
35
+ end
36
+
37
+ it 'extracts keywords' do
38
+ html = <<~HTML
39
+ <html>
40
+ <head><meta name="keywords" content="keyword1, keyword2, keyword3"></head>
41
+ <body><p>Content</p></body>
42
+ </html>
43
+ HTML
44
+ _, metadata = described_class.convert_with_metadata(html)
45
+
46
+ expect(metadata[:document][:keywords]).to include('keyword1', 'keyword2', 'keyword3')
47
+ end
48
+
49
+ it 'extracts author' do
50
+ html = '<html><head><meta name="author" content="John Doe"></head><body><p>Content</p></body></html>'
51
+ _, metadata = described_class.convert_with_metadata(html)
52
+
53
+ expect(metadata[:document][:author]).to eq('John Doe')
54
+ end
55
+
56
+ it 'extracts base href' do
57
+ html = '<html><head><base href="https://example.com/"></head><body><p>Content</p></body></html>'
58
+ _, metadata = described_class.convert_with_metadata(html)
59
+
60
+ expect(metadata[:document][:base_href]).to eq('https://example.com/')
61
+ end
62
+
63
+ it 'extracts canonical URL' do
64
+ html = '<html><head><link rel="canonical" href="https://example.com/page"></head><body><p>Content</p></body></html>'
65
+ _, metadata = described_class.convert_with_metadata(html)
66
+
67
+ expect(metadata[:document][:canonical_url]).to eq('https://example.com/page')
68
+ end
69
+
70
+ it 'extracts language' do
71
+ html = '<html lang="en"><head></head><body><p>Content</p></body></html>'
72
+ _, metadata = described_class.convert_with_metadata(html)
73
+
74
+ expect(metadata[:document][:language]).to eq('en')
75
+ end
76
+
77
+ it 'extracts text direction' do
78
+ html = '<html dir="ltr"><head></head><body><p>Content</p></body></html>'
79
+ _, metadata = described_class.convert_with_metadata(html)
80
+
81
+ expect(metadata[:document][:text_direction]).to eq('ltr')
82
+ end
83
+
84
+ it 'extracts open graph metadata' do
85
+ html = <<~HTML
86
+ <html>
87
+ <head>
88
+ <meta property="og:title" content="OG Title">
89
+ <meta property="og:description" content="OG Description">
90
+ <meta property="og:image" content="https://example.com/image.jpg">
91
+ </head>
92
+ <body><p>Content</p></body>
93
+ </html>
94
+ HTML
95
+ _, metadata = described_class.convert_with_metadata(html)
96
+
97
+ expect(metadata[:document][:open_graph]).to include(
98
+ 'title' => 'OG Title',
99
+ 'description' => 'OG Description',
100
+ 'image' => 'https://example.com/image.jpg'
101
+ )
102
+ end
103
+
104
+ it 'extracts twitter card metadata' do
105
+ html = <<~HTML
106
+ <html>
107
+ <head>
108
+ <meta name="twitter:card" content="summary_large_image">
109
+ <meta name="twitter:title" content="Twitter Title">
110
+ </head>
111
+ <body><p>Content</p></body>
112
+ </html>
113
+ HTML
114
+ _, metadata = described_class.convert_with_metadata(html)
115
+
116
+ expect(metadata[:document][:twitter_card]).to include(
117
+ 'card' => 'summary_large_image',
118
+ 'title' => 'Twitter Title'
119
+ )
120
+ end
121
+
122
+ it 'returns empty arrays and hashes for missing metadata' do
123
+ html = '<p>Content</p>'
124
+ _, metadata = described_class.convert_with_metadata(html)
125
+
126
+ expect(metadata[:document][:title]).to be_nil
127
+ expect(metadata[:document][:description]).to be_nil
128
+ expect(metadata[:document][:keywords]).to eq([])
129
+ expect(metadata[:document][:open_graph]).to eq({})
130
+ expect(metadata[:document][:twitter_card]).to eq({})
131
+ expect(metadata[:document][:meta_tags]).to eq({})
132
+ end
133
+ end
134
+
135
+ context 'when extracting header metadata' do
136
+ it 'extracts headers with hierarchy' do
137
+ html = <<~HTML
138
+ <html>
139
+ <body>
140
+ <h1>Main Title</h1>
141
+ <h2>Section</h2>
142
+ <h3>Subsection</h3>
143
+ </body>
144
+ </html>
145
+ HTML
146
+ _, metadata = described_class.convert_with_metadata(html)
147
+
148
+ expect(metadata[:headers].length).to eq(3)
149
+ expect(metadata[:headers][0][:level]).to eq(1)
150
+ expect(metadata[:headers][0][:text]).to eq('Main Title')
151
+ expect(metadata[:headers][1][:level]).to eq(2)
152
+ expect(metadata[:headers][1][:text]).to eq('Section')
153
+ expect(metadata[:headers][2][:level]).to eq(3)
154
+ expect(metadata[:headers][2][:text]).to eq('Subsection')
155
+ end
156
+
157
+ it 'includes header id' do
158
+ html = '<html><body><h1 id="main-title">Title</h1></body></html>'
159
+ _, metadata = described_class.convert_with_metadata(html)
160
+
161
+ expect(metadata[:headers][0][:id]).to eq('main-title')
162
+ end
163
+
164
+ it 'includes depth and html_offset' do
165
+ html = '<html><body><h1>Title</h1></body></html>'
166
+ _, metadata = described_class.convert_with_metadata(html)
167
+
168
+ header = metadata[:headers][0]
169
+ expect(header).to include(:depth, :html_offset)
170
+ expect(header[:depth]).to be_a(Integer)
171
+ expect(header[:html_offset]).to be_a(Integer)
172
+ end
173
+ end
174
+
175
+ context 'when extracting link metadata' do
176
+ it 'extracts links with classification' do
177
+ html = <<~HTML
178
+ <html>
179
+ <body>
180
+ <a href="#section">Anchor</a>
181
+ <a href="https://example.com">External</a>
182
+ <a href="/page">Internal</a>
183
+ <a href="mailto:test@example.com">Email</a>
184
+ <a href="tel:+1234567890">Phone</a>
185
+ </body>
186
+ </html>
187
+ HTML
188
+ _, metadata = described_class.convert_with_metadata(html)
189
+
190
+ links = metadata[:links]
191
+ expect(links.length).to eq(5)
192
+
193
+ expect(links[0][:link_type]).to eq('anchor')
194
+ expect(links[1][:link_type]).to eq('external')
195
+ expect(links[2][:link_type]).to eq('internal')
196
+ expect(links[3][:link_type]).to eq('email')
197
+ expect(links[4][:link_type]).to eq('phone')
198
+ end
199
+
200
+ it 'includes link text and href' do
201
+ html = '<html><body><a href="https://example.com">Click here</a></body></html>'
202
+ _, metadata = described_class.convert_with_metadata(html)
203
+
204
+ link = metadata[:links][0]
205
+ expect(link[:href]).to eq('https://example.com')
206
+ expect(link[:text]).to eq('Click here')
207
+ end
208
+
209
+ it 'includes link title attribute' do
210
+ html = '<html><body><a href="https://example.com" title="Example Site">Link</a></body></html>'
211
+ _, metadata = described_class.convert_with_metadata(html)
212
+
213
+ link = metadata[:links][0]
214
+ expect(link[:title]).to eq('Example Site')
215
+ end
216
+
217
+ it 'includes link rel attributes' do
218
+ html = '<html><body><a href="https://example.com" rel="nofollow external">Link</a></body></html>'
219
+ _, metadata = described_class.convert_with_metadata(html)
220
+
221
+ link = metadata[:links][0]
222
+ expect(link[:rel]).to include('nofollow', 'external')
223
+ end
224
+
225
+ it 'includes link attributes' do
226
+ html = '<html><body><a href="https://example.com" data-custom="value">Link</a></body></html>'
227
+ _, metadata = described_class.convert_with_metadata(html)
228
+
229
+ link = metadata[:links][0]
230
+ expect(link[:attributes]).to include('data-custom' => 'value')
231
+ end
232
+ end
233
+
234
+ context 'when extracting image metadata' do
235
+ it 'extracts images with source type' do
236
+ html = <<~HTML
237
+ <html>
238
+ <body>
239
+ <img src="https://example.com/image.jpg" alt="External">
240
+ <img src="/images/local.jpg" alt="Relative">
241
+ <img src="data:image/png;base64,..." alt="Data URI">
242
+ </body>
243
+ </html>
244
+ HTML
245
+ _, metadata = described_class.convert_with_metadata(html)
246
+
247
+ images = metadata[:images]
248
+ expect(images.length).to eq(3)
249
+
250
+ expect(images[0][:image_type]).to eq('external')
251
+ expect(images[1][:image_type]).to eq('relative')
252
+ expect(images[2][:image_type]).to eq('data_uri')
253
+ end
254
+
255
+ it 'includes image alt and title' do
256
+ html = '<html><body><img src="image.jpg" alt="Alt text" title="Image title"></body></html>'
257
+ _, metadata = described_class.convert_with_metadata(html)
258
+
259
+ image = metadata[:images][0]
260
+ expect(image[:alt]).to eq('Alt text')
261
+ expect(image[:title]).to eq('Image title')
262
+ end
263
+
264
+ it 'includes image dimensions' do
265
+ html = '<html><body><img src="image.jpg" width="800" height="600"></body></html>'
266
+ _, metadata = described_class.convert_with_metadata(html)
267
+
268
+ image = metadata[:images][0]
269
+ expect(image[:dimensions]).to be_an(Array)
270
+ expect(image[:dimensions].length).to eq(2)
271
+ end
272
+
273
+ it 'handles missing image attributes' do
274
+ html = '<html><body><img src="image.jpg"></body></html>'
275
+ _, metadata = described_class.convert_with_metadata(html)
276
+
277
+ image = metadata[:images][0]
278
+ expect(image[:alt]).to be_nil
279
+ expect(image[:title]).to be_nil
280
+ end
281
+ end
282
+
283
+ context 'with metadata configuration flags' do
284
+ it 'respects extract_headers flag' do
285
+ html = '<html><body><h1>Title</h1><p>Content</p></body></html>'
286
+ config = { extract_headers: false }
287
+ _, metadata = described_class.convert_with_metadata(html, nil, config)
288
+
289
+ expect(metadata[:headers]).to eq([])
290
+ end
291
+
292
+ it 'respects extract_links flag' do
293
+ html = '<html><body><a href="https://example.com">Link</a></body></html>'
294
+ config = { extract_links: false }
295
+ _, metadata = described_class.convert_with_metadata(html, nil, config)
296
+
297
+ expect(metadata[:links]).to eq([])
298
+ end
299
+
300
+ it 'respects extract_images flag' do
301
+ html = '<html><body><img src="image.jpg" alt="test"></body></html>'
302
+ config = { extract_images: false }
303
+ _, metadata = described_class.convert_with_metadata(html, nil, config)
304
+
305
+ expect(metadata[:images]).to eq([])
306
+ end
307
+
308
+ it 'respects extract_structured_data flag' do
309
+ html = '<html><body><script type="application/ld+json">{"@type":"Article"}</script></body></html>'
310
+ config = { extract_structured_data: false }
311
+ _, metadata = described_class.convert_with_metadata(html, nil, config)
312
+
313
+ expect(metadata[:structured_data]).to eq([])
314
+ end
315
+ end
316
+
317
+ context 'with conversion options and metadata config' do
318
+ it 'accepts both conversion options and metadata config' do
319
+ html = '<html><head><title>Test</title></head><body><h1>Heading</h1></body></html>'
320
+ conv_opts = { heading_style: :atx_closed }
321
+ meta_opts = { extract_headers: true }
322
+
323
+ markdown, metadata = described_class.convert_with_metadata(html, conv_opts, meta_opts)
324
+
325
+ expect(markdown).to include('# Heading #')
326
+ expect(metadata[:headers].length).to eq(1)
327
+ end
328
+
329
+ it 'works with nil options' do
330
+ html = '<html><head><title>Test</title></head><body><p>Content</p></body></html>'
331
+ result = described_class.convert_with_metadata(html, nil, nil)
332
+
333
+ expect(result).to be_an(Array)
334
+ expect(result.length).to eq(2)
335
+ end
336
+ end
337
+
338
+ context 'when extracting structured data' do
339
+ it 'extracts JSON-LD blocks' do
340
+ html = <<~HTML
341
+ <html>
342
+ <head>
343
+ <script type="application/ld+json">
344
+ {"@context":"https://schema.org","@type":"Article","headline":"Test"}
345
+ </script>
346
+ </head>
347
+ <body><p>Content</p></body>
348
+ </html>
349
+ HTML
350
+ _, metadata = described_class.convert_with_metadata(html)
351
+
352
+ # Structured data extraction may vary by implementation
353
+ expect(metadata[:structured_data]).to be_an(Array)
354
+ end
355
+ end
356
+
357
+ context 'with edge cases' do
358
+ it 'handles empty HTML' do
359
+ html = ''
360
+ markdown, metadata = described_class.convert_with_metadata(html)
361
+
362
+ expect(markdown).to be_a(String)
363
+ expect(metadata).to be_a(Hash)
364
+ end
365
+
366
+ it 'handles malformed HTML' do
367
+ html = '<html><head><title>Unclosed'
368
+ markdown, metadata = described_class.convert_with_metadata(html)
369
+
370
+ expect(markdown).to be_a(String)
371
+ expect(metadata).to be_a(Hash)
372
+ end
373
+
374
+ it 'handles special characters in metadata' do
375
+ html = '<html><head><title>Title with "quotes" & <brackets></title></head><body><p>Content</p></body></html>'
376
+ _, metadata = described_class.convert_with_metadata(html)
377
+
378
+ expect(metadata[:document][:title]).to be_a(String)
379
+ end
380
+
381
+ it 'handles whitespace in metadata' do
382
+ html = '<html><head><title> Title with spaces </title></head><body><p>Content</p></body></html>'
383
+ _, metadata = described_class.convert_with_metadata(html)
384
+
385
+ # Whitespace may be normalized
386
+ expect(metadata[:document][:title]).to match(/Title.*spaces/)
387
+ end
388
+
389
+ it 'handles multiple values for same metadata key' do
390
+ html = <<~HTML
391
+ <html>
392
+ <head>
393
+ <meta name="author" content="Author 1">
394
+ <meta name="author" content="Author 2">
395
+ </head>
396
+ <body><p>Content</p></body>
397
+ </html>
398
+ HTML
399
+ _, metadata = described_class.convert_with_metadata(html)
400
+
401
+ # Last value typically wins, but implementation may vary
402
+ expect(metadata[:document][:author]).to be_a(String)
403
+ end
404
+ end
405
+
406
+ context 'when returning value structure' do
407
+ it 'returns proper metadata hash structure' do
408
+ html = <<~HTML
409
+ <html>
410
+ <head><title>Test</title><base href="https://example.com"></head>
411
+ <body><h1>H1</h1><a href="link">Link</a><img src="img.jpg"></body>
412
+ </html>
413
+ HTML
414
+ _, metadata = described_class.convert_with_metadata(html)
415
+
416
+ expect(metadata).to include(
417
+ :document,
418
+ :headers,
419
+ :links,
420
+ :images,
421
+ :structured_data
422
+ )
423
+
424
+ expect(metadata[:document]).to include(
425
+ :title,
426
+ :description,
427
+ :keywords,
428
+ :author,
429
+ :canonical_url,
430
+ :base_href,
431
+ :language,
432
+ :text_direction,
433
+ :open_graph,
434
+ :twitter_card,
435
+ :meta_tags
436
+ )
437
+ end
438
+ end
439
+ end
440
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: html-to-markdown
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.12.0
4
+ version: 2.13.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Na'aman Hirschfeld
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2025-12-08 00:00:00.000000000 Z
11
+ date: 2025-12-10 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rb_sys
@@ -46,6 +46,7 @@ files:
46
46
  - ".rubocop.yml"
47
47
  - Gemfile
48
48
  - Gemfile.lock
49
+ - METADATA.md
49
50
  - README.md
50
51
  - Rakefile
51
52
  - Steepfile
@@ -67,6 +68,7 @@ files:
67
68
  - sig/open3.rbs
68
69
  - spec/cli_proxy_spec.rb
69
70
  - spec/convert_spec.rb
71
+ - spec/metadata_extraction_spec.rb
70
72
  - spec/spec_helper.rb
71
73
  homepage: https://github.com/Goldziher/html-to-markdown
72
74
  licenses: