html-to-markdown 2.30.0 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (166) hide show
  1. checksums.yaml +4 -4
  2. data/Gemfile.lock +4 -14
  3. data/README.md +37 -50
  4. data/ext/html-to-markdown-rb/native/Cargo.lock +13 -701
  5. data/ext/html-to-markdown-rb/native/Cargo.toml +1 -4
  6. data/ext/html-to-markdown-rb/native/README.md +4 -13
  7. data/ext/html-to-markdown-rb/native/src/conversion/inline_images.rs +2 -73
  8. data/ext/html-to-markdown-rb/native/src/conversion/metadata.rs +5 -49
  9. data/ext/html-to-markdown-rb/native/src/conversion/mod.rs +0 -6
  10. data/ext/html-to-markdown-rb/native/src/lib.rs +76 -213
  11. data/ext/html-to-markdown-rb/native/src/options.rs +0 -3
  12. data/lib/html_to_markdown/version.rb +1 -1
  13. data/lib/html_to_markdown.rb +13 -194
  14. data/sig/html_to_markdown.rbs +12 -373
  15. data/vendor/Cargo.toml +5 -2
  16. data/vendor/html-to-markdown-rs/Cargo.toml +4 -10
  17. data/vendor/html-to-markdown-rs/README.md +126 -52
  18. data/vendor/html-to-markdown-rs/examples/basic.rs +6 -1
  19. data/vendor/html-to-markdown-rs/examples/table.rs +6 -1
  20. data/vendor/html-to-markdown-rs/examples/test_escape.rs +6 -1
  21. data/vendor/html-to-markdown-rs/examples/test_inline_formatting.rs +8 -2
  22. data/vendor/html-to-markdown-rs/examples/test_lists.rs +6 -1
  23. data/vendor/html-to-markdown-rs/examples/test_semantic_tags.rs +6 -1
  24. data/vendor/html-to-markdown-rs/examples/test_tables.rs +6 -1
  25. data/vendor/html-to-markdown-rs/examples/test_task_lists.rs +6 -1
  26. data/vendor/html-to-markdown-rs/examples/test_whitespace.rs +6 -1
  27. data/vendor/html-to-markdown-rs/src/convert_api.rs +151 -745
  28. data/vendor/html-to-markdown-rs/src/converter/block/blockquote.rs +3 -5
  29. data/vendor/html-to-markdown-rs/src/converter/block/div.rs +1 -7
  30. data/vendor/html-to-markdown-rs/src/converter/block/heading.rs +18 -5
  31. data/vendor/html-to-markdown-rs/src/converter/block/paragraph.rs +10 -0
  32. data/vendor/html-to-markdown-rs/src/converter/block/preformatted.rs +3 -5
  33. data/vendor/html-to-markdown-rs/src/converter/block/table/builder.rs +16 -11
  34. data/vendor/html-to-markdown-rs/src/converter/block/table/cell.rs +20 -0
  35. data/vendor/html-to-markdown-rs/src/converter/block/table/cells.rs +4 -17
  36. data/vendor/html-to-markdown-rs/src/converter/block/table/mod.rs +140 -0
  37. data/vendor/html-to-markdown-rs/src/converter/block/table/scanner.rs +4 -18
  38. data/vendor/html-to-markdown-rs/src/converter/block/table/utils.rs +2 -18
  39. data/vendor/html-to-markdown-rs/src/converter/context.rs +8 -0
  40. data/vendor/html-to-markdown-rs/src/converter/dom_context.rs +1 -6
  41. data/vendor/html-to-markdown-rs/src/converter/form/elements.rs +14 -14
  42. data/vendor/html-to-markdown-rs/src/converter/handlers/blockquote.rs +4 -5
  43. data/vendor/html-to-markdown-rs/src/converter/handlers/code_block.rs +5 -10
  44. data/vendor/html-to-markdown-rs/src/converter/handlers/graphic.rs +3 -5
  45. data/vendor/html-to-markdown-rs/src/converter/handlers/image.rs +3 -5
  46. data/vendor/html-to-markdown-rs/src/converter/handlers/link.rs +3 -5
  47. data/vendor/html-to-markdown-rs/src/converter/inline/code.rs +3 -5
  48. data/vendor/html-to-markdown-rs/src/converter/inline/emphasis.rs +4 -10
  49. data/vendor/html-to-markdown-rs/src/converter/inline/link.rs +4 -170
  50. data/vendor/html-to-markdown-rs/src/converter/inline/semantic/marks.rs +7 -19
  51. data/vendor/html-to-markdown-rs/src/converter/list/item.rs +3 -5
  52. data/vendor/html-to-markdown-rs/src/converter/list/ordered.rs +4 -10
  53. data/vendor/html-to-markdown-rs/src/converter/list/unordered.rs +6 -12
  54. data/vendor/html-to-markdown-rs/src/converter/list/utils.rs +1 -12
  55. data/vendor/html-to-markdown-rs/src/converter/main.rs +85 -56
  56. data/vendor/html-to-markdown-rs/src/converter/main_helpers.rs +4 -68
  57. data/vendor/html-to-markdown-rs/src/converter/media/embedded.rs +1 -5
  58. data/vendor/html-to-markdown-rs/src/converter/media/graphic.rs +3 -40
  59. data/vendor/html-to-markdown-rs/src/converter/media/image.rs +0 -8
  60. data/vendor/html-to-markdown-rs/src/converter/media/svg.rs +3 -13
  61. data/vendor/html-to-markdown-rs/src/converter/metadata.rs +1 -1
  62. data/vendor/html-to-markdown-rs/src/converter/mod.rs +0 -8
  63. data/vendor/html-to-markdown-rs/src/converter/plain_text.rs +37 -12
  64. data/vendor/html-to-markdown-rs/src/converter/semantic/attributes.rs +5 -30
  65. data/vendor/html-to-markdown-rs/src/converter/semantic/figure.rs +29 -0
  66. data/vendor/html-to-markdown-rs/src/converter/text/escaping.rs +1 -36
  67. data/vendor/html-to-markdown-rs/src/converter/text/mod.rs +1 -3
  68. data/vendor/html-to-markdown-rs/src/converter/text/normalization.rs +0 -53
  69. data/vendor/html-to-markdown-rs/src/converter/text_node.rs +1 -1
  70. data/vendor/html-to-markdown-rs/src/converter/utility/attributes.rs +0 -41
  71. data/vendor/html-to-markdown-rs/src/converter/utility/caching.rs +2 -1
  72. data/vendor/html-to-markdown-rs/src/converter/utility/content.rs +15 -98
  73. data/vendor/html-to-markdown-rs/src/converter/utility/preprocessing.rs +113 -4
  74. data/vendor/html-to-markdown-rs/src/converter/utility/serialization.rs +3 -0
  75. data/vendor/html-to-markdown-rs/src/converter/visitor_hooks.rs +4 -10
  76. data/vendor/html-to-markdown-rs/src/exports.rs +1 -4
  77. data/vendor/html-to-markdown-rs/src/inline_images.rs +1 -1
  78. data/vendor/html-to-markdown-rs/src/lib.rs +13 -133
  79. data/vendor/html-to-markdown-rs/src/metadata/collector.rs +4 -4
  80. data/vendor/html-to-markdown-rs/src/metadata/mod.rs +22 -22
  81. data/vendor/html-to-markdown-rs/src/metadata/types.rs +3 -3
  82. data/vendor/html-to-markdown-rs/src/options/conversion.rs +351 -323
  83. data/vendor/html-to-markdown-rs/src/options/preprocessing.rs +8 -2
  84. data/vendor/html-to-markdown-rs/src/prelude.rs +1 -15
  85. data/vendor/html-to-markdown-rs/src/rcdom.rs +7 -1
  86. data/vendor/html-to-markdown-rs/src/text.rs +25 -14
  87. data/vendor/html-to-markdown-rs/src/types/document.rs +175 -0
  88. data/vendor/html-to-markdown-rs/src/types/mod.rs +17 -0
  89. data/vendor/html-to-markdown-rs/src/types/result.rs +49 -0
  90. data/vendor/html-to-markdown-rs/src/types/structure_builder.rs +790 -0
  91. data/vendor/html-to-markdown-rs/src/types/structure_collector.rs +442 -0
  92. data/vendor/html-to-markdown-rs/src/types/tables.rs +47 -0
  93. data/vendor/html-to-markdown-rs/src/types/warnings.rs +28 -0
  94. data/vendor/html-to-markdown-rs/src/visitor/mod.rs +0 -6
  95. data/vendor/html-to-markdown-rs/src/visitor/traits.rs +0 -1
  96. data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/mod.rs +1 -21
  97. data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/mod.rs +0 -5
  98. data/vendor/html-to-markdown-rs/src/visitor_helpers.rs +1 -845
  99. data/vendor/html-to-markdown-rs/tests/br_in_inline_test.rs +8 -1
  100. data/vendor/html-to-markdown-rs/tests/commonmark_compliance_test.rs +8 -8
  101. data/vendor/html-to-markdown-rs/tests/djot_output_test.rs +8 -2
  102. data/vendor/html-to-markdown-rs/tests/integration_test.rs +23 -6
  103. data/vendor/html-to-markdown-rs/tests/issue_121_regressions.rs +8 -1
  104. data/vendor/html-to-markdown-rs/tests/issue_127_regressions.rs +8 -2
  105. data/vendor/html-to-markdown-rs/tests/issue_128_regressions.rs +6 -1
  106. data/vendor/html-to-markdown-rs/tests/issue_131_regressions.rs +8 -1
  107. data/vendor/html-to-markdown-rs/tests/issue_134_regressions.rs +8 -1
  108. data/vendor/html-to-markdown-rs/tests/issue_139_regressions.rs +8 -1
  109. data/vendor/html-to-markdown-rs/tests/issue_140_regressions.rs +8 -1
  110. data/vendor/html-to-markdown-rs/tests/issue_143_regressions.rs +8 -1
  111. data/vendor/html-to-markdown-rs/tests/issue_145_regressions.rs +8 -7
  112. data/vendor/html-to-markdown-rs/tests/issue_146_regressions.rs +8 -7
  113. data/vendor/html-to-markdown-rs/tests/issue_176_regressions.rs +12 -2
  114. data/vendor/html-to-markdown-rs/tests/issue_190_regressions.rs +8 -1
  115. data/vendor/html-to-markdown-rs/tests/issue_199_regressions.rs +6 -1
  116. data/vendor/html-to-markdown-rs/tests/issue_200_regressions.rs +6 -1
  117. data/vendor/html-to-markdown-rs/tests/issue_212_regressions.rs +6 -1
  118. data/vendor/html-to-markdown-rs/tests/issue_216_217_regressions.rs +6 -1
  119. data/vendor/html-to-markdown-rs/tests/json_ld_script_extraction.rs +4 -6
  120. data/vendor/html-to-markdown-rs/tests/lists_test.rs +8 -1
  121. data/vendor/html-to-markdown-rs/tests/plain_output_test.rs +8 -2
  122. data/vendor/html-to-markdown-rs/tests/preprocessing_tests.rs +8 -1
  123. data/vendor/html-to-markdown-rs/tests/skip_images_test.rs +8 -11
  124. data/vendor/html-to-markdown-rs/tests/tables_test.rs +12 -2
  125. data/vendor/html-to-markdown-rs/tests/test_custom_elements.rs +8 -1
  126. data/vendor/html-to-markdown-rs/tests/test_nested_simple.rs +8 -1
  127. data/vendor/html-to-markdown-rs/tests/test_script_style_stripping.rs +17 -28
  128. data/vendor/html-to-markdown-rs/tests/test_spa_bisect.rs +8 -1
  129. data/vendor/html-to-markdown-rs/tests/visitor_integration_test.rs +29 -33
  130. data/vendor/html-to-markdown-rs/tests/xml_tables_test.rs +8 -1
  131. metadata +9 -37
  132. data/bin/benchmark.rb +0 -232
  133. data/ext/html-to-markdown-rb/native/src/conversion/tables.rs +0 -71
  134. data/ext/html-to-markdown-rb/native/src/profiling.rs +0 -215
  135. data/ext/html-to-markdown-rb/native/src/visitor/bridge.rs +0 -252
  136. data/ext/html-to-markdown-rb/native/src/visitor/callbacks.rs +0 -640
  137. data/ext/html-to-markdown-rb/native/src/visitor/mod.rs +0 -12
  138. data/spec/convert_spec.rb +0 -77
  139. data/spec/convert_with_tables_spec.rb +0 -194
  140. data/spec/metadata_extraction_spec.rb +0 -437
  141. data/spec/visitor_issue_187_spec.rb +0 -605
  142. data/spec/visitor_spec.rb +0 -1149
  143. data/vendor/html-to-markdown-rs/src/hocr/converter/code_analysis.rs +0 -254
  144. data/vendor/html-to-markdown-rs/src/hocr/converter/core.rs +0 -249
  145. data/vendor/html-to-markdown-rs/src/hocr/converter/elements.rs +0 -382
  146. data/vendor/html-to-markdown-rs/src/hocr/converter/hierarchy.rs +0 -379
  147. data/vendor/html-to-markdown-rs/src/hocr/converter/keywords.rs +0 -55
  148. data/vendor/html-to-markdown-rs/src/hocr/converter/layout.rs +0 -313
  149. data/vendor/html-to-markdown-rs/src/hocr/converter/mod.rs +0 -26
  150. data/vendor/html-to-markdown-rs/src/hocr/converter/output.rs +0 -78
  151. data/vendor/html-to-markdown-rs/src/hocr/extractor.rs +0 -232
  152. data/vendor/html-to-markdown-rs/src/hocr/mod.rs +0 -42
  153. data/vendor/html-to-markdown-rs/src/hocr/parser.rs +0 -333
  154. data/vendor/html-to-markdown-rs/src/hocr/spatial/coords.rs +0 -129
  155. data/vendor/html-to-markdown-rs/src/hocr/spatial/grouping.rs +0 -165
  156. data/vendor/html-to-markdown-rs/src/hocr/spatial/layout.rs +0 -335
  157. data/vendor/html-to-markdown-rs/src/hocr/spatial/mod.rs +0 -15
  158. data/vendor/html-to-markdown-rs/src/hocr/spatial/output.rs +0 -63
  159. data/vendor/html-to-markdown-rs/src/hocr/types.rs +0 -269
  160. data/vendor/html-to-markdown-rs/src/visitor/async_traits.rs +0 -249
  161. data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/bridge.rs +0 -189
  162. data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/bridge_visitor.rs +0 -343
  163. data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/macros.rs +0 -217
  164. data/vendor/html-to-markdown-rs/tests/async_visitor_test.rs +0 -57
  165. data/vendor/html-to-markdown-rs/tests/convert_with_metadata_no_frontmatter.rs +0 -100
  166. data/vendor/html-to-markdown-rs/tests/hocr_compliance_test.rs +0 -509
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a59dd088c63edcda3711c276290e4377db037bd7a7cab8beb8bcbe83cf52a6f7
4
- data.tar.gz: f49297af30be7e708bbca200099d471dbaadc0c98b2916489f17e83c07644ba7
3
+ metadata.gz: 823ad44845a919191f6b64697599dab71cd1a6e0d6a9a1f222fa0500ef66eb9a
4
+ data.tar.gz: 360148ff88f88e404ef19f14baccec24b0fa2dfe21b5f601ffc2ae56a9420f52
5
5
  SHA512:
6
- metadata.gz: 94a8a04d0c146886a27c184ec5aacd83ed1619611b6666d9a98e3088c1fae5421b38a7b83dc557decb14a49e05f2b70d9ae142d3db61ab8dbf2bde165ee67dcc
7
- data.tar.gz: b6676434b9dcf908f84d1803de65e2bbc86c4413eae4bd861d7732acca3b6938f9184ffb6b6e32b5326e7f56003fd44c67b851863f3b63931b69a388fb2bcfa6
6
+ metadata.gz: 664d63d8e57a17085da777089fff023db75b840f686ea84e12d9529bedc8d0405c4c9027f565d7c5cc8a5e01e42d3e7d81c356668b858f1f22451dd80349bceb
7
+ data.tar.gz: cd46a114ee8a200cbfa42e653b4ca24a2a2ef884c4981d366ab01dbd23e2bdc13f304bf12c9d6b367b480c3bca5e4e98f82805a8862320ac40931a42904b5ee6
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- html-to-markdown (2.30.0)
4
+ html-to-markdown (3.0.0)
5
5
  rb_sys (>= 0.9, < 1.0)
6
6
 
7
7
  GEM
@@ -22,16 +22,13 @@ GEM
22
22
  uri (>= 0.13.1)
23
23
  ast (2.4.3)
24
24
  base64 (0.3.0)
25
- bigdecimal (4.0.1)
25
+ bigdecimal (4.1.0)
26
26
  concurrent-ruby (1.3.6)
27
27
  connection_pool (3.0.2)
28
28
  csv (3.3.5)
29
29
  diff-lcs (1.6.2)
30
30
  drb (2.2.3)
31
- ffi (1.17.4-aarch64-linux-gnu)
32
31
  ffi (1.17.4-arm64-darwin)
33
- ffi (1.17.4-x64-mingw-ucrt)
34
- ffi (1.17.4-x86_64-darwin)
35
32
  ffi (1.17.4-x86_64-linux-gnu)
36
33
  fileutils (1.8.0)
37
34
  i18n (1.14.8)
@@ -129,12 +126,8 @@ GEM
129
126
  uri (1.1.1)
130
127
 
131
128
  PLATFORMS
132
- aarch64-linux
133
129
  arm64-darwin
134
- x64-mingw-ucrt
135
- x86_64-darwin
136
130
  x86_64-linux
137
- x86_64-linux-gnu
138
131
 
139
132
  DEPENDENCIES
140
133
  html-to-markdown!
@@ -150,19 +143,16 @@ CHECKSUMS
150
143
  activesupport (8.1.3) sha256=21a5e0dfbd4c3ddd9e1317ec6a4d782fa226e7867dc70b0743acda81a1dca20e
151
144
  ast (2.4.3) sha256=954615157c1d6a382bc27d690d973195e79db7f55e9765ac7c481c60bdb4d383
152
145
  base64 (0.3.0) sha256=27337aeabad6ffae05c265c450490628ef3ebd4b67be58257393227588f5a97b
153
- bigdecimal (4.0.1) sha256=8b07d3d065a9f921c80ceaea7c9d4ae596697295b584c296fe599dd0ad01c4a7
146
+ bigdecimal (4.1.0) sha256=6dc07767aa3dc456ccd48e7ae70a07b474e9afd7c5bc576f80bd6da5c8dd6cae
154
147
  concurrent-ruby (1.3.6) sha256=6b56837e1e7e5292f9864f34b69c5a2cbc75c0cf5338f1ce9903d10fa762d5ab
155
148
  connection_pool (3.0.2) sha256=33fff5ba71a12d2aa26cb72b1db8bba2a1a01823559fb01d29eb74c286e62e0a
156
149
  csv (3.3.5) sha256=6e5134ac3383ef728b7f02725d9872934f523cb40b961479f69cf3afa6c8e73f
157
150
  diff-lcs (1.6.2) sha256=9ae0d2cba7d4df3075fe8cd8602a8604993efc0dfa934cff568969efb1909962
158
151
  drb (2.2.3) sha256=0b00d6fdb50995fe4a45dea13663493c841112e4068656854646f418fda13373
159
- ffi (1.17.4-aarch64-linux-gnu) sha256=b208f06f91ffd8f5e1193da3cae3d2ccfc27fc36fba577baf698d26d91c080df
160
152
  ffi (1.17.4-arm64-darwin) sha256=19071aaf1419251b0a46852abf960e77330a3b334d13a4ab51d58b31a937001b
161
- ffi (1.17.4-x64-mingw-ucrt) sha256=f6ff9618cfccc494138bddade27aa06c74c6c7bc367a1ea1103d80c2fcb9ed35
162
- ffi (1.17.4-x86_64-darwin) sha256=aa70390523cf3235096cf64962b709b4cfbd5c082a2cb2ae714eb0fe2ccda496
163
153
  ffi (1.17.4-x86_64-linux-gnu) sha256=9d3db14c2eae074b382fa9c083fe95aec6e0a1451da249eab096c34002bc752d
164
154
  fileutils (1.8.0) sha256=8c6b1df54e2540bdb2f39258f08af78853aa70bad52b4d394bbc6424593c6e02
165
- html-to-markdown (2.30.0)
155
+ html-to-markdown (3.0.0)
166
156
  i18n (1.14.8) sha256=285778639134865c5e0f6269e0b818256017e8cde89993fdfcbfb64d088824a5
167
157
  json (2.19.3) sha256=289b0bb53052a1fa8c34ab33cc750b659ba14a5c45f3fcf4b18762dc67c78646
168
158
  language_server-protocol (3.17.0.5) sha256=fd1e39a51a28bf3eec959379985a72e296e9f9acfce46f6a79d31ca8760803cc
data/README.md CHANGED
@@ -17,8 +17,8 @@
17
17
  <a href="https://central.sonatype.com/artifact/dev.kreuzberg/html-to-markdown">
18
18
  <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/html-to-markdown?label=Java&color=007ec6" alt="Java">
19
19
  </a>
20
- <a href="https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v2/htmltomarkdown">
21
- <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/html-to-markdown?label=Go&color=007ec6&filter=v2.29.0" alt="Go">
20
+ <a href="https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v3/htmltomarkdown">
21
+ <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/html-to-markdown?label=Go&color=007ec6&filter=v3.0.0" alt="Go">
22
22
  </a>
23
23
  <a href="https://www.nuget.org/packages/KreuzbergDev.HtmlToMarkdown/">
24
24
  <img src="https://img.shields.io/nuget/v/KreuzbergDev.HtmlToMarkdown?label=C%23&color=007ec6" alt="C#">
@@ -87,7 +87,6 @@ Apple M4 • Real Wikipedia documents • `convert()` (Ruby)
87
87
  | Mixed (Python wiki) | 656KB | 4.89ms | 134 MB/s |
88
88
 
89
89
 
90
- See [Performance Guide](../../examples/performance/) for detailed benchmarks.
91
90
 
92
91
 
93
92
  ## Quick Start
@@ -98,7 +97,8 @@ Basic conversion:
98
97
  require 'html_to_markdown'
99
98
 
100
99
  html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
101
- markdown = HtmlToMarkdown.convert(html)
100
+ result = HtmlToMarkdown.convert(html)
101
+ markdown = result[:content]
102
102
  ```
103
103
 
104
104
 
@@ -109,60 +109,50 @@ With conversion options:
109
109
  require 'html_to_markdown'
110
110
 
111
111
  html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
112
- markdown = HtmlToMarkdown.convert(html, heading_style: :atx, code_block_style: :fenced)
112
+ result = HtmlToMarkdown.convert(html, heading_style: :atx, code_block_style: :fenced)
113
+ markdown = result[:content]
113
114
  ```
114
115
 
115
116
 
116
117
 
117
118
 
118
-
119
-
120
119
  ## API Reference
121
120
 
122
- ### Core Functions
123
-
124
-
125
- **`convert(html, options: nil) -> String`**
121
+ ### Core Function
126
122
 
127
- Basic HTML-to-Markdown conversion. Fast and simple.
128
123
 
129
- **`convert_with_metadata(html, options: nil, config: nil) -> [String, Hash]`**
124
+ **`convert(html, options: nil, visitor: nil) -> ConversionResult`**
130
125
 
131
- Extract Markdown plus metadata (headers, links, images, structured data) in a single pass. See [Metadata Extraction Guide](../../examples/metadata-extraction/).
126
+ Converts HTML to Markdown. Returns a `ConversionResult` hash with all results in a single call.
132
127
 
133
- **`convert_with_visitor(html, visitor:, options: nil) -> String`**
134
-
135
- Customize conversion with visitor callbacks for element interception. See [Visitor Pattern Guide](../../examples/visitor-pattern/).
136
-
137
- **`convert_with_inline_images(html, config: nil) -> [String, Array, Array]`**
138
-
139
- Extract base64-encoded inline images with metadata.
140
-
141
- **`convert_with_tables(html, options: nil, config: nil) -> ConversionWithTables`**
128
+ ```ruby
129
+ require 'html_to_markdown'
142
130
 
143
- Extract structured table data (cells, headers, rendered markdown) alongside conversion.
131
+ result = HtmlToMarkdown.convert(html)
132
+ markdown = result[:content] # Converted Markdown string
133
+ metadata = result[:metadata] # Metadata (when extract_metadata: true)
134
+ tables = result[:tables] # Structured table data (when extract_tables: true)
135
+ document = result[:document] # Document-level info
136
+ images = result[:images] # Extracted images
137
+ warnings = result[:warnings] # Any conversion warnings
138
+ ```
144
139
 
145
140
 
146
141
 
147
142
  ### Options
148
143
 
149
144
  **`ConversionOptions`** – Key configuration fields:
145
+
150
146
  - `heading_style`: Heading format (`"underlined"` | `"atx"` | `"atx_closed"`) — default: `"underlined"`
151
147
  - `list_indent_width`: Spaces per indent level — default: `2`
152
148
  - `bullets`: Bullet characters cycle — default: `"*+-"`
153
149
  - `wrap`: Enable text wrapping — default: `false`
154
150
  - `wrap_width`: Wrap at column — default: `80`
155
151
  - `code_language`: Default fenced code block language — default: none
156
- - `extract_metadata`: Embed metadata as YAML frontmatter — default: `false`
152
+ - `extract_metadata`: Enable metadata extraction into `result.metadata` — default: `false`
153
+ - `extract_tables`: Enable structured table extraction into `result.tables` — default: `false`
157
154
  - `output_format`: Output markup format (`"markdown"` | `"djot"` | `"plain"`) — default: `"markdown"`
158
155
 
159
- **`MetadataConfig`** – Selective metadata extraction:
160
- - `extract_headers`: h1-h6 elements — default: `true`
161
- - `extract_links`: Hyperlinks — default: `true`
162
- - `extract_images`: Image elements — default: `true`
163
- - `extract_structured_data`: JSON-LD, Microdata, RDFa — default: `true`
164
- - `max_structured_data_size`: Size limit in bytes — default: `100KB`
165
-
166
156
 
167
157
  ## Djot Output Format
168
158
 
@@ -222,16 +212,17 @@ Plain text mode is useful for search indexing, text extraction, and feeding cont
222
212
 
223
213
  ## Metadata Extraction
224
214
 
225
- The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass.
215
+ The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass — all via the standard `convert()` function.
226
216
 
227
217
  **Use Cases:**
218
+
228
219
  - **SEO analysis** – Extract title, description, Open Graph tags, Twitter cards
229
220
  - **Table of contents generation** – Build structured outlines from heading hierarchy
230
221
  - **Content migration** – Document all external links and resources
231
222
  - **Accessibility audits** – Check for images without alt text, empty links, invalid heading hierarchy
232
223
  - **Link validation** – Classify and validate anchor, internal, external, email, and phone links
233
224
 
234
- **Zero Overhead When Disabled:** Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Disable unused metadata types in `MetadataConfig` to optimize further.
225
+ **Zero Overhead When Disabled:** Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Pass `extract_metadata: true` in `ConversionOptions` to enable it; the result is available at `result.metadata`.
235
226
 
236
227
  ### Example: Quick Start
237
228
 
@@ -240,27 +231,27 @@ The metadata extraction feature enables comprehensive document analysis during c
240
231
  require 'html_to_markdown'
241
232
 
242
233
  html = '<h1>Article</h1><img src="test.jpg" alt="test">'
243
- markdown, metadata = HtmlToMarkdown.convert_with_metadata(html)
244
-
245
- puts metadata[:document][:title] # Document title
246
- puts metadata[:headers] # All h1-h6 elements
247
- puts metadata[:links] # All hyperlinks
248
- puts metadata[:images] # All images with alt text
249
- puts metadata[:structured_data] # JSON-LD, Microdata, RDFa
234
+ result = HtmlToMarkdown.convert(html, extract_metadata: true)
235
+
236
+ puts result[:content] # Converted Markdown
237
+ puts result[:metadata][:document][:title] # Document title
238
+ puts result[:metadata][:headers] # All h1-h6 elements
239
+ puts result[:metadata][:links] # All hyperlinks
240
+ puts result[:metadata][:images] # All images with alt text
241
+ puts result[:metadata][:structured_data] # JSON-LD, Microdata, RDFa
250
242
  ```
251
243
 
252
244
 
253
245
 
254
- For detailed examples including SEO extraction, table-of-contents generation, link validation, and accessibility audits, see the [Metadata Extraction Guide](../../examples/metadata-extraction/).
255
-
256
246
 
257
247
 
258
248
 
259
249
  ## Visitor Pattern
260
250
 
261
- The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Use visitors to transform content, filter elements, validate structure, or collect analytics.
251
+ The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Pass a visitor as the third argument to `convert()`.
262
252
 
263
253
  **Use Cases:**
254
+
264
255
  - **Custom Markdown dialects** – Convert to Obsidian, Notion, or other flavors
265
256
  - **Content filtering** – Remove tracking pixels, ads, or unwanted elements
266
257
  - **URL rewriting** – Rewrite CDN URLs, add query parameters, validate links
@@ -291,20 +282,16 @@ class MyVisitor
291
282
  end
292
283
 
293
284
  html = '<a href="https://old-cdn.com/file.pdf">Download</a>'
294
- markdown = HtmlToMarkdown.convert_with_visitor(html, visitor: MyVisitor.new)
285
+ result = HtmlToMarkdown.convert(html, visitor: MyVisitor.new)
286
+ markdown = result[:content]
295
287
  ```
296
288
 
297
289
 
298
290
 
299
- For comprehensive examples including content filtering, link footnotes, accessibility validation, and asynchronous URL validation, see the [Visitor Pattern Guide](../../examples/visitor-pattern/).
300
-
301
291
 
302
292
 
303
293
  ## Examples
304
294
 
305
- - [Visitor Pattern Guide](../../examples/visitor-pattern/)
306
- - [Metadata Extraction Guide](../../examples/metadata-extraction/)
307
- - [Performance Guide](../../examples/performance/)
308
295
 
309
296
  ## Links
310
297