html-to-markdown 2.30.0 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile.lock +4 -14
- data/README.md +37 -50
- data/ext/html-to-markdown-rb/native/Cargo.lock +13 -701
- data/ext/html-to-markdown-rb/native/Cargo.toml +1 -4
- data/ext/html-to-markdown-rb/native/README.md +4 -13
- data/ext/html-to-markdown-rb/native/src/conversion/inline_images.rs +2 -73
- data/ext/html-to-markdown-rb/native/src/conversion/metadata.rs +5 -49
- data/ext/html-to-markdown-rb/native/src/conversion/mod.rs +0 -6
- data/ext/html-to-markdown-rb/native/src/lib.rs +76 -213
- data/ext/html-to-markdown-rb/native/src/options.rs +0 -3
- data/lib/html_to_markdown/version.rb +1 -1
- data/lib/html_to_markdown.rb +13 -194
- data/sig/html_to_markdown.rbs +12 -373
- data/vendor/Cargo.toml +5 -2
- data/vendor/html-to-markdown-rs/Cargo.toml +4 -10
- data/vendor/html-to-markdown-rs/README.md +126 -52
- data/vendor/html-to-markdown-rs/examples/basic.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/table.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_escape.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_inline_formatting.rs +8 -2
- data/vendor/html-to-markdown-rs/examples/test_lists.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_semantic_tags.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_tables.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_task_lists.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_whitespace.rs +6 -1
- data/vendor/html-to-markdown-rs/src/convert_api.rs +151 -745
- data/vendor/html-to-markdown-rs/src/converter/block/blockquote.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/block/div.rs +1 -7
- data/vendor/html-to-markdown-rs/src/converter/block/heading.rs +18 -5
- data/vendor/html-to-markdown-rs/src/converter/block/paragraph.rs +10 -0
- data/vendor/html-to-markdown-rs/src/converter/block/preformatted.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/block/table/builder.rs +16 -11
- data/vendor/html-to-markdown-rs/src/converter/block/table/cell.rs +20 -0
- data/vendor/html-to-markdown-rs/src/converter/block/table/cells.rs +4 -17
- data/vendor/html-to-markdown-rs/src/converter/block/table/mod.rs +140 -0
- data/vendor/html-to-markdown-rs/src/converter/block/table/scanner.rs +4 -18
- data/vendor/html-to-markdown-rs/src/converter/block/table/utils.rs +2 -18
- data/vendor/html-to-markdown-rs/src/converter/context.rs +8 -0
- data/vendor/html-to-markdown-rs/src/converter/dom_context.rs +1 -6
- data/vendor/html-to-markdown-rs/src/converter/form/elements.rs +14 -14
- data/vendor/html-to-markdown-rs/src/converter/handlers/blockquote.rs +4 -5
- data/vendor/html-to-markdown-rs/src/converter/handlers/code_block.rs +5 -10
- data/vendor/html-to-markdown-rs/src/converter/handlers/graphic.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/handlers/image.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/handlers/link.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/inline/code.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/inline/emphasis.rs +4 -10
- data/vendor/html-to-markdown-rs/src/converter/inline/link.rs +4 -170
- data/vendor/html-to-markdown-rs/src/converter/inline/semantic/marks.rs +7 -19
- data/vendor/html-to-markdown-rs/src/converter/list/item.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/list/ordered.rs +4 -10
- data/vendor/html-to-markdown-rs/src/converter/list/unordered.rs +6 -12
- data/vendor/html-to-markdown-rs/src/converter/list/utils.rs +1 -12
- data/vendor/html-to-markdown-rs/src/converter/main.rs +85 -56
- data/vendor/html-to-markdown-rs/src/converter/main_helpers.rs +4 -68
- data/vendor/html-to-markdown-rs/src/converter/media/embedded.rs +1 -5
- data/vendor/html-to-markdown-rs/src/converter/media/graphic.rs +3 -40
- data/vendor/html-to-markdown-rs/src/converter/media/image.rs +0 -8
- data/vendor/html-to-markdown-rs/src/converter/media/svg.rs +3 -13
- data/vendor/html-to-markdown-rs/src/converter/metadata.rs +1 -1
- data/vendor/html-to-markdown-rs/src/converter/mod.rs +0 -8
- data/vendor/html-to-markdown-rs/src/converter/plain_text.rs +37 -12
- data/vendor/html-to-markdown-rs/src/converter/semantic/attributes.rs +5 -30
- data/vendor/html-to-markdown-rs/src/converter/semantic/figure.rs +29 -0
- data/vendor/html-to-markdown-rs/src/converter/text/escaping.rs +1 -36
- data/vendor/html-to-markdown-rs/src/converter/text/mod.rs +1 -3
- data/vendor/html-to-markdown-rs/src/converter/text/normalization.rs +0 -53
- data/vendor/html-to-markdown-rs/src/converter/text_node.rs +1 -1
- data/vendor/html-to-markdown-rs/src/converter/utility/attributes.rs +0 -41
- data/vendor/html-to-markdown-rs/src/converter/utility/caching.rs +2 -1
- data/vendor/html-to-markdown-rs/src/converter/utility/content.rs +15 -98
- data/vendor/html-to-markdown-rs/src/converter/utility/preprocessing.rs +113 -4
- data/vendor/html-to-markdown-rs/src/converter/utility/serialization.rs +3 -0
- data/vendor/html-to-markdown-rs/src/converter/visitor_hooks.rs +4 -10
- data/vendor/html-to-markdown-rs/src/exports.rs +1 -4
- data/vendor/html-to-markdown-rs/src/inline_images.rs +1 -1
- data/vendor/html-to-markdown-rs/src/lib.rs +13 -133
- data/vendor/html-to-markdown-rs/src/metadata/collector.rs +4 -4
- data/vendor/html-to-markdown-rs/src/metadata/mod.rs +22 -22
- data/vendor/html-to-markdown-rs/src/metadata/types.rs +3 -3
- data/vendor/html-to-markdown-rs/src/options/conversion.rs +351 -323
- data/vendor/html-to-markdown-rs/src/options/preprocessing.rs +8 -2
- data/vendor/html-to-markdown-rs/src/prelude.rs +1 -15
- data/vendor/html-to-markdown-rs/src/rcdom.rs +7 -1
- data/vendor/html-to-markdown-rs/src/text.rs +25 -14
- data/vendor/html-to-markdown-rs/src/types/document.rs +175 -0
- data/vendor/html-to-markdown-rs/src/types/mod.rs +17 -0
- data/vendor/html-to-markdown-rs/src/types/result.rs +49 -0
- data/vendor/html-to-markdown-rs/src/types/structure_builder.rs +790 -0
- data/vendor/html-to-markdown-rs/src/types/structure_collector.rs +442 -0
- data/vendor/html-to-markdown-rs/src/types/tables.rs +47 -0
- data/vendor/html-to-markdown-rs/src/types/warnings.rs +28 -0
- data/vendor/html-to-markdown-rs/src/visitor/mod.rs +0 -6
- data/vendor/html-to-markdown-rs/src/visitor/traits.rs +0 -1
- data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/mod.rs +1 -21
- data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/mod.rs +0 -5
- data/vendor/html-to-markdown-rs/src/visitor_helpers.rs +1 -845
- data/vendor/html-to-markdown-rs/tests/br_in_inline_test.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/commonmark_compliance_test.rs +8 -8
- data/vendor/html-to-markdown-rs/tests/djot_output_test.rs +8 -2
- data/vendor/html-to-markdown-rs/tests/integration_test.rs +23 -6
- data/vendor/html-to-markdown-rs/tests/issue_121_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_127_regressions.rs +8 -2
- data/vendor/html-to-markdown-rs/tests/issue_128_regressions.rs +6 -1
- data/vendor/html-to-markdown-rs/tests/issue_131_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_134_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_139_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_140_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_143_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_145_regressions.rs +8 -7
- data/vendor/html-to-markdown-rs/tests/issue_146_regressions.rs +8 -7
- data/vendor/html-to-markdown-rs/tests/issue_176_regressions.rs +12 -2
- data/vendor/html-to-markdown-rs/tests/issue_190_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_199_regressions.rs +6 -1
- data/vendor/html-to-markdown-rs/tests/issue_200_regressions.rs +6 -1
- data/vendor/html-to-markdown-rs/tests/issue_212_regressions.rs +6 -1
- data/vendor/html-to-markdown-rs/tests/issue_216_217_regressions.rs +6 -1
- data/vendor/html-to-markdown-rs/tests/json_ld_script_extraction.rs +4 -6
- data/vendor/html-to-markdown-rs/tests/lists_test.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/plain_output_test.rs +8 -2
- data/vendor/html-to-markdown-rs/tests/preprocessing_tests.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/skip_images_test.rs +8 -11
- data/vendor/html-to-markdown-rs/tests/tables_test.rs +12 -2
- data/vendor/html-to-markdown-rs/tests/test_custom_elements.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/test_nested_simple.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/test_script_style_stripping.rs +17 -28
- data/vendor/html-to-markdown-rs/tests/test_spa_bisect.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/visitor_integration_test.rs +29 -33
- data/vendor/html-to-markdown-rs/tests/xml_tables_test.rs +8 -1
- metadata +9 -37
- data/bin/benchmark.rb +0 -232
- data/ext/html-to-markdown-rb/native/src/conversion/tables.rs +0 -71
- data/ext/html-to-markdown-rb/native/src/profiling.rs +0 -215
- data/ext/html-to-markdown-rb/native/src/visitor/bridge.rs +0 -252
- data/ext/html-to-markdown-rb/native/src/visitor/callbacks.rs +0 -640
- data/ext/html-to-markdown-rb/native/src/visitor/mod.rs +0 -12
- data/spec/convert_spec.rb +0 -77
- data/spec/convert_with_tables_spec.rb +0 -194
- data/spec/metadata_extraction_spec.rb +0 -437
- data/spec/visitor_issue_187_spec.rb +0 -605
- data/spec/visitor_spec.rb +0 -1149
- data/vendor/html-to-markdown-rs/src/hocr/converter/code_analysis.rs +0 -254
- data/vendor/html-to-markdown-rs/src/hocr/converter/core.rs +0 -249
- data/vendor/html-to-markdown-rs/src/hocr/converter/elements.rs +0 -382
- data/vendor/html-to-markdown-rs/src/hocr/converter/hierarchy.rs +0 -379
- data/vendor/html-to-markdown-rs/src/hocr/converter/keywords.rs +0 -55
- data/vendor/html-to-markdown-rs/src/hocr/converter/layout.rs +0 -313
- data/vendor/html-to-markdown-rs/src/hocr/converter/mod.rs +0 -26
- data/vendor/html-to-markdown-rs/src/hocr/converter/output.rs +0 -78
- data/vendor/html-to-markdown-rs/src/hocr/extractor.rs +0 -232
- data/vendor/html-to-markdown-rs/src/hocr/mod.rs +0 -42
- data/vendor/html-to-markdown-rs/src/hocr/parser.rs +0 -333
- data/vendor/html-to-markdown-rs/src/hocr/spatial/coords.rs +0 -129
- data/vendor/html-to-markdown-rs/src/hocr/spatial/grouping.rs +0 -165
- data/vendor/html-to-markdown-rs/src/hocr/spatial/layout.rs +0 -335
- data/vendor/html-to-markdown-rs/src/hocr/spatial/mod.rs +0 -15
- data/vendor/html-to-markdown-rs/src/hocr/spatial/output.rs +0 -63
- data/vendor/html-to-markdown-rs/src/hocr/types.rs +0 -269
- data/vendor/html-to-markdown-rs/src/visitor/async_traits.rs +0 -249
- data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/bridge.rs +0 -189
- data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/bridge_visitor.rs +0 -343
- data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/macros.rs +0 -217
- data/vendor/html-to-markdown-rs/tests/async_visitor_test.rs +0 -57
- data/vendor/html-to-markdown-rs/tests/convert_with_metadata_no_frontmatter.rs +0 -100
- data/vendor/html-to-markdown-rs/tests/hocr_compliance_test.rs +0 -509
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 823ad44845a919191f6b64697599dab71cd1a6e0d6a9a1f222fa0500ef66eb9a
|
|
4
|
+
data.tar.gz: 360148ff88f88e404ef19f14baccec24b0fa2dfe21b5f601ffc2ae56a9420f52
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 664d63d8e57a17085da777089fff023db75b840f686ea84e12d9529bedc8d0405c4c9027f565d7c5cc8a5e01e42d3e7d81c356668b858f1f22451dd80349bceb
|
|
7
|
+
data.tar.gz: cd46a114ee8a200cbfa42e653b4ca24a2a2ef884c4981d366ab01dbd23e2bdc13f304bf12c9d6b367b480c3bca5e4e98f82805a8862320ac40931a42904b5ee6
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
html-to-markdown (
|
|
4
|
+
html-to-markdown (3.0.0)
|
|
5
5
|
rb_sys (>= 0.9, < 1.0)
|
|
6
6
|
|
|
7
7
|
GEM
|
|
@@ -22,16 +22,13 @@ GEM
|
|
|
22
22
|
uri (>= 0.13.1)
|
|
23
23
|
ast (2.4.3)
|
|
24
24
|
base64 (0.3.0)
|
|
25
|
-
bigdecimal (4.0
|
|
25
|
+
bigdecimal (4.1.0)
|
|
26
26
|
concurrent-ruby (1.3.6)
|
|
27
27
|
connection_pool (3.0.2)
|
|
28
28
|
csv (3.3.5)
|
|
29
29
|
diff-lcs (1.6.2)
|
|
30
30
|
drb (2.2.3)
|
|
31
|
-
ffi (1.17.4-aarch64-linux-gnu)
|
|
32
31
|
ffi (1.17.4-arm64-darwin)
|
|
33
|
-
ffi (1.17.4-x64-mingw-ucrt)
|
|
34
|
-
ffi (1.17.4-x86_64-darwin)
|
|
35
32
|
ffi (1.17.4-x86_64-linux-gnu)
|
|
36
33
|
fileutils (1.8.0)
|
|
37
34
|
i18n (1.14.8)
|
|
@@ -129,12 +126,8 @@ GEM
|
|
|
129
126
|
uri (1.1.1)
|
|
130
127
|
|
|
131
128
|
PLATFORMS
|
|
132
|
-
aarch64-linux
|
|
133
129
|
arm64-darwin
|
|
134
|
-
x64-mingw-ucrt
|
|
135
|
-
x86_64-darwin
|
|
136
130
|
x86_64-linux
|
|
137
|
-
x86_64-linux-gnu
|
|
138
131
|
|
|
139
132
|
DEPENDENCIES
|
|
140
133
|
html-to-markdown!
|
|
@@ -150,19 +143,16 @@ CHECKSUMS
|
|
|
150
143
|
activesupport (8.1.3) sha256=21a5e0dfbd4c3ddd9e1317ec6a4d782fa226e7867dc70b0743acda81a1dca20e
|
|
151
144
|
ast (2.4.3) sha256=954615157c1d6a382bc27d690d973195e79db7f55e9765ac7c481c60bdb4d383
|
|
152
145
|
base64 (0.3.0) sha256=27337aeabad6ffae05c265c450490628ef3ebd4b67be58257393227588f5a97b
|
|
153
|
-
bigdecimal (4.0
|
|
146
|
+
bigdecimal (4.1.0) sha256=6dc07767aa3dc456ccd48e7ae70a07b474e9afd7c5bc576f80bd6da5c8dd6cae
|
|
154
147
|
concurrent-ruby (1.3.6) sha256=6b56837e1e7e5292f9864f34b69c5a2cbc75c0cf5338f1ce9903d10fa762d5ab
|
|
155
148
|
connection_pool (3.0.2) sha256=33fff5ba71a12d2aa26cb72b1db8bba2a1a01823559fb01d29eb74c286e62e0a
|
|
156
149
|
csv (3.3.5) sha256=6e5134ac3383ef728b7f02725d9872934f523cb40b961479f69cf3afa6c8e73f
|
|
157
150
|
diff-lcs (1.6.2) sha256=9ae0d2cba7d4df3075fe8cd8602a8604993efc0dfa934cff568969efb1909962
|
|
158
151
|
drb (2.2.3) sha256=0b00d6fdb50995fe4a45dea13663493c841112e4068656854646f418fda13373
|
|
159
|
-
ffi (1.17.4-aarch64-linux-gnu) sha256=b208f06f91ffd8f5e1193da3cae3d2ccfc27fc36fba577baf698d26d91c080df
|
|
160
152
|
ffi (1.17.4-arm64-darwin) sha256=19071aaf1419251b0a46852abf960e77330a3b334d13a4ab51d58b31a937001b
|
|
161
|
-
ffi (1.17.4-x64-mingw-ucrt) sha256=f6ff9618cfccc494138bddade27aa06c74c6c7bc367a1ea1103d80c2fcb9ed35
|
|
162
|
-
ffi (1.17.4-x86_64-darwin) sha256=aa70390523cf3235096cf64962b709b4cfbd5c082a2cb2ae714eb0fe2ccda496
|
|
163
153
|
ffi (1.17.4-x86_64-linux-gnu) sha256=9d3db14c2eae074b382fa9c083fe95aec6e0a1451da249eab096c34002bc752d
|
|
164
154
|
fileutils (1.8.0) sha256=8c6b1df54e2540bdb2f39258f08af78853aa70bad52b4d394bbc6424593c6e02
|
|
165
|
-
html-to-markdown (
|
|
155
|
+
html-to-markdown (3.0.0)
|
|
166
156
|
i18n (1.14.8) sha256=285778639134865c5e0f6269e0b818256017e8cde89993fdfcbfb64d088824a5
|
|
167
157
|
json (2.19.3) sha256=289b0bb53052a1fa8c34ab33cc750b659ba14a5c45f3fcf4b18762dc67c78646
|
|
168
158
|
language_server-protocol (3.17.0.5) sha256=fd1e39a51a28bf3eec959379985a72e296e9f9acfce46f6a79d31ca8760803cc
|
data/README.md
CHANGED
|
@@ -17,8 +17,8 @@
|
|
|
17
17
|
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/html-to-markdown">
|
|
18
18
|
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/html-to-markdown?label=Java&color=007ec6" alt="Java">
|
|
19
19
|
</a>
|
|
20
|
-
<a href="https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/
|
|
21
|
-
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/html-to-markdown?label=Go&color=007ec6&filter=
|
|
20
|
+
<a href="https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v3/htmltomarkdown">
|
|
21
|
+
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/html-to-markdown?label=Go&color=007ec6&filter=v3.0.0" alt="Go">
|
|
22
22
|
</a>
|
|
23
23
|
<a href="https://www.nuget.org/packages/KreuzbergDev.HtmlToMarkdown/">
|
|
24
24
|
<img src="https://img.shields.io/nuget/v/KreuzbergDev.HtmlToMarkdown?label=C%23&color=007ec6" alt="C#">
|
|
@@ -87,7 +87,6 @@ Apple M4 • Real Wikipedia documents • `convert()` (Ruby)
|
|
|
87
87
|
| Mixed (Python wiki) | 656KB | 4.89ms | 134 MB/s |
|
|
88
88
|
|
|
89
89
|
|
|
90
|
-
See [Performance Guide](../../examples/performance/) for detailed benchmarks.
|
|
91
90
|
|
|
92
91
|
|
|
93
92
|
## Quick Start
|
|
@@ -98,7 +97,8 @@ Basic conversion:
|
|
|
98
97
|
require 'html_to_markdown'
|
|
99
98
|
|
|
100
99
|
html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
|
|
101
|
-
|
|
100
|
+
result = HtmlToMarkdown.convert(html)
|
|
101
|
+
markdown = result[:content]
|
|
102
102
|
```
|
|
103
103
|
|
|
104
104
|
|
|
@@ -109,60 +109,50 @@ With conversion options:
|
|
|
109
109
|
require 'html_to_markdown'
|
|
110
110
|
|
|
111
111
|
html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
|
|
112
|
-
|
|
112
|
+
result = HtmlToMarkdown.convert(html, heading_style: :atx, code_block_style: :fenced)
|
|
113
|
+
markdown = result[:content]
|
|
113
114
|
```
|
|
114
115
|
|
|
115
116
|
|
|
116
117
|
|
|
117
118
|
|
|
118
|
-
|
|
119
|
-
|
|
120
119
|
## API Reference
|
|
121
120
|
|
|
122
|
-
### Core
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
**`convert(html, options: nil) -> String`**
|
|
121
|
+
### Core Function
|
|
126
122
|
|
|
127
|
-
Basic HTML-to-Markdown conversion. Fast and simple.
|
|
128
123
|
|
|
129
|
-
**`
|
|
124
|
+
**`convert(html, options: nil, visitor: nil) -> ConversionResult`**
|
|
130
125
|
|
|
131
|
-
|
|
126
|
+
Converts HTML to Markdown. Returns a `ConversionResult` hash with all results in a single call.
|
|
132
127
|
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
Customize conversion with visitor callbacks for element interception. See [Visitor Pattern Guide](../../examples/visitor-pattern/).
|
|
136
|
-
|
|
137
|
-
**`convert_with_inline_images(html, config: nil) -> [String, Array, Array]`**
|
|
138
|
-
|
|
139
|
-
Extract base64-encoded inline images with metadata.
|
|
140
|
-
|
|
141
|
-
**`convert_with_tables(html, options: nil, config: nil) -> ConversionWithTables`**
|
|
128
|
+
```ruby
|
|
129
|
+
require 'html_to_markdown'
|
|
142
130
|
|
|
143
|
-
|
|
131
|
+
result = HtmlToMarkdown.convert(html)
|
|
132
|
+
markdown = result[:content] # Converted Markdown string
|
|
133
|
+
metadata = result[:metadata] # Metadata (when extract_metadata: true)
|
|
134
|
+
tables = result[:tables] # Structured table data (when extract_tables: true)
|
|
135
|
+
document = result[:document] # Document-level info
|
|
136
|
+
images = result[:images] # Extracted images
|
|
137
|
+
warnings = result[:warnings] # Any conversion warnings
|
|
138
|
+
```
|
|
144
139
|
|
|
145
140
|
|
|
146
141
|
|
|
147
142
|
### Options
|
|
148
143
|
|
|
149
144
|
**`ConversionOptions`** – Key configuration fields:
|
|
145
|
+
|
|
150
146
|
- `heading_style`: Heading format (`"underlined"` | `"atx"` | `"atx_closed"`) — default: `"underlined"`
|
|
151
147
|
- `list_indent_width`: Spaces per indent level — default: `2`
|
|
152
148
|
- `bullets`: Bullet characters cycle — default: `"*+-"`
|
|
153
149
|
- `wrap`: Enable text wrapping — default: `false`
|
|
154
150
|
- `wrap_width`: Wrap at column — default: `80`
|
|
155
151
|
- `code_language`: Default fenced code block language — default: none
|
|
156
|
-
- `extract_metadata`:
|
|
152
|
+
- `extract_metadata`: Enable metadata extraction into `result.metadata` — default: `false`
|
|
153
|
+
- `extract_tables`: Enable structured table extraction into `result.tables` — default: `false`
|
|
157
154
|
- `output_format`: Output markup format (`"markdown"` | `"djot"` | `"plain"`) — default: `"markdown"`
|
|
158
155
|
|
|
159
|
-
**`MetadataConfig`** – Selective metadata extraction:
|
|
160
|
-
- `extract_headers`: h1-h6 elements — default: `true`
|
|
161
|
-
- `extract_links`: Hyperlinks — default: `true`
|
|
162
|
-
- `extract_images`: Image elements — default: `true`
|
|
163
|
-
- `extract_structured_data`: JSON-LD, Microdata, RDFa — default: `true`
|
|
164
|
-
- `max_structured_data_size`: Size limit in bytes — default: `100KB`
|
|
165
|
-
|
|
166
156
|
|
|
167
157
|
## Djot Output Format
|
|
168
158
|
|
|
@@ -222,16 +212,17 @@ Plain text mode is useful for search indexing, text extraction, and feeding cont
|
|
|
222
212
|
|
|
223
213
|
## Metadata Extraction
|
|
224
214
|
|
|
225
|
-
The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass.
|
|
215
|
+
The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass — all via the standard `convert()` function.
|
|
226
216
|
|
|
227
217
|
**Use Cases:**
|
|
218
|
+
|
|
228
219
|
- **SEO analysis** – Extract title, description, Open Graph tags, Twitter cards
|
|
229
220
|
- **Table of contents generation** – Build structured outlines from heading hierarchy
|
|
230
221
|
- **Content migration** – Document all external links and resources
|
|
231
222
|
- **Accessibility audits** – Check for images without alt text, empty links, invalid heading hierarchy
|
|
232
223
|
- **Link validation** – Classify and validate anchor, internal, external, email, and phone links
|
|
233
224
|
|
|
234
|
-
**Zero Overhead When Disabled:** Metadata extraction adds negligible overhead and happens during the HTML parsing pass.
|
|
225
|
+
**Zero Overhead When Disabled:** Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Pass `extract_metadata: true` in `ConversionOptions` to enable it; the result is available at `result.metadata`.
|
|
235
226
|
|
|
236
227
|
### Example: Quick Start
|
|
237
228
|
|
|
@@ -240,27 +231,27 @@ The metadata extraction feature enables comprehensive document analysis during c
|
|
|
240
231
|
require 'html_to_markdown'
|
|
241
232
|
|
|
242
233
|
html = '<h1>Article</h1><img src="test.jpg" alt="test">'
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
puts
|
|
246
|
-
puts metadata[:
|
|
247
|
-
puts metadata[:
|
|
248
|
-
puts metadata[:
|
|
249
|
-
puts metadata[:
|
|
234
|
+
result = HtmlToMarkdown.convert(html, extract_metadata: true)
|
|
235
|
+
|
|
236
|
+
puts result[:content] # Converted Markdown
|
|
237
|
+
puts result[:metadata][:document][:title] # Document title
|
|
238
|
+
puts result[:metadata][:headers] # All h1-h6 elements
|
|
239
|
+
puts result[:metadata][:links] # All hyperlinks
|
|
240
|
+
puts result[:metadata][:images] # All images with alt text
|
|
241
|
+
puts result[:metadata][:structured_data] # JSON-LD, Microdata, RDFa
|
|
250
242
|
```
|
|
251
243
|
|
|
252
244
|
|
|
253
245
|
|
|
254
|
-
For detailed examples including SEO extraction, table-of-contents generation, link validation, and accessibility audits, see the [Metadata Extraction Guide](../../examples/metadata-extraction/).
|
|
255
|
-
|
|
256
246
|
|
|
257
247
|
|
|
258
248
|
|
|
259
249
|
## Visitor Pattern
|
|
260
250
|
|
|
261
|
-
The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal.
|
|
251
|
+
The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Pass a visitor as the third argument to `convert()`.
|
|
262
252
|
|
|
263
253
|
**Use Cases:**
|
|
254
|
+
|
|
264
255
|
- **Custom Markdown dialects** – Convert to Obsidian, Notion, or other flavors
|
|
265
256
|
- **Content filtering** – Remove tracking pixels, ads, or unwanted elements
|
|
266
257
|
- **URL rewriting** – Rewrite CDN URLs, add query parameters, validate links
|
|
@@ -291,20 +282,16 @@ class MyVisitor
|
|
|
291
282
|
end
|
|
292
283
|
|
|
293
284
|
html = '<a href="https://old-cdn.com/file.pdf">Download</a>'
|
|
294
|
-
|
|
285
|
+
result = HtmlToMarkdown.convert(html, visitor: MyVisitor.new)
|
|
286
|
+
markdown = result[:content]
|
|
295
287
|
```
|
|
296
288
|
|
|
297
289
|
|
|
298
290
|
|
|
299
|
-
For comprehensive examples including content filtering, link footnotes, accessibility validation, and asynchronous URL validation, see the [Visitor Pattern Guide](../../examples/visitor-pattern/).
|
|
300
|
-
|
|
301
291
|
|
|
302
292
|
|
|
303
293
|
## Examples
|
|
304
294
|
|
|
305
|
-
- [Visitor Pattern Guide](../../examples/visitor-pattern/)
|
|
306
|
-
- [Metadata Extraction Guide](../../examples/metadata-extraction/)
|
|
307
|
-
- [Performance Guide](../../examples/performance/)
|
|
308
295
|
|
|
309
296
|
## Links
|
|
310
297
|
|