html-to-markdown 2.29.0 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile.lock +18 -41
- data/README.md +37 -50
- data/ext/html-to-markdown-rb/native/Cargo.lock +17 -705
- data/ext/html-to-markdown-rb/native/Cargo.toml +1 -4
- data/ext/html-to-markdown-rb/native/README.md +4 -13
- data/ext/html-to-markdown-rb/native/src/conversion/inline_images.rs +2 -73
- data/ext/html-to-markdown-rb/native/src/conversion/metadata.rs +5 -49
- data/ext/html-to-markdown-rb/native/src/conversion/mod.rs +0 -6
- data/ext/html-to-markdown-rb/native/src/lib.rs +76 -213
- data/ext/html-to-markdown-rb/native/src/options.rs +0 -3
- data/lib/html_to_markdown/version.rb +1 -1
- data/lib/html_to_markdown.rb +13 -194
- data/sig/html_to_markdown.rbs +12 -373
- data/vendor/Cargo.toml +7 -4
- data/vendor/html-to-markdown-rs/Cargo.toml +4 -10
- data/vendor/html-to-markdown-rs/README.md +127 -51
- data/vendor/html-to-markdown-rs/examples/basic.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/table.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_escape.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_inline_formatting.rs +8 -2
- data/vendor/html-to-markdown-rs/examples/test_lists.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_semantic_tags.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_tables.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_task_lists.rs +6 -1
- data/vendor/html-to-markdown-rs/examples/test_whitespace.rs +6 -1
- data/vendor/html-to-markdown-rs/src/convert_api.rs +151 -745
- data/vendor/html-to-markdown-rs/src/converter/block/blockquote.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/block/div.rs +1 -7
- data/vendor/html-to-markdown-rs/src/converter/block/heading.rs +18 -5
- data/vendor/html-to-markdown-rs/src/converter/block/paragraph.rs +10 -0
- data/vendor/html-to-markdown-rs/src/converter/block/preformatted.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/block/table/builder.rs +16 -11
- data/vendor/html-to-markdown-rs/src/converter/block/table/cell.rs +20 -0
- data/vendor/html-to-markdown-rs/src/converter/block/table/cells.rs +4 -17
- data/vendor/html-to-markdown-rs/src/converter/block/table/mod.rs +140 -0
- data/vendor/html-to-markdown-rs/src/converter/block/table/scanner.rs +4 -18
- data/vendor/html-to-markdown-rs/src/converter/block/table/utils.rs +2 -18
- data/vendor/html-to-markdown-rs/src/converter/context.rs +8 -0
- data/vendor/html-to-markdown-rs/src/converter/dom_context.rs +1 -6
- data/vendor/html-to-markdown-rs/src/converter/form/elements.rs +14 -14
- data/vendor/html-to-markdown-rs/src/converter/handlers/blockquote.rs +4 -5
- data/vendor/html-to-markdown-rs/src/converter/handlers/code_block.rs +5 -10
- data/vendor/html-to-markdown-rs/src/converter/handlers/graphic.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/handlers/image.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/handlers/link.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/inline/code.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/inline/emphasis.rs +4 -10
- data/vendor/html-to-markdown-rs/src/converter/inline/link.rs +4 -170
- data/vendor/html-to-markdown-rs/src/converter/inline/semantic/marks.rs +7 -19
- data/vendor/html-to-markdown-rs/src/converter/list/item.rs +3 -5
- data/vendor/html-to-markdown-rs/src/converter/list/ordered.rs +4 -10
- data/vendor/html-to-markdown-rs/src/converter/list/unordered.rs +6 -12
- data/vendor/html-to-markdown-rs/src/converter/list/utils.rs +1 -12
- data/vendor/html-to-markdown-rs/src/converter/main.rs +85 -56
- data/vendor/html-to-markdown-rs/src/converter/main_helpers.rs +4 -67
- data/vendor/html-to-markdown-rs/src/converter/media/embedded.rs +1 -5
- data/vendor/html-to-markdown-rs/src/converter/media/graphic.rs +3 -40
- data/vendor/html-to-markdown-rs/src/converter/media/image.rs +0 -8
- data/vendor/html-to-markdown-rs/src/converter/media/svg.rs +3 -13
- data/vendor/html-to-markdown-rs/src/converter/metadata.rs +1 -1
- data/vendor/html-to-markdown-rs/src/converter/mod.rs +0 -8
- data/vendor/html-to-markdown-rs/src/converter/plain_text.rs +37 -12
- data/vendor/html-to-markdown-rs/src/converter/semantic/attributes.rs +5 -30
- data/vendor/html-to-markdown-rs/src/converter/semantic/figure.rs +29 -0
- data/vendor/html-to-markdown-rs/src/converter/text/escaping.rs +1 -36
- data/vendor/html-to-markdown-rs/src/converter/text/mod.rs +1 -3
- data/vendor/html-to-markdown-rs/src/converter/text/normalization.rs +0 -53
- data/vendor/html-to-markdown-rs/src/converter/text_node.rs +1 -1
- data/vendor/html-to-markdown-rs/src/converter/utility/attributes.rs +0 -41
- data/vendor/html-to-markdown-rs/src/converter/utility/caching.rs +2 -1
- data/vendor/html-to-markdown-rs/src/converter/utility/content.rs +15 -98
- data/vendor/html-to-markdown-rs/src/converter/utility/preprocessing.rs +113 -4
- data/vendor/html-to-markdown-rs/src/converter/utility/serialization.rs +3 -0
- data/vendor/html-to-markdown-rs/src/converter/visitor_hooks.rs +4 -10
- data/vendor/html-to-markdown-rs/src/exports.rs +1 -4
- data/vendor/html-to-markdown-rs/src/inline_images.rs +1 -1
- data/vendor/html-to-markdown-rs/src/lib.rs +13 -133
- data/vendor/html-to-markdown-rs/src/metadata/collector.rs +4 -4
- data/vendor/html-to-markdown-rs/src/metadata/mod.rs +22 -22
- data/vendor/html-to-markdown-rs/src/metadata/types.rs +3 -3
- data/vendor/html-to-markdown-rs/src/options/conversion.rs +351 -319
- data/vendor/html-to-markdown-rs/src/options/preprocessing.rs +8 -2
- data/vendor/html-to-markdown-rs/src/prelude.rs +1 -15
- data/vendor/html-to-markdown-rs/src/rcdom.rs +7 -1
- data/vendor/html-to-markdown-rs/src/text.rs +25 -14
- data/vendor/html-to-markdown-rs/src/types/document.rs +175 -0
- data/vendor/html-to-markdown-rs/src/types/mod.rs +17 -0
- data/vendor/html-to-markdown-rs/src/types/result.rs +49 -0
- data/vendor/html-to-markdown-rs/src/types/structure_builder.rs +790 -0
- data/vendor/html-to-markdown-rs/src/types/structure_collector.rs +442 -0
- data/vendor/html-to-markdown-rs/src/types/tables.rs +47 -0
- data/vendor/html-to-markdown-rs/src/types/warnings.rs +28 -0
- data/vendor/html-to-markdown-rs/src/visitor/mod.rs +0 -6
- data/vendor/html-to-markdown-rs/src/visitor/traits.rs +0 -1
- data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/mod.rs +1 -21
- data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/mod.rs +0 -5
- data/vendor/html-to-markdown-rs/src/visitor_helpers.rs +1 -845
- data/vendor/html-to-markdown-rs/tests/br_in_inline_test.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/commonmark_compliance_test.rs +8 -8
- data/vendor/html-to-markdown-rs/tests/djot_output_test.rs +8 -2
- data/vendor/html-to-markdown-rs/tests/integration_test.rs +23 -6
- data/vendor/html-to-markdown-rs/tests/issue_121_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_127_regressions.rs +8 -2
- data/vendor/html-to-markdown-rs/tests/issue_128_regressions.rs +6 -1
- data/vendor/html-to-markdown-rs/tests/issue_131_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_134_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_139_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_140_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_143_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_145_regressions.rs +8 -7
- data/vendor/html-to-markdown-rs/tests/issue_146_regressions.rs +8 -7
- data/vendor/html-to-markdown-rs/tests/issue_176_regressions.rs +12 -2
- data/vendor/html-to-markdown-rs/tests/issue_190_regressions.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/issue_199_regressions.rs +6 -1
- data/vendor/html-to-markdown-rs/tests/issue_200_regressions.rs +6 -1
- data/vendor/html-to-markdown-rs/tests/issue_212_regressions.rs +6 -1
- data/vendor/html-to-markdown-rs/tests/issue_216_217_regressions.rs +6 -1
- data/vendor/html-to-markdown-rs/tests/json_ld_script_extraction.rs +4 -6
- data/vendor/html-to-markdown-rs/tests/lists_test.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/plain_output_test.rs +8 -2
- data/vendor/html-to-markdown-rs/tests/preprocessing_tests.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/skip_images_test.rs +8 -11
- data/vendor/html-to-markdown-rs/tests/tables_test.rs +12 -2
- data/vendor/html-to-markdown-rs/tests/test_custom_elements.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/test_nested_simple.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/test_script_style_stripping.rs +17 -28
- data/vendor/html-to-markdown-rs/tests/test_spa_bisect.rs +8 -1
- data/vendor/html-to-markdown-rs/tests/visitor_integration_test.rs +29 -33
- data/vendor/html-to-markdown-rs/tests/xml_tables_test.rs +8 -1
- metadata +9 -37
- data/bin/benchmark.rb +0 -232
- data/ext/html-to-markdown-rb/native/src/conversion/tables.rs +0 -71
- data/ext/html-to-markdown-rb/native/src/profiling.rs +0 -215
- data/ext/html-to-markdown-rb/native/src/visitor/bridge.rs +0 -252
- data/ext/html-to-markdown-rb/native/src/visitor/callbacks.rs +0 -640
- data/ext/html-to-markdown-rb/native/src/visitor/mod.rs +0 -12
- data/spec/convert_spec.rb +0 -77
- data/spec/convert_with_tables_spec.rb +0 -194
- data/spec/metadata_extraction_spec.rb +0 -437
- data/spec/visitor_issue_187_spec.rb +0 -605
- data/spec/visitor_spec.rb +0 -1149
- data/vendor/html-to-markdown-rs/src/hocr/converter/code_analysis.rs +0 -254
- data/vendor/html-to-markdown-rs/src/hocr/converter/core.rs +0 -249
- data/vendor/html-to-markdown-rs/src/hocr/converter/elements.rs +0 -382
- data/vendor/html-to-markdown-rs/src/hocr/converter/hierarchy.rs +0 -379
- data/vendor/html-to-markdown-rs/src/hocr/converter/keywords.rs +0 -55
- data/vendor/html-to-markdown-rs/src/hocr/converter/layout.rs +0 -313
- data/vendor/html-to-markdown-rs/src/hocr/converter/mod.rs +0 -26
- data/vendor/html-to-markdown-rs/src/hocr/converter/output.rs +0 -78
- data/vendor/html-to-markdown-rs/src/hocr/extractor.rs +0 -232
- data/vendor/html-to-markdown-rs/src/hocr/mod.rs +0 -31
- data/vendor/html-to-markdown-rs/src/hocr/parser.rs +0 -333
- data/vendor/html-to-markdown-rs/src/hocr/spatial/coords.rs +0 -129
- data/vendor/html-to-markdown-rs/src/hocr/spatial/grouping.rs +0 -165
- data/vendor/html-to-markdown-rs/src/hocr/spatial/layout.rs +0 -335
- data/vendor/html-to-markdown-rs/src/hocr/spatial/mod.rs +0 -15
- data/vendor/html-to-markdown-rs/src/hocr/spatial/output.rs +0 -63
- data/vendor/html-to-markdown-rs/src/hocr/types.rs +0 -269
- data/vendor/html-to-markdown-rs/src/visitor/async_traits.rs +0 -249
- data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/bridge.rs +0 -189
- data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/bridge_visitor.rs +0 -343
- data/vendor/html-to-markdown-rs/src/visitor_helpers/helpers/callbacks/macros.rs +0 -217
- data/vendor/html-to-markdown-rs/tests/async_visitor_test.rs +0 -57
- data/vendor/html-to-markdown-rs/tests/convert_with_metadata_no_frontmatter.rs +0 -100
- data/vendor/html-to-markdown-rs/tests/hocr_compliance_test.rs +0 -509
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 823ad44845a919191f6b64697599dab71cd1a6e0d6a9a1f222fa0500ef66eb9a
|
|
4
|
+
data.tar.gz: 360148ff88f88e404ef19f14baccec24b0fa2dfe21b5f601ffc2ae56a9420f52
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 664d63d8e57a17085da777089fff023db75b840f686ea84e12d9529bedc8d0405c4c9027f565d7c5cc8a5e01e42d3e7d81c356668b858f1f22451dd80349bceb
|
|
7
|
+
data.tar.gz: cd46a114ee8a200cbfa42e653b4ca24a2a2ef884c4981d366ab01dbd23e2bdc13f304bf12c9d6b367b480c3bca5e4e98f82805a8862320ac40931a42904b5ee6
|
data/Gemfile.lock
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
html-to-markdown (
|
|
4
|
+
html-to-markdown (3.0.0)
|
|
5
5
|
rb_sys (>= 0.9, < 1.0)
|
|
6
6
|
|
|
7
7
|
GEM
|
|
8
8
|
remote: https://rubygems.org/
|
|
9
9
|
specs:
|
|
10
|
-
activesupport (8.1.
|
|
10
|
+
activesupport (8.1.3)
|
|
11
11
|
base64
|
|
12
12
|
bigdecimal
|
|
13
13
|
concurrent-ruby (~> 1.0, >= 1.3.1)
|
|
@@ -20,28 +20,20 @@ GEM
|
|
|
20
20
|
securerandom (>= 0.3)
|
|
21
21
|
tzinfo (~> 2.0, >= 2.0.5)
|
|
22
22
|
uri (>= 0.13.1)
|
|
23
|
-
addressable (2.8.9)
|
|
24
|
-
public_suffix (>= 2.0.2, < 8.0)
|
|
25
23
|
ast (2.4.3)
|
|
26
24
|
base64 (0.3.0)
|
|
27
|
-
bigdecimal (4.0
|
|
25
|
+
bigdecimal (4.1.0)
|
|
28
26
|
concurrent-ruby (1.3.6)
|
|
29
27
|
connection_pool (3.0.2)
|
|
30
28
|
csv (3.3.5)
|
|
31
29
|
diff-lcs (1.6.2)
|
|
32
30
|
drb (2.2.3)
|
|
33
|
-
ffi (1.17.
|
|
34
|
-
ffi (1.17.
|
|
35
|
-
ffi (1.17.3-x64-mingw-ucrt)
|
|
36
|
-
ffi (1.17.3-x86_64-darwin)
|
|
37
|
-
ffi (1.17.3-x86_64-linux-gnu)
|
|
31
|
+
ffi (1.17.4-arm64-darwin)
|
|
32
|
+
ffi (1.17.4-x86_64-linux-gnu)
|
|
38
33
|
fileutils (1.8.0)
|
|
39
34
|
i18n (1.14.8)
|
|
40
35
|
concurrent-ruby (~> 1.0)
|
|
41
|
-
json (2.19.
|
|
42
|
-
json-schema (6.2.0)
|
|
43
|
-
addressable (~> 2.8)
|
|
44
|
-
bigdecimal (>= 3.1, < 5)
|
|
36
|
+
json (2.19.3)
|
|
45
37
|
language_server-protocol (3.17.0.5)
|
|
46
38
|
lint_roller (1.1.0)
|
|
47
39
|
listen (3.10.0)
|
|
@@ -49,18 +41,15 @@ GEM
|
|
|
49
41
|
rb-fsevent (~> 0.10, >= 0.10.3)
|
|
50
42
|
rb-inotify (~> 0.9, >= 0.9.10)
|
|
51
43
|
logger (1.7.0)
|
|
52
|
-
mcp (0.9.0)
|
|
53
|
-
json-schema (>= 4.1)
|
|
54
44
|
minitest (6.0.2)
|
|
55
45
|
drb (~> 2.0)
|
|
56
46
|
prism (~> 1.5)
|
|
57
47
|
mutex_m (0.3.0)
|
|
58
48
|
parallel (1.27.0)
|
|
59
|
-
parser (3.3.
|
|
49
|
+
parser (3.3.11.1)
|
|
60
50
|
ast (~> 2.4.1)
|
|
61
51
|
racc
|
|
62
52
|
prism (1.9.0)
|
|
63
|
-
public_suffix (7.0.5)
|
|
64
53
|
racc (1.8.1)
|
|
65
54
|
rainbow (3.1.1)
|
|
66
55
|
rake (13.3.1)
|
|
@@ -72,7 +61,7 @@ GEM
|
|
|
72
61
|
ffi (~> 1.0)
|
|
73
62
|
rb_sys (0.9.124)
|
|
74
63
|
rake-compiler-dock (= 1.11.0)
|
|
75
|
-
rbs (3.10.
|
|
64
|
+
rbs (3.10.4)
|
|
76
65
|
logger
|
|
77
66
|
tsort
|
|
78
67
|
regexp_parser (2.11.3)
|
|
@@ -89,11 +78,10 @@ GEM
|
|
|
89
78
|
diff-lcs (>= 1.2.0, < 2.0)
|
|
90
79
|
rspec-support (~> 3.13.0)
|
|
91
80
|
rspec-support (3.13.7)
|
|
92
|
-
rubocop (1.
|
|
81
|
+
rubocop (1.86.0)
|
|
93
82
|
json (~> 2.3)
|
|
94
83
|
language_server-protocol (~> 3.17.0.2)
|
|
95
84
|
lint_roller (~> 1.1.0)
|
|
96
|
-
mcp (~> 0.6)
|
|
97
85
|
parallel (~> 1.10)
|
|
98
86
|
parser (>= 3.3.0.2)
|
|
99
87
|
rainbow (>= 2.2.2, < 4.0)
|
|
@@ -138,12 +126,8 @@ GEM
|
|
|
138
126
|
uri (1.1.1)
|
|
139
127
|
|
|
140
128
|
PLATFORMS
|
|
141
|
-
aarch64-linux
|
|
142
129
|
arm64-darwin
|
|
143
|
-
x64-mingw-ucrt
|
|
144
|
-
x86_64-darwin
|
|
145
130
|
x86_64-linux
|
|
146
|
-
x86_64-linux-gnu
|
|
147
131
|
|
|
148
132
|
DEPENDENCIES
|
|
149
133
|
html-to-markdown!
|
|
@@ -156,37 +140,30 @@ DEPENDENCIES
|
|
|
156
140
|
steep
|
|
157
141
|
|
|
158
142
|
CHECKSUMS
|
|
159
|
-
activesupport (8.1.
|
|
160
|
-
addressable (2.8.9) sha256=cc154fcbe689711808a43601dee7b980238ce54368d23e127421753e46895485
|
|
143
|
+
activesupport (8.1.3) sha256=21a5e0dfbd4c3ddd9e1317ec6a4d782fa226e7867dc70b0743acda81a1dca20e
|
|
161
144
|
ast (2.4.3) sha256=954615157c1d6a382bc27d690d973195e79db7f55e9765ac7c481c60bdb4d383
|
|
162
145
|
base64 (0.3.0) sha256=27337aeabad6ffae05c265c450490628ef3ebd4b67be58257393227588f5a97b
|
|
163
|
-
bigdecimal (4.0
|
|
146
|
+
bigdecimal (4.1.0) sha256=6dc07767aa3dc456ccd48e7ae70a07b474e9afd7c5bc576f80bd6da5c8dd6cae
|
|
164
147
|
concurrent-ruby (1.3.6) sha256=6b56837e1e7e5292f9864f34b69c5a2cbc75c0cf5338f1ce9903d10fa762d5ab
|
|
165
148
|
connection_pool (3.0.2) sha256=33fff5ba71a12d2aa26cb72b1db8bba2a1a01823559fb01d29eb74c286e62e0a
|
|
166
149
|
csv (3.3.5) sha256=6e5134ac3383ef728b7f02725d9872934f523cb40b961479f69cf3afa6c8e73f
|
|
167
150
|
diff-lcs (1.6.2) sha256=9ae0d2cba7d4df3075fe8cd8602a8604993efc0dfa934cff568969efb1909962
|
|
168
151
|
drb (2.2.3) sha256=0b00d6fdb50995fe4a45dea13663493c841112e4068656854646f418fda13373
|
|
169
|
-
ffi (1.17.
|
|
170
|
-
ffi (1.17.
|
|
171
|
-
ffi (1.17.3-x64-mingw-ucrt) sha256=5f1d7d067a9a1058ad183dba25b05557cd51c85fc1768c49338eabc1cf242d7c
|
|
172
|
-
ffi (1.17.3-x86_64-darwin) sha256=1f211811eb5cfaa25998322cdd92ab104bfbd26d1c4c08471599c511f2c00bb5
|
|
173
|
-
ffi (1.17.3-x86_64-linux-gnu) sha256=3746b01f677aae7b16dc1acb7cb3cc17b3e35bdae7676a3f568153fb0e2c887f
|
|
152
|
+
ffi (1.17.4-arm64-darwin) sha256=19071aaf1419251b0a46852abf960e77330a3b334d13a4ab51d58b31a937001b
|
|
153
|
+
ffi (1.17.4-x86_64-linux-gnu) sha256=9d3db14c2eae074b382fa9c083fe95aec6e0a1451da249eab096c34002bc752d
|
|
174
154
|
fileutils (1.8.0) sha256=8c6b1df54e2540bdb2f39258f08af78853aa70bad52b4d394bbc6424593c6e02
|
|
175
|
-
html-to-markdown (
|
|
155
|
+
html-to-markdown (3.0.0)
|
|
176
156
|
i18n (1.14.8) sha256=285778639134865c5e0f6269e0b818256017e8cde89993fdfcbfb64d088824a5
|
|
177
|
-
json (2.19.
|
|
178
|
-
json-schema (6.2.0) sha256=e8bff46ed845a22c1ab2bd0d7eccf831c01fe23bb3920caa4c74db4306813666
|
|
157
|
+
json (2.19.3) sha256=289b0bb53052a1fa8c34ab33cc750b659ba14a5c45f3fcf4b18762dc67c78646
|
|
179
158
|
language_server-protocol (3.17.0.5) sha256=fd1e39a51a28bf3eec959379985a72e296e9f9acfce46f6a79d31ca8760803cc
|
|
180
159
|
lint_roller (1.1.0) sha256=2c0c845b632a7d172cb849cc90c1bce937a28c5c8ccccb50dfd46a485003cc87
|
|
181
160
|
listen (3.10.0) sha256=c6e182db62143aeccc2e1960033bebe7445309c7272061979bb098d03760c9d2
|
|
182
161
|
logger (1.7.0) sha256=196edec7cc44b66cfb40f9755ce11b392f21f7967696af15d274dde7edff0203
|
|
183
|
-
mcp (0.9.0) sha256=a0a3737b0ac9df0772f4ef7e2b013c260ddbcf217a5d50a66bff0baeddf03e47
|
|
184
162
|
minitest (6.0.2) sha256=db6e57956f6ecc6134683b4c87467d6dd792323c7f0eea7b93f66bd284adbc3d
|
|
185
163
|
mutex_m (0.3.0) sha256=cfcb04ac16b69c4813777022fdceda24e9f798e48092a2b817eb4c0a782b0751
|
|
186
164
|
parallel (1.27.0) sha256=4ac151e1806b755fb4e2dc2332cbf0e54f2e24ba821ff2d3dcf86bf6dc4ae130
|
|
187
|
-
parser (3.3.
|
|
165
|
+
parser (3.3.11.1) sha256=d17ace7aabe3e72c3cc94043714be27cc6f852f104d81aa284c2281aecc65d54
|
|
188
166
|
prism (1.9.0) sha256=7b530c6a9f92c24300014919c9dcbc055bf4cdf51ec30aed099b06cd6674ef85
|
|
189
|
-
public_suffix (7.0.5) sha256=1a8bb08f1bbea19228d3bed6e5ed908d1cb4f7c2726d18bd9cadf60bc676f623
|
|
190
167
|
racc (1.8.1) sha256=4a7f6929691dbec8b5209a0b373bc2614882b55fc5d2e447a21aaa691303d62f
|
|
191
168
|
rainbow (3.1.1) sha256=039491aa3a89f42efa1d6dec2fc4e62ede96eb6acd95e52f1ad581182b79bc6a
|
|
192
169
|
rake (13.3.1) sha256=8c9e89d09f66a26a01264e7e3480ec0607f0c497a861ef16063604b1b08eb19c
|
|
@@ -195,14 +172,14 @@ CHECKSUMS
|
|
|
195
172
|
rb-fsevent (0.11.2) sha256=43900b972e7301d6570f64b850a5aa67833ee7d87b458ee92805d56b7318aefe
|
|
196
173
|
rb-inotify (0.11.1) sha256=a0a700441239b0ff18eb65e3866236cd78613d6b9f78fea1f9ac47a85e47be6e
|
|
197
174
|
rb_sys (0.9.124) sha256=513476557b12eaf73764b3da9f8746024558fe8699bda785fb548c9aa3877ae7
|
|
198
|
-
rbs (3.10.
|
|
175
|
+
rbs (3.10.4) sha256=b17d7c4be4bb31a11a3b529830f0aa206a807ca42f2e7921a3027dfc6b7e5ce8
|
|
199
176
|
regexp_parser (2.11.3) sha256=ca13f381a173b7a93450e53459075c9b76a10433caadcb2f1180f2c741fc55a4
|
|
200
177
|
rspec (3.13.2) sha256=206284a08ad798e61f86d7ca3e376718d52c0bc944626b2349266f239f820587
|
|
201
178
|
rspec-core (3.13.6) sha256=a8823c6411667b60a8bca135364351dda34cd55e44ff94c4be4633b37d828b2d
|
|
202
179
|
rspec-expectations (3.13.5) sha256=33a4d3a1d95060aea4c94e9f237030a8f9eae5615e9bd85718fe3a09e4b58836
|
|
203
180
|
rspec-mocks (3.13.8) sha256=086ad3d3d17533f4237643de0b5c42f04b66348c28bf6b9c2d3f4a3b01af1d47
|
|
204
181
|
rspec-support (3.13.7) sha256=0640e5570872aafefd79867901deeeeb40b0c9875a36b983d85f54fb7381c47c
|
|
205
|
-
rubocop (1.
|
|
182
|
+
rubocop (1.86.0) sha256=4ff1186fe16ebe9baff5e7aad66bb0ad4cabf5cdcd419f773146dbba2565d186
|
|
206
183
|
rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
|
|
207
184
|
rubocop-rspec (3.9.0) sha256=8fa70a3619408237d789aeecfb9beef40576acc855173e60939d63332fdb55e2
|
|
208
185
|
ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
|
data/README.md
CHANGED
|
@@ -17,8 +17,8 @@
|
|
|
17
17
|
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/html-to-markdown">
|
|
18
18
|
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/html-to-markdown?label=Java&color=007ec6" alt="Java">
|
|
19
19
|
</a>
|
|
20
|
-
<a href="https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/
|
|
21
|
-
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/html-to-markdown?label=Go&color=007ec6&filter=
|
|
20
|
+
<a href="https://pkg.go.dev/github.com/kreuzberg-dev/html-to-markdown/packages/go/v3/htmltomarkdown">
|
|
21
|
+
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/html-to-markdown?label=Go&color=007ec6&filter=v3.0.0" alt="Go">
|
|
22
22
|
</a>
|
|
23
23
|
<a href="https://www.nuget.org/packages/KreuzbergDev.HtmlToMarkdown/">
|
|
24
24
|
<img src="https://img.shields.io/nuget/v/KreuzbergDev.HtmlToMarkdown?label=C%23&color=007ec6" alt="C#">
|
|
@@ -87,7 +87,6 @@ Apple M4 • Real Wikipedia documents • `convert()` (Ruby)
|
|
|
87
87
|
| Mixed (Python wiki) | 656KB | 4.89ms | 134 MB/s |
|
|
88
88
|
|
|
89
89
|
|
|
90
|
-
See [Performance Guide](../../examples/performance/) for detailed benchmarks.
|
|
91
90
|
|
|
92
91
|
|
|
93
92
|
## Quick Start
|
|
@@ -98,7 +97,8 @@ Basic conversion:
|
|
|
98
97
|
require 'html_to_markdown'
|
|
99
98
|
|
|
100
99
|
html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
|
|
101
|
-
|
|
100
|
+
result = HtmlToMarkdown.convert(html)
|
|
101
|
+
markdown = result[:content]
|
|
102
102
|
```
|
|
103
103
|
|
|
104
104
|
|
|
@@ -109,60 +109,50 @@ With conversion options:
|
|
|
109
109
|
require 'html_to_markdown'
|
|
110
110
|
|
|
111
111
|
html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
|
|
112
|
-
|
|
112
|
+
result = HtmlToMarkdown.convert(html, heading_style: :atx, code_block_style: :fenced)
|
|
113
|
+
markdown = result[:content]
|
|
113
114
|
```
|
|
114
115
|
|
|
115
116
|
|
|
116
117
|
|
|
117
118
|
|
|
118
|
-
|
|
119
|
-
|
|
120
119
|
## API Reference
|
|
121
120
|
|
|
122
|
-
### Core
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
**`convert(html, options: nil) -> String`**
|
|
121
|
+
### Core Function
|
|
126
122
|
|
|
127
|
-
Basic HTML-to-Markdown conversion. Fast and simple.
|
|
128
123
|
|
|
129
|
-
**`
|
|
124
|
+
**`convert(html, options: nil, visitor: nil) -> ConversionResult`**
|
|
130
125
|
|
|
131
|
-
|
|
126
|
+
Converts HTML to Markdown. Returns a `ConversionResult` hash with all results in a single call.
|
|
132
127
|
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
Customize conversion with visitor callbacks for element interception. See [Visitor Pattern Guide](../../examples/visitor-pattern/).
|
|
136
|
-
|
|
137
|
-
**`convert_with_inline_images(html, config: nil) -> [String, Array, Array]`**
|
|
138
|
-
|
|
139
|
-
Extract base64-encoded inline images with metadata.
|
|
140
|
-
|
|
141
|
-
**`convert_with_tables(html, options: nil, config: nil) -> ConversionWithTables`**
|
|
128
|
+
```ruby
|
|
129
|
+
require 'html_to_markdown'
|
|
142
130
|
|
|
143
|
-
|
|
131
|
+
result = HtmlToMarkdown.convert(html)
|
|
132
|
+
markdown = result[:content] # Converted Markdown string
|
|
133
|
+
metadata = result[:metadata] # Metadata (when extract_metadata: true)
|
|
134
|
+
tables = result[:tables] # Structured table data (when extract_tables: true)
|
|
135
|
+
document = result[:document] # Document-level info
|
|
136
|
+
images = result[:images] # Extracted images
|
|
137
|
+
warnings = result[:warnings] # Any conversion warnings
|
|
138
|
+
```
|
|
144
139
|
|
|
145
140
|
|
|
146
141
|
|
|
147
142
|
### Options
|
|
148
143
|
|
|
149
144
|
**`ConversionOptions`** – Key configuration fields:
|
|
145
|
+
|
|
150
146
|
- `heading_style`: Heading format (`"underlined"` | `"atx"` | `"atx_closed"`) — default: `"underlined"`
|
|
151
147
|
- `list_indent_width`: Spaces per indent level — default: `2`
|
|
152
148
|
- `bullets`: Bullet characters cycle — default: `"*+-"`
|
|
153
149
|
- `wrap`: Enable text wrapping — default: `false`
|
|
154
150
|
- `wrap_width`: Wrap at column — default: `80`
|
|
155
151
|
- `code_language`: Default fenced code block language — default: none
|
|
156
|
-
- `extract_metadata`:
|
|
152
|
+
- `extract_metadata`: Enable metadata extraction into `result.metadata` — default: `false`
|
|
153
|
+
- `extract_tables`: Enable structured table extraction into `result.tables` — default: `false`
|
|
157
154
|
- `output_format`: Output markup format (`"markdown"` | `"djot"` | `"plain"`) — default: `"markdown"`
|
|
158
155
|
|
|
159
|
-
**`MetadataConfig`** – Selective metadata extraction:
|
|
160
|
-
- `extract_headers`: h1-h6 elements — default: `true`
|
|
161
|
-
- `extract_links`: Hyperlinks — default: `true`
|
|
162
|
-
- `extract_images`: Image elements — default: `true`
|
|
163
|
-
- `extract_structured_data`: JSON-LD, Microdata, RDFa — default: `true`
|
|
164
|
-
- `max_structured_data_size`: Size limit in bytes — default: `100KB`
|
|
165
|
-
|
|
166
156
|
|
|
167
157
|
## Djot Output Format
|
|
168
158
|
|
|
@@ -222,16 +212,17 @@ Plain text mode is useful for search indexing, text extraction, and feeding cont
|
|
|
222
212
|
|
|
223
213
|
## Metadata Extraction
|
|
224
214
|
|
|
225
|
-
The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass.
|
|
215
|
+
The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass — all via the standard `convert()` function.
|
|
226
216
|
|
|
227
217
|
**Use Cases:**
|
|
218
|
+
|
|
228
219
|
- **SEO analysis** – Extract title, description, Open Graph tags, Twitter cards
|
|
229
220
|
- **Table of contents generation** – Build structured outlines from heading hierarchy
|
|
230
221
|
- **Content migration** – Document all external links and resources
|
|
231
222
|
- **Accessibility audits** – Check for images without alt text, empty links, invalid heading hierarchy
|
|
232
223
|
- **Link validation** – Classify and validate anchor, internal, external, email, and phone links
|
|
233
224
|
|
|
234
|
-
**Zero Overhead When Disabled:** Metadata extraction adds negligible overhead and happens during the HTML parsing pass.
|
|
225
|
+
**Zero Overhead When Disabled:** Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Pass `extract_metadata: true` in `ConversionOptions` to enable it; the result is available at `result.metadata`.
|
|
235
226
|
|
|
236
227
|
### Example: Quick Start
|
|
237
228
|
|
|
@@ -240,27 +231,27 @@ The metadata extraction feature enables comprehensive document analysis during c
|
|
|
240
231
|
require 'html_to_markdown'
|
|
241
232
|
|
|
242
233
|
html = '<h1>Article</h1><img src="test.jpg" alt="test">'
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
puts
|
|
246
|
-
puts metadata[:
|
|
247
|
-
puts metadata[:
|
|
248
|
-
puts metadata[:
|
|
249
|
-
puts metadata[:
|
|
234
|
+
result = HtmlToMarkdown.convert(html, extract_metadata: true)
|
|
235
|
+
|
|
236
|
+
puts result[:content] # Converted Markdown
|
|
237
|
+
puts result[:metadata][:document][:title] # Document title
|
|
238
|
+
puts result[:metadata][:headers] # All h1-h6 elements
|
|
239
|
+
puts result[:metadata][:links] # All hyperlinks
|
|
240
|
+
puts result[:metadata][:images] # All images with alt text
|
|
241
|
+
puts result[:metadata][:structured_data] # JSON-LD, Microdata, RDFa
|
|
250
242
|
```
|
|
251
243
|
|
|
252
244
|
|
|
253
245
|
|
|
254
|
-
For detailed examples including SEO extraction, table-of-contents generation, link validation, and accessibility audits, see the [Metadata Extraction Guide](../../examples/metadata-extraction/).
|
|
255
|
-
|
|
256
246
|
|
|
257
247
|
|
|
258
248
|
|
|
259
249
|
## Visitor Pattern
|
|
260
250
|
|
|
261
|
-
The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal.
|
|
251
|
+
The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Pass a visitor as the third argument to `convert()`.
|
|
262
252
|
|
|
263
253
|
**Use Cases:**
|
|
254
|
+
|
|
264
255
|
- **Custom Markdown dialects** – Convert to Obsidian, Notion, or other flavors
|
|
265
256
|
- **Content filtering** – Remove tracking pixels, ads, or unwanted elements
|
|
266
257
|
- **URL rewriting** – Rewrite CDN URLs, add query parameters, validate links
|
|
@@ -291,20 +282,16 @@ class MyVisitor
|
|
|
291
282
|
end
|
|
292
283
|
|
|
293
284
|
html = '<a href="https://old-cdn.com/file.pdf">Download</a>'
|
|
294
|
-
|
|
285
|
+
result = HtmlToMarkdown.convert(html, visitor: MyVisitor.new)
|
|
286
|
+
markdown = result[:content]
|
|
295
287
|
```
|
|
296
288
|
|
|
297
289
|
|
|
298
290
|
|
|
299
|
-
For comprehensive examples including content filtering, link footnotes, accessibility validation, and asynchronous URL validation, see the [Visitor Pattern Guide](../../examples/visitor-pattern/).
|
|
300
|
-
|
|
301
291
|
|
|
302
292
|
|
|
303
293
|
## Examples
|
|
304
294
|
|
|
305
|
-
- [Visitor Pattern Guide](../../examples/visitor-pattern/)
|
|
306
|
-
- [Metadata Extraction Guide](../../examples/metadata-extraction/)
|
|
307
|
-
- [Performance Guide](../../examples/performance/)
|
|
308
295
|
|
|
309
296
|
## Links
|
|
310
297
|
|