html-to-markdown 2.0.1__tar.gz → 2.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of html-to-markdown might be problematic. Click here for more details.
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/Cargo.lock +177 -9
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/Cargo.toml +2 -2
- html_to_markdown-2.1.0/PKG-INFO +196 -0
- html_to_markdown-2.1.0/README_PYPI.md +162 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/Cargo.toml +5 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/README.md +39 -54
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/benches/conversion_benchmark.rs +2 -19
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/benches/profiling_benchmark.rs +2 -20
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/src/converter.rs +351 -58
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/src/hocr/converter.rs +67 -9
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/src/hocr/extractor.rs +1 -1
- html_to_markdown-2.1.0/crates/html-to-markdown/src/hocr/mod.rs +30 -0
- html_to_markdown-2.0.1/crates/html-to-markdown/src/hocr.rs → html_to_markdown-2.1.0/crates/html-to-markdown/src/hocr/spatial.rs +3 -31
- html_to_markdown-2.1.0/crates/html-to-markdown/src/inline_images.rs +221 -0
- html_to_markdown-2.1.0/crates/html-to-markdown/src/lib.rs +114 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/src/options.rs +3 -34
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/tests/hocr_compliance_test.rs +5 -5
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown-py/Cargo.toml +5 -1
- html_to_markdown-2.1.0/crates/html-to-markdown-py/README.md +86 -0
- html_to_markdown-2.1.0/crates/html-to-markdown-py/python/html_to_markdown/__init__.py +18 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown-py/python/html_to_markdown/_html_to_markdown.pyi +35 -19
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown-py/src/lib.rs +240 -44
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/html_to_markdown/__init__.py +3 -9
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/html_to_markdown/_rust.pyi +4 -12
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/html_to_markdown/api.py +7 -34
- html_to_markdown-2.1.0/html_to_markdown/cli.py +3 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/html_to_markdown/cli_proxy.py +23 -33
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/html_to_markdown/exceptions.py +8 -16
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/html_to_markdown/options.py +3 -76
- html_to_markdown-2.1.0/html_to_markdown/v1_compat.py +189 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/pyproject.toml +8 -8
- html_to_markdown-2.0.1/PKG-INFO +0 -243
- html_to_markdown-2.0.1/README_PYPI.md +0 -210
- html_to_markdown-2.0.1/crates/html-to-markdown/src/lib.rs +0 -57
- html_to_markdown-2.0.1/crates/html-to-markdown-py/README.md +0 -438
- html_to_markdown-2.0.1/crates/html-to-markdown-py/python/html_to_markdown/__init__.py +0 -5
- html_to_markdown-2.0.1/html_to_markdown/cli.py +0 -9
- html_to_markdown-2.0.1/html_to_markdown/v1_compat.py +0 -161
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/LICENSE +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/benches/micro_benchmark.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/examples/basic.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/examples/table.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/examples/test_escape.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/examples/test_inline_formatting.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/examples/test_lists.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/examples/test_semantic_tags.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/examples/test_tables.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/examples/test_task_lists.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/examples/test_whitespace.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/src/error.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/src/hocr/parser.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/src/hocr/types.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/src/sanitizer.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/src/text.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/src/wrapper.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/tests/commonmark_compliance_test.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown/tests/integration_test.rs +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/crates/html-to-markdown-py/uv.lock +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/html_to_markdown/__main__.py +0 -0
- {html_to_markdown-2.0.1 → html_to_markdown-2.1.0}/html_to_markdown/py.typed +0 -0
|
@@ -2,6 +2,12 @@
|
|
|
2
2
|
# It is not intended for manual editing.
|
|
3
3
|
version = 3
|
|
4
4
|
|
|
5
|
+
[[package]]
|
|
6
|
+
name = "adler2"
|
|
7
|
+
version = "2.0.1"
|
|
8
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
9
|
+
checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa"
|
|
10
|
+
|
|
5
11
|
[[package]]
|
|
6
12
|
name = "aho-corasick"
|
|
7
13
|
version = "1.1.3"
|
|
@@ -131,6 +137,18 @@ version = "3.19.0"
|
|
|
131
137
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
132
138
|
checksum = "46c5e41b57b8bba42a04676d81cb89e9ee8e859a1a66f80a5a72e1cb76b34d43"
|
|
133
139
|
|
|
140
|
+
[[package]]
|
|
141
|
+
name = "bytemuck"
|
|
142
|
+
version = "1.24.0"
|
|
143
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
144
|
+
checksum = "1fbdf580320f38b612e485521afda1ee26d10cc9884efaaa750d383e13e3c5f4"
|
|
145
|
+
|
|
146
|
+
[[package]]
|
|
147
|
+
name = "byteorder-lite"
|
|
148
|
+
version = "0.1.0"
|
|
149
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
150
|
+
checksum = "8f1fe948ff07f4bd06c30984e69f5b4899c516a3ef74f34df92a2df2ab535495"
|
|
151
|
+
|
|
134
152
|
[[package]]
|
|
135
153
|
name = "cast"
|
|
136
154
|
version = "0.3.0"
|
|
@@ -229,12 +247,27 @@ dependencies = [
|
|
|
229
247
|
"roff",
|
|
230
248
|
]
|
|
231
249
|
|
|
250
|
+
[[package]]
|
|
251
|
+
name = "color_quant"
|
|
252
|
+
version = "1.1.0"
|
|
253
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
254
|
+
checksum = "3d7b894f5411737b7867f4827955924d7c254fc9f4d91a6aad6b097804b1018b"
|
|
255
|
+
|
|
232
256
|
[[package]]
|
|
233
257
|
name = "colorchoice"
|
|
234
258
|
version = "1.0.4"
|
|
235
259
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
236
260
|
checksum = "b05b61dc5112cbb17e4b6cd61790d9845d13888356391624cbe7e41efeac1e75"
|
|
237
261
|
|
|
262
|
+
[[package]]
|
|
263
|
+
name = "crc32fast"
|
|
264
|
+
version = "1.5.0"
|
|
265
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
266
|
+
checksum = "9481c1c90cbf2ac953f07c8d4a58aa3945c425b7185c9154d67a65e4230da511"
|
|
267
|
+
dependencies = [
|
|
268
|
+
"cfg-if",
|
|
269
|
+
]
|
|
270
|
+
|
|
238
271
|
[[package]]
|
|
239
272
|
name = "criterion"
|
|
240
273
|
version = "0.5.1"
|
|
@@ -394,6 +427,25 @@ version = "2.3.0"
|
|
|
394
427
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
395
428
|
checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be"
|
|
396
429
|
|
|
430
|
+
[[package]]
|
|
431
|
+
name = "fdeflate"
|
|
432
|
+
version = "0.3.7"
|
|
433
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
434
|
+
checksum = "1e6853b52649d4ac5c0bd02320cddc5ba956bdb407c4b75a2c6b75bf51500f8c"
|
|
435
|
+
dependencies = [
|
|
436
|
+
"simd-adler32",
|
|
437
|
+
]
|
|
438
|
+
|
|
439
|
+
[[package]]
|
|
440
|
+
name = "flate2"
|
|
441
|
+
version = "1.1.4"
|
|
442
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
443
|
+
checksum = "dc5a4e564e38c699f2880d3fda590bedc2e69f3f84cd48b457bd892ce61d0aa9"
|
|
444
|
+
dependencies = [
|
|
445
|
+
"crc32fast",
|
|
446
|
+
"miniz_oxide",
|
|
447
|
+
]
|
|
448
|
+
|
|
397
449
|
[[package]]
|
|
398
450
|
name = "float-cmp"
|
|
399
451
|
version = "0.10.0"
|
|
@@ -434,6 +486,16 @@ dependencies = [
|
|
|
434
486
|
"wasi",
|
|
435
487
|
]
|
|
436
488
|
|
|
489
|
+
[[package]]
|
|
490
|
+
name = "gif"
|
|
491
|
+
version = "0.13.3"
|
|
492
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
493
|
+
checksum = "4ae047235e33e2829703574b54fdec96bfbad892062d97fed2f76022287de61b"
|
|
494
|
+
dependencies = [
|
|
495
|
+
"color_quant",
|
|
496
|
+
"weezl",
|
|
497
|
+
]
|
|
498
|
+
|
|
437
499
|
[[package]]
|
|
438
500
|
name = "half"
|
|
439
501
|
version = "2.7.0"
|
|
@@ -468,7 +530,7 @@ dependencies = [
|
|
|
468
530
|
|
|
469
531
|
[[package]]
|
|
470
532
|
name = "html-to-markdown-cli"
|
|
471
|
-
version = "2.0
|
|
533
|
+
version = "2.1.0"
|
|
472
534
|
dependencies = [
|
|
473
535
|
"assert_cmd",
|
|
474
536
|
"clap",
|
|
@@ -482,20 +544,23 @@ dependencies = [
|
|
|
482
544
|
|
|
483
545
|
[[package]]
|
|
484
546
|
name = "html-to-markdown-py"
|
|
485
|
-
version = "2.0
|
|
547
|
+
version = "2.1.0"
|
|
486
548
|
dependencies = [
|
|
549
|
+
"base64",
|
|
487
550
|
"html-to-markdown-rs",
|
|
551
|
+
"image",
|
|
488
552
|
"pyo3",
|
|
489
553
|
]
|
|
490
554
|
|
|
491
555
|
[[package]]
|
|
492
556
|
name = "html-to-markdown-rs"
|
|
493
|
-
version = "2.0
|
|
557
|
+
version = "2.1.0"
|
|
494
558
|
dependencies = [
|
|
495
559
|
"ammonia",
|
|
496
560
|
"base64",
|
|
497
561
|
"criterion",
|
|
498
562
|
"html-escape",
|
|
563
|
+
"image",
|
|
499
564
|
"once_cell",
|
|
500
565
|
"regex",
|
|
501
566
|
"serde",
|
|
@@ -622,6 +687,34 @@ dependencies = [
|
|
|
622
687
|
"icu_properties",
|
|
623
688
|
]
|
|
624
689
|
|
|
690
|
+
[[package]]
|
|
691
|
+
name = "image"
|
|
692
|
+
version = "0.25.8"
|
|
693
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
694
|
+
checksum = "529feb3e6769d234375c4cf1ee2ce713682b8e76538cb13f9fc23e1400a591e7"
|
|
695
|
+
dependencies = [
|
|
696
|
+
"bytemuck",
|
|
697
|
+
"byteorder-lite",
|
|
698
|
+
"color_quant",
|
|
699
|
+
"gif",
|
|
700
|
+
"image-webp",
|
|
701
|
+
"moxcms",
|
|
702
|
+
"num-traits",
|
|
703
|
+
"png",
|
|
704
|
+
"zune-core",
|
|
705
|
+
"zune-jpeg",
|
|
706
|
+
]
|
|
707
|
+
|
|
708
|
+
[[package]]
|
|
709
|
+
name = "image-webp"
|
|
710
|
+
version = "0.2.4"
|
|
711
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
712
|
+
checksum = "525e9ff3e1a4be2fbea1fdf0e98686a6d98b4d8f937e1bf7402245af1909e8c3"
|
|
713
|
+
dependencies = [
|
|
714
|
+
"byteorder-lite",
|
|
715
|
+
"quick-error",
|
|
716
|
+
]
|
|
717
|
+
|
|
625
718
|
[[package]]
|
|
626
719
|
name = "indoc"
|
|
627
720
|
version = "2.0.6"
|
|
@@ -752,6 +845,26 @@ dependencies = [
|
|
|
752
845
|
"autocfg",
|
|
753
846
|
]
|
|
754
847
|
|
|
848
|
+
[[package]]
|
|
849
|
+
name = "miniz_oxide"
|
|
850
|
+
version = "0.8.9"
|
|
851
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
852
|
+
checksum = "1fa76a2c86f704bdb222d66965fb3d63269ce38518b83cb0575fca855ebb6316"
|
|
853
|
+
dependencies = [
|
|
854
|
+
"adler2",
|
|
855
|
+
"simd-adler32",
|
|
856
|
+
]
|
|
857
|
+
|
|
858
|
+
[[package]]
|
|
859
|
+
name = "moxcms"
|
|
860
|
+
version = "0.7.7"
|
|
861
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
862
|
+
checksum = "c588e11a3082784af229e23e8e4ecf5bcc6fbe4f69101e0421ce8d79da7f0b40"
|
|
863
|
+
dependencies = [
|
|
864
|
+
"num-traits",
|
|
865
|
+
"pxfm",
|
|
866
|
+
]
|
|
867
|
+
|
|
755
868
|
[[package]]
|
|
756
869
|
name = "new_debug_unreachable"
|
|
757
870
|
version = "1.0.6"
|
|
@@ -900,6 +1013,19 @@ dependencies = [
|
|
|
900
1013
|
"plotters-backend",
|
|
901
1014
|
]
|
|
902
1015
|
|
|
1016
|
+
[[package]]
|
|
1017
|
+
name = "png"
|
|
1018
|
+
version = "0.18.0"
|
|
1019
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
1020
|
+
checksum = "97baced388464909d42d89643fe4361939af9b7ce7a31ee32a168f832a70f2a0"
|
|
1021
|
+
dependencies = [
|
|
1022
|
+
"bitflags",
|
|
1023
|
+
"crc32fast",
|
|
1024
|
+
"fdeflate",
|
|
1025
|
+
"flate2",
|
|
1026
|
+
"miniz_oxide",
|
|
1027
|
+
]
|
|
1028
|
+
|
|
903
1029
|
[[package]]
|
|
904
1030
|
name = "portable-atomic"
|
|
905
1031
|
version = "1.11.1"
|
|
@@ -960,6 +1086,15 @@ dependencies = [
|
|
|
960
1086
|
"unicode-ident",
|
|
961
1087
|
]
|
|
962
1088
|
|
|
1089
|
+
[[package]]
|
|
1090
|
+
name = "pxfm"
|
|
1091
|
+
version = "0.1.25"
|
|
1092
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
1093
|
+
checksum = "a3cbdf373972bf78df4d3b518d07003938e2c7d1fb5891e55f9cb6df57009d84"
|
|
1094
|
+
dependencies = [
|
|
1095
|
+
"num-traits",
|
|
1096
|
+
]
|
|
1097
|
+
|
|
963
1098
|
[[package]]
|
|
964
1099
|
name = "pyo3"
|
|
965
1100
|
version = "0.26.0"
|
|
@@ -1021,6 +1156,12 @@ dependencies = [
|
|
|
1021
1156
|
"syn",
|
|
1022
1157
|
]
|
|
1023
1158
|
|
|
1159
|
+
[[package]]
|
|
1160
|
+
name = "quick-error"
|
|
1161
|
+
version = "2.0.1"
|
|
1162
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
1163
|
+
checksum = "a993555f31e5a609f617c12db6250dedcac1b0a85076912c436e6fc9b2c8e6a3"
|
|
1164
|
+
|
|
1024
1165
|
[[package]]
|
|
1025
1166
|
name = "quote"
|
|
1026
1167
|
version = "1.0.41"
|
|
@@ -1082,9 +1223,9 @@ dependencies = [
|
|
|
1082
1223
|
|
|
1083
1224
|
[[package]]
|
|
1084
1225
|
name = "regex"
|
|
1085
|
-
version = "1.
|
|
1226
|
+
version = "1.12.1"
|
|
1086
1227
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
1087
|
-
checksum = "
|
|
1228
|
+
checksum = "4a52d8d02cacdb176ef4678de6c052efb4b3da14b78e4db683a4252762be5433"
|
|
1088
1229
|
dependencies = [
|
|
1089
1230
|
"aho-corasick",
|
|
1090
1231
|
"memchr",
|
|
@@ -1094,9 +1235,9 @@ dependencies = [
|
|
|
1094
1235
|
|
|
1095
1236
|
[[package]]
|
|
1096
1237
|
name = "regex-automata"
|
|
1097
|
-
version = "0.4.
|
|
1238
|
+
version = "0.4.12"
|
|
1098
1239
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
1099
|
-
checksum = "
|
|
1240
|
+
checksum = "722166aa0d7438abbaa4d5cc2c649dac844e8c56d82fb3d33e9c34b5cd268fc6"
|
|
1100
1241
|
dependencies = [
|
|
1101
1242
|
"aho-corasick",
|
|
1102
1243
|
"memchr",
|
|
@@ -1105,9 +1246,9 @@ dependencies = [
|
|
|
1105
1246
|
|
|
1106
1247
|
[[package]]
|
|
1107
1248
|
name = "regex-syntax"
|
|
1108
|
-
version = "0.8.
|
|
1249
|
+
version = "0.8.7"
|
|
1109
1250
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
1110
|
-
checksum = "
|
|
1251
|
+
checksum = "c3160422bbd54dd5ecfdca71e5fd59b7b8fe2b1697ab2baf64f6d05dcc66d298"
|
|
1111
1252
|
|
|
1112
1253
|
[[package]]
|
|
1113
1254
|
name = "roff"
|
|
@@ -1198,6 +1339,12 @@ dependencies = [
|
|
|
1198
1339
|
"serde_core",
|
|
1199
1340
|
]
|
|
1200
1341
|
|
|
1342
|
+
[[package]]
|
|
1343
|
+
name = "simd-adler32"
|
|
1344
|
+
version = "0.3.7"
|
|
1345
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
1346
|
+
checksum = "d66dc143e6b11c1eddc06d5c423cfc97062865baf299914ab64caa38182078fe"
|
|
1347
|
+
|
|
1201
1348
|
[[package]]
|
|
1202
1349
|
name = "siphasher"
|
|
1203
1350
|
version = "1.0.1"
|
|
@@ -1517,6 +1664,12 @@ dependencies = [
|
|
|
1517
1664
|
"string_cache_codegen",
|
|
1518
1665
|
]
|
|
1519
1666
|
|
|
1667
|
+
[[package]]
|
|
1668
|
+
name = "weezl"
|
|
1669
|
+
version = "0.1.10"
|
|
1670
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
1671
|
+
checksum = "a751b3277700db47d3e574514de2eced5e54dc8a5436a3bf7a0b248b2cee16f3"
|
|
1672
|
+
|
|
1520
1673
|
[[package]]
|
|
1521
1674
|
name = "winapi-util"
|
|
1522
1675
|
version = "0.1.11"
|
|
@@ -1797,3 +1950,18 @@ dependencies = [
|
|
|
1797
1950
|
"quote",
|
|
1798
1951
|
"syn",
|
|
1799
1952
|
]
|
|
1953
|
+
|
|
1954
|
+
[[package]]
|
|
1955
|
+
name = "zune-core"
|
|
1956
|
+
version = "0.4.12"
|
|
1957
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
1958
|
+
checksum = "3f423a2c17029964870cfaabb1f13dfab7d092a62a29a89264f4d36990ca414a"
|
|
1959
|
+
|
|
1960
|
+
[[package]]
|
|
1961
|
+
name = "zune-jpeg"
|
|
1962
|
+
version = "0.4.21"
|
|
1963
|
+
source = "registry+https://github.com/rust-lang/crates.io-index"
|
|
1964
|
+
checksum = "29ce2c8a9384ad323cf564b67da86e21d3cfdff87908bc1223ed5c99bc792713"
|
|
1965
|
+
dependencies = [
|
|
1966
|
+
"zune-core",
|
|
1967
|
+
]
|
|
@@ -3,7 +3,7 @@ resolver = "2"
|
|
|
3
3
|
members = ["crates/html-to-markdown-py"]
|
|
4
4
|
|
|
5
5
|
[workspace.package]
|
|
6
|
-
version = "2.0
|
|
6
|
+
version = "2.1.0"
|
|
7
7
|
edition = "2021"
|
|
8
8
|
authors = ["Na'aman Hirschfeld <nhirschfeld@gmail.com>"]
|
|
9
9
|
license = "MIT"
|
|
@@ -15,7 +15,7 @@ rust-version = "1.80"
|
|
|
15
15
|
|
|
16
16
|
[workspace.dependencies]
|
|
17
17
|
# Core library
|
|
18
|
-
html-to-markdown-rs = { version = "2.0
|
|
18
|
+
html-to-markdown-rs = { version = "2.1.0", path = "crates/html-to-markdown" }
|
|
19
19
|
|
|
20
20
|
# HTML parsing and sanitization
|
|
21
21
|
tl = "0.7"
|
|
@@ -0,0 +1,196 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: html-to-markdown
|
|
3
|
+
Version: 2.1.0
|
|
4
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
5
|
+
Classifier: Environment :: Console
|
|
6
|
+
Classifier: Intended Audience :: Developers
|
|
7
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
8
|
+
Classifier: Operating System :: OS Independent
|
|
9
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
15
|
+
Classifier: Programming Language :: Rust
|
|
16
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
17
|
+
Classifier: Topic :: Text Processing
|
|
18
|
+
Classifier: Topic :: Text Processing :: Markup
|
|
19
|
+
Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
20
|
+
Classifier: Topic :: Text Processing :: Markup :: Markdown
|
|
21
|
+
Classifier: Typing :: Typed
|
|
22
|
+
License-File: LICENSE
|
|
23
|
+
Summary: High-performance HTML to Markdown converter powered by Rust with a clean Python API
|
|
24
|
+
Keywords: cli-tool,converter,html,html2markdown,html5,markdown,markup,parser,rust,text-processing
|
|
25
|
+
Home-Page: https://github.com/Goldziher/html-to-markdown
|
|
26
|
+
Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
|
|
27
|
+
Requires-Python: >=3.10
|
|
28
|
+
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
|
|
29
|
+
Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
|
|
30
|
+
Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
|
|
31
|
+
Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
|
|
32
|
+
Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
|
|
33
|
+
|
|
34
|
+
# html-to-markdown
|
|
35
|
+
|
|
36
|
+
High-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). Wheels are published for Linux, macOS, and Windows.
|
|
37
|
+
|
|
38
|
+
[](https://github.com/Goldziher/html-to-markdown)
|
|
39
|
+
[](https://github.com/Goldziher/html-to-markdown)
|
|
40
|
+
[](https://github.com/Goldziher/html-to-markdown)
|
|
41
|
+
[](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
|
|
42
|
+
|
|
43
|
+
## Installation
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
pip install html-to-markdown
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Performance Snapshot
|
|
50
|
+
|
|
51
|
+
Apple M4 • Real Wikipedia documents • `convert()` (Python)
|
|
52
|
+
|
|
53
|
+
| Document | Size | Latency | Throughput | Docs/sec |
|
|
54
|
+
| ------------------- | ----- | ------- | ---------- | -------- |
|
|
55
|
+
| Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
|
|
56
|
+
| Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
|
|
57
|
+
| Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
|
|
58
|
+
|
|
59
|
+
> V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2’s Rust engine delivers 60–80× higher throughput.
|
|
60
|
+
|
|
61
|
+
## Quick Start
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
from html_to_markdown import convert
|
|
65
|
+
|
|
66
|
+
html = """
|
|
67
|
+
<h1>Welcome</h1>
|
|
68
|
+
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
|
|
69
|
+
<ul>
|
|
70
|
+
<li>Blazing fast</li>
|
|
71
|
+
<li>Type safe</li>
|
|
72
|
+
<li>Easy to use</li>
|
|
73
|
+
</ul>
|
|
74
|
+
"""
|
|
75
|
+
|
|
76
|
+
markdown = convert(html)
|
|
77
|
+
print(markdown)
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Configuration (v2 API)
|
|
81
|
+
|
|
82
|
+
```python
|
|
83
|
+
from html_to_markdown import ConversionOptions, convert
|
|
84
|
+
|
|
85
|
+
options = ConversionOptions(
|
|
86
|
+
heading_style="atx",
|
|
87
|
+
list_indent_width=2,
|
|
88
|
+
bullets="*+-",
|
|
89
|
+
)
|
|
90
|
+
options.escape_asterisks = True
|
|
91
|
+
options.code_language = "python"
|
|
92
|
+
options.extract_metadata = True
|
|
93
|
+
|
|
94
|
+
markdown = convert(html, options)
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### HTML Preprocessing
|
|
98
|
+
|
|
99
|
+
```python
|
|
100
|
+
from html_to_markdown import ConversionOptions, PreprocessingOptions, convert
|
|
101
|
+
|
|
102
|
+
options = ConversionOptions(
|
|
103
|
+
preprocessing=PreprocessingOptions(enabled=True, preset="aggressive"),
|
|
104
|
+
)
|
|
105
|
+
|
|
106
|
+
markdown = convert(scraped_html, options)
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
### Inline Image Extraction
|
|
110
|
+
|
|
111
|
+
```python
|
|
112
|
+
from html_to_markdown import InlineImageConfig, convert_with_inline_images
|
|
113
|
+
|
|
114
|
+
markdown, inline_images, warnings = convert_with_inline_images(
|
|
115
|
+
'<p><img src="data:image/png;base64,...==" alt="Pixel" width="1" height="1"></p>',
|
|
116
|
+
image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
|
|
117
|
+
)
|
|
118
|
+
|
|
119
|
+
if inline_images:
|
|
120
|
+
first = inline_images[0]
|
|
121
|
+
print(first["format"], first["dimensions"], first["attributes"]) # e.g. "png", (1, 1), {"width": "1"}
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Each inline image is returned as a typed dictionary (`bytes` payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.
|
|
125
|
+
|
|
126
|
+
### hOCR (HTML OCR) Support
|
|
127
|
+
|
|
128
|
+
```python
|
|
129
|
+
from html_to_markdown import ConversionOptions, convert
|
|
130
|
+
|
|
131
|
+
# Default: emit structured Markdown directly
|
|
132
|
+
markdown = convert(hocr_html)
|
|
133
|
+
|
|
134
|
+
# hOCR documents are detected automatically; tables are reconstructed without extra configuration.
|
|
135
|
+
markdown = convert(hocr_html)
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
## CLI (same engine)
|
|
139
|
+
|
|
140
|
+
```bash
|
|
141
|
+
pipx install html-to-markdown # or: pip install html-to-markdown
|
|
142
|
+
|
|
143
|
+
html-to-markdown page.html > page.md
|
|
144
|
+
cat page.html | html-to-markdown --heading-style atx > page.md
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
## API Surface
|
|
148
|
+
|
|
149
|
+
### `ConversionOptions`
|
|
150
|
+
|
|
151
|
+
Key fields (see docstring for full matrix):
|
|
152
|
+
|
|
153
|
+
- `heading_style`: `"underlined" | "atx" | "atx_closed"`
|
|
154
|
+
- `list_indent_width`: spaces per indent level (default 2)
|
|
155
|
+
- `bullets`: cycle of bullet characters (`"*+-"`)
|
|
156
|
+
- `strong_em_symbol`: `"*"` or `"_"`
|
|
157
|
+
- `code_language`: default fenced code block language
|
|
158
|
+
- `wrap`, `wrap_width`: wrap Markdown output
|
|
159
|
+
- `strip_tags`: remove specific HTML tags
|
|
160
|
+
- `preprocessing`: `PreprocessingOptions`
|
|
161
|
+
- `encoding`: input character encoding (informational)
|
|
162
|
+
|
|
163
|
+
### `PreprocessingOptions`
|
|
164
|
+
|
|
165
|
+
- `enabled`: enable HTML sanitisation
|
|
166
|
+
- `preset`: `"minimal" | "standard" | "aggressive"`
|
|
167
|
+
- `remove_navigation`, `remove_forms`
|
|
168
|
+
|
|
169
|
+
### `InlineImageConfig`
|
|
170
|
+
|
|
171
|
+
- `max_decoded_size_bytes`: reject larger payloads
|
|
172
|
+
- `filename_prefix`: generated name prefix (`embedded_image` default)
|
|
173
|
+
- `capture_svg`: collect inline `<svg>` (default `True`)
|
|
174
|
+
- `infer_dimensions`: decode raster images to obtain dimensions (default `False`)
|
|
175
|
+
|
|
176
|
+
## v1 Compatibility
|
|
177
|
+
|
|
178
|
+
- **Performance**: V1 averaged ~2.5 MB/s; V2 sustains 150–210 MB/s with identical Markdown output.
|
|
179
|
+
- **Compat shim**: `html_to_markdown.v1_compat` exposes `convert_to_markdown`, `convert_to_markdown_stream`, and `markdownify` to ease migration. Keyword mappings are listed in the [changelog](CHANGELOG.md#v200).
|
|
180
|
+
- **CLI**: The Rust CLI replaces the Python script. New flags are documented via `html-to-markdown --help`.
|
|
181
|
+
- **Removed options**: `code_language_callback`, `strip`, and streaming APIs were removed; use `ConversionOptions`, `PreprocessingOptions`, and the inline-image helpers instead.
|
|
182
|
+
|
|
183
|
+
## Links
|
|
184
|
+
|
|
185
|
+
- GitHub: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)
|
|
186
|
+
- Discord: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
|
|
187
|
+
- Kreuzberg ecosystem: [https://kreuzberg.dev](https://kreuzberg.dev)
|
|
188
|
+
|
|
189
|
+
## License
|
|
190
|
+
|
|
191
|
+
MIT License – see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE).
|
|
192
|
+
|
|
193
|
+
## Support
|
|
194
|
+
|
|
195
|
+
If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).
|
|
196
|
+
|
|
@@ -0,0 +1,162 @@
|
|
|
1
|
+
# html-to-markdown
|
|
2
|
+
|
|
3
|
+
High-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). Wheels are published for Linux, macOS, and Windows.
|
|
4
|
+
|
|
5
|
+
[](https://github.com/Goldziher/html-to-markdown)
|
|
6
|
+
[](https://github.com/Goldziher/html-to-markdown)
|
|
7
|
+
[](https://github.com/Goldziher/html-to-markdown)
|
|
8
|
+
[](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
|
|
9
|
+
|
|
10
|
+
## Installation
|
|
11
|
+
|
|
12
|
+
```bash
|
|
13
|
+
pip install html-to-markdown
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
## Performance Snapshot
|
|
17
|
+
|
|
18
|
+
Apple M4 • Real Wikipedia documents • `convert()` (Python)
|
|
19
|
+
|
|
20
|
+
| Document | Size | Latency | Throughput | Docs/sec |
|
|
21
|
+
| ------------------- | ----- | ------- | ---------- | -------- |
|
|
22
|
+
| Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
|
|
23
|
+
| Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
|
|
24
|
+
| Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
|
|
25
|
+
|
|
26
|
+
> V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2’s Rust engine delivers 60–80× higher throughput.
|
|
27
|
+
|
|
28
|
+
## Quick Start
|
|
29
|
+
|
|
30
|
+
```python
|
|
31
|
+
from html_to_markdown import convert
|
|
32
|
+
|
|
33
|
+
html = """
|
|
34
|
+
<h1>Welcome</h1>
|
|
35
|
+
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
|
|
36
|
+
<ul>
|
|
37
|
+
<li>Blazing fast</li>
|
|
38
|
+
<li>Type safe</li>
|
|
39
|
+
<li>Easy to use</li>
|
|
40
|
+
</ul>
|
|
41
|
+
"""
|
|
42
|
+
|
|
43
|
+
markdown = convert(html)
|
|
44
|
+
print(markdown)
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
## Configuration (v2 API)
|
|
48
|
+
|
|
49
|
+
```python
|
|
50
|
+
from html_to_markdown import ConversionOptions, convert
|
|
51
|
+
|
|
52
|
+
options = ConversionOptions(
|
|
53
|
+
heading_style="atx",
|
|
54
|
+
list_indent_width=2,
|
|
55
|
+
bullets="*+-",
|
|
56
|
+
)
|
|
57
|
+
options.escape_asterisks = True
|
|
58
|
+
options.code_language = "python"
|
|
59
|
+
options.extract_metadata = True
|
|
60
|
+
|
|
61
|
+
markdown = convert(html, options)
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### HTML Preprocessing
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
from html_to_markdown import ConversionOptions, PreprocessingOptions, convert
|
|
68
|
+
|
|
69
|
+
options = ConversionOptions(
|
|
70
|
+
preprocessing=PreprocessingOptions(enabled=True, preset="aggressive"),
|
|
71
|
+
)
|
|
72
|
+
|
|
73
|
+
markdown = convert(scraped_html, options)
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
### Inline Image Extraction
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
from html_to_markdown import InlineImageConfig, convert_with_inline_images
|
|
80
|
+
|
|
81
|
+
markdown, inline_images, warnings = convert_with_inline_images(
|
|
82
|
+
'<p><img src="data:image/png;base64,...==" alt="Pixel" width="1" height="1"></p>',
|
|
83
|
+
image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
|
|
84
|
+
)
|
|
85
|
+
|
|
86
|
+
if inline_images:
|
|
87
|
+
first = inline_images[0]
|
|
88
|
+
print(first["format"], first["dimensions"], first["attributes"]) # e.g. "png", (1, 1), {"width": "1"}
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
Each inline image is returned as a typed dictionary (`bytes` payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.
|
|
92
|
+
|
|
93
|
+
### hOCR (HTML OCR) Support
|
|
94
|
+
|
|
95
|
+
```python
|
|
96
|
+
from html_to_markdown import ConversionOptions, convert
|
|
97
|
+
|
|
98
|
+
# Default: emit structured Markdown directly
|
|
99
|
+
markdown = convert(hocr_html)
|
|
100
|
+
|
|
101
|
+
# hOCR documents are detected automatically; tables are reconstructed without extra configuration.
|
|
102
|
+
markdown = convert(hocr_html)
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## CLI (same engine)
|
|
106
|
+
|
|
107
|
+
```bash
|
|
108
|
+
pipx install html-to-markdown # or: pip install html-to-markdown
|
|
109
|
+
|
|
110
|
+
html-to-markdown page.html > page.md
|
|
111
|
+
cat page.html | html-to-markdown --heading-style atx > page.md
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
## API Surface
|
|
115
|
+
|
|
116
|
+
### `ConversionOptions`
|
|
117
|
+
|
|
118
|
+
Key fields (see docstring for full matrix):
|
|
119
|
+
|
|
120
|
+
- `heading_style`: `"underlined" | "atx" | "atx_closed"`
|
|
121
|
+
- `list_indent_width`: spaces per indent level (default 2)
|
|
122
|
+
- `bullets`: cycle of bullet characters (`"*+-"`)
|
|
123
|
+
- `strong_em_symbol`: `"*"` or `"_"`
|
|
124
|
+
- `code_language`: default fenced code block language
|
|
125
|
+
- `wrap`, `wrap_width`: wrap Markdown output
|
|
126
|
+
- `strip_tags`: remove specific HTML tags
|
|
127
|
+
- `preprocessing`: `PreprocessingOptions`
|
|
128
|
+
- `encoding`: input character encoding (informational)
|
|
129
|
+
|
|
130
|
+
### `PreprocessingOptions`
|
|
131
|
+
|
|
132
|
+
- `enabled`: enable HTML sanitisation
|
|
133
|
+
- `preset`: `"minimal" | "standard" | "aggressive"`
|
|
134
|
+
- `remove_navigation`, `remove_forms`
|
|
135
|
+
|
|
136
|
+
### `InlineImageConfig`
|
|
137
|
+
|
|
138
|
+
- `max_decoded_size_bytes`: reject larger payloads
|
|
139
|
+
- `filename_prefix`: generated name prefix (`embedded_image` default)
|
|
140
|
+
- `capture_svg`: collect inline `<svg>` (default `True`)
|
|
141
|
+
- `infer_dimensions`: decode raster images to obtain dimensions (default `False`)
|
|
142
|
+
|
|
143
|
+
## v1 Compatibility
|
|
144
|
+
|
|
145
|
+
- **Performance**: V1 averaged ~2.5 MB/s; V2 sustains 150–210 MB/s with identical Markdown output.
|
|
146
|
+
- **Compat shim**: `html_to_markdown.v1_compat` exposes `convert_to_markdown`, `convert_to_markdown_stream`, and `markdownify` to ease migration. Keyword mappings are listed in the [changelog](CHANGELOG.md#v200).
|
|
147
|
+
- **CLI**: The Rust CLI replaces the Python script. New flags are documented via `html-to-markdown --help`.
|
|
148
|
+
- **Removed options**: `code_language_callback`, `strip`, and streaming APIs were removed; use `ConversionOptions`, `PreprocessingOptions`, and the inline-image helpers instead.
|
|
149
|
+
|
|
150
|
+
## Links
|
|
151
|
+
|
|
152
|
+
- GitHub: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)
|
|
153
|
+
- Discord: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
|
|
154
|
+
- Kreuzberg ecosystem: [https://kreuzberg.dev](https://kreuzberg.dev)
|
|
155
|
+
|
|
156
|
+
## License
|
|
157
|
+
|
|
158
|
+
MIT License – see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE).
|
|
159
|
+
|
|
160
|
+
## Support
|
|
161
|
+
|
|
162
|
+
If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).
|