html-to-markdown 2.0.1__tar.gz → 2.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of html-to-markdown might be problematic. Click here for more details.

Files changed (60) hide show
  1. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/Cargo.lock +177 -9
  2. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/Cargo.toml +2 -2
  3. html_to_markdown-2.1.2/PKG-INFO +196 -0
  4. html_to_markdown-2.1.2/README_PYPI.md +162 -0
  5. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/Cargo.toml +5 -0
  6. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/README.md +39 -54
  7. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/benches/conversion_benchmark.rs +2 -19
  8. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/benches/profiling_benchmark.rs +2 -20
  9. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/src/converter.rs +351 -58
  10. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/src/hocr/converter.rs +74 -9
  11. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/src/hocr/extractor.rs +1 -1
  12. html_to_markdown-2.1.2/crates/html-to-markdown/src/hocr/mod.rs +30 -0
  13. html_to_markdown-2.0.1/crates/html-to-markdown/src/hocr.rs → html_to_markdown-2.1.2/crates/html-to-markdown/src/hocr/spatial.rs +3 -31
  14. html_to_markdown-2.1.2/crates/html-to-markdown/src/inline_images.rs +221 -0
  15. html_to_markdown-2.1.2/crates/html-to-markdown/src/lib.rs +114 -0
  16. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/src/options.rs +3 -34
  17. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/tests/hocr_compliance_test.rs +5 -5
  18. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown-py/Cargo.toml +5 -1
  19. html_to_markdown-2.1.2/crates/html-to-markdown-py/README.md +86 -0
  20. html_to_markdown-2.1.2/crates/html-to-markdown-py/python/html_to_markdown/__init__.py +18 -0
  21. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown-py/python/html_to_markdown/_html_to_markdown.pyi +35 -19
  22. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown-py/src/lib.rs +240 -44
  23. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/html_to_markdown/__init__.py +3 -9
  24. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/html_to_markdown/_rust.pyi +4 -12
  25. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/html_to_markdown/api.py +7 -34
  26. html_to_markdown-2.1.2/html_to_markdown/cli.py +3 -0
  27. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/html_to_markdown/cli_proxy.py +23 -33
  28. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/html_to_markdown/exceptions.py +8 -16
  29. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/html_to_markdown/options.py +3 -76
  30. html_to_markdown-2.1.2/html_to_markdown/v1_compat.py +189 -0
  31. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/pyproject.toml +8 -8
  32. html_to_markdown-2.0.1/PKG-INFO +0 -243
  33. html_to_markdown-2.0.1/README_PYPI.md +0 -210
  34. html_to_markdown-2.0.1/crates/html-to-markdown/src/lib.rs +0 -57
  35. html_to_markdown-2.0.1/crates/html-to-markdown-py/README.md +0 -438
  36. html_to_markdown-2.0.1/crates/html-to-markdown-py/python/html_to_markdown/__init__.py +0 -5
  37. html_to_markdown-2.0.1/html_to_markdown/cli.py +0 -9
  38. html_to_markdown-2.0.1/html_to_markdown/v1_compat.py +0 -161
  39. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/LICENSE +0 -0
  40. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/benches/micro_benchmark.rs +0 -0
  41. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/examples/basic.rs +0 -0
  42. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/examples/table.rs +0 -0
  43. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/examples/test_escape.rs +0 -0
  44. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/examples/test_inline_formatting.rs +0 -0
  45. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/examples/test_lists.rs +0 -0
  46. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/examples/test_semantic_tags.rs +0 -0
  47. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/examples/test_tables.rs +0 -0
  48. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/examples/test_task_lists.rs +0 -0
  49. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/examples/test_whitespace.rs +0 -0
  50. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/src/error.rs +0 -0
  51. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/src/hocr/parser.rs +0 -0
  52. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/src/hocr/types.rs +0 -0
  53. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/src/sanitizer.rs +0 -0
  54. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/src/text.rs +0 -0
  55. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/src/wrapper.rs +0 -0
  56. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/tests/commonmark_compliance_test.rs +0 -0
  57. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown/tests/integration_test.rs +0 -0
  58. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/crates/html-to-markdown-py/uv.lock +0 -0
  59. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/html_to_markdown/__main__.py +0 -0
  60. {html_to_markdown-2.0.1 → html_to_markdown-2.1.2}/html_to_markdown/py.typed +0 -0
@@ -2,6 +2,12 @@
2
2
  # It is not intended for manual editing.
3
3
  version = 3
4
4
 
5
+ [[package]]
6
+ name = "adler2"
7
+ version = "2.0.1"
8
+ source = "registry+https://github.com/rust-lang/crates.io-index"
9
+ checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa"
10
+
5
11
  [[package]]
6
12
  name = "aho-corasick"
7
13
  version = "1.1.3"
@@ -131,6 +137,18 @@ version = "3.19.0"
131
137
  source = "registry+https://github.com/rust-lang/crates.io-index"
132
138
  checksum = "46c5e41b57b8bba42a04676d81cb89e9ee8e859a1a66f80a5a72e1cb76b34d43"
133
139
 
140
+ [[package]]
141
+ name = "bytemuck"
142
+ version = "1.24.0"
143
+ source = "registry+https://github.com/rust-lang/crates.io-index"
144
+ checksum = "1fbdf580320f38b612e485521afda1ee26d10cc9884efaaa750d383e13e3c5f4"
145
+
146
+ [[package]]
147
+ name = "byteorder-lite"
148
+ version = "0.1.0"
149
+ source = "registry+https://github.com/rust-lang/crates.io-index"
150
+ checksum = "8f1fe948ff07f4bd06c30984e69f5b4899c516a3ef74f34df92a2df2ab535495"
151
+
134
152
  [[package]]
135
153
  name = "cast"
136
154
  version = "0.3.0"
@@ -229,12 +247,27 @@ dependencies = [
229
247
  "roff",
230
248
  ]
231
249
 
250
+ [[package]]
251
+ name = "color_quant"
252
+ version = "1.1.0"
253
+ source = "registry+https://github.com/rust-lang/crates.io-index"
254
+ checksum = "3d7b894f5411737b7867f4827955924d7c254fc9f4d91a6aad6b097804b1018b"
255
+
232
256
  [[package]]
233
257
  name = "colorchoice"
234
258
  version = "1.0.4"
235
259
  source = "registry+https://github.com/rust-lang/crates.io-index"
236
260
  checksum = "b05b61dc5112cbb17e4b6cd61790d9845d13888356391624cbe7e41efeac1e75"
237
261
 
262
+ [[package]]
263
+ name = "crc32fast"
264
+ version = "1.5.0"
265
+ source = "registry+https://github.com/rust-lang/crates.io-index"
266
+ checksum = "9481c1c90cbf2ac953f07c8d4a58aa3945c425b7185c9154d67a65e4230da511"
267
+ dependencies = [
268
+ "cfg-if",
269
+ ]
270
+
238
271
  [[package]]
239
272
  name = "criterion"
240
273
  version = "0.5.1"
@@ -394,6 +427,25 @@ version = "2.3.0"
394
427
  source = "registry+https://github.com/rust-lang/crates.io-index"
395
428
  checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be"
396
429
 
430
+ [[package]]
431
+ name = "fdeflate"
432
+ version = "0.3.7"
433
+ source = "registry+https://github.com/rust-lang/crates.io-index"
434
+ checksum = "1e6853b52649d4ac5c0bd02320cddc5ba956bdb407c4b75a2c6b75bf51500f8c"
435
+ dependencies = [
436
+ "simd-adler32",
437
+ ]
438
+
439
+ [[package]]
440
+ name = "flate2"
441
+ version = "1.1.4"
442
+ source = "registry+https://github.com/rust-lang/crates.io-index"
443
+ checksum = "dc5a4e564e38c699f2880d3fda590bedc2e69f3f84cd48b457bd892ce61d0aa9"
444
+ dependencies = [
445
+ "crc32fast",
446
+ "miniz_oxide",
447
+ ]
448
+
397
449
  [[package]]
398
450
  name = "float-cmp"
399
451
  version = "0.10.0"
@@ -434,6 +486,16 @@ dependencies = [
434
486
  "wasi",
435
487
  ]
436
488
 
489
+ [[package]]
490
+ name = "gif"
491
+ version = "0.13.3"
492
+ source = "registry+https://github.com/rust-lang/crates.io-index"
493
+ checksum = "4ae047235e33e2829703574b54fdec96bfbad892062d97fed2f76022287de61b"
494
+ dependencies = [
495
+ "color_quant",
496
+ "weezl",
497
+ ]
498
+
437
499
  [[package]]
438
500
  name = "half"
439
501
  version = "2.7.0"
@@ -468,7 +530,7 @@ dependencies = [
468
530
 
469
531
  [[package]]
470
532
  name = "html-to-markdown-cli"
471
- version = "2.0.1"
533
+ version = "2.1.2"
472
534
  dependencies = [
473
535
  "assert_cmd",
474
536
  "clap",
@@ -482,20 +544,23 @@ dependencies = [
482
544
 
483
545
  [[package]]
484
546
  name = "html-to-markdown-py"
485
- version = "2.0.1"
547
+ version = "2.1.2"
486
548
  dependencies = [
549
+ "base64",
487
550
  "html-to-markdown-rs",
551
+ "image",
488
552
  "pyo3",
489
553
  ]
490
554
 
491
555
  [[package]]
492
556
  name = "html-to-markdown-rs"
493
- version = "2.0.1"
557
+ version = "2.1.2"
494
558
  dependencies = [
495
559
  "ammonia",
496
560
  "base64",
497
561
  "criterion",
498
562
  "html-escape",
563
+ "image",
499
564
  "once_cell",
500
565
  "regex",
501
566
  "serde",
@@ -622,6 +687,34 @@ dependencies = [
622
687
  "icu_properties",
623
688
  ]
624
689
 
690
+ [[package]]
691
+ name = "image"
692
+ version = "0.25.8"
693
+ source = "registry+https://github.com/rust-lang/crates.io-index"
694
+ checksum = "529feb3e6769d234375c4cf1ee2ce713682b8e76538cb13f9fc23e1400a591e7"
695
+ dependencies = [
696
+ "bytemuck",
697
+ "byteorder-lite",
698
+ "color_quant",
699
+ "gif",
700
+ "image-webp",
701
+ "moxcms",
702
+ "num-traits",
703
+ "png",
704
+ "zune-core",
705
+ "zune-jpeg",
706
+ ]
707
+
708
+ [[package]]
709
+ name = "image-webp"
710
+ version = "0.2.4"
711
+ source = "registry+https://github.com/rust-lang/crates.io-index"
712
+ checksum = "525e9ff3e1a4be2fbea1fdf0e98686a6d98b4d8f937e1bf7402245af1909e8c3"
713
+ dependencies = [
714
+ "byteorder-lite",
715
+ "quick-error",
716
+ ]
717
+
625
718
  [[package]]
626
719
  name = "indoc"
627
720
  version = "2.0.6"
@@ -752,6 +845,26 @@ dependencies = [
752
845
  "autocfg",
753
846
  ]
754
847
 
848
+ [[package]]
849
+ name = "miniz_oxide"
850
+ version = "0.8.9"
851
+ source = "registry+https://github.com/rust-lang/crates.io-index"
852
+ checksum = "1fa76a2c86f704bdb222d66965fb3d63269ce38518b83cb0575fca855ebb6316"
853
+ dependencies = [
854
+ "adler2",
855
+ "simd-adler32",
856
+ ]
857
+
858
+ [[package]]
859
+ name = "moxcms"
860
+ version = "0.7.7"
861
+ source = "registry+https://github.com/rust-lang/crates.io-index"
862
+ checksum = "c588e11a3082784af229e23e8e4ecf5bcc6fbe4f69101e0421ce8d79da7f0b40"
863
+ dependencies = [
864
+ "num-traits",
865
+ "pxfm",
866
+ ]
867
+
755
868
  [[package]]
756
869
  name = "new_debug_unreachable"
757
870
  version = "1.0.6"
@@ -900,6 +1013,19 @@ dependencies = [
900
1013
  "plotters-backend",
901
1014
  ]
902
1015
 
1016
+ [[package]]
1017
+ name = "png"
1018
+ version = "0.18.0"
1019
+ source = "registry+https://github.com/rust-lang/crates.io-index"
1020
+ checksum = "97baced388464909d42d89643fe4361939af9b7ce7a31ee32a168f832a70f2a0"
1021
+ dependencies = [
1022
+ "bitflags",
1023
+ "crc32fast",
1024
+ "fdeflate",
1025
+ "flate2",
1026
+ "miniz_oxide",
1027
+ ]
1028
+
903
1029
  [[package]]
904
1030
  name = "portable-atomic"
905
1031
  version = "1.11.1"
@@ -960,6 +1086,15 @@ dependencies = [
960
1086
  "unicode-ident",
961
1087
  ]
962
1088
 
1089
+ [[package]]
1090
+ name = "pxfm"
1091
+ version = "0.1.25"
1092
+ source = "registry+https://github.com/rust-lang/crates.io-index"
1093
+ checksum = "a3cbdf373972bf78df4d3b518d07003938e2c7d1fb5891e55f9cb6df57009d84"
1094
+ dependencies = [
1095
+ "num-traits",
1096
+ ]
1097
+
963
1098
  [[package]]
964
1099
  name = "pyo3"
965
1100
  version = "0.26.0"
@@ -1021,6 +1156,12 @@ dependencies = [
1021
1156
  "syn",
1022
1157
  ]
1023
1158
 
1159
+ [[package]]
1160
+ name = "quick-error"
1161
+ version = "2.0.1"
1162
+ source = "registry+https://github.com/rust-lang/crates.io-index"
1163
+ checksum = "a993555f31e5a609f617c12db6250dedcac1b0a85076912c436e6fc9b2c8e6a3"
1164
+
1024
1165
  [[package]]
1025
1166
  name = "quote"
1026
1167
  version = "1.0.41"
@@ -1082,9 +1223,9 @@ dependencies = [
1082
1223
 
1083
1224
  [[package]]
1084
1225
  name = "regex"
1085
- version = "1.11.3"
1226
+ version = "1.12.1"
1086
1227
  source = "registry+https://github.com/rust-lang/crates.io-index"
1087
- checksum = "8b5288124840bee7b386bc413c487869b360b2b4ec421ea56425128692f2a82c"
1228
+ checksum = "4a52d8d02cacdb176ef4678de6c052efb4b3da14b78e4db683a4252762be5433"
1088
1229
  dependencies = [
1089
1230
  "aho-corasick",
1090
1231
  "memchr",
@@ -1094,9 +1235,9 @@ dependencies = [
1094
1235
 
1095
1236
  [[package]]
1096
1237
  name = "regex-automata"
1097
- version = "0.4.11"
1238
+ version = "0.4.12"
1098
1239
  source = "registry+https://github.com/rust-lang/crates.io-index"
1099
- checksum = "833eb9ce86d40ef33cb1306d8accf7bc8ec2bfea4355cbdebb3df68b40925cad"
1240
+ checksum = "722166aa0d7438abbaa4d5cc2c649dac844e8c56d82fb3d33e9c34b5cd268fc6"
1100
1241
  dependencies = [
1101
1242
  "aho-corasick",
1102
1243
  "memchr",
@@ -1105,9 +1246,9 @@ dependencies = [
1105
1246
 
1106
1247
  [[package]]
1107
1248
  name = "regex-syntax"
1108
- version = "0.8.6"
1249
+ version = "0.8.7"
1109
1250
  source = "registry+https://github.com/rust-lang/crates.io-index"
1110
- checksum = "caf4aa5b0f434c91fe5c7f1ecb6a5ece2130b02ad2a590589dda5146df959001"
1251
+ checksum = "c3160422bbd54dd5ecfdca71e5fd59b7b8fe2b1697ab2baf64f6d05dcc66d298"
1111
1252
 
1112
1253
  [[package]]
1113
1254
  name = "roff"
@@ -1198,6 +1339,12 @@ dependencies = [
1198
1339
  "serde_core",
1199
1340
  ]
1200
1341
 
1342
+ [[package]]
1343
+ name = "simd-adler32"
1344
+ version = "0.3.7"
1345
+ source = "registry+https://github.com/rust-lang/crates.io-index"
1346
+ checksum = "d66dc143e6b11c1eddc06d5c423cfc97062865baf299914ab64caa38182078fe"
1347
+
1201
1348
  [[package]]
1202
1349
  name = "siphasher"
1203
1350
  version = "1.0.1"
@@ -1517,6 +1664,12 @@ dependencies = [
1517
1664
  "string_cache_codegen",
1518
1665
  ]
1519
1666
 
1667
+ [[package]]
1668
+ name = "weezl"
1669
+ version = "0.1.10"
1670
+ source = "registry+https://github.com/rust-lang/crates.io-index"
1671
+ checksum = "a751b3277700db47d3e574514de2eced5e54dc8a5436a3bf7a0b248b2cee16f3"
1672
+
1520
1673
  [[package]]
1521
1674
  name = "winapi-util"
1522
1675
  version = "0.1.11"
@@ -1797,3 +1950,18 @@ dependencies = [
1797
1950
  "quote",
1798
1951
  "syn",
1799
1952
  ]
1953
+
1954
+ [[package]]
1955
+ name = "zune-core"
1956
+ version = "0.4.12"
1957
+ source = "registry+https://github.com/rust-lang/crates.io-index"
1958
+ checksum = "3f423a2c17029964870cfaabb1f13dfab7d092a62a29a89264f4d36990ca414a"
1959
+
1960
+ [[package]]
1961
+ name = "zune-jpeg"
1962
+ version = "0.4.21"
1963
+ source = "registry+https://github.com/rust-lang/crates.io-index"
1964
+ checksum = "29ce2c8a9384ad323cf564b67da86e21d3cfdff87908bc1223ed5c99bc792713"
1965
+ dependencies = [
1966
+ "zune-core",
1967
+ ]
@@ -3,7 +3,7 @@ resolver = "2"
3
3
  members = ["crates/html-to-markdown-py"]
4
4
 
5
5
  [workspace.package]
6
- version = "2.0.1"
6
+ version = "2.1.2"
7
7
  edition = "2021"
8
8
  authors = ["Na'aman Hirschfeld <nhirschfeld@gmail.com>"]
9
9
  license = "MIT"
@@ -15,7 +15,7 @@ rust-version = "1.80"
15
15
 
16
16
  [workspace.dependencies]
17
17
  # Core library
18
- html-to-markdown-rs = { version = "2.0.1", path = "crates/html-to-markdown" }
18
+ html-to-markdown-rs = { version = "2.1.2", path = "crates/html-to-markdown" }
19
19
 
20
20
  # HTML parsing and sanitization
21
21
  tl = "0.7"
@@ -0,0 +1,196 @@
1
+ Metadata-Version: 2.4
2
+ Name: html-to-markdown
3
+ Version: 2.1.2
4
+ Classifier: Development Status :: 5 - Production/Stable
5
+ Classifier: Environment :: Console
6
+ Classifier: Intended Audience :: Developers
7
+ Classifier: License :: OSI Approved :: MIT License
8
+ Classifier: Operating System :: OS Independent
9
+ Classifier: Programming Language :: Python :: 3 :: Only
10
+ Classifier: Programming Language :: Python :: 3.10
11
+ Classifier: Programming Language :: Python :: 3.11
12
+ Classifier: Programming Language :: Python :: 3.12
13
+ Classifier: Programming Language :: Python :: 3.13
14
+ Classifier: Programming Language :: Python :: 3.14
15
+ Classifier: Programming Language :: Rust
16
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
17
+ Classifier: Topic :: Text Processing
18
+ Classifier: Topic :: Text Processing :: Markup
19
+ Classifier: Topic :: Text Processing :: Markup :: HTML
20
+ Classifier: Topic :: Text Processing :: Markup :: Markdown
21
+ Classifier: Typing :: Typed
22
+ License-File: LICENSE
23
+ Summary: High-performance HTML to Markdown converter powered by Rust with a clean Python API
24
+ Keywords: cli-tool,converter,html,html2markdown,html5,markdown,markup,parser,rust,text-processing
25
+ Home-Page: https://github.com/Goldziher/html-to-markdown
26
+ Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
27
+ Requires-Python: >=3.10
28
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
29
+ Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
30
+ Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
31
+ Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
32
+ Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
33
+
34
+ # html-to-markdown
35
+
36
+ High-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). Wheels are published for Linux, macOS, and Windows.
37
+
38
+ [![PyPI version](https://badge.fury.io/py/html-to-markdown.svg)](https://github.com/Goldziher/html-to-markdown)
39
+ [![Rust crate](https://img.shields.io/crates/v/html-to-markdown-rs.svg)](https://github.com/Goldziher/html-to-markdown)
40
+ [![Python Versions](https://img.shields.io/pypi/pyversions/html-to-markdown.svg)](https://github.com/Goldziher/html-to-markdown)
41
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
42
+
43
+ ## Installation
44
+
45
+ ```bash
46
+ pip install html-to-markdown
47
+ ```
48
+
49
+ ## Performance Snapshot
50
+
51
+ Apple M4 • Real Wikipedia documents • `convert()` (Python)
52
+
53
+ | Document | Size | Latency | Throughput | Docs/sec |
54
+ | ------------------- | ----- | ------- | ---------- | -------- |
55
+ | Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
56
+ | Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
57
+ | Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
58
+
59
+ > V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2’s Rust engine delivers 60–80× higher throughput.
60
+
61
+ ## Quick Start
62
+
63
+ ```python
64
+ from html_to_markdown import convert
65
+
66
+ html = """
67
+ <h1>Welcome</h1>
68
+ <p>This is <strong>fast</strong> Rust-powered conversion!</p>
69
+ <ul>
70
+ <li>Blazing fast</li>
71
+ <li>Type safe</li>
72
+ <li>Easy to use</li>
73
+ </ul>
74
+ """
75
+
76
+ markdown = convert(html)
77
+ print(markdown)
78
+ ```
79
+
80
+ ## Configuration (v2 API)
81
+
82
+ ```python
83
+ from html_to_markdown import ConversionOptions, convert
84
+
85
+ options = ConversionOptions(
86
+ heading_style="atx",
87
+ list_indent_width=2,
88
+ bullets="*+-",
89
+ )
90
+ options.escape_asterisks = True
91
+ options.code_language = "python"
92
+ options.extract_metadata = True
93
+
94
+ markdown = convert(html, options)
95
+ ```
96
+
97
+ ### HTML Preprocessing
98
+
99
+ ```python
100
+ from html_to_markdown import ConversionOptions, PreprocessingOptions, convert
101
+
102
+ options = ConversionOptions(
103
+ preprocessing=PreprocessingOptions(enabled=True, preset="aggressive"),
104
+ )
105
+
106
+ markdown = convert(scraped_html, options)
107
+ ```
108
+
109
+ ### Inline Image Extraction
110
+
111
+ ```python
112
+ from html_to_markdown import InlineImageConfig, convert_with_inline_images
113
+
114
+ markdown, inline_images, warnings = convert_with_inline_images(
115
+ '<p><img src="data:image/png;base64,...==" alt="Pixel" width="1" height="1"></p>',
116
+ image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
117
+ )
118
+
119
+ if inline_images:
120
+ first = inline_images[0]
121
+ print(first["format"], first["dimensions"], first["attributes"]) # e.g. "png", (1, 1), {"width": "1"}
122
+ ```
123
+
124
+ Each inline image is returned as a typed dictionary (`bytes` payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.
125
+
126
+ ### hOCR (HTML OCR) Support
127
+
128
+ ```python
129
+ from html_to_markdown import ConversionOptions, convert
130
+
131
+ # Default: emit structured Markdown directly
132
+ markdown = convert(hocr_html)
133
+
134
+ # hOCR documents are detected automatically; tables are reconstructed without extra configuration.
135
+ markdown = convert(hocr_html)
136
+ ```
137
+
138
+ ## CLI (same engine)
139
+
140
+ ```bash
141
+ pipx install html-to-markdown # or: pip install html-to-markdown
142
+
143
+ html-to-markdown page.html > page.md
144
+ cat page.html | html-to-markdown --heading-style atx > page.md
145
+ ```
146
+
147
+ ## API Surface
148
+
149
+ ### `ConversionOptions`
150
+
151
+ Key fields (see docstring for full matrix):
152
+
153
+ - `heading_style`: `"underlined" | "atx" | "atx_closed"`
154
+ - `list_indent_width`: spaces per indent level (default 2)
155
+ - `bullets`: cycle of bullet characters (`"*+-"`)
156
+ - `strong_em_symbol`: `"*"` or `"_"`
157
+ - `code_language`: default fenced code block language
158
+ - `wrap`, `wrap_width`: wrap Markdown output
159
+ - `strip_tags`: remove specific HTML tags
160
+ - `preprocessing`: `PreprocessingOptions`
161
+ - `encoding`: input character encoding (informational)
162
+
163
+ ### `PreprocessingOptions`
164
+
165
+ - `enabled`: enable HTML sanitisation
166
+ - `preset`: `"minimal" | "standard" | "aggressive"`
167
+ - `remove_navigation`, `remove_forms`
168
+
169
+ ### `InlineImageConfig`
170
+
171
+ - `max_decoded_size_bytes`: reject larger payloads
172
+ - `filename_prefix`: generated name prefix (`embedded_image` default)
173
+ - `capture_svg`: collect inline `<svg>` (default `True`)
174
+ - `infer_dimensions`: decode raster images to obtain dimensions (default `False`)
175
+
176
+ ## v1 Compatibility
177
+
178
+ - **Performance**: V1 averaged ~2.5 MB/s; V2 sustains 150–210 MB/s with identical Markdown output.
179
+ - **Compat shim**: `html_to_markdown.v1_compat` exposes `convert_to_markdown`, `convert_to_markdown_stream`, and `markdownify` to ease migration. Keyword mappings are listed in the [changelog](CHANGELOG.md#v200).
180
+ - **CLI**: The Rust CLI replaces the Python script. New flags are documented via `html-to-markdown --help`.
181
+ - **Removed options**: `code_language_callback`, `strip`, and streaming APIs were removed; use `ConversionOptions`, `PreprocessingOptions`, and the inline-image helpers instead.
182
+
183
+ ## Links
184
+
185
+ - GitHub: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)
186
+ - Discord: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
187
+ - Kreuzberg ecosystem: [https://kreuzberg.dev](https://kreuzberg.dev)
188
+
189
+ ## License
190
+
191
+ MIT License – see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE).
192
+
193
+ ## Support
194
+
195
+ If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).
196
+
@@ -0,0 +1,162 @@
1
+ # html-to-markdown
2
+
3
+ High-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). Wheels are published for Linux, macOS, and Windows.
4
+
5
+ [![PyPI version](https://badge.fury.io/py/html-to-markdown.svg)](https://github.com/Goldziher/html-to-markdown)
6
+ [![Rust crate](https://img.shields.io/crates/v/html-to-markdown-rs.svg)](https://github.com/Goldziher/html-to-markdown)
7
+ [![Python Versions](https://img.shields.io/pypi/pyversions/html-to-markdown.svg)](https://github.com/Goldziher/html-to-markdown)
8
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
9
+
10
+ ## Installation
11
+
12
+ ```bash
13
+ pip install html-to-markdown
14
+ ```
15
+
16
+ ## Performance Snapshot
17
+
18
+ Apple M4 • Real Wikipedia documents • `convert()` (Python)
19
+
20
+ | Document | Size | Latency | Throughput | Docs/sec |
21
+ | ------------------- | ----- | ------- | ---------- | -------- |
22
+ | Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
23
+ | Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
24
+ | Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
25
+
26
+ > V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2’s Rust engine delivers 60–80× higher throughput.
27
+
28
+ ## Quick Start
29
+
30
+ ```python
31
+ from html_to_markdown import convert
32
+
33
+ html = """
34
+ <h1>Welcome</h1>
35
+ <p>This is <strong>fast</strong> Rust-powered conversion!</p>
36
+ <ul>
37
+ <li>Blazing fast</li>
38
+ <li>Type safe</li>
39
+ <li>Easy to use</li>
40
+ </ul>
41
+ """
42
+
43
+ markdown = convert(html)
44
+ print(markdown)
45
+ ```
46
+
47
+ ## Configuration (v2 API)
48
+
49
+ ```python
50
+ from html_to_markdown import ConversionOptions, convert
51
+
52
+ options = ConversionOptions(
53
+ heading_style="atx",
54
+ list_indent_width=2,
55
+ bullets="*+-",
56
+ )
57
+ options.escape_asterisks = True
58
+ options.code_language = "python"
59
+ options.extract_metadata = True
60
+
61
+ markdown = convert(html, options)
62
+ ```
63
+
64
+ ### HTML Preprocessing
65
+
66
+ ```python
67
+ from html_to_markdown import ConversionOptions, PreprocessingOptions, convert
68
+
69
+ options = ConversionOptions(
70
+ preprocessing=PreprocessingOptions(enabled=True, preset="aggressive"),
71
+ )
72
+
73
+ markdown = convert(scraped_html, options)
74
+ ```
75
+
76
+ ### Inline Image Extraction
77
+
78
+ ```python
79
+ from html_to_markdown import InlineImageConfig, convert_with_inline_images
80
+
81
+ markdown, inline_images, warnings = convert_with_inline_images(
82
+ '<p><img src="data:image/png;base64,...==" alt="Pixel" width="1" height="1"></p>',
83
+ image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
84
+ )
85
+
86
+ if inline_images:
87
+ first = inline_images[0]
88
+ print(first["format"], first["dimensions"], first["attributes"]) # e.g. "png", (1, 1), {"width": "1"}
89
+ ```
90
+
91
+ Each inline image is returned as a typed dictionary (`bytes` payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.
92
+
93
+ ### hOCR (HTML OCR) Support
94
+
95
+ ```python
96
+ from html_to_markdown import ConversionOptions, convert
97
+
98
+ # Default: emit structured Markdown directly
99
+ markdown = convert(hocr_html)
100
+
101
+ # hOCR documents are detected automatically; tables are reconstructed without extra configuration.
102
+ markdown = convert(hocr_html)
103
+ ```
104
+
105
+ ## CLI (same engine)
106
+
107
+ ```bash
108
+ pipx install html-to-markdown # or: pip install html-to-markdown
109
+
110
+ html-to-markdown page.html > page.md
111
+ cat page.html | html-to-markdown --heading-style atx > page.md
112
+ ```
113
+
114
+ ## API Surface
115
+
116
+ ### `ConversionOptions`
117
+
118
+ Key fields (see docstring for full matrix):
119
+
120
+ - `heading_style`: `"underlined" | "atx" | "atx_closed"`
121
+ - `list_indent_width`: spaces per indent level (default 2)
122
+ - `bullets`: cycle of bullet characters (`"*+-"`)
123
+ - `strong_em_symbol`: `"*"` or `"_"`
124
+ - `code_language`: default fenced code block language
125
+ - `wrap`, `wrap_width`: wrap Markdown output
126
+ - `strip_tags`: remove specific HTML tags
127
+ - `preprocessing`: `PreprocessingOptions`
128
+ - `encoding`: input character encoding (informational)
129
+
130
+ ### `PreprocessingOptions`
131
+
132
+ - `enabled`: enable HTML sanitisation
133
+ - `preset`: `"minimal" | "standard" | "aggressive"`
134
+ - `remove_navigation`, `remove_forms`
135
+
136
+ ### `InlineImageConfig`
137
+
138
+ - `max_decoded_size_bytes`: reject larger payloads
139
+ - `filename_prefix`: generated name prefix (`embedded_image` default)
140
+ - `capture_svg`: collect inline `<svg>` (default `True`)
141
+ - `infer_dimensions`: decode raster images to obtain dimensions (default `False`)
142
+
143
+ ## v1 Compatibility
144
+
145
+ - **Performance**: V1 averaged ~2.5 MB/s; V2 sustains 150–210 MB/s with identical Markdown output.
146
+ - **Compat shim**: `html_to_markdown.v1_compat` exposes `convert_to_markdown`, `convert_to_markdown_stream`, and `markdownify` to ease migration. Keyword mappings are listed in the [changelog](CHANGELOG.md#v200).
147
+ - **CLI**: The Rust CLI replaces the Python script. New flags are documented via `html-to-markdown --help`.
148
+ - **Removed options**: `code_language_callback`, `strip`, and streaming APIs were removed; use `ConversionOptions`, `PreprocessingOptions`, and the inline-image helpers instead.
149
+
150
+ ## Links
151
+
152
+ - GitHub: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)
153
+ - Discord: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
154
+ - Kreuzberg ecosystem: [https://kreuzberg.dev](https://kreuzberg.dev)
155
+
156
+ ## License
157
+
158
+ MIT License – see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE).
159
+
160
+ ## Support
161
+
162
+ If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).