html-to-markdown 2.0.0__cp310-abi3-win_amd64.whl → 2.0.1__cp310-abi3-win_amd64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of html-to-markdown might be problematic. Click here for more details.

Binary file
Binary file
@@ -0,0 +1,243 @@
1
+ Metadata-Version: 2.4
2
+ Name: html-to-markdown
3
+ Version: 2.0.1
4
+ Classifier: Development Status :: 5 - Production/Stable
5
+ Classifier: Environment :: Console
6
+ Classifier: Intended Audience :: Developers
7
+ Classifier: License :: OSI Approved :: MIT License
8
+ Classifier: Operating System :: OS Independent
9
+ Classifier: Programming Language :: Python :: 3 :: Only
10
+ Classifier: Programming Language :: Python :: 3.10
11
+ Classifier: Programming Language :: Python :: 3.11
12
+ Classifier: Programming Language :: Python :: 3.12
13
+ Classifier: Programming Language :: Python :: 3.13
14
+ Classifier: Programming Language :: Rust
15
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
16
+ Classifier: Topic :: Text Processing
17
+ Classifier: Topic :: Text Processing :: Markup
18
+ Classifier: Topic :: Text Processing :: Markup :: HTML
19
+ Classifier: Topic :: Text Processing :: Markup :: Markdown
20
+ Classifier: Typing :: Typed
21
+ License-File: LICENSE
22
+ Summary: High-performance HTML to Markdown converter powered by Rust with a clean Python API
23
+ Keywords: cli-tool,converter,html,html2markdown,html5,markdown,markup,parser,rust,text-processing
24
+ Home-Page: https://github.com/Goldziher/html-to-markdown
25
+ Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
26
+ Requires-Python: >=3.10
27
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
28
+ Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
29
+ Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
30
+ Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
31
+ Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
32
+
33
+ # html-to-markdown
34
+
35
+ High-performance HTML to Markdown converter powered by Rust with a clean Python API. Available via PyPI with pre-built wheels for all major platforms.
36
+
37
+ [![PyPI version](https://badge.fury.io/py/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/)
38
+ [![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg)](https://crates.io/crates/html-to-markdown-rs)
39
+ [![Python Versions](https://img.shields.io/pypi/pyversions/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/)
40
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
41
+ [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
42
+
43
+ Part of the [Kreuzberg](https://kreuzberg.dev) ecosystem for document intelligence.
44
+
45
+ ## Installation
46
+
47
+ ```bash
48
+ pip install html-to-markdown
49
+ ```
50
+
51
+ Pre-built wheels available for:
52
+
53
+ - **Linux**: x86_64, aarch64
54
+ - **macOS**: x86_64 (Intel), arm64 (Apple Silicon)
55
+ - **Windows**: x86_64
56
+
57
+ ## ⚡ Performance
58
+
59
+ Real Wikipedia documents on Apple M4:
60
+
61
+ | Document | Size | Latency | Throughput | Docs/sec |
62
+ | ------------------- | ----- | ------- | ---------- | -------- |
63
+ | Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
64
+ | Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
65
+ | Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
66
+
67
+ **19-30x faster** than pure Python implementations.
68
+
69
+ ## Quick Start
70
+
71
+ ```python
72
+ from html_to_markdown import convert_to_markdown
73
+
74
+ html = """
75
+ <h1>Welcome</h1>
76
+ <p>This is <strong>fast</strong> Rust-powered conversion!</p>
77
+ <ul>
78
+ <li>Blazing fast</li>
79
+ <li>Type safe</li>
80
+ <li>Easy to use</li>
81
+ </ul>
82
+ """
83
+
84
+ markdown = convert_to_markdown(html)
85
+ print(markdown)
86
+ ```
87
+
88
+ Output:
89
+
90
+ ```markdown
91
+ # Welcome
92
+
93
+ This is **fast** Rust-powered conversion!
94
+
95
+ - Blazing fast
96
+ - Type safe
97
+ - Easy to use
98
+ ```
99
+
100
+ ## Configuration
101
+
102
+ ```python
103
+ from html_to_markdown import convert_to_markdown
104
+
105
+ markdown = convert_to_markdown(
106
+ html,
107
+ heading_style="atx", # "atx", "atx_closed", "underlined"
108
+ list_indent_width=2, # Discord/Slack: use 2
109
+ bullets="*+-", # Bullet characters
110
+ strong_em_symbol="*", # "*" or "_"
111
+ escape_asterisks=True, # Escape * in text
112
+ code_language="python", # Default code block language
113
+ extract_metadata=True, # Extract HTML metadata
114
+ )
115
+ ```
116
+
117
+ ### HTML Preprocessing
118
+
119
+ Clean web-scraped HTML before conversion:
120
+
121
+ ```python
122
+ from html_to_markdown import convert_to_markdown
123
+
124
+ markdown = convert_to_markdown(
125
+ scraped_html,
126
+ preprocess=True,
127
+ preprocessing_preset="aggressive", # "minimal", "standard", "aggressive"
128
+ )
129
+ ```
130
+
131
+ ## Features
132
+
133
+ - **🚀 Blazing Fast**: Pure Rust core with ultra-fast `tl` HTML parser
134
+ - **🐍 Type Safe**: Full type hints and `.pyi` stubs for excellent IDE support
135
+ - **📊 hOCR 1.2 Compliant**: Full support for all 40+ elements and 20+ properties
136
+ - **📝 CommonMark Compliant**: Follows CommonMark specification for list formatting
137
+ - **🌍 Cross-Platform**: Pre-built wheels for Linux, macOS, and Windows
138
+ - **✅ Well-Tested**: 900+ tests with dual Python + Rust coverage
139
+ - **🔧 Zero Dependencies**: No BeautifulSoup or lxml required
140
+
141
+ ## hOCR 1.2 Support
142
+
143
+ Complete hOCR 1.2 specification compliance with support for all elements, properties, and metadata:
144
+
145
+ ```python
146
+ from html_to_markdown import convert_to_markdown
147
+
148
+ # Option 1: Document structure extraction (NEW in v2)
149
+ # Extracts all hOCR elements and converts to structured markdown
150
+ markdown = convert_to_markdown(hocr_html)
151
+
152
+ # Option 2: Legacy table extraction (spatial reconstruction)
153
+ # Reconstructs tables from word bounding boxes
154
+ markdown = convert_to_markdown(
155
+ hocr_html,
156
+ hocr_extract_tables=True,
157
+ hocr_table_column_threshold=50,
158
+ hocr_table_row_threshold_ratio=0.5,
159
+ )
160
+ ```
161
+
162
+ **Full hOCR 1.2 Spec Coverage:**
163
+
164
+ - ✅ **All 40 Element Types** - Logical structure, typesetting, floats, inline, engine-specific
165
+ - ✅ **All 20+ Properties** - bbox, baseline, textangle, poly, x_wconf, x_font, x_fsize, and more
166
+ - ✅ **All 5 Metadata Fields** - ocr-system, ocr-capabilities, ocr-number-of-pages, ocr-langs, ocr-scripts
167
+
168
+ ## Configuration Reference
169
+
170
+ ### ConversionOptions
171
+
172
+ | Option | Type | Default | Description |
173
+ | -------------------------------- | ----- | ------------- | ----------------------------------------------------------------------- |
174
+ | `heading_style` | str | `"atx"` | Heading format: `"atx"` (#), `"atx_closed"` (# #), `"underlined"` (===) |
175
+ | `list_indent_width` | int | `2` | Spaces per list indent level (CommonMark: 2) |
176
+ | `list_indent_type` | str | `"spaces"` | `"spaces"` or `"tabs"` |
177
+ | `bullets` | str | `"*+-"` | Bullet chars for unordered lists (cycles through levels) |
178
+ | `strong_em_symbol` | str | `"*"` | Symbol for bold/italic: `"*"` or `"_"` |
179
+ | `escape_asterisks` | bool | `True` | Escape `*` in text |
180
+ | `escape_underscores` | bool | `True` | Escape `_` in text |
181
+ | `code_language` | str | `""` | Default language for code blocks |
182
+ | `code_block_style` | str | `"backticks"` | `"indented"` (4 spaces), `"backticks"` (\`\`\`), `"tildes"` (\~~~) |
183
+ | `extract_metadata` | bool | `True` | Extract HTML metadata as comment |
184
+ | `hocr_extract_tables` | bool | `True` | Enable hOCR table extraction |
185
+ | `hocr_table_column_threshold` | int | `50` | Column detection threshold (pixels) |
186
+ | `hocr_table_row_threshold_ratio` | float | `0.5` | Row grouping threshold ratio |
187
+
188
+ ### Preprocessing Options
189
+
190
+ | Option | Type | Default | Description |
191
+ | ---------------------- | ---- | ------------ | ----------------------------------------- |
192
+ | `preprocess` | bool | `False` | Enable HTML preprocessing |
193
+ | `preprocessing_preset` | str | `"standard"` | `"minimal"`, `"standard"`, `"aggressive"` |
194
+
195
+ ## CLI Tool
196
+
197
+ A native Rust CLI binary is also available:
198
+
199
+ ```bash
200
+ # Install via pipx (recommended for CLI tools)
201
+ pipx install html-to-markdown
202
+
203
+ # Or install with pip
204
+ pip install html-to-markdown
205
+
206
+ # Use the CLI
207
+ html-to-markdown input.html > output.md
208
+ echo "<h1>Test</h1>" | html-to-markdown
209
+ ```
210
+
211
+ **For Rust library usage and comprehensive documentation**, see the [GitHub repository](https://github.com/Goldziher/html-to-markdown).
212
+
213
+ ## Upgrading from v1.x
214
+
215
+ All v1 code works without changes. v2 is a complete Rust rewrite with **19-30x performance improvements**:
216
+
217
+ **What Changed:**
218
+
219
+ - Complete Rust rewrite using `tl` HTML parser
220
+ - CommonMark-compliant defaults (2-space indents, minimal escaping, ATX headings)
221
+ - No BeautifulSoup or lxml dependencies
222
+
223
+ **Removed Features:**
224
+
225
+ - `code_language_callback` - use `code_language` for default language
226
+ - `strip` / `convert` options - use preprocessing instead
227
+ - `convert_to_markdown_stream()` - not supported in v2
228
+
229
+ ## Links
230
+
231
+ - **GitHub Repository**: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)
232
+ - **Rust Crate**: [https://crates.io/crates/html-to-markdown-rs](https://crates.io/crates/html-to-markdown-rs)
233
+ - **Discord Community**: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
234
+ - **Kreuzberg Ecosystem**: [https://kreuzberg.dev](https://kreuzberg.dev)
235
+
236
+ ## License
237
+
238
+ MIT License - see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE) for details.
239
+
240
+ ## Support
241
+
242
+ If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).
243
+
@@ -1,17 +1,17 @@
1
- html_to_markdown-2.0.0.data/scripts/html-to-markdown.exe,sha256=GLfymdFNhpp4IkYO3-ALbkz7SKXMd5k8iDETaisroho,4380672
2
- html_to_markdown-2.0.0.dist-info/METADATA,sha256=Nd34pEmUGy9zrNDC0dSDEYSnASENKwToXBelV19z8XY,14673
3
- html_to_markdown-2.0.0.dist-info/WHEEL,sha256=4EDp_7DiFfWl1yYv5M4wSosAn5L_xgD1dyrQxQxfCx8,95
4
- html_to_markdown-2.0.0.dist-info/licenses/LICENSE,sha256=QhKFMkQLa4mSUlOsyG9VElzC7GYbAKtiS_EwOCyH-b4,1107
1
+ html_to_markdown-2.0.1.data/scripts/html-to-markdown.exe,sha256=4IXz5ITYrahD_3Qze4ldQ1R9AdrxiEO9xWF56VnaMJE,4380672
2
+ html_to_markdown-2.0.1.dist-info/METADATA,sha256=altO23YUZgJVPuW2sbzObGYEQAWS9MDDkAPMIZjlHWg,10057
3
+ html_to_markdown-2.0.1.dist-info/WHEEL,sha256=4EDp_7DiFfWl1yYv5M4wSosAn5L_xgD1dyrQxQxfCx8,95
4
+ html_to_markdown-2.0.1.dist-info/licenses/LICENSE,sha256=QhKFMkQLa4mSUlOsyG9VElzC7GYbAKtiS_EwOCyH-b4,1107
5
5
  html_to_markdown/__init__.py,sha256=c8bbkGkjR0bQs9ZZzhNevTULM4vzQIDXv3aaHSfVtmQ,1314
6
6
  html_to_markdown/__main__.py,sha256=5objj9lB7hhpSpZsDok5tv9o9yztVR63Ccww-pXsAyY,343
7
- html_to_markdown/_html_to_markdown.pyd,sha256=MG6Z56UUBTOGpg9zH_0sATEQ7DkSTBn-sCdzZgsvCJ4,3401216
7
+ html_to_markdown/_html_to_markdown.pyd,sha256=djqk4pMy4X6ioYreksFwKsmAtGuZzhBNhg2Mtw4AcPM,3401216
8
8
  html_to_markdown/_rust.pyi,sha256=bA1lfF_pRWpMFXApXVl7VXUcV1q8-T4-jVrEqLDwJ1Y,2220
9
9
  html_to_markdown/api.py,sha256=HfxYYebne2bZU62a-VALhurWfqDy5PEfUYNVnU6suac,3720
10
- html_to_markdown/bin/html-to-markdown.exe,sha256=GLfymdFNhpp4IkYO3-ALbkz7SKXMd5k8iDETaisroho,4380672
10
+ html_to_markdown/bin/html-to-markdown.exe,sha256=4IXz5ITYrahD_3Qze4ldQ1R9AdrxiEO9xWF56VnaMJE,4380672
11
11
  html_to_markdown/cli.py,sha256=4bQ44HhOKtQJY9t-TjSul-J7pqSHEpg9NKtZ0AeAqZA,263
12
12
  html_to_markdown/cli_proxy.py,sha256=mq-1BTjnr380N-mCU4cx8DyTIbY6ycq-_2DA10sLlfk,4442
13
13
  html_to_markdown/exceptions.py,sha256=L2YmpDWsvxTu2b-M_oT3mQdyyNhDM1jsvdWr18ZvD2Q,2707
14
14
  html_to_markdown/options.py,sha256=cenv2-vwduteQdIt6IXqDD0JyRFfa05oqhWgxKVdlB4,7452
15
15
  html_to_markdown/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
16
16
  html_to_markdown/v1_compat.py,sha256=Qts9BqmWMcZt5CZ8P20_vf8oc43tMgZxkj3-W0YI2VE,5682
17
- html_to_markdown-2.0.0.dist-info/RECORD,,
17
+ html_to_markdown-2.0.1.dist-info/RECORD,,
@@ -1,422 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: html-to-markdown
3
- Version: 2.0.0
4
- Classifier: Development Status :: 5 - Production/Stable
5
- Classifier: Environment :: Console
6
- Classifier: Intended Audience :: Developers
7
- Classifier: License :: OSI Approved :: MIT License
8
- Classifier: Operating System :: OS Independent
9
- Classifier: Programming Language :: Python :: 3 :: Only
10
- Classifier: Programming Language :: Python :: 3.10
11
- Classifier: Programming Language :: Python :: 3.11
12
- Classifier: Programming Language :: Python :: 3.12
13
- Classifier: Programming Language :: Python :: 3.13
14
- Classifier: Programming Language :: Rust
15
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
16
- Classifier: Topic :: Text Processing
17
- Classifier: Topic :: Text Processing :: Markup
18
- Classifier: Topic :: Text Processing :: Markup :: HTML
19
- Classifier: Topic :: Text Processing :: Markup :: Markdown
20
- Classifier: Typing :: Typed
21
- License-File: LICENSE
22
- Summary: High-performance HTML to Markdown converter powered by Rust with a clean Python API
23
- Keywords: cli-tool,converter,html,html2markdown,html5,markdown,markup,parser,rust,text-processing
24
- Home-Page: https://github.com/Goldziher/html-to-markdown
25
- Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
26
- License: MIT
27
- Requires-Python: >=3.10
28
- Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
29
- Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
30
- Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
31
- Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
32
- Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
33
-
34
- # html-to-markdown
35
-
36
- High-performance HTML to Markdown converter Rust crate and CLI with Python bindings and CLI. Available via PyPI, Homebrew, and Cargo. Cross-platform support for Linux, macOS, and Windows.
37
-
38
- [![PyPI version](https://badge.fury.io/py/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/)
39
- [![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg)](https://crates.io/crates/html-to-markdown-rs)
40
- [![Python Versions](https://img.shields.io/pypi/pyversions/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/)
41
- [![Documentation](https://img.shields.io/badge/docs-github-blue)](https://github.com/Goldziher/html-to-markdown)
42
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
43
- [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
44
-
45
- Part of the [Kreuzberg](https://kreuzberg.dev) ecosystem for document intelligence.
46
-
47
- **📚 [Full V2 Documentation](crates/html-to-markdown/README.md)** - Comprehensive guide for Rust, Python, and CLI usage.
48
-
49
- ## ⚡ Benchmarks
50
-
51
- ### Throughput (Python API)
52
-
53
- Real Wikipedia documents on Apple M1 Pro:
54
-
55
- | Document | Size | Latency | Throughput | Docs/sec |
56
- | ------------------- | ----- | ------- | ---------- | -------- |
57
- | Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
58
- | Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
59
- | Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
60
-
61
- **Throughput scales linearly** from 144-208 MB/s across all document sizes.
62
-
63
- ### Memory Usage
64
-
65
- | Document Size | Memory Delta | Peak RSS | Leak Detection |
66
- | ------------- | ------------ | -------- | -------------- |
67
- | 10KB | < 2 MB | < 20 MB | ✅ None |
68
- | 50KB | < 8 MB | < 35 MB | ✅ None |
69
- | 500KB | < 40 MB | < 80 MB | ✅ None |
70
-
71
- Memory usage is linear and stable across 50+ repeated conversions.
72
-
73
- **V2 is 19-30x faster** than v1 Python/BeautifulSoup implementation.
74
-
75
- 📊 **[Benchmark Results](BENCHMARK_RESULTS.md)** - Detailed Python API comparison
76
- 📈 **[Performance Analysis](PERFORMANCE.md)** - Rust core benchmarks and profiling
77
- 🔧 **[Benchmarking Guide](BENCHMARKS.md)** - How to run benchmarks
78
- ✅ **[CommonMark Compliance](COMMONMARK_COMPLIANCE.md)** - CommonMark specification compliance
79
-
80
- ## Features
81
-
82
- - **🚀 Blazing Fast**: Pure Rust core with ultra-fast `tl` HTML parser
83
- - **🐍 Python Bindings**: Clean Python API via PyO3 with full type hints
84
- - **🦀 Native CLI**: Rust CLI binary with comprehensive options
85
- - **📊 hOCR 1.2 Compliant**: Full support for all 40+ elements and 20+ properties
86
- - **📝 CommonMark Compliant**: Follows CommonMark specification for list formatting
87
- - **🎯 Type Safe**: Full type hints and `.pyi` stubs for excellent IDE support
88
- - **🌍 Cross-Platform**: Wheels for Linux, macOS, Windows (x86_64 + ARM64)
89
- - **✅ Well-Tested**: 900+ tests with dual Python + Rust coverage
90
-
91
- ## Installation
92
-
93
- > **📦 Package Names**: Due to a naming conflict on crates.io, the Rust crate is published as `html-to-markdown-rs`, while the Python package remains `html-to-markdown` on PyPI. The CLI binary name is `html-to-markdown` for both.
94
-
95
- ### Python Package
96
-
97
- ```bash
98
- pip install html-to-markdown
99
- ```
100
-
101
- ### Rust Library
102
-
103
- ```bash
104
- cargo add html-to-markdown-rs
105
- ```
106
-
107
- ### CLI Binary
108
-
109
- #### via Homebrew (macOS/Linux)
110
-
111
- ```bash
112
- brew tap goldziher/tap
113
- brew install html-to-markdown
114
- ```
115
-
116
- #### via Cargo
117
-
118
- ```bash
119
- cargo install html-to-markdown-cli
120
- ```
121
-
122
- #### Direct Download
123
-
124
- Download pre-built binaries from [GitHub Releases](https://github.com/Goldziher/html-to-markdown/releases).
125
-
126
- ## Quick Start
127
-
128
- ### Python API
129
-
130
- Clean, type-safe configuration with dataclasses:
131
-
132
- ```python
133
- from html_to_markdown import convert, ConversionOptions
134
-
135
- html = """
136
- <h1>Welcome</h1>
137
- <p>This is <strong>fast</strong> Rust-powered conversion!</p>
138
- <ul>
139
- <li>Blazing fast</li>
140
- <li>Type safe</li>
141
- <li>Easy to use</li>
142
- </ul>
143
- """
144
-
145
- options = ConversionOptions(
146
- heading_style="atx",
147
- strong_em_symbol="*",
148
- bullets="*+-",
149
- )
150
-
151
- markdown = convert(html, options)
152
- print(markdown)
153
- ```
154
-
155
- Output:
156
-
157
- ```markdown
158
- # Welcome
159
-
160
- This is **fast** Rust-powered conversion!
161
-
162
- * Blazing fast
163
- + Type safe
164
- - Easy to use
165
- ```
166
-
167
- ### Rust API
168
-
169
- ```rust
170
- use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle};
171
-
172
- fn main() {
173
- let html = r#"
174
- <h1>Welcome</h1>
175
- <p>This is <strong>fast</strong> conversion!</p>
176
- "#;
177
-
178
- let options = ConversionOptions {
179
- heading_style: HeadingStyle::Atx,
180
- ..Default::default()
181
- };
182
-
183
- let markdown = convert(html, Some(options)).unwrap();
184
- println!("{}", markdown);
185
- }
186
- ```
187
-
188
- ### CLI Usage
189
-
190
- ```bash
191
- # Convert file
192
- html-to-markdown input.html > output.md
193
-
194
- # From stdin
195
- cat input.html | html-to-markdown > output.md
196
-
197
- # With options
198
- html-to-markdown --heading-style atx --list-indent-width 2 input.html
199
-
200
- # Clean web-scraped content
201
- html-to-markdown \
202
- --preprocess \
203
- --preset aggressive \
204
- --no-extract-metadata \
205
- scraped.html > clean.md
206
- ```
207
-
208
- ## Configuration
209
-
210
- ### Python: Dataclass Configuration
211
-
212
- ```python
213
- from html_to_markdown import (
214
- convert,
215
- ConversionOptions,
216
- PreprocessingOptions,
217
- )
218
-
219
- # Conversion settings
220
- options = ConversionOptions(
221
- heading_style="atx", # "atx", "atx_closed", "underlined"
222
- list_indent_width=2, # Discord/Slack: use 2
223
- bullets="*+-", # Bullet characters
224
- strong_em_symbol="*", # "*" or "_"
225
- escape_asterisks=True, # Escape * in text
226
- code_language="python", # Default code block language
227
- extract_metadata=True, # Extract HTML metadata
228
- highlight_style="double-equal", # "double-equal", "html", "bold"
229
- )
230
-
231
- # HTML preprocessing
232
- preprocessing = PreprocessingOptions(
233
- enabled=True,
234
- preset="standard", # "minimal", "standard", "aggressive"
235
- remove_navigation=True,
236
- remove_forms=True,
237
- )
238
-
239
- markdown = convert(html, options, preprocessing)
240
- ```
241
-
242
- ### Python: Legacy API (v1 compatibility)
243
-
244
- For backward compatibility with existing v1 code:
245
-
246
- ```python
247
- from html_to_markdown import convert_to_markdown
248
-
249
- markdown = convert_to_markdown(
250
- html,
251
- heading_style="atx",
252
- list_indent_width=2,
253
- preprocess=True,
254
- preprocessing_preset="standard",
255
- )
256
- ```
257
-
258
- ## Common Use Cases
259
-
260
- ### Discord/Slack Compatible Lists
261
-
262
- ```python
263
- from html_to_markdown import convert, ConversionOptions
264
-
265
- options = ConversionOptions(list_indent_width=2)
266
- markdown = convert(html, options)
267
- ```
268
-
269
- ### Clean Web-Scraped HTML
270
-
271
- ```python
272
- from html_to_markdown import convert, PreprocessingOptions
273
-
274
- preprocessing = PreprocessingOptions(
275
- enabled=True,
276
- preset="aggressive", # Heavy cleaning
277
- remove_navigation=True,
278
- remove_forms=True,
279
- )
280
-
281
- markdown = convert(html, preprocessing=preprocessing)
282
- ```
283
-
284
- ### hOCR 1.2 Support
285
-
286
- **Complete hOCR 1.2 specification compliance** with support for all elements, properties, and metadata:
287
-
288
- ```python
289
- from html_to_markdown import convert, ConversionOptions
290
-
291
- # Option 1: Document structure extraction (NEW in v2)
292
- # Extracts all hOCR elements and converts to structured markdown
293
- # Supports: paragraphs, sections, chapters, headers/footers, images, math, etc.
294
- markdown = convert(hocr_html)
295
-
296
- # Option 2: Legacy table extraction (spatial reconstruction)
297
- # Reconstructs tables from word bounding boxes
298
- options = ConversionOptions(
299
- hocr_extract_tables=True,
300
- hocr_table_column_threshold=50,
301
- hocr_table_row_threshold_ratio=0.5,
302
- )
303
- markdown = convert(hocr_html, options)
304
- ```
305
-
306
- **Full hOCR 1.2 Spec Coverage:**
307
-
308
- - ✅ **All 40 Element Types** - Logical structure (12), typesetting (6), float (13), inline (6), engine-specific (3)
309
- - ✅ **All 20+ Properties** - bbox, baseline, textangle, poly, x_wconf, x_confs, x_font, x_fsize, order, cflow, cuts, x_bboxes, image, ppageno, lpageno, scan_res, and more
310
- - ✅ **All 5 Metadata Fields** - ocr-system, ocr-capabilities, ocr-number-of-pages, ocr-langs, ocr-scripts
311
- - ✅ **37 Tests** - Complete coverage of all elements and properties
312
-
313
- **Semantic Markdown Conversion:**
314
-
315
- | Element Category | Examples | Markdown Output |
316
- | ---------------- | ------------------------------- | ----------------------------------------- |
317
- | Headings | `ocr_title`, `ocr_chapter` | `# Heading` |
318
- | Sections | `ocr_section`, `ocr_subsection` | `##`, `###` |
319
- | Structure | `ocr_par`, `ocr_blockquote` | Paragraphs, `> quotes` |
320
- | Metadata | `ocr_abstract`, `ocr_author` | `**Abstract**`, `*Author*` |
321
- | Floats | `ocr_header`, `ocr_footer` | `*Header*`, `*Footer*` |
322
- | Images | `ocr_image`, `ocr_photo` | `![alt](path)` with image property |
323
- | Math | `ocr_math`, `ocr_display` | `` `formula` ``, ```` ```equation``` ```` |
324
- | Layout | `ocr_separator` | `---` horizontal rule |
325
- | Inline | `ocrx_word`, `ocr_dropcap` | Text, `**Letter**` |
326
-
327
- **HTML Entity Handling:** Automatically decodes `&quot;`, `&apos;`, `&lt;`, `&gt;`, `&amp;` in title attributes for proper property parsing.
328
-
329
- ## Configuration Reference
330
-
331
- **V2 Defaults (CommonMark-compliant):**
332
-
333
- - `list_indent_width`: 2 (CommonMark standard)
334
- - `bullets`: "\*+-" (cycles through `*`, `+`, `-` for nested levels)
335
- - `escape_asterisks`: false (minimal escaping)
336
- - `escape_underscores`: false (minimal escaping)
337
- - `escape_misc`: false (minimal escaping)
338
- - `newline_style`: "spaces" (CommonMark: two trailing spaces)
339
- - `code_block_style`: "backticks" (fenced code blocks with \`\`\`, better whitespace preservation)
340
- - `heading_style`: "atx" (CommonMark: `#`)
341
- - `preprocessing.enabled`: false (no preprocessing by default)
342
-
343
- For complete configuration reference, see **[Full Documentation](crates/html-to-markdown/README.md#configuration-reference)**.
344
-
345
- ## Upgrading from v1.x
346
-
347
- ### Backward Compatibility
348
-
349
- Existing v1 code works without changes:
350
-
351
- ```python
352
- from html_to_markdown import convert_to_markdown
353
-
354
- markdown = convert_to_markdown(html, heading_style="atx") # Still works!
355
- ```
356
-
357
- ### Modern API (Recommended)
358
-
359
- For new projects, use the dataclass-based API:
360
-
361
- ```python
362
- from html_to_markdown import convert, ConversionOptions
363
-
364
- options = ConversionOptions(heading_style="atx", list_indent_width=2)
365
- markdown = convert(html, options)
366
- ```
367
-
368
- ### What Changed in v2
369
-
370
- **Core Rewrite:**
371
-
372
- - Complete Rust rewrite using `tl` HTML parser
373
- - 19-30x performance improvement over v1
374
- - CommonMark-compliant defaults (2-space indents, minimal escaping, ATX headings)
375
- - No BeautifulSoup or lxml dependencies
376
-
377
- **Removed Features:**
378
-
379
- - `code_language_callback` - use `code_language` for default language
380
- - `strip` / `convert` options - use `strip_tags` or preprocessing
381
- - `convert_to_markdown_stream()` - not supported in v2
382
-
383
- **Planned:**
384
-
385
- - `custom_converters` - planned for future release
386
-
387
- See **[CHANGELOG.md](CHANGELOG.md)** for complete v1 vs v2 comparison and migration guide.
388
-
389
- ## Kreuzberg Ecosystem
390
-
391
- html-to-markdown is part of the [Kreuzberg](https://kreuzberg.dev) ecosystem, a comprehensive framework for document intelligence and processing. While html-to-markdown focuses on converting HTML to Markdown with maximum performance, Kreuzberg provides a complete solution for:
392
-
393
- - **Document Extraction**: Extract text, images, and metadata from 50+ document formats
394
- - **OCR Processing**: Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR)
395
- - **Table Extraction**: Vision-based and OCR-based table detection
396
- - **Document Classification**: Automatic detection of contracts, forms, invoices, etc.
397
- - **RAG Pipelines**: Integration with retrieval-augmented generation workflows
398
-
399
- Learn more at [kreuzberg.dev](https://kreuzberg.dev) or join our [Discord community](https://discord.gg/pXxagNK2zN).
400
-
401
- ## Contributing
402
-
403
- See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, testing, and contribution guidelines.
404
-
405
- ## License
406
-
407
- MIT License - see [LICENSE](LICENSE) for details.
408
-
409
- ## Acknowledgments
410
-
411
- Version 1 started as a fork of [markdownify](https://pypi.org/project/markdownify/), rewritten, extended, and enhanced with better typing and features. Version 2 is a complete Rust rewrite for high performance.
412
-
413
- ## Support
414
-
415
- If you find this library useful, consider:
416
-
417
- <a href="https://github.com/sponsors/Goldziher">
418
- <img src="https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors" alt="Sponsor" height="32">
419
- </a>
420
-
421
- Your support helps maintain and improve this library!
422
-