html-to-markdown 2.0.0__cp310-abi3-macosx_11_0_arm64.whl → 2.0.1__cp310-abi3-macosx_11_0_arm64.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of html-to-markdown might be problematic. Click here for more details.
- html_to_markdown/_html_to_markdown.abi3.so +0 -0
- html_to_markdown/bin/html-to-markdown +0 -0
- {html_to_markdown-2.0.0.data → html_to_markdown-2.0.1.data}/scripts/html-to-markdown +0 -0
- html_to_markdown-2.0.1.dist-info/METADATA +243 -0
- {html_to_markdown-2.0.0.dist-info → html_to_markdown-2.0.1.dist-info}/RECORD +7 -7
- html_to_markdown-2.0.0.dist-info/METADATA +0 -422
- {html_to_markdown-2.0.0.dist-info → html_to_markdown-2.0.1.dist-info}/WHEEL +0 -0
- {html_to_markdown-2.0.0.dist-info → html_to_markdown-2.0.1.dist-info}/licenses/LICENSE +0 -0
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
@@ -0,0 +1,243 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: html-to-markdown
|
|
3
|
+
Version: 2.0.1
|
|
4
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
5
|
+
Classifier: Environment :: Console
|
|
6
|
+
Classifier: Intended Audience :: Developers
|
|
7
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
8
|
+
Classifier: Operating System :: OS Independent
|
|
9
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
14
|
+
Classifier: Programming Language :: Rust
|
|
15
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
16
|
+
Classifier: Topic :: Text Processing
|
|
17
|
+
Classifier: Topic :: Text Processing :: Markup
|
|
18
|
+
Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
19
|
+
Classifier: Topic :: Text Processing :: Markup :: Markdown
|
|
20
|
+
Classifier: Typing :: Typed
|
|
21
|
+
License-File: LICENSE
|
|
22
|
+
Summary: High-performance HTML to Markdown converter powered by Rust with a clean Python API
|
|
23
|
+
Keywords: cli-tool,converter,html,html2markdown,html5,markdown,markup,parser,rust,text-processing
|
|
24
|
+
Home-Page: https://github.com/Goldziher/html-to-markdown
|
|
25
|
+
Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
|
|
26
|
+
Requires-Python: >=3.10
|
|
27
|
+
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
|
|
28
|
+
Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
|
|
29
|
+
Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
|
|
30
|
+
Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
|
|
31
|
+
Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
|
|
32
|
+
|
|
33
|
+
# html-to-markdown
|
|
34
|
+
|
|
35
|
+
High-performance HTML to Markdown converter powered by Rust with a clean Python API. Available via PyPI with pre-built wheels for all major platforms.
|
|
36
|
+
|
|
37
|
+
[](https://pypi.org/project/html-to-markdown/)
|
|
38
|
+
[](https://crates.io/crates/html-to-markdown-rs)
|
|
39
|
+
[](https://pypi.org/project/html-to-markdown/)
|
|
40
|
+
[](https://opensource.org/licenses/MIT)
|
|
41
|
+
[](https://discord.gg/pXxagNK2zN)
|
|
42
|
+
|
|
43
|
+
Part of the [Kreuzberg](https://kreuzberg.dev) ecosystem for document intelligence.
|
|
44
|
+
|
|
45
|
+
## Installation
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
pip install html-to-markdown
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
Pre-built wheels available for:
|
|
52
|
+
|
|
53
|
+
- **Linux**: x86_64, aarch64
|
|
54
|
+
- **macOS**: x86_64 (Intel), arm64 (Apple Silicon)
|
|
55
|
+
- **Windows**: x86_64
|
|
56
|
+
|
|
57
|
+
## ⚡ Performance
|
|
58
|
+
|
|
59
|
+
Real Wikipedia documents on Apple M4:
|
|
60
|
+
|
|
61
|
+
| Document | Size | Latency | Throughput | Docs/sec |
|
|
62
|
+
| ------------------- | ----- | ------- | ---------- | -------- |
|
|
63
|
+
| Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
|
|
64
|
+
| Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
|
|
65
|
+
| Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
|
|
66
|
+
|
|
67
|
+
**19-30x faster** than pure Python implementations.
|
|
68
|
+
|
|
69
|
+
## Quick Start
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
from html_to_markdown import convert_to_markdown
|
|
73
|
+
|
|
74
|
+
html = """
|
|
75
|
+
<h1>Welcome</h1>
|
|
76
|
+
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
|
|
77
|
+
<ul>
|
|
78
|
+
<li>Blazing fast</li>
|
|
79
|
+
<li>Type safe</li>
|
|
80
|
+
<li>Easy to use</li>
|
|
81
|
+
</ul>
|
|
82
|
+
"""
|
|
83
|
+
|
|
84
|
+
markdown = convert_to_markdown(html)
|
|
85
|
+
print(markdown)
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Output:
|
|
89
|
+
|
|
90
|
+
```markdown
|
|
91
|
+
# Welcome
|
|
92
|
+
|
|
93
|
+
This is **fast** Rust-powered conversion!
|
|
94
|
+
|
|
95
|
+
- Blazing fast
|
|
96
|
+
- Type safe
|
|
97
|
+
- Easy to use
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
## Configuration
|
|
101
|
+
|
|
102
|
+
```python
|
|
103
|
+
from html_to_markdown import convert_to_markdown
|
|
104
|
+
|
|
105
|
+
markdown = convert_to_markdown(
|
|
106
|
+
html,
|
|
107
|
+
heading_style="atx", # "atx", "atx_closed", "underlined"
|
|
108
|
+
list_indent_width=2, # Discord/Slack: use 2
|
|
109
|
+
bullets="*+-", # Bullet characters
|
|
110
|
+
strong_em_symbol="*", # "*" or "_"
|
|
111
|
+
escape_asterisks=True, # Escape * in text
|
|
112
|
+
code_language="python", # Default code block language
|
|
113
|
+
extract_metadata=True, # Extract HTML metadata
|
|
114
|
+
)
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### HTML Preprocessing
|
|
118
|
+
|
|
119
|
+
Clean web-scraped HTML before conversion:
|
|
120
|
+
|
|
121
|
+
```python
|
|
122
|
+
from html_to_markdown import convert_to_markdown
|
|
123
|
+
|
|
124
|
+
markdown = convert_to_markdown(
|
|
125
|
+
scraped_html,
|
|
126
|
+
preprocess=True,
|
|
127
|
+
preprocessing_preset="aggressive", # "minimal", "standard", "aggressive"
|
|
128
|
+
)
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
## Features
|
|
132
|
+
|
|
133
|
+
- **🚀 Blazing Fast**: Pure Rust core with ultra-fast `tl` HTML parser
|
|
134
|
+
- **🐍 Type Safe**: Full type hints and `.pyi` stubs for excellent IDE support
|
|
135
|
+
- **📊 hOCR 1.2 Compliant**: Full support for all 40+ elements and 20+ properties
|
|
136
|
+
- **📝 CommonMark Compliant**: Follows CommonMark specification for list formatting
|
|
137
|
+
- **🌍 Cross-Platform**: Pre-built wheels for Linux, macOS, and Windows
|
|
138
|
+
- **✅ Well-Tested**: 900+ tests with dual Python + Rust coverage
|
|
139
|
+
- **🔧 Zero Dependencies**: No BeautifulSoup or lxml required
|
|
140
|
+
|
|
141
|
+
## hOCR 1.2 Support
|
|
142
|
+
|
|
143
|
+
Complete hOCR 1.2 specification compliance with support for all elements, properties, and metadata:
|
|
144
|
+
|
|
145
|
+
```python
|
|
146
|
+
from html_to_markdown import convert_to_markdown
|
|
147
|
+
|
|
148
|
+
# Option 1: Document structure extraction (NEW in v2)
|
|
149
|
+
# Extracts all hOCR elements and converts to structured markdown
|
|
150
|
+
markdown = convert_to_markdown(hocr_html)
|
|
151
|
+
|
|
152
|
+
# Option 2: Legacy table extraction (spatial reconstruction)
|
|
153
|
+
# Reconstructs tables from word bounding boxes
|
|
154
|
+
markdown = convert_to_markdown(
|
|
155
|
+
hocr_html,
|
|
156
|
+
hocr_extract_tables=True,
|
|
157
|
+
hocr_table_column_threshold=50,
|
|
158
|
+
hocr_table_row_threshold_ratio=0.5,
|
|
159
|
+
)
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
**Full hOCR 1.2 Spec Coverage:**
|
|
163
|
+
|
|
164
|
+
- ✅ **All 40 Element Types** - Logical structure, typesetting, floats, inline, engine-specific
|
|
165
|
+
- ✅ **All 20+ Properties** - bbox, baseline, textangle, poly, x_wconf, x_font, x_fsize, and more
|
|
166
|
+
- ✅ **All 5 Metadata Fields** - ocr-system, ocr-capabilities, ocr-number-of-pages, ocr-langs, ocr-scripts
|
|
167
|
+
|
|
168
|
+
## Configuration Reference
|
|
169
|
+
|
|
170
|
+
### ConversionOptions
|
|
171
|
+
|
|
172
|
+
| Option | Type | Default | Description |
|
|
173
|
+
| -------------------------------- | ----- | ------------- | ----------------------------------------------------------------------- |
|
|
174
|
+
| `heading_style` | str | `"atx"` | Heading format: `"atx"` (#), `"atx_closed"` (# #), `"underlined"` (===) |
|
|
175
|
+
| `list_indent_width` | int | `2` | Spaces per list indent level (CommonMark: 2) |
|
|
176
|
+
| `list_indent_type` | str | `"spaces"` | `"spaces"` or `"tabs"` |
|
|
177
|
+
| `bullets` | str | `"*+-"` | Bullet chars for unordered lists (cycles through levels) |
|
|
178
|
+
| `strong_em_symbol` | str | `"*"` | Symbol for bold/italic: `"*"` or `"_"` |
|
|
179
|
+
| `escape_asterisks` | bool | `True` | Escape `*` in text |
|
|
180
|
+
| `escape_underscores` | bool | `True` | Escape `_` in text |
|
|
181
|
+
| `code_language` | str | `""` | Default language for code blocks |
|
|
182
|
+
| `code_block_style` | str | `"backticks"` | `"indented"` (4 spaces), `"backticks"` (\`\`\`), `"tildes"` (\~~~) |
|
|
183
|
+
| `extract_metadata` | bool | `True` | Extract HTML metadata as comment |
|
|
184
|
+
| `hocr_extract_tables` | bool | `True` | Enable hOCR table extraction |
|
|
185
|
+
| `hocr_table_column_threshold` | int | `50` | Column detection threshold (pixels) |
|
|
186
|
+
| `hocr_table_row_threshold_ratio` | float | `0.5` | Row grouping threshold ratio |
|
|
187
|
+
|
|
188
|
+
### Preprocessing Options
|
|
189
|
+
|
|
190
|
+
| Option | Type | Default | Description |
|
|
191
|
+
| ---------------------- | ---- | ------------ | ----------------------------------------- |
|
|
192
|
+
| `preprocess` | bool | `False` | Enable HTML preprocessing |
|
|
193
|
+
| `preprocessing_preset` | str | `"standard"` | `"minimal"`, `"standard"`, `"aggressive"` |
|
|
194
|
+
|
|
195
|
+
## CLI Tool
|
|
196
|
+
|
|
197
|
+
A native Rust CLI binary is also available:
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
# Install via pipx (recommended for CLI tools)
|
|
201
|
+
pipx install html-to-markdown
|
|
202
|
+
|
|
203
|
+
# Or install with pip
|
|
204
|
+
pip install html-to-markdown
|
|
205
|
+
|
|
206
|
+
# Use the CLI
|
|
207
|
+
html-to-markdown input.html > output.md
|
|
208
|
+
echo "<h1>Test</h1>" | html-to-markdown
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
**For Rust library usage and comprehensive documentation**, see the [GitHub repository](https://github.com/Goldziher/html-to-markdown).
|
|
212
|
+
|
|
213
|
+
## Upgrading from v1.x
|
|
214
|
+
|
|
215
|
+
All v1 code works without changes. v2 is a complete Rust rewrite with **19-30x performance improvements**:
|
|
216
|
+
|
|
217
|
+
**What Changed:**
|
|
218
|
+
|
|
219
|
+
- Complete Rust rewrite using `tl` HTML parser
|
|
220
|
+
- CommonMark-compliant defaults (2-space indents, minimal escaping, ATX headings)
|
|
221
|
+
- No BeautifulSoup or lxml dependencies
|
|
222
|
+
|
|
223
|
+
**Removed Features:**
|
|
224
|
+
|
|
225
|
+
- `code_language_callback` - use `code_language` for default language
|
|
226
|
+
- `strip` / `convert` options - use preprocessing instead
|
|
227
|
+
- `convert_to_markdown_stream()` - not supported in v2
|
|
228
|
+
|
|
229
|
+
## Links
|
|
230
|
+
|
|
231
|
+
- **GitHub Repository**: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)
|
|
232
|
+
- **Rust Crate**: [https://crates.io/crates/html-to-markdown-rs](https://crates.io/crates/html-to-markdown-rs)
|
|
233
|
+
- **Discord Community**: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
|
|
234
|
+
- **Kreuzberg Ecosystem**: [https://kreuzberg.dev](https://kreuzberg.dev)
|
|
235
|
+
|
|
236
|
+
## License
|
|
237
|
+
|
|
238
|
+
MIT License - see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE) for details.
|
|
239
|
+
|
|
240
|
+
## Support
|
|
241
|
+
|
|
242
|
+
If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).
|
|
243
|
+
|
|
@@ -1,10 +1,10 @@
|
|
|
1
|
-
html_to_markdown-2.0.
|
|
2
|
-
html_to_markdown-2.0.
|
|
3
|
-
html_to_markdown-2.0.
|
|
4
|
-
html_to_markdown-2.0.
|
|
5
|
-
html_to_markdown-2.0.
|
|
1
|
+
html_to_markdown-2.0.1.data/scripts/html-to-markdown,sha256=jhtwKs5YAVs1Vkx4e1hHuSfH3rLNRP0JyJC7DCsGr_I,3734448
|
|
2
|
+
html_to_markdown-2.0.1.dist-info/RECORD,,
|
|
3
|
+
html_to_markdown-2.0.1.dist-info/WHEEL,sha256=HtAbUhtjhH1WdiDuIy2CapdoAiKCwe6bij_Tlxr1lEg,131
|
|
4
|
+
html_to_markdown-2.0.1.dist-info/METADATA,sha256=n-MONvlBJmSDWmejjTWLwzIV-PzlbWh1hWDLywyv9oc,9847
|
|
5
|
+
html_to_markdown-2.0.1.dist-info/licenses/LICENSE,sha256=oQvPC-0UWvfg0WaeUBe11OJMtX60An-TW1ev_oaAA0k,1086
|
|
6
6
|
html_to_markdown/options.py,sha256=LXOUDqWwuvC-ryE118LttnATDO6-rlogYbbEGVfynhM,7241
|
|
7
|
-
html_to_markdown/_html_to_markdown.abi3.so,sha256=
|
|
7
|
+
html_to_markdown/_html_to_markdown.abi3.so,sha256=jvvpeF0sSDJ5bBeiR5INT91cIPIwGjc6CxRgT5dDp2g,2989792
|
|
8
8
|
html_to_markdown/__init__.py,sha256=0r7a2ruI_9xqj0Ko-5O4yCGrQ4Nga89qSUY4lTSyiDE,1266
|
|
9
9
|
html_to_markdown/api.py,sha256=0KgVWCDX-pWxrADxxxnqzk5_IhYc4fDxRytgeHttCKQ,3620
|
|
10
10
|
html_to_markdown/_rust.pyi,sha256=6GZ5fXfQ7VqglKB-kSZ395cysOdLIdQidDq6yoAHICA,2141
|
|
@@ -14,4 +14,4 @@ html_to_markdown/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
|
14
14
|
html_to_markdown/exceptions.py,sha256=0Yrzndw1kSqN-HMnE34TjZzo21iihiD1TZG1k2dmpdI,2626
|
|
15
15
|
html_to_markdown/cli_proxy.py,sha256=nuBMky_q_ArDUKGgWW6Vrxf2JwOa_RgmUPH8qYBIcRQ,4298
|
|
16
16
|
html_to_markdown/__main__.py,sha256=3Ic_EbOt2h6W88q084pkz5IKU6iY5z_woBygH6u9aw0,327
|
|
17
|
-
html_to_markdown/bin/html-to-markdown,sha256=
|
|
17
|
+
html_to_markdown/bin/html-to-markdown,sha256=jhtwKs5YAVs1Vkx4e1hHuSfH3rLNRP0JyJC7DCsGr_I,3734448
|
|
@@ -1,422 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: html-to-markdown
|
|
3
|
-
Version: 2.0.0
|
|
4
|
-
Classifier: Development Status :: 5 - Production/Stable
|
|
5
|
-
Classifier: Environment :: Console
|
|
6
|
-
Classifier: Intended Audience :: Developers
|
|
7
|
-
Classifier: License :: OSI Approved :: MIT License
|
|
8
|
-
Classifier: Operating System :: OS Independent
|
|
9
|
-
Classifier: Programming Language :: Python :: 3 :: Only
|
|
10
|
-
Classifier: Programming Language :: Python :: 3.10
|
|
11
|
-
Classifier: Programming Language :: Python :: 3.11
|
|
12
|
-
Classifier: Programming Language :: Python :: 3.12
|
|
13
|
-
Classifier: Programming Language :: Python :: 3.13
|
|
14
|
-
Classifier: Programming Language :: Rust
|
|
15
|
-
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
16
|
-
Classifier: Topic :: Text Processing
|
|
17
|
-
Classifier: Topic :: Text Processing :: Markup
|
|
18
|
-
Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
19
|
-
Classifier: Topic :: Text Processing :: Markup :: Markdown
|
|
20
|
-
Classifier: Typing :: Typed
|
|
21
|
-
License-File: LICENSE
|
|
22
|
-
Summary: High-performance HTML to Markdown converter powered by Rust with a clean Python API
|
|
23
|
-
Keywords: cli-tool,converter,html,html2markdown,html5,markdown,markup,parser,rust,text-processing
|
|
24
|
-
Home-Page: https://github.com/Goldziher/html-to-markdown
|
|
25
|
-
Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
|
|
26
|
-
License: MIT
|
|
27
|
-
Requires-Python: >=3.10
|
|
28
|
-
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
|
|
29
|
-
Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
|
|
30
|
-
Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
|
|
31
|
-
Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
|
|
32
|
-
Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
|
|
33
|
-
|
|
34
|
-
# html-to-markdown
|
|
35
|
-
|
|
36
|
-
High-performance HTML to Markdown converter Rust crate and CLI with Python bindings and CLI. Available via PyPI, Homebrew, and Cargo. Cross-platform support for Linux, macOS, and Windows.
|
|
37
|
-
|
|
38
|
-
[](https://pypi.org/project/html-to-markdown/)
|
|
39
|
-
[](https://crates.io/crates/html-to-markdown-rs)
|
|
40
|
-
[](https://pypi.org/project/html-to-markdown/)
|
|
41
|
-
[](https://github.com/Goldziher/html-to-markdown)
|
|
42
|
-
[](https://opensource.org/licenses/MIT)
|
|
43
|
-
[](https://discord.gg/pXxagNK2zN)
|
|
44
|
-
|
|
45
|
-
Part of the [Kreuzberg](https://kreuzberg.dev) ecosystem for document intelligence.
|
|
46
|
-
|
|
47
|
-
**📚 [Full V2 Documentation](crates/html-to-markdown/README.md)** - Comprehensive guide for Rust, Python, and CLI usage.
|
|
48
|
-
|
|
49
|
-
## ⚡ Benchmarks
|
|
50
|
-
|
|
51
|
-
### Throughput (Python API)
|
|
52
|
-
|
|
53
|
-
Real Wikipedia documents on Apple M1 Pro:
|
|
54
|
-
|
|
55
|
-
| Document | Size | Latency | Throughput | Docs/sec |
|
|
56
|
-
| ------------------- | ----- | ------- | ---------- | -------- |
|
|
57
|
-
| Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
|
|
58
|
-
| Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
|
|
59
|
-
| Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
|
|
60
|
-
|
|
61
|
-
**Throughput scales linearly** from 144-208 MB/s across all document sizes.
|
|
62
|
-
|
|
63
|
-
### Memory Usage
|
|
64
|
-
|
|
65
|
-
| Document Size | Memory Delta | Peak RSS | Leak Detection |
|
|
66
|
-
| ------------- | ------------ | -------- | -------------- |
|
|
67
|
-
| 10KB | < 2 MB | < 20 MB | ✅ None |
|
|
68
|
-
| 50KB | < 8 MB | < 35 MB | ✅ None |
|
|
69
|
-
| 500KB | < 40 MB | < 80 MB | ✅ None |
|
|
70
|
-
|
|
71
|
-
Memory usage is linear and stable across 50+ repeated conversions.
|
|
72
|
-
|
|
73
|
-
**V2 is 19-30x faster** than v1 Python/BeautifulSoup implementation.
|
|
74
|
-
|
|
75
|
-
📊 **[Benchmark Results](BENCHMARK_RESULTS.md)** - Detailed Python API comparison
|
|
76
|
-
📈 **[Performance Analysis](PERFORMANCE.md)** - Rust core benchmarks and profiling
|
|
77
|
-
🔧 **[Benchmarking Guide](BENCHMARKS.md)** - How to run benchmarks
|
|
78
|
-
✅ **[CommonMark Compliance](COMMONMARK_COMPLIANCE.md)** - CommonMark specification compliance
|
|
79
|
-
|
|
80
|
-
## Features
|
|
81
|
-
|
|
82
|
-
- **🚀 Blazing Fast**: Pure Rust core with ultra-fast `tl` HTML parser
|
|
83
|
-
- **🐍 Python Bindings**: Clean Python API via PyO3 with full type hints
|
|
84
|
-
- **🦀 Native CLI**: Rust CLI binary with comprehensive options
|
|
85
|
-
- **📊 hOCR 1.2 Compliant**: Full support for all 40+ elements and 20+ properties
|
|
86
|
-
- **📝 CommonMark Compliant**: Follows CommonMark specification for list formatting
|
|
87
|
-
- **🎯 Type Safe**: Full type hints and `.pyi` stubs for excellent IDE support
|
|
88
|
-
- **🌍 Cross-Platform**: Wheels for Linux, macOS, Windows (x86_64 + ARM64)
|
|
89
|
-
- **✅ Well-Tested**: 900+ tests with dual Python + Rust coverage
|
|
90
|
-
|
|
91
|
-
## Installation
|
|
92
|
-
|
|
93
|
-
> **📦 Package Names**: Due to a naming conflict on crates.io, the Rust crate is published as `html-to-markdown-rs`, while the Python package remains `html-to-markdown` on PyPI. The CLI binary name is `html-to-markdown` for both.
|
|
94
|
-
|
|
95
|
-
### Python Package
|
|
96
|
-
|
|
97
|
-
```bash
|
|
98
|
-
pip install html-to-markdown
|
|
99
|
-
```
|
|
100
|
-
|
|
101
|
-
### Rust Library
|
|
102
|
-
|
|
103
|
-
```bash
|
|
104
|
-
cargo add html-to-markdown-rs
|
|
105
|
-
```
|
|
106
|
-
|
|
107
|
-
### CLI Binary
|
|
108
|
-
|
|
109
|
-
#### via Homebrew (macOS/Linux)
|
|
110
|
-
|
|
111
|
-
```bash
|
|
112
|
-
brew tap goldziher/tap
|
|
113
|
-
brew install html-to-markdown
|
|
114
|
-
```
|
|
115
|
-
|
|
116
|
-
#### via Cargo
|
|
117
|
-
|
|
118
|
-
```bash
|
|
119
|
-
cargo install html-to-markdown-cli
|
|
120
|
-
```
|
|
121
|
-
|
|
122
|
-
#### Direct Download
|
|
123
|
-
|
|
124
|
-
Download pre-built binaries from [GitHub Releases](https://github.com/Goldziher/html-to-markdown/releases).
|
|
125
|
-
|
|
126
|
-
## Quick Start
|
|
127
|
-
|
|
128
|
-
### Python API
|
|
129
|
-
|
|
130
|
-
Clean, type-safe configuration with dataclasses:
|
|
131
|
-
|
|
132
|
-
```python
|
|
133
|
-
from html_to_markdown import convert, ConversionOptions
|
|
134
|
-
|
|
135
|
-
html = """
|
|
136
|
-
<h1>Welcome</h1>
|
|
137
|
-
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
|
|
138
|
-
<ul>
|
|
139
|
-
<li>Blazing fast</li>
|
|
140
|
-
<li>Type safe</li>
|
|
141
|
-
<li>Easy to use</li>
|
|
142
|
-
</ul>
|
|
143
|
-
"""
|
|
144
|
-
|
|
145
|
-
options = ConversionOptions(
|
|
146
|
-
heading_style="atx",
|
|
147
|
-
strong_em_symbol="*",
|
|
148
|
-
bullets="*+-",
|
|
149
|
-
)
|
|
150
|
-
|
|
151
|
-
markdown = convert(html, options)
|
|
152
|
-
print(markdown)
|
|
153
|
-
```
|
|
154
|
-
|
|
155
|
-
Output:
|
|
156
|
-
|
|
157
|
-
```markdown
|
|
158
|
-
# Welcome
|
|
159
|
-
|
|
160
|
-
This is **fast** Rust-powered conversion!
|
|
161
|
-
|
|
162
|
-
* Blazing fast
|
|
163
|
-
+ Type safe
|
|
164
|
-
- Easy to use
|
|
165
|
-
```
|
|
166
|
-
|
|
167
|
-
### Rust API
|
|
168
|
-
|
|
169
|
-
```rust
|
|
170
|
-
use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle};
|
|
171
|
-
|
|
172
|
-
fn main() {
|
|
173
|
-
let html = r#"
|
|
174
|
-
<h1>Welcome</h1>
|
|
175
|
-
<p>This is <strong>fast</strong> conversion!</p>
|
|
176
|
-
"#;
|
|
177
|
-
|
|
178
|
-
let options = ConversionOptions {
|
|
179
|
-
heading_style: HeadingStyle::Atx,
|
|
180
|
-
..Default::default()
|
|
181
|
-
};
|
|
182
|
-
|
|
183
|
-
let markdown = convert(html, Some(options)).unwrap();
|
|
184
|
-
println!("{}", markdown);
|
|
185
|
-
}
|
|
186
|
-
```
|
|
187
|
-
|
|
188
|
-
### CLI Usage
|
|
189
|
-
|
|
190
|
-
```bash
|
|
191
|
-
# Convert file
|
|
192
|
-
html-to-markdown input.html > output.md
|
|
193
|
-
|
|
194
|
-
# From stdin
|
|
195
|
-
cat input.html | html-to-markdown > output.md
|
|
196
|
-
|
|
197
|
-
# With options
|
|
198
|
-
html-to-markdown --heading-style atx --list-indent-width 2 input.html
|
|
199
|
-
|
|
200
|
-
# Clean web-scraped content
|
|
201
|
-
html-to-markdown \
|
|
202
|
-
--preprocess \
|
|
203
|
-
--preset aggressive \
|
|
204
|
-
--no-extract-metadata \
|
|
205
|
-
scraped.html > clean.md
|
|
206
|
-
```
|
|
207
|
-
|
|
208
|
-
## Configuration
|
|
209
|
-
|
|
210
|
-
### Python: Dataclass Configuration
|
|
211
|
-
|
|
212
|
-
```python
|
|
213
|
-
from html_to_markdown import (
|
|
214
|
-
convert,
|
|
215
|
-
ConversionOptions,
|
|
216
|
-
PreprocessingOptions,
|
|
217
|
-
)
|
|
218
|
-
|
|
219
|
-
# Conversion settings
|
|
220
|
-
options = ConversionOptions(
|
|
221
|
-
heading_style="atx", # "atx", "atx_closed", "underlined"
|
|
222
|
-
list_indent_width=2, # Discord/Slack: use 2
|
|
223
|
-
bullets="*+-", # Bullet characters
|
|
224
|
-
strong_em_symbol="*", # "*" or "_"
|
|
225
|
-
escape_asterisks=True, # Escape * in text
|
|
226
|
-
code_language="python", # Default code block language
|
|
227
|
-
extract_metadata=True, # Extract HTML metadata
|
|
228
|
-
highlight_style="double-equal", # "double-equal", "html", "bold"
|
|
229
|
-
)
|
|
230
|
-
|
|
231
|
-
# HTML preprocessing
|
|
232
|
-
preprocessing = PreprocessingOptions(
|
|
233
|
-
enabled=True,
|
|
234
|
-
preset="standard", # "minimal", "standard", "aggressive"
|
|
235
|
-
remove_navigation=True,
|
|
236
|
-
remove_forms=True,
|
|
237
|
-
)
|
|
238
|
-
|
|
239
|
-
markdown = convert(html, options, preprocessing)
|
|
240
|
-
```
|
|
241
|
-
|
|
242
|
-
### Python: Legacy API (v1 compatibility)
|
|
243
|
-
|
|
244
|
-
For backward compatibility with existing v1 code:
|
|
245
|
-
|
|
246
|
-
```python
|
|
247
|
-
from html_to_markdown import convert_to_markdown
|
|
248
|
-
|
|
249
|
-
markdown = convert_to_markdown(
|
|
250
|
-
html,
|
|
251
|
-
heading_style="atx",
|
|
252
|
-
list_indent_width=2,
|
|
253
|
-
preprocess=True,
|
|
254
|
-
preprocessing_preset="standard",
|
|
255
|
-
)
|
|
256
|
-
```
|
|
257
|
-
|
|
258
|
-
## Common Use Cases
|
|
259
|
-
|
|
260
|
-
### Discord/Slack Compatible Lists
|
|
261
|
-
|
|
262
|
-
```python
|
|
263
|
-
from html_to_markdown import convert, ConversionOptions
|
|
264
|
-
|
|
265
|
-
options = ConversionOptions(list_indent_width=2)
|
|
266
|
-
markdown = convert(html, options)
|
|
267
|
-
```
|
|
268
|
-
|
|
269
|
-
### Clean Web-Scraped HTML
|
|
270
|
-
|
|
271
|
-
```python
|
|
272
|
-
from html_to_markdown import convert, PreprocessingOptions
|
|
273
|
-
|
|
274
|
-
preprocessing = PreprocessingOptions(
|
|
275
|
-
enabled=True,
|
|
276
|
-
preset="aggressive", # Heavy cleaning
|
|
277
|
-
remove_navigation=True,
|
|
278
|
-
remove_forms=True,
|
|
279
|
-
)
|
|
280
|
-
|
|
281
|
-
markdown = convert(html, preprocessing=preprocessing)
|
|
282
|
-
```
|
|
283
|
-
|
|
284
|
-
### hOCR 1.2 Support
|
|
285
|
-
|
|
286
|
-
**Complete hOCR 1.2 specification compliance** with support for all elements, properties, and metadata:
|
|
287
|
-
|
|
288
|
-
```python
|
|
289
|
-
from html_to_markdown import convert, ConversionOptions
|
|
290
|
-
|
|
291
|
-
# Option 1: Document structure extraction (NEW in v2)
|
|
292
|
-
# Extracts all hOCR elements and converts to structured markdown
|
|
293
|
-
# Supports: paragraphs, sections, chapters, headers/footers, images, math, etc.
|
|
294
|
-
markdown = convert(hocr_html)
|
|
295
|
-
|
|
296
|
-
# Option 2: Legacy table extraction (spatial reconstruction)
|
|
297
|
-
# Reconstructs tables from word bounding boxes
|
|
298
|
-
options = ConversionOptions(
|
|
299
|
-
hocr_extract_tables=True,
|
|
300
|
-
hocr_table_column_threshold=50,
|
|
301
|
-
hocr_table_row_threshold_ratio=0.5,
|
|
302
|
-
)
|
|
303
|
-
markdown = convert(hocr_html, options)
|
|
304
|
-
```
|
|
305
|
-
|
|
306
|
-
**Full hOCR 1.2 Spec Coverage:**
|
|
307
|
-
|
|
308
|
-
- ✅ **All 40 Element Types** - Logical structure (12), typesetting (6), float (13), inline (6), engine-specific (3)
|
|
309
|
-
- ✅ **All 20+ Properties** - bbox, baseline, textangle, poly, x_wconf, x_confs, x_font, x_fsize, order, cflow, cuts, x_bboxes, image, ppageno, lpageno, scan_res, and more
|
|
310
|
-
- ✅ **All 5 Metadata Fields** - ocr-system, ocr-capabilities, ocr-number-of-pages, ocr-langs, ocr-scripts
|
|
311
|
-
- ✅ **37 Tests** - Complete coverage of all elements and properties
|
|
312
|
-
|
|
313
|
-
**Semantic Markdown Conversion:**
|
|
314
|
-
|
|
315
|
-
| Element Category | Examples | Markdown Output |
|
|
316
|
-
| ---------------- | ------------------------------- | ----------------------------------------- |
|
|
317
|
-
| Headings | `ocr_title`, `ocr_chapter` | `# Heading` |
|
|
318
|
-
| Sections | `ocr_section`, `ocr_subsection` | `##`, `###` |
|
|
319
|
-
| Structure | `ocr_par`, `ocr_blockquote` | Paragraphs, `> quotes` |
|
|
320
|
-
| Metadata | `ocr_abstract`, `ocr_author` | `**Abstract**`, `*Author*` |
|
|
321
|
-
| Floats | `ocr_header`, `ocr_footer` | `*Header*`, `*Footer*` |
|
|
322
|
-
| Images | `ocr_image`, `ocr_photo` | `` with image property |
|
|
323
|
-
| Math | `ocr_math`, `ocr_display` | `` `formula` ``, ```` ```equation``` ```` |
|
|
324
|
-
| Layout | `ocr_separator` | `---` horizontal rule |
|
|
325
|
-
| Inline | `ocrx_word`, `ocr_dropcap` | Text, `**Letter**` |
|
|
326
|
-
|
|
327
|
-
**HTML Entity Handling:** Automatically decodes `"`, `'`, `<`, `>`, `&` in title attributes for proper property parsing.
|
|
328
|
-
|
|
329
|
-
## Configuration Reference
|
|
330
|
-
|
|
331
|
-
**V2 Defaults (CommonMark-compliant):**
|
|
332
|
-
|
|
333
|
-
- `list_indent_width`: 2 (CommonMark standard)
|
|
334
|
-
- `bullets`: "\*+-" (cycles through `*`, `+`, `-` for nested levels)
|
|
335
|
-
- `escape_asterisks`: false (minimal escaping)
|
|
336
|
-
- `escape_underscores`: false (minimal escaping)
|
|
337
|
-
- `escape_misc`: false (minimal escaping)
|
|
338
|
-
- `newline_style`: "spaces" (CommonMark: two trailing spaces)
|
|
339
|
-
- `code_block_style`: "backticks" (fenced code blocks with \`\`\`, better whitespace preservation)
|
|
340
|
-
- `heading_style`: "atx" (CommonMark: `#`)
|
|
341
|
-
- `preprocessing.enabled`: false (no preprocessing by default)
|
|
342
|
-
|
|
343
|
-
For complete configuration reference, see **[Full Documentation](crates/html-to-markdown/README.md#configuration-reference)**.
|
|
344
|
-
|
|
345
|
-
## Upgrading from v1.x
|
|
346
|
-
|
|
347
|
-
### Backward Compatibility
|
|
348
|
-
|
|
349
|
-
Existing v1 code works without changes:
|
|
350
|
-
|
|
351
|
-
```python
|
|
352
|
-
from html_to_markdown import convert_to_markdown
|
|
353
|
-
|
|
354
|
-
markdown = convert_to_markdown(html, heading_style="atx") # Still works!
|
|
355
|
-
```
|
|
356
|
-
|
|
357
|
-
### Modern API (Recommended)
|
|
358
|
-
|
|
359
|
-
For new projects, use the dataclass-based API:
|
|
360
|
-
|
|
361
|
-
```python
|
|
362
|
-
from html_to_markdown import convert, ConversionOptions
|
|
363
|
-
|
|
364
|
-
options = ConversionOptions(heading_style="atx", list_indent_width=2)
|
|
365
|
-
markdown = convert(html, options)
|
|
366
|
-
```
|
|
367
|
-
|
|
368
|
-
### What Changed in v2
|
|
369
|
-
|
|
370
|
-
**Core Rewrite:**
|
|
371
|
-
|
|
372
|
-
- Complete Rust rewrite using `tl` HTML parser
|
|
373
|
-
- 19-30x performance improvement over v1
|
|
374
|
-
- CommonMark-compliant defaults (2-space indents, minimal escaping, ATX headings)
|
|
375
|
-
- No BeautifulSoup or lxml dependencies
|
|
376
|
-
|
|
377
|
-
**Removed Features:**
|
|
378
|
-
|
|
379
|
-
- `code_language_callback` - use `code_language` for default language
|
|
380
|
-
- `strip` / `convert` options - use `strip_tags` or preprocessing
|
|
381
|
-
- `convert_to_markdown_stream()` - not supported in v2
|
|
382
|
-
|
|
383
|
-
**Planned:**
|
|
384
|
-
|
|
385
|
-
- `custom_converters` - planned for future release
|
|
386
|
-
|
|
387
|
-
See **[CHANGELOG.md](CHANGELOG.md)** for complete v1 vs v2 comparison and migration guide.
|
|
388
|
-
|
|
389
|
-
## Kreuzberg Ecosystem
|
|
390
|
-
|
|
391
|
-
html-to-markdown is part of the [Kreuzberg](https://kreuzberg.dev) ecosystem, a comprehensive framework for document intelligence and processing. While html-to-markdown focuses on converting HTML to Markdown with maximum performance, Kreuzberg provides a complete solution for:
|
|
392
|
-
|
|
393
|
-
- **Document Extraction**: Extract text, images, and metadata from 50+ document formats
|
|
394
|
-
- **OCR Processing**: Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR)
|
|
395
|
-
- **Table Extraction**: Vision-based and OCR-based table detection
|
|
396
|
-
- **Document Classification**: Automatic detection of contracts, forms, invoices, etc.
|
|
397
|
-
- **RAG Pipelines**: Integration with retrieval-augmented generation workflows
|
|
398
|
-
|
|
399
|
-
Learn more at [kreuzberg.dev](https://kreuzberg.dev) or join our [Discord community](https://discord.gg/pXxagNK2zN).
|
|
400
|
-
|
|
401
|
-
## Contributing
|
|
402
|
-
|
|
403
|
-
See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, testing, and contribution guidelines.
|
|
404
|
-
|
|
405
|
-
## License
|
|
406
|
-
|
|
407
|
-
MIT License - see [LICENSE](LICENSE) for details.
|
|
408
|
-
|
|
409
|
-
## Acknowledgments
|
|
410
|
-
|
|
411
|
-
Version 1 started as a fork of [markdownify](https://pypi.org/project/markdownify/), rewritten, extended, and enhanced with better typing and features. Version 2 is a complete Rust rewrite for high performance.
|
|
412
|
-
|
|
413
|
-
## Support
|
|
414
|
-
|
|
415
|
-
If you find this library useful, consider:
|
|
416
|
-
|
|
417
|
-
<a href="https://github.com/sponsors/Goldziher">
|
|
418
|
-
<img src="https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors" alt="Sponsor" height="32">
|
|
419
|
-
</a>
|
|
420
|
-
|
|
421
|
-
Your support helps maintain and improve this library!
|
|
422
|
-
|
|
File without changes
|
|
File without changes
|