html-to-markdown 2.6.3__cp310-abi3-macosx_11_0_arm64.whl → 2.14.2__cp310-abi3-macosx_11_0_arm64.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- html_to_markdown/__init__.py +14 -1
- html_to_markdown/_html_to_markdown.abi3.so +0 -0
- html_to_markdown/_html_to_markdown.pyi +196 -0
- html_to_markdown/api.py +83 -31
- html_to_markdown/bin/html-to-markdown +0 -0
- html_to_markdown/v1_compat.py +11 -13
- html_to_markdown-2.14.2.data/scripts/html-to-markdown +0 -0
- html_to_markdown-2.14.2.dist-info/METADATA +634 -0
- html_to_markdown-2.14.2.dist-info/RECORD +17 -0
- {html_to_markdown-2.6.3.dist-info → html_to_markdown-2.14.2.dist-info}/WHEEL +1 -1
- html_to_markdown/_rust.pyi +0 -73
- html_to_markdown-2.6.3.data/scripts/html-to-markdown +0 -0
- html_to_markdown-2.6.3.dist-info/METADATA +0 -242
- html_to_markdown-2.6.3.dist-info/RECORD +0 -17
- {html_to_markdown-2.6.3.dist-info → html_to_markdown-2.14.2.dist-info}/licenses/LICENSE +0 -0
|
@@ -0,0 +1,634 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: html-to-markdown
|
|
3
|
+
Version: 2.14.2
|
|
4
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
5
|
+
Classifier: Environment :: Console
|
|
6
|
+
Classifier: Intended Audience :: Developers
|
|
7
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
8
|
+
Classifier: Operating System :: OS Independent
|
|
9
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
15
|
+
Classifier: Programming Language :: Rust
|
|
16
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
17
|
+
Classifier: Topic :: Text Processing
|
|
18
|
+
Classifier: Topic :: Text Processing :: Markup
|
|
19
|
+
Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
20
|
+
Classifier: Topic :: Text Processing :: Markup :: Markdown
|
|
21
|
+
Classifier: Typing :: Typed
|
|
22
|
+
License-File: LICENSE
|
|
23
|
+
Summary: High-performance HTML to Markdown converter powered by Rust with a clean Python API
|
|
24
|
+
Keywords: cli-tool,converter,html,html2markdown,html5,markdown,markup,parser,rust,text-processing
|
|
25
|
+
Home-Page: https://github.com/Goldziher/html-to-markdown
|
|
26
|
+
Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
|
|
27
|
+
Requires-Python: >=3.10
|
|
28
|
+
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
|
|
29
|
+
Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
|
|
30
|
+
Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
|
|
31
|
+
Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
|
|
32
|
+
Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
|
|
33
|
+
|
|
34
|
+
# html-to-markdown
|
|
35
|
+
|
|
36
|
+
High-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). The same engine also drives the Node.js, Ruby, PHP, and WebAssembly bindings, so rendered Markdown stays identical across runtimes. Wheels are published for Linux, macOS, and Windows.
|
|
37
|
+
|
|
38
|
+
[](https://crates.io/crates/html-to-markdown-rs)
|
|
39
|
+
[](https://www.npmjs.com/package/html-to-markdown-node)
|
|
40
|
+
[](https://www.npmjs.com/package/html-to-markdown-wasm)
|
|
41
|
+
[](https://pypi.org/project/html-to-markdown/)
|
|
42
|
+
[](https://packagist.org/packages/goldziher/html-to-markdown)
|
|
43
|
+
[](https://rubygems.org/gems/html-to-markdown)
|
|
44
|
+
[](https://hex.pm/packages/html_to_markdown)
|
|
45
|
+
[](https://www.nuget.org/packages/Goldziher.HtmlToMarkdown/)
|
|
46
|
+
[](https://central.sonatype.com/artifact/io.github.goldziher/html-to-markdown)
|
|
47
|
+
[](https://pkg.go.dev/github.com/Goldziher/html-to-markdown/packages/go/htmltomarkdown)
|
|
48
|
+
[](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
|
|
49
|
+
[](https://discord.gg/pXxagNK2zN)
|
|
50
|
+
|
|
51
|
+
## Installation
|
|
52
|
+
|
|
53
|
+
```bash
|
|
54
|
+
pip install html-to-markdown
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## Performance Snapshot
|
|
58
|
+
|
|
59
|
+
Apple M4 • Real Wikipedia documents • `convert()` (Python)
|
|
60
|
+
|
|
61
|
+
| Document | Size | Latency | Throughput | Docs/sec |
|
|
62
|
+
| ------------------- | ----- | ------- | ---------- | -------- |
|
|
63
|
+
| Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
|
|
64
|
+
| Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
|
|
65
|
+
| Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
|
|
66
|
+
|
|
67
|
+
> V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2's Rust engine delivers 60–80× higher throughput.
|
|
68
|
+
|
|
69
|
+
### Benchmark Fixtures (Apple M4)
|
|
70
|
+
|
|
71
|
+
Pulled directly from `tools/runtime-bench` (`task bench:bindings -- --language python`) so they stay in lockstep with the Rust core:
|
|
72
|
+
|
|
73
|
+
| Document | Size | ops/sec (Python) |
|
|
74
|
+
| ---------------------- | ------ | ---------------- |
|
|
75
|
+
| Lists (Timeline) | 129 KB | 1,405 |
|
|
76
|
+
| Tables (Countries) | 360 KB | 352 |
|
|
77
|
+
| Medium (Python) | 657 KB | 158 |
|
|
78
|
+
| Large (Rust) | 567 KB | 183 |
|
|
79
|
+
| Small (Intro) | 463 KB | 223 |
|
|
80
|
+
| hOCR German PDF | 44 KB | 2,991 |
|
|
81
|
+
| hOCR Invoice | 4 KB | 23,500 |
|
|
82
|
+
| hOCR Embedded Tables | 37 KB | 3,464 |
|
|
83
|
+
|
|
84
|
+
> Re-run locally with `task bench:bindings -- --language python --output tmp.json` to compare against CI history.
|
|
85
|
+
|
|
86
|
+
## Quick Start
|
|
87
|
+
|
|
88
|
+
```python
|
|
89
|
+
from html_to_markdown import convert
|
|
90
|
+
|
|
91
|
+
html = """
|
|
92
|
+
<h1>Welcome</h1>
|
|
93
|
+
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
|
|
94
|
+
<ul>
|
|
95
|
+
<li>Blazing fast</li>
|
|
96
|
+
<li>Type safe</li>
|
|
97
|
+
<li>Easy to use</li>
|
|
98
|
+
</ul>
|
|
99
|
+
"""
|
|
100
|
+
|
|
101
|
+
markdown = convert(html)
|
|
102
|
+
print(markdown)
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## Configuration (v2 API)
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
from html_to_markdown import ConversionOptions, convert
|
|
109
|
+
|
|
110
|
+
options = ConversionOptions(
|
|
111
|
+
heading_style="atx",
|
|
112
|
+
list_indent_width=2,
|
|
113
|
+
bullets="*+-",
|
|
114
|
+
)
|
|
115
|
+
options.escape_asterisks = True
|
|
116
|
+
options.code_language = "python"
|
|
117
|
+
options.extract_metadata = True
|
|
118
|
+
|
|
119
|
+
markdown = convert(html, options)
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
### Reusing Parsed Options
|
|
123
|
+
|
|
124
|
+
Avoid re-parsing the same option dictionaries inside hot loops by building a reusable handle:
|
|
125
|
+
|
|
126
|
+
```python
|
|
127
|
+
from html_to_markdown import ConversionOptions, convert_with_handle, create_options_handle
|
|
128
|
+
|
|
129
|
+
handle = create_options_handle(ConversionOptions(hocr_spatial_tables=False))
|
|
130
|
+
|
|
131
|
+
for html in documents:
|
|
132
|
+
markdown = convert_with_handle(html, handle)
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### HTML Preprocessing
|
|
136
|
+
|
|
137
|
+
```python
|
|
138
|
+
from html_to_markdown import ConversionOptions, PreprocessingOptions, convert
|
|
139
|
+
|
|
140
|
+
options = ConversionOptions(
|
|
141
|
+
...
|
|
142
|
+
)
|
|
143
|
+
|
|
144
|
+
preprocessing = PreprocessingOptions(
|
|
145
|
+
enabled=True,
|
|
146
|
+
preset="aggressive",
|
|
147
|
+
)
|
|
148
|
+
|
|
149
|
+
markdown = convert(scraped_html, options, preprocessing)
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
### Inline Image Extraction
|
|
153
|
+
|
|
154
|
+
```python
|
|
155
|
+
from html_to_markdown import InlineImageConfig, convert_with_inline_images
|
|
156
|
+
|
|
157
|
+
markdown, inline_images, warnings = convert_with_inline_images(
|
|
158
|
+
'<p><img src="data:image/png;base64,...==" alt="Pixel" width="1" height="1"></p>',
|
|
159
|
+
image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
|
|
160
|
+
)
|
|
161
|
+
|
|
162
|
+
if inline_images:
|
|
163
|
+
first = inline_images[0]
|
|
164
|
+
print(first["format"], first["dimensions"], first["attributes"]) # e.g. "png", (1, 1), {"width": "1"}
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
Each inline image is returned as a typed dictionary (`bytes` payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.
|
|
168
|
+
|
|
169
|
+
### Metadata Extraction
|
|
170
|
+
|
|
171
|
+
Extract comprehensive metadata (title, description, headers, links, images, structured data) during conversion in a single pass.
|
|
172
|
+
|
|
173
|
+
#### Basic Usage
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
from html_to_markdown import convert_with_metadata
|
|
177
|
+
|
|
178
|
+
html = """
|
|
179
|
+
<html>
|
|
180
|
+
<head>
|
|
181
|
+
<title>Example Article</title>
|
|
182
|
+
<meta name="description" content="Demo page">
|
|
183
|
+
<link rel="canonical" href="https://example.com/article">
|
|
184
|
+
</head>
|
|
185
|
+
<body>
|
|
186
|
+
<h1 id="welcome">Welcome</h1>
|
|
187
|
+
<a href="https://example.com" rel="nofollow external">Example link</a>
|
|
188
|
+
<img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
|
|
189
|
+
</body>
|
|
190
|
+
</html>
|
|
191
|
+
"""
|
|
192
|
+
|
|
193
|
+
markdown, metadata = convert_with_metadata(html)
|
|
194
|
+
|
|
195
|
+
print(markdown)
|
|
196
|
+
print(metadata["document"]["title"]) # "Example Article"
|
|
197
|
+
print(metadata["headers"][0]["text"]) # "Welcome"
|
|
198
|
+
print(metadata["links"][0]["href"]) # "https://example.com"
|
|
199
|
+
print(metadata["images"][0]["dimensions"]) # (640, 480)
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
#### Configuration
|
|
203
|
+
|
|
204
|
+
Control which metadata types are extracted using `MetadataConfig`:
|
|
205
|
+
|
|
206
|
+
```python
|
|
207
|
+
from html_to_markdown import ConversionOptions, MetadataConfig, convert_with_metadata
|
|
208
|
+
|
|
209
|
+
options = ConversionOptions(heading_style="atx")
|
|
210
|
+
config = MetadataConfig(
|
|
211
|
+
extract_headers=True, # h1-h6 elements (default: True)
|
|
212
|
+
extract_links=True, # <a> hyperlinks (default: True)
|
|
213
|
+
extract_images=True, # <img> elements (default: True)
|
|
214
|
+
extract_structured_data=True, # JSON-LD, Microdata, RDFa (default: True)
|
|
215
|
+
max_structured_data_size=1_000_000, # Max bytes for structured data (default: 100KB)
|
|
216
|
+
)
|
|
217
|
+
|
|
218
|
+
markdown, metadata = convert_with_metadata(html, options, config)
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
#### Metadata Structure
|
|
222
|
+
|
|
223
|
+
The `metadata` dictionary contains five categories:
|
|
224
|
+
|
|
225
|
+
```python
|
|
226
|
+
metadata = {
|
|
227
|
+
"document": { # Document-level metadata from <head>
|
|
228
|
+
"title": str | None,
|
|
229
|
+
"description": str | None,
|
|
230
|
+
"keywords": list[str], # Comma-separated keywords from meta tags
|
|
231
|
+
"author": str | None,
|
|
232
|
+
"canonical_url": str | None, # link[rel="canonical"] href
|
|
233
|
+
"base_href": str | None,
|
|
234
|
+
"language": str | None, # lang attribute (e.g., "en")
|
|
235
|
+
"text_direction": str | None, # "ltr", "rtl", or "auto"
|
|
236
|
+
"open_graph": dict[str, str], # og:* meta properties
|
|
237
|
+
"twitter_card": dict[str, str], # twitter:* meta properties
|
|
238
|
+
"meta_tags": dict[str, str], # Other meta tag properties
|
|
239
|
+
},
|
|
240
|
+
"headers": [ # h1-h6 elements with hierarchy
|
|
241
|
+
{
|
|
242
|
+
"level": int, # 1-6
|
|
243
|
+
"text": str, # Normalized text content
|
|
244
|
+
"id": str | None, # HTML id attribute
|
|
245
|
+
"depth": int, # Nesting depth in document tree
|
|
246
|
+
"html_offset": int, # Byte offset in original HTML
|
|
247
|
+
},
|
|
248
|
+
# ... more headers
|
|
249
|
+
],
|
|
250
|
+
"links": [ # Extracted <a> elements
|
|
251
|
+
{
|
|
252
|
+
"href": str,
|
|
253
|
+
"text": str,
|
|
254
|
+
"title": str | None,
|
|
255
|
+
"link_type": str, # "anchor" | "internal" | "external" | "email" | "phone" | "other"
|
|
256
|
+
"rel": list[str], # rel attribute values
|
|
257
|
+
"attributes": dict[str, str], # Other HTML attributes
|
|
258
|
+
},
|
|
259
|
+
# ... more links
|
|
260
|
+
],
|
|
261
|
+
"images": [ # Extracted <img> elements
|
|
262
|
+
{
|
|
263
|
+
"src": str, # Image source (URL or data URI)
|
|
264
|
+
"alt": str | None,
|
|
265
|
+
"title": str | None,
|
|
266
|
+
"dimensions": tuple[int, int] | None, # (width, height)
|
|
267
|
+
"image_type": str, # "data_uri" | "inline_svg" | "external" | "relative"
|
|
268
|
+
"attributes": dict[str, str],
|
|
269
|
+
},
|
|
270
|
+
# ... more images
|
|
271
|
+
],
|
|
272
|
+
"structured_data": [ # JSON-LD, Microdata, RDFa blocks
|
|
273
|
+
{
|
|
274
|
+
"data_type": str, # "json_ld" | "microdata" | "rdfa"
|
|
275
|
+
"raw_json": str, # JSON string representation
|
|
276
|
+
"schema_type": str | None, # Detected schema type (e.g., "Article")
|
|
277
|
+
},
|
|
278
|
+
# ... more structured data
|
|
279
|
+
],
|
|
280
|
+
}
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
#### Real-World Use Cases
|
|
284
|
+
|
|
285
|
+
**Extract Article Metadata for SEO**
|
|
286
|
+
|
|
287
|
+
```python
|
|
288
|
+
from html_to_markdown import convert_with_metadata
|
|
289
|
+
|
|
290
|
+
def extract_article_metadata(html: str) -> dict:
|
|
291
|
+
markdown, metadata = convert_with_metadata(html)
|
|
292
|
+
doc = metadata["document"]
|
|
293
|
+
|
|
294
|
+
return {
|
|
295
|
+
"title": doc.get("title"),
|
|
296
|
+
"description": doc.get("description"),
|
|
297
|
+
"keywords": doc.get("keywords", []),
|
|
298
|
+
"author": doc.get("author"),
|
|
299
|
+
"canonical_url": doc.get("canonical_url"),
|
|
300
|
+
"language": doc.get("language"),
|
|
301
|
+
"open_graph": doc.get("open_graph", {}),
|
|
302
|
+
"twitter_card": doc.get("twitter_card", {}),
|
|
303
|
+
"markdown": markdown,
|
|
304
|
+
}
|
|
305
|
+
|
|
306
|
+
# Usage
|
|
307
|
+
seo_data = extract_article_metadata(html)
|
|
308
|
+
print(f"Title: {seo_data['title']}")
|
|
309
|
+
print(f"Language: {seo_data['language']}")
|
|
310
|
+
print(f"OG Image: {seo_data['open_graph'].get('image')}")
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
**Build Table of Contents**
|
|
314
|
+
|
|
315
|
+
```python
|
|
316
|
+
from html_to_markdown import convert_with_metadata
|
|
317
|
+
|
|
318
|
+
def build_table_of_contents(html: str) -> list[dict]:
|
|
319
|
+
"""Generate a nested TOC from header structure."""
|
|
320
|
+
markdown, metadata = convert_with_metadata(html)
|
|
321
|
+
headers = metadata["headers"]
|
|
322
|
+
|
|
323
|
+
toc = []
|
|
324
|
+
for header in headers:
|
|
325
|
+
toc.append({
|
|
326
|
+
"level": header["level"],
|
|
327
|
+
"text": header["text"],
|
|
328
|
+
"anchor": header.get("id") or header["text"].lower().replace(" ", "-"),
|
|
329
|
+
})
|
|
330
|
+
return toc
|
|
331
|
+
|
|
332
|
+
# Usage
|
|
333
|
+
toc = build_table_of_contents(html)
|
|
334
|
+
for item in toc:
|
|
335
|
+
indent = " " * (item["level"] - 1)
|
|
336
|
+
print(f"{indent}- [{item['text']}](#{item['anchor']})")
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
**Validate Links and Accessibility**
|
|
340
|
+
|
|
341
|
+
```python
|
|
342
|
+
from html_to_markdown import convert_with_metadata
|
|
343
|
+
|
|
344
|
+
def check_accessibility(html: str) -> dict:
|
|
345
|
+
"""Find common accessibility and SEO issues."""
|
|
346
|
+
markdown, metadata = convert_with_metadata(html)
|
|
347
|
+
|
|
348
|
+
return {
|
|
349
|
+
"images_without_alt": [
|
|
350
|
+
img for img in metadata["images"]
|
|
351
|
+
if not img.get("alt")
|
|
352
|
+
],
|
|
353
|
+
"links_without_text": [
|
|
354
|
+
link for link in metadata["links"]
|
|
355
|
+
if not link.get("text", "").strip()
|
|
356
|
+
],
|
|
357
|
+
"external_links_count": len([
|
|
358
|
+
link for link in metadata["links"]
|
|
359
|
+
if link["link_type"] == "external"
|
|
360
|
+
]),
|
|
361
|
+
"broken_anchors": [
|
|
362
|
+
link for link in metadata["links"]
|
|
363
|
+
if link["link_type"] == "anchor"
|
|
364
|
+
],
|
|
365
|
+
}
|
|
366
|
+
|
|
367
|
+
# Usage
|
|
368
|
+
issues = check_accessibility(html)
|
|
369
|
+
if issues["images_without_alt"]:
|
|
370
|
+
print(f"Found {len(issues['images_without_alt'])} images without alt text")
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
**Extract Structured Data (JSON-LD, Microdata)**
|
|
374
|
+
|
|
375
|
+
```python
|
|
376
|
+
from html_to_markdown import convert_with_metadata
|
|
377
|
+
import json
|
|
378
|
+
|
|
379
|
+
def extract_json_ld_schemas(html: str) -> list[dict]:
|
|
380
|
+
"""Extract all JSON-LD structured data blocks."""
|
|
381
|
+
markdown, metadata = convert_with_metadata(html)
|
|
382
|
+
|
|
383
|
+
schemas = []
|
|
384
|
+
for block in metadata["structured_data"]:
|
|
385
|
+
if block["data_type"] == "json_ld":
|
|
386
|
+
try:
|
|
387
|
+
schema = json.loads(block["raw_json"])
|
|
388
|
+
schemas.append({
|
|
389
|
+
"type": block.get("schema_type"),
|
|
390
|
+
"data": schema,
|
|
391
|
+
})
|
|
392
|
+
except json.JSONDecodeError:
|
|
393
|
+
continue
|
|
394
|
+
return schemas
|
|
395
|
+
|
|
396
|
+
# Usage
|
|
397
|
+
schemas = extract_json_ld_schemas(html)
|
|
398
|
+
for schema in schemas:
|
|
399
|
+
print(f"Found {schema['type']} schema:")
|
|
400
|
+
print(json.dumps(schema["data"], indent=2))
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
**Migrate Content with Preservation of Links and Images**
|
|
404
|
+
|
|
405
|
+
```python
|
|
406
|
+
from html_to_markdown import convert_with_metadata
|
|
407
|
+
|
|
408
|
+
def migrate_with_manifest(html: str, base_url: str) -> tuple[str, dict]:
|
|
409
|
+
"""Convert to Markdown while capturing all external references."""
|
|
410
|
+
markdown, metadata = convert_with_metadata(html)
|
|
411
|
+
|
|
412
|
+
manifest = {
|
|
413
|
+
"title": metadata["document"].get("title"),
|
|
414
|
+
"external_links": [
|
|
415
|
+
{"url": link["href"], "text": link["text"]}
|
|
416
|
+
for link in metadata["links"]
|
|
417
|
+
if link["link_type"] == "external"
|
|
418
|
+
],
|
|
419
|
+
"external_images": [
|
|
420
|
+
{"url": img["src"], "alt": img.get("alt")}
|
|
421
|
+
for img in metadata["images"]
|
|
422
|
+
if img["image_type"] == "external"
|
|
423
|
+
],
|
|
424
|
+
}
|
|
425
|
+
return markdown, manifest
|
|
426
|
+
|
|
427
|
+
# Usage
|
|
428
|
+
md, manifest = migrate_with_manifest(html, "https://example.com")
|
|
429
|
+
print(f"Converted: {manifest['title']}")
|
|
430
|
+
print(f"External resources: {len(manifest['external_links'])} links, {len(manifest['external_images'])} images")
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
#### Feature Detection
|
|
434
|
+
|
|
435
|
+
Check if metadata extraction is available at runtime:
|
|
436
|
+
|
|
437
|
+
```python
|
|
438
|
+
from html_to_markdown import convert_with_metadata, convert
|
|
439
|
+
|
|
440
|
+
try:
|
|
441
|
+
# Try to use metadata extraction
|
|
442
|
+
markdown, metadata = convert_with_metadata(html)
|
|
443
|
+
print(f"Metadata available: {metadata['document'].get('title')}")
|
|
444
|
+
except (NameError, TypeError):
|
|
445
|
+
# Fallback for builds without metadata feature
|
|
446
|
+
markdown = convert(html)
|
|
447
|
+
print("Metadata feature not available, using basic conversion")
|
|
448
|
+
```
|
|
449
|
+
|
|
450
|
+
#### Error Handling
|
|
451
|
+
|
|
452
|
+
Metadata extraction is designed to be robust:
|
|
453
|
+
|
|
454
|
+
```python
|
|
455
|
+
from html_to_markdown import convert_with_metadata, MetadataConfig
|
|
456
|
+
|
|
457
|
+
# Handle large structured data safely
|
|
458
|
+
config = MetadataConfig(
|
|
459
|
+
extract_structured_data=True,
|
|
460
|
+
max_structured_data_size=500_000, # 500KB limit
|
|
461
|
+
)
|
|
462
|
+
|
|
463
|
+
try:
|
|
464
|
+
markdown, metadata = convert_with_metadata(html, metadata_config=config)
|
|
465
|
+
|
|
466
|
+
# Safe access with defaults
|
|
467
|
+
title = metadata["document"].get("title", "Untitled")
|
|
468
|
+
headers = metadata["headers"] or []
|
|
469
|
+
images = metadata["images"] or []
|
|
470
|
+
|
|
471
|
+
except Exception as e:
|
|
472
|
+
# Handle parsing errors gracefully
|
|
473
|
+
print(f"Extraction error: {e}")
|
|
474
|
+
# Fallback to basic conversion
|
|
475
|
+
from html_to_markdown import convert
|
|
476
|
+
markdown = convert(html)
|
|
477
|
+
```
|
|
478
|
+
|
|
479
|
+
#### Performance Considerations
|
|
480
|
+
|
|
481
|
+
1. **Single-Pass Collection**: Metadata extraction happens during HTML parsing with zero overhead when disabled.
|
|
482
|
+
2. **Memory Efficient**: Collections use reasonable pre-allocations (32 headers, 64 links, 16 images typical).
|
|
483
|
+
3. **Selective Extraction**: Disable unused metadata types in `MetadataConfig` to reduce overhead.
|
|
484
|
+
4. **Structured Data Limits**: Large JSON-LD blocks are skipped if they exceed the size limit to prevent memory exhaustion.
|
|
485
|
+
|
|
486
|
+
```python
|
|
487
|
+
from html_to_markdown import MetadataConfig, convert_with_metadata
|
|
488
|
+
|
|
489
|
+
# Optimize for performance
|
|
490
|
+
config = MetadataConfig(
|
|
491
|
+
extract_headers=True,
|
|
492
|
+
extract_links=False, # Skip if not needed
|
|
493
|
+
extract_images=False, # Skip if not needed
|
|
494
|
+
extract_structured_data=False, # Skip if not needed
|
|
495
|
+
)
|
|
496
|
+
|
|
497
|
+
markdown, metadata = convert_with_metadata(html, metadata_config=config)
|
|
498
|
+
```
|
|
499
|
+
|
|
500
|
+
#### Differences from Basic Conversion
|
|
501
|
+
|
|
502
|
+
When `extract_metadata=True` (default in `ConversionOptions`), basic metadata is embedded in a YAML frontmatter block:
|
|
503
|
+
|
|
504
|
+
```python
|
|
505
|
+
from html_to_markdown import convert, ConversionOptions
|
|
506
|
+
|
|
507
|
+
# Basic metadata as YAML frontmatter
|
|
508
|
+
options = ConversionOptions(extract_metadata=True)
|
|
509
|
+
markdown = convert(html, options)
|
|
510
|
+
# Output: "---\ntitle: ...\n---\n\nContent..."
|
|
511
|
+
|
|
512
|
+
# Rich metadata extraction (all metadata types)
|
|
513
|
+
from html_to_markdown import convert_with_metadata
|
|
514
|
+
markdown, full_metadata = convert_with_metadata(html)
|
|
515
|
+
# Returns structured data dict with headers, links, images, etc.
|
|
516
|
+
```
|
|
517
|
+
|
|
518
|
+
The two approaches serve different purposes:
|
|
519
|
+
- `extract_metadata=True`: Embeds basic metadata in the output Markdown
|
|
520
|
+
- `convert_with_metadata()`: Returns structured metadata for programmatic access
|
|
521
|
+
|
|
522
|
+
### hOCR (HTML OCR) Support
|
|
523
|
+
|
|
524
|
+
```python
|
|
525
|
+
from html_to_markdown import ConversionOptions, convert
|
|
526
|
+
|
|
527
|
+
# Default: emit structured Markdown directly
|
|
528
|
+
markdown = convert(hocr_html)
|
|
529
|
+
|
|
530
|
+
# hOCR documents are detected automatically; tables are reconstructed without extra configuration.
|
|
531
|
+
markdown = convert(hocr_html)
|
|
532
|
+
```
|
|
533
|
+
|
|
534
|
+
## CLI (same engine)
|
|
535
|
+
|
|
536
|
+
```bash
|
|
537
|
+
pipx install html-to-markdown # or: pip install html-to-markdown
|
|
538
|
+
|
|
539
|
+
html-to-markdown page.html > page.md
|
|
540
|
+
cat page.html | html-to-markdown --heading-style atx > page.md
|
|
541
|
+
```
|
|
542
|
+
|
|
543
|
+
## API Surface
|
|
544
|
+
|
|
545
|
+
### `ConversionOptions`
|
|
546
|
+
|
|
547
|
+
Key fields (see docstring for full matrix):
|
|
548
|
+
|
|
549
|
+
- `heading_style`: `"underlined" | "atx" | "atx_closed"`
|
|
550
|
+
- `list_indent_width`: spaces per indent level (default 2)
|
|
551
|
+
- `bullets`: cycle of bullet characters (`"*+-"`)
|
|
552
|
+
- `strong_em_symbol`: `"*"` or `"_"`
|
|
553
|
+
- `code_language`: default fenced code block language
|
|
554
|
+
- `wrap`, `wrap_width`: wrap Markdown output
|
|
555
|
+
- `strip_tags`: remove specific HTML tags
|
|
556
|
+
- `preprocessing`: `PreprocessingOptions`
|
|
557
|
+
- `encoding`: input character encoding (informational)
|
|
558
|
+
|
|
559
|
+
### `PreprocessingOptions`
|
|
560
|
+
|
|
561
|
+
- `enabled`: enable HTML sanitisation (default: `True` since v2.4.2 for robust malformed HTML handling)
|
|
562
|
+
- `preset`: `"minimal" | "standard" | "aggressive"` (default: `"standard"`)
|
|
563
|
+
- `remove_navigation`: remove navigation elements (default: `True`)
|
|
564
|
+
- `remove_forms`: remove form elements (default: `True`)
|
|
565
|
+
|
|
566
|
+
**Note:** As of v2.4.2, preprocessing is enabled by default to ensure robust handling of malformed HTML (e.g., bare angle brackets like `1<2` in content). Set `enabled=False` if you need minimal preprocessing.
|
|
567
|
+
|
|
568
|
+
### `InlineImageConfig`
|
|
569
|
+
|
|
570
|
+
- `max_decoded_size_bytes`: reject larger payloads
|
|
571
|
+
- `filename_prefix`: generated name prefix (`embedded_image` default)
|
|
572
|
+
- `capture_svg`: collect inline `<svg>` (default `True`)
|
|
573
|
+
- `infer_dimensions`: decode raster images to obtain dimensions (default `False`)
|
|
574
|
+
|
|
575
|
+
## Performance: V2 vs V1 Compatibility Layer
|
|
576
|
+
|
|
577
|
+
### ⚠️ Important: Always Use V2 API
|
|
578
|
+
|
|
579
|
+
The v2 API (`convert()`) is **strongly recommended** for all code. The v1 compatibility layer adds significant overhead and should only be used for gradual migration:
|
|
580
|
+
|
|
581
|
+
```python
|
|
582
|
+
# ✅ RECOMMENDED - V2 Direct API (Fast)
|
|
583
|
+
from html_to_markdown import convert, ConversionOptions
|
|
584
|
+
|
|
585
|
+
markdown = convert(html) # Simple conversion - FAST
|
|
586
|
+
markdown = convert(html, ConversionOptions(heading_style="atx")) # With options - FAST
|
|
587
|
+
|
|
588
|
+
# ❌ AVOID - V1 Compatibility Layer (Slow)
|
|
589
|
+
from html_to_markdown import convert_to_markdown
|
|
590
|
+
|
|
591
|
+
markdown = convert_to_markdown(html, heading_style="atx") # Adds 77% overhead
|
|
592
|
+
```
|
|
593
|
+
|
|
594
|
+
### Performance Comparison
|
|
595
|
+
|
|
596
|
+
Benchmarked on Apple M4 with 25-paragraph HTML document:
|
|
597
|
+
|
|
598
|
+
| API | ops/sec | Relative Performance | Recommendation |
|
|
599
|
+
| ------------------------ | ---------------- | -------------------- | ------------------- |
|
|
600
|
+
| **V2 API** (`convert()`) | **129,822** | baseline | ✅ **Use this** |
|
|
601
|
+
| **V1 Compat Layer** | **67,673** | **77% slower** | ⚠️ Migration only |
|
|
602
|
+
| **CLI** | **150-210 MB/s** | Fastest | ✅ Batch processing |
|
|
603
|
+
|
|
604
|
+
The v1 compatibility layer creates extra Python objects and performs additional conversions, significantly impacting performance.
|
|
605
|
+
|
|
606
|
+
### When to Use Each
|
|
607
|
+
|
|
608
|
+
- **V2 API (`convert()`)**: All new code, production systems, performance-critical applications ← **Use this**
|
|
609
|
+
- **V1 Compat (`convert_to_markdown()`)**: Only for gradual migration from legacy codebases
|
|
610
|
+
- **CLI (`html-to-markdown`)**: Batch processing, shell scripts, maximum throughput
|
|
611
|
+
|
|
612
|
+
## v1 Compatibility
|
|
613
|
+
|
|
614
|
+
A compatibility layer is provided to ease migration from v1.x:
|
|
615
|
+
|
|
616
|
+
- **Compat shim**: `html_to_markdown.v1_compat` exposes `convert_to_markdown`, `convert_to_markdown_stream`, and `markdownify`. Keyword mappings are listed in the [changelog](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md#v200).
|
|
617
|
+
- **⚠️ Performance warning**: These compatibility functions add 77% overhead. Migrate to v2 API as soon as possible.
|
|
618
|
+
- **CLI**: The Rust CLI replaces the old Python script. New flags are documented via `html-to-markdown --help`.
|
|
619
|
+
- **Removed options**: `code_language_callback`, `strip`, and streaming APIs were removed; use `ConversionOptions`, `PreprocessingOptions`, and the inline-image helpers instead.
|
|
620
|
+
|
|
621
|
+
## Links
|
|
622
|
+
|
|
623
|
+
- GitHub: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)
|
|
624
|
+
- Discord: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
|
|
625
|
+
- Kreuzberg ecosystem: [https://kreuzberg.dev](https://kreuzberg.dev)
|
|
626
|
+
|
|
627
|
+
## License
|
|
628
|
+
|
|
629
|
+
MIT License – see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE).
|
|
630
|
+
|
|
631
|
+
## Support
|
|
632
|
+
|
|
633
|
+
If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).
|
|
634
|
+
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
html_to_markdown-2.14.2.data/scripts/html-to-markdown,sha256=hniSeml124eJXvYbQsC3GLsUlS-TX93fFYgLtxogCn8,6263856
|
|
2
|
+
html_to_markdown-2.14.2.dist-info/RECORD,,
|
|
3
|
+
html_to_markdown-2.14.2.dist-info/WHEEL,sha256=WvP__evn8XoyZeDO32cKBm5BQTOFbdB1WoQ-d3AzYdw,132
|
|
4
|
+
html_to_markdown-2.14.2.dist-info/METADATA,sha256=KceoGs__CWCDYonJi8jTkXIyTSczsxeQP6IT-niDOPE,23246
|
|
5
|
+
html_to_markdown-2.14.2.dist-info/licenses/LICENSE,sha256=oQvPC-0UWvfg0WaeUBe11OJMtX60An-TW1ev_oaAA0k,1086
|
|
6
|
+
html_to_markdown/options.py,sha256=vImRfeHAeyAy0Lnt6cTPHGbj7mTdw8AEUgo19u7MAA0,5080
|
|
7
|
+
html_to_markdown/_html_to_markdown.pyi,sha256=IPD6CegtaanBsKTmK30v4nvWZ5HUlCajS6jkiOsoVj8,5875
|
|
8
|
+
html_to_markdown/_html_to_markdown.abi3.so,sha256=aL3Cy8W9rUaEyom4OTw9D1NQJ01ELgSzXSr-aDjVPc4,3503168
|
|
9
|
+
html_to_markdown/__init__.py,sha256=heUlsM_dzRMTxzDPQtvEHO-9g85GtWXyLucGfkk_wp0,1692
|
|
10
|
+
html_to_markdown/api.py,sha256=zXXoFpdDbMIQXl65NT7BjjYu_1xwEM7VNGNUK2zQNfQ,6934
|
|
11
|
+
html_to_markdown/v1_compat.py,sha256=kn5GYvgn3dTW_Zksu9PzWVk-5CYhvXxsqAeyTdDYZSY,8001
|
|
12
|
+
html_to_markdown/cli.py,sha256=Rn-s3FZPea1jgCJtDzH_TFvOEiA_uZFVfgjhr6xyL_g,64
|
|
13
|
+
html_to_markdown/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
14
|
+
html_to_markdown/exceptions.py,sha256=aTASOzbywgfqOYjlw18ZkOWSxKff4EbUbmMua_73TGA,2370
|
|
15
|
+
html_to_markdown/cli_proxy.py,sha256=HPYKH5Mf5OUvkbEQISJvAkxrbjWKxE5GokA44HoQ6z8,3858
|
|
16
|
+
html_to_markdown/__main__.py,sha256=3Ic_EbOt2h6W88q084pkz5IKU6iY5z_woBygH6u9aw0,327
|
|
17
|
+
html_to_markdown/bin/html-to-markdown,sha256=hniSeml124eJXvYbQsC3GLsUlS-TX93fFYgLtxogCn8,6263856
|