html-to-markdown 2.6.3__cp310-abi3-macosx_11_0_arm64.whl → 2.14.2__cp310-abi3-macosx_11_0_arm64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,634 @@
1
+ Metadata-Version: 2.4
2
+ Name: html-to-markdown
3
+ Version: 2.14.2
4
+ Classifier: Development Status :: 5 - Production/Stable
5
+ Classifier: Environment :: Console
6
+ Classifier: Intended Audience :: Developers
7
+ Classifier: License :: OSI Approved :: MIT License
8
+ Classifier: Operating System :: OS Independent
9
+ Classifier: Programming Language :: Python :: 3 :: Only
10
+ Classifier: Programming Language :: Python :: 3.10
11
+ Classifier: Programming Language :: Python :: 3.11
12
+ Classifier: Programming Language :: Python :: 3.12
13
+ Classifier: Programming Language :: Python :: 3.13
14
+ Classifier: Programming Language :: Python :: 3.14
15
+ Classifier: Programming Language :: Rust
16
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
17
+ Classifier: Topic :: Text Processing
18
+ Classifier: Topic :: Text Processing :: Markup
19
+ Classifier: Topic :: Text Processing :: Markup :: HTML
20
+ Classifier: Topic :: Text Processing :: Markup :: Markdown
21
+ Classifier: Typing :: Typed
22
+ License-File: LICENSE
23
+ Summary: High-performance HTML to Markdown converter powered by Rust with a clean Python API
24
+ Keywords: cli-tool,converter,html,html2markdown,html5,markdown,markup,parser,rust,text-processing
25
+ Home-Page: https://github.com/Goldziher/html-to-markdown
26
+ Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
27
+ Requires-Python: >=3.10
28
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
29
+ Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
30
+ Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
31
+ Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
32
+ Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
33
+
34
+ # html-to-markdown
35
+
36
+ High-performance HTML to Markdown converter with a clean Python API (powered by a Rust core). The same engine also drives the Node.js, Ruby, PHP, and WebAssembly bindings, so rendered Markdown stays identical across runtimes. Wheels are published for Linux, macOS, and Windows.
37
+
38
+ [![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg?logo=rust&label=crates.io)](https://crates.io/crates/html-to-markdown-rs)
39
+ [![npm (node)](https://img.shields.io/npm/v/html-to-markdown-node.svg?logo=npm)](https://www.npmjs.com/package/html-to-markdown-node)
40
+ [![npm (wasm)](https://img.shields.io/npm/v/html-to-markdown-wasm.svg?logo=npm)](https://www.npmjs.com/package/html-to-markdown-wasm)
41
+ [![PyPI](https://img.shields.io/pypi/v/html-to-markdown.svg?logo=pypi)](https://pypi.org/project/html-to-markdown/)
42
+ [![Packagist](https://img.shields.io/packagist/v/goldziher/html-to-markdown.svg)](https://packagist.org/packages/goldziher/html-to-markdown)
43
+ [![RubyGems](https://badge.fury.io/rb/html-to-markdown.svg)](https://rubygems.org/gems/html-to-markdown)
44
+ [![Hex.pm](https://img.shields.io/hexpm/v/html_to_markdown.svg)](https://hex.pm/packages/html_to_markdown)
45
+ [![NuGet](https://img.shields.io/nuget/v/Goldziher.HtmlToMarkdown.svg)](https://www.nuget.org/packages/Goldziher.HtmlToMarkdown/)
46
+ [![Maven Central](https://img.shields.io/maven-central/v/io.github.goldziher/html-to-markdown.svg)](https://central.sonatype.com/artifact/io.github.goldziher/html-to-markdown)
47
+ [![Go Reference](https://pkg.go.dev/badge/github.com/Goldziher/html-to-markdown/packages/go/htmltomarkdown.svg)](https://pkg.go.dev/github.com/Goldziher/html-to-markdown/packages/go/htmltomarkdown)
48
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE)
49
+ [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
50
+
51
+ ## Installation
52
+
53
+ ```bash
54
+ pip install html-to-markdown
55
+ ```
56
+
57
+ ## Performance Snapshot
58
+
59
+ Apple M4 • Real Wikipedia documents • `convert()` (Python)
60
+
61
+ | Document | Size | Latency | Throughput | Docs/sec |
62
+ | ------------------- | ----- | ------- | ---------- | -------- |
63
+ | Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
64
+ | Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
65
+ | Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
66
+
67
+ > V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2's Rust engine delivers 60–80× higher throughput.
68
+
69
+ ### Benchmark Fixtures (Apple M4)
70
+
71
+ Pulled directly from `tools/runtime-bench` (`task bench:bindings -- --language python`) so they stay in lockstep with the Rust core:
72
+
73
+ | Document | Size | ops/sec (Python) |
74
+ | ---------------------- | ------ | ---------------- |
75
+ | Lists (Timeline) | 129 KB | 1,405 |
76
+ | Tables (Countries) | 360 KB | 352 |
77
+ | Medium (Python) | 657 KB | 158 |
78
+ | Large (Rust) | 567 KB | 183 |
79
+ | Small (Intro) | 463 KB | 223 |
80
+ | hOCR German PDF | 44 KB | 2,991 |
81
+ | hOCR Invoice | 4 KB | 23,500 |
82
+ | hOCR Embedded Tables | 37 KB | 3,464 |
83
+
84
+ > Re-run locally with `task bench:bindings -- --language python --output tmp.json` to compare against CI history.
85
+
86
+ ## Quick Start
87
+
88
+ ```python
89
+ from html_to_markdown import convert
90
+
91
+ html = """
92
+ <h1>Welcome</h1>
93
+ <p>This is <strong>fast</strong> Rust-powered conversion!</p>
94
+ <ul>
95
+ <li>Blazing fast</li>
96
+ <li>Type safe</li>
97
+ <li>Easy to use</li>
98
+ </ul>
99
+ """
100
+
101
+ markdown = convert(html)
102
+ print(markdown)
103
+ ```
104
+
105
+ ## Configuration (v2 API)
106
+
107
+ ```python
108
+ from html_to_markdown import ConversionOptions, convert
109
+
110
+ options = ConversionOptions(
111
+ heading_style="atx",
112
+ list_indent_width=2,
113
+ bullets="*+-",
114
+ )
115
+ options.escape_asterisks = True
116
+ options.code_language = "python"
117
+ options.extract_metadata = True
118
+
119
+ markdown = convert(html, options)
120
+ ```
121
+
122
+ ### Reusing Parsed Options
123
+
124
+ Avoid re-parsing the same option dictionaries inside hot loops by building a reusable handle:
125
+
126
+ ```python
127
+ from html_to_markdown import ConversionOptions, convert_with_handle, create_options_handle
128
+
129
+ handle = create_options_handle(ConversionOptions(hocr_spatial_tables=False))
130
+
131
+ for html in documents:
132
+ markdown = convert_with_handle(html, handle)
133
+ ```
134
+
135
+ ### HTML Preprocessing
136
+
137
+ ```python
138
+ from html_to_markdown import ConversionOptions, PreprocessingOptions, convert
139
+
140
+ options = ConversionOptions(
141
+ ...
142
+ )
143
+
144
+ preprocessing = PreprocessingOptions(
145
+ enabled=True,
146
+ preset="aggressive",
147
+ )
148
+
149
+ markdown = convert(scraped_html, options, preprocessing)
150
+ ```
151
+
152
+ ### Inline Image Extraction
153
+
154
+ ```python
155
+ from html_to_markdown import InlineImageConfig, convert_with_inline_images
156
+
157
+ markdown, inline_images, warnings = convert_with_inline_images(
158
+ '<p><img src="data:image/png;base64,...==" alt="Pixel" width="1" height="1"></p>',
159
+ image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
160
+ )
161
+
162
+ if inline_images:
163
+ first = inline_images[0]
164
+ print(first["format"], first["dimensions"], first["attributes"]) # e.g. "png", (1, 1), {"width": "1"}
165
+ ```
166
+
167
+ Each inline image is returned as a typed dictionary (`bytes` payload, metadata, and relevant HTML attributes). Warnings are human-readable skip reasons.
168
+
169
+ ### Metadata Extraction
170
+
171
+ Extract comprehensive metadata (title, description, headers, links, images, structured data) during conversion in a single pass.
172
+
173
+ #### Basic Usage
174
+
175
+ ```python
176
+ from html_to_markdown import convert_with_metadata
177
+
178
+ html = """
179
+ <html>
180
+ <head>
181
+ <title>Example Article</title>
182
+ <meta name="description" content="Demo page">
183
+ <link rel="canonical" href="https://example.com/article">
184
+ </head>
185
+ <body>
186
+ <h1 id="welcome">Welcome</h1>
187
+ <a href="https://example.com" rel="nofollow external">Example link</a>
188
+ <img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
189
+ </body>
190
+ </html>
191
+ """
192
+
193
+ markdown, metadata = convert_with_metadata(html)
194
+
195
+ print(markdown)
196
+ print(metadata["document"]["title"]) # "Example Article"
197
+ print(metadata["headers"][0]["text"]) # "Welcome"
198
+ print(metadata["links"][0]["href"]) # "https://example.com"
199
+ print(metadata["images"][0]["dimensions"]) # (640, 480)
200
+ ```
201
+
202
+ #### Configuration
203
+
204
+ Control which metadata types are extracted using `MetadataConfig`:
205
+
206
+ ```python
207
+ from html_to_markdown import ConversionOptions, MetadataConfig, convert_with_metadata
208
+
209
+ options = ConversionOptions(heading_style="atx")
210
+ config = MetadataConfig(
211
+ extract_headers=True, # h1-h6 elements (default: True)
212
+ extract_links=True, # <a> hyperlinks (default: True)
213
+ extract_images=True, # <img> elements (default: True)
214
+ extract_structured_data=True, # JSON-LD, Microdata, RDFa (default: True)
215
+ max_structured_data_size=1_000_000, # Max bytes for structured data (default: 100KB)
216
+ )
217
+
218
+ markdown, metadata = convert_with_metadata(html, options, config)
219
+ ```
220
+
221
+ #### Metadata Structure
222
+
223
+ The `metadata` dictionary contains five categories:
224
+
225
+ ```python
226
+ metadata = {
227
+ "document": { # Document-level metadata from <head>
228
+ "title": str | None,
229
+ "description": str | None,
230
+ "keywords": list[str], # Comma-separated keywords from meta tags
231
+ "author": str | None,
232
+ "canonical_url": str | None, # link[rel="canonical"] href
233
+ "base_href": str | None,
234
+ "language": str | None, # lang attribute (e.g., "en")
235
+ "text_direction": str | None, # "ltr", "rtl", or "auto"
236
+ "open_graph": dict[str, str], # og:* meta properties
237
+ "twitter_card": dict[str, str], # twitter:* meta properties
238
+ "meta_tags": dict[str, str], # Other meta tag properties
239
+ },
240
+ "headers": [ # h1-h6 elements with hierarchy
241
+ {
242
+ "level": int, # 1-6
243
+ "text": str, # Normalized text content
244
+ "id": str | None, # HTML id attribute
245
+ "depth": int, # Nesting depth in document tree
246
+ "html_offset": int, # Byte offset in original HTML
247
+ },
248
+ # ... more headers
249
+ ],
250
+ "links": [ # Extracted <a> elements
251
+ {
252
+ "href": str,
253
+ "text": str,
254
+ "title": str | None,
255
+ "link_type": str, # "anchor" | "internal" | "external" | "email" | "phone" | "other"
256
+ "rel": list[str], # rel attribute values
257
+ "attributes": dict[str, str], # Other HTML attributes
258
+ },
259
+ # ... more links
260
+ ],
261
+ "images": [ # Extracted <img> elements
262
+ {
263
+ "src": str, # Image source (URL or data URI)
264
+ "alt": str | None,
265
+ "title": str | None,
266
+ "dimensions": tuple[int, int] | None, # (width, height)
267
+ "image_type": str, # "data_uri" | "inline_svg" | "external" | "relative"
268
+ "attributes": dict[str, str],
269
+ },
270
+ # ... more images
271
+ ],
272
+ "structured_data": [ # JSON-LD, Microdata, RDFa blocks
273
+ {
274
+ "data_type": str, # "json_ld" | "microdata" | "rdfa"
275
+ "raw_json": str, # JSON string representation
276
+ "schema_type": str | None, # Detected schema type (e.g., "Article")
277
+ },
278
+ # ... more structured data
279
+ ],
280
+ }
281
+ ```
282
+
283
+ #### Real-World Use Cases
284
+
285
+ **Extract Article Metadata for SEO**
286
+
287
+ ```python
288
+ from html_to_markdown import convert_with_metadata
289
+
290
+ def extract_article_metadata(html: str) -> dict:
291
+ markdown, metadata = convert_with_metadata(html)
292
+ doc = metadata["document"]
293
+
294
+ return {
295
+ "title": doc.get("title"),
296
+ "description": doc.get("description"),
297
+ "keywords": doc.get("keywords", []),
298
+ "author": doc.get("author"),
299
+ "canonical_url": doc.get("canonical_url"),
300
+ "language": doc.get("language"),
301
+ "open_graph": doc.get("open_graph", {}),
302
+ "twitter_card": doc.get("twitter_card", {}),
303
+ "markdown": markdown,
304
+ }
305
+
306
+ # Usage
307
+ seo_data = extract_article_metadata(html)
308
+ print(f"Title: {seo_data['title']}")
309
+ print(f"Language: {seo_data['language']}")
310
+ print(f"OG Image: {seo_data['open_graph'].get('image')}")
311
+ ```
312
+
313
+ **Build Table of Contents**
314
+
315
+ ```python
316
+ from html_to_markdown import convert_with_metadata
317
+
318
+ def build_table_of_contents(html: str) -> list[dict]:
319
+ """Generate a nested TOC from header structure."""
320
+ markdown, metadata = convert_with_metadata(html)
321
+ headers = metadata["headers"]
322
+
323
+ toc = []
324
+ for header in headers:
325
+ toc.append({
326
+ "level": header["level"],
327
+ "text": header["text"],
328
+ "anchor": header.get("id") or header["text"].lower().replace(" ", "-"),
329
+ })
330
+ return toc
331
+
332
+ # Usage
333
+ toc = build_table_of_contents(html)
334
+ for item in toc:
335
+ indent = " " * (item["level"] - 1)
336
+ print(f"{indent}- [{item['text']}](#{item['anchor']})")
337
+ ```
338
+
339
+ **Validate Links and Accessibility**
340
+
341
+ ```python
342
+ from html_to_markdown import convert_with_metadata
343
+
344
+ def check_accessibility(html: str) -> dict:
345
+ """Find common accessibility and SEO issues."""
346
+ markdown, metadata = convert_with_metadata(html)
347
+
348
+ return {
349
+ "images_without_alt": [
350
+ img for img in metadata["images"]
351
+ if not img.get("alt")
352
+ ],
353
+ "links_without_text": [
354
+ link for link in metadata["links"]
355
+ if not link.get("text", "").strip()
356
+ ],
357
+ "external_links_count": len([
358
+ link for link in metadata["links"]
359
+ if link["link_type"] == "external"
360
+ ]),
361
+ "broken_anchors": [
362
+ link for link in metadata["links"]
363
+ if link["link_type"] == "anchor"
364
+ ],
365
+ }
366
+
367
+ # Usage
368
+ issues = check_accessibility(html)
369
+ if issues["images_without_alt"]:
370
+ print(f"Found {len(issues['images_without_alt'])} images without alt text")
371
+ ```
372
+
373
+ **Extract Structured Data (JSON-LD, Microdata)**
374
+
375
+ ```python
376
+ from html_to_markdown import convert_with_metadata
377
+ import json
378
+
379
+ def extract_json_ld_schemas(html: str) -> list[dict]:
380
+ """Extract all JSON-LD structured data blocks."""
381
+ markdown, metadata = convert_with_metadata(html)
382
+
383
+ schemas = []
384
+ for block in metadata["structured_data"]:
385
+ if block["data_type"] == "json_ld":
386
+ try:
387
+ schema = json.loads(block["raw_json"])
388
+ schemas.append({
389
+ "type": block.get("schema_type"),
390
+ "data": schema,
391
+ })
392
+ except json.JSONDecodeError:
393
+ continue
394
+ return schemas
395
+
396
+ # Usage
397
+ schemas = extract_json_ld_schemas(html)
398
+ for schema in schemas:
399
+ print(f"Found {schema['type']} schema:")
400
+ print(json.dumps(schema["data"], indent=2))
401
+ ```
402
+
403
+ **Migrate Content with Preservation of Links and Images**
404
+
405
+ ```python
406
+ from html_to_markdown import convert_with_metadata
407
+
408
+ def migrate_with_manifest(html: str, base_url: str) -> tuple[str, dict]:
409
+ """Convert to Markdown while capturing all external references."""
410
+ markdown, metadata = convert_with_metadata(html)
411
+
412
+ manifest = {
413
+ "title": metadata["document"].get("title"),
414
+ "external_links": [
415
+ {"url": link["href"], "text": link["text"]}
416
+ for link in metadata["links"]
417
+ if link["link_type"] == "external"
418
+ ],
419
+ "external_images": [
420
+ {"url": img["src"], "alt": img.get("alt")}
421
+ for img in metadata["images"]
422
+ if img["image_type"] == "external"
423
+ ],
424
+ }
425
+ return markdown, manifest
426
+
427
+ # Usage
428
+ md, manifest = migrate_with_manifest(html, "https://example.com")
429
+ print(f"Converted: {manifest['title']}")
430
+ print(f"External resources: {len(manifest['external_links'])} links, {len(manifest['external_images'])} images")
431
+ ```
432
+
433
+ #### Feature Detection
434
+
435
+ Check if metadata extraction is available at runtime:
436
+
437
+ ```python
438
+ from html_to_markdown import convert_with_metadata, convert
439
+
440
+ try:
441
+ # Try to use metadata extraction
442
+ markdown, metadata = convert_with_metadata(html)
443
+ print(f"Metadata available: {metadata['document'].get('title')}")
444
+ except (NameError, TypeError):
445
+ # Fallback for builds without metadata feature
446
+ markdown = convert(html)
447
+ print("Metadata feature not available, using basic conversion")
448
+ ```
449
+
450
+ #### Error Handling
451
+
452
+ Metadata extraction is designed to be robust:
453
+
454
+ ```python
455
+ from html_to_markdown import convert_with_metadata, MetadataConfig
456
+
457
+ # Handle large structured data safely
458
+ config = MetadataConfig(
459
+ extract_structured_data=True,
460
+ max_structured_data_size=500_000, # 500KB limit
461
+ )
462
+
463
+ try:
464
+ markdown, metadata = convert_with_metadata(html, metadata_config=config)
465
+
466
+ # Safe access with defaults
467
+ title = metadata["document"].get("title", "Untitled")
468
+ headers = metadata["headers"] or []
469
+ images = metadata["images"] or []
470
+
471
+ except Exception as e:
472
+ # Handle parsing errors gracefully
473
+ print(f"Extraction error: {e}")
474
+ # Fallback to basic conversion
475
+ from html_to_markdown import convert
476
+ markdown = convert(html)
477
+ ```
478
+
479
+ #### Performance Considerations
480
+
481
+ 1. **Single-Pass Collection**: Metadata extraction happens during HTML parsing with zero overhead when disabled.
482
+ 2. **Memory Efficient**: Collections use reasonable pre-allocations (32 headers, 64 links, 16 images typical).
483
+ 3. **Selective Extraction**: Disable unused metadata types in `MetadataConfig` to reduce overhead.
484
+ 4. **Structured Data Limits**: Large JSON-LD blocks are skipped if they exceed the size limit to prevent memory exhaustion.
485
+
486
+ ```python
487
+ from html_to_markdown import MetadataConfig, convert_with_metadata
488
+
489
+ # Optimize for performance
490
+ config = MetadataConfig(
491
+ extract_headers=True,
492
+ extract_links=False, # Skip if not needed
493
+ extract_images=False, # Skip if not needed
494
+ extract_structured_data=False, # Skip if not needed
495
+ )
496
+
497
+ markdown, metadata = convert_with_metadata(html, metadata_config=config)
498
+ ```
499
+
500
+ #### Differences from Basic Conversion
501
+
502
+ When `extract_metadata=True` (default in `ConversionOptions`), basic metadata is embedded in a YAML frontmatter block:
503
+
504
+ ```python
505
+ from html_to_markdown import convert, ConversionOptions
506
+
507
+ # Basic metadata as YAML frontmatter
508
+ options = ConversionOptions(extract_metadata=True)
509
+ markdown = convert(html, options)
510
+ # Output: "---\ntitle: ...\n---\n\nContent..."
511
+
512
+ # Rich metadata extraction (all metadata types)
513
+ from html_to_markdown import convert_with_metadata
514
+ markdown, full_metadata = convert_with_metadata(html)
515
+ # Returns structured data dict with headers, links, images, etc.
516
+ ```
517
+
518
+ The two approaches serve different purposes:
519
+ - `extract_metadata=True`: Embeds basic metadata in the output Markdown
520
+ - `convert_with_metadata()`: Returns structured metadata for programmatic access
521
+
522
+ ### hOCR (HTML OCR) Support
523
+
524
+ ```python
525
+ from html_to_markdown import ConversionOptions, convert
526
+
527
+ # Default: emit structured Markdown directly
528
+ markdown = convert(hocr_html)
529
+
530
+ # hOCR documents are detected automatically; tables are reconstructed without extra configuration.
531
+ markdown = convert(hocr_html)
532
+ ```
533
+
534
+ ## CLI (same engine)
535
+
536
+ ```bash
537
+ pipx install html-to-markdown # or: pip install html-to-markdown
538
+
539
+ html-to-markdown page.html > page.md
540
+ cat page.html | html-to-markdown --heading-style atx > page.md
541
+ ```
542
+
543
+ ## API Surface
544
+
545
+ ### `ConversionOptions`
546
+
547
+ Key fields (see docstring for full matrix):
548
+
549
+ - `heading_style`: `"underlined" | "atx" | "atx_closed"`
550
+ - `list_indent_width`: spaces per indent level (default 2)
551
+ - `bullets`: cycle of bullet characters (`"*+-"`)
552
+ - `strong_em_symbol`: `"*"` or `"_"`
553
+ - `code_language`: default fenced code block language
554
+ - `wrap`, `wrap_width`: wrap Markdown output
555
+ - `strip_tags`: remove specific HTML tags
556
+ - `preprocessing`: `PreprocessingOptions`
557
+ - `encoding`: input character encoding (informational)
558
+
559
+ ### `PreprocessingOptions`
560
+
561
+ - `enabled`: enable HTML sanitisation (default: `True` since v2.4.2 for robust malformed HTML handling)
562
+ - `preset`: `"minimal" | "standard" | "aggressive"` (default: `"standard"`)
563
+ - `remove_navigation`: remove navigation elements (default: `True`)
564
+ - `remove_forms`: remove form elements (default: `True`)
565
+
566
+ **Note:** As of v2.4.2, preprocessing is enabled by default to ensure robust handling of malformed HTML (e.g., bare angle brackets like `1<2` in content). Set `enabled=False` if you need minimal preprocessing.
567
+
568
+ ### `InlineImageConfig`
569
+
570
+ - `max_decoded_size_bytes`: reject larger payloads
571
+ - `filename_prefix`: generated name prefix (`embedded_image` default)
572
+ - `capture_svg`: collect inline `<svg>` (default `True`)
573
+ - `infer_dimensions`: decode raster images to obtain dimensions (default `False`)
574
+
575
+ ## Performance: V2 vs V1 Compatibility Layer
576
+
577
+ ### ⚠️ Important: Always Use V2 API
578
+
579
+ The v2 API (`convert()`) is **strongly recommended** for all code. The v1 compatibility layer adds significant overhead and should only be used for gradual migration:
580
+
581
+ ```python
582
+ # ✅ RECOMMENDED - V2 Direct API (Fast)
583
+ from html_to_markdown import convert, ConversionOptions
584
+
585
+ markdown = convert(html) # Simple conversion - FAST
586
+ markdown = convert(html, ConversionOptions(heading_style="atx")) # With options - FAST
587
+
588
+ # ❌ AVOID - V1 Compatibility Layer (Slow)
589
+ from html_to_markdown import convert_to_markdown
590
+
591
+ markdown = convert_to_markdown(html, heading_style="atx") # Adds 77% overhead
592
+ ```
593
+
594
+ ### Performance Comparison
595
+
596
+ Benchmarked on Apple M4 with 25-paragraph HTML document:
597
+
598
+ | API | ops/sec | Relative Performance | Recommendation |
599
+ | ------------------------ | ---------------- | -------------------- | ------------------- |
600
+ | **V2 API** (`convert()`) | **129,822** | baseline | ✅ **Use this** |
601
+ | **V1 Compat Layer** | **67,673** | **77% slower** | ⚠️ Migration only |
602
+ | **CLI** | **150-210 MB/s** | Fastest | ✅ Batch processing |
603
+
604
+ The v1 compatibility layer creates extra Python objects and performs additional conversions, significantly impacting performance.
605
+
606
+ ### When to Use Each
607
+
608
+ - **V2 API (`convert()`)**: All new code, production systems, performance-critical applications ← **Use this**
609
+ - **V1 Compat (`convert_to_markdown()`)**: Only for gradual migration from legacy codebases
610
+ - **CLI (`html-to-markdown`)**: Batch processing, shell scripts, maximum throughput
611
+
612
+ ## v1 Compatibility
613
+
614
+ A compatibility layer is provided to ease migration from v1.x:
615
+
616
+ - **Compat shim**: `html_to_markdown.v1_compat` exposes `convert_to_markdown`, `convert_to_markdown_stream`, and `markdownify`. Keyword mappings are listed in the [changelog](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md#v200).
617
+ - **⚠️ Performance warning**: These compatibility functions add 77% overhead. Migrate to v2 API as soon as possible.
618
+ - **CLI**: The Rust CLI replaces the old Python script. New flags are documented via `html-to-markdown --help`.
619
+ - **Removed options**: `code_language_callback`, `strip`, and streaming APIs were removed; use `ConversionOptions`, `PreprocessingOptions`, and the inline-image helpers instead.
620
+
621
+ ## Links
622
+
623
+ - GitHub: [https://github.com/Goldziher/html-to-markdown](https://github.com/Goldziher/html-to-markdown)
624
+ - Discord: [https://discord.gg/pXxagNK2zN](https://discord.gg/pXxagNK2zN)
625
+ - Kreuzberg ecosystem: [https://kreuzberg.dev](https://kreuzberg.dev)
626
+
627
+ ## License
628
+
629
+ MIT License – see [LICENSE](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE).
630
+
631
+ ## Support
632
+
633
+ If you find this library useful, consider [sponsoring the project](https://github.com/sponsors/Goldziher).
634
+
@@ -0,0 +1,17 @@
1
+ html_to_markdown-2.14.2.data/scripts/html-to-markdown,sha256=hniSeml124eJXvYbQsC3GLsUlS-TX93fFYgLtxogCn8,6263856
2
+ html_to_markdown-2.14.2.dist-info/RECORD,,
3
+ html_to_markdown-2.14.2.dist-info/WHEEL,sha256=WvP__evn8XoyZeDO32cKBm5BQTOFbdB1WoQ-d3AzYdw,132
4
+ html_to_markdown-2.14.2.dist-info/METADATA,sha256=KceoGs__CWCDYonJi8jTkXIyTSczsxeQP6IT-niDOPE,23246
5
+ html_to_markdown-2.14.2.dist-info/licenses/LICENSE,sha256=oQvPC-0UWvfg0WaeUBe11OJMtX60An-TW1ev_oaAA0k,1086
6
+ html_to_markdown/options.py,sha256=vImRfeHAeyAy0Lnt6cTPHGbj7mTdw8AEUgo19u7MAA0,5080
7
+ html_to_markdown/_html_to_markdown.pyi,sha256=IPD6CegtaanBsKTmK30v4nvWZ5HUlCajS6jkiOsoVj8,5875
8
+ html_to_markdown/_html_to_markdown.abi3.so,sha256=aL3Cy8W9rUaEyom4OTw9D1NQJ01ELgSzXSr-aDjVPc4,3503168
9
+ html_to_markdown/__init__.py,sha256=heUlsM_dzRMTxzDPQtvEHO-9g85GtWXyLucGfkk_wp0,1692
10
+ html_to_markdown/api.py,sha256=zXXoFpdDbMIQXl65NT7BjjYu_1xwEM7VNGNUK2zQNfQ,6934
11
+ html_to_markdown/v1_compat.py,sha256=kn5GYvgn3dTW_Zksu9PzWVk-5CYhvXxsqAeyTdDYZSY,8001
12
+ html_to_markdown/cli.py,sha256=Rn-s3FZPea1jgCJtDzH_TFvOEiA_uZFVfgjhr6xyL_g,64
13
+ html_to_markdown/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
14
+ html_to_markdown/exceptions.py,sha256=aTASOzbywgfqOYjlw18ZkOWSxKff4EbUbmMua_73TGA,2370
15
+ html_to_markdown/cli_proxy.py,sha256=HPYKH5Mf5OUvkbEQISJvAkxrbjWKxE5GokA44HoQ6z8,3858
16
+ html_to_markdown/__main__.py,sha256=3Ic_EbOt2h6W88q084pkz5IKU6iY5z_woBygH6u9aw0,327
17
+ html_to_markdown/bin/html-to-markdown,sha256=hniSeml124eJXvYbQsC3GLsUlS-TX93fFYgLtxogCn8,6263856
@@ -1,5 +1,5 @@
1
1
  Wheel-Version: 1.0
2
- Generator: maturin (1.9.6)
2
+ Generator: maturin (1.10.2)
3
3
  Root-Is-Purelib: false
4
4
  Tag: cp310-abi3-macosx_11_0_arm64
5
5
  Generator: delocate 0.13.0