html-to-markdown 1.11.0__py3-none-any.whl → 1.12.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of html-to-markdown might be problematic. Click here for more details.
- html_to_markdown/cli.py +28 -2
- html_to_markdown/converters.py +208 -130
- html_to_markdown/exceptions.py +5 -0
- html_to_markdown/preprocessor.py +96 -86
- html_to_markdown/processing.py +63 -48
- html_to_markdown/utils.py +1 -3
- html_to_markdown/whitespace.py +23 -33
- {html_to_markdown-1.11.0.dist-info → html_to_markdown-1.12.1.dist-info}/METADATA +143 -2
- html_to_markdown-1.12.1.dist-info/RECORD +17 -0
- html_to_markdown-1.11.0.dist-info/RECORD +0 -17
- {html_to_markdown-1.11.0.dist-info → html_to_markdown-1.12.1.dist-info}/WHEEL +0 -0
- {html_to_markdown-1.11.0.dist-info → html_to_markdown-1.12.1.dist-info}/entry_points.txt +0 -0
- {html_to_markdown-1.11.0.dist-info → html_to_markdown-1.12.1.dist-info}/licenses/LICENSE +0 -0
- {html_to_markdown-1.11.0.dist-info → html_to_markdown-1.12.1.dist-info}/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: html-to-markdown
|
|
3
|
-
Version: 1.
|
|
3
|
+
Version: 1.12.1
|
|
4
4
|
Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
|
|
5
5
|
Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
|
|
6
6
|
License: MIT
|
|
@@ -320,6 +320,132 @@ def converter(*, tag: Tag, text: str, **kwargs) -> str:
|
|
|
320
320
|
|
|
321
321
|
Custom converters take precedence over built-in converters and can be used alongside other configuration options.
|
|
322
322
|
|
|
323
|
+
### Streaming API
|
|
324
|
+
|
|
325
|
+
For processing large documents with memory constraints, use the streaming API:
|
|
326
|
+
|
|
327
|
+
```python
|
|
328
|
+
from html_to_markdown import convert_to_markdown_stream
|
|
329
|
+
|
|
330
|
+
# Process large HTML in chunks
|
|
331
|
+
with open("large_document.html", "r") as f:
|
|
332
|
+
html_content = f.read()
|
|
333
|
+
|
|
334
|
+
# Returns a generator that yields markdown chunks
|
|
335
|
+
for chunk in convert_to_markdown_stream(html_content, chunk_size=2048):
|
|
336
|
+
print(chunk, end="")
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
With progress tracking:
|
|
340
|
+
|
|
341
|
+
```python
|
|
342
|
+
def show_progress(processed: int, total: int):
|
|
343
|
+
if total > 0:
|
|
344
|
+
percent = (processed / total) * 100
|
|
345
|
+
print(f"\rProgress: {percent:.1f}%", end="")
|
|
346
|
+
|
|
347
|
+
# Stream with progress callback
|
|
348
|
+
markdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
#### When to Use Streaming vs Regular Processing
|
|
352
|
+
|
|
353
|
+
Based on comprehensive performance analysis, here are our recommendations:
|
|
354
|
+
|
|
355
|
+
**📄 Use Regular Processing When:**
|
|
356
|
+
|
|
357
|
+
- Files < 100KB (simplicity preferred)
|
|
358
|
+
- Simple scripts and one-off conversions
|
|
359
|
+
- Memory is not a concern
|
|
360
|
+
- You want the simplest API
|
|
361
|
+
|
|
362
|
+
**🌊 Use Streaming Processing When:**
|
|
363
|
+
|
|
364
|
+
- Files > 100KB (memory efficiency)
|
|
365
|
+
- Processing many files in batch
|
|
366
|
+
- Memory is constrained
|
|
367
|
+
- You need progress reporting
|
|
368
|
+
- You want to process results incrementally
|
|
369
|
+
- Running in production environments
|
|
370
|
+
|
|
371
|
+
**📋 Specific Recommendations by File Size:**
|
|
372
|
+
|
|
373
|
+
| File Size | Recommendation | Reason |
|
|
374
|
+
| ---------- | ----------------------------------------------- | -------------------------------------- |
|
|
375
|
+
| < 50KB | Regular (simplicity) or Streaming (3-5% faster) | Either works well |
|
|
376
|
+
| 50KB-100KB | Either (streaming slightly preferred) | Minimal difference |
|
|
377
|
+
| 100KB-1MB | Streaming preferred | Better performance + memory efficiency |
|
|
378
|
+
| > 1MB | Streaming strongly recommended | Significant memory advantages |
|
|
379
|
+
|
|
380
|
+
**🔧 Configuration Recommendations:**
|
|
381
|
+
|
|
382
|
+
- **Default chunk_size: 2048 bytes** (optimal performance balance)
|
|
383
|
+
- **For very large files (>10MB)**: Consider `chunk_size=4096`
|
|
384
|
+
- **For memory-constrained environments**: Use smaller chunks `chunk_size=1024`
|
|
385
|
+
|
|
386
|
+
**📈 Performance Benefits:**
|
|
387
|
+
|
|
388
|
+
Streaming provides consistent **3-5% performance improvement** across all file sizes:
|
|
389
|
+
|
|
390
|
+
- **Streaming throughput**: ~0.47-0.48 MB/s
|
|
391
|
+
- **Regular throughput**: ~0.44-0.47 MB/s
|
|
392
|
+
- **Memory usage**: Streaming uses less peak memory for large files
|
|
393
|
+
- **Latency**: Streaming allows processing results before completion
|
|
394
|
+
|
|
395
|
+
### Preprocessing API
|
|
396
|
+
|
|
397
|
+
The library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:
|
|
398
|
+
|
|
399
|
+
```python
|
|
400
|
+
from html_to_markdown import preprocess_html, create_preprocessor
|
|
401
|
+
|
|
402
|
+
# Direct preprocessing with custom options
|
|
403
|
+
cleaned_html = preprocess_html(
|
|
404
|
+
raw_html,
|
|
405
|
+
remove_navigation=True,
|
|
406
|
+
remove_forms=True,
|
|
407
|
+
remove_scripts=True,
|
|
408
|
+
remove_styles=True,
|
|
409
|
+
remove_comments=True,
|
|
410
|
+
preserve_semantic_structure=True,
|
|
411
|
+
preserve_tables=True,
|
|
412
|
+
preserve_media=True,
|
|
413
|
+
)
|
|
414
|
+
markdown = convert_to_markdown(cleaned_html)
|
|
415
|
+
|
|
416
|
+
# Create a preprocessor configuration from presets
|
|
417
|
+
config = create_preprocessor(preset="aggressive", preserve_tables=False) # or "minimal", "standard" # Override preset settings
|
|
418
|
+
markdown = convert_to_markdown(html, **config)
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
### Exception Handling
|
|
422
|
+
|
|
423
|
+
The library provides specific exception classes for better error handling:
|
|
424
|
+
|
|
425
|
+
````python
|
|
426
|
+
from html_to_markdown import (
|
|
427
|
+
convert_to_markdown,
|
|
428
|
+
HtmlToMarkdownError,
|
|
429
|
+
EmptyHtmlError,
|
|
430
|
+
InvalidParserError,
|
|
431
|
+
ConflictingOptionsError,
|
|
432
|
+
MissingDependencyError
|
|
433
|
+
)
|
|
434
|
+
|
|
435
|
+
try:
|
|
436
|
+
markdown = convert_to_markdown(html, parser='lxml')
|
|
437
|
+
except MissingDependencyError:
|
|
438
|
+
# lxml not installed
|
|
439
|
+
markdown = convert_to_markdown(html, parser='html.parser')
|
|
440
|
+
except EmptyHtmlError:
|
|
441
|
+
print("No HTML content to convert")
|
|
442
|
+
except InvalidParserError as e:
|
|
443
|
+
print(f"Parser error: {e}")
|
|
444
|
+
except ConflictingOptionsError as e:
|
|
445
|
+
print(f"Conflicting options: {e}")
|
|
446
|
+
except HtmlToMarkdownError as e:
|
|
447
|
+
print(f"Conversion error: {e}")
|
|
448
|
+
|
|
323
449
|
## CLI Usage
|
|
324
450
|
|
|
325
451
|
Convert HTML files directly from the command line with full access to all API options:
|
|
@@ -340,7 +466,7 @@ html_to_markdown \
|
|
|
340
466
|
--preprocess-html \
|
|
341
467
|
--preprocessing-preset aggressive \
|
|
342
468
|
input.html > output.md
|
|
343
|
-
|
|
469
|
+
````
|
|
344
470
|
|
|
345
471
|
### Key CLI Options
|
|
346
472
|
|
|
@@ -353,6 +479,20 @@ html_to_markdown \
|
|
|
353
479
|
--whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
|
|
354
480
|
--heading-style {atx,atx_closed,underlined} # Header style
|
|
355
481
|
--no-extract-metadata # Disable metadata extraction
|
|
482
|
+
--br-in-tables # Use <br> tags for line breaks in table cells
|
|
483
|
+
--source-encoding ENCODING # Override auto-detected encoding (rarely needed)
|
|
484
|
+
```
|
|
485
|
+
|
|
486
|
+
**File Encoding:**
|
|
487
|
+
|
|
488
|
+
The CLI automatically detects file encoding in most cases. Use `--source-encoding` only when automatic detection fails (typically on some Windows systems or with unusual encodings):
|
|
489
|
+
|
|
490
|
+
```shell
|
|
491
|
+
# Override auto-detection for Latin-1 encoded file
|
|
492
|
+
html_to_markdown --source-encoding latin-1 input.html > output.md
|
|
493
|
+
|
|
494
|
+
# Force UTF-16 encoding when auto-detection fails
|
|
495
|
+
html_to_markdown --source-encoding utf-16 input.html > output.md
|
|
356
496
|
```
|
|
357
497
|
|
|
358
498
|
**All Available Options:**
|
|
@@ -393,6 +533,7 @@ The `markdownify` function is an alias for `convert_to_markdown` and provides id
|
|
|
393
533
|
- `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
|
|
394
534
|
- `sub_symbol` (str, default: `''`): Custom symbol for subscript text
|
|
395
535
|
- `sup_symbol` (str, default: `''`): Custom symbol for superscript text
|
|
536
|
+
- `br_in_tables` (bool, default: `False`): Use `<br>` tags for line breaks in table cells instead of spaces
|
|
396
537
|
|
|
397
538
|
### Parser Options
|
|
398
539
|
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
html_to_markdown/__init__.py,sha256=TzZzhZDJHeXW_3B9zceYehz2zlttqdLsDr5un8stZLM,653
|
|
2
|
+
html_to_markdown/__main__.py,sha256=E9d62nVceR_5TUWgVu5L5CnSZxKcnT_7a6ScWZUGE-s,292
|
|
3
|
+
html_to_markdown/cli.py,sha256=qB8-1jqJPW-YrOmlyOdJnLM6DpKSUIA3iyn1SJaJgKg,9418
|
|
4
|
+
html_to_markdown/constants.py,sha256=CKFVHjUZKgi8-lgU6AHPic7X5ChlTkbZt4Jv6VaVjjs,665
|
|
5
|
+
html_to_markdown/converters.py,sha256=fdFT9WwDd3hGpYn0jVbPDcB8OmLPvQUmanbM7aQmzms,35821
|
|
6
|
+
html_to_markdown/exceptions.py,sha256=ytUOIL0D8r0Jd59RzUPqzmk73i-Mg63zDQYo6S6DBg4,1389
|
|
7
|
+
html_to_markdown/preprocessor.py,sha256=otnTOhoivJkxaip1Lb9xNMl8q-x9aGFXSYkSrxsTW8g,9591
|
|
8
|
+
html_to_markdown/processing.py,sha256=xchoTwKZHQW8ejjwLAiMb_AY6XcgPQ6zhLShlduYVuY,35213
|
|
9
|
+
html_to_markdown/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
10
|
+
html_to_markdown/utils.py,sha256=s3A4ET_XyKC-WxzJtH4W0S7cIBGF5fTYIf4JJrqTX8Q,1069
|
|
11
|
+
html_to_markdown/whitespace.py,sha256=rl3eEwqfMpNWx4FBmbkZ1RxO_Od45p3EZ_7UgKcDAtg,7710
|
|
12
|
+
html_to_markdown-1.12.1.dist-info/licenses/LICENSE,sha256=3J_HR5BWvUM1mlIrlkF32-uC1FM64gy8JfG17LBuheQ,1122
|
|
13
|
+
html_to_markdown-1.12.1.dist-info/METADATA,sha256=5PoGUeYuGtGmh5q_XwxKlSzq7572CUw3yAVBNmVxDTc,22694
|
|
14
|
+
html_to_markdown-1.12.1.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
|
|
15
|
+
html_to_markdown-1.12.1.dist-info/entry_points.txt,sha256=xmFijrTfgYW7lOrZxZGRPciicQHa5KiXKkUhBCmICtQ,116
|
|
16
|
+
html_to_markdown-1.12.1.dist-info/top_level.txt,sha256=Ev6djb1c4dSKr_-n4K-FpEGDkzBigXY6LuZ5onqS7AE,17
|
|
17
|
+
html_to_markdown-1.12.1.dist-info/RECORD,,
|
|
@@ -1,17 +0,0 @@
|
|
|
1
|
-
html_to_markdown/__init__.py,sha256=TzZzhZDJHeXW_3B9zceYehz2zlttqdLsDr5un8stZLM,653
|
|
2
|
-
html_to_markdown/__main__.py,sha256=E9d62nVceR_5TUWgVu5L5CnSZxKcnT_7a6ScWZUGE-s,292
|
|
3
|
-
html_to_markdown/cli.py,sha256=ilnrJN2XMhPDQ4UkkG4cjLXTvglu_ZJj-bBsohVF3fw,8541
|
|
4
|
-
html_to_markdown/constants.py,sha256=CKFVHjUZKgi8-lgU6AHPic7X5ChlTkbZt4Jv6VaVjjs,665
|
|
5
|
-
html_to_markdown/converters.py,sha256=CbChkRIlOPe0d1MK5-txDE56IG4Ea_dcCV6KRCTjeKY,32497
|
|
6
|
-
html_to_markdown/exceptions.py,sha256=YjfwVCWE_oZakr9iy0E-_aPSYHNaocJZgWeQ9Enty7Q,1212
|
|
7
|
-
html_to_markdown/preprocessor.py,sha256=acmuJJvx1RaXE3c0F_aWsartQE0cEpa3AOnJYGnPzqw,9708
|
|
8
|
-
html_to_markdown/processing.py,sha256=sOIIFNyRkRYAH8Q4ehrh66RY71bkvttSuqzXYsMC5JM,34334
|
|
9
|
-
html_to_markdown/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
10
|
-
html_to_markdown/utils.py,sha256=4Vzk2cCjxN0LAZ1DXQCufYtxE7a6739TYgPbje-VM_E,1086
|
|
11
|
-
html_to_markdown/whitespace.py,sha256=EJ0gEsfLB_wZAk5d5qP4UPhPg0pJJ8LZLRRr_QoL01o,8186
|
|
12
|
-
html_to_markdown-1.11.0.dist-info/licenses/LICENSE,sha256=3J_HR5BWvUM1mlIrlkF32-uC1FM64gy8JfG17LBuheQ,1122
|
|
13
|
-
html_to_markdown-1.11.0.dist-info/METADATA,sha256=Cej6bnqT9JVFzACZvND6Z5-kD0QoabiLi46opAaC11U,17814
|
|
14
|
-
html_to_markdown-1.11.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
|
|
15
|
-
html_to_markdown-1.11.0.dist-info/entry_points.txt,sha256=xmFijrTfgYW7lOrZxZGRPciicQHa5KiXKkUhBCmICtQ,116
|
|
16
|
-
html_to_markdown-1.11.0.dist-info/top_level.txt,sha256=Ev6djb1c4dSKr_-n4K-FpEGDkzBigXY6LuZ5onqS7AE,17
|
|
17
|
-
html_to_markdown-1.11.0.dist-info/RECORD,,
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|