PyPI - html-to-markdown - Versions diffs - 1.11.0__tar.gz → 1.12.1__tar.gz - Mend

html-to-markdown 1.11.0tar.gz → 1.12.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of html-to-markdown might be problematic. Click here for more details.

Files changed (22) hide show

{html_to_markdown-1.11.0 → html_to_markdown-1.12.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: html-to-markdown
-Version: 1.11.0
+Version: 1.12.1
 Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
 Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
 License: MIT
@@ -320,6 +320,132 @@ def converter(*, tag: Tag, text: str, **kwargs) -> str:
 Custom converters take precedence over built-in converters and can be used alongside other configuration options.
+### Streaming API
+For processing large documents with memory constraints, use the streaming API:
+```python
+from html_to_markdown import convert_to_markdown_stream
+# Process large HTML in chunks
+with open("large_document.html", "r") as f:
+    html_content = f.read()
+# Returns a generator that yields markdown chunks
+for chunk in convert_to_markdown_stream(html_content, chunk_size=2048):
+    print(chunk, end="")
+```
+With progress tracking:
+```python
+def show_progress(processed: int, total: int):
+    if total > 0:
+        percent = (processed / total) * 100
+        print(f"\rProgress: {percent:.1f}%", end="")
+# Stream with progress callback
+markdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)
+```
+#### When to Use Streaming vs Regular Processing
+Based on comprehensive performance analysis, here are our recommendations:
+**📄 Use Regular Processing When:**
+- Files < 100KB (simplicity preferred)
+- Simple scripts and one-off conversions
+- Memory is not a concern
+- You want the simplest API
+**🌊 Use Streaming Processing When:**
+- Files > 100KB (memory efficiency)
+- Processing many files in batch
+- Memory is constrained
+- You need progress reporting
+- You want to process results incrementally
+- Running in production environments
+**📋 Specific Recommendations by File Size:**
+| File Size  | Recommendation                                  | Reason                                 |
+| ---------- | ----------------------------------------------- | -------------------------------------- |
+| < 50KB     | Regular (simplicity) or Streaming (3-5% faster) | Either works well                      |
+| 50KB-100KB | Either (streaming slightly preferred)           | Minimal difference                     |
+| 100KB-1MB  | Streaming preferred                             | Better performance + memory efficiency |
+| > 1MB      | Streaming strongly recommended                  | Significant memory advantages          |
+**🔧 Configuration Recommendations:**
+- **Default chunk_size: 2048 bytes** (optimal performance balance)
+- **For very large files (>10MB)**: Consider `chunk_size=4096`
+- **For memory-constrained environments**: Use smaller chunks `chunk_size=1024`
+**📈 Performance Benefits:**
+Streaming provides consistent **3-5% performance improvement** across all file sizes:
+- **Streaming throughput**: ~0.47-0.48 MB/s
+- **Regular throughput**: ~0.44-0.47 MB/s
+- **Memory usage**: Streaming uses less peak memory for large files
+- **Latency**: Streaming allows processing results before completion
+### Preprocessing API
+The library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:
+```python
+from html_to_markdown import preprocess_html, create_preprocessor
+# Direct preprocessing with custom options
+cleaned_html = preprocess_html(
+    raw_html,
+    remove_navigation=True,
+    remove_forms=True,
+    remove_scripts=True,
+    remove_styles=True,
+    remove_comments=True,
+    preserve_semantic_structure=True,
+    preserve_tables=True,
+    preserve_media=True,
+)
+markdown = convert_to_markdown(cleaned_html)
+# Create a preprocessor configuration from presets
+config = create_preprocessor(preset="aggressive", preserve_tables=False)  # or "minimal", "standard"  # Override preset settings
+markdown = convert_to_markdown(html, **config)
+```
+### Exception Handling
+The library provides specific exception classes for better error handling:
+````python
+from html_to_markdown import (
+    convert_to_markdown,
+    HtmlToMarkdownError,
+    EmptyHtmlError,
+    InvalidParserError,
+    ConflictingOptionsError,
+    MissingDependencyError
+)
+try:
+    markdown = convert_to_markdown(html, parser='lxml')
+except MissingDependencyError:
+    # lxml not installed
+    markdown = convert_to_markdown(html, parser='html.parser')
+except EmptyHtmlError:
+    print("No HTML content to convert")
+except InvalidParserError as e:
+    print(f"Parser error: {e}")
+except ConflictingOptionsError as e:
+    print(f"Conflicting options: {e}")
+except HtmlToMarkdownError as e:
+    print(f"Conversion error: {e}")
 ## CLI Usage
 Convert HTML files directly from the command line with full access to all API options:
@@ -340,7 +466,7 @@ html_to_markdown \
   --preprocess-html \
   --preprocessing-preset aggressive \
   input.html > output.md
-```
+````
 ### Key CLI Options
@@ -353,6 +479,20 @@ html_to_markdown \
 --whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
 --heading-style {atx,atx_closed,underlined} # Header style
 --no-extract-metadata               # Disable metadata extraction
+--br-in-tables                      # Use <br> tags for line breaks in table cells
+--source-encoding ENCODING          # Override auto-detected encoding (rarely needed)
+```
+**File Encoding:**
+The CLI automatically detects file encoding in most cases. Use `--source-encoding` only when automatic detection fails (typically on some Windows systems or with unusual encodings):
+```shell
+# Override auto-detection for Latin-1 encoded file
+html_to_markdown --source-encoding latin-1 input.html > output.md
+# Force UTF-16 encoding when auto-detection fails
+html_to_markdown --source-encoding utf-16 input.html > output.md
 ```
 **All Available Options:**
@@ -393,6 +533,7 @@ The `markdownify` function is an alias for `convert_to_markdown` and provides id
 - `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
 - `sub_symbol` (str, default: `''`): Custom symbol for subscript text
 - `sup_symbol` (str, default: `''`): Custom symbol for superscript text
+- `br_in_tables` (bool, default: `False`): Use `<br>` tags for line breaks in table cells instead of spaces
 ### Parser Options

{html_to_markdown-1.11.0 → html_to_markdown-1.12.1}/README.md RENAMED Viewed

@@ -282,6 +282,132 @@ def converter(*, tag: Tag, text: str, **kwargs) -> str:
 Custom converters take precedence over built-in converters and can be used alongside other configuration options.
+### Streaming API
+For processing large documents with memory constraints, use the streaming API:
+```python
+from html_to_markdown import convert_to_markdown_stream
+# Process large HTML in chunks
+with open("large_document.html", "r") as f:
+    html_content = f.read()
+# Returns a generator that yields markdown chunks
+for chunk in convert_to_markdown_stream(html_content, chunk_size=2048):
+    print(chunk, end="")
+```
+With progress tracking:
+```python
+def show_progress(processed: int, total: int):
+    if total > 0:
+        percent = (processed / total) * 100
+        print(f"\rProgress: {percent:.1f}%", end="")
+# Stream with progress callback
+markdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)
+```
+#### When to Use Streaming vs Regular Processing
+Based on comprehensive performance analysis, here are our recommendations:
+**📄 Use Regular Processing When:**
+- Files < 100KB (simplicity preferred)
+- Simple scripts and one-off conversions
+- Memory is not a concern
+- You want the simplest API
+**🌊 Use Streaming Processing When:**
+- Files > 100KB (memory efficiency)
+- Processing many files in batch
+- Memory is constrained
+- You need progress reporting
+- You want to process results incrementally
+- Running in production environments
+**📋 Specific Recommendations by File Size:**
+| File Size  | Recommendation                                  | Reason                                 |
+| ---------- | ----------------------------------------------- | -------------------------------------- |
+| < 50KB     | Regular (simplicity) or Streaming (3-5% faster) | Either works well                      |
+| 50KB-100KB | Either (streaming slightly preferred)           | Minimal difference                     |
+| 100KB-1MB  | Streaming preferred                             | Better performance + memory efficiency |
+| > 1MB      | Streaming strongly recommended                  | Significant memory advantages          |
+**🔧 Configuration Recommendations:**
+- **Default chunk_size: 2048 bytes** (optimal performance balance)
+- **For very large files (>10MB)**: Consider `chunk_size=4096`
+- **For memory-constrained environments**: Use smaller chunks `chunk_size=1024`
+**📈 Performance Benefits:**
+Streaming provides consistent **3-5% performance improvement** across all file sizes:
+- **Streaming throughput**: ~0.47-0.48 MB/s
+- **Regular throughput**: ~0.44-0.47 MB/s
+- **Memory usage**: Streaming uses less peak memory for large files
+- **Latency**: Streaming allows processing results before completion
+### Preprocessing API
+The library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:
+```python
+from html_to_markdown import preprocess_html, create_preprocessor
+# Direct preprocessing with custom options
+cleaned_html = preprocess_html(
+    raw_html,
+    remove_navigation=True,
+    remove_forms=True,
+    remove_scripts=True,
+    remove_styles=True,
+    remove_comments=True,
+    preserve_semantic_structure=True,
+    preserve_tables=True,
+    preserve_media=True,
+)
+markdown = convert_to_markdown(cleaned_html)
+# Create a preprocessor configuration from presets
+config = create_preprocessor(preset="aggressive", preserve_tables=False)  # or "minimal", "standard"  # Override preset settings
+markdown = convert_to_markdown(html, **config)
+```
+### Exception Handling
+The library provides specific exception classes for better error handling:
+````python
+from html_to_markdown import (
+    convert_to_markdown,
+    HtmlToMarkdownError,
+    EmptyHtmlError,
+    InvalidParserError,
+    ConflictingOptionsError,
+    MissingDependencyError
+)
+try:
+    markdown = convert_to_markdown(html, parser='lxml')
+except MissingDependencyError:
+    # lxml not installed
+    markdown = convert_to_markdown(html, parser='html.parser')
+except EmptyHtmlError:
+    print("No HTML content to convert")
+except InvalidParserError as e:
+    print(f"Parser error: {e}")
+except ConflictingOptionsError as e:
+    print(f"Conflicting options: {e}")
+except HtmlToMarkdownError as e:
+    print(f"Conversion error: {e}")
 ## CLI Usage
 Convert HTML files directly from the command line with full access to all API options:
@@ -302,7 +428,7 @@ html_to_markdown \
   --preprocess-html \
   --preprocessing-preset aggressive \
   input.html > output.md
-```
+````
 ### Key CLI Options
@@ -315,6 +441,20 @@ html_to_markdown \
 --whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
 --heading-style {atx,atx_closed,underlined} # Header style
 --no-extract-metadata               # Disable metadata extraction
+--br-in-tables                      # Use <br> tags for line breaks in table cells
+--source-encoding ENCODING          # Override auto-detected encoding (rarely needed)
+```
+**File Encoding:**
+The CLI automatically detects file encoding in most cases. Use `--source-encoding` only when automatic detection fails (typically on some Windows systems or with unusual encodings):
+```shell
+# Override auto-detection for Latin-1 encoded file
+html_to_markdown --source-encoding latin-1 input.html > output.md
+# Force UTF-16 encoding when auto-detection fails
+html_to_markdown --source-encoding utf-16 input.html > output.md
 ```
 **All Available Options:**
@@ -355,6 +495,7 @@ The `markdownify` function is an alias for `convert_to_markdown` and provides id
 - `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
 - `sub_symbol` (str, default: `''`): Custom symbol for subscript text
 - `sup_symbol` (str, default: `''`): Custom symbol for superscript text
+- `br_in_tables` (bool, default: `False`): Use `<br>` tags for line breaks in table cells instead of spaces
 ### Parser Options

{html_to_markdown-1.11.0 → html_to_markdown-1.12.1}/html_to_markdown/cli.py RENAMED Viewed

@@ -1,5 +1,6 @@
 import sys
 from argparse import ArgumentParser, FileType
+from pathlib import Path
 from html_to_markdown.constants import (
     ASTERISK,
@@ -13,6 +14,7 @@ from html_to_markdown.constants import (
     WHITESPACE_NORMALIZED,
     WHITESPACE_STRICT,
 )
+from html_to_markdown.exceptions import InvalidEncodingError
 from html_to_markdown.processing import convert_to_markdown
@@ -131,6 +133,12 @@ def main(argv: list[str]) -> str:
         help="Parent tags where images remain inline (not converted to alt-text).",
     )
+    parser.add_argument(
+        "--br-in-tables",
+        action="store_true",
+        help="Use <br> tags for line breaks in table cells instead of spaces.",
+    )
     parser.add_argument("-w", "--wrap", action="store_true", help="Enable text wrapping at --wrap-width characters.")
     parser.add_argument(
@@ -235,10 +243,18 @@ def main(argv: list[str]) -> str:
         help="Keep navigation elements when preprocessing (normally removed).",
     )
+    parser.add_argument(
+        "--source-encoding",
+        type=str,
+        default=None,
+        help="Source file encoding (e.g. 'utf-8', 'latin-1'). Defaults to system default.",
+    )
     args = parser.parse_args(argv)
     base_args = {
         "autolinks": args.autolinks,
+        "br_in_tables": args.br_in_tables,
         "bullets": args.bullets,
         "code_language": args.code_language,
         "convert": args.convert,
@@ -278,7 +294,7 @@ def main(argv: list[str]) -> str:
         if args.show_progress:
             def progress_callback(processed: int, total: int) -> None:
-                if total > 0:
+                if total > 0:  # pragma: no cover
                     percent = (processed / total) * 100
                     sys.stderr.write(f"\rProgress: {percent:.1f}% ({processed}/{total} bytes)")
@@ -286,4 +302,14 @@ def main(argv: list[str]) -> str:
             base_args["progress_callback"] = progress_callback
-    return convert_to_markdown(args.html.read(), **base_args)
+    if args.source_encoding and args.html.name != "<stdin>":
+        args.html.close()
+        try:
+            with Path(args.html.name).open(encoding=args.source_encoding) as f:
+                html_content = f.read()
+        except LookupError as e:
+            raise InvalidEncodingError(args.source_encoding) from e
+    else:
+        html_content = args.html.read()
+    return convert_to_markdown(html_content, **base_args)

html-to-markdown 1.11.0__tar.gz → 1.12.1__tar.gz

Potentially problematic release.

html-to-markdown 1.11.0tar.gz → 1.12.1tar.gz