PyPI - html-to-markdown - Versions diffs - 1.11.0__tar.gz → 1.12.0__tar.gz - Mend

html-to-markdown 1.11.0tar.gz → 1.12.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of html-to-markdown might be problematic. Click here for more details.

Files changed (22) hide show

{html_to_markdown-1.11.0 → html_to_markdown-1.12.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: html-to-markdown
-Version: 1.11.0
+Version: 1.12.0
 Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
 Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
 License: MIT
@@ -320,6 +320,88 @@ def converter(*, tag: Tag, text: str, **kwargs) -> str:
 Custom converters take precedence over built-in converters and can be used alongside other configuration options.
+### Streaming API
+For processing large documents with memory constraints, use the streaming API:
+```python
+from html_to_markdown import convert_to_markdown_stream
+# Process large HTML in chunks
+with open("large_document.html", "r") as f:
+    html_content = f.read()
+# Returns a generator that yields markdown chunks
+for chunk in convert_to_markdown_stream(html_content, chunk_size=2048):
+    print(chunk, end="")
+```
+With progress tracking:
+```python
+def show_progress(processed: int, total: int):
+    if total > 0:
+        percent = (processed / total) * 100
+        print(f"\rProgress: {percent:.1f}%", end="")
+# Stream with progress callback
+markdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)
+```
+### Preprocessing API
+The library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:
+```python
+from html_to_markdown import preprocess_html, create_preprocessor
+# Direct preprocessing with custom options
+cleaned_html = preprocess_html(
+    raw_html,
+    remove_navigation=True,
+    remove_forms=True,
+    remove_scripts=True,
+    remove_styles=True,
+    remove_comments=True,
+    preserve_semantic_structure=True,
+    preserve_tables=True,
+    preserve_media=True,
+)
+markdown = convert_to_markdown(cleaned_html)
+# Create a preprocessor configuration from presets
+config = create_preprocessor(preset="aggressive", preserve_tables=False)  # or "minimal", "standard"  # Override preset settings
+markdown = convert_to_markdown(html, **config)
+```
+### Exception Handling
+The library provides specific exception classes for better error handling:
+````python
+from html_to_markdown import (
+    convert_to_markdown,
+    HtmlToMarkdownError,
+    EmptyHtmlError,
+    InvalidParserError,
+    ConflictingOptionsError,
+    MissingDependencyError
+)
+try:
+    markdown = convert_to_markdown(html, parser='lxml')
+except MissingDependencyError:
+    # lxml not installed
+    markdown = convert_to_markdown(html, parser='html.parser')
+except EmptyHtmlError:
+    print("No HTML content to convert")
+except InvalidParserError as e:
+    print(f"Parser error: {e}")
+except ConflictingOptionsError as e:
+    print(f"Conflicting options: {e}")
+except HtmlToMarkdownError as e:
+    print(f"Conversion error: {e}")
 ## CLI Usage
 Convert HTML files directly from the command line with full access to all API options:
@@ -340,7 +422,7 @@ html_to_markdown \
   --preprocess-html \
   --preprocessing-preset aggressive \
   input.html > output.md
-```
+````
 ### Key CLI Options
@@ -353,6 +435,20 @@ html_to_markdown \
 --whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
 --heading-style {atx,atx_closed,underlined} # Header style
 --no-extract-metadata               # Disable metadata extraction
+--br-in-tables                      # Use <br> tags for line breaks in table cells
+--source-encoding ENCODING          # Override auto-detected encoding (rarely needed)
+```
+**File Encoding:**
+The CLI automatically detects file encoding in most cases. Use `--source-encoding` only when automatic detection fails (typically on some Windows systems or with unusual encodings):
+```shell
+# Override auto-detection for Latin-1 encoded file
+html_to_markdown --source-encoding latin-1 input.html > output.md
+# Force UTF-16 encoding when auto-detection fails
+html_to_markdown --source-encoding utf-16 input.html > output.md
 ```
 **All Available Options:**
@@ -393,6 +489,7 @@ The `markdownify` function is an alias for `convert_to_markdown` and provides id
 - `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
 - `sub_symbol` (str, default: `''`): Custom symbol for subscript text
 - `sup_symbol` (str, default: `''`): Custom symbol for superscript text
+- `br_in_tables` (bool, default: `False`): Use `<br>` tags for line breaks in table cells instead of spaces
 ### Parser Options

{html_to_markdown-1.11.0 → html_to_markdown-1.12.0}/README.md RENAMED Viewed

@@ -282,6 +282,88 @@ def converter(*, tag: Tag, text: str, **kwargs) -> str:
 Custom converters take precedence over built-in converters and can be used alongside other configuration options.
+### Streaming API
+For processing large documents with memory constraints, use the streaming API:
+```python
+from html_to_markdown import convert_to_markdown_stream
+# Process large HTML in chunks
+with open("large_document.html", "r") as f:
+    html_content = f.read()
+# Returns a generator that yields markdown chunks
+for chunk in convert_to_markdown_stream(html_content, chunk_size=2048):
+    print(chunk, end="")
+```
+With progress tracking:
+```python
+def show_progress(processed: int, total: int):
+    if total > 0:
+        percent = (processed / total) * 100
+        print(f"\rProgress: {percent:.1f}%", end="")
+# Stream with progress callback
+markdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)
+```
+### Preprocessing API
+The library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:
+```python
+from html_to_markdown import preprocess_html, create_preprocessor
+# Direct preprocessing with custom options
+cleaned_html = preprocess_html(
+    raw_html,
+    remove_navigation=True,
+    remove_forms=True,
+    remove_scripts=True,
+    remove_styles=True,
+    remove_comments=True,
+    preserve_semantic_structure=True,
+    preserve_tables=True,
+    preserve_media=True,
+)
+markdown = convert_to_markdown(cleaned_html)
+# Create a preprocessor configuration from presets
+config = create_preprocessor(preset="aggressive", preserve_tables=False)  # or "minimal", "standard"  # Override preset settings
+markdown = convert_to_markdown(html, **config)
+```
+### Exception Handling
+The library provides specific exception classes for better error handling:
+````python
+from html_to_markdown import (
+    convert_to_markdown,
+    HtmlToMarkdownError,
+    EmptyHtmlError,
+    InvalidParserError,
+    ConflictingOptionsError,
+    MissingDependencyError
+)
+try:
+    markdown = convert_to_markdown(html, parser='lxml')
+except MissingDependencyError:
+    # lxml not installed
+    markdown = convert_to_markdown(html, parser='html.parser')
+except EmptyHtmlError:
+    print("No HTML content to convert")
+except InvalidParserError as e:
+    print(f"Parser error: {e}")
+except ConflictingOptionsError as e:
+    print(f"Conflicting options: {e}")
+except HtmlToMarkdownError as e:
+    print(f"Conversion error: {e}")
 ## CLI Usage
 Convert HTML files directly from the command line with full access to all API options:
@@ -302,7 +384,7 @@ html_to_markdown \
   --preprocess-html \
   --preprocessing-preset aggressive \
   input.html > output.md
-```
+````
 ### Key CLI Options
@@ -315,6 +397,20 @@ html_to_markdown \
 --whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
 --heading-style {atx,atx_closed,underlined} # Header style
 --no-extract-metadata               # Disable metadata extraction
+--br-in-tables                      # Use <br> tags for line breaks in table cells
+--source-encoding ENCODING          # Override auto-detected encoding (rarely needed)
+```
+**File Encoding:**
+The CLI automatically detects file encoding in most cases. Use `--source-encoding` only when automatic detection fails (typically on some Windows systems or with unusual encodings):
+```shell
+# Override auto-detection for Latin-1 encoded file
+html_to_markdown --source-encoding latin-1 input.html > output.md
+# Force UTF-16 encoding when auto-detection fails
+html_to_markdown --source-encoding utf-16 input.html > output.md
 ```
 **All Available Options:**
@@ -355,6 +451,7 @@ The `markdownify` function is an alias for `convert_to_markdown` and provides id
 - `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
 - `sub_symbol` (str, default: `''`): Custom symbol for subscript text
 - `sup_symbol` (str, default: `''`): Custom symbol for superscript text
+- `br_in_tables` (bool, default: `False`): Use `<br>` tags for line breaks in table cells instead of spaces
 ### Parser Options

{html_to_markdown-1.11.0 → html_to_markdown-1.12.0}/html_to_markdown/cli.py RENAMED Viewed

@@ -1,5 +1,6 @@
 import sys
 from argparse import ArgumentParser, FileType
+from pathlib import Path
 from html_to_markdown.constants import (
     ASTERISK,
@@ -13,6 +14,7 @@ from html_to_markdown.constants import (
     WHITESPACE_NORMALIZED,
     WHITESPACE_STRICT,
 )
+from html_to_markdown.exceptions import InvalidEncodingError
 from html_to_markdown.processing import convert_to_markdown
@@ -131,6 +133,12 @@ def main(argv: list[str]) -> str:
         help="Parent tags where images remain inline (not converted to alt-text).",
     )
+    parser.add_argument(
+        "--br-in-tables",
+        action="store_true",
+        help="Use <br> tags for line breaks in table cells instead of spaces.",
+    )
     parser.add_argument("-w", "--wrap", action="store_true", help="Enable text wrapping at --wrap-width characters.")
     parser.add_argument(
@@ -235,10 +243,18 @@ def main(argv: list[str]) -> str:
         help="Keep navigation elements when preprocessing (normally removed).",
     )
+    parser.add_argument(
+        "--source-encoding",
+        type=str,
+        default=None,
+        help="Source file encoding (e.g. 'utf-8', 'latin-1'). Defaults to system default.",
+    )
     args = parser.parse_args(argv)
     base_args = {
         "autolinks": args.autolinks,
+        "br_in_tables": args.br_in_tables,
         "bullets": args.bullets,
         "code_language": args.code_language,
         "convert": args.convert,
@@ -278,7 +294,7 @@ def main(argv: list[str]) -> str:
         if args.show_progress:
             def progress_callback(processed: int, total: int) -> None:
-                if total > 0:
+                if total > 0:  # pragma: no cover
                     percent = (processed / total) * 100
                     sys.stderr.write(f"\rProgress: {percent:.1f}% ({processed}/{total} bytes)")
@@ -286,4 +302,14 @@ def main(argv: list[str]) -> str:
             base_args["progress_callback"] = progress_callback
-    return convert_to_markdown(args.html.read(), **base_args)
+    if args.source_encoding and args.html.name != "<stdin>":
+        args.html.close()
+        try:
+            with Path(args.html.name).open(encoding=args.source_encoding) as f:
+                html_content = f.read()
+        except LookupError as e:
+            raise InvalidEncodingError(args.source_encoding) from e
+    else:
+        html_content = args.html.read()
+    return convert_to_markdown(html_content, **base_args)

html-to-markdown 1.11.0__tar.gz → 1.12.0__tar.gz

Potentially problematic release.

html-to-markdown 1.11.0tar.gz → 1.12.0tar.gz