PyPI - html-to-markdown - Versions diffs - 1.13.0__tar.gz → 1.14.0__tar.gz - Mend

html-to-markdown 1.13.0tar.gz → 1.14.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of html-to-markdown might be problematic. Click here for more details.

Files changed (22) hide show

{html_to_markdown-1.13.0 → html_to_markdown-1.14.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: html-to-markdown
-Version: 1.13.0
+Version: 1.14.0
 Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
 Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
 License: MIT
@@ -32,6 +32,8 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: beautifulsoup4>=4.13.5
 Requires-Dist: nh3>=0.3
+Provides-Extra: html5lib
+Requires-Dist: beautifulsoup4[html5lib]>=4.13.5; extra == "html5lib"
 Provides-Extra: lxml
 Requires-Dist: beautifulsoup4[lxml]>=4.13.5; extra == "lxml"
 Dynamic: license-file
@@ -40,7 +42,7 @@ Dynamic: license-file
 A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork
 of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
-Python 3.9+.
+Python 3.10+.
 ## Support This Project
@@ -64,8 +66,10 @@ Your support helps maintain and improve this library for the community.
 - **Custom Converters**: Extensible converter system for custom HTML tag handling
 - **List Formatting**: Configurable list indentation with Discord/Slack compatibility
 - **HTML Preprocessing**: Clean messy HTML with configurable aggressiveness levels
+- **Bytes Input Support**: Direct handling of bytes input with automatic encoding detection and configurable source encoding
 - **Whitespace Control**: Normalized or strict whitespace preservation modes
 - **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
+- **Parser Normalization**: Consistent output across all supported parsers (html.parser, lxml, html5lib)
 - **Robustly Tested**: Comprehensive unit tests and integration tests covering all conversion scenarios
 ## Installation
@@ -74,17 +78,27 @@ Your support helps maintain and improve this library for the community.
 pip install html-to-markdown
 ```
-### Optional lxml Parser
+### Optional Parsers
-For improved performance, you can install with the optional lxml parser:
+For improved performance and compatibility, you can install with optional parsers:
 ```shell
+# Fast lxml parser (recommended)
 pip install html-to-markdown[lxml]
+# Standards-compliant html5lib parser
+pip install html-to-markdown[html5lib]
 ```
-The lxml parser offers faster HTML parsing and better handling of malformed HTML compared to the default html.parser.
+**Parser Options:**
+- **html.parser** (default): Built-in Python parser, no dependencies
+- **lxml**: **Recommended** - Fastest parser with good malformed HTML handling
+- **html5lib**: Most standards-compliant, handles edge cases best
-The library automatically uses lxml when available. You can explicitly specify a parser using the `parser` parameter.
+The library automatically uses lxml when available and **normalizes output to ensure consistent results regardless of parser choice**. We recommend using the **lxml parser for optimal performance** - it's significantly faster than the other options while maintaining excellent compatibility.
+You can explicitly specify a parser using the `parser` parameter.
 ## Quick Start
@@ -149,6 +163,39 @@ soup = BeautifulSoup(html, "lxml")  # Note: lxml requires additional installatio
 markdown = convert_to_markdown(soup)
 ```
+### Working with Bytes and Encodings
+The library can directly handle bytes input, which is useful when working with HTTP responses or files:
+```python
+import requests
+from html_to_markdown import convert_to_markdown
+# Working with HTTP responses (bytes)
+response = requests.get("https://example.com")
+markdown = convert_to_markdown(response.content)  # response.content returns bytes
+# Specify encoding for non-UTF-8 content
+response = requests.get("https://example.fr")
+markdown = convert_to_markdown(response.content, source_encoding="latin-1")
+# Common encoding examples
+html_bytes = b"<p>Hello World</p>"
+markdown = convert_to_markdown(html_bytes)  # UTF-8 by default
+# Latin-1 encoded content
+html_latin1 = "<p>Café résumé</p>".encode("latin-1")
+markdown = convert_to_markdown(html_latin1, source_encoding="latin-1")
+# Windows-1252 encoded content
+html_windows = '<p>Smart quotes: "Hello"</p>'.encode("windows-1252")
+markdown = convert_to_markdown(html_windows, source_encoding="windows-1252")
+# Piping bytes from command line
+# echo '<p>Hello</p>' | python -m html_to_markdown
+# cat file.html | python -m html_to_markdown --source-encoding latin-1
+```
 ## Common Use Cases
 ### Discord/Slack Compatible Lists
@@ -681,6 +728,131 @@ This library provides comprehensive support for all modern HTML5 elements:
 - `<math>` (MathML support)
+## Command Line Interface
+The library includes a full-featured CLI tool with complete API parity:
+### Basic Usage
+```bash
+# Convert HTML file to Markdown
+html-to-markdown document.html
+# Convert from stdin
+echo '<h1>Title</h1><p>Content</p>' | html-to-markdown
+# Read HTML file with specific encoding
+html-to-markdown document.html --source-encoding latin-1
+# Pipe bytes with encoding specification
+cat document.html | html-to-markdown --source-encoding utf-8
+```
+### Advanced CLI Options
+```bash
+# Discord/Slack compatible lists (2-space indent)
+html-to-markdown file.html --list-indent-width 2
+# Clean messy HTML before conversion
+html-to-markdown file.html --preprocess-html --preprocessing-preset aggressive
+# Custom heading style
+html-to-markdown file.html --heading-style atx
+# Strip specific tags
+html-to-markdown file.html --strip nav aside footer
+# Convert only specific tags
+html-to-markdown file.html --convert h1 h2 p a strong em
+# Enable streaming for large files with progress
+html-to-markdown large.html --stream-processing --show-progress
+# Use specific parser (lxml recommended for best performance)
+html-to-markdown file.html --parser lxml
+```
+### Real-World CLI Examples
+```bash
+# Download and convert a webpage
+curl -s https://example.com | html-to-markdown --preprocess-html > output.md
+# Process multiple files with different encodings
+for file in *.html; do
+    html-to-markdown "$file" --source-encoding latin-1 > "${file%.html}.md"
+done
+# Convert with custom formatting for documentation
+html-to-markdown docs.html \
+    --heading-style atx \
+    --list-indent-width 2 \
+    --highlight-style bold \
+    --no-extract-metadata > docs.md
+```
+## Differences from markdownify
+html-to-markdown is a modern, completely rewritten library inspired by markdownify but with significant improvements:
+### Key Advantages
+| Feature                 | markdownify      | html-to-markdown                                                       |
+| ----------------------- | ---------------- | ---------------------------------------------------------------------- |
+| **Type Safety**         | No type hints    | Full MyPy compliance with strict typing                                |
+| **Python Support**      | Python 3.6+      | Python 3.10+ with modern features                                      |
+| **HTML5 Elements**      | Basic support    | Comprehensive HTML5 support (semantic, form, table, interactive, etc.) |
+| **Table Handling**      | Simple tables    | Advanced rowspan/colspan support                                       |
+| **Streaming**           | Memory-intensive | Memory-efficient streaming for large documents                         |
+| **CLI Tool**            | Basic            | Full-featured CLI with all API options                                 |
+| **Preprocessing**       | None             | Built-in HTML cleaning with configurable presets                       |
+| **Metadata Extraction** | None             | Automatic title/meta extraction as comments                            |
+| **Task Lists**          | None             | GitHub-compatible checkbox conversion                                  |
+| **Bytes Input**         | None             | Direct bytes support with configurable encoding                        |
+| **Custom Converters**   | Class-based      | Function-based with simpler API                                        |
+| **Testing**             | Basic            | Comprehensive test suite with 100% coverage                            |
+| **Performance**         | Standard         | Significantly faster with recommended lxml parser                      |
+### API Compatibility
+While inspired by markdownify, html-to-markdown uses a more modern, explicit API:
+```python
+# markdownify style
+from markdownify import markdownify
+result = markdownify(html, heading_style="atx", strip=["nav"])
+# html-to-markdown style (more explicit)
+from html_to_markdown import convert_to_markdown
+result = convert_to_markdown(html, heading_style="atx", strip=["nav"])
+```
+### Migration from markdownify
+Most markdownify code can be easily migrated:
+```python
+# Before (markdownify)
+from markdownify import markdownify as md
+result = md(html, heading_style="atx")
+# After (html-to-markdown)
+from html_to_markdown import convert_to_markdown
+result = convert_to_markdown(html, heading_style="atx")
+```
+Key changes when migrating:
+- Import path: `markdownify` → `html_to_markdown`
+- Function name: `markdownify()` → `convert_to_markdown()`
+- All parameter names remain the same for common options
+- New parameters available for advanced features (preprocessing, streaming, etc.)
 ## Acknowledgments
-Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.
+Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors for the inspiration and foundation that made this modern implementation possible.

{html_to_markdown-1.13.0 → html_to_markdown-1.14.0}/README.md RENAMED Viewed

@@ -2,7 +2,7 @@
 A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork
 of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
-Python 3.9+.
+Python 3.10+.
 ## Support This Project
@@ -26,8 +26,10 @@ Your support helps maintain and improve this library for the community.
 - **Custom Converters**: Extensible converter system for custom HTML tag handling
 - **List Formatting**: Configurable list indentation with Discord/Slack compatibility
 - **HTML Preprocessing**: Clean messy HTML with configurable aggressiveness levels
+- **Bytes Input Support**: Direct handling of bytes input with automatic encoding detection and configurable source encoding
 - **Whitespace Control**: Normalized or strict whitespace preservation modes
 - **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
+- **Parser Normalization**: Consistent output across all supported parsers (html.parser, lxml, html5lib)
 - **Robustly Tested**: Comprehensive unit tests and integration tests covering all conversion scenarios
 ## Installation
@@ -36,17 +38,27 @@ Your support helps maintain and improve this library for the community.
 pip install html-to-markdown
 ```
-### Optional lxml Parser
+### Optional Parsers
-For improved performance, you can install with the optional lxml parser:
+For improved performance and compatibility, you can install with optional parsers:
 ```shell
+# Fast lxml parser (recommended)
 pip install html-to-markdown[lxml]
+# Standards-compliant html5lib parser
+pip install html-to-markdown[html5lib]
 ```
-The lxml parser offers faster HTML parsing and better handling of malformed HTML compared to the default html.parser.
+**Parser Options:**
+- **html.parser** (default): Built-in Python parser, no dependencies
+- **lxml**: **Recommended** - Fastest parser with good malformed HTML handling
+- **html5lib**: Most standards-compliant, handles edge cases best
-The library automatically uses lxml when available. You can explicitly specify a parser using the `parser` parameter.
+The library automatically uses lxml when available and **normalizes output to ensure consistent results regardless of parser choice**. We recommend using the **lxml parser for optimal performance** - it's significantly faster than the other options while maintaining excellent compatibility.
+You can explicitly specify a parser using the `parser` parameter.
 ## Quick Start
@@ -111,6 +123,39 @@ soup = BeautifulSoup(html, "lxml")  # Note: lxml requires additional installatio
 markdown = convert_to_markdown(soup)
 ```
+### Working with Bytes and Encodings
+The library can directly handle bytes input, which is useful when working with HTTP responses or files:
+```python
+import requests
+from html_to_markdown import convert_to_markdown
+# Working with HTTP responses (bytes)
+response = requests.get("https://example.com")
+markdown = convert_to_markdown(response.content)  # response.content returns bytes
+# Specify encoding for non-UTF-8 content
+response = requests.get("https://example.fr")
+markdown = convert_to_markdown(response.content, source_encoding="latin-1")
+# Common encoding examples
+html_bytes = b"<p>Hello World</p>"
+markdown = convert_to_markdown(html_bytes)  # UTF-8 by default
+# Latin-1 encoded content
+html_latin1 = "<p>Café résumé</p>".encode("latin-1")
+markdown = convert_to_markdown(html_latin1, source_encoding="latin-1")
+# Windows-1252 encoded content
+html_windows = '<p>Smart quotes: "Hello"</p>'.encode("windows-1252")
+markdown = convert_to_markdown(html_windows, source_encoding="windows-1252")
+# Piping bytes from command line
+# echo '<p>Hello</p>' | python -m html_to_markdown
+# cat file.html | python -m html_to_markdown --source-encoding latin-1
+```
 ## Common Use Cases
 ### Discord/Slack Compatible Lists
@@ -643,6 +688,131 @@ This library provides comprehensive support for all modern HTML5 elements:
 - `<math>` (MathML support)
+## Command Line Interface
+The library includes a full-featured CLI tool with complete API parity:
+### Basic Usage
+```bash
+# Convert HTML file to Markdown
+html-to-markdown document.html
+# Convert from stdin
+echo '<h1>Title</h1><p>Content</p>' | html-to-markdown
+# Read HTML file with specific encoding
+html-to-markdown document.html --source-encoding latin-1
+# Pipe bytes with encoding specification
+cat document.html | html-to-markdown --source-encoding utf-8
+```
+### Advanced CLI Options
+```bash
+# Discord/Slack compatible lists (2-space indent)
+html-to-markdown file.html --list-indent-width 2
+# Clean messy HTML before conversion
+html-to-markdown file.html --preprocess-html --preprocessing-preset aggressive
+# Custom heading style
+html-to-markdown file.html --heading-style atx
+# Strip specific tags
+html-to-markdown file.html --strip nav aside footer
+# Convert only specific tags
+html-to-markdown file.html --convert h1 h2 p a strong em
+# Enable streaming for large files with progress
+html-to-markdown large.html --stream-processing --show-progress
+# Use specific parser (lxml recommended for best performance)
+html-to-markdown file.html --parser lxml
+```
+### Real-World CLI Examples
+```bash
+# Download and convert a webpage
+curl -s https://example.com | html-to-markdown --preprocess-html > output.md
+# Process multiple files with different encodings
+for file in *.html; do
+    html-to-markdown "$file" --source-encoding latin-1 > "${file%.html}.md"
+done
+# Convert with custom formatting for documentation
+html-to-markdown docs.html \
+    --heading-style atx \
+    --list-indent-width 2 \
+    --highlight-style bold \
+    --no-extract-metadata > docs.md
+```
+## Differences from markdownify
+html-to-markdown is a modern, completely rewritten library inspired by markdownify but with significant improvements:
+### Key Advantages
+| Feature                 | markdownify      | html-to-markdown                                                       |
+| ----------------------- | ---------------- | ---------------------------------------------------------------------- |
+| **Type Safety**         | No type hints    | Full MyPy compliance with strict typing                                |
+| **Python Support**      | Python 3.6+      | Python 3.10+ with modern features                                      |
+| **HTML5 Elements**      | Basic support    | Comprehensive HTML5 support (semantic, form, table, interactive, etc.) |
+| **Table Handling**      | Simple tables    | Advanced rowspan/colspan support                                       |
+| **Streaming**           | Memory-intensive | Memory-efficient streaming for large documents                         |
+| **CLI Tool**            | Basic            | Full-featured CLI with all API options                                 |
+| **Preprocessing**       | None             | Built-in HTML cleaning with configurable presets                       |
+| **Metadata Extraction** | None             | Automatic title/meta extraction as comments                            |
+| **Task Lists**          | None             | GitHub-compatible checkbox conversion                                  |
+| **Bytes Input**         | None             | Direct bytes support with configurable encoding                        |
+| **Custom Converters**   | Class-based      | Function-based with simpler API                                        |
+| **Testing**             | Basic            | Comprehensive test suite with 100% coverage                            |
+| **Performance**         | Standard         | Significantly faster with recommended lxml parser                      |
+### API Compatibility
+While inspired by markdownify, html-to-markdown uses a more modern, explicit API:
+```python
+# markdownify style
+from markdownify import markdownify
+result = markdownify(html, heading_style="atx", strip=["nav"])
+# html-to-markdown style (more explicit)
+from html_to_markdown import convert_to_markdown
+result = convert_to_markdown(html, heading_style="atx", strip=["nav"])
+```
+### Migration from markdownify
+Most markdownify code can be easily migrated:
+```python
+# Before (markdownify)
+from markdownify import markdownify as md
+result = md(html, heading_style="atx")
+# After (html-to-markdown)
+from html_to_markdown import convert_to_markdown
+result = convert_to_markdown(html, heading_style="atx")
+```
+Key changes when migrating:
+- Import path: `markdownify` → `html_to_markdown`
+- Function name: `markdownify()` → `convert_to_markdown()`
+- All parameter names remain the same for common options
+- New parameters available for advanced features (preprocessing, streaming, etc.)
 ## Acknowledgments
-Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.
+Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors for the inspiration and foundation that made this modern implementation possible.

{html_to_markdown-1.13.0 → html_to_markdown-1.14.0}/html_to_markdown/cli.py RENAMED Viewed

@@ -1,5 +1,5 @@
 import sys
-from argparse import ArgumentParser, FileType
+from argparse import ArgumentParser
 from pathlib import Path
 from html_to_markdown.constants import (
@@ -27,8 +27,7 @@ def main(argv: list[str]) -> str:
     parser.add_argument(
         "html",
         nargs="?",
-        type=FileType("r"),
-        default=sys.stdin,
+        default="-",
         help="The HTML file to convert. Defaults to STDIN if not provided.",
     )
@@ -247,7 +246,7 @@ def main(argv: list[str]) -> str:
         "--source-encoding",
         type=str,
         default=None,
-        help="Source file encoding (e.g. 'utf-8', 'latin-1'). Defaults to system default.",
+        help="Encoding for reading input files and decoding bytes (e.g. 'utf-8', 'latin-1'). Default: utf-8.",
     )
     args = parser.parse_args(argv)
@@ -260,6 +259,7 @@ def main(argv: list[str]) -> str:
         "convert": args.convert,
         "convert_as_inline": args.convert_as_inline,
         "default_title": args.default_title,
+        "source_encoding": args.source_encoding,
         "escape_asterisks": args.escape_asterisks,
         "escape_misc": args.escape_misc,
         "escape_underscores": args.escape_underscores,
@@ -302,14 +302,20 @@ def main(argv: list[str]) -> str:
             base_args["progress_callback"] = progress_callback
-    if args.source_encoding and args.html.name != "<stdin>":
-        args.html.close()
-        try:
-            with Path(args.html.name).open(encoding=args.source_encoding) as f:
-                html_content = f.read()
-        except LookupError as e:
-            raise InvalidEncodingError(args.source_encoding) from e
+    if args.html == "-":
+        html_content = sys.stdin.buffer.read()
     else:
-        html_content = args.html.read()
+        try:
+            file_path = Path(args.html)
+            if args.source_encoding:
+                with file_path.open(encoding=args.source_encoding, errors="replace") as f:
+                    html_content = f.read()
+            else:
+                with file_path.open("rb") as f:
+                    html_content = f.read()
+        except (OSError, LookupError) as e:
+            if isinstance(e, LookupError):
+                raise InvalidEncodingError(args.source_encoding) from e
+            raise
     return convert_to_markdown(html_content, **base_args)

{html_to_markdown-1.13.0 → html_to_markdown-1.14.0}/html_to_markdown/converters.py RENAMED Viewed

@@ -414,8 +414,6 @@ def _convert_li(*, tag: Tag, text: str, bullets: str, list_indent_str: str) -> s
             return "".join(result_parts)
-    # Ensure consistent whitespace handling for list items, especially with strip_newlines=True
-    # Strip any leading whitespace that may have been inherited from parent containers
     clean_text = (text or "").strip()
     return f"{bullet} {clean_text}\n"

{html_to_markdown-1.13.0 → html_to_markdown-1.14.0}/html_to_markdown/processing.py RENAMED Viewed

@@ -445,13 +445,14 @@ def _format_metadata_comment(metadata: dict[str, str]) -> str:
 def convert_to_markdown(
-    source: str | BeautifulSoup,
+    source: str | bytes | BeautifulSoup,
     *,
     stream_processing: bool = False,
     chunk_size: int = 1024,
     chunk_callback: Callable[[str], None] | None = None,
     progress_callback: Callable[[int, int], None] | None = None,
     parser: str | None = None,
+    source_encoding: str = "utf-8",
     autolinks: bool = True,
     br_in_tables: bool = False,
     bullets: str = "*+-",
@@ -489,12 +490,13 @@ def convert_to_markdown(
     various customization options for controlling the conversion behavior.
     Args:
-        source: HTML string or BeautifulSoup object to convert.
+        source: HTML string, bytes, or BeautifulSoup object to convert.
         stream_processing: Enable streaming mode for large documents.
         chunk_size: Size of chunks for streaming processing.
         chunk_callback: Callback for processing chunks in streaming mode.
         progress_callback: Callback for progress updates (current, total).
         parser: HTML parser to use ('html.parser', 'lxml', 'html5lib').
+        source_encoding: Character encoding to use when decoding bytes (default: 'utf-8').
         autolinks: Convert URLs to automatic links.
         br_in_tables: Use <br> tags for line breaks in table cells instead of spaces.
         bullets: Characters to use for unordered list bullets.
@@ -548,11 +550,12 @@ def convert_to_markdown(
         >>> convert_to_markdown(html, list_indent_width=2)
         '* Item 1\\n* Item 2\\n\\n'
     """
-    # Initialize original input string for Windows lxml fix
     original_input_str = None
+    if isinstance(source, bytes):
+        source = source.decode(source_encoding or "utf-8", errors="replace")
     if isinstance(source, str):
-        # Store original string for plain text detection (Windows lxml fix)
         original_input_str = source
         if (
@@ -613,6 +616,37 @@ def convert_to_markdown(
                         new_text = NavigableString(leading_ws + str(first_child))
                         first_child.replace_with(new_text)
                         needs_leading_space_fix = False
+            # Fix html5lib whitespace handling to match other parsers
+            if parser == "html5lib":
+                body = source.find("body")
+                if body and isinstance(body, Tag):
+                    children = list(body.children)
+                    if (
+                        len(children) == 1
+                        and isinstance(children[0], NavigableString)
+                        and original_source.startswith((" ", "\t", "\n", "\r"))
+                        and not str(children[0]).startswith((" ", "\t", "\n", "\r"))
+                    ):
+                        first_child = children[0]
+                        original_text = str(first_child)
+                        # Preserve leading whitespace from original if html5lib stripped it
+                        leading_ws = ""
+                        for char in original_source:
+                            if char in " \t\n\r":
+                                leading_ws += char
+                            else:
+                                break
+                        # Create normalized text: restore leading whitespace only
+                        normalized_text = original_text
+                        if leading_ws and not normalized_text.startswith(leading_ws):
+                            normalized_text = leading_ws + normalized_text
+                        new_text = NavigableString(normalized_text)
+                        first_child.replace_with(new_text)
         else:
             raise EmptyHtmlError
@@ -626,6 +660,7 @@ def convert_to_markdown(
             chunk_size=chunk_size,
             progress_callback=progress_callback,
             parser=parser,
+            source_encoding=source_encoding,
             autolinks=autolinks,
             bullets=bullets,
             code_language=code_language,
@@ -673,6 +708,7 @@ def convert_to_markdown(
         sink,
         whitespace_handler=whitespace_handler,
         parser=parser,
+        source_encoding=source_encoding,
         autolinks=autolinks,
         br_in_tables=br_in_tables,
         bullets=bullets,
@@ -703,8 +739,6 @@ def convert_to_markdown(
     result = sink.get_result()
-    # Parser-agnostic behavior: handle leading whitespace differences between parsers
-    # lxml may either add unwanted whitespace or strip meaningful whitespace compared to html.parser
     if "needs_leading_whitespace_fix" in locals() and needs_leading_whitespace_fix:
         original_input = sink.original_source if hasattr(sink, "original_source") else original_source
         if isinstance(original_input, str):
@@ -713,19 +747,14 @@ def convert_to_markdown(
                 original_leading_whitespace_match.group(0) if original_leading_whitespace_match else ""
             )
-            # Case 1: lxml added leading newlines (like "\n<figure>") - strip them
             if result.startswith("\n") and not original_input.lstrip().startswith(result.strip()):
                 result = result.lstrip("\n\r")
-            # Case 2: lxml stripped meaningful leading whitespace (like " <b>") - restore it
-            # However, don't restore whitespace if strip_newlines=True was used, as the user
-            # explicitly requested to remove formatting whitespace
             elif (
                 not strip_newlines
                 and not result.startswith((" ", "\t"))
                 and original_leading_whitespace.startswith((" ", "\t"))
             ):
-                # Only restore spaces/tabs, not newlines (which are usually formatting)
                 leading_spaces_tabs_match = re.match(r"^[ \t]*", original_leading_whitespace)
                 leading_spaces_tabs = leading_spaces_tabs_match.group(0) if leading_spaces_tabs_match else ""
                 if leading_spaces_tabs:
@@ -758,9 +787,6 @@ def convert_to_markdown(
     if convert_as_inline:
         result = result.rstrip("\n")
-    # Windows-specific fix: For plain text input (no HTML tags), lxml may add extra trailing newlines
-    # This ensures consistent behavior across platforms when processing plain text
-    # Only apply to cases where lxml adds extra newlines (\n\n) at the end
     if (
         "original_input_str" in locals()
         and original_input_str
@@ -768,19 +794,11 @@ def convert_to_markdown(
         and not original_input_str.strip().endswith(">")
         and result.endswith("\n\n")
     ):
-        # Input appears to be plain text, not HTML - normalize trailing newlines only
         result = result.rstrip("\n")
-    # If the original input contained no block-level elements, normalize any
-    # accidental trailing newlines for cross-platform consistency.
-    # This guards cases like inline-only inputs (e.g., "text <strong>bold</strong>")
-    # and head-only documents (e.g., "<head>head</head>") where output should
-    # not end with extra blank lines.
     if "original_input_str" in locals() and original_input_str:
         from html_to_markdown.whitespace import BLOCK_ELEMENTS  # noqa: PLC0415
-        # Treat additional tags as block-producing for trailing newline purposes.
-        # These may be inline in HTML spec but produce block output in our Markdown conversion.
         blockish = set(BLOCK_ELEMENTS) | {
             "textarea",
             "dialog",
@@ -880,11 +898,12 @@ class StreamingSink(OutputSink):
 def _process_html_core(
-    source: str | BeautifulSoup,
+    source: str | bytes | BeautifulSoup,
     sink: OutputSink,
     *,
     whitespace_handler: WhitespaceHandler,
     parser: str | None = None,
+    source_encoding: str = "utf-8",
     autolinks: bool,
     br_in_tables: bool,
     bullets: str,
@@ -915,7 +934,12 @@ def _process_html_core(
     token = _ancestor_cache.set({})
     try:
-        if isinstance(source, str):
+        if isinstance(source, (str, bytes)):
+            original_source = source
+            if isinstance(source, bytes):
+                source = source.decode(source_encoding or "utf-8", errors="replace")
+                original_source = source
             if strip_newlines:
                 source = source.replace("\n", " ").replace("\r", " ")  # pragma: no cover
@@ -926,7 +950,36 @@ def _process_html_core(
                 if parser == "lxml" and not LXML_AVAILABLE:  # pragma: no cover
                     raise MissingDependencyError("lxml", "pip install html-to-markdown[lxml]")
+                needs_leading_whitespace_fix = (
+                    parser == "lxml"
+                    and isinstance(original_source, str)
+                    and original_source.startswith((" ", "\t", "\n", "\r"))
+                )
                 source = BeautifulSoup(source, parser)
+                if parser == "lxml" and needs_leading_whitespace_fix and isinstance(original_source, str):
+                    body = source.find("body")
+                    if body and isinstance(body, Tag):
+                        children = list(body.children)
+                        if (
+                            len(children) == 1
+                            and isinstance(children[0], NavigableString)
+                            and original_source.startswith((" ", "\t", "\n", "\r"))
+                            and not str(children[0]).startswith((" ", "\t", "\n", "\r"))
+                        ):
+                            first_child = children[0]
+                            leading_ws = ""
+                            for char in original_source:
+                                if char in " \t":
+                                    leading_ws += char
+                                else:
+                                    break
+                            new_text = NavigableString(leading_ws + str(first_child))
+                            first_child.replace_with(new_text)
             else:
                 raise EmptyHtmlError
@@ -998,11 +1051,12 @@ def _process_html_core(
 def convert_to_markdown_stream(
-    source: str | BeautifulSoup,
+    source: str | bytes | BeautifulSoup,
     *,
     chunk_size: int = 1024,
     progress_callback: Callable[[int, int], None] | None = None,
     parser: str | None = None,
+    source_encoding: str = "utf-8",
     autolinks: bool = True,
     br_in_tables: bool = False,
     bullets: str = "*+-",
@@ -1022,6 +1076,10 @@ def convert_to_markdown_stream(
     list_indent_type: Literal["spaces", "tabs"] = "spaces",
     list_indent_width: int = 4,
     newline_style: Literal["spaces", "backslash"] = SPACES,
+    preprocess_html: bool = False,
+    preprocessing_preset: Literal["minimal", "standard", "aggressive"] = "standard",
+    remove_forms: bool = True,
+    remove_navigation: bool = True,
     strip: str | Iterable[str] | None = None,
     strip_newlines: bool = False,
     strong_em_symbol: Literal["*", "_"] = ASTERISK,
@@ -1033,8 +1091,22 @@ def convert_to_markdown_stream(
 ) -> Generator[str, None, None]:
     sink = StreamingSink(chunk_size, progress_callback)
-    if isinstance(source, str):
-        sink.total_bytes = len(source)
+    if isinstance(source, bytes):
+        source = source.decode(source_encoding or "utf-8", errors="replace")
+    if isinstance(source, str) and preprocess_html and create_preprocessor is not None and preprocess_fn is not None:
+        config = create_preprocessor(
+            preset=preprocessing_preset,
+            remove_navigation=remove_navigation,
+            remove_forms=remove_forms,
+        )
+        source = preprocess_fn(source, **config)
+    if isinstance(source, (str, bytes)):
+        if isinstance(source, bytes):
+            sink.total_bytes = len(source)
+        else:
+            sink.total_bytes = len(source)
     elif isinstance(source, BeautifulSoup):
         sink.total_bytes = len(str(source))
@@ -1045,6 +1117,7 @@ def convert_to_markdown_stream(
         sink,
         whitespace_handler=whitespace_handler,
         parser=parser,
+        source_encoding=source_encoding,
         autolinks=autolinks,
         br_in_tables=br_in_tables,
         bullets=bullets,

{html_to_markdown-1.13.0 → html_to_markdown-1.14.0}/html_to_markdown.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: html-to-markdown
-Version: 1.13.0
+Version: 1.14.0
 Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
 Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
 License: MIT
@@ -32,6 +32,8 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: beautifulsoup4>=4.13.5
 Requires-Dist: nh3>=0.3
+Provides-Extra: html5lib
+Requires-Dist: beautifulsoup4[html5lib]>=4.13.5; extra == "html5lib"
 Provides-Extra: lxml
 Requires-Dist: beautifulsoup4[lxml]>=4.13.5; extra == "lxml"
 Dynamic: license-file
@@ -40,7 +42,7 @@ Dynamic: license-file
 A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork
 of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
-Python 3.9+.
+Python 3.10+.
 ## Support This Project
@@ -64,8 +66,10 @@ Your support helps maintain and improve this library for the community.
 - **Custom Converters**: Extensible converter system for custom HTML tag handling
 - **List Formatting**: Configurable list indentation with Discord/Slack compatibility
 - **HTML Preprocessing**: Clean messy HTML with configurable aggressiveness levels
+- **Bytes Input Support**: Direct handling of bytes input with automatic encoding detection and configurable source encoding
 - **Whitespace Control**: Normalized or strict whitespace preservation modes
 - **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
+- **Parser Normalization**: Consistent output across all supported parsers (html.parser, lxml, html5lib)
 - **Robustly Tested**: Comprehensive unit tests and integration tests covering all conversion scenarios
 ## Installation
@@ -74,17 +78,27 @@ Your support helps maintain and improve this library for the community.
 pip install html-to-markdown
 ```
-### Optional lxml Parser
+### Optional Parsers
-For improved performance, you can install with the optional lxml parser:
+For improved performance and compatibility, you can install with optional parsers:
 ```shell
+# Fast lxml parser (recommended)
 pip install html-to-markdown[lxml]
+# Standards-compliant html5lib parser
+pip install html-to-markdown[html5lib]
 ```
-The lxml parser offers faster HTML parsing and better handling of malformed HTML compared to the default html.parser.
+**Parser Options:**
+- **html.parser** (default): Built-in Python parser, no dependencies
+- **lxml**: **Recommended** - Fastest parser with good malformed HTML handling
+- **html5lib**: Most standards-compliant, handles edge cases best
-The library automatically uses lxml when available. You can explicitly specify a parser using the `parser` parameter.
+The library automatically uses lxml when available and **normalizes output to ensure consistent results regardless of parser choice**. We recommend using the **lxml parser for optimal performance** - it's significantly faster than the other options while maintaining excellent compatibility.
+You can explicitly specify a parser using the `parser` parameter.
 ## Quick Start
@@ -149,6 +163,39 @@ soup = BeautifulSoup(html, "lxml")  # Note: lxml requires additional installatio
 markdown = convert_to_markdown(soup)
 ```
+### Working with Bytes and Encodings
+The library can directly handle bytes input, which is useful when working with HTTP responses or files:
+```python
+import requests
+from html_to_markdown import convert_to_markdown
+# Working with HTTP responses (bytes)
+response = requests.get("https://example.com")
+markdown = convert_to_markdown(response.content)  # response.content returns bytes
+# Specify encoding for non-UTF-8 content
+response = requests.get("https://example.fr")
+markdown = convert_to_markdown(response.content, source_encoding="latin-1")
+# Common encoding examples
+html_bytes = b"<p>Hello World</p>"
+markdown = convert_to_markdown(html_bytes)  # UTF-8 by default
+# Latin-1 encoded content
+html_latin1 = "<p>Café résumé</p>".encode("latin-1")
+markdown = convert_to_markdown(html_latin1, source_encoding="latin-1")
+# Windows-1252 encoded content
+html_windows = '<p>Smart quotes: "Hello"</p>'.encode("windows-1252")
+markdown = convert_to_markdown(html_windows, source_encoding="windows-1252")
+# Piping bytes from command line
+# echo '<p>Hello</p>' | python -m html_to_markdown
+# cat file.html | python -m html_to_markdown --source-encoding latin-1
+```
 ## Common Use Cases
 ### Discord/Slack Compatible Lists
@@ -681,6 +728,131 @@ This library provides comprehensive support for all modern HTML5 elements:
 - `<math>` (MathML support)
+## Command Line Interface
+The library includes a full-featured CLI tool with complete API parity:
+### Basic Usage
+```bash
+# Convert HTML file to Markdown
+html-to-markdown document.html
+# Convert from stdin
+echo '<h1>Title</h1><p>Content</p>' | html-to-markdown
+# Read HTML file with specific encoding
+html-to-markdown document.html --source-encoding latin-1
+# Pipe bytes with encoding specification
+cat document.html | html-to-markdown --source-encoding utf-8
+```
+### Advanced CLI Options
+```bash
+# Discord/Slack compatible lists (2-space indent)
+html-to-markdown file.html --list-indent-width 2
+# Clean messy HTML before conversion
+html-to-markdown file.html --preprocess-html --preprocessing-preset aggressive
+# Custom heading style
+html-to-markdown file.html --heading-style atx
+# Strip specific tags
+html-to-markdown file.html --strip nav aside footer
+# Convert only specific tags
+html-to-markdown file.html --convert h1 h2 p a strong em
+# Enable streaming for large files with progress
+html-to-markdown large.html --stream-processing --show-progress
+# Use specific parser (lxml recommended for best performance)
+html-to-markdown file.html --parser lxml
+```
+### Real-World CLI Examples
+```bash
+# Download and convert a webpage
+curl -s https://example.com | html-to-markdown --preprocess-html > output.md
+# Process multiple files with different encodings
+for file in *.html; do
+    html-to-markdown "$file" --source-encoding latin-1 > "${file%.html}.md"
+done
+# Convert with custom formatting for documentation
+html-to-markdown docs.html \
+    --heading-style atx \
+    --list-indent-width 2 \
+    --highlight-style bold \
+    --no-extract-metadata > docs.md
+```
+## Differences from markdownify
+html-to-markdown is a modern, completely rewritten library inspired by markdownify but with significant improvements:
+### Key Advantages
+| Feature                 | markdownify      | html-to-markdown                                                       |
+| ----------------------- | ---------------- | ---------------------------------------------------------------------- |
+| **Type Safety**         | No type hints    | Full MyPy compliance with strict typing                                |
+| **Python Support**      | Python 3.6+      | Python 3.10+ with modern features                                      |
+| **HTML5 Elements**      | Basic support    | Comprehensive HTML5 support (semantic, form, table, interactive, etc.) |
+| **Table Handling**      | Simple tables    | Advanced rowspan/colspan support                                       |
+| **Streaming**           | Memory-intensive | Memory-efficient streaming for large documents                         |
+| **CLI Tool**            | Basic            | Full-featured CLI with all API options                                 |
+| **Preprocessing**       | None             | Built-in HTML cleaning with configurable presets                       |
+| **Metadata Extraction** | None             | Automatic title/meta extraction as comments                            |
+| **Task Lists**          | None             | GitHub-compatible checkbox conversion                                  |
+| **Bytes Input**         | None             | Direct bytes support with configurable encoding                        |
+| **Custom Converters**   | Class-based      | Function-based with simpler API                                        |
+| **Testing**             | Basic            | Comprehensive test suite with 100% coverage                            |
+| **Performance**         | Standard         | Significantly faster with recommended lxml parser                      |
+### API Compatibility
+While inspired by markdownify, html-to-markdown uses a more modern, explicit API:
+```python
+# markdownify style
+from markdownify import markdownify
+result = markdownify(html, heading_style="atx", strip=["nav"])
+# html-to-markdown style (more explicit)
+from html_to_markdown import convert_to_markdown
+result = convert_to_markdown(html, heading_style="atx", strip=["nav"])
+```
+### Migration from markdownify
+Most markdownify code can be easily migrated:
+```python
+# Before (markdownify)
+from markdownify import markdownify as md
+result = md(html, heading_style="atx")
+# After (html-to-markdown)
+from html_to_markdown import convert_to_markdown
+result = convert_to_markdown(html, heading_style="atx")
+```
+Key changes when migrating:
+- Import path: `markdownify` → `html_to_markdown`
+- Function name: `markdownify()` → `convert_to_markdown()`
+- All parameter names remain the same for common options
+- New parameters available for advanced features (preprocessing, streaming, etc.)
 ## Acknowledgments
-Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.
+Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors for the inspiration and foundation that made this modern implementation possible.

{html_to_markdown-1.13.0 → html_to_markdown-1.14.0}/html_to_markdown.egg-info/requires.txt RENAMED Viewed

@@ -1,5 +1,8 @@
 beautifulsoup4>=4.13.5
 nh3>=0.3
+[html5lib]
+beautifulsoup4[html5lib]>=4.13.5
 [lxml]
 beautifulsoup4[lxml]>=4.13.5

{html_to_markdown-1.13.0 → html_to_markdown-1.14.0}/pyproject.toml RENAMED Viewed

@@ -5,7 +5,7 @@ requires = [ "setuptools>=78.1" ]
 [project]
 name = "html-to-markdown"
-version = "1.13.0"
+version = "1.14.0"
 description = "A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options"
 readme = "README.md"
 keywords = [
@@ -46,8 +46,9 @@ dependencies = [
   "beautifulsoup4>=4.13.5",
   "nh3>=0.3",
 ]
-optional-dependencies.lxml = [ "beautifulsoup4[lxml]>=4.13.5" ]
+optional-dependencies.html5lib = [ "beautifulsoup4[html5lib]>=4.13.5" ]
+optional-dependencies.lxml = [ "beautifulsoup4[lxml]>=4.13.5" ]
 urls.Changelog = "https://github.com/Goldziher/html-to-markdown/releases"
 urls.Homepage = "https://github.com/Goldziher/html-to-markdown"
 urls.Issues = "https://github.com/Goldziher/html-to-markdown/issues"
@@ -57,14 +58,16 @@ scripts.html_to_markdown = "html_to_markdown.__main__:cli"
 [dependency-groups]
 dev = [
+  "beautifulsoup4[html5lib]>=4.13.5",
+  "beautifulsoup4[lxml]>=4.13.5",
   "covdefaults>=2.3",
-  "mypy>=1.18.1",
+  "mypy>=1.18.2",
   "pre-commit>=4.3",
   "pytest>=8.4.2",
   "pytest-benchmark>=5.1",
   "pytest-cov>=7",
-  "pytest-mock>=3.15",
-  "ruff>=0.13",
+  "pytest-mock>=3.15.1",
+  "ruff>=0.13.1",
   "types-beautifulsoup4>=4.12.0.20250516",
   "types-psutil>=7.0.0.20250822",
   "uv-bump",