PyPI - html-to-markdown - Versions diffs - 1.5.0__tar.gz → 1.8.0__tar.gz - Mend

html-to-markdown 1.5.0tar.gz → 1.8.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of html-to-markdown might be problematic. Click here for more details.

Files changed (23) hide show

{html_to_markdown-1.5.0 → html_to_markdown-1.8.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: html-to-markdown
-Version: 1.5.0
+Version: 1.8.0
 Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
 Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
 License: MIT
@@ -32,6 +32,9 @@ Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: beautifulsoup4>=4.13.4
+Requires-Dist: nh3>=0.2.21
+Provides-Extra: lxml
+Requires-Dist: lxml>=5; extra == "lxml"
 Dynamic: license-file
 # html-to-markdown
@@ -60,6 +63,28 @@ Python 3.9+.
 pip install html-to-markdown
 ```
+### Optional lxml Parser
+For improved performance, you can install with the optional lxml parser:
+```shell
+pip install html-to-markdown[lxml]
+```
+The lxml parser offers:
+- **~30% faster HTML parsing** compared to the default html.parser
+- Better handling of malformed HTML
+- More robust parsing for complex documents
+Once installed, lxml is automatically used by default for better performance. You can explicitly specify a parser if needed:
+```python
+result = convert_to_markdown(html)  # Auto-detects: uses lxml if available, otherwise html.parser
+result = convert_to_markdown(html, parser="lxml")  # Force lxml (requires installation)
+result = convert_to_markdown(html, parser="html.parser")  # Force built-in parser
+```
 ## Quick Start
 Convert HTML to Markdown with a single function call:
@@ -180,18 +205,19 @@ Custom converters take precedence over the built-in converters and can be used a
 ### Key Configuration Options
-| Option              | Type | Default          | Description                                            |
-| ------------------- | ---- | ---------------- | ------------------------------------------------------ |
-| `extract_metadata`  | bool | `True`           | Extract document metadata as comment header            |
-| `convert_as_inline` | bool | `False`          | Treat content as inline elements only                  |
-| `heading_style`     | str  | `'underlined'`   | Header style (`'underlined'`, `'atx'`, `'atx_closed'`) |
-| `highlight_style`   | str  | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`) |
-| `stream_processing` | bool | `False`          | Enable streaming for large documents                   |
-| `autolinks`         | bool | `True`           | Auto-convert URLs to Markdown links                    |
-| `bullets`           | str  | `'*+-'`          | Characters to use for bullet points                    |
-| `escape_asterisks`  | bool | `True`           | Escape * characters                                    |
-| `wrap`              | bool | `False`          | Enable text wrapping                                   |
-| `wrap_width`        | int  | `80`             | Text wrap width                                        |
+| Option              | Type | Default          | Description                                                     |
+| ------------------- | ---- | ---------------- | --------------------------------------------------------------- |
+| `extract_metadata`  | bool | `True`           | Extract document metadata as comment header                     |
+| `convert_as_inline` | bool | `False`          | Treat content as inline elements only                           |
+| `heading_style`     | str  | `'underlined'`   | Header style (`'underlined'`, `'atx'`, `'atx_closed'`)          |
+| `highlight_style`   | str  | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`)          |
+| `stream_processing` | bool | `False`          | Enable streaming for large documents                            |
+| `parser`            | str  | auto-detect      | BeautifulSoup parser (auto-detects `'lxml'` or `'html.parser'`) |
+| `autolinks`         | bool | `True`           | Auto-convert URLs to Markdown links                             |
+| `bullets`           | str  | `'*+-'`          | Characters to use for bullet points                             |
+| `escape_asterisks`  | bool | `True`           | Escape * characters                                             |
+| `wrap`              | bool | `False`          | Enable text wrapping                                            |
+| `wrap_width`        | int  | `80`             | Text wrap width                                                 |
 For a complete list of all 20+ options, see the [Configuration Reference](#configuration-reference) section below.
@@ -379,6 +405,17 @@ uv run python -m html_to_markdown input.html
 uv build
 ```
+## Performance
+The library is optimized for performance with several key features:
+- **Efficient ancestor caching**: Reduces repeated DOM traversals using context-aware caching
+- **Streaming support**: Process large documents in chunks to minimize memory usage
+- **Optional lxml parser**: ~30% faster parsing for complex HTML documents
+- **Optimized string operations**: Minimizes string concatenations in hot paths
+Typical throughput: ~2 MB/s for regular processing on modern hardware.
 ## License
 This library uses the MIT license.

{html_to_markdown-1.5.0 → html_to_markdown-1.8.0}/README.md RENAMED Viewed

@@ -24,6 +24,28 @@ Python 3.9+.
 pip install html-to-markdown
 ```
+### Optional lxml Parser
+For improved performance, you can install with the optional lxml parser:
+```shell
+pip install html-to-markdown[lxml]
+```
+The lxml parser offers:
+- **~30% faster HTML parsing** compared to the default html.parser
+- Better handling of malformed HTML
+- More robust parsing for complex documents
+Once installed, lxml is automatically used by default for better performance. You can explicitly specify a parser if needed:
+```python
+result = convert_to_markdown(html)  # Auto-detects: uses lxml if available, otherwise html.parser
+result = convert_to_markdown(html, parser="lxml")  # Force lxml (requires installation)
+result = convert_to_markdown(html, parser="html.parser")  # Force built-in parser
+```
 ## Quick Start
 Convert HTML to Markdown with a single function call:
@@ -144,18 +166,19 @@ Custom converters take precedence over the built-in converters and can be used a
 ### Key Configuration Options
-| Option              | Type | Default          | Description                                            |
-| ------------------- | ---- | ---------------- | ------------------------------------------------------ |
-| `extract_metadata`  | bool | `True`           | Extract document metadata as comment header            |
-| `convert_as_inline` | bool | `False`          | Treat content as inline elements only                  |
-| `heading_style`     | str  | `'underlined'`   | Header style (`'underlined'`, `'atx'`, `'atx_closed'`) |
-| `highlight_style`   | str  | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`) |
-| `stream_processing` | bool | `False`          | Enable streaming for large documents                   |
-| `autolinks`         | bool | `True`           | Auto-convert URLs to Markdown links                    |
-| `bullets`           | str  | `'*+-'`          | Characters to use for bullet points                    |
-| `escape_asterisks`  | bool | `True`           | Escape * characters                                    |
-| `wrap`              | bool | `False`          | Enable text wrapping                                   |
-| `wrap_width`        | int  | `80`             | Text wrap width                                        |
+| Option              | Type | Default          | Description                                                     |
+| ------------------- | ---- | ---------------- | --------------------------------------------------------------- |
+| `extract_metadata`  | bool | `True`           | Extract document metadata as comment header                     |
+| `convert_as_inline` | bool | `False`          | Treat content as inline elements only                           |
+| `heading_style`     | str  | `'underlined'`   | Header style (`'underlined'`, `'atx'`, `'atx_closed'`)          |
+| `highlight_style`   | str  | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`)          |
+| `stream_processing` | bool | `False`          | Enable streaming for large documents                            |
+| `parser`            | str  | auto-detect      | BeautifulSoup parser (auto-detects `'lxml'` or `'html.parser'`) |
+| `autolinks`         | bool | `True`           | Auto-convert URLs to Markdown links                             |
+| `bullets`           | str  | `'*+-'`          | Characters to use for bullet points                             |
+| `escape_asterisks`  | bool | `True`           | Escape * characters                                             |
+| `wrap`              | bool | `False`          | Enable text wrapping                                            |
+| `wrap_width`        | int  | `80`             | Text wrap width                                                 |
 For a complete list of all 20+ options, see the [Configuration Reference](#configuration-reference) section below.
@@ -343,6 +366,17 @@ uv run python -m html_to_markdown input.html
 uv build
 ```
+## Performance
+The library is optimized for performance with several key features:
+- **Efficient ancestor caching**: Reduces repeated DOM traversals using context-aware caching
+- **Streaming support**: Process large documents in chunks to minimize memory usage
+- **Optional lxml parser**: ~30% faster parsing for complex HTML documents
+- **Optimized string operations**: Minimizes string concatenations in hot paths
+Typical throughput: ~2 MB/s for regular processing on modern hardware.
 ## License
 This library uses the MIT license.

html_to_markdown-1.8.0/html_to_markdown/__init__.py ADDED Viewed

@@ -0,0 +1,24 @@
+from html_to_markdown.exceptions import (
+    ConflictingOptionsError,
+    EmptyHtmlError,
+    HtmlToMarkdownError,
+    InvalidParserError,
+    MissingDependencyError,
+)
+from html_to_markdown.preprocessor import create_preprocessor, preprocess_html
+from html_to_markdown.processing import convert_to_markdown, convert_to_markdown_stream
+markdownify = convert_to_markdown
+__all__ = [
+    "ConflictingOptionsError",
+    "EmptyHtmlError",
+    "HtmlToMarkdownError",
+    "InvalidParserError",
+    "MissingDependencyError",
+    "convert_to_markdown",
+    "convert_to_markdown_stream",
+    "create_preprocessor",
+    "markdownify",
+    "preprocess_html",
+]

{html_to_markdown-1.5.0 → html_to_markdown-1.8.0}/html_to_markdown/cli.py RENAMED Viewed

@@ -191,7 +191,6 @@ def main(argv: list[str]) -> str:
     args = parser.parse_args(argv)
-    # Prepare base arguments
     base_args = {
         "strip": args.strip,
         "convert": args.convert,
@@ -216,18 +215,16 @@ def main(argv: list[str]) -> str:
         "highlight_style": args.highlight_style,
     }
-    # Add streaming parameters only if streaming is enabled
     if args.stream_processing:
         base_args["stream_processing"] = True
         base_args["chunk_size"] = args.chunk_size
-        # Progress callback for CLI
         if args.show_progress:
             def progress_callback(processed: int, total: int) -> None:
                 if total > 0:
                     percent = (processed / total) * 100
-                    # Use sys.stderr to avoid ruff T201 error for progress output
                     sys.stderr.write(f"\rProgress: {percent:.1f}% ({processed}/{total} bytes)")
                     sys.stderr.flush()

html-to-markdown 1.5.0__tar.gz → 1.8.0__tar.gz

Potentially problematic release.

html-to-markdown 1.5.0tar.gz → 1.8.0tar.gz