PyPI - html-to-markdown - Versions diffs - 1.5.0__py3-none-any.whl → 1.8.0__py3-none-any.whl - Mend

html-to-markdown 1.5.0py3-none-any.whl → 1.8.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of html-to-markdown might be problematic. Click here for more details.

Files changed (14) hide show

html_to_markdown/__init__.py +20 -2
html_to_markdown/cli.py +1 -4
html_to_markdown/converters.py +36 -92
html_to_markdown/exceptions.py +49 -0
html_to_markdown/preprocessor.py +407 -0
html_to_markdown/processing.py +447 -210
html_to_markdown/utils.py +12 -5
{html_to_markdown-1.5.0.dist-info → html_to_markdown-1.8.0.dist-info}/METADATA +50 -13
html_to_markdown-1.8.0.dist-info/RECORD +16 -0
html_to_markdown-1.5.0.dist-info/RECORD +0 -14
{html_to_markdown-1.5.0.dist-info → html_to_markdown-1.8.0.dist-info}/WHEEL +0 -0
{html_to_markdown-1.5.0.dist-info → html_to_markdown-1.8.0.dist-info}/entry_points.txt +0 -0
{html_to_markdown-1.5.0.dist-info → html_to_markdown-1.8.0.dist-info}/licenses/LICENSE +0 -0
{html_to_markdown-1.5.0.dist-info → html_to_markdown-1.8.0.dist-info}/top_level.txt +0 -0

html_to_markdown/utils.py CHANGED Viewed

@@ -6,18 +6,25 @@ from html_to_markdown.constants import line_beginning_re
 def chomp(text: str) -> tuple[str, str, str]:
-    """If the text in an inline tag like b, a, or em contains a leading or trailing
-    space, strip the string and return a space as suffix of prefix, if needed.
+    """Simplified whitespace handling for inline elements.
+    For semantic markdown output, preserves leading/trailing spaces as single spaces
+    and normalizes internal whitespace.
     Args:
         text: The text to chomp.
     Returns:
-        A tuple containing the prefix, suffix, and the stripped text.
+        A tuple containing the prefix, suffix, and the normalized text.
     """
-    prefix = " " if text and text[0] == " " else ""
-    suffix = " " if text and text[-1] == " " else ""
+    if not text:
+        return "", "", ""
+    prefix = " " if text.startswith((" ", "\t")) else ""
+    suffix = " " if text.endswith((" ", "\t")) else ""
     text = text.strip()
     return prefix, suffix, text

{html_to_markdown-1.5.0.dist-info → html_to_markdown-1.8.0.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: html-to-markdown
-Version: 1.5.0
+Version: 1.8.0
 Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
 Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
 License: MIT
@@ -32,6 +32,9 @@ Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: beautifulsoup4>=4.13.4
+Requires-Dist: nh3>=0.2.21
+Provides-Extra: lxml
+Requires-Dist: lxml>=5; extra == "lxml"
 Dynamic: license-file
 # html-to-markdown
@@ -60,6 +63,28 @@ Python 3.9+.
 pip install html-to-markdown
 ```
+### Optional lxml Parser
+For improved performance, you can install with the optional lxml parser:
+```shell
+pip install html-to-markdown[lxml]
+```
+The lxml parser offers:
+- **~30% faster HTML parsing** compared to the default html.parser
+- Better handling of malformed HTML
+- More robust parsing for complex documents
+Once installed, lxml is automatically used by default for better performance. You can explicitly specify a parser if needed:
+```python
+result = convert_to_markdown(html)  # Auto-detects: uses lxml if available, otherwise html.parser
+result = convert_to_markdown(html, parser="lxml")  # Force lxml (requires installation)
+result = convert_to_markdown(html, parser="html.parser")  # Force built-in parser
+```
 ## Quick Start
 Convert HTML to Markdown with a single function call:
@@ -180,18 +205,19 @@ Custom converters take precedence over the built-in converters and can be used a
 ### Key Configuration Options
-| Option              | Type | Default          | Description                                            |
-| ------------------- | ---- | ---------------- | ------------------------------------------------------ |
-| `extract_metadata`  | bool | `True`           | Extract document metadata as comment header            |
-| `convert_as_inline` | bool | `False`          | Treat content as inline elements only                  |
-| `heading_style`     | str  | `'underlined'`   | Header style (`'underlined'`, `'atx'`, `'atx_closed'`) |
-| `highlight_style`   | str  | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`) |
-| `stream_processing` | bool | `False`          | Enable streaming for large documents                   |
-| `autolinks`         | bool | `True`           | Auto-convert URLs to Markdown links                    |
-| `bullets`           | str  | `'*+-'`          | Characters to use for bullet points                    |
-| `escape_asterisks`  | bool | `True`           | Escape * characters                                    |
-| `wrap`              | bool | `False`          | Enable text wrapping                                   |
-| `wrap_width`        | int  | `80`             | Text wrap width                                        |
+| Option              | Type | Default          | Description                                                     |
+| ------------------- | ---- | ---------------- | --------------------------------------------------------------- |
+| `extract_metadata`  | bool | `True`           | Extract document metadata as comment header                     |
+| `convert_as_inline` | bool | `False`          | Treat content as inline elements only                           |
+| `heading_style`     | str  | `'underlined'`   | Header style (`'underlined'`, `'atx'`, `'atx_closed'`)          |
+| `highlight_style`   | str  | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`)          |
+| `stream_processing` | bool | `False`          | Enable streaming for large documents                            |
+| `parser`            | str  | auto-detect      | BeautifulSoup parser (auto-detects `'lxml'` or `'html.parser'`) |
+| `autolinks`         | bool | `True`           | Auto-convert URLs to Markdown links                             |
+| `bullets`           | str  | `'*+-'`          | Characters to use for bullet points                             |
+| `escape_asterisks`  | bool | `True`           | Escape * characters                                             |
+| `wrap`              | bool | `False`          | Enable text wrapping                                            |
+| `wrap_width`        | int  | `80`             | Text wrap width                                                 |
 For a complete list of all 20+ options, see the [Configuration Reference](#configuration-reference) section below.
@@ -379,6 +405,17 @@ uv run python -m html_to_markdown input.html
 uv build
 ```
+## Performance
+The library is optimized for performance with several key features:
+- **Efficient ancestor caching**: Reduces repeated DOM traversals using context-aware caching
+- **Streaming support**: Process large documents in chunks to minimize memory usage
+- **Optional lxml parser**: ~30% faster parsing for complex HTML documents
+- **Optimized string operations**: Minimizes string concatenations in hot paths
+Typical throughput: ~2 MB/s for regular processing on modern hardware.
 ## License
 This library uses the MIT license.

html_to_markdown-1.8.0.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,16 @@
+html_to_markdown/__init__.py,sha256=TzZzhZDJHeXW_3B9zceYehz2zlttqdLsDr5un8stZLM,653
+html_to_markdown/__main__.py,sha256=DJyJX7NIK0BVPNS2r3BYJ0Ci_lKHhgVOpw7ZEqACH3c,323
+html_to_markdown/cli.py,sha256=8xlgSEcnqsSM_dr1TCSgPDAo09YvUtO78PvDFivFFdg,6973
+html_to_markdown/constants.py,sha256=8vqANd-7wYvDzBm1VXZvdIxS4Xom4Ov_Yghg6jvmyio,584
+html_to_markdown/converters.py,sha256=COC2KqPelJlMCY5eXUS5gdiPOG8Yzx0U719FeXPw3GA,55514
+html_to_markdown/exceptions.py,sha256=s1DaG6A23rOurF91e4jryuUzplWcC_JIAuK9_bw_4jQ,1558
+html_to_markdown/preprocessor.py,sha256=S4S1ZfLC_hkJVgmA5atImTyWQDOxfHctPbaep2QtyrQ,11248
+html_to_markdown/processing.py,sha256=wkbhLg42U3aeVQSZFuzGt5irtN037XzRKpCE71QYZXI,36520
+html_to_markdown/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+html_to_markdown/utils.py,sha256=QgWPzmpZKFd6wDTe8IY3gbVT3xNzoGV3PBgd17J0O-w,2066
+html_to_markdown-1.8.0.dist-info/licenses/LICENSE,sha256=3J_HR5BWvUM1mlIrlkF32-uC1FM64gy8JfG17LBuheQ,1122
+html_to_markdown-1.8.0.dist-info/METADATA,sha256=6pgiK4p0A77axLfD8MH1EGgzifP06koVV8KWS_5-iYk,17175
+html_to_markdown-1.8.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
+html_to_markdown-1.8.0.dist-info/entry_points.txt,sha256=xmFijrTfgYW7lOrZxZGRPciicQHa5KiXKkUhBCmICtQ,116
+html_to_markdown-1.8.0.dist-info/top_level.txt,sha256=Ev6djb1c4dSKr_-n4K-FpEGDkzBigXY6LuZ5onqS7AE,17
+html_to_markdown-1.8.0.dist-info/RECORD,,

html_to_markdown-1.5.0.dist-info/RECORD DELETED Viewed

@@ -1,14 +0,0 @@
-html_to_markdown/__init__.py,sha256=ZfPBBhhxQJTFQiOX-5OtgSMP2xFs5UUJeYmLL-AawoQ,265
-html_to_markdown/__main__.py,sha256=DJyJX7NIK0BVPNS2r3BYJ0Ci_lKHhgVOpw7ZEqACH3c,323
-html_to_markdown/cli.py,sha256=WzQVr97jKECEZwW-xIJofSl3v4EhqU-De7XRQjmgc08,7179
-html_to_markdown/constants.py,sha256=8vqANd-7wYvDzBm1VXZvdIxS4Xom4Ov_Yghg6jvmyio,584
-html_to_markdown/converters.py,sha256=xEVT0rQGWBU4V-HBF7Mmm-2XGPB1cboAmKlF6vcxS4k,59456
-html_to_markdown/processing.py,sha256=nqpPiRZu5B--E9dJ9AOwH2r1alg-ynv7ie63rtIb9Ls,28661
-html_to_markdown/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-html_to_markdown/utils.py,sha256=HJUDej5HSpXRtYv-OkCyD0hwnPnVfQCwY6rBRlIOt9s,1989
-html_to_markdown-1.5.0.dist-info/licenses/LICENSE,sha256=3J_HR5BWvUM1mlIrlkF32-uC1FM64gy8JfG17LBuheQ,1122
-html_to_markdown-1.5.0.dist-info/METADATA,sha256=nGVi7PSapoEUNTn5WGBW2g744dZTxaXCcFxl_ILeb9s,15641
-html_to_markdown-1.5.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
-html_to_markdown-1.5.0.dist-info/entry_points.txt,sha256=xmFijrTfgYW7lOrZxZGRPciicQHa5KiXKkUhBCmICtQ,116
-html_to_markdown-1.5.0.dist-info/top_level.txt,sha256=Ev6djb1c4dSKr_-n4K-FpEGDkzBigXY6LuZ5onqS7AE,17
-html_to_markdown-1.5.0.dist-info/RECORD,,

{html_to_markdown-1.5.0.dist-info → html_to_markdown-1.8.0.dist-info}/WHEEL RENAMED Viewed

File without changes

{html_to_markdown-1.5.0.dist-info → html_to_markdown-1.8.0.dist-info}/entry_points.txt RENAMED Viewed

File without changes

{html_to_markdown-1.5.0.dist-info → html_to_markdown-1.8.0.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

{html_to_markdown-1.5.0.dist-info → html_to_markdown-1.8.0.dist-info}/top_level.txt RENAMED Viewed

File without changes

html-to-markdown 1.5.0__py3-none-any.whl → 1.8.0__py3-none-any.whl

Potentially problematic release.

html-to-markdown 1.5.0py3-none-any.whl → 1.8.0py3-none-any.whl