PyPI - html-to-markdown - Versions diffs - 1.0.0__tar.gz → 1.2.0__tar.gz - Mend

html-to-markdown 1.0.0tar.gz → 1.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of html-to-markdown might be problematic. Click here for more details.

Files changed (20) hide show

html_to_markdown-1.2.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,102 @@
+Metadata-Version: 2.4
+Name: html-to-markdown
+Version: 1.2.0
+Summary: Convert HTML to markdown
+Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
+License: MIT
+License-File: LICENSE
+Keywords: converter,html,markdown,text-extraction,text-processing
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Text Processing
+Classifier: Topic :: Text Processing :: Markup
+Classifier: Topic :: Text Processing :: Markup :: HTML
+Classifier: Topic :: Text Processing :: Markup :: Markdown
+Classifier: Topic :: Utilities
+Classifier: Typing :: Typed
+Requires-Python: >=3.9
+Requires-Dist: beautifulsoup4>=4.12.3
+Description-Content-Type: text/markdown
+# html_to_markdown
+This library is a refactored and modernized fork of [markdownify](https://pypi.org/project/markdownify/), supporting
+Python 3.9 and above.
+### Differences with the Markdownify
+- The refactored codebase uses a strict functional approach - no classes are involved.
+- There is full typing with strict MyPy strict adherence and a py.typed file included.
+- The `convert_to_markdown` function allows passing a pre-configured instance of `BeautifulSoup` instead of html.
+- This library releases follows standard semver. Its version v1.0.0 was branched from markdownify's v0.13.1, at which
+  point versioning is no longer aligned.
+## Installation
+```shell
+pip install html_to_markdown
+```
+## Usage
+Convert an string HTML to Markdown:
+```python
+from html_to_markdown import convert_to_markdown
+convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>')  # > '**Yay** [GitHub](http://github.com)'
+```
+Or pass a pre-configured instance of `BeautifulSoup`:
+```python
+from bs4 import BeautifulSoup
+from html_to_markdown import convert_to_markdown
+soup = BeautifulSoup('<b>Yay</b> <a href="http://github.com">GitHub</a>', 'lxml')  # lxml requires an extra dependency.
+convert_to_markdown(soup)  # > '**Yay** [GitHub](http://github.com)'
+```
+### Options
+The `convert_to_markdown` function accepts the following kwargs:
+- autolinks (bool): Automatically convert valid URLs into Markdown links. Defaults to True.
+- bullets (str): A string of characters to use for bullet points in lists. Defaults to '\*+-'.
+- code_language (str): Default language identifier for fenced code blocks. Defaults to an empty string.
+- code_language_callback (Callable[[Any], str] | None): Function to dynamically determine the language for code blocks.
+- convert (Iterable[str] | None): A list of tag names to convert to Markdown. If None, all supported tags are converted.
+- default_title (bool): Use the default title when converting certain elements (e.g., links). Defaults to False.
+- escape_asterisks (bool): Escape asterisks (\*) to prevent unintended Markdown formatting. Defaults to True.
+- escape_misc (bool): Escape miscellaneous characters to prevent conflicts in Markdown. Defaults to True.
+- escape*underscores (bool): Escape underscores (*) to prevent unintended italic formatting. Defaults to True.
+- heading_style (Literal["underlined", "atx", "atx_closed"]): The style to use for Markdown headings. Defaults to "
+  underlined".
+- keep_inline_images_in (Iterable[str] | None): Tags in which inline images should be preserved. Defaults to None.
+- newline_style (Literal["spaces", "backslash"]): Style for handling newlines in text content. Defaults to "spaces".
+- strip (Iterable[str] | None): Tags to strip from the output. Defaults to None.
+- strong*em_symbol (Literal["\*", "*"]): Symbol to use for strong/emphasized text. Defaults to "\*".
+- sub_symbol (str): Custom symbol for subscript text. Defaults to an empty string.
+- sup_symbol (str): Custom symbol for superscript text. Defaults to an empty string.
+- wrap (bool): Wrap text to the specified width. Defaults to False.
+- wrap_width (int): The number of characters at which to wrap text. Defaults to 80.
+- convert_as_inline (bool): Treat the content as inline elements (no block elements like paragraphs). Defaults to False.
+## CLI
+For compatibility with the original markdownify, a CLI is provided. Use `html_to_markdown example.html > example.md` or
+pipe input from stdin:
+```shell
+cat example.html | html_to_markdown > example.md
+```
+Use `html_to_markdown -h` to see all available options. They are the same as listed above and take the same arguments.

html_to_markdown-1.2.0/README.md ADDED Viewed

@@ -0,0 +1,75 @@
+# html_to_markdown
+This library is a refactored and modernized fork of [markdownify](https://pypi.org/project/markdownify/), supporting
+Python 3.9 and above.
+### Differences with the Markdownify
+- The refactored codebase uses a strict functional approach - no classes are involved.
+- There is full typing with strict MyPy strict adherence and a py.typed file included.
+- The `convert_to_markdown` function allows passing a pre-configured instance of `BeautifulSoup` instead of html.
+- This library releases follows standard semver. Its version v1.0.0 was branched from markdownify's v0.13.1, at which
+  point versioning is no longer aligned.
+## Installation
+```shell
+pip install html_to_markdown
+```
+## Usage
+Convert an string HTML to Markdown:
+```python
+from html_to_markdown import convert_to_markdown
+convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>')  # > '**Yay** [GitHub](http://github.com)'
+```
+Or pass a pre-configured instance of `BeautifulSoup`:
+```python
+from bs4 import BeautifulSoup
+from html_to_markdown import convert_to_markdown
+soup = BeautifulSoup('<b>Yay</b> <a href="http://github.com">GitHub</a>', 'lxml')  # lxml requires an extra dependency.
+convert_to_markdown(soup)  # > '**Yay** [GitHub](http://github.com)'
+```
+### Options
+The `convert_to_markdown` function accepts the following kwargs:
+- autolinks (bool): Automatically convert valid URLs into Markdown links. Defaults to True.
+- bullets (str): A string of characters to use for bullet points in lists. Defaults to '\*+-'.
+- code_language (str): Default language identifier for fenced code blocks. Defaults to an empty string.
+- code_language_callback (Callable[[Any], str] | None): Function to dynamically determine the language for code blocks.
+- convert (Iterable[str] | None): A list of tag names to convert to Markdown. If None, all supported tags are converted.
+- default_title (bool): Use the default title when converting certain elements (e.g., links). Defaults to False.
+- escape_asterisks (bool): Escape asterisks (\*) to prevent unintended Markdown formatting. Defaults to True.
+- escape_misc (bool): Escape miscellaneous characters to prevent conflicts in Markdown. Defaults to True.
+- escape*underscores (bool): Escape underscores (*) to prevent unintended italic formatting. Defaults to True.
+- heading_style (Literal["underlined", "atx", "atx_closed"]): The style to use for Markdown headings. Defaults to "
+  underlined".
+- keep_inline_images_in (Iterable[str] | None): Tags in which inline images should be preserved. Defaults to None.
+- newline_style (Literal["spaces", "backslash"]): Style for handling newlines in text content. Defaults to "spaces".
+- strip (Iterable[str] | None): Tags to strip from the output. Defaults to None.
+- strong*em_symbol (Literal["\*", "*"]): Symbol to use for strong/emphasized text. Defaults to "\*".
+- sub_symbol (str): Custom symbol for subscript text. Defaults to an empty string.
+- sup_symbol (str): Custom symbol for superscript text. Defaults to an empty string.
+- wrap (bool): Wrap text to the specified width. Defaults to False.
+- wrap_width (int): The number of characters at which to wrap text. Defaults to 80.
+- convert_as_inline (bool): Treat the content as inline elements (no block elements like paragraphs). Defaults to False.
+## CLI
+For compatibility with the original markdownify, a CLI is provided. Use `html_to_markdown example.html > example.md` or
+pipe input from stdin:
+```shell
+cat example.html | html_to_markdown > example.md
+```
+Use `html_to_markdown -h` to see all available options. They are the same as listed above and take the same arguments.

html_to_markdown-1.2.0/html_to_markdown/__init__.py ADDED Viewed

@@ -0,0 +1,5 @@
+from html_to_markdown.processing import convert_to_markdown
+from .legacy import Markdownify
+__all__ = ["Markdownify", "convert_to_markdown"]

html_to_markdown-1.2.0/html_to_markdown/__main__.py ADDED Viewed

@@ -0,0 +1,11 @@
+import sys
+if __name__ == "__main__":
+    from html_to_markdown.cli import main
+    try:
+        result = main(sys.argv[1:])
+        print(result)  # noqa: T201
+    except ValueError as e:
+        print(str(e), file=sys.stderr)  # noqa: T201
+        sys.exit(1)

html_to_markdown-1.2.0/html_to_markdown/cli.py ADDED Viewed

@@ -0,0 +1,150 @@
+def main(argv: list[str]) -> str:
+    """Command-line entry point."""
+    from argparse import ArgumentParser, FileType
+    from sys import stdin
+    from html_to_markdown.constants import ASTERISK, ATX, ATX_CLOSED, BACKSLASH, SPACES, UNDERLINED, UNDERSCORE
+    from html_to_markdown.processing import convert_to_markdown
+    parser = ArgumentParser(
+        prog="html_to_markdown",
+        description="Converts HTML to Markdown.",
+    )
+    parser.add_argument(
+        "html",
+        nargs="?",
+        type=FileType("r"),
+        default=stdin,
+        help="The HTML file to convert. Defaults to STDIN if not provided.",
+    )
+    parser.add_argument(
+        "-s",
+        "--strip",
+        nargs="*",
+        help="A list of tags to strip from the conversion. Incompatible with the --convert option.",
+    )
+    parser.add_argument(
+        "-c",
+        "--convert",
+        nargs="*",
+        help="A list of HTML tags to explicitly convert. Incompatible with the --strip option.",
+    )
+    parser.add_argument(
+        "-a",
+        "--autolinks",
+        action="store_true",
+        help="Automatically convert anchor links where the content matches the href.",
+    )
+    parser.add_argument(
+        "--default-title",
+        action="store_false",
+        help="Use this flag to disable setting the link title to its href when no title is provided.",
+    )
+    parser.add_argument(
+        "--heading-style",
+        default=UNDERLINED,
+        choices=(ATX, ATX_CLOSED, UNDERLINED),
+        help="Defines the heading conversion style: 'atx', 'atx_closed', or 'underlined'. Defaults to 'underlined'.",
+    )
+    parser.add_argument(
+        "-b",
+        "--bullets",
+        default="*+-",
+        help="A string of bullet styles to use for list items. The style alternates based on nesting level. Defaults to '*+-'.",
+    )
+    parser.add_argument(
+        "--strong-em-symbol",
+        default=ASTERISK,
+        choices=(ASTERISK, UNDERSCORE),
+        help="Choose between '*' or '_' for strong and emphasized text. Defaults to '*'.",
+    )
+    parser.add_argument(
+        "--sub-symbol",
+        default="",
+        help="Define the characters used to surround <sub> text. Defaults to empty.",
+    )
+    parser.add_argument(
+        "--sup-symbol",
+        default="",
+        help="Define the characters used to surround <sup> text. Defaults to empty.",
+    )
+    parser.add_argument(
+        "--newline-style",
+        default=SPACES,
+        choices=(SPACES, BACKSLASH),
+        help="Specify the <br> conversion style: two spaces (default) or a backslash at the end of the line.",
+    )
+    parser.add_argument(
+        "--code-language",
+        default="",
+        help="Specify the default language for code blocks inside <pre> tags. Defaults to empty.",
+    )
+    parser.add_argument(
+        "--no-escape-asterisks",
+        dest="escape_asterisks",
+        action="store_false",
+        help="Disable escaping of '*' characters in text to '\\*'.",
+    )
+    parser.add_argument(
+        "--no-escape-underscores",
+        dest="escape_underscores",
+        action="store_false",
+        help="Disable escaping of '_' characters in text to '\\_'.",
+    )
+    parser.add_argument(
+        "-i",
+        "--keep-inline-images-in",
+        nargs="*",
+        help="Specify parent tags where inline images should be preserved as images, rather than converted to alt-text. Defaults to None.",
+    )
+    parser.add_argument(
+        "-w",
+        "--wrap",
+        action="store_true",
+        help="Enable word wrapping for paragraphs at --wrap-width characters.",
+    )
+    parser.add_argument(
+        "--wrap-width",
+        type=int,
+        default=80,
+        help="The number of characters at which text paragraphs should wrap. Defaults to 80.",
+    )
+    args = parser.parse_args(argv)
+    return convert_to_markdown(
+        args.html.read(),
+        strip=args.strip,
+        convert=args.convert,
+        autolinks=args.autolinks,
+        default_title=args.default_title,
+        heading_style=args.heading_style,
+        bullets=args.bullets,
+        strong_em_symbol=args.strong_em_symbol,
+        sub_symbol=args.sub_symbol,
+        sup_symbol=args.sup_symbol,
+        newline_style=args.newline_style,
+        code_language=args.code_language,
+        escape_asterisks=args.escape_asterisks,
+        escape_underscores=args.escape_underscores,
+        keep_inline_images_in=args.keep_inline_images_in,
+        wrap=args.wrap,
+        wrap_width=args.wrap_width,
+    )

html_to_markdown-1.2.0/html_to_markdown/constants.py ADDED Viewed

@@ -0,0 +1,18 @@
+from __future__ import annotations
+import re
+from re import Pattern
+from typing import Final
+convert_heading_re: Final[Pattern[str]] = re.compile(r"convert_h(\d+)")
+line_beginning_re: Final[Pattern[str]] = re.compile(r"^", re.MULTILINE)
+whitespace_re: Final[Pattern[str]] = re.compile(r"[\t ]+")
+html_heading_re: Final[Pattern[str]] = re.compile(r"h[1-6]")
+ASTERISK: Final = "*"
+ATX: Final = "atx"
+ATX_CLOSED: Final = "atx_closed"
+BACKSLASH: Final = "backslash"
+UNDERLINED: Final = "underlined"
+SPACES: Final = "spaces"
+UNDERSCORE: Final = "_"

{html_to_markdown-1.0.0 → html_to_markdown-1.2.0}/html_to_markdown/converters.py RENAMED Viewed

@@ -55,17 +55,19 @@ SupportedElements = Literal[
     "kbd",
 ]
-ConvertsMap = Mapping[SupportedElements, Callable[[str, Tag], str]]
+ConvertersMap = Mapping[SupportedElements, Callable[[str, Tag], str]]
 T = TypeVar("T")
 def _create_inline_converter(markup_prefix: str) -> Callable[[Tag, str], str]:
-    """This abstracts all simple inline tags like b, em, del, ...
-    Returns a function that wraps the chomped text in a pair of the string
-    that is returned by markup_fn, with '/' inserted in the string used after
-    the text if it looks like an HTML tag. markup_fn is necessary to allow for
-    references to self.strong_em_symbol etc.
+    """Create an inline converter for a markup pattern or tag.
+    Args:
+        markup_prefix: The markup prefix to insert.
+    Returns:
+        A function that can be used to convert HTML to Markdown.
     """
     def implementation(*, tag: Tag, text: str) -> str:
@@ -147,9 +149,9 @@ def _convert_hn(
 def _convert_img(*, tag: Tag, convert_as_inline: bool, keep_inline_images_in: Iterable[str] | None) -> str:
-    alt = tag.attrs.get("alt", None) or ""
-    src = tag.attrs.get("src", None) or ""
-    title = tag.attrs.get("title", None) or ""
+    alt = tag.attrs.get("alt", "")
+    src = tag.attrs.get("src", "")
+    title = tag.attrs.get("title", "")
     title_part = ' "{}"'.format(title.replace('"', r"\"")) if title else ""
     parent_name = tag.parent.name if tag.parent else ""
     if convert_as_inline and parent_name not in (keep_inline_images_in or []):
@@ -295,7 +297,7 @@ def create_converters_map(
     sup_symbol: str,
     wrap: bool,
     wrap_width: int,
-) -> ConvertsMap:
+) -> ConvertersMap:
     """Create a mapping of HTML elements to their corresponding conversion functions.
     Args:

html_to_markdown-1.2.0/html_to_markdown/legacy.py ADDED Viewed

@@ -0,0 +1,89 @@
+from __future__ import annotations
+from typing import TYPE_CHECKING, Literal
+from html_to_markdown.constants import ASTERISK, SPACES, UNDERLINED
+from html_to_markdown.converters import create_converters_map
+if TYPE_CHECKING:
+    from collections.abc import Callable, Iterable
+    from bs4 import Tag
+def _create_legacy_class(
+    autolinks: bool,
+    bullets: str,
+    code_language: str,
+    code_language_callback: Callable[[Tag], str] | None,
+    default_title: bool,
+    heading_style: Literal["atx", "atx_closed", "underlined"],
+    keep_inline_images_in: Iterable[str] | None,
+    newline_style: str,
+    strong_em_symbol: str,
+    sub_symbol: str,
+    sup_symbol: str,
+    wrap: bool,
+    wrap_width: int,
+) -> type:
+    """Create a legacy class for Markdownify.
+    Deprecated: Use the new hooks api instead.
+    Args:
+        autolinks: Whether to convert URLs into links.
+        bullets: The bullet characters to use for unordered lists.
+        code_language: The default code language to use.
+        code_language_callback: A callback to get the code language.
+        default_title: Whether to use the URL as the title for links.
+        heading_style: The style of headings.
+        keep_inline_images_in: The tags to keep inline images in.
+        newline_style: The style of newlines.
+        strong_em_symbol: The symbol to use for strong and emphasis text.
+        sub_symbol: The symbol to use for subscript text.
+        sup_symbol: The symbol to use for superscript text.
+        wrap: Whether to wrap text.
+        wrap_width: The width to wrap text at.
+    Returns:
+        A class that can be used to convert HTML to Markdown.
+    """
+    return type(
+        "Markdownify",
+        (),
+        {
+            k.removeprefix("_"): v
+            for k, v in create_converters_map(
+                autolinks=autolinks,
+                bullets=bullets,
+                code_language=code_language,
+                code_language_callback=code_language_callback,
+                default_title=default_title,
+                heading_style=heading_style,
+                keep_inline_images_in=keep_inline_images_in,
+                newline_style=newline_style,
+                strong_em_symbol=strong_em_symbol,
+                sub_symbol=sub_symbol,
+                sup_symbol=sup_symbol,
+                wrap=wrap,
+                wrap_width=wrap_width,
+            ).items()
+        },
+    )
+Markdownify = _create_legacy_class(
+    autolinks=True,
+    bullets="*+-",
+    code_language="",
+    code_language_callback=None,
+    default_title=False,
+    heading_style=UNDERLINED,
+    keep_inline_images_in=None,
+    newline_style=SPACES,
+    strong_em_symbol=ASTERISK,
+    sub_symbol="",
+    sup_symbol="",
+    wrap=False,
+    wrap_width=80,
+)

html-to-markdown 1.0.0__tar.gz → 1.2.0__tar.gz

Potentially problematic release.

html-to-markdown 1.0.0tar.gz → 1.2.0tar.gz