html-to-markdown 1.0.0__tar.gz → 1.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of html-to-markdown might be problematic. Click here for more details.

@@ -0,0 +1,101 @@
1
+ Metadata-Version: 2.3
2
+ Name: html-to-markdown
3
+ Version: 1.1.0
4
+ Summary: Convert HTML to markdown
5
+ Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
6
+ License: MIT
7
+ License-File: LICENSE
8
+ Keywords: beautifulsoup,converter,html,markdown,text-processing
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: License :: OSI Approved :: MIT License
11
+ Classifier: Operating System :: OS Independent
12
+ Classifier: Programming Language :: Python :: 3.9
13
+ Classifier: Programming Language :: Python :: 3.10
14
+ Classifier: Programming Language :: Python :: 3.11
15
+ Classifier: Programming Language :: Python :: 3.12
16
+ Classifier: Programming Language :: Python :: 3.13
17
+ Classifier: Topic :: Text Processing
18
+ Classifier: Topic :: Text Processing :: Markup
19
+ Classifier: Topic :: Text Processing :: Markup :: HTML
20
+ Classifier: Topic :: Text Processing :: Markup :: Markdown
21
+ Classifier: Topic :: Utilities
22
+ Classifier: Typing :: Typed
23
+ Requires-Python: >=3.9
24
+ Requires-Dist: beautifulsoup4>=4.12.3
25
+ Description-Content-Type: text/markdown
26
+
27
+ # html_to_markdown
28
+
29
+ This library is a refactored and modernized fork of [markdownify](https://pypi.org/project/markdownify/), supporting
30
+ Python 3.9 and above.
31
+
32
+ ### Differences with the Markdownify
33
+
34
+ - The refactored codebase uses a strict functional approach - no classes are involved.
35
+ - There is full typing with strict MyPy strict adherence and a py.typed file included.
36
+ - The `convert_to_markdown` function allows passing a pre-configured instance of `BeautifulSoup` instead of html.
37
+ - This library releases follows standard semver. Its version v1.0.0 was branched from markdownify's v0.13.1, at which
38
+ point versioning is no longer aligned.
39
+
40
+ ## Installation
41
+
42
+ ```shell
43
+ pip install html_to_markdown
44
+ ```
45
+
46
+ ## Usage
47
+
48
+ Convert an string HTML to Markdown:
49
+
50
+ ```python
51
+ from html_to_markdown import convert_to_markdown
52
+
53
+ convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
54
+ ```
55
+
56
+ Or pass a pre-configured instance of `BeautifulSoup`:
57
+
58
+ ```python
59
+ from bs4 import BeautifulSoup
60
+ from html_to_markdown import convert_to_markdown
61
+
62
+ soup = BeautifulSoup('<b>Yay</b> <a href="http://github.com">GitHub</a>', 'lxml') # lxml requires an extra dependency.
63
+
64
+ convert_to_markdown(soup) # > '**Yay** [GitHub](http://github.com)'
65
+ ```
66
+
67
+ ### Options
68
+
69
+ The `convert_to_markdown` function accepts the following kwargs:
70
+
71
+ - autolinks (bool): Automatically convert valid URLs into Markdown links. Defaults to True.
72
+ - bullets (str): A string of characters to use for bullet points in lists. Defaults to '*+-'.
73
+ - code_language (str): Default language identifier for fenced code blocks. Defaults to an empty string.
74
+ - code_language_callback (Callable[[Any], str] | None): Function to dynamically determine the language for code blocks.
75
+ - convert (Iterable[str] | None): A list of tag names to convert to Markdown. If None, all supported tags are converted.
76
+ - default_title (bool): Use the default title when converting certain elements (e.g., links). Defaults to False.
77
+ - escape_asterisks (bool): Escape asterisks (*) to prevent unintended Markdown formatting. Defaults to True.
78
+ - escape_misc (bool): Escape miscellaneous characters to prevent conflicts in Markdown. Defaults to True.
79
+ - escape_underscores (bool): Escape underscores (_) to prevent unintended italic formatting. Defaults to True.
80
+ - heading_style (Literal["underlined", "atx", "atx_closed"]): The style to use for Markdown headings. Defaults to "
81
+ underlined".
82
+ - keep_inline_images_in (Iterable[str] | None): Tags in which inline images should be preserved. Defaults to None.
83
+ - newline_style (Literal["spaces", "backslash"]): Style for handling newlines in text content. Defaults to "spaces".
84
+ - strip (Iterable[str] | None): Tags to strip from the output. Defaults to None.
85
+ - strong_em_symbol (Literal["*", "_"]): Symbol to use for strong/emphasized text. Defaults to "*".
86
+ - sub_symbol (str): Custom symbol for subscript text. Defaults to an empty string.
87
+ - sup_symbol (str): Custom symbol for superscript text. Defaults to an empty string.
88
+ - wrap (bool): Wrap text to the specified width. Defaults to False.
89
+ - wrap_width (int): The number of characters at which to wrap text. Defaults to 80.
90
+ - convert_as_inline (bool): Treat the content as inline elements (no block elements like paragraphs). Defaults to False.
91
+
92
+ ## CLI
93
+
94
+ For compatibility with the original markdownify, a CLI is provided. Use `html_to_markdown example.html > example.md` or
95
+ pipe input from stdin:
96
+
97
+ ```shell
98
+ cat example.html | html_to_markdown > example.md
99
+ ```
100
+
101
+ Use `html_to_markdown -h` to see all available options. They are the same as listed above and take the same arguments.
@@ -0,0 +1,75 @@
1
+ # html_to_markdown
2
+
3
+ This library is a refactored and modernized fork of [markdownify](https://pypi.org/project/markdownify/), supporting
4
+ Python 3.9 and above.
5
+
6
+ ### Differences with the Markdownify
7
+
8
+ - The refactored codebase uses a strict functional approach - no classes are involved.
9
+ - There is full typing with strict MyPy strict adherence and a py.typed file included.
10
+ - The `convert_to_markdown` function allows passing a pre-configured instance of `BeautifulSoup` instead of html.
11
+ - This library releases follows standard semver. Its version v1.0.0 was branched from markdownify's v0.13.1, at which
12
+ point versioning is no longer aligned.
13
+
14
+ ## Installation
15
+
16
+ ```shell
17
+ pip install html_to_markdown
18
+ ```
19
+
20
+ ## Usage
21
+
22
+ Convert an string HTML to Markdown:
23
+
24
+ ```python
25
+ from html_to_markdown import convert_to_markdown
26
+
27
+ convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
28
+ ```
29
+
30
+ Or pass a pre-configured instance of `BeautifulSoup`:
31
+
32
+ ```python
33
+ from bs4 import BeautifulSoup
34
+ from html_to_markdown import convert_to_markdown
35
+
36
+ soup = BeautifulSoup('<b>Yay</b> <a href="http://github.com">GitHub</a>', 'lxml') # lxml requires an extra dependency.
37
+
38
+ convert_to_markdown(soup) # > '**Yay** [GitHub](http://github.com)'
39
+ ```
40
+
41
+ ### Options
42
+
43
+ The `convert_to_markdown` function accepts the following kwargs:
44
+
45
+ - autolinks (bool): Automatically convert valid URLs into Markdown links. Defaults to True.
46
+ - bullets (str): A string of characters to use for bullet points in lists. Defaults to '*+-'.
47
+ - code_language (str): Default language identifier for fenced code blocks. Defaults to an empty string.
48
+ - code_language_callback (Callable[[Any], str] | None): Function to dynamically determine the language for code blocks.
49
+ - convert (Iterable[str] | None): A list of tag names to convert to Markdown. If None, all supported tags are converted.
50
+ - default_title (bool): Use the default title when converting certain elements (e.g., links). Defaults to False.
51
+ - escape_asterisks (bool): Escape asterisks (*) to prevent unintended Markdown formatting. Defaults to True.
52
+ - escape_misc (bool): Escape miscellaneous characters to prevent conflicts in Markdown. Defaults to True.
53
+ - escape_underscores (bool): Escape underscores (_) to prevent unintended italic formatting. Defaults to True.
54
+ - heading_style (Literal["underlined", "atx", "atx_closed"]): The style to use for Markdown headings. Defaults to "
55
+ underlined".
56
+ - keep_inline_images_in (Iterable[str] | None): Tags in which inline images should be preserved. Defaults to None.
57
+ - newline_style (Literal["spaces", "backslash"]): Style for handling newlines in text content. Defaults to "spaces".
58
+ - strip (Iterable[str] | None): Tags to strip from the output. Defaults to None.
59
+ - strong_em_symbol (Literal["*", "_"]): Symbol to use for strong/emphasized text. Defaults to "*".
60
+ - sub_symbol (str): Custom symbol for subscript text. Defaults to an empty string.
61
+ - sup_symbol (str): Custom symbol for superscript text. Defaults to an empty string.
62
+ - wrap (bool): Wrap text to the specified width. Defaults to False.
63
+ - wrap_width (int): The number of characters at which to wrap text. Defaults to 80.
64
+ - convert_as_inline (bool): Treat the content as inline elements (no block elements like paragraphs). Defaults to False.
65
+
66
+ ## CLI
67
+
68
+ For compatibility with the original markdownify, a CLI is provided. Use `html_to_markdown example.html > example.md` or
69
+ pipe input from stdin:
70
+
71
+ ```shell
72
+ cat example.html | html_to_markdown > example.md
73
+ ```
74
+
75
+ Use `html_to_markdown -h` to see all available options. They are the same as listed above and take the same arguments.
@@ -0,0 +1,7 @@
1
+ import sys
2
+
3
+ from html_to_markdown.dli import cli
4
+
5
+ if __name__ == "__main__":
6
+ result = cli(sys.argv[1:])
7
+ print(result) # noqa: T201
@@ -0,0 +1,150 @@
1
+ def main(argv: list[str]) -> str:
2
+ """Command-line entry point."""
3
+ from argparse import ArgumentParser, FileType
4
+ from sys import stdin
5
+
6
+ from html_to_markdown.constants import ASTERISK, ATX, ATX_CLOSED, BACKSLASH, SPACES, UNDERLINED, UNDERSCORE
7
+ from html_to_markdown.processing import convert_to_markdown
8
+
9
+ parser = ArgumentParser(
10
+ prog="html_to_markdown",
11
+ description="Converts HTML to Markdown.",
12
+ )
13
+
14
+ parser.add_argument(
15
+ "html",
16
+ nargs="?",
17
+ type=FileType("r"),
18
+ default=stdin,
19
+ help="The HTML file to convert. Defaults to STDIN if not provided.",
20
+ )
21
+
22
+ parser.add_argument(
23
+ "-s",
24
+ "--strip",
25
+ nargs="*",
26
+ help="A list of tags to strip from the conversion. Incompatible with the --convert option.",
27
+ )
28
+
29
+ parser.add_argument(
30
+ "-c",
31
+ "--convert",
32
+ nargs="*",
33
+ help="A list of HTML tags to explicitly convert. Incompatible with the --strip option.",
34
+ )
35
+
36
+ parser.add_argument(
37
+ "-a",
38
+ "--autolinks",
39
+ action="store_true",
40
+ help="Automatically convert anchor links where the content matches the href.",
41
+ )
42
+
43
+ parser.add_argument(
44
+ "--default-title",
45
+ action="store_false",
46
+ help="Use this flag to disable setting the link title to its href when no title is provided.",
47
+ )
48
+
49
+ parser.add_argument(
50
+ "--heading-style",
51
+ default=UNDERLINED,
52
+ choices=(ATX, ATX_CLOSED, UNDERLINED),
53
+ help="Defines the heading conversion style: 'atx', 'atx_closed', or 'underlined'. Defaults to 'underlined'.",
54
+ )
55
+
56
+ parser.add_argument(
57
+ "-b",
58
+ "--bullets",
59
+ default="*+-",
60
+ help="A string of bullet styles to use for list items. The style alternates based on nesting level. Defaults to '*+-'.",
61
+ )
62
+
63
+ parser.add_argument(
64
+ "--strong-em-symbol",
65
+ default=ASTERISK,
66
+ choices=(ASTERISK, UNDERSCORE),
67
+ help="Choose between '*' or '_' for strong and emphasized text. Defaults to '*'.",
68
+ )
69
+
70
+ parser.add_argument(
71
+ "--sub-symbol",
72
+ default="",
73
+ help="Define the characters used to surround <sub> text. Defaults to empty.",
74
+ )
75
+
76
+ parser.add_argument(
77
+ "--sup-symbol",
78
+ default="",
79
+ help="Define the characters used to surround <sup> text. Defaults to empty.",
80
+ )
81
+
82
+ parser.add_argument(
83
+ "--newline-style",
84
+ default=SPACES,
85
+ choices=(SPACES, BACKSLASH),
86
+ help="Specify the <br> conversion style: two spaces (default) or a backslash at the end of the line.",
87
+ )
88
+
89
+ parser.add_argument(
90
+ "--code-language",
91
+ default="",
92
+ help="Specify the default language for code blocks inside <pre> tags. Defaults to empty.",
93
+ )
94
+
95
+ parser.add_argument(
96
+ "--no-escape-asterisks",
97
+ dest="escape_asterisks",
98
+ action="store_false",
99
+ help="Disable escaping of '*' characters in text to '\\*'.",
100
+ )
101
+
102
+ parser.add_argument(
103
+ "--no-escape-underscores",
104
+ dest="escape_underscores",
105
+ action="store_false",
106
+ help="Disable escaping of '_' characters in text to '\\_'.",
107
+ )
108
+
109
+ parser.add_argument(
110
+ "-i",
111
+ "--keep-inline-images-in",
112
+ nargs="*",
113
+ help="Specify parent tags where inline images should be preserved as images, rather than converted to alt-text. Defaults to None.",
114
+ )
115
+
116
+ parser.add_argument(
117
+ "-w",
118
+ "--wrap",
119
+ action="store_true",
120
+ help="Enable word wrapping for paragraphs at --wrap-width characters.",
121
+ )
122
+
123
+ parser.add_argument(
124
+ "--wrap-width",
125
+ type=int,
126
+ default=80,
127
+ help="The number of characters at which text paragraphs should wrap. Defaults to 80.",
128
+ )
129
+
130
+ args = parser.parse_args(argv)
131
+
132
+ return convert_to_markdown(
133
+ args.html.read(),
134
+ strip=args.strip,
135
+ convert=args.convert,
136
+ autolinks=args.autolinks,
137
+ default_title=args.default_title,
138
+ heading_style=args.heading_style,
139
+ bullets=args.bullets,
140
+ strong_em_symbol=args.strong_em_symbol,
141
+ sub_symbol=args.sub_symbol,
142
+ sup_symbol=args.sup_symbol,
143
+ newline_style=args.newline_style,
144
+ code_language=args.code_language,
145
+ escape_asterisks=args.escape_asterisks,
146
+ escape_underscores=args.escape_underscores,
147
+ keep_inline_images_in=args.keep_inline_images_in,
148
+ wrap=args.wrap,
149
+ wrap_width=args.wrap_width,
150
+ )
@@ -55,7 +55,7 @@ SupportedElements = Literal[
55
55
  "kbd",
56
56
  ]
57
57
 
58
- ConvertsMap = Mapping[SupportedElements, Callable[[str, Tag], str]]
58
+ ConverterssMap = Mapping[SupportedElements, Callable[[str, Tag], str]]
59
59
 
60
60
  T = TypeVar("T")
61
61
 
@@ -147,9 +147,9 @@ def _convert_hn(
147
147
 
148
148
 
149
149
  def _convert_img(*, tag: Tag, convert_as_inline: bool, keep_inline_images_in: Iterable[str] | None) -> str:
150
- alt = tag.attrs.get("alt", None) or ""
151
- src = tag.attrs.get("src", None) or ""
152
- title = tag.attrs.get("title", None) or ""
150
+ alt = tag.attrs.get("alt", "")
151
+ src = tag.attrs.get("src", "")
152
+ title = tag.attrs.get("title", "")
153
153
  title_part = ' "{}"'.format(title.replace('"', r"\"")) if title else ""
154
154
  parent_name = tag.parent.name if tag.parent else ""
155
155
  if convert_as_inline and parent_name not in (keep_inline_images_in or []):
@@ -295,7 +295,7 @@ def create_converters_map(
295
295
  sup_symbol: str,
296
296
  wrap: bool,
297
297
  wrap_width: int,
298
- ) -> ConvertsMap:
298
+ ) -> ConverterssMap:
299
299
  """Create a mapping of HTML elements to their corresponding conversion functions.
300
300
 
301
301
  Args:
@@ -11,7 +11,7 @@ from html_to_markdown.constants import (
11
11
  html_heading_re,
12
12
  whitespace_re,
13
13
  )
14
- from html_to_markdown.converters import ConvertsMap, create_converters_map
14
+ from html_to_markdown.converters import ConverterssMap, create_converters_map
15
15
  from html_to_markdown.utils import escape
16
16
 
17
17
  if TYPE_CHECKING:
@@ -76,45 +76,15 @@ def _is_nested_tag(el: PageElement) -> bool:
76
76
 
77
77
  def _process_tag(
78
78
  tag: Tag,
79
+ converters_map: ConverterssMap,
79
80
  *,
80
- autolinks: bool,
81
- bullets: str,
82
- code_language: str,
83
- code_language_callback: Callable[[Any], str] | None,
84
81
  convert: Iterable[str] | None,
85
82
  convert_as_inline: bool = False,
86
- converters_map: ConvertsMap | None = None,
87
- default_title: bool,
88
83
  escape_asterisks: bool,
89
84
  escape_misc: bool,
90
85
  escape_underscores: bool,
91
- heading_style: Literal["atx", "atx_closed", "underlined"],
92
- keep_inline_images_in: Iterable[str] | None,
93
- newline_style: str,
94
86
  strip: Iterable[str] | None,
95
- strong_em_symbol: str,
96
- sub_symbol: str,
97
- sup_symbol: str,
98
- wrap: bool,
99
- wrap_width: int,
100
87
  ) -> str:
101
- if converters_map is None:
102
- converters_map = create_converters_map(
103
- autolinks=autolinks,
104
- bullets=bullets,
105
- code_language=code_language,
106
- code_language_callback=code_language_callback,
107
- default_title=default_title,
108
- heading_style=heading_style,
109
- keep_inline_images_in=keep_inline_images_in,
110
- newline_style=newline_style,
111
- strong_em_symbol=strong_em_symbol,
112
- sub_symbol=sub_symbol,
113
- sup_symbol=sup_symbol,
114
- wrap=wrap,
115
- wrap_width=wrap_width,
116
- )
117
-
118
88
  text = ""
119
89
  is_heading = html_heading_re.match(tag.name) is not None
120
90
  is_cell = tag.name in {"td", "th"}
@@ -141,27 +111,14 @@ def _process_tag(
141
111
  )
142
112
  elif isinstance(el, Tag):
143
113
  text += _process_tag(
144
- tag=el,
114
+ el,
115
+ converters_map,
145
116
  convert_as_inline=convert_children_as_inline,
146
- strip=strip,
147
117
  convert=convert,
148
- escape_misc=escape_misc,
149
118
  escape_asterisks=escape_asterisks,
119
+ escape_misc=escape_misc,
150
120
  escape_underscores=escape_underscores,
151
- converters_map=converters_map,
152
- autolinks=autolinks,
153
- bullets=bullets,
154
- code_language=code_language,
155
- code_language_callback=code_language_callback,
156
- default_title=default_title,
157
- heading_style=heading_style,
158
- keep_inline_images_in=keep_inline_images_in,
159
- newline_style=newline_style,
160
- strong_em_symbol=strong_em_symbol,
161
- sub_symbol=sub_symbol,
162
- sup_symbol=sup_symbol,
163
- wrap=wrap,
164
- wrap_width=wrap_width,
121
+ strip=strip,
165
122
  )
166
123
 
167
124
  tag_name: SupportedTag | None = cast(SupportedTag, tag.name.lower()) if tag.name.lower() in converters_map else None
@@ -218,9 +175,8 @@ def _should_convert_tag(*, tag_name: str, strip: Iterable[str] | None, convert:
218
175
 
219
176
 
220
177
  def convert_to_markdown(
221
- html: str,
178
+ source: str | BeautifulSoup,
222
179
  *,
223
- soup: BeautifulSoup | None = None,
224
180
  autolinks: bool = True,
225
181
  bullets: str = "*+-",
226
182
  code_language: str = "",
@@ -244,55 +200,58 @@ def convert_to_markdown(
244
200
  """Convert HTML to Markdown.
245
201
 
246
202
  Args:
247
- html: The HTML to convert.
248
- soup: The BeautifulSoup object to convert.
249
- autolinks: Whether to convert links to Markdown.
250
- bullets: The bullet characters to use for unordered lists.
251
- code_language: The default code language to use.
252
- code_language_callback: A callback function to determine the code language.
253
- convert: The HTML elements to convert.
254
- default_title: Whether to use the default title.
255
- escape_asterisks: Whether to escape asterisks.
256
- escape_misc: Whether to escape miscellaneous characters.
257
- escape_underscores: Whether to escape underscores.
258
- heading_style: The style to use for headings.
259
- keep_inline_images_in: The tags to keep inline images in.
260
- newline_style: The style to use for newlines.
261
- strip: The HTML elements to strip.
262
- strong_em_symbol: The symbol to use for strong and emphasis.
263
- sub_symbol: The symbol to use for subscript.
264
- sup_symbol: The symbol to use for superscript.
265
- wrap: Whether to wrap text.
266
- wrap_width: The width to wrap text at.
267
- convert_as_inline: Whether to convert elements as inline.
203
+ source: An HTML document or a an initialized instance of BeautifulSoup.
204
+ autolinks: Automatically convert valid URLs into Markdown links. Defaults to True.
205
+ bullets: A string of characters to use for bullet points in lists. Defaults to '*+-'.
206
+ code_language: Default language identifier for fenced code blocks. Defaults to an empty string.
207
+ code_language_callback: Function to dynamically determine the language for code blocks.
208
+ convert: A list of tag names to convert to Markdown. If None, all supported tags are converted.
209
+ default_title: Use the default title when converting certain elements (e.g., links). Defaults to False.
210
+ escape_asterisks: Escape asterisks (*) to prevent unintended Markdown formatting. Defaults to True.
211
+ escape_misc: Escape miscellaneous characters to prevent conflicts in Markdown. Defaults to True.
212
+ escape_underscores: Escape underscores (_) to prevent unintended italic formatting. Defaults to True.
213
+ heading_style: The style to use for Markdown headings. Defaults to "underlined".
214
+ keep_inline_images_in: Tags in which inline images should be preserved. Defaults to None.
215
+ newline_style: Style for handling newlines in text content. Defaults to "spaces".
216
+ strip: Tags to strip from the output. Defaults to None.
217
+ strong_em_symbol: Symbol to use for strong/emphasized text. Defaults to "*".
218
+ sub_symbol: Custom symbol for subscript text. Defaults to an empty string.
219
+ sup_symbol: Custom symbol for superscript text. Defaults to an empty string.
220
+ wrap: Wrap text to the specified width. Defaults to False.
221
+ wrap_width: The number of characters at which to wrap text. Defaults to 80.
222
+ convert_as_inline: Treat the content as inline elements (no block elements like paragraphs). Defaults to False.
268
223
 
269
224
  Returns:
270
- The Markdown.
225
+ str: A string of Markdown-formatted text converted from the given HTML.
271
226
  """
272
- if soup is None:
227
+ if isinstance(source, str):
273
228
  from bs4 import BeautifulSoup
274
229
 
275
- soup = BeautifulSoup(html, "html.parser")
230
+ source = BeautifulSoup(source, "html.parser")
276
231
 
277
- return _process_tag(
232
+ converters_map = create_converters_map(
278
233
  autolinks=autolinks,
279
234
  bullets=bullets,
280
235
  code_language=code_language,
281
236
  code_language_callback=code_language_callback,
282
- convert=convert,
283
- convert_as_inline=convert_as_inline,
284
237
  default_title=default_title,
285
- escape_asterisks=escape_asterisks,
286
- escape_misc=escape_misc,
287
- escape_underscores=escape_underscores,
288
238
  heading_style=heading_style,
289
239
  keep_inline_images_in=keep_inline_images_in,
290
240
  newline_style=newline_style,
291
- strip=strip,
292
241
  strong_em_symbol=strong_em_symbol,
293
242
  sub_symbol=sub_symbol,
294
243
  sup_symbol=sup_symbol,
295
- tag=soup,
296
244
  wrap=wrap,
297
245
  wrap_width=wrap_width,
298
246
  )
247
+
248
+ return _process_tag(
249
+ source,
250
+ converters_map,
251
+ convert=convert,
252
+ convert_as_inline=convert_as_inline,
253
+ escape_asterisks=escape_asterisks,
254
+ escape_misc=escape_misc,
255
+ escape_underscores=escape_underscores,
256
+ strip=strip,
257
+ )
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "html-to-markdown"
3
- version = "1.0.0"
3
+ version = "1.1.0"
4
4
  description = "Convert HTML to markdown"
5
5
  authors = [{ name = "Na'aman Hirschfeld", email = "nhirschfeld@gmail.com" }]
6
6
  requires-python = ">=3.9"
@@ -95,7 +95,7 @@ lint.ignore = [
95
95
  src = ["html_to_markdown", "tests"]
96
96
 
97
97
  [tool.ruff.lint.per-file-ignores]
98
- "tests/**/*.*" = ["S", "D", "PT006", "PT013", "PD"]
98
+ "tests/**/*.*" = ["S", "D", "PT006", "PT013", "PD", "ARG"]
99
99
 
100
100
  [tool.ruff.format]
101
101
  docstring-code-format = true
@@ -1,194 +0,0 @@
1
- Metadata-Version: 2.3
2
- Name: html-to-markdown
3
- Version: 1.0.0
4
- Summary: Convert HTML to markdown
5
- Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
6
- License: MIT
7
- License-File: LICENSE
8
- Keywords: beautifulsoup,converter,html,markdown,text-processing
9
- Classifier: Intended Audience :: Developers
10
- Classifier: License :: OSI Approved :: MIT License
11
- Classifier: Operating System :: OS Independent
12
- Classifier: Programming Language :: Python :: 3.9
13
- Classifier: Programming Language :: Python :: 3.10
14
- Classifier: Programming Language :: Python :: 3.11
15
- Classifier: Programming Language :: Python :: 3.12
16
- Classifier: Programming Language :: Python :: 3.13
17
- Classifier: Topic :: Text Processing
18
- Classifier: Topic :: Text Processing :: Markup
19
- Classifier: Topic :: Text Processing :: Markup :: HTML
20
- Classifier: Topic :: Text Processing :: Markup :: Markdown
21
- Classifier: Topic :: Utilities
22
- Classifier: Typing :: Typed
23
- Requires-Python: >=3.9
24
- Requires-Dist: beautifulsoup4>=4.12.3
25
- Description-Content-Type: text/markdown
26
-
27
- # html_to_markdown
28
-
29
- This library is a refactored and modernized fork of [markdownify](https://pypi.org/project/markdownify/), supporting
30
- Python 3.9 and offering strong typing.
31
-
32
- ### Differences from the Markdownify
33
-
34
- - The refactored codebase uses a strict functional approach - no classes are involved.
35
- - There is full typing with strict MyPy adherence in place.
36
- - The `convert_to_markdown` allows passing a pre-configured instance of `Beautifulsoup`.
37
- - This library releases follows standard semver. Its version v1.0.0 was branched from markdownify's v0.13.1, at which
38
- point versioning is no longer aligned.
39
-
40
- ## Installation
41
-
42
- ```shell
43
- pip install html_to_markdown
44
- ```
45
-
46
- ## Usage
47
-
48
- Convert some HTML to Markdown:
49
-
50
- ```python
51
- from html_to_markdown import convert_to_markdown
52
-
53
- convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
54
- ```
55
-
56
- Specify tags to exclude:
57
-
58
- ```python
59
- from html_to_markdown import convert_to_markdown
60
-
61
- convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a']) # > '**Yay** GitHub'
62
- ```
63
-
64
- \...or specify the tags you want to include:
65
-
66
- ```python
67
- from html_to_markdown import convert_to_markdown
68
-
69
- convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>', convert=['b']) # > '**Yay** GitHub'
70
- ```
71
-
72
- # Options
73
-
74
- html_to_markdown supports the following options:
75
-
76
- strip
77
-
78
- : A list of tags to strip. This option can\'t be used with the
79
- `convert` option.
80
-
81
- convert
82
-
83
- : A list of tags to convert. This option can\'t be used with the
84
- `strip` option.
85
-
86
- autolinks
87
-
88
- : A boolean indicating whether the \"automatic link\" style should be
89
- used when a `a` tag\'s contents match its href. Defaults to `True`.
90
-
91
- default_title
92
-
93
- : A boolean to enable setting the title of a link to its href, if no
94
- title is given. Defaults to `False`.
95
-
96
- heading_style
97
-
98
- : Defines how headings should be converted. Accepted values are `ATX`,
99
- `ATX_CLOSED`, `SETEXT`, and `UNDERLINED` (which is an alias for
100
- `SETEXT`). Defaults to `UNDERLINED`.
101
-
102
- bullets
103
-
104
- : An iterable (string, list, or tuple) of bullet styles to be used. If
105
- the iterable only contains one item, it will be used regardless of
106
- how deeply lists are nested. Otherwise, the bullet will alternate
107
- based on nesting level. Defaults to `'*+-'`.
108
-
109
- strong_em_symbol
110
-
111
- : In markdown, both `*` and `_` are used to encode **strong** or
112
- *emphasized* texts. Either of these symbols can be chosen by the
113
- options `ASTERISK` (default) or `UNDERSCORE` respectively.
114
-
115
- sub_symbol, sup_symbol
116
-
117
- : Define the chars that surround `<sub>` and `<sup>` text. Defaults to
118
- an empty string, because this is non-standard behavior. Could be
119
- something like `~` and `^` to result in `~sub~` and `^sup^`. If the
120
- value starts with `<` and ends with `>`, it is treated as an HTML
121
- tag and a `/` is inserted after the `<` in the string used after the
122
- text; this allows specifying `<sub>` to use raw HTML in the output
123
- for subscripts, for example.
124
-
125
- newline_style
126
-
127
- : Defines the style of marking linebreaks (`<br>`) in markdown. The
128
- default value `SPACES` of this option will adopt the usual two
129
- spaces and a newline, while `BACKSLASH` will convert a linebreak to
130
- `\\n` (a backslash and a newline). While the latter convention is
131
- non-standard, it is commonly preferred and supported by a lot of
132
- interpreters.
133
-
134
- code_language
135
-
136
- : Defines the language that should be assumed for all `<pre>`
137
- sections. Useful, if all code on a page is in the same programming
138
- language and should be annotated with ``[python]{.title-ref}[ or
139
- similar. Defaults to ]{.title-ref}[\'\']{.title-ref}\` (empty
140
- string) and can be any string.
141
-
142
- code_language_callback
143
-
144
- : When the HTML code contains `pre` tags that in some way provide the
145
- code language, for example as class, this callback can be used to
146
- extract the language from the tag and prefix it to the converted
147
- `pre` tag. The callback gets one single argument, an BeautifylSoup
148
- object, and returns a string containing the code language, or
149
- `None`. An example to use the class name as code language could be:
150
-
151
- def callback(el):
152
- return el['class'][0] if el.has_attr('class') else None
153
-
154
- Defaults to `None`.
155
-
156
- escape_asterisks
157
-
158
- : If set to `False`, do not escape `*` to `\*` in text. Defaults to
159
- `True`.
160
-
161
- escape_underscores
162
-
163
- : If set to `False`, do not escape `_` to `\_` in text. Defaults to
164
- `True`.
165
-
166
- escape_misc
167
-
168
- : If set to `False`, do not escape miscellaneous punctuation
169
- characters that sometimes have Markdown significance in text.
170
- Defaults to `True`.
171
-
172
- keep_inline_images_in
173
-
174
- : Images are converted to their alt-text when the images are located
175
- inside headlines or table cells. If some inline images should be
176
- converted to markdown images instead, this option can be set to a
177
- list of parent tags that should be allowed to contain inline images,
178
- for example `['td']`. Defaults to an empty list.
179
-
180
- wrap, wrap_width
181
-
182
- : If `wrap` is set to `True`, all text paragraphs are wrapped at
183
- `wrap_width` characters. Defaults to `False` and `80`. Use with
184
- `newline_style=BACKSLASH` to keep line breaks in paragraphs.
185
-
186
- Options may be specified as kwargs to the `html_to_markdown` function, or as
187
- a nested `Options` class in `MarkdownConverter` subclasses.
188
-
189
- # CLI
190
-
191
- Use `html_to_markdown example.html > example.md` or pipe input from stdin
192
- (`cat example.html | html_to_markdown > example.md`). Call `html_to_markdown -h`
193
- to see all available options. They are the same as listed above and take
194
- the same arguments.
@@ -1,168 +0,0 @@
1
- # html_to_markdown
2
-
3
- This library is a refactored and modernized fork of [markdownify](https://pypi.org/project/markdownify/), supporting
4
- Python 3.9 and offering strong typing.
5
-
6
- ### Differences from the Markdownify
7
-
8
- - The refactored codebase uses a strict functional approach - no classes are involved.
9
- - There is full typing with strict MyPy adherence in place.
10
- - The `convert_to_markdown` allows passing a pre-configured instance of `Beautifulsoup`.
11
- - This library releases follows standard semver. Its version v1.0.0 was branched from markdownify's v0.13.1, at which
12
- point versioning is no longer aligned.
13
-
14
- ## Installation
15
-
16
- ```shell
17
- pip install html_to_markdown
18
- ```
19
-
20
- ## Usage
21
-
22
- Convert some HTML to Markdown:
23
-
24
- ```python
25
- from html_to_markdown import convert_to_markdown
26
-
27
- convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
28
- ```
29
-
30
- Specify tags to exclude:
31
-
32
- ```python
33
- from html_to_markdown import convert_to_markdown
34
-
35
- convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a']) # > '**Yay** GitHub'
36
- ```
37
-
38
- \...or specify the tags you want to include:
39
-
40
- ```python
41
- from html_to_markdown import convert_to_markdown
42
-
43
- convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>', convert=['b']) # > '**Yay** GitHub'
44
- ```
45
-
46
- # Options
47
-
48
- html_to_markdown supports the following options:
49
-
50
- strip
51
-
52
- : A list of tags to strip. This option can\'t be used with the
53
- `convert` option.
54
-
55
- convert
56
-
57
- : A list of tags to convert. This option can\'t be used with the
58
- `strip` option.
59
-
60
- autolinks
61
-
62
- : A boolean indicating whether the \"automatic link\" style should be
63
- used when a `a` tag\'s contents match its href. Defaults to `True`.
64
-
65
- default_title
66
-
67
- : A boolean to enable setting the title of a link to its href, if no
68
- title is given. Defaults to `False`.
69
-
70
- heading_style
71
-
72
- : Defines how headings should be converted. Accepted values are `ATX`,
73
- `ATX_CLOSED`, `SETEXT`, and `UNDERLINED` (which is an alias for
74
- `SETEXT`). Defaults to `UNDERLINED`.
75
-
76
- bullets
77
-
78
- : An iterable (string, list, or tuple) of bullet styles to be used. If
79
- the iterable only contains one item, it will be used regardless of
80
- how deeply lists are nested. Otherwise, the bullet will alternate
81
- based on nesting level. Defaults to `'*+-'`.
82
-
83
- strong_em_symbol
84
-
85
- : In markdown, both `*` and `_` are used to encode **strong** or
86
- *emphasized* texts. Either of these symbols can be chosen by the
87
- options `ASTERISK` (default) or `UNDERSCORE` respectively.
88
-
89
- sub_symbol, sup_symbol
90
-
91
- : Define the chars that surround `<sub>` and `<sup>` text. Defaults to
92
- an empty string, because this is non-standard behavior. Could be
93
- something like `~` and `^` to result in `~sub~` and `^sup^`. If the
94
- value starts with `<` and ends with `>`, it is treated as an HTML
95
- tag and a `/` is inserted after the `<` in the string used after the
96
- text; this allows specifying `<sub>` to use raw HTML in the output
97
- for subscripts, for example.
98
-
99
- newline_style
100
-
101
- : Defines the style of marking linebreaks (`<br>`) in markdown. The
102
- default value `SPACES` of this option will adopt the usual two
103
- spaces and a newline, while `BACKSLASH` will convert a linebreak to
104
- `\\n` (a backslash and a newline). While the latter convention is
105
- non-standard, it is commonly preferred and supported by a lot of
106
- interpreters.
107
-
108
- code_language
109
-
110
- : Defines the language that should be assumed for all `<pre>`
111
- sections. Useful, if all code on a page is in the same programming
112
- language and should be annotated with ``[python]{.title-ref}[ or
113
- similar. Defaults to ]{.title-ref}[\'\']{.title-ref}\` (empty
114
- string) and can be any string.
115
-
116
- code_language_callback
117
-
118
- : When the HTML code contains `pre` tags that in some way provide the
119
- code language, for example as class, this callback can be used to
120
- extract the language from the tag and prefix it to the converted
121
- `pre` tag. The callback gets one single argument, an BeautifylSoup
122
- object, and returns a string containing the code language, or
123
- `None`. An example to use the class name as code language could be:
124
-
125
- def callback(el):
126
- return el['class'][0] if el.has_attr('class') else None
127
-
128
- Defaults to `None`.
129
-
130
- escape_asterisks
131
-
132
- : If set to `False`, do not escape `*` to `\*` in text. Defaults to
133
- `True`.
134
-
135
- escape_underscores
136
-
137
- : If set to `False`, do not escape `_` to `\_` in text. Defaults to
138
- `True`.
139
-
140
- escape_misc
141
-
142
- : If set to `False`, do not escape miscellaneous punctuation
143
- characters that sometimes have Markdown significance in text.
144
- Defaults to `True`.
145
-
146
- keep_inline_images_in
147
-
148
- : Images are converted to their alt-text when the images are located
149
- inside headlines or table cells. If some inline images should be
150
- converted to markdown images instead, this option can be set to a
151
- list of parent tags that should be allowed to contain inline images,
152
- for example `['td']`. Defaults to an empty list.
153
-
154
- wrap, wrap_width
155
-
156
- : If `wrap` is set to `True`, all text paragraphs are wrapped at
157
- `wrap_width` characters. Defaults to `False` and `80`. Use with
158
- `newline_style=BACKSLASH` to keep line breaks in paragraphs.
159
-
160
- Options may be specified as kwargs to the `html_to_markdown` function, or as
161
- a nested `Options` class in `MarkdownConverter` subclasses.
162
-
163
- # CLI
164
-
165
- Use `html_to_markdown example.html > example.md` or pipe input from stdin
166
- (`cat example.html | html_to_markdown > example.md`). Call `html_to_markdown -h`
167
- to see all available options. They are the same as listed above and take
168
- the same arguments.
@@ -1,131 +0,0 @@
1
- import argparse
2
- import sys
3
-
4
- from html_to_markdown import convert_to_markdown
5
- from html_to_markdown.constants import ASTERISK, ATX, ATX_CLOSED, BACKSLASH, SPACES, UNDERLINED, UNDERSCORE
6
-
7
-
8
- def cli(argv: list[str]) -> None:
9
- """Command-line interface for html_to_markdown."""
10
- parser = argparse.ArgumentParser(
11
- prog="html_to_markdown",
12
- description="Converts html to markdown.",
13
- )
14
-
15
- parser.add_argument(
16
- "html",
17
- nargs="?",
18
- type=argparse.FileType("r"),
19
- default=sys.stdin,
20
- help="The html file to convert. Defaults to STDIN if not " "provided.",
21
- )
22
- parser.add_argument(
23
- "-s",
24
- "--strip",
25
- nargs="*",
26
- help="A list of tags to strip. This option can't be used with " "the --convert option.",
27
- )
28
- parser.add_argument(
29
- "-c",
30
- "--convert",
31
- nargs="*",
32
- help="A list of tags to convert. This option can't be used with " "the --strip option.",
33
- )
34
- parser.add_argument(
35
- "-a",
36
- "--autolinks",
37
- action="store_true",
38
- help="A boolean indicating whether the 'automatic link' style "
39
- "should be used when a 'a' tag's contents match its href.",
40
- )
41
- parser.add_argument(
42
- "--default-title",
43
- action="store_false",
44
- help="A boolean to enable setting the title of a link to its " "href, if no title is given.",
45
- )
46
- parser.add_argument(
47
- "--heading-style",
48
- default=UNDERLINED,
49
- choices=(ATX, ATX_CLOSED, UNDERLINED),
50
- help="Defines how headings should be converted.",
51
- )
52
- parser.add_argument(
53
- "-b",
54
- "--bullets",
55
- default="*+-",
56
- help="A string of bullet styles to use; the bullet will " "alternate based on nesting level.",
57
- )
58
- (
59
- parser.add_argument(
60
- "--strong-em-symbol",
61
- default=ASTERISK,
62
- choices=(ASTERISK, UNDERSCORE),
63
- help="Use * or _ to convert strong and italics text",
64
- ),
65
- )
66
- parser.add_argument("--sub-symbol", default="", help="Define the chars that surround '<sub>'.")
67
- parser.add_argument("--sup-symbol", default="", help="Define the chars that surround '<sup>'.")
68
- parser.add_argument(
69
- "--newline-style",
70
- default=SPACES,
71
- choices=(SPACES, BACKSLASH),
72
- help="Defines the style of <br> conversions: two spaces "
73
- "or backslash at the and of the line thet should break.",
74
- )
75
- parser.add_argument(
76
- "--code-language", default="", help="Defines the language that should be assumed for all " "'<pre>' sections."
77
- )
78
- parser.add_argument(
79
- "--no-escape-asterisks",
80
- dest="escape_asterisks",
81
- action="store_false",
82
- help="Do not escape '*' to '\\*' in text.",
83
- )
84
- parser.add_argument(
85
- "--no-escape-underscores",
86
- dest="escape_underscores",
87
- action="store_false",
88
- help="Do not escape '_' to '\\_' in text.",
89
- )
90
- parser.add_argument(
91
- "-i",
92
- "--keep-inline-images-in",
93
- nargs="*",
94
- help="Images are converted to their alt-text when the images are "
95
- "located inside headlines or table cells. If some inline images "
96
- "should be converted to markdown images instead, this option can "
97
- "be set to a list of parent tags that should be allowed to "
98
- "contain inline images.",
99
- )
100
- parser.add_argument(
101
- "-w", "--wrap", action="store_true", help="Wrap all text paragraphs at --wrap-width characters."
102
- )
103
- parser.add_argument("--wrap-width", type=int, default=80)
104
-
105
- args = parser.parse_args(argv)
106
-
107
- result = convert_to_markdown(
108
- args.html.read(),
109
- strip=args.strip,
110
- convert=args.convert,
111
- autolinks=args.autolinks,
112
- default_title=args.default_title,
113
- heading_style=args.heading_style,
114
- bullets=args.bullets,
115
- strong_em_symbol=args.strong_em_symbol,
116
- sub_symbol=args.sub_symbol,
117
- sup_symbol=args.sup_symbol,
118
- newline_style=args.newline_style,
119
- code_language=args.code_language,
120
- escape_asterisks=args.escape_asterisks,
121
- escape_underscores=args.escape_underscores,
122
- keep_inline_images_in=args.keep_inline_images_in,
123
- wrap=args.wrap,
124
- wrap_width=args.wrap_width,
125
- )
126
-
127
- print(result) # noqa: T201
128
-
129
-
130
- if __name__ == "__main__":
131
- cli(sys.argv[1:])