html-to-markdown 1.3.3__tar.gz → 1.5.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of html-to-markdown might be problematic. Click here for more details.

Files changed (26) hide show
  1. html_to_markdown-1.5.0/PKG-INFO +436 -0
  2. html_to_markdown-1.5.0/README.md +400 -0
  3. html_to_markdown-1.5.0/html_to_markdown/__init__.py +6 -0
  4. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/html_to_markdown/__main__.py +5 -2
  5. html_to_markdown-1.5.0/html_to_markdown/cli.py +236 -0
  6. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/html_to_markdown/constants.py +1 -0
  7. html_to_markdown-1.5.0/html_to_markdown/converters.py +1922 -0
  8. html_to_markdown-1.5.0/html_to_markdown/processing.py +795 -0
  9. html_to_markdown-1.5.0/html_to_markdown.egg-info/PKG-INFO +436 -0
  10. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/html_to_markdown.egg-info/entry_points.txt +1 -0
  11. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/pyproject.toml +57 -21
  12. html_to_markdown-1.3.3/PKG-INFO +0 -242
  13. html_to_markdown-1.3.3/README.md +0 -213
  14. html_to_markdown-1.3.3/html_to_markdown/__init__.py +0 -5
  15. html_to_markdown-1.3.3/html_to_markdown/cli.py +0 -150
  16. html_to_markdown-1.3.3/html_to_markdown/converters.py +0 -381
  17. html_to_markdown-1.3.3/html_to_markdown/processing.py +0 -309
  18. html_to_markdown-1.3.3/html_to_markdown.egg-info/PKG-INFO +0 -242
  19. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/LICENSE +0 -0
  20. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/html_to_markdown/py.typed +0 -0
  21. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/html_to_markdown/utils.py +0 -0
  22. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/html_to_markdown.egg-info/SOURCES.txt +0 -0
  23. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/html_to_markdown.egg-info/dependency_links.txt +0 -0
  24. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/html_to_markdown.egg-info/requires.txt +0 -0
  25. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/html_to_markdown.egg-info/top_level.txt +0 -0
  26. {html_to_markdown-1.3.3 → html_to_markdown-1.5.0}/setup.cfg +0 -0
@@ -0,0 +1,436 @@
1
+ Metadata-Version: 2.4
2
+ Name: html-to-markdown
3
+ Version: 1.5.0
4
+ Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
5
+ Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
6
+ License: MIT
7
+ Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
8
+ Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
9
+ Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
10
+ Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
11
+ Keywords: beautifulsoup,cli-tool,converter,html,html2markdown,markdown,markup,text-extraction,text-processing
12
+ Classifier: Development Status :: 5 - Production/Stable
13
+ Classifier: Environment :: Console
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3 :: Only
18
+ Classifier: Programming Language :: Python :: 3.9
19
+ Classifier: Programming Language :: Python :: 3.10
20
+ Classifier: Programming Language :: Python :: 3.11
21
+ Classifier: Programming Language :: Python :: 3.12
22
+ Classifier: Programming Language :: Python :: 3.13
23
+ Classifier: Topic :: Internet :: WWW/HTTP
24
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
25
+ Classifier: Topic :: Text Processing
26
+ Classifier: Topic :: Text Processing :: Markup
27
+ Classifier: Topic :: Text Processing :: Markup :: HTML
28
+ Classifier: Topic :: Text Processing :: Markup :: Markdown
29
+ Classifier: Topic :: Utilities
30
+ Classifier: Typing :: Typed
31
+ Requires-Python: >=3.9
32
+ Description-Content-Type: text/markdown
33
+ License-File: LICENSE
34
+ Requires-Dist: beautifulsoup4>=4.13.4
35
+ Dynamic: license-file
36
+
37
+ # html-to-markdown
38
+
39
+ A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork
40
+ of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
41
+ Python 3.9+.
42
+
43
+ ## Features
44
+
45
+ - **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
46
+ - **Type Safety**: Strict MyPy adherence with comprehensive type hints
47
+ - **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
48
+ - **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
49
+ - **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)
50
+ - **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax
51
+ - **Flexible Configuration**: 20+ configuration options for customizing conversion behavior
52
+ - **CLI Tool**: Full-featured command-line interface with all API options exposed
53
+ - **Custom Converters**: Extensible converter system for custom HTML tag handling
54
+ - **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
55
+ - **Extensive Test Coverage**: 100% test coverage requirement with comprehensive test suite
56
+
57
+ ## Installation
58
+
59
+ ```shell
60
+ pip install html-to-markdown
61
+ ```
62
+
63
+ ## Quick Start
64
+
65
+ Convert HTML to Markdown with a single function call:
66
+
67
+ ```python
68
+ from html_to_markdown import convert_to_markdown
69
+
70
+ html = """
71
+ <!DOCTYPE html>
72
+ <html>
73
+ <head>
74
+ <title>Sample Document</title>
75
+ <meta name="description" content="A sample HTML document">
76
+ </head>
77
+ <body>
78
+ <article>
79
+ <h1>Welcome</h1>
80
+ <p>This is a <strong>sample</strong> with a <a href="https://example.com">link</a>.</p>
81
+ <p>Here's some <mark>highlighted text</mark> and a task list:</p>
82
+ <ul>
83
+ <li><input type="checkbox" checked> Completed task</li>
84
+ <li><input type="checkbox"> Pending task</li>
85
+ </ul>
86
+ </article>
87
+ </body>
88
+ </html>
89
+ """
90
+
91
+ markdown = convert_to_markdown(html)
92
+ print(markdown)
93
+ ```
94
+
95
+ Output:
96
+
97
+ ```markdown
98
+ <!--
99
+ title: Sample Document
100
+ meta-description: A sample HTML document
101
+ -->
102
+
103
+ # Welcome
104
+
105
+ This is a **sample** with a [link](https://example.com).
106
+
107
+ Here's some ==highlighted text== and a task list:
108
+
109
+ * [x] Completed task
110
+ * [ ] Pending task
111
+ ```
112
+
113
+ ### Working with BeautifulSoup
114
+
115
+ If you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:
116
+
117
+ ```python
118
+ from bs4 import BeautifulSoup
119
+ from html_to_markdown import convert_to_markdown
120
+
121
+ # Configure BeautifulSoup with your preferred parser
122
+ soup = BeautifulSoup(html, "lxml") # Note: lxml requires additional installation
123
+ markdown = convert_to_markdown(soup)
124
+ ```
125
+
126
+ ## Advanced Usage
127
+
128
+ ### Customizing Conversion Options
129
+
130
+ The library offers extensive customization through various options:
131
+
132
+ ```python
133
+ from html_to_markdown import convert_to_markdown
134
+
135
+ html = "<div>Your content here...</div>"
136
+ markdown = convert_to_markdown(
137
+ html,
138
+ # Document processing
139
+ extract_metadata=True, # Extract metadata as comment header
140
+ convert_as_inline=False, # Treat as block-level content
141
+ strip_newlines=False, # Preserve original newlines
142
+ # Formatting options
143
+ heading_style="atx", # Use # style headers
144
+ strong_em_symbol="*", # Use * for bold/italic
145
+ bullets="*+-", # Define bullet point characters
146
+ highlight_style="double-equal", # Use == for highlighted text
147
+ # Text processing
148
+ wrap=True, # Enable text wrapping
149
+ wrap_width=100, # Set wrap width
150
+ escape_asterisks=True, # Escape * characters
151
+ escape_underscores=True, # Escape _ characters
152
+ escape_misc=True, # Escape other special characters
153
+ # Code blocks
154
+ code_language="python", # Default code block language
155
+ # Streaming for large documents
156
+ stream_processing=False, # Enable for memory efficiency
157
+ chunk_size=1024, # Chunk size for streaming
158
+ )
159
+ ```
160
+
161
+ ### Custom Converters
162
+
163
+ You can provide your own conversion functions for specific HTML tags:
164
+
165
+ ```python
166
+ from bs4.element import Tag
167
+ from html_to_markdown import convert_to_markdown
168
+
169
+ # Define a custom converter for the <b> tag
170
+ def custom_bold_converter(*, tag: Tag, text: str, **kwargs) -> str:
171
+ return f"IMPORTANT: {text}"
172
+
173
+ html = "<p>This is a <b>bold statement</b>.</p>"
174
+ markdown = convert_to_markdown(html, custom_converters={"b": custom_bold_converter})
175
+ print(markdown)
176
+ # Output: This is a IMPORTANT: bold statement.
177
+ ```
178
+
179
+ Custom converters take precedence over the built-in converters and can be used alongside other configuration options.
180
+
181
+ ### Key Configuration Options
182
+
183
+ | Option | Type | Default | Description |
184
+ | ------------------- | ---- | ---------------- | ------------------------------------------------------ |
185
+ | `extract_metadata` | bool | `True` | Extract document metadata as comment header |
186
+ | `convert_as_inline` | bool | `False` | Treat content as inline elements only |
187
+ | `heading_style` | str | `'underlined'` | Header style (`'underlined'`, `'atx'`, `'atx_closed'`) |
188
+ | `highlight_style` | str | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`) |
189
+ | `stream_processing` | bool | `False` | Enable streaming for large documents |
190
+ | `autolinks` | bool | `True` | Auto-convert URLs to Markdown links |
191
+ | `bullets` | str | `'*+-'` | Characters to use for bullet points |
192
+ | `escape_asterisks` | bool | `True` | Escape * characters |
193
+ | `wrap` | bool | `False` | Enable text wrapping |
194
+ | `wrap_width` | int | `80` | Text wrap width |
195
+
196
+ For a complete list of all 20+ options, see the [Configuration Reference](#configuration-reference) section below.
197
+
198
+ ## CLI Usage
199
+
200
+ Convert HTML files directly from the command line with full access to all API options:
201
+
202
+ ```shell
203
+ # Convert a file
204
+ html_to_markdown input.html > output.md
205
+
206
+ # Process stdin
207
+ cat input.html | html_to_markdown > output.md
208
+
209
+ # Use custom options
210
+ html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md
211
+
212
+ # Advanced options
213
+ html_to_markdown \
214
+ --no-extract-metadata \
215
+ --convert-as-inline \
216
+ --highlight-style html \
217
+ --stream-processing \
218
+ --show-progress \
219
+ input.html > output.md
220
+ ```
221
+
222
+ ### Key CLI Options
223
+
224
+ ```shell
225
+ # Content processing
226
+ --convert-as-inline # Treat content as inline elements
227
+ --no-extract-metadata # Disable metadata extraction
228
+ --strip-newlines # Remove newlines from input
229
+
230
+ # Formatting
231
+ --heading-style {atx,atx_closed,underlined}
232
+ --highlight-style {double-equal,html,bold}
233
+ --strong-em-symbol {*,_}
234
+ --bullets CHARS # e.g., "*+-"
235
+
236
+ # Text escaping
237
+ --no-escape-asterisks # Disable * escaping
238
+ --no-escape-underscores # Disable _ escaping
239
+ --no-escape-misc # Disable misc character escaping
240
+
241
+ # Large document processing
242
+ --stream-processing # Enable streaming mode
243
+ --chunk-size SIZE # Set chunk size (default: 1024)
244
+ --show-progress # Show progress for large files
245
+
246
+ # Text wrapping
247
+ --wrap # Enable text wrapping
248
+ --wrap-width WIDTH # Set wrap width (default: 80)
249
+ ```
250
+
251
+ View all available options:
252
+
253
+ ```shell
254
+ html_to_markdown --help
255
+ ```
256
+
257
+ ## Migration from Markdownify
258
+
259
+ For existing projects using Markdownify, a compatibility layer is provided:
260
+
261
+ ```python
262
+ # Old code
263
+ from markdownify import markdownify as md
264
+
265
+ # New code - works the same way
266
+ from html_to_markdown import markdownify as md
267
+ ```
268
+
269
+ The `markdownify` function is an alias for `convert_to_markdown` and provides identical functionality.
270
+
271
+ **Note**: While the compatibility layer ensures existing code continues to work, new projects should use `convert_to_markdown` directly as it provides better type hints and clearer naming.
272
+
273
+ ## Configuration Reference
274
+
275
+ Complete list of all configuration options:
276
+
277
+ ### Document Processing
278
+
279
+ - `extract_metadata` (bool, default: `True`): Extract document metadata (title, meta tags) as comment header
280
+ - `convert_as_inline` (bool, default: `False`): Treat content as inline elements only (no block elements)
281
+ - `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing
282
+ - `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)
283
+ - `strip` (list, default: `None`): List of HTML tags to remove from output
284
+ - `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions
285
+
286
+ ### Streaming Support
287
+
288
+ - `stream_processing` (bool, default: `False`): Enable streaming processing for large documents
289
+ - `chunk_size` (int, default: `1024`): Size of chunks when using streaming processing
290
+ - `chunk_callback` (callable, default: `None`): Callback function called with each processed chunk
291
+ - `progress_callback` (callable, default: `None`): Callback function called with (processed_bytes, total_bytes)
292
+
293
+ ### Text Formatting
294
+
295
+ - `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)
296
+ - `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)
297
+ - `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)
298
+ - `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists
299
+ - `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
300
+ - `sub_symbol` (str, default: `''`): Custom symbol for subscript text
301
+ - `sup_symbol` (str, default: `''`): Custom symbol for superscript text
302
+
303
+ ### Text Escaping
304
+
305
+ - `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting
306
+ - `escape_underscores` (bool, default: `True`): Escape `_` characters to prevent unintended formatting
307
+ - `escape_misc` (bool, default: `True`): Escape miscellaneous characters to prevent Markdown conflicts
308
+
309
+ ### Links and Media
310
+
311
+ - `autolinks` (bool, default: `True`): Automatically convert valid URLs to Markdown links
312
+ - `default_title` (bool, default: `False`): Use default titles for elements like links
313
+ - `keep_inline_images_in` (list, default: `None`): Tags where inline images should be preserved
314
+
315
+ ### Code Blocks
316
+
317
+ - `code_language` (str, default: `''`): Default language identifier for fenced code blocks
318
+ - `code_language_callback` (callable, default: `None`): Function to dynamically determine code block language
319
+
320
+ ### Text Wrapping
321
+
322
+ - `wrap` (bool, default: `False`): Enable text wrapping
323
+ - `wrap_width` (int, default: `80`): Width for text wrapping
324
+
325
+ ## Contribution
326
+
327
+ This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
328
+ submitting PRs to avoid disappointment.
329
+
330
+ ### Local Development
331
+
332
+ 1. Clone the repo
333
+
334
+ 1. Install system dependencies (requires Python 3.9+)
335
+
336
+ 1. Install the project dependencies:
337
+
338
+ ```shell
339
+ uv sync --all-extras --dev
340
+ ```
341
+
342
+ 1. Install pre-commit hooks:
343
+
344
+ ```shell
345
+ uv run pre-commit install
346
+ ```
347
+
348
+ 1. Run tests to ensure everything works:
349
+
350
+ ```shell
351
+ uv run pytest
352
+ ```
353
+
354
+ 1. Run code quality checks:
355
+
356
+ ```shell
357
+ uv run pre-commit run --all-files
358
+ ```
359
+
360
+ 1. Make your changes and submit a PR
361
+
362
+ ### Development Commands
363
+
364
+ ```shell
365
+ # Run tests with coverage
366
+ uv run pytest --cov=html_to_markdown --cov-report=term-missing
367
+
368
+ # Lint and format code
369
+ uv run ruff check --fix .
370
+ uv run ruff format .
371
+
372
+ # Type checking
373
+ uv run mypy
374
+
375
+ # Test CLI during development
376
+ uv run python -m html_to_markdown input.html
377
+
378
+ # Build package
379
+ uv build
380
+ ```
381
+
382
+ ## License
383
+
384
+ This library uses the MIT license.
385
+
386
+ ## HTML5 Element Support
387
+
388
+ This library provides comprehensive support for all modern HTML5 elements:
389
+
390
+ ### Semantic Elements
391
+
392
+ - `<article>`, `<aside>`, `<figcaption>`, `<figure>`, `<footer>`, `<header>`, `<hgroup>`, `<main>`, `<nav>`, `<section>`
393
+ - `<abbr>`, `<bdi>`, `<bdo>`, `<cite>`, `<data>`, `<dfn>`, `<kbd>`, `<mark>`, `<samp>`, `<small>`, `<time>`, `<var>`
394
+ - `<del>`, `<ins>` (strikethrough and insertion tracking)
395
+
396
+ ### Form Elements
397
+
398
+ - `<form>`, `<fieldset>`, `<legend>`, `<label>`, `<input>`, `<textarea>`, `<select>`, `<option>`, `<optgroup>`
399
+ - `<button>`, `<datalist>`, `<output>`, `<progress>`, `<meter>`
400
+ - Task list support: `<input type="checkbox">` converts to `- [x]` / `- [ ]`
401
+
402
+ ### Table Elements
403
+
404
+ - `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`, `<col>`, `<colgroup>`
405
+
406
+ ### Interactive Elements
407
+
408
+ - `<details>`, `<summary>`, `<dialog>`, `<menu>`
409
+
410
+ ### Ruby Annotations
411
+
412
+ - `<ruby>`, `<rb>`, `<rt>`, `<rtc>`, `<rp>` (for East Asian typography)
413
+
414
+ ### Media Elements
415
+
416
+ - `<img>`, `<picture>`, `<audio>`, `<video>`, `<iframe>`
417
+ - SVG support with data URI conversion
418
+
419
+ ### Math Elements
420
+
421
+ - `<math>` (MathML support)
422
+
423
+ ## Breaking Changes (Major Version)
424
+
425
+ This version introduces several breaking changes for improved consistency and functionality:
426
+
427
+ 1. **Enhanced Metadata Extraction**: Now enabled by default with comprehensive extraction of title, meta tags, and link relations
428
+ 1. **Improved Newline Handling**: Better normalization of excessive newlines (max 2 consecutive)
429
+ 1. **Extended HTML5 Support**: Added support for 40+ new HTML5 elements
430
+ 1. **Streaming API**: New streaming parameters for large document processing
431
+ 1. **Task List Support**: Automatic conversion of HTML checkboxes to GitHub-compatible task lists
432
+ 1. **Highlight Styles**: New `highlight_style` parameter with multiple options for `<mark>` elements
433
+
434
+ ## Acknowledgments
435
+
436
+ Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.