html-to-markdown 1.4.0__tar.gz → 1.6.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of html-to-markdown might be problematic. Click here for more details.

Files changed (26) hide show
  1. html_to_markdown-1.6.0/PKG-INFO +472 -0
  2. html_to_markdown-1.6.0/README.md +434 -0
  3. html_to_markdown-1.6.0/html_to_markdown/__init__.py +22 -0
  4. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/html_to_markdown/cli.py +103 -25
  5. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/html_to_markdown/constants.py +1 -0
  6. html_to_markdown-1.6.0/html_to_markdown/converters.py +1929 -0
  7. html_to_markdown-1.6.0/html_to_markdown/exceptions.py +49 -0
  8. html_to_markdown-1.6.0/html_to_markdown/processing.py +988 -0
  9. html_to_markdown-1.6.0/html_to_markdown.egg-info/PKG-INFO +472 -0
  10. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/html_to_markdown.egg-info/SOURCES.txt +1 -0
  11. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/html_to_markdown.egg-info/requires.txt +3 -0
  12. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/pyproject.toml +15 -5
  13. html_to_markdown-1.4.0/PKG-INFO +0 -249
  14. html_to_markdown-1.4.0/README.md +0 -213
  15. html_to_markdown-1.4.0/html_to_markdown/__init__.py +0 -5
  16. html_to_markdown-1.4.0/html_to_markdown/converters.py +0 -387
  17. html_to_markdown-1.4.0/html_to_markdown/processing.py +0 -315
  18. html_to_markdown-1.4.0/html_to_markdown.egg-info/PKG-INFO +0 -249
  19. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/LICENSE +0 -0
  20. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/html_to_markdown/__main__.py +0 -0
  21. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/html_to_markdown/py.typed +0 -0
  22. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/html_to_markdown/utils.py +0 -0
  23. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/html_to_markdown.egg-info/dependency_links.txt +0 -0
  24. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/html_to_markdown.egg-info/entry_points.txt +0 -0
  25. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/html_to_markdown.egg-info/top_level.txt +0 -0
  26. {html_to_markdown-1.4.0 → html_to_markdown-1.6.0}/setup.cfg +0 -0
@@ -0,0 +1,472 @@
1
+ Metadata-Version: 2.4
2
+ Name: html-to-markdown
3
+ Version: 1.6.0
4
+ Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
5
+ Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
6
+ License: MIT
7
+ Project-URL: Changelog, https://github.com/Goldziher/html-to-markdown/releases
8
+ Project-URL: Homepage, https://github.com/Goldziher/html-to-markdown
9
+ Project-URL: Issues, https://github.com/Goldziher/html-to-markdown/issues
10
+ Project-URL: Repository, https://github.com/Goldziher/html-to-markdown.git
11
+ Keywords: beautifulsoup,cli-tool,converter,html,html2markdown,markdown,markup,text-extraction,text-processing
12
+ Classifier: Development Status :: 5 - Production/Stable
13
+ Classifier: Environment :: Console
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3 :: Only
18
+ Classifier: Programming Language :: Python :: 3.9
19
+ Classifier: Programming Language :: Python :: 3.10
20
+ Classifier: Programming Language :: Python :: 3.11
21
+ Classifier: Programming Language :: Python :: 3.12
22
+ Classifier: Programming Language :: Python :: 3.13
23
+ Classifier: Topic :: Internet :: WWW/HTTP
24
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
25
+ Classifier: Topic :: Text Processing
26
+ Classifier: Topic :: Text Processing :: Markup
27
+ Classifier: Topic :: Text Processing :: Markup :: HTML
28
+ Classifier: Topic :: Text Processing :: Markup :: Markdown
29
+ Classifier: Topic :: Utilities
30
+ Classifier: Typing :: Typed
31
+ Requires-Python: >=3.9
32
+ Description-Content-Type: text/markdown
33
+ License-File: LICENSE
34
+ Requires-Dist: beautifulsoup4>=4.13.4
35
+ Provides-Extra: lxml
36
+ Requires-Dist: lxml>=5; extra == "lxml"
37
+ Dynamic: license-file
38
+
39
+ # html-to-markdown
40
+
41
+ A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork
42
+ of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
43
+ Python 3.9+.
44
+
45
+ ## Features
46
+
47
+ - **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
48
+ - **Type Safety**: Strict MyPy adherence with comprehensive type hints
49
+ - **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
50
+ - **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
51
+ - **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)
52
+ - **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax
53
+ - **Flexible Configuration**: 20+ configuration options for customizing conversion behavior
54
+ - **CLI Tool**: Full-featured command-line interface with all API options exposed
55
+ - **Custom Converters**: Extensible converter system for custom HTML tag handling
56
+ - **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
57
+ - **Extensive Test Coverage**: 100% test coverage requirement with comprehensive test suite
58
+
59
+ ## Installation
60
+
61
+ ```shell
62
+ pip install html-to-markdown
63
+ ```
64
+
65
+ ### Optional lxml Parser
66
+
67
+ For improved performance, you can install with the optional lxml parser:
68
+
69
+ ```shell
70
+ pip install html-to-markdown[lxml]
71
+ ```
72
+
73
+ The lxml parser offers:
74
+
75
+ - **~30% faster HTML parsing** compared to the default html.parser
76
+ - Better handling of malformed HTML
77
+ - More robust parsing for complex documents
78
+
79
+ Once installed, lxml is automatically used by default for better performance. You can explicitly specify a parser if needed:
80
+
81
+ ```python
82
+ result = convert_to_markdown(html) # Auto-detects: uses lxml if available, otherwise html.parser
83
+ result = convert_to_markdown(html, parser="lxml") # Force lxml (requires installation)
84
+ result = convert_to_markdown(html, parser="html.parser") # Force built-in parser
85
+ ```
86
+
87
+ ## Quick Start
88
+
89
+ Convert HTML to Markdown with a single function call:
90
+
91
+ ```python
92
+ from html_to_markdown import convert_to_markdown
93
+
94
+ html = """
95
+ <!DOCTYPE html>
96
+ <html>
97
+ <head>
98
+ <title>Sample Document</title>
99
+ <meta name="description" content="A sample HTML document">
100
+ </head>
101
+ <body>
102
+ <article>
103
+ <h1>Welcome</h1>
104
+ <p>This is a <strong>sample</strong> with a <a href="https://example.com">link</a>.</p>
105
+ <p>Here's some <mark>highlighted text</mark> and a task list:</p>
106
+ <ul>
107
+ <li><input type="checkbox" checked> Completed task</li>
108
+ <li><input type="checkbox"> Pending task</li>
109
+ </ul>
110
+ </article>
111
+ </body>
112
+ </html>
113
+ """
114
+
115
+ markdown = convert_to_markdown(html)
116
+ print(markdown)
117
+ ```
118
+
119
+ Output:
120
+
121
+ ```markdown
122
+ <!--
123
+ title: Sample Document
124
+ meta-description: A sample HTML document
125
+ -->
126
+
127
+ # Welcome
128
+
129
+ This is a **sample** with a [link](https://example.com).
130
+
131
+ Here's some ==highlighted text== and a task list:
132
+
133
+ * [x] Completed task
134
+ * [ ] Pending task
135
+ ```
136
+
137
+ ### Working with BeautifulSoup
138
+
139
+ If you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:
140
+
141
+ ```python
142
+ from bs4 import BeautifulSoup
143
+ from html_to_markdown import convert_to_markdown
144
+
145
+ # Configure BeautifulSoup with your preferred parser
146
+ soup = BeautifulSoup(html, "lxml") # Note: lxml requires additional installation
147
+ markdown = convert_to_markdown(soup)
148
+ ```
149
+
150
+ ## Advanced Usage
151
+
152
+ ### Customizing Conversion Options
153
+
154
+ The library offers extensive customization through various options:
155
+
156
+ ```python
157
+ from html_to_markdown import convert_to_markdown
158
+
159
+ html = "<div>Your content here...</div>"
160
+ markdown = convert_to_markdown(
161
+ html,
162
+ # Document processing
163
+ extract_metadata=True, # Extract metadata as comment header
164
+ convert_as_inline=False, # Treat as block-level content
165
+ strip_newlines=False, # Preserve original newlines
166
+ # Formatting options
167
+ heading_style="atx", # Use # style headers
168
+ strong_em_symbol="*", # Use * for bold/italic
169
+ bullets="*+-", # Define bullet point characters
170
+ highlight_style="double-equal", # Use == for highlighted text
171
+ # Text processing
172
+ wrap=True, # Enable text wrapping
173
+ wrap_width=100, # Set wrap width
174
+ escape_asterisks=True, # Escape * characters
175
+ escape_underscores=True, # Escape _ characters
176
+ escape_misc=True, # Escape other special characters
177
+ # Code blocks
178
+ code_language="python", # Default code block language
179
+ # Streaming for large documents
180
+ stream_processing=False, # Enable for memory efficiency
181
+ chunk_size=1024, # Chunk size for streaming
182
+ )
183
+ ```
184
+
185
+ ### Custom Converters
186
+
187
+ You can provide your own conversion functions for specific HTML tags:
188
+
189
+ ```python
190
+ from bs4.element import Tag
191
+ from html_to_markdown import convert_to_markdown
192
+
193
+ # Define a custom converter for the <b> tag
194
+ def custom_bold_converter(*, tag: Tag, text: str, **kwargs) -> str:
195
+ return f"IMPORTANT: {text}"
196
+
197
+ html = "<p>This is a <b>bold statement</b>.</p>"
198
+ markdown = convert_to_markdown(html, custom_converters={"b": custom_bold_converter})
199
+ print(markdown)
200
+ # Output: This is a IMPORTANT: bold statement.
201
+ ```
202
+
203
+ Custom converters take precedence over the built-in converters and can be used alongside other configuration options.
204
+
205
+ ### Key Configuration Options
206
+
207
+ | Option | Type | Default | Description |
208
+ | ------------------- | ---- | ---------------- | --------------------------------------------------------------- |
209
+ | `extract_metadata` | bool | `True` | Extract document metadata as comment header |
210
+ | `convert_as_inline` | bool | `False` | Treat content as inline elements only |
211
+ | `heading_style` | str | `'underlined'` | Header style (`'underlined'`, `'atx'`, `'atx_closed'`) |
212
+ | `highlight_style` | str | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`) |
213
+ | `stream_processing` | bool | `False` | Enable streaming for large documents |
214
+ | `parser` | str | auto-detect | BeautifulSoup parser (auto-detects `'lxml'` or `'html.parser'`) |
215
+ | `autolinks` | bool | `True` | Auto-convert URLs to Markdown links |
216
+ | `bullets` | str | `'*+-'` | Characters to use for bullet points |
217
+ | `escape_asterisks` | bool | `True` | Escape * characters |
218
+ | `wrap` | bool | `False` | Enable text wrapping |
219
+ | `wrap_width` | int | `80` | Text wrap width |
220
+
221
+ For a complete list of all 20+ options, see the [Configuration Reference](#configuration-reference) section below.
222
+
223
+ ## CLI Usage
224
+
225
+ Convert HTML files directly from the command line with full access to all API options:
226
+
227
+ ```shell
228
+ # Convert a file
229
+ html_to_markdown input.html > output.md
230
+
231
+ # Process stdin
232
+ cat input.html | html_to_markdown > output.md
233
+
234
+ # Use custom options
235
+ html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md
236
+
237
+ # Advanced options
238
+ html_to_markdown \
239
+ --no-extract-metadata \
240
+ --convert-as-inline \
241
+ --highlight-style html \
242
+ --stream-processing \
243
+ --show-progress \
244
+ input.html > output.md
245
+ ```
246
+
247
+ ### Key CLI Options
248
+
249
+ ```shell
250
+ # Content processing
251
+ --convert-as-inline # Treat content as inline elements
252
+ --no-extract-metadata # Disable metadata extraction
253
+ --strip-newlines # Remove newlines from input
254
+
255
+ # Formatting
256
+ --heading-style {atx,atx_closed,underlined}
257
+ --highlight-style {double-equal,html,bold}
258
+ --strong-em-symbol {*,_}
259
+ --bullets CHARS # e.g., "*+-"
260
+
261
+ # Text escaping
262
+ --no-escape-asterisks # Disable * escaping
263
+ --no-escape-underscores # Disable _ escaping
264
+ --no-escape-misc # Disable misc character escaping
265
+
266
+ # Large document processing
267
+ --stream-processing # Enable streaming mode
268
+ --chunk-size SIZE # Set chunk size (default: 1024)
269
+ --show-progress # Show progress for large files
270
+
271
+ # Text wrapping
272
+ --wrap # Enable text wrapping
273
+ --wrap-width WIDTH # Set wrap width (default: 80)
274
+ ```
275
+
276
+ View all available options:
277
+
278
+ ```shell
279
+ html_to_markdown --help
280
+ ```
281
+
282
+ ## Migration from Markdownify
283
+
284
+ For existing projects using Markdownify, a compatibility layer is provided:
285
+
286
+ ```python
287
+ # Old code
288
+ from markdownify import markdownify as md
289
+
290
+ # New code - works the same way
291
+ from html_to_markdown import markdownify as md
292
+ ```
293
+
294
+ The `markdownify` function is an alias for `convert_to_markdown` and provides identical functionality.
295
+
296
+ **Note**: While the compatibility layer ensures existing code continues to work, new projects should use `convert_to_markdown` directly as it provides better type hints and clearer naming.
297
+
298
+ ## Configuration Reference
299
+
300
+ Complete list of all configuration options:
301
+
302
+ ### Document Processing
303
+
304
+ - `extract_metadata` (bool, default: `True`): Extract document metadata (title, meta tags) as comment header
305
+ - `convert_as_inline` (bool, default: `False`): Treat content as inline elements only (no block elements)
306
+ - `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing
307
+ - `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)
308
+ - `strip` (list, default: `None`): List of HTML tags to remove from output
309
+ - `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions
310
+
311
+ ### Streaming Support
312
+
313
+ - `stream_processing` (bool, default: `False`): Enable streaming processing for large documents
314
+ - `chunk_size` (int, default: `1024`): Size of chunks when using streaming processing
315
+ - `chunk_callback` (callable, default: `None`): Callback function called with each processed chunk
316
+ - `progress_callback` (callable, default: `None`): Callback function called with (processed_bytes, total_bytes)
317
+
318
+ ### Text Formatting
319
+
320
+ - `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)
321
+ - `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)
322
+ - `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)
323
+ - `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists
324
+ - `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
325
+ - `sub_symbol` (str, default: `''`): Custom symbol for subscript text
326
+ - `sup_symbol` (str, default: `''`): Custom symbol for superscript text
327
+
328
+ ### Text Escaping
329
+
330
+ - `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting
331
+ - `escape_underscores` (bool, default: `True`): Escape `_` characters to prevent unintended formatting
332
+ - `escape_misc` (bool, default: `True`): Escape miscellaneous characters to prevent Markdown conflicts
333
+
334
+ ### Links and Media
335
+
336
+ - `autolinks` (bool, default: `True`): Automatically convert valid URLs to Markdown links
337
+ - `default_title` (bool, default: `False`): Use default titles for elements like links
338
+ - `keep_inline_images_in` (list, default: `None`): Tags where inline images should be preserved
339
+
340
+ ### Code Blocks
341
+
342
+ - `code_language` (str, default: `''`): Default language identifier for fenced code blocks
343
+ - `code_language_callback` (callable, default: `None`): Function to dynamically determine code block language
344
+
345
+ ### Text Wrapping
346
+
347
+ - `wrap` (bool, default: `False`): Enable text wrapping
348
+ - `wrap_width` (int, default: `80`): Width for text wrapping
349
+
350
+ ## Contribution
351
+
352
+ This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
353
+ submitting PRs to avoid disappointment.
354
+
355
+ ### Local Development
356
+
357
+ 1. Clone the repo
358
+
359
+ 1. Install system dependencies (requires Python 3.9+)
360
+
361
+ 1. Install the project dependencies:
362
+
363
+ ```shell
364
+ uv sync --all-extras --dev
365
+ ```
366
+
367
+ 1. Install pre-commit hooks:
368
+
369
+ ```shell
370
+ uv run pre-commit install
371
+ ```
372
+
373
+ 1. Run tests to ensure everything works:
374
+
375
+ ```shell
376
+ uv run pytest
377
+ ```
378
+
379
+ 1. Run code quality checks:
380
+
381
+ ```shell
382
+ uv run pre-commit run --all-files
383
+ ```
384
+
385
+ 1. Make your changes and submit a PR
386
+
387
+ ### Development Commands
388
+
389
+ ```shell
390
+ # Run tests with coverage
391
+ uv run pytest --cov=html_to_markdown --cov-report=term-missing
392
+
393
+ # Lint and format code
394
+ uv run ruff check --fix .
395
+ uv run ruff format .
396
+
397
+ # Type checking
398
+ uv run mypy
399
+
400
+ # Test CLI during development
401
+ uv run python -m html_to_markdown input.html
402
+
403
+ # Build package
404
+ uv build
405
+ ```
406
+
407
+ ## Performance
408
+
409
+ The library is optimized for performance with several key features:
410
+
411
+ - **Efficient ancestor caching**: Reduces repeated DOM traversals using context-aware caching
412
+ - **Streaming support**: Process large documents in chunks to minimize memory usage
413
+ - **Optional lxml parser**: ~30% faster parsing for complex HTML documents
414
+ - **Optimized string operations**: Minimizes string concatenations in hot paths
415
+
416
+ Typical throughput: ~2 MB/s for regular processing on modern hardware.
417
+
418
+ ## License
419
+
420
+ This library uses the MIT license.
421
+
422
+ ## HTML5 Element Support
423
+
424
+ This library provides comprehensive support for all modern HTML5 elements:
425
+
426
+ ### Semantic Elements
427
+
428
+ - `<article>`, `<aside>`, `<figcaption>`, `<figure>`, `<footer>`, `<header>`, `<hgroup>`, `<main>`, `<nav>`, `<section>`
429
+ - `<abbr>`, `<bdi>`, `<bdo>`, `<cite>`, `<data>`, `<dfn>`, `<kbd>`, `<mark>`, `<samp>`, `<small>`, `<time>`, `<var>`
430
+ - `<del>`, `<ins>` (strikethrough and insertion tracking)
431
+
432
+ ### Form Elements
433
+
434
+ - `<form>`, `<fieldset>`, `<legend>`, `<label>`, `<input>`, `<textarea>`, `<select>`, `<option>`, `<optgroup>`
435
+ - `<button>`, `<datalist>`, `<output>`, `<progress>`, `<meter>`
436
+ - Task list support: `<input type="checkbox">` converts to `- [x]` / `- [ ]`
437
+
438
+ ### Table Elements
439
+
440
+ - `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`, `<col>`, `<colgroup>`
441
+
442
+ ### Interactive Elements
443
+
444
+ - `<details>`, `<summary>`, `<dialog>`, `<menu>`
445
+
446
+ ### Ruby Annotations
447
+
448
+ - `<ruby>`, `<rb>`, `<rt>`, `<rtc>`, `<rp>` (for East Asian typography)
449
+
450
+ ### Media Elements
451
+
452
+ - `<img>`, `<picture>`, `<audio>`, `<video>`, `<iframe>`
453
+ - SVG support with data URI conversion
454
+
455
+ ### Math Elements
456
+
457
+ - `<math>` (MathML support)
458
+
459
+ ## Breaking Changes (Major Version)
460
+
461
+ This version introduces several breaking changes for improved consistency and functionality:
462
+
463
+ 1. **Enhanced Metadata Extraction**: Now enabled by default with comprehensive extraction of title, meta tags, and link relations
464
+ 1. **Improved Newline Handling**: Better normalization of excessive newlines (max 2 consecutive)
465
+ 1. **Extended HTML5 Support**: Added support for 40+ new HTML5 elements
466
+ 1. **Streaming API**: New streaming parameters for large document processing
467
+ 1. **Task List Support**: Automatic conversion of HTML checkboxes to GitHub-compatible task lists
468
+ 1. **Highlight Styles**: New `highlight_style` parameter with multiple options for `<mark>` elements
469
+
470
+ ## Acknowledgments
471
+
472
+ Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.