html-to-markdown 1.8.0__tar.gz → 1.9.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of html-to-markdown might be problematic. Click here for more details.
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/PKG-INFO +96 -16
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/README.md +91 -10
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/converters.py +305 -562
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/processing.py +120 -45
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown.egg-info/PKG-INFO +96 -16
- html_to_markdown-1.9.1/html_to_markdown.egg-info/requires.txt +5 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/pyproject.toml +11 -12
- html_to_markdown-1.8.0/html_to_markdown.egg-info/requires.txt +0 -5
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/LICENSE +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/__init__.py +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/__main__.py +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/cli.py +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/constants.py +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/exceptions.py +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/preprocessor.py +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/py.typed +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/utils.py +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown.egg-info/SOURCES.txt +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown.egg-info/dependency_links.txt +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown.egg-info/entry_points.txt +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown.egg-info/top_level.txt +0 -0
- {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: html-to-markdown
|
|
3
|
-
Version: 1.
|
|
3
|
+
Version: 1.9.1
|
|
4
4
|
Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
|
|
5
5
|
Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
|
|
6
6
|
License: MIT
|
|
@@ -15,7 +15,6 @@ Classifier: Intended Audience :: Developers
|
|
|
15
15
|
Classifier: License :: OSI Approved :: MIT License
|
|
16
16
|
Classifier: Operating System :: OS Independent
|
|
17
17
|
Classifier: Programming Language :: Python :: 3 :: Only
|
|
18
|
-
Classifier: Programming Language :: Python :: 3.9
|
|
19
18
|
Classifier: Programming Language :: Python :: 3.10
|
|
20
19
|
Classifier: Programming Language :: Python :: 3.11
|
|
21
20
|
Classifier: Programming Language :: Python :: 3.12
|
|
@@ -28,13 +27,13 @@ Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
|
28
27
|
Classifier: Topic :: Text Processing :: Markup :: Markdown
|
|
29
28
|
Classifier: Topic :: Utilities
|
|
30
29
|
Classifier: Typing :: Typed
|
|
31
|
-
Requires-Python: >=3.
|
|
30
|
+
Requires-Python: >=3.10
|
|
32
31
|
Description-Content-Type: text/markdown
|
|
33
32
|
License-File: LICENSE
|
|
34
|
-
Requires-Dist: beautifulsoup4>=4.13.
|
|
35
|
-
Requires-Dist: nh3>=0.
|
|
33
|
+
Requires-Dist: beautifulsoup4>=4.13.5
|
|
34
|
+
Requires-Dist: nh3>=0.3
|
|
36
35
|
Provides-Extra: lxml
|
|
37
|
-
Requires-Dist: lxml>=
|
|
36
|
+
Requires-Dist: lxml>=6.0.1; extra == "lxml"
|
|
38
37
|
Dynamic: license-file
|
|
39
38
|
|
|
40
39
|
# html-to-markdown
|
|
@@ -43,9 +42,18 @@ A modern, fully typed Python library for converting HTML to Markdown. This libra
|
|
|
43
42
|
of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
|
|
44
43
|
Python 3.9+.
|
|
45
44
|
|
|
45
|
+
## Support This Project
|
|
46
|
+
|
|
47
|
+
If you find html-to-markdown useful, please consider sponsoring the development:
|
|
48
|
+
|
|
49
|
+
<a href="https://github.com/sponsors/Goldziher"><img src="https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors" alt="Sponsor on GitHub" height="32"></a>
|
|
50
|
+
|
|
51
|
+
Your support helps maintain and improve this library for the community! 🚀
|
|
52
|
+
|
|
46
53
|
## Features
|
|
47
54
|
|
|
48
55
|
- **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
|
|
56
|
+
- **Enhanced Table Support**: Advanced handling of merged cells with rowspan/colspan support for better table representation
|
|
49
57
|
- **Type Safety**: Strict MyPy adherence with comprehensive type hints
|
|
50
58
|
- **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
|
|
51
59
|
- **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
|
|
@@ -55,7 +63,7 @@ Python 3.9+.
|
|
|
55
63
|
- **CLI Tool**: Full-featured command-line interface with all API options exposed
|
|
56
64
|
- **Custom Converters**: Extensible converter system for custom HTML tag handling
|
|
57
65
|
- **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
|
|
58
|
-
- **
|
|
66
|
+
- **Comprehensive Test Coverage**: 91%+ test coverage with 623+ comprehensive tests
|
|
59
67
|
|
|
60
68
|
## Installation
|
|
61
69
|
|
|
@@ -203,6 +211,51 @@ print(markdown)
|
|
|
203
211
|
|
|
204
212
|
Custom converters take precedence over the built-in converters and can be used alongside other configuration options.
|
|
205
213
|
|
|
214
|
+
### Enhanced Table Support
|
|
215
|
+
|
|
216
|
+
The library now provides better handling of complex tables with merged cells:
|
|
217
|
+
|
|
218
|
+
```python
|
|
219
|
+
from html_to_markdown import convert_to_markdown
|
|
220
|
+
|
|
221
|
+
# HTML table with merged cells
|
|
222
|
+
html = """
|
|
223
|
+
<table>
|
|
224
|
+
<tr>
|
|
225
|
+
<th rowspan="2">Category</th>
|
|
226
|
+
<th colspan="2">Sales Data</th>
|
|
227
|
+
</tr>
|
|
228
|
+
<tr>
|
|
229
|
+
<th>Q1</th>
|
|
230
|
+
<th>Q2</th>
|
|
231
|
+
</tr>
|
|
232
|
+
<tr>
|
|
233
|
+
<td>Product A</td>
|
|
234
|
+
<td>$100K</td>
|
|
235
|
+
<td>$150K</td>
|
|
236
|
+
</tr>
|
|
237
|
+
</table>
|
|
238
|
+
"""
|
|
239
|
+
|
|
240
|
+
markdown = convert_to_markdown(html)
|
|
241
|
+
print(markdown)
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
Output:
|
|
245
|
+
|
|
246
|
+
```markdown
|
|
247
|
+
| Category | Sales Data | |
|
|
248
|
+
| --- | --- | --- |
|
|
249
|
+
| | Q1 | Q2 |
|
|
250
|
+
| Product A | $100K | $150K |
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
The library handles:
|
|
254
|
+
|
|
255
|
+
- **Rowspan**: Inserts empty cells in subsequent rows
|
|
256
|
+
- **Colspan**: Properly manages column spanning
|
|
257
|
+
- **Clean output**: Removes `<colgroup>` and `<col>` elements that have no Markdown equivalent
|
|
258
|
+
|
|
206
259
|
### Key Configuration Options
|
|
207
260
|
|
|
208
261
|
| Option | Type | Default | Description |
|
|
@@ -438,7 +491,9 @@ This library provides comprehensive support for all modern HTML5 elements:
|
|
|
438
491
|
|
|
439
492
|
### Table Elements
|
|
440
493
|
|
|
441
|
-
- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption
|
|
494
|
+
- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`
|
|
495
|
+
- **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts
|
|
496
|
+
- **Smart cleanup**: Automatically handles table styling elements for clean Markdown output
|
|
442
497
|
|
|
443
498
|
### Interactive Elements
|
|
444
499
|
|
|
@@ -457,16 +512,41 @@ This library provides comprehensive support for all modern HTML5 elements:
|
|
|
457
512
|
|
|
458
513
|
- `<math>` (MathML support)
|
|
459
514
|
|
|
460
|
-
##
|
|
515
|
+
## Advanced Table Support
|
|
516
|
+
|
|
517
|
+
The library provides sophisticated handling of complex HTML tables, including merged cells and proper structure conversion:
|
|
518
|
+
|
|
519
|
+
```python
|
|
520
|
+
from html_to_markdown import convert_to_markdown
|
|
521
|
+
|
|
522
|
+
# Complex table with merged cells
|
|
523
|
+
html = """
|
|
524
|
+
<table>
|
|
525
|
+
<caption>Sales Report</caption>
|
|
526
|
+
<tr>
|
|
527
|
+
<th rowspan="2">Product</th>
|
|
528
|
+
<th colspan="2">Quarterly Sales</th>
|
|
529
|
+
</tr>
|
|
530
|
+
<tr>
|
|
531
|
+
<th>Q1</th>
|
|
532
|
+
<th>Q2</th>
|
|
533
|
+
</tr>
|
|
534
|
+
<tr>
|
|
535
|
+
<td>Widget A</td>
|
|
536
|
+
<td>$50K</td>
|
|
537
|
+
<td>$75K</td>
|
|
538
|
+
</tr>
|
|
539
|
+
</table>
|
|
540
|
+
"""
|
|
541
|
+
|
|
542
|
+
result = convert_to_markdown(html)
|
|
543
|
+
```
|
|
461
544
|
|
|
462
|
-
|
|
545
|
+
**Features:**
|
|
463
546
|
|
|
464
|
-
|
|
465
|
-
|
|
466
|
-
|
|
467
|
-
1. **Streaming API**: New streaming parameters for large document processing
|
|
468
|
-
1. **Task List Support**: Automatic conversion of HTML checkboxes to GitHub-compatible task lists
|
|
469
|
-
1. **Highlight Styles**: New `highlight_style` parameter with multiple options for `<mark>` elements
|
|
547
|
+
- **Merged cell support**: Handles `rowspan` and `colspan` attributes intelligently
|
|
548
|
+
- **Clean output**: Automatically removes table styling elements that don't translate to Markdown
|
|
549
|
+
- **Structure preservation**: Maintains table hierarchy and relationships
|
|
470
550
|
|
|
471
551
|
## Acknowledgments
|
|
472
552
|
|
|
@@ -4,9 +4,18 @@ A modern, fully typed Python library for converting HTML to Markdown. This libra
|
|
|
4
4
|
of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
|
|
5
5
|
Python 3.9+.
|
|
6
6
|
|
|
7
|
+
## Support This Project
|
|
8
|
+
|
|
9
|
+
If you find html-to-markdown useful, please consider sponsoring the development:
|
|
10
|
+
|
|
11
|
+
<a href="https://github.com/sponsors/Goldziher"><img src="https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors" alt="Sponsor on GitHub" height="32"></a>
|
|
12
|
+
|
|
13
|
+
Your support helps maintain and improve this library for the community! 🚀
|
|
14
|
+
|
|
7
15
|
## Features
|
|
8
16
|
|
|
9
17
|
- **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
|
|
18
|
+
- **Enhanced Table Support**: Advanced handling of merged cells with rowspan/colspan support for better table representation
|
|
10
19
|
- **Type Safety**: Strict MyPy adherence with comprehensive type hints
|
|
11
20
|
- **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
|
|
12
21
|
- **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
|
|
@@ -16,7 +25,7 @@ Python 3.9+.
|
|
|
16
25
|
- **CLI Tool**: Full-featured command-line interface with all API options exposed
|
|
17
26
|
- **Custom Converters**: Extensible converter system for custom HTML tag handling
|
|
18
27
|
- **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
|
|
19
|
-
- **
|
|
28
|
+
- **Comprehensive Test Coverage**: 91%+ test coverage with 623+ comprehensive tests
|
|
20
29
|
|
|
21
30
|
## Installation
|
|
22
31
|
|
|
@@ -164,6 +173,51 @@ print(markdown)
|
|
|
164
173
|
|
|
165
174
|
Custom converters take precedence over the built-in converters and can be used alongside other configuration options.
|
|
166
175
|
|
|
176
|
+
### Enhanced Table Support
|
|
177
|
+
|
|
178
|
+
The library now provides better handling of complex tables with merged cells:
|
|
179
|
+
|
|
180
|
+
```python
|
|
181
|
+
from html_to_markdown import convert_to_markdown
|
|
182
|
+
|
|
183
|
+
# HTML table with merged cells
|
|
184
|
+
html = """
|
|
185
|
+
<table>
|
|
186
|
+
<tr>
|
|
187
|
+
<th rowspan="2">Category</th>
|
|
188
|
+
<th colspan="2">Sales Data</th>
|
|
189
|
+
</tr>
|
|
190
|
+
<tr>
|
|
191
|
+
<th>Q1</th>
|
|
192
|
+
<th>Q2</th>
|
|
193
|
+
</tr>
|
|
194
|
+
<tr>
|
|
195
|
+
<td>Product A</td>
|
|
196
|
+
<td>$100K</td>
|
|
197
|
+
<td>$150K</td>
|
|
198
|
+
</tr>
|
|
199
|
+
</table>
|
|
200
|
+
"""
|
|
201
|
+
|
|
202
|
+
markdown = convert_to_markdown(html)
|
|
203
|
+
print(markdown)
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
Output:
|
|
207
|
+
|
|
208
|
+
```markdown
|
|
209
|
+
| Category | Sales Data | |
|
|
210
|
+
| --- | --- | --- |
|
|
211
|
+
| | Q1 | Q2 |
|
|
212
|
+
| Product A | $100K | $150K |
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
The library handles:
|
|
216
|
+
|
|
217
|
+
- **Rowspan**: Inserts empty cells in subsequent rows
|
|
218
|
+
- **Colspan**: Properly manages column spanning
|
|
219
|
+
- **Clean output**: Removes `<colgroup>` and `<col>` elements that have no Markdown equivalent
|
|
220
|
+
|
|
167
221
|
### Key Configuration Options
|
|
168
222
|
|
|
169
223
|
| Option | Type | Default | Description |
|
|
@@ -399,7 +453,9 @@ This library provides comprehensive support for all modern HTML5 elements:
|
|
|
399
453
|
|
|
400
454
|
### Table Elements
|
|
401
455
|
|
|
402
|
-
- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption
|
|
456
|
+
- `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`
|
|
457
|
+
- **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts
|
|
458
|
+
- **Smart cleanup**: Automatically handles table styling elements for clean Markdown output
|
|
403
459
|
|
|
404
460
|
### Interactive Elements
|
|
405
461
|
|
|
@@ -418,16 +474,41 @@ This library provides comprehensive support for all modern HTML5 elements:
|
|
|
418
474
|
|
|
419
475
|
- `<math>` (MathML support)
|
|
420
476
|
|
|
421
|
-
##
|
|
477
|
+
## Advanced Table Support
|
|
478
|
+
|
|
479
|
+
The library provides sophisticated handling of complex HTML tables, including merged cells and proper structure conversion:
|
|
480
|
+
|
|
481
|
+
```python
|
|
482
|
+
from html_to_markdown import convert_to_markdown
|
|
483
|
+
|
|
484
|
+
# Complex table with merged cells
|
|
485
|
+
html = """
|
|
486
|
+
<table>
|
|
487
|
+
<caption>Sales Report</caption>
|
|
488
|
+
<tr>
|
|
489
|
+
<th rowspan="2">Product</th>
|
|
490
|
+
<th colspan="2">Quarterly Sales</th>
|
|
491
|
+
</tr>
|
|
492
|
+
<tr>
|
|
493
|
+
<th>Q1</th>
|
|
494
|
+
<th>Q2</th>
|
|
495
|
+
</tr>
|
|
496
|
+
<tr>
|
|
497
|
+
<td>Widget A</td>
|
|
498
|
+
<td>$50K</td>
|
|
499
|
+
<td>$75K</td>
|
|
500
|
+
</tr>
|
|
501
|
+
</table>
|
|
502
|
+
"""
|
|
503
|
+
|
|
504
|
+
result = convert_to_markdown(html)
|
|
505
|
+
```
|
|
422
506
|
|
|
423
|
-
|
|
507
|
+
**Features:**
|
|
424
508
|
|
|
425
|
-
|
|
426
|
-
|
|
427
|
-
|
|
428
|
-
1. **Streaming API**: New streaming parameters for large document processing
|
|
429
|
-
1. **Task List Support**: Automatic conversion of HTML checkboxes to GitHub-compatible task lists
|
|
430
|
-
1. **Highlight Styles**: New `highlight_style` parameter with multiple options for `<mark>` elements
|
|
509
|
+
- **Merged cell support**: Handles `rowspan` and `colspan` attributes intelligently
|
|
510
|
+
- **Clean output**: Automatically removes table styling elements that don't translate to Markdown
|
|
511
|
+
- **Structure preservation**: Maintains table hierarchy and relationships
|
|
431
512
|
|
|
432
513
|
## Acknowledgments
|
|
433
514
|
|