html-to-markdown 1.8.0__tar.gz → 1.9.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of html-to-markdown might be problematic. Click here for more details.

Files changed (22) hide show
  1. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/PKG-INFO +96 -16
  2. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/README.md +91 -10
  3. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/converters.py +305 -562
  4. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/processing.py +120 -45
  5. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown.egg-info/PKG-INFO +96 -16
  6. html_to_markdown-1.9.1/html_to_markdown.egg-info/requires.txt +5 -0
  7. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/pyproject.toml +11 -12
  8. html_to_markdown-1.8.0/html_to_markdown.egg-info/requires.txt +0 -5
  9. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/LICENSE +0 -0
  10. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/__init__.py +0 -0
  11. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/__main__.py +0 -0
  12. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/cli.py +0 -0
  13. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/constants.py +0 -0
  14. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/exceptions.py +0 -0
  15. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/preprocessor.py +0 -0
  16. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/py.typed +0 -0
  17. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown/utils.py +0 -0
  18. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown.egg-info/SOURCES.txt +0 -0
  19. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown.egg-info/dependency_links.txt +0 -0
  20. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown.egg-info/entry_points.txt +0 -0
  21. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/html_to_markdown.egg-info/top_level.txt +0 -0
  22. {html_to_markdown-1.8.0 → html_to_markdown-1.9.1}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: html-to-markdown
3
- Version: 1.8.0
3
+ Version: 1.9.1
4
4
  Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
5
5
  Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
6
6
  License: MIT
@@ -15,7 +15,6 @@ Classifier: Intended Audience :: Developers
15
15
  Classifier: License :: OSI Approved :: MIT License
16
16
  Classifier: Operating System :: OS Independent
17
17
  Classifier: Programming Language :: Python :: 3 :: Only
18
- Classifier: Programming Language :: Python :: 3.9
19
18
  Classifier: Programming Language :: Python :: 3.10
20
19
  Classifier: Programming Language :: Python :: 3.11
21
20
  Classifier: Programming Language :: Python :: 3.12
@@ -28,13 +27,13 @@ Classifier: Topic :: Text Processing :: Markup :: HTML
28
27
  Classifier: Topic :: Text Processing :: Markup :: Markdown
29
28
  Classifier: Topic :: Utilities
30
29
  Classifier: Typing :: Typed
31
- Requires-Python: >=3.9
30
+ Requires-Python: >=3.10
32
31
  Description-Content-Type: text/markdown
33
32
  License-File: LICENSE
34
- Requires-Dist: beautifulsoup4>=4.13.4
35
- Requires-Dist: nh3>=0.2.21
33
+ Requires-Dist: beautifulsoup4>=4.13.5
34
+ Requires-Dist: nh3>=0.3
36
35
  Provides-Extra: lxml
37
- Requires-Dist: lxml>=5; extra == "lxml"
36
+ Requires-Dist: lxml>=6.0.1; extra == "lxml"
38
37
  Dynamic: license-file
39
38
 
40
39
  # html-to-markdown
@@ -43,9 +42,18 @@ A modern, fully typed Python library for converting HTML to Markdown. This libra
43
42
  of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
44
43
  Python 3.9+.
45
44
 
45
+ ## Support This Project
46
+
47
+ If you find html-to-markdown useful, please consider sponsoring the development:
48
+
49
+ <a href="https://github.com/sponsors/Goldziher"><img src="https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors" alt="Sponsor on GitHub" height="32"></a>
50
+
51
+ Your support helps maintain and improve this library for the community! 🚀
52
+
46
53
  ## Features
47
54
 
48
55
  - **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
56
+ - **Enhanced Table Support**: Advanced handling of merged cells with rowspan/colspan support for better table representation
49
57
  - **Type Safety**: Strict MyPy adherence with comprehensive type hints
50
58
  - **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
51
59
  - **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
@@ -55,7 +63,7 @@ Python 3.9+.
55
63
  - **CLI Tool**: Full-featured command-line interface with all API options exposed
56
64
  - **Custom Converters**: Extensible converter system for custom HTML tag handling
57
65
  - **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
58
- - **Extensive Test Coverage**: 100% test coverage requirement with comprehensive test suite
66
+ - **Comprehensive Test Coverage**: 91%+ test coverage with 623+ comprehensive tests
59
67
 
60
68
  ## Installation
61
69
 
@@ -203,6 +211,51 @@ print(markdown)
203
211
 
204
212
  Custom converters take precedence over the built-in converters and can be used alongside other configuration options.
205
213
 
214
+ ### Enhanced Table Support
215
+
216
+ The library now provides better handling of complex tables with merged cells:
217
+
218
+ ```python
219
+ from html_to_markdown import convert_to_markdown
220
+
221
+ # HTML table with merged cells
222
+ html = """
223
+ <table>
224
+ <tr>
225
+ <th rowspan="2">Category</th>
226
+ <th colspan="2">Sales Data</th>
227
+ </tr>
228
+ <tr>
229
+ <th>Q1</th>
230
+ <th>Q2</th>
231
+ </tr>
232
+ <tr>
233
+ <td>Product A</td>
234
+ <td>$100K</td>
235
+ <td>$150K</td>
236
+ </tr>
237
+ </table>
238
+ """
239
+
240
+ markdown = convert_to_markdown(html)
241
+ print(markdown)
242
+ ```
243
+
244
+ Output:
245
+
246
+ ```markdown
247
+ | Category | Sales Data | |
248
+ | --- | --- | --- |
249
+ | | Q1 | Q2 |
250
+ | Product A | $100K | $150K |
251
+ ```
252
+
253
+ The library handles:
254
+
255
+ - **Rowspan**: Inserts empty cells in subsequent rows
256
+ - **Colspan**: Properly manages column spanning
257
+ - **Clean output**: Removes `<colgroup>` and `<col>` elements that have no Markdown equivalent
258
+
206
259
  ### Key Configuration Options
207
260
 
208
261
  | Option | Type | Default | Description |
@@ -438,7 +491,9 @@ This library provides comprehensive support for all modern HTML5 elements:
438
491
 
439
492
  ### Table Elements
440
493
 
441
- - `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`, `<col>`, `<colgroup>`
494
+ - `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`
495
+ - **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts
496
+ - **Smart cleanup**: Automatically handles table styling elements for clean Markdown output
442
497
 
443
498
  ### Interactive Elements
444
499
 
@@ -457,16 +512,41 @@ This library provides comprehensive support for all modern HTML5 elements:
457
512
 
458
513
  - `<math>` (MathML support)
459
514
 
460
- ## Breaking Changes (Major Version)
515
+ ## Advanced Table Support
516
+
517
+ The library provides sophisticated handling of complex HTML tables, including merged cells and proper structure conversion:
518
+
519
+ ```python
520
+ from html_to_markdown import convert_to_markdown
521
+
522
+ # Complex table with merged cells
523
+ html = """
524
+ <table>
525
+ <caption>Sales Report</caption>
526
+ <tr>
527
+ <th rowspan="2">Product</th>
528
+ <th colspan="2">Quarterly Sales</th>
529
+ </tr>
530
+ <tr>
531
+ <th>Q1</th>
532
+ <th>Q2</th>
533
+ </tr>
534
+ <tr>
535
+ <td>Widget A</td>
536
+ <td>$50K</td>
537
+ <td>$75K</td>
538
+ </tr>
539
+ </table>
540
+ """
541
+
542
+ result = convert_to_markdown(html)
543
+ ```
461
544
 
462
- This version introduces several breaking changes for improved consistency and functionality:
545
+ **Features:**
463
546
 
464
- 1. **Enhanced Metadata Extraction**: Now enabled by default with comprehensive extraction of title, meta tags, and link relations
465
- 1. **Improved Newline Handling**: Better normalization of excessive newlines (max 2 consecutive)
466
- 1. **Extended HTML5 Support**: Added support for 40+ new HTML5 elements
467
- 1. **Streaming API**: New streaming parameters for large document processing
468
- 1. **Task List Support**: Automatic conversion of HTML checkboxes to GitHub-compatible task lists
469
- 1. **Highlight Styles**: New `highlight_style` parameter with multiple options for `<mark>` elements
547
+ - **Merged cell support**: Handles `rowspan` and `colspan` attributes intelligently
548
+ - **Clean output**: Automatically removes table styling elements that don't translate to Markdown
549
+ - **Structure preservation**: Maintains table hierarchy and relationships
470
550
 
471
551
  ## Acknowledgments
472
552
 
@@ -4,9 +4,18 @@ A modern, fully typed Python library for converting HTML to Markdown. This libra
4
4
  of [markdownify](https://pypi.org/project/markdownify/) with a modernized codebase, strict type safety and support for
5
5
  Python 3.9+.
6
6
 
7
+ ## Support This Project
8
+
9
+ If you find html-to-markdown useful, please consider sponsoring the development:
10
+
11
+ <a href="https://github.com/sponsors/Goldziher"><img src="https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors" alt="Sponsor on GitHub" height="32"></a>
12
+
13
+ Your support helps maintain and improve this library for the community! 🚀
14
+
7
15
  ## Features
8
16
 
9
17
  - **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
18
+ - **Enhanced Table Support**: Advanced handling of merged cells with rowspan/colspan support for better table representation
10
19
  - **Type Safety**: Strict MyPy adherence with comprehensive type hints
11
20
  - **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
12
21
  - **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
@@ -16,7 +25,7 @@ Python 3.9+.
16
25
  - **CLI Tool**: Full-featured command-line interface with all API options exposed
17
26
  - **Custom Converters**: Extensible converter system for custom HTML tag handling
18
27
  - **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
19
- - **Extensive Test Coverage**: 100% test coverage requirement with comprehensive test suite
28
+ - **Comprehensive Test Coverage**: 91%+ test coverage with 623+ comprehensive tests
20
29
 
21
30
  ## Installation
22
31
 
@@ -164,6 +173,51 @@ print(markdown)
164
173
 
165
174
  Custom converters take precedence over the built-in converters and can be used alongside other configuration options.
166
175
 
176
+ ### Enhanced Table Support
177
+
178
+ The library now provides better handling of complex tables with merged cells:
179
+
180
+ ```python
181
+ from html_to_markdown import convert_to_markdown
182
+
183
+ # HTML table with merged cells
184
+ html = """
185
+ <table>
186
+ <tr>
187
+ <th rowspan="2">Category</th>
188
+ <th colspan="2">Sales Data</th>
189
+ </tr>
190
+ <tr>
191
+ <th>Q1</th>
192
+ <th>Q2</th>
193
+ </tr>
194
+ <tr>
195
+ <td>Product A</td>
196
+ <td>$100K</td>
197
+ <td>$150K</td>
198
+ </tr>
199
+ </table>
200
+ """
201
+
202
+ markdown = convert_to_markdown(html)
203
+ print(markdown)
204
+ ```
205
+
206
+ Output:
207
+
208
+ ```markdown
209
+ | Category | Sales Data | |
210
+ | --- | --- | --- |
211
+ | | Q1 | Q2 |
212
+ | Product A | $100K | $150K |
213
+ ```
214
+
215
+ The library handles:
216
+
217
+ - **Rowspan**: Inserts empty cells in subsequent rows
218
+ - **Colspan**: Properly manages column spanning
219
+ - **Clean output**: Removes `<colgroup>` and `<col>` elements that have no Markdown equivalent
220
+
167
221
  ### Key Configuration Options
168
222
 
169
223
  | Option | Type | Default | Description |
@@ -399,7 +453,9 @@ This library provides comprehensive support for all modern HTML5 elements:
399
453
 
400
454
  ### Table Elements
401
455
 
402
- - `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`, `<col>`, `<colgroup>`
456
+ - `<table>`, `<thead>`, `<tbody>`, `<tfoot>`, `<tr>`, `<th>`, `<td>`, `<caption>`
457
+ - **Merged cell support**: Handles `rowspan` and `colspan` attributes for complex table layouts
458
+ - **Smart cleanup**: Automatically handles table styling elements for clean Markdown output
403
459
 
404
460
  ### Interactive Elements
405
461
 
@@ -418,16 +474,41 @@ This library provides comprehensive support for all modern HTML5 elements:
418
474
 
419
475
  - `<math>` (MathML support)
420
476
 
421
- ## Breaking Changes (Major Version)
477
+ ## Advanced Table Support
478
+
479
+ The library provides sophisticated handling of complex HTML tables, including merged cells and proper structure conversion:
480
+
481
+ ```python
482
+ from html_to_markdown import convert_to_markdown
483
+
484
+ # Complex table with merged cells
485
+ html = """
486
+ <table>
487
+ <caption>Sales Report</caption>
488
+ <tr>
489
+ <th rowspan="2">Product</th>
490
+ <th colspan="2">Quarterly Sales</th>
491
+ </tr>
492
+ <tr>
493
+ <th>Q1</th>
494
+ <th>Q2</th>
495
+ </tr>
496
+ <tr>
497
+ <td>Widget A</td>
498
+ <td>$50K</td>
499
+ <td>$75K</td>
500
+ </tr>
501
+ </table>
502
+ """
503
+
504
+ result = convert_to_markdown(html)
505
+ ```
422
506
 
423
- This version introduces several breaking changes for improved consistency and functionality:
507
+ **Features:**
424
508
 
425
- 1. **Enhanced Metadata Extraction**: Now enabled by default with comprehensive extraction of title, meta tags, and link relations
426
- 1. **Improved Newline Handling**: Better normalization of excessive newlines (max 2 consecutive)
427
- 1. **Extended HTML5 Support**: Added support for 40+ new HTML5 elements
428
- 1. **Streaming API**: New streaming parameters for large document processing
429
- 1. **Task List Support**: Automatic conversion of HTML checkboxes to GitHub-compatible task lists
430
- 1. **Highlight Styles**: New `highlight_style` parameter with multiple options for `<mark>` elements
509
+ - **Merged cell support**: Handles `rowspan` and `colspan` attributes intelligently
510
+ - **Clean output**: Automatically removes table styling elements that don't translate to Markdown
511
+ - **Structure preservation**: Maintains table hierarchy and relationships
431
512
 
432
513
  ## Acknowledgments
433
514