html-to-markdown 1.9.1__tar.gz → 1.10.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of html-to-markdown might be problematic. Click here for more details.
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/PKG-INFO +195 -203
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/README.md +194 -202
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown/__main__.py +0 -1
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown/cli.py +101 -45
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown/constants.py +3 -0
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown/converters.py +31 -502
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown/exceptions.py +1 -11
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown/preprocessor.py +0 -37
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown/processing.py +104 -181
- html_to_markdown-1.10.0/html_to_markdown/utils.py +39 -0
- html_to_markdown-1.10.0/html_to_markdown/whitespace.py +292 -0
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown.egg-info/PKG-INFO +195 -203
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown.egg-info/SOURCES.txt +1 -0
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/pyproject.toml +10 -6
- html_to_markdown-1.9.1/html_to_markdown/utils.py +0 -79
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/LICENSE +0 -0
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown/__init__.py +0 -0
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown/py.typed +0 -0
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown.egg-info/dependency_links.txt +0 -0
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown.egg-info/entry_points.txt +0 -0
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown.egg-info/requires.txt +0 -0
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/html_to_markdown.egg-info/top_level.txt +0 -0
- {html_to_markdown-1.9.1 → html_to_markdown-1.10.0}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: html-to-markdown
|
|
3
|
-
Version: 1.
|
|
3
|
+
Version: 1.10.0
|
|
4
4
|
Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
|
|
5
5
|
Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
|
|
6
6
|
License: MIT
|
|
@@ -48,22 +48,25 @@ If you find html-to-markdown useful, please consider sponsoring the development:
|
|
|
48
48
|
|
|
49
49
|
<a href="https://github.com/sponsors/Goldziher"><img src="https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors" alt="Sponsor on GitHub" height="32"></a>
|
|
50
50
|
|
|
51
|
-
Your support helps maintain and improve this library for the community
|
|
51
|
+
Your support helps maintain and improve this library for the community.
|
|
52
52
|
|
|
53
53
|
## Features
|
|
54
54
|
|
|
55
55
|
- **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
|
|
56
|
-
- **
|
|
56
|
+
- **Table Support**: Advanced handling of complex tables with rowspan/colspan support
|
|
57
57
|
- **Type Safety**: Strict MyPy adherence with comprehensive type hints
|
|
58
58
|
- **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
|
|
59
59
|
- **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
|
|
60
60
|
- **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)
|
|
61
61
|
- **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax
|
|
62
|
-
- **Flexible Configuration**:
|
|
63
|
-
- **CLI Tool**: Full-featured command-line interface with
|
|
62
|
+
- **Flexible Configuration**: Comprehensive configuration options for customizing conversion behavior
|
|
63
|
+
- **CLI Tool**: Full-featured command-line interface with complete API parity
|
|
64
64
|
- **Custom Converters**: Extensible converter system for custom HTML tag handling
|
|
65
|
+
- **List Formatting**: Configurable list indentation with Discord/Slack compatibility
|
|
66
|
+
- **HTML Preprocessing**: Clean messy HTML with configurable aggressiveness levels
|
|
67
|
+
- **Whitespace Control**: Normalized or strict whitespace preservation modes
|
|
65
68
|
- **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
|
|
66
|
-
- **Comprehensive
|
|
69
|
+
- **Robustly Tested**: Comprehensive unit tests and integration tests covering all conversion scenarios
|
|
67
70
|
|
|
68
71
|
## Installation
|
|
69
72
|
|
|
@@ -79,19 +82,9 @@ For improved performance, you can install with the optional lxml parser:
|
|
|
79
82
|
pip install html-to-markdown[lxml]
|
|
80
83
|
```
|
|
81
84
|
|
|
82
|
-
The lxml parser offers
|
|
85
|
+
The lxml parser offers faster HTML parsing and better handling of malformed HTML compared to the default html.parser.
|
|
83
86
|
|
|
84
|
-
|
|
85
|
-
- Better handling of malformed HTML
|
|
86
|
-
- More robust parsing for complex documents
|
|
87
|
-
|
|
88
|
-
Once installed, lxml is automatically used by default for better performance. You can explicitly specify a parser if needed:
|
|
89
|
-
|
|
90
|
-
```python
|
|
91
|
-
result = convert_to_markdown(html) # Auto-detects: uses lxml if available, otherwise html.parser
|
|
92
|
-
result = convert_to_markdown(html, parser="lxml") # Force lxml (requires installation)
|
|
93
|
-
result = convert_to_markdown(html, parser="html.parser") # Force built-in parser
|
|
94
|
-
```
|
|
87
|
+
The library automatically uses lxml when available. You can explicitly specify a parser using the `parser` parameter.
|
|
95
88
|
|
|
96
89
|
## Quick Start
|
|
97
90
|
|
|
@@ -156,123 +149,176 @@ soup = BeautifulSoup(html, "lxml") # Note: lxml requires additional installatio
|
|
|
156
149
|
markdown = convert_to_markdown(soup)
|
|
157
150
|
```
|
|
158
151
|
|
|
159
|
-
##
|
|
152
|
+
## Common Use Cases
|
|
153
|
+
|
|
154
|
+
### Discord/Slack Compatible Lists
|
|
160
155
|
|
|
161
|
-
|
|
156
|
+
Discord and Slack require 2-space indentation for nested lists:
|
|
162
157
|
|
|
163
|
-
|
|
158
|
+
**Python:**
|
|
164
159
|
|
|
165
160
|
```python
|
|
166
161
|
from html_to_markdown import convert_to_markdown
|
|
167
162
|
|
|
168
|
-
html = "<
|
|
169
|
-
markdown = convert_to_markdown(
|
|
170
|
-
|
|
171
|
-
# Document processing
|
|
172
|
-
extract_metadata=True, # Extract metadata as comment header
|
|
173
|
-
convert_as_inline=False, # Treat as block-level content
|
|
174
|
-
strip_newlines=False, # Preserve original newlines
|
|
175
|
-
# Formatting options
|
|
176
|
-
heading_style="atx", # Use # style headers
|
|
177
|
-
strong_em_symbol="*", # Use * for bold/italic
|
|
178
|
-
bullets="*+-", # Define bullet point characters
|
|
179
|
-
highlight_style="double-equal", # Use == for highlighted text
|
|
180
|
-
# Text processing
|
|
181
|
-
wrap=True, # Enable text wrapping
|
|
182
|
-
wrap_width=100, # Set wrap width
|
|
183
|
-
escape_asterisks=True, # Escape * characters
|
|
184
|
-
escape_underscores=True, # Escape _ characters
|
|
185
|
-
escape_misc=True, # Escape other special characters
|
|
186
|
-
# Code blocks
|
|
187
|
-
code_language="python", # Default code block language
|
|
188
|
-
# Streaming for large documents
|
|
189
|
-
stream_processing=False, # Enable for memory efficiency
|
|
190
|
-
chunk_size=1024, # Chunk size for streaming
|
|
191
|
-
)
|
|
163
|
+
html = "<ul><li>Item 1<ul><li>Nested item</li></ul></li></ul>"
|
|
164
|
+
markdown = convert_to_markdown(html, list_indent_width=2)
|
|
165
|
+
# Output: * Item 1\n + Nested item
|
|
192
166
|
```
|
|
193
167
|
|
|
194
|
-
|
|
168
|
+
**CLI:**
|
|
169
|
+
|
|
170
|
+
```shell
|
|
171
|
+
html_to_markdown --list-indent-width 2 input.html
|
|
172
|
+
```
|
|
195
173
|
|
|
196
|
-
|
|
174
|
+
### Cleaning Web-Scraped HTML
|
|
175
|
+
|
|
176
|
+
Remove navigation, advertisements, and forms from scraped content:
|
|
177
|
+
|
|
178
|
+
**Python:**
|
|
197
179
|
|
|
198
180
|
```python
|
|
199
|
-
|
|
200
|
-
|
|
181
|
+
markdown = convert_to_markdown(html, preprocess_html=True, preprocessing_preset="aggressive")
|
|
182
|
+
```
|
|
201
183
|
|
|
202
|
-
|
|
203
|
-
def custom_bold_converter(*, tag: Tag, text: str, **kwargs) -> str:
|
|
204
|
-
return f"IMPORTANT: {text}"
|
|
184
|
+
**CLI:**
|
|
205
185
|
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
186
|
+
```shell
|
|
187
|
+
html_to_markdown --preprocess-html --preprocessing-preset aggressive input.html
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
### Preserving Whitespace for Documentation
|
|
191
|
+
|
|
192
|
+
Maintain exact whitespace for code documentation or technical content:
|
|
193
|
+
|
|
194
|
+
**Python:**
|
|
195
|
+
|
|
196
|
+
```python
|
|
197
|
+
markdown = convert_to_markdown(html, whitespace_mode="strict")
|
|
210
198
|
```
|
|
211
199
|
|
|
212
|
-
|
|
200
|
+
**CLI:**
|
|
213
201
|
|
|
214
|
-
|
|
202
|
+
```shell
|
|
203
|
+
html_to_markdown --whitespace-mode strict input.html
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
### Using Tabs for List Indentation
|
|
207
|
+
|
|
208
|
+
Some editors and platforms prefer tab-based indentation:
|
|
209
|
+
|
|
210
|
+
**Python:**
|
|
211
|
+
|
|
212
|
+
```python
|
|
213
|
+
markdown = convert_to_markdown(html, list_indent_type="tabs")
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
**CLI:**
|
|
217
|
+
|
|
218
|
+
```shell
|
|
219
|
+
html_to_markdown --list-indent-type tabs input.html
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
## Advanced Usage
|
|
215
223
|
|
|
216
|
-
|
|
224
|
+
### Configuration Example
|
|
217
225
|
|
|
218
226
|
```python
|
|
219
227
|
from html_to_markdown import convert_to_markdown
|
|
220
228
|
|
|
221
|
-
|
|
222
|
-
html
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
"""
|
|
229
|
+
markdown = convert_to_markdown(
|
|
230
|
+
html,
|
|
231
|
+
# Headers and formatting
|
|
232
|
+
heading_style="atx",
|
|
233
|
+
strong_em_symbol="*",
|
|
234
|
+
bullets="*+-",
|
|
235
|
+
highlight_style="double-equal",
|
|
236
|
+
# List indentation
|
|
237
|
+
list_indent_type="spaces",
|
|
238
|
+
list_indent_width=4,
|
|
239
|
+
# Whitespace handling
|
|
240
|
+
whitespace_mode="normalized",
|
|
241
|
+
# HTML preprocessing
|
|
242
|
+
preprocess_html=True,
|
|
243
|
+
preprocessing_preset="standard",
|
|
244
|
+
)
|
|
245
|
+
```
|
|
239
246
|
|
|
240
|
-
|
|
247
|
+
### Custom Converters
|
|
248
|
+
|
|
249
|
+
Custom converters allow you to override the default conversion behavior for any HTML tag. This is particularly useful for customizing header formatting or implementing domain-specific conversion rules.
|
|
250
|
+
|
|
251
|
+
#### Basic Example: Custom Header Formatting
|
|
252
|
+
|
|
253
|
+
```python
|
|
254
|
+
from bs4.element import Tag
|
|
255
|
+
from html_to_markdown import convert_to_markdown
|
|
256
|
+
|
|
257
|
+
def custom_h1_converter(*, tag: Tag, text: str, **kwargs) -> str:
|
|
258
|
+
"""Convert h1 tags with custom formatting."""
|
|
259
|
+
return f"### {text.upper()} ###\n\n"
|
|
260
|
+
|
|
261
|
+
def custom_h2_converter(*, tag: Tag, text: str, **kwargs) -> str:
|
|
262
|
+
"""Convert h2 tags with underline."""
|
|
263
|
+
return f"{text}\n{'=' * len(text)}\n\n"
|
|
264
|
+
|
|
265
|
+
html = "<h1>Title</h1><h2>Subtitle</h2><p>Content</p>"
|
|
266
|
+
markdown = convert_to_markdown(html, custom_converters={"h1": custom_h1_converter, "h2": custom_h2_converter})
|
|
241
267
|
print(markdown)
|
|
268
|
+
# Output:
|
|
269
|
+
# ### TITLE ###
|
|
270
|
+
#
|
|
271
|
+
# Subtitle
|
|
272
|
+
# ========
|
|
273
|
+
#
|
|
274
|
+
# Content
|
|
242
275
|
```
|
|
243
276
|
|
|
244
|
-
|
|
277
|
+
#### Advanced Example: Context-Aware Link Conversion
|
|
245
278
|
|
|
246
|
-
```
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
279
|
+
```python
|
|
280
|
+
def smart_link_converter(*, tag: Tag, text: str, **kwargs) -> str:
|
|
281
|
+
"""Convert links based on their attributes."""
|
|
282
|
+
href = tag.get("href", "")
|
|
283
|
+
title = tag.get("title", "")
|
|
284
|
+
|
|
285
|
+
# Handle different link types
|
|
286
|
+
if href.startswith("http"):
|
|
287
|
+
# External link
|
|
288
|
+
return f"[{text}]({href} \"{title or 'External link'}\")"
|
|
289
|
+
elif href.startswith("#"):
|
|
290
|
+
# Anchor link
|
|
291
|
+
return f"[{text}]({href})"
|
|
292
|
+
elif href.startswith("mailto:"):
|
|
293
|
+
# Email link
|
|
294
|
+
return f"[{text}]({href})"
|
|
295
|
+
else:
|
|
296
|
+
# Relative link
|
|
297
|
+
return f"[{text}]({href})"
|
|
298
|
+
|
|
299
|
+
html = '<a href="https://example.com">External</a> <a href="#section">Anchor</a>'
|
|
300
|
+
markdown = convert_to_markdown(html, custom_converters={"a": smart_link_converter})
|
|
251
301
|
```
|
|
252
302
|
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
- **Rowspan**: Inserts empty cells in subsequent rows
|
|
256
|
-
- **Colspan**: Properly manages column spanning
|
|
257
|
-
- **Clean output**: Removes `<colgroup>` and `<col>` elements that have no Markdown equivalent
|
|
303
|
+
#### Converter Function Signature
|
|
258
304
|
|
|
259
|
-
|
|
305
|
+
All converter functions must follow this signature:
|
|
260
306
|
|
|
261
|
-
|
|
262
|
-
|
|
263
|
-
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
|
|
307
|
+
```python
|
|
308
|
+
def converter(*, tag: Tag, text: str, **kwargs) -> str:
|
|
309
|
+
"""
|
|
310
|
+
Args:
|
|
311
|
+
tag: BeautifulSoup Tag object with access to all HTML attributes
|
|
312
|
+
text: Pre-processed text content of the tag
|
|
313
|
+
**kwargs: Additional context passed through from conversion
|
|
314
|
+
|
|
315
|
+
Returns:
|
|
316
|
+
Markdown formatted string
|
|
317
|
+
"""
|
|
318
|
+
pass
|
|
319
|
+
```
|
|
274
320
|
|
|
275
|
-
|
|
321
|
+
Custom converters take precedence over built-in converters and can be used alongside other configuration options.
|
|
276
322
|
|
|
277
323
|
## CLI Usage
|
|
278
324
|
|
|
@@ -288,51 +334,30 @@ cat input.html | html_to_markdown > output.md
|
|
|
288
334
|
# Use custom options
|
|
289
335
|
html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md
|
|
290
336
|
|
|
291
|
-
#
|
|
337
|
+
# Discord-compatible lists with HTML preprocessing
|
|
292
338
|
html_to_markdown \
|
|
293
|
-
--
|
|
294
|
-
--
|
|
295
|
-
--
|
|
296
|
-
--stream-processing \
|
|
297
|
-
--show-progress \
|
|
339
|
+
--list-indent-width 2 \
|
|
340
|
+
--preprocess-html \
|
|
341
|
+
--preprocessing-preset aggressive \
|
|
298
342
|
input.html > output.md
|
|
299
343
|
```
|
|
300
344
|
|
|
301
345
|
### Key CLI Options
|
|
302
346
|
|
|
303
|
-
|
|
304
|
-
# Content processing
|
|
305
|
-
--convert-as-inline # Treat content as inline elements
|
|
306
|
-
--no-extract-metadata # Disable metadata extraction
|
|
307
|
-
--strip-newlines # Remove newlines from input
|
|
308
|
-
|
|
309
|
-
# Formatting
|
|
310
|
-
--heading-style {atx,atx_closed,underlined}
|
|
311
|
-
--highlight-style {double-equal,html,bold}
|
|
312
|
-
--strong-em-symbol {*,_}
|
|
313
|
-
--bullets CHARS # e.g., "*+-"
|
|
314
|
-
|
|
315
|
-
# Text escaping
|
|
316
|
-
--no-escape-asterisks # Disable * escaping
|
|
317
|
-
--no-escape-underscores # Disable _ escaping
|
|
318
|
-
--no-escape-misc # Disable misc character escaping
|
|
319
|
-
|
|
320
|
-
# Large document processing
|
|
321
|
-
--stream-processing # Enable streaming mode
|
|
322
|
-
--chunk-size SIZE # Set chunk size (default: 1024)
|
|
323
|
-
--show-progress # Show progress for large files
|
|
324
|
-
|
|
325
|
-
# Text wrapping
|
|
326
|
-
--wrap # Enable text wrapping
|
|
327
|
-
--wrap-width WIDTH # Set wrap width (default: 80)
|
|
328
|
-
```
|
|
329
|
-
|
|
330
|
-
View all available options:
|
|
347
|
+
**Most Common Options:**
|
|
331
348
|
|
|
332
349
|
```shell
|
|
333
|
-
|
|
350
|
+
--list-indent-width WIDTH # Spaces per indent (default: 4, use 2 for Discord)
|
|
351
|
+
--list-indent-type {spaces,tabs} # Indentation type (default: spaces)
|
|
352
|
+
--preprocess-html # Enable HTML cleaning for web scraping
|
|
353
|
+
--whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
|
|
354
|
+
--heading-style {atx,atx_closed,underlined} # Header style
|
|
355
|
+
--no-extract-metadata # Disable metadata extraction
|
|
334
356
|
```
|
|
335
357
|
|
|
358
|
+
**All Available Options:**
|
|
359
|
+
The CLI supports all Python API parameters. Use `html_to_markdown --help` to see the complete list.
|
|
360
|
+
|
|
336
361
|
## Migration from Markdownify
|
|
337
362
|
|
|
338
363
|
For existing projects using Markdownify, a compatibility layer is provided:
|
|
@@ -351,27 +376,17 @@ The `markdownify` function is an alias for `convert_to_markdown` and provides id
|
|
|
351
376
|
|
|
352
377
|
## Configuration Reference
|
|
353
378
|
|
|
354
|
-
|
|
355
|
-
|
|
356
|
-
### Document Processing
|
|
379
|
+
### Most Common Parameters
|
|
357
380
|
|
|
358
|
-
- `
|
|
359
|
-
- `
|
|
360
|
-
- `
|
|
361
|
-
- `
|
|
362
|
-
- `
|
|
363
|
-
- `
|
|
364
|
-
|
|
365
|
-
### Streaming Support
|
|
366
|
-
|
|
367
|
-
- `stream_processing` (bool, default: `False`): Enable streaming processing for large documents
|
|
368
|
-
- `chunk_size` (int, default: `1024`): Size of chunks when using streaming processing
|
|
369
|
-
- `chunk_callback` (callable, default: `None`): Callback function called with each processed chunk
|
|
370
|
-
- `progress_callback` (callable, default: `None`): Callback function called with (processed_bytes, total_bytes)
|
|
381
|
+
- `list_indent_width` (int, default: `4`): Number of spaces per indentation level (use 2 for Discord/Slack)
|
|
382
|
+
- `list_indent_type` (str, default: `'spaces'`): Use `'spaces'` or `'tabs'` for list indentation
|
|
383
|
+
- `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)
|
|
384
|
+
- `whitespace_mode` (str, default: `'normalized'`): Whitespace handling (`'normalized'` or `'strict'`)
|
|
385
|
+
- `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML
|
|
386
|
+
- `extract_metadata` (bool, default: `True`): Extract document metadata as comment header
|
|
371
387
|
|
|
372
388
|
### Text Formatting
|
|
373
389
|
|
|
374
|
-
- `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)
|
|
375
390
|
- `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)
|
|
376
391
|
- `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)
|
|
377
392
|
- `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists
|
|
@@ -379,6 +394,21 @@ Complete list of all configuration options:
|
|
|
379
394
|
- `sub_symbol` (str, default: `''`): Custom symbol for subscript text
|
|
380
395
|
- `sup_symbol` (str, default: `''`): Custom symbol for superscript text
|
|
381
396
|
|
|
397
|
+
### Parser Options
|
|
398
|
+
|
|
399
|
+
- `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)
|
|
400
|
+
- `preprocessing_preset` (str, default: `'standard'`): Preprocessing level (`'minimal'`, `'standard'`, `'aggressive'`)
|
|
401
|
+
- `remove_forms` (bool, default: `True`): Remove form elements during preprocessing
|
|
402
|
+
- `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing
|
|
403
|
+
|
|
404
|
+
### Document Processing
|
|
405
|
+
|
|
406
|
+
- `convert_as_inline` (bool, default: `False`): Treat content as inline elements only
|
|
407
|
+
- `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing
|
|
408
|
+
- `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)
|
|
409
|
+
- `strip` (list, default: `None`): List of HTML tags to remove from output
|
|
410
|
+
- `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions
|
|
411
|
+
|
|
382
412
|
### Text Escaping
|
|
383
413
|
|
|
384
414
|
- `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting
|
|
@@ -401,6 +431,15 @@ Complete list of all configuration options:
|
|
|
401
431
|
- `wrap` (bool, default: `False`): Enable text wrapping
|
|
402
432
|
- `wrap_width` (int, default: `80`): Width for text wrapping
|
|
403
433
|
|
|
434
|
+
### HTML Processing
|
|
435
|
+
|
|
436
|
+
- `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)
|
|
437
|
+
- `whitespace_mode` (str, default: `'normalized'`): How to handle whitespace (`'normalized'` intelligently cleans whitespace, `'strict'` preserves original)
|
|
438
|
+
- `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML
|
|
439
|
+
- `preprocessing_preset` (str, default: `'standard'`): Preprocessing aggressiveness (`'minimal'` for basic cleaning, `'standard'` for balanced, `'aggressive'` for heavy cleaning)
|
|
440
|
+
- `remove_forms` (bool, default: `True`): Remove form elements during preprocessing
|
|
441
|
+
- `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing
|
|
442
|
+
|
|
404
443
|
## Contribution
|
|
405
444
|
|
|
406
445
|
This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
|
|
@@ -458,17 +497,6 @@ uv run python -m html_to_markdown input.html
|
|
|
458
497
|
uv build
|
|
459
498
|
```
|
|
460
499
|
|
|
461
|
-
## Performance
|
|
462
|
-
|
|
463
|
-
The library is optimized for performance with several key features:
|
|
464
|
-
|
|
465
|
-
- **Efficient ancestor caching**: Reduces repeated DOM traversals using context-aware caching
|
|
466
|
-
- **Streaming support**: Process large documents in chunks to minimize memory usage
|
|
467
|
-
- **Optional lxml parser**: ~30% faster parsing for complex HTML documents
|
|
468
|
-
- **Optimized string operations**: Minimizes string concatenations in hot paths
|
|
469
|
-
|
|
470
|
-
Typical throughput: ~2 MB/s for regular processing on modern hardware.
|
|
471
|
-
|
|
472
500
|
## License
|
|
473
501
|
|
|
474
502
|
This library uses the MIT license.
|
|
@@ -512,42 +540,6 @@ This library provides comprehensive support for all modern HTML5 elements:
|
|
|
512
540
|
|
|
513
541
|
- `<math>` (MathML support)
|
|
514
542
|
|
|
515
|
-
## Advanced Table Support
|
|
516
|
-
|
|
517
|
-
The library provides sophisticated handling of complex HTML tables, including merged cells and proper structure conversion:
|
|
518
|
-
|
|
519
|
-
```python
|
|
520
|
-
from html_to_markdown import convert_to_markdown
|
|
521
|
-
|
|
522
|
-
# Complex table with merged cells
|
|
523
|
-
html = """
|
|
524
|
-
<table>
|
|
525
|
-
<caption>Sales Report</caption>
|
|
526
|
-
<tr>
|
|
527
|
-
<th rowspan="2">Product</th>
|
|
528
|
-
<th colspan="2">Quarterly Sales</th>
|
|
529
|
-
</tr>
|
|
530
|
-
<tr>
|
|
531
|
-
<th>Q1</th>
|
|
532
|
-
<th>Q2</th>
|
|
533
|
-
</tr>
|
|
534
|
-
<tr>
|
|
535
|
-
<td>Widget A</td>
|
|
536
|
-
<td>$50K</td>
|
|
537
|
-
<td>$75K</td>
|
|
538
|
-
</tr>
|
|
539
|
-
</table>
|
|
540
|
-
"""
|
|
541
|
-
|
|
542
|
-
result = convert_to_markdown(html)
|
|
543
|
-
```
|
|
544
|
-
|
|
545
|
-
**Features:**
|
|
546
|
-
|
|
547
|
-
- **Merged cell support**: Handles `rowspan` and `colspan` attributes intelligently
|
|
548
|
-
- **Clean output**: Automatically removes table styling elements that don't translate to Markdown
|
|
549
|
-
- **Structure preservation**: Maintains table hierarchy and relationships
|
|
550
|
-
|
|
551
543
|
## Acknowledgments
|
|
552
544
|
|
|
553
545
|
Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.
|