html-to-markdown 1.9.1__tar.gz → 1.11.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of html-to-markdown might be problematic. Click here for more details.

Files changed (23) hide show
  1. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/PKG-INFO +196 -204
  2. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/README.md +194 -202
  3. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown/__main__.py +0 -1
  4. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown/cli.py +101 -45
  5. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown/constants.py +3 -0
  6. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown/converters.py +34 -502
  7. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown/exceptions.py +1 -11
  8. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown/preprocessor.py +0 -37
  9. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown/processing.py +117 -191
  10. html_to_markdown-1.11.0/html_to_markdown/utils.py +39 -0
  11. html_to_markdown-1.11.0/html_to_markdown/whitespace.py +303 -0
  12. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown.egg-info/PKG-INFO +196 -204
  13. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown.egg-info/SOURCES.txt +1 -0
  14. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown.egg-info/requires.txt +1 -1
  15. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/pyproject.toml +11 -8
  16. html_to_markdown-1.9.1/html_to_markdown/utils.py +0 -79
  17. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/LICENSE +0 -0
  18. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown/__init__.py +0 -0
  19. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown/py.typed +0 -0
  20. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown.egg-info/dependency_links.txt +0 -0
  21. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown.egg-info/entry_points.txt +0 -0
  22. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/html_to_markdown.egg-info/top_level.txt +0 -0
  23. {html_to_markdown-1.9.1 → html_to_markdown-1.11.0}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: html-to-markdown
3
- Version: 1.9.1
3
+ Version: 1.11.0
4
4
  Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
5
5
  Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
6
6
  License: MIT
@@ -33,7 +33,7 @@ License-File: LICENSE
33
33
  Requires-Dist: beautifulsoup4>=4.13.5
34
34
  Requires-Dist: nh3>=0.3
35
35
  Provides-Extra: lxml
36
- Requires-Dist: lxml>=6.0.1; extra == "lxml"
36
+ Requires-Dist: beautifulsoup4[lxml]>=4.13.5; extra == "lxml"
37
37
  Dynamic: license-file
38
38
 
39
39
  # html-to-markdown
@@ -48,22 +48,25 @@ If you find html-to-markdown useful, please consider sponsoring the development:
48
48
 
49
49
  <a href="https://github.com/sponsors/Goldziher"><img src="https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github-sponsors" alt="Sponsor on GitHub" height="32"></a>
50
50
 
51
- Your support helps maintain and improve this library for the community! 🚀
51
+ Your support helps maintain and improve this library for the community.
52
52
 
53
53
  ## Features
54
54
 
55
55
  - **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
56
- - **Enhanced Table Support**: Advanced handling of merged cells with rowspan/colspan support for better table representation
56
+ - **Table Support**: Advanced handling of complex tables with rowspan/colspan support
57
57
  - **Type Safety**: Strict MyPy adherence with comprehensive type hints
58
58
  - **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
59
59
  - **Streaming Support**: Memory-efficient processing for large documents with progress callbacks
60
60
  - **Highlight Support**: Multiple styles for highlighted text (`<mark>` elements)
61
61
  - **Task List Support**: Converts HTML checkboxes to GitHub-compatible task list syntax
62
- - **Flexible Configuration**: 20+ configuration options for customizing conversion behavior
63
- - **CLI Tool**: Full-featured command-line interface with all API options exposed
62
+ - **Flexible Configuration**: Comprehensive configuration options for customizing conversion behavior
63
+ - **CLI Tool**: Full-featured command-line interface with complete API parity
64
64
  - **Custom Converters**: Extensible converter system for custom HTML tag handling
65
+ - **List Formatting**: Configurable list indentation with Discord/Slack compatibility
66
+ - **HTML Preprocessing**: Clean messy HTML with configurable aggressiveness levels
67
+ - **Whitespace Control**: Normalized or strict whitespace preservation modes
65
68
  - **BeautifulSoup Integration**: Support for pre-configured BeautifulSoup instances
66
- - **Comprehensive Test Coverage**: 91%+ test coverage with 623+ comprehensive tests
69
+ - **Robustly Tested**: Comprehensive unit tests and integration tests covering all conversion scenarios
67
70
 
68
71
  ## Installation
69
72
 
@@ -79,19 +82,9 @@ For improved performance, you can install with the optional lxml parser:
79
82
  pip install html-to-markdown[lxml]
80
83
  ```
81
84
 
82
- The lxml parser offers:
85
+ The lxml parser offers faster HTML parsing and better handling of malformed HTML compared to the default html.parser.
83
86
 
84
- - **~30% faster HTML parsing** compared to the default html.parser
85
- - Better handling of malformed HTML
86
- - More robust parsing for complex documents
87
-
88
- Once installed, lxml is automatically used by default for better performance. You can explicitly specify a parser if needed:
89
-
90
- ```python
91
- result = convert_to_markdown(html) # Auto-detects: uses lxml if available, otherwise html.parser
92
- result = convert_to_markdown(html, parser="lxml") # Force lxml (requires installation)
93
- result = convert_to_markdown(html, parser="html.parser") # Force built-in parser
94
- ```
87
+ The library automatically uses lxml when available. You can explicitly specify a parser using the `parser` parameter.
95
88
 
96
89
  ## Quick Start
97
90
 
@@ -156,123 +149,176 @@ soup = BeautifulSoup(html, "lxml") # Note: lxml requires additional installatio
156
149
  markdown = convert_to_markdown(soup)
157
150
  ```
158
151
 
159
- ## Advanced Usage
152
+ ## Common Use Cases
153
+
154
+ ### Discord/Slack Compatible Lists
160
155
 
161
- ### Customizing Conversion Options
156
+ Discord and Slack require 2-space indentation for nested lists:
162
157
 
163
- The library offers extensive customization through various options:
158
+ **Python:**
164
159
 
165
160
  ```python
166
161
  from html_to_markdown import convert_to_markdown
167
162
 
168
- html = "<div>Your content here...</div>"
169
- markdown = convert_to_markdown(
170
- html,
171
- # Document processing
172
- extract_metadata=True, # Extract metadata as comment header
173
- convert_as_inline=False, # Treat as block-level content
174
- strip_newlines=False, # Preserve original newlines
175
- # Formatting options
176
- heading_style="atx", # Use # style headers
177
- strong_em_symbol="*", # Use * for bold/italic
178
- bullets="*+-", # Define bullet point characters
179
- highlight_style="double-equal", # Use == for highlighted text
180
- # Text processing
181
- wrap=True, # Enable text wrapping
182
- wrap_width=100, # Set wrap width
183
- escape_asterisks=True, # Escape * characters
184
- escape_underscores=True, # Escape _ characters
185
- escape_misc=True, # Escape other special characters
186
- # Code blocks
187
- code_language="python", # Default code block language
188
- # Streaming for large documents
189
- stream_processing=False, # Enable for memory efficiency
190
- chunk_size=1024, # Chunk size for streaming
191
- )
163
+ html = "<ul><li>Item 1<ul><li>Nested item</li></ul></li></ul>"
164
+ markdown = convert_to_markdown(html, list_indent_width=2)
165
+ # Output: * Item 1\n + Nested item
192
166
  ```
193
167
 
194
- ### Custom Converters
168
+ **CLI:**
169
+
170
+ ```shell
171
+ html_to_markdown --list-indent-width 2 input.html
172
+ ```
195
173
 
196
- You can provide your own conversion functions for specific HTML tags:
174
+ ### Cleaning Web-Scraped HTML
175
+
176
+ Remove navigation, advertisements, and forms from scraped content:
177
+
178
+ **Python:**
197
179
 
198
180
  ```python
199
- from bs4.element import Tag
200
- from html_to_markdown import convert_to_markdown
181
+ markdown = convert_to_markdown(html, preprocess_html=True, preprocessing_preset="aggressive")
182
+ ```
201
183
 
202
- # Define a custom converter for the <b> tag
203
- def custom_bold_converter(*, tag: Tag, text: str, **kwargs) -> str:
204
- return f"IMPORTANT: {text}"
184
+ **CLI:**
205
185
 
206
- html = "<p>This is a <b>bold statement</b>.</p>"
207
- markdown = convert_to_markdown(html, custom_converters={"b": custom_bold_converter})
208
- print(markdown)
209
- # Output: This is a IMPORTANT: bold statement.
186
+ ```shell
187
+ html_to_markdown --preprocess-html --preprocessing-preset aggressive input.html
188
+ ```
189
+
190
+ ### Preserving Whitespace for Documentation
191
+
192
+ Maintain exact whitespace for code documentation or technical content:
193
+
194
+ **Python:**
195
+
196
+ ```python
197
+ markdown = convert_to_markdown(html, whitespace_mode="strict")
210
198
  ```
211
199
 
212
- Custom converters take precedence over the built-in converters and can be used alongside other configuration options.
200
+ **CLI:**
213
201
 
214
- ### Enhanced Table Support
202
+ ```shell
203
+ html_to_markdown --whitespace-mode strict input.html
204
+ ```
205
+
206
+ ### Using Tabs for List Indentation
207
+
208
+ Some editors and platforms prefer tab-based indentation:
209
+
210
+ **Python:**
211
+
212
+ ```python
213
+ markdown = convert_to_markdown(html, list_indent_type="tabs")
214
+ ```
215
+
216
+ **CLI:**
217
+
218
+ ```shell
219
+ html_to_markdown --list-indent-type tabs input.html
220
+ ```
221
+
222
+ ## Advanced Usage
215
223
 
216
- The library now provides better handling of complex tables with merged cells:
224
+ ### Configuration Example
217
225
 
218
226
  ```python
219
227
  from html_to_markdown import convert_to_markdown
220
228
 
221
- # HTML table with merged cells
222
- html = """
223
- <table>
224
- <tr>
225
- <th rowspan="2">Category</th>
226
- <th colspan="2">Sales Data</th>
227
- </tr>
228
- <tr>
229
- <th>Q1</th>
230
- <th>Q2</th>
231
- </tr>
232
- <tr>
233
- <td>Product A</td>
234
- <td>$100K</td>
235
- <td>$150K</td>
236
- </tr>
237
- </table>
238
- """
229
+ markdown = convert_to_markdown(
230
+ html,
231
+ # Headers and formatting
232
+ heading_style="atx",
233
+ strong_em_symbol="*",
234
+ bullets="*+-",
235
+ highlight_style="double-equal",
236
+ # List indentation
237
+ list_indent_type="spaces",
238
+ list_indent_width=4,
239
+ # Whitespace handling
240
+ whitespace_mode="normalized",
241
+ # HTML preprocessing
242
+ preprocess_html=True,
243
+ preprocessing_preset="standard",
244
+ )
245
+ ```
239
246
 
240
- markdown = convert_to_markdown(html)
247
+ ### Custom Converters
248
+
249
+ Custom converters allow you to override the default conversion behavior for any HTML tag. This is particularly useful for customizing header formatting or implementing domain-specific conversion rules.
250
+
251
+ #### Basic Example: Custom Header Formatting
252
+
253
+ ```python
254
+ from bs4.element import Tag
255
+ from html_to_markdown import convert_to_markdown
256
+
257
+ def custom_h1_converter(*, tag: Tag, text: str, **kwargs) -> str:
258
+ """Convert h1 tags with custom formatting."""
259
+ return f"### {text.upper()} ###\n\n"
260
+
261
+ def custom_h2_converter(*, tag: Tag, text: str, **kwargs) -> str:
262
+ """Convert h2 tags with underline."""
263
+ return f"{text}\n{'=' * len(text)}\n\n"
264
+
265
+ html = "<h1>Title</h1><h2>Subtitle</h2><p>Content</p>"
266
+ markdown = convert_to_markdown(html, custom_converters={"h1": custom_h1_converter, "h2": custom_h2_converter})
241
267
  print(markdown)
268
+ # Output:
269
+ # ### TITLE ###
270
+ #
271
+ # Subtitle
272
+ # ========
273
+ #
274
+ # Content
242
275
  ```
243
276
 
244
- Output:
277
+ #### Advanced Example: Context-Aware Link Conversion
245
278
 
246
- ```markdown
247
- | Category | Sales Data | |
248
- | --- | --- | --- |
249
- | | Q1 | Q2 |
250
- | Product A | $100K | $150K |
279
+ ```python
280
+ def smart_link_converter(*, tag: Tag, text: str, **kwargs) -> str:
281
+ """Convert links based on their attributes."""
282
+ href = tag.get("href", "")
283
+ title = tag.get("title", "")
284
+
285
+ # Handle different link types
286
+ if href.startswith("http"):
287
+ # External link
288
+ return f"[{text}]({href} \"{title or 'External link'}\")"
289
+ elif href.startswith("#"):
290
+ # Anchor link
291
+ return f"[{text}]({href})"
292
+ elif href.startswith("mailto:"):
293
+ # Email link
294
+ return f"[{text}]({href})"
295
+ else:
296
+ # Relative link
297
+ return f"[{text}]({href})"
298
+
299
+ html = '<a href="https://example.com">External</a> <a href="#section">Anchor</a>'
300
+ markdown = convert_to_markdown(html, custom_converters={"a": smart_link_converter})
251
301
  ```
252
302
 
253
- The library handles:
254
-
255
- - **Rowspan**: Inserts empty cells in subsequent rows
256
- - **Colspan**: Properly manages column spanning
257
- - **Clean output**: Removes `<colgroup>` and `<col>` elements that have no Markdown equivalent
303
+ #### Converter Function Signature
258
304
 
259
- ### Key Configuration Options
305
+ All converter functions must follow this signature:
260
306
 
261
- | Option | Type | Default | Description |
262
- | ------------------- | ---- | ---------------- | --------------------------------------------------------------- |
263
- | `extract_metadata` | bool | `True` | Extract document metadata as comment header |
264
- | `convert_as_inline` | bool | `False` | Treat content as inline elements only |
265
- | `heading_style` | str | `'underlined'` | Header style (`'underlined'`, `'atx'`, `'atx_closed'`) |
266
- | `highlight_style` | str | `'double-equal'` | Highlight style (`'double-equal'`, `'html'`, `'bold'`) |
267
- | `stream_processing` | bool | `False` | Enable streaming for large documents |
268
- | `parser` | str | auto-detect | BeautifulSoup parser (auto-detects `'lxml'` or `'html.parser'`) |
269
- | `autolinks` | bool | `True` | Auto-convert URLs to Markdown links |
270
- | `bullets` | str | `'*+-'` | Characters to use for bullet points |
271
- | `escape_asterisks` | bool | `True` | Escape * characters |
272
- | `wrap` | bool | `False` | Enable text wrapping |
273
- | `wrap_width` | int | `80` | Text wrap width |
307
+ ```python
308
+ def converter(*, tag: Tag, text: str, **kwargs) -> str:
309
+ """
310
+ Args:
311
+ tag: BeautifulSoup Tag object with access to all HTML attributes
312
+ text: Pre-processed text content of the tag
313
+ **kwargs: Additional context passed through from conversion
314
+
315
+ Returns:
316
+ Markdown formatted string
317
+ """
318
+ pass
319
+ ```
274
320
 
275
- For a complete list of all 20+ options, see the [Configuration Reference](#configuration-reference) section below.
321
+ Custom converters take precedence over built-in converters and can be used alongside other configuration options.
276
322
 
277
323
  ## CLI Usage
278
324
 
@@ -288,51 +334,30 @@ cat input.html | html_to_markdown > output.md
288
334
  # Use custom options
289
335
  html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md
290
336
 
291
- # Advanced options
337
+ # Discord-compatible lists with HTML preprocessing
292
338
  html_to_markdown \
293
- --no-extract-metadata \
294
- --convert-as-inline \
295
- --highlight-style html \
296
- --stream-processing \
297
- --show-progress \
339
+ --list-indent-width 2 \
340
+ --preprocess-html \
341
+ --preprocessing-preset aggressive \
298
342
  input.html > output.md
299
343
  ```
300
344
 
301
345
  ### Key CLI Options
302
346
 
303
- ```shell
304
- # Content processing
305
- --convert-as-inline # Treat content as inline elements
306
- --no-extract-metadata # Disable metadata extraction
307
- --strip-newlines # Remove newlines from input
308
-
309
- # Formatting
310
- --heading-style {atx,atx_closed,underlined}
311
- --highlight-style {double-equal,html,bold}
312
- --strong-em-symbol {*,_}
313
- --bullets CHARS # e.g., "*+-"
314
-
315
- # Text escaping
316
- --no-escape-asterisks # Disable * escaping
317
- --no-escape-underscores # Disable _ escaping
318
- --no-escape-misc # Disable misc character escaping
319
-
320
- # Large document processing
321
- --stream-processing # Enable streaming mode
322
- --chunk-size SIZE # Set chunk size (default: 1024)
323
- --show-progress # Show progress for large files
324
-
325
- # Text wrapping
326
- --wrap # Enable text wrapping
327
- --wrap-width WIDTH # Set wrap width (default: 80)
328
- ```
329
-
330
- View all available options:
347
+ **Most Common Options:**
331
348
 
332
349
  ```shell
333
- html_to_markdown --help
350
+ --list-indent-width WIDTH # Spaces per indent (default: 4, use 2 for Discord)
351
+ --list-indent-type {spaces,tabs} # Indentation type (default: spaces)
352
+ --preprocess-html # Enable HTML cleaning for web scraping
353
+ --whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
354
+ --heading-style {atx,atx_closed,underlined} # Header style
355
+ --no-extract-metadata # Disable metadata extraction
334
356
  ```
335
357
 
358
+ **All Available Options:**
359
+ The CLI supports all Python API parameters. Use `html_to_markdown --help` to see the complete list.
360
+
336
361
  ## Migration from Markdownify
337
362
 
338
363
  For existing projects using Markdownify, a compatibility layer is provided:
@@ -351,27 +376,17 @@ The `markdownify` function is an alias for `convert_to_markdown` and provides id
351
376
 
352
377
  ## Configuration Reference
353
378
 
354
- Complete list of all configuration options:
355
-
356
- ### Document Processing
379
+ ### Most Common Parameters
357
380
 
358
- - `extract_metadata` (bool, default: `True`): Extract document metadata (title, meta tags) as comment header
359
- - `convert_as_inline` (bool, default: `False`): Treat content as inline elements only (no block elements)
360
- - `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing
361
- - `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)
362
- - `strip` (list, default: `None`): List of HTML tags to remove from output
363
- - `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions
364
-
365
- ### Streaming Support
366
-
367
- - `stream_processing` (bool, default: `False`): Enable streaming processing for large documents
368
- - `chunk_size` (int, default: `1024`): Size of chunks when using streaming processing
369
- - `chunk_callback` (callable, default: `None`): Callback function called with each processed chunk
370
- - `progress_callback` (callable, default: `None`): Callback function called with (processed_bytes, total_bytes)
381
+ - `list_indent_width` (int, default: `4`): Number of spaces per indentation level (use 2 for Discord/Slack)
382
+ - `list_indent_type` (str, default: `'spaces'`): Use `'spaces'` or `'tabs'` for list indentation
383
+ - `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)
384
+ - `whitespace_mode` (str, default: `'normalized'`): Whitespace handling (`'normalized'` or `'strict'`)
385
+ - `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML
386
+ - `extract_metadata` (bool, default: `True`): Extract document metadata as comment header
371
387
 
372
388
  ### Text Formatting
373
389
 
374
- - `heading_style` (str, default: `'underlined'`): Header style (`'underlined'`, `'atx'`, `'atx_closed'`)
375
390
  - `highlight_style` (str, default: `'double-equal'`): Style for highlighted text (`'double-equal'`, `'html'`, `'bold'`)
376
391
  - `strong_em_symbol` (str, default: `'*'`): Symbol for strong/emphasized text (`'*'` or `'_'`)
377
392
  - `bullets` (str, default: `'*+-'`): Characters to use for bullet points in lists
@@ -379,6 +394,21 @@ Complete list of all configuration options:
379
394
  - `sub_symbol` (str, default: `''`): Custom symbol for subscript text
380
395
  - `sup_symbol` (str, default: `''`): Custom symbol for superscript text
381
396
 
397
+ ### Parser Options
398
+
399
+ - `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)
400
+ - `preprocessing_preset` (str, default: `'standard'`): Preprocessing level (`'minimal'`, `'standard'`, `'aggressive'`)
401
+ - `remove_forms` (bool, default: `True`): Remove form elements during preprocessing
402
+ - `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing
403
+
404
+ ### Document Processing
405
+
406
+ - `convert_as_inline` (bool, default: `False`): Treat content as inline elements only
407
+ - `strip_newlines` (bool, default: `False`): Remove newlines from HTML input before processing
408
+ - `convert` (list, default: `None`): List of HTML tags to convert (None = all supported tags)
409
+ - `strip` (list, default: `None`): List of HTML tags to remove from output
410
+ - `custom_converters` (dict, default: `None`): Mapping of HTML tag names to custom converter functions
411
+
382
412
  ### Text Escaping
383
413
 
384
414
  - `escape_asterisks` (bool, default: `True`): Escape `*` characters to prevent unintended formatting
@@ -401,6 +431,15 @@ Complete list of all configuration options:
401
431
  - `wrap` (bool, default: `False`): Enable text wrapping
402
432
  - `wrap_width` (int, default: `80`): Width for text wrapping
403
433
 
434
+ ### HTML Processing
435
+
436
+ - `parser` (str, default: auto-detect): BeautifulSoup parser to use (`'lxml'`, `'html.parser'`, `'html5lib'`)
437
+ - `whitespace_mode` (str, default: `'normalized'`): How to handle whitespace (`'normalized'` intelligently cleans whitespace, `'strict'` preserves original)
438
+ - `preprocess_html` (bool, default: `False`): Enable HTML preprocessing to clean messy HTML
439
+ - `preprocessing_preset` (str, default: `'standard'`): Preprocessing aggressiveness (`'minimal'` for basic cleaning, `'standard'` for balanced, `'aggressive'` for heavy cleaning)
440
+ - `remove_forms` (bool, default: `True`): Remove form elements during preprocessing
441
+ - `remove_navigation` (bool, default: `True`): Remove navigation elements during preprocessing
442
+
404
443
  ## Contribution
405
444
 
406
445
  This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
@@ -458,17 +497,6 @@ uv run python -m html_to_markdown input.html
458
497
  uv build
459
498
  ```
460
499
 
461
- ## Performance
462
-
463
- The library is optimized for performance with several key features:
464
-
465
- - **Efficient ancestor caching**: Reduces repeated DOM traversals using context-aware caching
466
- - **Streaming support**: Process large documents in chunks to minimize memory usage
467
- - **Optional lxml parser**: ~30% faster parsing for complex HTML documents
468
- - **Optimized string operations**: Minimizes string concatenations in hot paths
469
-
470
- Typical throughput: ~2 MB/s for regular processing on modern hardware.
471
-
472
500
  ## License
473
501
 
474
502
  This library uses the MIT license.
@@ -512,42 +540,6 @@ This library provides comprehensive support for all modern HTML5 elements:
512
540
 
513
541
  - `<math>` (MathML support)
514
542
 
515
- ## Advanced Table Support
516
-
517
- The library provides sophisticated handling of complex HTML tables, including merged cells and proper structure conversion:
518
-
519
- ```python
520
- from html_to_markdown import convert_to_markdown
521
-
522
- # Complex table with merged cells
523
- html = """
524
- <table>
525
- <caption>Sales Report</caption>
526
- <tr>
527
- <th rowspan="2">Product</th>
528
- <th colspan="2">Quarterly Sales</th>
529
- </tr>
530
- <tr>
531
- <th>Q1</th>
532
- <th>Q2</th>
533
- </tr>
534
- <tr>
535
- <td>Widget A</td>
536
- <td>$50K</td>
537
- <td>$75K</td>
538
- </tr>
539
- </table>
540
- """
541
-
542
- result = convert_to_markdown(html)
543
- ```
544
-
545
- **Features:**
546
-
547
- - **Merged cell support**: Handles `rowspan` and `colspan` attributes intelligently
548
- - **Clean output**: Automatically removes table styling elements that don't translate to Markdown
549
- - **Structure preservation**: Maintains table hierarchy and relationships
550
-
551
543
  ## Acknowledgments
552
544
 
553
545
  Special thanks to the original [markdownify](https://pypi.org/project/markdownify/) project creators and contributors.