html-to-markdown 1.11.0__py3-none-any.whl → 1.12.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of html-to-markdown might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: html-to-markdown
3
- Version: 1.11.0
3
+ Version: 1.12.1
4
4
  Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
5
5
  Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
6
6
  License: MIT
@@ -320,6 +320,132 @@ def converter(*, tag: Tag, text: str, **kwargs) -> str:
320
320
 
321
321
  Custom converters take precedence over built-in converters and can be used alongside other configuration options.
322
322
 
323
+ ### Streaming API
324
+
325
+ For processing large documents with memory constraints, use the streaming API:
326
+
327
+ ```python
328
+ from html_to_markdown import convert_to_markdown_stream
329
+
330
+ # Process large HTML in chunks
331
+ with open("large_document.html", "r") as f:
332
+ html_content = f.read()
333
+
334
+ # Returns a generator that yields markdown chunks
335
+ for chunk in convert_to_markdown_stream(html_content, chunk_size=2048):
336
+ print(chunk, end="")
337
+ ```
338
+
339
+ With progress tracking:
340
+
341
+ ```python
342
+ def show_progress(processed: int, total: int):
343
+ if total > 0:
344
+ percent = (processed / total) * 100
345
+ print(f"\rProgress: {percent:.1f}%", end="")
346
+
347
+ # Stream with progress callback
348
+ markdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)
349
+ ```
350
+
351
+ #### When to Use Streaming vs Regular Processing
352
+
353
+ Based on comprehensive performance analysis, here are our recommendations:
354
+
355
+ **📄 Use Regular Processing When:**
356
+
357
+ - Files < 100KB (simplicity preferred)
358
+ - Simple scripts and one-off conversions
359
+ - Memory is not a concern
360
+ - You want the simplest API
361
+
362
+ **🌊 Use Streaming Processing When:**
363
+
364
+ - Files > 100KB (memory efficiency)
365
+ - Processing many files in batch
366
+ - Memory is constrained
367
+ - You need progress reporting
368
+ - You want to process results incrementally
369
+ - Running in production environments
370
+
371
+ **📋 Specific Recommendations by File Size:**
372
+
373
+ | File Size | Recommendation | Reason |
374
+ | ---------- | ----------------------------------------------- | -------------------------------------- |
375
+ | < 50KB | Regular (simplicity) or Streaming (3-5% faster) | Either works well |
376
+ | 50KB-100KB | Either (streaming slightly preferred) | Minimal difference |
377
+ | 100KB-1MB | Streaming preferred | Better performance + memory efficiency |
378
+ | > 1MB | Streaming strongly recommended | Significant memory advantages |
379
+
380
+ **🔧 Configuration Recommendations:**
381
+
382
+ - **Default chunk_size: 2048 bytes** (optimal performance balance)
383
+ - **For very large files (>10MB)**: Consider `chunk_size=4096`
384
+ - **For memory-constrained environments**: Use smaller chunks `chunk_size=1024`
385
+
386
+ **📈 Performance Benefits:**
387
+
388
+ Streaming provides consistent **3-5% performance improvement** across all file sizes:
389
+
390
+ - **Streaming throughput**: ~0.47-0.48 MB/s
391
+ - **Regular throughput**: ~0.44-0.47 MB/s
392
+ - **Memory usage**: Streaming uses less peak memory for large files
393
+ - **Latency**: Streaming allows processing results before completion
394
+
395
+ ### Preprocessing API
396
+
397
+ The library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:
398
+
399
+ ```python
400
+ from html_to_markdown import preprocess_html, create_preprocessor
401
+
402
+ # Direct preprocessing with custom options
403
+ cleaned_html = preprocess_html(
404
+ raw_html,
405
+ remove_navigation=True,
406
+ remove_forms=True,
407
+ remove_scripts=True,
408
+ remove_styles=True,
409
+ remove_comments=True,
410
+ preserve_semantic_structure=True,
411
+ preserve_tables=True,
412
+ preserve_media=True,
413
+ )
414
+ markdown = convert_to_markdown(cleaned_html)
415
+
416
+ # Create a preprocessor configuration from presets
417
+ config = create_preprocessor(preset="aggressive", preserve_tables=False) # or "minimal", "standard" # Override preset settings
418
+ markdown = convert_to_markdown(html, **config)
419
+ ```
420
+
421
+ ### Exception Handling
422
+
423
+ The library provides specific exception classes for better error handling:
424
+
425
+ ````python
426
+ from html_to_markdown import (
427
+ convert_to_markdown,
428
+ HtmlToMarkdownError,
429
+ EmptyHtmlError,
430
+ InvalidParserError,
431
+ ConflictingOptionsError,
432
+ MissingDependencyError
433
+ )
434
+
435
+ try:
436
+ markdown = convert_to_markdown(html, parser='lxml')
437
+ except MissingDependencyError:
438
+ # lxml not installed
439
+ markdown = convert_to_markdown(html, parser='html.parser')
440
+ except EmptyHtmlError:
441
+ print("No HTML content to convert")
442
+ except InvalidParserError as e:
443
+ print(f"Parser error: {e}")
444
+ except ConflictingOptionsError as e:
445
+ print(f"Conflicting options: {e}")
446
+ except HtmlToMarkdownError as e:
447
+ print(f"Conversion error: {e}")
448
+
323
449
  ## CLI Usage
324
450
 
325
451
  Convert HTML files directly from the command line with full access to all API options:
@@ -340,7 +466,7 @@ html_to_markdown \
340
466
  --preprocess-html \
341
467
  --preprocessing-preset aggressive \
342
468
  input.html > output.md
343
- ```
469
+ ````
344
470
 
345
471
  ### Key CLI Options
346
472
 
@@ -353,6 +479,20 @@ html_to_markdown \
353
479
  --whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
354
480
  --heading-style {atx,atx_closed,underlined} # Header style
355
481
  --no-extract-metadata # Disable metadata extraction
482
+ --br-in-tables # Use <br> tags for line breaks in table cells
483
+ --source-encoding ENCODING # Override auto-detected encoding (rarely needed)
484
+ ```
485
+
486
+ **File Encoding:**
487
+
488
+ The CLI automatically detects file encoding in most cases. Use `--source-encoding` only when automatic detection fails (typically on some Windows systems or with unusual encodings):
489
+
490
+ ```shell
491
+ # Override auto-detection for Latin-1 encoded file
492
+ html_to_markdown --source-encoding latin-1 input.html > output.md
493
+
494
+ # Force UTF-16 encoding when auto-detection fails
495
+ html_to_markdown --source-encoding utf-16 input.html > output.md
356
496
  ```
357
497
 
358
498
  **All Available Options:**
@@ -393,6 +533,7 @@ The `markdownify` function is an alias for `convert_to_markdown` and provides id
393
533
  - `newline_style` (str, default: `'spaces'`): Style for handling newlines (`'spaces'` or `'backslash'`)
394
534
  - `sub_symbol` (str, default: `''`): Custom symbol for subscript text
395
535
  - `sup_symbol` (str, default: `''`): Custom symbol for superscript text
536
+ - `br_in_tables` (bool, default: `False`): Use `<br>` tags for line breaks in table cells instead of spaces
396
537
 
397
538
  ### Parser Options
398
539
 
@@ -0,0 +1,17 @@
1
+ html_to_markdown/__init__.py,sha256=TzZzhZDJHeXW_3B9zceYehz2zlttqdLsDr5un8stZLM,653
2
+ html_to_markdown/__main__.py,sha256=E9d62nVceR_5TUWgVu5L5CnSZxKcnT_7a6ScWZUGE-s,292
3
+ html_to_markdown/cli.py,sha256=qB8-1jqJPW-YrOmlyOdJnLM6DpKSUIA3iyn1SJaJgKg,9418
4
+ html_to_markdown/constants.py,sha256=CKFVHjUZKgi8-lgU6AHPic7X5ChlTkbZt4Jv6VaVjjs,665
5
+ html_to_markdown/converters.py,sha256=fdFT9WwDd3hGpYn0jVbPDcB8OmLPvQUmanbM7aQmzms,35821
6
+ html_to_markdown/exceptions.py,sha256=ytUOIL0D8r0Jd59RzUPqzmk73i-Mg63zDQYo6S6DBg4,1389
7
+ html_to_markdown/preprocessor.py,sha256=otnTOhoivJkxaip1Lb9xNMl8q-x9aGFXSYkSrxsTW8g,9591
8
+ html_to_markdown/processing.py,sha256=xchoTwKZHQW8ejjwLAiMb_AY6XcgPQ6zhLShlduYVuY,35213
9
+ html_to_markdown/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
10
+ html_to_markdown/utils.py,sha256=s3A4ET_XyKC-WxzJtH4W0S7cIBGF5fTYIf4JJrqTX8Q,1069
11
+ html_to_markdown/whitespace.py,sha256=rl3eEwqfMpNWx4FBmbkZ1RxO_Od45p3EZ_7UgKcDAtg,7710
12
+ html_to_markdown-1.12.1.dist-info/licenses/LICENSE,sha256=3J_HR5BWvUM1mlIrlkF32-uC1FM64gy8JfG17LBuheQ,1122
13
+ html_to_markdown-1.12.1.dist-info/METADATA,sha256=5PoGUeYuGtGmh5q_XwxKlSzq7572CUw3yAVBNmVxDTc,22694
14
+ html_to_markdown-1.12.1.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
15
+ html_to_markdown-1.12.1.dist-info/entry_points.txt,sha256=xmFijrTfgYW7lOrZxZGRPciicQHa5KiXKkUhBCmICtQ,116
16
+ html_to_markdown-1.12.1.dist-info/top_level.txt,sha256=Ev6djb1c4dSKr_-n4K-FpEGDkzBigXY6LuZ5onqS7AE,17
17
+ html_to_markdown-1.12.1.dist-info/RECORD,,
@@ -1,17 +0,0 @@
1
- html_to_markdown/__init__.py,sha256=TzZzhZDJHeXW_3B9zceYehz2zlttqdLsDr5un8stZLM,653
2
- html_to_markdown/__main__.py,sha256=E9d62nVceR_5TUWgVu5L5CnSZxKcnT_7a6ScWZUGE-s,292
3
- html_to_markdown/cli.py,sha256=ilnrJN2XMhPDQ4UkkG4cjLXTvglu_ZJj-bBsohVF3fw,8541
4
- html_to_markdown/constants.py,sha256=CKFVHjUZKgi8-lgU6AHPic7X5ChlTkbZt4Jv6VaVjjs,665
5
- html_to_markdown/converters.py,sha256=CbChkRIlOPe0d1MK5-txDE56IG4Ea_dcCV6KRCTjeKY,32497
6
- html_to_markdown/exceptions.py,sha256=YjfwVCWE_oZakr9iy0E-_aPSYHNaocJZgWeQ9Enty7Q,1212
7
- html_to_markdown/preprocessor.py,sha256=acmuJJvx1RaXE3c0F_aWsartQE0cEpa3AOnJYGnPzqw,9708
8
- html_to_markdown/processing.py,sha256=sOIIFNyRkRYAH8Q4ehrh66RY71bkvttSuqzXYsMC5JM,34334
9
- html_to_markdown/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
10
- html_to_markdown/utils.py,sha256=4Vzk2cCjxN0LAZ1DXQCufYtxE7a6739TYgPbje-VM_E,1086
11
- html_to_markdown/whitespace.py,sha256=EJ0gEsfLB_wZAk5d5qP4UPhPg0pJJ8LZLRRr_QoL01o,8186
12
- html_to_markdown-1.11.0.dist-info/licenses/LICENSE,sha256=3J_HR5BWvUM1mlIrlkF32-uC1FM64gy8JfG17LBuheQ,1122
13
- html_to_markdown-1.11.0.dist-info/METADATA,sha256=Cej6bnqT9JVFzACZvND6Z5-kD0QoabiLi46opAaC11U,17814
14
- html_to_markdown-1.11.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
15
- html_to_markdown-1.11.0.dist-info/entry_points.txt,sha256=xmFijrTfgYW7lOrZxZGRPciicQHa5KiXKkUhBCmICtQ,116
16
- html_to_markdown-1.11.0.dist-info/top_level.txt,sha256=Ev6djb1c4dSKr_-n4K-FpEGDkzBigXY6LuZ5onqS7AE,17
17
- html_to_markdown-1.11.0.dist-info/RECORD,,