PyPI - html-to-markdown - Versions diffs - 1.15.0__tar.gz → 1.16.0__tar.gz - Mend

html-to-markdown 1.15.0tar.gz → 1.16.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of html-to-markdown might be problematic. Click here for more details.

Files changed (23) hide show

{html_to_markdown-1.15.0 → html_to_markdown-1.16.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: html-to-markdown
-Version: 1.15.0
+Version: 1.16.0
 Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
 Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
 License: MIT
@@ -55,6 +55,7 @@ Your support helps maintain and improve this library for the community.
 ## Features
 - **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
+- **HOCR Support**: Automatic detection and processing of HOCR (HTML-based OCR) documents with clean text extraction and proper spacing
 - **Table Support**: Advanced handling of complex tables with rowspan/colspan support
 - **Type Safety**: Strict MyPy adherence with comprehensive type hints
 - **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
@@ -266,6 +267,63 @@ markdown = convert_to_markdown(html, list_indent_type="tabs")
 html_to_markdown --list-indent-type tabs input.html
 ```
+### Working with HOCR Documents
+HOCR (HTML-based OCR) is a standard format used by OCR software like Tesseract to output structured text with positioning and confidence information. The library automatically detects and processes HOCR documents, extracting clean text while preserving proper spacing and structure.
+**Python:**
+```python
+from html_to_markdown import convert_to_markdown
+# HOCR from Tesseract OCR
+hocr_content = """<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+    <meta name='ocr-system' content='tesseract 5.5.1' />
+    <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
+</head>
+<body>
+    <div class='ocr_page' id='page_1'>
+        <div class='ocr_carea' id='block_1_1'>
+            <p class='ocr_par' id='par_1_1'>
+                <span class='ocr_line' id='line_1_1'>
+                    <span class='ocrx_word' id='word_1_1'>Hello</span>
+                    <span class='ocrx_word' id='word_1_2'>world</span>
+                </span>
+            </p>
+        </div>
+    </div>
+</body>
+</html>"""
+# Automatically detected as HOCR and converted to clean text
+markdown = convert_to_markdown(hocr_content)
+print(markdown)  # Output: "Hello world"
+```
+**CLI:**
+```shell
+# Process HOCR files directly
+tesseract image.png output hocr
+html_to_markdown output.hocr
+# Or pipe directly from Tesseract
+tesseract image.png - hocr | html_to_markdown
+```
+**Features:**
+- **Automatic Detection**: No configuration needed - HOCR documents are detected automatically
+- **Clean Output**: Removes OCR metadata, bounding boxes, and confidence scores
+- **Proper Spacing**: Maintains correct word spacing and text structure
+- **Multi-language Support**: Works with HOCR output in any language
+- **Performance Optimized**: Efficient processing of large OCR documents
+- **Error Resilient**: Handles malformed or incomplete HOCR gracefully
 ## Advanced Usage
 ### Configuration Example

{html_to_markdown-1.15.0 → html_to_markdown-1.16.0}/README.md RENAMED Viewed

@@ -15,6 +15,7 @@ Your support helps maintain and improve this library for the community.
 ## Features
 - **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
+- **HOCR Support**: Automatic detection and processing of HOCR (HTML-based OCR) documents with clean text extraction and proper spacing
 - **Table Support**: Advanced handling of complex tables with rowspan/colspan support
 - **Type Safety**: Strict MyPy adherence with comprehensive type hints
 - **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
@@ -226,6 +227,63 @@ markdown = convert_to_markdown(html, list_indent_type="tabs")
 html_to_markdown --list-indent-type tabs input.html
 ```
+### Working with HOCR Documents
+HOCR (HTML-based OCR) is a standard format used by OCR software like Tesseract to output structured text with positioning and confidence information. The library automatically detects and processes HOCR documents, extracting clean text while preserving proper spacing and structure.
+**Python:**
+```python
+from html_to_markdown import convert_to_markdown
+# HOCR from Tesseract OCR
+hocr_content = """<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+    <meta name='ocr-system' content='tesseract 5.5.1' />
+    <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
+</head>
+<body>
+    <div class='ocr_page' id='page_1'>
+        <div class='ocr_carea' id='block_1_1'>
+            <p class='ocr_par' id='par_1_1'>
+                <span class='ocr_line' id='line_1_1'>
+                    <span class='ocrx_word' id='word_1_1'>Hello</span>
+                    <span class='ocrx_word' id='word_1_2'>world</span>
+                </span>
+            </p>
+        </div>
+    </div>
+</body>
+</html>"""
+# Automatically detected as HOCR and converted to clean text
+markdown = convert_to_markdown(hocr_content)
+print(markdown)  # Output: "Hello world"
+```
+**CLI:**
+```shell
+# Process HOCR files directly
+tesseract image.png output hocr
+html_to_markdown output.hocr
+# Or pipe directly from Tesseract
+tesseract image.png - hocr | html_to_markdown
+```
+**Features:**
+- **Automatic Detection**: No configuration needed - HOCR documents are detected automatically
+- **Clean Output**: Removes OCR metadata, bounding boxes, and confidence scores
+- **Proper Spacing**: Maintains correct word spacing and text structure
+- **Multi-language Support**: Works with HOCR output in any language
+- **Performance Optimized**: Efficient processing of large OCR documents
+- **Error Resilient**: Handles malformed or incomplete HOCR gracefully
 ## Advanced Usage
 ### Configuration Example

html_to_markdown-1.16.0/html_to_markdown/hocr_processor.py ADDED Viewed

@@ -0,0 +1,128 @@
+"""HOCR (HTML-based OCR) document processing utilities.
+This module handles the conversion of HOCR documents to clean markdown text,
+including proper spacing, layout preservation, and metadata suppression.
+"""
+from __future__ import annotations
+import re
+from typing import TYPE_CHECKING, ClassVar
+from bs4 import Tag
+from bs4.element import NavigableString, PageElement
+if TYPE_CHECKING:
+    from bs4 import BeautifulSoup
+class HOCRProcessor:
+    """Handles HOCR-specific document processing."""
+    _HOCR_PATTERNS: ClassVar[list[re.Pattern[str]]] = [
+        re.compile(r'class\s*=\s*["\'].*?ocr_page.*?["\']', re.IGNORECASE),
+        re.compile(r'class\s*=\s*["\'].*?ocrx_word.*?["\']', re.IGNORECASE),
+        re.compile(r'name\s*=\s*["\']ocr-system["\']', re.IGNORECASE),
+        re.compile(r'class\s*=\s*["\'].*?ocr_carea.*?["\']', re.IGNORECASE),
+        re.compile(r'class\s*=\s*["\'].*?ocr_par.*?["\']', re.IGNORECASE),
+        re.compile(r'class\s*=\s*["\'].*?ocr_line.*?["\']', re.IGNORECASE),
+    ]
+    @classmethod
+    def is_hocr_document(cls, content: str) -> bool:
+        """Check if content is an HOCR document.
+        Args:
+            content: Raw HTML/XML content to check
+        Returns:
+            True if content appears to be HOCR format
+        Raises:
+            ValueError: If content is too large (>10MB)
+        """
+        if len(content) > 10_000_000:
+            raise ValueError("Document too large for HOCR processing")
+        content_sample = content[:50000]
+        return any(pattern.search(content_sample) for pattern in cls._HOCR_PATTERNS)
+    @classmethod
+    def is_hocr_word_element(cls, tag: Tag | None) -> bool:
+        """Check if a tag is an HOCR word element.
+        Args:
+            tag: BeautifulSoup tag to check
+        Returns:
+            True if tag is a span with ocrx_word class
+        """
+        if not tag or tag.name != "span":
+            return False
+        class_attr = tag.get("class")
+        if isinstance(class_attr, list):
+            return "ocrx_word" in class_attr
+        return class_attr == "ocrx_word"
+    @classmethod
+    def should_add_space_before_word(cls, children: list[PageElement], current_index: int) -> bool:
+        """Determine if space should be added before an HOCR word.
+        Args:
+            children: List of child elements
+            current_index: Index of current element
+        Returns:
+            True if a space should be added before this word
+        """
+        if not (0 < current_index < len(children)):
+            return False
+        prev_element = children[current_index - 1]
+        if isinstance(prev_element, NavigableString):
+            text_content = str(prev_element)
+            return not (text_content.strip() or " " in text_content)
+        return isinstance(prev_element, Tag) and cls.is_hocr_word_element(prev_element)
+    @classmethod
+    def is_hocr_element_in_soup(cls, soup: BeautifulSoup) -> bool:
+        """Check if parsed soup contains HOCR elements.
+        Args:
+            soup: Parsed BeautifulSoup document
+        Returns:
+            True if soup contains HOCR elements
+        """
+        return bool(
+            soup.find("meta", attrs={"name": "ocr-system"})
+            or soup.find("meta", attrs={"name": "ocr-capabilities"})
+            or soup.find(class_="ocr_page")
+            or soup.find(class_="ocrx_word")
+            or soup.find(class_="ocr_carea")
+            or soup.find(class_="ocr_par")
+            or soup.find(class_="ocr_line")
+        )
+    @classmethod
+    def get_optimal_parser(cls, content: str, lxml_available: bool) -> str:
+        """Get optimal parser for HOCR content.
+        Args:
+            content: Document content
+            lxml_available: Whether lxml is available
+        Returns:
+            Parser name to use ('xml', 'lxml', or 'html.parser')
+        """
+        try:
+            if cls.is_hocr_document(content) and lxml_available:
+                return "xml"
+        except ValueError:
+            pass
+        return "lxml" if lxml_available else "html.parser"

{html_to_markdown-1.15.0 → html_to_markdown-1.16.0}/html_to_markdown/processing.py RENAMED Viewed

@@ -38,6 +38,7 @@ from html_to_markdown.constants import (
 )
 from html_to_markdown.converters import Converter, ConvertersMap, SupportedElements, create_converters_map
 from html_to_markdown.exceptions import ConflictingOptionsError, EmptyHtmlError, MissingDependencyError
+from html_to_markdown.hocr_processor import HOCRProcessor
 from html_to_markdown.utils import escape
 from html_to_markdown.whitespace import WhitespaceHandler
@@ -150,6 +151,11 @@ def _get_list_indent(list_indent_type: str, list_indent_width: int) -> str:
     return " " * list_indent_width
+_is_hocr_document = HOCRProcessor.is_hocr_document
+_is_hocr_word_element = HOCRProcessor.is_hocr_word_element
+_should_add_space_before_hocr_word = HOCRProcessor.should_add_space_before_word
 def _is_nested_tag(el: PageElement) -> bool:
     return isinstance(el, Tag) and el.name in {
         "ol",
@@ -244,6 +250,10 @@ def _process_tag(
             )
         elif isinstance(el, Tag):
             current_text = "".join(text_parts)
+            if _is_hocr_word_element(el) and _should_add_space_before_hocr_word(children, i):
+                text_parts.append(" ")
             text_parts.append(
                 _process_tag(
                     el,
@@ -588,7 +598,7 @@ def convert_to_markdown(
         if "".join(source.split("\n")):
             if parser is None:
-                parser = "lxml" if LXML_AVAILABLE else "html.parser"
+                parser = HOCRProcessor.get_optimal_parser(source, LXML_AVAILABLE)
             if parser == "lxml" and not LXML_AVAILABLE:
                 raise MissingDependencyError("lxml", "pip install html-to-markdown[lxml]")
@@ -949,7 +959,7 @@ def _process_html_core(
             if "".join(source.split("\n")):
                 if parser is None:
-                    parser = "lxml" if LXML_AVAILABLE else "html.parser"
+                    parser = HOCRProcessor.get_optimal_parser(source, LXML_AVAILABLE)
                 if parser == "lxml" and not LXML_AVAILABLE:  # pragma: no cover
                     raise MissingDependencyError("lxml", "pip install html-to-markdown[lxml]")
@@ -1012,7 +1022,7 @@ def _process_html_core(
         if custom_converters:
             converters_map.update(cast("ConvertersMap", custom_converters))
-        if extract_metadata and not convert_as_inline:
+        if extract_metadata and not convert_as_inline and not HOCRProcessor.is_hocr_element_in_soup(source):
             metadata = _extract_metadata(source)
             metadata_comment = _format_metadata_comment(metadata)
             if metadata_comment:

{html_to_markdown-1.15.0 → html_to_markdown-1.16.0}/html_to_markdown.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: html-to-markdown
-Version: 1.15.0
+Version: 1.16.0
 Summary: A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options
 Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
 License: MIT
@@ -55,6 +55,7 @@ Your support helps maintain and improve this library for the community.
 ## Features
 - **Full HTML5 Support**: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
+- **HOCR Support**: Automatic detection and processing of HOCR (HTML-based OCR) documents with clean text extraction and proper spacing
 - **Table Support**: Advanced handling of complex tables with rowspan/colspan support
 - **Type Safety**: Strict MyPy adherence with comprehensive type hints
 - **Metadata Extraction**: Automatic extraction of document metadata (title, meta tags) as comment headers
@@ -266,6 +267,63 @@ markdown = convert_to_markdown(html, list_indent_type="tabs")
 html_to_markdown --list-indent-type tabs input.html
 ```
+### Working with HOCR Documents
+HOCR (HTML-based OCR) is a standard format used by OCR software like Tesseract to output structured text with positioning and confidence information. The library automatically detects and processes HOCR documents, extracting clean text while preserving proper spacing and structure.
+**Python:**
+```python
+from html_to_markdown import convert_to_markdown
+# HOCR from Tesseract OCR
+hocr_content = """<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+    <meta name='ocr-system' content='tesseract 5.5.1' />
+    <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
+</head>
+<body>
+    <div class='ocr_page' id='page_1'>
+        <div class='ocr_carea' id='block_1_1'>
+            <p class='ocr_par' id='par_1_1'>
+                <span class='ocr_line' id='line_1_1'>
+                    <span class='ocrx_word' id='word_1_1'>Hello</span>
+                    <span class='ocrx_word' id='word_1_2'>world</span>
+                </span>
+            </p>
+        </div>
+    </div>
+</body>
+</html>"""
+# Automatically detected as HOCR and converted to clean text
+markdown = convert_to_markdown(hocr_content)
+print(markdown)  # Output: "Hello world"
+```
+**CLI:**
+```shell
+# Process HOCR files directly
+tesseract image.png output hocr
+html_to_markdown output.hocr
+# Or pipe directly from Tesseract
+tesseract image.png - hocr | html_to_markdown
+```
+**Features:**
+- **Automatic Detection**: No configuration needed - HOCR documents are detected automatically
+- **Clean Output**: Removes OCR metadata, bounding boxes, and confidence scores
+- **Proper Spacing**: Maintains correct word spacing and text structure
+- **Multi-language Support**: Works with HOCR output in any language
+- **Performance Optimized**: Efficient processing of large OCR documents
+- **Error Resilient**: Handles malformed or incomplete HOCR gracefully
 ## Advanced Usage
 ### Configuration Example

{html_to_markdown-1.15.0 → html_to_markdown-1.16.0}/html_to_markdown.egg-info/SOURCES.txt RENAMED Viewed

@@ -7,6 +7,7 @@ html_to_markdown/cli.py
 html_to_markdown/constants.py
 html_to_markdown/converters.py
 html_to_markdown/exceptions.py
+html_to_markdown/hocr_processor.py
 html_to_markdown/preprocessor.py
 html_to_markdown/processing.py
 html_to_markdown/py.typed

{html_to_markdown-1.15.0 → html_to_markdown-1.16.0}/pyproject.toml RENAMED Viewed

@@ -5,7 +5,7 @@ requires = [ "setuptools>=78.1" ]
 [project]
 name = "html-to-markdown"
-version = "1.15.0"
+version = "1.16.0"
 description = "A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options"
 readme = "README.md"
 keywords = [
@@ -69,7 +69,7 @@ dev = [
   "pytest-benchmark>=5.1",
   "pytest-cov>=7",
   "pytest-mock>=3.15.1",
-  "ruff>=0.13.1",
+  "ruff>=0.13.2",
   "types-beautifulsoup4>=4.12.0.20250516",
   "types-psutil>=7.0.0.20250822",
   "uv-bump",