PyPI - kreuzberg - Versions diffs - 3.15.0__tar.gz → 3.16.0__tar.gz - Mend

kreuzberg 3.15.0tar.gz → 3.16.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (295) hide show

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/.github/workflows/ci.yaml RENAMED Viewed

@@ -212,7 +212,7 @@ jobs:
         uses: actions/checkout@v5
       - name: Download Coverage Artifacts
-        uses: actions/download-artifact@v4
+        uses: actions/download-artifact@v5
         with:
           pattern: coverage-*-${{ github.sha }}
           merge-multiple: true

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -11,7 +11,7 @@ repos:
       - id: name-tests-test
         args:
           - --pytest
-        exclude: factories|test_utils|completion.py|test_data
+        exclude: factories|test_utils|completion.py|test_data|docker_e2e.py
       - id: trailing-whitespace
       - id: end-of-file-fixer
       - id: check-toml

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: kreuzberg
-Version: 3.15.0
+Version: 3.16.0
 Summary: Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats
 Project-URL: documentation, https://kreuzberg.dev
 Project-URL: homepage, https://github.com/Goldziher/kreuzberg
@@ -31,7 +31,7 @@ Requires-Python: >=3.10
 Requires-Dist: anyio>=4.10.0
 Requires-Dist: chardetng-py>=0.3.5
 Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
-Requires-Dist: html-to-markdown[lxml]>=1.11.0
+Requires-Dist: html-to-markdown[lxml]>=1.13.0
 Requires-Dist: mcp>=1.14.0
 Requires-Dist: msgspec>=0.18.0
 Requires-Dist: numpy>=2.0.0
@@ -109,7 +109,7 @@ Description-Content-Type: text/markdown
 - **Text Extraction**: High-fidelity text extraction preserving document structure and formatting
 - **Image Extraction**: Extract embedded images from PDFs, presentations, HTML, and Office documents with optional OCR
 - **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties
-- **Format Support**: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
+- **Format Support**: 21 document types including PDF, Microsoft Office, images, HTML, and structured data formats
 - **OCR Integration**: Tesseract OCR with markdown output (default) and table extraction from scanned documents
 - **Document Classification**: Automatic document type detection (contracts, forms, invoices, receipts, reports)
@@ -227,14 +227,15 @@ claude mcp add kreuzberg uvx kreuzberg-mcp
 ## Supported Formats
-| Category          | Formats                        |
-| ----------------- | ------------------------------ |
-| **Documents**     | PDF, DOCX, DOC, RTF, TXT, EPUB |
-| **Images**        | JPG, PNG, TIFF, BMP, GIF, WEBP |
-| **Spreadsheets**  | XLSX, XLS, CSV, ODS            |
-| **Presentations** | PPTX, PPT, ODP                 |
-| **Web**           | HTML, XML, MHTML               |
-| **Archives**      | Support via extraction         |
+| Category            | Formats                        |
+| ------------------- | ------------------------------ |
+| **Documents**       | PDF, DOCX, DOC, RTF, TXT, EPUB |
+| **Images**          | JPG, PNG, TIFF, BMP, GIF, WEBP |
+| **Spreadsheets**    | XLSX, XLS, CSV, ODS            |
+| **Presentations**   | PPTX, PPT, ODP                 |
+| **Web**             | HTML, XML, MHTML               |
+| **Structured Data** | JSON, YAML, TOML               |
+| **Archives**        | Support via extraction         |
 ## 📊 Performance Characteristics

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/README.md RENAMED Viewed

@@ -18,7 +18,7 @@
 - **Text Extraction**: High-fidelity text extraction preserving document structure and formatting
 - **Image Extraction**: Extract embedded images from PDFs, presentations, HTML, and Office documents with optional OCR
 - **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties
-- **Format Support**: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
+- **Format Support**: 21 document types including PDF, Microsoft Office, images, HTML, and structured data formats
 - **OCR Integration**: Tesseract OCR with markdown output (default) and table extraction from scanned documents
 - **Document Classification**: Automatic document type detection (contracts, forms, invoices, receipts, reports)
@@ -136,14 +136,15 @@ claude mcp add kreuzberg uvx kreuzberg-mcp
 ## Supported Formats
-| Category          | Formats                        |
-| ----------------- | ------------------------------ |
-| **Documents**     | PDF, DOCX, DOC, RTF, TXT, EPUB |
-| **Images**        | JPG, PNG, TIFF, BMP, GIF, WEBP |
-| **Spreadsheets**  | XLSX, XLS, CSV, ODS            |
-| **Presentations** | PPTX, PPT, ODP                 |
-| **Web**           | HTML, XML, MHTML               |
-| **Archives**      | Support via extraction         |
+| Category            | Formats                        |
+| ------------------- | ------------------------------ |
+| **Documents**       | PDF, DOCX, DOC, RTF, TXT, EPUB |
+| **Images**          | JPG, PNG, TIFF, BMP, GIF, WEBP |
+| **Spreadsheets**    | XLSX, XLS, CSV, ODS            |
+| **Presentations**   | PPTX, PPT, ODP                 |
+| **Web**             | HTML, XML, MHTML               |
+| **Structured Data** | JSON, YAML, TOML               |
+| **Archives**        | Support via extraction         |
 ## 📊 Performance Characteristics

kreuzberg-3.16.0/Taskfile.yml ADDED Viewed

@@ -0,0 +1,50 @@
+version: "3"
+env:
+  DOCKER_BUILDKIT: 1
+  BUILDKIT_PROGRESS: plain
+tasks:
+  setup:
+    desc: "Install dependencies with uv"
+    cmds:
+      - uv sync --all-extras --all-packages
+      - pre-commit install && pre-commit install -hook-type commit-msg
+  update:
+    desc: "Update the dependencies"
+    cmds:
+      - uv run uv-bump
+      - cd benchmarks && uv run uv-bump && cd -
+      - uv sync --all-extras --all-packages --upgrade
+      - pre-commit autoupdate
+  test:
+    desc: "Run tests with pytest"
+    cmds:
+      - uv run pytest
+  test:cov:
+    desc: "Run tests with coverage"
+    cmds:
+      - uv run pytest --cov
+  lint:
+    desc: "Lint code with ruff and docs with markdownlint"
+    cmds:
+      - pre-commit run --all-files
+  docs:build:
+    desc: "Build documentation"
+    cmds:
+      - uv run mkdocs build --clean --strict
+  docs:serve:
+    desc: "Serve documentation locally"
+    cmds:
+      - uv run mkdocs serve
+  default:
+    desc: "Show available tasks"
+    cmds:
+      - task --list

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/docs/api-reference/types.md RENAMED Viewed

@@ -72,6 +72,18 @@ Configuration options for automatic language detection:
 ::: kreuzberg.LanguageDetectionConfig
+## JSON Extraction Configuration
+Configuration for enhanced JSON document processing:
+::: kreuzberg.JSONExtractionConfig
+## HTML to Markdown Configuration
+Configuration options for converting HTML content to Markdown:
+::: kreuzberg.HTMLToMarkdownConfig
 ## PSMMode (Page Segmentation Mode)
 ::: kreuzberg.PSMMode

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/docs/examples/extraction-examples.md RENAMED Viewed

@@ -525,13 +525,95 @@ async def comprehensive_extraction():
     print(f"Total text (including OCR): {len(all_text)} characters")
 ```
+## JSON and Structured Data Extraction
+### Basic JSON Extraction
+```python
+from kreuzberg import extract_file_sync
+# Simple JSON extraction
+result = extract_file_sync("data.json")
+print(result.content)
+# Metadata includes detected text fields
+print(f"Title: {result.metadata.get('title')}")
+print(f"Description: {result.metadata.get('description')}")
+```
+### Advanced JSON with Schema Extraction
+```python
+from kreuzberg import extract_file_sync, ExtractionConfig, JSONExtractionConfig
+# Configure advanced JSON extraction
+json_config = JSONExtractionConfig(
+    extract_schema=True,  # Extract JSON structure
+    custom_text_field_patterns=frozenset({"summary", "abstract"}),  # Custom fields
+    include_type_info=True,  # Add type annotations
+    flatten_nested_objects=True,  # Flatten nested structures
+    max_depth=5,  # Limit schema depth
+    array_item_limit=100,  # Limit array processing
+)
+config = ExtractionConfig(json_config=json_config)
+result = extract_file_sync("complex.json", config=config)
+# Access schema information
+if "json_schema" in result.metadata:
+    schema = result.metadata["json_schema"]
+    print(f"Root type: {schema['type']}")
+    print(f"Properties: {list(schema.get('properties', {}).keys())}")
+# Access nested attributes with dotted notation
+if "attributes" in result.metadata:
+    attrs = result.metadata["attributes"]
+    # Nested fields like {"info": {"title": "Example"}} become "info.title"
+    print(f"Nested title: {attrs.get('info.title')}")
+```
+### YAML and TOML Processing
+```python
+from kreuzberg import extract_file_sync
+# YAML extraction (similar to JSON)
+yaml_result = extract_file_sync("config.yaml")
+print(yaml_result.content)
+# TOML extraction
+toml_result = extract_file_sync("pyproject.toml")
+print(toml_result.content)
+# Both formats support the same metadata extraction as JSON
+print(f"Package name: {toml_result.metadata.get('name')}")
+```
+### Working with API Responses
+```python
+import httpx
+from kreuzberg import extract_bytes_sync, ExtractionConfig, JSONExtractionConfig
+# Fetch JSON from API
+response = httpx.get("https://api.example.com/data")
+# Extract with schema
+config = ExtractionConfig(json_config=JSONExtractionConfig(extract_schema=True))
+result = extract_bytes_sync(response.content, mime_type="application/json", config=config)
+print(f"API Response: {result.content}")
+print(f"Schema: {result.metadata.get('json_schema')}")
+```
 ## Batch Processing
 ```python
 from kreuzberg import batch_extract_file, ExtractionConfig
 async def process_documents():
-    file_paths = ["document1.pdf", "document2.docx", "image.jpg"]
+    file_paths = ["document1.pdf", "document2.docx", "data.json", "image.jpg"]
     config = ExtractionConfig()  # Optional: configure extraction options
     results = await batch_extract_file(file_paths, config=config)

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/docs/user-guide/extraction-configuration.md RENAMED Viewed

@@ -94,6 +94,14 @@ strong_em_symbol = "_"
 escape_underscores = false
 wrap = true
 wrap_width = 100
+list_indent_width = 2                # Use 2 spaces for Discord/Slack compatibility
+list_indent_type = "spaces"          # Use spaces instead of tabs
+whitespace_mode = "normalized"       # Handle whitespace intelligently
+br_in_tables = false                 # Use spaces instead of <br> in tables
+highlight_style = "double-equal"     # Style for highlighted text
+newline_style = "spaces"             # Style for line breaks
+preprocess_html = true               # Clean messy HTML before conversion
+preprocessing_preset = "standard"    # Level of HTML cleaning
 ```
 ### pyproject.toml Example
@@ -623,6 +631,58 @@ For better performance in production:
 - Enable deduplication to avoid redundant processing
 - Use selective extraction based on document types
+### JSON Extraction Configuration
+Kreuzberg provides enhanced JSON document processing with schema extraction and customizable field detection:
+```python
+from kreuzberg import extract_file, ExtractionConfig, JSONExtractionConfig
+# Advanced JSON extraction with schema
+result = await extract_file(
+    "data.json",
+    config=ExtractionConfig(
+        json_config=JSONExtractionConfig(
+            extract_schema=True,  # Extract JSON structure schema
+            include_type_info=True,  # Add type annotations to output
+            flatten_nested_objects=True,  # Flatten nested objects in output
+            custom_text_field_patterns=frozenset({"summary", "abstract"}),  # Additional text fields
+            max_depth=10,  # Maximum nesting depth for schema
+            array_item_limit=1000,  # Limit array processing for performance
+        )
+    ),
+)
+# Access schema and nested attributes
+if result.metadata.get("json_schema"):
+    print(f"JSON Schema: {result.metadata['json_schema']}")
+if result.metadata.get("attributes"):
+    print(f"Nested fields: {result.metadata['attributes']}")
+```
+#### Configuration File Support
+Add JSON configuration to your `kreuzberg.toml`:
+```toml
+[json_config]
+extract_schema = true              # Extract JSON structure schema
+include_type_info = false          # Add type annotations to output
+flatten_nested_objects = true      # Flatten nested objects in output
+custom_text_field_patterns = ["summary", "abstract"]  # Additional text fields to extract
+max_depth = 10                     # Maximum nesting depth for schema extraction
+array_item_limit = 1000           # Limit array processing for performance
+```
+#### Key Features
+- **High Performance**: Uses msgspec for fast JSON parsing, significantly faster than standard library
+- **Schema Extraction**: Automatically extracts the structure of your JSON data, useful for understanding complex documents
+- **Custom Field Detection**: Configure additional text fields beyond defaults (title, name, description, content, body, text, message)
+- **Type Information**: Optionally include data type annotations in extracted content for better understanding
+- **Nested Object Control**: Choose between flattened or hierarchical output based on your needs
+- **Memory Protection**: Array item limits prevent memory issues with large datasets
 ### Entity and Keyword Extraction
 Kreuzberg can extract named entities and keywords from documents using spaCy for entity recognition and KeyBERT for keyword extraction:
@@ -833,7 +893,14 @@ html_config = HTMLToMarkdownConfig(
     escape_underscores=False,
     wrap=True,
     wrap_width=100,
-    preprocessing_preset="standard",
+    list_indent_width=2,  # Discord/Slack compatible spacing
+    list_indent_type="spaces",  # Use spaces for indentation
+    whitespace_mode="normalized",  # Smart whitespace handling
+    br_in_tables=False,  # Use spaces in table cells
+    highlight_style="double-equal",  # ==highlighted== text style
+    newline_style="spaces",  # Line break style
+    preprocess_html=True,  # Clean HTML before conversion
+    preprocessing_preset="standard",  # HTML cleaning level
 )
 result = await extract_file(

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/docs/user-guide/metadata-extraction.md RENAMED Viewed

@@ -49,6 +49,54 @@ For PDF documents, Kreuzberg extracts a rich set of metadata including:
 If a PDF document contains UTF-16BE encoded strings (often present in PDF metadata with a byte order mark `\xfe\xff`), Kreuzberg will automatically detect and decode these properly.
+## Structured Data Metadata
+For JSON, YAML, and TOML files, Kreuzberg provides specialized metadata extraction:
+### Text Field Detection
+Kreuzberg automatically identifies and extracts common text fields:
+- **Default fields**: `title`, `name`, `description`, `content`, `body`, `text`, `message`
+- **Custom fields**: Configure additional patterns via `JSONExtractionConfig`
+### Nested Attributes
+Complex nested fields are stored in `metadata.attributes` with dotted key notation:
+```python
+from kreuzberg import extract_file_sync
+# Example JSON with nested structure
+result = extract_file_sync("complex.json")
+# Access nested fields via attributes
+if "attributes" in result.metadata:
+    # Nested fields like {"info": {"title": "Example"}} become "info.title"
+    nested_title = result.metadata["attributes"].get("info.title")
+    # Array items are indexed: {"items": [{"name": "first"}]} becomes "items[0].name"
+    first_item = result.metadata["attributes"].get("items[0].name")
+```
+### Schema Extraction
+When enabled, Kreuzberg extracts the JSON structure:
+```python
+from kreuzberg import extract_file_sync, ExtractionConfig, JSONExtractionConfig
+config = ExtractionConfig(json_config=JSONExtractionConfig(extract_schema=True))
+result = extract_file_sync("data.json", config=config)
+# Access the schema
+if "json_schema" in result.metadata:
+    schema = result.metadata["json_schema"]
+    print(f"Root type: {schema['type']}")
+    if "properties" in schema:
+        print(f"Properties: {list(schema['properties'].keys())}")
+```
 ## Working with Multiple Document Types
 When working with multiple document types, it's important to remember that different document formats may provide different metadata fields. Always use defensive programming (like using `.get()` with a default value) when accessing metadata fields:
@@ -57,6 +105,9 @@ When working with multiple document types, it's important to remember that diffe
 # Safe way to access metadata across different document types
 author = result.metadata.get("authors", ["Unknown"])[0] if "authors" in result.metadata else "Unknown"
 creation_date = result.metadata.get("created_at", "Unknown date")
+# For structured data with nested attributes
+nested_fields = result.metadata.get("attributes", {})
 ```
 ## Viewing Available Metadata

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/docs/user-guide/supported-formats.md RENAMED Viewed

@@ -1,6 +1,6 @@
 # Supported Formats
-Kreuzberg handles a wide range of document, image, and text formats.
+Kreuzberg handles a wide range of document, image, text, and structured data formats.
 ## Document Formats
@@ -36,6 +36,19 @@ Kreuzberg handles a wide range of document, image, and text formats.
 - EndNote and JATS XML (`.xml`)
 - RIS (`.ris`)
+## Structured Data Formats
+- JSON (`.json`) - High-performance extraction using msgspec with schema analysis
+- YAML (`.yaml`, `.yml`) - Full YAML 1.2 support with nested structure extraction
+- TOML (`.toml`) - Configuration and metadata files with type-aware processing
+These formats benefit from:
+- **Schema extraction**: Automatically analyze and extract the structure of your data
+- **Custom field detection**: Configure additional text fields for specialized extraction
+- **Type information**: Optionally include data type annotations in extracted content
+- **Performance optimization**: Uses msgspec for efficient JSON parsing
 ## Image Formats
 - JPEG (`.jpg`, `.jpeg`, `.pjpeg`)

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/kreuzberg/__init__.py RENAMED Viewed

@@ -8,8 +8,10 @@ from ._types import (
     ExtractionConfig,
     ExtractionResult,
     GMFTConfig,
+    HTMLToMarkdownConfig,
     ImageOCRConfig,
     ImageOCRResult,
+    JSONExtractionConfig,
     LanguageDetectionConfig,
     Metadata,
     PaddleOCRConfig,
@@ -40,8 +42,10 @@ __all__ = [
     "ExtractionResult",
     "ExtractorRegistry",
     "GMFTConfig",
+    "HTMLToMarkdownConfig",
     "ImageOCRConfig",
     "ImageOCRResult",
+    "JSONExtractionConfig",
     "KreuzbergError",
     "LanguageDetectionConfig",
     "Metadata",

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/kreuzberg/_api/main.py RENAMED Viewed

@@ -13,10 +13,8 @@ from typing_extensions import TypedDict
 from kreuzberg import (
     EasyOCRConfig,
-    ExtractedImage,
     ExtractionConfig,
     ExtractionResult,
-    ImageOCRResult,
     KreuzbergError,
     MissingDependencyError,
     PaddleOCRConfig,
@@ -40,30 +38,6 @@ if TYPE_CHECKING:
     from litestar.datastructures import UploadFile
-class ExtractedImageDict(TypedDict):
-    """TypedDict for extracted image JSON representation."""
-    data: str
-    format: str
-    filename: str | None
-    page_number: int | None
-    dimensions: tuple[int, int] | None
-    colorspace: str | None
-    bits_per_component: int | None
-    is_mask: bool
-    description: str | None
-class ImageOCRResultDict(TypedDict):
-    """TypedDict for image OCR result JSON representation."""
-    image: ExtractedImageDict
-    ocr_result: Any
-    confidence_score: float | None
-    processing_time: float | None
-    skipped_reason: str | None
 class HealthResponse(TypedDict):
     """Response model for health check endpoint."""
@@ -384,31 +358,6 @@ def _pil_image_encoder(obj: Any) -> str:
     return f"data:image/png;base64,{img_str}"
-def _extracted_image_encoder(obj: ExtractedImage) -> ExtractedImageDict:
-    encoded_data = base64.b64encode(obj.data).decode()
-    return ExtractedImageDict(
-        data=f"data:image/{obj.format};base64,{encoded_data}",
-        format=obj.format,
-        filename=obj.filename,
-        page_number=obj.page_number,
-        dimensions=obj.dimensions,
-        colorspace=obj.colorspace,
-        bits_per_component=obj.bits_per_component,
-        is_mask=obj.is_mask,
-        description=obj.description,
-    )
-def _image_ocr_result_encoder(obj: ImageOCRResult) -> ImageOCRResultDict:
-    return ImageOCRResultDict(
-        image=_extracted_image_encoder(obj.image),
-        ocr_result=obj.ocr_result,
-        confidence_score=obj.confidence_score,
-        processing_time=obj.processing_time,
-        skipped_reason=obj.skipped_reason,
-    )
 openapi_config = OpenAPIConfig(
     title="Kreuzberg API",
     version="3.14.0",
@@ -428,8 +377,6 @@ openapi_config = OpenAPIConfig(
 type_encoders = {
     pl.DataFrame: _polars_dataframe_encoder,
     Image.Image: _pil_image_encoder,
-    ExtractedImage: _extracted_image_encoder,
-    ImageOCRResult: _image_ocr_result_encoder,
 }
 app = Litestar(

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/kreuzberg/_config.py RENAMED Viewed

@@ -69,7 +69,17 @@ def _build_ocr_config_from_cli(
     try:
         match ocr_backend:
             case "tesseract":
-                return TesseractConfig(**backend_args)
+                # Handle PSM mode conversion from int to enum
+                processed_args = backend_args.copy()
+                if "psm" in processed_args and isinstance(processed_args["psm"], int):
+                    try:
+                        processed_args["psm"] = PSMMode(processed_args["psm"])
+                    except ValueError as e:
+                        raise ValidationError(
+                            f"Invalid PSM mode value: {processed_args['psm']}",
+                            context={"psm_value": processed_args["psm"], "error": str(e)},
+                        ) from e
+                return TesseractConfig(**processed_args)
             case "easyocr":
                 return EasyOCRConfig(**backend_args)
             case "paddleocr":

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/kreuzberg/_document_classification.py RENAMED Viewed

@@ -132,7 +132,7 @@ def classify_document_from_layout(
             if not found_words.is_empty():
                 scores[doc_type] += 1.0
                 word_top = found_words[0, "top"]
-                if word_top < page_height * 0.3:
+                if word_top is not None and word_top < page_height * 0.3:
                     scores[doc_type] += 0.5
     total_score = sum(scores.values())

{kreuzberg-3.15.0 → kreuzberg-3.16.0}/kreuzberg/_extractors/_email.py RENAMED Viewed

@@ -27,6 +27,8 @@ except ImportError:  # pragma: no cover
     html2text = None
 _HTML_TAG_PATTERN = re.compile(r"<[^>]+>")
+_UNICODE_QUOTES_PATTERN = re.compile(r"[\u201c\u201d]")
+_UNICODE_SINGLE_QUOTES_PATTERN = re.compile(r"[\u2018\u2019]")
 class EmailExtractor(Extractor):
@@ -86,7 +88,14 @@ class EmailExtractor(Extractor):
     def _format_email_field(self, field: Any) -> str:
         match field:
             case list():
-                return ", ".join(str(item.get("email", "")) if isinstance(item, dict) else str(item) for item in field)
+                emails = []
+                for item in field:
+                    if isinstance(item, dict):
+                        if email := item.get("email", ""):
+                            emails.append(str(email))
+                    else:
+                        emails.append(str(item))
+                return ", ".join(emails)
             case dict():
                 return str(field.get("email", ""))
             case _:
@@ -111,12 +120,8 @@ class EmailExtractor(Extractor):
                 cleaned = re.sub(r"<style[^>]*>.*?</style>", "", cleaned, flags=re.IGNORECASE | re.DOTALL)
                 clean_html = _HTML_TAG_PATTERN.sub("", cleaned)
                 clean_html = unescape(clean_html)
-                clean_html = (
-                    clean_html.replace("\u201c", '"')
-                    .replace("\u201d", '"')
-                    .replace("\u2019", "'")
-                    .replace("\u2018", "'")
-                )
+                clean_html = _UNICODE_QUOTES_PATTERN.sub('"', clean_html)
+                clean_html = _UNICODE_SINGLE_QUOTES_PATTERN.sub("'", clean_html)
                 text_parts.append(clean_html)
     def _extract_email_attachments(
@@ -129,12 +134,12 @@ class EmailExtractor(Extractor):
         for att in attachments:
             name_val: str = "unknown"
             if isinstance(att, dict):
-                n = att.get("name")
+                n = att.get("name") or att.get("filename")
                 if isinstance(n, str) and n:
                     name_val = n
             names.append(name_val)
-        metadata["attachments"] = names
         if names:
+            metadata["attachments"] = names
             text_parts.append("Attachments: " + ", ".join(names))
     def _extract_images_from_attachments(self, parsed_email: dict[str, Any]) -> list[ExtractedImage]:
@@ -151,7 +156,8 @@ class EmailExtractor(Extractor):
             if not isinstance(mime, str) or not mime.startswith("image/"):
                 continue
-            name = att.get("name") if isinstance(att.get("name"), str) else None
+            name = att.get("name") or att.get("filename")
+            name = name if isinstance(name, str) else None
             data = att.get("data") or att.get("content") or att.get("payload")
             raw: bytes | None = None
             if isinstance(data, (bytes, bytearray)):

kreuzberg 3.15.0__tar.gz → 3.16.0__tar.gz

kreuzberg 3.15.0tar.gz → 3.16.0tar.gz