PyPI - kreuzberg - Versions diffs - 3.10.1__tar.gz → 3.11.0__tar.gz - Mend

kreuzberg 3.10.1tar.gz → 3.11.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (221) hide show

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: kreuzberg
-Version: 3.10.1
+Version: 3.11.0
 Summary: Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats
 Project-URL: documentation, https://kreuzberg.dev
 Project-URL: homepage, https://github.com/Goldziher/kreuzberg
@@ -32,7 +32,7 @@ Requires-Dist: anyio>=4.9.0
 Requires-Dist: chardetng-py>=0.3.5
 Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
 Requires-Dist: html-to-markdown[lxml]>=1.9.0
-Requires-Dist: mcp>=1.12.2
+Requires-Dist: mcp>=1.12.3
 Requires-Dist: msgspec>=0.18.0
 Requires-Dist: playa-pdf>=0.6.4
 Requires-Dist: psutil>=7.0.0
@@ -45,6 +45,7 @@ Requires-Dist: mailparse>=1.0.15; extra == 'additional-extensions'
 Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'additional-extensions'
 Provides-Extra: all
 Requires-Dist: click>=8.2.1; extra == 'all'
+Requires-Dist: deep-translator>=1.11.4; extra == 'all'
 Requires-Dist: easyocr>=1.7.2; extra == 'all'
 Requires-Dist: fast-langdetect>=0.3.2; extra == 'all'
 Requires-Dist: gmft>=0.4.2; extra == 'all'
@@ -53,6 +54,7 @@ Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.16.0; extra == 'all
 Requires-Dist: mailparse>=1.0.15; extra == 'all'
 Requires-Dist: paddleocr>=3.1.0; extra == 'all'
 Requires-Dist: paddlepaddle>=3.1.0; extra == 'all'
+Requires-Dist: pandas>=2.3.1; extra == 'all'
 Requires-Dist: playa-pdf[crypto]>=0.6.4; extra == 'all'
 Requires-Dist: rich>=14.1.0; extra == 'all'
 Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'all'
@@ -61,9 +63,6 @@ Requires-Dist: spacy>=3.8.7; extra == 'all'
 Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'all'
 Provides-Extra: api
 Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.16.0; extra == 'api'
-Provides-Extra: auto-classify-document-type
-Requires-Dist: deep-translator>=1.11.4; extra == 'auto-classify-document-type'
-Requires-Dist: pandas>=2.3.1; extra == 'auto-classify-document-type'
 Provides-Extra: chunking
 Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'chunking'
 Provides-Extra: cli
@@ -72,6 +71,9 @@ Requires-Dist: rich>=14.1.0; extra == 'cli'
 Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'cli'
 Provides-Extra: crypto
 Requires-Dist: playa-pdf[crypto]>=0.6.4; extra == 'crypto'
+Provides-Extra: document-classification
+Requires-Dist: deep-translator>=1.11.4; extra == 'document-classification'
+Requires-Dist: pandas>=2.3.1; extra == 'document-classification'
 Provides-Extra: easyocr
 Requires-Dist: easyocr>=1.7.2; extra == 'easyocr'
 Provides-Extra: entity-extraction

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/docs/contributing.md RENAMED Viewed

@@ -34,7 +34,7 @@ All commands run through `uv run`:
 # Testing
 uv run pytest                      # Run all tests
 uv run pytest tests/foo_test.py    # Run specific test
-uv run pytest --cov                # With coverage (must be ≥95%)
+uv run pytest --cov                # With coverage (must be ≥85%)
 # Code quality
 uv run ruff format                 # Format code

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/docs/getting-started/installation.md RENAMED Viewed

@@ -134,6 +134,16 @@ python -m spacy download es_core_news_sm  # Spanish
     spaCy language models are large (50-500MB each) and are downloaded separately. Only download the models for languages you actually need to process. See the [spaCy models documentation](https://spacy.io/models) for a complete list of available models.
+### Document Classification
+For automatic document type detection (invoice, contract, receipt, etc.), install the document classification extra:
+```shell
+pip install "kreuzberg[document-classification]"
+```
+This feature uses Google Translate for multi-language support and requires explicit opt-in by setting `auto_detect_document_type=True` in your configuration.
 ### All Optional Dependencies
 To install Kreuzberg with all optional dependencies, you can use the `all` extra group:
@@ -145,5 +155,5 @@ pip install "kreuzberg[all]"
 This is equivalent to:
 ```shell
-pip install "kreuzberg[chunking,easyocr,entity-extraction,gmft,langdetect,paddleocr]"
+pip install "kreuzberg[chunking,document-classification,easyocr,entity-extraction,gmft,langdetect,paddleocr]"
 ```

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/docs/index.md RENAMED Viewed

@@ -22,7 +22,7 @@ Kreuzberg addresses the complete document intelligence pipeline through a modula
 ### Engineering Principles
-- **Test Coverage**: 95%+ coverage with comprehensive test suites
+- **Test Coverage**: Comprehensive test suites ensuring code reliability
 - **API Design**: True async/await implementation alongside synchronous APIs
 - **Error Handling**: Consistent exception hierarchy with detailed context
 - **Type Safety**: Full type annotations for enhanced developer experience

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/docs/user-guide/document-classification.md RENAMED Viewed

@@ -2,9 +2,17 @@
 Kreuzberg can automatically classify documents into common types like invoices, contracts, and receipts. This allows you to build custom processing pipelines tailored to each document type.
+## Installation
+Document classification requires the `document-classification` extra to be installed:
+```bash
+pip install "kreuzberg[document-classification]"
+```
 ## Enabling Document Classification
-To enable this feature, set `auto_detect_document_type=True` in your `ExtractionConfig`:
+Document classification is disabled by default. To enable this feature, set `auto_detect_document_type=True` in your `ExtractionConfig`:
 ```python
 from kreuzberg import ExtractionConfig, extract_file

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/kreuzberg/_config.py RENAMED Viewed

@@ -97,19 +97,21 @@ def parse_ocr_backend_config(
     if not isinstance(backend_config, dict):
         return None
-    if backend == "tesseract":
-        # Convert psm integer to PSMMode enum if needed
-        processed_config = backend_config.copy()
-        if "psm" in processed_config and isinstance(processed_config["psm"], int):
-            from kreuzberg._ocr._tesseract import PSMMode  # noqa: PLC0415
-            processed_config["psm"] = PSMMode(processed_config["psm"])
-        return TesseractConfig(**processed_config)
-    if backend == "easyocr":
-        return EasyOCRConfig(**backend_config)
-    if backend == "paddleocr":
-        return PaddleOCRConfig(**backend_config)
-    return None
+    match backend:
+        case "tesseract":
+            # Convert psm integer to PSMMode enum if needed
+            processed_config = backend_config.copy()
+            if "psm" in processed_config and isinstance(processed_config["psm"], int):
+                from kreuzberg._ocr._tesseract import PSMMode  # noqa: PLC0415
+                processed_config["psm"] = PSMMode(processed_config["psm"])
+            return TesseractConfig(**processed_config)
+        case "easyocr":
+            return EasyOCRConfig(**backend_config)
+        case "paddleocr":
+            return PaddleOCRConfig(**backend_config)
+        case _:
+            return None
 def build_extraction_config_from_dict(config_dict: dict[str, Any]) -> ExtractionConfig:
@@ -140,7 +142,9 @@ def build_extraction_config_from_dict(config_dict: dict[str, Any]) -> Extraction
         "document_classification_mode",
         "keyword_count",
     }
-    extraction_config.update({field: config_dict[field] for field in basic_fields if field in config_dict})
+    extraction_config = extraction_config | {
+        field: config_dict[field] for field in basic_fields if field in config_dict
+    }
     # Handle OCR backend configuration
     ocr_backend = extraction_config.get("ocr_backend")

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/kreuzberg/_document_classification.py RENAMED Viewed

@@ -62,7 +62,7 @@ def _get_translated_text(result: ExtractionResult) -> str:
         from deep_translator import GoogleTranslator  # noqa: PLC0415
     except ImportError as e:  # pragma: no cover
         raise MissingDependencyError(
-            "The 'deep-translator' library is not installed. Please install it with: pip install 'kreuzberg[auto-classify-document-type]'"
+            "The 'deep-translator' library is not installed. Please install it with: pip install 'kreuzberg[document-classification]'"
         ) from e
     try:

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/kreuzberg/_extractors/_base.py RENAMED Viewed

@@ -116,8 +116,7 @@ class Extractor(ABC):
         quality_score = calculate_quality_score(cleaned_content, dict(result.metadata) if result.metadata else None)
         # Add quality metadata
-        enhanced_metadata = dict(result.metadata) if result.metadata else {}
-        enhanced_metadata["quality_score"] = quality_score
+        enhanced_metadata = (dict(result.metadata) if result.metadata else {}) | {"quality_score": quality_score}
         # Return enhanced result
         return ExtractionResult(

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/kreuzberg/_extractors/_image.py RENAMED Viewed

@@ -85,23 +85,24 @@ class ImageExtractor(Extractor):
         backend = get_ocr_backend(self.config.ocr_backend)
-        if self.config.ocr_backend == "tesseract":
-            config = (
-                self.config.ocr_config if isinstance(self.config.ocr_config, TesseractConfig) else TesseractConfig()
-            )
-            result = backend.process_file_sync(path, **asdict(config))
-        elif self.config.ocr_backend == "paddleocr":
-            paddle_config = (
-                self.config.ocr_config if isinstance(self.config.ocr_config, PaddleOCRConfig) else PaddleOCRConfig()
-            )
-            result = backend.process_file_sync(path, **asdict(paddle_config))
-        elif self.config.ocr_backend == "easyocr":
-            easy_config = (
-                self.config.ocr_config if isinstance(self.config.ocr_config, EasyOCRConfig) else EasyOCRConfig()
-            )
-            result = backend.process_file_sync(path, **asdict(easy_config))
-        else:
-            raise NotImplementedError(f"Sync OCR not implemented for {self.config.ocr_backend}")
+        match self.config.ocr_backend:
+            case "tesseract":
+                config = (
+                    self.config.ocr_config if isinstance(self.config.ocr_config, TesseractConfig) else TesseractConfig()
+                )
+                result = backend.process_file_sync(path, **asdict(config))
+            case "paddleocr":
+                paddle_config = (
+                    self.config.ocr_config if isinstance(self.config.ocr_config, PaddleOCRConfig) else PaddleOCRConfig()
+                )
+                result = backend.process_file_sync(path, **asdict(paddle_config))
+            case "easyocr":
+                easy_config = (
+                    self.config.ocr_config if isinstance(self.config.ocr_config, EasyOCRConfig) else EasyOCRConfig()
+                )
+                result = backend.process_file_sync(path, **asdict(easy_config))
+            case _:
+                raise NotImplementedError(f"Sync OCR not implemented for {self.config.ocr_backend}")
         return self._apply_quality_processing(result)
     def _get_extension_from_mime_type(self, mime_type: str) -> str:

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/kreuzberg/_extractors/_pdf.py RENAMED Viewed

@@ -88,14 +88,12 @@ class PDFExtractor(Extractor):
             # Enhance metadata with table information
             if result.tables:
                 table_summary = generate_table_summary(result.tables)
-                result.metadata.update(
-                    {
-                        "table_count": table_summary["table_count"],
-                        "tables_summary": f"Document contains {table_summary['table_count']} tables "
-                        f"across {table_summary['pages_with_tables']} pages with "
-                        f"{table_summary['total_rows']} total rows",
-                    }
-                )
+                result.metadata = result.metadata | {
+                    "table_count": table_summary["table_count"],
+                    "tables_summary": f"Document contains {table_summary['table_count']} tables "
+                    f"across {table_summary['pages_with_tables']} pages with "
+                    f"{table_summary['total_rows']} total rows",
+                }
         return self._apply_quality_processing(result)
@@ -153,14 +151,12 @@ class PDFExtractor(Extractor):
         # Enhance metadata with table information
         if tables:
             table_summary = generate_table_summary(tables)
-            result.metadata.update(
-                {
-                    "table_count": table_summary["table_count"],
-                    "tables_summary": f"Document contains {table_summary['table_count']} tables "
-                    f"across {table_summary['pages_with_tables']} pages with "
-                    f"{table_summary['total_rows']} total rows",
-                }
-            )
+            result.metadata = result.metadata | {
+                "table_count": table_summary["table_count"],
+                "tables_summary": f"Document contains {table_summary['table_count']} tables "
+                f"across {table_summary['pages_with_tables']} pages with "
+                f"{table_summary['total_rows']} total rows",
+            }
         # Apply quality processing
         return self._apply_quality_processing(result)
@@ -386,23 +382,24 @@ class PDFExtractor(Extractor):
         backend = get_ocr_backend(self.config.ocr_backend)
         paths = [Path(p) for p in image_paths]
-        if self.config.ocr_backend == "tesseract":
-            config = (
-                self.config.ocr_config if isinstance(self.config.ocr_config, TesseractConfig) else TesseractConfig()
-            )
-            results = backend.process_batch_sync(paths, **asdict(config))
-        elif self.config.ocr_backend == "paddleocr":
-            paddle_config = (
-                self.config.ocr_config if isinstance(self.config.ocr_config, PaddleOCRConfig) else PaddleOCRConfig()
-            )
-            results = backend.process_batch_sync(paths, **asdict(paddle_config))
-        elif self.config.ocr_backend == "easyocr":
-            easy_config = (
-                self.config.ocr_config if isinstance(self.config.ocr_config, EasyOCRConfig) else EasyOCRConfig()
-            )
-            results = backend.process_batch_sync(paths, **asdict(easy_config))
-        else:
-            raise NotImplementedError(f"Sync OCR not implemented for {self.config.ocr_backend}")
+        match self.config.ocr_backend:
+            case "tesseract":
+                config = (
+                    self.config.ocr_config if isinstance(self.config.ocr_config, TesseractConfig) else TesseractConfig()
+                )
+                results = backend.process_batch_sync(paths, **asdict(config))
+            case "paddleocr":
+                paddle_config = (
+                    self.config.ocr_config if isinstance(self.config.ocr_config, PaddleOCRConfig) else PaddleOCRConfig()
+                )
+                results = backend.process_batch_sync(paths, **asdict(paddle_config))
+            case "easyocr":
+                easy_config = (
+                    self.config.ocr_config if isinstance(self.config.ocr_config, EasyOCRConfig) else EasyOCRConfig()
+                )
+                results = backend.process_batch_sync(paths, **asdict(easy_config))
+            case _:
+                raise NotImplementedError(f"Sync OCR not implemented for {self.config.ocr_backend}")
         # Use list comprehension and join for efficient string building
         return "\n\n".join(result.content for result in results)

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/kreuzberg/_mcp/server.py RENAMED Viewed

@@ -51,7 +51,7 @@ def _create_config_with_overrides(**kwargs: Any) -> ExtractionConfig:
     }
     # Override with provided parameters
-    config_dict.update(kwargs)
+    config_dict = config_dict | kwargs
     return ExtractionConfig(**config_dict)

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/kreuzberg/_types.py RENAMED Viewed

@@ -349,7 +349,7 @@ class ExtractionConfig:
     """Configuration for language detection. If None, uses default settings."""
     spacy_entity_extraction_config: SpacyEntityExtractionConfig | None = None
     """Configuration for spaCy entity extraction. If None, uses default settings."""
-    auto_detect_document_type: bool = True
+    auto_detect_document_type: bool = False
     """Whether to automatically detect the document type."""
     document_type_confidence_threshold: float = 0.5
     """Confidence threshold for document type detection."""
@@ -398,15 +398,16 @@ class ExtractionConfig:
             return asdict(self.ocr_config)
         # Lazy load and cache default configs instead of creating new instances
-        if self.ocr_backend == "tesseract":
-            from kreuzberg._ocr._tesseract import TesseractConfig  # noqa: PLC0415
+        match self.ocr_backend:
+            case "tesseract":
+                from kreuzberg._ocr._tesseract import TesseractConfig  # noqa: PLC0415
-            return asdict(TesseractConfig())
-        if self.ocr_backend == "easyocr":
-            from kreuzberg._ocr._easyocr import EasyOCRConfig  # noqa: PLC0415
+                return asdict(TesseractConfig())
+            case "easyocr":
+                from kreuzberg._ocr._easyocr import EasyOCRConfig  # noqa: PLC0415
-            return asdict(EasyOCRConfig())
-        # paddleocr
-        from kreuzberg._ocr._paddleocr import PaddleOCRConfig  # noqa: PLC0415
+                return asdict(EasyOCRConfig())
+            case _:  # paddleocr or any other backend
+                from kreuzberg._ocr._paddleocr import PaddleOCRConfig  # noqa: PLC0415
-        return asdict(PaddleOCRConfig())
+                return asdict(PaddleOCRConfig())

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/mkdocs.yaml RENAMED Viewed

@@ -158,4 +158,3 @@ nav:
       - Custom Hooks: advanced/custom-hooks.md
       - Custom Extractors: advanced/custom-extractors.md
   - Contributing: contributing.md
-  - Changelog: changelog.md

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/pyproject.toml RENAMED Viewed

@@ -5,7 +5,7 @@ requires = [ "hatchling" ]
 [project]
 name = "kreuzberg"
-version = "3.10.1"
+version = "3.11.0"
 description = "Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats"
 readme = "README.md"
 keywords = [
@@ -61,7 +61,7 @@ dependencies = [
   "chardetng-py>=0.3.5",
   "exceptiongroup>=1.2.2; python_version<'3.11'",
   "html-to-markdown[lxml]>=1.9.0",
-  "mcp>=1.12.2",
+  "mcp>=1.12.3",
   "msgspec>=0.18.0",
   "playa-pdf>=0.6.4",                                 # pinned due to breaking changes in 0.5.0
   "psutil>=7.0.0",
@@ -76,15 +76,11 @@ optional-dependencies.additional-extensions = [
   "tomli>=2.0.0; python_version<'3.11'",
 ]
 optional-dependencies.all = [
-  "kreuzberg[additional-extensions,api,chunking,cli,crypto,easyocr,entity-extraction,gmft,langdetect,paddleocr]",
+  "kreuzberg[additional-extensions,api,chunking,cli,crypto,document-classification,easyocr,entity-extraction,gmft,langdetect,paddleocr]",
 ]
 optional-dependencies.api = [
   "litestar[standard,structlog,opentelemetry]>=2.16.0",
 ]
-optional-dependencies.auto-classify-document-type = [
-  "deep-translator>=1.11.4",
-  "pandas>=2.3.1",
-]
 optional-dependencies.chunking = [ "semantic-text-splitter>=0.27.0" ]
 optional-dependencies.cli = [
   "click>=8.2.1",
@@ -92,6 +88,10 @@ optional-dependencies.cli = [
   "tomli>=2.0.0; python_version<'3.11'",
 ]
 optional-dependencies.crypto = [ "playa-pdf[crypto]>=0.6.4" ]
+optional-dependencies.document-classification = [
+  "deep-translator>=1.11.4",
+  "pandas>=2.3.1",
+]
 optional-dependencies.easyocr = [ "easyocr>=1.7.2" ]
 optional-dependencies.entity-extraction = [ "keybert>=0.9.0", "spacy>=3.8.7" ]
 optional-dependencies.gmft = [ "gmft>=0.4.2" ]
@@ -256,7 +256,7 @@ exclude_lines = [
   "class .*\\bProtocol\\):",
   "@(abc\\.)?abstractmethod",
 ]
-fail_under = 95
+fail_under = 85
 [tool.mypy]
 packages = [ "kreuzberg", "tests", "benchmarks.src.kreuzberg_benchmarks" ]

{kreuzberg-3.10.1 → kreuzberg-3.11.0}/tests/document_classification_test.py RENAMED Viewed

@@ -2,8 +2,10 @@
 from __future__ import annotations
+import builtins
+import sys
 from pathlib import Path
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Any
 import pandas as pd
 import pytest
@@ -15,6 +17,7 @@ from kreuzberg._document_classification import (
     classify_document_from_layout,
 )
 from kreuzberg._types import ExtractionConfig, ExtractionResult
+from kreuzberg.exceptions import MissingDependencyError
 if TYPE_CHECKING:
     from pytest_mock import MockerFixture
@@ -112,7 +115,7 @@ def test_classify_document_with_metadata() -> None:
         mime_type="text/plain",
         metadata={"title": "Invoice #12345", "subject": "Payment Due"},
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     doc_type, confidence = classify_document(result, config)
@@ -142,7 +145,7 @@ def test_classify_document_empty_content() -> None:
         mime_type="text/plain",
         metadata={},
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     doc_type, confidence = classify_document(result, config)
@@ -158,7 +161,7 @@ def test_classify_document_with_exclusions() -> None:
         mime_type="text/plain",
         metadata={},
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     doc_type, confidence = classify_document(result, config)
@@ -184,7 +187,7 @@ def test_classify_document_from_layout_basic() -> None:
         metadata={},
         layout=layout_df,
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     doc_type, confidence = classify_document_from_layout(result, config)
@@ -200,7 +203,7 @@ def test_classify_document_from_layout_no_layout() -> None:
         mime_type="text/plain",
         metadata={},
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     doc_type, confidence = classify_document_from_layout(result, config)
@@ -218,7 +221,7 @@ def test_classify_document_from_layout_empty_layout() -> None:
         metadata={},
         layout=layout_df,
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     doc_type, confidence = classify_document_from_layout(result, config)
@@ -236,7 +239,7 @@ def test_classify_document_from_layout_missing_columns() -> None:
         metadata={},
         layout=layout_df,
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     doc_type, confidence = classify_document_from_layout(result, config)
@@ -260,7 +263,7 @@ def test_classify_document_from_layout_no_pattern_matches() -> None:
         metadata={},
         layout=layout_df,
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     doc_type, confidence = classify_document_from_layout(result, config)
@@ -285,7 +288,7 @@ def test_classify_document_from_layout_header_patterns() -> None:
         metadata={},
         layout=layout_df,
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     doc_type, confidence = classify_document_from_layout(result, config)
@@ -312,7 +315,7 @@ def test_classify_document_from_layout_position_scoring() -> None:
         metadata={},
         layout=layout_df,
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     doc_type, confidence = classify_document_from_layout(result, config)
@@ -327,7 +330,7 @@ def test_auto_detect_document_type_from_content() -> None:
         mime_type="text/plain",
         metadata={},
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     detection_result = auto_detect_document_type(result, config)
@@ -352,7 +355,7 @@ def test_auto_detect_document_type_from_layout() -> None:
         metadata={},
         layout=layout_df,
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     detection_result = auto_detect_document_type(result, config)
@@ -382,7 +385,7 @@ def test_auto_detect_document_type_no_matches() -> None:
         mime_type="text/plain",
         metadata={},
     )
-    config = ExtractionConfig()
+    config = ExtractionConfig(auto_detect_document_type=True)
     detection_result = auto_detect_document_type(result, config)
@@ -884,3 +887,35 @@ def test_classify_document_confidence_calculation(mocker: MockerFixture) -> None
     assert doc_type == "invoice"
     assert confidence == 1.0  # All 3 matches are for invoice, so 3/3 = 1.0
+def test_missing_deep_translator_import_error(mocker: MockerFixture) -> None:
+    """Test that MissingDependencyError is raised when deep-translator is not installed."""
+    # Temporarily remove deep_translator from sys.modules if it exists
+    original_module = sys.modules.pop("deep_translator", None)
+    try:
+        # Mock the import to raise ImportError when importing deep_translator
+        def mock_import(name: str, *args: Any, **kwargs: Any) -> Any:
+            if name == "deep_translator":
+                raise ImportError("No module named 'deep_translator'")
+            return original_import(name, *args, **kwargs)
+        original_import = builtins.__import__
+        mocker.patch("builtins.__import__", side_effect=mock_import)
+        # Import _get_translated_text after setting up the mock
+        from kreuzberg._document_classification import _get_translated_text
+        result = ExtractionResult(content="Test content", mime_type="text/plain", metadata={})
+        # Should raise MissingDependencyError when trying to import deep_translator
+        with pytest.raises(MissingDependencyError) as exc_info:
+            _get_translated_text(result)
+        assert "deep-translator" in str(exc_info.value)
+        assert "pip install 'kreuzberg[document-classification]'" in str(exc_info.value)
+    finally:
+        # Restore original module if it existed
+        if original_module is not None:
+            sys.modules["deep_translator"] = original_module

kreuzberg 3.10.1__tar.gz → 3.11.0__tar.gz

kreuzberg 3.10.1tar.gz → 3.11.0tar.gz