PyPI - kreuzberg - Versions diffs - 2.0.0__tar.gz → 2.1.0__tar.gz - Mend

kreuzberg 2.0.0tar.gz → 2.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

{kreuzberg-2.0.0 → kreuzberg-2.1.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: kreuzberg
-Version: 2.0.0
+Version: 2.1.0
 Summary: A text extraction library supporting PDFs, images, office documents and more
 Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
 License: MIT
@@ -42,7 +42,7 @@ Kreuzberg is a Python library for text extraction from documents. It provides a
 - **Simple and Hassle-Free**: Clean API that just works, without complex configuration
 - **Local Processing**: No external API calls or cloud dependencies required
 - **Resource Efficient**: Lightweight processing without GPU requirements
-- **Lightweight**: Has few curated dependencies and a minimal footprint
+- **Small Package Size**: Has few curated dependencies and a minimal footprint
 - **Format Support**: Comprehensive support for documents, images, and text formats
 - **Modern Python**: Built with async/await, type hints, and functional first approach
 - **Permissive OSS**: Kreuzberg and its dependencies have a permissive OSS license
@@ -61,10 +61,34 @@ pip install kreuzberg
 Kreuzberg requires two system level dependencies:
-- [Pandoc](https://pandoc.org/installing.html) - For document format conversion
-- [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR
+- [Pandoc](https://pandoc.org/installing.html) - For document format conversion. Minimum required version is Pandoc 2.
+- [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR. Minimum required version is Tesseract 4.
-Please install these using their respective installation guides.
+You can install these with:
+#### Linux (Ubuntu)
+```shell
+sudo apt-get install pandoc tesseract-ocr
+```
+#### MacOS
+```shell
+#
+brew install tesseract pandoc
+```
+#### Windows
+```shell
+choco install -y tesseract pandoc
+```
+Notes:
+- in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.
+- please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.
 ## Architecture
@@ -152,26 +176,30 @@ All extraction functions accept the following optional parameters for configurin
 #### OCR Configuration
-- `language` (default: "eng"): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for non-English documents. Examples:
-  - "eng" for English
-  - "deu" for German
-  - "fra" for French
+- `force_ocr`(default: `False`): Forces OCR processing even for searchable PDFs.
+- `language` (default: `eng`): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:
+  - `eng` for English
+  - `deu` for German
+  - `eng+deu` for English and German
-Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information.
+  Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.
-- `psm` (Page Segmentation Mode, default: PSM.AUTO): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
+- `psm` (Page Segmentation Mode, default: `PSM.AUTO`): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
-#### Performance Configuration
+Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information on both options.
-- `max_processes` (default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc. Higher values can lead to performance improvements, but may cause resource exhaustion and deadlocks (especially for tesseract).
+#### Processing Configuration
+- `max_processes` (default: CPU count): Maximum number of concurrent processes for Tesseract.
 ### Quick Start
 ```python
 from pathlib import Path
 from kreuzberg import extract_file
-from kreuzberg.extraction import ExtractionResult
-from kreuzberg._tesseract import PSMMode, SupportedLanguage
+from kreuzberg import ExtractionResult
+from kreuzberg import PSMMode
 # Basic file extraction
@@ -193,14 +221,14 @@ async def extract_document():
     docx_result = await extract_file(Path("document.docx"))
     if docx_result.metadata:
         print(f"Title: {docx_result.metadata.get('title')}")
-        print(f"Author: {docx_result.metadata.get('author')}")
+        print(f"Author: {docx_result.metadata.get('creator')}")
 ```
 ### Extracting Bytes
 ```python
 from kreuzberg import extract_bytes
-from kreuzberg.extraction import ExtractionResult
+from kreuzberg import ExtractionResult
 async def process_upload(file_content: bytes, mime_type: str) -> ExtractionResult:
@@ -236,7 +264,7 @@ Kreuzberg supports efficient batch processing of multiple files or byte contents
 ```python
 from pathlib import Path
-from kreuzberg import batch_extract_file, batch_extract_bytes
+from kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync
 # Process multiple files concurrently
@@ -346,8 +374,8 @@ async def process_document(path: str) -> tuple[str, str, Metadata]:
 Kreuzberg provides comprehensive error handling through several exception types, all inheriting from `KreuzbergError`. Each exception includes helpful context information for debugging.
 ```python
-from kreuzberg import extract_file
-from kreuzberg.exceptions import (
+from kreuzberg import (
+    extract_file,
     ValidationError,
     ParsingError,
     OCRError,

{kreuzberg-2.0.0 → kreuzberg-2.1.0}/README.md RENAMED Viewed

@@ -7,7 +7,7 @@ Kreuzberg is a Python library for text extraction from documents. It provides a
 - **Simple and Hassle-Free**: Clean API that just works, without complex configuration
 - **Local Processing**: No external API calls or cloud dependencies required
 - **Resource Efficient**: Lightweight processing without GPU requirements
-- **Lightweight**: Has few curated dependencies and a minimal footprint
+- **Small Package Size**: Has few curated dependencies and a minimal footprint
 - **Format Support**: Comprehensive support for documents, images, and text formats
 - **Modern Python**: Built with async/await, type hints, and functional first approach
 - **Permissive OSS**: Kreuzberg and its dependencies have a permissive OSS license
@@ -26,10 +26,34 @@ pip install kreuzberg
 Kreuzberg requires two system level dependencies:
-- [Pandoc](https://pandoc.org/installing.html) - For document format conversion
-- [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR
+- [Pandoc](https://pandoc.org/installing.html) - For document format conversion. Minimum required version is Pandoc 2.
+- [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR. Minimum required version is Tesseract 4.
-Please install these using their respective installation guides.
+You can install these with:
+#### Linux (Ubuntu)
+```shell
+sudo apt-get install pandoc tesseract-ocr
+```
+#### MacOS
+```shell
+#
+brew install tesseract pandoc
+```
+#### Windows
+```shell
+choco install -y tesseract pandoc
+```
+Notes:
+- in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.
+- please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.
 ## Architecture
@@ -117,26 +141,30 @@ All extraction functions accept the following optional parameters for configurin
 #### OCR Configuration
-- `language` (default: "eng"): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for non-English documents. Examples:
-  - "eng" for English
-  - "deu" for German
-  - "fra" for French
+- `force_ocr`(default: `False`): Forces OCR processing even for searchable PDFs.
+- `language` (default: `eng`): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:
+  - `eng` for English
+  - `deu` for German
+  - `eng+deu` for English and German
-Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information.
+  Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.
-- `psm` (Page Segmentation Mode, default: PSM.AUTO): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
+- `psm` (Page Segmentation Mode, default: `PSM.AUTO`): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
-#### Performance Configuration
+Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information on both options.
-- `max_processes` (default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc. Higher values can lead to performance improvements, but may cause resource exhaustion and deadlocks (especially for tesseract).
+#### Processing Configuration
+- `max_processes` (default: CPU count): Maximum number of concurrent processes for Tesseract.
 ### Quick Start
 ```python
 from pathlib import Path
 from kreuzberg import extract_file
-from kreuzberg.extraction import ExtractionResult
-from kreuzberg._tesseract import PSMMode, SupportedLanguage
+from kreuzberg import ExtractionResult
+from kreuzberg import PSMMode
 # Basic file extraction
@@ -158,14 +186,14 @@ async def extract_document():
     docx_result = await extract_file(Path("document.docx"))
     if docx_result.metadata:
         print(f"Title: {docx_result.metadata.get('title')}")
-        print(f"Author: {docx_result.metadata.get('author')}")
+        print(f"Author: {docx_result.metadata.get('creator')}")
 ```
 ### Extracting Bytes
 ```python
 from kreuzberg import extract_bytes
-from kreuzberg.extraction import ExtractionResult
+from kreuzberg import ExtractionResult
 async def process_upload(file_content: bytes, mime_type: str) -> ExtractionResult:
@@ -201,7 +229,7 @@ Kreuzberg supports efficient batch processing of multiple files or byte contents
 ```python
 from pathlib import Path
-from kreuzberg import batch_extract_file, batch_extract_bytes
+from kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync
 # Process multiple files concurrently
@@ -311,8 +339,8 @@ async def process_document(path: str) -> tuple[str, str, Metadata]:
 Kreuzberg provides comprehensive error handling through several exception types, all inheriting from `KreuzbergError`. Each exception includes helpful context information for debugging.
 ```python
-from kreuzberg import extract_file
-from kreuzberg.exceptions import (
+from kreuzberg import (
+    extract_file,
     ValidationError,
     ParsingError,
     OCRError,

{kreuzberg-2.0.0 → kreuzberg-2.1.0}/kreuzberg/__init__.py RENAMED Viewed

@@ -1,6 +1,14 @@
+from ._tesseract import PSMMode
 from ._types import ExtractionResult, Metadata
 from .exceptions import KreuzbergError, MissingDependencyError, OCRError, ParsingError, ValidationError
-from .extraction import extract_bytes, extract_file
+from .extraction import (
+    batch_extract_bytes,
+    batch_extract_bytes_sync,
+    batch_extract_file,
+    batch_extract_file_sync,
+    extract_bytes,
+    extract_file,
+)
 __all__ = [
     "ExtractionResult",
@@ -8,8 +16,13 @@ __all__ = [
     "Metadata",
     "MissingDependencyError",
     "OCRError",
+    "PSMMode",
     "ParsingError",
     "ValidationError",
+    "batch_extract_bytes",
+    "batch_extract_bytes_sync",
+    "batch_extract_file",
+    "batch_extract_file_sync",
     "extract_bytes",
     "extract_file",
 ]

kreuzberg-2.1.0/kreuzberg/_constants.py ADDED Viewed

@@ -0,0 +1,8 @@
+from __future__ import annotations
+from multiprocessing import cpu_count
+from typing import Final
+DEFAULT_MAX_PROCESSES: Final[int] = cpu_count()
+MINIMAL_SUPPORTED_TESSERACT_VERSION: Final[int] = 5
+MINIMAL_SUPPORTED_PANDOC_VERSION: Final[int] = 2

{kreuzberg-2.0.0 → kreuzberg-2.1.0}/kreuzberg/_html.py RENAMED Viewed

@@ -8,7 +8,6 @@ from anyio import Path as AsyncPath
 from kreuzberg import ExtractionResult
 from kreuzberg._mime_types import MARKDOWN_MIME_TYPE
 from kreuzberg._string import normalize_spaces, safe_decode
-from kreuzberg._sync import run_sync
 if TYPE_CHECKING:
     from pathlib import Path
@@ -28,5 +27,5 @@ async def extract_html_string(file_path_or_contents: Path | bytes) -> Extraction
         if isinstance(file_path_or_contents, bytes)
         else await AsyncPath(file_path_or_contents).read_text()
     )
-    result = await run_sync(html_to_markdown.convert_to_markdown, content)
+    result = html_to_markdown.convert_to_markdown(content)
     return ExtractionResult(content=normalize_spaces(result), mime_type=MARKDOWN_MIME_TYPE, metadata={})

{kreuzberg-2.0.0 → kreuzberg-2.1.0}/kreuzberg/_pandoc.py RENAMED Viewed

@@ -1,21 +1,22 @@
 from __future__ import annotations
-import subprocess
+import re
 import sys
 from functools import partial
 from json import JSONDecodeError, loads
 from typing import TYPE_CHECKING, Any, Final, Literal, cast
-from anyio import CapacityLimiter, create_task_group, to_process
 from anyio import Path as AsyncPath
+from anyio import run_process
-from kreuzberg._constants import DEFAULT_MAX_PROCESSES
+from kreuzberg import ValidationError
+from kreuzberg._constants import MINIMAL_SUPPORTED_PANDOC_VERSION
 from kreuzberg._mime_types import MARKDOWN_MIME_TYPE
 from kreuzberg._string import normalize_spaces
-from kreuzberg._sync import run_sync
+from kreuzberg._sync import run_taskgroup
 from kreuzberg._tmp import create_temp_file
 from kreuzberg._types import ExtractionResult, Metadata
-from kreuzberg.exceptions import MissingDependencyError, ParsingError, ValidationError
+from kreuzberg.exceptions import MissingDependencyError, ParsingError
 if TYPE_CHECKING:  # pragma: no cover
     from collections.abc import Mapping
@@ -24,10 +25,8 @@ if TYPE_CHECKING:  # pragma: no cover
 if sys.version_info < (3, 11):  # pragma: no cover
     from exceptiongroup import ExceptionGroup  # type: ignore[import-not-found]
 version_ref: Final[dict[str, bool]] = {"checked": False}
 # Block-level node types in Pandoc AST
 BLOCK_HEADER: Final = "Header"  # Header with level, attributes and inline content
 BLOCK_PARA: Final = "Para"  # Paragraph containing inline content
@@ -229,20 +228,15 @@ def _extract_metadata(raw_meta: dict[str, Any]) -> Metadata:
 def _get_pandoc_type_from_mime_type(mime_type: str) -> str:
-    if mime_type not in MIMETYPE_TO_PANDOC_TYPE_MAPPING or not any(
-        mime_type.startswith(value) for value in MIMETYPE_TO_PANDOC_TYPE_MAPPING
-    ):
-        raise ValidationError(
-            f"Unsupported mime type: {mime_type}",
-            context={
-                "mime_type": mime_type,
-                "supported_mimetypes": ",".join(sorted(MIMETYPE_TO_PANDOC_TYPE_MAPPING)),
-            },
+    if pandoc_type := (MIMETYPE_TO_PANDOC_TYPE_MAPPING.get(mime_type, "")):
+        return pandoc_type
+    if any(k.startswith(mime_type) for k in MIMETYPE_TO_PANDOC_TYPE_MAPPING):
+        return next(
+            MIMETYPE_TO_PANDOC_TYPE_MAPPING[k] for k in MIMETYPE_TO_PANDOC_TYPE_MAPPING if k.startswith(mime_type)
         )
-    return MIMETYPE_TO_PANDOC_TYPE_MAPPING.get(mime_type) or next(
-        MIMETYPE_TO_PANDOC_TYPE_MAPPING[k] for k in MIMETYPE_TO_PANDOC_TYPE_MAPPING if k.startswith(mime_type)
-    )
+    raise ValidationError(f"Unsupported mime type: {mime_type}")
 async def _validate_pandoc_version() -> None:
@@ -251,20 +245,19 @@ async def _validate_pandoc_version() -> None:
             return
         command = ["pandoc", "--version"]
-        result = await run_sync(subprocess.run, command, capture_output=True)
-        version = result.stdout.decode().split("\n")[0].split()[1]
-        if not version.startswith("3."):
-            raise MissingDependencyError("Pandoc version 3 or above is required.")
+        result = await run_process(command)
+        version_match = re.search(r"pandoc\s+v?(\d+)\.\d+\.\d+", result.stdout.decode())
+        if not version_match or int(version_match.group(1)) < MINIMAL_SUPPORTED_PANDOC_VERSION:
+            raise MissingDependencyError("Pandoc version 2 or above is required")
         version_ref["checked"] = True
     except FileNotFoundError as e:
-        raise MissingDependencyError("Pandoc is not installed.") from e
+        raise MissingDependencyError("Pandoc is not installed") from e
-async def _handle_extract_metadata(
-    input_file: str | PathLike[str], *, mime_type: str, max_processes: int = DEFAULT_MAX_PROCESSES
-) -> Metadata:
+async def _handle_extract_metadata(input_file: str | PathLike[str], *, mime_type: str) -> Metadata:
     pandoc_type = _get_pandoc_type_from_mime_type(mime_type)
     metadata_file, unlink = await create_temp_file(".json")
     try:
@@ -276,15 +269,10 @@ async def _handle_extract_metadata(
             "--standalone",
             "--quiet",
             "--output",
-            metadata_file,
+            str(metadata_file),
         ]
-        result = await to_process.run_sync(
-            partial(subprocess.run, capture_output=True),
-            command,
-            cancellable=True,
-            limiter=CapacityLimiter(max_processes),
-        )
+        result = await run_process(command)
         if result.returncode != 0:
             raise ParsingError("Failed to extract file data", context={"file": str(input_file), "error": result.stderr})
@@ -297,9 +285,7 @@ async def _handle_extract_metadata(
         await unlink()
-async def _handle_extract_file(
-    input_file: str | PathLike[str], *, mime_type: str, max_processes: int = DEFAULT_MAX_PROCESSES
-) -> str:
+async def _handle_extract_file(input_file: str | PathLike[str], *, mime_type: str) -> str:
     pandoc_type = _get_pandoc_type_from_mime_type(mime_type)
     output_path, unlink = await create_temp_file(".md")
     try:
@@ -315,12 +301,7 @@ async def _handle_extract_file(
         command.extend(["--output", str(output_path)])
-        result = await to_process.run_sync(
-            partial(subprocess.run, capture_output=True),
-            command,
-            cancellable=True,
-            limiter=CapacityLimiter(max_processes),
-        )
+        result = await run_process(command)
         if result.returncode != 0:
             raise ParsingError("Failed to extract file data", context={"file": str(input_file), "error": result.stderr})
@@ -334,15 +315,12 @@ async def _handle_extract_file(
         await unlink()
-async def process_file_with_pandoc(
-    input_file: str | PathLike[str], *, mime_type: str, max_processes: int = DEFAULT_MAX_PROCESSES
-) -> ExtractionResult:
+async def process_file_with_pandoc(input_file: str | PathLike[str], *, mime_type: str) -> ExtractionResult:
     """Process a single file using Pandoc and convert to markdown.
     Args:
         input_file: The path to the file to process.
         mime_type: The mime type of the file.
-        max_processes: Maximum number of concurrent processes. Defaults to CPU count / 2 (minimum 1).
     Raises:
         ParsingError: If the file data could not be extracted.
@@ -354,41 +332,27 @@ async def process_file_with_pandoc(
     _get_pandoc_type_from_mime_type(mime_type)
-    metadata: Metadata = {}
-    content: str = ""
     try:
-        async with create_task_group() as tg:
-            async def _get_metadata() -> None:
-                nonlocal metadata
-                metadata = await _handle_extract_metadata(input_file, mime_type=mime_type, max_processes=max_processes)
-            async def _get_content() -> None:
-                nonlocal content
-                content = await _handle_extract_file(input_file, mime_type=mime_type, max_processes=max_processes)
+        metadata, content = await run_taskgroup(
+            partial(_handle_extract_metadata, input_file, mime_type=mime_type),
+            partial(_handle_extract_file, input_file, mime_type=mime_type),
+        )
-            tg.start_soon(_get_metadata)
-            tg.start_soon(_get_content)
+        return ExtractionResult(
+            content=normalize_spaces(cast(str, content)),
+            metadata=cast(Metadata, metadata),
+            mime_type=MARKDOWN_MIME_TYPE,
+        )
     except ExceptionGroup as eg:
-        raise ParsingError("Failed to extract file data", context={"file": str(input_file)}) from eg.exceptions[0]
-    return ExtractionResult(
-        content=normalize_spaces(content),
-        metadata=metadata,
-        mime_type=MARKDOWN_MIME_TYPE,
-    )
+        raise ParsingError("Failed to process file", context={"file": str(input_file), "errors": eg.exceptions}) from eg
-async def process_content_with_pandoc(
-    content: bytes, *, mime_type: str, max_processes: int = DEFAULT_MAX_PROCESSES
-) -> ExtractionResult:
+async def process_content_with_pandoc(content: bytes, *, mime_type: str) -> ExtractionResult:
     """Process content using Pandoc and convert to markdown.
     Args:
         content: The content to process.
         mime_type: The mime type of the content.
-        max_processes: Maximum number of concurrent processes. Defaults to CPU count / 2 (minimum 1).
     Returns:
         ExtractionResult
@@ -397,7 +361,7 @@ async def process_content_with_pandoc(
     input_file, unlink = await create_temp_file(f".{extension}")
     await AsyncPath(input_file).write_bytes(content)
-    result = await process_file_with_pandoc(input_file, mime_type=mime_type, max_processes=max_processes)
+    result = await process_file_with_pandoc(input_file, mime_type=mime_type)
     await unlink()
     return result

{kreuzberg-2.0.0 → kreuzberg-2.1.0}/kreuzberg/_pdf.py RENAMED Viewed

@@ -11,7 +11,7 @@ from kreuzberg import ExtractionResult
 from kreuzberg._mime_types import PLAIN_TEXT_MIME_TYPE
 from kreuzberg._string import normalize_spaces
 from kreuzberg._sync import run_sync
-from kreuzberg._tesseract import PSMMode, SupportedLanguage, batch_process_images
+from kreuzberg._tesseract import PSMMode, batch_process_images
 from kreuzberg.exceptions import ParsingError
 if TYPE_CHECKING:  # pragma: no cover
@@ -67,7 +67,7 @@ async def _convert_pdf_to_images(input_file: Path) -> list[Image]:
     document: pypdfium2.PdfDocument | None = None
     try:
         document = await run_sync(pypdfium2.PdfDocument, str(input_file))
-        return [page.render(scale=2.0).to_pil() for page in cast(pypdfium2.PdfDocument, document)]
+        return [page.render(scale=4.25).to_pil() for page in cast(pypdfium2.PdfDocument, document)]
     except pypdfium2.PdfiumError as e:
         raise ParsingError(
             "Could not convert PDF to images", context={"file_path": str(input_file), "error": str(e)}
@@ -80,7 +80,7 @@ async def _convert_pdf_to_images(input_file: Path) -> list[Image]:
 async def _extract_pdf_text_with_ocr(
     input_file: Path,
     *,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int,
     psm: PSMMode = PSMMode.AUTO,
 ) -> ExtractionResult:
@@ -132,7 +132,7 @@ async def extract_pdf_file(
     input_file: Path,
     *,
     force_ocr: bool,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int,
     psm: PSMMode = PSMMode.AUTO,
 ) -> ExtractionResult:
@@ -154,7 +154,6 @@ async def extract_pdf_file(
         and _validate_extracted_text(content)
     ):
         return ExtractionResult(content=content, mime_type=PLAIN_TEXT_MIME_TYPE, metadata={})
     return await _extract_pdf_text_with_ocr(input_file, max_processes=max_processes, language=language, psm=psm)
@@ -162,7 +161,7 @@ async def extract_pdf_content(
     content: bytes,
     *,
     force_ocr: bool,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int,
     psm: PSMMode = PSMMode.AUTO,
 ) -> ExtractionResult:

{kreuzberg-2.0.0 → kreuzberg-2.1.0}/kreuzberg/_string.py RENAMED Viewed

@@ -22,7 +22,7 @@ def safe_decode(byte_data: bytes, encoding: str | None = None) -> str:
     encodings = [encoding, detect(byte_data).get("encoding", ""), "utf-8"]
     for enc in [e for e in encodings if e]:  # pragma: no cover
-        with suppress(UnicodeDecodeError):
+        with suppress(UnicodeDecodeError, LookupError):
             return byte_data.decode(enc)
     # If all encodings fail, fall back to latin-1 which can handle any byte

kreuzberg-2.1.0/kreuzberg/_sync.py ADDED Viewed

@@ -0,0 +1,74 @@
+from __future__ import annotations
+import sys
+from functools import partial
+from typing import TYPE_CHECKING, TypeVar, cast
+from anyio import create_task_group
+from anyio.to_thread import run_sync as any_io_run_sync
+if TYPE_CHECKING:  # pragma: no cover
+    from collections.abc import Callable, Coroutine
+if sys.version_info >= (3, 10):
+    from typing import ParamSpec
+else:  # pragma: no cover
+    from typing_extensions import ParamSpec
+T = TypeVar("T")
+P = ParamSpec("P")
+async def run_sync(sync_fn: Callable[P, T], *args: P.args, **kwargs: P.kwargs) -> T:
+    """Run a synchronous function in an asynchronous context.
+    Args:
+        sync_fn: The synchronous function to run.
+        *args: The positional arguments to pass to the function.
+        **kwargs: The keyword arguments to pass to the function.
+    Returns:
+        The result of the synchronous function.
+    """
+    handler = partial(sync_fn, **kwargs)
+    return cast(T, await any_io_run_sync(handler, *args, abandon_on_cancel=True))  # pyright: ignore [reportCallIssue]
+async def run_taskgroup(*async_tasks: Callable[[], Coroutine[None, None, T]]) -> list[T]:
+    """Run a list of coroutines concurrently.
+    Args:
+        *async_tasks: The list of coroutines to run.
+    Returns:
+        The results of the coroutines.
+    """
+    results = cast(list[T], [None] * len(async_tasks))
+    async def run_task(index: int, task: Callable[[], Coroutine[None, None, T]]) -> None:
+        results[index] = await task()
+    async with create_task_group() as tg:
+        for i, t in enumerate(async_tasks):
+            tg.start_soon(run_task, i, t)
+    return results
+async def run_taskgroup_batched(*async_tasks: Callable[[], Coroutine[None, None, T]], batch_size: int) -> list[T]:
+    """Run a list of coroutines concurrently in batches.
+    Args:
+        *async_tasks: The list of coroutines to run.
+        batch_size: The size of each batch.
+    Returns:
+        The results of the coroutines.
+    """
+    results: list[T] = []
+    for i in range(0, len(async_tasks), batch_size):
+        batch = async_tasks[i : i + batch_size]
+        results.extend(await run_taskgroup(*batch))
+    return results

kreuzberg 2.0.0__tar.gz → 2.1.0__tar.gz

kreuzberg 2.0.0tar.gz → 2.1.0tar.gz