PyPI - kreuzberg - Versions diffs - 2.0.0__tar.gz → 2.0.1__tar.gz - Mend

kreuzberg 2.0.0tar.gz → 2.0.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

{kreuzberg-2.0.0 → kreuzberg-2.0.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: kreuzberg
-Version: 2.0.0
+Version: 2.0.1
 Summary: A text extraction library supporting PDFs, images, office documents and more
 Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
 License: MIT
@@ -64,7 +64,31 @@ Kreuzberg requires two system level dependencies:
 - [Pandoc](https://pandoc.org/installing.html) - For document format conversion
 - [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR
-Please install these using their respective installation guides.
+You can install these with:
+#### Linux (Ubuntu)
+```shell
+sudo apt-get install pandoc tesseract-ocr
+```
+#### MacOS
+```shell
+# MacOS
+brew install tesseract pandoc
+```
+#### Windows
+```shell
+choco install -y tesseract pandoc
+```
+Notes:
+- in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.
+- please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.
 ## Architecture
@@ -152,18 +176,26 @@ All extraction functions accept the following optional parameters for configurin
 #### OCR Configuration
-- `language` (default: "eng"): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for non-English documents. Examples:
-  - "eng" for English
-  - "deu" for German
-  - "fra" for French
+- `force_ocr`(default: `False`): Forces OCR processing even for searchable PDFs.
+- `language` (default: `eng`): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:
+  - `eng` for English
+  - `deu` for German
+  - `eng+deu` for English and German
+  Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.
+- `psm` (Page Segmentation Mode, default: `PSM.AUTO`): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
+Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information on both options.
-Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information.
+#### Processing Configuration
-- `psm` (Page Segmentation Mode, default: PSM.AUTO): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
+- `max_processes` (default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc.
-#### Performance Configuration
+  Notes:
-- `max_processes` (default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc. Higher values can lead to performance improvements, but may cause resource exhaustion and deadlocks (especially for tesseract).
+  - Higher values can lead to performance improvements when batch processing especially with OCR, but may cause resource exhaustion and deadlocks (especially for tesseract).
 ### Quick Start
@@ -171,7 +203,7 @@ Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/)
 from pathlib import Path
 from kreuzberg import extract_file
 from kreuzberg.extraction import ExtractionResult
-from kreuzberg._tesseract import PSMMode, SupportedLanguage
+from kreuzberg._tesseract import PSMMode
 # Basic file extraction
@@ -193,7 +225,7 @@ async def extract_document():
     docx_result = await extract_file(Path("document.docx"))
     if docx_result.metadata:
         print(f"Title: {docx_result.metadata.get('title')}")
-        print(f"Author: {docx_result.metadata.get('author')}")
+        print(f"Author: {docx_result.metadata.get('creator')}")
 ```
 ### Extracting Bytes
@@ -236,7 +268,7 @@ Kreuzberg supports efficient batch processing of multiple files or byte contents
 ```python
 from pathlib import Path
-from kreuzberg import batch_extract_file, batch_extract_bytes
+from kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync
 # Process multiple files concurrently

{kreuzberg-2.0.0 → kreuzberg-2.0.1}/README.md RENAMED Viewed

@@ -29,7 +29,31 @@ Kreuzberg requires two system level dependencies:
 - [Pandoc](https://pandoc.org/installing.html) - For document format conversion
 - [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR
-Please install these using their respective installation guides.
+You can install these with:
+#### Linux (Ubuntu)
+```shell
+sudo apt-get install pandoc tesseract-ocr
+```
+#### MacOS
+```shell
+# MacOS
+brew install tesseract pandoc
+```
+#### Windows
+```shell
+choco install -y tesseract pandoc
+```
+Notes:
+- in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.
+- please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.
 ## Architecture
@@ -117,18 +141,26 @@ All extraction functions accept the following optional parameters for configurin
 #### OCR Configuration
-- `language` (default: "eng"): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for non-English documents. Examples:
-  - "eng" for English
-  - "deu" for German
-  - "fra" for French
+- `force_ocr`(default: `False`): Forces OCR processing even for searchable PDFs.
+- `language` (default: `eng`): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:
+  - `eng` for English
+  - `deu` for German
+  - `eng+deu` for English and German
+  Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.
+- `psm` (Page Segmentation Mode, default: `PSM.AUTO`): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
+Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information on both options.
-Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information.
+#### Processing Configuration
-- `psm` (Page Segmentation Mode, default: PSM.AUTO): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
+- `max_processes` (default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc.
-#### Performance Configuration
+  Notes:
-- `max_processes` (default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc. Higher values can lead to performance improvements, but may cause resource exhaustion and deadlocks (especially for tesseract).
+  - Higher values can lead to performance improvements when batch processing especially with OCR, but may cause resource exhaustion and deadlocks (especially for tesseract).
 ### Quick Start
@@ -136,7 +168,7 @@ Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/)
 from pathlib import Path
 from kreuzberg import extract_file
 from kreuzberg.extraction import ExtractionResult
-from kreuzberg._tesseract import PSMMode, SupportedLanguage
+from kreuzberg._tesseract import PSMMode
 # Basic file extraction
@@ -158,7 +190,7 @@ async def extract_document():
     docx_result = await extract_file(Path("document.docx"))
     if docx_result.metadata:
         print(f"Title: {docx_result.metadata.get('title')}")
-        print(f"Author: {docx_result.metadata.get('author')}")
+        print(f"Author: {docx_result.metadata.get('creator')}")
 ```
 ### Extracting Bytes
@@ -201,7 +233,7 @@ Kreuzberg supports efficient batch processing of multiple files or byte contents
 ```python
 from pathlib import Path
-from kreuzberg import batch_extract_file, batch_extract_bytes
+from kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync
 # Process multiple files concurrently

{kreuzberg-2.0.0 → kreuzberg-2.0.1}/kreuzberg/__init__.py RENAMED Viewed

@@ -1,6 +1,13 @@
 from ._types import ExtractionResult, Metadata
 from .exceptions import KreuzbergError, MissingDependencyError, OCRError, ParsingError, ValidationError
-from .extraction import extract_bytes, extract_file
+from .extraction import (
+    batch_extract_bytes,
+    batch_extract_bytes_sync,
+    batch_extract_file,
+    batch_extract_file_sync,
+    extract_bytes,
+    extract_file,
+)
 __all__ = [
     "ExtractionResult",
@@ -10,6 +17,10 @@ __all__ = [
     "OCRError",
     "ParsingError",
     "ValidationError",
+    "batch_extract_bytes",
+    "batch_extract_bytes_sync",
+    "batch_extract_file",
+    "batch_extract_file_sync",
     "extract_bytes",
     "extract_file",
 ]

{kreuzberg-2.0.0 → kreuzberg-2.0.1}/kreuzberg/_pdf.py RENAMED Viewed

@@ -11,7 +11,7 @@ from kreuzberg import ExtractionResult
 from kreuzberg._mime_types import PLAIN_TEXT_MIME_TYPE
 from kreuzberg._string import normalize_spaces
 from kreuzberg._sync import run_sync
-from kreuzberg._tesseract import PSMMode, SupportedLanguage, batch_process_images
+from kreuzberg._tesseract import PSMMode, batch_process_images
 from kreuzberg.exceptions import ParsingError
 if TYPE_CHECKING:  # pragma: no cover
@@ -80,7 +80,7 @@ async def _convert_pdf_to_images(input_file: Path) -> list[Image]:
 async def _extract_pdf_text_with_ocr(
     input_file: Path,
     *,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int,
     psm: PSMMode = PSMMode.AUTO,
 ) -> ExtractionResult:
@@ -132,7 +132,7 @@ async def extract_pdf_file(
     input_file: Path,
     *,
     force_ocr: bool,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int,
     psm: PSMMode = PSMMode.AUTO,
 ) -> ExtractionResult:
@@ -162,7 +162,7 @@ async def extract_pdf_content(
     content: bytes,
     *,
     force_ocr: bool,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int,
     psm: PSMMode = PSMMode.AUTO,
 ) -> ExtractionResult:

{kreuzberg-2.0.0 → kreuzberg-2.0.1}/kreuzberg/_tesseract.py RENAMED Viewed

@@ -6,7 +6,7 @@ import sys
 from enum import Enum
 from functools import partial
 from os import PathLike
-from typing import Final, Literal, TypeVar, Union, cast
+from typing import Final, TypeVar, Union, cast
 from anyio import CapacityLimiter, create_task_group, to_process
 from anyio import Path as AsyncPath
@@ -29,136 +29,6 @@ version_ref = {"checked": False}
 T = TypeVar("T", bound=Union[Image, PathLike[str], str])
-SupportedLanguage = Literal[
-    "afr",
-    "amh",
-    "ara",
-    "asm",
-    "aze",
-    "aze_cyrl",
-    "bel",
-    "ben",
-    "bod",
-    "bos",
-    "bre",
-    "bul",
-    "cat",
-    "ceb",
-    "ces",
-    "chi_sim",
-    "chi_tra",
-    "chr",
-    "cos",
-    "cym",
-    "dan",
-    "dan_frak",
-    "deu",
-    "deu_frak",
-    "deu_latf",
-    "dzo",
-    "ell",
-    "eng",
-    "enm",
-    "epo",
-    "equ",
-    "est",
-    "eus",
-    "fao",
-    "fas",
-    "fil",
-    "fin",
-    "fra",
-    "frk",
-    "frm",
-    "fry",
-    "gla",
-    "gle",
-    "glg",
-    "grc",
-    "guj",
-    "hat",
-    "heb",
-    "hin",
-    "hrv",
-    "hun",
-    "hye",
-    "iku",
-    "ind",
-    "isl",
-    "ita",
-    "ita_old",
-    "jav",
-    "jpn",
-    "kan",
-    "kat",
-    "kat_old",
-    "kaz",
-    "khm",
-    "kir",
-    "kmr",
-    "kor",
-    "kor_vert",
-    "kur",
-    "lao",
-    "lat",
-    "lav",
-    "lit",
-    "ltz",
-    "mal",
-    "mar",
-    "mkd",
-    "mlt",
-    "mon",
-    "mri",
-    "msa",
-    "mya",
-    "nep",
-    "nld",
-    "nor",
-    "oci",
-    "ori",
-    "osd",
-    "pan",
-    "pol",
-    "por",
-    "pus",
-    "que",
-    "ron",
-    "rus",
-    "san",
-    "sin",
-    "slk",
-    "slk_frak",
-    "slv",
-    "snd",
-    "spa",
-    "spa_old",
-    "sqi",
-    "srp",
-    "srp_latn",
-    "sun",
-    "swa",
-    "swe",
-    "syr",
-    "tam",
-    "tat",
-    "tel",
-    "tgk",
-    "tgl",
-    "tha",
-    "tir",
-    "ton",
-    "tur",
-    "uig",
-    "ukr",
-    "urd",
-    "uzb",
-    "uzb_cyrl",
-    "vie",
-    "yid",
-    "yor",
-]
 class PSMMode(Enum):
     """Enum for Tesseract Page Segmentation Modes (PSM) with human-readable values."""
@@ -211,7 +81,7 @@ async def validate_tesseract_version() -> None:
 async def process_file(
     input_file: str | PathLike[str],
     *,
-    language: SupportedLanguage,
+    language: str,
     psm: PSMMode,
     max_processes: int = DEFAULT_MAX_PROCESSES,
 ) -> ExtractionResult:
@@ -263,7 +133,7 @@ async def process_file(
 async def process_image(
     image: Image,
     *,
-    language: SupportedLanguage,
+    language: str,
     psm: PSMMode,
     max_processes: int = DEFAULT_MAX_PROCESSES,
 ) -> ExtractionResult:
@@ -288,7 +158,7 @@ async def process_image(
 async def process_image_with_tesseract(
     image: Image | PathLike[str] | str,
     *,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     psm: PSMMode = PSMMode.AUTO,
     max_processes: int = DEFAULT_MAX_PROCESSES,
 ) -> ExtractionResult:
@@ -320,7 +190,7 @@ async def process_image_with_tesseract(
 async def batch_process_images(
     images: list[T],
     *,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     psm: PSMMode = PSMMode.AUTO,
     max_processes: int = DEFAULT_MAX_PROCESSES,
 ) -> list[ExtractionResult]:

{kreuzberg-2.0.0 → kreuzberg-2.0.1}/kreuzberg/extraction.py RENAMED Viewed

@@ -38,7 +38,7 @@ from kreuzberg._pdf import (
 )
 from kreuzberg._pptx import extract_pptx_file_content
 from kreuzberg._string import safe_decode
-from kreuzberg._tesseract import PSMMode, SupportedLanguage, process_image_with_tesseract
+from kreuzberg._tesseract import PSMMode, process_image_with_tesseract
 from kreuzberg._xlsx import extract_xlsx_content, extract_xlsx_file
 from kreuzberg.exceptions import ValidationError
@@ -52,7 +52,7 @@ async def extract_bytes(
     mime_type: str,
     *,
     force_ocr: bool = False,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int = DEFAULT_MAX_PROCESSES,
     psm: PSMMode = PSMMode.AUTO,
 ) -> ExtractionResult:
@@ -114,7 +114,7 @@ async def extract_file(
     mime_type: str | None = None,
     *,
     force_ocr: bool = False,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int = DEFAULT_MAX_PROCESSES,
     psm: PSMMode = PSMMode.AUTO,
 ) -> ExtractionResult:
@@ -170,7 +170,7 @@ async def batch_extract_file(
     file_paths: Sequence[PathLike[str] | str],
     *,
     force_ocr: bool = False,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int = DEFAULT_MAX_PROCESSES,
     psm: PSMMode = PSMMode.AUTO,
 ) -> list[ExtractionResult]:
@@ -209,7 +209,7 @@ async def batch_extract_bytes(
     contents: Sequence[tuple[bytes, str]],
     *,
     force_ocr: bool = False,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int = DEFAULT_MAX_PROCESSES,
     psm: PSMMode = PSMMode.AUTO,
 ) -> list[ExtractionResult]:
@@ -253,7 +253,7 @@ def extract_bytes_sync(
     mime_type: str,
     *,
     force_ocr: bool = False,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int = DEFAULT_MAX_PROCESSES,
     psm: PSMMode = PSMMode.AUTO,
 ) -> ExtractionResult:
@@ -281,7 +281,7 @@ def extract_file_sync(
     mime_type: str | None = None,
     *,
     force_ocr: bool = False,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int = DEFAULT_MAX_PROCESSES,
     psm: PSMMode = PSMMode.AUTO,
 ) -> ExtractionResult:
@@ -308,7 +308,7 @@ def batch_extract_file_sync(
     file_paths: Sequence[PathLike[str] | str],
     *,
     force_ocr: bool = False,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int = DEFAULT_MAX_PROCESSES,
     psm: PSMMode = PSMMode.AUTO,
 ) -> list[ExtractionResult]:
@@ -339,7 +339,7 @@ def batch_extract_bytes_sync(
     contents: Sequence[tuple[bytes, str]],
     *,
     force_ocr: bool = False,
-    language: SupportedLanguage = "eng",
+    language: str = "eng",
     max_processes: int = DEFAULT_MAX_PROCESSES,
     psm: PSMMode = PSMMode.AUTO,
 ) -> list[ExtractionResult]:

{kreuzberg-2.0.0 → kreuzberg-2.0.1}/kreuzberg.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: kreuzberg
-Version: 2.0.0
+Version: 2.0.1
 Summary: A text extraction library supporting PDFs, images, office documents and more
 Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
 License: MIT
@@ -64,7 +64,31 @@ Kreuzberg requires two system level dependencies:
 - [Pandoc](https://pandoc.org/installing.html) - For document format conversion
 - [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR
-Please install these using their respective installation guides.
+You can install these with:
+#### Linux (Ubuntu)
+```shell
+sudo apt-get install pandoc tesseract-ocr
+```
+#### MacOS
+```shell
+# MacOS
+brew install tesseract pandoc
+```
+#### Windows
+```shell
+choco install -y tesseract pandoc
+```
+Notes:
+- in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.
+- please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.
 ## Architecture
@@ -152,18 +176,26 @@ All extraction functions accept the following optional parameters for configurin
 #### OCR Configuration
-- `language` (default: "eng"): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for non-English documents. Examples:
-  - "eng" for English
-  - "deu" for German
-  - "fra" for French
+- `force_ocr`(default: `False`): Forces OCR processing even for searchable PDFs.
+- `language` (default: `eng`): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:
+  - `eng` for English
+  - `deu` for German
+  - `eng+deu` for English and German
+  Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.
+- `psm` (Page Segmentation Mode, default: `PSM.AUTO`): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
+Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information on both options.
-Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information.
+#### Processing Configuration
-- `psm` (Page Segmentation Mode, default: PSM.AUTO): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
+- `max_processes` (default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc.
-#### Performance Configuration
+  Notes:
-- `max_processes` (default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc. Higher values can lead to performance improvements, but may cause resource exhaustion and deadlocks (especially for tesseract).
+  - Higher values can lead to performance improvements when batch processing especially with OCR, but may cause resource exhaustion and deadlocks (especially for tesseract).
 ### Quick Start
@@ -171,7 +203,7 @@ Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/)
 from pathlib import Path
 from kreuzberg import extract_file
 from kreuzberg.extraction import ExtractionResult
-from kreuzberg._tesseract import PSMMode, SupportedLanguage
+from kreuzberg._tesseract import PSMMode
 # Basic file extraction
@@ -193,7 +225,7 @@ async def extract_document():
     docx_result = await extract_file(Path("document.docx"))
     if docx_result.metadata:
         print(f"Title: {docx_result.metadata.get('title')}")
-        print(f"Author: {docx_result.metadata.get('author')}")
+        print(f"Author: {docx_result.metadata.get('creator')}")
 ```
 ### Extracting Bytes
@@ -236,7 +268,7 @@ Kreuzberg supports efficient batch processing of multiple files or byte contents
 ```python
 from pathlib import Path
-from kreuzberg import batch_extract_file, batch_extract_bytes
+from kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync
 # Process multiple files concurrently

{kreuzberg-2.0.0 → kreuzberg-2.0.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "kreuzberg"
-version = "2.0.0"
+version = "2.0.1"
 description = "A text extraction library supporting PDFs, images, office documents and more"
 readme = "README.md"
 keywords = [