PyPI - kreuzberg - Versions diffs - 3.15.0__py3-none-any.whl → 3.17.0__py3-none-any.whl - Mend

kreuzberg 3.15.0py3-none-any.whl → 3.17.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (99) hide show

kreuzberg/__init__.py +6 -0
kreuzberg/_api/main.py +0 -53
kreuzberg/_config.py +17 -8
kreuzberg/_document_classification.py +1 -1
kreuzberg/_extractors/_base.py +0 -46
kreuzberg/_extractors/_email.py +16 -10
kreuzberg/_extractors/_html.py +39 -12
kreuzberg/_extractors/_pandoc.py +2 -2
kreuzberg/_extractors/_pdf.py +6 -7
kreuzberg/_extractors/_presentation.py +4 -0
kreuzberg/_extractors/_spread_sheet.py +0 -1
kreuzberg/_extractors/_structured.py +83 -15
kreuzberg/_gmft.py +7 -2
kreuzberg/_mcp/server.py +1 -22
kreuzberg/_mime_types.py +1 -1
kreuzberg/_ocr/_easyocr.py +47 -20
kreuzberg/_ocr/_paddleocr.py +1 -1
kreuzberg/_ocr/_tesseract.py +27 -26
kreuzberg/_token_reduction/__init__.py +11 -0
kreuzberg/_token_reduction/_reducer.py +439 -0
kreuzberg/_token_reduction/_stopwords.py +116 -0
kreuzberg/_token_reduction/stopwords/af_stopwords.json +53 -0
kreuzberg/_token_reduction/stopwords/ar_stopwords.json +482 -0
kreuzberg/_token_reduction/stopwords/bg_stopwords.json +261 -0
kreuzberg/_token_reduction/stopwords/bn_stopwords.json +400 -0
kreuzberg/_token_reduction/stopwords/br_stopwords.json +1205 -0
kreuzberg/_token_reduction/stopwords/ca_stopwords.json +280 -0
kreuzberg/_token_reduction/stopwords/cs_stopwords.json +425 -0
kreuzberg/_token_reduction/stopwords/da_stopwords.json +172 -0
kreuzberg/_token_reduction/stopwords/de_stopwords.json +622 -0
kreuzberg/_token_reduction/stopwords/el_stopwords.json +849 -0
kreuzberg/_token_reduction/stopwords/en_stopwords.json +1300 -0
kreuzberg/_token_reduction/stopwords/eo_stopwords.json +175 -0
kreuzberg/_token_reduction/stopwords/es_stopwords.json +734 -0
kreuzberg/_token_reduction/stopwords/et_stopwords.json +37 -0
kreuzberg/_token_reduction/stopwords/eu_stopwords.json +100 -0
kreuzberg/_token_reduction/stopwords/fa_stopwords.json +801 -0
kreuzberg/_token_reduction/stopwords/fi_stopwords.json +849 -0
kreuzberg/_token_reduction/stopwords/fr_stopwords.json +693 -0
kreuzberg/_token_reduction/stopwords/ga_stopwords.json +111 -0
kreuzberg/_token_reduction/stopwords/gl_stopwords.json +162 -0
kreuzberg/_token_reduction/stopwords/gu_stopwords.json +226 -0
kreuzberg/_token_reduction/stopwords/ha_stopwords.json +41 -0
kreuzberg/_token_reduction/stopwords/he_stopwords.json +196 -0
kreuzberg/_token_reduction/stopwords/hi_stopwords.json +227 -0
kreuzberg/_token_reduction/stopwords/hr_stopwords.json +181 -0
kreuzberg/_token_reduction/stopwords/hu_stopwords.json +791 -0
kreuzberg/_token_reduction/stopwords/hy_stopwords.json +47 -0
kreuzberg/_token_reduction/stopwords/id_stopwords.json +760 -0
kreuzberg/_token_reduction/stopwords/it_stopwords.json +634 -0
kreuzberg/_token_reduction/stopwords/ja_stopwords.json +136 -0
kreuzberg/_token_reduction/stopwords/kn_stopwords.json +84 -0
kreuzberg/_token_reduction/stopwords/ko_stopwords.json +681 -0
kreuzberg/_token_reduction/stopwords/ku_stopwords.json +64 -0
kreuzberg/_token_reduction/stopwords/la_stopwords.json +51 -0
kreuzberg/_token_reduction/stopwords/lt_stopwords.json +476 -0
kreuzberg/_token_reduction/stopwords/lv_stopwords.json +163 -0
kreuzberg/_token_reduction/stopwords/ml_stopwords.json +11 -0
kreuzberg/_token_reduction/stopwords/mr_stopwords.json +101 -0
kreuzberg/_token_reduction/stopwords/ms_stopwords.json +477 -0
kreuzberg/_token_reduction/stopwords/ne_stopwords.json +490 -0
kreuzberg/_token_reduction/stopwords/nl_stopwords.json +415 -0
kreuzberg/_token_reduction/stopwords/no_stopwords.json +223 -0
kreuzberg/_token_reduction/stopwords/pl_stopwords.json +331 -0
kreuzberg/_token_reduction/stopwords/pt_stopwords.json +562 -0
kreuzberg/_token_reduction/stopwords/ro_stopwords.json +436 -0
kreuzberg/_token_reduction/stopwords/ru_stopwords.json +561 -0
kreuzberg/_token_reduction/stopwords/si_stopwords.json +193 -0
kreuzberg/_token_reduction/stopwords/sk_stopwords.json +420 -0
kreuzberg/_token_reduction/stopwords/sl_stopwords.json +448 -0
kreuzberg/_token_reduction/stopwords/so_stopwords.json +32 -0
kreuzberg/_token_reduction/stopwords/st_stopwords.json +33 -0
kreuzberg/_token_reduction/stopwords/sv_stopwords.json +420 -0
kreuzberg/_token_reduction/stopwords/sw_stopwords.json +76 -0
kreuzberg/_token_reduction/stopwords/ta_stopwords.json +129 -0
kreuzberg/_token_reduction/stopwords/te_stopwords.json +54 -0
kreuzberg/_token_reduction/stopwords/th_stopwords.json +118 -0
kreuzberg/_token_reduction/stopwords/tl_stopwords.json +149 -0
kreuzberg/_token_reduction/stopwords/tr_stopwords.json +506 -0
kreuzberg/_token_reduction/stopwords/uk_stopwords.json +75 -0
kreuzberg/_token_reduction/stopwords/ur_stopwords.json +519 -0
kreuzberg/_token_reduction/stopwords/vi_stopwords.json +647 -0
kreuzberg/_token_reduction/stopwords/yo_stopwords.json +62 -0
kreuzberg/_token_reduction/stopwords/zh_stopwords.json +796 -0
kreuzberg/_token_reduction/stopwords/zu_stopwords.json +31 -0
kreuzberg/_types.py +146 -43
kreuzberg/_utils/_html_streaming.py +20 -0
kreuzberg/_utils/_image_preprocessing.py +1 -1
kreuzberg/_utils/_ref.py +14 -6
kreuzberg/_utils/_serialization.py +13 -6
kreuzberg/_utils/_sync.py +15 -16
kreuzberg/exceptions.py +0 -1
kreuzberg/extraction.py +27 -11
{kreuzberg-3.15.0.dist-info → kreuzberg-3.17.0.dist-info}/METADATA +15 -13
kreuzberg-3.17.0.dist-info/RECORD +128 -0
kreuzberg-3.15.0.dist-info/RECORD +0 -60
{kreuzberg-3.15.0.dist-info → kreuzberg-3.17.0.dist-info}/WHEEL +0 -0
{kreuzberg-3.15.0.dist-info → kreuzberg-3.17.0.dist-info}/entry_points.txt +0 -0
{kreuzberg-3.15.0.dist-info → kreuzberg-3.17.0.dist-info}/licenses/LICENSE +0 -0

kreuzberg/_token_reduction/stopwords/zu_stopwords.json ADDED Viewed

@@ -0,0 +1,31 @@
+[
+  "futhi",
+  "kahle",
+  "kakhulu",
+  "kanye",
+  "khona",
+  "kodwa",
+  "kungani",
+  "kusho",
+  "la",
+  "lakhe",
+  "lapho",
+  "mina",
+  "ngesikhathi",
+  "nje",
+  "phansi",
+  "phezulu",
+  "u",
+  "ukuba",
+  "ukuthi",
+  "ukuze",
+  "uma",
+  "wahamba",
+  "wakhe",
+  "wami",
+  "wase",
+  "wathi",
+  "yakhe",
+  "zakhe",
+  "zonke"
+]

kreuzberg/_types.py CHANGED Viewed

@@ -1,12 +1,13 @@
 from __future__ import annotations
 import sys
-from collections.abc import Awaitable, Callable, Iterable, Mapping
+from collections.abc import Awaitable, Callable, Mapping
 from dataclasses import asdict, dataclass, field
 from enum import Enum
 from pathlib import Path
 from typing import TYPE_CHECKING, Any, Literal, NamedTuple, TypedDict
+import langcodes
 import msgspec
 from kreuzberg._constants import DEFAULT_MAX_CHARACTERS, DEFAULT_MAX_OVERLAP
@@ -591,6 +592,8 @@ class ImagePreprocessingMetadata(NamedTuple):
 class Metadata(TypedDict, total=False):
+    abstract: NotRequired[str]
+    """Document abstract or summary."""
     authors: NotRequired[list[str]]
     """List of document authors."""
     categories: NotRequired[list[str]]
@@ -677,9 +680,28 @@ class Metadata(TypedDict, total=False):
     """Error message if extraction failed."""
     error_context: NotRequired[dict[str, Any]]
     """Error context information for debugging."""
+    json_schema: NotRequired[dict[str, Any]]
+    """JSON schema information extracted from structured data."""
+    notes: NotRequired[list[str]]
+    """Notes or additional information extracted from documents."""
+    note: NotRequired[str]
+    """Single note or annotation."""
+    name: NotRequired[str]
+    """Name field from structured data."""
+    body: NotRequired[str]
+    """Body text content."""
+    text: NotRequired[str]
+    """Generic text content."""
+    message: NotRequired[str]
+    """Message or communication content."""
+    attributes: NotRequired[dict[str, Any]]
+    """Additional attributes extracted from structured data (e.g., custom text fields with dotted keys)."""
+    token_reduction: NotRequired[dict[str, float]]
+    """Token reduction statistics including reduction ratios and counts."""
 _VALID_METADATA_KEYS = {
+    "abstract",
     "authors",
     "categories",
     "citations",
@@ -722,6 +744,15 @@ _VALID_METADATA_KEYS = {
     "source_format",
     "error",
     "error_context",
+    "json_schema",
+    "notes",
+    "note",
+    "name",
+    "body",
+    "text",
+    "message",
+    "attributes",
+    "token_reduction",
 }
@@ -730,9 +761,29 @@ def normalize_metadata(data: dict[str, Any] | None) -> Metadata:
         return {}
     normalized: Metadata = {}
+    attributes: dict[str, Any] = {}
     for key, value in data.items():
-        if key in _VALID_METADATA_KEYS and value is not None:
-            normalized[key] = value  # type: ignore[literal-required]
+        if value is not None:
+            if key in _VALID_METADATA_KEYS:
+                normalized[key] = value  # type: ignore[literal-required]
+            elif "." in key and key.split(".")[-1] in {
+                "title",
+                "name",
+                "subject",
+                "description",
+                "content",
+                "body",
+                "text",
+                "message",
+                "note",
+                "abstract",
+                "summary",
+            }:
+                attributes[key] = value
+    if attributes:
+        normalized["attributes"] = attributes
     return normalized
@@ -835,6 +886,30 @@ PostProcessingHook = Callable[[ExtractionResult], ExtractionResult | Awaitable[E
 ValidationHook = Callable[[ExtractionResult], None | Awaitable[None]]
+@dataclass(unsafe_hash=True, frozen=True, slots=True)
+class JSONExtractionConfig(ConfigDict):
+    extract_schema: bool = False
+    """Extract and include JSON schema information in metadata."""
+    custom_text_field_patterns: frozenset[str] | None = None
+    """Custom patterns to identify text fields beyond default keywords."""
+    max_depth: int = 10
+    """Maximum nesting depth to process in JSON structures."""
+    array_item_limit: int = 1000
+    """Maximum number of array items to process to prevent memory issues."""
+    include_type_info: bool = False
+    """Include data type information in extracted content."""
+    flatten_nested_objects: bool = True
+    """Flatten nested objects using dot notation for better text extraction."""
+    def __post_init__(self) -> None:
+        if self.max_depth <= 0:
+            raise ValidationError("max_depth must be positive", context={"max_depth": self.max_depth})
+        if self.array_item_limit <= 0:
+            raise ValidationError(
+                "array_item_limit must be positive", context={"array_item_limit": self.array_item_limit}
+            )
 @dataclass(unsafe_hash=True, frozen=True, slots=True)
 class ExtractionConfig(ConfigDict):
     force_ocr: bool = False
@@ -924,6 +999,8 @@ class ExtractionConfig(ConfigDict):
     """Password(s) for encrypted PDF files. Can be a single password or list of passwords to try in sequence. Only used when crypto extra is installed."""
     html_to_markdown_config: HTMLToMarkdownConfig | None = None
     """Configuration for HTML to Markdown conversion. If None, uses default settings."""
+    json_config: JSONExtractionConfig | None = None
+    """Configuration for enhanced JSON extraction features. If None, uses standard JSON processing."""
     use_cache: bool = True
     """Whether to use caching for extraction results. Set to False to disable all caching."""
     target_dpi: int = 150
@@ -936,6 +1013,8 @@ class ExtractionConfig(ConfigDict):
     """Minimum DPI threshold when auto-adjusting DPI."""
     max_dpi: int = 600
     """Maximum DPI threshold when auto-adjusting DPI."""
+    token_reduction: TokenReductionConfig | None = None
+    """Configuration for token reduction to optimize output size while preserving meaning."""
     def __post_init__(self) -> None:
         if self.custom_entity_patterns is not None and isinstance(self.custom_entity_patterns, dict):
@@ -1060,71 +1139,95 @@ class ExtractionConfig(ConfigDict):
 @dataclass(unsafe_hash=True, frozen=True, slots=True)
 class HTMLToMarkdownConfig:
-    stream_processing: bool = False
-    """Enable streaming mode for processing large HTML documents."""
-    chunk_size: int = 1024
-    """Size of chunks when stream_processing is enabled."""
-    chunk_callback: Callable[[str], None] | None = None
-    """Callback function invoked for each chunk during stream processing."""
-    progress_callback: Callable[[int, int], None] | None = None
-    """Callback function for progress updates (current, total)."""
-    parser: str | None = "lxml"
-    """BeautifulSoup parser to use. Defaults to 'lxml' for ~30% better performance. Falls back to 'html.parser' if lxml not available."""
     autolinks: bool = True
-    """Convert URLs to clickable links automatically."""
+    """Automatically convert valid URLs to Markdown links."""
+    br_in_tables: bool = False
+    """Use <br> tags for line breaks in table cells instead of spaces."""
     bullets: str = "*+-"
     """Characters to use for unordered list bullets."""
     code_language: str = ""
-    """Default language for code blocks."""
+    """Default language identifier for fenced code blocks."""
     code_language_callback: Callable[[Any], str] | None = None
-    """Callback to determine code language dynamically."""
-    convert: str | Iterable[str] | None = None
-    """HTML tags to convert. If None, all supported tags are converted."""
+    """Function to dynamically determine code block language."""
+    convert: list[str] | None = None
+    """List of HTML tags to convert (None = all supported tags)."""
     convert_as_inline: bool = False
-    """Convert block elements as inline elements."""
-    custom_converters: Mapping[Any, Any] | None = None
-    """Custom converters for specific HTML elements."""
+    """Treat content as inline elements only."""
+    custom_converters: Mapping[str, Callable[..., str]] | None = None
+    """Mapping of HTML tag names to custom converter functions."""
     default_title: bool = False
-    """Use a default title if none is found."""
-    escape_asterisks: bool = True
-    """Escape asterisks in text to prevent unintended emphasis."""
-    escape_misc: bool = True
-    """Escape miscellaneous characters that have special meaning in Markdown."""
-    escape_underscores: bool = True
-    """Escape underscores in text to prevent unintended emphasis."""
+    """Use default titles for elements like links."""
+    escape_asterisks: bool = False
+    """Escape * characters to prevent unintended formatting."""
+    escape_misc: bool = False
+    """Escape miscellaneous characters to prevent Markdown conflicts."""
+    escape_underscores: bool = False
+    """Escape _ characters to prevent unintended formatting."""
     extract_metadata: bool = True
-    """Extract metadata from HTML head section."""
+    """Extract document metadata as comment header."""
     heading_style: Literal["underlined", "atx", "atx_closed"] = "underlined"
     """Style for markdown headings."""
     highlight_style: Literal["double-equal", "html", "bold"] = "double-equal"
     """Style for highlighting text."""
-    keep_inline_images_in: Iterable[str] | None = None
-    """HTML tags where inline images should be preserved."""
+    keep_inline_images_in: list[str] | None = None
+    """Tags where inline images should be preserved."""
+    list_indent_type: Literal["spaces", "tabs"] = "spaces"
+    """Type of indentation to use for lists."""
+    list_indent_width: int = 4
+    """Number of spaces per indentation level (use 2 for Discord/Slack)."""
     newline_style: Literal["spaces", "backslash"] = "spaces"
     """Style for line breaks in markdown."""
-    strip: str | Iterable[str] | None = None
-    """HTML tags to strip completely from output."""
+    preprocess_html: bool = False
+    """Enable HTML preprocessing to clean messy HTML."""
+    preprocessing_preset: Literal["minimal", "standard", "aggressive"] = "standard"
+    """Preprocessing level for cleaning HTML."""
+    remove_forms: bool = True
+    """Remove form elements during preprocessing."""
+    remove_navigation: bool = True
+    """Remove navigation elements during preprocessing."""
+    strip: list[str] | None = None
+    """List of HTML tags to remove from output."""
     strip_newlines: bool = False
-    """Strip newlines from the output."""
+    """Remove newlines from HTML input before processing."""
     strong_em_symbol: Literal["*", "_"] = "*"
     """Symbol to use for strong/emphasis formatting."""
     sub_symbol: str = ""
     """Symbol to use for subscript text."""
     sup_symbol: str = ""
     """Symbol to use for superscript text."""
+    whitespace_mode: Literal["normalized", "strict"] = "normalized"
+    """Whitespace handling mode."""
     wrap: bool = False
     """Enable text wrapping."""
     wrap_width: int = 80
-    """Width for text wrapping when wrap is True."""
-    preprocess_html: bool = True
-    """Enable HTML preprocessing to clean up the input."""
-    preprocessing_preset: Literal["minimal", "standard", "aggressive"] = "aggressive"
-    """Preprocessing level for cleaning HTML."""
-    remove_navigation: bool = True
-    """Remove navigation elements from HTML."""
-    remove_forms: bool = True
-    """Remove form elements from HTML."""
+    """Width for text wrapping."""
     def to_dict(self) -> dict[str, Any]:
         result = msgspec.to_builtins(self, builtin_types=(type(None),), order="deterministic")
         return {k: v for k, v in result.items() if v is not None}
+@dataclass(unsafe_hash=True, frozen=True, slots=True)
+class TokenReductionConfig:
+    mode: Literal["off", "light", "moderate"] = "off"
+    preserve_markdown: bool = True
+    custom_stopwords: dict[str, list[str]] | None = field(default=None, compare=False, hash=False)
+    language_hint: str | None = None
+    def __post_init__(self) -> None:
+        if self.language_hint:
+            hint = self.language_hint.strip()
+            if not hint or len(hint) > 50 or any(c in hint for c in "\x00\r\n\t"):
+                object.__setattr__(self, "language_hint", None)
+                return
+            try:
+                normalized = langcodes.standardize_tag(hint)
+                lang = langcodes.Language.get(normalized).language
+                if lang and lang != hint:
+                    object.__setattr__(self, "language_hint", lang)
+            except (ValueError, AttributeError, TypeError):
+                object.__setattr__(self, "language_hint", None)

kreuzberg/_utils/_html_streaming.py ADDED Viewed

@@ -0,0 +1,20 @@
+from __future__ import annotations
+_STREAMING_THRESHOLD_KB = 10
+_LARGE_FILE_THRESHOLD_MB = 1
+_DEFAULT_CHUNK_SIZE = 2048
+_LARGE_FILE_CHUNK_SIZE = 4096
+_STREAMING_THRESHOLD_BYTES = _STREAMING_THRESHOLD_KB * 1024
+_LARGE_FILE_THRESHOLD_BYTES = _LARGE_FILE_THRESHOLD_MB * 1024 * 1024
+def should_use_streaming(content_size: int) -> tuple[bool, int]:
+    if content_size < 0:
+        return False, _DEFAULT_CHUNK_SIZE
+    if content_size > _STREAMING_THRESHOLD_BYTES:
+        if content_size > _LARGE_FILE_THRESHOLD_BYTES:
+            return True, _LARGE_FILE_CHUNK_SIZE
+        return True, _DEFAULT_CHUNK_SIZE
+    return False, _DEFAULT_CHUNK_SIZE

kreuzberg/_utils/_image_preprocessing.py CHANGED Viewed

@@ -198,7 +198,7 @@ def normalize_image_dpi(
             calculated_dpi=calculated_dpi,
         )
-    except OSError as e:
+    except OSError as e:  # pragma: no cover
         return image, ImagePreprocessingMetadata(
             original_dimensions=(original_width, original_height),
             original_dpi=original_dpi,

kreuzberg/_utils/_ref.py CHANGED Viewed

@@ -1,5 +1,6 @@
 from __future__ import annotations
+import threading
 from typing import TYPE_CHECKING, Any, ClassVar, Generic, TypeVar, cast
 if TYPE_CHECKING:
@@ -10,23 +11,30 @@ T = TypeVar("T")
 class Ref(Generic[T]):
     _instances: ClassVar[dict[str, Any]] = {}
+    _lock: ClassVar[threading.Lock] = threading.Lock()
     def __init__(self, name: str, factory: Callable[[], T]) -> None:
         self.name = name
         self.factory = factory
     def get(self) -> T:
-        if self.name not in self._instances:
-            self._instances[self.name] = self.factory()
-        return cast("T", self._instances[self.name])
+        if self.name in self._instances:
+            return cast("T", self._instances[self.name])
+        with self._lock:
+            if self.name not in self._instances:
+                self._instances[self.name] = self.factory()
+            return cast("T", self._instances[self.name])
     def clear(self) -> None:
-        if self.name in self._instances:
-            del self._instances[self.name]
+        with self._lock:
+            if self.name in self._instances:
+                del self._instances[self.name]
     def is_initialized(self) -> bool:
         return self.name in self._instances
     @classmethod
     def clear_all(cls) -> None:
-        cls._instances.clear()
+        with cls._lock:
+            cls._instances.clear()

kreuzberg/_utils/_serialization.py CHANGED Viewed

@@ -1,11 +1,10 @@
 from __future__ import annotations
 from dataclasses import is_dataclass
-from typing import Any, TypeVar, cast
+from typing import Any, TypeVar
 import msgspec
 from msgspec import MsgspecError
-from msgspec.msgpack import decode, encode
 T = TypeVar("T")
@@ -42,18 +41,26 @@ def encode_hook(obj: Any) -> Any:
     raise TypeError(f"Unsupported type: {type(obj)!r}")
-def deserialize(value: str | bytes, target_type: type[T]) -> T:
+def deserialize(value: str | bytes, target_type: type[T], json: bool = False) -> T:
+    decoder = msgspec.json.decode if json else msgspec.msgpack.decode
+    if json:
+        data = value.encode() if isinstance(value, str) else value
+    else:
+        data = value.encode() if isinstance(value, str) else value
     try:
-        return decode(cast("bytes", value), type=target_type, strict=False)
+        return decoder(data, type=target_type, strict=False)
     except MsgspecError as e:
         raise ValueError(f"Failed to deserialize to {target_type.__name__}: {e}") from e
-def serialize(value: Any, **kwargs: Any) -> bytes:
+def serialize(value: Any, json: bool = False, **kwargs: Any) -> bytes:
     if isinstance(value, dict) and kwargs:
         value = value | kwargs
+    encoder = msgspec.json.encode if json else msgspec.msgpack.encode
     try:
-        return encode(value, enc_hook=encode_hook)
+        return encoder(value, enc_hook=encode_hook)
     except (MsgspecError, TypeError) as e:
         raise ValueError(f"Failed to serialize {type(value).__name__}: {e}") from e

kreuzberg/_utils/_sync.py CHANGED Viewed

@@ -1,19 +1,16 @@
 from __future__ import annotations
-import asyncio
 from functools import partial
 from inspect import isawaitable, iscoroutinefunction
-from typing import TYPE_CHECKING, Any, TypeVar, cast
+from typing import TYPE_CHECKING, Any, ParamSpec, TypeVar, cast
 import anyio
-from anyio import create_task_group
+from anyio import CapacityLimiter, create_task_group
 from anyio.to_thread import run_sync as any_io_run_sync
 if TYPE_CHECKING:  # pragma: no cover
     from collections.abc import Awaitable, Callable
-from typing import ParamSpec
 T = TypeVar("T")
 P = ParamSpec("P")
@@ -57,24 +54,26 @@ async def run_taskgroup_batched(
         return []
     if len(async_tasks) <= batch_size or not use_semaphore:
-        results: list[Any] = []
+        batch_results: list[Any] = []
         for i in range(0, len(async_tasks), batch_size):
             batch = async_tasks[i : i + batch_size]
-            results.extend(await run_taskgroup(*batch))
-        return results
+            batch_results.extend(await run_taskgroup(*batch))
+        return batch_results
-    semaphore = asyncio.Semaphore(batch_size)
+    limiter = CapacityLimiter(batch_size)
+    results: list[tuple[int, Any]] = []
-    async def run_with_semaphore(task: Awaitable[Any], index: int) -> tuple[int, Any]:
-        async with semaphore:
+    async def run_with_semaphore(task: Awaitable[Any], index: int) -> None:
+        async with limiter:
             result = await task
-            return (index, result)
+            results.append((index, result))
-    indexed_tasks = [run_with_semaphore(task, i) for i, task in enumerate(async_tasks)]
-    indexed_results = await asyncio.gather(*indexed_tasks)
+    async with create_task_group() as tg:
+        for i, task in enumerate(async_tasks):
+            tg.start_soon(run_with_semaphore, task, i)
-    indexed_results.sort(key=lambda x: x[0])
-    return [result for _, result in indexed_results]
+    results.sort(key=lambda x: x[0])
+    return [result for _, result in results]
 async def run_maybe_sync(fn: Callable[P, T | Awaitable[T]], *args: P.args, **kwargs: P.kwargs) -> T:

kreuzberg/exceptions.py CHANGED Viewed

@@ -17,7 +17,6 @@ class KreuzbergError(Exception):
         super().__init__(message)
     def _serialize_context(self, obj: Any) -> Any:
-        """Recursively serialize context objects to ensure JSON compatibility."""
         if isinstance(obj, bytes):
             return obj.decode("utf-8", errors="replace")
         if isinstance(obj, dict):

kreuzberg/extraction.py CHANGED Viewed

@@ -15,6 +15,7 @@ from kreuzberg._mime_types import (
     validate_mime_type,
 )
 from kreuzberg._registry import ExtractorRegistry
+from kreuzberg._token_reduction import get_reduction_stats, reduce_tokens
 from kreuzberg._types import ExtractionConfig, ExtractionResult
 from kreuzberg._utils._document_cache import get_document_cache
 from kreuzberg._utils._errors import create_error_context
@@ -31,15 +32,6 @@ DEFAULT_CONFIG: Final[ExtractionConfig] = ExtractionConfig()
 async def _handle_cache_async(path: Path, config: ExtractionConfig) -> ExtractionResult | None:
-    """Handle cache lookup and coordination with other processes.
-    Args:
-        path: Path to the file being processed
-        config: Extraction configuration
-    Returns:
-        Cached result if available, None otherwise
-    """
     cache = get_document_cache()
     cached_result = cache.get(path, config)
@@ -47,7 +39,7 @@ async def _handle_cache_async(path: Path, config: ExtractionConfig) -> Extractio
         return cached_result
     if cache.is_processing(path, config):
-        event = cache.mark_processing(path, config)
+        event = cache.mark_processing(path, config)  # pragma: no cover
         await anyio.to_thread.run_sync(event.wait)  # pragma: no cover
         return cache.get(path, config)  # pragma: no cover
@@ -92,6 +84,30 @@ def _validate_and_post_process_helper(
     if config.auto_detect_document_type:
         result = auto_detect_document_type(result, config, file_path=file_path)
+    if config.token_reduction is not None and config.token_reduction.mode != "off":
+        original_content = result.content
+        language_hint = None
+        if result.detected_languages and len(result.detected_languages) > 0:
+            language_hint = result.detected_languages[0]
+        reduced_content = reduce_tokens(
+            original_content,
+            config=config.token_reduction,
+            language=language_hint,
+        )
+        reduction_stats = get_reduction_stats(original_content, reduced_content)
+        result.content = reduced_content
+        result.metadata["token_reduction"] = {
+            "character_reduction_ratio": reduction_stats["character_reduction_ratio"],
+            "token_reduction_ratio": reduction_stats["token_reduction_ratio"],
+            "original_characters": reduction_stats["original_characters"],
+            "reduced_characters": reduction_stats["reduced_characters"],
+            "original_tokens": reduction_stats["original_tokens"],
+            "reduced_tokens": reduction_stats["reduced_tokens"],
+        }
     return result
@@ -362,7 +378,7 @@ def extract_file_sync(
             return cached_result
         if cache.is_processing(path, config):
-            event = cache.mark_processing(path, config)
+            event = cache.mark_processing(path, config)  # pragma: no cover
             event.wait()  # pragma: no cover
             # Try cache again after waiting for other process to complete  # ~keep

{kreuzberg-3.15.0.dist-info → kreuzberg-3.17.0.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: kreuzberg
-Version: 3.15.0
+Version: 3.17.0
 Summary: Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats
 Project-URL: documentation, https://kreuzberg.dev
 Project-URL: homepage, https://github.com/Goldziher/kreuzberg
@@ -31,7 +31,8 @@ Requires-Python: >=3.10
 Requires-Dist: anyio>=4.10.0
 Requires-Dist: chardetng-py>=0.3.5
 Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
-Requires-Dist: html-to-markdown[lxml]>=1.11.0
+Requires-Dist: html-to-markdown[lxml]>=1.13.0
+Requires-Dist: langcodes>=3.5.0
 Requires-Dist: mcp>=1.14.0
 Requires-Dist: msgspec>=0.18.0
 Requires-Dist: numpy>=2.0.0
@@ -49,7 +50,7 @@ Provides-Extra: all
 Requires-Dist: click>=8.2.1; extra == 'all'
 Requires-Dist: deep-translator>=1.11.4; extra == 'all'
 Requires-Dist: easyocr>=1.7.2; extra == 'all'
-Requires-Dist: fast-langdetect>=0.3.2; extra == 'all'
+Requires-Dist: fast-langdetect>=1.0.0; extra == 'all'
 Requires-Dist: gmft>=0.4.2; extra == 'all'
 Requires-Dist: keybert>=0.9.0; extra == 'all'
 Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.17.0; extra == 'all'
@@ -82,7 +83,7 @@ Requires-Dist: spacy>=3.8.7; extra == 'entity-extraction'
 Provides-Extra: gmft
 Requires-Dist: gmft>=0.4.2; extra == 'gmft'
 Provides-Extra: langdetect
-Requires-Dist: fast-langdetect>=0.3.2; extra == 'langdetect'
+Requires-Dist: fast-langdetect>=1.0.0; extra == 'langdetect'
 Provides-Extra: paddleocr
 Requires-Dist: paddleocr>=3.2.0; extra == 'paddleocr'
 Requires-Dist: paddlepaddle>=3.2.0; extra == 'paddleocr'
@@ -109,7 +110,7 @@ Description-Content-Type: text/markdown
 - **Text Extraction**: High-fidelity text extraction preserving document structure and formatting
 - **Image Extraction**: Extract embedded images from PDFs, presentations, HTML, and Office documents with optional OCR
 - **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties
-- **Format Support**: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
+- **Format Support**: 21 document types including PDF, Microsoft Office, images, HTML, and structured data formats
 - **OCR Integration**: Tesseract OCR with markdown output (default) and table extraction from scanned documents
 - **Document Classification**: Automatic document type detection (contracts, forms, invoices, receipts, reports)
@@ -227,14 +228,15 @@ claude mcp add kreuzberg uvx kreuzberg-mcp
 ## Supported Formats
-| Category          | Formats                        |
-| ----------------- | ------------------------------ |
-| **Documents**     | PDF, DOCX, DOC, RTF, TXT, EPUB |
-| **Images**        | JPG, PNG, TIFF, BMP, GIF, WEBP |
-| **Spreadsheets**  | XLSX, XLS, CSV, ODS            |
-| **Presentations** | PPTX, PPT, ODP                 |
-| **Web**           | HTML, XML, MHTML               |
-| **Archives**      | Support via extraction         |
+| Category            | Formats                        |
+| ------------------- | ------------------------------ |
+| **Documents**       | PDF, DOCX, DOC, RTF, TXT, EPUB |
+| **Images**          | JPG, PNG, TIFF, BMP, GIF, WEBP |
+| **Spreadsheets**    | XLSX, XLS, CSV, ODS            |
+| **Presentations**   | PPTX, PPT, ODP                 |
+| **Web**             | HTML, XML, MHTML               |
+| **Structured Data** | JSON, YAML, TOML               |
+| **Archives**        | Support via extraction         |
 ## 📊 Performance Characteristics

kreuzberg 3.15.0__py3-none-any.whl → 3.17.0__py3-none-any.whl

kreuzberg 3.15.0py3-none-any.whl → 3.17.0py3-none-any.whl