PyPI - kreuzberg - Versions diffs - 3.7.0__tar.gz → 3.8.1__tar.gz - Mend

kreuzberg 3.7.0tar.gz → 3.8.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

{kreuzberg-3.7.0 → kreuzberg-3.8.1}/.gitignore RENAMED Viewed

@@ -33,3 +33,5 @@ GEMINI.md
 prompt_template.egg-info/
 requirements.txt
 site/
+.cache/
+dist/

{kreuzberg-3.7.0 → kreuzberg-3.8.1}/PKG-INFO RENAMED Viewed

@@ -1,14 +1,16 @@
 Metadata-Version: 2.4
 Name: kreuzberg
-Version: 3.7.0
-Summary: A text extraction library supporting PDFs, images, office documents and more
+Version: 3.8.1
+Summary: Advanced document intelligence framework for extracting structured content from PDFs, images, and office documents
 Project-URL: homepage, https://github.com/Goldziher/kreuzberg
 Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
 License: MIT
 License-File: LICENSE
-Keywords: document-processing,entity-extraction,image-to-text,keyword-extraction,named-entity-recognition,ner,ocr,pandoc,pdf-extraction,rag,spacy,table-extraction,tesseract,text-extraction,text-processing
+Keywords: automation,content-extraction,data-processing,document-analysis,document-intelligence,document-processing,entity-extraction,image-to-text,information-extraction,ocr,pdf-extraction,rag,structured-data,table-extraction,text-extraction
 Classifier: Development Status :: 5 - Production/Stable
 Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Information Technology
+Classifier: Intended Audience :: Science/Research
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Operating System :: OS Independent
 Classifier: Programming Language :: Python :: 3 :: Only
@@ -16,16 +18,19 @@ Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Database
+Classifier: Topic :: Multimedia :: Graphics :: Capture :: Scanners
+Classifier: Topic :: Office/Business :: Office Suites
 Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Scientific/Engineering :: Information Analysis
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
 Classifier: Topic :: Text Processing :: General
-Classifier: Topic :: Utilities
 Classifier: Typing :: Typed
 Requires-Python: >=3.10
 Requires-Dist: anyio>=4.9.0
-Requires-Dist: charset-normalizer>=3.4.2
+Requires-Dist: chardetng-py>=0.3.4
 Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
-Requires-Dist: html-to-markdown[lxml]>=1.6.0
+Requires-Dist: html-to-markdown[lxml]>=1.8.0
 Requires-Dist: mcp>=1.11.0
 Requires-Dist: msgspec>=0.18.0
 Requires-Dist: playa-pdf>=0.6.1
@@ -34,6 +39,9 @@ Requires-Dist: pypdfium2==4.30.0
 Requires-Dist: python-calamine>=0.3.2
 Requires-Dist: python-pptx>=1.0.2
 Requires-Dist: typing-extensions>=4.14.0; python_version < '3.12'
+Provides-Extra: additional-extensions
+Requires-Dist: mailparse>=1.0.15; extra == 'additional-extensions'
+Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'additional-extensions'
 Provides-Extra: all
 Requires-Dist: click>=8.2.1; extra == 'all'
 Requires-Dist: easyocr>=1.7.2; extra == 'all'
@@ -41,6 +49,7 @@ Requires-Dist: fast-langdetect>=0.3.2; extra == 'all'
 Requires-Dist: gmft>=0.4.2; extra == 'all'
 Requires-Dist: keybert>=0.9.0; extra == 'all'
 Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.16.0; extra == 'all'
+Requires-Dist: mailparse>=1.0.15; extra == 'all'
 Requires-Dist: paddleocr>=3.1.0; extra == 'all'
 Requires-Dist: paddlepaddle>=3.1.0; extra == 'all'
 Requires-Dist: rich>=14.0.0; extra == 'all'
@@ -77,22 +86,33 @@ Description-Content-Type: text/markdown
 [![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)
 [![Documentation](https://img.shields.io/badge/docs-GitHub_Pages-blue)](https://goldziher.github.io/kreuzberg/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Test Coverage](https://img.shields.io/badge/coverage-95%25-green)](https://github.com/Goldziher/kreuzberg)
-**High-performance Python library for text extraction from documents.** Extract text from PDFs, images, office documents, and more with both async and sync APIs.
+**Advanced Document Intelligence for Modern Python Applications.** Transform PDFs, images, and office documents into structured data with production-grade performance. Built by engineers who understand that speed, reliability, and developer experience matter.
 📖 **[Complete Documentation](https://goldziher.github.io/kreuzberg/)**
-## Why Kreuzberg?
+## Why Choose Kreuzberg?
-- **🚀 Fastest Performance**: [35+ files/second](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) - the fastest text extraction library
-- **💾 Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+) with lowest memory usage (~530MB)
-- **⚡ Dual APIs**: Only library with both sync and async support
-- **🔧 Zero Configuration**: Works out of the box with sane defaults
-- **🏠 Local Processing**: No cloud dependencies or external API calls
-- **📦 Rich Format Support**: PDFs, images, Office docs, HTML, and more
-- **🔍 Multiple OCR Engines**: Tesseract, EasyOCR, and PaddleOCR support
-- **🤖 AI Integration**: Native MCP server for Claude and other AI tools
-- **🐳 Production Ready**: CLI, REST API, MCP server, and Docker images included
+### ⚡ Proven Performance
+[Benchmarked](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) 6-126x faster than alternatives while using minimal resources. Process up to 14 files per second with 87MB install size and ~360MB memory usage. Optimized for production workloads and resource-constrained environments.
+### 🏗️ Production Engineering
+Comprehensive test coverage (95%+), robust error handling, and true async/await support. Built with modern Python practices for reliability in production environments.
+### 🔧 Developer Experience
+Works immediately with smart defaults, scales as you grow. Native MCP integration for AI tools, full type safety, and clear documentation.
+### 🚀 Flexible Deployment
+Deploy on serverless platforms, containers, or traditional servers. Supports both CPU and GPU processing (via PaddleOCR and EasyOCR). No external API dependencies. Multiple deployment modes: CLI, REST API, MCP server.
+### 📄 Comprehensive Format Support
+Extract from PDFs, images, Office documents, HTML, spreadsheets, and presentations. Multiple OCR engines with intelligent fallbacks, table extraction, and content preparation for RAG workflows.
 ## Quick Start
@@ -128,7 +148,7 @@ import asyncio
 from kreuzberg import extract_file
 async def main():
-    # Extract from any document type
+    # Extract content from files
     result = await extract_file("document.pdf")
     print(result.content)
     print(result.metadata)
@@ -197,7 +217,7 @@ docker run -p 8000:8000 goldziher/kreuzberg:latest
 curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
 ```
-Available variants: `latest`, `3.6.1`, `3.6.1-easyocr`, `3.6.1-paddle`, `3.6.1-gmft`, `3.6.1-all`
+Available variants: `latest`, `v3.8.0`, `v3.8.0-easyocr`, `v3.8.0-paddle`, `v3.8.0-gmft`, `v3.8.0-all`
 ### 🌐 REST API
@@ -240,23 +260,28 @@ kreuzberg extract *.pdf --output-dir ./extracted/
 | **Web**           | HTML, XML, MHTML               |
 | **Archives**      | Support via extraction         |
-## Performance
+## 📊 Performance Comparison
-**[Comprehensive benchmarks](https://goldziher.github.io/python-text-extraction-libs-benchmarks/)** across 94 real-world documents (~210MB) • [View source](https://github.com/Goldziher/python-text-extraction-libs-benchmarks):
+[Comprehensive benchmarks](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) across ~100 real-world documents • [View source](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) • [**Detailed Analysis**](https://goldziher.github.io/kreuzberg/performance-analysis/):
-| Library       | Speed           | Memory    | Install Size | Dependencies | Success Rate |
-| ------------- | --------------- | --------- | ------------ | ------------ | ------------ |
-| **Kreuzberg** | **35+ files/s** | **530MB** | **71MB**     | **20**       | High\*       |
-| Unstructured  | Moderate        | ~1GB      | 146MB        | 54           | 88%+         |
-| MarkItDown    | Good†           | ~1.5GB    | 251MB        | 25           | 80%†         |
-| Docling       | 60+ min/file‡   | ~5GB      | 1,032MB      | 88           | Low‡         |
+| Framework     | Speed        | Memory | Install Size | Dependencies | Success Rate |
+| ------------- | ------------ | ------ | ------------ | ------------ | ------------ |
+| **Kreuzberg** | 14.4 files/s | 360MB  | 87MB         | 43           | 100%         |
+| Unstructured  | ~12 files/s  | ~1GB   | 146MB        | 54           | 88%+         |
+| MarkItDown    | ~15 files/s  | ~1.5GB | 251MB        | 25           | 80%\*        |
+| Docling       | ~1 file/min  | ~5GB   | 1,032MB      | 88           | 45%\*        |
-\*_Can achieve 75% reliability with 15% performance trade-off when configured_
-†_Good on simple documents, struggles with large/complex files (>10MB)_
-‡_Frequently fails/times out on medium files (>1MB)_
+\*_Performance varies significantly with document complexity and size_
-> **Benchmark details**: Tested across PDFs, Word docs, HTML, images, spreadsheets in 6 languages (English, Hebrew, German, Chinese, Japanese, Korean)
-> **Rule of thumb**: Use async API for complex documents and batch processing (up to 4.5x faster)
+**Key strengths:**
+- 6-126x faster processing than comparable frameworks
+- Smallest installation footprint and memory usage
+- Only framework with built-in async/await support
+- Supports both CPU and GPU processing
+- Built by software engineers for production reliability
+> **Benchmark details**: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.
 ## Documentation
@@ -264,34 +289,13 @@ kreuzberg extract *.pdf --output-dir ./extracted/
 - [Installation Guide](https://goldziher.github.io/kreuzberg/getting-started/installation/) - Setup and dependencies
 - [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - Comprehensive usage guide
+- [Performance Analysis](https://goldziher.github.io/kreuzberg/performance-analysis/) - Detailed benchmark results
 - [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Complete API documentation
 - [Docker Guide](https://goldziher.github.io/kreuzberg/user-guide/docker/) - Container deployment
 - [REST API](https://goldziher.github.io/kreuzberg/user-guide/api-server/) - HTTP endpoints
 - [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line usage
 - [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - OCR engine setup
-## Advanced Features
-- **🤖 MCP Server**: Native integration with Claude Desktop and AI tools
-- **📊 Table Extraction**: Extract tables from PDFs with GMFT
-- **🧩 Content Chunking**: Split documents for RAG applications
-- **🎯 Custom Extractors**: Extend with your own document handlers
-- **🔧 Configuration**: Flexible TOML-based configuration
-- **🪝 Hooks**: Pre/post-processing customization
-- **🌍 Multi-language OCR**: 100+ languages supported
-- **⚙️ Metadata Extraction**: Rich document metadata
-- **🔄 Batch Processing**: Efficient bulk document processing
 ## License
 MIT License - see [LICENSE](LICENSE) for details.
-______________________________________________________________________
-<div align="center">
-**[Documentation](https://goldziher.github.io/kreuzberg/) • [PyPI](https://pypi.org/project/kreuzberg/) • [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) • [Benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) • [Discord](https://discord.gg/pXxagNK2zN)**
-Made with ❤️ by the [Kreuzberg contributors](https://github.com/Goldziher/kreuzberg/graphs/contributors)
-</div>

{kreuzberg-3.7.0 → kreuzberg-3.8.1}/README.md RENAMED Viewed

@@ -4,22 +4,33 @@
 [![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)
 [![Documentation](https://img.shields.io/badge/docs-GitHub_Pages-blue)](https://goldziher.github.io/kreuzberg/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Test Coverage](https://img.shields.io/badge/coverage-95%25-green)](https://github.com/Goldziher/kreuzberg)
-**High-performance Python library for text extraction from documents.** Extract text from PDFs, images, office documents, and more with both async and sync APIs.
+**Advanced Document Intelligence for Modern Python Applications.** Transform PDFs, images, and office documents into structured data with production-grade performance. Built by engineers who understand that speed, reliability, and developer experience matter.
 📖 **[Complete Documentation](https://goldziher.github.io/kreuzberg/)**
-## Why Kreuzberg?
+## Why Choose Kreuzberg?
-- **🚀 Fastest Performance**: [35+ files/second](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) - the fastest text extraction library
-- **💾 Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+) with lowest memory usage (~530MB)
-- **⚡ Dual APIs**: Only library with both sync and async support
-- **🔧 Zero Configuration**: Works out of the box with sane defaults
-- **🏠 Local Processing**: No cloud dependencies or external API calls
-- **📦 Rich Format Support**: PDFs, images, Office docs, HTML, and more
-- **🔍 Multiple OCR Engines**: Tesseract, EasyOCR, and PaddleOCR support
-- **🤖 AI Integration**: Native MCP server for Claude and other AI tools
-- **🐳 Production Ready**: CLI, REST API, MCP server, and Docker images included
+### ⚡ Proven Performance
+[Benchmarked](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) 6-126x faster than alternatives while using minimal resources. Process up to 14 files per second with 87MB install size and ~360MB memory usage. Optimized for production workloads and resource-constrained environments.
+### 🏗️ Production Engineering
+Comprehensive test coverage (95%+), robust error handling, and true async/await support. Built with modern Python practices for reliability in production environments.
+### 🔧 Developer Experience
+Works immediately with smart defaults, scales as you grow. Native MCP integration for AI tools, full type safety, and clear documentation.
+### 🚀 Flexible Deployment
+Deploy on serverless platforms, containers, or traditional servers. Supports both CPU and GPU processing (via PaddleOCR and EasyOCR). No external API dependencies. Multiple deployment modes: CLI, REST API, MCP server.
+### 📄 Comprehensive Format Support
+Extract from PDFs, images, Office documents, HTML, spreadsheets, and presentations. Multiple OCR engines with intelligent fallbacks, table extraction, and content preparation for RAG workflows.
 ## Quick Start
@@ -55,7 +66,7 @@ import asyncio
 from kreuzberg import extract_file
 async def main():
-    # Extract from any document type
+    # Extract content from files
     result = await extract_file("document.pdf")
     print(result.content)
     print(result.metadata)
@@ -124,7 +135,7 @@ docker run -p 8000:8000 goldziher/kreuzberg:latest
 curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
 ```
-Available variants: `latest`, `3.6.1`, `3.6.1-easyocr`, `3.6.1-paddle`, `3.6.1-gmft`, `3.6.1-all`
+Available variants: `latest`, `v3.8.0`, `v3.8.0-easyocr`, `v3.8.0-paddle`, `v3.8.0-gmft`, `v3.8.0-all`
 ### 🌐 REST API
@@ -167,23 +178,28 @@ kreuzberg extract *.pdf --output-dir ./extracted/
 | **Web**           | HTML, XML, MHTML               |
 | **Archives**      | Support via extraction         |
-## Performance
+## 📊 Performance Comparison
-**[Comprehensive benchmarks](https://goldziher.github.io/python-text-extraction-libs-benchmarks/)** across 94 real-world documents (~210MB) • [View source](https://github.com/Goldziher/python-text-extraction-libs-benchmarks):
+[Comprehensive benchmarks](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) across ~100 real-world documents • [View source](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) • [**Detailed Analysis**](https://goldziher.github.io/kreuzberg/performance-analysis/):
-| Library       | Speed           | Memory    | Install Size | Dependencies | Success Rate |
-| ------------- | --------------- | --------- | ------------ | ------------ | ------------ |
-| **Kreuzberg** | **35+ files/s** | **530MB** | **71MB**     | **20**       | High\*       |
-| Unstructured  | Moderate        | ~1GB      | 146MB        | 54           | 88%+         |
-| MarkItDown    | Good†           | ~1.5GB    | 251MB        | 25           | 80%†         |
-| Docling       | 60+ min/file‡   | ~5GB      | 1,032MB      | 88           | Low‡         |
+| Framework     | Speed        | Memory | Install Size | Dependencies | Success Rate |
+| ------------- | ------------ | ------ | ------------ | ------------ | ------------ |
+| **Kreuzberg** | 14.4 files/s | 360MB  | 87MB         | 43           | 100%         |
+| Unstructured  | ~12 files/s  | ~1GB   | 146MB        | 54           | 88%+         |
+| MarkItDown    | ~15 files/s  | ~1.5GB | 251MB        | 25           | 80%\*        |
+| Docling       | ~1 file/min  | ~5GB   | 1,032MB      | 88           | 45%\*        |
-\*_Can achieve 75% reliability with 15% performance trade-off when configured_
-†_Good on simple documents, struggles with large/complex files (>10MB)_
-‡_Frequently fails/times out on medium files (>1MB)_
+\*_Performance varies significantly with document complexity and size_
-> **Benchmark details**: Tested across PDFs, Word docs, HTML, images, spreadsheets in 6 languages (English, Hebrew, German, Chinese, Japanese, Korean)
-> **Rule of thumb**: Use async API for complex documents and batch processing (up to 4.5x faster)
+**Key strengths:**
+- 6-126x faster processing than comparable frameworks
+- Smallest installation footprint and memory usage
+- Only framework with built-in async/await support
+- Supports both CPU and GPU processing
+- Built by software engineers for production reliability
+> **Benchmark details**: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.
 ## Documentation
@@ -191,34 +207,13 @@ kreuzberg extract *.pdf --output-dir ./extracted/
 - [Installation Guide](https://goldziher.github.io/kreuzberg/getting-started/installation/) - Setup and dependencies
 - [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - Comprehensive usage guide
+- [Performance Analysis](https://goldziher.github.io/kreuzberg/performance-analysis/) - Detailed benchmark results
 - [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Complete API documentation
 - [Docker Guide](https://goldziher.github.io/kreuzberg/user-guide/docker/) - Container deployment
 - [REST API](https://goldziher.github.io/kreuzberg/user-guide/api-server/) - HTTP endpoints
 - [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line usage
 - [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - OCR engine setup
-## Advanced Features
-- **🤖 MCP Server**: Native integration with Claude Desktop and AI tools
-- **📊 Table Extraction**: Extract tables from PDFs with GMFT
-- **🧩 Content Chunking**: Split documents for RAG applications
-- **🎯 Custom Extractors**: Extend with your own document handlers
-- **🔧 Configuration**: Flexible TOML-based configuration
-- **🪝 Hooks**: Pre/post-processing customization
-- **🌍 Multi-language OCR**: 100+ languages supported
-- **⚙️ Metadata Extraction**: Rich document metadata
-- **🔄 Batch Processing**: Efficient bulk document processing
 ## License
 MIT License - see [LICENSE](LICENSE) for details.
-______________________________________________________________________
-<div align="center">
-**[Documentation](https://goldziher.github.io/kreuzberg/) • [PyPI](https://pypi.org/project/kreuzberg/) • [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) • [Benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) • [Discord](https://discord.gg/pXxagNK2zN)**
-Made with ❤️ by the [Kreuzberg contributors](https://github.com/Goldziher/kreuzberg/graphs/contributors)
-</div>

{kreuzberg-3.7.0 → kreuzberg-3.8.1}/benchmarks/pyproject.toml RENAMED Viewed

@@ -3,9 +3,12 @@ name = "kreuzberg-benchmarks"
 version = "0.1.0"
 description = "Performance benchmarking suite for Kreuzberg text extraction library"
 readme = "README.md"
-requires-python = ">=3.13"
+requires-python = ">=3.10"
 classifiers = [
   "Programming Language :: Python :: 3 :: Only",
+  "Programming Language :: Python :: 3.10",
+  "Programming Language :: Python :: 3.11",
+  "Programming Language :: Python :: 3.12",
   "Programming Language :: Python :: 3.13",
 ]
 dependencies = [

kreuzberg 3.7.0__tar.gz → 3.8.1__tar.gz

kreuzberg 3.7.0tar.gz → 3.8.1tar.gz