PyPI - kreuzberg - Versions diffs - 4.0.6__cp310-abi3-macosx_14_0_arm64.whl - Mend

kreuzberg 4.0.6__cp310-abi3-macosx_14_0_arm64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of kreuzberg might be problematic. Click here for more details.

Files changed (17) hide show

kreuzberg/__init__.py +931 -0
kreuzberg/__main__.py +160 -0
kreuzberg/_internal_bindings.abi3.so +0 -0
kreuzberg/_setup_lib_path.py +143 -0
kreuzberg/exceptions.py +254 -0
kreuzberg/ocr/__init__.py +25 -0
kreuzberg/ocr/easyocr.py +371 -0
kreuzberg/ocr/paddleocr.py +284 -0
kreuzberg/ocr/protocol.py +150 -0
kreuzberg/postprocessors/__init__.py +61 -0
kreuzberg/postprocessors/protocol.py +83 -0
kreuzberg/py.typed +0 -0
kreuzberg/types.py +509 -0
kreuzberg-4.0.6.dist-info/METADATA +470 -0
kreuzberg-4.0.6.dist-info/RECORD +17 -0
kreuzberg-4.0.6.dist-info/WHEEL +4 -0
kreuzberg-4.0.6.dist-info/entry_points.txt +2 -0

kreuzberg-4.0.6.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,470 @@
+Metadata-Version: 2.4
+Name: kreuzberg
+Version: 4.0.6
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Information Technology
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Programming Language :: Python :: Implementation :: CPython
+Classifier: Programming Language :: Rust
+Classifier: Topic :: Office/Business
+Classifier: Topic :: Scientific/Engineering :: Information Analysis
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Topic :: Text Processing
+Classifier: Topic :: Text Processing :: Filters
+Classifier: Topic :: Text Processing :: General
+Classifier: Typing :: Typed
+Requires-Dist: kreuzberg[easyocr,paddleocr] ; extra == 'all'
+Requires-Dist: easyocr>=1.7.2 ; python_full_version < '3.14' and extra == 'easyocr'
+Requires-Dist: torch>=2.9.1 ; python_full_version < '3.14' and extra == 'easyocr'
+Requires-Dist: paddleocr>=3.3.2 ; python_full_version < '3.14' and extra == 'paddleocr'
+Requires-Dist: paddlepaddle>=3.2.1,<3.2.2 ; python_full_version < '3.14' and extra == 'paddleocr'
+Requires-Dist: setuptools>=80.9 ; python_full_version < '3.14' and extra == 'paddleocr'
+Provides-Extra: all
+Provides-Extra: easyocr
+Provides-Extra: paddleocr
+Summary: High-performance document intelligence library for Python. Extract text, metadata, and structured data from PDFs, Office documents, images, and 50+ formats. Powered by Rust core for 10-50x speed improvements.
+Keywords: document-extraction,document-intelligence,document-parsing,document-processing,docx,easyocr,email-parsing,html,markdown,metadata-extraction,ocr,office-documents,paddleocr,pdf,pdf-extraction,performance,pptx,rust,table-extraction,tesseract,text-extraction,xlsx,xml
+Home-Page: https://goldziher.github.io/kreuzberg/
+Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
+Maintainer-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
+License: MIT
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
+Project-URL: Changelog, https://kreuzberg.dev/CHANGELOG/
+Project-URL: Documentation, https://kreuzberg.dev
+Project-URL: Homepage, https://kreuzberg.dev
+Project-URL: Issues, https://github.com/kreuzberg-dev/kreuzberg/issues
+Project-URL: Repository, https://github.com/kreuzberg-dev/kreuzberg
+Project-URL: Source, https://github.com/kreuzberg-dev/kreuzberg
+# Python
+<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
+  <!-- Language Bindings -->
+  <a href="https://crates.io/crates/kreuzberg">
+    <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
+  </a>
+  <a href="https://hex.pm/packages/kreuzberg">
+    <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
+  </a>
+  <a href="https://pypi.org/project/kreuzberg/">
+    <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
+  </a>
+  <a href="https://www.npmjs.com/package/@kreuzberg/node">
+    <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
+  </a>
+  <a href="https://www.npmjs.com/package/@kreuzberg/wasm">
+    <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
+  </a>
+  <a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
+    <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
+  </a>
+  <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
+    <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0" alt="Go">
+  </a>
+  <a href="https://www.nuget.org/packages/Kreuzberg/">
+    <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
+  </a>
+  <a href="https://packagist.org/packages/kreuzberg/kreuzberg">
+    <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
+  </a>
+  <a href="https://rubygems.org/gems/kreuzberg">
+    <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
+  </a>
+  <!-- Project Info -->
+  <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
+    <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
+  </a>
+  <a href="https://docs.kreuzberg.dev">
+    <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
+  </a>
+</div>
+<img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
+<div align="center" style="margin-top: 20px;">
+  <a href="https://discord.gg/pXxagNK2zN">
+      <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
+  </a>
+</div>
+Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. Native Python bindings with async/await support, multiple OCR backends (Tesseract, EasyOCR, PaddleOCR), and extensible plugin system.
+## Installation
+### Package Installation
+Install via pip:
+```bash
+pip install kreuzberg
+```
+For async support and additional features:
+```bash
+pip install kreuzberg[async]
+```
+### System Requirements
+- **Python 3.10+** required
+- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
+- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
+## Quick Start
+### Basic Extraction
+Extract text, metadata, and structure from any supported document format:
+```python
+import asyncio
+from kreuzberg import extract_file, ExtractionConfig
+async def main() -> None:
+    config = ExtractionConfig(
+        use_cache=True,
+        enable_quality_processing=True
+    )
+    result = await extract_file("document.pdf", config=config)
+    print(result.content)
+asyncio.run(main())
+```
+### Common Use Cases
+#### Extract with Custom Configuration
+Most use cases benefit from configuration to control extraction behavior:
+**With OCR (for scanned documents):**
+```python
+import asyncio
+from kreuzberg import extract_file
+async def main() -> None:
+    result = await extract_file("document.pdf")
+    print(result.content)
+asyncio.run(main())
+```
+#### Table Extraction
+```python
+import asyncio
+from kreuzberg import extract_file
+async def main() -> None:
+    result = await extract_file("document.pdf")
+    content: str = result.content
+    tables: int = len(result.tables)
+    format_type: str | None = result.metadata.format_type
+    print(f"Content length: {len(content)} characters")
+    print(f"Tables found: {tables}")
+    print(f"Format: {format_type}")
+asyncio.run(main())
+```
+#### Processing Multiple Files
+```python
+import asyncio
+from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig
+async def main() -> None:
+    config = ExtractionConfig(
+        force_ocr=True,
+        ocr=OcrConfig(
+            backend="tesseract",
+            language="eng",
+            tesseract_config=TesseractConfig(psm=3)
+        )
+    )
+    result = await extract_file("scanned.pdf", config=config)
+    print(result.content)
+    print(f"Detected Languages: {result.detected_languages}")
+asyncio.run(main())
+```
+#### Async Processing
+For non-blocking document processing:
+```python
+import asyncio
+from pathlib import Path
+from kreuzberg import extract_file
+async def main() -> None:
+    file_path: Path = Path("document.pdf")
+    result = await extract_file(file_path)
+    print(f"Content: {result.content}")
+    print(f"MIME Type: {result.metadata.format_type}")
+    print(f"Tables: {len(result.tables)}")
+asyncio.run(main())
+```
+### Next Steps
+- **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
+- **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
+- **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
+- **[Configuration Guide](https://kreuzberg.dev/guides/configuration/)** - Advanced configuration options
+## Features
+### Supported File Formats (56+)
+56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
+#### Office Documents
+| Category | Formats | Capabilities |
+|----------|---------|--------------|
+| **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
+| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
+| **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
+| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
+| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
+#### Images (OCR-Enabled)
+| Category | Formats | Features |
+|----------|---------|----------|
+| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
+| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
+| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
+#### Web & Data
+| Category | Formats | Features |
+|----------|---------|----------|
+| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
+| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
+| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
+#### Email & Archives
+| Category | Formats | Features |
+|----------|---------|----------|
+| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
+| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
+#### Academic & Scientific
+| Category | Formats | Features |
+|----------|---------|----------|
+| **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
+| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
+| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
+**[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
+### Key Capabilities
+- **Text Extraction** - Extract all text content with position and formatting information
+- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
+- **Table Extraction** - Parse tables with structure and cell content preservation
+- **Image Extraction** - Extract embedded images and render page previews
+- **OCR Support** - Integrate multiple OCR backends for scanned documents
+- **Async/Await** - Non-blocking document processing with concurrent operations
+- **Plugin System** - Extensible post-processing for custom text transformation
+- **Embeddings** - Generate vector embeddings using ONNX Runtime models
+- **Batch Processing** - Efficiently process multiple documents in parallel
+- **Memory Efficient** - Stream large files without loading entirely into memory
+- **Language Detection** - Detect and support multiple languages in documents
+- **Configuration** - Fine-grained control over extraction behavior
+### Performance Characteristics
+| Format | Speed | Memory | Notes |
+|--------|-------|--------|-------|
+| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
+| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
+| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
+| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
+| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
+## OCR Support
+Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
+- **Tesseract**
+- **Easyocr**
+- **Paddleocr**
+### OCR Configuration Example
+```python
+import asyncio
+from kreuzberg import extract_file
+async def main() -> None:
+    result = await extract_file("document.pdf")
+    print(result.content)
+asyncio.run(main())
+```
+## Async Support
+This binding provides full async/await support for non-blocking document processing:
+```python
+import asyncio
+from pathlib import Path
+from kreuzberg import extract_file
+async def main() -> None:
+    file_path: Path = Path("document.pdf")
+    result = await extract_file(file_path)
+    print(f"Content: {result.content}")
+    print(f"MIME Type: {result.metadata.format_type}")
+    print(f"Tables: {len(result.tables)}")
+asyncio.run(main())
+```
+## Plugin System
+Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
+For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/guides/plugins/).
+## Embeddings Support
+Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
+**[Embeddings Guide](https://kreuzberg.dev/features/#embeddings)**
+## Batch Processing
+Process multiple documents efficiently:
+```python
+import asyncio
+from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig
+async def main() -> None:
+    config = ExtractionConfig(
+        force_ocr=True,
+        ocr=OcrConfig(
+            backend="tesseract",
+            language="eng",
+            tesseract_config=TesseractConfig(psm=3)
+        )
+    )
+    result = await extract_file("scanned.pdf", config=config)
+    print(result.content)
+    print(f"Detected Languages: {result.detected_languages}")
+asyncio.run(main())
+```
+## Configuration
+For advanced configuration options including language detection, table extraction, OCR settings, and more:
+**[Configuration Guide](https://kreuzberg.dev/guides/configuration/)**
+## Documentation
+- **[Official Documentation](https://kreuzberg.dev/)**
+- **[API Reference](https://kreuzberg.dev/reference/api-python/)**
+- **[Examples & Guides](https://kreuzberg.dev/guides/)**
+## Contributing
+Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
+## License
+MIT License - see LICENSE file for details.
+## Support
+- **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
+- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
+- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)

kreuzberg-4.0.6.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,17 @@
+kreuzberg/__init__.py,sha256=4djrQ4GGY3NiWztqrqFdOOuVXsIXxZaPnf1rSWV04OQ,32008
+kreuzberg/__main__.py,sha256=wQJIcjFj9mGv54ea5T3XKAlXMMXjeMfDMWX7cSQyS4E,4977
+kreuzberg/_internal_bindings.abi3.so,sha256=OUQDDFnKWDP_vji5_TN1DDBPFcZB7Py_FD_xkPYX-pg,30574288
+kreuzberg/_setup_lib_path.py,sha256=pGAKaVRXq2eNSXUgnJZkUwYBdUwWla43cg0RSdotbNs,4570
+kreuzberg/exceptions.py,sha256=ZX9aBhaxCzjPWn5P5eFq02KbIDzQcxWPUoyS2p38pks,7967
+kreuzberg/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+kreuzberg/types.py,sha256=g6tYGnj139eB-74TiV-iZN1C-d2-WG6gtitE88Ci-OM,12876
+kreuzberg/ocr/__init__.py,sha256=8MrVCtEu-5HXGAJL9wubKwNNP0U8Cr35zd0EyzW9gdY,791
+kreuzberg/ocr/easyocr.py,sha256=XwxXC6t5JpDzc-3gVR-Ns2xy-j4e0ofklgmKB7VNuUs,10168
+kreuzberg/ocr/paddleocr.py,sha256=nTP9Xi5nQzVDotQ7O-hobLjTWHyaYjKPEYoIhsHMgCo,8806
+kreuzberg/ocr/protocol.py,sha256=6nzTEIo3jK6i9O5EhoHaaZiWQ1P5VPF1bW6sAxSGshQ,5453
+kreuzberg/postprocessors/__init__.py,sha256=xvzJ_NxzzuThWlTyFVc0pyuSv4R28f7oOkIhT7Dk9yQ,2283
+kreuzberg/postprocessors/protocol.py,sha256=sU5prEkg3kuTnygv8j1GBPY-nevZgm4FrKzDPjOnOjU,2545
+kreuzberg-4.0.6.dist-info/METADATA,sha256=ZKe888vfN82MKnBW6ywBKf-qxiXxfb3BKKjIfGMIs6M,15186
+kreuzberg-4.0.6.dist-info/WHEEL,sha256=6rvbSekKj8Ky-umb-C-gAUiNALwG7Ly9wjfiqiV9R_M,104
+kreuzberg-4.0.6.dist-info/entry_points.txt,sha256=OpqEOa3KCMZvGWMUSYMkBIXji-LZcgqnuBknEImvWJY,52
+kreuzberg-4.0.6.dist-info/RECORD,,

kreuzberg-4.0.6.dist-info/WHEEL ADDED Viewed

@@ -0,0 +1,4 @@
+Wheel-Version: 1.0
+Generator: maturin (1.11.5)
+Root-Is-Purelib: false
+Tag: cp310-abi3-macosx_14_0_arm64

kreuzberg-4.0.6.dist-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ [console_scripts]
2	+ kreuzberg=kreuzberg.__main__:main