kreuzberg 3.4.0__py3-none-any.whl → 3.4.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
kreuzberg/__init__.py CHANGED
@@ -1,3 +1,5 @@
1
+ from importlib.metadata import version
2
+
1
3
  from kreuzberg._gmft import GMFTConfig
2
4
  from kreuzberg._ocr._easyocr import EasyOCRConfig
3
5
  from kreuzberg._ocr._paddleocr import PaddleOCRConfig
@@ -18,7 +20,7 @@ from .extraction import (
18
20
  extract_file_sync,
19
21
  )
20
22
 
21
- __version__ = "3.2.0"
23
+ __version__ = version("kreuzberg")
22
24
 
23
25
  __all__ = [
24
26
  "EasyOCRConfig",
@@ -0,0 +1,233 @@
1
+ Metadata-Version: 2.4
2
+ Name: kreuzberg
3
+ Version: 3.4.1
4
+ Summary: A text extraction library supporting PDFs, images, office documents and more
5
+ Project-URL: homepage, https://github.com/Goldziher/kreuzberg
6
+ Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
7
+ License: MIT
8
+ License-File: LICENSE
9
+ Keywords: document-processing,image-to-text,ocr,pandoc,pdf-extraction,rag,table-extraction,tesseract,text-extraction,text-processing
10
+ Classifier: Development Status :: 5 - Production/Stable
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Operating System :: OS Independent
14
+ Classifier: Programming Language :: Python :: 3 :: Only
15
+ Classifier: Programming Language :: Python :: 3.9
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: 3.13
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
22
+ Classifier: Topic :: Text Processing :: General
23
+ Classifier: Topic :: Utilities
24
+ Classifier: Typing :: Typed
25
+ Requires-Python: >=3.9
26
+ Requires-Dist: anyio>=4.9.0
27
+ Requires-Dist: charset-normalizer>=3.4.2
28
+ Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
29
+ Requires-Dist: html-to-markdown>=1.4.0
30
+ Requires-Dist: msgspec>=0.18.0
31
+ Requires-Dist: playa-pdf>=0.6.1
32
+ Requires-Dist: psutil>=7.0.0
33
+ Requires-Dist: pypdfium2==4.30.0
34
+ Requires-Dist: python-calamine>=0.3.2
35
+ Requires-Dist: python-pptx>=1.0.2
36
+ Requires-Dist: typing-extensions>=4.14.0; python_version < '3.12'
37
+ Provides-Extra: all
38
+ Requires-Dist: click>=8.2.1; extra == 'all'
39
+ Requires-Dist: easyocr>=1.7.2; extra == 'all'
40
+ Requires-Dist: gmft>=0.4.2; extra == 'all'
41
+ Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.1.6; extra == 'all'
42
+ Requires-Dist: paddleocr>=3.1.0; extra == 'all'
43
+ Requires-Dist: paddlepaddle>=3.1.0; extra == 'all'
44
+ Requires-Dist: rich>=14.0.0; extra == 'all'
45
+ Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'all'
46
+ Requires-Dist: setuptools>=80.9.0; extra == 'all'
47
+ Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'all'
48
+ Provides-Extra: api
49
+ Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.1.6; extra == 'api'
50
+ Provides-Extra: chunking
51
+ Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'chunking'
52
+ Provides-Extra: cli
53
+ Requires-Dist: click>=8.2.1; extra == 'cli'
54
+ Requires-Dist: rich>=14.0.0; extra == 'cli'
55
+ Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'cli'
56
+ Provides-Extra: easyocr
57
+ Requires-Dist: easyocr>=1.7.2; extra == 'easyocr'
58
+ Provides-Extra: gmft
59
+ Requires-Dist: gmft>=0.4.2; extra == 'gmft'
60
+ Provides-Extra: paddleocr
61
+ Requires-Dist: paddleocr>=3.1.0; extra == 'paddleocr'
62
+ Requires-Dist: paddlepaddle>=3.1.0; extra == 'paddleocr'
63
+ Requires-Dist: setuptools>=80.9.0; extra == 'paddleocr'
64
+ Description-Content-Type: text/markdown
65
+
66
+ # Kreuzberg
67
+
68
+ [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
69
+ [![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)
70
+ [![Documentation](https://img.shields.io/badge/docs-GitHub_Pages-blue)](https://goldziher.github.io/kreuzberg/)
71
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
72
+
73
+ **High-performance Python library for text extraction from documents.** Extract text from PDFs, images, office documents, and more with both async and sync APIs.
74
+
75
+ 📖 **[Complete Documentation](https://goldziher.github.io/kreuzberg/)**
76
+
77
+ ## Why Kreuzberg?
78
+
79
+ - **🚀 Fastest Performance**: [Benchmarked](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) as the fastest text extraction library
80
+ - **💾 Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+)
81
+ - **⚡ Dual APIs**: Only library with both sync and async support
82
+ - **🔧 Zero Configuration**: Works out of the box with sane defaults
83
+ - **🏠 Local Processing**: No cloud dependencies or external API calls
84
+ - **📦 Rich Format Support**: PDFs, images, Office docs, HTML, and more
85
+ - **🔍 Multiple OCR Engines**: Tesseract, EasyOCR, and PaddleOCR support
86
+ - **🐳 Production Ready**: CLI, REST API, and Docker images included
87
+
88
+ ## Quick Start
89
+
90
+ ### Installation
91
+
92
+ ```bash
93
+ # Basic installation
94
+ pip install kreuzberg
95
+
96
+ # With optional features
97
+ pip install "kreuzberg[cli,api]" # CLI + REST API
98
+ pip install "kreuzberg[easyocr,gmft]" # EasyOCR + table extraction
99
+ pip install "kreuzberg[all]" # Everything
100
+ ```
101
+
102
+ ### System Dependencies
103
+
104
+ ```bash
105
+ # Ubuntu/Debian
106
+ sudo apt-get install tesseract-ocr pandoc
107
+
108
+ # macOS
109
+ brew install tesseract pandoc
110
+
111
+ # Windows
112
+ choco install tesseract pandoc
113
+ ```
114
+
115
+ ### Basic Usage
116
+
117
+ ```python
118
+ import asyncio
119
+ from kreuzberg import extract_file
120
+
121
+ async def main():
122
+ # Extract from any document type
123
+ result = await extract_file("document.pdf")
124
+ print(result.content)
125
+ print(result.metadata)
126
+
127
+ asyncio.run(main())
128
+ ```
129
+
130
+ ## Deployment Options
131
+
132
+ ### 🐳 Docker (Recommended)
133
+
134
+ ```bash
135
+ # Run API server
136
+ docker run -p 8000:8000 goldziher/kreuzberg:3.4.0
137
+
138
+ # Extract files
139
+ curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
140
+ ```
141
+
142
+ Available variants: `3.4.0`, `3.4.0-easyocr`, `3.4.0-paddle`, `3.4.0-gmft`, `3.4.0-all`
143
+
144
+ ### 🌐 REST API
145
+
146
+ ```bash
147
+ # Install and run
148
+ pip install "kreuzberg[api]"
149
+ litestar --app kreuzberg._api.main:app run
150
+
151
+ # Health check
152
+ curl http://localhost:8000/health
153
+
154
+ # Extract files
155
+ curl -X POST http://localhost:8000/extract -F "data=@file.pdf"
156
+ ```
157
+
158
+ ### 💻 Command Line
159
+
160
+ ```bash
161
+ # Install CLI
162
+ pip install "kreuzberg[cli]"
163
+
164
+ # Extract to stdout
165
+ kreuzberg extract document.pdf
166
+
167
+ # JSON output with metadata
168
+ kreuzberg extract document.pdf --output-format json --show-metadata
169
+
170
+ # Batch processing
171
+ kreuzberg extract *.pdf --output-dir ./extracted/
172
+ ```
173
+
174
+ ## Supported Formats
175
+
176
+ | Category | Formats |
177
+ | ----------------- | ------------------------------ |
178
+ | **Documents** | PDF, DOCX, DOC, RTF, TXT, EPUB |
179
+ | **Images** | JPG, PNG, TIFF, BMP, GIF, WEBP |
180
+ | **Spreadsheets** | XLSX, XLS, CSV, ODS |
181
+ | **Presentations** | PPTX, PPT, ODP |
182
+ | **Web** | HTML, XML, MHTML |
183
+ | **Archives** | Support via extraction |
184
+
185
+ ## Performance
186
+
187
+ **Fastest extraction speeds** with minimal resource usage:
188
+
189
+ | Library | Speed | Memory | Size | Success Rate |
190
+ | ------------- | -------------- | ------------- | ----------- | ------------ |
191
+ | **Kreuzberg** | ⚡ **Fastest** | 💾 **Lowest** | 📦 **71MB** | ✅ **100%** |
192
+ | Unstructured | 2-3x slower | 2x higher | 146MB | 95% |
193
+ | MarkItDown | 3-4x slower | 3x higher | 251MB | 90% |
194
+ | Docling | 4-5x slower | 10x higher | 1,032MB | 85% |
195
+
196
+ > **Rule of thumb**: Use async API for complex documents and batch processing (up to 4.5x faster)
197
+
198
+ ## Documentation
199
+
200
+ ### Quick Links
201
+
202
+ - [Installation Guide](https://goldziher.github.io/kreuzberg/getting-started/installation/) - Setup and dependencies
203
+ - [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - Comprehensive usage guide
204
+ - [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Complete API documentation
205
+ - [Docker Guide](https://goldziher.github.io/kreuzberg/user-guide/docker/) - Container deployment
206
+ - [REST API](https://goldziher.github.io/kreuzberg/user-guide/api-server/) - HTTP endpoints
207
+ - [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line usage
208
+ - [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - OCR engine setup
209
+
210
+ ## Advanced Features
211
+
212
+ - **📊 Table Extraction**: Extract tables from PDFs with GMFT
213
+ - **🧩 Content Chunking**: Split documents for RAG applications
214
+ - **🎯 Custom Extractors**: Extend with your own document handlers
215
+ - **🔧 Configuration**: Flexible TOML-based configuration
216
+ - **🪝 Hooks**: Pre/post-processing customization
217
+ - **🌍 Multi-language OCR**: 100+ languages supported
218
+ - **⚙️ Metadata Extraction**: Rich document metadata
219
+ - **🔄 Batch Processing**: Efficient bulk document processing
220
+
221
+ ## License
222
+
223
+ MIT License - see [LICENSE](LICENSE) for details.
224
+
225
+ ______________________________________________________________________
226
+
227
+ <div align="center">
228
+
229
+ **[Documentation](https://goldziher.github.io/kreuzberg/) • [PyPI](https://pypi.org/project/kreuzberg/) • [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) • [Discord](https://discord.gg/pXxagNK2zN)**
230
+
231
+ Made with ❤️ by the [Kreuzberg contributors](https://github.com/Goldziher/kreuzberg/graphs/contributors)
232
+
233
+ </div>
@@ -1,4 +1,4 @@
1
- kreuzberg/__init__.py,sha256=jRm2U-loiKWwJpgOFgZ8Ev2mfz9sI1qJOZ2V3OoJUlg,1258
1
+ kreuzberg/__init__.py,sha256=5GP2j8PI3P_ZNSEhLpm8iqseY3i4nye6iUmVGUnfzno,1311
2
2
  kreuzberg/__main__.py,sha256=s2qM1nPEkRHAQP-G3P7sf5l6qA_KJeIEHS5LpPz04lg,183
3
3
  kreuzberg/_chunker.py,sha256=2eHSRHcZdJ2ZjR3in49y3o9tPl5HMO3vkbnMqaVCbHI,1887
4
4
  kreuzberg/_cli_config.py,sha256=WD_seFjbuay_NJv77vGLBW6BVV9WZNujdzf3zQkhzPc,5691
@@ -43,8 +43,8 @@ kreuzberg/_utils/_serialization.py,sha256=AhZvyAu4KsjAqyZDh--Kn2kSWGgCuH7udio8lT
43
43
  kreuzberg/_utils/_string.py,sha256=owIVkUtP0__GiJD9RIJzPdvyIigT5sQho3mOXPbsnW0,958
44
44
  kreuzberg/_utils/_sync.py,sha256=IsKkR_YmseZKY6Asz6w3k-dgMXcrVaI06jWfDY7Bol4,4842
45
45
  kreuzberg/_utils/_tmp.py,sha256=5rqG_Nlb9xweaLqJA8Kc5csHDase9_eY_Fq93rNQGWc,1044
46
- kreuzberg-3.4.0.dist-info/METADATA,sha256=Rg939xe9b-H0TExRcJKkf9MFg7-kWM_fvzGvV6VDG0Q,11215
47
- kreuzberg-3.4.0.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
48
- kreuzberg-3.4.0.dist-info/entry_points.txt,sha256=VdoFaTl3QSvVWOZcIlPpDd47o6kn7EvmXSs8FI0ExLc,48
49
- kreuzberg-3.4.0.dist-info/licenses/LICENSE,sha256=-8caMvpCK8SgZ5LlRKhGCMtYDEXqTKH9X8pFEhl91_4,1066
50
- kreuzberg-3.4.0.dist-info/RECORD,,
46
+ kreuzberg-3.4.1.dist-info/METADATA,sha256=g3DwLXNiDzvPDBApPnDp3BeZ4SbVN0NTrEzN9cyKy34,8751
47
+ kreuzberg-3.4.1.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
48
+ kreuzberg-3.4.1.dist-info/entry_points.txt,sha256=VdoFaTl3QSvVWOZcIlPpDd47o6kn7EvmXSs8FI0ExLc,48
49
+ kreuzberg-3.4.1.dist-info/licenses/LICENSE,sha256=-8caMvpCK8SgZ5LlRKhGCMtYDEXqTKH9X8pFEhl91_4,1066
50
+ kreuzberg-3.4.1.dist-info/RECORD,,
@@ -1,290 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: kreuzberg
3
- Version: 3.4.0
4
- Summary: A text extraction library supporting PDFs, images, office documents and more
5
- Project-URL: homepage, https://github.com/Goldziher/kreuzberg
6
- Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
7
- License: MIT
8
- License-File: LICENSE
9
- Keywords: document-processing,image-to-text,ocr,pandoc,pdf-extraction,rag,table-extraction,tesseract,text-extraction,text-processing
10
- Classifier: Development Status :: 4 - Beta
11
- Classifier: Intended Audience :: Developers
12
- Classifier: License :: OSI Approved :: MIT License
13
- Classifier: Operating System :: OS Independent
14
- Classifier: Programming Language :: Python :: 3 :: Only
15
- Classifier: Programming Language :: Python :: 3.13
16
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
17
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
18
- Classifier: Topic :: Text Processing :: General
19
- Classifier: Topic :: Utilities
20
- Classifier: Typing :: Typed
21
- Requires-Python: >=3.13
22
- Requires-Dist: anyio>=4.9.0
23
- Requires-Dist: charset-normalizer>=3.4.2
24
- Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
25
- Requires-Dist: html-to-markdown>=1.4.0
26
- Requires-Dist: msgspec>=0.18.0
27
- Requires-Dist: playa-pdf>=0.6.1
28
- Requires-Dist: psutil>=7.0.0
29
- Requires-Dist: pypdfium2==4.30.0
30
- Requires-Dist: python-calamine>=0.3.2
31
- Requires-Dist: python-pptx>=1.0.2
32
- Requires-Dist: typing-extensions>=4.14.0; python_version < '3.12'
33
- Provides-Extra: all
34
- Requires-Dist: click>=8.2.1; extra == 'all'
35
- Requires-Dist: easyocr>=1.7.2; extra == 'all'
36
- Requires-Dist: gmft>=0.4.2; extra == 'all'
37
- Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.1.6; extra == 'all'
38
- Requires-Dist: paddleocr>=3.1.0; extra == 'all'
39
- Requires-Dist: paddlepaddle>=3.1.0; extra == 'all'
40
- Requires-Dist: rich>=14.0.0; extra == 'all'
41
- Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'all'
42
- Requires-Dist: setuptools>=80.9.0; extra == 'all'
43
- Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'all'
44
- Provides-Extra: api
45
- Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.1.6; extra == 'api'
46
- Provides-Extra: chunking
47
- Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'chunking'
48
- Provides-Extra: cli
49
- Requires-Dist: click>=8.2.1; extra == 'cli'
50
- Requires-Dist: rich>=14.0.0; extra == 'cli'
51
- Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'cli'
52
- Provides-Extra: easyocr
53
- Requires-Dist: easyocr>=1.7.2; extra == 'easyocr'
54
- Provides-Extra: gmft
55
- Requires-Dist: gmft>=0.4.2; extra == 'gmft'
56
- Provides-Extra: paddleocr
57
- Requires-Dist: paddleocr>=3.1.0; extra == 'paddleocr'
58
- Requires-Dist: paddlepaddle>=3.1.0; extra == 'paddleocr'
59
- Requires-Dist: setuptools>=80.9.0; extra == 'paddleocr'
60
- Description-Content-Type: text/markdown
61
-
62
- # Kreuzberg
63
-
64
- [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
65
- [![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)
66
- [![Documentation](https://img.shields.io/badge/docs-GitHub_Pages-blue)](https://goldziher.github.io/kreuzberg/)
67
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
68
-
69
- Kreuzberg is a **high-performance** Python library for text extraction from documents. **Benchmarked as one of the fastest text extraction libraries available**, it provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs optimized for speed and efficiency.
70
-
71
- ## Why Kreuzberg?
72
-
73
- - **🚀 Substantially Faster**: Extraction speeds that significantly outperform other text extraction libraries
74
- - **⚡ Unique Dual API**: The only framework supporting both sync and async APIs for maximum flexibility
75
- - **💾 Memory Efficient**: Lower memory footprint compared to competing libraries
76
- - **📊 Proven Performance**: [Comprehensive benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) demonstrate superior performance across formats
77
- - **Simple and Hassle-Free**: Clean API that just works, without complex configuration
78
- - **Local Processing**: No external API calls or cloud dependencies required
79
- - **Resource Efficient**: Lightweight processing without GPU requirements
80
- - **Format Support**: Comprehensive support for documents, images, and text formats
81
- - **Multiple OCR Engines**: Support for Tesseract, EasyOCR, and PaddleOCR
82
- - **Command Line Interface**: Powerful CLI for batch processing and automation
83
- - **Metadata Extraction**: Get document metadata alongside text content
84
- - **Table Extraction**: Extract tables from documents using the excellent GMFT library
85
- - **Modern Python**: Built with async/await, type hints, and a functional-first approach
86
- - **Permissive OSS**: MIT licensed with permissively licensed dependencies
87
-
88
- ## Quick Start
89
-
90
- ```bash
91
- pip install kreuzberg
92
-
93
- # Or install with CLI support
94
- pip install "kreuzberg[cli]"
95
-
96
- # Or install with API server
97
- pip install "kreuzberg[api]"
98
- ```
99
-
100
- Install pandoc:
101
-
102
- ```bash
103
- # Ubuntu/Debian
104
- sudo apt-get install tesseract-ocr pandoc
105
-
106
- # macOS
107
- brew install tesseract pandoc
108
-
109
- # Windows
110
- choco install -y tesseract pandoc
111
- ```
112
-
113
- The tesseract OCR engine is the default OCR engine. You can decide not to use it - and then either use one of the two alternative OCR engines, or have no OCR at all.
114
-
115
- ### Alternative OCR engines
116
-
117
- ```bash
118
- # Install with EasyOCR support
119
- pip install "kreuzberg[easyocr]"
120
-
121
- # Install with PaddleOCR support
122
- pip install "kreuzberg[paddleocr]"
123
- ```
124
-
125
- ## Quick Example
126
-
127
- ```python
128
- import asyncio
129
- from kreuzberg import extract_file
130
-
131
- async def main():
132
- # Extract text from a PDF
133
- result = await extract_file("document.pdf")
134
- print(result.content)
135
-
136
- # Extract text from an image
137
- result = await extract_file("scan.jpg")
138
- print(result.content)
139
-
140
- # Extract text from a Word document
141
- result = await extract_file("report.docx")
142
- print(result.content)
143
-
144
- asyncio.run(main())
145
- ```
146
-
147
- ## Docker
148
-
149
- Docker images are available for easy deployment:
150
-
151
- ```bash
152
- # Run the API server
153
- docker run -p 8000:8000 goldziher/kreuzberg:latest
154
-
155
- # Extract files via API
156
- curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
157
- ```
158
-
159
- See the [Docker documentation](https://goldziher.github.io/kreuzberg/user-guide/docker/) for more options.
160
-
161
- ## REST API
162
-
163
- Run Kreuzberg as a REST API server:
164
-
165
- ```bash
166
- pip install "kreuzberg[api]"
167
- litestar --app kreuzberg._api.main:app run
168
- ```
169
-
170
- See the [API documentation](https://goldziher.github.io/kreuzberg/user-guide/api-server/) for endpoints and usage.
171
-
172
- ## Command Line Interface
173
-
174
- Kreuzberg includes a powerful CLI for processing documents from the command line:
175
-
176
- ```bash
177
- # Extract text from a file
178
- kreuzberg extract document.pdf
179
-
180
- # Extract with JSON output and metadata
181
- kreuzberg extract document.pdf --output-format json --show-metadata
182
-
183
- # Extract from stdin
184
- cat document.html | kreuzberg extract
185
-
186
- # Use specific OCR backend
187
- kreuzberg extract image.png --ocr-backend easyocr --easyocr-languages en,de
188
-
189
- # Extract with configuration file
190
- kreuzberg extract document.pdf --config config.toml
191
- ```
192
-
193
- ### CLI Configuration
194
-
195
- Configure via `pyproject.toml`:
196
-
197
- ```toml
198
- [tool.kreuzberg]
199
- force_ocr = true
200
- chunk_content = false
201
- extract_tables = true
202
- max_chars = 4000
203
- ocr_backend = "tesseract"
204
-
205
- [tool.kreuzberg.tesseract]
206
- language = "eng+deu"
207
- psm = 3
208
- ```
209
-
210
- For full CLI documentation, see the [CLI Guide](https://goldziher.github.io/kreuzberg/cli/).
211
-
212
- ## Documentation
213
-
214
- For comprehensive documentation, visit our [GitHub Pages](https://goldziher.github.io/kreuzberg/):
215
-
216
- - [Getting Started](https://goldziher.github.io/kreuzberg/getting-started/) - Installation and basic usage
217
- - [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - In-depth usage information
218
- - [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line interface documentation
219
- - [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Detailed API documentation
220
- - [Examples](https://goldziher.github.io/kreuzberg/examples/) - Code examples for common use cases
221
- - [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - Configure OCR engines
222
- - [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) - Choose the right OCR engine
223
-
224
- ## Supported Formats
225
-
226
- Kreuzberg supports a wide range of document formats:
227
-
228
- - **Documents**: PDF, DOCX, RTF, TXT, EPUB, etc.
229
- - **Images**: JPG, PNG, TIFF, BMP, GIF, etc.
230
- - **Spreadsheets**: XLSX, XLS, CSV, etc.
231
- - **Presentations**: PPTX, PPT, etc.
232
- - **Web Content**: HTML, XML, etc.
233
-
234
- ## OCR Engines
235
-
236
- Kreuzberg supports multiple OCR engines:
237
-
238
- - **Tesseract** (Default): Lightweight, fast startup, requires system installation
239
- - **EasyOCR**: Good for many languages, pure Python, but downloads models on first use
240
- - **PaddleOCR**: Excellent for Asian languages, pure Python, but downloads models on first use
241
-
242
- For comparison and selection guidance, see the [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) documentation.
243
-
244
- ## Performance
245
-
246
- Kreuzberg delivers **exceptional performance** compared to other text extraction libraries:
247
-
248
- ### 🏆 Competitive Benchmarks
249
-
250
- [Comprehensive benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) comparing Kreuzberg against other popular Python text extraction libraries show:
251
-
252
- - **Fastest Extraction**: Consistently fastest processing times across file formats
253
- - **Lowest Memory Usage**: Most memory-efficient text extraction solution
254
- - **100% Success Rate**: Reliable extraction across all tested document types
255
- - **Optimal for High-Throughput**: Designed for real-time, production applications
256
-
257
- ### 💾 Installation Size Efficiency
258
-
259
- Kreuzberg delivers maximum performance with minimal overhead:
260
-
261
- 1. **Kreuzberg**: 71.0 MB (20 deps) - Most lightweight
262
- 1. **Unstructured**: 145.8 MB (54 deps) - Moderate footprint
263
- 1. **MarkItDown**: 250.7 MB (25 deps) - ML inference overhead
264
- 1. **Docling**: 1,031.9 MB (88 deps) - Full ML stack included
265
-
266
- **Kreuzberg is up to 14x smaller** than competing solutions while delivering superior performance.
267
-
268
- ### ⚡ Sync vs Async Performance
269
-
270
- Kreuzberg is the only library offering both sync and async APIs. Choose based on your use case:
271
-
272
- | Operation | Sync Time | Async Time | Async Advantage |
273
- | ---------------------- | --------- | ---------- | ------------------ |
274
- | Simple text (Markdown) | 0.4ms | 17.5ms | **❌ 41x slower** |
275
- | HTML documents | 1.6ms | 1.1ms | **✅ 1.5x faster** |
276
- | Complex PDFs | 39.0s | 8.5s | **✅ 4.6x faster** |
277
- | OCR processing | 0.4s | 0.7s | **✅ 1.7x faster** |
278
- | Batch operations | 38.6s | 8.5s | **✅ 4.5x faster** |
279
-
280
- **Rule of thumb:** Use async for complex documents, OCR, batch processing, and backend APIs.
281
-
282
- For detailed benchmarks and methodology, see our [Performance Documentation](https://goldziher.github.io/kreuzberg/advanced/performance/).
283
-
284
- ## Contributing
285
-
286
- We welcome contributions! Please see our [Contributing Guide](docs/contributing.md) for details on setting up your development environment and submitting pull requests.
287
-
288
- ## License
289
-
290
- This library is released under the MIT license.