kreuzberg 2.1.2__py3-none-any.whl → 3.0.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. kreuzberg/__init__.py +16 -2
  2. kreuzberg/_chunker.py +51 -0
  3. kreuzberg/_constants.py +2 -3
  4. kreuzberg/_extractors/__init__.py +0 -0
  5. kreuzberg/_extractors/_base.py +92 -0
  6. kreuzberg/_extractors/_html.py +34 -0
  7. kreuzberg/_extractors/_image.py +74 -0
  8. kreuzberg/_extractors/_pandoc.py +613 -0
  9. kreuzberg/_extractors/_pdf.py +163 -0
  10. kreuzberg/_extractors/_presentation.py +233 -0
  11. kreuzberg/_extractors/_spread_sheet.py +125 -0
  12. kreuzberg/_mime_types.py +19 -26
  13. kreuzberg/_ocr/__init__.py +17 -0
  14. kreuzberg/_ocr/_base.py +54 -0
  15. kreuzberg/_ocr/_easyocr.py +376 -0
  16. kreuzberg/_ocr/_paddleocr.py +291 -0
  17. kreuzberg/_ocr/_tesseract.py +342 -0
  18. kreuzberg/_playa.py +276 -0
  19. kreuzberg/_registry.py +108 -0
  20. kreuzberg/_types.py +133 -36
  21. kreuzberg/_utils/__init__.py +0 -0
  22. kreuzberg/{_string.py → _utils/_string.py} +0 -2
  23. kreuzberg/_utils/_sync.py +121 -0
  24. kreuzberg/{_tmp.py → _utils/_tmp.py} +1 -1
  25. kreuzberg/exceptions.py +25 -0
  26. kreuzberg/extraction.py +114 -227
  27. kreuzberg-3.0.1.dist-info/METADATA +178 -0
  28. kreuzberg-3.0.1.dist-info/RECORD +32 -0
  29. {kreuzberg-2.1.2.dist-info → kreuzberg-3.0.1.dist-info}/WHEEL +1 -1
  30. kreuzberg/_html.py +0 -31
  31. kreuzberg/_pandoc.py +0 -366
  32. kreuzberg/_pdf.py +0 -190
  33. kreuzberg/_pptx.py +0 -88
  34. kreuzberg/_sync.py +0 -74
  35. kreuzberg/_tesseract.py +0 -231
  36. kreuzberg/_xlsx.py +0 -88
  37. kreuzberg-2.1.2.dist-info/METADATA +0 -446
  38. kreuzberg-2.1.2.dist-info/RECORD +0 -21
  39. {kreuzberg-2.1.2.dist-info → kreuzberg-3.0.1.dist-info/licenses}/LICENSE +0 -0
  40. {kreuzberg-2.1.2.dist-info → kreuzberg-3.0.1.dist-info}/top_level.txt +0 -0
@@ -1,446 +0,0 @@
1
- Metadata-Version: 2.2
2
- Name: kreuzberg
3
- Version: 2.1.2
4
- Summary: A text extraction library supporting PDFs, images, office documents and more
5
- Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
6
- License: MIT
7
- Project-URL: homepage, https://github.com/Goldziher/kreuzberg
8
- Keywords: document-processing,image-to-text,ocr,pandoc,pdf-extraction,rag,tesseract,text-extraction,text-processing
9
- Classifier: Development Status :: 4 - Beta
10
- Classifier: Intended Audience :: Developers
11
- Classifier: License :: OSI Approved :: MIT License
12
- Classifier: Operating System :: OS Independent
13
- Classifier: Programming Language :: Python :: 3 :: Only
14
- Classifier: Programming Language :: Python :: 3.9
15
- Classifier: Programming Language :: Python :: 3.10
16
- Classifier: Programming Language :: Python :: 3.11
17
- Classifier: Programming Language :: Python :: 3.12
18
- Classifier: Programming Language :: Python :: 3.13
19
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
20
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
- Classifier: Topic :: Text Processing :: General
22
- Classifier: Topic :: Utilities
23
- Classifier: Typing :: Typed
24
- Requires-Python: >=3.9
25
- Description-Content-Type: text/markdown
26
- License-File: LICENSE
27
- Requires-Dist: anyio>=4.8.0
28
- Requires-Dist: charset-normalizer>=3.4.1
29
- Requires-Dist: exceptiongroup>=1.2.2; python_version < "3.11"
30
- Requires-Dist: html-to-markdown>=1.2.0
31
- Requires-Dist: pypdfium2==4.30.0
32
- Requires-Dist: python-calamine>=0.3.1
33
- Requires-Dist: python-pptx>=1.0.2
34
- Requires-Dist: typing-extensions>=4.12.2; python_version < "3.10"
35
-
36
- # Kreuzberg
37
-
38
- Kreuzberg is a Python library for text extraction from documents. It provides a unified async interface for extracting text from PDFs, images, office documents, and more.
39
-
40
- ## Why Kreuzberg?
41
-
42
- - **Simple and Hassle-Free**: Clean API that just works, without complex configuration
43
- - **Local Processing**: No external API calls or cloud dependencies required
44
- - **Resource Efficient**: Lightweight processing without GPU requirements
45
- - **Small Package Size**: Has few curated dependencies and a minimal footprint
46
- - **Format Support**: Comprehensive support for documents, images, and text formats
47
- - **Modern Python**: Built with async/await, type hints, and functional first approach
48
- - **Permissive OSS**: Kreuzberg and its dependencies have a permissive OSS license
49
-
50
- Kreuzberg was built for RAG (Retrieval Augmented Generation) applications, focusing on local processing with minimal dependencies. Its designed for modern async applications, serverless functions, and dockerized applications.
51
-
52
- ## Installation
53
-
54
- ### 1. Install the Python Package
55
-
56
- ```shell
57
- pip install kreuzberg
58
- ```
59
-
60
- ### 2. Install System Dependencies
61
-
62
- Kreuzberg requires two system level dependencies:
63
-
64
- - [Pandoc](https://pandoc.org/installing.html) - For document format conversion. Minimum required version is Pandoc 2.
65
- - [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR. Minimum required version is Tesseract 5.
66
-
67
- You can install these with:
68
-
69
- #### Linux (Ubuntu)
70
-
71
- ```shell
72
- sudo apt-get install pandoc tesseract-ocr
73
- ```
74
-
75
- #### MacOS
76
-
77
- ```shell
78
- brew install tesseract pandoc
79
- ```
80
-
81
- #### Windows
82
-
83
- ```shell
84
- choco install -y tesseract pandoc
85
- ```
86
-
87
- Notes:
88
-
89
- - in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.
90
- - please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.
91
-
92
- ## Architecture
93
-
94
- Kreuzberg integrates:
95
-
96
- - **PDF Processing**:
97
- - `pdfium2` for searchable PDFs
98
- - Tesseract OCR for scanned content
99
- - **Document Conversion**:
100
- - Pandoc for many document and markup formats
101
- - `python-pptx` for PowerPoint files
102
- - `html-to-markdown` for HTML content
103
- - `calamine` for Excel spreadsheets (with multi-sheet support)
104
- - **Text Processing**:
105
- - Smart encoding detection
106
- - Markdown and plain text handling
107
-
108
- ### Supported Formats
109
-
110
- #### Document Formats
111
-
112
- - PDF (`.pdf`, both searchable and scanned)
113
- - Microsoft Word (`.docx`)
114
- - PowerPoint presentations (`.pptx`)
115
- - OpenDocument Text (`.odt`)
116
- - Rich Text Format (`.rtf`)
117
- - EPUB (`.epub`)
118
- - DocBook XML (`.dbk`, `.xml`)
119
- - FictionBook (`.fb2`)
120
- - LaTeX (`.tex`, `.latex`)
121
- - Typst (`.typ`)
122
-
123
- #### Markup and Text Formats
124
-
125
- - HTML (`.html`, `.htm`)
126
- - Plain text (`.txt`) and Markdown (`.md`, `.markdown`)
127
- - reStructuredText (`.rst`)
128
- - Org-mode (`.org`)
129
- - DokuWiki (`.txt`)
130
- - Pod (`.pod`)
131
- - Troff/Man (`.1`, `.2`, etc.)
132
-
133
- #### Data and Research Formats
134
-
135
- - Spreadsheets (`.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.xlam`, `.xla`, `.ods`)
136
- - CSV (`.csv`) and TSV (`.tsv`) files
137
- - OPML files (`.opml`)
138
- - Jupyter Notebooks (`.ipynb`)
139
- - BibTeX (`.bib`) and BibLaTeX (`.bib`)
140
- - CSL-JSON (`.json`)
141
- - EndNote and JATS XML (`.xml`)
142
- - RIS (`.ris`)
143
-
144
- #### Image Formats
145
-
146
- - JPEG (`.jpg`, `.jpeg`, `.pjpeg`)
147
- - PNG (`.png`)
148
- - TIFF (`.tiff`, `.tif`)
149
- - BMP (`.bmp`)
150
- - GIF (`.gif`)
151
- - JPEG 2000 family (`.jp2`, `.jpm`, `.jpx`, `.mj2`)
152
- - WebP (`.webp`)
153
- - Portable anymap formats (`.pbm`, `.pgm`, `.ppm`, `.pnm`)
154
-
155
- ## Usage
156
-
157
- Kreuzberg provides both async and sync APIs for text extraction, including batch processing. The library exports the following main functions:
158
-
159
- - Single Item Processing:
160
-
161
- - `extract_file()`: Async function to extract text from a file (accepts string path or `pathlib.Path`)
162
- - `extract_bytes()`: Async function to extract text from bytes (accepts a byte string)
163
- - `extract_file_sync()`: Synchronous version of `extract_file()`
164
- - `extract_bytes_sync()`: Synchronous version of `extract_bytes()`
165
-
166
- - Batch Processing:
167
- - `batch_extract_file()`: Async function to extract text from multiple files concurrently
168
- - `batch_extract_bytes()`: Async function to extract text from multiple byte contents concurrently
169
- - `batch_extract_file_sync()`: Synchronous version of `batch_extract_file()`
170
- - `batch_extract_bytes_sync()`: Synchronous version of `batch_extract_bytes()`
171
-
172
- ### Configuration Parameters
173
-
174
- All extraction functions accept the following optional parameters for configuring OCR and performance:
175
-
176
- #### OCR Configuration
177
-
178
- - `force_ocr`(default: `False`): Forces OCR processing even for searchable PDFs.
179
- - `language` (default: `eng`): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:
180
-
181
- - `eng` for English
182
- - `deu` for German
183
- - `eng+deu` for English and German
184
-
185
- Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.
186
-
187
- - `psm` (Page Segmentation Mode, default: `PSM.AUTO`): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
188
-
189
- Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information on both options.
190
-
191
- #### Processing Configuration
192
-
193
- - `max_processes` (default: CPU count): Maximum number of concurrent processes for Tesseract.
194
-
195
- ### Quick Start
196
-
197
- ```python
198
- from pathlib import Path
199
- from kreuzberg import extract_file
200
- from kreuzberg import ExtractionResult
201
- from kreuzberg import PSMMode
202
-
203
-
204
- # Basic file extraction
205
- async def extract_document():
206
- # Extract from a PDF file with default settings
207
- pdf_result: ExtractionResult = await extract_file("document.pdf")
208
- print(f"Content: {pdf_result.content}")
209
-
210
- # Extract from an image with German language model
211
- img_result = await extract_file(
212
- "scan.png",
213
- language="deu", # German language model
214
- psm=PSMMode.SINGLE_BLOCK, # Treat as single block of text
215
- max_processes=4 # Limit concurrent processes
216
- )
217
- print(f"Image text: {img_result.content}")
218
-
219
- # Extract from Word document with metadata
220
- docx_result = await extract_file(Path("document.docx"))
221
- if docx_result.metadata:
222
- print(f"Title: {docx_result.metadata.get('title')}")
223
- print(f"Author: {docx_result.metadata.get('creator')}")
224
- ```
225
-
226
- ### Extracting Bytes
227
-
228
- ```python
229
- from kreuzberg import extract_bytes
230
- from kreuzberg import ExtractionResult
231
-
232
-
233
- async def process_upload(file_content: bytes, mime_type: str) -> ExtractionResult:
234
- """Process uploaded file content with known MIME type."""
235
- return await extract_bytes(
236
- file_content,
237
- mime_type=mime_type,
238
- )
239
-
240
-
241
- # Example usage with different file types
242
- async def handle_uploads(docx_bytes: bytes, pdf_bytes: bytes, image_bytes: bytes):
243
- # Process PDF upload
244
- pdf_result = await process_upload(pdf_bytes, mime_type="application/pdf")
245
- print(f"PDF content: {pdf_result.content}")
246
- print(f"PDF metadata: {pdf_result.metadata}")
247
-
248
- # Process image upload (will use OCR)
249
- img_result = await process_upload(image_bytes, mime_type="image/jpeg")
250
- print(f"Image text: {img_result.content}")
251
-
252
- # Process Word document upload
253
- docx_result = await process_upload(
254
- docx_bytes,
255
- mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
256
- )
257
- print(f"Word content: {docx_result.content}")
258
- ```
259
-
260
- ### Batch Processing
261
-
262
- Kreuzberg supports efficient batch processing of multiple files or byte contents:
263
-
264
- ```python
265
- from pathlib import Path
266
- from kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync
267
-
268
-
269
- # Process multiple files concurrently
270
- async def process_documents(file_paths: list[Path]) -> None:
271
- # Extract from multiple files
272
- results = await batch_extract_file(file_paths)
273
- for path, result in zip(file_paths, results):
274
- print(f"File {path}: {result.content[:100]}...")
275
-
276
-
277
- # Process multiple uploaded files concurrently
278
- async def process_uploads(contents: list[tuple[bytes, str]]) -> None:
279
- # Each item is a tuple of (content, mime_type)
280
- results = await batch_extract_bytes(contents)
281
- for (_, mime_type), result in zip(contents, results):
282
- print(f"Upload {mime_type}: {result.content[:100]}...")
283
-
284
-
285
- # Synchronous batch processing is also available
286
- def process_documents_sync(file_paths: list[Path]) -> None:
287
- results = batch_extract_file_sync(file_paths)
288
- for path, result in zip(file_paths, results):
289
- print(f"File {path}: {result.content[:100]}...")
290
- ```
291
-
292
- Features:
293
-
294
- - Ordered results
295
- - Concurrent processing
296
- - Error handling per item
297
- - Async and sync interfaces
298
- - Same options as single extraction
299
-
300
- ### PDF Processing
301
-
302
- Kreuzberg employs a smart approach to PDF text extraction:
303
-
304
- 1. **Searchable Text Detection**: First attempts to extract text directly from searchable PDFs using `pdfium2`.
305
-
306
- 2. **Text Validation**: Extracted text is validated for corruption by checking for:
307
-
308
- - Control and non-printable characters
309
- - Unicode replacement characters (�)
310
- - Zero-width spaces and other invisible characters
311
- - Empty or whitespace-only content
312
-
313
- 3. **Automatic OCR Fallback**: If the extracted text appears corrupted or if the PDF is image-based, automatically falls back to OCR using Tesseract.
314
-
315
- This approach works well for searchable PDFs and standard text documents. For complex OCR (e.g., handwriting, photographs), use a specialized tool.
316
-
317
- ### PDF Processing Options
318
-
319
- You can control PDF processing behavior using optional parameters:
320
-
321
- ```python
322
- from kreuzberg import extract_file
323
-
324
-
325
- async def process_pdf():
326
- # Default behavior: auto-detect and use OCR if needed
327
- # By default, max_processes=1 for safe operation
328
- result = await extract_file("document.pdf")
329
- print(result.content)
330
-
331
- # Force OCR even for searchable PDFs
332
- result = await extract_file("document.pdf", force_ocr=True)
333
- print(result.content)
334
-
335
- # Control OCR concurrency for large documents
336
- # Warning: High concurrency values can cause system resource exhaustion
337
- # Start with a low value and increase based on your system's capabilities
338
- result = await extract_file(
339
- "large_document.pdf",
340
- max_processes=4 # Process up to 4 pages concurrently
341
- )
342
- print(result.content)
343
-
344
- # Process a scanned PDF (automatically uses OCR)
345
- result = await extract_file("scanned.pdf")
346
- print(result.content)
347
- ```
348
-
349
- ### ExtractionResult Object
350
-
351
- All extraction functions return an `ExtractionResult` or a list thereof (for batch functions). The `ExtractionResult` object is a `NamedTuple`:
352
-
353
- - `content`: The extracted text (str)
354
- - `mime_type`: Output format ("text/plain" or "text/markdown" for Pandoc conversions)
355
- - `metadata`: A metadata dictionary. Currently this dictionary is only populated when extracting documents using pandoc.
356
-
357
- ```python
358
- from kreuzberg import extract_file, ExtractionResult, Metadata
359
-
360
- async def process_document(path: str) -> tuple[str, str, Metadata]:
361
- # Access as a named tuple
362
- result: ExtractionResult = await extract_file(path)
363
- print(f"Content: {result.content}")
364
- print(f"Format: {result.mime_type}")
365
-
366
- # Or unpack as a tuple
367
- content, mime_type, metadata = await extract_file(path)
368
- return content, mime_type, metadata
369
- ```
370
-
371
- ### Error Handling
372
-
373
- Kreuzberg provides comprehensive error handling through several exception types, all inheriting from `KreuzbergError`. Each exception includes helpful context information for debugging.
374
-
375
- ```python
376
- from kreuzberg import (
377
- extract_file,
378
- ValidationError,
379
- ParsingError,
380
- OCRError,
381
- MissingDependencyError
382
- )
383
-
384
- async def safe_extract(path: str) -> str:
385
- try:
386
- result = await extract_file(path)
387
- return result.content
388
-
389
- except ValidationError as e:
390
- # Input validation issues
391
- # - Unsupported or undetectable MIME types
392
- # - Missing files
393
- # - Invalid input parameters
394
- print(f"Validation failed: {e}")
395
-
396
- except OCRError as e:
397
- # OCR-specific issues
398
- # - Tesseract processing failures
399
- # - Image conversion problems
400
- print(f"OCR failed: {e}")
401
-
402
- except MissingDependencyError as e:
403
- # System dependency issues
404
- # - Missing Tesseract OCR
405
- # - Missing Pandoc
406
- # - Incompatible versions
407
- print(f"Dependency missing: {e}")
408
-
409
- except ParsingError as e:
410
- # General processing errors
411
- # - PDF parsing failures
412
- # - Format conversion issues
413
- # - Encoding problems
414
- print(f"Processing failed: {e}")
415
-
416
- return ""
417
- ```
418
-
419
- All exceptions include:
420
-
421
- - Error message
422
- - Context in the `context` attribute
423
- - String representation
424
- - Exception chaining
425
-
426
- ## Contribution
427
-
428
- This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
429
- submitting PRs to avoid disappointment.
430
-
431
- ### Local Development
432
-
433
- 1. Clone the repo
434
- 2. Install the system dependencies
435
- 3. Install the full dependencies with `uv sync`
436
- 4. Install the pre-commit hooks with:
437
-
438
- ```shell
439
- pre-commit install && pre-commit install --hook-type commit-msg
440
- ```
441
-
442
- 5. Make your changes and submit a PR
443
-
444
- ## License
445
-
446
- This library uses the MIT license.
@@ -1,21 +0,0 @@
1
- kreuzberg/__init__.py,sha256=WgGo3x09JKCk89htZuodbnYysu0ZYpkAP29dcRl5Sg0,694
2
- kreuzberg/_constants.py,sha256=N61ZF8xuEso8GzRGiVpqIv5yfMkQmLeH_EN9fVARYV0,249
3
- kreuzberg/_html.py,sha256=yM78bPjyKRaXqMp5QW9xOYe0CBd9uUhDZfjnFB1tZOY,925
4
- kreuzberg/_mime_types.py,sha256=Kuu0yWY4p0Eck8b_vdp9oamqRZc1RJaS_ZKikVD2Z2o,6431
5
- kreuzberg/_pandoc.py,sha256=YIXaFC11N2tgVHjBd3JD_21GZ6OOVQ0UY3aKrWNfK-I,12531
6
- kreuzberg/_pdf.py,sha256=AIwxlydZkJOU4878SaeF9cKUmzSN7o3X40Hye7z017U,6479
7
- kreuzberg/_pptx.py,sha256=oX1WYabKQ02Hla2jYnkEBjJXCPvrcRnzLi3MeY86TN0,3028
8
- kreuzberg/_string.py,sha256=pE92BF2E7BXrQ5if3uATM2enwH82ntViBpshxK-797E,1106
9
- kreuzberg/_sync.py,sha256=sDVH4GrpYW9SOnmu3BqKPL76xl0hxzHjTAC78aovbQA,2122
10
- kreuzberg/_tesseract.py,sha256=0BkguZJIKlOFHkrN2mjVgaycWwolmuEv6DwpQY7n7Os,7610
11
- kreuzberg/_tmp.py,sha256=y0PxKJXsRsDCwpFqtJAMl05lMNu3N_E2yaUVL93h7g0,1037
12
- kreuzberg/_types.py,sha256=Qxlk6qfdtvEsCfjsXU57qgZiONfwF7wUgbCJK8QXNZ4,2195
13
- kreuzberg/_xlsx.py,sha256=kSH7PJ33vdLgoh5LmL_bqbc4I0VgZlZUeF4ckKl6NJM,2675
14
- kreuzberg/exceptions.py,sha256=syDCjy8PNqVMGhD-zAuhkurLMg9bk1j1yJtvJN8cN9A,1679
15
- kreuzberg/extraction.py,sha256=7oc2C1_bIxrLx2r4NEyGrL9Jt6YpPxfQKMRJm6QQayo,13076
16
- kreuzberg/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
17
- kreuzberg-2.1.2.dist-info/LICENSE,sha256=-8caMvpCK8SgZ5LlRKhGCMtYDEXqTKH9X8pFEhl91_4,1066
18
- kreuzberg-2.1.2.dist-info/METADATA,sha256=0MEegHP8F5ur-wafeprL9UEN6Utipml1SuCF_xF6daA,14842
19
- kreuzberg-2.1.2.dist-info/WHEEL,sha256=jB7zZ3N9hIM9adW7qlTAyycLYW9npaWKLRzaoVcLKcM,91
20
- kreuzberg-2.1.2.dist-info/top_level.txt,sha256=rbGkygffkZiyKhL8UN41ZOjLfem0jJPA1Whtndne0rE,10
21
- kreuzberg-2.1.2.dist-info/RECORD,,