kreuzberg 1.4.0__tar.gz → 1.6.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,317 @@
1
+ Metadata-Version: 2.2
2
+ Name: kreuzberg
3
+ Version: 1.6.0
4
+ Summary: A text extraction library supporting PDFs, images, office documents and more
5
+ Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
6
+ License: MIT
7
+ Project-URL: homepage, https://github.com/Goldziher/kreuzberg
8
+ Keywords: document-processing,image-to-text,ocr,pandoc,pdf-extraction,rag,tesseract,text-extraction,text-processing
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Operating System :: OS Independent
13
+ Classifier: Programming Language :: Python :: 3 :: Only
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Classifier: Topic :: Text Processing :: General
22
+ Classifier: Topic :: Utilities
23
+ Classifier: Typing :: Typed
24
+ Requires-Python: >=3.9
25
+ Description-Content-Type: text/markdown
26
+ License-File: LICENSE
27
+ Requires-Dist: anyio>=4.8.0
28
+ Requires-Dist: charset-normalizer>=3.4.1
29
+ Requires-Dist: html-to-markdown>=1.2.0
30
+ Requires-Dist: pypdfium2>=4.30.1
31
+ Requires-Dist: python-pptx>=1.0.2
32
+ Requires-Dist: typing-extensions>=4.12.2; python_version < "3.10"
33
+ Requires-Dist: xlsx2csv>=0.8.4
34
+
35
+ # Kreuzberg
36
+
37
+ Kreuzberg is a modern Python library for text extraction from documents, designed for simplicity and efficiency. It provides a unified async interface for extracting text from a wide range of file formats including PDFs, images, office documents, and more.
38
+
39
+ ## Why Kreuzberg?
40
+
41
+ - **Simple and Hassle-Free**: Clean API that just works, without complex configuration
42
+ - **Local Processing**: No external API calls or cloud dependencies required
43
+ - **Resource Efficient**: Lightweight processing without GPU requirements
44
+ - **Format Support**: Comprehensive support for documents, images, and text formats
45
+ - **Modern Python**: Built with async/await, type hints, and current best practices
46
+
47
+ Kreuzberg was created to solve text extraction needs in RAG (Retrieval Augmented Generation) applications, but it's suitable for any text extraction use case. Unlike many commercial solutions that require API calls or complex setups, Kreuzberg focuses on local processing with minimal dependencies.
48
+
49
+ ## Features
50
+
51
+ - **Universal Text Extraction**: Extract text from PDFs (both searchable and scanned), images, office documents, and more
52
+ - **Smart Processing**: Automatic OCR for scanned documents, encoding detection for text files
53
+ - **Modern Python Design**:
54
+ - Async-first API using `anyio`
55
+ - Comprehensive type hints for better IDE support
56
+ - Detailed error handling with context information
57
+ - **Production Ready**:
58
+ - Robust error handling
59
+ - Detailed debugging information
60
+ - Memory efficient processing
61
+
62
+ ## Installation
63
+
64
+ ### 1. Install the Python Package
65
+
66
+ ```shell
67
+ pip install kreuzberg
68
+ ```
69
+
70
+ ### 2. Install System Dependencies
71
+
72
+ Kreuzberg requires two system level dependencies:
73
+
74
+ - [Pandoc](https://pandoc.org/installing.html) - For document format conversion
75
+ - [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR
76
+
77
+ Please install these using their respective installation guides.
78
+
79
+ ## Architecture
80
+
81
+ Kreuzberg is designed as a high-level async abstraction over established open-source tools. It integrates:
82
+
83
+ - **PDF Processing**:
84
+ - `pdfium2` for searchable PDFs
85
+ - Tesseract OCR for scanned content
86
+ - **Document Conversion**:
87
+ - Pandoc for many document and markup formats
88
+ - `python-pptx` for PowerPoint files
89
+ - `html-to-markdown` for HTML content
90
+ - `xlsx2csv` for Excel spreadsheets
91
+ - **Text Processing**:
92
+ - Smart encoding detection
93
+ - Markdown and plain text handling
94
+
95
+ ### Supported Formats
96
+
97
+ #### Document Formats
98
+
99
+ - PDF (`.pdf`, both searchable and scanned documents)
100
+ - Microsoft Word (`.docx`, `.doc`)
101
+ - PowerPoint presentations (`.pptx`)
102
+ - OpenDocument Text (`.odt`)
103
+ - Rich Text Format (`.rtf`)
104
+ - EPUB (`.epub`)
105
+ - DocBook XML (`.dbk`, `.xml`)
106
+ - FictionBook (`.fb2`)
107
+ - LaTeX (`.tex`, `.latex`)
108
+ - Typst (`.typ`)
109
+
110
+ #### Markup and Text Formats
111
+
112
+ - HTML (`.html`, `.htm`)
113
+ - Plain text (`.txt`) and Markdown (`.md`, `.markdown`)
114
+ - reStructuredText (`.rst`)
115
+ - Org-mode (`.org`)
116
+ - DokuWiki (`.txt`)
117
+ - Pod (`.pod`)
118
+ - Man pages (`.1`, `.2`, etc.)
119
+
120
+ #### Data and Research Formats
121
+
122
+ - Excel spreadsheets (`.xlsx`)
123
+ - CSV (`.csv`) and TSV (`.tsv`) files
124
+ - Jupyter Notebooks (`.ipynb`)
125
+ - BibTeX (`.bib`) and BibLaTeX (`.bib`)
126
+ - CSL-JSON (`.json`)
127
+ - EndNote XML (`.xml`)
128
+ - RIS (`.ris`)
129
+ - JATS XML (`.xml`)
130
+
131
+ #### Image Formats
132
+
133
+ - JPEG (`.jpg`, `.jpeg`, `.pjpeg`)
134
+ - PNG (`.png`)
135
+ - TIFF (`.tiff`, `.tif`)
136
+ - BMP (`.bmp`)
137
+ - GIF (`.gif`)
138
+ - WebP (`.webp`)
139
+ - JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)
140
+ - Portable Anymap (`.pnm`)
141
+ - Portable Bitmap (`.pbm`)
142
+ - Portable Graymap (`.pgm`)
143
+ - Portable Pixmap (`.ppm`)
144
+
145
+ ## Usage
146
+
147
+ Kreuzberg provides a simple, async-first API for text extraction. The library exports two main functions:
148
+
149
+ - `extract_file()`: Extract text from a file (accepts string path or `pathlib.Path`)
150
+ - `extract_bytes()`: Extract text from bytes (accepts a byte string)
151
+
152
+ ### Quick Start
153
+
154
+ ```python
155
+ from pathlib import Path
156
+ from kreuzberg import extract_file, extract_bytes
157
+
158
+ # Basic file extraction
159
+ async def extract_document():
160
+ # Extract from a PDF file
161
+ pdf_result = await extract_file("document.pdf")
162
+ print(f"PDF text: {pdf_result.content}")
163
+
164
+ # Extract from an image
165
+ img_result = await extract_file("scan.png")
166
+ print(f"Image text: {img_result.content}")
167
+
168
+ # Extract from Word document
169
+ docx_result = await extract_file(Path("document.docx"))
170
+ print(f"Word text: {docx_result.content}")
171
+ ```
172
+
173
+ ### Processing Uploaded Files
174
+
175
+ ```python
176
+ from kreuzberg import extract_bytes
177
+
178
+ async def process_upload(file_content: bytes, mime_type: str):
179
+ """Process uploaded file content with known MIME type."""
180
+ result = await extract_bytes(file_content, mime_type=mime_type)
181
+ return result.content
182
+
183
+ # Example usage with different file types
184
+ async def handle_uploads():
185
+ # Process PDF upload
186
+ pdf_result = await extract_bytes(pdf_bytes, mime_type="application/pdf")
187
+
188
+ # Process image upload
189
+ img_result = await extract_bytes(image_bytes, mime_type="image/jpeg")
190
+
191
+ # Process Word document upload
192
+ docx_result = await extract_bytes(docx_bytes,
193
+ mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document")
194
+ ```
195
+
196
+ ### Advanced Features
197
+
198
+ #### PDF Processing Options
199
+
200
+ ```python
201
+ from kreuzberg import extract_file
202
+
203
+ async def process_pdf():
204
+ # Force OCR for PDFs with embedded images or scanned content
205
+ result = await extract_file("document.pdf", force_ocr=True)
206
+
207
+ # Process a scanned PDF (automatically uses OCR)
208
+ scanned = await extract_file("scanned.pdf")
209
+ ```
210
+
211
+ #### ExtractionResult Object
212
+
213
+ All extraction functions return an `ExtractionResult` containing:
214
+
215
+ - `content`: The extracted text (str)
216
+ - `mime_type`: Output format ("text/plain" or "text/markdown" for Pandoc conversions)
217
+
218
+ ```python
219
+ from kreuzberg import ExtractionResult
220
+
221
+ async def process_document(path: str) -> tuple[str, str]:
222
+ # Access as a named tuple
223
+ result: ExtractionResult = await extract_file(path)
224
+ print(f"Content: {result.content}")
225
+ print(f"Format: {result.mime_type}")
226
+
227
+ # Or unpack as a tuple
228
+ content, mime_type = await extract_file(path)
229
+ return content, mime_type
230
+ ```
231
+
232
+ ### Error Handling
233
+
234
+ Kreuzberg provides detailed error handling with two main exception types:
235
+
236
+ ```python
237
+ from kreuzberg import extract_file
238
+ from kreuzberg.exceptions import ValidationError, ParsingError
239
+
240
+ async def safe_extract(path: str) -> str:
241
+ try:
242
+ result = await extract_file(path)
243
+ return result.content
244
+
245
+ except ValidationError as e:
246
+ # Handles input validation issues:
247
+ # - Unsupported file types
248
+ # - Missing files
249
+ # - Invalid MIME types
250
+ print(f"Invalid input: {e.message}")
251
+ print(f"Details: {e.context}")
252
+
253
+ except ParsingError as e:
254
+ # Handles processing errors:
255
+ # - PDF parsing failures
256
+ # - OCR errors
257
+ # - Format conversion issues
258
+ print(f"Processing failed: {e.message}")
259
+ print(f"Details: {e.context}")
260
+
261
+ return ""
262
+
263
+ # Example error contexts
264
+ try:
265
+ result = await extract_file("document.xyz")
266
+ except ValidationError as e:
267
+ # e.context might contain:
268
+ # {
269
+ # "file_path": "document.xyz",
270
+ # "error": "Unsupported file type",
271
+ # "supported_types": ["pdf", "docx", ...]
272
+ # }
273
+
274
+ try:
275
+ result = await extract_file("scan.pdf")
276
+ except ParsingError as e:
277
+ # e.context might contain:
278
+ # {
279
+ # "file_path": "scan.pdf",
280
+ # "error": "OCR processing failed",
281
+ # "details": "Tesseract error: Unable to process image"
282
+ # }
283
+ ```
284
+
285
+ ## Roadmap
286
+
287
+ V1:
288
+
289
+ - [x] - html file text extraction
290
+ - [ ] - better PDF table extraction
291
+ - [ ] - batch APIs
292
+ - [ ] - sync APIs
293
+
294
+ V2:
295
+
296
+ - [ ] - metadata extraction (breaking change)
297
+ - [ ] - TBD
298
+
299
+ ## Contribution
300
+
301
+ This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
302
+ submitting PRs to avoid disappointment.
303
+
304
+ ### Local Development
305
+
306
+ 1. Clone the repo
307
+ 2. Install the system dependencies
308
+ 3. Install the full dependencies with `uv sync`
309
+ 4. Install the pre-commit hooks with:
310
+ ```shell
311
+ pre-commit install && pre-commit install --hook-type commit-msg
312
+ ```
313
+ 5. Make your changes and submit a PR
314
+
315
+ ## License
316
+
317
+ This library uses the MIT license.
@@ -0,0 +1,283 @@
1
+ # Kreuzberg
2
+
3
+ Kreuzberg is a modern Python library for text extraction from documents, designed for simplicity and efficiency. It provides a unified async interface for extracting text from a wide range of file formats including PDFs, images, office documents, and more.
4
+
5
+ ## Why Kreuzberg?
6
+
7
+ - **Simple and Hassle-Free**: Clean API that just works, without complex configuration
8
+ - **Local Processing**: No external API calls or cloud dependencies required
9
+ - **Resource Efficient**: Lightweight processing without GPU requirements
10
+ - **Format Support**: Comprehensive support for documents, images, and text formats
11
+ - **Modern Python**: Built with async/await, type hints, and current best practices
12
+
13
+ Kreuzberg was created to solve text extraction needs in RAG (Retrieval Augmented Generation) applications, but it's suitable for any text extraction use case. Unlike many commercial solutions that require API calls or complex setups, Kreuzberg focuses on local processing with minimal dependencies.
14
+
15
+ ## Features
16
+
17
+ - **Universal Text Extraction**: Extract text from PDFs (both searchable and scanned), images, office documents, and more
18
+ - **Smart Processing**: Automatic OCR for scanned documents, encoding detection for text files
19
+ - **Modern Python Design**:
20
+ - Async-first API using `anyio`
21
+ - Comprehensive type hints for better IDE support
22
+ - Detailed error handling with context information
23
+ - **Production Ready**:
24
+ - Robust error handling
25
+ - Detailed debugging information
26
+ - Memory efficient processing
27
+
28
+ ## Installation
29
+
30
+ ### 1. Install the Python Package
31
+
32
+ ```shell
33
+ pip install kreuzberg
34
+ ```
35
+
36
+ ### 2. Install System Dependencies
37
+
38
+ Kreuzberg requires two system level dependencies:
39
+
40
+ - [Pandoc](https://pandoc.org/installing.html) - For document format conversion
41
+ - [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR
42
+
43
+ Please install these using their respective installation guides.
44
+
45
+ ## Architecture
46
+
47
+ Kreuzberg is designed as a high-level async abstraction over established open-source tools. It integrates:
48
+
49
+ - **PDF Processing**:
50
+ - `pdfium2` for searchable PDFs
51
+ - Tesseract OCR for scanned content
52
+ - **Document Conversion**:
53
+ - Pandoc for many document and markup formats
54
+ - `python-pptx` for PowerPoint files
55
+ - `html-to-markdown` for HTML content
56
+ - `xlsx2csv` for Excel spreadsheets
57
+ - **Text Processing**:
58
+ - Smart encoding detection
59
+ - Markdown and plain text handling
60
+
61
+ ### Supported Formats
62
+
63
+ #### Document Formats
64
+
65
+ - PDF (`.pdf`, both searchable and scanned documents)
66
+ - Microsoft Word (`.docx`, `.doc`)
67
+ - PowerPoint presentations (`.pptx`)
68
+ - OpenDocument Text (`.odt`)
69
+ - Rich Text Format (`.rtf`)
70
+ - EPUB (`.epub`)
71
+ - DocBook XML (`.dbk`, `.xml`)
72
+ - FictionBook (`.fb2`)
73
+ - LaTeX (`.tex`, `.latex`)
74
+ - Typst (`.typ`)
75
+
76
+ #### Markup and Text Formats
77
+
78
+ - HTML (`.html`, `.htm`)
79
+ - Plain text (`.txt`) and Markdown (`.md`, `.markdown`)
80
+ - reStructuredText (`.rst`)
81
+ - Org-mode (`.org`)
82
+ - DokuWiki (`.txt`)
83
+ - Pod (`.pod`)
84
+ - Man pages (`.1`, `.2`, etc.)
85
+
86
+ #### Data and Research Formats
87
+
88
+ - Excel spreadsheets (`.xlsx`)
89
+ - CSV (`.csv`) and TSV (`.tsv`) files
90
+ - Jupyter Notebooks (`.ipynb`)
91
+ - BibTeX (`.bib`) and BibLaTeX (`.bib`)
92
+ - CSL-JSON (`.json`)
93
+ - EndNote XML (`.xml`)
94
+ - RIS (`.ris`)
95
+ - JATS XML (`.xml`)
96
+
97
+ #### Image Formats
98
+
99
+ - JPEG (`.jpg`, `.jpeg`, `.pjpeg`)
100
+ - PNG (`.png`)
101
+ - TIFF (`.tiff`, `.tif`)
102
+ - BMP (`.bmp`)
103
+ - GIF (`.gif`)
104
+ - WebP (`.webp`)
105
+ - JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)
106
+ - Portable Anymap (`.pnm`)
107
+ - Portable Bitmap (`.pbm`)
108
+ - Portable Graymap (`.pgm`)
109
+ - Portable Pixmap (`.ppm`)
110
+
111
+ ## Usage
112
+
113
+ Kreuzberg provides a simple, async-first API for text extraction. The library exports two main functions:
114
+
115
+ - `extract_file()`: Extract text from a file (accepts string path or `pathlib.Path`)
116
+ - `extract_bytes()`: Extract text from bytes (accepts a byte string)
117
+
118
+ ### Quick Start
119
+
120
+ ```python
121
+ from pathlib import Path
122
+ from kreuzberg import extract_file, extract_bytes
123
+
124
+ # Basic file extraction
125
+ async def extract_document():
126
+ # Extract from a PDF file
127
+ pdf_result = await extract_file("document.pdf")
128
+ print(f"PDF text: {pdf_result.content}")
129
+
130
+ # Extract from an image
131
+ img_result = await extract_file("scan.png")
132
+ print(f"Image text: {img_result.content}")
133
+
134
+ # Extract from Word document
135
+ docx_result = await extract_file(Path("document.docx"))
136
+ print(f"Word text: {docx_result.content}")
137
+ ```
138
+
139
+ ### Processing Uploaded Files
140
+
141
+ ```python
142
+ from kreuzberg import extract_bytes
143
+
144
+ async def process_upload(file_content: bytes, mime_type: str):
145
+ """Process uploaded file content with known MIME type."""
146
+ result = await extract_bytes(file_content, mime_type=mime_type)
147
+ return result.content
148
+
149
+ # Example usage with different file types
150
+ async def handle_uploads():
151
+ # Process PDF upload
152
+ pdf_result = await extract_bytes(pdf_bytes, mime_type="application/pdf")
153
+
154
+ # Process image upload
155
+ img_result = await extract_bytes(image_bytes, mime_type="image/jpeg")
156
+
157
+ # Process Word document upload
158
+ docx_result = await extract_bytes(docx_bytes,
159
+ mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document")
160
+ ```
161
+
162
+ ### Advanced Features
163
+
164
+ #### PDF Processing Options
165
+
166
+ ```python
167
+ from kreuzberg import extract_file
168
+
169
+ async def process_pdf():
170
+ # Force OCR for PDFs with embedded images or scanned content
171
+ result = await extract_file("document.pdf", force_ocr=True)
172
+
173
+ # Process a scanned PDF (automatically uses OCR)
174
+ scanned = await extract_file("scanned.pdf")
175
+ ```
176
+
177
+ #### ExtractionResult Object
178
+
179
+ All extraction functions return an `ExtractionResult` containing:
180
+
181
+ - `content`: The extracted text (str)
182
+ - `mime_type`: Output format ("text/plain" or "text/markdown" for Pandoc conversions)
183
+
184
+ ```python
185
+ from kreuzberg import ExtractionResult
186
+
187
+ async def process_document(path: str) -> tuple[str, str]:
188
+ # Access as a named tuple
189
+ result: ExtractionResult = await extract_file(path)
190
+ print(f"Content: {result.content}")
191
+ print(f"Format: {result.mime_type}")
192
+
193
+ # Or unpack as a tuple
194
+ content, mime_type = await extract_file(path)
195
+ return content, mime_type
196
+ ```
197
+
198
+ ### Error Handling
199
+
200
+ Kreuzberg provides detailed error handling with two main exception types:
201
+
202
+ ```python
203
+ from kreuzberg import extract_file
204
+ from kreuzberg.exceptions import ValidationError, ParsingError
205
+
206
+ async def safe_extract(path: str) -> str:
207
+ try:
208
+ result = await extract_file(path)
209
+ return result.content
210
+
211
+ except ValidationError as e:
212
+ # Handles input validation issues:
213
+ # - Unsupported file types
214
+ # - Missing files
215
+ # - Invalid MIME types
216
+ print(f"Invalid input: {e.message}")
217
+ print(f"Details: {e.context}")
218
+
219
+ except ParsingError as e:
220
+ # Handles processing errors:
221
+ # - PDF parsing failures
222
+ # - OCR errors
223
+ # - Format conversion issues
224
+ print(f"Processing failed: {e.message}")
225
+ print(f"Details: {e.context}")
226
+
227
+ return ""
228
+
229
+ # Example error contexts
230
+ try:
231
+ result = await extract_file("document.xyz")
232
+ except ValidationError as e:
233
+ # e.context might contain:
234
+ # {
235
+ # "file_path": "document.xyz",
236
+ # "error": "Unsupported file type",
237
+ # "supported_types": ["pdf", "docx", ...]
238
+ # }
239
+
240
+ try:
241
+ result = await extract_file("scan.pdf")
242
+ except ParsingError as e:
243
+ # e.context might contain:
244
+ # {
245
+ # "file_path": "scan.pdf",
246
+ # "error": "OCR processing failed",
247
+ # "details": "Tesseract error: Unable to process image"
248
+ # }
249
+ ```
250
+
251
+ ## Roadmap
252
+
253
+ V1:
254
+
255
+ - [x] - html file text extraction
256
+ - [ ] - better PDF table extraction
257
+ - [ ] - batch APIs
258
+ - [ ] - sync APIs
259
+
260
+ V2:
261
+
262
+ - [ ] - metadata extraction (breaking change)
263
+ - [ ] - TBD
264
+
265
+ ## Contribution
266
+
267
+ This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
268
+ submitting PRs to avoid disappointment.
269
+
270
+ ### Local Development
271
+
272
+ 1. Clone the repo
273
+ 2. Install the system dependencies
274
+ 3. Install the full dependencies with `uv sync`
275
+ 4. Install the pre-commit hooks with:
276
+ ```shell
277
+ pre-commit install && pre-commit install --hook-type commit-msg
278
+ ```
279
+ 5. Make your changes and submit a PR
280
+
281
+ ## License
282
+
283
+ This library uses the MIT license.