kreuzberg 1.3.0__py3-none-any.whl → 1.5.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- kreuzberg/_extractors.py +46 -81
- kreuzberg/_mime_types.py +22 -31
- kreuzberg/_pandoc.py +416 -0
- kreuzberg/_string.py +9 -12
- kreuzberg/_tesseract.py +318 -0
- kreuzberg/exceptions.py +9 -1
- kreuzberg/extraction.py +16 -16
- kreuzberg-1.5.0.dist-info/METADATA +318 -0
- kreuzberg-1.5.0.dist-info/RECORD +15 -0
- kreuzberg-1.3.0.dist-info/METADATA +0 -306
- kreuzberg-1.3.0.dist-info/RECORD +0 -13
- {kreuzberg-1.3.0.dist-info → kreuzberg-1.5.0.dist-info}/LICENSE +0 -0
- {kreuzberg-1.3.0.dist-info → kreuzberg-1.5.0.dist-info}/WHEEL +0 -0
- {kreuzberg-1.3.0.dist-info → kreuzberg-1.5.0.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,318 @@
|
|
1
|
+
Metadata-Version: 2.2
|
2
|
+
Name: kreuzberg
|
3
|
+
Version: 1.5.0
|
4
|
+
Summary: A text extraction library supporting PDFs, images, office documents and more
|
5
|
+
Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
|
6
|
+
License: MIT
|
7
|
+
Project-URL: homepage, https://github.com/Goldziher/kreuzberg
|
8
|
+
Keywords: document-processing,image-to-text,ocr,pandoc,pdf-extraction,rag,tesseract,text-extraction,text-processing
|
9
|
+
Classifier: Development Status :: 4 - Beta
|
10
|
+
Classifier: Intended Audience :: Developers
|
11
|
+
Classifier: License :: OSI Approved :: MIT License
|
12
|
+
Classifier: Operating System :: OS Independent
|
13
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
18
|
+
Classifier: Programming Language :: Python :: 3.13
|
19
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
20
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
21
|
+
Classifier: Topic :: Text Processing :: General
|
22
|
+
Classifier: Topic :: Utilities
|
23
|
+
Classifier: Typing :: Typed
|
24
|
+
Requires-Python: >=3.9
|
25
|
+
Description-Content-Type: text/markdown
|
26
|
+
License-File: LICENSE
|
27
|
+
Requires-Dist: anyio>=4.8.0
|
28
|
+
Requires-Dist: charset-normalizer>=3.4.1
|
29
|
+
Requires-Dist: html-to-markdown>=1.2.0
|
30
|
+
Requires-Dist: pypdfium2>=4.30.1
|
31
|
+
Requires-Dist: python-pptx>=1.0.2
|
32
|
+
Requires-Dist: typing-extensions>=4.12.2; python_version < "3.10"
|
33
|
+
|
34
|
+
# Kreuzberg
|
35
|
+
|
36
|
+
Kreuzberg is a modern Python library for text extraction from documents, designed for simplicity and efficiency. It provides a unified async interface for extracting text from a wide range of file formats including PDFs, images, office documents, and more.
|
37
|
+
|
38
|
+
## Why Kreuzberg?
|
39
|
+
|
40
|
+
- **Simple and Hassle-Free**: Clean API that just works, without complex configuration
|
41
|
+
- **Local Processing**: No external API calls or cloud dependencies required
|
42
|
+
- **Resource Efficient**: Lightweight processing without GPU requirements
|
43
|
+
- **Format Support**: Comprehensive support for documents, images, and text formats
|
44
|
+
- **Modern Python**: Built with async/await, type hints, and current best practices
|
45
|
+
|
46
|
+
Kreuzberg was created to solve text extraction needs in RAG (Retrieval Augmented Generation) applications, but it's suitable for any text extraction use case. Unlike many commercial solutions that require API calls or complex setups, Kreuzberg focuses on local processing with minimal dependencies.
|
47
|
+
|
48
|
+
## Features
|
49
|
+
|
50
|
+
- **Universal Text Extraction**: Extract text from PDFs (both searchable and scanned), images, office documents, and more
|
51
|
+
- **Smart Processing**: Automatic OCR for scanned documents, encoding detection for text files
|
52
|
+
- **Modern Python Design**:
|
53
|
+
- Async-first API using `anyio`
|
54
|
+
- Comprehensive type hints for better IDE support
|
55
|
+
- Detailed error handling with context information
|
56
|
+
- **Production Ready**:
|
57
|
+
- Robust error handling
|
58
|
+
- Detailed debugging information
|
59
|
+
- Memory efficient processing
|
60
|
+
|
61
|
+
## Installation
|
62
|
+
|
63
|
+
### 1. Install the Python Package
|
64
|
+
|
65
|
+
```shell
|
66
|
+
pip install kreuzberg
|
67
|
+
```
|
68
|
+
|
69
|
+
### 2. Install System Dependencies
|
70
|
+
|
71
|
+
Kreuzberg requires two open-source tools:
|
72
|
+
|
73
|
+
- [Pandoc](https://pandoc.org/installing.html) - For document format conversion
|
74
|
+
|
75
|
+
- GPL v2.0 licensed (used via CLI only)
|
76
|
+
- Handles office documents and markup formats
|
77
|
+
|
78
|
+
- [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR
|
79
|
+
- Apache License
|
80
|
+
- Required for scanned documents and images
|
81
|
+
|
82
|
+
## Architecture
|
83
|
+
|
84
|
+
Kreuzberg is designed as a high-level async abstraction over established open-source tools. It integrates:
|
85
|
+
|
86
|
+
- **PDF Processing**:
|
87
|
+
- `pdfium2` for searchable PDFs
|
88
|
+
- Tesseract OCR for scanned content
|
89
|
+
- **Document Conversion**:
|
90
|
+
- Pandoc for office documents and markup
|
91
|
+
- `python-pptx` for PowerPoint files
|
92
|
+
- `html-to-markdown` for HTML content
|
93
|
+
- **Text Processing**:
|
94
|
+
- Smart encoding detection
|
95
|
+
- Markdown and plain text handling
|
96
|
+
|
97
|
+
### Supported Formats
|
98
|
+
|
99
|
+
#### Document Formats
|
100
|
+
|
101
|
+
- PDF (`.pdf`, both searchable and scanned documents)
|
102
|
+
- Microsoft Word (`.docx`, `.doc`)
|
103
|
+
- PowerPoint presentations (`.pptx`)
|
104
|
+
- OpenDocument Text (`.odt`)
|
105
|
+
- Rich Text Format (`.rtf`)
|
106
|
+
- EPUB (`.epub`)
|
107
|
+
- DocBook XML (`.dbk`, `.xml`)
|
108
|
+
- FictionBook (`.fb2`)
|
109
|
+
- LaTeX (`.tex`, `.latex`)
|
110
|
+
- Typst (`.typ`)
|
111
|
+
|
112
|
+
#### Markup and Text Formats
|
113
|
+
|
114
|
+
- HTML (`.html`, `.htm`)
|
115
|
+
- Plain text (`.txt`) and Markdown (`.md`, `.markdown`)
|
116
|
+
- reStructuredText (`.rst`)
|
117
|
+
- Org-mode (`.org`)
|
118
|
+
- DokuWiki (`.txt`)
|
119
|
+
- Pod (`.pod`)
|
120
|
+
- Man pages (`.1`, `.2`, etc.)
|
121
|
+
|
122
|
+
#### Data and Research Formats
|
123
|
+
|
124
|
+
- CSV (`.csv`) and TSV (`.tsv`) files
|
125
|
+
- Jupyter Notebooks (`.ipynb`)
|
126
|
+
- BibTeX (`.bib`) and BibLaTeX (`.bib`)
|
127
|
+
- CSL-JSON (`.json`)
|
128
|
+
- EndNote XML (`.xml`)
|
129
|
+
- RIS (`.ris`)
|
130
|
+
- JATS XML (`.xml`)
|
131
|
+
|
132
|
+
#### Image Formats
|
133
|
+
|
134
|
+
- JPEG (`.jpg`, `.jpeg`, `.pjpeg`)
|
135
|
+
- PNG (`.png`)
|
136
|
+
- TIFF (`.tiff`, `.tif`)
|
137
|
+
- BMP (`.bmp`)
|
138
|
+
- GIF (`.gif`)
|
139
|
+
- WebP (`.webp`)
|
140
|
+
- JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)
|
141
|
+
- Portable Anymap (`.pnm`)
|
142
|
+
- Portable Bitmap (`.pbm`)
|
143
|
+
- Portable Graymap (`.pgm`)
|
144
|
+
- Portable Pixmap (`.ppm`)
|
145
|
+
|
146
|
+
## Usage
|
147
|
+
|
148
|
+
Kreuzberg provides a simple, async-first API for text extraction. The library exports two main functions:
|
149
|
+
|
150
|
+
- `extract_file()`: Extract text from a file (accepts string path or `pathlib.Path`)
|
151
|
+
- `extract_bytes()`: Extract text from bytes (accepts a byte string)
|
152
|
+
|
153
|
+
### Quick Start
|
154
|
+
|
155
|
+
```python
|
156
|
+
from pathlib import Path
|
157
|
+
from kreuzberg import extract_file, extract_bytes
|
158
|
+
|
159
|
+
# Basic file extraction
|
160
|
+
async def extract_document():
|
161
|
+
# Extract from a PDF file
|
162
|
+
pdf_result = await extract_file("document.pdf")
|
163
|
+
print(f"PDF text: {pdf_result.content}")
|
164
|
+
|
165
|
+
# Extract from an image
|
166
|
+
img_result = await extract_file("scan.png")
|
167
|
+
print(f"Image text: {img_result.content}")
|
168
|
+
|
169
|
+
# Extract from Word document
|
170
|
+
docx_result = await extract_file(Path("document.docx"))
|
171
|
+
print(f"Word text: {docx_result.content}")
|
172
|
+
```
|
173
|
+
|
174
|
+
### Processing Uploaded Files
|
175
|
+
|
176
|
+
```python
|
177
|
+
from kreuzberg import extract_bytes
|
178
|
+
|
179
|
+
async def process_upload(file_content: bytes, mime_type: str):
|
180
|
+
"""Process uploaded file content with known MIME type."""
|
181
|
+
result = await extract_bytes(file_content, mime_type=mime_type)
|
182
|
+
return result.content
|
183
|
+
|
184
|
+
# Example usage with different file types
|
185
|
+
async def handle_uploads():
|
186
|
+
# Process PDF upload
|
187
|
+
pdf_result = await extract_bytes(pdf_bytes, mime_type="application/pdf")
|
188
|
+
|
189
|
+
# Process image upload
|
190
|
+
img_result = await extract_bytes(image_bytes, mime_type="image/jpeg")
|
191
|
+
|
192
|
+
# Process Word document upload
|
193
|
+
docx_result = await extract_bytes(docx_bytes,
|
194
|
+
mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document")
|
195
|
+
```
|
196
|
+
|
197
|
+
### Advanced Features
|
198
|
+
|
199
|
+
#### PDF Processing Options
|
200
|
+
|
201
|
+
```python
|
202
|
+
from kreuzberg import extract_file
|
203
|
+
|
204
|
+
async def process_pdf():
|
205
|
+
# Force OCR for PDFs with embedded images or scanned content
|
206
|
+
result = await extract_file("document.pdf", force_ocr=True)
|
207
|
+
|
208
|
+
# Process a scanned PDF (automatically uses OCR)
|
209
|
+
scanned = await extract_file("scanned.pdf")
|
210
|
+
```
|
211
|
+
|
212
|
+
#### ExtractionResult Object
|
213
|
+
|
214
|
+
All extraction functions return an `ExtractionResult` containing:
|
215
|
+
|
216
|
+
- `content`: The extracted text (str)
|
217
|
+
- `mime_type`: Output format ("text/plain" or "text/markdown" for Pandoc conversions)
|
218
|
+
|
219
|
+
```python
|
220
|
+
from kreuzberg import ExtractionResult
|
221
|
+
|
222
|
+
async def process_document(path: str) -> tuple[str, str]:
|
223
|
+
# Access as a named tuple
|
224
|
+
result: ExtractionResult = await extract_file(path)
|
225
|
+
print(f"Content: {result.content}")
|
226
|
+
print(f"Format: {result.mime_type}")
|
227
|
+
|
228
|
+
# Or unpack as a tuple
|
229
|
+
content, mime_type = await extract_file(path)
|
230
|
+
return content, mime_type
|
231
|
+
```
|
232
|
+
|
233
|
+
### Error Handling
|
234
|
+
|
235
|
+
Kreuzberg provides detailed error handling with two main exception types:
|
236
|
+
|
237
|
+
```python
|
238
|
+
from kreuzberg import extract_file
|
239
|
+
from kreuzberg.exceptions import ValidationError, ParsingError
|
240
|
+
|
241
|
+
async def safe_extract(path: str) -> str:
|
242
|
+
try:
|
243
|
+
result = await extract_file(path)
|
244
|
+
return result.content
|
245
|
+
|
246
|
+
except ValidationError as e:
|
247
|
+
# Handles input validation issues:
|
248
|
+
# - Unsupported file types
|
249
|
+
# - Missing files
|
250
|
+
# - Invalid MIME types
|
251
|
+
print(f"Invalid input: {e.message}")
|
252
|
+
print(f"Details: {e.context}")
|
253
|
+
|
254
|
+
except ParsingError as e:
|
255
|
+
# Handles processing errors:
|
256
|
+
# - PDF parsing failures
|
257
|
+
# - OCR errors
|
258
|
+
# - Format conversion issues
|
259
|
+
print(f"Processing failed: {e.message}")
|
260
|
+
print(f"Details: {e.context}")
|
261
|
+
|
262
|
+
return ""
|
263
|
+
|
264
|
+
# Example error contexts
|
265
|
+
try:
|
266
|
+
result = await extract_file("document.xyz")
|
267
|
+
except ValidationError as e:
|
268
|
+
# e.context might contain:
|
269
|
+
# {
|
270
|
+
# "file_path": "document.xyz",
|
271
|
+
# "error": "Unsupported file type",
|
272
|
+
# "supported_types": ["pdf", "docx", ...]
|
273
|
+
# }
|
274
|
+
|
275
|
+
try:
|
276
|
+
result = await extract_file("scan.pdf")
|
277
|
+
except ParsingError as e:
|
278
|
+
# e.context might contain:
|
279
|
+
# {
|
280
|
+
# "file_path": "scan.pdf",
|
281
|
+
# "error": "OCR processing failed",
|
282
|
+
# "details": "Tesseract error: Unable to process image"
|
283
|
+
# }
|
284
|
+
```
|
285
|
+
|
286
|
+
## Roadmap
|
287
|
+
|
288
|
+
V1:
|
289
|
+
|
290
|
+
- [x] - html file text extraction
|
291
|
+
- [ ] - better PDF table extraction
|
292
|
+
- [ ] - batch APIs
|
293
|
+
- [ ] - sync APIs
|
294
|
+
|
295
|
+
V2:
|
296
|
+
|
297
|
+
- [ ] - metadata extraction (breaking change)
|
298
|
+
- [ ] - TBD
|
299
|
+
|
300
|
+
## Contribution
|
301
|
+
|
302
|
+
This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
|
303
|
+
submitting PRs to avoid disappointment.
|
304
|
+
|
305
|
+
### Local Development
|
306
|
+
|
307
|
+
1. Clone the repo
|
308
|
+
2. Install the system dependencies
|
309
|
+
3. Install the full dependencies with `uv sync`
|
310
|
+
4. Install the pre-commit hooks with:
|
311
|
+
```shell
|
312
|
+
pre-commit install && pre-commit install --hook-type commit-msg
|
313
|
+
```
|
314
|
+
5. Make your changes and submit a PR
|
315
|
+
|
316
|
+
## License
|
317
|
+
|
318
|
+
This library uses the MIT license.
|
@@ -0,0 +1,15 @@
|
|
1
|
+
kreuzberg/__init__.py,sha256=5IBPjPsZ7faK15gFB9ZEROHhkEX7KKQmrHPCZuGnhb0,285
|
2
|
+
kreuzberg/_extractors.py,sha256=k6xO_2ItaftPmlqzfXyxTn8rdaWdwrJHGziBbo7gCio,6599
|
3
|
+
kreuzberg/_mime_types.py,sha256=0ZYtRrMAaKpCMDkhpTbWAXHCsVob5MFRMGlbni8iYSA,2573
|
4
|
+
kreuzberg/_pandoc.py,sha256=DC6y_NN_CG9dF6fhAj3WumXqKIJLjYmnql2H53_KHnE,13766
|
5
|
+
kreuzberg/_string.py,sha256=4txRDnkdR12oO6G8V-jXEMlA9ivgmw8E8EbjyhfL-W4,1106
|
6
|
+
kreuzberg/_sync.py,sha256=ovsFHFdkcczz7gNEUJsbZzY8KHG0_GAOOYipQNE4hIY,874
|
7
|
+
kreuzberg/_tesseract.py,sha256=nnhkjRIS0BSoovjMIqOlBEXlzngE0QJeFDe7BIqUik8,7872
|
8
|
+
kreuzberg/exceptions.py,sha256=pxoEPS0T9e5QSgxsfXn1VmxsY_EGXvTwY0gETPiNn8E,945
|
9
|
+
kreuzberg/extraction.py,sha256=gux3fkPIs8IbIKtRGuPFWJBLB5jO6Y9JsBfhHRcpQ0k,6160
|
10
|
+
kreuzberg/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
11
|
+
kreuzberg-1.5.0.dist-info/LICENSE,sha256=-8caMvpCK8SgZ5LlRKhGCMtYDEXqTKH9X8pFEhl91_4,1066
|
12
|
+
kreuzberg-1.5.0.dist-info/METADATA,sha256=O462ss7M6Cb8cO6fJXwqsOdzkzaZekqa1oGwb7Vrgx8,9641
|
13
|
+
kreuzberg-1.5.0.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
|
14
|
+
kreuzberg-1.5.0.dist-info/top_level.txt,sha256=rbGkygffkZiyKhL8UN41ZOjLfem0jJPA1Whtndne0rE,10
|
15
|
+
kreuzberg-1.5.0.dist-info/RECORD,,
|
@@ -1,306 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.2
|
2
|
-
Name: kreuzberg
|
3
|
-
Version: 1.3.0
|
4
|
-
Summary: A text extraction library supporting PDFs, images, office documents and more
|
5
|
-
Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
|
6
|
-
License: MIT
|
7
|
-
Project-URL: homepage, https://github.com/Goldziher/kreuzberg
|
8
|
-
Keywords: document-processing,docx,image-to-text,latex,markdown,ocr,odt,office-documents,pandoc,pdf,pdf-extraction,rag,text-extraction,text-processing
|
9
|
-
Classifier: Development Status :: 4 - Beta
|
10
|
-
Classifier: Intended Audience :: Developers
|
11
|
-
Classifier: License :: OSI Approved :: MIT License
|
12
|
-
Classifier: Operating System :: OS Independent
|
13
|
-
Classifier: Programming Language :: Python :: 3 :: Only
|
14
|
-
Classifier: Programming Language :: Python :: 3.9
|
15
|
-
Classifier: Programming Language :: Python :: 3.10
|
16
|
-
Classifier: Programming Language :: Python :: 3.11
|
17
|
-
Classifier: Programming Language :: Python :: 3.12
|
18
|
-
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
19
|
-
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
20
|
-
Classifier: Topic :: Text Processing :: General
|
21
|
-
Classifier: Topic :: Utilities
|
22
|
-
Classifier: Typing :: Typed
|
23
|
-
Requires-Python: >=3.9
|
24
|
-
Description-Content-Type: text/markdown
|
25
|
-
License-File: LICENSE
|
26
|
-
Requires-Dist: anyio>=4.8.0
|
27
|
-
Requires-Dist: charset-normalizer>=3.4.1
|
28
|
-
Requires-Dist: html-to-markdown>=1.2.0
|
29
|
-
Requires-Dist: pypandoc>=1.15
|
30
|
-
Requires-Dist: pypdfium2>=4.30.1
|
31
|
-
Requires-Dist: pytesseract>=0.3.13
|
32
|
-
Requires-Dist: python-pptx>=1.0.2
|
33
|
-
Requires-Dist: typing-extensions>=4.12.2
|
34
|
-
|
35
|
-
# Kreuzberg
|
36
|
-
|
37
|
-
Kreuzberg is a library for simplified text extraction from PDF files. It's meant to offer simple, hassle free text
|
38
|
-
extraction.
|
39
|
-
|
40
|
-
Why?
|
41
|
-
|
42
|
-
I am building, like many do now, a RAG focused service (checkout https://grantflow.ai). I have text extraction needs.
|
43
|
-
There are quite a lot of commercial options out there, and several open-source + paid options.
|
44
|
-
But I wanted something simple, which does not require expansive round-trips to an external API.
|
45
|
-
Furthermore, I wanted something that is easy to run locally and isn't very heavy / requires a GPU.
|
46
|
-
|
47
|
-
Hence, this library.
|
48
|
-
|
49
|
-
## Features
|
50
|
-
|
51
|
-
- Extract text from PDFs, images, office documents and more (see supported formats below)
|
52
|
-
- Use modern Python with async (via `anyio`) and proper type hints
|
53
|
-
- Extensive error handling for easy debugging
|
54
|
-
|
55
|
-
## Installation
|
56
|
-
|
57
|
-
1. Begin by installing the python package:
|
58
|
-
|
59
|
-
```shell
|
60
|
-
|
61
|
-
pip install kreuzberg
|
62
|
-
|
63
|
-
```
|
64
|
-
|
65
|
-
2. Install the system dependencies:
|
66
|
-
|
67
|
-
- [pandoc](https://pandoc.org/installing.html) (non-pdf text extraction, GPL v2.0 licensed but used via CLI only)
|
68
|
-
- [tesseract-ocr](https://tesseract-ocr.github.io/) (for image/PDF OCR, Apache License)
|
69
|
-
|
70
|
-
## Dependencies and Philosophy
|
71
|
-
|
72
|
-
This library is built to be minimalist and simple. It also aims to utilize OSS tools for the job. Its fundamentally a
|
73
|
-
high order async abstraction on top of other tools, think of it like the library you would bake in your code base, but
|
74
|
-
polished and well maintained.
|
75
|
-
|
76
|
-
### Dependencies
|
77
|
-
|
78
|
-
- PDFs are processed using pdfium2 for searchable PDFs + Tesseract OCR for scanned documents
|
79
|
-
- Images are processed using Tesseract OCR
|
80
|
-
- Office documents and other formats are processed using Pandoc
|
81
|
-
- PPTX files are converted using python-pptx
|
82
|
-
- HTML files are converted using html-to-markdown
|
83
|
-
- Plain text files are read directly with appropriate encoding detection
|
84
|
-
|
85
|
-
### Roadmap
|
86
|
-
|
87
|
-
V1:
|
88
|
-
|
89
|
-
- [x] - html file text extraction
|
90
|
-
- [ ] - better PDF table extraction
|
91
|
-
- [ ] - TBD
|
92
|
-
|
93
|
-
V2:
|
94
|
-
|
95
|
-
- [ ] - extra install groups (to make dependencies optional)
|
96
|
-
- [ ] - metadata extraction (possible breaking change)
|
97
|
-
- [ ] - TBD
|
98
|
-
|
99
|
-
### Feature Requests
|
100
|
-
|
101
|
-
Feel free to open a discussion in GitHub or an issue if you have any feature requests
|
102
|
-
|
103
|
-
### Contribution
|
104
|
-
|
105
|
-
Is welcome! Read guidelines below.
|
106
|
-
|
107
|
-
## Supported File Types
|
108
|
-
|
109
|
-
Kreuzberg supports a wide range of file formats:
|
110
|
-
|
111
|
-
### Document Formats
|
112
|
-
|
113
|
-
- PDF (`.pdf`) - both searchable and scanned documents
|
114
|
-
- Word Documents (`.docx`, `.doc`)
|
115
|
-
- Power Point Presentations (`.pptx`)
|
116
|
-
- OpenDocument Text (`.odt`)
|
117
|
-
- Rich Text Format (`.rtf`)
|
118
|
-
|
119
|
-
### Image Formats
|
120
|
-
|
121
|
-
- JPEG, JPG (`.jpg`, `.jpeg`, `.pjpeg`)
|
122
|
-
- PNG (`.png`)
|
123
|
-
- TIFF (`.tiff`, `.tif`)
|
124
|
-
- BMP (`.bmp`)
|
125
|
-
- GIF (`.gif`)
|
126
|
-
- WebP (`.webp`)
|
127
|
-
- JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)
|
128
|
-
- Portable Anymap (`.pnm`)
|
129
|
-
- Portable Bitmap (`.pbm`)
|
130
|
-
- Portable Graymap (`.pgm`)
|
131
|
-
- Portable Pixmap (`.ppm`)
|
132
|
-
|
133
|
-
#### Text and Markup Formats
|
134
|
-
|
135
|
-
- HTML (`.html`, `.htm`)
|
136
|
-
- Plain Text (`.txt`)
|
137
|
-
- Markdown (`.md`)
|
138
|
-
- reStructuredText (`.rst`)
|
139
|
-
- LaTeX (`.tex`)
|
140
|
-
|
141
|
-
#### Data Formats
|
142
|
-
|
143
|
-
- Comma-Separated Values (`.csv`)
|
144
|
-
- Tab-Separated Values (`.tsv`)
|
145
|
-
|
146
|
-
## Usage
|
147
|
-
|
148
|
-
Kreuzberg exports two async functions:
|
149
|
-
|
150
|
-
- Extract text from a file (string path or `pathlib.Path`) using `extract_file()`
|
151
|
-
- Extract text from a byte-string using `extract_bytes()`
|
152
|
-
|
153
|
-
### Extract from File
|
154
|
-
|
155
|
-
```python
|
156
|
-
from pathlib import Path
|
157
|
-
from kreuzberg import extract_file
|
158
|
-
|
159
|
-
|
160
|
-
# Extract text from a PDF file
|
161
|
-
async def extract_pdf():
|
162
|
-
result = await extract_file("document.pdf")
|
163
|
-
print(f"Extracted text: {result.content}")
|
164
|
-
print(f"Output mime type: {result.mime_type}")
|
165
|
-
|
166
|
-
|
167
|
-
# Extract text from an image
|
168
|
-
async def extract_image():
|
169
|
-
result = await extract_file("scan.png")
|
170
|
-
print(f"Extracted text: {result.content}")
|
171
|
-
|
172
|
-
|
173
|
-
# or use Path
|
174
|
-
|
175
|
-
async def extract_pdf():
|
176
|
-
result = await extract_file(Path("document.pdf"))
|
177
|
-
print(f"Extracted text: {result.content}")
|
178
|
-
print(f"Output mime type: {result.mime_type}")
|
179
|
-
```
|
180
|
-
|
181
|
-
### Extract from Bytes
|
182
|
-
|
183
|
-
```python
|
184
|
-
from kreuzberg import extract_bytes
|
185
|
-
|
186
|
-
|
187
|
-
# Extract text from PDF bytes
|
188
|
-
async def process_uploaded_pdf(pdf_content: bytes):
|
189
|
-
result = await extract_bytes(pdf_content, mime_type="application/pdf")
|
190
|
-
return result.content
|
191
|
-
|
192
|
-
|
193
|
-
# Extract text from image bytes
|
194
|
-
async def process_uploaded_image(image_content: bytes):
|
195
|
-
result = await extract_bytes(image_content, mime_type="image/jpeg")
|
196
|
-
return result.content
|
197
|
-
```
|
198
|
-
|
199
|
-
### Forcing OCR
|
200
|
-
|
201
|
-
When extracting a PDF file or bytes, you might want to force OCR - for example, if the PDF includes images that have text that should be extracted etc.
|
202
|
-
You can do this by passing `force_ocr=True`:
|
203
|
-
|
204
|
-
```python
|
205
|
-
from kreuzberg import extract_bytes
|
206
|
-
|
207
|
-
|
208
|
-
# Extract text from PDF bytes and force OCR
|
209
|
-
async def process_uploaded_pdf(pdf_content: bytes):
|
210
|
-
result = await extract_bytes(pdf_content, mime_type="application/pdf", force_ocr=True)
|
211
|
-
return result.content
|
212
|
-
```
|
213
|
-
|
214
|
-
### Error Handling
|
215
|
-
|
216
|
-
Kreuzberg raises two exception types:
|
217
|
-
|
218
|
-
#### ValidationError
|
219
|
-
|
220
|
-
Raised when there are issues with input validation:
|
221
|
-
|
222
|
-
- Unsupported mime types
|
223
|
-
- Undetectable mime types
|
224
|
-
- Path doesn't point at an exist file
|
225
|
-
|
226
|
-
#### ParsingError
|
227
|
-
|
228
|
-
Raised when there are issues during the text extraction process:
|
229
|
-
|
230
|
-
- PDF parsing failures
|
231
|
-
- OCR errors
|
232
|
-
- Pandoc conversion errors
|
233
|
-
|
234
|
-
```python
|
235
|
-
from kreuzberg import extract_file
|
236
|
-
from kreuzberg.exceptions import ValidationError, ParsingError
|
237
|
-
|
238
|
-
|
239
|
-
async def safe_extract():
|
240
|
-
try:
|
241
|
-
result = await extract_file("document.doc")
|
242
|
-
return result.content
|
243
|
-
except ValidationError as e:
|
244
|
-
print(f"Validation error: {e.message}")
|
245
|
-
print(f"Context: {e.context}")
|
246
|
-
except ParsingError as e:
|
247
|
-
print(f"Parsing error: {e.message}")
|
248
|
-
print(f"Context: {e.context}") # Contains detailed error information
|
249
|
-
```
|
250
|
-
|
251
|
-
Both error types include helpful context information for debugging:
|
252
|
-
|
253
|
-
```python
|
254
|
-
try:
|
255
|
-
result = await extract_file("scanned.pdf")
|
256
|
-
except ParsingError as e:
|
257
|
-
# e.context might contain:
|
258
|
-
# {
|
259
|
-
# "file_path": "scanned.pdf",
|
260
|
-
# "error": "Tesseract OCR failed: Unable to process image"
|
261
|
-
# }
|
262
|
-
```
|
263
|
-
|
264
|
-
### ExtractionResult
|
265
|
-
|
266
|
-
All extraction functions return an ExtractionResult named tuple containing:
|
267
|
-
|
268
|
-
- `content`: The extracted text as a string
|
269
|
-
- `mime_type`: The mime type of the output (either "text/plain" or, if pandoc is used- "text/markdown")
|
270
|
-
|
271
|
-
```python
|
272
|
-
from kreuzberg import ExtractionResult
|
273
|
-
|
274
|
-
|
275
|
-
async def process_document(path: str) -> str:
|
276
|
-
result: ExtractionResult = await extract_file(path)
|
277
|
-
return result.content
|
278
|
-
|
279
|
-
|
280
|
-
# or access the result as tuple
|
281
|
-
|
282
|
-
async def process_document(path: str) -> str:
|
283
|
-
content, mime_type = await extract_file(path)
|
284
|
-
# do something with mime_type
|
285
|
-
return content
|
286
|
-
```
|
287
|
-
|
288
|
-
## Contribution
|
289
|
-
|
290
|
-
This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
|
291
|
-
submitting PRs to avoid disappointment.
|
292
|
-
|
293
|
-
### Local Development
|
294
|
-
|
295
|
-
1. Clone the repo
|
296
|
-
2. Install the system dependencies
|
297
|
-
3. Install the full dependencies with `uv sync`
|
298
|
-
4. Install the pre-commit hooks with:
|
299
|
-
```shell
|
300
|
-
pre-commit install && pre-commit install --hook-type commit-msg
|
301
|
-
```
|
302
|
-
5. Make your changes and submit a PR
|
303
|
-
|
304
|
-
## License
|
305
|
-
|
306
|
-
This library uses the MIT license.
|
kreuzberg-1.3.0.dist-info/RECORD
DELETED
@@ -1,13 +0,0 @@
|
|
1
|
-
kreuzberg/__init__.py,sha256=5IBPjPsZ7faK15gFB9ZEROHhkEX7KKQmrHPCZuGnhb0,285
|
2
|
-
kreuzberg/_extractors.py,sha256=eiWPpjnZOZFDwlQL4XsgavJEWqxGtzLVvS8YU28RBAo,8095
|
3
|
-
kreuzberg/_mime_types.py,sha256=hR6LFXWn8dtCDB05PkADYk2l__HpmETNyf4YFixhecE,2918
|
4
|
-
kreuzberg/_string.py,sha256=O023sxdYoC4DhFCU1z430UBdbxqwXKmyymUDDx3J_i8,1156
|
5
|
-
kreuzberg/_sync.py,sha256=ovsFHFdkcczz7gNEUJsbZzY8KHG0_GAOOYipQNE4hIY,874
|
6
|
-
kreuzberg/exceptions.py,sha256=jrXyvcuSU-694OEtXPZfHYcUbpoRZzNKw9Lo3wIZwL0,770
|
7
|
-
kreuzberg/extraction.py,sha256=cgX8uoCVXf-Va30g8T8DwrZUqsSPHIzmPfDgnWOqNNU,6148
|
8
|
-
kreuzberg/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
9
|
-
kreuzberg-1.3.0.dist-info/LICENSE,sha256=-8caMvpCK8SgZ5LlRKhGCMtYDEXqTKH9X8pFEhl91_4,1066
|
10
|
-
kreuzberg-1.3.0.dist-info/METADATA,sha256=3wiaAuaiA865lg5oCjwlAKaZqRQn1w8VqaQXeoEdip4,8579
|
11
|
-
kreuzberg-1.3.0.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
|
12
|
-
kreuzberg-1.3.0.dist-info/top_level.txt,sha256=rbGkygffkZiyKhL8UN41ZOjLfem0jJPA1Whtndne0rE,10
|
13
|
-
kreuzberg-1.3.0.dist-info/RECORD,,
|
File without changes
|
File without changes
|
File without changes
|