kreuzberg 1.3.0__py3-none-any.whl → 1.5.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,318 @@
1
+ Metadata-Version: 2.2
2
+ Name: kreuzberg
3
+ Version: 1.5.0
4
+ Summary: A text extraction library supporting PDFs, images, office documents and more
5
+ Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
6
+ License: MIT
7
+ Project-URL: homepage, https://github.com/Goldziher/kreuzberg
8
+ Keywords: document-processing,image-to-text,ocr,pandoc,pdf-extraction,rag,tesseract,text-extraction,text-processing
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Operating System :: OS Independent
13
+ Classifier: Programming Language :: Python :: 3 :: Only
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Classifier: Topic :: Text Processing :: General
22
+ Classifier: Topic :: Utilities
23
+ Classifier: Typing :: Typed
24
+ Requires-Python: >=3.9
25
+ Description-Content-Type: text/markdown
26
+ License-File: LICENSE
27
+ Requires-Dist: anyio>=4.8.0
28
+ Requires-Dist: charset-normalizer>=3.4.1
29
+ Requires-Dist: html-to-markdown>=1.2.0
30
+ Requires-Dist: pypdfium2>=4.30.1
31
+ Requires-Dist: python-pptx>=1.0.2
32
+ Requires-Dist: typing-extensions>=4.12.2; python_version < "3.10"
33
+
34
+ # Kreuzberg
35
+
36
+ Kreuzberg is a modern Python library for text extraction from documents, designed for simplicity and efficiency. It provides a unified async interface for extracting text from a wide range of file formats including PDFs, images, office documents, and more.
37
+
38
+ ## Why Kreuzberg?
39
+
40
+ - **Simple and Hassle-Free**: Clean API that just works, without complex configuration
41
+ - **Local Processing**: No external API calls or cloud dependencies required
42
+ - **Resource Efficient**: Lightweight processing without GPU requirements
43
+ - **Format Support**: Comprehensive support for documents, images, and text formats
44
+ - **Modern Python**: Built with async/await, type hints, and current best practices
45
+
46
+ Kreuzberg was created to solve text extraction needs in RAG (Retrieval Augmented Generation) applications, but it's suitable for any text extraction use case. Unlike many commercial solutions that require API calls or complex setups, Kreuzberg focuses on local processing with minimal dependencies.
47
+
48
+ ## Features
49
+
50
+ - **Universal Text Extraction**: Extract text from PDFs (both searchable and scanned), images, office documents, and more
51
+ - **Smart Processing**: Automatic OCR for scanned documents, encoding detection for text files
52
+ - **Modern Python Design**:
53
+ - Async-first API using `anyio`
54
+ - Comprehensive type hints for better IDE support
55
+ - Detailed error handling with context information
56
+ - **Production Ready**:
57
+ - Robust error handling
58
+ - Detailed debugging information
59
+ - Memory efficient processing
60
+
61
+ ## Installation
62
+
63
+ ### 1. Install the Python Package
64
+
65
+ ```shell
66
+ pip install kreuzberg
67
+ ```
68
+
69
+ ### 2. Install System Dependencies
70
+
71
+ Kreuzberg requires two open-source tools:
72
+
73
+ - [Pandoc](https://pandoc.org/installing.html) - For document format conversion
74
+
75
+ - GPL v2.0 licensed (used via CLI only)
76
+ - Handles office documents and markup formats
77
+
78
+ - [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR
79
+ - Apache License
80
+ - Required for scanned documents and images
81
+
82
+ ## Architecture
83
+
84
+ Kreuzberg is designed as a high-level async abstraction over established open-source tools. It integrates:
85
+
86
+ - **PDF Processing**:
87
+ - `pdfium2` for searchable PDFs
88
+ - Tesseract OCR for scanned content
89
+ - **Document Conversion**:
90
+ - Pandoc for office documents and markup
91
+ - `python-pptx` for PowerPoint files
92
+ - `html-to-markdown` for HTML content
93
+ - **Text Processing**:
94
+ - Smart encoding detection
95
+ - Markdown and plain text handling
96
+
97
+ ### Supported Formats
98
+
99
+ #### Document Formats
100
+
101
+ - PDF (`.pdf`, both searchable and scanned documents)
102
+ - Microsoft Word (`.docx`, `.doc`)
103
+ - PowerPoint presentations (`.pptx`)
104
+ - OpenDocument Text (`.odt`)
105
+ - Rich Text Format (`.rtf`)
106
+ - EPUB (`.epub`)
107
+ - DocBook XML (`.dbk`, `.xml`)
108
+ - FictionBook (`.fb2`)
109
+ - LaTeX (`.tex`, `.latex`)
110
+ - Typst (`.typ`)
111
+
112
+ #### Markup and Text Formats
113
+
114
+ - HTML (`.html`, `.htm`)
115
+ - Plain text (`.txt`) and Markdown (`.md`, `.markdown`)
116
+ - reStructuredText (`.rst`)
117
+ - Org-mode (`.org`)
118
+ - DokuWiki (`.txt`)
119
+ - Pod (`.pod`)
120
+ - Man pages (`.1`, `.2`, etc.)
121
+
122
+ #### Data and Research Formats
123
+
124
+ - CSV (`.csv`) and TSV (`.tsv`) files
125
+ - Jupyter Notebooks (`.ipynb`)
126
+ - BibTeX (`.bib`) and BibLaTeX (`.bib`)
127
+ - CSL-JSON (`.json`)
128
+ - EndNote XML (`.xml`)
129
+ - RIS (`.ris`)
130
+ - JATS XML (`.xml`)
131
+
132
+ #### Image Formats
133
+
134
+ - JPEG (`.jpg`, `.jpeg`, `.pjpeg`)
135
+ - PNG (`.png`)
136
+ - TIFF (`.tiff`, `.tif`)
137
+ - BMP (`.bmp`)
138
+ - GIF (`.gif`)
139
+ - WebP (`.webp`)
140
+ - JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)
141
+ - Portable Anymap (`.pnm`)
142
+ - Portable Bitmap (`.pbm`)
143
+ - Portable Graymap (`.pgm`)
144
+ - Portable Pixmap (`.ppm`)
145
+
146
+ ## Usage
147
+
148
+ Kreuzberg provides a simple, async-first API for text extraction. The library exports two main functions:
149
+
150
+ - `extract_file()`: Extract text from a file (accepts string path or `pathlib.Path`)
151
+ - `extract_bytes()`: Extract text from bytes (accepts a byte string)
152
+
153
+ ### Quick Start
154
+
155
+ ```python
156
+ from pathlib import Path
157
+ from kreuzberg import extract_file, extract_bytes
158
+
159
+ # Basic file extraction
160
+ async def extract_document():
161
+ # Extract from a PDF file
162
+ pdf_result = await extract_file("document.pdf")
163
+ print(f"PDF text: {pdf_result.content}")
164
+
165
+ # Extract from an image
166
+ img_result = await extract_file("scan.png")
167
+ print(f"Image text: {img_result.content}")
168
+
169
+ # Extract from Word document
170
+ docx_result = await extract_file(Path("document.docx"))
171
+ print(f"Word text: {docx_result.content}")
172
+ ```
173
+
174
+ ### Processing Uploaded Files
175
+
176
+ ```python
177
+ from kreuzberg import extract_bytes
178
+
179
+ async def process_upload(file_content: bytes, mime_type: str):
180
+ """Process uploaded file content with known MIME type."""
181
+ result = await extract_bytes(file_content, mime_type=mime_type)
182
+ return result.content
183
+
184
+ # Example usage with different file types
185
+ async def handle_uploads():
186
+ # Process PDF upload
187
+ pdf_result = await extract_bytes(pdf_bytes, mime_type="application/pdf")
188
+
189
+ # Process image upload
190
+ img_result = await extract_bytes(image_bytes, mime_type="image/jpeg")
191
+
192
+ # Process Word document upload
193
+ docx_result = await extract_bytes(docx_bytes,
194
+ mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document")
195
+ ```
196
+
197
+ ### Advanced Features
198
+
199
+ #### PDF Processing Options
200
+
201
+ ```python
202
+ from kreuzberg import extract_file
203
+
204
+ async def process_pdf():
205
+ # Force OCR for PDFs with embedded images or scanned content
206
+ result = await extract_file("document.pdf", force_ocr=True)
207
+
208
+ # Process a scanned PDF (automatically uses OCR)
209
+ scanned = await extract_file("scanned.pdf")
210
+ ```
211
+
212
+ #### ExtractionResult Object
213
+
214
+ All extraction functions return an `ExtractionResult` containing:
215
+
216
+ - `content`: The extracted text (str)
217
+ - `mime_type`: Output format ("text/plain" or "text/markdown" for Pandoc conversions)
218
+
219
+ ```python
220
+ from kreuzberg import ExtractionResult
221
+
222
+ async def process_document(path: str) -> tuple[str, str]:
223
+ # Access as a named tuple
224
+ result: ExtractionResult = await extract_file(path)
225
+ print(f"Content: {result.content}")
226
+ print(f"Format: {result.mime_type}")
227
+
228
+ # Or unpack as a tuple
229
+ content, mime_type = await extract_file(path)
230
+ return content, mime_type
231
+ ```
232
+
233
+ ### Error Handling
234
+
235
+ Kreuzberg provides detailed error handling with two main exception types:
236
+
237
+ ```python
238
+ from kreuzberg import extract_file
239
+ from kreuzberg.exceptions import ValidationError, ParsingError
240
+
241
+ async def safe_extract(path: str) -> str:
242
+ try:
243
+ result = await extract_file(path)
244
+ return result.content
245
+
246
+ except ValidationError as e:
247
+ # Handles input validation issues:
248
+ # - Unsupported file types
249
+ # - Missing files
250
+ # - Invalid MIME types
251
+ print(f"Invalid input: {e.message}")
252
+ print(f"Details: {e.context}")
253
+
254
+ except ParsingError as e:
255
+ # Handles processing errors:
256
+ # - PDF parsing failures
257
+ # - OCR errors
258
+ # - Format conversion issues
259
+ print(f"Processing failed: {e.message}")
260
+ print(f"Details: {e.context}")
261
+
262
+ return ""
263
+
264
+ # Example error contexts
265
+ try:
266
+ result = await extract_file("document.xyz")
267
+ except ValidationError as e:
268
+ # e.context might contain:
269
+ # {
270
+ # "file_path": "document.xyz",
271
+ # "error": "Unsupported file type",
272
+ # "supported_types": ["pdf", "docx", ...]
273
+ # }
274
+
275
+ try:
276
+ result = await extract_file("scan.pdf")
277
+ except ParsingError as e:
278
+ # e.context might contain:
279
+ # {
280
+ # "file_path": "scan.pdf",
281
+ # "error": "OCR processing failed",
282
+ # "details": "Tesseract error: Unable to process image"
283
+ # }
284
+ ```
285
+
286
+ ## Roadmap
287
+
288
+ V1:
289
+
290
+ - [x] - html file text extraction
291
+ - [ ] - better PDF table extraction
292
+ - [ ] - batch APIs
293
+ - [ ] - sync APIs
294
+
295
+ V2:
296
+
297
+ - [ ] - metadata extraction (breaking change)
298
+ - [ ] - TBD
299
+
300
+ ## Contribution
301
+
302
+ This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
303
+ submitting PRs to avoid disappointment.
304
+
305
+ ### Local Development
306
+
307
+ 1. Clone the repo
308
+ 2. Install the system dependencies
309
+ 3. Install the full dependencies with `uv sync`
310
+ 4. Install the pre-commit hooks with:
311
+ ```shell
312
+ pre-commit install && pre-commit install --hook-type commit-msg
313
+ ```
314
+ 5. Make your changes and submit a PR
315
+
316
+ ## License
317
+
318
+ This library uses the MIT license.
@@ -0,0 +1,15 @@
1
+ kreuzberg/__init__.py,sha256=5IBPjPsZ7faK15gFB9ZEROHhkEX7KKQmrHPCZuGnhb0,285
2
+ kreuzberg/_extractors.py,sha256=k6xO_2ItaftPmlqzfXyxTn8rdaWdwrJHGziBbo7gCio,6599
3
+ kreuzberg/_mime_types.py,sha256=0ZYtRrMAaKpCMDkhpTbWAXHCsVob5MFRMGlbni8iYSA,2573
4
+ kreuzberg/_pandoc.py,sha256=DC6y_NN_CG9dF6fhAj3WumXqKIJLjYmnql2H53_KHnE,13766
5
+ kreuzberg/_string.py,sha256=4txRDnkdR12oO6G8V-jXEMlA9ivgmw8E8EbjyhfL-W4,1106
6
+ kreuzberg/_sync.py,sha256=ovsFHFdkcczz7gNEUJsbZzY8KHG0_GAOOYipQNE4hIY,874
7
+ kreuzberg/_tesseract.py,sha256=nnhkjRIS0BSoovjMIqOlBEXlzngE0QJeFDe7BIqUik8,7872
8
+ kreuzberg/exceptions.py,sha256=pxoEPS0T9e5QSgxsfXn1VmxsY_EGXvTwY0gETPiNn8E,945
9
+ kreuzberg/extraction.py,sha256=gux3fkPIs8IbIKtRGuPFWJBLB5jO6Y9JsBfhHRcpQ0k,6160
10
+ kreuzberg/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
11
+ kreuzberg-1.5.0.dist-info/LICENSE,sha256=-8caMvpCK8SgZ5LlRKhGCMtYDEXqTKH9X8pFEhl91_4,1066
12
+ kreuzberg-1.5.0.dist-info/METADATA,sha256=O462ss7M6Cb8cO6fJXwqsOdzkzaZekqa1oGwb7Vrgx8,9641
13
+ kreuzberg-1.5.0.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
14
+ kreuzberg-1.5.0.dist-info/top_level.txt,sha256=rbGkygffkZiyKhL8UN41ZOjLfem0jJPA1Whtndne0rE,10
15
+ kreuzberg-1.5.0.dist-info/RECORD,,
@@ -1,306 +0,0 @@
1
- Metadata-Version: 2.2
2
- Name: kreuzberg
3
- Version: 1.3.0
4
- Summary: A text extraction library supporting PDFs, images, office documents and more
5
- Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
6
- License: MIT
7
- Project-URL: homepage, https://github.com/Goldziher/kreuzberg
8
- Keywords: document-processing,docx,image-to-text,latex,markdown,ocr,odt,office-documents,pandoc,pdf,pdf-extraction,rag,text-extraction,text-processing
9
- Classifier: Development Status :: 4 - Beta
10
- Classifier: Intended Audience :: Developers
11
- Classifier: License :: OSI Approved :: MIT License
12
- Classifier: Operating System :: OS Independent
13
- Classifier: Programming Language :: Python :: 3 :: Only
14
- Classifier: Programming Language :: Python :: 3.9
15
- Classifier: Programming Language :: Python :: 3.10
16
- Classifier: Programming Language :: Python :: 3.11
17
- Classifier: Programming Language :: Python :: 3.12
18
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
20
- Classifier: Topic :: Text Processing :: General
21
- Classifier: Topic :: Utilities
22
- Classifier: Typing :: Typed
23
- Requires-Python: >=3.9
24
- Description-Content-Type: text/markdown
25
- License-File: LICENSE
26
- Requires-Dist: anyio>=4.8.0
27
- Requires-Dist: charset-normalizer>=3.4.1
28
- Requires-Dist: html-to-markdown>=1.2.0
29
- Requires-Dist: pypandoc>=1.15
30
- Requires-Dist: pypdfium2>=4.30.1
31
- Requires-Dist: pytesseract>=0.3.13
32
- Requires-Dist: python-pptx>=1.0.2
33
- Requires-Dist: typing-extensions>=4.12.2
34
-
35
- # Kreuzberg
36
-
37
- Kreuzberg is a library for simplified text extraction from PDF files. It's meant to offer simple, hassle free text
38
- extraction.
39
-
40
- Why?
41
-
42
- I am building, like many do now, a RAG focused service (checkout https://grantflow.ai). I have text extraction needs.
43
- There are quite a lot of commercial options out there, and several open-source + paid options.
44
- But I wanted something simple, which does not require expansive round-trips to an external API.
45
- Furthermore, I wanted something that is easy to run locally and isn't very heavy / requires a GPU.
46
-
47
- Hence, this library.
48
-
49
- ## Features
50
-
51
- - Extract text from PDFs, images, office documents and more (see supported formats below)
52
- - Use modern Python with async (via `anyio`) and proper type hints
53
- - Extensive error handling for easy debugging
54
-
55
- ## Installation
56
-
57
- 1. Begin by installing the python package:
58
-
59
- ```shell
60
-
61
- pip install kreuzberg
62
-
63
- ```
64
-
65
- 2. Install the system dependencies:
66
-
67
- - [pandoc](https://pandoc.org/installing.html) (non-pdf text extraction, GPL v2.0 licensed but used via CLI only)
68
- - [tesseract-ocr](https://tesseract-ocr.github.io/) (for image/PDF OCR, Apache License)
69
-
70
- ## Dependencies and Philosophy
71
-
72
- This library is built to be minimalist and simple. It also aims to utilize OSS tools for the job. Its fundamentally a
73
- high order async abstraction on top of other tools, think of it like the library you would bake in your code base, but
74
- polished and well maintained.
75
-
76
- ### Dependencies
77
-
78
- - PDFs are processed using pdfium2 for searchable PDFs + Tesseract OCR for scanned documents
79
- - Images are processed using Tesseract OCR
80
- - Office documents and other formats are processed using Pandoc
81
- - PPTX files are converted using python-pptx
82
- - HTML files are converted using html-to-markdown
83
- - Plain text files are read directly with appropriate encoding detection
84
-
85
- ### Roadmap
86
-
87
- V1:
88
-
89
- - [x] - html file text extraction
90
- - [ ] - better PDF table extraction
91
- - [ ] - TBD
92
-
93
- V2:
94
-
95
- - [ ] - extra install groups (to make dependencies optional)
96
- - [ ] - metadata extraction (possible breaking change)
97
- - [ ] - TBD
98
-
99
- ### Feature Requests
100
-
101
- Feel free to open a discussion in GitHub or an issue if you have any feature requests
102
-
103
- ### Contribution
104
-
105
- Is welcome! Read guidelines below.
106
-
107
- ## Supported File Types
108
-
109
- Kreuzberg supports a wide range of file formats:
110
-
111
- ### Document Formats
112
-
113
- - PDF (`.pdf`) - both searchable and scanned documents
114
- - Word Documents (`.docx`, `.doc`)
115
- - Power Point Presentations (`.pptx`)
116
- - OpenDocument Text (`.odt`)
117
- - Rich Text Format (`.rtf`)
118
-
119
- ### Image Formats
120
-
121
- - JPEG, JPG (`.jpg`, `.jpeg`, `.pjpeg`)
122
- - PNG (`.png`)
123
- - TIFF (`.tiff`, `.tif`)
124
- - BMP (`.bmp`)
125
- - GIF (`.gif`)
126
- - WebP (`.webp`)
127
- - JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)
128
- - Portable Anymap (`.pnm`)
129
- - Portable Bitmap (`.pbm`)
130
- - Portable Graymap (`.pgm`)
131
- - Portable Pixmap (`.ppm`)
132
-
133
- #### Text and Markup Formats
134
-
135
- - HTML (`.html`, `.htm`)
136
- - Plain Text (`.txt`)
137
- - Markdown (`.md`)
138
- - reStructuredText (`.rst`)
139
- - LaTeX (`.tex`)
140
-
141
- #### Data Formats
142
-
143
- - Comma-Separated Values (`.csv`)
144
- - Tab-Separated Values (`.tsv`)
145
-
146
- ## Usage
147
-
148
- Kreuzberg exports two async functions:
149
-
150
- - Extract text from a file (string path or `pathlib.Path`) using `extract_file()`
151
- - Extract text from a byte-string using `extract_bytes()`
152
-
153
- ### Extract from File
154
-
155
- ```python
156
- from pathlib import Path
157
- from kreuzberg import extract_file
158
-
159
-
160
- # Extract text from a PDF file
161
- async def extract_pdf():
162
- result = await extract_file("document.pdf")
163
- print(f"Extracted text: {result.content}")
164
- print(f"Output mime type: {result.mime_type}")
165
-
166
-
167
- # Extract text from an image
168
- async def extract_image():
169
- result = await extract_file("scan.png")
170
- print(f"Extracted text: {result.content}")
171
-
172
-
173
- # or use Path
174
-
175
- async def extract_pdf():
176
- result = await extract_file(Path("document.pdf"))
177
- print(f"Extracted text: {result.content}")
178
- print(f"Output mime type: {result.mime_type}")
179
- ```
180
-
181
- ### Extract from Bytes
182
-
183
- ```python
184
- from kreuzberg import extract_bytes
185
-
186
-
187
- # Extract text from PDF bytes
188
- async def process_uploaded_pdf(pdf_content: bytes):
189
- result = await extract_bytes(pdf_content, mime_type="application/pdf")
190
- return result.content
191
-
192
-
193
- # Extract text from image bytes
194
- async def process_uploaded_image(image_content: bytes):
195
- result = await extract_bytes(image_content, mime_type="image/jpeg")
196
- return result.content
197
- ```
198
-
199
- ### Forcing OCR
200
-
201
- When extracting a PDF file or bytes, you might want to force OCR - for example, if the PDF includes images that have text that should be extracted etc.
202
- You can do this by passing `force_ocr=True`:
203
-
204
- ```python
205
- from kreuzberg import extract_bytes
206
-
207
-
208
- # Extract text from PDF bytes and force OCR
209
- async def process_uploaded_pdf(pdf_content: bytes):
210
- result = await extract_bytes(pdf_content, mime_type="application/pdf", force_ocr=True)
211
- return result.content
212
- ```
213
-
214
- ### Error Handling
215
-
216
- Kreuzberg raises two exception types:
217
-
218
- #### ValidationError
219
-
220
- Raised when there are issues with input validation:
221
-
222
- - Unsupported mime types
223
- - Undetectable mime types
224
- - Path doesn't point at an exist file
225
-
226
- #### ParsingError
227
-
228
- Raised when there are issues during the text extraction process:
229
-
230
- - PDF parsing failures
231
- - OCR errors
232
- - Pandoc conversion errors
233
-
234
- ```python
235
- from kreuzberg import extract_file
236
- from kreuzberg.exceptions import ValidationError, ParsingError
237
-
238
-
239
- async def safe_extract():
240
- try:
241
- result = await extract_file("document.doc")
242
- return result.content
243
- except ValidationError as e:
244
- print(f"Validation error: {e.message}")
245
- print(f"Context: {e.context}")
246
- except ParsingError as e:
247
- print(f"Parsing error: {e.message}")
248
- print(f"Context: {e.context}") # Contains detailed error information
249
- ```
250
-
251
- Both error types include helpful context information for debugging:
252
-
253
- ```python
254
- try:
255
- result = await extract_file("scanned.pdf")
256
- except ParsingError as e:
257
- # e.context might contain:
258
- # {
259
- # "file_path": "scanned.pdf",
260
- # "error": "Tesseract OCR failed: Unable to process image"
261
- # }
262
- ```
263
-
264
- ### ExtractionResult
265
-
266
- All extraction functions return an ExtractionResult named tuple containing:
267
-
268
- - `content`: The extracted text as a string
269
- - `mime_type`: The mime type of the output (either "text/plain" or, if pandoc is used- "text/markdown")
270
-
271
- ```python
272
- from kreuzberg import ExtractionResult
273
-
274
-
275
- async def process_document(path: str) -> str:
276
- result: ExtractionResult = await extract_file(path)
277
- return result.content
278
-
279
-
280
- # or access the result as tuple
281
-
282
- async def process_document(path: str) -> str:
283
- content, mime_type = await extract_file(path)
284
- # do something with mime_type
285
- return content
286
- ```
287
-
288
- ## Contribution
289
-
290
- This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
291
- submitting PRs to avoid disappointment.
292
-
293
- ### Local Development
294
-
295
- 1. Clone the repo
296
- 2. Install the system dependencies
297
- 3. Install the full dependencies with `uv sync`
298
- 4. Install the pre-commit hooks with:
299
- ```shell
300
- pre-commit install && pre-commit install --hook-type commit-msg
301
- ```
302
- 5. Make your changes and submit a PR
303
-
304
- ## License
305
-
306
- This library uses the MIT license.
@@ -1,13 +0,0 @@
1
- kreuzberg/__init__.py,sha256=5IBPjPsZ7faK15gFB9ZEROHhkEX7KKQmrHPCZuGnhb0,285
2
- kreuzberg/_extractors.py,sha256=eiWPpjnZOZFDwlQL4XsgavJEWqxGtzLVvS8YU28RBAo,8095
3
- kreuzberg/_mime_types.py,sha256=hR6LFXWn8dtCDB05PkADYk2l__HpmETNyf4YFixhecE,2918
4
- kreuzberg/_string.py,sha256=O023sxdYoC4DhFCU1z430UBdbxqwXKmyymUDDx3J_i8,1156
5
- kreuzberg/_sync.py,sha256=ovsFHFdkcczz7gNEUJsbZzY8KHG0_GAOOYipQNE4hIY,874
6
- kreuzberg/exceptions.py,sha256=jrXyvcuSU-694OEtXPZfHYcUbpoRZzNKw9Lo3wIZwL0,770
7
- kreuzberg/extraction.py,sha256=cgX8uoCVXf-Va30g8T8DwrZUqsSPHIzmPfDgnWOqNNU,6148
8
- kreuzberg/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
9
- kreuzberg-1.3.0.dist-info/LICENSE,sha256=-8caMvpCK8SgZ5LlRKhGCMtYDEXqTKH9X8pFEhl91_4,1066
10
- kreuzberg-1.3.0.dist-info/METADATA,sha256=3wiaAuaiA865lg5oCjwlAKaZqRQn1w8VqaQXeoEdip4,8579
11
- kreuzberg-1.3.0.dist-info/WHEEL,sha256=In9FTNxeP60KnTkGw7wk6mJPYd_dQSjEZmXdBdMCI-8,91
12
- kreuzberg-1.3.0.dist-info/top_level.txt,sha256=rbGkygffkZiyKhL8UN41ZOjLfem0jJPA1Whtndne0rE,10
13
- kreuzberg-1.3.0.dist-info/RECORD,,