kreuzberg 4.0.6__cp310-abi3-macosx_14_0_arm64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of kreuzberg might be problematic. Click here for more details.

@@ -0,0 +1,470 @@
1
+ Metadata-Version: 2.4
2
+ Name: kreuzberg
3
+ Version: 4.0.6
4
+ Classifier: Development Status :: 4 - Beta
5
+ Classifier: Intended Audience :: Developers
6
+ Classifier: Intended Audience :: Information Technology
7
+ Classifier: Intended Audience :: Science/Research
8
+ Classifier: License :: OSI Approved :: MIT License
9
+ Classifier: Operating System :: OS Independent
10
+ Classifier: Programming Language :: Python :: 3 :: Only
11
+ Classifier: Programming Language :: Python :: 3.10
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Classifier: Programming Language :: Python :: 3.13
15
+ Classifier: Programming Language :: Python :: 3.14
16
+ Classifier: Programming Language :: Python :: Implementation :: CPython
17
+ Classifier: Programming Language :: Rust
18
+ Classifier: Topic :: Office/Business
19
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Classifier: Topic :: Text Processing
22
+ Classifier: Topic :: Text Processing :: Filters
23
+ Classifier: Topic :: Text Processing :: General
24
+ Classifier: Typing :: Typed
25
+ Requires-Dist: kreuzberg[easyocr,paddleocr] ; extra == 'all'
26
+ Requires-Dist: easyocr>=1.7.2 ; python_full_version < '3.14' and extra == 'easyocr'
27
+ Requires-Dist: torch>=2.9.1 ; python_full_version < '3.14' and extra == 'easyocr'
28
+ Requires-Dist: paddleocr>=3.3.2 ; python_full_version < '3.14' and extra == 'paddleocr'
29
+ Requires-Dist: paddlepaddle>=3.2.1,<3.2.2 ; python_full_version < '3.14' and extra == 'paddleocr'
30
+ Requires-Dist: setuptools>=80.9 ; python_full_version < '3.14' and extra == 'paddleocr'
31
+ Provides-Extra: all
32
+ Provides-Extra: easyocr
33
+ Provides-Extra: paddleocr
34
+ Summary: High-performance document intelligence library for Python. Extract text, metadata, and structured data from PDFs, Office documents, images, and 50+ formats. Powered by Rust core for 10-50x speed improvements.
35
+ Keywords: document-extraction,document-intelligence,document-parsing,document-processing,docx,easyocr,email-parsing,html,markdown,metadata-extraction,ocr,office-documents,paddleocr,pdf,pdf-extraction,performance,pptx,rust,table-extraction,tesseract,text-extraction,xlsx,xml
36
+ Home-Page: https://goldziher.github.io/kreuzberg/
37
+ Author-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
38
+ Maintainer-email: Na'aman Hirschfeld <nhirschfeld@gmail.com>
39
+ License: MIT
40
+ Requires-Python: >=3.10
41
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
42
+ Project-URL: Changelog, https://kreuzberg.dev/CHANGELOG/
43
+ Project-URL: Documentation, https://kreuzberg.dev
44
+ Project-URL: Homepage, https://kreuzberg.dev
45
+ Project-URL: Issues, https://github.com/kreuzberg-dev/kreuzberg/issues
46
+ Project-URL: Repository, https://github.com/kreuzberg-dev/kreuzberg
47
+ Project-URL: Source, https://github.com/kreuzberg-dev/kreuzberg
48
+
49
+ # Python
50
+
51
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
52
+ <!-- Language Bindings -->
53
+ <a href="https://crates.io/crates/kreuzberg">
54
+ <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
55
+ </a>
56
+ <a href="https://hex.pm/packages/kreuzberg">
57
+ <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
58
+ </a>
59
+ <a href="https://pypi.org/project/kreuzberg/">
60
+ <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
61
+ </a>
62
+ <a href="https://www.npmjs.com/package/@kreuzberg/node">
63
+ <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
64
+ </a>
65
+ <a href="https://www.npmjs.com/package/@kreuzberg/wasm">
66
+ <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
67
+ </a>
68
+
69
+ <a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
70
+ <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
71
+ </a>
72
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
73
+ <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0" alt="Go">
74
+ </a>
75
+ <a href="https://www.nuget.org/packages/Kreuzberg/">
76
+ <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
77
+ </a>
78
+ <a href="https://packagist.org/packages/kreuzberg/kreuzberg">
79
+ <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
80
+ </a>
81
+ <a href="https://rubygems.org/gems/kreuzberg">
82
+ <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
83
+ </a>
84
+
85
+ <!-- Project Info -->
86
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
87
+ <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
88
+ </a>
89
+ <a href="https://docs.kreuzberg.dev">
90
+ <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
91
+ </a>
92
+ </div>
93
+
94
+ <img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
95
+
96
+ <div align="center" style="margin-top: 20px;">
97
+ <a href="https://discord.gg/pXxagNK2zN">
98
+ <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
99
+ </a>
100
+ </div>
101
+
102
+
103
+ Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. Native Python bindings with async/await support, multiple OCR backends (Tesseract, EasyOCR, PaddleOCR), and extensible plugin system.
104
+
105
+
106
+ ## Installation
107
+
108
+ ### Package Installation
109
+
110
+
111
+
112
+
113
+ Install via pip:
114
+
115
+ ```bash
116
+ pip install kreuzberg
117
+ ```
118
+
119
+ For async support and additional features:
120
+
121
+ ```bash
122
+ pip install kreuzberg[async]
123
+ ```
124
+
125
+
126
+
127
+
128
+ ### System Requirements
129
+
130
+ - **Python 3.10+** required
131
+ - Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
132
+ - Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
133
+
134
+
135
+
136
+ ## Quick Start
137
+
138
+ ### Basic Extraction
139
+
140
+ Extract text, metadata, and structure from any supported document format:
141
+
142
+ ```python
143
+ import asyncio
144
+ from kreuzberg import extract_file, ExtractionConfig
145
+
146
+ async def main() -> None:
147
+ config = ExtractionConfig(
148
+ use_cache=True,
149
+ enable_quality_processing=True
150
+ )
151
+ result = await extract_file("document.pdf", config=config)
152
+ print(result.content)
153
+
154
+ asyncio.run(main())
155
+ ```
156
+
157
+
158
+ ### Common Use Cases
159
+
160
+ #### Extract with Custom Configuration
161
+
162
+ Most use cases benefit from configuration to control extraction behavior:
163
+
164
+
165
+ **With OCR (for scanned documents):**
166
+
167
+ ```python
168
+ import asyncio
169
+ from kreuzberg import extract_file
170
+
171
+ async def main() -> None:
172
+ result = await extract_file("document.pdf")
173
+ print(result.content)
174
+
175
+ asyncio.run(main())
176
+ ```
177
+
178
+
179
+
180
+
181
+ #### Table Extraction
182
+
183
+
184
+ ```python
185
+ import asyncio
186
+ from kreuzberg import extract_file
187
+
188
+ async def main() -> None:
189
+ result = await extract_file("document.pdf")
190
+
191
+ content: str = result.content
192
+ tables: int = len(result.tables)
193
+ format_type: str | None = result.metadata.format_type
194
+
195
+ print(f"Content length: {len(content)} characters")
196
+ print(f"Tables found: {tables}")
197
+ print(f"Format: {format_type}")
198
+
199
+ asyncio.run(main())
200
+ ```
201
+
202
+
203
+
204
+
205
+ #### Processing Multiple Files
206
+
207
+
208
+ ```python
209
+ import asyncio
210
+ from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig
211
+
212
+ async def main() -> None:
213
+ config = ExtractionConfig(
214
+ force_ocr=True,
215
+ ocr=OcrConfig(
216
+ backend="tesseract",
217
+ language="eng",
218
+ tesseract_config=TesseractConfig(psm=3)
219
+ )
220
+ )
221
+ result = await extract_file("scanned.pdf", config=config)
222
+ print(result.content)
223
+ print(f"Detected Languages: {result.detected_languages}")
224
+
225
+ asyncio.run(main())
226
+ ```
227
+
228
+
229
+
230
+
231
+
232
+ #### Async Processing
233
+
234
+ For non-blocking document processing:
235
+
236
+ ```python
237
+ import asyncio
238
+ from pathlib import Path
239
+ from kreuzberg import extract_file
240
+
241
+ async def main() -> None:
242
+ file_path: Path = Path("document.pdf")
243
+
244
+ result = await extract_file(file_path)
245
+
246
+ print(f"Content: {result.content}")
247
+ print(f"MIME Type: {result.metadata.format_type}")
248
+ print(f"Tables: {len(result.tables)}")
249
+
250
+ asyncio.run(main())
251
+ ```
252
+
253
+
254
+
255
+
256
+
257
+
258
+ ### Next Steps
259
+
260
+ - **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
261
+ - **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
262
+ - **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
263
+ - **[Configuration Guide](https://kreuzberg.dev/guides/configuration/)** - Advanced configuration options
264
+
265
+
266
+
267
+ ## Features
268
+
269
+ ### Supported File Formats (56+)
270
+
271
+ 56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
272
+
273
+ #### Office Documents
274
+
275
+ | Category | Formats | Capabilities |
276
+ |----------|---------|--------------|
277
+ | **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
278
+ | **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
279
+ | **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
280
+ | **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
281
+ | **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
282
+
283
+ #### Images (OCR-Enabled)
284
+
285
+ | Category | Formats | Features |
286
+ |----------|---------|----------|
287
+ | **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
288
+ | **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
289
+ | **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
290
+
291
+ #### Web & Data
292
+
293
+ | Category | Formats | Features |
294
+ |----------|---------|----------|
295
+ | **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
296
+ | **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
297
+ | **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
298
+
299
+ #### Email & Archives
300
+
301
+ | Category | Formats | Features |
302
+ |----------|---------|----------|
303
+ | **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
304
+ | **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
305
+
306
+ #### Academic & Scientific
307
+
308
+ | Category | Formats | Features |
309
+ |----------|---------|----------|
310
+ | **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
311
+ | **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
312
+ | **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
313
+
314
+ **[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
315
+
316
+ ### Key Capabilities
317
+
318
+ - **Text Extraction** - Extract all text content with position and formatting information
319
+ - **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
320
+ - **Table Extraction** - Parse tables with structure and cell content preservation
321
+ - **Image Extraction** - Extract embedded images and render page previews
322
+ - **OCR Support** - Integrate multiple OCR backends for scanned documents
323
+
324
+ - **Async/Await** - Non-blocking document processing with concurrent operations
325
+
326
+
327
+ - **Plugin System** - Extensible post-processing for custom text transformation
328
+
329
+
330
+ - **Embeddings** - Generate vector embeddings using ONNX Runtime models
331
+
332
+ - **Batch Processing** - Efficiently process multiple documents in parallel
333
+ - **Memory Efficient** - Stream large files without loading entirely into memory
334
+ - **Language Detection** - Detect and support multiple languages in documents
335
+ - **Configuration** - Fine-grained control over extraction behavior
336
+
337
+ ### Performance Characteristics
338
+
339
+ | Format | Speed | Memory | Notes |
340
+ |--------|-------|--------|-------|
341
+ | **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
342
+ | **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
343
+ | **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
344
+ | **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
345
+ | **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
346
+
347
+
348
+
349
+ ## OCR Support
350
+
351
+ Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
352
+
353
+
354
+ - **Tesseract**
355
+
356
+ - **Easyocr**
357
+
358
+ - **Paddleocr**
359
+
360
+
361
+ ### OCR Configuration Example
362
+
363
+ ```python
364
+ import asyncio
365
+ from kreuzberg import extract_file
366
+
367
+ async def main() -> None:
368
+ result = await extract_file("document.pdf")
369
+ print(result.content)
370
+
371
+ asyncio.run(main())
372
+ ```
373
+
374
+
375
+
376
+
377
+ ## Async Support
378
+
379
+ This binding provides full async/await support for non-blocking document processing:
380
+
381
+ ```python
382
+ import asyncio
383
+ from pathlib import Path
384
+ from kreuzberg import extract_file
385
+
386
+ async def main() -> None:
387
+ file_path: Path = Path("document.pdf")
388
+
389
+ result = await extract_file(file_path)
390
+
391
+ print(f"Content: {result.content}")
392
+ print(f"MIME Type: {result.metadata.format_type}")
393
+ print(f"Tables: {len(result.tables)}")
394
+
395
+ asyncio.run(main())
396
+ ```
397
+
398
+
399
+
400
+
401
+ ## Plugin System
402
+
403
+ Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
404
+
405
+ For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/guides/plugins/).
406
+
407
+
408
+
409
+
410
+ ## Embeddings Support
411
+
412
+ Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
413
+
414
+ **[Embeddings Guide](https://kreuzberg.dev/features/#embeddings)**
415
+
416
+
417
+
418
+ ## Batch Processing
419
+
420
+ Process multiple documents efficiently:
421
+
422
+ ```python
423
+ import asyncio
424
+ from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig
425
+
426
+ async def main() -> None:
427
+ config = ExtractionConfig(
428
+ force_ocr=True,
429
+ ocr=OcrConfig(
430
+ backend="tesseract",
431
+ language="eng",
432
+ tesseract_config=TesseractConfig(psm=3)
433
+ )
434
+ )
435
+ result = await extract_file("scanned.pdf", config=config)
436
+ print(result.content)
437
+ print(f"Detected Languages: {result.detected_languages}")
438
+
439
+ asyncio.run(main())
440
+ ```
441
+
442
+
443
+
444
+
445
+ ## Configuration
446
+
447
+ For advanced configuration options including language detection, table extraction, OCR settings, and more:
448
+
449
+ **[Configuration Guide](https://kreuzberg.dev/guides/configuration/)**
450
+
451
+ ## Documentation
452
+
453
+ - **[Official Documentation](https://kreuzberg.dev/)**
454
+ - **[API Reference](https://kreuzberg.dev/reference/api-python/)**
455
+ - **[Examples & Guides](https://kreuzberg.dev/guides/)**
456
+
457
+ ## Contributing
458
+
459
+ Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
460
+
461
+ ## License
462
+
463
+ MIT License - see LICENSE file for details.
464
+
465
+ ## Support
466
+
467
+ - **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
468
+ - **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
469
+ - **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)
470
+
@@ -0,0 +1,17 @@
1
+ kreuzberg/__init__.py,sha256=4djrQ4GGY3NiWztqrqFdOOuVXsIXxZaPnf1rSWV04OQ,32008
2
+ kreuzberg/__main__.py,sha256=wQJIcjFj9mGv54ea5T3XKAlXMMXjeMfDMWX7cSQyS4E,4977
3
+ kreuzberg/_internal_bindings.abi3.so,sha256=OUQDDFnKWDP_vji5_TN1DDBPFcZB7Py_FD_xkPYX-pg,30574288
4
+ kreuzberg/_setup_lib_path.py,sha256=pGAKaVRXq2eNSXUgnJZkUwYBdUwWla43cg0RSdotbNs,4570
5
+ kreuzberg/exceptions.py,sha256=ZX9aBhaxCzjPWn5P5eFq02KbIDzQcxWPUoyS2p38pks,7967
6
+ kreuzberg/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
7
+ kreuzberg/types.py,sha256=g6tYGnj139eB-74TiV-iZN1C-d2-WG6gtitE88Ci-OM,12876
8
+ kreuzberg/ocr/__init__.py,sha256=8MrVCtEu-5HXGAJL9wubKwNNP0U8Cr35zd0EyzW9gdY,791
9
+ kreuzberg/ocr/easyocr.py,sha256=XwxXC6t5JpDzc-3gVR-Ns2xy-j4e0ofklgmKB7VNuUs,10168
10
+ kreuzberg/ocr/paddleocr.py,sha256=nTP9Xi5nQzVDotQ7O-hobLjTWHyaYjKPEYoIhsHMgCo,8806
11
+ kreuzberg/ocr/protocol.py,sha256=6nzTEIo3jK6i9O5EhoHaaZiWQ1P5VPF1bW6sAxSGshQ,5453
12
+ kreuzberg/postprocessors/__init__.py,sha256=xvzJ_NxzzuThWlTyFVc0pyuSv4R28f7oOkIhT7Dk9yQ,2283
13
+ kreuzberg/postprocessors/protocol.py,sha256=sU5prEkg3kuTnygv8j1GBPY-nevZgm4FrKzDPjOnOjU,2545
14
+ kreuzberg-4.0.6.dist-info/METADATA,sha256=ZKe888vfN82MKnBW6ywBKf-qxiXxfb3BKKjIfGMIs6M,15186
15
+ kreuzberg-4.0.6.dist-info/WHEEL,sha256=6rvbSekKj8Ky-umb-C-gAUiNALwG7Ly9wjfiqiV9R_M,104
16
+ kreuzberg-4.0.6.dist-info/entry_points.txt,sha256=OpqEOa3KCMZvGWMUSYMkBIXji-LZcgqnuBknEImvWJY,52
17
+ kreuzberg-4.0.6.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: maturin (1.11.5)
3
+ Root-Is-Purelib: false
4
+ Tag: cp310-abi3-macosx_14_0_arm64
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ kreuzberg=kreuzberg.__main__:main