kreuzberg 3.6.1__py3-none-any.whl → 3.6.2__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {kreuzberg-3.6.1.dist-info → kreuzberg-3.6.2.dist-info}/METADATA +19 -14
- {kreuzberg-3.6.1.dist-info → kreuzberg-3.6.2.dist-info}/RECORD +5 -5
- {kreuzberg-3.6.1.dist-info → kreuzberg-3.6.2.dist-info}/WHEEL +0 -0
- {kreuzberg-3.6.1.dist-info → kreuzberg-3.6.2.dist-info}/entry_points.txt +0 -0
- {kreuzberg-3.6.1.dist-info → kreuzberg-3.6.2.dist-info}/licenses/LICENSE +0 -0
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: kreuzberg
|
3
|
-
Version: 3.6.
|
3
|
+
Version: 3.6.2
|
4
4
|
Summary: A text extraction library supporting PDFs, images, office documents and more
|
5
5
|
Project-URL: homepage, https://github.com/Goldziher/kreuzberg
|
6
6
|
Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
|
@@ -25,7 +25,7 @@ Requires-Python: >=3.10
|
|
25
25
|
Requires-Dist: anyio>=4.9.0
|
26
26
|
Requires-Dist: charset-normalizer>=3.4.2
|
27
27
|
Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
|
28
|
-
Requires-Dist: html-to-markdown>=1.
|
28
|
+
Requires-Dist: html-to-markdown[lxml]>=1.6.0
|
29
29
|
Requires-Dist: msgspec>=0.18.0
|
30
30
|
Requires-Dist: playa-pdf>=0.6.1
|
31
31
|
Requires-Dist: psutil>=7.0.0
|
@@ -83,8 +83,8 @@ Description-Content-Type: text/markdown
|
|
83
83
|
|
84
84
|
## Why Kreuzberg?
|
85
85
|
|
86
|
-
- **🚀 Fastest Performance**: [
|
87
|
-
- **💾 Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+)
|
86
|
+
- **🚀 Fastest Performance**: [35+ files/second](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) - the fastest text extraction library
|
87
|
+
- **💾 Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+) with lowest memory usage (~530MB)
|
88
88
|
- **⚡ Dual APIs**: Only library with both sync and async support
|
89
89
|
- **🔧 Zero Configuration**: Works out of the box with sane defaults
|
90
90
|
- **🏠 Local Processing**: No cloud dependencies or external API calls
|
@@ -140,13 +140,13 @@ asyncio.run(main())
|
|
140
140
|
|
141
141
|
```bash
|
142
142
|
# Run API server
|
143
|
-
docker run -p 8000:8000 goldziher/kreuzberg:
|
143
|
+
docker run -p 8000:8000 goldziher/kreuzberg:latest
|
144
144
|
|
145
145
|
# Extract files
|
146
146
|
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
|
147
147
|
```
|
148
148
|
|
149
|
-
Available variants: `3.
|
149
|
+
Available variants: `latest`, `3.6.1`, `3.6.1-easyocr`, `3.6.1-paddle`, `3.6.1-gmft`, `3.6.1-all`
|
150
150
|
|
151
151
|
### 🌐 REST API
|
152
152
|
|
@@ -191,15 +191,20 @@ kreuzberg extract *.pdf --output-dir ./extracted/
|
|
191
191
|
|
192
192
|
## Performance
|
193
193
|
|
194
|
-
**
|
194
|
+
**[Comprehensive benchmarks](https://goldziher.github.io/python-text-extraction-libs-benchmarks/)** across 94 real-world documents (~210MB) • [View source](https://github.com/Goldziher/python-text-extraction-libs-benchmarks):
|
195
195
|
|
196
|
-
| Library | Speed
|
197
|
-
| ------------- |
|
198
|
-
| **Kreuzberg** |
|
199
|
-
| Unstructured |
|
200
|
-
| MarkItDown |
|
201
|
-
| Docling |
|
196
|
+
| Library | Speed | Memory | Install Size | Dependencies | Success Rate |
|
197
|
+
| ------------- | --------------- | --------- | ------------ | ------------ | ------------ |
|
198
|
+
| **Kreuzberg** | **35+ files/s** | **530MB** | **71MB** | **20** | High\* |
|
199
|
+
| Unstructured | Moderate | ~1GB | 146MB | 54 | 88%+ |
|
200
|
+
| MarkItDown | Good† | ~1.5GB | 251MB | 25 | 80%† |
|
201
|
+
| Docling | 60+ min/file‡ | ~5GB | 1,032MB | 88 | Low‡ |
|
202
202
|
|
203
|
+
\*_Can achieve 75% reliability with 15% performance trade-off when configured_
|
204
|
+
†_Good on simple documents, struggles with large/complex files (>10MB)_
|
205
|
+
‡_Frequently fails/times out on medium files (>1MB)_
|
206
|
+
|
207
|
+
> **Benchmark details**: Tested across PDFs, Word docs, HTML, images, spreadsheets in 6 languages (English, Hebrew, German, Chinese, Japanese, Korean)
|
203
208
|
> **Rule of thumb**: Use async API for complex documents and batch processing (up to 4.5x faster)
|
204
209
|
|
205
210
|
## Documentation
|
@@ -233,7 +238,7 @@ ______________________________________________________________________
|
|
233
238
|
|
234
239
|
<div align="center">
|
235
240
|
|
236
|
-
**[Documentation](https://goldziher.github.io/kreuzberg/) • [PyPI](https://pypi.org/project/kreuzberg/) • [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) • [Discord](https://discord.gg/pXxagNK2zN)**
|
241
|
+
**[Documentation](https://goldziher.github.io/kreuzberg/) • [PyPI](https://pypi.org/project/kreuzberg/) • [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) • [Benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) • [Discord](https://discord.gg/pXxagNK2zN)**
|
237
242
|
|
238
243
|
Made with ❤️ by the [Kreuzberg contributors](https://github.com/Goldziher/kreuzberg/graphs/contributors)
|
239
244
|
|
@@ -47,8 +47,8 @@ kreuzberg/_utils/_serialization.py,sha256=AhZvyAu4KsjAqyZDh--Kn2kSWGgCuH7udio8lT
|
|
47
47
|
kreuzberg/_utils/_string.py,sha256=owIVkUtP0__GiJD9RIJzPdvyIigT5sQho3mOXPbsnW0,958
|
48
48
|
kreuzberg/_utils/_sync.py,sha256=oT4Y_cDBKtE_BFEoLTae3rSisqlYXzW-jlUG_x-dmLM,4725
|
49
49
|
kreuzberg/_utils/_tmp.py,sha256=hVn-VVijIg2FM7EZJ899gc7wZg-TGoJZoeAcxMX-Cxg,1044
|
50
|
-
kreuzberg-3.6.
|
51
|
-
kreuzberg-3.6.
|
52
|
-
kreuzberg-3.6.
|
53
|
-
kreuzberg-3.6.
|
54
|
-
kreuzberg-3.6.
|
50
|
+
kreuzberg-3.6.2.dist-info/METADATA,sha256=shguv5yge8FkD9aT0x02dRdLpuLi1PW4SmczFYiILmU,9910
|
51
|
+
kreuzberg-3.6.2.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
|
52
|
+
kreuzberg-3.6.2.dist-info/entry_points.txt,sha256=VdoFaTl3QSvVWOZcIlPpDd47o6kn7EvmXSs8FI0ExLc,48
|
53
|
+
kreuzberg-3.6.2.dist-info/licenses/LICENSE,sha256=-8caMvpCK8SgZ5LlRKhGCMtYDEXqTKH9X8pFEhl91_4,1066
|
54
|
+
kreuzberg-3.6.2.dist-info/RECORD,,
|
File without changes
|
File without changes
|
File without changes
|