kreuzberg 3.6.0__py3-none-any.whl → 3.6.2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: kreuzberg
3
- Version: 3.6.0
3
+ Version: 3.6.2
4
4
  Summary: A text extraction library supporting PDFs, images, office documents and more
5
5
  Project-URL: homepage, https://github.com/Goldziher/kreuzberg
6
6
  Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
@@ -25,7 +25,7 @@ Requires-Python: >=3.10
25
25
  Requires-Dist: anyio>=4.9.0
26
26
  Requires-Dist: charset-normalizer>=3.4.2
27
27
  Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
28
- Requires-Dist: html-to-markdown>=1.4.0
28
+ Requires-Dist: html-to-markdown[lxml]>=1.6.0
29
29
  Requires-Dist: msgspec>=0.18.0
30
30
  Requires-Dist: playa-pdf>=0.6.1
31
31
  Requires-Dist: psutil>=7.0.0
@@ -83,8 +83,8 @@ Description-Content-Type: text/markdown
83
83
 
84
84
  ## Why Kreuzberg?
85
85
 
86
- - **🚀 Fastest Performance**: [Benchmarked](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) as the fastest text extraction library
87
- - **💾 Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+)
86
+ - **🚀 Fastest Performance**: [35+ files/second](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) - the fastest text extraction library
87
+ - **💾 Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+) with lowest memory usage (~530MB)
88
88
  - **⚡ Dual APIs**: Only library with both sync and async support
89
89
  - **🔧 Zero Configuration**: Works out of the box with sane defaults
90
90
  - **🏠 Local Processing**: No cloud dependencies or external API calls
@@ -140,13 +140,13 @@ asyncio.run(main())
140
140
 
141
141
  ```bash
142
142
  # Run API server
143
- docker run -p 8000:8000 goldziher/kreuzberg:3.4.0
143
+ docker run -p 8000:8000 goldziher/kreuzberg:latest
144
144
 
145
145
  # Extract files
146
146
  curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
147
147
  ```
148
148
 
149
- Available variants: `3.4.0`, `3.4.0-easyocr`, `3.4.0-paddle`, `3.4.0-gmft`, `3.4.0-all`
149
+ Available variants: `latest`, `3.6.1`, `3.6.1-easyocr`, `3.6.1-paddle`, `3.6.1-gmft`, `3.6.1-all`
150
150
 
151
151
  ### 🌐 REST API
152
152
 
@@ -191,15 +191,20 @@ kreuzberg extract *.pdf --output-dir ./extracted/
191
191
 
192
192
  ## Performance
193
193
 
194
- **Fastest extraction speeds** with minimal resource usage:
194
+ **[Comprehensive benchmarks](https://goldziher.github.io/python-text-extraction-libs-benchmarks/)** across 94 real-world documents (~210MB) • [View source](https://github.com/Goldziher/python-text-extraction-libs-benchmarks):
195
195
 
196
- | Library | Speed | Memory | Size | Success Rate |
197
- | ------------- | -------------- | ------------- | ----------- | ------------ |
198
- | **Kreuzberg** | **Fastest** | 💾 **Lowest** | 📦 **71MB** | **100%** |
199
- | Unstructured | 2-3x slower | 2x higher | 146MB | 95% |
200
- | MarkItDown | 3-4x slower | 3x higher | 251MB | 90% |
201
- | Docling | 4-5x slower | 10x higher | 1,032MB | 85% |
196
+ | Library | Speed | Memory | Install Size | Dependencies | Success Rate |
197
+ | ------------- | --------------- | --------- | ------------ | ------------ | ------------ |
198
+ | **Kreuzberg** | **35+ files/s** | **530MB** | **71MB** | **20** | High\* |
199
+ | Unstructured | Moderate | ~1GB | 146MB | 54 | 88%+ |
200
+ | MarkItDown | Good† | ~1.5GB | 251MB | 25 | 80%† |
201
+ | Docling | 60+ min/file‡ | ~5GB | 1,032MB | 88 | Low‡ |
202
202
 
203
+ \*_Can achieve 75% reliability with 15% performance trade-off when configured_
204
+ †_Good on simple documents, struggles with large/complex files (>10MB)_
205
+ ‡_Frequently fails/times out on medium files (>1MB)_
206
+
207
+ > **Benchmark details**: Tested across PDFs, Word docs, HTML, images, spreadsheets in 6 languages (English, Hebrew, German, Chinese, Japanese, Korean)
203
208
  > **Rule of thumb**: Use async API for complex documents and batch processing (up to 4.5x faster)
204
209
 
205
210
  ## Documentation
@@ -233,7 +238,7 @@ ______________________________________________________________________
233
238
 
234
239
  <div align="center">
235
240
 
236
- **[Documentation](https://goldziher.github.io/kreuzberg/) • [PyPI](https://pypi.org/project/kreuzberg/) • [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) • [Discord](https://discord.gg/pXxagNK2zN)**
241
+ **[Documentation](https://goldziher.github.io/kreuzberg/) • [PyPI](https://pypi.org/project/kreuzberg/) • [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) • [Benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) • [Discord](https://discord.gg/pXxagNK2zN)**
237
242
 
238
243
  Made with ❤️ by the [Kreuzberg contributors](https://github.com/Goldziher/kreuzberg/graphs/contributors)
239
244
 
@@ -47,8 +47,8 @@ kreuzberg/_utils/_serialization.py,sha256=AhZvyAu4KsjAqyZDh--Kn2kSWGgCuH7udio8lT
47
47
  kreuzberg/_utils/_string.py,sha256=owIVkUtP0__GiJD9RIJzPdvyIigT5sQho3mOXPbsnW0,958
48
48
  kreuzberg/_utils/_sync.py,sha256=oT4Y_cDBKtE_BFEoLTae3rSisqlYXzW-jlUG_x-dmLM,4725
49
49
  kreuzberg/_utils/_tmp.py,sha256=hVn-VVijIg2FM7EZJ899gc7wZg-TGoJZoeAcxMX-Cxg,1044
50
- kreuzberg-3.6.0.dist-info/METADATA,sha256=zlqw5yTQit-jYeZVnM27kPsn2mCfulpL8wssptrQR8Q,9160
51
- kreuzberg-3.6.0.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
52
- kreuzberg-3.6.0.dist-info/entry_points.txt,sha256=VdoFaTl3QSvVWOZcIlPpDd47o6kn7EvmXSs8FI0ExLc,48
53
- kreuzberg-3.6.0.dist-info/licenses/LICENSE,sha256=-8caMvpCK8SgZ5LlRKhGCMtYDEXqTKH9X8pFEhl91_4,1066
54
- kreuzberg-3.6.0.dist-info/RECORD,,
50
+ kreuzberg-3.6.2.dist-info/METADATA,sha256=shguv5yge8FkD9aT0x02dRdLpuLi1PW4SmczFYiILmU,9910
51
+ kreuzberg-3.6.2.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
52
+ kreuzberg-3.6.2.dist-info/entry_points.txt,sha256=VdoFaTl3QSvVWOZcIlPpDd47o6kn7EvmXSs8FI0ExLc,48
53
+ kreuzberg-3.6.2.dist-info/licenses/LICENSE,sha256=-8caMvpCK8SgZ5LlRKhGCMtYDEXqTKH9X8pFEhl91_4,1066
54
+ kreuzberg-3.6.2.dist-info/RECORD,,