kreuzberg 3.8.1__tar.gz → 3.9.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (227) hide show
  1. kreuzberg-3.9.0/.deepsource.toml +54 -0
  2. kreuzberg-3.9.0/.github/workflows/ci.yaml +197 -0
  3. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.gitignore +3 -0
  4. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.pre-commit-config.yaml +1 -1
  5. kreuzberg-3.9.0/PKG-INFO +269 -0
  6. kreuzberg-3.9.0/README.md +183 -0
  7. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/benchmark_baseline.py +1 -2
  8. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/end_to_end_benchmark.py +3 -4
  9. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/serialization_benchmark.py +2 -3
  10. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/src/kreuzberg_benchmarks/models.py +7 -7
  11. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/src/kreuzberg_benchmarks/runner.py +1 -1
  12. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/statistical_benchmark.py +2 -3
  13. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/advanced/custom-extractors.md +1 -1
  14. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/advanced/index.md +1 -1
  15. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/advanced/performance.md +6 -6
  16. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/api-reference/index.md +1 -1
  17. kreuzberg-3.9.0/docs/changelog.md +49 -0
  18. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/getting-started/installation.md +2 -2
  19. kreuzberg-3.9.0/docs/index.md +59 -0
  20. kreuzberg-3.9.0/docs/performance-analysis.md +168 -0
  21. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/basic-usage.md +29 -1
  22. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/docker.md +99 -4
  23. kreuzberg-3.9.0/docs/user-guide/document-classification.md +53 -0
  24. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/extraction-configuration.md +151 -2
  25. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/index.md +2 -1
  26. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/mcp-server.md +13 -13
  27. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/__init__.py +4 -0
  28. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_api/main.py +22 -1
  29. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_chunker.py +3 -3
  30. kreuzberg-3.9.0/kreuzberg/_config.py +404 -0
  31. kreuzberg-3.9.0/kreuzberg/_document_classification.py +156 -0
  32. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_entity_extraction.py +6 -6
  33. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_extractors/_image.py +4 -3
  34. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_extractors/_pdf.py +40 -29
  35. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_extractors/_spread_sheet.py +6 -8
  36. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_extractors/_structured.py +34 -25
  37. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_gmft.py +33 -42
  38. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_language_detection.py +1 -1
  39. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_mcp/server.py +58 -8
  40. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_mime_types.py +1 -1
  41. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_ocr/_base.py +1 -1
  42. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_ocr/_easyocr.py +5 -5
  43. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_ocr/_paddleocr.py +4 -4
  44. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_ocr/_tesseract.py +12 -21
  45. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_playa.py +2 -3
  46. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_types.py +65 -27
  47. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_cache.py +14 -17
  48. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_device.py +17 -27
  49. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_errors.py +41 -38
  50. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_quality.py +7 -11
  51. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_serialization.py +21 -16
  52. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_string.py +22 -12
  53. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_table.py +3 -4
  54. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/cli.py +5 -5
  55. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/exceptions.py +10 -0
  56. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/extraction.py +20 -11
  57. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/mkdocs.yaml +2 -1
  58. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/pyproject.toml +28 -11
  59. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/cli_test.py +1 -1
  60. kreuzberg-3.9.0/tests/config_test.py +401 -0
  61. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/conftest.py +5 -0
  62. kreuzberg-3.9.0/tests/document_classification_test.py +86 -0
  63. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/entity_extraction_test.py +2 -2
  64. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/image_test.py +35 -8
  65. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/pdf_test.py +0 -2
  66. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/gmft_test.py +3 -3
  67. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/mcp_server_test.py +8 -0
  68. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/multiprocessing/gmft_integration_test.py +2 -1
  69. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/multiprocessing/gmft_isolated_test.py +55 -50
  70. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/ocr/device_integration_test.py +14 -13
  71. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/ocr/paddleocr_test.py +0 -5
  72. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/ocr/tesseract_test.py +2 -2
  73. kreuzberg-3.9.0/tests/test_source_files/contract_test.txt +4 -0
  74. kreuzberg-3.9.0/tests/test_source_files/form_test.txt +5 -0
  75. kreuzberg-3.9.0/tests/test_source_files/images/test_hello_world.png +0 -0
  76. kreuzberg-3.9.0/tests/test_source_files/invoice_image.png +0 -0
  77. kreuzberg-3.9.0/tests/test_source_files/invoice_test.txt +4 -0
  78. kreuzberg-3.9.0/tests/test_source_files/receipt_test.txt +5 -0
  79. kreuzberg-3.9.0/tests/test_source_files/report_test.txt +4 -0
  80. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/types_test.py +60 -1
  81. kreuzberg-3.9.0/tests/utils/__init__.py +0 -0
  82. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/uv.lock +223 -199
  83. kreuzberg-3.8.1/.github/workflows/ci.yaml +0 -124
  84. kreuzberg-3.8.1/PKG-INFO +0 -301
  85. kreuzberg-3.8.1/README.md +0 -219
  86. kreuzberg-3.8.1/docs/changelog.md +0 -32
  87. kreuzberg-3.8.1/docs/index.md +0 -54
  88. kreuzberg-3.8.1/docs/performance-analysis.md +0 -140
  89. kreuzberg-3.8.1/kreuzberg/_cli_config.py +0 -175
  90. kreuzberg-3.8.1/tests/test_source_files/toml/sample-config.toml +0 -33
  91. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.commitlintrc +0 -0
  92. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.docker/Dockerfile +0 -0
  93. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.docker/README.md +0 -0
  94. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.dockerignore +0 -0
  95. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.github/dependabot.yaml +0 -0
  96. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.github/workflows/docs.yml +0 -0
  97. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.github/workflows/pr-title.yaml +0 -0
  98. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.github/workflows/publish-docker.yml +0 -0
  99. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.github/workflows/release.yaml +0 -0
  100. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/.markdownlint.yaml +0 -0
  101. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/LICENSE +0 -0
  102. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/ai-rulez.yaml +0 -0
  103. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/README.md +0 -0
  104. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/final_benchmark.py +0 -0
  105. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/pyproject.toml +0 -0
  106. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/baseline_results.json +0 -0
  107. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/benchmark_msgpack_20250702_003800.json +0 -0
  108. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/comprehensive_caching_results.json +0 -0
  109. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/final_benchmark_results.json +0 -0
  110. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/latest.json +0 -0
  111. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/mime_caching_results.json +0 -0
  112. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/msgspec_caching_results.json +0 -0
  113. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/ocr_caching_results.json +0 -0
  114. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/serialization_benchmark_results.json +0 -0
  115. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/statistical_benchmark_results.json +0 -0
  116. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/results/table_caching_results.json +0 -0
  117. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/src/kreuzberg_benchmarks/__init__.py +0 -0
  118. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/src/kreuzberg_benchmarks/__main__.py +0 -0
  119. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/src/kreuzberg_benchmarks/benchmarks.py +0 -0
  120. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/src/kreuzberg_benchmarks/cli.py +0 -0
  121. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/benchmarks/src/kreuzberg_benchmarks/profiler.py +0 -0
  122. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/advanced/custom-hooks.md +0 -0
  123. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/advanced/error-handling.md +0 -0
  124. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/api-reference/exceptions.md +0 -0
  125. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/api-reference/extraction-functions.md +0 -0
  126. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/api-reference/extractor-registry.md +0 -0
  127. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/api-reference/ocr-configuration.md +0 -0
  128. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/api-reference/types.md +0 -0
  129. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/assets/favicon.png +0 -0
  130. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/assets/logo.png +0 -0
  131. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/cli.md +0 -0
  132. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/contributing.md +0 -0
  133. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/css/extra.css +0 -0
  134. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/examples/extraction-examples.md +0 -0
  135. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/examples/index.md +0 -0
  136. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/getting-started/index.md +0 -0
  137. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/getting-started/quick-start.md +0 -0
  138. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/api-server.md +0 -0
  139. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/chunking.md +0 -0
  140. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/metadata-extraction.md +0 -0
  141. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/ocr-backends.md +0 -0
  142. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/ocr-configuration.md +0 -0
  143. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/docs/user-guide/supported-formats.md +0 -0
  144. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/__main__.py +0 -0
  145. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_api/__init__.py +0 -0
  146. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_constants.py +0 -0
  147. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_extractors/__init__.py +0 -0
  148. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_extractors/_base.py +0 -0
  149. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_extractors/_email.py +0 -0
  150. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_extractors/_html.py +0 -0
  151. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_extractors/_pandoc.py +0 -0
  152. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_extractors/_presentation.py +0 -0
  153. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_mcp/__init__.py +0 -0
  154. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_ocr/__init__.py +0 -0
  155. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_registry.py +0 -0
  156. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/__init__.py +0 -0
  157. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_document_cache.py +0 -0
  158. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_pdf_lock.py +0 -0
  159. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_process_pool.py +0 -0
  160. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_sync.py +0 -0
  161. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/_utils/_tmp.py +0 -0
  162. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/kreuzberg/py.typed +0 -0
  163. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/__init__.py +0 -0
  164. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/api/__init__.py +0 -0
  165. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/api/main_test.py +0 -0
  166. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/chunker_test.py +0 -0
  167. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/cli_integration_test.py +0 -0
  168. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/exceptions_test.py +0 -0
  169. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extraction_batch_test.py +0 -0
  170. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extraction_test.py +0 -0
  171. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/__init__.py +0 -0
  172. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/email_comprehensive_test.py +0 -0
  173. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/email_test.py +0 -0
  174. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/html_test.py +0 -0
  175. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/pandoc_metadata_test.py +0 -0
  176. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/pandoc_test.py +0 -0
  177. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/presentation_test.py +0 -0
  178. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/spreed_sheet_test.py +0 -0
  179. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/extractors/structured_test.py +0 -0
  180. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/gmft_extended_test.py +0 -0
  181. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/hooks_test.py +0 -0
  182. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/language_detection_test.py +0 -0
  183. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/mime_types_test.py +0 -0
  184. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/multiprocessing/__init__.py +0 -0
  185. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/multiprocessing/process_manager_test.py +0 -0
  186. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/multiprocessing/tesseract_pool_test.py +0 -0
  187. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/ocr/__init__.py +0 -0
  188. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/ocr/base_test.py +0 -0
  189. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/ocr/easyocr_test.py +0 -0
  190. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/ocr/init_test.py +0 -0
  191. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/playa_test.py +0 -0
  192. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/registry_test.py +0 -0
  193. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/better-ocr-image.jpg +0 -0
  194. /kreuzberg-3.8.1/tests/utils/__init__.py → /kreuzberg-3.9.0/tests/test_source_files/contract.txt +0 -0
  195. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/document.docx +0 -0
  196. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/email/sample-email.eml +0 -0
  197. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/excel-multi-sheet.xlsx +0 -0
  198. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/excel.xlsx +0 -0
  199. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/french-text.txt +0 -0
  200. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/german-text.txt +0 -0
  201. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/html.html +0 -0
  202. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/json/sample-document.json +0 -0
  203. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/layout-parser-ocr.jpg +0 -0
  204. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/markdown.md +0 -0
  205. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/non-ascii-text.pdf +0 -0
  206. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/non-searchable.pdf +0 -0
  207. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/ocr-image.jpg +0 -0
  208. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/pdfs_with_tables/large.pdf +0 -0
  209. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/pdfs_with_tables/medium.pdf +0 -0
  210. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/pdfs_with_tables/tiny.pdf +0 -0
  211. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/pitch-deck-presentation.pptx +0 -0
  212. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/sample-contract.pdf +0 -0
  213. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/scanned.pdf +0 -0
  214. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/searchable.pdf +0 -0
  215. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/spanish-text.txt +0 -0
  216. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/test-article.pdf +0 -0
  217. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/test_source_files/yaml/sample-config.yaml +0 -0
  218. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/utils/cache_test.py +0 -0
  219. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/utils/device_test.py +0 -0
  220. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/utils/errors_test.py +0 -0
  221. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/utils/pdf_lock_test.py +0 -0
  222. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/utils/process_pool_test.py +0 -0
  223. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/utils/serialization_test.py +0 -0
  224. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/utils/string_test.py +0 -0
  225. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/utils/sync_test.py +0 -0
  226. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/utils/table_test.py +0 -0
  227. {kreuzberg-3.8.1 → kreuzberg-3.9.0}/tests/utils/tmp_test.py +0 -0
@@ -0,0 +1,54 @@
1
+ version = 1
2
+
3
+ test_patterns = ["tests/**"]
4
+
5
+ exclude_patterns = [
6
+ # Virtual environments
7
+ ".venv/**",
8
+ "venv/**",
9
+
10
+ # Build and distribution artifacts
11
+ "dist/**",
12
+ "build/**",
13
+ "*.egg-info/**",
14
+
15
+ # Documentation
16
+ "docs/**",
17
+ "site/**",
18
+
19
+ # Cache directories
20
+ "**/__pycache__/**",
21
+ ".pytest_cache/**",
22
+ ".mypy_cache/**",
23
+ ".ruff_cache/**",
24
+ ".coverage",
25
+ "htmlcov/**",
26
+
27
+ # Benchmarks and performance tests
28
+ "benchmarks/**",
29
+
30
+ # IDE and editor files
31
+ ".idea/**",
32
+ ".vscode/**",
33
+
34
+ # Version control
35
+ ".git/**",
36
+
37
+ # Temporary and generated files
38
+ "*.pyc",
39
+ ".DS_Store",
40
+ "*.swp",
41
+ "*.swo",
42
+ ]
43
+
44
+ [[analyzers]]
45
+ name = "test-coverage"
46
+
47
+ [[analyzers]]
48
+ name = "python"
49
+
50
+ [analyzers.meta]
51
+ runtime_version = "3.x.x"
52
+
53
+ [[transformers]]
54
+ name = "ruff"
@@ -0,0 +1,197 @@
1
+ name: CI
2
+
3
+ on:
4
+ pull_request:
5
+ branches:
6
+ - main
7
+ push:
8
+ branches:
9
+ - main
10
+ - feat/smart-multiprocessing
11
+
12
+ jobs:
13
+ validate:
14
+ runs-on: ubuntu-latest
15
+ timeout-minutes: 10
16
+ steps:
17
+ - name: Checkout
18
+ uses: actions/checkout@v4
19
+
20
+ - name: Install uv
21
+ uses: astral-sh/setup-uv@v6
22
+ with:
23
+ enable-cache: true
24
+
25
+ - name: Set up Python
26
+ uses: actions/setup-python@v5
27
+ with:
28
+ python-version-file: "pyproject.toml"
29
+
30
+ - name: Install Dependencies
31
+ uses: nick-fields/retry@v3
32
+ with:
33
+ timeout_minutes: 5
34
+ max_attempts: 3
35
+ retry_wait_seconds: 30
36
+ command: |
37
+ if [[ "${{ runner.os }}" == "Windows" ]] && [[ -d ".venv" ]]; then
38
+ echo "Removing existing .venv directory on Windows"
39
+ rm -rf .venv
40
+ fi
41
+ uv sync --all-packages --all-extras --dev
42
+ shell: bash
43
+
44
+ - name: Load Cached Pre-Commit Dependencies
45
+ id: cached-pre-commit-dependencies
46
+ uses: actions/cache@v4
47
+ with:
48
+ path: ~/.cache/pre-commit/
49
+ key: pre-commit|${{ env.pythonLocation }}|${{ hashFiles('.pre-commit-config.yaml') }}
50
+
51
+ - name: Execute Pre-Commit
52
+ run: uv run pre-commit run --show-diff-on-failure --color=always --all-files
53
+
54
+ test:
55
+ strategy:
56
+ matrix:
57
+ os: [ ubuntu-latest, macOS-latest, windows-latest ]
58
+ python: ${{ github.event_name == 'pull_request' && fromJSON('["3.13"]') || fromJSON('["3.10", "3.11", "3.12", "3.13"]') }}
59
+ runs-on: ${{ matrix.os }}
60
+ timeout-minutes: 30
61
+ steps:
62
+ - name: Checkout
63
+ uses: actions/checkout@v4
64
+
65
+ - name: Install uv
66
+ uses: astral-sh/setup-uv@v6
67
+ with:
68
+ enable-cache: true
69
+
70
+ - name: Install Python
71
+ uses: actions/setup-python@v5
72
+ id: setup-python
73
+ with:
74
+ python-version: ${{ matrix.python }}
75
+
76
+ - name: Cache Python Dependencies
77
+ id: python-cache
78
+ uses: actions/cache@v4
79
+ with:
80
+ path: |
81
+ ~/.cache/uv
82
+ .venv
83
+ key: python-dependencies-${{ matrix.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('uv.lock') }}
84
+ restore-keys: |
85
+ python-dependencies-${{ matrix.os }}-${{ matrix.python }}-
86
+
87
+ - name: Install Dependencies
88
+ uses: nick-fields/retry@v3
89
+ with:
90
+ timeout_minutes: 5
91
+ max_attempts: 3
92
+ retry_wait_seconds: 30
93
+ command: |
94
+ if [[ "${{ runner.os }}" == "Windows" ]] && [[ -d ".venv" ]]; then
95
+ echo "Removing existing .venv directory on Windows"
96
+ rm -rf .venv
97
+ fi
98
+ uv sync --all-packages --all-extras --dev
99
+ shell: bash
100
+
101
+ - name: Cache Test Artifacts
102
+ uses: actions/cache@v4
103
+ with:
104
+ path: .pytest_cache/
105
+ key: pytest-cache-${{ matrix.os }}-${{ matrix.python }}
106
+
107
+ - name: Cache and Install Homebrew (macOS)
108
+ if: runner.os == 'macOS'
109
+ uses: nick-fields/retry@v3
110
+ with:
111
+ timeout_minutes: 10
112
+ max_attempts: 3
113
+ retry_wait_seconds: 30
114
+ command: |
115
+ # Using the underlying homebrew commands instead of the action
116
+ brew update || true
117
+ brew install tesseract tesseract-lang pandoc || brew upgrade tesseract tesseract-lang pandoc || true
118
+ brew list tesseract tesseract-lang pandoc
119
+ shell: bash
120
+
121
+ - name: Cache and Install APT Packages (Linux)
122
+ if: runner.os == 'Linux'
123
+ uses: nick-fields/retry@v3
124
+ with:
125
+ timeout_minutes: 5
126
+ max_attempts: 3
127
+ retry_wait_seconds: 30
128
+ command: |
129
+ sudo apt-get update
130
+ sudo apt-get install -y tesseract-ocr tesseract-ocr-deu pandoc
131
+ shell: bash
132
+
133
+ - name: Install System Dependencies (Windows)
134
+ if: runner.os == 'Windows'
135
+ uses: nick-fields/retry@v3
136
+ with:
137
+ timeout_minutes: 10
138
+ max_attempts: 3
139
+ retry_wait_seconds: 30
140
+ command: |
141
+ choco install -y tesseract pandoc --no-progress
142
+ Write-Output "C:\Program Files\Tesseract-OCR" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
143
+ Write-Output "C:\Program Files\Pandoc" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
144
+ $env:PATH = "C:\Program Files\Tesseract-OCR;C:\Program Files\Pandoc;" + $env:PATH
145
+ tesseract --version
146
+ pandoc --version
147
+ shell: pwsh
148
+
149
+ - name: Clean Coverage Data
150
+ run: |
151
+ rm -f .coverage .coverage.* coverage.lcov htmlcov/* || true
152
+ shell: bash
153
+
154
+ - name: Run Tests with Coverage
155
+ run: |
156
+ uv run coverage erase
157
+ uv run pytest -s -vvv --cov=kreuzberg --cov-report=lcov:coverage.lcov --cov-report=term --cov-config=pyproject.toml
158
+
159
+ - name: Upload Coverage Artifacts
160
+ if: matrix.os == 'ubuntu-latest' && matrix.python == '3.13'
161
+ uses: actions/upload-artifact@v4
162
+ with:
163
+ name: coverage-report
164
+ path: coverage.lcov
165
+ retention-days: 1
166
+
167
+ upload-coverage:
168
+ needs: test
169
+ runs-on: ubuntu-latest
170
+ if: github.event_name == 'push' || github.event_name == 'pull_request'
171
+ steps:
172
+ - name: Checkout
173
+ uses: actions/checkout@v4
174
+ with:
175
+ ref: ${{ github.event.pull_request.head.sha || github.sha }}
176
+
177
+ - name: Download Coverage Artifacts
178
+ uses: actions/download-artifact@v4
179
+ with:
180
+ name: coverage-report
181
+ path: .
182
+
183
+ - name: Install DeepSource CLI
184
+ uses: nick-fields/retry@v3
185
+ with:
186
+ timeout_minutes: 3
187
+ max_attempts: 3
188
+ retry_wait_seconds: 10
189
+ command: |
190
+ curl -fsSL https://deepsource.io/cli | sh
191
+ shell: bash
192
+
193
+ - name: Upload Coverage to DeepSource
194
+ env:
195
+ DEEPSOURCE_DSN: ${{ secrets.DEEPSOURCE_DSN }}
196
+ run: |
197
+ ./bin/deepsource report --analyzer test-coverage --key python --value-file ./coverage.lcov
@@ -1,5 +1,6 @@
1
1
  *$py.class
2
2
  *.Cache
3
+ .clause/
3
4
  *.cscfg
4
5
  *.egg-info/
5
6
  *.log
@@ -9,6 +10,8 @@
9
10
  *temp/
10
11
  .coverage
11
12
  .coverage*
13
+ coverage.lcov
14
+ htmlcov/
12
15
  .cursorrules
13
16
  .dist/
14
17
  .DS_store
@@ -53,7 +53,7 @@ repos:
53
53
  hooks:
54
54
  - id: pyproject-fmt
55
55
  - repo: https://github.com/astral-sh/ruff-pre-commit
56
- rev: v0.12.2
56
+ rev: v0.12.3
57
57
  hooks:
58
58
  - id: ruff
59
59
  args: ["--fix", "--unsafe-fixes"]
@@ -0,0 +1,269 @@
1
+ Metadata-Version: 2.4
2
+ Name: kreuzberg
3
+ Version: 3.9.0
4
+ Summary: Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats
5
+ Project-URL: documentation, https://kreuzberg.dev
6
+ Project-URL: homepage, https://github.com/Goldziher/kreuzberg
7
+ Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
8
+ License: MIT
9
+ License-File: LICENSE
10
+ Keywords: async,document-analysis,document-classification,document-intelligence,document-processing,extensible,information-extraction,mcp,metadata-extraction,model-context-protocol,ocr,pandoc,pdf-extraction,pdfium,plugin-architecture,rag,retrieval-augmented-generation,structured-data,table-extraction,tesseract,text-extraction
11
+ Classifier: Development Status :: 5 - Production/Stable
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Information Technology
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3 :: Only
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Programming Language :: Python :: 3.13
22
+ Classifier: Topic :: Database
23
+ Classifier: Topic :: Multimedia :: Graphics :: Capture :: Scanners
24
+ Classifier: Topic :: Office/Business :: Office Suites
25
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
26
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
27
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
28
+ Classifier: Topic :: Text Processing :: General
29
+ Classifier: Typing :: Typed
30
+ Requires-Python: >=3.10
31
+ Requires-Dist: anyio>=4.9.0
32
+ Requires-Dist: chardetng-py>=0.3.4
33
+ Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
34
+ Requires-Dist: html-to-markdown[lxml]>=1.8.0
35
+ Requires-Dist: mcp>=1.11.0
36
+ Requires-Dist: msgspec>=0.18.0
37
+ Requires-Dist: playa-pdf>=0.6.1
38
+ Requires-Dist: psutil>=7.0.0
39
+ Requires-Dist: pypdfium2==4.30.0
40
+ Requires-Dist: python-calamine>=0.3.2
41
+ Requires-Dist: python-pptx>=1.0.2
42
+ Requires-Dist: typing-extensions>=4.14.0; python_version < '3.12'
43
+ Provides-Extra: additional-extensions
44
+ Requires-Dist: mailparse>=1.0.15; extra == 'additional-extensions'
45
+ Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'additional-extensions'
46
+ Provides-Extra: all
47
+ Requires-Dist: click>=8.2.1; extra == 'all'
48
+ Requires-Dist: easyocr>=1.7.2; extra == 'all'
49
+ Requires-Dist: fast-langdetect>=0.3.2; extra == 'all'
50
+ Requires-Dist: gmft>=0.4.2; extra == 'all'
51
+ Requires-Dist: keybert>=0.9.0; extra == 'all'
52
+ Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.16.0; extra == 'all'
53
+ Requires-Dist: mailparse>=1.0.15; extra == 'all'
54
+ Requires-Dist: paddleocr>=3.1.0; extra == 'all'
55
+ Requires-Dist: paddlepaddle>=3.1.0; extra == 'all'
56
+ Requires-Dist: rich>=14.0.0; extra == 'all'
57
+ Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'all'
58
+ Requires-Dist: setuptools>=80.9.0; extra == 'all'
59
+ Requires-Dist: spacy>=3.8.7; extra == 'all'
60
+ Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'all'
61
+ Provides-Extra: api
62
+ Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.16.0; extra == 'api'
63
+ Provides-Extra: auto-classify-document-type
64
+ Requires-Dist: deep-translator>=1.11.4; extra == 'auto-classify-document-type'
65
+ Requires-Dist: pandas>=2.3.1; extra == 'auto-classify-document-type'
66
+ Provides-Extra: chunking
67
+ Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'chunking'
68
+ Provides-Extra: cli
69
+ Requires-Dist: click>=8.2.1; extra == 'cli'
70
+ Requires-Dist: rich>=14.0.0; extra == 'cli'
71
+ Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'cli'
72
+ Provides-Extra: easyocr
73
+ Requires-Dist: easyocr>=1.7.2; extra == 'easyocr'
74
+ Provides-Extra: entity-extraction
75
+ Requires-Dist: keybert>=0.9.0; extra == 'entity-extraction'
76
+ Requires-Dist: spacy>=3.8.7; extra == 'entity-extraction'
77
+ Provides-Extra: gmft
78
+ Requires-Dist: gmft>=0.4.2; extra == 'gmft'
79
+ Provides-Extra: langdetect
80
+ Requires-Dist: fast-langdetect>=0.3.2; extra == 'langdetect'
81
+ Provides-Extra: paddleocr
82
+ Requires-Dist: paddleocr>=3.1.0; extra == 'paddleocr'
83
+ Requires-Dist: paddlepaddle>=3.1.0; extra == 'paddleocr'
84
+ Requires-Dist: setuptools>=80.9.0; extra == 'paddleocr'
85
+ Description-Content-Type: text/markdown
86
+
87
+ # Kreuzberg
88
+
89
+ [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
90
+ [![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)
91
+ [![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
92
+ [![Benchmarks](https://img.shields.io/badge/benchmarks-fastest%20CPU-orange)](https://benchmarks.kreuzberg.dev/)
93
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
94
+ [![DeepSource](https://app.deepsource.com/gh/Goldziher/kreuzberg.svg/?label=code+coverage&show_trend=true&token=U8AW1VWWSLwVhrbtL8LmLBDN)](https://app.deepsource.com/gh/Goldziher/kreuzberg/)
95
+
96
+ **A document intelligence framework for Python.** Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.
97
+
98
+ 📖 **[Complete Documentation](https://kreuzberg.dev/)**
99
+
100
+ ## Framework Overview
101
+
102
+ ### Document Intelligence Capabilities
103
+
104
+ - **Text Extraction**: High-fidelity text extraction preserving document structure and formatting
105
+ - **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties
106
+ - **Format Support**: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
107
+ - **OCR Integration**: Multiple OCR engines (Tesseract, EasyOCR, PaddleOCR) with automatic fallback
108
+ - **Table Detection**: Structured table extraction with cell-level precision via GMFT integration
109
+ - **Document Classification**: Automatic document type detection (contracts, forms, invoices, receipts, reports)
110
+
111
+ ### Technical Architecture
112
+
113
+ - **Performance**: Highest throughput among Python document processing frameworks (30+ docs/second)
114
+ - **Resource Efficiency**: 71MB installation, ~360MB runtime memory footprint
115
+ - **Extensibility**: Plugin architecture for custom extractors via the Extractor base class
116
+ - **API Design**: Synchronous and asynchronous APIs with consistent interfaces
117
+ - **Type Safety**: Complete type annotations throughout the codebase
118
+
119
+ ### Open Source Foundation
120
+
121
+ Kreuzberg leverages established open source technologies:
122
+
123
+ - **Pandoc**: Universal document converter for robust format support
124
+ - **PDFium**: Google's PDF rendering engine for accurate PDF processing
125
+ - **Tesseract**: Google's OCR engine for text recognition
126
+ - **Python-docx/pptx**: Native Microsoft Office format support
127
+
128
+ ## Quick Start
129
+
130
+ ### Extract Text with CLI
131
+
132
+ ```bash
133
+ # Extract text from any file to markdown
134
+ uvx kreuzberg extract document.pdf > output.md
135
+
136
+ # With all features (OCR, table extraction, etc.)
137
+ uvx --from "kreuzberg[all]" kreuzberg extract invoice.pdf --ocr --format markdown
138
+
139
+ # Extract with rich metadata
140
+ uvx kreuzberg extract report.pdf --show-metadata --format json
141
+ ```
142
+
143
+ ### Python Usage
144
+
145
+ **Async (recommended for web apps):**
146
+
147
+ ```python
148
+ from kreuzberg import extract_file
149
+
150
+ # In your async function
151
+ result = await extract_file("presentation.pptx")
152
+ print(result.content)
153
+
154
+ # Rich metadata extraction
155
+ print(f"Title: {result.metadata.title}")
156
+ print(f"Author: {result.metadata.author}")
157
+ print(f"Page count: {result.metadata.page_count}")
158
+ print(f"Created: {result.metadata.created_at}")
159
+ ```
160
+
161
+ **Sync (for scripts and CLI tools):**
162
+
163
+ ```python
164
+ from kreuzberg import extract_file_sync
165
+
166
+ result = extract_file_sync("report.docx")
167
+ print(result.content)
168
+
169
+ # Access rich metadata
170
+ print(f"Language: {result.metadata.language}")
171
+ print(f"Word count: {result.metadata.word_count}")
172
+ print(f"Keywords: {result.metadata.keywords}")
173
+ ```
174
+
175
+ ### Docker
176
+
177
+ ```bash
178
+ # Run the REST API
179
+ docker run -p 8000:8000 goldziher/kreuzberg
180
+
181
+ # Extract via API
182
+ curl -X POST -F "file=@document.pdf" http://localhost:8000/extract
183
+ ```
184
+
185
+ 📖 **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** • **[CLI Documentation](https://kreuzberg.dev/cli/)** • **[API Reference](https://kreuzberg.dev/api-reference/)**
186
+
187
+ ## Deployment Options
188
+
189
+ ### 🤖 MCP Server (AI Integration)
190
+
191
+ **Add to Claude Desktop with one command:**
192
+
193
+ ```bash
194
+ claude mcp add kreuzberg uvx -- --from "kreuzberg[all]" kreuzberg-mcp
195
+ ```
196
+
197
+ **Or configure manually in `claude_desktop_config.json`:**
198
+
199
+ ```json
200
+ {
201
+ "mcpServers": {
202
+ "kreuzberg": {
203
+ "command": "uvx",
204
+ "args": ["--from", "kreuzberg[all]", "kreuzberg-mcp"]
205
+ }
206
+ }
207
+ }
208
+ ```
209
+
210
+ **MCP capabilities:**
211
+
212
+ - Extract text from PDFs, images, Office docs, and more
213
+ - Full OCR support with multiple engines
214
+ - Table extraction and metadata parsing
215
+
216
+ 📖 **[MCP Documentation](https://kreuzberg.dev/user-guide/mcp-server/)**
217
+
218
+ ## Supported Formats
219
+
220
+ | Category | Formats |
221
+ | ----------------- | ------------------------------ |
222
+ | **Documents** | PDF, DOCX, DOC, RTF, TXT, EPUB |
223
+ | **Images** | JPG, PNG, TIFF, BMP, GIF, WEBP |
224
+ | **Spreadsheets** | XLSX, XLS, CSV, ODS |
225
+ | **Presentations** | PPTX, PPT, ODP |
226
+ | **Web** | HTML, XML, MHTML |
227
+ | **Archives** | Support via extraction |
228
+
229
+ ## 📊 Performance Characteristics
230
+
231
+ [View comprehensive benchmarks](https://benchmarks.kreuzberg.dev/) • [Benchmark methodology](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) • [**Detailed Analysis**](https://kreuzberg.dev/performance-analysis/)
232
+
233
+ ### Technical Specifications
234
+
235
+ | Metric | Kreuzberg Sync | Kreuzberg Async | Benchmarked |
236
+ | ---------------------------- | -------------- | --------------- | ------------------ |
237
+ | **Throughput (tiny files)** | 31.78 files/s | 23.94 files/s | Highest throughput |
238
+ | **Throughput (small files)** | 8.91 files/s | 9.31 files/s | Highest throughput |
239
+ | **Memory footprint** | 359.8 MB | 395.2 MB | Lowest usage |
240
+ | **Installation size** | 71 MB | 71 MB | Smallest size |
241
+ | **Success rate** | 100% | 100% | Perfect |
242
+ | **Supported formats** | 18 | 18 | Comprehensive |
243
+
244
+ ### Architecture Advantages
245
+
246
+ - **Native C extensions**: Built on PDFium and Tesseract for maximum performance
247
+ - **Async/await support**: True asynchronous processing with intelligent task scheduling
248
+ - **Memory efficiency**: Streaming architecture minimizes memory allocation
249
+ - **Process pooling**: Automatic multiprocessing for CPU-intensive operations
250
+ - **Optimized data flow**: Efficient data handling with minimal transformations
251
+
252
+ > **Benchmark details**: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.
253
+
254
+ ## Documentation
255
+
256
+ ### Quick Links
257
+
258
+ - [Installation Guide](https://kreuzberg.dev/getting-started/installation/) - Setup and dependencies
259
+ - [User Guide](https://kreuzberg.dev/user-guide/) - Comprehensive usage guide
260
+ - [Performance Analysis](https://kreuzberg.dev/performance-analysis/) - Detailed benchmark results
261
+ - [API Reference](https://kreuzberg.dev/api-reference/) - Complete API documentation
262
+ - [Docker Guide](https://kreuzberg.dev/user-guide/docker/) - Container deployment
263
+ - [REST API](https://kreuzberg.dev/user-guide/api-server/) - HTTP endpoints
264
+ - [CLI Guide](https://kreuzberg.dev/cli/) - Command-line usage
265
+ - [OCR Configuration](https://kreuzberg.dev/user-guide/ocr-configuration/) - OCR engine setup
266
+
267
+ ## License
268
+
269
+ MIT License - see [LICENSE](LICENSE) for details.