kreuzberg 3.15.0__tar.gz → 3.17.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (368) hide show
  1. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.github/workflows/ci.yaml +43 -73
  2. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.github/workflows/docker-e2e-tests.yml +4 -0
  3. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.github/workflows/docs.yml +1 -1
  4. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.github/workflows/test-docker-builds.yml +4 -0
  5. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.pre-commit-config.yaml +4 -3
  6. kreuzberg-3.17.0/.prettierignore +1 -0
  7. kreuzberg-3.17.0/ATTRIBUTIONS.md +47 -0
  8. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/PKG-INFO +15 -13
  9. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/README.md +10 -9
  10. kreuzberg-3.17.0/Taskfile.yml +50 -0
  11. kreuzberg-3.17.0/benchmarks/token_reduction_compression_benchmark.py +268 -0
  12. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/api-reference/types.md +18 -0
  13. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/examples/extraction-examples.md +83 -1
  14. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/extraction-configuration.md +68 -1
  15. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/index.md +1 -0
  16. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/metadata-extraction.md +51 -0
  17. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/supported-formats.md +14 -1
  18. kreuzberg-3.17.0/docs/user-guide/token-reduction.md +251 -0
  19. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/__init__.py +6 -0
  20. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_api/main.py +0 -53
  21. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_config.py +17 -8
  22. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_document_classification.py +1 -1
  23. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_extractors/_base.py +0 -46
  24. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_extractors/_email.py +16 -10
  25. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_extractors/_html.py +39 -12
  26. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_extractors/_pandoc.py +2 -2
  27. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_extractors/_pdf.py +6 -7
  28. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_extractors/_presentation.py +4 -0
  29. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_extractors/_spread_sheet.py +0 -1
  30. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_extractors/_structured.py +83 -15
  31. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_gmft.py +7 -2
  32. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_mcp/server.py +1 -22
  33. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_mime_types.py +1 -1
  34. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_ocr/_easyocr.py +47 -20
  35. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_ocr/_paddleocr.py +1 -1
  36. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_ocr/_tesseract.py +27 -26
  37. kreuzberg-3.17.0/kreuzberg/_token_reduction/__init__.py +11 -0
  38. kreuzberg-3.17.0/kreuzberg/_token_reduction/_reducer.py +439 -0
  39. kreuzberg-3.17.0/kreuzberg/_token_reduction/_stopwords.py +116 -0
  40. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/af_stopwords.json +53 -0
  41. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ar_stopwords.json +482 -0
  42. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/bg_stopwords.json +261 -0
  43. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/bn_stopwords.json +400 -0
  44. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/br_stopwords.json +1205 -0
  45. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ca_stopwords.json +280 -0
  46. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/cs_stopwords.json +425 -0
  47. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/da_stopwords.json +172 -0
  48. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/de_stopwords.json +622 -0
  49. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/el_stopwords.json +849 -0
  50. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/en_stopwords.json +1300 -0
  51. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/eo_stopwords.json +175 -0
  52. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/es_stopwords.json +734 -0
  53. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/et_stopwords.json +37 -0
  54. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/eu_stopwords.json +100 -0
  55. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/fa_stopwords.json +801 -0
  56. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/fi_stopwords.json +849 -0
  57. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/fr_stopwords.json +693 -0
  58. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ga_stopwords.json +111 -0
  59. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/gl_stopwords.json +162 -0
  60. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/gu_stopwords.json +226 -0
  61. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ha_stopwords.json +41 -0
  62. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/he_stopwords.json +196 -0
  63. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/hi_stopwords.json +227 -0
  64. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/hr_stopwords.json +181 -0
  65. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/hu_stopwords.json +791 -0
  66. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/hy_stopwords.json +47 -0
  67. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/id_stopwords.json +760 -0
  68. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/it_stopwords.json +634 -0
  69. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ja_stopwords.json +136 -0
  70. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/kn_stopwords.json +84 -0
  71. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ko_stopwords.json +681 -0
  72. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ku_stopwords.json +64 -0
  73. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/la_stopwords.json +51 -0
  74. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/lt_stopwords.json +476 -0
  75. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/lv_stopwords.json +163 -0
  76. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ml_stopwords.json +11 -0
  77. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/mr_stopwords.json +101 -0
  78. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ms_stopwords.json +477 -0
  79. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ne_stopwords.json +490 -0
  80. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/nl_stopwords.json +415 -0
  81. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/no_stopwords.json +223 -0
  82. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/pl_stopwords.json +331 -0
  83. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/pt_stopwords.json +562 -0
  84. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ro_stopwords.json +436 -0
  85. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ru_stopwords.json +561 -0
  86. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/si_stopwords.json +193 -0
  87. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/sk_stopwords.json +420 -0
  88. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/sl_stopwords.json +448 -0
  89. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/so_stopwords.json +32 -0
  90. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/st_stopwords.json +33 -0
  91. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/sv_stopwords.json +420 -0
  92. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/sw_stopwords.json +76 -0
  93. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ta_stopwords.json +129 -0
  94. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/te_stopwords.json +54 -0
  95. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/th_stopwords.json +118 -0
  96. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/tl_stopwords.json +149 -0
  97. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/tr_stopwords.json +506 -0
  98. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/uk_stopwords.json +75 -0
  99. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/ur_stopwords.json +519 -0
  100. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/vi_stopwords.json +647 -0
  101. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/yo_stopwords.json +62 -0
  102. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/zh_stopwords.json +796 -0
  103. kreuzberg-3.17.0/kreuzberg/_token_reduction/stopwords/zu_stopwords.json +31 -0
  104. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_types.py +146 -43
  105. kreuzberg-3.17.0/kreuzberg/_utils/_html_streaming.py +20 -0
  106. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_image_preprocessing.py +1 -1
  107. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_ref.py +14 -6
  108. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_serialization.py +13 -6
  109. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_sync.py +15 -16
  110. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/exceptions.py +0 -1
  111. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/extraction.py +27 -11
  112. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/pyproject.toml +7 -5
  113. kreuzberg-3.17.0/tests/api/config_cache_test.py +224 -0
  114. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/api/image_extraction_test.py +4 -1
  115. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/api/main_test.py +7 -7
  116. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/api/runtime_config_test.py +4 -1
  117. kreuzberg-3.17.0/tests/core/comprehensive_config_test.py +664 -0
  118. kreuzberg-3.17.0/tests/core/constants_test.py +22 -0
  119. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/core/dpi_configuration_test.py +19 -0
  120. kreuzberg-3.17.0/tests/core/exceptions_test.py +159 -0
  121. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/core/extraction_batch_test.py +8 -65
  122. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/core/extraction_test.py +75 -38
  123. kreuzberg-3.17.0/tests/core/init_test.py +85 -0
  124. kreuzberg-3.17.0/tests/core/main_test.py +35 -0
  125. kreuzberg-3.17.0/tests/core/mime_types_test.py +242 -0
  126. kreuzberg-3.17.0/tests/core/registry_test.py +225 -0
  127. kreuzberg-3.17.0/tests/core/types_test.py +465 -0
  128. kreuzberg-3.17.0/tests/extractors/base_extractor_test.py +420 -0
  129. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/base_ocr_processing_test.py +6 -18
  130. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/email_test.py +1 -1
  131. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/image_error_handling_test.py +5 -3
  132. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/image_test.py +2 -19
  133. kreuzberg-3.17.0/tests/extractors/json_test.py +427 -0
  134. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/pandoc_test.py +27 -29
  135. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/pdf_test.py +12 -7
  136. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/spreadsheet_test.py +17 -13
  137. kreuzberg-3.17.0/tests/features/chunker_test.py +94 -0
  138. kreuzberg-3.17.0/tests/features/document_classification_test.py +747 -0
  139. kreuzberg-3.17.0/tests/features/entity_extraction_test.py +348 -0
  140. kreuzberg-3.17.0/tests/features/gmft_test.py +1496 -0
  141. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/features/language_detection_test.py +6 -34
  142. kreuzberg-3.17.0/tests/features/token_reduction_test.py +813 -0
  143. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/all_extractors_images_test.py +45 -24
  144. kreuzberg-3.17.0/tests/integration/token_reduction_integration_test.py +173 -0
  145. kreuzberg-3.17.0/tests/interfaces/cli_test.py +527 -0
  146. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/interfaces/mcp_server_test.py +44 -203
  147. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/multiprocessing/gmft_isolated_test.py +1 -0
  148. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/ocr/easyocr_test.py +6 -0
  149. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/ocr/paddleocr_test.py +1 -0
  150. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/ocr/tesseract_test.py +64 -0
  151. kreuzberg-3.17.0/tests/test_source_files/json/complex_nested.json +41 -0
  152. kreuzberg-3.17.0/tests/test_source_files/json/real_world/aws_policy.json +43 -0
  153. kreuzberg-3.17.0/tests/test_source_files/json/real_world/earthquakes.geojson +6 -0
  154. kreuzberg-3.17.0/tests/test_source_files/json/real_world/github_emojis.json +111 -0
  155. kreuzberg-3.17.0/tests/test_source_files/json/real_world/iss_location.json +1 -0
  156. kreuzberg-3.17.0/tests/test_source_files/json/real_world/openapi_spec.json +84 -0
  157. kreuzberg-3.17.0/tests/test_source_files/json/real_world/package.json +33 -0
  158. kreuzberg-3.17.0/tests/test_source_files/json/real_world/rick_morty_character.json +1 -0
  159. kreuzberg-3.17.0/tests/test_source_files/json/schema_test.json +25 -0
  160. kreuzberg-3.17.0/tests/utils/playa_metadata_test.py +753 -0
  161. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/playa_test.py +68 -17
  162. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/process_pool_test.py +1 -1
  163. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/serialization_test.py +82 -0
  164. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/uv.lock +147 -127
  165. kreuzberg-3.15.0/Taskfile.yml +0 -161
  166. kreuzberg-3.15.0/docker-logs/docker-info.txt +0 -60
  167. kreuzberg-3.15.0/docker-logs/docker-version.txt +0 -27
  168. kreuzberg-3.15.0/tests/core/mime_types_test.py +0 -0
  169. kreuzberg-3.15.0/tests/core/registry_test.py +0 -0
  170. kreuzberg-3.15.0/tests/core/types_test.py +0 -23
  171. kreuzberg-3.15.0/tests/features/chunker_test.py +0 -0
  172. kreuzberg-3.15.0/tests/features/document_classification_test.py +0 -0
  173. kreuzberg-3.15.0/tests/features/entity_extraction_test.py +0 -0
  174. kreuzberg-3.15.0/tests/features/gmft_test.py +0 -528
  175. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.commitlintrc +0 -0
  176. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.deepsource.toml +0 -0
  177. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.docker/Dockerfile +0 -0
  178. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.docker/README.md +0 -0
  179. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.dockerignore +0 -0
  180. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.github/dependabot.yaml +0 -0
  181. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.github/workflows/pr-title.yaml +0 -0
  182. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.github/workflows/publish-docker.yml +0 -0
  183. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.github/workflows/release.yaml +0 -0
  184. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.gitignore +0 -0
  185. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/.markdownlint.yaml +0 -0
  186. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/LICENSE +0 -0
  187. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/ai-rulez.yaml +0 -0
  188. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/README.md +0 -0
  189. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/__init__.py +0 -0
  190. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/batch_size_benchmark.py +0 -0
  191. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/batch_validation_benchmark.py +0 -0
  192. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/py.typed +0 -0
  193. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/pyproject.toml +0 -0
  194. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/src/__init__.py +0 -0
  195. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/src/__main__.py +0 -0
  196. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/src/benchmarks.py +0 -0
  197. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/src/cli.py +0 -0
  198. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/src/models.py +0 -0
  199. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/src/profiler.py +0 -0
  200. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/benchmarks/src/runner.py +0 -0
  201. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/advanced/custom-extractors.md +0 -0
  202. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/advanced/custom-hooks.md +0 -0
  203. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/advanced/error-handling.md +0 -0
  204. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/advanced/index.md +0 -0
  205. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/advanced/performance.md +0 -0
  206. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/api-reference/exceptions.md +0 -0
  207. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/api-reference/extraction-functions.md +0 -0
  208. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/api-reference/extractor-registry.md +0 -0
  209. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/api-reference/index.md +0 -0
  210. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/api-reference/ocr-configuration.md +0 -0
  211. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/assets/favicon.png +0 -0
  212. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/assets/logo.png +0 -0
  213. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/cli.md +0 -0
  214. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/contributing.md +0 -0
  215. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/css/extra.css +0 -0
  216. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/examples/index.md +0 -0
  217. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/getting-started/index.md +0 -0
  218. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/getting-started/installation.md +0 -0
  219. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/getting-started/quick-start.md +0 -0
  220. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/index.md +0 -0
  221. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/api-server.md +0 -0
  222. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/basic-usage.md +0 -0
  223. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/chunking.md +0 -0
  224. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/docker.md +0 -0
  225. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/document-classification.md +0 -0
  226. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/mcp-server.md +0 -0
  227. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/ocr-backends.md +0 -0
  228. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/docs/user-guide/ocr-configuration.md +0 -0
  229. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/__main__.py +0 -0
  230. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_api/__init__.py +0 -0
  231. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_api/_config_cache.py +0 -0
  232. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_chunker.py +0 -0
  233. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_constants.py +0 -0
  234. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_entity_extraction.py +0 -0
  235. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_extractors/__init__.py +0 -0
  236. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_extractors/_image.py +0 -0
  237. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_language_detection.py +0 -0
  238. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_mcp/__init__.py +0 -0
  239. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_ocr/__init__.py +0 -0
  240. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_ocr/_base.py +0 -0
  241. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_ocr/_table_extractor.py +0 -0
  242. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_playa.py +0 -0
  243. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_registry.py +0 -0
  244. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/__init__.py +0 -0
  245. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_cache.py +0 -0
  246. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_device.py +0 -0
  247. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_document_cache.py +0 -0
  248. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_errors.py +0 -0
  249. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_ocr_cache.py +0 -0
  250. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_pdf_lock.py +0 -0
  251. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_process_pool.py +0 -0
  252. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_quality.py +0 -0
  253. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_resource_managers.py +0 -0
  254. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_string.py +0 -0
  255. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_table.py +0 -0
  256. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/_utils/_tmp.py +0 -0
  257. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/cli.py +0 -0
  258. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/kreuzberg/py.typed +0 -0
  259. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/mkdocs.yaml +0 -0
  260. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/__init__.py +0 -0
  261. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/api/__init__.py +0 -0
  262. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/api/conftest.py +0 -0
  263. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/api/header_config_hashing_test.py +0 -0
  264. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/conftest.py +0 -0
  265. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/core/__init__.py +0 -0
  266. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/core/config_test.py +0 -0
  267. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/core/html_to_markdown_config_test.py +0 -0
  268. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/core/image_ocr_result_test.py +0 -0
  269. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/e2e/__init__.py +0 -0
  270. /kreuzberg-3.15.0/tests/e2e/docker_e2e_test.py → /kreuzberg-3.17.0/tests/e2e/docker_e2e.py +0 -0
  271. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/README_image_tests.md +0 -0
  272. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/__init__.py +0 -0
  273. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/base_memory_limits_test.py +0 -0
  274. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/base_ocr_simple_test.py +0 -0
  275. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/email_error_paths_test.py +0 -0
  276. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/html_invalid_base64_test.py +0 -0
  277. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/html_test.py +0 -0
  278. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/image_deduplication_test.py +0 -0
  279. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/image_error_simple_test.py +0 -0
  280. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/pandoc_metadata_test.py +0 -0
  281. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/pdf_images_test.py +0 -0
  282. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/pdf_sync_images_test.py +0 -0
  283. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/presentation_test.py +0 -0
  284. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/extractors/structured_test.py +0 -0
  285. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/features/__init__.py +0 -0
  286. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/features/hooks_test.py +0 -0
  287. /kreuzberg-3.15.0/tests/core/exceptions_test.py → /kreuzberg-3.17.0/tests/features/table_extraction_test.py +0 -0
  288. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/__init__.py +0 -0
  289. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/api/__init__.py +0 -0
  290. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/api/large_file_test.py +0 -0
  291. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/api/mounted_config_test.py +0 -0
  292. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/dpi_integration_test.py +0 -0
  293. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/multiprocessing/__init__.py +0 -0
  294. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/multiprocessing/gmft_integration_test.py +0 -0
  295. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/ocr/__init__.py +0 -0
  296. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/ocr/device_integration_test.py +0 -0
  297. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/ocr/tesseract_sync_formats_test.py +0 -0
  298. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/ocr/tesseract_tsv_integration_test.py +0 -0
  299. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/pandoc_images_test.py +0 -0
  300. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/pdf_images_test.py +0 -0
  301. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/pdf_real_images_test.py +0 -0
  302. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/pptx_complex_test.py +0 -0
  303. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/pptx_images_test.py +0 -0
  304. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/integration/regression_test.py +0 -0
  305. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/interfaces/__init__.py +0 -0
  306. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/mcp/__init__.py +0 -0
  307. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/mcp/mcp_server_test.py +0 -0
  308. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/multiprocessing/__init__.py +0 -0
  309. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/multiprocessing/process_manager_test.py +0 -0
  310. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/multiprocessing/tesseract_pool_test.py +0 -0
  311. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/ocr/__init__.py +0 -0
  312. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/ocr/base_test.py +0 -0
  313. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/ocr/init_test.py +0 -0
  314. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/ocr/tesseract_tsv_test.py +0 -0
  315. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/performance/__init__.py +0 -0
  316. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/performance/large_pdf_perf_test.py +0 -0
  317. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/Xerox_AltaLink_series_mfp_sag_en-US 2.pdf +0 -0
  318. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/contract.txt +0 -0
  319. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/contract_test.txt +0 -0
  320. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/document.docx +0 -0
  321. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/email/sample-email.eml +0 -0
  322. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/excel-multi-sheet.xlsx +0 -0
  323. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/excel.xlsx +0 -0
  324. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/flower-no-text.jpg +0 -0
  325. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/form_test.txt +0 -0
  326. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/french-text.txt +0 -0
  327. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/german-text.txt +0 -0
  328. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/google-doc-document.pdf +0 -0
  329. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/html.html +0 -0
  330. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/images/test_hello_world.png +0 -0
  331. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/invoice_image.png +0 -0
  332. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/invoice_test.txt +0 -0
  333. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/json/sample-document.json +0 -0
  334. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/layout-parser-ocr.jpg +0 -0
  335. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/markdown.md +0 -0
  336. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/non-ascii-text.pdf +0 -0
  337. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/non-searchable.pdf +0 -0
  338. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/ocr-image.jpg +0 -0
  339. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/pdfs_with_tables/large.pdf +0 -0
  340. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/pdfs_with_tables/medium.pdf +0 -0
  341. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/pdfs_with_tables/tiny.pdf +0 -0
  342. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/pitch-deck-presentation.pptx +0 -0
  343. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/receipt_test.txt +0 -0
  344. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/report_test.txt +0 -0
  345. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/sample-contract.pdf +0 -0
  346. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/scanned.pdf +0 -0
  347. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/searchable.pdf +0 -0
  348. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/sharable-web-guide.pdf +0 -0
  349. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/spanish-text.txt +0 -0
  350. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/tables/borderless_table.png +0 -0
  351. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/tables/complex_document.png +0 -0
  352. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/tables/simple_table.png +0 -0
  353. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/test-article.pdf +0 -0
  354. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/test-excel.xls +0 -0
  355. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/test_source_files/yaml/sample-config.yaml +0 -0
  356. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/__init__.py +0 -0
  357. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/cache_test.py +0 -0
  358. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/device_test.py +0 -0
  359. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/errors_test.py +0 -0
  360. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/ocr_cache_test.py +0 -0
  361. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/pdf_lock_test.py +0 -0
  362. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/playa_helpers_test.py +0 -0
  363. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/quality_test.py +0 -0
  364. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/ref_test.py +0 -0
  365. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/string_test.py +0 -0
  366. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/sync_test.py +0 -0
  367. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/table_test.py +0 -0
  368. {kreuzberg-3.15.0 → kreuzberg-3.17.0}/tests/utils/tmp_test.py +0 -0
@@ -8,6 +8,10 @@ on:
8
8
  branches:
9
9
  - main
10
10
 
11
+ concurrency:
12
+ group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
13
+ cancel-in-progress: true
14
+
11
15
  jobs:
12
16
  validate:
13
17
  runs-on: ubuntu-latest
@@ -138,27 +142,7 @@ jobs:
138
142
  needs: validate
139
143
  if: github.event_name == 'pull_request' && needs.validate.result == 'success'
140
144
  runs-on: ubuntu-latest
141
- strategy:
142
- fail-fast: false
143
- matrix:
144
- test-category:
145
- - name: "core"
146
- path: "tests/core,tests/utils"
147
- system-deps: false
148
- timeout: 15
149
- - name: "extractors"
150
- path: "tests/extractors"
151
- system-deps: true
152
- timeout: 20
153
- - name: "integration"
154
- path: "tests/integration,tests/api"
155
- system-deps: true
156
- timeout: 25
157
- - name: "features"
158
- path: "tests/features,tests/interfaces,tests/mcp,tests/multiprocessing,tests/ocr"
159
- system-deps: true
160
- timeout: 20
161
- timeout-minutes: ${{ matrix.test-category.timeout }}
145
+ timeout-minutes: 45
162
146
  steps:
163
147
  - name: Checkout
164
148
  uses: actions/checkout@v5
@@ -170,36 +154,62 @@ jobs:
170
154
 
171
155
  - name: Install Python
172
156
  uses: actions/setup-python@v6
157
+ id: setup-python
173
158
  with:
174
159
  python-version: "3.13"
175
160
 
176
161
  - name: Cache Python Dependencies
162
+ id: python-cache
177
163
  uses: actions/cache@v4
178
164
  with:
179
165
  path: |
180
166
  ~/.cache/uv
181
167
  .venv
182
- key: python-dependencies-ubuntu-latest-3.13-${{ matrix.test-category.name }}-${{ hashFiles('uv.lock') }}
168
+ key: python-dependencies-ubuntu-latest-3.13-${{ hashFiles('uv.lock') }}
183
169
  restore-keys: |
184
170
  python-dependencies-ubuntu-latest-3.13-
185
171
 
186
172
  - name: Install Dependencies
187
- run: uv sync --all-extras --dev
173
+ uses: nick-fields/retry@v3
174
+ with:
175
+ timeout_minutes: 5
176
+ max_attempts: 3
177
+ retry_wait_seconds: 30
178
+ command: |
179
+ uv sync --all-extras --dev
180
+ shell: bash
188
181
 
189
182
  - name: Install System Dependencies
190
- if: matrix.test-category.system-deps
191
- run: |
192
- sudo apt-get update
193
- sudo apt-get install -y tesseract-ocr tesseract-ocr-deu pandoc
183
+ uses: nick-fields/retry@v3
184
+ with:
185
+ timeout_minutes: 5
186
+ max_attempts: 3
187
+ retry_wait_seconds: 30
188
+ command: |
189
+ sudo apt-get update
190
+ sudo apt-get install -y tesseract-ocr tesseract-ocr-deu pandoc
191
+ shell: bash
194
192
 
195
- - name: Run Tests - ${{ matrix.test-category.name }}
196
- run: uv run pytest $(echo "${{ matrix.test-category.path }}" | tr ',' ' ') -v --reruns 1 --reruns-delay 1 --cov=kreuzberg --cov-append --cov-report=lcov:coverage-${{ matrix.test-category.name }}.lcov
193
+ - name: Run All Tests with Coverage
194
+ uses: nick-fields/retry@v3
195
+ with:
196
+ timeout_minutes: 15
197
+ max_attempts: 3
198
+ retry_wait_seconds: 10
199
+ command: |
200
+ uv run coverage erase
201
+ uv run pytest -s -vvv --cov=kreuzberg --cov-report=lcov:coverage.lcov --cov-report=term --cov-config=pyproject.toml --reruns 2 --reruns-delay 1
202
+ uv run coverage report --precision=2
203
+ shell: bash
197
204
 
198
205
  - name: Upload Coverage Artifacts
206
+ if: always()
199
207
  uses: actions/upload-artifact@v4
200
208
  with:
201
- name: coverage-${{ matrix.test-category.name }}-${{ github.sha }}
202
- path: coverage-${{ matrix.test-category.name }}.lcov
209
+ name: coverage-pr-${{ github.sha }}
210
+ path: |
211
+ coverage.lcov
212
+ .coverage
203
213
  retention-days: 1
204
214
 
205
215
  coverage-pr:
@@ -212,49 +222,9 @@ jobs:
212
222
  uses: actions/checkout@v5
213
223
 
214
224
  - name: Download Coverage Artifacts
215
- uses: actions/download-artifact@v4
216
- with:
217
- pattern: coverage-*-${{ github.sha }}
218
- merge-multiple: true
219
-
220
- - name: Install uv
221
- uses: astral-sh/setup-uv@v6
222
- with:
223
- enable-cache: true
224
-
225
- - name: Install Python
226
- uses: actions/setup-python@v6
225
+ uses: actions/download-artifact@v5
227
226
  with:
228
- python-version: "3.13"
229
-
230
- - name: Install Dependencies
231
- run: uv sync --dev
232
-
233
- - name: Combine Coverage Reports
234
- run: |
235
- # Install lcov for combining reports
236
- sudo apt-get update && sudo apt-get install -y lcov
237
-
238
- # List available coverage files
239
- echo "Available coverage files:"
240
- find . -name "coverage-*.lcov" -type f || echo "No coverage files found"
241
-
242
- # Combine all lcov files if they exist
243
- coverage_files=($(find . -name "coverage-*.lcov" -type f))
244
- if [ ${#coverage_files[@]} -gt 0 ]; then
245
- echo "Combining ${#coverage_files[@]} coverage files..."
246
- if [ ${#coverage_files[@]} -eq 1 ]; then
247
- # Only one file, just copy it
248
- cp "${coverage_files[0]}" coverage.lcov
249
- else
250
- # Multiple files, combine them
251
- lcov --rc branch_coverage=1 $(printf " -a %s" "${coverage_files[@]}") -o coverage.lcov
252
- fi
253
- else
254
- echo "No coverage files to combine, creating empty coverage.lcov"
255
- echo "TN:" > coverage.lcov
256
- echo "end_of_record" >> coverage.lcov
257
- fi
227
+ name: coverage-pr-${{ github.sha }}
258
228
 
259
229
  - name: Upload Coverage to DeepSource
260
230
  if: always()
@@ -4,6 +4,10 @@ on:
4
4
  workflow_dispatch:
5
5
  workflow_call:
6
6
 
7
+ concurrency:
8
+ group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
9
+ cancel-in-progress: true
10
+
7
11
  jobs:
8
12
  test-docker-images:
9
13
  runs-on: ubuntu-latest
@@ -17,7 +17,7 @@ permissions:
17
17
 
18
18
  concurrency:
19
19
  group: "pages"
20
- cancel-in-progress: false
20
+ cancel-in-progress: true
21
21
 
22
22
  jobs:
23
23
  build:
@@ -3,6 +3,10 @@ name: Test Docker Builds (No Push)
3
3
  on:
4
4
  workflow_dispatch:
5
5
 
6
+ concurrency:
7
+ group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
8
+ cancel-in-progress: true
9
+
6
10
  jobs:
7
11
  test-build-all-images:
8
12
  runs-on: ubuntu-latest
@@ -11,7 +11,7 @@ repos:
11
11
  - id: name-tests-test
12
12
  args:
13
13
  - --pytest
14
- exclude: factories|test_utils|completion.py|test_data
14
+ exclude: factories|test_utils|completion.py|test_data|docker_e2e.py
15
15
  - id: trailing-whitespace
16
16
  - id: end-of-file-fixer
17
17
  - id: check-toml
@@ -26,7 +26,7 @@ repos:
26
26
  hooks:
27
27
  - id: mdformat
28
28
  additional_dependencies:
29
- - mdformat-mkdocs==4.0.0
29
+ - mdformat-mkdocs==4.1.0
30
30
  - repo: https://github.com/igorshubovych/markdownlint-cli
31
31
  rev: v0.45.0
32
32
  hooks:
@@ -36,6 +36,7 @@ repos:
36
36
  hooks:
37
37
  - id: blacken-docs
38
38
  args: ["--pyi", "--line-length", "130"]
39
+ exclude: tests/features/token_reduction_test.py
39
40
  additional_dependencies:
40
41
  - black==25.1.0
41
42
  - repo: https://github.com/rbubley/mirrors-prettier
@@ -59,7 +60,7 @@ repos:
59
60
  - id: codespell
60
61
  exclude: ^tests|^scripts|^kreuzberg/_tesseract|^kreuzberg/_mime_types
61
62
  additional_dependencies:
62
- - tomli
63
+ - tomli==2.2.1
63
64
  - repo: https://github.com/jsh9/pydoclint
64
65
  rev: 0.7.3
65
66
  hooks:
@@ -0,0 +1 @@
1
+ ATTRIBUTIONS.md
@@ -0,0 +1,47 @@
1
+ # Third-Party Attributions
2
+
3
+ This file contains attributions for third-party code, data, and libraries used in Kreuzberg.
4
+
5
+ ## Stopwords Data
6
+
7
+ The stopwords data in `kreuzberg/_token_reduction/stop_words.json` is derived from the [stopwords-iso](https://github.com/stopwords-iso/stopwords-iso) project.
8
+
9
+ **Original Author:** Gene Diaz and contributors
10
+ **License:** MIT License
11
+ **Source:** <https://github.com/stopwords-iso/stopwords-iso>
12
+
13
+ ### MIT License (stopwords-iso)
14
+
15
+ ```text
16
+ MIT License
17
+
18
+ Copyright (c) stopwords-iso contributors
19
+
20
+ Permission is hereby granted, free of charge, to any person obtaining a copy
21
+ of this software and associated documentation files (the "Software"), to deal
22
+ in the Software without restriction, including without limitation the rights
23
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
24
+ copies of the Software, and to permit persons to whom the Software is
25
+ furnished to do so, subject to the following conditions:
26
+
27
+ The above copyright notice and this permission notice shall be included in all
28
+ copies or substantial portions of the Software.
29
+
30
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
31
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
32
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
33
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
34
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
35
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
36
+ SOFTWARE.
37
+ ```
38
+
39
+ ### Changes Made
40
+
41
+ The original stopwords-iso data was used as-is with no modifications to the word lists themselves. The data was packaged into Kreuzberg's `_token_reduction` module for use in the token reduction feature.
42
+
43
+ ______________________________________________________________________
44
+
45
+ ## Other Third-Party Dependencies
46
+
47
+ All other third-party dependencies are listed in `pyproject.toml` with their respective licenses. This section is specifically for bundled/vendored code and data.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: kreuzberg
3
- Version: 3.15.0
3
+ Version: 3.17.0
4
4
  Summary: Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats
5
5
  Project-URL: documentation, https://kreuzberg.dev
6
6
  Project-URL: homepage, https://github.com/Goldziher/kreuzberg
@@ -31,7 +31,8 @@ Requires-Python: >=3.10
31
31
  Requires-Dist: anyio>=4.10.0
32
32
  Requires-Dist: chardetng-py>=0.3.5
33
33
  Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
34
- Requires-Dist: html-to-markdown[lxml]>=1.11.0
34
+ Requires-Dist: html-to-markdown[lxml]>=1.13.0
35
+ Requires-Dist: langcodes>=3.5.0
35
36
  Requires-Dist: mcp>=1.14.0
36
37
  Requires-Dist: msgspec>=0.18.0
37
38
  Requires-Dist: numpy>=2.0.0
@@ -49,7 +50,7 @@ Provides-Extra: all
49
50
  Requires-Dist: click>=8.2.1; extra == 'all'
50
51
  Requires-Dist: deep-translator>=1.11.4; extra == 'all'
51
52
  Requires-Dist: easyocr>=1.7.2; extra == 'all'
52
- Requires-Dist: fast-langdetect>=0.3.2; extra == 'all'
53
+ Requires-Dist: fast-langdetect>=1.0.0; extra == 'all'
53
54
  Requires-Dist: gmft>=0.4.2; extra == 'all'
54
55
  Requires-Dist: keybert>=0.9.0; extra == 'all'
55
56
  Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.17.0; extra == 'all'
@@ -82,7 +83,7 @@ Requires-Dist: spacy>=3.8.7; extra == 'entity-extraction'
82
83
  Provides-Extra: gmft
83
84
  Requires-Dist: gmft>=0.4.2; extra == 'gmft'
84
85
  Provides-Extra: langdetect
85
- Requires-Dist: fast-langdetect>=0.3.2; extra == 'langdetect'
86
+ Requires-Dist: fast-langdetect>=1.0.0; extra == 'langdetect'
86
87
  Provides-Extra: paddleocr
87
88
  Requires-Dist: paddleocr>=3.2.0; extra == 'paddleocr'
88
89
  Requires-Dist: paddlepaddle>=3.2.0; extra == 'paddleocr'
@@ -109,7 +110,7 @@ Description-Content-Type: text/markdown
109
110
  - **Text Extraction**: High-fidelity text extraction preserving document structure and formatting
110
111
  - **Image Extraction**: Extract embedded images from PDFs, presentations, HTML, and Office documents with optional OCR
111
112
  - **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties
112
- - **Format Support**: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
113
+ - **Format Support**: 21 document types including PDF, Microsoft Office, images, HTML, and structured data formats
113
114
  - **OCR Integration**: Tesseract OCR with markdown output (default) and table extraction from scanned documents
114
115
  - **Document Classification**: Automatic document type detection (contracts, forms, invoices, receipts, reports)
115
116
 
@@ -227,14 +228,15 @@ claude mcp add kreuzberg uvx kreuzberg-mcp
227
228
 
228
229
  ## Supported Formats
229
230
 
230
- | Category | Formats |
231
- | ----------------- | ------------------------------ |
232
- | **Documents** | PDF, DOCX, DOC, RTF, TXT, EPUB |
233
- | **Images** | JPG, PNG, TIFF, BMP, GIF, WEBP |
234
- | **Spreadsheets** | XLSX, XLS, CSV, ODS |
235
- | **Presentations** | PPTX, PPT, ODP |
236
- | **Web** | HTML, XML, MHTML |
237
- | **Archives** | Support via extraction |
231
+ | Category | Formats |
232
+ | ------------------- | ------------------------------ |
233
+ | **Documents** | PDF, DOCX, DOC, RTF, TXT, EPUB |
234
+ | **Images** | JPG, PNG, TIFF, BMP, GIF, WEBP |
235
+ | **Spreadsheets** | XLSX, XLS, CSV, ODS |
236
+ | **Presentations** | PPTX, PPT, ODP |
237
+ | **Web** | HTML, XML, MHTML |
238
+ | **Structured Data** | JSON, YAML, TOML |
239
+ | **Archives** | Support via extraction |
238
240
 
239
241
  ## 📊 Performance Characteristics
240
242
 
@@ -18,7 +18,7 @@
18
18
  - **Text Extraction**: High-fidelity text extraction preserving document structure and formatting
19
19
  - **Image Extraction**: Extract embedded images from PDFs, presentations, HTML, and Office documents with optional OCR
20
20
  - **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties
21
- - **Format Support**: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
21
+ - **Format Support**: 21 document types including PDF, Microsoft Office, images, HTML, and structured data formats
22
22
  - **OCR Integration**: Tesseract OCR with markdown output (default) and table extraction from scanned documents
23
23
  - **Document Classification**: Automatic document type detection (contracts, forms, invoices, receipts, reports)
24
24
 
@@ -136,14 +136,15 @@ claude mcp add kreuzberg uvx kreuzberg-mcp
136
136
 
137
137
  ## Supported Formats
138
138
 
139
- | Category | Formats |
140
- | ----------------- | ------------------------------ |
141
- | **Documents** | PDF, DOCX, DOC, RTF, TXT, EPUB |
142
- | **Images** | JPG, PNG, TIFF, BMP, GIF, WEBP |
143
- | **Spreadsheets** | XLSX, XLS, CSV, ODS |
144
- | **Presentations** | PPTX, PPT, ODP |
145
- | **Web** | HTML, XML, MHTML |
146
- | **Archives** | Support via extraction |
139
+ | Category | Formats |
140
+ | ------------------- | ------------------------------ |
141
+ | **Documents** | PDF, DOCX, DOC, RTF, TXT, EPUB |
142
+ | **Images** | JPG, PNG, TIFF, BMP, GIF, WEBP |
143
+ | **Spreadsheets** | XLSX, XLS, CSV, ODS |
144
+ | **Presentations** | PPTX, PPT, ODP |
145
+ | **Web** | HTML, XML, MHTML |
146
+ | **Structured Data** | JSON, YAML, TOML |
147
+ | **Archives** | Support via extraction |
147
148
 
148
149
  ## 📊 Performance Characteristics
149
150
 
@@ -0,0 +1,50 @@
1
+ version: "3"
2
+
3
+ env:
4
+ DOCKER_BUILDKIT: 1
5
+ BUILDKIT_PROGRESS: plain
6
+
7
+ tasks:
8
+ setup:
9
+ desc: "Install dependencies with uv"
10
+ cmds:
11
+ - uv sync --all-extras --all-packages
12
+ - pre-commit install && pre-commit install -hook-type commit-msg
13
+
14
+ update:
15
+ desc: "Update the dependencies"
16
+ cmds:
17
+ - uv run uv-bump
18
+ - cd benchmarks && uv run uv-bump && cd -
19
+ - uv sync --all-extras --all-packages --upgrade
20
+ - pre-commit autoupdate
21
+
22
+ test:
23
+ desc: "Run tests with pytest"
24
+ cmds:
25
+ - uv run pytest
26
+
27
+ test:cov:
28
+ desc: "Run tests with coverage"
29
+ cmds:
30
+ - uv run pytest --cov
31
+
32
+ lint:
33
+ desc: "Lint code with ruff and docs with markdownlint"
34
+ cmds:
35
+ - pre-commit run --all-files
36
+
37
+ docs:build:
38
+ desc: "Build documentation"
39
+ cmds:
40
+ - uv run mkdocs build --clean --strict
41
+
42
+ docs:serve:
43
+ desc: "Serve documentation locally"
44
+ cmds:
45
+ - uv run mkdocs serve
46
+
47
+ default:
48
+ desc: "Show available tasks"
49
+ cmds:
50
+ - task --list