kreuzberg 4.0.0.pre.rc.29 → 4.0.0.rc1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (321) hide show
  1. checksums.yaml +4 -4
  2. data/.gitignore +0 -6
  3. data/.rubocop.yaml +534 -1
  4. data/Gemfile +2 -1
  5. data/Gemfile.lock +28 -116
  6. data/README.md +269 -629
  7. data/Rakefile +0 -9
  8. data/Steepfile +4 -8
  9. data/examples/async_patterns.rb +58 -1
  10. data/ext/kreuzberg_rb/extconf.rb +5 -35
  11. data/ext/kreuzberg_rb/native/Cargo.toml +16 -55
  12. data/ext/kreuzberg_rb/native/build.rs +14 -12
  13. data/ext/kreuzberg_rb/native/include/ieeefp.h +1 -1
  14. data/ext/kreuzberg_rb/native/include/msvc_compat/strings.h +1 -1
  15. data/ext/kreuzberg_rb/native/include/strings.h +2 -2
  16. data/ext/kreuzberg_rb/native/include/unistd.h +1 -1
  17. data/ext/kreuzberg_rb/native/src/lib.rs +34 -897
  18. data/extconf.rb +6 -38
  19. data/kreuzberg.gemspec +20 -114
  20. data/lib/kreuzberg/api_proxy.rb +18 -2
  21. data/lib/kreuzberg/cache_api.rb +0 -22
  22. data/lib/kreuzberg/cli.rb +10 -2
  23. data/lib/kreuzberg/cli_proxy.rb +10 -0
  24. data/lib/kreuzberg/config.rb +22 -274
  25. data/lib/kreuzberg/errors.rb +7 -73
  26. data/lib/kreuzberg/extraction_api.rb +8 -237
  27. data/lib/kreuzberg/mcp_proxy.rb +11 -2
  28. data/lib/kreuzberg/ocr_backend_protocol.rb +73 -0
  29. data/lib/kreuzberg/post_processor_protocol.rb +71 -0
  30. data/lib/kreuzberg/result.rb +33 -151
  31. data/lib/kreuzberg/setup_lib_path.rb +2 -22
  32. data/lib/kreuzberg/validator_protocol.rb +73 -0
  33. data/lib/kreuzberg/version.rb +1 -1
  34. data/lib/kreuzberg.rb +13 -27
  35. data/pkg/kreuzberg-4.0.0.rc1.gem +0 -0
  36. data/sig/kreuzberg.rbs +12 -105
  37. data/spec/binding/cache_spec.rb +22 -22
  38. data/spec/binding/cli_proxy_spec.rb +4 -2
  39. data/spec/binding/cli_spec.rb +11 -12
  40. data/spec/binding/config_spec.rb +0 -74
  41. data/spec/binding/config_validation_spec.rb +6 -100
  42. data/spec/binding/error_handling_spec.rb +97 -283
  43. data/spec/binding/plugins/ocr_backend_spec.rb +8 -8
  44. data/spec/binding/plugins/postprocessor_spec.rb +11 -11
  45. data/spec/binding/plugins/validator_spec.rb +13 -12
  46. data/spec/examples.txt +104 -0
  47. data/spec/fixtures/config.toml +1 -0
  48. data/spec/fixtures/config.yaml +1 -0
  49. data/spec/fixtures/invalid_config.toml +1 -0
  50. data/spec/smoke/package_spec.rb +3 -2
  51. data/spec/spec_helper.rb +3 -1
  52. data/vendor/kreuzberg/Cargo.toml +67 -192
  53. data/vendor/kreuzberg/README.md +9 -97
  54. data/vendor/kreuzberg/build.rs +194 -516
  55. data/vendor/kreuzberg/src/api/handlers.rs +9 -130
  56. data/vendor/kreuzberg/src/api/mod.rs +3 -18
  57. data/vendor/kreuzberg/src/api/server.rs +71 -236
  58. data/vendor/kreuzberg/src/api/types.rs +7 -43
  59. data/vendor/kreuzberg/src/bin/profile_extract.rs +455 -0
  60. data/vendor/kreuzberg/src/cache/mod.rs +3 -27
  61. data/vendor/kreuzberg/src/chunking/mod.rs +79 -1705
  62. data/vendor/kreuzberg/src/core/batch_mode.rs +0 -60
  63. data/vendor/kreuzberg/src/core/config.rs +23 -905
  64. data/vendor/kreuzberg/src/core/extractor.rs +106 -403
  65. data/vendor/kreuzberg/src/core/io.rs +2 -4
  66. data/vendor/kreuzberg/src/core/mime.rs +12 -2
  67. data/vendor/kreuzberg/src/core/mod.rs +3 -22
  68. data/vendor/kreuzberg/src/core/pipeline.rs +78 -395
  69. data/vendor/kreuzberg/src/embeddings.rs +21 -169
  70. data/vendor/kreuzberg/src/error.rs +2 -2
  71. data/vendor/kreuzberg/src/extraction/archive.rs +31 -36
  72. data/vendor/kreuzberg/src/extraction/docx.rs +1 -365
  73. data/vendor/kreuzberg/src/extraction/email.rs +11 -12
  74. data/vendor/kreuzberg/src/extraction/excel.rs +129 -138
  75. data/vendor/kreuzberg/src/extraction/html.rs +170 -1447
  76. data/vendor/kreuzberg/src/extraction/image.rs +14 -138
  77. data/vendor/kreuzberg/src/extraction/libreoffice.rs +3 -13
  78. data/vendor/kreuzberg/src/extraction/mod.rs +5 -21
  79. data/vendor/kreuzberg/src/extraction/office_metadata/mod.rs +0 -2
  80. data/vendor/kreuzberg/src/extraction/pandoc/batch.rs +275 -0
  81. data/vendor/kreuzberg/src/extraction/pandoc/mime_types.rs +178 -0
  82. data/vendor/kreuzberg/src/extraction/pandoc/mod.rs +491 -0
  83. data/vendor/kreuzberg/src/extraction/pandoc/server.rs +496 -0
  84. data/vendor/kreuzberg/src/extraction/pandoc/subprocess.rs +1188 -0
  85. data/vendor/kreuzberg/src/extraction/pandoc/version.rs +162 -0
  86. data/vendor/kreuzberg/src/extraction/pptx.rs +94 -196
  87. data/vendor/kreuzberg/src/extraction/structured.rs +4 -5
  88. data/vendor/kreuzberg/src/extraction/table.rs +1 -2
  89. data/vendor/kreuzberg/src/extraction/text.rs +10 -18
  90. data/vendor/kreuzberg/src/extractors/archive.rs +0 -22
  91. data/vendor/kreuzberg/src/extractors/docx.rs +148 -69
  92. data/vendor/kreuzberg/src/extractors/email.rs +9 -37
  93. data/vendor/kreuzberg/src/extractors/excel.rs +40 -81
  94. data/vendor/kreuzberg/src/extractors/html.rs +173 -182
  95. data/vendor/kreuzberg/src/extractors/image.rs +8 -32
  96. data/vendor/kreuzberg/src/extractors/mod.rs +10 -171
  97. data/vendor/kreuzberg/src/extractors/pandoc.rs +201 -0
  98. data/vendor/kreuzberg/src/extractors/pdf.rs +64 -329
  99. data/vendor/kreuzberg/src/extractors/pptx.rs +34 -79
  100. data/vendor/kreuzberg/src/extractors/structured.rs +0 -16
  101. data/vendor/kreuzberg/src/extractors/text.rs +7 -30
  102. data/vendor/kreuzberg/src/extractors/xml.rs +8 -27
  103. data/vendor/kreuzberg/src/keywords/processor.rs +1 -9
  104. data/vendor/kreuzberg/src/keywords/rake.rs +1 -0
  105. data/vendor/kreuzberg/src/language_detection/mod.rs +51 -94
  106. data/vendor/kreuzberg/src/lib.rs +5 -17
  107. data/vendor/kreuzberg/src/mcp/mod.rs +1 -4
  108. data/vendor/kreuzberg/src/mcp/server.rs +21 -145
  109. data/vendor/kreuzberg/src/ocr/mod.rs +0 -2
  110. data/vendor/kreuzberg/src/ocr/processor.rs +8 -19
  111. data/vendor/kreuzberg/src/ocr/tesseract_backend.rs +0 -2
  112. data/vendor/kreuzberg/src/pdf/error.rs +1 -93
  113. data/vendor/kreuzberg/src/pdf/metadata.rs +100 -263
  114. data/vendor/kreuzberg/src/pdf/mod.rs +2 -33
  115. data/vendor/kreuzberg/src/pdf/rendering.rs +12 -12
  116. data/vendor/kreuzberg/src/pdf/table.rs +64 -61
  117. data/vendor/kreuzberg/src/pdf/text.rs +24 -416
  118. data/vendor/kreuzberg/src/plugins/extractor.rs +8 -40
  119. data/vendor/kreuzberg/src/plugins/mod.rs +0 -3
  120. data/vendor/kreuzberg/src/plugins/ocr.rs +14 -22
  121. data/vendor/kreuzberg/src/plugins/processor.rs +1 -10
  122. data/vendor/kreuzberg/src/plugins/registry.rs +0 -15
  123. data/vendor/kreuzberg/src/plugins/validator.rs +8 -20
  124. data/vendor/kreuzberg/src/stopwords/mod.rs +2 -2
  125. data/vendor/kreuzberg/src/text/mod.rs +0 -8
  126. data/vendor/kreuzberg/src/text/quality.rs +15 -28
  127. data/vendor/kreuzberg/src/text/string_utils.rs +10 -22
  128. data/vendor/kreuzberg/src/text/token_reduction/core.rs +50 -86
  129. data/vendor/kreuzberg/src/text/token_reduction/filters.rs +16 -37
  130. data/vendor/kreuzberg/src/text/token_reduction/simd_text.rs +1 -2
  131. data/vendor/kreuzberg/src/types.rs +67 -907
  132. data/vendor/kreuzberg/src/utils/mod.rs +0 -14
  133. data/vendor/kreuzberg/src/utils/quality.rs +3 -12
  134. data/vendor/kreuzberg/tests/api_tests.rs +0 -506
  135. data/vendor/kreuzberg/tests/archive_integration.rs +0 -2
  136. data/vendor/kreuzberg/tests/batch_orchestration.rs +12 -57
  137. data/vendor/kreuzberg/tests/batch_processing.rs +8 -32
  138. data/vendor/kreuzberg/tests/chunking_offset_demo.rs +92 -0
  139. data/vendor/kreuzberg/tests/concurrency_stress.rs +8 -40
  140. data/vendor/kreuzberg/tests/config_features.rs +1 -33
  141. data/vendor/kreuzberg/tests/config_loading_tests.rs +39 -16
  142. data/vendor/kreuzberg/tests/core_integration.rs +9 -35
  143. data/vendor/kreuzberg/tests/csv_integration.rs +81 -71
  144. data/vendor/kreuzberg/tests/docx_metadata_extraction_test.rs +25 -23
  145. data/vendor/kreuzberg/tests/email_integration.rs +1 -3
  146. data/vendor/kreuzberg/tests/error_handling.rs +34 -43
  147. data/vendor/kreuzberg/tests/format_integration.rs +1 -7
  148. data/vendor/kreuzberg/tests/helpers/mod.rs +0 -60
  149. data/vendor/kreuzberg/tests/image_integration.rs +0 -2
  150. data/vendor/kreuzberg/tests/mime_detection.rs +16 -17
  151. data/vendor/kreuzberg/tests/ocr_configuration.rs +0 -4
  152. data/vendor/kreuzberg/tests/ocr_errors.rs +0 -22
  153. data/vendor/kreuzberg/tests/ocr_quality.rs +0 -2
  154. data/vendor/kreuzberg/tests/pandoc_integration.rs +503 -0
  155. data/vendor/kreuzberg/tests/pdf_integration.rs +0 -2
  156. data/vendor/kreuzberg/tests/pipeline_integration.rs +2 -36
  157. data/vendor/kreuzberg/tests/plugin_ocr_backend_test.rs +0 -5
  158. data/vendor/kreuzberg/tests/plugin_postprocessor_test.rs +1 -17
  159. data/vendor/kreuzberg/tests/plugin_system.rs +0 -6
  160. data/vendor/kreuzberg/tests/registry_integration_tests.rs +22 -2
  161. data/vendor/kreuzberg/tests/security_validation.rs +1 -13
  162. data/vendor/kreuzberg/tests/test_fastembed.rs +23 -45
  163. metadata +25 -171
  164. data/.rubocop.yml +0 -543
  165. data/ext/kreuzberg_rb/native/.cargo/config.toml +0 -23
  166. data/ext/kreuzberg_rb/native/Cargo.lock +0 -7619
  167. data/lib/kreuzberg/error_context.rb +0 -136
  168. data/lib/kreuzberg/types.rb +0 -170
  169. data/lib/libpdfium.so +0 -0
  170. data/spec/binding/async_operations_spec.rb +0 -473
  171. data/spec/binding/batch_operations_spec.rb +0 -595
  172. data/spec/binding/batch_spec.rb +0 -359
  173. data/spec/binding/config_result_spec.rb +0 -377
  174. data/spec/binding/embeddings_spec.rb +0 -816
  175. data/spec/binding/error_recovery_spec.rb +0 -488
  176. data/spec/binding/font_config_spec.rb +0 -220
  177. data/spec/binding/images_spec.rb +0 -738
  178. data/spec/binding/keywords_extraction_spec.rb +0 -600
  179. data/spec/binding/metadata_types_spec.rb +0 -1228
  180. data/spec/binding/pages_extraction_spec.rb +0 -471
  181. data/spec/binding/tables_spec.rb +0 -641
  182. data/spec/unit/config/chunking_config_spec.rb +0 -213
  183. data/spec/unit/config/embedding_config_spec.rb +0 -343
  184. data/spec/unit/config/extraction_config_spec.rb +0 -438
  185. data/spec/unit/config/font_config_spec.rb +0 -285
  186. data/spec/unit/config/hierarchy_config_spec.rb +0 -314
  187. data/spec/unit/config/image_extraction_config_spec.rb +0 -209
  188. data/spec/unit/config/image_preprocessing_config_spec.rb +0 -249
  189. data/spec/unit/config/keyword_config_spec.rb +0 -229
  190. data/spec/unit/config/language_detection_config_spec.rb +0 -258
  191. data/spec/unit/config/ocr_config_spec.rb +0 -171
  192. data/spec/unit/config/page_config_spec.rb +0 -221
  193. data/spec/unit/config/pdf_config_spec.rb +0 -267
  194. data/spec/unit/config/postprocessor_config_spec.rb +0 -290
  195. data/spec/unit/config/tesseract_config_spec.rb +0 -181
  196. data/spec/unit/config/token_reduction_config_spec.rb +0 -251
  197. data/test/metadata_types_test.rb +0 -959
  198. data/vendor/Cargo.toml +0 -61
  199. data/vendor/kreuzberg/examples/bench_fixes.rs +0 -71
  200. data/vendor/kreuzberg/examples/test_pdfium_fork.rs +0 -62
  201. data/vendor/kreuzberg/src/chunking/processor.rs +0 -219
  202. data/vendor/kreuzberg/src/core/batch_optimizations.rs +0 -385
  203. data/vendor/kreuzberg/src/core/config_validation.rs +0 -949
  204. data/vendor/kreuzberg/src/core/formats.rs +0 -235
  205. data/vendor/kreuzberg/src/core/server_config.rs +0 -1220
  206. data/vendor/kreuzberg/src/extraction/capacity.rs +0 -263
  207. data/vendor/kreuzberg/src/extraction/markdown.rs +0 -216
  208. data/vendor/kreuzberg/src/extraction/office_metadata/odt_properties.rs +0 -284
  209. data/vendor/kreuzberg/src/extractors/bibtex.rs +0 -470
  210. data/vendor/kreuzberg/src/extractors/docbook.rs +0 -504
  211. data/vendor/kreuzberg/src/extractors/epub.rs +0 -696
  212. data/vendor/kreuzberg/src/extractors/fictionbook.rs +0 -492
  213. data/vendor/kreuzberg/src/extractors/jats.rs +0 -1054
  214. data/vendor/kreuzberg/src/extractors/jupyter.rs +0 -368
  215. data/vendor/kreuzberg/src/extractors/latex.rs +0 -653
  216. data/vendor/kreuzberg/src/extractors/markdown.rs +0 -701
  217. data/vendor/kreuzberg/src/extractors/odt.rs +0 -628
  218. data/vendor/kreuzberg/src/extractors/opml.rs +0 -635
  219. data/vendor/kreuzberg/src/extractors/orgmode.rs +0 -529
  220. data/vendor/kreuzberg/src/extractors/rst.rs +0 -577
  221. data/vendor/kreuzberg/src/extractors/rtf.rs +0 -809
  222. data/vendor/kreuzberg/src/extractors/security.rs +0 -484
  223. data/vendor/kreuzberg/src/extractors/security_tests.rs +0 -367
  224. data/vendor/kreuzberg/src/extractors/typst.rs +0 -651
  225. data/vendor/kreuzberg/src/language_detection/processor.rs +0 -218
  226. data/vendor/kreuzberg/src/ocr/language_registry.rs +0 -520
  227. data/vendor/kreuzberg/src/panic_context.rs +0 -154
  228. data/vendor/kreuzberg/src/pdf/bindings.rs +0 -306
  229. data/vendor/kreuzberg/src/pdf/bundled.rs +0 -408
  230. data/vendor/kreuzberg/src/pdf/fonts.rs +0 -358
  231. data/vendor/kreuzberg/src/pdf/hierarchy.rs +0 -903
  232. data/vendor/kreuzberg/src/text/quality_processor.rs +0 -231
  233. data/vendor/kreuzberg/src/text/utf8_validation.rs +0 -193
  234. data/vendor/kreuzberg/src/utils/pool.rs +0 -503
  235. data/vendor/kreuzberg/src/utils/pool_sizing.rs +0 -364
  236. data/vendor/kreuzberg/src/utils/string_pool.rs +0 -761
  237. data/vendor/kreuzberg/tests/api_embed.rs +0 -360
  238. data/vendor/kreuzberg/tests/api_extract_multipart.rs +0 -52
  239. data/vendor/kreuzberg/tests/api_large_pdf_extraction.rs +0 -471
  240. data/vendor/kreuzberg/tests/api_large_pdf_extraction_diagnostics.rs +0 -289
  241. data/vendor/kreuzberg/tests/batch_pooling_benchmark.rs +0 -154
  242. data/vendor/kreuzberg/tests/bibtex_parity_test.rs +0 -421
  243. data/vendor/kreuzberg/tests/config_integration_test.rs +0 -753
  244. data/vendor/kreuzberg/tests/data/hierarchy_ground_truth.json +0 -294
  245. data/vendor/kreuzberg/tests/docbook_extractor_tests.rs +0 -500
  246. data/vendor/kreuzberg/tests/docx_vs_pandoc_comparison.rs +0 -370
  247. data/vendor/kreuzberg/tests/epub_native_extractor_tests.rs +0 -275
  248. data/vendor/kreuzberg/tests/fictionbook_extractor_tests.rs +0 -228
  249. data/vendor/kreuzberg/tests/html_table_test.rs +0 -551
  250. data/vendor/kreuzberg/tests/instrumentation_test.rs +0 -139
  251. data/vendor/kreuzberg/tests/jats_extractor_tests.rs +0 -639
  252. data/vendor/kreuzberg/tests/jupyter_extractor_tests.rs +0 -704
  253. data/vendor/kreuzberg/tests/latex_extractor_tests.rs +0 -496
  254. data/vendor/kreuzberg/tests/markdown_extractor_tests.rs +0 -490
  255. data/vendor/kreuzberg/tests/ocr_language_registry.rs +0 -191
  256. data/vendor/kreuzberg/tests/odt_extractor_tests.rs +0 -674
  257. data/vendor/kreuzberg/tests/opml_extractor_tests.rs +0 -616
  258. data/vendor/kreuzberg/tests/orgmode_extractor_tests.rs +0 -822
  259. data/vendor/kreuzberg/tests/page_markers.rs +0 -297
  260. data/vendor/kreuzberg/tests/pdf_hierarchy_detection.rs +0 -301
  261. data/vendor/kreuzberg/tests/pdf_hierarchy_quality.rs +0 -589
  262. data/vendor/kreuzberg/tests/pdf_ocr_triggering.rs +0 -301
  263. data/vendor/kreuzberg/tests/pdf_text_merging.rs +0 -475
  264. data/vendor/kreuzberg/tests/pdfium_linking.rs +0 -340
  265. data/vendor/kreuzberg/tests/rst_extractor_tests.rs +0 -694
  266. data/vendor/kreuzberg/tests/rtf_extractor_tests.rs +0 -775
  267. data/vendor/kreuzberg/tests/typst_behavioral_tests.rs +0 -1260
  268. data/vendor/kreuzberg/tests/typst_extractor_tests.rs +0 -648
  269. data/vendor/kreuzberg-ffi/Cargo.toml +0 -67
  270. data/vendor/kreuzberg-ffi/README.md +0 -851
  271. data/vendor/kreuzberg-ffi/benches/result_view_benchmark.rs +0 -227
  272. data/vendor/kreuzberg-ffi/build.rs +0 -168
  273. data/vendor/kreuzberg-ffi/cbindgen.toml +0 -37
  274. data/vendor/kreuzberg-ffi/kreuzberg-ffi.pc.in +0 -12
  275. data/vendor/kreuzberg-ffi/kreuzberg.h +0 -3012
  276. data/vendor/kreuzberg-ffi/src/batch_streaming.rs +0 -588
  277. data/vendor/kreuzberg-ffi/src/config.rs +0 -1341
  278. data/vendor/kreuzberg-ffi/src/error.rs +0 -901
  279. data/vendor/kreuzberg-ffi/src/extraction.rs +0 -555
  280. data/vendor/kreuzberg-ffi/src/helpers.rs +0 -879
  281. data/vendor/kreuzberg-ffi/src/lib.rs +0 -977
  282. data/vendor/kreuzberg-ffi/src/memory.rs +0 -493
  283. data/vendor/kreuzberg-ffi/src/mime.rs +0 -329
  284. data/vendor/kreuzberg-ffi/src/panic_shield.rs +0 -265
  285. data/vendor/kreuzberg-ffi/src/plugins/document_extractor.rs +0 -442
  286. data/vendor/kreuzberg-ffi/src/plugins/mod.rs +0 -14
  287. data/vendor/kreuzberg-ffi/src/plugins/ocr_backend.rs +0 -628
  288. data/vendor/kreuzberg-ffi/src/plugins/post_processor.rs +0 -438
  289. data/vendor/kreuzberg-ffi/src/plugins/validator.rs +0 -329
  290. data/vendor/kreuzberg-ffi/src/result.rs +0 -510
  291. data/vendor/kreuzberg-ffi/src/result_pool.rs +0 -639
  292. data/vendor/kreuzberg-ffi/src/result_view.rs +0 -773
  293. data/vendor/kreuzberg-ffi/src/string_intern.rs +0 -568
  294. data/vendor/kreuzberg-ffi/src/types.rs +0 -363
  295. data/vendor/kreuzberg-ffi/src/util.rs +0 -210
  296. data/vendor/kreuzberg-ffi/src/validation.rs +0 -848
  297. data/vendor/kreuzberg-ffi/tests.disabled/README.md +0 -48
  298. data/vendor/kreuzberg-ffi/tests.disabled/config_loading_tests.rs +0 -299
  299. data/vendor/kreuzberg-ffi/tests.disabled/config_tests.rs +0 -346
  300. data/vendor/kreuzberg-ffi/tests.disabled/extractor_tests.rs +0 -232
  301. data/vendor/kreuzberg-ffi/tests.disabled/plugin_registration_tests.rs +0 -470
  302. data/vendor/kreuzberg-tesseract/.commitlintrc.json +0 -13
  303. data/vendor/kreuzberg-tesseract/.crate-ignore +0 -2
  304. data/vendor/kreuzberg-tesseract/Cargo.lock +0 -2933
  305. data/vendor/kreuzberg-tesseract/Cargo.toml +0 -57
  306. data/vendor/kreuzberg-tesseract/LICENSE +0 -22
  307. data/vendor/kreuzberg-tesseract/README.md +0 -399
  308. data/vendor/kreuzberg-tesseract/build.rs +0 -1127
  309. data/vendor/kreuzberg-tesseract/patches/README.md +0 -71
  310. data/vendor/kreuzberg-tesseract/patches/tesseract.diff +0 -199
  311. data/vendor/kreuzberg-tesseract/src/api.rs +0 -1371
  312. data/vendor/kreuzberg-tesseract/src/choice_iterator.rs +0 -77
  313. data/vendor/kreuzberg-tesseract/src/enums.rs +0 -297
  314. data/vendor/kreuzberg-tesseract/src/error.rs +0 -81
  315. data/vendor/kreuzberg-tesseract/src/lib.rs +0 -145
  316. data/vendor/kreuzberg-tesseract/src/monitor.rs +0 -57
  317. data/vendor/kreuzberg-tesseract/src/mutable_iterator.rs +0 -197
  318. data/vendor/kreuzberg-tesseract/src/page_iterator.rs +0 -253
  319. data/vendor/kreuzberg-tesseract/src/result_iterator.rs +0 -286
  320. data/vendor/kreuzberg-tesseract/src/result_renderer.rs +0 -183
  321. data/vendor/kreuzberg-tesseract/tests/integration_test.rs +0 -211
@@ -1,57 +0,0 @@
1
- [package]
2
- name = "kreuzberg-tesseract"
3
- version = "4.0.0-rc.29"
4
- edition = "2024"
5
- rust-version = "1.91"
6
- authors = ["Na'aman Hirschfeld <nhirschfeld@gmail.com>"]
7
- description = "Rust bindings for Tesseract OCR with cross-compilation, C++17, and caching improvements"
8
- license = "MIT"
9
- repository.workspace = true
10
- homepage = "https://kreuzberg.dev"
11
- documentation = "https://docs.kreuzberg.dev"
12
- readme = "README.md"
13
- keywords = ["tesseract", "ocr", "bindings", "vision", "recognition"]
14
- categories = ["external-ffi-bindings", "computer-vision", "text-processing"]
15
- build = "build.rs"
16
- links = "kreuzberg_tesseract"
17
- exclude = ["tessdata/*", "third_party/*"]
18
-
19
- [dependencies]
20
- libc = { workspace = true }
21
- thiserror = { workspace = true }
22
-
23
- [dev-dependencies]
24
- image = { workspace = true }
25
-
26
- [build-dependencies]
27
- cc = { version = "^1.2.52", optional = true }
28
- cmake = { version = "0.1.57", optional = true }
29
- zip = { version = "7.0.0", optional = true }
30
-
31
- # Use native-tls on Windows to avoid aws-lc-sys CMake build issues with MinGW
32
- [target.'cfg(target_os = "windows")'.build-dependencies]
33
- reqwest = { workspace = true, default-features = false, features = [
34
- "blocking",
35
- "native-tls",
36
- ], optional = true }
37
-
38
- [target.'cfg(not(target_os = "windows"))'.build-dependencies]
39
- reqwest = { workspace = true, default-features = false, features = [
40
- "blocking",
41
- "rustls",
42
- ], optional = true }
43
-
44
- [features]
45
- default = ["static-linking"]
46
- build-tesseract = ["cc", "cmake", "reqwest", "zip"]
47
- build-tesseract-wasm = ["cmake", "reqwest", "zip"]
48
- static-linking = ["build-tesseract"]
49
- dynamic-linking = []
50
-
51
- [package.metadata.docs.rs]
52
- features = ["docs-only"]
53
- rustdoc-args = ["--cfg", "docsrs"]
54
-
55
- [lib]
56
- name = "kreuzberg_tesseract"
57
- crate-type = ["lib"]
@@ -1,22 +0,0 @@
1
- MIT License
2
-
3
- Copyright (c) 2024 Cafer Can Gündoğdu
4
- Copyright (c) 2025 Na'aman Hirschfeld
5
-
6
- Permission is hereby granted, free of charge, to any person obtaining a copy
7
- of this software and associated documentation files (the "Software"), to deal
8
- in the Software without restriction, including without limitation the rights
9
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10
- copies of the Software, and to permit persons to whom the Software is
11
- furnished to do so, subject to the following conditions:
12
-
13
- The above copyright notice and this permission notice shall be included in all
14
- copies or substantial portions of the Software.
15
-
16
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22
- SOFTWARE.
@@ -1,399 +0,0 @@
1
- # kreuzberg-tesseract
2
-
3
- Rust bindings for Tesseract OCR with built-in compilation of Tesseract and Leptonica libraries. Provides a safe and idiomatic Rust interface to Tesseract's functionality while handling the complexity of compiling the underlying C++ libraries.
4
-
5
- Based on the original [tesseract-rs](https://github.com/cafercangundogdu/tesseract-rs) by Cafer Can Gündoğdu, this maintained version adds critical improvements for production use:
6
-
7
- - **C++17 Support**: Upgraded for Tesseract 5.5.1 which requires C++17 filesystem
8
- - **Cross-Compilation**: Fixed CXX compiler detection for cross-platform builds
9
- - **Architecture Validation**: Validates target architecture before using cached libraries
10
- - **Windows Static Linking**: Fixed MSVC static linking issues
11
- - **Build Caching**: Improved caching with OUT_DIR-based cache directory
12
- - **MinGW Support**: Added support for MinGW toolchains
13
-
14
- ## Features
15
-
16
- - Safe Rust bindings for Tesseract OCR
17
- - **Multiple linking options:**
18
- - **Static linking** (default): Built-in compilation with no runtime dependencies
19
- - **Dynamic linking**: Link to system-installed libraries for faster builds
20
- - Uses existing Tesseract training data (expects English data for tests)
21
- - High-level Rust API for common OCR tasks
22
- - Caching of compiled libraries for faster subsequent builds
23
- - Support for multiple operating systems (Linux, macOS, Windows)
24
-
25
- ## Installation
26
-
27
- ### Static Linking (Default)
28
-
29
- Static linking builds Tesseract and Leptonica from source and embeds them in your binary. No runtime dependencies required:
30
-
31
- ```toml
32
- [dependencies]
33
- kreuzberg-tesseract = "1.0.0-rc.1"
34
- # or explicitly:
35
- kreuzberg-tesseract = { version = "1.0.0-rc.1", features = ["static-linking"] }
36
- ```
37
-
38
- ### Dynamic Linking
39
-
40
- Dynamic linking uses system-installed Tesseract and Leptonica libraries. Faster builds, but requires libraries installed on the system:
41
-
42
- ```toml
43
- [dependencies]
44
- kreuzberg-tesseract = { version = "1.0.0-rc.1", features = ["dynamic-linking"], default-features = false }
45
- ```
46
-
47
- **System requirements for dynamic linking:**
48
- - Tesseract 5.x libraries installed (`libtesseract`, `libleptonica`)
49
- - macOS: `brew install tesseract leptonica`
50
- - Ubuntu/Debian: `sudo apt-get install libtesseract-dev libleptonica-dev`
51
- - RHEL/CentOS/Fedora: `sudo dnf install tesseract-devel leptonica-devel`
52
- - Windows: Install from [Tesseract releases](https://github.com/tesseract-ocr/tesseract/releases) or vcpkg
53
-
54
- ### Development Dependencies
55
-
56
- For development and testing, you'll also need these dependencies:
57
-
58
- ```toml
59
- [dev-dependencies]
60
- image = "0.25.5"
61
- ```
62
-
63
- ## System Requirements
64
-
65
- ### For Static Linking (Default)
66
-
67
- When building with static linking, the crate will compile Tesseract and Leptonica from source. You need:
68
-
69
- - Rust 1.85.0 or later
70
- - A C++ compiler (e.g., gcc, clang, MSVC on Windows)
71
- - CMake 3.x or later
72
- - Internet connection (for downloading Tesseract source code)
73
-
74
- ### For Dynamic Linking
75
-
76
- When using dynamic linking with system-installed libraries, you need:
77
-
78
- - Rust 1.85.0 or later
79
- - Tesseract 5.x and Leptonica libraries installed on your system (see Installation section)
80
- - Internet connection (for downloading Tesseract source code)
81
-
82
- No C++ compiler or CMake required for dynamic linking builds.
83
-
84
- For a full development environment checklist (including optional tooling suggestions), see [CONTRIBUTING.md](CONTRIBUTING.md).
85
-
86
- ## Environment Variables
87
-
88
- The following environment variables affect the build and test process:
89
-
90
- ### Build Variables
91
-
92
- - `CARGO_CLEAN`: If set, cleans the cache directory before building
93
- - `RUSTC_WRAPPER`: If set to "sccache", enables compiler caching with sccache
94
- - `CC`: Compiler selection for C code (affects Linux builds)
95
- - `HOME` (Unix) or `APPDATA` (Windows): Used to determine cache directory location
96
- - `TESSERACT_RS_CACHE_DIR`: Optional override for the cache root. When unset or not writable, the build falls back to the default OS-specific directory, and if that still fails, a temporary directory under the system temp folder is used automatically.
97
-
98
- ### Test Variables
99
-
100
- - `TESSDATA_PREFIX` (Optional): Path to override the default tessdata directory. If not set, the crate will use its default cache directory.
101
-
102
- ## Cache and Data Directories
103
-
104
- The crate uses the following directory structure based on your operating system:
105
-
106
- - macOS: `~/Library/Application Support/tesseract-rs`
107
- - Linux: `~/.tesseract-rs`
108
- - Windows: `%APPDATA%/tesseract-rs`
109
-
110
- The cache includes:
111
-
112
- - Compiled Tesseract and Leptonica libraries
113
- - Third-party source code
114
-
115
- Training data is not downloaded during the build. Provide `eng.traineddata` (and any other languages you need) via `TESSDATA_PREFIX` or your system Tesseract installation.
116
-
117
- ## Testing
118
-
119
- The project includes several integration tests that verify OCR functionality. To run the tests:
120
-
121
- 1. Ensure you have the required test dependencies:
122
-
123
- ```toml
124
- [dev-dependencies]
125
- image = "0.25.9"
126
- ```
127
-
128
- 2. Run the tests:
129
- ```bash
130
- cargo test
131
- ```
132
-
133
- Note: Make sure `eng.traineddata` is available in your tessdata directory before running tests. If `TESSDATA_PREFIX` is not set, the tests look in the default cache location. You can point the tests at a custom tessdata directory by setting:
134
-
135
- ```bash
136
- # Linux/macOS
137
- export TESSDATA_PREFIX=/path/to/custom/tessdata
138
-
139
- # Windows (PowerShell)
140
- $env:TESSDATA_PREFIX="C:\path\to\custom\tessdata"
141
- ```
142
-
143
- Available test cases:
144
-
145
- - OCR on English sample images
146
- - Error handling and invalid input coverage
147
-
148
- Test images are sourced from the shared `test_documents/` directory in the repository:
149
-
150
- - `images/test_hello_world.png`: Simple English text
151
- - `tables/simple_table.png`: Basic table with English headers
152
-
153
- ## Usage
154
-
155
- Here's a basic example of how to use `tesseract-rs`:
156
-
157
- ```rust
158
- use std::path::PathBuf;
159
- use std::error::Error;
160
- use kreuzberg_tesseract::TesseractAPI;
161
-
162
- fn get_default_tessdata_dir() -> PathBuf {
163
- if cfg!(target_os = "macos") {
164
- let home_dir = std::env::var("HOME").expect("HOME environment variable not set");
165
- PathBuf::from(home_dir)
166
- .join("Library")
167
- .join("Application Support")
168
- .join("tesseract-rs")
169
- .join("tessdata")
170
- } else if cfg!(target_os = "linux") {
171
- let home_dir = std::env::var("HOME").expect("HOME environment variable not set");
172
- PathBuf::from(home_dir)
173
- .join(".tesseract-rs")
174
- .join("tessdata")
175
- } else if cfg!(target_os = "windows") {
176
- PathBuf::from(std::env::var("APPDATA").expect("APPDATA environment variable not set"))
177
- .join("tesseract-rs")
178
- .join("tessdata")
179
- } else {
180
- panic!("Unsupported operating system");
181
- }
182
- }
183
-
184
- fn get_tessdata_dir() -> PathBuf {
185
- match std::env::var("TESSDATA_PREFIX") {
186
- Ok(dir) => {
187
- let path = PathBuf::from(dir);
188
- println!("Using TESSDATA_PREFIX directory: {:?}", path);
189
- path
190
- }
191
- Err(_) => {
192
- let default_dir = get_default_tessdata_dir();
193
- println!(
194
- "TESSDATA_PREFIX not set, using default directory: {:?}",
195
- default_dir
196
- );
197
- default_dir
198
- }
199
- }
200
- }
201
-
202
- fn main() -> Result<(), Box<dyn Error>> {
203
- let api = TesseractAPI::new()?;
204
-
205
- // Get tessdata directory (uses default location or TESSDATA_PREFIX if set)
206
- let tessdata_dir = get_tessdata_dir();
207
- api.init(tessdata_dir.to_str().unwrap(), "eng")?;
208
-
209
- let width = 24;
210
- let height = 24;
211
- let bytes_per_pixel = 1;
212
- let bytes_per_line = width * bytes_per_pixel;
213
-
214
- // Initialize image data with all white pixels
215
- let mut image_data = vec![255u8; width * height];
216
-
217
- // Draw number 9 with clearer distinction
218
- for y in 4..19 {
219
- for x in 7..17 {
220
- // Top bar
221
- if y == 4 && x >= 8 && x <= 15 {
222
- image_data[y * width + x] = 0;
223
- }
224
- // Top curve left side
225
- if y >= 4 && y <= 10 && x == 7 {
226
- image_data[y * width + x] = 0;
227
- }
228
- // Top curve right side
229
- if y >= 4 && y <= 11 && x == 16 {
230
- image_data[y * width + x] = 0;
231
- }
232
- // Middle bar
233
- if y == 11 && x >= 8 && x <= 15 {
234
- image_data[y * width + x] = 0;
235
- }
236
- // Bottom right vertical line
237
- if y >= 11 && y <= 18 && x == 16 {
238
- image_data[y * width + x] = 0;
239
- }
240
- // Bottom bar
241
- if y == 18 && x >= 8 && x <= 15 {
242
- image_data[y * width + x] = 0;
243
- }
244
- }
245
- }
246
-
247
- // Set the image data
248
- api.set_image(
249
- &image_data,
250
- width.try_into().unwrap(),
251
- height.try_into().unwrap(),
252
- bytes_per_pixel.try_into().unwrap(),
253
- bytes_per_line.try_into().unwrap(),
254
- )?;
255
-
256
- // Set whitelist for digits only
257
- api.set_variable("tessedit_char_whitelist", "0123456789")?;
258
-
259
- // Set PSM mode to single character
260
- api.set_variable("tessedit_pageseg_mode", "10")?;
261
-
262
- // Get the recognized text
263
- let text = api.get_utf8_text()?;
264
- println!("Recognized text: {}", text.trim());
265
-
266
- Ok(())
267
- }
268
- ```
269
-
270
- ## Advanced Usage
271
-
272
- The API provides additional functionality for more complex OCR tasks, including thread-safe operations:
273
-
274
- ```rust
275
- use kreuzberg_tesseract::TesseractAPI;
276
- use std::sync::Arc;
277
- use std::thread;
278
- use std::error::Error;
279
-
280
- fn main() -> Result<(), Box<dyn Error>> {
281
- let tessdata_dir = get_tessdata_dir();
282
- let api = TesseractAPI::new()?;
283
-
284
- // Initialize the main API
285
- api.init(tessdata_dir.to_str().unwrap(), "eng")?;
286
- api.set_variable("tessedit_pageseg_mode", "1")?;
287
-
288
- // Load and prepare image data
289
- let (image_data, width, height) = load_test_image("sample_text.png")?;
290
-
291
- // Share image data across threads
292
- let image_data = Arc::new(image_data);
293
- let mut handles = vec![];
294
-
295
- // Spawn multiple threads for parallel OCR processing
296
- for _ in 0..3 {
297
- let api_clone = api.clone(); // Clones the API with all configurations
298
- let image_data = Arc::clone(&image_data);
299
-
300
- let handle = thread::spawn(move || {
301
- // Set image in each thread
302
- let res = api_clone.set_image(
303
- &image_data,
304
- width as i32,
305
- height as i32,
306
- 3,
307
- 3 * width as i32,
308
- );
309
- assert!(res.is_ok());
310
-
311
- // Perform OCR in parallel
312
- let text = api_clone.get_utf8_text()
313
- .expect("Failed to get text");
314
- println!("Thread result: {}", text);
315
- });
316
- handles.push(handle);
317
- }
318
-
319
- // Wait for all threads to complete
320
- for handle in handles {
321
- handle.join().unwrap();
322
- }
323
-
324
- Ok(())
325
- }
326
-
327
- // Helper function to get tessdata directory
328
- fn get_tessdata_dir() -> PathBuf {
329
- // ... (implementation as shown in basic example)
330
- }
331
-
332
- // Helper function to load test image
333
- fn load_test_image(filename: &str) -> Result<(Vec<u8>, u32, u32), Box<dyn Error>> {
334
- let img = image::open(filename)?
335
- .to_rgb8();
336
- let (width, height) = img.dimensions();
337
- Ok((img.into_raw(), width, height))
338
- }
339
- ```
340
-
341
- ## Building
342
-
343
- ### Static Linking (Default)
344
-
345
- With static linking, the crate will automatically download and compile Tesseract and Leptonica during the build process. This may take some time on the first build (5-10 minutes), but subsequent builds will use the cached libraries.
346
-
347
- To clean the cache and force a rebuild:
348
-
349
- ```bash
350
- CARGO_CLEAN=1 cargo build
351
- ```
352
-
353
- ### Dynamic Linking
354
-
355
- With dynamic linking, the build is much faster (seconds instead of minutes) since it only links against system-installed libraries:
356
-
357
- ```bash
358
- cargo build --no-default-features --features dynamic-linking
359
- ```
360
-
361
- **Note**: Dynamic linking requires Tesseract and Leptonica to be installed on your system (see Installation section).
362
-
363
- ## Documentation
364
-
365
- For more detailed information, please check the [API documentation](https://docs.rs/kreuzberg-tesseract).
366
-
367
- ## License
368
-
369
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
370
-
371
- ## Acknowledgements
372
-
373
- This project is based on the original [tesseract-rs](https://github.com/cafercangundogdu/tesseract-rs) by [Cafer Can Gündoğdu](https://github.com/cafercangundogdu). We are grateful for the foundational work that made this project possible.
374
-
375
- ## Contributing
376
-
377
- We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
378
-
379
- ### Quick Start for Contributors
380
-
381
- 1. Fork and clone the repository
382
- 2. Install uv and set up git hooks:
383
- ```bash
384
- curl -LsSf https://astral.sh/uv/install.sh | sh
385
- uvx prek install
386
- ```
387
- 3. Make your changes following our commit message format
388
- 4. Run tests: `cargo test`
389
- 5. Submit a Pull Request
390
-
391
- Our commit messages follow the [Conventional Commits](https://www.conventionalcommits.org/) specification.
392
-
393
- ## Acknowledgements
394
-
395
- This project uses [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) and [Leptonica](http://leptonica.org/). We are grateful to the maintainers and contributors of these projects.
396
-
397
- ```
398
-
399
- ```