kreuzberg 4.0.0.pre.rc.29 → 4.0.0.rc1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (321) hide show
  1. checksums.yaml +4 -4
  2. data/.gitignore +0 -6
  3. data/.rubocop.yaml +534 -1
  4. data/Gemfile +2 -1
  5. data/Gemfile.lock +28 -116
  6. data/README.md +269 -629
  7. data/Rakefile +0 -9
  8. data/Steepfile +4 -8
  9. data/examples/async_patterns.rb +58 -1
  10. data/ext/kreuzberg_rb/extconf.rb +5 -35
  11. data/ext/kreuzberg_rb/native/Cargo.toml +16 -55
  12. data/ext/kreuzberg_rb/native/build.rs +14 -12
  13. data/ext/kreuzberg_rb/native/include/ieeefp.h +1 -1
  14. data/ext/kreuzberg_rb/native/include/msvc_compat/strings.h +1 -1
  15. data/ext/kreuzberg_rb/native/include/strings.h +2 -2
  16. data/ext/kreuzberg_rb/native/include/unistd.h +1 -1
  17. data/ext/kreuzberg_rb/native/src/lib.rs +34 -897
  18. data/extconf.rb +6 -38
  19. data/kreuzberg.gemspec +20 -114
  20. data/lib/kreuzberg/api_proxy.rb +18 -2
  21. data/lib/kreuzberg/cache_api.rb +0 -22
  22. data/lib/kreuzberg/cli.rb +10 -2
  23. data/lib/kreuzberg/cli_proxy.rb +10 -0
  24. data/lib/kreuzberg/config.rb +22 -274
  25. data/lib/kreuzberg/errors.rb +7 -73
  26. data/lib/kreuzberg/extraction_api.rb +8 -237
  27. data/lib/kreuzberg/mcp_proxy.rb +11 -2
  28. data/lib/kreuzberg/ocr_backend_protocol.rb +73 -0
  29. data/lib/kreuzberg/post_processor_protocol.rb +71 -0
  30. data/lib/kreuzberg/result.rb +33 -151
  31. data/lib/kreuzberg/setup_lib_path.rb +2 -22
  32. data/lib/kreuzberg/validator_protocol.rb +73 -0
  33. data/lib/kreuzberg/version.rb +1 -1
  34. data/lib/kreuzberg.rb +13 -27
  35. data/pkg/kreuzberg-4.0.0.rc1.gem +0 -0
  36. data/sig/kreuzberg.rbs +12 -105
  37. data/spec/binding/cache_spec.rb +22 -22
  38. data/spec/binding/cli_proxy_spec.rb +4 -2
  39. data/spec/binding/cli_spec.rb +11 -12
  40. data/spec/binding/config_spec.rb +0 -74
  41. data/spec/binding/config_validation_spec.rb +6 -100
  42. data/spec/binding/error_handling_spec.rb +97 -283
  43. data/spec/binding/plugins/ocr_backend_spec.rb +8 -8
  44. data/spec/binding/plugins/postprocessor_spec.rb +11 -11
  45. data/spec/binding/plugins/validator_spec.rb +13 -12
  46. data/spec/examples.txt +104 -0
  47. data/spec/fixtures/config.toml +1 -0
  48. data/spec/fixtures/config.yaml +1 -0
  49. data/spec/fixtures/invalid_config.toml +1 -0
  50. data/spec/smoke/package_spec.rb +3 -2
  51. data/spec/spec_helper.rb +3 -1
  52. data/vendor/kreuzberg/Cargo.toml +67 -192
  53. data/vendor/kreuzberg/README.md +9 -97
  54. data/vendor/kreuzberg/build.rs +194 -516
  55. data/vendor/kreuzberg/src/api/handlers.rs +9 -130
  56. data/vendor/kreuzberg/src/api/mod.rs +3 -18
  57. data/vendor/kreuzberg/src/api/server.rs +71 -236
  58. data/vendor/kreuzberg/src/api/types.rs +7 -43
  59. data/vendor/kreuzberg/src/bin/profile_extract.rs +455 -0
  60. data/vendor/kreuzberg/src/cache/mod.rs +3 -27
  61. data/vendor/kreuzberg/src/chunking/mod.rs +79 -1705
  62. data/vendor/kreuzberg/src/core/batch_mode.rs +0 -60
  63. data/vendor/kreuzberg/src/core/config.rs +23 -905
  64. data/vendor/kreuzberg/src/core/extractor.rs +106 -403
  65. data/vendor/kreuzberg/src/core/io.rs +2 -4
  66. data/vendor/kreuzberg/src/core/mime.rs +12 -2
  67. data/vendor/kreuzberg/src/core/mod.rs +3 -22
  68. data/vendor/kreuzberg/src/core/pipeline.rs +78 -395
  69. data/vendor/kreuzberg/src/embeddings.rs +21 -169
  70. data/vendor/kreuzberg/src/error.rs +2 -2
  71. data/vendor/kreuzberg/src/extraction/archive.rs +31 -36
  72. data/vendor/kreuzberg/src/extraction/docx.rs +1 -365
  73. data/vendor/kreuzberg/src/extraction/email.rs +11 -12
  74. data/vendor/kreuzberg/src/extraction/excel.rs +129 -138
  75. data/vendor/kreuzberg/src/extraction/html.rs +170 -1447
  76. data/vendor/kreuzberg/src/extraction/image.rs +14 -138
  77. data/vendor/kreuzberg/src/extraction/libreoffice.rs +3 -13
  78. data/vendor/kreuzberg/src/extraction/mod.rs +5 -21
  79. data/vendor/kreuzberg/src/extraction/office_metadata/mod.rs +0 -2
  80. data/vendor/kreuzberg/src/extraction/pandoc/batch.rs +275 -0
  81. data/vendor/kreuzberg/src/extraction/pandoc/mime_types.rs +178 -0
  82. data/vendor/kreuzberg/src/extraction/pandoc/mod.rs +491 -0
  83. data/vendor/kreuzberg/src/extraction/pandoc/server.rs +496 -0
  84. data/vendor/kreuzberg/src/extraction/pandoc/subprocess.rs +1188 -0
  85. data/vendor/kreuzberg/src/extraction/pandoc/version.rs +162 -0
  86. data/vendor/kreuzberg/src/extraction/pptx.rs +94 -196
  87. data/vendor/kreuzberg/src/extraction/structured.rs +4 -5
  88. data/vendor/kreuzberg/src/extraction/table.rs +1 -2
  89. data/vendor/kreuzberg/src/extraction/text.rs +10 -18
  90. data/vendor/kreuzberg/src/extractors/archive.rs +0 -22
  91. data/vendor/kreuzberg/src/extractors/docx.rs +148 -69
  92. data/vendor/kreuzberg/src/extractors/email.rs +9 -37
  93. data/vendor/kreuzberg/src/extractors/excel.rs +40 -81
  94. data/vendor/kreuzberg/src/extractors/html.rs +173 -182
  95. data/vendor/kreuzberg/src/extractors/image.rs +8 -32
  96. data/vendor/kreuzberg/src/extractors/mod.rs +10 -171
  97. data/vendor/kreuzberg/src/extractors/pandoc.rs +201 -0
  98. data/vendor/kreuzberg/src/extractors/pdf.rs +64 -329
  99. data/vendor/kreuzberg/src/extractors/pptx.rs +34 -79
  100. data/vendor/kreuzberg/src/extractors/structured.rs +0 -16
  101. data/vendor/kreuzberg/src/extractors/text.rs +7 -30
  102. data/vendor/kreuzberg/src/extractors/xml.rs +8 -27
  103. data/vendor/kreuzberg/src/keywords/processor.rs +1 -9
  104. data/vendor/kreuzberg/src/keywords/rake.rs +1 -0
  105. data/vendor/kreuzberg/src/language_detection/mod.rs +51 -94
  106. data/vendor/kreuzberg/src/lib.rs +5 -17
  107. data/vendor/kreuzberg/src/mcp/mod.rs +1 -4
  108. data/vendor/kreuzberg/src/mcp/server.rs +21 -145
  109. data/vendor/kreuzberg/src/ocr/mod.rs +0 -2
  110. data/vendor/kreuzberg/src/ocr/processor.rs +8 -19
  111. data/vendor/kreuzberg/src/ocr/tesseract_backend.rs +0 -2
  112. data/vendor/kreuzberg/src/pdf/error.rs +1 -93
  113. data/vendor/kreuzberg/src/pdf/metadata.rs +100 -263
  114. data/vendor/kreuzberg/src/pdf/mod.rs +2 -33
  115. data/vendor/kreuzberg/src/pdf/rendering.rs +12 -12
  116. data/vendor/kreuzberg/src/pdf/table.rs +64 -61
  117. data/vendor/kreuzberg/src/pdf/text.rs +24 -416
  118. data/vendor/kreuzberg/src/plugins/extractor.rs +8 -40
  119. data/vendor/kreuzberg/src/plugins/mod.rs +0 -3
  120. data/vendor/kreuzberg/src/plugins/ocr.rs +14 -22
  121. data/vendor/kreuzberg/src/plugins/processor.rs +1 -10
  122. data/vendor/kreuzberg/src/plugins/registry.rs +0 -15
  123. data/vendor/kreuzberg/src/plugins/validator.rs +8 -20
  124. data/vendor/kreuzberg/src/stopwords/mod.rs +2 -2
  125. data/vendor/kreuzberg/src/text/mod.rs +0 -8
  126. data/vendor/kreuzberg/src/text/quality.rs +15 -28
  127. data/vendor/kreuzberg/src/text/string_utils.rs +10 -22
  128. data/vendor/kreuzberg/src/text/token_reduction/core.rs +50 -86
  129. data/vendor/kreuzberg/src/text/token_reduction/filters.rs +16 -37
  130. data/vendor/kreuzberg/src/text/token_reduction/simd_text.rs +1 -2
  131. data/vendor/kreuzberg/src/types.rs +67 -907
  132. data/vendor/kreuzberg/src/utils/mod.rs +0 -14
  133. data/vendor/kreuzberg/src/utils/quality.rs +3 -12
  134. data/vendor/kreuzberg/tests/api_tests.rs +0 -506
  135. data/vendor/kreuzberg/tests/archive_integration.rs +0 -2
  136. data/vendor/kreuzberg/tests/batch_orchestration.rs +12 -57
  137. data/vendor/kreuzberg/tests/batch_processing.rs +8 -32
  138. data/vendor/kreuzberg/tests/chunking_offset_demo.rs +92 -0
  139. data/vendor/kreuzberg/tests/concurrency_stress.rs +8 -40
  140. data/vendor/kreuzberg/tests/config_features.rs +1 -33
  141. data/vendor/kreuzberg/tests/config_loading_tests.rs +39 -16
  142. data/vendor/kreuzberg/tests/core_integration.rs +9 -35
  143. data/vendor/kreuzberg/tests/csv_integration.rs +81 -71
  144. data/vendor/kreuzberg/tests/docx_metadata_extraction_test.rs +25 -23
  145. data/vendor/kreuzberg/tests/email_integration.rs +1 -3
  146. data/vendor/kreuzberg/tests/error_handling.rs +34 -43
  147. data/vendor/kreuzberg/tests/format_integration.rs +1 -7
  148. data/vendor/kreuzberg/tests/helpers/mod.rs +0 -60
  149. data/vendor/kreuzberg/tests/image_integration.rs +0 -2
  150. data/vendor/kreuzberg/tests/mime_detection.rs +16 -17
  151. data/vendor/kreuzberg/tests/ocr_configuration.rs +0 -4
  152. data/vendor/kreuzberg/tests/ocr_errors.rs +0 -22
  153. data/vendor/kreuzberg/tests/ocr_quality.rs +0 -2
  154. data/vendor/kreuzberg/tests/pandoc_integration.rs +503 -0
  155. data/vendor/kreuzberg/tests/pdf_integration.rs +0 -2
  156. data/vendor/kreuzberg/tests/pipeline_integration.rs +2 -36
  157. data/vendor/kreuzberg/tests/plugin_ocr_backend_test.rs +0 -5
  158. data/vendor/kreuzberg/tests/plugin_postprocessor_test.rs +1 -17
  159. data/vendor/kreuzberg/tests/plugin_system.rs +0 -6
  160. data/vendor/kreuzberg/tests/registry_integration_tests.rs +22 -2
  161. data/vendor/kreuzberg/tests/security_validation.rs +1 -13
  162. data/vendor/kreuzberg/tests/test_fastembed.rs +23 -45
  163. metadata +25 -171
  164. data/.rubocop.yml +0 -543
  165. data/ext/kreuzberg_rb/native/.cargo/config.toml +0 -23
  166. data/ext/kreuzberg_rb/native/Cargo.lock +0 -7619
  167. data/lib/kreuzberg/error_context.rb +0 -136
  168. data/lib/kreuzberg/types.rb +0 -170
  169. data/lib/libpdfium.so +0 -0
  170. data/spec/binding/async_operations_spec.rb +0 -473
  171. data/spec/binding/batch_operations_spec.rb +0 -595
  172. data/spec/binding/batch_spec.rb +0 -359
  173. data/spec/binding/config_result_spec.rb +0 -377
  174. data/spec/binding/embeddings_spec.rb +0 -816
  175. data/spec/binding/error_recovery_spec.rb +0 -488
  176. data/spec/binding/font_config_spec.rb +0 -220
  177. data/spec/binding/images_spec.rb +0 -738
  178. data/spec/binding/keywords_extraction_spec.rb +0 -600
  179. data/spec/binding/metadata_types_spec.rb +0 -1228
  180. data/spec/binding/pages_extraction_spec.rb +0 -471
  181. data/spec/binding/tables_spec.rb +0 -641
  182. data/spec/unit/config/chunking_config_spec.rb +0 -213
  183. data/spec/unit/config/embedding_config_spec.rb +0 -343
  184. data/spec/unit/config/extraction_config_spec.rb +0 -438
  185. data/spec/unit/config/font_config_spec.rb +0 -285
  186. data/spec/unit/config/hierarchy_config_spec.rb +0 -314
  187. data/spec/unit/config/image_extraction_config_spec.rb +0 -209
  188. data/spec/unit/config/image_preprocessing_config_spec.rb +0 -249
  189. data/spec/unit/config/keyword_config_spec.rb +0 -229
  190. data/spec/unit/config/language_detection_config_spec.rb +0 -258
  191. data/spec/unit/config/ocr_config_spec.rb +0 -171
  192. data/spec/unit/config/page_config_spec.rb +0 -221
  193. data/spec/unit/config/pdf_config_spec.rb +0 -267
  194. data/spec/unit/config/postprocessor_config_spec.rb +0 -290
  195. data/spec/unit/config/tesseract_config_spec.rb +0 -181
  196. data/spec/unit/config/token_reduction_config_spec.rb +0 -251
  197. data/test/metadata_types_test.rb +0 -959
  198. data/vendor/Cargo.toml +0 -61
  199. data/vendor/kreuzberg/examples/bench_fixes.rs +0 -71
  200. data/vendor/kreuzberg/examples/test_pdfium_fork.rs +0 -62
  201. data/vendor/kreuzberg/src/chunking/processor.rs +0 -219
  202. data/vendor/kreuzberg/src/core/batch_optimizations.rs +0 -385
  203. data/vendor/kreuzberg/src/core/config_validation.rs +0 -949
  204. data/vendor/kreuzberg/src/core/formats.rs +0 -235
  205. data/vendor/kreuzberg/src/core/server_config.rs +0 -1220
  206. data/vendor/kreuzberg/src/extraction/capacity.rs +0 -263
  207. data/vendor/kreuzberg/src/extraction/markdown.rs +0 -216
  208. data/vendor/kreuzberg/src/extraction/office_metadata/odt_properties.rs +0 -284
  209. data/vendor/kreuzberg/src/extractors/bibtex.rs +0 -470
  210. data/vendor/kreuzberg/src/extractors/docbook.rs +0 -504
  211. data/vendor/kreuzberg/src/extractors/epub.rs +0 -696
  212. data/vendor/kreuzberg/src/extractors/fictionbook.rs +0 -492
  213. data/vendor/kreuzberg/src/extractors/jats.rs +0 -1054
  214. data/vendor/kreuzberg/src/extractors/jupyter.rs +0 -368
  215. data/vendor/kreuzberg/src/extractors/latex.rs +0 -653
  216. data/vendor/kreuzberg/src/extractors/markdown.rs +0 -701
  217. data/vendor/kreuzberg/src/extractors/odt.rs +0 -628
  218. data/vendor/kreuzberg/src/extractors/opml.rs +0 -635
  219. data/vendor/kreuzberg/src/extractors/orgmode.rs +0 -529
  220. data/vendor/kreuzberg/src/extractors/rst.rs +0 -577
  221. data/vendor/kreuzberg/src/extractors/rtf.rs +0 -809
  222. data/vendor/kreuzberg/src/extractors/security.rs +0 -484
  223. data/vendor/kreuzberg/src/extractors/security_tests.rs +0 -367
  224. data/vendor/kreuzberg/src/extractors/typst.rs +0 -651
  225. data/vendor/kreuzberg/src/language_detection/processor.rs +0 -218
  226. data/vendor/kreuzberg/src/ocr/language_registry.rs +0 -520
  227. data/vendor/kreuzberg/src/panic_context.rs +0 -154
  228. data/vendor/kreuzberg/src/pdf/bindings.rs +0 -306
  229. data/vendor/kreuzberg/src/pdf/bundled.rs +0 -408
  230. data/vendor/kreuzberg/src/pdf/fonts.rs +0 -358
  231. data/vendor/kreuzberg/src/pdf/hierarchy.rs +0 -903
  232. data/vendor/kreuzberg/src/text/quality_processor.rs +0 -231
  233. data/vendor/kreuzberg/src/text/utf8_validation.rs +0 -193
  234. data/vendor/kreuzberg/src/utils/pool.rs +0 -503
  235. data/vendor/kreuzberg/src/utils/pool_sizing.rs +0 -364
  236. data/vendor/kreuzberg/src/utils/string_pool.rs +0 -761
  237. data/vendor/kreuzberg/tests/api_embed.rs +0 -360
  238. data/vendor/kreuzberg/tests/api_extract_multipart.rs +0 -52
  239. data/vendor/kreuzberg/tests/api_large_pdf_extraction.rs +0 -471
  240. data/vendor/kreuzberg/tests/api_large_pdf_extraction_diagnostics.rs +0 -289
  241. data/vendor/kreuzberg/tests/batch_pooling_benchmark.rs +0 -154
  242. data/vendor/kreuzberg/tests/bibtex_parity_test.rs +0 -421
  243. data/vendor/kreuzberg/tests/config_integration_test.rs +0 -753
  244. data/vendor/kreuzberg/tests/data/hierarchy_ground_truth.json +0 -294
  245. data/vendor/kreuzberg/tests/docbook_extractor_tests.rs +0 -500
  246. data/vendor/kreuzberg/tests/docx_vs_pandoc_comparison.rs +0 -370
  247. data/vendor/kreuzberg/tests/epub_native_extractor_tests.rs +0 -275
  248. data/vendor/kreuzberg/tests/fictionbook_extractor_tests.rs +0 -228
  249. data/vendor/kreuzberg/tests/html_table_test.rs +0 -551
  250. data/vendor/kreuzberg/tests/instrumentation_test.rs +0 -139
  251. data/vendor/kreuzberg/tests/jats_extractor_tests.rs +0 -639
  252. data/vendor/kreuzberg/tests/jupyter_extractor_tests.rs +0 -704
  253. data/vendor/kreuzberg/tests/latex_extractor_tests.rs +0 -496
  254. data/vendor/kreuzberg/tests/markdown_extractor_tests.rs +0 -490
  255. data/vendor/kreuzberg/tests/ocr_language_registry.rs +0 -191
  256. data/vendor/kreuzberg/tests/odt_extractor_tests.rs +0 -674
  257. data/vendor/kreuzberg/tests/opml_extractor_tests.rs +0 -616
  258. data/vendor/kreuzberg/tests/orgmode_extractor_tests.rs +0 -822
  259. data/vendor/kreuzberg/tests/page_markers.rs +0 -297
  260. data/vendor/kreuzberg/tests/pdf_hierarchy_detection.rs +0 -301
  261. data/vendor/kreuzberg/tests/pdf_hierarchy_quality.rs +0 -589
  262. data/vendor/kreuzberg/tests/pdf_ocr_triggering.rs +0 -301
  263. data/vendor/kreuzberg/tests/pdf_text_merging.rs +0 -475
  264. data/vendor/kreuzberg/tests/pdfium_linking.rs +0 -340
  265. data/vendor/kreuzberg/tests/rst_extractor_tests.rs +0 -694
  266. data/vendor/kreuzberg/tests/rtf_extractor_tests.rs +0 -775
  267. data/vendor/kreuzberg/tests/typst_behavioral_tests.rs +0 -1260
  268. data/vendor/kreuzberg/tests/typst_extractor_tests.rs +0 -648
  269. data/vendor/kreuzberg-ffi/Cargo.toml +0 -67
  270. data/vendor/kreuzberg-ffi/README.md +0 -851
  271. data/vendor/kreuzberg-ffi/benches/result_view_benchmark.rs +0 -227
  272. data/vendor/kreuzberg-ffi/build.rs +0 -168
  273. data/vendor/kreuzberg-ffi/cbindgen.toml +0 -37
  274. data/vendor/kreuzberg-ffi/kreuzberg-ffi.pc.in +0 -12
  275. data/vendor/kreuzberg-ffi/kreuzberg.h +0 -3012
  276. data/vendor/kreuzberg-ffi/src/batch_streaming.rs +0 -588
  277. data/vendor/kreuzberg-ffi/src/config.rs +0 -1341
  278. data/vendor/kreuzberg-ffi/src/error.rs +0 -901
  279. data/vendor/kreuzberg-ffi/src/extraction.rs +0 -555
  280. data/vendor/kreuzberg-ffi/src/helpers.rs +0 -879
  281. data/vendor/kreuzberg-ffi/src/lib.rs +0 -977
  282. data/vendor/kreuzberg-ffi/src/memory.rs +0 -493
  283. data/vendor/kreuzberg-ffi/src/mime.rs +0 -329
  284. data/vendor/kreuzberg-ffi/src/panic_shield.rs +0 -265
  285. data/vendor/kreuzberg-ffi/src/plugins/document_extractor.rs +0 -442
  286. data/vendor/kreuzberg-ffi/src/plugins/mod.rs +0 -14
  287. data/vendor/kreuzberg-ffi/src/plugins/ocr_backend.rs +0 -628
  288. data/vendor/kreuzberg-ffi/src/plugins/post_processor.rs +0 -438
  289. data/vendor/kreuzberg-ffi/src/plugins/validator.rs +0 -329
  290. data/vendor/kreuzberg-ffi/src/result.rs +0 -510
  291. data/vendor/kreuzberg-ffi/src/result_pool.rs +0 -639
  292. data/vendor/kreuzberg-ffi/src/result_view.rs +0 -773
  293. data/vendor/kreuzberg-ffi/src/string_intern.rs +0 -568
  294. data/vendor/kreuzberg-ffi/src/types.rs +0 -363
  295. data/vendor/kreuzberg-ffi/src/util.rs +0 -210
  296. data/vendor/kreuzberg-ffi/src/validation.rs +0 -848
  297. data/vendor/kreuzberg-ffi/tests.disabled/README.md +0 -48
  298. data/vendor/kreuzberg-ffi/tests.disabled/config_loading_tests.rs +0 -299
  299. data/vendor/kreuzberg-ffi/tests.disabled/config_tests.rs +0 -346
  300. data/vendor/kreuzberg-ffi/tests.disabled/extractor_tests.rs +0 -232
  301. data/vendor/kreuzberg-ffi/tests.disabled/plugin_registration_tests.rs +0 -470
  302. data/vendor/kreuzberg-tesseract/.commitlintrc.json +0 -13
  303. data/vendor/kreuzberg-tesseract/.crate-ignore +0 -2
  304. data/vendor/kreuzberg-tesseract/Cargo.lock +0 -2933
  305. data/vendor/kreuzberg-tesseract/Cargo.toml +0 -57
  306. data/vendor/kreuzberg-tesseract/LICENSE +0 -22
  307. data/vendor/kreuzberg-tesseract/README.md +0 -399
  308. data/vendor/kreuzberg-tesseract/build.rs +0 -1127
  309. data/vendor/kreuzberg-tesseract/patches/README.md +0 -71
  310. data/vendor/kreuzberg-tesseract/patches/tesseract.diff +0 -199
  311. data/vendor/kreuzberg-tesseract/src/api.rs +0 -1371
  312. data/vendor/kreuzberg-tesseract/src/choice_iterator.rs +0 -77
  313. data/vendor/kreuzberg-tesseract/src/enums.rs +0 -297
  314. data/vendor/kreuzberg-tesseract/src/error.rs +0 -81
  315. data/vendor/kreuzberg-tesseract/src/lib.rs +0 -145
  316. data/vendor/kreuzberg-tesseract/src/monitor.rs +0 -57
  317. data/vendor/kreuzberg-tesseract/src/mutable_iterator.rs +0 -197
  318. data/vendor/kreuzberg-tesseract/src/page_iterator.rs +0 -253
  319. data/vendor/kreuzberg-tesseract/src/result_iterator.rs +0 -286
  320. data/vendor/kreuzberg-tesseract/src/result_renderer.rs +0 -183
  321. data/vendor/kreuzberg-tesseract/tests/integration_test.rs +0 -211
data/README.md CHANGED
@@ -1,781 +1,421 @@
1
- # Ruby
2
-
3
- <div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
4
- <!-- Language Bindings -->
5
- <a href="https://crates.io/crates/kreuzberg">
6
- <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
7
- </a>
8
- <a href="https://hex.pm/packages/kreuzberg">
9
- <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
10
- </a>
11
- <a href="https://pypi.org/project/kreuzberg/">
12
- <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
13
- </a>
14
- <a href="https://www.npmjs.com/package/@kreuzberg/node">
15
- <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
16
- </a>
17
- <a href="https://www.npmjs.com/package/@kreuzberg/wasm">
18
- <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
19
- </a>
20
-
21
- <a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
22
- <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
23
- </a>
24
- <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
25
- <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0-*" alt="Go">
26
- </a>
27
- <a href="https://www.nuget.org/packages/Kreuzberg/">
28
- <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
29
- </a>
30
- <a href="https://packagist.org/packages/kreuzberg/kreuzberg">
31
- <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
32
- </a>
33
- <a href="https://rubygems.org/gems/kreuzberg">
34
- <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
35
- </a>
36
-
37
- <!-- Project Info -->
38
-
39
- <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
40
- <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
41
- </a>
42
- <a href="https://docs.kreuzberg.dev">
43
- <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
44
- </a>
45
- </div>
46
-
47
- <img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
48
-
49
- <div align="center" style="margin-top: 20px;">
50
- <a href="https://discord.gg/pXxagNK2zN">
51
- <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
52
- </a>
53
- </div>
54
-
55
- Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. Ruby bindings with idiomatic Ruby API and native performance.
56
-
57
- > **Version 4.0.0 Release Candidate**
58
- > Kreuzberg v4.0.0 is in **Release Candidate** stage. Bugs and breaking changes are expected.
59
- > This is a pre-release version. Please test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
1
+ # Kreuzberg for Ruby
60
2
 
61
- ## Installation
3
+ [![RubyGems](https://img.shields.io/gem/v/kreuzberg)](https://rubygems.org/gems/kreuzberg)
4
+ [![Crates.io](https://img.shields.io/crates/v/kreuzberg)](https://crates.io/crates/kreuzberg)
5
+ [![PyPI](https://img.shields.io/pypi/v/kreuzberg)](https://pypi.org/project/kreuzberg/)
6
+ [![npm](https://img.shields.io/npm/v/@goldziher/kreuzberg)](https://www.npmjs.com/package/@goldziher/kreuzberg)
7
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8
+ [![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev)
62
9
 
63
- ### Package Installation
10
+ High-performance document intelligence for Ruby, powered by Rust.
64
11
 
65
- Install via one of the supported package managers:
12
+ Extract text, tables, images, and metadata from 30+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
66
13
 
67
- **gem:**
14
+ > **🚀 Version 4.0.0 Release Candidate**
15
+ > This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/Goldziher/kreuzberg/issues) you encounter.
68
16
 
69
- ```bash
70
- gem install kreuzberg
71
- ```
72
-
73
- **Bundler:**
74
-
75
- ```ruby
76
- gem 'kreuzberg'
77
- ```
17
+ ## Features
78
18
 
79
- ### System Requirements
19
+ - **30+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
20
+ - **OCR Support**: Built-in Tesseract OCR for scanned documents and images
21
+ - **High Performance**: Rust-powered extraction for native-level performance
22
+ - **Table Extraction**: Extract structured tables from documents
23
+ - **Language Detection**: Automatic language detection for extracted text
24
+ - **Text Chunking**: Split long documents into manageable chunks
25
+ - **Caching**: Built-in result caching for faster repeated extractions
26
+ - **Type-Safe**: Comprehensive typed configuration and result objects
80
27
 
81
- - **Ruby 2.7+** required
82
- - Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
83
- - Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
28
+ ## Requirements
84
29
 
85
- ### Platform Support
30
+ - Ruby 3.2 or higher
31
+ - Rust toolchain (for building from source)
86
32
 
87
- Precompiled native extensions are available for the following platforms, providing instant installation without compilation:
33
+ ### Optional System Dependencies
88
34
 
89
- - Linux x86_64
90
- - Linux aarch64 (ARM64)
91
- - macOS aarch64 (Apple Silicon)
35
+ - **Tesseract**: For OCR functionality
36
+ - macOS: `brew install tesseract`
37
+ - Ubuntu: `sudo apt-get install tesseract-ocr`
38
+ - Windows: Download from [GitHub](https://github.com/tesseract-ocr/tesseract)
92
39
 
93
- On these platforms, no C compiler or Rust toolchain is required for installation.
40
+ - **LibreOffice**: For legacy MS Office formats (.doc, .ppt)
41
+ - macOS: `brew install libreoffice`
42
+ - Ubuntu: `sudo apt-get install libreoffice`
94
43
 
95
- ## Quick Start
44
+ - **Pandoc**: For advanced document conversion
45
+ - macOS: `brew install pandoc`
46
+ - Ubuntu: `sudo apt-get install pandoc`
96
47
 
97
- ### Basic Extraction
48
+ ## Installation
98
49
 
99
- Extract text, metadata, and structure from any supported document format:
50
+ Add to your Gemfile:
100
51
 
101
52
  ```ruby
102
- require 'kreuzberg'
103
-
104
- result = Kreuzberg.extract_file_sync(path: 'document.pdf')
105
-
106
- puts "Content:"
107
- puts result.content
108
-
109
- puts "\nMetadata:"
110
- puts "Title: #{result.metadata&.dig('title')}"
111
- puts "Author: #{result.metadata&.dig('author')}"
112
-
113
- puts "\nTables found: #{result.tables.length}"
114
- puts "Images found: #{result.images.length}"
53
+ gem 'kreuzberg'
115
54
  ```
116
55
 
117
- ### Common Use Cases
118
-
119
- #### Extract with Custom Configuration
56
+ Then run:
120
57
 
121
- Most use cases benefit from configuration to control extraction behavior:
122
-
123
- **With OCR (for scanned documents):**
124
-
125
- ```ruby
126
- require 'kreuzberg'
127
-
128
- ocr_config = Kreuzberg::Config::OCR.new(
129
- backend: 'tesseract',
130
- language: 'eng'
131
- )
58
+ ```bash
59
+ bundle install
60
+ ```
132
61
 
133
- config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
134
- result = Kreuzberg.extract_file_sync(path: 'scanned.pdf', config: config)
62
+ Or install directly:
135
63
 
136
- puts "Extracted text from scanned document:"
137
- puts result.content
138
- puts "Used OCR backend: tesseract"
64
+ ```bash
65
+ gem install kreuzberg
139
66
  ```
140
67
 
141
- #### Table Extraction
142
-
143
- See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.
68
+ ## Quick Start
144
69
 
145
- #### Processing Multiple Files
70
+ ### Basic Extraction
146
71
 
147
72
  ```ruby
148
73
  require 'kreuzberg'
149
74
 
150
- puts "Kreuzberg version: #{Kreuzberg::VERSION}"
151
- puts "FFI bindings loaded successfully"
152
-
153
- result = Kreuzberg.extract_file_sync(path: 'sample.pdf')
154
- puts "Installation verified! Extracted #{result.content.length} characters"
75
+ # Extract from a file
76
+ result = Kreuzberg.extract_file_sync("document.pdf")
77
+ puts result.content
78
+ puts "MIME type: #{result.mime_type}"
155
79
  ```
156
80
 
157
- #### Async Processing
158
-
159
- For non-blocking document processing:
81
+ ### With Configuration
160
82
 
161
83
  ```ruby
162
- require 'kreuzberg'
163
-
84
+ # Create configuration
164
85
  config = Kreuzberg::Config::Extraction.new(
165
86
  use_cache: true,
166
- enable_quality_processing: true
87
+ force_ocr: false
167
88
  )
168
89
 
169
- result = Kreuzberg.extract_file_sync(path: 'contract.pdf', config: config)
170
-
171
- puts "Extracted #{result.content.length} characters"
172
- puts "Quality score: #{result.metadata&.dig('quality_score')}"
173
- puts "Processing time: #{result.metadata&.dig('processing_time')}ms"
90
+ result = Kreuzberg.extract_file_sync("document.pdf", config: config)
174
91
  ```
175
92
 
176
- ### Next Steps
177
-
178
- - **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
179
- - **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
180
- - **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
181
- - **[Configuration Guide](https://kreuzberg.dev/configuration/)** - Advanced configuration options
182
- - **[Troubleshooting](https://kreuzberg.dev/troubleshooting/)** - Common issues and solutions
183
-
184
- ## Features
185
-
186
- ### Supported File Formats (56+)
187
-
188
- 56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
189
-
190
- #### Office Documents
191
-
192
- | Category | Formats | Capabilities |
193
- |----------|---------|--------------|
194
- | **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
195
- | **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
196
- | **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
197
- | **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
198
- | **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
199
-
200
- #### Images (OCR-Enabled)
201
-
202
- | Category | Formats | Features |
203
- |----------|---------|----------|
204
- | **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
205
- | **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
206
- | **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
207
-
208
- #### Web & Data
209
-
210
- | Category | Formats | Features |
211
- |----------|---------|----------|
212
- | **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
213
- | **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
214
- | **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
215
-
216
- #### Email & Archives
217
-
218
- | Category | Formats | Features |
219
- |----------|---------|----------|
220
- | **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
221
- | **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
222
-
223
- #### Academic & Scientific
224
-
225
- | Category | Formats | Features |
226
- |----------|---------|----------|
227
- | **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
228
- | **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
229
- | **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
230
-
231
- **[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
232
-
233
- ### Key Capabilities
234
-
235
- - **Text Extraction** - Extract all text content with position and formatting information
236
-
237
- - **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
238
-
239
- - **Table Extraction** - Parse tables with structure and cell content preservation
240
-
241
- - **Image Extraction** - Extract embedded images and render page previews
242
-
243
- - **OCR Support** - Integrate multiple OCR backends for scanned documents
244
-
245
- - **Async/Await** - Non-blocking document processing with concurrent operations
246
-
247
- - **Plugin System** - Extensible post-processing for custom text transformation
248
-
249
- - **Embeddings** - Generate vector embeddings using ONNX Runtime models
250
-
251
- - **Batch Processing** - Efficiently process multiple documents in parallel
252
-
253
- - **Memory Efficient** - Stream large files without loading entirely into memory
254
-
255
- - **Language Detection** - Detect and support multiple languages in documents
256
-
257
- - **Configuration** - Fine-grained control over extraction behavior
258
-
259
- ### Performance Characteristics
260
-
261
- | Format | Speed | Memory | Notes |
262
- |--------|-------|--------|-------|
263
- | **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
264
- | **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
265
- | **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
266
- | **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
267
- | **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
268
-
269
- ## OCR Support
270
-
271
- Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
272
-
273
- - **Tesseract**
274
-
275
- ### OCR Configuration Example
93
+ ### With OCR
276
94
 
277
95
  ```ruby
278
- require 'kreuzberg'
279
-
96
+ # Configure OCR
280
97
  ocr_config = Kreuzberg::Config::OCR.new(
281
- backend: 'tesseract',
282
- language: 'eng'
98
+ backend: "tesseract",
99
+ language: "eng",
100
+ preprocessing: true
283
101
  )
284
102
 
285
103
  config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
286
- result = Kreuzberg.extract_file_sync(path: 'scanned.pdf', config: config)
287
-
288
- puts "Extracted text from scanned document:"
289
- puts result.content
290
- puts "Used OCR backend: tesseract"
104
+ result = Kreuzberg.extract_file_sync("scanned.pdf", config: config)
291
105
  ```
292
106
 
293
- ## Async Support
294
-
295
- This binding provides full async/await support for non-blocking document processing:
107
+ ### Extract from Bytes
296
108
 
297
109
  ```ruby
298
- require 'kreuzberg'
299
-
300
- config = Kreuzberg::Config::Extraction.new(
301
- use_cache: true,
302
- enable_quality_processing: true
303
- )
304
-
305
- result = Kreuzberg.extract_file_sync(path: 'contract.pdf', config: config)
306
-
307
- puts "Extracted #{result.content.length} characters"
308
- puts "Quality score: #{result.metadata&.dig('quality_score')}"
309
- puts "Processing time: #{result.metadata&.dig('processing_time')}ms"
110
+ data = File.binread("document.pdf")
111
+ result = Kreuzberg.extract_bytes_sync(data, "application/pdf")
112
+ puts result.content
310
113
  ```
311
114
 
312
- ## Plugin System
313
-
314
- Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
315
-
316
- For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/plugins/).
317
-
318
- ## Embeddings Support
319
-
320
- Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
321
-
322
- **[Embeddings Guide](https://kreuzberg.dev/features/#embeddings)**
115
+ ### Batch Processing
323
116
 
324
- ## Advanced Examples
117
+ ```ruby
118
+ paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
119
+ results = Kreuzberg.batch_extract_files_sync(paths)
325
120
 
326
- ### Embeddings with Model Configuration
121
+ results.each do |result|
122
+ puts "Content: #{result.content[0..100]}"
123
+ puts "MIME: #{result.mime_type}"
124
+ end
125
+ ```
327
126
 
328
- Generate embeddings for document chunks with custom model configuration:
127
+ ### Structured Results (Chunks & Images)
329
128
 
330
129
  ```ruby
331
- require 'kreuzberg'
332
-
333
- # Configure embedding model with custom parameters
334
- embedding_config = Kreuzberg::Config::Embedding.new(
335
- model: { type: :preset, name: 'balanced' },
336
- normalize: true,
337
- batch_size: 32,
338
- show_download_progress: false
339
- )
130
+ result = Kreuzberg.extract_file_sync("long-report.pdf", config: {
131
+ chunking: { max_chars: 750 },
132
+ image_extraction: { extract_images: true }
133
+ })
340
134
 
341
- # Enable chunking with embeddings
342
- chunking_config = Kreuzberg::Config::Chunking.new(
343
- max_chars: 1024,
344
- max_overlap: 256,
345
- embedding: embedding_config
346
- )
135
+ result.chunks&.each do |chunk|
136
+ puts "[#{chunk.chunk_index + 1}/#{chunk.total_chunks}] #{chunk.content[0..80]}"
137
+ end
347
138
 
348
- config = Kreuzberg::Config::Extraction.new(chunking: chunking_config)
349
- result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
350
-
351
- # Access chunks with embeddings
352
- result.chunks.each_with_index do |chunk, idx|
353
- puts "Chunk #{idx}:"
354
- puts " Content: #{chunk.content[0..50]}..."
355
- puts " Tokens: #{chunk.token_count}"
356
- puts " Pages: #{chunk.first_page}-#{chunk.last_page}"
357
- if chunk.embedding
358
- puts " Embedding dimensions: #{chunk.embedding.length}"
139
+ result.images&.each do |image|
140
+ File.binwrite("image-#{image.image_index}.#{image.format}", image.data)
141
+ if image.ocr_result
142
+ puts "Embedded OCR content: #{image.ocr_result.content[0..60]}"
359
143
  end
360
144
  end
361
145
  ```
362
146
 
363
- ### Keywords Extraction (YAKE and RAKE)
147
+ ## Configuration
364
148
 
365
- Extract keywords using YAKE and RAKE algorithms:
149
+ ### Load From File
366
150
 
367
151
  ```ruby
368
- require 'kreuzberg'
152
+ config = Kreuzberg::Config::Extraction.from_file("config.toml")
153
+ result = Kreuzberg.extract_file_sync("report.pdf", config: config)
154
+ ```
369
155
 
370
- # Extract keywords using YAKE algorithm
371
- yake_config = Kreuzberg::Config::Keywords.new(
372
- algorithm: 'yake',
373
- max_keywords: 10,
374
- min_score: 0.1,
375
- yake_params: Kreuzberg::Config::KeywordYakeParams.new(window_size: 3)
376
- )
156
+ ### Extraction Configuration
377
157
 
378
- config = Kreuzberg::Config::Extraction.new(keywords: yake_config)
379
- result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
380
-
381
- # Extract keywords using RAKE algorithm
382
- rake_config = Kreuzberg::Config::Keywords.new(
383
- algorithm: 'rake',
384
- max_keywords: 15,
385
- language: 'english',
386
- rake_params: Kreuzberg::Config::KeywordRakeParams.new(
387
- min_word_length: 3,
388
- max_words_per_phrase: 5
389
- )
158
+ ```ruby
159
+ config = Kreuzberg::Config::Extraction.new(
160
+ use_cache: true, # Enable result caching
161
+ enable_quality_processing: false, # Enable text quality processing
162
+ force_ocr: false # Force OCR even for digital PDFs
390
163
  )
391
-
392
- config = Kreuzberg::Config::Extraction.new(keywords: rake_config)
393
- result = Kreuzberg.extract_file_sync(path: 'report.docx', config: config)
394
-
395
- puts "Keywords extracted for document"
396
164
  ```
397
165
 
398
- ### Pages Extraction with PageConfig
399
-
400
- Extract and organize content by pages:
166
+ ### OCR Configuration
401
167
 
402
168
  ```ruby
403
- require 'kreuzberg'
404
-
405
- # Enable per-page extraction with markers
406
- page_config = Kreuzberg::Config::PageConfig.new(
407
- extract_pages: true,
408
- insert_page_markers: true,
409
- marker_format: "\n\n=== PAGE {page_num} ===\n\n"
169
+ ocr = Kreuzberg::Config::OCR.new(
170
+ backend: "tesseract", # OCR backend (tesseract, easyocr, paddleocr)
171
+ language: "eng", # Language code (eng, deu, fra, etc.)
172
+ tesseract_config: {
173
+ psm: 6,
174
+ enable_table_detection: true,
175
+ preprocessing: Kreuzberg::Config::ImagePreprocessing.new(auto_rotate: true).to_h
176
+ }
410
177
  )
411
178
 
412
- config = Kreuzberg::Config::Extraction.new(pages: page_config)
413
- result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
414
-
415
- # Access extracted pages
416
- if result.pages
417
- result.pages.each do |page|
418
- puts "Page #{page.page_number}:"
419
- puts " Content length: #{page.content.length}"
420
- puts " Tables: #{page.tables.length}"
421
- puts " Images: #{page.images.length}"
422
- end
423
- end
424
-
425
- puts "Total pages: #{result.page_count}"
179
+ config = Kreuzberg::Config::Extraction.new(ocr: ocr)
426
180
  ```
427
181
 
428
- ### Custom PostProcessor Implementation
429
-
430
- Create and register custom post-processors for text transformation:
182
+ ### Chunking Configuration
431
183
 
432
184
  ```ruby
433
- require 'kreuzberg'
434
-
435
- # Define a custom post-processor class
436
- class MarkdownEnhancerPostProcessor
437
- include Kreuzberg::PostProcessorProtocol
438
-
439
- def call(result)
440
- # Enhance extracted content with markdown formatting
441
- enhanced = result.dup
442
-
443
- if enhanced['content']
444
- # Add markdown headers for detected structure
445
- enhanced['content'] = enhance_with_markdown(enhanced['content'])
446
- end
447
-
448
- enhanced
449
- end
185
+ chunking = Kreuzberg::Config::Chunking.new(
186
+ enabled: true,
187
+ chunk_size: 1000, # Characters per chunk
188
+ chunk_overlap: 200, # Overlap between chunks
189
+ embedding: {
190
+ model: { type: :preset, name: "balanced" },
191
+ normalize: true
192
+ }
193
+ )
450
194
 
451
- private
195
+ config = Kreuzberg::Config::Extraction.new(chunking: chunking)
196
+ result = Kreuzberg.extract_file_sync("long_document.pdf", config: config)
452
197
 
453
- def enhance_with_markdown(content)
454
- # Example: Convert section breaks to markdown headers
455
- content
456
- .split("\n\n")
457
- .map { |paragraph| paragraph.length > 100 ? "## #{paragraph[0..30]}...\n\n#{paragraph}" : paragraph }
458
- .join("\n\n")
459
- end
198
+ result.chunks.each do |chunk|
199
+ puts "Chunk: #{chunk.content}"
200
+ puts "Tokens: #{chunk.token_count}"
460
201
  end
461
-
462
- # Use custom post-processor in configuration
463
- processor = MarkdownEnhancerPostProcessor.new
464
- postprocessor_config = Kreuzberg::Config::PostProcessor.new(enabled: true)
465
- config = Kreuzberg::Config::Extraction.new(postprocessor: postprocessor_config)
466
-
467
- result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
468
- puts result.content
469
202
  ```
470
203
 
471
- ### Custom Validator Implementation
472
-
473
- Create and register validators to ensure extraction quality:
204
+ ### HTML Conversion Options
474
205
 
475
206
  ```ruby
476
- require 'kreuzberg'
477
-
478
- # Define a custom validator class
479
- class ContentQualityValidator
480
- include Kreuzberg::ValidatorProtocol
481
-
482
- MIN_CONTENT_LENGTH = 100
483
- MIN_METADATA_FIELDS = 2
484
-
485
- def call(result)
486
- # Validate extracted content meets quality standards
487
- content = result['content'].to_s
488
- metadata = result['metadata'].to_h
489
-
490
- if content.length < MIN_CONTENT_LENGTH
491
- raise Kreuzberg::Errors::ValidationError,
492
- "Content too short: #{content.length} bytes (minimum: #{MIN_CONTENT_LENGTH})"
493
- end
494
-
495
- if metadata.length < MIN_METADATA_FIELDS
496
- raise Kreuzberg::Errors::ValidationError,
497
- "Insufficient metadata: #{metadata.length} fields (minimum: #{MIN_METADATA_FIELDS})"
498
- end
499
-
500
- # Validation passed
501
- nil
502
- end
503
- end
504
-
505
- # Use validator in extraction workflow
506
- validator = ContentQualityValidator.new
507
- config = Kreuzberg::Config::Extraction.new(enable_quality_processing: true)
207
+ html_options = Kreuzberg::Config::HtmlOptions.new(
208
+ heading_style: :atx_closed,
209
+ wrap: true,
210
+ wrap_width: 100,
211
+ preprocessing: { enabled: true, preset: :standard }
212
+ )
508
213
 
509
- begin
510
- result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
511
- validator.call(result.to_h)
512
- puts "Extraction passed quality validation"
513
- rescue Kreuzberg::Errors::ValidationError => e
514
- puts "Validation failed: #{e.message}"
515
- end
214
+ config = Kreuzberg::Config::Extraction.new(html_options: html_options)
215
+ result = Kreuzberg.extract_file_sync("page.html", config: config)
516
216
  ```
517
217
 
518
- ### Config File Loading (from_file and discover)
519
-
520
- Load configuration from TOML, YAML, or JSON files:
218
+ ### Keyword Extraction
521
219
 
522
220
  ```ruby
523
- require 'kreuzberg'
221
+ keywords = Kreuzberg::Config::Keywords.new(
222
+ algorithm: :yake,
223
+ max_keywords: 8,
224
+ min_score: 0.2,
225
+ ngram_range: [1, 3]
226
+ )
524
227
 
525
- # Load configuration from a specific file
526
- # Supports: .toml, .yaml/.yml, .json
527
- config = Kreuzberg::Config::Extraction.from_file('config/kreuzberg.toml')
528
-
529
- # Example: config/kreuzberg.toml
530
- # use_cache = true
531
- # force_ocr = false
532
- # enable_quality_processing = true
533
- #
534
- # [chunking]
535
- # max_chars = 1024
536
- # max_overlap = 256
537
- #
538
- # [ocr]
539
- # backend = "tesseract"
540
- # language = "eng"
541
- #
542
- # [language_detection]
543
- # enabled = true
544
- # min_confidence = 0.7
545
-
546
- result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
547
- puts "Extracted with config from file"
548
-
549
- # Auto-discover configuration in project hierarchy
550
- discovered_config = Kreuzberg::Config::Extraction.discover
551
- if discovered_config
552
- puts "Found configuration at project root"
553
- result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: discovered_config)
554
- else
555
- puts "No configuration file found, using defaults"
556
- result = Kreuzberg.extract_file_sync(path: 'document.pdf')
557
- end
228
+ config = Kreuzberg::Config::Extraction.new(keywords: keywords)
229
+ result = Kreuzberg.extract_file_sync("research.pdf", config: config)
558
230
  ```
559
231
 
560
- ### Fiber-Based Async Patterns
561
-
562
- Use Ruby Fibers for efficient async extraction workflows:
232
+ ### Language Detection
563
233
 
564
234
  ```ruby
565
- require 'kreuzberg'
566
-
567
- # Create async extraction workflow using Fibers
568
- def extract_documents_async(file_paths)
569
- fibers = file_paths.map do |path|
570
- Fiber.new do
571
- config = Kreuzberg::Config::Extraction.new(
572
- use_cache: true,
573
- enable_quality_processing: true
574
- )
575
-
576
- # Extract asynchronously
577
- result = Kreuzberg.extract_file(path: path, config: config)
578
-
579
- {
580
- path: path,
581
- content_length: result.content.length,
582
- tables: result.tables.length,
583
- languages: result.detected_languages
584
- }
585
- end
586
- end
235
+ lang_detection = Kreuzberg::Config::LanguageDetection.new(
236
+ enabled: true,
237
+ min_confidence: 0.8,
238
+ detect_multiple: true
239
+ )
587
240
 
588
- # Resume all fibers and collect results
589
- results = fibers.map do |fiber|
590
- Fiber.yield fiber.resume if fiber.alive?
591
- end
241
+ config = Kreuzberg::Config::Extraction.new(language_detection: lang_detection)
242
+ result = Kreuzberg.extract_file_sync("multilingual.pdf", config: config)
592
243
 
593
- results.compact
244
+ result.detected_languages&.each do |lang|
245
+ puts "Language: #{lang.lang}, Confidence: #{lang.confidence}"
594
246
  end
247
+ ```
595
248
 
596
- # Usage
597
- file_paths = ['document1.pdf', 'document2.docx', 'document3.xlsx']
598
- results = extract_documents_async(file_paths)
249
+ ### PDF Options
599
250
 
600
- results.each do |result|
601
- puts "#{result[:path]}: #{result[:content_length]} characters"
602
- end
603
- ```
251
+ ```ruby
252
+ pdf_options = Kreuzberg::Config::PDF.new(
253
+ extract_images: true,
254
+ image_min_size: 10000, # Minimum image size in bytes
255
+ password: "secret" # PDF password
256
+ )
604
257
 
605
- ### Table Extraction Detailed Usage
258
+ config = Kreuzberg::Config::Extraction.new(pdf_options: pdf_options)
259
+ ```
606
260
 
607
- Extract and access table structure and cell data:
261
+ ## Working with Results
608
262
 
609
263
  ```ruby
610
- require 'kreuzberg'
264
+ result = Kreuzberg.extract_file_sync("invoice.pdf")
611
265
 
612
- # Configure table extraction
613
- config = Kreuzberg::Config::Extraction.new(
614
- pdf_options: Kreuzberg::Config::PDF.new(extract_images: true)
615
- )
266
+ # Access extracted text
267
+ puts result.content
268
+
269
+ # Access MIME type
270
+ puts result.mime_type
616
271
 
617
- result = Kreuzberg.extract_file_sync(path: 'spreadsheet.pdf', config: config)
272
+ # Access metadata
273
+ puts result.metadata.inspect
618
274
 
619
275
  # Access extracted tables
620
- result.tables.each_with_index do |table, table_idx|
621
- puts "Table #{table_idx} (Page #{table.page_number}):"
622
-
623
- # Access table cells (2D array)
624
- table.cells.each_with_index do |row, row_idx|
625
- puts " Row #{row_idx}:"
626
- row.each_with_index do |cell, col_idx|
627
- puts " [#{col_idx}] #{cell}"
628
- end
276
+ result.tables.each do |table|
277
+ puts "Headers: #{table.headers.join(', ')}"
278
+ table.rows.each do |row|
279
+ puts row.join(', ')
629
280
  end
281
+ end
630
282
 
631
- # Access markdown representation
632
- puts "\nMarkdown format:"
633
- puts table.markdown
283
+ # Access text chunks and metadata
284
+ result.chunks&.each do |chunk|
285
+ puts "Chunk #{chunk.chunk_index + 1}/#{chunk.total_chunks}"
286
+ puts "Chars: #{chunk.char_start}-#{chunk.char_end}"
287
+ puts "Embedding length: #{chunk.embedding&.length}"
634
288
  end
635
289
 
636
- # Extract tables from specific pages
637
- page_config = Kreuzberg::Config::PageConfig.new(extract_pages: true)
638
- config = Kreuzberg::Config::Extraction.new(pages: page_config)
639
- result = Kreuzberg.extract_file_sync(path: 'data.xlsx', config: config)
640
-
641
- if result.pages
642
- result.pages.each do |page|
643
- page.tables.each do |table|
644
- puts "Table on page #{page.page_number}:"
645
- puts " Dimensions: #{table.cells.length} rows x #{table.cells.first&.length || 0} columns"
646
- end
647
- end
290
+ # Access extracted images
291
+ result.images&.each do |image|
292
+ File.binwrite("image-\#{image.image_index}.#{image.format}", image.data)
293
+ puts "Image #{image.image_index} on page #{image.page_number}"
648
294
  end
649
- ```
650
295
 
651
- ### Image Extraction and Saving
296
+ # Convert to hash
297
+ hash = result.to_h
652
298
 
653
- Extract images and save them to disk:
299
+ # Convert to JSON
300
+ json = result.to_json
301
+ ```
654
302
 
655
- ```ruby
656
- require 'kreuzberg'
303
+ ## CLI Usage
657
304
 
658
- # Configure image extraction with high DPI
659
- image_config = Kreuzberg::Config::ImageExtraction.new(
660
- extract_images: true,
661
- target_dpi: 300,
662
- max_image_dimension: 2000,
663
- auto_adjust_dpi: true
664
- )
305
+ Kreuzberg provides a Ruby wrapper for the CLI:
665
306
 
666
- config = Kreuzberg::Config::Extraction.new(image_extraction: image_config)
667
- result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
307
+ ```ruby
308
+ # Extract content
309
+ output = Kreuzberg::CLI.extract("document.pdf", output: "text")
668
310
 
669
- # Save extracted images
670
- output_dir = 'extracted_images'
671
- Dir.mkdir(output_dir) unless Dir.exist?(output_dir)
311
+ # Detect MIME type
312
+ mime_type = Kreuzberg::CLI.detect("document.pdf")
672
313
 
673
- result.images.each_with_index do |image, idx|
674
- # Generate filename
675
- filename = "image_p#{image.page_number}_#{image.image_index}.#{image.format}"
676
- filepath = File.join(output_dir, filename)
314
+ # Get version
315
+ version = Kreuzberg::CLI.version
316
+ ```
677
317
 
678
- # Save image data
679
- File.write(filepath, image.data, mode: 'wb')
318
+ ## API Server
680
319
 
681
- puts "Saved: #{filename}"
682
- puts " Page: #{image.page_number}"
683
- puts " Format: #{image.format}"
684
- puts " Dimensions: #{image.width}x#{image.height}"
685
- puts " Colorspace: #{image.colorspace}"
320
+ Start an API server (requires kreuzberg CLI):
686
321
 
687
- # Process OCR result if available
688
- if image.ocr_result
689
- puts " OCR Text: #{image.ocr_result['text'][0..50]}..."
690
- end
322
+ ```ruby
323
+ Kreuzberg::APIProxy.run(port: 8000) do |server|
324
+ # Server runs in background
325
+ # Make HTTP requests to http://localhost:8000
691
326
  end
692
327
  ```
693
328
 
694
- ### Language Detection Configuration
329
+ ## MCP Server
695
330
 
696
- Configure and use language detection:
331
+ Start a Model Context Protocol server for Claude Desktop:
697
332
 
698
333
  ```ruby
699
- require 'kreuzberg'
700
-
701
- # Enable language detection with confidence threshold
702
- lang_detection_config = Kreuzberg::Config::LanguageDetection.new(
703
- enabled: true,
704
- min_confidence: 0.8,
705
- detect_multiple: true
706
- )
707
-
708
- config = Kreuzberg::Config::Extraction.new(
709
- language_detection: lang_detection_config
710
- )
711
-
712
- result = Kreuzberg.extract_file_sync(path: 'multilingual.pdf', config: config)
334
+ server = Kreuzberg::MCPProxy::Server.new(transport: 'stdio')
335
+ server.start
713
336
 
714
- # Access detected languages
715
- puts "Primary language: #{result.detected_language}"
716
- puts "All detected languages: #{result.detected_languages.join(', ')}"
717
-
718
- # Access language from metadata
719
- if result.metadata.is_a?(Hash)
720
- puts "Language from metadata: #{result.metadata['language']}"
721
- end
337
+ # Use with Claude Desktop integration
338
+ ```
722
339
 
723
- # Combine with keyword extraction for specific language
724
- keywords_config = Kreuzberg::Config::Keywords.new(
725
- algorithm: 'yake',
726
- language: 'de', # German keywords
727
- max_keywords: 10
728
- )
340
+ ## Cache Management
729
341
 
730
- config = Kreuzberg::Config::Extraction.new(
731
- language_detection: lang_detection_config,
732
- keywords: keywords_config
733
- )
342
+ ```ruby
343
+ # Get cache statistics
344
+ stats = Kreuzberg.cache_stats
345
+ puts "Entries: #{stats[:total_entries]}"
346
+ puts "Size: #{stats[:total_size_bytes]} bytes"
734
347
 
735
- result = Kreuzberg.extract_file_sync(path: 'german_document.pdf', config: config)
736
- puts "Keywords extracted for: #{result.detected_language}"
348
+ # Clear cache
349
+ Kreuzberg.clear_cache
737
350
  ```
738
351
 
739
- ## Batch Processing
740
-
741
- Process multiple documents efficiently:
352
+ ## Error Handling
742
353
 
743
354
  ```ruby
744
- require 'kreuzberg'
355
+ begin
356
+ result = Kreuzberg.extract_file_sync("document.pdf")
357
+ rescue Kreuzberg::Errors::ParsingError => e
358
+ puts "Parsing failed: #{e.message}"
359
+ puts "Context: #{e.context}"
360
+ rescue Kreuzberg::Errors::OCRError => e
361
+ puts "OCR failed: #{e.message}"
362
+ rescue Kreuzberg::Errors::MissingDependencyError => e
363
+ puts "Missing dependency: #{e.dependency}"
364
+ rescue Kreuzberg::Errors::Error => e
365
+ puts "Kreuzberg error: #{e.message}"
366
+ end
367
+ ```
745
368
 
746
- puts "Kreuzberg version: #{Kreuzberg::VERSION}"
747
- puts "FFI bindings loaded successfully"
369
+ ## Supported Formats
748
370
 
749
- result = Kreuzberg.extract_file_sync(path: 'sample.pdf')
750
- puts "Installation verified! Extracted #{result.content.length} characters"
751
- ```
371
+ - **Documents**: PDF, DOCX, DOC, PPTX, PPT, ODT, ODP
372
+ - **Spreadsheets**: XLSX, XLS, ODS, CSV
373
+ - **Images**: PNG, JPEG, TIFF, BMP, GIF
374
+ - **Web**: HTML, MHTML, Markdown
375
+ - **Data**: JSON, YAML, TOML, XML
376
+ - **Email**: EML, MSG
377
+ - **Archives**: ZIP, TAR, 7Z
378
+ - **Text**: TXT, RTF, MD
752
379
 
753
- ## Configuration
380
+ ## Performance
754
381
 
755
- For advanced configuration options including language detection, table extraction, OCR settings, and more:
382
+ Kreuzberg's Rust core provides significant performance improvements:
756
383
 
757
- **[Configuration Guide](https://kreuzberg.dev/configuration/)**
384
+ - **PDF extraction**: 10-50x faster than pure Ruby solutions
385
+ - **Batch processing**: Parallel extraction with Tokio async runtime
386
+ - **Memory efficient**: Streaming parsers for large files
387
+ - **Caching**: Automatic result caching for repeated extractions
758
388
 
759
- ## Documentation
389
+ ## Development
760
390
 
761
- - **[Official Documentation](https://kreuzberg.dev/)**
762
- - **[API Reference](https://kreuzberg.dev/reference/api-ruby/)**
763
- - **[Examples & Guides](https://kreuzberg.dev/guides/)**
391
+ ```bash
392
+ # Clone the repository
393
+ git clone https://github.com/Goldziher/kreuzberg.git
394
+ cd kreuzberg/packages/ruby
764
395
 
765
- ## Troubleshooting
396
+ # Install dependencies
397
+ bundle install
766
398
 
767
- For common issues and solutions, visit [Troubleshooting Guide](https://kreuzberg.dev/troubleshooting/).
399
+ # Build the Rust extension
400
+ bundle exec rake compile
768
401
 
769
- ## Contributing
402
+ # Run tests
403
+ bundle exec rspec
770
404
 
771
- Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
405
+ # Run RuboCop
406
+ bundle exec rubocop
407
+ ```
772
408
 
773
409
  ## License
774
410
 
775
- MIT License - see LICENSE file for details.
411
+ MIT License. See [LICENSE](../../LICENSE) for details.
412
+
413
+ ## Contributing
414
+
415
+ Contributions are welcome! Please see [CONTRIBUTING.md](../../CONTRIBUTING.md) for guidelines.
776
416
 
777
- ## Support
417
+ ## Links
778
418
 
779
- - **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
780
- - **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
781
- - **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)
419
+ - **Documentation**: https://docs.kreuzberg.dev
420
+ - **GitHub**: https://github.com/Goldziher/kreuzberg
421
+ - **Issues**: https://github.com/Goldziher/kreuzberg/issues