kreuzberg 4.0.0.pre.rc.8 → 4.0.0.pre.rc.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (370) hide show
  1. checksums.yaml +4 -4
  2. data/.gitignore +14 -14
  3. data/.rspec +3 -3
  4. data/.rubocop.yaml +1 -1
  5. data/.rubocop.yml +538 -538
  6. data/Gemfile +8 -8
  7. data/Gemfile.lock +4 -104
  8. data/README.md +454 -432
  9. data/Rakefile +25 -25
  10. data/Steepfile +47 -47
  11. data/examples/async_patterns.rb +341 -341
  12. data/ext/kreuzberg_rb/extconf.rb +45 -45
  13. data/ext/kreuzberg_rb/native/.cargo/config.toml +2 -2
  14. data/ext/kreuzberg_rb/native/Cargo.lock +6941 -6721
  15. data/ext/kreuzberg_rb/native/Cargo.toml +54 -54
  16. data/ext/kreuzberg_rb/native/README.md +425 -425
  17. data/ext/kreuzberg_rb/native/build.rs +15 -15
  18. data/ext/kreuzberg_rb/native/include/ieeefp.h +11 -11
  19. data/ext/kreuzberg_rb/native/include/msvc_compat/strings.h +14 -14
  20. data/ext/kreuzberg_rb/native/include/strings.h +20 -20
  21. data/ext/kreuzberg_rb/native/include/unistd.h +47 -47
  22. data/ext/kreuzberg_rb/native/src/lib.rs +3158 -3135
  23. data/extconf.rb +28 -28
  24. data/kreuzberg.gemspec +214 -182
  25. data/lib/kreuzberg/api_proxy.rb +142 -142
  26. data/lib/kreuzberg/cache_api.rb +81 -46
  27. data/lib/kreuzberg/cli.rb +55 -55
  28. data/lib/kreuzberg/cli_proxy.rb +127 -127
  29. data/lib/kreuzberg/config.rb +724 -724
  30. data/lib/kreuzberg/error_context.rb +80 -32
  31. data/lib/kreuzberg/errors.rb +118 -118
  32. data/lib/kreuzberg/extraction_api.rb +340 -85
  33. data/lib/kreuzberg/mcp_proxy.rb +186 -186
  34. data/lib/kreuzberg/ocr_backend_protocol.rb +113 -113
  35. data/lib/kreuzberg/post_processor_protocol.rb +86 -86
  36. data/lib/kreuzberg/result.rb +279 -279
  37. data/lib/kreuzberg/setup_lib_path.rb +80 -80
  38. data/lib/kreuzberg/validator_protocol.rb +89 -89
  39. data/lib/kreuzberg/version.rb +5 -5
  40. data/lib/kreuzberg.rb +109 -103
  41. data/lib/pdfium.dll +0 -0
  42. data/sig/kreuzberg/internal.rbs +184 -184
  43. data/sig/kreuzberg.rbs +546 -537
  44. data/spec/binding/cache_spec.rb +227 -227
  45. data/spec/binding/cli_proxy_spec.rb +85 -85
  46. data/spec/binding/cli_spec.rb +55 -55
  47. data/spec/binding/config_spec.rb +345 -345
  48. data/spec/binding/config_validation_spec.rb +283 -283
  49. data/spec/binding/error_handling_spec.rb +213 -213
  50. data/spec/binding/errors_spec.rb +66 -66
  51. data/spec/binding/plugins/ocr_backend_spec.rb +307 -307
  52. data/spec/binding/plugins/postprocessor_spec.rb +269 -269
  53. data/spec/binding/plugins/validator_spec.rb +274 -274
  54. data/spec/fixtures/config.toml +39 -39
  55. data/spec/fixtures/config.yaml +41 -41
  56. data/spec/fixtures/invalid_config.toml +4 -4
  57. data/spec/smoke/package_spec.rb +178 -178
  58. data/spec/spec_helper.rb +42 -42
  59. data/vendor/Cargo.toml +45 -0
  60. data/vendor/kreuzberg/Cargo.toml +61 -38
  61. data/vendor/kreuzberg/README.md +230 -221
  62. data/vendor/kreuzberg/benches/otel_overhead.rs +48 -48
  63. data/vendor/kreuzberg/build.rs +843 -891
  64. data/vendor/kreuzberg/src/api/error.rs +81 -81
  65. data/vendor/kreuzberg/src/api/handlers.rs +199 -199
  66. data/vendor/kreuzberg/src/api/mod.rs +79 -79
  67. data/vendor/kreuzberg/src/api/server.rs +353 -353
  68. data/vendor/kreuzberg/src/api/types.rs +170 -170
  69. data/vendor/kreuzberg/src/cache/mod.rs +1167 -1167
  70. data/vendor/kreuzberg/src/chunking/mod.rs +1877 -1877
  71. data/vendor/kreuzberg/src/chunking/processor.rs +220 -220
  72. data/vendor/kreuzberg/src/core/batch_mode.rs +95 -95
  73. data/vendor/kreuzberg/src/core/config.rs +1080 -1080
  74. data/vendor/kreuzberg/src/core/extractor.rs +1156 -1156
  75. data/vendor/kreuzberg/src/core/io.rs +329 -329
  76. data/vendor/kreuzberg/src/core/mime.rs +605 -605
  77. data/vendor/kreuzberg/src/core/mod.rs +47 -47
  78. data/vendor/kreuzberg/src/core/pipeline.rs +1184 -1171
  79. data/vendor/kreuzberg/src/embeddings.rs +500 -432
  80. data/vendor/kreuzberg/src/error.rs +431 -431
  81. data/vendor/kreuzberg/src/extraction/archive.rs +954 -954
  82. data/vendor/kreuzberg/src/extraction/docx.rs +398 -398
  83. data/vendor/kreuzberg/src/extraction/email.rs +854 -854
  84. data/vendor/kreuzberg/src/extraction/excel.rs +688 -688
  85. data/vendor/kreuzberg/src/extraction/html.rs +601 -569
  86. data/vendor/kreuzberg/src/extraction/image.rs +491 -491
  87. data/vendor/kreuzberg/src/extraction/libreoffice.rs +574 -562
  88. data/vendor/kreuzberg/src/extraction/markdown.rs +213 -213
  89. data/vendor/kreuzberg/src/extraction/mod.rs +81 -81
  90. data/vendor/kreuzberg/src/extraction/office_metadata/app_properties.rs +398 -398
  91. data/vendor/kreuzberg/src/extraction/office_metadata/core_properties.rs +247 -247
  92. data/vendor/kreuzberg/src/extraction/office_metadata/custom_properties.rs +240 -240
  93. data/vendor/kreuzberg/src/extraction/office_metadata/mod.rs +130 -130
  94. data/vendor/kreuzberg/src/extraction/office_metadata/odt_properties.rs +284 -284
  95. data/vendor/kreuzberg/src/extraction/pptx.rs +3100 -3100
  96. data/vendor/kreuzberg/src/extraction/structured.rs +490 -490
  97. data/vendor/kreuzberg/src/extraction/table.rs +328 -328
  98. data/vendor/kreuzberg/src/extraction/text.rs +269 -269
  99. data/vendor/kreuzberg/src/extraction/xml.rs +333 -333
  100. data/vendor/kreuzberg/src/extractors/archive.rs +447 -447
  101. data/vendor/kreuzberg/src/extractors/bibtex.rs +470 -470
  102. data/vendor/kreuzberg/src/extractors/docbook.rs +504 -504
  103. data/vendor/kreuzberg/src/extractors/docx.rs +400 -400
  104. data/vendor/kreuzberg/src/extractors/email.rs +157 -157
  105. data/vendor/kreuzberg/src/extractors/epub.rs +708 -708
  106. data/vendor/kreuzberg/src/extractors/excel.rs +345 -345
  107. data/vendor/kreuzberg/src/extractors/fictionbook.rs +492 -492
  108. data/vendor/kreuzberg/src/extractors/html.rs +407 -407
  109. data/vendor/kreuzberg/src/extractors/image.rs +219 -219
  110. data/vendor/kreuzberg/src/extractors/jats.rs +1054 -1054
  111. data/vendor/kreuzberg/src/extractors/jupyter.rs +368 -368
  112. data/vendor/kreuzberg/src/extractors/latex.rs +653 -653
  113. data/vendor/kreuzberg/src/extractors/markdown.rs +701 -701
  114. data/vendor/kreuzberg/src/extractors/mod.rs +429 -429
  115. data/vendor/kreuzberg/src/extractors/odt.rs +628 -628
  116. data/vendor/kreuzberg/src/extractors/opml.rs +635 -635
  117. data/vendor/kreuzberg/src/extractors/orgmode.rs +529 -529
  118. data/vendor/kreuzberg/src/extractors/pdf.rs +749 -673
  119. data/vendor/kreuzberg/src/extractors/pptx.rs +267 -267
  120. data/vendor/kreuzberg/src/extractors/rst.rs +577 -577
  121. data/vendor/kreuzberg/src/extractors/rtf.rs +809 -809
  122. data/vendor/kreuzberg/src/extractors/security.rs +484 -484
  123. data/vendor/kreuzberg/src/extractors/security_tests.rs +367 -367
  124. data/vendor/kreuzberg/src/extractors/structured.rs +142 -142
  125. data/vendor/kreuzberg/src/extractors/text.rs +265 -265
  126. data/vendor/kreuzberg/src/extractors/typst.rs +651 -651
  127. data/vendor/kreuzberg/src/extractors/xml.rs +147 -147
  128. data/vendor/kreuzberg/src/image/dpi.rs +164 -164
  129. data/vendor/kreuzberg/src/image/mod.rs +6 -6
  130. data/vendor/kreuzberg/src/image/preprocessing.rs +417 -417
  131. data/vendor/kreuzberg/src/image/resize.rs +89 -89
  132. data/vendor/kreuzberg/src/keywords/config.rs +154 -154
  133. data/vendor/kreuzberg/src/keywords/mod.rs +237 -237
  134. data/vendor/kreuzberg/src/keywords/processor.rs +275 -275
  135. data/vendor/kreuzberg/src/keywords/rake.rs +293 -293
  136. data/vendor/kreuzberg/src/keywords/types.rs +68 -68
  137. data/vendor/kreuzberg/src/keywords/yake.rs +163 -163
  138. data/vendor/kreuzberg/src/language_detection/mod.rs +985 -985
  139. data/vendor/kreuzberg/src/language_detection/processor.rs +219 -219
  140. data/vendor/kreuzberg/src/lib.rs +113 -113
  141. data/vendor/kreuzberg/src/mcp/mod.rs +35 -35
  142. data/vendor/kreuzberg/src/mcp/server.rs +2076 -2076
  143. data/vendor/kreuzberg/src/ocr/cache.rs +469 -469
  144. data/vendor/kreuzberg/src/ocr/error.rs +37 -37
  145. data/vendor/kreuzberg/src/ocr/hocr.rs +216 -216
  146. data/vendor/kreuzberg/src/ocr/mod.rs +58 -58
  147. data/vendor/kreuzberg/src/ocr/processor.rs +863 -863
  148. data/vendor/kreuzberg/src/ocr/table/mod.rs +4 -4
  149. data/vendor/kreuzberg/src/ocr/table/tsv_parser.rs +144 -144
  150. data/vendor/kreuzberg/src/ocr/tesseract_backend.rs +452 -452
  151. data/vendor/kreuzberg/src/ocr/types.rs +393 -393
  152. data/vendor/kreuzberg/src/ocr/utils.rs +47 -47
  153. data/vendor/kreuzberg/src/ocr/validation.rs +206 -206
  154. data/vendor/kreuzberg/src/panic_context.rs +154 -154
  155. data/vendor/kreuzberg/src/pdf/bindings.rs +44 -0
  156. data/vendor/kreuzberg/src/pdf/bundled.rs +346 -328
  157. data/vendor/kreuzberg/src/pdf/error.rs +130 -130
  158. data/vendor/kreuzberg/src/pdf/images.rs +139 -139
  159. data/vendor/kreuzberg/src/pdf/metadata.rs +489 -489
  160. data/vendor/kreuzberg/src/pdf/mod.rs +68 -66
  161. data/vendor/kreuzberg/src/pdf/rendering.rs +368 -368
  162. data/vendor/kreuzberg/src/pdf/table.rs +420 -417
  163. data/vendor/kreuzberg/src/pdf/text.rs +240 -240
  164. data/vendor/kreuzberg/src/plugins/extractor.rs +1044 -1044
  165. data/vendor/kreuzberg/src/plugins/mod.rs +212 -212
  166. data/vendor/kreuzberg/src/plugins/ocr.rs +639 -639
  167. data/vendor/kreuzberg/src/plugins/processor.rs +650 -650
  168. data/vendor/kreuzberg/src/plugins/registry.rs +1339 -1339
  169. data/vendor/kreuzberg/src/plugins/traits.rs +258 -258
  170. data/vendor/kreuzberg/src/plugins/validator.rs +967 -967
  171. data/vendor/kreuzberg/src/stopwords/mod.rs +1470 -1470
  172. data/vendor/kreuzberg/src/text/mod.rs +25 -25
  173. data/vendor/kreuzberg/src/text/quality.rs +697 -697
  174. data/vendor/kreuzberg/src/text/quality_processor.rs +219 -219
  175. data/vendor/kreuzberg/src/text/string_utils.rs +217 -217
  176. data/vendor/kreuzberg/src/text/token_reduction/cjk_utils.rs +164 -164
  177. data/vendor/kreuzberg/src/text/token_reduction/config.rs +100 -100
  178. data/vendor/kreuzberg/src/text/token_reduction/core.rs +796 -796
  179. data/vendor/kreuzberg/src/text/token_reduction/filters.rs +902 -902
  180. data/vendor/kreuzberg/src/text/token_reduction/mod.rs +160 -160
  181. data/vendor/kreuzberg/src/text/token_reduction/semantic.rs +619 -619
  182. data/vendor/kreuzberg/src/text/token_reduction/simd_text.rs +147 -147
  183. data/vendor/kreuzberg/src/types.rs +1055 -1055
  184. data/vendor/kreuzberg/src/utils/mod.rs +17 -17
  185. data/vendor/kreuzberg/src/utils/quality.rs +959 -959
  186. data/vendor/kreuzberg/src/utils/string_utils.rs +381 -381
  187. data/vendor/kreuzberg/stopwords/af_stopwords.json +53 -53
  188. data/vendor/kreuzberg/stopwords/ar_stopwords.json +482 -482
  189. data/vendor/kreuzberg/stopwords/bg_stopwords.json +261 -261
  190. data/vendor/kreuzberg/stopwords/bn_stopwords.json +400 -400
  191. data/vendor/kreuzberg/stopwords/br_stopwords.json +1205 -1205
  192. data/vendor/kreuzberg/stopwords/ca_stopwords.json +280 -280
  193. data/vendor/kreuzberg/stopwords/cs_stopwords.json +425 -425
  194. data/vendor/kreuzberg/stopwords/da_stopwords.json +172 -172
  195. data/vendor/kreuzberg/stopwords/de_stopwords.json +622 -622
  196. data/vendor/kreuzberg/stopwords/el_stopwords.json +849 -849
  197. data/vendor/kreuzberg/stopwords/en_stopwords.json +1300 -1300
  198. data/vendor/kreuzberg/stopwords/eo_stopwords.json +175 -175
  199. data/vendor/kreuzberg/stopwords/es_stopwords.json +734 -734
  200. data/vendor/kreuzberg/stopwords/et_stopwords.json +37 -37
  201. data/vendor/kreuzberg/stopwords/eu_stopwords.json +100 -100
  202. data/vendor/kreuzberg/stopwords/fa_stopwords.json +801 -801
  203. data/vendor/kreuzberg/stopwords/fi_stopwords.json +849 -849
  204. data/vendor/kreuzberg/stopwords/fr_stopwords.json +693 -693
  205. data/vendor/kreuzberg/stopwords/ga_stopwords.json +111 -111
  206. data/vendor/kreuzberg/stopwords/gl_stopwords.json +162 -162
  207. data/vendor/kreuzberg/stopwords/gu_stopwords.json +226 -226
  208. data/vendor/kreuzberg/stopwords/ha_stopwords.json +41 -41
  209. data/vendor/kreuzberg/stopwords/he_stopwords.json +196 -196
  210. data/vendor/kreuzberg/stopwords/hi_stopwords.json +227 -227
  211. data/vendor/kreuzberg/stopwords/hr_stopwords.json +181 -181
  212. data/vendor/kreuzberg/stopwords/hu_stopwords.json +791 -791
  213. data/vendor/kreuzberg/stopwords/hy_stopwords.json +47 -47
  214. data/vendor/kreuzberg/stopwords/id_stopwords.json +760 -760
  215. data/vendor/kreuzberg/stopwords/it_stopwords.json +634 -634
  216. data/vendor/kreuzberg/stopwords/ja_stopwords.json +136 -136
  217. data/vendor/kreuzberg/stopwords/kn_stopwords.json +84 -84
  218. data/vendor/kreuzberg/stopwords/ko_stopwords.json +681 -681
  219. data/vendor/kreuzberg/stopwords/ku_stopwords.json +64 -64
  220. data/vendor/kreuzberg/stopwords/la_stopwords.json +51 -51
  221. data/vendor/kreuzberg/stopwords/lt_stopwords.json +476 -476
  222. data/vendor/kreuzberg/stopwords/lv_stopwords.json +163 -163
  223. data/vendor/kreuzberg/stopwords/ml_stopwords.json +1 -1
  224. data/vendor/kreuzberg/stopwords/mr_stopwords.json +101 -101
  225. data/vendor/kreuzberg/stopwords/ms_stopwords.json +477 -477
  226. data/vendor/kreuzberg/stopwords/ne_stopwords.json +490 -490
  227. data/vendor/kreuzberg/stopwords/nl_stopwords.json +415 -415
  228. data/vendor/kreuzberg/stopwords/no_stopwords.json +223 -223
  229. data/vendor/kreuzberg/stopwords/pl_stopwords.json +331 -331
  230. data/vendor/kreuzberg/stopwords/pt_stopwords.json +562 -562
  231. data/vendor/kreuzberg/stopwords/ro_stopwords.json +436 -436
  232. data/vendor/kreuzberg/stopwords/ru_stopwords.json +561 -561
  233. data/vendor/kreuzberg/stopwords/si_stopwords.json +193 -193
  234. data/vendor/kreuzberg/stopwords/sk_stopwords.json +420 -420
  235. data/vendor/kreuzberg/stopwords/sl_stopwords.json +448 -448
  236. data/vendor/kreuzberg/stopwords/so_stopwords.json +32 -32
  237. data/vendor/kreuzberg/stopwords/st_stopwords.json +33 -33
  238. data/vendor/kreuzberg/stopwords/sv_stopwords.json +420 -420
  239. data/vendor/kreuzberg/stopwords/sw_stopwords.json +76 -76
  240. data/vendor/kreuzberg/stopwords/ta_stopwords.json +129 -129
  241. data/vendor/kreuzberg/stopwords/te_stopwords.json +54 -54
  242. data/vendor/kreuzberg/stopwords/th_stopwords.json +118 -118
  243. data/vendor/kreuzberg/stopwords/tl_stopwords.json +149 -149
  244. data/vendor/kreuzberg/stopwords/tr_stopwords.json +506 -506
  245. data/vendor/kreuzberg/stopwords/uk_stopwords.json +75 -75
  246. data/vendor/kreuzberg/stopwords/ur_stopwords.json +519 -519
  247. data/vendor/kreuzberg/stopwords/vi_stopwords.json +647 -647
  248. data/vendor/kreuzberg/stopwords/yo_stopwords.json +62 -62
  249. data/vendor/kreuzberg/stopwords/zh_stopwords.json +796 -796
  250. data/vendor/kreuzberg/stopwords/zu_stopwords.json +31 -31
  251. data/vendor/kreuzberg/tests/api_extract_multipart.rs +52 -52
  252. data/vendor/kreuzberg/tests/api_tests.rs +966 -966
  253. data/vendor/kreuzberg/tests/archive_integration.rs +545 -545
  254. data/vendor/kreuzberg/tests/batch_orchestration.rs +556 -556
  255. data/vendor/kreuzberg/tests/batch_processing.rs +318 -318
  256. data/vendor/kreuzberg/tests/bibtex_parity_test.rs +421 -421
  257. data/vendor/kreuzberg/tests/concurrency_stress.rs +533 -533
  258. data/vendor/kreuzberg/tests/config_features.rs +612 -612
  259. data/vendor/kreuzberg/tests/config_loading_tests.rs +416 -416
  260. data/vendor/kreuzberg/tests/core_integration.rs +510 -510
  261. data/vendor/kreuzberg/tests/csv_integration.rs +414 -414
  262. data/vendor/kreuzberg/tests/docbook_extractor_tests.rs +500 -500
  263. data/vendor/kreuzberg/tests/docx_metadata_extraction_test.rs +122 -122
  264. data/vendor/kreuzberg/tests/docx_vs_pandoc_comparison.rs +370 -370
  265. data/vendor/kreuzberg/tests/email_integration.rs +327 -327
  266. data/vendor/kreuzberg/tests/epub_native_extractor_tests.rs +275 -275
  267. data/vendor/kreuzberg/tests/error_handling.rs +402 -402
  268. data/vendor/kreuzberg/tests/fictionbook_extractor_tests.rs +228 -228
  269. data/vendor/kreuzberg/tests/format_integration.rs +164 -161
  270. data/vendor/kreuzberg/tests/helpers/mod.rs +142 -142
  271. data/vendor/kreuzberg/tests/html_table_test.rs +551 -551
  272. data/vendor/kreuzberg/tests/image_integration.rs +255 -255
  273. data/vendor/kreuzberg/tests/instrumentation_test.rs +139 -139
  274. data/vendor/kreuzberg/tests/jats_extractor_tests.rs +639 -639
  275. data/vendor/kreuzberg/tests/jupyter_extractor_tests.rs +704 -704
  276. data/vendor/kreuzberg/tests/keywords_integration.rs +479 -479
  277. data/vendor/kreuzberg/tests/keywords_quality.rs +509 -509
  278. data/vendor/kreuzberg/tests/latex_extractor_tests.rs +496 -496
  279. data/vendor/kreuzberg/tests/markdown_extractor_tests.rs +490 -490
  280. data/vendor/kreuzberg/tests/mime_detection.rs +429 -429
  281. data/vendor/kreuzberg/tests/ocr_configuration.rs +514 -514
  282. data/vendor/kreuzberg/tests/ocr_errors.rs +698 -698
  283. data/vendor/kreuzberg/tests/ocr_quality.rs +629 -629
  284. data/vendor/kreuzberg/tests/ocr_stress.rs +469 -469
  285. data/vendor/kreuzberg/tests/odt_extractor_tests.rs +674 -674
  286. data/vendor/kreuzberg/tests/opml_extractor_tests.rs +616 -616
  287. data/vendor/kreuzberg/tests/orgmode_extractor_tests.rs +822 -822
  288. data/vendor/kreuzberg/tests/pdf_integration.rs +45 -45
  289. data/vendor/kreuzberg/tests/pdfium_linking.rs +374 -374
  290. data/vendor/kreuzberg/tests/pipeline_integration.rs +1436 -1436
  291. data/vendor/kreuzberg/tests/plugin_ocr_backend_test.rs +776 -776
  292. data/vendor/kreuzberg/tests/plugin_postprocessor_test.rs +560 -560
  293. data/vendor/kreuzberg/tests/plugin_system.rs +927 -927
  294. data/vendor/kreuzberg/tests/plugin_validator_test.rs +783 -783
  295. data/vendor/kreuzberg/tests/registry_integration_tests.rs +587 -587
  296. data/vendor/kreuzberg/tests/rst_extractor_tests.rs +694 -694
  297. data/vendor/kreuzberg/tests/rtf_extractor_tests.rs +775 -775
  298. data/vendor/kreuzberg/tests/security_validation.rs +416 -416
  299. data/vendor/kreuzberg/tests/stopwords_integration_test.rs +888 -888
  300. data/vendor/kreuzberg/tests/test_fastembed.rs +631 -631
  301. data/vendor/kreuzberg/tests/typst_behavioral_tests.rs +1260 -1260
  302. data/vendor/kreuzberg/tests/typst_extractor_tests.rs +648 -648
  303. data/vendor/kreuzberg/tests/xlsx_metadata_extraction_test.rs +87 -87
  304. data/vendor/kreuzberg-ffi/Cargo.toml +63 -0
  305. data/vendor/kreuzberg-ffi/README.md +851 -0
  306. data/vendor/kreuzberg-ffi/build.rs +176 -0
  307. data/vendor/kreuzberg-ffi/cbindgen.toml +27 -0
  308. data/vendor/kreuzberg-ffi/kreuzberg-ffi-install.pc +12 -0
  309. data/vendor/kreuzberg-ffi/kreuzberg-ffi.pc.in +12 -0
  310. data/vendor/kreuzberg-ffi/kreuzberg.h +1087 -0
  311. data/vendor/kreuzberg-ffi/src/lib.rs +3616 -0
  312. data/vendor/kreuzberg-ffi/src/panic_shield.rs +247 -0
  313. data/vendor/kreuzberg-ffi/tests.disabled/README.md +48 -0
  314. data/vendor/kreuzberg-ffi/tests.disabled/config_loading_tests.rs +299 -0
  315. data/vendor/kreuzberg-ffi/tests.disabled/config_tests.rs +346 -0
  316. data/vendor/kreuzberg-ffi/tests.disabled/extractor_tests.rs +232 -0
  317. data/vendor/kreuzberg-ffi/tests.disabled/plugin_registration_tests.rs +470 -0
  318. data/vendor/kreuzberg-tesseract/.commitlintrc.json +13 -0
  319. data/vendor/kreuzberg-tesseract/.crate-ignore +2 -0
  320. data/vendor/kreuzberg-tesseract/Cargo.lock +2933 -0
  321. data/vendor/kreuzberg-tesseract/Cargo.toml +48 -0
  322. data/vendor/kreuzberg-tesseract/LICENSE +22 -0
  323. data/vendor/kreuzberg-tesseract/README.md +399 -0
  324. data/vendor/kreuzberg-tesseract/build.rs +1354 -0
  325. data/vendor/kreuzberg-tesseract/patches/README.md +71 -0
  326. data/vendor/kreuzberg-tesseract/patches/tesseract.diff +199 -0
  327. data/vendor/kreuzberg-tesseract/src/api.rs +1371 -0
  328. data/vendor/kreuzberg-tesseract/src/choice_iterator.rs +77 -0
  329. data/vendor/kreuzberg-tesseract/src/enums.rs +297 -0
  330. data/vendor/kreuzberg-tesseract/src/error.rs +81 -0
  331. data/vendor/kreuzberg-tesseract/src/lib.rs +145 -0
  332. data/vendor/kreuzberg-tesseract/src/monitor.rs +57 -0
  333. data/vendor/kreuzberg-tesseract/src/mutable_iterator.rs +197 -0
  334. data/vendor/kreuzberg-tesseract/src/page_iterator.rs +253 -0
  335. data/vendor/kreuzberg-tesseract/src/result_iterator.rs +286 -0
  336. data/vendor/kreuzberg-tesseract/src/result_renderer.rs +183 -0
  337. data/vendor/kreuzberg-tesseract/tests/integration_test.rs +211 -0
  338. data/vendor/rb-sys/.cargo_vcs_info.json +5 -5
  339. data/vendor/rb-sys/Cargo.lock +393 -393
  340. data/vendor/rb-sys/Cargo.toml +70 -70
  341. data/vendor/rb-sys/Cargo.toml.orig +57 -57
  342. data/vendor/rb-sys/LICENSE-APACHE +190 -190
  343. data/vendor/rb-sys/LICENSE-MIT +21 -21
  344. data/vendor/rb-sys/build/features.rs +111 -111
  345. data/vendor/rb-sys/build/main.rs +286 -286
  346. data/vendor/rb-sys/build/stable_api_config.rs +155 -155
  347. data/vendor/rb-sys/build/version.rs +50 -50
  348. data/vendor/rb-sys/readme.md +36 -36
  349. data/vendor/rb-sys/src/bindings.rs +21 -21
  350. data/vendor/rb-sys/src/hidden.rs +11 -11
  351. data/vendor/rb-sys/src/lib.rs +35 -35
  352. data/vendor/rb-sys/src/macros.rs +371 -371
  353. data/vendor/rb-sys/src/memory.rs +53 -53
  354. data/vendor/rb-sys/src/ruby_abi_version.rs +38 -38
  355. data/vendor/rb-sys/src/special_consts.rs +31 -31
  356. data/vendor/rb-sys/src/stable_api/compiled.c +179 -179
  357. data/vendor/rb-sys/src/stable_api/compiled.rs +257 -257
  358. data/vendor/rb-sys/src/stable_api/ruby_2_7.rs +324 -324
  359. data/vendor/rb-sys/src/stable_api/ruby_3_0.rs +332 -332
  360. data/vendor/rb-sys/src/stable_api/ruby_3_1.rs +325 -325
  361. data/vendor/rb-sys/src/stable_api/ruby_3_2.rs +323 -323
  362. data/vendor/rb-sys/src/stable_api/ruby_3_3.rs +339 -339
  363. data/vendor/rb-sys/src/stable_api/ruby_3_4.rs +339 -339
  364. data/vendor/rb-sys/src/stable_api.rs +260 -260
  365. data/vendor/rb-sys/src/symbol.rs +31 -31
  366. data/vendor/rb-sys/src/tracking_allocator.rs +330 -330
  367. data/vendor/rb-sys/src/utils.rs +89 -89
  368. data/vendor/rb-sys/src/value_type.rs +7 -7
  369. metadata +44 -81
  370. data/vendor/rb-sys/bin/release.sh +0 -21
data/README.md CHANGED
@@ -1,432 +1,454 @@
1
- # Kreuzberg
2
-
3
- [![Rust](https://img.shields.io/crates/v/kreuzberg?label=Rust)](https://crates.io/crates/kreuzberg)
4
- [![Python](https://img.shields.io/pypi/v/kreuzberg?label=Python)](https://pypi.org/project/kreuzberg/)
5
- [![TypeScript](https://img.shields.io/npm/v/@kreuzberg/node?label=TypeScript)](https://www.npmjs.com/package/@kreuzberg/node)
6
- [![WASM](https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM)](https://www.npmjs.com/package/@kreuzberg/wasm)
7
- [![Ruby](https://img.shields.io/gem/v/kreuzberg?label=Ruby)](https://rubygems.org/gems/kreuzberg)
8
- [![Java](https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java)](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
9
- [![Go](https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go)](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
10
- [![C#](https://img.shields.io/nuget/v/Goldziher.Kreuzberg?label=C%23)](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
11
-
12
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
13
- [![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
14
- [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
15
-
16
- High-performance document intelligence for Ruby, powered by Rust.
17
-
18
- Extract text, tables, images, and metadata from 56 file formats including PDF, DOCX, PPTX, XLSX, images, and more.
19
-
20
- > **🚀 Version 4.0.0 Release Candidate**
21
- > This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
22
-
23
- ## Features
24
-
25
- - **56 File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
26
- - **OCR Support**: Built-in Tesseract OCR for scanned documents and images
27
- - **High Performance**: Rust-powered extraction for native-level performance
28
- - **Table Extraction**: Extract structured tables from documents
29
- - **Language Detection**: Automatic language detection for extracted text
30
- - **Text Chunking**: Split long documents into manageable chunks
31
- - **Caching**: Built-in result caching for faster repeated extractions
32
- - **Type-Safe**: Comprehensive typed configuration and result objects
33
-
34
- ## Requirements
35
-
36
- - Ruby 3.2 or higher
37
- - Rust toolchain (for building from source)
38
-
39
- ### Optional System Dependencies
40
-
41
- - **Tesseract**: For OCR functionality
42
- - macOS: `brew install tesseract`
43
- - Ubuntu: `sudo apt-get install tesseract-ocr`
44
- - Windows: Download from [GitHub](https://github.com/tesseract-ocr/tesseract)
45
-
46
- - **LibreOffice**: For legacy MS Office formats (.doc, .ppt)
47
- - macOS: `brew install libreoffice`
48
- - Ubuntu: `sudo apt-get install libreoffice`
49
-
50
- - **Pandoc**: For advanced document conversion
51
- - macOS: `brew install pandoc`
52
- - Ubuntu: `sudo apt-get install pandoc`
53
-
54
- ## Installation
55
-
56
- Add to your Gemfile:
57
-
58
- ```ruby
59
- gem 'kreuzberg'
60
- ```
61
-
62
- Then run:
63
-
64
- ```bash
65
- bundle install
66
- ```
67
-
68
- Or install directly:
69
-
70
- ```bash
71
- gem install kreuzberg
72
- ```
73
-
74
- ## Quick Start
75
-
76
- ### Basic Extraction
77
-
78
- ```ruby
79
- require 'kreuzberg'
80
-
81
- # Extract from a file
82
- result = Kreuzberg.extract_file_sync("document.pdf")
83
- puts result.content
84
- puts "MIME type: #{result.mime_type}"
85
- ```
86
-
87
- ### With Configuration
88
-
89
- ```ruby
90
- # Create configuration
91
- config = Kreuzberg::Config::Extraction.new(
92
- use_cache: true,
93
- force_ocr: false
94
- )
95
-
96
- result = Kreuzberg.extract_file_sync("document.pdf", config: config)
97
- ```
98
-
99
- ### With OCR
100
-
101
- ```ruby
102
- # Configure OCR
103
- ocr_config = Kreuzberg::Config::OCR.new(
104
- backend: "tesseract",
105
- language: "eng",
106
- preprocessing: true
107
- )
108
-
109
- config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
110
- result = Kreuzberg.extract_file_sync("scanned.pdf", config: config)
111
- ```
112
-
113
- ### Extract from Bytes
114
-
115
- ```ruby
116
- data = File.binread("document.pdf")
117
- result = Kreuzberg.extract_bytes_sync(data, "application/pdf")
118
- puts result.content
119
- ```
120
-
121
- ### Batch Processing
122
-
123
- ```ruby
124
- paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
125
- results = Kreuzberg.batch_extract_files_sync(paths)
126
-
127
- results.each do |result|
128
- puts "Content: #{result.content[0..100]}"
129
- puts "MIME: #{result.mime_type}"
130
- end
131
- ```
132
-
133
- ### Structured Results (Chunks & Images)
134
-
135
- ```ruby
136
- result = Kreuzberg.extract_file_sync("long-report.pdf", config: {
137
- chunking: { max_chars: 750 },
138
- image_extraction: { extract_images: true }
139
- })
140
-
141
- result.chunks&.each do |chunk|
142
- puts "[#{chunk.chunk_index + 1}/#{chunk.total_chunks}] #{chunk.content[0..80]}"
143
- end
144
-
145
- result.images&.each do |image|
146
- File.binwrite("image-#{image.image_index}.#{image.format}", image.data)
147
- if image.ocr_result
148
- puts "Embedded OCR content: #{image.ocr_result.content[0..60]}"
149
- end
150
- end
151
- ```
152
-
153
- ## Configuration
154
-
155
- ### Load From File
156
-
157
- ```ruby
158
- config = Kreuzberg::Config::Extraction.from_file("config.toml")
159
- result = Kreuzberg.extract_file_sync("report.pdf", config: config)
160
- ```
161
-
162
- ### Extraction Configuration
163
-
164
- ```ruby
165
- config = Kreuzberg::Config::Extraction.new(
166
- use_cache: true, # Enable result caching
167
- enable_quality_processing: false, # Enable text quality processing
168
- force_ocr: false # Force OCR even for digital PDFs
169
- )
170
- ```
171
-
172
- ### OCR Configuration
173
-
174
- ```ruby
175
- ocr = Kreuzberg::Config::OCR.new(
176
- backend: "tesseract", # OCR backend (tesseract, easyocr, paddleocr)
177
- language: "eng", # Language code (eng, deu, fra, etc.)
178
- tesseract_config: {
179
- psm: 6,
180
- enable_table_detection: true,
181
- preprocessing: Kreuzberg::Config::ImagePreprocessing.new(auto_rotate: true).to_h
182
- }
183
- )
184
-
185
- config = Kreuzberg::Config::Extraction.new(ocr: ocr)
186
- ```
187
-
188
- ### Chunking Configuration
189
-
190
- ```ruby
191
- chunking = Kreuzberg::Config::Chunking.new(
192
- enabled: true,
193
- chunk_size: 1000, # Characters per chunk
194
- chunk_overlap: 200, # Overlap between chunks
195
- embedding: {
196
- model: { type: :preset, name: "balanced" },
197
- normalize: true
198
- }
199
- )
200
-
201
- config = Kreuzberg::Config::Extraction.new(chunking: chunking)
202
- result = Kreuzberg.extract_file_sync("long_document.pdf", config: config)
203
-
204
- result.chunks.each do |chunk|
205
- puts "Chunk: #{chunk.content}"
206
- puts "Tokens: #{chunk.token_count}"
207
- end
208
- ```
209
-
210
- ### HTML Conversion Options
211
-
212
- ```ruby
213
- html_options = Kreuzberg::Config::HtmlOptions.new(
214
- heading_style: :atx_closed,
215
- wrap: true,
216
- wrap_width: 100,
217
- preprocessing: { enabled: true, preset: :standard }
218
- )
219
-
220
- config = Kreuzberg::Config::Extraction.new(html_options: html_options)
221
- result = Kreuzberg.extract_file_sync("page.html", config: config)
222
- ```
223
-
224
- ### Keyword Extraction
225
-
226
- ```ruby
227
- keywords = Kreuzberg::Config::Keywords.new(
228
- algorithm: :yake,
229
- max_keywords: 8,
230
- min_score: 0.2,
231
- ngram_range: [1, 3]
232
- )
233
-
234
- config = Kreuzberg::Config::Extraction.new(keywords: keywords)
235
- result = Kreuzberg.extract_file_sync("research.pdf", config: config)
236
- ```
237
-
238
- ### Language Detection
239
-
240
- ```ruby
241
- lang_detection = Kreuzberg::Config::LanguageDetection.new(
242
- enabled: true,
243
- min_confidence: 0.8,
244
- detect_multiple: true
245
- )
246
-
247
- config = Kreuzberg::Config::Extraction.new(language_detection: lang_detection)
248
- result = Kreuzberg.extract_file_sync("multilingual.pdf", config: config)
249
-
250
- result.detected_languages&.each do |lang|
251
- puts "Language: #{lang.lang}, Confidence: #{lang.confidence}"
252
- end
253
- ```
254
-
255
- ### PDF Options
256
-
257
- ```ruby
258
- pdf_options = Kreuzberg::Config::PDF.new(
259
- extract_images: true,
260
- image_min_size: 10000, # Minimum image size in bytes
261
- password: "secret" # PDF password
262
- )
263
-
264
- config = Kreuzberg::Config::Extraction.new(pdf_options: pdf_options)
265
- ```
266
-
267
- ## Working with Results
268
-
269
- ```ruby
270
- result = Kreuzberg.extract_file_sync("invoice.pdf")
271
-
272
- # Access extracted text
273
- puts result.content
274
-
275
- # Access MIME type
276
- puts result.mime_type
277
-
278
- # Access metadata
279
- puts result.metadata.inspect
280
-
281
- # Access extracted tables
282
- result.tables.each do |table|
283
- puts "Headers: #{table.headers.join(', ')}"
284
- table.rows.each do |row|
285
- puts row.join(', ')
286
- end
287
- end
288
-
289
- # Access text chunks and metadata
290
- result.chunks&.each do |chunk|
291
- puts "Chunk #{chunk.chunk_index + 1}/#{chunk.total_chunks}"
292
- puts "Chars: #{chunk.char_start}-#{chunk.char_end}"
293
- puts "Embedding length: #{chunk.embedding&.length}"
294
- end
295
-
296
- # Access extracted images
297
- result.images&.each do |image|
298
- File.binwrite("image-\#{image.image_index}.#{image.format}", image.data)
299
- puts "Image #{image.image_index} on page #{image.page_number}"
300
- end
301
-
302
- # Convert to hash
303
- hash = result.to_h
304
-
305
- # Convert to JSON
306
- json = result.to_json
307
- ```
308
-
309
- ## CLI Usage
310
-
311
- Kreuzberg provides a Ruby wrapper for the CLI:
312
-
313
- ```ruby
314
- # Extract content
315
- output = Kreuzberg::CLI.extract("document.pdf", output: "text")
316
-
317
- # Detect MIME type
318
- mime_type = Kreuzberg::CLI.detect("document.pdf")
319
-
320
- # Get version
321
- version = Kreuzberg::CLI.version
322
- ```
323
-
324
- ## API Server
325
-
326
- Start an API server (requires kreuzberg CLI):
327
-
328
- ```ruby
329
- Kreuzberg::APIProxy.run(port: 8000) do |server|
330
- # Server runs in background
331
- # Make HTTP requests to http://localhost:8000
332
- end
333
- ```
334
-
335
- ## MCP Server
336
-
337
- Start a Model Context Protocol server for Claude Desktop:
338
-
339
- ```ruby
340
- server = Kreuzberg::MCPProxy::Server.new(transport: 'stdio')
341
- server.start
342
-
343
- # Use with Claude Desktop integration
344
- ```
345
-
346
- ## Cache Management
347
-
348
- ```ruby
349
- # Get cache statistics
350
- stats = Kreuzberg.cache_stats
351
- puts "Entries: #{stats[:total_entries]}"
352
- puts "Size: #{stats[:total_size_bytes]} bytes"
353
-
354
- # Clear cache
355
- Kreuzberg.clear_cache
356
- ```
357
-
358
- ## Error Handling
359
-
360
- ```ruby
361
- begin
362
- result = Kreuzberg.extract_file_sync("document.pdf")
363
- rescue Kreuzberg::Errors::ParsingError => e
364
- puts "Parsing failed: #{e.message}"
365
- puts "Context: #{e.context}"
366
- rescue Kreuzberg::Errors::OCRError => e
367
- puts "OCR failed: #{e.message}"
368
- rescue Kreuzberg::Errors::MissingDependencyError => e
369
- puts "Missing dependency: #{e.dependency}"
370
- rescue Kreuzberg::Errors::Error => e
371
- puts "Kreuzberg error: #{e.message}"
372
- end
373
- ```
374
-
375
- ## Supported Formats
376
-
377
- - **Documents**: PDF, DOCX, DOC, PPTX, PPT, ODT, ODP
378
- - **Spreadsheets**: XLSX, XLS, ODS, CSV
379
- - **Images**: PNG, JPEG, TIFF, BMP, GIF
380
- - **Web**: HTML, MHTML, Markdown
381
- - **Data**: JSON, YAML, TOML, XML
382
- - **Email**: EML, MSG
383
- - **Archives**: ZIP, TAR, 7Z
384
- - **Text**: TXT, RTF, MD
385
-
386
- ## Performance
387
-
388
- Kreuzberg's Rust core provides significant performance improvements:
389
-
390
- - **PDF extraction**: 10-50x faster than pure Ruby solutions
391
- - **Batch processing**: Parallel extraction with Tokio async runtime
392
- - **Memory efficient**: Streaming parsers for large files
393
- - **Caching**: Automatic result caching for repeated extractions
394
-
395
- ## Development
396
-
397
- ```bash
398
- # Clone the repository
399
- git clone https://github.com/kreuzberg-dev/kreuzberg.git
400
- cd kreuzberg/packages/ruby
401
-
402
- # Install dependencies
403
- bundle install
404
-
405
- # Set up vendor symlink for local development (required for building)
406
- ln -sfn ../../crates/kreuzberg vendor/kreuzberg
407
-
408
- # Build the Rust extension
409
- bundle exec rake compile
410
-
411
- # Run tests
412
- bundle exec rspec
413
-
414
- # Run RuboCop
415
- bundle exec rubocop
416
- ```
417
-
418
- **Note**: The Ruby bindings use a vendored copy of the core `kreuzberg` Rust crate. For local development, create a symlink at `vendor/kreuzberg` pointing to `../../crates/kreuzberg`. In CI and gem packaging, the actual vendored files are copied to this location.
419
-
420
- ## License
421
-
422
- MIT License. See [LICENSE](../../LICENSE) for details.
423
-
424
- ## Contributing
425
-
426
- Contributions are welcome! Please see [CONTRIBUTING.md](../../CONTRIBUTING.md) for guidelines.
427
-
428
- ## Links
429
-
430
- - **Documentation**: https://kreuzberg.dev
431
- - **GitHub**: https://github.com/kreuzberg-dev/kreuzberg
432
- - **Issues**: https://github.com/kreuzberg-dev/kreuzberg/issues
1
+ # Kreuzberg
2
+
3
+ [![Rust](https://img.shields.io/crates/v/kreuzberg?label=Rust)](https://crates.io/crates/kreuzberg)
4
+ [![Python](https://img.shields.io/pypi/v/kreuzberg?label=Python)](https://pypi.org/project/kreuzberg/)
5
+ [![TypeScript](https://img.shields.io/npm/v/@kreuzberg/node?label=TypeScript)](https://www.npmjs.com/package/@kreuzberg/node)
6
+ [![WASM](https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM)](https://www.npmjs.com/package/@kreuzberg/wasm)
7
+ [![Ruby](https://img.shields.io/gem/v/kreuzberg?label=Ruby)](https://rubygems.org/gems/kreuzberg)
8
+ [![Java](https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java)](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
9
+ [![Go](https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go)](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
10
+ [![C#](https://img.shields.io/nuget/v/Goldziher.Kreuzberg?label=C%23)](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
11
+
12
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
13
+ [![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
14
+ [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
15
+
16
+ High-performance document intelligence for Ruby, powered by Rust.
17
+
18
+ Extract text, tables, images, and metadata from 56 file formats including PDF, DOCX, PPTX, XLSX, images, and more.
19
+
20
+ > **🚀 Version 4.0.0 Release Candidate**
21
+ > This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
22
+
23
+ ## Features
24
+
25
+ - **56 File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
26
+ - **OCR Support**: Built-in Tesseract OCR for scanned documents and images
27
+ - **High Performance**: Rust-powered extraction for native-level performance
28
+ - **Table Extraction**: Extract structured tables from documents
29
+ - **Language Detection**: Automatic language detection for extracted text
30
+ - **Text Chunking**: Split long documents into manageable chunks
31
+ - **Caching**: Built-in result caching for faster repeated extractions
32
+ - **Type-Safe**: Comprehensive typed configuration and result objects
33
+
34
+ ## Requirements
35
+
36
+ - Ruby 3.2 or higher
37
+ - Rust toolchain (for building from source)
38
+
39
+ ### Optional System Dependencies
40
+
41
+ - **ONNX Runtime**: For embeddings functionality
42
+ - macOS: `brew install onnxruntime`
43
+ - Ubuntu: `sudo apt-get install libonnxruntime libonnxruntime-dev`
44
+ - Windows: `scoop install onnxruntime` or download from [GitHub](https://github.com/microsoft/onnxruntime/releases)
45
+
46
+ - **Tesseract**: For OCR functionality
47
+ - macOS: `brew install tesseract`
48
+ - Ubuntu: `sudo apt-get install tesseract-ocr`
49
+ - Windows: Download from [GitHub](https://github.com/tesseract-ocr/tesseract)
50
+
51
+ - **LibreOffice**: For legacy MS Office formats (.doc, .ppt)
52
+ - macOS: `brew install libreoffice`
53
+ - Ubuntu: `sudo apt-get install libreoffice`
54
+
55
+ - **Pandoc**: For advanced document conversion
56
+ - macOS: `brew install pandoc`
57
+ - Ubuntu: `sudo apt-get install pandoc`
58
+
59
+ ## Installation
60
+
61
+ Add to your Gemfile:
62
+
63
+ ```ruby
64
+ gem 'kreuzberg'
65
+ ```
66
+
67
+ Then run:
68
+
69
+ ```bash
70
+ bundle install
71
+ ```
72
+
73
+ Or install directly:
74
+
75
+ ```bash
76
+ gem install kreuzberg
77
+ ```
78
+
79
+ ## Quick Start
80
+
81
+ ### Basic Extraction
82
+
83
+ ```ruby
84
+ require 'kreuzberg'
85
+
86
+ # Extract from a file
87
+ result = Kreuzberg.extract_file_sync("document.pdf")
88
+ puts result.content
89
+ puts "MIME type: #{result.mime_type}"
90
+ ```
91
+
92
+ ### With Configuration
93
+
94
+ ```ruby
95
+ # Create configuration
96
+ config = Kreuzberg::Config::Extraction.new(
97
+ use_cache: true,
98
+ force_ocr: false
99
+ )
100
+
101
+ result = Kreuzberg.extract_file_sync("document.pdf", config: config)
102
+ ```
103
+
104
+ ### With OCR
105
+
106
+ ```ruby
107
+ # Configure OCR
108
+ ocr_config = Kreuzberg::Config::OCR.new(
109
+ backend: "tesseract",
110
+ language: "eng",
111
+ preprocessing: true
112
+ )
113
+
114
+ config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
115
+ result = Kreuzberg.extract_file_sync("scanned.pdf", config: config)
116
+ ```
117
+
118
+ ### Extract from Bytes
119
+
120
+ ```ruby
121
+ data = File.binread("document.pdf")
122
+ result = Kreuzberg.extract_bytes_sync(data, "application/pdf")
123
+ puts result.content
124
+ ```
125
+
126
+ ### Batch Processing
127
+
128
+ ```ruby
129
+ paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
130
+ results = Kreuzberg.batch_extract_files_sync(paths)
131
+
132
+ results.each do |result|
133
+ puts "Content: #{result.content[0..100]}"
134
+ puts "MIME: #{result.mime_type}"
135
+ end
136
+ ```
137
+
138
+ ### Structured Results (Chunks & Images)
139
+
140
+ ```ruby
141
+ result = Kreuzberg.extract_file_sync("long-report.pdf", config: {
142
+ chunking: { max_chars: 750 },
143
+ image_extraction: { extract_images: true }
144
+ })
145
+
146
+ result.chunks&.each do |chunk|
147
+ puts "[#{chunk.chunk_index + 1}/#{chunk.total_chunks}] #{chunk.content[0..80]}"
148
+ end
149
+
150
+ result.images&.each do |image|
151
+ File.binwrite("image-#{image.image_index}.#{image.format}", image.data)
152
+ if image.ocr_result
153
+ puts "Embedded OCR content: #{image.ocr_result.content[0..60]}"
154
+ end
155
+ end
156
+ ```
157
+
158
+ ## Configuration
159
+
160
+ ### Load From File
161
+
162
+ ```ruby
163
+ config = Kreuzberg::Config::Extraction.from_file("config.toml")
164
+ result = Kreuzberg.extract_file_sync("report.pdf", config: config)
165
+ ```
166
+
167
+ ### Extraction Configuration
168
+
169
+ ```ruby
170
+ config = Kreuzberg::Config::Extraction.new(
171
+ use_cache: true, # Enable result caching
172
+ enable_quality_processing: false, # Enable text quality processing
173
+ force_ocr: false # Force OCR even for digital PDFs
174
+ )
175
+ ```
176
+
177
+ ### OCR Configuration
178
+
179
+ ```ruby
180
+ ocr = Kreuzberg::Config::OCR.new(
181
+ backend: "tesseract", # OCR backend (tesseract, easyocr, paddleocr)
182
+ language: "eng", # Language code (eng, deu, fra, etc.)
183
+ tesseract_config: {
184
+ psm: 6,
185
+ enable_table_detection: true,
186
+ preprocessing: Kreuzberg::Config::ImagePreprocessing.new(auto_rotate: true).to_h
187
+ }
188
+ )
189
+
190
+ config = Kreuzberg::Config::Extraction.new(ocr: ocr)
191
+ ```
192
+
193
+ ### Chunking Configuration
194
+
195
+ ```ruby
196
+ chunking = Kreuzberg::Config::Chunking.new(
197
+ enabled: true,
198
+ chunk_size: 1000, # Characters per chunk
199
+ chunk_overlap: 200, # Overlap between chunks
200
+ embedding: {
201
+ model: { type: :preset, name: "balanced" },
202
+ normalize: true
203
+ }
204
+ )
205
+
206
+ config = Kreuzberg::Config::Extraction.new(chunking: chunking)
207
+ result = Kreuzberg.extract_file_sync("long_document.pdf", config: config)
208
+
209
+ result.chunks.each do |chunk|
210
+ puts "Chunk: #{chunk.content}"
211
+ puts "Tokens: #{chunk.token_count}"
212
+ end
213
+ ```
214
+
215
+ ### HTML Conversion Options
216
+
217
+ ```ruby
218
+ html_options = Kreuzberg::Config::HtmlOptions.new(
219
+ heading_style: :atx_closed,
220
+ wrap: true,
221
+ wrap_width: 100,
222
+ preprocessing: { enabled: true, preset: :standard }
223
+ )
224
+
225
+ config = Kreuzberg::Config::Extraction.new(html_options: html_options)
226
+ result = Kreuzberg.extract_file_sync("page.html", config: config)
227
+ ```
228
+
229
+ ### Keyword Extraction
230
+
231
+ ```ruby
232
+ keywords = Kreuzberg::Config::Keywords.new(
233
+ algorithm: :yake,
234
+ max_keywords: 8,
235
+ min_score: 0.2,
236
+ ngram_range: [1, 3]
237
+ )
238
+
239
+ config = Kreuzberg::Config::Extraction.new(keywords: keywords)
240
+ result = Kreuzberg.extract_file_sync("research.pdf", config: config)
241
+ ```
242
+
243
+ ### Language Detection
244
+
245
+ ```ruby
246
+ lang_detection = Kreuzberg::Config::LanguageDetection.new(
247
+ enabled: true,
248
+ min_confidence: 0.8,
249
+ detect_multiple: true
250
+ )
251
+
252
+ config = Kreuzberg::Config::Extraction.new(language_detection: lang_detection)
253
+ result = Kreuzberg.extract_file_sync("multilingual.pdf", config: config)
254
+
255
+ result.detected_languages&.each do |lang|
256
+ puts "Language: #{lang.lang}, Confidence: #{lang.confidence}"
257
+ end
258
+ ```
259
+
260
+ ### PDF Options
261
+
262
+ ```ruby
263
+ pdf_options = Kreuzberg::Config::PDF.new(
264
+ extract_images: true,
265
+ image_min_size: 10000, # Minimum image size in bytes
266
+ password: "secret" # PDF password
267
+ )
268
+
269
+ config = Kreuzberg::Config::Extraction.new(pdf_options: pdf_options)
270
+ ```
271
+
272
+ ## Working with Results
273
+
274
+ ```ruby
275
+ result = Kreuzberg.extract_file_sync("invoice.pdf")
276
+
277
+ # Access extracted text
278
+ puts result.content
279
+
280
+ # Access MIME type
281
+ puts result.mime_type
282
+
283
+ # Access metadata
284
+ puts result.metadata.inspect
285
+
286
+ # Access extracted tables
287
+ result.tables.each do |table|
288
+ puts "Headers: #{table.headers.join(', ')}"
289
+ table.rows.each do |row|
290
+ puts row.join(', ')
291
+ end
292
+ end
293
+
294
+ # Access text chunks and metadata
295
+ result.chunks&.each do |chunk|
296
+ puts "Chunk #{chunk.chunk_index + 1}/#{chunk.total_chunks}"
297
+ puts "Chars: #{chunk.char_start}-#{chunk.char_end}"
298
+ puts "Embedding length: #{chunk.embedding&.length}"
299
+ end
300
+
301
+ # Access extracted images
302
+ result.images&.each do |image|
303
+ File.binwrite("image-\#{image.image_index}.#{image.format}", image.data)
304
+ puts "Image #{image.image_index} on page #{image.page_number}"
305
+ end
306
+
307
+ # Convert to hash
308
+ hash = result.to_h
309
+
310
+ # Convert to JSON
311
+ json = result.to_json
312
+ ```
313
+
314
+ ## CLI Usage
315
+
316
+ Kreuzberg provides a Ruby wrapper for the CLI:
317
+
318
+ ```ruby
319
+ # Extract content
320
+ output = Kreuzberg::CLI.extract("document.pdf", output: "text")
321
+
322
+ # Detect MIME type
323
+ mime_type = Kreuzberg::CLI.detect("document.pdf")
324
+
325
+ # Get version
326
+ version = Kreuzberg::CLI.version
327
+ ```
328
+
329
+ ## API Server
330
+
331
+ Start an API server (requires kreuzberg CLI):
332
+
333
+ ```ruby
334
+ Kreuzberg::APIProxy.run(port: 8000) do |server|
335
+ # Server runs in background
336
+ # Make HTTP requests to http://localhost:8000
337
+ end
338
+ ```
339
+
340
+ ## MCP Server
341
+
342
+ Start a Model Context Protocol server for Claude Desktop:
343
+
344
+ ```ruby
345
+ server = Kreuzberg::MCPProxy::Server.new(transport: 'stdio')
346
+ server.start
347
+
348
+ # Use with Claude Desktop integration
349
+ ```
350
+
351
+ ## Cache Management
352
+
353
+ ```ruby
354
+ # Get cache statistics
355
+ stats = Kreuzberg.cache_stats
356
+ puts "Entries: #{stats[:total_entries]}"
357
+ puts "Size: #{stats[:total_size_bytes]} bytes"
358
+
359
+ # Clear cache
360
+ Kreuzberg.clear_cache
361
+ ```
362
+
363
+ ## Error Handling
364
+
365
+ ```ruby
366
+ begin
367
+ result = Kreuzberg.extract_file_sync("document.pdf")
368
+ rescue Kreuzberg::Errors::ParsingError => e
369
+ puts "Parsing failed: #{e.message}"
370
+ puts "Context: #{e.context}"
371
+ rescue Kreuzberg::Errors::OCRError => e
372
+ puts "OCR failed: #{e.message}"
373
+ rescue Kreuzberg::Errors::MissingDependencyError => e
374
+ puts "Missing dependency: #{e.dependency}"
375
+ rescue Kreuzberg::Errors::Error => e
376
+ puts "Kreuzberg error: #{e.message}"
377
+ end
378
+ ```
379
+
380
+ ## Supported Formats
381
+
382
+ - **Documents**: PDF, DOCX, DOC, PPTX, PPT, ODT, ODP
383
+ - **Spreadsheets**: XLSX, XLS, ODS, CSV
384
+ - **Images**: PNG, JPEG, TIFF, BMP, GIF
385
+ - **Web**: HTML, MHTML, Markdown
386
+ - **Data**: JSON, YAML, TOML, XML
387
+ - **Email**: EML, MSG
388
+ - **Archives**: ZIP, TAR, 7Z
389
+ - **Text**: TXT, RTF, MD
390
+
391
+ ## Performance
392
+
393
+ Kreuzberg's Rust core provides significant performance improvements:
394
+
395
+ - **PDF extraction**: 10-50x faster than pure Ruby solutions
396
+ - **Batch processing**: Parallel extraction with Tokio async runtime
397
+ - **Memory efficient**: Streaming parsers for large files
398
+ - **Caching**: Automatic result caching for repeated extractions
399
+
400
+ ## Development
401
+
402
+ ```bash
403
+ # Clone the repository
404
+ git clone https://github.com/kreuzberg-dev/kreuzberg.git
405
+ cd kreuzberg/packages/ruby
406
+
407
+ # Install dependencies
408
+ bundle install
409
+
410
+ # Set up vendor symlink for local development (required for building)
411
+ ln -sfn ../../crates/kreuzberg vendor/kreuzberg
412
+
413
+ # Build the Rust extension
414
+ bundle exec rake compile
415
+
416
+ # Run tests
417
+ bundle exec rspec
418
+
419
+ # Run RuboCop
420
+ bundle exec rubocop
421
+ ```
422
+
423
+ **Note**: The Ruby bindings use a vendored copy of the core `kreuzberg` Rust crate. For local development, create a symlink at `vendor/kreuzberg` pointing to `../../crates/kreuzberg`. In CI and gem packaging, the actual vendored files are copied to this location.
424
+
425
+ ## PDFium Integration
426
+
427
+ PDF extraction is powered by PDFium, which is automatically bundled with this package. No system installation required.
428
+
429
+ ### Platform Support
430
+
431
+ | Platform | Status | Notes |
432
+ |----------|--------|-------|
433
+ | Linux x86_64 | ✅ | Bundled |
434
+ | macOS ARM64 | ✅ | Bundled |
435
+ | macOS x86_64 | ✅ | Bundled |
436
+ | Windows x86_64 | ✅ | Bundled |
437
+
438
+ ### Binary Size Impact
439
+
440
+ PDFium adds approximately 8-15 MB to the package size depending on platform. This ensures consistent PDF extraction across all environments without external dependencies.
441
+
442
+ ## License
443
+
444
+ MIT License. See [LICENSE](../../LICENSE) for details.
445
+
446
+ ## Contributing
447
+
448
+ Contributions are welcome! Please see [CONTRIBUTING.md](../../CONTRIBUTING.md) for guidelines.
449
+
450
+ ## Links
451
+
452
+ - **Documentation**: https://kreuzberg.dev
453
+ - **GitHub**: https://github.com/kreuzberg-dev/kreuzberg
454
+ - **Issues**: https://github.com/kreuzberg-dev/kreuzberg/issues