kreuzberg 4.0.0.pre.rc.13 → 4.0.0.pre.rc.14
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +14 -14
- data/.rspec +3 -3
- data/.rubocop.yaml +1 -1
- data/.rubocop.yml +538 -538
- data/Gemfile +8 -8
- data/Gemfile.lock +105 -2
- data/README.md +454 -454
- data/Rakefile +33 -25
- data/Steepfile +47 -47
- data/examples/async_patterns.rb +341 -341
- data/ext/kreuzberg_rb/extconf.rb +45 -45
- data/ext/kreuzberg_rb/native/.cargo/config.toml +2 -2
- data/ext/kreuzberg_rb/native/Cargo.lock +6940 -6941
- data/ext/kreuzberg_rb/native/Cargo.toml +54 -54
- data/ext/kreuzberg_rb/native/README.md +425 -425
- data/ext/kreuzberg_rb/native/build.rs +15 -15
- data/ext/kreuzberg_rb/native/include/ieeefp.h +11 -11
- data/ext/kreuzberg_rb/native/include/msvc_compat/strings.h +14 -14
- data/ext/kreuzberg_rb/native/include/strings.h +20 -20
- data/ext/kreuzberg_rb/native/include/unistd.h +47 -47
- data/ext/kreuzberg_rb/native/src/lib.rs +3158 -3158
- data/extconf.rb +28 -28
- data/kreuzberg.gemspec +214 -214
- data/lib/kreuzberg/api_proxy.rb +142 -142
- data/lib/kreuzberg/cache_api.rb +81 -81
- data/lib/kreuzberg/cli.rb +55 -55
- data/lib/kreuzberg/cli_proxy.rb +127 -127
- data/lib/kreuzberg/config.rb +724 -724
- data/lib/kreuzberg/error_context.rb +80 -80
- data/lib/kreuzberg/errors.rb +118 -118
- data/lib/kreuzberg/extraction_api.rb +340 -340
- data/lib/kreuzberg/mcp_proxy.rb +186 -186
- data/lib/kreuzberg/ocr_backend_protocol.rb +113 -113
- data/lib/kreuzberg/post_processor_protocol.rb +86 -86
- data/lib/kreuzberg/result.rb +279 -279
- data/lib/kreuzberg/setup_lib_path.rb +80 -80
- data/lib/kreuzberg/validator_protocol.rb +89 -89
- data/lib/kreuzberg/version.rb +5 -5
- data/lib/kreuzberg.rb +109 -109
- data/lib/{pdfium.dll → libpdfium.dylib} +0 -0
- data/sig/kreuzberg/internal.rbs +184 -184
- data/sig/kreuzberg.rbs +546 -546
- data/spec/binding/cache_spec.rb +227 -227
- data/spec/binding/cli_proxy_spec.rb +85 -85
- data/spec/binding/cli_spec.rb +55 -55
- data/spec/binding/config_spec.rb +345 -345
- data/spec/binding/config_validation_spec.rb +283 -283
- data/spec/binding/error_handling_spec.rb +213 -213
- data/spec/binding/errors_spec.rb +66 -66
- data/spec/binding/plugins/ocr_backend_spec.rb +307 -307
- data/spec/binding/plugins/postprocessor_spec.rb +269 -269
- data/spec/binding/plugins/validator_spec.rb +274 -274
- data/spec/fixtures/config.toml +39 -39
- data/spec/fixtures/config.yaml +41 -41
- data/spec/fixtures/invalid_config.toml +4 -4
- data/spec/smoke/package_spec.rb +178 -178
- data/spec/spec_helper.rb +42 -42
- data/vendor/Cargo.toml +1 -1
- data/vendor/kreuzberg/Cargo.toml +5 -5
- data/vendor/kreuzberg/README.md +230 -230
- data/vendor/kreuzberg/benches/otel_overhead.rs +48 -48
- data/vendor/kreuzberg/build.rs +843 -843
- data/vendor/kreuzberg/src/api/error.rs +81 -81
- data/vendor/kreuzberg/src/api/handlers.rs +199 -199
- data/vendor/kreuzberg/src/api/mod.rs +79 -79
- data/vendor/kreuzberg/src/api/server.rs +353 -353
- data/vendor/kreuzberg/src/api/types.rs +170 -170
- data/vendor/kreuzberg/src/cache/mod.rs +1167 -1167
- data/vendor/kreuzberg/src/chunking/mod.rs +1877 -1877
- data/vendor/kreuzberg/src/chunking/processor.rs +220 -220
- data/vendor/kreuzberg/src/core/batch_mode.rs +95 -95
- data/vendor/kreuzberg/src/core/config.rs +1080 -1080
- data/vendor/kreuzberg/src/core/extractor.rs +1156 -1156
- data/vendor/kreuzberg/src/core/io.rs +329 -329
- data/vendor/kreuzberg/src/core/mime.rs +605 -605
- data/vendor/kreuzberg/src/core/mod.rs +47 -47
- data/vendor/kreuzberg/src/core/pipeline.rs +1184 -1184
- data/vendor/kreuzberg/src/embeddings.rs +500 -500
- data/vendor/kreuzberg/src/error.rs +431 -431
- data/vendor/kreuzberg/src/extraction/archive.rs +954 -954
- data/vendor/kreuzberg/src/extraction/docx.rs +398 -398
- data/vendor/kreuzberg/src/extraction/email.rs +854 -854
- data/vendor/kreuzberg/src/extraction/excel.rs +688 -688
- data/vendor/kreuzberg/src/extraction/html.rs +601 -601
- data/vendor/kreuzberg/src/extraction/image.rs +491 -491
- data/vendor/kreuzberg/src/extraction/libreoffice.rs +574 -574
- data/vendor/kreuzberg/src/extraction/markdown.rs +213 -213
- data/vendor/kreuzberg/src/extraction/mod.rs +81 -81
- data/vendor/kreuzberg/src/extraction/office_metadata/app_properties.rs +398 -398
- data/vendor/kreuzberg/src/extraction/office_metadata/core_properties.rs +247 -247
- data/vendor/kreuzberg/src/extraction/office_metadata/custom_properties.rs +240 -240
- data/vendor/kreuzberg/src/extraction/office_metadata/mod.rs +130 -130
- data/vendor/kreuzberg/src/extraction/office_metadata/odt_properties.rs +284 -284
- data/vendor/kreuzberg/src/extraction/pptx.rs +3100 -3100
- data/vendor/kreuzberg/src/extraction/structured.rs +490 -490
- data/vendor/kreuzberg/src/extraction/table.rs +328 -328
- data/vendor/kreuzberg/src/extraction/text.rs +269 -269
- data/vendor/kreuzberg/src/extraction/xml.rs +333 -333
- data/vendor/kreuzberg/src/extractors/archive.rs +447 -447
- data/vendor/kreuzberg/src/extractors/bibtex.rs +470 -470
- data/vendor/kreuzberg/src/extractors/docbook.rs +504 -504
- data/vendor/kreuzberg/src/extractors/docx.rs +400 -400
- data/vendor/kreuzberg/src/extractors/email.rs +157 -157
- data/vendor/kreuzberg/src/extractors/epub.rs +708 -708
- data/vendor/kreuzberg/src/extractors/excel.rs +345 -345
- data/vendor/kreuzberg/src/extractors/fictionbook.rs +492 -492
- data/vendor/kreuzberg/src/extractors/html.rs +407 -407
- data/vendor/kreuzberg/src/extractors/image.rs +219 -219
- data/vendor/kreuzberg/src/extractors/jats.rs +1054 -1054
- data/vendor/kreuzberg/src/extractors/jupyter.rs +368 -368
- data/vendor/kreuzberg/src/extractors/latex.rs +653 -653
- data/vendor/kreuzberg/src/extractors/markdown.rs +701 -701
- data/vendor/kreuzberg/src/extractors/mod.rs +429 -429
- data/vendor/kreuzberg/src/extractors/odt.rs +628 -628
- data/vendor/kreuzberg/src/extractors/opml.rs +635 -635
- data/vendor/kreuzberg/src/extractors/orgmode.rs +529 -529
- data/vendor/kreuzberg/src/extractors/pdf.rs +749 -749
- data/vendor/kreuzberg/src/extractors/pptx.rs +267 -267
- data/vendor/kreuzberg/src/extractors/rst.rs +577 -577
- data/vendor/kreuzberg/src/extractors/rtf.rs +809 -809
- data/vendor/kreuzberg/src/extractors/security.rs +484 -484
- data/vendor/kreuzberg/src/extractors/security_tests.rs +367 -367
- data/vendor/kreuzberg/src/extractors/structured.rs +142 -142
- data/vendor/kreuzberg/src/extractors/text.rs +265 -265
- data/vendor/kreuzberg/src/extractors/typst.rs +651 -651
- data/vendor/kreuzberg/src/extractors/xml.rs +147 -147
- data/vendor/kreuzberg/src/image/dpi.rs +164 -164
- data/vendor/kreuzberg/src/image/mod.rs +6 -6
- data/vendor/kreuzberg/src/image/preprocessing.rs +417 -417
- data/vendor/kreuzberg/src/image/resize.rs +89 -89
- data/vendor/kreuzberg/src/keywords/config.rs +154 -154
- data/vendor/kreuzberg/src/keywords/mod.rs +237 -237
- data/vendor/kreuzberg/src/keywords/processor.rs +275 -275
- data/vendor/kreuzberg/src/keywords/rake.rs +293 -293
- data/vendor/kreuzberg/src/keywords/types.rs +68 -68
- data/vendor/kreuzberg/src/keywords/yake.rs +163 -163
- data/vendor/kreuzberg/src/language_detection/mod.rs +985 -985
- data/vendor/kreuzberg/src/language_detection/processor.rs +219 -219
- data/vendor/kreuzberg/src/lib.rs +113 -113
- data/vendor/kreuzberg/src/mcp/mod.rs +35 -35
- data/vendor/kreuzberg/src/mcp/server.rs +2076 -2076
- data/vendor/kreuzberg/src/ocr/cache.rs +469 -469
- data/vendor/kreuzberg/src/ocr/error.rs +37 -37
- data/vendor/kreuzberg/src/ocr/hocr.rs +216 -216
- data/vendor/kreuzberg/src/ocr/mod.rs +58 -58
- data/vendor/kreuzberg/src/ocr/processor.rs +863 -863
- data/vendor/kreuzberg/src/ocr/table/mod.rs +4 -4
- data/vendor/kreuzberg/src/ocr/table/tsv_parser.rs +144 -144
- data/vendor/kreuzberg/src/ocr/tesseract_backend.rs +452 -452
- data/vendor/kreuzberg/src/ocr/types.rs +393 -393
- data/vendor/kreuzberg/src/ocr/utils.rs +47 -47
- data/vendor/kreuzberg/src/ocr/validation.rs +206 -206
- data/vendor/kreuzberg/src/panic_context.rs +154 -154
- data/vendor/kreuzberg/src/pdf/bindings.rs +44 -44
- data/vendor/kreuzberg/src/pdf/bundled.rs +346 -346
- data/vendor/kreuzberg/src/pdf/error.rs +130 -130
- data/vendor/kreuzberg/src/pdf/images.rs +139 -139
- data/vendor/kreuzberg/src/pdf/metadata.rs +489 -489
- data/vendor/kreuzberg/src/pdf/mod.rs +68 -68
- data/vendor/kreuzberg/src/pdf/rendering.rs +368 -368
- data/vendor/kreuzberg/src/pdf/table.rs +420 -420
- data/vendor/kreuzberg/src/pdf/text.rs +240 -240
- data/vendor/kreuzberg/src/plugins/extractor.rs +1044 -1044
- data/vendor/kreuzberg/src/plugins/mod.rs +212 -212
- data/vendor/kreuzberg/src/plugins/ocr.rs +639 -639
- data/vendor/kreuzberg/src/plugins/processor.rs +650 -650
- data/vendor/kreuzberg/src/plugins/registry.rs +1339 -1339
- data/vendor/kreuzberg/src/plugins/traits.rs +258 -258
- data/vendor/kreuzberg/src/plugins/validator.rs +967 -967
- data/vendor/kreuzberg/src/stopwords/mod.rs +1470 -1470
- data/vendor/kreuzberg/src/text/mod.rs +25 -25
- data/vendor/kreuzberg/src/text/quality.rs +697 -697
- data/vendor/kreuzberg/src/text/quality_processor.rs +219 -219
- data/vendor/kreuzberg/src/text/string_utils.rs +217 -217
- data/vendor/kreuzberg/src/text/token_reduction/cjk_utils.rs +164 -164
- data/vendor/kreuzberg/src/text/token_reduction/config.rs +100 -100
- data/vendor/kreuzberg/src/text/token_reduction/core.rs +796 -796
- data/vendor/kreuzberg/src/text/token_reduction/filters.rs +902 -902
- data/vendor/kreuzberg/src/text/token_reduction/mod.rs +160 -160
- data/vendor/kreuzberg/src/text/token_reduction/semantic.rs +619 -619
- data/vendor/kreuzberg/src/text/token_reduction/simd_text.rs +147 -147
- data/vendor/kreuzberg/src/types.rs +1055 -1055
- data/vendor/kreuzberg/src/utils/mod.rs +17 -17
- data/vendor/kreuzberg/src/utils/quality.rs +959 -959
- data/vendor/kreuzberg/src/utils/string_utils.rs +381 -381
- data/vendor/kreuzberg/stopwords/af_stopwords.json +53 -53
- data/vendor/kreuzberg/stopwords/ar_stopwords.json +482 -482
- data/vendor/kreuzberg/stopwords/bg_stopwords.json +261 -261
- data/vendor/kreuzberg/stopwords/bn_stopwords.json +400 -400
- data/vendor/kreuzberg/stopwords/br_stopwords.json +1205 -1205
- data/vendor/kreuzberg/stopwords/ca_stopwords.json +280 -280
- data/vendor/kreuzberg/stopwords/cs_stopwords.json +425 -425
- data/vendor/kreuzberg/stopwords/da_stopwords.json +172 -172
- data/vendor/kreuzberg/stopwords/de_stopwords.json +622 -622
- data/vendor/kreuzberg/stopwords/el_stopwords.json +849 -849
- data/vendor/kreuzberg/stopwords/en_stopwords.json +1300 -1300
- data/vendor/kreuzberg/stopwords/eo_stopwords.json +175 -175
- data/vendor/kreuzberg/stopwords/es_stopwords.json +734 -734
- data/vendor/kreuzberg/stopwords/et_stopwords.json +37 -37
- data/vendor/kreuzberg/stopwords/eu_stopwords.json +100 -100
- data/vendor/kreuzberg/stopwords/fa_stopwords.json +801 -801
- data/vendor/kreuzberg/stopwords/fi_stopwords.json +849 -849
- data/vendor/kreuzberg/stopwords/fr_stopwords.json +693 -693
- data/vendor/kreuzberg/stopwords/ga_stopwords.json +111 -111
- data/vendor/kreuzberg/stopwords/gl_stopwords.json +162 -162
- data/vendor/kreuzberg/stopwords/gu_stopwords.json +226 -226
- data/vendor/kreuzberg/stopwords/ha_stopwords.json +41 -41
- data/vendor/kreuzberg/stopwords/he_stopwords.json +196 -196
- data/vendor/kreuzberg/stopwords/hi_stopwords.json +227 -227
- data/vendor/kreuzberg/stopwords/hr_stopwords.json +181 -181
- data/vendor/kreuzberg/stopwords/hu_stopwords.json +791 -791
- data/vendor/kreuzberg/stopwords/hy_stopwords.json +47 -47
- data/vendor/kreuzberg/stopwords/id_stopwords.json +760 -760
- data/vendor/kreuzberg/stopwords/it_stopwords.json +634 -634
- data/vendor/kreuzberg/stopwords/ja_stopwords.json +136 -136
- data/vendor/kreuzberg/stopwords/kn_stopwords.json +84 -84
- data/vendor/kreuzberg/stopwords/ko_stopwords.json +681 -681
- data/vendor/kreuzberg/stopwords/ku_stopwords.json +64 -64
- data/vendor/kreuzberg/stopwords/la_stopwords.json +51 -51
- data/vendor/kreuzberg/stopwords/lt_stopwords.json +476 -476
- data/vendor/kreuzberg/stopwords/lv_stopwords.json +163 -163
- data/vendor/kreuzberg/stopwords/ml_stopwords.json +1 -1
- data/vendor/kreuzberg/stopwords/mr_stopwords.json +101 -101
- data/vendor/kreuzberg/stopwords/ms_stopwords.json +477 -477
- data/vendor/kreuzberg/stopwords/ne_stopwords.json +490 -490
- data/vendor/kreuzberg/stopwords/nl_stopwords.json +415 -415
- data/vendor/kreuzberg/stopwords/no_stopwords.json +223 -223
- data/vendor/kreuzberg/stopwords/pl_stopwords.json +331 -331
- data/vendor/kreuzberg/stopwords/pt_stopwords.json +562 -562
- data/vendor/kreuzberg/stopwords/ro_stopwords.json +436 -436
- data/vendor/kreuzberg/stopwords/ru_stopwords.json +561 -561
- data/vendor/kreuzberg/stopwords/si_stopwords.json +193 -193
- data/vendor/kreuzberg/stopwords/sk_stopwords.json +420 -420
- data/vendor/kreuzberg/stopwords/sl_stopwords.json +448 -448
- data/vendor/kreuzberg/stopwords/so_stopwords.json +32 -32
- data/vendor/kreuzberg/stopwords/st_stopwords.json +33 -33
- data/vendor/kreuzberg/stopwords/sv_stopwords.json +420 -420
- data/vendor/kreuzberg/stopwords/sw_stopwords.json +76 -76
- data/vendor/kreuzberg/stopwords/ta_stopwords.json +129 -129
- data/vendor/kreuzberg/stopwords/te_stopwords.json +54 -54
- data/vendor/kreuzberg/stopwords/th_stopwords.json +118 -118
- data/vendor/kreuzberg/stopwords/tl_stopwords.json +149 -149
- data/vendor/kreuzberg/stopwords/tr_stopwords.json +506 -506
- data/vendor/kreuzberg/stopwords/uk_stopwords.json +75 -75
- data/vendor/kreuzberg/stopwords/ur_stopwords.json +519 -519
- data/vendor/kreuzberg/stopwords/vi_stopwords.json +647 -647
- data/vendor/kreuzberg/stopwords/yo_stopwords.json +62 -62
- data/vendor/kreuzberg/stopwords/zh_stopwords.json +796 -796
- data/vendor/kreuzberg/stopwords/zu_stopwords.json +31 -31
- data/vendor/kreuzberg/tests/api_extract_multipart.rs +52 -52
- data/vendor/kreuzberg/tests/api_tests.rs +966 -966
- data/vendor/kreuzberg/tests/archive_integration.rs +545 -545
- data/vendor/kreuzberg/tests/batch_orchestration.rs +556 -556
- data/vendor/kreuzberg/tests/batch_processing.rs +318 -318
- data/vendor/kreuzberg/tests/bibtex_parity_test.rs +421 -421
- data/vendor/kreuzberg/tests/concurrency_stress.rs +533 -533
- data/vendor/kreuzberg/tests/config_features.rs +612 -612
- data/vendor/kreuzberg/tests/config_loading_tests.rs +416 -416
- data/vendor/kreuzberg/tests/core_integration.rs +510 -510
- data/vendor/kreuzberg/tests/csv_integration.rs +414 -414
- data/vendor/kreuzberg/tests/docbook_extractor_tests.rs +500 -500
- data/vendor/kreuzberg/tests/docx_metadata_extraction_test.rs +122 -122
- data/vendor/kreuzberg/tests/docx_vs_pandoc_comparison.rs +370 -370
- data/vendor/kreuzberg/tests/email_integration.rs +327 -327
- data/vendor/kreuzberg/tests/epub_native_extractor_tests.rs +275 -275
- data/vendor/kreuzberg/tests/error_handling.rs +402 -402
- data/vendor/kreuzberg/tests/fictionbook_extractor_tests.rs +228 -228
- data/vendor/kreuzberg/tests/format_integration.rs +164 -164
- data/vendor/kreuzberg/tests/helpers/mod.rs +142 -142
- data/vendor/kreuzberg/tests/html_table_test.rs +551 -551
- data/vendor/kreuzberg/tests/image_integration.rs +255 -255
- data/vendor/kreuzberg/tests/instrumentation_test.rs +139 -139
- data/vendor/kreuzberg/tests/jats_extractor_tests.rs +639 -639
- data/vendor/kreuzberg/tests/jupyter_extractor_tests.rs +704 -704
- data/vendor/kreuzberg/tests/keywords_integration.rs +479 -479
- data/vendor/kreuzberg/tests/keywords_quality.rs +509 -509
- data/vendor/kreuzberg/tests/latex_extractor_tests.rs +496 -496
- data/vendor/kreuzberg/tests/markdown_extractor_tests.rs +490 -490
- data/vendor/kreuzberg/tests/mime_detection.rs +429 -429
- data/vendor/kreuzberg/tests/ocr_configuration.rs +514 -514
- data/vendor/kreuzberg/tests/ocr_errors.rs +698 -698
- data/vendor/kreuzberg/tests/ocr_quality.rs +629 -629
- data/vendor/kreuzberg/tests/ocr_stress.rs +469 -469
- data/vendor/kreuzberg/tests/odt_extractor_tests.rs +674 -674
- data/vendor/kreuzberg/tests/opml_extractor_tests.rs +616 -616
- data/vendor/kreuzberg/tests/orgmode_extractor_tests.rs +822 -822
- data/vendor/kreuzberg/tests/pdf_integration.rs +45 -45
- data/vendor/kreuzberg/tests/pdfium_linking.rs +374 -374
- data/vendor/kreuzberg/tests/pipeline_integration.rs +1436 -1436
- data/vendor/kreuzberg/tests/plugin_ocr_backend_test.rs +776 -776
- data/vendor/kreuzberg/tests/plugin_postprocessor_test.rs +560 -560
- data/vendor/kreuzberg/tests/plugin_system.rs +927 -927
- data/vendor/kreuzberg/tests/plugin_validator_test.rs +783 -783
- data/vendor/kreuzberg/tests/registry_integration_tests.rs +587 -587
- data/vendor/kreuzberg/tests/rst_extractor_tests.rs +694 -694
- data/vendor/kreuzberg/tests/rtf_extractor_tests.rs +775 -775
- data/vendor/kreuzberg/tests/security_validation.rs +416 -416
- data/vendor/kreuzberg/tests/stopwords_integration_test.rs +888 -888
- data/vendor/kreuzberg/tests/test_fastembed.rs +631 -631
- data/vendor/kreuzberg/tests/typst_behavioral_tests.rs +1260 -1260
- data/vendor/kreuzberg/tests/typst_extractor_tests.rs +648 -648
- data/vendor/kreuzberg/tests/xlsx_metadata_extraction_test.rs +87 -87
- data/vendor/kreuzberg-ffi/Cargo.toml +1 -1
- data/vendor/kreuzberg-ffi/README.md +851 -851
- data/vendor/kreuzberg-ffi/build.rs +176 -176
- data/vendor/kreuzberg-ffi/cbindgen.toml +27 -27
- data/vendor/kreuzberg-ffi/kreuzberg-ffi.pc.in +12 -12
- data/vendor/kreuzberg-ffi/kreuzberg.h +1087 -1087
- data/vendor/kreuzberg-ffi/src/lib.rs +3616 -3616
- data/vendor/kreuzberg-ffi/src/panic_shield.rs +247 -247
- data/vendor/kreuzberg-ffi/tests.disabled/README.md +48 -48
- data/vendor/kreuzberg-ffi/tests.disabled/config_loading_tests.rs +299 -299
- data/vendor/kreuzberg-ffi/tests.disabled/config_tests.rs +346 -346
- data/vendor/kreuzberg-ffi/tests.disabled/extractor_tests.rs +232 -232
- data/vendor/kreuzberg-ffi/tests.disabled/plugin_registration_tests.rs +470 -470
- data/vendor/kreuzberg-tesseract/.commitlintrc.json +13 -13
- data/vendor/kreuzberg-tesseract/.crate-ignore +2 -2
- data/vendor/kreuzberg-tesseract/Cargo.lock +2933 -2933
- data/vendor/kreuzberg-tesseract/Cargo.toml +2 -2
- data/vendor/kreuzberg-tesseract/LICENSE +22 -22
- data/vendor/kreuzberg-tesseract/README.md +399 -399
- data/vendor/kreuzberg-tesseract/build.rs +1354 -1354
- data/vendor/kreuzberg-tesseract/patches/README.md +71 -71
- data/vendor/kreuzberg-tesseract/patches/tesseract.diff +199 -199
- data/vendor/kreuzberg-tesseract/src/api.rs +1371 -1371
- data/vendor/kreuzberg-tesseract/src/choice_iterator.rs +77 -77
- data/vendor/kreuzberg-tesseract/src/enums.rs +297 -297
- data/vendor/kreuzberg-tesseract/src/error.rs +81 -81
- data/vendor/kreuzberg-tesseract/src/lib.rs +145 -145
- data/vendor/kreuzberg-tesseract/src/monitor.rs +57 -57
- data/vendor/kreuzberg-tesseract/src/mutable_iterator.rs +197 -197
- data/vendor/kreuzberg-tesseract/src/page_iterator.rs +253 -253
- data/vendor/kreuzberg-tesseract/src/result_iterator.rs +286 -286
- data/vendor/kreuzberg-tesseract/src/result_renderer.rs +183 -183
- data/vendor/kreuzberg-tesseract/tests/integration_test.rs +211 -211
- data/vendor/rb-sys/.cargo_vcs_info.json +5 -5
- data/vendor/rb-sys/Cargo.lock +393 -393
- data/vendor/rb-sys/Cargo.toml +70 -70
- data/vendor/rb-sys/Cargo.toml.orig +57 -57
- data/vendor/rb-sys/LICENSE-APACHE +190 -190
- data/vendor/rb-sys/LICENSE-MIT +21 -21
- data/vendor/rb-sys/build/features.rs +111 -111
- data/vendor/rb-sys/build/main.rs +286 -286
- data/vendor/rb-sys/build/stable_api_config.rs +155 -155
- data/vendor/rb-sys/build/version.rs +50 -50
- data/vendor/rb-sys/readme.md +36 -36
- data/vendor/rb-sys/src/bindings.rs +21 -21
- data/vendor/rb-sys/src/hidden.rs +11 -11
- data/vendor/rb-sys/src/lib.rs +35 -35
- data/vendor/rb-sys/src/macros.rs +371 -371
- data/vendor/rb-sys/src/memory.rs +53 -53
- data/vendor/rb-sys/src/ruby_abi_version.rs +38 -38
- data/vendor/rb-sys/src/special_consts.rs +31 -31
- data/vendor/rb-sys/src/stable_api/compiled.c +179 -179
- data/vendor/rb-sys/src/stable_api/compiled.rs +257 -257
- data/vendor/rb-sys/src/stable_api/ruby_2_7.rs +324 -324
- data/vendor/rb-sys/src/stable_api/ruby_3_0.rs +332 -332
- data/vendor/rb-sys/src/stable_api/ruby_3_1.rs +325 -325
- data/vendor/rb-sys/src/stable_api/ruby_3_2.rs +323 -323
- data/vendor/rb-sys/src/stable_api/ruby_3_3.rs +339 -339
- data/vendor/rb-sys/src/stable_api/ruby_3_4.rs +339 -339
- data/vendor/rb-sys/src/stable_api.rs +260 -260
- data/vendor/rb-sys/src/symbol.rs +31 -31
- data/vendor/rb-sys/src/tracking_allocator.rs +330 -330
- data/vendor/rb-sys/src/utils.rs +89 -89
- data/vendor/rb-sys/src/value_type.rs +7 -7
- metadata +73 -4
- data/vendor/kreuzberg-ffi/kreuzberg-ffi-install.pc +0 -12
|
@@ -1,399 +1,399 @@
|
|
|
1
|
-
# kreuzberg-tesseract
|
|
2
|
-
|
|
3
|
-
Rust bindings for Tesseract OCR with built-in compilation of Tesseract and Leptonica libraries. Provides a safe and idiomatic Rust interface to Tesseract's functionality while handling the complexity of compiling the underlying C++ libraries.
|
|
4
|
-
|
|
5
|
-
Based on the original [tesseract-rs](https://github.com/cafercangundogdu/tesseract-rs) by Cafer Can Gündoğdu, this maintained version adds critical improvements for production use:
|
|
6
|
-
|
|
7
|
-
- **C++17 Support**: Upgraded for Tesseract 5.5.1 which requires C++17 filesystem
|
|
8
|
-
- **Cross-Compilation**: Fixed CXX compiler detection for cross-platform builds
|
|
9
|
-
- **Architecture Validation**: Validates target architecture before using cached libraries
|
|
10
|
-
- **Windows Static Linking**: Fixed MSVC static linking issues
|
|
11
|
-
- **Build Caching**: Improved caching with OUT_DIR-based cache directory
|
|
12
|
-
- **MinGW Support**: Added support for MinGW toolchains
|
|
13
|
-
|
|
14
|
-
## Features
|
|
15
|
-
|
|
16
|
-
- Safe Rust bindings for Tesseract OCR
|
|
17
|
-
- **Multiple linking options:**
|
|
18
|
-
- **Static linking** (default): Built-in compilation with no runtime dependencies
|
|
19
|
-
- **Dynamic linking**: Link to system-installed libraries for faster builds
|
|
20
|
-
- Uses existing Tesseract training data (expects English data for tests)
|
|
21
|
-
- High-level Rust API for common OCR tasks
|
|
22
|
-
- Caching of compiled libraries for faster subsequent builds
|
|
23
|
-
- Support for multiple operating systems (Linux, macOS, Windows)
|
|
24
|
-
|
|
25
|
-
## Installation
|
|
26
|
-
|
|
27
|
-
### Static Linking (Default)
|
|
28
|
-
|
|
29
|
-
Static linking builds Tesseract and Leptonica from source and embeds them in your binary. No runtime dependencies required:
|
|
30
|
-
|
|
31
|
-
```toml
|
|
32
|
-
[dependencies]
|
|
33
|
-
kreuzberg-tesseract = "1.0.0-rc.1"
|
|
34
|
-
# or explicitly:
|
|
35
|
-
kreuzberg-tesseract = { version = "1.0.0-rc.1", features = ["static-linking"] }
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
### Dynamic Linking
|
|
39
|
-
|
|
40
|
-
Dynamic linking uses system-installed Tesseract and Leptonica libraries. Faster builds, but requires libraries installed on the system:
|
|
41
|
-
|
|
42
|
-
```toml
|
|
43
|
-
[dependencies]
|
|
44
|
-
kreuzberg-tesseract = { version = "1.0.0-rc.1", features = ["dynamic-linking"], default-features = false }
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
**System requirements for dynamic linking:**
|
|
48
|
-
- Tesseract 5.x libraries installed (`libtesseract`, `libleptonica`)
|
|
49
|
-
- macOS: `brew install tesseract leptonica`
|
|
50
|
-
- Ubuntu/Debian: `sudo apt-get install libtesseract-dev libleptonica-dev`
|
|
51
|
-
- RHEL/CentOS/Fedora: `sudo dnf install tesseract-devel leptonica-devel`
|
|
52
|
-
- Windows: Install from [Tesseract releases](https://github.com/tesseract-ocr/tesseract/releases) or vcpkg
|
|
53
|
-
|
|
54
|
-
### Development Dependencies
|
|
55
|
-
|
|
56
|
-
For development and testing, you'll also need these dependencies:
|
|
57
|
-
|
|
58
|
-
```toml
|
|
59
|
-
[dev-dependencies]
|
|
60
|
-
image = "0.25.5"
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
## System Requirements
|
|
64
|
-
|
|
65
|
-
### For Static Linking (Default)
|
|
66
|
-
|
|
67
|
-
When building with static linking, the crate will compile Tesseract and Leptonica from source. You need:
|
|
68
|
-
|
|
69
|
-
- Rust 1.85.0 or later
|
|
70
|
-
- A C++ compiler (e.g., gcc, clang, MSVC on Windows)
|
|
71
|
-
- CMake 3.x or later
|
|
72
|
-
- Internet connection (for downloading Tesseract source code)
|
|
73
|
-
|
|
74
|
-
### For Dynamic Linking
|
|
75
|
-
|
|
76
|
-
When using dynamic linking with system-installed libraries, you need:
|
|
77
|
-
|
|
78
|
-
- Rust 1.85.0 or later
|
|
79
|
-
- Tesseract 5.x and Leptonica libraries installed on your system (see Installation section)
|
|
80
|
-
- Internet connection (for downloading Tesseract source code)
|
|
81
|
-
|
|
82
|
-
No C++ compiler or CMake required for dynamic linking builds.
|
|
83
|
-
|
|
84
|
-
For a full development environment checklist (including optional tooling suggestions), see [CONTRIBUTING.md](CONTRIBUTING.md).
|
|
85
|
-
|
|
86
|
-
## Environment Variables
|
|
87
|
-
|
|
88
|
-
The following environment variables affect the build and test process:
|
|
89
|
-
|
|
90
|
-
### Build Variables
|
|
91
|
-
|
|
92
|
-
- `CARGO_CLEAN`: If set, cleans the cache directory before building
|
|
93
|
-
- `RUSTC_WRAPPER`: If set to "sccache", enables compiler caching with sccache
|
|
94
|
-
- `CC`: Compiler selection for C code (affects Linux builds)
|
|
95
|
-
- `HOME` (Unix) or `APPDATA` (Windows): Used to determine cache directory location
|
|
96
|
-
- `TESSERACT_RS_CACHE_DIR`: Optional override for the cache root. When unset or not writable, the build falls back to the default OS-specific directory, and if that still fails, a temporary directory under the system temp folder is used automatically.
|
|
97
|
-
|
|
98
|
-
### Test Variables
|
|
99
|
-
|
|
100
|
-
- `TESSDATA_PREFIX` (Optional): Path to override the default tessdata directory. If not set, the crate will use its default cache directory.
|
|
101
|
-
|
|
102
|
-
## Cache and Data Directories
|
|
103
|
-
|
|
104
|
-
The crate uses the following directory structure based on your operating system:
|
|
105
|
-
|
|
106
|
-
- macOS: `~/Library/Application Support/tesseract-rs`
|
|
107
|
-
- Linux: `~/.tesseract-rs`
|
|
108
|
-
- Windows: `%APPDATA%/tesseract-rs`
|
|
109
|
-
|
|
110
|
-
The cache includes:
|
|
111
|
-
|
|
112
|
-
- Compiled Tesseract and Leptonica libraries
|
|
113
|
-
- Third-party source code
|
|
114
|
-
|
|
115
|
-
Training data is not downloaded during the build. Provide `eng.traineddata` (and any other languages you need) via `TESSDATA_PREFIX` or your system Tesseract installation.
|
|
116
|
-
|
|
117
|
-
## Testing
|
|
118
|
-
|
|
119
|
-
The project includes several integration tests that verify OCR functionality. To run the tests:
|
|
120
|
-
|
|
121
|
-
1. Ensure you have the required test dependencies:
|
|
122
|
-
|
|
123
|
-
```toml
|
|
124
|
-
[dev-dependencies]
|
|
125
|
-
image = "0.25.9"
|
|
126
|
-
```
|
|
127
|
-
|
|
128
|
-
2. Run the tests:
|
|
129
|
-
```bash
|
|
130
|
-
cargo test
|
|
131
|
-
```
|
|
132
|
-
|
|
133
|
-
Note: Make sure `eng.traineddata` is available in your tessdata directory before running tests. If `TESSDATA_PREFIX` is not set, the tests look in the default cache location. You can point the tests at a custom tessdata directory by setting:
|
|
134
|
-
|
|
135
|
-
```bash
|
|
136
|
-
# Linux/macOS
|
|
137
|
-
export TESSDATA_PREFIX=/path/to/custom/tessdata
|
|
138
|
-
|
|
139
|
-
# Windows (PowerShell)
|
|
140
|
-
$env:TESSDATA_PREFIX="C:\path\to\custom\tessdata"
|
|
141
|
-
```
|
|
142
|
-
|
|
143
|
-
Available test cases:
|
|
144
|
-
|
|
145
|
-
- OCR on English sample images
|
|
146
|
-
- Error handling and invalid input coverage
|
|
147
|
-
|
|
148
|
-
Test images are sourced from the shared `test_documents/` directory in the repository:
|
|
149
|
-
|
|
150
|
-
- `images/test_hello_world.png`: Simple English text
|
|
151
|
-
- `tables/simple_table.png`: Basic table with English headers
|
|
152
|
-
|
|
153
|
-
## Usage
|
|
154
|
-
|
|
155
|
-
Here's a basic example of how to use `tesseract-rs`:
|
|
156
|
-
|
|
157
|
-
```rust
|
|
158
|
-
use std::path::PathBuf;
|
|
159
|
-
use std::error::Error;
|
|
160
|
-
use kreuzberg_tesseract::TesseractAPI;
|
|
161
|
-
|
|
162
|
-
fn get_default_tessdata_dir() -> PathBuf {
|
|
163
|
-
if cfg!(target_os = "macos") {
|
|
164
|
-
let home_dir = std::env::var("HOME").expect("HOME environment variable not set");
|
|
165
|
-
PathBuf::from(home_dir)
|
|
166
|
-
.join("Library")
|
|
167
|
-
.join("Application Support")
|
|
168
|
-
.join("tesseract-rs")
|
|
169
|
-
.join("tessdata")
|
|
170
|
-
} else if cfg!(target_os = "linux") {
|
|
171
|
-
let home_dir = std::env::var("HOME").expect("HOME environment variable not set");
|
|
172
|
-
PathBuf::from(home_dir)
|
|
173
|
-
.join(".tesseract-rs")
|
|
174
|
-
.join("tessdata")
|
|
175
|
-
} else if cfg!(target_os = "windows") {
|
|
176
|
-
PathBuf::from(std::env::var("APPDATA").expect("APPDATA environment variable not set"))
|
|
177
|
-
.join("tesseract-rs")
|
|
178
|
-
.join("tessdata")
|
|
179
|
-
} else {
|
|
180
|
-
panic!("Unsupported operating system");
|
|
181
|
-
}
|
|
182
|
-
}
|
|
183
|
-
|
|
184
|
-
fn get_tessdata_dir() -> PathBuf {
|
|
185
|
-
match std::env::var("TESSDATA_PREFIX") {
|
|
186
|
-
Ok(dir) => {
|
|
187
|
-
let path = PathBuf::from(dir);
|
|
188
|
-
println!("Using TESSDATA_PREFIX directory: {:?}", path);
|
|
189
|
-
path
|
|
190
|
-
}
|
|
191
|
-
Err(_) => {
|
|
192
|
-
let default_dir = get_default_tessdata_dir();
|
|
193
|
-
println!(
|
|
194
|
-
"TESSDATA_PREFIX not set, using default directory: {:?}",
|
|
195
|
-
default_dir
|
|
196
|
-
);
|
|
197
|
-
default_dir
|
|
198
|
-
}
|
|
199
|
-
}
|
|
200
|
-
}
|
|
201
|
-
|
|
202
|
-
fn main() -> Result<(), Box<dyn Error>> {
|
|
203
|
-
let api = TesseractAPI::new()?;
|
|
204
|
-
|
|
205
|
-
// Get tessdata directory (uses default location or TESSDATA_PREFIX if set)
|
|
206
|
-
let tessdata_dir = get_tessdata_dir();
|
|
207
|
-
api.init(tessdata_dir.to_str().unwrap(), "eng")?;
|
|
208
|
-
|
|
209
|
-
let width = 24;
|
|
210
|
-
let height = 24;
|
|
211
|
-
let bytes_per_pixel = 1;
|
|
212
|
-
let bytes_per_line = width * bytes_per_pixel;
|
|
213
|
-
|
|
214
|
-
// Initialize image data with all white pixels
|
|
215
|
-
let mut image_data = vec![255u8; width * height];
|
|
216
|
-
|
|
217
|
-
// Draw number 9 with clearer distinction
|
|
218
|
-
for y in 4..19 {
|
|
219
|
-
for x in 7..17 {
|
|
220
|
-
// Top bar
|
|
221
|
-
if y == 4 && x >= 8 && x <= 15 {
|
|
222
|
-
image_data[y * width + x] = 0;
|
|
223
|
-
}
|
|
224
|
-
// Top curve left side
|
|
225
|
-
if y >= 4 && y <= 10 && x == 7 {
|
|
226
|
-
image_data[y * width + x] = 0;
|
|
227
|
-
}
|
|
228
|
-
// Top curve right side
|
|
229
|
-
if y >= 4 && y <= 11 && x == 16 {
|
|
230
|
-
image_data[y * width + x] = 0;
|
|
231
|
-
}
|
|
232
|
-
// Middle bar
|
|
233
|
-
if y == 11 && x >= 8 && x <= 15 {
|
|
234
|
-
image_data[y * width + x] = 0;
|
|
235
|
-
}
|
|
236
|
-
// Bottom right vertical line
|
|
237
|
-
if y >= 11 && y <= 18 && x == 16 {
|
|
238
|
-
image_data[y * width + x] = 0;
|
|
239
|
-
}
|
|
240
|
-
// Bottom bar
|
|
241
|
-
if y == 18 && x >= 8 && x <= 15 {
|
|
242
|
-
image_data[y * width + x] = 0;
|
|
243
|
-
}
|
|
244
|
-
}
|
|
245
|
-
}
|
|
246
|
-
|
|
247
|
-
// Set the image data
|
|
248
|
-
api.set_image(
|
|
249
|
-
&image_data,
|
|
250
|
-
width.try_into().unwrap(),
|
|
251
|
-
height.try_into().unwrap(),
|
|
252
|
-
bytes_per_pixel.try_into().unwrap(),
|
|
253
|
-
bytes_per_line.try_into().unwrap(),
|
|
254
|
-
)?;
|
|
255
|
-
|
|
256
|
-
// Set whitelist for digits only
|
|
257
|
-
api.set_variable("tessedit_char_whitelist", "0123456789")?;
|
|
258
|
-
|
|
259
|
-
// Set PSM mode to single character
|
|
260
|
-
api.set_variable("tessedit_pageseg_mode", "10")?;
|
|
261
|
-
|
|
262
|
-
// Get the recognized text
|
|
263
|
-
let text = api.get_utf8_text()?;
|
|
264
|
-
println!("Recognized text: {}", text.trim());
|
|
265
|
-
|
|
266
|
-
Ok(())
|
|
267
|
-
}
|
|
268
|
-
```
|
|
269
|
-
|
|
270
|
-
## Advanced Usage
|
|
271
|
-
|
|
272
|
-
The API provides additional functionality for more complex OCR tasks, including thread-safe operations:
|
|
273
|
-
|
|
274
|
-
```rust
|
|
275
|
-
use kreuzberg_tesseract::TesseractAPI;
|
|
276
|
-
use std::sync::Arc;
|
|
277
|
-
use std::thread;
|
|
278
|
-
use std::error::Error;
|
|
279
|
-
|
|
280
|
-
fn main() -> Result<(), Box<dyn Error>> {
|
|
281
|
-
let tessdata_dir = get_tessdata_dir();
|
|
282
|
-
let api = TesseractAPI::new()?;
|
|
283
|
-
|
|
284
|
-
// Initialize the main API
|
|
285
|
-
api.init(tessdata_dir.to_str().unwrap(), "eng")?;
|
|
286
|
-
api.set_variable("tessedit_pageseg_mode", "1")?;
|
|
287
|
-
|
|
288
|
-
// Load and prepare image data
|
|
289
|
-
let (image_data, width, height) = load_test_image("sample_text.png")?;
|
|
290
|
-
|
|
291
|
-
// Share image data across threads
|
|
292
|
-
let image_data = Arc::new(image_data);
|
|
293
|
-
let mut handles = vec![];
|
|
294
|
-
|
|
295
|
-
// Spawn multiple threads for parallel OCR processing
|
|
296
|
-
for _ in 0..3 {
|
|
297
|
-
let api_clone = api.clone(); // Clones the API with all configurations
|
|
298
|
-
let image_data = Arc::clone(&image_data);
|
|
299
|
-
|
|
300
|
-
let handle = thread::spawn(move || {
|
|
301
|
-
// Set image in each thread
|
|
302
|
-
let res = api_clone.set_image(
|
|
303
|
-
&image_data,
|
|
304
|
-
width as i32,
|
|
305
|
-
height as i32,
|
|
306
|
-
3,
|
|
307
|
-
3 * width as i32,
|
|
308
|
-
);
|
|
309
|
-
assert!(res.is_ok());
|
|
310
|
-
|
|
311
|
-
// Perform OCR in parallel
|
|
312
|
-
let text = api_clone.get_utf8_text()
|
|
313
|
-
.expect("Failed to get text");
|
|
314
|
-
println!("Thread result: {}", text);
|
|
315
|
-
});
|
|
316
|
-
handles.push(handle);
|
|
317
|
-
}
|
|
318
|
-
|
|
319
|
-
// Wait for all threads to complete
|
|
320
|
-
for handle in handles {
|
|
321
|
-
handle.join().unwrap();
|
|
322
|
-
}
|
|
323
|
-
|
|
324
|
-
Ok(())
|
|
325
|
-
}
|
|
326
|
-
|
|
327
|
-
// Helper function to get tessdata directory
|
|
328
|
-
fn get_tessdata_dir() -> PathBuf {
|
|
329
|
-
// ... (implementation as shown in basic example)
|
|
330
|
-
}
|
|
331
|
-
|
|
332
|
-
// Helper function to load test image
|
|
333
|
-
fn load_test_image(filename: &str) -> Result<(Vec<u8>, u32, u32), Box<dyn Error>> {
|
|
334
|
-
let img = image::open(filename)?
|
|
335
|
-
.to_rgb8();
|
|
336
|
-
let (width, height) = img.dimensions();
|
|
337
|
-
Ok((img.into_raw(), width, height))
|
|
338
|
-
}
|
|
339
|
-
```
|
|
340
|
-
|
|
341
|
-
## Building
|
|
342
|
-
|
|
343
|
-
### Static Linking (Default)
|
|
344
|
-
|
|
345
|
-
With static linking, the crate will automatically download and compile Tesseract and Leptonica during the build process. This may take some time on the first build (5-10 minutes), but subsequent builds will use the cached libraries.
|
|
346
|
-
|
|
347
|
-
To clean the cache and force a rebuild:
|
|
348
|
-
|
|
349
|
-
```bash
|
|
350
|
-
CARGO_CLEAN=1 cargo build
|
|
351
|
-
```
|
|
352
|
-
|
|
353
|
-
### Dynamic Linking
|
|
354
|
-
|
|
355
|
-
With dynamic linking, the build is much faster (seconds instead of minutes) since it only links against system-installed libraries:
|
|
356
|
-
|
|
357
|
-
```bash
|
|
358
|
-
cargo build --no-default-features --features dynamic-linking
|
|
359
|
-
```
|
|
360
|
-
|
|
361
|
-
**Note**: Dynamic linking requires Tesseract and Leptonica to be installed on your system (see Installation section).
|
|
362
|
-
|
|
363
|
-
## Documentation
|
|
364
|
-
|
|
365
|
-
For more detailed information, please check the [API documentation](https://docs.rs/kreuzberg-tesseract).
|
|
366
|
-
|
|
367
|
-
## License
|
|
368
|
-
|
|
369
|
-
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
370
|
-
|
|
371
|
-
## Acknowledgements
|
|
372
|
-
|
|
373
|
-
This project is based on the original [tesseract-rs](https://github.com/cafercangundogdu/tesseract-rs) by [Cafer Can Gündoğdu](https://github.com/cafercangundogdu). We are grateful for the foundational work that made this project possible.
|
|
374
|
-
|
|
375
|
-
## Contributing
|
|
376
|
-
|
|
377
|
-
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
|
|
378
|
-
|
|
379
|
-
### Quick Start for Contributors
|
|
380
|
-
|
|
381
|
-
1. Fork and clone the repository
|
|
382
|
-
2. Install uv and set up git hooks:
|
|
383
|
-
```bash
|
|
384
|
-
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
385
|
-
uvx prek install
|
|
386
|
-
```
|
|
387
|
-
3. Make your changes following our commit message format
|
|
388
|
-
4. Run tests: `cargo test`
|
|
389
|
-
5. Submit a Pull Request
|
|
390
|
-
|
|
391
|
-
Our commit messages follow the [Conventional Commits](https://www.conventionalcommits.org/) specification.
|
|
392
|
-
|
|
393
|
-
## Acknowledgements
|
|
394
|
-
|
|
395
|
-
This project uses [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) and [Leptonica](http://leptonica.org/). We are grateful to the maintainers and contributors of these projects.
|
|
396
|
-
|
|
397
|
-
```
|
|
398
|
-
|
|
399
|
-
```
|
|
1
|
+
# kreuzberg-tesseract
|
|
2
|
+
|
|
3
|
+
Rust bindings for Tesseract OCR with built-in compilation of Tesseract and Leptonica libraries. Provides a safe and idiomatic Rust interface to Tesseract's functionality while handling the complexity of compiling the underlying C++ libraries.
|
|
4
|
+
|
|
5
|
+
Based on the original [tesseract-rs](https://github.com/cafercangundogdu/tesseract-rs) by Cafer Can Gündoğdu, this maintained version adds critical improvements for production use:
|
|
6
|
+
|
|
7
|
+
- **C++17 Support**: Upgraded for Tesseract 5.5.1 which requires C++17 filesystem
|
|
8
|
+
- **Cross-Compilation**: Fixed CXX compiler detection for cross-platform builds
|
|
9
|
+
- **Architecture Validation**: Validates target architecture before using cached libraries
|
|
10
|
+
- **Windows Static Linking**: Fixed MSVC static linking issues
|
|
11
|
+
- **Build Caching**: Improved caching with OUT_DIR-based cache directory
|
|
12
|
+
- **MinGW Support**: Added support for MinGW toolchains
|
|
13
|
+
|
|
14
|
+
## Features
|
|
15
|
+
|
|
16
|
+
- Safe Rust bindings for Tesseract OCR
|
|
17
|
+
- **Multiple linking options:**
|
|
18
|
+
- **Static linking** (default): Built-in compilation with no runtime dependencies
|
|
19
|
+
- **Dynamic linking**: Link to system-installed libraries for faster builds
|
|
20
|
+
- Uses existing Tesseract training data (expects English data for tests)
|
|
21
|
+
- High-level Rust API for common OCR tasks
|
|
22
|
+
- Caching of compiled libraries for faster subsequent builds
|
|
23
|
+
- Support for multiple operating systems (Linux, macOS, Windows)
|
|
24
|
+
|
|
25
|
+
## Installation
|
|
26
|
+
|
|
27
|
+
### Static Linking (Default)
|
|
28
|
+
|
|
29
|
+
Static linking builds Tesseract and Leptonica from source and embeds them in your binary. No runtime dependencies required:
|
|
30
|
+
|
|
31
|
+
```toml
|
|
32
|
+
[dependencies]
|
|
33
|
+
kreuzberg-tesseract = "1.0.0-rc.1"
|
|
34
|
+
# or explicitly:
|
|
35
|
+
kreuzberg-tesseract = { version = "1.0.0-rc.1", features = ["static-linking"] }
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
### Dynamic Linking
|
|
39
|
+
|
|
40
|
+
Dynamic linking uses system-installed Tesseract and Leptonica libraries. Faster builds, but requires libraries installed on the system:
|
|
41
|
+
|
|
42
|
+
```toml
|
|
43
|
+
[dependencies]
|
|
44
|
+
kreuzberg-tesseract = { version = "1.0.0-rc.1", features = ["dynamic-linking"], default-features = false }
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
**System requirements for dynamic linking:**
|
|
48
|
+
- Tesseract 5.x libraries installed (`libtesseract`, `libleptonica`)
|
|
49
|
+
- macOS: `brew install tesseract leptonica`
|
|
50
|
+
- Ubuntu/Debian: `sudo apt-get install libtesseract-dev libleptonica-dev`
|
|
51
|
+
- RHEL/CentOS/Fedora: `sudo dnf install tesseract-devel leptonica-devel`
|
|
52
|
+
- Windows: Install from [Tesseract releases](https://github.com/tesseract-ocr/tesseract/releases) or vcpkg
|
|
53
|
+
|
|
54
|
+
### Development Dependencies
|
|
55
|
+
|
|
56
|
+
For development and testing, you'll also need these dependencies:
|
|
57
|
+
|
|
58
|
+
```toml
|
|
59
|
+
[dev-dependencies]
|
|
60
|
+
image = "0.25.5"
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
## System Requirements
|
|
64
|
+
|
|
65
|
+
### For Static Linking (Default)
|
|
66
|
+
|
|
67
|
+
When building with static linking, the crate will compile Tesseract and Leptonica from source. You need:
|
|
68
|
+
|
|
69
|
+
- Rust 1.85.0 or later
|
|
70
|
+
- A C++ compiler (e.g., gcc, clang, MSVC on Windows)
|
|
71
|
+
- CMake 3.x or later
|
|
72
|
+
- Internet connection (for downloading Tesseract source code)
|
|
73
|
+
|
|
74
|
+
### For Dynamic Linking
|
|
75
|
+
|
|
76
|
+
When using dynamic linking with system-installed libraries, you need:
|
|
77
|
+
|
|
78
|
+
- Rust 1.85.0 or later
|
|
79
|
+
- Tesseract 5.x and Leptonica libraries installed on your system (see Installation section)
|
|
80
|
+
- Internet connection (for downloading Tesseract source code)
|
|
81
|
+
|
|
82
|
+
No C++ compiler or CMake required for dynamic linking builds.
|
|
83
|
+
|
|
84
|
+
For a full development environment checklist (including optional tooling suggestions), see [CONTRIBUTING.md](CONTRIBUTING.md).
|
|
85
|
+
|
|
86
|
+
## Environment Variables
|
|
87
|
+
|
|
88
|
+
The following environment variables affect the build and test process:
|
|
89
|
+
|
|
90
|
+
### Build Variables
|
|
91
|
+
|
|
92
|
+
- `CARGO_CLEAN`: If set, cleans the cache directory before building
|
|
93
|
+
- `RUSTC_WRAPPER`: If set to "sccache", enables compiler caching with sccache
|
|
94
|
+
- `CC`: Compiler selection for C code (affects Linux builds)
|
|
95
|
+
- `HOME` (Unix) or `APPDATA` (Windows): Used to determine cache directory location
|
|
96
|
+
- `TESSERACT_RS_CACHE_DIR`: Optional override for the cache root. When unset or not writable, the build falls back to the default OS-specific directory, and if that still fails, a temporary directory under the system temp folder is used automatically.
|
|
97
|
+
|
|
98
|
+
### Test Variables
|
|
99
|
+
|
|
100
|
+
- `TESSDATA_PREFIX` (Optional): Path to override the default tessdata directory. If not set, the crate will use its default cache directory.
|
|
101
|
+
|
|
102
|
+
## Cache and Data Directories
|
|
103
|
+
|
|
104
|
+
The crate uses the following directory structure based on your operating system:
|
|
105
|
+
|
|
106
|
+
- macOS: `~/Library/Application Support/tesseract-rs`
|
|
107
|
+
- Linux: `~/.tesseract-rs`
|
|
108
|
+
- Windows: `%APPDATA%/tesseract-rs`
|
|
109
|
+
|
|
110
|
+
The cache includes:
|
|
111
|
+
|
|
112
|
+
- Compiled Tesseract and Leptonica libraries
|
|
113
|
+
- Third-party source code
|
|
114
|
+
|
|
115
|
+
Training data is not downloaded during the build. Provide `eng.traineddata` (and any other languages you need) via `TESSDATA_PREFIX` or your system Tesseract installation.
|
|
116
|
+
|
|
117
|
+
## Testing
|
|
118
|
+
|
|
119
|
+
The project includes several integration tests that verify OCR functionality. To run the tests:
|
|
120
|
+
|
|
121
|
+
1. Ensure you have the required test dependencies:
|
|
122
|
+
|
|
123
|
+
```toml
|
|
124
|
+
[dev-dependencies]
|
|
125
|
+
image = "0.25.9"
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
2. Run the tests:
|
|
129
|
+
```bash
|
|
130
|
+
cargo test
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
Note: Make sure `eng.traineddata` is available in your tessdata directory before running tests. If `TESSDATA_PREFIX` is not set, the tests look in the default cache location. You can point the tests at a custom tessdata directory by setting:
|
|
134
|
+
|
|
135
|
+
```bash
|
|
136
|
+
# Linux/macOS
|
|
137
|
+
export TESSDATA_PREFIX=/path/to/custom/tessdata
|
|
138
|
+
|
|
139
|
+
# Windows (PowerShell)
|
|
140
|
+
$env:TESSDATA_PREFIX="C:\path\to\custom\tessdata"
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
Available test cases:
|
|
144
|
+
|
|
145
|
+
- OCR on English sample images
|
|
146
|
+
- Error handling and invalid input coverage
|
|
147
|
+
|
|
148
|
+
Test images are sourced from the shared `test_documents/` directory in the repository:
|
|
149
|
+
|
|
150
|
+
- `images/test_hello_world.png`: Simple English text
|
|
151
|
+
- `tables/simple_table.png`: Basic table with English headers
|
|
152
|
+
|
|
153
|
+
## Usage
|
|
154
|
+
|
|
155
|
+
Here's a basic example of how to use `tesseract-rs`:
|
|
156
|
+
|
|
157
|
+
```rust
|
|
158
|
+
use std::path::PathBuf;
|
|
159
|
+
use std::error::Error;
|
|
160
|
+
use kreuzberg_tesseract::TesseractAPI;
|
|
161
|
+
|
|
162
|
+
fn get_default_tessdata_dir() -> PathBuf {
|
|
163
|
+
if cfg!(target_os = "macos") {
|
|
164
|
+
let home_dir = std::env::var("HOME").expect("HOME environment variable not set");
|
|
165
|
+
PathBuf::from(home_dir)
|
|
166
|
+
.join("Library")
|
|
167
|
+
.join("Application Support")
|
|
168
|
+
.join("tesseract-rs")
|
|
169
|
+
.join("tessdata")
|
|
170
|
+
} else if cfg!(target_os = "linux") {
|
|
171
|
+
let home_dir = std::env::var("HOME").expect("HOME environment variable not set");
|
|
172
|
+
PathBuf::from(home_dir)
|
|
173
|
+
.join(".tesseract-rs")
|
|
174
|
+
.join("tessdata")
|
|
175
|
+
} else if cfg!(target_os = "windows") {
|
|
176
|
+
PathBuf::from(std::env::var("APPDATA").expect("APPDATA environment variable not set"))
|
|
177
|
+
.join("tesseract-rs")
|
|
178
|
+
.join("tessdata")
|
|
179
|
+
} else {
|
|
180
|
+
panic!("Unsupported operating system");
|
|
181
|
+
}
|
|
182
|
+
}
|
|
183
|
+
|
|
184
|
+
fn get_tessdata_dir() -> PathBuf {
|
|
185
|
+
match std::env::var("TESSDATA_PREFIX") {
|
|
186
|
+
Ok(dir) => {
|
|
187
|
+
let path = PathBuf::from(dir);
|
|
188
|
+
println!("Using TESSDATA_PREFIX directory: {:?}", path);
|
|
189
|
+
path
|
|
190
|
+
}
|
|
191
|
+
Err(_) => {
|
|
192
|
+
let default_dir = get_default_tessdata_dir();
|
|
193
|
+
println!(
|
|
194
|
+
"TESSDATA_PREFIX not set, using default directory: {:?}",
|
|
195
|
+
default_dir
|
|
196
|
+
);
|
|
197
|
+
default_dir
|
|
198
|
+
}
|
|
199
|
+
}
|
|
200
|
+
}
|
|
201
|
+
|
|
202
|
+
fn main() -> Result<(), Box<dyn Error>> {
|
|
203
|
+
let api = TesseractAPI::new()?;
|
|
204
|
+
|
|
205
|
+
// Get tessdata directory (uses default location or TESSDATA_PREFIX if set)
|
|
206
|
+
let tessdata_dir = get_tessdata_dir();
|
|
207
|
+
api.init(tessdata_dir.to_str().unwrap(), "eng")?;
|
|
208
|
+
|
|
209
|
+
let width = 24;
|
|
210
|
+
let height = 24;
|
|
211
|
+
let bytes_per_pixel = 1;
|
|
212
|
+
let bytes_per_line = width * bytes_per_pixel;
|
|
213
|
+
|
|
214
|
+
// Initialize image data with all white pixels
|
|
215
|
+
let mut image_data = vec![255u8; width * height];
|
|
216
|
+
|
|
217
|
+
// Draw number 9 with clearer distinction
|
|
218
|
+
for y in 4..19 {
|
|
219
|
+
for x in 7..17 {
|
|
220
|
+
// Top bar
|
|
221
|
+
if y == 4 && x >= 8 && x <= 15 {
|
|
222
|
+
image_data[y * width + x] = 0;
|
|
223
|
+
}
|
|
224
|
+
// Top curve left side
|
|
225
|
+
if y >= 4 && y <= 10 && x == 7 {
|
|
226
|
+
image_data[y * width + x] = 0;
|
|
227
|
+
}
|
|
228
|
+
// Top curve right side
|
|
229
|
+
if y >= 4 && y <= 11 && x == 16 {
|
|
230
|
+
image_data[y * width + x] = 0;
|
|
231
|
+
}
|
|
232
|
+
// Middle bar
|
|
233
|
+
if y == 11 && x >= 8 && x <= 15 {
|
|
234
|
+
image_data[y * width + x] = 0;
|
|
235
|
+
}
|
|
236
|
+
// Bottom right vertical line
|
|
237
|
+
if y >= 11 && y <= 18 && x == 16 {
|
|
238
|
+
image_data[y * width + x] = 0;
|
|
239
|
+
}
|
|
240
|
+
// Bottom bar
|
|
241
|
+
if y == 18 && x >= 8 && x <= 15 {
|
|
242
|
+
image_data[y * width + x] = 0;
|
|
243
|
+
}
|
|
244
|
+
}
|
|
245
|
+
}
|
|
246
|
+
|
|
247
|
+
// Set the image data
|
|
248
|
+
api.set_image(
|
|
249
|
+
&image_data,
|
|
250
|
+
width.try_into().unwrap(),
|
|
251
|
+
height.try_into().unwrap(),
|
|
252
|
+
bytes_per_pixel.try_into().unwrap(),
|
|
253
|
+
bytes_per_line.try_into().unwrap(),
|
|
254
|
+
)?;
|
|
255
|
+
|
|
256
|
+
// Set whitelist for digits only
|
|
257
|
+
api.set_variable("tessedit_char_whitelist", "0123456789")?;
|
|
258
|
+
|
|
259
|
+
// Set PSM mode to single character
|
|
260
|
+
api.set_variable("tessedit_pageseg_mode", "10")?;
|
|
261
|
+
|
|
262
|
+
// Get the recognized text
|
|
263
|
+
let text = api.get_utf8_text()?;
|
|
264
|
+
println!("Recognized text: {}", text.trim());
|
|
265
|
+
|
|
266
|
+
Ok(())
|
|
267
|
+
}
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
## Advanced Usage
|
|
271
|
+
|
|
272
|
+
The API provides additional functionality for more complex OCR tasks, including thread-safe operations:
|
|
273
|
+
|
|
274
|
+
```rust
|
|
275
|
+
use kreuzberg_tesseract::TesseractAPI;
|
|
276
|
+
use std::sync::Arc;
|
|
277
|
+
use std::thread;
|
|
278
|
+
use std::error::Error;
|
|
279
|
+
|
|
280
|
+
fn main() -> Result<(), Box<dyn Error>> {
|
|
281
|
+
let tessdata_dir = get_tessdata_dir();
|
|
282
|
+
let api = TesseractAPI::new()?;
|
|
283
|
+
|
|
284
|
+
// Initialize the main API
|
|
285
|
+
api.init(tessdata_dir.to_str().unwrap(), "eng")?;
|
|
286
|
+
api.set_variable("tessedit_pageseg_mode", "1")?;
|
|
287
|
+
|
|
288
|
+
// Load and prepare image data
|
|
289
|
+
let (image_data, width, height) = load_test_image("sample_text.png")?;
|
|
290
|
+
|
|
291
|
+
// Share image data across threads
|
|
292
|
+
let image_data = Arc::new(image_data);
|
|
293
|
+
let mut handles = vec![];
|
|
294
|
+
|
|
295
|
+
// Spawn multiple threads for parallel OCR processing
|
|
296
|
+
for _ in 0..3 {
|
|
297
|
+
let api_clone = api.clone(); // Clones the API with all configurations
|
|
298
|
+
let image_data = Arc::clone(&image_data);
|
|
299
|
+
|
|
300
|
+
let handle = thread::spawn(move || {
|
|
301
|
+
// Set image in each thread
|
|
302
|
+
let res = api_clone.set_image(
|
|
303
|
+
&image_data,
|
|
304
|
+
width as i32,
|
|
305
|
+
height as i32,
|
|
306
|
+
3,
|
|
307
|
+
3 * width as i32,
|
|
308
|
+
);
|
|
309
|
+
assert!(res.is_ok());
|
|
310
|
+
|
|
311
|
+
// Perform OCR in parallel
|
|
312
|
+
let text = api_clone.get_utf8_text()
|
|
313
|
+
.expect("Failed to get text");
|
|
314
|
+
println!("Thread result: {}", text);
|
|
315
|
+
});
|
|
316
|
+
handles.push(handle);
|
|
317
|
+
}
|
|
318
|
+
|
|
319
|
+
// Wait for all threads to complete
|
|
320
|
+
for handle in handles {
|
|
321
|
+
handle.join().unwrap();
|
|
322
|
+
}
|
|
323
|
+
|
|
324
|
+
Ok(())
|
|
325
|
+
}
|
|
326
|
+
|
|
327
|
+
// Helper function to get tessdata directory
|
|
328
|
+
fn get_tessdata_dir() -> PathBuf {
|
|
329
|
+
// ... (implementation as shown in basic example)
|
|
330
|
+
}
|
|
331
|
+
|
|
332
|
+
// Helper function to load test image
|
|
333
|
+
fn load_test_image(filename: &str) -> Result<(Vec<u8>, u32, u32), Box<dyn Error>> {
|
|
334
|
+
let img = image::open(filename)?
|
|
335
|
+
.to_rgb8();
|
|
336
|
+
let (width, height) = img.dimensions();
|
|
337
|
+
Ok((img.into_raw(), width, height))
|
|
338
|
+
}
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
## Building
|
|
342
|
+
|
|
343
|
+
### Static Linking (Default)
|
|
344
|
+
|
|
345
|
+
With static linking, the crate will automatically download and compile Tesseract and Leptonica during the build process. This may take some time on the first build (5-10 minutes), but subsequent builds will use the cached libraries.
|
|
346
|
+
|
|
347
|
+
To clean the cache and force a rebuild:
|
|
348
|
+
|
|
349
|
+
```bash
|
|
350
|
+
CARGO_CLEAN=1 cargo build
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
### Dynamic Linking
|
|
354
|
+
|
|
355
|
+
With dynamic linking, the build is much faster (seconds instead of minutes) since it only links against system-installed libraries:
|
|
356
|
+
|
|
357
|
+
```bash
|
|
358
|
+
cargo build --no-default-features --features dynamic-linking
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
**Note**: Dynamic linking requires Tesseract and Leptonica to be installed on your system (see Installation section).
|
|
362
|
+
|
|
363
|
+
## Documentation
|
|
364
|
+
|
|
365
|
+
For more detailed information, please check the [API documentation](https://docs.rs/kreuzberg-tesseract).
|
|
366
|
+
|
|
367
|
+
## License
|
|
368
|
+
|
|
369
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
370
|
+
|
|
371
|
+
## Acknowledgements
|
|
372
|
+
|
|
373
|
+
This project is based on the original [tesseract-rs](https://github.com/cafercangundogdu/tesseract-rs) by [Cafer Can Gündoğdu](https://github.com/cafercangundogdu). We are grateful for the foundational work that made this project possible.
|
|
374
|
+
|
|
375
|
+
## Contributing
|
|
376
|
+
|
|
377
|
+
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
|
|
378
|
+
|
|
379
|
+
### Quick Start for Contributors
|
|
380
|
+
|
|
381
|
+
1. Fork and clone the repository
|
|
382
|
+
2. Install uv and set up git hooks:
|
|
383
|
+
```bash
|
|
384
|
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
385
|
+
uvx prek install
|
|
386
|
+
```
|
|
387
|
+
3. Make your changes following our commit message format
|
|
388
|
+
4. Run tests: `cargo test`
|
|
389
|
+
5. Submit a Pull Request
|
|
390
|
+
|
|
391
|
+
Our commit messages follow the [Conventional Commits](https://www.conventionalcommits.org/) specification.
|
|
392
|
+
|
|
393
|
+
## Acknowledgements
|
|
394
|
+
|
|
395
|
+
This project uses [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) and [Leptonica](http://leptonica.org/). We are grateful to the maintainers and contributors of these projects.
|
|
396
|
+
|
|
397
|
+
```
|
|
398
|
+
|
|
399
|
+
```
|