kreuzberg 4.0.0.pre.rc.29 → 4.0.0.rc1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +0 -6
- data/.rubocop.yaml +534 -1
- data/Gemfile +2 -1
- data/Gemfile.lock +28 -116
- data/README.md +269 -629
- data/Rakefile +0 -9
- data/Steepfile +4 -8
- data/examples/async_patterns.rb +58 -1
- data/ext/kreuzberg_rb/extconf.rb +5 -35
- data/ext/kreuzberg_rb/native/Cargo.toml +16 -55
- data/ext/kreuzberg_rb/native/build.rs +14 -12
- data/ext/kreuzberg_rb/native/include/ieeefp.h +1 -1
- data/ext/kreuzberg_rb/native/include/msvc_compat/strings.h +1 -1
- data/ext/kreuzberg_rb/native/include/strings.h +2 -2
- data/ext/kreuzberg_rb/native/include/unistd.h +1 -1
- data/ext/kreuzberg_rb/native/src/lib.rs +34 -897
- data/extconf.rb +6 -38
- data/kreuzberg.gemspec +20 -114
- data/lib/kreuzberg/api_proxy.rb +18 -2
- data/lib/kreuzberg/cache_api.rb +0 -22
- data/lib/kreuzberg/cli.rb +10 -2
- data/lib/kreuzberg/cli_proxy.rb +10 -0
- data/lib/kreuzberg/config.rb +22 -274
- data/lib/kreuzberg/errors.rb +7 -73
- data/lib/kreuzberg/extraction_api.rb +8 -237
- data/lib/kreuzberg/mcp_proxy.rb +11 -2
- data/lib/kreuzberg/ocr_backend_protocol.rb +73 -0
- data/lib/kreuzberg/post_processor_protocol.rb +71 -0
- data/lib/kreuzberg/result.rb +33 -151
- data/lib/kreuzberg/setup_lib_path.rb +2 -22
- data/lib/kreuzberg/validator_protocol.rb +73 -0
- data/lib/kreuzberg/version.rb +1 -1
- data/lib/kreuzberg.rb +13 -27
- data/pkg/kreuzberg-4.0.0.rc1.gem +0 -0
- data/sig/kreuzberg.rbs +12 -105
- data/spec/binding/cache_spec.rb +22 -22
- data/spec/binding/cli_proxy_spec.rb +4 -2
- data/spec/binding/cli_spec.rb +11 -12
- data/spec/binding/config_spec.rb +0 -74
- data/spec/binding/config_validation_spec.rb +6 -100
- data/spec/binding/error_handling_spec.rb +97 -283
- data/spec/binding/plugins/ocr_backend_spec.rb +8 -8
- data/spec/binding/plugins/postprocessor_spec.rb +11 -11
- data/spec/binding/plugins/validator_spec.rb +13 -12
- data/spec/examples.txt +104 -0
- data/spec/fixtures/config.toml +1 -0
- data/spec/fixtures/config.yaml +1 -0
- data/spec/fixtures/invalid_config.toml +1 -0
- data/spec/smoke/package_spec.rb +3 -2
- data/spec/spec_helper.rb +3 -1
- data/vendor/kreuzberg/Cargo.toml +67 -192
- data/vendor/kreuzberg/README.md +9 -97
- data/vendor/kreuzberg/build.rs +194 -516
- data/vendor/kreuzberg/src/api/handlers.rs +9 -130
- data/vendor/kreuzberg/src/api/mod.rs +3 -18
- data/vendor/kreuzberg/src/api/server.rs +71 -236
- data/vendor/kreuzberg/src/api/types.rs +7 -43
- data/vendor/kreuzberg/src/bin/profile_extract.rs +455 -0
- data/vendor/kreuzberg/src/cache/mod.rs +3 -27
- data/vendor/kreuzberg/src/chunking/mod.rs +79 -1705
- data/vendor/kreuzberg/src/core/batch_mode.rs +0 -60
- data/vendor/kreuzberg/src/core/config.rs +23 -905
- data/vendor/kreuzberg/src/core/extractor.rs +106 -403
- data/vendor/kreuzberg/src/core/io.rs +2 -4
- data/vendor/kreuzberg/src/core/mime.rs +12 -2
- data/vendor/kreuzberg/src/core/mod.rs +3 -22
- data/vendor/kreuzberg/src/core/pipeline.rs +78 -395
- data/vendor/kreuzberg/src/embeddings.rs +21 -169
- data/vendor/kreuzberg/src/error.rs +2 -2
- data/vendor/kreuzberg/src/extraction/archive.rs +31 -36
- data/vendor/kreuzberg/src/extraction/docx.rs +1 -365
- data/vendor/kreuzberg/src/extraction/email.rs +11 -12
- data/vendor/kreuzberg/src/extraction/excel.rs +129 -138
- data/vendor/kreuzberg/src/extraction/html.rs +170 -1447
- data/vendor/kreuzberg/src/extraction/image.rs +14 -138
- data/vendor/kreuzberg/src/extraction/libreoffice.rs +3 -13
- data/vendor/kreuzberg/src/extraction/mod.rs +5 -21
- data/vendor/kreuzberg/src/extraction/office_metadata/mod.rs +0 -2
- data/vendor/kreuzberg/src/extraction/pandoc/batch.rs +275 -0
- data/vendor/kreuzberg/src/extraction/pandoc/mime_types.rs +178 -0
- data/vendor/kreuzberg/src/extraction/pandoc/mod.rs +491 -0
- data/vendor/kreuzberg/src/extraction/pandoc/server.rs +496 -0
- data/vendor/kreuzberg/src/extraction/pandoc/subprocess.rs +1188 -0
- data/vendor/kreuzberg/src/extraction/pandoc/version.rs +162 -0
- data/vendor/kreuzberg/src/extraction/pptx.rs +94 -196
- data/vendor/kreuzberg/src/extraction/structured.rs +4 -5
- data/vendor/kreuzberg/src/extraction/table.rs +1 -2
- data/vendor/kreuzberg/src/extraction/text.rs +10 -18
- data/vendor/kreuzberg/src/extractors/archive.rs +0 -22
- data/vendor/kreuzberg/src/extractors/docx.rs +148 -69
- data/vendor/kreuzberg/src/extractors/email.rs +9 -37
- data/vendor/kreuzberg/src/extractors/excel.rs +40 -81
- data/vendor/kreuzberg/src/extractors/html.rs +173 -182
- data/vendor/kreuzberg/src/extractors/image.rs +8 -32
- data/vendor/kreuzberg/src/extractors/mod.rs +10 -171
- data/vendor/kreuzberg/src/extractors/pandoc.rs +201 -0
- data/vendor/kreuzberg/src/extractors/pdf.rs +64 -329
- data/vendor/kreuzberg/src/extractors/pptx.rs +34 -79
- data/vendor/kreuzberg/src/extractors/structured.rs +0 -16
- data/vendor/kreuzberg/src/extractors/text.rs +7 -30
- data/vendor/kreuzberg/src/extractors/xml.rs +8 -27
- data/vendor/kreuzberg/src/keywords/processor.rs +1 -9
- data/vendor/kreuzberg/src/keywords/rake.rs +1 -0
- data/vendor/kreuzberg/src/language_detection/mod.rs +51 -94
- data/vendor/kreuzberg/src/lib.rs +5 -17
- data/vendor/kreuzberg/src/mcp/mod.rs +1 -4
- data/vendor/kreuzberg/src/mcp/server.rs +21 -145
- data/vendor/kreuzberg/src/ocr/mod.rs +0 -2
- data/vendor/kreuzberg/src/ocr/processor.rs +8 -19
- data/vendor/kreuzberg/src/ocr/tesseract_backend.rs +0 -2
- data/vendor/kreuzberg/src/pdf/error.rs +1 -93
- data/vendor/kreuzberg/src/pdf/metadata.rs +100 -263
- data/vendor/kreuzberg/src/pdf/mod.rs +2 -33
- data/vendor/kreuzberg/src/pdf/rendering.rs +12 -12
- data/vendor/kreuzberg/src/pdf/table.rs +64 -61
- data/vendor/kreuzberg/src/pdf/text.rs +24 -416
- data/vendor/kreuzberg/src/plugins/extractor.rs +8 -40
- data/vendor/kreuzberg/src/plugins/mod.rs +0 -3
- data/vendor/kreuzberg/src/plugins/ocr.rs +14 -22
- data/vendor/kreuzberg/src/plugins/processor.rs +1 -10
- data/vendor/kreuzberg/src/plugins/registry.rs +0 -15
- data/vendor/kreuzberg/src/plugins/validator.rs +8 -20
- data/vendor/kreuzberg/src/stopwords/mod.rs +2 -2
- data/vendor/kreuzberg/src/text/mod.rs +0 -8
- data/vendor/kreuzberg/src/text/quality.rs +15 -28
- data/vendor/kreuzberg/src/text/string_utils.rs +10 -22
- data/vendor/kreuzberg/src/text/token_reduction/core.rs +50 -86
- data/vendor/kreuzberg/src/text/token_reduction/filters.rs +16 -37
- data/vendor/kreuzberg/src/text/token_reduction/simd_text.rs +1 -2
- data/vendor/kreuzberg/src/types.rs +67 -907
- data/vendor/kreuzberg/src/utils/mod.rs +0 -14
- data/vendor/kreuzberg/src/utils/quality.rs +3 -12
- data/vendor/kreuzberg/tests/api_tests.rs +0 -506
- data/vendor/kreuzberg/tests/archive_integration.rs +0 -2
- data/vendor/kreuzberg/tests/batch_orchestration.rs +12 -57
- data/vendor/kreuzberg/tests/batch_processing.rs +8 -32
- data/vendor/kreuzberg/tests/chunking_offset_demo.rs +92 -0
- data/vendor/kreuzberg/tests/concurrency_stress.rs +8 -40
- data/vendor/kreuzberg/tests/config_features.rs +1 -33
- data/vendor/kreuzberg/tests/config_loading_tests.rs +39 -16
- data/vendor/kreuzberg/tests/core_integration.rs +9 -35
- data/vendor/kreuzberg/tests/csv_integration.rs +81 -71
- data/vendor/kreuzberg/tests/docx_metadata_extraction_test.rs +25 -23
- data/vendor/kreuzberg/tests/email_integration.rs +1 -3
- data/vendor/kreuzberg/tests/error_handling.rs +34 -43
- data/vendor/kreuzberg/tests/format_integration.rs +1 -7
- data/vendor/kreuzberg/tests/helpers/mod.rs +0 -60
- data/vendor/kreuzberg/tests/image_integration.rs +0 -2
- data/vendor/kreuzberg/tests/mime_detection.rs +16 -17
- data/vendor/kreuzberg/tests/ocr_configuration.rs +0 -4
- data/vendor/kreuzberg/tests/ocr_errors.rs +0 -22
- data/vendor/kreuzberg/tests/ocr_quality.rs +0 -2
- data/vendor/kreuzberg/tests/pandoc_integration.rs +503 -0
- data/vendor/kreuzberg/tests/pdf_integration.rs +0 -2
- data/vendor/kreuzberg/tests/pipeline_integration.rs +2 -36
- data/vendor/kreuzberg/tests/plugin_ocr_backend_test.rs +0 -5
- data/vendor/kreuzberg/tests/plugin_postprocessor_test.rs +1 -17
- data/vendor/kreuzberg/tests/plugin_system.rs +0 -6
- data/vendor/kreuzberg/tests/registry_integration_tests.rs +22 -2
- data/vendor/kreuzberg/tests/security_validation.rs +1 -13
- data/vendor/kreuzberg/tests/test_fastembed.rs +23 -45
- metadata +25 -171
- data/.rubocop.yml +0 -543
- data/ext/kreuzberg_rb/native/.cargo/config.toml +0 -23
- data/ext/kreuzberg_rb/native/Cargo.lock +0 -7619
- data/lib/kreuzberg/error_context.rb +0 -136
- data/lib/kreuzberg/types.rb +0 -170
- data/lib/libpdfium.so +0 -0
- data/spec/binding/async_operations_spec.rb +0 -473
- data/spec/binding/batch_operations_spec.rb +0 -595
- data/spec/binding/batch_spec.rb +0 -359
- data/spec/binding/config_result_spec.rb +0 -377
- data/spec/binding/embeddings_spec.rb +0 -816
- data/spec/binding/error_recovery_spec.rb +0 -488
- data/spec/binding/font_config_spec.rb +0 -220
- data/spec/binding/images_spec.rb +0 -738
- data/spec/binding/keywords_extraction_spec.rb +0 -600
- data/spec/binding/metadata_types_spec.rb +0 -1228
- data/spec/binding/pages_extraction_spec.rb +0 -471
- data/spec/binding/tables_spec.rb +0 -641
- data/spec/unit/config/chunking_config_spec.rb +0 -213
- data/spec/unit/config/embedding_config_spec.rb +0 -343
- data/spec/unit/config/extraction_config_spec.rb +0 -438
- data/spec/unit/config/font_config_spec.rb +0 -285
- data/spec/unit/config/hierarchy_config_spec.rb +0 -314
- data/spec/unit/config/image_extraction_config_spec.rb +0 -209
- data/spec/unit/config/image_preprocessing_config_spec.rb +0 -249
- data/spec/unit/config/keyword_config_spec.rb +0 -229
- data/spec/unit/config/language_detection_config_spec.rb +0 -258
- data/spec/unit/config/ocr_config_spec.rb +0 -171
- data/spec/unit/config/page_config_spec.rb +0 -221
- data/spec/unit/config/pdf_config_spec.rb +0 -267
- data/spec/unit/config/postprocessor_config_spec.rb +0 -290
- data/spec/unit/config/tesseract_config_spec.rb +0 -181
- data/spec/unit/config/token_reduction_config_spec.rb +0 -251
- data/test/metadata_types_test.rb +0 -959
- data/vendor/Cargo.toml +0 -61
- data/vendor/kreuzberg/examples/bench_fixes.rs +0 -71
- data/vendor/kreuzberg/examples/test_pdfium_fork.rs +0 -62
- data/vendor/kreuzberg/src/chunking/processor.rs +0 -219
- data/vendor/kreuzberg/src/core/batch_optimizations.rs +0 -385
- data/vendor/kreuzberg/src/core/config_validation.rs +0 -949
- data/vendor/kreuzberg/src/core/formats.rs +0 -235
- data/vendor/kreuzberg/src/core/server_config.rs +0 -1220
- data/vendor/kreuzberg/src/extraction/capacity.rs +0 -263
- data/vendor/kreuzberg/src/extraction/markdown.rs +0 -216
- data/vendor/kreuzberg/src/extraction/office_metadata/odt_properties.rs +0 -284
- data/vendor/kreuzberg/src/extractors/bibtex.rs +0 -470
- data/vendor/kreuzberg/src/extractors/docbook.rs +0 -504
- data/vendor/kreuzberg/src/extractors/epub.rs +0 -696
- data/vendor/kreuzberg/src/extractors/fictionbook.rs +0 -492
- data/vendor/kreuzberg/src/extractors/jats.rs +0 -1054
- data/vendor/kreuzberg/src/extractors/jupyter.rs +0 -368
- data/vendor/kreuzberg/src/extractors/latex.rs +0 -653
- data/vendor/kreuzberg/src/extractors/markdown.rs +0 -701
- data/vendor/kreuzberg/src/extractors/odt.rs +0 -628
- data/vendor/kreuzberg/src/extractors/opml.rs +0 -635
- data/vendor/kreuzberg/src/extractors/orgmode.rs +0 -529
- data/vendor/kreuzberg/src/extractors/rst.rs +0 -577
- data/vendor/kreuzberg/src/extractors/rtf.rs +0 -809
- data/vendor/kreuzberg/src/extractors/security.rs +0 -484
- data/vendor/kreuzberg/src/extractors/security_tests.rs +0 -367
- data/vendor/kreuzberg/src/extractors/typst.rs +0 -651
- data/vendor/kreuzberg/src/language_detection/processor.rs +0 -218
- data/vendor/kreuzberg/src/ocr/language_registry.rs +0 -520
- data/vendor/kreuzberg/src/panic_context.rs +0 -154
- data/vendor/kreuzberg/src/pdf/bindings.rs +0 -306
- data/vendor/kreuzberg/src/pdf/bundled.rs +0 -408
- data/vendor/kreuzberg/src/pdf/fonts.rs +0 -358
- data/vendor/kreuzberg/src/pdf/hierarchy.rs +0 -903
- data/vendor/kreuzberg/src/text/quality_processor.rs +0 -231
- data/vendor/kreuzberg/src/text/utf8_validation.rs +0 -193
- data/vendor/kreuzberg/src/utils/pool.rs +0 -503
- data/vendor/kreuzberg/src/utils/pool_sizing.rs +0 -364
- data/vendor/kreuzberg/src/utils/string_pool.rs +0 -761
- data/vendor/kreuzberg/tests/api_embed.rs +0 -360
- data/vendor/kreuzberg/tests/api_extract_multipart.rs +0 -52
- data/vendor/kreuzberg/tests/api_large_pdf_extraction.rs +0 -471
- data/vendor/kreuzberg/tests/api_large_pdf_extraction_diagnostics.rs +0 -289
- data/vendor/kreuzberg/tests/batch_pooling_benchmark.rs +0 -154
- data/vendor/kreuzberg/tests/bibtex_parity_test.rs +0 -421
- data/vendor/kreuzberg/tests/config_integration_test.rs +0 -753
- data/vendor/kreuzberg/tests/data/hierarchy_ground_truth.json +0 -294
- data/vendor/kreuzberg/tests/docbook_extractor_tests.rs +0 -500
- data/vendor/kreuzberg/tests/docx_vs_pandoc_comparison.rs +0 -370
- data/vendor/kreuzberg/tests/epub_native_extractor_tests.rs +0 -275
- data/vendor/kreuzberg/tests/fictionbook_extractor_tests.rs +0 -228
- data/vendor/kreuzberg/tests/html_table_test.rs +0 -551
- data/vendor/kreuzberg/tests/instrumentation_test.rs +0 -139
- data/vendor/kreuzberg/tests/jats_extractor_tests.rs +0 -639
- data/vendor/kreuzberg/tests/jupyter_extractor_tests.rs +0 -704
- data/vendor/kreuzberg/tests/latex_extractor_tests.rs +0 -496
- data/vendor/kreuzberg/tests/markdown_extractor_tests.rs +0 -490
- data/vendor/kreuzberg/tests/ocr_language_registry.rs +0 -191
- data/vendor/kreuzberg/tests/odt_extractor_tests.rs +0 -674
- data/vendor/kreuzberg/tests/opml_extractor_tests.rs +0 -616
- data/vendor/kreuzberg/tests/orgmode_extractor_tests.rs +0 -822
- data/vendor/kreuzberg/tests/page_markers.rs +0 -297
- data/vendor/kreuzberg/tests/pdf_hierarchy_detection.rs +0 -301
- data/vendor/kreuzberg/tests/pdf_hierarchy_quality.rs +0 -589
- data/vendor/kreuzberg/tests/pdf_ocr_triggering.rs +0 -301
- data/vendor/kreuzberg/tests/pdf_text_merging.rs +0 -475
- data/vendor/kreuzberg/tests/pdfium_linking.rs +0 -340
- data/vendor/kreuzberg/tests/rst_extractor_tests.rs +0 -694
- data/vendor/kreuzberg/tests/rtf_extractor_tests.rs +0 -775
- data/vendor/kreuzberg/tests/typst_behavioral_tests.rs +0 -1260
- data/vendor/kreuzberg/tests/typst_extractor_tests.rs +0 -648
- data/vendor/kreuzberg-ffi/Cargo.toml +0 -67
- data/vendor/kreuzberg-ffi/README.md +0 -851
- data/vendor/kreuzberg-ffi/benches/result_view_benchmark.rs +0 -227
- data/vendor/kreuzberg-ffi/build.rs +0 -168
- data/vendor/kreuzberg-ffi/cbindgen.toml +0 -37
- data/vendor/kreuzberg-ffi/kreuzberg-ffi.pc.in +0 -12
- data/vendor/kreuzberg-ffi/kreuzberg.h +0 -3012
- data/vendor/kreuzberg-ffi/src/batch_streaming.rs +0 -588
- data/vendor/kreuzberg-ffi/src/config.rs +0 -1341
- data/vendor/kreuzberg-ffi/src/error.rs +0 -901
- data/vendor/kreuzberg-ffi/src/extraction.rs +0 -555
- data/vendor/kreuzberg-ffi/src/helpers.rs +0 -879
- data/vendor/kreuzberg-ffi/src/lib.rs +0 -977
- data/vendor/kreuzberg-ffi/src/memory.rs +0 -493
- data/vendor/kreuzberg-ffi/src/mime.rs +0 -329
- data/vendor/kreuzberg-ffi/src/panic_shield.rs +0 -265
- data/vendor/kreuzberg-ffi/src/plugins/document_extractor.rs +0 -442
- data/vendor/kreuzberg-ffi/src/plugins/mod.rs +0 -14
- data/vendor/kreuzberg-ffi/src/plugins/ocr_backend.rs +0 -628
- data/vendor/kreuzberg-ffi/src/plugins/post_processor.rs +0 -438
- data/vendor/kreuzberg-ffi/src/plugins/validator.rs +0 -329
- data/vendor/kreuzberg-ffi/src/result.rs +0 -510
- data/vendor/kreuzberg-ffi/src/result_pool.rs +0 -639
- data/vendor/kreuzberg-ffi/src/result_view.rs +0 -773
- data/vendor/kreuzberg-ffi/src/string_intern.rs +0 -568
- data/vendor/kreuzberg-ffi/src/types.rs +0 -363
- data/vendor/kreuzberg-ffi/src/util.rs +0 -210
- data/vendor/kreuzberg-ffi/src/validation.rs +0 -848
- data/vendor/kreuzberg-ffi/tests.disabled/README.md +0 -48
- data/vendor/kreuzberg-ffi/tests.disabled/config_loading_tests.rs +0 -299
- data/vendor/kreuzberg-ffi/tests.disabled/config_tests.rs +0 -346
- data/vendor/kreuzberg-ffi/tests.disabled/extractor_tests.rs +0 -232
- data/vendor/kreuzberg-ffi/tests.disabled/plugin_registration_tests.rs +0 -470
- data/vendor/kreuzberg-tesseract/.commitlintrc.json +0 -13
- data/vendor/kreuzberg-tesseract/.crate-ignore +0 -2
- data/vendor/kreuzberg-tesseract/Cargo.lock +0 -2933
- data/vendor/kreuzberg-tesseract/Cargo.toml +0 -57
- data/vendor/kreuzberg-tesseract/LICENSE +0 -22
- data/vendor/kreuzberg-tesseract/README.md +0 -399
- data/vendor/kreuzberg-tesseract/build.rs +0 -1127
- data/vendor/kreuzberg-tesseract/patches/README.md +0 -71
- data/vendor/kreuzberg-tesseract/patches/tesseract.diff +0 -199
- data/vendor/kreuzberg-tesseract/src/api.rs +0 -1371
- data/vendor/kreuzberg-tesseract/src/choice_iterator.rs +0 -77
- data/vendor/kreuzberg-tesseract/src/enums.rs +0 -297
- data/vendor/kreuzberg-tesseract/src/error.rs +0 -81
- data/vendor/kreuzberg-tesseract/src/lib.rs +0 -145
- data/vendor/kreuzberg-tesseract/src/monitor.rs +0 -57
- data/vendor/kreuzberg-tesseract/src/mutable_iterator.rs +0 -197
- data/vendor/kreuzberg-tesseract/src/page_iterator.rs +0 -253
- data/vendor/kreuzberg-tesseract/src/result_iterator.rs +0 -286
- data/vendor/kreuzberg-tesseract/src/result_renderer.rs +0 -183
- data/vendor/kreuzberg-tesseract/tests/integration_test.rs +0 -211
data/README.md
CHANGED
|
@@ -1,781 +1,421 @@
|
|
|
1
|
-
# Ruby
|
|
2
|
-
|
|
3
|
-
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
|
4
|
-
<!-- Language Bindings -->
|
|
5
|
-
<a href="https://crates.io/crates/kreuzberg">
|
|
6
|
-
<img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
|
|
7
|
-
</a>
|
|
8
|
-
<a href="https://hex.pm/packages/kreuzberg">
|
|
9
|
-
<img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
|
|
10
|
-
</a>
|
|
11
|
-
<a href="https://pypi.org/project/kreuzberg/">
|
|
12
|
-
<img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
|
|
13
|
-
</a>
|
|
14
|
-
<a href="https://www.npmjs.com/package/@kreuzberg/node">
|
|
15
|
-
<img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
|
|
16
|
-
</a>
|
|
17
|
-
<a href="https://www.npmjs.com/package/@kreuzberg/wasm">
|
|
18
|
-
<img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
|
|
19
|
-
</a>
|
|
20
|
-
|
|
21
|
-
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
|
|
22
|
-
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
|
|
23
|
-
</a>
|
|
24
|
-
<a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
|
|
25
|
-
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0-*" alt="Go">
|
|
26
|
-
</a>
|
|
27
|
-
<a href="https://www.nuget.org/packages/Kreuzberg/">
|
|
28
|
-
<img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
|
|
29
|
-
</a>
|
|
30
|
-
<a href="https://packagist.org/packages/kreuzberg/kreuzberg">
|
|
31
|
-
<img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
|
|
32
|
-
</a>
|
|
33
|
-
<a href="https://rubygems.org/gems/kreuzberg">
|
|
34
|
-
<img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
|
|
35
|
-
</a>
|
|
36
|
-
|
|
37
|
-
<!-- Project Info -->
|
|
38
|
-
|
|
39
|
-
<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
|
|
40
|
-
<img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
|
|
41
|
-
</a>
|
|
42
|
-
<a href="https://docs.kreuzberg.dev">
|
|
43
|
-
<img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
|
|
44
|
-
</a>
|
|
45
|
-
</div>
|
|
46
|
-
|
|
47
|
-
<img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
|
|
48
|
-
|
|
49
|
-
<div align="center" style="margin-top: 20px;">
|
|
50
|
-
<a href="https://discord.gg/pXxagNK2zN">
|
|
51
|
-
<img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
|
|
52
|
-
</a>
|
|
53
|
-
</div>
|
|
54
|
-
|
|
55
|
-
Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. Ruby bindings with idiomatic Ruby API and native performance.
|
|
56
|
-
|
|
57
|
-
> **Version 4.0.0 Release Candidate**
|
|
58
|
-
> Kreuzberg v4.0.0 is in **Release Candidate** stage. Bugs and breaking changes are expected.
|
|
59
|
-
> This is a pre-release version. Please test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
|
|
1
|
+
# Kreuzberg for Ruby
|
|
60
2
|
|
|
61
|
-
|
|
3
|
+
[](https://rubygems.org/gems/kreuzberg)
|
|
4
|
+
[](https://crates.io/crates/kreuzberg)
|
|
5
|
+
[](https://pypi.org/project/kreuzberg/)
|
|
6
|
+
[](https://www.npmjs.com/package/@goldziher/kreuzberg)
|
|
7
|
+
[](https://opensource.org/licenses/MIT)
|
|
8
|
+
[](https://kreuzberg.dev)
|
|
62
9
|
|
|
63
|
-
|
|
10
|
+
High-performance document intelligence for Ruby, powered by Rust.
|
|
64
11
|
|
|
65
|
-
|
|
12
|
+
Extract text, tables, images, and metadata from 30+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
|
|
66
13
|
|
|
67
|
-
**
|
|
14
|
+
> **🚀 Version 4.0.0 Release Candidate**
|
|
15
|
+
> This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/Goldziher/kreuzberg/issues) you encounter.
|
|
68
16
|
|
|
69
|
-
|
|
70
|
-
gem install kreuzberg
|
|
71
|
-
```
|
|
72
|
-
|
|
73
|
-
**Bundler:**
|
|
74
|
-
|
|
75
|
-
```ruby
|
|
76
|
-
gem 'kreuzberg'
|
|
77
|
-
```
|
|
17
|
+
## Features
|
|
78
18
|
|
|
79
|
-
|
|
19
|
+
- **30+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
|
|
20
|
+
- **OCR Support**: Built-in Tesseract OCR for scanned documents and images
|
|
21
|
+
- **High Performance**: Rust-powered extraction for native-level performance
|
|
22
|
+
- **Table Extraction**: Extract structured tables from documents
|
|
23
|
+
- **Language Detection**: Automatic language detection for extracted text
|
|
24
|
+
- **Text Chunking**: Split long documents into manageable chunks
|
|
25
|
+
- **Caching**: Built-in result caching for faster repeated extractions
|
|
26
|
+
- **Type-Safe**: Comprehensive typed configuration and result objects
|
|
80
27
|
|
|
81
|
-
|
|
82
|
-
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
|
83
|
-
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
|
28
|
+
## Requirements
|
|
84
29
|
|
|
85
|
-
|
|
30
|
+
- Ruby 3.2 or higher
|
|
31
|
+
- Rust toolchain (for building from source)
|
|
86
32
|
|
|
87
|
-
|
|
33
|
+
### Optional System Dependencies
|
|
88
34
|
|
|
89
|
-
-
|
|
90
|
-
-
|
|
91
|
-
-
|
|
35
|
+
- **Tesseract**: For OCR functionality
|
|
36
|
+
- macOS: `brew install tesseract`
|
|
37
|
+
- Ubuntu: `sudo apt-get install tesseract-ocr`
|
|
38
|
+
- Windows: Download from [GitHub](https://github.com/tesseract-ocr/tesseract)
|
|
92
39
|
|
|
93
|
-
|
|
40
|
+
- **LibreOffice**: For legacy MS Office formats (.doc, .ppt)
|
|
41
|
+
- macOS: `brew install libreoffice`
|
|
42
|
+
- Ubuntu: `sudo apt-get install libreoffice`
|
|
94
43
|
|
|
95
|
-
|
|
44
|
+
- **Pandoc**: For advanced document conversion
|
|
45
|
+
- macOS: `brew install pandoc`
|
|
46
|
+
- Ubuntu: `sudo apt-get install pandoc`
|
|
96
47
|
|
|
97
|
-
|
|
48
|
+
## Installation
|
|
98
49
|
|
|
99
|
-
|
|
50
|
+
Add to your Gemfile:
|
|
100
51
|
|
|
101
52
|
```ruby
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
result = Kreuzberg.extract_file_sync(path: 'document.pdf')
|
|
105
|
-
|
|
106
|
-
puts "Content:"
|
|
107
|
-
puts result.content
|
|
108
|
-
|
|
109
|
-
puts "\nMetadata:"
|
|
110
|
-
puts "Title: #{result.metadata&.dig('title')}"
|
|
111
|
-
puts "Author: #{result.metadata&.dig('author')}"
|
|
112
|
-
|
|
113
|
-
puts "\nTables found: #{result.tables.length}"
|
|
114
|
-
puts "Images found: #{result.images.length}"
|
|
53
|
+
gem 'kreuzberg'
|
|
115
54
|
```
|
|
116
55
|
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
#### Extract with Custom Configuration
|
|
56
|
+
Then run:
|
|
120
57
|
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
```ruby
|
|
126
|
-
require 'kreuzberg'
|
|
127
|
-
|
|
128
|
-
ocr_config = Kreuzberg::Config::OCR.new(
|
|
129
|
-
backend: 'tesseract',
|
|
130
|
-
language: 'eng'
|
|
131
|
-
)
|
|
58
|
+
```bash
|
|
59
|
+
bundle install
|
|
60
|
+
```
|
|
132
61
|
|
|
133
|
-
|
|
134
|
-
result = Kreuzberg.extract_file_sync(path: 'scanned.pdf', config: config)
|
|
62
|
+
Or install directly:
|
|
135
63
|
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
puts "Used OCR backend: tesseract"
|
|
64
|
+
```bash
|
|
65
|
+
gem install kreuzberg
|
|
139
66
|
```
|
|
140
67
|
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.
|
|
68
|
+
## Quick Start
|
|
144
69
|
|
|
145
|
-
|
|
70
|
+
### Basic Extraction
|
|
146
71
|
|
|
147
72
|
```ruby
|
|
148
73
|
require 'kreuzberg'
|
|
149
74
|
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
puts "Installation verified! Extracted #{result.content.length} characters"
|
|
75
|
+
# Extract from a file
|
|
76
|
+
result = Kreuzberg.extract_file_sync("document.pdf")
|
|
77
|
+
puts result.content
|
|
78
|
+
puts "MIME type: #{result.mime_type}"
|
|
155
79
|
```
|
|
156
80
|
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
For non-blocking document processing:
|
|
81
|
+
### With Configuration
|
|
160
82
|
|
|
161
83
|
```ruby
|
|
162
|
-
|
|
163
|
-
|
|
84
|
+
# Create configuration
|
|
164
85
|
config = Kreuzberg::Config::Extraction.new(
|
|
165
86
|
use_cache: true,
|
|
166
|
-
|
|
87
|
+
force_ocr: false
|
|
167
88
|
)
|
|
168
89
|
|
|
169
|
-
result = Kreuzberg.extract_file_sync(
|
|
170
|
-
|
|
171
|
-
puts "Extracted #{result.content.length} characters"
|
|
172
|
-
puts "Quality score: #{result.metadata&.dig('quality_score')}"
|
|
173
|
-
puts "Processing time: #{result.metadata&.dig('processing_time')}ms"
|
|
90
|
+
result = Kreuzberg.extract_file_sync("document.pdf", config: config)
|
|
174
91
|
```
|
|
175
92
|
|
|
176
|
-
###
|
|
177
|
-
|
|
178
|
-
- **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
|
|
179
|
-
- **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
|
|
180
|
-
- **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
|
|
181
|
-
- **[Configuration Guide](https://kreuzberg.dev/configuration/)** - Advanced configuration options
|
|
182
|
-
- **[Troubleshooting](https://kreuzberg.dev/troubleshooting/)** - Common issues and solutions
|
|
183
|
-
|
|
184
|
-
## Features
|
|
185
|
-
|
|
186
|
-
### Supported File Formats (56+)
|
|
187
|
-
|
|
188
|
-
56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
|
|
189
|
-
|
|
190
|
-
#### Office Documents
|
|
191
|
-
|
|
192
|
-
| Category | Formats | Capabilities |
|
|
193
|
-
|----------|---------|--------------|
|
|
194
|
-
| **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
|
|
195
|
-
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
|
|
196
|
-
| **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
|
|
197
|
-
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
|
|
198
|
-
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
|
|
199
|
-
|
|
200
|
-
#### Images (OCR-Enabled)
|
|
201
|
-
|
|
202
|
-
| Category | Formats | Features |
|
|
203
|
-
|----------|---------|----------|
|
|
204
|
-
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
|
|
205
|
-
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
|
|
206
|
-
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
|
|
207
|
-
|
|
208
|
-
#### Web & Data
|
|
209
|
-
|
|
210
|
-
| Category | Formats | Features |
|
|
211
|
-
|----------|---------|----------|
|
|
212
|
-
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
|
|
213
|
-
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
|
|
214
|
-
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
|
|
215
|
-
|
|
216
|
-
#### Email & Archives
|
|
217
|
-
|
|
218
|
-
| Category | Formats | Features |
|
|
219
|
-
|----------|---------|----------|
|
|
220
|
-
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
|
|
221
|
-
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
|
|
222
|
-
|
|
223
|
-
#### Academic & Scientific
|
|
224
|
-
|
|
225
|
-
| Category | Formats | Features |
|
|
226
|
-
|----------|---------|----------|
|
|
227
|
-
| **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
|
|
228
|
-
| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
|
|
229
|
-
| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
|
|
230
|
-
|
|
231
|
-
**[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
|
|
232
|
-
|
|
233
|
-
### Key Capabilities
|
|
234
|
-
|
|
235
|
-
- **Text Extraction** - Extract all text content with position and formatting information
|
|
236
|
-
|
|
237
|
-
- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
|
|
238
|
-
|
|
239
|
-
- **Table Extraction** - Parse tables with structure and cell content preservation
|
|
240
|
-
|
|
241
|
-
- **Image Extraction** - Extract embedded images and render page previews
|
|
242
|
-
|
|
243
|
-
- **OCR Support** - Integrate multiple OCR backends for scanned documents
|
|
244
|
-
|
|
245
|
-
- **Async/Await** - Non-blocking document processing with concurrent operations
|
|
246
|
-
|
|
247
|
-
- **Plugin System** - Extensible post-processing for custom text transformation
|
|
248
|
-
|
|
249
|
-
- **Embeddings** - Generate vector embeddings using ONNX Runtime models
|
|
250
|
-
|
|
251
|
-
- **Batch Processing** - Efficiently process multiple documents in parallel
|
|
252
|
-
|
|
253
|
-
- **Memory Efficient** - Stream large files without loading entirely into memory
|
|
254
|
-
|
|
255
|
-
- **Language Detection** - Detect and support multiple languages in documents
|
|
256
|
-
|
|
257
|
-
- **Configuration** - Fine-grained control over extraction behavior
|
|
258
|
-
|
|
259
|
-
### Performance Characteristics
|
|
260
|
-
|
|
261
|
-
| Format | Speed | Memory | Notes |
|
|
262
|
-
|--------|-------|--------|-------|
|
|
263
|
-
| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
|
|
264
|
-
| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
|
|
265
|
-
| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
|
|
266
|
-
| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
|
|
267
|
-
| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
|
|
268
|
-
|
|
269
|
-
## OCR Support
|
|
270
|
-
|
|
271
|
-
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
|
|
272
|
-
|
|
273
|
-
- **Tesseract**
|
|
274
|
-
|
|
275
|
-
### OCR Configuration Example
|
|
93
|
+
### With OCR
|
|
276
94
|
|
|
277
95
|
```ruby
|
|
278
|
-
|
|
279
|
-
|
|
96
|
+
# Configure OCR
|
|
280
97
|
ocr_config = Kreuzberg::Config::OCR.new(
|
|
281
|
-
backend:
|
|
282
|
-
language:
|
|
98
|
+
backend: "tesseract",
|
|
99
|
+
language: "eng",
|
|
100
|
+
preprocessing: true
|
|
283
101
|
)
|
|
284
102
|
|
|
285
103
|
config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
|
|
286
|
-
result = Kreuzberg.extract_file_sync(
|
|
287
|
-
|
|
288
|
-
puts "Extracted text from scanned document:"
|
|
289
|
-
puts result.content
|
|
290
|
-
puts "Used OCR backend: tesseract"
|
|
104
|
+
result = Kreuzberg.extract_file_sync("scanned.pdf", config: config)
|
|
291
105
|
```
|
|
292
106
|
|
|
293
|
-
|
|
294
|
-
|
|
295
|
-
This binding provides full async/await support for non-blocking document processing:
|
|
107
|
+
### Extract from Bytes
|
|
296
108
|
|
|
297
109
|
```ruby
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
|
|
301
|
-
use_cache: true,
|
|
302
|
-
enable_quality_processing: true
|
|
303
|
-
)
|
|
304
|
-
|
|
305
|
-
result = Kreuzberg.extract_file_sync(path: 'contract.pdf', config: config)
|
|
306
|
-
|
|
307
|
-
puts "Extracted #{result.content.length} characters"
|
|
308
|
-
puts "Quality score: #{result.metadata&.dig('quality_score')}"
|
|
309
|
-
puts "Processing time: #{result.metadata&.dig('processing_time')}ms"
|
|
110
|
+
data = File.binread("document.pdf")
|
|
111
|
+
result = Kreuzberg.extract_bytes_sync(data, "application/pdf")
|
|
112
|
+
puts result.content
|
|
310
113
|
```
|
|
311
114
|
|
|
312
|
-
|
|
313
|
-
|
|
314
|
-
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
|
|
315
|
-
|
|
316
|
-
For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/plugins/).
|
|
317
|
-
|
|
318
|
-
## Embeddings Support
|
|
319
|
-
|
|
320
|
-
Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
|
|
321
|
-
|
|
322
|
-
**[Embeddings Guide](https://kreuzberg.dev/features/#embeddings)**
|
|
115
|
+
### Batch Processing
|
|
323
116
|
|
|
324
|
-
|
|
117
|
+
```ruby
|
|
118
|
+
paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
|
|
119
|
+
results = Kreuzberg.batch_extract_files_sync(paths)
|
|
325
120
|
|
|
326
|
-
|
|
121
|
+
results.each do |result|
|
|
122
|
+
puts "Content: #{result.content[0..100]}"
|
|
123
|
+
puts "MIME: #{result.mime_type}"
|
|
124
|
+
end
|
|
125
|
+
```
|
|
327
126
|
|
|
328
|
-
|
|
127
|
+
### Structured Results (Chunks & Images)
|
|
329
128
|
|
|
330
129
|
```ruby
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
335
|
-
model: { type: :preset, name: 'balanced' },
|
|
336
|
-
normalize: true,
|
|
337
|
-
batch_size: 32,
|
|
338
|
-
show_download_progress: false
|
|
339
|
-
)
|
|
130
|
+
result = Kreuzberg.extract_file_sync("long-report.pdf", config: {
|
|
131
|
+
chunking: { max_chars: 750 },
|
|
132
|
+
image_extraction: { extract_images: true }
|
|
133
|
+
})
|
|
340
134
|
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
max_overlap: 256,
|
|
345
|
-
embedding: embedding_config
|
|
346
|
-
)
|
|
135
|
+
result.chunks&.each do |chunk|
|
|
136
|
+
puts "[#{chunk.chunk_index + 1}/#{chunk.total_chunks}] #{chunk.content[0..80]}"
|
|
137
|
+
end
|
|
347
138
|
|
|
348
|
-
|
|
349
|
-
|
|
350
|
-
|
|
351
|
-
|
|
352
|
-
result.chunks.each_with_index do |chunk, idx|
|
|
353
|
-
puts "Chunk #{idx}:"
|
|
354
|
-
puts " Content: #{chunk.content[0..50]}..."
|
|
355
|
-
puts " Tokens: #{chunk.token_count}"
|
|
356
|
-
puts " Pages: #{chunk.first_page}-#{chunk.last_page}"
|
|
357
|
-
if chunk.embedding
|
|
358
|
-
puts " Embedding dimensions: #{chunk.embedding.length}"
|
|
139
|
+
result.images&.each do |image|
|
|
140
|
+
File.binwrite("image-#{image.image_index}.#{image.format}", image.data)
|
|
141
|
+
if image.ocr_result
|
|
142
|
+
puts "Embedded OCR content: #{image.ocr_result.content[0..60]}"
|
|
359
143
|
end
|
|
360
144
|
end
|
|
361
145
|
```
|
|
362
146
|
|
|
363
|
-
|
|
147
|
+
## Configuration
|
|
364
148
|
|
|
365
|
-
|
|
149
|
+
### Load From File
|
|
366
150
|
|
|
367
151
|
```ruby
|
|
368
|
-
|
|
152
|
+
config = Kreuzberg::Config::Extraction.from_file("config.toml")
|
|
153
|
+
result = Kreuzberg.extract_file_sync("report.pdf", config: config)
|
|
154
|
+
```
|
|
369
155
|
|
|
370
|
-
|
|
371
|
-
yake_config = Kreuzberg::Config::Keywords.new(
|
|
372
|
-
algorithm: 'yake',
|
|
373
|
-
max_keywords: 10,
|
|
374
|
-
min_score: 0.1,
|
|
375
|
-
yake_params: Kreuzberg::Config::KeywordYakeParams.new(window_size: 3)
|
|
376
|
-
)
|
|
156
|
+
### Extraction Configuration
|
|
377
157
|
|
|
378
|
-
|
|
379
|
-
|
|
380
|
-
|
|
381
|
-
#
|
|
382
|
-
|
|
383
|
-
algorithm: 'rake',
|
|
384
|
-
max_keywords: 15,
|
|
385
|
-
language: 'english',
|
|
386
|
-
rake_params: Kreuzberg::Config::KeywordRakeParams.new(
|
|
387
|
-
min_word_length: 3,
|
|
388
|
-
max_words_per_phrase: 5
|
|
389
|
-
)
|
|
158
|
+
```ruby
|
|
159
|
+
config = Kreuzberg::Config::Extraction.new(
|
|
160
|
+
use_cache: true, # Enable result caching
|
|
161
|
+
enable_quality_processing: false, # Enable text quality processing
|
|
162
|
+
force_ocr: false # Force OCR even for digital PDFs
|
|
390
163
|
)
|
|
391
|
-
|
|
392
|
-
config = Kreuzberg::Config::Extraction.new(keywords: rake_config)
|
|
393
|
-
result = Kreuzberg.extract_file_sync(path: 'report.docx', config: config)
|
|
394
|
-
|
|
395
|
-
puts "Keywords extracted for document"
|
|
396
164
|
```
|
|
397
165
|
|
|
398
|
-
###
|
|
399
|
-
|
|
400
|
-
Extract and organize content by pages:
|
|
166
|
+
### OCR Configuration
|
|
401
167
|
|
|
402
168
|
```ruby
|
|
403
|
-
|
|
404
|
-
|
|
405
|
-
#
|
|
406
|
-
|
|
407
|
-
|
|
408
|
-
|
|
409
|
-
|
|
169
|
+
ocr = Kreuzberg::Config::OCR.new(
|
|
170
|
+
backend: "tesseract", # OCR backend (tesseract, easyocr, paddleocr)
|
|
171
|
+
language: "eng", # Language code (eng, deu, fra, etc.)
|
|
172
|
+
tesseract_config: {
|
|
173
|
+
psm: 6,
|
|
174
|
+
enable_table_detection: true,
|
|
175
|
+
preprocessing: Kreuzberg::Config::ImagePreprocessing.new(auto_rotate: true).to_h
|
|
176
|
+
}
|
|
410
177
|
)
|
|
411
178
|
|
|
412
|
-
config = Kreuzberg::Config::Extraction.new(
|
|
413
|
-
result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
|
|
414
|
-
|
|
415
|
-
# Access extracted pages
|
|
416
|
-
if result.pages
|
|
417
|
-
result.pages.each do |page|
|
|
418
|
-
puts "Page #{page.page_number}:"
|
|
419
|
-
puts " Content length: #{page.content.length}"
|
|
420
|
-
puts " Tables: #{page.tables.length}"
|
|
421
|
-
puts " Images: #{page.images.length}"
|
|
422
|
-
end
|
|
423
|
-
end
|
|
424
|
-
|
|
425
|
-
puts "Total pages: #{result.page_count}"
|
|
179
|
+
config = Kreuzberg::Config::Extraction.new(ocr: ocr)
|
|
426
180
|
```
|
|
427
181
|
|
|
428
|
-
###
|
|
429
|
-
|
|
430
|
-
Create and register custom post-processors for text transformation:
|
|
182
|
+
### Chunking Configuration
|
|
431
183
|
|
|
432
184
|
```ruby
|
|
433
|
-
|
|
434
|
-
|
|
435
|
-
#
|
|
436
|
-
|
|
437
|
-
|
|
438
|
-
|
|
439
|
-
|
|
440
|
-
|
|
441
|
-
|
|
442
|
-
|
|
443
|
-
if enhanced['content']
|
|
444
|
-
# Add markdown headers for detected structure
|
|
445
|
-
enhanced['content'] = enhance_with_markdown(enhanced['content'])
|
|
446
|
-
end
|
|
447
|
-
|
|
448
|
-
enhanced
|
|
449
|
-
end
|
|
185
|
+
chunking = Kreuzberg::Config::Chunking.new(
|
|
186
|
+
enabled: true,
|
|
187
|
+
chunk_size: 1000, # Characters per chunk
|
|
188
|
+
chunk_overlap: 200, # Overlap between chunks
|
|
189
|
+
embedding: {
|
|
190
|
+
model: { type: :preset, name: "balanced" },
|
|
191
|
+
normalize: true
|
|
192
|
+
}
|
|
193
|
+
)
|
|
450
194
|
|
|
451
|
-
|
|
195
|
+
config = Kreuzberg::Config::Extraction.new(chunking: chunking)
|
|
196
|
+
result = Kreuzberg.extract_file_sync("long_document.pdf", config: config)
|
|
452
197
|
|
|
453
|
-
|
|
454
|
-
|
|
455
|
-
|
|
456
|
-
.split("\n\n")
|
|
457
|
-
.map { |paragraph| paragraph.length > 100 ? "## #{paragraph[0..30]}...\n\n#{paragraph}" : paragraph }
|
|
458
|
-
.join("\n\n")
|
|
459
|
-
end
|
|
198
|
+
result.chunks.each do |chunk|
|
|
199
|
+
puts "Chunk: #{chunk.content}"
|
|
200
|
+
puts "Tokens: #{chunk.token_count}"
|
|
460
201
|
end
|
|
461
|
-
|
|
462
|
-
# Use custom post-processor in configuration
|
|
463
|
-
processor = MarkdownEnhancerPostProcessor.new
|
|
464
|
-
postprocessor_config = Kreuzberg::Config::PostProcessor.new(enabled: true)
|
|
465
|
-
config = Kreuzberg::Config::Extraction.new(postprocessor: postprocessor_config)
|
|
466
|
-
|
|
467
|
-
result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
|
|
468
|
-
puts result.content
|
|
469
202
|
```
|
|
470
203
|
|
|
471
|
-
###
|
|
472
|
-
|
|
473
|
-
Create and register validators to ensure extraction quality:
|
|
204
|
+
### HTML Conversion Options
|
|
474
205
|
|
|
475
206
|
```ruby
|
|
476
|
-
|
|
477
|
-
|
|
478
|
-
|
|
479
|
-
|
|
480
|
-
|
|
481
|
-
|
|
482
|
-
MIN_CONTENT_LENGTH = 100
|
|
483
|
-
MIN_METADATA_FIELDS = 2
|
|
484
|
-
|
|
485
|
-
def call(result)
|
|
486
|
-
# Validate extracted content meets quality standards
|
|
487
|
-
content = result['content'].to_s
|
|
488
|
-
metadata = result['metadata'].to_h
|
|
489
|
-
|
|
490
|
-
if content.length < MIN_CONTENT_LENGTH
|
|
491
|
-
raise Kreuzberg::Errors::ValidationError,
|
|
492
|
-
"Content too short: #{content.length} bytes (minimum: #{MIN_CONTENT_LENGTH})"
|
|
493
|
-
end
|
|
494
|
-
|
|
495
|
-
if metadata.length < MIN_METADATA_FIELDS
|
|
496
|
-
raise Kreuzberg::Errors::ValidationError,
|
|
497
|
-
"Insufficient metadata: #{metadata.length} fields (minimum: #{MIN_METADATA_FIELDS})"
|
|
498
|
-
end
|
|
499
|
-
|
|
500
|
-
# Validation passed
|
|
501
|
-
nil
|
|
502
|
-
end
|
|
503
|
-
end
|
|
504
|
-
|
|
505
|
-
# Use validator in extraction workflow
|
|
506
|
-
validator = ContentQualityValidator.new
|
|
507
|
-
config = Kreuzberg::Config::Extraction.new(enable_quality_processing: true)
|
|
207
|
+
html_options = Kreuzberg::Config::HtmlOptions.new(
|
|
208
|
+
heading_style: :atx_closed,
|
|
209
|
+
wrap: true,
|
|
210
|
+
wrap_width: 100,
|
|
211
|
+
preprocessing: { enabled: true, preset: :standard }
|
|
212
|
+
)
|
|
508
213
|
|
|
509
|
-
|
|
510
|
-
|
|
511
|
-
validator.call(result.to_h)
|
|
512
|
-
puts "Extraction passed quality validation"
|
|
513
|
-
rescue Kreuzberg::Errors::ValidationError => e
|
|
514
|
-
puts "Validation failed: #{e.message}"
|
|
515
|
-
end
|
|
214
|
+
config = Kreuzberg::Config::Extraction.new(html_options: html_options)
|
|
215
|
+
result = Kreuzberg.extract_file_sync("page.html", config: config)
|
|
516
216
|
```
|
|
517
217
|
|
|
518
|
-
###
|
|
519
|
-
|
|
520
|
-
Load configuration from TOML, YAML, or JSON files:
|
|
218
|
+
### Keyword Extraction
|
|
521
219
|
|
|
522
220
|
```ruby
|
|
523
|
-
|
|
221
|
+
keywords = Kreuzberg::Config::Keywords.new(
|
|
222
|
+
algorithm: :yake,
|
|
223
|
+
max_keywords: 8,
|
|
224
|
+
min_score: 0.2,
|
|
225
|
+
ngram_range: [1, 3]
|
|
226
|
+
)
|
|
524
227
|
|
|
525
|
-
|
|
526
|
-
|
|
527
|
-
config = Kreuzberg::Config::Extraction.from_file('config/kreuzberg.toml')
|
|
528
|
-
|
|
529
|
-
# Example: config/kreuzberg.toml
|
|
530
|
-
# use_cache = true
|
|
531
|
-
# force_ocr = false
|
|
532
|
-
# enable_quality_processing = true
|
|
533
|
-
#
|
|
534
|
-
# [chunking]
|
|
535
|
-
# max_chars = 1024
|
|
536
|
-
# max_overlap = 256
|
|
537
|
-
#
|
|
538
|
-
# [ocr]
|
|
539
|
-
# backend = "tesseract"
|
|
540
|
-
# language = "eng"
|
|
541
|
-
#
|
|
542
|
-
# [language_detection]
|
|
543
|
-
# enabled = true
|
|
544
|
-
# min_confidence = 0.7
|
|
545
|
-
|
|
546
|
-
result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
|
|
547
|
-
puts "Extracted with config from file"
|
|
548
|
-
|
|
549
|
-
# Auto-discover configuration in project hierarchy
|
|
550
|
-
discovered_config = Kreuzberg::Config::Extraction.discover
|
|
551
|
-
if discovered_config
|
|
552
|
-
puts "Found configuration at project root"
|
|
553
|
-
result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: discovered_config)
|
|
554
|
-
else
|
|
555
|
-
puts "No configuration file found, using defaults"
|
|
556
|
-
result = Kreuzberg.extract_file_sync(path: 'document.pdf')
|
|
557
|
-
end
|
|
228
|
+
config = Kreuzberg::Config::Extraction.new(keywords: keywords)
|
|
229
|
+
result = Kreuzberg.extract_file_sync("research.pdf", config: config)
|
|
558
230
|
```
|
|
559
231
|
|
|
560
|
-
###
|
|
561
|
-
|
|
562
|
-
Use Ruby Fibers for efficient async extraction workflows:
|
|
232
|
+
### Language Detection
|
|
563
233
|
|
|
564
234
|
```ruby
|
|
565
|
-
|
|
566
|
-
|
|
567
|
-
|
|
568
|
-
|
|
569
|
-
|
|
570
|
-
Fiber.new do
|
|
571
|
-
config = Kreuzberg::Config::Extraction.new(
|
|
572
|
-
use_cache: true,
|
|
573
|
-
enable_quality_processing: true
|
|
574
|
-
)
|
|
575
|
-
|
|
576
|
-
# Extract asynchronously
|
|
577
|
-
result = Kreuzberg.extract_file(path: path, config: config)
|
|
578
|
-
|
|
579
|
-
{
|
|
580
|
-
path: path,
|
|
581
|
-
content_length: result.content.length,
|
|
582
|
-
tables: result.tables.length,
|
|
583
|
-
languages: result.detected_languages
|
|
584
|
-
}
|
|
585
|
-
end
|
|
586
|
-
end
|
|
235
|
+
lang_detection = Kreuzberg::Config::LanguageDetection.new(
|
|
236
|
+
enabled: true,
|
|
237
|
+
min_confidence: 0.8,
|
|
238
|
+
detect_multiple: true
|
|
239
|
+
)
|
|
587
240
|
|
|
588
|
-
|
|
589
|
-
|
|
590
|
-
Fiber.yield fiber.resume if fiber.alive?
|
|
591
|
-
end
|
|
241
|
+
config = Kreuzberg::Config::Extraction.new(language_detection: lang_detection)
|
|
242
|
+
result = Kreuzberg.extract_file_sync("multilingual.pdf", config: config)
|
|
592
243
|
|
|
593
|
-
|
|
244
|
+
result.detected_languages&.each do |lang|
|
|
245
|
+
puts "Language: #{lang.lang}, Confidence: #{lang.confidence}"
|
|
594
246
|
end
|
|
247
|
+
```
|
|
595
248
|
|
|
596
|
-
|
|
597
|
-
file_paths = ['document1.pdf', 'document2.docx', 'document3.xlsx']
|
|
598
|
-
results = extract_documents_async(file_paths)
|
|
249
|
+
### PDF Options
|
|
599
250
|
|
|
600
|
-
|
|
601
|
-
|
|
602
|
-
|
|
603
|
-
|
|
251
|
+
```ruby
|
|
252
|
+
pdf_options = Kreuzberg::Config::PDF.new(
|
|
253
|
+
extract_images: true,
|
|
254
|
+
image_min_size: 10000, # Minimum image size in bytes
|
|
255
|
+
password: "secret" # PDF password
|
|
256
|
+
)
|
|
604
257
|
|
|
605
|
-
|
|
258
|
+
config = Kreuzberg::Config::Extraction.new(pdf_options: pdf_options)
|
|
259
|
+
```
|
|
606
260
|
|
|
607
|
-
|
|
261
|
+
## Working with Results
|
|
608
262
|
|
|
609
263
|
```ruby
|
|
610
|
-
|
|
264
|
+
result = Kreuzberg.extract_file_sync("invoice.pdf")
|
|
611
265
|
|
|
612
|
-
#
|
|
613
|
-
|
|
614
|
-
|
|
615
|
-
|
|
266
|
+
# Access extracted text
|
|
267
|
+
puts result.content
|
|
268
|
+
|
|
269
|
+
# Access MIME type
|
|
270
|
+
puts result.mime_type
|
|
616
271
|
|
|
617
|
-
|
|
272
|
+
# Access metadata
|
|
273
|
+
puts result.metadata.inspect
|
|
618
274
|
|
|
619
275
|
# Access extracted tables
|
|
620
|
-
result.tables.
|
|
621
|
-
puts "
|
|
622
|
-
|
|
623
|
-
|
|
624
|
-
table.cells.each_with_index do |row, row_idx|
|
|
625
|
-
puts " Row #{row_idx}:"
|
|
626
|
-
row.each_with_index do |cell, col_idx|
|
|
627
|
-
puts " [#{col_idx}] #{cell}"
|
|
628
|
-
end
|
|
276
|
+
result.tables.each do |table|
|
|
277
|
+
puts "Headers: #{table.headers.join(', ')}"
|
|
278
|
+
table.rows.each do |row|
|
|
279
|
+
puts row.join(', ')
|
|
629
280
|
end
|
|
281
|
+
end
|
|
630
282
|
|
|
631
|
-
|
|
632
|
-
|
|
633
|
-
puts
|
|
283
|
+
# Access text chunks and metadata
|
|
284
|
+
result.chunks&.each do |chunk|
|
|
285
|
+
puts "Chunk #{chunk.chunk_index + 1}/#{chunk.total_chunks}"
|
|
286
|
+
puts "Chars: #{chunk.char_start}-#{chunk.char_end}"
|
|
287
|
+
puts "Embedding length: #{chunk.embedding&.length}"
|
|
634
288
|
end
|
|
635
289
|
|
|
636
|
-
#
|
|
637
|
-
|
|
638
|
-
|
|
639
|
-
|
|
640
|
-
|
|
641
|
-
if result.pages
|
|
642
|
-
result.pages.each do |page|
|
|
643
|
-
page.tables.each do |table|
|
|
644
|
-
puts "Table on page #{page.page_number}:"
|
|
645
|
-
puts " Dimensions: #{table.cells.length} rows x #{table.cells.first&.length || 0} columns"
|
|
646
|
-
end
|
|
647
|
-
end
|
|
290
|
+
# Access extracted images
|
|
291
|
+
result.images&.each do |image|
|
|
292
|
+
File.binwrite("image-\#{image.image_index}.#{image.format}", image.data)
|
|
293
|
+
puts "Image #{image.image_index} on page #{image.page_number}"
|
|
648
294
|
end
|
|
649
|
-
```
|
|
650
295
|
|
|
651
|
-
|
|
296
|
+
# Convert to hash
|
|
297
|
+
hash = result.to_h
|
|
652
298
|
|
|
653
|
-
|
|
299
|
+
# Convert to JSON
|
|
300
|
+
json = result.to_json
|
|
301
|
+
```
|
|
654
302
|
|
|
655
|
-
|
|
656
|
-
require 'kreuzberg'
|
|
303
|
+
## CLI Usage
|
|
657
304
|
|
|
658
|
-
|
|
659
|
-
image_config = Kreuzberg::Config::ImageExtraction.new(
|
|
660
|
-
extract_images: true,
|
|
661
|
-
target_dpi: 300,
|
|
662
|
-
max_image_dimension: 2000,
|
|
663
|
-
auto_adjust_dpi: true
|
|
664
|
-
)
|
|
305
|
+
Kreuzberg provides a Ruby wrapper for the CLI:
|
|
665
306
|
|
|
666
|
-
|
|
667
|
-
|
|
307
|
+
```ruby
|
|
308
|
+
# Extract content
|
|
309
|
+
output = Kreuzberg::CLI.extract("document.pdf", output: "text")
|
|
668
310
|
|
|
669
|
-
#
|
|
670
|
-
|
|
671
|
-
Dir.mkdir(output_dir) unless Dir.exist?(output_dir)
|
|
311
|
+
# Detect MIME type
|
|
312
|
+
mime_type = Kreuzberg::CLI.detect("document.pdf")
|
|
672
313
|
|
|
673
|
-
|
|
674
|
-
|
|
675
|
-
|
|
676
|
-
filepath = File.join(output_dir, filename)
|
|
314
|
+
# Get version
|
|
315
|
+
version = Kreuzberg::CLI.version
|
|
316
|
+
```
|
|
677
317
|
|
|
678
|
-
|
|
679
|
-
File.write(filepath, image.data, mode: 'wb')
|
|
318
|
+
## API Server
|
|
680
319
|
|
|
681
|
-
|
|
682
|
-
puts " Page: #{image.page_number}"
|
|
683
|
-
puts " Format: #{image.format}"
|
|
684
|
-
puts " Dimensions: #{image.width}x#{image.height}"
|
|
685
|
-
puts " Colorspace: #{image.colorspace}"
|
|
320
|
+
Start an API server (requires kreuzberg CLI):
|
|
686
321
|
|
|
687
|
-
|
|
688
|
-
|
|
689
|
-
|
|
690
|
-
|
|
322
|
+
```ruby
|
|
323
|
+
Kreuzberg::APIProxy.run(port: 8000) do |server|
|
|
324
|
+
# Server runs in background
|
|
325
|
+
# Make HTTP requests to http://localhost:8000
|
|
691
326
|
end
|
|
692
327
|
```
|
|
693
328
|
|
|
694
|
-
|
|
329
|
+
## MCP Server
|
|
695
330
|
|
|
696
|
-
|
|
331
|
+
Start a Model Context Protocol server for Claude Desktop:
|
|
697
332
|
|
|
698
333
|
```ruby
|
|
699
|
-
|
|
700
|
-
|
|
701
|
-
# Enable language detection with confidence threshold
|
|
702
|
-
lang_detection_config = Kreuzberg::Config::LanguageDetection.new(
|
|
703
|
-
enabled: true,
|
|
704
|
-
min_confidence: 0.8,
|
|
705
|
-
detect_multiple: true
|
|
706
|
-
)
|
|
707
|
-
|
|
708
|
-
config = Kreuzberg::Config::Extraction.new(
|
|
709
|
-
language_detection: lang_detection_config
|
|
710
|
-
)
|
|
711
|
-
|
|
712
|
-
result = Kreuzberg.extract_file_sync(path: 'multilingual.pdf', config: config)
|
|
334
|
+
server = Kreuzberg::MCPProxy::Server.new(transport: 'stdio')
|
|
335
|
+
server.start
|
|
713
336
|
|
|
714
|
-
#
|
|
715
|
-
|
|
716
|
-
puts "All detected languages: #{result.detected_languages.join(', ')}"
|
|
717
|
-
|
|
718
|
-
# Access language from metadata
|
|
719
|
-
if result.metadata.is_a?(Hash)
|
|
720
|
-
puts "Language from metadata: #{result.metadata['language']}"
|
|
721
|
-
end
|
|
337
|
+
# Use with Claude Desktop integration
|
|
338
|
+
```
|
|
722
339
|
|
|
723
|
-
|
|
724
|
-
keywords_config = Kreuzberg::Config::Keywords.new(
|
|
725
|
-
algorithm: 'yake',
|
|
726
|
-
language: 'de', # German keywords
|
|
727
|
-
max_keywords: 10
|
|
728
|
-
)
|
|
340
|
+
## Cache Management
|
|
729
341
|
|
|
730
|
-
|
|
731
|
-
|
|
732
|
-
|
|
733
|
-
|
|
342
|
+
```ruby
|
|
343
|
+
# Get cache statistics
|
|
344
|
+
stats = Kreuzberg.cache_stats
|
|
345
|
+
puts "Entries: #{stats[:total_entries]}"
|
|
346
|
+
puts "Size: #{stats[:total_size_bytes]} bytes"
|
|
734
347
|
|
|
735
|
-
|
|
736
|
-
|
|
348
|
+
# Clear cache
|
|
349
|
+
Kreuzberg.clear_cache
|
|
737
350
|
```
|
|
738
351
|
|
|
739
|
-
##
|
|
740
|
-
|
|
741
|
-
Process multiple documents efficiently:
|
|
352
|
+
## Error Handling
|
|
742
353
|
|
|
743
354
|
```ruby
|
|
744
|
-
|
|
355
|
+
begin
|
|
356
|
+
result = Kreuzberg.extract_file_sync("document.pdf")
|
|
357
|
+
rescue Kreuzberg::Errors::ParsingError => e
|
|
358
|
+
puts "Parsing failed: #{e.message}"
|
|
359
|
+
puts "Context: #{e.context}"
|
|
360
|
+
rescue Kreuzberg::Errors::OCRError => e
|
|
361
|
+
puts "OCR failed: #{e.message}"
|
|
362
|
+
rescue Kreuzberg::Errors::MissingDependencyError => e
|
|
363
|
+
puts "Missing dependency: #{e.dependency}"
|
|
364
|
+
rescue Kreuzberg::Errors::Error => e
|
|
365
|
+
puts "Kreuzberg error: #{e.message}"
|
|
366
|
+
end
|
|
367
|
+
```
|
|
745
368
|
|
|
746
|
-
|
|
747
|
-
puts "FFI bindings loaded successfully"
|
|
369
|
+
## Supported Formats
|
|
748
370
|
|
|
749
|
-
|
|
750
|
-
|
|
751
|
-
|
|
371
|
+
- **Documents**: PDF, DOCX, DOC, PPTX, PPT, ODT, ODP
|
|
372
|
+
- **Spreadsheets**: XLSX, XLS, ODS, CSV
|
|
373
|
+
- **Images**: PNG, JPEG, TIFF, BMP, GIF
|
|
374
|
+
- **Web**: HTML, MHTML, Markdown
|
|
375
|
+
- **Data**: JSON, YAML, TOML, XML
|
|
376
|
+
- **Email**: EML, MSG
|
|
377
|
+
- **Archives**: ZIP, TAR, 7Z
|
|
378
|
+
- **Text**: TXT, RTF, MD
|
|
752
379
|
|
|
753
|
-
##
|
|
380
|
+
## Performance
|
|
754
381
|
|
|
755
|
-
|
|
382
|
+
Kreuzberg's Rust core provides significant performance improvements:
|
|
756
383
|
|
|
757
|
-
**
|
|
384
|
+
- **PDF extraction**: 10-50x faster than pure Ruby solutions
|
|
385
|
+
- **Batch processing**: Parallel extraction with Tokio async runtime
|
|
386
|
+
- **Memory efficient**: Streaming parsers for large files
|
|
387
|
+
- **Caching**: Automatic result caching for repeated extractions
|
|
758
388
|
|
|
759
|
-
##
|
|
389
|
+
## Development
|
|
760
390
|
|
|
761
|
-
|
|
762
|
-
|
|
763
|
-
|
|
391
|
+
```bash
|
|
392
|
+
# Clone the repository
|
|
393
|
+
git clone https://github.com/Goldziher/kreuzberg.git
|
|
394
|
+
cd kreuzberg/packages/ruby
|
|
764
395
|
|
|
765
|
-
|
|
396
|
+
# Install dependencies
|
|
397
|
+
bundle install
|
|
766
398
|
|
|
767
|
-
|
|
399
|
+
# Build the Rust extension
|
|
400
|
+
bundle exec rake compile
|
|
768
401
|
|
|
769
|
-
|
|
402
|
+
# Run tests
|
|
403
|
+
bundle exec rspec
|
|
770
404
|
|
|
771
|
-
|
|
405
|
+
# Run RuboCop
|
|
406
|
+
bundle exec rubocop
|
|
407
|
+
```
|
|
772
408
|
|
|
773
409
|
## License
|
|
774
410
|
|
|
775
|
-
MIT License
|
|
411
|
+
MIT License. See [LICENSE](../../LICENSE) for details.
|
|
412
|
+
|
|
413
|
+
## Contributing
|
|
414
|
+
|
|
415
|
+
Contributions are welcome! Please see [CONTRIBUTING.md](../../CONTRIBUTING.md) for guidelines.
|
|
776
416
|
|
|
777
|
-
##
|
|
417
|
+
## Links
|
|
778
418
|
|
|
779
|
-
- **
|
|
780
|
-
- **GitHub
|
|
781
|
-
- **
|
|
419
|
+
- **Documentation**: https://docs.kreuzberg.dev
|
|
420
|
+
- **GitHub**: https://github.com/Goldziher/kreuzberg
|
|
421
|
+
- **Issues**: https://github.com/Goldziher/kreuzberg/issues
|