RubyGems - kreuzberg - Versions diffs - 4.0.0.pre.rc.29 → 4.0.0.rc1 - Mend

kreuzberg 4.0.0.pre.rc.29 → 4.0.0.rc1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (321) hide show

checksums.yaml +4 -4
data/.gitignore +0 -6
data/.rubocop.yaml +534 -1
data/Gemfile +2 -1
data/Gemfile.lock +28 -116
data/README.md +269 -629
data/Rakefile +0 -9
data/Steepfile +4 -8
data/examples/async_patterns.rb +58 -1
data/ext/kreuzberg_rb/extconf.rb +5 -35
data/ext/kreuzberg_rb/native/Cargo.toml +16 -55
data/ext/kreuzberg_rb/native/build.rs +14 -12
data/ext/kreuzberg_rb/native/include/ieeefp.h +1 -1
data/ext/kreuzberg_rb/native/include/msvc_compat/strings.h +1 -1
data/ext/kreuzberg_rb/native/include/strings.h +2 -2
data/ext/kreuzberg_rb/native/include/unistd.h +1 -1
data/ext/kreuzberg_rb/native/src/lib.rs +34 -897
data/extconf.rb +6 -38
data/kreuzberg.gemspec +20 -114
data/lib/kreuzberg/api_proxy.rb +18 -2
data/lib/kreuzberg/cache_api.rb +0 -22
data/lib/kreuzberg/cli.rb +10 -2
data/lib/kreuzberg/cli_proxy.rb +10 -0
data/lib/kreuzberg/config.rb +22 -274
data/lib/kreuzberg/errors.rb +7 -73
data/lib/kreuzberg/extraction_api.rb +8 -237
data/lib/kreuzberg/mcp_proxy.rb +11 -2
data/lib/kreuzberg/ocr_backend_protocol.rb +73 -0
data/lib/kreuzberg/post_processor_protocol.rb +71 -0
data/lib/kreuzberg/result.rb +33 -151
data/lib/kreuzberg/setup_lib_path.rb +2 -22
data/lib/kreuzberg/validator_protocol.rb +73 -0
data/lib/kreuzberg/version.rb +1 -1
data/lib/kreuzberg.rb +13 -27
data/pkg/kreuzberg-4.0.0.rc1.gem +0 -0
data/sig/kreuzberg.rbs +12 -105
data/spec/binding/cache_spec.rb +22 -22
data/spec/binding/cli_proxy_spec.rb +4 -2
data/spec/binding/cli_spec.rb +11 -12
data/spec/binding/config_spec.rb +0 -74
data/spec/binding/config_validation_spec.rb +6 -100
data/spec/binding/error_handling_spec.rb +97 -283
data/spec/binding/plugins/ocr_backend_spec.rb +8 -8
data/spec/binding/plugins/postprocessor_spec.rb +11 -11
data/spec/binding/plugins/validator_spec.rb +13 -12
data/spec/examples.txt +104 -0
data/spec/fixtures/config.toml +1 -0
data/spec/fixtures/config.yaml +1 -0
data/spec/fixtures/invalid_config.toml +1 -0
data/spec/smoke/package_spec.rb +3 -2
data/spec/spec_helper.rb +3 -1
data/vendor/kreuzberg/Cargo.toml +67 -192
data/vendor/kreuzberg/README.md +9 -97
data/vendor/kreuzberg/build.rs +194 -516
data/vendor/kreuzberg/src/api/handlers.rs +9 -130
data/vendor/kreuzberg/src/api/mod.rs +3 -18
data/vendor/kreuzberg/src/api/server.rs +71 -236
data/vendor/kreuzberg/src/api/types.rs +7 -43
data/vendor/kreuzberg/src/bin/profile_extract.rs +455 -0
data/vendor/kreuzberg/src/cache/mod.rs +3 -27
data/vendor/kreuzberg/src/chunking/mod.rs +79 -1705
data/vendor/kreuzberg/src/core/batch_mode.rs +0 -60
data/vendor/kreuzberg/src/core/config.rs +23 -905
data/vendor/kreuzberg/src/core/extractor.rs +106 -403
data/vendor/kreuzberg/src/core/io.rs +2 -4
data/vendor/kreuzberg/src/core/mime.rs +12 -2
data/vendor/kreuzberg/src/core/mod.rs +3 -22
data/vendor/kreuzberg/src/core/pipeline.rs +78 -395
data/vendor/kreuzberg/src/embeddings.rs +21 -169
data/vendor/kreuzberg/src/error.rs +2 -2
data/vendor/kreuzberg/src/extraction/archive.rs +31 -36
data/vendor/kreuzberg/src/extraction/docx.rs +1 -365
data/vendor/kreuzberg/src/extraction/email.rs +11 -12
data/vendor/kreuzberg/src/extraction/excel.rs +129 -138
data/vendor/kreuzberg/src/extraction/html.rs +170 -1447
data/vendor/kreuzberg/src/extraction/image.rs +14 -138
data/vendor/kreuzberg/src/extraction/libreoffice.rs +3 -13
data/vendor/kreuzberg/src/extraction/mod.rs +5 -21
data/vendor/kreuzberg/src/extraction/office_metadata/mod.rs +0 -2
data/vendor/kreuzberg/src/extraction/pandoc/batch.rs +275 -0
data/vendor/kreuzberg/src/extraction/pandoc/mime_types.rs +178 -0
data/vendor/kreuzberg/src/extraction/pandoc/mod.rs +491 -0
data/vendor/kreuzberg/src/extraction/pandoc/server.rs +496 -0
data/vendor/kreuzberg/src/extraction/pandoc/subprocess.rs +1188 -0
data/vendor/kreuzberg/src/extraction/pandoc/version.rs +162 -0
data/vendor/kreuzberg/src/extraction/pptx.rs +94 -196
data/vendor/kreuzberg/src/extraction/structured.rs +4 -5
data/vendor/kreuzberg/src/extraction/table.rs +1 -2
data/vendor/kreuzberg/src/extraction/text.rs +10 -18
data/vendor/kreuzberg/src/extractors/archive.rs +0 -22
data/vendor/kreuzberg/src/extractors/docx.rs +148 -69
data/vendor/kreuzberg/src/extractors/email.rs +9 -37
data/vendor/kreuzberg/src/extractors/excel.rs +40 -81
data/vendor/kreuzberg/src/extractors/html.rs +173 -182
data/vendor/kreuzberg/src/extractors/image.rs +8 -32
data/vendor/kreuzberg/src/extractors/mod.rs +10 -171
data/vendor/kreuzberg/src/extractors/pandoc.rs +201 -0
data/vendor/kreuzberg/src/extractors/pdf.rs +64 -329
data/vendor/kreuzberg/src/extractors/pptx.rs +34 -79
data/vendor/kreuzberg/src/extractors/structured.rs +0 -16
data/vendor/kreuzberg/src/extractors/text.rs +7 -30
data/vendor/kreuzberg/src/extractors/xml.rs +8 -27
data/vendor/kreuzberg/src/keywords/processor.rs +1 -9
data/vendor/kreuzberg/src/keywords/rake.rs +1 -0
data/vendor/kreuzberg/src/language_detection/mod.rs +51 -94
data/vendor/kreuzberg/src/lib.rs +5 -17
data/vendor/kreuzberg/src/mcp/mod.rs +1 -4
data/vendor/kreuzberg/src/mcp/server.rs +21 -145
data/vendor/kreuzberg/src/ocr/mod.rs +0 -2
data/vendor/kreuzberg/src/ocr/processor.rs +8 -19
data/vendor/kreuzberg/src/ocr/tesseract_backend.rs +0 -2
data/vendor/kreuzberg/src/pdf/error.rs +1 -93
data/vendor/kreuzberg/src/pdf/metadata.rs +100 -263
data/vendor/kreuzberg/src/pdf/mod.rs +2 -33
data/vendor/kreuzberg/src/pdf/rendering.rs +12 -12
data/vendor/kreuzberg/src/pdf/table.rs +64 -61
data/vendor/kreuzberg/src/pdf/text.rs +24 -416
data/vendor/kreuzberg/src/plugins/extractor.rs +8 -40
data/vendor/kreuzberg/src/plugins/mod.rs +0 -3
data/vendor/kreuzberg/src/plugins/ocr.rs +14 -22
data/vendor/kreuzberg/src/plugins/processor.rs +1 -10
data/vendor/kreuzberg/src/plugins/registry.rs +0 -15
data/vendor/kreuzberg/src/plugins/validator.rs +8 -20
data/vendor/kreuzberg/src/stopwords/mod.rs +2 -2
data/vendor/kreuzberg/src/text/mod.rs +0 -8
data/vendor/kreuzberg/src/text/quality.rs +15 -28
data/vendor/kreuzberg/src/text/string_utils.rs +10 -22
data/vendor/kreuzberg/src/text/token_reduction/core.rs +50 -86
data/vendor/kreuzberg/src/text/token_reduction/filters.rs +16 -37
data/vendor/kreuzberg/src/text/token_reduction/simd_text.rs +1 -2
data/vendor/kreuzberg/src/types.rs +67 -907
data/vendor/kreuzberg/src/utils/mod.rs +0 -14
data/vendor/kreuzberg/src/utils/quality.rs +3 -12
data/vendor/kreuzberg/tests/api_tests.rs +0 -506
data/vendor/kreuzberg/tests/archive_integration.rs +0 -2
data/vendor/kreuzberg/tests/batch_orchestration.rs +12 -57
data/vendor/kreuzberg/tests/batch_processing.rs +8 -32
data/vendor/kreuzberg/tests/chunking_offset_demo.rs +92 -0
data/vendor/kreuzberg/tests/concurrency_stress.rs +8 -40
data/vendor/kreuzberg/tests/config_features.rs +1 -33
data/vendor/kreuzberg/tests/config_loading_tests.rs +39 -16
data/vendor/kreuzberg/tests/core_integration.rs +9 -35
data/vendor/kreuzberg/tests/csv_integration.rs +81 -71
data/vendor/kreuzberg/tests/docx_metadata_extraction_test.rs +25 -23
data/vendor/kreuzberg/tests/email_integration.rs +1 -3
data/vendor/kreuzberg/tests/error_handling.rs +34 -43
data/vendor/kreuzberg/tests/format_integration.rs +1 -7
data/vendor/kreuzberg/tests/helpers/mod.rs +0 -60
data/vendor/kreuzberg/tests/image_integration.rs +0 -2
data/vendor/kreuzberg/tests/mime_detection.rs +16 -17
data/vendor/kreuzberg/tests/ocr_configuration.rs +0 -4
data/vendor/kreuzberg/tests/ocr_errors.rs +0 -22
data/vendor/kreuzberg/tests/ocr_quality.rs +0 -2
data/vendor/kreuzberg/tests/pandoc_integration.rs +503 -0
data/vendor/kreuzberg/tests/pdf_integration.rs +0 -2
data/vendor/kreuzberg/tests/pipeline_integration.rs +2 -36
data/vendor/kreuzberg/tests/plugin_ocr_backend_test.rs +0 -5
data/vendor/kreuzberg/tests/plugin_postprocessor_test.rs +1 -17
data/vendor/kreuzberg/tests/plugin_system.rs +0 -6
data/vendor/kreuzberg/tests/registry_integration_tests.rs +22 -2
data/vendor/kreuzberg/tests/security_validation.rs +1 -13
data/vendor/kreuzberg/tests/test_fastembed.rs +23 -45
metadata +25 -171
data/.rubocop.yml +0 -543
data/ext/kreuzberg_rb/native/.cargo/config.toml +0 -23
data/ext/kreuzberg_rb/native/Cargo.lock +0 -7619
data/lib/kreuzberg/error_context.rb +0 -136
data/lib/kreuzberg/types.rb +0 -170
data/lib/libpdfium.so +0 -0
data/spec/binding/async_operations_spec.rb +0 -473
data/spec/binding/batch_operations_spec.rb +0 -595
data/spec/binding/batch_spec.rb +0 -359
data/spec/binding/config_result_spec.rb +0 -377
data/spec/binding/embeddings_spec.rb +0 -816
data/spec/binding/error_recovery_spec.rb +0 -488
data/spec/binding/font_config_spec.rb +0 -220
data/spec/binding/images_spec.rb +0 -738
data/spec/binding/keywords_extraction_spec.rb +0 -600
data/spec/binding/metadata_types_spec.rb +0 -1228
data/spec/binding/pages_extraction_spec.rb +0 -471
data/spec/binding/tables_spec.rb +0 -641
data/spec/unit/config/chunking_config_spec.rb +0 -213
data/spec/unit/config/embedding_config_spec.rb +0 -343
data/spec/unit/config/extraction_config_spec.rb +0 -438
data/spec/unit/config/font_config_spec.rb +0 -285
data/spec/unit/config/hierarchy_config_spec.rb +0 -314
data/spec/unit/config/image_extraction_config_spec.rb +0 -209
data/spec/unit/config/image_preprocessing_config_spec.rb +0 -249
data/spec/unit/config/keyword_config_spec.rb +0 -229
data/spec/unit/config/language_detection_config_spec.rb +0 -258
data/spec/unit/config/ocr_config_spec.rb +0 -171
data/spec/unit/config/page_config_spec.rb +0 -221
data/spec/unit/config/pdf_config_spec.rb +0 -267
data/spec/unit/config/postprocessor_config_spec.rb +0 -290
data/spec/unit/config/tesseract_config_spec.rb +0 -181
data/spec/unit/config/token_reduction_config_spec.rb +0 -251
data/test/metadata_types_test.rb +0 -959
data/vendor/Cargo.toml +0 -61
data/vendor/kreuzberg/examples/bench_fixes.rs +0 -71
data/vendor/kreuzberg/examples/test_pdfium_fork.rs +0 -62
data/vendor/kreuzberg/src/chunking/processor.rs +0 -219
data/vendor/kreuzberg/src/core/batch_optimizations.rs +0 -385
data/vendor/kreuzberg/src/core/config_validation.rs +0 -949
data/vendor/kreuzberg/src/core/formats.rs +0 -235
data/vendor/kreuzberg/src/core/server_config.rs +0 -1220
data/vendor/kreuzberg/src/extraction/capacity.rs +0 -263
data/vendor/kreuzberg/src/extraction/markdown.rs +0 -216
data/vendor/kreuzberg/src/extraction/office_metadata/odt_properties.rs +0 -284
data/vendor/kreuzberg/src/extractors/bibtex.rs +0 -470
data/vendor/kreuzberg/src/extractors/docbook.rs +0 -504
data/vendor/kreuzberg/src/extractors/epub.rs +0 -696
data/vendor/kreuzberg/src/extractors/fictionbook.rs +0 -492
data/vendor/kreuzberg/src/extractors/jats.rs +0 -1054
data/vendor/kreuzberg/src/extractors/jupyter.rs +0 -368
data/vendor/kreuzberg/src/extractors/latex.rs +0 -653
data/vendor/kreuzberg/src/extractors/markdown.rs +0 -701
data/vendor/kreuzberg/src/extractors/odt.rs +0 -628
data/vendor/kreuzberg/src/extractors/opml.rs +0 -635
data/vendor/kreuzberg/src/extractors/orgmode.rs +0 -529
data/vendor/kreuzberg/src/extractors/rst.rs +0 -577
data/vendor/kreuzberg/src/extractors/rtf.rs +0 -809
data/vendor/kreuzberg/src/extractors/security.rs +0 -484
data/vendor/kreuzberg/src/extractors/security_tests.rs +0 -367
data/vendor/kreuzberg/src/extractors/typst.rs +0 -651
data/vendor/kreuzberg/src/language_detection/processor.rs +0 -218
data/vendor/kreuzberg/src/ocr/language_registry.rs +0 -520
data/vendor/kreuzberg/src/panic_context.rs +0 -154
data/vendor/kreuzberg/src/pdf/bindings.rs +0 -306
data/vendor/kreuzberg/src/pdf/bundled.rs +0 -408
data/vendor/kreuzberg/src/pdf/fonts.rs +0 -358
data/vendor/kreuzberg/src/pdf/hierarchy.rs +0 -903
data/vendor/kreuzberg/src/text/quality_processor.rs +0 -231
data/vendor/kreuzberg/src/text/utf8_validation.rs +0 -193
data/vendor/kreuzberg/src/utils/pool.rs +0 -503
data/vendor/kreuzberg/src/utils/pool_sizing.rs +0 -364
data/vendor/kreuzberg/src/utils/string_pool.rs +0 -761
data/vendor/kreuzberg/tests/api_embed.rs +0 -360
data/vendor/kreuzberg/tests/api_extract_multipart.rs +0 -52
data/vendor/kreuzberg/tests/api_large_pdf_extraction.rs +0 -471
data/vendor/kreuzberg/tests/api_large_pdf_extraction_diagnostics.rs +0 -289
data/vendor/kreuzberg/tests/batch_pooling_benchmark.rs +0 -154
data/vendor/kreuzberg/tests/bibtex_parity_test.rs +0 -421
data/vendor/kreuzberg/tests/config_integration_test.rs +0 -753
data/vendor/kreuzberg/tests/data/hierarchy_ground_truth.json +0 -294
data/vendor/kreuzberg/tests/docbook_extractor_tests.rs +0 -500
data/vendor/kreuzberg/tests/docx_vs_pandoc_comparison.rs +0 -370
data/vendor/kreuzberg/tests/epub_native_extractor_tests.rs +0 -275
data/vendor/kreuzberg/tests/fictionbook_extractor_tests.rs +0 -228
data/vendor/kreuzberg/tests/html_table_test.rs +0 -551
data/vendor/kreuzberg/tests/instrumentation_test.rs +0 -139
data/vendor/kreuzberg/tests/jats_extractor_tests.rs +0 -639
data/vendor/kreuzberg/tests/jupyter_extractor_tests.rs +0 -704
data/vendor/kreuzberg/tests/latex_extractor_tests.rs +0 -496
data/vendor/kreuzberg/tests/markdown_extractor_tests.rs +0 -490
data/vendor/kreuzberg/tests/ocr_language_registry.rs +0 -191
data/vendor/kreuzberg/tests/odt_extractor_tests.rs +0 -674
data/vendor/kreuzberg/tests/opml_extractor_tests.rs +0 -616
data/vendor/kreuzberg/tests/orgmode_extractor_tests.rs +0 -822
data/vendor/kreuzberg/tests/page_markers.rs +0 -297
data/vendor/kreuzberg/tests/pdf_hierarchy_detection.rs +0 -301
data/vendor/kreuzberg/tests/pdf_hierarchy_quality.rs +0 -589
data/vendor/kreuzberg/tests/pdf_ocr_triggering.rs +0 -301
data/vendor/kreuzberg/tests/pdf_text_merging.rs +0 -475
data/vendor/kreuzberg/tests/pdfium_linking.rs +0 -340
data/vendor/kreuzberg/tests/rst_extractor_tests.rs +0 -694
data/vendor/kreuzberg/tests/rtf_extractor_tests.rs +0 -775
data/vendor/kreuzberg/tests/typst_behavioral_tests.rs +0 -1260
data/vendor/kreuzberg/tests/typst_extractor_tests.rs +0 -648
data/vendor/kreuzberg-ffi/Cargo.toml +0 -67
data/vendor/kreuzberg-ffi/README.md +0 -851
data/vendor/kreuzberg-ffi/benches/result_view_benchmark.rs +0 -227
data/vendor/kreuzberg-ffi/build.rs +0 -168
data/vendor/kreuzberg-ffi/cbindgen.toml +0 -37
data/vendor/kreuzberg-ffi/kreuzberg-ffi.pc.in +0 -12
data/vendor/kreuzberg-ffi/kreuzberg.h +0 -3012
data/vendor/kreuzberg-ffi/src/batch_streaming.rs +0 -588
data/vendor/kreuzberg-ffi/src/config.rs +0 -1341
data/vendor/kreuzberg-ffi/src/error.rs +0 -901
data/vendor/kreuzberg-ffi/src/extraction.rs +0 -555
data/vendor/kreuzberg-ffi/src/helpers.rs +0 -879
data/vendor/kreuzberg-ffi/src/lib.rs +0 -977
data/vendor/kreuzberg-ffi/src/memory.rs +0 -493
data/vendor/kreuzberg-ffi/src/mime.rs +0 -329
data/vendor/kreuzberg-ffi/src/panic_shield.rs +0 -265
data/vendor/kreuzberg-ffi/src/plugins/document_extractor.rs +0 -442
data/vendor/kreuzberg-ffi/src/plugins/mod.rs +0 -14
data/vendor/kreuzberg-ffi/src/plugins/ocr_backend.rs +0 -628
data/vendor/kreuzberg-ffi/src/plugins/post_processor.rs +0 -438
data/vendor/kreuzberg-ffi/src/plugins/validator.rs +0 -329
data/vendor/kreuzberg-ffi/src/result.rs +0 -510
data/vendor/kreuzberg-ffi/src/result_pool.rs +0 -639
data/vendor/kreuzberg-ffi/src/result_view.rs +0 -773
data/vendor/kreuzberg-ffi/src/string_intern.rs +0 -568
data/vendor/kreuzberg-ffi/src/types.rs +0 -363
data/vendor/kreuzberg-ffi/src/util.rs +0 -210
data/vendor/kreuzberg-ffi/src/validation.rs +0 -848
data/vendor/kreuzberg-ffi/tests.disabled/README.md +0 -48
data/vendor/kreuzberg-ffi/tests.disabled/config_loading_tests.rs +0 -299
data/vendor/kreuzberg-ffi/tests.disabled/config_tests.rs +0 -346
data/vendor/kreuzberg-ffi/tests.disabled/extractor_tests.rs +0 -232
data/vendor/kreuzberg-ffi/tests.disabled/plugin_registration_tests.rs +0 -470
data/vendor/kreuzberg-tesseract/.commitlintrc.json +0 -13
data/vendor/kreuzberg-tesseract/.crate-ignore +0 -2
data/vendor/kreuzberg-tesseract/Cargo.lock +0 -2933
data/vendor/kreuzberg-tesseract/Cargo.toml +0 -57
data/vendor/kreuzberg-tesseract/LICENSE +0 -22
data/vendor/kreuzberg-tesseract/README.md +0 -399
data/vendor/kreuzberg-tesseract/build.rs +0 -1127
data/vendor/kreuzberg-tesseract/patches/README.md +0 -71
data/vendor/kreuzberg-tesseract/patches/tesseract.diff +0 -199
data/vendor/kreuzberg-tesseract/src/api.rs +0 -1371
data/vendor/kreuzberg-tesseract/src/choice_iterator.rs +0 -77
data/vendor/kreuzberg-tesseract/src/enums.rs +0 -297
data/vendor/kreuzberg-tesseract/src/error.rs +0 -81
data/vendor/kreuzberg-tesseract/src/lib.rs +0 -145
data/vendor/kreuzberg-tesseract/src/monitor.rs +0 -57
data/vendor/kreuzberg-tesseract/src/mutable_iterator.rs +0 -197
data/vendor/kreuzberg-tesseract/src/page_iterator.rs +0 -253
data/vendor/kreuzberg-tesseract/src/result_iterator.rs +0 -286
data/vendor/kreuzberg-tesseract/src/result_renderer.rs +0 -183
data/vendor/kreuzberg-tesseract/tests/integration_test.rs +0 -211

data/README.md CHANGED Viewed

@@ -1,781 +1,421 @@
-# Ruby
-<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
-  <!-- Language Bindings -->
-  <a href="https://crates.io/crates/kreuzberg">
-    <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
-  </a>
-  <a href="https://hex.pm/packages/kreuzberg">
-    <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
-  </a>
-  <a href="https://pypi.org/project/kreuzberg/">
-    <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
-  </a>
-  <a href="https://www.npmjs.com/package/@kreuzberg/node">
-    <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
-  </a>
-  <a href="https://www.npmjs.com/package/@kreuzberg/wasm">
-    <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
-  </a>
-<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
-    <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
-  </a>
-  <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
-    <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0-*" alt="Go">
-  </a>
-  <a href="https://www.nuget.org/packages/Kreuzberg/">
-    <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
-  </a>
-  <a href="https://packagist.org/packages/kreuzberg/kreuzberg">
-    <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
-  </a>
-  <a href="https://rubygems.org/gems/kreuzberg">
-    <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
-  </a>
-<!-- Project Info -->
-<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
-    <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
-  </a>
-  <a href="https://docs.kreuzberg.dev">
-    <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
-  </a>
-</div>
-<img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
-<div align="center" style="margin-top: 20px;">
-  <a href="https://discord.gg/pXxagNK2zN">
-      <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
-  </a>
-</div>
-Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. Ruby bindings with idiomatic Ruby API and native performance.
-> **Version 4.0.0 Release Candidate**
-> Kreuzberg v4.0.0 is in **Release Candidate** stage. Bugs and breaking changes are expected.
-> This is a pre-release version. Please test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
+# Kreuzberg for Ruby
-## Installation
+[![RubyGems](https://img.shields.io/gem/v/kreuzberg)](https://rubygems.org/gems/kreuzberg)
+[![Crates.io](https://img.shields.io/crates/v/kreuzberg)](https://crates.io/crates/kreuzberg)
+[![PyPI](https://img.shields.io/pypi/v/kreuzberg)](https://pypi.org/project/kreuzberg/)
+[![npm](https://img.shields.io/npm/v/@goldziher/kreuzberg)](https://www.npmjs.com/package/@goldziher/kreuzberg)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev)
-### Package Installation
+High-performance document intelligence for Ruby, powered by Rust.
-Install via one of the supported package managers:
+Extract text, tables, images, and metadata from 30+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
-**gem:**
+> **🚀 Version 4.0.0 Release Candidate**
+> This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/Goldziher/kreuzberg/issues) you encounter.
-```bash
-gem install kreuzberg
-```
-**Bundler:**
-```ruby
-gem 'kreuzberg'
-```
+## Features
-### System Requirements
+- **30+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
+- **OCR Support**: Built-in Tesseract OCR for scanned documents and images
+- **High Performance**: Rust-powered extraction for native-level performance
+- **Table Extraction**: Extract structured tables from documents
+- **Language Detection**: Automatic language detection for extracted text
+- **Text Chunking**: Split long documents into manageable chunks
+- **Caching**: Built-in result caching for faster repeated extractions
+- **Type-Safe**: Comprehensive typed configuration and result objects
-- **Ruby 2.7+** required
-- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
-- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
+## Requirements
-### Platform Support
+- Ruby 3.2 or higher
+- Rust toolchain (for building from source)
-Precompiled native extensions are available for the following platforms, providing instant installation without compilation:
+### Optional System Dependencies
-- ✅ Linux x86_64
-- ✅ Linux aarch64 (ARM64)
-- ✅ macOS aarch64 (Apple Silicon)
+- **Tesseract**: For OCR functionality
+  - macOS: `brew install tesseract`
+  - Ubuntu: `sudo apt-get install tesseract-ocr`
+  - Windows: Download from [GitHub](https://github.com/tesseract-ocr/tesseract)
-On these platforms, no C compiler or Rust toolchain is required for installation.
+- **LibreOffice**: For legacy MS Office formats (.doc, .ppt)
+  - macOS: `brew install libreoffice`
+  - Ubuntu: `sudo apt-get install libreoffice`
-## Quick Start
+- **Pandoc**: For advanced document conversion
+  - macOS: `brew install pandoc`
+  - Ubuntu: `sudo apt-get install pandoc`
-### Basic Extraction
+## Installation
-Extract text, metadata, and structure from any supported document format:
+Add to your Gemfile:
 ```ruby
-require 'kreuzberg'
-result = Kreuzberg.extract_file_sync(path: 'document.pdf')
-puts "Content:"
-puts result.content
-puts "\nMetadata:"
-puts "Title: #{result.metadata&.dig('title')}"
-puts "Author: #{result.metadata&.dig('author')}"
-puts "\nTables found: #{result.tables.length}"
-puts "Images found: #{result.images.length}"
+gem 'kreuzberg'
 ```
-### Common Use Cases
-#### Extract with Custom Configuration
+Then run:
-Most use cases benefit from configuration to control extraction behavior:
-**With OCR (for scanned documents):**
-```ruby
-require 'kreuzberg'
-ocr_config = Kreuzberg::Config::OCR.new(
-  backend: 'tesseract',
-  language: 'eng'
-)
+```bash
+bundle install
+```
-config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
-result = Kreuzberg.extract_file_sync(path: 'scanned.pdf', config: config)
+Or install directly:
-puts "Extracted text from scanned document:"
-puts result.content
-puts "Used OCR backend: tesseract"
+```bash
+gem install kreuzberg
 ```
-#### Table Extraction
-See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.
+## Quick Start
-#### Processing Multiple Files
+### Basic Extraction
 ```ruby
 require 'kreuzberg'
-puts "Kreuzberg version: #{Kreuzberg::VERSION}"
-puts "FFI bindings loaded successfully"
-result = Kreuzberg.extract_file_sync(path: 'sample.pdf')
-puts "Installation verified! Extracted #{result.content.length} characters"
+# Extract from a file
+result = Kreuzberg.extract_file_sync("document.pdf")
+puts result.content
+puts "MIME type: #{result.mime_type}"
 ```
-#### Async Processing
-For non-blocking document processing:
+### With Configuration
 ```ruby
-require 'kreuzberg'
+# Create configuration
 config = Kreuzberg::Config::Extraction.new(
   use_cache: true,
-  enable_quality_processing: true
+  force_ocr: false
 )
-result = Kreuzberg.extract_file_sync(path: 'contract.pdf', config: config)
-puts "Extracted #{result.content.length} characters"
-puts "Quality score: #{result.metadata&.dig('quality_score')}"
-puts "Processing time: #{result.metadata&.dig('processing_time')}ms"
+result = Kreuzberg.extract_file_sync("document.pdf", config: config)
 ```
-### Next Steps
-- **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
-- **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
-- **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
-- **[Configuration Guide](https://kreuzberg.dev/configuration/)** - Advanced configuration options
-- **[Troubleshooting](https://kreuzberg.dev/troubleshooting/)** - Common issues and solutions
-## Features
-### Supported File Formats (56+)
-56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
-#### Office Documents
-| Category | Formats | Capabilities |
-|----------|---------|--------------|
-| **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
-| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
-| **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
-| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
-| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
-#### Images (OCR-Enabled)
-| Category | Formats | Features |
-|----------|---------|----------|
-| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
-| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
-| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
-#### Web & Data
-| Category | Formats | Features |
-|----------|---------|----------|
-| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
-| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
-| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
-#### Email & Archives
-| Category | Formats | Features |
-|----------|---------|----------|
-| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
-| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
-#### Academic & Scientific
-| Category | Formats | Features |
-|----------|---------|----------|
-| **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
-| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
-| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
-**[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
-### Key Capabilities
-- **Text Extraction** - Extract all text content with position and formatting information
-- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
-- **Table Extraction** - Parse tables with structure and cell content preservation
-- **Image Extraction** - Extract embedded images and render page previews
-- **OCR Support** - Integrate multiple OCR backends for scanned documents
-- **Async/Await** - Non-blocking document processing with concurrent operations
-- **Plugin System** - Extensible post-processing for custom text transformation
-- **Embeddings** - Generate vector embeddings using ONNX Runtime models
-- **Batch Processing** - Efficiently process multiple documents in parallel
-- **Memory Efficient** - Stream large files without loading entirely into memory
-- **Language Detection** - Detect and support multiple languages in documents
-- **Configuration** - Fine-grained control over extraction behavior
-### Performance Characteristics
-| Format | Speed | Memory | Notes |
-|--------|-------|--------|-------|
-| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
-| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
-| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
-| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
-| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
-## OCR Support
-Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
-- **Tesseract**
-### OCR Configuration Example
+### With OCR
 ```ruby
-require 'kreuzberg'
+# Configure OCR
 ocr_config = Kreuzberg::Config::OCR.new(
-  backend: 'tesseract',
-  language: 'eng'
+  backend: "tesseract",
+  language: "eng",
+  preprocessing: true
 )
 config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
-result = Kreuzberg.extract_file_sync(path: 'scanned.pdf', config: config)
-puts "Extracted text from scanned document:"
-puts result.content
-puts "Used OCR backend: tesseract"
+result = Kreuzberg.extract_file_sync("scanned.pdf", config: config)
 ```
-## Async Support
-This binding provides full async/await support for non-blocking document processing:
+### Extract from Bytes
 ```ruby
-require 'kreuzberg'
-config = Kreuzberg::Config::Extraction.new(
-  use_cache: true,
-  enable_quality_processing: true
-)
-result = Kreuzberg.extract_file_sync(path: 'contract.pdf', config: config)
-puts "Extracted #{result.content.length} characters"
-puts "Quality score: #{result.metadata&.dig('quality_score')}"
-puts "Processing time: #{result.metadata&.dig('processing_time')}ms"
+data = File.binread("document.pdf")
+result = Kreuzberg.extract_bytes_sync(data, "application/pdf")
+puts result.content
 ```
-## Plugin System
-Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
-For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/plugins/).
-## Embeddings Support
-Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
-**[Embeddings Guide](https://kreuzberg.dev/features/#embeddings)**
+### Batch Processing
-## Advanced Examples
+```ruby
+paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
+results = Kreuzberg.batch_extract_files_sync(paths)
-### Embeddings with Model Configuration
+results.each do |result|
+  puts "Content: #{result.content[0..100]}"
+  puts "MIME: #{result.mime_type}"
+end
+```
-Generate embeddings for document chunks with custom model configuration:
+### Structured Results (Chunks & Images)
 ```ruby
-require 'kreuzberg'
-# Configure embedding model with custom parameters
-embedding_config = Kreuzberg::Config::Embedding.new(
-  model: { type: :preset, name: 'balanced' },
-  normalize: true,
-  batch_size: 32,
-  show_download_progress: false
-)
+result = Kreuzberg.extract_file_sync("long-report.pdf", config: {
+  chunking: { max_chars: 750 },
+  image_extraction: { extract_images: true }
+})
-# Enable chunking with embeddings
-chunking_config = Kreuzberg::Config::Chunking.new(
-  max_chars: 1024,
-  max_overlap: 256,
-  embedding: embedding_config
-)
+result.chunks&.each do |chunk|
+  puts "[#{chunk.chunk_index + 1}/#{chunk.total_chunks}] #{chunk.content[0..80]}"
+end
-config = Kreuzberg::Config::Extraction.new(chunking: chunking_config)
-result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
-# Access chunks with embeddings
-result.chunks.each_with_index do |chunk, idx|
-  puts "Chunk #{idx}:"
-  puts "  Content: #{chunk.content[0..50]}..."
-  puts "  Tokens: #{chunk.token_count}"
-  puts "  Pages: #{chunk.first_page}-#{chunk.last_page}"
-  if chunk.embedding
-    puts "  Embedding dimensions: #{chunk.embedding.length}"
+result.images&.each do |image|
+  File.binwrite("image-#{image.image_index}.#{image.format}", image.data)
+  if image.ocr_result
+    puts "Embedded OCR content: #{image.ocr_result.content[0..60]}"
   end
 end
 ```
-### Keywords Extraction (YAKE and RAKE)
+## Configuration
-Extract keywords using YAKE and RAKE algorithms:
+### Load From File
 ```ruby
-require 'kreuzberg'
+config = Kreuzberg::Config::Extraction.from_file("config.toml")
+result = Kreuzberg.extract_file_sync("report.pdf", config: config)
+```
-# Extract keywords using YAKE algorithm
-yake_config = Kreuzberg::Config::Keywords.new(
-  algorithm: 'yake',
-  max_keywords: 10,
-  min_score: 0.1,
-  yake_params: Kreuzberg::Config::KeywordYakeParams.new(window_size: 3)
-)
+### Extraction Configuration
-config = Kreuzberg::Config::Extraction.new(keywords: yake_config)
-result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
-# Extract keywords using RAKE algorithm
-rake_config = Kreuzberg::Config::Keywords.new(
-  algorithm: 'rake',
-  max_keywords: 15,
-  language: 'english',
-  rake_params: Kreuzberg::Config::KeywordRakeParams.new(
-    min_word_length: 3,
-    max_words_per_phrase: 5
-  )
+```ruby
+config = Kreuzberg::Config::Extraction.new(
+  use_cache: true,                      # Enable result caching
+  enable_quality_processing: false,     # Enable text quality processing
+  force_ocr: false                      # Force OCR even for digital PDFs
 )
-config = Kreuzberg::Config::Extraction.new(keywords: rake_config)
-result = Kreuzberg.extract_file_sync(path: 'report.docx', config: config)
-puts "Keywords extracted for document"
 ```
-### Pages Extraction with PageConfig
-Extract and organize content by pages:
+### OCR Configuration
 ```ruby
-require 'kreuzberg'
-# Enable per-page extraction with markers
-page_config = Kreuzberg::Config::PageConfig.new(
-  extract_pages: true,
-  insert_page_markers: true,
-  marker_format: "\n\n=== PAGE {page_num} ===\n\n"
+ocr = Kreuzberg::Config::OCR.new(
+  backend: "tesseract",           # OCR backend (tesseract, easyocr, paddleocr)
+  language: "eng",                # Language code (eng, deu, fra, etc.)
+  tesseract_config: {
+    psm: 6,
+    enable_table_detection: true,
+    preprocessing: Kreuzberg::Config::ImagePreprocessing.new(auto_rotate: true).to_h
+  }
 )
-config = Kreuzberg::Config::Extraction.new(pages: page_config)
-result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
-# Access extracted pages
-if result.pages
-  result.pages.each do |page|
-    puts "Page #{page.page_number}:"
-    puts "  Content length: #{page.content.length}"
-    puts "  Tables: #{page.tables.length}"
-    puts "  Images: #{page.images.length}"
-  end
-end
-puts "Total pages: #{result.page_count}"
+config = Kreuzberg::Config::Extraction.new(ocr: ocr)
 ```
-### Custom PostProcessor Implementation
-Create and register custom post-processors for text transformation:
+### Chunking Configuration
 ```ruby
-require 'kreuzberg'
-# Define a custom post-processor class
-class MarkdownEnhancerPostProcessor
-  include Kreuzberg::PostProcessorProtocol
-  def call(result)
-    # Enhance extracted content with markdown formatting
-    enhanced = result.dup
-    if enhanced['content']
-      # Add markdown headers for detected structure
-      enhanced['content'] = enhance_with_markdown(enhanced['content'])
-    end
-    enhanced
-  end
+chunking = Kreuzberg::Config::Chunking.new(
+  enabled: true,
+  chunk_size: 1000,       # Characters per chunk
+  chunk_overlap: 200,     # Overlap between chunks
+  embedding: {
+    model: { type: :preset, name: "balanced" },
+    normalize: true
+  }
+)
-  private
+config = Kreuzberg::Config::Extraction.new(chunking: chunking)
+result = Kreuzberg.extract_file_sync("long_document.pdf", config: config)
-  def enhance_with_markdown(content)
-    # Example: Convert section breaks to markdown headers
-    content
-      .split("\n\n")
-      .map { |paragraph| paragraph.length > 100 ? "## #{paragraph[0..30]}...\n\n#{paragraph}" : paragraph }
-      .join("\n\n")
-  end
+result.chunks.each do |chunk|
+  puts "Chunk: #{chunk.content}"
+  puts "Tokens: #{chunk.token_count}"
 end
-# Use custom post-processor in configuration
-processor = MarkdownEnhancerPostProcessor.new
-postprocessor_config = Kreuzberg::Config::PostProcessor.new(enabled: true)
-config = Kreuzberg::Config::Extraction.new(postprocessor: postprocessor_config)
-result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
-puts result.content
 ```
-### Custom Validator Implementation
-Create and register validators to ensure extraction quality:
+### HTML Conversion Options
 ```ruby
-require 'kreuzberg'
-# Define a custom validator class
-class ContentQualityValidator
-  include Kreuzberg::ValidatorProtocol
-  MIN_CONTENT_LENGTH = 100
-  MIN_METADATA_FIELDS = 2
-  def call(result)
-    # Validate extracted content meets quality standards
-    content = result['content'].to_s
-    metadata = result['metadata'].to_h
-    if content.length < MIN_CONTENT_LENGTH
-      raise Kreuzberg::Errors::ValidationError,
-            "Content too short: #{content.length} bytes (minimum: #{MIN_CONTENT_LENGTH})"
-    end
-    if metadata.length < MIN_METADATA_FIELDS
-      raise Kreuzberg::Errors::ValidationError,
-            "Insufficient metadata: #{metadata.length} fields (minimum: #{MIN_METADATA_FIELDS})"
-    end
-    # Validation passed
-    nil
-  end
-end
-# Use validator in extraction workflow
-validator = ContentQualityValidator.new
-config = Kreuzberg::Config::Extraction.new(enable_quality_processing: true)
+html_options = Kreuzberg::Config::HtmlOptions.new(
+  heading_style: :atx_closed,
+  wrap: true,
+  wrap_width: 100,
+  preprocessing: { enabled: true, preset: :standard }
+)
-begin
-  result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
-  validator.call(result.to_h)
-  puts "Extraction passed quality validation"
-rescue Kreuzberg::Errors::ValidationError => e
-  puts "Validation failed: #{e.message}"
-end
+config = Kreuzberg::Config::Extraction.new(html_options: html_options)
+result = Kreuzberg.extract_file_sync("page.html", config: config)
 ```
-### Config File Loading (from_file and discover)
-Load configuration from TOML, YAML, or JSON files:
+### Keyword Extraction
 ```ruby
-require 'kreuzberg'
+keywords = Kreuzberg::Config::Keywords.new(
+  algorithm: :yake,
+  max_keywords: 8,
+  min_score: 0.2,
+  ngram_range: [1, 3]
+)
-# Load configuration from a specific file
-# Supports: .toml, .yaml/.yml, .json
-config = Kreuzberg::Config::Extraction.from_file('config/kreuzberg.toml')
-# Example: config/kreuzberg.toml
-# use_cache = true
-# force_ocr = false
-# enable_quality_processing = true
-#
-# [chunking]
-# max_chars = 1024
-# max_overlap = 256
-#
-# [ocr]
-# backend = "tesseract"
-# language = "eng"
-#
-# [language_detection]
-# enabled = true
-# min_confidence = 0.7
-result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
-puts "Extracted with config from file"
-# Auto-discover configuration in project hierarchy
-discovered_config = Kreuzberg::Config::Extraction.discover
-if discovered_config
-  puts "Found configuration at project root"
-  result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: discovered_config)
-else
-  puts "No configuration file found, using defaults"
-  result = Kreuzberg.extract_file_sync(path: 'document.pdf')
-end
+config = Kreuzberg::Config::Extraction.new(keywords: keywords)
+result = Kreuzberg.extract_file_sync("research.pdf", config: config)
 ```
-### Fiber-Based Async Patterns
-Use Ruby Fibers for efficient async extraction workflows:
+### Language Detection
 ```ruby
-require 'kreuzberg'
-# Create async extraction workflow using Fibers
-def extract_documents_async(file_paths)
-  fibers = file_paths.map do |path|
-    Fiber.new do
-      config = Kreuzberg::Config::Extraction.new(
-        use_cache: true,
-        enable_quality_processing: true
-      )
-      # Extract asynchronously
-      result = Kreuzberg.extract_file(path: path, config: config)
-      {
-        path: path,
-        content_length: result.content.length,
-        tables: result.tables.length,
-        languages: result.detected_languages
-      }
-    end
-  end
+lang_detection = Kreuzberg::Config::LanguageDetection.new(
+  enabled: true,
+  min_confidence: 0.8,
+  detect_multiple: true
+)
-  # Resume all fibers and collect results
-  results = fibers.map do |fiber|
-    Fiber.yield fiber.resume if fiber.alive?
-  end
+config = Kreuzberg::Config::Extraction.new(language_detection: lang_detection)
+result = Kreuzberg.extract_file_sync("multilingual.pdf", config: config)
-  results.compact
+result.detected_languages&.each do |lang|
+  puts "Language: #{lang.lang}, Confidence: #{lang.confidence}"
 end
+```
-# Usage
-file_paths = ['document1.pdf', 'document2.docx', 'document3.xlsx']
-results = extract_documents_async(file_paths)
+### PDF Options
-results.each do |result|
-  puts "#{result[:path]}: #{result[:content_length]} characters"
-end
-```
+```ruby
+pdf_options = Kreuzberg::Config::PDF.new(
+  extract_images: true,
+  image_min_size: 10000,    # Minimum image size in bytes
+  password: "secret"        # PDF password
+)
-### Table Extraction Detailed Usage
+config = Kreuzberg::Config::Extraction.new(pdf_options: pdf_options)
+```
-Extract and access table structure and cell data:
+## Working with Results
 ```ruby
-require 'kreuzberg'
+result = Kreuzberg.extract_file_sync("invoice.pdf")
-# Configure table extraction
-config = Kreuzberg::Config::Extraction.new(
-  pdf_options: Kreuzberg::Config::PDF.new(extract_images: true)
-)
+# Access extracted text
+puts result.content
+# Access MIME type
+puts result.mime_type
-result = Kreuzberg.extract_file_sync(path: 'spreadsheet.pdf', config: config)
+# Access metadata
+puts result.metadata.inspect
 # Access extracted tables
-result.tables.each_with_index do |table, table_idx|
-  puts "Table #{table_idx} (Page #{table.page_number}):"
-  # Access table cells (2D array)
-  table.cells.each_with_index do |row, row_idx|
-    puts "  Row #{row_idx}:"
-    row.each_with_index do |cell, col_idx|
-      puts "    [#{col_idx}] #{cell}"
-    end
+result.tables.each do |table|
+  puts "Headers: #{table.headers.join(', ')}"
+  table.rows.each do |row|
+    puts row.join(', ')
   end
+end
-  # Access markdown representation
-  puts "\nMarkdown format:"
-  puts table.markdown
+# Access text chunks and metadata
+result.chunks&.each do |chunk|
+  puts "Chunk #{chunk.chunk_index + 1}/#{chunk.total_chunks}"
+  puts "Chars: #{chunk.char_start}-#{chunk.char_end}"
+  puts "Embedding length: #{chunk.embedding&.length}"
 end
-# Extract tables from specific pages
-page_config = Kreuzberg::Config::PageConfig.new(extract_pages: true)
-config = Kreuzberg::Config::Extraction.new(pages: page_config)
-result = Kreuzberg.extract_file_sync(path: 'data.xlsx', config: config)
-if result.pages
-  result.pages.each do |page|
-    page.tables.each do |table|
-      puts "Table on page #{page.page_number}:"
-      puts "  Dimensions: #{table.cells.length} rows x #{table.cells.first&.length || 0} columns"
-    end
-  end
+# Access extracted images
+result.images&.each do |image|
+  File.binwrite("image-\#{image.image_index}.#{image.format}", image.data)
+  puts "Image #{image.image_index} on page #{image.page_number}"
 end
-```
-### Image Extraction and Saving
+# Convert to hash
+hash = result.to_h
-Extract images and save them to disk:
+# Convert to JSON
+json = result.to_json
+```
-```ruby
-require 'kreuzberg'
+## CLI Usage
-# Configure image extraction with high DPI
-image_config = Kreuzberg::Config::ImageExtraction.new(
-  extract_images: true,
-  target_dpi: 300,
-  max_image_dimension: 2000,
-  auto_adjust_dpi: true
-)
+Kreuzberg provides a Ruby wrapper for the CLI:
-config = Kreuzberg::Config::Extraction.new(image_extraction: image_config)
-result = Kreuzberg.extract_file_sync(path: 'document.pdf', config: config)
+```ruby
+# Extract content
+output = Kreuzberg::CLI.extract("document.pdf", output: "text")
-# Save extracted images
-output_dir = 'extracted_images'
-Dir.mkdir(output_dir) unless Dir.exist?(output_dir)
+# Detect MIME type
+mime_type = Kreuzberg::CLI.detect("document.pdf")
-result.images.each_with_index do |image, idx|
-  # Generate filename
-  filename = "image_p#{image.page_number}_#{image.image_index}.#{image.format}"
-  filepath = File.join(output_dir, filename)
+# Get version
+version = Kreuzberg::CLI.version
+```
-  # Save image data
-  File.write(filepath, image.data, mode: 'wb')
+## API Server
-  puts "Saved: #{filename}"
-  puts "  Page: #{image.page_number}"
-  puts "  Format: #{image.format}"
-  puts "  Dimensions: #{image.width}x#{image.height}"
-  puts "  Colorspace: #{image.colorspace}"
+Start an API server (requires kreuzberg CLI):
-  # Process OCR result if available
-  if image.ocr_result
-    puts "  OCR Text: #{image.ocr_result['text'][0..50]}..."
-  end
+```ruby
+Kreuzberg::APIProxy.run(port: 8000) do |server|
+  # Server runs in background
+  # Make HTTP requests to http://localhost:8000
 end
 ```
-### Language Detection Configuration
+## MCP Server
-Configure and use language detection:
+Start a Model Context Protocol server for Claude Desktop:
 ```ruby
-require 'kreuzberg'
-# Enable language detection with confidence threshold
-lang_detection_config = Kreuzberg::Config::LanguageDetection.new(
-  enabled: true,
-  min_confidence: 0.8,
-  detect_multiple: true
-)
-config = Kreuzberg::Config::Extraction.new(
-  language_detection: lang_detection_config
-)
-result = Kreuzberg.extract_file_sync(path: 'multilingual.pdf', config: config)
+server = Kreuzberg::MCPProxy::Server.new(transport: 'stdio')
+server.start
-# Access detected languages
-puts "Primary language: #{result.detected_language}"
-puts "All detected languages: #{result.detected_languages.join(', ')}"
-# Access language from metadata
-if result.metadata.is_a?(Hash)
-  puts "Language from metadata: #{result.metadata['language']}"
-end
+# Use with Claude Desktop integration
+```
-# Combine with keyword extraction for specific language
-keywords_config = Kreuzberg::Config::Keywords.new(
-  algorithm: 'yake',
-  language: 'de',  # German keywords
-  max_keywords: 10
-)
+## Cache Management
-config = Kreuzberg::Config::Extraction.new(
-  language_detection: lang_detection_config,
-  keywords: keywords_config
-)
+```ruby
+# Get cache statistics
+stats = Kreuzberg.cache_stats
+puts "Entries: #{stats[:total_entries]}"
+puts "Size: #{stats[:total_size_bytes]} bytes"
-result = Kreuzberg.extract_file_sync(path: 'german_document.pdf', config: config)
-puts "Keywords extracted for: #{result.detected_language}"
+# Clear cache
+Kreuzberg.clear_cache
 ```
-## Batch Processing
-Process multiple documents efficiently:
+## Error Handling
 ```ruby
-require 'kreuzberg'
+begin
+  result = Kreuzberg.extract_file_sync("document.pdf")
+rescue Kreuzberg::Errors::ParsingError => e
+  puts "Parsing failed: #{e.message}"
+  puts "Context: #{e.context}"
+rescue Kreuzberg::Errors::OCRError => e
+  puts "OCR failed: #{e.message}"
+rescue Kreuzberg::Errors::MissingDependencyError => e
+  puts "Missing dependency: #{e.dependency}"
+rescue Kreuzberg::Errors::Error => e
+  puts "Kreuzberg error: #{e.message}"
+end
+```
-puts "Kreuzberg version: #{Kreuzberg::VERSION}"
-puts "FFI bindings loaded successfully"
+## Supported Formats
-result = Kreuzberg.extract_file_sync(path: 'sample.pdf')
-puts "Installation verified! Extracted #{result.content.length} characters"
-```
+- **Documents**: PDF, DOCX, DOC, PPTX, PPT, ODT, ODP
+- **Spreadsheets**: XLSX, XLS, ODS, CSV
+- **Images**: PNG, JPEG, TIFF, BMP, GIF
+- **Web**: HTML, MHTML, Markdown
+- **Data**: JSON, YAML, TOML, XML
+- **Email**: EML, MSG
+- **Archives**: ZIP, TAR, 7Z
+- **Text**: TXT, RTF, MD
-## Configuration
+## Performance
-For advanced configuration options including language detection, table extraction, OCR settings, and more:
+Kreuzberg's Rust core provides significant performance improvements:
-**[Configuration Guide](https://kreuzberg.dev/configuration/)**
+- **PDF extraction**: 10-50x faster than pure Ruby solutions
+- **Batch processing**: Parallel extraction with Tokio async runtime
+- **Memory efficient**: Streaming parsers for large files
+- **Caching**: Automatic result caching for repeated extractions
-## Documentation
+## Development
-- **[Official Documentation](https://kreuzberg.dev/)**
-- **[API Reference](https://kreuzberg.dev/reference/api-ruby/)**
-- **[Examples & Guides](https://kreuzberg.dev/guides/)**
+```bash
+# Clone the repository
+git clone https://github.com/Goldziher/kreuzberg.git
+cd kreuzberg/packages/ruby
-## Troubleshooting
+# Install dependencies
+bundle install
-For common issues and solutions, visit [Troubleshooting Guide](https://kreuzberg.dev/troubleshooting/).
+# Build the Rust extension
+bundle exec rake compile
-## Contributing
+# Run tests
+bundle exec rspec
-Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
+# Run RuboCop
+bundle exec rubocop
+```
 ## License
-MIT License - see LICENSE file for details.
+MIT License. See [LICENSE](../../LICENSE) for details.
+## Contributing
+Contributions are welcome! Please see [CONTRIBUTING.md](../../CONTRIBUTING.md) for guidelines.
-## Support
+## Links
-- **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
-- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
-- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)
+- **Documentation**: https://docs.kreuzberg.dev
+- **GitHub**: https://github.com/Goldziher/kreuzberg
+- **Issues**: https://github.com/Goldziher/kreuzberg/issues