kreuzberg 5.0.0.pre.rc.1 → 5.0.0.pre.rc.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/LICENSE +93 -0
- data/README.md +467 -0
- data/ext/kreuzberg_rb/native/Cargo.lock +661 -268
- data/ext/kreuzberg_rb/native/Cargo.toml +8 -5
- data/ext/kreuzberg_rb/native/extconf.rb +14 -0
- data/ext/kreuzberg_rb/src/lib.rs +16815 -11874
- data/lib/kreuzberg/native.rb +508 -1248
- data/lib/kreuzberg/version.rb +3 -3
- data/lib/kreuzberg.rb +7 -2
- data/lib/kreuzberg_rb.so +0 -0
- data/sig/types.rbs +466 -99
- metadata +15 -6
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 6b530ee610c625a8bd7e88f12e281aee8d5814d9ad7d26d6c2c839db83fc1563
|
|
4
|
+
data.tar.gz: 3ad761c213024ce192541e198da052302d66de1f7e682e872043fdd02d53dad9
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 183448573cf8e5c42f4ea9a620a1d4e390871d82750b9b703d066ad7146dc8f3f67b27da0afb5e45714544f00c53e4beaf54024406b8cc44c4f7a8ca96bb3266
|
|
7
|
+
data.tar.gz: 6a2e7a72318b0ddb2de2044940d7ef4ab84d2f20dd19bee0ff6ddcb2479ef5e1d5cf5118a83586214d4ba379a320d4bcc23ca04406c31f174b4b05752f6a8713
|
data/LICENSE
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
Elastic License 2.0 (ELv2)
|
|
2
|
+
|
|
3
|
+
Copyright 2025-2026 Kreuzberg, Inc.
|
|
4
|
+
|
|
5
|
+
Acceptance
|
|
6
|
+
|
|
7
|
+
By using the software, you agree to all of the terms and conditions below.
|
|
8
|
+
|
|
9
|
+
Copyright License
|
|
10
|
+
|
|
11
|
+
The licensor grants you a non-exclusive, royalty-free, worldwide,
|
|
12
|
+
non-sublicensable, non-transferable license to use, copy, distribute, make
|
|
13
|
+
available, and prepare derivative works of the software, in each case subject to
|
|
14
|
+
the limitations and conditions below.
|
|
15
|
+
|
|
16
|
+
Limitations
|
|
17
|
+
|
|
18
|
+
You may not provide the software to third parties as a hosted or managed
|
|
19
|
+
service, where the service provides users with access to any substantial set of
|
|
20
|
+
the features or functionality of the software.
|
|
21
|
+
|
|
22
|
+
You may not move, change, disable, or circumvent the license key functionality
|
|
23
|
+
in the software, and you may not remove or obscure any functionality in the
|
|
24
|
+
software that is protected by the license key.
|
|
25
|
+
|
|
26
|
+
You may not alter, remove, or obscure any licensing, copyright, or other notices
|
|
27
|
+
of the licensor in the software. Any use of the licensor's trademarks is subject
|
|
28
|
+
to applicable law.
|
|
29
|
+
|
|
30
|
+
Patents
|
|
31
|
+
|
|
32
|
+
The licensor grants you a license, under any patent claims the licensor can
|
|
33
|
+
license, or becomes able to license, to make, have made, use, sell, offer for
|
|
34
|
+
sale, import and have imported the software, in each case subject to the
|
|
35
|
+
limitations and conditions in this license. This license does not cover any
|
|
36
|
+
patent claims that you cause to be infringed by modifications or additions to the
|
|
37
|
+
software. If you or your company make any written claim that the software
|
|
38
|
+
infringes or contributes to infringement of any patent, your patent license for
|
|
39
|
+
the software granted under these terms ends immediately. If your company makes
|
|
40
|
+
such a claim, your patent license ends immediately for work on behalf of your
|
|
41
|
+
company.
|
|
42
|
+
|
|
43
|
+
Notices
|
|
44
|
+
|
|
45
|
+
You must ensure that anyone who gets a copy of any part of the software from you
|
|
46
|
+
also gets a copy of these terms.
|
|
47
|
+
|
|
48
|
+
If you modify the software, you must include in any modified copies of the
|
|
49
|
+
software prominent notices stating that you have modified the software.
|
|
50
|
+
|
|
51
|
+
No Other Rights
|
|
52
|
+
|
|
53
|
+
These terms do not imply any licenses other than those expressly granted in
|
|
54
|
+
these terms.
|
|
55
|
+
|
|
56
|
+
Termination
|
|
57
|
+
|
|
58
|
+
If you use the software in violation of these terms, such use is not licensed,
|
|
59
|
+
and your licenses will automatically terminate. If the licensor provides you with
|
|
60
|
+
a notice of your violation, and you cease all violation of this license no later
|
|
61
|
+
than 30 days after you receive that notice, your licenses will be reinstated
|
|
62
|
+
retroactively. However, if you violate these terms after such reinstatement, any
|
|
63
|
+
additional violation of these terms will cause your licenses to terminate
|
|
64
|
+
automatically and permanently.
|
|
65
|
+
|
|
66
|
+
No Liability
|
|
67
|
+
|
|
68
|
+
As far as the law allows, the software comes as is, without any warranty or
|
|
69
|
+
condition, and the licensor will not be liable to you for any damages arising out
|
|
70
|
+
of these terms or the use or nature of the software, under any kind of legal
|
|
71
|
+
claim.
|
|
72
|
+
|
|
73
|
+
Definitions
|
|
74
|
+
|
|
75
|
+
The licensor is the entity offering these terms, and the software is the
|
|
76
|
+
software the licensor makes available under these terms, including any portion
|
|
77
|
+
of it.
|
|
78
|
+
|
|
79
|
+
you refers to the individual or entity agreeing to these terms.
|
|
80
|
+
|
|
81
|
+
your company is any legal entity, sole proprietorship, or other kind of
|
|
82
|
+
organization that you work for, plus all organizations that have control over,
|
|
83
|
+
are under the control of, or are under common control with that organization.
|
|
84
|
+
control means ownership of substantially all the assets of an entity, or the
|
|
85
|
+
power to direct its management and policies by vote, contract, or otherwise.
|
|
86
|
+
Control can be direct or indirect.
|
|
87
|
+
|
|
88
|
+
your licenses are all the licenses granted to you for the software under these
|
|
89
|
+
terms.
|
|
90
|
+
|
|
91
|
+
use means anything you do with the software requiring one of your licenses.
|
|
92
|
+
|
|
93
|
+
trademark means trademarks, service marks, and similar rights.
|
data/README.md
ADDED
|
@@ -0,0 +1,467 @@
|
|
|
1
|
+
# Kreuzberg for Ruby
|
|
2
|
+
|
|
3
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
|
4
|
+
<a href="https://github.com/kreuzberg-dev/alef">
|
|
5
|
+
<img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
|
|
6
|
+
</a>
|
|
7
|
+
<!-- Language Bindings -->
|
|
8
|
+
<a href="https://crates.io/crates/kreuzberg">
|
|
9
|
+
<img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
|
|
10
|
+
</a>
|
|
11
|
+
<a href="https://pypi.org/project/kreuzberg/">
|
|
12
|
+
<img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
|
|
13
|
+
</a>
|
|
14
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/node">
|
|
15
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
|
|
16
|
+
</a>
|
|
17
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/wasm">
|
|
18
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
|
|
19
|
+
</a>
|
|
20
|
+
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
|
|
21
|
+
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
|
|
22
|
+
</a>
|
|
23
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5">
|
|
24
|
+
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v5*" alt="Go">
|
|
25
|
+
</a>
|
|
26
|
+
<a href="https://www.nuget.org/packages/Kreuzberg/">
|
|
27
|
+
<img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
|
|
28
|
+
</a>
|
|
29
|
+
<a href="https://packagist.org/packages/kreuzberg/kreuzberg">
|
|
30
|
+
<img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
|
|
31
|
+
</a>
|
|
32
|
+
<a href="https://rubygems.org/gems/kreuzberg">
|
|
33
|
+
<img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
|
|
34
|
+
</a>
|
|
35
|
+
<a href="https://hex.pm/packages/kreuzberg">
|
|
36
|
+
<img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
|
|
37
|
+
</a>
|
|
38
|
+
<a href="https://kreuzberg-dev.r-universe.dev/kreuzberg">
|
|
39
|
+
<img src="https://img.shields.io/badge/R-kreuzberg-007ec6" alt="R">
|
|
40
|
+
</a>
|
|
41
|
+
<a href="https://pub.dev/packages/kreuzberg">
|
|
42
|
+
<img src="https://img.shields.io/pub/v/kreuzberg?label=Dart&color=007ec6" alt="Dart">
|
|
43
|
+
</a>
|
|
44
|
+
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg-android">
|
|
45
|
+
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
|
|
46
|
+
</a>
|
|
47
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/swift">
|
|
48
|
+
<img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
|
|
49
|
+
</a>
|
|
50
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/zig">
|
|
51
|
+
<img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
|
|
52
|
+
</a>
|
|
53
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
|
|
54
|
+
<img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
|
|
55
|
+
</a>
|
|
56
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/kreuzberg">
|
|
57
|
+
<img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
|
|
58
|
+
</a>
|
|
59
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/charts%2Fkreuzberg">
|
|
60
|
+
<img src="https://img.shields.io/badge/Helm-ghcr.io-007ec6?logo=helm&logoColor=white" alt="Helm">
|
|
61
|
+
</a>
|
|
62
|
+
|
|
63
|
+
<!-- Project Info -->
|
|
64
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
|
|
65
|
+
<img src="https://img.shields.io/badge/License-Elastic--2.0-007ec6" alt="License">
|
|
66
|
+
</a>
|
|
67
|
+
<a href="https://docs.kreuzberg.dev">
|
|
68
|
+
<img src="https://img.shields.io/badge/Docs-kreuzberg-007ec6" alt="Documentation">
|
|
69
|
+
</a>
|
|
70
|
+
<a href="https://huggingface.co/Kreuzberg">
|
|
71
|
+
<img src="https://img.shields.io/badge/Hugging%20Face-Kreuzberg-007ec6" alt="Hugging Face">
|
|
72
|
+
</a>
|
|
73
|
+
</div>
|
|
74
|
+
|
|
75
|
+
<div align="center" style="margin: 24px 0 0;">
|
|
76
|
+
<a href="https://kreuzberg.dev">
|
|
77
|
+
<img alt="Kreuzberg" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
|
|
78
|
+
</a>
|
|
79
|
+
</div>
|
|
80
|
+
|
|
81
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
|
|
82
|
+
<a href="https://discord.gg/xt9WY3GnKR">
|
|
83
|
+
<img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
|
|
84
|
+
</a>
|
|
85
|
+
<a href="https://docs.kreuzberg.dev/demo.html">
|
|
86
|
+
<img height="22" src="https://img.shields.io/badge/Live%20Demo-Open-007ec6?logo=webassembly&logoColor=white" alt="Live Demo">
|
|
87
|
+
</a>
|
|
88
|
+
</div>
|
|
89
|
+
|
|
90
|
+
Extract text, tables, images, and metadata from 90+ file formats and 300+ programming languages including PDF, Office documents, and images. Ruby bindings with idiomatic Ruby API and native performance.
|
|
91
|
+
|
|
92
|
+
## What This Package Provides
|
|
93
|
+
|
|
94
|
+
- **Ruby-native extraction** — idiomatic Ruby objects over the shared Rust document engine.
|
|
95
|
+
- **Structured results** — text, tables, images, metadata, language detection, chunks, and warnings.
|
|
96
|
+
- **OCR support** — Tesseract and PaddleOCR through the same configuration model as other bindings.
|
|
97
|
+
- **Cross-binding parity** — output matches the Python, Node.js, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
|
|
98
|
+
|
|
99
|
+
## Installation
|
|
100
|
+
|
|
101
|
+
Add to your Gemfile:
|
|
102
|
+
|
|
103
|
+
```ruby
|
|
104
|
+
gem 'kreuzberg'
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Then execute:
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
bundle install
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
Or install it directly:
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
gem install kreuzberg
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
## Quick Start
|
|
120
|
+
|
|
121
|
+
### Basic Usage
|
|
122
|
+
|
|
123
|
+
```ruby
|
|
124
|
+
require 'kreuzberg'
|
|
125
|
+
|
|
126
|
+
# Simple synchronous extraction
|
|
127
|
+
result = Kreuzberg.extract_file("document.pdf")
|
|
128
|
+
puts result.content
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
### Async Extraction
|
|
132
|
+
|
|
133
|
+
```ruby
|
|
134
|
+
require 'kreuzberg'
|
|
135
|
+
|
|
136
|
+
# Using Fiber for concurrency (Ruby 3.0+)
|
|
137
|
+
Fiber.new do
|
|
138
|
+
result = Kreuzberg.extract_file_async("document.pdf")
|
|
139
|
+
puts result.content
|
|
140
|
+
end.resume
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
### Batch Processing
|
|
144
|
+
|
|
145
|
+
```ruby
|
|
146
|
+
require 'kreuzberg'
|
|
147
|
+
|
|
148
|
+
files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
|
|
149
|
+
|
|
150
|
+
results = files.map { |file| Kreuzberg.extract_file(file) }
|
|
151
|
+
|
|
152
|
+
results.each do |result|
|
|
153
|
+
puts "Content length: #{result.content.length}"
|
|
154
|
+
end
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
## Configuration
|
|
158
|
+
|
|
159
|
+
```ruby
|
|
160
|
+
require 'kreuzberg'
|
|
161
|
+
|
|
162
|
+
config = Kreuzberg::ExtractionConfig.new(
|
|
163
|
+
use_cache: true,
|
|
164
|
+
enable_quality_processing: true,
|
|
165
|
+
ocr: Kreuzberg::OcrConfig.new(
|
|
166
|
+
backend: 'tesseract',
|
|
167
|
+
language: 'eng'
|
|
168
|
+
)
|
|
169
|
+
)
|
|
170
|
+
|
|
171
|
+
result = Kreuzberg.extract_file("document.pdf", config: config)
|
|
172
|
+
puts result.content
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
## OCR Support
|
|
176
|
+
|
|
177
|
+
### Tesseract Configuration
|
|
178
|
+
|
|
179
|
+
```ruby
|
|
180
|
+
require 'kreuzberg'
|
|
181
|
+
|
|
182
|
+
config = Kreuzberg::ExtractionConfig.new(
|
|
183
|
+
ocr: Kreuzberg::OcrConfig.new(
|
|
184
|
+
backend: 'tesseract',
|
|
185
|
+
language: 'eng',
|
|
186
|
+
tesseract_config: Kreuzberg::TesseractConfig.new(
|
|
187
|
+
psm: 6,
|
|
188
|
+
enable_table_detection: true
|
|
189
|
+
)
|
|
190
|
+
)
|
|
191
|
+
)
|
|
192
|
+
|
|
193
|
+
result = Kreuzberg.extract_file("scanned.pdf", config: config)
|
|
194
|
+
puts result.content
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
## Table Extraction
|
|
198
|
+
|
|
199
|
+
```ruby
|
|
200
|
+
require 'kreuzberg'
|
|
201
|
+
|
|
202
|
+
config = Kreuzberg::ExtractionConfig.new(
|
|
203
|
+
ocr: Kreuzberg::OcrConfig.new(
|
|
204
|
+
backend: 'tesseract',
|
|
205
|
+
tesseract_config: Kreuzberg::TesseractConfig.new(
|
|
206
|
+
enable_table_detection: true
|
|
207
|
+
)
|
|
208
|
+
)
|
|
209
|
+
)
|
|
210
|
+
|
|
211
|
+
result = Kreuzberg.extract_file("invoice.pdf", config: config)
|
|
212
|
+
|
|
213
|
+
result.tables.each_with_index do |table, index|
|
|
214
|
+
puts "Table #{index}:"
|
|
215
|
+
puts table.markdown
|
|
216
|
+
end
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
## Metadata Extraction
|
|
220
|
+
|
|
221
|
+
```ruby
|
|
222
|
+
require 'kreuzberg'
|
|
223
|
+
|
|
224
|
+
result = Kreuzberg.extract_file("document.pdf")
|
|
225
|
+
|
|
226
|
+
# PDF metadata
|
|
227
|
+
if result.metadata[:pdf]
|
|
228
|
+
pdf_meta = result.metadata[:pdf]
|
|
229
|
+
puts "Title: #{pdf_meta[:title]}"
|
|
230
|
+
puts "Author: #{pdf_meta[:author]}"
|
|
231
|
+
puts "Pages: #{pdf_meta[:page_count]}"
|
|
232
|
+
end
|
|
233
|
+
|
|
234
|
+
# Detected languages
|
|
235
|
+
puts "Languages: #{result.detected_languages}"
|
|
236
|
+
|
|
237
|
+
# Images
|
|
238
|
+
if result.images
|
|
239
|
+
puts "Images found: #{result.images.count}"
|
|
240
|
+
end
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
## Text Chunking
|
|
244
|
+
|
|
245
|
+
```ruby
|
|
246
|
+
require 'kreuzberg'
|
|
247
|
+
|
|
248
|
+
config = Kreuzberg::ExtractionConfig.new(
|
|
249
|
+
chunking: Kreuzberg::ChunkingConfig.new(
|
|
250
|
+
max_chars: 1000,
|
|
251
|
+
max_overlap: 200
|
|
252
|
+
)
|
|
253
|
+
)
|
|
254
|
+
|
|
255
|
+
result = Kreuzberg.extract_file("long_document.pdf", config: config)
|
|
256
|
+
|
|
257
|
+
result.chunks.each_with_index do |chunk, index|
|
|
258
|
+
puts "Chunk #{index}: #{chunk.length} characters"
|
|
259
|
+
end
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
## Password-Protected PDFs
|
|
263
|
+
|
|
264
|
+
```ruby
|
|
265
|
+
require 'kreuzberg'
|
|
266
|
+
|
|
267
|
+
config = Kreuzberg::ExtractionConfig.new(
|
|
268
|
+
pdf_options: Kreuzberg::PdfConfig.new(
|
|
269
|
+
passwords: ["password1", "password2"]
|
|
270
|
+
)
|
|
271
|
+
)
|
|
272
|
+
|
|
273
|
+
result = Kreuzberg.extract_file("protected.pdf", config: config)
|
|
274
|
+
puts result.content
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
## Language Detection
|
|
278
|
+
|
|
279
|
+
```ruby
|
|
280
|
+
require 'kreuzberg'
|
|
281
|
+
|
|
282
|
+
config = Kreuzberg::ExtractionConfig.new(
|
|
283
|
+
language_detection: Kreuzberg::LanguageDetectionConfig.new(
|
|
284
|
+
enabled: true
|
|
285
|
+
)
|
|
286
|
+
)
|
|
287
|
+
|
|
288
|
+
result = Kreuzberg.extract_file("multilingual.pdf", config: config)
|
|
289
|
+
puts "Detected languages: #{result.detected_languages}"
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
## API Reference
|
|
293
|
+
|
|
294
|
+
### Main Methods
|
|
295
|
+
|
|
296
|
+
- `Kreuzberg.extract_file(path, config: nil)` – Extract from file
|
|
297
|
+
- `Kreuzberg.extract_file_async(path, config: nil)` – Async extraction
|
|
298
|
+
- `Kreuzberg.extract_bytes(data, mime_type, config: nil)` – Extract from bytes
|
|
299
|
+
- `Kreuzberg.batch_extract_files(paths, config: nil)` – Batch processing
|
|
300
|
+
|
|
301
|
+
### Configuration Classes
|
|
302
|
+
|
|
303
|
+
- `ExtractionConfig` – Main configuration
|
|
304
|
+
- `OcrConfig` – OCR settings
|
|
305
|
+
- `TesseractConfig` – Tesseract-specific options
|
|
306
|
+
- `ChunkingConfig` – Text chunking settings
|
|
307
|
+
- `PdfConfig` – PDF-specific options
|
|
308
|
+
- `LanguageDetectionConfig` – Language detection settings
|
|
309
|
+
|
|
310
|
+
### Result Object
|
|
311
|
+
|
|
312
|
+
- `content` – Extracted text
|
|
313
|
+
- `metadata` – File metadata as Hash
|
|
314
|
+
- `tables` – Array of ExtractedTable objects
|
|
315
|
+
- `detected_languages` – Array of language codes
|
|
316
|
+
- `chunks` – Array of text chunks
|
|
317
|
+
- `images` – Array of extracted images (if enabled)
|
|
318
|
+
|
|
319
|
+
## System Requirements
|
|
320
|
+
|
|
321
|
+
### Ruby Version
|
|
322
|
+
|
|
323
|
+
- **Ruby 3.2.0 or higher** (including Ruby 4.x)
|
|
324
|
+
- Ruby 4.0+ is fully supported with no code changes required
|
|
325
|
+
- Magnus bindings compile successfully on all supported Ruby versions
|
|
326
|
+
|
|
327
|
+
### Required
|
|
328
|
+
|
|
329
|
+
- Rust toolchain (for native extension compilation)
|
|
330
|
+
|
|
331
|
+
### Optional
|
|
332
|
+
|
|
333
|
+
```bash
|
|
334
|
+
# Tesseract OCR
|
|
335
|
+
brew install tesseract # macOS
|
|
336
|
+
sudo apt-get install tesseract-ocr # Ubuntu/Debian
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
### Ruby 4.0 Compatibility
|
|
340
|
+
|
|
341
|
+
Kreuzberg is fully compatible with Ruby 4.0 (released December 25, 2025) and later. Key Ruby 4.0 features that work seamlessly:
|
|
342
|
+
|
|
343
|
+
- **Ruby Box** - Improved memory efficiency and performance
|
|
344
|
+
- **ZJIT Compiler** - Enhanced JIT compilation for faster execution
|
|
345
|
+
- **Ractor Improvements** - Better multi-threaded document processing
|
|
346
|
+
- **Set Promoted to Core** - No changes needed for Kreuzberg
|
|
347
|
+
|
|
348
|
+
All tests pass with Ruby 4.0.1 with 100% compatibility. The gem compiles without any breaking changes.
|
|
349
|
+
|
|
350
|
+
## Development
|
|
351
|
+
|
|
352
|
+
Clone and setup:
|
|
353
|
+
|
|
354
|
+
```bash
|
|
355
|
+
git clone https://github.com/kreuzberg-dev/kreuzberg.git
|
|
356
|
+
cd kreuzberg
|
|
357
|
+
bundle install
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
Run tests:
|
|
361
|
+
|
|
362
|
+
```bash
|
|
363
|
+
rake test
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
## Troubleshooting
|
|
367
|
+
|
|
368
|
+
### Native extension compilation error
|
|
369
|
+
|
|
370
|
+
Ensure build tools are installed:
|
|
371
|
+
|
|
372
|
+
```bash
|
|
373
|
+
# macOS
|
|
374
|
+
xcode-select --install
|
|
375
|
+
|
|
376
|
+
# Ubuntu/Debian
|
|
377
|
+
sudo apt-get install build-essential ruby-dev
|
|
378
|
+
|
|
379
|
+
# Windows (via RubyInstaller)
|
|
380
|
+
ridk install
|
|
381
|
+
```
|
|
382
|
+
|
|
383
|
+
### "Could not find Kreuzberg"
|
|
384
|
+
|
|
385
|
+
Reinstall the gem:
|
|
386
|
+
|
|
387
|
+
```bash
|
|
388
|
+
gem uninstall kreuzberg
|
|
389
|
+
gem install kreuzberg --no-document
|
|
390
|
+
```
|
|
391
|
+
|
|
392
|
+
### OCR not working
|
|
393
|
+
|
|
394
|
+
Verify Tesseract is installed:
|
|
395
|
+
|
|
396
|
+
```bash
|
|
397
|
+
tesseract --version
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
## Examples
|
|
401
|
+
|
|
402
|
+
### Process Directory of PDFs
|
|
403
|
+
|
|
404
|
+
```ruby
|
|
405
|
+
require 'kreuzberg'
|
|
406
|
+
require 'pathname'
|
|
407
|
+
|
|
408
|
+
Dir.glob("documents/*.pdf").each do |file|
|
|
409
|
+
puts "Processing: #{file}"
|
|
410
|
+
result = Kreuzberg.extract_file(file)
|
|
411
|
+
puts " Content length: #{result.content.length}"
|
|
412
|
+
puts " Language: #{result.detected_languages}"
|
|
413
|
+
end
|
|
414
|
+
```
|
|
415
|
+
|
|
416
|
+
### Extract and Parse Structured Data
|
|
417
|
+
|
|
418
|
+
```ruby
|
|
419
|
+
require 'kreuzberg'
|
|
420
|
+
require 'json'
|
|
421
|
+
|
|
422
|
+
result = Kreuzberg.extract_file("data.pdf")
|
|
423
|
+
|
|
424
|
+
# Parse content as JSON (if applicable)
|
|
425
|
+
begin
|
|
426
|
+
data = JSON.parse(result.content)
|
|
427
|
+
puts "Parsed data: #{data}"
|
|
428
|
+
rescue JSON::ParserError
|
|
429
|
+
puts "Content is not JSON"
|
|
430
|
+
end
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
### Save Extracted Images
|
|
434
|
+
|
|
435
|
+
```ruby
|
|
436
|
+
require 'kreuzberg'
|
|
437
|
+
|
|
438
|
+
config = Kreuzberg::ExtractionConfig.new(
|
|
439
|
+
images: Kreuzberg::ImageExtractionConfig.new(
|
|
440
|
+
extract_images: true
|
|
441
|
+
)
|
|
442
|
+
)
|
|
443
|
+
|
|
444
|
+
result = Kreuzberg.extract_file("document.pdf", config: config)
|
|
445
|
+
|
|
446
|
+
result.images&.each_with_index do |image, index|
|
|
447
|
+
File.write("image_#{index}.png", image.data)
|
|
448
|
+
end
|
|
449
|
+
```
|
|
450
|
+
|
|
451
|
+
## Documentation
|
|
452
|
+
|
|
453
|
+
For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
|
|
454
|
+
|
|
455
|
+
## Part of Kreuzberg.dev
|
|
456
|
+
|
|
457
|
+
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
|
|
458
|
+
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
|
|
459
|
+
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
|
460
|
+
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
|
461
|
+
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
|
462
|
+
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
|
|
463
|
+
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
|
|
464
|
+
|
|
465
|
+
## License
|
|
466
|
+
|
|
467
|
+
Elastic-2.0 License - see [LICENSE](../../LICENSE) for details.
|