kreuzberg 5.0.0.pre.rc.1 → 5.0.0.pre.rc.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1a69253c4b216e8cde6ee74a09384688fb16d3d6feffc038f6ce08799aa8e4d1
4
- data.tar.gz: 6341d37dc7bfe4073461b89ae53ec5ccb6924f95745190299a3809ed2e3514ec
3
+ metadata.gz: 6b530ee610c625a8bd7e88f12e281aee8d5814d9ad7d26d6c2c839db83fc1563
4
+ data.tar.gz: 3ad761c213024ce192541e198da052302d66de1f7e682e872043fdd02d53dad9
5
5
  SHA512:
6
- metadata.gz: c3127d63e11cd51f5b961d9ccc15b822af486f16ae3fc67cd651d3403f6192f6d12ea550070d2ce4c45bc2a0472e1e4ff930853315c03f6a32a91883e7ae7f59
7
- data.tar.gz: 2b4a0bb3d9a44a2d66c7b58f4b91f2d51bd7ef59e5d0f69bd3561e012580c3548372f0ae007f9a9b07381509992bf01dca80116d3f828a3b54a44396e9c73e5d
6
+ metadata.gz: 183448573cf8e5c42f4ea9a620a1d4e390871d82750b9b703d066ad7146dc8f3f67b27da0afb5e45714544f00c53e4beaf54024406b8cc44c4f7a8ca96bb3266
7
+ data.tar.gz: 6a2e7a72318b0ddb2de2044940d7ef4ab84d2f20dd19bee0ff6ddcb2479ef5e1d5cf5118a83586214d4ba379a320d4bcc23ca04406c31f174b4b05752f6a8713
data/LICENSE ADDED
@@ -0,0 +1,93 @@
1
+ Elastic License 2.0 (ELv2)
2
+
3
+ Copyright 2025-2026 Kreuzberg, Inc.
4
+
5
+ Acceptance
6
+
7
+ By using the software, you agree to all of the terms and conditions below.
8
+
9
+ Copyright License
10
+
11
+ The licensor grants you a non-exclusive, royalty-free, worldwide,
12
+ non-sublicensable, non-transferable license to use, copy, distribute, make
13
+ available, and prepare derivative works of the software, in each case subject to
14
+ the limitations and conditions below.
15
+
16
+ Limitations
17
+
18
+ You may not provide the software to third parties as a hosted or managed
19
+ service, where the service provides users with access to any substantial set of
20
+ the features or functionality of the software.
21
+
22
+ You may not move, change, disable, or circumvent the license key functionality
23
+ in the software, and you may not remove or obscure any functionality in the
24
+ software that is protected by the license key.
25
+
26
+ You may not alter, remove, or obscure any licensing, copyright, or other notices
27
+ of the licensor in the software. Any use of the licensor's trademarks is subject
28
+ to applicable law.
29
+
30
+ Patents
31
+
32
+ The licensor grants you a license, under any patent claims the licensor can
33
+ license, or becomes able to license, to make, have made, use, sell, offer for
34
+ sale, import and have imported the software, in each case subject to the
35
+ limitations and conditions in this license. This license does not cover any
36
+ patent claims that you cause to be infringed by modifications or additions to the
37
+ software. If you or your company make any written claim that the software
38
+ infringes or contributes to infringement of any patent, your patent license for
39
+ the software granted under these terms ends immediately. If your company makes
40
+ such a claim, your patent license ends immediately for work on behalf of your
41
+ company.
42
+
43
+ Notices
44
+
45
+ You must ensure that anyone who gets a copy of any part of the software from you
46
+ also gets a copy of these terms.
47
+
48
+ If you modify the software, you must include in any modified copies of the
49
+ software prominent notices stating that you have modified the software.
50
+
51
+ No Other Rights
52
+
53
+ These terms do not imply any licenses other than those expressly granted in
54
+ these terms.
55
+
56
+ Termination
57
+
58
+ If you use the software in violation of these terms, such use is not licensed,
59
+ and your licenses will automatically terminate. If the licensor provides you with
60
+ a notice of your violation, and you cease all violation of this license no later
61
+ than 30 days after you receive that notice, your licenses will be reinstated
62
+ retroactively. However, if you violate these terms after such reinstatement, any
63
+ additional violation of these terms will cause your licenses to terminate
64
+ automatically and permanently.
65
+
66
+ No Liability
67
+
68
+ As far as the law allows, the software comes as is, without any warranty or
69
+ condition, and the licensor will not be liable to you for any damages arising out
70
+ of these terms or the use or nature of the software, under any kind of legal
71
+ claim.
72
+
73
+ Definitions
74
+
75
+ The licensor is the entity offering these terms, and the software is the
76
+ software the licensor makes available under these terms, including any portion
77
+ of it.
78
+
79
+ you refers to the individual or entity agreeing to these terms.
80
+
81
+ your company is any legal entity, sole proprietorship, or other kind of
82
+ organization that you work for, plus all organizations that have control over,
83
+ are under the control of, or are under common control with that organization.
84
+ control means ownership of substantially all the assets of an entity, or the
85
+ power to direct its management and policies by vote, contract, or otherwise.
86
+ Control can be direct or indirect.
87
+
88
+ your licenses are all the licenses granted to you for the software under these
89
+ terms.
90
+
91
+ use means anything you do with the software requiring one of your licenses.
92
+
93
+ trademark means trademarks, service marks, and similar rights.
data/README.md ADDED
@@ -0,0 +1,467 @@
1
+ # Kreuzberg for Ruby
2
+
3
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
4
+ <a href="https://github.com/kreuzberg-dev/alef">
5
+ <img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
6
+ </a>
7
+ <!-- Language Bindings -->
8
+ <a href="https://crates.io/crates/kreuzberg">
9
+ <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
10
+ </a>
11
+ <a href="https://pypi.org/project/kreuzberg/">
12
+ <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
13
+ </a>
14
+ <a href="https://www.npmjs.com/package/@kreuzberg/node">
15
+ <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
16
+ </a>
17
+ <a href="https://www.npmjs.com/package/@kreuzberg/wasm">
18
+ <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
19
+ </a>
20
+ <a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
21
+ <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
22
+ </a>
23
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5">
24
+ <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v5*" alt="Go">
25
+ </a>
26
+ <a href="https://www.nuget.org/packages/Kreuzberg/">
27
+ <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
28
+ </a>
29
+ <a href="https://packagist.org/packages/kreuzberg/kreuzberg">
30
+ <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
31
+ </a>
32
+ <a href="https://rubygems.org/gems/kreuzberg">
33
+ <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
34
+ </a>
35
+ <a href="https://hex.pm/packages/kreuzberg">
36
+ <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
37
+ </a>
38
+ <a href="https://kreuzberg-dev.r-universe.dev/kreuzberg">
39
+ <img src="https://img.shields.io/badge/R-kreuzberg-007ec6" alt="R">
40
+ </a>
41
+ <a href="https://pub.dev/packages/kreuzberg">
42
+ <img src="https://img.shields.io/pub/v/kreuzberg?label=Dart&color=007ec6" alt="Dart">
43
+ </a>
44
+ <a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg-android">
45
+ <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
46
+ </a>
47
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/swift">
48
+ <img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
49
+ </a>
50
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/zig">
51
+ <img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
52
+ </a>
53
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
54
+ <img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
55
+ </a>
56
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/kreuzberg">
57
+ <img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
58
+ </a>
59
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/charts%2Fkreuzberg">
60
+ <img src="https://img.shields.io/badge/Helm-ghcr.io-007ec6?logo=helm&logoColor=white" alt="Helm">
61
+ </a>
62
+
63
+ <!-- Project Info -->
64
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
65
+ <img src="https://img.shields.io/badge/License-Elastic--2.0-007ec6" alt="License">
66
+ </a>
67
+ <a href="https://docs.kreuzberg.dev">
68
+ <img src="https://img.shields.io/badge/Docs-kreuzberg-007ec6" alt="Documentation">
69
+ </a>
70
+ <a href="https://huggingface.co/Kreuzberg">
71
+ <img src="https://img.shields.io/badge/Hugging%20Face-Kreuzberg-007ec6" alt="Hugging Face">
72
+ </a>
73
+ </div>
74
+
75
+ <div align="center" style="margin: 24px 0 0;">
76
+ <a href="https://kreuzberg.dev">
77
+ <img alt="Kreuzberg" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
78
+ </a>
79
+ </div>
80
+
81
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
82
+ <a href="https://discord.gg/xt9WY3GnKR">
83
+ <img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
84
+ </a>
85
+ <a href="https://docs.kreuzberg.dev/demo.html">
86
+ <img height="22" src="https://img.shields.io/badge/Live%20Demo-Open-007ec6?logo=webassembly&logoColor=white" alt="Live Demo">
87
+ </a>
88
+ </div>
89
+
90
+ Extract text, tables, images, and metadata from 90+ file formats and 300+ programming languages including PDF, Office documents, and images. Ruby bindings with idiomatic Ruby API and native performance.
91
+
92
+ ## What This Package Provides
93
+
94
+ - **Ruby-native extraction** — idiomatic Ruby objects over the shared Rust document engine.
95
+ - **Structured results** — text, tables, images, metadata, language detection, chunks, and warnings.
96
+ - **OCR support** — Tesseract and PaddleOCR through the same configuration model as other bindings.
97
+ - **Cross-binding parity** — output matches the Python, Node.js, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
98
+
99
+ ## Installation
100
+
101
+ Add to your Gemfile:
102
+
103
+ ```ruby
104
+ gem 'kreuzberg'
105
+ ```
106
+
107
+ Then execute:
108
+
109
+ ```bash
110
+ bundle install
111
+ ```
112
+
113
+ Or install it directly:
114
+
115
+ ```bash
116
+ gem install kreuzberg
117
+ ```
118
+
119
+ ## Quick Start
120
+
121
+ ### Basic Usage
122
+
123
+ ```ruby
124
+ require 'kreuzberg'
125
+
126
+ # Simple synchronous extraction
127
+ result = Kreuzberg.extract_file("document.pdf")
128
+ puts result.content
129
+ ```
130
+
131
+ ### Async Extraction
132
+
133
+ ```ruby
134
+ require 'kreuzberg'
135
+
136
+ # Using Fiber for concurrency (Ruby 3.0+)
137
+ Fiber.new do
138
+ result = Kreuzberg.extract_file_async("document.pdf")
139
+ puts result.content
140
+ end.resume
141
+ ```
142
+
143
+ ### Batch Processing
144
+
145
+ ```ruby
146
+ require 'kreuzberg'
147
+
148
+ files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
149
+
150
+ results = files.map { |file| Kreuzberg.extract_file(file) }
151
+
152
+ results.each do |result|
153
+ puts "Content length: #{result.content.length}"
154
+ end
155
+ ```
156
+
157
+ ## Configuration
158
+
159
+ ```ruby
160
+ require 'kreuzberg'
161
+
162
+ config = Kreuzberg::ExtractionConfig.new(
163
+ use_cache: true,
164
+ enable_quality_processing: true,
165
+ ocr: Kreuzberg::OcrConfig.new(
166
+ backend: 'tesseract',
167
+ language: 'eng'
168
+ )
169
+ )
170
+
171
+ result = Kreuzberg.extract_file("document.pdf", config: config)
172
+ puts result.content
173
+ ```
174
+
175
+ ## OCR Support
176
+
177
+ ### Tesseract Configuration
178
+
179
+ ```ruby
180
+ require 'kreuzberg'
181
+
182
+ config = Kreuzberg::ExtractionConfig.new(
183
+ ocr: Kreuzberg::OcrConfig.new(
184
+ backend: 'tesseract',
185
+ language: 'eng',
186
+ tesseract_config: Kreuzberg::TesseractConfig.new(
187
+ psm: 6,
188
+ enable_table_detection: true
189
+ )
190
+ )
191
+ )
192
+
193
+ result = Kreuzberg.extract_file("scanned.pdf", config: config)
194
+ puts result.content
195
+ ```
196
+
197
+ ## Table Extraction
198
+
199
+ ```ruby
200
+ require 'kreuzberg'
201
+
202
+ config = Kreuzberg::ExtractionConfig.new(
203
+ ocr: Kreuzberg::OcrConfig.new(
204
+ backend: 'tesseract',
205
+ tesseract_config: Kreuzberg::TesseractConfig.new(
206
+ enable_table_detection: true
207
+ )
208
+ )
209
+ )
210
+
211
+ result = Kreuzberg.extract_file("invoice.pdf", config: config)
212
+
213
+ result.tables.each_with_index do |table, index|
214
+ puts "Table #{index}:"
215
+ puts table.markdown
216
+ end
217
+ ```
218
+
219
+ ## Metadata Extraction
220
+
221
+ ```ruby
222
+ require 'kreuzberg'
223
+
224
+ result = Kreuzberg.extract_file("document.pdf")
225
+
226
+ # PDF metadata
227
+ if result.metadata[:pdf]
228
+ pdf_meta = result.metadata[:pdf]
229
+ puts "Title: #{pdf_meta[:title]}"
230
+ puts "Author: #{pdf_meta[:author]}"
231
+ puts "Pages: #{pdf_meta[:page_count]}"
232
+ end
233
+
234
+ # Detected languages
235
+ puts "Languages: #{result.detected_languages}"
236
+
237
+ # Images
238
+ if result.images
239
+ puts "Images found: #{result.images.count}"
240
+ end
241
+ ```
242
+
243
+ ## Text Chunking
244
+
245
+ ```ruby
246
+ require 'kreuzberg'
247
+
248
+ config = Kreuzberg::ExtractionConfig.new(
249
+ chunking: Kreuzberg::ChunkingConfig.new(
250
+ max_chars: 1000,
251
+ max_overlap: 200
252
+ )
253
+ )
254
+
255
+ result = Kreuzberg.extract_file("long_document.pdf", config: config)
256
+
257
+ result.chunks.each_with_index do |chunk, index|
258
+ puts "Chunk #{index}: #{chunk.length} characters"
259
+ end
260
+ ```
261
+
262
+ ## Password-Protected PDFs
263
+
264
+ ```ruby
265
+ require 'kreuzberg'
266
+
267
+ config = Kreuzberg::ExtractionConfig.new(
268
+ pdf_options: Kreuzberg::PdfConfig.new(
269
+ passwords: ["password1", "password2"]
270
+ )
271
+ )
272
+
273
+ result = Kreuzberg.extract_file("protected.pdf", config: config)
274
+ puts result.content
275
+ ```
276
+
277
+ ## Language Detection
278
+
279
+ ```ruby
280
+ require 'kreuzberg'
281
+
282
+ config = Kreuzberg::ExtractionConfig.new(
283
+ language_detection: Kreuzberg::LanguageDetectionConfig.new(
284
+ enabled: true
285
+ )
286
+ )
287
+
288
+ result = Kreuzberg.extract_file("multilingual.pdf", config: config)
289
+ puts "Detected languages: #{result.detected_languages}"
290
+ ```
291
+
292
+ ## API Reference
293
+
294
+ ### Main Methods
295
+
296
+ - `Kreuzberg.extract_file(path, config: nil)` – Extract from file
297
+ - `Kreuzberg.extract_file_async(path, config: nil)` – Async extraction
298
+ - `Kreuzberg.extract_bytes(data, mime_type, config: nil)` – Extract from bytes
299
+ - `Kreuzberg.batch_extract_files(paths, config: nil)` – Batch processing
300
+
301
+ ### Configuration Classes
302
+
303
+ - `ExtractionConfig` – Main configuration
304
+ - `OcrConfig` – OCR settings
305
+ - `TesseractConfig` – Tesseract-specific options
306
+ - `ChunkingConfig` – Text chunking settings
307
+ - `PdfConfig` – PDF-specific options
308
+ - `LanguageDetectionConfig` – Language detection settings
309
+
310
+ ### Result Object
311
+
312
+ - `content` – Extracted text
313
+ - `metadata` – File metadata as Hash
314
+ - `tables` – Array of ExtractedTable objects
315
+ - `detected_languages` – Array of language codes
316
+ - `chunks` – Array of text chunks
317
+ - `images` – Array of extracted images (if enabled)
318
+
319
+ ## System Requirements
320
+
321
+ ### Ruby Version
322
+
323
+ - **Ruby 3.2.0 or higher** (including Ruby 4.x)
324
+ - Ruby 4.0+ is fully supported with no code changes required
325
+ - Magnus bindings compile successfully on all supported Ruby versions
326
+
327
+ ### Required
328
+
329
+ - Rust toolchain (for native extension compilation)
330
+
331
+ ### Optional
332
+
333
+ ```bash
334
+ # Tesseract OCR
335
+ brew install tesseract # macOS
336
+ sudo apt-get install tesseract-ocr # Ubuntu/Debian
337
+ ```
338
+
339
+ ### Ruby 4.0 Compatibility
340
+
341
+ Kreuzberg is fully compatible with Ruby 4.0 (released December 25, 2025) and later. Key Ruby 4.0 features that work seamlessly:
342
+
343
+ - **Ruby Box** - Improved memory efficiency and performance
344
+ - **ZJIT Compiler** - Enhanced JIT compilation for faster execution
345
+ - **Ractor Improvements** - Better multi-threaded document processing
346
+ - **Set Promoted to Core** - No changes needed for Kreuzberg
347
+
348
+ All tests pass with Ruby 4.0.1 with 100% compatibility. The gem compiles without any breaking changes.
349
+
350
+ ## Development
351
+
352
+ Clone and setup:
353
+
354
+ ```bash
355
+ git clone https://github.com/kreuzberg-dev/kreuzberg.git
356
+ cd kreuzberg
357
+ bundle install
358
+ ```
359
+
360
+ Run tests:
361
+
362
+ ```bash
363
+ rake test
364
+ ```
365
+
366
+ ## Troubleshooting
367
+
368
+ ### Native extension compilation error
369
+
370
+ Ensure build tools are installed:
371
+
372
+ ```bash
373
+ # macOS
374
+ xcode-select --install
375
+
376
+ # Ubuntu/Debian
377
+ sudo apt-get install build-essential ruby-dev
378
+
379
+ # Windows (via RubyInstaller)
380
+ ridk install
381
+ ```
382
+
383
+ ### "Could not find Kreuzberg"
384
+
385
+ Reinstall the gem:
386
+
387
+ ```bash
388
+ gem uninstall kreuzberg
389
+ gem install kreuzberg --no-document
390
+ ```
391
+
392
+ ### OCR not working
393
+
394
+ Verify Tesseract is installed:
395
+
396
+ ```bash
397
+ tesseract --version
398
+ ```
399
+
400
+ ## Examples
401
+
402
+ ### Process Directory of PDFs
403
+
404
+ ```ruby
405
+ require 'kreuzberg'
406
+ require 'pathname'
407
+
408
+ Dir.glob("documents/*.pdf").each do |file|
409
+ puts "Processing: #{file}"
410
+ result = Kreuzberg.extract_file(file)
411
+ puts " Content length: #{result.content.length}"
412
+ puts " Language: #{result.detected_languages}"
413
+ end
414
+ ```
415
+
416
+ ### Extract and Parse Structured Data
417
+
418
+ ```ruby
419
+ require 'kreuzberg'
420
+ require 'json'
421
+
422
+ result = Kreuzberg.extract_file("data.pdf")
423
+
424
+ # Parse content as JSON (if applicable)
425
+ begin
426
+ data = JSON.parse(result.content)
427
+ puts "Parsed data: #{data}"
428
+ rescue JSON::ParserError
429
+ puts "Content is not JSON"
430
+ end
431
+ ```
432
+
433
+ ### Save Extracted Images
434
+
435
+ ```ruby
436
+ require 'kreuzberg'
437
+
438
+ config = Kreuzberg::ExtractionConfig.new(
439
+ images: Kreuzberg::ImageExtractionConfig.new(
440
+ extract_images: true
441
+ )
442
+ )
443
+
444
+ result = Kreuzberg.extract_file("document.pdf", config: config)
445
+
446
+ result.images&.each_with_index do |image, index|
447
+ File.write("image_#{index}.png", image.data)
448
+ end
449
+ ```
450
+
451
+ ## Documentation
452
+
453
+ For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
454
+
455
+ ## Part of Kreuzberg.dev
456
+
457
+ - [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
458
+ - [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
459
+ - [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
460
+ - [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
461
+ - [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
462
+ - [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
463
+ - [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
464
+
465
+ ## License
466
+
467
+ Elastic-2.0 License - see [LICENSE](../../LICENSE) for details.