xberg 0.1.0 → 1.0.0.pre.rc.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4613ff8d41c5a493c91850c6232ad4284e88c7357c00d01280ed5fc6e96037ef
4
- data.tar.gz: f1ce432eae2afeaffd51e295f38662860f08952287104e1e7378f22620590f21
3
+ metadata.gz: 6640de3913281d734137cc0362ade67d82a742950c938060882e8a86acb7a855
4
+ data.tar.gz: ae86759fa1d1d9a68470bf43bedad36565ac5e9ed8a2eebac0d05370223b74d8
5
5
  SHA512:
6
- metadata.gz: efd689846e2edd2d73f6ddaa722a3be3d9ac20aa14c38cfb3ae15c6d3fae081872242a81173de03dffaafba6075370281fe9a51079ea387a53585e4937d4e8b1
7
- data.tar.gz: d654949cb1cd1524a4e7302d8724aa344f056cdec560144ba6f3a496b4086a28ce89a50deb9acb95e7129c2daa75f825ef7cdf9ca9b0fb085553a963903e86a7
6
+ metadata.gz: 320bf2e2ee031b84aa12615552f603c20c7c80578df0675ce7fd0d5911a7967699dd5ad28b895decf523251f82f0a79408750e8e16895c733538cb20c6443569
7
+ data.tar.gz: b5350869b0bea09c769f1b7bd6a250911ce7dc400b4f7477a7050cd2163c469248b6a885704eea5469e6575284bdc7b648d032a39a65849ee4e105b3d2d5293f
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025-2026 Kreuzberg, Inc.
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
data/README.md CHANGED
@@ -1,11 +1,479 @@
1
- # xberg
1
+ # Xberg for Ruby
2
2
 
3
- `xberg` is the Xberg package alias for
4
- [Kreuzberg](https://rubygems.org/gems/kreuzberg).
3
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
4
+ <a href="https://github.com/xberg-io/alef">
5
+ <img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
6
+ </a>
7
+ <!-- Language Bindings -->
8
+ <a href="https://crates.io/crates/xberg">
9
+ <img src="https://img.shields.io/crates/v/xberg?label=Rust&color=007ec6" alt="Rust">
10
+ </a>
11
+ <a href="https://pypi.org/project/xberg/">
12
+ <img src="https://img.shields.io/pypi/v/xberg?label=Python&color=007ec6" alt="Python">
13
+ </a>
14
+ <a href="https://www.npmjs.com/package/@xberg-io/xberg">
15
+ <img src="https://img.shields.io/npm/v/@xberg-io/xberg?label=Node.js&color=007ec6" alt="Node.js">
16
+ </a>
17
+ <a href="https://www.npmjs.com/package/@xberg-io/xberg-wasm">
18
+ <img src="https://img.shields.io/npm/v/@xberg-io/xberg-wasm?label=WASM&color=007ec6" alt="WASM">
19
+ </a>
20
+ <a href="https://central.sonatype.com/artifact/io.xberg/xberg">
21
+ <img src="https://img.shields.io/maven-central/v/io.xberg/xberg?label=Java&color=007ec6" alt="Java">
22
+ </a>
23
+ <a href="https://github.com/xberg-io/xberg/tree/main/packages/go">
24
+ <img src="https://img.shields.io/github/v/tag/xberg-io/xberg?label=Go&color=007ec6&filter=v1*" alt="Go">
25
+ </a>
26
+ <a href="https://www.nuget.org/packages/Xberg/">
27
+ <img src="https://img.shields.io/nuget/v/Xberg?label=C%23&color=007ec6" alt="C#">
28
+ </a>
29
+ <a href="https://packagist.org/packages/xberg-io/xberg">
30
+ <img src="https://img.shields.io/packagist/v/xberg-io/xberg?label=PHP&color=007ec6" alt="PHP">
31
+ </a>
32
+ <a href="https://rubygems.org/gems/xberg">
33
+ <img src="https://img.shields.io/gem/v/xberg?label=Ruby&color=007ec6" alt="Ruby">
34
+ </a>
35
+ <a href="https://hex.pm/packages/xberg">
36
+ <img src="https://img.shields.io/hexpm/v/xberg?label=Elixir&color=007ec6" alt="Elixir">
37
+ </a>
38
+ <a href="https://xberg-io.r-universe.dev/xberg">
39
+ <img src="https://img.shields.io/badge/R-xberg-007ec6" alt="R">
40
+ </a>
41
+ <a href="https://pub.dev/packages/xberg">
42
+ <img src="https://img.shields.io/pub/v/xberg?label=Dart&color=007ec6" alt="Dart">
43
+ </a>
44
+ <a href="https://central.sonatype.com/artifact/io.xberg/xberg-android">
45
+ <img src="https://img.shields.io/maven-central/v/io.xberg/xberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
46
+ </a>
47
+ <a href="https://github.com/xberg-io/xberg/tree/main/packages/swift">
48
+ <img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
49
+ </a>
50
+ <a href="https://github.com/xberg-io/xberg/tree/main/packages/zig">
51
+ <img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
52
+ </a>
53
+ <a href="https://github.com/xberg-io/xberg/releases">
54
+ <img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
55
+ </a>
56
+ <a href="https://github.com/xberg-io/xberg/pkgs/container/xberg">
57
+ <img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
58
+ </a>
59
+ <!-- Project Info -->
60
+ <a href="https://github.com/xberg-io/xberg/blob/main/LICENSE">
61
+ <img src="https://img.shields.io/badge/License-MIT-007ec6" alt="License">
62
+ </a>
63
+ <a href="https://docs.xberg.io">
64
+ <img src="https://img.shields.io/badge/Docs-xberg-007ec6" alt="Documentation">
65
+ </a>
66
+ <a href="https://huggingface.co/xberg-io">
67
+ <img src="https://img.shields.io/badge/Hugging%20Face-Xberg-007ec6" alt="Hugging Face">
68
+ </a>
69
+ </div>
5
70
 
6
- The canonical Ruby gem remains `kreuzberg` today. This gem reserves the Xberg
7
- namespace and loads the same API.
71
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
72
+ <a href="https://discord.gg/xt9WY3GnKR">
73
+ <img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
74
+ </a>
75
+ <a href="https://docs.xberg.io/demo.html">
76
+ <img height="22" src="https://img.shields.io/badge/Live%20Demo-Open-007ec6?logo=webassembly&logoColor=white" alt="Live Demo">
77
+ </a>
78
+ <a href="https://github.com/xberg-io/xberg/stargazers">
79
+ <img height="22" src="https://img.shields.io/github/stars/xberg-io/xberg?style=social" alt="GitHub Stars">
80
+ </a>
81
+ </div>
82
+
83
+ Extract text, tables, images, metadata, and code intelligence from 96 file formats and 306 programming languages including PDF, Office documents, images, and audio/video transcripts where native transcription is available. Ruby bindings with idiomatic Ruby API and native performance.
84
+
85
+ ## What This Package Provides
86
+
87
+ - **Ruby-native extraction** — idiomatic Ruby objects over the shared Rust document engine.
88
+ - **Structured results** — an `ExtractionResult` envelope with `ExtractedDocument` items, errors, and summary counts.
89
+ - **OCR support** — Tesseract and PaddleOCR through the same configuration model as other bindings.
90
+ - **Cross-binding parity** — output matches the Python, Node.js, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
91
+
92
+ ## Installation
93
+
94
+ Add to your Gemfile:
95
+
96
+ ```ruby
97
+ gem 'xberg'
98
+ ```
99
+
100
+ Then execute:
101
+
102
+ ```bash
103
+ bundle install
104
+ ```
105
+
106
+ Or install it directly:
107
+
108
+ ```bash
109
+ gem install xberg
110
+ ```
111
+
112
+ ## Quick Start
113
+
114
+ ### Basic Usage
115
+
116
+ ```ruby
117
+ require 'xberg'
118
+
119
+ input = Xberg::ExtractInput.new(kind: "uri", uri: "document.pdf")
120
+ output = Xberg.extract(input, Xberg::ExtractionConfig.new)
121
+ document = output.results.first
122
+
123
+ puts document.content
124
+ puts "Results: #{output.summary.results}"
125
+ ```
126
+
127
+ ### Batch Processing
128
+
129
+ ```ruby
130
+ require 'xberg'
131
+
132
+ bytes = File.binread("doc3.txt")
133
+ inputs = [
134
+ Xberg::ExtractInput.new(kind: "uri", uri: "doc1.pdf"),
135
+ Xberg::ExtractInput.new(kind: "uri", uri: "doc2.docx"),
136
+ Xberg::ExtractInput.new(
137
+ kind: "bytes",
138
+ bytes: bytes,
139
+ mime_type: "text/plain",
140
+ filename: "doc3.txt"
141
+ ),
142
+ ]
143
+
144
+ output = Xberg.extract_batch(inputs, Xberg::ExtractionConfig.new)
145
+
146
+ output.results.each do |document|
147
+ puts "Content length: #{document.content.length}"
148
+ end
149
+ ```
150
+
151
+ ## Configuration
152
+
153
+ ```ruby
154
+ require 'xberg'
155
+
156
+ config = Xberg::ExtractionConfig.new(
157
+ use_cache: true,
158
+ enable_quality_processing: true,
159
+ ocr: Xberg::OcrConfig.new(
160
+ backend: 'tesseract',
161
+ language: 'eng'
162
+ )
163
+ )
164
+
165
+ input = Xberg::ExtractInput.new(kind: "uri", uri: "document.pdf")
166
+ output = Xberg.extract(input, config)
167
+ document = output.results.first
168
+
169
+ puts document.content
170
+ ```
171
+
172
+ ## OCR Support
173
+
174
+ ### Tesseract Configuration
175
+
176
+ ```ruby
177
+ require 'xberg'
178
+
179
+ config = Xberg::ExtractionConfig.new(
180
+ ocr: Xberg::OcrConfig.new(
181
+ backend: 'tesseract',
182
+ language: 'eng',
183
+ tesseract_config: Xberg::TesseractConfig.new(
184
+ psm: 6,
185
+ enable_table_detection: true
186
+ )
187
+ )
188
+ )
189
+
190
+ input = Xberg::ExtractInput.new(kind: "uri", uri: "scanned.pdf")
191
+ output = Xberg.extract(input, config)
192
+ document = output.results.first
193
+
194
+ puts document.content
195
+ ```
196
+
197
+ ## Table Extraction
198
+
199
+ ```ruby
200
+ require 'xberg'
201
+
202
+ config = Xberg::ExtractionConfig.new(
203
+ ocr: Xberg::OcrConfig.new(
204
+ backend: 'tesseract',
205
+ tesseract_config: Xberg::TesseractConfig.new(
206
+ enable_table_detection: true
207
+ )
208
+ )
209
+ )
210
+
211
+ input = Xberg::ExtractInput.new(kind: "uri", uri: "invoice.pdf")
212
+ output = Xberg.extract(input, config)
213
+ document = output.results.first
214
+
215
+ document.tables.each_with_index do |table, index|
216
+ puts "Table #{index}:"
217
+ puts table.markdown
218
+ end
219
+ ```
220
+
221
+ ## Metadata Extraction
222
+
223
+ ```ruby
224
+ require 'xberg'
225
+
226
+ input = Xberg::ExtractInput.new(kind: "uri", uri: "document.pdf")
227
+ output = Xberg.extract(input, Xberg::ExtractionConfig.new)
228
+ document = output.results.first
229
+
230
+ metadata = document.metadata
231
+ puts "Title: #{metadata.title}" if metadata&.title
232
+ if metadata&.authors
233
+ puts "Authors: #{metadata.authors.join(', ')}"
234
+ end
235
+
236
+ puts "Languages: #{document.detected_languages}"
237
+
238
+ if document.images
239
+ puts "Images found: #{document.images.count}"
240
+ end
241
+ ```
242
+
243
+ ## Text Chunking
244
+
245
+ ```ruby
246
+ require 'xberg'
247
+
248
+ config = Xberg::ExtractionConfig.new(
249
+ chunking: Xberg::ChunkingConfig.new(
250
+ max_chars: 1000,
251
+ max_overlap: 200
252
+ )
253
+ )
254
+
255
+ input = Xberg::ExtractInput.new(kind: "uri", uri: "long_document.pdf")
256
+ output = Xberg.extract(input, config)
257
+ document = output.results.first
258
+
259
+ document.chunks.each_with_index do |chunk, index|
260
+ puts "Chunk #{index}: #{chunk.content.length} characters"
261
+ end
262
+ ```
263
+
264
+ ## Password-Protected PDFs
8
265
 
9
266
  ```ruby
10
- require "xberg"
267
+ require 'xberg'
268
+
269
+ config = Xberg::ExtractionConfig.new(
270
+ pdf_options: Xberg::PdfConfig.new(
271
+ passwords: ["password1", "password2"]
272
+ )
273
+ )
274
+
275
+ input = Xberg::ExtractInput.new(kind: "uri", uri: "protected.pdf")
276
+ output = Xberg.extract(input, config)
277
+ document = output.results.first
278
+
279
+ puts document.content
11
280
  ```
281
+
282
+ ## Language Detection
283
+
284
+ ```ruby
285
+ require 'xberg'
286
+
287
+ config = Xberg::ExtractionConfig.new(
288
+ language_detection: Xberg::LanguageDetectionConfig.new(
289
+ enabled: true
290
+ )
291
+ )
292
+
293
+ input = Xberg::ExtractInput.new(kind: "uri", uri: "multilingual.pdf")
294
+ output = Xberg.extract(input, config)
295
+ document = output.results.first
296
+
297
+ puts "Detected languages: #{document.detected_languages}"
298
+ ```
299
+
300
+ ## API Reference
301
+
302
+ ### Main Methods
303
+
304
+ - `Xberg.extract(input, config)` – Extract one URI or bytes input.
305
+ - `Xberg.extract_batch(inputs, config)` – Extract multiple URI or bytes inputs.
306
+ - `Xberg::ExtractInput.new(kind: "uri", uri: "document.pdf")` – Local path, `file://`, or HTTP(S) URI input.
307
+ - `Xberg::ExtractInput.new(kind: "bytes", bytes: data, mime_type: "application/pdf")` – In-memory bytes input.
308
+
309
+ ### Configuration Classes
310
+
311
+ - `ExtractionConfig` – Main configuration
312
+ - `OcrConfig` – OCR settings
313
+ - `TesseractConfig` – Tesseract-specific options
314
+ - `ChunkingConfig` – Text chunking settings
315
+ - `PdfConfig` – PDF-specific options
316
+ - `LanguageDetectionConfig` – Language detection settings
317
+
318
+ ### Result Types
319
+
320
+ - `ExtractionResult` – Envelope with `results`, `errors`, and `summary`.
321
+ - `ExtractedDocument` – Per-document item at `output.results.first` with content, metadata, tables, and chunks.
322
+ - `Table` – Table with `cells`, `markdown`, and `page_number`.
323
+ - `Metadata` – Typed document metadata.
324
+
325
+ ## System Requirements
326
+
327
+ ### Ruby Version
328
+
329
+ - **Ruby 3.2.0 or higher** (including Ruby 4.x)
330
+ - Ruby 4.0+ is fully supported with no code changes required
331
+ - Magnus bindings compile successfully on all supported Ruby versions
332
+
333
+ ### Required
334
+
335
+ - Rust toolchain (for native extension compilation)
336
+
337
+ ### Optional
338
+
339
+ ```bash
340
+ # Tesseract OCR
341
+ brew install tesseract # macOS
342
+ sudo apt-get install tesseract-ocr # Ubuntu/Debian
343
+ ```
344
+
345
+ ### Ruby 4.0 Compatibility
346
+
347
+ Xberg is fully compatible with Ruby 4.0 (released December 25, 2025) and later. Key Ruby 4.0 features that work seamlessly:
348
+
349
+ - **Ruby Box** - Improved memory efficiency and performance
350
+ - **ZJIT Compiler** - Enhanced JIT compilation for faster execution
351
+ - **Ractor Improvements** - Better multi-threaded document processing
352
+ - **Set Promoted to Core** - No changes needed for Xberg
353
+
354
+ All tests pass with Ruby 4.0.1 with 100% compatibility. The gem compiles without any breaking changes.
355
+
356
+ ## Development
357
+
358
+ Clone and setup:
359
+
360
+ ```bash
361
+ git clone https://github.com/xberg-io/xberg.git
362
+ cd xberg
363
+ bundle install
364
+ ```
365
+
366
+ Run tests:
367
+
368
+ ```bash
369
+ rake test
370
+ ```
371
+
372
+ ## Troubleshooting
373
+
374
+ ### Native extension compilation error
375
+
376
+ Ensure build tools are installed:
377
+
378
+ ```bash
379
+ # macOS
380
+ xcode-select --install
381
+
382
+ # Ubuntu/Debian
383
+ sudo apt-get install build-essential ruby-dev
384
+
385
+ # Windows (via RubyInstaller)
386
+ ridk install
387
+ ```
388
+
389
+ ### "Could not find Xberg"
390
+
391
+ Reinstall the gem:
392
+
393
+ ```bash
394
+ gem uninstall xberg
395
+ gem install xberg --no-document
396
+ ```
397
+
398
+ ### OCR not working
399
+
400
+ Verify Tesseract is installed:
401
+
402
+ ```bash
403
+ tesseract --version
404
+ ```
405
+
406
+ ## Examples
407
+
408
+ ### Process Directory of PDFs
409
+
410
+ ```ruby
411
+ require 'xberg'
412
+ require 'pathname'
413
+
414
+ Dir.glob("documents/*.pdf").each do |file|
415
+ puts "Processing: #{file}"
416
+ input = Xberg::ExtractInput.new(kind: "uri", uri: file)
417
+ output = Xberg.extract(input, Xberg::ExtractionConfig.new)
418
+ document = output.results.first
419
+
420
+ puts " Content length: #{document.content.length}"
421
+ puts " Language: #{document.detected_languages}"
422
+ end
423
+ ```
424
+
425
+ ### Extract and Parse Structured Data
426
+
427
+ ```ruby
428
+ require 'xberg'
429
+ require 'json'
430
+
431
+ input = Xberg::ExtractInput.new(kind: "uri", uri: "data.pdf")
432
+ output = Xberg.extract(input, Xberg::ExtractionConfig.new)
433
+ document = output.results.first
434
+
435
+ # Parse content as JSON (if applicable)
436
+ begin
437
+ data = JSON.parse(document.content)
438
+ puts "Parsed data: #{data}"
439
+ rescue JSON::ParserError
440
+ puts "Content is not JSON"
441
+ end
442
+ ```
443
+
444
+ ### Save Extracted Images
445
+
446
+ ```ruby
447
+ require 'xberg'
448
+
449
+ config = Xberg::ExtractionConfig.new(
450
+ images: Xberg::ImageExtractionConfig.new(
451
+ extract_images: true
452
+ )
453
+ )
454
+
455
+ input = Xberg::ExtractInput.new(kind: "uri", uri: "document.pdf")
456
+ output = Xberg.extract(input, config)
457
+ document = output.results.first
458
+
459
+ document.images&.each_with_index do |image, index|
460
+ File.write("image_#{index}.png", image.data)
461
+ end
462
+ ```
463
+
464
+ ## Documentation
465
+
466
+ For comprehensive documentation, visit [https://xberg.io](https://xberg.io)
467
+
468
+ ## Part of Xberg.dev
469
+
470
+ - [crawlberg](https://github.com/xberg-io/crawlberg) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
471
+ - [html-to-markdown](https://github.com/xberg-io/html-to-markdown) — fast, lossless HTML→Markdown engine.
472
+ - [liter-llm](https://github.com/xberg-io/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
473
+ - [tree-sitter-language-pack](https://github.com/xberg-io/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
474
+ - [alef](https://github.com/xberg-io/alef) — the polyglot binding generator that produces this README and all per-language bindings.
475
+ - [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
476
+
477
+ ## License
478
+
479
+ MIT License - see [LICENSE](../../LICENSE) for details.
data/Steepfile ADDED
@@ -0,0 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ target :lib do
4
+ signature "sig"
5
+ check "lib"
6
+ # The generated `lib/xberg/native.rb` carries inline Sorbet
7
+ # `sig { ... }` blocks on tagged-enum variant Data classes. Sorbet's runtime
8
+ # provides those via `extend T::Sig`, but Steep does not understand the
9
+ # extension (it relies on RBS, not Sorbet sigs) and reports
10
+ # `Type `self` does not have method `sig`` on every block. RBS coverage
11
+ # for the same surface lives in `sig/types.rbs`, so we steer Steep to the
12
+ # RBS file by ignoring the .rb.
13
+ ignore "lib/xberg/native.rb"
14
+ end