xberg 0.1.0 → 1.0.0.pre.rc.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/LICENSE +21 -0
- data/README.md +474 -6
- data/Steepfile +14 -0
- data/ext/xberg_rb/native/Cargo.lock +8575 -0
- data/ext/xberg_rb/native/Cargo.toml +59 -0
- data/ext/xberg_rb/native/extconf.rb +14 -0
- data/ext/xberg_rb/src/lib.rs +27640 -0
- data/lib/xberg/native.rb +4332 -0
- data/lib/xberg/version.rb +10 -0
- data/lib/xberg.rb +22 -1
- data/lib/xberg_rb.so +0 -0
- data/sig/types.rbs +2560 -0
- metadata +53 -15
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 6640de3913281d734137cc0362ade67d82a742950c938060882e8a86acb7a855
|
|
4
|
+
data.tar.gz: ae86759fa1d1d9a68470bf43bedad36565ac5e9ed8a2eebac0d05370223b74d8
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 320bf2e2ee031b84aa12615552f603c20c7c80578df0675ce7fd0d5911a7967699dd5ad28b895decf523251f82f0a79408750e8e16895c733538cb20c6443569
|
|
7
|
+
data.tar.gz: b5350869b0bea09c769f1b7bd6a250911ce7dc400b4f7477a7050cd2163c469248b6a885704eea5469e6575284bdc7b648d032a39a65849ee4e105b3d2d5293f
|
data/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025-2026 Kreuzberg, Inc.
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
data/README.md
CHANGED
|
@@ -1,11 +1,479 @@
|
|
|
1
|
-
#
|
|
1
|
+
# Xberg for Ruby
|
|
2
2
|
|
|
3
|
-
|
|
4
|
-
|
|
3
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
|
4
|
+
<a href="https://github.com/xberg-io/alef">
|
|
5
|
+
<img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
|
|
6
|
+
</a>
|
|
7
|
+
<!-- Language Bindings -->
|
|
8
|
+
<a href="https://crates.io/crates/xberg">
|
|
9
|
+
<img src="https://img.shields.io/crates/v/xberg?label=Rust&color=007ec6" alt="Rust">
|
|
10
|
+
</a>
|
|
11
|
+
<a href="https://pypi.org/project/xberg/">
|
|
12
|
+
<img src="https://img.shields.io/pypi/v/xberg?label=Python&color=007ec6" alt="Python">
|
|
13
|
+
</a>
|
|
14
|
+
<a href="https://www.npmjs.com/package/@xberg-io/xberg">
|
|
15
|
+
<img src="https://img.shields.io/npm/v/@xberg-io/xberg?label=Node.js&color=007ec6" alt="Node.js">
|
|
16
|
+
</a>
|
|
17
|
+
<a href="https://www.npmjs.com/package/@xberg-io/xberg-wasm">
|
|
18
|
+
<img src="https://img.shields.io/npm/v/@xberg-io/xberg-wasm?label=WASM&color=007ec6" alt="WASM">
|
|
19
|
+
</a>
|
|
20
|
+
<a href="https://central.sonatype.com/artifact/io.xberg/xberg">
|
|
21
|
+
<img src="https://img.shields.io/maven-central/v/io.xberg/xberg?label=Java&color=007ec6" alt="Java">
|
|
22
|
+
</a>
|
|
23
|
+
<a href="https://github.com/xberg-io/xberg/tree/main/packages/go">
|
|
24
|
+
<img src="https://img.shields.io/github/v/tag/xberg-io/xberg?label=Go&color=007ec6&filter=v1*" alt="Go">
|
|
25
|
+
</a>
|
|
26
|
+
<a href="https://www.nuget.org/packages/Xberg/">
|
|
27
|
+
<img src="https://img.shields.io/nuget/v/Xberg?label=C%23&color=007ec6" alt="C#">
|
|
28
|
+
</a>
|
|
29
|
+
<a href="https://packagist.org/packages/xberg-io/xberg">
|
|
30
|
+
<img src="https://img.shields.io/packagist/v/xberg-io/xberg?label=PHP&color=007ec6" alt="PHP">
|
|
31
|
+
</a>
|
|
32
|
+
<a href="https://rubygems.org/gems/xberg">
|
|
33
|
+
<img src="https://img.shields.io/gem/v/xberg?label=Ruby&color=007ec6" alt="Ruby">
|
|
34
|
+
</a>
|
|
35
|
+
<a href="https://hex.pm/packages/xberg">
|
|
36
|
+
<img src="https://img.shields.io/hexpm/v/xberg?label=Elixir&color=007ec6" alt="Elixir">
|
|
37
|
+
</a>
|
|
38
|
+
<a href="https://xberg-io.r-universe.dev/xberg">
|
|
39
|
+
<img src="https://img.shields.io/badge/R-xberg-007ec6" alt="R">
|
|
40
|
+
</a>
|
|
41
|
+
<a href="https://pub.dev/packages/xberg">
|
|
42
|
+
<img src="https://img.shields.io/pub/v/xberg?label=Dart&color=007ec6" alt="Dart">
|
|
43
|
+
</a>
|
|
44
|
+
<a href="https://central.sonatype.com/artifact/io.xberg/xberg-android">
|
|
45
|
+
<img src="https://img.shields.io/maven-central/v/io.xberg/xberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
|
|
46
|
+
</a>
|
|
47
|
+
<a href="https://github.com/xberg-io/xberg/tree/main/packages/swift">
|
|
48
|
+
<img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
|
|
49
|
+
</a>
|
|
50
|
+
<a href="https://github.com/xberg-io/xberg/tree/main/packages/zig">
|
|
51
|
+
<img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
|
|
52
|
+
</a>
|
|
53
|
+
<a href="https://github.com/xberg-io/xberg/releases">
|
|
54
|
+
<img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
|
|
55
|
+
</a>
|
|
56
|
+
<a href="https://github.com/xberg-io/xberg/pkgs/container/xberg">
|
|
57
|
+
<img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
|
|
58
|
+
</a>
|
|
59
|
+
<!-- Project Info -->
|
|
60
|
+
<a href="https://github.com/xberg-io/xberg/blob/main/LICENSE">
|
|
61
|
+
<img src="https://img.shields.io/badge/License-MIT-007ec6" alt="License">
|
|
62
|
+
</a>
|
|
63
|
+
<a href="https://docs.xberg.io">
|
|
64
|
+
<img src="https://img.shields.io/badge/Docs-xberg-007ec6" alt="Documentation">
|
|
65
|
+
</a>
|
|
66
|
+
<a href="https://huggingface.co/xberg-io">
|
|
67
|
+
<img src="https://img.shields.io/badge/Hugging%20Face-Xberg-007ec6" alt="Hugging Face">
|
|
68
|
+
</a>
|
|
69
|
+
</div>
|
|
5
70
|
|
|
6
|
-
|
|
7
|
-
|
|
71
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
|
|
72
|
+
<a href="https://discord.gg/xt9WY3GnKR">
|
|
73
|
+
<img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
|
|
74
|
+
</a>
|
|
75
|
+
<a href="https://docs.xberg.io/demo.html">
|
|
76
|
+
<img height="22" src="https://img.shields.io/badge/Live%20Demo-Open-007ec6?logo=webassembly&logoColor=white" alt="Live Demo">
|
|
77
|
+
</a>
|
|
78
|
+
<a href="https://github.com/xberg-io/xberg/stargazers">
|
|
79
|
+
<img height="22" src="https://img.shields.io/github/stars/xberg-io/xberg?style=social" alt="GitHub Stars">
|
|
80
|
+
</a>
|
|
81
|
+
</div>
|
|
82
|
+
|
|
83
|
+
Extract text, tables, images, metadata, and code intelligence from 96 file formats and 306 programming languages including PDF, Office documents, images, and audio/video transcripts where native transcription is available. Ruby bindings with idiomatic Ruby API and native performance.
|
|
84
|
+
|
|
85
|
+
## What This Package Provides
|
|
86
|
+
|
|
87
|
+
- **Ruby-native extraction** — idiomatic Ruby objects over the shared Rust document engine.
|
|
88
|
+
- **Structured results** — an `ExtractionResult` envelope with `ExtractedDocument` items, errors, and summary counts.
|
|
89
|
+
- **OCR support** — Tesseract and PaddleOCR through the same configuration model as other bindings.
|
|
90
|
+
- **Cross-binding parity** — output matches the Python, Node.js, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
|
|
91
|
+
|
|
92
|
+
## Installation
|
|
93
|
+
|
|
94
|
+
Add to your Gemfile:
|
|
95
|
+
|
|
96
|
+
```ruby
|
|
97
|
+
gem 'xberg'
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Then execute:
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
bundle install
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
Or install it directly:
|
|
107
|
+
|
|
108
|
+
```bash
|
|
109
|
+
gem install xberg
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
## Quick Start
|
|
113
|
+
|
|
114
|
+
### Basic Usage
|
|
115
|
+
|
|
116
|
+
```ruby
|
|
117
|
+
require 'xberg'
|
|
118
|
+
|
|
119
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: "document.pdf")
|
|
120
|
+
output = Xberg.extract(input, Xberg::ExtractionConfig.new)
|
|
121
|
+
document = output.results.first
|
|
122
|
+
|
|
123
|
+
puts document.content
|
|
124
|
+
puts "Results: #{output.summary.results}"
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
### Batch Processing
|
|
128
|
+
|
|
129
|
+
```ruby
|
|
130
|
+
require 'xberg'
|
|
131
|
+
|
|
132
|
+
bytes = File.binread("doc3.txt")
|
|
133
|
+
inputs = [
|
|
134
|
+
Xberg::ExtractInput.new(kind: "uri", uri: "doc1.pdf"),
|
|
135
|
+
Xberg::ExtractInput.new(kind: "uri", uri: "doc2.docx"),
|
|
136
|
+
Xberg::ExtractInput.new(
|
|
137
|
+
kind: "bytes",
|
|
138
|
+
bytes: bytes,
|
|
139
|
+
mime_type: "text/plain",
|
|
140
|
+
filename: "doc3.txt"
|
|
141
|
+
),
|
|
142
|
+
]
|
|
143
|
+
|
|
144
|
+
output = Xberg.extract_batch(inputs, Xberg::ExtractionConfig.new)
|
|
145
|
+
|
|
146
|
+
output.results.each do |document|
|
|
147
|
+
puts "Content length: #{document.content.length}"
|
|
148
|
+
end
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
## Configuration
|
|
152
|
+
|
|
153
|
+
```ruby
|
|
154
|
+
require 'xberg'
|
|
155
|
+
|
|
156
|
+
config = Xberg::ExtractionConfig.new(
|
|
157
|
+
use_cache: true,
|
|
158
|
+
enable_quality_processing: true,
|
|
159
|
+
ocr: Xberg::OcrConfig.new(
|
|
160
|
+
backend: 'tesseract',
|
|
161
|
+
language: 'eng'
|
|
162
|
+
)
|
|
163
|
+
)
|
|
164
|
+
|
|
165
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: "document.pdf")
|
|
166
|
+
output = Xberg.extract(input, config)
|
|
167
|
+
document = output.results.first
|
|
168
|
+
|
|
169
|
+
puts document.content
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
## OCR Support
|
|
173
|
+
|
|
174
|
+
### Tesseract Configuration
|
|
175
|
+
|
|
176
|
+
```ruby
|
|
177
|
+
require 'xberg'
|
|
178
|
+
|
|
179
|
+
config = Xberg::ExtractionConfig.new(
|
|
180
|
+
ocr: Xberg::OcrConfig.new(
|
|
181
|
+
backend: 'tesseract',
|
|
182
|
+
language: 'eng',
|
|
183
|
+
tesseract_config: Xberg::TesseractConfig.new(
|
|
184
|
+
psm: 6,
|
|
185
|
+
enable_table_detection: true
|
|
186
|
+
)
|
|
187
|
+
)
|
|
188
|
+
)
|
|
189
|
+
|
|
190
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: "scanned.pdf")
|
|
191
|
+
output = Xberg.extract(input, config)
|
|
192
|
+
document = output.results.first
|
|
193
|
+
|
|
194
|
+
puts document.content
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
## Table Extraction
|
|
198
|
+
|
|
199
|
+
```ruby
|
|
200
|
+
require 'xberg'
|
|
201
|
+
|
|
202
|
+
config = Xberg::ExtractionConfig.new(
|
|
203
|
+
ocr: Xberg::OcrConfig.new(
|
|
204
|
+
backend: 'tesseract',
|
|
205
|
+
tesseract_config: Xberg::TesseractConfig.new(
|
|
206
|
+
enable_table_detection: true
|
|
207
|
+
)
|
|
208
|
+
)
|
|
209
|
+
)
|
|
210
|
+
|
|
211
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: "invoice.pdf")
|
|
212
|
+
output = Xberg.extract(input, config)
|
|
213
|
+
document = output.results.first
|
|
214
|
+
|
|
215
|
+
document.tables.each_with_index do |table, index|
|
|
216
|
+
puts "Table #{index}:"
|
|
217
|
+
puts table.markdown
|
|
218
|
+
end
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
## Metadata Extraction
|
|
222
|
+
|
|
223
|
+
```ruby
|
|
224
|
+
require 'xberg'
|
|
225
|
+
|
|
226
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: "document.pdf")
|
|
227
|
+
output = Xberg.extract(input, Xberg::ExtractionConfig.new)
|
|
228
|
+
document = output.results.first
|
|
229
|
+
|
|
230
|
+
metadata = document.metadata
|
|
231
|
+
puts "Title: #{metadata.title}" if metadata&.title
|
|
232
|
+
if metadata&.authors
|
|
233
|
+
puts "Authors: #{metadata.authors.join(', ')}"
|
|
234
|
+
end
|
|
235
|
+
|
|
236
|
+
puts "Languages: #{document.detected_languages}"
|
|
237
|
+
|
|
238
|
+
if document.images
|
|
239
|
+
puts "Images found: #{document.images.count}"
|
|
240
|
+
end
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
## Text Chunking
|
|
244
|
+
|
|
245
|
+
```ruby
|
|
246
|
+
require 'xberg'
|
|
247
|
+
|
|
248
|
+
config = Xberg::ExtractionConfig.new(
|
|
249
|
+
chunking: Xberg::ChunkingConfig.new(
|
|
250
|
+
max_chars: 1000,
|
|
251
|
+
max_overlap: 200
|
|
252
|
+
)
|
|
253
|
+
)
|
|
254
|
+
|
|
255
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: "long_document.pdf")
|
|
256
|
+
output = Xberg.extract(input, config)
|
|
257
|
+
document = output.results.first
|
|
258
|
+
|
|
259
|
+
document.chunks.each_with_index do |chunk, index|
|
|
260
|
+
puts "Chunk #{index}: #{chunk.content.length} characters"
|
|
261
|
+
end
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
## Password-Protected PDFs
|
|
8
265
|
|
|
9
266
|
```ruby
|
|
10
|
-
require
|
|
267
|
+
require 'xberg'
|
|
268
|
+
|
|
269
|
+
config = Xberg::ExtractionConfig.new(
|
|
270
|
+
pdf_options: Xberg::PdfConfig.new(
|
|
271
|
+
passwords: ["password1", "password2"]
|
|
272
|
+
)
|
|
273
|
+
)
|
|
274
|
+
|
|
275
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: "protected.pdf")
|
|
276
|
+
output = Xberg.extract(input, config)
|
|
277
|
+
document = output.results.first
|
|
278
|
+
|
|
279
|
+
puts document.content
|
|
11
280
|
```
|
|
281
|
+
|
|
282
|
+
## Language Detection
|
|
283
|
+
|
|
284
|
+
```ruby
|
|
285
|
+
require 'xberg'
|
|
286
|
+
|
|
287
|
+
config = Xberg::ExtractionConfig.new(
|
|
288
|
+
language_detection: Xberg::LanguageDetectionConfig.new(
|
|
289
|
+
enabled: true
|
|
290
|
+
)
|
|
291
|
+
)
|
|
292
|
+
|
|
293
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: "multilingual.pdf")
|
|
294
|
+
output = Xberg.extract(input, config)
|
|
295
|
+
document = output.results.first
|
|
296
|
+
|
|
297
|
+
puts "Detected languages: #{document.detected_languages}"
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
## API Reference
|
|
301
|
+
|
|
302
|
+
### Main Methods
|
|
303
|
+
|
|
304
|
+
- `Xberg.extract(input, config)` – Extract one URI or bytes input.
|
|
305
|
+
- `Xberg.extract_batch(inputs, config)` – Extract multiple URI or bytes inputs.
|
|
306
|
+
- `Xberg::ExtractInput.new(kind: "uri", uri: "document.pdf")` – Local path, `file://`, or HTTP(S) URI input.
|
|
307
|
+
- `Xberg::ExtractInput.new(kind: "bytes", bytes: data, mime_type: "application/pdf")` – In-memory bytes input.
|
|
308
|
+
|
|
309
|
+
### Configuration Classes
|
|
310
|
+
|
|
311
|
+
- `ExtractionConfig` – Main configuration
|
|
312
|
+
- `OcrConfig` – OCR settings
|
|
313
|
+
- `TesseractConfig` – Tesseract-specific options
|
|
314
|
+
- `ChunkingConfig` – Text chunking settings
|
|
315
|
+
- `PdfConfig` – PDF-specific options
|
|
316
|
+
- `LanguageDetectionConfig` – Language detection settings
|
|
317
|
+
|
|
318
|
+
### Result Types
|
|
319
|
+
|
|
320
|
+
- `ExtractionResult` – Envelope with `results`, `errors`, and `summary`.
|
|
321
|
+
- `ExtractedDocument` – Per-document item at `output.results.first` with content, metadata, tables, and chunks.
|
|
322
|
+
- `Table` – Table with `cells`, `markdown`, and `page_number`.
|
|
323
|
+
- `Metadata` – Typed document metadata.
|
|
324
|
+
|
|
325
|
+
## System Requirements
|
|
326
|
+
|
|
327
|
+
### Ruby Version
|
|
328
|
+
|
|
329
|
+
- **Ruby 3.2.0 or higher** (including Ruby 4.x)
|
|
330
|
+
- Ruby 4.0+ is fully supported with no code changes required
|
|
331
|
+
- Magnus bindings compile successfully on all supported Ruby versions
|
|
332
|
+
|
|
333
|
+
### Required
|
|
334
|
+
|
|
335
|
+
- Rust toolchain (for native extension compilation)
|
|
336
|
+
|
|
337
|
+
### Optional
|
|
338
|
+
|
|
339
|
+
```bash
|
|
340
|
+
# Tesseract OCR
|
|
341
|
+
brew install tesseract # macOS
|
|
342
|
+
sudo apt-get install tesseract-ocr # Ubuntu/Debian
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
### Ruby 4.0 Compatibility
|
|
346
|
+
|
|
347
|
+
Xberg is fully compatible with Ruby 4.0 (released December 25, 2025) and later. Key Ruby 4.0 features that work seamlessly:
|
|
348
|
+
|
|
349
|
+
- **Ruby Box** - Improved memory efficiency and performance
|
|
350
|
+
- **ZJIT Compiler** - Enhanced JIT compilation for faster execution
|
|
351
|
+
- **Ractor Improvements** - Better multi-threaded document processing
|
|
352
|
+
- **Set Promoted to Core** - No changes needed for Xberg
|
|
353
|
+
|
|
354
|
+
All tests pass with Ruby 4.0.1 with 100% compatibility. The gem compiles without any breaking changes.
|
|
355
|
+
|
|
356
|
+
## Development
|
|
357
|
+
|
|
358
|
+
Clone and setup:
|
|
359
|
+
|
|
360
|
+
```bash
|
|
361
|
+
git clone https://github.com/xberg-io/xberg.git
|
|
362
|
+
cd xberg
|
|
363
|
+
bundle install
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
Run tests:
|
|
367
|
+
|
|
368
|
+
```bash
|
|
369
|
+
rake test
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
## Troubleshooting
|
|
373
|
+
|
|
374
|
+
### Native extension compilation error
|
|
375
|
+
|
|
376
|
+
Ensure build tools are installed:
|
|
377
|
+
|
|
378
|
+
```bash
|
|
379
|
+
# macOS
|
|
380
|
+
xcode-select --install
|
|
381
|
+
|
|
382
|
+
# Ubuntu/Debian
|
|
383
|
+
sudo apt-get install build-essential ruby-dev
|
|
384
|
+
|
|
385
|
+
# Windows (via RubyInstaller)
|
|
386
|
+
ridk install
|
|
387
|
+
```
|
|
388
|
+
|
|
389
|
+
### "Could not find Xberg"
|
|
390
|
+
|
|
391
|
+
Reinstall the gem:
|
|
392
|
+
|
|
393
|
+
```bash
|
|
394
|
+
gem uninstall xberg
|
|
395
|
+
gem install xberg --no-document
|
|
396
|
+
```
|
|
397
|
+
|
|
398
|
+
### OCR not working
|
|
399
|
+
|
|
400
|
+
Verify Tesseract is installed:
|
|
401
|
+
|
|
402
|
+
```bash
|
|
403
|
+
tesseract --version
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
## Examples
|
|
407
|
+
|
|
408
|
+
### Process Directory of PDFs
|
|
409
|
+
|
|
410
|
+
```ruby
|
|
411
|
+
require 'xberg'
|
|
412
|
+
require 'pathname'
|
|
413
|
+
|
|
414
|
+
Dir.glob("documents/*.pdf").each do |file|
|
|
415
|
+
puts "Processing: #{file}"
|
|
416
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: file)
|
|
417
|
+
output = Xberg.extract(input, Xberg::ExtractionConfig.new)
|
|
418
|
+
document = output.results.first
|
|
419
|
+
|
|
420
|
+
puts " Content length: #{document.content.length}"
|
|
421
|
+
puts " Language: #{document.detected_languages}"
|
|
422
|
+
end
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
### Extract and Parse Structured Data
|
|
426
|
+
|
|
427
|
+
```ruby
|
|
428
|
+
require 'xberg'
|
|
429
|
+
require 'json'
|
|
430
|
+
|
|
431
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: "data.pdf")
|
|
432
|
+
output = Xberg.extract(input, Xberg::ExtractionConfig.new)
|
|
433
|
+
document = output.results.first
|
|
434
|
+
|
|
435
|
+
# Parse content as JSON (if applicable)
|
|
436
|
+
begin
|
|
437
|
+
data = JSON.parse(document.content)
|
|
438
|
+
puts "Parsed data: #{data}"
|
|
439
|
+
rescue JSON::ParserError
|
|
440
|
+
puts "Content is not JSON"
|
|
441
|
+
end
|
|
442
|
+
```
|
|
443
|
+
|
|
444
|
+
### Save Extracted Images
|
|
445
|
+
|
|
446
|
+
```ruby
|
|
447
|
+
require 'xberg'
|
|
448
|
+
|
|
449
|
+
config = Xberg::ExtractionConfig.new(
|
|
450
|
+
images: Xberg::ImageExtractionConfig.new(
|
|
451
|
+
extract_images: true
|
|
452
|
+
)
|
|
453
|
+
)
|
|
454
|
+
|
|
455
|
+
input = Xberg::ExtractInput.new(kind: "uri", uri: "document.pdf")
|
|
456
|
+
output = Xberg.extract(input, config)
|
|
457
|
+
document = output.results.first
|
|
458
|
+
|
|
459
|
+
document.images&.each_with_index do |image, index|
|
|
460
|
+
File.write("image_#{index}.png", image.data)
|
|
461
|
+
end
|
|
462
|
+
```
|
|
463
|
+
|
|
464
|
+
## Documentation
|
|
465
|
+
|
|
466
|
+
For comprehensive documentation, visit [https://xberg.io](https://xberg.io)
|
|
467
|
+
|
|
468
|
+
## Part of Xberg.dev
|
|
469
|
+
|
|
470
|
+
- [crawlberg](https://github.com/xberg-io/crawlberg) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
|
|
471
|
+
- [html-to-markdown](https://github.com/xberg-io/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
|
472
|
+
- [liter-llm](https://github.com/xberg-io/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
|
473
|
+
- [tree-sitter-language-pack](https://github.com/xberg-io/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
|
474
|
+
- [alef](https://github.com/xberg-io/alef) — the polyglot binding generator that produces this README and all per-language bindings.
|
|
475
|
+
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
|
|
476
|
+
|
|
477
|
+
## License
|
|
478
|
+
|
|
479
|
+
MIT License - see [LICENSE](../../LICENSE) for details.
|
data/Steepfile
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
target :lib do
|
|
4
|
+
signature "sig"
|
|
5
|
+
check "lib"
|
|
6
|
+
# The generated `lib/xberg/native.rb` carries inline Sorbet
|
|
7
|
+
# `sig { ... }` blocks on tagged-enum variant Data classes. Sorbet's runtime
|
|
8
|
+
# provides those via `extend T::Sig`, but Steep does not understand the
|
|
9
|
+
# extension (it relies on RBS, not Sorbet sigs) and reports
|
|
10
|
+
# `Type `self` does not have method `sig`` on every block. RBS coverage
|
|
11
|
+
# for the same surface lives in `sig/types.rbs`, so we steer Steep to the
|
|
12
|
+
# RBS file by ignoring the .rb.
|
|
13
|
+
ignore "lib/xberg/native.rb"
|
|
14
|
+
end
|