fastembed 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -3,156 +3,477 @@
3
3
  [![Gem Version](https://badge.fury.io/rb/fastembed.svg)](https://rubygems.org/gems/fastembed)
4
4
  [![CI](https://github.com/khasinski/fastembed-rb/actions/workflows/ci.yml/badge.svg)](https://github.com/khasinski/fastembed-rb/actions/workflows/ci.yml)
5
5
 
6
- Fast, lightweight text embeddings in Ruby. Convert text into vectors for semantic search, similarity matching, clustering, and RAG applications.
6
+ Fast, lightweight text embeddings in Ruby. A port of [FastEmbed](https://github.com/qdrant/fastembed) by Qdrant.
7
7
 
8
8
  ```ruby
9
9
  embedding = Fastembed::TextEmbedding.new
10
- vectors = embedding.embed(["Hello world", "Ruby is great"]).to_a
11
- # => [[0.123, -0.456, ...], [0.789, 0.012, ...]] (384-dimensional vectors)
10
+ vectors = embedding.embed(["The quick brown fox", "jumps over the lazy dog"]).to_a
12
11
  ```
13
12
 
14
- ## What are embeddings?
13
+ Supports dense embeddings, sparse embeddings (SPLADE), late interaction (ColBERT), reranking, and image embeddings - all running locally with ONNX Runtime.
15
14
 
16
- Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors, enabling:
15
+ ## Table of Contents
17
16
 
18
- - **Semantic search** - Find relevant documents by meaning, not just keywords
19
- - **Similarity matching** - Compare texts to find duplicates or related content
20
- - **RAG applications** - Retrieve context for LLMs like ChatGPT
21
- - **Clustering** - Group similar documents together
17
+ - [Installation](#installation)
18
+ - [Getting Started](#getting-started)
19
+ - [Text Embeddings](#text-embeddings)
20
+ - [Reranking](#reranking)
21
+ - [Sparse Embeddings](#sparse-embeddings)
22
+ - [Late Interaction (ColBERT)](#late-interaction-colbert)
23
+ - [Image Embeddings](#image-embeddings)
24
+ - [Async Processing](#async-processing)
25
+ - [Progress Tracking](#progress-tracking)
26
+ - [CLI](#cli)
27
+ - [Custom Models](#custom-models)
28
+ - [Configuration](#configuration)
29
+ - [Performance](#performance)
22
30
 
23
31
  ## Installation
24
32
 
33
+ Add to your Gemfile:
34
+
35
+ ```ruby
36
+ gem "fastembed"
37
+ ```
38
+
39
+ For image embeddings, also add:
40
+
25
41
  ```ruby
26
- gem 'fastembed'
42
+ gem "mini_magick"
27
43
  ```
28
44
 
29
- ## Quick Start
45
+ ## Getting Started
30
46
 
31
47
  ```ruby
32
- require 'fastembed'
48
+ require "fastembed"
33
49
 
34
- # Create embedding model (downloads automatically on first use, ~67MB)
50
+ # Create an embedding model (downloads ~67MB on first use)
35
51
  embedding = Fastembed::TextEmbedding.new
36
52
 
37
- # Embed your texts
38
- docs = ["Ruby is a programming language", "Python is also a programming language"]
39
- vectors = embedding.embed(docs).to_a
53
+ # Embed some text
54
+ documents = [
55
+ "Ruby is a dynamic programming language",
56
+ "Python is great for data science",
57
+ "JavaScript runs in the browser"
58
+ ]
59
+ vectors = embedding.embed(documents).to_a
40
60
 
41
- # Find similarity between texts (cosine similarity via dot product)
42
- similarity = vectors[0].zip(vectors[1]).sum { |a, b| a * b }
43
- puts similarity # => 0.89 (high similarity!)
61
+ # Each vector is 384 floats (for the default model)
62
+ vectors.first.length # => 384
44
63
  ```
45
64
 
46
- ## Semantic Search Example
65
+ ### Semantic Search
66
+
67
+ Find documents by meaning, not just keywords:
47
68
 
48
69
  ```ruby
70
+ embedding = Fastembed::TextEmbedding.new
71
+
49
72
  # Your document corpus
50
73
  documents = [
51
- "The quick brown fox jumps over the lazy dog",
52
- "Machine learning is a subset of artificial intelligence",
53
- "Ruby on Rails is a web application framework",
54
- "Neural networks are inspired by biological brains"
74
+ "The cat sat on the mat",
75
+ "Machine learning powers modern AI",
76
+ "Ruby on Rails is a web framework",
77
+ "Deep learning uses neural networks"
55
78
  ]
56
-
57
- # Create embeddings for all documents
58
- embedding = Fastembed::TextEmbedding.new
59
79
  doc_vectors = embedding.embed(documents).to_a
60
80
 
61
- # Search query
62
- query = "AI and deep learning"
63
- query_vector = embedding.embed(query).first
81
+ # Search for a concept
82
+ query = "artificial intelligence and neural nets"
83
+ query_vector = embedding.embed([query]).first
84
+
85
+ # Find the most similar document (cosine similarity)
86
+ scores = doc_vectors.map { |v| query_vector.zip(v).sum { |a, b| a * b } }
87
+ best_idx = scores.each_with_index.max.last
88
+
89
+ puts documents[best_idx] # => "Deep learning uses neural networks"
90
+ ```
91
+
92
+ ### Integration with Vector Databases
93
+
94
+ ```ruby
95
+ # With Qdrant
96
+ require "qdrant"
64
97
 
65
- # Find most similar document (highest dot product)
66
- similarities = doc_vectors.map.with_index do |doc_vec, i|
67
- score = query_vector.zip(doc_vec).sum { |a, b| a * b }
68
- [i, score]
98
+ embedding = Fastembed::TextEmbedding.new
99
+ client = Qdrant::Client.new(url: "http://localhost:6333")
100
+
101
+ # Index documents
102
+ documents.each_with_index do |doc, i|
103
+ vector = embedding.embed([doc]).first
104
+ client.points.upsert(
105
+ collection_name: "docs",
106
+ points: [{ id: i, vector: vector, payload: { text: doc } }]
107
+ )
69
108
  end
70
109
 
71
- best_match = similarities.max_by { |_, score| score }
72
- puts documents[best_match[0]] # => "Machine learning is a subset of artificial intelligence"
110
+ # Search
111
+ query_vector = embedding.embed(["your search query"]).first
112
+ results = client.points.search(collection_name: "docs", vector: query_vector, limit: 5)
73
113
  ```
74
114
 
75
- ## Usage
115
+ ## Text Embeddings
76
116
 
77
117
  ### Choose a Model
78
118
 
79
119
  ```ruby
80
- # Default: Fast and accurate (384 dimensions, 67MB)
120
+ # Default: fast and accurate (384 dimensions, 67MB)
81
121
  embedding = Fastembed::TextEmbedding.new
82
122
 
83
123
  # Higher accuracy (768 dimensions, 210MB)
84
124
  embedding = Fastembed::TextEmbedding.new(model_name: "BAAI/bge-base-en-v1.5")
85
125
 
86
- # Multilingual support (100+ languages)
126
+ # Multilingual - 100+ languages (384 dimensions)
87
127
  embedding = Fastembed::TextEmbedding.new(model_name: "intfloat/multilingual-e5-small")
88
128
 
89
- # Long documents (8192 tokens vs default 512)
129
+ # Long documents - 8192 token context (768 dimensions)
90
130
  embedding = Fastembed::TextEmbedding.new(model_name: "nomic-ai/nomic-embed-text-v1.5")
91
131
  ```
92
132
 
93
- ### Process Large Datasets
133
+ ### Supported Models
134
+
135
+ | Model | Dimensions | Size | Notes |
136
+ |-------|-----------|------|-------|
137
+ | `BAAI/bge-small-en-v1.5` | 384 | 67MB | Default, fast |
138
+ | `BAAI/bge-base-en-v1.5` | 768 | 210MB | Better accuracy |
139
+ | `BAAI/bge-large-en-v1.5` | 1024 | 1.2GB | Best accuracy |
140
+ | `sentence-transformers/all-MiniLM-L6-v2` | 384 | 90MB | General purpose |
141
+ | `sentence-transformers/all-mpnet-base-v2` | 768 | 440MB | High quality |
142
+ | `intfloat/multilingual-e5-small` | 384 | 450MB | 100+ languages |
143
+ | `intfloat/multilingual-e5-base` | 768 | 1.1GB | Multilingual, better |
144
+ | `nomic-ai/nomic-embed-text-v1.5` | 768 | 520MB | 8192 token context |
145
+ | `jinaai/jina-embeddings-v2-base-en` | 768 | 520MB | 8192 token context |
146
+
147
+ ### Query vs Passage Embeddings
148
+
149
+ For asymmetric search (short queries, long documents), use specialized methods:
94
150
 
95
151
  ```ruby
96
- # Lazy evaluation - memory efficient for large datasets
97
- documents = File.readlines("corpus.txt")
152
+ # For search queries
153
+ query_vectors = embedding.query_embed(["What is Ruby?"]).to_a
98
154
 
99
- embedding.embed(documents, batch_size: 64).each_slice(100) do |batch|
100
- store_in_vector_database(batch)
155
+ # For documents/passages
156
+ doc_vectors = embedding.passage_embed(documents).to_a
157
+ ```
158
+
159
+ ### Lazy Evaluation
160
+
161
+ Embeddings are generated lazily, making it memory-efficient for large datasets:
162
+
163
+ ```ruby
164
+ # Process millions of documents without loading all vectors into memory
165
+ File.foreach("documents.txt").lazy.each_slice(1000) do |batch|
166
+ embedding.embed(batch).each do |vector|
167
+ store_in_database(vector)
168
+ end
101
169
  end
102
170
  ```
103
171
 
104
- ### List Available Models
172
+ ## Reranking
173
+
174
+ Rerankers score query-document pairs for more accurate relevance ranking. Use them after initial retrieval:
105
175
 
106
176
  ```ruby
107
- Fastembed::TextEmbedding.list_supported_models.each do |model|
108
- puts "#{model[:model_name]} - #{model[:dim]}d - #{model[:description]}"
177
+ reranker = Fastembed::TextCrossEncoder.new
178
+
179
+ query = "What is machine learning?"
180
+ documents = [
181
+ "Machine learning is a branch of AI",
182
+ "The weather is nice today",
183
+ "Deep learning uses neural networks"
184
+ ]
185
+
186
+ # Get raw scores (higher = more relevant)
187
+ scores = reranker.rerank(query: query, documents: documents)
188
+ # => [8.5, -10.2, 5.3]
189
+
190
+ # Get sorted results with metadata
191
+ results = reranker.rerank_with_scores(query: query, documents: documents, top_k: 2)
192
+ # => [
193
+ # { document: "Machine learning is...", score: 8.5, index: 0 },
194
+ # { document: "Deep learning uses...", score: 5.3, index: 2 }
195
+ # ]
196
+ ```
197
+
198
+ ### Reranker Models
199
+
200
+ | Model | Size | Notes |
201
+ |-------|------|-------|
202
+ | `cross-encoder/ms-marco-MiniLM-L-6-v2` | 80MB | Default, fast |
203
+ | `cross-encoder/ms-marco-MiniLM-L-12-v2` | 120MB | Better accuracy |
204
+ | `BAAI/bge-reranker-base` | 1.1GB | High accuracy |
205
+ | `BAAI/bge-reranker-large` | 2.2GB | Best accuracy |
206
+
207
+ ## Sparse Embeddings
208
+
209
+ SPLADE models produce sparse vectors where each dimension corresponds to a vocabulary term. Great for hybrid search:
210
+
211
+ ```ruby
212
+ sparse = Fastembed::TextSparseEmbedding.new
213
+
214
+ result = sparse.embed(["Ruby programming language"]).first
215
+ # => #<SparseEmbedding indices=[1234, 5678, ...] values=[0.8, 1.2, ...]>
216
+
217
+ result.indices # vocabulary token IDs with non-zero weights
218
+ result.values # corresponding weights
219
+ result.nnz # number of non-zero elements
220
+ ```
221
+
222
+ ### Hybrid Search
223
+
224
+ Combine dense and sparse embeddings for better results:
225
+
226
+ ```ruby
227
+ dense = Fastembed::TextEmbedding.new
228
+ sparse = Fastembed::TextSparseEmbedding.new
229
+
230
+ documents = ["your documents here"]
231
+
232
+ # Generate both types of embeddings
233
+ dense_vectors = dense.embed(documents).to_a
234
+ sparse_vectors = sparse.embed(documents).to_a
235
+
236
+ # Store both in your vector database and combine scores at query time
237
+ ```
238
+
239
+ ## Late Interaction (ColBERT)
240
+
241
+ ColBERT produces token-level embeddings for fine-grained matching:
242
+
243
+ ```ruby
244
+ colbert = Fastembed::LateInteractionTextEmbedding.new
245
+
246
+ query = colbert.query_embed(["What is Ruby?"]).first
247
+ doc = colbert.embed(["Ruby is a programming language"]).first
248
+
249
+ # MaxSim scoring - sum of max similarities per query token
250
+ score = query.max_sim(doc)
251
+ ```
252
+
253
+ ### Late Interaction Models
254
+
255
+ | Model | Dimensions | Notes |
256
+ |-------|-----------|-------|
257
+ | `colbert-ir/colbertv2.0` | 128 | Default |
258
+ | `jinaai/jina-colbert-v1-en` | 768 | 8192 token context |
259
+
260
+ ## Image Embeddings
261
+
262
+ Convert images to vectors for visual search:
263
+
264
+ ```ruby
265
+ # Requires mini_magick gem
266
+ image_embed = Fastembed::ImageEmbedding.new
267
+
268
+ # From file paths
269
+ vectors = image_embed.embed(["photo1.jpg", "photo2.png"]).to_a
270
+
271
+ # From URLs
272
+ vectors = image_embed.embed(["https://example.com/image.jpg"]).to_a
273
+ ```
274
+
275
+ ### Image Models
276
+
277
+ | Model | Dimensions | Notes |
278
+ |-------|-----------|-------|
279
+ | `Qdrant/clip-ViT-B-32-vision` | 512 | Default, CLIP |
280
+ | `Qdrant/resnet50-onnx` | 2048 | ResNet50 |
281
+ | `jinaai/jina-clip-v1` | 768 | Jina CLIP |
282
+
283
+ ## Async Processing
284
+
285
+ Run embeddings in background threads:
286
+
287
+ ```ruby
288
+ embedding = Fastembed::TextEmbedding.new
289
+
290
+ # Start async embedding
291
+ future = embedding.embed_async(large_document_list)
292
+
293
+ # Do other work...
294
+
295
+ # Get results when ready (blocks until complete)
296
+ vectors = future.value
297
+ ```
298
+
299
+ ### Parallel Processing
300
+
301
+ ```ruby
302
+ # Process multiple batches concurrently
303
+ futures = documents.each_slice(1000).map do |batch|
304
+ embedding.embed_async(batch)
109
305
  end
306
+
307
+ # Wait for all and combine results
308
+ all_vectors = futures.flat_map(&:value)
110
309
  ```
111
310
 
112
- ## Supported Models
311
+ ### Future Methods
113
312
 
114
- | Model | Dim | Use Case |
115
- |-------|-----|----------|
116
- | `BAAI/bge-small-en-v1.5` | 384 | Default, fast English embeddings |
117
- | `BAAI/bge-base-en-v1.5` | 768 | Higher accuracy English |
118
- | `BAAI/bge-large-en-v1.5` | 1024 | Highest accuracy English |
119
- | `sentence-transformers/all-MiniLM-L6-v2` | 384 | General purpose, lightweight |
120
- | `sentence-transformers/all-mpnet-base-v2` | 768 | High quality general purpose |
121
- | `intfloat/multilingual-e5-small` | 384 | 100+ languages |
122
- | `intfloat/multilingual-e5-base` | 768 | 100+ languages, higher accuracy |
123
- | `nomic-ai/nomic-embed-text-v1.5` | 768 | Long context (8192 tokens) |
124
- | `jinaai/jina-embeddings-v2-base-en` | 768 | Long context (8192 tokens) |
313
+ ```ruby
314
+ future.complete? # check if done
315
+ future.pending? # check if still running
316
+ future.success? # completed without error?
317
+ future.failure? # completed with error?
318
+ future.error # get the error if failed
319
+ future.wait(timeout: 5) # wait up to 5 seconds
320
+
321
+ # Chaining
322
+ future.then { |vectors| vectors.map(&:first) }
323
+ .rescue { |e| puts "Error: #{e}" }
324
+ ```
125
325
 
126
- ## Performance
326
+ ### Async Utilities
327
+
328
+ ```ruby
329
+ # Wait for all futures
330
+ results = Fastembed::Async.all(futures)
331
+
332
+ # Get first completed result
333
+ result = Fastembed::Async.race(futures, timeout: 10)
334
+ ```
335
+
336
+ ## Progress Tracking
337
+
338
+ Track progress for large embedding jobs:
339
+
340
+ ```ruby
341
+ embedding = Fastembed::TextEmbedding.new
342
+
343
+ documents = Array.new(10_000) { "document text" }
344
+
345
+ embedding.embed(documents, batch_size: 256) do |progress|
346
+ puts "Batch #{progress.current}/#{progress.total}"
347
+ puts "#{(progress.percentage * 100).round}% complete"
348
+ puts "~#{progress.documents_processed} documents processed"
349
+ end.to_a
350
+ ```
351
+
352
+ ## CLI
353
+
354
+ FastEmbed includes a command-line tool:
355
+
356
+ ```bash
357
+ # List available models
358
+ fastembed list # embedding models
359
+ fastembed list-reranker # reranker models
360
+ fastembed list-sparse # sparse models
361
+ fastembed list-image # image models
362
+
363
+ # Get model info
364
+ fastembed info "BAAI/bge-small-en-v1.5"
365
+
366
+ # Pre-download a model
367
+ fastembed download "BAAI/bge-base-en-v1.5"
368
+
369
+ # Embed text (outputs JSON)
370
+ fastembed embed "Hello world" "Another text"
127
371
 
128
- On Apple M1 Max with the default model:
372
+ # Different output formats
373
+ fastembed embed -f ndjson "Hello world"
374
+ fastembed embed -f csv "Hello world"
129
375
 
130
- | Batch Size | Throughput |
131
- |------------|------------|
132
- | 1 document | ~6.5ms |
133
- | 100 documents | ~500 docs/sec |
134
- | 1000 documents | ~500 docs/sec |
376
+ # Read from file
377
+ fastembed embed -i documents.txt
135
378
 
136
- Larger models are slower but more accurate. See [benchmarks](BENCHMARKS.md) for details.
379
+ # Use different model
380
+ fastembed embed -m "BAAI/bge-base-en-v1.5" "Hello"
381
+
382
+ # Rerank documents
383
+ fastembed rerank "query" "doc1" "doc2" "doc3"
384
+
385
+ # Benchmark a model
386
+ fastembed benchmark -m "BAAI/bge-small-en-v1.5" -n 100
387
+ ```
388
+
389
+ ## Custom Models
390
+
391
+ Register custom models from HuggingFace:
392
+
393
+ ```ruby
394
+ Fastembed.register_model(
395
+ model_name: "my-org/my-model",
396
+ dim: 768,
397
+ description: "My custom model",
398
+ sources: { hf: "my-org/my-model" },
399
+ model_file: "onnx/model.onnx"
400
+ )
401
+
402
+ # Now use it like any other model
403
+ embedding = Fastembed::TextEmbedding.new(model_name: "my-org/my-model")
404
+ ```
405
+
406
+ ### Load from Local Directory
407
+
408
+ ```ruby
409
+ embedding = Fastembed::TextEmbedding.new(
410
+ local_model_dir: "/path/to/model",
411
+ model_file: "model.onnx",
412
+ tokenizer_file: "tokenizer.json"
413
+ )
414
+ ```
137
415
 
138
416
  ## Configuration
139
417
 
418
+ ### Initialization Options
419
+
140
420
  ```ruby
141
421
  Fastembed::TextEmbedding.new(
142
- model_name: "BAAI/bge-small-en-v1.5", # Model to use
143
- cache_dir: "~/.cache/fastembed", # Where to store models
422
+ model_name: "BAAI/bge-small-en-v1.5", # model to use
423
+ cache_dir: "~/.cache/fastembed", # where to store models
144
424
  threads: 4, # ONNX Runtime threads
145
- providers: ["CUDAExecutionProvider"] # GPU acceleration (Linux/Windows)
425
+ providers: ["CUDAExecutionProvider"], # GPU acceleration
426
+ show_progress: true, # show download progress
427
+ quantization: :q4 # use quantized model
428
+ )
429
+ ```
430
+
431
+ ### Quantization
432
+
433
+ Use smaller, faster models with quantization:
434
+
435
+ ```ruby
436
+ # Available: :fp32 (default), :fp16, :int8, :uint8, :q4
437
+ embedding = Fastembed::TextEmbedding.new(quantization: :int8)
438
+ ```
439
+
440
+ ### Environment Variables
441
+
442
+ | Variable | Description |
443
+ |----------|-------------|
444
+ | `FASTEMBED_CACHE_PATH` | Custom model cache directory |
445
+ | `HF_TOKEN` | HuggingFace token for private models |
446
+
447
+ ### GPU Acceleration
448
+
449
+ ```ruby
450
+ # CUDA (Linux/Windows with NVIDIA GPU)
451
+ embedding = Fastembed::TextEmbedding.new(
452
+ providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
453
+ )
454
+
455
+ # CoreML (macOS)
456
+ embedding = Fastembed::TextEmbedding.new(
457
+ providers: ["CoreMLExecutionProvider", "CPUExecutionProvider"]
146
458
  )
147
459
  ```
148
460
 
149
- **Environment variables:**
150
- - `FASTEMBED_CACHE_PATH` - Custom model cache directory
461
+ ## Performance
462
+
463
+ On Apple M1 Max with the default model (BAAI/bge-small-en-v1.5):
464
+
465
+ | Batch Size | Documents/sec | Latency |
466
+ |------------|--------------|---------|
467
+ | 1 | ~150 | ~6.5ms |
468
+ | 32 | ~500 | ~64ms |
469
+ | 256 | ~550 | ~465ms |
470
+
471
+ Larger models are slower but more accurate. See [BENCHMARKS.md](BENCHMARKS.md) for detailed comparisons.
151
472
 
152
473
  ## Requirements
153
474
 
154
475
  - Ruby >= 3.3
155
- - ~70MB disk space for default model (varies by model)
476
+ - ~70MB-2GB disk space (varies by model)
156
477
 
157
478
  ## Acknowledgments
158
479