RubyGems - fastembed - Versions diffs - 1.0.0 → 1.1.0 - Mend

fastembed 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

checksums.yaml +4 -4
data/.rubocop.yml +1 -0
data/.yardopts +6 -0
data/BENCHMARKS.md +124 -1
data/CHANGELOG.md +14 -0
data/README.md +395 -74
data/benchmark/compare_all.rb +167 -0
data/benchmark/compare_python.py +60 -0
data/benchmark/memory_profile.rb +70 -0
data/benchmark/profile.rb +198 -0
data/benchmark/reranker_benchmark.rb +158 -0
data/exe/fastembed +6 -0
data/fastembed.gemspec +3 -0
data/lib/fastembed/async.rb +193 -0
data/lib/fastembed/base_model.rb +247 -0
data/lib/fastembed/base_model_info.rb +61 -0
data/lib/fastembed/cli.rb +745 -0
data/lib/fastembed/custom_model_registry.rb +255 -0
data/lib/fastembed/image_embedding.rb +313 -0
data/lib/fastembed/late_interaction_embedding.rb +260 -0
data/lib/fastembed/late_interaction_model_info.rb +91 -0
data/lib/fastembed/model_info.rb +59 -19
data/lib/fastembed/model_management.rb +82 -23
data/lib/fastembed/onnx_embedding_model.rb +25 -4
data/lib/fastembed/pooling.rb +39 -3
data/lib/fastembed/progress.rb +52 -0
data/lib/fastembed/quantization.rb +75 -0
data/lib/fastembed/reranker_model_info.rb +91 -0
data/lib/fastembed/sparse_embedding.rb +261 -0
data/lib/fastembed/sparse_model_info.rb +80 -0
data/lib/fastembed/text_cross_encoder.rb +217 -0
data/lib/fastembed/text_embedding.rb +161 -28
data/lib/fastembed/validators.rb +59 -0
data/lib/fastembed/version.rb +1 -1
data/lib/fastembed.rb +42 -1
data/plan.md +257 -0
data/scripts/verify_models.rb +229 -0
metadata +70 -3

data/README.md CHANGED Viewed

@@ -3,156 +3,477 @@
 [![Gem Version](https://badge.fury.io/rb/fastembed.svg)](https://rubygems.org/gems/fastembed)
 [![CI](https://github.com/khasinski/fastembed-rb/actions/workflows/ci.yml/badge.svg)](https://github.com/khasinski/fastembed-rb/actions/workflows/ci.yml)
-Fast, lightweight text embeddings in Ruby. Convert text into vectors for semantic search, similarity matching, clustering, and RAG applications.
+Fast, lightweight text embeddings in Ruby. A port of [FastEmbed](https://github.com/qdrant/fastembed) by Qdrant.
 ```ruby
 embedding = Fastembed::TextEmbedding.new
-vectors = embedding.embed(["Hello world", "Ruby is great"]).to_a
-# => [[0.123, -0.456, ...], [0.789, 0.012, ...]]  (384-dimensional vectors)
+vectors = embedding.embed(["The quick brown fox", "jumps over the lazy dog"]).to_a
 ```
-## What are embeddings?
+Supports dense embeddings, sparse embeddings (SPLADE), late interaction (ColBERT), reranking, and image embeddings - all running locally with ONNX Runtime.
-Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors, enabling:
+## Table of Contents
-- **Semantic search** - Find relevant documents by meaning, not just keywords
-- **Similarity matching** - Compare texts to find duplicates or related content
-- **RAG applications** - Retrieve context for LLMs like ChatGPT
-- **Clustering** - Group similar documents together
+- [Installation](#installation)
+- [Getting Started](#getting-started)
+- [Text Embeddings](#text-embeddings)
+- [Reranking](#reranking)
+- [Sparse Embeddings](#sparse-embeddings)
+- [Late Interaction (ColBERT)](#late-interaction-colbert)
+- [Image Embeddings](#image-embeddings)
+- [Async Processing](#async-processing)
+- [Progress Tracking](#progress-tracking)
+- [CLI](#cli)
+- [Custom Models](#custom-models)
+- [Configuration](#configuration)
+- [Performance](#performance)
 ## Installation
+Add to your Gemfile:
+```ruby
+gem "fastembed"
+```
+For image embeddings, also add:
 ```ruby
-gem 'fastembed'
+gem "mini_magick"
 ```
-## Quick Start
+## Getting Started
 ```ruby
-require 'fastembed'
+require "fastembed"
-# Create embedding model (downloads automatically on first use, ~67MB)
+# Create an embedding model (downloads ~67MB on first use)
 embedding = Fastembed::TextEmbedding.new
-# Embed your texts
-docs = ["Ruby is a programming language", "Python is also a programming language"]
-vectors = embedding.embed(docs).to_a
+# Embed some text
+documents = [
+  "Ruby is a dynamic programming language",
+  "Python is great for data science",
+  "JavaScript runs in the browser"
+]
+vectors = embedding.embed(documents).to_a
-# Find similarity between texts (cosine similarity via dot product)
-similarity = vectors[0].zip(vectors[1]).sum { |a, b| a * b }
-puts similarity  # => 0.89 (high similarity!)
+# Each vector is 384 floats (for the default model)
+vectors.first.length  # => 384
 ```
-## Semantic Search Example
+### Semantic Search
+Find documents by meaning, not just keywords:
 ```ruby
+embedding = Fastembed::TextEmbedding.new
 # Your document corpus
 documents = [
-  "The quick brown fox jumps over the lazy dog",
-  "Machine learning is a subset of artificial intelligence",
-  "Ruby on Rails is a web application framework",
-  "Neural networks are inspired by biological brains"
+  "The cat sat on the mat",
+  "Machine learning powers modern AI",
+  "Ruby on Rails is a web framework",
+  "Deep learning uses neural networks"
 ]
-# Create embeddings for all documents
-embedding = Fastembed::TextEmbedding.new
 doc_vectors = embedding.embed(documents).to_a
-# Search query
-query = "AI and deep learning"
-query_vector = embedding.embed(query).first
+# Search for a concept
+query = "artificial intelligence and neural nets"
+query_vector = embedding.embed([query]).first
+# Find the most similar document (cosine similarity)
+scores = doc_vectors.map { |v| query_vector.zip(v).sum { |a, b| a * b } }
+best_idx = scores.each_with_index.max.last
+puts documents[best_idx]  # => "Deep learning uses neural networks"
+```
+### Integration with Vector Databases
+```ruby
+# With Qdrant
+require "qdrant"
-# Find most similar document (highest dot product)
-similarities = doc_vectors.map.with_index do |doc_vec, i|
-  score = query_vector.zip(doc_vec).sum { |a, b| a * b }
-  [i, score]
+embedding = Fastembed::TextEmbedding.new
+client = Qdrant::Client.new(url: "http://localhost:6333")
+# Index documents
+documents.each_with_index do |doc, i|
+  vector = embedding.embed([doc]).first
+  client.points.upsert(
+    collection_name: "docs",
+    points: [{ id: i, vector: vector, payload: { text: doc } }]
+  )
 end
-best_match = similarities.max_by { |_, score| score }
-puts documents[best_match[0]]  # => "Machine learning is a subset of artificial intelligence"
+# Search
+query_vector = embedding.embed(["your search query"]).first
+results = client.points.search(collection_name: "docs", vector: query_vector, limit: 5)
 ```
-## Usage
+## Text Embeddings
 ### Choose a Model
 ```ruby
-# Default: Fast and accurate (384 dimensions, 67MB)
+# Default: fast and accurate (384 dimensions, 67MB)
 embedding = Fastembed::TextEmbedding.new
 # Higher accuracy (768 dimensions, 210MB)
 embedding = Fastembed::TextEmbedding.new(model_name: "BAAI/bge-base-en-v1.5")
-# Multilingual support (100+ languages)
+# Multilingual - 100+ languages (384 dimensions)
 embedding = Fastembed::TextEmbedding.new(model_name: "intfloat/multilingual-e5-small")
-# Long documents (8192 tokens vs default 512)
+# Long documents - 8192 token context (768 dimensions)
 embedding = Fastembed::TextEmbedding.new(model_name: "nomic-ai/nomic-embed-text-v1.5")
 ```
-### Process Large Datasets
+### Supported Models
+| Model | Dimensions | Size | Notes |
+|-------|-----------|------|-------|
+| `BAAI/bge-small-en-v1.5` | 384 | 67MB | Default, fast |
+| `BAAI/bge-base-en-v1.5` | 768 | 210MB | Better accuracy |
+| `BAAI/bge-large-en-v1.5` | 1024 | 1.2GB | Best accuracy |
+| `sentence-transformers/all-MiniLM-L6-v2` | 384 | 90MB | General purpose |
+| `sentence-transformers/all-mpnet-base-v2` | 768 | 440MB | High quality |
+| `intfloat/multilingual-e5-small` | 384 | 450MB | 100+ languages |
+| `intfloat/multilingual-e5-base` | 768 | 1.1GB | Multilingual, better |
+| `nomic-ai/nomic-embed-text-v1.5` | 768 | 520MB | 8192 token context |
+| `jinaai/jina-embeddings-v2-base-en` | 768 | 520MB | 8192 token context |
+### Query vs Passage Embeddings
+For asymmetric search (short queries, long documents), use specialized methods:
 ```ruby
-# Lazy evaluation - memory efficient for large datasets
-documents = File.readlines("corpus.txt")
+# For search queries
+query_vectors = embedding.query_embed(["What is Ruby?"]).to_a
-embedding.embed(documents, batch_size: 64).each_slice(100) do |batch|
-  store_in_vector_database(batch)
+# For documents/passages
+doc_vectors = embedding.passage_embed(documents).to_a
+```
+### Lazy Evaluation
+Embeddings are generated lazily, making it memory-efficient for large datasets:
+```ruby
+# Process millions of documents without loading all vectors into memory
+File.foreach("documents.txt").lazy.each_slice(1000) do |batch|
+  embedding.embed(batch).each do |vector|
+    store_in_database(vector)
+  end
 end
 ```
-### List Available Models
+## Reranking
+Rerankers score query-document pairs for more accurate relevance ranking. Use them after initial retrieval:
 ```ruby
-Fastembed::TextEmbedding.list_supported_models.each do |model|
-  puts "#{model[:model_name]} - #{model[:dim]}d - #{model[:description]}"
+reranker = Fastembed::TextCrossEncoder.new
+query = "What is machine learning?"
+documents = [
+  "Machine learning is a branch of AI",
+  "The weather is nice today",
+  "Deep learning uses neural networks"
+]
+# Get raw scores (higher = more relevant)
+scores = reranker.rerank(query: query, documents: documents)
+# => [8.5, -10.2, 5.3]
+# Get sorted results with metadata
+results = reranker.rerank_with_scores(query: query, documents: documents, top_k: 2)
+# => [
+#   { document: "Machine learning is...", score: 8.5, index: 0 },
+#   { document: "Deep learning uses...", score: 5.3, index: 2 }
+# ]
+```
+### Reranker Models
+| Model | Size | Notes |
+|-------|------|-------|
+| `cross-encoder/ms-marco-MiniLM-L-6-v2` | 80MB | Default, fast |
+| `cross-encoder/ms-marco-MiniLM-L-12-v2` | 120MB | Better accuracy |
+| `BAAI/bge-reranker-base` | 1.1GB | High accuracy |
+| `BAAI/bge-reranker-large` | 2.2GB | Best accuracy |
+## Sparse Embeddings
+SPLADE models produce sparse vectors where each dimension corresponds to a vocabulary term. Great for hybrid search:
+```ruby
+sparse = Fastembed::TextSparseEmbedding.new
+result = sparse.embed(["Ruby programming language"]).first
+# => #<SparseEmbedding indices=[1234, 5678, ...] values=[0.8, 1.2, ...]>
+result.indices  # vocabulary token IDs with non-zero weights
+result.values   # corresponding weights
+result.nnz      # number of non-zero elements
+```
+### Hybrid Search
+Combine dense and sparse embeddings for better results:
+```ruby
+dense = Fastembed::TextEmbedding.new
+sparse = Fastembed::TextSparseEmbedding.new
+documents = ["your documents here"]
+# Generate both types of embeddings
+dense_vectors = dense.embed(documents).to_a
+sparse_vectors = sparse.embed(documents).to_a
+# Store both in your vector database and combine scores at query time
+```
+## Late Interaction (ColBERT)
+ColBERT produces token-level embeddings for fine-grained matching:
+```ruby
+colbert = Fastembed::LateInteractionTextEmbedding.new
+query = colbert.query_embed(["What is Ruby?"]).first
+doc = colbert.embed(["Ruby is a programming language"]).first
+# MaxSim scoring - sum of max similarities per query token
+score = query.max_sim(doc)
+```
+### Late Interaction Models
+| Model | Dimensions | Notes |
+|-------|-----------|-------|
+| `colbert-ir/colbertv2.0` | 128 | Default |
+| `jinaai/jina-colbert-v1-en` | 768 | 8192 token context |
+## Image Embeddings
+Convert images to vectors for visual search:
+```ruby
+# Requires mini_magick gem
+image_embed = Fastembed::ImageEmbedding.new
+# From file paths
+vectors = image_embed.embed(["photo1.jpg", "photo2.png"]).to_a
+# From URLs
+vectors = image_embed.embed(["https://example.com/image.jpg"]).to_a
+```
+### Image Models
+| Model | Dimensions | Notes |
+|-------|-----------|-------|
+| `Qdrant/clip-ViT-B-32-vision` | 512 | Default, CLIP |
+| `Qdrant/resnet50-onnx` | 2048 | ResNet50 |
+| `jinaai/jina-clip-v1` | 768 | Jina CLIP |
+## Async Processing
+Run embeddings in background threads:
+```ruby
+embedding = Fastembed::TextEmbedding.new
+# Start async embedding
+future = embedding.embed_async(large_document_list)
+# Do other work...
+# Get results when ready (blocks until complete)
+vectors = future.value
+```
+### Parallel Processing
+```ruby
+# Process multiple batches concurrently
+futures = documents.each_slice(1000).map do |batch|
+  embedding.embed_async(batch)
 end
+# Wait for all and combine results
+all_vectors = futures.flat_map(&:value)
 ```
-## Supported Models
+### Future Methods
-| Model | Dim | Use Case |
-|-------|-----|----------|
-| `BAAI/bge-small-en-v1.5` | 384 | Default, fast English embeddings |
-| `BAAI/bge-base-en-v1.5` | 768 | Higher accuracy English |
-| `BAAI/bge-large-en-v1.5` | 1024 | Highest accuracy English |
-| `sentence-transformers/all-MiniLM-L6-v2` | 384 | General purpose, lightweight |
-| `sentence-transformers/all-mpnet-base-v2` | 768 | High quality general purpose |
-| `intfloat/multilingual-e5-small` | 384 | 100+ languages |
-| `intfloat/multilingual-e5-base` | 768 | 100+ languages, higher accuracy |
-| `nomic-ai/nomic-embed-text-v1.5` | 768 | Long context (8192 tokens) |
-| `jinaai/jina-embeddings-v2-base-en` | 768 | Long context (8192 tokens) |
+```ruby
+future.complete?  # check if done
+future.pending?   # check if still running
+future.success?   # completed without error?
+future.failure?   # completed with error?
+future.error      # get the error if failed
+future.wait(timeout: 5)  # wait up to 5 seconds
+# Chaining
+future.then { |vectors| vectors.map(&:first) }
+      .rescue { |e| puts "Error: #{e}" }
+```
-## Performance
+### Async Utilities
+```ruby
+# Wait for all futures
+results = Fastembed::Async.all(futures)
+# Get first completed result
+result = Fastembed::Async.race(futures, timeout: 10)
+```
+## Progress Tracking
+Track progress for large embedding jobs:
+```ruby
+embedding = Fastembed::TextEmbedding.new
+documents = Array.new(10_000) { "document text" }
+embedding.embed(documents, batch_size: 256) do |progress|
+  puts "Batch #{progress.current}/#{progress.total}"
+  puts "#{(progress.percentage * 100).round}% complete"
+  puts "~#{progress.documents_processed} documents processed"
+end.to_a
+```
+## CLI
+FastEmbed includes a command-line tool:
+```bash
+# List available models
+fastembed list           # embedding models
+fastembed list-reranker  # reranker models
+fastembed list-sparse    # sparse models
+fastembed list-image     # image models
+# Get model info
+fastembed info "BAAI/bge-small-en-v1.5"
+# Pre-download a model
+fastembed download "BAAI/bge-base-en-v1.5"
+# Embed text (outputs JSON)
+fastembed embed "Hello world" "Another text"
-On Apple M1 Max with the default model:
+# Different output formats
+fastembed embed -f ndjson "Hello world"
+fastembed embed -f csv "Hello world"
-| Batch Size | Throughput |
-|------------|------------|
-| 1 document | ~6.5ms |
-| 100 documents | ~500 docs/sec |
-| 1000 documents | ~500 docs/sec |
+# Read from file
+fastembed embed -i documents.txt
-Larger models are slower but more accurate. See [benchmarks](BENCHMARKS.md) for details.
+# Use different model
+fastembed embed -m "BAAI/bge-base-en-v1.5" "Hello"
+# Rerank documents
+fastembed rerank "query" "doc1" "doc2" "doc3"
+# Benchmark a model
+fastembed benchmark -m "BAAI/bge-small-en-v1.5" -n 100
+```
+## Custom Models
+Register custom models from HuggingFace:
+```ruby
+Fastembed.register_model(
+  model_name: "my-org/my-model",
+  dim: 768,
+  description: "My custom model",
+  sources: { hf: "my-org/my-model" },
+  model_file: "onnx/model.onnx"
+)
+# Now use it like any other model
+embedding = Fastembed::TextEmbedding.new(model_name: "my-org/my-model")
+```
+### Load from Local Directory
+```ruby
+embedding = Fastembed::TextEmbedding.new(
+  local_model_dir: "/path/to/model",
+  model_file: "model.onnx",
+  tokenizer_file: "tokenizer.json"
+)
+```
 ## Configuration
+### Initialization Options
 ```ruby
 Fastembed::TextEmbedding.new(
-  model_name: "BAAI/bge-small-en-v1.5",  # Model to use
-  cache_dir: "~/.cache/fastembed",        # Where to store models
+  model_name: "BAAI/bge-small-en-v1.5",  # model to use
+  cache_dir: "~/.cache/fastembed",        # where to store models
   threads: 4,                              # ONNX Runtime threads
-  providers: ["CUDAExecutionProvider"]     # GPU acceleration (Linux/Windows)
+  providers: ["CUDAExecutionProvider"],    # GPU acceleration
+  show_progress: true,                     # show download progress
+  quantization: :q4                        # use quantized model
+)
+```
+### Quantization
+Use smaller, faster models with quantization:
+```ruby
+# Available: :fp32 (default), :fp16, :int8, :uint8, :q4
+embedding = Fastembed::TextEmbedding.new(quantization: :int8)
+```
+### Environment Variables
+| Variable | Description |
+|----------|-------------|
+| `FASTEMBED_CACHE_PATH` | Custom model cache directory |
+| `HF_TOKEN` | HuggingFace token for private models |
+### GPU Acceleration
+```ruby
+# CUDA (Linux/Windows with NVIDIA GPU)
+embedding = Fastembed::TextEmbedding.new(
+  providers: ["CUDAExecutionProvider", "CPUExecutionProvider"]
+)
+# CoreML (macOS)
+embedding = Fastembed::TextEmbedding.new(
+  providers: ["CoreMLExecutionProvider", "CPUExecutionProvider"]
 )
 ```
-**Environment variables:**
-- `FASTEMBED_CACHE_PATH` - Custom model cache directory
+## Performance
+On Apple M1 Max with the default model (BAAI/bge-small-en-v1.5):
+| Batch Size | Documents/sec | Latency |
+|------------|--------------|---------|
+| 1 | ~150 | ~6.5ms |
+| 32 | ~500 | ~64ms |
+| 256 | ~550 | ~465ms |
+Larger models are slower but more accurate. See [BENCHMARKS.md](BENCHMARKS.md) for detailed comparisons.
 ## Requirements
 - Ruby >= 3.3
-- ~70MB disk space for default model (varies by model)
+- ~70MB-2GB disk space (varies by model)
 ## Acknowledgments