lancelot 0.1.1 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 9524a69ce1b7a5065839b41b5619585bc4e1643c79ec90cfc69fc82a3a8eed7e
4
- data.tar.gz: e2c8fc98de615913297b7b06e909b8d85179cb2018e3a08c9b70f4bb2756d55f
3
+ metadata.gz: 59a61f845bead9178dc6b7ca831b0cd1f3577969c189d723b085a706527bb9a9
4
+ data.tar.gz: f6941d534cc770393803c152f1718dad15639086b4ae78e84da93f8f8ab6f43f
5
5
  SHA512:
6
- metadata.gz: 043fcea4635008454b61999055be80b21c0ddb7c67462e0f90d242f7b81395068ff7ed1cabeabf1f63f7c1823593f2405fca083c3aa9330c7ac7a7b1f09ae4c7
7
- data.tar.gz: 4efedcef3af02685c97531162aa6adea5beb3084759033a0b32bdf535663aa12c4246cb9d67649bd836f04bf317e59101326e17c8a8474ace059908d0b36074d
6
+ metadata.gz: 04f63038fe699b2441c22618daac20ef710df82a5d95c6c8258317a2600b23ea2844ab5c2f1bc01245fb4ff90d531f3ec09f7ddcb28876e758b983a0bcb1fdc2
7
+ data.tar.gz: 20b8c06914dc993b6868b7264cfcfba0eeb0a0a0c70b0473ac6e3b9e163c8ed3855c2eeae1ad02ab3f19125f80c844a0aebd2c26a91aafc005af0dfebd8d37d6
data/CLAUDE.md ADDED
@@ -0,0 +1,176 @@
1
+ # Lancelot - Project Instructions
2
+
3
+ You are working on lancelot, a Ruby gem that provides native bindings to the Lance columnar data format through Rust and Magnus.
4
+
5
+ ## Project Context
6
+
7
+ Lancelot is part of a Ruby-native NLP/ML ecosystem:
8
+ - **lance**: Rust crate providing columnar storage with vector and text search
9
+ - **lancelot**: THIS PROJECT - Ruby bindings to Lance via Magnus
10
+ - **red-candle**: Ruby gem providing LLMs, embeddings, and rerankers
11
+ - **candle-agent**: Future agent capabilities for the ecosystem
12
+
13
+ ## Core Architecture
14
+
15
+ ### Ruby-Rust Bridge
16
+ - Uses Magnus 0.7 for Ruby-Rust interop
17
+ - RefCell pattern for interior mutability
18
+ - Embedded Tokio runtime for async Lance operations
19
+ - Clean separation between Ruby API and Rust implementation
20
+
21
+ ### Key Components
22
+ - `lib/lancelot/dataset.rb`: Ruby-idiomatic API layer
23
+ - `ext/lancelot/src/dataset.rs`: Core Lance operations
24
+ - `ext/lancelot/src/schema.rs`: Schema building and type mapping
25
+ - `ext/lancelot/src/conversion.rs`: Ruby-Rust data conversion
26
+
27
+ ## Design Principles
28
+
29
+ 1. **Ruby-First API**: Make it feel native to Ruby developers
30
+ - Use symbols for type names (`:string`, `:vector`)
31
+ - Support operator overloading (`<<` for append)
32
+ - Include Enumerable for iteration
33
+ - Return Ruby hashes/arrays, not foreign objects
34
+
35
+ 2. **Schema-First Design**: Lance requires schemas at creation
36
+ - Clear schema definition API
37
+ - Helpful error messages for type mismatches
38
+ - Support all Arrow types Lance supports
39
+
40
+ 3. **Performance Without Complexity**: Hide async/columnar details
41
+ - Embed Tokio runtime, don't expose it
42
+ - Convert RecordBatches to Ruby transparently
43
+ - Efficient batch operations when possible
44
+
45
+ 4. **Error Handling**: Rust errors become Ruby exceptions
46
+ - Use Magnus's error conversion
47
+ - Provide context in error messages
48
+ - Never panic in Rust code
49
+
50
+ ## Current Features
51
+
52
+ ### Implemented
53
+ - Dataset creation with schema
54
+ - CRUD operations (add, update, delete, get)
55
+ - Vector search with ANN indices
56
+ - Full-text search with inverted indices
57
+ - Multi-column text search
58
+ - SQL-like filtering
59
+ - Enumerable support
60
+
61
+ ### Not Yet Implemented
62
+ - Schema evolution
63
+ - Hybrid search (vector + text fusion)
64
+ - Streaming operations
65
+ - Transaction support
66
+ - Data versioning/time travel
67
+
68
+ ## Development Guidelines
69
+
70
+ ### Adding New Features
71
+
72
+ 1. **Start with the Ruby API**: Design how it should feel in Ruby first
73
+ 2. **Implement in Rust**: Keep the Rust layer focused on Lance operations
74
+ 3. **Type Conversion**: Use conversion.rs patterns for new types
75
+ 4. **Testing**: Add Ruby tests in spec/ and Rust tests in src/
76
+
77
+ ### Magnus Best Practices
78
+
79
+ 1. **Memory Management**:
80
+ ```rust
81
+ #[magnus::wrap(class = "Lancelot::Dataset", free_immediately, size)]
82
+ ```
83
+
84
+ 2. **Method Definition**:
85
+ ```rust
86
+ class.define_method("method_name", method!(LancelotDataset::method_name, arity))?;
87
+ ```
88
+
89
+ 3. **Error Handling**:
90
+ ```rust
91
+ pub fn operation(&self) -> Result<Value, Error> {
92
+ self.with_dataset(|dataset| {
93
+ // Lance operation
94
+ }).map_err(|e| Error::new(exception::runtime_error(), e.to_string()))
95
+ }
96
+ ```
97
+
98
+ 4. **Async Operations**:
99
+ ```rust
100
+ self.with_runtime(|runtime| {
101
+ runtime.block_on(async {
102
+ // Async Lance operation
103
+ })
104
+ })
105
+ ```
106
+
107
+ ### Type Mappings
108
+
109
+ Ruby → Arrow/Lance:
110
+ - `:string` → Utf8
111
+ - `:integer` → Int64
112
+ - `:float` → Float64
113
+ - `:vector` → FixedSizeList with dimension
114
+ - `:boolean` → Bool
115
+ - `:date` → Date32
116
+ - `:datetime` → Timestamp
117
+
118
+ ### Testing
119
+
120
+ - Ruby specs use RSpec
121
+ - Test both successful operations and error cases
122
+ - Use temporary directories for test datasets
123
+ - Clean up resources in after blocks
124
+
125
+ ## Common Tasks
126
+
127
+ ### Adding a New Search Method
128
+ 1. Define Ruby API in dataset.rb
129
+ 2. Add Rust implementation in dataset.rs
130
+ 3. Handle type conversion if needed
131
+ 4. Add index support if applicable
132
+ 5. Write comprehensive tests
133
+
134
+ ### Exposing Lance Features
135
+ 1. Check Lance API documentation
136
+ 2. Design Ruby-idiomatic wrapper
137
+ 3. Consider if it needs async handling
138
+ 4. Implement with proper error handling
139
+ 5. Document with YARD comments
140
+
141
+ ## Integration Points
142
+
143
+ ### With Red-Candle
144
+ - Lancelot stores embeddings from red-candle
145
+ - Vector dimensions must match model output
146
+ - Consider batch operations for efficiency
147
+
148
+ ### Future: Hybrid Search
149
+ - Will combine vector and text search results
150
+ - RRF (Reciprocal Rank Fusion) planned
151
+ - Consider implementing in Ruby first
152
+
153
+ ## Performance Considerations
154
+
155
+ 1. **Batch Operations**: Always prefer batch over individual ops
156
+ 2. **Index Building**: Build indices after bulk loading
157
+ 3. **Memory Usage**: Lance uses memory mapping efficiently
158
+ 4. **Ruby GC**: Use free_immediately for deterministic cleanup
159
+
160
+ ## Debugging Tips
161
+
162
+ 1. **Rust Panics**: Use `RUST_BACKTRACE=1` for stack traces
163
+ 2. **Lance Logs**: Set `RUST_LOG=lance=debug`
164
+ 3. **Ruby-Rust Bridge**: Check type conversions first
165
+ 4. **Async Issues**: Ensure operations run on the runtime
166
+
167
+ ## Release Process
168
+
169
+ 1. Update version.rb
170
+ 2. Run full test suite
171
+ 3. Build gem locally and test
172
+ 4. Update CHANGELOG.md
173
+ 5. Tag release and push
174
+ 6. `rake release` to publish
175
+
176
+ Remember: The goal is to make Lance's power accessible to Ruby developers without them needing to understand columnar formats, Rust, or async programming.
data/README.md CHANGED
@@ -2,23 +2,59 @@
2
2
 
3
3
  Ruby bindings for [Lance](https://github.com/lancedb/lance), a modern columnar data format for ML. Lancelot provides a Ruby-native interface to Lance, enabling efficient storage and search of multimodal data including text, vectors, and more.
4
4
 
5
+ ## Quickstart
6
+
7
+ ```ruby
8
+ require 'lancelot'
9
+ require 'red-candle'
10
+
11
+ strings = [
12
+ "apple",
13
+ "orange",
14
+ "google"
15
+ ]
16
+
17
+ model = Candle::EmbeddingModel.from_pretrained
18
+
19
+ dataset = Lancelot::Dataset.open_or_create("words", schema: {
20
+ text: :string,
21
+ embedding: { type: "vector", dimension: 768 }
22
+ })
23
+
24
+ records = strings.collect do |string|
25
+ embedding = model.embedding(string).first.to_a
26
+ { text: string, embedding: embedding }
27
+ end
28
+
29
+ dataset.add_documents(records)
30
+
31
+ dataset.create_vector_index("embedding")
32
+ dataset.create_text_index("text")
33
+
34
+
35
+ query = "fruit"
36
+ query_embedding = model.embedding(query).first.to_a
37
+ dataset.vector_search(query_embedding, column: "embedding", limit: 5).each { |r| puts r[:text] }; nil
38
+
39
+ dataset.text_search("apple", column: "text", limit: 5).each { |r| puts r[:text] }; nil
40
+
41
+ query = "tech company"
42
+ query_embedding = model.embedding(query).first.to_a
43
+ dataset.vector_search(query_embedding, column: "embedding", limit: 5).each { |r| puts r[:text] }; nil
44
+ ```
45
+
5
46
  ## Features
6
47
 
7
48
  ### Implemented
8
49
  - **Dataset Creation**: Create Lance datasets with schemas
9
- - **Data Storage**: Add documents to datasets
50
+ - **Data Storage**: Add documents to datasets
10
51
  - **Document Retrieval**: Read documents from datasets with enumerable support
11
52
  - **Vector Search**: Create vector indices and perform similarity search
53
+ - **Full-Text Search**: Built-in full-text search with inverted indices
54
+ - **Hybrid Search**: Combine text and vector search with Reciprocal Rank Fusion (RRF)
12
55
  - **Schema Support**: Define schemas with string, float32, and vector types
13
56
  - **Row Counting**: Get the number of rows in a dataset
14
57
 
15
- ### Planned
16
-
17
- - **Full-Text Search**: Built-in full-text search capabilities
18
- - **Hybrid Search**: Combine text and vector search with RRF and other fusion methods
19
- - **Multimodal Support**: Store and search across different data types beyond text and vectors
20
- - **Schema Evolution**: Add new columns to existing datasets without rewriting data
21
-
22
58
  ## Installation
23
59
 
24
60
  Install the gem and add to the application's Gemfile by executing:
@@ -121,10 +157,109 @@ results = dataset.text_search("programming", columns: ["title", "content", "tags
121
157
 
122
158
  **Note**: Full-text search requires creating inverted indices first. For simple pattern matching without indices, use SQL-like filtering with `where`.
123
159
 
160
+ ### Hybrid Search with Reciprocal Rank Fusion (RRF)
161
+
162
+ Lancelot now supports hybrid search, combining vector and text search results using Reciprocal Rank Fusion:
163
+
164
+ ```ruby
165
+ # Example 1: Using the same query for both vector and text search
166
+ # First, let's assume we have a function that converts text to embeddings
167
+ def text_to_embedding(text)
168
+ # Your embedding model here (e.g., using red-candle or another embedding service)
169
+ # Returns a vector representation of the text
170
+ end
171
+
172
+ # Search using both modalities with the same query
173
+ query = "machine learning frameworks"
174
+ query_embedding = text_to_embedding(query)
175
+
176
+ results = dataset.hybrid_search(
177
+ query, # Text query
178
+ vector: query_embedding, # Vector query (same content, embedded)
179
+ vector_column: "embedding", # Vector column to search
180
+ text_column: "content", # Text column to search
181
+ limit: 10
182
+ )
183
+
184
+ # Results are fused using RRF and include an rrf_score
185
+ results.each do |doc|
186
+ puts "#{doc[:title]} - RRF Score: #{doc[:rrf_score]}"
187
+ end
188
+ ```
189
+
190
+ ```ruby
191
+ # Example 2: Multiple queries across different modalities
192
+ # You can use different queries for vector and text search
193
+
194
+ # Semantic vector search for conceptually similar content
195
+ concept_embedding = text_to_embedding("deep learning neural networks")
196
+
197
+ # Keyword text search for specific terms
198
+ keyword_query = "PyTorch TensorFlow"
199
+
200
+ results = dataset.hybrid_search(
201
+ keyword_query, # Specific keyword search
202
+ vector: concept_embedding, # Broader semantic search
203
+ vector_column: "embedding",
204
+ text_column: "content",
205
+ limit: 20
206
+ )
207
+ ```
208
+
209
+ ```ruby
210
+ # Example 3: Multi-column text search with vector search
211
+ # Search across multiple text columns while also doing vector similarity
212
+
213
+ results = dataset.hybrid_search(
214
+ "ruby programming",
215
+ vector: text_to_embedding("object-oriented scripting language"),
216
+ vector_column: "embedding",
217
+ text_columns: ["title", "content", "tags"], # Search multiple text columns
218
+ limit: 15
219
+ )
220
+ ```
221
+
222
+ ```ruby
223
+ # Example 4: Advanced RRF with custom k parameter
224
+ # The k parameter (default 60) controls the fusion behavior
225
+ # Lower k values give more weight to top-ranked results
226
+
227
+ results = dataset.hybrid_search(
228
+ "distributed systems",
229
+ vector: text_to_embedding("distributed systems"),
230
+ vector_column: "embedding",
231
+ text_column: "content",
232
+ limit: 10,
233
+ rrf_k: 30 # More aggressive fusion, emphasizes top results
234
+ )
235
+ ```
236
+
237
+ ```ruby
238
+ # Example 5: Using RankFusion module directly for custom fusion
239
+ # Useful when you want to combine results from multiple separate searches
240
+ require 'lancelot/rank_fusion'
241
+
242
+ # Perform multiple searches with different queries
243
+ vector_results1 = dataset.vector_search(embedding1, column: "embedding", limit: 20)
244
+ vector_results2 = dataset.vector_search(embedding2, column: "embedding", limit: 20)
245
+ text_results1 = dataset.text_search("machine learning", column: "content", limit: 20)
246
+ text_results2 = dataset.text_search("neural networks", column: "title", limit: 20)
247
+
248
+ # Fuse all results using RRF
249
+ fused_results = Lancelot::RankFusion.reciprocal_rank_fusion(
250
+ [vector_results1, vector_results2, text_results1, text_results2],
251
+ k: 60
252
+ )
253
+
254
+ # Take top 10 fused results
255
+ top_results = fused_results.first(10)
256
+ ```
257
+
258
+ **RRF Algorithm**: Reciprocal Rank Fusion calculates scores as `Σ(1/(k+rank))` across all result lists, where k=60 by default. Documents appearing in multiple result lists with high ranks get higher RRF scores.
259
+
124
260
  **Current Limitations:**
125
261
  - Schema must be defined when creating a dataset
126
262
  - Schema evolution is not yet implemented (Lance supports it, but our bindings don't expose it yet)
127
- - Hybrid search (RRF) is not yet implemented
128
263
  - Supported field types: string, float32, float64, int32, int64, boolean, and fixed-size vectors
129
264
 
130
265
  **Note on Lance's Schema Flexibility:**
@@ -0,0 +1,66 @@
1
+ #!/usr/bin/env ruby
2
+ # Demonstrates idempotent dataset creation with open_or_create
3
+
4
+ require 'bundler/setup'
5
+ require 'lancelot'
6
+ require 'fileutils'
7
+
8
+ dataset_path = "words"
9
+
10
+ puts "="*60
11
+ puts "Idempotent Dataset Creation Demo"
12
+ puts "="*60
13
+
14
+ schema = {
15
+ text: :string,
16
+ embedding: { type: "vector", dimension: 768 }
17
+ }
18
+
19
+ # First call - will CREATE the dataset
20
+ puts "\n1. First call to open_or_create (should create)..."
21
+ dataset = Lancelot::Dataset.open_or_create(dataset_path, schema: schema)
22
+ puts " Dataset opened/created. Current count: #{dataset.count}"
23
+
24
+ # Add some data
25
+ dataset.add_documents([
26
+ { text: "hello", embedding: Array.new(768) { rand } },
27
+ { text: "world", embedding: Array.new(768) { rand } }
28
+ ])
29
+ puts " Added 2 documents. New count: #{dataset.count}"
30
+
31
+ # Second call - will OPEN the existing dataset
32
+ puts "\n2. Second call to open_or_create (should open existing)..."
33
+ dataset2 = Lancelot::Dataset.open_or_create(dataset_path, schema: schema)
34
+ puts " Dataset opened. Current count: #{dataset2.count}"
35
+ puts " ✓ Data persisted from previous session!"
36
+
37
+ # Third call - still idempotent
38
+ puts "\n3. Third call - still works..."
39
+ dataset3 = Lancelot::Dataset.open_or_create(dataset_path, schema: schema)
40
+ dataset3.add_documents([
41
+ { text: "more", embedding: Array.new(768) { rand } }
42
+ ])
43
+ puts " Added 1 more document. New count: #{dataset3.count}"
44
+
45
+ # Demonstrate the OLD way that would fail
46
+ puts "\n4. Compare with non-idempotent create (would fail)..."
47
+ begin
48
+ # This will fail because dataset already exists
49
+ failing_dataset = Lancelot::Dataset.create(dataset_path, schema: schema)
50
+ puts " ✗ This shouldn't happen!"
51
+ rescue => e
52
+ puts " ✓ Dataset.create correctly failed: #{e.class}"
53
+ puts " Message: #{e.message[0..50]}..."
54
+ end
55
+
56
+ # Clean up
57
+ FileUtils.rm_rf(dataset_path)
58
+
59
+ puts "\n" + "="*60
60
+ puts "Summary: Use open_or_create for idempotent operations!"
61
+ puts "="*60
62
+ puts "\nInstead of:"
63
+ puts ' dataset = Lancelot::Dataset.create("words", schema: {...})'
64
+ puts "\nUse:"
65
+ puts ' dataset = Lancelot::Dataset.open_or_create("words", schema: {...})'
66
+ puts "\nThis way your code works whether the dataset exists or not!"
@@ -0,0 +1,83 @@
1
+ #!/usr/bin/env ruby
2
+ # This example demonstrates optional field support in lancelot
3
+ # After the fix in conversion.rs, documents can have missing fields
4
+
5
+ require 'bundler/setup'
6
+ require 'lancelot'
7
+ require 'fileutils'
8
+
9
+ dataset_path = "example_optional_fields"
10
+ FileUtils.rm_rf(dataset_path)
11
+
12
+ puts "="*60
13
+ puts "Lancelot Optional Fields Demo"
14
+ puts "="*60
15
+
16
+ # Step 1: Create dataset with initial schema
17
+ puts "\n1. Creating dataset with 3 fields (id, text, score)..."
18
+ schema = {
19
+ id: :string,
20
+ text: :string,
21
+ score: :float32
22
+ }
23
+ dataset = Lancelot::Dataset.create(dataset_path, schema: schema)
24
+
25
+ # Add initial documents
26
+ initial_docs = [
27
+ { id: "1", text: "First document", score: 0.9 },
28
+ { id: "2", text: "Second document", score: 0.8 }
29
+ ]
30
+ dataset.add_documents(initial_docs)
31
+ puts " Added #{dataset.count} documents"
32
+
33
+ # Step 2: Simulate schema evolution (adding a new field)
34
+ puts "\n2. Simulating schema evolution (adding 'category' field)..."
35
+
36
+ # Get existing data
37
+ all_docs = dataset.to_a
38
+
39
+ # Recreate with expanded schema
40
+ FileUtils.rm_rf(dataset_path)
41
+ expanded_schema = {
42
+ id: :string,
43
+ text: :string,
44
+ score: :float32,
45
+ category: :string # NEW FIELD
46
+ }
47
+ dataset = Lancelot::Dataset.create(dataset_path, schema: expanded_schema)
48
+
49
+ # Re-add existing docs with the new field
50
+ docs_with_category = all_docs.map { |doc| doc.merge(category: "original") }
51
+ dataset.add_documents(docs_with_category)
52
+ puts " Recreated dataset with expanded schema"
53
+
54
+ # Step 3: Add new documents WITHOUT the new field
55
+ puts "\n3. Adding new documents WITHOUT the 'category' field..."
56
+ new_docs = [
57
+ { id: "3", text: "Third document", score: 0.7 }, # No category!
58
+ { id: "4", text: "Fourth document", score: 0.6 } # No category!
59
+ ]
60
+
61
+ begin
62
+ dataset.add_documents(new_docs)
63
+ puts " ✅ SUCCESS! Added #{new_docs.size} documents with missing fields"
64
+ rescue => e
65
+ puts " ❌ FAILED: #{e.message}"
66
+ puts " (This would have failed before the fix in conversion.rs)"
67
+ end
68
+
69
+ # Step 4: Verify the data
70
+ puts "\n4. Verifying all documents..."
71
+ dataset.to_a.each do |doc|
72
+ category = doc[:category] || "nil"
73
+ puts " Doc #{doc[:id]}: category=#{category}"
74
+ end
75
+
76
+ puts "\nTotal documents: #{dataset.count}"
77
+
78
+ # Cleanup
79
+ FileUtils.rm_rf(dataset_path)
80
+
81
+ puts "\n" + "="*60
82
+ puts "Demo complete! Optional fields work correctly."
83
+ puts "="*60
@@ -39,11 +39,13 @@ pub fn build_record_batch(
39
39
  let item = RHash::try_convert(item)?;
40
40
  for field in schema.fields() {
41
41
  let key = Symbol::new(field.name());
42
- let value: Value = item.fetch(key)
43
- .or_else(|_| {
42
+ // Make fields optional - use get instead of fetch
43
+ let value: Value = item.get(key)
44
+ .or_else(|| {
44
45
  // Try with string key
45
- item.fetch(field.name().as_str())
46
- })?;
46
+ item.get(field.name().as_str())
47
+ })
48
+ .unwrap_or_else(|| Ruby::get().unwrap().qnil().as_value());
47
49
 
48
50
  match field.data_type() {
49
51
  DataType::Utf8 => {
@@ -15,6 +15,14 @@ module Lancelot
15
15
  dataset
16
16
  end
17
17
 
18
+ def open_or_create(path, schema:)
19
+ if File.exist?(path)
20
+ open(path)
21
+ else
22
+ create(path, schema: schema)
23
+ end
24
+ end
25
+
18
26
  private
19
27
 
20
28
  def normalize_schema(schema)
@@ -106,6 +114,38 @@ module Lancelot
106
114
  end
107
115
  end
108
116
 
117
+ def hybrid_search(query, vector_column: "vector", text_column: nil, text_columns: nil,
118
+ vector: nil, limit: 10, rrf_k: 60)
119
+ require 'lancelot/rank_fusion'
120
+
121
+ result_lists = []
122
+
123
+ # Perform vector search if vector is provided
124
+ if vector
125
+ unless vector.is_a?(Array)
126
+ raise ArgumentError, "Vector must be an array of numbers"
127
+ end
128
+
129
+ vector_results = vector_search(vector, column: vector_column, limit: limit * 2)
130
+ result_lists << vector_results if vector_results.any?
131
+ end
132
+
133
+ # Perform text search if query is provided
134
+ if query && !query.empty?
135
+ text_results = text_search(query, column: text_column, columns: text_columns, limit: limit * 2)
136
+ result_lists << text_results if text_results.any?
137
+ end
138
+
139
+ # Return empty array if no searches were performed
140
+ return [] if result_lists.empty?
141
+
142
+ # Return single result list if only one search was performed
143
+ return result_lists.first[0...limit] if result_lists.size == 1
144
+
145
+ # Perform RRF fusion and limit results
146
+ Lancelot::RankFusion.reciprocal_rank_fusion(result_lists, k: rrf_k)[0...limit]
147
+ end
148
+
109
149
  def where(filter_expression, limit: nil)
110
150
  filter_scan(filter_expression.to_s, limit)
111
151
  end
@@ -0,0 +1,84 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Lancelot
4
+ module RankFusion
5
+ class << self
6
+ def reciprocal_rank_fusion(result_lists, k: 60)
7
+ return [] if result_lists.nil? || result_lists.empty?
8
+
9
+ # Validate inputs
10
+ result_lists = Array(result_lists)
11
+ validate_result_lists(result_lists)
12
+
13
+ return [] if result_lists.all?(&:empty?)
14
+
15
+ # Build document to rank mapping for each result list
16
+ doc_ranks = build_document_ranks(result_lists)
17
+
18
+ # Calculate RRF scores
19
+ rrf_scores = calculate_rrf_scores(doc_ranks, result_lists.size, k)
20
+
21
+ # Sort by RRF score descending and return results
22
+ rrf_scores.sort_by { |_, score| -score }.map do |doc, score|
23
+ doc.merge(rrf_score: score)
24
+ end
25
+ end
26
+
27
+ private
28
+
29
+ def validate_result_lists(result_lists)
30
+ result_lists.each_with_index do |list, i|
31
+ unless list.is_a?(Array)
32
+ raise ArgumentError, "Result list at index #{i} must be an Array, got #{list.class}"
33
+ end
34
+
35
+ list.each_with_index do |doc, j|
36
+ unless doc.is_a?(Hash)
37
+ raise ArgumentError, "Document at position #{j} in result list #{i} must be a Hash, got #{doc.class}"
38
+ end
39
+ end
40
+ end
41
+ end
42
+
43
+ def build_document_ranks(result_lists)
44
+ doc_ranks = {}
45
+
46
+ result_lists.each_with_index do |list, list_idx|
47
+ list.each_with_index do |doc, rank|
48
+ # Use the document content as the key (excluding metadata like distance/score)
49
+ doc_key = normalize_document(doc)
50
+ doc_ranks[doc_key] ||= {document: doc, ranks: {}}
51
+ doc_ranks[doc_key][:ranks][list_idx] = rank + 1 # 1-based ranking
52
+ end
53
+ end
54
+
55
+ doc_ranks
56
+ end
57
+
58
+ def calculate_rrf_scores(doc_ranks, num_lists, k)
59
+ doc_ranks.map do |doc_key, data|
60
+ score = 0.0
61
+
62
+ num_lists.times do |list_idx|
63
+ rank = data[:ranks][list_idx]
64
+ if rank
65
+ score += 1.0 / (k + rank)
66
+ else
67
+ # Document doesn't appear in this list, treat as infinite rank
68
+ # RRF score contribution is effectively 0
69
+ end
70
+ end
71
+
72
+ [data[:document], score]
73
+ end
74
+ end
75
+
76
+ def normalize_document(doc)
77
+ # Create a normalized version for comparison, excluding metadata fields
78
+ # that might differ between search types (like _distance, _score, etc.)
79
+ doc.reject { |k, _| k.to_s.start_with?("_") || k == :rrf_score }
80
+ end
81
+ end
82
+ end
83
+ end
84
+
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Lancelot
4
- VERSION = "0.1.1"
4
+ VERSION = "0.3.1"
5
5
  end
data/lib/lancelot.rb CHANGED
@@ -3,6 +3,7 @@
3
3
  require_relative "lancelot/version"
4
4
  require_relative "lancelot/lancelot"
5
5
  require_relative "lancelot/dataset"
6
+ require_relative "lancelot/rank_fusion"
6
7
 
7
8
  module Lancelot
8
9
  class Error < StandardError; end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: lancelot
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.3.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Chris Petersen
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2025-07-26 00:00:00.000000000 Z
11
+ date: 2025-08-10 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rb_sys
@@ -106,12 +106,15 @@ files:
106
106
  - ".rspec"
107
107
  - ".standard.yml"
108
108
  - CHANGELOG.md
109
+ - CLAUDE.md
109
110
  - CODE_OF_CONDUCT.md
110
111
  - LICENSE.txt
111
112
  - README.md
112
113
  - Rakefile
113
114
  - examples/basic_usage.rb
114
115
  - examples/full_text_search.rb
116
+ - examples/idempotent_create.rb
117
+ - examples/optional_fields_demo.rb
115
118
  - examples/red_candle_integration.rb
116
119
  - examples/vector_search.rb
117
120
  - ext/lancelot/.gitignore
@@ -123,6 +126,7 @@ files:
123
126
  - ext/lancelot/src/schema.rs
124
127
  - lib/lancelot.rb
125
128
  - lib/lancelot/dataset.rb
129
+ - lib/lancelot/rank_fusion.rb
126
130
  - lib/lancelot/version.rb
127
131
  - sig/lancelot.rbs
128
132
  homepage: https://github.com/cpetersen/lancelot