lancelot 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 9524a69ce1b7a5065839b41b5619585bc4e1643c79ec90cfc69fc82a3a8eed7e
4
- data.tar.gz: e2c8fc98de615913297b7b06e909b8d85179cb2018e3a08c9b70f4bb2756d55f
3
+ metadata.gz: 7d1c9e53cfc1948e847b195b8d8860c6dba8a85813714b60a7d7747966533de3
4
+ data.tar.gz: 7c130c04c01a8eebce16755a1da082f4db6ec1f85ac2857d84943272c695eb89
5
5
  SHA512:
6
- metadata.gz: 043fcea4635008454b61999055be80b21c0ddb7c67462e0f90d242f7b81395068ff7ed1cabeabf1f63f7c1823593f2405fca083c3aa9330c7ac7a7b1f09ae4c7
7
- data.tar.gz: 4efedcef3af02685c97531162aa6adea5beb3084759033a0b32bdf535663aa12c4246cb9d67649bd836f04bf317e59101326e17c8a8474ace059908d0b36074d
6
+ metadata.gz: fb1a939d21c8935cec9fb9eb0ba1ddbef7ab9a8edd380149e0026db00ce67167609c99ed98597ccd692baa59f046dcd041fc76f653a5a635c152b82c343310d4
7
+ data.tar.gz: d35819fbfc3c2dec2b2cf65236b2bbf74eaef9f501d32f2a0a0ca7724ffc4bc1bdef56222cb50a3390d59d91e6e492a1170114b2762cebac38bcf4529bb27925
data/CLAUDE.md ADDED
@@ -0,0 +1,176 @@
1
+ # Lancelot - Project Instructions
2
+
3
+ You are working on lancelot, a Ruby gem that provides native bindings to the Lance columnar data format through Rust and Magnus.
4
+
5
+ ## Project Context
6
+
7
+ Lancelot is part of a Ruby-native NLP/ML ecosystem:
8
+ - **lance**: Rust crate providing columnar storage with vector and text search
9
+ - **lancelot**: THIS PROJECT - Ruby bindings to Lance via Magnus
10
+ - **red-candle**: Ruby gem providing LLMs, embeddings, and rerankers
11
+ - **candle-agent**: Future agent capabilities for the ecosystem
12
+
13
+ ## Core Architecture
14
+
15
+ ### Ruby-Rust Bridge
16
+ - Uses Magnus 0.7 for Ruby-Rust interop
17
+ - RefCell pattern for interior mutability
18
+ - Embedded Tokio runtime for async Lance operations
19
+ - Clean separation between Ruby API and Rust implementation
20
+
21
+ ### Key Components
22
+ - `lib/lancelot/dataset.rb`: Ruby-idiomatic API layer
23
+ - `ext/lancelot/src/dataset.rs`: Core Lance operations
24
+ - `ext/lancelot/src/schema.rs`: Schema building and type mapping
25
+ - `ext/lancelot/src/conversion.rs`: Ruby-Rust data conversion
26
+
27
+ ## Design Principles
28
+
29
+ 1. **Ruby-First API**: Make it feel native to Ruby developers
30
+ - Use symbols for type names (`:string`, `:vector`)
31
+ - Support operator overloading (`<<` for append)
32
+ - Include Enumerable for iteration
33
+ - Return Ruby hashes/arrays, not foreign objects
34
+
35
+ 2. **Schema-First Design**: Lance requires schemas at creation
36
+ - Clear schema definition API
37
+ - Helpful error messages for type mismatches
38
+ - Support all Arrow types Lance supports
39
+
40
+ 3. **Performance Without Complexity**: Hide async/columnar details
41
+ - Embed Tokio runtime, don't expose it
42
+ - Convert RecordBatches to Ruby transparently
43
+ - Efficient batch operations when possible
44
+
45
+ 4. **Error Handling**: Rust errors become Ruby exceptions
46
+ - Use Magnus's error conversion
47
+ - Provide context in error messages
48
+ - Never panic in Rust code
49
+
50
+ ## Current Features
51
+
52
+ ### Implemented
53
+ - Dataset creation with schema
54
+ - CRUD operations (add, update, delete, get)
55
+ - Vector search with ANN indices
56
+ - Full-text search with inverted indices
57
+ - Multi-column text search
58
+ - SQL-like filtering
59
+ - Enumerable support
60
+
61
+ ### Not Yet Implemented
62
+ - Schema evolution
63
+ - Hybrid search (vector + text fusion)
64
+ - Streaming operations
65
+ - Transaction support
66
+ - Data versioning/time travel
67
+
68
+ ## Development Guidelines
69
+
70
+ ### Adding New Features
71
+
72
+ 1. **Start with the Ruby API**: Design how it should feel in Ruby first
73
+ 2. **Implement in Rust**: Keep the Rust layer focused on Lance operations
74
+ 3. **Type Conversion**: Use conversion.rs patterns for new types
75
+ 4. **Testing**: Add Ruby tests in spec/ and Rust tests in src/
76
+
77
+ ### Magnus Best Practices
78
+
79
+ 1. **Memory Management**:
80
+ ```rust
81
+ #[magnus::wrap(class = "Lancelot::Dataset", free_immediately, size)]
82
+ ```
83
+
84
+ 2. **Method Definition**:
85
+ ```rust
86
+ class.define_method("method_name", method!(LancelotDataset::method_name, arity))?;
87
+ ```
88
+
89
+ 3. **Error Handling**:
90
+ ```rust
91
+ pub fn operation(&self) -> Result<Value, Error> {
92
+ self.with_dataset(|dataset| {
93
+ // Lance operation
94
+ }).map_err(|e| Error::new(exception::runtime_error(), e.to_string()))
95
+ }
96
+ ```
97
+
98
+ 4. **Async Operations**:
99
+ ```rust
100
+ self.with_runtime(|runtime| {
101
+ runtime.block_on(async {
102
+ // Async Lance operation
103
+ })
104
+ })
105
+ ```
106
+
107
+ ### Type Mappings
108
+
109
+ Ruby → Arrow/Lance:
110
+ - `:string` → Utf8
111
+ - `:integer` → Int64
112
+ - `:float` → Float64
113
+ - `:vector` → FixedSizeList with dimension
114
+ - `:boolean` → Bool
115
+ - `:date` → Date32
116
+ - `:datetime` → Timestamp
117
+
118
+ ### Testing
119
+
120
+ - Ruby specs use RSpec
121
+ - Test both successful operations and error cases
122
+ - Use temporary directories for test datasets
123
+ - Clean up resources in after blocks
124
+
125
+ ## Common Tasks
126
+
127
+ ### Adding a New Search Method
128
+ 1. Define Ruby API in dataset.rb
129
+ 2. Add Rust implementation in dataset.rs
130
+ 3. Handle type conversion if needed
131
+ 4. Add index support if applicable
132
+ 5. Write comprehensive tests
133
+
134
+ ### Exposing Lance Features
135
+ 1. Check Lance API documentation
136
+ 2. Design Ruby-idiomatic wrapper
137
+ 3. Consider if it needs async handling
138
+ 4. Implement with proper error handling
139
+ 5. Document with YARD comments
140
+
141
+ ## Integration Points
142
+
143
+ ### With Red-Candle
144
+ - Lancelot stores embeddings from red-candle
145
+ - Vector dimensions must match model output
146
+ - Consider batch operations for efficiency
147
+
148
+ ### Future: Hybrid Search
149
+ - Will combine vector and text search results
150
+ - RRF (Reciprocal Rank Fusion) planned
151
+ - Consider implementing in Ruby first
152
+
153
+ ## Performance Considerations
154
+
155
+ 1. **Batch Operations**: Always prefer batch over individual ops
156
+ 2. **Index Building**: Build indices after bulk loading
157
+ 3. **Memory Usage**: Lance uses memory mapping efficiently
158
+ 4. **Ruby GC**: Use free_immediately for deterministic cleanup
159
+
160
+ ## Debugging Tips
161
+
162
+ 1. **Rust Panics**: Use `RUST_BACKTRACE=1` for stack traces
163
+ 2. **Lance Logs**: Set `RUST_LOG=lance=debug`
164
+ 3. **Ruby-Rust Bridge**: Check type conversions first
165
+ 4. **Async Issues**: Ensure operations run on the runtime
166
+
167
+ ## Release Process
168
+
169
+ 1. Update version.rb
170
+ 2. Run full test suite
171
+ 3. Build gem locally and test
172
+ 4. Update CHANGELOG.md
173
+ 5. Tag release and push
174
+ 6. `rake release` to publish
175
+
176
+ Remember: The goal is to make Lance's power accessible to Ruby developers without them needing to understand columnar formats, Rust, or async programming.
data/README.md CHANGED
@@ -9,15 +9,16 @@ Ruby bindings for [Lance](https://github.com/lancedb/lance), a modern columnar d
9
9
  - **Data Storage**: Add documents to datasets
10
10
  - **Document Retrieval**: Read documents from datasets with enumerable support
11
11
  - **Vector Search**: Create vector indices and perform similarity search
12
+ - **Full-Text Search**: Built-in full-text search with inverted indices
13
+ - **Hybrid Search**: Combine text and vector search with Reciprocal Rank Fusion (RRF)
12
14
  - **Schema Support**: Define schemas with string, float32, and vector types
13
15
  - **Row Counting**: Get the number of rows in a dataset
14
16
 
15
17
  ### Planned
16
18
 
17
- - **Full-Text Search**: Built-in full-text search capabilities
18
- - **Hybrid Search**: Combine text and vector search with RRF and other fusion methods
19
19
  - **Multimodal Support**: Store and search across different data types beyond text and vectors
20
20
  - **Schema Evolution**: Add new columns to existing datasets without rewriting data
21
+ - **Additional Fusion Methods**: Support for other fusion algorithms beyond RRF
21
22
 
22
23
  ## Installation
23
24
 
@@ -121,10 +122,109 @@ results = dataset.text_search("programming", columns: ["title", "content", "tags
121
122
 
122
123
  **Note**: Full-text search requires creating inverted indices first. For simple pattern matching without indices, use SQL-like filtering with `where`.
123
124
 
125
+ ### Hybrid Search with Reciprocal Rank Fusion (RRF)
126
+
127
+ Lancelot now supports hybrid search, combining vector and text search results using Reciprocal Rank Fusion:
128
+
129
+ ```ruby
130
+ # Example 1: Using the same query for both vector and text search
131
+ # First, let's assume we have a function that converts text to embeddings
132
+ def text_to_embedding(text)
133
+ # Your embedding model here (e.g., using red-candle or another embedding service)
134
+ # Returns a vector representation of the text
135
+ end
136
+
137
+ # Search using both modalities with the same query
138
+ query = "machine learning frameworks"
139
+ query_embedding = text_to_embedding(query)
140
+
141
+ results = dataset.hybrid_search(
142
+ query, # Text query
143
+ vector: query_embedding, # Vector query (same content, embedded)
144
+ vector_column: "embedding", # Vector column to search
145
+ text_column: "content", # Text column to search
146
+ limit: 10
147
+ )
148
+
149
+ # Results are fused using RRF and include an rrf_score
150
+ results.each do |doc|
151
+ puts "#{doc[:title]} - RRF Score: #{doc[:rrf_score]}"
152
+ end
153
+ ```
154
+
155
+ ```ruby
156
+ # Example 2: Multiple queries across different modalities
157
+ # You can use different queries for vector and text search
158
+
159
+ # Semantic vector search for conceptually similar content
160
+ concept_embedding = text_to_embedding("deep learning neural networks")
161
+
162
+ # Keyword text search for specific terms
163
+ keyword_query = "PyTorch TensorFlow"
164
+
165
+ results = dataset.hybrid_search(
166
+ keyword_query, # Specific keyword search
167
+ vector: concept_embedding, # Broader semantic search
168
+ vector_column: "embedding",
169
+ text_column: "content",
170
+ limit: 20
171
+ )
172
+ ```
173
+
174
+ ```ruby
175
+ # Example 3: Multi-column text search with vector search
176
+ # Search across multiple text columns while also doing vector similarity
177
+
178
+ results = dataset.hybrid_search(
179
+ "ruby programming",
180
+ vector: text_to_embedding("object-oriented scripting language"),
181
+ vector_column: "embedding",
182
+ text_columns: ["title", "content", "tags"], # Search multiple text columns
183
+ limit: 15
184
+ )
185
+ ```
186
+
187
+ ```ruby
188
+ # Example 4: Advanced RRF with custom k parameter
189
+ # The k parameter (default 60) controls the fusion behavior
190
+ # Lower k values give more weight to top-ranked results
191
+
192
+ results = dataset.hybrid_search(
193
+ "distributed systems",
194
+ vector: text_to_embedding("distributed systems"),
195
+ vector_column: "embedding",
196
+ text_column: "content",
197
+ limit: 10,
198
+ rrf_k: 30 # More aggressive fusion, emphasizes top results
199
+ )
200
+ ```
201
+
202
+ ```ruby
203
+ # Example 5: Using RankFusion module directly for custom fusion
204
+ # Useful when you want to combine results from multiple separate searches
205
+ require 'lancelot/rank_fusion'
206
+
207
+ # Perform multiple searches with different queries
208
+ vector_results1 = dataset.vector_search(embedding1, column: "embedding", limit: 20)
209
+ vector_results2 = dataset.vector_search(embedding2, column: "embedding", limit: 20)
210
+ text_results1 = dataset.text_search("machine learning", column: "content", limit: 20)
211
+ text_results2 = dataset.text_search("neural networks", column: "title", limit: 20)
212
+
213
+ # Fuse all results using RRF
214
+ fused_results = Lancelot::RankFusion.reciprocal_rank_fusion(
215
+ [vector_results1, vector_results2, text_results1, text_results2],
216
+ k: 60
217
+ )
218
+
219
+ # Take top 10 fused results
220
+ top_results = fused_results.first(10)
221
+ ```
222
+
223
+ **RRF Algorithm**: Reciprocal Rank Fusion calculates scores as `Σ(1/(k+rank))` across all result lists, where k=60 by default. Documents appearing in multiple result lists with high ranks get higher RRF scores.
224
+
124
225
  **Current Limitations:**
125
226
  - Schema must be defined when creating a dataset
126
227
  - Schema evolution is not yet implemented (Lance supports it, but our bindings don't expose it yet)
127
- - Hybrid search (RRF) is not yet implemented
128
228
  - Supported field types: string, float32, float64, int32, int64, boolean, and fixed-size vectors
129
229
 
130
230
  **Note on Lance's Schema Flexibility:**
@@ -106,6 +106,38 @@ module Lancelot
106
106
  end
107
107
  end
108
108
 
109
+ def hybrid_search(query, vector_column: "vector", text_column: nil, text_columns: nil,
110
+ vector: nil, limit: 10, rrf_k: 60)
111
+ require 'lancelot/rank_fusion'
112
+
113
+ result_lists = []
114
+
115
+ # Perform vector search if vector is provided
116
+ if vector
117
+ unless vector.is_a?(Array)
118
+ raise ArgumentError, "Vector must be an array of numbers"
119
+ end
120
+
121
+ vector_results = vector_search(vector, column: vector_column, limit: limit * 2)
122
+ result_lists << vector_results if vector_results.any?
123
+ end
124
+
125
+ # Perform text search if query is provided
126
+ if query && !query.empty?
127
+ text_results = text_search(query, column: text_column, columns: text_columns, limit: limit * 2)
128
+ result_lists << text_results if text_results.any?
129
+ end
130
+
131
+ # Return empty array if no searches were performed
132
+ return [] if result_lists.empty?
133
+
134
+ # Return single result list if only one search was performed
135
+ return result_lists.first[0...limit] if result_lists.size == 1
136
+
137
+ # Perform RRF fusion and limit results
138
+ Lancelot::RankFusion.reciprocal_rank_fusion(result_lists, k: rrf_k)[0...limit]
139
+ end
140
+
109
141
  def where(filter_expression, limit: nil)
110
142
  filter_scan(filter_expression.to_s, limit)
111
143
  end
@@ -0,0 +1,84 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Lancelot
4
+ module RankFusion
5
+ class << self
6
+ def reciprocal_rank_fusion(result_lists, k: 60)
7
+ return [] if result_lists.nil? || result_lists.empty?
8
+
9
+ # Validate inputs
10
+ result_lists = Array(result_lists)
11
+ validate_result_lists(result_lists)
12
+
13
+ return [] if result_lists.all?(&:empty?)
14
+
15
+ # Build document to rank mapping for each result list
16
+ doc_ranks = build_document_ranks(result_lists)
17
+
18
+ # Calculate RRF scores
19
+ rrf_scores = calculate_rrf_scores(doc_ranks, result_lists.size, k)
20
+
21
+ # Sort by RRF score descending and return results
22
+ rrf_scores.sort_by { |_, score| -score }.map do |doc, score|
23
+ doc.merge(rrf_score: score)
24
+ end
25
+ end
26
+
27
+ private
28
+
29
+ def validate_result_lists(result_lists)
30
+ result_lists.each_with_index do |list, i|
31
+ unless list.is_a?(Array)
32
+ raise ArgumentError, "Result list at index #{i} must be an Array, got #{list.class}"
33
+ end
34
+
35
+ list.each_with_index do |doc, j|
36
+ unless doc.is_a?(Hash)
37
+ raise ArgumentError, "Document at position #{j} in result list #{i} must be a Hash, got #{doc.class}"
38
+ end
39
+ end
40
+ end
41
+ end
42
+
43
+ def build_document_ranks(result_lists)
44
+ doc_ranks = {}
45
+
46
+ result_lists.each_with_index do |list, list_idx|
47
+ list.each_with_index do |doc, rank|
48
+ # Use the document content as the key (excluding metadata like distance/score)
49
+ doc_key = normalize_document(doc)
50
+ doc_ranks[doc_key] ||= {document: doc, ranks: {}}
51
+ doc_ranks[doc_key][:ranks][list_idx] = rank + 1 # 1-based ranking
52
+ end
53
+ end
54
+
55
+ doc_ranks
56
+ end
57
+
58
+ def calculate_rrf_scores(doc_ranks, num_lists, k)
59
+ doc_ranks.map do |doc_key, data|
60
+ score = 0.0
61
+
62
+ num_lists.times do |list_idx|
63
+ rank = data[:ranks][list_idx]
64
+ if rank
65
+ score += 1.0 / (k + rank)
66
+ else
67
+ # Document doesn't appear in this list, treat as infinite rank
68
+ # RRF score contribution is effectively 0
69
+ end
70
+ end
71
+
72
+ [data[:document], score]
73
+ end
74
+ end
75
+
76
+ def normalize_document(doc)
77
+ # Create a normalized version for comparison, excluding metadata fields
78
+ # that might differ between search types (like _distance, _score, etc.)
79
+ doc.reject { |k, _| k.to_s.start_with?("_") || k == :rrf_score }
80
+ end
81
+ end
82
+ end
83
+ end
84
+
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Lancelot
4
- VERSION = "0.1.1"
4
+ VERSION = "0.2.0"
5
5
  end
data/lib/lancelot.rb CHANGED
@@ -3,6 +3,7 @@
3
3
  require_relative "lancelot/version"
4
4
  require_relative "lancelot/lancelot"
5
5
  require_relative "lancelot/dataset"
6
+ require_relative "lancelot/rank_fusion"
6
7
 
7
8
  module Lancelot
8
9
  class Error < StandardError; end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: lancelot
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Chris Petersen
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2025-07-26 00:00:00.000000000 Z
11
+ date: 2025-07-27 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rb_sys
@@ -106,6 +106,7 @@ files:
106
106
  - ".rspec"
107
107
  - ".standard.yml"
108
108
  - CHANGELOG.md
109
+ - CLAUDE.md
109
110
  - CODE_OF_CONDUCT.md
110
111
  - LICENSE.txt
111
112
  - README.md
@@ -123,6 +124,7 @@ files:
123
124
  - ext/lancelot/src/schema.rs
124
125
  - lib/lancelot.rb
125
126
  - lib/lancelot/dataset.rb
127
+ - lib/lancelot/rank_fusion.rb
126
128
  - lib/lancelot/version.rb
127
129
  - sig/lancelot.rbs
128
130
  homepage: https://github.com/cpetersen/lancelot