RubyGems - lancelot - Versions diffs - 0.1.0 → 0.2.0 - Mend

lancelot 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 0e84754226b79241530dd10c90c67d97589e79a217ccbe636d836fa42cdcd450
-  data.tar.gz: fcbbe98ba47a8276d1e1f95a4cc34a84232f7886eef33d955d335eed36248b65
+  metadata.gz: 7d1c9e53cfc1948e847b195b8d8860c6dba8a85813714b60a7d7747966533de3
+  data.tar.gz: 7c130c04c01a8eebce16755a1da082f4db6ec1f85ac2857d84943272c695eb89
 SHA512:
-  metadata.gz: 5a4e28401f3167189d55cd9d543fa1a4d5ecff5f99423aecf371b6d9dbf0126c106a2b52bd67352a364d98c9edce66d4d3e3beea06bedfdc99bf92c065040fec
-  data.tar.gz: 939bdfd0e3ffaf0b21ccf88562bf4dac73656c5dc3fe511ed3b0696646e47d4a67a25b1d3d4b0c0d8e586a74514fdddb252b5e27616f90890f1a0abc8508b68b
+  metadata.gz: fb1a939d21c8935cec9fb9eb0ba1ddbef7ab9a8edd380149e0026db00ce67167609c99ed98597ccd692baa59f046dcd041fc76f653a5a635c152b82c343310d4
+  data.tar.gz: d35819fbfc3c2dec2b2cf65236b2bbf74eaef9f501d32f2a0a0ca7724ffc4bc1bdef56222cb50a3390d59d91e6e492a1170114b2762cebac38bcf4529bb27925

data/CLAUDE.md ADDED Viewed

@@ -0,0 +1,176 @@
+# Lancelot - Project Instructions
+You are working on lancelot, a Ruby gem that provides native bindings to the Lance columnar data format through Rust and Magnus.
+## Project Context
+Lancelot is part of a Ruby-native NLP/ML ecosystem:
+- **lance**: Rust crate providing columnar storage with vector and text search
+- **lancelot**: THIS PROJECT - Ruby bindings to Lance via Magnus
+- **red-candle**: Ruby gem providing LLMs, embeddings, and rerankers
+- **candle-agent**: Future agent capabilities for the ecosystem
+## Core Architecture
+### Ruby-Rust Bridge
+- Uses Magnus 0.7 for Ruby-Rust interop
+- RefCell pattern for interior mutability
+- Embedded Tokio runtime for async Lance operations
+- Clean separation between Ruby API and Rust implementation
+### Key Components
+- `lib/lancelot/dataset.rb`: Ruby-idiomatic API layer
+- `ext/lancelot/src/dataset.rs`: Core Lance operations
+- `ext/lancelot/src/schema.rs`: Schema building and type mapping
+- `ext/lancelot/src/conversion.rs`: Ruby-Rust data conversion
+## Design Principles
+1. **Ruby-First API**: Make it feel native to Ruby developers
+   - Use symbols for type names (`:string`, `:vector`)
+   - Support operator overloading (`<<` for append)
+   - Include Enumerable for iteration
+   - Return Ruby hashes/arrays, not foreign objects
+2. **Schema-First Design**: Lance requires schemas at creation
+   - Clear schema definition API
+   - Helpful error messages for type mismatches
+   - Support all Arrow types Lance supports
+3. **Performance Without Complexity**: Hide async/columnar details
+   - Embed Tokio runtime, don't expose it
+   - Convert RecordBatches to Ruby transparently
+   - Efficient batch operations when possible
+4. **Error Handling**: Rust errors become Ruby exceptions
+   - Use Magnus's error conversion
+   - Provide context in error messages
+   - Never panic in Rust code
+## Current Features
+### Implemented
+- Dataset creation with schema
+- CRUD operations (add, update, delete, get)
+- Vector search with ANN indices
+- Full-text search with inverted indices
+- Multi-column text search
+- SQL-like filtering
+- Enumerable support
+### Not Yet Implemented
+- Schema evolution
+- Hybrid search (vector + text fusion)
+- Streaming operations
+- Transaction support
+- Data versioning/time travel
+## Development Guidelines
+### Adding New Features
+1. **Start with the Ruby API**: Design how it should feel in Ruby first
+2. **Implement in Rust**: Keep the Rust layer focused on Lance operations
+3. **Type Conversion**: Use conversion.rs patterns for new types
+4. **Testing**: Add Ruby tests in spec/ and Rust tests in src/
+### Magnus Best Practices
+1. **Memory Management**:
+   ```rust
+   #[magnus::wrap(class = "Lancelot::Dataset", free_immediately, size)]
+   ```
+2. **Method Definition**:
+   ```rust
+   class.define_method("method_name", method!(LancelotDataset::method_name, arity))?;
+   ```
+3. **Error Handling**:
+   ```rust
+   pub fn operation(&self) -> Result<Value, Error> {
+       self.with_dataset(|dataset| {
+           // Lance operation
+       }).map_err(|e| Error::new(exception::runtime_error(), e.to_string()))
+   }
+   ```
+4. **Async Operations**:
+   ```rust
+   self.with_runtime(|runtime| {
+       runtime.block_on(async {
+           // Async Lance operation
+       })
+   })
+   ```
+### Type Mappings
+Ruby → Arrow/Lance:
+- `:string` → Utf8
+- `:integer` → Int64
+- `:float` → Float64
+- `:vector` → FixedSizeList with dimension
+- `:boolean` → Bool
+- `:date` → Date32
+- `:datetime` → Timestamp
+### Testing
+- Ruby specs use RSpec
+- Test both successful operations and error cases
+- Use temporary directories for test datasets
+- Clean up resources in after blocks
+## Common Tasks
+### Adding a New Search Method
+1. Define Ruby API in dataset.rb
+2. Add Rust implementation in dataset.rs
+3. Handle type conversion if needed
+4. Add index support if applicable
+5. Write comprehensive tests
+### Exposing Lance Features
+1. Check Lance API documentation
+2. Design Ruby-idiomatic wrapper
+3. Consider if it needs async handling
+4. Implement with proper error handling
+5. Document with YARD comments
+## Integration Points
+### With Red-Candle
+- Lancelot stores embeddings from red-candle
+- Vector dimensions must match model output
+- Consider batch operations for efficiency
+### Future: Hybrid Search
+- Will combine vector and text search results
+- RRF (Reciprocal Rank Fusion) planned
+- Consider implementing in Ruby first
+## Performance Considerations
+1. **Batch Operations**: Always prefer batch over individual ops
+2. **Index Building**: Build indices after bulk loading
+3. **Memory Usage**: Lance uses memory mapping efficiently
+4. **Ruby GC**: Use free_immediately for deterministic cleanup
+## Debugging Tips
+1. **Rust Panics**: Use `RUST_BACKTRACE=1` for stack traces
+2. **Lance Logs**: Set `RUST_LOG=lance=debug`
+3. **Ruby-Rust Bridge**: Check type conversions first
+4. **Async Issues**: Ensure operations run on the runtime
+## Release Process
+1. Update version.rb
+2. Run full test suite
+3. Build gem locally and test
+4. Update CHANGELOG.md
+5. Tag release and push
+6. `rake release` to publish
+Remember: The goal is to make Lance's power accessible to Ruby developers without them needing to understand columnar formats, Rust, or async programming.

data/README.md CHANGED Viewed

@@ -9,15 +9,16 @@ Ruby bindings for [Lance](https://github.com/lancedb/lance), a modern columnar d
 - **Data Storage**: Add documents to datasets
 - **Document Retrieval**: Read documents from datasets with enumerable support
 - **Vector Search**: Create vector indices and perform similarity search
+- **Full-Text Search**: Built-in full-text search with inverted indices
+- **Hybrid Search**: Combine text and vector search with Reciprocal Rank Fusion (RRF)
 - **Schema Support**: Define schemas with string, float32, and vector types
 - **Row Counting**: Get the number of rows in a dataset
 ### Planned
-- **Full-Text Search**: Built-in full-text search capabilities
-- **Hybrid Search**: Combine text and vector search with RRF and other fusion methods
 - **Multimodal Support**: Store and search across different data types beyond text and vectors
 - **Schema Evolution**: Add new columns to existing datasets without rewriting data
+- **Additional Fusion Methods**: Support for other fusion algorithms beyond RRF
 ## Installation
@@ -121,10 +122,109 @@ results = dataset.text_search("programming", columns: ["title", "content", "tags
 **Note**: Full-text search requires creating inverted indices first. For simple pattern matching without indices, use SQL-like filtering with `where`.
+### Hybrid Search with Reciprocal Rank Fusion (RRF)
+Lancelot now supports hybrid search, combining vector and text search results using Reciprocal Rank Fusion:
+```ruby
+# Example 1: Using the same query for both vector and text search
+# First, let's assume we have a function that converts text to embeddings
+def text_to_embedding(text)
+  # Your embedding model here (e.g., using red-candle or another embedding service)
+  # Returns a vector representation of the text
+end
+# Search using both modalities with the same query
+query = "machine learning frameworks"
+query_embedding = text_to_embedding(query)
+results = dataset.hybrid_search(
+  query,                           # Text query
+  vector: query_embedding,         # Vector query (same content, embedded)
+  vector_column: "embedding",      # Vector column to search
+  text_column: "content",          # Text column to search
+  limit: 10
+)
+# Results are fused using RRF and include an rrf_score
+results.each do |doc|
+  puts "#{doc[:title]} - RRF Score: #{doc[:rrf_score]}"
+end
+```
+```ruby
+# Example 2: Multiple queries across different modalities
+# You can use different queries for vector and text search
+# Semantic vector search for conceptually similar content
+concept_embedding = text_to_embedding("deep learning neural networks")
+# Keyword text search for specific terms
+keyword_query = "PyTorch TensorFlow"
+results = dataset.hybrid_search(
+  keyword_query,                   # Specific keyword search
+  vector: concept_embedding,       # Broader semantic search
+  vector_column: "embedding",
+  text_column: "content",
+  limit: 20
+)
+```
+```ruby
+# Example 3: Multi-column text search with vector search
+# Search across multiple text columns while also doing vector similarity
+results = dataset.hybrid_search(
+  "ruby programming",
+  vector: text_to_embedding("object-oriented scripting language"),
+  vector_column: "embedding",
+  text_columns: ["title", "content", "tags"],  # Search multiple text columns
+  limit: 15
+)
+```
+```ruby
+# Example 4: Advanced RRF with custom k parameter
+# The k parameter (default 60) controls the fusion behavior
+# Lower k values give more weight to top-ranked results
+results = dataset.hybrid_search(
+  "distributed systems",
+  vector: text_to_embedding("distributed systems"),
+  vector_column: "embedding",
+  text_column: "content",
+  limit: 10,
+  rrf_k: 30  # More aggressive fusion, emphasizes top results
+)
+```
+```ruby
+# Example 5: Using RankFusion module directly for custom fusion
+# Useful when you want to combine results from multiple separate searches
+require 'lancelot/rank_fusion'
+# Perform multiple searches with different queries
+vector_results1 = dataset.vector_search(embedding1, column: "embedding", limit: 20)
+vector_results2 = dataset.vector_search(embedding2, column: "embedding", limit: 20)
+text_results1 = dataset.text_search("machine learning", column: "content", limit: 20)
+text_results2 = dataset.text_search("neural networks", column: "title", limit: 20)
+# Fuse all results using RRF
+fused_results = Lancelot::RankFusion.reciprocal_rank_fusion(
+  [vector_results1, vector_results2, text_results1, text_results2],
+  k: 60
+)
+# Take top 10 fused results
+top_results = fused_results.first(10)
+```
+**RRF Algorithm**: Reciprocal Rank Fusion calculates scores as `Σ(1/(k+rank))` across all result lists, where k=60 by default. Documents appearing in multiple result lists with high ranks get higher RRF scores.
 **Current Limitations:**
 - Schema must be defined when creating a dataset
 - Schema evolution is not yet implemented (Lance supports it, but our bindings don't expose it yet)
-- Hybrid search (RRF) is not yet implemented
 - Supported field types: string, float32, float64, int32, int64, boolean, and fixed-size vectors
 **Note on Lance's Schema Flexibility:**

data/lib/lancelot/dataset.rb CHANGED Viewed

@@ -106,10 +106,56 @@ module Lancelot
       end
     end
+    def hybrid_search(query, vector_column: "vector", text_column: nil, text_columns: nil,
+                      vector: nil, limit: 10, rrf_k: 60)
+      require 'lancelot/rank_fusion'
+      result_lists = []
+      # Perform vector search if vector is provided
+      if vector
+        unless vector.is_a?(Array)
+          raise ArgumentError, "Vector must be an array of numbers"
+        end
+        vector_results = vector_search(vector, column: vector_column, limit: limit * 2)
+        result_lists << vector_results if vector_results.any?
+      end
+      # Perform text search if query is provided
+      if query && !query.empty?
+        text_results = text_search(query, column: text_column, columns: text_columns, limit: limit * 2)
+        result_lists << text_results if text_results.any?
+      end
+      # Return empty array if no searches were performed
+      return [] if result_lists.empty?
+      # Return single result list if only one search was performed
+      return result_lists.first[0...limit] if result_lists.size == 1
+      # Perform RRF fusion and limit results
+      Lancelot::RankFusion.reciprocal_rank_fusion(result_lists, k: rrf_k)[0...limit]
+    end
     def where(filter_expression, limit: nil)
       filter_scan(filter_expression.to_s, limit)
     end
+    def to_s
+      "#<Lancelot::Dataset path=\"#{path}\" count=#{count}>"
+    end
+    alias inspect to_s
+    def ==(other)
+      other.is_a?(Dataset) && other.path == path
+    end
+    alias eql? ==
+    def hash
+      path.hash
+    end
     private
     def normalize_document(doc)

data/lib/lancelot/rank_fusion.rb ADDED Viewed

@@ -0,0 +1,84 @@
+# frozen_string_literal: true
+module Lancelot
+  module RankFusion
+    class << self
+      def reciprocal_rank_fusion(result_lists, k: 60)
+        return [] if result_lists.nil? || result_lists.empty?
+        # Validate inputs
+        result_lists = Array(result_lists)
+        validate_result_lists(result_lists)
+        return [] if result_lists.all?(&:empty?)
+        # Build document to rank mapping for each result list
+        doc_ranks = build_document_ranks(result_lists)
+        # Calculate RRF scores
+        rrf_scores = calculate_rrf_scores(doc_ranks, result_lists.size, k)
+        # Sort by RRF score descending and return results
+        rrf_scores.sort_by { |_, score| -score }.map do |doc, score|
+          doc.merge(rrf_score: score)
+        end
+      end
+      private
+      def validate_result_lists(result_lists)
+        result_lists.each_with_index do |list, i|
+          unless list.is_a?(Array)
+            raise ArgumentError, "Result list at index #{i} must be an Array, got #{list.class}"
+          end
+          list.each_with_index do |doc, j|
+            unless doc.is_a?(Hash)
+              raise ArgumentError, "Document at position #{j} in result list #{i} must be a Hash, got #{doc.class}"
+            end
+          end
+        end
+      end
+      def build_document_ranks(result_lists)
+        doc_ranks = {}
+        result_lists.each_with_index do |list, list_idx|
+          list.each_with_index do |doc, rank|
+            # Use the document content as the key (excluding metadata like distance/score)
+            doc_key = normalize_document(doc)
+            doc_ranks[doc_key] ||= {document: doc, ranks: {}}
+            doc_ranks[doc_key][:ranks][list_idx] = rank + 1  # 1-based ranking
+          end
+        end
+        doc_ranks
+      end
+      def calculate_rrf_scores(doc_ranks, num_lists, k)
+        doc_ranks.map do |doc_key, data|
+          score = 0.0
+          num_lists.times do |list_idx|
+            rank = data[:ranks][list_idx]
+            if rank
+              score += 1.0 / (k + rank)
+            else
+              # Document doesn't appear in this list, treat as infinite rank
+              # RRF score contribution is effectively 0
+            end
+          end
+          [data[:document], score]
+        end
+      end
+      def normalize_document(doc)
+        # Create a normalized version for comparison, excluding metadata fields
+        # that might differ between search types (like _distance, _score, etc.)
+        doc.reject { |k, _| k.to_s.start_with?("_") || k == :rrf_score }
+      end
+    end
+  end
+end

data/lib/lancelot/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Lancelot
-  VERSION = "0.1.0"
+  VERSION = "0.2.0"
 end

data/lib/lancelot.rb CHANGED Viewed

@@ -3,6 +3,7 @@
 require_relative "lancelot/version"
 require_relative "lancelot/lancelot"
 require_relative "lancelot/dataset"
+require_relative "lancelot/rank_fusion"
 module Lancelot
   class Error < StandardError; end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: lancelot
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - Chris Petersen
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2025-07-26 00:00:00.000000000 Z
+date: 2025-07-27 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rb_sys
@@ -80,6 +80,20 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '1.3'
+- !ruby/object:Gem::Dependency
+  name: simplecov
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 description: Lancelot provides a Ruby-native interface to Lance, enabling efficient
   storage and search of multimodal data including text, vectors, and more.
 email:
@@ -92,6 +106,7 @@ files:
 - ".rspec"
 - ".standard.yml"
 - CHANGELOG.md
+- CLAUDE.md
 - CODE_OF_CONDUCT.md
 - LICENSE.txt
 - README.md
@@ -109,6 +124,7 @@ files:
 - ext/lancelot/src/schema.rs
 - lib/lancelot.rb
 - lib/lancelot/dataset.rb
+- lib/lancelot/rank_fusion.rb
 - lib/lancelot/version.rb
 - sig/lancelot.rbs
 homepage: https://github.com/cpetersen/lancelot