lancelot 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 0e84754226b79241530dd10c90c67d97589e79a217ccbe636d836fa42cdcd450
4
+ data.tar.gz: fcbbe98ba47a8276d1e1f95a4cc34a84232f7886eef33d955d335eed36248b65
5
+ SHA512:
6
+ metadata.gz: 5a4e28401f3167189d55cd9d543fa1a4d5ecff5f99423aecf371b6d9dbf0126c106a2b52bd67352a364d98c9edce66d4d3e3beea06bedfdc99bf92c065040fec
7
+ data.tar.gz: 939bdfd0e3ffaf0b21ccf88562bf4dac73656c5dc3fe511ed3b0696646e47d4a67a25b1d3d4b0c0d8e586a74514fdddb252b5e27616f90890f1a0abc8508b68b
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/.standard.yml ADDED
@@ -0,0 +1,3 @@
1
+ # For available configuration options, see:
2
+ # https://github.com/standardrb/standard
3
+ ruby_version: 3.1
data/CHANGELOG.md ADDED
@@ -0,0 +1,18 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ### Added
11
+ - Initial gem structure with Magnus/Rust integration
12
+ - Basic Dataset class with create/open functionality
13
+ - Core data storage with Lance integration
14
+ - Schema definition support for string and float32 types
15
+ - Document addition with add_documents and << methods
16
+ - Row counting functionality
17
+ - Comprehensive test suite for Phase 2 features
18
+ - Project setup and configuration
@@ -0,0 +1,132 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ We as members, contributors, and leaders pledge to make participation in our
6
+ community a harassment-free experience for everyone, regardless of age, body
7
+ size, visible or invisible disability, ethnicity, sex characteristics, gender
8
+ identity and expression, level of experience, education, socio-economic status,
9
+ nationality, personal appearance, race, caste, color, religion, or sexual
10
+ identity and orientation.
11
+
12
+ We pledge to act and interact in ways that contribute to an open, welcoming,
13
+ diverse, inclusive, and healthy community.
14
+
15
+ ## Our Standards
16
+
17
+ Examples of behavior that contributes to a positive environment for our
18
+ community include:
19
+
20
+ * Demonstrating empathy and kindness toward other people
21
+ * Being respectful of differing opinions, viewpoints, and experiences
22
+ * Giving and gracefully accepting constructive feedback
23
+ * Accepting responsibility and apologizing to those affected by our mistakes,
24
+ and learning from the experience
25
+ * Focusing on what is best not just for us as individuals, but for the overall
26
+ community
27
+
28
+ Examples of unacceptable behavior include:
29
+
30
+ * The use of sexualized language or imagery, and sexual attention or advances of
31
+ any kind
32
+ * Trolling, insulting or derogatory comments, and personal or political attacks
33
+ * Public or private harassment
34
+ * Publishing others' private information, such as a physical or email address,
35
+ without their explicit permission
36
+ * Other conduct which could reasonably be considered inappropriate in a
37
+ professional setting
38
+
39
+ ## Enforcement Responsibilities
40
+
41
+ Community leaders are responsible for clarifying and enforcing our standards of
42
+ acceptable behavior and will take appropriate and fair corrective action in
43
+ response to any behavior that they deem inappropriate, threatening, offensive,
44
+ or harmful.
45
+
46
+ Community leaders have the right and responsibility to remove, edit, or reject
47
+ comments, commits, code, wiki edits, issues, and other contributions that are
48
+ not aligned to this Code of Conduct, and will communicate reasons for moderation
49
+ decisions when appropriate.
50
+
51
+ ## Scope
52
+
53
+ This Code of Conduct applies within all community spaces, and also applies when
54
+ an individual is officially representing the community in public spaces.
55
+ Examples of representing our community include using an official email address,
56
+ posting via an official social media account, or acting as an appointed
57
+ representative at an online or offline event.
58
+
59
+ ## Enforcement
60
+
61
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
62
+ reported to the community leaders responsible for enforcement at
63
+ [INSERT CONTACT METHOD].
64
+ All complaints will be reviewed and investigated promptly and fairly.
65
+
66
+ All community leaders are obligated to respect the privacy and security of the
67
+ reporter of any incident.
68
+
69
+ ## Enforcement Guidelines
70
+
71
+ Community leaders will follow these Community Impact Guidelines in determining
72
+ the consequences for any action they deem in violation of this Code of Conduct:
73
+
74
+ ### 1. Correction
75
+
76
+ **Community Impact**: Use of inappropriate language or other behavior deemed
77
+ unprofessional or unwelcome in the community.
78
+
79
+ **Consequence**: A private, written warning from community leaders, providing
80
+ clarity around the nature of the violation and an explanation of why the
81
+ behavior was inappropriate. A public apology may be requested.
82
+
83
+ ### 2. Warning
84
+
85
+ **Community Impact**: A violation through a single incident or series of
86
+ actions.
87
+
88
+ **Consequence**: A warning with consequences for continued behavior. No
89
+ interaction with the people involved, including unsolicited interaction with
90
+ those enforcing the Code of Conduct, for a specified period of time. This
91
+ includes avoiding interactions in community spaces as well as external channels
92
+ like social media. Violating these terms may lead to a temporary or permanent
93
+ ban.
94
+
95
+ ### 3. Temporary Ban
96
+
97
+ **Community Impact**: A serious violation of community standards, including
98
+ sustained inappropriate behavior.
99
+
100
+ **Consequence**: A temporary ban from any sort of interaction or public
101
+ communication with the community for a specified period of time. No public or
102
+ private interaction with the people involved, including unsolicited interaction
103
+ with those enforcing the Code of Conduct, is allowed during this period.
104
+ Violating these terms may lead to a permanent ban.
105
+
106
+ ### 4. Permanent Ban
107
+
108
+ **Community Impact**: Demonstrating a pattern of violation of community
109
+ standards, including sustained inappropriate behavior, harassment of an
110
+ individual, or aggression toward or disparagement of classes of individuals.
111
+
112
+ **Consequence**: A permanent ban from any sort of public interaction within the
113
+ community.
114
+
115
+ ## Attribution
116
+
117
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118
+ version 2.1, available at
119
+ [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
120
+
121
+ Community Impact Guidelines were inspired by
122
+ [Mozilla's code of conduct enforcement ladder][Mozilla CoC].
123
+
124
+ For answers to common questions about this code of conduct, see the FAQ at
125
+ [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
126
+ [https://www.contributor-covenant.org/translations][translations].
127
+
128
+ [homepage]: https://www.contributor-covenant.org
129
+ [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
130
+ [Mozilla CoC]: https://github.com/mozilla/diversity
131
+ [FAQ]: https://www.contributor-covenant.org/faq
132
+ [translations]: https://www.contributor-covenant.org/translations
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2025 Chris Petersen
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,152 @@
1
+ # Lancelot
2
+
3
+ Ruby bindings for [Lance](https://github.com/lancedb/lance), a modern columnar data format for ML. Lancelot provides a Ruby-native interface to Lance, enabling efficient storage and search of multimodal data including text, vectors, and more.
4
+
5
+ ## Features
6
+
7
+ ### Implemented
8
+ - **Dataset Creation**: Create Lance datasets with schemas
9
+ - **Data Storage**: Add documents to datasets
10
+ - **Document Retrieval**: Read documents from datasets with enumerable support
11
+ - **Vector Search**: Create vector indices and perform similarity search
12
+ - **Schema Support**: Define schemas with string, float32, and vector types
13
+ - **Row Counting**: Get the number of rows in a dataset
14
+
15
+ ### Planned
16
+
17
+ - **Full-Text Search**: Built-in full-text search capabilities
18
+ - **Hybrid Search**: Combine text and vector search with RRF and other fusion methods
19
+ - **Multimodal Support**: Store and search across different data types beyond text and vectors
20
+ - **Schema Evolution**: Add new columns to existing datasets without rewriting data
21
+
22
+ ## Installation
23
+
24
+ Install the gem and add to the application's Gemfile by executing:
25
+
26
+ ```bash
27
+ bundle add lancelot
28
+ ```
29
+
30
+ If bundler is not being used to manage dependencies, install the gem by executing:
31
+
32
+ ```bash
33
+ gem install lancelot
34
+ ```
35
+
36
+ ## Usage
37
+
38
+ ```ruby
39
+ require 'lancelot'
40
+
41
+ # Create a dataset with a schema including vectors
42
+ dataset = Lancelot::Dataset.create("path/to/dataset", schema: {
43
+ text: :string,
44
+ score: :float32,
45
+ embedding: { type: "vector", dimension: 128 }
46
+ })
47
+
48
+ # Add documents with embeddings
49
+ dataset.add_documents([
50
+ { text: "Ruby is a dynamic programming language", score: 0.95, embedding: [0.1, 0.2, ...] },
51
+ { text: "Python is great for data science", score: 0.88, embedding: [0.2, 0.3, ...] }
52
+ ])
53
+
54
+ # Or use the << operator
55
+ dataset << { text: "JavaScript runs everywhere", score: 0.92, embedding: [0.3, 0.4, ...] }
56
+
57
+ # Open an existing dataset
58
+ dataset = Lancelot::Dataset.open("path/to/dataset")
59
+
60
+ # Get the count
61
+ puts dataset.count # => 3
62
+
63
+ # Get the schema
64
+ puts dataset.schema # => { text: "string", score: "float32" }
65
+
66
+ # Retrieve documents
67
+ dataset.all # => Returns all documents
68
+ dataset.first # => Returns first document
69
+ dataset.first(2) # => Returns first 2 documents
70
+
71
+ # Enumerable support
72
+ dataset.each { |doc| puts doc[:text] }
73
+ dataset.map { |doc| doc[:score] }
74
+ dataset.select { |doc| doc[:score] > 0.9 }
75
+
76
+ # Vector search
77
+ dataset.create_vector_index("embedding") # Create vector index
78
+ results = dataset.vector_search([0.15, 0.25, ...], column: "embedding", limit: 5) # Find 5 nearest neighbors
79
+
80
+ # Or use the nearest_neighbors alias
81
+ similar = dataset.nearest_neighbors([0.1, 0.2, ...], k: 10, column: "embedding")
82
+
83
+ # Full-text search with inverted indices
84
+ # First create text indices on the columns you want to search
85
+ dataset.create_text_index("title")
86
+ dataset.create_text_index("content")
87
+ dataset.create_text_index("tags")
88
+
89
+ # Single column search
90
+ results = dataset.text_search("ruby programming", column: "content", limit: 10)
91
+
92
+ # Multi-column search
93
+ results = dataset.text_search("machine learning", columns: ["title", "content"], limit: 10)
94
+
95
+ # SQL-like filtering (uses Lance's SQL engine, not full-text indices)
96
+ results = dataset.where("score > 0.9")
97
+ results = dataset.where("category = 'tutorial' AND year >= 2023", limit: 5)
98
+ ```
99
+
100
+ ### Full-Text Search
101
+
102
+ Lancelot supports Lance's full-text search capabilities with inverted indices:
103
+
104
+ ```ruby
105
+ # Create indices before searching
106
+ dataset.create_text_index("title")
107
+ dataset.create_text_index("content")
108
+
109
+ # Search a single column
110
+ results = dataset.text_search("ruby", column: "title")
111
+
112
+ # Search multiple columns (returns union of results)
113
+ results = dataset.text_search("programming", columns: ["title", "content", "tags"])
114
+
115
+ # The underlying Lance engine provides:
116
+ # - BM25 scoring for relevance ranking
117
+ # - Tokenization with language support
118
+ # - Case-insensitive search
119
+ # - Multi-word queries
120
+ ```
121
+
122
+ **Note**: Full-text search requires creating inverted indices first. For simple pattern matching without indices, use SQL-like filtering with `where`.
123
+
124
+ **Current Limitations:**
125
+ - Schema must be defined when creating a dataset
126
+ - Schema evolution is not yet implemented (Lance supports it, but our bindings don't expose it yet)
127
+ - Hybrid search (RRF) is not yet implemented
128
+ - Supported field types: string, float32, float64, int32, int64, boolean, and fixed-size vectors
129
+
130
+ **Note on Lance's Schema Flexibility:**
131
+ Lance itself supports schema evolution - you can add new columns without rewriting data. However, our current Ruby bindings have simplified this and require an upfront schema. This will be improved in future releases to expose Lance's full flexibility.
132
+
133
+ ## Development
134
+
135
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
136
+
137
+ To compile the Rust extension:
138
+ ```bash
139
+ bundle exec rake compile
140
+ ```
141
+
142
+ ## Contributing
143
+
144
+ Bug reports and pull requests are welcome on GitHub at https://github.com/cpetersen/lancelot. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/cpetersen/lancelot/blob/main/CODE_OF_CONDUCT.md).
145
+
146
+ ## License
147
+
148
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
149
+
150
+ ## Code of Conduct
151
+
152
+ Everyone interacting in the Lancelot project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/cpetersen/lancelot/blob/main/CODE_OF_CONDUCT.md).
data/Rakefile ADDED
@@ -0,0 +1,20 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rspec/core/rake_task"
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ require "standard/rake"
9
+
10
+ require "rake/extensiontask"
11
+
12
+ task build: :compile
13
+
14
+ GEMSPEC = Gem::Specification.load("lancelot.gemspec")
15
+
16
+ Rake::ExtensionTask.new("lancelot", GEMSPEC) do |ext|
17
+ ext.lib_dir = "lib/lancelot"
18
+ end
19
+
20
+ task default: %i[clobber compile spec standard]
@@ -0,0 +1,52 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "lancelot"
5
+ require "tmpdir"
6
+
7
+ # Create a temporary directory for our dataset
8
+ Dir.mktmpdir do |dir|
9
+ dataset_path = File.join(dir, "my_dataset")
10
+
11
+ puts "Creating dataset at: #{dataset_path}"
12
+
13
+ # Create a dataset with a schema
14
+ dataset = Lancelot::Dataset.create(dataset_path, schema: {
15
+ text: :string,
16
+ score: :float32
17
+ })
18
+
19
+ # Add documents
20
+ dataset.add_documents([
21
+ { text: "Ruby is a dynamic programming language", score: 0.95 },
22
+ { text: "Python is great for data science", score: 0.88 }
23
+ ])
24
+
25
+ # Or use the << operator
26
+ dataset << { text: "JavaScript runs everywhere", score: 0.92 }
27
+
28
+ # Get the count
29
+ puts "Document count: #{dataset.count}"
30
+
31
+ # Get the schema
32
+ puts "Schema: #{dataset.schema.inspect}"
33
+
34
+ # Retrieve documents
35
+ puts "\nRetrieving documents:"
36
+ puts "First document: #{dataset.first.inspect}"
37
+ puts "First 2 documents: #{dataset.first(2).map { |d| d[:text] }.join(', ')}"
38
+
39
+ # Use Enumerable methods
40
+ puts "\nUsing Enumerable methods:"
41
+ texts = dataset.map { |doc| doc[:text] }
42
+ puts "All texts: #{texts.join(', ')}"
43
+
44
+ high_scores = dataset.select { |doc| doc[:score] > 0.9 }
45
+ puts "High scoring documents: #{high_scores.map { |d| d[:text] }.join(', ')}"
46
+
47
+ # Open an existing dataset
48
+ dataset2 = Lancelot::Dataset.open(dataset_path)
49
+ puts "\nOpened dataset has #{dataset2.count} documents"
50
+ end
51
+
52
+ puts "Done!"
@@ -0,0 +1,146 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'bundler/setup'
4
+ require 'lancelot'
5
+ require 'tmpdir'
6
+
7
+ Dir.mktmpdir do |dir|
8
+ dataset_path = File.join(dir, "articles_dataset")
9
+
10
+ puts "Creating dataset with article data..."
11
+
12
+ # Create a dataset with multiple text fields
13
+ dataset = Lancelot::Dataset.create(dataset_path, schema: {
14
+ title: :string,
15
+ content: :string,
16
+ category: :string,
17
+ author: :string,
18
+ year: :int64,
19
+ tags: :string
20
+ })
21
+
22
+ # Sample articles
23
+ articles = [
24
+ {
25
+ title: "Getting Started with Ruby on Rails",
26
+ content: "Ruby on Rails is a powerful web framework that makes building applications fast and enjoyable. It follows the MVC pattern and emphasizes convention over configuration.",
27
+ category: "web development",
28
+ author: "Alice Johnson",
29
+ year: 2024,
30
+ tags: "ruby rails web mvc framework"
31
+ },
32
+ {
33
+ title: "Advanced Ruby Metaprogramming Techniques",
34
+ content: "Ruby's metaprogramming capabilities allow you to write code that writes code. Learn about method_missing, define_method, and dynamic class creation.",
35
+ category: "programming",
36
+ author: "Bob Smith",
37
+ year: 2024,
38
+ tags: "ruby metaprogramming advanced dynamic"
39
+ },
40
+ {
41
+ title: "Building RESTful APIs with Rails",
42
+ content: "Learn how to build robust RESTful APIs using Ruby on Rails. We'll cover routing, controllers, serialization, and authentication.",
43
+ category: "web development",
44
+ author: "Alice Johnson",
45
+ year: 2023,
46
+ tags: "ruby rails api rest web services"
47
+ },
48
+ {
49
+ title: "Python vs Ruby: A Comprehensive Comparison",
50
+ content: "Both Python and Ruby are dynamic, interpreted languages. This article compares their syntax, performance, ecosystem, and use cases.",
51
+ category: "programming",
52
+ author: "Charlie Davis",
53
+ year: 2024,
54
+ tags: "python ruby comparison languages programming"
55
+ },
56
+ {
57
+ title: "Machine Learning with Python",
58
+ content: "Python has become the de facto language for machine learning. Explore popular libraries like scikit-learn, TensorFlow, and PyTorch.",
59
+ category: "data science",
60
+ author: "David Lee",
61
+ year: 2024,
62
+ tags: "python machine learning ml ai data science"
63
+ },
64
+ {
65
+ title: "Rust for Systems Programming",
66
+ content: "Rust provides memory safety without garbage collection. Learn how Rust is revolutionizing systems programming with its ownership model.",
67
+ category: "systems",
68
+ author: "Eve Wilson",
69
+ year: 2023,
70
+ tags: "rust systems programming memory safety performance"
71
+ }
72
+ ]
73
+
74
+ # Add articles
75
+ dataset.add_documents(articles)
76
+ puts "Added #{dataset.count} articles\n\n"
77
+
78
+ # Create text indices on multiple columns
79
+ puts "Creating text indices..."
80
+ dataset.create_text_index("title")
81
+ dataset.create_text_index("content")
82
+ dataset.create_text_index("tags")
83
+ puts "Text indices created\n\n"
84
+
85
+ # Test 1: Single column full-text search
86
+ puts "=== Single Column Full-Text Search ==="
87
+
88
+ puts "\nSearching for 'ruby' in content:"
89
+ results = dataset.text_search("ruby", column: "content", limit: 5)
90
+ results.each do |doc|
91
+ puts " - #{doc[:title]}"
92
+ puts " #{doc[:content][0..80]}..."
93
+ end
94
+
95
+ # Test 2: Search in title
96
+ puts "\n\nSearching for 'python' in title:"
97
+ results = dataset.text_search("python", column: "title", limit: 5)
98
+ results.each do |doc|
99
+ puts " - #{doc[:title]} (#{doc[:year]})"
100
+ end
101
+
102
+ # Test 3: Search in tags
103
+ puts "\n\nSearching for 'programming' in tags:"
104
+ results = dataset.text_search("programming", column: "tags", limit: 5)
105
+ results.each do |doc|
106
+ puts " - #{doc[:title]}"
107
+ puts " Tags: #{doc[:tags]}"
108
+ end
109
+
110
+ # Test 4: Multi-column search
111
+ puts "\n\n=== Multi-Column Full-Text Search ==="
112
+
113
+ puts "\nSearching for 'ruby' across title and content:"
114
+ results = dataset.text_search("ruby", columns: ["title", "content"], limit: 10)
115
+ results.each do |doc|
116
+ puts " - #{doc[:title]} by #{doc[:author]}"
117
+ end
118
+
119
+ # Test 5: Complex multi-word queries
120
+ puts "\n\nSearching for 'machine learning' across all text fields:"
121
+ results = dataset.text_search("machine learning", columns: ["title", "content", "tags"], limit: 5)
122
+ results.each do |doc|
123
+ puts " - #{doc[:title]}"
124
+ puts " Category: #{doc[:category]}"
125
+ end
126
+
127
+ # Test 6: Combining with SQL filters
128
+ puts "\n\n=== Combining Full-Text Search with Filters ==="
129
+
130
+ # First do a text search, then filter by year
131
+ puts "\nArticles about 'programming' from 2024:"
132
+ all_results = dataset.text_search("programming", column: "content", limit: 20)
133
+ filtered = all_results.select { |doc| doc[:year] == 2024 }
134
+ filtered.each do |doc|
135
+ puts " - #{doc[:title]} (#{doc[:year]})"
136
+ end
137
+
138
+ # Or use SQL filter for category
139
+ puts "\n\nWeb development articles:"
140
+ results = dataset.where("category = 'web development'")
141
+ results.each do |doc|
142
+ puts " - #{doc[:title]}"
143
+ end
144
+ end
145
+
146
+ puts "\nDone!"
@@ -0,0 +1,87 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'bundler/setup'
4
+ require 'lancelot'
5
+ require 'red-candle'
6
+ require 'tmpdir'
7
+
8
+ Dir.mktmpdir do |dir|
9
+ dataset_path = File.join(dir, "embeddings.lance")
10
+
11
+ puts "Creating dataset at: #{dataset_path}"
12
+
13
+ # Create a dataset with text and embedding columns
14
+ dataset = Lancelot::Dataset.create(dataset_path, schema: {
15
+ text: :string,
16
+ embedding: { type: "vector", dimension: 768 }
17
+ })
18
+
19
+ # Initialize the embedding model
20
+ embedding_model = Candle::EmbeddingModel.new
21
+ puts "Using embedding model: jinaai/jina-embeddings-v2-base-en"
22
+
23
+ # Sample documents
24
+ documents = [
25
+ "Ruby is a dynamic, object-oriented programming language",
26
+ "Python is great for data science and machine learning",
27
+ "JavaScript runs in browsers and on servers with Node.js",
28
+ "Rust provides memory safety without garbage collection",
29
+ "Go makes concurrent programming easy with goroutines",
30
+ "Java is widely used in enterprise applications",
31
+ "C++ offers high performance and low-level control",
32
+ "TypeScript adds static typing to JavaScript",
33
+ ]
34
+
35
+ # Add documents with embeddings
36
+ puts "\nAdding documents..."
37
+ documents.each do |text|
38
+ embedding = embedding_model.embedding(text)
39
+
40
+ # Convert tensor to array (remove batch dimension)
41
+ embedding_array = embedding.squeeze(0).to_a
42
+
43
+ dataset.add_documents([
44
+ { text: text, embedding: embedding_array }
45
+ ])
46
+
47
+ puts " Added: #{text[0..50]}..."
48
+ end
49
+
50
+ puts "\nTotal documents: #{dataset.count}"
51
+
52
+ # Create vector index
53
+ puts "\nCreating vector index..."
54
+ dataset.create_vector_index("embedding")
55
+
56
+ # Perform vector search
57
+ query = "Which languages are good for systems programming?"
58
+ puts "\nSearching for: '#{query}'"
59
+
60
+ # Generate embedding for query
61
+ query_embedding = embedding_model.embedding(query)
62
+ query_array = query_embedding.squeeze(0).to_a
63
+
64
+ # Search for similar documents
65
+ results = dataset.vector_search(query_array, column: "embedding", limit: 3)
66
+
67
+ puts "\nTop 3 most similar documents:"
68
+ results.each_with_index do |doc, i|
69
+ puts "#{i + 1}. #{doc[:text]}"
70
+ end
71
+
72
+ # Another search
73
+ query2 = "dynamic typing and interpreted languages"
74
+ puts "\n\nSearching for: '#{query2}'"
75
+
76
+ query_embedding2 = embedding_model.embedding(query2)
77
+ query_array2 = query_embedding2.squeeze(0).to_a
78
+
79
+ similar = dataset.nearest_neighbors(query_array2, k: 3, column: "embedding")
80
+
81
+ puts "\nTop 3 nearest neighbors:"
82
+ similar.each_with_index do |doc, i|
83
+ puts "#{i + 1}. #{doc[:text]}"
84
+ end
85
+ end
86
+
87
+ puts "\nDone!"