vectorstore 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (4) hide show
  1. checksums.yaml +7 -0
  2. data/README.md +101 -0
  3. data/lib/vectorstore.rb +181 -0
  4. metadata +59 -0
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 38d0ad1caf146e2a67210910d242bed9032027e2cc7687840f3351db11cd3a6f
4
+ data.tar.gz: '06509c3c9626ba84274b434cb0c53bcf3cf48f683db07d7ce736dd7180d6bbdf'
5
+ SHA512:
6
+ metadata.gz: 2694664e11a08dcc2df37939a5e785f71eb6ae62d1f4a54b03f80fd352f70ade8225038958a591d63d541b66affb558db1a1f29350ea7920349e7c1c15a14d60
7
+ data.tar.gz: a92e847ba349a1b0cd6e5065ce951b9a46794e50cb22bc88a4692f5ea08918cdb42e9e64407bbec1cf0f280f0b1756b1a428556c6234cb1d659b18a7a6e252b4
data/README.md ADDED
@@ -0,0 +1,101 @@
1
+ # VectorStore
2
+
3
+ A pure Ruby library for storing and querying vectors with optional 1 bit quantization. It provides an easy-to-use interface for adding vectors, computing cosine similarity, finding the closest vectors, and serializing to JSON. It also features quantized storage for **significantly** reduced memory requirements.
4
+
5
+ ## Features
6
+
7
+ - **Vector Storage:** Easily add and retrieve vectors with unique keys.
8
+ - **Closest Match:** Find the closest vectors to a given query vector using cosine similarity.
9
+ - **Serialization:** Serialize the vector store to JSON for persistence.
10
+ - **Quantization:** Optional 1-bit quantization to reduce memory footprint (significantly).
11
+ - **Save/Load:** Persist vector store to disk and reload it.
12
+
13
+ ## Installation
14
+
15
+ In a `Gemfile`:
16
+
17
+ ```ruby
18
+ gem 'vectorstore'
19
+ ```
20
+
21
+ Or directly:
22
+ ```bash
23
+ gem install vectorstore
24
+ ```
25
+
26
+ ## Basic Usage
27
+
28
+ ```ruby
29
+ require 'vectorstore'
30
+
31
+ # Create a new VectorStore
32
+ store = VectorStore.new
33
+
34
+ # Add some vectors
35
+ store.add("vector1", [1.0, 2.0, 3.0])
36
+ store.add("vector2", [2.0, 3.0, 4.0])
37
+ store.add("vector3", [3.0, 4.0, 5.0])
38
+ store.add("vector4", [0.0, 0.0, 0.0]) # Zero vector edge case
39
+
40
+ # Calculate cosine similarity between two vectors
41
+ similarity = store.cosine_similarity([1.0, 2.0, 3.0], [2.0, 3.0, 4.0])
42
+ puts "Cosine similarity: #{similarity}"
43
+
44
+ # Find the closest vectors to a query vector
45
+ closest = store.find_closest([2.0, 3.0, 4.0], 2)
46
+ puts "Closest vectors: #{closest.inspect}"
47
+
48
+ # Save the store to disk
49
+ store.save("vector_store.json")
50
+
51
+ # Load the store from disk
52
+ loaded_store = VectorStore.new
53
+ loaded_store.load("vector_store.json")
54
+ ```
55
+
56
+ ### Using OpenAI for vector embedding text
57
+
58
+ VectorStore can integrate with OpenAI's API to generate embeddings for text inputs and queries. To use this feature with quantization (STRONGLY RECOMMENDED), initialize the store with quantized mode.
59
+
60
+ > [!TIP]
61
+ > Let me iterate again, using quantization with OpenAI embeddings is strongly recommended with VectorStore as the normal way we currently store the vectors is very space inefficient, particularly when serializing to disk.
62
+
63
+ Example:
64
+
65
+ ```ruby
66
+ store = VectorStore.new(quantized: true)
67
+
68
+ store.add_with_openai("example", "Your sample text to embed")
69
+
70
+ # You can query with text and have the text embedded automatically
71
+ store.find_closest_with_openai("Your query text", 3)
72
+
73
+ # You can also query by the key of the vector
74
+ store.find_closest_with_key("example")
75
+ ```
76
+
77
+ Supporting other embedding systems in a nice way would be good for the future, but I like OpenAI's embedding mechanism and it's cheap, so this is just step one. You can see example scripts in `examples/example_openai_*.rb` for a broader demo.
78
+
79
+ > [!NOTE]
80
+ > For now, your API key is assumed to be in the `OPENAI_API_KEY` environment variable. The `text-embedding-3-small` model is also used by default but this can be overridden in calls by using the `embedding_model` keyword argument on `find_closest_with_openai` and `add_with_openai` calls.
81
+
82
+ ### Working with quantized vectors
83
+
84
+ VectorStore supports 1 bit vector quantization so that vectors can be stored in a bitfield (using a ASCII-encoded string with 8 bits per character for portability) for a significant memory use reduction. The cost is accuracy, especially on low dimension vectors – high dimension vectors such as used for text embeddings from OpenAI's API (see above) will fare a LOT better. Initialize the store with the `quantized: true` option:
85
+
86
+ ```ruby
87
+ store = VectorStore.new(quantized: true)
88
+ store.add("vectorQ", [1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0])
89
+ ```
90
+
91
+ ## Running the Tests
92
+
93
+ This project uses Minitest. To run the tests:
94
+
95
+ ```bash
96
+ rake test
97
+ ```
98
+
99
+ ## License
100
+
101
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
@@ -0,0 +1,181 @@
1
+ require 'json'
2
+ require 'base64'
3
+
4
+ # If openai is available, offer it as an option for embedding text directly
5
+ begin
6
+ require 'openai'
7
+ rescue LoadError
8
+ end
9
+
10
+
11
+ class VectorStore
12
+ attr_reader :vectors
13
+ def initialize(quantized: false)
14
+ # Internal store mapping primary key to vector (array of numbers)
15
+ @vectors = {}
16
+ @quantized = quantized
17
+ end
18
+
19
+ # Add a vector with the given primary key. Overwrites any existing vector.
20
+ def add(key, vector)
21
+ if @quantized
22
+ @vectors[key] = quantize(vector)
23
+ else
24
+ @vectors[key] = vector
25
+ end
26
+ end
27
+
28
+ def add_with_openai(key, text, embedding_model: "text-embedding-3-small")
29
+ return false unless defined?(OpenAI)
30
+ add(key, get_openai_embedding(text, embedding_model: embedding_model))
31
+ end
32
+
33
+ # Remove a vector by its primary key.
34
+ def remove(key)
35
+ @vectors.delete(key)
36
+ end
37
+
38
+ # Retrieve a vector by its primary key.
39
+ def get(key)
40
+ @vectors[key]
41
+ end
42
+
43
+ # Compute the cosine similarity between two vectors.
44
+ def cosine_similarity(vec1, vec2)
45
+ if @quantized
46
+ vec1 = vec1.is_a?(String) ? vec1 : quantize(vec1)
47
+ vec2 = vec2.is_a?(String) ? vec2 : quantize(vec2)
48
+ return cosine_similarity_quantized(vec1, vec2)
49
+ end
50
+ if vec1.is_a?(String) && vec2.is_a?(String)
51
+ return cosine_similarity_quantized(vec1, vec2)
52
+ end
53
+
54
+ # Ensure vectors are of the same size
55
+ raise "Vector dimensions do not match" if vec1.size != vec2.size
56
+
57
+ dot_product = vec1.zip(vec2).map { |a, b| a * b }.sum
58
+ norm1 = Math.sqrt(vec1.map { |x| x * x }.sum)
59
+ norm2 = Math.sqrt(vec2.map { |x| x * x }.sum)
60
+ return 0.0 if norm1 == 0 || norm2 == 0
61
+
62
+ dot_product / (norm1 * norm2)
63
+ end
64
+
65
+ # Find the top k closest vectors to the query vector using cosine similarity.
66
+ # Returns an array of [key, similarity] pairs.
67
+ def find_closest(query_vector, k=1)
68
+ if @quantized
69
+ query_vector = quantize(query_vector)
70
+ end
71
+
72
+ similarities = @vectors.map do |key, vector|
73
+ similarity = cosine_similarity(query_vector, vector)
74
+ [key, similarity]
75
+ end
76
+ similarities.sort_by { |_, sim| -sim }.first(k)
77
+ end
78
+
79
+ def find_closest_with_key(key, k=1)
80
+ query_vector = @vectors[key]
81
+ find_closest(query_vector, k)
82
+ end
83
+
84
+ def get_openai_embedding(text, embedding_model: "text-embedding-3-small")
85
+ return false unless defined?(OpenAI)
86
+
87
+ client = OpenAI::Client.new(
88
+ access_token: ENV["OPENAI_API_KEY"],
89
+ log_errors: true
90
+ )
91
+ response = client.embeddings(
92
+ parameters: {
93
+ model: embedding_model,
94
+ input: text
95
+ }
96
+ )
97
+
98
+ response.dig("data", 0, "embedding")
99
+ end
100
+
101
+ def find_closest_with_openai(query_text, k=1, embedding_model: "text-embedding-3-small")
102
+ return false unless defined?(OpenAI)
103
+ query_vector = get_openai_embedding(query_text, embedding_model: embedding_model)
104
+ find_closest(query_vector, k)
105
+ end
106
+
107
+ # Compute cosine similarity for quantized vectors (bit strings).
108
+ def cosine_similarity_quantized(str1, str2)
109
+ dot = 0
110
+ total_ones_str1 = 0
111
+ total_ones_str2 = 0
112
+ str1.each_byte.with_index do |byte1, index|
113
+ byte2 = str2.getbyte(index)
114
+ dot += (byte1 & byte2).to_s(2).count("1")
115
+ total_ones_str1 += byte1.to_s(2).count("1")
116
+ total_ones_str2 += byte2.to_s(2).count("1")
117
+ end
118
+ return 0.0 if total_ones_str1 == 0 || total_ones_str2 == 0
119
+ sim = dot.to_f / (Math.sqrt(total_ones_str1) * Math.sqrt(total_ones_str2))
120
+ sim = 1.0 if (1.0 - sim).abs < 1e-6
121
+ sim
122
+ end
123
+
124
+ # Convert an array of floats to a 1-bit quantized bit string.
125
+ def quantize(vector)
126
+ # If it's already a string, it's already quantized
127
+ return vector if vector.is_a?(String)
128
+
129
+ bits = vector.map { |x| x >= 0 ? 1 : 0 }
130
+ bytes = []
131
+ bits.each_slice(8) do |slice|
132
+ byte = slice.join.to_i(2)
133
+ bytes << byte.chr("ASCII-8BIT")
134
+ end
135
+ result = bytes.join
136
+ result.force_encoding("ASCII-8BIT")
137
+ result
138
+ end
139
+
140
+ # Serialize the internal vector store to a JSON string.
141
+ def serialize
142
+ if @quantized
143
+ encoded = {}
144
+ @vectors.each do |k, v|
145
+ encoded[k] = Base64.strict_encode64(v)
146
+ end
147
+ JSON.dump(encoded)
148
+ else
149
+ JSON.dump(@vectors)
150
+ end
151
+ end
152
+
153
+ # Deserialize a JSON string and update the internal store.
154
+ def deserialize(json_string)
155
+ data = JSON.parse(json_string)
156
+ # We need to detect if the data is quantized or not
157
+ # by seeing if the values are strings and not arrays
158
+ @quantized = data.values.first.is_a?(String)
159
+
160
+ if @quantized
161
+ decoded = {}
162
+ data.each do |k, v|
163
+ decoded[k] = Base64.decode64(v)
164
+ end
165
+ @vectors = decoded
166
+ else
167
+ @vectors = data
168
+ end
169
+ end
170
+
171
+ # Save the internal vector store to a file.
172
+ def save(filename)
173
+ File.write(filename, serialize)
174
+ end
175
+
176
+ # Load the internal vector store from a file.
177
+ def load(filename)
178
+ json_string = File.read(filename)
179
+ deserialize(json_string)
180
+ end
181
+ end
metadata ADDED
@@ -0,0 +1,59 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: vectorstore
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Peter Cooper
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2025-02-09 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: minitest
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '5.0'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '5.0'
27
+ description: A library for storing and handling vectors with optional quantization.
28
+ email:
29
+ - git@peterc.org
30
+ executables: []
31
+ extensions: []
32
+ extra_rdoc_files: []
33
+ files:
34
+ - README.md
35
+ - lib/vectorstore.rb
36
+ homepage: http://github.com/peterc/vectorstore
37
+ licenses:
38
+ - MIT
39
+ metadata: {}
40
+ post_install_message:
41
+ rdoc_options: []
42
+ require_paths:
43
+ - lib
44
+ required_ruby_version: !ruby/object:Gem::Requirement
45
+ requirements:
46
+ - - ">="
47
+ - !ruby/object:Gem::Version
48
+ version: '0'
49
+ required_rubygems_version: !ruby/object:Gem::Requirement
50
+ requirements:
51
+ - - ">="
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ requirements: []
55
+ rubygems_version: 3.5.23
56
+ signing_key:
57
+ specification_version: 4
58
+ summary: A simple vector storage and querying library
59
+ test_files: []