vectorstore 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/README.md +101 -0
- data/lib/vectorstore.rb +181 -0
- metadata +59 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 38d0ad1caf146e2a67210910d242bed9032027e2cc7687840f3351db11cd3a6f
|
4
|
+
data.tar.gz: '06509c3c9626ba84274b434cb0c53bcf3cf48f683db07d7ce736dd7180d6bbdf'
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 2694664e11a08dcc2df37939a5e785f71eb6ae62d1f4a54b03f80fd352f70ade8225038958a591d63d541b66affb558db1a1f29350ea7920349e7c1c15a14d60
|
7
|
+
data.tar.gz: a92e847ba349a1b0cd6e5065ce951b9a46794e50cb22bc88a4692f5ea08918cdb42e9e64407bbec1cf0f280f0b1756b1a428556c6234cb1d659b18a7a6e252b4
|
data/README.md
ADDED
@@ -0,0 +1,101 @@
|
|
1
|
+
# VectorStore
|
2
|
+
|
3
|
+
A pure Ruby library for storing and querying vectors with optional 1 bit quantization. It provides an easy-to-use interface for adding vectors, computing cosine similarity, finding the closest vectors, and serializing to JSON. It also features quantized storage for **significantly** reduced memory requirements.
|
4
|
+
|
5
|
+
## Features
|
6
|
+
|
7
|
+
- **Vector Storage:** Easily add and retrieve vectors with unique keys.
|
8
|
+
- **Closest Match:** Find the closest vectors to a given query vector using cosine similarity.
|
9
|
+
- **Serialization:** Serialize the vector store to JSON for persistence.
|
10
|
+
- **Quantization:** Optional 1-bit quantization to reduce memory footprint (significantly).
|
11
|
+
- **Save/Load:** Persist vector store to disk and reload it.
|
12
|
+
|
13
|
+
## Installation
|
14
|
+
|
15
|
+
In a `Gemfile`:
|
16
|
+
|
17
|
+
```ruby
|
18
|
+
gem 'vectorstore'
|
19
|
+
```
|
20
|
+
|
21
|
+
Or directly:
|
22
|
+
```bash
|
23
|
+
gem install vectorstore
|
24
|
+
```
|
25
|
+
|
26
|
+
## Basic Usage
|
27
|
+
|
28
|
+
```ruby
|
29
|
+
require 'vectorstore'
|
30
|
+
|
31
|
+
# Create a new VectorStore
|
32
|
+
store = VectorStore.new
|
33
|
+
|
34
|
+
# Add some vectors
|
35
|
+
store.add("vector1", [1.0, 2.0, 3.0])
|
36
|
+
store.add("vector2", [2.0, 3.0, 4.0])
|
37
|
+
store.add("vector3", [3.0, 4.0, 5.0])
|
38
|
+
store.add("vector4", [0.0, 0.0, 0.0]) # Zero vector edge case
|
39
|
+
|
40
|
+
# Calculate cosine similarity between two vectors
|
41
|
+
similarity = store.cosine_similarity([1.0, 2.0, 3.0], [2.0, 3.0, 4.0])
|
42
|
+
puts "Cosine similarity: #{similarity}"
|
43
|
+
|
44
|
+
# Find the closest vectors to a query vector
|
45
|
+
closest = store.find_closest([2.0, 3.0, 4.0], 2)
|
46
|
+
puts "Closest vectors: #{closest.inspect}"
|
47
|
+
|
48
|
+
# Save the store to disk
|
49
|
+
store.save("vector_store.json")
|
50
|
+
|
51
|
+
# Load the store from disk
|
52
|
+
loaded_store = VectorStore.new
|
53
|
+
loaded_store.load("vector_store.json")
|
54
|
+
```
|
55
|
+
|
56
|
+
### Using OpenAI for vector embedding text
|
57
|
+
|
58
|
+
VectorStore can integrate with OpenAI's API to generate embeddings for text inputs and queries. To use this feature with quantization (STRONGLY RECOMMENDED), initialize the store with quantized mode.
|
59
|
+
|
60
|
+
> [!TIP]
|
61
|
+
> Let me iterate again, using quantization with OpenAI embeddings is strongly recommended with VectorStore as the normal way we currently store the vectors is very space inefficient, particularly when serializing to disk.
|
62
|
+
|
63
|
+
Example:
|
64
|
+
|
65
|
+
```ruby
|
66
|
+
store = VectorStore.new(quantized: true)
|
67
|
+
|
68
|
+
store.add_with_openai("example", "Your sample text to embed")
|
69
|
+
|
70
|
+
# You can query with text and have the text embedded automatically
|
71
|
+
store.find_closest_with_openai("Your query text", 3)
|
72
|
+
|
73
|
+
# You can also query by the key of the vector
|
74
|
+
store.find_closest_with_key("example")
|
75
|
+
```
|
76
|
+
|
77
|
+
Supporting other embedding systems in a nice way would be good for the future, but I like OpenAI's embedding mechanism and it's cheap, so this is just step one. You can see example scripts in `examples/example_openai_*.rb` for a broader demo.
|
78
|
+
|
79
|
+
> [!NOTE]
|
80
|
+
> For now, your API key is assumed to be in the `OPENAI_API_KEY` environment variable. The `text-embedding-3-small` model is also used by default but this can be overridden in calls by using the `embedding_model` keyword argument on `find_closest_with_openai` and `add_with_openai` calls.
|
81
|
+
|
82
|
+
### Working with quantized vectors
|
83
|
+
|
84
|
+
VectorStore supports 1 bit vector quantization so that vectors can be stored in a bitfield (using a ASCII-encoded string with 8 bits per character for portability) for a significant memory use reduction. The cost is accuracy, especially on low dimension vectors – high dimension vectors such as used for text embeddings from OpenAI's API (see above) will fare a LOT better. Initialize the store with the `quantized: true` option:
|
85
|
+
|
86
|
+
```ruby
|
87
|
+
store = VectorStore.new(quantized: true)
|
88
|
+
store.add("vectorQ", [1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0])
|
89
|
+
```
|
90
|
+
|
91
|
+
## Running the Tests
|
92
|
+
|
93
|
+
This project uses Minitest. To run the tests:
|
94
|
+
|
95
|
+
```bash
|
96
|
+
rake test
|
97
|
+
```
|
98
|
+
|
99
|
+
## License
|
100
|
+
|
101
|
+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
|
data/lib/vectorstore.rb
ADDED
@@ -0,0 +1,181 @@
|
|
1
|
+
require 'json'
|
2
|
+
require 'base64'
|
3
|
+
|
4
|
+
# If openai is available, offer it as an option for embedding text directly
|
5
|
+
begin
|
6
|
+
require 'openai'
|
7
|
+
rescue LoadError
|
8
|
+
end
|
9
|
+
|
10
|
+
|
11
|
+
class VectorStore
|
12
|
+
attr_reader :vectors
|
13
|
+
def initialize(quantized: false)
|
14
|
+
# Internal store mapping primary key to vector (array of numbers)
|
15
|
+
@vectors = {}
|
16
|
+
@quantized = quantized
|
17
|
+
end
|
18
|
+
|
19
|
+
# Add a vector with the given primary key. Overwrites any existing vector.
|
20
|
+
def add(key, vector)
|
21
|
+
if @quantized
|
22
|
+
@vectors[key] = quantize(vector)
|
23
|
+
else
|
24
|
+
@vectors[key] = vector
|
25
|
+
end
|
26
|
+
end
|
27
|
+
|
28
|
+
def add_with_openai(key, text, embedding_model: "text-embedding-3-small")
|
29
|
+
return false unless defined?(OpenAI)
|
30
|
+
add(key, get_openai_embedding(text, embedding_model: embedding_model))
|
31
|
+
end
|
32
|
+
|
33
|
+
# Remove a vector by its primary key.
|
34
|
+
def remove(key)
|
35
|
+
@vectors.delete(key)
|
36
|
+
end
|
37
|
+
|
38
|
+
# Retrieve a vector by its primary key.
|
39
|
+
def get(key)
|
40
|
+
@vectors[key]
|
41
|
+
end
|
42
|
+
|
43
|
+
# Compute the cosine similarity between two vectors.
|
44
|
+
def cosine_similarity(vec1, vec2)
|
45
|
+
if @quantized
|
46
|
+
vec1 = vec1.is_a?(String) ? vec1 : quantize(vec1)
|
47
|
+
vec2 = vec2.is_a?(String) ? vec2 : quantize(vec2)
|
48
|
+
return cosine_similarity_quantized(vec1, vec2)
|
49
|
+
end
|
50
|
+
if vec1.is_a?(String) && vec2.is_a?(String)
|
51
|
+
return cosine_similarity_quantized(vec1, vec2)
|
52
|
+
end
|
53
|
+
|
54
|
+
# Ensure vectors are of the same size
|
55
|
+
raise "Vector dimensions do not match" if vec1.size != vec2.size
|
56
|
+
|
57
|
+
dot_product = vec1.zip(vec2).map { |a, b| a * b }.sum
|
58
|
+
norm1 = Math.sqrt(vec1.map { |x| x * x }.sum)
|
59
|
+
norm2 = Math.sqrt(vec2.map { |x| x * x }.sum)
|
60
|
+
return 0.0 if norm1 == 0 || norm2 == 0
|
61
|
+
|
62
|
+
dot_product / (norm1 * norm2)
|
63
|
+
end
|
64
|
+
|
65
|
+
# Find the top k closest vectors to the query vector using cosine similarity.
|
66
|
+
# Returns an array of [key, similarity] pairs.
|
67
|
+
def find_closest(query_vector, k=1)
|
68
|
+
if @quantized
|
69
|
+
query_vector = quantize(query_vector)
|
70
|
+
end
|
71
|
+
|
72
|
+
similarities = @vectors.map do |key, vector|
|
73
|
+
similarity = cosine_similarity(query_vector, vector)
|
74
|
+
[key, similarity]
|
75
|
+
end
|
76
|
+
similarities.sort_by { |_, sim| -sim }.first(k)
|
77
|
+
end
|
78
|
+
|
79
|
+
def find_closest_with_key(key, k=1)
|
80
|
+
query_vector = @vectors[key]
|
81
|
+
find_closest(query_vector, k)
|
82
|
+
end
|
83
|
+
|
84
|
+
def get_openai_embedding(text, embedding_model: "text-embedding-3-small")
|
85
|
+
return false unless defined?(OpenAI)
|
86
|
+
|
87
|
+
client = OpenAI::Client.new(
|
88
|
+
access_token: ENV["OPENAI_API_KEY"],
|
89
|
+
log_errors: true
|
90
|
+
)
|
91
|
+
response = client.embeddings(
|
92
|
+
parameters: {
|
93
|
+
model: embedding_model,
|
94
|
+
input: text
|
95
|
+
}
|
96
|
+
)
|
97
|
+
|
98
|
+
response.dig("data", 0, "embedding")
|
99
|
+
end
|
100
|
+
|
101
|
+
def find_closest_with_openai(query_text, k=1, embedding_model: "text-embedding-3-small")
|
102
|
+
return false unless defined?(OpenAI)
|
103
|
+
query_vector = get_openai_embedding(query_text, embedding_model: embedding_model)
|
104
|
+
find_closest(query_vector, k)
|
105
|
+
end
|
106
|
+
|
107
|
+
# Compute cosine similarity for quantized vectors (bit strings).
|
108
|
+
def cosine_similarity_quantized(str1, str2)
|
109
|
+
dot = 0
|
110
|
+
total_ones_str1 = 0
|
111
|
+
total_ones_str2 = 0
|
112
|
+
str1.each_byte.with_index do |byte1, index|
|
113
|
+
byte2 = str2.getbyte(index)
|
114
|
+
dot += (byte1 & byte2).to_s(2).count("1")
|
115
|
+
total_ones_str1 += byte1.to_s(2).count("1")
|
116
|
+
total_ones_str2 += byte2.to_s(2).count("1")
|
117
|
+
end
|
118
|
+
return 0.0 if total_ones_str1 == 0 || total_ones_str2 == 0
|
119
|
+
sim = dot.to_f / (Math.sqrt(total_ones_str1) * Math.sqrt(total_ones_str2))
|
120
|
+
sim = 1.0 if (1.0 - sim).abs < 1e-6
|
121
|
+
sim
|
122
|
+
end
|
123
|
+
|
124
|
+
# Convert an array of floats to a 1-bit quantized bit string.
|
125
|
+
def quantize(vector)
|
126
|
+
# If it's already a string, it's already quantized
|
127
|
+
return vector if vector.is_a?(String)
|
128
|
+
|
129
|
+
bits = vector.map { |x| x >= 0 ? 1 : 0 }
|
130
|
+
bytes = []
|
131
|
+
bits.each_slice(8) do |slice|
|
132
|
+
byte = slice.join.to_i(2)
|
133
|
+
bytes << byte.chr("ASCII-8BIT")
|
134
|
+
end
|
135
|
+
result = bytes.join
|
136
|
+
result.force_encoding("ASCII-8BIT")
|
137
|
+
result
|
138
|
+
end
|
139
|
+
|
140
|
+
# Serialize the internal vector store to a JSON string.
|
141
|
+
def serialize
|
142
|
+
if @quantized
|
143
|
+
encoded = {}
|
144
|
+
@vectors.each do |k, v|
|
145
|
+
encoded[k] = Base64.strict_encode64(v)
|
146
|
+
end
|
147
|
+
JSON.dump(encoded)
|
148
|
+
else
|
149
|
+
JSON.dump(@vectors)
|
150
|
+
end
|
151
|
+
end
|
152
|
+
|
153
|
+
# Deserialize a JSON string and update the internal store.
|
154
|
+
def deserialize(json_string)
|
155
|
+
data = JSON.parse(json_string)
|
156
|
+
# We need to detect if the data is quantized or not
|
157
|
+
# by seeing if the values are strings and not arrays
|
158
|
+
@quantized = data.values.first.is_a?(String)
|
159
|
+
|
160
|
+
if @quantized
|
161
|
+
decoded = {}
|
162
|
+
data.each do |k, v|
|
163
|
+
decoded[k] = Base64.decode64(v)
|
164
|
+
end
|
165
|
+
@vectors = decoded
|
166
|
+
else
|
167
|
+
@vectors = data
|
168
|
+
end
|
169
|
+
end
|
170
|
+
|
171
|
+
# Save the internal vector store to a file.
|
172
|
+
def save(filename)
|
173
|
+
File.write(filename, serialize)
|
174
|
+
end
|
175
|
+
|
176
|
+
# Load the internal vector store from a file.
|
177
|
+
def load(filename)
|
178
|
+
json_string = File.read(filename)
|
179
|
+
deserialize(json_string)
|
180
|
+
end
|
181
|
+
end
|
metadata
ADDED
@@ -0,0 +1,59 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: vectorstore
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.0.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Peter Cooper
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2025-02-09 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: minitest
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - "~>"
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '5.0'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - "~>"
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '5.0'
|
27
|
+
description: A library for storing and handling vectors with optional quantization.
|
28
|
+
email:
|
29
|
+
- git@peterc.org
|
30
|
+
executables: []
|
31
|
+
extensions: []
|
32
|
+
extra_rdoc_files: []
|
33
|
+
files:
|
34
|
+
- README.md
|
35
|
+
- lib/vectorstore.rb
|
36
|
+
homepage: http://github.com/peterc/vectorstore
|
37
|
+
licenses:
|
38
|
+
- MIT
|
39
|
+
metadata: {}
|
40
|
+
post_install_message:
|
41
|
+
rdoc_options: []
|
42
|
+
require_paths:
|
43
|
+
- lib
|
44
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
45
|
+
requirements:
|
46
|
+
- - ">="
|
47
|
+
- !ruby/object:Gem::Version
|
48
|
+
version: '0'
|
49
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
50
|
+
requirements:
|
51
|
+
- - ">="
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '0'
|
54
|
+
requirements: []
|
55
|
+
rubygems_version: 3.5.23
|
56
|
+
signing_key:
|
57
|
+
specification_version: 4
|
58
|
+
summary: A simple vector storage and querying library
|
59
|
+
test_files: []
|