neighbor-s3 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 3d4c40ed4dac5051282fef9538abb3f74c7dbaa25e2acf5cbef7fdb43b72f9c5
4
+ data.tar.gz: cfe0700757e352d2eeb27dd869c11562fa1a31d53a9c99270c84c14ec72c9e4f
5
+ SHA512:
6
+ metadata.gz: 8fc63cd4fd5030bfa9500f3f664d99a9860a60131aaff5dbd02c9072c0d43009b06394ca7ba5f70b1835d361e27c1e9a5815a352dbdf0edb373615d630dfefc3
7
+ data.tar.gz: 77a5d9c7cb2ca0ade3f2728ad63e05d8f024cf21fc10f925126afece477933e244dd04baf6f14a799e57aa60acfe6da1940d2feb86578533d2d4efeff4fb210b
data/CHANGELOG.md ADDED
@@ -0,0 +1,3 @@
1
+ ## 0.1.0 (2025-10-02)
2
+
3
+ - First release
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2025 Andrew Kane
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,199 @@
1
+ # Neighbor S3
2
+
3
+ Nearest neighbor search for Ruby and S3 Vectors
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application’s Gemfile:
8
+
9
+ ```ruby
10
+ gem "neighbor-s3"
11
+ ```
12
+
13
+ ## Getting Started
14
+
15
+ First, create a [vector bucket](https://console.aws.amazon.com/s3/vector-buckets)
16
+
17
+ Create an index
18
+
19
+ ```ruby
20
+ index = Neighbor::S3::Index.new("items", bucket: "my-bucket", dimensions: 3, distance: "cosine")
21
+ index.create
22
+ ```
23
+
24
+ Add vectors
25
+
26
+ ```ruby
27
+ index.add(1, [1, 1, 1])
28
+ index.add(2, [2, 2, 2])
29
+ index.add(3, [1, 1, 2])
30
+ ```
31
+
32
+ Search for nearest neighbors to a vector
33
+
34
+ ```ruby
35
+ index.search([1, 1, 1], count: 5)
36
+ ```
37
+
38
+ Search for nearest neighbors to a vector in the index
39
+
40
+ ```ruby
41
+ index.search_id(1, count: 5)
42
+ ```
43
+
44
+ IDs are treated as strings by default, but can also be treated as integers
45
+
46
+ ```ruby
47
+ Neighbor::S3::Index.new("items", id_type: "integer", ...)
48
+ ```
49
+
50
+ ## Operations
51
+
52
+ Add or update a vector
53
+
54
+ ```ruby
55
+ index.add(id, vector)
56
+ ```
57
+
58
+ Add or update multiple vectors
59
+
60
+ ```ruby
61
+ index.add_all([{id: 1, vector: [1, 2, 3]}, {id: 2, vector: [4, 5, 6]}])
62
+ ```
63
+
64
+ Get a vector
65
+
66
+ ```ruby
67
+ index.find(id)
68
+ ```
69
+
70
+ Get all vectors
71
+
72
+ ```ruby
73
+ index.find_in_batches do |batch|
74
+ # ...
75
+ end
76
+ ```
77
+
78
+ Remove a vector
79
+
80
+ ```ruby
81
+ index.remove(id)
82
+ ```
83
+
84
+ Remove multiple vectors
85
+
86
+ ```ruby
87
+ index.remove_all(ids)
88
+ ```
89
+
90
+ ## Metadata
91
+
92
+ Add a vector with metadata
93
+
94
+ ```ruby
95
+ index.add(id, vector, metadata: {category: "A"})
96
+ ```
97
+
98
+ Add multiple vectors with metadata
99
+
100
+ ```ruby
101
+ index.add_all([
102
+ {id: 1, vector: [1, 2, 3], metadata: {category: "A"}},
103
+ {id: 2, vector: [4, 5, 6], metadata: {category: "B"}}
104
+ ])
105
+ ```
106
+
107
+ Get metadata with search results
108
+
109
+ ```ruby
110
+ index.search(vector, with_metadata: true)
111
+ ```
112
+
113
+ Filter by metadata
114
+
115
+ ```ruby
116
+ index.search(vector, filter: {category: "A"})
117
+ ```
118
+
119
+ Supports [these operators](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-metadata-filtering.html#s3-vectors-metadata-filtering-filterable)
120
+
121
+ Specify non-filterable metadata on index creation
122
+
123
+ ```ruby
124
+ Neighbor::S3::Index.new(name, non_filterable: ["category"], ...)
125
+ ```
126
+
127
+ ## Example
128
+
129
+ You can use Neighbor S3 for online item-based recommendations with [Disco](https://github.com/ankane/disco). We’ll use MovieLens data for this example.
130
+
131
+ Create an index
132
+
133
+ ```ruby
134
+ index = Neighbor::S3::Index.new("movies", bucket: "my-bucket", dimensions: 20, distance: "cosine")
135
+ ```
136
+
137
+ Fit the recommender
138
+
139
+ ```ruby
140
+ data = Disco.load_movielens
141
+ recommender = Disco::Recommender.new(factors: 20)
142
+ recommender.fit(data)
143
+ ```
144
+
145
+ Store the item factors
146
+
147
+ ```ruby
148
+ index.add_all(recommender.item_ids.map { |v| {id: v, vector: recommender.item_factors(v)} })
149
+ ```
150
+
151
+ And get similar movies
152
+
153
+ ```ruby
154
+ index.search_id("Star Wars (1977)").map { |v| v[:id] }
155
+ ```
156
+
157
+ See the [complete code](examples/disco_item_recs.rb)
158
+
159
+ ## Reference
160
+
161
+ Get index info
162
+
163
+ ```ruby
164
+ index.info
165
+ ```
166
+
167
+ Check if an index exists
168
+
169
+ ```ruby
170
+ index.exists?
171
+ ```
172
+
173
+ Drop an index
174
+
175
+ ```ruby
176
+ index.drop
177
+ ```
178
+
179
+ ## History
180
+
181
+ View the [changelog](https://github.com/ankane/neighbor-s3/blob/master/CHANGELOG.md)
182
+
183
+ ## Contributing
184
+
185
+ Everyone is encouraged to help improve this project. Here are a few ways you can help:
186
+
187
+ - [Report bugs](https://github.com/ankane/neighbor-s3/issues)
188
+ - Fix bugs and [submit pull requests](https://github.com/ankane/neighbor-s3/pulls)
189
+ - Write, clarify, or fix documentation
190
+ - Suggest or add new features
191
+
192
+ To get started with development:
193
+
194
+ ```sh
195
+ git clone https://github.com/ankane/neighbor-s3.git
196
+ cd neighbor-s3
197
+ bundle install
198
+ bundle exec rake test
199
+ ```
@@ -0,0 +1,237 @@
1
+ module Neighbor
2
+ module S3
3
+ class Index
4
+ def initialize(name, bucket:, dimensions:, distance:, id_type: "string", non_filterable: nil)
5
+ @name = name
6
+ @bucket = bucket
7
+ @dimensions = dimensions.to_i
8
+
9
+ @distance_metric =
10
+ case distance.to_s
11
+ when "euclidean"
12
+ "euclidean"
13
+ when "cosine"
14
+ "cosine"
15
+ else
16
+ raise ArgumentError, "invalid distance"
17
+ end
18
+
19
+ @int_ids =
20
+ case id_type.to_s
21
+ when "string"
22
+ false
23
+ when "integer"
24
+ true
25
+ else
26
+ raise ArgumentError, "invalid id_type"
27
+ end
28
+
29
+ @non_filterable = non_filterable.to_a
30
+ end
31
+
32
+ def self.create(*args, **options)
33
+ index = new(*args, **options)
34
+ index.create
35
+ index
36
+ end
37
+
38
+ def create
39
+ options = {
40
+ vector_bucket_name: @bucket,
41
+ index_name: @name,
42
+ data_type: "float32",
43
+ dimension: @dimensions,
44
+ distance_metric: @distance_metric
45
+ }
46
+ if @non_filterable.any?
47
+ options[:metadata_configuration] = {
48
+ non_filterable_metadata_keys: @non_filterable
49
+ }
50
+ end
51
+ client.create_index(options)
52
+ nil
53
+ end
54
+
55
+ def exists?
56
+ client.get_index({
57
+ vector_bucket_name: @bucket,
58
+ index_name: @name
59
+ })
60
+ true
61
+ rescue Aws::S3Vectors::Errors::NotFoundException
62
+ false
63
+ end
64
+
65
+ def info
66
+ client.get_index({
67
+ vector_bucket_name: @bucket,
68
+ index_name: @name
69
+ }).index.to_h
70
+ end
71
+
72
+ def add(id, vector, metadata: nil)
73
+ add_all([{id: id, vector: vector, metadata: metadata}])
74
+ end
75
+
76
+ def add_all(items)
77
+ # perform checks first to reduce chance of non-atomic updates
78
+ vectors =
79
+ items.map do |item|
80
+ vector = item.fetch(:vector).to_a
81
+ check_dimensions(vector)
82
+
83
+ {
84
+ key: item_id(item.fetch(:id)).to_s,
85
+ data: {float32: vector},
86
+ metadata: item[:metadata]
87
+ }
88
+ end
89
+
90
+ vectors.each_slice(500) do |batch|
91
+ client.put_vectors({
92
+ vector_bucket_name: @bucket,
93
+ index_name: @name,
94
+ vectors: batch
95
+ })
96
+ end
97
+ nil
98
+ end
99
+
100
+ def member?(id)
101
+ id = item_id(id)
102
+
103
+ client.get_vectors({
104
+ vector_bucket_name: @bucket,
105
+ index_name: @name,
106
+ keys: [id.to_s],
107
+ return_data: false,
108
+ return_metadata: false
109
+ }).vectors.any?
110
+ end
111
+ alias_method :include?, :member?
112
+
113
+ def remove(id)
114
+ remove_all([id])
115
+ end
116
+
117
+ def remove_all(ids)
118
+ ids = ids.to_a.map { |id| item_id(id) }
119
+
120
+ ids.each_slice(500) do |batch|
121
+ client.delete_vectors({
122
+ vector_bucket_name: @bucket,
123
+ index_name: @name,
124
+ keys: batch.map(&:to_s)
125
+ })
126
+ end
127
+ nil
128
+ end
129
+
130
+ def find(id, with_metadata: true)
131
+ id = item_id(id)
132
+
133
+ v =
134
+ client.get_vectors({
135
+ vector_bucket_name: @bucket,
136
+ index_name: @name,
137
+ keys: [id.to_s],
138
+ return_data: true,
139
+ return_metadata: with_metadata
140
+ }).vectors.first
141
+
142
+ if v
143
+ item = {
144
+ id: item_id(v.key),
145
+ vector: v.data.float32
146
+ }
147
+ item[:metadata] = v.metadata if with_metadata
148
+ item
149
+ end
150
+ end
151
+
152
+ def find_in_batches(batch_size: 1000, with_metadata: true)
153
+ options = {
154
+ vector_bucket_name: @bucket,
155
+ index_name: @name,
156
+ max_results: batch_size,
157
+ return_data: true,
158
+ return_metadata: with_metadata
159
+ }
160
+
161
+ begin
162
+ resp = client.list_vectors(options)
163
+ batch =
164
+ resp.vectors.map do |v|
165
+ item = {
166
+ id: item_id(v.key),
167
+ vector: v.data.float32
168
+ }
169
+ item[:metadata] = v.metadata if with_metadata
170
+ item
171
+ end
172
+ yield batch
173
+ options[:next_token] = resp.next_token
174
+ end while resp.next_token
175
+ end
176
+
177
+ def search(vector, count: 5, with_metadata: false, filter: nil)
178
+ check_dimensions(vector)
179
+
180
+ client.query_vectors({
181
+ vector_bucket_name: @bucket,
182
+ index_name: @name,
183
+ top_k: count,
184
+ query_vector: {
185
+ float32: vector,
186
+ },
187
+ filter: filter,
188
+ return_metadata: with_metadata,
189
+ return_distance: true
190
+ }).vectors.map do |v|
191
+ item = {
192
+ id: item_id(v.key),
193
+ distance: @distance_metric == "euclidean" ? Math.sqrt(v.distance) : v.distance
194
+ }
195
+ item[:metadata] = v.metadata if with_metadata
196
+ item
197
+ end
198
+ end
199
+
200
+ def search_id(id, count: 5, with_metadata: false, filter: nil)
201
+ id = item_id(id)
202
+
203
+ item = find(id)
204
+ unless item
205
+ raise Error, "Could not find item #{id}"
206
+ end
207
+
208
+ result = search(item[:vector], count: count + 1, with_metadata:, filter:)
209
+ result.reject { |v| v[:id] == id }.first(count)
210
+ end
211
+
212
+ def drop
213
+ client.delete_index({
214
+ vector_bucket_name: @bucket,
215
+ index_name: @name
216
+ })
217
+ nil
218
+ end
219
+
220
+ private
221
+
222
+ def check_dimensions(vector)
223
+ if vector.size != @dimensions
224
+ raise ArgumentError, "expected #{@dimensions} dimensions"
225
+ end
226
+ end
227
+
228
+ def item_id(id)
229
+ @int_ids ? Integer(id) : id.to_s
230
+ end
231
+
232
+ def client
233
+ S3.client
234
+ end
235
+ end
236
+ end
237
+ end
@@ -0,0 +1,5 @@
1
+ module Neighbor
2
+ module S3
3
+ VERSION = "0.1.0"
4
+ end
5
+ end
@@ -0,0 +1,20 @@
1
+ # dependencies
2
+ require "aws-sdk-s3vectors"
3
+
4
+ # modules
5
+ require_relative "s3/index"
6
+ require_relative "s3/version"
7
+
8
+ module Neighbor
9
+ module S3
10
+ class Error < StandardError; end
11
+
12
+ class << self
13
+ attr_writer :client
14
+
15
+ def client
16
+ @client ||= Aws::S3Vectors::Client.new
17
+ end
18
+ end
19
+ end
20
+ end
@@ -0,0 +1 @@
1
+ require_relative "neighbor/s3"
metadata ADDED
@@ -0,0 +1,59 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: neighbor-s3
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Andrew Kane
8
+ bindir: bin
9
+ cert_chain: []
10
+ date: 1980-01-02 00:00:00.000000000 Z
11
+ dependencies:
12
+ - !ruby/object:Gem::Dependency
13
+ name: aws-sdk-s3vectors
14
+ requirement: !ruby/object:Gem::Requirement
15
+ requirements:
16
+ - - ">="
17
+ - !ruby/object:Gem::Version
18
+ version: '0'
19
+ type: :runtime
20
+ prerelease: false
21
+ version_requirements: !ruby/object:Gem::Requirement
22
+ requirements:
23
+ - - ">="
24
+ - !ruby/object:Gem::Version
25
+ version: '0'
26
+ email: andrew@ankane.org
27
+ executables: []
28
+ extensions: []
29
+ extra_rdoc_files: []
30
+ files:
31
+ - CHANGELOG.md
32
+ - LICENSE.txt
33
+ - README.md
34
+ - lib/neighbor-s3.rb
35
+ - lib/neighbor/s3.rb
36
+ - lib/neighbor/s3/index.rb
37
+ - lib/neighbor/s3/version.rb
38
+ homepage: https://github.com/ankane/neighbor-s3
39
+ licenses:
40
+ - MIT
41
+ metadata: {}
42
+ rdoc_options: []
43
+ require_paths:
44
+ - lib
45
+ required_ruby_version: !ruby/object:Gem::Requirement
46
+ requirements:
47
+ - - ">="
48
+ - !ruby/object:Gem::Version
49
+ version: '3.2'
50
+ required_rubygems_version: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ requirements: []
56
+ rubygems_version: 3.6.9
57
+ specification_version: 4
58
+ summary: Nearest neighbor search for Ruby and S3 Vectors
59
+ test_files: []