neighbor 0.3.2 → 0.4.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 0c8b5d19222742f33f51f2c30f9d03108ebd3ed99908a7e9dd5f4e49caa2e225
4
- data.tar.gz: c9cfa942f2cdd8b9757c9ecfe5e89d0aced11263f8a559004ee15fa0c8adb3f4
3
+ metadata.gz: 8aa6de2790d94de9411b0142836b2ad181a411e299fce4b98357b96ac4161183
4
+ data.tar.gz: 2924d7f15f5b36bc89ee72372c1bfeb373d99481269696a9a9dcc41f90201f38
5
5
  SHA512:
6
- metadata.gz: e9e0050031ce7691baa9242b3b6b5aa76afb1fe7c63575129e68b2f5c027143b3c08f68a7babfcf2a9b02f1d9327679f75e9c40b95ac2245ea7c8dd3025d3cdb
7
- data.tar.gz: a9c505740cba454437617733d4025360848a16ef9a4c9c83fc16d5bc82a3e5521c77e3cba874ef3cf318cf3a1e319567958a6156481f7fd82ef72ebaa87d97eb
6
+ metadata.gz: 2bc1b3ee6d5b1ee0ab175b017e753cf958bd8ceb1ef2a23ba769770dfebf54eec251ac59c8f5f3b6ca56efcbad1763c34622b94924a017622c2f78fc8740f762
7
+ data.tar.gz: d946dda99833964582f63863b2d898fea6bf065312cf60aec873631df96195e1a54375606ad9c9cc0f767937cdb7ea38b0d9990efcbbeab15ccbb11f8a2020ef
data/CHANGELOG.md CHANGED
@@ -1,3 +1,22 @@
1
+ ## 0.4.1 (2024-08-26)
2
+
3
+ - Added `precision` option
4
+ - Added support for `bit` dimensions to model generator
5
+ - Fixed error with Numo arrays
6
+
7
+ ## 0.4.0 (2024-06-25)
8
+
9
+ - Added support for `halfvec` and `sparsevec` types
10
+ - Added support for `taxicab`, `hamming`, and `jaccard` distances with `vector` extension
11
+ - Added deserialization for `cube` and `vector` columns without `has_neighbor`
12
+ - Added support for composite primary keys
13
+ - Changed `nearest_neighbors` to replace previous `order` scopes
14
+ - Changed `normalize` option to use `before_save` callback
15
+ - Changed dimensions and finite values checks to use Active Record validations
16
+ - Fixed issue with `nearest_neighbors` scope overriding `select` values
17
+ - Removed default attribute name
18
+ - Dropped support for Ruby < 3.1
19
+
1
20
  ## 0.3.2 (2023-12-12)
2
21
 
3
22
  - Added deprecation warning for `has_neighbors` without an attribute name
data/LICENSE.txt CHANGED
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2021-2023 Andrew Kane
3
+ Copyright (c) 2021-2024 Andrew Kane
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
data/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
 
3
3
  Nearest neighbor search for Rails and Postgres
4
4
 
5
- [![Build Status](https://github.com/ankane/neighbor/workflows/build/badge.svg?branch=master)](https://github.com/ankane/neighbor/actions)
5
+ [![Build Status](https://github.com/ankane/neighbor/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/neighbor/actions)
6
6
 
7
7
  ## Installation
8
8
 
@@ -35,7 +35,7 @@ rails db:migrate
35
35
  Create a migration
36
36
 
37
37
  ```ruby
38
- class AddEmbeddingToItems < ActiveRecord::Migration[7.1]
38
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
39
39
  def change
40
40
  add_column :items, :embedding, :cube
41
41
  # or
@@ -76,9 +76,11 @@ Supported values are:
76
76
 
77
77
  - `euclidean`
78
78
  - `cosine`
79
- - `taxicab` (cube only)
79
+ - `taxicab`
80
80
  - `chebyshev` (cube only)
81
81
  - `inner_product` (vector only)
82
+ - `hamming` (vector only)
83
+ - `jaccard` (vector only)
82
84
 
83
85
  For cosine distance with cube, vectors must be normalized before being stored.
84
86
 
@@ -114,32 +116,114 @@ end
114
116
  For vector, add an approximate index to speed up queries. Create a migration with:
115
117
 
116
118
  ```ruby
117
- class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.1]
119
+ class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
118
120
  def change
119
- add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
120
- # or with pgvector 0.5.0+
121
121
  add_index :items, :embedding, using: :hnsw, opclass: :vector_l2_ops
122
+ # or
123
+ add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
122
124
  end
123
125
  end
124
126
  ```
125
127
 
126
128
  Use `:vector_cosine_ops` for cosine distance and `:vector_ip_ops` for inner product.
127
129
 
128
- Set the number of probes with IVFFlat
130
+ Set the size of the dynamic candidate list with HNSW
131
+
132
+ ```ruby
133
+ Item.connection.execute("SET hnsw.ef_search = 100")
134
+ ```
135
+
136
+ Or the number of probes with IVFFlat
129
137
 
130
138
  ```ruby
131
139
  Item.connection.execute("SET ivfflat.probes = 3")
132
140
  ```
133
141
 
134
- Or the size of the dynamic candidate list with HNSW
142
+ ## Half-Precision Vectors
143
+
144
+ Use the `halfvec` type to store half-precision vectors
135
145
 
136
146
  ```ruby
137
- Item.connection.execute("SET hnsw.ef_search = 100")
147
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
148
+ def change
149
+ add_column :items, :embedding, :halfvec, limit: 3 # dimensions
150
+ end
151
+ end
152
+ ```
153
+
154
+ ## Half-Precision Indexing
155
+
156
+ Index vectors at half precision for smaller indexes
157
+
158
+ ```ruby
159
+ class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
160
+ def change
161
+ add_index :items, "(embedding::halfvec(3)) vector_l2_ops", using: :hnsw
162
+ end
163
+ end
164
+ ```
165
+
166
+ Get the nearest neighbors
167
+
168
+ ```ruby
169
+ Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)
170
+ ```
171
+
172
+ ## Binary Vectors
173
+
174
+ Use the `bit` type to store binary vectors
175
+
176
+ ```ruby
177
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
178
+ def change
179
+ add_column :items, :embedding, :bit, limit: 3 # dimensions
180
+ end
181
+ end
182
+ ```
183
+
184
+ Get the nearest neighbors by Hamming distance
185
+
186
+ ```ruby
187
+ Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)
188
+ ```
189
+
190
+ ## Binary Quantization
191
+
192
+ Use expression indexing for binary quantization
193
+
194
+ ```ruby
195
+ class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
196
+ def change
197
+ add_index :items, "(binary_quantize(embedding)::bit(3)) bit_hamming_ops", using: :hnsw
198
+ end
199
+ end
200
+ ```
201
+
202
+ ## Sparse Vectors
203
+
204
+ Use the `sparsevec` type to store sparse vectors
205
+
206
+ ```ruby
207
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
208
+ def change
209
+ add_column :items, :embedding, :sparsevec, limit: 3 # dimensions
210
+ end
211
+ end
212
+ ```
213
+
214
+ Get the nearest neighbors
215
+
216
+ ```ruby
217
+ embedding = Neighbor::SparseVector.new({0 => 0.9, 1 => 1.3, 2 => 1.1}, 3)
218
+ Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)
138
219
  ```
139
220
 
140
221
  ## Examples
141
222
 
142
223
  - [OpenAI Embeddings](#openai-embeddings)
224
+ - [Cohere Embeddings](#cohere-embeddings)
225
+ - [Sentence Embeddings](#sentence-embeddings)
226
+ - [Sparse Embeddings](#sparse-embeddings)
143
227
  - [Disco Recommendations](#disco-recommendations)
144
228
 
145
229
  ### OpenAI Embeddings
@@ -170,10 +254,10 @@ def fetch_embeddings(input)
170
254
  }
171
255
  data = {
172
256
  input: input,
173
- model: "text-embedding-ada-002"
257
+ model: "text-embedding-3-small"
174
258
  }
175
259
 
176
- response = Net::HTTP.post(URI(url), data.to_json, headers)
260
+ response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
177
261
  JSON.parse(response.body)["data"].map { |v| v["embedding"] }
178
262
  end
179
263
  ```
@@ -199,14 +283,221 @@ end
199
283
  Document.insert_all!(documents)
200
284
  ```
201
285
 
202
- And get similar articles
286
+ And get similar documents
287
+
288
+ ```ruby
289
+ document = Document.first
290
+ document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)
291
+ ```
292
+
293
+ See the [complete code](examples/openai/example.rb)
294
+
295
+ ### Cohere Embeddings
296
+
297
+ Generate a model
298
+
299
+ ```sh
300
+ rails generate model Document content:text embedding:bit{1024}
301
+ rails db:migrate
302
+ ```
303
+
304
+ And add `has_neighbors`
305
+
306
+ ```ruby
307
+ class Document < ApplicationRecord
308
+ has_neighbors :embedding
309
+ end
310
+ ```
311
+
312
+ Create a method to call the [embed API](https://docs.cohere.com/reference/embed)
313
+
314
+ ```ruby
315
+ def fetch_embeddings(input, input_type)
316
+ url = "https://api.cohere.com/v1/embed"
317
+ headers = {
318
+ "Authorization" => "Bearer #{ENV.fetch("CO_API_KEY")}",
319
+ "Content-Type" => "application/json"
320
+ }
321
+ data = {
322
+ texts: input,
323
+ model: "embed-english-v3.0",
324
+ input_type: input_type,
325
+ embedding_types: ["ubinary"]
326
+ }
327
+
328
+ response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
329
+ JSON.parse(response.body)["embeddings"]["ubinary"].map { |e| e.map { |v| v.chr.unpack1("B*") }.join }
330
+ end
331
+ ```
332
+
333
+ Pass your input
334
+
335
+ ```ruby
336
+ input = [
337
+ "The dog is barking",
338
+ "The cat is purring",
339
+ "The bear is growling"
340
+ ]
341
+ embeddings = fetch_embeddings(input, "search_document")
342
+ ```
343
+
344
+ Store the embeddings
345
+
346
+ ```ruby
347
+ documents = []
348
+ input.zip(embeddings) do |content, embedding|
349
+ documents << {content: content, embedding: embedding}
350
+ end
351
+ Document.insert_all!(documents)
352
+ ```
353
+
354
+ Embed the search query
355
+
356
+ ```ruby
357
+ query = "forest"
358
+ query_embedding = fetch_embeddings([query], "search_query")[0]
359
+ ```
360
+
361
+ And search the documents
362
+
363
+ ```ruby
364
+ Document.nearest_neighbors(:embedding, query_embedding, distance: "hamming").first(5).map(&:content)
365
+ ```
366
+
367
+ See the [complete code](examples/cohere/example.rb)
368
+
369
+ ### Sentence Embeddings
370
+
371
+ You can generate embeddings locally with [Informers](https://github.com/ankane/informers).
372
+
373
+ Generate a model
374
+
375
+ ```sh
376
+ rails generate model Document content:text embedding:vector{384}
377
+ rails db:migrate
378
+ ```
379
+
380
+ And add `has_neighbors`
381
+
382
+ ```ruby
383
+ class Document < ApplicationRecord
384
+ has_neighbors :embedding
385
+ end
386
+ ```
387
+
388
+ Load a [model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
389
+
390
+ ```ruby
391
+ model = Informers::Model.new("sentence-transformers/all-MiniLM-L6-v2")
392
+ ```
393
+
394
+ Pass your input
395
+
396
+ ```ruby
397
+ input = [
398
+ "The dog is barking",
399
+ "The cat is purring",
400
+ "The bear is growling"
401
+ ]
402
+ embeddings = model.embed(input)
403
+ ```
404
+
405
+ Store the embeddings
406
+
407
+ ```ruby
408
+ documents = []
409
+ input.zip(embeddings) do |content, embedding|
410
+ documents << {content: content, embedding: embedding}
411
+ end
412
+ Document.insert_all!(documents)
413
+ ```
414
+
415
+ And get similar documents
203
416
 
204
417
  ```ruby
205
418
  document = Document.first
206
419
  document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)
207
420
  ```
208
421
 
209
- See the [complete code](examples/openai_embeddings.rb)
422
+ See the [complete code](examples/informers/example.rb)
423
+
424
+ ### Sparse Embeddings
425
+
426
+ You can generate sparse embeddings locally with [Transformers.rb](https://github.com/ankane/transformers-ruby).
427
+
428
+ Generate a model
429
+
430
+ ```sh
431
+ rails generate model Document content:text embedding:sparsevec{30522}
432
+ rails db:migrate
433
+ ```
434
+
435
+ And add `has_neighbors`
436
+
437
+ ```ruby
438
+ class Document < ApplicationRecord
439
+ has_neighbors :embedding
440
+ end
441
+ ```
442
+
443
+ Load a [model](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1) to generate embeddings
444
+
445
+ ```ruby
446
+ class EmbeddingModel
447
+ def initialize(model_id)
448
+ @model = Transformers::AutoModelForMaskedLM.from_pretrained(model_id)
449
+ @tokenizer = Transformers::AutoTokenizer.from_pretrained(model_id)
450
+ @special_token_ids = @tokenizer.special_tokens_map.map { |_, token| @tokenizer.vocab[token] }
451
+ end
452
+
453
+ def embed(input)
454
+ feature = @tokenizer.(input, padding: true, truncation: true, return_tensors: "pt", return_token_type_ids: false)
455
+ output = @model.(**feature)[0]
456
+ values = Torch.max(output * feature[:attention_mask].unsqueeze(-1), dim: 1)[0]
457
+ values = Torch.log(1 + Torch.relu(values))
458
+ values[0.., @special_token_ids] = 0
459
+ values.to_a
460
+ end
461
+ end
462
+
463
+ model = EmbeddingModel.new("opensearch-project/opensearch-neural-sparse-encoding-v1")
464
+ ```
465
+
466
+ Pass your input
467
+
468
+ ```ruby
469
+ input = [
470
+ "The dog is barking",
471
+ "The cat is purring",
472
+ "The bear is growling"
473
+ ]
474
+ embeddings = model.embed(input)
475
+ ```
476
+
477
+ Store the embeddings
478
+
479
+ ```ruby
480
+ documents = []
481
+ input.zip(embeddings) do |content, embedding|
482
+ documents << {content: content, embedding: Neighbor::SparseVector.new(embedding)}
483
+ end
484
+ Document.insert_all!(documents)
485
+ ```
486
+
487
+ Embed the search query
488
+
489
+ ```ruby
490
+ query = "forest"
491
+ query_embedding = model.embed([query])[0]
492
+ ```
493
+
494
+ And search the documents
495
+
496
+ ```ruby
497
+ Document.nearest_neighbors(:embedding, Neighbor::SparseVector.new(query_embedding), distance: "inner_product").first(5).map(&:content)
498
+ ```
499
+
500
+ See the [complete code](examples/sparse/example.rb)
210
501
 
211
502
  ### Disco Recommendations
212
503
 
@@ -242,7 +533,7 @@ movies = []
242
533
  recommender.item_ids.each do |item_id|
243
534
  movies << {name: item_id, factors: recommender.item_factors(item_id)}
244
535
  end
245
- Movie.insert_all!(movies) # use create! for Active Record < 6
536
+ Movie.insert_all!(movies)
246
537
  ```
247
538
 
248
539
  And get similar movies
@@ -252,19 +543,7 @@ movie = Movie.find_by(name: "Star Wars (1977)")
252
543
  movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)
253
544
  ```
254
545
 
255
- See the complete code for [cube](examples/disco_item_recs_cube.rb) and [vector](examples/disco_item_recs_vector.rb)
256
-
257
- ## Upgrading
258
-
259
- ### 0.2.0
260
-
261
- The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default. If you use cosine distance, set:
262
-
263
- ```ruby
264
- class Item < ApplicationRecord
265
- has_neighbors normalize: true
266
- end
267
- ```
546
+ See the complete code for [cube](examples/disco/item_recs_cube.rb) and [vector](examples/disco/item_recs_vector.rb)
268
547
 
269
548
  ## History
270
549
 
@@ -286,10 +565,5 @@ git clone https://github.com/ankane/neighbor.git
286
565
  cd neighbor
287
566
  bundle install
288
567
  createdb neighbor_test
289
-
290
- # cube
291
568
  bundle exec rake test
292
-
293
- # vector
294
- EXT=vector bundle exec rake test
295
569
  ```
@@ -1,3 +1,4 @@
1
+ require "rails/generators"
1
2
  require "rails/generators/active_record"
2
3
 
3
4
  module Neighbor
@@ -1,3 +1,4 @@
1
+ require "rails/generators"
1
2
  require "rails/generators/active_record"
2
3
 
3
4
  module Neighbor
@@ -2,11 +2,9 @@ module Neighbor
2
2
  module Model
3
3
  def has_neighbors(*attribute_names, dimensions: nil, normalize: nil)
4
4
  if attribute_names.empty?
5
- warn "[neighbor] has_neighbors without an attribute name is deprecated"
6
- attribute_names << :neighbor_vector
7
- else
8
- attribute_names.map!(&:to_sym)
5
+ raise ArgumentError, "has_neighbors requires an attribute name"
9
6
  end
7
+ attribute_names.map!(&:to_sym)
10
8
 
11
9
  class_eval do
12
10
  @neighbor_attributes ||= {}
@@ -27,30 +25,46 @@ module Neighbor
27
25
  attribute_names.each do |attribute_name|
28
26
  raise Error, "has_neighbors already called for #{attribute_name.inspect}" if neighbor_attributes[attribute_name]
29
27
  @neighbor_attributes[attribute_name] = {dimensions: dimensions, normalize: normalize}
30
-
31
- attribute attribute_name, Neighbor::Vector.new(dimensions: dimensions, normalize: normalize, model: self, attribute_name: attribute_name)
32
28
  end
33
29
 
34
30
  return if @neighbor_attributes.size != attribute_names.size
35
31
 
36
- scope :nearest_neighbors, ->(attribute_name, vector = nil, options = nil) {
37
- # cannot use keyword arguments with scope with Ruby 3.2 and Active Record 6.1
38
- # https://github.com/rails/rails/issues/46934
39
- if options.nil? && vector.is_a?(Hash)
40
- options = vector
41
- vector = nil
32
+ validate do
33
+ self.class.neighbor_attributes.each do |k, v|
34
+ value = read_attribute(k)
35
+ next if value.nil?
36
+
37
+ column_info = self.class.columns_hash[k.to_s]
38
+ dimensions = v[:dimensions] || column_info&.limit
39
+
40
+ if !Neighbor::Utils.validate_dimensions(value, column_info&.type, dimensions).nil?
41
+ errors.add(k, "must have #{dimensions} dimensions")
42
+ end
43
+ if !Neighbor::Utils.validate_finite(value, column_info&.type)
44
+ errors.add(k, "must have finite values")
45
+ end
42
46
  end
47
+ end
48
+
49
+ # TODO move to normalizes when Active Record < 7.1 no longer supported
50
+ before_save do
51
+ self.class.neighbor_attributes.each do |k, v|
52
+ next unless v[:normalize] && attribute_changed?(k)
53
+ value = read_attribute(k)
54
+ next if value.nil?
55
+ self[k] = Neighbor::Utils.normalize(value, column_info: self.class.columns_hash[k.to_s])
56
+ end
57
+ end
58
+
59
+ # cannot use keyword arguments with scope with Ruby 3.2 and Active Record 6.1
60
+ # https://github.com/rails/rails/issues/46934
61
+ scope :nearest_neighbors, ->(attribute_name, vector, options = nil) {
43
62
  raise ArgumentError, "missing keyword: :distance" unless options.is_a?(Hash) && options.key?(:distance)
44
63
  distance = options.delete(:distance)
64
+ precision = options.delete(:precision)
45
65
  raise ArgumentError, "unknown keywords: #{options.keys.map(&:inspect).join(", ")}" if options.any?
46
66
 
47
- if vector.nil? && !attribute_name.nil? && attribute_name.respond_to?(:to_a)
48
- warn "[neighbor] nearest_neighbors without an attribute name is deprecated"
49
- vector = attribute_name
50
- attribute_name = :neighbor_vector
51
- end
52
67
  attribute_name = attribute_name.to_sym
53
-
54
68
  options = neighbor_attributes[attribute_name]
55
69
  raise ArgumentError, "Invalid attribute" unless options
56
70
  normalize = options[:normalize]
@@ -62,10 +76,21 @@ module Neighbor
62
76
 
63
77
  quoted_attribute = "#{connection.quote_table_name(table_name)}.#{connection.quote_column_name(attribute_name)}"
64
78
 
65
- column_info = klass.type_for_attribute(attribute_name).column_info
79
+ column_info = columns_hash[attribute_name.to_s]
80
+ column_type = column_info&.type
66
81
 
67
82
  operator =
68
- if column_info[:type] == :vector
83
+ case column_type
84
+ when :bit
85
+ case distance
86
+ when "hamming"
87
+ "<~>"
88
+ when "jaccard"
89
+ "<%>"
90
+ when "hamming2"
91
+ "#"
92
+ end
93
+ when :vector, :halfvec, :sparsevec
69
94
  case distance
70
95
  when "inner_product"
71
96
  "<#>"
@@ -73,8 +98,10 @@ module Neighbor
73
98
  "<=>"
74
99
  when "euclidean"
75
100
  "<->"
101
+ when "taxicab"
102
+ "<+>"
76
103
  end
77
- else
104
+ when :cube
78
105
  case distance
79
106
  when "taxicab"
80
107
  "<#>"
@@ -83,27 +110,39 @@ module Neighbor
83
110
  when "euclidean", "cosine"
84
111
  "<->"
85
112
  end
113
+ else
114
+ raise ArgumentError, "Unsupported type: #{column_type}"
86
115
  end
87
116
 
88
117
  raise ArgumentError, "Invalid distance: #{distance}" unless operator
89
118
 
90
119
  # ensure normalize set (can be true or false)
91
- if distance == "cosine" && column_info[:type] == :cube && normalize.nil?
120
+ if distance == "cosine" && column_type == :cube && normalize.nil?
92
121
  raise Neighbor::Error, "Set normalize for cosine distance with cube"
93
122
  end
94
123
 
95
- vector = Neighbor::Vector.cast(vector, dimensions: dimensions, normalize: normalize, column_info: column_info)
124
+ column_attribute = klass.type_for_attribute(attribute_name)
125
+ vector = column_attribute.cast(vector)
126
+ Neighbor::Utils.validate(vector, dimensions: dimensions, column_info: column_info)
127
+ vector = Neighbor::Utils.normalize(vector, column_info: column_info) if normalize
128
+
129
+ query = connection.quote(column_attribute.serialize(vector))
96
130
 
97
- # important! neighbor_vector should already be typecast
98
- # but use to_f as extra safeguard against SQL injection
99
- query =
100
- if column_info[:type] == :vector
101
- connection.quote("[#{vector.map(&:to_f).join(", ")}]")
131
+ if !precision.nil?
132
+ case precision.to_s
133
+ when "half"
134
+ cast_dimensions = dimensions || column_info&.limit
135
+ raise ArgumentError, "Unknown dimensions" unless cast_dimensions
136
+ quoted_attribute += "::halfvec(#{connection.quote(cast_dimensions.to_i)})"
102
137
  else
103
- "cube(array[#{vector.map(&:to_f).join(", ")}])"
138
+ raise ArgumentError, "Invalid precision"
104
139
  end
140
+ end
105
141
 
106
142
  order = "#{quoted_attribute} #{operator} #{query}"
143
+ if operator == "#"
144
+ order = "bit_count(#{order})"
145
+ end
107
146
 
108
147
  # https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance
109
148
  # with normalized vectors:
@@ -111,31 +150,28 @@ module Neighbor
111
150
  # cosine distance = 1 - cosine similarity
112
151
  # this transformation doesn't change the order, so only needed for select
113
152
  neighbor_distance =
114
- if column_info[:type] != :vector && distance == "cosine"
153
+ if column_type == :cube && distance == "cosine"
115
154
  "POWER(#{order}, 2) / 2.0"
116
- elsif column_info[:type] == :vector && distance == "inner_product"
155
+ elsif [:vector, :halfvec, :sparsevec].include?(column_type) && distance == "inner_product"
117
156
  "(#{order}) * -1"
118
157
  else
119
158
  order
120
159
  end
121
160
 
122
161
  # for select, use column_names instead of * to account for ignored columns
123
- select(*column_names, "#{neighbor_distance} AS neighbor_distance")
162
+ select_columns = select_values.any? ? [] : column_names
163
+ select(*select_columns, "#{neighbor_distance} AS neighbor_distance")
124
164
  .where.not(attribute_name => nil)
125
- .order(Arel.sql(order))
165
+ .reorder(Arel.sql(order))
126
166
  }
127
167
 
128
- def nearest_neighbors(attribute_name = nil, **options)
129
- if attribute_name.nil?
130
- warn "[neighbor] nearest_neighbors without an attribute name is deprecated"
131
- attribute_name = :neighbor_vector
132
- end
168
+ def nearest_neighbors(attribute_name, **options)
133
169
  attribute_name = attribute_name.to_sym
134
- # important! check if neighbor attribute before calling send
170
+ # important! check if neighbor attribute before accessing
135
171
  raise ArgumentError, "Invalid attribute" unless self.class.neighbor_attributes[attribute_name]
136
172
 
137
173
  self.class
138
- .where.not(self.class.primary_key => self[self.class.primary_key])
174
+ .where.not(Array(self.class.primary_key).to_h { |k| [k, self[k]] })
139
175
  .nearest_neighbors(attribute_name, self[attribute_name], **options)
140
176
  end
141
177
  end
@@ -1,16 +1,16 @@
1
1
  module Neighbor
2
2
  class Railtie < Rails::Railtie
3
3
  generators do
4
+ require "rails/generators/generated_attribute"
5
+
4
6
  # rails generate model Item embedding:vector{3}
5
- if defined?(Rails::Generators::GeneratedAttribute)
6
- Rails::Generators::GeneratedAttribute.singleton_class.prepend(Neighbor::GeneratedAttribute)
7
- end
7
+ Rails::Generators::GeneratedAttribute.singleton_class.prepend(Neighbor::GeneratedAttribute)
8
8
  end
9
9
  end
10
10
 
11
11
  module GeneratedAttribute
12
12
  def parse_type_and_options(type, *, **)
13
- if type =~ /\A(vector)\{(\d+)\}\z/
13
+ if type =~ /\A(vector|halfvec|bit|sparsevec)\{(\d+)\}\z/
14
14
  return $1, limit: $2.to_i
15
15
  end
16
16
  super
@@ -0,0 +1,79 @@
1
+ module Neighbor
2
+ class SparseVector
3
+ attr_reader :dimensions, :indices, :values
4
+
5
+ NO_DEFAULT = Object.new
6
+
7
+ def initialize(value, dimensions = NO_DEFAULT)
8
+ if value.is_a?(Hash)
9
+ if dimensions == NO_DEFAULT
10
+ raise ArgumentError, "missing dimensions"
11
+ end
12
+ from_hash(value, dimensions)
13
+ else
14
+ unless dimensions == NO_DEFAULT
15
+ raise ArgumentError, "extra argument"
16
+ end
17
+ from_array(value)
18
+ end
19
+ end
20
+
21
+ def to_s
22
+ "{#{@indices.zip(@values).map { |i, v| "#{i.to_i + 1}:#{v.to_f}" }.join(",")}}/#{@dimensions.to_i}"
23
+ end
24
+
25
+ def to_a
26
+ arr = Array.new(dimensions, 0.0)
27
+ @indices.zip(@values) do |i, v|
28
+ arr[i] = v
29
+ end
30
+ arr
31
+ end
32
+
33
+ private
34
+
35
+ def from_hash(data, dimensions)
36
+ elements = data.select { |_, v| v != 0 }.sort
37
+ @dimensions = dimensions.to_i
38
+ @indices = elements.map { |v| v[0].to_i }
39
+ @values = elements.map { |v| v[1].to_f }
40
+ end
41
+
42
+ def from_array(arr)
43
+ arr = arr.to_a
44
+ @dimensions = arr.size
45
+ @indices = []
46
+ @values = []
47
+ arr.each_with_index do |v, i|
48
+ if v != 0
49
+ @indices << i
50
+ @values << v.to_f
51
+ end
52
+ end
53
+ end
54
+
55
+ class << self
56
+ def from_text(string)
57
+ elements, dimensions = string.split("/", 2)
58
+ indices = []
59
+ values = []
60
+ elements[1..-2].split(",").each do |e|
61
+ index, value = e.split(":", 2)
62
+ indices << index.to_i - 1
63
+ values << value.to_f
64
+ end
65
+ from_parts(dimensions.to_i, indices, values)
66
+ end
67
+
68
+ private
69
+
70
+ def from_parts(dimensions, indices, values)
71
+ vec = allocate
72
+ vec.instance_variable_set(:@dimensions, dimensions)
73
+ vec.instance_variable_set(:@indices, indices)
74
+ vec.instance_variable_set(:@values, values)
75
+ vec
76
+ end
77
+ end
78
+ end
79
+ end
@@ -1,36 +1,41 @@
1
1
  module Neighbor
2
2
  module Type
3
- class Cube < ActiveRecord::Type::String
3
+ class Cube < ActiveRecord::Type::Value
4
4
  def type
5
5
  :cube
6
6
  end
7
7
 
8
- def cast(value)
9
- if value.is_a?(Array)
8
+ def serialize(value)
9
+ if value.respond_to?(:to_a)
10
+ value = value.to_a
10
11
  if value.first.is_a?(Array)
11
- value.map { |v| cast_point(v) }.join(", ")
12
+ value = value.map { |v| serialize_point(v) }.join(", ")
12
13
  else
13
- cast_point(value)
14
+ value = serialize_point(value)
14
15
  end
15
- else
16
- super
17
16
  end
17
+ super(value)
18
18
  end
19
19
 
20
- # TODO uncomment in 0.4.0
21
- # def deserialize(value)
22
- # if value.nil?
23
- # super
24
- # elsif value.include?("),(")
25
- # value[1..-1].split("),(").map { |v| v.split(",").map(&:to_f) }
26
- # else
27
- # value[1..-1].split(",").map(&:to_f)
28
- # end
29
- # end
30
-
31
20
  private
32
21
 
33
- def cast_point(value)
22
+ def cast_value(value)
23
+ if value.respond_to?(:to_a)
24
+ value.to_a
25
+ elsif value.is_a?(Numeric)
26
+ [value]
27
+ elsif value.is_a?(String)
28
+ if value.include?("),(")
29
+ value[1..-1].split("),(").map { |v| v.split(",").map(&:to_f) }
30
+ else
31
+ value[1..-1].split(",").map(&:to_f)
32
+ end
33
+ else
34
+ raise "can't cast #{value.class.name} to cube"
35
+ end
36
+ end
37
+
38
+ def serialize_point(value)
34
39
  "(#{value.map(&:to_f).join(", ")})"
35
40
  end
36
41
  end
@@ -0,0 +1,28 @@
1
+ module Neighbor
2
+ module Type
3
+ class Halfvec < ActiveRecord::Type::Value
4
+ def type
5
+ :halfvec
6
+ end
7
+
8
+ def serialize(value)
9
+ if value.respond_to?(:to_a)
10
+ value = "[#{value.to_a.map(&:to_f).join(",")}]"
11
+ end
12
+ super(value)
13
+ end
14
+
15
+ private
16
+
17
+ def cast_value(value)
18
+ if value.is_a?(String)
19
+ value[1..-1].split(",").map(&:to_f)
20
+ elsif value.respond_to?(:to_a)
21
+ value.to_a
22
+ else
23
+ raise "can't cast #{value.class.name} to halfvec"
24
+ end
25
+ end
26
+ end
27
+ end
28
+ end
@@ -0,0 +1,30 @@
1
+ module Neighbor
2
+ module Type
3
+ class Sparsevec < ActiveRecord::Type::Value
4
+ def type
5
+ :sparsevec
6
+ end
7
+
8
+ def serialize(value)
9
+ if value.is_a?(SparseVector)
10
+ value = "{#{value.indices.zip(value.values).map { |i, v| "#{i.to_i + 1}:#{v.to_f}" }.join(",")}}/#{value.dimensions.to_i}"
11
+ end
12
+ super(value)
13
+ end
14
+
15
+ private
16
+
17
+ def cast_value(value)
18
+ if value.is_a?(SparseVector)
19
+ value
20
+ elsif value.is_a?(String)
21
+ SparseVector.from_text(value)
22
+ elsif value.respond_to?(:to_a)
23
+ value = SparseVector.new(value.to_a)
24
+ else
25
+ raise "can't cast #{value.class.name} to sparsevec"
26
+ end
27
+ end
28
+ end
29
+ end
30
+ end
@@ -1,14 +1,28 @@
1
1
  module Neighbor
2
2
  module Type
3
- class Vector < ActiveRecord::Type::String
3
+ class Vector < ActiveRecord::Type::Value
4
4
  def type
5
5
  :vector
6
6
  end
7
7
 
8
- # TODO uncomment in 0.4.0
9
- # def deserialize(value)
10
- # value[1..-1].split(",").map(&:to_f) unless value.nil?
11
- # end
8
+ def serialize(value)
9
+ if value.respond_to?(:to_a)
10
+ value = "[#{value.to_a.map(&:to_f).join(",")}]"
11
+ end
12
+ super(value)
13
+ end
14
+
15
+ private
16
+
17
+ def cast_value(value)
18
+ if value.is_a?(String)
19
+ value[1..-1].split(",").map(&:to_f)
20
+ elsif value.respond_to?(:to_a)
21
+ value.to_a
22
+ else
23
+ raise "can't cast #{value.class.name} to vector"
24
+ end
25
+ end
12
26
  end
13
27
  end
14
28
  end
@@ -0,0 +1,42 @@
1
+ module Neighbor
2
+ module Utils
3
+ def self.validate_dimensions(value, type, expected)
4
+ dimensions = type == :sparsevec ? value.dimensions : value.size
5
+ if expected && dimensions != expected
6
+ "Expected #{expected} dimensions, not #{dimensions}"
7
+ end
8
+ end
9
+
10
+ def self.validate_finite(value, type)
11
+ case type
12
+ when :bit
13
+ true
14
+ when :sparsevec
15
+ value.values.all?(&:finite?)
16
+ else
17
+ value.all?(&:finite?)
18
+ end
19
+ end
20
+
21
+ def self.validate(value, dimensions:, column_info:)
22
+ if (message = validate_dimensions(value, column_info&.type, dimensions || column_info&.limit))
23
+ raise Error, message
24
+ end
25
+
26
+ if !validate_finite(value, column_info&.type)
27
+ raise Error, "Values must be finite"
28
+ end
29
+ end
30
+
31
+ def self.normalize(value, column_info:)
32
+ raise Error, "Normalize not supported for type" unless [:cube, :vector, :halfvec].include?(column_info&.type)
33
+
34
+ norm = Math.sqrt(value.sum { |v| v * v })
35
+
36
+ # store zero vector as all zeros
37
+ # since NaN makes the distance always 0
38
+ # could also throw error
39
+ norm > 0 ? value.map { |v| v / norm } : value
40
+ end
41
+ end
42
+ end
@@ -1,3 +1,3 @@
1
1
  module Neighbor
2
- VERSION = "0.3.2"
2
+ VERSION = "0.4.1"
3
3
  end
data/lib/neighbor.rb CHANGED
@@ -2,6 +2,8 @@
2
2
  require "active_support"
3
3
 
4
4
  # modules
5
+ require_relative "neighbor/sparse_vector"
6
+ require_relative "neighbor/utils"
5
7
  require_relative "neighbor/version"
6
8
 
7
9
  module Neighbor
@@ -11,6 +13,14 @@ module Neighbor
11
13
  def initialize_type_map(m = type_map)
12
14
  super
13
15
  m.register_type "cube", Type::Cube.new
16
+ m.register_type "halfvec" do |_, _, sql_type|
17
+ limit = extract_limit(sql_type)
18
+ Type::Halfvec.new(limit: limit)
19
+ end
20
+ m.register_type "sparsevec" do |_, _, sql_type|
21
+ limit = extract_limit(sql_type)
22
+ Type::Sparsevec.new(limit: limit)
23
+ end
14
24
  m.register_type "vector" do |_, _, sql_type|
15
25
  limit = extract_limit(sql_type)
16
26
  Type::Vector.new(limit: limit)
@@ -21,8 +31,9 @@ end
21
31
 
22
32
  ActiveSupport.on_load(:active_record) do
23
33
  require_relative "neighbor/model"
24
- require_relative "neighbor/vector"
25
34
  require_relative "neighbor/type/cube"
35
+ require_relative "neighbor/type/halfvec"
36
+ require_relative "neighbor/type/sparsevec"
26
37
  require_relative "neighbor/type/vector"
27
38
 
28
39
  extend Neighbor::Model
@@ -31,10 +42,12 @@ ActiveSupport.on_load(:active_record) do
31
42
 
32
43
  # ensure schema can be dumped
33
44
  ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:cube] = {name: "cube"}
45
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:halfvec] = {name: "halfvec"}
46
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:sparsevec] = {name: "sparsevec"}
34
47
  ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:vector] = {name: "vector"}
35
48
 
36
49
  # ensure schema can be loaded
37
- ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube, :vector)
50
+ ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube, :halfvec, :sparsevec, :vector)
38
51
 
39
52
  # prevent unknown OID warning
40
53
  if ActiveRecord::VERSION::MAJOR >= 7
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: neighbor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.2
4
+ version: 0.4.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-12-12 00:00:00.000000000 Z
11
+ date: 2024-08-27 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activerecord
@@ -40,9 +40,12 @@ files:
40
40
  - lib/neighbor.rb
41
41
  - lib/neighbor/model.rb
42
42
  - lib/neighbor/railtie.rb
43
+ - lib/neighbor/sparse_vector.rb
43
44
  - lib/neighbor/type/cube.rb
45
+ - lib/neighbor/type/halfvec.rb
46
+ - lib/neighbor/type/sparsevec.rb
44
47
  - lib/neighbor/type/vector.rb
45
- - lib/neighbor/vector.rb
48
+ - lib/neighbor/utils.rb
46
49
  - lib/neighbor/version.rb
47
50
  homepage: https://github.com/ankane/neighbor
48
51
  licenses:
@@ -56,14 +59,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
56
59
  requirements:
57
60
  - - ">="
58
61
  - !ruby/object:Gem::Version
59
- version: '3'
62
+ version: '3.1'
60
63
  required_rubygems_version: !ruby/object:Gem::Requirement
61
64
  requirements:
62
65
  - - ">="
63
66
  - !ruby/object:Gem::Version
64
67
  version: '0'
65
68
  requirements: []
66
- rubygems_version: 3.4.10
69
+ rubygems_version: 3.5.11
67
70
  signing_key:
68
71
  specification_version: 4
69
72
  summary: Nearest neighbor search for Rails and Postgres
@@ -1,65 +0,0 @@
1
- module Neighbor
2
- class Vector < ActiveRecord::Type::Value
3
- def initialize(dimensions:, normalize:, model:, attribute_name:)
4
- super()
5
- @dimensions = dimensions
6
- @normalize = normalize
7
- @model = model
8
- @attribute_name = attribute_name
9
- end
10
-
11
- def self.cast(value, dimensions:, normalize:, column_info:)
12
- value = value.to_a.map(&:to_f)
13
-
14
- dimensions ||= column_info[:dimensions]
15
- raise Error, "Expected #{dimensions} dimensions, not #{value.size}" if dimensions && value.size != dimensions
16
-
17
- raise Error, "Values must be finite" unless value.all?(&:finite?)
18
-
19
- if normalize
20
- norm = Math.sqrt(value.sum { |v| v * v })
21
-
22
- # store zero vector as all zeros
23
- # since NaN makes the distance always 0
24
- # could also throw error
25
-
26
- # safe to update in-place since earlier map dups
27
- value.map! { |v| v / norm } if norm > 0
28
- end
29
-
30
- value
31
- end
32
-
33
- def self.column_info(model, attribute_name)
34
- attribute_name = attribute_name.to_s
35
- column = model.columns.detect { |c| c.name == attribute_name }
36
- {
37
- type: column.try(:type),
38
- dimensions: column.try(:limit)
39
- }
40
- end
41
-
42
- # need to be careful to avoid loading column info before needed
43
- def column_info
44
- @column_info ||= self.class.column_info(@model, @attribute_name)
45
- end
46
-
47
- def cast(value)
48
- self.class.cast(value, dimensions: @dimensions, normalize: @normalize, column_info: column_info) unless value.nil?
49
- end
50
-
51
- def serialize(value)
52
- unless value.nil?
53
- if column_info[:type] == :vector
54
- "[#{cast(value).join(", ")}]"
55
- else
56
- "(#{cast(value).join(", ")})"
57
- end
58
- end
59
- end
60
-
61
- def deserialize(value)
62
- value[1..-1].split(",").map(&:to_f) unless value.nil?
63
- end
64
- end
65
- end