neighbor 0.1.2 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e64a3292445759187b7c08fd9cc54fe0da340ea3e3ab5de909620de6395b4984
4
- data.tar.gz: 274d73ed464a0503973c5549022dca8820257fe470c9746b5599da9a153d46e3
3
+ metadata.gz: 9a8839eeaeeef5b3ff8deb63165791929a81f9a9c1c807dd3d094e9bf770e881
4
+ data.tar.gz: 14035d37686f7035bc0e874de7933259dc0b8be9d046fca1d31867cacaf1c922
5
5
  SHA512:
6
- metadata.gz: 17df9f8a5848337fc570c6fa685f54f4aa9d49bea8e52e74fbec098e0bc9e450b655c7d96cb66c6e8b6aaae4824f2c831ea26a14f7f46b50ef64c955cfb9839b
7
- data.tar.gz: a47165d174ec86ebfdcd128e62c50b85d097aac535b6b6d424ea31988428a57237cdade2829d8e22a15e566ac7597371c95fabafad42cdbc5d56f063abab55e4
6
+ metadata.gz: bfbf7bca1b4290abf2ec91dd5ea032dd1878ab9d7a0938dfed4e6e63441aa557397345090713222a735beb32bc9f1b6cd7971536de9ca427db8650f053917040
7
+ data.tar.gz: 1ba36cba82aafbb2bb003e016f0817884b1129a7d0a8b2be7385e438999e6e7728224f7e146d4de611e171fc851ae64ebea3aa7bfa65f55868cb0c5477d8ecff
data/CHANGELOG.md CHANGED
@@ -1,3 +1,16 @@
1
+ ## 0.2.0 (2021-04-21)
2
+
3
+ - Added support for pgvector
4
+ - Added `normalize` option
5
+ - Made `dimensions` optional
6
+ - Raise an error if `nearest_neighbors` already defined
7
+ - Raise an error for non-finite values
8
+ - Fixed NaN with zero vectors and cosine distance
9
+
10
+ Breaking changes
11
+
12
+ - The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default
13
+
1
14
  ## 0.1.2 (2021-02-21)
2
15
 
3
16
  - Added `nearest_neighbors` scope
data/README.md CHANGED
@@ -12,15 +12,23 @@ Add this line to your application’s Gemfile:
12
12
  gem 'neighbor'
13
13
  ```
14
14
 
15
- And run:
15
+ ## Choose An Extension
16
+
17
+ Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [vector](https://github.com/ankane/pgvector). cube ships with Postgres, while vector supports approximate nearest neighbor search.
18
+
19
+ For cube, run:
16
20
 
17
21
  ```sh
18
- bundle install
19
- rails generate neighbor:install
22
+ rails generate neighbor:cube
20
23
  rails db:migrate
21
24
  ```
22
25
 
23
- This enables the [cube extension](https://www.postgresql.org/docs/current/cube.html) in Postgres
26
+ For vector, install [pgvector](https://github.com/ankane/pgvector#installation) and run:
27
+
28
+ ```sh
29
+ rails generate neighbor:vector
30
+ rails db:migrate
31
+ ```
24
32
 
25
33
  ## Getting Started
26
34
 
@@ -30,6 +38,8 @@ Create a migration
30
38
  class AddNeighborVectorToItems < ActiveRecord::Migration[6.1]
31
39
  def change
32
40
  add_column :items, :neighbor_vector, :cube
41
+ # or
42
+ add_column :items, :neighbor_vector, :vector, limit: 3
33
43
  end
34
44
  end
35
45
  ```
@@ -38,7 +48,7 @@ Add to your model
38
48
 
39
49
  ```ruby
40
50
  class Item < ApplicationRecord
41
- has_neighbors dimensions: 3
51
+ has_neighbors
42
52
  end
43
53
  ```
44
54
 
@@ -48,49 +58,76 @@ Update the vectors
48
58
  item.update(neighbor_vector: [1.0, 1.2, 0.5])
49
59
  ```
50
60
 
51
- > With cosine distance (the default), vectors are normalized before being stored
52
-
53
61
  Get the nearest neighbors to a record
54
62
 
55
63
  ```ruby
56
- item.nearest_neighbors.first(5)
64
+ item.nearest_neighbors(distance: "euclidean").first(5)
57
65
  ```
58
66
 
59
67
  Get the nearest neighbors to a vector
60
68
 
61
69
  ```ruby
62
- Item.nearest_neighbors([1, 2, 3])
70
+ Item.nearest_neighbors([0.9, 1.3, 1.1], distance: "euclidean").first(5)
63
71
  ```
64
72
 
65
73
  ## Distance
66
74
 
67
- Specify the distance metric
75
+ Supported values are:
76
+
77
+ - `euclidean`
78
+ - `cosine`
79
+ - `taxicab` (cube only)
80
+ - `chebyshev` (cube only)
81
+ - `inner_product` (vector only)
82
+
83
+ For cosine distance with cube, vectors must be normalized before being stored.
68
84
 
69
85
  ```ruby
70
86
  class Item < ApplicationRecord
71
- has_neighbors dimensions: 20, distance: "euclidean"
87
+ has_neighbors normalize: true
72
88
  end
73
89
  ```
74
90
 
75
- Supported values are:
76
-
77
- - `cosine` (default)
78
- - `euclidean`
79
- - `taxicab`
80
- - `chebyshev`
81
-
82
- For inner product, see [this example](examples/disco_user_recs.rb)
91
+ For inner product with cube, see [this example](examples/disco_user_recs.rb).
83
92
 
84
93
  Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
85
94
 
86
95
  ```ruby
87
- nearest_item = item.nearest_neighbors.first
96
+ nearest_item = item.nearest_neighbors(distance: "euclidean").first
88
97
  nearest_item.neighbor_distance
89
98
  ```
90
99
 
91
100
  ## Dimensions
92
101
 
93
- By default, Postgres limits the `cube` data type to 100 dimensions. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this.
102
+ The cube data type is limited 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this. The vector data type is limited to 1024 dimensions.
103
+
104
+ For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
105
+
106
+ ```ruby
107
+ class Movie < ApplicationRecord
108
+ has_neighbors dimensions: 3
109
+ end
110
+ ```
111
+
112
+ ## Indexing
113
+
114
+ For vector, add an approximate index to speed up queries. Create a migration with:
115
+
116
+ ```ruby
117
+ class AddIndexToItemsNeighborVector < ActiveRecord::Migration[6.1]
118
+ def change
119
+ add_index :items, :neighbor_vector, using: :ivfflat
120
+ end
121
+ end
122
+ ```
123
+
124
+ Add `opclass: :vector_cosine_ops` for cosine distance and `opclass: :vector_ip_ops` for inner product.
125
+
126
+ Set the number of probes
127
+
128
+ ```ruby
129
+ Item.connection.execute("SET ivfflat.probes = 3")
130
+ ```
94
131
 
95
132
  ## Example
96
133
 
@@ -107,7 +144,7 @@ And add `has_neighbors`
107
144
 
108
145
  ```ruby
109
146
  class Movie < ApplicationRecord
110
- has_neighbors dimensions: 20
147
+ has_neighbors dimensions: 20, normalize: true
111
148
  end
112
149
  ```
113
150
 
@@ -131,11 +168,23 @@ And get similar movies
131
168
 
132
169
  ```ruby
133
170
  movie = Movie.find_by(name: "Star Wars (1977)")
134
- movie.nearest_neighbors.first(5).map(&:name)
171
+ movie.nearest_neighbors(distance: "cosine").first(5).map(&:name)
135
172
  ```
136
173
 
137
174
  [Complete code](examples/disco_item_recs.rb)
138
175
 
176
+ ## Upgrading
177
+
178
+ ### 0.2.0
179
+
180
+ The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default. If you use cosine distance, set:
181
+
182
+ ```ruby
183
+ class Item < ApplicationRecord
184
+ has_neighbors normalize: true
185
+ end
186
+ ```
187
+
139
188
  ## History
140
189
 
141
190
  View the [changelog](https://github.com/ankane/neighbor/blob/master/CHANGELOG.md)
@@ -2,12 +2,12 @@ require "rails/generators/active_record"
2
2
 
3
3
  module Neighbor
4
4
  module Generators
5
- class InstallGenerator < Rails::Generators::Base
5
+ class CubeGenerator < Rails::Generators::Base
6
6
  include ActiveRecord::Generators::Migration
7
7
  source_root File.join(__dir__, "templates")
8
8
 
9
9
  def copy_migration
10
- migration_template "migration.rb", "db/migrate/install_neighbor.rb", migration_version: migration_version
10
+ migration_template "cube.rb", "db/migrate/install_neighbor_cube.rb", migration_version: migration_version
11
11
  end
12
12
 
13
13
  def migration_version
@@ -0,0 +1,5 @@
1
+ class <%= migration_class_name %> < ActiveRecord::Migration<%= migration_version %>
2
+ def change
3
+ enable_extension "vector"
4
+ end
5
+ end
@@ -0,0 +1,18 @@
1
+ require "rails/generators/active_record"
2
+
3
+ module Neighbor
4
+ module Generators
5
+ class VectorGenerator < Rails::Generators::Base
6
+ include ActiveRecord::Generators::Migration
7
+ source_root File.join(__dir__, "templates")
8
+
9
+ def copy_migration
10
+ migration_template "vector.rb", "db/migrate/install_neighbor_vector.rb", migration_version: migration_version
11
+ end
12
+
13
+ def migration_version
14
+ "[#{ActiveRecord::VERSION::MAJOR}.#{ActiveRecord::VERSION::MINOR}]"
15
+ end
16
+ end
17
+ end
18
+ end
data/lib/neighbor.rb CHANGED
@@ -7,10 +7,14 @@ require "neighbor/version"
7
7
  module Neighbor
8
8
  class Error < StandardError; end
9
9
 
10
- module RegisterCubeType
10
+ module RegisterTypes
11
11
  def initialize_type_map(m = type_map)
12
12
  super
13
13
  m.register_type "cube", ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:cube)
14
+ m.register_type "vector" do |_, _, sql_type|
15
+ limit = extract_limit(sql_type)
16
+ ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:vector, limit: limit)
17
+ end
14
18
  end
15
19
  end
16
20
  end
@@ -25,16 +29,20 @@ ActiveSupport.on_load(:active_record) do
25
29
 
26
30
  # ensure schema can be dumped
27
31
  ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:cube] = {name: "cube"}
32
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:vector] = {name: "vector"}
28
33
 
29
34
  # ensure schema can be loaded
30
35
  if ActiveRecord::VERSION::MAJOR >= 6
31
- ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube)
36
+ ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube, :vector)
32
37
  else
33
38
  ActiveRecord::ConnectionAdapters::TableDefinition.define_method :cube do |*args, **options|
34
39
  args.each { |name| column(name, :cube, options) }
35
40
  end
41
+ ActiveRecord::ConnectionAdapters::TableDefinition.define_method :vector do |*args, **options|
42
+ args.each { |name| column(name, :vector, options) }
43
+ end
36
44
  end
37
45
 
38
46
  # prevent unknown OID warning
39
- ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::RegisterCubeType)
47
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::RegisterTypes)
40
48
  end
@@ -1,42 +1,78 @@
1
1
  module Neighbor
2
2
  module Model
3
- def has_neighbors(dimensions:, distance: "cosine")
4
- distance = distance.to_s
5
- raise ArgumentError, "Invalid distance: #{distance}" unless %w(cosine euclidean taxicab chebyshev).include?(distance)
6
-
3
+ def has_neighbors(dimensions: nil, normalize: nil)
7
4
  # TODO make configurable
8
5
  # likely use argument
9
6
  attribute_name = :neighbor_vector
10
7
 
11
8
  class_eval do
12
- attribute attribute_name, Neighbor::Vector.new(dimensions: dimensions, distance: distance)
9
+ raise Error, "nearest_neighbors already defined" if method_defined?(:nearest_neighbors)
10
+
11
+ attribute attribute_name, Neighbor::Vector.new(dimensions: dimensions, normalize: normalize, model: self, attribute_name: attribute_name)
13
12
 
14
- scope :nearest_neighbors, ->(vector) {
13
+ scope :nearest_neighbors, ->(vector, distance:) {
15
14
  return none if vector.nil?
16
15
 
16
+ distance = distance.to_s
17
+
17
18
  quoted_attribute = "#{connection.quote_table_name(table_name)}.#{connection.quote_column_name(attribute_name)}"
18
19
 
20
+ column_info = klass.type_for_attribute(attribute_name).column_info
21
+
19
22
  operator =
20
- case distance
21
- when "taxicab"
22
- "<#>"
23
- when "chebyshev"
24
- "<=>"
23
+ if column_info[:type] == :vector
24
+ case distance
25
+ when "inner_product"
26
+ "<#>"
27
+ when "cosine"
28
+ "<=>"
29
+ when "euclidean"
30
+ "<->"
31
+ end
25
32
  else
26
- "<->"
33
+ case distance
34
+ when "taxicab"
35
+ "<#>"
36
+ when "chebyshev"
37
+ "<=>"
38
+ when "euclidean", "cosine"
39
+ "<->"
40
+ end
27
41
  end
28
42
 
43
+ raise ArgumentError, "Invalid distance: #{distance}" unless operator
44
+
45
+ # ensure normalize set (can be true or false)
46
+ if distance == "cosine" && column_info[:type] == :cube && normalize.nil?
47
+ raise Neighbor::Error, "Set normalize for cosine distance with cube"
48
+ end
49
+
50
+ vector = Neighbor::Vector.cast(vector, dimensions: dimensions, normalize: normalize, column_info: column_info)
51
+
29
52
  # important! neighbor_vector should already be typecast
30
53
  # but use to_f as extra safeguard against SQL injection
31
- vector = Neighbor::Vector.cast(vector, dimensions: dimensions, distance: distance)
32
- order = "#{quoted_attribute} #{operator} cube(array[#{vector.map(&:to_f).join(", ")}])"
54
+ query =
55
+ if column_info[:type] == :vector
56
+ connection.quote("[#{vector.map(&:to_f).join(", ")}]")
57
+ else
58
+ "cube(array[#{vector.map(&:to_f).join(", ")}])"
59
+ end
60
+
61
+ order = "#{quoted_attribute} #{operator} #{query}"
33
62
 
34
63
  # https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance
35
64
  # with normalized vectors:
36
65
  # cosine similarity = 1 - (euclidean distance)**2 / 2
37
66
  # cosine distance = 1 - cosine similarity
38
67
  # this transformation doesn't change the order, so only needed for select
39
- neighbor_distance = distance == "cosine" ? "POWER(#{order}, 2) / 2.0" : order
68
+ neighbor_distance =
69
+ if column_info[:type] != :vector && distance == "cosine"
70
+ "POWER(#{order}, 2) / 2.0"
71
+ elsif column_info[:type] == :vector && distance == "inner_product"
72
+ "(#{order}) * -1"
73
+ else
74
+ order
75
+ end
40
76
 
41
77
  # for select, use column_names instead of * to account for ignored columns
42
78
  select(*column_names, "#{neighbor_distance} AS neighbor_distance")
@@ -44,10 +80,10 @@ module Neighbor
44
80
  .order(Arel.sql(order))
45
81
  }
46
82
 
47
- define_method :nearest_neighbors do
83
+ define_method :nearest_neighbors do |**options|
48
84
  self.class
49
85
  .where.not(self.class.primary_key => send(self.class.primary_key))
50
- .nearest_neighbors(send(attribute_name))
86
+ .nearest_neighbors(send(attribute_name), **options)
51
87
  end
52
88
  end
53
89
  end
@@ -1,29 +1,61 @@
1
1
  module Neighbor
2
2
  class Vector < ActiveRecord::Type::Value
3
- def initialize(dimensions:, distance:)
3
+ def initialize(dimensions:, normalize:, model:, attribute_name:)
4
4
  super()
5
5
  @dimensions = dimensions
6
- @distance = distance
6
+ @normalize = normalize
7
+ @model = model
8
+ @attribute_name = attribute_name
7
9
  end
8
10
 
9
- def self.cast(value, dimensions:, distance:)
11
+ def self.cast(value, dimensions:, normalize:, column_info:)
10
12
  value = value.to_a.map(&:to_f)
11
- raise Error, "Expected #{dimensions} dimensions, not #{value.size}" unless value.size == dimensions
12
13
 
13
- if distance == "cosine"
14
+ dimensions ||= column_info[:dimensions]
15
+ raise Error, "Expected #{dimensions} dimensions, not #{value.size}" if dimensions && value.size != dimensions
16
+
17
+ raise Error, "Values must be finite" unless value.all?(&:finite?)
18
+
19
+ if normalize
14
20
  norm = Math.sqrt(value.sum { |v| v * v })
15
- value.map { |v| v / norm }
16
- else
17
- value
21
+
22
+ # store zero vector as all zeros
23
+ # since NaN makes the distance always 0
24
+ # could also throw error
25
+
26
+ # safe to update in-place since earlier map dups
27
+ value.map! { |v| v / norm } if norm > 0
18
28
  end
29
+
30
+ value
31
+ end
32
+
33
+ def self.column_info(model, attribute_name)
34
+ attribute_name = attribute_name.to_s
35
+ column = model.columns.detect { |c| c.name == attribute_name }
36
+ {
37
+ type: column.try(:type),
38
+ dimensions: column.try(:limit)
39
+ }
40
+ end
41
+
42
+ # need to be careful to avoid loading column info before needed
43
+ def column_info
44
+ @column_info ||= self.class.column_info(@model, @attribute_name)
19
45
  end
20
46
 
21
47
  def cast(value)
22
- self.class.cast(value, dimensions: @dimensions, distance: @distance) unless value.nil?
48
+ self.class.cast(value, dimensions: @dimensions, normalize: @normalize, column_info: column_info) unless value.nil?
23
49
  end
24
50
 
25
51
  def serialize(value)
26
- "(#{cast(value).join(", ")})" unless value.nil?
52
+ unless value.nil?
53
+ if column_info[:type] == :vector
54
+ "[#{cast(value).join(", ")}]"
55
+ else
56
+ "(#{cast(value).join(", ")})"
57
+ end
58
+ end
27
59
  end
28
60
 
29
61
  def deserialize(value)
@@ -1,3 +1,3 @@
1
1
  module Neighbor
2
- VERSION = "0.1.2"
2
+ VERSION = "0.2.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: neighbor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.2
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-02-22 00:00:00.000000000 Z
11
+ date: 2021-04-22 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activerecord
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: '0'
19
+ version: '5.2'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: '0'
26
+ version: '5.2'
27
27
  description:
28
28
  email: andrew@ankane.org
29
29
  executables: []
@@ -33,8 +33,10 @@ files:
33
33
  - CHANGELOG.md
34
34
  - LICENSE.txt
35
35
  - README.md
36
- - lib/generators/neighbor/install_generator.rb
37
- - lib/generators/neighbor/templates/migration.rb.tt
36
+ - lib/generators/neighbor/cube_generator.rb
37
+ - lib/generators/neighbor/templates/cube.rb.tt
38
+ - lib/generators/neighbor/templates/vector.rb.tt
39
+ - lib/generators/neighbor/vector_generator.rb
38
40
  - lib/neighbor.rb
39
41
  - lib/neighbor/model.rb
40
42
  - lib/neighbor/vector.rb