neighbor 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e64a3292445759187b7c08fd9cc54fe0da340ea3e3ab5de909620de6395b4984
4
- data.tar.gz: 274d73ed464a0503973c5549022dca8820257fe470c9746b5599da9a153d46e3
3
+ metadata.gz: 9a8839eeaeeef5b3ff8deb63165791929a81f9a9c1c807dd3d094e9bf770e881
4
+ data.tar.gz: 14035d37686f7035bc0e874de7933259dc0b8be9d046fca1d31867cacaf1c922
5
5
  SHA512:
6
- metadata.gz: 17df9f8a5848337fc570c6fa685f54f4aa9d49bea8e52e74fbec098e0bc9e450b655c7d96cb66c6e8b6aaae4824f2c831ea26a14f7f46b50ef64c955cfb9839b
7
- data.tar.gz: a47165d174ec86ebfdcd128e62c50b85d097aac535b6b6d424ea31988428a57237cdade2829d8e22a15e566ac7597371c95fabafad42cdbc5d56f063abab55e4
6
+ metadata.gz: bfbf7bca1b4290abf2ec91dd5ea032dd1878ab9d7a0938dfed4e6e63441aa557397345090713222a735beb32bc9f1b6cd7971536de9ca427db8650f053917040
7
+ data.tar.gz: 1ba36cba82aafbb2bb003e016f0817884b1129a7d0a8b2be7385e438999e6e7728224f7e146d4de611e171fc851ae64ebea3aa7bfa65f55868cb0c5477d8ecff
data/CHANGELOG.md CHANGED
@@ -1,3 +1,16 @@
1
+ ## 0.2.0 (2021-04-21)
2
+
3
+ - Added support for pgvector
4
+ - Added `normalize` option
5
+ - Made `dimensions` optional
6
+ - Raise an error if `nearest_neighbors` already defined
7
+ - Raise an error for non-finite values
8
+ - Fixed NaN with zero vectors and cosine distance
9
+
10
+ Breaking changes
11
+
12
+ - The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default
13
+
1
14
  ## 0.1.2 (2021-02-21)
2
15
 
3
16
  - Added `nearest_neighbors` scope
data/README.md CHANGED
@@ -12,15 +12,23 @@ Add this line to your application’s Gemfile:
12
12
  gem 'neighbor'
13
13
  ```
14
14
 
15
- And run:
15
+ ## Choose An Extension
16
+
17
+ Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [vector](https://github.com/ankane/pgvector). cube ships with Postgres, while vector supports approximate nearest neighbor search.
18
+
19
+ For cube, run:
16
20
 
17
21
  ```sh
18
- bundle install
19
- rails generate neighbor:install
22
+ rails generate neighbor:cube
20
23
  rails db:migrate
21
24
  ```
22
25
 
23
- This enables the [cube extension](https://www.postgresql.org/docs/current/cube.html) in Postgres
26
+ For vector, install [pgvector](https://github.com/ankane/pgvector#installation) and run:
27
+
28
+ ```sh
29
+ rails generate neighbor:vector
30
+ rails db:migrate
31
+ ```
24
32
 
25
33
  ## Getting Started
26
34
 
@@ -30,6 +38,8 @@ Create a migration
30
38
  class AddNeighborVectorToItems < ActiveRecord::Migration[6.1]
31
39
  def change
32
40
  add_column :items, :neighbor_vector, :cube
41
+ # or
42
+ add_column :items, :neighbor_vector, :vector, limit: 3
33
43
  end
34
44
  end
35
45
  ```
@@ -38,7 +48,7 @@ Add to your model
38
48
 
39
49
  ```ruby
40
50
  class Item < ApplicationRecord
41
- has_neighbors dimensions: 3
51
+ has_neighbors
42
52
  end
43
53
  ```
44
54
 
@@ -48,49 +58,76 @@ Update the vectors
48
58
  item.update(neighbor_vector: [1.0, 1.2, 0.5])
49
59
  ```
50
60
 
51
- > With cosine distance (the default), vectors are normalized before being stored
52
-
53
61
  Get the nearest neighbors to a record
54
62
 
55
63
  ```ruby
56
- item.nearest_neighbors.first(5)
64
+ item.nearest_neighbors(distance: "euclidean").first(5)
57
65
  ```
58
66
 
59
67
  Get the nearest neighbors to a vector
60
68
 
61
69
  ```ruby
62
- Item.nearest_neighbors([1, 2, 3])
70
+ Item.nearest_neighbors([0.9, 1.3, 1.1], distance: "euclidean").first(5)
63
71
  ```
64
72
 
65
73
  ## Distance
66
74
 
67
- Specify the distance metric
75
+ Supported values are:
76
+
77
+ - `euclidean`
78
+ - `cosine`
79
+ - `taxicab` (cube only)
80
+ - `chebyshev` (cube only)
81
+ - `inner_product` (vector only)
82
+
83
+ For cosine distance with cube, vectors must be normalized before being stored.
68
84
 
69
85
  ```ruby
70
86
  class Item < ApplicationRecord
71
- has_neighbors dimensions: 20, distance: "euclidean"
87
+ has_neighbors normalize: true
72
88
  end
73
89
  ```
74
90
 
75
- Supported values are:
76
-
77
- - `cosine` (default)
78
- - `euclidean`
79
- - `taxicab`
80
- - `chebyshev`
81
-
82
- For inner product, see [this example](examples/disco_user_recs.rb)
91
+ For inner product with cube, see [this example](examples/disco_user_recs.rb).
83
92
 
84
93
  Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
85
94
 
86
95
  ```ruby
87
- nearest_item = item.nearest_neighbors.first
96
+ nearest_item = item.nearest_neighbors(distance: "euclidean").first
88
97
  nearest_item.neighbor_distance
89
98
  ```
90
99
 
91
100
  ## Dimensions
92
101
 
93
- By default, Postgres limits the `cube` data type to 100 dimensions. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this.
102
+ The cube data type is limited 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this. The vector data type is limited to 1024 dimensions.
103
+
104
+ For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
105
+
106
+ ```ruby
107
+ class Movie < ApplicationRecord
108
+ has_neighbors dimensions: 3
109
+ end
110
+ ```
111
+
112
+ ## Indexing
113
+
114
+ For vector, add an approximate index to speed up queries. Create a migration with:
115
+
116
+ ```ruby
117
+ class AddIndexToItemsNeighborVector < ActiveRecord::Migration[6.1]
118
+ def change
119
+ add_index :items, :neighbor_vector, using: :ivfflat
120
+ end
121
+ end
122
+ ```
123
+
124
+ Add `opclass: :vector_cosine_ops` for cosine distance and `opclass: :vector_ip_ops` for inner product.
125
+
126
+ Set the number of probes
127
+
128
+ ```ruby
129
+ Item.connection.execute("SET ivfflat.probes = 3")
130
+ ```
94
131
 
95
132
  ## Example
96
133
 
@@ -107,7 +144,7 @@ And add `has_neighbors`
107
144
 
108
145
  ```ruby
109
146
  class Movie < ApplicationRecord
110
- has_neighbors dimensions: 20
147
+ has_neighbors dimensions: 20, normalize: true
111
148
  end
112
149
  ```
113
150
 
@@ -131,11 +168,23 @@ And get similar movies
131
168
 
132
169
  ```ruby
133
170
  movie = Movie.find_by(name: "Star Wars (1977)")
134
- movie.nearest_neighbors.first(5).map(&:name)
171
+ movie.nearest_neighbors(distance: "cosine").first(5).map(&:name)
135
172
  ```
136
173
 
137
174
  [Complete code](examples/disco_item_recs.rb)
138
175
 
176
+ ## Upgrading
177
+
178
+ ### 0.2.0
179
+
180
+ The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default. If you use cosine distance, set:
181
+
182
+ ```ruby
183
+ class Item < ApplicationRecord
184
+ has_neighbors normalize: true
185
+ end
186
+ ```
187
+
139
188
  ## History
140
189
 
141
190
  View the [changelog](https://github.com/ankane/neighbor/blob/master/CHANGELOG.md)
@@ -2,12 +2,12 @@ require "rails/generators/active_record"
2
2
 
3
3
  module Neighbor
4
4
  module Generators
5
- class InstallGenerator < Rails::Generators::Base
5
+ class CubeGenerator < Rails::Generators::Base
6
6
  include ActiveRecord::Generators::Migration
7
7
  source_root File.join(__dir__, "templates")
8
8
 
9
9
  def copy_migration
10
- migration_template "migration.rb", "db/migrate/install_neighbor.rb", migration_version: migration_version
10
+ migration_template "cube.rb", "db/migrate/install_neighbor_cube.rb", migration_version: migration_version
11
11
  end
12
12
 
13
13
  def migration_version
@@ -0,0 +1,5 @@
1
+ class <%= migration_class_name %> < ActiveRecord::Migration<%= migration_version %>
2
+ def change
3
+ enable_extension "vector"
4
+ end
5
+ end
@@ -0,0 +1,18 @@
1
+ require "rails/generators/active_record"
2
+
3
+ module Neighbor
4
+ module Generators
5
+ class VectorGenerator < Rails::Generators::Base
6
+ include ActiveRecord::Generators::Migration
7
+ source_root File.join(__dir__, "templates")
8
+
9
+ def copy_migration
10
+ migration_template "vector.rb", "db/migrate/install_neighbor_vector.rb", migration_version: migration_version
11
+ end
12
+
13
+ def migration_version
14
+ "[#{ActiveRecord::VERSION::MAJOR}.#{ActiveRecord::VERSION::MINOR}]"
15
+ end
16
+ end
17
+ end
18
+ end
data/lib/neighbor.rb CHANGED
@@ -7,10 +7,14 @@ require "neighbor/version"
7
7
  module Neighbor
8
8
  class Error < StandardError; end
9
9
 
10
- module RegisterCubeType
10
+ module RegisterTypes
11
11
  def initialize_type_map(m = type_map)
12
12
  super
13
13
  m.register_type "cube", ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:cube)
14
+ m.register_type "vector" do |_, _, sql_type|
15
+ limit = extract_limit(sql_type)
16
+ ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:vector, limit: limit)
17
+ end
14
18
  end
15
19
  end
16
20
  end
@@ -25,16 +29,20 @@ ActiveSupport.on_load(:active_record) do
25
29
 
26
30
  # ensure schema can be dumped
27
31
  ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:cube] = {name: "cube"}
32
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:vector] = {name: "vector"}
28
33
 
29
34
  # ensure schema can be loaded
30
35
  if ActiveRecord::VERSION::MAJOR >= 6
31
- ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube)
36
+ ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube, :vector)
32
37
  else
33
38
  ActiveRecord::ConnectionAdapters::TableDefinition.define_method :cube do |*args, **options|
34
39
  args.each { |name| column(name, :cube, options) }
35
40
  end
41
+ ActiveRecord::ConnectionAdapters::TableDefinition.define_method :vector do |*args, **options|
42
+ args.each { |name| column(name, :vector, options) }
43
+ end
36
44
  end
37
45
 
38
46
  # prevent unknown OID warning
39
- ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::RegisterCubeType)
47
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::RegisterTypes)
40
48
  end
@@ -1,42 +1,78 @@
1
1
  module Neighbor
2
2
  module Model
3
- def has_neighbors(dimensions:, distance: "cosine")
4
- distance = distance.to_s
5
- raise ArgumentError, "Invalid distance: #{distance}" unless %w(cosine euclidean taxicab chebyshev).include?(distance)
6
-
3
+ def has_neighbors(dimensions: nil, normalize: nil)
7
4
  # TODO make configurable
8
5
  # likely use argument
9
6
  attribute_name = :neighbor_vector
10
7
 
11
8
  class_eval do
12
- attribute attribute_name, Neighbor::Vector.new(dimensions: dimensions, distance: distance)
9
+ raise Error, "nearest_neighbors already defined" if method_defined?(:nearest_neighbors)
10
+
11
+ attribute attribute_name, Neighbor::Vector.new(dimensions: dimensions, normalize: normalize, model: self, attribute_name: attribute_name)
13
12
 
14
- scope :nearest_neighbors, ->(vector) {
13
+ scope :nearest_neighbors, ->(vector, distance:) {
15
14
  return none if vector.nil?
16
15
 
16
+ distance = distance.to_s
17
+
17
18
  quoted_attribute = "#{connection.quote_table_name(table_name)}.#{connection.quote_column_name(attribute_name)}"
18
19
 
20
+ column_info = klass.type_for_attribute(attribute_name).column_info
21
+
19
22
  operator =
20
- case distance
21
- when "taxicab"
22
- "<#>"
23
- when "chebyshev"
24
- "<=>"
23
+ if column_info[:type] == :vector
24
+ case distance
25
+ when "inner_product"
26
+ "<#>"
27
+ when "cosine"
28
+ "<=>"
29
+ when "euclidean"
30
+ "<->"
31
+ end
25
32
  else
26
- "<->"
33
+ case distance
34
+ when "taxicab"
35
+ "<#>"
36
+ when "chebyshev"
37
+ "<=>"
38
+ when "euclidean", "cosine"
39
+ "<->"
40
+ end
27
41
  end
28
42
 
43
+ raise ArgumentError, "Invalid distance: #{distance}" unless operator
44
+
45
+ # ensure normalize set (can be true or false)
46
+ if distance == "cosine" && column_info[:type] == :cube && normalize.nil?
47
+ raise Neighbor::Error, "Set normalize for cosine distance with cube"
48
+ end
49
+
50
+ vector = Neighbor::Vector.cast(vector, dimensions: dimensions, normalize: normalize, column_info: column_info)
51
+
29
52
  # important! neighbor_vector should already be typecast
30
53
  # but use to_f as extra safeguard against SQL injection
31
- vector = Neighbor::Vector.cast(vector, dimensions: dimensions, distance: distance)
32
- order = "#{quoted_attribute} #{operator} cube(array[#{vector.map(&:to_f).join(", ")}])"
54
+ query =
55
+ if column_info[:type] == :vector
56
+ connection.quote("[#{vector.map(&:to_f).join(", ")}]")
57
+ else
58
+ "cube(array[#{vector.map(&:to_f).join(", ")}])"
59
+ end
60
+
61
+ order = "#{quoted_attribute} #{operator} #{query}"
33
62
 
34
63
  # https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance
35
64
  # with normalized vectors:
36
65
  # cosine similarity = 1 - (euclidean distance)**2 / 2
37
66
  # cosine distance = 1 - cosine similarity
38
67
  # this transformation doesn't change the order, so only needed for select
39
- neighbor_distance = distance == "cosine" ? "POWER(#{order}, 2) / 2.0" : order
68
+ neighbor_distance =
69
+ if column_info[:type] != :vector && distance == "cosine"
70
+ "POWER(#{order}, 2) / 2.0"
71
+ elsif column_info[:type] == :vector && distance == "inner_product"
72
+ "(#{order}) * -1"
73
+ else
74
+ order
75
+ end
40
76
 
41
77
  # for select, use column_names instead of * to account for ignored columns
42
78
  select(*column_names, "#{neighbor_distance} AS neighbor_distance")
@@ -44,10 +80,10 @@ module Neighbor
44
80
  .order(Arel.sql(order))
45
81
  }
46
82
 
47
- define_method :nearest_neighbors do
83
+ define_method :nearest_neighbors do |**options|
48
84
  self.class
49
85
  .where.not(self.class.primary_key => send(self.class.primary_key))
50
- .nearest_neighbors(send(attribute_name))
86
+ .nearest_neighbors(send(attribute_name), **options)
51
87
  end
52
88
  end
53
89
  end
@@ -1,29 +1,61 @@
1
1
  module Neighbor
2
2
  class Vector < ActiveRecord::Type::Value
3
- def initialize(dimensions:, distance:)
3
+ def initialize(dimensions:, normalize:, model:, attribute_name:)
4
4
  super()
5
5
  @dimensions = dimensions
6
- @distance = distance
6
+ @normalize = normalize
7
+ @model = model
8
+ @attribute_name = attribute_name
7
9
  end
8
10
 
9
- def self.cast(value, dimensions:, distance:)
11
+ def self.cast(value, dimensions:, normalize:, column_info:)
10
12
  value = value.to_a.map(&:to_f)
11
- raise Error, "Expected #{dimensions} dimensions, not #{value.size}" unless value.size == dimensions
12
13
 
13
- if distance == "cosine"
14
+ dimensions ||= column_info[:dimensions]
15
+ raise Error, "Expected #{dimensions} dimensions, not #{value.size}" if dimensions && value.size != dimensions
16
+
17
+ raise Error, "Values must be finite" unless value.all?(&:finite?)
18
+
19
+ if normalize
14
20
  norm = Math.sqrt(value.sum { |v| v * v })
15
- value.map { |v| v / norm }
16
- else
17
- value
21
+
22
+ # store zero vector as all zeros
23
+ # since NaN makes the distance always 0
24
+ # could also throw error
25
+
26
+ # safe to update in-place since earlier map dups
27
+ value.map! { |v| v / norm } if norm > 0
18
28
  end
29
+
30
+ value
31
+ end
32
+
33
+ def self.column_info(model, attribute_name)
34
+ attribute_name = attribute_name.to_s
35
+ column = model.columns.detect { |c| c.name == attribute_name }
36
+ {
37
+ type: column.try(:type),
38
+ dimensions: column.try(:limit)
39
+ }
40
+ end
41
+
42
+ # need to be careful to avoid loading column info before needed
43
+ def column_info
44
+ @column_info ||= self.class.column_info(@model, @attribute_name)
19
45
  end
20
46
 
21
47
  def cast(value)
22
- self.class.cast(value, dimensions: @dimensions, distance: @distance) unless value.nil?
48
+ self.class.cast(value, dimensions: @dimensions, normalize: @normalize, column_info: column_info) unless value.nil?
23
49
  end
24
50
 
25
51
  def serialize(value)
26
- "(#{cast(value).join(", ")})" unless value.nil?
52
+ unless value.nil?
53
+ if column_info[:type] == :vector
54
+ "[#{cast(value).join(", ")}]"
55
+ else
56
+ "(#{cast(value).join(", ")})"
57
+ end
58
+ end
27
59
  end
28
60
 
29
61
  def deserialize(value)
@@ -1,3 +1,3 @@
1
1
  module Neighbor
2
- VERSION = "0.1.2"
2
+ VERSION = "0.2.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: neighbor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.2
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-02-22 00:00:00.000000000 Z
11
+ date: 2021-04-22 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activerecord
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: '0'
19
+ version: '5.2'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: '0'
26
+ version: '5.2'
27
27
  description:
28
28
  email: andrew@ankane.org
29
29
  executables: []
@@ -33,8 +33,10 @@ files:
33
33
  - CHANGELOG.md
34
34
  - LICENSE.txt
35
35
  - README.md
36
- - lib/generators/neighbor/install_generator.rb
37
- - lib/generators/neighbor/templates/migration.rb.tt
36
+ - lib/generators/neighbor/cube_generator.rb
37
+ - lib/generators/neighbor/templates/cube.rb.tt
38
+ - lib/generators/neighbor/templates/vector.rb.tt
39
+ - lib/generators/neighbor/vector_generator.rb
38
40
  - lib/neighbor.rb
39
41
  - lib/neighbor/model.rb
40
42
  - lib/neighbor/vector.rb