neighbor 0.1.0 → 0.2.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: f346d121745bd39998267c45f77009970c014de887376022f8b6425192d37354
4
- data.tar.gz: '0293bc9fd3633ca3ee811940c59995407357d4b880fcf8ea8cd9223e5df0e75b'
3
+ metadata.gz: 8a71d9aff5133b9551dc47cfe490d5d6a0ee63d2b77ac1d3d2f381f58aef67d6
4
+ data.tar.gz: d22cab3140b979ee0d13ddd2c99f064282e896d49c265bf9b1946dafbda7783a
5
5
  SHA512:
6
- metadata.gz: cf080c46ad8133a460453c773faabac02e229821625fa9f2d51a320195b1e821b230c94c7fe45bc7f7d12b0b51ce6d3e5f1d4c0d74e446b0ce22dde2f5171cde
7
- data.tar.gz: f2e0ae2247979bd3dd083b1869b56f7b6ed879ce8dd1c0ef31d965392d8250548acb304a78ba99a1d6cdfea116c10e0d67aff9c06185a793e576bddae4d624f1
6
+ metadata.gz: 3e24a71c6ace693525fc4866aaa21ee7c75f7251d5d6e0820da3b6abc0bd6fd5921b20c2edbb7fc656b75991e1fc5607357dc02870072900e7b6ef972c840280
7
+ data.tar.gz: 5b75200355c62c549c2d61820e74e1cc71e4ebce68a1e390af60dbe76b6e1b7a69e71d73c8121db42ea05fc296db4ecc1f5776c71ad988685f13f71fc00fec66
data/CHANGELOG.md CHANGED
@@ -1,3 +1,28 @@
1
+ ## 0.2.1 (2021-12-15)
2
+
3
+ - Added support for Active Record 7
4
+
5
+ ## 0.2.0 (2021-04-21)
6
+
7
+ - Added support for pgvector
8
+ - Added `normalize` option
9
+ - Made `dimensions` optional
10
+ - Raise an error if `nearest_neighbors` already defined
11
+ - Raise an error for non-finite values
12
+ - Fixed NaN with zero vectors and cosine distance
13
+
14
+ Breaking changes
15
+
16
+ - The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default
17
+
18
+ ## 0.1.2 (2021-02-21)
19
+
20
+ - Added `nearest_neighbors` scope
21
+
22
+ ## 0.1.1 (2021-02-16)
23
+
24
+ - Fixed `Could not dump table` error
25
+
1
26
  ## 0.1.0 (2021-02-15)
2
27
 
3
28
  - First release
data/README.md CHANGED
@@ -12,11 +12,21 @@ Add this line to your application’s Gemfile:
12
12
  gem 'neighbor'
13
13
  ```
14
14
 
15
- And run:
15
+ ## Choose An Extension
16
+
17
+ Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [vector](https://github.com/ankane/pgvector). cube ships with Postgres, while vector supports approximate nearest neighbor search.
18
+
19
+ For cube, run:
16
20
 
17
21
  ```sh
18
- bundle install
19
- rails generate neighbor:install
22
+ rails generate neighbor:cube
23
+ rails db:migrate
24
+ ```
25
+
26
+ For vector, [install pgvector](https://github.com/ankane/pgvector#installation) and run:
27
+
28
+ ```sh
29
+ rails generate neighbor:vector
20
30
  rails db:migrate
21
31
  ```
22
32
 
@@ -28,6 +38,8 @@ Create a migration
28
38
  class AddNeighborVectorToItems < ActiveRecord::Migration[6.1]
29
39
  def change
30
40
  add_column :items, :neighbor_vector, :cube
41
+ # or
42
+ add_column :items, :neighbor_vector, :vector, limit: 3 # dimensions
31
43
  end
32
44
  end
33
45
  ```
@@ -36,7 +48,7 @@ Add to your model
36
48
 
37
49
  ```ruby
38
50
  class Item < ApplicationRecord
39
- has_neighbors dimensions: 3
51
+ has_neighbors
40
52
  end
41
53
  ```
42
54
 
@@ -46,40 +58,80 @@ Update the vectors
46
58
  item.update(neighbor_vector: [1.0, 1.2, 0.5])
47
59
  ```
48
60
 
49
- > With cosine distance (the default), vectors are normalized before being stored
61
+ Get the nearest neighbors to a record
62
+
63
+ ```ruby
64
+ item.nearest_neighbors(distance: "euclidean").first(5)
65
+ ```
50
66
 
51
- And get the nearest neighbors
67
+ Get the nearest neighbors to a vector
52
68
 
53
69
  ```ruby
54
- item.nearest_neighbors.first(5)
70
+ Item.nearest_neighbors([0.9, 1.3, 1.1], distance: "euclidean").first(5)
55
71
  ```
56
72
 
57
- ## Distances
73
+ ## Distance
58
74
 
59
- Specify the distance metric
75
+ Supported values are:
76
+
77
+ - `euclidean`
78
+ - `cosine`
79
+ - `taxicab` (cube only)
80
+ - `chebyshev` (cube only)
81
+ - `inner_product` (vector only)
82
+
83
+ For cosine distance with cube, vectors must be normalized before being stored.
60
84
 
61
85
  ```ruby
62
86
  class Item < ApplicationRecord
63
- has_neighbors dimensions: 20, distance: "euclidean"
87
+ has_neighbors normalize: true
64
88
  end
65
89
  ```
66
90
 
67
- Supported distances are:
91
+ For inner product with cube, see [this example](examples/disco_user_recs_cube.rb).
68
92
 
69
- - `cosine` (default)
70
- - `euclidean`
71
- - `taxicab`
72
- - `chebyshev`
93
+ Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
94
+
95
+ ```ruby
96
+ nearest_item = item.nearest_neighbors(distance: "euclidean").first
97
+ nearest_item.neighbor_distance
98
+ ```
99
+
100
+ ## Dimensions
101
+
102
+ The cube data type is limited 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this. The vector data type is limited to 1024 dimensions.
103
+
104
+ For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
105
+
106
+ ```ruby
107
+ class Movie < ApplicationRecord
108
+ has_neighbors dimensions: 3
109
+ end
110
+ ```
111
+
112
+ ## Indexing
113
+
114
+ For vector, add an approximate index to speed up queries. Create a migration with:
115
+
116
+ ```ruby
117
+ class AddIndexToItemsNeighborVector < ActiveRecord::Migration[6.1]
118
+ def change
119
+ add_index :items, :neighbor_vector, using: :ivfflat, opclass: :vector_l2_ops
120
+ end
121
+ end
122
+ ```
73
123
 
74
- Returned records will have a `neighbor_distance` attribute
124
+ Use `:vector_cosine_ops` for cosine distance and `:vector_ip_ops` for inner product.
125
+
126
+ Set the number of probes
75
127
 
76
128
  ```ruby
77
- returned_item.neighbor_distance
129
+ Item.connection.execute("SET ivfflat.probes = 3")
78
130
  ```
79
131
 
80
132
  ## Example
81
133
 
82
- You can use Neighbor for online item recommendations with [Disco](https://github.com/ankane/disco). We’ll use MovieLens data for this example.
134
+ You can use Neighbor for online item-based recommendations with [Disco](https://github.com/ankane/disco). We’ll use MovieLens data for this example.
83
135
 
84
136
  Generate a model
85
137
 
@@ -92,7 +144,7 @@ And add `has_neighbors`
92
144
 
93
145
  ```ruby
94
146
  class Movie < ApplicationRecord
95
- has_neighbors dimensions: 20
147
+ has_neighbors dimensions: 20, normalize: true
96
148
  end
97
149
  ```
98
150
 
@@ -107,16 +159,32 @@ recommender.fit(data)
107
159
  Use item factors for the neighbor vector
108
160
 
109
161
  ```ruby
162
+ movies = []
110
163
  recommender.item_ids.each do |item_id|
111
- Movie.create!(name: item_id, neighbor_vector: recommender.item_factors(item_id))
164
+ movies << {name: item_id, neighbor_vector: recommender.item_factors(item_id)}
112
165
  end
166
+ Movie.insert_all!(movies) # use create! for Active Record < 6
113
167
  ```
114
168
 
115
169
  And get similar movies
116
170
 
117
171
  ```ruby
118
172
  movie = Movie.find_by(name: "Star Wars (1977)")
119
- movie.nearest_neighbors.first(5).map(&:name)
173
+ movie.nearest_neighbors(distance: "cosine").first(5).map(&:name)
174
+ ```
175
+
176
+ See the complete code for [cube](examples/disco_item_recs_cube.rb) and [vector](examples/disco_item_recs_vector.rb)
177
+
178
+ ## Upgrading
179
+
180
+ ### 0.2.0
181
+
182
+ The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default. If you use cosine distance, set:
183
+
184
+ ```ruby
185
+ class Item < ApplicationRecord
186
+ has_neighbors normalize: true
187
+ end
120
188
  ```
121
189
 
122
190
  ## History
@@ -138,5 +206,11 @@ To get started with development:
138
206
  git clone https://github.com/ankane/neighbor.git
139
207
  cd neighbor
140
208
  bundle install
209
+ createdb neighbor_test
210
+
211
+ # cube
141
212
  bundle exec rake test
213
+
214
+ # vector
215
+ EXT=vector bundle exec rake test
142
216
  ```
@@ -2,12 +2,12 @@ require "rails/generators/active_record"
2
2
 
3
3
  module Neighbor
4
4
  module Generators
5
- class InstallGenerator < Rails::Generators::Base
5
+ class CubeGenerator < Rails::Generators::Base
6
6
  include ActiveRecord::Generators::Migration
7
7
  source_root File.join(__dir__, "templates")
8
8
 
9
9
  def copy_migration
10
- migration_template "migration.rb", "db/migrate/install_neighbor.rb", migration_version: migration_version
10
+ migration_template "cube.rb", "db/migrate/install_neighbor_cube.rb", migration_version: migration_version
11
11
  end
12
12
 
13
13
  def migration_version
@@ -0,0 +1,5 @@
1
+ class <%= migration_class_name %> < ActiveRecord::Migration<%= migration_version %>
2
+ def change
3
+ enable_extension "vector"
4
+ end
5
+ end
@@ -0,0 +1,18 @@
1
+ require "rails/generators/active_record"
2
+
3
+ module Neighbor
4
+ module Generators
5
+ class VectorGenerator < Rails::Generators::Base
6
+ include ActiveRecord::Generators::Migration
7
+ source_root File.join(__dir__, "templates")
8
+
9
+ def copy_migration
10
+ migration_template "vector.rb", "db/migrate/install_neighbor_vector.rb", migration_version: migration_version
11
+ end
12
+
13
+ def migration_version
14
+ "[#{ActiveRecord::VERSION::MAJOR}.#{ActiveRecord::VERSION::MINOR}]"
15
+ end
16
+ end
17
+ end
18
+ end
@@ -1,42 +1,89 @@
1
1
  module Neighbor
2
2
  module Model
3
- def has_neighbors(dimensions:, distance: "cosine")
4
- distance = distance.to_s
5
- raise ArgumentError, "Invalid distance: #{distance}" unless %w(cosine euclidean taxicab chebyshev).include?(distance)
3
+ def has_neighbors(dimensions: nil, normalize: nil)
4
+ # TODO make configurable
5
+ # likely use argument
6
+ attribute_name = :neighbor_vector
6
7
 
7
8
  class_eval do
8
- attribute :neighbor_vector, Neighbor::Vector.new(dimensions: dimensions, distance: distance)
9
+ raise Error, "nearest_neighbors already defined" if method_defined?(:nearest_neighbors)
9
10
 
10
- define_method :nearest_neighbors do
11
- return self.class.none if neighbor_vector.nil?
11
+ attribute attribute_name, Neighbor::Vector.new(dimensions: dimensions, normalize: normalize, model: self, attribute_name: attribute_name)
12
+
13
+ scope :nearest_neighbors, ->(vector, distance:) {
14
+ return none if vector.nil?
15
+
16
+ distance = distance.to_s
17
+
18
+ quoted_attribute = "#{connection.quote_table_name(table_name)}.#{connection.quote_column_name(attribute_name)}"
19
+
20
+ column_info = klass.type_for_attribute(attribute_name).column_info
12
21
 
13
22
  operator =
14
- case distance
15
- when "taxicab"
16
- "<#>"
17
- when "chebyshev"
18
- "<=>"
23
+ if column_info[:type] == :vector
24
+ case distance
25
+ when "inner_product"
26
+ "<#>"
27
+ when "cosine"
28
+ "<=>"
29
+ when "euclidean"
30
+ "<->"
31
+ end
19
32
  else
20
- "<->"
33
+ case distance
34
+ when "taxicab"
35
+ "<#>"
36
+ when "chebyshev"
37
+ "<=>"
38
+ when "euclidean", "cosine"
39
+ "<->"
40
+ end
21
41
  end
22
42
 
43
+ raise ArgumentError, "Invalid distance: #{distance}" unless operator
44
+
45
+ # ensure normalize set (can be true or false)
46
+ if distance == "cosine" && column_info[:type] == :cube && normalize.nil?
47
+ raise Neighbor::Error, "Set normalize for cosine distance with cube"
48
+ end
49
+
50
+ vector = Neighbor::Vector.cast(vector, dimensions: dimensions, normalize: normalize, column_info: column_info)
51
+
23
52
  # important! neighbor_vector should already be typecast
24
53
  # but use to_f as extra safeguard against SQL injection
25
- order = "neighbor_vector #{operator} cube(array[#{neighbor_vector.map(&:to_f).join(", ")}])"
54
+ query =
55
+ if column_info[:type] == :vector
56
+ connection.quote("[#{vector.map(&:to_f).join(", ")}]")
57
+ else
58
+ "cube(array[#{vector.map(&:to_f).join(", ")}])"
59
+ end
60
+
61
+ order = "#{quoted_attribute} #{operator} #{query}"
26
62
 
27
63
  # https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance
28
64
  # with normalized vectors:
29
65
  # cosine similarity = 1 - (euclidean distance)**2 / 2
30
66
  # cosine distance = 1 - cosine similarity
31
67
  # this transformation doesn't change the order, so only needed for select
32
- neighbor_distance = distance == "cosine" ? "POWER(#{order}, 2) / 2.0" : order
68
+ neighbor_distance =
69
+ if column_info[:type] != :vector && distance == "cosine"
70
+ "POWER(#{order}, 2) / 2.0"
71
+ elsif column_info[:type] == :vector && distance == "inner_product"
72
+ "(#{order}) * -1"
73
+ else
74
+ order
75
+ end
33
76
 
34
77
  # for select, use column_names instead of * to account for ignored columns
78
+ select(*column_names, "#{neighbor_distance} AS neighbor_distance")
79
+ .where.not(attribute_name => nil)
80
+ .order(Arel.sql(order))
81
+ }
82
+
83
+ define_method :nearest_neighbors do |**options|
35
84
  self.class
36
- .select(*self.class.column_names, "#{neighbor_distance} AS neighbor_distance")
37
85
  .where.not(self.class.primary_key => send(self.class.primary_key))
38
- .where.not(neighbor_vector: nil)
39
- .order(Arel.sql(order))
86
+ .nearest_neighbors(send(attribute_name), **options)
40
87
  end
41
88
  end
42
89
  end
@@ -1,31 +1,61 @@
1
1
  module Neighbor
2
2
  class Vector < ActiveRecord::Type::Value
3
- def initialize(dimensions:, distance:)
3
+ def initialize(dimensions:, normalize:, model:, attribute_name:)
4
4
  super()
5
5
  @dimensions = dimensions
6
- @distance = distance
6
+ @normalize = normalize
7
+ @model = model
8
+ @attribute_name = attribute_name
7
9
  end
8
10
 
9
- def cast(value)
10
- return if value.nil?
11
-
11
+ def self.cast(value, dimensions:, normalize:, column_info:)
12
12
  value = value.to_a.map(&:to_f)
13
- raise Error, "Expected #{@dimensions} dimensions, not #{value.size}" unless value.size == @dimensions
14
13
 
15
- if @distance == "cosine"
16
- norm = 0.0
17
- value.each do |v|
18
- norm += v * v
19
- end
20
- norm = Math.sqrt(norm)
21
- value.map { |v| v / norm }
22
- else
23
- value
14
+ dimensions ||= column_info[:dimensions]
15
+ raise Error, "Expected #{dimensions} dimensions, not #{value.size}" if dimensions && value.size != dimensions
16
+
17
+ raise Error, "Values must be finite" unless value.all?(&:finite?)
18
+
19
+ if normalize
20
+ norm = Math.sqrt(value.sum { |v| v * v })
21
+
22
+ # store zero vector as all zeros
23
+ # since NaN makes the distance always 0
24
+ # could also throw error
25
+
26
+ # safe to update in-place since earlier map dups
27
+ value.map! { |v| v / norm } if norm > 0
24
28
  end
29
+
30
+ value
31
+ end
32
+
33
+ def self.column_info(model, attribute_name)
34
+ attribute_name = attribute_name.to_s
35
+ column = model.columns.detect { |c| c.name == attribute_name }
36
+ {
37
+ type: column.try(:type),
38
+ dimensions: column.try(:limit)
39
+ }
40
+ end
41
+
42
+ # need to be careful to avoid loading column info before needed
43
+ def column_info
44
+ @column_info ||= self.class.column_info(@model, @attribute_name)
45
+ end
46
+
47
+ def cast(value)
48
+ self.class.cast(value, dimensions: @dimensions, normalize: @normalize, column_info: column_info) unless value.nil?
25
49
  end
26
50
 
27
51
  def serialize(value)
28
- "(#{cast(value).join(", ")})" unless value.nil?
52
+ unless value.nil?
53
+ if column_info[:type] == :vector
54
+ "[#{cast(value).join(", ")}]"
55
+ else
56
+ "(#{cast(value).join(", ")})"
57
+ end
58
+ end
29
59
  end
30
60
 
31
61
  def deserialize(value)
@@ -1,3 +1,3 @@
1
1
  module Neighbor
2
- VERSION = "0.1.0"
2
+ VERSION = "0.2.1"
3
3
  end
data/lib/neighbor.rb CHANGED
@@ -7,10 +7,14 @@ require "neighbor/version"
7
7
  module Neighbor
8
8
  class Error < StandardError; end
9
9
 
10
- module RegisterCubeType
10
+ module RegisterTypes
11
11
  def initialize_type_map(m = type_map)
12
12
  super
13
13
  m.register_type "cube", ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:cube)
14
+ m.register_type "vector" do |_, _, sql_type|
15
+ limit = extract_limit(sql_type)
16
+ ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:vector, limit: limit)
17
+ end
14
18
  end
15
19
  end
16
20
  end
@@ -21,7 +25,28 @@ ActiveSupport.on_load(:active_record) do
21
25
 
22
26
  extend Neighbor::Model
23
27
 
24
- # prevent unknown OID warning
25
28
  require "active_record/connection_adapters/postgresql_adapter"
26
- ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::RegisterCubeType)
29
+
30
+ # ensure schema can be dumped
31
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:cube] = {name: "cube"}
32
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:vector] = {name: "vector"}
33
+
34
+ # ensure schema can be loaded
35
+ if ActiveRecord::VERSION::MAJOR >= 6
36
+ ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube, :vector)
37
+ else
38
+ ActiveRecord::ConnectionAdapters::TableDefinition.define_method :cube do |*args, **options|
39
+ args.each { |name| column(name, :cube, options) }
40
+ end
41
+ ActiveRecord::ConnectionAdapters::TableDefinition.define_method :vector do |*args, **options|
42
+ args.each { |name| column(name, :vector, options) }
43
+ end
44
+ end
45
+
46
+ # prevent unknown OID warning
47
+ if ActiveRecord::VERSION::MAJOR >= 7
48
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.singleton_class.prepend(Neighbor::RegisterTypes)
49
+ else
50
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::RegisterTypes)
51
+ end
27
52
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: neighbor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-02-16 00:00:00.000000000 Z
11
+ date: 2021-12-16 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activerecord
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: '0'
19
+ version: '5.2'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: '0'
26
+ version: '5.2'
27
27
  description:
28
28
  email: andrew@ankane.org
29
29
  executables: []
@@ -33,8 +33,10 @@ files:
33
33
  - CHANGELOG.md
34
34
  - LICENSE.txt
35
35
  - README.md
36
- - lib/generators/neighbor/install_generator.rb
37
- - lib/generators/neighbor/templates/migration.rb.tt
36
+ - lib/generators/neighbor/cube_generator.rb
37
+ - lib/generators/neighbor/templates/cube.rb.tt
38
+ - lib/generators/neighbor/templates/vector.rb.tt
39
+ - lib/generators/neighbor/vector_generator.rb
38
40
  - lib/neighbor.rb
39
41
  - lib/neighbor/model.rb
40
42
  - lib/neighbor/vector.rb
@@ -58,7 +60,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
58
60
  - !ruby/object:Gem::Version
59
61
  version: '0'
60
62
  requirements: []
61
- rubygems_version: 3.1.4
63
+ rubygems_version: 3.2.32
62
64
  signing_key:
63
65
  specification_version: 4
64
66
  summary: Nearest neighbor search for Rails and Postgres