neighbor 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: f346d121745bd39998267c45f77009970c014de887376022f8b6425192d37354
4
- data.tar.gz: '0293bc9fd3633ca3ee811940c59995407357d4b880fcf8ea8cd9223e5df0e75b'
3
+ metadata.gz: 8a71d9aff5133b9551dc47cfe490d5d6a0ee63d2b77ac1d3d2f381f58aef67d6
4
+ data.tar.gz: d22cab3140b979ee0d13ddd2c99f064282e896d49c265bf9b1946dafbda7783a
5
5
  SHA512:
6
- metadata.gz: cf080c46ad8133a460453c773faabac02e229821625fa9f2d51a320195b1e821b230c94c7fe45bc7f7d12b0b51ce6d3e5f1d4c0d74e446b0ce22dde2f5171cde
7
- data.tar.gz: f2e0ae2247979bd3dd083b1869b56f7b6ed879ce8dd1c0ef31d965392d8250548acb304a78ba99a1d6cdfea116c10e0d67aff9c06185a793e576bddae4d624f1
6
+ metadata.gz: 3e24a71c6ace693525fc4866aaa21ee7c75f7251d5d6e0820da3b6abc0bd6fd5921b20c2edbb7fc656b75991e1fc5607357dc02870072900e7b6ef972c840280
7
+ data.tar.gz: 5b75200355c62c549c2d61820e74e1cc71e4ebce68a1e390af60dbe76b6e1b7a69e71d73c8121db42ea05fc296db4ecc1f5776c71ad988685f13f71fc00fec66
data/CHANGELOG.md CHANGED
@@ -1,3 +1,28 @@
1
+ ## 0.2.1 (2021-12-15)
2
+
3
+ - Added support for Active Record 7
4
+
5
+ ## 0.2.0 (2021-04-21)
6
+
7
+ - Added support for pgvector
8
+ - Added `normalize` option
9
+ - Made `dimensions` optional
10
+ - Raise an error if `nearest_neighbors` already defined
11
+ - Raise an error for non-finite values
12
+ - Fixed NaN with zero vectors and cosine distance
13
+
14
+ Breaking changes
15
+
16
+ - The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default
17
+
18
+ ## 0.1.2 (2021-02-21)
19
+
20
+ - Added `nearest_neighbors` scope
21
+
22
+ ## 0.1.1 (2021-02-16)
23
+
24
+ - Fixed `Could not dump table` error
25
+
1
26
  ## 0.1.0 (2021-02-15)
2
27
 
3
28
  - First release
data/README.md CHANGED
@@ -12,11 +12,21 @@ Add this line to your application’s Gemfile:
12
12
  gem 'neighbor'
13
13
  ```
14
14
 
15
- And run:
15
+ ## Choose An Extension
16
+
17
+ Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [vector](https://github.com/ankane/pgvector). cube ships with Postgres, while vector supports approximate nearest neighbor search.
18
+
19
+ For cube, run:
16
20
 
17
21
  ```sh
18
- bundle install
19
- rails generate neighbor:install
22
+ rails generate neighbor:cube
23
+ rails db:migrate
24
+ ```
25
+
26
+ For vector, [install pgvector](https://github.com/ankane/pgvector#installation) and run:
27
+
28
+ ```sh
29
+ rails generate neighbor:vector
20
30
  rails db:migrate
21
31
  ```
22
32
 
@@ -28,6 +38,8 @@ Create a migration
28
38
  class AddNeighborVectorToItems < ActiveRecord::Migration[6.1]
29
39
  def change
30
40
  add_column :items, :neighbor_vector, :cube
41
+ # or
42
+ add_column :items, :neighbor_vector, :vector, limit: 3 # dimensions
31
43
  end
32
44
  end
33
45
  ```
@@ -36,7 +48,7 @@ Add to your model
36
48
 
37
49
  ```ruby
38
50
  class Item < ApplicationRecord
39
- has_neighbors dimensions: 3
51
+ has_neighbors
40
52
  end
41
53
  ```
42
54
 
@@ -46,40 +58,80 @@ Update the vectors
46
58
  item.update(neighbor_vector: [1.0, 1.2, 0.5])
47
59
  ```
48
60
 
49
- > With cosine distance (the default), vectors are normalized before being stored
61
+ Get the nearest neighbors to a record
62
+
63
+ ```ruby
64
+ item.nearest_neighbors(distance: "euclidean").first(5)
65
+ ```
50
66
 
51
- And get the nearest neighbors
67
+ Get the nearest neighbors to a vector
52
68
 
53
69
  ```ruby
54
- item.nearest_neighbors.first(5)
70
+ Item.nearest_neighbors([0.9, 1.3, 1.1], distance: "euclidean").first(5)
55
71
  ```
56
72
 
57
- ## Distances
73
+ ## Distance
58
74
 
59
- Specify the distance metric
75
+ Supported values are:
76
+
77
+ - `euclidean`
78
+ - `cosine`
79
+ - `taxicab` (cube only)
80
+ - `chebyshev` (cube only)
81
+ - `inner_product` (vector only)
82
+
83
+ For cosine distance with cube, vectors must be normalized before being stored.
60
84
 
61
85
  ```ruby
62
86
  class Item < ApplicationRecord
63
- has_neighbors dimensions: 20, distance: "euclidean"
87
+ has_neighbors normalize: true
64
88
  end
65
89
  ```
66
90
 
67
- Supported distances are:
91
+ For inner product with cube, see [this example](examples/disco_user_recs_cube.rb).
68
92
 
69
- - `cosine` (default)
70
- - `euclidean`
71
- - `taxicab`
72
- - `chebyshev`
93
+ Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
94
+
95
+ ```ruby
96
+ nearest_item = item.nearest_neighbors(distance: "euclidean").first
97
+ nearest_item.neighbor_distance
98
+ ```
99
+
100
+ ## Dimensions
101
+
102
+ The cube data type is limited 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this. The vector data type is limited to 1024 dimensions.
103
+
104
+ For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
105
+
106
+ ```ruby
107
+ class Movie < ApplicationRecord
108
+ has_neighbors dimensions: 3
109
+ end
110
+ ```
111
+
112
+ ## Indexing
113
+
114
+ For vector, add an approximate index to speed up queries. Create a migration with:
115
+
116
+ ```ruby
117
+ class AddIndexToItemsNeighborVector < ActiveRecord::Migration[6.1]
118
+ def change
119
+ add_index :items, :neighbor_vector, using: :ivfflat, opclass: :vector_l2_ops
120
+ end
121
+ end
122
+ ```
73
123
 
74
- Returned records will have a `neighbor_distance` attribute
124
+ Use `:vector_cosine_ops` for cosine distance and `:vector_ip_ops` for inner product.
125
+
126
+ Set the number of probes
75
127
 
76
128
  ```ruby
77
- returned_item.neighbor_distance
129
+ Item.connection.execute("SET ivfflat.probes = 3")
78
130
  ```
79
131
 
80
132
  ## Example
81
133
 
82
- You can use Neighbor for online item recommendations with [Disco](https://github.com/ankane/disco). We’ll use MovieLens data for this example.
134
+ You can use Neighbor for online item-based recommendations with [Disco](https://github.com/ankane/disco). We’ll use MovieLens data for this example.
83
135
 
84
136
  Generate a model
85
137
 
@@ -92,7 +144,7 @@ And add `has_neighbors`
92
144
 
93
145
  ```ruby
94
146
  class Movie < ApplicationRecord
95
- has_neighbors dimensions: 20
147
+ has_neighbors dimensions: 20, normalize: true
96
148
  end
97
149
  ```
98
150
 
@@ -107,16 +159,32 @@ recommender.fit(data)
107
159
  Use item factors for the neighbor vector
108
160
 
109
161
  ```ruby
162
+ movies = []
110
163
  recommender.item_ids.each do |item_id|
111
- Movie.create!(name: item_id, neighbor_vector: recommender.item_factors(item_id))
164
+ movies << {name: item_id, neighbor_vector: recommender.item_factors(item_id)}
112
165
  end
166
+ Movie.insert_all!(movies) # use create! for Active Record < 6
113
167
  ```
114
168
 
115
169
  And get similar movies
116
170
 
117
171
  ```ruby
118
172
  movie = Movie.find_by(name: "Star Wars (1977)")
119
- movie.nearest_neighbors.first(5).map(&:name)
173
+ movie.nearest_neighbors(distance: "cosine").first(5).map(&:name)
174
+ ```
175
+
176
+ See the complete code for [cube](examples/disco_item_recs_cube.rb) and [vector](examples/disco_item_recs_vector.rb)
177
+
178
+ ## Upgrading
179
+
180
+ ### 0.2.0
181
+
182
+ The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default. If you use cosine distance, set:
183
+
184
+ ```ruby
185
+ class Item < ApplicationRecord
186
+ has_neighbors normalize: true
187
+ end
120
188
  ```
121
189
 
122
190
  ## History
@@ -138,5 +206,11 @@ To get started with development:
138
206
  git clone https://github.com/ankane/neighbor.git
139
207
  cd neighbor
140
208
  bundle install
209
+ createdb neighbor_test
210
+
211
+ # cube
141
212
  bundle exec rake test
213
+
214
+ # vector
215
+ EXT=vector bundle exec rake test
142
216
  ```
@@ -2,12 +2,12 @@ require "rails/generators/active_record"
2
2
 
3
3
  module Neighbor
4
4
  module Generators
5
- class InstallGenerator < Rails::Generators::Base
5
+ class CubeGenerator < Rails::Generators::Base
6
6
  include ActiveRecord::Generators::Migration
7
7
  source_root File.join(__dir__, "templates")
8
8
 
9
9
  def copy_migration
10
- migration_template "migration.rb", "db/migrate/install_neighbor.rb", migration_version: migration_version
10
+ migration_template "cube.rb", "db/migrate/install_neighbor_cube.rb", migration_version: migration_version
11
11
  end
12
12
 
13
13
  def migration_version
@@ -0,0 +1,5 @@
1
+ class <%= migration_class_name %> < ActiveRecord::Migration<%= migration_version %>
2
+ def change
3
+ enable_extension "vector"
4
+ end
5
+ end
@@ -0,0 +1,18 @@
1
+ require "rails/generators/active_record"
2
+
3
+ module Neighbor
4
+ module Generators
5
+ class VectorGenerator < Rails::Generators::Base
6
+ include ActiveRecord::Generators::Migration
7
+ source_root File.join(__dir__, "templates")
8
+
9
+ def copy_migration
10
+ migration_template "vector.rb", "db/migrate/install_neighbor_vector.rb", migration_version: migration_version
11
+ end
12
+
13
+ def migration_version
14
+ "[#{ActiveRecord::VERSION::MAJOR}.#{ActiveRecord::VERSION::MINOR}]"
15
+ end
16
+ end
17
+ end
18
+ end
@@ -1,42 +1,89 @@
1
1
  module Neighbor
2
2
  module Model
3
- def has_neighbors(dimensions:, distance: "cosine")
4
- distance = distance.to_s
5
- raise ArgumentError, "Invalid distance: #{distance}" unless %w(cosine euclidean taxicab chebyshev).include?(distance)
3
+ def has_neighbors(dimensions: nil, normalize: nil)
4
+ # TODO make configurable
5
+ # likely use argument
6
+ attribute_name = :neighbor_vector
6
7
 
7
8
  class_eval do
8
- attribute :neighbor_vector, Neighbor::Vector.new(dimensions: dimensions, distance: distance)
9
+ raise Error, "nearest_neighbors already defined" if method_defined?(:nearest_neighbors)
9
10
 
10
- define_method :nearest_neighbors do
11
- return self.class.none if neighbor_vector.nil?
11
+ attribute attribute_name, Neighbor::Vector.new(dimensions: dimensions, normalize: normalize, model: self, attribute_name: attribute_name)
12
+
13
+ scope :nearest_neighbors, ->(vector, distance:) {
14
+ return none if vector.nil?
15
+
16
+ distance = distance.to_s
17
+
18
+ quoted_attribute = "#{connection.quote_table_name(table_name)}.#{connection.quote_column_name(attribute_name)}"
19
+
20
+ column_info = klass.type_for_attribute(attribute_name).column_info
12
21
 
13
22
  operator =
14
- case distance
15
- when "taxicab"
16
- "<#>"
17
- when "chebyshev"
18
- "<=>"
23
+ if column_info[:type] == :vector
24
+ case distance
25
+ when "inner_product"
26
+ "<#>"
27
+ when "cosine"
28
+ "<=>"
29
+ when "euclidean"
30
+ "<->"
31
+ end
19
32
  else
20
- "<->"
33
+ case distance
34
+ when "taxicab"
35
+ "<#>"
36
+ when "chebyshev"
37
+ "<=>"
38
+ when "euclidean", "cosine"
39
+ "<->"
40
+ end
21
41
  end
22
42
 
43
+ raise ArgumentError, "Invalid distance: #{distance}" unless operator
44
+
45
+ # ensure normalize set (can be true or false)
46
+ if distance == "cosine" && column_info[:type] == :cube && normalize.nil?
47
+ raise Neighbor::Error, "Set normalize for cosine distance with cube"
48
+ end
49
+
50
+ vector = Neighbor::Vector.cast(vector, dimensions: dimensions, normalize: normalize, column_info: column_info)
51
+
23
52
  # important! neighbor_vector should already be typecast
24
53
  # but use to_f as extra safeguard against SQL injection
25
- order = "neighbor_vector #{operator} cube(array[#{neighbor_vector.map(&:to_f).join(", ")}])"
54
+ query =
55
+ if column_info[:type] == :vector
56
+ connection.quote("[#{vector.map(&:to_f).join(", ")}]")
57
+ else
58
+ "cube(array[#{vector.map(&:to_f).join(", ")}])"
59
+ end
60
+
61
+ order = "#{quoted_attribute} #{operator} #{query}"
26
62
 
27
63
  # https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance
28
64
  # with normalized vectors:
29
65
  # cosine similarity = 1 - (euclidean distance)**2 / 2
30
66
  # cosine distance = 1 - cosine similarity
31
67
  # this transformation doesn't change the order, so only needed for select
32
- neighbor_distance = distance == "cosine" ? "POWER(#{order}, 2) / 2.0" : order
68
+ neighbor_distance =
69
+ if column_info[:type] != :vector && distance == "cosine"
70
+ "POWER(#{order}, 2) / 2.0"
71
+ elsif column_info[:type] == :vector && distance == "inner_product"
72
+ "(#{order}) * -1"
73
+ else
74
+ order
75
+ end
33
76
 
34
77
  # for select, use column_names instead of * to account for ignored columns
78
+ select(*column_names, "#{neighbor_distance} AS neighbor_distance")
79
+ .where.not(attribute_name => nil)
80
+ .order(Arel.sql(order))
81
+ }
82
+
83
+ define_method :nearest_neighbors do |**options|
35
84
  self.class
36
- .select(*self.class.column_names, "#{neighbor_distance} AS neighbor_distance")
37
85
  .where.not(self.class.primary_key => send(self.class.primary_key))
38
- .where.not(neighbor_vector: nil)
39
- .order(Arel.sql(order))
86
+ .nearest_neighbors(send(attribute_name), **options)
40
87
  end
41
88
  end
42
89
  end
@@ -1,31 +1,61 @@
1
1
  module Neighbor
2
2
  class Vector < ActiveRecord::Type::Value
3
- def initialize(dimensions:, distance:)
3
+ def initialize(dimensions:, normalize:, model:, attribute_name:)
4
4
  super()
5
5
  @dimensions = dimensions
6
- @distance = distance
6
+ @normalize = normalize
7
+ @model = model
8
+ @attribute_name = attribute_name
7
9
  end
8
10
 
9
- def cast(value)
10
- return if value.nil?
11
-
11
+ def self.cast(value, dimensions:, normalize:, column_info:)
12
12
  value = value.to_a.map(&:to_f)
13
- raise Error, "Expected #{@dimensions} dimensions, not #{value.size}" unless value.size == @dimensions
14
13
 
15
- if @distance == "cosine"
16
- norm = 0.0
17
- value.each do |v|
18
- norm += v * v
19
- end
20
- norm = Math.sqrt(norm)
21
- value.map { |v| v / norm }
22
- else
23
- value
14
+ dimensions ||= column_info[:dimensions]
15
+ raise Error, "Expected #{dimensions} dimensions, not #{value.size}" if dimensions && value.size != dimensions
16
+
17
+ raise Error, "Values must be finite" unless value.all?(&:finite?)
18
+
19
+ if normalize
20
+ norm = Math.sqrt(value.sum { |v| v * v })
21
+
22
+ # store zero vector as all zeros
23
+ # since NaN makes the distance always 0
24
+ # could also throw error
25
+
26
+ # safe to update in-place since earlier map dups
27
+ value.map! { |v| v / norm } if norm > 0
24
28
  end
29
+
30
+ value
31
+ end
32
+
33
+ def self.column_info(model, attribute_name)
34
+ attribute_name = attribute_name.to_s
35
+ column = model.columns.detect { |c| c.name == attribute_name }
36
+ {
37
+ type: column.try(:type),
38
+ dimensions: column.try(:limit)
39
+ }
40
+ end
41
+
42
+ # need to be careful to avoid loading column info before needed
43
+ def column_info
44
+ @column_info ||= self.class.column_info(@model, @attribute_name)
45
+ end
46
+
47
+ def cast(value)
48
+ self.class.cast(value, dimensions: @dimensions, normalize: @normalize, column_info: column_info) unless value.nil?
25
49
  end
26
50
 
27
51
  def serialize(value)
28
- "(#{cast(value).join(", ")})" unless value.nil?
52
+ unless value.nil?
53
+ if column_info[:type] == :vector
54
+ "[#{cast(value).join(", ")}]"
55
+ else
56
+ "(#{cast(value).join(", ")})"
57
+ end
58
+ end
29
59
  end
30
60
 
31
61
  def deserialize(value)
@@ -1,3 +1,3 @@
1
1
  module Neighbor
2
- VERSION = "0.1.0"
2
+ VERSION = "0.2.1"
3
3
  end
data/lib/neighbor.rb CHANGED
@@ -7,10 +7,14 @@ require "neighbor/version"
7
7
  module Neighbor
8
8
  class Error < StandardError; end
9
9
 
10
- module RegisterCubeType
10
+ module RegisterTypes
11
11
  def initialize_type_map(m = type_map)
12
12
  super
13
13
  m.register_type "cube", ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:cube)
14
+ m.register_type "vector" do |_, _, sql_type|
15
+ limit = extract_limit(sql_type)
16
+ ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:vector, limit: limit)
17
+ end
14
18
  end
15
19
  end
16
20
  end
@@ -21,7 +25,28 @@ ActiveSupport.on_load(:active_record) do
21
25
 
22
26
  extend Neighbor::Model
23
27
 
24
- # prevent unknown OID warning
25
28
  require "active_record/connection_adapters/postgresql_adapter"
26
- ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::RegisterCubeType)
29
+
30
+ # ensure schema can be dumped
31
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:cube] = {name: "cube"}
32
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:vector] = {name: "vector"}
33
+
34
+ # ensure schema can be loaded
35
+ if ActiveRecord::VERSION::MAJOR >= 6
36
+ ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube, :vector)
37
+ else
38
+ ActiveRecord::ConnectionAdapters::TableDefinition.define_method :cube do |*args, **options|
39
+ args.each { |name| column(name, :cube, options) }
40
+ end
41
+ ActiveRecord::ConnectionAdapters::TableDefinition.define_method :vector do |*args, **options|
42
+ args.each { |name| column(name, :vector, options) }
43
+ end
44
+ end
45
+
46
+ # prevent unknown OID warning
47
+ if ActiveRecord::VERSION::MAJOR >= 7
48
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.singleton_class.prepend(Neighbor::RegisterTypes)
49
+ else
50
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::RegisterTypes)
51
+ end
27
52
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: neighbor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-02-16 00:00:00.000000000 Z
11
+ date: 2021-12-16 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activerecord
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: '0'
19
+ version: '5.2'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: '0'
26
+ version: '5.2'
27
27
  description:
28
28
  email: andrew@ankane.org
29
29
  executables: []
@@ -33,8 +33,10 @@ files:
33
33
  - CHANGELOG.md
34
34
  - LICENSE.txt
35
35
  - README.md
36
- - lib/generators/neighbor/install_generator.rb
37
- - lib/generators/neighbor/templates/migration.rb.tt
36
+ - lib/generators/neighbor/cube_generator.rb
37
+ - lib/generators/neighbor/templates/cube.rb.tt
38
+ - lib/generators/neighbor/templates/vector.rb.tt
39
+ - lib/generators/neighbor/vector_generator.rb
38
40
  - lib/neighbor.rb
39
41
  - lib/neighbor/model.rb
40
42
  - lib/neighbor/vector.rb
@@ -58,7 +60,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
58
60
  - !ruby/object:Gem::Version
59
61
  version: '0'
60
62
  requirements: []
61
- rubygems_version: 3.1.4
63
+ rubygems_version: 3.2.32
62
64
  signing_key:
63
65
  specification_version: 4
64
66
  summary: Nearest neighbor search for Rails and Postgres