neighbor 0.1.2 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +13 -0
- data/README.md +72 -23
- data/lib/generators/neighbor/{install_generator.rb → cube_generator.rb} +2 -2
- data/lib/generators/neighbor/templates/{migration.rb.tt → cube.rb.tt} +0 -0
- data/lib/generators/neighbor/templates/vector.rb.tt +5 -0
- data/lib/generators/neighbor/vector_generator.rb +18 -0
- data/lib/neighbor.rb +11 -3
- data/lib/neighbor/model.rb +53 -17
- data/lib/neighbor/vector.rb +42 -10
- data/lib/neighbor/version.rb +1 -1
- metadata +8 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9a8839eeaeeef5b3ff8deb63165791929a81f9a9c1c807dd3d094e9bf770e881
|
4
|
+
data.tar.gz: 14035d37686f7035bc0e874de7933259dc0b8be9d046fca1d31867cacaf1c922
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: bfbf7bca1b4290abf2ec91dd5ea032dd1878ab9d7a0938dfed4e6e63441aa557397345090713222a735beb32bc9f1b6cd7971536de9ca427db8650f053917040
|
7
|
+
data.tar.gz: 1ba36cba82aafbb2bb003e016f0817884b1129a7d0a8b2be7385e438999e6e7728224f7e146d4de611e171fc851ae64ebea3aa7bfa65f55868cb0c5477d8ecff
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,16 @@
|
|
1
|
+
## 0.2.0 (2021-04-21)
|
2
|
+
|
3
|
+
- Added support for pgvector
|
4
|
+
- Added `normalize` option
|
5
|
+
- Made `dimensions` optional
|
6
|
+
- Raise an error if `nearest_neighbors` already defined
|
7
|
+
- Raise an error for non-finite values
|
8
|
+
- Fixed NaN with zero vectors and cosine distance
|
9
|
+
|
10
|
+
Breaking changes
|
11
|
+
|
12
|
+
- The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default
|
13
|
+
|
1
14
|
## 0.1.2 (2021-02-21)
|
2
15
|
|
3
16
|
- Added `nearest_neighbors` scope
|
data/README.md
CHANGED
@@ -12,15 +12,23 @@ Add this line to your application’s Gemfile:
|
|
12
12
|
gem 'neighbor'
|
13
13
|
```
|
14
14
|
|
15
|
-
|
15
|
+
## Choose An Extension
|
16
|
+
|
17
|
+
Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [vector](https://github.com/ankane/pgvector). cube ships with Postgres, while vector supports approximate nearest neighbor search.
|
18
|
+
|
19
|
+
For cube, run:
|
16
20
|
|
17
21
|
```sh
|
18
|
-
|
19
|
-
rails generate neighbor:install
|
22
|
+
rails generate neighbor:cube
|
20
23
|
rails db:migrate
|
21
24
|
```
|
22
25
|
|
23
|
-
|
26
|
+
For vector, install [pgvector](https://github.com/ankane/pgvector#installation) and run:
|
27
|
+
|
28
|
+
```sh
|
29
|
+
rails generate neighbor:vector
|
30
|
+
rails db:migrate
|
31
|
+
```
|
24
32
|
|
25
33
|
## Getting Started
|
26
34
|
|
@@ -30,6 +38,8 @@ Create a migration
|
|
30
38
|
class AddNeighborVectorToItems < ActiveRecord::Migration[6.1]
|
31
39
|
def change
|
32
40
|
add_column :items, :neighbor_vector, :cube
|
41
|
+
# or
|
42
|
+
add_column :items, :neighbor_vector, :vector, limit: 3
|
33
43
|
end
|
34
44
|
end
|
35
45
|
```
|
@@ -38,7 +48,7 @@ Add to your model
|
|
38
48
|
|
39
49
|
```ruby
|
40
50
|
class Item < ApplicationRecord
|
41
|
-
has_neighbors
|
51
|
+
has_neighbors
|
42
52
|
end
|
43
53
|
```
|
44
54
|
|
@@ -48,49 +58,76 @@ Update the vectors
|
|
48
58
|
item.update(neighbor_vector: [1.0, 1.2, 0.5])
|
49
59
|
```
|
50
60
|
|
51
|
-
> With cosine distance (the default), vectors are normalized before being stored
|
52
|
-
|
53
61
|
Get the nearest neighbors to a record
|
54
62
|
|
55
63
|
```ruby
|
56
|
-
item.nearest_neighbors.first(5)
|
64
|
+
item.nearest_neighbors(distance: "euclidean").first(5)
|
57
65
|
```
|
58
66
|
|
59
67
|
Get the nearest neighbors to a vector
|
60
68
|
|
61
69
|
```ruby
|
62
|
-
Item.nearest_neighbors([1,
|
70
|
+
Item.nearest_neighbors([0.9, 1.3, 1.1], distance: "euclidean").first(5)
|
63
71
|
```
|
64
72
|
|
65
73
|
## Distance
|
66
74
|
|
67
|
-
|
75
|
+
Supported values are:
|
76
|
+
|
77
|
+
- `euclidean`
|
78
|
+
- `cosine`
|
79
|
+
- `taxicab` (cube only)
|
80
|
+
- `chebyshev` (cube only)
|
81
|
+
- `inner_product` (vector only)
|
82
|
+
|
83
|
+
For cosine distance with cube, vectors must be normalized before being stored.
|
68
84
|
|
69
85
|
```ruby
|
70
86
|
class Item < ApplicationRecord
|
71
|
-
has_neighbors
|
87
|
+
has_neighbors normalize: true
|
72
88
|
end
|
73
89
|
```
|
74
90
|
|
75
|
-
|
76
|
-
|
77
|
-
- `cosine` (default)
|
78
|
-
- `euclidean`
|
79
|
-
- `taxicab`
|
80
|
-
- `chebyshev`
|
81
|
-
|
82
|
-
For inner product, see [this example](examples/disco_user_recs.rb)
|
91
|
+
For inner product with cube, see [this example](examples/disco_user_recs.rb).
|
83
92
|
|
84
93
|
Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
|
85
94
|
|
86
95
|
```ruby
|
87
|
-
nearest_item = item.nearest_neighbors.first
|
96
|
+
nearest_item = item.nearest_neighbors(distance: "euclidean").first
|
88
97
|
nearest_item.neighbor_distance
|
89
98
|
```
|
90
99
|
|
91
100
|
## Dimensions
|
92
101
|
|
93
|
-
|
102
|
+
The cube data type is limited 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this. The vector data type is limited to 1024 dimensions.
|
103
|
+
|
104
|
+
For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
|
105
|
+
|
106
|
+
```ruby
|
107
|
+
class Movie < ApplicationRecord
|
108
|
+
has_neighbors dimensions: 3
|
109
|
+
end
|
110
|
+
```
|
111
|
+
|
112
|
+
## Indexing
|
113
|
+
|
114
|
+
For vector, add an approximate index to speed up queries. Create a migration with:
|
115
|
+
|
116
|
+
```ruby
|
117
|
+
class AddIndexToItemsNeighborVector < ActiveRecord::Migration[6.1]
|
118
|
+
def change
|
119
|
+
add_index :items, :neighbor_vector, using: :ivfflat
|
120
|
+
end
|
121
|
+
end
|
122
|
+
```
|
123
|
+
|
124
|
+
Add `opclass: :vector_cosine_ops` for cosine distance and `opclass: :vector_ip_ops` for inner product.
|
125
|
+
|
126
|
+
Set the number of probes
|
127
|
+
|
128
|
+
```ruby
|
129
|
+
Item.connection.execute("SET ivfflat.probes = 3")
|
130
|
+
```
|
94
131
|
|
95
132
|
## Example
|
96
133
|
|
@@ -107,7 +144,7 @@ And add `has_neighbors`
|
|
107
144
|
|
108
145
|
```ruby
|
109
146
|
class Movie < ApplicationRecord
|
110
|
-
has_neighbors dimensions: 20
|
147
|
+
has_neighbors dimensions: 20, normalize: true
|
111
148
|
end
|
112
149
|
```
|
113
150
|
|
@@ -131,11 +168,23 @@ And get similar movies
|
|
131
168
|
|
132
169
|
```ruby
|
133
170
|
movie = Movie.find_by(name: "Star Wars (1977)")
|
134
|
-
movie.nearest_neighbors.first(5).map(&:name)
|
171
|
+
movie.nearest_neighbors(distance: "cosine").first(5).map(&:name)
|
135
172
|
```
|
136
173
|
|
137
174
|
[Complete code](examples/disco_item_recs.rb)
|
138
175
|
|
176
|
+
## Upgrading
|
177
|
+
|
178
|
+
### 0.2.0
|
179
|
+
|
180
|
+
The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default. If you use cosine distance, set:
|
181
|
+
|
182
|
+
```ruby
|
183
|
+
class Item < ApplicationRecord
|
184
|
+
has_neighbors normalize: true
|
185
|
+
end
|
186
|
+
```
|
187
|
+
|
139
188
|
## History
|
140
189
|
|
141
190
|
View the [changelog](https://github.com/ankane/neighbor/blob/master/CHANGELOG.md)
|
@@ -2,12 +2,12 @@ require "rails/generators/active_record"
|
|
2
2
|
|
3
3
|
module Neighbor
|
4
4
|
module Generators
|
5
|
-
class
|
5
|
+
class CubeGenerator < Rails::Generators::Base
|
6
6
|
include ActiveRecord::Generators::Migration
|
7
7
|
source_root File.join(__dir__, "templates")
|
8
8
|
|
9
9
|
def copy_migration
|
10
|
-
migration_template "
|
10
|
+
migration_template "cube.rb", "db/migrate/install_neighbor_cube.rb", migration_version: migration_version
|
11
11
|
end
|
12
12
|
|
13
13
|
def migration_version
|
File without changes
|
@@ -0,0 +1,18 @@
|
|
1
|
+
require "rails/generators/active_record"
|
2
|
+
|
3
|
+
module Neighbor
|
4
|
+
module Generators
|
5
|
+
class VectorGenerator < Rails::Generators::Base
|
6
|
+
include ActiveRecord::Generators::Migration
|
7
|
+
source_root File.join(__dir__, "templates")
|
8
|
+
|
9
|
+
def copy_migration
|
10
|
+
migration_template "vector.rb", "db/migrate/install_neighbor_vector.rb", migration_version: migration_version
|
11
|
+
end
|
12
|
+
|
13
|
+
def migration_version
|
14
|
+
"[#{ActiveRecord::VERSION::MAJOR}.#{ActiveRecord::VERSION::MINOR}]"
|
15
|
+
end
|
16
|
+
end
|
17
|
+
end
|
18
|
+
end
|
data/lib/neighbor.rb
CHANGED
@@ -7,10 +7,14 @@ require "neighbor/version"
|
|
7
7
|
module Neighbor
|
8
8
|
class Error < StandardError; end
|
9
9
|
|
10
|
-
module
|
10
|
+
module RegisterTypes
|
11
11
|
def initialize_type_map(m = type_map)
|
12
12
|
super
|
13
13
|
m.register_type "cube", ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:cube)
|
14
|
+
m.register_type "vector" do |_, _, sql_type|
|
15
|
+
limit = extract_limit(sql_type)
|
16
|
+
ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:vector, limit: limit)
|
17
|
+
end
|
14
18
|
end
|
15
19
|
end
|
16
20
|
end
|
@@ -25,16 +29,20 @@ ActiveSupport.on_load(:active_record) do
|
|
25
29
|
|
26
30
|
# ensure schema can be dumped
|
27
31
|
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:cube] = {name: "cube"}
|
32
|
+
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:vector] = {name: "vector"}
|
28
33
|
|
29
34
|
# ensure schema can be loaded
|
30
35
|
if ActiveRecord::VERSION::MAJOR >= 6
|
31
|
-
ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube)
|
36
|
+
ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube, :vector)
|
32
37
|
else
|
33
38
|
ActiveRecord::ConnectionAdapters::TableDefinition.define_method :cube do |*args, **options|
|
34
39
|
args.each { |name| column(name, :cube, options) }
|
35
40
|
end
|
41
|
+
ActiveRecord::ConnectionAdapters::TableDefinition.define_method :vector do |*args, **options|
|
42
|
+
args.each { |name| column(name, :vector, options) }
|
43
|
+
end
|
36
44
|
end
|
37
45
|
|
38
46
|
# prevent unknown OID warning
|
39
|
-
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::
|
47
|
+
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::RegisterTypes)
|
40
48
|
end
|
data/lib/neighbor/model.rb
CHANGED
@@ -1,42 +1,78 @@
|
|
1
1
|
module Neighbor
|
2
2
|
module Model
|
3
|
-
def has_neighbors(dimensions
|
4
|
-
distance = distance.to_s
|
5
|
-
raise ArgumentError, "Invalid distance: #{distance}" unless %w(cosine euclidean taxicab chebyshev).include?(distance)
|
6
|
-
|
3
|
+
def has_neighbors(dimensions: nil, normalize: nil)
|
7
4
|
# TODO make configurable
|
8
5
|
# likely use argument
|
9
6
|
attribute_name = :neighbor_vector
|
10
7
|
|
11
8
|
class_eval do
|
12
|
-
|
9
|
+
raise Error, "nearest_neighbors already defined" if method_defined?(:nearest_neighbors)
|
10
|
+
|
11
|
+
attribute attribute_name, Neighbor::Vector.new(dimensions: dimensions, normalize: normalize, model: self, attribute_name: attribute_name)
|
13
12
|
|
14
|
-
scope :nearest_neighbors, ->(vector) {
|
13
|
+
scope :nearest_neighbors, ->(vector, distance:) {
|
15
14
|
return none if vector.nil?
|
16
15
|
|
16
|
+
distance = distance.to_s
|
17
|
+
|
17
18
|
quoted_attribute = "#{connection.quote_table_name(table_name)}.#{connection.quote_column_name(attribute_name)}"
|
18
19
|
|
20
|
+
column_info = klass.type_for_attribute(attribute_name).column_info
|
21
|
+
|
19
22
|
operator =
|
20
|
-
|
21
|
-
|
22
|
-
"
|
23
|
-
|
24
|
-
"
|
23
|
+
if column_info[:type] == :vector
|
24
|
+
case distance
|
25
|
+
when "inner_product"
|
26
|
+
"<#>"
|
27
|
+
when "cosine"
|
28
|
+
"<=>"
|
29
|
+
when "euclidean"
|
30
|
+
"<->"
|
31
|
+
end
|
25
32
|
else
|
26
|
-
|
33
|
+
case distance
|
34
|
+
when "taxicab"
|
35
|
+
"<#>"
|
36
|
+
when "chebyshev"
|
37
|
+
"<=>"
|
38
|
+
when "euclidean", "cosine"
|
39
|
+
"<->"
|
40
|
+
end
|
27
41
|
end
|
28
42
|
|
43
|
+
raise ArgumentError, "Invalid distance: #{distance}" unless operator
|
44
|
+
|
45
|
+
# ensure normalize set (can be true or false)
|
46
|
+
if distance == "cosine" && column_info[:type] == :cube && normalize.nil?
|
47
|
+
raise Neighbor::Error, "Set normalize for cosine distance with cube"
|
48
|
+
end
|
49
|
+
|
50
|
+
vector = Neighbor::Vector.cast(vector, dimensions: dimensions, normalize: normalize, column_info: column_info)
|
51
|
+
|
29
52
|
# important! neighbor_vector should already be typecast
|
30
53
|
# but use to_f as extra safeguard against SQL injection
|
31
|
-
|
32
|
-
|
54
|
+
query =
|
55
|
+
if column_info[:type] == :vector
|
56
|
+
connection.quote("[#{vector.map(&:to_f).join(", ")}]")
|
57
|
+
else
|
58
|
+
"cube(array[#{vector.map(&:to_f).join(", ")}])"
|
59
|
+
end
|
60
|
+
|
61
|
+
order = "#{quoted_attribute} #{operator} #{query}"
|
33
62
|
|
34
63
|
# https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance
|
35
64
|
# with normalized vectors:
|
36
65
|
# cosine similarity = 1 - (euclidean distance)**2 / 2
|
37
66
|
# cosine distance = 1 - cosine similarity
|
38
67
|
# this transformation doesn't change the order, so only needed for select
|
39
|
-
neighbor_distance =
|
68
|
+
neighbor_distance =
|
69
|
+
if column_info[:type] != :vector && distance == "cosine"
|
70
|
+
"POWER(#{order}, 2) / 2.0"
|
71
|
+
elsif column_info[:type] == :vector && distance == "inner_product"
|
72
|
+
"(#{order}) * -1"
|
73
|
+
else
|
74
|
+
order
|
75
|
+
end
|
40
76
|
|
41
77
|
# for select, use column_names instead of * to account for ignored columns
|
42
78
|
select(*column_names, "#{neighbor_distance} AS neighbor_distance")
|
@@ -44,10 +80,10 @@ module Neighbor
|
|
44
80
|
.order(Arel.sql(order))
|
45
81
|
}
|
46
82
|
|
47
|
-
define_method :nearest_neighbors do
|
83
|
+
define_method :nearest_neighbors do |**options|
|
48
84
|
self.class
|
49
85
|
.where.not(self.class.primary_key => send(self.class.primary_key))
|
50
|
-
.nearest_neighbors(send(attribute_name))
|
86
|
+
.nearest_neighbors(send(attribute_name), **options)
|
51
87
|
end
|
52
88
|
end
|
53
89
|
end
|
data/lib/neighbor/vector.rb
CHANGED
@@ -1,29 +1,61 @@
|
|
1
1
|
module Neighbor
|
2
2
|
class Vector < ActiveRecord::Type::Value
|
3
|
-
def initialize(dimensions:,
|
3
|
+
def initialize(dimensions:, normalize:, model:, attribute_name:)
|
4
4
|
super()
|
5
5
|
@dimensions = dimensions
|
6
|
-
@
|
6
|
+
@normalize = normalize
|
7
|
+
@model = model
|
8
|
+
@attribute_name = attribute_name
|
7
9
|
end
|
8
10
|
|
9
|
-
def self.cast(value, dimensions:,
|
11
|
+
def self.cast(value, dimensions:, normalize:, column_info:)
|
10
12
|
value = value.to_a.map(&:to_f)
|
11
|
-
raise Error, "Expected #{dimensions} dimensions, not #{value.size}" unless value.size == dimensions
|
12
13
|
|
13
|
-
|
14
|
+
dimensions ||= column_info[:dimensions]
|
15
|
+
raise Error, "Expected #{dimensions} dimensions, not #{value.size}" if dimensions && value.size != dimensions
|
16
|
+
|
17
|
+
raise Error, "Values must be finite" unless value.all?(&:finite?)
|
18
|
+
|
19
|
+
if normalize
|
14
20
|
norm = Math.sqrt(value.sum { |v| v * v })
|
15
|
-
|
16
|
-
|
17
|
-
|
21
|
+
|
22
|
+
# store zero vector as all zeros
|
23
|
+
# since NaN makes the distance always 0
|
24
|
+
# could also throw error
|
25
|
+
|
26
|
+
# safe to update in-place since earlier map dups
|
27
|
+
value.map! { |v| v / norm } if norm > 0
|
18
28
|
end
|
29
|
+
|
30
|
+
value
|
31
|
+
end
|
32
|
+
|
33
|
+
def self.column_info(model, attribute_name)
|
34
|
+
attribute_name = attribute_name.to_s
|
35
|
+
column = model.columns.detect { |c| c.name == attribute_name }
|
36
|
+
{
|
37
|
+
type: column.try(:type),
|
38
|
+
dimensions: column.try(:limit)
|
39
|
+
}
|
40
|
+
end
|
41
|
+
|
42
|
+
# need to be careful to avoid loading column info before needed
|
43
|
+
def column_info
|
44
|
+
@column_info ||= self.class.column_info(@model, @attribute_name)
|
19
45
|
end
|
20
46
|
|
21
47
|
def cast(value)
|
22
|
-
self.class.cast(value, dimensions: @dimensions,
|
48
|
+
self.class.cast(value, dimensions: @dimensions, normalize: @normalize, column_info: column_info) unless value.nil?
|
23
49
|
end
|
24
50
|
|
25
51
|
def serialize(value)
|
26
|
-
|
52
|
+
unless value.nil?
|
53
|
+
if column_info[:type] == :vector
|
54
|
+
"[#{cast(value).join(", ")}]"
|
55
|
+
else
|
56
|
+
"(#{cast(value).join(", ")})"
|
57
|
+
end
|
58
|
+
end
|
27
59
|
end
|
28
60
|
|
29
61
|
def deserialize(value)
|
data/lib/neighbor/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: neighbor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Kane
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2021-
|
11
|
+
date: 2021-04-22 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activerecord
|
@@ -16,14 +16,14 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '
|
19
|
+
version: '5.2'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: '
|
26
|
+
version: '5.2'
|
27
27
|
description:
|
28
28
|
email: andrew@ankane.org
|
29
29
|
executables: []
|
@@ -33,8 +33,10 @@ files:
|
|
33
33
|
- CHANGELOG.md
|
34
34
|
- LICENSE.txt
|
35
35
|
- README.md
|
36
|
-
- lib/generators/neighbor/
|
37
|
-
- lib/generators/neighbor/templates/
|
36
|
+
- lib/generators/neighbor/cube_generator.rb
|
37
|
+
- lib/generators/neighbor/templates/cube.rb.tt
|
38
|
+
- lib/generators/neighbor/templates/vector.rb.tt
|
39
|
+
- lib/generators/neighbor/vector_generator.rb
|
38
40
|
- lib/neighbor.rb
|
39
41
|
- lib/neighbor/model.rb
|
40
42
|
- lib/neighbor/vector.rb
|