neighbor 0.1.2 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +13 -0
- data/README.md +72 -23
- data/lib/generators/neighbor/{install_generator.rb → cube_generator.rb} +2 -2
- data/lib/generators/neighbor/templates/{migration.rb.tt → cube.rb.tt} +0 -0
- data/lib/generators/neighbor/templates/vector.rb.tt +5 -0
- data/lib/generators/neighbor/vector_generator.rb +18 -0
- data/lib/neighbor.rb +11 -3
- data/lib/neighbor/model.rb +53 -17
- data/lib/neighbor/vector.rb +42 -10
- data/lib/neighbor/version.rb +1 -1
- metadata +8 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9a8839eeaeeef5b3ff8deb63165791929a81f9a9c1c807dd3d094e9bf770e881
|
4
|
+
data.tar.gz: 14035d37686f7035bc0e874de7933259dc0b8be9d046fca1d31867cacaf1c922
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: bfbf7bca1b4290abf2ec91dd5ea032dd1878ab9d7a0938dfed4e6e63441aa557397345090713222a735beb32bc9f1b6cd7971536de9ca427db8650f053917040
|
7
|
+
data.tar.gz: 1ba36cba82aafbb2bb003e016f0817884b1129a7d0a8b2be7385e438999e6e7728224f7e146d4de611e171fc851ae64ebea3aa7bfa65f55868cb0c5477d8ecff
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,16 @@
|
|
1
|
+
## 0.2.0 (2021-04-21)
|
2
|
+
|
3
|
+
- Added support for pgvector
|
4
|
+
- Added `normalize` option
|
5
|
+
- Made `dimensions` optional
|
6
|
+
- Raise an error if `nearest_neighbors` already defined
|
7
|
+
- Raise an error for non-finite values
|
8
|
+
- Fixed NaN with zero vectors and cosine distance
|
9
|
+
|
10
|
+
Breaking changes
|
11
|
+
|
12
|
+
- The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default
|
13
|
+
|
1
14
|
## 0.1.2 (2021-02-21)
|
2
15
|
|
3
16
|
- Added `nearest_neighbors` scope
|
data/README.md
CHANGED
@@ -12,15 +12,23 @@ Add this line to your application’s Gemfile:
|
|
12
12
|
gem 'neighbor'
|
13
13
|
```
|
14
14
|
|
15
|
-
|
15
|
+
## Choose An Extension
|
16
|
+
|
17
|
+
Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [vector](https://github.com/ankane/pgvector). cube ships with Postgres, while vector supports approximate nearest neighbor search.
|
18
|
+
|
19
|
+
For cube, run:
|
16
20
|
|
17
21
|
```sh
|
18
|
-
|
19
|
-
rails generate neighbor:install
|
22
|
+
rails generate neighbor:cube
|
20
23
|
rails db:migrate
|
21
24
|
```
|
22
25
|
|
23
|
-
|
26
|
+
For vector, install [pgvector](https://github.com/ankane/pgvector#installation) and run:
|
27
|
+
|
28
|
+
```sh
|
29
|
+
rails generate neighbor:vector
|
30
|
+
rails db:migrate
|
31
|
+
```
|
24
32
|
|
25
33
|
## Getting Started
|
26
34
|
|
@@ -30,6 +38,8 @@ Create a migration
|
|
30
38
|
class AddNeighborVectorToItems < ActiveRecord::Migration[6.1]
|
31
39
|
def change
|
32
40
|
add_column :items, :neighbor_vector, :cube
|
41
|
+
# or
|
42
|
+
add_column :items, :neighbor_vector, :vector, limit: 3
|
33
43
|
end
|
34
44
|
end
|
35
45
|
```
|
@@ -38,7 +48,7 @@ Add to your model
|
|
38
48
|
|
39
49
|
```ruby
|
40
50
|
class Item < ApplicationRecord
|
41
|
-
has_neighbors
|
51
|
+
has_neighbors
|
42
52
|
end
|
43
53
|
```
|
44
54
|
|
@@ -48,49 +58,76 @@ Update the vectors
|
|
48
58
|
item.update(neighbor_vector: [1.0, 1.2, 0.5])
|
49
59
|
```
|
50
60
|
|
51
|
-
> With cosine distance (the default), vectors are normalized before being stored
|
52
|
-
|
53
61
|
Get the nearest neighbors to a record
|
54
62
|
|
55
63
|
```ruby
|
56
|
-
item.nearest_neighbors.first(5)
|
64
|
+
item.nearest_neighbors(distance: "euclidean").first(5)
|
57
65
|
```
|
58
66
|
|
59
67
|
Get the nearest neighbors to a vector
|
60
68
|
|
61
69
|
```ruby
|
62
|
-
Item.nearest_neighbors([1,
|
70
|
+
Item.nearest_neighbors([0.9, 1.3, 1.1], distance: "euclidean").first(5)
|
63
71
|
```
|
64
72
|
|
65
73
|
## Distance
|
66
74
|
|
67
|
-
|
75
|
+
Supported values are:
|
76
|
+
|
77
|
+
- `euclidean`
|
78
|
+
- `cosine`
|
79
|
+
- `taxicab` (cube only)
|
80
|
+
- `chebyshev` (cube only)
|
81
|
+
- `inner_product` (vector only)
|
82
|
+
|
83
|
+
For cosine distance with cube, vectors must be normalized before being stored.
|
68
84
|
|
69
85
|
```ruby
|
70
86
|
class Item < ApplicationRecord
|
71
|
-
has_neighbors
|
87
|
+
has_neighbors normalize: true
|
72
88
|
end
|
73
89
|
```
|
74
90
|
|
75
|
-
|
76
|
-
|
77
|
-
- `cosine` (default)
|
78
|
-
- `euclidean`
|
79
|
-
- `taxicab`
|
80
|
-
- `chebyshev`
|
81
|
-
|
82
|
-
For inner product, see [this example](examples/disco_user_recs.rb)
|
91
|
+
For inner product with cube, see [this example](examples/disco_user_recs.rb).
|
83
92
|
|
84
93
|
Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
|
85
94
|
|
86
95
|
```ruby
|
87
|
-
nearest_item = item.nearest_neighbors.first
|
96
|
+
nearest_item = item.nearest_neighbors(distance: "euclidean").first
|
88
97
|
nearest_item.neighbor_distance
|
89
98
|
```
|
90
99
|
|
91
100
|
## Dimensions
|
92
101
|
|
93
|
-
|
102
|
+
The cube data type is limited 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this. The vector data type is limited to 1024 dimensions.
|
103
|
+
|
104
|
+
For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
|
105
|
+
|
106
|
+
```ruby
|
107
|
+
class Movie < ApplicationRecord
|
108
|
+
has_neighbors dimensions: 3
|
109
|
+
end
|
110
|
+
```
|
111
|
+
|
112
|
+
## Indexing
|
113
|
+
|
114
|
+
For vector, add an approximate index to speed up queries. Create a migration with:
|
115
|
+
|
116
|
+
```ruby
|
117
|
+
class AddIndexToItemsNeighborVector < ActiveRecord::Migration[6.1]
|
118
|
+
def change
|
119
|
+
add_index :items, :neighbor_vector, using: :ivfflat
|
120
|
+
end
|
121
|
+
end
|
122
|
+
```
|
123
|
+
|
124
|
+
Add `opclass: :vector_cosine_ops` for cosine distance and `opclass: :vector_ip_ops` for inner product.
|
125
|
+
|
126
|
+
Set the number of probes
|
127
|
+
|
128
|
+
```ruby
|
129
|
+
Item.connection.execute("SET ivfflat.probes = 3")
|
130
|
+
```
|
94
131
|
|
95
132
|
## Example
|
96
133
|
|
@@ -107,7 +144,7 @@ And add `has_neighbors`
|
|
107
144
|
|
108
145
|
```ruby
|
109
146
|
class Movie < ApplicationRecord
|
110
|
-
has_neighbors dimensions: 20
|
147
|
+
has_neighbors dimensions: 20, normalize: true
|
111
148
|
end
|
112
149
|
```
|
113
150
|
|
@@ -131,11 +168,23 @@ And get similar movies
|
|
131
168
|
|
132
169
|
```ruby
|
133
170
|
movie = Movie.find_by(name: "Star Wars (1977)")
|
134
|
-
movie.nearest_neighbors.first(5).map(&:name)
|
171
|
+
movie.nearest_neighbors(distance: "cosine").first(5).map(&:name)
|
135
172
|
```
|
136
173
|
|
137
174
|
[Complete code](examples/disco_item_recs.rb)
|
138
175
|
|
176
|
+
## Upgrading
|
177
|
+
|
178
|
+
### 0.2.0
|
179
|
+
|
180
|
+
The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default. If you use cosine distance, set:
|
181
|
+
|
182
|
+
```ruby
|
183
|
+
class Item < ApplicationRecord
|
184
|
+
has_neighbors normalize: true
|
185
|
+
end
|
186
|
+
```
|
187
|
+
|
139
188
|
## History
|
140
189
|
|
141
190
|
View the [changelog](https://github.com/ankane/neighbor/blob/master/CHANGELOG.md)
|
@@ -2,12 +2,12 @@ require "rails/generators/active_record"
|
|
2
2
|
|
3
3
|
module Neighbor
|
4
4
|
module Generators
|
5
|
-
class
|
5
|
+
class CubeGenerator < Rails::Generators::Base
|
6
6
|
include ActiveRecord::Generators::Migration
|
7
7
|
source_root File.join(__dir__, "templates")
|
8
8
|
|
9
9
|
def copy_migration
|
10
|
-
migration_template "
|
10
|
+
migration_template "cube.rb", "db/migrate/install_neighbor_cube.rb", migration_version: migration_version
|
11
11
|
end
|
12
12
|
|
13
13
|
def migration_version
|
File without changes
|
@@ -0,0 +1,18 @@
|
|
1
|
+
require "rails/generators/active_record"
|
2
|
+
|
3
|
+
module Neighbor
|
4
|
+
module Generators
|
5
|
+
class VectorGenerator < Rails::Generators::Base
|
6
|
+
include ActiveRecord::Generators::Migration
|
7
|
+
source_root File.join(__dir__, "templates")
|
8
|
+
|
9
|
+
def copy_migration
|
10
|
+
migration_template "vector.rb", "db/migrate/install_neighbor_vector.rb", migration_version: migration_version
|
11
|
+
end
|
12
|
+
|
13
|
+
def migration_version
|
14
|
+
"[#{ActiveRecord::VERSION::MAJOR}.#{ActiveRecord::VERSION::MINOR}]"
|
15
|
+
end
|
16
|
+
end
|
17
|
+
end
|
18
|
+
end
|
data/lib/neighbor.rb
CHANGED
@@ -7,10 +7,14 @@ require "neighbor/version"
|
|
7
7
|
module Neighbor
|
8
8
|
class Error < StandardError; end
|
9
9
|
|
10
|
-
module
|
10
|
+
module RegisterTypes
|
11
11
|
def initialize_type_map(m = type_map)
|
12
12
|
super
|
13
13
|
m.register_type "cube", ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:cube)
|
14
|
+
m.register_type "vector" do |_, _, sql_type|
|
15
|
+
limit = extract_limit(sql_type)
|
16
|
+
ActiveRecord::ConnectionAdapters::PostgreSQL::OID::SpecializedString.new(:vector, limit: limit)
|
17
|
+
end
|
14
18
|
end
|
15
19
|
end
|
16
20
|
end
|
@@ -25,16 +29,20 @@ ActiveSupport.on_load(:active_record) do
|
|
25
29
|
|
26
30
|
# ensure schema can be dumped
|
27
31
|
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:cube] = {name: "cube"}
|
32
|
+
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:vector] = {name: "vector"}
|
28
33
|
|
29
34
|
# ensure schema can be loaded
|
30
35
|
if ActiveRecord::VERSION::MAJOR >= 6
|
31
|
-
ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube)
|
36
|
+
ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :cube, :vector)
|
32
37
|
else
|
33
38
|
ActiveRecord::ConnectionAdapters::TableDefinition.define_method :cube do |*args, **options|
|
34
39
|
args.each { |name| column(name, :cube, options) }
|
35
40
|
end
|
41
|
+
ActiveRecord::ConnectionAdapters::TableDefinition.define_method :vector do |*args, **options|
|
42
|
+
args.each { |name| column(name, :vector, options) }
|
43
|
+
end
|
36
44
|
end
|
37
45
|
|
38
46
|
# prevent unknown OID warning
|
39
|
-
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::
|
47
|
+
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(Neighbor::RegisterTypes)
|
40
48
|
end
|
data/lib/neighbor/model.rb
CHANGED
@@ -1,42 +1,78 @@
|
|
1
1
|
module Neighbor
|
2
2
|
module Model
|
3
|
-
def has_neighbors(dimensions
|
4
|
-
distance = distance.to_s
|
5
|
-
raise ArgumentError, "Invalid distance: #{distance}" unless %w(cosine euclidean taxicab chebyshev).include?(distance)
|
6
|
-
|
3
|
+
def has_neighbors(dimensions: nil, normalize: nil)
|
7
4
|
# TODO make configurable
|
8
5
|
# likely use argument
|
9
6
|
attribute_name = :neighbor_vector
|
10
7
|
|
11
8
|
class_eval do
|
12
|
-
|
9
|
+
raise Error, "nearest_neighbors already defined" if method_defined?(:nearest_neighbors)
|
10
|
+
|
11
|
+
attribute attribute_name, Neighbor::Vector.new(dimensions: dimensions, normalize: normalize, model: self, attribute_name: attribute_name)
|
13
12
|
|
14
|
-
scope :nearest_neighbors, ->(vector) {
|
13
|
+
scope :nearest_neighbors, ->(vector, distance:) {
|
15
14
|
return none if vector.nil?
|
16
15
|
|
16
|
+
distance = distance.to_s
|
17
|
+
|
17
18
|
quoted_attribute = "#{connection.quote_table_name(table_name)}.#{connection.quote_column_name(attribute_name)}"
|
18
19
|
|
20
|
+
column_info = klass.type_for_attribute(attribute_name).column_info
|
21
|
+
|
19
22
|
operator =
|
20
|
-
|
21
|
-
|
22
|
-
"
|
23
|
-
|
24
|
-
"
|
23
|
+
if column_info[:type] == :vector
|
24
|
+
case distance
|
25
|
+
when "inner_product"
|
26
|
+
"<#>"
|
27
|
+
when "cosine"
|
28
|
+
"<=>"
|
29
|
+
when "euclidean"
|
30
|
+
"<->"
|
31
|
+
end
|
25
32
|
else
|
26
|
-
|
33
|
+
case distance
|
34
|
+
when "taxicab"
|
35
|
+
"<#>"
|
36
|
+
when "chebyshev"
|
37
|
+
"<=>"
|
38
|
+
when "euclidean", "cosine"
|
39
|
+
"<->"
|
40
|
+
end
|
27
41
|
end
|
28
42
|
|
43
|
+
raise ArgumentError, "Invalid distance: #{distance}" unless operator
|
44
|
+
|
45
|
+
# ensure normalize set (can be true or false)
|
46
|
+
if distance == "cosine" && column_info[:type] == :cube && normalize.nil?
|
47
|
+
raise Neighbor::Error, "Set normalize for cosine distance with cube"
|
48
|
+
end
|
49
|
+
|
50
|
+
vector = Neighbor::Vector.cast(vector, dimensions: dimensions, normalize: normalize, column_info: column_info)
|
51
|
+
|
29
52
|
# important! neighbor_vector should already be typecast
|
30
53
|
# but use to_f as extra safeguard against SQL injection
|
31
|
-
|
32
|
-
|
54
|
+
query =
|
55
|
+
if column_info[:type] == :vector
|
56
|
+
connection.quote("[#{vector.map(&:to_f).join(", ")}]")
|
57
|
+
else
|
58
|
+
"cube(array[#{vector.map(&:to_f).join(", ")}])"
|
59
|
+
end
|
60
|
+
|
61
|
+
order = "#{quoted_attribute} #{operator} #{query}"
|
33
62
|
|
34
63
|
# https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance
|
35
64
|
# with normalized vectors:
|
36
65
|
# cosine similarity = 1 - (euclidean distance)**2 / 2
|
37
66
|
# cosine distance = 1 - cosine similarity
|
38
67
|
# this transformation doesn't change the order, so only needed for select
|
39
|
-
neighbor_distance =
|
68
|
+
neighbor_distance =
|
69
|
+
if column_info[:type] != :vector && distance == "cosine"
|
70
|
+
"POWER(#{order}, 2) / 2.0"
|
71
|
+
elsif column_info[:type] == :vector && distance == "inner_product"
|
72
|
+
"(#{order}) * -1"
|
73
|
+
else
|
74
|
+
order
|
75
|
+
end
|
40
76
|
|
41
77
|
# for select, use column_names instead of * to account for ignored columns
|
42
78
|
select(*column_names, "#{neighbor_distance} AS neighbor_distance")
|
@@ -44,10 +80,10 @@ module Neighbor
|
|
44
80
|
.order(Arel.sql(order))
|
45
81
|
}
|
46
82
|
|
47
|
-
define_method :nearest_neighbors do
|
83
|
+
define_method :nearest_neighbors do |**options|
|
48
84
|
self.class
|
49
85
|
.where.not(self.class.primary_key => send(self.class.primary_key))
|
50
|
-
.nearest_neighbors(send(attribute_name))
|
86
|
+
.nearest_neighbors(send(attribute_name), **options)
|
51
87
|
end
|
52
88
|
end
|
53
89
|
end
|
data/lib/neighbor/vector.rb
CHANGED
@@ -1,29 +1,61 @@
|
|
1
1
|
module Neighbor
|
2
2
|
class Vector < ActiveRecord::Type::Value
|
3
|
-
def initialize(dimensions:,
|
3
|
+
def initialize(dimensions:, normalize:, model:, attribute_name:)
|
4
4
|
super()
|
5
5
|
@dimensions = dimensions
|
6
|
-
@
|
6
|
+
@normalize = normalize
|
7
|
+
@model = model
|
8
|
+
@attribute_name = attribute_name
|
7
9
|
end
|
8
10
|
|
9
|
-
def self.cast(value, dimensions:,
|
11
|
+
def self.cast(value, dimensions:, normalize:, column_info:)
|
10
12
|
value = value.to_a.map(&:to_f)
|
11
|
-
raise Error, "Expected #{dimensions} dimensions, not #{value.size}" unless value.size == dimensions
|
12
13
|
|
13
|
-
|
14
|
+
dimensions ||= column_info[:dimensions]
|
15
|
+
raise Error, "Expected #{dimensions} dimensions, not #{value.size}" if dimensions && value.size != dimensions
|
16
|
+
|
17
|
+
raise Error, "Values must be finite" unless value.all?(&:finite?)
|
18
|
+
|
19
|
+
if normalize
|
14
20
|
norm = Math.sqrt(value.sum { |v| v * v })
|
15
|
-
|
16
|
-
|
17
|
-
|
21
|
+
|
22
|
+
# store zero vector as all zeros
|
23
|
+
# since NaN makes the distance always 0
|
24
|
+
# could also throw error
|
25
|
+
|
26
|
+
# safe to update in-place since earlier map dups
|
27
|
+
value.map! { |v| v / norm } if norm > 0
|
18
28
|
end
|
29
|
+
|
30
|
+
value
|
31
|
+
end
|
32
|
+
|
33
|
+
def self.column_info(model, attribute_name)
|
34
|
+
attribute_name = attribute_name.to_s
|
35
|
+
column = model.columns.detect { |c| c.name == attribute_name }
|
36
|
+
{
|
37
|
+
type: column.try(:type),
|
38
|
+
dimensions: column.try(:limit)
|
39
|
+
}
|
40
|
+
end
|
41
|
+
|
42
|
+
# need to be careful to avoid loading column info before needed
|
43
|
+
def column_info
|
44
|
+
@column_info ||= self.class.column_info(@model, @attribute_name)
|
19
45
|
end
|
20
46
|
|
21
47
|
def cast(value)
|
22
|
-
self.class.cast(value, dimensions: @dimensions,
|
48
|
+
self.class.cast(value, dimensions: @dimensions, normalize: @normalize, column_info: column_info) unless value.nil?
|
23
49
|
end
|
24
50
|
|
25
51
|
def serialize(value)
|
26
|
-
|
52
|
+
unless value.nil?
|
53
|
+
if column_info[:type] == :vector
|
54
|
+
"[#{cast(value).join(", ")}]"
|
55
|
+
else
|
56
|
+
"(#{cast(value).join(", ")})"
|
57
|
+
end
|
58
|
+
end
|
27
59
|
end
|
28
60
|
|
29
61
|
def deserialize(value)
|
data/lib/neighbor/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: neighbor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Kane
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2021-
|
11
|
+
date: 2021-04-22 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activerecord
|
@@ -16,14 +16,14 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '
|
19
|
+
version: '5.2'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: '
|
26
|
+
version: '5.2'
|
27
27
|
description:
|
28
28
|
email: andrew@ankane.org
|
29
29
|
executables: []
|
@@ -33,8 +33,10 @@ files:
|
|
33
33
|
- CHANGELOG.md
|
34
34
|
- LICENSE.txt
|
35
35
|
- README.md
|
36
|
-
- lib/generators/neighbor/
|
37
|
-
- lib/generators/neighbor/templates/
|
36
|
+
- lib/generators/neighbor/cube_generator.rb
|
37
|
+
- lib/generators/neighbor/templates/cube.rb.tt
|
38
|
+
- lib/generators/neighbor/templates/vector.rb.tt
|
39
|
+
- lib/generators/neighbor/vector_generator.rb
|
38
40
|
- lib/neighbor.rb
|
39
41
|
- lib/neighbor/model.rb
|
40
42
|
- lib/neighbor/vector.rb
|