neighbor 0.4.1 → 0.4.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -0
- data/README.md +137 -34
- data/lib/neighbor/reranking.rb +27 -0
- data/lib/neighbor/type/cube.rb +2 -2
- data/lib/neighbor/type/halfvec.rb +2 -2
- data/lib/neighbor/type/sparsevec.rb +1 -1
- data/lib/neighbor/type/vector.rb +2 -2
- data/lib/neighbor/utils.rb +4 -0
- data/lib/neighbor/version.rb +1 -1
- data/lib/neighbor.rb +1 -0
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: e8d611fd277cd48d309b2a087fdeb22f39f43d8ef81fcab57763bd5b4b2e48b3
|
4
|
+
data.tar.gz: fe0a5f7e4aa1ebd81f8c5849be67dc0f3c948d53d313b80259def04b9b7e9e84
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0d9d9d0be9f2929f1eab7e5df52a0606ca8c220ea74e6ef349c2ab77e5da44bd4b98ed4d2271345b2d679882f1790ac54b8373c13a69f75607935322bdb68754
|
7
|
+
data.tar.gz: 834c7d6e26be9b6d8fc262280048e5104cd022587fed30d030f6c0edd849966a49a08bc793f1b49764b1fc5afc014c1463dd8acacfb5739bc507f4a77d3281a1
|
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -14,7 +14,7 @@ gem "neighbor"
|
|
14
14
|
|
15
15
|
## Choose An Extension
|
16
16
|
|
17
|
-
Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [
|
17
|
+
Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [pgvector](https://github.com/pgvector/pgvector). cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.
|
18
18
|
|
19
19
|
For cube, run:
|
20
20
|
|
@@ -23,7 +23,7 @@ rails generate neighbor:cube
|
|
23
23
|
rails db:migrate
|
24
24
|
```
|
25
25
|
|
26
|
-
For
|
26
|
+
For pgvector, [install the extension](https://github.com/pgvector/pgvector#installation) and run:
|
27
27
|
|
28
28
|
```sh
|
29
29
|
rails generate neighbor:vector
|
@@ -70,17 +70,30 @@ Get the nearest neighbors to a vector
|
|
70
70
|
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)
|
71
71
|
```
|
72
72
|
|
73
|
-
|
73
|
+
Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
|
74
|
+
|
75
|
+
```ruby
|
76
|
+
nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
|
77
|
+
nearest_item.neighbor_distance
|
78
|
+
```
|
79
|
+
|
80
|
+
See the additional docs for:
|
81
|
+
|
82
|
+
- [cube](#cube)
|
83
|
+
- [pgvector](#pgvector)
|
84
|
+
|
85
|
+
Or check out some [examples](#examples)
|
86
|
+
|
87
|
+
## cube
|
88
|
+
|
89
|
+
### Distance
|
74
90
|
|
75
91
|
Supported values are:
|
76
92
|
|
77
93
|
- `euclidean`
|
78
94
|
- `cosine`
|
79
95
|
- `taxicab`
|
80
|
-
- `chebyshev`
|
81
|
-
- `inner_product` (vector only)
|
82
|
-
- `hamming` (vector only)
|
83
|
-
- `jaccard` (vector only)
|
96
|
+
- `chebyshev`
|
84
97
|
|
85
98
|
For cosine distance with cube, vectors must be normalized before being stored.
|
86
99
|
|
@@ -90,18 +103,11 @@ class Item < ApplicationRecord
|
|
90
103
|
end
|
91
104
|
```
|
92
105
|
|
93
|
-
For inner product with cube, see [this example](examples/
|
94
|
-
|
95
|
-
Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
|
96
|
-
|
97
|
-
```ruby
|
98
|
-
nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
|
99
|
-
nearest_item.neighbor_distance
|
100
|
-
```
|
106
|
+
For inner product with cube, see [this example](examples/disco/user_recs_cube.rb).
|
101
107
|
|
102
|
-
|
108
|
+
### Dimensions
|
103
109
|
|
104
|
-
The cube
|
110
|
+
The `cube` type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this.
|
105
111
|
|
106
112
|
For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
|
107
113
|
|
@@ -111,9 +117,26 @@ class Item < ApplicationRecord
|
|
111
117
|
end
|
112
118
|
```
|
113
119
|
|
114
|
-
##
|
120
|
+
## pgvector
|
115
121
|
|
116
|
-
|
122
|
+
### Distance
|
123
|
+
|
124
|
+
Supported values are:
|
125
|
+
|
126
|
+
- `euclidean`
|
127
|
+
- `inner_product`
|
128
|
+
- `cosine`
|
129
|
+
- `taxicab`
|
130
|
+
- `hamming`
|
131
|
+
- `jaccard`
|
132
|
+
|
133
|
+
### Dimensions
|
134
|
+
|
135
|
+
The `vector` type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
|
136
|
+
|
137
|
+
### Indexing
|
138
|
+
|
139
|
+
Add an approximate index to speed up queries. Create a migration with:
|
117
140
|
|
118
141
|
```ruby
|
119
142
|
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
|
@@ -139,7 +162,7 @@ Or the number of probes with IVFFlat
|
|
139
162
|
Item.connection.execute("SET ivfflat.probes = 3")
|
140
163
|
```
|
141
164
|
|
142
|
-
|
165
|
+
### Half-Precision Vectors
|
143
166
|
|
144
167
|
Use the `halfvec` type to store half-precision vectors
|
145
168
|
|
@@ -151,7 +174,7 @@ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
|
|
151
174
|
end
|
152
175
|
```
|
153
176
|
|
154
|
-
|
177
|
+
### Half-Precision Indexing
|
155
178
|
|
156
179
|
Index vectors at half precision for smaller indexes
|
157
180
|
|
@@ -169,7 +192,7 @@ Get the nearest neighbors
|
|
169
192
|
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)
|
170
193
|
```
|
171
194
|
|
172
|
-
|
195
|
+
### Binary Vectors
|
173
196
|
|
174
197
|
Use the `bit` type to store binary vectors
|
175
198
|
|
@@ -187,7 +210,7 @@ Get the nearest neighbors by Hamming distance
|
|
187
210
|
Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)
|
188
211
|
```
|
189
212
|
|
190
|
-
|
213
|
+
### Binary Quantization
|
191
214
|
|
192
215
|
Use expression indexing for binary quantization
|
193
216
|
|
@@ -199,7 +222,7 @@ class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
|
|
199
222
|
end
|
200
223
|
```
|
201
224
|
|
202
|
-
|
225
|
+
### Sparse Vectors
|
203
226
|
|
204
227
|
Use the `sparsevec` type to store sparse vectors
|
205
228
|
|
@@ -220,11 +243,12 @@ Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)
|
|
220
243
|
|
221
244
|
## Examples
|
222
245
|
|
223
|
-
- [
|
224
|
-
- [
|
225
|
-
- [Sentence
|
226
|
-
- [
|
227
|
-
- [
|
246
|
+
- [Embeddings](#openai-embeddings) with OpenAI
|
247
|
+
- [Binary embeddings](#cohere-embeddings) with Cohere
|
248
|
+
- [Sentence embeddings](#sentence-embeddings) with Informers
|
249
|
+
- [Hybrid search](#hybrid-search) with Informers
|
250
|
+
- [Sparse search](#sparse-search) with Transformers.rb
|
251
|
+
- [Recommendations](#disco-recommendations) with Disco
|
228
252
|
|
229
253
|
### OpenAI Embeddings
|
230
254
|
|
@@ -388,7 +412,7 @@ end
|
|
388
412
|
Load a [model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
|
389
413
|
|
390
414
|
```ruby
|
391
|
-
model = Informers
|
415
|
+
model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")
|
392
416
|
```
|
393
417
|
|
394
418
|
Pass your input
|
@@ -399,7 +423,7 @@ input = [
|
|
399
423
|
"The cat is purring",
|
400
424
|
"The bear is growling"
|
401
425
|
]
|
402
|
-
embeddings = model.
|
426
|
+
embeddings = model.(input)
|
403
427
|
```
|
404
428
|
|
405
429
|
Store the embeddings
|
@@ -421,7 +445,86 @@ document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:conten
|
|
421
445
|
|
422
446
|
See the [complete code](examples/informers/example.rb)
|
423
447
|
|
424
|
-
###
|
448
|
+
### Hybrid Search
|
449
|
+
|
450
|
+
You can use Neighbor for hybrid search with [Informers](https://github.com/ankane/informers).
|
451
|
+
|
452
|
+
Generate a model
|
453
|
+
|
454
|
+
```sh
|
455
|
+
rails generate model Document content:text embedding:vector{768}
|
456
|
+
rails db:migrate
|
457
|
+
```
|
458
|
+
|
459
|
+
And add `has_neighbors` and a scope for keyword search
|
460
|
+
|
461
|
+
```ruby
|
462
|
+
class Document < ApplicationRecord
|
463
|
+
has_neighbors :embedding
|
464
|
+
|
465
|
+
scope :search, ->(query) {
|
466
|
+
where("to_tsvector(content) @@ plainto_tsquery(?)", query)
|
467
|
+
.order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC", query))
|
468
|
+
}
|
469
|
+
end
|
470
|
+
```
|
471
|
+
|
472
|
+
Create some documents
|
473
|
+
|
474
|
+
```ruby
|
475
|
+
texts = [
|
476
|
+
"The dog is barking",
|
477
|
+
"The cat is purring",
|
478
|
+
"The bear is growling"
|
479
|
+
]
|
480
|
+
documents = Document.create!(texts.map { |v| {content: v} })
|
481
|
+
```
|
482
|
+
|
483
|
+
Generate an embedding for each document
|
484
|
+
|
485
|
+
```ruby
|
486
|
+
embed = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
|
487
|
+
embed_options = {model_output: "sentence_embedding", pooling: "none"} # specific to embedding model
|
488
|
+
embeddings = embed.(documents.map(&:content), **embed_options)
|
489
|
+
|
490
|
+
documents.zip(embeddings) do |document, embedding|
|
491
|
+
document.update!(embedding: embedding)
|
492
|
+
end
|
493
|
+
```
|
494
|
+
|
495
|
+
Perform keyword search
|
496
|
+
|
497
|
+
```ruby
|
498
|
+
query = "growling bear"
|
499
|
+
keyword_results = Document.search(query).limit(20).load_async
|
500
|
+
```
|
501
|
+
|
502
|
+
And semantic search in parallel (the query prefix is specific to the [embedding model](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5))
|
503
|
+
|
504
|
+
```ruby
|
505
|
+
query_prefix = "Represent this sentence for searching relevant passages: "
|
506
|
+
query_embedding = embed.(query_prefix + query, **embed_options)
|
507
|
+
semantic_results =
|
508
|
+
Document.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20).load_async
|
509
|
+
```
|
510
|
+
|
511
|
+
To combine the results, use Reciprocal Rank Fusion (RRF)
|
512
|
+
|
513
|
+
```ruby
|
514
|
+
Neighbor::Reranking.rrf(keyword_results, semantic_results)
|
515
|
+
```
|
516
|
+
|
517
|
+
Or a reranking model
|
518
|
+
|
519
|
+
```ruby
|
520
|
+
rerank = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-xsmall-v1")
|
521
|
+
results = (keyword_results + semantic_results).uniq
|
522
|
+
rerank.(query, results.map(&:content), top_k: 5).map { |v| results[v[:doc_id]] }
|
523
|
+
```
|
524
|
+
|
525
|
+
See the [complete code](examples/hybrid/example.rb)
|
526
|
+
|
527
|
+
### Sparse Search
|
425
528
|
|
426
529
|
You can generate sparse embeddings locally with [Transformers.rb](https://github.com/ankane/transformers-ruby).
|
427
530
|
|
@@ -533,7 +636,7 @@ movies = []
|
|
533
636
|
recommender.item_ids.each do |item_id|
|
534
637
|
movies << {name: item_id, factors: recommender.item_factors(item_id)}
|
535
638
|
end
|
536
|
-
Movie.
|
639
|
+
Movie.create!(movies)
|
537
640
|
```
|
538
641
|
|
539
642
|
And get similar movies
|
@@ -543,7 +646,7 @@ movie = Movie.find_by(name: "Star Wars (1977)")
|
|
543
646
|
movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)
|
544
647
|
```
|
545
648
|
|
546
|
-
See the complete code for [cube](examples/disco/item_recs_cube.rb) and [
|
649
|
+
See the complete code for [cube](examples/disco/item_recs_cube.rb) and [pgvector](examples/disco/item_recs_vector.rb)
|
547
650
|
|
548
651
|
## History
|
549
652
|
|
@@ -0,0 +1,27 @@
|
|
1
|
+
module Neighbor
|
2
|
+
module Reranking
|
3
|
+
def self.rrf(first_ranking, *rankings, k: 60)
|
4
|
+
rankings.unshift(first_ranking)
|
5
|
+
|
6
|
+
ranks = []
|
7
|
+
results = []
|
8
|
+
rankings.each do |ranking|
|
9
|
+
ranks << ranking.map.with_index.to_h { |v, i| [v, i + 1] }
|
10
|
+
results.concat(ranking)
|
11
|
+
end
|
12
|
+
|
13
|
+
results =
|
14
|
+
results.uniq.map do |result|
|
15
|
+
score =
|
16
|
+
ranks.sum do |rank|
|
17
|
+
r = rank[result]
|
18
|
+
r ? 1.0 / (k + r) : 0.0
|
19
|
+
end
|
20
|
+
|
21
|
+
{result: result, score: score}
|
22
|
+
end
|
23
|
+
|
24
|
+
results.sort_by { |v| -v[:score] }
|
25
|
+
end
|
26
|
+
end
|
27
|
+
end
|
data/lib/neighbor/type/cube.rb
CHANGED
@@ -6,7 +6,7 @@ module Neighbor
|
|
6
6
|
end
|
7
7
|
|
8
8
|
def serialize(value)
|
9
|
-
if
|
9
|
+
if Utils.array?(value)
|
10
10
|
value = value.to_a
|
11
11
|
if value.first.is_a?(Array)
|
12
12
|
value = value.map { |v| serialize_point(v) }.join(", ")
|
@@ -20,7 +20,7 @@ module Neighbor
|
|
20
20
|
private
|
21
21
|
|
22
22
|
def cast_value(value)
|
23
|
-
if
|
23
|
+
if Utils.array?(value)
|
24
24
|
value.to_a
|
25
25
|
elsif value.is_a?(Numeric)
|
26
26
|
[value]
|
@@ -6,7 +6,7 @@ module Neighbor
|
|
6
6
|
end
|
7
7
|
|
8
8
|
def serialize(value)
|
9
|
-
if
|
9
|
+
if Utils.array?(value)
|
10
10
|
value = "[#{value.to_a.map(&:to_f).join(",")}]"
|
11
11
|
end
|
12
12
|
super(value)
|
@@ -17,7 +17,7 @@ module Neighbor
|
|
17
17
|
def cast_value(value)
|
18
18
|
if value.is_a?(String)
|
19
19
|
value[1..-1].split(",").map(&:to_f)
|
20
|
-
elsif
|
20
|
+
elsif Utils.array?(value)
|
21
21
|
value.to_a
|
22
22
|
else
|
23
23
|
raise "can't cast #{value.class.name} to halfvec"
|
data/lib/neighbor/type/vector.rb
CHANGED
@@ -6,7 +6,7 @@ module Neighbor
|
|
6
6
|
end
|
7
7
|
|
8
8
|
def serialize(value)
|
9
|
-
if
|
9
|
+
if Utils.array?(value)
|
10
10
|
value = "[#{value.to_a.map(&:to_f).join(",")}]"
|
11
11
|
end
|
12
12
|
super(value)
|
@@ -17,7 +17,7 @@ module Neighbor
|
|
17
17
|
def cast_value(value)
|
18
18
|
if value.is_a?(String)
|
19
19
|
value[1..-1].split(",").map(&:to_f)
|
20
|
-
elsif
|
20
|
+
elsif Utils.array?(value)
|
21
21
|
value.to_a
|
22
22
|
else
|
23
23
|
raise "can't cast #{value.class.name} to vector"
|
data/lib/neighbor/utils.rb
CHANGED
data/lib/neighbor/version.rb
CHANGED
data/lib/neighbor.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: neighbor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.4.
|
4
|
+
version: 0.4.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Kane
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-
|
11
|
+
date: 2024-09-02 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activerecord
|
@@ -40,6 +40,7 @@ files:
|
|
40
40
|
- lib/neighbor.rb
|
41
41
|
- lib/neighbor/model.rb
|
42
42
|
- lib/neighbor/railtie.rb
|
43
|
+
- lib/neighbor/reranking.rb
|
43
44
|
- lib/neighbor/sparse_vector.rb
|
44
45
|
- lib/neighbor/type/cube.rb
|
45
46
|
- lib/neighbor/type/halfvec.rb
|