neighbor 0.4.1 → 0.4.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -0
- data/README.md +137 -34
- data/lib/neighbor/reranking.rb +27 -0
- data/lib/neighbor/type/cube.rb +2 -2
- data/lib/neighbor/type/halfvec.rb +2 -2
- data/lib/neighbor/type/sparsevec.rb +1 -1
- data/lib/neighbor/type/vector.rb +2 -2
- data/lib/neighbor/utils.rb +4 -0
- data/lib/neighbor/version.rb +1 -1
- data/lib/neighbor.rb +1 -0
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: e8d611fd277cd48d309b2a087fdeb22f39f43d8ef81fcab57763bd5b4b2e48b3
|
4
|
+
data.tar.gz: fe0a5f7e4aa1ebd81f8c5849be67dc0f3c948d53d313b80259def04b9b7e9e84
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0d9d9d0be9f2929f1eab7e5df52a0606ca8c220ea74e6ef349c2ab77e5da44bd4b98ed4d2271345b2d679882f1790ac54b8373c13a69f75607935322bdb68754
|
7
|
+
data.tar.gz: 834c7d6e26be9b6d8fc262280048e5104cd022587fed30d030f6c0edd849966a49a08bc793f1b49764b1fc5afc014c1463dd8acacfb5739bc507f4a77d3281a1
|
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -14,7 +14,7 @@ gem "neighbor"
|
|
14
14
|
|
15
15
|
## Choose An Extension
|
16
16
|
|
17
|
-
Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [
|
17
|
+
Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [pgvector](https://github.com/pgvector/pgvector). cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.
|
18
18
|
|
19
19
|
For cube, run:
|
20
20
|
|
@@ -23,7 +23,7 @@ rails generate neighbor:cube
|
|
23
23
|
rails db:migrate
|
24
24
|
```
|
25
25
|
|
26
|
-
For
|
26
|
+
For pgvector, [install the extension](https://github.com/pgvector/pgvector#installation) and run:
|
27
27
|
|
28
28
|
```sh
|
29
29
|
rails generate neighbor:vector
|
@@ -70,17 +70,30 @@ Get the nearest neighbors to a vector
|
|
70
70
|
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)
|
71
71
|
```
|
72
72
|
|
73
|
-
|
73
|
+
Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
|
74
|
+
|
75
|
+
```ruby
|
76
|
+
nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
|
77
|
+
nearest_item.neighbor_distance
|
78
|
+
```
|
79
|
+
|
80
|
+
See the additional docs for:
|
81
|
+
|
82
|
+
- [cube](#cube)
|
83
|
+
- [pgvector](#pgvector)
|
84
|
+
|
85
|
+
Or check out some [examples](#examples)
|
86
|
+
|
87
|
+
## cube
|
88
|
+
|
89
|
+
### Distance
|
74
90
|
|
75
91
|
Supported values are:
|
76
92
|
|
77
93
|
- `euclidean`
|
78
94
|
- `cosine`
|
79
95
|
- `taxicab`
|
80
|
-
- `chebyshev`
|
81
|
-
- `inner_product` (vector only)
|
82
|
-
- `hamming` (vector only)
|
83
|
-
- `jaccard` (vector only)
|
96
|
+
- `chebyshev`
|
84
97
|
|
85
98
|
For cosine distance with cube, vectors must be normalized before being stored.
|
86
99
|
|
@@ -90,18 +103,11 @@ class Item < ApplicationRecord
|
|
90
103
|
end
|
91
104
|
```
|
92
105
|
|
93
|
-
For inner product with cube, see [this example](examples/
|
94
|
-
|
95
|
-
Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
|
96
|
-
|
97
|
-
```ruby
|
98
|
-
nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
|
99
|
-
nearest_item.neighbor_distance
|
100
|
-
```
|
106
|
+
For inner product with cube, see [this example](examples/disco/user_recs_cube.rb).
|
101
107
|
|
102
|
-
|
108
|
+
### Dimensions
|
103
109
|
|
104
|
-
The cube
|
110
|
+
The `cube` type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this.
|
105
111
|
|
106
112
|
For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
|
107
113
|
|
@@ -111,9 +117,26 @@ class Item < ApplicationRecord
|
|
111
117
|
end
|
112
118
|
```
|
113
119
|
|
114
|
-
##
|
120
|
+
## pgvector
|
115
121
|
|
116
|
-
|
122
|
+
### Distance
|
123
|
+
|
124
|
+
Supported values are:
|
125
|
+
|
126
|
+
- `euclidean`
|
127
|
+
- `inner_product`
|
128
|
+
- `cosine`
|
129
|
+
- `taxicab`
|
130
|
+
- `hamming`
|
131
|
+
- `jaccard`
|
132
|
+
|
133
|
+
### Dimensions
|
134
|
+
|
135
|
+
The `vector` type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
|
136
|
+
|
137
|
+
### Indexing
|
138
|
+
|
139
|
+
Add an approximate index to speed up queries. Create a migration with:
|
117
140
|
|
118
141
|
```ruby
|
119
142
|
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
|
@@ -139,7 +162,7 @@ Or the number of probes with IVFFlat
|
|
139
162
|
Item.connection.execute("SET ivfflat.probes = 3")
|
140
163
|
```
|
141
164
|
|
142
|
-
|
165
|
+
### Half-Precision Vectors
|
143
166
|
|
144
167
|
Use the `halfvec` type to store half-precision vectors
|
145
168
|
|
@@ -151,7 +174,7 @@ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
|
|
151
174
|
end
|
152
175
|
```
|
153
176
|
|
154
|
-
|
177
|
+
### Half-Precision Indexing
|
155
178
|
|
156
179
|
Index vectors at half precision for smaller indexes
|
157
180
|
|
@@ -169,7 +192,7 @@ Get the nearest neighbors
|
|
169
192
|
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)
|
170
193
|
```
|
171
194
|
|
172
|
-
|
195
|
+
### Binary Vectors
|
173
196
|
|
174
197
|
Use the `bit` type to store binary vectors
|
175
198
|
|
@@ -187,7 +210,7 @@ Get the nearest neighbors by Hamming distance
|
|
187
210
|
Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)
|
188
211
|
```
|
189
212
|
|
190
|
-
|
213
|
+
### Binary Quantization
|
191
214
|
|
192
215
|
Use expression indexing for binary quantization
|
193
216
|
|
@@ -199,7 +222,7 @@ class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
|
|
199
222
|
end
|
200
223
|
```
|
201
224
|
|
202
|
-
|
225
|
+
### Sparse Vectors
|
203
226
|
|
204
227
|
Use the `sparsevec` type to store sparse vectors
|
205
228
|
|
@@ -220,11 +243,12 @@ Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)
|
|
220
243
|
|
221
244
|
## Examples
|
222
245
|
|
223
|
-
- [
|
224
|
-
- [
|
225
|
-
- [Sentence
|
226
|
-
- [
|
227
|
-
- [
|
246
|
+
- [Embeddings](#openai-embeddings) with OpenAI
|
247
|
+
- [Binary embeddings](#cohere-embeddings) with Cohere
|
248
|
+
- [Sentence embeddings](#sentence-embeddings) with Informers
|
249
|
+
- [Hybrid search](#hybrid-search) with Informers
|
250
|
+
- [Sparse search](#sparse-search) with Transformers.rb
|
251
|
+
- [Recommendations](#disco-recommendations) with Disco
|
228
252
|
|
229
253
|
### OpenAI Embeddings
|
230
254
|
|
@@ -388,7 +412,7 @@ end
|
|
388
412
|
Load a [model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
|
389
413
|
|
390
414
|
```ruby
|
391
|
-
model = Informers
|
415
|
+
model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")
|
392
416
|
```
|
393
417
|
|
394
418
|
Pass your input
|
@@ -399,7 +423,7 @@ input = [
|
|
399
423
|
"The cat is purring",
|
400
424
|
"The bear is growling"
|
401
425
|
]
|
402
|
-
embeddings = model.
|
426
|
+
embeddings = model.(input)
|
403
427
|
```
|
404
428
|
|
405
429
|
Store the embeddings
|
@@ -421,7 +445,86 @@ document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:conten
|
|
421
445
|
|
422
446
|
See the [complete code](examples/informers/example.rb)
|
423
447
|
|
424
|
-
###
|
448
|
+
### Hybrid Search
|
449
|
+
|
450
|
+
You can use Neighbor for hybrid search with [Informers](https://github.com/ankane/informers).
|
451
|
+
|
452
|
+
Generate a model
|
453
|
+
|
454
|
+
```sh
|
455
|
+
rails generate model Document content:text embedding:vector{768}
|
456
|
+
rails db:migrate
|
457
|
+
```
|
458
|
+
|
459
|
+
And add `has_neighbors` and a scope for keyword search
|
460
|
+
|
461
|
+
```ruby
|
462
|
+
class Document < ApplicationRecord
|
463
|
+
has_neighbors :embedding
|
464
|
+
|
465
|
+
scope :search, ->(query) {
|
466
|
+
where("to_tsvector(content) @@ plainto_tsquery(?)", query)
|
467
|
+
.order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC", query))
|
468
|
+
}
|
469
|
+
end
|
470
|
+
```
|
471
|
+
|
472
|
+
Create some documents
|
473
|
+
|
474
|
+
```ruby
|
475
|
+
texts = [
|
476
|
+
"The dog is barking",
|
477
|
+
"The cat is purring",
|
478
|
+
"The bear is growling"
|
479
|
+
]
|
480
|
+
documents = Document.create!(texts.map { |v| {content: v} })
|
481
|
+
```
|
482
|
+
|
483
|
+
Generate an embedding for each document
|
484
|
+
|
485
|
+
```ruby
|
486
|
+
embed = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
|
487
|
+
embed_options = {model_output: "sentence_embedding", pooling: "none"} # specific to embedding model
|
488
|
+
embeddings = embed.(documents.map(&:content), **embed_options)
|
489
|
+
|
490
|
+
documents.zip(embeddings) do |document, embedding|
|
491
|
+
document.update!(embedding: embedding)
|
492
|
+
end
|
493
|
+
```
|
494
|
+
|
495
|
+
Perform keyword search
|
496
|
+
|
497
|
+
```ruby
|
498
|
+
query = "growling bear"
|
499
|
+
keyword_results = Document.search(query).limit(20).load_async
|
500
|
+
```
|
501
|
+
|
502
|
+
And semantic search in parallel (the query prefix is specific to the [embedding model](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5))
|
503
|
+
|
504
|
+
```ruby
|
505
|
+
query_prefix = "Represent this sentence for searching relevant passages: "
|
506
|
+
query_embedding = embed.(query_prefix + query, **embed_options)
|
507
|
+
semantic_results =
|
508
|
+
Document.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20).load_async
|
509
|
+
```
|
510
|
+
|
511
|
+
To combine the results, use Reciprocal Rank Fusion (RRF)
|
512
|
+
|
513
|
+
```ruby
|
514
|
+
Neighbor::Reranking.rrf(keyword_results, semantic_results)
|
515
|
+
```
|
516
|
+
|
517
|
+
Or a reranking model
|
518
|
+
|
519
|
+
```ruby
|
520
|
+
rerank = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-xsmall-v1")
|
521
|
+
results = (keyword_results + semantic_results).uniq
|
522
|
+
rerank.(query, results.map(&:content), top_k: 5).map { |v| results[v[:doc_id]] }
|
523
|
+
```
|
524
|
+
|
525
|
+
See the [complete code](examples/hybrid/example.rb)
|
526
|
+
|
527
|
+
### Sparse Search
|
425
528
|
|
426
529
|
You can generate sparse embeddings locally with [Transformers.rb](https://github.com/ankane/transformers-ruby).
|
427
530
|
|
@@ -533,7 +636,7 @@ movies = []
|
|
533
636
|
recommender.item_ids.each do |item_id|
|
534
637
|
movies << {name: item_id, factors: recommender.item_factors(item_id)}
|
535
638
|
end
|
536
|
-
Movie.
|
639
|
+
Movie.create!(movies)
|
537
640
|
```
|
538
641
|
|
539
642
|
And get similar movies
|
@@ -543,7 +646,7 @@ movie = Movie.find_by(name: "Star Wars (1977)")
|
|
543
646
|
movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)
|
544
647
|
```
|
545
648
|
|
546
|
-
See the complete code for [cube](examples/disco/item_recs_cube.rb) and [
|
649
|
+
See the complete code for [cube](examples/disco/item_recs_cube.rb) and [pgvector](examples/disco/item_recs_vector.rb)
|
547
650
|
|
548
651
|
## History
|
549
652
|
|
@@ -0,0 +1,27 @@
|
|
1
|
+
module Neighbor
|
2
|
+
module Reranking
|
3
|
+
def self.rrf(first_ranking, *rankings, k: 60)
|
4
|
+
rankings.unshift(first_ranking)
|
5
|
+
|
6
|
+
ranks = []
|
7
|
+
results = []
|
8
|
+
rankings.each do |ranking|
|
9
|
+
ranks << ranking.map.with_index.to_h { |v, i| [v, i + 1] }
|
10
|
+
results.concat(ranking)
|
11
|
+
end
|
12
|
+
|
13
|
+
results =
|
14
|
+
results.uniq.map do |result|
|
15
|
+
score =
|
16
|
+
ranks.sum do |rank|
|
17
|
+
r = rank[result]
|
18
|
+
r ? 1.0 / (k + r) : 0.0
|
19
|
+
end
|
20
|
+
|
21
|
+
{result: result, score: score}
|
22
|
+
end
|
23
|
+
|
24
|
+
results.sort_by { |v| -v[:score] }
|
25
|
+
end
|
26
|
+
end
|
27
|
+
end
|
data/lib/neighbor/type/cube.rb
CHANGED
@@ -6,7 +6,7 @@ module Neighbor
|
|
6
6
|
end
|
7
7
|
|
8
8
|
def serialize(value)
|
9
|
-
if
|
9
|
+
if Utils.array?(value)
|
10
10
|
value = value.to_a
|
11
11
|
if value.first.is_a?(Array)
|
12
12
|
value = value.map { |v| serialize_point(v) }.join(", ")
|
@@ -20,7 +20,7 @@ module Neighbor
|
|
20
20
|
private
|
21
21
|
|
22
22
|
def cast_value(value)
|
23
|
-
if
|
23
|
+
if Utils.array?(value)
|
24
24
|
value.to_a
|
25
25
|
elsif value.is_a?(Numeric)
|
26
26
|
[value]
|
@@ -6,7 +6,7 @@ module Neighbor
|
|
6
6
|
end
|
7
7
|
|
8
8
|
def serialize(value)
|
9
|
-
if
|
9
|
+
if Utils.array?(value)
|
10
10
|
value = "[#{value.to_a.map(&:to_f).join(",")}]"
|
11
11
|
end
|
12
12
|
super(value)
|
@@ -17,7 +17,7 @@ module Neighbor
|
|
17
17
|
def cast_value(value)
|
18
18
|
if value.is_a?(String)
|
19
19
|
value[1..-1].split(",").map(&:to_f)
|
20
|
-
elsif
|
20
|
+
elsif Utils.array?(value)
|
21
21
|
value.to_a
|
22
22
|
else
|
23
23
|
raise "can't cast #{value.class.name} to halfvec"
|
data/lib/neighbor/type/vector.rb
CHANGED
@@ -6,7 +6,7 @@ module Neighbor
|
|
6
6
|
end
|
7
7
|
|
8
8
|
def serialize(value)
|
9
|
-
if
|
9
|
+
if Utils.array?(value)
|
10
10
|
value = "[#{value.to_a.map(&:to_f).join(",")}]"
|
11
11
|
end
|
12
12
|
super(value)
|
@@ -17,7 +17,7 @@ module Neighbor
|
|
17
17
|
def cast_value(value)
|
18
18
|
if value.is_a?(String)
|
19
19
|
value[1..-1].split(",").map(&:to_f)
|
20
|
-
elsif
|
20
|
+
elsif Utils.array?(value)
|
21
21
|
value.to_a
|
22
22
|
else
|
23
23
|
raise "can't cast #{value.class.name} to vector"
|
data/lib/neighbor/utils.rb
CHANGED
data/lib/neighbor/version.rb
CHANGED
data/lib/neighbor.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: neighbor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.4.
|
4
|
+
version: 0.4.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Kane
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-
|
11
|
+
date: 2024-09-02 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activerecord
|
@@ -40,6 +40,7 @@ files:
|
|
40
40
|
- lib/neighbor.rb
|
41
41
|
- lib/neighbor/model.rb
|
42
42
|
- lib/neighbor/railtie.rb
|
43
|
+
- lib/neighbor/reranking.rb
|
43
44
|
- lib/neighbor/sparse_vector.rb
|
44
45
|
- lib/neighbor/type/cube.rb
|
45
46
|
- lib/neighbor/type/halfvec.rb
|