neighbor 0.4.1 → 0.4.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 8aa6de2790d94de9411b0142836b2ad181a411e299fce4b98357b96ac4161183
4
- data.tar.gz: 2924d7f15f5b36bc89ee72372c1bfeb373d99481269696a9a9dcc41f90201f38
3
+ metadata.gz: e8d611fd277cd48d309b2a087fdeb22f39f43d8ef81fcab57763bd5b4b2e48b3
4
+ data.tar.gz: fe0a5f7e4aa1ebd81f8c5849be67dc0f3c948d53d313b80259def04b9b7e9e84
5
5
  SHA512:
6
- metadata.gz: 2bc1b3ee6d5b1ee0ab175b017e753cf958bd8ceb1ef2a23ba769770dfebf54eec251ac59c8f5f3b6ca56efcbad1763c34622b94924a017622c2f78fc8740f762
7
- data.tar.gz: d946dda99833964582f63863b2d898fea6bf065312cf60aec873631df96195e1a54375606ad9c9cc0f767937cdb7ea38b0d9990efcbbeab15ccbb11f8a2020ef
6
+ metadata.gz: 0d9d9d0be9f2929f1eab7e5df52a0606ca8c220ea74e6ef349c2ab77e5da44bd4b98ed4d2271345b2d679882f1790ac54b8373c13a69f75607935322bdb68754
7
+ data.tar.gz: 834c7d6e26be9b6d8fc262280048e5104cd022587fed30d030f6c0edd849966a49a08bc793f1b49764b1fc5afc014c1463dd8acacfb5739bc507f4a77d3281a1
data/CHANGELOG.md CHANGED
@@ -1,3 +1,11 @@
1
+ ## 0.4.3 (2024-09-02)
2
+
3
+ - Added `rrf` method
4
+
5
+ ## 0.4.2 (2024-08-27)
6
+
7
+ - Fixed error with `nil` values
8
+
1
9
  ## 0.4.1 (2024-08-26)
2
10
 
3
11
  - Added `precision` option
data/README.md CHANGED
@@ -14,7 +14,7 @@ gem "neighbor"
14
14
 
15
15
  ## Choose An Extension
16
16
 
17
- Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [vector](https://github.com/pgvector/pgvector). cube ships with Postgres, while vector supports more dimensions and approximate nearest neighbor search.
17
+ Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [pgvector](https://github.com/pgvector/pgvector). cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.
18
18
 
19
19
  For cube, run:
20
20
 
@@ -23,7 +23,7 @@ rails generate neighbor:cube
23
23
  rails db:migrate
24
24
  ```
25
25
 
26
- For vector, [install pgvector](https://github.com/pgvector/pgvector#installation) and run:
26
+ For pgvector, [install the extension](https://github.com/pgvector/pgvector#installation) and run:
27
27
 
28
28
  ```sh
29
29
  rails generate neighbor:vector
@@ -70,17 +70,30 @@ Get the nearest neighbors to a vector
70
70
  Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)
71
71
  ```
72
72
 
73
- ## Distance
73
+ Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
74
+
75
+ ```ruby
76
+ nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
77
+ nearest_item.neighbor_distance
78
+ ```
79
+
80
+ See the additional docs for:
81
+
82
+ - [cube](#cube)
83
+ - [pgvector](#pgvector)
84
+
85
+ Or check out some [examples](#examples)
86
+
87
+ ## cube
88
+
89
+ ### Distance
74
90
 
75
91
  Supported values are:
76
92
 
77
93
  - `euclidean`
78
94
  - `cosine`
79
95
  - `taxicab`
80
- - `chebyshev` (cube only)
81
- - `inner_product` (vector only)
82
- - `hamming` (vector only)
83
- - `jaccard` (vector only)
96
+ - `chebyshev`
84
97
 
85
98
  For cosine distance with cube, vectors must be normalized before being stored.
86
99
 
@@ -90,18 +103,11 @@ class Item < ApplicationRecord
90
103
  end
91
104
  ```
92
105
 
93
- For inner product with cube, see [this example](examples/disco_user_recs_cube.rb).
94
-
95
- Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
96
-
97
- ```ruby
98
- nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
99
- nearest_item.neighbor_distance
100
- ```
106
+ For inner product with cube, see [this example](examples/disco/user_recs_cube.rb).
101
107
 
102
- ## Dimensions
108
+ ### Dimensions
103
109
 
104
- The cube data type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this. The vector data type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
110
+ The `cube` type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this.
105
111
 
106
112
  For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
107
113
 
@@ -111,9 +117,26 @@ class Item < ApplicationRecord
111
117
  end
112
118
  ```
113
119
 
114
- ## Indexing
120
+ ## pgvector
115
121
 
116
- For vector, add an approximate index to speed up queries. Create a migration with:
122
+ ### Distance
123
+
124
+ Supported values are:
125
+
126
+ - `euclidean`
127
+ - `inner_product`
128
+ - `cosine`
129
+ - `taxicab`
130
+ - `hamming`
131
+ - `jaccard`
132
+
133
+ ### Dimensions
134
+
135
+ The `vector` type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
136
+
137
+ ### Indexing
138
+
139
+ Add an approximate index to speed up queries. Create a migration with:
117
140
 
118
141
  ```ruby
119
142
  class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
@@ -139,7 +162,7 @@ Or the number of probes with IVFFlat
139
162
  Item.connection.execute("SET ivfflat.probes = 3")
140
163
  ```
141
164
 
142
- ## Half-Precision Vectors
165
+ ### Half-Precision Vectors
143
166
 
144
167
  Use the `halfvec` type to store half-precision vectors
145
168
 
@@ -151,7 +174,7 @@ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
151
174
  end
152
175
  ```
153
176
 
154
- ## Half-Precision Indexing
177
+ ### Half-Precision Indexing
155
178
 
156
179
  Index vectors at half precision for smaller indexes
157
180
 
@@ -169,7 +192,7 @@ Get the nearest neighbors
169
192
  Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)
170
193
  ```
171
194
 
172
- ## Binary Vectors
195
+ ### Binary Vectors
173
196
 
174
197
  Use the `bit` type to store binary vectors
175
198
 
@@ -187,7 +210,7 @@ Get the nearest neighbors by Hamming distance
187
210
  Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)
188
211
  ```
189
212
 
190
- ## Binary Quantization
213
+ ### Binary Quantization
191
214
 
192
215
  Use expression indexing for binary quantization
193
216
 
@@ -199,7 +222,7 @@ class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
199
222
  end
200
223
  ```
201
224
 
202
- ## Sparse Vectors
225
+ ### Sparse Vectors
203
226
 
204
227
  Use the `sparsevec` type to store sparse vectors
205
228
 
@@ -220,11 +243,12 @@ Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)
220
243
 
221
244
  ## Examples
222
245
 
223
- - [OpenAI Embeddings](#openai-embeddings)
224
- - [Cohere Embeddings](#cohere-embeddings)
225
- - [Sentence Embeddings](#sentence-embeddings)
226
- - [Sparse Embeddings](#sparse-embeddings)
227
- - [Disco Recommendations](#disco-recommendations)
246
+ - [Embeddings](#openai-embeddings) with OpenAI
247
+ - [Binary embeddings](#cohere-embeddings) with Cohere
248
+ - [Sentence embeddings](#sentence-embeddings) with Informers
249
+ - [Hybrid search](#hybrid-search) with Informers
250
+ - [Sparse search](#sparse-search) with Transformers.rb
251
+ - [Recommendations](#disco-recommendations) with Disco
228
252
 
229
253
  ### OpenAI Embeddings
230
254
 
@@ -388,7 +412,7 @@ end
388
412
  Load a [model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
389
413
 
390
414
  ```ruby
391
- model = Informers::Model.new("sentence-transformers/all-MiniLM-L6-v2")
415
+ model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")
392
416
  ```
393
417
 
394
418
  Pass your input
@@ -399,7 +423,7 @@ input = [
399
423
  "The cat is purring",
400
424
  "The bear is growling"
401
425
  ]
402
- embeddings = model.embed(input)
426
+ embeddings = model.(input)
403
427
  ```
404
428
 
405
429
  Store the embeddings
@@ -421,7 +445,86 @@ document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:conten
421
445
 
422
446
  See the [complete code](examples/informers/example.rb)
423
447
 
424
- ### Sparse Embeddings
448
+ ### Hybrid Search
449
+
450
+ You can use Neighbor for hybrid search with [Informers](https://github.com/ankane/informers).
451
+
452
+ Generate a model
453
+
454
+ ```sh
455
+ rails generate model Document content:text embedding:vector{768}
456
+ rails db:migrate
457
+ ```
458
+
459
+ And add `has_neighbors` and a scope for keyword search
460
+
461
+ ```ruby
462
+ class Document < ApplicationRecord
463
+ has_neighbors :embedding
464
+
465
+ scope :search, ->(query) {
466
+ where("to_tsvector(content) @@ plainto_tsquery(?)", query)
467
+ .order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC", query))
468
+ }
469
+ end
470
+ ```
471
+
472
+ Create some documents
473
+
474
+ ```ruby
475
+ texts = [
476
+ "The dog is barking",
477
+ "The cat is purring",
478
+ "The bear is growling"
479
+ ]
480
+ documents = Document.create!(texts.map { |v| {content: v} })
481
+ ```
482
+
483
+ Generate an embedding for each document
484
+
485
+ ```ruby
486
+ embed = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
487
+ embed_options = {model_output: "sentence_embedding", pooling: "none"} # specific to embedding model
488
+ embeddings = embed.(documents.map(&:content), **embed_options)
489
+
490
+ documents.zip(embeddings) do |document, embedding|
491
+ document.update!(embedding: embedding)
492
+ end
493
+ ```
494
+
495
+ Perform keyword search
496
+
497
+ ```ruby
498
+ query = "growling bear"
499
+ keyword_results = Document.search(query).limit(20).load_async
500
+ ```
501
+
502
+ And semantic search in parallel (the query prefix is specific to the [embedding model](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5))
503
+
504
+ ```ruby
505
+ query_prefix = "Represent this sentence for searching relevant passages: "
506
+ query_embedding = embed.(query_prefix + query, **embed_options)
507
+ semantic_results =
508
+ Document.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20).load_async
509
+ ```
510
+
511
+ To combine the results, use Reciprocal Rank Fusion (RRF)
512
+
513
+ ```ruby
514
+ Neighbor::Reranking.rrf(keyword_results, semantic_results)
515
+ ```
516
+
517
+ Or a reranking model
518
+
519
+ ```ruby
520
+ rerank = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-xsmall-v1")
521
+ results = (keyword_results + semantic_results).uniq
522
+ rerank.(query, results.map(&:content), top_k: 5).map { |v| results[v[:doc_id]] }
523
+ ```
524
+
525
+ See the [complete code](examples/hybrid/example.rb)
526
+
527
+ ### Sparse Search
425
528
 
426
529
  You can generate sparse embeddings locally with [Transformers.rb](https://github.com/ankane/transformers-ruby).
427
530
 
@@ -533,7 +636,7 @@ movies = []
533
636
  recommender.item_ids.each do |item_id|
534
637
  movies << {name: item_id, factors: recommender.item_factors(item_id)}
535
638
  end
536
- Movie.insert_all!(movies)
639
+ Movie.create!(movies)
537
640
  ```
538
641
 
539
642
  And get similar movies
@@ -543,7 +646,7 @@ movie = Movie.find_by(name: "Star Wars (1977)")
543
646
  movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)
544
647
  ```
545
648
 
546
- See the complete code for [cube](examples/disco/item_recs_cube.rb) and [vector](examples/disco/item_recs_vector.rb)
649
+ See the complete code for [cube](examples/disco/item_recs_cube.rb) and [pgvector](examples/disco/item_recs_vector.rb)
547
650
 
548
651
  ## History
549
652
 
@@ -0,0 +1,27 @@
1
+ module Neighbor
2
+ module Reranking
3
+ def self.rrf(first_ranking, *rankings, k: 60)
4
+ rankings.unshift(first_ranking)
5
+
6
+ ranks = []
7
+ results = []
8
+ rankings.each do |ranking|
9
+ ranks << ranking.map.with_index.to_h { |v, i| [v, i + 1] }
10
+ results.concat(ranking)
11
+ end
12
+
13
+ results =
14
+ results.uniq.map do |result|
15
+ score =
16
+ ranks.sum do |rank|
17
+ r = rank[result]
18
+ r ? 1.0 / (k + r) : 0.0
19
+ end
20
+
21
+ {result: result, score: score}
22
+ end
23
+
24
+ results.sort_by { |v| -v[:score] }
25
+ end
26
+ end
27
+ end
@@ -6,7 +6,7 @@ module Neighbor
6
6
  end
7
7
 
8
8
  def serialize(value)
9
- if value.respond_to?(:to_a)
9
+ if Utils.array?(value)
10
10
  value = value.to_a
11
11
  if value.first.is_a?(Array)
12
12
  value = value.map { |v| serialize_point(v) }.join(", ")
@@ -20,7 +20,7 @@ module Neighbor
20
20
  private
21
21
 
22
22
  def cast_value(value)
23
- if value.respond_to?(:to_a)
23
+ if Utils.array?(value)
24
24
  value.to_a
25
25
  elsif value.is_a?(Numeric)
26
26
  [value]
@@ -6,7 +6,7 @@ module Neighbor
6
6
  end
7
7
 
8
8
  def serialize(value)
9
- if value.respond_to?(:to_a)
9
+ if Utils.array?(value)
10
10
  value = "[#{value.to_a.map(&:to_f).join(",")}]"
11
11
  end
12
12
  super(value)
@@ -17,7 +17,7 @@ module Neighbor
17
17
  def cast_value(value)
18
18
  if value.is_a?(String)
19
19
  value[1..-1].split(",").map(&:to_f)
20
- elsif value.respond_to?(:to_a)
20
+ elsif Utils.array?(value)
21
21
  value.to_a
22
22
  else
23
23
  raise "can't cast #{value.class.name} to halfvec"
@@ -19,7 +19,7 @@ module Neighbor
19
19
  value
20
20
  elsif value.is_a?(String)
21
21
  SparseVector.from_text(value)
22
- elsif value.respond_to?(:to_a)
22
+ elsif Utils.array?(value)
23
23
  value = SparseVector.new(value.to_a)
24
24
  else
25
25
  raise "can't cast #{value.class.name} to sparsevec"
@@ -6,7 +6,7 @@ module Neighbor
6
6
  end
7
7
 
8
8
  def serialize(value)
9
- if value.respond_to?(:to_a)
9
+ if Utils.array?(value)
10
10
  value = "[#{value.to_a.map(&:to_f).join(",")}]"
11
11
  end
12
12
  super(value)
@@ -17,7 +17,7 @@ module Neighbor
17
17
  def cast_value(value)
18
18
  if value.is_a?(String)
19
19
  value[1..-1].split(",").map(&:to_f)
20
- elsif value.respond_to?(:to_a)
20
+ elsif Utils.array?(value)
21
21
  value.to_a
22
22
  else
23
23
  raise "can't cast #{value.class.name} to vector"
@@ -38,5 +38,9 @@ module Neighbor
38
38
  # could also throw error
39
39
  norm > 0 ? value.map { |v| v / norm } : value
40
40
  end
41
+
42
+ def self.array?(value)
43
+ !value.nil? && value.respond_to?(:to_a)
44
+ end
41
45
  end
42
46
  end
@@ -1,3 +1,3 @@
1
1
  module Neighbor
2
- VERSION = "0.4.1"
2
+ VERSION = "0.4.3"
3
3
  end
data/lib/neighbor.rb CHANGED
@@ -2,6 +2,7 @@
2
2
  require "active_support"
3
3
 
4
4
  # modules
5
+ require_relative "neighbor/reranking"
5
6
  require_relative "neighbor/sparse_vector"
6
7
  require_relative "neighbor/utils"
7
8
  require_relative "neighbor/version"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: neighbor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.1
4
+ version: 0.4.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-08-27 00:00:00.000000000 Z
11
+ date: 2024-09-02 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activerecord
@@ -40,6 +40,7 @@ files:
40
40
  - lib/neighbor.rb
41
41
  - lib/neighbor/model.rb
42
42
  - lib/neighbor/railtie.rb
43
+ - lib/neighbor/reranking.rb
43
44
  - lib/neighbor/sparse_vector.rb
44
45
  - lib/neighbor/type/cube.rb
45
46
  - lib/neighbor/type/halfvec.rb