neighbor 0.4.1 → 0.4.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 8aa6de2790d94de9411b0142836b2ad181a411e299fce4b98357b96ac4161183
4
- data.tar.gz: 2924d7f15f5b36bc89ee72372c1bfeb373d99481269696a9a9dcc41f90201f38
3
+ metadata.gz: e8d611fd277cd48d309b2a087fdeb22f39f43d8ef81fcab57763bd5b4b2e48b3
4
+ data.tar.gz: fe0a5f7e4aa1ebd81f8c5849be67dc0f3c948d53d313b80259def04b9b7e9e84
5
5
  SHA512:
6
- metadata.gz: 2bc1b3ee6d5b1ee0ab175b017e753cf958bd8ceb1ef2a23ba769770dfebf54eec251ac59c8f5f3b6ca56efcbad1763c34622b94924a017622c2f78fc8740f762
7
- data.tar.gz: d946dda99833964582f63863b2d898fea6bf065312cf60aec873631df96195e1a54375606ad9c9cc0f767937cdb7ea38b0d9990efcbbeab15ccbb11f8a2020ef
6
+ metadata.gz: 0d9d9d0be9f2929f1eab7e5df52a0606ca8c220ea74e6ef349c2ab77e5da44bd4b98ed4d2271345b2d679882f1790ac54b8373c13a69f75607935322bdb68754
7
+ data.tar.gz: 834c7d6e26be9b6d8fc262280048e5104cd022587fed30d030f6c0edd849966a49a08bc793f1b49764b1fc5afc014c1463dd8acacfb5739bc507f4a77d3281a1
data/CHANGELOG.md CHANGED
@@ -1,3 +1,11 @@
1
+ ## 0.4.3 (2024-09-02)
2
+
3
+ - Added `rrf` method
4
+
5
+ ## 0.4.2 (2024-08-27)
6
+
7
+ - Fixed error with `nil` values
8
+
1
9
  ## 0.4.1 (2024-08-26)
2
10
 
3
11
  - Added `precision` option
data/README.md CHANGED
@@ -14,7 +14,7 @@ gem "neighbor"
14
14
 
15
15
  ## Choose An Extension
16
16
 
17
- Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [vector](https://github.com/pgvector/pgvector). cube ships with Postgres, while vector supports more dimensions and approximate nearest neighbor search.
17
+ Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [pgvector](https://github.com/pgvector/pgvector). cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.
18
18
 
19
19
  For cube, run:
20
20
 
@@ -23,7 +23,7 @@ rails generate neighbor:cube
23
23
  rails db:migrate
24
24
  ```
25
25
 
26
- For vector, [install pgvector](https://github.com/pgvector/pgvector#installation) and run:
26
+ For pgvector, [install the extension](https://github.com/pgvector/pgvector#installation) and run:
27
27
 
28
28
  ```sh
29
29
  rails generate neighbor:vector
@@ -70,17 +70,30 @@ Get the nearest neighbors to a vector
70
70
  Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)
71
71
  ```
72
72
 
73
- ## Distance
73
+ Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
74
+
75
+ ```ruby
76
+ nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
77
+ nearest_item.neighbor_distance
78
+ ```
79
+
80
+ See the additional docs for:
81
+
82
+ - [cube](#cube)
83
+ - [pgvector](#pgvector)
84
+
85
+ Or check out some [examples](#examples)
86
+
87
+ ## cube
88
+
89
+ ### Distance
74
90
 
75
91
  Supported values are:
76
92
 
77
93
  - `euclidean`
78
94
  - `cosine`
79
95
  - `taxicab`
80
- - `chebyshev` (cube only)
81
- - `inner_product` (vector only)
82
- - `hamming` (vector only)
83
- - `jaccard` (vector only)
96
+ - `chebyshev`
84
97
 
85
98
  For cosine distance with cube, vectors must be normalized before being stored.
86
99
 
@@ -90,18 +103,11 @@ class Item < ApplicationRecord
90
103
  end
91
104
  ```
92
105
 
93
- For inner product with cube, see [this example](examples/disco_user_recs_cube.rb).
94
-
95
- Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
96
-
97
- ```ruby
98
- nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
99
- nearest_item.neighbor_distance
100
- ```
106
+ For inner product with cube, see [this example](examples/disco/user_recs_cube.rb).
101
107
 
102
- ## Dimensions
108
+ ### Dimensions
103
109
 
104
- The cube data type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this. The vector data type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
110
+ The `cube` type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this.
105
111
 
106
112
  For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
107
113
 
@@ -111,9 +117,26 @@ class Item < ApplicationRecord
111
117
  end
112
118
  ```
113
119
 
114
- ## Indexing
120
+ ## pgvector
115
121
 
116
- For vector, add an approximate index to speed up queries. Create a migration with:
122
+ ### Distance
123
+
124
+ Supported values are:
125
+
126
+ - `euclidean`
127
+ - `inner_product`
128
+ - `cosine`
129
+ - `taxicab`
130
+ - `hamming`
131
+ - `jaccard`
132
+
133
+ ### Dimensions
134
+
135
+ The `vector` type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
136
+
137
+ ### Indexing
138
+
139
+ Add an approximate index to speed up queries. Create a migration with:
117
140
 
118
141
  ```ruby
119
142
  class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
@@ -139,7 +162,7 @@ Or the number of probes with IVFFlat
139
162
  Item.connection.execute("SET ivfflat.probes = 3")
140
163
  ```
141
164
 
142
- ## Half-Precision Vectors
165
+ ### Half-Precision Vectors
143
166
 
144
167
  Use the `halfvec` type to store half-precision vectors
145
168
 
@@ -151,7 +174,7 @@ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
151
174
  end
152
175
  ```
153
176
 
154
- ## Half-Precision Indexing
177
+ ### Half-Precision Indexing
155
178
 
156
179
  Index vectors at half precision for smaller indexes
157
180
 
@@ -169,7 +192,7 @@ Get the nearest neighbors
169
192
  Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)
170
193
  ```
171
194
 
172
- ## Binary Vectors
195
+ ### Binary Vectors
173
196
 
174
197
  Use the `bit` type to store binary vectors
175
198
 
@@ -187,7 +210,7 @@ Get the nearest neighbors by Hamming distance
187
210
  Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)
188
211
  ```
189
212
 
190
- ## Binary Quantization
213
+ ### Binary Quantization
191
214
 
192
215
  Use expression indexing for binary quantization
193
216
 
@@ -199,7 +222,7 @@ class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
199
222
  end
200
223
  ```
201
224
 
202
- ## Sparse Vectors
225
+ ### Sparse Vectors
203
226
 
204
227
  Use the `sparsevec` type to store sparse vectors
205
228
 
@@ -220,11 +243,12 @@ Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)
220
243
 
221
244
  ## Examples
222
245
 
223
- - [OpenAI Embeddings](#openai-embeddings)
224
- - [Cohere Embeddings](#cohere-embeddings)
225
- - [Sentence Embeddings](#sentence-embeddings)
226
- - [Sparse Embeddings](#sparse-embeddings)
227
- - [Disco Recommendations](#disco-recommendations)
246
+ - [Embeddings](#openai-embeddings) with OpenAI
247
+ - [Binary embeddings](#cohere-embeddings) with Cohere
248
+ - [Sentence embeddings](#sentence-embeddings) with Informers
249
+ - [Hybrid search](#hybrid-search) with Informers
250
+ - [Sparse search](#sparse-search) with Transformers.rb
251
+ - [Recommendations](#disco-recommendations) with Disco
228
252
 
229
253
  ### OpenAI Embeddings
230
254
 
@@ -388,7 +412,7 @@ end
388
412
  Load a [model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
389
413
 
390
414
  ```ruby
391
- model = Informers::Model.new("sentence-transformers/all-MiniLM-L6-v2")
415
+ model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")
392
416
  ```
393
417
 
394
418
  Pass your input
@@ -399,7 +423,7 @@ input = [
399
423
  "The cat is purring",
400
424
  "The bear is growling"
401
425
  ]
402
- embeddings = model.embed(input)
426
+ embeddings = model.(input)
403
427
  ```
404
428
 
405
429
  Store the embeddings
@@ -421,7 +445,86 @@ document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:conten
421
445
 
422
446
  See the [complete code](examples/informers/example.rb)
423
447
 
424
- ### Sparse Embeddings
448
+ ### Hybrid Search
449
+
450
+ You can use Neighbor for hybrid search with [Informers](https://github.com/ankane/informers).
451
+
452
+ Generate a model
453
+
454
+ ```sh
455
+ rails generate model Document content:text embedding:vector{768}
456
+ rails db:migrate
457
+ ```
458
+
459
+ And add `has_neighbors` and a scope for keyword search
460
+
461
+ ```ruby
462
+ class Document < ApplicationRecord
463
+ has_neighbors :embedding
464
+
465
+ scope :search, ->(query) {
466
+ where("to_tsvector(content) @@ plainto_tsquery(?)", query)
467
+ .order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC", query))
468
+ }
469
+ end
470
+ ```
471
+
472
+ Create some documents
473
+
474
+ ```ruby
475
+ texts = [
476
+ "The dog is barking",
477
+ "The cat is purring",
478
+ "The bear is growling"
479
+ ]
480
+ documents = Document.create!(texts.map { |v| {content: v} })
481
+ ```
482
+
483
+ Generate an embedding for each document
484
+
485
+ ```ruby
486
+ embed = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
487
+ embed_options = {model_output: "sentence_embedding", pooling: "none"} # specific to embedding model
488
+ embeddings = embed.(documents.map(&:content), **embed_options)
489
+
490
+ documents.zip(embeddings) do |document, embedding|
491
+ document.update!(embedding: embedding)
492
+ end
493
+ ```
494
+
495
+ Perform keyword search
496
+
497
+ ```ruby
498
+ query = "growling bear"
499
+ keyword_results = Document.search(query).limit(20).load_async
500
+ ```
501
+
502
+ And semantic search in parallel (the query prefix is specific to the [embedding model](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5))
503
+
504
+ ```ruby
505
+ query_prefix = "Represent this sentence for searching relevant passages: "
506
+ query_embedding = embed.(query_prefix + query, **embed_options)
507
+ semantic_results =
508
+ Document.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20).load_async
509
+ ```
510
+
511
+ To combine the results, use Reciprocal Rank Fusion (RRF)
512
+
513
+ ```ruby
514
+ Neighbor::Reranking.rrf(keyword_results, semantic_results)
515
+ ```
516
+
517
+ Or a reranking model
518
+
519
+ ```ruby
520
+ rerank = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-xsmall-v1")
521
+ results = (keyword_results + semantic_results).uniq
522
+ rerank.(query, results.map(&:content), top_k: 5).map { |v| results[v[:doc_id]] }
523
+ ```
524
+
525
+ See the [complete code](examples/hybrid/example.rb)
526
+
527
+ ### Sparse Search
425
528
 
426
529
  You can generate sparse embeddings locally with [Transformers.rb](https://github.com/ankane/transformers-ruby).
427
530
 
@@ -533,7 +636,7 @@ movies = []
533
636
  recommender.item_ids.each do |item_id|
534
637
  movies << {name: item_id, factors: recommender.item_factors(item_id)}
535
638
  end
536
- Movie.insert_all!(movies)
639
+ Movie.create!(movies)
537
640
  ```
538
641
 
539
642
  And get similar movies
@@ -543,7 +646,7 @@ movie = Movie.find_by(name: "Star Wars (1977)")
543
646
  movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)
544
647
  ```
545
648
 
546
- See the complete code for [cube](examples/disco/item_recs_cube.rb) and [vector](examples/disco/item_recs_vector.rb)
649
+ See the complete code for [cube](examples/disco/item_recs_cube.rb) and [pgvector](examples/disco/item_recs_vector.rb)
547
650
 
548
651
  ## History
549
652
 
@@ -0,0 +1,27 @@
1
+ module Neighbor
2
+ module Reranking
3
+ def self.rrf(first_ranking, *rankings, k: 60)
4
+ rankings.unshift(first_ranking)
5
+
6
+ ranks = []
7
+ results = []
8
+ rankings.each do |ranking|
9
+ ranks << ranking.map.with_index.to_h { |v, i| [v, i + 1] }
10
+ results.concat(ranking)
11
+ end
12
+
13
+ results =
14
+ results.uniq.map do |result|
15
+ score =
16
+ ranks.sum do |rank|
17
+ r = rank[result]
18
+ r ? 1.0 / (k + r) : 0.0
19
+ end
20
+
21
+ {result: result, score: score}
22
+ end
23
+
24
+ results.sort_by { |v| -v[:score] }
25
+ end
26
+ end
27
+ end
@@ -6,7 +6,7 @@ module Neighbor
6
6
  end
7
7
 
8
8
  def serialize(value)
9
- if value.respond_to?(:to_a)
9
+ if Utils.array?(value)
10
10
  value = value.to_a
11
11
  if value.first.is_a?(Array)
12
12
  value = value.map { |v| serialize_point(v) }.join(", ")
@@ -20,7 +20,7 @@ module Neighbor
20
20
  private
21
21
 
22
22
  def cast_value(value)
23
- if value.respond_to?(:to_a)
23
+ if Utils.array?(value)
24
24
  value.to_a
25
25
  elsif value.is_a?(Numeric)
26
26
  [value]
@@ -6,7 +6,7 @@ module Neighbor
6
6
  end
7
7
 
8
8
  def serialize(value)
9
- if value.respond_to?(:to_a)
9
+ if Utils.array?(value)
10
10
  value = "[#{value.to_a.map(&:to_f).join(",")}]"
11
11
  end
12
12
  super(value)
@@ -17,7 +17,7 @@ module Neighbor
17
17
  def cast_value(value)
18
18
  if value.is_a?(String)
19
19
  value[1..-1].split(",").map(&:to_f)
20
- elsif value.respond_to?(:to_a)
20
+ elsif Utils.array?(value)
21
21
  value.to_a
22
22
  else
23
23
  raise "can't cast #{value.class.name} to halfvec"
@@ -19,7 +19,7 @@ module Neighbor
19
19
  value
20
20
  elsif value.is_a?(String)
21
21
  SparseVector.from_text(value)
22
- elsif value.respond_to?(:to_a)
22
+ elsif Utils.array?(value)
23
23
  value = SparseVector.new(value.to_a)
24
24
  else
25
25
  raise "can't cast #{value.class.name} to sparsevec"
@@ -6,7 +6,7 @@ module Neighbor
6
6
  end
7
7
 
8
8
  def serialize(value)
9
- if value.respond_to?(:to_a)
9
+ if Utils.array?(value)
10
10
  value = "[#{value.to_a.map(&:to_f).join(",")}]"
11
11
  end
12
12
  super(value)
@@ -17,7 +17,7 @@ module Neighbor
17
17
  def cast_value(value)
18
18
  if value.is_a?(String)
19
19
  value[1..-1].split(",").map(&:to_f)
20
- elsif value.respond_to?(:to_a)
20
+ elsif Utils.array?(value)
21
21
  value.to_a
22
22
  else
23
23
  raise "can't cast #{value.class.name} to vector"
@@ -38,5 +38,9 @@ module Neighbor
38
38
  # could also throw error
39
39
  norm > 0 ? value.map { |v| v / norm } : value
40
40
  end
41
+
42
+ def self.array?(value)
43
+ !value.nil? && value.respond_to?(:to_a)
44
+ end
41
45
  end
42
46
  end
@@ -1,3 +1,3 @@
1
1
  module Neighbor
2
- VERSION = "0.4.1"
2
+ VERSION = "0.4.3"
3
3
  end
data/lib/neighbor.rb CHANGED
@@ -2,6 +2,7 @@
2
2
  require "active_support"
3
3
 
4
4
  # modules
5
+ require_relative "neighbor/reranking"
5
6
  require_relative "neighbor/sparse_vector"
6
7
  require_relative "neighbor/utils"
7
8
  require_relative "neighbor/version"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: neighbor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.1
4
+ version: 0.4.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-08-27 00:00:00.000000000 Z
11
+ date: 2024-09-02 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activerecord
@@ -40,6 +40,7 @@ files:
40
40
  - lib/neighbor.rb
41
41
  - lib/neighbor/model.rb
42
42
  - lib/neighbor/railtie.rb
43
+ - lib/neighbor/reranking.rb
43
44
  - lib/neighbor/sparse_vector.rb
44
45
  - lib/neighbor/type/cube.rb
45
46
  - lib/neighbor/type/halfvec.rb