neighbor 0.3.2 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,8 +1,15 @@
1
1
  # Neighbor
2
2
 
3
- Nearest neighbor search for Rails and Postgres
3
+ Nearest neighbor search for Rails
4
4
 
5
- [![Build Status](https://github.com/ankane/neighbor/workflows/build/badge.svg?branch=master)](https://github.com/ankane/neighbor/actions)
5
+ Supports:
6
+
7
+ - Postgres (cube and pgvector)
8
+ - SQLite (sqlite-vec) - experimental
9
+ - MariaDB 11.6 Vector - experimental
10
+ - MySQL 9 (searching requires HeatWave) - experimental
11
+
12
+ [![Build Status](https://github.com/ankane/neighbor/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/neighbor/actions)
6
13
 
7
14
  ## Installation
8
15
 
@@ -12,9 +19,9 @@ Add this line to your application’s Gemfile:
12
19
  gem "neighbor"
13
20
  ```
14
21
 
15
- ## Choose An Extension
22
+ ### For Postgres
16
23
 
17
- Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [vector](https://github.com/pgvector/pgvector). cube ships with Postgres, while vector supports more dimensions and approximate nearest neighbor search.
24
+ Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [pgvector](https://github.com/pgvector/pgvector). cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.
18
25
 
19
26
  For cube, run:
20
27
 
@@ -23,23 +30,42 @@ rails generate neighbor:cube
23
30
  rails db:migrate
24
31
  ```
25
32
 
26
- For vector, [install pgvector](https://github.com/pgvector/pgvector#installation) and run:
33
+ For pgvector, [install the extension](https://github.com/pgvector/pgvector#installation) and run:
27
34
 
28
35
  ```sh
29
36
  rails generate neighbor:vector
30
37
  rails db:migrate
31
38
  ```
32
39
 
40
+ ### For SQLite
41
+
42
+ Add this line to your application’s Gemfile:
43
+
44
+ ```ruby
45
+ gem "sqlite-vec"
46
+ ```
47
+
48
+ And run:
49
+
50
+ ```sh
51
+ rails generate neighbor:sqlite
52
+ ```
53
+
33
54
  ## Getting Started
34
55
 
35
56
  Create a migration
36
57
 
37
58
  ```ruby
38
- class AddEmbeddingToItems < ActiveRecord::Migration[7.1]
59
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
39
60
  def change
61
+ # cube
40
62
  add_column :items, :embedding, :cube
41
- # or
63
+
64
+ # pgvector and MySQL
42
65
  add_column :items, :embedding, :vector, limit: 3 # dimensions
66
+
67
+ # sqlite-vec and MariaDB
68
+ add_column :items, :embedding, :binary
43
69
  end
44
70
  end
45
71
  ```
@@ -70,15 +96,33 @@ Get the nearest neighbors to a vector
70
96
  Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)
71
97
  ```
72
98
 
73
- ## Distance
99
+ Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
100
+
101
+ ```ruby
102
+ nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
103
+ nearest_item.neighbor_distance
104
+ ```
105
+
106
+ See the additional docs for:
107
+
108
+ - [cube](#cube)
109
+ - [pgvector](#pgvector)
110
+ - [sqlite-vec](#sqlite-vec)
111
+ - [MariaDB](#mariadb)
112
+ - [MySQL](#mysql)
113
+
114
+ Or check out some [examples](#examples)
115
+
116
+ ## cube
117
+
118
+ ### Distance
74
119
 
75
120
  Supported values are:
76
121
 
77
122
  - `euclidean`
78
123
  - `cosine`
79
- - `taxicab` (cube only)
80
- - `chebyshev` (cube only)
81
- - `inner_product` (vector only)
124
+ - `taxicab`
125
+ - `chebyshev`
82
126
 
83
127
  For cosine distance with cube, vectors must be normalized before being stored.
84
128
 
@@ -88,18 +132,11 @@ class Item < ApplicationRecord
88
132
  end
89
133
  ```
90
134
 
91
- For inner product with cube, see [this example](examples/disco_user_recs_cube.rb).
92
-
93
- Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
135
+ For inner product with cube, see [this example](examples/disco/user_recs_cube.rb).
94
136
 
95
- ```ruby
96
- nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
97
- nearest_item.neighbor_distance
98
- ```
99
-
100
- ## Dimensions
137
+ ### Dimensions
101
138
 
102
- The cube data type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this. The vector data type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
139
+ The `cube` type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this.
103
140
 
104
141
  For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
105
142
 
@@ -109,38 +146,328 @@ class Item < ApplicationRecord
109
146
  end
110
147
  ```
111
148
 
112
- ## Indexing
149
+ ## pgvector
150
+
151
+ ### Distance
152
+
153
+ Supported values are:
154
+
155
+ - `euclidean`
156
+ - `inner_product`
157
+ - `cosine`
158
+ - `taxicab`
159
+ - `hamming`
160
+ - `jaccard`
161
+
162
+ ### Dimensions
163
+
164
+ The `vector` type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
113
165
 
114
- For vector, add an approximate index to speed up queries. Create a migration with:
166
+ The `halfvec` type can have up to 16,000 dimensions, and half vectors with up to 4,000 dimensions can be indexed.
167
+
168
+ The `bit` type can have up to 83 million dimensions, and bit vectors with up to 64,000 dimensions can be indexed.
169
+
170
+ The `sparsevec` type can have up to 16,000 non-zero elements, and sparse vectors with up to 1,000 non-zero elements can be indexed.
171
+
172
+ ### Indexing
173
+
174
+ Add an approximate index to speed up queries. Create a migration with:
115
175
 
116
176
  ```ruby
117
- class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.1]
177
+ class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
118
178
  def change
119
- add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
120
- # or with pgvector 0.5.0+
121
179
  add_index :items, :embedding, using: :hnsw, opclass: :vector_l2_ops
180
+ # or
181
+ add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
122
182
  end
123
183
  end
124
184
  ```
125
185
 
126
186
  Use `:vector_cosine_ops` for cosine distance and `:vector_ip_ops` for inner product.
127
187
 
128
- Set the number of probes with IVFFlat
188
+ Set the size of the dynamic candidate list with HNSW
189
+
190
+ ```ruby
191
+ Item.connection.execute("SET hnsw.ef_search = 100")
192
+ ```
193
+
194
+ Or the number of probes with IVFFlat
129
195
 
130
196
  ```ruby
131
197
  Item.connection.execute("SET ivfflat.probes = 3")
132
198
  ```
133
199
 
134
- Or the size of the dynamic candidate list with HNSW
200
+ ### Half-Precision Vectors
201
+
202
+ Use the `halfvec` type to store half-precision vectors
203
+
204
+ ```ruby
205
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
206
+ def change
207
+ add_column :items, :embedding, :halfvec, limit: 3 # dimensions
208
+ end
209
+ end
210
+ ```
211
+
212
+ ### Half-Precision Indexing
213
+
214
+ Index vectors at half precision for smaller indexes
135
215
 
136
216
  ```ruby
137
- Item.connection.execute("SET hnsw.ef_search = 100")
217
+ class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
218
+ def change
219
+ add_index :items, "(embedding::halfvec(3)) vector_l2_ops", using: :hnsw
220
+ end
221
+ end
222
+ ```
223
+
224
+ Get the nearest neighbors
225
+
226
+ ```ruby
227
+ Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)
228
+ ```
229
+
230
+ ### Binary Vectors
231
+
232
+ Use the `bit` type to store binary vectors
233
+
234
+ ```ruby
235
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
236
+ def change
237
+ add_column :items, :embedding, :bit, limit: 3 # dimensions
238
+ end
239
+ end
240
+ ```
241
+
242
+ Get the nearest neighbors by Hamming distance
243
+
244
+ ```ruby
245
+ Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)
246
+ ```
247
+
248
+ ### Binary Quantization
249
+
250
+ Use expression indexing for binary quantization
251
+
252
+ ```ruby
253
+ class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
254
+ def change
255
+ add_index :items, "(binary_quantize(embedding)::bit(3)) bit_hamming_ops", using: :hnsw
256
+ end
257
+ end
258
+ ```
259
+
260
+ ### Sparse Vectors
261
+
262
+ Use the `sparsevec` type to store sparse vectors
263
+
264
+ ```ruby
265
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
266
+ def change
267
+ add_column :items, :embedding, :sparsevec, limit: 3 # dimensions
268
+ end
269
+ end
270
+ ```
271
+
272
+ Get the nearest neighbors
273
+
274
+ ```ruby
275
+ embedding = Neighbor::SparseVector.new({0 => 0.9, 1 => 1.3, 2 => 1.1}, 3)
276
+ Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)
277
+ ```
278
+
279
+ ## sqlite-vec
280
+
281
+ ### Distance
282
+
283
+ Supported values are:
284
+
285
+ - `euclidean`
286
+ - `cosine`
287
+ - `taxicab`
288
+ - `hamming`
289
+
290
+ ### Dimensions
291
+
292
+ For sqlite-vec, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
293
+
294
+ ```ruby
295
+ class Item < ApplicationRecord
296
+ has_neighbors :embedding, dimensions: 3
297
+ end
298
+ ```
299
+
300
+ ### Virtual Tables
301
+
302
+ You can also use [virtual tables](https://alexgarcia.xyz/sqlite-vec/features/knn.html)
303
+
304
+ ```ruby
305
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
306
+ def change
307
+ # Rails < 8
308
+ execute <<~SQL
309
+ CREATE VIRTUAL TABLE items USING vec0(
310
+ embedding float[3] distance_metric=L2
311
+ )
312
+ SQL
313
+
314
+ # Rails 8+
315
+ create_virtual_table :items, :vec0, [
316
+ "embedding float[3] distance_metric=L2"
317
+ ]
318
+ end
319
+ end
320
+ ```
321
+
322
+ Use `distance_metric=cosine` for cosine distance
323
+
324
+ You can optionally ignore any shadow tables that are created
325
+
326
+ ```ruby
327
+ ActiveRecord::SchemaDumper.ignore_tables += [
328
+ "items_chunks", "items_rowids", "items_vector_chunks00"
329
+ ]
330
+ ```
331
+
332
+ Create a model with `rowid` as the primary key
333
+
334
+ ```ruby
335
+ class Item < ApplicationRecord
336
+ self.primary_key = "rowid"
337
+
338
+ has_neighbors :embedding, dimensions: 3
339
+ end
340
+ ```
341
+
342
+ Get the `k` nearest neighbors
343
+
344
+ ```ruby
345
+ Item.where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)
346
+ ```
347
+
348
+ Filter by primary key
349
+
350
+ ```ruby
351
+ Item.where(rowid: [2, 3]).where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)
352
+ ```
353
+
354
+ ### Int8 Vectors
355
+
356
+ Use the `type` option for int8 vectors
357
+
358
+ ```ruby
359
+ class Item < ApplicationRecord
360
+ has_neighbors :embedding, dimensions: 3, type: :int8
361
+ end
362
+ ```
363
+
364
+ ### Binary Vectors
365
+
366
+ Use the `type` option for binary vectors
367
+
368
+ ```ruby
369
+ class Item < ApplicationRecord
370
+ has_neighbors :embedding, dimensions: 8, type: :bit
371
+ end
372
+ ```
373
+
374
+ Get the nearest neighbors by Hamming distance
375
+
376
+ ```ruby
377
+ Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)
378
+ ```
379
+
380
+ ## MariaDB
381
+
382
+ ### Distance
383
+
384
+ Supported values are:
385
+
386
+ - `euclidean`
387
+ - `cosine`
388
+ - `hamming`
389
+
390
+ For cosine distance with MariaDB, vectors must be normalized before being stored.
391
+
392
+ ```ruby
393
+ class Item < ApplicationRecord
394
+ has_neighbors :embedding, normalize: true
395
+ end
396
+ ```
397
+
398
+ ### Indexing
399
+
400
+ Vector columns must use `null: false` to add a vector index
401
+
402
+ ```ruby
403
+ class CreateItems < ActiveRecord::Migration[7.2]
404
+ def change
405
+ create_table :items do |t|
406
+ t.binary :embedding, null: false
407
+ t.index :embedding, type: :vector
408
+ end
409
+ end
410
+ end
411
+ ```
412
+
413
+ ### Binary Vectors
414
+
415
+ Use the `bigint` type to store binary vectors
416
+
417
+ ```ruby
418
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
419
+ def change
420
+ add_column :items, :embedding, :bigint
421
+ end
422
+ end
423
+ ```
424
+
425
+ Note: Binary vectors can have up to 64 dimensions
426
+
427
+ Get the nearest neighbors by Hamming distance
428
+
429
+ ```ruby
430
+ Item.nearest_neighbors(:embedding, 5, distance: "hamming").first(5)
431
+ ```
432
+
433
+ ## MySQL
434
+
435
+ ### Distance
436
+
437
+ Supported values are:
438
+
439
+ - `euclidean`
440
+ - `cosine`
441
+ - `hamming`
442
+
443
+ Note: The `DISTANCE()` function is [only available on HeatWave](https://dev.mysql.com/doc/refman/9.0/en/vector-functions.html)
444
+
445
+ ### Binary Vectors
446
+
447
+ Use the `binary` type to store binary vectors
448
+
449
+ ```ruby
450
+ class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
451
+ def change
452
+ add_column :items, :embedding, :binary
453
+ end
454
+ end
455
+ ```
456
+
457
+ Get the nearest neighbors by Hamming distance
458
+
459
+ ```ruby
460
+ Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)
138
461
  ```
139
462
 
140
463
  ## Examples
141
464
 
142
- - [OpenAI Embeddings](#openai-embeddings)
143
- - [Disco Recommendations](#disco-recommendations)
465
+ - [Embeddings](#openai-embeddings) with OpenAI
466
+ - [Binary embeddings](#cohere-embeddings) with Cohere
467
+ - [Sentence embeddings](#sentence-embeddings) with Informers
468
+ - [Hybrid search](#hybrid-search) with Informers
469
+ - [Sparse search](#sparse-search) with Transformers.rb
470
+ - [Recommendations](#disco-recommendations) with Disco
144
471
 
145
472
  ### OpenAI Embeddings
146
473
 
@@ -170,10 +497,10 @@ def fetch_embeddings(input)
170
497
  }
171
498
  data = {
172
499
  input: input,
173
- model: "text-embedding-ada-002"
500
+ model: "text-embedding-3-small"
174
501
  }
175
502
 
176
- response = Net::HTTP.post(URI(url), data.to_json, headers)
503
+ response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
177
504
  JSON.parse(response.body)["data"].map { |v| v["embedding"] }
178
505
  end
179
506
  ```
@@ -199,14 +526,297 @@ end
199
526
  Document.insert_all!(documents)
200
527
  ```
201
528
 
202
- And get similar articles
529
+ And get similar documents
530
+
531
+ ```ruby
532
+ document = Document.first
533
+ document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)
534
+ ```
535
+
536
+ See the [complete code](examples/openai/example.rb)
537
+
538
+ ### Cohere Embeddings
539
+
540
+ Generate a model
541
+
542
+ ```sh
543
+ rails generate model Document content:text embedding:bit{1024}
544
+ rails db:migrate
545
+ ```
546
+
547
+ And add `has_neighbors`
548
+
549
+ ```ruby
550
+ class Document < ApplicationRecord
551
+ has_neighbors :embedding
552
+ end
553
+ ```
554
+
555
+ Create a method to call the [embed API](https://docs.cohere.com/reference/embed)
556
+
557
+ ```ruby
558
+ def fetch_embeddings(input, input_type)
559
+ url = "https://api.cohere.com/v1/embed"
560
+ headers = {
561
+ "Authorization" => "Bearer #{ENV.fetch("CO_API_KEY")}",
562
+ "Content-Type" => "application/json"
563
+ }
564
+ data = {
565
+ texts: input,
566
+ model: "embed-english-v3.0",
567
+ input_type: input_type,
568
+ embedding_types: ["ubinary"]
569
+ }
570
+
571
+ response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
572
+ JSON.parse(response.body)["embeddings"]["ubinary"].map { |e| e.map { |v| v.chr.unpack1("B*") }.join }
573
+ end
574
+ ```
575
+
576
+ Pass your input
577
+
578
+ ```ruby
579
+ input = [
580
+ "The dog is barking",
581
+ "The cat is purring",
582
+ "The bear is growling"
583
+ ]
584
+ embeddings = fetch_embeddings(input, "search_document")
585
+ ```
586
+
587
+ Store the embeddings
588
+
589
+ ```ruby
590
+ documents = []
591
+ input.zip(embeddings) do |content, embedding|
592
+ documents << {content: content, embedding: embedding}
593
+ end
594
+ Document.insert_all!(documents)
595
+ ```
596
+
597
+ Embed the search query
598
+
599
+ ```ruby
600
+ query = "forest"
601
+ query_embedding = fetch_embeddings([query], "search_query")[0]
602
+ ```
603
+
604
+ And search the documents
605
+
606
+ ```ruby
607
+ Document.nearest_neighbors(:embedding, query_embedding, distance: "hamming").first(5).map(&:content)
608
+ ```
609
+
610
+ See the [complete code](examples/cohere/example.rb)
611
+
612
+ ### Sentence Embeddings
613
+
614
+ You can generate embeddings locally with [Informers](https://github.com/ankane/informers).
615
+
616
+ Generate a model
617
+
618
+ ```sh
619
+ rails generate model Document content:text embedding:vector{384}
620
+ rails db:migrate
621
+ ```
622
+
623
+ And add `has_neighbors`
624
+
625
+ ```ruby
626
+ class Document < ApplicationRecord
627
+ has_neighbors :embedding
628
+ end
629
+ ```
630
+
631
+ Load a [model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
632
+
633
+ ```ruby
634
+ model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")
635
+ ```
636
+
637
+ Pass your input
638
+
639
+ ```ruby
640
+ input = [
641
+ "The dog is barking",
642
+ "The cat is purring",
643
+ "The bear is growling"
644
+ ]
645
+ embeddings = model.(input)
646
+ ```
647
+
648
+ Store the embeddings
649
+
650
+ ```ruby
651
+ documents = []
652
+ input.zip(embeddings) do |content, embedding|
653
+ documents << {content: content, embedding: embedding}
654
+ end
655
+ Document.insert_all!(documents)
656
+ ```
657
+
658
+ And get similar documents
203
659
 
204
660
  ```ruby
205
661
  document = Document.first
206
662
  document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)
207
663
  ```
208
664
 
209
- See the [complete code](examples/openai_embeddings.rb)
665
+ See the [complete code](examples/informers/example.rb)
666
+
667
+ ### Hybrid Search
668
+
669
+ You can use Neighbor for hybrid search with [Informers](https://github.com/ankane/informers).
670
+
671
+ Generate a model
672
+
673
+ ```sh
674
+ rails generate model Document content:text embedding:vector{768}
675
+ rails db:migrate
676
+ ```
677
+
678
+ And add `has_neighbors` and a scope for keyword search
679
+
680
+ ```ruby
681
+ class Document < ApplicationRecord
682
+ has_neighbors :embedding
683
+
684
+ scope :search, ->(query) {
685
+ where("to_tsvector(content) @@ plainto_tsquery(?)", query)
686
+ .order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC", query))
687
+ }
688
+ end
689
+ ```
690
+
691
+ Create some documents
692
+
693
+ ```ruby
694
+ Document.create!(content: "The dog is barking")
695
+ Document.create!(content: "The cat is purring")
696
+ Document.create!(content: "The bear is growling")
697
+ ```
698
+
699
+ Generate an embedding for each document
700
+
701
+ ```ruby
702
+ embed = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
703
+ embed_options = {model_output: "sentence_embedding", pooling: "none"} # specific to embedding model
704
+
705
+ Document.find_each do |document|
706
+ embedding = embed.(document.content, **embed_options)
707
+ document.update!(embedding: embedding)
708
+ end
709
+ ```
710
+
711
+ Perform keyword search
712
+
713
+ ```ruby
714
+ query = "growling bear"
715
+ keyword_results = Document.search(query).limit(20).load_async
716
+ ```
717
+
718
+ And semantic search in parallel (the query prefix is specific to the [embedding model](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5))
719
+
720
+ ```ruby
721
+ query_prefix = "Represent this sentence for searching relevant passages: "
722
+ query_embedding = embed.(query_prefix + query, **embed_options)
723
+ semantic_results =
724
+ Document.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20).load_async
725
+ ```
726
+
727
+ To combine the results, use Reciprocal Rank Fusion (RRF)
728
+
729
+ ```ruby
730
+ Neighbor::Reranking.rrf(keyword_results, semantic_results).first(5)
731
+ ```
732
+
733
+ Or a reranking model
734
+
735
+ ```ruby
736
+ rerank = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-xsmall-v1")
737
+ results = (keyword_results + semantic_results).uniq
738
+ rerank.(query, results.map(&:content)).first(5).map { |v| results[v[:doc_id]] }
739
+ ```
740
+
741
+ See the [complete code](examples/hybrid/example.rb)
742
+
743
+ ### Sparse Search
744
+
745
+ You can generate sparse embeddings locally with [Transformers.rb](https://github.com/ankane/transformers-ruby).
746
+
747
+ Generate a model
748
+
749
+ ```sh
750
+ rails generate model Document content:text embedding:sparsevec{30522}
751
+ rails db:migrate
752
+ ```
753
+
754
+ And add `has_neighbors`
755
+
756
+ ```ruby
757
+ class Document < ApplicationRecord
758
+ has_neighbors :embedding
759
+ end
760
+ ```
761
+
762
+ Load a [model](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1) to generate embeddings
763
+
764
+ ```ruby
765
+ class EmbeddingModel
766
+ def initialize(model_id)
767
+ @model = Transformers::AutoModelForMaskedLM.from_pretrained(model_id)
768
+ @tokenizer = Transformers::AutoTokenizer.from_pretrained(model_id)
769
+ @special_token_ids = @tokenizer.special_tokens_map.map { |_, token| @tokenizer.vocab[token] }
770
+ end
771
+
772
+ def embed(input)
773
+ feature = @tokenizer.(input, padding: true, truncation: true, return_tensors: "pt", return_token_type_ids: false)
774
+ output = @model.(**feature)[0]
775
+ values = Torch.max(output * feature[:attention_mask].unsqueeze(-1), dim: 1)[0]
776
+ values = Torch.log(1 + Torch.relu(values))
777
+ values[0.., @special_token_ids] = 0
778
+ values.to_a
779
+ end
780
+ end
781
+
782
+ model = EmbeddingModel.new("opensearch-project/opensearch-neural-sparse-encoding-v1")
783
+ ```
784
+
785
+ Pass your input
786
+
787
+ ```ruby
788
+ input = [
789
+ "The dog is barking",
790
+ "The cat is purring",
791
+ "The bear is growling"
792
+ ]
793
+ embeddings = model.embed(input)
794
+ ```
795
+
796
+ Store the embeddings
797
+
798
+ ```ruby
799
+ documents = []
800
+ input.zip(embeddings) do |content, embedding|
801
+ documents << {content: content, embedding: Neighbor::SparseVector.new(embedding)}
802
+ end
803
+ Document.insert_all!(documents)
804
+ ```
805
+
806
+ Embed the search query
807
+
808
+ ```ruby
809
+ query = "forest"
810
+ query_embedding = model.embed([query])[0]
811
+ ```
812
+
813
+ And search the documents
814
+
815
+ ```ruby
816
+ Document.nearest_neighbors(:embedding, Neighbor::SparseVector.new(query_embedding), distance: "inner_product").first(5).map(&:content)
817
+ ```
818
+
819
+ See the [complete code](examples/sparse/example.rb)
210
820
 
211
821
  ### Disco Recommendations
212
822
 
@@ -242,7 +852,7 @@ movies = []
242
852
  recommender.item_ids.each do |item_id|
243
853
  movies << {name: item_id, factors: recommender.item_factors(item_id)}
244
854
  end
245
- Movie.insert_all!(movies) # use create! for Active Record < 6
855
+ Movie.create!(movies)
246
856
  ```
247
857
 
248
858
  And get similar movies
@@ -252,19 +862,7 @@ movie = Movie.find_by(name: "Star Wars (1977)")
252
862
  movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)
253
863
  ```
254
864
 
255
- See the complete code for [cube](examples/disco_item_recs_cube.rb) and [vector](examples/disco_item_recs_vector.rb)
256
-
257
- ## Upgrading
258
-
259
- ### 0.2.0
260
-
261
- The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default. If you use cosine distance, set:
262
-
263
- ```ruby
264
- class Item < ApplicationRecord
265
- has_neighbors normalize: true
266
- end
267
- ```
865
+ See the complete code for [cube](examples/disco/item_recs_cube.rb) and [pgvector](examples/disco/item_recs_vector.rb)
268
866
 
269
867
  ## History
270
868
 
@@ -285,11 +883,19 @@ To get started with development:
285
883
  git clone https://github.com/ankane/neighbor.git
286
884
  cd neighbor
287
885
  bundle install
886
+
887
+ # Postgres
288
888
  createdb neighbor_test
889
+ bundle exec rake test:postgresql
890
+
891
+ # SQLite
892
+ bundle exec rake test:sqlite
289
893
 
290
- # cube
291
- bundle exec rake test
894
+ # MariaDB
895
+ docker run -e MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1 -e MARIADB_DATABASE=neighbor_test -p 3307:3306 quay.io/mariadb-foundation/mariadb-devel:11.6-vector-preview
896
+ bundle exec rake test:mariadb
292
897
 
293
- # vector
294
- EXT=vector bundle exec rake test
898
+ # MySQL
899
+ docker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=neighbor_test -p 3306:3306 mysql:9
900
+ bundle exec rake test:mysql
295
901
  ```