neighbor 0.3.2 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +36 -0
- data/LICENSE.txt +1 -1
- data/README.md +659 -53
- data/lib/generators/neighbor/cube_generator.rb +1 -0
- data/lib/generators/neighbor/sqlite_generator.rb +13 -0
- data/lib/generators/neighbor/templates/sqlite.rb.tt +2 -0
- data/lib/generators/neighbor/vector_generator.rb +1 -0
- data/lib/neighbor/attribute.rb +48 -0
- data/lib/neighbor/model.rb +93 -66
- data/lib/neighbor/mysql.rb +37 -0
- data/lib/neighbor/normalized_attribute.rb +21 -0
- data/lib/neighbor/postgresql.rb +43 -0
- data/lib/neighbor/railtie.rb +4 -4
- data/lib/neighbor/reranking.rb +27 -0
- data/lib/neighbor/sparse_vector.rb +79 -0
- data/lib/neighbor/sqlite.rb +28 -0
- data/lib/neighbor/type/cube.rb +24 -19
- data/lib/neighbor/type/halfvec.rb +28 -0
- data/lib/neighbor/type/mysql_vector.rb +33 -0
- data/lib/neighbor/type/sparsevec.rb +30 -0
- data/lib/neighbor/type/sqlite_int8_vector.rb +29 -0
- data/lib/neighbor/type/sqlite_vector.rb +29 -0
- data/lib/neighbor/type/vector.rb +19 -5
- data/lib/neighbor/utils.rb +201 -0
- data/lib/neighbor/version.rb +1 -1
- data/lib/neighbor.rb +16 -28
- metadata +22 -8
- data/lib/neighbor/vector.rb +0 -65
data/README.md
CHANGED
@@ -1,8 +1,15 @@
|
|
1
1
|
# Neighbor
|
2
2
|
|
3
|
-
Nearest neighbor search for Rails
|
3
|
+
Nearest neighbor search for Rails
|
4
4
|
|
5
|
-
|
5
|
+
Supports:
|
6
|
+
|
7
|
+
- Postgres (cube and pgvector)
|
8
|
+
- SQLite (sqlite-vec) - experimental
|
9
|
+
- MariaDB 11.6 Vector - experimental
|
10
|
+
- MySQL 9 (searching requires HeatWave) - experimental
|
11
|
+
|
12
|
+
[](https://github.com/ankane/neighbor/actions)
|
6
13
|
|
7
14
|
## Installation
|
8
15
|
|
@@ -12,9 +19,9 @@ Add this line to your application’s Gemfile:
|
|
12
19
|
gem "neighbor"
|
13
20
|
```
|
14
21
|
|
15
|
-
|
22
|
+
### For Postgres
|
16
23
|
|
17
|
-
Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [
|
24
|
+
Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [pgvector](https://github.com/pgvector/pgvector). cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.
|
18
25
|
|
19
26
|
For cube, run:
|
20
27
|
|
@@ -23,23 +30,42 @@ rails generate neighbor:cube
|
|
23
30
|
rails db:migrate
|
24
31
|
```
|
25
32
|
|
26
|
-
For
|
33
|
+
For pgvector, [install the extension](https://github.com/pgvector/pgvector#installation) and run:
|
27
34
|
|
28
35
|
```sh
|
29
36
|
rails generate neighbor:vector
|
30
37
|
rails db:migrate
|
31
38
|
```
|
32
39
|
|
40
|
+
### For SQLite
|
41
|
+
|
42
|
+
Add this line to your application’s Gemfile:
|
43
|
+
|
44
|
+
```ruby
|
45
|
+
gem "sqlite-vec"
|
46
|
+
```
|
47
|
+
|
48
|
+
And run:
|
49
|
+
|
50
|
+
```sh
|
51
|
+
rails generate neighbor:sqlite
|
52
|
+
```
|
53
|
+
|
33
54
|
## Getting Started
|
34
55
|
|
35
56
|
Create a migration
|
36
57
|
|
37
58
|
```ruby
|
38
|
-
class AddEmbeddingToItems < ActiveRecord::Migration[7.
|
59
|
+
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
|
39
60
|
def change
|
61
|
+
# cube
|
40
62
|
add_column :items, :embedding, :cube
|
41
|
-
|
63
|
+
|
64
|
+
# pgvector and MySQL
|
42
65
|
add_column :items, :embedding, :vector, limit: 3 # dimensions
|
66
|
+
|
67
|
+
# sqlite-vec and MariaDB
|
68
|
+
add_column :items, :embedding, :binary
|
43
69
|
end
|
44
70
|
end
|
45
71
|
```
|
@@ -70,15 +96,33 @@ Get the nearest neighbors to a vector
|
|
70
96
|
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)
|
71
97
|
```
|
72
98
|
|
73
|
-
|
99
|
+
Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
|
100
|
+
|
101
|
+
```ruby
|
102
|
+
nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
|
103
|
+
nearest_item.neighbor_distance
|
104
|
+
```
|
105
|
+
|
106
|
+
See the additional docs for:
|
107
|
+
|
108
|
+
- [cube](#cube)
|
109
|
+
- [pgvector](#pgvector)
|
110
|
+
- [sqlite-vec](#sqlite-vec)
|
111
|
+
- [MariaDB](#mariadb)
|
112
|
+
- [MySQL](#mysql)
|
113
|
+
|
114
|
+
Or check out some [examples](#examples)
|
115
|
+
|
116
|
+
## cube
|
117
|
+
|
118
|
+
### Distance
|
74
119
|
|
75
120
|
Supported values are:
|
76
121
|
|
77
122
|
- `euclidean`
|
78
123
|
- `cosine`
|
79
|
-
- `taxicab`
|
80
|
-
- `chebyshev`
|
81
|
-
- `inner_product` (vector only)
|
124
|
+
- `taxicab`
|
125
|
+
- `chebyshev`
|
82
126
|
|
83
127
|
For cosine distance with cube, vectors must be normalized before being stored.
|
84
128
|
|
@@ -88,18 +132,11 @@ class Item < ApplicationRecord
|
|
88
132
|
end
|
89
133
|
```
|
90
134
|
|
91
|
-
For inner product with cube, see [this example](examples/
|
92
|
-
|
93
|
-
Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
|
135
|
+
For inner product with cube, see [this example](examples/disco/user_recs_cube.rb).
|
94
136
|
|
95
|
-
|
96
|
-
nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
|
97
|
-
nearest_item.neighbor_distance
|
98
|
-
```
|
99
|
-
|
100
|
-
## Dimensions
|
137
|
+
### Dimensions
|
101
138
|
|
102
|
-
The cube
|
139
|
+
The `cube` type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this.
|
103
140
|
|
104
141
|
For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
|
105
142
|
|
@@ -109,38 +146,328 @@ class Item < ApplicationRecord
|
|
109
146
|
end
|
110
147
|
```
|
111
148
|
|
112
|
-
##
|
149
|
+
## pgvector
|
150
|
+
|
151
|
+
### Distance
|
152
|
+
|
153
|
+
Supported values are:
|
154
|
+
|
155
|
+
- `euclidean`
|
156
|
+
- `inner_product`
|
157
|
+
- `cosine`
|
158
|
+
- `taxicab`
|
159
|
+
- `hamming`
|
160
|
+
- `jaccard`
|
161
|
+
|
162
|
+
### Dimensions
|
163
|
+
|
164
|
+
The `vector` type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
|
113
165
|
|
114
|
-
|
166
|
+
The `halfvec` type can have up to 16,000 dimensions, and half vectors with up to 4,000 dimensions can be indexed.
|
167
|
+
|
168
|
+
The `bit` type can have up to 83 million dimensions, and bit vectors with up to 64,000 dimensions can be indexed.
|
169
|
+
|
170
|
+
The `sparsevec` type can have up to 16,000 non-zero elements, and sparse vectors with up to 1,000 non-zero elements can be indexed.
|
171
|
+
|
172
|
+
### Indexing
|
173
|
+
|
174
|
+
Add an approximate index to speed up queries. Create a migration with:
|
115
175
|
|
116
176
|
```ruby
|
117
|
-
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.
|
177
|
+
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
|
118
178
|
def change
|
119
|
-
add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
|
120
|
-
# or with pgvector 0.5.0+
|
121
179
|
add_index :items, :embedding, using: :hnsw, opclass: :vector_l2_ops
|
180
|
+
# or
|
181
|
+
add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
|
122
182
|
end
|
123
183
|
end
|
124
184
|
```
|
125
185
|
|
126
186
|
Use `:vector_cosine_ops` for cosine distance and `:vector_ip_ops` for inner product.
|
127
187
|
|
128
|
-
Set the
|
188
|
+
Set the size of the dynamic candidate list with HNSW
|
189
|
+
|
190
|
+
```ruby
|
191
|
+
Item.connection.execute("SET hnsw.ef_search = 100")
|
192
|
+
```
|
193
|
+
|
194
|
+
Or the number of probes with IVFFlat
|
129
195
|
|
130
196
|
```ruby
|
131
197
|
Item.connection.execute("SET ivfflat.probes = 3")
|
132
198
|
```
|
133
199
|
|
134
|
-
|
200
|
+
### Half-Precision Vectors
|
201
|
+
|
202
|
+
Use the `halfvec` type to store half-precision vectors
|
203
|
+
|
204
|
+
```ruby
|
205
|
+
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
|
206
|
+
def change
|
207
|
+
add_column :items, :embedding, :halfvec, limit: 3 # dimensions
|
208
|
+
end
|
209
|
+
end
|
210
|
+
```
|
211
|
+
|
212
|
+
### Half-Precision Indexing
|
213
|
+
|
214
|
+
Index vectors at half precision for smaller indexes
|
135
215
|
|
136
216
|
```ruby
|
137
|
-
|
217
|
+
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
|
218
|
+
def change
|
219
|
+
add_index :items, "(embedding::halfvec(3)) vector_l2_ops", using: :hnsw
|
220
|
+
end
|
221
|
+
end
|
222
|
+
```
|
223
|
+
|
224
|
+
Get the nearest neighbors
|
225
|
+
|
226
|
+
```ruby
|
227
|
+
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)
|
228
|
+
```
|
229
|
+
|
230
|
+
### Binary Vectors
|
231
|
+
|
232
|
+
Use the `bit` type to store binary vectors
|
233
|
+
|
234
|
+
```ruby
|
235
|
+
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
|
236
|
+
def change
|
237
|
+
add_column :items, :embedding, :bit, limit: 3 # dimensions
|
238
|
+
end
|
239
|
+
end
|
240
|
+
```
|
241
|
+
|
242
|
+
Get the nearest neighbors by Hamming distance
|
243
|
+
|
244
|
+
```ruby
|
245
|
+
Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)
|
246
|
+
```
|
247
|
+
|
248
|
+
### Binary Quantization
|
249
|
+
|
250
|
+
Use expression indexing for binary quantization
|
251
|
+
|
252
|
+
```ruby
|
253
|
+
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
|
254
|
+
def change
|
255
|
+
add_index :items, "(binary_quantize(embedding)::bit(3)) bit_hamming_ops", using: :hnsw
|
256
|
+
end
|
257
|
+
end
|
258
|
+
```
|
259
|
+
|
260
|
+
### Sparse Vectors
|
261
|
+
|
262
|
+
Use the `sparsevec` type to store sparse vectors
|
263
|
+
|
264
|
+
```ruby
|
265
|
+
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
|
266
|
+
def change
|
267
|
+
add_column :items, :embedding, :sparsevec, limit: 3 # dimensions
|
268
|
+
end
|
269
|
+
end
|
270
|
+
```
|
271
|
+
|
272
|
+
Get the nearest neighbors
|
273
|
+
|
274
|
+
```ruby
|
275
|
+
embedding = Neighbor::SparseVector.new({0 => 0.9, 1 => 1.3, 2 => 1.1}, 3)
|
276
|
+
Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)
|
277
|
+
```
|
278
|
+
|
279
|
+
## sqlite-vec
|
280
|
+
|
281
|
+
### Distance
|
282
|
+
|
283
|
+
Supported values are:
|
284
|
+
|
285
|
+
- `euclidean`
|
286
|
+
- `cosine`
|
287
|
+
- `taxicab`
|
288
|
+
- `hamming`
|
289
|
+
|
290
|
+
### Dimensions
|
291
|
+
|
292
|
+
For sqlite-vec, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
|
293
|
+
|
294
|
+
```ruby
|
295
|
+
class Item < ApplicationRecord
|
296
|
+
has_neighbors :embedding, dimensions: 3
|
297
|
+
end
|
298
|
+
```
|
299
|
+
|
300
|
+
### Virtual Tables
|
301
|
+
|
302
|
+
You can also use [virtual tables](https://alexgarcia.xyz/sqlite-vec/features/knn.html)
|
303
|
+
|
304
|
+
```ruby
|
305
|
+
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
|
306
|
+
def change
|
307
|
+
# Rails < 8
|
308
|
+
execute <<~SQL
|
309
|
+
CREATE VIRTUAL TABLE items USING vec0(
|
310
|
+
embedding float[3] distance_metric=L2
|
311
|
+
)
|
312
|
+
SQL
|
313
|
+
|
314
|
+
# Rails 8+
|
315
|
+
create_virtual_table :items, :vec0, [
|
316
|
+
"embedding float[3] distance_metric=L2"
|
317
|
+
]
|
318
|
+
end
|
319
|
+
end
|
320
|
+
```
|
321
|
+
|
322
|
+
Use `distance_metric=cosine` for cosine distance
|
323
|
+
|
324
|
+
You can optionally ignore any shadow tables that are created
|
325
|
+
|
326
|
+
```ruby
|
327
|
+
ActiveRecord::SchemaDumper.ignore_tables += [
|
328
|
+
"items_chunks", "items_rowids", "items_vector_chunks00"
|
329
|
+
]
|
330
|
+
```
|
331
|
+
|
332
|
+
Create a model with `rowid` as the primary key
|
333
|
+
|
334
|
+
```ruby
|
335
|
+
class Item < ApplicationRecord
|
336
|
+
self.primary_key = "rowid"
|
337
|
+
|
338
|
+
has_neighbors :embedding, dimensions: 3
|
339
|
+
end
|
340
|
+
```
|
341
|
+
|
342
|
+
Get the `k` nearest neighbors
|
343
|
+
|
344
|
+
```ruby
|
345
|
+
Item.where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)
|
346
|
+
```
|
347
|
+
|
348
|
+
Filter by primary key
|
349
|
+
|
350
|
+
```ruby
|
351
|
+
Item.where(rowid: [2, 3]).where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)
|
352
|
+
```
|
353
|
+
|
354
|
+
### Int8 Vectors
|
355
|
+
|
356
|
+
Use the `type` option for int8 vectors
|
357
|
+
|
358
|
+
```ruby
|
359
|
+
class Item < ApplicationRecord
|
360
|
+
has_neighbors :embedding, dimensions: 3, type: :int8
|
361
|
+
end
|
362
|
+
```
|
363
|
+
|
364
|
+
### Binary Vectors
|
365
|
+
|
366
|
+
Use the `type` option for binary vectors
|
367
|
+
|
368
|
+
```ruby
|
369
|
+
class Item < ApplicationRecord
|
370
|
+
has_neighbors :embedding, dimensions: 8, type: :bit
|
371
|
+
end
|
372
|
+
```
|
373
|
+
|
374
|
+
Get the nearest neighbors by Hamming distance
|
375
|
+
|
376
|
+
```ruby
|
377
|
+
Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)
|
378
|
+
```
|
379
|
+
|
380
|
+
## MariaDB
|
381
|
+
|
382
|
+
### Distance
|
383
|
+
|
384
|
+
Supported values are:
|
385
|
+
|
386
|
+
- `euclidean`
|
387
|
+
- `cosine`
|
388
|
+
- `hamming`
|
389
|
+
|
390
|
+
For cosine distance with MariaDB, vectors must be normalized before being stored.
|
391
|
+
|
392
|
+
```ruby
|
393
|
+
class Item < ApplicationRecord
|
394
|
+
has_neighbors :embedding, normalize: true
|
395
|
+
end
|
396
|
+
```
|
397
|
+
|
398
|
+
### Indexing
|
399
|
+
|
400
|
+
Vector columns must use `null: false` to add a vector index
|
401
|
+
|
402
|
+
```ruby
|
403
|
+
class CreateItems < ActiveRecord::Migration[7.2]
|
404
|
+
def change
|
405
|
+
create_table :items do |t|
|
406
|
+
t.binary :embedding, null: false
|
407
|
+
t.index :embedding, type: :vector
|
408
|
+
end
|
409
|
+
end
|
410
|
+
end
|
411
|
+
```
|
412
|
+
|
413
|
+
### Binary Vectors
|
414
|
+
|
415
|
+
Use the `bigint` type to store binary vectors
|
416
|
+
|
417
|
+
```ruby
|
418
|
+
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
|
419
|
+
def change
|
420
|
+
add_column :items, :embedding, :bigint
|
421
|
+
end
|
422
|
+
end
|
423
|
+
```
|
424
|
+
|
425
|
+
Note: Binary vectors can have up to 64 dimensions
|
426
|
+
|
427
|
+
Get the nearest neighbors by Hamming distance
|
428
|
+
|
429
|
+
```ruby
|
430
|
+
Item.nearest_neighbors(:embedding, 5, distance: "hamming").first(5)
|
431
|
+
```
|
432
|
+
|
433
|
+
## MySQL
|
434
|
+
|
435
|
+
### Distance
|
436
|
+
|
437
|
+
Supported values are:
|
438
|
+
|
439
|
+
- `euclidean`
|
440
|
+
- `cosine`
|
441
|
+
- `hamming`
|
442
|
+
|
443
|
+
Note: The `DISTANCE()` function is [only available on HeatWave](https://dev.mysql.com/doc/refman/9.0/en/vector-functions.html)
|
444
|
+
|
445
|
+
### Binary Vectors
|
446
|
+
|
447
|
+
Use the `binary` type to store binary vectors
|
448
|
+
|
449
|
+
```ruby
|
450
|
+
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
|
451
|
+
def change
|
452
|
+
add_column :items, :embedding, :binary
|
453
|
+
end
|
454
|
+
end
|
455
|
+
```
|
456
|
+
|
457
|
+
Get the nearest neighbors by Hamming distance
|
458
|
+
|
459
|
+
```ruby
|
460
|
+
Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)
|
138
461
|
```
|
139
462
|
|
140
463
|
## Examples
|
141
464
|
|
142
|
-
- [
|
143
|
-
- [
|
465
|
+
- [Embeddings](#openai-embeddings) with OpenAI
|
466
|
+
- [Binary embeddings](#cohere-embeddings) with Cohere
|
467
|
+
- [Sentence embeddings](#sentence-embeddings) with Informers
|
468
|
+
- [Hybrid search](#hybrid-search) with Informers
|
469
|
+
- [Sparse search](#sparse-search) with Transformers.rb
|
470
|
+
- [Recommendations](#disco-recommendations) with Disco
|
144
471
|
|
145
472
|
### OpenAI Embeddings
|
146
473
|
|
@@ -170,10 +497,10 @@ def fetch_embeddings(input)
|
|
170
497
|
}
|
171
498
|
data = {
|
172
499
|
input: input,
|
173
|
-
model: "text-embedding-
|
500
|
+
model: "text-embedding-3-small"
|
174
501
|
}
|
175
502
|
|
176
|
-
response = Net::HTTP.post(URI(url), data.to_json, headers)
|
503
|
+
response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
|
177
504
|
JSON.parse(response.body)["data"].map { |v| v["embedding"] }
|
178
505
|
end
|
179
506
|
```
|
@@ -199,14 +526,297 @@ end
|
|
199
526
|
Document.insert_all!(documents)
|
200
527
|
```
|
201
528
|
|
202
|
-
And get similar
|
529
|
+
And get similar documents
|
530
|
+
|
531
|
+
```ruby
|
532
|
+
document = Document.first
|
533
|
+
document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)
|
534
|
+
```
|
535
|
+
|
536
|
+
See the [complete code](examples/openai/example.rb)
|
537
|
+
|
538
|
+
### Cohere Embeddings
|
539
|
+
|
540
|
+
Generate a model
|
541
|
+
|
542
|
+
```sh
|
543
|
+
rails generate model Document content:text embedding:bit{1024}
|
544
|
+
rails db:migrate
|
545
|
+
```
|
546
|
+
|
547
|
+
And add `has_neighbors`
|
548
|
+
|
549
|
+
```ruby
|
550
|
+
class Document < ApplicationRecord
|
551
|
+
has_neighbors :embedding
|
552
|
+
end
|
553
|
+
```
|
554
|
+
|
555
|
+
Create a method to call the [embed API](https://docs.cohere.com/reference/embed)
|
556
|
+
|
557
|
+
```ruby
|
558
|
+
def fetch_embeddings(input, input_type)
|
559
|
+
url = "https://api.cohere.com/v1/embed"
|
560
|
+
headers = {
|
561
|
+
"Authorization" => "Bearer #{ENV.fetch("CO_API_KEY")}",
|
562
|
+
"Content-Type" => "application/json"
|
563
|
+
}
|
564
|
+
data = {
|
565
|
+
texts: input,
|
566
|
+
model: "embed-english-v3.0",
|
567
|
+
input_type: input_type,
|
568
|
+
embedding_types: ["ubinary"]
|
569
|
+
}
|
570
|
+
|
571
|
+
response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
|
572
|
+
JSON.parse(response.body)["embeddings"]["ubinary"].map { |e| e.map { |v| v.chr.unpack1("B*") }.join }
|
573
|
+
end
|
574
|
+
```
|
575
|
+
|
576
|
+
Pass your input
|
577
|
+
|
578
|
+
```ruby
|
579
|
+
input = [
|
580
|
+
"The dog is barking",
|
581
|
+
"The cat is purring",
|
582
|
+
"The bear is growling"
|
583
|
+
]
|
584
|
+
embeddings = fetch_embeddings(input, "search_document")
|
585
|
+
```
|
586
|
+
|
587
|
+
Store the embeddings
|
588
|
+
|
589
|
+
```ruby
|
590
|
+
documents = []
|
591
|
+
input.zip(embeddings) do |content, embedding|
|
592
|
+
documents << {content: content, embedding: embedding}
|
593
|
+
end
|
594
|
+
Document.insert_all!(documents)
|
595
|
+
```
|
596
|
+
|
597
|
+
Embed the search query
|
598
|
+
|
599
|
+
```ruby
|
600
|
+
query = "forest"
|
601
|
+
query_embedding = fetch_embeddings([query], "search_query")[0]
|
602
|
+
```
|
603
|
+
|
604
|
+
And search the documents
|
605
|
+
|
606
|
+
```ruby
|
607
|
+
Document.nearest_neighbors(:embedding, query_embedding, distance: "hamming").first(5).map(&:content)
|
608
|
+
```
|
609
|
+
|
610
|
+
See the [complete code](examples/cohere/example.rb)
|
611
|
+
|
612
|
+
### Sentence Embeddings
|
613
|
+
|
614
|
+
You can generate embeddings locally with [Informers](https://github.com/ankane/informers).
|
615
|
+
|
616
|
+
Generate a model
|
617
|
+
|
618
|
+
```sh
|
619
|
+
rails generate model Document content:text embedding:vector{384}
|
620
|
+
rails db:migrate
|
621
|
+
```
|
622
|
+
|
623
|
+
And add `has_neighbors`
|
624
|
+
|
625
|
+
```ruby
|
626
|
+
class Document < ApplicationRecord
|
627
|
+
has_neighbors :embedding
|
628
|
+
end
|
629
|
+
```
|
630
|
+
|
631
|
+
Load a [model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
|
632
|
+
|
633
|
+
```ruby
|
634
|
+
model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")
|
635
|
+
```
|
636
|
+
|
637
|
+
Pass your input
|
638
|
+
|
639
|
+
```ruby
|
640
|
+
input = [
|
641
|
+
"The dog is barking",
|
642
|
+
"The cat is purring",
|
643
|
+
"The bear is growling"
|
644
|
+
]
|
645
|
+
embeddings = model.(input)
|
646
|
+
```
|
647
|
+
|
648
|
+
Store the embeddings
|
649
|
+
|
650
|
+
```ruby
|
651
|
+
documents = []
|
652
|
+
input.zip(embeddings) do |content, embedding|
|
653
|
+
documents << {content: content, embedding: embedding}
|
654
|
+
end
|
655
|
+
Document.insert_all!(documents)
|
656
|
+
```
|
657
|
+
|
658
|
+
And get similar documents
|
203
659
|
|
204
660
|
```ruby
|
205
661
|
document = Document.first
|
206
662
|
document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)
|
207
663
|
```
|
208
664
|
|
209
|
-
See the [complete code](examples/
|
665
|
+
See the [complete code](examples/informers/example.rb)
|
666
|
+
|
667
|
+
### Hybrid Search
|
668
|
+
|
669
|
+
You can use Neighbor for hybrid search with [Informers](https://github.com/ankane/informers).
|
670
|
+
|
671
|
+
Generate a model
|
672
|
+
|
673
|
+
```sh
|
674
|
+
rails generate model Document content:text embedding:vector{768}
|
675
|
+
rails db:migrate
|
676
|
+
```
|
677
|
+
|
678
|
+
And add `has_neighbors` and a scope for keyword search
|
679
|
+
|
680
|
+
```ruby
|
681
|
+
class Document < ApplicationRecord
|
682
|
+
has_neighbors :embedding
|
683
|
+
|
684
|
+
scope :search, ->(query) {
|
685
|
+
where("to_tsvector(content) @@ plainto_tsquery(?)", query)
|
686
|
+
.order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC", query))
|
687
|
+
}
|
688
|
+
end
|
689
|
+
```
|
690
|
+
|
691
|
+
Create some documents
|
692
|
+
|
693
|
+
```ruby
|
694
|
+
Document.create!(content: "The dog is barking")
|
695
|
+
Document.create!(content: "The cat is purring")
|
696
|
+
Document.create!(content: "The bear is growling")
|
697
|
+
```
|
698
|
+
|
699
|
+
Generate an embedding for each document
|
700
|
+
|
701
|
+
```ruby
|
702
|
+
embed = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
|
703
|
+
embed_options = {model_output: "sentence_embedding", pooling: "none"} # specific to embedding model
|
704
|
+
|
705
|
+
Document.find_each do |document|
|
706
|
+
embedding = embed.(document.content, **embed_options)
|
707
|
+
document.update!(embedding: embedding)
|
708
|
+
end
|
709
|
+
```
|
710
|
+
|
711
|
+
Perform keyword search
|
712
|
+
|
713
|
+
```ruby
|
714
|
+
query = "growling bear"
|
715
|
+
keyword_results = Document.search(query).limit(20).load_async
|
716
|
+
```
|
717
|
+
|
718
|
+
And semantic search in parallel (the query prefix is specific to the [embedding model](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5))
|
719
|
+
|
720
|
+
```ruby
|
721
|
+
query_prefix = "Represent this sentence for searching relevant passages: "
|
722
|
+
query_embedding = embed.(query_prefix + query, **embed_options)
|
723
|
+
semantic_results =
|
724
|
+
Document.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20).load_async
|
725
|
+
```
|
726
|
+
|
727
|
+
To combine the results, use Reciprocal Rank Fusion (RRF)
|
728
|
+
|
729
|
+
```ruby
|
730
|
+
Neighbor::Reranking.rrf(keyword_results, semantic_results).first(5)
|
731
|
+
```
|
732
|
+
|
733
|
+
Or a reranking model
|
734
|
+
|
735
|
+
```ruby
|
736
|
+
rerank = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-xsmall-v1")
|
737
|
+
results = (keyword_results + semantic_results).uniq
|
738
|
+
rerank.(query, results.map(&:content)).first(5).map { |v| results[v[:doc_id]] }
|
739
|
+
```
|
740
|
+
|
741
|
+
See the [complete code](examples/hybrid/example.rb)
|
742
|
+
|
743
|
+
### Sparse Search
|
744
|
+
|
745
|
+
You can generate sparse embeddings locally with [Transformers.rb](https://github.com/ankane/transformers-ruby).
|
746
|
+
|
747
|
+
Generate a model
|
748
|
+
|
749
|
+
```sh
|
750
|
+
rails generate model Document content:text embedding:sparsevec{30522}
|
751
|
+
rails db:migrate
|
752
|
+
```
|
753
|
+
|
754
|
+
And add `has_neighbors`
|
755
|
+
|
756
|
+
```ruby
|
757
|
+
class Document < ApplicationRecord
|
758
|
+
has_neighbors :embedding
|
759
|
+
end
|
760
|
+
```
|
761
|
+
|
762
|
+
Load a [model](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1) to generate embeddings
|
763
|
+
|
764
|
+
```ruby
|
765
|
+
class EmbeddingModel
|
766
|
+
def initialize(model_id)
|
767
|
+
@model = Transformers::AutoModelForMaskedLM.from_pretrained(model_id)
|
768
|
+
@tokenizer = Transformers::AutoTokenizer.from_pretrained(model_id)
|
769
|
+
@special_token_ids = @tokenizer.special_tokens_map.map { |_, token| @tokenizer.vocab[token] }
|
770
|
+
end
|
771
|
+
|
772
|
+
def embed(input)
|
773
|
+
feature = @tokenizer.(input, padding: true, truncation: true, return_tensors: "pt", return_token_type_ids: false)
|
774
|
+
output = @model.(**feature)[0]
|
775
|
+
values = Torch.max(output * feature[:attention_mask].unsqueeze(-1), dim: 1)[0]
|
776
|
+
values = Torch.log(1 + Torch.relu(values))
|
777
|
+
values[0.., @special_token_ids] = 0
|
778
|
+
values.to_a
|
779
|
+
end
|
780
|
+
end
|
781
|
+
|
782
|
+
model = EmbeddingModel.new("opensearch-project/opensearch-neural-sparse-encoding-v1")
|
783
|
+
```
|
784
|
+
|
785
|
+
Pass your input
|
786
|
+
|
787
|
+
```ruby
|
788
|
+
input = [
|
789
|
+
"The dog is barking",
|
790
|
+
"The cat is purring",
|
791
|
+
"The bear is growling"
|
792
|
+
]
|
793
|
+
embeddings = model.embed(input)
|
794
|
+
```
|
795
|
+
|
796
|
+
Store the embeddings
|
797
|
+
|
798
|
+
```ruby
|
799
|
+
documents = []
|
800
|
+
input.zip(embeddings) do |content, embedding|
|
801
|
+
documents << {content: content, embedding: Neighbor::SparseVector.new(embedding)}
|
802
|
+
end
|
803
|
+
Document.insert_all!(documents)
|
804
|
+
```
|
805
|
+
|
806
|
+
Embed the search query
|
807
|
+
|
808
|
+
```ruby
|
809
|
+
query = "forest"
|
810
|
+
query_embedding = model.embed([query])[0]
|
811
|
+
```
|
812
|
+
|
813
|
+
And search the documents
|
814
|
+
|
815
|
+
```ruby
|
816
|
+
Document.nearest_neighbors(:embedding, Neighbor::SparseVector.new(query_embedding), distance: "inner_product").first(5).map(&:content)
|
817
|
+
```
|
818
|
+
|
819
|
+
See the [complete code](examples/sparse/example.rb)
|
210
820
|
|
211
821
|
### Disco Recommendations
|
212
822
|
|
@@ -242,7 +852,7 @@ movies = []
|
|
242
852
|
recommender.item_ids.each do |item_id|
|
243
853
|
movies << {name: item_id, factors: recommender.item_factors(item_id)}
|
244
854
|
end
|
245
|
-
Movie.
|
855
|
+
Movie.create!(movies)
|
246
856
|
```
|
247
857
|
|
248
858
|
And get similar movies
|
@@ -252,19 +862,7 @@ movie = Movie.find_by(name: "Star Wars (1977)")
|
|
252
862
|
movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)
|
253
863
|
```
|
254
864
|
|
255
|
-
See the complete code for [cube](examples/
|
256
|
-
|
257
|
-
## Upgrading
|
258
|
-
|
259
|
-
### 0.2.0
|
260
|
-
|
261
|
-
The `distance` option has been moved from `has_neighbors` to `nearest_neighbors`, and there is no longer a default. If you use cosine distance, set:
|
262
|
-
|
263
|
-
```ruby
|
264
|
-
class Item < ApplicationRecord
|
265
|
-
has_neighbors normalize: true
|
266
|
-
end
|
267
|
-
```
|
865
|
+
See the complete code for [cube](examples/disco/item_recs_cube.rb) and [pgvector](examples/disco/item_recs_vector.rb)
|
268
866
|
|
269
867
|
## History
|
270
868
|
|
@@ -285,11 +883,19 @@ To get started with development:
|
|
285
883
|
git clone https://github.com/ankane/neighbor.git
|
286
884
|
cd neighbor
|
287
885
|
bundle install
|
886
|
+
|
887
|
+
# Postgres
|
288
888
|
createdb neighbor_test
|
889
|
+
bundle exec rake test:postgresql
|
890
|
+
|
891
|
+
# SQLite
|
892
|
+
bundle exec rake test:sqlite
|
289
893
|
|
290
|
-
#
|
291
|
-
|
894
|
+
# MariaDB
|
895
|
+
docker run -e MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1 -e MARIADB_DATABASE=neighbor_test -p 3307:3306 quay.io/mariadb-foundation/mariadb-devel:11.6-vector-preview
|
896
|
+
bundle exec rake test:mariadb
|
292
897
|
|
293
|
-
#
|
294
|
-
|
898
|
+
# MySQL
|
899
|
+
docker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=neighbor_test -p 3306:3306 mysql:9
|
900
|
+
bundle exec rake test:mysql
|
295
901
|
```
|