disco 0.2.5 → 0.2.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +15 -0
- data/README.md +36 -19
- data/lib/disco/model.rb +1 -0
- data/lib/disco/recommender.rb +89 -45
- data/lib/disco/version.rb +1 -1
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 0ec370f448c5cc8aebb2860b0580466687874eb34c165e1a3d0254a1c6e701d7
|
4
|
+
data.tar.gz: 1b080c37206371ee59ce184ae420c5fa1de60da0714ef17ea9459b31fcdd22ab
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 1802e0fbf68ee489891f94e468c3b24df0eb8463de4e55a7f50a9dbc86bda7631ca15d5b84d0d584dbd341e994db84a82f00f6753c70eda4a7400376f0443df5
|
7
|
+
data.tar.gz: bee0645357a5fc4eb226e75d4b56208e7377952911dd19ccad8f3072eee3eaf4d4a318933dcb4ec4d75273630e278dbf98b3ac21f5987baea90261d15cc2d851
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,18 @@
|
|
1
|
+
## 0.2.8 (2022-03-13)
|
2
|
+
|
3
|
+
- Fixed error with `top_items` with all same rating
|
4
|
+
|
5
|
+
## 0.2.7 (2021-08-06)
|
6
|
+
|
7
|
+
- Added warning for `value`
|
8
|
+
|
9
|
+
## 0.2.6 (2021-02-24)
|
10
|
+
|
11
|
+
- Improved performance
|
12
|
+
- Improved `inspect` method
|
13
|
+
- Fixed issue with `similar_users` and `item_recs` returning the original user/item
|
14
|
+
- Fixed error with `fit` after loading
|
15
|
+
|
1
16
|
## 0.2.5 (2021-02-20)
|
2
17
|
|
3
18
|
- Added `top_items` method
|
data/README.md
CHANGED
@@ -35,16 +35,16 @@ recommender.fit([
|
|
35
35
|
|
36
36
|
> IDs can be integers, strings, or any other data type
|
37
37
|
|
38
|
-
If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating
|
38
|
+
If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating.
|
39
39
|
|
40
40
|
```ruby
|
41
41
|
recommender.fit([
|
42
|
-
{user_id: 1, item_id: 1
|
43
|
-
{user_id: 2, item_id: 1
|
42
|
+
{user_id: 1, item_id: 1},
|
43
|
+
{user_id: 2, item_id: 1}
|
44
44
|
])
|
45
45
|
```
|
46
46
|
|
47
|
-
>
|
47
|
+
> Each `user_id`/`item_id` combination should only appear once
|
48
48
|
|
49
49
|
Get user-based recommendations - “users like you also liked”
|
50
50
|
|
@@ -99,18 +99,13 @@ recommender.item_recs("Star Wars (1977)")
|
|
99
99
|
[Ahoy](https://github.com/ankane/ahoy) is a great source for implicit feedback
|
100
100
|
|
101
101
|
```ruby
|
102
|
-
views = Ahoy::Event.
|
103
|
-
where(name: "Viewed post").
|
104
|
-
group(:user_id).
|
105
|
-
group("properties->>'post_id'"). # postgres syntax
|
106
|
-
count
|
102
|
+
views = Ahoy::Event.where(name: "Viewed post").group(:user_id).group_prop(:post_id).count
|
107
103
|
|
108
104
|
data =
|
109
|
-
views.map do |(user_id, post_id),
|
105
|
+
views.map do |(user_id, post_id), _|
|
110
106
|
{
|
111
107
|
user_id: user_id,
|
112
|
-
item_id: post_id
|
113
|
-
value: count
|
108
|
+
item_id: post_id
|
114
109
|
}
|
115
110
|
end
|
116
111
|
```
|
@@ -201,7 +196,7 @@ bin = File.binread("recommender.bin")
|
|
201
196
|
recommender = Marshal.load(bin)
|
202
197
|
```
|
203
198
|
|
204
|
-
Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor)
|
199
|
+
Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor). See the [examples](https://github.com/ankane/neighbor/tree/master/examples).
|
205
200
|
|
206
201
|
## Algorithms
|
207
202
|
|
@@ -247,7 +242,7 @@ recommender.fit(data)
|
|
247
242
|
recommender.top_items
|
248
243
|
```
|
249
244
|
|
250
|
-
This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) your application’s Gemfile) and item frequency for implicit feedback.
|
245
|
+
This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) to your application’s Gemfile) and item frequency for implicit feedback.
|
251
246
|
|
252
247
|
## Data
|
253
248
|
|
@@ -269,7 +264,7 @@ Or a Daru data frame
|
|
269
264
|
Daru::DataFrame.from_csv("ratings.csv")
|
270
265
|
```
|
271
266
|
|
272
|
-
## Performance
|
267
|
+
## Performance
|
273
268
|
|
274
269
|
If you have a large number of users or items, you can use an approximate nearest neighbors library like [Faiss](https://github.com/ankane/faiss) to improve the performance of certain methods.
|
275
270
|
|
@@ -282,22 +277,22 @@ gem 'faiss'
|
|
282
277
|
Speed up the `user_recs` method with:
|
283
278
|
|
284
279
|
```ruby
|
285
|
-
|
280
|
+
recommender.optimize_user_recs
|
286
281
|
```
|
287
282
|
|
288
283
|
Speed up the `item_recs` method with:
|
289
284
|
|
290
285
|
```ruby
|
291
|
-
|
286
|
+
recommender.optimize_item_recs
|
292
287
|
```
|
293
288
|
|
294
289
|
Speed up the `similar_users` method with:
|
295
290
|
|
296
291
|
```ruby
|
297
|
-
|
292
|
+
recommender.optimize_similar_users
|
298
293
|
```
|
299
294
|
|
300
|
-
This should be called after fitting or loading the
|
295
|
+
This should be called after fitting or loading the recommender.
|
301
296
|
|
302
297
|
## Reference
|
303
298
|
|
@@ -336,6 +331,28 @@ Thanks to:
|
|
336
331
|
- [Implicit](https://github.com/benfred/implicit/) for serving as an initial reference for user and item similarity
|
337
332
|
- [@dasch](https://github.com/dasch) for the gem name
|
338
333
|
|
334
|
+
## Upgrading
|
335
|
+
|
336
|
+
### 0.2.7
|
337
|
+
|
338
|
+
There’s now a warning when passing `:value` with implicit feedback, as this has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used.
|
339
|
+
|
340
|
+
```ruby
|
341
|
+
recommender.fit([
|
342
|
+
{user_id: 1, item_id: 1, value: 1},
|
343
|
+
{user_id: 2, item_id: 1, value: 3}
|
344
|
+
])
|
345
|
+
```
|
346
|
+
|
347
|
+
to:
|
348
|
+
|
349
|
+
```ruby
|
350
|
+
recommender.fit([
|
351
|
+
{user_id: 1, item_id: 1},
|
352
|
+
{user_id: 2, item_id: 1}
|
353
|
+
])
|
354
|
+
```
|
355
|
+
|
339
356
|
## History
|
340
357
|
|
341
358
|
View the [changelog](https://github.com/ankane/disco/blob/master/CHANGELOG.md)
|
data/lib/disco/model.rb
CHANGED
@@ -10,6 +10,7 @@ module Disco
|
|
10
10
|
|
11
11
|
has_many :"recommended_#{name}", -> { where("disco_recommendations.context = ?", name).order("disco_recommendations.score DESC") }, through: :recommendations, source: :item, source_type: class_name
|
12
12
|
|
13
|
+
# TODO use fetch for item_id and score in 0.3.0
|
13
14
|
define_method("update_recommended_#{name}") do |items|
|
14
15
|
now = Time.now
|
15
16
|
items = items.map { |item| {subject_type: model_name.name, subject_id: id, item_type: class_name, item_id: item[:item_id], context: name, score: item[:score], created_at: now, updated_at: now} }
|
data/lib/disco/recommender.rb
CHANGED
@@ -17,38 +17,54 @@ module Disco
|
|
17
17
|
|
18
18
|
check_training_set(train_set)
|
19
19
|
|
20
|
+
# TODO option to set in initializer to avoid pass
|
21
|
+
# could also just check first few values
|
22
|
+
# but may be confusing if they are all missing and later ones aren't
|
20
23
|
@implicit = !train_set.any? { |v| v[:rating] }
|
24
|
+
|
25
|
+
if @implicit && train_set.any? { |v| v[:value] }
|
26
|
+
warn "[disco] WARNING: Passing `:value` with implicit feedback has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used."
|
27
|
+
end
|
28
|
+
|
29
|
+
# TODO improve performance
|
30
|
+
# (catch exception instead of checking ahead of time)
|
21
31
|
unless @implicit
|
22
32
|
check_ratings(train_set)
|
23
|
-
@min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
|
24
33
|
|
25
34
|
if validation_set
|
26
35
|
check_ratings(validation_set)
|
27
36
|
end
|
28
37
|
end
|
29
38
|
|
30
|
-
update_maps(train_set)
|
31
|
-
|
32
39
|
@rated = Hash.new { |hash, key| hash[key] = {} }
|
33
40
|
input = []
|
34
|
-
value_key = @implicit ? :value : :rating
|
35
41
|
train_set.each do |v|
|
36
|
-
|
37
|
-
|
42
|
+
# update maps and build matrix in single pass
|
43
|
+
u = (@user_map[v[:user_id]] ||= @user_map.size)
|
44
|
+
i = (@item_map[v[:item_id]] ||= @item_map.size)
|
38
45
|
@rated[u][i] = true
|
39
46
|
|
40
47
|
# explicit will always have a value due to check_ratings
|
41
|
-
input << [u, i,
|
48
|
+
input << [u, i, @implicit ? 1 : v[:rating]]
|
42
49
|
end
|
43
50
|
@rated.default = nil
|
44
51
|
|
52
|
+
# much more efficient than checking every value in another pass
|
53
|
+
raise ArgumentError, "Missing user_id" if @user_map.key?(nil)
|
54
|
+
raise ArgumentError, "Missing item_id" if @item_map.key?(nil)
|
55
|
+
|
56
|
+
# TODO improve performance
|
57
|
+
unless @implicit
|
58
|
+
@min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
|
59
|
+
end
|
60
|
+
|
45
61
|
if @top_items
|
46
62
|
@item_count = [0] * @item_map.size
|
47
63
|
@item_sum = [0.0] * @item_map.size
|
48
64
|
train_set.each do |v|
|
49
65
|
i = @item_map[v[:item_id]]
|
50
66
|
@item_count[i] += 1
|
51
|
-
@item_sum[i] += (v[
|
67
|
+
@item_sum[i] += (@implicit ? 1 : v[:rating])
|
52
68
|
end
|
53
69
|
end
|
54
70
|
|
@@ -63,7 +79,7 @@ module Disco
|
|
63
79
|
u ||= -1
|
64
80
|
i ||= -1
|
65
81
|
|
66
|
-
eval_set << [u, i,
|
82
|
+
eval_set << [u, i, @implicit ? 1 : v[:rating]]
|
67
83
|
end
|
68
84
|
end
|
69
85
|
|
@@ -78,6 +94,9 @@ module Disco
|
|
78
94
|
@user_factors = model.p_factors(format: :numo)
|
79
95
|
@item_factors = model.q_factors(format: :numo)
|
80
96
|
|
97
|
+
@normalized_user_factors = nil
|
98
|
+
@normalized_item_factors = nil
|
99
|
+
|
81
100
|
@user_recs_index = nil
|
82
101
|
@similar_users_index = nil
|
83
102
|
@similar_items_index = nil
|
@@ -122,8 +141,7 @@ module Disco
|
|
122
141
|
predictions, ids = @user_recs_index.search(@user_factors[u, true].expand_dims(0), count + rated.size).map { |v| v[0, true] }
|
123
142
|
else
|
124
143
|
predictions = @item_factors.inner(@user_factors[u, true])
|
125
|
-
|
126
|
-
indexes = predictions.sort_index.reverse
|
144
|
+
indexes = predictions.sort_index.reverse # reverse just creates view
|
127
145
|
indexes = indexes[0...[count + rated.size, indexes.size].min] if count
|
128
146
|
predictions = predictions[indexes]
|
129
147
|
ids = indexes
|
@@ -149,13 +167,13 @@ module Disco
|
|
149
167
|
|
150
168
|
def similar_items(item_id, count: 5)
|
151
169
|
check_fit
|
152
|
-
similar(item_id, @item_map,
|
170
|
+
similar(item_id, @item_map, normalized_item_factors, count, @similar_items_index)
|
153
171
|
end
|
154
172
|
alias_method :item_recs, :similar_items
|
155
173
|
|
156
174
|
def similar_users(user_id, count: 5)
|
157
175
|
check_fit
|
158
|
-
similar(user_id, @user_map,
|
176
|
+
similar(user_id, @user_map, normalized_user_factors, count, @similar_users_index)
|
159
177
|
end
|
160
178
|
|
161
179
|
def top_items(count: 5)
|
@@ -163,19 +181,38 @@ module Disco
|
|
163
181
|
raise "top_items not computed" unless @top_items
|
164
182
|
|
165
183
|
if @implicit
|
166
|
-
scores = @item_count
|
184
|
+
scores = Numo::UInt64.cast(@item_count)
|
167
185
|
else
|
168
186
|
require "wilson_score"
|
169
187
|
|
170
|
-
range =
|
171
|
-
|
188
|
+
range =
|
189
|
+
if @min_rating == @max_rating
|
190
|
+
# TODO remove temp fix
|
191
|
+
(@min_rating - 1)..@max_rating
|
192
|
+
else
|
193
|
+
@min_rating..@max_rating
|
194
|
+
end
|
195
|
+
scores = Numo::DFloat.cast(@item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) })
|
196
|
+
|
197
|
+
# TODO uncomment in 0.3.0
|
198
|
+
# wilson score with continuity correction
|
199
|
+
# https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval_with_continuity_correction
|
200
|
+
# z = 1.96 # 95% confidence
|
201
|
+
# range = @max_rating - @min_rating
|
202
|
+
# n = Numo::DFloat.cast(@item_count)
|
203
|
+
# phat = (Numo::DFloat.cast(@item_sum) - (@min_rating * n)) / range / n
|
204
|
+
# phat = (phat - (1 / (2 * n))).clip(0, nil) # continuity correction
|
205
|
+
# scores = (phat + z**2 / (2 * n) - z * Numo::DFloat::Math.sqrt((phat * (1 - phat) + z**2 / (4 * n)) / n)) / (1 + z**2 / n)
|
206
|
+
# scores = scores * range + @min_rating
|
172
207
|
end
|
173
208
|
|
174
|
-
|
175
|
-
|
176
|
-
|
177
|
-
|
178
|
-
|
209
|
+
indexes = scores.sort_index.reverse
|
210
|
+
indexes = indexes[0...[count, indexes.size].min] if count
|
211
|
+
scores = scores[indexes]
|
212
|
+
|
213
|
+
keys = @item_map.keys
|
214
|
+
indexes.size.times.map do |i|
|
215
|
+
{item_id: keys[indexes[i]], score: scores[i]}
|
179
216
|
end
|
180
217
|
end
|
181
218
|
|
@@ -212,13 +249,17 @@ module Disco
|
|
212
249
|
|
213
250
|
def optimize_similar_items(library: nil)
|
214
251
|
check_fit
|
215
|
-
@similar_items_index = create_index(
|
252
|
+
@similar_items_index = create_index(normalized_item_factors, library: library)
|
216
253
|
end
|
217
254
|
alias_method :optimize_item_recs, :optimize_similar_items
|
218
255
|
|
219
256
|
def optimize_similar_users(library: nil)
|
220
257
|
check_fit
|
221
|
-
@similar_users_index = create_index(
|
258
|
+
@similar_users_index = create_index(normalized_user_factors, library: library)
|
259
|
+
end
|
260
|
+
|
261
|
+
def inspect
|
262
|
+
to_s # for now
|
222
263
|
end
|
223
264
|
|
224
265
|
private
|
@@ -235,8 +276,9 @@ module Disco
|
|
235
276
|
# inner product is cosine similarity with normalized vectors
|
236
277
|
# https://github.com/facebookresearch/faiss/issues/95
|
237
278
|
#
|
238
|
-
# TODO use non-exact index
|
279
|
+
# TODO use non-exact index in 0.3.0
|
239
280
|
# https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
|
281
|
+
# index = Faiss::IndexHNSWFlat.new(factors.shape[1], 32, :inner_product)
|
240
282
|
index = Faiss::IndexFlatIP.new(factors.shape[1])
|
241
283
|
|
242
284
|
# ids are from 0...total
|
@@ -251,7 +293,7 @@ module Disco
|
|
251
293
|
# https://github.com/yahoojapan/NGT/issues/36
|
252
294
|
index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
|
253
295
|
|
254
|
-
# NGT normalizes so could call create_index
|
296
|
+
# NGT normalizes so could call create_index without normalized factors
|
255
297
|
# but keep code simple for now
|
256
298
|
ids = index.batch_insert(factors)
|
257
299
|
raise "Unexpected ids. Please report a bug." if ids.first != 1 || ids.last != factors.shape[0]
|
@@ -262,15 +304,15 @@ module Disco
|
|
262
304
|
end
|
263
305
|
end
|
264
306
|
|
265
|
-
def
|
266
|
-
@
|
307
|
+
def normalized_user_factors
|
308
|
+
@normalized_user_factors ||= normalize(@user_factors)
|
267
309
|
end
|
268
310
|
|
269
|
-
def
|
270
|
-
@
|
311
|
+
def normalized_item_factors
|
312
|
+
@normalized_item_factors ||= normalize(@item_factors)
|
271
313
|
end
|
272
314
|
|
273
|
-
def
|
315
|
+
def normalize(factors)
|
274
316
|
norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
|
275
317
|
norms[norms.eq(0)] = 1e-10 # no zeros
|
276
318
|
factors / norms.expand_dims(1)
|
@@ -303,30 +345,26 @@ module Disco
|
|
303
345
|
# TODO use user_id for similar_users in 0.3.0
|
304
346
|
key = :item_id
|
305
347
|
|
306
|
-
|
307
|
-
|
348
|
+
result = []
|
349
|
+
# items can have the same score
|
350
|
+
# so original item may not be at index 0
|
351
|
+
ids.each_with_index do |id, j|
|
352
|
+
next if id == i
|
353
|
+
|
354
|
+
result << {key => keys[id], score: predictions[j]}
|
308
355
|
end
|
356
|
+
result
|
309
357
|
else
|
310
358
|
[]
|
311
359
|
end
|
312
360
|
end
|
313
361
|
|
314
|
-
def update_maps(train_set)
|
315
|
-
raise ArgumentError, "Missing user_id" if train_set.any? { |v| v[:user_id].nil? }
|
316
|
-
raise ArgumentError, "Missing item_id" if train_set.any? { |v| v[:item_id].nil? }
|
317
|
-
|
318
|
-
train_set.each do |v|
|
319
|
-
@user_map[v[:user_id]] ||= @user_map.size
|
320
|
-
@item_map[v[:item_id]] ||= @item_map.size
|
321
|
-
end
|
322
|
-
end
|
323
|
-
|
324
362
|
def check_ratings(ratings)
|
325
363
|
unless ratings.all? { |r| !r[:rating].nil? }
|
326
|
-
raise ArgumentError, "Missing
|
364
|
+
raise ArgumentError, "Missing rating"
|
327
365
|
end
|
328
366
|
unless ratings.all? { |r| r[:rating].is_a?(Numeric) }
|
329
|
-
raise ArgumentError, "
|
367
|
+
raise ArgumentError, "Rating must be numeric"
|
330
368
|
end
|
331
369
|
end
|
332
370
|
|
@@ -365,7 +403,10 @@ module Disco
|
|
365
403
|
rated: @rated,
|
366
404
|
global_mean: @global_mean,
|
367
405
|
user_factors: @user_factors,
|
368
|
-
item_factors: @item_factors
|
406
|
+
item_factors: @item_factors,
|
407
|
+
factors: @factors,
|
408
|
+
epochs: @epochs,
|
409
|
+
verbose: @verbose
|
369
410
|
}
|
370
411
|
|
371
412
|
unless @implicit
|
@@ -389,6 +430,9 @@ module Disco
|
|
389
430
|
@global_mean = obj[:global_mean]
|
390
431
|
@user_factors = obj[:user_factors]
|
391
432
|
@item_factors = obj[:item_factors]
|
433
|
+
@factors = obj[:factors]
|
434
|
+
@epochs = obj[:epochs]
|
435
|
+
@verbose = obj[:verbose]
|
392
436
|
|
393
437
|
unless @implicit
|
394
438
|
@min_rating = obj[:min_rating]
|
data/lib/disco/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: disco
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.8
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Kane
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2022-03-13 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: libmf
|
@@ -76,7 +76,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
76
76
|
- !ruby/object:Gem::Version
|
77
77
|
version: '0'
|
78
78
|
requirements: []
|
79
|
-
rubygems_version: 3.
|
79
|
+
rubygems_version: 3.3.7
|
80
80
|
signing_key:
|
81
81
|
specification_version: 4
|
82
82
|
summary: Recommendations for Ruby and Rails using collaborative filtering
|