disco 0.2.3 → 0.2.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +25 -0
- data/LICENSE.txt +1 -1
- data/README.md +70 -19
- data/lib/disco.rb +1 -0
- data/lib/disco/metrics.rb +10 -0
- data/lib/disco/recommender.rb +251 -103
- data/lib/disco/version.rb +1 -1
- metadata +5 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5f400f07839587b574ddcfa4c88335bfe20fcd876164b943e8094a35c3c1cfef
|
4
|
+
data.tar.gz: e2426b283146837d14be154ff0e67eb2505fd6587958b39212bf2dfe3bfccd80
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2be9f24184036ec5b093de55640aebb60887ac59c566f37698fcba7a18daa15cf586566708def0060f80fc0747a50447538cf42fdf36024ae19ddac0de8b415c
|
7
|
+
data.tar.gz: 4682a5524a8cad4a247ec53f99c78e317d56ee55433bb2ad7806af4f2a9854bc016fd23564003f009dc69d0fdcf81949dc88c64d3cbe824a8e76fc5cae8abc7d
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,28 @@
|
|
1
|
+
## 0.2.7 (2021-08-06)
|
2
|
+
|
3
|
+
- Added warning for `value`
|
4
|
+
|
5
|
+
## 0.2.6 (2021-02-24)
|
6
|
+
|
7
|
+
- Improved performance
|
8
|
+
- Improved `inspect` method
|
9
|
+
- Fixed issue with `similar_users` and `item_recs` returning the original user/item
|
10
|
+
- Fixed error with `fit` after loading
|
11
|
+
|
12
|
+
## 0.2.5 (2021-02-20)
|
13
|
+
|
14
|
+
- Added `top_items` method
|
15
|
+
- Added `optimize_similar_users` method
|
16
|
+
- Added support for Faiss for `optimize_item_recs` and `optimize_similar_users` methods
|
17
|
+
- Added `rmse` method
|
18
|
+
- Improved performance
|
19
|
+
|
20
|
+
## 0.2.4 (2021-02-15)
|
21
|
+
|
22
|
+
- Added `user_ids` and `item_ids` methods
|
23
|
+
- Added `user_id` argument to `user_factors`
|
24
|
+
- Added `item_id` argument to `item_factors`
|
25
|
+
|
1
26
|
## 0.2.3 (2020-11-28)
|
2
27
|
|
3
28
|
- Added `predict` method
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
@@ -35,24 +35,22 @@ recommender.fit([
|
|
35
35
|
|
36
36
|
> IDs can be integers, strings, or any other data type
|
37
37
|
|
38
|
-
If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating
|
38
|
+
If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating.
|
39
39
|
|
40
40
|
```ruby
|
41
41
|
recommender.fit([
|
42
|
-
{user_id: 1, item_id: 1
|
43
|
-
{user_id: 2, item_id: 1
|
42
|
+
{user_id: 1, item_id: 1},
|
43
|
+
{user_id: 2, item_id: 1}
|
44
44
|
])
|
45
45
|
```
|
46
46
|
|
47
|
-
|
48
|
-
|
49
|
-
Get user-based (user-item) recommendations - “users like you also liked”
|
47
|
+
Get user-based recommendations - “users like you also liked”
|
50
48
|
|
51
49
|
```ruby
|
52
50
|
recommender.user_recs(user_id)
|
53
51
|
```
|
54
52
|
|
55
|
-
Get item-based
|
53
|
+
Get item-based recommendations - “users who liked this item also liked”
|
56
54
|
|
57
55
|
```ruby
|
58
56
|
recommender.item_recs(item_id)
|
@@ -106,11 +104,10 @@ views = Ahoy::Event.
|
|
106
104
|
count
|
107
105
|
|
108
106
|
data =
|
109
|
-
views.map do |(user_id, post_id),
|
107
|
+
views.map do |(user_id, post_id), _|
|
110
108
|
{
|
111
109
|
user_id: user_id,
|
112
|
-
item_id: post_id
|
113
|
-
value: count
|
110
|
+
item_id: post_id
|
114
111
|
}
|
115
112
|
end
|
116
113
|
```
|
@@ -201,6 +198,8 @@ bin = File.binread("recommender.bin")
|
|
201
198
|
recommender = Marshal.load(bin)
|
202
199
|
```
|
203
200
|
|
201
|
+
Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor). See the [examples](https://github.com/ankane/neighbor/tree/master/examples).
|
202
|
+
|
204
203
|
## Algorithms
|
205
204
|
|
206
205
|
Disco uses high-performance matrix factorization.
|
@@ -237,6 +236,16 @@ There are a number of ways to deal with this, but here are some common ones:
|
|
237
236
|
- For user-based recommendations, show new users the most popular items.
|
238
237
|
- For item-based recommendations, make content-based recommendations with a gem like [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity).
|
239
238
|
|
239
|
+
Get top items with:
|
240
|
+
|
241
|
+
```ruby
|
242
|
+
recommender = Disco::Recommender.new(top_items: true)
|
243
|
+
recommender.fit(data)
|
244
|
+
recommender.top_items
|
245
|
+
```
|
246
|
+
|
247
|
+
This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) to your application’s Gemfile) and item frequency for implicit feedback.
|
248
|
+
|
240
249
|
## Data
|
241
250
|
|
242
251
|
Data can be an array of hashes
|
@@ -257,45 +266,65 @@ Or a Daru data frame
|
|
257
266
|
Daru::DataFrame.from_csv("ratings.csv")
|
258
267
|
```
|
259
268
|
|
260
|
-
##
|
269
|
+
## Performance
|
261
270
|
|
262
|
-
If you have a large number of users
|
271
|
+
If you have a large number of users or items, you can use an approximate nearest neighbors library like [Faiss](https://github.com/ankane/faiss) to improve the performance of certain methods.
|
263
272
|
|
264
273
|
Add this line to your application’s Gemfile:
|
265
274
|
|
266
275
|
```ruby
|
267
|
-
gem '
|
276
|
+
gem 'faiss'
|
268
277
|
```
|
269
278
|
|
270
|
-
Speed up
|
279
|
+
Speed up the `user_recs` method with:
|
271
280
|
|
272
281
|
```ruby
|
273
|
-
|
282
|
+
recommender.optimize_user_recs
|
274
283
|
```
|
275
284
|
|
276
|
-
Speed up
|
285
|
+
Speed up the `item_recs` method with:
|
277
286
|
|
278
287
|
```ruby
|
279
|
-
|
288
|
+
recommender.optimize_item_recs
|
280
289
|
```
|
281
290
|
|
282
|
-
|
291
|
+
Speed up the `similar_users` method with:
|
292
|
+
|
293
|
+
```ruby
|
294
|
+
recommender.optimize_similar_users
|
295
|
+
```
|
296
|
+
|
297
|
+
This should be called after fitting or loading the recommender.
|
283
298
|
|
284
299
|
## Reference
|
285
300
|
|
301
|
+
Get ids
|
302
|
+
|
303
|
+
```ruby
|
304
|
+
recommender.user_ids
|
305
|
+
recommender.item_ids
|
306
|
+
```
|
307
|
+
|
286
308
|
Get the global mean
|
287
309
|
|
288
310
|
```ruby
|
289
311
|
recommender.global_mean
|
290
312
|
```
|
291
313
|
|
292
|
-
Get
|
314
|
+
Get factors
|
293
315
|
|
294
316
|
```ruby
|
295
317
|
recommender.user_factors
|
296
318
|
recommender.item_factors
|
297
319
|
```
|
298
320
|
|
321
|
+
Get factors for specific users and items
|
322
|
+
|
323
|
+
```ruby
|
324
|
+
recommender.user_factors(user_id)
|
325
|
+
recommender.item_factors(item_id)
|
326
|
+
```
|
327
|
+
|
299
328
|
## Credits
|
300
329
|
|
301
330
|
Thanks to:
|
@@ -304,6 +333,28 @@ Thanks to:
|
|
304
333
|
- [Implicit](https://github.com/benfred/implicit/) for serving as an initial reference for user and item similarity
|
305
334
|
- [@dasch](https://github.com/dasch) for the gem name
|
306
335
|
|
336
|
+
## Upgrading
|
337
|
+
|
338
|
+
### 0.2.7
|
339
|
+
|
340
|
+
There’s now a warning when passing `:value` with implicit feedback, as this has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used.
|
341
|
+
|
342
|
+
```ruby
|
343
|
+
recommender.fit([
|
344
|
+
{user_id: 1, item_id: 1, value: 1},
|
345
|
+
{user_id: 2, item_id: 1, value: 3}
|
346
|
+
])
|
347
|
+
```
|
348
|
+
|
349
|
+
to:
|
350
|
+
|
351
|
+
```ruby
|
352
|
+
recommender.fit([
|
353
|
+
{user_id: 1, item_id: 1},
|
354
|
+
{user_id: 2, item_id: 1}
|
355
|
+
])
|
356
|
+
```
|
357
|
+
|
307
358
|
## History
|
308
359
|
|
309
360
|
View the [changelog](https://github.com/ankane/disco/blob/master/CHANGELOG.md)
|
data/lib/disco.rb
CHANGED
data/lib/disco/recommender.rb
CHANGED
@@ -1,46 +1,73 @@
|
|
1
1
|
module Disco
|
2
2
|
class Recommender
|
3
|
-
attr_reader :global_mean
|
3
|
+
attr_reader :global_mean
|
4
4
|
|
5
|
-
def initialize(factors: 8, epochs: 20, verbose: nil)
|
5
|
+
def initialize(factors: 8, epochs: 20, verbose: nil, top_items: false)
|
6
6
|
@factors = factors
|
7
7
|
@epochs = epochs
|
8
8
|
@verbose = verbose
|
9
|
+
@user_map = {}
|
10
|
+
@item_map = {}
|
11
|
+
@top_items = top_items
|
9
12
|
end
|
10
13
|
|
11
14
|
def fit(train_set, validation_set: nil)
|
12
15
|
train_set = to_dataset(train_set)
|
13
16
|
validation_set = to_dataset(validation_set) if validation_set
|
14
17
|
|
18
|
+
check_training_set(train_set)
|
19
|
+
|
20
|
+
# TODO option to set in initializer to avoid pass
|
21
|
+
# could also just check first few values
|
22
|
+
# but may be confusing if they are all missing and later ones aren't
|
15
23
|
@implicit = !train_set.any? { |v| v[:rating] }
|
16
24
|
|
25
|
+
if @implicit && train_set.any? { |v| v[:value] }
|
26
|
+
warn "[disco] WARNING: Passing `:value` with implicit feedback has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used."
|
27
|
+
end
|
28
|
+
|
29
|
+
# TODO improve performance
|
30
|
+
# (catch exception instead of checking ahead of time)
|
17
31
|
unless @implicit
|
18
|
-
|
19
|
-
check_ratings(ratings)
|
20
|
-
@min_rating = ratings.min
|
21
|
-
@max_rating = ratings.max
|
32
|
+
check_ratings(train_set)
|
22
33
|
|
23
34
|
if validation_set
|
24
|
-
check_ratings(validation_set
|
35
|
+
check_ratings(validation_set)
|
25
36
|
end
|
26
37
|
end
|
27
38
|
|
28
|
-
check_training_set(train_set)
|
29
|
-
create_maps(train_set)
|
30
|
-
|
31
39
|
@rated = Hash.new { |hash, key| hash[key] = {} }
|
32
40
|
input = []
|
33
|
-
value_key = @implicit ? :value : :rating
|
34
41
|
train_set.each do |v|
|
35
|
-
|
36
|
-
|
42
|
+
# update maps and build matrix in single pass
|
43
|
+
u = (@user_map[v[:user_id]] ||= @user_map.size)
|
44
|
+
i = (@item_map[v[:item_id]] ||= @item_map.size)
|
37
45
|
@rated[u][i] = true
|
38
46
|
|
39
47
|
# explicit will always have a value due to check_ratings
|
40
|
-
input << [u, i,
|
48
|
+
input << [u, i, @implicit ? 1 : v[:rating]]
|
41
49
|
end
|
42
50
|
@rated.default = nil
|
43
51
|
|
52
|
+
# much more efficient than checking every value in another pass
|
53
|
+
raise ArgumentError, "Missing user_id" if @user_map.key?(nil)
|
54
|
+
raise ArgumentError, "Missing item_id" if @item_map.key?(nil)
|
55
|
+
|
56
|
+
# TODO improve performance
|
57
|
+
unless @implicit
|
58
|
+
@min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
|
59
|
+
end
|
60
|
+
|
61
|
+
if @top_items
|
62
|
+
@item_count = [0] * @item_map.size
|
63
|
+
@item_sum = [0.0] * @item_map.size
|
64
|
+
train_set.each do |v|
|
65
|
+
i = @item_map[v[:item_id]]
|
66
|
+
@item_count[i] += 1
|
67
|
+
@item_sum[i] += (@implicit ? 1 : v[:rating])
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
44
71
|
eval_set = nil
|
45
72
|
if validation_set
|
46
73
|
eval_set = []
|
@@ -52,7 +79,7 @@ module Disco
|
|
52
79
|
u ||= -1
|
53
80
|
i ||= -1
|
54
81
|
|
55
|
-
eval_set << [u, i,
|
82
|
+
eval_set << [u, i, @implicit ? 1 : v[:rating]]
|
56
83
|
end
|
57
84
|
end
|
58
85
|
|
@@ -67,8 +94,12 @@ module Disco
|
|
67
94
|
@user_factors = model.p_factors(format: :numo)
|
68
95
|
@item_factors = model.q_factors(format: :numo)
|
69
96
|
|
70
|
-
@
|
71
|
-
@
|
97
|
+
@normalized_user_factors = nil
|
98
|
+
@normalized_item_factors = nil
|
99
|
+
|
100
|
+
@user_recs_index = nil
|
101
|
+
@similar_users_index = nil
|
102
|
+
@similar_items_index = nil
|
72
103
|
end
|
73
104
|
|
74
105
|
# generates a prediction even if a user has already rated the item
|
@@ -95,139 +126,239 @@ module Disco
|
|
95
126
|
u = @user_map[user_id]
|
96
127
|
|
97
128
|
if u
|
98
|
-
|
99
|
-
|
100
|
-
predictions =
|
101
|
-
@item_map.keys.zip(predictions).map do |item_id, pred|
|
102
|
-
{item_id: item_id, score: pred}
|
103
|
-
end
|
129
|
+
rated = item_ids ? {} : @rated[u]
|
104
130
|
|
105
131
|
if item_ids
|
106
|
-
|
107
|
-
|
132
|
+
ids = Numo::NArray.cast(item_ids.map { |i| @item_map[i] }.compact)
|
133
|
+
return [] if ids.size == 0
|
134
|
+
|
135
|
+
predictions = @item_factors[ids, true].inner(@user_factors[u, true])
|
136
|
+
indexes = predictions.sort_index.reverse
|
137
|
+
indexes = indexes[0...[count + rated.size, indexes.size].min] if count
|
138
|
+
predictions = predictions[indexes]
|
139
|
+
ids = ids[indexes]
|
140
|
+
elsif @user_recs_index && count
|
141
|
+
predictions, ids = @user_recs_index.search(@user_factors[u, true].expand_dims(0), count + rated.size).map { |v| v[0, true] }
|
108
142
|
else
|
109
|
-
@
|
110
|
-
|
111
|
-
|
143
|
+
predictions = @item_factors.inner(@user_factors[u, true])
|
144
|
+
indexes = predictions.sort_index.reverse # reverse just creates view
|
145
|
+
indexes = indexes[0...[count + rated.size, indexes.size].min] if count
|
146
|
+
predictions = predictions[indexes]
|
147
|
+
ids = indexes
|
112
148
|
end
|
113
149
|
|
114
|
-
predictions.
|
115
|
-
predictions = predictions.first(count) if count && !item_ids
|
150
|
+
predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
|
116
151
|
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
pred[:score] = pred[:score].clamp(@min_rating, @max_rating)
|
122
|
-
end
|
123
|
-
end
|
152
|
+
keys = @item_map.keys
|
153
|
+
result = []
|
154
|
+
ids.each_with_index do |item_id, i|
|
155
|
+
next if rated[item_id]
|
124
156
|
|
125
|
-
|
157
|
+
result << {item_id: keys[item_id], score: predictions[i]}
|
158
|
+
break if result.size == count
|
159
|
+
end
|
160
|
+
result
|
161
|
+
elsif @top_items
|
162
|
+
top_items(count: count)
|
126
163
|
else
|
127
|
-
# no items if user is unknown
|
128
|
-
# TODO maybe most popular items
|
129
164
|
[]
|
130
165
|
end
|
131
166
|
end
|
132
167
|
|
133
|
-
def
|
168
|
+
def similar_items(item_id, count: 5)
|
134
169
|
check_fit
|
135
|
-
@
|
170
|
+
similar(item_id, @item_map, normalized_item_factors, count, @similar_items_index)
|
136
171
|
end
|
137
|
-
alias_method :
|
172
|
+
alias_method :item_recs, :similar_items
|
138
173
|
|
139
|
-
def
|
174
|
+
def similar_users(user_id, count: 5)
|
140
175
|
check_fit
|
141
|
-
@
|
176
|
+
similar(user_id, @user_map, normalized_user_factors, count, @similar_users_index)
|
142
177
|
end
|
143
178
|
|
144
|
-
def
|
179
|
+
def top_items(count: 5)
|
145
180
|
check_fit
|
146
|
-
|
181
|
+
raise "top_items not computed" unless @top_items
|
182
|
+
|
183
|
+
if @implicit
|
184
|
+
scores = Numo::UInt64.cast(@item_count)
|
185
|
+
else
|
186
|
+
require "wilson_score"
|
187
|
+
|
188
|
+
range = @min_rating..@max_rating
|
189
|
+
scores = Numo::DFloat.cast(@item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) })
|
190
|
+
|
191
|
+
# TODO uncomment in 0.3.0
|
192
|
+
# wilson score with continuity correction
|
193
|
+
# https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval_with_continuity_correction
|
194
|
+
# z = 1.96 # 95% confidence
|
195
|
+
# range = @max_rating - @min_rating
|
196
|
+
# n = Numo::DFloat.cast(@item_count)
|
197
|
+
# phat = (Numo::DFloat.cast(@item_sum) - (@min_rating * n)) / range / n
|
198
|
+
# phat = (phat - (1 / 2 * n)).clip(0, 100) # continuity correction
|
199
|
+
# scores = (phat + z**2 / (2 * n) - z * Numo::DFloat::Math.sqrt((phat * (1 - phat) + z**2 / (4 * n)) / n)) / (1 + z**2 / n)
|
200
|
+
# scores = scores * range + @min_rating
|
201
|
+
end
|
202
|
+
|
203
|
+
indexes = scores.sort_index.reverse
|
204
|
+
indexes = indexes[0...[count, indexes.size].min] if count
|
205
|
+
scores = scores[indexes]
|
206
|
+
|
207
|
+
keys = @item_map.keys
|
208
|
+
indexes.size.times.map do |i|
|
209
|
+
{item_id: keys[indexes[i]], score: scores[i]}
|
210
|
+
end
|
147
211
|
end
|
148
|
-
alias_method :item_recs, :similar_items
|
149
212
|
|
150
|
-
def
|
213
|
+
def user_ids
|
214
|
+
@user_map.keys
|
215
|
+
end
|
216
|
+
|
217
|
+
def item_ids
|
218
|
+
@item_map.keys
|
219
|
+
end
|
220
|
+
|
221
|
+
def user_factors(user_id = nil)
|
222
|
+
if user_id
|
223
|
+
u = @user_map[user_id]
|
224
|
+
@user_factors[u, true] if u
|
225
|
+
else
|
226
|
+
@user_factors
|
227
|
+
end
|
228
|
+
end
|
229
|
+
|
230
|
+
def item_factors(item_id = nil)
|
231
|
+
if item_id
|
232
|
+
i = @item_map[item_id]
|
233
|
+
@item_factors[i, true] if i
|
234
|
+
else
|
235
|
+
@item_factors
|
236
|
+
end
|
237
|
+
end
|
238
|
+
|
239
|
+
def optimize_user_recs
|
151
240
|
check_fit
|
152
|
-
|
241
|
+
@user_recs_index = create_index(item_factors, library: "faiss")
|
153
242
|
end
|
154
243
|
|
155
|
-
|
244
|
+
def optimize_similar_items(library: nil)
|
245
|
+
check_fit
|
246
|
+
@similar_items_index = create_index(normalized_item_factors, library: library)
|
247
|
+
end
|
248
|
+
alias_method :optimize_item_recs, :optimize_similar_items
|
249
|
+
|
250
|
+
def optimize_similar_users(library: nil)
|
251
|
+
check_fit
|
252
|
+
@similar_users_index = create_index(normalized_user_factors, library: library)
|
253
|
+
end
|
254
|
+
|
255
|
+
def inspect
|
256
|
+
to_s # for now
|
257
|
+
end
|
156
258
|
|
157
|
-
|
158
|
-
require "ngt"
|
259
|
+
private
|
159
260
|
|
160
|
-
|
161
|
-
|
162
|
-
|
261
|
+
# factors should already be normalized for similar users/items
|
262
|
+
def create_index(factors, library:)
|
263
|
+
# TODO make Faiss the default in 0.3.0
|
264
|
+
library ||= defined?(Faiss) && !defined?(Ngt) ? "faiss" : "ngt"
|
265
|
+
|
266
|
+
case library
|
267
|
+
when "faiss"
|
268
|
+
require "faiss"
|
269
|
+
|
270
|
+
# inner product is cosine similarity with normalized vectors
|
271
|
+
# https://github.com/facebookresearch/faiss/issues/95
|
272
|
+
#
|
273
|
+
# TODO use non-exact index in 0.3.0
|
274
|
+
# https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
|
275
|
+
# index = Faiss::IndexHNSWFlat.new(factors.shape[1], 32, :inner_product)
|
276
|
+
index = Faiss::IndexFlatIP.new(factors.shape[1])
|
277
|
+
|
278
|
+
# ids are from 0...total
|
279
|
+
# https://github.com/facebookresearch/faiss/blob/96b740abedffc8f67389f29c2a180913941534c6/faiss/Index.h#L89
|
280
|
+
index.add(factors)
|
281
|
+
|
282
|
+
index
|
283
|
+
when "ngt"
|
284
|
+
require "ngt"
|
285
|
+
|
286
|
+
# could speed up search with normalized cosine
|
287
|
+
# https://github.com/yahoojapan/NGT/issues/36
|
288
|
+
index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
|
289
|
+
|
290
|
+
# NGT normalizes so could call create_index without normalized factors
|
291
|
+
# but keep code simple for now
|
292
|
+
ids = index.batch_insert(factors)
|
293
|
+
raise "Unexpected ids. Please report a bug." if ids.first != 1 || ids.last != factors.shape[0]
|
294
|
+
|
295
|
+
index
|
296
|
+
else
|
297
|
+
raise ArgumentError, "Invalid library: #{library}"
|
298
|
+
end
|
163
299
|
end
|
164
300
|
|
165
|
-
def
|
166
|
-
@
|
301
|
+
def normalized_user_factors
|
302
|
+
@normalized_user_factors ||= normalize(@user_factors)
|
167
303
|
end
|
168
304
|
|
169
|
-
def
|
170
|
-
@
|
305
|
+
def normalized_item_factors
|
306
|
+
@normalized_item_factors ||= normalize(@item_factors)
|
171
307
|
end
|
172
308
|
|
173
|
-
def
|
309
|
+
def normalize(factors)
|
174
310
|
norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
|
175
311
|
norms[norms.eq(0)] = 1e-10 # no zeros
|
176
|
-
norms
|
312
|
+
factors / norms.expand_dims(1)
|
177
313
|
end
|
178
314
|
|
179
|
-
def similar(id, map,
|
315
|
+
def similar(id, map, norm_factors, count, index)
|
180
316
|
i = map[id]
|
181
|
-
|
317
|
+
|
318
|
+
if i && norm_factors.shape[0] > 1
|
182
319
|
if index && count
|
183
|
-
|
184
|
-
|
185
|
-
|
186
|
-
|
187
|
-
|
188
|
-
|
189
|
-
|
190
|
-
|
191
|
-
}
|
320
|
+
if defined?(Faiss) && index.is_a?(Faiss::Index)
|
321
|
+
predictions, ids = index.search(norm_factors[i, true].expand_dims(0), count + 1).map { |v| v.to_a[0] }
|
322
|
+
else
|
323
|
+
result = index.search(norm_factors[i, true], size: count + 1)
|
324
|
+
# ids from batch_insert start at 1 instead of 0
|
325
|
+
ids = result.map { |v| v[:id] - 1 }
|
326
|
+
# convert cosine distance to cosine similarity
|
327
|
+
predictions = result.map { |v| 1 - v[:distance] }
|
192
328
|
end
|
193
329
|
else
|
194
|
-
predictions =
|
195
|
-
|
196
|
-
|
197
|
-
|
198
|
-
|
199
|
-
end
|
200
|
-
|
201
|
-
max_score = predictions.delete_at(i)[:score]
|
202
|
-
predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
|
203
|
-
predictions = predictions.first(count) if count
|
204
|
-
# divide by max score to get cosine similarity
|
205
|
-
# only need to do for returned records
|
206
|
-
predictions.each { |pred| pred[:score] /= max_score }
|
207
|
-
predictions
|
330
|
+
predictions = norm_factors.inner(norm_factors[i, true])
|
331
|
+
indexes = predictions.sort_index.reverse
|
332
|
+
indexes = indexes[0...[count + 1, indexes.size].min] if count
|
333
|
+
predictions = predictions[indexes]
|
334
|
+
ids = indexes
|
208
335
|
end
|
209
|
-
else
|
210
|
-
[]
|
211
|
-
end
|
212
|
-
end
|
213
336
|
|
214
|
-
|
215
|
-
user_ids = train_set.map { |v| v[:user_id] }.uniq.sort
|
216
|
-
item_ids = train_set.map { |v| v[:item_id] }.uniq.sort
|
337
|
+
keys = map.keys
|
217
338
|
|
218
|
-
|
219
|
-
|
339
|
+
# TODO use user_id for similar_users in 0.3.0
|
340
|
+
key = :item_id
|
220
341
|
|
221
|
-
|
222
|
-
|
342
|
+
result = []
|
343
|
+
# items can have the same score
|
344
|
+
# so original item may not be at index 0
|
345
|
+
ids.each_with_index do |id, j|
|
346
|
+
next if id == i
|
347
|
+
|
348
|
+
result << {key => keys[id], score: predictions[j]}
|
349
|
+
end
|
350
|
+
result
|
351
|
+
else
|
352
|
+
[]
|
353
|
+
end
|
223
354
|
end
|
224
355
|
|
225
356
|
def check_ratings(ratings)
|
226
|
-
unless ratings.all? { |r| !r.nil? }
|
227
|
-
raise ArgumentError, "Missing
|
357
|
+
unless ratings.all? { |r| !r[:rating].nil? }
|
358
|
+
raise ArgumentError, "Missing rating"
|
228
359
|
end
|
229
|
-
unless ratings.all? { |r| r.is_a?(Numeric) }
|
230
|
-
raise ArgumentError, "
|
360
|
+
unless ratings.all? { |r| r[:rating].is_a?(Numeric) }
|
361
|
+
raise ArgumentError, "Rating must be numeric"
|
231
362
|
end
|
232
363
|
end
|
233
364
|
|
@@ -266,7 +397,10 @@ module Disco
|
|
266
397
|
rated: @rated,
|
267
398
|
global_mean: @global_mean,
|
268
399
|
user_factors: @user_factors,
|
269
|
-
item_factors: @item_factors
|
400
|
+
item_factors: @item_factors,
|
401
|
+
factors: @factors,
|
402
|
+
epochs: @epochs,
|
403
|
+
verbose: @verbose
|
270
404
|
}
|
271
405
|
|
272
406
|
unless @implicit
|
@@ -274,6 +408,11 @@ module Disco
|
|
274
408
|
obj[:max_rating] = @max_rating
|
275
409
|
end
|
276
410
|
|
411
|
+
if @top_items
|
412
|
+
obj[:item_count] = @item_count
|
413
|
+
obj[:item_sum] = @item_sum
|
414
|
+
end
|
415
|
+
|
277
416
|
obj
|
278
417
|
end
|
279
418
|
|
@@ -285,11 +424,20 @@ module Disco
|
|
285
424
|
@global_mean = obj[:global_mean]
|
286
425
|
@user_factors = obj[:user_factors]
|
287
426
|
@item_factors = obj[:item_factors]
|
427
|
+
@factors = obj[:factors]
|
428
|
+
@epochs = obj[:epochs]
|
429
|
+
@verbose = obj[:verbose]
|
288
430
|
|
289
431
|
unless @implicit
|
290
432
|
@min_rating = obj[:min_rating]
|
291
433
|
@max_rating = obj[:max_rating]
|
292
434
|
end
|
435
|
+
|
436
|
+
@top_items = obj.key?(:item_count)
|
437
|
+
if @top_items
|
438
|
+
@item_count = obj[:item_count]
|
439
|
+
@item_sum = obj[:item_sum]
|
440
|
+
end
|
293
441
|
end
|
294
442
|
end
|
295
443
|
end
|
data/lib/disco/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: disco
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.7
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Kane
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2021-08-06 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: libmf
|
@@ -39,7 +39,7 @@ dependencies:
|
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '0'
|
41
41
|
description:
|
42
|
-
email: andrew@
|
42
|
+
email: andrew@ankane.org
|
43
43
|
executables: []
|
44
44
|
extensions: []
|
45
45
|
extra_rdoc_files: []
|
@@ -51,6 +51,7 @@ files:
|
|
51
51
|
- lib/disco.rb
|
52
52
|
- lib/disco/data.rb
|
53
53
|
- lib/disco/engine.rb
|
54
|
+
- lib/disco/metrics.rb
|
54
55
|
- lib/disco/model.rb
|
55
56
|
- lib/disco/recommender.rb
|
56
57
|
- lib/disco/version.rb
|
@@ -75,7 +76,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
75
76
|
- !ruby/object:Gem::Version
|
76
77
|
version: '0'
|
77
78
|
requirements: []
|
78
|
-
rubygems_version: 3.
|
79
|
+
rubygems_version: 3.2.22
|
79
80
|
signing_key:
|
80
81
|
specification_version: 4
|
81
82
|
summary: Recommendations for Ruby and Rails using collaborative filtering
|