disco 0.1.3 → 0.2.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +32 -0
- data/LICENSE.txt +1 -1
- data/README.md +56 -14
- data/lib/disco.rb +1 -0
- data/lib/disco/data.rb +1 -2
- data/lib/disco/metrics.rb +10 -0
- data/lib/disco/recommender.rb +227 -82
- data/lib/disco/version.rb +1 -1
- metadata +10 -121
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 8fbecb858b316ed39a9cb726263e182561cba6df498e6253d88c79ebec5cab05
|
|
4
|
+
data.tar.gz: 42eb38a6e4e0b3fc5a9452deae5a48676ae9a53e78eeb6197718a0c94bd02b6b
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: d0250346d75fba75064a29578f6bfd39f09ecf712ba2e505b97a4952b5ff8b31af307eb1b912e9b25cc3dc28dee0d096bea44b47bb2ef268859bb4171f0ef8b2
|
|
7
|
+
data.tar.gz: 7b341328c12885efd0ffece4201036bb9457caee80a48a99ba110af9a81bcf832bbc1e8f8f5f14e7fddffef2dd3f4643837e0d569c997ab0c2d9ae85e12422f7
|
data/CHANGELOG.md
CHANGED
|
@@ -1,3 +1,35 @@
|
|
|
1
|
+
## 0.2.5 (2021-02-20)
|
|
2
|
+
|
|
3
|
+
- Added `top_items` method
|
|
4
|
+
- Added `optimize_similar_users` method
|
|
5
|
+
- Added support for Faiss for `optimize_item_recs` and `optimize_similar_users` methods
|
|
6
|
+
- Added `rmse` method
|
|
7
|
+
- Improved performance
|
|
8
|
+
|
|
9
|
+
## 0.2.4 (2021-02-15)
|
|
10
|
+
|
|
11
|
+
- Added `user_ids` and `item_ids` methods
|
|
12
|
+
- Added `user_id` argument to `user_factors`
|
|
13
|
+
- Added `item_id` argument to `item_factors`
|
|
14
|
+
|
|
15
|
+
## 0.2.3 (2020-11-28)
|
|
16
|
+
|
|
17
|
+
- Added `predict` method
|
|
18
|
+
- Fixed bad recommendations and scores with `user_recs` and explicit feedback
|
|
19
|
+
- Fixed `item_ids` option for `user_recs`
|
|
20
|
+
|
|
21
|
+
## 0.2.2 (n/a)
|
|
22
|
+
|
|
23
|
+
- Not available (released by previous gem owner)
|
|
24
|
+
|
|
25
|
+
## 0.2.1 (2020-10-28)
|
|
26
|
+
|
|
27
|
+
- Fixed issue with `user_recs` returning rated items
|
|
28
|
+
|
|
29
|
+
## 0.2.0 (2020-07-31)
|
|
30
|
+
|
|
31
|
+
- Changed score to always be between -1 and 1 for `item_recs` and `similar_users` (cosine similarity - this makes it easier to understand and consistent with `optimize_item_recs` and `optimize_similar_users`)
|
|
32
|
+
|
|
1
33
|
## 0.1.3 (2020-06-28)
|
|
2
34
|
|
|
3
35
|
- Added support for Rover
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
|
@@ -1,12 +1,12 @@
|
|
|
1
1
|
# Disco
|
|
2
2
|
|
|
3
|
-
:fire:
|
|
3
|
+
:fire: Recommendations for Ruby and Rails using collaborative filtering
|
|
4
4
|
|
|
5
5
|
- Supports user-based and item-based recommendations
|
|
6
6
|
- Works with explicit and implicit feedback
|
|
7
7
|
- Uses high-performance matrix factorization
|
|
8
8
|
|
|
9
|
-
[](https://github.com/ankane/disco/actions)
|
|
10
10
|
|
|
11
11
|
## Installation
|
|
12
12
|
|
|
@@ -46,13 +46,13 @@ recommender.fit([
|
|
|
46
46
|
|
|
47
47
|
> Use `value` instead of rating for implicit feedback
|
|
48
48
|
|
|
49
|
-
Get user-based
|
|
49
|
+
Get user-based recommendations - “users like you also liked”
|
|
50
50
|
|
|
51
51
|
```ruby
|
|
52
52
|
recommender.user_recs(user_id)
|
|
53
53
|
```
|
|
54
54
|
|
|
55
|
-
Get item-based
|
|
55
|
+
Get item-based recommendations - “users who liked this item also liked”
|
|
56
56
|
|
|
57
57
|
```ruby
|
|
58
58
|
recommender.item_recs(item_id)
|
|
@@ -64,10 +64,10 @@ Use the `count` option to specify the number of recommendations (default is 5)
|
|
|
64
64
|
recommender.user_recs(user_id, count: 3)
|
|
65
65
|
```
|
|
66
66
|
|
|
67
|
-
Get predicted ratings for specific items
|
|
67
|
+
Get predicted ratings for specific users and items
|
|
68
68
|
|
|
69
69
|
```ruby
|
|
70
|
-
recommender.
|
|
70
|
+
recommender.predict([{user_id: 1, item_id: 2}, {user_id: 2, item_id: 4}])
|
|
71
71
|
```
|
|
72
72
|
|
|
73
73
|
Get similar users
|
|
@@ -101,14 +101,15 @@ recommender.item_recs("Star Wars (1977)")
|
|
|
101
101
|
```ruby
|
|
102
102
|
views = Ahoy::Event.
|
|
103
103
|
where(name: "Viewed post").
|
|
104
|
-
group(:user_id
|
|
104
|
+
group(:user_id).
|
|
105
|
+
group("properties->>'post_id'"). # postgres syntax
|
|
105
106
|
count
|
|
106
107
|
|
|
107
108
|
data =
|
|
108
109
|
views.map do |(user_id, post_id), count|
|
|
109
110
|
{
|
|
110
111
|
user_id: user_id,
|
|
111
|
-
|
|
112
|
+
item_id: post_id,
|
|
112
113
|
value: count
|
|
113
114
|
}
|
|
114
115
|
end
|
|
@@ -200,6 +201,8 @@ bin = File.binread("recommender.bin")
|
|
|
200
201
|
recommender = Marshal.load(bin)
|
|
201
202
|
```
|
|
202
203
|
|
|
204
|
+
Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor)
|
|
205
|
+
|
|
203
206
|
## Algorithms
|
|
204
207
|
|
|
205
208
|
Disco uses high-performance matrix factorization.
|
|
@@ -236,6 +239,16 @@ There are a number of ways to deal with this, but here are some common ones:
|
|
|
236
239
|
- For user-based recommendations, show new users the most popular items.
|
|
237
240
|
- For item-based recommendations, make content-based recommendations with a gem like [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity).
|
|
238
241
|
|
|
242
|
+
Get top items with:
|
|
243
|
+
|
|
244
|
+
```ruby
|
|
245
|
+
recommender = Disco::Recommender.new(top_items: true)
|
|
246
|
+
recommender.fit(data)
|
|
247
|
+
recommender.top_items
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) your application’s Gemfile) and item frequency for implicit feedback.
|
|
251
|
+
|
|
239
252
|
## Data
|
|
240
253
|
|
|
241
254
|
Data can be an array of hashes
|
|
@@ -256,23 +269,29 @@ Or a Daru data frame
|
|
|
256
269
|
Daru::DataFrame.from_csv("ratings.csv")
|
|
257
270
|
```
|
|
258
271
|
|
|
259
|
-
##
|
|
272
|
+
## Performance [master]
|
|
260
273
|
|
|
261
|
-
If you have a large number of users
|
|
274
|
+
If you have a large number of users or items, you can use an approximate nearest neighbors library like [Faiss](https://github.com/ankane/faiss) to improve the performance of certain methods.
|
|
262
275
|
|
|
263
276
|
Add this line to your application’s Gemfile:
|
|
264
277
|
|
|
265
278
|
```ruby
|
|
266
|
-
gem '
|
|
279
|
+
gem 'faiss'
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
Speed up the `user_recs` method with:
|
|
283
|
+
|
|
284
|
+
```ruby
|
|
285
|
+
model.optimize_user_recs
|
|
267
286
|
```
|
|
268
287
|
|
|
269
|
-
Speed up
|
|
288
|
+
Speed up the `item_recs` method with:
|
|
270
289
|
|
|
271
290
|
```ruby
|
|
272
291
|
model.optimize_item_recs
|
|
273
292
|
```
|
|
274
293
|
|
|
275
|
-
Speed up
|
|
294
|
+
Speed up the `similar_users` method with:
|
|
276
295
|
|
|
277
296
|
```ruby
|
|
278
297
|
model.optimize_similar_users
|
|
@@ -282,19 +301,33 @@ This should be called after fitting or loading the model.
|
|
|
282
301
|
|
|
283
302
|
## Reference
|
|
284
303
|
|
|
304
|
+
Get ids
|
|
305
|
+
|
|
306
|
+
```ruby
|
|
307
|
+
recommender.user_ids
|
|
308
|
+
recommender.item_ids
|
|
309
|
+
```
|
|
310
|
+
|
|
285
311
|
Get the global mean
|
|
286
312
|
|
|
287
313
|
```ruby
|
|
288
314
|
recommender.global_mean
|
|
289
315
|
```
|
|
290
316
|
|
|
291
|
-
Get
|
|
317
|
+
Get factors
|
|
292
318
|
|
|
293
319
|
```ruby
|
|
294
320
|
recommender.user_factors
|
|
295
321
|
recommender.item_factors
|
|
296
322
|
```
|
|
297
323
|
|
|
324
|
+
Get factors for specific users and items
|
|
325
|
+
|
|
326
|
+
```ruby
|
|
327
|
+
recommender.user_factors(user_id)
|
|
328
|
+
recommender.item_factors(item_id)
|
|
329
|
+
```
|
|
330
|
+
|
|
298
331
|
## Credits
|
|
299
332
|
|
|
300
333
|
Thanks to:
|
|
@@ -315,3 +348,12 @@ Everyone is encouraged to help improve this project. Here are a few ways you can
|
|
|
315
348
|
- Fix bugs and [submit pull requests](https://github.com/ankane/disco/pulls)
|
|
316
349
|
- Write, clarify, or fix documentation
|
|
317
350
|
- Suggest or add new features
|
|
351
|
+
|
|
352
|
+
To get started with development:
|
|
353
|
+
|
|
354
|
+
```sh
|
|
355
|
+
git clone https://github.com/ankane/disco.git
|
|
356
|
+
cd disco
|
|
357
|
+
bundle install
|
|
358
|
+
bundle exec rake test
|
|
359
|
+
```
|
data/lib/disco.rb
CHANGED
data/lib/disco/data.rb
CHANGED
|
@@ -36,8 +36,7 @@ module Disco
|
|
|
36
36
|
|
|
37
37
|
return dest if File.exist?(dest)
|
|
38
38
|
|
|
39
|
-
|
|
40
|
-
temp_path = "#{temp_dir}/#{Time.now.to_f}" # TODO better name
|
|
39
|
+
temp_path = "#{Dir.tmpdir}/disco-#{Time.now.to_f}" # TODO better name
|
|
41
40
|
|
|
42
41
|
digest = Digest::SHA2.new
|
|
43
42
|
|
data/lib/disco/recommender.rb
CHANGED
|
@@ -1,32 +1,33 @@
|
|
|
1
1
|
module Disco
|
|
2
2
|
class Recommender
|
|
3
|
-
attr_reader :global_mean
|
|
3
|
+
attr_reader :global_mean
|
|
4
4
|
|
|
5
|
-
def initialize(factors: 8, epochs: 20, verbose: nil)
|
|
5
|
+
def initialize(factors: 8, epochs: 20, verbose: nil, top_items: false)
|
|
6
6
|
@factors = factors
|
|
7
7
|
@epochs = epochs
|
|
8
8
|
@verbose = verbose
|
|
9
|
+
@user_map = {}
|
|
10
|
+
@item_map = {}
|
|
11
|
+
@top_items = top_items
|
|
9
12
|
end
|
|
10
13
|
|
|
11
14
|
def fit(train_set, validation_set: nil)
|
|
12
15
|
train_set = to_dataset(train_set)
|
|
13
16
|
validation_set = to_dataset(validation_set) if validation_set
|
|
14
17
|
|
|
15
|
-
|
|
18
|
+
check_training_set(train_set)
|
|
16
19
|
|
|
20
|
+
@implicit = !train_set.any? { |v| v[:rating] }
|
|
17
21
|
unless @implicit
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
@min_rating = ratings.min
|
|
21
|
-
@max_rating = ratings.max
|
|
22
|
+
check_ratings(train_set)
|
|
23
|
+
@min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
|
|
22
24
|
|
|
23
25
|
if validation_set
|
|
24
|
-
check_ratings(validation_set
|
|
26
|
+
check_ratings(validation_set)
|
|
25
27
|
end
|
|
26
28
|
end
|
|
27
29
|
|
|
28
|
-
|
|
29
|
-
create_maps(train_set)
|
|
30
|
+
update_maps(train_set)
|
|
30
31
|
|
|
31
32
|
@rated = Hash.new { |hash, key| hash[key] = {} }
|
|
32
33
|
input = []
|
|
@@ -41,6 +42,16 @@ module Disco
|
|
|
41
42
|
end
|
|
42
43
|
@rated.default = nil
|
|
43
44
|
|
|
45
|
+
if @top_items
|
|
46
|
+
@item_count = [0] * @item_map.size
|
|
47
|
+
@item_sum = [0.0] * @item_map.size
|
|
48
|
+
train_set.each do |v|
|
|
49
|
+
i = @item_map[v[:item_id]]
|
|
50
|
+
@item_count[i] += 1
|
|
51
|
+
@item_sum[i] += (v[value_key] || 1)
|
|
52
|
+
end
|
|
53
|
+
end
|
|
54
|
+
|
|
44
55
|
eval_set = nil
|
|
45
56
|
if validation_set
|
|
46
57
|
eval_set = []
|
|
@@ -67,67 +78,188 @@ module Disco
|
|
|
67
78
|
@user_factors = model.p_factors(format: :numo)
|
|
68
79
|
@item_factors = model.q_factors(format: :numo)
|
|
69
80
|
|
|
70
|
-
@
|
|
71
|
-
@
|
|
81
|
+
@user_recs_index = nil
|
|
82
|
+
@similar_users_index = nil
|
|
83
|
+
@similar_items_index = nil
|
|
84
|
+
end
|
|
85
|
+
|
|
86
|
+
# generates a prediction even if a user has already rated the item
|
|
87
|
+
def predict(data)
|
|
88
|
+
data = to_dataset(data)
|
|
89
|
+
|
|
90
|
+
u = data.map { |v| @user_map[v[:user_id]] }
|
|
91
|
+
i = data.map { |v| @item_map[v[:item_id]] }
|
|
92
|
+
|
|
93
|
+
new_index = data.each_index.select { |index| u[index].nil? || i[index].nil? }
|
|
94
|
+
new_index.each do |j|
|
|
95
|
+
u[j] = 0
|
|
96
|
+
i[j] = 0
|
|
97
|
+
end
|
|
98
|
+
|
|
99
|
+
predictions = @user_factors[u, true].inner(@item_factors[i, true])
|
|
100
|
+
predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
|
|
101
|
+
predictions[new_index] = @global_mean
|
|
102
|
+
predictions.to_a
|
|
72
103
|
end
|
|
73
104
|
|
|
74
105
|
def user_recs(user_id, count: 5, item_ids: nil)
|
|
106
|
+
check_fit
|
|
75
107
|
u = @user_map[user_id]
|
|
76
108
|
|
|
77
109
|
if u
|
|
78
|
-
|
|
79
|
-
predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
|
|
80
|
-
|
|
81
|
-
predictions =
|
|
82
|
-
@item_map.keys.zip(predictions).map do |item_id, pred|
|
|
83
|
-
{item_id: item_id, score: pred}
|
|
84
|
-
end
|
|
110
|
+
rated = item_ids ? {} : @rated[u]
|
|
85
111
|
|
|
86
112
|
if item_ids
|
|
87
|
-
|
|
88
|
-
|
|
113
|
+
ids = Numo::NArray.cast(item_ids.map { |i| @item_map[i] }.compact)
|
|
114
|
+
return [] if ids.size == 0
|
|
115
|
+
|
|
116
|
+
predictions = @item_factors[ids, true].inner(@user_factors[u, true])
|
|
117
|
+
indexes = predictions.sort_index.reverse
|
|
118
|
+
indexes = indexes[0...[count + rated.size, indexes.size].min] if count
|
|
119
|
+
predictions = predictions[indexes]
|
|
120
|
+
ids = ids[indexes]
|
|
121
|
+
elsif @user_recs_index && count
|
|
122
|
+
predictions, ids = @user_recs_index.search(@user_factors[u, true].expand_dims(0), count + rated.size).map { |v| v[0, true] }
|
|
89
123
|
else
|
|
90
|
-
@
|
|
91
|
-
|
|
92
|
-
|
|
124
|
+
predictions = @item_factors.inner(@user_factors[u, true])
|
|
125
|
+
# TODO make sure reverse isn't hurting performance
|
|
126
|
+
indexes = predictions.sort_index.reverse
|
|
127
|
+
indexes = indexes[0...[count + rated.size, indexes.size].min] if count
|
|
128
|
+
predictions = predictions[indexes]
|
|
129
|
+
ids = indexes
|
|
93
130
|
end
|
|
94
131
|
|
|
95
|
-
predictions.
|
|
96
|
-
|
|
97
|
-
|
|
132
|
+
predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
|
|
133
|
+
|
|
134
|
+
keys = @item_map.keys
|
|
135
|
+
result = []
|
|
136
|
+
ids.each_with_index do |item_id, i|
|
|
137
|
+
next if rated[item_id]
|
|
138
|
+
|
|
139
|
+
result << {item_id: keys[item_id], score: predictions[i]}
|
|
140
|
+
break if result.size == count
|
|
141
|
+
end
|
|
142
|
+
result
|
|
143
|
+
elsif @top_items
|
|
144
|
+
top_items(count: count)
|
|
98
145
|
else
|
|
99
|
-
# no items if user is unknown
|
|
100
|
-
# TODO maybe most popular items
|
|
101
146
|
[]
|
|
102
147
|
end
|
|
103
148
|
end
|
|
104
149
|
|
|
105
|
-
def
|
|
106
|
-
|
|
150
|
+
def similar_items(item_id, count: 5)
|
|
151
|
+
check_fit
|
|
152
|
+
similar(item_id, @item_map, item_norms, count, @similar_items_index)
|
|
107
153
|
end
|
|
108
|
-
alias_method :
|
|
154
|
+
alias_method :item_recs, :similar_items
|
|
109
155
|
|
|
110
|
-
def
|
|
111
|
-
|
|
156
|
+
def similar_users(user_id, count: 5)
|
|
157
|
+
check_fit
|
|
158
|
+
similar(user_id, @user_map, user_norms, count, @similar_users_index)
|
|
112
159
|
end
|
|
113
160
|
|
|
114
|
-
def
|
|
115
|
-
|
|
161
|
+
def top_items(count: 5)
|
|
162
|
+
check_fit
|
|
163
|
+
raise "top_items not computed" unless @top_items
|
|
164
|
+
|
|
165
|
+
if @implicit
|
|
166
|
+
scores = @item_count
|
|
167
|
+
else
|
|
168
|
+
require "wilson_score"
|
|
169
|
+
|
|
170
|
+
range = @min_rating..@max_rating
|
|
171
|
+
scores = @item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) }
|
|
172
|
+
end
|
|
173
|
+
|
|
174
|
+
scores = scores.map.with_index.sort_by { |s, _| -s }
|
|
175
|
+
scores = scores.first(count) if count
|
|
176
|
+
item_ids = item_ids()
|
|
177
|
+
scores.map do |s, i|
|
|
178
|
+
{item_id: item_ids[i], score: s}
|
|
179
|
+
end
|
|
116
180
|
end
|
|
117
|
-
alias_method :item_recs, :similar_items
|
|
118
181
|
|
|
119
|
-
def
|
|
120
|
-
|
|
182
|
+
def user_ids
|
|
183
|
+
@user_map.keys
|
|
184
|
+
end
|
|
185
|
+
|
|
186
|
+
def item_ids
|
|
187
|
+
@item_map.keys
|
|
188
|
+
end
|
|
189
|
+
|
|
190
|
+
def user_factors(user_id = nil)
|
|
191
|
+
if user_id
|
|
192
|
+
u = @user_map[user_id]
|
|
193
|
+
@user_factors[u, true] if u
|
|
194
|
+
else
|
|
195
|
+
@user_factors
|
|
196
|
+
end
|
|
197
|
+
end
|
|
198
|
+
|
|
199
|
+
def item_factors(item_id = nil)
|
|
200
|
+
if item_id
|
|
201
|
+
i = @item_map[item_id]
|
|
202
|
+
@item_factors[i, true] if i
|
|
203
|
+
else
|
|
204
|
+
@item_factors
|
|
205
|
+
end
|
|
206
|
+
end
|
|
207
|
+
|
|
208
|
+
def optimize_user_recs
|
|
209
|
+
check_fit
|
|
210
|
+
@user_recs_index = create_index(item_factors, library: "faiss")
|
|
211
|
+
end
|
|
212
|
+
|
|
213
|
+
def optimize_similar_items(library: nil)
|
|
214
|
+
check_fit
|
|
215
|
+
@similar_items_index = create_index(item_norms, library: library)
|
|
216
|
+
end
|
|
217
|
+
alias_method :optimize_item_recs, :optimize_similar_items
|
|
218
|
+
|
|
219
|
+
def optimize_similar_users(library: nil)
|
|
220
|
+
check_fit
|
|
221
|
+
@similar_users_index = create_index(user_norms, library: library)
|
|
121
222
|
end
|
|
122
223
|
|
|
123
224
|
private
|
|
124
225
|
|
|
125
|
-
|
|
126
|
-
|
|
226
|
+
# factors should already be normalized for similar users/items
|
|
227
|
+
def create_index(factors, library:)
|
|
228
|
+
# TODO make Faiss the default in 0.3.0
|
|
229
|
+
library ||= defined?(Faiss) && !defined?(Ngt) ? "faiss" : "ngt"
|
|
230
|
+
|
|
231
|
+
case library
|
|
232
|
+
when "faiss"
|
|
233
|
+
require "faiss"
|
|
234
|
+
|
|
235
|
+
# inner product is cosine similarity with normalized vectors
|
|
236
|
+
# https://github.com/facebookresearch/faiss/issues/95
|
|
237
|
+
#
|
|
238
|
+
# TODO use non-exact index
|
|
239
|
+
# https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
|
|
240
|
+
index = Faiss::IndexFlatIP.new(factors.shape[1])
|
|
241
|
+
|
|
242
|
+
# ids are from 0...total
|
|
243
|
+
# https://github.com/facebookresearch/faiss/blob/96b740abedffc8f67389f29c2a180913941534c6/faiss/Index.h#L89
|
|
244
|
+
index.add(factors)
|
|
245
|
+
|
|
246
|
+
index
|
|
247
|
+
when "ngt"
|
|
248
|
+
require "ngt"
|
|
127
249
|
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
250
|
+
# could speed up search with normalized cosine
|
|
251
|
+
# https://github.com/yahoojapan/NGT/issues/36
|
|
252
|
+
index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
|
|
253
|
+
|
|
254
|
+
# NGT normalizes so could call create_index with factors instead of norms
|
|
255
|
+
# but keep code simple for now
|
|
256
|
+
ids = index.batch_insert(factors)
|
|
257
|
+
raise "Unexpected ids. Please report a bug." if ids.first != 1 || ids.last != factors.shape[0]
|
|
258
|
+
|
|
259
|
+
index
|
|
260
|
+
else
|
|
261
|
+
raise ArgumentError, "Invalid library: #{library}"
|
|
262
|
+
end
|
|
131
263
|
end
|
|
132
264
|
|
|
133
265
|
def user_norms
|
|
@@ -139,63 +271,61 @@ module Disco
|
|
|
139
271
|
end
|
|
140
272
|
|
|
141
273
|
def norms(factors)
|
|
142
|
-
norms = Numo::
|
|
274
|
+
norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
|
|
143
275
|
norms[norms.eq(0)] = 1e-10 # no zeros
|
|
144
|
-
norms
|
|
276
|
+
factors / norms.expand_dims(1)
|
|
145
277
|
end
|
|
146
278
|
|
|
147
|
-
def similar(id, map,
|
|
279
|
+
def similar(id, map, norm_factors, count, index)
|
|
148
280
|
i = map[id]
|
|
149
|
-
|
|
281
|
+
|
|
282
|
+
if i && norm_factors.shape[0] > 1
|
|
150
283
|
if index && count
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
}
|
|
284
|
+
if defined?(Faiss) && index.is_a?(Faiss::Index)
|
|
285
|
+
predictions, ids = index.search(norm_factors[i, true].expand_dims(0), count + 1).map { |v| v.to_a[0] }
|
|
286
|
+
else
|
|
287
|
+
result = index.search(norm_factors[i, true], size: count + 1)
|
|
288
|
+
# ids from batch_insert start at 1 instead of 0
|
|
289
|
+
ids = result.map { |v| v[:id] - 1 }
|
|
290
|
+
# convert cosine distance to cosine similarity
|
|
291
|
+
predictions = result.map { |v| 1 - v[:distance] }
|
|
160
292
|
end
|
|
161
293
|
else
|
|
162
|
-
predictions =
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
predictions
|
|
294
|
+
predictions = norm_factors.inner(norm_factors[i, true])
|
|
295
|
+
indexes = predictions.sort_index.reverse
|
|
296
|
+
indexes = indexes[0...[count + 1, indexes.size].min] if count
|
|
297
|
+
predictions = predictions[indexes]
|
|
298
|
+
ids = indexes
|
|
299
|
+
end
|
|
300
|
+
|
|
301
|
+
keys = map.keys
|
|
302
|
+
|
|
303
|
+
# TODO use user_id for similar_users in 0.3.0
|
|
304
|
+
key = :item_id
|
|
305
|
+
|
|
306
|
+
(1...ids.size).map do |i|
|
|
307
|
+
{key => keys[ids[i]], score: predictions[i]}
|
|
177
308
|
end
|
|
178
309
|
else
|
|
179
310
|
[]
|
|
180
311
|
end
|
|
181
312
|
end
|
|
182
313
|
|
|
183
|
-
def
|
|
184
|
-
|
|
185
|
-
|
|
314
|
+
def update_maps(train_set)
|
|
315
|
+
raise ArgumentError, "Missing user_id" if train_set.any? { |v| v[:user_id].nil? }
|
|
316
|
+
raise ArgumentError, "Missing item_id" if train_set.any? { |v| v[:item_id].nil? }
|
|
186
317
|
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
@item_map = item_ids.zip(item_ids.size.times).to_h
|
|
318
|
+
train_set.each do |v|
|
|
319
|
+
@user_map[v[:user_id]] ||= @user_map.size
|
|
320
|
+
@item_map[v[:item_id]] ||= @item_map.size
|
|
321
|
+
end
|
|
192
322
|
end
|
|
193
323
|
|
|
194
324
|
def check_ratings(ratings)
|
|
195
|
-
unless ratings.all? { |r| !r.nil? }
|
|
325
|
+
unless ratings.all? { |r| !r[:rating].nil? }
|
|
196
326
|
raise ArgumentError, "Missing ratings"
|
|
197
327
|
end
|
|
198
|
-
unless ratings.all? { |r| r.is_a?(Numeric) }
|
|
328
|
+
unless ratings.all? { |r| r[:rating].is_a?(Numeric) }
|
|
199
329
|
raise ArgumentError, "Ratings must be numeric"
|
|
200
330
|
end
|
|
201
331
|
end
|
|
@@ -204,6 +334,10 @@ module Disco
|
|
|
204
334
|
raise ArgumentError, "No training data" if train_set.empty?
|
|
205
335
|
end
|
|
206
336
|
|
|
337
|
+
def check_fit
|
|
338
|
+
raise "Not fit" unless defined?(@implicit)
|
|
339
|
+
end
|
|
340
|
+
|
|
207
341
|
def to_dataset(dataset)
|
|
208
342
|
if defined?(Rover::DataFrame) && dataset.is_a?(Rover::DataFrame)
|
|
209
343
|
# convert keys to symbols
|
|
@@ -239,6 +373,11 @@ module Disco
|
|
|
239
373
|
obj[:max_rating] = @max_rating
|
|
240
374
|
end
|
|
241
375
|
|
|
376
|
+
if @top_items
|
|
377
|
+
obj[:item_count] = @item_count
|
|
378
|
+
obj[:item_sum] = @item_sum
|
|
379
|
+
end
|
|
380
|
+
|
|
242
381
|
obj
|
|
243
382
|
end
|
|
244
383
|
|
|
@@ -255,6 +394,12 @@ module Disco
|
|
|
255
394
|
@min_rating = obj[:min_rating]
|
|
256
395
|
@max_rating = obj[:max_rating]
|
|
257
396
|
end
|
|
397
|
+
|
|
398
|
+
@top_items = obj.key?(:item_count)
|
|
399
|
+
if @top_items
|
|
400
|
+
@item_count = obj[:item_count]
|
|
401
|
+
@item_sum = obj[:item_sum]
|
|
402
|
+
end
|
|
258
403
|
end
|
|
259
404
|
end
|
|
260
405
|
end
|
data/lib/disco/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: disco
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.2.5
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Andrew Kane
|
|
8
|
-
autorequire:
|
|
8
|
+
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date:
|
|
11
|
+
date: 2021-02-20 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: libmf
|
|
@@ -38,120 +38,8 @@ dependencies:
|
|
|
38
38
|
- - ">="
|
|
39
39
|
- !ruby/object:Gem::Version
|
|
40
40
|
version: '0'
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
requirement: !ruby/object:Gem::Requirement
|
|
44
|
-
requirements:
|
|
45
|
-
- - ">="
|
|
46
|
-
- !ruby/object:Gem::Version
|
|
47
|
-
version: '0'
|
|
48
|
-
type: :development
|
|
49
|
-
prerelease: false
|
|
50
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
51
|
-
requirements:
|
|
52
|
-
- - ">="
|
|
53
|
-
- !ruby/object:Gem::Version
|
|
54
|
-
version: '0'
|
|
55
|
-
- !ruby/object:Gem::Dependency
|
|
56
|
-
name: rake
|
|
57
|
-
requirement: !ruby/object:Gem::Requirement
|
|
58
|
-
requirements:
|
|
59
|
-
- - ">="
|
|
60
|
-
- !ruby/object:Gem::Version
|
|
61
|
-
version: '0'
|
|
62
|
-
type: :development
|
|
63
|
-
prerelease: false
|
|
64
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
65
|
-
requirements:
|
|
66
|
-
- - ">="
|
|
67
|
-
- !ruby/object:Gem::Version
|
|
68
|
-
version: '0'
|
|
69
|
-
- !ruby/object:Gem::Dependency
|
|
70
|
-
name: minitest
|
|
71
|
-
requirement: !ruby/object:Gem::Requirement
|
|
72
|
-
requirements:
|
|
73
|
-
- - ">="
|
|
74
|
-
- !ruby/object:Gem::Version
|
|
75
|
-
version: '5'
|
|
76
|
-
type: :development
|
|
77
|
-
prerelease: false
|
|
78
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
79
|
-
requirements:
|
|
80
|
-
- - ">="
|
|
81
|
-
- !ruby/object:Gem::Version
|
|
82
|
-
version: '5'
|
|
83
|
-
- !ruby/object:Gem::Dependency
|
|
84
|
-
name: activerecord
|
|
85
|
-
requirement: !ruby/object:Gem::Requirement
|
|
86
|
-
requirements:
|
|
87
|
-
- - ">="
|
|
88
|
-
- !ruby/object:Gem::Version
|
|
89
|
-
version: '0'
|
|
90
|
-
type: :development
|
|
91
|
-
prerelease: false
|
|
92
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
93
|
-
requirements:
|
|
94
|
-
- - ">="
|
|
95
|
-
- !ruby/object:Gem::Version
|
|
96
|
-
version: '0'
|
|
97
|
-
- !ruby/object:Gem::Dependency
|
|
98
|
-
name: sqlite3
|
|
99
|
-
requirement: !ruby/object:Gem::Requirement
|
|
100
|
-
requirements:
|
|
101
|
-
- - ">="
|
|
102
|
-
- !ruby/object:Gem::Version
|
|
103
|
-
version: '0'
|
|
104
|
-
type: :development
|
|
105
|
-
prerelease: false
|
|
106
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
107
|
-
requirements:
|
|
108
|
-
- - ">="
|
|
109
|
-
- !ruby/object:Gem::Version
|
|
110
|
-
version: '0'
|
|
111
|
-
- !ruby/object:Gem::Dependency
|
|
112
|
-
name: daru
|
|
113
|
-
requirement: !ruby/object:Gem::Requirement
|
|
114
|
-
requirements:
|
|
115
|
-
- - ">="
|
|
116
|
-
- !ruby/object:Gem::Version
|
|
117
|
-
version: '0'
|
|
118
|
-
type: :development
|
|
119
|
-
prerelease: false
|
|
120
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
121
|
-
requirements:
|
|
122
|
-
- - ">="
|
|
123
|
-
- !ruby/object:Gem::Version
|
|
124
|
-
version: '0'
|
|
125
|
-
- !ruby/object:Gem::Dependency
|
|
126
|
-
name: rover-df
|
|
127
|
-
requirement: !ruby/object:Gem::Requirement
|
|
128
|
-
requirements:
|
|
129
|
-
- - ">="
|
|
130
|
-
- !ruby/object:Gem::Version
|
|
131
|
-
version: '0'
|
|
132
|
-
type: :development
|
|
133
|
-
prerelease: false
|
|
134
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
135
|
-
requirements:
|
|
136
|
-
- - ">="
|
|
137
|
-
- !ruby/object:Gem::Version
|
|
138
|
-
version: '0'
|
|
139
|
-
- !ruby/object:Gem::Dependency
|
|
140
|
-
name: ngt
|
|
141
|
-
requirement: !ruby/object:Gem::Requirement
|
|
142
|
-
requirements:
|
|
143
|
-
- - ">="
|
|
144
|
-
- !ruby/object:Gem::Version
|
|
145
|
-
version: 0.2.3
|
|
146
|
-
type: :development
|
|
147
|
-
prerelease: false
|
|
148
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
149
|
-
requirements:
|
|
150
|
-
- - ">="
|
|
151
|
-
- !ruby/object:Gem::Version
|
|
152
|
-
version: 0.2.3
|
|
153
|
-
description:
|
|
154
|
-
email: andrew@chartkick.com
|
|
41
|
+
description:
|
|
42
|
+
email: andrew@ankane.org
|
|
155
43
|
executables: []
|
|
156
44
|
extensions: []
|
|
157
45
|
extra_rdoc_files: []
|
|
@@ -163,6 +51,7 @@ files:
|
|
|
163
51
|
- lib/disco.rb
|
|
164
52
|
- lib/disco/data.rb
|
|
165
53
|
- lib/disco/engine.rb
|
|
54
|
+
- lib/disco/metrics.rb
|
|
166
55
|
- lib/disco/model.rb
|
|
167
56
|
- lib/disco/recommender.rb
|
|
168
57
|
- lib/disco/version.rb
|
|
@@ -172,7 +61,7 @@ homepage: https://github.com/ankane/disco
|
|
|
172
61
|
licenses:
|
|
173
62
|
- MIT
|
|
174
63
|
metadata: {}
|
|
175
|
-
post_install_message:
|
|
64
|
+
post_install_message:
|
|
176
65
|
rdoc_options: []
|
|
177
66
|
require_paths:
|
|
178
67
|
- lib
|
|
@@ -187,8 +76,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
187
76
|
- !ruby/object:Gem::Version
|
|
188
77
|
version: '0'
|
|
189
78
|
requirements: []
|
|
190
|
-
rubygems_version: 3.
|
|
191
|
-
signing_key:
|
|
79
|
+
rubygems_version: 3.2.3
|
|
80
|
+
signing_key:
|
|
192
81
|
specification_version: 4
|
|
193
|
-
summary:
|
|
82
|
+
summary: Recommendations for Ruby and Rails using collaborative filtering
|
|
194
83
|
test_files: []
|