disco 0.2.3 → 0.2.7

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e9b8792d465e2bd894ce9aaa5dabf79dd89e93337d838917c709ac7747b85772
4
- data.tar.gz: 9d34a5124dc26f8a2ecb7e2ed3cbf524fe586c37c693d885e668974e24dfaf0a
3
+ metadata.gz: 5f400f07839587b574ddcfa4c88335bfe20fcd876164b943e8094a35c3c1cfef
4
+ data.tar.gz: e2426b283146837d14be154ff0e67eb2505fd6587958b39212bf2dfe3bfccd80
5
5
  SHA512:
6
- metadata.gz: 658b48b75994a295382eb22908d4a5f1825b01bfc26f52428e993802c42f7ebb435a59e7f1262d17400d65eda886f1a4f38edff82cdeda96c2a0ce280602742f
7
- data.tar.gz: c9acce77cae8a575c5814456247600367d5fea4eb85a52309e26cac643d107bc03c42c99ce094c2f9ec46a329fd2dfa97d2f58f0020b337feb88b56346630942
6
+ metadata.gz: 2be9f24184036ec5b093de55640aebb60887ac59c566f37698fcba7a18daa15cf586566708def0060f80fc0747a50447538cf42fdf36024ae19ddac0de8b415c
7
+ data.tar.gz: 4682a5524a8cad4a247ec53f99c78e317d56ee55433bb2ad7806af4f2a9854bc016fd23564003f009dc69d0fdcf81949dc88c64d3cbe824a8e76fc5cae8abc7d
data/CHANGELOG.md CHANGED
@@ -1,3 +1,28 @@
1
+ ## 0.2.7 (2021-08-06)
2
+
3
+ - Added warning for `value`
4
+
5
+ ## 0.2.6 (2021-02-24)
6
+
7
+ - Improved performance
8
+ - Improved `inspect` method
9
+ - Fixed issue with `similar_users` and `item_recs` returning the original user/item
10
+ - Fixed error with `fit` after loading
11
+
12
+ ## 0.2.5 (2021-02-20)
13
+
14
+ - Added `top_items` method
15
+ - Added `optimize_similar_users` method
16
+ - Added support for Faiss for `optimize_item_recs` and `optimize_similar_users` methods
17
+ - Added `rmse` method
18
+ - Improved performance
19
+
20
+ ## 0.2.4 (2021-02-15)
21
+
22
+ - Added `user_ids` and `item_ids` methods
23
+ - Added `user_id` argument to `user_factors`
24
+ - Added `item_id` argument to `item_factors`
25
+
1
26
  ## 0.2.3 (2020-11-28)
2
27
 
3
28
  - Added `predict` method
data/LICENSE.txt CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2019-2020 Andrew Kane
1
+ Copyright (c) 2019-2021 Andrew Kane
2
2
 
3
3
  MIT License
4
4
 
data/README.md CHANGED
@@ -35,24 +35,22 @@ recommender.fit([
35
35
 
36
36
  > IDs can be integers, strings, or any other data type
37
37
 
38
- If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating, or use a value like number of purchases, number of page views, or time spent on page:
38
+ If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating.
39
39
 
40
40
  ```ruby
41
41
  recommender.fit([
42
- {user_id: 1, item_id: 1, value: 1},
43
- {user_id: 2, item_id: 1, value: 1}
42
+ {user_id: 1, item_id: 1},
43
+ {user_id: 2, item_id: 1}
44
44
  ])
45
45
  ```
46
46
 
47
- > Use `value` instead of rating for implicit feedback
48
-
49
- Get user-based (user-item) recommendations - “users like you also liked”
47
+ Get user-based recommendations - “users like you also liked”
50
48
 
51
49
  ```ruby
52
50
  recommender.user_recs(user_id)
53
51
  ```
54
52
 
55
- Get item-based (item-item) recommendations - “users who liked this item also liked”
53
+ Get item-based recommendations - “users who liked this item also liked”
56
54
 
57
55
  ```ruby
58
56
  recommender.item_recs(item_id)
@@ -106,11 +104,10 @@ views = Ahoy::Event.
106
104
  count
107
105
 
108
106
  data =
109
- views.map do |(user_id, post_id), count|
107
+ views.map do |(user_id, post_id), _|
110
108
  {
111
109
  user_id: user_id,
112
- item_id: post_id,
113
- value: count
110
+ item_id: post_id
114
111
  }
115
112
  end
116
113
  ```
@@ -201,6 +198,8 @@ bin = File.binread("recommender.bin")
201
198
  recommender = Marshal.load(bin)
202
199
  ```
203
200
 
201
+ Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor). See the [examples](https://github.com/ankane/neighbor/tree/master/examples).
202
+
204
203
  ## Algorithms
205
204
 
206
205
  Disco uses high-performance matrix factorization.
@@ -237,6 +236,16 @@ There are a number of ways to deal with this, but here are some common ones:
237
236
  - For user-based recommendations, show new users the most popular items.
238
237
  - For item-based recommendations, make content-based recommendations with a gem like [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity).
239
238
 
239
+ Get top items with:
240
+
241
+ ```ruby
242
+ recommender = Disco::Recommender.new(top_items: true)
243
+ recommender.fit(data)
244
+ recommender.top_items
245
+ ```
246
+
247
+ This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) to your application’s Gemfile) and item frequency for implicit feedback.
248
+
240
249
  ## Data
241
250
 
242
251
  Data can be an array of hashes
@@ -257,45 +266,65 @@ Or a Daru data frame
257
266
  Daru::DataFrame.from_csv("ratings.csv")
258
267
  ```
259
268
 
260
- ## Faster Similarity
269
+ ## Performance
261
270
 
262
- If you have a large number of users/items, you can use an approximate nearest neighbors library like [NGT](https://github.com/ankane/ngt) to speed up item-based recommendations and similar users.
271
+ If you have a large number of users or items, you can use an approximate nearest neighbors library like [Faiss](https://github.com/ankane/faiss) to improve the performance of certain methods.
263
272
 
264
273
  Add this line to your application’s Gemfile:
265
274
 
266
275
  ```ruby
267
- gem 'ngt', '>= 0.3.0'
276
+ gem 'faiss'
268
277
  ```
269
278
 
270
- Speed up item-based recommendations with:
279
+ Speed up the `user_recs` method with:
271
280
 
272
281
  ```ruby
273
- model.optimize_item_recs
282
+ recommender.optimize_user_recs
274
283
  ```
275
284
 
276
- Speed up similar users with:
285
+ Speed up the `item_recs` method with:
277
286
 
278
287
  ```ruby
279
- model.optimize_similar_users
288
+ recommender.optimize_item_recs
280
289
  ```
281
290
 
282
- This should be called after fitting or loading the model.
291
+ Speed up the `similar_users` method with:
292
+
293
+ ```ruby
294
+ recommender.optimize_similar_users
295
+ ```
296
+
297
+ This should be called after fitting or loading the recommender.
283
298
 
284
299
  ## Reference
285
300
 
301
+ Get ids
302
+
303
+ ```ruby
304
+ recommender.user_ids
305
+ recommender.item_ids
306
+ ```
307
+
286
308
  Get the global mean
287
309
 
288
310
  ```ruby
289
311
  recommender.global_mean
290
312
  ```
291
313
 
292
- Get the factors
314
+ Get factors
293
315
 
294
316
  ```ruby
295
317
  recommender.user_factors
296
318
  recommender.item_factors
297
319
  ```
298
320
 
321
+ Get factors for specific users and items
322
+
323
+ ```ruby
324
+ recommender.user_factors(user_id)
325
+ recommender.item_factors(item_id)
326
+ ```
327
+
299
328
  ## Credits
300
329
 
301
330
  Thanks to:
@@ -304,6 +333,28 @@ Thanks to:
304
333
  - [Implicit](https://github.com/benfred/implicit/) for serving as an initial reference for user and item similarity
305
334
  - [@dasch](https://github.com/dasch) for the gem name
306
335
 
336
+ ## Upgrading
337
+
338
+ ### 0.2.7
339
+
340
+ There’s now a warning when passing `:value` with implicit feedback, as this has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used.
341
+
342
+ ```ruby
343
+ recommender.fit([
344
+ {user_id: 1, item_id: 1, value: 1},
345
+ {user_id: 2, item_id: 1, value: 3}
346
+ ])
347
+ ```
348
+
349
+ to:
350
+
351
+ ```ruby
352
+ recommender.fit([
353
+ {user_id: 1, item_id: 1},
354
+ {user_id: 2, item_id: 1}
355
+ ])
356
+ ```
357
+
307
358
  ## History
308
359
 
309
360
  View the [changelog](https://github.com/ankane/disco/blob/master/CHANGELOG.md)
data/lib/disco.rb CHANGED
@@ -9,6 +9,7 @@ require "net/http"
9
9
 
10
10
  # modules
11
11
  require "disco/data"
12
+ require "disco/metrics"
12
13
  require "disco/recommender"
13
14
  require "disco/version"
14
15
 
@@ -0,0 +1,10 @@
1
+ module Disco
2
+ module Metrics
3
+ class << self
4
+ def rmse(act, exp)
5
+ raise ArgumentError, "Size mismatch" if act.size != exp.size
6
+ Math.sqrt(act.zip(exp).sum { |a, e| (a - e)**2 } / act.size.to_f)
7
+ end
8
+ end
9
+ end
10
+ end
@@ -1,46 +1,73 @@
1
1
  module Disco
2
2
  class Recommender
3
- attr_reader :global_mean, :item_factors, :user_factors
3
+ attr_reader :global_mean
4
4
 
5
- def initialize(factors: 8, epochs: 20, verbose: nil)
5
+ def initialize(factors: 8, epochs: 20, verbose: nil, top_items: false)
6
6
  @factors = factors
7
7
  @epochs = epochs
8
8
  @verbose = verbose
9
+ @user_map = {}
10
+ @item_map = {}
11
+ @top_items = top_items
9
12
  end
10
13
 
11
14
  def fit(train_set, validation_set: nil)
12
15
  train_set = to_dataset(train_set)
13
16
  validation_set = to_dataset(validation_set) if validation_set
14
17
 
18
+ check_training_set(train_set)
19
+
20
+ # TODO option to set in initializer to avoid pass
21
+ # could also just check first few values
22
+ # but may be confusing if they are all missing and later ones aren't
15
23
  @implicit = !train_set.any? { |v| v[:rating] }
16
24
 
25
+ if @implicit && train_set.any? { |v| v[:value] }
26
+ warn "[disco] WARNING: Passing `:value` with implicit feedback has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used."
27
+ end
28
+
29
+ # TODO improve performance
30
+ # (catch exception instead of checking ahead of time)
17
31
  unless @implicit
18
- ratings = train_set.map { |o| o[:rating] }
19
- check_ratings(ratings)
20
- @min_rating = ratings.min
21
- @max_rating = ratings.max
32
+ check_ratings(train_set)
22
33
 
23
34
  if validation_set
24
- check_ratings(validation_set.map { |o| o[:rating] })
35
+ check_ratings(validation_set)
25
36
  end
26
37
  end
27
38
 
28
- check_training_set(train_set)
29
- create_maps(train_set)
30
-
31
39
  @rated = Hash.new { |hash, key| hash[key] = {} }
32
40
  input = []
33
- value_key = @implicit ? :value : :rating
34
41
  train_set.each do |v|
35
- u = @user_map[v[:user_id]]
36
- i = @item_map[v[:item_id]]
42
+ # update maps and build matrix in single pass
43
+ u = (@user_map[v[:user_id]] ||= @user_map.size)
44
+ i = (@item_map[v[:item_id]] ||= @item_map.size)
37
45
  @rated[u][i] = true
38
46
 
39
47
  # explicit will always have a value due to check_ratings
40
- input << [u, i, v[value_key] || 1]
48
+ input << [u, i, @implicit ? 1 : v[:rating]]
41
49
  end
42
50
  @rated.default = nil
43
51
 
52
+ # much more efficient than checking every value in another pass
53
+ raise ArgumentError, "Missing user_id" if @user_map.key?(nil)
54
+ raise ArgumentError, "Missing item_id" if @item_map.key?(nil)
55
+
56
+ # TODO improve performance
57
+ unless @implicit
58
+ @min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
59
+ end
60
+
61
+ if @top_items
62
+ @item_count = [0] * @item_map.size
63
+ @item_sum = [0.0] * @item_map.size
64
+ train_set.each do |v|
65
+ i = @item_map[v[:item_id]]
66
+ @item_count[i] += 1
67
+ @item_sum[i] += (@implicit ? 1 : v[:rating])
68
+ end
69
+ end
70
+
44
71
  eval_set = nil
45
72
  if validation_set
46
73
  eval_set = []
@@ -52,7 +79,7 @@ module Disco
52
79
  u ||= -1
53
80
  i ||= -1
54
81
 
55
- eval_set << [u, i, v[value_key] || 1]
82
+ eval_set << [u, i, @implicit ? 1 : v[:rating]]
56
83
  end
57
84
  end
58
85
 
@@ -67,8 +94,12 @@ module Disco
67
94
  @user_factors = model.p_factors(format: :numo)
68
95
  @item_factors = model.q_factors(format: :numo)
69
96
 
70
- @user_index = nil
71
- @item_index = nil
97
+ @normalized_user_factors = nil
98
+ @normalized_item_factors = nil
99
+
100
+ @user_recs_index = nil
101
+ @similar_users_index = nil
102
+ @similar_items_index = nil
72
103
  end
73
104
 
74
105
  # generates a prediction even if a user has already rated the item
@@ -95,139 +126,239 @@ module Disco
95
126
  u = @user_map[user_id]
96
127
 
97
128
  if u
98
- predictions = @item_factors.inner(@user_factors[u, true])
99
-
100
- predictions =
101
- @item_map.keys.zip(predictions).map do |item_id, pred|
102
- {item_id: item_id, score: pred}
103
- end
129
+ rated = item_ids ? {} : @rated[u]
104
130
 
105
131
  if item_ids
106
- idx = item_ids.map { |i| @item_map[i] }.compact
107
- predictions = predictions.values_at(*idx)
132
+ ids = Numo::NArray.cast(item_ids.map { |i| @item_map[i] }.compact)
133
+ return [] if ids.size == 0
134
+
135
+ predictions = @item_factors[ids, true].inner(@user_factors[u, true])
136
+ indexes = predictions.sort_index.reverse
137
+ indexes = indexes[0...[count + rated.size, indexes.size].min] if count
138
+ predictions = predictions[indexes]
139
+ ids = ids[indexes]
140
+ elsif @user_recs_index && count
141
+ predictions, ids = @user_recs_index.search(@user_factors[u, true].expand_dims(0), count + rated.size).map { |v| v[0, true] }
108
142
  else
109
- @rated[u].keys.sort_by { |v| -v }.each do |i|
110
- predictions.delete_at(i)
111
- end
143
+ predictions = @item_factors.inner(@user_factors[u, true])
144
+ indexes = predictions.sort_index.reverse # reverse just creates view
145
+ indexes = indexes[0...[count + rated.size, indexes.size].min] if count
146
+ predictions = predictions[indexes]
147
+ ids = indexes
112
148
  end
113
149
 
114
- predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
115
- predictions = predictions.first(count) if count && !item_ids
150
+ predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
116
151
 
117
- # clamp *after* sorting
118
- # also, only needed for returned predictions
119
- if @min_rating
120
- predictions.each do |pred|
121
- pred[:score] = pred[:score].clamp(@min_rating, @max_rating)
122
- end
123
- end
152
+ keys = @item_map.keys
153
+ result = []
154
+ ids.each_with_index do |item_id, i|
155
+ next if rated[item_id]
124
156
 
125
- predictions
157
+ result << {item_id: keys[item_id], score: predictions[i]}
158
+ break if result.size == count
159
+ end
160
+ result
161
+ elsif @top_items
162
+ top_items(count: count)
126
163
  else
127
- # no items if user is unknown
128
- # TODO maybe most popular items
129
164
  []
130
165
  end
131
166
  end
132
167
 
133
- def optimize_similar_items
168
+ def similar_items(item_id, count: 5)
134
169
  check_fit
135
- @item_index = create_index(@item_factors)
170
+ similar(item_id, @item_map, normalized_item_factors, count, @similar_items_index)
136
171
  end
137
- alias_method :optimize_item_recs, :optimize_similar_items
172
+ alias_method :item_recs, :similar_items
138
173
 
139
- def optimize_similar_users
174
+ def similar_users(user_id, count: 5)
140
175
  check_fit
141
- @user_index = create_index(@user_factors)
176
+ similar(user_id, @user_map, normalized_user_factors, count, @similar_users_index)
142
177
  end
143
178
 
144
- def similar_items(item_id, count: 5)
179
+ def top_items(count: 5)
145
180
  check_fit
146
- similar(item_id, @item_map, @item_factors, item_norms, count, @item_index)
181
+ raise "top_items not computed" unless @top_items
182
+
183
+ if @implicit
184
+ scores = Numo::UInt64.cast(@item_count)
185
+ else
186
+ require "wilson_score"
187
+
188
+ range = @min_rating..@max_rating
189
+ scores = Numo::DFloat.cast(@item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) })
190
+
191
+ # TODO uncomment in 0.3.0
192
+ # wilson score with continuity correction
193
+ # https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval_with_continuity_correction
194
+ # z = 1.96 # 95% confidence
195
+ # range = @max_rating - @min_rating
196
+ # n = Numo::DFloat.cast(@item_count)
197
+ # phat = (Numo::DFloat.cast(@item_sum) - (@min_rating * n)) / range / n
198
+ # phat = (phat - (1 / 2 * n)).clip(0, 100) # continuity correction
199
+ # scores = (phat + z**2 / (2 * n) - z * Numo::DFloat::Math.sqrt((phat * (1 - phat) + z**2 / (4 * n)) / n)) / (1 + z**2 / n)
200
+ # scores = scores * range + @min_rating
201
+ end
202
+
203
+ indexes = scores.sort_index.reverse
204
+ indexes = indexes[0...[count, indexes.size].min] if count
205
+ scores = scores[indexes]
206
+
207
+ keys = @item_map.keys
208
+ indexes.size.times.map do |i|
209
+ {item_id: keys[indexes[i]], score: scores[i]}
210
+ end
147
211
  end
148
- alias_method :item_recs, :similar_items
149
212
 
150
- def similar_users(user_id, count: 5)
213
+ def user_ids
214
+ @user_map.keys
215
+ end
216
+
217
+ def item_ids
218
+ @item_map.keys
219
+ end
220
+
221
+ def user_factors(user_id = nil)
222
+ if user_id
223
+ u = @user_map[user_id]
224
+ @user_factors[u, true] if u
225
+ else
226
+ @user_factors
227
+ end
228
+ end
229
+
230
+ def item_factors(item_id = nil)
231
+ if item_id
232
+ i = @item_map[item_id]
233
+ @item_factors[i, true] if i
234
+ else
235
+ @item_factors
236
+ end
237
+ end
238
+
239
+ def optimize_user_recs
151
240
  check_fit
152
- similar(user_id, @user_map, @user_factors, user_norms, count, @user_index)
241
+ @user_recs_index = create_index(item_factors, library: "faiss")
153
242
  end
154
243
 
155
- private
244
+ def optimize_similar_items(library: nil)
245
+ check_fit
246
+ @similar_items_index = create_index(normalized_item_factors, library: library)
247
+ end
248
+ alias_method :optimize_item_recs, :optimize_similar_items
249
+
250
+ def optimize_similar_users(library: nil)
251
+ check_fit
252
+ @similar_users_index = create_index(normalized_user_factors, library: library)
253
+ end
254
+
255
+ def inspect
256
+ to_s # for now
257
+ end
156
258
 
157
- def create_index(factors)
158
- require "ngt"
259
+ private
159
260
 
160
- index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
161
- index.batch_insert(factors)
162
- index
261
+ # factors should already be normalized for similar users/items
262
+ def create_index(factors, library:)
263
+ # TODO make Faiss the default in 0.3.0
264
+ library ||= defined?(Faiss) && !defined?(Ngt) ? "faiss" : "ngt"
265
+
266
+ case library
267
+ when "faiss"
268
+ require "faiss"
269
+
270
+ # inner product is cosine similarity with normalized vectors
271
+ # https://github.com/facebookresearch/faiss/issues/95
272
+ #
273
+ # TODO use non-exact index in 0.3.0
274
+ # https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
275
+ # index = Faiss::IndexHNSWFlat.new(factors.shape[1], 32, :inner_product)
276
+ index = Faiss::IndexFlatIP.new(factors.shape[1])
277
+
278
+ # ids are from 0...total
279
+ # https://github.com/facebookresearch/faiss/blob/96b740abedffc8f67389f29c2a180913941534c6/faiss/Index.h#L89
280
+ index.add(factors)
281
+
282
+ index
283
+ when "ngt"
284
+ require "ngt"
285
+
286
+ # could speed up search with normalized cosine
287
+ # https://github.com/yahoojapan/NGT/issues/36
288
+ index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
289
+
290
+ # NGT normalizes so could call create_index without normalized factors
291
+ # but keep code simple for now
292
+ ids = index.batch_insert(factors)
293
+ raise "Unexpected ids. Please report a bug." if ids.first != 1 || ids.last != factors.shape[0]
294
+
295
+ index
296
+ else
297
+ raise ArgumentError, "Invalid library: #{library}"
298
+ end
163
299
  end
164
300
 
165
- def user_norms
166
- @user_norms ||= norms(@user_factors)
301
+ def normalized_user_factors
302
+ @normalized_user_factors ||= normalize(@user_factors)
167
303
  end
168
304
 
169
- def item_norms
170
- @item_norms ||= norms(@item_factors)
305
+ def normalized_item_factors
306
+ @normalized_item_factors ||= normalize(@item_factors)
171
307
  end
172
308
 
173
- def norms(factors)
309
+ def normalize(factors)
174
310
  norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
175
311
  norms[norms.eq(0)] = 1e-10 # no zeros
176
- norms
312
+ factors / norms.expand_dims(1)
177
313
  end
178
314
 
179
- def similar(id, map, factors, norms, count, index)
315
+ def similar(id, map, norm_factors, count, index)
180
316
  i = map[id]
181
- if i
317
+
318
+ if i && norm_factors.shape[0] > 1
182
319
  if index && count
183
- keys = map.keys
184
- result = index.search(factors[i, true], size: count + 1)[1..-1]
185
- result.map do |v|
186
- {
187
- # ids from batch_insert start at 1 instead of 0
188
- item_id: keys[v[:id] - 1],
189
- # convert cosine distance to cosine similarity
190
- score: 1 - v[:distance]
191
- }
320
+ if defined?(Faiss) && index.is_a?(Faiss::Index)
321
+ predictions, ids = index.search(norm_factors[i, true].expand_dims(0), count + 1).map { |v| v.to_a[0] }
322
+ else
323
+ result = index.search(norm_factors[i, true], size: count + 1)
324
+ # ids from batch_insert start at 1 instead of 0
325
+ ids = result.map { |v| v[:id] - 1 }
326
+ # convert cosine distance to cosine similarity
327
+ predictions = result.map { |v| 1 - v[:distance] }
192
328
  end
193
329
  else
194
- predictions = factors.dot(factors[i, true]) / norms
195
-
196
- predictions =
197
- map.keys.zip(predictions).map do |item_id, pred|
198
- {item_id: item_id, score: pred}
199
- end
200
-
201
- max_score = predictions.delete_at(i)[:score]
202
- predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
203
- predictions = predictions.first(count) if count
204
- # divide by max score to get cosine similarity
205
- # only need to do for returned records
206
- predictions.each { |pred| pred[:score] /= max_score }
207
- predictions
330
+ predictions = norm_factors.inner(norm_factors[i, true])
331
+ indexes = predictions.sort_index.reverse
332
+ indexes = indexes[0...[count + 1, indexes.size].min] if count
333
+ predictions = predictions[indexes]
334
+ ids = indexes
208
335
  end
209
- else
210
- []
211
- end
212
- end
213
336
 
214
- def create_maps(train_set)
215
- user_ids = train_set.map { |v| v[:user_id] }.uniq.sort
216
- item_ids = train_set.map { |v| v[:item_id] }.uniq.sort
337
+ keys = map.keys
217
338
 
218
- raise ArgumentError, "Missing user_id" if user_ids.any?(&:nil?)
219
- raise ArgumentError, "Missing item_id" if item_ids.any?(&:nil?)
339
+ # TODO use user_id for similar_users in 0.3.0
340
+ key = :item_id
220
341
 
221
- @user_map = user_ids.zip(user_ids.size.times).to_h
222
- @item_map = item_ids.zip(item_ids.size.times).to_h
342
+ result = []
343
+ # items can have the same score
344
+ # so original item may not be at index 0
345
+ ids.each_with_index do |id, j|
346
+ next if id == i
347
+
348
+ result << {key => keys[id], score: predictions[j]}
349
+ end
350
+ result
351
+ else
352
+ []
353
+ end
223
354
  end
224
355
 
225
356
  def check_ratings(ratings)
226
- unless ratings.all? { |r| !r.nil? }
227
- raise ArgumentError, "Missing ratings"
357
+ unless ratings.all? { |r| !r[:rating].nil? }
358
+ raise ArgumentError, "Missing rating"
228
359
  end
229
- unless ratings.all? { |r| r.is_a?(Numeric) }
230
- raise ArgumentError, "Ratings must be numeric"
360
+ unless ratings.all? { |r| r[:rating].is_a?(Numeric) }
361
+ raise ArgumentError, "Rating must be numeric"
231
362
  end
232
363
  end
233
364
 
@@ -266,7 +397,10 @@ module Disco
266
397
  rated: @rated,
267
398
  global_mean: @global_mean,
268
399
  user_factors: @user_factors,
269
- item_factors: @item_factors
400
+ item_factors: @item_factors,
401
+ factors: @factors,
402
+ epochs: @epochs,
403
+ verbose: @verbose
270
404
  }
271
405
 
272
406
  unless @implicit
@@ -274,6 +408,11 @@ module Disco
274
408
  obj[:max_rating] = @max_rating
275
409
  end
276
410
 
411
+ if @top_items
412
+ obj[:item_count] = @item_count
413
+ obj[:item_sum] = @item_sum
414
+ end
415
+
277
416
  obj
278
417
  end
279
418
 
@@ -285,11 +424,20 @@ module Disco
285
424
  @global_mean = obj[:global_mean]
286
425
  @user_factors = obj[:user_factors]
287
426
  @item_factors = obj[:item_factors]
427
+ @factors = obj[:factors]
428
+ @epochs = obj[:epochs]
429
+ @verbose = obj[:verbose]
288
430
 
289
431
  unless @implicit
290
432
  @min_rating = obj[:min_rating]
291
433
  @max_rating = obj[:max_rating]
292
434
  end
435
+
436
+ @top_items = obj.key?(:item_count)
437
+ if @top_items
438
+ @item_count = obj[:item_count]
439
+ @item_sum = obj[:item_sum]
440
+ end
293
441
  end
294
442
  end
295
443
  end
data/lib/disco/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Disco
2
- VERSION = "0.2.3"
2
+ VERSION = "0.2.7"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: disco
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.3
4
+ version: 0.2.7
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-11-28 00:00:00.000000000 Z
11
+ date: 2021-08-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: libmf
@@ -39,7 +39,7 @@ dependencies:
39
39
  - !ruby/object:Gem::Version
40
40
  version: '0'
41
41
  description:
42
- email: andrew@chartkick.com
42
+ email: andrew@ankane.org
43
43
  executables: []
44
44
  extensions: []
45
45
  extra_rdoc_files: []
@@ -51,6 +51,7 @@ files:
51
51
  - lib/disco.rb
52
52
  - lib/disco/data.rb
53
53
  - lib/disco/engine.rb
54
+ - lib/disco/metrics.rb
54
55
  - lib/disco/model.rb
55
56
  - lib/disco/recommender.rb
56
57
  - lib/disco/version.rb
@@ -75,7 +76,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
75
76
  - !ruby/object:Gem::Version
76
77
  version: '0'
77
78
  requirements: []
78
- rubygems_version: 3.1.4
79
+ rubygems_version: 3.2.22
79
80
  signing_key:
80
81
  specification_version: 4
81
82
  summary: Recommendations for Ruby and Rails using collaborative filtering