disco 0.2.3 → 0.2.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e9b8792d465e2bd894ce9aaa5dabf79dd89e93337d838917c709ac7747b85772
4
- data.tar.gz: 9d34a5124dc26f8a2ecb7e2ed3cbf524fe586c37c693d885e668974e24dfaf0a
3
+ metadata.gz: 5f400f07839587b574ddcfa4c88335bfe20fcd876164b943e8094a35c3c1cfef
4
+ data.tar.gz: e2426b283146837d14be154ff0e67eb2505fd6587958b39212bf2dfe3bfccd80
5
5
  SHA512:
6
- metadata.gz: 658b48b75994a295382eb22908d4a5f1825b01bfc26f52428e993802c42f7ebb435a59e7f1262d17400d65eda886f1a4f38edff82cdeda96c2a0ce280602742f
7
- data.tar.gz: c9acce77cae8a575c5814456247600367d5fea4eb85a52309e26cac643d107bc03c42c99ce094c2f9ec46a329fd2dfa97d2f58f0020b337feb88b56346630942
6
+ metadata.gz: 2be9f24184036ec5b093de55640aebb60887ac59c566f37698fcba7a18daa15cf586566708def0060f80fc0747a50447538cf42fdf36024ae19ddac0de8b415c
7
+ data.tar.gz: 4682a5524a8cad4a247ec53f99c78e317d56ee55433bb2ad7806af4f2a9854bc016fd23564003f009dc69d0fdcf81949dc88c64d3cbe824a8e76fc5cae8abc7d
data/CHANGELOG.md CHANGED
@@ -1,3 +1,28 @@
1
+ ## 0.2.7 (2021-08-06)
2
+
3
+ - Added warning for `value`
4
+
5
+ ## 0.2.6 (2021-02-24)
6
+
7
+ - Improved performance
8
+ - Improved `inspect` method
9
+ - Fixed issue with `similar_users` and `item_recs` returning the original user/item
10
+ - Fixed error with `fit` after loading
11
+
12
+ ## 0.2.5 (2021-02-20)
13
+
14
+ - Added `top_items` method
15
+ - Added `optimize_similar_users` method
16
+ - Added support for Faiss for `optimize_item_recs` and `optimize_similar_users` methods
17
+ - Added `rmse` method
18
+ - Improved performance
19
+
20
+ ## 0.2.4 (2021-02-15)
21
+
22
+ - Added `user_ids` and `item_ids` methods
23
+ - Added `user_id` argument to `user_factors`
24
+ - Added `item_id` argument to `item_factors`
25
+
1
26
  ## 0.2.3 (2020-11-28)
2
27
 
3
28
  - Added `predict` method
data/LICENSE.txt CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2019-2020 Andrew Kane
1
+ Copyright (c) 2019-2021 Andrew Kane
2
2
 
3
3
  MIT License
4
4
 
data/README.md CHANGED
@@ -35,24 +35,22 @@ recommender.fit([
35
35
 
36
36
  > IDs can be integers, strings, or any other data type
37
37
 
38
- If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating, or use a value like number of purchases, number of page views, or time spent on page:
38
+ If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating.
39
39
 
40
40
  ```ruby
41
41
  recommender.fit([
42
- {user_id: 1, item_id: 1, value: 1},
43
- {user_id: 2, item_id: 1, value: 1}
42
+ {user_id: 1, item_id: 1},
43
+ {user_id: 2, item_id: 1}
44
44
  ])
45
45
  ```
46
46
 
47
- > Use `value` instead of rating for implicit feedback
48
-
49
- Get user-based (user-item) recommendations - “users like you also liked”
47
+ Get user-based recommendations - “users like you also liked”
50
48
 
51
49
  ```ruby
52
50
  recommender.user_recs(user_id)
53
51
  ```
54
52
 
55
- Get item-based (item-item) recommendations - “users who liked this item also liked”
53
+ Get item-based recommendations - “users who liked this item also liked”
56
54
 
57
55
  ```ruby
58
56
  recommender.item_recs(item_id)
@@ -106,11 +104,10 @@ views = Ahoy::Event.
106
104
  count
107
105
 
108
106
  data =
109
- views.map do |(user_id, post_id), count|
107
+ views.map do |(user_id, post_id), _|
110
108
  {
111
109
  user_id: user_id,
112
- item_id: post_id,
113
- value: count
110
+ item_id: post_id
114
111
  }
115
112
  end
116
113
  ```
@@ -201,6 +198,8 @@ bin = File.binread("recommender.bin")
201
198
  recommender = Marshal.load(bin)
202
199
  ```
203
200
 
201
+ Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor). See the [examples](https://github.com/ankane/neighbor/tree/master/examples).
202
+
204
203
  ## Algorithms
205
204
 
206
205
  Disco uses high-performance matrix factorization.
@@ -237,6 +236,16 @@ There are a number of ways to deal with this, but here are some common ones:
237
236
  - For user-based recommendations, show new users the most popular items.
238
237
  - For item-based recommendations, make content-based recommendations with a gem like [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity).
239
238
 
239
+ Get top items with:
240
+
241
+ ```ruby
242
+ recommender = Disco::Recommender.new(top_items: true)
243
+ recommender.fit(data)
244
+ recommender.top_items
245
+ ```
246
+
247
+ This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) to your application’s Gemfile) and item frequency for implicit feedback.
248
+
240
249
  ## Data
241
250
 
242
251
  Data can be an array of hashes
@@ -257,45 +266,65 @@ Or a Daru data frame
257
266
  Daru::DataFrame.from_csv("ratings.csv")
258
267
  ```
259
268
 
260
- ## Faster Similarity
269
+ ## Performance
261
270
 
262
- If you have a large number of users/items, you can use an approximate nearest neighbors library like [NGT](https://github.com/ankane/ngt) to speed up item-based recommendations and similar users.
271
+ If you have a large number of users or items, you can use an approximate nearest neighbors library like [Faiss](https://github.com/ankane/faiss) to improve the performance of certain methods.
263
272
 
264
273
  Add this line to your application’s Gemfile:
265
274
 
266
275
  ```ruby
267
- gem 'ngt', '>= 0.3.0'
276
+ gem 'faiss'
268
277
  ```
269
278
 
270
- Speed up item-based recommendations with:
279
+ Speed up the `user_recs` method with:
271
280
 
272
281
  ```ruby
273
- model.optimize_item_recs
282
+ recommender.optimize_user_recs
274
283
  ```
275
284
 
276
- Speed up similar users with:
285
+ Speed up the `item_recs` method with:
277
286
 
278
287
  ```ruby
279
- model.optimize_similar_users
288
+ recommender.optimize_item_recs
280
289
  ```
281
290
 
282
- This should be called after fitting or loading the model.
291
+ Speed up the `similar_users` method with:
292
+
293
+ ```ruby
294
+ recommender.optimize_similar_users
295
+ ```
296
+
297
+ This should be called after fitting or loading the recommender.
283
298
 
284
299
  ## Reference
285
300
 
301
+ Get ids
302
+
303
+ ```ruby
304
+ recommender.user_ids
305
+ recommender.item_ids
306
+ ```
307
+
286
308
  Get the global mean
287
309
 
288
310
  ```ruby
289
311
  recommender.global_mean
290
312
  ```
291
313
 
292
- Get the factors
314
+ Get factors
293
315
 
294
316
  ```ruby
295
317
  recommender.user_factors
296
318
  recommender.item_factors
297
319
  ```
298
320
 
321
+ Get factors for specific users and items
322
+
323
+ ```ruby
324
+ recommender.user_factors(user_id)
325
+ recommender.item_factors(item_id)
326
+ ```
327
+
299
328
  ## Credits
300
329
 
301
330
  Thanks to:
@@ -304,6 +333,28 @@ Thanks to:
304
333
  - [Implicit](https://github.com/benfred/implicit/) for serving as an initial reference for user and item similarity
305
334
  - [@dasch](https://github.com/dasch) for the gem name
306
335
 
336
+ ## Upgrading
337
+
338
+ ### 0.2.7
339
+
340
+ There’s now a warning when passing `:value` with implicit feedback, as this has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used.
341
+
342
+ ```ruby
343
+ recommender.fit([
344
+ {user_id: 1, item_id: 1, value: 1},
345
+ {user_id: 2, item_id: 1, value: 3}
346
+ ])
347
+ ```
348
+
349
+ to:
350
+
351
+ ```ruby
352
+ recommender.fit([
353
+ {user_id: 1, item_id: 1},
354
+ {user_id: 2, item_id: 1}
355
+ ])
356
+ ```
357
+
307
358
  ## History
308
359
 
309
360
  View the [changelog](https://github.com/ankane/disco/blob/master/CHANGELOG.md)
data/lib/disco.rb CHANGED
@@ -9,6 +9,7 @@ require "net/http"
9
9
 
10
10
  # modules
11
11
  require "disco/data"
12
+ require "disco/metrics"
12
13
  require "disco/recommender"
13
14
  require "disco/version"
14
15
 
@@ -0,0 +1,10 @@
1
+ module Disco
2
+ module Metrics
3
+ class << self
4
+ def rmse(act, exp)
5
+ raise ArgumentError, "Size mismatch" if act.size != exp.size
6
+ Math.sqrt(act.zip(exp).sum { |a, e| (a - e)**2 } / act.size.to_f)
7
+ end
8
+ end
9
+ end
10
+ end
@@ -1,46 +1,73 @@
1
1
  module Disco
2
2
  class Recommender
3
- attr_reader :global_mean, :item_factors, :user_factors
3
+ attr_reader :global_mean
4
4
 
5
- def initialize(factors: 8, epochs: 20, verbose: nil)
5
+ def initialize(factors: 8, epochs: 20, verbose: nil, top_items: false)
6
6
  @factors = factors
7
7
  @epochs = epochs
8
8
  @verbose = verbose
9
+ @user_map = {}
10
+ @item_map = {}
11
+ @top_items = top_items
9
12
  end
10
13
 
11
14
  def fit(train_set, validation_set: nil)
12
15
  train_set = to_dataset(train_set)
13
16
  validation_set = to_dataset(validation_set) if validation_set
14
17
 
18
+ check_training_set(train_set)
19
+
20
+ # TODO option to set in initializer to avoid pass
21
+ # could also just check first few values
22
+ # but may be confusing if they are all missing and later ones aren't
15
23
  @implicit = !train_set.any? { |v| v[:rating] }
16
24
 
25
+ if @implicit && train_set.any? { |v| v[:value] }
26
+ warn "[disco] WARNING: Passing `:value` with implicit feedback has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used."
27
+ end
28
+
29
+ # TODO improve performance
30
+ # (catch exception instead of checking ahead of time)
17
31
  unless @implicit
18
- ratings = train_set.map { |o| o[:rating] }
19
- check_ratings(ratings)
20
- @min_rating = ratings.min
21
- @max_rating = ratings.max
32
+ check_ratings(train_set)
22
33
 
23
34
  if validation_set
24
- check_ratings(validation_set.map { |o| o[:rating] })
35
+ check_ratings(validation_set)
25
36
  end
26
37
  end
27
38
 
28
- check_training_set(train_set)
29
- create_maps(train_set)
30
-
31
39
  @rated = Hash.new { |hash, key| hash[key] = {} }
32
40
  input = []
33
- value_key = @implicit ? :value : :rating
34
41
  train_set.each do |v|
35
- u = @user_map[v[:user_id]]
36
- i = @item_map[v[:item_id]]
42
+ # update maps and build matrix in single pass
43
+ u = (@user_map[v[:user_id]] ||= @user_map.size)
44
+ i = (@item_map[v[:item_id]] ||= @item_map.size)
37
45
  @rated[u][i] = true
38
46
 
39
47
  # explicit will always have a value due to check_ratings
40
- input << [u, i, v[value_key] || 1]
48
+ input << [u, i, @implicit ? 1 : v[:rating]]
41
49
  end
42
50
  @rated.default = nil
43
51
 
52
+ # much more efficient than checking every value in another pass
53
+ raise ArgumentError, "Missing user_id" if @user_map.key?(nil)
54
+ raise ArgumentError, "Missing item_id" if @item_map.key?(nil)
55
+
56
+ # TODO improve performance
57
+ unless @implicit
58
+ @min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
59
+ end
60
+
61
+ if @top_items
62
+ @item_count = [0] * @item_map.size
63
+ @item_sum = [0.0] * @item_map.size
64
+ train_set.each do |v|
65
+ i = @item_map[v[:item_id]]
66
+ @item_count[i] += 1
67
+ @item_sum[i] += (@implicit ? 1 : v[:rating])
68
+ end
69
+ end
70
+
44
71
  eval_set = nil
45
72
  if validation_set
46
73
  eval_set = []
@@ -52,7 +79,7 @@ module Disco
52
79
  u ||= -1
53
80
  i ||= -1
54
81
 
55
- eval_set << [u, i, v[value_key] || 1]
82
+ eval_set << [u, i, @implicit ? 1 : v[:rating]]
56
83
  end
57
84
  end
58
85
 
@@ -67,8 +94,12 @@ module Disco
67
94
  @user_factors = model.p_factors(format: :numo)
68
95
  @item_factors = model.q_factors(format: :numo)
69
96
 
70
- @user_index = nil
71
- @item_index = nil
97
+ @normalized_user_factors = nil
98
+ @normalized_item_factors = nil
99
+
100
+ @user_recs_index = nil
101
+ @similar_users_index = nil
102
+ @similar_items_index = nil
72
103
  end
73
104
 
74
105
  # generates a prediction even if a user has already rated the item
@@ -95,139 +126,239 @@ module Disco
95
126
  u = @user_map[user_id]
96
127
 
97
128
  if u
98
- predictions = @item_factors.inner(@user_factors[u, true])
99
-
100
- predictions =
101
- @item_map.keys.zip(predictions).map do |item_id, pred|
102
- {item_id: item_id, score: pred}
103
- end
129
+ rated = item_ids ? {} : @rated[u]
104
130
 
105
131
  if item_ids
106
- idx = item_ids.map { |i| @item_map[i] }.compact
107
- predictions = predictions.values_at(*idx)
132
+ ids = Numo::NArray.cast(item_ids.map { |i| @item_map[i] }.compact)
133
+ return [] if ids.size == 0
134
+
135
+ predictions = @item_factors[ids, true].inner(@user_factors[u, true])
136
+ indexes = predictions.sort_index.reverse
137
+ indexes = indexes[0...[count + rated.size, indexes.size].min] if count
138
+ predictions = predictions[indexes]
139
+ ids = ids[indexes]
140
+ elsif @user_recs_index && count
141
+ predictions, ids = @user_recs_index.search(@user_factors[u, true].expand_dims(0), count + rated.size).map { |v| v[0, true] }
108
142
  else
109
- @rated[u].keys.sort_by { |v| -v }.each do |i|
110
- predictions.delete_at(i)
111
- end
143
+ predictions = @item_factors.inner(@user_factors[u, true])
144
+ indexes = predictions.sort_index.reverse # reverse just creates view
145
+ indexes = indexes[0...[count + rated.size, indexes.size].min] if count
146
+ predictions = predictions[indexes]
147
+ ids = indexes
112
148
  end
113
149
 
114
- predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
115
- predictions = predictions.first(count) if count && !item_ids
150
+ predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
116
151
 
117
- # clamp *after* sorting
118
- # also, only needed for returned predictions
119
- if @min_rating
120
- predictions.each do |pred|
121
- pred[:score] = pred[:score].clamp(@min_rating, @max_rating)
122
- end
123
- end
152
+ keys = @item_map.keys
153
+ result = []
154
+ ids.each_with_index do |item_id, i|
155
+ next if rated[item_id]
124
156
 
125
- predictions
157
+ result << {item_id: keys[item_id], score: predictions[i]}
158
+ break if result.size == count
159
+ end
160
+ result
161
+ elsif @top_items
162
+ top_items(count: count)
126
163
  else
127
- # no items if user is unknown
128
- # TODO maybe most popular items
129
164
  []
130
165
  end
131
166
  end
132
167
 
133
- def optimize_similar_items
168
+ def similar_items(item_id, count: 5)
134
169
  check_fit
135
- @item_index = create_index(@item_factors)
170
+ similar(item_id, @item_map, normalized_item_factors, count, @similar_items_index)
136
171
  end
137
- alias_method :optimize_item_recs, :optimize_similar_items
172
+ alias_method :item_recs, :similar_items
138
173
 
139
- def optimize_similar_users
174
+ def similar_users(user_id, count: 5)
140
175
  check_fit
141
- @user_index = create_index(@user_factors)
176
+ similar(user_id, @user_map, normalized_user_factors, count, @similar_users_index)
142
177
  end
143
178
 
144
- def similar_items(item_id, count: 5)
179
+ def top_items(count: 5)
145
180
  check_fit
146
- similar(item_id, @item_map, @item_factors, item_norms, count, @item_index)
181
+ raise "top_items not computed" unless @top_items
182
+
183
+ if @implicit
184
+ scores = Numo::UInt64.cast(@item_count)
185
+ else
186
+ require "wilson_score"
187
+
188
+ range = @min_rating..@max_rating
189
+ scores = Numo::DFloat.cast(@item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) })
190
+
191
+ # TODO uncomment in 0.3.0
192
+ # wilson score with continuity correction
193
+ # https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval_with_continuity_correction
194
+ # z = 1.96 # 95% confidence
195
+ # range = @max_rating - @min_rating
196
+ # n = Numo::DFloat.cast(@item_count)
197
+ # phat = (Numo::DFloat.cast(@item_sum) - (@min_rating * n)) / range / n
198
+ # phat = (phat - (1 / 2 * n)).clip(0, 100) # continuity correction
199
+ # scores = (phat + z**2 / (2 * n) - z * Numo::DFloat::Math.sqrt((phat * (1 - phat) + z**2 / (4 * n)) / n)) / (1 + z**2 / n)
200
+ # scores = scores * range + @min_rating
201
+ end
202
+
203
+ indexes = scores.sort_index.reverse
204
+ indexes = indexes[0...[count, indexes.size].min] if count
205
+ scores = scores[indexes]
206
+
207
+ keys = @item_map.keys
208
+ indexes.size.times.map do |i|
209
+ {item_id: keys[indexes[i]], score: scores[i]}
210
+ end
147
211
  end
148
- alias_method :item_recs, :similar_items
149
212
 
150
- def similar_users(user_id, count: 5)
213
+ def user_ids
214
+ @user_map.keys
215
+ end
216
+
217
+ def item_ids
218
+ @item_map.keys
219
+ end
220
+
221
+ def user_factors(user_id = nil)
222
+ if user_id
223
+ u = @user_map[user_id]
224
+ @user_factors[u, true] if u
225
+ else
226
+ @user_factors
227
+ end
228
+ end
229
+
230
+ def item_factors(item_id = nil)
231
+ if item_id
232
+ i = @item_map[item_id]
233
+ @item_factors[i, true] if i
234
+ else
235
+ @item_factors
236
+ end
237
+ end
238
+
239
+ def optimize_user_recs
151
240
  check_fit
152
- similar(user_id, @user_map, @user_factors, user_norms, count, @user_index)
241
+ @user_recs_index = create_index(item_factors, library: "faiss")
153
242
  end
154
243
 
155
- private
244
+ def optimize_similar_items(library: nil)
245
+ check_fit
246
+ @similar_items_index = create_index(normalized_item_factors, library: library)
247
+ end
248
+ alias_method :optimize_item_recs, :optimize_similar_items
249
+
250
+ def optimize_similar_users(library: nil)
251
+ check_fit
252
+ @similar_users_index = create_index(normalized_user_factors, library: library)
253
+ end
254
+
255
+ def inspect
256
+ to_s # for now
257
+ end
156
258
 
157
- def create_index(factors)
158
- require "ngt"
259
+ private
159
260
 
160
- index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
161
- index.batch_insert(factors)
162
- index
261
+ # factors should already be normalized for similar users/items
262
+ def create_index(factors, library:)
263
+ # TODO make Faiss the default in 0.3.0
264
+ library ||= defined?(Faiss) && !defined?(Ngt) ? "faiss" : "ngt"
265
+
266
+ case library
267
+ when "faiss"
268
+ require "faiss"
269
+
270
+ # inner product is cosine similarity with normalized vectors
271
+ # https://github.com/facebookresearch/faiss/issues/95
272
+ #
273
+ # TODO use non-exact index in 0.3.0
274
+ # https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
275
+ # index = Faiss::IndexHNSWFlat.new(factors.shape[1], 32, :inner_product)
276
+ index = Faiss::IndexFlatIP.new(factors.shape[1])
277
+
278
+ # ids are from 0...total
279
+ # https://github.com/facebookresearch/faiss/blob/96b740abedffc8f67389f29c2a180913941534c6/faiss/Index.h#L89
280
+ index.add(factors)
281
+
282
+ index
283
+ when "ngt"
284
+ require "ngt"
285
+
286
+ # could speed up search with normalized cosine
287
+ # https://github.com/yahoojapan/NGT/issues/36
288
+ index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
289
+
290
+ # NGT normalizes so could call create_index without normalized factors
291
+ # but keep code simple for now
292
+ ids = index.batch_insert(factors)
293
+ raise "Unexpected ids. Please report a bug." if ids.first != 1 || ids.last != factors.shape[0]
294
+
295
+ index
296
+ else
297
+ raise ArgumentError, "Invalid library: #{library}"
298
+ end
163
299
  end
164
300
 
165
- def user_norms
166
- @user_norms ||= norms(@user_factors)
301
+ def normalized_user_factors
302
+ @normalized_user_factors ||= normalize(@user_factors)
167
303
  end
168
304
 
169
- def item_norms
170
- @item_norms ||= norms(@item_factors)
305
+ def normalized_item_factors
306
+ @normalized_item_factors ||= normalize(@item_factors)
171
307
  end
172
308
 
173
- def norms(factors)
309
+ def normalize(factors)
174
310
  norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
175
311
  norms[norms.eq(0)] = 1e-10 # no zeros
176
- norms
312
+ factors / norms.expand_dims(1)
177
313
  end
178
314
 
179
- def similar(id, map, factors, norms, count, index)
315
+ def similar(id, map, norm_factors, count, index)
180
316
  i = map[id]
181
- if i
317
+
318
+ if i && norm_factors.shape[0] > 1
182
319
  if index && count
183
- keys = map.keys
184
- result = index.search(factors[i, true], size: count + 1)[1..-1]
185
- result.map do |v|
186
- {
187
- # ids from batch_insert start at 1 instead of 0
188
- item_id: keys[v[:id] - 1],
189
- # convert cosine distance to cosine similarity
190
- score: 1 - v[:distance]
191
- }
320
+ if defined?(Faiss) && index.is_a?(Faiss::Index)
321
+ predictions, ids = index.search(norm_factors[i, true].expand_dims(0), count + 1).map { |v| v.to_a[0] }
322
+ else
323
+ result = index.search(norm_factors[i, true], size: count + 1)
324
+ # ids from batch_insert start at 1 instead of 0
325
+ ids = result.map { |v| v[:id] - 1 }
326
+ # convert cosine distance to cosine similarity
327
+ predictions = result.map { |v| 1 - v[:distance] }
192
328
  end
193
329
  else
194
- predictions = factors.dot(factors[i, true]) / norms
195
-
196
- predictions =
197
- map.keys.zip(predictions).map do |item_id, pred|
198
- {item_id: item_id, score: pred}
199
- end
200
-
201
- max_score = predictions.delete_at(i)[:score]
202
- predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
203
- predictions = predictions.first(count) if count
204
- # divide by max score to get cosine similarity
205
- # only need to do for returned records
206
- predictions.each { |pred| pred[:score] /= max_score }
207
- predictions
330
+ predictions = norm_factors.inner(norm_factors[i, true])
331
+ indexes = predictions.sort_index.reverse
332
+ indexes = indexes[0...[count + 1, indexes.size].min] if count
333
+ predictions = predictions[indexes]
334
+ ids = indexes
208
335
  end
209
- else
210
- []
211
- end
212
- end
213
336
 
214
- def create_maps(train_set)
215
- user_ids = train_set.map { |v| v[:user_id] }.uniq.sort
216
- item_ids = train_set.map { |v| v[:item_id] }.uniq.sort
337
+ keys = map.keys
217
338
 
218
- raise ArgumentError, "Missing user_id" if user_ids.any?(&:nil?)
219
- raise ArgumentError, "Missing item_id" if item_ids.any?(&:nil?)
339
+ # TODO use user_id for similar_users in 0.3.0
340
+ key = :item_id
220
341
 
221
- @user_map = user_ids.zip(user_ids.size.times).to_h
222
- @item_map = item_ids.zip(item_ids.size.times).to_h
342
+ result = []
343
+ # items can have the same score
344
+ # so original item may not be at index 0
345
+ ids.each_with_index do |id, j|
346
+ next if id == i
347
+
348
+ result << {key => keys[id], score: predictions[j]}
349
+ end
350
+ result
351
+ else
352
+ []
353
+ end
223
354
  end
224
355
 
225
356
  def check_ratings(ratings)
226
- unless ratings.all? { |r| !r.nil? }
227
- raise ArgumentError, "Missing ratings"
357
+ unless ratings.all? { |r| !r[:rating].nil? }
358
+ raise ArgumentError, "Missing rating"
228
359
  end
229
- unless ratings.all? { |r| r.is_a?(Numeric) }
230
- raise ArgumentError, "Ratings must be numeric"
360
+ unless ratings.all? { |r| r[:rating].is_a?(Numeric) }
361
+ raise ArgumentError, "Rating must be numeric"
231
362
  end
232
363
  end
233
364
 
@@ -266,7 +397,10 @@ module Disco
266
397
  rated: @rated,
267
398
  global_mean: @global_mean,
268
399
  user_factors: @user_factors,
269
- item_factors: @item_factors
400
+ item_factors: @item_factors,
401
+ factors: @factors,
402
+ epochs: @epochs,
403
+ verbose: @verbose
270
404
  }
271
405
 
272
406
  unless @implicit
@@ -274,6 +408,11 @@ module Disco
274
408
  obj[:max_rating] = @max_rating
275
409
  end
276
410
 
411
+ if @top_items
412
+ obj[:item_count] = @item_count
413
+ obj[:item_sum] = @item_sum
414
+ end
415
+
277
416
  obj
278
417
  end
279
418
 
@@ -285,11 +424,20 @@ module Disco
285
424
  @global_mean = obj[:global_mean]
286
425
  @user_factors = obj[:user_factors]
287
426
  @item_factors = obj[:item_factors]
427
+ @factors = obj[:factors]
428
+ @epochs = obj[:epochs]
429
+ @verbose = obj[:verbose]
288
430
 
289
431
  unless @implicit
290
432
  @min_rating = obj[:min_rating]
291
433
  @max_rating = obj[:max_rating]
292
434
  end
435
+
436
+ @top_items = obj.key?(:item_count)
437
+ if @top_items
438
+ @item_count = obj[:item_count]
439
+ @item_sum = obj[:item_sum]
440
+ end
293
441
  end
294
442
  end
295
443
  end
data/lib/disco/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Disco
2
- VERSION = "0.2.3"
2
+ VERSION = "0.2.7"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: disco
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.3
4
+ version: 0.2.7
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-11-28 00:00:00.000000000 Z
11
+ date: 2021-08-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: libmf
@@ -39,7 +39,7 @@ dependencies:
39
39
  - !ruby/object:Gem::Version
40
40
  version: '0'
41
41
  description:
42
- email: andrew@chartkick.com
42
+ email: andrew@ankane.org
43
43
  executables: []
44
44
  extensions: []
45
45
  extra_rdoc_files: []
@@ -51,6 +51,7 @@ files:
51
51
  - lib/disco.rb
52
52
  - lib/disco/data.rb
53
53
  - lib/disco/engine.rb
54
+ - lib/disco/metrics.rb
54
55
  - lib/disco/model.rb
55
56
  - lib/disco/recommender.rb
56
57
  - lib/disco/version.rb
@@ -75,7 +76,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
75
76
  - !ruby/object:Gem::Version
76
77
  version: '0'
77
78
  requirements: []
78
- rubygems_version: 3.1.4
79
+ rubygems_version: 3.2.22
79
80
  signing_key:
80
81
  specification_version: 4
81
82
  summary: Recommendations for Ruby and Rails using collaborative filtering