disco 0.2.5 → 0.2.8

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 8fbecb858b316ed39a9cb726263e182561cba6df498e6253d88c79ebec5cab05
4
- data.tar.gz: 42eb38a6e4e0b3fc5a9452deae5a48676ae9a53e78eeb6197718a0c94bd02b6b
3
+ metadata.gz: 0ec370f448c5cc8aebb2860b0580466687874eb34c165e1a3d0254a1c6e701d7
4
+ data.tar.gz: 1b080c37206371ee59ce184ae420c5fa1de60da0714ef17ea9459b31fcdd22ab
5
5
  SHA512:
6
- metadata.gz: d0250346d75fba75064a29578f6bfd39f09ecf712ba2e505b97a4952b5ff8b31af307eb1b912e9b25cc3dc28dee0d096bea44b47bb2ef268859bb4171f0ef8b2
7
- data.tar.gz: 7b341328c12885efd0ffece4201036bb9457caee80a48a99ba110af9a81bcf832bbc1e8f8f5f14e7fddffef2dd3f4643837e0d569c997ab0c2d9ae85e12422f7
6
+ metadata.gz: 1802e0fbf68ee489891f94e468c3b24df0eb8463de4e55a7f50a9dbc86bda7631ca15d5b84d0d584dbd341e994db84a82f00f6753c70eda4a7400376f0443df5
7
+ data.tar.gz: bee0645357a5fc4eb226e75d4b56208e7377952911dd19ccad8f3072eee3eaf4d4a318933dcb4ec4d75273630e278dbf98b3ac21f5987baea90261d15cc2d851
data/CHANGELOG.md CHANGED
@@ -1,3 +1,18 @@
1
+ ## 0.2.8 (2022-03-13)
2
+
3
+ - Fixed error with `top_items` with all same rating
4
+
5
+ ## 0.2.7 (2021-08-06)
6
+
7
+ - Added warning for `value`
8
+
9
+ ## 0.2.6 (2021-02-24)
10
+
11
+ - Improved performance
12
+ - Improved `inspect` method
13
+ - Fixed issue with `similar_users` and `item_recs` returning the original user/item
14
+ - Fixed error with `fit` after loading
15
+
1
16
  ## 0.2.5 (2021-02-20)
2
17
 
3
18
  - Added `top_items` method
data/README.md CHANGED
@@ -35,16 +35,16 @@ recommender.fit([
35
35
 
36
36
  > IDs can be integers, strings, or any other data type
37
37
 
38
- If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating, or use a value like number of purchases, number of page views, or time spent on page:
38
+ If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating.
39
39
 
40
40
  ```ruby
41
41
  recommender.fit([
42
- {user_id: 1, item_id: 1, value: 1},
43
- {user_id: 2, item_id: 1, value: 1}
42
+ {user_id: 1, item_id: 1},
43
+ {user_id: 2, item_id: 1}
44
44
  ])
45
45
  ```
46
46
 
47
- > Use `value` instead of rating for implicit feedback
47
+ > Each `user_id`/`item_id` combination should only appear once
48
48
 
49
49
  Get user-based recommendations - “users like you also liked”
50
50
 
@@ -99,18 +99,13 @@ recommender.item_recs("Star Wars (1977)")
99
99
  [Ahoy](https://github.com/ankane/ahoy) is a great source for implicit feedback
100
100
 
101
101
  ```ruby
102
- views = Ahoy::Event.
103
- where(name: "Viewed post").
104
- group(:user_id).
105
- group("properties->>'post_id'"). # postgres syntax
106
- count
102
+ views = Ahoy::Event.where(name: "Viewed post").group(:user_id).group_prop(:post_id).count
107
103
 
108
104
  data =
109
- views.map do |(user_id, post_id), count|
105
+ views.map do |(user_id, post_id), _|
110
106
  {
111
107
  user_id: user_id,
112
- item_id: post_id,
113
- value: count
108
+ item_id: post_id
114
109
  }
115
110
  end
116
111
  ```
@@ -201,7 +196,7 @@ bin = File.binread("recommender.bin")
201
196
  recommender = Marshal.load(bin)
202
197
  ```
203
198
 
204
- Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor)
199
+ Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor). See the [examples](https://github.com/ankane/neighbor/tree/master/examples).
205
200
 
206
201
  ## Algorithms
207
202
 
@@ -247,7 +242,7 @@ recommender.fit(data)
247
242
  recommender.top_items
248
243
  ```
249
244
 
250
- This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) your application’s Gemfile) and item frequency for implicit feedback.
245
+ This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) to your application’s Gemfile) and item frequency for implicit feedback.
251
246
 
252
247
  ## Data
253
248
 
@@ -269,7 +264,7 @@ Or a Daru data frame
269
264
  Daru::DataFrame.from_csv("ratings.csv")
270
265
  ```
271
266
 
272
- ## Performance [master]
267
+ ## Performance
273
268
 
274
269
  If you have a large number of users or items, you can use an approximate nearest neighbors library like [Faiss](https://github.com/ankane/faiss) to improve the performance of certain methods.
275
270
 
@@ -282,22 +277,22 @@ gem 'faiss'
282
277
  Speed up the `user_recs` method with:
283
278
 
284
279
  ```ruby
285
- model.optimize_user_recs
280
+ recommender.optimize_user_recs
286
281
  ```
287
282
 
288
283
  Speed up the `item_recs` method with:
289
284
 
290
285
  ```ruby
291
- model.optimize_item_recs
286
+ recommender.optimize_item_recs
292
287
  ```
293
288
 
294
289
  Speed up the `similar_users` method with:
295
290
 
296
291
  ```ruby
297
- model.optimize_similar_users
292
+ recommender.optimize_similar_users
298
293
  ```
299
294
 
300
- This should be called after fitting or loading the model.
295
+ This should be called after fitting or loading the recommender.
301
296
 
302
297
  ## Reference
303
298
 
@@ -336,6 +331,28 @@ Thanks to:
336
331
  - [Implicit](https://github.com/benfred/implicit/) for serving as an initial reference for user and item similarity
337
332
  - [@dasch](https://github.com/dasch) for the gem name
338
333
 
334
+ ## Upgrading
335
+
336
+ ### 0.2.7
337
+
338
+ There’s now a warning when passing `:value` with implicit feedback, as this has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used.
339
+
340
+ ```ruby
341
+ recommender.fit([
342
+ {user_id: 1, item_id: 1, value: 1},
343
+ {user_id: 2, item_id: 1, value: 3}
344
+ ])
345
+ ```
346
+
347
+ to:
348
+
349
+ ```ruby
350
+ recommender.fit([
351
+ {user_id: 1, item_id: 1},
352
+ {user_id: 2, item_id: 1}
353
+ ])
354
+ ```
355
+
339
356
  ## History
340
357
 
341
358
  View the [changelog](https://github.com/ankane/disco/blob/master/CHANGELOG.md)
data/lib/disco/model.rb CHANGED
@@ -10,6 +10,7 @@ module Disco
10
10
 
11
11
  has_many :"recommended_#{name}", -> { where("disco_recommendations.context = ?", name).order("disco_recommendations.score DESC") }, through: :recommendations, source: :item, source_type: class_name
12
12
 
13
+ # TODO use fetch for item_id and score in 0.3.0
13
14
  define_method("update_recommended_#{name}") do |items|
14
15
  now = Time.now
15
16
  items = items.map { |item| {subject_type: model_name.name, subject_id: id, item_type: class_name, item_id: item[:item_id], context: name, score: item[:score], created_at: now, updated_at: now} }
@@ -17,38 +17,54 @@ module Disco
17
17
 
18
18
  check_training_set(train_set)
19
19
 
20
+ # TODO option to set in initializer to avoid pass
21
+ # could also just check first few values
22
+ # but may be confusing if they are all missing and later ones aren't
20
23
  @implicit = !train_set.any? { |v| v[:rating] }
24
+
25
+ if @implicit && train_set.any? { |v| v[:value] }
26
+ warn "[disco] WARNING: Passing `:value` with implicit feedback has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used."
27
+ end
28
+
29
+ # TODO improve performance
30
+ # (catch exception instead of checking ahead of time)
21
31
  unless @implicit
22
32
  check_ratings(train_set)
23
- @min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
24
33
 
25
34
  if validation_set
26
35
  check_ratings(validation_set)
27
36
  end
28
37
  end
29
38
 
30
- update_maps(train_set)
31
-
32
39
  @rated = Hash.new { |hash, key| hash[key] = {} }
33
40
  input = []
34
- value_key = @implicit ? :value : :rating
35
41
  train_set.each do |v|
36
- u = @user_map[v[:user_id]]
37
- i = @item_map[v[:item_id]]
42
+ # update maps and build matrix in single pass
43
+ u = (@user_map[v[:user_id]] ||= @user_map.size)
44
+ i = (@item_map[v[:item_id]] ||= @item_map.size)
38
45
  @rated[u][i] = true
39
46
 
40
47
  # explicit will always have a value due to check_ratings
41
- input << [u, i, v[value_key] || 1]
48
+ input << [u, i, @implicit ? 1 : v[:rating]]
42
49
  end
43
50
  @rated.default = nil
44
51
 
52
+ # much more efficient than checking every value in another pass
53
+ raise ArgumentError, "Missing user_id" if @user_map.key?(nil)
54
+ raise ArgumentError, "Missing item_id" if @item_map.key?(nil)
55
+
56
+ # TODO improve performance
57
+ unless @implicit
58
+ @min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
59
+ end
60
+
45
61
  if @top_items
46
62
  @item_count = [0] * @item_map.size
47
63
  @item_sum = [0.0] * @item_map.size
48
64
  train_set.each do |v|
49
65
  i = @item_map[v[:item_id]]
50
66
  @item_count[i] += 1
51
- @item_sum[i] += (v[value_key] || 1)
67
+ @item_sum[i] += (@implicit ? 1 : v[:rating])
52
68
  end
53
69
  end
54
70
 
@@ -63,7 +79,7 @@ module Disco
63
79
  u ||= -1
64
80
  i ||= -1
65
81
 
66
- eval_set << [u, i, v[value_key] || 1]
82
+ eval_set << [u, i, @implicit ? 1 : v[:rating]]
67
83
  end
68
84
  end
69
85
 
@@ -78,6 +94,9 @@ module Disco
78
94
  @user_factors = model.p_factors(format: :numo)
79
95
  @item_factors = model.q_factors(format: :numo)
80
96
 
97
+ @normalized_user_factors = nil
98
+ @normalized_item_factors = nil
99
+
81
100
  @user_recs_index = nil
82
101
  @similar_users_index = nil
83
102
  @similar_items_index = nil
@@ -122,8 +141,7 @@ module Disco
122
141
  predictions, ids = @user_recs_index.search(@user_factors[u, true].expand_dims(0), count + rated.size).map { |v| v[0, true] }
123
142
  else
124
143
  predictions = @item_factors.inner(@user_factors[u, true])
125
- # TODO make sure reverse isn't hurting performance
126
- indexes = predictions.sort_index.reverse
144
+ indexes = predictions.sort_index.reverse # reverse just creates view
127
145
  indexes = indexes[0...[count + rated.size, indexes.size].min] if count
128
146
  predictions = predictions[indexes]
129
147
  ids = indexes
@@ -149,13 +167,13 @@ module Disco
149
167
 
150
168
  def similar_items(item_id, count: 5)
151
169
  check_fit
152
- similar(item_id, @item_map, item_norms, count, @similar_items_index)
170
+ similar(item_id, @item_map, normalized_item_factors, count, @similar_items_index)
153
171
  end
154
172
  alias_method :item_recs, :similar_items
155
173
 
156
174
  def similar_users(user_id, count: 5)
157
175
  check_fit
158
- similar(user_id, @user_map, user_norms, count, @similar_users_index)
176
+ similar(user_id, @user_map, normalized_user_factors, count, @similar_users_index)
159
177
  end
160
178
 
161
179
  def top_items(count: 5)
@@ -163,19 +181,38 @@ module Disco
163
181
  raise "top_items not computed" unless @top_items
164
182
 
165
183
  if @implicit
166
- scores = @item_count
184
+ scores = Numo::UInt64.cast(@item_count)
167
185
  else
168
186
  require "wilson_score"
169
187
 
170
- range = @min_rating..@max_rating
171
- scores = @item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) }
188
+ range =
189
+ if @min_rating == @max_rating
190
+ # TODO remove temp fix
191
+ (@min_rating - 1)..@max_rating
192
+ else
193
+ @min_rating..@max_rating
194
+ end
195
+ scores = Numo::DFloat.cast(@item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) })
196
+
197
+ # TODO uncomment in 0.3.0
198
+ # wilson score with continuity correction
199
+ # https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval_with_continuity_correction
200
+ # z = 1.96 # 95% confidence
201
+ # range = @max_rating - @min_rating
202
+ # n = Numo::DFloat.cast(@item_count)
203
+ # phat = (Numo::DFloat.cast(@item_sum) - (@min_rating * n)) / range / n
204
+ # phat = (phat - (1 / (2 * n))).clip(0, nil) # continuity correction
205
+ # scores = (phat + z**2 / (2 * n) - z * Numo::DFloat::Math.sqrt((phat * (1 - phat) + z**2 / (4 * n)) / n)) / (1 + z**2 / n)
206
+ # scores = scores * range + @min_rating
172
207
  end
173
208
 
174
- scores = scores.map.with_index.sort_by { |s, _| -s }
175
- scores = scores.first(count) if count
176
- item_ids = item_ids()
177
- scores.map do |s, i|
178
- {item_id: item_ids[i], score: s}
209
+ indexes = scores.sort_index.reverse
210
+ indexes = indexes[0...[count, indexes.size].min] if count
211
+ scores = scores[indexes]
212
+
213
+ keys = @item_map.keys
214
+ indexes.size.times.map do |i|
215
+ {item_id: keys[indexes[i]], score: scores[i]}
179
216
  end
180
217
  end
181
218
 
@@ -212,13 +249,17 @@ module Disco
212
249
 
213
250
  def optimize_similar_items(library: nil)
214
251
  check_fit
215
- @similar_items_index = create_index(item_norms, library: library)
252
+ @similar_items_index = create_index(normalized_item_factors, library: library)
216
253
  end
217
254
  alias_method :optimize_item_recs, :optimize_similar_items
218
255
 
219
256
  def optimize_similar_users(library: nil)
220
257
  check_fit
221
- @similar_users_index = create_index(user_norms, library: library)
258
+ @similar_users_index = create_index(normalized_user_factors, library: library)
259
+ end
260
+
261
+ def inspect
262
+ to_s # for now
222
263
  end
223
264
 
224
265
  private
@@ -235,8 +276,9 @@ module Disco
235
276
  # inner product is cosine similarity with normalized vectors
236
277
  # https://github.com/facebookresearch/faiss/issues/95
237
278
  #
238
- # TODO use non-exact index
279
+ # TODO use non-exact index in 0.3.0
239
280
  # https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
281
+ # index = Faiss::IndexHNSWFlat.new(factors.shape[1], 32, :inner_product)
240
282
  index = Faiss::IndexFlatIP.new(factors.shape[1])
241
283
 
242
284
  # ids are from 0...total
@@ -251,7 +293,7 @@ module Disco
251
293
  # https://github.com/yahoojapan/NGT/issues/36
252
294
  index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
253
295
 
254
- # NGT normalizes so could call create_index with factors instead of norms
296
+ # NGT normalizes so could call create_index without normalized factors
255
297
  # but keep code simple for now
256
298
  ids = index.batch_insert(factors)
257
299
  raise "Unexpected ids. Please report a bug." if ids.first != 1 || ids.last != factors.shape[0]
@@ -262,15 +304,15 @@ module Disco
262
304
  end
263
305
  end
264
306
 
265
- def user_norms
266
- @user_norms ||= norms(@user_factors)
307
+ def normalized_user_factors
308
+ @normalized_user_factors ||= normalize(@user_factors)
267
309
  end
268
310
 
269
- def item_norms
270
- @item_norms ||= norms(@item_factors)
311
+ def normalized_item_factors
312
+ @normalized_item_factors ||= normalize(@item_factors)
271
313
  end
272
314
 
273
- def norms(factors)
315
+ def normalize(factors)
274
316
  norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
275
317
  norms[norms.eq(0)] = 1e-10 # no zeros
276
318
  factors / norms.expand_dims(1)
@@ -303,30 +345,26 @@ module Disco
303
345
  # TODO use user_id for similar_users in 0.3.0
304
346
  key = :item_id
305
347
 
306
- (1...ids.size).map do |i|
307
- {key => keys[ids[i]], score: predictions[i]}
348
+ result = []
349
+ # items can have the same score
350
+ # so original item may not be at index 0
351
+ ids.each_with_index do |id, j|
352
+ next if id == i
353
+
354
+ result << {key => keys[id], score: predictions[j]}
308
355
  end
356
+ result
309
357
  else
310
358
  []
311
359
  end
312
360
  end
313
361
 
314
- def update_maps(train_set)
315
- raise ArgumentError, "Missing user_id" if train_set.any? { |v| v[:user_id].nil? }
316
- raise ArgumentError, "Missing item_id" if train_set.any? { |v| v[:item_id].nil? }
317
-
318
- train_set.each do |v|
319
- @user_map[v[:user_id]] ||= @user_map.size
320
- @item_map[v[:item_id]] ||= @item_map.size
321
- end
322
- end
323
-
324
362
  def check_ratings(ratings)
325
363
  unless ratings.all? { |r| !r[:rating].nil? }
326
- raise ArgumentError, "Missing ratings"
364
+ raise ArgumentError, "Missing rating"
327
365
  end
328
366
  unless ratings.all? { |r| r[:rating].is_a?(Numeric) }
329
- raise ArgumentError, "Ratings must be numeric"
367
+ raise ArgumentError, "Rating must be numeric"
330
368
  end
331
369
  end
332
370
 
@@ -365,7 +403,10 @@ module Disco
365
403
  rated: @rated,
366
404
  global_mean: @global_mean,
367
405
  user_factors: @user_factors,
368
- item_factors: @item_factors
406
+ item_factors: @item_factors,
407
+ factors: @factors,
408
+ epochs: @epochs,
409
+ verbose: @verbose
369
410
  }
370
411
 
371
412
  unless @implicit
@@ -389,6 +430,9 @@ module Disco
389
430
  @global_mean = obj[:global_mean]
390
431
  @user_factors = obj[:user_factors]
391
432
  @item_factors = obj[:item_factors]
433
+ @factors = obj[:factors]
434
+ @epochs = obj[:epochs]
435
+ @verbose = obj[:verbose]
392
436
 
393
437
  unless @implicit
394
438
  @min_rating = obj[:min_rating]
data/lib/disco/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Disco
2
- VERSION = "0.2.5"
2
+ VERSION = "0.2.8"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: disco
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.5
4
+ version: 0.2.8
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-02-20 00:00:00.000000000 Z
11
+ date: 2022-03-13 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: libmf
@@ -76,7 +76,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
76
76
  - !ruby/object:Gem::Version
77
77
  version: '0'
78
78
  requirements: []
79
- rubygems_version: 3.2.3
79
+ rubygems_version: 3.3.7
80
80
  signing_key:
81
81
  specification_version: 4
82
82
  summary: Recommendations for Ruby and Rails using collaborative filtering