disco 0.2.5 → 0.2.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 8fbecb858b316ed39a9cb726263e182561cba6df498e6253d88c79ebec5cab05
4
- data.tar.gz: 42eb38a6e4e0b3fc5a9452deae5a48676ae9a53e78eeb6197718a0c94bd02b6b
3
+ metadata.gz: 0ec370f448c5cc8aebb2860b0580466687874eb34c165e1a3d0254a1c6e701d7
4
+ data.tar.gz: 1b080c37206371ee59ce184ae420c5fa1de60da0714ef17ea9459b31fcdd22ab
5
5
  SHA512:
6
- metadata.gz: d0250346d75fba75064a29578f6bfd39f09ecf712ba2e505b97a4952b5ff8b31af307eb1b912e9b25cc3dc28dee0d096bea44b47bb2ef268859bb4171f0ef8b2
7
- data.tar.gz: 7b341328c12885efd0ffece4201036bb9457caee80a48a99ba110af9a81bcf832bbc1e8f8f5f14e7fddffef2dd3f4643837e0d569c997ab0c2d9ae85e12422f7
6
+ metadata.gz: 1802e0fbf68ee489891f94e468c3b24df0eb8463de4e55a7f50a9dbc86bda7631ca15d5b84d0d584dbd341e994db84a82f00f6753c70eda4a7400376f0443df5
7
+ data.tar.gz: bee0645357a5fc4eb226e75d4b56208e7377952911dd19ccad8f3072eee3eaf4d4a318933dcb4ec4d75273630e278dbf98b3ac21f5987baea90261d15cc2d851
data/CHANGELOG.md CHANGED
@@ -1,3 +1,18 @@
1
+ ## 0.2.8 (2022-03-13)
2
+
3
+ - Fixed error with `top_items` with all same rating
4
+
5
+ ## 0.2.7 (2021-08-06)
6
+
7
+ - Added warning for `value`
8
+
9
+ ## 0.2.6 (2021-02-24)
10
+
11
+ - Improved performance
12
+ - Improved `inspect` method
13
+ - Fixed issue with `similar_users` and `item_recs` returning the original user/item
14
+ - Fixed error with `fit` after loading
15
+
1
16
  ## 0.2.5 (2021-02-20)
2
17
 
3
18
  - Added `top_items` method
data/README.md CHANGED
@@ -35,16 +35,16 @@ recommender.fit([
35
35
 
36
36
  > IDs can be integers, strings, or any other data type
37
37
 
38
- If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating, or use a value like number of purchases, number of page views, or time spent on page:
38
+ If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating.
39
39
 
40
40
  ```ruby
41
41
  recommender.fit([
42
- {user_id: 1, item_id: 1, value: 1},
43
- {user_id: 2, item_id: 1, value: 1}
42
+ {user_id: 1, item_id: 1},
43
+ {user_id: 2, item_id: 1}
44
44
  ])
45
45
  ```
46
46
 
47
- > Use `value` instead of rating for implicit feedback
47
+ > Each `user_id`/`item_id` combination should only appear once
48
48
 
49
49
  Get user-based recommendations - “users like you also liked”
50
50
 
@@ -99,18 +99,13 @@ recommender.item_recs("Star Wars (1977)")
99
99
  [Ahoy](https://github.com/ankane/ahoy) is a great source for implicit feedback
100
100
 
101
101
  ```ruby
102
- views = Ahoy::Event.
103
- where(name: "Viewed post").
104
- group(:user_id).
105
- group("properties->>'post_id'"). # postgres syntax
106
- count
102
+ views = Ahoy::Event.where(name: "Viewed post").group(:user_id).group_prop(:post_id).count
107
103
 
108
104
  data =
109
- views.map do |(user_id, post_id), count|
105
+ views.map do |(user_id, post_id), _|
110
106
  {
111
107
  user_id: user_id,
112
- item_id: post_id,
113
- value: count
108
+ item_id: post_id
114
109
  }
115
110
  end
116
111
  ```
@@ -201,7 +196,7 @@ bin = File.binread("recommender.bin")
201
196
  recommender = Marshal.load(bin)
202
197
  ```
203
198
 
204
- Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor)
199
+ Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor). See the [examples](https://github.com/ankane/neighbor/tree/master/examples).
205
200
 
206
201
  ## Algorithms
207
202
 
@@ -247,7 +242,7 @@ recommender.fit(data)
247
242
  recommender.top_items
248
243
  ```
249
244
 
250
- This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) your application’s Gemfile) and item frequency for implicit feedback.
245
+ This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) to your application’s Gemfile) and item frequency for implicit feedback.
251
246
 
252
247
  ## Data
253
248
 
@@ -269,7 +264,7 @@ Or a Daru data frame
269
264
  Daru::DataFrame.from_csv("ratings.csv")
270
265
  ```
271
266
 
272
- ## Performance [master]
267
+ ## Performance
273
268
 
274
269
  If you have a large number of users or items, you can use an approximate nearest neighbors library like [Faiss](https://github.com/ankane/faiss) to improve the performance of certain methods.
275
270
 
@@ -282,22 +277,22 @@ gem 'faiss'
282
277
  Speed up the `user_recs` method with:
283
278
 
284
279
  ```ruby
285
- model.optimize_user_recs
280
+ recommender.optimize_user_recs
286
281
  ```
287
282
 
288
283
  Speed up the `item_recs` method with:
289
284
 
290
285
  ```ruby
291
- model.optimize_item_recs
286
+ recommender.optimize_item_recs
292
287
  ```
293
288
 
294
289
  Speed up the `similar_users` method with:
295
290
 
296
291
  ```ruby
297
- model.optimize_similar_users
292
+ recommender.optimize_similar_users
298
293
  ```
299
294
 
300
- This should be called after fitting or loading the model.
295
+ This should be called after fitting or loading the recommender.
301
296
 
302
297
  ## Reference
303
298
 
@@ -336,6 +331,28 @@ Thanks to:
336
331
  - [Implicit](https://github.com/benfred/implicit/) for serving as an initial reference for user and item similarity
337
332
  - [@dasch](https://github.com/dasch) for the gem name
338
333
 
334
+ ## Upgrading
335
+
336
+ ### 0.2.7
337
+
338
+ There’s now a warning when passing `:value` with implicit feedback, as this has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used.
339
+
340
+ ```ruby
341
+ recommender.fit([
342
+ {user_id: 1, item_id: 1, value: 1},
343
+ {user_id: 2, item_id: 1, value: 3}
344
+ ])
345
+ ```
346
+
347
+ to:
348
+
349
+ ```ruby
350
+ recommender.fit([
351
+ {user_id: 1, item_id: 1},
352
+ {user_id: 2, item_id: 1}
353
+ ])
354
+ ```
355
+
339
356
  ## History
340
357
 
341
358
  View the [changelog](https://github.com/ankane/disco/blob/master/CHANGELOG.md)
data/lib/disco/model.rb CHANGED
@@ -10,6 +10,7 @@ module Disco
10
10
 
11
11
  has_many :"recommended_#{name}", -> { where("disco_recommendations.context = ?", name).order("disco_recommendations.score DESC") }, through: :recommendations, source: :item, source_type: class_name
12
12
 
13
+ # TODO use fetch for item_id and score in 0.3.0
13
14
  define_method("update_recommended_#{name}") do |items|
14
15
  now = Time.now
15
16
  items = items.map { |item| {subject_type: model_name.name, subject_id: id, item_type: class_name, item_id: item[:item_id], context: name, score: item[:score], created_at: now, updated_at: now} }
@@ -17,38 +17,54 @@ module Disco
17
17
 
18
18
  check_training_set(train_set)
19
19
 
20
+ # TODO option to set in initializer to avoid pass
21
+ # could also just check first few values
22
+ # but may be confusing if they are all missing and later ones aren't
20
23
  @implicit = !train_set.any? { |v| v[:rating] }
24
+
25
+ if @implicit && train_set.any? { |v| v[:value] }
26
+ warn "[disco] WARNING: Passing `:value` with implicit feedback has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used."
27
+ end
28
+
29
+ # TODO improve performance
30
+ # (catch exception instead of checking ahead of time)
21
31
  unless @implicit
22
32
  check_ratings(train_set)
23
- @min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
24
33
 
25
34
  if validation_set
26
35
  check_ratings(validation_set)
27
36
  end
28
37
  end
29
38
 
30
- update_maps(train_set)
31
-
32
39
  @rated = Hash.new { |hash, key| hash[key] = {} }
33
40
  input = []
34
- value_key = @implicit ? :value : :rating
35
41
  train_set.each do |v|
36
- u = @user_map[v[:user_id]]
37
- i = @item_map[v[:item_id]]
42
+ # update maps and build matrix in single pass
43
+ u = (@user_map[v[:user_id]] ||= @user_map.size)
44
+ i = (@item_map[v[:item_id]] ||= @item_map.size)
38
45
  @rated[u][i] = true
39
46
 
40
47
  # explicit will always have a value due to check_ratings
41
- input << [u, i, v[value_key] || 1]
48
+ input << [u, i, @implicit ? 1 : v[:rating]]
42
49
  end
43
50
  @rated.default = nil
44
51
 
52
+ # much more efficient than checking every value in another pass
53
+ raise ArgumentError, "Missing user_id" if @user_map.key?(nil)
54
+ raise ArgumentError, "Missing item_id" if @item_map.key?(nil)
55
+
56
+ # TODO improve performance
57
+ unless @implicit
58
+ @min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
59
+ end
60
+
45
61
  if @top_items
46
62
  @item_count = [0] * @item_map.size
47
63
  @item_sum = [0.0] * @item_map.size
48
64
  train_set.each do |v|
49
65
  i = @item_map[v[:item_id]]
50
66
  @item_count[i] += 1
51
- @item_sum[i] += (v[value_key] || 1)
67
+ @item_sum[i] += (@implicit ? 1 : v[:rating])
52
68
  end
53
69
  end
54
70
 
@@ -63,7 +79,7 @@ module Disco
63
79
  u ||= -1
64
80
  i ||= -1
65
81
 
66
- eval_set << [u, i, v[value_key] || 1]
82
+ eval_set << [u, i, @implicit ? 1 : v[:rating]]
67
83
  end
68
84
  end
69
85
 
@@ -78,6 +94,9 @@ module Disco
78
94
  @user_factors = model.p_factors(format: :numo)
79
95
  @item_factors = model.q_factors(format: :numo)
80
96
 
97
+ @normalized_user_factors = nil
98
+ @normalized_item_factors = nil
99
+
81
100
  @user_recs_index = nil
82
101
  @similar_users_index = nil
83
102
  @similar_items_index = nil
@@ -122,8 +141,7 @@ module Disco
122
141
  predictions, ids = @user_recs_index.search(@user_factors[u, true].expand_dims(0), count + rated.size).map { |v| v[0, true] }
123
142
  else
124
143
  predictions = @item_factors.inner(@user_factors[u, true])
125
- # TODO make sure reverse isn't hurting performance
126
- indexes = predictions.sort_index.reverse
144
+ indexes = predictions.sort_index.reverse # reverse just creates view
127
145
  indexes = indexes[0...[count + rated.size, indexes.size].min] if count
128
146
  predictions = predictions[indexes]
129
147
  ids = indexes
@@ -149,13 +167,13 @@ module Disco
149
167
 
150
168
  def similar_items(item_id, count: 5)
151
169
  check_fit
152
- similar(item_id, @item_map, item_norms, count, @similar_items_index)
170
+ similar(item_id, @item_map, normalized_item_factors, count, @similar_items_index)
153
171
  end
154
172
  alias_method :item_recs, :similar_items
155
173
 
156
174
  def similar_users(user_id, count: 5)
157
175
  check_fit
158
- similar(user_id, @user_map, user_norms, count, @similar_users_index)
176
+ similar(user_id, @user_map, normalized_user_factors, count, @similar_users_index)
159
177
  end
160
178
 
161
179
  def top_items(count: 5)
@@ -163,19 +181,38 @@ module Disco
163
181
  raise "top_items not computed" unless @top_items
164
182
 
165
183
  if @implicit
166
- scores = @item_count
184
+ scores = Numo::UInt64.cast(@item_count)
167
185
  else
168
186
  require "wilson_score"
169
187
 
170
- range = @min_rating..@max_rating
171
- scores = @item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) }
188
+ range =
189
+ if @min_rating == @max_rating
190
+ # TODO remove temp fix
191
+ (@min_rating - 1)..@max_rating
192
+ else
193
+ @min_rating..@max_rating
194
+ end
195
+ scores = Numo::DFloat.cast(@item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) })
196
+
197
+ # TODO uncomment in 0.3.0
198
+ # wilson score with continuity correction
199
+ # https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval_with_continuity_correction
200
+ # z = 1.96 # 95% confidence
201
+ # range = @max_rating - @min_rating
202
+ # n = Numo::DFloat.cast(@item_count)
203
+ # phat = (Numo::DFloat.cast(@item_sum) - (@min_rating * n)) / range / n
204
+ # phat = (phat - (1 / (2 * n))).clip(0, nil) # continuity correction
205
+ # scores = (phat + z**2 / (2 * n) - z * Numo::DFloat::Math.sqrt((phat * (1 - phat) + z**2 / (4 * n)) / n)) / (1 + z**2 / n)
206
+ # scores = scores * range + @min_rating
172
207
  end
173
208
 
174
- scores = scores.map.with_index.sort_by { |s, _| -s }
175
- scores = scores.first(count) if count
176
- item_ids = item_ids()
177
- scores.map do |s, i|
178
- {item_id: item_ids[i], score: s}
209
+ indexes = scores.sort_index.reverse
210
+ indexes = indexes[0...[count, indexes.size].min] if count
211
+ scores = scores[indexes]
212
+
213
+ keys = @item_map.keys
214
+ indexes.size.times.map do |i|
215
+ {item_id: keys[indexes[i]], score: scores[i]}
179
216
  end
180
217
  end
181
218
 
@@ -212,13 +249,17 @@ module Disco
212
249
 
213
250
  def optimize_similar_items(library: nil)
214
251
  check_fit
215
- @similar_items_index = create_index(item_norms, library: library)
252
+ @similar_items_index = create_index(normalized_item_factors, library: library)
216
253
  end
217
254
  alias_method :optimize_item_recs, :optimize_similar_items
218
255
 
219
256
  def optimize_similar_users(library: nil)
220
257
  check_fit
221
- @similar_users_index = create_index(user_norms, library: library)
258
+ @similar_users_index = create_index(normalized_user_factors, library: library)
259
+ end
260
+
261
+ def inspect
262
+ to_s # for now
222
263
  end
223
264
 
224
265
  private
@@ -235,8 +276,9 @@ module Disco
235
276
  # inner product is cosine similarity with normalized vectors
236
277
  # https://github.com/facebookresearch/faiss/issues/95
237
278
  #
238
- # TODO use non-exact index
279
+ # TODO use non-exact index in 0.3.0
239
280
  # https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
281
+ # index = Faiss::IndexHNSWFlat.new(factors.shape[1], 32, :inner_product)
240
282
  index = Faiss::IndexFlatIP.new(factors.shape[1])
241
283
 
242
284
  # ids are from 0...total
@@ -251,7 +293,7 @@ module Disco
251
293
  # https://github.com/yahoojapan/NGT/issues/36
252
294
  index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
253
295
 
254
- # NGT normalizes so could call create_index with factors instead of norms
296
+ # NGT normalizes so could call create_index without normalized factors
255
297
  # but keep code simple for now
256
298
  ids = index.batch_insert(factors)
257
299
  raise "Unexpected ids. Please report a bug." if ids.first != 1 || ids.last != factors.shape[0]
@@ -262,15 +304,15 @@ module Disco
262
304
  end
263
305
  end
264
306
 
265
- def user_norms
266
- @user_norms ||= norms(@user_factors)
307
+ def normalized_user_factors
308
+ @normalized_user_factors ||= normalize(@user_factors)
267
309
  end
268
310
 
269
- def item_norms
270
- @item_norms ||= norms(@item_factors)
311
+ def normalized_item_factors
312
+ @normalized_item_factors ||= normalize(@item_factors)
271
313
  end
272
314
 
273
- def norms(factors)
315
+ def normalize(factors)
274
316
  norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
275
317
  norms[norms.eq(0)] = 1e-10 # no zeros
276
318
  factors / norms.expand_dims(1)
@@ -303,30 +345,26 @@ module Disco
303
345
  # TODO use user_id for similar_users in 0.3.0
304
346
  key = :item_id
305
347
 
306
- (1...ids.size).map do |i|
307
- {key => keys[ids[i]], score: predictions[i]}
348
+ result = []
349
+ # items can have the same score
350
+ # so original item may not be at index 0
351
+ ids.each_with_index do |id, j|
352
+ next if id == i
353
+
354
+ result << {key => keys[id], score: predictions[j]}
308
355
  end
356
+ result
309
357
  else
310
358
  []
311
359
  end
312
360
  end
313
361
 
314
- def update_maps(train_set)
315
- raise ArgumentError, "Missing user_id" if train_set.any? { |v| v[:user_id].nil? }
316
- raise ArgumentError, "Missing item_id" if train_set.any? { |v| v[:item_id].nil? }
317
-
318
- train_set.each do |v|
319
- @user_map[v[:user_id]] ||= @user_map.size
320
- @item_map[v[:item_id]] ||= @item_map.size
321
- end
322
- end
323
-
324
362
  def check_ratings(ratings)
325
363
  unless ratings.all? { |r| !r[:rating].nil? }
326
- raise ArgumentError, "Missing ratings"
364
+ raise ArgumentError, "Missing rating"
327
365
  end
328
366
  unless ratings.all? { |r| r[:rating].is_a?(Numeric) }
329
- raise ArgumentError, "Ratings must be numeric"
367
+ raise ArgumentError, "Rating must be numeric"
330
368
  end
331
369
  end
332
370
 
@@ -365,7 +403,10 @@ module Disco
365
403
  rated: @rated,
366
404
  global_mean: @global_mean,
367
405
  user_factors: @user_factors,
368
- item_factors: @item_factors
406
+ item_factors: @item_factors,
407
+ factors: @factors,
408
+ epochs: @epochs,
409
+ verbose: @verbose
369
410
  }
370
411
 
371
412
  unless @implicit
@@ -389,6 +430,9 @@ module Disco
389
430
  @global_mean = obj[:global_mean]
390
431
  @user_factors = obj[:user_factors]
391
432
  @item_factors = obj[:item_factors]
433
+ @factors = obj[:factors]
434
+ @epochs = obj[:epochs]
435
+ @verbose = obj[:verbose]
392
436
 
393
437
  unless @implicit
394
438
  @min_rating = obj[:min_rating]
data/lib/disco/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Disco
2
- VERSION = "0.2.5"
2
+ VERSION = "0.2.8"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: disco
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.5
4
+ version: 0.2.8
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-02-20 00:00:00.000000000 Z
11
+ date: 2022-03-13 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: libmf
@@ -76,7 +76,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
76
76
  - !ruby/object:Gem::Version
77
77
  version: '0'
78
78
  requirements: []
79
- rubygems_version: 3.2.3
79
+ rubygems_version: 3.3.7
80
80
  signing_key:
81
81
  specification_version: 4
82
82
  summary: Recommendations for Ruby and Rails using collaborative filtering