disco 0.2.8 → 0.3.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 0ec370f448c5cc8aebb2860b0580466687874eb34c165e1a3d0254a1c6e701d7
4
- data.tar.gz: 1b080c37206371ee59ce184ae420c5fa1de60da0714ef17ea9459b31fcdd22ab
3
+ metadata.gz: a4af4d7df56f884618557fd98f97da2686cecbfaf3ce1f1f52b6ba1a3a9155f5
4
+ data.tar.gz: ddbc7551c3534c41284e958042a94d988cdd5c52248f3ca7e4a8d8a72c6b168e
5
5
  SHA512:
6
- metadata.gz: 1802e0fbf68ee489891f94e468c3b24df0eb8463de4e55a7f50a9dbc86bda7631ca15d5b84d0d584dbd341e994db84a82f00f6753c70eda4a7400376f0443df5
7
- data.tar.gz: bee0645357a5fc4eb226e75d4b56208e7377952911dd19ccad8f3072eee3eaf4d4a318933dcb4ec4d75273630e278dbf98b3ac21f5987baea90261d15cc2d851
6
+ metadata.gz: 5f40f125fe4096dcf09eaf1d1295f23e68fef9bbe1e5651a1ecfa1ed06748df7a0f4c9ea048b767b429b67b4d0add1f32663388707616571a6936a55ffd1d6b7
7
+ data.tar.gz: baf3caa4deec5422bd85e9372bc717f683d4cbb067f468eb7d4c964556885f7bb273fa0b6339201aef917c8f046346c20356358462fc73103243608a5245ebe6
data/CHANGELOG.md CHANGED
@@ -1,3 +1,19 @@
1
+ ## 0.3.1 (2022-07-10)
2
+
3
+ - Added support for JSON serialization
4
+
5
+ ## 0.3.0 (2022-03-22)
6
+
7
+ - Changed `item_id` to `user_id` for `similar_users`
8
+ - Changed warning to an error when `value` passed to `fit`
9
+ - Changed to use Faiss over NGT for `optimize_item_recs` and `optimize_similar_users` when both are installed
10
+ - Removed dependency on `wilson_score` gem for `top_items`
11
+ - Dropped support for Ruby < 2.6
12
+
13
+ ## 0.2.9 (2022-03-22)
14
+
15
+ - Fixed error with `load_movielens`
16
+
1
17
  ## 0.2.8 (2022-03-13)
2
18
 
3
19
  - Fixed error with `top_items` with all same rating
data/LICENSE.txt CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2019-2021 Andrew Kane
1
+ Copyright (c) 2019-2022 Andrew Kane
2
2
 
3
3
  MIT License
4
4
 
data/README.md CHANGED
@@ -13,7 +13,7 @@
13
13
  Add this line to your application’s Gemfile:
14
14
 
15
15
  ```ruby
16
- gem 'disco'
16
+ gem "disco"
17
17
  ```
18
18
 
19
19
  ## Getting Started
@@ -183,17 +183,19 @@ For Rails < 6, speed up inserts by adding [activerecord-import](https://github.c
183
183
  If you’d prefer to perform recommendations on-the-fly, store the recommender
184
184
 
185
185
  ```ruby
186
- bin = Marshal.dump(recommender)
187
- File.binwrite("recommender.bin", bin)
186
+ json = recommender.to_json
187
+ File.write("recommender.json", json)
188
188
  ```
189
189
 
190
- > You can save it to a file, database, or any other storage system
190
+ > You can save it to a file, database, or any other storage system. Also, user and item IDs should be integers or strings for this.
191
+
192
+ The serialized recommender includes user activity from the training data (to avoid recommending previously rated items), so be sure to protect it.
191
193
 
192
194
  Load a recommender
193
195
 
194
196
  ```ruby
195
- bin = File.binread("recommender.bin")
196
- recommender = Marshal.load(bin)
197
+ json = File.read("recommender.json")
198
+ recommender = Disco::Recommender.load_json(json)
197
199
  ```
198
200
 
199
201
  Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor). See the [examples](https://github.com/ankane/neighbor/tree/master/examples).
@@ -242,7 +244,7 @@ recommender.fit(data)
242
244
  recommender.top_items
243
245
  ```
244
246
 
245
- This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) to your application’s Gemfile) and item frequency for implicit feedback.
247
+ This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback and item frequency for implicit feedback.
246
248
 
247
249
  ## Data
248
250
 
@@ -271,7 +273,7 @@ If you have a large number of users or items, you can use an approximate nearest
271
273
  Add this line to your application’s Gemfile:
272
274
 
273
275
  ```ruby
274
- gem 'faiss'
276
+ gem "faiss"
275
277
  ```
276
278
 
277
279
  Speed up the `user_recs` method with:
data/lib/disco/data.rb CHANGED
@@ -1,9 +1,11 @@
1
1
  module Disco
2
2
  module Data
3
3
  def load_movielens
4
- item_path = download_file("ml-100k/u.item", "http://files.grouplens.org/datasets/movielens/ml-100k/u.item",
4
+ require "csv"
5
+
6
+ item_path = download_file("ml-100k/u.item", "https://files.grouplens.org/datasets/movielens/ml-100k/u.item",
5
7
  file_hash: "553841ebc7de3a0fd0d6b62a204ea30c1e651aacfb2814c7a6584ac52f2c5701")
6
- data_path = download_file("ml-100k/u.data", "http://files.grouplens.org/datasets/movielens/ml-100k/u.data",
8
+ data_path = download_file("ml-100k/u.data", "https://files.grouplens.org/datasets/movielens/ml-100k/u.data",
7
9
  file_hash: "06416e597f82b7342361e41163890c81036900f418ad91315590814211dca490")
8
10
 
9
11
  # convert u.item to utf-8
@@ -29,6 +31,11 @@ module Disco
29
31
  private
30
32
 
31
33
  def download_file(fname, origin, file_hash:)
34
+ require "digest"
35
+ require "fileutils"
36
+ require "net/http"
37
+ require "tmpdir"
38
+
32
39
  # TODO handle this better
33
40
  raise "No HOME" unless ENV["HOME"]
34
41
  dest = "#{ENV["HOME"]}/.disco/#{fname}"
data/lib/disco/model.rb CHANGED
@@ -10,10 +10,9 @@ module Disco
10
10
 
11
11
  has_many :"recommended_#{name}", -> { where("disco_recommendations.context = ?", name).order("disco_recommendations.score DESC") }, through: :recommendations, source: :item, source_type: class_name
12
12
 
13
- # TODO use fetch for item_id and score in 0.3.0
14
13
  define_method("update_recommended_#{name}") do |items|
15
14
  now = Time.now
16
- items = items.map { |item| {subject_type: model_name.name, subject_id: id, item_type: class_name, item_id: item[:item_id], context: name, score: item[:score], created_at: now, updated_at: now} }
15
+ items = items.map { |item| {subject_type: model_name.name, subject_id: id, item_type: class_name, item_id: item.fetch(:item_id), context: name, score: item.fetch(:score), created_at: now, updated_at: now} }
17
16
 
18
17
  self.class.transaction do
19
18
  recommendations.where(context: name).delete_all
@@ -23,7 +23,7 @@ module Disco
23
23
  @implicit = !train_set.any? { |v| v[:rating] }
24
24
 
25
25
  if @implicit && train_set.any? { |v| v[:value] }
26
- warn "[disco] WARNING: Passing `:value` with implicit feedback has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used."
26
+ raise ArgumentError, "Passing `:value` with implicit feedback has no effect on recommendations and should be removed. Earlier versions of the library incorrectly stated this was used."
27
27
  end
28
28
 
29
29
  # TODO improve performance
@@ -167,13 +167,13 @@ module Disco
167
167
 
168
168
  def similar_items(item_id, count: 5)
169
169
  check_fit
170
- similar(item_id, @item_map, normalized_item_factors, count, @similar_items_index)
170
+ similar(item_id, :item_id, @item_map, normalized_item_factors, count, @similar_items_index)
171
171
  end
172
172
  alias_method :item_recs, :similar_items
173
173
 
174
174
  def similar_users(user_id, count: 5)
175
175
  check_fit
176
- similar(user_id, @user_map, normalized_user_factors, count, @similar_users_index)
176
+ similar(user_id, :user_id, @user_map, normalized_user_factors, count, @similar_users_index)
177
177
  end
178
178
 
179
179
  def top_items(count: 5)
@@ -183,27 +183,20 @@ module Disco
183
183
  if @implicit
184
184
  scores = Numo::UInt64.cast(@item_count)
185
185
  else
186
- require "wilson_score"
186
+ min_rating = @min_rating
187
187
 
188
- range =
189
- if @min_rating == @max_rating
190
- # TODO remove temp fix
191
- (@min_rating - 1)..@max_rating
192
- else
193
- @min_rating..@max_rating
194
- end
195
- scores = Numo::DFloat.cast(@item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) })
188
+ # TODO remove temp fix
189
+ min_rating -= 1 if @min_rating == @max_rating
196
190
 
197
- # TODO uncomment in 0.3.0
198
191
  # wilson score with continuity correction
199
192
  # https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval_with_continuity_correction
200
- # z = 1.96 # 95% confidence
201
- # range = @max_rating - @min_rating
202
- # n = Numo::DFloat.cast(@item_count)
203
- # phat = (Numo::DFloat.cast(@item_sum) - (@min_rating * n)) / range / n
204
- # phat = (phat - (1 / (2 * n))).clip(0, nil) # continuity correction
205
- # scores = (phat + z**2 / (2 * n) - z * Numo::DFloat::Math.sqrt((phat * (1 - phat) + z**2 / (4 * n)) / n)) / (1 + z**2 / n)
206
- # scores = scores * range + @min_rating
193
+ z = 1.96 # 95% confidence
194
+ range = @max_rating - @min_rating
195
+ n = Numo::DFloat.cast(@item_count)
196
+ phat = (Numo::DFloat.cast(@item_sum) - (min_rating * n)) / range / n
197
+ phat = (phat - (1 / (2 * n))).clip(0, nil) # continuity correction
198
+ scores = (phat + z**2 / (2 * n) - z * Numo::DFloat::Math.sqrt((phat * (1 - phat) + z**2 / (4 * n)) / n)) / (1 + z**2 / n)
199
+ scores = scores * range + min_rating
207
200
  end
208
201
 
209
202
  indexes = scores.sort_index.reverse
@@ -262,12 +255,51 @@ module Disco
262
255
  to_s # for now
263
256
  end
264
257
 
258
+ def to_json
259
+ require "base64"
260
+ require "json"
261
+
262
+ obj = {
263
+ implicit: @implicit,
264
+ user_ids: @user_map.keys,
265
+ item_ids: @item_map.keys,
266
+ rated: @user_map.map { |_, u| (@rated[u] || {}).keys },
267
+ global_mean: @global_mean,
268
+ user_factors: Base64.strict_encode64(@user_factors.to_binary),
269
+ item_factors: Base64.strict_encode64(@item_factors.to_binary),
270
+ factors: @factors,
271
+ epochs: @epochs,
272
+ verbose: @verbose
273
+ }
274
+
275
+ unless @implicit
276
+ obj[:min_rating] = @min_rating
277
+ obj[:max_rating] = @max_rating
278
+ end
279
+
280
+ if @top_items
281
+ obj[:item_count] = @item_count
282
+ obj[:item_sum] = @item_sum
283
+ end
284
+
285
+ JSON.generate(obj)
286
+ end
287
+
288
+ def self.load_json(json)
289
+ require "json"
290
+
291
+ obj = JSON.parse(json)
292
+
293
+ recommender = new
294
+ recommender.send(:json_load, obj)
295
+ recommender
296
+ end
297
+
265
298
  private
266
299
 
267
300
  # factors should already be normalized for similar users/items
268
301
  def create_index(factors, library:)
269
- # TODO make Faiss the default in 0.3.0
270
- library ||= defined?(Faiss) && !defined?(Ngt) ? "faiss" : "ngt"
302
+ library ||= defined?(Ngt) && !defined?(Faiss) ? "ngt" : "faiss"
271
303
 
272
304
  case library
273
305
  when "faiss"
@@ -276,7 +308,7 @@ module Disco
276
308
  # inner product is cosine similarity with normalized vectors
277
309
  # https://github.com/facebookresearch/faiss/issues/95
278
310
  #
279
- # TODO use non-exact index in 0.3.0
311
+ # TODO add option for index type
280
312
  # https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
281
313
  # index = Faiss::IndexHNSWFlat.new(factors.shape[1], 32, :inner_product)
282
314
  index = Faiss::IndexFlatIP.new(factors.shape[1])
@@ -318,7 +350,7 @@ module Disco
318
350
  factors / norms.expand_dims(1)
319
351
  end
320
352
 
321
- def similar(id, map, norm_factors, count, index)
353
+ def similar(id, key, map, norm_factors, count, index)
322
354
  i = map[id]
323
355
 
324
356
  if i && norm_factors.shape[0] > 1
@@ -342,9 +374,6 @@ module Disco
342
374
 
343
375
  keys = map.keys
344
376
 
345
- # TODO use user_id for similar_users in 0.3.0
346
- key = :item_id
347
-
348
377
  result = []
349
378
  # items can have the same score
350
379
  # so original item may not be at index 0
@@ -445,5 +474,31 @@ module Disco
445
474
  @item_sum = obj[:item_sum]
446
475
  end
447
476
  end
477
+
478
+ def json_load(obj)
479
+ require "base64"
480
+
481
+ @implicit = obj["implicit"]
482
+ @user_map = obj["user_ids"].map.with_index.to_h
483
+ @item_map = obj["item_ids"].map.with_index.to_h
484
+ @rated = obj["rated"].map.with_index.to_h { |r, i| [i, r.to_h { |v| [v, true] }] }
485
+ @global_mean = obj["global_mean"].to_f
486
+ @factors = obj["factors"].to_i
487
+ @user_factors = Numo::SFloat.from_binary(Base64.strict_decode64(obj["user_factors"]), [@user_map.size, @factors])
488
+ @item_factors = Numo::SFloat.from_binary(Base64.strict_decode64(obj["item_factors"]), [@item_map.size, @factors])
489
+ @epochs = obj["epochs"].to_i
490
+ @verbose = obj["verbose"]
491
+
492
+ unless @implicit
493
+ @min_rating = obj["min_rating"]
494
+ @max_rating = obj["max_rating"]
495
+ end
496
+
497
+ @top_items = obj.key?("item_count")
498
+ if @top_items
499
+ @item_count = obj["item_count"]
500
+ @item_sum = obj["item_sum"]
501
+ end
502
+ end
448
503
  end
449
504
  end
data/lib/disco/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Disco
2
- VERSION = "0.2.8"
2
+ VERSION = "0.3.1"
3
3
  end
data/lib/disco.rb CHANGED
@@ -2,11 +2,6 @@
2
2
  require "libmf"
3
3
  require "numo/narray"
4
4
 
5
- # stdlib
6
- require "csv"
7
- require "fileutils"
8
- require "net/http"
9
-
10
5
  # modules
11
6
  require "disco/data"
12
7
  require "disco/metrics"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: disco
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.8
4
+ version: 0.3.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2022-03-13 00:00:00.000000000 Z
11
+ date: 2022-07-10 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: libmf
@@ -69,7 +69,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
69
69
  requirements:
70
70
  - - ">="
71
71
  - !ruby/object:Gem::Version
72
- version: '2.4'
72
+ version: '2.6'
73
73
  required_rubygems_version: !ruby/object:Gem::Requirement
74
74
  requirements:
75
75
  - - ">="