RubyGems - disco - Versions diffs - 0.2.5 → 0.2.8 - Mend

disco 0.2.5 → 0.2.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 8fbecb858b316ed39a9cb726263e182561cba6df498e6253d88c79ebec5cab05
-  data.tar.gz: 42eb38a6e4e0b3fc5a9452deae5a48676ae9a53e78eeb6197718a0c94bd02b6b
+  metadata.gz: 0ec370f448c5cc8aebb2860b0580466687874eb34c165e1a3d0254a1c6e701d7
+  data.tar.gz: 1b080c37206371ee59ce184ae420c5fa1de60da0714ef17ea9459b31fcdd22ab
 SHA512:
-  metadata.gz: d0250346d75fba75064a29578f6bfd39f09ecf712ba2e505b97a4952b5ff8b31af307eb1b912e9b25cc3dc28dee0d096bea44b47bb2ef268859bb4171f0ef8b2
-  data.tar.gz: 7b341328c12885efd0ffece4201036bb9457caee80a48a99ba110af9a81bcf832bbc1e8f8f5f14e7fddffef2dd3f4643837e0d569c997ab0c2d9ae85e12422f7
+  metadata.gz: 1802e0fbf68ee489891f94e468c3b24df0eb8463de4e55a7f50a9dbc86bda7631ca15d5b84d0d584dbd341e994db84a82f00f6753c70eda4a7400376f0443df5
+  data.tar.gz: bee0645357a5fc4eb226e75d4b56208e7377952911dd19ccad8f3072eee3eaf4d4a318933dcb4ec4d75273630e278dbf98b3ac21f5987baea90261d15cc2d851

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,18 @@
+## 0.2.8 (2022-03-13)
+- Fixed error with `top_items` with all same rating
+## 0.2.7 (2021-08-06)
+- Added warning for `value`
+## 0.2.6 (2021-02-24)
+- Improved performance
+- Improved `inspect` method
+- Fixed issue with `similar_users` and `item_recs` returning the original user/item
+- Fixed error with `fit` after loading
 ## 0.2.5 (2021-02-20)
 - Added `top_items` method

data/README.md CHANGED Viewed

@@ -35,16 +35,16 @@ recommender.fit([
 > IDs can be integers, strings, or any other data type
-If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating, or use a value like number of purchases, number of page views, or time spent on page:
+If users don’t rate items directly (for instance, they’re purchasing items or reading posts), this is known as implicit feedback. Leave out the rating.
 ```ruby
 recommender.fit([
-  {user_id: 1, item_id: 1, value: 1},
-  {user_id: 2, item_id: 1, value: 1}
+  {user_id: 1, item_id: 1},
+  {user_id: 2, item_id: 1}
 ])
 ```
-> Use `value` instead of rating for implicit feedback
+> Each `user_id`/`item_id` combination should only appear once
 Get user-based recommendations - “users like you also liked”
@@ -99,18 +99,13 @@ recommender.item_recs("Star Wars (1977)")
 [Ahoy](https://github.com/ankane/ahoy) is a great source for implicit feedback
 ```ruby
-views = Ahoy::Event.
-  where(name: "Viewed post").
-  group(:user_id).
-  group("properties->>'post_id'"). # postgres syntax
-  count
+views = Ahoy::Event.where(name: "Viewed post").group(:user_id).group_prop(:post_id).count
 data =
-  views.map do |(user_id, post_id), count|
+  views.map do |(user_id, post_id), _|
     {
       user_id: user_id,
-      item_id: post_id,
-      value: count
+      item_id: post_id
     }
   end
 ```
@@ -201,7 +196,7 @@ bin = File.binread("recommender.bin")
 recommender = Marshal.load(bin)
 ```
-Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor)
+Alternatively, you can store only the factors and use a library like [Neighbor](https://github.com/ankane/neighbor). See the [examples](https://github.com/ankane/neighbor/tree/master/examples).
 ## Algorithms
@@ -247,7 +242,7 @@ recommender.fit(data)
 recommender.top_items
 ```
-This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) your application’s Gemfile) and item frequency for implicit feedback.
+This uses [Wilson score](https://www.evanmiller.org/how-not-to-sort-by-average-rating.html) for explicit feedback (add [wilson_score](https://github.com/instacart/wilson_score) to your application’s Gemfile) and item frequency for implicit feedback.
 ## Data
@@ -269,7 +264,7 @@ Or a Daru data frame
 Daru::DataFrame.from_csv("ratings.csv")
 ```
-## Performance [master]
+## Performance
 If you have a large number of users or items, you can use an approximate nearest neighbors library like [Faiss](https://github.com/ankane/faiss) to improve the performance of certain methods.
@@ -282,22 +277,22 @@ gem 'faiss'
 Speed up the `user_recs` method with:
 ```ruby
-model.optimize_user_recs
+recommender.optimize_user_recs
 ```
 Speed up the `item_recs` method with:
 ```ruby
-model.optimize_item_recs
+recommender.optimize_item_recs
 ```
 Speed up the `similar_users` method with:
 ```ruby
-model.optimize_similar_users
+recommender.optimize_similar_users
 ```
-This should be called after fitting or loading the model.
+This should be called after fitting or loading the recommender.
 ## Reference
@@ -336,6 +331,28 @@ Thanks to:
 - [Implicit](https://github.com/benfred/implicit/) for serving as an initial reference for user and item similarity
 - [@dasch](https://github.com/dasch) for the gem name
+## Upgrading
+### 0.2.7
+There’s now a warning when passing `:value` with implicit feedback, as this has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used.
+```ruby
+recommender.fit([
+  {user_id: 1, item_id: 1, value: 1},
+  {user_id: 2, item_id: 1, value: 3}
+])
+```
+to:
+```ruby
+recommender.fit([
+  {user_id: 1, item_id: 1},
+  {user_id: 2, item_id: 1}
+])
+```
 ## History
 View the [changelog](https://github.com/ankane/disco/blob/master/CHANGELOG.md)

data/lib/disco/model.rb CHANGED Viewed

@@ -10,6 +10,7 @@ module Disco
         has_many :"recommended_#{name}", -> { where("disco_recommendations.context = ?", name).order("disco_recommendations.score DESC") }, through: :recommendations, source: :item, source_type: class_name
+        # TODO use fetch for item_id and score in 0.3.0
         define_method("update_recommended_#{name}") do |items|
           now = Time.now
           items = items.map { |item| {subject_type: model_name.name, subject_id: id, item_type: class_name, item_id: item[:item_id], context: name, score: item[:score], created_at: now, updated_at: now} }

data/lib/disco/recommender.rb CHANGED Viewed

@@ -17,38 +17,54 @@ module Disco
       check_training_set(train_set)
+      # TODO option to set in initializer to avoid pass
+      # could also just check first few values
+      # but may be confusing if they are all missing and later ones aren't
       @implicit = !train_set.any? { |v| v[:rating] }
+      if @implicit && train_set.any? { |v| v[:value] }
+        warn "[disco] WARNING: Passing `:value` with implicit feedback has no effect on recommendations and can be removed. Earlier versions of the library incorrectly stated this was used."
+      end
+      # TODO improve performance
+      # (catch exception instead of checking ahead of time)
       unless @implicit
         check_ratings(train_set)
-        @min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
         if validation_set
           check_ratings(validation_set)
         end
       end
-      update_maps(train_set)
       @rated = Hash.new { |hash, key| hash[key] = {} }
       input = []
-      value_key = @implicit ? :value : :rating
       train_set.each do |v|
-        u = @user_map[v[:user_id]]
-        i = @item_map[v[:item_id]]
+        # update maps and build matrix in single pass
+        u = (@user_map[v[:user_id]] ||= @user_map.size)
+        i = (@item_map[v[:item_id]] ||= @item_map.size)
         @rated[u][i] = true
         # explicit will always have a value due to check_ratings
-        input << [u, i, v[value_key] || 1]
+        input << [u, i, @implicit ? 1 : v[:rating]]
       end
       @rated.default = nil
+      # much more efficient than checking every value in another pass
+      raise ArgumentError, "Missing user_id" if @user_map.key?(nil)
+      raise ArgumentError, "Missing item_id" if @item_map.key?(nil)
+      # TODO improve performance
+      unless @implicit
+        @min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
+      end
       if @top_items
         @item_count = [0] * @item_map.size
         @item_sum = [0.0] * @item_map.size
         train_set.each do |v|
           i = @item_map[v[:item_id]]
           @item_count[i] += 1
-          @item_sum[i] += (v[value_key] || 1)
+          @item_sum[i] += (@implicit ? 1 : v[:rating])
         end
       end
@@ -63,7 +79,7 @@ module Disco
           u ||= -1
           i ||= -1
-          eval_set << [u, i, v[value_key] || 1]
+          eval_set << [u, i, @implicit ? 1 : v[:rating]]
         end
       end
@@ -78,6 +94,9 @@ module Disco
       @user_factors = model.p_factors(format: :numo)
       @item_factors = model.q_factors(format: :numo)
+      @normalized_user_factors = nil
+      @normalized_item_factors = nil
       @user_recs_index = nil
       @similar_users_index = nil
       @similar_items_index = nil
@@ -122,8 +141,7 @@ module Disco
           predictions, ids = @user_recs_index.search(@user_factors[u, true].expand_dims(0), count + rated.size).map { |v| v[0, true] }
         else
           predictions = @item_factors.inner(@user_factors[u, true])
-          # TODO make sure reverse isn't hurting performance
-          indexes = predictions.sort_index.reverse
+          indexes = predictions.sort_index.reverse # reverse just creates view
           indexes = indexes[0...[count + rated.size, indexes.size].min] if count
           predictions = predictions[indexes]
           ids = indexes
@@ -149,13 +167,13 @@ module Disco
     def similar_items(item_id, count: 5)
       check_fit
-      similar(item_id, @item_map, item_norms, count, @similar_items_index)
+      similar(item_id, @item_map, normalized_item_factors, count, @similar_items_index)
     end
     alias_method :item_recs, :similar_items
     def similar_users(user_id, count: 5)
       check_fit
-      similar(user_id, @user_map, user_norms, count, @similar_users_index)
+      similar(user_id, @user_map, normalized_user_factors, count, @similar_users_index)
     end
     def top_items(count: 5)
@@ -163,19 +181,38 @@ module Disco
       raise "top_items not computed" unless @top_items
       if @implicit
-        scores = @item_count
+        scores = Numo::UInt64.cast(@item_count)
       else
         require "wilson_score"
-        range = @min_rating..@max_rating
-        scores = @item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) }
+        range =
+          if @min_rating == @max_rating
+            # TODO remove temp fix
+            (@min_rating - 1)..@max_rating
+          else
+            @min_rating..@max_rating
+          end
+        scores = Numo::DFloat.cast(@item_sum.zip(@item_count).map { |s, c| WilsonScore.rating_lower_bound(s / c, c, range) })
+        # TODO uncomment in 0.3.0
+        # wilson score with continuity correction
+        # https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval_with_continuity_correction
+        # z = 1.96 # 95% confidence
+        # range = @max_rating - @min_rating
+        # n = Numo::DFloat.cast(@item_count)
+        # phat = (Numo::DFloat.cast(@item_sum) - (@min_rating * n)) / range / n
+        # phat = (phat - (1 / (2 * n))).clip(0, nil) # continuity correction
+        # scores = (phat + z**2 / (2 * n) - z * Numo::DFloat::Math.sqrt((phat * (1 - phat) + z**2 / (4 * n)) / n)) / (1 + z**2 / n)
+        # scores = scores * range + @min_rating
       end
-      scores = scores.map.with_index.sort_by { |s, _| -s }
-      scores = scores.first(count) if count
-      item_ids = item_ids()
-      scores.map do |s, i|
-        {item_id: item_ids[i], score: s}
+      indexes = scores.sort_index.reverse
+      indexes = indexes[0...[count, indexes.size].min] if count
+      scores = scores[indexes]
+      keys = @item_map.keys
+      indexes.size.times.map do |i|
+        {item_id: keys[indexes[i]], score: scores[i]}
       end
     end
@@ -212,13 +249,17 @@ module Disco
     def optimize_similar_items(library: nil)
       check_fit
-      @similar_items_index = create_index(item_norms, library: library)
+      @similar_items_index = create_index(normalized_item_factors, library: library)
     end
     alias_method :optimize_item_recs, :optimize_similar_items
     def optimize_similar_users(library: nil)
       check_fit
-      @similar_users_index = create_index(user_norms, library: library)
+      @similar_users_index = create_index(normalized_user_factors, library: library)
+    end
+    def inspect
+      to_s # for now
     end
     private
@@ -235,8 +276,9 @@ module Disco
         # inner product is cosine similarity with normalized vectors
         # https://github.com/facebookresearch/faiss/issues/95
         #
-        # TODO use non-exact index
+        # TODO use non-exact index in 0.3.0
         # https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
+        # index = Faiss::IndexHNSWFlat.new(factors.shape[1], 32, :inner_product)
         index = Faiss::IndexFlatIP.new(factors.shape[1])
         # ids are from 0...total
@@ -251,7 +293,7 @@ module Disco
         # https://github.com/yahoojapan/NGT/issues/36
         index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
-        # NGT normalizes so could call create_index with factors instead of norms
+        # NGT normalizes so could call create_index without normalized factors
         # but keep code simple for now
         ids = index.batch_insert(factors)
         raise "Unexpected ids. Please report a bug." if ids.first != 1 || ids.last != factors.shape[0]
@@ -262,15 +304,15 @@ module Disco
       end
     end
-    def user_norms
-      @user_norms ||= norms(@user_factors)
+    def normalized_user_factors
+      @normalized_user_factors ||= normalize(@user_factors)
     end
-    def item_norms
-      @item_norms ||= norms(@item_factors)
+    def normalized_item_factors
+      @normalized_item_factors ||= normalize(@item_factors)
     end
-    def norms(factors)
+    def normalize(factors)
       norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
       norms[norms.eq(0)] = 1e-10 # no zeros
       factors / norms.expand_dims(1)
@@ -303,30 +345,26 @@ module Disco
         # TODO use user_id for similar_users in 0.3.0
         key = :item_id
-        (1...ids.size).map do |i|
-          {key => keys[ids[i]], score: predictions[i]}
+        result = []
+        # items can have the same score
+        # so original item may not be at index 0
+        ids.each_with_index do |id, j|
+          next if id == i
+          result << {key => keys[id], score: predictions[j]}
         end
+        result
       else
         []
       end
     end
-    def update_maps(train_set)
-      raise ArgumentError, "Missing user_id" if train_set.any? { |v| v[:user_id].nil? }
-      raise ArgumentError, "Missing item_id" if train_set.any? { |v| v[:item_id].nil? }
-      train_set.each do |v|
-        @user_map[v[:user_id]] ||= @user_map.size
-        @item_map[v[:item_id]] ||= @item_map.size
-      end
-    end
     def check_ratings(ratings)
       unless ratings.all? { |r| !r[:rating].nil? }
-        raise ArgumentError, "Missing ratings"
+        raise ArgumentError, "Missing rating"
       end
       unless ratings.all? { |r| r[:rating].is_a?(Numeric) }
-        raise ArgumentError, "Ratings must be numeric"
+        raise ArgumentError, "Rating must be numeric"
       end
     end
@@ -365,7 +403,10 @@ module Disco
         rated: @rated,
         global_mean: @global_mean,
         user_factors: @user_factors,
-        item_factors: @item_factors
+        item_factors: @item_factors,
+        factors: @factors,
+        epochs: @epochs,
+        verbose: @verbose
       }
       unless @implicit
@@ -389,6 +430,9 @@ module Disco
       @global_mean = obj[:global_mean]
       @user_factors = obj[:user_factors]
       @item_factors = obj[:item_factors]
+      @factors = obj[:factors]
+      @epochs = obj[:epochs]
+      @verbose = obj[:verbose]
       unless @implicit
         @min_rating = obj[:min_rating]

data/lib/disco/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Disco
-  VERSION = "0.2.5"
+  VERSION = "0.2.8"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: disco
 version: !ruby/object:Gem::Version
-  version: 0.2.5
+  version: 0.2.8
 platform: ruby
 authors:
 - Andrew Kane
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2021-02-20 00:00:00.000000000 Z
+date: 2022-03-13 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: libmf
@@ -76,7 +76,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.2.3
+rubygems_version: 3.3.7
 signing_key:
 specification_version: 4
 summary: Recommendations for Ruby and Rails using collaborative filtering