disco 0.1.1 → 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e16ea9f41bc910a0966c4f16f0e48df98f40abb70d2d3b3bd2e8ba2080e57599
4
- data.tar.gz: '05490bda394fa0edf02ab33cb16cc65093d5b66b6317671cbdbfc429c3bd196c'
3
+ metadata.gz: e9b8792d465e2bd894ce9aaa5dabf79dd89e93337d838917c709ac7747b85772
4
+ data.tar.gz: 9d34a5124dc26f8a2ecb7e2ed3cbf524fe586c37c693d885e668974e24dfaf0a
5
5
  SHA512:
6
- metadata.gz: b12681372323e4bc323915923f91b5883ea97161e2f2a2846548657b25932ddc7ce364e09a6b932506e15e3cb45ff03c8c2cc22022f3a2cfa62d64bbe77e988d
7
- data.tar.gz: ea2d4200bfb4d4aed3481ebda8de4133c7af5bb4c4bb7c65685a726bbf661b2285579a507ea464ed49b3609d5350b45404dd6cf99401664e013870a6dee54b7f
6
+ metadata.gz: 658b48b75994a295382eb22908d4a5f1825b01bfc26f52428e993802c42f7ebb435a59e7f1262d17400d65eda886f1a4f38edff82cdeda96c2a0ce280602742f
7
+ data.tar.gz: c9acce77cae8a575c5814456247600367d5fea4eb85a52309e26cac643d107bc03c42c99ce094c2f9ec46a329fd2dfa97d2f58f0020b337feb88b56346630942
@@ -1,3 +1,32 @@
1
+ ## 0.2.3 (2020-11-28)
2
+
3
+ - Added `predict` method
4
+ - Fixed bad recommendations and scores with `user_recs` and explicit feedback
5
+ - Fixed `item_ids` option for `user_recs`
6
+
7
+ ## 0.2.2 (n/a)
8
+
9
+ - Not available (released by previous gem owner)
10
+
11
+ ## 0.2.1 (2020-10-28)
12
+
13
+ - Fixed issue with `user_recs` returning rated items
14
+
15
+ ## 0.2.0 (2020-07-31)
16
+
17
+ - Changed score to always be between -1 and 1 for `item_recs` and `similar_users` (cosine similarity - this makes it easier to understand and consistent with `optimize_item_recs` and `optimize_similar_users`)
18
+
19
+ ## 0.1.3 (2020-06-28)
20
+
21
+ - Added support for Rover
22
+ - Raise error when missing user or item ids
23
+ - Fixed string keys for Daru data frames
24
+ - `optimize_item_recs` and `optimize_similar_users` methods are no longer experimental
25
+
26
+ ## 0.1.2 (2020-03-26)
27
+
28
+ - Added experimental `optimize_item_recs` and `optimize_similar_users` methods
29
+
1
30
  ## 0.1.1 (2019-11-14)
2
31
 
3
32
  - Fixed Rails integration
@@ -1,4 +1,4 @@
1
- Copyright (c) 2019 Andrew Kane
1
+ Copyright (c) 2019-2020 Andrew Kane
2
2
 
3
3
  MIT License
4
4
 
data/README.md CHANGED
@@ -1,12 +1,12 @@
1
1
  # Disco
2
2
 
3
- :fire: Collaborative filtering for Ruby
3
+ :fire: Recommendations for Ruby and Rails using collaborative filtering
4
4
 
5
5
  - Supports user-based and item-based recommendations
6
6
  - Works with explicit and implicit feedback
7
- - Uses matrix factorization
7
+ - Uses high-performance matrix factorization
8
8
 
9
- [![Build Status](https://travis-ci.org/ankane/disco.svg?branch=master)](https://travis-ci.org/ankane/disco)
9
+ [![Build Status](https://github.com/ankane/disco/workflows/build/badge.svg?branch=master)](https://github.com/ankane/disco/actions)
10
10
 
11
11
  ## Installation
12
12
 
@@ -64,10 +64,10 @@ Use the `count` option to specify the number of recommendations (default is 5)
64
64
  recommender.user_recs(user_id, count: 3)
65
65
  ```
66
66
 
67
- Get predicted ratings for specific items
67
+ Get predicted ratings for specific users and items
68
68
 
69
69
  ```ruby
70
- recommender.user_recs(user_id, item_ids: [1, 2, 3])
70
+ recommender.predict([{user_id: 1, item_id: 2}, {user_id: 2, item_id: 4}])
71
71
  ```
72
72
 
73
73
  Get similar users
@@ -101,14 +101,15 @@ recommender.item_recs("Star Wars (1977)")
101
101
  ```ruby
102
102
  views = Ahoy::Event.
103
103
  where(name: "Viewed post").
104
- group(:user_id, "properties->>'post_id'") # postgres syntax
104
+ group(:user_id).
105
+ group("properties->>'post_id'"). # postgres syntax
105
106
  count
106
107
 
107
108
  data =
108
109
  views.map do |(user_id, post_id), count|
109
110
  {
110
111
  user_id: user_id,
111
- post_id: post_id,
112
+ item_id: post_id,
112
113
  value: count
113
114
  }
114
115
  end
@@ -202,7 +203,7 @@ recommender = Marshal.load(bin)
202
203
 
203
204
  ## Algorithms
204
205
 
205
- Disco uses matrix factorization.
206
+ Disco uses high-performance matrix factorization.
206
207
 
207
208
  - For explicit feedback, it uses [stochastic gradient descent](https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf)
208
209
  - For implicit feedback, it uses [coordinate descent](https://www.csie.ntu.edu.tw/~cjlin/papers/one-class-mf/biased-mf-sdm-with-supp.pdf)
@@ -236,15 +237,50 @@ There are a number of ways to deal with this, but here are some common ones:
236
237
  - For user-based recommendations, show new users the most popular items.
237
238
  - For item-based recommendations, make content-based recommendations with a gem like [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity).
238
239
 
239
- ## Daru
240
+ ## Data
240
241
 
241
- Disco works with Daru data frames
242
+ Data can be an array of hashes
242
243
 
243
244
  ```ruby
244
- data = Daru::DataFrame.from_csv("ratings.csv")
245
- recommender.fit(data)
245
+ [{user_id: 1, item_id: 1, rating: 5}, {user_id: 2, item_id: 1, rating: 3}]
246
+ ```
247
+
248
+ Or a Rover data frame
249
+
250
+ ```ruby
251
+ Rover.read_csv("ratings.csv")
246
252
  ```
247
253
 
254
+ Or a Daru data frame
255
+
256
+ ```ruby
257
+ Daru::DataFrame.from_csv("ratings.csv")
258
+ ```
259
+
260
+ ## Faster Similarity
261
+
262
+ If you have a large number of users/items, you can use an approximate nearest neighbors library like [NGT](https://github.com/ankane/ngt) to speed up item-based recommendations and similar users.
263
+
264
+ Add this line to your application’s Gemfile:
265
+
266
+ ```ruby
267
+ gem 'ngt', '>= 0.3.0'
268
+ ```
269
+
270
+ Speed up item-based recommendations with:
271
+
272
+ ```ruby
273
+ model.optimize_item_recs
274
+ ```
275
+
276
+ Speed up similar users with:
277
+
278
+ ```ruby
279
+ model.optimize_similar_users
280
+ ```
281
+
282
+ This should be called after fitting or loading the model.
283
+
248
284
  ## Reference
249
285
 
250
286
  Get the global mean
@@ -280,3 +316,12 @@ Everyone is encouraged to help improve this project. Here are a few ways you can
280
316
  - Fix bugs and [submit pull requests](https://github.com/ankane/disco/pulls)
281
317
  - Write, clarify, or fix documentation
282
318
  - Suggest or add new features
319
+
320
+ To get started with development:
321
+
322
+ ```sh
323
+ git clone https://github.com/ankane/disco.git
324
+ cd disco
325
+ bundle install
326
+ bundle exec rake test
327
+ ```
@@ -36,8 +36,7 @@ module Disco
36
36
 
37
37
  return dest if File.exist?(dest)
38
38
 
39
- temp_dir ||= File.dirname(Tempfile.new("disco"))
40
- temp_path = "#{temp_dir}/#{Time.now.to_f}" # TODO better name
39
+ temp_path = "#{Dir.tmpdir}/disco-#{Time.now.to_f}" # TODO better name
41
40
 
42
41
  digest = Digest::SHA2.new
43
42
 
@@ -9,14 +9,8 @@ module Disco
9
9
  end
10
10
 
11
11
  def fit(train_set, validation_set: nil)
12
- if defined?(Daru)
13
- if train_set.is_a?(Daru::DataFrame)
14
- train_set = train_set.to_a[0]
15
- end
16
- if validation_set.is_a?(Daru::DataFrame)
17
- validation_set = validation_set.to_a[0]
18
- end
19
- end
12
+ train_set = to_dataset(train_set)
13
+ validation_set = to_dataset(validation_set) if validation_set
20
14
 
21
15
  @implicit = !train_set.any? { |v| v[:rating] }
22
16
 
@@ -70,17 +64,38 @@ module Disco
70
64
 
71
65
  @global_mean = model.bias
72
66
 
73
- # TODO read from LIBMF directly to Numo for performance
74
- @user_factors = Numo::DFloat.cast(model.p_factors)
75
- @item_factors = Numo::DFloat.cast(model.q_factors)
67
+ @user_factors = model.p_factors(format: :numo)
68
+ @item_factors = model.q_factors(format: :numo)
69
+
70
+ @user_index = nil
71
+ @item_index = nil
72
+ end
73
+
74
+ # generates a prediction even if a user has already rated the item
75
+ def predict(data)
76
+ data = to_dataset(data)
77
+
78
+ u = data.map { |v| @user_map[v[:user_id]] }
79
+ i = data.map { |v| @item_map[v[:item_id]] }
80
+
81
+ new_index = data.each_index.select { |index| u[index].nil? || i[index].nil? }
82
+ new_index.each do |j|
83
+ u[j] = 0
84
+ i[j] = 0
85
+ end
86
+
87
+ predictions = @user_factors[u, true].inner(@item_factors[i, true])
88
+ predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
89
+ predictions[new_index] = @global_mean
90
+ predictions.to_a
76
91
  end
77
92
 
78
93
  def user_recs(user_id, count: 5, item_ids: nil)
94
+ check_fit
79
95
  u = @user_map[user_id]
80
96
 
81
97
  if u
82
- predictions = @global_mean + @item_factors.dot(@user_factors[u, true])
83
- predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
98
+ predictions = @item_factors.inner(@user_factors[u, true])
84
99
 
85
100
  predictions =
86
101
  @item_map.keys.zip(predictions).map do |item_id, pred|
@@ -89,15 +104,24 @@ module Disco
89
104
 
90
105
  if item_ids
91
106
  idx = item_ids.map { |i| @item_map[i] }.compact
92
- predictions.values_at(*idx)
107
+ predictions = predictions.values_at(*idx)
93
108
  else
94
- @rated[u].keys.each do |i|
109
+ @rated[u].keys.sort_by { |v| -v }.each do |i|
95
110
  predictions.delete_at(i)
96
111
  end
97
112
  end
98
113
 
99
114
  predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
100
115
  predictions = predictions.first(count) if count && !item_ids
116
+
117
+ # clamp *after* sorting
118
+ # also, only needed for returned predictions
119
+ if @min_rating
120
+ predictions.each do |pred|
121
+ pred[:score] = pred[:score].clamp(@min_rating, @max_rating)
122
+ end
123
+ end
124
+
101
125
  predictions
102
126
  else
103
127
  # no items if user is unknown
@@ -106,17 +130,38 @@ module Disco
106
130
  end
107
131
  end
108
132
 
133
+ def optimize_similar_items
134
+ check_fit
135
+ @item_index = create_index(@item_factors)
136
+ end
137
+ alias_method :optimize_item_recs, :optimize_similar_items
138
+
139
+ def optimize_similar_users
140
+ check_fit
141
+ @user_index = create_index(@user_factors)
142
+ end
143
+
109
144
  def similar_items(item_id, count: 5)
110
- similar(item_id, @item_map, @item_factors, item_norms, count)
145
+ check_fit
146
+ similar(item_id, @item_map, @item_factors, item_norms, count, @item_index)
111
147
  end
112
148
  alias_method :item_recs, :similar_items
113
149
 
114
150
  def similar_users(user_id, count: 5)
115
- similar(user_id, @user_map, @user_factors, user_norms, count)
151
+ check_fit
152
+ similar(user_id, @user_map, @user_factors, user_norms, count, @user_index)
116
153
  end
117
154
 
118
155
  private
119
156
 
157
+ def create_index(factors)
158
+ require "ngt"
159
+
160
+ index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
161
+ index.batch_insert(factors)
162
+ index
163
+ end
164
+
120
165
  def user_norms
121
166
  @user_norms ||= norms(@user_factors)
122
167
  end
@@ -126,25 +171,41 @@ module Disco
126
171
  end
127
172
 
128
173
  def norms(factors)
129
- norms = Numo::DFloat::Math.sqrt((factors * factors).sum(axis: 1))
174
+ norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
130
175
  norms[norms.eq(0)] = 1e-10 # no zeros
131
176
  norms
132
177
  end
133
178
 
134
- def similar(id, map, factors, norms, count)
179
+ def similar(id, map, factors, norms, count, index)
135
180
  i = map[id]
136
181
  if i
137
- predictions = factors.dot(factors[i, true]) / norms
138
-
139
- predictions =
140
- map.keys.zip(predictions).map do |item_id, pred|
141
- {item_id: item_id, score: pred}
182
+ if index && count
183
+ keys = map.keys
184
+ result = index.search(factors[i, true], size: count + 1)[1..-1]
185
+ result.map do |v|
186
+ {
187
+ # ids from batch_insert start at 1 instead of 0
188
+ item_id: keys[v[:id] - 1],
189
+ # convert cosine distance to cosine similarity
190
+ score: 1 - v[:distance]
191
+ }
142
192
  end
143
-
144
- predictions.delete_at(i)
145
- predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
146
- predictions = predictions.first(count) if count
147
- predictions
193
+ else
194
+ predictions = factors.dot(factors[i, true]) / norms
195
+
196
+ predictions =
197
+ map.keys.zip(predictions).map do |item_id, pred|
198
+ {item_id: item_id, score: pred}
199
+ end
200
+
201
+ max_score = predictions.delete_at(i)[:score]
202
+ predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
203
+ predictions = predictions.first(count) if count
204
+ # divide by max score to get cosine similarity
205
+ # only need to do for returned records
206
+ predictions.each { |pred| pred[:score] /= max_score }
207
+ predictions
208
+ end
148
209
  else
149
210
  []
150
211
  end
@@ -154,6 +215,9 @@ module Disco
154
215
  user_ids = train_set.map { |v| v[:user_id] }.uniq.sort
155
216
  item_ids = train_set.map { |v| v[:item_id] }.uniq.sort
156
217
 
218
+ raise ArgumentError, "Missing user_id" if user_ids.any?(&:nil?)
219
+ raise ArgumentError, "Missing item_id" if item_ids.any?(&:nil?)
220
+
157
221
  @user_map = user_ids.zip(user_ids.size.times).to_h
158
222
  @item_map = item_ids.zip(item_ids.size.times).to_h
159
223
  end
@@ -171,6 +235,29 @@ module Disco
171
235
  raise ArgumentError, "No training data" if train_set.empty?
172
236
  end
173
237
 
238
+ def check_fit
239
+ raise "Not fit" unless defined?(@implicit)
240
+ end
241
+
242
+ def to_dataset(dataset)
243
+ if defined?(Rover::DataFrame) && dataset.is_a?(Rover::DataFrame)
244
+ # convert keys to symbols
245
+ dataset = dataset.dup
246
+ dataset.keys.each do |k, v|
247
+ dataset[k.to_sym] ||= dataset.delete(k)
248
+ end
249
+ dataset.to_a
250
+ elsif defined?(Daru::DataFrame) && dataset.is_a?(Daru::DataFrame)
251
+ # convert keys to symbols
252
+ dataset = dataset.dup
253
+ new_names = dataset.vectors.to_a.map { |k| [k, k.to_sym] }.to_h
254
+ dataset.rename_vectors!(new_names)
255
+ dataset.to_a[0]
256
+ else
257
+ dataset
258
+ end
259
+ end
260
+
174
261
  def marshal_dump
175
262
  obj = {
176
263
  implicit: @implicit,
@@ -1,3 +1,3 @@
1
1
  module Disco
2
- VERSION = "0.1.1"
2
+ VERSION = "0.2.3"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: disco
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-11-14 00:00:00.000000000 Z
11
+ date: 2020-11-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: libmf
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: 0.1.3
19
+ version: 0.2.0
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: 0.1.3
26
+ version: 0.2.0
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: numo-narray
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -38,91 +38,7 @@ dependencies:
38
38
  - - ">="
39
39
  - !ruby/object:Gem::Version
40
40
  version: '0'
41
- - !ruby/object:Gem::Dependency
42
- name: bundler
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - ">="
46
- - !ruby/object:Gem::Version
47
- version: '0'
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - ">="
53
- - !ruby/object:Gem::Version
54
- version: '0'
55
- - !ruby/object:Gem::Dependency
56
- name: rake
57
- requirement: !ruby/object:Gem::Requirement
58
- requirements:
59
- - - ">="
60
- - !ruby/object:Gem::Version
61
- version: '0'
62
- type: :development
63
- prerelease: false
64
- version_requirements: !ruby/object:Gem::Requirement
65
- requirements:
66
- - - ">="
67
- - !ruby/object:Gem::Version
68
- version: '0'
69
- - !ruby/object:Gem::Dependency
70
- name: minitest
71
- requirement: !ruby/object:Gem::Requirement
72
- requirements:
73
- - - ">="
74
- - !ruby/object:Gem::Version
75
- version: '5'
76
- type: :development
77
- prerelease: false
78
- version_requirements: !ruby/object:Gem::Requirement
79
- requirements:
80
- - - ">="
81
- - !ruby/object:Gem::Version
82
- version: '5'
83
- - !ruby/object:Gem::Dependency
84
- name: activerecord
85
- requirement: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - ">="
88
- - !ruby/object:Gem::Version
89
- version: '0'
90
- type: :development
91
- prerelease: false
92
- version_requirements: !ruby/object:Gem::Requirement
93
- requirements:
94
- - - ">="
95
- - !ruby/object:Gem::Version
96
- version: '0'
97
- - !ruby/object:Gem::Dependency
98
- name: sqlite3
99
- requirement: !ruby/object:Gem::Requirement
100
- requirements:
101
- - - ">="
102
- - !ruby/object:Gem::Version
103
- version: '0'
104
- type: :development
105
- prerelease: false
106
- version_requirements: !ruby/object:Gem::Requirement
107
- requirements:
108
- - - ">="
109
- - !ruby/object:Gem::Version
110
- version: '0'
111
- - !ruby/object:Gem::Dependency
112
- name: daru
113
- requirement: !ruby/object:Gem::Requirement
114
- requirements:
115
- - - ">="
116
- - !ruby/object:Gem::Version
117
- version: '0'
118
- type: :development
119
- prerelease: false
120
- version_requirements: !ruby/object:Gem::Requirement
121
- requirements:
122
- - - ">="
123
- - !ruby/object:Gem::Version
124
- version: '0'
125
- description:
41
+ description:
126
42
  email: andrew@chartkick.com
127
43
  executables: []
128
44
  extensions: []
@@ -144,7 +60,7 @@ homepage: https://github.com/ankane/disco
144
60
  licenses:
145
61
  - MIT
146
62
  metadata: {}
147
- post_install_message:
63
+ post_install_message:
148
64
  rdoc_options: []
149
65
  require_paths:
150
66
  - lib
@@ -159,8 +75,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
159
75
  - !ruby/object:Gem::Version
160
76
  version: '0'
161
77
  requirements: []
162
- rubygems_version: 3.0.6
163
- signing_key:
78
+ rubygems_version: 3.1.4
79
+ signing_key:
164
80
  specification_version: 4
165
- summary: Collaborative filtering for Ruby
81
+ summary: Recommendations for Ruby and Rails using collaborative filtering
166
82
  test_files: []