disco 0.1.1 → 0.2.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e16ea9f41bc910a0966c4f16f0e48df98f40abb70d2d3b3bd2e8ba2080e57599
4
- data.tar.gz: '05490bda394fa0edf02ab33cb16cc65093d5b66b6317671cbdbfc429c3bd196c'
3
+ metadata.gz: e9b8792d465e2bd894ce9aaa5dabf79dd89e93337d838917c709ac7747b85772
4
+ data.tar.gz: 9d34a5124dc26f8a2ecb7e2ed3cbf524fe586c37c693d885e668974e24dfaf0a
5
5
  SHA512:
6
- metadata.gz: b12681372323e4bc323915923f91b5883ea97161e2f2a2846548657b25932ddc7ce364e09a6b932506e15e3cb45ff03c8c2cc22022f3a2cfa62d64bbe77e988d
7
- data.tar.gz: ea2d4200bfb4d4aed3481ebda8de4133c7af5bb4c4bb7c65685a726bbf661b2285579a507ea464ed49b3609d5350b45404dd6cf99401664e013870a6dee54b7f
6
+ metadata.gz: 658b48b75994a295382eb22908d4a5f1825b01bfc26f52428e993802c42f7ebb435a59e7f1262d17400d65eda886f1a4f38edff82cdeda96c2a0ce280602742f
7
+ data.tar.gz: c9acce77cae8a575c5814456247600367d5fea4eb85a52309e26cac643d107bc03c42c99ce094c2f9ec46a329fd2dfa97d2f58f0020b337feb88b56346630942
@@ -1,3 +1,32 @@
1
+ ## 0.2.3 (2020-11-28)
2
+
3
+ - Added `predict` method
4
+ - Fixed bad recommendations and scores with `user_recs` and explicit feedback
5
+ - Fixed `item_ids` option for `user_recs`
6
+
7
+ ## 0.2.2 (n/a)
8
+
9
+ - Not available (released by previous gem owner)
10
+
11
+ ## 0.2.1 (2020-10-28)
12
+
13
+ - Fixed issue with `user_recs` returning rated items
14
+
15
+ ## 0.2.0 (2020-07-31)
16
+
17
+ - Changed score to always be between -1 and 1 for `item_recs` and `similar_users` (cosine similarity - this makes it easier to understand and consistent with `optimize_item_recs` and `optimize_similar_users`)
18
+
19
+ ## 0.1.3 (2020-06-28)
20
+
21
+ - Added support for Rover
22
+ - Raise error when missing user or item ids
23
+ - Fixed string keys for Daru data frames
24
+ - `optimize_item_recs` and `optimize_similar_users` methods are no longer experimental
25
+
26
+ ## 0.1.2 (2020-03-26)
27
+
28
+ - Added experimental `optimize_item_recs` and `optimize_similar_users` methods
29
+
1
30
  ## 0.1.1 (2019-11-14)
2
31
 
3
32
  - Fixed Rails integration
@@ -1,4 +1,4 @@
1
- Copyright (c) 2019 Andrew Kane
1
+ Copyright (c) 2019-2020 Andrew Kane
2
2
 
3
3
  MIT License
4
4
 
data/README.md CHANGED
@@ -1,12 +1,12 @@
1
1
  # Disco
2
2
 
3
- :fire: Collaborative filtering for Ruby
3
+ :fire: Recommendations for Ruby and Rails using collaborative filtering
4
4
 
5
5
  - Supports user-based and item-based recommendations
6
6
  - Works with explicit and implicit feedback
7
- - Uses matrix factorization
7
+ - Uses high-performance matrix factorization
8
8
 
9
- [![Build Status](https://travis-ci.org/ankane/disco.svg?branch=master)](https://travis-ci.org/ankane/disco)
9
+ [![Build Status](https://github.com/ankane/disco/workflows/build/badge.svg?branch=master)](https://github.com/ankane/disco/actions)
10
10
 
11
11
  ## Installation
12
12
 
@@ -64,10 +64,10 @@ Use the `count` option to specify the number of recommendations (default is 5)
64
64
  recommender.user_recs(user_id, count: 3)
65
65
  ```
66
66
 
67
- Get predicted ratings for specific items
67
+ Get predicted ratings for specific users and items
68
68
 
69
69
  ```ruby
70
- recommender.user_recs(user_id, item_ids: [1, 2, 3])
70
+ recommender.predict([{user_id: 1, item_id: 2}, {user_id: 2, item_id: 4}])
71
71
  ```
72
72
 
73
73
  Get similar users
@@ -101,14 +101,15 @@ recommender.item_recs("Star Wars (1977)")
101
101
  ```ruby
102
102
  views = Ahoy::Event.
103
103
  where(name: "Viewed post").
104
- group(:user_id, "properties->>'post_id'") # postgres syntax
104
+ group(:user_id).
105
+ group("properties->>'post_id'"). # postgres syntax
105
106
  count
106
107
 
107
108
  data =
108
109
  views.map do |(user_id, post_id), count|
109
110
  {
110
111
  user_id: user_id,
111
- post_id: post_id,
112
+ item_id: post_id,
112
113
  value: count
113
114
  }
114
115
  end
@@ -202,7 +203,7 @@ recommender = Marshal.load(bin)
202
203
 
203
204
  ## Algorithms
204
205
 
205
- Disco uses matrix factorization.
206
+ Disco uses high-performance matrix factorization.
206
207
 
207
208
  - For explicit feedback, it uses [stochastic gradient descent](https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf)
208
209
  - For implicit feedback, it uses [coordinate descent](https://www.csie.ntu.edu.tw/~cjlin/papers/one-class-mf/biased-mf-sdm-with-supp.pdf)
@@ -236,15 +237,50 @@ There are a number of ways to deal with this, but here are some common ones:
236
237
  - For user-based recommendations, show new users the most popular items.
237
238
  - For item-based recommendations, make content-based recommendations with a gem like [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity).
238
239
 
239
- ## Daru
240
+ ## Data
240
241
 
241
- Disco works with Daru data frames
242
+ Data can be an array of hashes
242
243
 
243
244
  ```ruby
244
- data = Daru::DataFrame.from_csv("ratings.csv")
245
- recommender.fit(data)
245
+ [{user_id: 1, item_id: 1, rating: 5}, {user_id: 2, item_id: 1, rating: 3}]
246
+ ```
247
+
248
+ Or a Rover data frame
249
+
250
+ ```ruby
251
+ Rover.read_csv("ratings.csv")
246
252
  ```
247
253
 
254
+ Or a Daru data frame
255
+
256
+ ```ruby
257
+ Daru::DataFrame.from_csv("ratings.csv")
258
+ ```
259
+
260
+ ## Faster Similarity
261
+
262
+ If you have a large number of users/items, you can use an approximate nearest neighbors library like [NGT](https://github.com/ankane/ngt) to speed up item-based recommendations and similar users.
263
+
264
+ Add this line to your application’s Gemfile:
265
+
266
+ ```ruby
267
+ gem 'ngt', '>= 0.3.0'
268
+ ```
269
+
270
+ Speed up item-based recommendations with:
271
+
272
+ ```ruby
273
+ model.optimize_item_recs
274
+ ```
275
+
276
+ Speed up similar users with:
277
+
278
+ ```ruby
279
+ model.optimize_similar_users
280
+ ```
281
+
282
+ This should be called after fitting or loading the model.
283
+
248
284
  ## Reference
249
285
 
250
286
  Get the global mean
@@ -280,3 +316,12 @@ Everyone is encouraged to help improve this project. Here are a few ways you can
280
316
  - Fix bugs and [submit pull requests](https://github.com/ankane/disco/pulls)
281
317
  - Write, clarify, or fix documentation
282
318
  - Suggest or add new features
319
+
320
+ To get started with development:
321
+
322
+ ```sh
323
+ git clone https://github.com/ankane/disco.git
324
+ cd disco
325
+ bundle install
326
+ bundle exec rake test
327
+ ```
@@ -36,8 +36,7 @@ module Disco
36
36
 
37
37
  return dest if File.exist?(dest)
38
38
 
39
- temp_dir ||= File.dirname(Tempfile.new("disco"))
40
- temp_path = "#{temp_dir}/#{Time.now.to_f}" # TODO better name
39
+ temp_path = "#{Dir.tmpdir}/disco-#{Time.now.to_f}" # TODO better name
41
40
 
42
41
  digest = Digest::SHA2.new
43
42
 
@@ -9,14 +9,8 @@ module Disco
9
9
  end
10
10
 
11
11
  def fit(train_set, validation_set: nil)
12
- if defined?(Daru)
13
- if train_set.is_a?(Daru::DataFrame)
14
- train_set = train_set.to_a[0]
15
- end
16
- if validation_set.is_a?(Daru::DataFrame)
17
- validation_set = validation_set.to_a[0]
18
- end
19
- end
12
+ train_set = to_dataset(train_set)
13
+ validation_set = to_dataset(validation_set) if validation_set
20
14
 
21
15
  @implicit = !train_set.any? { |v| v[:rating] }
22
16
 
@@ -70,17 +64,38 @@ module Disco
70
64
 
71
65
  @global_mean = model.bias
72
66
 
73
- # TODO read from LIBMF directly to Numo for performance
74
- @user_factors = Numo::DFloat.cast(model.p_factors)
75
- @item_factors = Numo::DFloat.cast(model.q_factors)
67
+ @user_factors = model.p_factors(format: :numo)
68
+ @item_factors = model.q_factors(format: :numo)
69
+
70
+ @user_index = nil
71
+ @item_index = nil
72
+ end
73
+
74
+ # generates a prediction even if a user has already rated the item
75
+ def predict(data)
76
+ data = to_dataset(data)
77
+
78
+ u = data.map { |v| @user_map[v[:user_id]] }
79
+ i = data.map { |v| @item_map[v[:item_id]] }
80
+
81
+ new_index = data.each_index.select { |index| u[index].nil? || i[index].nil? }
82
+ new_index.each do |j|
83
+ u[j] = 0
84
+ i[j] = 0
85
+ end
86
+
87
+ predictions = @user_factors[u, true].inner(@item_factors[i, true])
88
+ predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
89
+ predictions[new_index] = @global_mean
90
+ predictions.to_a
76
91
  end
77
92
 
78
93
  def user_recs(user_id, count: 5, item_ids: nil)
94
+ check_fit
79
95
  u = @user_map[user_id]
80
96
 
81
97
  if u
82
- predictions = @global_mean + @item_factors.dot(@user_factors[u, true])
83
- predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
98
+ predictions = @item_factors.inner(@user_factors[u, true])
84
99
 
85
100
  predictions =
86
101
  @item_map.keys.zip(predictions).map do |item_id, pred|
@@ -89,15 +104,24 @@ module Disco
89
104
 
90
105
  if item_ids
91
106
  idx = item_ids.map { |i| @item_map[i] }.compact
92
- predictions.values_at(*idx)
107
+ predictions = predictions.values_at(*idx)
93
108
  else
94
- @rated[u].keys.each do |i|
109
+ @rated[u].keys.sort_by { |v| -v }.each do |i|
95
110
  predictions.delete_at(i)
96
111
  end
97
112
  end
98
113
 
99
114
  predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
100
115
  predictions = predictions.first(count) if count && !item_ids
116
+
117
+ # clamp *after* sorting
118
+ # also, only needed for returned predictions
119
+ if @min_rating
120
+ predictions.each do |pred|
121
+ pred[:score] = pred[:score].clamp(@min_rating, @max_rating)
122
+ end
123
+ end
124
+
101
125
  predictions
102
126
  else
103
127
  # no items if user is unknown
@@ -106,17 +130,38 @@ module Disco
106
130
  end
107
131
  end
108
132
 
133
+ def optimize_similar_items
134
+ check_fit
135
+ @item_index = create_index(@item_factors)
136
+ end
137
+ alias_method :optimize_item_recs, :optimize_similar_items
138
+
139
+ def optimize_similar_users
140
+ check_fit
141
+ @user_index = create_index(@user_factors)
142
+ end
143
+
109
144
  def similar_items(item_id, count: 5)
110
- similar(item_id, @item_map, @item_factors, item_norms, count)
145
+ check_fit
146
+ similar(item_id, @item_map, @item_factors, item_norms, count, @item_index)
111
147
  end
112
148
  alias_method :item_recs, :similar_items
113
149
 
114
150
  def similar_users(user_id, count: 5)
115
- similar(user_id, @user_map, @user_factors, user_norms, count)
151
+ check_fit
152
+ similar(user_id, @user_map, @user_factors, user_norms, count, @user_index)
116
153
  end
117
154
 
118
155
  private
119
156
 
157
+ def create_index(factors)
158
+ require "ngt"
159
+
160
+ index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
161
+ index.batch_insert(factors)
162
+ index
163
+ end
164
+
120
165
  def user_norms
121
166
  @user_norms ||= norms(@user_factors)
122
167
  end
@@ -126,25 +171,41 @@ module Disco
126
171
  end
127
172
 
128
173
  def norms(factors)
129
- norms = Numo::DFloat::Math.sqrt((factors * factors).sum(axis: 1))
174
+ norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
130
175
  norms[norms.eq(0)] = 1e-10 # no zeros
131
176
  norms
132
177
  end
133
178
 
134
- def similar(id, map, factors, norms, count)
179
+ def similar(id, map, factors, norms, count, index)
135
180
  i = map[id]
136
181
  if i
137
- predictions = factors.dot(factors[i, true]) / norms
138
-
139
- predictions =
140
- map.keys.zip(predictions).map do |item_id, pred|
141
- {item_id: item_id, score: pred}
182
+ if index && count
183
+ keys = map.keys
184
+ result = index.search(factors[i, true], size: count + 1)[1..-1]
185
+ result.map do |v|
186
+ {
187
+ # ids from batch_insert start at 1 instead of 0
188
+ item_id: keys[v[:id] - 1],
189
+ # convert cosine distance to cosine similarity
190
+ score: 1 - v[:distance]
191
+ }
142
192
  end
143
-
144
- predictions.delete_at(i)
145
- predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
146
- predictions = predictions.first(count) if count
147
- predictions
193
+ else
194
+ predictions = factors.dot(factors[i, true]) / norms
195
+
196
+ predictions =
197
+ map.keys.zip(predictions).map do |item_id, pred|
198
+ {item_id: item_id, score: pred}
199
+ end
200
+
201
+ max_score = predictions.delete_at(i)[:score]
202
+ predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
203
+ predictions = predictions.first(count) if count
204
+ # divide by max score to get cosine similarity
205
+ # only need to do for returned records
206
+ predictions.each { |pred| pred[:score] /= max_score }
207
+ predictions
208
+ end
148
209
  else
149
210
  []
150
211
  end
@@ -154,6 +215,9 @@ module Disco
154
215
  user_ids = train_set.map { |v| v[:user_id] }.uniq.sort
155
216
  item_ids = train_set.map { |v| v[:item_id] }.uniq.sort
156
217
 
218
+ raise ArgumentError, "Missing user_id" if user_ids.any?(&:nil?)
219
+ raise ArgumentError, "Missing item_id" if item_ids.any?(&:nil?)
220
+
157
221
  @user_map = user_ids.zip(user_ids.size.times).to_h
158
222
  @item_map = item_ids.zip(item_ids.size.times).to_h
159
223
  end
@@ -171,6 +235,29 @@ module Disco
171
235
  raise ArgumentError, "No training data" if train_set.empty?
172
236
  end
173
237
 
238
+ def check_fit
239
+ raise "Not fit" unless defined?(@implicit)
240
+ end
241
+
242
+ def to_dataset(dataset)
243
+ if defined?(Rover::DataFrame) && dataset.is_a?(Rover::DataFrame)
244
+ # convert keys to symbols
245
+ dataset = dataset.dup
246
+ dataset.keys.each do |k, v|
247
+ dataset[k.to_sym] ||= dataset.delete(k)
248
+ end
249
+ dataset.to_a
250
+ elsif defined?(Daru::DataFrame) && dataset.is_a?(Daru::DataFrame)
251
+ # convert keys to symbols
252
+ dataset = dataset.dup
253
+ new_names = dataset.vectors.to_a.map { |k| [k, k.to_sym] }.to_h
254
+ dataset.rename_vectors!(new_names)
255
+ dataset.to_a[0]
256
+ else
257
+ dataset
258
+ end
259
+ end
260
+
174
261
  def marshal_dump
175
262
  obj = {
176
263
  implicit: @implicit,
@@ -1,3 +1,3 @@
1
1
  module Disco
2
- VERSION = "0.1.1"
2
+ VERSION = "0.2.3"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: disco
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-11-14 00:00:00.000000000 Z
11
+ date: 2020-11-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: libmf
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: 0.1.3
19
+ version: 0.2.0
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: 0.1.3
26
+ version: 0.2.0
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: numo-narray
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -38,91 +38,7 @@ dependencies:
38
38
  - - ">="
39
39
  - !ruby/object:Gem::Version
40
40
  version: '0'
41
- - !ruby/object:Gem::Dependency
42
- name: bundler
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - ">="
46
- - !ruby/object:Gem::Version
47
- version: '0'
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - ">="
53
- - !ruby/object:Gem::Version
54
- version: '0'
55
- - !ruby/object:Gem::Dependency
56
- name: rake
57
- requirement: !ruby/object:Gem::Requirement
58
- requirements:
59
- - - ">="
60
- - !ruby/object:Gem::Version
61
- version: '0'
62
- type: :development
63
- prerelease: false
64
- version_requirements: !ruby/object:Gem::Requirement
65
- requirements:
66
- - - ">="
67
- - !ruby/object:Gem::Version
68
- version: '0'
69
- - !ruby/object:Gem::Dependency
70
- name: minitest
71
- requirement: !ruby/object:Gem::Requirement
72
- requirements:
73
- - - ">="
74
- - !ruby/object:Gem::Version
75
- version: '5'
76
- type: :development
77
- prerelease: false
78
- version_requirements: !ruby/object:Gem::Requirement
79
- requirements:
80
- - - ">="
81
- - !ruby/object:Gem::Version
82
- version: '5'
83
- - !ruby/object:Gem::Dependency
84
- name: activerecord
85
- requirement: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - ">="
88
- - !ruby/object:Gem::Version
89
- version: '0'
90
- type: :development
91
- prerelease: false
92
- version_requirements: !ruby/object:Gem::Requirement
93
- requirements:
94
- - - ">="
95
- - !ruby/object:Gem::Version
96
- version: '0'
97
- - !ruby/object:Gem::Dependency
98
- name: sqlite3
99
- requirement: !ruby/object:Gem::Requirement
100
- requirements:
101
- - - ">="
102
- - !ruby/object:Gem::Version
103
- version: '0'
104
- type: :development
105
- prerelease: false
106
- version_requirements: !ruby/object:Gem::Requirement
107
- requirements:
108
- - - ">="
109
- - !ruby/object:Gem::Version
110
- version: '0'
111
- - !ruby/object:Gem::Dependency
112
- name: daru
113
- requirement: !ruby/object:Gem::Requirement
114
- requirements:
115
- - - ">="
116
- - !ruby/object:Gem::Version
117
- version: '0'
118
- type: :development
119
- prerelease: false
120
- version_requirements: !ruby/object:Gem::Requirement
121
- requirements:
122
- - - ">="
123
- - !ruby/object:Gem::Version
124
- version: '0'
125
- description:
41
+ description:
126
42
  email: andrew@chartkick.com
127
43
  executables: []
128
44
  extensions: []
@@ -144,7 +60,7 @@ homepage: https://github.com/ankane/disco
144
60
  licenses:
145
61
  - MIT
146
62
  metadata: {}
147
- post_install_message:
63
+ post_install_message:
148
64
  rdoc_options: []
149
65
  require_paths:
150
66
  - lib
@@ -159,8 +75,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
159
75
  - !ruby/object:Gem::Version
160
76
  version: '0'
161
77
  requirements: []
162
- rubygems_version: 3.0.6
163
- signing_key:
78
+ rubygems_version: 3.1.4
79
+ signing_key:
164
80
  specification_version: 4
165
- summary: Collaborative filtering for Ruby
81
+ summary: Recommendations for Ruby and Rails using collaborative filtering
166
82
  test_files: []