disco 0.1.2 → 0.2.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 04d278a7daf8187ac8a5eadaa279c98a0a51a8cf0ad596e793198dcc9141233a
4
- data.tar.gz: '0916f7cfb91d5bf48ce1186502f15647c102eba54e07bdc33eb042b75e1fb0c6'
3
+ metadata.gz: e4a978d2eec39ca280142c49fb4ef4be2e1ad4f35dfa4d977941f46d5d34b466
4
+ data.tar.gz: 8a29a54bba5ac8b715294e2fce4e34fa1b11442b1800c388807c60b9520ced23
5
5
  SHA512:
6
- metadata.gz: a8e977bcf2988e8e4cb85b13959446d068e3a41feeca26f3789ff7aa0a454258340bc81fb3adb470e0143cc6027cd803ef034900cc29db4648b01f855f6cb011
7
- data.tar.gz: defc71dd93461a114338f0737cfa3eccae47605e2922aaf12d960a0cb6309131dbba497f7c7d125e962edd055ff7df898cd406544971ed75906cb8c1db6004cf
6
+ metadata.gz: 99376dd48cce340a4fdcb0d76c93b03af494d88167e2caaca0d186fcf5d2303f2524884e0c712c2f8e3d7be79a92b029a8d5fa726bb94826315f283afea0f74b
7
+ data.tar.gz: eeb8c480098616f93d6c7e39a1bb57e2feefa6af3696c407791ff6f052450eb035f1d1659ded70d7b5fbbbe8cff9f7309118828a454b1d4f9d459321b90035cf
data/CHANGELOG.md CHANGED
@@ -1,3 +1,34 @@
1
+ ## 0.2.4 (2021-02-15)
2
+
3
+ - Added `user_ids` and `item_ids` methods
4
+ - Added `user_id` argument to `user_factors`
5
+ - Added `item_id` argument to `item_factors`
6
+
7
+ ## 0.2.3 (2020-11-28)
8
+
9
+ - Added `predict` method
10
+ - Fixed bad recommendations and scores with `user_recs` and explicit feedback
11
+ - Fixed `item_ids` option for `user_recs`
12
+
13
+ ## 0.2.2 (n/a)
14
+
15
+ - Not available (released by previous gem owner)
16
+
17
+ ## 0.2.1 (2020-10-28)
18
+
19
+ - Fixed issue with `user_recs` returning rated items
20
+
21
+ ## 0.2.0 (2020-07-31)
22
+
23
+ - Changed score to always be between -1 and 1 for `item_recs` and `similar_users` (cosine similarity - this makes it easier to understand and consistent with `optimize_item_recs` and `optimize_similar_users`)
24
+
25
+ ## 0.1.3 (2020-06-28)
26
+
27
+ - Added support for Rover
28
+ - Raise error when missing user or item ids
29
+ - Fixed string keys for Daru data frames
30
+ - `optimize_item_recs` and `optimize_similar_users` methods are no longer experimental
31
+
1
32
  ## 0.1.2 (2020-03-26)
2
33
 
3
34
  - Added experimental `optimize_item_recs` and `optimize_similar_users` methods
data/LICENSE.txt CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2019 Andrew Kane
1
+ Copyright (c) 2019-2021 Andrew Kane
2
2
 
3
3
  MIT License
4
4
 
data/README.md CHANGED
@@ -1,12 +1,12 @@
1
1
  # Disco
2
2
 
3
- :fire: Collaborative filtering for Ruby
3
+ :fire: Recommendations for Ruby and Rails using collaborative filtering
4
4
 
5
5
  - Supports user-based and item-based recommendations
6
6
  - Works with explicit and implicit feedback
7
7
  - Uses high-performance matrix factorization
8
8
 
9
- [![Build Status](https://travis-ci.org/ankane/disco.svg?branch=master)](https://travis-ci.org/ankane/disco)
9
+ [![Build Status](https://github.com/ankane/disco/workflows/build/badge.svg?branch=master)](https://github.com/ankane/disco/actions)
10
10
 
11
11
  ## Installation
12
12
 
@@ -46,13 +46,13 @@ recommender.fit([
46
46
 
47
47
  > Use `value` instead of rating for implicit feedback
48
48
 
49
- Get user-based (user-item) recommendations - “users like you also liked”
49
+ Get user-based recommendations - “users like you also liked”
50
50
 
51
51
  ```ruby
52
52
  recommender.user_recs(user_id)
53
53
  ```
54
54
 
55
- Get item-based (item-item) recommendations - “users who liked this item also liked”
55
+ Get item-based recommendations - “users who liked this item also liked”
56
56
 
57
57
  ```ruby
58
58
  recommender.item_recs(item_id)
@@ -64,10 +64,10 @@ Use the `count` option to specify the number of recommendations (default is 5)
64
64
  recommender.user_recs(user_id, count: 3)
65
65
  ```
66
66
 
67
- Get predicted ratings for specific items
67
+ Get predicted ratings for specific users and items
68
68
 
69
69
  ```ruby
70
- recommender.user_recs(user_id, item_ids: [1, 2, 3])
70
+ recommender.predict([{user_id: 1, item_id: 2}, {user_id: 2, item_id: 4}])
71
71
  ```
72
72
 
73
73
  Get similar users
@@ -101,14 +101,15 @@ recommender.item_recs("Star Wars (1977)")
101
101
  ```ruby
102
102
  views = Ahoy::Event.
103
103
  where(name: "Viewed post").
104
- group(:user_id, "properties->>'post_id'") # postgres syntax
104
+ group(:user_id).
105
+ group("properties->>'post_id'"). # postgres syntax
105
106
  count
106
107
 
107
108
  data =
108
109
  views.map do |(user_id, post_id), count|
109
110
  {
110
111
  user_id: user_id,
111
- post_id: post_id,
112
+ item_id: post_id,
112
113
  value: count
113
114
  }
114
115
  end
@@ -244,20 +245,26 @@ Data can be an array of hashes
244
245
  [{user_id: 1, item_id: 1, rating: 5}, {user_id: 2, item_id: 1, rating: 3}]
245
246
  ```
246
247
 
248
+ Or a Rover data frame
249
+
250
+ ```ruby
251
+ Rover.read_csv("ratings.csv")
252
+ ```
253
+
247
254
  Or a Daru data frame
248
255
 
249
256
  ```ruby
250
257
  Daru::DataFrame.from_csv("ratings.csv")
251
258
  ```
252
259
 
253
- ## Faster Similarity [experimental]
260
+ ## Faster Similarity
254
261
 
255
262
  If you have a large number of users/items, you can use an approximate nearest neighbors library like [NGT](https://github.com/ankane/ngt) to speed up item-based recommendations and similar users.
256
263
 
257
264
  Add this line to your application’s Gemfile:
258
265
 
259
266
  ```ruby
260
- gem 'ngt', '>= 0.2.3'
267
+ gem 'ngt', '>= 0.3.0'
261
268
  ```
262
269
 
263
270
  Speed up item-based recommendations with:
@@ -276,19 +283,33 @@ This should be called after fitting or loading the model.
276
283
 
277
284
  ## Reference
278
285
 
286
+ Get ids
287
+
288
+ ```ruby
289
+ recommender.user_ids
290
+ recommender.item_ids
291
+ ```
292
+
279
293
  Get the global mean
280
294
 
281
295
  ```ruby
282
296
  recommender.global_mean
283
297
  ```
284
298
 
285
- Get the factors
299
+ Get factors
286
300
 
287
301
  ```ruby
288
302
  recommender.user_factors
289
303
  recommender.item_factors
290
304
  ```
291
305
 
306
+ Get factors for specific users and items
307
+
308
+ ```ruby
309
+ recommender.user_factors(user_id)
310
+ recommender.item_factors(item_id)
311
+ ```
312
+
292
313
  ## Credits
293
314
 
294
315
  Thanks to:
@@ -309,3 +330,12 @@ Everyone is encouraged to help improve this project. Here are a few ways you can
309
330
  - Fix bugs and [submit pull requests](https://github.com/ankane/disco/pulls)
310
331
  - Write, clarify, or fix documentation
311
332
  - Suggest or add new features
333
+
334
+ To get started with development:
335
+
336
+ ```sh
337
+ git clone https://github.com/ankane/disco.git
338
+ cd disco
339
+ bundle install
340
+ bundle exec rake test
341
+ ```
data/lib/disco/data.rb CHANGED
@@ -36,8 +36,7 @@ module Disco
36
36
 
37
37
  return dest if File.exist?(dest)
38
38
 
39
- temp_dir ||= File.dirname(Tempfile.new("disco"))
40
- temp_path = "#{temp_dir}/#{Time.now.to_f}" # TODO better name
39
+ temp_path = "#{Dir.tmpdir}/disco-#{Time.now.to_f}" # TODO better name
41
40
 
42
41
  digest = Digest::SHA2.new
43
42
 
@@ -1,38 +1,32 @@
1
1
  module Disco
2
2
  class Recommender
3
- attr_reader :global_mean, :item_factors, :user_factors
3
+ attr_reader :global_mean
4
4
 
5
5
  def initialize(factors: 8, epochs: 20, verbose: nil)
6
6
  @factors = factors
7
7
  @epochs = epochs
8
8
  @verbose = verbose
9
+ @user_map = {}
10
+ @item_map = {}
9
11
  end
10
12
 
11
13
  def fit(train_set, validation_set: nil)
12
- if defined?(Daru)
13
- if train_set.is_a?(Daru::DataFrame)
14
- train_set = train_set.to_a[0]
15
- end
16
- if validation_set.is_a?(Daru::DataFrame)
17
- validation_set = validation_set.to_a[0]
18
- end
19
- end
14
+ train_set = to_dataset(train_set)
15
+ validation_set = to_dataset(validation_set) if validation_set
20
16
 
21
- @implicit = !train_set.any? { |v| v[:rating] }
17
+ check_training_set(train_set)
22
18
 
19
+ @implicit = !train_set.any? { |v| v[:rating] }
23
20
  unless @implicit
24
- ratings = train_set.map { |o| o[:rating] }
25
- check_ratings(ratings)
26
- @min_rating = ratings.min
27
- @max_rating = ratings.max
21
+ check_ratings(train_set)
22
+ @min_rating, @max_rating = train_set.minmax_by { |o| o[:rating] }.map { |o| o[:rating] }
28
23
 
29
24
  if validation_set
30
- check_ratings(validation_set.map { |o| o[:rating] })
25
+ check_ratings(validation_set)
31
26
  end
32
27
  end
33
28
 
34
- check_training_set(train_set)
35
- create_maps(train_set)
29
+ update_maps(train_set)
36
30
 
37
31
  @rated = Hash.new { |hash, key| hash[key] = {} }
38
32
  input = []
@@ -77,12 +71,31 @@ module Disco
77
71
  @item_index = nil
78
72
  end
79
73
 
74
+ # generates a prediction even if a user has already rated the item
75
+ def predict(data)
76
+ data = to_dataset(data)
77
+
78
+ u = data.map { |v| @user_map[v[:user_id]] }
79
+ i = data.map { |v| @item_map[v[:item_id]] }
80
+
81
+ new_index = data.each_index.select { |index| u[index].nil? || i[index].nil? }
82
+ new_index.each do |j|
83
+ u[j] = 0
84
+ i[j] = 0
85
+ end
86
+
87
+ predictions = @user_factors[u, true].inner(@item_factors[i, true])
88
+ predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
89
+ predictions[new_index] = @global_mean
90
+ predictions.to_a
91
+ end
92
+
80
93
  def user_recs(user_id, count: 5, item_ids: nil)
94
+ check_fit
81
95
  u = @user_map[user_id]
82
96
 
83
97
  if u
84
- predictions = @global_mean + @item_factors.dot(@user_factors[u, true])
85
- predictions.inplace.clip(@min_rating, @max_rating) if @min_rating
98
+ predictions = @item_factors.inner(@user_factors[u, true])
86
99
 
87
100
  predictions =
88
101
  @item_map.keys.zip(predictions).map do |item_id, pred|
@@ -91,15 +104,24 @@ module Disco
91
104
 
92
105
  if item_ids
93
106
  idx = item_ids.map { |i| @item_map[i] }.compact
94
- predictions.values_at(*idx)
107
+ predictions = predictions.values_at(*idx)
95
108
  else
96
- @rated[u].keys.each do |i|
109
+ @rated[u].keys.sort_by { |v| -v }.each do |i|
97
110
  predictions.delete_at(i)
98
111
  end
99
112
  end
100
113
 
101
114
  predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
102
115
  predictions = predictions.first(count) if count && !item_ids
116
+
117
+ # clamp *after* sorting
118
+ # also, only needed for returned predictions
119
+ if @min_rating
120
+ predictions.each do |pred|
121
+ pred[:score] = pred[:score].clamp(@min_rating, @max_rating)
122
+ end
123
+ end
124
+
103
125
  predictions
104
126
  else
105
127
  # no items if user is unknown
@@ -109,21 +131,51 @@ module Disco
109
131
  end
110
132
 
111
133
  def optimize_similar_items
134
+ check_fit
112
135
  @item_index = create_index(@item_factors)
113
136
  end
114
137
  alias_method :optimize_item_recs, :optimize_similar_items
115
138
 
116
139
  def optimize_similar_users
140
+ check_fit
117
141
  @user_index = create_index(@user_factors)
118
142
  end
119
143
 
120
144
  def similar_items(item_id, count: 5)
121
- similar(item_id, @item_map, @item_factors, item_norms, count, @item_index)
145
+ check_fit
146
+ similar(item_id, @item_map, @item_factors, @item_index ? nil : item_norms, count, @item_index)
122
147
  end
123
148
  alias_method :item_recs, :similar_items
124
149
 
125
150
  def similar_users(user_id, count: 5)
126
- similar(user_id, @user_map, @user_factors, user_norms, count, @user_index)
151
+ check_fit
152
+ similar(user_id, @user_map, @user_factors, @user_index ? nil : user_norms, count, @user_index)
153
+ end
154
+
155
+ def user_ids
156
+ @user_map.keys
157
+ end
158
+
159
+ def item_ids
160
+ @item_map.keys
161
+ end
162
+
163
+ def user_factors(user_id = nil)
164
+ if user_id
165
+ u = @user_map[user_id]
166
+ @user_factors[u, true] if u
167
+ else
168
+ @user_factors
169
+ end
170
+ end
171
+
172
+ def item_factors(item_id = nil)
173
+ if item_id
174
+ i = @item_map[item_id]
175
+ @item_factors[i, true] if i
176
+ else
177
+ @item_factors
178
+ end
127
179
  end
128
180
 
129
181
  private
@@ -131,8 +183,11 @@ module Disco
131
183
  def create_index(factors)
132
184
  require "ngt"
133
185
 
186
+ # could speed up search with normalized cosine
187
+ # https://github.com/yahoojapan/NGT/issues/36
134
188
  index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
135
- index.batch_insert(factors)
189
+ ids = index.batch_insert(factors)
190
+ raise "Unexpected ids. Please report a bug." if ids.first != 1 || ids.last != factors.shape[0]
136
191
  index
137
192
  end
138
193
 
@@ -145,7 +200,7 @@ module Disco
145
200
  end
146
201
 
147
202
  def norms(factors)
148
- norms = Numo::DFloat::Math.sqrt((factors * factors).sum(axis: 1))
203
+ norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
149
204
  norms[norms.eq(0)] = 1e-10 # no zeros
150
205
  norms
151
206
  end
@@ -165,20 +220,21 @@ module Disco
165
220
  }
166
221
  end
167
222
  else
168
- predictions = factors.dot(factors[i, true]) / norms
223
+ # cosine similarity without norms[i]
224
+ # otherwise, denominator would be (norms[i] * norms)
225
+ predictions = factors.inner(factors[i, true]) / norms
169
226
 
170
227
  predictions =
171
228
  map.keys.zip(predictions).map do |item_id, pred|
172
229
  {item_id: item_id, score: pred}
173
230
  end
174
231
 
175
- max_score = predictions.delete_at(i)[:score]
232
+ predictions.delete_at(i)
176
233
  predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
177
234
  predictions = predictions.first(count) if count
178
- # divide by max score to get cosine similarity
235
+ # divide by norms[i] to get cosine similarity
179
236
  # only need to do for returned records
180
- # could alternatively do cosine distance = 1 - cosine similarity
181
- # predictions.each { |pred| pred[:score] /= max_score }
237
+ predictions.each { |pred| pred[:score] /= norms[i] }
182
238
  predictions
183
239
  end
184
240
  else
@@ -186,19 +242,21 @@ module Disco
186
242
  end
187
243
  end
188
244
 
189
- def create_maps(train_set)
190
- user_ids = train_set.map { |v| v[:user_id] }.uniq.sort
191
- item_ids = train_set.map { |v| v[:item_id] }.uniq.sort
245
+ def update_maps(train_set)
246
+ raise ArgumentError, "Missing user_id" if train_set.any? { |v| v[:user_id].nil? }
247
+ raise ArgumentError, "Missing item_id" if train_set.any? { |v| v[:item_id].nil? }
192
248
 
193
- @user_map = user_ids.zip(user_ids.size.times).to_h
194
- @item_map = item_ids.zip(item_ids.size.times).to_h
249
+ train_set.each do |v|
250
+ @user_map[v[:user_id]] ||= @user_map.size
251
+ @item_map[v[:item_id]] ||= @item_map.size
252
+ end
195
253
  end
196
254
 
197
255
  def check_ratings(ratings)
198
- unless ratings.all? { |r| !r.nil? }
256
+ unless ratings.all? { |r| !r[:rating].nil? }
199
257
  raise ArgumentError, "Missing ratings"
200
258
  end
201
- unless ratings.all? { |r| r.is_a?(Numeric) }
259
+ unless ratings.all? { |r| r[:rating].is_a?(Numeric) }
202
260
  raise ArgumentError, "Ratings must be numeric"
203
261
  end
204
262
  end
@@ -207,6 +265,29 @@ module Disco
207
265
  raise ArgumentError, "No training data" if train_set.empty?
208
266
  end
209
267
 
268
+ def check_fit
269
+ raise "Not fit" unless defined?(@implicit)
270
+ end
271
+
272
+ def to_dataset(dataset)
273
+ if defined?(Rover::DataFrame) && dataset.is_a?(Rover::DataFrame)
274
+ # convert keys to symbols
275
+ dataset = dataset.dup
276
+ dataset.keys.each do |k, v|
277
+ dataset[k.to_sym] ||= dataset.delete(k)
278
+ end
279
+ dataset.to_a
280
+ elsif defined?(Daru::DataFrame) && dataset.is_a?(Daru::DataFrame)
281
+ # convert keys to symbols
282
+ dataset = dataset.dup
283
+ new_names = dataset.vectors.to_a.map { |k| [k, k.to_sym] }.to_h
284
+ dataset.rename_vectors!(new_names)
285
+ dataset.to_a[0]
286
+ else
287
+ dataset
288
+ end
289
+ end
290
+
210
291
  def marshal_dump
211
292
  obj = {
212
293
  implicit: @implicit,
data/lib/disco/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Disco
2
- VERSION = "0.1.2"
2
+ VERSION = "0.2.4"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: disco
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.2
4
+ version: 0.2.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-03-26 00:00:00.000000000 Z
11
+ date: 2021-02-16 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: libmf
@@ -38,106 +38,8 @@ dependencies:
38
38
  - - ">="
39
39
  - !ruby/object:Gem::Version
40
40
  version: '0'
41
- - !ruby/object:Gem::Dependency
42
- name: bundler
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - ">="
46
- - !ruby/object:Gem::Version
47
- version: '0'
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - ">="
53
- - !ruby/object:Gem::Version
54
- version: '0'
55
- - !ruby/object:Gem::Dependency
56
- name: rake
57
- requirement: !ruby/object:Gem::Requirement
58
- requirements:
59
- - - ">="
60
- - !ruby/object:Gem::Version
61
- version: '0'
62
- type: :development
63
- prerelease: false
64
- version_requirements: !ruby/object:Gem::Requirement
65
- requirements:
66
- - - ">="
67
- - !ruby/object:Gem::Version
68
- version: '0'
69
- - !ruby/object:Gem::Dependency
70
- name: minitest
71
- requirement: !ruby/object:Gem::Requirement
72
- requirements:
73
- - - ">="
74
- - !ruby/object:Gem::Version
75
- version: '5'
76
- type: :development
77
- prerelease: false
78
- version_requirements: !ruby/object:Gem::Requirement
79
- requirements:
80
- - - ">="
81
- - !ruby/object:Gem::Version
82
- version: '5'
83
- - !ruby/object:Gem::Dependency
84
- name: activerecord
85
- requirement: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - ">="
88
- - !ruby/object:Gem::Version
89
- version: '0'
90
- type: :development
91
- prerelease: false
92
- version_requirements: !ruby/object:Gem::Requirement
93
- requirements:
94
- - - ">="
95
- - !ruby/object:Gem::Version
96
- version: '0'
97
- - !ruby/object:Gem::Dependency
98
- name: sqlite3
99
- requirement: !ruby/object:Gem::Requirement
100
- requirements:
101
- - - ">="
102
- - !ruby/object:Gem::Version
103
- version: '0'
104
- type: :development
105
- prerelease: false
106
- version_requirements: !ruby/object:Gem::Requirement
107
- requirements:
108
- - - ">="
109
- - !ruby/object:Gem::Version
110
- version: '0'
111
- - !ruby/object:Gem::Dependency
112
- name: daru
113
- requirement: !ruby/object:Gem::Requirement
114
- requirements:
115
- - - ">="
116
- - !ruby/object:Gem::Version
117
- version: '0'
118
- type: :development
119
- prerelease: false
120
- version_requirements: !ruby/object:Gem::Requirement
121
- requirements:
122
- - - ">="
123
- - !ruby/object:Gem::Version
124
- version: '0'
125
- - !ruby/object:Gem::Dependency
126
- name: ngt
127
- requirement: !ruby/object:Gem::Requirement
128
- requirements:
129
- - - ">="
130
- - !ruby/object:Gem::Version
131
- version: 0.2.3
132
- type: :development
133
- prerelease: false
134
- version_requirements: !ruby/object:Gem::Requirement
135
- requirements:
136
- - - ">="
137
- - !ruby/object:Gem::Version
138
- version: 0.2.3
139
- description:
140
- email: andrew@chartkick.com
41
+ description:
42
+ email: andrew@ankane.org
141
43
  executables: []
142
44
  extensions: []
143
45
  extra_rdoc_files: []
@@ -158,7 +60,7 @@ homepage: https://github.com/ankane/disco
158
60
  licenses:
159
61
  - MIT
160
62
  metadata: {}
161
- post_install_message:
63
+ post_install_message:
162
64
  rdoc_options: []
163
65
  require_paths:
164
66
  - lib
@@ -173,8 +75,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
173
75
  - !ruby/object:Gem::Version
174
76
  version: '0'
175
77
  requirements: []
176
- rubygems_version: 3.1.2
177
- signing_key:
78
+ rubygems_version: 3.2.3
79
+ signing_key:
178
80
  specification_version: 4
179
- summary: Collaborative filtering for Ruby
81
+ summary: Recommendations for Ruby and Rails using collaborative filtering
180
82
  test_files: []