disco 0.1.0 → 0.2.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 8f0494a12f2efe7d077d989960b4f40bae39e9d433b8c38d45be367f79ad1a3e
4
- data.tar.gz: d5ef0e761f5cc3a409e0b7c3d8cdc14a33d5c03ca4eca7b476159c82954ad163
3
+ metadata.gz: 1ed4e5ff1d50d49068cfc9dc6b26df6be24aca1827b3f3724c3e351b06d768ac
4
+ data.tar.gz: 24756addf552956497889a9898eb9f93cdeec996c2a8a3cc1808c61425ea3d39
5
5
  SHA512:
6
- metadata.gz: 0a4995986ac209da39ff9f7c449225f1fc560364936f6df76647e33500f8f07749dc857f2a334b7bb59dadd941403feeee4f43f51647a7df799051a095f0cb3b
7
- data.tar.gz: e1bef5d6f9d3272cbca8a6fc56d7d14bcc2bc2b269ddda7931a256f6a4c019419b2e590e06841e5a09914b9523a5b4c09c9ff8bfd817b3d3bd2e99ecf3368752
6
+ metadata.gz: 50d5a9ba262c8c751f77fb2037235d9186e7616af55c653518752b2300d0c0b1a70d59dd935f0eaed9e1bd09881e547c66446dd0af8bfe641a679acae76fddd0
7
+ data.tar.gz: 00b260249dbc2831ad4948d531176c53df965b28f317b0e46987c5be7209da63126033cf75116b01f78ceefd8da977a9fb14bf014e67ae71e18bb634a9852335
@@ -1,3 +1,26 @@
1
- ## 0.1.0
1
+ ## 0.2.1 (2020-10-28)
2
+
3
+ - Fixed issue with `user_recs` returning rated items
4
+
5
+ ## 0.2.0 (2020-07-31)
6
+
7
+ - Changed score to always be between -1 and 1 for `item_recs` and `similar_users` (cosine similarity - this makes it easier to understand and consistent with `optimize_item_recs` and `optimize_similar_users`)
8
+
9
+ ## 0.1.3 (2020-06-28)
10
+
11
+ - Added support for Rover
12
+ - Raise error when missing user or item ids
13
+ - Fixed string keys for Daru data frames
14
+ - `optimize_item_recs` and `optimize_similar_users` methods are no longer experimental
15
+
16
+ ## 0.1.2 (2020-03-26)
17
+
18
+ - Added experimental `optimize_item_recs` and `optimize_similar_users` methods
19
+
20
+ ## 0.1.1 (2019-11-14)
21
+
22
+ - Fixed Rails integration
23
+
24
+ ## 0.1.0 (2019-11-14)
2
25
 
3
26
  - First release
@@ -1,4 +1,4 @@
1
- Copyright (c) 2019 Andrew Kane
1
+ Copyright (c) 2019-2020 Andrew Kane
2
2
 
3
3
  MIT License
4
4
 
data/README.md CHANGED
@@ -1,10 +1,10 @@
1
1
  # Disco
2
2
 
3
- :fire: Collaborative filtering for Ruby
3
+ :fire: Recommendations for Ruby and Rails using collaborative filtering
4
4
 
5
5
  - Supports user-based and item-based recommendations
6
6
  - Works with explicit and implicit feedback
7
- - Uses matrix factorization
7
+ - Uses high-performance matrix factorization
8
8
 
9
9
  [![Build Status](https://travis-ci.org/ankane/disco.svg?branch=master)](https://travis-ci.org/ankane/disco)
10
10
 
@@ -101,14 +101,15 @@ recommender.item_recs("Star Wars (1977)")
101
101
  ```ruby
102
102
  views = Ahoy::Event.
103
103
  where(name: "Viewed post").
104
- group(:user_id, "properties->>'post_id'") # postgres syntax
104
+ group(:user_id).
105
+ group("properties->>'post_id'"). # postgres syntax
105
106
  count
106
107
 
107
108
  data =
108
109
  views.map do |(user_id, post_id), count|
109
110
  {
110
111
  user_id: user_id,
111
- post_id: post_id,
112
+ item_id: post_id,
112
113
  value: count
113
114
  }
114
115
  end
@@ -202,7 +203,7 @@ recommender = Marshal.load(bin)
202
203
 
203
204
  ## Algorithms
204
205
 
205
- Disco uses matrix factorization.
206
+ Disco uses high-performance matrix factorization.
206
207
 
207
208
  - For explicit feedback, it uses [stochastic gradient descent](https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf)
208
209
  - For implicit feedback, it uses [coordinate descent](https://www.csie.ntu.edu.tw/~cjlin/papers/one-class-mf/biased-mf-sdm-with-supp.pdf)
@@ -236,15 +237,50 @@ There are a number of ways to deal with this, but here are some common ones:
236
237
  - For user-based recommendations, show new users the most popular items.
237
238
  - For item-based recommendations, make content-based recommendations with a gem like [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity).
238
239
 
239
- ## Daru
240
+ ## Data
240
241
 
241
- Disco works with Daru data frames
242
+ Data can be an array of hashes
242
243
 
243
244
  ```ruby
244
- data = Daru::DataFrame.from_csv("ratings.csv")
245
- recommender.fit(data)
245
+ [{user_id: 1, item_id: 1, rating: 5}, {user_id: 2, item_id: 1, rating: 3}]
246
+ ```
247
+
248
+ Or a Rover data frame
249
+
250
+ ```ruby
251
+ Rover.read_csv("ratings.csv")
246
252
  ```
247
253
 
254
+ Or a Daru data frame
255
+
256
+ ```ruby
257
+ Daru::DataFrame.from_csv("ratings.csv")
258
+ ```
259
+
260
+ ## Faster Similarity
261
+
262
+ If you have a large number of users/items, you can use an approximate nearest neighbors library like [NGT](https://github.com/ankane/ngt) to speed up item-based recommendations and similar users.
263
+
264
+ Add this line to your application’s Gemfile:
265
+
266
+ ```ruby
267
+ gem 'ngt', '>= 0.3.0'
268
+ ```
269
+
270
+ Speed up item-based recommendations with:
271
+
272
+ ```ruby
273
+ model.optimize_item_recs
274
+ ```
275
+
276
+ Speed up similar users with:
277
+
278
+ ```ruby
279
+ model.optimize_similar_users
280
+ ```
281
+
282
+ This should be called after fitting or loading the model.
283
+
248
284
  ## Reference
249
285
 
250
286
  Get the global mean
@@ -280,3 +316,12 @@ Everyone is encouraged to help improve this project. Here are a few ways you can
280
316
  - Fix bugs and [submit pull requests](https://github.com/ankane/disco/pulls)
281
317
  - Write, clarify, or fix documentation
282
318
  - Suggest or add new features
319
+
320
+ To get started with development:
321
+
322
+ ```sh
323
+ git clone https://github.com/ankane/disco.git
324
+ cd disco
325
+ bundle install
326
+ bundle exec rake test
327
+ ```
@@ -0,0 +1,8 @@
1
+ module Disco
2
+ class Recommendation < ActiveRecord::Base
3
+ self.table_name = "disco_recommendations"
4
+
5
+ belongs_to :subject, polymorphic: true
6
+ belongs_to :item, polymorphic: true
7
+ end
8
+ end
@@ -36,8 +36,7 @@ module Disco
36
36
 
37
37
  return dest if File.exist?(dest)
38
38
 
39
- temp_dir ||= File.dirname(Tempfile.new("disco"))
40
- temp_path = "#{temp_dir}/#{Time.now.to_f}" # TODO better name
39
+ temp_path = "#{Dir.tmpdir}/disco-#{Time.now.to_f}" # TODO better name
41
40
 
42
41
  digest = Digest::SHA2.new
43
42
 
@@ -9,14 +9,8 @@ module Disco
9
9
  end
10
10
 
11
11
  def fit(train_set, validation_set: nil)
12
- if defined?(Daru)
13
- if train_set.is_a?(Daru::DataFrame)
14
- train_set = train_set.to_a[0]
15
- end
16
- if validation_set.is_a?(Daru::DataFrame)
17
- validation_set = validation_set.to_a[0]
18
- end
19
- end
12
+ train_set = to_dataset(train_set)
13
+ validation_set = to_dataset(validation_set) if validation_set
20
14
 
21
15
  @implicit = !train_set.any? { |v| v[:rating] }
22
16
 
@@ -70,9 +64,11 @@ module Disco
70
64
 
71
65
  @global_mean = model.bias
72
66
 
73
- # TODO read from LIBMF directly to Numo for performance
74
- @user_factors = Numo::DFloat.cast(model.p_factors)
75
- @item_factors = Numo::DFloat.cast(model.q_factors)
67
+ @user_factors = model.p_factors(format: :numo)
68
+ @item_factors = model.q_factors(format: :numo)
69
+
70
+ @user_index = nil
71
+ @item_index = nil
76
72
  end
77
73
 
78
74
  def user_recs(user_id, count: 5, item_ids: nil)
@@ -91,7 +87,7 @@ module Disco
91
87
  idx = item_ids.map { |i| @item_map[i] }.compact
92
88
  predictions.values_at(*idx)
93
89
  else
94
- @rated[u].keys.each do |i|
90
+ @rated[u].keys.sort_by { |v| -v }.each do |i|
95
91
  predictions.delete_at(i)
96
92
  end
97
93
  end
@@ -106,17 +102,34 @@ module Disco
106
102
  end
107
103
  end
108
104
 
105
+ def optimize_similar_items
106
+ @item_index = create_index(@item_factors)
107
+ end
108
+ alias_method :optimize_item_recs, :optimize_similar_items
109
+
110
+ def optimize_similar_users
111
+ @user_index = create_index(@user_factors)
112
+ end
113
+
109
114
  def similar_items(item_id, count: 5)
110
- similar(item_id, @item_map, @item_factors, item_norms, count)
115
+ similar(item_id, @item_map, @item_factors, item_norms, count, @item_index)
111
116
  end
112
117
  alias_method :item_recs, :similar_items
113
118
 
114
119
  def similar_users(user_id, count: 5)
115
- similar(user_id, @user_map, @user_factors, user_norms, count)
120
+ similar(user_id, @user_map, @user_factors, user_norms, count, @user_index)
116
121
  end
117
122
 
118
123
  private
119
124
 
125
+ def create_index(factors)
126
+ require "ngt"
127
+
128
+ index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
129
+ index.batch_insert(factors)
130
+ index
131
+ end
132
+
120
133
  def user_norms
121
134
  @user_norms ||= norms(@user_factors)
122
135
  end
@@ -126,25 +139,41 @@ module Disco
126
139
  end
127
140
 
128
141
  def norms(factors)
129
- norms = Numo::DFloat::Math.sqrt((factors * factors).sum(axis: 1))
142
+ norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
130
143
  norms[norms.eq(0)] = 1e-10 # no zeros
131
144
  norms
132
145
  end
133
146
 
134
- def similar(id, map, factors, norms, count)
147
+ def similar(id, map, factors, norms, count, index)
135
148
  i = map[id]
136
149
  if i
137
- predictions = factors.dot(factors[i, true]) / norms
138
-
139
- predictions =
140
- map.keys.zip(predictions).map do |item_id, pred|
141
- {item_id: item_id, score: pred}
150
+ if index && count
151
+ keys = map.keys
152
+ result = index.search(factors[i, true], size: count + 1)[1..-1]
153
+ result.map do |v|
154
+ {
155
+ # ids from batch_insert start at 1 instead of 0
156
+ item_id: keys[v[:id] - 1],
157
+ # convert cosine distance to cosine similarity
158
+ score: 1 - v[:distance]
159
+ }
142
160
  end
143
-
144
- predictions.delete_at(i)
145
- predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
146
- predictions = predictions.first(count) if count
147
- predictions
161
+ else
162
+ predictions = factors.dot(factors[i, true]) / norms
163
+
164
+ predictions =
165
+ map.keys.zip(predictions).map do |item_id, pred|
166
+ {item_id: item_id, score: pred}
167
+ end
168
+
169
+ max_score = predictions.delete_at(i)[:score]
170
+ predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
171
+ predictions = predictions.first(count) if count
172
+ # divide by max score to get cosine similarity
173
+ # only need to do for returned records
174
+ predictions.each { |pred| pred[:score] /= max_score }
175
+ predictions
176
+ end
148
177
  else
149
178
  []
150
179
  end
@@ -154,6 +183,9 @@ module Disco
154
183
  user_ids = train_set.map { |v| v[:user_id] }.uniq.sort
155
184
  item_ids = train_set.map { |v| v[:item_id] }.uniq.sort
156
185
 
186
+ raise ArgumentError, "Missing user_id" if user_ids.any?(&:nil?)
187
+ raise ArgumentError, "Missing item_id" if item_ids.any?(&:nil?)
188
+
157
189
  @user_map = user_ids.zip(user_ids.size.times).to_h
158
190
  @item_map = item_ids.zip(item_ids.size.times).to_h
159
191
  end
@@ -171,6 +203,25 @@ module Disco
171
203
  raise ArgumentError, "No training data" if train_set.empty?
172
204
  end
173
205
 
206
+ def to_dataset(dataset)
207
+ if defined?(Rover::DataFrame) && dataset.is_a?(Rover::DataFrame)
208
+ # convert keys to symbols
209
+ dataset = dataset.dup
210
+ dataset.keys.each do |k, v|
211
+ dataset[k.to_sym] ||= dataset.delete(k)
212
+ end
213
+ dataset.to_a
214
+ elsif defined?(Daru::DataFrame) && dataset.is_a?(Daru::DataFrame)
215
+ # convert keys to symbols
216
+ dataset = dataset.dup
217
+ new_names = dataset.vectors.to_a.map { |k| [k, k.to_sym] }.to_h
218
+ dataset.rename_vectors!(new_names)
219
+ dataset.to_a[0]
220
+ else
221
+ dataset
222
+ end
223
+ end
224
+
174
225
  def marshal_dump
175
226
  obj = {
176
227
  implicit: @implicit,
@@ -1,3 +1,3 @@
1
1
  module Disco
2
- VERSION = "0.1.0"
2
+ VERSION = "0.2.1"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: disco
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-11-14 00:00:00.000000000 Z
11
+ date: 2020-10-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: libmf
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: 0.1.3
19
+ version: 0.2.0
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: 0.1.3
26
+ version: 0.2.0
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: numo-narray
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -122,7 +122,35 @@ dependencies:
122
122
  - - ">="
123
123
  - !ruby/object:Gem::Version
124
124
  version: '0'
125
- description:
125
+ - !ruby/object:Gem::Dependency
126
+ name: rover-df
127
+ requirement: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - ">="
130
+ - !ruby/object:Gem::Version
131
+ version: '0'
132
+ type: :development
133
+ prerelease: false
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - ">="
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
139
+ - !ruby/object:Gem::Dependency
140
+ name: ngt
141
+ requirement: !ruby/object:Gem::Requirement
142
+ requirements:
143
+ - - ">="
144
+ - !ruby/object:Gem::Version
145
+ version: 0.3.0
146
+ type: :development
147
+ prerelease: false
148
+ version_requirements: !ruby/object:Gem::Requirement
149
+ requirements:
150
+ - - ">="
151
+ - !ruby/object:Gem::Version
152
+ version: 0.3.0
153
+ description:
126
154
  email: andrew@chartkick.com
127
155
  executables: []
128
156
  extensions: []
@@ -131,6 +159,7 @@ files:
131
159
  - CHANGELOG.md
132
160
  - LICENSE.txt
133
161
  - README.md
162
+ - app/models/disco/recommendation.rb
134
163
  - lib/disco.rb
135
164
  - lib/disco/data.rb
136
165
  - lib/disco/engine.rb
@@ -143,7 +172,7 @@ homepage: https://github.com/ankane/disco
143
172
  licenses:
144
173
  - MIT
145
174
  metadata: {}
146
- post_install_message:
175
+ post_install_message:
147
176
  rdoc_options: []
148
177
  require_paths:
149
178
  - lib
@@ -158,8 +187,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
158
187
  - !ruby/object:Gem::Version
159
188
  version: '0'
160
189
  requirements: []
161
- rubygems_version: 3.0.3
162
- signing_key:
190
+ rubygems_version: 3.1.4
191
+ signing_key:
163
192
  specification_version: 4
164
- summary: Collaborative filtering for Ruby
193
+ summary: Recommendations for Ruby and Rails using collaborative filtering
165
194
  test_files: []