disco 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 8f0494a12f2efe7d077d989960b4f40bae39e9d433b8c38d45be367f79ad1a3e
4
- data.tar.gz: d5ef0e761f5cc3a409e0b7c3d8cdc14a33d5c03ca4eca7b476159c82954ad163
3
+ metadata.gz: 1ed4e5ff1d50d49068cfc9dc6b26df6be24aca1827b3f3724c3e351b06d768ac
4
+ data.tar.gz: 24756addf552956497889a9898eb9f93cdeec996c2a8a3cc1808c61425ea3d39
5
5
  SHA512:
6
- metadata.gz: 0a4995986ac209da39ff9f7c449225f1fc560364936f6df76647e33500f8f07749dc857f2a334b7bb59dadd941403feeee4f43f51647a7df799051a095f0cb3b
7
- data.tar.gz: e1bef5d6f9d3272cbca8a6fc56d7d14bcc2bc2b269ddda7931a256f6a4c019419b2e590e06841e5a09914b9523a5b4c09c9ff8bfd817b3d3bd2e99ecf3368752
6
+ metadata.gz: 50d5a9ba262c8c751f77fb2037235d9186e7616af55c653518752b2300d0c0b1a70d59dd935f0eaed9e1bd09881e547c66446dd0af8bfe641a679acae76fddd0
7
+ data.tar.gz: 00b260249dbc2831ad4948d531176c53df965b28f317b0e46987c5be7209da63126033cf75116b01f78ceefd8da977a9fb14bf014e67ae71e18bb634a9852335
@@ -1,3 +1,26 @@
1
- ## 0.1.0
1
+ ## 0.2.1 (2020-10-28)
2
+
3
+ - Fixed issue with `user_recs` returning rated items
4
+
5
+ ## 0.2.0 (2020-07-31)
6
+
7
+ - Changed score to always be between -1 and 1 for `item_recs` and `similar_users` (cosine similarity - this makes it easier to understand and consistent with `optimize_item_recs` and `optimize_similar_users`)
8
+
9
+ ## 0.1.3 (2020-06-28)
10
+
11
+ - Added support for Rover
12
+ - Raise error when missing user or item ids
13
+ - Fixed string keys for Daru data frames
14
+ - `optimize_item_recs` and `optimize_similar_users` methods are no longer experimental
15
+
16
+ ## 0.1.2 (2020-03-26)
17
+
18
+ - Added experimental `optimize_item_recs` and `optimize_similar_users` methods
19
+
20
+ ## 0.1.1 (2019-11-14)
21
+
22
+ - Fixed Rails integration
23
+
24
+ ## 0.1.0 (2019-11-14)
2
25
 
3
26
  - First release
@@ -1,4 +1,4 @@
1
- Copyright (c) 2019 Andrew Kane
1
+ Copyright (c) 2019-2020 Andrew Kane
2
2
 
3
3
  MIT License
4
4
 
data/README.md CHANGED
@@ -1,10 +1,10 @@
1
1
  # Disco
2
2
 
3
- :fire: Collaborative filtering for Ruby
3
+ :fire: Recommendations for Ruby and Rails using collaborative filtering
4
4
 
5
5
  - Supports user-based and item-based recommendations
6
6
  - Works with explicit and implicit feedback
7
- - Uses matrix factorization
7
+ - Uses high-performance matrix factorization
8
8
 
9
9
  [![Build Status](https://travis-ci.org/ankane/disco.svg?branch=master)](https://travis-ci.org/ankane/disco)
10
10
 
@@ -101,14 +101,15 @@ recommender.item_recs("Star Wars (1977)")
101
101
  ```ruby
102
102
  views = Ahoy::Event.
103
103
  where(name: "Viewed post").
104
- group(:user_id, "properties->>'post_id'") # postgres syntax
104
+ group(:user_id).
105
+ group("properties->>'post_id'"). # postgres syntax
105
106
  count
106
107
 
107
108
  data =
108
109
  views.map do |(user_id, post_id), count|
109
110
  {
110
111
  user_id: user_id,
111
- post_id: post_id,
112
+ item_id: post_id,
112
113
  value: count
113
114
  }
114
115
  end
@@ -202,7 +203,7 @@ recommender = Marshal.load(bin)
202
203
 
203
204
  ## Algorithms
204
205
 
205
- Disco uses matrix factorization.
206
+ Disco uses high-performance matrix factorization.
206
207
 
207
208
  - For explicit feedback, it uses [stochastic gradient descent](https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf)
208
209
  - For implicit feedback, it uses [coordinate descent](https://www.csie.ntu.edu.tw/~cjlin/papers/one-class-mf/biased-mf-sdm-with-supp.pdf)
@@ -236,15 +237,50 @@ There are a number of ways to deal with this, but here are some common ones:
236
237
  - For user-based recommendations, show new users the most popular items.
237
238
  - For item-based recommendations, make content-based recommendations with a gem like [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity).
238
239
 
239
- ## Daru
240
+ ## Data
240
241
 
241
- Disco works with Daru data frames
242
+ Data can be an array of hashes
242
243
 
243
244
  ```ruby
244
- data = Daru::DataFrame.from_csv("ratings.csv")
245
- recommender.fit(data)
245
+ [{user_id: 1, item_id: 1, rating: 5}, {user_id: 2, item_id: 1, rating: 3}]
246
+ ```
247
+
248
+ Or a Rover data frame
249
+
250
+ ```ruby
251
+ Rover.read_csv("ratings.csv")
246
252
  ```
247
253
 
254
+ Or a Daru data frame
255
+
256
+ ```ruby
257
+ Daru::DataFrame.from_csv("ratings.csv")
258
+ ```
259
+
260
+ ## Faster Similarity
261
+
262
+ If you have a large number of users/items, you can use an approximate nearest neighbors library like [NGT](https://github.com/ankane/ngt) to speed up item-based recommendations and similar users.
263
+
264
+ Add this line to your application’s Gemfile:
265
+
266
+ ```ruby
267
+ gem 'ngt', '>= 0.3.0'
268
+ ```
269
+
270
+ Speed up item-based recommendations with:
271
+
272
+ ```ruby
273
+ model.optimize_item_recs
274
+ ```
275
+
276
+ Speed up similar users with:
277
+
278
+ ```ruby
279
+ model.optimize_similar_users
280
+ ```
281
+
282
+ This should be called after fitting or loading the model.
283
+
248
284
  ## Reference
249
285
 
250
286
  Get the global mean
@@ -280,3 +316,12 @@ Everyone is encouraged to help improve this project. Here are a few ways you can
280
316
  - Fix bugs and [submit pull requests](https://github.com/ankane/disco/pulls)
281
317
  - Write, clarify, or fix documentation
282
318
  - Suggest or add new features
319
+
320
+ To get started with development:
321
+
322
+ ```sh
323
+ git clone https://github.com/ankane/disco.git
324
+ cd disco
325
+ bundle install
326
+ bundle exec rake test
327
+ ```
@@ -0,0 +1,8 @@
1
+ module Disco
2
+ class Recommendation < ActiveRecord::Base
3
+ self.table_name = "disco_recommendations"
4
+
5
+ belongs_to :subject, polymorphic: true
6
+ belongs_to :item, polymorphic: true
7
+ end
8
+ end
@@ -36,8 +36,7 @@ module Disco
36
36
 
37
37
  return dest if File.exist?(dest)
38
38
 
39
- temp_dir ||= File.dirname(Tempfile.new("disco"))
40
- temp_path = "#{temp_dir}/#{Time.now.to_f}" # TODO better name
39
+ temp_path = "#{Dir.tmpdir}/disco-#{Time.now.to_f}" # TODO better name
41
40
 
42
41
  digest = Digest::SHA2.new
43
42
 
@@ -9,14 +9,8 @@ module Disco
9
9
  end
10
10
 
11
11
  def fit(train_set, validation_set: nil)
12
- if defined?(Daru)
13
- if train_set.is_a?(Daru::DataFrame)
14
- train_set = train_set.to_a[0]
15
- end
16
- if validation_set.is_a?(Daru::DataFrame)
17
- validation_set = validation_set.to_a[0]
18
- end
19
- end
12
+ train_set = to_dataset(train_set)
13
+ validation_set = to_dataset(validation_set) if validation_set
20
14
 
21
15
  @implicit = !train_set.any? { |v| v[:rating] }
22
16
 
@@ -70,9 +64,11 @@ module Disco
70
64
 
71
65
  @global_mean = model.bias
72
66
 
73
- # TODO read from LIBMF directly to Numo for performance
74
- @user_factors = Numo::DFloat.cast(model.p_factors)
75
- @item_factors = Numo::DFloat.cast(model.q_factors)
67
+ @user_factors = model.p_factors(format: :numo)
68
+ @item_factors = model.q_factors(format: :numo)
69
+
70
+ @user_index = nil
71
+ @item_index = nil
76
72
  end
77
73
 
78
74
  def user_recs(user_id, count: 5, item_ids: nil)
@@ -91,7 +87,7 @@ module Disco
91
87
  idx = item_ids.map { |i| @item_map[i] }.compact
92
88
  predictions.values_at(*idx)
93
89
  else
94
- @rated[u].keys.each do |i|
90
+ @rated[u].keys.sort_by { |v| -v }.each do |i|
95
91
  predictions.delete_at(i)
96
92
  end
97
93
  end
@@ -106,17 +102,34 @@ module Disco
106
102
  end
107
103
  end
108
104
 
105
+ def optimize_similar_items
106
+ @item_index = create_index(@item_factors)
107
+ end
108
+ alias_method :optimize_item_recs, :optimize_similar_items
109
+
110
+ def optimize_similar_users
111
+ @user_index = create_index(@user_factors)
112
+ end
113
+
109
114
  def similar_items(item_id, count: 5)
110
- similar(item_id, @item_map, @item_factors, item_norms, count)
115
+ similar(item_id, @item_map, @item_factors, item_norms, count, @item_index)
111
116
  end
112
117
  alias_method :item_recs, :similar_items
113
118
 
114
119
  def similar_users(user_id, count: 5)
115
- similar(user_id, @user_map, @user_factors, user_norms, count)
120
+ similar(user_id, @user_map, @user_factors, user_norms, count, @user_index)
116
121
  end
117
122
 
118
123
  private
119
124
 
125
+ def create_index(factors)
126
+ require "ngt"
127
+
128
+ index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
129
+ index.batch_insert(factors)
130
+ index
131
+ end
132
+
120
133
  def user_norms
121
134
  @user_norms ||= norms(@user_factors)
122
135
  end
@@ -126,25 +139,41 @@ module Disco
126
139
  end
127
140
 
128
141
  def norms(factors)
129
- norms = Numo::DFloat::Math.sqrt((factors * factors).sum(axis: 1))
142
+ norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
130
143
  norms[norms.eq(0)] = 1e-10 # no zeros
131
144
  norms
132
145
  end
133
146
 
134
- def similar(id, map, factors, norms, count)
147
+ def similar(id, map, factors, norms, count, index)
135
148
  i = map[id]
136
149
  if i
137
- predictions = factors.dot(factors[i, true]) / norms
138
-
139
- predictions =
140
- map.keys.zip(predictions).map do |item_id, pred|
141
- {item_id: item_id, score: pred}
150
+ if index && count
151
+ keys = map.keys
152
+ result = index.search(factors[i, true], size: count + 1)[1..-1]
153
+ result.map do |v|
154
+ {
155
+ # ids from batch_insert start at 1 instead of 0
156
+ item_id: keys[v[:id] - 1],
157
+ # convert cosine distance to cosine similarity
158
+ score: 1 - v[:distance]
159
+ }
142
160
  end
143
-
144
- predictions.delete_at(i)
145
- predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
146
- predictions = predictions.first(count) if count
147
- predictions
161
+ else
162
+ predictions = factors.dot(factors[i, true]) / norms
163
+
164
+ predictions =
165
+ map.keys.zip(predictions).map do |item_id, pred|
166
+ {item_id: item_id, score: pred}
167
+ end
168
+
169
+ max_score = predictions.delete_at(i)[:score]
170
+ predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
171
+ predictions = predictions.first(count) if count
172
+ # divide by max score to get cosine similarity
173
+ # only need to do for returned records
174
+ predictions.each { |pred| pred[:score] /= max_score }
175
+ predictions
176
+ end
148
177
  else
149
178
  []
150
179
  end
@@ -154,6 +183,9 @@ module Disco
154
183
  user_ids = train_set.map { |v| v[:user_id] }.uniq.sort
155
184
  item_ids = train_set.map { |v| v[:item_id] }.uniq.sort
156
185
 
186
+ raise ArgumentError, "Missing user_id" if user_ids.any?(&:nil?)
187
+ raise ArgumentError, "Missing item_id" if item_ids.any?(&:nil?)
188
+
157
189
  @user_map = user_ids.zip(user_ids.size.times).to_h
158
190
  @item_map = item_ids.zip(item_ids.size.times).to_h
159
191
  end
@@ -171,6 +203,25 @@ module Disco
171
203
  raise ArgumentError, "No training data" if train_set.empty?
172
204
  end
173
205
 
206
+ def to_dataset(dataset)
207
+ if defined?(Rover::DataFrame) && dataset.is_a?(Rover::DataFrame)
208
+ # convert keys to symbols
209
+ dataset = dataset.dup
210
+ dataset.keys.each do |k, v|
211
+ dataset[k.to_sym] ||= dataset.delete(k)
212
+ end
213
+ dataset.to_a
214
+ elsif defined?(Daru::DataFrame) && dataset.is_a?(Daru::DataFrame)
215
+ # convert keys to symbols
216
+ dataset = dataset.dup
217
+ new_names = dataset.vectors.to_a.map { |k| [k, k.to_sym] }.to_h
218
+ dataset.rename_vectors!(new_names)
219
+ dataset.to_a[0]
220
+ else
221
+ dataset
222
+ end
223
+ end
224
+
174
225
  def marshal_dump
175
226
  obj = {
176
227
  implicit: @implicit,
@@ -1,3 +1,3 @@
1
1
  module Disco
2
- VERSION = "0.1.0"
2
+ VERSION = "0.2.1"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: disco
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-11-14 00:00:00.000000000 Z
11
+ date: 2020-10-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: libmf
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: 0.1.3
19
+ version: 0.2.0
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: 0.1.3
26
+ version: 0.2.0
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: numo-narray
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -122,7 +122,35 @@ dependencies:
122
122
  - - ">="
123
123
  - !ruby/object:Gem::Version
124
124
  version: '0'
125
- description:
125
+ - !ruby/object:Gem::Dependency
126
+ name: rover-df
127
+ requirement: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - ">="
130
+ - !ruby/object:Gem::Version
131
+ version: '0'
132
+ type: :development
133
+ prerelease: false
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - ">="
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
139
+ - !ruby/object:Gem::Dependency
140
+ name: ngt
141
+ requirement: !ruby/object:Gem::Requirement
142
+ requirements:
143
+ - - ">="
144
+ - !ruby/object:Gem::Version
145
+ version: 0.3.0
146
+ type: :development
147
+ prerelease: false
148
+ version_requirements: !ruby/object:Gem::Requirement
149
+ requirements:
150
+ - - ">="
151
+ - !ruby/object:Gem::Version
152
+ version: 0.3.0
153
+ description:
126
154
  email: andrew@chartkick.com
127
155
  executables: []
128
156
  extensions: []
@@ -131,6 +159,7 @@ files:
131
159
  - CHANGELOG.md
132
160
  - LICENSE.txt
133
161
  - README.md
162
+ - app/models/disco/recommendation.rb
134
163
  - lib/disco.rb
135
164
  - lib/disco/data.rb
136
165
  - lib/disco/engine.rb
@@ -143,7 +172,7 @@ homepage: https://github.com/ankane/disco
143
172
  licenses:
144
173
  - MIT
145
174
  metadata: {}
146
- post_install_message:
175
+ post_install_message:
147
176
  rdoc_options: []
148
177
  require_paths:
149
178
  - lib
@@ -158,8 +187,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
158
187
  - !ruby/object:Gem::Version
159
188
  version: '0'
160
189
  requirements: []
161
- rubygems_version: 3.0.3
162
- signing_key:
190
+ rubygems_version: 3.1.4
191
+ signing_key:
163
192
  specification_version: 4
164
- summary: Collaborative filtering for Ruby
193
+ summary: Recommendations for Ruby and Rails using collaborative filtering
165
194
  test_files: []