disco 0.1.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +24 -1
- data/LICENSE.txt +1 -1
- data/README.md +54 -9
- data/app/models/disco/recommendation.rb +8 -0
- data/lib/disco/data.rb +1 -2
- data/lib/disco/recommender.rb +77 -26
- data/lib/disco/version.rb +1 -1
- metadata +39 -10
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1ed4e5ff1d50d49068cfc9dc6b26df6be24aca1827b3f3724c3e351b06d768ac
|
4
|
+
data.tar.gz: 24756addf552956497889a9898eb9f93cdeec996c2a8a3cc1808c61425ea3d39
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 50d5a9ba262c8c751f77fb2037235d9186e7616af55c653518752b2300d0c0b1a70d59dd935f0eaed9e1bd09881e547c66446dd0af8bfe641a679acae76fddd0
|
7
|
+
data.tar.gz: 00b260249dbc2831ad4948d531176c53df965b28f317b0e46987c5be7209da63126033cf75116b01f78ceefd8da977a9fb14bf014e67ae71e18bb634a9852335
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,26 @@
|
|
1
|
-
## 0.1
|
1
|
+
## 0.2.1 (2020-10-28)
|
2
|
+
|
3
|
+
- Fixed issue with `user_recs` returning rated items
|
4
|
+
|
5
|
+
## 0.2.0 (2020-07-31)
|
6
|
+
|
7
|
+
- Changed score to always be between -1 and 1 for `item_recs` and `similar_users` (cosine similarity - this makes it easier to understand and consistent with `optimize_item_recs` and `optimize_similar_users`)
|
8
|
+
|
9
|
+
## 0.1.3 (2020-06-28)
|
10
|
+
|
11
|
+
- Added support for Rover
|
12
|
+
- Raise error when missing user or item ids
|
13
|
+
- Fixed string keys for Daru data frames
|
14
|
+
- `optimize_item_recs` and `optimize_similar_users` methods are no longer experimental
|
15
|
+
|
16
|
+
## 0.1.2 (2020-03-26)
|
17
|
+
|
18
|
+
- Added experimental `optimize_item_recs` and `optimize_similar_users` methods
|
19
|
+
|
20
|
+
## 0.1.1 (2019-11-14)
|
21
|
+
|
22
|
+
- Fixed Rails integration
|
23
|
+
|
24
|
+
## 0.1.0 (2019-11-14)
|
2
25
|
|
3
26
|
- First release
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
@@ -1,10 +1,10 @@
|
|
1
1
|
# Disco
|
2
2
|
|
3
|
-
:fire:
|
3
|
+
:fire: Recommendations for Ruby and Rails using collaborative filtering
|
4
4
|
|
5
5
|
- Supports user-based and item-based recommendations
|
6
6
|
- Works with explicit and implicit feedback
|
7
|
-
- Uses matrix factorization
|
7
|
+
- Uses high-performance matrix factorization
|
8
8
|
|
9
9
|
[](https://travis-ci.org/ankane/disco)
|
10
10
|
|
@@ -101,14 +101,15 @@ recommender.item_recs("Star Wars (1977)")
|
|
101
101
|
```ruby
|
102
102
|
views = Ahoy::Event.
|
103
103
|
where(name: "Viewed post").
|
104
|
-
group(:user_id
|
104
|
+
group(:user_id).
|
105
|
+
group("properties->>'post_id'"). # postgres syntax
|
105
106
|
count
|
106
107
|
|
107
108
|
data =
|
108
109
|
views.map do |(user_id, post_id), count|
|
109
110
|
{
|
110
111
|
user_id: user_id,
|
111
|
-
|
112
|
+
item_id: post_id,
|
112
113
|
value: count
|
113
114
|
}
|
114
115
|
end
|
@@ -202,7 +203,7 @@ recommender = Marshal.load(bin)
|
|
202
203
|
|
203
204
|
## Algorithms
|
204
205
|
|
205
|
-
Disco uses matrix factorization.
|
206
|
+
Disco uses high-performance matrix factorization.
|
206
207
|
|
207
208
|
- For explicit feedback, it uses [stochastic gradient descent](https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf)
|
208
209
|
- For implicit feedback, it uses [coordinate descent](https://www.csie.ntu.edu.tw/~cjlin/papers/one-class-mf/biased-mf-sdm-with-supp.pdf)
|
@@ -236,15 +237,50 @@ There are a number of ways to deal with this, but here are some common ones:
|
|
236
237
|
- For user-based recommendations, show new users the most popular items.
|
237
238
|
- For item-based recommendations, make content-based recommendations with a gem like [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity).
|
238
239
|
|
239
|
-
##
|
240
|
+
## Data
|
240
241
|
|
241
|
-
|
242
|
+
Data can be an array of hashes
|
242
243
|
|
243
244
|
```ruby
|
244
|
-
|
245
|
-
|
245
|
+
[{user_id: 1, item_id: 1, rating: 5}, {user_id: 2, item_id: 1, rating: 3}]
|
246
|
+
```
|
247
|
+
|
248
|
+
Or a Rover data frame
|
249
|
+
|
250
|
+
```ruby
|
251
|
+
Rover.read_csv("ratings.csv")
|
246
252
|
```
|
247
253
|
|
254
|
+
Or a Daru data frame
|
255
|
+
|
256
|
+
```ruby
|
257
|
+
Daru::DataFrame.from_csv("ratings.csv")
|
258
|
+
```
|
259
|
+
|
260
|
+
## Faster Similarity
|
261
|
+
|
262
|
+
If you have a large number of users/items, you can use an approximate nearest neighbors library like [NGT](https://github.com/ankane/ngt) to speed up item-based recommendations and similar users.
|
263
|
+
|
264
|
+
Add this line to your application’s Gemfile:
|
265
|
+
|
266
|
+
```ruby
|
267
|
+
gem 'ngt', '>= 0.3.0'
|
268
|
+
```
|
269
|
+
|
270
|
+
Speed up item-based recommendations with:
|
271
|
+
|
272
|
+
```ruby
|
273
|
+
model.optimize_item_recs
|
274
|
+
```
|
275
|
+
|
276
|
+
Speed up similar users with:
|
277
|
+
|
278
|
+
```ruby
|
279
|
+
model.optimize_similar_users
|
280
|
+
```
|
281
|
+
|
282
|
+
This should be called after fitting or loading the model.
|
283
|
+
|
248
284
|
## Reference
|
249
285
|
|
250
286
|
Get the global mean
|
@@ -280,3 +316,12 @@ Everyone is encouraged to help improve this project. Here are a few ways you can
|
|
280
316
|
- Fix bugs and [submit pull requests](https://github.com/ankane/disco/pulls)
|
281
317
|
- Write, clarify, or fix documentation
|
282
318
|
- Suggest or add new features
|
319
|
+
|
320
|
+
To get started with development:
|
321
|
+
|
322
|
+
```sh
|
323
|
+
git clone https://github.com/ankane/disco.git
|
324
|
+
cd disco
|
325
|
+
bundle install
|
326
|
+
bundle exec rake test
|
327
|
+
```
|
data/lib/disco/data.rb
CHANGED
@@ -36,8 +36,7 @@ module Disco
|
|
36
36
|
|
37
37
|
return dest if File.exist?(dest)
|
38
38
|
|
39
|
-
|
40
|
-
temp_path = "#{temp_dir}/#{Time.now.to_f}" # TODO better name
|
39
|
+
temp_path = "#{Dir.tmpdir}/disco-#{Time.now.to_f}" # TODO better name
|
41
40
|
|
42
41
|
digest = Digest::SHA2.new
|
43
42
|
|
data/lib/disco/recommender.rb
CHANGED
@@ -9,14 +9,8 @@ module Disco
|
|
9
9
|
end
|
10
10
|
|
11
11
|
def fit(train_set, validation_set: nil)
|
12
|
-
|
13
|
-
|
14
|
-
train_set = train_set.to_a[0]
|
15
|
-
end
|
16
|
-
if validation_set.is_a?(Daru::DataFrame)
|
17
|
-
validation_set = validation_set.to_a[0]
|
18
|
-
end
|
19
|
-
end
|
12
|
+
train_set = to_dataset(train_set)
|
13
|
+
validation_set = to_dataset(validation_set) if validation_set
|
20
14
|
|
21
15
|
@implicit = !train_set.any? { |v| v[:rating] }
|
22
16
|
|
@@ -70,9 +64,11 @@ module Disco
|
|
70
64
|
|
71
65
|
@global_mean = model.bias
|
72
66
|
|
73
|
-
|
74
|
-
@
|
75
|
-
|
67
|
+
@user_factors = model.p_factors(format: :numo)
|
68
|
+
@item_factors = model.q_factors(format: :numo)
|
69
|
+
|
70
|
+
@user_index = nil
|
71
|
+
@item_index = nil
|
76
72
|
end
|
77
73
|
|
78
74
|
def user_recs(user_id, count: 5, item_ids: nil)
|
@@ -91,7 +87,7 @@ module Disco
|
|
91
87
|
idx = item_ids.map { |i| @item_map[i] }.compact
|
92
88
|
predictions.values_at(*idx)
|
93
89
|
else
|
94
|
-
@rated[u].keys.each do |i|
|
90
|
+
@rated[u].keys.sort_by { |v| -v }.each do |i|
|
95
91
|
predictions.delete_at(i)
|
96
92
|
end
|
97
93
|
end
|
@@ -106,17 +102,34 @@ module Disco
|
|
106
102
|
end
|
107
103
|
end
|
108
104
|
|
105
|
+
def optimize_similar_items
|
106
|
+
@item_index = create_index(@item_factors)
|
107
|
+
end
|
108
|
+
alias_method :optimize_item_recs, :optimize_similar_items
|
109
|
+
|
110
|
+
def optimize_similar_users
|
111
|
+
@user_index = create_index(@user_factors)
|
112
|
+
end
|
113
|
+
|
109
114
|
def similar_items(item_id, count: 5)
|
110
|
-
similar(item_id, @item_map, @item_factors, item_norms, count)
|
115
|
+
similar(item_id, @item_map, @item_factors, item_norms, count, @item_index)
|
111
116
|
end
|
112
117
|
alias_method :item_recs, :similar_items
|
113
118
|
|
114
119
|
def similar_users(user_id, count: 5)
|
115
|
-
similar(user_id, @user_map, @user_factors, user_norms, count)
|
120
|
+
similar(user_id, @user_map, @user_factors, user_norms, count, @user_index)
|
116
121
|
end
|
117
122
|
|
118
123
|
private
|
119
124
|
|
125
|
+
def create_index(factors)
|
126
|
+
require "ngt"
|
127
|
+
|
128
|
+
index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
|
129
|
+
index.batch_insert(factors)
|
130
|
+
index
|
131
|
+
end
|
132
|
+
|
120
133
|
def user_norms
|
121
134
|
@user_norms ||= norms(@user_factors)
|
122
135
|
end
|
@@ -126,25 +139,41 @@ module Disco
|
|
126
139
|
end
|
127
140
|
|
128
141
|
def norms(factors)
|
129
|
-
norms = Numo::
|
142
|
+
norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
|
130
143
|
norms[norms.eq(0)] = 1e-10 # no zeros
|
131
144
|
norms
|
132
145
|
end
|
133
146
|
|
134
|
-
def similar(id, map, factors, norms, count)
|
147
|
+
def similar(id, map, factors, norms, count, index)
|
135
148
|
i = map[id]
|
136
149
|
if i
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
{
|
150
|
+
if index && count
|
151
|
+
keys = map.keys
|
152
|
+
result = index.search(factors[i, true], size: count + 1)[1..-1]
|
153
|
+
result.map do |v|
|
154
|
+
{
|
155
|
+
# ids from batch_insert start at 1 instead of 0
|
156
|
+
item_id: keys[v[:id] - 1],
|
157
|
+
# convert cosine distance to cosine similarity
|
158
|
+
score: 1 - v[:distance]
|
159
|
+
}
|
142
160
|
end
|
143
|
-
|
144
|
-
|
145
|
-
|
146
|
-
|
147
|
-
|
161
|
+
else
|
162
|
+
predictions = factors.dot(factors[i, true]) / norms
|
163
|
+
|
164
|
+
predictions =
|
165
|
+
map.keys.zip(predictions).map do |item_id, pred|
|
166
|
+
{item_id: item_id, score: pred}
|
167
|
+
end
|
168
|
+
|
169
|
+
max_score = predictions.delete_at(i)[:score]
|
170
|
+
predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
|
171
|
+
predictions = predictions.first(count) if count
|
172
|
+
# divide by max score to get cosine similarity
|
173
|
+
# only need to do for returned records
|
174
|
+
predictions.each { |pred| pred[:score] /= max_score }
|
175
|
+
predictions
|
176
|
+
end
|
148
177
|
else
|
149
178
|
[]
|
150
179
|
end
|
@@ -154,6 +183,9 @@ module Disco
|
|
154
183
|
user_ids = train_set.map { |v| v[:user_id] }.uniq.sort
|
155
184
|
item_ids = train_set.map { |v| v[:item_id] }.uniq.sort
|
156
185
|
|
186
|
+
raise ArgumentError, "Missing user_id" if user_ids.any?(&:nil?)
|
187
|
+
raise ArgumentError, "Missing item_id" if item_ids.any?(&:nil?)
|
188
|
+
|
157
189
|
@user_map = user_ids.zip(user_ids.size.times).to_h
|
158
190
|
@item_map = item_ids.zip(item_ids.size.times).to_h
|
159
191
|
end
|
@@ -171,6 +203,25 @@ module Disco
|
|
171
203
|
raise ArgumentError, "No training data" if train_set.empty?
|
172
204
|
end
|
173
205
|
|
206
|
+
def to_dataset(dataset)
|
207
|
+
if defined?(Rover::DataFrame) && dataset.is_a?(Rover::DataFrame)
|
208
|
+
# convert keys to symbols
|
209
|
+
dataset = dataset.dup
|
210
|
+
dataset.keys.each do |k, v|
|
211
|
+
dataset[k.to_sym] ||= dataset.delete(k)
|
212
|
+
end
|
213
|
+
dataset.to_a
|
214
|
+
elsif defined?(Daru::DataFrame) && dataset.is_a?(Daru::DataFrame)
|
215
|
+
# convert keys to symbols
|
216
|
+
dataset = dataset.dup
|
217
|
+
new_names = dataset.vectors.to_a.map { |k| [k, k.to_sym] }.to_h
|
218
|
+
dataset.rename_vectors!(new_names)
|
219
|
+
dataset.to_a[0]
|
220
|
+
else
|
221
|
+
dataset
|
222
|
+
end
|
223
|
+
end
|
224
|
+
|
174
225
|
def marshal_dump
|
175
226
|
obj = {
|
176
227
|
implicit: @implicit,
|
data/lib/disco/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: disco
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1
|
4
|
+
version: 0.2.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Kane
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2020-10-28 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: libmf
|
@@ -16,14 +16,14 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: 0.
|
19
|
+
version: 0.2.0
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: 0.
|
26
|
+
version: 0.2.0
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: numo-narray
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -122,7 +122,35 @@ dependencies:
|
|
122
122
|
- - ">="
|
123
123
|
- !ruby/object:Gem::Version
|
124
124
|
version: '0'
|
125
|
-
|
125
|
+
- !ruby/object:Gem::Dependency
|
126
|
+
name: rover-df
|
127
|
+
requirement: !ruby/object:Gem::Requirement
|
128
|
+
requirements:
|
129
|
+
- - ">="
|
130
|
+
- !ruby/object:Gem::Version
|
131
|
+
version: '0'
|
132
|
+
type: :development
|
133
|
+
prerelease: false
|
134
|
+
version_requirements: !ruby/object:Gem::Requirement
|
135
|
+
requirements:
|
136
|
+
- - ">="
|
137
|
+
- !ruby/object:Gem::Version
|
138
|
+
version: '0'
|
139
|
+
- !ruby/object:Gem::Dependency
|
140
|
+
name: ngt
|
141
|
+
requirement: !ruby/object:Gem::Requirement
|
142
|
+
requirements:
|
143
|
+
- - ">="
|
144
|
+
- !ruby/object:Gem::Version
|
145
|
+
version: 0.3.0
|
146
|
+
type: :development
|
147
|
+
prerelease: false
|
148
|
+
version_requirements: !ruby/object:Gem::Requirement
|
149
|
+
requirements:
|
150
|
+
- - ">="
|
151
|
+
- !ruby/object:Gem::Version
|
152
|
+
version: 0.3.0
|
153
|
+
description:
|
126
154
|
email: andrew@chartkick.com
|
127
155
|
executables: []
|
128
156
|
extensions: []
|
@@ -131,6 +159,7 @@ files:
|
|
131
159
|
- CHANGELOG.md
|
132
160
|
- LICENSE.txt
|
133
161
|
- README.md
|
162
|
+
- app/models/disco/recommendation.rb
|
134
163
|
- lib/disco.rb
|
135
164
|
- lib/disco/data.rb
|
136
165
|
- lib/disco/engine.rb
|
@@ -143,7 +172,7 @@ homepage: https://github.com/ankane/disco
|
|
143
172
|
licenses:
|
144
173
|
- MIT
|
145
174
|
metadata: {}
|
146
|
-
post_install_message:
|
175
|
+
post_install_message:
|
147
176
|
rdoc_options: []
|
148
177
|
require_paths:
|
149
178
|
- lib
|
@@ -158,8 +187,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
158
187
|
- !ruby/object:Gem::Version
|
159
188
|
version: '0'
|
160
189
|
requirements: []
|
161
|
-
rubygems_version: 3.
|
162
|
-
signing_key:
|
190
|
+
rubygems_version: 3.1.4
|
191
|
+
signing_key:
|
163
192
|
specification_version: 4
|
164
|
-
summary:
|
193
|
+
summary: Recommendations for Ruby and Rails using collaborative filtering
|
165
194
|
test_files: []
|