disco 0.1.0 → 0.2.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +24 -1
- data/LICENSE.txt +1 -1
- data/README.md +54 -9
- data/app/models/disco/recommendation.rb +8 -0
- data/lib/disco/data.rb +1 -2
- data/lib/disco/recommender.rb +77 -26
- data/lib/disco/version.rb +1 -1
- metadata +39 -10
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1ed4e5ff1d50d49068cfc9dc6b26df6be24aca1827b3f3724c3e351b06d768ac
|
4
|
+
data.tar.gz: 24756addf552956497889a9898eb9f93cdeec996c2a8a3cc1808c61425ea3d39
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 50d5a9ba262c8c751f77fb2037235d9186e7616af55c653518752b2300d0c0b1a70d59dd935f0eaed9e1bd09881e547c66446dd0af8bfe641a679acae76fddd0
|
7
|
+
data.tar.gz: 00b260249dbc2831ad4948d531176c53df965b28f317b0e46987c5be7209da63126033cf75116b01f78ceefd8da977a9fb14bf014e67ae71e18bb634a9852335
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,26 @@
|
|
1
|
-
## 0.1
|
1
|
+
## 0.2.1 (2020-10-28)
|
2
|
+
|
3
|
+
- Fixed issue with `user_recs` returning rated items
|
4
|
+
|
5
|
+
## 0.2.0 (2020-07-31)
|
6
|
+
|
7
|
+
- Changed score to always be between -1 and 1 for `item_recs` and `similar_users` (cosine similarity - this makes it easier to understand and consistent with `optimize_item_recs` and `optimize_similar_users`)
|
8
|
+
|
9
|
+
## 0.1.3 (2020-06-28)
|
10
|
+
|
11
|
+
- Added support for Rover
|
12
|
+
- Raise error when missing user or item ids
|
13
|
+
- Fixed string keys for Daru data frames
|
14
|
+
- `optimize_item_recs` and `optimize_similar_users` methods are no longer experimental
|
15
|
+
|
16
|
+
## 0.1.2 (2020-03-26)
|
17
|
+
|
18
|
+
- Added experimental `optimize_item_recs` and `optimize_similar_users` methods
|
19
|
+
|
20
|
+
## 0.1.1 (2019-11-14)
|
21
|
+
|
22
|
+
- Fixed Rails integration
|
23
|
+
|
24
|
+
## 0.1.0 (2019-11-14)
|
2
25
|
|
3
26
|
- First release
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
@@ -1,10 +1,10 @@
|
|
1
1
|
# Disco
|
2
2
|
|
3
|
-
:fire:
|
3
|
+
:fire: Recommendations for Ruby and Rails using collaborative filtering
|
4
4
|
|
5
5
|
- Supports user-based and item-based recommendations
|
6
6
|
- Works with explicit and implicit feedback
|
7
|
-
- Uses matrix factorization
|
7
|
+
- Uses high-performance matrix factorization
|
8
8
|
|
9
9
|
[![Build Status](https://travis-ci.org/ankane/disco.svg?branch=master)](https://travis-ci.org/ankane/disco)
|
10
10
|
|
@@ -101,14 +101,15 @@ recommender.item_recs("Star Wars (1977)")
|
|
101
101
|
```ruby
|
102
102
|
views = Ahoy::Event.
|
103
103
|
where(name: "Viewed post").
|
104
|
-
group(:user_id
|
104
|
+
group(:user_id).
|
105
|
+
group("properties->>'post_id'"). # postgres syntax
|
105
106
|
count
|
106
107
|
|
107
108
|
data =
|
108
109
|
views.map do |(user_id, post_id), count|
|
109
110
|
{
|
110
111
|
user_id: user_id,
|
111
|
-
|
112
|
+
item_id: post_id,
|
112
113
|
value: count
|
113
114
|
}
|
114
115
|
end
|
@@ -202,7 +203,7 @@ recommender = Marshal.load(bin)
|
|
202
203
|
|
203
204
|
## Algorithms
|
204
205
|
|
205
|
-
Disco uses matrix factorization.
|
206
|
+
Disco uses high-performance matrix factorization.
|
206
207
|
|
207
208
|
- For explicit feedback, it uses [stochastic gradient descent](https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf)
|
208
209
|
- For implicit feedback, it uses [coordinate descent](https://www.csie.ntu.edu.tw/~cjlin/papers/one-class-mf/biased-mf-sdm-with-supp.pdf)
|
@@ -236,15 +237,50 @@ There are a number of ways to deal with this, but here are some common ones:
|
|
236
237
|
- For user-based recommendations, show new users the most popular items.
|
237
238
|
- For item-based recommendations, make content-based recommendations with a gem like [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity).
|
238
239
|
|
239
|
-
##
|
240
|
+
## Data
|
240
241
|
|
241
|
-
|
242
|
+
Data can be an array of hashes
|
242
243
|
|
243
244
|
```ruby
|
244
|
-
|
245
|
-
|
245
|
+
[{user_id: 1, item_id: 1, rating: 5}, {user_id: 2, item_id: 1, rating: 3}]
|
246
|
+
```
|
247
|
+
|
248
|
+
Or a Rover data frame
|
249
|
+
|
250
|
+
```ruby
|
251
|
+
Rover.read_csv("ratings.csv")
|
246
252
|
```
|
247
253
|
|
254
|
+
Or a Daru data frame
|
255
|
+
|
256
|
+
```ruby
|
257
|
+
Daru::DataFrame.from_csv("ratings.csv")
|
258
|
+
```
|
259
|
+
|
260
|
+
## Faster Similarity
|
261
|
+
|
262
|
+
If you have a large number of users/items, you can use an approximate nearest neighbors library like [NGT](https://github.com/ankane/ngt) to speed up item-based recommendations and similar users.
|
263
|
+
|
264
|
+
Add this line to your application’s Gemfile:
|
265
|
+
|
266
|
+
```ruby
|
267
|
+
gem 'ngt', '>= 0.3.0'
|
268
|
+
```
|
269
|
+
|
270
|
+
Speed up item-based recommendations with:
|
271
|
+
|
272
|
+
```ruby
|
273
|
+
model.optimize_item_recs
|
274
|
+
```
|
275
|
+
|
276
|
+
Speed up similar users with:
|
277
|
+
|
278
|
+
```ruby
|
279
|
+
model.optimize_similar_users
|
280
|
+
```
|
281
|
+
|
282
|
+
This should be called after fitting or loading the model.
|
283
|
+
|
248
284
|
## Reference
|
249
285
|
|
250
286
|
Get the global mean
|
@@ -280,3 +316,12 @@ Everyone is encouraged to help improve this project. Here are a few ways you can
|
|
280
316
|
- Fix bugs and [submit pull requests](https://github.com/ankane/disco/pulls)
|
281
317
|
- Write, clarify, or fix documentation
|
282
318
|
- Suggest or add new features
|
319
|
+
|
320
|
+
To get started with development:
|
321
|
+
|
322
|
+
```sh
|
323
|
+
git clone https://github.com/ankane/disco.git
|
324
|
+
cd disco
|
325
|
+
bundle install
|
326
|
+
bundle exec rake test
|
327
|
+
```
|
data/lib/disco/data.rb
CHANGED
@@ -36,8 +36,7 @@ module Disco
|
|
36
36
|
|
37
37
|
return dest if File.exist?(dest)
|
38
38
|
|
39
|
-
|
40
|
-
temp_path = "#{temp_dir}/#{Time.now.to_f}" # TODO better name
|
39
|
+
temp_path = "#{Dir.tmpdir}/disco-#{Time.now.to_f}" # TODO better name
|
41
40
|
|
42
41
|
digest = Digest::SHA2.new
|
43
42
|
|
data/lib/disco/recommender.rb
CHANGED
@@ -9,14 +9,8 @@ module Disco
|
|
9
9
|
end
|
10
10
|
|
11
11
|
def fit(train_set, validation_set: nil)
|
12
|
-
|
13
|
-
|
14
|
-
train_set = train_set.to_a[0]
|
15
|
-
end
|
16
|
-
if validation_set.is_a?(Daru::DataFrame)
|
17
|
-
validation_set = validation_set.to_a[0]
|
18
|
-
end
|
19
|
-
end
|
12
|
+
train_set = to_dataset(train_set)
|
13
|
+
validation_set = to_dataset(validation_set) if validation_set
|
20
14
|
|
21
15
|
@implicit = !train_set.any? { |v| v[:rating] }
|
22
16
|
|
@@ -70,9 +64,11 @@ module Disco
|
|
70
64
|
|
71
65
|
@global_mean = model.bias
|
72
66
|
|
73
|
-
|
74
|
-
@
|
75
|
-
|
67
|
+
@user_factors = model.p_factors(format: :numo)
|
68
|
+
@item_factors = model.q_factors(format: :numo)
|
69
|
+
|
70
|
+
@user_index = nil
|
71
|
+
@item_index = nil
|
76
72
|
end
|
77
73
|
|
78
74
|
def user_recs(user_id, count: 5, item_ids: nil)
|
@@ -91,7 +87,7 @@ module Disco
|
|
91
87
|
idx = item_ids.map { |i| @item_map[i] }.compact
|
92
88
|
predictions.values_at(*idx)
|
93
89
|
else
|
94
|
-
@rated[u].keys.each do |i|
|
90
|
+
@rated[u].keys.sort_by { |v| -v }.each do |i|
|
95
91
|
predictions.delete_at(i)
|
96
92
|
end
|
97
93
|
end
|
@@ -106,17 +102,34 @@ module Disco
|
|
106
102
|
end
|
107
103
|
end
|
108
104
|
|
105
|
+
def optimize_similar_items
|
106
|
+
@item_index = create_index(@item_factors)
|
107
|
+
end
|
108
|
+
alias_method :optimize_item_recs, :optimize_similar_items
|
109
|
+
|
110
|
+
def optimize_similar_users
|
111
|
+
@user_index = create_index(@user_factors)
|
112
|
+
end
|
113
|
+
|
109
114
|
def similar_items(item_id, count: 5)
|
110
|
-
similar(item_id, @item_map, @item_factors, item_norms, count)
|
115
|
+
similar(item_id, @item_map, @item_factors, item_norms, count, @item_index)
|
111
116
|
end
|
112
117
|
alias_method :item_recs, :similar_items
|
113
118
|
|
114
119
|
def similar_users(user_id, count: 5)
|
115
|
-
similar(user_id, @user_map, @user_factors, user_norms, count)
|
120
|
+
similar(user_id, @user_map, @user_factors, user_norms, count, @user_index)
|
116
121
|
end
|
117
122
|
|
118
123
|
private
|
119
124
|
|
125
|
+
def create_index(factors)
|
126
|
+
require "ngt"
|
127
|
+
|
128
|
+
index = Ngt::Index.new(factors.shape[1], distance_type: "Cosine")
|
129
|
+
index.batch_insert(factors)
|
130
|
+
index
|
131
|
+
end
|
132
|
+
|
120
133
|
def user_norms
|
121
134
|
@user_norms ||= norms(@user_factors)
|
122
135
|
end
|
@@ -126,25 +139,41 @@ module Disco
|
|
126
139
|
end
|
127
140
|
|
128
141
|
def norms(factors)
|
129
|
-
norms = Numo::
|
142
|
+
norms = Numo::SFloat::Math.sqrt((factors * factors).sum(axis: 1))
|
130
143
|
norms[norms.eq(0)] = 1e-10 # no zeros
|
131
144
|
norms
|
132
145
|
end
|
133
146
|
|
134
|
-
def similar(id, map, factors, norms, count)
|
147
|
+
def similar(id, map, factors, norms, count, index)
|
135
148
|
i = map[id]
|
136
149
|
if i
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
{
|
150
|
+
if index && count
|
151
|
+
keys = map.keys
|
152
|
+
result = index.search(factors[i, true], size: count + 1)[1..-1]
|
153
|
+
result.map do |v|
|
154
|
+
{
|
155
|
+
# ids from batch_insert start at 1 instead of 0
|
156
|
+
item_id: keys[v[:id] - 1],
|
157
|
+
# convert cosine distance to cosine similarity
|
158
|
+
score: 1 - v[:distance]
|
159
|
+
}
|
142
160
|
end
|
143
|
-
|
144
|
-
|
145
|
-
|
146
|
-
|
147
|
-
|
161
|
+
else
|
162
|
+
predictions = factors.dot(factors[i, true]) / norms
|
163
|
+
|
164
|
+
predictions =
|
165
|
+
map.keys.zip(predictions).map do |item_id, pred|
|
166
|
+
{item_id: item_id, score: pred}
|
167
|
+
end
|
168
|
+
|
169
|
+
max_score = predictions.delete_at(i)[:score]
|
170
|
+
predictions.sort_by! { |pred| -pred[:score] } # already sorted by id
|
171
|
+
predictions = predictions.first(count) if count
|
172
|
+
# divide by max score to get cosine similarity
|
173
|
+
# only need to do for returned records
|
174
|
+
predictions.each { |pred| pred[:score] /= max_score }
|
175
|
+
predictions
|
176
|
+
end
|
148
177
|
else
|
149
178
|
[]
|
150
179
|
end
|
@@ -154,6 +183,9 @@ module Disco
|
|
154
183
|
user_ids = train_set.map { |v| v[:user_id] }.uniq.sort
|
155
184
|
item_ids = train_set.map { |v| v[:item_id] }.uniq.sort
|
156
185
|
|
186
|
+
raise ArgumentError, "Missing user_id" if user_ids.any?(&:nil?)
|
187
|
+
raise ArgumentError, "Missing item_id" if item_ids.any?(&:nil?)
|
188
|
+
|
157
189
|
@user_map = user_ids.zip(user_ids.size.times).to_h
|
158
190
|
@item_map = item_ids.zip(item_ids.size.times).to_h
|
159
191
|
end
|
@@ -171,6 +203,25 @@ module Disco
|
|
171
203
|
raise ArgumentError, "No training data" if train_set.empty?
|
172
204
|
end
|
173
205
|
|
206
|
+
def to_dataset(dataset)
|
207
|
+
if defined?(Rover::DataFrame) && dataset.is_a?(Rover::DataFrame)
|
208
|
+
# convert keys to symbols
|
209
|
+
dataset = dataset.dup
|
210
|
+
dataset.keys.each do |k, v|
|
211
|
+
dataset[k.to_sym] ||= dataset.delete(k)
|
212
|
+
end
|
213
|
+
dataset.to_a
|
214
|
+
elsif defined?(Daru::DataFrame) && dataset.is_a?(Daru::DataFrame)
|
215
|
+
# convert keys to symbols
|
216
|
+
dataset = dataset.dup
|
217
|
+
new_names = dataset.vectors.to_a.map { |k| [k, k.to_sym] }.to_h
|
218
|
+
dataset.rename_vectors!(new_names)
|
219
|
+
dataset.to_a[0]
|
220
|
+
else
|
221
|
+
dataset
|
222
|
+
end
|
223
|
+
end
|
224
|
+
|
174
225
|
def marshal_dump
|
175
226
|
obj = {
|
176
227
|
implicit: @implicit,
|
data/lib/disco/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: disco
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1
|
4
|
+
version: 0.2.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Kane
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2020-10-28 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: libmf
|
@@ -16,14 +16,14 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: 0.
|
19
|
+
version: 0.2.0
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: 0.
|
26
|
+
version: 0.2.0
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: numo-narray
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -122,7 +122,35 @@ dependencies:
|
|
122
122
|
- - ">="
|
123
123
|
- !ruby/object:Gem::Version
|
124
124
|
version: '0'
|
125
|
-
|
125
|
+
- !ruby/object:Gem::Dependency
|
126
|
+
name: rover-df
|
127
|
+
requirement: !ruby/object:Gem::Requirement
|
128
|
+
requirements:
|
129
|
+
- - ">="
|
130
|
+
- !ruby/object:Gem::Version
|
131
|
+
version: '0'
|
132
|
+
type: :development
|
133
|
+
prerelease: false
|
134
|
+
version_requirements: !ruby/object:Gem::Requirement
|
135
|
+
requirements:
|
136
|
+
- - ">="
|
137
|
+
- !ruby/object:Gem::Version
|
138
|
+
version: '0'
|
139
|
+
- !ruby/object:Gem::Dependency
|
140
|
+
name: ngt
|
141
|
+
requirement: !ruby/object:Gem::Requirement
|
142
|
+
requirements:
|
143
|
+
- - ">="
|
144
|
+
- !ruby/object:Gem::Version
|
145
|
+
version: 0.3.0
|
146
|
+
type: :development
|
147
|
+
prerelease: false
|
148
|
+
version_requirements: !ruby/object:Gem::Requirement
|
149
|
+
requirements:
|
150
|
+
- - ">="
|
151
|
+
- !ruby/object:Gem::Version
|
152
|
+
version: 0.3.0
|
153
|
+
description:
|
126
154
|
email: andrew@chartkick.com
|
127
155
|
executables: []
|
128
156
|
extensions: []
|
@@ -131,6 +159,7 @@ files:
|
|
131
159
|
- CHANGELOG.md
|
132
160
|
- LICENSE.txt
|
133
161
|
- README.md
|
162
|
+
- app/models/disco/recommendation.rb
|
134
163
|
- lib/disco.rb
|
135
164
|
- lib/disco/data.rb
|
136
165
|
- lib/disco/engine.rb
|
@@ -143,7 +172,7 @@ homepage: https://github.com/ankane/disco
|
|
143
172
|
licenses:
|
144
173
|
- MIT
|
145
174
|
metadata: {}
|
146
|
-
post_install_message:
|
175
|
+
post_install_message:
|
147
176
|
rdoc_options: []
|
148
177
|
require_paths:
|
149
178
|
- lib
|
@@ -158,8 +187,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
158
187
|
- !ruby/object:Gem::Version
|
159
188
|
version: '0'
|
160
189
|
requirements: []
|
161
|
-
rubygems_version: 3.
|
162
|
-
signing_key:
|
190
|
+
rubygems_version: 3.1.4
|
191
|
+
signing_key:
|
163
192
|
specification_version: 4
|
164
|
-
summary:
|
193
|
+
summary: Recommendations for Ruby and Rails using collaborative filtering
|
165
194
|
test_files: []
|