eps 0.1.0 → 0.1.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +5 -2
- data/CHANGELOG.md +4 -0
- data/README.md +49 -8
- data/Rakefile +23 -0
- data/guides/Modeling.md +17 -8
- data/lib/eps/base_regressor.rb +52 -3
- data/lib/eps/regressor.rb +8 -37
- data/lib/eps/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: af79060ff9164fdaa316b8d9543ef8093166be20009dc99509fdfd3c0e4e242d
|
4
|
+
data.tar.gz: 4e17cda8c33ba437652a00e85cd3630c9ce093a27d446f77b3b0711bf648c754
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e762ad255036a669003845322ea399300c2d0d2d9987dede0545497ff83e9d7ede89c120eaebd7214f6df26eb8194019b85c889ac50dbe5607688ff4c07ddd04
|
7
|
+
data.tar.gz: a5a16216316c718f13366b2618b8d1a0b8d6ba47896877573d805817b35ce789b199342344a32dbdb4ac6706122344b265741d3e319ab76a55443ff69662f699
|
data/.travis.yml
CHANGED
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -6,6 +6,9 @@ Linear regression for Ruby
|
|
6
6
|
- Serve models built in Ruby, Python, R, and more
|
7
7
|
- Automatically handles categorical variables
|
8
8
|
- No external dependencies
|
9
|
+
- Works great with the SciRuby ecosystem (Daru & IRuby)
|
10
|
+
|
11
|
+
[![Build Status](https://travis-ci.org/ankane/eps.svg?branch=master)](https://travis-ci.org/ankane/eps)
|
9
12
|
|
10
13
|
## Installation
|
11
14
|
|
@@ -58,6 +61,14 @@ split_date = Date.parse("2018-06-01")
|
|
58
61
|
train_set, test_set = houses.partition { |h| h.sold_at < split_date }
|
59
62
|
```
|
60
63
|
|
64
|
+
### Outliers and Missing Data
|
65
|
+
|
66
|
+
Next, decide what to do with outliers and missing data. There are a number of methods for handling them, but the easiest is to remove them.
|
67
|
+
|
68
|
+
```ruby
|
69
|
+
train_set.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
|
70
|
+
```
|
71
|
+
|
61
72
|
### Feature Engineering
|
62
73
|
|
63
74
|
Selecting features for a model is extremely important for performance. Features can be numeric or categorical. For categorical features, there’s no need to create dummy variables - just pass the data as strings.
|
@@ -87,20 +98,31 @@ def features(house)
|
|
87
98
|
{
|
88
99
|
bedrooms: house.bedrooms,
|
89
100
|
city_id: house.city_id.to_s,
|
90
|
-
month: house.sold_at.strftime("%b")
|
91
|
-
price: house.price
|
101
|
+
month: house.sold_at.strftime("%b")
|
92
102
|
}
|
93
103
|
end
|
94
104
|
|
95
|
-
|
105
|
+
train_features = train_set.map { |h| features(h) }
|
106
|
+
```
|
107
|
+
|
108
|
+
> We use a method for features so it can be used across training, evaluation, and prediction
|
109
|
+
|
110
|
+
We also need to prepare the target variable.
|
111
|
+
|
112
|
+
```ruby
|
113
|
+
def target(house)
|
114
|
+
house.price
|
115
|
+
end
|
116
|
+
|
117
|
+
train_target = train_set.map { |h| target(h) }
|
96
118
|
```
|
97
119
|
|
98
120
|
### Training
|
99
121
|
|
100
|
-
|
122
|
+
Now, let’s train the model.
|
101
123
|
|
102
124
|
```ruby
|
103
|
-
model = Eps::Regressor.new(
|
125
|
+
model = Eps::Regressor.new(train_features, train_target)
|
104
126
|
puts model.summary
|
105
127
|
```
|
106
128
|
|
@@ -111,18 +133,31 @@ The summary includes the coefficients and their significance. The lower the p-va
|
|
111
133
|
When you’re happy with the model, see how well it performs on the test set. This gives us an idea of how well it’ll perform on unseen data.
|
112
134
|
|
113
135
|
```ruby
|
114
|
-
|
115
|
-
|
136
|
+
test_features = test_set.map { |h| features(h) }
|
137
|
+
test_target = test_set.map { |h| target(h) }
|
138
|
+
model.evaluate(test_features, test_target)
|
116
139
|
```
|
117
140
|
|
118
141
|
This returns:
|
119
142
|
|
120
|
-
-
|
143
|
+
- RMSE - Root mean square error
|
121
144
|
- MAE - Mean absolute error
|
122
145
|
- ME - Mean error
|
123
146
|
|
124
147
|
We want to minimize the RMSE and MAE and keep the ME around 0.
|
125
148
|
|
149
|
+
### Finalize
|
150
|
+
|
151
|
+
Now that we have an idea of how the model will perform, we want to retrain the model with all of our data.
|
152
|
+
|
153
|
+
```ruby
|
154
|
+
all_features = houses.map { |h| features(h) }
|
155
|
+
all_target = houses.map { |h| target(h) }
|
156
|
+
model = Eps::Regressor.new(all_features, all_target)
|
157
|
+
```
|
158
|
+
|
159
|
+
We now have a model that’s ready to serve.
|
160
|
+
|
126
161
|
## Serving Models
|
127
162
|
|
128
163
|
Once the model is trained, all we need are the coefficients to make predictions. You can dump them as a Ruby object or JSON. For Ruby, use:
|
@@ -178,6 +213,8 @@ data = File.read("model.pmml")
|
|
178
213
|
model = Eps::Regressor.load_pmml(data)
|
179
214
|
```
|
180
215
|
|
216
|
+
> Loading PMML requires Nokogiri to be installed
|
217
|
+
|
181
218
|
[PFA](http://dmg.org/pfa/) - Portable Format for Analytics
|
182
219
|
|
183
220
|
```ruby
|
@@ -326,6 +363,10 @@ When importing data from CSV files, be sure to convert numeric fields. The `tabl
|
|
326
363
|
CSV.table("data.csv").map { |row| row.to_h }
|
327
364
|
```
|
328
365
|
|
366
|
+
## Jupyter & IRuby
|
367
|
+
|
368
|
+
You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://github.com/ankane/shorts/blob/master/Jupyter-Rails.md).
|
369
|
+
|
329
370
|
## Reference
|
330
371
|
|
331
372
|
Get coefficients
|
data/Rakefile
CHANGED
@@ -9,3 +9,26 @@ Rake::TestTask.new(:test) do |t|
|
|
9
9
|
end
|
10
10
|
|
11
11
|
task default: :test
|
12
|
+
|
13
|
+
task :benchmark do
|
14
|
+
require "benchmark"
|
15
|
+
require "eps"
|
16
|
+
require "gsl" if ENV["GSL"]
|
17
|
+
|
18
|
+
data = []
|
19
|
+
10000.times do
|
20
|
+
row = {}
|
21
|
+
30.times do |i|
|
22
|
+
row[:"x#{i}"] = rand(100)
|
23
|
+
end
|
24
|
+
row[:y] = rand(100)
|
25
|
+
data << row
|
26
|
+
end
|
27
|
+
|
28
|
+
puts "Starting benchmark..."
|
29
|
+
|
30
|
+
time = Benchmark.realtime do
|
31
|
+
Eps::Regressor.new(data, target: :y)
|
32
|
+
end
|
33
|
+
p time.round(1)
|
34
|
+
end
|
data/guides/Modeling.md
CHANGED
@@ -21,7 +21,8 @@ And run:
|
|
21
21
|
library(jsonlite)
|
22
22
|
|
23
23
|
model <- lm(dist ~ speed, cars)
|
24
|
-
toJSON(list(coefficients=as.list(coef(model))), auto_unbox=TRUE)
|
24
|
+
data <- toJSON(list(coefficients=as.list(coef(model))), auto_unbox=TRUE)
|
25
|
+
write(data, file="model.json")
|
25
26
|
```
|
26
27
|
|
27
28
|
## R PMML
|
@@ -38,7 +39,8 @@ And run:
|
|
38
39
|
library(pmml)
|
39
40
|
|
40
41
|
model <- lm(dist ~ speed, cars)
|
41
|
-
toString(pmml(model))
|
42
|
+
data <- toString(pmml(model))
|
43
|
+
write(data, file="model.pmml")
|
42
44
|
```
|
43
45
|
|
44
46
|
## R PFA
|
@@ -55,7 +57,7 @@ And run:
|
|
55
57
|
library(aurelius)
|
56
58
|
|
57
59
|
model <- lm(dist ~ speed, cars)
|
58
|
-
write_pfa(pfa(model))
|
60
|
+
write_pfa(pfa(model), file="model.pfa")
|
59
61
|
```
|
60
62
|
|
61
63
|
## Python JSON
|
@@ -80,15 +82,19 @@ coefficients = {'_intercept': model.intercept_}
|
|
80
82
|
for i, c in enumerate(model.coef_):
|
81
83
|
coefficients[features[i]] = c
|
82
84
|
|
83
|
-
|
85
|
+
|
86
|
+
data = json.dumps({'coefficients': coefficients})
|
87
|
+
|
88
|
+
with open('model.json', 'w') as f:
|
89
|
+
f.write(data)
|
84
90
|
```
|
85
91
|
|
86
92
|
## Python PMML
|
87
93
|
|
88
|
-
Install the [
|
94
|
+
Install the [scikit2pmml](https://github.com/vaclavcadek/scikit2pmml) package
|
89
95
|
|
90
96
|
```sh
|
91
|
-
pip install
|
97
|
+
pip install scikit2pmml
|
92
98
|
```
|
93
99
|
|
94
100
|
And run:
|
@@ -103,7 +109,7 @@ y = [5 * xi + 3 for xi in x]
|
|
103
109
|
model = linear_model.LinearRegression()
|
104
110
|
model.fit([[xi] for xi in x], y)
|
105
111
|
|
106
|
-
scikit2pmml(estimator=model, file='
|
112
|
+
scikit2pmml(estimator=model, file='model.pmml')
|
107
113
|
```
|
108
114
|
|
109
115
|
## Python PFA
|
@@ -139,5 +145,8 @@ action:
|
|
139
145
|
|
140
146
|
return pfaDocument
|
141
147
|
|
142
|
-
json.dumps(pfa(model))
|
148
|
+
data = json.dumps(pfa(model))
|
149
|
+
|
150
|
+
with open('model.pfa', 'w') as f:
|
151
|
+
f.write(data)
|
143
152
|
```
|
data/lib/eps/base_regressor.rb
CHANGED
@@ -24,6 +24,22 @@ module Eps
|
|
24
24
|
singular ? pred[0] : pred
|
25
25
|
end
|
26
26
|
|
27
|
+
def evaluate(data, y = nil, target: nil)
|
28
|
+
raise ArgumentError, "missing target" if !target && !y
|
29
|
+
|
30
|
+
actual = y
|
31
|
+
actual ||=
|
32
|
+
if daru?(data)
|
33
|
+
data[target].to_a
|
34
|
+
else
|
35
|
+
data.map { |v| v[target] }
|
36
|
+
end
|
37
|
+
|
38
|
+
actual = prep_y(actual)
|
39
|
+
estimated = predict(data)
|
40
|
+
Eps.metrics(actual, estimated)
|
41
|
+
end
|
42
|
+
|
27
43
|
# ruby
|
28
44
|
|
29
45
|
def self.load(data)
|
@@ -56,7 +72,11 @@ module Eps
|
|
56
72
|
# pmml
|
57
73
|
|
58
74
|
def self.load_pmml(data)
|
59
|
-
|
75
|
+
if data.is_a?(String)
|
76
|
+
require "nokogiri"
|
77
|
+
data = Nokogiri::XML(data)
|
78
|
+
end
|
79
|
+
|
60
80
|
# TODO more validation
|
61
81
|
node = data.css("RegressionTable")
|
62
82
|
coefficients = {
|
@@ -94,7 +114,7 @@ module Eps
|
|
94
114
|
BaseRegressor.new(coefficients: coefficients)
|
95
115
|
end
|
96
116
|
|
97
|
-
|
117
|
+
protected
|
98
118
|
|
99
119
|
def daru?(x)
|
100
120
|
defined?(Daru) && x.is_a?(Daru::DataFrame)
|
@@ -132,6 +152,10 @@ module Eps
|
|
132
152
|
xi.each do |k, v|
|
133
153
|
key = v.is_a?(String) ? [k.to_sym, v] : k.to_sym
|
134
154
|
v2 = v.is_a?(String) ? 1 : v
|
155
|
+
|
156
|
+
# TODO make more efficient
|
157
|
+
next if !train && !coefficients.key?(symbolize_coef(key))
|
158
|
+
|
135
159
|
raise "Missing data" if v2.nil?
|
136
160
|
|
137
161
|
unless cache[key]
|
@@ -167,7 +191,7 @@ module Eps
|
|
167
191
|
end
|
168
192
|
|
169
193
|
# flatten keys
|
170
|
-
c = [:_intercept] + cache.sort_by { |_, v| v }.map { |k, _| (k
|
194
|
+
c = [:_intercept] + cache.sort_by { |_, v| v }.map { |k, _| symbolize_coef(k) }
|
171
195
|
|
172
196
|
if c.size != c.uniq.size
|
173
197
|
raise "Overlapping coefficients"
|
@@ -176,8 +200,33 @@ module Eps
|
|
176
200
|
[ret2, c]
|
177
201
|
end
|
178
202
|
|
203
|
+
def symbolize_coef(k)
|
204
|
+
(k.is_a?(Array) ? k.join("") : k).to_sym
|
205
|
+
end
|
206
|
+
|
179
207
|
def matrix_arr(matrix)
|
180
208
|
matrix.to_a.map { |xi| xi[0].to_f }
|
181
209
|
end
|
210
|
+
|
211
|
+
# determine if target is a string or symbol
|
212
|
+
def prep_target(target, data)
|
213
|
+
if daru?(data)
|
214
|
+
data.has_vector?(target) ? target : flip_target(target)
|
215
|
+
else
|
216
|
+
x = data[0] || {}
|
217
|
+
x[target] ? target : flip_target(target)
|
218
|
+
end
|
219
|
+
end
|
220
|
+
|
221
|
+
def flip_target(target)
|
222
|
+
target.is_a?(String) ? target.to_sym : target.to_s
|
223
|
+
end
|
224
|
+
|
225
|
+
def prep_y(y)
|
226
|
+
y.each do |yi|
|
227
|
+
raise "Target missing in data" if yi.nil?
|
228
|
+
end
|
229
|
+
y.map(&:to_f)
|
230
|
+
end
|
182
231
|
end
|
183
232
|
end
|
data/lib/eps/regressor.rb
CHANGED
@@ -52,8 +52,9 @@ module Eps
|
|
52
52
|
# haven't found an efficient way to do QR-factorization in Ruby
|
53
53
|
# the extendmatrix gem has householder and givens (givens has bug)
|
54
54
|
# but methods are too slow
|
55
|
+
xt = x.t
|
55
56
|
begin
|
56
|
-
@xtxi = (
|
57
|
+
@xtxi = (xt * x).inverse
|
57
58
|
rescue ExceptionForMatrix::ErrNotRegular
|
58
59
|
constant = {}
|
59
60
|
(1...x.column_count).each do |i|
|
@@ -77,15 +78,18 @@ module Eps
|
|
77
78
|
vectors.delete_at(i)
|
78
79
|
end
|
79
80
|
x = Matrix.columns(vectors)
|
81
|
+
xt = x.t
|
80
82
|
|
81
83
|
# try again
|
82
84
|
begin
|
83
|
-
@xtxi = (
|
85
|
+
@xtxi = (xt * x).inverse
|
84
86
|
rescue ExceptionForMatrix::ErrNotRegular
|
85
87
|
raise "Multiple solutions - GSL is needed to select one"
|
86
88
|
end
|
87
89
|
end
|
88
|
-
|
90
|
+
# huge performance boost
|
91
|
+
# by multiplying xt * y first
|
92
|
+
v2 = matrix_arr(@xtxi * (xt * y))
|
89
93
|
|
90
94
|
# add back removed
|
91
95
|
removed.sort.each do |i|
|
@@ -101,19 +105,7 @@ module Eps
|
|
101
105
|
end
|
102
106
|
|
103
107
|
def evaluate(data, y = nil)
|
104
|
-
|
105
|
-
|
106
|
-
actual ||=
|
107
|
-
if daru?(data)
|
108
|
-
data[@target].to_a
|
109
|
-
else
|
110
|
-
data.map { |v| v[@target] }
|
111
|
-
end
|
112
|
-
|
113
|
-
actual = prep_y(actual)
|
114
|
-
|
115
|
-
estimated = predict(data)
|
116
|
-
Eps.metrics(actual, estimated)
|
108
|
+
super(data, y, target: @target)
|
117
109
|
end
|
118
110
|
|
119
111
|
# https://people.richland.edu/james/ictcm/2004/multiple.html
|
@@ -154,27 +146,6 @@ module Eps
|
|
154
146
|
arr.all? { |x| x == arr[0] }
|
155
147
|
end
|
156
148
|
|
157
|
-
# determine if target is a string or symbol
|
158
|
-
def prep_target(target, data)
|
159
|
-
if daru?(data)
|
160
|
-
data.has_vector?(target) ? target : flip_target(target)
|
161
|
-
else
|
162
|
-
x = data[0] || {}
|
163
|
-
x[target] ? target : flip_target(target)
|
164
|
-
end
|
165
|
-
end
|
166
|
-
|
167
|
-
def flip_target(target)
|
168
|
-
target.is_a?(String) ? target.to_sym : target.to_s
|
169
|
-
end
|
170
|
-
|
171
|
-
def prep_y(y)
|
172
|
-
y.each do |yi|
|
173
|
-
raise "Target missing in data" if yi.nil?
|
174
|
-
end
|
175
|
-
y.map(&:to_f)
|
176
|
-
end
|
177
|
-
|
178
149
|
# add epsilon for perfect fits
|
179
150
|
# consistent with GSL
|
180
151
|
def t_value
|
data/lib/eps/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: eps
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Andrew Kane
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2018-07-
|
11
|
+
date: 2018-07-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|