eps 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.travis.yml +5 -2
- data/CHANGELOG.md +4 -0
- data/README.md +49 -8
- data/Rakefile +23 -0
- data/guides/Modeling.md +17 -8
- data/lib/eps/base_regressor.rb +52 -3
- data/lib/eps/regressor.rb +8 -37
- data/lib/eps/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: af79060ff9164fdaa316b8d9543ef8093166be20009dc99509fdfd3c0e4e242d
|
|
4
|
+
data.tar.gz: 4e17cda8c33ba437652a00e85cd3630c9ce093a27d446f77b3b0711bf648c754
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: e762ad255036a669003845322ea399300c2d0d2d9987dede0545497ff83e9d7ede89c120eaebd7214f6df26eb8194019b85c889ac50dbe5607688ff4c07ddd04
|
|
7
|
+
data.tar.gz: a5a16216316c718f13366b2618b8d1a0b8d6ba47896877573d805817b35ce789b199342344a32dbdb4ac6706122344b265741d3e319ab76a55443ff69662f699
|
data/.travis.yml
CHANGED
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
|
@@ -6,6 +6,9 @@ Linear regression for Ruby
|
|
|
6
6
|
- Serve models built in Ruby, Python, R, and more
|
|
7
7
|
- Automatically handles categorical variables
|
|
8
8
|
- No external dependencies
|
|
9
|
+
- Works great with the SciRuby ecosystem (Daru & IRuby)
|
|
10
|
+
|
|
11
|
+
[](https://travis-ci.org/ankane/eps)
|
|
9
12
|
|
|
10
13
|
## Installation
|
|
11
14
|
|
|
@@ -58,6 +61,14 @@ split_date = Date.parse("2018-06-01")
|
|
|
58
61
|
train_set, test_set = houses.partition { |h| h.sold_at < split_date }
|
|
59
62
|
```
|
|
60
63
|
|
|
64
|
+
### Outliers and Missing Data
|
|
65
|
+
|
|
66
|
+
Next, decide what to do with outliers and missing data. There are a number of methods for handling them, but the easiest is to remove them.
|
|
67
|
+
|
|
68
|
+
```ruby
|
|
69
|
+
train_set.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
|
|
70
|
+
```
|
|
71
|
+
|
|
61
72
|
### Feature Engineering
|
|
62
73
|
|
|
63
74
|
Selecting features for a model is extremely important for performance. Features can be numeric or categorical. For categorical features, there’s no need to create dummy variables - just pass the data as strings.
|
|
@@ -87,20 +98,31 @@ def features(house)
|
|
|
87
98
|
{
|
|
88
99
|
bedrooms: house.bedrooms,
|
|
89
100
|
city_id: house.city_id.to_s,
|
|
90
|
-
month: house.sold_at.strftime("%b")
|
|
91
|
-
price: house.price
|
|
101
|
+
month: house.sold_at.strftime("%b")
|
|
92
102
|
}
|
|
93
103
|
end
|
|
94
104
|
|
|
95
|
-
|
|
105
|
+
train_features = train_set.map { |h| features(h) }
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
> We use a method for features so it can be used across training, evaluation, and prediction
|
|
109
|
+
|
|
110
|
+
We also need to prepare the target variable.
|
|
111
|
+
|
|
112
|
+
```ruby
|
|
113
|
+
def target(house)
|
|
114
|
+
house.price
|
|
115
|
+
end
|
|
116
|
+
|
|
117
|
+
train_target = train_set.map { |h| target(h) }
|
|
96
118
|
```
|
|
97
119
|
|
|
98
120
|
### Training
|
|
99
121
|
|
|
100
|
-
|
|
122
|
+
Now, let’s train the model.
|
|
101
123
|
|
|
102
124
|
```ruby
|
|
103
|
-
model = Eps::Regressor.new(
|
|
125
|
+
model = Eps::Regressor.new(train_features, train_target)
|
|
104
126
|
puts model.summary
|
|
105
127
|
```
|
|
106
128
|
|
|
@@ -111,18 +133,31 @@ The summary includes the coefficients and their significance. The lower the p-va
|
|
|
111
133
|
When you’re happy with the model, see how well it performs on the test set. This gives us an idea of how well it’ll perform on unseen data.
|
|
112
134
|
|
|
113
135
|
```ruby
|
|
114
|
-
|
|
115
|
-
|
|
136
|
+
test_features = test_set.map { |h| features(h) }
|
|
137
|
+
test_target = test_set.map { |h| target(h) }
|
|
138
|
+
model.evaluate(test_features, test_target)
|
|
116
139
|
```
|
|
117
140
|
|
|
118
141
|
This returns:
|
|
119
142
|
|
|
120
|
-
-
|
|
143
|
+
- RMSE - Root mean square error
|
|
121
144
|
- MAE - Mean absolute error
|
|
122
145
|
- ME - Mean error
|
|
123
146
|
|
|
124
147
|
We want to minimize the RMSE and MAE and keep the ME around 0.
|
|
125
148
|
|
|
149
|
+
### Finalize
|
|
150
|
+
|
|
151
|
+
Now that we have an idea of how the model will perform, we want to retrain the model with all of our data.
|
|
152
|
+
|
|
153
|
+
```ruby
|
|
154
|
+
all_features = houses.map { |h| features(h) }
|
|
155
|
+
all_target = houses.map { |h| target(h) }
|
|
156
|
+
model = Eps::Regressor.new(all_features, all_target)
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
We now have a model that’s ready to serve.
|
|
160
|
+
|
|
126
161
|
## Serving Models
|
|
127
162
|
|
|
128
163
|
Once the model is trained, all we need are the coefficients to make predictions. You can dump them as a Ruby object or JSON. For Ruby, use:
|
|
@@ -178,6 +213,8 @@ data = File.read("model.pmml")
|
|
|
178
213
|
model = Eps::Regressor.load_pmml(data)
|
|
179
214
|
```
|
|
180
215
|
|
|
216
|
+
> Loading PMML requires Nokogiri to be installed
|
|
217
|
+
|
|
181
218
|
[PFA](http://dmg.org/pfa/) - Portable Format for Analytics
|
|
182
219
|
|
|
183
220
|
```ruby
|
|
@@ -326,6 +363,10 @@ When importing data from CSV files, be sure to convert numeric fields. The `tabl
|
|
|
326
363
|
CSV.table("data.csv").map { |row| row.to_h }
|
|
327
364
|
```
|
|
328
365
|
|
|
366
|
+
## Jupyter & IRuby
|
|
367
|
+
|
|
368
|
+
You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://github.com/ankane/shorts/blob/master/Jupyter-Rails.md).
|
|
369
|
+
|
|
329
370
|
## Reference
|
|
330
371
|
|
|
331
372
|
Get coefficients
|
data/Rakefile
CHANGED
|
@@ -9,3 +9,26 @@ Rake::TestTask.new(:test) do |t|
|
|
|
9
9
|
end
|
|
10
10
|
|
|
11
11
|
task default: :test
|
|
12
|
+
|
|
13
|
+
task :benchmark do
|
|
14
|
+
require "benchmark"
|
|
15
|
+
require "eps"
|
|
16
|
+
require "gsl" if ENV["GSL"]
|
|
17
|
+
|
|
18
|
+
data = []
|
|
19
|
+
10000.times do
|
|
20
|
+
row = {}
|
|
21
|
+
30.times do |i|
|
|
22
|
+
row[:"x#{i}"] = rand(100)
|
|
23
|
+
end
|
|
24
|
+
row[:y] = rand(100)
|
|
25
|
+
data << row
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
puts "Starting benchmark..."
|
|
29
|
+
|
|
30
|
+
time = Benchmark.realtime do
|
|
31
|
+
Eps::Regressor.new(data, target: :y)
|
|
32
|
+
end
|
|
33
|
+
p time.round(1)
|
|
34
|
+
end
|
data/guides/Modeling.md
CHANGED
|
@@ -21,7 +21,8 @@ And run:
|
|
|
21
21
|
library(jsonlite)
|
|
22
22
|
|
|
23
23
|
model <- lm(dist ~ speed, cars)
|
|
24
|
-
toJSON(list(coefficients=as.list(coef(model))), auto_unbox=TRUE)
|
|
24
|
+
data <- toJSON(list(coefficients=as.list(coef(model))), auto_unbox=TRUE)
|
|
25
|
+
write(data, file="model.json")
|
|
25
26
|
```
|
|
26
27
|
|
|
27
28
|
## R PMML
|
|
@@ -38,7 +39,8 @@ And run:
|
|
|
38
39
|
library(pmml)
|
|
39
40
|
|
|
40
41
|
model <- lm(dist ~ speed, cars)
|
|
41
|
-
toString(pmml(model))
|
|
42
|
+
data <- toString(pmml(model))
|
|
43
|
+
write(data, file="model.pmml")
|
|
42
44
|
```
|
|
43
45
|
|
|
44
46
|
## R PFA
|
|
@@ -55,7 +57,7 @@ And run:
|
|
|
55
57
|
library(aurelius)
|
|
56
58
|
|
|
57
59
|
model <- lm(dist ~ speed, cars)
|
|
58
|
-
write_pfa(pfa(model))
|
|
60
|
+
write_pfa(pfa(model), file="model.pfa")
|
|
59
61
|
```
|
|
60
62
|
|
|
61
63
|
## Python JSON
|
|
@@ -80,15 +82,19 @@ coefficients = {'_intercept': model.intercept_}
|
|
|
80
82
|
for i, c in enumerate(model.coef_):
|
|
81
83
|
coefficients[features[i]] = c
|
|
82
84
|
|
|
83
|
-
|
|
85
|
+
|
|
86
|
+
data = json.dumps({'coefficients': coefficients})
|
|
87
|
+
|
|
88
|
+
with open('model.json', 'w') as f:
|
|
89
|
+
f.write(data)
|
|
84
90
|
```
|
|
85
91
|
|
|
86
92
|
## Python PMML
|
|
87
93
|
|
|
88
|
-
Install the [
|
|
94
|
+
Install the [scikit2pmml](https://github.com/vaclavcadek/scikit2pmml) package
|
|
89
95
|
|
|
90
96
|
```sh
|
|
91
|
-
pip install
|
|
97
|
+
pip install scikit2pmml
|
|
92
98
|
```
|
|
93
99
|
|
|
94
100
|
And run:
|
|
@@ -103,7 +109,7 @@ y = [5 * xi + 3 for xi in x]
|
|
|
103
109
|
model = linear_model.LinearRegression()
|
|
104
110
|
model.fit([[xi] for xi in x], y)
|
|
105
111
|
|
|
106
|
-
scikit2pmml(estimator=model, file='
|
|
112
|
+
scikit2pmml(estimator=model, file='model.pmml')
|
|
107
113
|
```
|
|
108
114
|
|
|
109
115
|
## Python PFA
|
|
@@ -139,5 +145,8 @@ action:
|
|
|
139
145
|
|
|
140
146
|
return pfaDocument
|
|
141
147
|
|
|
142
|
-
json.dumps(pfa(model))
|
|
148
|
+
data = json.dumps(pfa(model))
|
|
149
|
+
|
|
150
|
+
with open('model.pfa', 'w') as f:
|
|
151
|
+
f.write(data)
|
|
143
152
|
```
|
data/lib/eps/base_regressor.rb
CHANGED
|
@@ -24,6 +24,22 @@ module Eps
|
|
|
24
24
|
singular ? pred[0] : pred
|
|
25
25
|
end
|
|
26
26
|
|
|
27
|
+
def evaluate(data, y = nil, target: nil)
|
|
28
|
+
raise ArgumentError, "missing target" if !target && !y
|
|
29
|
+
|
|
30
|
+
actual = y
|
|
31
|
+
actual ||=
|
|
32
|
+
if daru?(data)
|
|
33
|
+
data[target].to_a
|
|
34
|
+
else
|
|
35
|
+
data.map { |v| v[target] }
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
actual = prep_y(actual)
|
|
39
|
+
estimated = predict(data)
|
|
40
|
+
Eps.metrics(actual, estimated)
|
|
41
|
+
end
|
|
42
|
+
|
|
27
43
|
# ruby
|
|
28
44
|
|
|
29
45
|
def self.load(data)
|
|
@@ -56,7 +72,11 @@ module Eps
|
|
|
56
72
|
# pmml
|
|
57
73
|
|
|
58
74
|
def self.load_pmml(data)
|
|
59
|
-
|
|
75
|
+
if data.is_a?(String)
|
|
76
|
+
require "nokogiri"
|
|
77
|
+
data = Nokogiri::XML(data)
|
|
78
|
+
end
|
|
79
|
+
|
|
60
80
|
# TODO more validation
|
|
61
81
|
node = data.css("RegressionTable")
|
|
62
82
|
coefficients = {
|
|
@@ -94,7 +114,7 @@ module Eps
|
|
|
94
114
|
BaseRegressor.new(coefficients: coefficients)
|
|
95
115
|
end
|
|
96
116
|
|
|
97
|
-
|
|
117
|
+
protected
|
|
98
118
|
|
|
99
119
|
def daru?(x)
|
|
100
120
|
defined?(Daru) && x.is_a?(Daru::DataFrame)
|
|
@@ -132,6 +152,10 @@ module Eps
|
|
|
132
152
|
xi.each do |k, v|
|
|
133
153
|
key = v.is_a?(String) ? [k.to_sym, v] : k.to_sym
|
|
134
154
|
v2 = v.is_a?(String) ? 1 : v
|
|
155
|
+
|
|
156
|
+
# TODO make more efficient
|
|
157
|
+
next if !train && !coefficients.key?(symbolize_coef(key))
|
|
158
|
+
|
|
135
159
|
raise "Missing data" if v2.nil?
|
|
136
160
|
|
|
137
161
|
unless cache[key]
|
|
@@ -167,7 +191,7 @@ module Eps
|
|
|
167
191
|
end
|
|
168
192
|
|
|
169
193
|
# flatten keys
|
|
170
|
-
c = [:_intercept] + cache.sort_by { |_, v| v }.map { |k, _| (k
|
|
194
|
+
c = [:_intercept] + cache.sort_by { |_, v| v }.map { |k, _| symbolize_coef(k) }
|
|
171
195
|
|
|
172
196
|
if c.size != c.uniq.size
|
|
173
197
|
raise "Overlapping coefficients"
|
|
@@ -176,8 +200,33 @@ module Eps
|
|
|
176
200
|
[ret2, c]
|
|
177
201
|
end
|
|
178
202
|
|
|
203
|
+
def symbolize_coef(k)
|
|
204
|
+
(k.is_a?(Array) ? k.join("") : k).to_sym
|
|
205
|
+
end
|
|
206
|
+
|
|
179
207
|
def matrix_arr(matrix)
|
|
180
208
|
matrix.to_a.map { |xi| xi[0].to_f }
|
|
181
209
|
end
|
|
210
|
+
|
|
211
|
+
# determine if target is a string or symbol
|
|
212
|
+
def prep_target(target, data)
|
|
213
|
+
if daru?(data)
|
|
214
|
+
data.has_vector?(target) ? target : flip_target(target)
|
|
215
|
+
else
|
|
216
|
+
x = data[0] || {}
|
|
217
|
+
x[target] ? target : flip_target(target)
|
|
218
|
+
end
|
|
219
|
+
end
|
|
220
|
+
|
|
221
|
+
def flip_target(target)
|
|
222
|
+
target.is_a?(String) ? target.to_sym : target.to_s
|
|
223
|
+
end
|
|
224
|
+
|
|
225
|
+
def prep_y(y)
|
|
226
|
+
y.each do |yi|
|
|
227
|
+
raise "Target missing in data" if yi.nil?
|
|
228
|
+
end
|
|
229
|
+
y.map(&:to_f)
|
|
230
|
+
end
|
|
182
231
|
end
|
|
183
232
|
end
|
data/lib/eps/regressor.rb
CHANGED
|
@@ -52,8 +52,9 @@ module Eps
|
|
|
52
52
|
# haven't found an efficient way to do QR-factorization in Ruby
|
|
53
53
|
# the extendmatrix gem has householder and givens (givens has bug)
|
|
54
54
|
# but methods are too slow
|
|
55
|
+
xt = x.t
|
|
55
56
|
begin
|
|
56
|
-
@xtxi = (
|
|
57
|
+
@xtxi = (xt * x).inverse
|
|
57
58
|
rescue ExceptionForMatrix::ErrNotRegular
|
|
58
59
|
constant = {}
|
|
59
60
|
(1...x.column_count).each do |i|
|
|
@@ -77,15 +78,18 @@ module Eps
|
|
|
77
78
|
vectors.delete_at(i)
|
|
78
79
|
end
|
|
79
80
|
x = Matrix.columns(vectors)
|
|
81
|
+
xt = x.t
|
|
80
82
|
|
|
81
83
|
# try again
|
|
82
84
|
begin
|
|
83
|
-
@xtxi = (
|
|
85
|
+
@xtxi = (xt * x).inverse
|
|
84
86
|
rescue ExceptionForMatrix::ErrNotRegular
|
|
85
87
|
raise "Multiple solutions - GSL is needed to select one"
|
|
86
88
|
end
|
|
87
89
|
end
|
|
88
|
-
|
|
90
|
+
# huge performance boost
|
|
91
|
+
# by multiplying xt * y first
|
|
92
|
+
v2 = matrix_arr(@xtxi * (xt * y))
|
|
89
93
|
|
|
90
94
|
# add back removed
|
|
91
95
|
removed.sort.each do |i|
|
|
@@ -101,19 +105,7 @@ module Eps
|
|
|
101
105
|
end
|
|
102
106
|
|
|
103
107
|
def evaluate(data, y = nil)
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
actual ||=
|
|
107
|
-
if daru?(data)
|
|
108
|
-
data[@target].to_a
|
|
109
|
-
else
|
|
110
|
-
data.map { |v| v[@target] }
|
|
111
|
-
end
|
|
112
|
-
|
|
113
|
-
actual = prep_y(actual)
|
|
114
|
-
|
|
115
|
-
estimated = predict(data)
|
|
116
|
-
Eps.metrics(actual, estimated)
|
|
108
|
+
super(data, y, target: @target)
|
|
117
109
|
end
|
|
118
110
|
|
|
119
111
|
# https://people.richland.edu/james/ictcm/2004/multiple.html
|
|
@@ -154,27 +146,6 @@ module Eps
|
|
|
154
146
|
arr.all? { |x| x == arr[0] }
|
|
155
147
|
end
|
|
156
148
|
|
|
157
|
-
# determine if target is a string or symbol
|
|
158
|
-
def prep_target(target, data)
|
|
159
|
-
if daru?(data)
|
|
160
|
-
data.has_vector?(target) ? target : flip_target(target)
|
|
161
|
-
else
|
|
162
|
-
x = data[0] || {}
|
|
163
|
-
x[target] ? target : flip_target(target)
|
|
164
|
-
end
|
|
165
|
-
end
|
|
166
|
-
|
|
167
|
-
def flip_target(target)
|
|
168
|
-
target.is_a?(String) ? target.to_sym : target.to_s
|
|
169
|
-
end
|
|
170
|
-
|
|
171
|
-
def prep_y(y)
|
|
172
|
-
y.each do |yi|
|
|
173
|
-
raise "Target missing in data" if yi.nil?
|
|
174
|
-
end
|
|
175
|
-
y.map(&:to_f)
|
|
176
|
-
end
|
|
177
|
-
|
|
178
149
|
# add epsilon for perfect fits
|
|
179
150
|
# consistent with GSL
|
|
180
151
|
def t_value
|
data/lib/eps/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: eps
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.1
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Andrew Kane
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2018-07-
|
|
11
|
+
date: 2018-07-05 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: bundler
|