eps 0.1.0 → 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fdb6a5fb57b353ab114eb85f0d5c168ddba510bc309ffb29d07b1f3e16744acd
4
- data.tar.gz: fc39e8b4c6c3c79ee7ba52cda09443db805f9724d5f413f77519e0d45554a1d8
3
+ metadata.gz: af79060ff9164fdaa316b8d9543ef8093166be20009dc99509fdfd3c0e4e242d
4
+ data.tar.gz: 4e17cda8c33ba437652a00e85cd3630c9ce093a27d446f77b3b0711bf648c754
5
5
  SHA512:
6
- metadata.gz: 87fdd118890f41def2a69fefa840e313b97878d985f8d60a45b77b75bfec33b506d1027d983e27873499211c59c62273341f21d3c5c9b151b8d44bd34fbbfebb
7
- data.tar.gz: fd3454f8539b48cbd8ae06fe10ab335aac557d5c8c5738d6531e436321721dc53664068d301da9602677022b8348bdf7c5f6b007b13403f602b21afd82df50b0
6
+ metadata.gz: e762ad255036a669003845322ea399300c2d0d2d9987dede0545497ff83e9d7ede89c120eaebd7214f6df26eb8194019b85c889ac50dbe5607688ff4c07ddd04
7
+ data.tar.gz: a5a16216316c718f13366b2618b8d1a0b8d6ba47896877573d805817b35ce789b199342344a32dbdb4ac6706122344b265741d3e319ab76a55443ff69662f699
@@ -1,7 +1,10 @@
1
1
  language: ruby
2
2
  rvm: 2.5.1
3
- sudo: false
4
- before_install: gem install bundler
3
+ sudo: required
4
+ before_install:
5
+ - gem install bundler
6
+ - sudo apt-get update
7
+ - sudo apt-get install -y libgsl0-dev
5
8
  script: bundle exec rake test
6
9
  env:
7
10
  -
@@ -1,3 +1,7 @@
1
+ ## 0.1.1
2
+
3
+ - Huge performance boost
4
+
1
5
  ## 0.1.0
2
6
 
3
7
  - First release
data/README.md CHANGED
@@ -6,6 +6,9 @@ Linear regression for Ruby
6
6
  - Serve models built in Ruby, Python, R, and more
7
7
  - Automatically handles categorical variables
8
8
  - No external dependencies
9
+ - Works great with the SciRuby ecosystem (Daru & IRuby)
10
+
11
+ [![Build Status](https://travis-ci.org/ankane/eps.svg?branch=master)](https://travis-ci.org/ankane/eps)
9
12
 
10
13
  ## Installation
11
14
 
@@ -58,6 +61,14 @@ split_date = Date.parse("2018-06-01")
58
61
  train_set, test_set = houses.partition { |h| h.sold_at < split_date }
59
62
  ```
60
63
 
64
+ ### Outliers and Missing Data
65
+
66
+ Next, decide what to do with outliers and missing data. There are a number of methods for handling them, but the easiest is to remove them.
67
+
68
+ ```ruby
69
+ train_set.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
70
+ ```
71
+
61
72
  ### Feature Engineering
62
73
 
63
74
  Selecting features for a model is extremely important for performance. Features can be numeric or categorical. For categorical features, there’s no need to create dummy variables - just pass the data as strings.
@@ -87,20 +98,31 @@ def features(house)
87
98
  {
88
99
  bedrooms: house.bedrooms,
89
100
  city_id: house.city_id.to_s,
90
- month: house.sold_at.strftime("%b"),
91
- price: house.price
101
+ month: house.sold_at.strftime("%b")
92
102
  }
93
103
  end
94
104
 
95
- train_data = train_set.map { |h| features(h) }
105
+ train_features = train_set.map { |h| features(h) }
106
+ ```
107
+
108
+ > We use a method for features so it can be used across training, evaluation, and prediction
109
+
110
+ We also need to prepare the target variable.
111
+
112
+ ```ruby
113
+ def target(house)
114
+ house.price
115
+ end
116
+
117
+ train_target = train_set.map { |h| target(h) }
96
118
  ```
97
119
 
98
120
  ### Training
99
121
 
100
- Once we have some features, let’s train the model.
122
+ Now, let’s train the model.
101
123
 
102
124
  ```ruby
103
- model = Eps::Regressor.new(train_data, target: :price)
125
+ model = Eps::Regressor.new(train_features, train_target)
104
126
  puts model.summary
105
127
  ```
106
128
 
@@ -111,18 +133,31 @@ The summary includes the coefficients and their significance. The lower the p-va
111
133
  When you’re happy with the model, see how well it performs on the test set. This gives us an idea of how well it’ll perform on unseen data.
112
134
 
113
135
  ```ruby
114
- test_data = test_set.map { |h| features(h) }
115
- model.evaluate(test_data)
136
+ test_features = test_set.map { |h| features(h) }
137
+ test_target = test_set.map { |h| target(h) }
138
+ model.evaluate(test_features, test_target)
116
139
  ```
117
140
 
118
141
  This returns:
119
142
 
120
- - RSME - Root mean square error
143
+ - RMSE - Root mean square error
121
144
  - MAE - Mean absolute error
122
145
  - ME - Mean error
123
146
 
124
147
  We want to minimize the RMSE and MAE and keep the ME around 0.
125
148
 
149
+ ### Finalize
150
+
151
+ Now that we have an idea of how the model will perform, we want to retrain the model with all of our data.
152
+
153
+ ```ruby
154
+ all_features = houses.map { |h| features(h) }
155
+ all_target = houses.map { |h| target(h) }
156
+ model = Eps::Regressor.new(all_features, all_target)
157
+ ```
158
+
159
+ We now have a model that’s ready to serve.
160
+
126
161
  ## Serving Models
127
162
 
128
163
  Once the model is trained, all we need are the coefficients to make predictions. You can dump them as a Ruby object or JSON. For Ruby, use:
@@ -178,6 +213,8 @@ data = File.read("model.pmml")
178
213
  model = Eps::Regressor.load_pmml(data)
179
214
  ```
180
215
 
216
+ > Loading PMML requires Nokogiri to be installed
217
+
181
218
  [PFA](http://dmg.org/pfa/) - Portable Format for Analytics
182
219
 
183
220
  ```ruby
@@ -326,6 +363,10 @@ When importing data from CSV files, be sure to convert numeric fields. The `tabl
326
363
  CSV.table("data.csv").map { |row| row.to_h }
327
364
  ```
328
365
 
366
+ ## Jupyter & IRuby
367
+
368
+ You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://github.com/ankane/shorts/blob/master/Jupyter-Rails.md).
369
+
329
370
  ## Reference
330
371
 
331
372
  Get coefficients
data/Rakefile CHANGED
@@ -9,3 +9,26 @@ Rake::TestTask.new(:test) do |t|
9
9
  end
10
10
 
11
11
  task default: :test
12
+
13
+ task :benchmark do
14
+ require "benchmark"
15
+ require "eps"
16
+ require "gsl" if ENV["GSL"]
17
+
18
+ data = []
19
+ 10000.times do
20
+ row = {}
21
+ 30.times do |i|
22
+ row[:"x#{i}"] = rand(100)
23
+ end
24
+ row[:y] = rand(100)
25
+ data << row
26
+ end
27
+
28
+ puts "Starting benchmark..."
29
+
30
+ time = Benchmark.realtime do
31
+ Eps::Regressor.new(data, target: :y)
32
+ end
33
+ p time.round(1)
34
+ end
@@ -21,7 +21,8 @@ And run:
21
21
  library(jsonlite)
22
22
 
23
23
  model <- lm(dist ~ speed, cars)
24
- toJSON(list(coefficients=as.list(coef(model))), auto_unbox=TRUE)
24
+ data <- toJSON(list(coefficients=as.list(coef(model))), auto_unbox=TRUE)
25
+ write(data, file="model.json")
25
26
  ```
26
27
 
27
28
  ## R PMML
@@ -38,7 +39,8 @@ And run:
38
39
  library(pmml)
39
40
 
40
41
  model <- lm(dist ~ speed, cars)
41
- toString(pmml(model))
42
+ data <- toString(pmml(model))
43
+ write(data, file="model.pmml")
42
44
  ```
43
45
 
44
46
  ## R PFA
@@ -55,7 +57,7 @@ And run:
55
57
  library(aurelius)
56
58
 
57
59
  model <- lm(dist ~ speed, cars)
58
- write_pfa(pfa(model))
60
+ write_pfa(pfa(model), file="model.pfa")
59
61
  ```
60
62
 
61
63
  ## Python JSON
@@ -80,15 +82,19 @@ coefficients = {'_intercept': model.intercept_}
80
82
  for i, c in enumerate(model.coef_):
81
83
  coefficients[features[i]] = c
82
84
 
83
- print(json.dumps({'coefficients': coefficients}))
85
+
86
+ data = json.dumps({'coefficients': coefficients})
87
+
88
+ with open('model.json', 'w') as f:
89
+ f.write(data)
84
90
  ```
85
91
 
86
92
  ## Python PMML
87
93
 
88
- Install the [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) package
94
+ Install the [scikit2pmml](https://github.com/vaclavcadek/scikit2pmml) package
89
95
 
90
96
  ```sh
91
- pip install sklearn2pmml
97
+ pip install scikit2pmml
92
98
  ```
93
99
 
94
100
  And run:
@@ -103,7 +109,7 @@ y = [5 * xi + 3 for xi in x]
103
109
  model = linear_model.LinearRegression()
104
110
  model.fit([[xi] for xi in x], y)
105
111
 
106
- scikit2pmml(estimator=model, file='pymodel.pmml')
112
+ scikit2pmml(estimator=model, file='model.pmml')
107
113
  ```
108
114
 
109
115
  ## Python PFA
@@ -139,5 +145,8 @@ action:
139
145
 
140
146
  return pfaDocument
141
147
 
142
- json.dumps(pfa(model))
148
+ data = json.dumps(pfa(model))
149
+
150
+ with open('model.pfa', 'w') as f:
151
+ f.write(data)
143
152
  ```
@@ -24,6 +24,22 @@ module Eps
24
24
  singular ? pred[0] : pred
25
25
  end
26
26
 
27
+ def evaluate(data, y = nil, target: nil)
28
+ raise ArgumentError, "missing target" if !target && !y
29
+
30
+ actual = y
31
+ actual ||=
32
+ if daru?(data)
33
+ data[target].to_a
34
+ else
35
+ data.map { |v| v[target] }
36
+ end
37
+
38
+ actual = prep_y(actual)
39
+ estimated = predict(data)
40
+ Eps.metrics(actual, estimated)
41
+ end
42
+
27
43
  # ruby
28
44
 
29
45
  def self.load(data)
@@ -56,7 +72,11 @@ module Eps
56
72
  # pmml
57
73
 
58
74
  def self.load_pmml(data)
59
- data = Nokogiri::XML(data) if data.is_a?(String)
75
+ if data.is_a?(String)
76
+ require "nokogiri"
77
+ data = Nokogiri::XML(data)
78
+ end
79
+
60
80
  # TODO more validation
61
81
  node = data.css("RegressionTable")
62
82
  coefficients = {
@@ -94,7 +114,7 @@ module Eps
94
114
  BaseRegressor.new(coefficients: coefficients)
95
115
  end
96
116
 
97
- private
117
+ protected
98
118
 
99
119
  def daru?(x)
100
120
  defined?(Daru) && x.is_a?(Daru::DataFrame)
@@ -132,6 +152,10 @@ module Eps
132
152
  xi.each do |k, v|
133
153
  key = v.is_a?(String) ? [k.to_sym, v] : k.to_sym
134
154
  v2 = v.is_a?(String) ? 1 : v
155
+
156
+ # TODO make more efficient
157
+ next if !train && !coefficients.key?(symbolize_coef(key))
158
+
135
159
  raise "Missing data" if v2.nil?
136
160
 
137
161
  unless cache[key]
@@ -167,7 +191,7 @@ module Eps
167
191
  end
168
192
 
169
193
  # flatten keys
170
- c = [:_intercept] + cache.sort_by { |_, v| v }.map { |k, _| (k.is_a?(Array) ? k.join("") : k).to_sym }
194
+ c = [:_intercept] + cache.sort_by { |_, v| v }.map { |k, _| symbolize_coef(k) }
171
195
 
172
196
  if c.size != c.uniq.size
173
197
  raise "Overlapping coefficients"
@@ -176,8 +200,33 @@ module Eps
176
200
  [ret2, c]
177
201
  end
178
202
 
203
+ def symbolize_coef(k)
204
+ (k.is_a?(Array) ? k.join("") : k).to_sym
205
+ end
206
+
179
207
  def matrix_arr(matrix)
180
208
  matrix.to_a.map { |xi| xi[0].to_f }
181
209
  end
210
+
211
+ # determine if target is a string or symbol
212
+ def prep_target(target, data)
213
+ if daru?(data)
214
+ data.has_vector?(target) ? target : flip_target(target)
215
+ else
216
+ x = data[0] || {}
217
+ x[target] ? target : flip_target(target)
218
+ end
219
+ end
220
+
221
+ def flip_target(target)
222
+ target.is_a?(String) ? target.to_sym : target.to_s
223
+ end
224
+
225
+ def prep_y(y)
226
+ y.each do |yi|
227
+ raise "Target missing in data" if yi.nil?
228
+ end
229
+ y.map(&:to_f)
230
+ end
182
231
  end
183
232
  end
@@ -52,8 +52,9 @@ module Eps
52
52
  # haven't found an efficient way to do QR-factorization in Ruby
53
53
  # the extendmatrix gem has householder and givens (givens has bug)
54
54
  # but methods are too slow
55
+ xt = x.t
55
56
  begin
56
- @xtxi = (x.t * x).inverse
57
+ @xtxi = (xt * x).inverse
57
58
  rescue ExceptionForMatrix::ErrNotRegular
58
59
  constant = {}
59
60
  (1...x.column_count).each do |i|
@@ -77,15 +78,18 @@ module Eps
77
78
  vectors.delete_at(i)
78
79
  end
79
80
  x = Matrix.columns(vectors)
81
+ xt = x.t
80
82
 
81
83
  # try again
82
84
  begin
83
- @xtxi = (x.t * x).inverse
85
+ @xtxi = (xt * x).inverse
84
86
  rescue ExceptionForMatrix::ErrNotRegular
85
87
  raise "Multiple solutions - GSL is needed to select one"
86
88
  end
87
89
  end
88
- v2 = matrix_arr(@xtxi * x.t * y)
90
+ # huge performance boost
91
+ # by multiplying xt * y first
92
+ v2 = matrix_arr(@xtxi * (xt * y))
89
93
 
90
94
  # add back removed
91
95
  removed.sort.each do |i|
@@ -101,19 +105,7 @@ module Eps
101
105
  end
102
106
 
103
107
  def evaluate(data, y = nil)
104
- actual = y
105
-
106
- actual ||=
107
- if daru?(data)
108
- data[@target].to_a
109
- else
110
- data.map { |v| v[@target] }
111
- end
112
-
113
- actual = prep_y(actual)
114
-
115
- estimated = predict(data)
116
- Eps.metrics(actual, estimated)
108
+ super(data, y, target: @target)
117
109
  end
118
110
 
119
111
  # https://people.richland.edu/james/ictcm/2004/multiple.html
@@ -154,27 +146,6 @@ module Eps
154
146
  arr.all? { |x| x == arr[0] }
155
147
  end
156
148
 
157
- # determine if target is a string or symbol
158
- def prep_target(target, data)
159
- if daru?(data)
160
- data.has_vector?(target) ? target : flip_target(target)
161
- else
162
- x = data[0] || {}
163
- x[target] ? target : flip_target(target)
164
- end
165
- end
166
-
167
- def flip_target(target)
168
- target.is_a?(String) ? target.to_sym : target.to_s
169
- end
170
-
171
- def prep_y(y)
172
- y.each do |yi|
173
- raise "Target missing in data" if yi.nil?
174
- end
175
- y.map(&:to_f)
176
- end
177
-
178
149
  # add epsilon for perfect fits
179
150
  # consistent with GSL
180
151
  def t_value
@@ -1,3 +1,3 @@
1
1
  module Eps
2
- VERSION = "0.1.0"
2
+ VERSION = "0.1.1"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: eps
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2018-07-03 00:00:00.000000000 Z
11
+ date: 2018-07-05 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler