eps 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: fdb6a5fb57b353ab114eb85f0d5c168ddba510bc309ffb29d07b1f3e16744acd
4
+ data.tar.gz: fc39e8b4c6c3c79ee7ba52cda09443db805f9724d5f413f77519e0d45554a1d8
5
+ SHA512:
6
+ metadata.gz: 87fdd118890f41def2a69fefa840e313b97878d985f8d60a45b77b75bfec33b506d1027d983e27873499211c59c62273341f21d3c5c9b151b8d44bd34fbbfebb
7
+ data.tar.gz: fd3454f8539b48cbd8ae06fe10ab335aac557d5c8c5738d6531e436321721dc53664068d301da9602677022b8348bdf7c5f6b007b13403f602b21afd82df50b0
@@ -0,0 +1,9 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /_yardoc/
4
+ /coverage/
5
+ /doc/
6
+ /pkg/
7
+ /spec/reports/
8
+ /tmp/
9
+ *.lock
@@ -0,0 +1,12 @@
1
+ language: ruby
2
+ rvm: 2.5.1
3
+ sudo: false
4
+ before_install: gem install bundler
5
+ script: bundle exec rake test
6
+ env:
7
+ -
8
+ - GSL=t
9
+ notifications:
10
+ email:
11
+ on_success: never
12
+ on_failure: change
@@ -0,0 +1,3 @@
1
+ ## 0.1.0
2
+
3
+ - First release
data/Gemfile ADDED
@@ -0,0 +1,11 @@
1
+ source "https://rubygems.org"
2
+
3
+ git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
4
+
5
+ # Specify your gem's dependencies in eps.gemspec
6
+ gemspec
7
+
8
+ # remove when 0.2.1 released
9
+ gem "daru", github: "sciruby/daru"
10
+
11
+ gem "gsl" if ENV["GSL"]
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2018 Andrew Kane
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,363 @@
1
+ # Eps
2
+
3
+ Linear regression for Ruby
4
+
5
+ - Build models quickly and easily
6
+ - Serve models built in Ruby, Python, R, and more
7
+ - Automatically handles categorical variables
8
+ - No external dependencies
9
+
10
+ ## Installation
11
+
12
+ Add this line to your application’s Gemfile:
13
+
14
+ ```ruby
15
+ gem 'eps'
16
+ ```
17
+
18
+ To speed up training on large datasets, you can also [add GSL](#training-performance).
19
+
20
+ ## Getting Started
21
+
22
+ Create a model
23
+
24
+ ```ruby
25
+ data = [
26
+ {bedrooms: 1, bathrooms: 1, price: 100000},
27
+ {bedrooms: 2, bathrooms: 1, price: 125000},
28
+ {bedrooms: 2, bathrooms: 2, price: 135000},
29
+ {bedrooms: 3, bathrooms: 2, price: 162000}
30
+ ]
31
+ model = Eps::Regressor.new(data, target: :price)
32
+ puts model.summary
33
+ ```
34
+
35
+ Make a prediction
36
+
37
+ ```ruby
38
+ model.predict(bedrooms: 2, bathrooms: 1)
39
+ ```
40
+
41
+ > Pass an array of hashes make multiple predictions at once
42
+
43
+ ## Building Models
44
+
45
+ ### Training and Test Sets
46
+
47
+ When building models, it’s a good idea to hold out some data so you can see how well the model will perform on unseen data. To do this, we split our data into two sets: training and test. We build the model with the training set and later evaluate it on the test set.
48
+
49
+ ```ruby
50
+ rng = Random.new(1) # seed random number generator
51
+ train_set, test_set = houses.partition { rng.rand < 0.7 }
52
+ ```
53
+
54
+ If your data has a time associated with it, we recommend splitting on this.
55
+
56
+ ```ruby
57
+ split_date = Date.parse("2018-06-01")
58
+ train_set, test_set = houses.partition { |h| h.sold_at < split_date }
59
+ ```
60
+
61
+ ### Feature Engineering
62
+
63
+ Selecting features for a model is extremely important for performance. Features can be numeric or categorical. For categorical features, there’s no need to create dummy variables - just pass the data as strings.
64
+
65
+ ```ruby
66
+ {state: "CA"}
67
+ ```
68
+
69
+ > Categorical features generate coefficients for each distinct value except for one
70
+
71
+ You should do this for any ids in your data.
72
+
73
+ ```ruby
74
+ {city_id: "123"}
75
+ ```
76
+
77
+ For times, create features like day of week and hour of day with:
78
+
79
+ ```ruby
80
+ {weekday: time.wday.to_s, hour: time.hour.to_s}
81
+ ```
82
+
83
+ In practice, your code may look like:
84
+
85
+ ```ruby
86
+ def features(house)
87
+ {
88
+ bedrooms: house.bedrooms,
89
+ city_id: house.city_id.to_s,
90
+ month: house.sold_at.strftime("%b"),
91
+ price: house.price
92
+ }
93
+ end
94
+
95
+ train_data = train_set.map { |h| features(h) }
96
+ ```
97
+
98
+ ### Training
99
+
100
+ Once we have some features, let’s train the model.
101
+
102
+ ```ruby
103
+ model = Eps::Regressor.new(train_data, target: :price)
104
+ puts model.summary
105
+ ```
106
+
107
+ The summary includes the coefficients and their significance. The lower the p-value, the more significant the feature is. p-values below 0.05 are typically considered significant. It also shows the adjusted r-squared, which is a measure of how well the model fits the data. The higher the number, the better the fit. Here’s a good explanation of why it’s [better than r-squared](https://www.quora.com/What-is-the-difference-between-R-squared-and-Adjusted-R-squared).
108
+
109
+ ### Evaluation
110
+
111
+ When you’re happy with the model, see how well it performs on the test set. This gives us an idea of how well it’ll perform on unseen data.
112
+
113
+ ```ruby
114
+ test_data = test_set.map { |h| features(h) }
115
+ model.evaluate(test_data)
116
+ ```
117
+
118
+ This returns:
119
+
120
+ - RSME - Root mean square error
121
+ - MAE - Mean absolute error
122
+ - ME - Mean error
123
+
124
+ We want to minimize the RMSE and MAE and keep the ME around 0.
125
+
126
+ ## Serving Models
127
+
128
+ Once the model is trained, all we need are the coefficients to make predictions. You can dump them as a Ruby object or JSON. For Ruby, use:
129
+
130
+ ```ruby
131
+ model.dump
132
+ ```
133
+
134
+ Then hardcode the result into your app.
135
+
136
+ ```ruby
137
+ data = {:coefficients=>{:_intercept=>63500.0, :bedrooms=>26000.0, :bathrooms=>10000.0}}
138
+ model = Eps::Regressor.load(data)
139
+ ```
140
+
141
+ Now we can use it to make predictions.
142
+
143
+ ```ruby
144
+ model.predict(bedrooms: 2, bathrooms: 1)
145
+ ```
146
+
147
+ Another option that works well is writing the model to file in your app.
148
+
149
+ ```ruby
150
+ json = model.to_json
151
+ File.open("lib/models/housing_price.json", "w") { |f| f.write(json) }
152
+ ```
153
+
154
+ To load it, use:
155
+
156
+ ```ruby
157
+ json = File.read("lib/models/housing_price.json")
158
+ model = Eps::Regressor.load_json(json)
159
+ ```
160
+
161
+ To continuously train models, we recommend [storing them in your database](#database-storage).
162
+
163
+ ### Beyond Ruby
164
+
165
+ Eps makes it easy to serve models from other languages. You can build models in R, Python, and others and serve them in Ruby without having to worry about how to deploy or run another language. Eps can load models in:
166
+
167
+ JSON
168
+
169
+ ```ruby
170
+ data = File.read("model.json")
171
+ model = Eps::Regressor.load_json(data)
172
+ ```
173
+
174
+ [PMML](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) - Predictive Model Markup Language
175
+
176
+ ```ruby
177
+ data = File.read("model.pmml")
178
+ model = Eps::Regressor.load_pmml(data)
179
+ ```
180
+
181
+ [PFA](http://dmg.org/pfa/) - Portable Format for Analytics
182
+
183
+ ```ruby
184
+ data = File.read("model.pfa")
185
+ model = Eps::Regressor.load_pfa(data)
186
+ ```
187
+
188
+ Here are examples for how to dump models in each:
189
+
190
+ - [R JSON](guides/Modeling.md#r-json)
191
+ - [R PMML](guides/Modeling.md#r-pmml)
192
+ - [R PFA](guides/Modeling.md#r-pfa)
193
+ - [Python JSON](guides/Modeling.md#python-json)
194
+ - [Python PMML](guides/Modeling.md#python-pmml)
195
+ - [Python PFA](guides/Modeling.md#python-pfa)
196
+
197
+ ### Verifying
198
+
199
+ It’s important for features to be implemented consistently when serving models created in other languages. We highly recommend verifying this programmatically. Create a CSV file with ids and predictions from the original model.
200
+
201
+ house_id | prediction
202
+ --- | ---
203
+ 1 | 145000
204
+ 2 | 123000
205
+ 3 | 250000
206
+
207
+ Once the model is implemented in Ruby, confirm the predictions match.
208
+
209
+ ```ruby
210
+ model = Eps::Regressor.load_json("model.json")
211
+
212
+ # preload houses to prevent n+1
213
+ houses = House.all.index_by(&:id)
214
+
215
+ CSV.foreach("predictions.csv", headers: true) do |row|
216
+ house = houses[row["house_id"].to_i]
217
+ expected = row["prediction"].to_f
218
+
219
+ actual = model.predict(bedrooms: house.bedrooms, bathrooms: house.bathrooms)
220
+
221
+ unless (actual - expected).abs < 0.001
222
+ raise "Bad prediction for house #{house.id} (exp: #{expected}, act: #{actual})"
223
+ end
224
+
225
+ putc "âś“"
226
+ end
227
+ ```
228
+
229
+ ### Database Storage
230
+
231
+ The database is another place you can store models. It’s good if you retrain models automatically.
232
+
233
+ > We recommend adding monitoring and guardrails as well if you retrain automatically
234
+
235
+ Create an ActiveRecord model to store the predictive model.
236
+
237
+ ```sh
238
+ rails g model Model key:string:uniq data:text
239
+ ```
240
+
241
+ Store the model with:
242
+
243
+ ```ruby
244
+ store = Model.where(key: "housing_price").first_or_initialize
245
+ store.update(data: model.to_json)
246
+ ```
247
+
248
+ Load the model with:
249
+
250
+ ```ruby
251
+ data = Model.find_by!(key: "housing_price").data
252
+ model = Eps::Regressor.load_json(data)
253
+ ```
254
+
255
+ ## Monitoring
256
+
257
+ We recommend monitoring how well your models perform over time. To do this, save your predictions to the database. Then, compare them with:
258
+
259
+ ```ruby
260
+ actual = houses.map(&:price)
261
+ estimated = houses.map(&:estimated_price)
262
+ Eps.metrics(actual, estimated)
263
+ ```
264
+
265
+ This returns the same evaluation metrics as model building. For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0.
266
+
267
+ ## Training Performance
268
+
269
+ Speed up training on large datasets with GSL.
270
+
271
+ First, [install GSL](https://www.gnu.org/software/gsl/). With Homebrew, you can use:
272
+
273
+ ```sh
274
+ brew install gsl
275
+ ```
276
+
277
+ Then, add this line to your application’s Gemfile:
278
+
279
+ ```ruby
280
+ gem 'gsl', group: :development
281
+ ```
282
+
283
+ It only needs to be available in environments used to build the model.
284
+
285
+ ## Data
286
+
287
+ A number of data formats are supported. You can pass the target variable separately.
288
+
289
+ ```ruby
290
+ x = [{x: 1}, {x: 2}, {x: 3}]
291
+ y = [1, 2, 3]
292
+ Eps::Regressor.new(x, y)
293
+ ```
294
+
295
+ Or pass arrays of arrays
296
+
297
+ ```ruby
298
+ x = [[1, 2], [2, 0], [3, 1]]
299
+ y = [1, 2, 3]
300
+ Eps::Regressor.new(x, y)
301
+ ```
302
+
303
+ ## Daru
304
+
305
+ Eps works well with Daru data frames.
306
+
307
+ ```ruby
308
+ df = Daru::DataFrame.from_csv("houses.csv")
309
+ Eps::Regressor.new(df, target: "price")
310
+ ```
311
+
312
+ To split into training and test sets, use:
313
+
314
+ ```ruby
315
+ rng = Random.new(1) # seed random number generator
316
+ train_index = houses.map { rng.rand < 0.7 }
317
+ train_set = houses.where(train_index)
318
+ test_set = houses.where(train_index.map { |v| !v })
319
+ ```
320
+
321
+ ## CSVs
322
+
323
+ When importing data from CSV files, be sure to convert numeric fields. The `table` method does this automatically.
324
+
325
+ ```ruby
326
+ CSV.table("data.csv").map { |row| row.to_h }
327
+ ```
328
+
329
+ ## Reference
330
+
331
+ Get coefficients
332
+
333
+ ```ruby
334
+ model.coefficients
335
+ ```
336
+
337
+ Get an extended summary with standard error, t-values, and r-squared
338
+
339
+ ```ruby
340
+ model.summary(extended: true)
341
+ ```
342
+
343
+ ## History
344
+
345
+ View the [changelog](https://github.com/ankane/eps/blob/master/CHANGELOG.md)
346
+
347
+ ## Contributing
348
+
349
+ Everyone is encouraged to help improve this project. Here are a few ways you can help:
350
+
351
+ - [Report bugs](https://github.com/ankane/eps/issues)
352
+ - Fix bugs and [submit pull requests](https://github.com/ankane/eps/pulls)
353
+ - Write, clarify, or fix documentation
354
+ - Suggest or add new features
355
+
356
+ To get started with development and testing:
357
+
358
+ ```sh
359
+ git clone https://github.com/ankane/eps.git
360
+ cd eps
361
+ bundle install
362
+ rake test
363
+ ```
@@ -0,0 +1,11 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+
4
+ Rake::TestTask.new(:test) do |t|
5
+ t.libs << "test"
6
+ t.libs << "lib"
7
+ t.test_files = FileList["test/**/*_test.rb"]
8
+ t.warning = false
9
+ end
10
+
11
+ task default: :test
@@ -0,0 +1,30 @@
1
+
2
+ lib = File.expand_path("../lib", __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require "eps/version"
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "eps"
8
+ spec.version = Eps::VERSION
9
+ spec.authors = ["Andrew Kane"]
10
+ spec.email = ["andrew@chartkick.com"]
11
+
12
+ spec.summary = "Linear regression for Ruby"
13
+ spec.homepage = "https://github.com/ankane/eps"
14
+ spec.license = "MIT"
15
+
16
+ # Specify which files should be added to the gem when it is released.
17
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
18
+ spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do
19
+ `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
20
+ end
21
+ spec.bindir = "exe"
22
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
23
+ spec.require_paths = ["lib"]
24
+
25
+ spec.add_development_dependency "bundler"
26
+ spec.add_development_dependency "daru"
27
+ spec.add_development_dependency "minitest"
28
+ spec.add_development_dependency "nokogiri"
29
+ spec.add_development_dependency "rake"
30
+ end
@@ -0,0 +1,143 @@
1
+ # Modeling
2
+
3
+ - [R JSON](#r-json)
4
+ - [R PMML](#r-pmml)
5
+ - [R PFA](#r-pfa)
6
+ - [Python JSON](#python-json)
7
+ - [Python PMML](#python-pmml)
8
+ - [Python PFA](#python-pfa)
9
+
10
+ ## R JSON
11
+
12
+ Install the [jsonlite](https://cran.r-project.org/package=jsonlite) package
13
+
14
+ ```r
15
+ install.packages("jsonlite")
16
+ ```
17
+
18
+ And run:
19
+
20
+ ```r
21
+ library(jsonlite)
22
+
23
+ model <- lm(dist ~ speed, cars)
24
+ toJSON(list(coefficients=as.list(coef(model))), auto_unbox=TRUE)
25
+ ```
26
+
27
+ ## R PMML
28
+
29
+ Install the [pmml](https://cran.r-project.org/package=pmml) package
30
+
31
+ ```r
32
+ install.packages("pmml")
33
+ ```
34
+
35
+ And run:
36
+
37
+ ```r
38
+ library(pmml)
39
+
40
+ model <- lm(dist ~ speed, cars)
41
+ toString(pmml(model))
42
+ ```
43
+
44
+ ## R PFA
45
+
46
+ Install the [aurelius](https://cran.r-project.org/package=aurelius) package
47
+
48
+ ```r
49
+ install.packages("aurelius")
50
+ ```
51
+
52
+ And run:
53
+
54
+ ```r
55
+ library(aurelius)
56
+
57
+ model <- lm(dist ~ speed, cars)
58
+ write_pfa(pfa(model))
59
+ ```
60
+
61
+ ## Python JSON
62
+
63
+ Run:
64
+
65
+ ```python
66
+ from sklearn import linear_model
67
+ import pandas as pd
68
+ import json
69
+
70
+ x = [1, 2, 3, 4, 5]
71
+ y = [5 * xi + 3 for xi in x]
72
+
73
+ df = pd.DataFrame({'x': x, 'y': y})
74
+ features = ['x']
75
+
76
+ model = linear_model.LinearRegression()
77
+ model.fit(df[features], df['y'])
78
+
79
+ coefficients = {'_intercept': model.intercept_}
80
+ for i, c in enumerate(model.coef_):
81
+ coefficients[features[i]] = c
82
+
83
+ print(json.dumps({'coefficients': coefficients}))
84
+ ```
85
+
86
+ ## Python PMML
87
+
88
+ Install the [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) package
89
+
90
+ ```sh
91
+ pip install sklearn2pmml
92
+ ```
93
+
94
+ And run:
95
+
96
+ ```python
97
+ from sklearn import linear_model
98
+ from scikit2pmml import scikit2pmml
99
+
100
+ x = [1, 2, 3, 5, 6]
101
+ y = [5 * xi + 3 for xi in x]
102
+
103
+ model = linear_model.LinearRegression()
104
+ model.fit([[xi] for xi in x], y)
105
+
106
+ scikit2pmml(estimator=model, file='pymodel.pmml')
107
+ ```
108
+
109
+ ## Python PFA
110
+
111
+ Install the [Titus](https://github.com/opendatagroup/hadrian) package and run:
112
+
113
+ ```python
114
+ from sklearn import linear_model
115
+ import titus.prettypfa
116
+ import json
117
+
118
+ x = [1, 2, 3, 5, 6]
119
+ y = [5 * xi + 3 for xi in x]
120
+
121
+ model = linear_model.LinearRegression()
122
+ model.fit([[xi] for xi in x], y)
123
+
124
+ def pfa(estimator):
125
+ pfaDocument = titus.prettypfa.jsonNode('''
126
+ types:
127
+ Regression = record(Regression,
128
+ const: double,
129
+ coeff: array(double))
130
+ input: array(double)
131
+ output: double
132
+ cells:
133
+ regression(Regression) = {const: 0.0, coeff: []}
134
+ action:
135
+ model.reg.linear(input, regression)
136
+ ''')
137
+
138
+ pfaDocument["cells"]["regression"]["init"] = {"const": estimator.intercept_, "coeff": list(estimator.coef_)}
139
+
140
+ return pfaDocument
141
+
142
+ json.dumps(pfa(model))
143
+ ```
@@ -0,0 +1,15 @@
1
+ # dependencies
2
+ require "matrix"
3
+ require "json"
4
+
5
+ # modules
6
+ require "eps/base_regressor"
7
+ require "eps/metrics"
8
+ require "eps/regressor"
9
+ require "eps/version"
10
+
11
+ module Eps
12
+ def self.metrics(actual, estimated)
13
+ Eps::Metrics.new(actual, estimated).all
14
+ end
15
+ end
@@ -0,0 +1,183 @@
1
+ module Eps
2
+ class BaseRegressor
3
+ attr_reader :coefficients
4
+
5
+ def initialize(coefficients:)
6
+ @coefficients = Hash[coefficients.map { |k, v| [k.to_sym, v] }]
7
+ end
8
+
9
+ def predict(x)
10
+ singular = !(x.is_a?(Array) || daru?(x))
11
+ x = [x] if singular
12
+ x, c = prep_x(x, train: false)
13
+ coef = c.map do |v|
14
+ # use 0 if coefficient does not exist
15
+ # this can happen for categorical features
16
+ # since only n-1 coefficients are stored
17
+ coefficients[v] || 0
18
+ end
19
+
20
+ x = Matrix.rows(x)
21
+ c = Matrix.column_vector(coef)
22
+ pred = matrix_arr(x * c)
23
+
24
+ singular ? pred[0] : pred
25
+ end
26
+
27
+ # ruby
28
+
29
+ def self.load(data)
30
+ BaseRegressor.new(Hash[data.map { |k, v| [k.to_sym, v] }])
31
+ end
32
+
33
+ def dump
34
+ {coefficients: coefficients}
35
+ end
36
+
37
+ # json
38
+
39
+ def self.load_json(data)
40
+ data = JSON.parse(data) if data.is_a?(String)
41
+ coefficients = data["coefficients"]
42
+
43
+ # for R models
44
+ if coefficients["(Intercept)"]
45
+ coefficients = coefficients.dup
46
+ coefficients["_intercept"] = coefficients.delete("(Intercept)")
47
+ end
48
+
49
+ BaseRegressor.new(coefficients: coefficients)
50
+ end
51
+
52
+ def to_json
53
+ JSON.generate(dump)
54
+ end
55
+
56
+ # pmml
57
+
58
+ def self.load_pmml(data)
59
+ data = Nokogiri::XML(data) if data.is_a?(String)
60
+ # TODO more validation
61
+ node = data.css("RegressionTable")
62
+ coefficients = {
63
+ _intercept: node.attribute("intercept").value.to_f
64
+ }
65
+ node.css("NumericPredictor").each do |n|
66
+ coefficients[n.attribute("name").value] = n.attribute("coefficient").value.to_f
67
+ end
68
+ node.css("CategoricalPredictor").each do |n|
69
+ coefficients["#{n.attribute("name").value}#{n.attribute("value").value}"] = n.attribute("coefficient").value.to_f
70
+ end
71
+ BaseRegressor.new(coefficients: coefficients)
72
+ end
73
+
74
+ # pfa
75
+
76
+ def self.load_pfa(data)
77
+ data = JSON.parse(data) if data.is_a?(String)
78
+ init = data["cells"].first[1]["init"]
79
+ names =
80
+ if data["input"]["fields"]
81
+ data["input"]["fields"].map { |f| f["name"] }
82
+ else
83
+ init["coeff"].map.with_index { |_, i| "x#{i}" }
84
+ end
85
+ coefficients = {
86
+ _intercept: init["const"]
87
+ }
88
+ init["coeff"].each_with_index do |c, i|
89
+ name = names[i]
90
+ # R can export coefficients with same name
91
+ raise "Coefficients with same name" if coefficients[name]
92
+ coefficients[name] = c
93
+ end
94
+ BaseRegressor.new(coefficients: coefficients)
95
+ end
96
+
97
+ private
98
+
99
+ def daru?(x)
100
+ defined?(Daru) && x.is_a?(Daru::DataFrame)
101
+ end
102
+
103
+ def prep_x(x, train: true)
104
+ if daru?(x)
105
+ x = x.to_a[0]
106
+ else
107
+ x = x.map do |xi|
108
+ case xi
109
+ when Hash
110
+ xi
111
+ when Array
112
+ Hash[xi.map.with_index { |v, i| [:"x#{i}", v] }]
113
+ else
114
+ {x0: xi}
115
+ end
116
+ end
117
+ end
118
+
119
+ # if !train && x.any?
120
+ # # check first row against coefficients
121
+ # ckeys = coefficients.keys.map(&:to_s)
122
+ # bad_keys = x[0].keys.map(&:to_s).reject { |k| ckeys.any? { |c| c.start_with?(k) } }
123
+ # raise "Unknown keys: #{bad_keys.join(", ")}" if bad_keys.any?
124
+ # end
125
+
126
+ cache = {}
127
+ first_key = {}
128
+ i = 0
129
+ rows = []
130
+ x.each do |xi|
131
+ row = {}
132
+ xi.each do |k, v|
133
+ key = v.is_a?(String) ? [k.to_sym, v] : k.to_sym
134
+ v2 = v.is_a?(String) ? 1 : v
135
+ raise "Missing data" if v2.nil?
136
+
137
+ unless cache[key]
138
+ cache[key] = i
139
+ first_key[k] ||= key if v.is_a?(String)
140
+ i += 1
141
+ end
142
+
143
+ row[key] = v2
144
+ end
145
+ rows << row
146
+ end
147
+
148
+ if train
149
+ # remove one degree of freedom
150
+ first_key.values.each do |v|
151
+ num = cache.delete(v)
152
+ cache.each do |k, v2|
153
+ cache[k] -= 1 if v2 > num
154
+ end
155
+ end
156
+ end
157
+
158
+ ret2 = []
159
+ rows.each do |row|
160
+ ret = [0] * cache.size
161
+ row.each do |k, v|
162
+ if cache[k]
163
+ ret[cache[k]] = v
164
+ end
165
+ end
166
+ ret2 << ([1] + ret)
167
+ end
168
+
169
+ # flatten keys
170
+ c = [:_intercept] + cache.sort_by { |_, v| v }.map { |k, _| (k.is_a?(Array) ? k.join("") : k).to_sym }
171
+
172
+ if c.size != c.uniq.size
173
+ raise "Overlapping coefficients"
174
+ end
175
+
176
+ [ret2, c]
177
+ end
178
+
179
+ def matrix_arr(matrix)
180
+ matrix.to_a.map { |xi| xi[0].to_f }
181
+ end
182
+ end
183
+ end
@@ -0,0 +1,35 @@
1
+ module Eps
2
+ class Metrics
3
+ attr_reader :errors
4
+
5
+ def initialize(actual, estimated)
6
+ @errors = actual.zip(estimated).map { |yi, yi2| yi - yi2 }
7
+ end
8
+
9
+ def all
10
+ {
11
+ rmse: rmse,
12
+ mae: mae,
13
+ me: me
14
+ }
15
+ end
16
+
17
+ private
18
+
19
+ def me
20
+ mean(errors)
21
+ end
22
+
23
+ def mae
24
+ mean(errors.map { |v| v.abs })
25
+ end
26
+
27
+ def rmse
28
+ Math.sqrt(mean(errors.map { |v| v**2 }))
29
+ end
30
+
31
+ def mean(arr)
32
+ arr.inject(0, &:+) / arr.size.to_f
33
+ end
34
+ end
35
+ end
@@ -0,0 +1,343 @@
1
+ module Eps
2
+ class Regressor < BaseRegressor
3
+ def initialize(data, y = nil, target: nil, gsl: nil)
4
+ raise ArgumentError, "missing target" if !target && !y
5
+
6
+ target = prep_target(target, data) if target
7
+
8
+ # TODO more performant conversion
9
+ if daru?(data)
10
+ y ||= data[target].to_a
11
+ x = data.dup.delete_vector(target)
12
+ else
13
+ x = data.map(&:dup)
14
+ y ||= x.map { |v| v.delete(target) }
15
+ end
16
+
17
+ @x = x
18
+ @y = prep_y(y.to_a)
19
+
20
+ if @x.size != @y.size
21
+ raise "Number of samples differs from target"
22
+ end
23
+
24
+ @target = target
25
+ @gsl = gsl.nil? ? defined?(GSL) : gsl
26
+
27
+ # fit immediately
28
+ coefficients
29
+ end
30
+
31
+ def coefficients
32
+ @coefficients ||= begin
33
+ x, @coefficient_names = prep_x(@x)
34
+
35
+ if x.size <= @coefficient_names.size
36
+ raise "Number of samples must be at least two more than number of features"
37
+ end
38
+
39
+ v =
40
+ if @gsl
41
+ x = GSL::Matrix.alloc(*x)
42
+ y = GSL::Vector.alloc(@y)
43
+ c, @covariance, _, _ = GSL::MultiFit::linear(x, y)
44
+ c.to_a
45
+ else
46
+ x = Matrix.rows(x)
47
+ y = Matrix.column_vector(@y)
48
+ removed = []
49
+
50
+ # https://statsmaths.github.io/stat612/lectures/lec13/lecture13.pdf
51
+ # unforutnately, this method is unstable
52
+ # haven't found an efficient way to do QR-factorization in Ruby
53
+ # the extendmatrix gem has householder and givens (givens has bug)
54
+ # but methods are too slow
55
+ begin
56
+ @xtxi = (x.t * x).inverse
57
+ rescue ExceptionForMatrix::ErrNotRegular
58
+ constant = {}
59
+ (1...x.column_count).each do |i|
60
+ constant[i] = constant?(x.column(i))
61
+ end
62
+
63
+ # remove constant columns
64
+ removed = constant.select { |_, v| v }.keys
65
+
66
+ # remove non-independent columns
67
+ constant.select { |_, v| !v }.keys.combination(2) do |c|
68
+ if !x.column(c[0]).independent?(x.column(c[1]))
69
+ removed << c[1]
70
+ end
71
+ end
72
+
73
+ vectors = x.column_vectors
74
+ # delete in reverse of indexes stay the same
75
+ removed.sort.reverse.each do |i|
76
+ # @coefficient_names.delete_at(i)
77
+ vectors.delete_at(i)
78
+ end
79
+ x = Matrix.columns(vectors)
80
+
81
+ # try again
82
+ begin
83
+ @xtxi = (x.t * x).inverse
84
+ rescue ExceptionForMatrix::ErrNotRegular
85
+ raise "Multiple solutions - GSL is needed to select one"
86
+ end
87
+ end
88
+ v2 = matrix_arr(@xtxi * x.t * y)
89
+
90
+ # add back removed
91
+ removed.sort.each do |i|
92
+ v2.insert(i, 0)
93
+ end
94
+ @removed = removed
95
+
96
+ v2
97
+ end
98
+
99
+ Hash[@coefficient_names.zip(v)]
100
+ end
101
+ end
102
+
103
+ def evaluate(data, y = nil)
104
+ actual = y
105
+
106
+ actual ||=
107
+ if daru?(data)
108
+ data[@target].to_a
109
+ else
110
+ data.map { |v| v[@target] }
111
+ end
112
+
113
+ actual = prep_y(actual)
114
+
115
+ estimated = predict(data)
116
+ Eps.metrics(actual, estimated)
117
+ end
118
+
119
+ # https://people.richland.edu/james/ictcm/2004/multiple.html
120
+ def summary(extended: false)
121
+ @summary_str ||= begin
122
+ str = String.new("")
123
+ len = [coefficients.keys.map(&:size).max, 15].max
124
+ if extended
125
+ str += "%-#{len}s %12s %12s %12s %12s\n" % ["", "coef", "stderr", "t", "p"]
126
+ else
127
+ str += "%-#{len}s %12s %12s\n" % ["", "coef", "p"]
128
+ end
129
+ coefficients.each do |k, v|
130
+ if extended
131
+ str += "%-#{len}s %12.2f %12.2f %12.2f %12.3f\n" % [k, v, std_err[k], t_value[k], p_value[k]]
132
+ else
133
+ str += "%-#{len}s %12.2f %12.3f\n" % [k, v, p_value[k]]
134
+ end
135
+ end
136
+ str += "\n"
137
+ str += "r2: %.3f\n" % [r2] if extended
138
+ str += "adjusted r2: %.3f\n" % [adjusted_r2]
139
+ str
140
+ end
141
+ end
142
+
143
+ def r2
144
+ @r2 ||= (sst - sse) / sst
145
+ end
146
+
147
+ def adjusted_r2
148
+ @adjusted_r2 ||= (mst - mse) / mst
149
+ end
150
+
151
+ private
152
+
153
+ def constant?(arr)
154
+ arr.all? { |x| x == arr[0] }
155
+ end
156
+
157
+ # determine if target is a string or symbol
158
+ def prep_target(target, data)
159
+ if daru?(data)
160
+ data.has_vector?(target) ? target : flip_target(target)
161
+ else
162
+ x = data[0] || {}
163
+ x[target] ? target : flip_target(target)
164
+ end
165
+ end
166
+
167
+ def flip_target(target)
168
+ target.is_a?(String) ? target.to_sym : target.to_s
169
+ end
170
+
171
+ def prep_y(y)
172
+ y.each do |yi|
173
+ raise "Target missing in data" if yi.nil?
174
+ end
175
+ y.map(&:to_f)
176
+ end
177
+
178
+ # add epsilon for perfect fits
179
+ # consistent with GSL
180
+ def t_value
181
+ @t_value ||= Hash[coefficients.map { |k, v| [k, v / (std_err[k] + Float::EPSILON)] }]
182
+ end
183
+
184
+ def p_value
185
+ @p_value ||= begin
186
+ Hash[coefficients.map do |k, _|
187
+ tp =
188
+ if @gsl
189
+ GSL::Cdf.tdist_P(t_value[k].abs, degrees_of_freedom)
190
+ else
191
+ tdist_p(t_value[k].abs, degrees_of_freedom)
192
+ end
193
+
194
+ [k, 2 * (1 - tp)]
195
+ end]
196
+ end
197
+ end
198
+
199
+ def std_err
200
+ @std_err ||= begin
201
+ Hash[@coefficient_names.zip(diagonal.map { |v| Math.sqrt(v) })]
202
+ end
203
+ end
204
+
205
+ def diagonal
206
+ @diagonal ||= begin
207
+ if covariance.respond_to?(:each)
208
+ d = covariance.each(:diagonal).to_a
209
+ @removed.each do |i|
210
+ d.insert(i, 0)
211
+ end
212
+ d
213
+ else
214
+ covariance.diagonal.to_a
215
+ end
216
+ end
217
+ end
218
+
219
+ def covariance
220
+ @covariance ||= mse * @xtxi
221
+ end
222
+
223
+ def y_bar
224
+ @y_bar ||= mean(@y)
225
+ end
226
+
227
+ def y_hat
228
+ @y_hat ||= predict(@x)
229
+ end
230
+
231
+ # total sum of squares
232
+ def sst
233
+ @sst ||= sum(@y.map { |y| (y - y_bar)**2 })
234
+ end
235
+
236
+ # sum of squared errors of prediction
237
+ # not to be confused with "explained sum of squares"
238
+ def sse
239
+ @sse ||= sum(@y.zip(y_hat).map { |y, yh| (y - yh)**2 })
240
+ end
241
+
242
+ def mst
243
+ @mst ||= sst / (@y.size - 1)
244
+ end
245
+
246
+ def mse
247
+ @mse ||= sse / degrees_of_freedom
248
+ end
249
+
250
+ def degrees_of_freedom
251
+ @y.size - coefficients.size
252
+ end
253
+
254
+ def sum(arr)
255
+ arr.inject(0, &:+)
256
+ end
257
+
258
+ def mean(arr)
259
+ sum(arr) / arr.size.to_f
260
+ end
261
+
262
+ ### Extracted from https://github.com/estebanz01/ruby-statistics
263
+ ### The Ruby author is Esteban Zapata Rojas
264
+ ###
265
+ ### Originally extracted from https://codeplea.com/incomplete-beta-function-c
266
+ ### This function is shared under zlib license and the author is Lewis Van Winkle
267
+ def tdist_p(value, degrees_of_freedom)
268
+ upper = (value + Math.sqrt(value * value + degrees_of_freedom))
269
+ lower = (2.0 * Math.sqrt(value * value + degrees_of_freedom))
270
+
271
+ x = upper/lower
272
+
273
+ alpha = degrees_of_freedom/2.0
274
+ beta = degrees_of_freedom/2.0
275
+
276
+ incomplete_beta_function(x, alpha, beta)
277
+ end
278
+
279
+ ### Extracted from https://github.com/estebanz01/ruby-statistics
280
+ ### The Ruby author is Esteban Zapata Rojas
281
+ ###
282
+ ### This implementation is an adaptation of the incomplete beta function made in C by
283
+ ### Lewis Van Winkle, which released the code under the zlib license.
284
+ ### The whole math behind this code is described in the following post: https://codeplea.com/incomplete-beta-function-c
285
+ def incomplete_beta_function(x, alp, bet)
286
+ return if x < 0.0
287
+ return 1.0 if x > 1.0
288
+
289
+ tiny = 1.0E-50
290
+
291
+ if x > ((alp + 1.0)/(alp + bet + 2.0))
292
+ return 1.0 - incomplete_beta_function(1.0 - x, bet, alp)
293
+ end
294
+
295
+ # To avoid overflow problems, the implementation applies the logarithm properties
296
+ # to calculate in a faster and safer way the values.
297
+ lbet_ab = (Math.lgamma(alp)[0] + Math.lgamma(bet)[0] - Math.lgamma(alp + bet)[0]).freeze
298
+ front = (Math.exp(Math.log(x) * alp + Math.log(1.0 - x) * bet - lbet_ab) / alp.to_f).freeze
299
+
300
+ # This is the non-log version of the left part of the formula (before the continuous fraction)
301
+ # down_left = alp * self.beta_function(alp, bet)
302
+ # upper_left = (x ** alp) * ((1.0 - x) ** bet)
303
+ # front = upper_left/down_left
304
+
305
+ f, c, d = 1.0, 1.0, 0.0
306
+
307
+ returned_value = nil
308
+
309
+ # Let's do more iterations than the proposed implementation (200 iters)
310
+ (0..500).each do |number|
311
+ m = number/2
312
+
313
+ numerator = if number == 0
314
+ 1.0
315
+ elsif number % 2 == 0
316
+ (m * (bet - m) * x)/((alp + 2.0 * m - 1.0)* (alp + 2.0 * m))
317
+ else
318
+ top = -((alp + m) * (alp + bet + m) * x)
319
+ down = ((alp + 2.0 * m) * (alp + 2.0 * m + 1.0))
320
+
321
+ top/down
322
+ end
323
+
324
+ d = 1.0 + numerator * d
325
+ d = tiny if d.abs < tiny
326
+ d = 1.0 / d
327
+
328
+ c = 1.0 + numerator / c
329
+ c = tiny if c.abs < tiny
330
+
331
+ cd = (c*d).freeze
332
+ f = f * cd
333
+
334
+ if (1.0 - cd).abs < 1.0E-10
335
+ returned_value = front * (f - 1.0)
336
+ break
337
+ end
338
+ end
339
+
340
+ returned_value
341
+ end
342
+ end
343
+ end
@@ -0,0 +1,3 @@
1
+ module Eps
2
+ VERSION = "0.1.0"
3
+ end
metadata ADDED
@@ -0,0 +1,128 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: eps
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Andrew Kane
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2018-07-03 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: daru
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: minitest
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: nokogiri
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ - !ruby/object:Gem::Dependency
70
+ name: rake
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ version: '0'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">="
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ description:
84
+ email:
85
+ - andrew@chartkick.com
86
+ executables: []
87
+ extensions: []
88
+ extra_rdoc_files: []
89
+ files:
90
+ - ".gitignore"
91
+ - ".travis.yml"
92
+ - CHANGELOG.md
93
+ - Gemfile
94
+ - LICENSE.txt
95
+ - README.md
96
+ - Rakefile
97
+ - eps.gemspec
98
+ - guides/Modeling.md
99
+ - lib/eps.rb
100
+ - lib/eps/base_regressor.rb
101
+ - lib/eps/metrics.rb
102
+ - lib/eps/regressor.rb
103
+ - lib/eps/version.rb
104
+ homepage: https://github.com/ankane/eps
105
+ licenses:
106
+ - MIT
107
+ metadata: {}
108
+ post_install_message:
109
+ rdoc_options: []
110
+ require_paths:
111
+ - lib
112
+ required_ruby_version: !ruby/object:Gem::Requirement
113
+ requirements:
114
+ - - ">="
115
+ - !ruby/object:Gem::Version
116
+ version: '0'
117
+ required_rubygems_version: !ruby/object:Gem::Requirement
118
+ requirements:
119
+ - - ">="
120
+ - !ruby/object:Gem::Version
121
+ version: '0'
122
+ requirements: []
123
+ rubyforge_project:
124
+ rubygems_version: 2.7.7
125
+ signing_key:
126
+ specification_version: 4
127
+ summary: Linear regression for Ruby
128
+ test_files: []