RubyGems - eps - Versions diffs - 0.1.0 - Mend

eps 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: fdb6a5fb57b353ab114eb85f0d5c168ddba510bc309ffb29d07b1f3e16744acd
+  data.tar.gz: fc39e8b4c6c3c79ee7ba52cda09443db805f9724d5f413f77519e0d45554a1d8
+SHA512:
+  metadata.gz: 87fdd118890f41def2a69fefa840e313b97878d985f8d60a45b77b75bfec33b506d1027d983e27873499211c59c62273341f21d3c5c9b151b8d44bd34fbbfebb
+  data.tar.gz: fd3454f8539b48cbd8ae06fe10ab335aac557d5c8c5738d6531e436321721dc53664068d301da9602677022b8348bdf7c5f6b007b13403f602b21afd82df50b0

data/.gitignore ADDED

@@ -0,0 +1,9 @@
+/.bundle/
+/.yardoc
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+*.lock

data/.travis.yml ADDED

@@ -0,0 +1,12 @@
+language: ruby
+rvm: 2.5.1
+sudo: false
+before_install: gem install bundler
+script: bundle exec rake test
+env:
+  -
+  - GSL=t
+notifications:
+  email:
+    on_success: never
+    on_failure: change

data/CHANGELOG.md ADDED

@@ -0,0 +1,3 @@
+## 0.1.0
+- First release

data/Gemfile ADDED

@@ -0,0 +1,11 @@
+source "https://rubygems.org"
+git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
+# Specify your gem's dependencies in eps.gemspec
+gemspec
+# remove when 0.2.1 released
+gem "daru", github: "sciruby/daru"
+gem "gsl" if ENV["GSL"]

data/LICENSE.txt ADDED

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2018 Andrew Kane
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,363 @@
+# Eps
+Linear regression for Ruby
+- Build models quickly and easily
+- Serve models built in Ruby, Python, R, and more
+- Automatically handles categorical variables
+- No external dependencies
+## Installation
+Add this line to your application’s Gemfile:
+```ruby
+gem 'eps'
+```
+To speed up training on large datasets, you can also [add GSL](#training-performance).
+## Getting Started
+Create a model
+```ruby
+data = [
+  {bedrooms: 1, bathrooms: 1, price: 100000},
+  {bedrooms: 2, bathrooms: 1, price: 125000},
+  {bedrooms: 2, bathrooms: 2, price: 135000},
+  {bedrooms: 3, bathrooms: 2, price: 162000}
+]
+model = Eps::Regressor.new(data, target: :price)
+puts model.summary
+```
+Make a prediction
+```ruby
+model.predict(bedrooms: 2, bathrooms: 1)
+```
+> Pass an array of hashes make multiple predictions at once
+## Building Models
+### Training and Test Sets
+When building models, it’s a good idea to hold out some data so you can see how well the model will perform on unseen data. To do this, we split our data into two sets: training and test. We build the model with the training set and later evaluate it on the test set.
+```ruby
+rng = Random.new(1) # seed random number generator
+train_set, test_set = houses.partition { rng.rand < 0.7 }
+```
+If your data has a time associated with it, we recommend splitting on this.
+```ruby
+split_date = Date.parse("2018-06-01")
+train_set, test_set = houses.partition { |h| h.sold_at < split_date }
+```
+### Feature Engineering
+Selecting features for a model is extremely important for performance. Features can be numeric or categorical. For categorical features, there’s no need to create dummy variables - just pass the data as strings.
+```ruby
+{state: "CA"}
+```
+> Categorical features generate coefficients for each distinct value except for one
+You should do this for any ids in your data.
+```ruby
+{city_id: "123"}
+```
+For times, create features like day of week and hour of day with:
+```ruby
+{weekday: time.wday.to_s, hour: time.hour.to_s}
+```
+In practice, your code may look like:
+```ruby
+def features(house)
+  {
+    bedrooms: house.bedrooms,
+    city_id: house.city_id.to_s,
+    month: house.sold_at.strftime("%b"),
+    price: house.price
+  }
+end
+train_data = train_set.map { |h| features(h) }
+```
+### Training
+Once we have some features, let’s train the model.
+```ruby
+model = Eps::Regressor.new(train_data, target: :price)
+puts model.summary
+```
+The summary includes the coefficients and their significance. The lower the p-value, the more significant the feature is. p-values below 0.05 are typically considered significant. It also shows the adjusted r-squared, which is a measure of how well the model fits the data. The higher the number, the better the fit. Here’s a good explanation of why it’s [better than r-squared](https://www.quora.com/What-is-the-difference-between-R-squared-and-Adjusted-R-squared).
+### Evaluation
+When you’re happy with the model, see how well it performs on the test set. This gives us an idea of how well it’ll perform on unseen data.
+```ruby
+test_data = test_set.map { |h| features(h) }
+model.evaluate(test_data)
+```
+This returns:
+- RSME - Root mean square error
+- MAE - Mean absolute error
+- ME - Mean error
+We want to minimize the RMSE and MAE and keep the ME around 0.
+## Serving Models
+Once the model is trained, all we need are the coefficients to make predictions. You can dump them as a Ruby object or JSON. For Ruby, use:
+```ruby
+model.dump
+```
+Then hardcode the result into your app.
+```ruby
+data = {:coefficients=>{:_intercept=>63500.0, :bedrooms=>26000.0, :bathrooms=>10000.0}}
+model = Eps::Regressor.load(data)
+```
+Now we can use it to make predictions.
+```ruby
+model.predict(bedrooms: 2, bathrooms: 1)
+```
+Another option that works well is writing the model to file in your app.
+```ruby
+json = model.to_json
+File.open("lib/models/housing_price.json", "w") { |f| f.write(json) }
+```
+To load it, use:
+```ruby
+json = File.read("lib/models/housing_price.json")
+model = Eps::Regressor.load_json(json)
+```
+To continuously train models, we recommend [storing them in your database](#database-storage).
+### Beyond Ruby
+Eps makes it easy to serve models from other languages. You can build models in R, Python, and others and serve them in Ruby without having to worry about how to deploy or run another language. Eps can load models in:
+JSON
+```ruby
+data = File.read("model.json")
+model = Eps::Regressor.load_json(data)
+```
+[PMML](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) - Predictive Model Markup Language
+```ruby
+data = File.read("model.pmml")
+model = Eps::Regressor.load_pmml(data)
+```
+[PFA](http://dmg.org/pfa/) - Portable Format for Analytics
+```ruby
+data = File.read("model.pfa")
+model = Eps::Regressor.load_pfa(data)
+```
+Here are examples for how to dump models in each:
+- [R JSON](guides/Modeling.md#r-json)
+- [R PMML](guides/Modeling.md#r-pmml)
+- [R PFA](guides/Modeling.md#r-pfa)
+- [Python JSON](guides/Modeling.md#python-json)
+- [Python PMML](guides/Modeling.md#python-pmml)
+- [Python PFA](guides/Modeling.md#python-pfa)
+### Verifying
+It’s important for features to be implemented consistently when serving models created in other languages. We highly recommend verifying this programmatically. Create a CSV file with ids and predictions from the original model.
+house_id | prediction
+--- | ---
+1 | 145000
+2 | 123000
+3 | 250000
+Once the model is implemented in Ruby, confirm the predictions match.
+```ruby
+model = Eps::Regressor.load_json("model.json")
+# preload houses to prevent n+1
+houses = House.all.index_by(&:id)
+CSV.foreach("predictions.csv", headers: true) do |row|
+  house = houses[row["house_id"].to_i]
+  expected = row["prediction"].to_f
+  actual = model.predict(bedrooms: house.bedrooms, bathrooms: house.bathrooms)
+  unless (actual - expected).abs < 0.001
+    raise "Bad prediction for house #{house.id} (exp: #{expected}, act: #{actual})"
+  end
+  putc "✓"
+end
+```
+### Database Storage
+The database is another place you can store models. It’s good if you retrain models automatically.
+> We recommend adding monitoring and guardrails as well if you retrain automatically
+Create an ActiveRecord model to store the predictive model.
+```sh
+rails g model Model key:string:uniq data:text
+```
+Store the model with:
+```ruby
+store = Model.where(key: "housing_price").first_or_initialize
+store.update(data: model.to_json)
+```
+Load the model with:
+```ruby
+data = Model.find_by!(key: "housing_price").data
+model = Eps::Regressor.load_json(data)
+```
+## Monitoring
+We recommend monitoring how well your models perform over time. To do this, save your predictions to the database. Then, compare them with:
+```ruby
+actual = houses.map(&:price)
+estimated = houses.map(&:estimated_price)
+Eps.metrics(actual, estimated)
+```
+This returns the same evaluation metrics as model building. For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0.
+## Training Performance
+Speed up training on large datasets with GSL.
+First, [install GSL](https://www.gnu.org/software/gsl/). With Homebrew, you can use:
+```sh
+brew install gsl
+```
+Then, add this line to your application’s Gemfile:
+```ruby
+gem 'gsl', group: :development
+```
+It only needs to be available in environments used to build the model.
+## Data
+A number of data formats are supported. You can pass the target variable separately.
+```ruby
+x = [{x: 1}, {x: 2}, {x: 3}]
+y = [1, 2, 3]
+Eps::Regressor.new(x, y)
+```
+Or pass arrays of arrays
+```ruby
+x = [[1, 2], [2, 0], [3, 1]]
+y = [1, 2, 3]
+Eps::Regressor.new(x, y)
+```
+## Daru
+Eps works well with Daru data frames.
+```ruby
+df = Daru::DataFrame.from_csv("houses.csv")
+Eps::Regressor.new(df, target: "price")
+```
+To split into training and test sets, use:
+```ruby
+rng = Random.new(1) # seed random number generator
+train_index = houses.map { rng.rand < 0.7 }
+train_set = houses.where(train_index)
+test_set = houses.where(train_index.map { |v| !v })
+```
+## CSVs
+When importing data from CSV files, be sure to convert numeric fields. The `table` method does this automatically.
+```ruby
+CSV.table("data.csv").map { |row| row.to_h }
+```
+## Reference
+Get coefficients
+```ruby
+model.coefficients
+```
+Get an extended summary with standard error, t-values, and r-squared
+```ruby
+model.summary(extended: true)
+```
+## History
+View the [changelog](https://github.com/ankane/eps/blob/master/CHANGELOG.md)
+## Contributing
+Everyone is encouraged to help improve this project. Here are a few ways you can help:
+- [Report bugs](https://github.com/ankane/eps/issues)
+- Fix bugs and [submit pull requests](https://github.com/ankane/eps/pulls)
+- Write, clarify, or fix documentation
+- Suggest or add new features
+To get started with development and testing:
+```sh
+git clone https://github.com/ankane/eps.git
+cd eps
+bundle install
+rake test
+```

data/Rakefile ADDED

@@ -0,0 +1,11 @@
+require "bundler/gem_tasks"
+require "rake/testtask"
+Rake::TestTask.new(:test) do |t|
+  t.libs << "test"
+  t.libs << "lib"
+  t.test_files = FileList["test/**/*_test.rb"]
+  t.warning = false
+end
+task default: :test

data/eps.gemspec ADDED

@@ -0,0 +1,30 @@
+lib = File.expand_path("../lib", __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require "eps/version"
+Gem::Specification.new do |spec|
+  spec.name          = "eps"
+  spec.version       = Eps::VERSION
+  spec.authors       = ["Andrew Kane"]
+  spec.email         = ["andrew@chartkick.com"]
+  spec.summary       = "Linear regression for Ruby"
+  spec.homepage      = "https://github.com/ankane/eps"
+  spec.license       = "MIT"
+  # Specify which files should be added to the gem when it is released.
+  # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
+  spec.files         = Dir.chdir(File.expand_path('..', __FILE__)) do
+    `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
+  end
+  spec.bindir        = "exe"
+  spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler"
+  spec.add_development_dependency "daru"
+  spec.add_development_dependency "minitest"
+  spec.add_development_dependency "nokogiri"
+  spec.add_development_dependency "rake"
+end

data/guides/Modeling.md ADDED

@@ -0,0 +1,143 @@
+# Modeling
+- [R JSON](#r-json)
+- [R PMML](#r-pmml)
+- [R PFA](#r-pfa)
+- [Python JSON](#python-json)
+- [Python PMML](#python-pmml)
+- [Python PFA](#python-pfa)
+## R JSON
+Install the [jsonlite](https://cran.r-project.org/package=jsonlite) package
+```r
+install.packages("jsonlite")
+```
+And run:
+```r
+library(jsonlite)
+model <- lm(dist ~ speed, cars)
+toJSON(list(coefficients=as.list(coef(model))), auto_unbox=TRUE)
+```
+## R PMML
+Install the [pmml](https://cran.r-project.org/package=pmml) package
+```r
+install.packages("pmml")
+```
+And run:
+```r
+library(pmml)
+model <- lm(dist ~ speed,  cars)
+toString(pmml(model))
+```
+## R PFA
+Install the [aurelius](https://cran.r-project.org/package=aurelius) package
+```r
+install.packages("aurelius")
+```
+And run:
+```r
+library(aurelius)
+model <- lm(dist ~ speed,  cars)
+write_pfa(pfa(model))
+```
+## Python JSON
+Run:
+```python
+from sklearn import linear_model
+import pandas as pd
+import json
+x = [1, 2, 3, 4, 5]
+y = [5 * xi + 3 for xi in x]
+df = pd.DataFrame({'x': x, 'y': y})
+features = ['x']
+model = linear_model.LinearRegression()
+model.fit(df[features], df['y'])
+coefficients = {'_intercept': model.intercept_}
+for i, c in enumerate(model.coef_):
+    coefficients[features[i]] = c
+print(json.dumps({'coefficients': coefficients}))
+```
+## Python PMML
+Install the [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) package
+```sh
+pip install sklearn2pmml
+```
+And run:
+```python
+from sklearn import linear_model
+from scikit2pmml import scikit2pmml
+x = [1, 2, 3, 5, 6]
+y = [5 * xi + 3 for xi in x]
+model = linear_model.LinearRegression()
+model.fit([[xi] for xi in x], y)
+scikit2pmml(estimator=model, file='pymodel.pmml')
+```
+## Python PFA
+Install the [Titus](https://github.com/opendatagroup/hadrian) package and run:
+```python
+from sklearn import linear_model
+import titus.prettypfa
+import json
+x = [1, 2, 3, 5, 6]
+y = [5 * xi + 3 for xi in x]
+model = linear_model.LinearRegression()
+model.fit([[xi] for xi in x], y)
+def pfa(estimator):
+    pfaDocument = titus.prettypfa.jsonNode('''
+types:
+  Regression = record(Regression,
+                      const: double,
+                      coeff: array(double))
+input: array(double)
+output: double
+cells:
+  regression(Regression) = {const: 0.0, coeff: []}
+action:
+  model.reg.linear(input, regression)
+''')
+    pfaDocument["cells"]["regression"]["init"] = {"const": estimator.intercept_, "coeff": list(estimator.coef_)}
+    return pfaDocument
+json.dumps(pfa(model))
+```

data/lib/eps.rb ADDED

@@ -0,0 +1,15 @@
+# dependencies
+require "matrix"
+require "json"
+# modules
+require "eps/base_regressor"
+require "eps/metrics"
+require "eps/regressor"
+require "eps/version"
+module Eps
+  def self.metrics(actual, estimated)
+    Eps::Metrics.new(actual, estimated).all
+  end
+end

data/lib/eps/base_regressor.rb ADDED

@@ -0,0 +1,183 @@
+module Eps
+  class BaseRegressor
+    attr_reader :coefficients
+    def initialize(coefficients:)
+      @coefficients = Hash[coefficients.map { |k, v| [k.to_sym, v] }]
+    end
+    def predict(x)
+      singular = !(x.is_a?(Array) || daru?(x))
+      x = [x] if singular
+      x, c = prep_x(x, train: false)
+      coef = c.map do |v|
+        # use 0 if coefficient does not exist
+        # this can happen for categorical features
+        # since only n-1 coefficients are stored
+        coefficients[v] || 0
+      end
+      x = Matrix.rows(x)
+      c = Matrix.column_vector(coef)
+      pred = matrix_arr(x * c)
+      singular ? pred[0] : pred
+    end
+    # ruby
+    def self.load(data)
+      BaseRegressor.new(Hash[data.map { |k, v| [k.to_sym, v] }])
+    end
+    def dump
+      {coefficients: coefficients}
+    end
+    # json
+    def self.load_json(data)
+      data = JSON.parse(data) if data.is_a?(String)
+      coefficients = data["coefficients"]
+      # for R models
+      if coefficients["(Intercept)"]
+        coefficients = coefficients.dup
+        coefficients["_intercept"] = coefficients.delete("(Intercept)")
+      end
+      BaseRegressor.new(coefficients: coefficients)
+    end
+    def to_json
+      JSON.generate(dump)
+    end
+    # pmml
+    def self.load_pmml(data)
+      data = Nokogiri::XML(data) if data.is_a?(String)
+      # TODO more validation
+      node = data.css("RegressionTable")
+      coefficients = {
+        _intercept: node.attribute("intercept").value.to_f
+      }
+      node.css("NumericPredictor").each do |n|
+        coefficients[n.attribute("name").value] = n.attribute("coefficient").value.to_f
+      end
+      node.css("CategoricalPredictor").each do |n|
+        coefficients["#{n.attribute("name").value}#{n.attribute("value").value}"] = n.attribute("coefficient").value.to_f
+      end
+      BaseRegressor.new(coefficients: coefficients)
+    end
+    # pfa
+    def self.load_pfa(data)
+      data = JSON.parse(data) if data.is_a?(String)
+      init = data["cells"].first[1]["init"]
+      names =
+        if data["input"]["fields"]
+          data["input"]["fields"].map { |f| f["name"] }
+        else
+          init["coeff"].map.with_index { |_, i| "x#{i}" }
+        end
+      coefficients = {
+        _intercept: init["const"]
+      }
+      init["coeff"].each_with_index do |c, i|
+        name = names[i]
+        # R can export coefficients with same name
+        raise "Coefficients with same name" if coefficients[name]
+        coefficients[name] = c
+      end
+      BaseRegressor.new(coefficients: coefficients)
+    end
+    private
+    def daru?(x)
+      defined?(Daru) && x.is_a?(Daru::DataFrame)
+    end
+    def prep_x(x, train: true)
+      if daru?(x)
+        x = x.to_a[0]
+      else
+        x = x.map do |xi|
+          case xi
+          when Hash
+            xi
+          when Array
+            Hash[xi.map.with_index { |v, i| [:"x#{i}", v] }]
+          else
+            {x0: xi}
+          end
+        end
+      end
+      # if !train && x.any?
+      #   # check first row against coefficients
+      #   ckeys = coefficients.keys.map(&:to_s)
+      #   bad_keys = x[0].keys.map(&:to_s).reject { |k| ckeys.any? { |c| c.start_with?(k) } }
+      #   raise "Unknown keys: #{bad_keys.join(", ")}" if bad_keys.any?
+      # end
+      cache = {}
+      first_key = {}
+      i = 0
+      rows = []
+      x.each do |xi|
+        row = {}
+        xi.each do |k, v|
+          key = v.is_a?(String) ? [k.to_sym, v] : k.to_sym
+          v2 = v.is_a?(String) ? 1 : v
+          raise "Missing data" if v2.nil?
+          unless cache[key]
+            cache[key] = i
+            first_key[k] ||= key if v.is_a?(String)
+            i += 1
+          end
+          row[key] = v2
+        end
+        rows << row
+      end
+      if train
+        # remove one degree of freedom
+        first_key.values.each do |v|
+          num = cache.delete(v)
+          cache.each do |k, v2|
+            cache[k] -= 1 if v2 > num
+          end
+        end
+      end
+      ret2 = []
+      rows.each do |row|
+        ret = [0] * cache.size
+        row.each do |k, v|
+          if cache[k]
+            ret[cache[k]] = v
+          end
+        end
+        ret2 << ([1] + ret)
+      end
+      # flatten keys
+      c = [:_intercept] + cache.sort_by { |_, v| v }.map { |k, _| (k.is_a?(Array) ? k.join("") : k).to_sym }
+      if c.size != c.uniq.size
+        raise "Overlapping coefficients"
+      end
+      [ret2, c]
+    end
+    def matrix_arr(matrix)
+      matrix.to_a.map { |xi| xi[0].to_f }
+    end
+  end
+end

data/lib/eps/metrics.rb ADDED

@@ -0,0 +1,35 @@
+module Eps
+  class Metrics
+    attr_reader :errors
+    def initialize(actual, estimated)
+      @errors = actual.zip(estimated).map { |yi, yi2| yi - yi2 }
+    end
+    def all
+      {
+        rmse: rmse,
+        mae: mae,
+        me: me
+      }
+    end
+    private
+    def me
+      mean(errors)
+    end
+    def mae
+      mean(errors.map { |v| v.abs })
+    end
+    def rmse
+      Math.sqrt(mean(errors.map { |v| v**2 }))
+    end
+    def mean(arr)
+      arr.inject(0, &:+) / arr.size.to_f
+    end
+  end
+end

data/lib/eps/regressor.rb ADDED

@@ -0,0 +1,343 @@
+module Eps
+  class Regressor < BaseRegressor
+    def initialize(data, y = nil, target: nil, gsl: nil)
+      raise ArgumentError, "missing target" if !target && !y
+      target = prep_target(target, data) if target
+      # TODO more performant conversion
+      if daru?(data)
+        y ||= data[target].to_a
+        x = data.dup.delete_vector(target)
+      else
+        x = data.map(&:dup)
+        y ||= x.map { |v| v.delete(target) }
+      end
+      @x = x
+      @y = prep_y(y.to_a)
+      if @x.size != @y.size
+        raise "Number of samples differs from target"
+      end
+      @target = target
+      @gsl = gsl.nil? ? defined?(GSL) : gsl
+      # fit immediately
+      coefficients
+    end
+    def coefficients
+      @coefficients ||= begin
+        x, @coefficient_names = prep_x(@x)
+        if x.size <= @coefficient_names.size
+          raise "Number of samples must be at least two more than number of features"
+        end
+        v =
+          if @gsl
+            x = GSL::Matrix.alloc(*x)
+            y = GSL::Vector.alloc(@y)
+            c, @covariance, _, _ = GSL::MultiFit::linear(x, y)
+            c.to_a
+          else
+            x = Matrix.rows(x)
+            y = Matrix.column_vector(@y)
+            removed = []
+            # https://statsmaths.github.io/stat612/lectures/lec13/lecture13.pdf
+            # unforutnately, this method is unstable
+            # haven't found an efficient way to do QR-factorization in Ruby
+            # the extendmatrix gem has householder and givens (givens has bug)
+            # but methods are too slow
+            begin
+              @xtxi = (x.t * x).inverse
+            rescue ExceptionForMatrix::ErrNotRegular
+              constant = {}
+              (1...x.column_count).each do |i|
+                constant[i] = constant?(x.column(i))
+              end
+              # remove constant columns
+              removed = constant.select { |_, v| v }.keys
+              # remove non-independent columns
+              constant.select { |_, v| !v }.keys.combination(2) do |c|
+                if !x.column(c[0]).independent?(x.column(c[1]))
+                  removed << c[1]
+                end
+              end
+              vectors = x.column_vectors
+              # delete in reverse of indexes stay the same
+              removed.sort.reverse.each do |i|
+                # @coefficient_names.delete_at(i)
+                vectors.delete_at(i)
+              end
+              x = Matrix.columns(vectors)
+              # try again
+              begin
+                @xtxi = (x.t * x).inverse
+              rescue ExceptionForMatrix::ErrNotRegular
+                raise "Multiple solutions - GSL is needed to select one"
+              end
+            end
+            v2 = matrix_arr(@xtxi * x.t * y)
+            # add back removed
+            removed.sort.each do |i|
+              v2.insert(i, 0)
+            end
+            @removed = removed
+            v2
+          end
+        Hash[@coefficient_names.zip(v)]
+      end
+    end
+    def evaluate(data, y = nil)
+      actual = y
+      actual ||=
+        if daru?(data)
+          data[@target].to_a
+        else
+          data.map { |v| v[@target] }
+        end
+      actual = prep_y(actual)
+      estimated = predict(data)
+      Eps.metrics(actual, estimated)
+    end
+    # https://people.richland.edu/james/ictcm/2004/multiple.html
+    def summary(extended: false)
+      @summary_str ||= begin
+        str = String.new("")
+        len = [coefficients.keys.map(&:size).max, 15].max
+        if extended
+          str += "%-#{len}s %12s %12s %12s %12s\n" % ["", "coef", "stderr", "t", "p"]
+        else
+          str += "%-#{len}s %12s %12s\n" % ["", "coef", "p"]
+        end
+        coefficients.each do |k, v|
+          if extended
+            str += "%-#{len}s %12.2f %12.2f %12.2f %12.3f\n" % [k, v, std_err[k], t_value[k], p_value[k]]
+          else
+            str += "%-#{len}s %12.2f %12.3f\n" % [k, v, p_value[k]]
+          end
+        end
+        str += "\n"
+        str += "r2: %.3f\n" % [r2] if extended
+        str += "adjusted r2: %.3f\n" % [adjusted_r2]
+        str
+      end
+    end
+    def r2
+      @r2 ||= (sst - sse) / sst
+    end
+    def adjusted_r2
+      @adjusted_r2 ||= (mst - mse) / mst
+    end
+    private
+    def constant?(arr)
+      arr.all? { |x| x == arr[0] }
+    end
+    # determine if target is a string or symbol
+    def prep_target(target, data)
+      if daru?(data)
+        data.has_vector?(target) ? target : flip_target(target)
+      else
+        x = data[0] || {}
+        x[target] ? target : flip_target(target)
+      end
+    end
+    def flip_target(target)
+      target.is_a?(String) ? target.to_sym : target.to_s
+    end
+    def prep_y(y)
+      y.each do |yi|
+        raise "Target missing in data" if yi.nil?
+      end
+      y.map(&:to_f)
+    end
+    # add epsilon for perfect fits
+    # consistent with GSL
+    def t_value
+      @t_value ||= Hash[coefficients.map { |k, v| [k, v / (std_err[k] + Float::EPSILON)] }]
+    end
+    def p_value
+      @p_value ||= begin
+        Hash[coefficients.map do |k, _|
+          tp =
+            if @gsl
+              GSL::Cdf.tdist_P(t_value[k].abs, degrees_of_freedom)
+            else
+              tdist_p(t_value[k].abs, degrees_of_freedom)
+            end
+          [k, 2 * (1 - tp)]
+        end]
+      end
+    end
+    def std_err
+      @std_err ||= begin
+        Hash[@coefficient_names.zip(diagonal.map { |v| Math.sqrt(v) })]
+      end
+    end
+    def diagonal
+      @diagonal ||= begin
+        if covariance.respond_to?(:each)
+          d = covariance.each(:diagonal).to_a
+          @removed.each do |i|
+            d.insert(i, 0)
+          end
+          d
+        else
+          covariance.diagonal.to_a
+        end
+      end
+    end
+    def covariance
+      @covariance ||= mse * @xtxi
+    end
+    def y_bar
+      @y_bar ||= mean(@y)
+    end
+    def y_hat
+      @y_hat ||= predict(@x)
+    end
+    # total sum of squares
+    def sst
+      @sst ||= sum(@y.map { |y| (y - y_bar)**2 })
+    end
+    # sum of squared errors of prediction
+    # not to be confused with "explained sum of squares"
+    def sse
+      @sse ||= sum(@y.zip(y_hat).map { |y, yh| (y - yh)**2 })
+    end
+    def mst
+      @mst ||= sst / (@y.size - 1)
+    end
+    def mse
+      @mse ||= sse / degrees_of_freedom
+    end
+    def degrees_of_freedom
+      @y.size - coefficients.size
+    end
+    def sum(arr)
+      arr.inject(0, &:+)
+    end
+    def mean(arr)
+      sum(arr) / arr.size.to_f
+    end
+    ### Extracted from https://github.com/estebanz01/ruby-statistics
+    ### The Ruby author is Esteban Zapata Rojas
+    ###
+    ### Originally extracted from https://codeplea.com/incomplete-beta-function-c
+    ### This function is shared under zlib license and the author is Lewis Van Winkle
+    def tdist_p(value, degrees_of_freedom)
+      upper = (value + Math.sqrt(value * value + degrees_of_freedom))
+      lower = (2.0 * Math.sqrt(value * value + degrees_of_freedom))
+      x = upper/lower
+      alpha = degrees_of_freedom/2.0
+      beta = degrees_of_freedom/2.0
+      incomplete_beta_function(x, alpha, beta)
+    end
+    ### Extracted from https://github.com/estebanz01/ruby-statistics
+    ### The Ruby author is Esteban Zapata Rojas
+    ###
+    ### This implementation is an adaptation of the incomplete beta function made in C by
+    ### Lewis Van Winkle, which released the code under the zlib license.
+    ### The whole math behind this code is described in the following post: https://codeplea.com/incomplete-beta-function-c
+    def incomplete_beta_function(x, alp, bet)
+      return if x < 0.0
+      return 1.0 if x > 1.0
+      tiny = 1.0E-50
+      if x > ((alp + 1.0)/(alp + bet + 2.0))
+        return 1.0 - incomplete_beta_function(1.0 - x, bet, alp)
+      end
+      # To avoid overflow problems, the implementation applies the logarithm properties
+      # to calculate in a faster and safer way the values.
+      lbet_ab = (Math.lgamma(alp)[0] + Math.lgamma(bet)[0] - Math.lgamma(alp + bet)[0]).freeze
+      front = (Math.exp(Math.log(x) * alp + Math.log(1.0 - x) * bet - lbet_ab) / alp.to_f).freeze
+      # This is the non-log version of the left part of the formula (before the continuous fraction)
+      # down_left = alp * self.beta_function(alp, bet)
+      # upper_left = (x ** alp) * ((1.0 - x) ** bet)
+      # front = upper_left/down_left
+      f, c, d = 1.0, 1.0, 0.0
+      returned_value = nil
+      # Let's do more iterations than the proposed implementation (200 iters)
+      (0..500).each do |number|
+        m = number/2
+        numerator = if number == 0
+                      1.0
+                    elsif number % 2 == 0
+                      (m * (bet - m) * x)/((alp + 2.0 * m - 1.0)* (alp + 2.0 * m))
+                    else
+                      top = -((alp + m) * (alp + bet + m) * x)
+                      down = ((alp + 2.0 * m) * (alp + 2.0 * m + 1.0))
+                      top/down
+                    end
+        d = 1.0 + numerator * d
+        d = tiny if d.abs < tiny
+        d = 1.0 / d
+        c = 1.0 + numerator / c
+        c = tiny if c.abs < tiny
+        cd = (c*d).freeze
+        f = f * cd
+        if (1.0 - cd).abs < 1.0E-10
+          returned_value = front * (f - 1.0)
+          break
+        end
+      end
+      returned_value
+    end
+  end
+end

data/lib/eps/version.rb ADDED

@@ -0,0 +1,3 @@
+module Eps
+  VERSION = "0.1.0"
+end

metadata ADDED

@@ -0,0 +1,128 @@
+--- !ruby/object:Gem::Specification
+name: eps
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Andrew Kane
+autorequire:
+bindir: exe
+cert_chain: []
+date: 2018-07-03 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: daru
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: minitest
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: nokogiri
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+description:
+email:
+- andrew@chartkick.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- ".travis.yml"
+- CHANGELOG.md
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- eps.gemspec
+- guides/Modeling.md
+- lib/eps.rb
+- lib/eps/base_regressor.rb
+- lib/eps/metrics.rb
+- lib/eps/regressor.rb
+- lib/eps/version.rb
+homepage: https://github.com/ankane/eps
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.7.7
+signing_key:
+specification_version: 4
+summary: Linear regression for Ruby
+test_files: []