RubyGems - eps - Versions diffs - 0.1.0 → 0.1.1 - Mend

eps 0.1.0 → 0.1.1

Files changed (10) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: fdb6a5fb57b353ab114eb85f0d5c168ddba510bc309ffb29d07b1f3e16744acd
-  data.tar.gz: fc39e8b4c6c3c79ee7ba52cda09443db805f9724d5f413f77519e0d45554a1d8
+  metadata.gz: af79060ff9164fdaa316b8d9543ef8093166be20009dc99509fdfd3c0e4e242d
+  data.tar.gz: 4e17cda8c33ba437652a00e85cd3630c9ce093a27d446f77b3b0711bf648c754
 SHA512:
-  metadata.gz: 87fdd118890f41def2a69fefa840e313b97878d985f8d60a45b77b75bfec33b506d1027d983e27873499211c59c62273341f21d3c5c9b151b8d44bd34fbbfebb
-  data.tar.gz: fd3454f8539b48cbd8ae06fe10ab335aac557d5c8c5738d6531e436321721dc53664068d301da9602677022b8348bdf7c5f6b007b13403f602b21afd82df50b0
+  metadata.gz: e762ad255036a669003845322ea399300c2d0d2d9987dede0545497ff83e9d7ede89c120eaebd7214f6df26eb8194019b85c889ac50dbe5607688ff4c07ddd04
+  data.tar.gz: a5a16216316c718f13366b2618b8d1a0b8d6ba47896877573d805817b35ce789b199342344a32dbdb4ac6706122344b265741d3e319ab76a55443ff69662f699

data/.travis.yml CHANGED

@@ -1,7 +1,10 @@
 language: ruby
 rvm: 2.5.1
-sudo: false
-before_install: gem install bundler
+sudo: required
+before_install:
+  - gem install bundler
+  - sudo apt-get update
+  - sudo apt-get install -y libgsl0-dev
 script: bundle exec rake test
 env:
   -

data/CHANGELOG.md CHANGED

@@ -1,3 +1,7 @@
+## 0.1.1
+- Huge performance boost
 ## 0.1.0
 - First release

data/README.md CHANGED

@@ -6,6 +6,9 @@ Linear regression for Ruby
 - Serve models built in Ruby, Python, R, and more
 - Automatically handles categorical variables
 - No external dependencies
+- Works great with the SciRuby ecosystem (Daru & IRuby)
+[![Build Status](https://travis-ci.org/ankane/eps.svg?branch=master)](https://travis-ci.org/ankane/eps)
 ## Installation
@@ -58,6 +61,14 @@ split_date = Date.parse("2018-06-01")
 train_set, test_set = houses.partition { |h| h.sold_at < split_date }
 ```
+### Outliers and Missing Data
+Next, decide what to do with outliers and missing data. There are a number of methods for handling them, but the easiest is to remove them.
+```ruby
+train_set.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
+```
 ### Feature Engineering
 Selecting features for a model is extremely important for performance. Features can be numeric or categorical. For categorical features, there’s no need to create dummy variables - just pass the data as strings.
@@ -87,20 +98,31 @@ def features(house)
   {
     bedrooms: house.bedrooms,
     city_id: house.city_id.to_s,
-    month: house.sold_at.strftime("%b"),
-    price: house.price
+    month: house.sold_at.strftime("%b")
   }
 end
-train_data = train_set.map { |h| features(h) }
+train_features = train_set.map { |h| features(h) }
+```
+> We use a method for features so it can be used across training, evaluation, and prediction
+We also need to prepare the target variable.
+```ruby
+def target(house)
+  house.price
+end
+train_target = train_set.map { |h| target(h) }
 ```
 ### Training
-Once we have some features, let’s train the model.
+Now, let’s train the model.
 ```ruby
-model = Eps::Regressor.new(train_data, target: :price)
+model = Eps::Regressor.new(train_features, train_target)
 puts model.summary
 ```
@@ -111,18 +133,31 @@ The summary includes the coefficients and their significance. The lower the p-va
 When you’re happy with the model, see how well it performs on the test set. This gives us an idea of how well it’ll perform on unseen data.
 ```ruby
-test_data = test_set.map { |h| features(h) }
-model.evaluate(test_data)
+test_features = test_set.map { |h| features(h) }
+test_target = test_set.map { |h| target(h) }
+model.evaluate(test_features, test_target)
 ```
 This returns:
-- RSME - Root mean square error
+- RMSE - Root mean square error
 - MAE - Mean absolute error
 - ME - Mean error
 We want to minimize the RMSE and MAE and keep the ME around 0.
+### Finalize
+Now that we have an idea of how the model will perform, we want to retrain the model with all of our data.
+```ruby
+all_features = houses.map { |h| features(h) }
+all_target = houses.map { |h| target(h) }
+model = Eps::Regressor.new(all_features, all_target)
+```
+We now have a model that’s ready to serve.
 ## Serving Models
 Once the model is trained, all we need are the coefficients to make predictions. You can dump them as a Ruby object or JSON. For Ruby, use:
@@ -178,6 +213,8 @@ data = File.read("model.pmml")
 model = Eps::Regressor.load_pmml(data)
 ```
+> Loading PMML requires Nokogiri to be installed
 [PFA](http://dmg.org/pfa/) - Portable Format for Analytics
 ```ruby
@@ -326,6 +363,10 @@ When importing data from CSV files, be sure to convert numeric fields. The `tabl
 CSV.table("data.csv").map { |row| row.to_h }
 ```
+## Jupyter & IRuby
+You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://github.com/ankane/shorts/blob/master/Jupyter-Rails.md).
 ## Reference
 Get coefficients

data/Rakefile CHANGED

@@ -9,3 +9,26 @@ Rake::TestTask.new(:test) do |t|
 end
 task default: :test
+task :benchmark do
+  require "benchmark"
+  require "eps"
+  require "gsl" if ENV["GSL"]
+  data = []
+  10000.times do
+    row = {}
+    30.times do |i|
+      row[:"x#{i}"] = rand(100)
+    end
+    row[:y] = rand(100)
+    data << row
+  end
+  puts "Starting benchmark..."
+  time = Benchmark.realtime do
+    Eps::Regressor.new(data, target: :y)
+  end
+  p time.round(1)
+end

data/guides/Modeling.md CHANGED

@@ -21,7 +21,8 @@ And run:
 library(jsonlite)
 model <- lm(dist ~ speed, cars)
-toJSON(list(coefficients=as.list(coef(model))), auto_unbox=TRUE)
+data <- toJSON(list(coefficients=as.list(coef(model))), auto_unbox=TRUE)
+write(data, file="model.json")
 ```
 ## R PMML
@@ -38,7 +39,8 @@ And run:
 library(pmml)
 model <- lm(dist ~ speed,  cars)
-toString(pmml(model))
+data <- toString(pmml(model))
+write(data, file="model.pmml")
 ```
 ## R PFA
@@ -55,7 +57,7 @@ And run:
 library(aurelius)
 model <- lm(dist ~ speed,  cars)
-write_pfa(pfa(model))
+write_pfa(pfa(model), file="model.pfa")
 ```
 ## Python JSON
@@ -80,15 +82,19 @@ coefficients = {'_intercept': model.intercept_}
 for i, c in enumerate(model.coef_):
     coefficients[features[i]] = c
-print(json.dumps({'coefficients': coefficients}))
+data = json.dumps({'coefficients': coefficients})
+with open('model.json', 'w') as f:
+    f.write(data)
 ```
 ## Python PMML
-Install the [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) package
+Install the [scikit2pmml](https://github.com/vaclavcadek/scikit2pmml) package
 ```sh
-pip install sklearn2pmml
+pip install scikit2pmml
 ```
 And run:
@@ -103,7 +109,7 @@ y = [5 * xi + 3 for xi in x]
 model = linear_model.LinearRegression()
 model.fit([[xi] for xi in x], y)
-scikit2pmml(estimator=model, file='pymodel.pmml')
+scikit2pmml(estimator=model, file='model.pmml')
 ```
 ## Python PFA
@@ -139,5 +145,8 @@ action:
     return pfaDocument
-json.dumps(pfa(model))
+data = json.dumps(pfa(model))
+with open('model.pfa', 'w') as f:
+    f.write(data)
 ```

data/lib/eps/base_regressor.rb CHANGED

@@ -24,6 +24,22 @@ module Eps
       singular ? pred[0] : pred
     end
+    def evaluate(data, y = nil, target: nil)
+      raise ArgumentError, "missing target" if !target && !y
+      actual = y
+      actual ||=
+        if daru?(data)
+          data[target].to_a
+        else
+          data.map { |v| v[target] }
+        end
+      actual = prep_y(actual)
+      estimated = predict(data)
+      Eps.metrics(actual, estimated)
+    end
     # ruby
     def self.load(data)
@@ -56,7 +72,11 @@ module Eps
     # pmml
     def self.load_pmml(data)
-      data = Nokogiri::XML(data) if data.is_a?(String)
+      if data.is_a?(String)
+        require "nokogiri"
+        data = Nokogiri::XML(data)
+      end
       # TODO more validation
       node = data.css("RegressionTable")
       coefficients = {
@@ -94,7 +114,7 @@ module Eps
       BaseRegressor.new(coefficients: coefficients)
     end
-    private
+    protected
     def daru?(x)
       defined?(Daru) && x.is_a?(Daru::DataFrame)
@@ -132,6 +152,10 @@ module Eps
         xi.each do |k, v|
           key = v.is_a?(String) ? [k.to_sym, v] : k.to_sym
           v2 = v.is_a?(String) ? 1 : v
+          # TODO make more efficient
+          next if !train && !coefficients.key?(symbolize_coef(key))
           raise "Missing data" if v2.nil?
           unless cache[key]
@@ -167,7 +191,7 @@ module Eps
       end
       # flatten keys
-      c = [:_intercept] + cache.sort_by { |_, v| v }.map { |k, _| (k.is_a?(Array) ? k.join("") : k).to_sym }
+      c = [:_intercept] + cache.sort_by { |_, v| v }.map { |k, _| symbolize_coef(k) }
       if c.size != c.uniq.size
         raise "Overlapping coefficients"
@@ -176,8 +200,33 @@ module Eps
       [ret2, c]
     end
+    def symbolize_coef(k)
+      (k.is_a?(Array) ? k.join("") : k).to_sym
+    end
     def matrix_arr(matrix)
       matrix.to_a.map { |xi| xi[0].to_f }
     end
+    # determine if target is a string or symbol
+    def prep_target(target, data)
+      if daru?(data)
+        data.has_vector?(target) ? target : flip_target(target)
+      else
+        x = data[0] || {}
+        x[target] ? target : flip_target(target)
+      end
+    end
+    def flip_target(target)
+      target.is_a?(String) ? target.to_sym : target.to_s
+    end
+    def prep_y(y)
+      y.each do |yi|
+        raise "Target missing in data" if yi.nil?
+      end
+      y.map(&:to_f)
+    end
   end
 end

data/lib/eps/regressor.rb CHANGED

@@ -52,8 +52,9 @@ module Eps
             # haven't found an efficient way to do QR-factorization in Ruby
             # the extendmatrix gem has householder and givens (givens has bug)
             # but methods are too slow
+            xt = x.t
             begin
-              @xtxi = (x.t * x).inverse
+              @xtxi = (xt * x).inverse
             rescue ExceptionForMatrix::ErrNotRegular
               constant = {}
               (1...x.column_count).each do |i|
@@ -77,15 +78,18 @@ module Eps
                 vectors.delete_at(i)
               end
               x = Matrix.columns(vectors)
+              xt = x.t
               # try again
               begin
-                @xtxi = (x.t * x).inverse
+                @xtxi = (xt * x).inverse
               rescue ExceptionForMatrix::ErrNotRegular
                 raise "Multiple solutions - GSL is needed to select one"
               end
             end
-            v2 = matrix_arr(@xtxi * x.t * y)
+            # huge performance boost
+            # by multiplying xt * y first
+            v2 = matrix_arr(@xtxi * (xt * y))
             # add back removed
             removed.sort.each do |i|
@@ -101,19 +105,7 @@ module Eps
     end
     def evaluate(data, y = nil)
-      actual = y
-      actual ||=
-        if daru?(data)
-          data[@target].to_a
-        else
-          data.map { |v| v[@target] }
-        end
-      actual = prep_y(actual)
-      estimated = predict(data)
-      Eps.metrics(actual, estimated)
+      super(data, y, target: @target)
     end
     # https://people.richland.edu/james/ictcm/2004/multiple.html
@@ -154,27 +146,6 @@ module Eps
       arr.all? { |x| x == arr[0] }
     end
-    # determine if target is a string or symbol
-    def prep_target(target, data)
-      if daru?(data)
-        data.has_vector?(target) ? target : flip_target(target)
-      else
-        x = data[0] || {}
-        x[target] ? target : flip_target(target)
-      end
-    end
-    def flip_target(target)
-      target.is_a?(String) ? target.to_sym : target.to_s
-    end
-    def prep_y(y)
-      y.each do |yi|
-        raise "Target missing in data" if yi.nil?
-      end
-      y.map(&:to_f)
-    end
     # add epsilon for perfect fits
     # consistent with GSL
     def t_value

data/lib/eps/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Eps
-  VERSION = "0.1.0"
+  VERSION = "0.1.1"
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: eps
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.1
 platform: ruby
 authors:
 - Andrew Kane
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2018-07-03 00:00:00.000000000 Z
+date: 2018-07-05 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler