RubyGems - eps - Versions diffs - 0.1.1 → 0.2.0 - Mend

eps 0.1.1 → 0.2.0

Files changed (20) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +6 -0
data/README.md +235 -84
data/lib/eps.rb +9 -4
data/lib/eps/base.rb +19 -0
data/lib/eps/base_estimator.rb +84 -0
data/lib/eps/linear_regression.rb +558 -0
data/lib/eps/model.rb +108 -0
data/lib/eps/naive_bayes.rb +240 -0
data/lib/eps/version.rb +1 -1
metadata +13 -18
data/.gitignore +0 -9
data/.travis.yml +0 -15
data/Gemfile +0 -11
data/Rakefile +0 -34
data/eps.gemspec +0 -30
data/guides/Modeling.md +0 -152
data/lib/eps/base_regressor.rb +0 -232
data/lib/eps/metrics.rb +0 -35
data/lib/eps/regressor.rb +0 -314

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: af79060ff9164fdaa316b8d9543ef8093166be20009dc99509fdfd3c0e4e242d
-  data.tar.gz: 4e17cda8c33ba437652a00e85cd3630c9ce093a27d446f77b3b0711bf648c754
+  metadata.gz: 5ce1ae30056e3aaa08465d63571685ed2d43a17b40dc95bb2020d8d1f6752d6d
+  data.tar.gz: f6f9652d164991b82fed1375b9039baa6cbe3dd3b5413cdee244ba5aac923012
 SHA512:
-  metadata.gz: e762ad255036a669003845322ea399300c2d0d2d9987dede0545497ff83e9d7ede89c120eaebd7214f6df26eb8194019b85c889ac50dbe5607688ff4c07ddd04
-  data.tar.gz: a5a16216316c718f13366b2618b8d1a0b8d6ba47896877573d805817b35ce789b199342344a32dbdb4ac6706122344b265741d3e319ab76a55443ff69662f699
+  metadata.gz: bd9ddb1589b2866c42fda230389a27124285bd162ca9863528b41531d1c725fca3ac7327786b373e9cecda014d9cc2f34f791f203da77215f738cf92c261db61
+  data.tar.gz: 7d209acc346b223d00827c1df817a2017882b9446ff523f985e4c89b44f10f5a702d6c6dd957c8fb9ceccd83b98f7492aa15e029c1d0655787c1823348beacd0

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,9 @@
+## 0.2.0
+- Added support for classification
+- Added `to_pmml` method
+- Added `Eps::Base`
 ## 0.1.1
 - Huge performance boost

data/README.md CHANGED Viewed

@@ -1,13 +1,15 @@
 # Eps
-Linear regression for Ruby
+Machine learning for Ruby
-- Build models quickly and easily
+- Build predictive models quickly and easily
 - Serve models built in Ruby, Python, R, and more
-- Automatically handles categorical variables
-- No external dependencies
+- Supports regression (linear regression) and classification (naive Bayes)
+- Automatically handles categorical features
 - Works great with the SciRuby ecosystem (Daru & IRuby)
+Check out [this post](https://ankane.org/rails-meet-data-science) for more info on machine learning with Rails
 [![Build Status](https://travis-ci.org/ankane/eps.svg?branch=master)](https://travis-ci.org/ankane/eps)
 ## Installation
@@ -31,7 +33,7 @@ data = [
   {bedrooms: 2, bathrooms: 2, price: 135000},
   {bedrooms: 3, bathrooms: 2, price: 162000}
 ]
-model = Eps::Regressor.new(data, target: :price)
+model = Eps::Model.new(data, target: :price)
 puts model.summary
 ```
@@ -43,6 +45,8 @@ model.predict(bedrooms: 2, bathrooms: 1)
 > Pass an array of hashes make multiple predictions at once
+The target can be numeric (regression) or categorical (classification).
 ## Building Models
 ### Training and Test Sets
@@ -50,15 +54,15 @@ model.predict(bedrooms: 2, bathrooms: 1)
 When building models, it’s a good idea to hold out some data so you can see how well the model will perform on unseen data. To do this, we split our data into two sets: training and test. We build the model with the training set and later evaluate it on the test set.
 ```ruby
-rng = Random.new(1) # seed random number generator
-train_set, test_set = houses.partition { rng.rand < 0.7 }
+split_date = Date.parse("2018-06-01")
+train_set, test_set = houses.partition { |h| h.sold_at < split_date }
 ```
-If your data has a time associated with it, we recommend splitting on this.
+If your data doesn’t have a time associated with it, you can split it randomly.
 ```ruby
-split_date = Date.parse("2018-06-01")
-train_set, test_set = houses.partition { |h| h.sold_at < split_date }
+rng = Random.new(1) # seed random number generator
+train_set, test_set = houses.partition { rng.rand < 0.7 }
 ```
 ### Outliers and Missing Data
@@ -79,10 +83,10 @@ Selecting features for a model is extremely important for performance. Features
 > Categorical features generate coefficients for each distinct value except for one
-You should do this for any ids in your data.
+Convert any ids to strings so they’re treated as categorical features.
 ```ruby
-{city_id: "123"}
+{city_id: city_id.to_s}
 ```
 For times, create features like day of week and hour of day with:
@@ -122,11 +126,11 @@ train_target = train_set.map { |h| target(h) }
 Now, let’s train the model.
 ```ruby
-model = Eps::Regressor.new(train_features, train_target)
+model = Eps::Model.new(train_features, train_target)
 puts model.summary
 ```
-The summary includes the coefficients and their significance. The lower the p-value, the more significant the feature is. p-values below 0.05 are typically considered significant. It also shows the adjusted r-squared, which is a measure of how well the model fits the data. The higher the number, the better the fit. Here’s a good explanation of why it’s [better than r-squared](https://www.quora.com/What-is-the-difference-between-R-squared-and-Adjusted-R-squared).
+For regression, the summary includes the coefficients and their significance. The lower the p-value, the more significant the feature is. p-values below 0.05 are typically considered significant. It also shows the adjusted r-squared, which is a measure of how well the model fits the data. The higher the number, the better the fit. Here’s a good explanation of why it’s [better than r-squared](https://www.quora.com/What-is-the-difference-between-R-squared-and-Adjusted-R-squared).
 ### Evaluation
@@ -138,7 +142,7 @@ test_target = test_set.map { |h| target(h) }
 model.evaluate(test_features, test_target)
 ```
-This returns:
+For regression, this returns:
 - RMSE - Root mean square error
 - MAE - Mean absolute error
@@ -146,31 +150,43 @@ This returns:
 We want to minimize the RMSE and MAE and keep the ME around 0.
+For classification, this returns:
+- Accuracy
+We want to maximize the accuracy.
 ### Finalize
-Now that we have an idea of how the model will perform, we want to retrain the model with all of our data.
+Now that we have an idea of how the model will perform, we want to retrain the model with all of our data. Treat outliers and missing data the same as you did with the training set.
 ```ruby
+# outliers and missing data
+houses.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
+# training
 all_features = houses.map { |h| features(h) }
 all_target = houses.map { |h| target(h) }
-model = Eps::Regressor.new(all_features, all_target)
+model = Eps::Model.new(all_features, all_target)
 ```
 We now have a model that’s ready to serve.
 ## Serving Models
-Once the model is trained, all we need are the coefficients to make predictions. You can dump them as a Ruby object or JSON. For Ruby, use:
+Once the model is trained, we need to store it. Eps uses PMML - [Predictive Model Markup Language](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) - a standard for storing models. A great option is to write the model to a file with:
 ```ruby
-model.dump
+File.write("model.pmml", model.to_pmml)
 ```
-Then hardcode the result into your app.
+> You may need to add `nokogiri` to your Gemfile
+To load a model, use:
 ```ruby
-data = {:coefficients=>{:_intercept=>63500.0, :bedrooms=>26000.0, :bathrooms=>10000.0}}
-model = Eps::Regressor.load(data)
+pmml = File.read("model.pmml")
+model = Eps::Model.load_pmml(pmml)
 ```
 Now we can use it to make predictions.
@@ -179,57 +195,191 @@ Now we can use it to make predictions.
 model.predict(bedrooms: 2, bathrooms: 1)
 ```
-Another option that works well is writing the model to file in your app.
+To continuously train models, we recommend [storing them in your database](#database-storage).
+## Full Example
-```ruby
-json = model.to_json
-File.open("lib/models/housing_price.json", "w") { |f| f.write(json) }
+We recommend putting all the model code in a single file. This makes it easy to rebuild the model as needed.
+In Rails, we recommend creating a `app/ml_models` directory. Be sure to restart Spring after creating the directory so files are autoloaded.
+```sh
+bin/spring stop
 ```
-To load it, use:
+Here’s what a complete model in `app/ml_models/price_model.rb` may look like:
 ```ruby
-json = File.read("lib/models/housing_price.json")
-model = Eps::Regressor.load_json(json)
-```
+class PriceModel < Eps::Base
+  def build
+    houses = House.all.to_a
-To continuously train models, we recommend [storing them in your database](#database-storage).
+    # divide into training and test set
+    split_date = Date.parse("2018-06-01")
+    train_set, test_set = houses.partition { |h| h.sold_at < split_date }
+    # handle outliers and missing values
+    train_set = preprocess(train_set)
-### Beyond Ruby
+    # train
+    train_features = train_set.map { |v| features(v) }
+    train_target = train_set.map { |v| target(v) }
+    model = Eps::Model.new(train_features, train_target)
+    puts model.summary
-Eps makes it easy to serve models from other languages. You can build models in R, Python, and others and serve them in Ruby without having to worry about how to deploy or run another language. Eps can load models in:
+    # evaluate
+    test_features = test_set.map { |v| features(v) }
+    test_target = test_set.map { |v| target(v) }
+    metrics = model.evaluate(test_features, test_target)
+    puts "Test RMSE: #{metrics[:rmse]}"
+    # for classification, use:
+    # puts "Test accuracy: #{metrics[:accuracy]}"
-JSON
+    # finalize
+    houses = preprocess(houses)
+    all_features = houses.map { |h| features(h) }
+    all_target = houses.map { |h| target(h) }
+    @model = Eps::Model.new(all_features, all_target)
+    # save
+    File.write(model_file, @model.to_pmml)
+  end
+  def predict(house)
+    model.predict(features(house))
+  end
+  private
+  def preprocess(train_set)
+    train_set.reject { |h| h.bedrooms.nil? || h.price < 10000 }
+  end
+  def features(house)
+    {
+      bedrooms: house.bedrooms,
+      city_id: house.city_id.to_s,
+      month: house.sold_at.strftime("%b")
+    }
+  end
+  def target(house)
+    house.price
+  end
+  def model
+    @model ||= Eps::Model.load_pmml(File.read(model_file))
+  end
+  def model_file
+    File.join(__dir__, "price_model.pmml")
+  end
+end
+```
+Build the model with:
 ```ruby
-data = File.read("model.json")
-model = Eps::Regressor.load_json(data)
+PriceModel.build
 ```
-[PMML](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) - Predictive Model Markup Language
+This saves the model to `price_model.pmml`. Be sure to check this into source control.
+Predict with:
 ```ruby
-data = File.read("model.pmml")
-model = Eps::Regressor.load_pmml(data)
+PriceModel.predict(house)
 ```
-> Loading PMML requires Nokogiri to be installed
+## Monitoring
-[PFA](http://dmg.org/pfa/) - Portable Format for Analytics
+We recommend monitoring how well your models perform over time. To do this, save your predictions to the database. Then, compare them with:
 ```ruby
-data = File.read("model.pfa")
-model = Eps::Regressor.load_pfa(data)
+actual = houses.map(&:price)
+estimated = houses.map(&:estimated_price)
+Eps.metrics(actual, estimated)
 ```
-Here are examples for how to dump models in each:
+This returns the same evaluation metrics as model building. For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0. For accuracy, alert if it drops below a certain threshold.
+## Other Languages
+Eps makes it easy to serve models from other languages. You can build models in R, Python, and others and serve them in Ruby without having to worry about how to deploy or run another language.
-- [R JSON](guides/Modeling.md#r-json)
-- [R PMML](guides/Modeling.md#r-pmml)
-- [R PFA](guides/Modeling.md#r-pfa)
-- [Python JSON](guides/Modeling.md#python-json)
-- [Python PMML](guides/Modeling.md#python-pmml)
-- [Python PFA](guides/Modeling.md#python-pfa)
+Eps can serve linear regression and Naive bayes models. Check out [Scoruby](https://github.com/asafschers/scoruby) to serve other models.
+### R
+To create a model in R, install the [pmml](https://cran.r-project.org/package=pmml) package
+```r
+install.packages("pmml")
+```
+For regression, run:
+```r
+library(pmml)
+model <- lm(dist ~ speed,  cars)
+# save model
+data <- toString(pmml(model))
+write(data, file="model.pmml")
+```
+For classification, run:
+```r
+library(pmml)
+library(e1071)
+model <- naiveBayes(Species ~ .,  iris)
+# save model
+data <- toString(pmml(model, predictedField="Species"))
+write(data, file="model.pmml")
+```
+### Python
+To create a model in Python, install the [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) package
+```sh
+pip install sklearn2pmml
+```
+For regression, run:
+```python
+from sklearn2pmml import sklearn2pmml, make_pmml_pipeline
+from sklearn.linear_model import LinearRegression
+x = [1, 2, 3, 5, 6]
+y = [5 * xi + 3 for xi in x]
+model = LinearRegression()
+model.fit([[xi] for xi in x], y)
+# save model
+sklearn2pmml(make_pmml_pipeline(model), "model.pmml")
+```
+For classification, run:
+```python
+from sklearn2pmml import sklearn2pmml, make_pmml_pipeline
+from sklearn.naive_bayes import GaussianNB
+x = [1, 2, 3, 5, 6]
+y = ["ham", "ham", "ham", "spam", "spam"]
+model = GaussianNB()
+model.fit([[xi] for xi in x], y)
+sklearn2pmml(make_pmml_pipeline(model), "model.pmml")
+```
 ### Verifying
@@ -244,26 +394,25 @@ house_id | prediction
 Once the model is implemented in Ruby, confirm the predictions match.
 ```ruby
-model = Eps::Regressor.load_json("model.json")
+model = Eps::Model.load_pmml("model.pmml")
 # preload houses to prevent n+1
 houses = House.all.index_by(&:id)
-CSV.foreach("predictions.csv", headers: true) do |row|
-  house = houses[row["house_id"].to_i]
-  expected = row["prediction"].to_f
+CSV.foreach("predictions.csv", headers: true, converters: :numeric) do |row|
+  house = houses[row["house_id"]]
+  expected = row["prediction"]
   actual = model.predict(bedrooms: house.bedrooms, bathrooms: house.bathrooms)
-  unless (actual - expected).abs < 0.001
-    raise "Bad prediction for house #{house.id} (exp: #{expected}, act: #{actual})"
-  end
+  success = actual.is_a?(String) ? actual == expected : (actual - expected).abs < 0.001
+  raise "Bad prediction for house #{house.id} (exp: #{expected}, act: #{actual})" unless success
   putc "✓"
 end
 ```
-### Database Storage
+## Database Storage
 The database is another place you can store models. It’s good if you retrain models automatically.
@@ -278,29 +427,17 @@ rails g model Model key:string:uniq data:text
 Store the model with:
 ```ruby
-store = Model.where(key: "housing_price").first_or_initialize
-store.update(data: model.to_json)
+store = Model.where(key: "price").first_or_initialize
+store.update(data: model.to_pmml)
 ```
 Load the model with:
 ```ruby
-data = Model.find_by!(key: "housing_price").data
-model = Eps::Regressor.load_json(data)
-```
-## Monitoring
-We recommend monitoring how well your models perform over time. To do this, save your predictions to the database. Then, compare them with:
-```ruby
-actual = houses.map(&:price)
-estimated = houses.map(&:estimated_price)
-Eps.metrics(actual, estimated)
+data = Model.find_by!(key: "price").data
+model = Eps::Model.load_pmml(data)
 ```
-This returns the same evaluation metrics as model building. For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0.
 ## Training Performance
 Speed up training on large datasets with GSL.
@@ -319,6 +456,8 @@ gem 'gsl', group: :development
 It only needs to be available in environments used to build the model.
+> This only speeds up regression, not classification
 ## Data
 A number of data formats are supported. You can pass the target variable separately.
@@ -326,7 +465,7 @@ A number of data formats are supported. You can pass the target variable separat
 ```ruby
 x = [{x: 1}, {x: 2}, {x: 3}]
 y = [1, 2, 3]
-Eps::Regressor.new(x, y)
+Eps::Model.new(x, y)
 ```
 Or pass arrays of arrays
@@ -334,7 +473,7 @@ Or pass arrays of arrays
 ```ruby
 x = [[1, 2], [2, 0], [3, 1]]
 y = [1, 2, 3]
-Eps::Regressor.new(x, y)
+Eps::Model.new(x, y)
 ```
 ## Daru
@@ -343,7 +482,7 @@ Eps works well with Daru data frames.
 ```ruby
 df = Daru::DataFrame.from_csv("houses.csv")
-Eps::Regressor.new(df, target: "price")
+Eps::Model.new(df, target: "price")
 ```
 To split into training and test sets, use:
@@ -365,22 +504,34 @@ CSV.table("data.csv").map { |row| row.to_h }
 ## Jupyter & IRuby
-You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://github.com/ankane/shorts/blob/master/Jupyter-Rails.md).
+You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://ankane.org/jupyter-rails).
 ## Reference
-Get coefficients
-```ruby
-model.coefficients
-```
 Get an extended summary with standard error, t-values, and r-squared
 ```ruby
 model.summary(extended: true)
 ```
+## Upgrading
+## 0.2.0
+Eps 0.2.0 brings a number of improvements, including support for classification.
+We recommend:
+1. Changing `Eps::Regressor` to `Eps::Model`
+2. Converting models from JSON to PMML
+  ```ruby
+  model = Eps::Model.load_json("model.json")
+  File.write("model.pmml", model.to_pmml)
+  ```
+3. Renaming `app/stats_models` to `app/ml_models`
 ## History
 View the [changelog](https://github.com/ankane/eps/blob/master/CHANGELOG.md)