RubyGems - eps - Versions diffs - 0.2.1 → 0.3.0 - Mend

eps 0.2.1 → 0.3.0

Files changed (23) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +14 -0
data/LICENSE.txt +1 -1
data/README.md +183 -243
data/lib/eps.rb +27 -3
data/lib/eps/base_estimator.rb +316 -47
data/lib/eps/data_frame.rb +141 -0
data/lib/eps/evaluators/lightgbm.rb +116 -0
data/lib/eps/evaluators/linear_regression.rb +54 -0
data/lib/eps/evaluators/naive_bayes.rb +95 -0
data/lib/eps/evaluators/node.rb +26 -0
data/lib/eps/label_encoder.rb +41 -0
data/lib/eps/lightgbm.rb +237 -0
data/lib/eps/linear_regression.rb +132 -386
data/lib/eps/metrics.rb +46 -0
data/lib/eps/model.rb +16 -58
data/lib/eps/naive_bayes.rb +175 -164
data/lib/eps/pmml_generators/lightgbm.rb +187 -0
data/lib/eps/statistics.rb +79 -0
data/lib/eps/text_encoder.rb +81 -0
data/lib/eps/utils.rb +22 -0
data/lib/eps/version.rb +1 -1
metadata +33 -7

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ae5bc00818b79dc5e07f4dcda7ca56aa825f1014d1f70203564a87cb49b375d4
-  data.tar.gz: 8ba22dddc8635da418c429c12066c63bc1aea15238c32c4a7c4185f66281b6a5
+  metadata.gz: 3ca27ba2379d1cbfb6f3407ace5ad9dd5fcb71b08e48b8805ddda6483c026194
+  data.tar.gz: 91bb0beb50664dda5c2a42684414b1972e2bff91c3a993926639939c91272ccd
 SHA512:
-  metadata.gz: e8a0f8cc325d26618691613a6213f6471b45c94a22bb2c9eb6ea729543dce4deabd9875d8e7055649fde90066d30d09c0b1b61949598c6859557e8270ff8e776
-  data.tar.gz: 0b8d0918e9571ce1587d4497b8de84b64222f19ba7c466d24bd490605dcfbed10ebfa10ad10ed6f03a13ab45a7bc1b77d5928f45d0a9f8ba7757e569591a36fe
+  metadata.gz: 648d8098928d0ed952ad4cf2195b3e2562db5a38249357b76eb39c0aa17d8f8f974936c4773b2395ae1b1197aedb6e47c8fd018675496f3f966ee2feebb1ed2d
+  data.tar.gz: aa48887027114d9b654f3564715586a1740b742fe7778602d8db770b4921cff8acfbf90baea3ae6092d7c3962f37763c630857d71fbcd573402dfb016159f0c2

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,17 @@
+## 0.3.0
+- Added support for LightGBM
+- Added text features
+- Fixed naive Bayes PMML
+- Fixed error with classification and Daru
+Breaking
+- LightGBM is now the default for new models
+- Cross-validation happens automatically by default
+- Removed support for JSON and PFA formats
+- Added smoothing to naive Bayes
 ## 0.2.1
 - Fixed error with `summary`

data/LICENSE.txt CHANGED Viewed

@@ -1,6 +1,6 @@
 The MIT License (MIT)
-Copyright (c) 2018 Andrew Kane
+Copyright (c) 2018-2019 Andrew Kane
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

data/README.md CHANGED Viewed

@@ -4,9 +4,7 @@ Machine learning for Ruby
 - Build predictive models quickly and easily
 - Serve models built in Ruby, Python, R, and more
-- Supports regression (linear regression) and classification (naive Bayes)
-- Automatically handles categorical features
-- Works great with the SciRuby ecosystem (Daru & IRuby)
+- No prior knowledge of machine learning required :tada:
 Check out [this post](https://ankane.org/rails-meet-data-science) for more info on machine learning with Rails
@@ -20,7 +18,11 @@ Add this line to your application’s Gemfile:
 gem 'eps'
 ```
-To speed up training on large datasets, you can also [add GSL](#training-performance).
+On Mac, also install OpenMP:
+```sh
+brew install libomp
+```
 ## Getting Started
@@ -43,160 +45,119 @@ Make a prediction
 model.predict(bedrooms: 2, bathrooms: 1)
 ```
-> Pass an array of hashes make multiple predictions at once
-The target can be numeric (regression) or categorical (classification).
-## Building Models
-### Training and Test Sets
-When building models, it’s a good idea to hold out some data so you can see how well the model will perform on unseen data. To do this, we split our data into two sets: training and test. We build the model with the training set and later evaluate it on the test set.
+Store the model
 ```ruby
-split_date = Date.parse("2018-06-01")
-train_set, test_set = houses.partition { |h| h.sold_at < split_date }
+File.write("model.pmml", model.to_pmml)
 ```
-If your data doesn’t have a time associated with it, you can split it randomly.
+Load the model
 ```ruby
-rng = Random.new(1) # seed random number generator
-train_set, test_set = houses.partition { rng.rand < 0.7 }
+pmml = File.read("model.pmml")
+model = Eps::Model.load_pmml(pmml)
 ```
-### Outliers and Missing Data
+A few notes:
-Next, decide what to do with outliers and missing data. There are a number of methods for handling them, but the easiest is to remove them.
+- The target can be numeric (regression) or categorical (classification)
+- Pass an array of hashes to `predict` to make multiple predictions at once
+- Models are stored in [PMML](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language), a standard for model storage
-```ruby
-train_set.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
-```
+## Building Models
-### Feature Engineering
+### Goal
-Selecting features for a model is extremely important for performance. Features can be numeric or categorical. For categorical features, there’s no need to create dummy variables - just pass the data as strings.
+Often, the goal of building a model is to make good predictions on future data. To help achieve this, Eps splits the data into training and validation sets if you have 30+ data points. It uses the training set to build the model and the validation set to evaluate the performance.
+If your data has a time associated with it, it’s highly recommended to use that field for the split.
 ```ruby
-{state: "CA"}
+Eps::Model.new(data, target: :price, split: :listed_at)
 ```
-> Categorical features generate coefficients for each distinct value except for one
+Otherwise, the split is random. There are a number of [other options](#validation-options) as well.
-Convert any ids to strings so they’re treated as categorical features.
+Performance is reported in the summary.
-```ruby
-{city_id: city_id.to_s}
-```
-For times, create features like day of week and hour of day with:
+- For regression, it reports validation RMSE (root mean squared error) - lower is better
+- For classification, it reports validation accuracy - higher is better
-```ruby
-{weekday: time.wday.to_s, hour: time.hour.to_s}
-```
+Typically, the best way to improve performance is feature engineering.
-In practice, your code may look like:
+### Feature Engineering
-```ruby
-def features(house)
-  {
-    bedrooms: house.bedrooms,
-    city_id: house.city_id.to_s,
-    month: house.sold_at.strftime("%b")
-  }
-end
+Features are extremely important for model performance. Features can be:
-train_features = train_set.map { |h| features(h) }
-```
+1. numeric
+2. categorical
+3. text
-> We use a method for features so it can be used across training, evaluation, and prediction
+#### Numeric
-We also need to prepare the target variable.
+For numeric features, use any numeric type.
 ```ruby
-def target(house)
-  house.price
-end
-train_target = train_set.map { |h| target(h) }
+{bedrooms: 4, bathrooms: 2.5}
 ```
-### Training
+#### Categorical
-Now, let’s train the model.
+For categorical features, use strings or booleans.
 ```ruby
-model = Eps::Model.new(train_features, train_target)
-puts model.summary
+{state: "CA", basement: true}
 ```
-For regression, the summary includes the coefficients and their significance. The lower the p-value, the more significant the feature is. p-values below 0.05 are typically considered significant. It also shows the adjusted r-squared, which is a measure of how well the model fits the data. The higher the number, the better the fit. Here’s a good explanation of why it’s [better than r-squared](https://www.quora.com/What-is-the-difference-between-R-squared-and-Adjusted-R-squared).
-### Evaluation
-When you’re happy with the model, see how well it performs on the test set. This gives us an idea of how well it’ll perform on unseen data.
+Convert any ids to strings so they’re treated as categorical features.
 ```ruby
-test_features = test_set.map { |h| features(h) }
-test_target = test_set.map { |h| target(h) }
-model.evaluate(test_features, test_target)
+{city_id: city_id.to_s}
 ```
-For regression, this returns:
-- RMSE - Root mean square error
-- MAE - Mean absolute error
-- ME - Mean error
-We want to minimize the RMSE and MAE and keep the ME around 0.
-For classification, this returns:
-- Accuracy
-We want to maximize the accuracy.
+For dates, create features like day of week and month.
-### Finalize
+```ruby
+{weekday: sold_on.strftime("%a"), month: sold_on.strftime("%b")}
+```
-Now that we have an idea of how the model will perform, we want to retrain the model with all of our data. Treat outliers and missing data the same as you did with the training set.
+For times, create features like day of week and hour of day.
 ```ruby
-# outliers and missing data
-houses.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
-# training
-all_features = houses.map { |h| features(h) }
-all_target = houses.map { |h| target(h) }
-model = Eps::Model.new(all_features, all_target)
+{weekday: listed_at.strftime("%a"), hour: listed_at.hour.to_s}
 ```
-We now have a model that’s ready to serve.
+#### Text
-## Serving Models
-Once the model is trained, we need to store it. Eps uses PMML - [Predictive Model Markup Language](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) - a standard for storing models. A great option is to write the model to a file with:
+For text features, use strings with multiple words.
 ```ruby
-File.write("model.pmml", model.to_pmml)
+{description: "a beautiful house on top of a hill"}
 ```
-> You may need to add `nokogiri` to your Gemfile
+This creates features based on word count (term frequency).
-To load a model, use:
+You can specify text features explicitly with:
 ```ruby
-pmml = File.read("model.pmml")
-model = Eps::Model.load_pmml(pmml)
+Eps::Model.new(data, target: :price, text_features: [:description])
 ```
-Now we can use it to make predictions.
+You can set advanced options with:
 ```ruby
-model.predict(bedrooms: 2, bathrooms: 1)
+text_features: {
+  description: {
+    min_occurences: 5,
+    max_features: 1000,
+    min_length: 1,
+    case_sensitive: true,
+    tokenizer: /\s+/,
+    stop_words: ["and", "the"]
+  }
+}
 ```
-To continuously train models, we recommend [storing them in your database](#database-storage).
 ## Full Example
 We recommend putting all the model code in a single file. This makes it easy to rebuild the model as needed.
@@ -212,38 +173,18 @@ Here’s what a complete model in `app/ml_models/price_model.rb` may look like:
 ```ruby
 class PriceModel < Eps::Base
   def build
-    houses = House.all.to_a
-    # divide into training and test set
-    split_date = Date.parse("2018-06-01")
-    train_set, test_set = houses.partition { |h| h.sold_at < split_date }
-    # handle outliers and missing values
-    train_set = preprocess(train_set)
+    houses = House.all
     # train
-    train_features = train_set.map { |v| features(v) }
-    train_target = train_set.map { |v| target(v) }
-    model = Eps::Model.new(train_features, train_target)
+    data = houses.map { |v| features(v) }
+    model = Eps::Model.new(data, target: :price, split: :listed_at)
     puts model.summary
-    # evaluate
-    test_features = test_set.map { |v| features(v) }
-    test_target = test_set.map { |v| target(v) }
-    metrics = model.evaluate(test_features, test_target)
-    puts "Test RMSE: #{metrics[:rmse]}"
-    # for classification, use:
-    # puts "Test accuracy: #{(100 * metrics[:accuracy]).round}%"
-    # finalize
-    houses = preprocess(houses)
-    all_features = houses.map { |h| features(h) }
-    all_target = houses.map { |h| target(h) }
-    model = Eps::Model.new(all_features, all_target)
-    # save
+    # save to file
     File.write(model_file, model.to_pmml)
-    @model = nil # reset for future predictions
+    # ensure reloads from file
+    @model = nil
   end
   def predict(house)
@@ -252,22 +193,16 @@ class PriceModel < Eps::Base
   private
-  def preprocess(train_set)
-    train_set.reject { |h| h.bedrooms.nil? || h.price < 10000 }
-  end
   def features(house)
     {
       bedrooms: house.bedrooms,
       city_id: house.city_id.to_s,
-      month: house.sold_at.strftime("%b")
+      month: house.listed_at.strftime("%b"),
+      listed_at: house.listed_at,
+      price: house.price
     }
   end
-  def target(house)
-    house.price
-  end
   def model
     @model ||= Eps::Model.load_pmml(File.read(model_file))
   end
@@ -298,50 +233,17 @@ We recommend monitoring how well your models perform over time. To do this, save
 ```ruby
 actual = houses.map(&:price)
-estimated = houses.map(&:estimated_price)
-Eps.metrics(actual, estimated)
+predicted = houses.map(&:predicted_price)
+Eps.metrics(actual, predicted)
 ```
-This returns the same evaluation metrics as model building. For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0. For accuracy, alert if it drops below a certain threshold.
+For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0. For accuracy, alert if it drops below a certain threshold.
 ## Other Languages
-Eps makes it easy to serve models from other languages. You can build models in R, Python, and others and serve them in Ruby without having to worry about how to deploy or run another language.
-Eps can serve linear regression and Naive bayes models. Check out [Scoruby](https://github.com/asafschers/scoruby) to serve other models.
-### R
-To create a model in R, install the [pmml](https://cran.r-project.org/package=pmml) package
-```r
-install.packages("pmml")
-```
-For regression, run:
+Eps makes it easy to serve models from other languages. You can build models in Python, R, and others and serve them in Ruby without having to worry about how to deploy or run another language.
-```r
-library(pmml)
-model <- lm(dist ~ speed,  cars)
-# save model
-data <- toString(pmml(model))
-write(data, file="model.pmml")
-```
-For classification, run:
-```r
-library(pmml)
-library(e1071)
-model <- naiveBayes(Species ~ .,  iris)
-# save model
-data <- toString(pmml(model, predictedField="Species"))
-write(data, file="model.pmml")
-```
+Eps can serve LightGBM, linear regression, and naive Bayes models. Check out [ONNX Runtime](https://github.com/ankane/onnxruntime) and [Scoruby](https://github.com/asafschers/scoruby) to serve other models.
 ### Python
@@ -351,36 +253,25 @@ To create a model in Python, install the [sklearn2pmml](https://github.com/jpmml
 pip install sklearn2pmml
 ```
-For regression, run:
+And check out the examples:
-```python
-from sklearn2pmml import sklearn2pmml, make_pmml_pipeline
-from sklearn.linear_model import LinearRegression
+- [LightGBM Regression](test/support/python/lightgbm_regression.py)
+- [LightGBM Classification](test/support/python/lightgbm_classification.py)
+- [Linear Regression](test/support/python/linear_regression.py)
+- [Naive Bayes](test/support/python/naive_bayes.py)
-x = [1, 2, 3, 5, 6]
-y = [5 * xi + 3 for xi in x]
+### R
-model = LinearRegression()
-model.fit([[xi] for xi in x], y)
+To create a model in R, install the [pmml](https://cran.r-project.org/package=pmml) package
-# save model
-sklearn2pmml(make_pmml_pipeline(model), "model.pmml")
+```r
+install.packages("pmml")
 ```
-For classification, run:
+And check out the examples:
-```python
-from sklearn2pmml import sklearn2pmml, make_pmml_pipeline
-from sklearn.naive_bayes import GaussianNB
-x = [1, 2, 3, 5, 6]
-y = ["ham", "ham", "ham", "spam", "spam"]
-model = GaussianNB()
-model.fit([[xi] for xi in x], y)
-sklearn2pmml(make_pmml_pipeline(model), "model.pmml")
-```
+- [Linear Regression](test/support/r/linear_regression.R)
+- [Naive Bayes](test/support/r/naive_bayes.R)
 ### Verifying
@@ -413,37 +304,58 @@ CSV.foreach("predictions.csv", headers: true, converters: :numeric) do |row|
 end
 ```
-## Database Storage
+## Data
-The database is another place you can store models. It’s good if you retrain models automatically.
+A number of data formats are supported. You can pass the target variable separately.
-> We recommend adding monitoring and guardrails as well if you retrain automatically
+```ruby
+x = [{x: 1}, {x: 2}, {x: 3}]
+y = [1, 2, 3]
+Eps::Model.new(x, y)
+```
-Create an ActiveRecord model to store the predictive model.
+Or pass arrays of arrays
-```sh
-rails g model Model key:string:uniq data:text
+```ruby
+x = [[1, 2], [2, 0], [3, 1]]
+y = [1, 2, 3]
+Eps::Model.new(x, y)
 ```
-Store the model with:
+### Daru
+Eps works well with Daru data frames.
 ```ruby
-store = Model.where(key: "price").first_or_initialize
-store.update(data: model.to_pmml)
+df = Daru::DataFrame.from_csv("houses.csv")
+Eps::Model.new(df, target: "price")
 ```
-Load the model with:
+### CSVs
+When importing data from CSV files, be sure to convert numeric fields. The `table` method does this automatically.
 ```ruby
-data = Model.find_by!(key: "price").data
-model = Eps::Model.load_pmml(data)
+CSV.table("data.csv").map { |row| row.to_h }
 ```
-## Training Performance
+## Algorithms
-Speed up training on large datasets with GSL.
+Pass an algorithm with:
-First, [install GSL](https://www.gnu.org/software/gsl/). With Homebrew, you can use:
+```ruby
+Eps::Model.new(data, algorithm: :linear_regression)
+```
+Eps supports:
+- LightGBM (default)
+- Linear Regression
+- Naive Bayes
+### Linear Regression
+To speed up training on large datasets with linear regression, [install GSL](https://www.gnu.org/software/gsl/). With Homebrew, you can use:
 ```sh
 brew install gsl
@@ -457,65 +369,93 @@ gem 'gsl', group: :development
 It only needs to be available in environments used to build the model.
-> This only speeds up regression, not classification
+## Validation Options
-## Data
+Pass your own validation set with:
-A number of data formats are supported. You can pass the target variable separately.
+```ruby
+Eps::Model.new(data, validation_set: validation_set)
+```
+Split on a specific value
 ```ruby
-x = [{x: 1}, {x: 2}, {x: 3}]
-y = [1, 2, 3]
-Eps::Model.new(x, y)
+Eps::Model.new(data, split: {column: :listed_at, value: Date.parse("2019-01-01")})
 ```
-Or pass arrays of arrays
+Specify the validation set size (the default is `0.25`, which is 25%)
 ```ruby
-x = [[1, 2], [2, 0], [3, 1]]
-y = [1, 2, 3]
-Eps::Model.new(x, y)
+Eps::Model.new(data, split: {validation_size: 0.2})
 ```
-## Daru
+## Database Storage
-Eps works well with Daru data frames.
+The database is another place you can store models. It’s good if you retrain models automatically.
-```ruby
-df = Daru::DataFrame.from_csv("houses.csv")
-Eps::Model.new(df, target: "price")
+> We recommend adding monitoring and guardrails as well if you retrain automatically
+Create an ActiveRecord model to store the predictive model.
+```sh
+rails g model Model key:string:uniq data:text
 ```
-To split into training and test sets, use:
+Store the model with:
 ```ruby
-rng = Random.new(1) # seed random number generator
-train_index = houses.map { rng.rand < 0.7 }
-train_set = houses.where(train_index)
-test_set = houses.where(train_index.map { |v| !v })
+store = Model.where(key: "price").first_or_initialize
+store.update(data: model.to_pmml)
 ```
-## CSVs
-When importing data from CSV files, be sure to convert numeric fields. The `table` method does this automatically.
+Load the model with:
 ```ruby
-CSV.table("data.csv").map { |row| row.to_h }
+data = Model.find_by!(key: "price").data
+model = Eps::Model.load_pmml(data)
 ```
 ## Jupyter & IRuby
 You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://ankane.org/jupyter-rails).
-## Reference
+## Upgrading
+## 0.3.0
-Get an extended summary with standard error, t-values, and r-squared
+Eps 0.3.0 brings a number of improvements, including support for LightGBM and cross-validation. There are a number of breaking changes to be aware of:
-```ruby
-model.summary(extended: true)
-```
+- LightGBM is now the default for new models. On Mac, run:
-## Upgrading
+  ```sh
+  brew install libomp
+  ```
+  Pass the `algorithm` option to use linear regression or naive Bayes.
+  ```ruby
+  Eps::Model.new(data, algorithm: :linear_regression) # or :naive_bayes
+  ```
+- Cross-validation happens automatically by default. You no longer need to create training and test sets manually. If you were splitting on a time, use:
+  ```ruby
+  Eps::Model.new(data, split: {column: :listed_at, value: Date.parse("2019-01-01")})
+  ```
+  Or randomly, use:
+  ```ruby
+  Eps::Model.new(data, split: {validation_size: 0.3})
+  ```
+  To continue splitting manually, use:
+  ```ruby
+  Eps::Model.new(data, validation_set: test_set)
+  ```
+- It’s no longer possible to load models in JSON or PFA formats. Retrain models and save them as PMML.
 ## 0.2.0