eps 0.1.1 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: af79060ff9164fdaa316b8d9543ef8093166be20009dc99509fdfd3c0e4e242d
4
- data.tar.gz: 4e17cda8c33ba437652a00e85cd3630c9ce093a27d446f77b3b0711bf648c754
3
+ metadata.gz: 5ce1ae30056e3aaa08465d63571685ed2d43a17b40dc95bb2020d8d1f6752d6d
4
+ data.tar.gz: f6f9652d164991b82fed1375b9039baa6cbe3dd3b5413cdee244ba5aac923012
5
5
  SHA512:
6
- metadata.gz: e762ad255036a669003845322ea399300c2d0d2d9987dede0545497ff83e9d7ede89c120eaebd7214f6df26eb8194019b85c889ac50dbe5607688ff4c07ddd04
7
- data.tar.gz: a5a16216316c718f13366b2618b8d1a0b8d6ba47896877573d805817b35ce789b199342344a32dbdb4ac6706122344b265741d3e319ab76a55443ff69662f699
6
+ metadata.gz: bd9ddb1589b2866c42fda230389a27124285bd162ca9863528b41531d1c725fca3ac7327786b373e9cecda014d9cc2f34f791f203da77215f738cf92c261db61
7
+ data.tar.gz: 7d209acc346b223d00827c1df817a2017882b9446ff523f985e4c89b44f10f5a702d6c6dd957c8fb9ceccd83b98f7492aa15e029c1d0655787c1823348beacd0
data/CHANGELOG.md CHANGED
@@ -1,3 +1,9 @@
1
+ ## 0.2.0
2
+
3
+ - Added support for classification
4
+ - Added `to_pmml` method
5
+ - Added `Eps::Base`
6
+
1
7
  ## 0.1.1
2
8
 
3
9
  - Huge performance boost
data/README.md CHANGED
@@ -1,13 +1,15 @@
1
1
  # Eps
2
2
 
3
- Linear regression for Ruby
3
+ Machine learning for Ruby
4
4
 
5
- - Build models quickly and easily
5
+ - Build predictive models quickly and easily
6
6
  - Serve models built in Ruby, Python, R, and more
7
- - Automatically handles categorical variables
8
- - No external dependencies
7
+ - Supports regression (linear regression) and classification (naive Bayes)
8
+ - Automatically handles categorical features
9
9
  - Works great with the SciRuby ecosystem (Daru & IRuby)
10
10
 
11
+ Check out [this post](https://ankane.org/rails-meet-data-science) for more info on machine learning with Rails
12
+
11
13
  [![Build Status](https://travis-ci.org/ankane/eps.svg?branch=master)](https://travis-ci.org/ankane/eps)
12
14
 
13
15
  ## Installation
@@ -31,7 +33,7 @@ data = [
31
33
  {bedrooms: 2, bathrooms: 2, price: 135000},
32
34
  {bedrooms: 3, bathrooms: 2, price: 162000}
33
35
  ]
34
- model = Eps::Regressor.new(data, target: :price)
36
+ model = Eps::Model.new(data, target: :price)
35
37
  puts model.summary
36
38
  ```
37
39
 
@@ -43,6 +45,8 @@ model.predict(bedrooms: 2, bathrooms: 1)
43
45
 
44
46
  > Pass an array of hashes make multiple predictions at once
45
47
 
48
+ The target can be numeric (regression) or categorical (classification).
49
+
46
50
  ## Building Models
47
51
 
48
52
  ### Training and Test Sets
@@ -50,15 +54,15 @@ model.predict(bedrooms: 2, bathrooms: 1)
50
54
  When building models, it’s a good idea to hold out some data so you can see how well the model will perform on unseen data. To do this, we split our data into two sets: training and test. We build the model with the training set and later evaluate it on the test set.
51
55
 
52
56
  ```ruby
53
- rng = Random.new(1) # seed random number generator
54
- train_set, test_set = houses.partition { rng.rand < 0.7 }
57
+ split_date = Date.parse("2018-06-01")
58
+ train_set, test_set = houses.partition { |h| h.sold_at < split_date }
55
59
  ```
56
60
 
57
- If your data has a time associated with it, we recommend splitting on this.
61
+ If your data doesn’t have a time associated with it, you can split it randomly.
58
62
 
59
63
  ```ruby
60
- split_date = Date.parse("2018-06-01")
61
- train_set, test_set = houses.partition { |h| h.sold_at < split_date }
64
+ rng = Random.new(1) # seed random number generator
65
+ train_set, test_set = houses.partition { rng.rand < 0.7 }
62
66
  ```
63
67
 
64
68
  ### Outliers and Missing Data
@@ -79,10 +83,10 @@ Selecting features for a model is extremely important for performance. Features
79
83
 
80
84
  > Categorical features generate coefficients for each distinct value except for one
81
85
 
82
- You should do this for any ids in your data.
86
+ Convert any ids to strings so they’re treated as categorical features.
83
87
 
84
88
  ```ruby
85
- {city_id: "123"}
89
+ {city_id: city_id.to_s}
86
90
  ```
87
91
 
88
92
  For times, create features like day of week and hour of day with:
@@ -122,11 +126,11 @@ train_target = train_set.map { |h| target(h) }
122
126
  Now, let’s train the model.
123
127
 
124
128
  ```ruby
125
- model = Eps::Regressor.new(train_features, train_target)
129
+ model = Eps::Model.new(train_features, train_target)
126
130
  puts model.summary
127
131
  ```
128
132
 
129
- The summary includes the coefficients and their significance. The lower the p-value, the more significant the feature is. p-values below 0.05 are typically considered significant. It also shows the adjusted r-squared, which is a measure of how well the model fits the data. The higher the number, the better the fit. Here’s a good explanation of why it’s [better than r-squared](https://www.quora.com/What-is-the-difference-between-R-squared-and-Adjusted-R-squared).
133
+ For regression, the summary includes the coefficients and their significance. The lower the p-value, the more significant the feature is. p-values below 0.05 are typically considered significant. It also shows the adjusted r-squared, which is a measure of how well the model fits the data. The higher the number, the better the fit. Here’s a good explanation of why it’s [better than r-squared](https://www.quora.com/What-is-the-difference-between-R-squared-and-Adjusted-R-squared).
130
134
 
131
135
  ### Evaluation
132
136
 
@@ -138,7 +142,7 @@ test_target = test_set.map { |h| target(h) }
138
142
  model.evaluate(test_features, test_target)
139
143
  ```
140
144
 
141
- This returns:
145
+ For regression, this returns:
142
146
 
143
147
  - RMSE - Root mean square error
144
148
  - MAE - Mean absolute error
@@ -146,31 +150,43 @@ This returns:
146
150
 
147
151
  We want to minimize the RMSE and MAE and keep the ME around 0.
148
152
 
153
+ For classification, this returns:
154
+
155
+ - Accuracy
156
+
157
+ We want to maximize the accuracy.
158
+
149
159
  ### Finalize
150
160
 
151
- Now that we have an idea of how the model will perform, we want to retrain the model with all of our data.
161
+ Now that we have an idea of how the model will perform, we want to retrain the model with all of our data. Treat outliers and missing data the same as you did with the training set.
152
162
 
153
163
  ```ruby
164
+ # outliers and missing data
165
+ houses.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
166
+
167
+ # training
154
168
  all_features = houses.map { |h| features(h) }
155
169
  all_target = houses.map { |h| target(h) }
156
- model = Eps::Regressor.new(all_features, all_target)
170
+ model = Eps::Model.new(all_features, all_target)
157
171
  ```
158
172
 
159
173
  We now have a model that’s ready to serve.
160
174
 
161
175
  ## Serving Models
162
176
 
163
- Once the model is trained, all we need are the coefficients to make predictions. You can dump them as a Ruby object or JSON. For Ruby, use:
177
+ Once the model is trained, we need to store it. Eps uses PMML - [Predictive Model Markup Language](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) - a standard for storing models. A great option is to write the model to a file with:
164
178
 
165
179
  ```ruby
166
- model.dump
180
+ File.write("model.pmml", model.to_pmml)
167
181
  ```
168
182
 
169
- Then hardcode the result into your app.
183
+ > You may need to add `nokogiri` to your Gemfile
184
+
185
+ To load a model, use:
170
186
 
171
187
  ```ruby
172
- data = {:coefficients=>{:_intercept=>63500.0, :bedrooms=>26000.0, :bathrooms=>10000.0}}
173
- model = Eps::Regressor.load(data)
188
+ pmml = File.read("model.pmml")
189
+ model = Eps::Model.load_pmml(pmml)
174
190
  ```
175
191
 
176
192
  Now we can use it to make predictions.
@@ -179,57 +195,191 @@ Now we can use it to make predictions.
179
195
  model.predict(bedrooms: 2, bathrooms: 1)
180
196
  ```
181
197
 
182
- Another option that works well is writing the model to file in your app.
198
+ To continuously train models, we recommend [storing them in your database](#database-storage).
199
+
200
+ ## Full Example
183
201
 
184
- ```ruby
185
- json = model.to_json
186
- File.open("lib/models/housing_price.json", "w") { |f| f.write(json) }
202
+ We recommend putting all the model code in a single file. This makes it easy to rebuild the model as needed.
203
+
204
+ In Rails, we recommend creating a `app/ml_models` directory. Be sure to restart Spring after creating the directory so files are autoloaded.
205
+
206
+ ```sh
207
+ bin/spring stop
187
208
  ```
188
209
 
189
- To load it, use:
210
+ Here’s what a complete model in `app/ml_models/price_model.rb` may look like:
190
211
 
191
212
  ```ruby
192
- json = File.read("lib/models/housing_price.json")
193
- model = Eps::Regressor.load_json(json)
194
- ```
213
+ class PriceModel < Eps::Base
214
+ def build
215
+ houses = House.all.to_a
195
216
 
196
- To continuously train models, we recommend [storing them in your database](#database-storage).
217
+ # divide into training and test set
218
+ split_date = Date.parse("2018-06-01")
219
+ train_set, test_set = houses.partition { |h| h.sold_at < split_date }
220
+
221
+ # handle outliers and missing values
222
+ train_set = preprocess(train_set)
197
223
 
198
- ### Beyond Ruby
224
+ # train
225
+ train_features = train_set.map { |v| features(v) }
226
+ train_target = train_set.map { |v| target(v) }
227
+ model = Eps::Model.new(train_features, train_target)
228
+ puts model.summary
199
229
 
200
- Eps makes it easy to serve models from other languages. You can build models in R, Python, and others and serve them in Ruby without having to worry about how to deploy or run another language. Eps can load models in:
230
+ # evaluate
231
+ test_features = test_set.map { |v| features(v) }
232
+ test_target = test_set.map { |v| target(v) }
233
+ metrics = model.evaluate(test_features, test_target)
234
+ puts "Test RMSE: #{metrics[:rmse]}"
235
+ # for classification, use:
236
+ # puts "Test accuracy: #{metrics[:accuracy]}"
201
237
 
202
- JSON
238
+ # finalize
239
+ houses = preprocess(houses)
240
+ all_features = houses.map { |h| features(h) }
241
+ all_target = houses.map { |h| target(h) }
242
+ @model = Eps::Model.new(all_features, all_target)
243
+
244
+ # save
245
+ File.write(model_file, @model.to_pmml)
246
+ end
247
+
248
+ def predict(house)
249
+ model.predict(features(house))
250
+ end
251
+
252
+ private
253
+
254
+ def preprocess(train_set)
255
+ train_set.reject { |h| h.bedrooms.nil? || h.price < 10000 }
256
+ end
257
+
258
+ def features(house)
259
+ {
260
+ bedrooms: house.bedrooms,
261
+ city_id: house.city_id.to_s,
262
+ month: house.sold_at.strftime("%b")
263
+ }
264
+ end
265
+
266
+ def target(house)
267
+ house.price
268
+ end
269
+
270
+ def model
271
+ @model ||= Eps::Model.load_pmml(File.read(model_file))
272
+ end
273
+
274
+ def model_file
275
+ File.join(__dir__, "price_model.pmml")
276
+ end
277
+ end
278
+ ```
279
+
280
+ Build the model with:
203
281
 
204
282
  ```ruby
205
- data = File.read("model.json")
206
- model = Eps::Regressor.load_json(data)
283
+ PriceModel.build
207
284
  ```
208
285
 
209
- [PMML](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) - Predictive Model Markup Language
286
+ This saves the model to `price_model.pmml`. Be sure to check this into source control.
287
+
288
+ Predict with:
210
289
 
211
290
  ```ruby
212
- data = File.read("model.pmml")
213
- model = Eps::Regressor.load_pmml(data)
291
+ PriceModel.predict(house)
214
292
  ```
215
293
 
216
- > Loading PMML requires Nokogiri to be installed
294
+ ## Monitoring
217
295
 
218
- [PFA](http://dmg.org/pfa/) - Portable Format for Analytics
296
+ We recommend monitoring how well your models perform over time. To do this, save your predictions to the database. Then, compare them with:
219
297
 
220
298
  ```ruby
221
- data = File.read("model.pfa")
222
- model = Eps::Regressor.load_pfa(data)
299
+ actual = houses.map(&:price)
300
+ estimated = houses.map(&:estimated_price)
301
+ Eps.metrics(actual, estimated)
223
302
  ```
224
303
 
225
- Here are examples for how to dump models in each:
304
+ This returns the same evaluation metrics as model building. For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0. For accuracy, alert if it drops below a certain threshold.
305
+
306
+ ## Other Languages
307
+
308
+ Eps makes it easy to serve models from other languages. You can build models in R, Python, and others and serve them in Ruby without having to worry about how to deploy or run another language.
226
309
 
227
- - [R JSON](guides/Modeling.md#r-json)
228
- - [R PMML](guides/Modeling.md#r-pmml)
229
- - [R PFA](guides/Modeling.md#r-pfa)
230
- - [Python JSON](guides/Modeling.md#python-json)
231
- - [Python PMML](guides/Modeling.md#python-pmml)
232
- - [Python PFA](guides/Modeling.md#python-pfa)
310
+ Eps can serve linear regression and Naive bayes models. Check out [Scoruby](https://github.com/asafschers/scoruby) to serve other models.
311
+
312
+ ### R
313
+
314
+ To create a model in R, install the [pmml](https://cran.r-project.org/package=pmml) package
315
+
316
+ ```r
317
+ install.packages("pmml")
318
+ ```
319
+
320
+ For regression, run:
321
+
322
+ ```r
323
+ library(pmml)
324
+
325
+ model <- lm(dist ~ speed, cars)
326
+
327
+ # save model
328
+ data <- toString(pmml(model))
329
+ write(data, file="model.pmml")
330
+ ```
331
+
332
+ For classification, run:
333
+
334
+ ```r
335
+ library(pmml)
336
+ library(e1071)
337
+
338
+ model <- naiveBayes(Species ~ ., iris)
339
+
340
+ # save model
341
+ data <- toString(pmml(model, predictedField="Species"))
342
+ write(data, file="model.pmml")
343
+ ```
344
+
345
+ ### Python
346
+
347
+ To create a model in Python, install the [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) package
348
+
349
+ ```sh
350
+ pip install sklearn2pmml
351
+ ```
352
+
353
+ For regression, run:
354
+
355
+ ```python
356
+ from sklearn2pmml import sklearn2pmml, make_pmml_pipeline
357
+ from sklearn.linear_model import LinearRegression
358
+
359
+ x = [1, 2, 3, 5, 6]
360
+ y = [5 * xi + 3 for xi in x]
361
+
362
+ model = LinearRegression()
363
+ model.fit([[xi] for xi in x], y)
364
+
365
+ # save model
366
+ sklearn2pmml(make_pmml_pipeline(model), "model.pmml")
367
+ ```
368
+
369
+ For classification, run:
370
+
371
+ ```python
372
+ from sklearn2pmml import sklearn2pmml, make_pmml_pipeline
373
+ from sklearn.naive_bayes import GaussianNB
374
+
375
+ x = [1, 2, 3, 5, 6]
376
+ y = ["ham", "ham", "ham", "spam", "spam"]
377
+
378
+ model = GaussianNB()
379
+ model.fit([[xi] for xi in x], y)
380
+
381
+ sklearn2pmml(make_pmml_pipeline(model), "model.pmml")
382
+ ```
233
383
 
234
384
  ### Verifying
235
385
 
@@ -244,26 +394,25 @@ house_id | prediction
244
394
  Once the model is implemented in Ruby, confirm the predictions match.
245
395
 
246
396
  ```ruby
247
- model = Eps::Regressor.load_json("model.json")
397
+ model = Eps::Model.load_pmml("model.pmml")
248
398
 
249
399
  # preload houses to prevent n+1
250
400
  houses = House.all.index_by(&:id)
251
401
 
252
- CSV.foreach("predictions.csv", headers: true) do |row|
253
- house = houses[row["house_id"].to_i]
254
- expected = row["prediction"].to_f
402
+ CSV.foreach("predictions.csv", headers: true, converters: :numeric) do |row|
403
+ house = houses[row["house_id"]]
404
+ expected = row["prediction"]
255
405
 
256
406
  actual = model.predict(bedrooms: house.bedrooms, bathrooms: house.bathrooms)
257
407
 
258
- unless (actual - expected).abs < 0.001
259
- raise "Bad prediction for house #{house.id} (exp: #{expected}, act: #{actual})"
260
- end
408
+ success = actual.is_a?(String) ? actual == expected : (actual - expected).abs < 0.001
409
+ raise "Bad prediction for house #{house.id} (exp: #{expected}, act: #{actual})" unless success
261
410
 
262
411
  putc "✓"
263
412
  end
264
413
  ```
265
414
 
266
- ### Database Storage
415
+ ## Database Storage
267
416
 
268
417
  The database is another place you can store models. It’s good if you retrain models automatically.
269
418
 
@@ -278,29 +427,17 @@ rails g model Model key:string:uniq data:text
278
427
  Store the model with:
279
428
 
280
429
  ```ruby
281
- store = Model.where(key: "housing_price").first_or_initialize
282
- store.update(data: model.to_json)
430
+ store = Model.where(key: "price").first_or_initialize
431
+ store.update(data: model.to_pmml)
283
432
  ```
284
433
 
285
434
  Load the model with:
286
435
 
287
436
  ```ruby
288
- data = Model.find_by!(key: "housing_price").data
289
- model = Eps::Regressor.load_json(data)
290
- ```
291
-
292
- ## Monitoring
293
-
294
- We recommend monitoring how well your models perform over time. To do this, save your predictions to the database. Then, compare them with:
295
-
296
- ```ruby
297
- actual = houses.map(&:price)
298
- estimated = houses.map(&:estimated_price)
299
- Eps.metrics(actual, estimated)
437
+ data = Model.find_by!(key: "price").data
438
+ model = Eps::Model.load_pmml(data)
300
439
  ```
301
440
 
302
- This returns the same evaluation metrics as model building. For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0.
303
-
304
441
  ## Training Performance
305
442
 
306
443
  Speed up training on large datasets with GSL.
@@ -319,6 +456,8 @@ gem 'gsl', group: :development
319
456
 
320
457
  It only needs to be available in environments used to build the model.
321
458
 
459
+ > This only speeds up regression, not classification
460
+
322
461
  ## Data
323
462
 
324
463
  A number of data formats are supported. You can pass the target variable separately.
@@ -326,7 +465,7 @@ A number of data formats are supported. You can pass the target variable separat
326
465
  ```ruby
327
466
  x = [{x: 1}, {x: 2}, {x: 3}]
328
467
  y = [1, 2, 3]
329
- Eps::Regressor.new(x, y)
468
+ Eps::Model.new(x, y)
330
469
  ```
331
470
 
332
471
  Or pass arrays of arrays
@@ -334,7 +473,7 @@ Or pass arrays of arrays
334
473
  ```ruby
335
474
  x = [[1, 2], [2, 0], [3, 1]]
336
475
  y = [1, 2, 3]
337
- Eps::Regressor.new(x, y)
476
+ Eps::Model.new(x, y)
338
477
  ```
339
478
 
340
479
  ## Daru
@@ -343,7 +482,7 @@ Eps works well with Daru data frames.
343
482
 
344
483
  ```ruby
345
484
  df = Daru::DataFrame.from_csv("houses.csv")
346
- Eps::Regressor.new(df, target: "price")
485
+ Eps::Model.new(df, target: "price")
347
486
  ```
348
487
 
349
488
  To split into training and test sets, use:
@@ -365,22 +504,34 @@ CSV.table("data.csv").map { |row| row.to_h }
365
504
 
366
505
  ## Jupyter & IRuby
367
506
 
368
- You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://github.com/ankane/shorts/blob/master/Jupyter-Rails.md).
507
+ You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://ankane.org/jupyter-rails).
369
508
 
370
509
  ## Reference
371
510
 
372
- Get coefficients
373
-
374
- ```ruby
375
- model.coefficients
376
- ```
377
-
378
511
  Get an extended summary with standard error, t-values, and r-squared
379
512
 
380
513
  ```ruby
381
514
  model.summary(extended: true)
382
515
  ```
383
516
 
517
+ ## Upgrading
518
+
519
+ ## 0.2.0
520
+
521
+ Eps 0.2.0 brings a number of improvements, including support for classification.
522
+
523
+ We recommend:
524
+
525
+ 1. Changing `Eps::Regressor` to `Eps::Model`
526
+ 2. Converting models from JSON to PMML
527
+
528
+ ```ruby
529
+ model = Eps::Model.load_json("model.json")
530
+ File.write("model.pmml", model.to_pmml)
531
+ ```
532
+
533
+ 3. Renaming `app/stats_models` to `app/ml_models`
534
+
384
535
  ## History
385
536
 
386
537
  View the [changelog](https://github.com/ankane/eps/blob/master/CHANGELOG.md)