eps 0.1.1 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +6 -0
- data/README.md +235 -84
- data/lib/eps.rb +9 -4
- data/lib/eps/base.rb +19 -0
- data/lib/eps/base_estimator.rb +84 -0
- data/lib/eps/linear_regression.rb +558 -0
- data/lib/eps/model.rb +108 -0
- data/lib/eps/naive_bayes.rb +240 -0
- data/lib/eps/version.rb +1 -1
- metadata +13 -18
- data/.gitignore +0 -9
- data/.travis.yml +0 -15
- data/Gemfile +0 -11
- data/Rakefile +0 -34
- data/eps.gemspec +0 -30
- data/guides/Modeling.md +0 -152
- data/lib/eps/base_regressor.rb +0 -232
- data/lib/eps/metrics.rb +0 -35
- data/lib/eps/regressor.rb +0 -314
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5ce1ae30056e3aaa08465d63571685ed2d43a17b40dc95bb2020d8d1f6752d6d
|
4
|
+
data.tar.gz: f6f9652d164991b82fed1375b9039baa6cbe3dd3b5413cdee244ba5aac923012
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: bd9ddb1589b2866c42fda230389a27124285bd162ca9863528b41531d1c725fca3ac7327786b373e9cecda014d9cc2f34f791f203da77215f738cf92c261db61
|
7
|
+
data.tar.gz: 7d209acc346b223d00827c1df817a2017882b9446ff523f985e4c89b44f10f5a702d6c6dd957c8fb9ceccd83b98f7492aa15e029c1d0655787c1823348beacd0
|
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -1,13 +1,15 @@
|
|
1
1
|
# Eps
|
2
2
|
|
3
|
-
|
3
|
+
Machine learning for Ruby
|
4
4
|
|
5
|
-
- Build models quickly and easily
|
5
|
+
- Build predictive models quickly and easily
|
6
6
|
- Serve models built in Ruby, Python, R, and more
|
7
|
-
-
|
8
|
-
-
|
7
|
+
- Supports regression (linear regression) and classification (naive Bayes)
|
8
|
+
- Automatically handles categorical features
|
9
9
|
- Works great with the SciRuby ecosystem (Daru & IRuby)
|
10
10
|
|
11
|
+
Check out [this post](https://ankane.org/rails-meet-data-science) for more info on machine learning with Rails
|
12
|
+
|
11
13
|
[![Build Status](https://travis-ci.org/ankane/eps.svg?branch=master)](https://travis-ci.org/ankane/eps)
|
12
14
|
|
13
15
|
## Installation
|
@@ -31,7 +33,7 @@ data = [
|
|
31
33
|
{bedrooms: 2, bathrooms: 2, price: 135000},
|
32
34
|
{bedrooms: 3, bathrooms: 2, price: 162000}
|
33
35
|
]
|
34
|
-
model = Eps::
|
36
|
+
model = Eps::Model.new(data, target: :price)
|
35
37
|
puts model.summary
|
36
38
|
```
|
37
39
|
|
@@ -43,6 +45,8 @@ model.predict(bedrooms: 2, bathrooms: 1)
|
|
43
45
|
|
44
46
|
> Pass an array of hashes make multiple predictions at once
|
45
47
|
|
48
|
+
The target can be numeric (regression) or categorical (classification).
|
49
|
+
|
46
50
|
## Building Models
|
47
51
|
|
48
52
|
### Training and Test Sets
|
@@ -50,15 +54,15 @@ model.predict(bedrooms: 2, bathrooms: 1)
|
|
50
54
|
When building models, it’s a good idea to hold out some data so you can see how well the model will perform on unseen data. To do this, we split our data into two sets: training and test. We build the model with the training set and later evaluate it on the test set.
|
51
55
|
|
52
56
|
```ruby
|
53
|
-
|
54
|
-
train_set, test_set = houses.partition {
|
57
|
+
split_date = Date.parse("2018-06-01")
|
58
|
+
train_set, test_set = houses.partition { |h| h.sold_at < split_date }
|
55
59
|
```
|
56
60
|
|
57
|
-
If your data
|
61
|
+
If your data doesn’t have a time associated with it, you can split it randomly.
|
58
62
|
|
59
63
|
```ruby
|
60
|
-
|
61
|
-
train_set, test_set = houses.partition {
|
64
|
+
rng = Random.new(1) # seed random number generator
|
65
|
+
train_set, test_set = houses.partition { rng.rand < 0.7 }
|
62
66
|
```
|
63
67
|
|
64
68
|
### Outliers and Missing Data
|
@@ -79,10 +83,10 @@ Selecting features for a model is extremely important for performance. Features
|
|
79
83
|
|
80
84
|
> Categorical features generate coefficients for each distinct value except for one
|
81
85
|
|
82
|
-
|
86
|
+
Convert any ids to strings so they’re treated as categorical features.
|
83
87
|
|
84
88
|
```ruby
|
85
|
-
{city_id:
|
89
|
+
{city_id: city_id.to_s}
|
86
90
|
```
|
87
91
|
|
88
92
|
For times, create features like day of week and hour of day with:
|
@@ -122,11 +126,11 @@ train_target = train_set.map { |h| target(h) }
|
|
122
126
|
Now, let’s train the model.
|
123
127
|
|
124
128
|
```ruby
|
125
|
-
model = Eps::
|
129
|
+
model = Eps::Model.new(train_features, train_target)
|
126
130
|
puts model.summary
|
127
131
|
```
|
128
132
|
|
129
|
-
|
133
|
+
For regression, the summary includes the coefficients and their significance. The lower the p-value, the more significant the feature is. p-values below 0.05 are typically considered significant. It also shows the adjusted r-squared, which is a measure of how well the model fits the data. The higher the number, the better the fit. Here’s a good explanation of why it’s [better than r-squared](https://www.quora.com/What-is-the-difference-between-R-squared-and-Adjusted-R-squared).
|
130
134
|
|
131
135
|
### Evaluation
|
132
136
|
|
@@ -138,7 +142,7 @@ test_target = test_set.map { |h| target(h) }
|
|
138
142
|
model.evaluate(test_features, test_target)
|
139
143
|
```
|
140
144
|
|
141
|
-
|
145
|
+
For regression, this returns:
|
142
146
|
|
143
147
|
- RMSE - Root mean square error
|
144
148
|
- MAE - Mean absolute error
|
@@ -146,31 +150,43 @@ This returns:
|
|
146
150
|
|
147
151
|
We want to minimize the RMSE and MAE and keep the ME around 0.
|
148
152
|
|
153
|
+
For classification, this returns:
|
154
|
+
|
155
|
+
- Accuracy
|
156
|
+
|
157
|
+
We want to maximize the accuracy.
|
158
|
+
|
149
159
|
### Finalize
|
150
160
|
|
151
|
-
Now that we have an idea of how the model will perform, we want to retrain the model with all of our data.
|
161
|
+
Now that we have an idea of how the model will perform, we want to retrain the model with all of our data. Treat outliers and missing data the same as you did with the training set.
|
152
162
|
|
153
163
|
```ruby
|
164
|
+
# outliers and missing data
|
165
|
+
houses.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
|
166
|
+
|
167
|
+
# training
|
154
168
|
all_features = houses.map { |h| features(h) }
|
155
169
|
all_target = houses.map { |h| target(h) }
|
156
|
-
model = Eps::
|
170
|
+
model = Eps::Model.new(all_features, all_target)
|
157
171
|
```
|
158
172
|
|
159
173
|
We now have a model that’s ready to serve.
|
160
174
|
|
161
175
|
## Serving Models
|
162
176
|
|
163
|
-
Once the model is trained,
|
177
|
+
Once the model is trained, we need to store it. Eps uses PMML - [Predictive Model Markup Language](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) - a standard for storing models. A great option is to write the model to a file with:
|
164
178
|
|
165
179
|
```ruby
|
166
|
-
model.
|
180
|
+
File.write("model.pmml", model.to_pmml)
|
167
181
|
```
|
168
182
|
|
169
|
-
|
183
|
+
> You may need to add `nokogiri` to your Gemfile
|
184
|
+
|
185
|
+
To load a model, use:
|
170
186
|
|
171
187
|
```ruby
|
172
|
-
|
173
|
-
model = Eps::
|
188
|
+
pmml = File.read("model.pmml")
|
189
|
+
model = Eps::Model.load_pmml(pmml)
|
174
190
|
```
|
175
191
|
|
176
192
|
Now we can use it to make predictions.
|
@@ -179,57 +195,191 @@ Now we can use it to make predictions.
|
|
179
195
|
model.predict(bedrooms: 2, bathrooms: 1)
|
180
196
|
```
|
181
197
|
|
182
|
-
|
198
|
+
To continuously train models, we recommend [storing them in your database](#database-storage).
|
199
|
+
|
200
|
+
## Full Example
|
183
201
|
|
184
|
-
|
185
|
-
|
186
|
-
|
202
|
+
We recommend putting all the model code in a single file. This makes it easy to rebuild the model as needed.
|
203
|
+
|
204
|
+
In Rails, we recommend creating a `app/ml_models` directory. Be sure to restart Spring after creating the directory so files are autoloaded.
|
205
|
+
|
206
|
+
```sh
|
207
|
+
bin/spring stop
|
187
208
|
```
|
188
209
|
|
189
|
-
|
210
|
+
Here’s what a complete model in `app/ml_models/price_model.rb` may look like:
|
190
211
|
|
191
212
|
```ruby
|
192
|
-
|
193
|
-
|
194
|
-
|
213
|
+
class PriceModel < Eps::Base
|
214
|
+
def build
|
215
|
+
houses = House.all.to_a
|
195
216
|
|
196
|
-
|
217
|
+
# divide into training and test set
|
218
|
+
split_date = Date.parse("2018-06-01")
|
219
|
+
train_set, test_set = houses.partition { |h| h.sold_at < split_date }
|
220
|
+
|
221
|
+
# handle outliers and missing values
|
222
|
+
train_set = preprocess(train_set)
|
197
223
|
|
198
|
-
|
224
|
+
# train
|
225
|
+
train_features = train_set.map { |v| features(v) }
|
226
|
+
train_target = train_set.map { |v| target(v) }
|
227
|
+
model = Eps::Model.new(train_features, train_target)
|
228
|
+
puts model.summary
|
199
229
|
|
200
|
-
|
230
|
+
# evaluate
|
231
|
+
test_features = test_set.map { |v| features(v) }
|
232
|
+
test_target = test_set.map { |v| target(v) }
|
233
|
+
metrics = model.evaluate(test_features, test_target)
|
234
|
+
puts "Test RMSE: #{metrics[:rmse]}"
|
235
|
+
# for classification, use:
|
236
|
+
# puts "Test accuracy: #{metrics[:accuracy]}"
|
201
237
|
|
202
|
-
|
238
|
+
# finalize
|
239
|
+
houses = preprocess(houses)
|
240
|
+
all_features = houses.map { |h| features(h) }
|
241
|
+
all_target = houses.map { |h| target(h) }
|
242
|
+
@model = Eps::Model.new(all_features, all_target)
|
243
|
+
|
244
|
+
# save
|
245
|
+
File.write(model_file, @model.to_pmml)
|
246
|
+
end
|
247
|
+
|
248
|
+
def predict(house)
|
249
|
+
model.predict(features(house))
|
250
|
+
end
|
251
|
+
|
252
|
+
private
|
253
|
+
|
254
|
+
def preprocess(train_set)
|
255
|
+
train_set.reject { |h| h.bedrooms.nil? || h.price < 10000 }
|
256
|
+
end
|
257
|
+
|
258
|
+
def features(house)
|
259
|
+
{
|
260
|
+
bedrooms: house.bedrooms,
|
261
|
+
city_id: house.city_id.to_s,
|
262
|
+
month: house.sold_at.strftime("%b")
|
263
|
+
}
|
264
|
+
end
|
265
|
+
|
266
|
+
def target(house)
|
267
|
+
house.price
|
268
|
+
end
|
269
|
+
|
270
|
+
def model
|
271
|
+
@model ||= Eps::Model.load_pmml(File.read(model_file))
|
272
|
+
end
|
273
|
+
|
274
|
+
def model_file
|
275
|
+
File.join(__dir__, "price_model.pmml")
|
276
|
+
end
|
277
|
+
end
|
278
|
+
```
|
279
|
+
|
280
|
+
Build the model with:
|
203
281
|
|
204
282
|
```ruby
|
205
|
-
|
206
|
-
model = Eps::Regressor.load_json(data)
|
283
|
+
PriceModel.build
|
207
284
|
```
|
208
285
|
|
209
|
-
|
286
|
+
This saves the model to `price_model.pmml`. Be sure to check this into source control.
|
287
|
+
|
288
|
+
Predict with:
|
210
289
|
|
211
290
|
```ruby
|
212
|
-
|
213
|
-
model = Eps::Regressor.load_pmml(data)
|
291
|
+
PriceModel.predict(house)
|
214
292
|
```
|
215
293
|
|
216
|
-
|
294
|
+
## Monitoring
|
217
295
|
|
218
|
-
|
296
|
+
We recommend monitoring how well your models perform over time. To do this, save your predictions to the database. Then, compare them with:
|
219
297
|
|
220
298
|
```ruby
|
221
|
-
|
222
|
-
|
299
|
+
actual = houses.map(&:price)
|
300
|
+
estimated = houses.map(&:estimated_price)
|
301
|
+
Eps.metrics(actual, estimated)
|
223
302
|
```
|
224
303
|
|
225
|
-
|
304
|
+
This returns the same evaluation metrics as model building. For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0. For accuracy, alert if it drops below a certain threshold.
|
305
|
+
|
306
|
+
## Other Languages
|
307
|
+
|
308
|
+
Eps makes it easy to serve models from other languages. You can build models in R, Python, and others and serve them in Ruby without having to worry about how to deploy or run another language.
|
226
309
|
|
227
|
-
|
228
|
-
|
229
|
-
|
230
|
-
|
231
|
-
|
232
|
-
|
310
|
+
Eps can serve linear regression and Naive bayes models. Check out [Scoruby](https://github.com/asafschers/scoruby) to serve other models.
|
311
|
+
|
312
|
+
### R
|
313
|
+
|
314
|
+
To create a model in R, install the [pmml](https://cran.r-project.org/package=pmml) package
|
315
|
+
|
316
|
+
```r
|
317
|
+
install.packages("pmml")
|
318
|
+
```
|
319
|
+
|
320
|
+
For regression, run:
|
321
|
+
|
322
|
+
```r
|
323
|
+
library(pmml)
|
324
|
+
|
325
|
+
model <- lm(dist ~ speed, cars)
|
326
|
+
|
327
|
+
# save model
|
328
|
+
data <- toString(pmml(model))
|
329
|
+
write(data, file="model.pmml")
|
330
|
+
```
|
331
|
+
|
332
|
+
For classification, run:
|
333
|
+
|
334
|
+
```r
|
335
|
+
library(pmml)
|
336
|
+
library(e1071)
|
337
|
+
|
338
|
+
model <- naiveBayes(Species ~ ., iris)
|
339
|
+
|
340
|
+
# save model
|
341
|
+
data <- toString(pmml(model, predictedField="Species"))
|
342
|
+
write(data, file="model.pmml")
|
343
|
+
```
|
344
|
+
|
345
|
+
### Python
|
346
|
+
|
347
|
+
To create a model in Python, install the [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) package
|
348
|
+
|
349
|
+
```sh
|
350
|
+
pip install sklearn2pmml
|
351
|
+
```
|
352
|
+
|
353
|
+
For regression, run:
|
354
|
+
|
355
|
+
```python
|
356
|
+
from sklearn2pmml import sklearn2pmml, make_pmml_pipeline
|
357
|
+
from sklearn.linear_model import LinearRegression
|
358
|
+
|
359
|
+
x = [1, 2, 3, 5, 6]
|
360
|
+
y = [5 * xi + 3 for xi in x]
|
361
|
+
|
362
|
+
model = LinearRegression()
|
363
|
+
model.fit([[xi] for xi in x], y)
|
364
|
+
|
365
|
+
# save model
|
366
|
+
sklearn2pmml(make_pmml_pipeline(model), "model.pmml")
|
367
|
+
```
|
368
|
+
|
369
|
+
For classification, run:
|
370
|
+
|
371
|
+
```python
|
372
|
+
from sklearn2pmml import sklearn2pmml, make_pmml_pipeline
|
373
|
+
from sklearn.naive_bayes import GaussianNB
|
374
|
+
|
375
|
+
x = [1, 2, 3, 5, 6]
|
376
|
+
y = ["ham", "ham", "ham", "spam", "spam"]
|
377
|
+
|
378
|
+
model = GaussianNB()
|
379
|
+
model.fit([[xi] for xi in x], y)
|
380
|
+
|
381
|
+
sklearn2pmml(make_pmml_pipeline(model), "model.pmml")
|
382
|
+
```
|
233
383
|
|
234
384
|
### Verifying
|
235
385
|
|
@@ -244,26 +394,25 @@ house_id | prediction
|
|
244
394
|
Once the model is implemented in Ruby, confirm the predictions match.
|
245
395
|
|
246
396
|
```ruby
|
247
|
-
model = Eps::
|
397
|
+
model = Eps::Model.load_pmml("model.pmml")
|
248
398
|
|
249
399
|
# preload houses to prevent n+1
|
250
400
|
houses = House.all.index_by(&:id)
|
251
401
|
|
252
|
-
CSV.foreach("predictions.csv", headers: true) do |row|
|
253
|
-
house = houses[row["house_id"]
|
254
|
-
expected = row["prediction"]
|
402
|
+
CSV.foreach("predictions.csv", headers: true, converters: :numeric) do |row|
|
403
|
+
house = houses[row["house_id"]]
|
404
|
+
expected = row["prediction"]
|
255
405
|
|
256
406
|
actual = model.predict(bedrooms: house.bedrooms, bathrooms: house.bathrooms)
|
257
407
|
|
258
|
-
|
259
|
-
|
260
|
-
end
|
408
|
+
success = actual.is_a?(String) ? actual == expected : (actual - expected).abs < 0.001
|
409
|
+
raise "Bad prediction for house #{house.id} (exp: #{expected}, act: #{actual})" unless success
|
261
410
|
|
262
411
|
putc "✓"
|
263
412
|
end
|
264
413
|
```
|
265
414
|
|
266
|
-
|
415
|
+
## Database Storage
|
267
416
|
|
268
417
|
The database is another place you can store models. It’s good if you retrain models automatically.
|
269
418
|
|
@@ -278,29 +427,17 @@ rails g model Model key:string:uniq data:text
|
|
278
427
|
Store the model with:
|
279
428
|
|
280
429
|
```ruby
|
281
|
-
store = Model.where(key: "
|
282
|
-
store.update(data: model.
|
430
|
+
store = Model.where(key: "price").first_or_initialize
|
431
|
+
store.update(data: model.to_pmml)
|
283
432
|
```
|
284
433
|
|
285
434
|
Load the model with:
|
286
435
|
|
287
436
|
```ruby
|
288
|
-
data = Model.find_by!(key: "
|
289
|
-
model = Eps::
|
290
|
-
```
|
291
|
-
|
292
|
-
## Monitoring
|
293
|
-
|
294
|
-
We recommend monitoring how well your models perform over time. To do this, save your predictions to the database. Then, compare them with:
|
295
|
-
|
296
|
-
```ruby
|
297
|
-
actual = houses.map(&:price)
|
298
|
-
estimated = houses.map(&:estimated_price)
|
299
|
-
Eps.metrics(actual, estimated)
|
437
|
+
data = Model.find_by!(key: "price").data
|
438
|
+
model = Eps::Model.load_pmml(data)
|
300
439
|
```
|
301
440
|
|
302
|
-
This returns the same evaluation metrics as model building. For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0.
|
303
|
-
|
304
441
|
## Training Performance
|
305
442
|
|
306
443
|
Speed up training on large datasets with GSL.
|
@@ -319,6 +456,8 @@ gem 'gsl', group: :development
|
|
319
456
|
|
320
457
|
It only needs to be available in environments used to build the model.
|
321
458
|
|
459
|
+
> This only speeds up regression, not classification
|
460
|
+
|
322
461
|
## Data
|
323
462
|
|
324
463
|
A number of data formats are supported. You can pass the target variable separately.
|
@@ -326,7 +465,7 @@ A number of data formats are supported. You can pass the target variable separat
|
|
326
465
|
```ruby
|
327
466
|
x = [{x: 1}, {x: 2}, {x: 3}]
|
328
467
|
y = [1, 2, 3]
|
329
|
-
Eps::
|
468
|
+
Eps::Model.new(x, y)
|
330
469
|
```
|
331
470
|
|
332
471
|
Or pass arrays of arrays
|
@@ -334,7 +473,7 @@ Or pass arrays of arrays
|
|
334
473
|
```ruby
|
335
474
|
x = [[1, 2], [2, 0], [3, 1]]
|
336
475
|
y = [1, 2, 3]
|
337
|
-
Eps::
|
476
|
+
Eps::Model.new(x, y)
|
338
477
|
```
|
339
478
|
|
340
479
|
## Daru
|
@@ -343,7 +482,7 @@ Eps works well with Daru data frames.
|
|
343
482
|
|
344
483
|
```ruby
|
345
484
|
df = Daru::DataFrame.from_csv("houses.csv")
|
346
|
-
Eps::
|
485
|
+
Eps::Model.new(df, target: "price")
|
347
486
|
```
|
348
487
|
|
349
488
|
To split into training and test sets, use:
|
@@ -365,22 +504,34 @@ CSV.table("data.csv").map { |row| row.to_h }
|
|
365
504
|
|
366
505
|
## Jupyter & IRuby
|
367
506
|
|
368
|
-
You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://
|
507
|
+
You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://ankane.org/jupyter-rails).
|
369
508
|
|
370
509
|
## Reference
|
371
510
|
|
372
|
-
Get coefficients
|
373
|
-
|
374
|
-
```ruby
|
375
|
-
model.coefficients
|
376
|
-
```
|
377
|
-
|
378
511
|
Get an extended summary with standard error, t-values, and r-squared
|
379
512
|
|
380
513
|
```ruby
|
381
514
|
model.summary(extended: true)
|
382
515
|
```
|
383
516
|
|
517
|
+
## Upgrading
|
518
|
+
|
519
|
+
## 0.2.0
|
520
|
+
|
521
|
+
Eps 0.2.0 brings a number of improvements, including support for classification.
|
522
|
+
|
523
|
+
We recommend:
|
524
|
+
|
525
|
+
1. Changing `Eps::Regressor` to `Eps::Model`
|
526
|
+
2. Converting models from JSON to PMML
|
527
|
+
|
528
|
+
```ruby
|
529
|
+
model = Eps::Model.load_json("model.json")
|
530
|
+
File.write("model.pmml", model.to_pmml)
|
531
|
+
```
|
532
|
+
|
533
|
+
3. Renaming `app/stats_models` to `app/ml_models`
|
534
|
+
|
384
535
|
## History
|
385
536
|
|
386
537
|
View the [changelog](https://github.com/ankane/eps/blob/master/CHANGELOG.md)
|