eps 0.3.1 → 0.3.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +23 -0
- data/README.md +42 -14
- data/lib/eps/base_estimator.rb +57 -35
- data/lib/eps/data_frame.rb +12 -2
- data/lib/eps/evaluators/lightgbm.rb +22 -15
- data/lib/eps/evaluators/linear_regression.rb +6 -3
- data/lib/eps/evaluators/naive_bayes.rb +9 -7
- data/lib/eps/lightgbm.rb +19 -11
- data/lib/eps/linear_regression.rb +29 -6
- data/lib/eps/naive_bayes.rb +1 -1
- data/lib/eps/pmml/generator.rb +4 -4
- data/lib/eps/version.rb +1 -1
- metadata +31 -3
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: d56573908e892d8d1959d66c7b6f2940f8930a2d0f2dfd5d4da75e2ff7cfdb63
|
|
4
|
+
data.tar.gz: 9eaf1a06c8c51ba15d9b4468796fc869f2933945494d027b54789304080c5d5b
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 971dbd2a95a280ed50925df68a29018ba7b3bccb7094b1374923a8ce7d100720202245843e003b26447832e9c1f8285bafcc7692020f5971a56c0a8e89a12afb
|
|
7
|
+
data.tar.gz: de06585dc75608b0f8c62188cce351987a0cd53f3b12889d4d63de28ed81ae1b143e31f47ac8c53083eeb250e18c5f8b721fff94a378e14203fd8fa90ba3e440
|
data/CHANGELOG.md
CHANGED
|
@@ -1,7 +1,30 @@
|
|
|
1
|
+
## 0.3.6 (2020-06-19)
|
|
2
|
+
|
|
3
|
+
- Fixed error with text features for LightGBM
|
|
4
|
+
|
|
5
|
+
## 0.3.5 (2020-06-10)
|
|
6
|
+
|
|
7
|
+
- Added `learning_rate` option for LightGBM
|
|
8
|
+
- Added support for Numo and Rover
|
|
9
|
+
|
|
10
|
+
## 0.3.4 (2020-04-05)
|
|
11
|
+
|
|
12
|
+
- Added `predict_probability` for classification
|
|
13
|
+
|
|
14
|
+
## 0.3.3 (2020-02-24)
|
|
15
|
+
|
|
16
|
+
- Fixed errors and incorrect predictions with boolean columns
|
|
17
|
+
- Fixed deprecation warnings in Ruby 2.7
|
|
18
|
+
|
|
19
|
+
## 0.3.2 (2019-12-08)
|
|
20
|
+
|
|
21
|
+
- Added support for GSLR
|
|
22
|
+
|
|
1
23
|
## 0.3.1 (2019-12-06)
|
|
2
24
|
|
|
3
25
|
- Added `weight` option for LightGBM and linear regression
|
|
4
26
|
- Added `intercept` option for linear regression
|
|
27
|
+
- Added LightGBM evaluator safety check
|
|
5
28
|
- Fixed `Unknown label` error for LightGBM
|
|
6
29
|
- Fixed error message for unstable solutions with linear regression
|
|
7
30
|
|
data/README.md
CHANGED
|
@@ -4,7 +4,6 @@ Machine learning for Ruby
|
|
|
4
4
|
|
|
5
5
|
- Build predictive models quickly and easily
|
|
6
6
|
- Serve models built in Ruby, Python, R, and more
|
|
7
|
-
- No prior knowledge of machine learning required :tada:
|
|
8
7
|
|
|
9
8
|
Check out [this post](https://ankane.org/rails-meet-data-science) for more info on machine learning with Rails
|
|
10
9
|
|
|
@@ -314,7 +313,7 @@ y = [1, 2, 3]
|
|
|
314
313
|
Eps::Model.new(x, y)
|
|
315
314
|
```
|
|
316
315
|
|
|
317
|
-
|
|
316
|
+
Data can be an array of arrays
|
|
318
317
|
|
|
319
318
|
```ruby
|
|
320
319
|
x = [[1, 2], [2, 0], [3, 1]]
|
|
@@ -322,18 +321,29 @@ y = [1, 2, 3]
|
|
|
322
321
|
Eps::Model.new(x, y)
|
|
323
322
|
```
|
|
324
323
|
|
|
325
|
-
|
|
324
|
+
Or Numo arrays
|
|
326
325
|
|
|
327
|
-
|
|
326
|
+
```ruby
|
|
327
|
+
x = Numo::NArray.cast([[1, 2], [2, 0], [3, 1]])
|
|
328
|
+
y = Numo::NArray.cast([1, 2, 3])
|
|
329
|
+
Eps::Model.new(x, y)
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
Or a Rover data frame
|
|
328
333
|
|
|
329
334
|
```ruby
|
|
330
|
-
df =
|
|
335
|
+
df = Rover.read_csv("houses.csv")
|
|
331
336
|
Eps::Model.new(df, target: "price")
|
|
332
337
|
```
|
|
333
338
|
|
|
334
|
-
|
|
339
|
+
Or a Daru data frame
|
|
335
340
|
|
|
336
|
-
|
|
341
|
+
```ruby
|
|
342
|
+
df = Daru::DataFrame.from_csv("houses.csv")
|
|
343
|
+
Eps::Model.new(df, target: "price")
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
When reading CSV files directly, be sure to convert numeric fields. The `table` method does this automatically.
|
|
337
347
|
|
|
338
348
|
```ruby
|
|
339
349
|
CSV.table("data.csv").map { |row| row.to_h }
|
|
@@ -353,9 +363,23 @@ Eps supports:
|
|
|
353
363
|
- Linear Regression
|
|
354
364
|
- Naive Bayes
|
|
355
365
|
|
|
366
|
+
### LightGBM
|
|
367
|
+
|
|
368
|
+
Pass the learning rate with:
|
|
369
|
+
|
|
370
|
+
```ruby
|
|
371
|
+
Eps::Model.new(data, learning_rate: 0.01)
|
|
372
|
+
```
|
|
373
|
+
|
|
356
374
|
### Linear Regression
|
|
357
375
|
|
|
358
|
-
|
|
376
|
+
By default, an intercept is included. Disable this with:
|
|
377
|
+
|
|
378
|
+
```ruby
|
|
379
|
+
Eps::Model.new(data, intercept: false)
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
To speed up training on large datasets with linear regression, [install GSL](https://github.com/ankane/gslr#gsl-installation). With Homebrew, you can use:
|
|
359
383
|
|
|
360
384
|
```sh
|
|
361
385
|
brew install gsl
|
|
@@ -364,17 +388,21 @@ brew install gsl
|
|
|
364
388
|
Then, add this line to your application’s Gemfile:
|
|
365
389
|
|
|
366
390
|
```ruby
|
|
367
|
-
gem '
|
|
391
|
+
gem 'gslr', group: :development
|
|
368
392
|
```
|
|
369
393
|
|
|
370
394
|
It only needs to be available in environments used to build the model.
|
|
371
395
|
|
|
372
|
-
|
|
396
|
+
## Probability
|
|
397
|
+
|
|
398
|
+
To get the probability of each category for predictions with classification, use:
|
|
373
399
|
|
|
374
400
|
```ruby
|
|
375
|
-
|
|
401
|
+
model.predict_probability(data)
|
|
376
402
|
```
|
|
377
403
|
|
|
404
|
+
Naive Bayes is known to produce poor probability estimates, so stick with LightGBM if you need this.
|
|
405
|
+
|
|
378
406
|
## Validation Options
|
|
379
407
|
|
|
380
408
|
Pass your own validation set with:
|
|
@@ -410,7 +438,7 @@ The database is another place you can store models. It’s good if you retrain m
|
|
|
410
438
|
Create an ActiveRecord model to store the predictive model.
|
|
411
439
|
|
|
412
440
|
```sh
|
|
413
|
-
rails
|
|
441
|
+
rails generate model Model key:string:uniq data:text
|
|
414
442
|
```
|
|
415
443
|
|
|
416
444
|
Store the model with:
|
|
@@ -520,11 +548,11 @@ Everyone is encouraged to help improve this project. Here are a few ways you can
|
|
|
520
548
|
- Write, clarify, or fix documentation
|
|
521
549
|
- Suggest or add new features
|
|
522
550
|
|
|
523
|
-
To get started with development
|
|
551
|
+
To get started with development:
|
|
524
552
|
|
|
525
553
|
```sh
|
|
526
554
|
git clone https://github.com/ankane/eps.git
|
|
527
555
|
cd eps
|
|
528
556
|
bundle install
|
|
529
|
-
rake test
|
|
557
|
+
bundle exec rake test
|
|
530
558
|
```
|
data/lib/eps/base_estimator.rb
CHANGED
|
@@ -2,33 +2,18 @@ module Eps
|
|
|
2
2
|
class BaseEstimator
|
|
3
3
|
def initialize(data = nil, y = nil, **options)
|
|
4
4
|
@options = options.dup
|
|
5
|
+
@trained = false
|
|
6
|
+
@text_encoders = {}
|
|
5
7
|
# TODO better pattern - don't pass most options to train
|
|
6
|
-
options.delete(:intercept)
|
|
7
8
|
train(data, y, **options) if data
|
|
8
9
|
end
|
|
9
10
|
|
|
10
11
|
def predict(data)
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
data = Eps::DataFrame.new(data)
|
|
15
|
-
|
|
16
|
-
@evaluator.features.each do |k, type|
|
|
17
|
-
values = data.columns[k]
|
|
18
|
-
raise ArgumentError, "Missing column: #{k}" if !values
|
|
19
|
-
column_type = Utils.column_type(values.compact, k) if values
|
|
20
|
-
|
|
21
|
-
if !column_type.nil?
|
|
22
|
-
if (type == "numeric" && column_type != "numeric") || (type != "numeric" && column_type != "categorical")
|
|
23
|
-
raise ArgumentError, "Bad type for column #{k}: Expected #{type} but got #{column_type}"
|
|
24
|
-
end
|
|
25
|
-
end
|
|
26
|
-
# TODO check for unknown values for categorical features
|
|
27
|
-
end
|
|
28
|
-
|
|
29
|
-
predictions = @evaluator.predict(data)
|
|
12
|
+
_predict(data, false)
|
|
13
|
+
end
|
|
30
14
|
|
|
31
|
-
|
|
15
|
+
def predict_probability(data)
|
|
16
|
+
_predict(data, true)
|
|
32
17
|
end
|
|
33
18
|
|
|
34
19
|
def evaluate(data, y = nil, target: nil, weight: nil)
|
|
@@ -48,6 +33,8 @@ module Eps
|
|
|
48
33
|
end
|
|
49
34
|
|
|
50
35
|
def summary(extended: false)
|
|
36
|
+
raise "Summary not available for loaded models" unless @trained
|
|
37
|
+
|
|
51
38
|
str = String.new("")
|
|
52
39
|
|
|
53
40
|
if @validation_set
|
|
@@ -72,7 +59,31 @@ module Eps
|
|
|
72
59
|
|
|
73
60
|
private
|
|
74
61
|
|
|
75
|
-
def
|
|
62
|
+
def _predict(data, probabilities)
|
|
63
|
+
singular = data.is_a?(Hash)
|
|
64
|
+
data = [data] if singular
|
|
65
|
+
|
|
66
|
+
data = Eps::DataFrame.new(data)
|
|
67
|
+
|
|
68
|
+
@evaluator.features.each do |k, type|
|
|
69
|
+
values = data.columns[k]
|
|
70
|
+
raise ArgumentError, "Missing column: #{k}" if !values
|
|
71
|
+
column_type = Utils.column_type(values.compact, k) if values
|
|
72
|
+
|
|
73
|
+
if !column_type.nil?
|
|
74
|
+
if (type == "numeric" && column_type != "numeric") || (type != "numeric" && column_type != "categorical")
|
|
75
|
+
raise ArgumentError, "Bad type for column #{k}: Expected #{type} but got #{column_type}"
|
|
76
|
+
end
|
|
77
|
+
end
|
|
78
|
+
# TODO check for unknown values for categorical features
|
|
79
|
+
end
|
|
80
|
+
|
|
81
|
+
predictions = @evaluator.predict(data, probabilities: probabilities)
|
|
82
|
+
|
|
83
|
+
singular ? predictions.first : predictions
|
|
84
|
+
end
|
|
85
|
+
|
|
86
|
+
def train(data, y = nil, target: nil, weight: nil, split: nil, validation_set: nil, text_features: nil, **options)
|
|
76
87
|
data, @target = prep_data(data, y, target, weight)
|
|
77
88
|
@target_type = Utils.column_type(data.label, @target)
|
|
78
89
|
|
|
@@ -164,11 +175,13 @@ module Eps
|
|
|
164
175
|
raise "No data in validation set" if validation_set && validation_set.empty?
|
|
165
176
|
|
|
166
177
|
@validation_set = validation_set
|
|
167
|
-
@evaluator = _train(
|
|
178
|
+
@evaluator = _train(**options)
|
|
168
179
|
|
|
169
180
|
# reset pmml
|
|
170
181
|
@pmml = nil
|
|
171
182
|
|
|
183
|
+
@trained = true
|
|
184
|
+
|
|
172
185
|
nil
|
|
173
186
|
end
|
|
174
187
|
|
|
@@ -197,29 +210,38 @@ module Eps
|
|
|
197
210
|
[data, target]
|
|
198
211
|
end
|
|
199
212
|
|
|
200
|
-
def prep_text_features(train_set)
|
|
201
|
-
@text_encoders = {}
|
|
213
|
+
def prep_text_features(train_set, fit: true)
|
|
202
214
|
@text_features.each do |k, v|
|
|
203
|
-
|
|
204
|
-
|
|
215
|
+
if fit
|
|
216
|
+
# reset vocabulary
|
|
217
|
+
v.delete(:vocabulary)
|
|
218
|
+
|
|
219
|
+
# TODO determine max features automatically
|
|
220
|
+
# start based on number of rows
|
|
221
|
+
encoder = Eps::TextEncoder.new(**v)
|
|
222
|
+
counts = encoder.fit(train_set.columns.delete(k))
|
|
223
|
+
else
|
|
224
|
+
encoder = @text_encoders[k]
|
|
225
|
+
counts = encoder.transform(train_set.columns.delete(k))
|
|
226
|
+
end
|
|
205
227
|
|
|
206
|
-
# TODO determine max features automatically
|
|
207
|
-
# start based on number of rows
|
|
208
|
-
encoder = Eps::TextEncoder.new(v)
|
|
209
|
-
counts = encoder.fit(train_set.columns.delete(k))
|
|
210
228
|
encoder.vocabulary.each do |word|
|
|
211
229
|
train_set.columns[[k, word]] = [0] * counts.size
|
|
212
230
|
end
|
|
231
|
+
|
|
213
232
|
counts.each_with_index do |ci, i|
|
|
214
233
|
ci.each do |word, count|
|
|
215
234
|
word_key = [k, word]
|
|
216
235
|
train_set.columns[word_key][i] = 1 if train_set.columns.key?(word_key)
|
|
217
236
|
end
|
|
218
237
|
end
|
|
219
|
-
@text_encoders[k] = encoder
|
|
220
238
|
|
|
221
|
-
|
|
222
|
-
|
|
239
|
+
if fit
|
|
240
|
+
@text_encoders[k] = encoder
|
|
241
|
+
|
|
242
|
+
# update vocabulary
|
|
243
|
+
v[:vocabulary] = encoder.vocabulary
|
|
244
|
+
end
|
|
223
245
|
end
|
|
224
246
|
|
|
225
247
|
raise "No features left" if train_set.columns.empty?
|
|
@@ -233,7 +255,7 @@ module Eps
|
|
|
233
255
|
|
|
234
256
|
def check_missing(c, name)
|
|
235
257
|
raise ArgumentError, "Missing column: #{name}" if !c
|
|
236
|
-
raise ArgumentError, "Missing values in column #{name}" if c.any?(&:nil?)
|
|
258
|
+
raise ArgumentError, "Missing values in column #{name}" if c.to_a.any?(&:nil?)
|
|
237
259
|
end
|
|
238
260
|
|
|
239
261
|
def check_missing_value(df)
|
data/lib/eps/data_frame.rb
CHANGED
|
@@ -10,7 +10,7 @@ module Eps
|
|
|
10
10
|
data.columns.each do |k, v|
|
|
11
11
|
@columns[k] = v
|
|
12
12
|
end
|
|
13
|
-
elsif daru?(data)
|
|
13
|
+
elsif rover?(data) || daru?(data)
|
|
14
14
|
data.to_h.each do |k, v|
|
|
15
15
|
@columns[k.to_s] = v.to_a
|
|
16
16
|
end
|
|
@@ -19,6 +19,8 @@ module Eps
|
|
|
19
19
|
@columns[k.to_s] = v.to_a
|
|
20
20
|
end
|
|
21
21
|
else
|
|
22
|
+
data = data.to_a if numo?(data)
|
|
23
|
+
|
|
22
24
|
if data.any?
|
|
23
25
|
row = data[0]
|
|
24
26
|
|
|
@@ -140,8 +142,16 @@ module Eps
|
|
|
140
142
|
|
|
141
143
|
private
|
|
142
144
|
|
|
145
|
+
def numo?(x)
|
|
146
|
+
defined?(Numo::NArray) && x.is_a?(Numo::NArray)
|
|
147
|
+
end
|
|
148
|
+
|
|
149
|
+
def rover?(x)
|
|
150
|
+
defined?(Rover::DataFrame) && x.is_a?(Rover::DataFrame)
|
|
151
|
+
end
|
|
152
|
+
|
|
143
153
|
def daru?(x)
|
|
144
|
-
defined?(Daru) && x.is_a?(Daru::DataFrame)
|
|
154
|
+
defined?(Daru::DataFrame) && x.is_a?(Daru::DataFrame)
|
|
145
155
|
end
|
|
146
156
|
end
|
|
147
157
|
end
|
|
@@ -11,19 +11,15 @@ module Eps
|
|
|
11
11
|
@text_features = text_features
|
|
12
12
|
end
|
|
13
13
|
|
|
14
|
-
def predict(data)
|
|
14
|
+
def predict(data, probabilities: false)
|
|
15
|
+
raise "Probabilities not supported" if probabilities && @objective == "regression"
|
|
16
|
+
|
|
15
17
|
rows = data.map(&:to_h)
|
|
16
18
|
|
|
17
19
|
# sparse matrix
|
|
18
20
|
@text_features.each do |k, v|
|
|
19
|
-
encoder = TextEncoder.new(v)
|
|
20
|
-
|
|
21
|
-
values = data.columns.delete(k)
|
|
22
|
-
counts = encoder.transform(values)
|
|
23
|
-
|
|
24
|
-
encoder.vocabulary.each do |word|
|
|
25
|
-
data.columns[[k, word]] = [0] * values.size
|
|
26
|
-
end
|
|
21
|
+
encoder = TextEncoder.new(**v)
|
|
22
|
+
counts = encoder.transform(data.columns[k])
|
|
27
23
|
|
|
28
24
|
counts.each_with_index do |xc, i|
|
|
29
25
|
row = rows[i]
|
|
@@ -38,17 +34,28 @@ module Eps
|
|
|
38
34
|
when "regression"
|
|
39
35
|
sum_trees(rows, @trees)
|
|
40
36
|
when "binary"
|
|
41
|
-
sum_trees(rows, @trees).map { |s|
|
|
37
|
+
prob = sum_trees(rows, @trees).map { |s| sigmoid(s) }
|
|
38
|
+
if probabilities
|
|
39
|
+
prob.map { |v| @labels.zip([1 - v, v]).to_h }
|
|
40
|
+
else
|
|
41
|
+
prob.map { |v| @labels[v > 0.5 ? 1 : 0] }
|
|
42
|
+
end
|
|
42
43
|
else
|
|
43
44
|
tree_scores = []
|
|
44
45
|
num_trees = @trees.size / @labels.size
|
|
45
46
|
@trees.each_slice(num_trees).each do |trees|
|
|
46
47
|
tree_scores << sum_trees(rows, trees)
|
|
47
48
|
end
|
|
48
|
-
|
|
49
|
+
rows.size.times.map do |i|
|
|
49
50
|
v = tree_scores.map { |s| s[i] }
|
|
50
|
-
|
|
51
|
-
|
|
51
|
+
if probabilities
|
|
52
|
+
exp = v.map { |vi| Math.exp(vi) }
|
|
53
|
+
sum = exp.sum
|
|
54
|
+
@labels.zip(exp.map { |e| e / sum }).to_h
|
|
55
|
+
else
|
|
56
|
+
idx = v.map.with_index.max_by { |v2, _| v2 }.last
|
|
57
|
+
@labels[idx]
|
|
58
|
+
end
|
|
52
59
|
end
|
|
53
60
|
end
|
|
54
61
|
end
|
|
@@ -81,7 +88,7 @@ module Eps
|
|
|
81
88
|
else
|
|
82
89
|
case node.operator
|
|
83
90
|
when "equal"
|
|
84
|
-
v == node.value
|
|
91
|
+
v.to_s == node.value
|
|
85
92
|
when "in"
|
|
86
93
|
node.value.include?(v)
|
|
87
94
|
when "greaterThan"
|
|
@@ -109,7 +116,7 @@ module Eps
|
|
|
109
116
|
end
|
|
110
117
|
|
|
111
118
|
def sigmoid(x)
|
|
112
|
-
1.0 / (1 + Math
|
|
119
|
+
1.0 / (1 + Math.exp(-x))
|
|
113
120
|
end
|
|
114
121
|
end
|
|
115
122
|
end
|
|
@@ -9,7 +9,9 @@ module Eps
|
|
|
9
9
|
@text_features = text_features || {}
|
|
10
10
|
end
|
|
11
11
|
|
|
12
|
-
def predict(x)
|
|
12
|
+
def predict(x, probabilities: false)
|
|
13
|
+
raise "Probabilities not supported" if probabilities
|
|
14
|
+
|
|
13
15
|
intercept = @coefficients["_intercept"] || 0.0
|
|
14
16
|
scores = [intercept] * x.size
|
|
15
17
|
|
|
@@ -19,10 +21,11 @@ module Eps
|
|
|
19
21
|
case type
|
|
20
22
|
when "categorical"
|
|
21
23
|
x.columns[k].each_with_index do |xv, i|
|
|
22
|
-
|
|
24
|
+
# TODO clean up
|
|
25
|
+
scores[i] += (@coefficients[[k, xv]] || @coefficients[[k, xv.to_s]]).to_f
|
|
23
26
|
end
|
|
24
27
|
when "text"
|
|
25
|
-
encoder = TextEncoder.new(
|
|
28
|
+
encoder = TextEncoder.new(**@text_features[k])
|
|
26
29
|
counts = encoder.transform(x.columns[k])
|
|
27
30
|
coef = {}
|
|
28
31
|
@coefficients.each do |k2, v|
|
|
@@ -10,14 +10,15 @@ module Eps
|
|
|
10
10
|
@legacy = legacy
|
|
11
11
|
end
|
|
12
12
|
|
|
13
|
-
def predict(x)
|
|
13
|
+
def predict(x, probabilities: false)
|
|
14
14
|
probs = calculate_class_probabilities(x)
|
|
15
15
|
probs.map do |xp|
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
16
|
+
if probabilities
|
|
17
|
+
sum = xp.values.map { |v| Math.exp(v) }.sum.to_f
|
|
18
|
+
xp.map { |k, v| [k, Math.exp(v) / sum] }.to_h
|
|
19
|
+
else
|
|
20
|
+
xp.sort_by { |k, v| [-v, k] }[0][0]
|
|
21
|
+
end
|
|
21
22
|
end
|
|
22
23
|
end
|
|
23
24
|
|
|
@@ -38,7 +39,8 @@ module Eps
|
|
|
38
39
|
case type
|
|
39
40
|
when "categorical"
|
|
40
41
|
x.columns[k].each_with_index do |xi, i|
|
|
41
|
-
|
|
42
|
+
# TODO clean this up
|
|
43
|
+
vc = probabilities[:conditional][k][xi] || probabilities[:conditional][k][xi.to_s]
|
|
42
44
|
|
|
43
45
|
# unknown value if not vc
|
|
44
46
|
if vc
|
data/lib/eps/lightgbm.rb
CHANGED
|
@@ -17,7 +17,7 @@ module Eps
|
|
|
17
17
|
str
|
|
18
18
|
end
|
|
19
19
|
|
|
20
|
-
def _train(verbose: nil, early_stopping: nil)
|
|
20
|
+
def _train(verbose: nil, early_stopping: nil, learning_rate: 0.1)
|
|
21
21
|
train_set = @train_set
|
|
22
22
|
validation_set = @validation_set.dup
|
|
23
23
|
summary_label = train_set.label
|
|
@@ -57,10 +57,13 @@ module Eps
|
|
|
57
57
|
|
|
58
58
|
# text feature encoding
|
|
59
59
|
prep_text_features(train_set)
|
|
60
|
-
prep_text_features(validation_set) if validation_set
|
|
60
|
+
prep_text_features(validation_set, fit: false) if validation_set
|
|
61
61
|
|
|
62
62
|
# create params
|
|
63
|
-
params = {
|
|
63
|
+
params = {
|
|
64
|
+
objective: objective,
|
|
65
|
+
learning_rate: learning_rate
|
|
66
|
+
}
|
|
64
67
|
params[:num_classes] = labels.size if objective == "multiclass"
|
|
65
68
|
if train_set.size < 30
|
|
66
69
|
params[:min_data_in_bin] = 1
|
|
@@ -121,25 +124,30 @@ module Eps
|
|
|
121
124
|
def check_evaluator(objective, labels, booster, booster_set, evaluator, evaluator_set)
|
|
122
125
|
expected = @booster.predict(booster_set.map_rows(&:to_a))
|
|
123
126
|
if objective == "multiclass"
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
+
actual = evaluator.predict(evaluator_set, probabilities: true)
|
|
128
|
+
# just compare first for now
|
|
129
|
+
expected.map! { |v| v.first }
|
|
130
|
+
actual.map! { |v| v.values.first }
|
|
127
131
|
elsif objective == "binary"
|
|
128
|
-
|
|
132
|
+
actual = evaluator.predict(evaluator_set, probabilities: true).map { |v| v.values.last }
|
|
133
|
+
else
|
|
134
|
+
actual = evaluator.predict(evaluator_set)
|
|
129
135
|
end
|
|
130
|
-
actual = evaluator.predict(evaluator_set)
|
|
131
136
|
|
|
132
|
-
regression = objective == "regression"
|
|
137
|
+
regression = objective == "regression" || objective == "binary"
|
|
133
138
|
bad_observations = []
|
|
134
139
|
expected.zip(actual).each_with_index do |(exp, act), i|
|
|
135
|
-
success =
|
|
140
|
+
success = (act - exp).abs < 0.001
|
|
136
141
|
unless success
|
|
137
142
|
bad_observations << {expected: exp, actual: act, data_point: evaluator_set[i].map(&:itself).first}
|
|
138
143
|
end
|
|
139
144
|
end
|
|
140
145
|
|
|
141
146
|
if bad_observations.any?
|
|
142
|
-
|
|
147
|
+
bad_observations.each do |obs|
|
|
148
|
+
p obs
|
|
149
|
+
end
|
|
150
|
+
raise "Bug detected in evaluator. Please report an issue."
|
|
143
151
|
end
|
|
144
152
|
end
|
|
145
153
|
|
|
@@ -37,6 +37,7 @@ module Eps
|
|
|
37
37
|
str
|
|
38
38
|
end
|
|
39
39
|
|
|
40
|
+
# TODO use keyword arguments for gsl and intercept in 0.4.0
|
|
40
41
|
def _train(**options)
|
|
41
42
|
raise "Target must be numeric" if @target_type != "numeric"
|
|
42
43
|
check_missing_value(@train_set)
|
|
@@ -50,17 +51,35 @@ module Eps
|
|
|
50
51
|
|
|
51
52
|
x = data.map_rows(&:to_a)
|
|
52
53
|
|
|
53
|
-
|
|
54
|
-
|
|
54
|
+
gsl =
|
|
55
|
+
if options.key?(:gsl)
|
|
56
|
+
options[:gsl]
|
|
57
|
+
elsif defined?(GSL)
|
|
58
|
+
true
|
|
59
|
+
elsif defined?(GSLR)
|
|
60
|
+
:gslr
|
|
61
|
+
else
|
|
62
|
+
false
|
|
63
|
+
end
|
|
64
|
+
|
|
65
|
+
intercept = options.key?(:intercept) ? options[:intercept] : true
|
|
66
|
+
if intercept && gsl != :gslr
|
|
55
67
|
data.size.times do |i|
|
|
56
68
|
x[i].unshift(1)
|
|
57
69
|
end
|
|
58
70
|
end
|
|
59
71
|
|
|
60
|
-
gsl = options.key?(:gsl) ? options[:gsl] : defined?(GSL)
|
|
61
|
-
|
|
62
72
|
v3 =
|
|
63
|
-
if gsl
|
|
73
|
+
if gsl == :gslr
|
|
74
|
+
model = GSLR::OLS.new(intercept: intercept)
|
|
75
|
+
model.fit(x, data.label, weight: data.weight)
|
|
76
|
+
|
|
77
|
+
@covariance = model.covariance
|
|
78
|
+
|
|
79
|
+
coefficients = model.coefficients.dup
|
|
80
|
+
coefficients.unshift(model.intercept) if intercept
|
|
81
|
+
coefficients
|
|
82
|
+
elsif gsl
|
|
64
83
|
x = GSL::Matrix.alloc(*x)
|
|
65
84
|
y = GSL::Vector.alloc(data.label)
|
|
66
85
|
w = GSL::Vector.alloc(data.weight) if data.weight
|
|
@@ -196,7 +215,11 @@ module Eps
|
|
|
196
215
|
|
|
197
216
|
def diagonal
|
|
198
217
|
@diagonal ||= begin
|
|
199
|
-
if covariance.
|
|
218
|
+
if covariance.is_a?(Array)
|
|
219
|
+
covariance.size.times.map do |i|
|
|
220
|
+
covariance[i][i]
|
|
221
|
+
end
|
|
222
|
+
elsif covariance.respond_to?(:each)
|
|
200
223
|
d = covariance.each(:diagonal).to_a
|
|
201
224
|
@removed.each do |i|
|
|
202
225
|
d.insert(i, 0)
|
data/lib/eps/naive_bayes.rb
CHANGED
data/lib/eps/pmml/generator.rb
CHANGED
|
@@ -210,10 +210,10 @@ module Eps
|
|
|
210
210
|
probabilities[:conditional].each do |k, v|
|
|
211
211
|
xml.BayesInput(fieldName: k) do
|
|
212
212
|
if features[k] == "categorical"
|
|
213
|
-
v.sort_by { |k2, _| k2 }.each do |k2, v2|
|
|
213
|
+
v.sort_by { |k2, _| k2.to_s }.each do |k2, v2|
|
|
214
214
|
xml.PairCounts(value: k2) do
|
|
215
215
|
xml.TargetValueCounts do
|
|
216
|
-
v2.sort_by { |k2, _| k2 }.each do |k3, v3|
|
|
216
|
+
v2.sort_by { |k2, _| k2.to_s }.each do |k3, v3|
|
|
217
217
|
xml.TargetValueCount(value: k3, count: v3)
|
|
218
218
|
end
|
|
219
219
|
end
|
|
@@ -221,7 +221,7 @@ module Eps
|
|
|
221
221
|
end
|
|
222
222
|
else
|
|
223
223
|
xml.TargetValueStats do
|
|
224
|
-
v.sort_by { |k2, _| k2 }.each do |k2, v2|
|
|
224
|
+
v.sort_by { |k2, _| k2.to_s }.each do |k2, v2|
|
|
225
225
|
xml.TargetValueStat(value: k2) do
|
|
226
226
|
xml.GaussianDistribution(mean: v2[:mean], variance: v2[:stdev]**2)
|
|
227
227
|
end
|
|
@@ -233,7 +233,7 @@ module Eps
|
|
|
233
233
|
end
|
|
234
234
|
xml.BayesOutput(fieldName: "target") do
|
|
235
235
|
xml.TargetValueCounts do
|
|
236
|
-
probabilities[:prior].sort_by { |k, _| k }.each do |k, v|
|
|
236
|
+
probabilities[:prior].sort_by { |k, _| k.to_s }.each do |k, v|
|
|
237
237
|
xml.TargetValueCount(value: k, count: v)
|
|
238
238
|
end
|
|
239
239
|
end
|
data/lib/eps/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: eps
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.3.
|
|
4
|
+
version: 0.3.6
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Andrew Kane
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date:
|
|
11
|
+
date: 2020-06-19 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: lightgbm
|
|
@@ -80,6 +80,20 @@ dependencies:
|
|
|
80
80
|
- - ">="
|
|
81
81
|
- !ruby/object:Gem::Version
|
|
82
82
|
version: '0'
|
|
83
|
+
- !ruby/object:Gem::Dependency
|
|
84
|
+
name: numo-narray
|
|
85
|
+
requirement: !ruby/object:Gem::Requirement
|
|
86
|
+
requirements:
|
|
87
|
+
- - ">="
|
|
88
|
+
- !ruby/object:Gem::Version
|
|
89
|
+
version: '0'
|
|
90
|
+
type: :development
|
|
91
|
+
prerelease: false
|
|
92
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
93
|
+
requirements:
|
|
94
|
+
- - ">="
|
|
95
|
+
- !ruby/object:Gem::Version
|
|
96
|
+
version: '0'
|
|
83
97
|
- !ruby/object:Gem::Dependency
|
|
84
98
|
name: rake
|
|
85
99
|
requirement: !ruby/object:Gem::Requirement
|
|
@@ -94,6 +108,20 @@ dependencies:
|
|
|
94
108
|
- - ">="
|
|
95
109
|
- !ruby/object:Gem::Version
|
|
96
110
|
version: '0'
|
|
111
|
+
- !ruby/object:Gem::Dependency
|
|
112
|
+
name: rover-df
|
|
113
|
+
requirement: !ruby/object:Gem::Requirement
|
|
114
|
+
requirements:
|
|
115
|
+
- - ">="
|
|
116
|
+
- !ruby/object:Gem::Version
|
|
117
|
+
version: '0'
|
|
118
|
+
type: :development
|
|
119
|
+
prerelease: false
|
|
120
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
121
|
+
requirements:
|
|
122
|
+
- - ">="
|
|
123
|
+
- !ruby/object:Gem::Version
|
|
124
|
+
version: '0'
|
|
97
125
|
description:
|
|
98
126
|
email: andrew@chartkick.com
|
|
99
127
|
executables: []
|
|
@@ -143,7 +171,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
143
171
|
- !ruby/object:Gem::Version
|
|
144
172
|
version: '0'
|
|
145
173
|
requirements: []
|
|
146
|
-
rubygems_version: 3.
|
|
174
|
+
rubygems_version: 3.1.2
|
|
147
175
|
signing_key:
|
|
148
176
|
specification_version: 4
|
|
149
177
|
summary: Machine learning for Ruby. Supports regression (linear regression) and classification
|