eps 0.3.5 → 0.3.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -0
- data/LICENSE.txt +1 -1
- data/README.md +15 -21
- data/lib/eps/base_estimator.rb +21 -11
- data/lib/eps/evaluators/lightgbm.rb +2 -8
- data/lib/eps/lightgbm.rb +7 -4
- data/lib/eps/pmml/loader.rb +1 -1
- data/lib/eps/version.rb +1 -1
- metadata +8 -92
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 49da7dab8a04a1404446f56051ea9faaf5b2260888ef9e2eccf12ec0f5ece887
|
|
4
|
+
data.tar.gz: 56a85a477e48cb07b39253e182878a947371ec30375b4570987c885a0e6bb93a
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 4a611604c4172110f67e3b39147ddb77346fe98390ee9a31791f61e1da2b4a6f1f6e83f63c4ab722c30d2247797c66328b9acefb51a907f76808793014207dd2
|
|
7
|
+
data.tar.gz: 1aca1429a26a1eafee5717680507cf04794f639bb13c4ce416709bf0e64ca6017525413f74bcfa7728bd5272ff5b845dfb2da17cdee781eb03ea98a56c3441b3
|
data/CHANGELOG.md
CHANGED
|
@@ -1,3 +1,19 @@
|
|
|
1
|
+
## 0.3.9 (2021-10-14)
|
|
2
|
+
|
|
3
|
+
- Fixed error with `lessOrEqual` operator
|
|
4
|
+
|
|
5
|
+
## 0.3.8 (2021-02-08)
|
|
6
|
+
|
|
7
|
+
- Fixed error with categorical and text features
|
|
8
|
+
|
|
9
|
+
## 0.3.7 (2020-11-23)
|
|
10
|
+
|
|
11
|
+
- Fixed error with LightGBM summary
|
|
12
|
+
|
|
13
|
+
## 0.3.6 (2020-06-19)
|
|
14
|
+
|
|
15
|
+
- Fixed error with text features for LightGBM
|
|
16
|
+
|
|
1
17
|
## 0.3.5 (2020-06-10)
|
|
2
18
|
|
|
3
19
|
- Added `learning_rate` option for LightGBM
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
|
@@ -7,7 +7,7 @@ Machine learning for Ruby
|
|
|
7
7
|
|
|
8
8
|
Check out [this post](https://ankane.org/rails-meet-data-science) for more info on machine learning with Rails
|
|
9
9
|
|
|
10
|
-
[](https://github.com/ankane/eps/actions)
|
|
11
11
|
|
|
12
12
|
## Installation
|
|
13
13
|
|
|
@@ -134,7 +134,7 @@ For text features, use strings with multiple words.
|
|
|
134
134
|
{description: "a beautiful house on top of a hill"}
|
|
135
135
|
```
|
|
136
136
|
|
|
137
|
-
This creates features based on word count
|
|
137
|
+
This creates features based on [word count](https://en.wikipedia.org/wiki/Bag-of-words_model).
|
|
138
138
|
|
|
139
139
|
You can specify text features explicitly with:
|
|
140
140
|
|
|
@@ -147,12 +147,12 @@ You can set advanced options with:
|
|
|
147
147
|
```ruby
|
|
148
148
|
text_features: {
|
|
149
149
|
description: {
|
|
150
|
-
min_occurences: 5,
|
|
151
|
-
max_features: 1000,
|
|
152
|
-
min_length: 1,
|
|
153
|
-
case_sensitive: true,
|
|
154
|
-
tokenizer: /\s+/,
|
|
155
|
-
stop_words: ["and", "the"]
|
|
150
|
+
min_occurences: 5, # min times a word must appear to be included in the model
|
|
151
|
+
max_features: 1000, # max number of words to include in the model
|
|
152
|
+
min_length: 1, # min length of words to be included
|
|
153
|
+
case_sensitive: true, # how to treat words with different case
|
|
154
|
+
tokenizer: /\s+/, # how to tokenize the text, defaults to whitespace
|
|
155
|
+
stop_words: ["and", "the"] # words to exclude from the model
|
|
156
156
|
}
|
|
157
157
|
}
|
|
158
158
|
```
|
|
@@ -218,7 +218,7 @@ Build the model with:
|
|
|
218
218
|
PriceModel.build
|
|
219
219
|
```
|
|
220
220
|
|
|
221
|
-
This saves the model to `price_model.pmml`.
|
|
221
|
+
This saves the model to `price_model.pmml`. Check this into source control or use a tool like [Trove](https://github.com/ankane/trove) to store it.
|
|
222
222
|
|
|
223
223
|
Predict with:
|
|
224
224
|
|
|
@@ -343,9 +343,7 @@ df = Daru::DataFrame.from_csv("houses.csv")
|
|
|
343
343
|
Eps::Model.new(df, target: "price")
|
|
344
344
|
```
|
|
345
345
|
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
When importing data from CSV files, be sure to convert numeric fields. The `table` method does this automatically.
|
|
346
|
+
When reading CSV files directly, be sure to convert numeric fields. The `table` method does this automatically.
|
|
349
347
|
|
|
350
348
|
```ruby
|
|
351
349
|
CSV.table("data.csv").map { |row| row.to_h }
|
|
@@ -375,7 +373,11 @@ Eps::Model.new(data, learning_rate: 0.01)
|
|
|
375
373
|
|
|
376
374
|
### Linear Regression
|
|
377
375
|
|
|
378
|
-
|
|
376
|
+
By default, an intercept is included. Disable this with:
|
|
377
|
+
|
|
378
|
+
```ruby
|
|
379
|
+
Eps::Model.new(data, intercept: false)
|
|
380
|
+
```
|
|
379
381
|
|
|
380
382
|
To speed up training on large datasets with linear regression, [install GSL](https://github.com/ankane/gslr#gsl-installation). With Homebrew, you can use:
|
|
381
383
|
|
|
@@ -391,14 +393,6 @@ gem 'gslr', group: :development
|
|
|
391
393
|
|
|
392
394
|
It only needs to be available in environments used to build the model.
|
|
393
395
|
|
|
394
|
-
#### Options
|
|
395
|
-
|
|
396
|
-
By default, an intercept is included. Disable this with:
|
|
397
|
-
|
|
398
|
-
```ruby
|
|
399
|
-
Eps::Model.new(data, intercept: false)
|
|
400
|
-
```
|
|
401
|
-
|
|
402
396
|
## Probability
|
|
403
397
|
|
|
404
398
|
To get the probability of each category for predictions with classification, use:
|
data/lib/eps/base_estimator.rb
CHANGED
|
@@ -3,6 +3,7 @@ module Eps
|
|
|
3
3
|
def initialize(data = nil, y = nil, **options)
|
|
4
4
|
@options = options.dup
|
|
5
5
|
@trained = false
|
|
6
|
+
@text_encoders = {}
|
|
6
7
|
# TODO better pattern - don't pass most options to train
|
|
7
8
|
train(data, y, **options) if data
|
|
8
9
|
end
|
|
@@ -209,29 +210,38 @@ module Eps
|
|
|
209
210
|
[data, target]
|
|
210
211
|
end
|
|
211
212
|
|
|
212
|
-
def prep_text_features(train_set)
|
|
213
|
-
@text_encoders = {}
|
|
213
|
+
def prep_text_features(train_set, fit: true)
|
|
214
214
|
@text_features.each do |k, v|
|
|
215
|
-
|
|
216
|
-
|
|
215
|
+
if fit
|
|
216
|
+
# reset vocabulary
|
|
217
|
+
v.delete(:vocabulary)
|
|
218
|
+
|
|
219
|
+
# TODO determine max features automatically
|
|
220
|
+
# start based on number of rows
|
|
221
|
+
encoder = Eps::TextEncoder.new(**v)
|
|
222
|
+
counts = encoder.fit(train_set.columns.delete(k))
|
|
223
|
+
else
|
|
224
|
+
encoder = @text_encoders[k]
|
|
225
|
+
counts = encoder.transform(train_set.columns.delete(k))
|
|
226
|
+
end
|
|
217
227
|
|
|
218
|
-
# TODO determine max features automatically
|
|
219
|
-
# start based on number of rows
|
|
220
|
-
encoder = Eps::TextEncoder.new(**v)
|
|
221
|
-
counts = encoder.fit(train_set.columns.delete(k))
|
|
222
228
|
encoder.vocabulary.each do |word|
|
|
223
229
|
train_set.columns[[k, word]] = [0] * counts.size
|
|
224
230
|
end
|
|
231
|
+
|
|
225
232
|
counts.each_with_index do |ci, i|
|
|
226
233
|
ci.each do |word, count|
|
|
227
234
|
word_key = [k, word]
|
|
228
235
|
train_set.columns[word_key][i] = 1 if train_set.columns.key?(word_key)
|
|
229
236
|
end
|
|
230
237
|
end
|
|
231
|
-
@text_encoders[k] = encoder
|
|
232
238
|
|
|
233
|
-
|
|
234
|
-
|
|
239
|
+
if fit
|
|
240
|
+
@text_encoders[k] = encoder
|
|
241
|
+
|
|
242
|
+
# update vocabulary
|
|
243
|
+
v[:vocabulary] = encoder.vocabulary
|
|
244
|
+
end
|
|
235
245
|
end
|
|
236
246
|
|
|
237
247
|
raise "No features left" if train_set.columns.empty?
|
|
@@ -19,13 +19,7 @@ module Eps
|
|
|
19
19
|
# sparse matrix
|
|
20
20
|
@text_features.each do |k, v|
|
|
21
21
|
encoder = TextEncoder.new(**v)
|
|
22
|
-
|
|
23
|
-
values = data.columns.delete(k)
|
|
24
|
-
counts = encoder.transform(values)
|
|
25
|
-
|
|
26
|
-
encoder.vocabulary.each do |word|
|
|
27
|
-
data.columns[[k, word]] = [0] * values.size
|
|
28
|
-
end
|
|
22
|
+
counts = encoder.transform(data.columns[k])
|
|
29
23
|
|
|
30
24
|
counts.each_with_index do |xc, i|
|
|
31
25
|
row = rows[i]
|
|
@@ -52,7 +46,7 @@ module Eps
|
|
|
52
46
|
@trees.each_slice(num_trees).each do |trees|
|
|
53
47
|
tree_scores << sum_trees(rows, trees)
|
|
54
48
|
end
|
|
55
|
-
|
|
49
|
+
rows.size.times.map do |i|
|
|
56
50
|
v = tree_scores.map { |s| s[i] }
|
|
57
51
|
if probabilities
|
|
58
52
|
exp = v.map { |vi| Math.exp(vi) }
|
data/lib/eps/lightgbm.rb
CHANGED
|
@@ -10,7 +10,7 @@ module Eps
|
|
|
10
10
|
str << "Model needs more data for better predictions\n"
|
|
11
11
|
else
|
|
12
12
|
str << "Most important features\n"
|
|
13
|
-
@importance_keys.zip(importance).sort_by { |k, v| [-v, k] }.first(10).each do |k, v|
|
|
13
|
+
@importance_keys.zip(importance).sort_by { |k, v| [-v, display_field(k)] }.first(10).each do |k, v|
|
|
14
14
|
str << "#{display_field(k)}: #{(100 * v / total).round}\n"
|
|
15
15
|
end
|
|
16
16
|
end
|
|
@@ -57,7 +57,7 @@ module Eps
|
|
|
57
57
|
|
|
58
58
|
# text feature encoding
|
|
59
59
|
prep_text_features(train_set)
|
|
60
|
-
prep_text_features(validation_set) if validation_set
|
|
60
|
+
prep_text_features(validation_set, fit: false) if validation_set
|
|
61
61
|
|
|
62
62
|
# create params
|
|
63
63
|
params = {
|
|
@@ -71,7 +71,7 @@ module Eps
|
|
|
71
71
|
end
|
|
72
72
|
|
|
73
73
|
# create datasets
|
|
74
|
-
categorical_idx =
|
|
74
|
+
categorical_idx = train_set.columns.keys.map.with_index.select { |k, _| @features[k] == "categorical" }.map(&:last)
|
|
75
75
|
train_ds = ::LightGBM::Dataset.new(train_set.map_rows(&:to_a), label: train_set.label, weight: train_set.weight, categorical_feature: categorical_idx, params: params)
|
|
76
76
|
validation_ds = ::LightGBM::Dataset.new(validation_set.map_rows(&:to_a), label: validation_set.label, weight: validation_set.weight, categorical_feature: categorical_idx, params: params, reference: train_ds) if validation_set
|
|
77
77
|
|
|
@@ -144,7 +144,10 @@ module Eps
|
|
|
144
144
|
end
|
|
145
145
|
|
|
146
146
|
if bad_observations.any?
|
|
147
|
-
|
|
147
|
+
bad_observations.each do |obs|
|
|
148
|
+
p obs
|
|
149
|
+
end
|
|
150
|
+
raise "Bug detected in evaluator. Please report an issue."
|
|
148
151
|
end
|
|
149
152
|
end
|
|
150
153
|
|
data/lib/eps/pmml/loader.rb
CHANGED
|
@@ -222,7 +222,7 @@ module Eps
|
|
|
222
222
|
else
|
|
223
223
|
operator = xml_predicate.attribute("operator").value
|
|
224
224
|
value = xml_predicate.attribute("value").value
|
|
225
|
-
value = value.to_f if operator == "greaterThan"
|
|
225
|
+
value = value.to_f if operator == "greaterThan" || operator == "lessOrEqual"
|
|
226
226
|
field = xml_predicate.attribute("field").value
|
|
227
227
|
field = derived_fields[field] if derived_fields[field]
|
|
228
228
|
{
|
data/lib/eps/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: eps
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.3.
|
|
4
|
+
version: 0.3.9
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Andrew Kane
|
|
8
|
-
autorequire:
|
|
8
|
+
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date:
|
|
11
|
+
date: 2021-10-14 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: lightgbm
|
|
@@ -38,92 +38,8 @@ dependencies:
|
|
|
38
38
|
- - ">="
|
|
39
39
|
- !ruby/object:Gem::Version
|
|
40
40
|
version: '0'
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
requirement: !ruby/object:Gem::Requirement
|
|
44
|
-
requirements:
|
|
45
|
-
- - ">="
|
|
46
|
-
- !ruby/object:Gem::Version
|
|
47
|
-
version: '0'
|
|
48
|
-
type: :development
|
|
49
|
-
prerelease: false
|
|
50
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
51
|
-
requirements:
|
|
52
|
-
- - ">="
|
|
53
|
-
- !ruby/object:Gem::Version
|
|
54
|
-
version: '0'
|
|
55
|
-
- !ruby/object:Gem::Dependency
|
|
56
|
-
name: daru
|
|
57
|
-
requirement: !ruby/object:Gem::Requirement
|
|
58
|
-
requirements:
|
|
59
|
-
- - ">="
|
|
60
|
-
- !ruby/object:Gem::Version
|
|
61
|
-
version: '0'
|
|
62
|
-
type: :development
|
|
63
|
-
prerelease: false
|
|
64
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
65
|
-
requirements:
|
|
66
|
-
- - ">="
|
|
67
|
-
- !ruby/object:Gem::Version
|
|
68
|
-
version: '0'
|
|
69
|
-
- !ruby/object:Gem::Dependency
|
|
70
|
-
name: minitest
|
|
71
|
-
requirement: !ruby/object:Gem::Requirement
|
|
72
|
-
requirements:
|
|
73
|
-
- - ">="
|
|
74
|
-
- !ruby/object:Gem::Version
|
|
75
|
-
version: '0'
|
|
76
|
-
type: :development
|
|
77
|
-
prerelease: false
|
|
78
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
79
|
-
requirements:
|
|
80
|
-
- - ">="
|
|
81
|
-
- !ruby/object:Gem::Version
|
|
82
|
-
version: '0'
|
|
83
|
-
- !ruby/object:Gem::Dependency
|
|
84
|
-
name: numo-narray
|
|
85
|
-
requirement: !ruby/object:Gem::Requirement
|
|
86
|
-
requirements:
|
|
87
|
-
- - ">="
|
|
88
|
-
- !ruby/object:Gem::Version
|
|
89
|
-
version: '0'
|
|
90
|
-
type: :development
|
|
91
|
-
prerelease: false
|
|
92
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
93
|
-
requirements:
|
|
94
|
-
- - ">="
|
|
95
|
-
- !ruby/object:Gem::Version
|
|
96
|
-
version: '0'
|
|
97
|
-
- !ruby/object:Gem::Dependency
|
|
98
|
-
name: rake
|
|
99
|
-
requirement: !ruby/object:Gem::Requirement
|
|
100
|
-
requirements:
|
|
101
|
-
- - ">="
|
|
102
|
-
- !ruby/object:Gem::Version
|
|
103
|
-
version: '0'
|
|
104
|
-
type: :development
|
|
105
|
-
prerelease: false
|
|
106
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
107
|
-
requirements:
|
|
108
|
-
- - ">="
|
|
109
|
-
- !ruby/object:Gem::Version
|
|
110
|
-
version: '0'
|
|
111
|
-
- !ruby/object:Gem::Dependency
|
|
112
|
-
name: rover-df
|
|
113
|
-
requirement: !ruby/object:Gem::Requirement
|
|
114
|
-
requirements:
|
|
115
|
-
- - ">="
|
|
116
|
-
- !ruby/object:Gem::Version
|
|
117
|
-
version: '0'
|
|
118
|
-
type: :development
|
|
119
|
-
prerelease: false
|
|
120
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
121
|
-
requirements:
|
|
122
|
-
- - ">="
|
|
123
|
-
- !ruby/object:Gem::Version
|
|
124
|
-
version: '0'
|
|
125
|
-
description:
|
|
126
|
-
email: andrew@chartkick.com
|
|
41
|
+
description:
|
|
42
|
+
email: andrew@ankane.org
|
|
127
43
|
executables: []
|
|
128
44
|
extensions: []
|
|
129
45
|
extra_rdoc_files: []
|
|
@@ -156,7 +72,7 @@ homepage: https://github.com/ankane/eps
|
|
|
156
72
|
licenses:
|
|
157
73
|
- MIT
|
|
158
74
|
metadata: {}
|
|
159
|
-
post_install_message:
|
|
75
|
+
post_install_message:
|
|
160
76
|
rdoc_options: []
|
|
161
77
|
require_paths:
|
|
162
78
|
- lib
|
|
@@ -171,8 +87,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
171
87
|
- !ruby/object:Gem::Version
|
|
172
88
|
version: '0'
|
|
173
89
|
requirements: []
|
|
174
|
-
rubygems_version: 3.
|
|
175
|
-
signing_key:
|
|
90
|
+
rubygems_version: 3.2.22
|
|
91
|
+
signing_key:
|
|
176
92
|
specification_version: 4
|
|
177
93
|
summary: Machine learning for Ruby. Supports regression (linear regression) and classification
|
|
178
94
|
(naive Bayes)
|