eps 0.3.5 → 0.3.9

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1369016c3cae228f169fe580b54fca3c0d240cda202fa7d03ecc7a4e156ee8c7
4
- data.tar.gz: bf83ca424c509798d1a1436806b52cba0cfdbefecb8d827d5b17aec7b807b121
3
+ metadata.gz: 49da7dab8a04a1404446f56051ea9faaf5b2260888ef9e2eccf12ec0f5ece887
4
+ data.tar.gz: 56a85a477e48cb07b39253e182878a947371ec30375b4570987c885a0e6bb93a
5
5
  SHA512:
6
- metadata.gz: 2bf47d80a301eb546c348aaa71f847fa22ace5bed63d97a1f19eb14bc15388b056cd3f545ccf251b2bbf2afc485ef81e5559849ff7459e9dd9f88a71c7cbf83a
7
- data.tar.gz: 82d65d84e95a6518cd132c2a42cdec20afd05c0013192941b59ee0edb524874d12b2dd9082dd89be1422872c88e827e031469e43b80336c48c7eab7ff4fe611e
6
+ metadata.gz: 4a611604c4172110f67e3b39147ddb77346fe98390ee9a31791f61e1da2b4a6f1f6e83f63c4ab722c30d2247797c66328b9acefb51a907f76808793014207dd2
7
+ data.tar.gz: 1aca1429a26a1eafee5717680507cf04794f639bb13c4ce416709bf0e64ca6017525413f74bcfa7728bd5272ff5b845dfb2da17cdee781eb03ea98a56c3441b3
data/CHANGELOG.md CHANGED
@@ -1,3 +1,19 @@
1
+ ## 0.3.9 (2021-10-14)
2
+
3
+ - Fixed error with `lessOrEqual` operator
4
+
5
+ ## 0.3.8 (2021-02-08)
6
+
7
+ - Fixed error with categorical and text features
8
+
9
+ ## 0.3.7 (2020-11-23)
10
+
11
+ - Fixed error with LightGBM summary
12
+
13
+ ## 0.3.6 (2020-06-19)
14
+
15
+ - Fixed error with text features for LightGBM
16
+
1
17
  ## 0.3.5 (2020-06-10)
2
18
 
3
19
  - Added `learning_rate` option for LightGBM
data/LICENSE.txt CHANGED
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2018-2019 Andrew Kane
3
+ Copyright (c) 2018-2021 Andrew Kane
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
data/README.md CHANGED
@@ -7,7 +7,7 @@ Machine learning for Ruby
7
7
 
8
8
  Check out [this post](https://ankane.org/rails-meet-data-science) for more info on machine learning with Rails
9
9
 
10
- [![Build Status](https://travis-ci.org/ankane/eps.svg?branch=master)](https://travis-ci.org/ankane/eps)
10
+ [![Build Status](https://github.com/ankane/eps/workflows/build/badge.svg?branch=master)](https://github.com/ankane/eps/actions)
11
11
 
12
12
  ## Installation
13
13
 
@@ -134,7 +134,7 @@ For text features, use strings with multiple words.
134
134
  {description: "a beautiful house on top of a hill"}
135
135
  ```
136
136
 
137
- This creates features based on word count (term frequency).
137
+ This creates features based on [word count](https://en.wikipedia.org/wiki/Bag-of-words_model).
138
138
 
139
139
  You can specify text features explicitly with:
140
140
 
@@ -147,12 +147,12 @@ You can set advanced options with:
147
147
  ```ruby
148
148
  text_features: {
149
149
  description: {
150
- min_occurences: 5,
151
- max_features: 1000,
152
- min_length: 1,
153
- case_sensitive: true,
154
- tokenizer: /\s+/,
155
- stop_words: ["and", "the"]
150
+ min_occurences: 5, # min times a word must appear to be included in the model
151
+ max_features: 1000, # max number of words to include in the model
152
+ min_length: 1, # min length of words to be included
153
+ case_sensitive: true, # how to treat words with different case
154
+ tokenizer: /\s+/, # how to tokenize the text, defaults to whitespace
155
+ stop_words: ["and", "the"] # words to exclude from the model
156
156
  }
157
157
  }
158
158
  ```
@@ -218,7 +218,7 @@ Build the model with:
218
218
  PriceModel.build
219
219
  ```
220
220
 
221
- This saves the model to `price_model.pmml`. Be sure to check this into source control.
221
+ This saves the model to `price_model.pmml`. Check this into source control or use a tool like [Trove](https://github.com/ankane/trove) to store it.
222
222
 
223
223
  Predict with:
224
224
 
@@ -343,9 +343,7 @@ df = Daru::DataFrame.from_csv("houses.csv")
343
343
  Eps::Model.new(df, target: "price")
344
344
  ```
345
345
 
346
- ### CSVs
347
-
348
- When importing data from CSV files, be sure to convert numeric fields. The `table` method does this automatically.
346
+ When reading CSV files directly, be sure to convert numeric fields. The `table` method does this automatically.
349
347
 
350
348
  ```ruby
351
349
  CSV.table("data.csv").map { |row| row.to_h }
@@ -375,7 +373,11 @@ Eps::Model.new(data, learning_rate: 0.01)
375
373
 
376
374
  ### Linear Regression
377
375
 
378
- #### Performance
376
+ By default, an intercept is included. Disable this with:
377
+
378
+ ```ruby
379
+ Eps::Model.new(data, intercept: false)
380
+ ```
379
381
 
380
382
  To speed up training on large datasets with linear regression, [install GSL](https://github.com/ankane/gslr#gsl-installation). With Homebrew, you can use:
381
383
 
@@ -391,14 +393,6 @@ gem 'gslr', group: :development
391
393
 
392
394
  It only needs to be available in environments used to build the model.
393
395
 
394
- #### Options
395
-
396
- By default, an intercept is included. Disable this with:
397
-
398
- ```ruby
399
- Eps::Model.new(data, intercept: false)
400
- ```
401
-
402
396
  ## Probability
403
397
 
404
398
  To get the probability of each category for predictions with classification, use:
@@ -3,6 +3,7 @@ module Eps
3
3
  def initialize(data = nil, y = nil, **options)
4
4
  @options = options.dup
5
5
  @trained = false
6
+ @text_encoders = {}
6
7
  # TODO better pattern - don't pass most options to train
7
8
  train(data, y, **options) if data
8
9
  end
@@ -209,29 +210,38 @@ module Eps
209
210
  [data, target]
210
211
  end
211
212
 
212
- def prep_text_features(train_set)
213
- @text_encoders = {}
213
+ def prep_text_features(train_set, fit: true)
214
214
  @text_features.each do |k, v|
215
- # reset vocabulary
216
- v.delete(:vocabulary)
215
+ if fit
216
+ # reset vocabulary
217
+ v.delete(:vocabulary)
218
+
219
+ # TODO determine max features automatically
220
+ # start based on number of rows
221
+ encoder = Eps::TextEncoder.new(**v)
222
+ counts = encoder.fit(train_set.columns.delete(k))
223
+ else
224
+ encoder = @text_encoders[k]
225
+ counts = encoder.transform(train_set.columns.delete(k))
226
+ end
217
227
 
218
- # TODO determine max features automatically
219
- # start based on number of rows
220
- encoder = Eps::TextEncoder.new(**v)
221
- counts = encoder.fit(train_set.columns.delete(k))
222
228
  encoder.vocabulary.each do |word|
223
229
  train_set.columns[[k, word]] = [0] * counts.size
224
230
  end
231
+
225
232
  counts.each_with_index do |ci, i|
226
233
  ci.each do |word, count|
227
234
  word_key = [k, word]
228
235
  train_set.columns[word_key][i] = 1 if train_set.columns.key?(word_key)
229
236
  end
230
237
  end
231
- @text_encoders[k] = encoder
232
238
 
233
- # update vocabulary
234
- v[:vocabulary] = encoder.vocabulary
239
+ if fit
240
+ @text_encoders[k] = encoder
241
+
242
+ # update vocabulary
243
+ v[:vocabulary] = encoder.vocabulary
244
+ end
235
245
  end
236
246
 
237
247
  raise "No features left" if train_set.columns.empty?
@@ -19,13 +19,7 @@ module Eps
19
19
  # sparse matrix
20
20
  @text_features.each do |k, v|
21
21
  encoder = TextEncoder.new(**v)
22
-
23
- values = data.columns.delete(k)
24
- counts = encoder.transform(values)
25
-
26
- encoder.vocabulary.each do |word|
27
- data.columns[[k, word]] = [0] * values.size
28
- end
22
+ counts = encoder.transform(data.columns[k])
29
23
 
30
24
  counts.each_with_index do |xc, i|
31
25
  row = rows[i]
@@ -52,7 +46,7 @@ module Eps
52
46
  @trees.each_slice(num_trees).each do |trees|
53
47
  tree_scores << sum_trees(rows, trees)
54
48
  end
55
- data.size.times.map do |i|
49
+ rows.size.times.map do |i|
56
50
  v = tree_scores.map { |s| s[i] }
57
51
  if probabilities
58
52
  exp = v.map { |vi| Math.exp(vi) }
data/lib/eps/lightgbm.rb CHANGED
@@ -10,7 +10,7 @@ module Eps
10
10
  str << "Model needs more data for better predictions\n"
11
11
  else
12
12
  str << "Most important features\n"
13
- @importance_keys.zip(importance).sort_by { |k, v| [-v, k] }.first(10).each do |k, v|
13
+ @importance_keys.zip(importance).sort_by { |k, v| [-v, display_field(k)] }.first(10).each do |k, v|
14
14
  str << "#{display_field(k)}: #{(100 * v / total).round}\n"
15
15
  end
16
16
  end
@@ -57,7 +57,7 @@ module Eps
57
57
 
58
58
  # text feature encoding
59
59
  prep_text_features(train_set)
60
- prep_text_features(validation_set) if validation_set
60
+ prep_text_features(validation_set, fit: false) if validation_set
61
61
 
62
62
  # create params
63
63
  params = {
@@ -71,7 +71,7 @@ module Eps
71
71
  end
72
72
 
73
73
  # create datasets
74
- categorical_idx = @features.values.map.with_index.select { |type, _| type == "categorical" }.map(&:last)
74
+ categorical_idx = train_set.columns.keys.map.with_index.select { |k, _| @features[k] == "categorical" }.map(&:last)
75
75
  train_ds = ::LightGBM::Dataset.new(train_set.map_rows(&:to_a), label: train_set.label, weight: train_set.weight, categorical_feature: categorical_idx, params: params)
76
76
  validation_ds = ::LightGBM::Dataset.new(validation_set.map_rows(&:to_a), label: validation_set.label, weight: validation_set.weight, categorical_feature: categorical_idx, params: params, reference: train_ds) if validation_set
77
77
 
@@ -144,7 +144,10 @@ module Eps
144
144
  end
145
145
 
146
146
  if bad_observations.any?
147
- raise "Bug detected in evaluator. Please report an issue. Bad data points: #{bad_observations.inspect}"
147
+ bad_observations.each do |obs|
148
+ p obs
149
+ end
150
+ raise "Bug detected in evaluator. Please report an issue."
148
151
  end
149
152
  end
150
153
 
@@ -222,7 +222,7 @@ module Eps
222
222
  else
223
223
  operator = xml_predicate.attribute("operator").value
224
224
  value = xml_predicate.attribute("value").value
225
- value = value.to_f if operator == "greaterThan"
225
+ value = value.to_f if operator == "greaterThan" || operator == "lessOrEqual"
226
226
  field = xml_predicate.attribute("field").value
227
227
  field = derived_fields[field] if derived_fields[field]
228
228
  {
data/lib/eps/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Eps
2
- VERSION = "0.3.5"
2
+ VERSION = "0.3.9"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: eps
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.5
4
+ version: 0.3.9
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-06-11 00:00:00.000000000 Z
11
+ date: 2021-10-14 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: lightgbm
@@ -38,92 +38,8 @@ dependencies:
38
38
  - - ">="
39
39
  - !ruby/object:Gem::Version
40
40
  version: '0'
41
- - !ruby/object:Gem::Dependency
42
- name: bundler
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - ">="
46
- - !ruby/object:Gem::Version
47
- version: '0'
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - ">="
53
- - !ruby/object:Gem::Version
54
- version: '0'
55
- - !ruby/object:Gem::Dependency
56
- name: daru
57
- requirement: !ruby/object:Gem::Requirement
58
- requirements:
59
- - - ">="
60
- - !ruby/object:Gem::Version
61
- version: '0'
62
- type: :development
63
- prerelease: false
64
- version_requirements: !ruby/object:Gem::Requirement
65
- requirements:
66
- - - ">="
67
- - !ruby/object:Gem::Version
68
- version: '0'
69
- - !ruby/object:Gem::Dependency
70
- name: minitest
71
- requirement: !ruby/object:Gem::Requirement
72
- requirements:
73
- - - ">="
74
- - !ruby/object:Gem::Version
75
- version: '0'
76
- type: :development
77
- prerelease: false
78
- version_requirements: !ruby/object:Gem::Requirement
79
- requirements:
80
- - - ">="
81
- - !ruby/object:Gem::Version
82
- version: '0'
83
- - !ruby/object:Gem::Dependency
84
- name: numo-narray
85
- requirement: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - ">="
88
- - !ruby/object:Gem::Version
89
- version: '0'
90
- type: :development
91
- prerelease: false
92
- version_requirements: !ruby/object:Gem::Requirement
93
- requirements:
94
- - - ">="
95
- - !ruby/object:Gem::Version
96
- version: '0'
97
- - !ruby/object:Gem::Dependency
98
- name: rake
99
- requirement: !ruby/object:Gem::Requirement
100
- requirements:
101
- - - ">="
102
- - !ruby/object:Gem::Version
103
- version: '0'
104
- type: :development
105
- prerelease: false
106
- version_requirements: !ruby/object:Gem::Requirement
107
- requirements:
108
- - - ">="
109
- - !ruby/object:Gem::Version
110
- version: '0'
111
- - !ruby/object:Gem::Dependency
112
- name: rover-df
113
- requirement: !ruby/object:Gem::Requirement
114
- requirements:
115
- - - ">="
116
- - !ruby/object:Gem::Version
117
- version: '0'
118
- type: :development
119
- prerelease: false
120
- version_requirements: !ruby/object:Gem::Requirement
121
- requirements:
122
- - - ">="
123
- - !ruby/object:Gem::Version
124
- version: '0'
125
- description:
126
- email: andrew@chartkick.com
41
+ description:
42
+ email: andrew@ankane.org
127
43
  executables: []
128
44
  extensions: []
129
45
  extra_rdoc_files: []
@@ -156,7 +72,7 @@ homepage: https://github.com/ankane/eps
156
72
  licenses:
157
73
  - MIT
158
74
  metadata: {}
159
- post_install_message:
75
+ post_install_message:
160
76
  rdoc_options: []
161
77
  require_paths:
162
78
  - lib
@@ -171,8 +87,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
171
87
  - !ruby/object:Gem::Version
172
88
  version: '0'
173
89
  requirements: []
174
- rubygems_version: 3.1.2
175
- signing_key:
90
+ rubygems_version: 3.2.22
91
+ signing_key:
176
92
  specification_version: 4
177
93
  summary: Machine learning for Ruby. Supports regression (linear regression) and classification
178
94
  (naive Bayes)