eps 0.3.5 → 0.3.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1369016c3cae228f169fe580b54fca3c0d240cda202fa7d03ecc7a4e156ee8c7
4
- data.tar.gz: bf83ca424c509798d1a1436806b52cba0cfdbefecb8d827d5b17aec7b807b121
3
+ metadata.gz: 49da7dab8a04a1404446f56051ea9faaf5b2260888ef9e2eccf12ec0f5ece887
4
+ data.tar.gz: 56a85a477e48cb07b39253e182878a947371ec30375b4570987c885a0e6bb93a
5
5
  SHA512:
6
- metadata.gz: 2bf47d80a301eb546c348aaa71f847fa22ace5bed63d97a1f19eb14bc15388b056cd3f545ccf251b2bbf2afc485ef81e5559849ff7459e9dd9f88a71c7cbf83a
7
- data.tar.gz: 82d65d84e95a6518cd132c2a42cdec20afd05c0013192941b59ee0edb524874d12b2dd9082dd89be1422872c88e827e031469e43b80336c48c7eab7ff4fe611e
6
+ metadata.gz: 4a611604c4172110f67e3b39147ddb77346fe98390ee9a31791f61e1da2b4a6f1f6e83f63c4ab722c30d2247797c66328b9acefb51a907f76808793014207dd2
7
+ data.tar.gz: 1aca1429a26a1eafee5717680507cf04794f639bb13c4ce416709bf0e64ca6017525413f74bcfa7728bd5272ff5b845dfb2da17cdee781eb03ea98a56c3441b3
data/CHANGELOG.md CHANGED
@@ -1,3 +1,19 @@
1
+ ## 0.3.9 (2021-10-14)
2
+
3
+ - Fixed error with `lessOrEqual` operator
4
+
5
+ ## 0.3.8 (2021-02-08)
6
+
7
+ - Fixed error with categorical and text features
8
+
9
+ ## 0.3.7 (2020-11-23)
10
+
11
+ - Fixed error with LightGBM summary
12
+
13
+ ## 0.3.6 (2020-06-19)
14
+
15
+ - Fixed error with text features for LightGBM
16
+
1
17
  ## 0.3.5 (2020-06-10)
2
18
 
3
19
  - Added `learning_rate` option for LightGBM
data/LICENSE.txt CHANGED
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2018-2019 Andrew Kane
3
+ Copyright (c) 2018-2021 Andrew Kane
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
data/README.md CHANGED
@@ -7,7 +7,7 @@ Machine learning for Ruby
7
7
 
8
8
  Check out [this post](https://ankane.org/rails-meet-data-science) for more info on machine learning with Rails
9
9
 
10
- [![Build Status](https://travis-ci.org/ankane/eps.svg?branch=master)](https://travis-ci.org/ankane/eps)
10
+ [![Build Status](https://github.com/ankane/eps/workflows/build/badge.svg?branch=master)](https://github.com/ankane/eps/actions)
11
11
 
12
12
  ## Installation
13
13
 
@@ -134,7 +134,7 @@ For text features, use strings with multiple words.
134
134
  {description: "a beautiful house on top of a hill"}
135
135
  ```
136
136
 
137
- This creates features based on word count (term frequency).
137
+ This creates features based on [word count](https://en.wikipedia.org/wiki/Bag-of-words_model).
138
138
 
139
139
  You can specify text features explicitly with:
140
140
 
@@ -147,12 +147,12 @@ You can set advanced options with:
147
147
  ```ruby
148
148
  text_features: {
149
149
  description: {
150
- min_occurences: 5,
151
- max_features: 1000,
152
- min_length: 1,
153
- case_sensitive: true,
154
- tokenizer: /\s+/,
155
- stop_words: ["and", "the"]
150
+ min_occurences: 5, # min times a word must appear to be included in the model
151
+ max_features: 1000, # max number of words to include in the model
152
+ min_length: 1, # min length of words to be included
153
+ case_sensitive: true, # how to treat words with different case
154
+ tokenizer: /\s+/, # how to tokenize the text, defaults to whitespace
155
+ stop_words: ["and", "the"] # words to exclude from the model
156
156
  }
157
157
  }
158
158
  ```
@@ -218,7 +218,7 @@ Build the model with:
218
218
  PriceModel.build
219
219
  ```
220
220
 
221
- This saves the model to `price_model.pmml`. Be sure to check this into source control.
221
+ This saves the model to `price_model.pmml`. Check this into source control or use a tool like [Trove](https://github.com/ankane/trove) to store it.
222
222
 
223
223
  Predict with:
224
224
 
@@ -343,9 +343,7 @@ df = Daru::DataFrame.from_csv("houses.csv")
343
343
  Eps::Model.new(df, target: "price")
344
344
  ```
345
345
 
346
- ### CSVs
347
-
348
- When importing data from CSV files, be sure to convert numeric fields. The `table` method does this automatically.
346
+ When reading CSV files directly, be sure to convert numeric fields. The `table` method does this automatically.
349
347
 
350
348
  ```ruby
351
349
  CSV.table("data.csv").map { |row| row.to_h }
@@ -375,7 +373,11 @@ Eps::Model.new(data, learning_rate: 0.01)
375
373
 
376
374
  ### Linear Regression
377
375
 
378
- #### Performance
376
+ By default, an intercept is included. Disable this with:
377
+
378
+ ```ruby
379
+ Eps::Model.new(data, intercept: false)
380
+ ```
379
381
 
380
382
  To speed up training on large datasets with linear regression, [install GSL](https://github.com/ankane/gslr#gsl-installation). With Homebrew, you can use:
381
383
 
@@ -391,14 +393,6 @@ gem 'gslr', group: :development
391
393
 
392
394
  It only needs to be available in environments used to build the model.
393
395
 
394
- #### Options
395
-
396
- By default, an intercept is included. Disable this with:
397
-
398
- ```ruby
399
- Eps::Model.new(data, intercept: false)
400
- ```
401
-
402
396
  ## Probability
403
397
 
404
398
  To get the probability of each category for predictions with classification, use:
@@ -3,6 +3,7 @@ module Eps
3
3
  def initialize(data = nil, y = nil, **options)
4
4
  @options = options.dup
5
5
  @trained = false
6
+ @text_encoders = {}
6
7
  # TODO better pattern - don't pass most options to train
7
8
  train(data, y, **options) if data
8
9
  end
@@ -209,29 +210,38 @@ module Eps
209
210
  [data, target]
210
211
  end
211
212
 
212
- def prep_text_features(train_set)
213
- @text_encoders = {}
213
+ def prep_text_features(train_set, fit: true)
214
214
  @text_features.each do |k, v|
215
- # reset vocabulary
216
- v.delete(:vocabulary)
215
+ if fit
216
+ # reset vocabulary
217
+ v.delete(:vocabulary)
218
+
219
+ # TODO determine max features automatically
220
+ # start based on number of rows
221
+ encoder = Eps::TextEncoder.new(**v)
222
+ counts = encoder.fit(train_set.columns.delete(k))
223
+ else
224
+ encoder = @text_encoders[k]
225
+ counts = encoder.transform(train_set.columns.delete(k))
226
+ end
217
227
 
218
- # TODO determine max features automatically
219
- # start based on number of rows
220
- encoder = Eps::TextEncoder.new(**v)
221
- counts = encoder.fit(train_set.columns.delete(k))
222
228
  encoder.vocabulary.each do |word|
223
229
  train_set.columns[[k, word]] = [0] * counts.size
224
230
  end
231
+
225
232
  counts.each_with_index do |ci, i|
226
233
  ci.each do |word, count|
227
234
  word_key = [k, word]
228
235
  train_set.columns[word_key][i] = 1 if train_set.columns.key?(word_key)
229
236
  end
230
237
  end
231
- @text_encoders[k] = encoder
232
238
 
233
- # update vocabulary
234
- v[:vocabulary] = encoder.vocabulary
239
+ if fit
240
+ @text_encoders[k] = encoder
241
+
242
+ # update vocabulary
243
+ v[:vocabulary] = encoder.vocabulary
244
+ end
235
245
  end
236
246
 
237
247
  raise "No features left" if train_set.columns.empty?
@@ -19,13 +19,7 @@ module Eps
19
19
  # sparse matrix
20
20
  @text_features.each do |k, v|
21
21
  encoder = TextEncoder.new(**v)
22
-
23
- values = data.columns.delete(k)
24
- counts = encoder.transform(values)
25
-
26
- encoder.vocabulary.each do |word|
27
- data.columns[[k, word]] = [0] * values.size
28
- end
22
+ counts = encoder.transform(data.columns[k])
29
23
 
30
24
  counts.each_with_index do |xc, i|
31
25
  row = rows[i]
@@ -52,7 +46,7 @@ module Eps
52
46
  @trees.each_slice(num_trees).each do |trees|
53
47
  tree_scores << sum_trees(rows, trees)
54
48
  end
55
- data.size.times.map do |i|
49
+ rows.size.times.map do |i|
56
50
  v = tree_scores.map { |s| s[i] }
57
51
  if probabilities
58
52
  exp = v.map { |vi| Math.exp(vi) }
data/lib/eps/lightgbm.rb CHANGED
@@ -10,7 +10,7 @@ module Eps
10
10
  str << "Model needs more data for better predictions\n"
11
11
  else
12
12
  str << "Most important features\n"
13
- @importance_keys.zip(importance).sort_by { |k, v| [-v, k] }.first(10).each do |k, v|
13
+ @importance_keys.zip(importance).sort_by { |k, v| [-v, display_field(k)] }.first(10).each do |k, v|
14
14
  str << "#{display_field(k)}: #{(100 * v / total).round}\n"
15
15
  end
16
16
  end
@@ -57,7 +57,7 @@ module Eps
57
57
 
58
58
  # text feature encoding
59
59
  prep_text_features(train_set)
60
- prep_text_features(validation_set) if validation_set
60
+ prep_text_features(validation_set, fit: false) if validation_set
61
61
 
62
62
  # create params
63
63
  params = {
@@ -71,7 +71,7 @@ module Eps
71
71
  end
72
72
 
73
73
  # create datasets
74
- categorical_idx = @features.values.map.with_index.select { |type, _| type == "categorical" }.map(&:last)
74
+ categorical_idx = train_set.columns.keys.map.with_index.select { |k, _| @features[k] == "categorical" }.map(&:last)
75
75
  train_ds = ::LightGBM::Dataset.new(train_set.map_rows(&:to_a), label: train_set.label, weight: train_set.weight, categorical_feature: categorical_idx, params: params)
76
76
  validation_ds = ::LightGBM::Dataset.new(validation_set.map_rows(&:to_a), label: validation_set.label, weight: validation_set.weight, categorical_feature: categorical_idx, params: params, reference: train_ds) if validation_set
77
77
 
@@ -144,7 +144,10 @@ module Eps
144
144
  end
145
145
 
146
146
  if bad_observations.any?
147
- raise "Bug detected in evaluator. Please report an issue. Bad data points: #{bad_observations.inspect}"
147
+ bad_observations.each do |obs|
148
+ p obs
149
+ end
150
+ raise "Bug detected in evaluator. Please report an issue."
148
151
  end
149
152
  end
150
153
 
@@ -222,7 +222,7 @@ module Eps
222
222
  else
223
223
  operator = xml_predicate.attribute("operator").value
224
224
  value = xml_predicate.attribute("value").value
225
- value = value.to_f if operator == "greaterThan"
225
+ value = value.to_f if operator == "greaterThan" || operator == "lessOrEqual"
226
226
  field = xml_predicate.attribute("field").value
227
227
  field = derived_fields[field] if derived_fields[field]
228
228
  {
data/lib/eps/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Eps
2
- VERSION = "0.3.5"
2
+ VERSION = "0.3.9"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: eps
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.5
4
+ version: 0.3.9
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-06-11 00:00:00.000000000 Z
11
+ date: 2021-10-14 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: lightgbm
@@ -38,92 +38,8 @@ dependencies:
38
38
  - - ">="
39
39
  - !ruby/object:Gem::Version
40
40
  version: '0'
41
- - !ruby/object:Gem::Dependency
42
- name: bundler
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - ">="
46
- - !ruby/object:Gem::Version
47
- version: '0'
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - ">="
53
- - !ruby/object:Gem::Version
54
- version: '0'
55
- - !ruby/object:Gem::Dependency
56
- name: daru
57
- requirement: !ruby/object:Gem::Requirement
58
- requirements:
59
- - - ">="
60
- - !ruby/object:Gem::Version
61
- version: '0'
62
- type: :development
63
- prerelease: false
64
- version_requirements: !ruby/object:Gem::Requirement
65
- requirements:
66
- - - ">="
67
- - !ruby/object:Gem::Version
68
- version: '0'
69
- - !ruby/object:Gem::Dependency
70
- name: minitest
71
- requirement: !ruby/object:Gem::Requirement
72
- requirements:
73
- - - ">="
74
- - !ruby/object:Gem::Version
75
- version: '0'
76
- type: :development
77
- prerelease: false
78
- version_requirements: !ruby/object:Gem::Requirement
79
- requirements:
80
- - - ">="
81
- - !ruby/object:Gem::Version
82
- version: '0'
83
- - !ruby/object:Gem::Dependency
84
- name: numo-narray
85
- requirement: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - ">="
88
- - !ruby/object:Gem::Version
89
- version: '0'
90
- type: :development
91
- prerelease: false
92
- version_requirements: !ruby/object:Gem::Requirement
93
- requirements:
94
- - - ">="
95
- - !ruby/object:Gem::Version
96
- version: '0'
97
- - !ruby/object:Gem::Dependency
98
- name: rake
99
- requirement: !ruby/object:Gem::Requirement
100
- requirements:
101
- - - ">="
102
- - !ruby/object:Gem::Version
103
- version: '0'
104
- type: :development
105
- prerelease: false
106
- version_requirements: !ruby/object:Gem::Requirement
107
- requirements:
108
- - - ">="
109
- - !ruby/object:Gem::Version
110
- version: '0'
111
- - !ruby/object:Gem::Dependency
112
- name: rover-df
113
- requirement: !ruby/object:Gem::Requirement
114
- requirements:
115
- - - ">="
116
- - !ruby/object:Gem::Version
117
- version: '0'
118
- type: :development
119
- prerelease: false
120
- version_requirements: !ruby/object:Gem::Requirement
121
- requirements:
122
- - - ">="
123
- - !ruby/object:Gem::Version
124
- version: '0'
125
- description:
126
- email: andrew@chartkick.com
41
+ description:
42
+ email: andrew@ankane.org
127
43
  executables: []
128
44
  extensions: []
129
45
  extra_rdoc_files: []
@@ -156,7 +72,7 @@ homepage: https://github.com/ankane/eps
156
72
  licenses:
157
73
  - MIT
158
74
  metadata: {}
159
- post_install_message:
75
+ post_install_message:
160
76
  rdoc_options: []
161
77
  require_paths:
162
78
  - lib
@@ -171,8 +87,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
171
87
  - !ruby/object:Gem::Version
172
88
  version: '0'
173
89
  requirements: []
174
- rubygems_version: 3.1.2
175
- signing_key:
90
+ rubygems_version: 3.2.22
91
+ signing_key:
176
92
  specification_version: 4
177
93
  summary: Machine learning for Ruby. Supports regression (linear regression) and classification
178
94
  (naive Bayes)