fselector 0.2.0 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -8,30 +8,41 @@ FSelector: a Ruby gem for feature selection and ranking
8
8
  **Email**: [need47@gmail.com](mailto:need47@gmail.com)
9
9
  **Copyright**: 2012
10
10
  **License**: MIT License
11
- **Latest Version**: 0.2.0
12
- **Release Date**: April 1st 2012
11
+ **Latest Version**: 0.3.0
12
+ **Release Date**: April 3rd 2012
13
13
 
14
14
  Synopsis
15
15
  --------
16
16
 
17
- FSelector is a Ruby gem that aims to integrate various feature selection/ranking
18
- algorithms into one single package. Welcome to contact me (need47@gmail.com)
19
- if you want to contribute your own algorithms or report a bug. FSelector enables
20
- the user to perform feature selection by using either a single algorithm or an
21
- ensemble of algorithms. FSelector acts on a full-feature data set with CSV, LibSVM
22
- or WEKA file format and outputs a reduced data set with only selected subset of
23
- features, which can later be used as the input for various machine learning softwares
24
- including LibSVM and WEKA. FSelector, itself, does not implement any of the machine
25
- learning algorithms such as support vector machines and random forest. Below is a
26
- summary of FSelector's features.
17
+ FSelector is a Ruby gem that aims to integrate various feature
18
+ selection/ranking algorithms and related functions into one single
19
+ package. Welcome to contact me (need47@gmail.com) if you'd like to
20
+ contribute your own algorithms or report a bug. FSelector allows user
21
+ to perform feature selection by using either a single algorithm or an
22
+ ensemble of multiple algorithms, and other common tasks including
23
+ normalization and discretization on continuous data, as well as replace
24
+ missing feature values with certain criterion. FSelector acts on a
25
+ full-feature data set in either CSV, LibSVM or WEKA file format and
26
+ outputs a reduced data set with only selected subset of features, which
27
+ can later be used as the input for various machine learning softwares
28
+ including LibSVM and WEKA. FSelector, itself, does not implement
29
+ any of the machine learning algorithms such as support vector machines
30
+ and random forest. See below for a list of FSelector's features.
27
31
 
28
32
  Feature List
29
33
  ------------
30
34
 
31
- **1. available feature selection/ranking algorithms**
35
+ **1. supported input/output file types**
36
+
37
+ - csv
38
+ - libsvm
39
+ - weka ARFF
40
+ - random data (for test purpose)
41
+
42
+ **2. available feature selection/ranking algorithms**
32
43
 
33
- algorithm alias feature type
34
- -------------------------------------------------------
44
+ algorithm alias feature type
45
+ --------------------------------------------------------
35
46
  Accuracy Acc discrete
36
47
  AccuracyBalanced Acc2 discrete
37
48
  BiNormalSeparation BNS discrete
@@ -67,29 +78,31 @@ Feature List
67
78
  ReliefF_c ReliefF_c continuous
68
79
  TScore TS continuous
69
80
 
70
- **2. feature selection approaches**
81
+ **3. feature selection approaches**
71
82
 
72
83
  - by a single algorithm
73
84
  - by multiple algorithms in a tandem manner
74
85
  - by multiple algorithms in a consensus manner
75
86
 
76
- **3. availabe normalization and discretization algorithms for continuous feature**
87
+ **4. availabe normalization and discretization algorithms for continuous feature**
77
88
 
78
89
  algorithm note
79
- --------------------------------------------------------------------
80
- log normalization by logarithmic transformation
81
- min_max normalization by scaling into [min, max]
82
- zscore normalization by converting into zscore
83
- equal_width discretization by equal width among intervals
84
- equal_frequency discretization by equal frequency among intervals
85
- ChiMerge discretization by ChiMerge method
86
-
87
- **4. supported input/output file types**
88
-
89
- - csv
90
- - libsvm
91
- - weka ARFF
92
- - random data (for test purpose)
90
+ -----------------------------------------------------------------
91
+ log normalize by logarithmic transformation
92
+ min_max normalize by scaling into [min, max]
93
+ zscore normalize by converting into zscore
94
+ equal_width discretize by equal width among intervals
95
+ equal_frequency discretize by equal frequency among intervals
96
+ ChiMerge discretize by ChiMerge method
97
+ MID discretize by Multi-Interval Discretization
98
+
99
+ **5. availabe algorithms for replacing missing feature values**
100
+
101
+ algorithm note feature type
102
+ --------------------------------------------------------------------------------------
103
+ fixed_value replace with a fixed value discrete, continuous
104
+ mean_value replace with the mean feature value continuous
105
+ most_seen_value replace with most seen feature value discrete
93
106
 
94
107
  Installing
95
108
  ----------
@@ -187,11 +200,11 @@ Usage
187
200
  r1.data_from_csv('test/iris.csv')
188
201
 
189
202
  # normalization by log2 (optional)
190
- # r1.normalize_log!(2)
203
+ # r1.normalize_by_log!(2)
191
204
 
192
205
  # discretization by ChiMerge algorithm
193
206
  # chi-squared value = 4.60 for a three-class problem at alpha=0.10
194
- r1.discretize_by_chimerge!(4.60)
207
+ r1.discretize_by_ChiMerge!(4.60)
195
208
 
196
209
  # apply Fast Correlation-Based Filter (FCBF) algorithm for discrete feature
197
210
  # initialize with discretized data from r1
@@ -3,7 +3,7 @@
3
3
  #
4
4
  module FSelector
5
5
  # module version
6
- VERSION = '0.2.0'
6
+ VERSION = '0.3.0'
7
7
  end
8
8
 
9
9
  ROOT = File.expand_path(File.dirname(__FILE__))
@@ -11,9 +11,14 @@ ROOT = File.expand_path(File.dirname(__FILE__))
11
11
  #
12
12
  # include necessary files
13
13
  #
14
+ # read and write file, supported formats include CSV, LibSVM and WEKA files
14
15
  require "#{ROOT}/fselector/fileio.rb"
16
+ # extend Array and String class
15
17
  require "#{ROOT}/fselector/util.rb"
18
+ # entropy-related functions
16
19
  require "#{ROOT}/fselector/entropy.rb"
20
+ # replace missing values
21
+ require "#{ROOT}/fselector/replace_missing_values.rb"
17
22
 
18
23
  #
19
24
  # base class
@@ -8,6 +8,8 @@ module FSelector
8
8
  class Base
9
9
  # include FileIO
10
10
  include FileIO
11
+ # include ReplaceMissingValues
12
+ include ReplaceMissingValues
11
13
 
12
14
  # initialize from an existing data structure
13
15
  def initialize(data=nil)
@@ -167,13 +169,13 @@ module FSelector
167
169
  def set_data(data)
168
170
  if data and data.class == Hash
169
171
  @data = data
170
- # clear
171
- @classes, @features, @fvs = nil, nil, nil
172
- @scores, @ranks, @sz = nil, nil, nil
172
+ # clear variables
173
+ clear_vars
173
174
  else
174
175
  abort "[#{__FILE__}@#{__LINE__}]: "+
175
176
  "data must be a Hash object!"
176
177
  end
178
+
177
179
  data
178
180
  end
179
181
 
@@ -335,6 +337,14 @@ module FSelector
335
337
 
336
338
  private
337
339
 
340
+ # clear variables when data structure is altered
341
+ def clear_vars
342
+ @classes, @features, @fvs = nil, nil, nil
343
+ @scores, @ranks, @sz = nil, nil, nil
344
+ @cv, @fvs = nil, nil
345
+ end
346
+
347
+
338
348
  # set feature (f) score (s) for class (k)
339
349
  def set_feature_score(f, k, s)
340
350
  @scores ||= {}
@@ -342,6 +352,7 @@ module FSelector
342
352
  @scores[f][k] = s
343
353
  end
344
354
 
355
+
345
356
  # get subset of feature
346
357
  def get_feature_subset
347
358
  abort "[#{__FILE__}@#{__LINE__}]: "+
@@ -21,6 +21,9 @@ module FSelector
21
21
 
22
22
  # use sequential forward search
23
23
  def get_feature_subset
24
+ # handle missing values
25
+ handle_missing_value
26
+
24
27
  subset = []
25
28
  feats = get_features.dup
26
29
 
@@ -58,6 +61,15 @@ module FSelector
58
61
  end # get_feature_subset
59
62
 
60
63
 
64
+ # handle missing values
65
+ # CFS replaces missing values with the mean for continous features and
66
+ # the most seen value for discrete features
67
+ def handle_missing_values
68
+ abort "[#{__FILE__}@#{__LINE__}]: "+
69
+ "derived CFS algo must implement its own handle_missing_values()"
70
+ end
71
+
72
+
61
73
  # calc new merit of subset when adding feature (f)
62
74
  def calc_merit(subset, f)
63
75
  k = subset.size.to_f + 1
@@ -10,8 +10,8 @@ module FSelector
10
10
  class BaseContinuous < Base
11
11
  # include normalizer
12
12
  include Normalizer
13
- # include discretilizer
14
- include Discretilizer
13
+ # include discretizer
14
+ include Discretizer
15
15
 
16
16
  # initialize from an existing data structure
17
17
  def initialize(data=nil)
@@ -8,8 +8,18 @@ module FSelector
8
8
  # ref: [Feature Selection for Discrete and Numeric Class Machine Learning](http://www.cs.waikato.ac.nz/ml/publications/1999/99MH-Feature-Select.pdf)
9
9
  #
10
10
  class CFS_c < BaseCFS
11
+ # include normalizer and discretizer
12
+ include Normalizer
13
+ include Discretizer
11
14
 
12
15
  private
16
+
17
+
18
+ # replace missing values with mean feature value
19
+ def handle_missing_values
20
+ replace_with_mean_value!
21
+ end
22
+
13
23
 
14
24
  # calc the feature-class correlation of two vectors
15
25
  def do_rcf(cv, fv)
@@ -10,6 +10,9 @@ module FSelector
10
10
  # ref: [Estimating Attributes: Analysis and Extensions of RELIEF](http://www.springerlink.com/content/fp23jh2h0426ww45/)
11
11
  #
12
12
  class ReliefF_c < BaseReliefF
13
+ # include normalizer and discretizer
14
+ include Normalizer
15
+ include Discretizer
13
16
 
14
17
  private
15
18
 
@@ -10,6 +10,9 @@ module FSelector
10
10
  # ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
11
11
  #
12
12
  class Relief_c < BaseRelief
13
+ # include normalizer and discretizer
14
+ include Normalizer
15
+ include Discretizer
13
16
 
14
17
  private
15
18
 
@@ -1,7 +1,10 @@
1
1
  #
2
- # discretilize continous feature
2
+ # discretize continous feature
3
3
  #
4
- module Discretilizer
4
+ module Discretizer
5
+ # include Entropy module
6
+ include Entropy
7
+
5
8
  # discretize by equal-width intervals
6
9
  #
7
10
  # @param [Integer] n_interval
@@ -84,7 +87,7 @@ module Discretilizer
84
87
  # 2 4.60 5.99 9.21 13.82
85
88
  # 3 6.35 7.82 11.34 16.27
86
89
  #
87
- def discretize_by_chimerge!(chisq)
90
+ def discretize_by_ChiMerge!(chisq)
88
91
  # chisq = 4.60 # for iris::Sepal.Length
89
92
  # for intialization
90
93
  hzero = {}
@@ -177,19 +180,71 @@ module Discretilizer
177
180
  end
178
181
  end
179
182
 
180
- end # discretize_chimerge!
183
+ end # discretize_ChiMerge!
184
+
185
+
186
+ #
187
+ # discretize by Multi-Interval Discretization (MID) algorithm
188
+ # @note no missing feature values allowed and data structure will be altered
189
+ #
190
+ # ref: [Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning](http://www.ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf)
191
+ #
192
+ def discretize_by_MID!
193
+ # determine the final boundaries
194
+ f2cp = {} # cut points for each feature
195
+ each_feature do |f|
196
+ cv = get_class_labels
197
+ # we assume no missing feature values
198
+ fv = get_feature_values(f)
199
+
200
+ n = cv.size
201
+ # sort cv and fv according ascending order of fv
202
+ sis = (0...n).to_a.sort { |i,j| fv[i] <=> fv[j] }
203
+ cv = cv.values_at(*sis)
204
+ fv = fv.values_at(*sis)
205
+
206
+ # get initial boundaries
207
+ bs = []
208
+ fv.each_with_index do |v, i|
209
+ # cut point (Ta) for feature A must always be a value between
210
+ # two examples of different classes in the sequence of sorted examples
211
+ # see orginal reference
212
+ if i < n-1 and cv[i] != cv[i+1]
213
+ bs << (v+fv[i+1])/2.0
214
+ end
215
+ end
216
+ bs.uniq! # remove duplicates
217
+
218
+ # main algorithm, iteratively determine cut point
219
+ cp = []
220
+ partition(cv, fv, bs, cp)
221
+
222
+ # add the rightmost boundary for convenience
223
+ cp << fv.max+1.0
224
+ # record cut points for feature (f)
225
+ f2cp[f] = cp
226
+ end
227
+
228
+ # discretize based on cut points
229
+ each_sample do |k, s|
230
+ s.keys.each do |f|
231
+ s[f] = get_index(s[f], f2cp[f])
232
+ end
233
+ end
234
+
235
+ end # discretize_by_MID!
181
236
 
182
237
  private
183
238
 
184
239
  # get index from sorted boundaries
185
240
  #
186
241
  # min -- | -- | -- | ... max |
187
- # b0 b1 b2 bn(=max+1)
188
- # 0 1 2 ... n
242
+ # b1 b2 b3 bn(=max+1)
243
+ # 1 2 3 ... n
189
244
  #
190
245
  def get_index(v, boundaries)
191
246
  boundaries.each_with_index do |b, i|
192
- return i if v < b
247
+ return i+1 if v < b
193
248
  end
194
249
  end # get_index
195
250
 
@@ -215,4 +270,103 @@ module Discretilizer
215
270
  end # calc_chisq
216
271
 
217
272
 
273
+ #
274
+ # Multi-Interval Discretization main algorithm
275
+ # recursively always selecting the best cut point
276
+ #
277
+ # @param [Array] cv class labels
278
+ # @param [Array] fv feature values
279
+ # @param [Array] bs potential cut points
280
+ # @param [Array] cp resultant cut points
281
+ def partition(cv, fv, bs, cp)
282
+ # best cut point
283
+ cp_best = nil
284
+
285
+ # binary subset at the best cut point
286
+ cv1_best, cv2_best = nil, nil
287
+ fv1_best, fv2_best = nil, nil
288
+ bs1_best, bs2_best = nil, nil
289
+
290
+ # best information gain
291
+ gain_best = -100.0
292
+ ent_best = -100.0
293
+ ent1_best = -100.0
294
+ ent2_best = -100.0
295
+
296
+ # try each potential cut point
297
+ bs.each do |b|
298
+ # binary split
299
+ cv1_try, cv2_try, fv1_try, fv2_try, bs1_try, bs2_try =
300
+ binary_split(cv, fv, bs, b)
301
+
302
+ # gain for this cut point
303
+ ent_try = get_marginal_entropy(cv)
304
+ ent1_try = get_marginal_entropy(cv1_try)
305
+ ent2_try = get_marginal_entropy(cv2_try)
306
+ gain_try = ent_try -
307
+ (cv1_try.size.to_f/cv.size) * ent1_try -
308
+ (cv2_try.size.to_f/cv.size) * ent2_try
309
+
310
+ #pp gain_try
311
+ if gain_try > gain_best
312
+ cp_best = b
313
+ cv1_best, cv2_best = cv1_try, cv2_try
314
+ fv1_best, fv2_best = fv1_try, fv2_try
315
+ bs1_best, bs2_best = bs1_try, bs2_try
316
+
317
+ gain_best = gain_try
318
+ ent_best = ent_try
319
+ ent1_best, ent2_best = ent1_try, ent2_try
320
+ end
321
+ end
322
+
323
+ # to cut or not to cut?
324
+ #
325
+ # Gain(A,T;S) > 1/N * log2(N-1) + 1/N * delta(A,T;S)
326
+ if cp_best
327
+ n = cv.size.to_f
328
+ k = cv.uniq.size.to_f
329
+ k1 = cv1_best.uniq.size.to_f
330
+ k2 = cv2_best.uniq.size.to_f
331
+ delta = Math.log2(3**k-2)-(k*ent_best - k1*ent1_best - k2*ent2_best)
332
+
333
+ # accept cut point
334
+ if gain_best > (Math.log2(n-1)/n + delta/n)
335
+ # a: record cut point
336
+ cp << cp_best
337
+
338
+ # b: recursively call on subset
339
+ partition(cv1_best, fv1_best, bs1_best, cp)
340
+ partition(cv2_best, fv2_best, bs2_best, cp)
341
+ end
342
+ end
343
+ end
344
+
345
+
346
+ # binarily split based on a cut point
347
+ def binary_split(cv, fv, bs, cut_point)
348
+ cv1, cv2, fv1, fv2, bs1, bs2 = [], [], [], [], [], []
349
+ fv.each_with_index do |v, i|
350
+ if v < cut_point
351
+ cv1 << cv[i]
352
+ fv1 << v
353
+ else
354
+ cv2 << cv[i]
355
+ fv2 << v
356
+ end
357
+ end
358
+
359
+ bs.each do |b|
360
+ if b < cut_point
361
+ bs1 << b
362
+ else
363
+ bs2 << b
364
+ end
365
+ end
366
+
367
+ # return subset
368
+ [cv1, cv2, fv1, fv2, bs1, bs2]
369
+ end
370
+
371
+
218
372
  end # module
@@ -3,7 +3,7 @@
3
3
  #
4
4
  module Normalizer
5
5
  # log transformation, requires positive feature values
6
- def normalize_log!(base=10)
6
+ def normalize_by_log!(base=10)
7
7
  each_sample do |k, s|
8
8
  s.keys.each do |f|
9
9
  s[f] = Math.log(s[f], base) if s[f] > 0.0
@@ -13,7 +13,7 @@ module Normalizer
13
13
 
14
14
 
15
15
  # scale to [min,max], max > min
16
- def normalize_min_max!(min=0.0, max=1.0)
16
+ def normalize_by_min_max!(min=0.0, max=1.0)
17
17
  # first determine min and max for each feature
18
18
  f2min_max = {}
19
19
 
@@ -33,7 +33,7 @@ module Normalizer
33
33
 
34
34
 
35
35
  # by z-score
36
- def normalize_zscore!
36
+ def normalize_by_zscore!
37
37
  # first determine mean and sd for each feature
38
38
  f2mean_sd = {}
39
39
 
@@ -12,6 +12,12 @@ module FSelector
12
12
  include Entropy
13
13
 
14
14
  private
15
+
16
+ # replace missing values with most seen feature value
17
+ def handle_missing_values
18
+ replace_with_most_seen_value!
19
+ end
20
+
15
21
 
16
22
  # calc the feature-class correlation of two vectors
17
23
  def do_rcf(cv, fv)
@@ -7,16 +7,16 @@ module Entropy
7
7
  #
8
8
  # H(X) = -1 * sigma_i (P(x_i) logP(x_i))
9
9
  #
10
- def get_marginal_entropy(arrX)
10
+ def get_marginal_entropy(arrX)
11
11
  h = 0.0
12
12
  n = arrX.size.to_f
13
-
14
- arrX.uniq.each do |x_i|
15
- p = arrX.count(x_i)/n
16
- h += -1.0 * (p * Math.log2(p))
17
- end
18
-
19
- h
13
+
14
+ arrX.uniq.each do |x_i|
15
+ p = arrX.count(x_i)/n
16
+ h += -1.0 * (p * Math.log2(p))
17
+ end
18
+
19
+ h
20
20
  end # get_marginal_entropy
21
21
 
22
22
 
@@ -27,28 +27,28 @@ module Entropy
27
27
  #
28
28
  # where H(X|y_j) = -1 * sigma_i (P(x_i|y_j) logP(x_i|y_j))
29
29
  #
30
- def get_conditional_entropy(arrX, arrY)
31
- abort "[#{__FILE__}@#{__LINE__}]: "+
32
- "array must be of same length" if not arrX.size == arrY.size
33
-
30
+ def get_conditional_entropy(arrX, arrY)
31
+ abort "[#{__FILE__}@#{__LINE__}]: "+
32
+ "array must be of same length" if not arrX.size == arrY.size
33
+
34
34
  hxy = 0.0
35
- n = arrX.size.to_f
36
-
37
- arrY.uniq.each do |y_j|
38
- p1 = arrY.count(y_j)/n
39
-
40
- indices = (0...n).to_a.select { |k| arrY[k] == y_j }
41
- xvs = arrX.values_at(*indices)
42
- m = xvs.size.to_f
43
-
44
- xvs.uniq.each do |x_i|
45
- p2 = xvs.count(x_i)/m
46
-
47
- hxy += -1.0 * p1 * (p2 * Math.log2(p2))
35
+ n = arrX.size.to_f
36
+
37
+ arrY.uniq.each do |y_j|
38
+ p1 = arrY.count(y_j)/n
39
+
40
+ indices = (0...n).to_a.select { |k| arrY[k] == y_j }
41
+ xvs = arrX.values_at(*indices)
42
+ m = xvs.size.to_f
43
+
44
+ xvs.uniq.each do |x_i|
45
+ p2 = xvs.count(x_i)/m
46
+
47
+ hxy += -1.0 * p1 * (p2 * Math.log2(p2))
48
+ end
48
49
  end
49
- end
50
-
51
- hxy
50
+
51
+ hxy
52
52
  end # get_conditional_entropy
53
53
 
54
54
 
@@ -60,11 +60,11 @@ module Entropy
60
60
  #
61
61
  # i.e. H(X,Y) == H(Y,X)
62
62
  #
63
- def get_joint_entropy(arrX, arrY)
63
+ def get_joint_entropy(arrX, arrY)
64
64
  abort "[#{__FILE__}@#{__LINE__}]: "+
65
65
  "array must be of same length" if not arrX.size == arrY.size
66
-
67
- get_marginal_entropy(arrY) + get_conditional_entropy(arrX, arrY)
66
+
67
+ get_marginal_entropy(arrY) + get_conditional_entropy(arrX, arrY)
68
68
  end # get_joint_entropy
69
69
 
70
70
 
@@ -110,10 +110,22 @@ module FileIO
110
110
  ofs = File.open(fname, 'w')
111
111
  end
112
112
 
113
+ # convert class label to integer type
114
+ k2idx = {}
115
+ get_classes.each_with_index do |k, i|
116
+ k2idx[k] = i+1
117
+ end
118
+
119
+ # convert feature to integer type
120
+ f2idx = {}
121
+ get_features.each_with_index do |f, i|
122
+ f2idx[f] = i+1
123
+ end
124
+
113
125
  each_sample do |k, s|
114
- ofs.print "#{k} "
126
+ ofs.print "#{k2idx[k]} "
115
127
  s.keys.sort { |x, y| x.to_s.to_i <=> y.to_s.to_i }.each do |i|
116
- ofs.print " #{i}:#{s[i]}" if not s[i].zero?
128
+ ofs.print " #{f2idx[i]}:#{s[i]}" if not s[i].zero? # implicit mode
117
129
  end
118
130
  ofs.puts
119
131
  end
@@ -171,7 +183,7 @@ module FileIO
171
183
  end
172
184
  else
173
185
  abort "[#{__FILE__}@#{__LINE__}]: "+
174
- "1st and 2nd row must have same fields"
186
+ "the first two rows must have same number of fields"
175
187
  end
176
188
  else # data rows
177
189
  label, *fvs = ln.chomp.split(/,/)
@@ -0,0 +1,78 @@
1
+ #
2
+ # replace missing feature values
3
+ #
4
+ module ReplaceMissingValues
5
+ #
6
+ # replace missing feature value with a fixed value
7
+ # applicable for both discrete and continuous feature
8
+ # @note data structure will be altered
9
+ #
10
+ def replace_with_fixed_value!(val)
11
+ each_sample do |k, s|
12
+ each_feature do |f|
13
+ if not s.has_key? f
14
+ s[f] = my_value
15
+ end
16
+ end
17
+ end
18
+
19
+ # clear variables
20
+ clear_vars
21
+ end # replace_fixed_value
22
+
23
+
24
+ #
25
+ # replace missing feature value with mean feature value
26
+ # applicable only to continuous feature
27
+ # @note data structure will be altered
28
+ #
29
+ def replace_with_mean_value!
30
+ each_sample do |k, s|
31
+ each_feature do |f|
32
+ fv = get_feature_values(f)
33
+ next if fv.size == get_sample_size # no missing values
34
+
35
+ mean = fv.ave
36
+ if not s.has_key? f
37
+ s[f] = mean
38
+ end
39
+ end
40
+ end
41
+
42
+ # clear variables
43
+ clear_vars
44
+ end # replace_with_mean_value!
45
+
46
+
47
+ #
48
+ # replace missing feature value with most seen feature value
49
+ # applicable only to discrete feature
50
+ # @note data structure will be altered
51
+ #
52
+ def replace_with_most_seen_value!
53
+ each_sample do |k, s|
54
+ each_feature do |f|
55
+ fv = get_feature_values(f)
56
+ next if fv.size == get_sample_size # no missing values
57
+
58
+ seen_count, seen_value = 0, nil
59
+ fv.uniq.each do |v|
60
+ count = fv.count(v)
61
+ if count > seen_count
62
+ seen_count = count
63
+ seen_value = v
64
+ end
65
+ end
66
+
67
+ if not s.has_key? f
68
+ s[f] = seen_value
69
+ end
70
+ end
71
+ end
72
+
73
+ # clear variables
74
+ clear_vars
75
+ end # replace_with_mean_value!
76
+
77
+
78
+ end # ReplaceMissingValues
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fselector
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.3.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,17 +9,19 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-04-02 00:00:00.000000000 Z
12
+ date: 2012-04-03 00:00:00.000000000 Z
13
13
  dependencies: []
14
14
  description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
15
- algorithms into one single package. Welcome to contact me (need47@gmail.com) if
16
- you want to contribute your own algorithms or report a bug. FSelector enables the
17
- user to perform feature selection by using either a single algorithm or an ensemble
18
- of algorithms. FSelector acts on a full-feature data set with CSV, LibSVM or WEKA
19
- file format and outputs a reduced data set with only selected subset of features,
20
- which can later be used as the input for various machine learning softwares including
21
- LibSVM and WEKA. FSelector, itself, does not implement any of the machine learning
22
- algorithms such as support vector machines and random forest.
15
+ algorithms and related functions into one single package. Welcome to contact me
16
+ (need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
17
+ FSelector allows user to perform feature selection by using either a single algorithm
18
+ or an ensemble of multiple algorithms, and other common tasks including normalization
19
+ and discretization on continuous data, as well as replace missing feature values
20
+ with certain criterion. FSelector acts on a full-feature data set in either CSV,
21
+ LibSVM or WEKA file format and outputs a reduced data set with only selected subset
22
+ of features, which can later be used as the input for various machine learning softwares
23
+ including LibSVM and WEKA. FSelector, itself, does not implement any of the machine
24
+ learning algorithms such as support vector machines and random forest.
23
25
  email: need47@gmail.com
24
26
  executables: []
25
27
  extensions: []
@@ -73,6 +75,7 @@ files:
73
75
  - lib/fselector/ensemble.rb
74
76
  - lib/fselector/entropy.rb
75
77
  - lib/fselector/fileio.rb
78
+ - lib/fselector/replace_missing_values.rb
76
79
  - lib/fselector/util.rb
77
80
  - lib/fselector.rb
78
81
  homepage: http://github.com/need47/fselector