fselector 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -8,30 +8,41 @@ FSelector: a Ruby gem for feature selection and ranking
8
8
  **Email**: [need47@gmail.com](mailto:need47@gmail.com)
9
9
  **Copyright**: 2012
10
10
  **License**: MIT License
11
- **Latest Version**: 0.2.0
12
- **Release Date**: April 1st 2012
11
+ **Latest Version**: 0.3.0
12
+ **Release Date**: April 3rd 2012
13
13
 
14
14
  Synopsis
15
15
  --------
16
16
 
17
- FSelector is a Ruby gem that aims to integrate various feature selection/ranking
18
- algorithms into one single package. Welcome to contact me (need47@gmail.com)
19
- if you want to contribute your own algorithms or report a bug. FSelector enables
20
- the user to perform feature selection by using either a single algorithm or an
21
- ensemble of algorithms. FSelector acts on a full-feature data set with CSV, LibSVM
22
- or WEKA file format and outputs a reduced data set with only selected subset of
23
- features, which can later be used as the input for various machine learning softwares
24
- including LibSVM and WEKA. FSelector, itself, does not implement any of the machine
25
- learning algorithms such as support vector machines and random forest. Below is a
26
- summary of FSelector's features.
17
+ FSelector is a Ruby gem that aims to integrate various feature
18
+ selection/ranking algorithms and related functions into one single
19
+ package. Welcome to contact me (need47@gmail.com) if you'd like to
20
+ contribute your own algorithms or report a bug. FSelector allows user
21
+ to perform feature selection by using either a single algorithm or an
22
+ ensemble of multiple algorithms, and other common tasks including
23
+ normalization and discretization on continuous data, as well as replace
24
+ missing feature values with certain criterion. FSelector acts on a
25
+ full-feature data set in either CSV, LibSVM or WEKA file format and
26
+ outputs a reduced data set with only selected subset of features, which
27
+ can later be used as the input for various machine learning softwares
28
+ including LibSVM and WEKA. FSelector, itself, does not implement
29
+ any of the machine learning algorithms such as support vector machines
30
+ and random forest. See below for a list of FSelector's features.
27
31
 
28
32
  Feature List
29
33
  ------------
30
34
 
31
- **1. available feature selection/ranking algorithms**
35
+ **1. supported input/output file types**
36
+
37
+ - csv
38
+ - libsvm
39
+ - weka ARFF
40
+ - random data (for test purpose)
41
+
42
+ **2. available feature selection/ranking algorithms**
32
43
 
33
- algorithm alias feature type
34
- -------------------------------------------------------
44
+ algorithm alias feature type
45
+ --------------------------------------------------------
35
46
  Accuracy Acc discrete
36
47
  AccuracyBalanced Acc2 discrete
37
48
  BiNormalSeparation BNS discrete
@@ -67,29 +78,31 @@ Feature List
67
78
  ReliefF_c ReliefF_c continuous
68
79
  TScore TS continuous
69
80
 
70
- **2. feature selection approaches**
81
+ **3. feature selection approaches**
71
82
 
72
83
  - by a single algorithm
73
84
  - by multiple algorithms in a tandem manner
74
85
  - by multiple algorithms in a consensus manner
75
86
 
76
- **3. availabe normalization and discretization algorithms for continuous feature**
87
+ **4. availabe normalization and discretization algorithms for continuous feature**
77
88
 
78
89
  algorithm note
79
- --------------------------------------------------------------------
80
- log normalization by logarithmic transformation
81
- min_max normalization by scaling into [min, max]
82
- zscore normalization by converting into zscore
83
- equal_width discretization by equal width among intervals
84
- equal_frequency discretization by equal frequency among intervals
85
- ChiMerge discretization by ChiMerge method
86
-
87
- **4. supported input/output file types**
88
-
89
- - csv
90
- - libsvm
91
- - weka ARFF
92
- - random data (for test purpose)
90
+ -----------------------------------------------------------------
91
+ log normalize by logarithmic transformation
92
+ min_max normalize by scaling into [min, max]
93
+ zscore normalize by converting into zscore
94
+ equal_width discretize by equal width among intervals
95
+ equal_frequency discretize by equal frequency among intervals
96
+ ChiMerge discretize by ChiMerge method
97
+ MID discretize by Multi-Interval Discretization
98
+
99
+ **5. availabe algorithms for replacing missing feature values**
100
+
101
+ algorithm note feature type
102
+ --------------------------------------------------------------------------------------
103
+ fixed_value replace with a fixed value discrete, continuous
104
+ mean_value replace with the mean feature value continuous
105
+ most_seen_value replace with most seen feature value discrete
93
106
 
94
107
  Installing
95
108
  ----------
@@ -187,11 +200,11 @@ Usage
187
200
  r1.data_from_csv('test/iris.csv')
188
201
 
189
202
  # normalization by log2 (optional)
190
- # r1.normalize_log!(2)
203
+ # r1.normalize_by_log!(2)
191
204
 
192
205
  # discretization by ChiMerge algorithm
193
206
  # chi-squared value = 4.60 for a three-class problem at alpha=0.10
194
- r1.discretize_by_chimerge!(4.60)
207
+ r1.discretize_by_ChiMerge!(4.60)
195
208
 
196
209
  # apply Fast Correlation-Based Filter (FCBF) algorithm for discrete feature
197
210
  # initialize with discretized data from r1
@@ -3,7 +3,7 @@
3
3
  #
4
4
  module FSelector
5
5
  # module version
6
- VERSION = '0.2.0'
6
+ VERSION = '0.3.0'
7
7
  end
8
8
 
9
9
  ROOT = File.expand_path(File.dirname(__FILE__))
@@ -11,9 +11,14 @@ ROOT = File.expand_path(File.dirname(__FILE__))
11
11
  #
12
12
  # include necessary files
13
13
  #
14
+ # read and write file, supported formats include CSV, LibSVM and WEKA files
14
15
  require "#{ROOT}/fselector/fileio.rb"
16
+ # extend Array and String class
15
17
  require "#{ROOT}/fselector/util.rb"
18
+ # entropy-related functions
16
19
  require "#{ROOT}/fselector/entropy.rb"
20
+ # replace missing values
21
+ require "#{ROOT}/fselector/replace_missing_values.rb"
17
22
 
18
23
  #
19
24
  # base class
@@ -8,6 +8,8 @@ module FSelector
8
8
  class Base
9
9
  # include FileIO
10
10
  include FileIO
11
+ # include ReplaceMissingValues
12
+ include ReplaceMissingValues
11
13
 
12
14
  # initialize from an existing data structure
13
15
  def initialize(data=nil)
@@ -167,13 +169,13 @@ module FSelector
167
169
  def set_data(data)
168
170
  if data and data.class == Hash
169
171
  @data = data
170
- # clear
171
- @classes, @features, @fvs = nil, nil, nil
172
- @scores, @ranks, @sz = nil, nil, nil
172
+ # clear variables
173
+ clear_vars
173
174
  else
174
175
  abort "[#{__FILE__}@#{__LINE__}]: "+
175
176
  "data must be a Hash object!"
176
177
  end
178
+
177
179
  data
178
180
  end
179
181
 
@@ -335,6 +337,14 @@ module FSelector
335
337
 
336
338
  private
337
339
 
340
+ # clear variables when data structure is altered
341
+ def clear_vars
342
+ @classes, @features, @fvs = nil, nil, nil
343
+ @scores, @ranks, @sz = nil, nil, nil
344
+ @cv, @fvs = nil, nil
345
+ end
346
+
347
+
338
348
  # set feature (f) score (s) for class (k)
339
349
  def set_feature_score(f, k, s)
340
350
  @scores ||= {}
@@ -342,6 +352,7 @@ module FSelector
342
352
  @scores[f][k] = s
343
353
  end
344
354
 
355
+
345
356
  # get subset of feature
346
357
  def get_feature_subset
347
358
  abort "[#{__FILE__}@#{__LINE__}]: "+
@@ -21,6 +21,9 @@ module FSelector
21
21
 
22
22
  # use sequential forward search
23
23
  def get_feature_subset
24
+ # handle missing values
25
+ handle_missing_value
26
+
24
27
  subset = []
25
28
  feats = get_features.dup
26
29
 
@@ -58,6 +61,15 @@ module FSelector
58
61
  end # get_feature_subset
59
62
 
60
63
 
64
+ # handle missing values
65
+ # CFS replaces missing values with the mean for continous features and
66
+ # the most seen value for discrete features
67
+ def handle_missing_values
68
+ abort "[#{__FILE__}@#{__LINE__}]: "+
69
+ "derived CFS algo must implement its own handle_missing_values()"
70
+ end
71
+
72
+
61
73
  # calc new merit of subset when adding feature (f)
62
74
  def calc_merit(subset, f)
63
75
  k = subset.size.to_f + 1
@@ -10,8 +10,8 @@ module FSelector
10
10
  class BaseContinuous < Base
11
11
  # include normalizer
12
12
  include Normalizer
13
- # include discretilizer
14
- include Discretilizer
13
+ # include discretizer
14
+ include Discretizer
15
15
 
16
16
  # initialize from an existing data structure
17
17
  def initialize(data=nil)
@@ -8,8 +8,18 @@ module FSelector
8
8
  # ref: [Feature Selection for Discrete and Numeric Class Machine Learning](http://www.cs.waikato.ac.nz/ml/publications/1999/99MH-Feature-Select.pdf)
9
9
  #
10
10
  class CFS_c < BaseCFS
11
+ # include normalizer and discretizer
12
+ include Normalizer
13
+ include Discretizer
11
14
 
12
15
  private
16
+
17
+
18
+ # replace missing values with mean feature value
19
+ def handle_missing_values
20
+ replace_with_mean_value!
21
+ end
22
+
13
23
 
14
24
  # calc the feature-class correlation of two vectors
15
25
  def do_rcf(cv, fv)
@@ -10,6 +10,9 @@ module FSelector
10
10
  # ref: [Estimating Attributes: Analysis and Extensions of RELIEF](http://www.springerlink.com/content/fp23jh2h0426ww45/)
11
11
  #
12
12
  class ReliefF_c < BaseReliefF
13
+ # include normalizer and discretizer
14
+ include Normalizer
15
+ include Discretizer
13
16
 
14
17
  private
15
18
 
@@ -10,6 +10,9 @@ module FSelector
10
10
  # ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
11
11
  #
12
12
  class Relief_c < BaseRelief
13
+ # include normalizer and discretizer
14
+ include Normalizer
15
+ include Discretizer
13
16
 
14
17
  private
15
18
 
@@ -1,7 +1,10 @@
1
1
  #
2
- # discretilize continous feature
2
+ # discretize continous feature
3
3
  #
4
- module Discretilizer
4
+ module Discretizer
5
+ # include Entropy module
6
+ include Entropy
7
+
5
8
  # discretize by equal-width intervals
6
9
  #
7
10
  # @param [Integer] n_interval
@@ -84,7 +87,7 @@ module Discretilizer
84
87
  # 2 4.60 5.99 9.21 13.82
85
88
  # 3 6.35 7.82 11.34 16.27
86
89
  #
87
- def discretize_by_chimerge!(chisq)
90
+ def discretize_by_ChiMerge!(chisq)
88
91
  # chisq = 4.60 # for iris::Sepal.Length
89
92
  # for intialization
90
93
  hzero = {}
@@ -177,19 +180,71 @@ module Discretilizer
177
180
  end
178
181
  end
179
182
 
180
- end # discretize_chimerge!
183
+ end # discretize_ChiMerge!
184
+
185
+
186
+ #
187
+ # discretize by Multi-Interval Discretization (MID) algorithm
188
+ # @note no missing feature values allowed and data structure will be altered
189
+ #
190
+ # ref: [Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning](http://www.ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf)
191
+ #
192
+ def discretize_by_MID!
193
+ # determine the final boundaries
194
+ f2cp = {} # cut points for each feature
195
+ each_feature do |f|
196
+ cv = get_class_labels
197
+ # we assume no missing feature values
198
+ fv = get_feature_values(f)
199
+
200
+ n = cv.size
201
+ # sort cv and fv according ascending order of fv
202
+ sis = (0...n).to_a.sort { |i,j| fv[i] <=> fv[j] }
203
+ cv = cv.values_at(*sis)
204
+ fv = fv.values_at(*sis)
205
+
206
+ # get initial boundaries
207
+ bs = []
208
+ fv.each_with_index do |v, i|
209
+ # cut point (Ta) for feature A must always be a value between
210
+ # two examples of different classes in the sequence of sorted examples
211
+ # see orginal reference
212
+ if i < n-1 and cv[i] != cv[i+1]
213
+ bs << (v+fv[i+1])/2.0
214
+ end
215
+ end
216
+ bs.uniq! # remove duplicates
217
+
218
+ # main algorithm, iteratively determine cut point
219
+ cp = []
220
+ partition(cv, fv, bs, cp)
221
+
222
+ # add the rightmost boundary for convenience
223
+ cp << fv.max+1.0
224
+ # record cut points for feature (f)
225
+ f2cp[f] = cp
226
+ end
227
+
228
+ # discretize based on cut points
229
+ each_sample do |k, s|
230
+ s.keys.each do |f|
231
+ s[f] = get_index(s[f], f2cp[f])
232
+ end
233
+ end
234
+
235
+ end # discretize_by_MID!
181
236
 
182
237
  private
183
238
 
184
239
  # get index from sorted boundaries
185
240
  #
186
241
  # min -- | -- | -- | ... max |
187
- # b0 b1 b2 bn(=max+1)
188
- # 0 1 2 ... n
242
+ # b1 b2 b3 bn(=max+1)
243
+ # 1 2 3 ... n
189
244
  #
190
245
  def get_index(v, boundaries)
191
246
  boundaries.each_with_index do |b, i|
192
- return i if v < b
247
+ return i+1 if v < b
193
248
  end
194
249
  end # get_index
195
250
 
@@ -215,4 +270,103 @@ module Discretilizer
215
270
  end # calc_chisq
216
271
 
217
272
 
273
+ #
274
+ # Multi-Interval Discretization main algorithm
275
+ # recursively always selecting the best cut point
276
+ #
277
+ # @param [Array] cv class labels
278
+ # @param [Array] fv feature values
279
+ # @param [Array] bs potential cut points
280
+ # @param [Array] cp resultant cut points
281
+ def partition(cv, fv, bs, cp)
282
+ # best cut point
283
+ cp_best = nil
284
+
285
+ # binary subset at the best cut point
286
+ cv1_best, cv2_best = nil, nil
287
+ fv1_best, fv2_best = nil, nil
288
+ bs1_best, bs2_best = nil, nil
289
+
290
+ # best information gain
291
+ gain_best = -100.0
292
+ ent_best = -100.0
293
+ ent1_best = -100.0
294
+ ent2_best = -100.0
295
+
296
+ # try each potential cut point
297
+ bs.each do |b|
298
+ # binary split
299
+ cv1_try, cv2_try, fv1_try, fv2_try, bs1_try, bs2_try =
300
+ binary_split(cv, fv, bs, b)
301
+
302
+ # gain for this cut point
303
+ ent_try = get_marginal_entropy(cv)
304
+ ent1_try = get_marginal_entropy(cv1_try)
305
+ ent2_try = get_marginal_entropy(cv2_try)
306
+ gain_try = ent_try -
307
+ (cv1_try.size.to_f/cv.size) * ent1_try -
308
+ (cv2_try.size.to_f/cv.size) * ent2_try
309
+
310
+ #pp gain_try
311
+ if gain_try > gain_best
312
+ cp_best = b
313
+ cv1_best, cv2_best = cv1_try, cv2_try
314
+ fv1_best, fv2_best = fv1_try, fv2_try
315
+ bs1_best, bs2_best = bs1_try, bs2_try
316
+
317
+ gain_best = gain_try
318
+ ent_best = ent_try
319
+ ent1_best, ent2_best = ent1_try, ent2_try
320
+ end
321
+ end
322
+
323
+ # to cut or not to cut?
324
+ #
325
+ # Gain(A,T;S) > 1/N * log2(N-1) + 1/N * delta(A,T;S)
326
+ if cp_best
327
+ n = cv.size.to_f
328
+ k = cv.uniq.size.to_f
329
+ k1 = cv1_best.uniq.size.to_f
330
+ k2 = cv2_best.uniq.size.to_f
331
+ delta = Math.log2(3**k-2)-(k*ent_best - k1*ent1_best - k2*ent2_best)
332
+
333
+ # accept cut point
334
+ if gain_best > (Math.log2(n-1)/n + delta/n)
335
+ # a: record cut point
336
+ cp << cp_best
337
+
338
+ # b: recursively call on subset
339
+ partition(cv1_best, fv1_best, bs1_best, cp)
340
+ partition(cv2_best, fv2_best, bs2_best, cp)
341
+ end
342
+ end
343
+ end
344
+
345
+
346
+ # binarily split based on a cut point
347
+ def binary_split(cv, fv, bs, cut_point)
348
+ cv1, cv2, fv1, fv2, bs1, bs2 = [], [], [], [], [], []
349
+ fv.each_with_index do |v, i|
350
+ if v < cut_point
351
+ cv1 << cv[i]
352
+ fv1 << v
353
+ else
354
+ cv2 << cv[i]
355
+ fv2 << v
356
+ end
357
+ end
358
+
359
+ bs.each do |b|
360
+ if b < cut_point
361
+ bs1 << b
362
+ else
363
+ bs2 << b
364
+ end
365
+ end
366
+
367
+ # return subset
368
+ [cv1, cv2, fv1, fv2, bs1, bs2]
369
+ end
370
+
371
+
218
372
  end # module
@@ -3,7 +3,7 @@
3
3
  #
4
4
  module Normalizer
5
5
  # log transformation, requires positive feature values
6
- def normalize_log!(base=10)
6
+ def normalize_by_log!(base=10)
7
7
  each_sample do |k, s|
8
8
  s.keys.each do |f|
9
9
  s[f] = Math.log(s[f], base) if s[f] > 0.0
@@ -13,7 +13,7 @@ module Normalizer
13
13
 
14
14
 
15
15
  # scale to [min,max], max > min
16
- def normalize_min_max!(min=0.0, max=1.0)
16
+ def normalize_by_min_max!(min=0.0, max=1.0)
17
17
  # first determine min and max for each feature
18
18
  f2min_max = {}
19
19
 
@@ -33,7 +33,7 @@ module Normalizer
33
33
 
34
34
 
35
35
  # by z-score
36
- def normalize_zscore!
36
+ def normalize_by_zscore!
37
37
  # first determine mean and sd for each feature
38
38
  f2mean_sd = {}
39
39
 
@@ -12,6 +12,12 @@ module FSelector
12
12
  include Entropy
13
13
 
14
14
  private
15
+
16
+ # replace missing values with most seen feature value
17
+ def handle_missing_values
18
+ replace_with_most_seen_value!
19
+ end
20
+
15
21
 
16
22
  # calc the feature-class correlation of two vectors
17
23
  def do_rcf(cv, fv)
@@ -7,16 +7,16 @@ module Entropy
7
7
  #
8
8
  # H(X) = -1 * sigma_i (P(x_i) logP(x_i))
9
9
  #
10
- def get_marginal_entropy(arrX)
10
+ def get_marginal_entropy(arrX)
11
11
  h = 0.0
12
12
  n = arrX.size.to_f
13
-
14
- arrX.uniq.each do |x_i|
15
- p = arrX.count(x_i)/n
16
- h += -1.0 * (p * Math.log2(p))
17
- end
18
-
19
- h
13
+
14
+ arrX.uniq.each do |x_i|
15
+ p = arrX.count(x_i)/n
16
+ h += -1.0 * (p * Math.log2(p))
17
+ end
18
+
19
+ h
20
20
  end # get_marginal_entropy
21
21
 
22
22
 
@@ -27,28 +27,28 @@ module Entropy
27
27
  #
28
28
  # where H(X|y_j) = -1 * sigma_i (P(x_i|y_j) logP(x_i|y_j))
29
29
  #
30
- def get_conditional_entropy(arrX, arrY)
31
- abort "[#{__FILE__}@#{__LINE__}]: "+
32
- "array must be of same length" if not arrX.size == arrY.size
33
-
30
+ def get_conditional_entropy(arrX, arrY)
31
+ abort "[#{__FILE__}@#{__LINE__}]: "+
32
+ "array must be of same length" if not arrX.size == arrY.size
33
+
34
34
  hxy = 0.0
35
- n = arrX.size.to_f
36
-
37
- arrY.uniq.each do |y_j|
38
- p1 = arrY.count(y_j)/n
39
-
40
- indices = (0...n).to_a.select { |k| arrY[k] == y_j }
41
- xvs = arrX.values_at(*indices)
42
- m = xvs.size.to_f
43
-
44
- xvs.uniq.each do |x_i|
45
- p2 = xvs.count(x_i)/m
46
-
47
- hxy += -1.0 * p1 * (p2 * Math.log2(p2))
35
+ n = arrX.size.to_f
36
+
37
+ arrY.uniq.each do |y_j|
38
+ p1 = arrY.count(y_j)/n
39
+
40
+ indices = (0...n).to_a.select { |k| arrY[k] == y_j }
41
+ xvs = arrX.values_at(*indices)
42
+ m = xvs.size.to_f
43
+
44
+ xvs.uniq.each do |x_i|
45
+ p2 = xvs.count(x_i)/m
46
+
47
+ hxy += -1.0 * p1 * (p2 * Math.log2(p2))
48
+ end
48
49
  end
49
- end
50
-
51
- hxy
50
+
51
+ hxy
52
52
  end # get_conditional_entropy
53
53
 
54
54
 
@@ -60,11 +60,11 @@ module Entropy
60
60
  #
61
61
  # i.e. H(X,Y) == H(Y,X)
62
62
  #
63
- def get_joint_entropy(arrX, arrY)
63
+ def get_joint_entropy(arrX, arrY)
64
64
  abort "[#{__FILE__}@#{__LINE__}]: "+
65
65
  "array must be of same length" if not arrX.size == arrY.size
66
-
67
- get_marginal_entropy(arrY) + get_conditional_entropy(arrX, arrY)
66
+
67
+ get_marginal_entropy(arrY) + get_conditional_entropy(arrX, arrY)
68
68
  end # get_joint_entropy
69
69
 
70
70
 
@@ -110,10 +110,22 @@ module FileIO
110
110
  ofs = File.open(fname, 'w')
111
111
  end
112
112
 
113
+ # convert class label to integer type
114
+ k2idx = {}
115
+ get_classes.each_with_index do |k, i|
116
+ k2idx[k] = i+1
117
+ end
118
+
119
+ # convert feature to integer type
120
+ f2idx = {}
121
+ get_features.each_with_index do |f, i|
122
+ f2idx[f] = i+1
123
+ end
124
+
113
125
  each_sample do |k, s|
114
- ofs.print "#{k} "
126
+ ofs.print "#{k2idx[k]} "
115
127
  s.keys.sort { |x, y| x.to_s.to_i <=> y.to_s.to_i }.each do |i|
116
- ofs.print " #{i}:#{s[i]}" if not s[i].zero?
128
+ ofs.print " #{f2idx[i]}:#{s[i]}" if not s[i].zero? # implicit mode
117
129
  end
118
130
  ofs.puts
119
131
  end
@@ -171,7 +183,7 @@ module FileIO
171
183
  end
172
184
  else
173
185
  abort "[#{__FILE__}@#{__LINE__}]: "+
174
- "1st and 2nd row must have same fields"
186
+ "the first two rows must have same number of fields"
175
187
  end
176
188
  else # data rows
177
189
  label, *fvs = ln.chomp.split(/,/)
@@ -0,0 +1,78 @@
1
+ #
2
+ # replace missing feature values
3
+ #
4
+ module ReplaceMissingValues
5
+ #
6
+ # replace missing feature value with a fixed value
7
+ # applicable for both discrete and continuous feature
8
+ # @note data structure will be altered
9
+ #
10
+ def replace_with_fixed_value!(val)
11
+ each_sample do |k, s|
12
+ each_feature do |f|
13
+ if not s.has_key? f
14
+ s[f] = my_value
15
+ end
16
+ end
17
+ end
18
+
19
+ # clear variables
20
+ clear_vars
21
+ end # replace_fixed_value
22
+
23
+
24
+ #
25
+ # replace missing feature value with mean feature value
26
+ # applicable only to continuous feature
27
+ # @note data structure will be altered
28
+ #
29
+ def replace_with_mean_value!
30
+ each_sample do |k, s|
31
+ each_feature do |f|
32
+ fv = get_feature_values(f)
33
+ next if fv.size == get_sample_size # no missing values
34
+
35
+ mean = fv.ave
36
+ if not s.has_key? f
37
+ s[f] = mean
38
+ end
39
+ end
40
+ end
41
+
42
+ # clear variables
43
+ clear_vars
44
+ end # replace_with_mean_value!
45
+
46
+
47
+ #
48
+ # replace missing feature value with most seen feature value
49
+ # applicable only to discrete feature
50
+ # @note data structure will be altered
51
+ #
52
+ def replace_with_most_seen_value!
53
+ each_sample do |k, s|
54
+ each_feature do |f|
55
+ fv = get_feature_values(f)
56
+ next if fv.size == get_sample_size # no missing values
57
+
58
+ seen_count, seen_value = 0, nil
59
+ fv.uniq.each do |v|
60
+ count = fv.count(v)
61
+ if count > seen_count
62
+ seen_count = count
63
+ seen_value = v
64
+ end
65
+ end
66
+
67
+ if not s.has_key? f
68
+ s[f] = seen_value
69
+ end
70
+ end
71
+ end
72
+
73
+ # clear variables
74
+ clear_vars
75
+ end # replace_with_mean_value!
76
+
77
+
78
+ end # ReplaceMissingValues
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fselector
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.3.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,17 +9,19 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-04-02 00:00:00.000000000 Z
12
+ date: 2012-04-03 00:00:00.000000000 Z
13
13
  dependencies: []
14
14
  description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
15
- algorithms into one single package. Welcome to contact me (need47@gmail.com) if
16
- you want to contribute your own algorithms or report a bug. FSelector enables the
17
- user to perform feature selection by using either a single algorithm or an ensemble
18
- of algorithms. FSelector acts on a full-feature data set with CSV, LibSVM or WEKA
19
- file format and outputs a reduced data set with only selected subset of features,
20
- which can later be used as the input for various machine learning softwares including
21
- LibSVM and WEKA. FSelector, itself, does not implement any of the machine learning
22
- algorithms such as support vector machines and random forest.
15
+ algorithms and related functions into one single package. Welcome to contact me
16
+ (need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
17
+ FSelector allows user to perform feature selection by using either a single algorithm
18
+ or an ensemble of multiple algorithms, and other common tasks including normalization
19
+ and discretization on continuous data, as well as replace missing feature values
20
+ with certain criterion. FSelector acts on a full-feature data set in either CSV,
21
+ LibSVM or WEKA file format and outputs a reduced data set with only selected subset
22
+ of features, which can later be used as the input for various machine learning softwares
23
+ including LibSVM and WEKA. FSelector, itself, does not implement any of the machine
24
+ learning algorithms such as support vector machines and random forest.
23
25
  email: need47@gmail.com
24
26
  executables: []
25
27
  extensions: []
@@ -73,6 +75,7 @@ files:
73
75
  - lib/fselector/ensemble.rb
74
76
  - lib/fselector/entropy.rb
75
77
  - lib/fselector/fileio.rb
78
+ - lib/fselector/replace_missing_values.rb
76
79
  - lib/fselector/util.rb
77
80
  - lib/fselector.rb
78
81
  homepage: http://github.com/need47/fselector