fselector 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (55) hide show
  1. data/ChangeLog +9 -0
  2. data/README.md +62 -26
  3. data/lib/fselector.rb +1 -1
  4. data/lib/fselector/algo_base/base.rb +89 -34
  5. data/lib/fselector/algo_base/base_CFS.rb +20 -7
  6. data/lib/fselector/algo_base/base_Relief.rb +5 -5
  7. data/lib/fselector/algo_base/base_ReliefF.rb +11 -3
  8. data/lib/fselector/algo_base/base_discrete.rb +8 -0
  9. data/lib/fselector/algo_continuous/BSS_WSS.rb +3 -1
  10. data/lib/fselector/algo_continuous/CFS_c.rb +3 -1
  11. data/lib/fselector/algo_continuous/FTest.rb +2 -0
  12. data/lib/fselector/algo_continuous/PMetric.rb +4 -2
  13. data/lib/fselector/algo_continuous/ReliefF_c.rb +11 -0
  14. data/lib/fselector/algo_continuous/Relief_c.rb +14 -3
  15. data/lib/fselector/algo_continuous/TScore.rb +5 -3
  16. data/lib/fselector/algo_continuous/WilcoxonRankSum.rb +5 -3
  17. data/lib/fselector/algo_discrete/Accuracy.rb +2 -0
  18. data/lib/fselector/algo_discrete/AccuracyBalanced.rb +2 -0
  19. data/lib/fselector/algo_discrete/BiNormalSeparation.rb +3 -1
  20. data/lib/fselector/algo_discrete/CFS_d.rb +3 -0
  21. data/lib/fselector/algo_discrete/ChiSquaredTest.rb +3 -0
  22. data/lib/fselector/algo_discrete/CorrelationCoefficient.rb +2 -0
  23. data/lib/fselector/algo_discrete/DocumentFrequency.rb +2 -0
  24. data/lib/fselector/algo_discrete/F1Measure.rb +2 -0
  25. data/lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb +12 -1
  26. data/lib/fselector/algo_discrete/FishersExactTest.rb +3 -1
  27. data/lib/fselector/algo_discrete/GMean.rb +2 -0
  28. data/lib/fselector/algo_discrete/GSSCoefficient.rb +2 -0
  29. data/lib/fselector/algo_discrete/GiniIndex.rb +3 -1
  30. data/lib/fselector/algo_discrete/INTERACT.rb +3 -0
  31. data/lib/fselector/algo_discrete/InformationGain.rb +12 -1
  32. data/lib/fselector/algo_discrete/LasVegasFilter.rb +3 -0
  33. data/lib/fselector/algo_discrete/LasVegasIncremental.rb +3 -0
  34. data/lib/fselector/algo_discrete/MatthewsCorrelationCoefficient.rb +2 -0
  35. data/lib/fselector/algo_discrete/McNemarsTest.rb +3 -0
  36. data/lib/fselector/algo_discrete/MutualInformation.rb +3 -1
  37. data/lib/fselector/algo_discrete/OddsRatio.rb +2 -0
  38. data/lib/fselector/algo_discrete/OddsRatioNumerator.rb +2 -0
  39. data/lib/fselector/algo_discrete/Power.rb +4 -1
  40. data/lib/fselector/algo_discrete/Precision.rb +2 -0
  41. data/lib/fselector/algo_discrete/ProbabilityRatio.rb +2 -0
  42. data/lib/fselector/algo_discrete/Random.rb +3 -0
  43. data/lib/fselector/algo_discrete/ReliefF_d.rb +3 -1
  44. data/lib/fselector/algo_discrete/Relief_d.rb +4 -2
  45. data/lib/fselector/algo_discrete/Sensitivity.rb +2 -0
  46. data/lib/fselector/algo_discrete/Specificity.rb +2 -0
  47. data/lib/fselector/algo_discrete/SymmetricalUncertainty.rb +4 -1
  48. data/lib/fselector/discretizer.rb +7 -7
  49. data/lib/fselector/ensemble.rb +375 -115
  50. data/lib/fselector/entropy.rb +2 -2
  51. data/lib/fselector/fileio.rb +83 -70
  52. data/lib/fselector/normalizer.rb +2 -2
  53. data/lib/fselector/replace_missing_values.rb +137 -3
  54. data/lib/fselector/util.rb +17 -5
  55. metadata +4 -4
data/ChangeLog CHANGED
@@ -1,3 +1,12 @@
1
+ 2012-05-15 version 1.1.0
2
+
3
+ * add replace\_by\_median\_value! for replacing missing value with feature median value
4
+ * add replace\_by\_knn\_value! for replacing missing value with weighted feature value from k-nearest neighbors
5
+ * replace\_by\_mean\_value! and replace\_by\_median\_value! now support both column and row mode
6
+ * add EnsembleSingle class for ensemble feature selection by creating an ensemble of feature selectors using a single feature selection algorithm
7
+ * rename Ensemble to EnsembleMultiple for ensemble feature selection by creating an ensemble of feature selectors using multiple feature selection algorithms of the same type
8
+ * bug fix in FileIO module
9
+
1
10
  2012-05-08 version 1.0.1
2
11
 
3
12
  * modify Ensemble module so that ensemble\_by\_score() and ensemble\_by\_rank() now take Symbol, instead of Method, as argument. This allows easier and clearer function call
data/README.md CHANGED
@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
8
8
  **Email**: [need47@gmail.com](mailto:need47@gmail.com)
9
9
  **Copyright**: 2012
10
10
  **License**: MIT License
11
- **Latest Version**: 1.0.1
12
- **Release Date**: 2012-05-08
11
+ **Latest Version**: 1.1.0
12
+ **Release Date**: 2012-05-15
13
13
 
14
14
  Synopsis
15
15
  --------
@@ -86,17 +86,16 @@ Feature List
86
86
  WilcoxonRankSum WRS weighting continuous two-class
87
87
 
88
88
  **note for feature selection interface:**
89
- there are two types of filter methods, i.e., weighting algorithms and subset selection algorithms
89
+ there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
90
90
 
91
91
  - for weighting type: use either **select\_feature\_by\_rank!** or **select\_feature\_by\_score!**
92
92
  - for subset type: use **select\_feature!**
93
-
94
93
 
95
94
  **3. feature selection approaches**
96
95
 
97
96
  - by a single algorithm
98
97
  - by multiple algorithms in a tandem manner
99
- - by multiple algorithms in an ensemble manner
98
+ - by multiple algorithms in an ensemble manner (share same feature selection interface as single algorithm)
100
99
 
101
100
  **4. availabe normalization and discretization algorithms for continuous feature**
102
101
 
@@ -114,11 +113,13 @@ Feature List
114
113
 
115
114
  **5. availabe algorithms for replacing missing feature values**
116
115
 
117
- algorithm note feature_type
118
- -------------------------------------------------------------------------------------------
119
- replace_by_fixed_value! replace by a fixed value discrete, continuous
120
- replace_by_mean_value! replace by mean feature value continuous
121
- replace_by_most_seen_value! replace by most seen feature value discrete
116
+ algorithm note feature_type
117
+ ---------------------------------------------------------------------------------------------------------
118
+ replace_by_fixed_value! replace by a fixed value discrete, continuous
119
+ replace_by_mean_value! replace by mean feature value continuous
120
+ replace_by_median_value! replace by median feature value continuous
121
+ replace_by_knn_value! replace by weighted knn feature value continuous
122
+ replace_by_most_seen_value! replace by most seen feature value discrete
122
123
 
123
124
  Installing
124
125
  ----------
@@ -140,7 +141,7 @@ Usage
140
141
 
141
142
  require 'fselector'
142
143
 
143
- # use InformationGain as a feature ranking algorithm
144
+ # use InformationGain as a feature selection algorithm
144
145
  r1 = FSelector::InformationGain.new
145
146
 
146
147
  # read from random data (or csv, libsvm, weka ARFF file)
@@ -152,13 +153,13 @@ Usage
152
153
  r1.data_from_random(100, 2, 15, 3, true)
153
154
 
154
155
  # number of features before feature selection
155
- puts "# features (before): "+ r1.get_features.size.to_s
156
+ puts " # features (before): "+ r1.get_features.size.to_s
156
157
 
157
158
  # select the top-ranked features with scores >0.01
158
159
  r1.select_feature_by_score!('>0.01')
159
160
 
160
161
  # number of features after feature selection
161
- puts "# features (after): "+ r1.get_features.size.to_s
162
+ puts " # features (after): "+ r1.get_features.size.to_s
162
163
 
163
164
  # you can also use multiple alogirithms in a tandem manner
164
165
  # e.g. use the ChiSquaredTest with Yates' continuity correction
@@ -166,29 +167,65 @@ Usage
166
167
  r2 = FSelector::ChiSquaredTest.new(:yates, r1.get_data)
167
168
 
168
169
  # number of features before feature selection
169
- puts "# features (before): "+ r2.get_features.size.to_s
170
+ puts " # features (before): "+ r2.get_features.size.to_s
170
171
 
171
172
  # select the top-ranked 3 features
172
173
  r2.select_feature_by_rank!('<=3')
173
174
 
174
175
  # number of features after feature selection
175
- puts "# features (after): "+ r2.get_features.size.to_s
176
+ puts " # features (after): "+ r2.get_features.size.to_s
176
177
 
177
178
  # save data to standard ouput as a weka ARFF file (sparse format)
178
179
  # with selected features only
179
180
  r2.data_to_weka(:stdout, :sparse)
180
181
 
181
182
 
182
- **2. feature selection by an ensemble of multiple algorithms**
183
+ **2. feature selection by an ensemble of multiple feature selectors**
183
184
 
184
185
  require 'fselector'
185
186
 
186
- # use both InformationGain and Relief_d
187
+ # example 1
188
+ #
189
+
190
+
191
+ # creating an ensemble of feature selectors by using
192
+ # a single feature selection algorithm (INTERACT)
193
+ # by instance perturbation (e.g. bootstrap sampling)
194
+
195
+ # test for the type of feature subset selection algorithms
196
+ r = FSelector::INTERACT.new(0.0001)
197
+
198
+ # an ensemble of 40 feature selectors with 90% data by random sampling
199
+ re = FSelector::EnsembleSingle.new(r, 40, 0.90, :random_sampling)
200
+
201
+ # read SPECT data set (under the test/ directory)
202
+ re.data_from_csv('test/SPECT_train.csv')
203
+
204
+ # number of features before feature selection
205
+ puts ' # features (before): ' + re.get_features.size.to_s
206
+
207
+ # only features with above average count among ensemble are selected
208
+ re.select_feature!
209
+
210
+ # number of features after feature selection
211
+ puts ' # features before (after): ' + re.get_features.size.to_s
212
+
213
+
214
+ # example 2
215
+ #
216
+
217
+
218
+ # creating an ensemble of feature selectors by using
219
+ # two feature selection algorithms (InformationGain and Relief_d).
220
+ # note: can be 2+ algorithms, as long as they are of the same type,
221
+ # either feature weighting or feature subset selection algorithms
222
+
223
+ # test for the type of feature weighting algorithms
187
224
  r1 = FSelector::InformationGain.new
188
- r2 = FSelector::Relief_d.new
225
+ r2 = FSelector::Relief_d.new(10)
189
226
 
190
- # ensemble ranker
191
- re = FSelector::Ensemble.new(r1, r2)
227
+ # an ensemble of two feature selectors
228
+ re = FSelector::EnsembleMultiple.new(r1, r2)
192
229
 
193
230
  # read random data
194
231
  re.data_from_random(100, 2, 15, 3, true)
@@ -198,18 +235,17 @@ Usage
198
235
  re.replace_by_most_seen_value!
199
236
 
200
237
  # number of features before feature selection
201
- puts '# features (before): ' + re.get_features.size.to_s
238
+ puts ' # features (before): ' + re.get_features.size.to_s
202
239
 
203
240
  # based on the max feature score (z-score standardized) among
204
- # an ensemble of feature selection algorithms
241
+ # an ensemble of feature selectors
205
242
  re.ensemble_by_score(:by_max, :by_zscore)
206
243
 
207
244
  # select the top-ranked 3 features
208
245
  re.select_feature_by_rank!('<=3')
209
246
 
210
247
  # number of features after feature selection
211
- puts '# features (after): ' + re.get_features.size.to_s
212
-
248
+ puts ' # features (after): ' + re.get_features.size.to_s
213
249
 
214
250
  **3. normalization and discretization before feature selection**
215
251
 
@@ -233,13 +269,13 @@ Usage
233
269
  r2 = FSelector::FCBF.new(0.0, r1.get_data)
234
270
 
235
271
  # number of features before feature selection
236
- puts '# features (before): ' + r2.get_features.size.to_s
272
+ puts ' # features (before): ' + r2.get_features.size.to_s
237
273
 
238
274
  # feature selection
239
275
  r2.select_feature!
240
276
 
241
277
  # number of features after feature selection
242
- puts '# features (after): ' + r2.get_features.size.to_s
278
+ puts ' # features (after): ' + r2.get_features.size.to_s
243
279
 
244
280
  **4. see more examples test_*.rb under the test/ directory**
245
281
 
@@ -6,7 +6,7 @@ require 'rinruby'
6
6
  #
7
7
  module FSelector
8
8
  # module version
9
- VERSION = '1.0.1'
9
+ VERSION = '1.1.0'
10
10
  end
11
11
 
12
12
  # the root dir of FSelector
@@ -11,25 +11,39 @@ module FSelector
11
11
  # include ReplaceMissingValues
12
12
  include ReplaceMissingValues
13
13
 
14
+ class << self
15
+ # class-level instance variable, type of feature selection algorithm.
16
+ #
17
+ # @note derived class (except for Base*** class) must set its own type with
18
+ # one of the following two:
19
+ # - :feature\_weighting # when algo outputs weight for each feature
20
+ # - :feature\_subset_selection # when algo outputs a subset of features
21
+ attr_accessor :algo_type
22
+ end
23
+
24
+ # get the type of feature selection algorithm at class-level
25
+ def algo_type
26
+ self.class.algo_type
27
+ end
28
+
29
+
14
30
  # initialize from an existing data structure
15
31
  def initialize(data=nil)
16
- @data = data
17
- @opts = {} # store non-data information
32
+ @data = data # store data
18
33
  end
19
34
 
20
35
 
21
36
  #
22
- # iterator for each class, a block must be given
37
+ # iterator for each class, a block must be given. e.g.
23
38
  #
24
- # e.g.
25
- # self.each_class do |k|
39
+ # each_class do |k|
26
40
  # puts k
27
41
  # end
28
42
  #
29
43
  def each_class
30
44
  if not block_given?
31
- abort "[#{__FILE__}@#{__LINE__}]: "+
32
- "block must be given!"
45
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
46
+ " block must be given!"
33
47
  else
34
48
  get_classes.each { |k| yield k }
35
49
  end
@@ -37,17 +51,16 @@ module FSelector
37
51
 
38
52
 
39
53
  #
40
- # iterator for each feature, a block must be given
54
+ # iterator for each feature, a block must be given. e.g.
41
55
  #
42
- # e.g.
43
- # self.each_feature do |f|
56
+ # each_feature do |f|
44
57
  # puts f
45
58
  # end
46
59
  #
47
60
  def each_feature
48
61
  if not block_given?
49
- abort "[#{__FILE__}@#{__LINE__}]: "+
50
- "block must be given!"
62
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
63
+ " block must be given!"
51
64
  else
52
65
  get_features.each { |f| yield f }
53
66
  end
@@ -55,10 +68,10 @@ module FSelector
55
68
 
56
69
 
57
70
  #
58
- # iterator for each sample with class label, a block must be given
71
+ # iterator for each sample with class label,
72
+ # a block must be given. e.g.
59
73
  #
60
- # e.g.
61
- # self.each_sample do |k, s|
74
+ # each_sample do |k, s|
62
75
  # print k
63
76
  # s.each { |f, v| print " #{v}" }
64
77
  # puts
@@ -66,7 +79,7 @@ module FSelector
66
79
  #
67
80
  def each_sample
68
81
  if not block_given?
69
- abort "[#{__FILE__}@#{__LINE__}]: "+
82
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
70
83
  " block must be given!"
71
84
  else
72
85
  get_data.each do |k, samples|
@@ -114,8 +127,8 @@ module FSelector
114
127
  if classes and classes.class == Array
115
128
  @classes = classes
116
129
  else
117
- abort "[#{__FILE__}@#{__LINE__}]: "+
118
- "classes must be a Array object!"
130
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
131
+ " classes must be a Array object!"
119
132
  end
120
133
  end
121
134
 
@@ -125,7 +138,7 @@ module FSelector
125
138
  # @return [Array<Symbol>] unique features
126
139
  #
127
140
  def get_features
128
- @features ||= @data.map { |x| x[1].map { |y| y.keys } }.flatten.uniq
141
+ @features ||= @data.collect { |x| x[1].collect { |y| y.keys } }.flatten.uniq
129
142
  end
130
143
 
131
144
 
@@ -174,8 +187,8 @@ module FSelector
174
187
  if features and features.class == Array
175
188
  @features = features
176
189
  else
177
- abort "[#{__FILE__}@#{__LINE__}]: "+
178
- "features must be a Array object!"
190
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
191
+ " features must be a Array object!"
179
192
  end
180
193
  end
181
194
 
@@ -204,27 +217,40 @@ module FSelector
204
217
  # set data and clean relevant variables in case of data change
205
218
  #
206
219
  # @param [Hash] data source data structure
220
+ # @return [nil] to suppress console echo of data in irb
207
221
  #
208
222
  def set_data(data)
209
223
  if data and data.class == Hash
210
- @data = data
211
224
  # clear variables
212
- clear_vars
225
+ clear_vars if @data
226
+ @data = data # set new data structure
213
227
  else
214
- abort "[#{__FILE__}@#{__LINE__}]: "+
215
- "data must be a Hash object!"
228
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
229
+ " data must be a Hash object!"
216
230
  end
231
+
232
+ nil # suppress console echo of data in irb
217
233
  end
218
234
 
219
235
 
236
+ #
220
237
  # get non-data information for a given key
221
- def get_opt(key)
222
- @opts.has_key?(key) ? @opts[key] : nil
238
+ #
239
+ # @param [Symbol] key key of non-data
240
+ # @return [Any] value of non-data, can be any type
241
+ #
242
+ # @note return all non-data as a Hash if key == nil
243
+ #
244
+ def get_opt(key=nil)
245
+ key ? @opts[key] : @opts
223
246
  end
224
247
 
225
248
 
226
249
  # set non-data information as a key-value pair
250
+ # @param [Symbol] key key of non-data
251
+ # @param [Any] value value of non-data, can be any type
227
252
  def set_opt(key, value)
253
+ @opts ||= {} # store non-data information
228
254
  @opts[key] = value
229
255
  end
230
256
 
@@ -235,7 +261,7 @@ module FSelector
235
261
  # @return [Integer] sample size
236
262
  #
237
263
  def get_sample_size
238
- @sz ||= get_data.values.flatten.size
264
+ @sz ||= get_classes.inject(0) { |sz, k| sz+get_data[k].size }
239
265
  end
240
266
 
241
267
 
@@ -286,6 +312,13 @@ module FSelector
286
312
  # the subset selection type of algorithms, see {file:README.md}
287
313
  #
288
314
  def select_feature!
315
+ if not self.algo_type == :feature_subset_selection
316
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
317
+ " select_feature! is the interface for the type of feature subset selection algorithms only. \n" +
318
+ " please consider select_featue_by_score! or select_feature_by_rank!, \n" +
319
+ " which is the interface for the type of feature weighting algorithms"
320
+ end
321
+
289
322
  # derived class must implement its own one
290
323
  subset = get_feature_subset
291
324
  return if subset.empty?
@@ -313,8 +346,16 @@ module FSelector
313
346
  # the weighting type of algorithms, see {file:README.md}
314
347
  #
315
348
  def select_feature_by_score!(criterion, my_scores=nil)
349
+ if not self.algo_type == :feature_weighting
350
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
351
+ " select_feature_by_score! is the interface for the type of feature weighting algorithms only. \n" +
352
+ " please consider select_featue!, \n" +
353
+ " which is the interface for the type of feature subset selection algorithms"
354
+ end
355
+
316
356
  # user scores or internal scores
317
357
  scores = my_scores || get_feature_scores
358
+ return if scores.empty?
318
359
 
319
360
  my_data = {}
320
361
 
@@ -339,8 +380,16 @@ module FSelector
339
380
  # the weighting type of algorithms, see {file:README.md}
340
381
  #
341
382
  def select_feature_by_rank!(criterion, my_ranks=nil)
383
+ if not self.algo_type == :feature_weighting
384
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
385
+ " select_feature_by_rank! is the interface for the type of feature weighting algorithms only. \n" +
386
+ " please consider select_featue!, \n" +
387
+ " which is the interface for the type of feature subset selection algorithms"
388
+ end
389
+
342
390
  # user ranks or internal ranks
343
391
  ranks = my_ranks || get_feature_ranks
392
+ return if ranks.empty?
344
393
 
345
394
  my_data = {}
346
395
 
@@ -355,12 +404,18 @@ module FSelector
355
404
 
356
405
  private
357
406
 
358
- # clear variables when data structure is altered,
359
- # except @opts (non-data information)
407
+ #
408
+ # clear variables when data structure is altered, this is
409
+ # useful when data structure has changed while
410
+ # you still want to use the same instance
411
+ #
412
+ # @note the variables of original data structure (@data) and
413
+ # algorithm type (@algo_type) are retained
414
+ #
360
415
  def clear_vars
361
416
  @classes, @features, @fvs = nil, nil, nil
362
417
  @scores, @ranks, @sz = nil, nil, nil
363
- @cv, @fvs = nil, nil
418
+ @cv, @fvs, @opts = nil, nil, {}
364
419
  end
365
420
 
366
421
 
@@ -399,10 +454,10 @@ module FSelector
399
454
  end
400
455
 
401
456
 
402
- # get subset of feature, for the type of subset selection algorithms
457
+ # get feature subset, for the type of subset selection algorithms
403
458
  def get_feature_subset
404
- abort "[#{__FILE__}@#{__LINE__}]: "+
405
- "derived class must implement its own get_feature_subset()"
459
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
460
+ " derived subclass must implement its own get_feature_subset()"
406
461
  end
407
462
 
408
463