fselector 0.9.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. data/ChangeLog +7 -0
  2. data/README.md +51 -47
  3. data/lib/fselector.rb +4 -1
  4. data/lib/fselector/algo_base/base.rb +56 -22
  5. data/lib/fselector/algo_base/base_CFS.rb +3 -3
  6. data/lib/fselector/algo_base/base_Relief.rb +5 -3
  7. data/lib/fselector/algo_base/base_ReliefF.rb +9 -10
  8. data/lib/fselector/algo_base/base_continuous.rb +1 -1
  9. data/lib/fselector/algo_base/base_discrete.rb +2 -2
  10. data/lib/fselector/algo_continuous/BSS_WSS.rb +4 -4
  11. data/lib/fselector/algo_continuous/FTest.rb +7 -7
  12. data/lib/fselector/algo_continuous/PMetric.rb +5 -5
  13. data/lib/fselector/algo_continuous/TScore.rb +8 -6
  14. data/lib/fselector/algo_continuous/WilcoxonRankSum.rb +4 -4
  15. data/lib/fselector/algo_discrete/AccuracyBalanced.rb +5 -3
  16. data/lib/fselector/algo_discrete/BiNormalSeparation.rb +5 -3
  17. data/lib/fselector/algo_discrete/ChiSquaredTest.rb +10 -11
  18. data/lib/fselector/algo_discrete/CorrelationCoefficient.rb +7 -6
  19. data/lib/fselector/algo_discrete/F1Measure.rb +3 -3
  20. data/lib/fselector/algo_discrete/FishersExactTest.rb +3 -3
  21. data/lib/fselector/algo_discrete/GMean.rb +4 -4
  22. data/lib/fselector/algo_discrete/GiniIndex.rb +3 -1
  23. data/lib/fselector/algo_discrete/INTERACT.rb +112 -0
  24. data/lib/fselector/algo_discrete/InformationGain.rb +5 -5
  25. data/lib/fselector/algo_discrete/LasVegasFilter.rb +17 -54
  26. data/lib/fselector/algo_discrete/LasVegasIncremental.rb +70 -78
  27. data/lib/fselector/algo_discrete/MatthewsCorrelationCoefficient.rb +5 -5
  28. data/lib/fselector/algo_discrete/McNemarsTest.rb +13 -10
  29. data/lib/fselector/algo_discrete/MutualInformation.rb +4 -4
  30. data/lib/fselector/algo_discrete/OddsRatio.rb +3 -3
  31. data/lib/fselector/algo_discrete/OddsRatioNumerator.rb +4 -4
  32. data/lib/fselector/algo_discrete/Power.rb +8 -9
  33. data/lib/fselector/algo_discrete/Precision.rb +3 -3
  34. data/lib/fselector/algo_discrete/ProbabilityRatio.rb +3 -3
  35. data/lib/fselector/algo_discrete/Sensitivity.rb +3 -3
  36. data/lib/fselector/algo_discrete/Specificity.rb +3 -3
  37. data/lib/fselector/algo_discrete/SymmetricalUncertainty.rb +7 -7
  38. data/lib/fselector/consistency.rb +118 -0
  39. data/lib/fselector/discretizer.rb +79 -114
  40. data/lib/fselector/ensemble.rb +4 -2
  41. data/lib/fselector/entropy.rb +62 -92
  42. data/lib/fselector/fileio.rb +2 -2
  43. data/lib/fselector/normalizer.rb +68 -59
  44. data/lib/fselector/replace_missing_values.rb +1 -1
  45. data/lib/fselector/util.rb +3 -3
  46. metadata +6 -4
data/ChangeLog CHANGED
@@ -1,3 +1,10 @@
1
+ 2012-05-04 version 1.0.0
2
+
3
+ * add new algorithm INTERACT for discrete feature
4
+ * add Consistency module to deal with data inconsistency calculation, which bases on a Hash table and is efficient in both storage and speed
5
+ * update the Chi2 algorithm to try to reproduce the results of the original Chi2 algorithm
6
+ * update documentation whenever necessary
7
+
1
8
  2012-04-25 version 0.9.0
2
9
 
3
10
  * add new discretization algorithm (Three-Interval Discretization, TID)
data/README.md CHANGED
@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
8
8
  **Email**: [need47@gmail.com](mailto:need47@gmail.com)
9
9
  **Copyright**: 2012
10
10
  **License**: MIT License
11
- **Latest Version**: 0.9.0
12
- **Release Date**: April 25 2012
11
+ **Latest Version**: 1.0.0
12
+ **Release Date**: 2012-05-04
13
13
 
14
14
  Synopsis
15
15
  --------
@@ -38,55 +38,59 @@ Feature List
38
38
  - csv
39
39
  - libsvm
40
40
  - weka ARFF
41
- - random data (for test purpose)
41
+ - random data (read only, for test purpose)
42
42
 
43
43
  **2. available feature selection/ranking algorithms**
44
44
 
45
- algorithm alias feature_type applicability
46
- --------------------------------------------------------------------------------------
47
- Accuracy Acc discrete
48
- AccuracyBalanced Acc2 discrete
49
- BiNormalSeparation BNS discrete
50
- CFS_d CFS_d discrete
51
- ChiSquaredTest CHI discrete
52
- CorrelationCoefficient CC discrete
53
- DocumentFrequency DF discrete
54
- F1Measure F1 discrete
55
- FishersExactTest FET discrete
56
- FastCorrelationBasedFilter FCBF discrete
57
- GiniIndex GI discrete
58
- GMean GM discrete
59
- GSSCoefficient GSS discrete
60
- InformationGain IG discrete
61
- LasVegasFilter LVF discrete
62
- LasVegasIncremental LVI discrete
63
- MatthewsCorrelationCoefficient MCC, PHI discrete
64
- McNemarsTest MNT discrete
65
- OddsRatio OR discrete
66
- OddsRatioNumerator ORN discrete
67
- PhiCoefficient Phi discrete
68
- Power Power discrete
69
- Precision Precision discrete
70
- ProbabilityRatio PR discrete
71
- Random Random discrete
72
- Recall Recall discrete
73
- Relief_d Relief_d discrete two-class, no missing data
74
- ReliefF_d ReliefF_d discrete
75
- Sensitivity SN, Recall discrete
76
- Specificity SP discrete
77
- SymmetricalUncertainty SU discrete
78
- BetweenWithinClassesSumOfSquare BSS_WSS continuous
79
- CFS_c CFS_c continuous
80
- FTest FT continuous
81
- PMetric PM continuous two-class
82
- Relief_c Relief_c continuous two-class, no missing data
83
- ReliefF_c ReliefF_c continuous
84
- TScore TS continuous two-class
85
- WilcoxonRankSum WRS continuous two-class
45
+ algorithm alias algo_type feature_type applicability
46
+ --------------------------------------------------------------------------------------------------
47
+ Accuracy Acc weighting discrete
48
+ AccuracyBalanced Acc2 weighting discrete
49
+ BiNormalSeparation BNS weighting discrete
50
+ CFS_d CFS_d subset discrete
51
+ ChiSquaredTest CHI weighting discrete
52
+ CorrelationCoefficient CC weighting discrete
53
+ DocumentFrequency DF weighting discrete
54
+ F1Measure F1 weighting discrete
55
+ FishersExactTest FET weighting discrete
56
+ FastCorrelationBasedFilter FCBF subset discrete
57
+ GiniIndex GI weighting discrete
58
+ GMean GM weighting discrete
59
+ GSSCoefficient GSS weighting discrete
60
+ InformationGain IG weighting discrete
61
+ INTERACT INTERACT subset discrete
62
+ LasVegasFilter LVF subset discrete
63
+ LasVegasIncremental LVI subset discrete
64
+ MatthewsCorrelationCoefficient MCC, PHI weighting discrete
65
+ McNemarsTest MNT weighting discrete
66
+ OddsRatio OR weighting discrete
67
+ OddsRatioNumerator ORN weighting discrete
68
+ PhiCoefficient Phi weighting discrete
69
+ Power Power weighting discrete
70
+ Precision Precision weighting discrete
71
+ ProbabilityRatio PR weighting discrete
72
+ Random Random weighting discrete
73
+ Recall Recall weighting discrete
74
+ Relief_d Relief_d weighting discrete two-class, no missing data
75
+ ReliefF_d ReliefF_d weighting discrete
76
+ Sensitivity SN, Recall weighting discrete
77
+ Specificity SP weighting discrete
78
+ SymmetricalUncertainty SU weighting discrete
79
+ BetweenWithinClassesSumOfSquare BSS_WSS weighting continuous
80
+ CFS_c CFS_c subset continuous
81
+ FTest FT weighting continuous
82
+ PMetric PM weighting continuous two-class
83
+ Relief_c Relief_c weighting continuous two-class, no missing data
84
+ ReliefF_c ReliefF_c weighting continuous
85
+ TScore TS weighting continuous two-class
86
+ WilcoxonRankSum WRS weighting continuous two-class
86
87
 
87
88
  **note for feature selection interace:**
88
- - for the algorithms of CFS\_d, FCBF and CFS\_c, use select\_feature!
89
- - for other algorithms, use either select\_feature\_by\_rank! or select\_feature\_by\_score!
89
+ there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
90
+
91
+ - for weighting type: use either **select\_feature\_by\_rank!** or **select\_feature\_by\_score!**
92
+ - for subset type: use **select\_feature!**
93
+
90
94
 
91
95
  **3. feature selection approaches**
92
96
 
@@ -159,7 +163,7 @@ Usage
159
163
  # you can also use multiple alogirithms in a tandem manner
160
164
  # e.g. use the ChiSquaredTest with Yates' continuity correction
161
165
  # initialize from r1's data
162
- r2 = FSelector::ChiSquaredTest.new(:yates_continuity_correction, r1.get_data)
166
+ r2 = FSelector::ChiSquaredTest.new(:yates, r1.get_data)
163
167
 
164
168
  # number of features before feature selection
165
169
  puts "# features (before): "+ r2.get_features.size.to_s
data/lib/fselector.rb CHANGED
@@ -6,7 +6,7 @@ require 'rinruby'
6
6
  #
7
7
  module FSelector
8
8
  # module version
9
- VERSION = '0.9.0'
9
+ VERSION = '1.0.0'
10
10
  end
11
11
 
12
12
  # the root dir of FSelector
@@ -19,6 +19,8 @@ ROOT = File.expand_path(File.dirname(__FILE__))
19
19
  require "#{ROOT}/fselector/fileio.rb"
20
20
  # extend Array and String class
21
21
  require "#{ROOT}/fselector/util.rb"
22
+ # check data consistency
23
+ require "#{ROOT}/fselector/consistency.rb"
22
24
  # entropy-related functions
23
25
  require "#{ROOT}/fselector/entropy.rb"
24
26
  # normalization for continuous data
@@ -30,6 +32,7 @@ require "#{ROOT}/fselector/replace_missing_values.rb"
30
32
 
31
33
  #
32
34
  # base class
35
+ #
33
36
  Dir.glob("#{ROOT}/fselector/algo_base/*").each do |f|
34
37
  require f
35
38
  end
@@ -76,13 +76,22 @@ module FSelector
76
76
  end
77
77
 
78
78
 
79
- # get (uniq) classes labels as an array
79
+ #
80
+ # get (unique) classes labels
81
+ #
82
+ # @return [Array<Symbol>] unique class labels
83
+ #
80
84
  def get_classes
81
85
  @classes ||= @data.keys
82
86
  end
83
87
 
84
88
 
85
- # get class labels for all samples as an array
89
+ #
90
+ # get class labels for all samples
91
+ #
92
+ # @return [Array<Symbol>] class labels for all classes,
93
+ # same size as the number of samples
94
+ #
86
95
  def get_class_labels
87
96
  if not @cv
88
97
  @cv = []
@@ -96,7 +105,11 @@ module FSelector
96
105
  end
97
106
 
98
107
 
108
+ #
99
109
  # set classes
110
+ #
111
+ # @param [Array<Symbol>] classes source unique class labels
112
+ #
100
113
  def set_classes(classes)
101
114
  if classes and classes.class == Array
102
115
  @classes = classes
@@ -106,8 +119,11 @@ module FSelector
106
119
  end
107
120
  end
108
121
 
109
-
110
- # get (unique) features as an array
122
+ #
123
+ # get (unique) features
124
+ #
125
+ # @return [Array<Symbol>] unique features
126
+ #
111
127
  def get_features
112
128
  @features ||= @data.map { |x| x[1].map { |y| y.keys } }.flatten.uniq
113
129
  end
@@ -123,6 +139,7 @@ module FSelector
123
139
  # @param [Symbol] ck class of interest.
124
140
  # return feature values for all classes, otherwise return feature
125
141
  # values for the specific class (ck)
142
+ # @return [Hash] feature values
126
143
  #
127
144
  def get_feature_values(f, mv=nil, ck=nil)
128
145
  @fvs ||= {}
@@ -148,7 +165,11 @@ module FSelector
148
165
  end
149
166
 
150
167
 
168
+ #
151
169
  # set features
170
+ #
171
+ # @param [Array<Symbol>] features source unique features
172
+ #
152
173
  def set_features(features)
153
174
  if features and features.class == Array
154
175
  @features = features
@@ -159,20 +180,31 @@ module FSelector
159
180
  end
160
181
 
161
182
 
162
- # get data
183
+ #
184
+ # get internal data
185
+ #
186
+ # @return [Hash] internal data
187
+ #
163
188
  def get_data
164
189
  @data
165
190
  end
166
191
 
167
192
 
168
- # get a copy of data,
169
- # by means of the standard Marshal library
193
+ #
194
+ # get a copy of internal data, by means of the standard Marshal library
195
+ #
196
+ # @return [Hash] a copy of internal data
197
+ #
170
198
  def get_data_copy
171
199
  Marshal.load(Marshal.dump(@data)) if @data
172
200
  end
173
201
 
174
202
 
175
- # set data
203
+ #
204
+ # set data and clean relevant variables in case of data change
205
+ #
206
+ # @param [Hash] data source data structure
207
+ #
176
208
  def set_data(data)
177
209
  if data and data.class == Hash
178
210
  @data = data
@@ -182,8 +214,6 @@ module FSelector
182
214
  abort "[#{__FILE__}@#{__LINE__}]: "+
183
215
  "data must be a Hash object!"
184
216
  end
185
-
186
- data
187
217
  end
188
218
 
189
219
 
@@ -199,11 +229,16 @@ module FSelector
199
229
  end
200
230
 
201
231
 
232
+ #
202
233
  # number of samples
234
+ #
235
+ # @return [Integer] sample size
236
+ #
203
237
  def get_sample_size
204
238
  @sz ||= get_data.values.flatten.size
205
239
  end
206
-
240
+
241
+
207
242
  #
208
243
  # get scores of all features for all classes
209
244
  #
@@ -257,10 +292,9 @@ module FSelector
257
292
  #
258
293
  # reconstruct data with selected features
259
294
  #
260
- # @return [Hash] data after feature selection
261
- # @note derived class must implement its own get\_subset(),
262
- # and data structure will be altered. For now, only the algorithms of
263
- # CFS\_c, CFS\_d and FCBF implemented such functions
295
+ # @note data structure will be altered. Dderived class must
296
+ # implement its own get\_subset(). This is only available for
297
+ # the feature subset selection type of algorithms
264
298
  #
265
299
  def select_feature!
266
300
  subset = get_feature_subset
@@ -279,14 +313,14 @@ module FSelector
279
313
 
280
314
 
281
315
  #
282
- # reconstruct data with feature scores satisfying cutoff
316
+ # reconstruct data by feature score satisfying criterion
283
317
  #
284
318
  # @param [String] criterion
285
- # valid criterion can be '>0.5', '>=0.4', '==2', '<=1' or '<0.2'
319
+ # valid criterion can be '>0.5', '>=0.4', '==2.0', '<=1.0' or '<0.2'
286
320
  # @param [Hash] my_scores
287
321
  # user customized feature scores
288
- # @return [Hash] data after feature selection
289
- # @note data structure will be altered
322
+ # @note data structure will be altered. This is only available for
323
+ # the feature weighting type of algorithms
290
324
  #
291
325
  def select_feature_by_score!(criterion, my_scores=nil)
292
326
  # user scores or internal scores
@@ -305,14 +339,14 @@ module FSelector
305
339
 
306
340
 
307
341
  #
308
- # reconstruct data by rank
342
+ # reconstruct data by feature rank satisfying criterion
309
343
  #
310
344
  # @param [String] criterion
311
345
  # valid criterion can be '>11', '>=10', '==1', '<=10' or '<20'
312
346
  # @param [Hash] my_ranks
313
347
  # user customized feature ranks
314
- # @return [Hash] data after feature selection
315
- # @note data structure will be altered
348
+ # @note data structure will be altered. This is only available for
349
+ # the feature weighting type of algorithms
316
350
  #
317
351
  def select_feature_by_rank!(criterion, my_ranks=nil)
318
352
  # user ranks or internal ranks
@@ -59,7 +59,7 @@ module FSelector
59
59
 
60
60
 
61
61
  # handle missing values
62
- # CFS replaces missing values with the mean for continous features and
62
+ # CFS replaces missing values with the mean for continuous features and
63
63
  # the most seen value for discrete features
64
64
  def handle_missing_values
65
65
  abort "[#{__FILE__}@#{__LINE__}]: "+
@@ -104,8 +104,8 @@ module FSelector
104
104
 
105
105
  if not @f2idx
106
106
  @f2idx = {}
107
- fvs = get_features
108
- fvs.each_with_index { |f, idx| @f2idx[f] = idx }
107
+ fs = get_features
108
+ fs.each_with_index { |_f, idx| @f2idx[_f] = idx }
109
109
  end
110
110
 
111
111
  if @f2idx[f] > @f2idx[s]
@@ -10,14 +10,16 @@ module FSelector
10
10
  #
11
11
  # ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
12
12
  #
13
- class BaseRelief < Base
13
+ class BaseRelief < Base
14
+ # include ReplaceMissingValue module
15
+ include ReplaceMissingValues
16
+
14
17
  #
15
- # new()
18
+ # intialize from an existing data structure
16
19
  #
17
20
  # @param [Integer] m number of samples to be used
18
21
  # for estimating feature contribution. max can be
19
22
  # the number of training samples
20
- # @param [Hash] data existing data structure
21
23
  #
22
24
  def initialize(m=30, data=nil)
23
25
  super(data)
@@ -12,13 +12,12 @@ module FSelector
12
12
  #
13
13
  class BaseReliefF < Base
14
14
  #
15
- # new()
15
+ # intialize from an existing data structure
16
16
  #
17
17
  # @param [Integer] m number of samples to be used
18
18
  # for estimating feature contribution. max can be
19
19
  # the number of training samples
20
20
  # @param [Integer] k number of k-nearest neighbors
21
- # @param [Hash] data existing data structure
22
21
  #
23
22
  def initialize(m=30, k=10, data=nil)
24
23
  super(data)
@@ -106,21 +105,21 @@ module FSelector
106
105
  if not @f2mvp
107
106
  @f2mvp = {}
108
107
 
109
- each_feature do |f|
110
- @f2mvp[f] = {}
108
+ each_feature do |_f|
109
+ @f2mvp[_f] = {}
111
110
 
112
- each_class do |k|
113
- @f2mvp[f][k] = {}
111
+ each_class do |_k|
112
+ @f2mvp[_f][_k] = {}
114
113
 
115
- fvs = get_feature_values(f).uniq
114
+ fvs = get_feature_values(_f).uniq
116
115
  fvs.each do |v|
117
116
  n = 0.0
118
117
 
119
- get_data[k].each do |s|
120
- n += 1 if s.has_key?(f) and s[f] == v
118
+ get_data[_k].each do |s|
119
+ n += 1 if s.has_key?(_f) and s[_f] == v
121
120
  end
122
121
 
123
- @f2mvp[f][k][v] = n/get_data[k].size
122
+ @f2mvp[_f][_k][v] = n/get_data[_k].size
124
123
  end
125
124
  end
126
125
  end
@@ -3,7 +3,7 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # base algorithm for handling continous feature
6
+ # base algorithm for continuous feature
7
7
  #
8
8
  class BaseContinuous < Base
9
9
  # include normalizer
@@ -3,9 +3,9 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # base alogrithm for handling discrete feature
6
+ # base alogrithm for discrete feature
7
7
  #
8
- # 2 x 2 contingency table
8
+ # many algos are based on the following 2 x 2 contingency table
9
9
  #
10
10
  # c c'
11
11
  # ---------