fselector 1.1.0 → 1.2.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (34) hide show
  1. data/ChangeLog +8 -0
  2. data/README.md +74 -77
  3. data/lib/fselector.rb +2 -1
  4. data/lib/fselector/algo_base/base.rb +1 -2
  5. data/lib/fselector/algo_base/base_Relief.rb +1 -3
  6. data/lib/fselector/algo_base/base_ReliefF.rb +1 -0
  7. data/lib/fselector/algo_base/base_continuous.rb +1 -3
  8. data/lib/fselector/algo_base/base_discrete.rb +3 -0
  9. data/lib/fselector/algo_continuous/CFS_c.rb +1 -2
  10. data/lib/fselector/algo_continuous/{FTest.rb → F-Test.rb} +1 -31
  11. data/lib/fselector/algo_continuous/KS-CCBF.rb +125 -0
  12. data/lib/fselector/algo_continuous/KS-Test.rb +51 -0
  13. data/lib/fselector/algo_continuous/{PMetric.rb → P-Metric.rb} +0 -0
  14. data/lib/fselector/algo_continuous/ReliefF_c.rb +1 -2
  15. data/lib/fselector/algo_continuous/Relief_c.rb +1 -2
  16. data/lib/fselector/algo_continuous/{TScore.rb → T-Score.rb} +1 -1
  17. data/lib/fselector/algo_discrete/CFS_d.rb +2 -1
  18. data/lib/fselector/algo_discrete/ChiSquaredTest.rb +1 -0
  19. data/lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb +6 -6
  20. data/lib/fselector/algo_discrete/{GMean.rb → G-Mean.rb} +1 -1
  21. data/lib/fselector/algo_discrete/INTERACT.rb +3 -3
  22. data/lib/fselector/algo_discrete/InformationGain.rb +1 -1
  23. data/lib/fselector/algo_discrete/J-Measure.rb +51 -0
  24. data/lib/fselector/algo_discrete/KL-Divergence.rb +65 -0
  25. data/lib/fselector/algo_discrete/LasVegasFilter.rb +3 -2
  26. data/lib/fselector/algo_discrete/LasVegasIncremental.rb +3 -2
  27. data/lib/fselector/algo_discrete/McNemarsTest.rb +1 -0
  28. data/lib/fselector/algo_discrete/Power.rb +1 -0
  29. data/lib/fselector/algo_discrete/Random.rb +1 -0
  30. data/lib/fselector/algo_discrete/ReliefF_d.rb +3 -0
  31. data/lib/fselector/algo_discrete/Relief_d.rb +3 -0
  32. data/lib/fselector/algo_discrete/SymmetricalUncertainty.rb +1 -1
  33. data/lib/fselector/discretizer.rb +5 -6
  34. metadata +12 -8
data/ChangeLog CHANGED
@@ -1,3 +1,11 @@
1
+ 2012-05-20 version 1.2.0
2
+
3
+ * add KS-Test algorithm for continuous feature
4
+ * add KS-CCBF algorithm for continuous feature
5
+ * add J-Measure algorithm for discrete feature
6
+ * add KL-Divergence algorithm for discrete feature
7
+ * include the Discretizer module for algorithms requiring data with discrete feature, which allows to deal with continuous feature after discretization. Those algorithms requiring data with continuous feature now do not include the Discretizer module
8
+
1
9
  2012-05-15 version 1.1.0
2
10
 
3
11
  * add replace\_by\_median\_value! for replacing missing value with feature median value
data/README.md CHANGED
@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
8
8
  **Email**: [need47@gmail.com](mailto:need47@gmail.com)
9
9
  **Copyright**: 2012
10
10
  **License**: MIT License
11
- **Latest Version**: 1.1.0
12
- **Release Date**: 2012-05-15
11
+ **Latest Version**: 1.2.0
12
+ **Release Date**: 2012-05-20
13
13
 
14
14
  Synopsis
15
15
  --------
@@ -42,65 +42,69 @@ Feature List
42
42
 
43
43
  **2. available feature selection/ranking algorithms**
44
44
 
45
- algorithm alias algo_type feature_type applicability
46
- --------------------------------------------------------------------------------------------------
47
- Accuracy Acc weighting discrete
48
- AccuracyBalanced Acc2 weighting discrete
49
- BiNormalSeparation BNS weighting discrete
50
- CFS_d CFS_d subset discrete
51
- ChiSquaredTest CHI weighting discrete
52
- CorrelationCoefficient CC weighting discrete
53
- DocumentFrequency DF weighting discrete
54
- F1Measure F1 weighting discrete
55
- FishersExactTest FET weighting discrete
56
- FastCorrelationBasedFilter FCBF subset discrete
57
- GiniIndex GI weighting discrete
58
- GMean GM weighting discrete
59
- GSSCoefficient GSS weighting discrete
60
- InformationGain IG weighting discrete
61
- INTERACT INTERACT subset discrete
62
- LasVegasFilter LVF subset discrete
63
- LasVegasIncremental LVI subset discrete
64
- MatthewsCorrelationCoefficient MCC, PHI weighting discrete
65
- McNemarsTest MNT weighting discrete
66
- OddsRatio OR weighting discrete
67
- OddsRatioNumerator ORN weighting discrete
68
- PhiCoefficient Phi weighting discrete
69
- Power Power weighting discrete
70
- Precision Precision weighting discrete
71
- ProbabilityRatio PR weighting discrete
72
- Random Random weighting discrete
73
- Recall Recall weighting discrete
74
- Relief_d Relief_d weighting discrete two-class, no missing data
75
- ReliefF_d ReliefF_d weighting discrete
76
- Sensitivity SN, Recall weighting discrete
77
- Specificity SP weighting discrete
78
- SymmetricalUncertainty SU weighting discrete
79
- BetweenWithinClassesSumOfSquare BSS_WSS weighting continuous
80
- CFS_c CFS_c subset continuous
81
- FTest FT weighting continuous
82
- PMetric PM weighting continuous two-class
83
- Relief_c Relief_c weighting continuous two-class, no missing data
84
- ReliefF_c ReliefF_c weighting continuous
85
- TScore TS weighting continuous two-class
86
- WilcoxonRankSum WRS weighting continuous two-class
45
+ algorithm shortcut algo_type feature_type applicability
46
+ --------------------------------------------------------------------------------------------------------
47
+ Accuracy Acc weighting discrete multi-class
48
+ AccuracyBalanced Acc2 weighting discrete multi-class
49
+ BiNormalSeparation BNS weighting discrete multi-class
50
+ CFS_d CFS_d subset discrete multi-class
51
+ ChiSquaredTest CHI weighting discrete multi-class
52
+ CorrelationCoefficient CC weighting discrete multi-class
53
+ DocumentFrequency DF weighting discrete multi-class
54
+ F1Measure F1 weighting discrete multi-class
55
+ FishersExactTest FET weighting discrete multi-class
56
+ FastCorrelationBasedFilter FCBF subset discrete multi-class
57
+ GiniIndex GI weighting discrete multi-class
58
+ GMean GM weighting discrete multi-class
59
+ GSSCoefficient GSS weighting discrete multi-class
60
+ InformationGain IG weighting discrete multi-class
61
+ INTERACT INTERACT subset discrete multi-class
62
+ JMeasure JM weighting discrete multi-class
63
+ KLDivergence KLD weighting discrete multi-class
64
+ LasVegasFilter LVF subset discrete, continuous multi-class
65
+ LasVegasIncremental LVI subset discrete, continuous multi-class
66
+ MatthewsCorrelationCoefficient MCC, PHI weighting discrete multi-class
67
+ McNemarsTest MNT weighting discrete multi-class
68
+ OddsRatio OR weighting discrete multi-class
69
+ OddsRatioNumerator ORN weighting discrete multi-class
70
+ PhiCoefficient PHI weighting discrete multi-class
71
+ Power Power weighting discrete multi-class
72
+ Precision Precision weighting discrete multi-class
73
+ ProbabilityRatio PR weighting discrete multi-class
74
+ Random Random weighting discrete multi-class
75
+ Recall Recall weighting discrete multi-class
76
+ Relief_d Relief_d weighting discrete two-class, no missing data
77
+ ReliefF_d ReliefF_d weighting discrete multi-class
78
+ Sensitivity SN, Recall weighting discrete multi-class
79
+ Specificity SP weighting discrete multi-class
80
+ SymmetricalUncertainty SU weighting discrete multi-class
81
+ BetweenWithinClassesSumOfSquare BSS_WSS weighting continuous multi-class
82
+ CFS_c CFS_c subset continuous multi-class
83
+ FTest FT weighting continuous multi-class
84
+ KS_CCBF KS_CCBF subset continuous multi-class
85
+ KSTest KST weighting continuous two-class
86
+ PMetric PM weighting continuous two-class
87
+ Relief_c Relief_c weighting continuous two-class, no missing data
88
+ ReliefF_c ReliefF_c weighting continuous multi-class
89
+ TScore TS weighting continuous two-class
90
+ WilcoxonRankSum WRS weighting continuous two-class
87
91
 
88
92
  **note for feature selection interface:**
89
93
  there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
90
94
 
91
- - for weighting type: use either **select\_feature\_by\_rank!** or **select\_feature\_by\_score!**
95
+ - for weighting type: use either **select\_feature\_by\_score!** or **select\_feature\_by\_rank!**
92
96
  - for subset type: use **select\_feature!**
93
97
 
94
98
  **3. feature selection approaches**
95
99
 
96
100
  - by a single algorithm
97
101
  - by multiple algorithms in a tandem manner
98
- - by multiple algorithms in an ensemble manner (share same feature selection interface as single algorithm)
102
+ - by multiple algorithms in an ensemble manner (share the same feature selection interface as single algorithm)
99
103
 
100
104
  **4. availabe normalization and discretization algorithms for continuous feature**
101
105
 
102
106
  algorithm note
103
- -------------------------------------------------------------------------------
107
+ ---------------------------------------------------------------------------------------
104
108
  normalize_by_log! normalize by logarithmic transformation
105
109
  normalize_by_min_max! normalize by scaling into [min, max]
106
110
  normalize_by_zscore! normalize by converting into zscore
@@ -108,13 +112,13 @@ Feature List
108
112
  discretize_by_equal_frequency! discretize by equal frequency among intervals
109
113
  discretize_by_ChiMerge! discretize by ChiMerge algorithm
110
114
  discretize_by_Chi2! discretize by Chi2 algorithm
111
- discretize_by_MID! discretize by Multi-Interval Discretization
112
- discretize_by_TID! discretize by Three-Interval Discretization
115
+ discretize_by_MID! discretize by Multi-Interval Discretization algorithm
116
+ discretize_by_TID! discretize by Three-Interval Discretization algorithm
113
117
 
114
118
  **5. availabe algorithms for replacing missing feature values**
115
119
 
116
120
  algorithm note feature_type
117
- ---------------------------------------------------------------------------------------------------------
121
+ ---------------------------------------------------------------------------------------------
118
122
  replace_by_fixed_value! replace by a fixed value discrete, continuous
119
123
  replace_by_mean_value! replace by mean feature value continuous
120
124
  replace_by_median_value! replace by median feature value continuous
@@ -141,8 +145,8 @@ Usage
141
145
 
142
146
  require 'fselector'
143
147
 
144
- # use InformationGain as a feature selection algorithm
145
- r1 = FSelector::InformationGain.new
148
+ # use InformationGain (IG) as a feature selection algorithm
149
+ r1 = FSelector::IG.new
146
150
 
147
151
  # read from random data (or csv, libsvm, weka ARFF file)
148
152
  # no. of samples: 100
@@ -161,10 +165,10 @@ Usage
161
165
  # number of features after feature selection
162
166
  puts " # features (after): "+ r1.get_features.size.to_s
163
167
 
164
- # you can also use multiple alogirithms in a tandem manner
165
- # e.g. use the ChiSquaredTest with Yates' continuity correction
168
+ # you can also use a second alogirithm for further feature selection
169
+ # e.g. use the ChiSquaredTest (CHI) with Yates' continuity correction
166
170
  # initialize from r1's data
167
- r2 = FSelector::ChiSquaredTest.new(:yates, r1.get_data)
171
+ r2 = FSelector::CHI.new(:yates, r1.get_data)
168
172
 
169
173
  # number of features before feature selection
170
174
  puts " # features (before): "+ r2.get_features.size.to_s
@@ -216,18 +220,18 @@ Usage
216
220
 
217
221
 
218
222
  # creating an ensemble of feature selectors by using
219
- # two feature selection algorithms (InformationGain and Relief_d).
223
+ # two feature selection algorithms: InformationGain (IG) and Relief_d.
220
224
  # note: can be 2+ algorithms, as long as they are of the same type,
221
225
  # either feature weighting or feature subset selection algorithms
222
226
 
223
227
  # test for the type of feature weighting algorithms
224
- r1 = FSelector::InformationGain.new
228
+ r1 = FSelector::IG.new
225
229
  r2 = FSelector::Relief_d.new(10)
226
230
 
227
231
  # an ensemble of two feature selectors
228
232
  re = FSelector::EnsembleMultiple.new(r1, r2)
229
233
 
230
- # read random data
234
+ # read random discrete data (containing missing value)
231
235
  re.data_from_random(100, 2, 15, 3, true)
232
236
 
233
237
  # replace missing value because Relief_d
@@ -247,35 +251,28 @@ Usage
247
251
  # number of features after feature selection
248
252
  puts ' # features (after): ' + re.get_features.size.to_s
249
253
 
250
- **3. normalization and discretization before feature selection**
251
-
252
- In addition to the algorithms designed for continuous feature, one
253
- can apply those deisgned for discrete feature after (optionally
254
- normalization and) discretization
254
+ **3. feature selection after discretization**
255
255
 
256
256
  require 'fselector'
257
257
 
258
- # for continuous feature
259
- r1 = FSelector::Relief_c.new
258
+ # the Information Gain (IG) algorithm requires data with discrete feature
259
+ r = FSelector::IG.new
260
260
 
261
- # read the Iris data set (under the test/ directory)
262
- r1.data_from_csv('test/iris.csv')
261
+ # but the Iris data set contains continuous features (under the test/ directory)
262
+ r.data_from_csv('test/iris.csv')
263
263
 
264
- # discretization by ChiMerge algorithm at alpha=0.10
265
- r1.discretize_by_ChiMerge!(0.10)
266
-
267
- # apply Fast Correlation-Based Filter (FCBF) algorithm for discrete feature
268
- # initialize with discretized data from r1
269
- r2 = FSelector::FCBF.new(0.0, r1.get_data)
264
+ # let's first discretize it by ChiMerge algorithm at alpha=0.10
265
+ # then perform feature selection as usual
266
+ r.discretize_by_ChiMerge!(0.10)
270
267
 
271
268
  # number of features before feature selection
272
- puts ' # features (before): ' + r2.get_features.size.to_s
269
+ puts ' # features (before): ' + r.get_features.size.to_s
273
270
 
274
- # feature selection
275
- r2.select_feature!
271
+ # select the top-ranked feature
272
+ r.select_feature_by_rank!('<=1')
276
273
 
277
274
  # number of features after feature selection
278
- puts ' # features (after): ' + r2.get_features.size.to_s
275
+ puts ' # features (after): ' + r.get_features.size.to_s
279
276
 
280
277
  **4. see more examples test_*.rb under the test/ directory**
281
278
 
@@ -1,12 +1,13 @@
1
1
  # access to the statistical routines in R package
2
2
  require 'rinruby'
3
+ R.eval 'options(warn = -1)' # suppress R warnings
3
4
 
4
5
  #
5
6
  # FSelector: a Ruby gem for feature selection and ranking
6
7
  #
7
8
  module FSelector
8
9
  # module version
9
- VERSION = '1.1.0'
10
+ VERSION = '1.2.0'
10
11
  end
11
12
 
12
13
  # the root dir of FSelector
@@ -6,9 +6,8 @@ module FSelector
6
6
  # base class for a single feature selection algorithm
7
7
  #
8
8
  class Base
9
- # include FileIO
9
+ # include module
10
10
  include FileIO
11
- # include ReplaceMissingValues
12
11
  include ReplaceMissingValues
13
12
 
14
13
  class << self
@@ -11,9 +11,6 @@ module FSelector
11
11
  # ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
12
12
  #
13
13
  class BaseRelief < Base
14
- # include ReplaceMissingValue module
15
- include ReplaceMissingValues
16
-
17
14
  #
18
15
  # intialize from an existing data structure
19
16
  #
@@ -23,6 +20,7 @@ module FSelector
23
20
  #
24
21
  def initialize(m=30, data=nil)
25
22
  super(data)
23
+
26
24
  @m = m || 30 # default 30
27
25
  end
28
26
 
@@ -21,6 +21,7 @@ module FSelector
21
21
  #
22
22
  def initialize(m=30, k=10, data=nil)
23
23
  super(data)
24
+
24
25
  @m = m || 30 # default 30
25
26
  @k = k || 10 # default 10
26
27
  end
@@ -6,10 +6,8 @@ module FSelector
6
6
  # base algorithm for continuous feature
7
7
  #
8
8
  class BaseContinuous < Base
9
- # include normalizer
9
+ # include module
10
10
  include Normalizer
11
- # include discretizer
12
- include Discretizer
13
11
 
14
12
  # initialize from an existing data structure
15
13
  def initialize(data=nil)
@@ -29,6 +29,9 @@ module FSelector
29
29
  # P(f'|c') = D/(B+D)
30
30
  #
31
31
  class BaseDiscrete < Base
32
+ # include module
33
+ include Discretizer
34
+
32
35
  # initialize from an existing data structure
33
36
  def initialize(data=nil)
34
37
  super(data)
@@ -9,9 +9,8 @@ module FSelector
9
9
  # ref: [Feature Selection for Discrete and Numeric Class Machine Learning](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.5673)
10
10
  #
11
11
  class CFS_c < BaseCFS
12
- # include normalizer and discretizer
12
+ # include module
13
13
  include Normalizer
14
- include Discretizer
15
14
 
16
15
  # this algo outputs a subset of feature
17
16
  @algo_type = :feature_subset_selection
@@ -3,7 +3,7 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # F-test (FT) based on F-statistics for continuous feature
6
+ # F-Test (FT) based on F-statistics for continuous feature
7
7
  #
8
8
  # between-group variability
9
9
  # FT = ---------------------------
@@ -29,36 +29,6 @@ module FSelector
29
29
  private
30
30
 
31
31
  # calculate contribution of each feature (f) across all classes
32
- def calc_contribution2(f)
33
- a, b, s = 0.0, 0.0, 0.0
34
- ybar = get_feature_values(f).mean
35
- kz = get_classes.size.to_f
36
- sz = get_sample_size.to_f
37
-
38
- k2ybar = {} # cache
39
- each_class do |k|
40
- k2ybar[k] = get_feature_values(f, nil, k).mean
41
- end
42
-
43
- # a
44
- each_class do |k|
45
- n_k = get_data[k].size.to_f
46
- a += n_k * (k2ybar[k] - ybar)**2 / (kz-1)
47
- end
48
-
49
- # b
50
- each_sample do |k, s|
51
- if s.has_key? f
52
- y_ik = s[f]
53
- b += (y_ik - k2ybar[k])**2 / (sz-kz)
54
- end
55
- end
56
-
57
- s = a/b if not b.zero?
58
-
59
- set_feature_score(f, :BEST, s)
60
- end # calc_contribution
61
-
62
32
  def calc_contribution(f)
63
33
  a, b, s = 0.0, 0.0, 0.0
64
34
  ybar = get_feature_values(f).mean
@@ -0,0 +1,125 @@
1
+ #
2
+ # FSelector: a Ruby gem for feature selection and ranking
3
+ #
4
+ module FSelector
5
+ #
6
+ # Kolmogorov-Smirnov Class Correlation-Based Filter (KS-CCBF) for continuous feature
7
+ #
8
+ # ref: [Feature Selection for Supervised Classification: A KolmogorovSmirnov Class Correlation-Based Filter](http://kzi.polsl.pl/~jbiesiada/Infosel/downolad/publikacje/09-Gliwice.pdf)
9
+ #
10
+ class KS_CCBF < BaseContinuous
11
+ # include module
12
+ include Entropy
13
+ include Discretizer
14
+
15
+ # this algo outputs a subset of feature
16
+ @algo_type = :feature_subset_selection
17
+
18
+ #
19
+ # initialize from an existing data structure
20
+ #
21
+ # @param [Float] lamda threshold value [0, 1] to determine feature redundancy
22
+ #
23
+ def initialize(lamda=0.2, data=nil)
24
+ super(data)
25
+
26
+ @lamda = lamda || 0.2
27
+ end
28
+
29
+ private
30
+
31
+ # INTERACT algorithm
32
+ def get_feature_subset
33
+ # make a copy of data, since the discretization method will alter internal data
34
+ data_bak = get_data_copy
35
+
36
+ # stage 1: calculate SUC coefficient
37
+ # but let's discretize features first
38
+ discretize_for_suc
39
+
40
+ # then SUC
41
+ f2suc = {}
42
+ cv = get_class_labels
43
+ each_feature do |f|
44
+ fv = get_feature_values(f, :include_missing_values)
45
+ f2suc[f] = get_symmetrical_uncertainty(fv, cv)
46
+ end
47
+
48
+ # sort feature according to descending order of its SUC
49
+ subset = f2suc.keys.sort { |x, y| f2suc[y] <=> f2suc[x] }
50
+
51
+ # restore data, note set_data also clear old variables
52
+ set_data(data_bak)
53
+
54
+ # stage 2: remove redundancy
55
+ fp = subset.first
56
+ while fp
57
+ fq = get_next_element(subset, fp)
58
+
59
+ while fq
60
+ ks = calc_ks(fp, fq)
61
+
62
+ if ks < @lamda
63
+ fq_new = get_next_element(subset, fq)
64
+ subset.delete(fq) # remove fq
65
+ fq = fq_new
66
+ else
67
+ fq = get_next_element(subset, fq)
68
+ end
69
+ end
70
+
71
+ fp = get_next_element(subset, fp)
72
+ end
73
+
74
+ subset
75
+ end # get_feature_subset
76
+
77
+
78
+ # discretize continuous feature for calculating the SUC,
79
+ # which requires discrete features. See Discretizer module
80
+ # for available discretization methods. If you want to use
81
+ # alternative one, simply override this function
82
+ def discretize_for_suc
83
+ discretize_by_ChiMerge!(0.10)
84
+ end
85
+
86
+
87
+ # get the next element of fp
88
+ def get_next_element(subset, fp)
89
+ fq = nil
90
+
91
+ idx = subset.index(fp)
92
+ if idx and idx < subset.size-1
93
+ fq = subset[idx+1]
94
+ end
95
+
96
+ fq
97
+ end # get_next_element
98
+
99
+
100
+ # calculate K-S statistic (relying on R package) among all classes
101
+ def calc_ks(fp, fq)
102
+ ks = 0.0
103
+
104
+ each_class do |k|
105
+ R.sp = get_feature_values(fp, nil, k)
106
+ R.sq = get_feature_values(fq, nil, k)
107
+
108
+ # K-S test
109
+ R.eval "ks <- ks.test(sp, sq)$statistic"
110
+
111
+ # pull K-S statistic
112
+ ks_try = R.ks
113
+
114
+ # record max ks among classes
115
+ ks = ks_try if ks_try > ks
116
+ end
117
+
118
+ ks
119
+ end # calc_ks
120
+
121
+
122
+ end # class
123
+
124
+
125
+ end # module
@@ -0,0 +1,51 @@
1
+ #
2
+ # FSelector: a Ruby gem for feature selection and ranking
3
+ #
4
+ module FSelector
5
+ #
6
+ # Kolmogorov-Smirnov Test (KST) for continuous feature
7
+ #
8
+ # @note KST is applicable only to two-class problems, and missing data are ignored
9
+ #
10
+ # for KST (p-value), the smaller, the better, but we intentionally negate it
11
+ # so that the larger is always the better (consistent with other algorithms).
12
+ # R equivalent: ks.test
13
+ #
14
+ # ref: [Wikipedia](http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) and [Feature Extraction, Foundations and Applications](http://www.springer.com/engineering/computational+intelligence+and+complexity/book/978-3-540-35487-1)
15
+ #
16
+ class KSTest < BaseContinuous
17
+ # this algo outputs weight for each feature
18
+ @algo_type = :feature_weighting
19
+
20
+ private
21
+
22
+ # calculate contribution of each feature (f) across all classes
23
+ def calc_contribution(f)
24
+ if not get_classes.size == 2
25
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
26
+ " suitable only for two-class problem with continuous feature"
27
+ end
28
+
29
+ # collect data for class 1 and 2, respectively
30
+ k1, k2 = get_classes
31
+ R.s1 = get_feature_values(f, nil, k1) # class 1
32
+ R.s2 = get_feature_values(f, nil, k2) # class 2
33
+
34
+ # K-S test
35
+ R.eval "rv <- ks.test(s1, s2)$p.value"
36
+
37
+ # intensionally negate it
38
+ s = -1.0 * R.rv # pull the p-value from R
39
+
40
+ set_feature_score(f, :BEST, s)
41
+ end # calc_contribution
42
+
43
+
44
+ end # class
45
+
46
+
47
+ # shortcut so that you can use FSelector::KST instead of FSelector::KSTest
48
+ KST = KSTest
49
+
50
+
51
+ end # module
@@ -10,9 +10,8 @@ module FSelector
10
10
  # ref: [Estimating Attributes: Analysis and Extensions of RELIEF](http://www.springerlink.com/content/fp23jh2h0426ww45/)
11
11
  #
12
12
  class ReliefF_c < BaseReliefF
13
- # include normalizer and discretizer
13
+ # include module
14
14
  include Normalizer
15
- include Discretizer
16
15
 
17
16
  # this algo outputs weight for each feature
18
17
  @algo_type = :feature_weighting
@@ -10,9 +10,8 @@ module FSelector
10
10
  # ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
11
11
  #
12
12
  class Relief_c < BaseRelief
13
- # include normalizer and discretizer
13
+ # include module
14
14
  include Normalizer
15
- include Discretizer
16
15
 
17
16
  # this algo outputs weight for each feature
18
17
  @algo_type = :feature_weighting
@@ -3,7 +3,7 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # t-score (TS) based on Student's t-test for continuous feature
6
+ # T-Score (TS) based on Student's t-test for continuous feature
7
7
  #
8
8
  # |u1 - u2|
9
9
  # TS = -------------------------------------
@@ -9,7 +9,8 @@ module FSelector
9
9
  # ref: [Feature Selection for Discrete and Numeric Class Machine Learning](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.5673)
10
10
  #
11
11
  class CFS_d < BaseCFS
12
- # include Entropy module
12
+ # include module
13
+ include Discretizer
13
14
  include Entropy
14
15
 
15
16
  # this algo outputs a subset of feature
@@ -30,6 +30,7 @@ module FSelector
30
30
  #
31
31
  def initialize(correction=:yates, data=nil)
32
32
  super(data)
33
+
33
34
  @correction = (correction==:yates) ? true : false
34
35
  end
35
36
 
@@ -9,7 +9,7 @@ module FSelector
9
9
  # ref: [Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](http://www.hpl.hp.com/conferences/icml2003/papers/144.pdf)
10
10
  #
11
11
  class FastCorrelationBasedFilter < BaseDiscrete
12
- # include Entropy module
12
+ # include module
13
13
  include Entropy
14
14
 
15
15
  # this algo outputs a subset of feature
@@ -22,6 +22,7 @@ module FSelector
22
22
  #
23
23
  def initialize(delta=0.0, data=nil)
24
24
  super(data)
25
+
25
26
  @delta = delta || 0.0
26
27
  end
27
28
 
@@ -108,14 +109,13 @@ module FSelector
108
109
  end
109
110
 
110
111
 
112
+ # get the next element of fp in subset
111
113
  def get_next_element(subset, fp)
112
114
  fq = nil
113
115
 
114
- subset.each_with_index do |v, i|
115
- if v == fp and i+1 < subset.size
116
- fq = subset[i+1]
117
- break
118
- end
116
+ idx = subset.index(fp)
117
+ if idx and idx < subset.size-1
118
+ fq = subset[idx+1]
119
119
  end
120
120
 
121
121
  fq
@@ -3,7 +3,7 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # GMean (GM)
6
+ # G-Mean (GM)
7
7
  #
8
8
  # GM = sqrt(Sensitivity * Specificity)
9
9
  #
@@ -9,9 +9,8 @@ module FSelector
9
9
  # ref: [Searching for Interacting Features](http://www.public.asu.edu/~huanliu/papers/ijcai07.pdf)
10
10
  #
11
11
  class INTERACT < BaseDiscrete
12
- # include Entropy module
12
+ # include module
13
13
  include Entropy
14
- # include Consistency module
15
14
  include Consistency
16
15
 
17
16
  # this algo outputs a subset of feature
@@ -24,13 +23,14 @@ module FSelector
24
23
  #
25
24
  def initialize(delta=0.0001, data=nil)
26
25
  super(data)
26
+
27
27
  @delta = delta || 0.0001
28
28
  end
29
29
 
30
30
  private
31
31
 
32
32
  # INTERACT algorithm
33
- def get_feature_subset
33
+ def get_feature_subset
34
34
  subset, f2su = get_features.dup, {}
35
35
 
36
36
  # part 1, get symmetrical uncertainty for each feature
@@ -14,7 +14,7 @@ module FSelector
14
14
  # ref: [Using Information Gain to Analyze and Fine Tune the Performance of Supply Chain Trading Agents](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.141.7895)
15
15
  #
16
16
  class InformationGain < BaseDiscrete
17
- # include Entropy module
17
+ # include module
18
18
  include Entropy
19
19
 
20
20
  # this algo outputs weight for each feature
@@ -0,0 +1,51 @@
1
+ #
2
+ # FSelector: a Ruby gem for feature selection and ranking
3
+ #
4
+ module FSelector
5
+ #
6
+ # J-Measure (JM) for discrete feature
7
+ #
8
+ # P(y_j|x_i)
9
+ # JM = sigma_i P(x_i) sigma_j P(y_j|x_i) log ------------
10
+ # P(y_j)
11
+ #
12
+ # ref: [Feature Extraction, Foundations and Applications](http://www.springer.com/engineering/computational+intelligence+and+complexity/book/978-3-540-35487-1)
13
+ #
14
+ class JMeasure < BaseDiscrete
15
+ # this algo outputs weight for each feature
16
+ @algo_type = :feature_weighting
17
+
18
+ private
19
+
20
+ # calculate contribution of each feature (f) across all classes
21
+ def calc_contribution(f)
22
+ cv = get_class_labels
23
+ fv = get_feature_values(f, :include_missing_values)
24
+ sz = cv.size.to_f # also equal fv.size
25
+
26
+ s = 0.0
27
+ fv.uniq.each do |x|
28
+ px = fv.count(x)/sz
29
+
30
+ cv.uniq.each do |y|
31
+ py = cv.count(y)/sz
32
+
33
+ indices = (0...fv.size).to_a.select { |i| fv[i] == x }
34
+ pyx = cv.values_at(*indices).count(y)/indices.size.to_f
35
+
36
+ s += px * ( pyx * Math.log2(pyx/py) ) if not pyx.zero?
37
+ end
38
+ end
39
+
40
+ set_feature_score(f, :BEST, s)
41
+ end # calc_contribution
42
+
43
+
44
+ end # class
45
+
46
+
47
+ # shortcut so that you can use FSelector::JM instead of FSelector::JMeasure
48
+ JM = JMeasure
49
+
50
+
51
+ end # module
@@ -0,0 +1,65 @@
1
+ #
2
+ # FSelector: a Ruby gem for feature selection and ranking
3
+ #
4
+ module FSelector
5
+ #
6
+ # Kullback-Leibler Divergence (KLD) for discrete feature
7
+ #
8
+ # w_i = wbar_i / ( -Z * sigma_j ( P(a_ij) logP(a_ij) ) )
9
+ #
10
+ # where wbar(i) = sigma_j ( P(a_ij) KL(C|a_ij) )
11
+ #
12
+ # KL(C|a_ij) = sigma_c ( P(c|a_ij) log(P(c|a_ij)/P(c)) )
13
+ #
14
+ # Z is normalization constant
15
+ #
16
+ # ref: [Calculating Feature Weights in Naive Bayes with Kullback-Leibler Measure](http://ix.cs.uoregon.edu/~dou/research/papers/icdm11_fw.)
17
+ #
18
+ class KLDivergence < BaseDiscrete
19
+ # this algo outputs weight for each feature
20
+ @algo_type = :feature_weighting
21
+
22
+ private
23
+
24
+ # calculate contribution of each feature (f) across all classes
25
+ # note the normalization constant Z is ignored, since we need only
26
+ # the relative feature scores
27
+ def calc_contribution(f)
28
+ cv = get_class_labels
29
+ fv = get_feature_values(f, :include_missing_values)
30
+ sz = cv.size.to_f # also equal fv.size
31
+
32
+ s, w_avg, d = 0.0, 0.0, 0.0
33
+
34
+ fv.uniq.each do |x|
35
+ px = fv.count(x)/sz
36
+ d += -1.0 * px * Math.log2(px)
37
+
38
+ kl_x = 0.0
39
+
40
+ cv.uniq.each do |y|
41
+ py = cv.count(y)/sz
42
+
43
+ indices = (0...fv.size).to_a.select { |i| fv[i] == x }
44
+ pyx = cv.values_at(*indices).count(y)/indices.size.to_f
45
+
46
+ kl_x += pyx * Math.log2(pyx/py) if not pyx.zero?
47
+ end
48
+
49
+ w_avg += px * kl_x
50
+ end
51
+
52
+ s = w_avg / d if not d.zero?
53
+
54
+ set_feature_score(f, :BEST, s)
55
+ end # calc_contribution
56
+
57
+
58
+ end # class
59
+
60
+
61
+ # shortcut so that you can use FSelector::KLD instead of FSelector::KLDivergence
62
+ KLD = KLDivergence
63
+
64
+
65
+ end # module
@@ -11,7 +11,7 @@ module FSelector
11
11
  # ref: [Review and Evaluation of Feature Selection Algorithms in Synthetic Problems](http://arxiv.org/abs/1101.2320)
12
12
  #
13
13
  class LasVegasFilter < BaseDiscrete
14
- # include Consistency module
14
+ # include module
15
15
  include Consistency
16
16
 
17
17
  # this algo outputs a subset of feature
@@ -24,13 +24,14 @@ module FSelector
24
24
  #
25
25
  def initialize(max_iter=100, data=nil)
26
26
  super(data)
27
+
27
28
  @max_iter = max_iter || 100
28
29
  end
29
30
 
30
31
  private
31
32
 
32
33
  # Las Vegas Filter (LVF) algorithm
33
- def get_feature_subset
34
+ def get_feature_subset
34
35
  inst_cnt = get_instance_count
35
36
  j0 = get_IR_by_count(inst_cnt)
36
37
 
@@ -9,7 +9,7 @@ module FSelector
9
9
  # ref: [Incremental Feature Selection](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.8218)
10
10
  #
11
11
  class LasVegasIncremental < BaseDiscrete
12
- # include Consistency module
12
+ # include module
13
13
  include Consistency
14
14
 
15
15
  # this algo outputs a subset of feature
@@ -23,6 +23,7 @@ module FSelector
23
23
  #
24
24
  def initialize(max_iter=100, portion=0.10, data=nil)
25
25
  super(data)
26
+
26
27
  @max_iter = max_iter || 100
27
28
  @portion = portion || 0.10
28
29
  end
@@ -30,7 +31,7 @@ module FSelector
30
31
  private
31
32
 
32
33
  # Las Vegas Incremental (LVI) algorithm
33
- def get_feature_subset
34
+ def get_feature_subset
34
35
  data = get_data # working dataset
35
36
  s0, s1 = portion(data)
36
37
  feats = get_features
@@ -25,6 +25,7 @@ module FSelector
25
25
  #
26
26
  def initialize(correction=:yates, data=nil)
27
27
  super(data)
28
+
28
29
  @correction = (correction==:yates) ? true : false
29
30
  end
30
31
 
@@ -24,6 +24,7 @@ module FSelector
24
24
  #
25
25
  def initialize(k=5, data=nil)
26
26
  super(data)
27
+
27
28
  @k = k || 5
28
29
  end
29
30
 
@@ -22,6 +22,7 @@ module FSelector
22
22
  #
23
23
  def initialize(seed=nil, data=nil)
24
24
  super(data)
25
+
25
26
  srand(seed) if seed
26
27
  end
27
28
 
@@ -9,6 +9,9 @@ module FSelector
9
9
  # ref: [Estimating Attributes: Analysis and Extensions of RELIEF](http://www.springerlink.com/content/fp23jh2h0426ww45/)
10
10
  #
11
11
  class ReliefF_d < BaseReliefF
12
+ # include module
13
+ include Discretizer
14
+
12
15
  # this algo outputs weight for each feature
13
16
  @algo_type = :feature_weighting
14
17
 
@@ -10,6 +10,9 @@ module FSelector
10
10
  # ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
11
11
  #
12
12
  class Relief_d < BaseRelief
13
+ # include module
14
+ include Discretizer
15
+
13
16
  # this algo outputs weight for each feature
14
17
  @algo_type = :feature_weighting
15
18
 
@@ -17,7 +17,7 @@ module FSelector
17
17
  # ref: [Wikipedia](http://en.wikipedia.org/wiki/Symmetric_uncertainty) and [Robust Feature Selection Using Ensemble Feature Selection Techniques](http://dl.acm.org/citation.cfm?id=1432021)
18
18
  #
19
19
  class SymmetricalUncertainty < BaseDiscrete
20
- # include Entropy module
20
+ # include module
21
21
  include Entropy
22
22
 
23
23
  # this algo outputs weight for each feature
@@ -2,11 +2,10 @@
2
2
  # discretize continuous feature
3
3
  #
4
4
  module Discretizer
5
- # include Entropy module
6
- include Entropy
7
- # include Consistency module
5
+ # include module
8
6
  include Consistency
9
-
7
+ include Entropy
8
+
10
9
  #
11
10
  # discretize by equal-width intervals
12
11
  #
@@ -157,7 +156,7 @@ module Discretizer
157
156
  #
158
157
  # ref: [Chi2: Feature Selection and Discretization of Numeric Attributes](http://sci2s.ugr.es/keel/pdf/specific/congreso/liu1995.pdf)
159
158
  #
160
- def discretize_by_Chi2!(delta=0.02)
159
+ def discretize_by_Chi2!(delta=0.02)
161
160
  # degree of freedom equals one less than number of classes
162
161
  df = get_classes.size-1
163
162
 
@@ -270,7 +269,7 @@ module Discretizer
270
269
  #
271
270
  # ref: [Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning](http://www.ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf)
272
271
  #
273
- def discretize_by_MID!
272
+ def discretize_by_MID!
274
273
  # determine the final boundaries
275
274
  f2cp = {} # cut points for each feature
276
275
  each_feature do |f|
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fselector
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.0
4
+ version: 1.2.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-05-15 00:00:00.000000000 Z
12
+ date: 2012-05-21 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rinruby
16
- requirement: &28540080 !ruby/object:Gem::Requirement
16
+ requirement: &23863908 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,7 +21,7 @@ dependencies:
21
21
  version: 2.0.2
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *28540080
24
+ version_requirements: *23863908
25
25
  description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
26
26
  algorithms and related functions into one single package. Welcome to contact me
27
27
  (need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
@@ -50,11 +50,13 @@ files:
50
50
  - lib/fselector/algo_base/base_ReliefF.rb
51
51
  - lib/fselector/algo_continuous/BSS_WSS.rb
52
52
  - lib/fselector/algo_continuous/CFS_c.rb
53
- - lib/fselector/algo_continuous/FTest.rb
54
- - lib/fselector/algo_continuous/PMetric.rb
53
+ - lib/fselector/algo_continuous/F-Test.rb
54
+ - lib/fselector/algo_continuous/KS-CCBF.rb
55
+ - lib/fselector/algo_continuous/KS-Test.rb
56
+ - lib/fselector/algo_continuous/P-Metric.rb
55
57
  - lib/fselector/algo_continuous/ReliefF_c.rb
56
58
  - lib/fselector/algo_continuous/Relief_c.rb
57
- - lib/fselector/algo_continuous/TScore.rb
59
+ - lib/fselector/algo_continuous/T-Score.rb
58
60
  - lib/fselector/algo_continuous/WilcoxonRankSum.rb
59
61
  - lib/fselector/algo_discrete/Accuracy.rb
60
62
  - lib/fselector/algo_discrete/AccuracyBalanced.rb
@@ -66,11 +68,13 @@ files:
66
68
  - lib/fselector/algo_discrete/F1Measure.rb
67
69
  - lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb
68
70
  - lib/fselector/algo_discrete/FishersExactTest.rb
71
+ - lib/fselector/algo_discrete/G-Mean.rb
69
72
  - lib/fselector/algo_discrete/GiniIndex.rb
70
- - lib/fselector/algo_discrete/GMean.rb
71
73
  - lib/fselector/algo_discrete/GSSCoefficient.rb
72
74
  - lib/fselector/algo_discrete/InformationGain.rb
73
75
  - lib/fselector/algo_discrete/INTERACT.rb
76
+ - lib/fselector/algo_discrete/J-Measure.rb
77
+ - lib/fselector/algo_discrete/KL-Divergence.rb
74
78
  - lib/fselector/algo_discrete/LasVegasFilter.rb
75
79
  - lib/fselector/algo_discrete/LasVegasIncremental.rb
76
80
  - lib/fselector/algo_discrete/MatthewsCorrelationCoefficient.rb