fselector 1.1.0 → 1.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/ChangeLog +8 -0
- data/README.md +74 -77
- data/lib/fselector.rb +2 -1
- data/lib/fselector/algo_base/base.rb +1 -2
- data/lib/fselector/algo_base/base_Relief.rb +1 -3
- data/lib/fselector/algo_base/base_ReliefF.rb +1 -0
- data/lib/fselector/algo_base/base_continuous.rb +1 -3
- data/lib/fselector/algo_base/base_discrete.rb +3 -0
- data/lib/fselector/algo_continuous/CFS_c.rb +1 -2
- data/lib/fselector/algo_continuous/{FTest.rb → F-Test.rb} +1 -31
- data/lib/fselector/algo_continuous/KS-CCBF.rb +125 -0
- data/lib/fselector/algo_continuous/KS-Test.rb +51 -0
- data/lib/fselector/algo_continuous/{PMetric.rb → P-Metric.rb} +0 -0
- data/lib/fselector/algo_continuous/ReliefF_c.rb +1 -2
- data/lib/fselector/algo_continuous/Relief_c.rb +1 -2
- data/lib/fselector/algo_continuous/{TScore.rb → T-Score.rb} +1 -1
- data/lib/fselector/algo_discrete/CFS_d.rb +2 -1
- data/lib/fselector/algo_discrete/ChiSquaredTest.rb +1 -0
- data/lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb +6 -6
- data/lib/fselector/algo_discrete/{GMean.rb → G-Mean.rb} +1 -1
- data/lib/fselector/algo_discrete/INTERACT.rb +3 -3
- data/lib/fselector/algo_discrete/InformationGain.rb +1 -1
- data/lib/fselector/algo_discrete/J-Measure.rb +51 -0
- data/lib/fselector/algo_discrete/KL-Divergence.rb +65 -0
- data/lib/fselector/algo_discrete/LasVegasFilter.rb +3 -2
- data/lib/fselector/algo_discrete/LasVegasIncremental.rb +3 -2
- data/lib/fselector/algo_discrete/McNemarsTest.rb +1 -0
- data/lib/fselector/algo_discrete/Power.rb +1 -0
- data/lib/fselector/algo_discrete/Random.rb +1 -0
- data/lib/fselector/algo_discrete/ReliefF_d.rb +3 -0
- data/lib/fselector/algo_discrete/Relief_d.rb +3 -0
- data/lib/fselector/algo_discrete/SymmetricalUncertainty.rb +1 -1
- data/lib/fselector/discretizer.rb +5 -6
- metadata +12 -8
data/ChangeLog
CHANGED
@@ -1,3 +1,11 @@
|
|
1
|
+
2012-05-20 version 1.2.0
|
2
|
+
|
3
|
+
* add KS-Test algorithm for continuous feature
|
4
|
+
* add KS-CCBF algorithm for continuous feature
|
5
|
+
* add J-Measure algorithm for discrete feature
|
6
|
+
* add KL-Divergence algorithm for discrete feature
|
7
|
+
* include the Discretizer module for algorithms requiring data with discrete feature, which allows to deal with continuous feature after discretization. Those algorithms requiring data with continuous feature now do not include the Discretizer module
|
8
|
+
|
1
9
|
2012-05-15 version 1.1.0
|
2
10
|
|
3
11
|
* add replace\_by\_median\_value! for replacing missing value with feature median value
|
data/README.md
CHANGED
@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
|
|
8
8
|
**Email**: [need47@gmail.com](mailto:need47@gmail.com)
|
9
9
|
**Copyright**: 2012
|
10
10
|
**License**: MIT License
|
11
|
-
**Latest Version**: 1.
|
12
|
-
**Release Date**: 2012-05-
|
11
|
+
**Latest Version**: 1.2.0
|
12
|
+
**Release Date**: 2012-05-20
|
13
13
|
|
14
14
|
Synopsis
|
15
15
|
--------
|
@@ -42,65 +42,69 @@ Feature List
|
|
42
42
|
|
43
43
|
**2. available feature selection/ranking algorithms**
|
44
44
|
|
45
|
-
algorithm
|
46
|
-
|
47
|
-
Accuracy
|
48
|
-
AccuracyBalanced
|
49
|
-
BiNormalSeparation
|
50
|
-
CFS_d
|
51
|
-
ChiSquaredTest
|
52
|
-
CorrelationCoefficient
|
53
|
-
DocumentFrequency
|
54
|
-
F1Measure
|
55
|
-
FishersExactTest
|
56
|
-
FastCorrelationBasedFilter
|
57
|
-
GiniIndex
|
58
|
-
GMean
|
59
|
-
GSSCoefficient
|
60
|
-
InformationGain
|
61
|
-
INTERACT
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
45
|
+
algorithm shortcut algo_type feature_type applicability
|
46
|
+
--------------------------------------------------------------------------------------------------------
|
47
|
+
Accuracy Acc weighting discrete multi-class
|
48
|
+
AccuracyBalanced Acc2 weighting discrete multi-class
|
49
|
+
BiNormalSeparation BNS weighting discrete multi-class
|
50
|
+
CFS_d CFS_d subset discrete multi-class
|
51
|
+
ChiSquaredTest CHI weighting discrete multi-class
|
52
|
+
CorrelationCoefficient CC weighting discrete multi-class
|
53
|
+
DocumentFrequency DF weighting discrete multi-class
|
54
|
+
F1Measure F1 weighting discrete multi-class
|
55
|
+
FishersExactTest FET weighting discrete multi-class
|
56
|
+
FastCorrelationBasedFilter FCBF subset discrete multi-class
|
57
|
+
GiniIndex GI weighting discrete multi-class
|
58
|
+
GMean GM weighting discrete multi-class
|
59
|
+
GSSCoefficient GSS weighting discrete multi-class
|
60
|
+
InformationGain IG weighting discrete multi-class
|
61
|
+
INTERACT INTERACT subset discrete multi-class
|
62
|
+
JMeasure JM weighting discrete multi-class
|
63
|
+
KLDivergence KLD weighting discrete multi-class
|
64
|
+
LasVegasFilter LVF subset discrete, continuous multi-class
|
65
|
+
LasVegasIncremental LVI subset discrete, continuous multi-class
|
66
|
+
MatthewsCorrelationCoefficient MCC, PHI weighting discrete multi-class
|
67
|
+
McNemarsTest MNT weighting discrete multi-class
|
68
|
+
OddsRatio OR weighting discrete multi-class
|
69
|
+
OddsRatioNumerator ORN weighting discrete multi-class
|
70
|
+
PhiCoefficient PHI weighting discrete multi-class
|
71
|
+
Power Power weighting discrete multi-class
|
72
|
+
Precision Precision weighting discrete multi-class
|
73
|
+
ProbabilityRatio PR weighting discrete multi-class
|
74
|
+
Random Random weighting discrete multi-class
|
75
|
+
Recall Recall weighting discrete multi-class
|
76
|
+
Relief_d Relief_d weighting discrete two-class, no missing data
|
77
|
+
ReliefF_d ReliefF_d weighting discrete multi-class
|
78
|
+
Sensitivity SN, Recall weighting discrete multi-class
|
79
|
+
Specificity SP weighting discrete multi-class
|
80
|
+
SymmetricalUncertainty SU weighting discrete multi-class
|
81
|
+
BetweenWithinClassesSumOfSquare BSS_WSS weighting continuous multi-class
|
82
|
+
CFS_c CFS_c subset continuous multi-class
|
83
|
+
FTest FT weighting continuous multi-class
|
84
|
+
KS_CCBF KS_CCBF subset continuous multi-class
|
85
|
+
KSTest KST weighting continuous two-class
|
86
|
+
PMetric PM weighting continuous two-class
|
87
|
+
Relief_c Relief_c weighting continuous two-class, no missing data
|
88
|
+
ReliefF_c ReliefF_c weighting continuous multi-class
|
89
|
+
TScore TS weighting continuous two-class
|
90
|
+
WilcoxonRankSum WRS weighting continuous two-class
|
87
91
|
|
88
92
|
**note for feature selection interface:**
|
89
93
|
there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
|
90
94
|
|
91
|
-
- for weighting type: use either **select\_feature\_by\
|
95
|
+
- for weighting type: use either **select\_feature\_by\_score!** or **select\_feature\_by\_rank!**
|
92
96
|
- for subset type: use **select\_feature!**
|
93
97
|
|
94
98
|
**3. feature selection approaches**
|
95
99
|
|
96
100
|
- by a single algorithm
|
97
101
|
- by multiple algorithms in a tandem manner
|
98
|
-
- by multiple algorithms in an ensemble manner (share same feature selection interface as single algorithm)
|
102
|
+
- by multiple algorithms in an ensemble manner (share the same feature selection interface as single algorithm)
|
99
103
|
|
100
104
|
**4. availabe normalization and discretization algorithms for continuous feature**
|
101
105
|
|
102
106
|
algorithm note
|
103
|
-
|
107
|
+
---------------------------------------------------------------------------------------
|
104
108
|
normalize_by_log! normalize by logarithmic transformation
|
105
109
|
normalize_by_min_max! normalize by scaling into [min, max]
|
106
110
|
normalize_by_zscore! normalize by converting into zscore
|
@@ -108,13 +112,13 @@ Feature List
|
|
108
112
|
discretize_by_equal_frequency! discretize by equal frequency among intervals
|
109
113
|
discretize_by_ChiMerge! discretize by ChiMerge algorithm
|
110
114
|
discretize_by_Chi2! discretize by Chi2 algorithm
|
111
|
-
discretize_by_MID! discretize by Multi-Interval Discretization
|
112
|
-
discretize_by_TID! discretize by Three-Interval Discretization
|
115
|
+
discretize_by_MID! discretize by Multi-Interval Discretization algorithm
|
116
|
+
discretize_by_TID! discretize by Three-Interval Discretization algorithm
|
113
117
|
|
114
118
|
**5. availabe algorithms for replacing missing feature values**
|
115
119
|
|
116
120
|
algorithm note feature_type
|
117
|
-
|
121
|
+
---------------------------------------------------------------------------------------------
|
118
122
|
replace_by_fixed_value! replace by a fixed value discrete, continuous
|
119
123
|
replace_by_mean_value! replace by mean feature value continuous
|
120
124
|
replace_by_median_value! replace by median feature value continuous
|
@@ -141,8 +145,8 @@ Usage
|
|
141
145
|
|
142
146
|
require 'fselector'
|
143
147
|
|
144
|
-
# use InformationGain as a feature selection algorithm
|
145
|
-
r1 = FSelector::
|
148
|
+
# use InformationGain (IG) as a feature selection algorithm
|
149
|
+
r1 = FSelector::IG.new
|
146
150
|
|
147
151
|
# read from random data (or csv, libsvm, weka ARFF file)
|
148
152
|
# no. of samples: 100
|
@@ -161,10 +165,10 @@ Usage
|
|
161
165
|
# number of features after feature selection
|
162
166
|
puts " # features (after): "+ r1.get_features.size.to_s
|
163
167
|
|
164
|
-
# you can also use
|
165
|
-
# e.g. use the ChiSquaredTest with Yates' continuity correction
|
168
|
+
# you can also use a second alogirithm for further feature selection
|
169
|
+
# e.g. use the ChiSquaredTest (CHI) with Yates' continuity correction
|
166
170
|
# initialize from r1's data
|
167
|
-
r2 = FSelector::
|
171
|
+
r2 = FSelector::CHI.new(:yates, r1.get_data)
|
168
172
|
|
169
173
|
# number of features before feature selection
|
170
174
|
puts " # features (before): "+ r2.get_features.size.to_s
|
@@ -216,18 +220,18 @@ Usage
|
|
216
220
|
|
217
221
|
|
218
222
|
# creating an ensemble of feature selectors by using
|
219
|
-
# two feature selection algorithms
|
223
|
+
# two feature selection algorithms: InformationGain (IG) and Relief_d.
|
220
224
|
# note: can be 2+ algorithms, as long as they are of the same type,
|
221
225
|
# either feature weighting or feature subset selection algorithms
|
222
226
|
|
223
227
|
# test for the type of feature weighting algorithms
|
224
|
-
r1 = FSelector::
|
228
|
+
r1 = FSelector::IG.new
|
225
229
|
r2 = FSelector::Relief_d.new(10)
|
226
230
|
|
227
231
|
# an ensemble of two feature selectors
|
228
232
|
re = FSelector::EnsembleMultiple.new(r1, r2)
|
229
233
|
|
230
|
-
# read random data
|
234
|
+
# read random discrete data (containing missing value)
|
231
235
|
re.data_from_random(100, 2, 15, 3, true)
|
232
236
|
|
233
237
|
# replace missing value because Relief_d
|
@@ -247,35 +251,28 @@ Usage
|
|
247
251
|
# number of features after feature selection
|
248
252
|
puts ' # features (after): ' + re.get_features.size.to_s
|
249
253
|
|
250
|
-
**3.
|
251
|
-
|
252
|
-
In addition to the algorithms designed for continuous feature, one
|
253
|
-
can apply those deisgned for discrete feature after (optionally
|
254
|
-
normalization and) discretization
|
254
|
+
**3. feature selection after discretization**
|
255
255
|
|
256
256
|
require 'fselector'
|
257
257
|
|
258
|
-
#
|
259
|
-
|
258
|
+
# the Information Gain (IG) algorithm requires data with discrete feature
|
259
|
+
r = FSelector::IG.new
|
260
260
|
|
261
|
-
#
|
262
|
-
|
261
|
+
# but the Iris data set contains continuous features (under the test/ directory)
|
262
|
+
r.data_from_csv('test/iris.csv')
|
263
263
|
|
264
|
-
#
|
265
|
-
|
266
|
-
|
267
|
-
# apply Fast Correlation-Based Filter (FCBF) algorithm for discrete feature
|
268
|
-
# initialize with discretized data from r1
|
269
|
-
r2 = FSelector::FCBF.new(0.0, r1.get_data)
|
264
|
+
# let's first discretize it by ChiMerge algorithm at alpha=0.10
|
265
|
+
# then perform feature selection as usual
|
266
|
+
r.discretize_by_ChiMerge!(0.10)
|
270
267
|
|
271
268
|
# number of features before feature selection
|
272
|
-
puts ' # features (before): ' +
|
269
|
+
puts ' # features (before): ' + r.get_features.size.to_s
|
273
270
|
|
274
|
-
# feature
|
275
|
-
|
271
|
+
# select the top-ranked feature
|
272
|
+
r.select_feature_by_rank!('<=1')
|
276
273
|
|
277
274
|
# number of features after feature selection
|
278
|
-
puts ' # features (after): ' +
|
275
|
+
puts ' # features (after): ' + r.get_features.size.to_s
|
279
276
|
|
280
277
|
**4. see more examples test_*.rb under the test/ directory**
|
281
278
|
|
data/lib/fselector.rb
CHANGED
@@ -1,12 +1,13 @@
|
|
1
1
|
# access to the statistical routines in R package
|
2
2
|
require 'rinruby'
|
3
|
+
R.eval 'options(warn = -1)' # suppress R warnings
|
3
4
|
|
4
5
|
#
|
5
6
|
# FSelector: a Ruby gem for feature selection and ranking
|
6
7
|
#
|
7
8
|
module FSelector
|
8
9
|
# module version
|
9
|
-
VERSION = '1.
|
10
|
+
VERSION = '1.2.0'
|
10
11
|
end
|
11
12
|
|
12
13
|
# the root dir of FSelector
|
@@ -11,9 +11,6 @@ module FSelector
|
|
11
11
|
# ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
|
12
12
|
#
|
13
13
|
class BaseRelief < Base
|
14
|
-
# include ReplaceMissingValue module
|
15
|
-
include ReplaceMissingValues
|
16
|
-
|
17
14
|
#
|
18
15
|
# intialize from an existing data structure
|
19
16
|
#
|
@@ -23,6 +20,7 @@ module FSelector
|
|
23
20
|
#
|
24
21
|
def initialize(m=30, data=nil)
|
25
22
|
super(data)
|
23
|
+
|
26
24
|
@m = m || 30 # default 30
|
27
25
|
end
|
28
26
|
|
@@ -6,10 +6,8 @@ module FSelector
|
|
6
6
|
# base algorithm for continuous feature
|
7
7
|
#
|
8
8
|
class BaseContinuous < Base
|
9
|
-
# include
|
9
|
+
# include module
|
10
10
|
include Normalizer
|
11
|
-
# include discretizer
|
12
|
-
include Discretizer
|
13
11
|
|
14
12
|
# initialize from an existing data structure
|
15
13
|
def initialize(data=nil)
|
@@ -9,9 +9,8 @@ module FSelector
|
|
9
9
|
# ref: [Feature Selection for Discrete and Numeric Class Machine Learning](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.5673)
|
10
10
|
#
|
11
11
|
class CFS_c < BaseCFS
|
12
|
-
# include
|
12
|
+
# include module
|
13
13
|
include Normalizer
|
14
|
-
include Discretizer
|
15
14
|
|
16
15
|
# this algo outputs a subset of feature
|
17
16
|
@algo_type = :feature_subset_selection
|
@@ -3,7 +3,7 @@
|
|
3
3
|
#
|
4
4
|
module FSelector
|
5
5
|
#
|
6
|
-
# F-
|
6
|
+
# F-Test (FT) based on F-statistics for continuous feature
|
7
7
|
#
|
8
8
|
# between-group variability
|
9
9
|
# FT = ---------------------------
|
@@ -29,36 +29,6 @@ module FSelector
|
|
29
29
|
private
|
30
30
|
|
31
31
|
# calculate contribution of each feature (f) across all classes
|
32
|
-
def calc_contribution2(f)
|
33
|
-
a, b, s = 0.0, 0.0, 0.0
|
34
|
-
ybar = get_feature_values(f).mean
|
35
|
-
kz = get_classes.size.to_f
|
36
|
-
sz = get_sample_size.to_f
|
37
|
-
|
38
|
-
k2ybar = {} # cache
|
39
|
-
each_class do |k|
|
40
|
-
k2ybar[k] = get_feature_values(f, nil, k).mean
|
41
|
-
end
|
42
|
-
|
43
|
-
# a
|
44
|
-
each_class do |k|
|
45
|
-
n_k = get_data[k].size.to_f
|
46
|
-
a += n_k * (k2ybar[k] - ybar)**2 / (kz-1)
|
47
|
-
end
|
48
|
-
|
49
|
-
# b
|
50
|
-
each_sample do |k, s|
|
51
|
-
if s.has_key? f
|
52
|
-
y_ik = s[f]
|
53
|
-
b += (y_ik - k2ybar[k])**2 / (sz-kz)
|
54
|
-
end
|
55
|
-
end
|
56
|
-
|
57
|
-
s = a/b if not b.zero?
|
58
|
-
|
59
|
-
set_feature_score(f, :BEST, s)
|
60
|
-
end # calc_contribution
|
61
|
-
|
62
32
|
def calc_contribution(f)
|
63
33
|
a, b, s = 0.0, 0.0, 0.0
|
64
34
|
ybar = get_feature_values(f).mean
|
@@ -0,0 +1,125 @@
|
|
1
|
+
#
|
2
|
+
# FSelector: a Ruby gem for feature selection and ranking
|
3
|
+
#
|
4
|
+
module FSelector
|
5
|
+
#
|
6
|
+
# Kolmogorov-Smirnov Class Correlation-Based Filter (KS-CCBF) for continuous feature
|
7
|
+
#
|
8
|
+
# ref: [Feature Selection for Supervised Classification: A KolmogorovSmirnov Class Correlation-Based Filter](http://kzi.polsl.pl/~jbiesiada/Infosel/downolad/publikacje/09-Gliwice.pdf)
|
9
|
+
#
|
10
|
+
class KS_CCBF < BaseContinuous
|
11
|
+
# include module
|
12
|
+
include Entropy
|
13
|
+
include Discretizer
|
14
|
+
|
15
|
+
# this algo outputs a subset of feature
|
16
|
+
@algo_type = :feature_subset_selection
|
17
|
+
|
18
|
+
#
|
19
|
+
# initialize from an existing data structure
|
20
|
+
#
|
21
|
+
# @param [Float] lamda threshold value [0, 1] to determine feature redundancy
|
22
|
+
#
|
23
|
+
def initialize(lamda=0.2, data=nil)
|
24
|
+
super(data)
|
25
|
+
|
26
|
+
@lamda = lamda || 0.2
|
27
|
+
end
|
28
|
+
|
29
|
+
private
|
30
|
+
|
31
|
+
# INTERACT algorithm
|
32
|
+
def get_feature_subset
|
33
|
+
# make a copy of data, since the discretization method will alter internal data
|
34
|
+
data_bak = get_data_copy
|
35
|
+
|
36
|
+
# stage 1: calculate SUC coefficient
|
37
|
+
# but let's discretize features first
|
38
|
+
discretize_for_suc
|
39
|
+
|
40
|
+
# then SUC
|
41
|
+
f2suc = {}
|
42
|
+
cv = get_class_labels
|
43
|
+
each_feature do |f|
|
44
|
+
fv = get_feature_values(f, :include_missing_values)
|
45
|
+
f2suc[f] = get_symmetrical_uncertainty(fv, cv)
|
46
|
+
end
|
47
|
+
|
48
|
+
# sort feature according to descending order of its SUC
|
49
|
+
subset = f2suc.keys.sort { |x, y| f2suc[y] <=> f2suc[x] }
|
50
|
+
|
51
|
+
# restore data, note set_data also clear old variables
|
52
|
+
set_data(data_bak)
|
53
|
+
|
54
|
+
# stage 2: remove redundancy
|
55
|
+
fp = subset.first
|
56
|
+
while fp
|
57
|
+
fq = get_next_element(subset, fp)
|
58
|
+
|
59
|
+
while fq
|
60
|
+
ks = calc_ks(fp, fq)
|
61
|
+
|
62
|
+
if ks < @lamda
|
63
|
+
fq_new = get_next_element(subset, fq)
|
64
|
+
subset.delete(fq) # remove fq
|
65
|
+
fq = fq_new
|
66
|
+
else
|
67
|
+
fq = get_next_element(subset, fq)
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
fp = get_next_element(subset, fp)
|
72
|
+
end
|
73
|
+
|
74
|
+
subset
|
75
|
+
end # get_feature_subset
|
76
|
+
|
77
|
+
|
78
|
+
# discretize continuous feature for calculating the SUC,
|
79
|
+
# which requires discrete features. See Discretizer module
|
80
|
+
# for available discretization methods. If you want to use
|
81
|
+
# alternative one, simply override this function
|
82
|
+
def discretize_for_suc
|
83
|
+
discretize_by_ChiMerge!(0.10)
|
84
|
+
end
|
85
|
+
|
86
|
+
|
87
|
+
# get the next element of fp
|
88
|
+
def get_next_element(subset, fp)
|
89
|
+
fq = nil
|
90
|
+
|
91
|
+
idx = subset.index(fp)
|
92
|
+
if idx and idx < subset.size-1
|
93
|
+
fq = subset[idx+1]
|
94
|
+
end
|
95
|
+
|
96
|
+
fq
|
97
|
+
end # get_next_element
|
98
|
+
|
99
|
+
|
100
|
+
# calculate K-S statistic (relying on R package) among all classes
|
101
|
+
def calc_ks(fp, fq)
|
102
|
+
ks = 0.0
|
103
|
+
|
104
|
+
each_class do |k|
|
105
|
+
R.sp = get_feature_values(fp, nil, k)
|
106
|
+
R.sq = get_feature_values(fq, nil, k)
|
107
|
+
|
108
|
+
# K-S test
|
109
|
+
R.eval "ks <- ks.test(sp, sq)$statistic"
|
110
|
+
|
111
|
+
# pull K-S statistic
|
112
|
+
ks_try = R.ks
|
113
|
+
|
114
|
+
# record max ks among classes
|
115
|
+
ks = ks_try if ks_try > ks
|
116
|
+
end
|
117
|
+
|
118
|
+
ks
|
119
|
+
end # calc_ks
|
120
|
+
|
121
|
+
|
122
|
+
end # class
|
123
|
+
|
124
|
+
|
125
|
+
end # module
|
@@ -0,0 +1,51 @@
|
|
1
|
+
#
|
2
|
+
# FSelector: a Ruby gem for feature selection and ranking
|
3
|
+
#
|
4
|
+
module FSelector
|
5
|
+
#
|
6
|
+
# Kolmogorov-Smirnov Test (KST) for continuous feature
|
7
|
+
#
|
8
|
+
# @note KST is applicable only to two-class problems, and missing data are ignored
|
9
|
+
#
|
10
|
+
# for KST (p-value), the smaller, the better, but we intentionally negate it
|
11
|
+
# so that the larger is always the better (consistent with other algorithms).
|
12
|
+
# R equivalent: ks.test
|
13
|
+
#
|
14
|
+
# ref: [Wikipedia](http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) and [Feature Extraction, Foundations and Applications](http://www.springer.com/engineering/computational+intelligence+and+complexity/book/978-3-540-35487-1)
|
15
|
+
#
|
16
|
+
class KSTest < BaseContinuous
|
17
|
+
# this algo outputs weight for each feature
|
18
|
+
@algo_type = :feature_weighting
|
19
|
+
|
20
|
+
private
|
21
|
+
|
22
|
+
# calculate contribution of each feature (f) across all classes
|
23
|
+
def calc_contribution(f)
|
24
|
+
if not get_classes.size == 2
|
25
|
+
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
26
|
+
" suitable only for two-class problem with continuous feature"
|
27
|
+
end
|
28
|
+
|
29
|
+
# collect data for class 1 and 2, respectively
|
30
|
+
k1, k2 = get_classes
|
31
|
+
R.s1 = get_feature_values(f, nil, k1) # class 1
|
32
|
+
R.s2 = get_feature_values(f, nil, k2) # class 2
|
33
|
+
|
34
|
+
# K-S test
|
35
|
+
R.eval "rv <- ks.test(s1, s2)$p.value"
|
36
|
+
|
37
|
+
# intensionally negate it
|
38
|
+
s = -1.0 * R.rv # pull the p-value from R
|
39
|
+
|
40
|
+
set_feature_score(f, :BEST, s)
|
41
|
+
end # calc_contribution
|
42
|
+
|
43
|
+
|
44
|
+
end # class
|
45
|
+
|
46
|
+
|
47
|
+
# shortcut so that you can use FSelector::KST instead of FSelector::KSTest
|
48
|
+
KST = KSTest
|
49
|
+
|
50
|
+
|
51
|
+
end # module
|
File without changes
|
@@ -10,9 +10,8 @@ module FSelector
|
|
10
10
|
# ref: [Estimating Attributes: Analysis and Extensions of RELIEF](http://www.springerlink.com/content/fp23jh2h0426ww45/)
|
11
11
|
#
|
12
12
|
class ReliefF_c < BaseReliefF
|
13
|
-
# include
|
13
|
+
# include module
|
14
14
|
include Normalizer
|
15
|
-
include Discretizer
|
16
15
|
|
17
16
|
# this algo outputs weight for each feature
|
18
17
|
@algo_type = :feature_weighting
|
@@ -10,9 +10,8 @@ module FSelector
|
|
10
10
|
# ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
|
11
11
|
#
|
12
12
|
class Relief_c < BaseRelief
|
13
|
-
# include
|
13
|
+
# include module
|
14
14
|
include Normalizer
|
15
|
-
include Discretizer
|
16
15
|
|
17
16
|
# this algo outputs weight for each feature
|
18
17
|
@algo_type = :feature_weighting
|
@@ -9,7 +9,8 @@ module FSelector
|
|
9
9
|
# ref: [Feature Selection for Discrete and Numeric Class Machine Learning](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.5673)
|
10
10
|
#
|
11
11
|
class CFS_d < BaseCFS
|
12
|
-
# include
|
12
|
+
# include module
|
13
|
+
include Discretizer
|
13
14
|
include Entropy
|
14
15
|
|
15
16
|
# this algo outputs a subset of feature
|
@@ -9,7 +9,7 @@ module FSelector
|
|
9
9
|
# ref: [Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution](http://www.hpl.hp.com/conferences/icml2003/papers/144.pdf)
|
10
10
|
#
|
11
11
|
class FastCorrelationBasedFilter < BaseDiscrete
|
12
|
-
# include
|
12
|
+
# include module
|
13
13
|
include Entropy
|
14
14
|
|
15
15
|
# this algo outputs a subset of feature
|
@@ -22,6 +22,7 @@ module FSelector
|
|
22
22
|
#
|
23
23
|
def initialize(delta=0.0, data=nil)
|
24
24
|
super(data)
|
25
|
+
|
25
26
|
@delta = delta || 0.0
|
26
27
|
end
|
27
28
|
|
@@ -108,14 +109,13 @@ module FSelector
|
|
108
109
|
end
|
109
110
|
|
110
111
|
|
112
|
+
# get the next element of fp in subset
|
111
113
|
def get_next_element(subset, fp)
|
112
114
|
fq = nil
|
113
115
|
|
114
|
-
subset.
|
115
|
-
|
116
|
-
|
117
|
-
break
|
118
|
-
end
|
116
|
+
idx = subset.index(fp)
|
117
|
+
if idx and idx < subset.size-1
|
118
|
+
fq = subset[idx+1]
|
119
119
|
end
|
120
120
|
|
121
121
|
fq
|
@@ -9,9 +9,8 @@ module FSelector
|
|
9
9
|
# ref: [Searching for Interacting Features](http://www.public.asu.edu/~huanliu/papers/ijcai07.pdf)
|
10
10
|
#
|
11
11
|
class INTERACT < BaseDiscrete
|
12
|
-
# include
|
12
|
+
# include module
|
13
13
|
include Entropy
|
14
|
-
# include Consistency module
|
15
14
|
include Consistency
|
16
15
|
|
17
16
|
# this algo outputs a subset of feature
|
@@ -24,13 +23,14 @@ module FSelector
|
|
24
23
|
#
|
25
24
|
def initialize(delta=0.0001, data=nil)
|
26
25
|
super(data)
|
26
|
+
|
27
27
|
@delta = delta || 0.0001
|
28
28
|
end
|
29
29
|
|
30
30
|
private
|
31
31
|
|
32
32
|
# INTERACT algorithm
|
33
|
-
def get_feature_subset
|
33
|
+
def get_feature_subset
|
34
34
|
subset, f2su = get_features.dup, {}
|
35
35
|
|
36
36
|
# part 1, get symmetrical uncertainty for each feature
|
@@ -14,7 +14,7 @@ module FSelector
|
|
14
14
|
# ref: [Using Information Gain to Analyze and Fine Tune the Performance of Supply Chain Trading Agents](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.141.7895)
|
15
15
|
#
|
16
16
|
class InformationGain < BaseDiscrete
|
17
|
-
# include
|
17
|
+
# include module
|
18
18
|
include Entropy
|
19
19
|
|
20
20
|
# this algo outputs weight for each feature
|
@@ -0,0 +1,51 @@
|
|
1
|
+
#
|
2
|
+
# FSelector: a Ruby gem for feature selection and ranking
|
3
|
+
#
|
4
|
+
module FSelector
|
5
|
+
#
|
6
|
+
# J-Measure (JM) for discrete feature
|
7
|
+
#
|
8
|
+
# P(y_j|x_i)
|
9
|
+
# JM = sigma_i P(x_i) sigma_j P(y_j|x_i) log ------------
|
10
|
+
# P(y_j)
|
11
|
+
#
|
12
|
+
# ref: [Feature Extraction, Foundations and Applications](http://www.springer.com/engineering/computational+intelligence+and+complexity/book/978-3-540-35487-1)
|
13
|
+
#
|
14
|
+
class JMeasure < BaseDiscrete
|
15
|
+
# this algo outputs weight for each feature
|
16
|
+
@algo_type = :feature_weighting
|
17
|
+
|
18
|
+
private
|
19
|
+
|
20
|
+
# calculate contribution of each feature (f) across all classes
|
21
|
+
def calc_contribution(f)
|
22
|
+
cv = get_class_labels
|
23
|
+
fv = get_feature_values(f, :include_missing_values)
|
24
|
+
sz = cv.size.to_f # also equal fv.size
|
25
|
+
|
26
|
+
s = 0.0
|
27
|
+
fv.uniq.each do |x|
|
28
|
+
px = fv.count(x)/sz
|
29
|
+
|
30
|
+
cv.uniq.each do |y|
|
31
|
+
py = cv.count(y)/sz
|
32
|
+
|
33
|
+
indices = (0...fv.size).to_a.select { |i| fv[i] == x }
|
34
|
+
pyx = cv.values_at(*indices).count(y)/indices.size.to_f
|
35
|
+
|
36
|
+
s += px * ( pyx * Math.log2(pyx/py) ) if not pyx.zero?
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
set_feature_score(f, :BEST, s)
|
41
|
+
end # calc_contribution
|
42
|
+
|
43
|
+
|
44
|
+
end # class
|
45
|
+
|
46
|
+
|
47
|
+
# shortcut so that you can use FSelector::JM instead of FSelector::JMeasure
|
48
|
+
JM = JMeasure
|
49
|
+
|
50
|
+
|
51
|
+
end # module
|
@@ -0,0 +1,65 @@
|
|
1
|
+
#
|
2
|
+
# FSelector: a Ruby gem for feature selection and ranking
|
3
|
+
#
|
4
|
+
module FSelector
|
5
|
+
#
|
6
|
+
# Kullback-Leibler Divergence (KLD) for discrete feature
|
7
|
+
#
|
8
|
+
# w_i = wbar_i / ( -Z * sigma_j ( P(a_ij) logP(a_ij) ) )
|
9
|
+
#
|
10
|
+
# where wbar(i) = sigma_j ( P(a_ij) KL(C|a_ij) )
|
11
|
+
#
|
12
|
+
# KL(C|a_ij) = sigma_c ( P(c|a_ij) log(P(c|a_ij)/P(c)) )
|
13
|
+
#
|
14
|
+
# Z is normalization constant
|
15
|
+
#
|
16
|
+
# ref: [Calculating Feature Weights in Naive Bayes with Kullback-Leibler Measure](http://ix.cs.uoregon.edu/~dou/research/papers/icdm11_fw.)
|
17
|
+
#
|
18
|
+
class KLDivergence < BaseDiscrete
|
19
|
+
# this algo outputs weight for each feature
|
20
|
+
@algo_type = :feature_weighting
|
21
|
+
|
22
|
+
private
|
23
|
+
|
24
|
+
# calculate contribution of each feature (f) across all classes
|
25
|
+
# note the normalization constant Z is ignored, since we need only
|
26
|
+
# the relative feature scores
|
27
|
+
def calc_contribution(f)
|
28
|
+
cv = get_class_labels
|
29
|
+
fv = get_feature_values(f, :include_missing_values)
|
30
|
+
sz = cv.size.to_f # also equal fv.size
|
31
|
+
|
32
|
+
s, w_avg, d = 0.0, 0.0, 0.0
|
33
|
+
|
34
|
+
fv.uniq.each do |x|
|
35
|
+
px = fv.count(x)/sz
|
36
|
+
d += -1.0 * px * Math.log2(px)
|
37
|
+
|
38
|
+
kl_x = 0.0
|
39
|
+
|
40
|
+
cv.uniq.each do |y|
|
41
|
+
py = cv.count(y)/sz
|
42
|
+
|
43
|
+
indices = (0...fv.size).to_a.select { |i| fv[i] == x }
|
44
|
+
pyx = cv.values_at(*indices).count(y)/indices.size.to_f
|
45
|
+
|
46
|
+
kl_x += pyx * Math.log2(pyx/py) if not pyx.zero?
|
47
|
+
end
|
48
|
+
|
49
|
+
w_avg += px * kl_x
|
50
|
+
end
|
51
|
+
|
52
|
+
s = w_avg / d if not d.zero?
|
53
|
+
|
54
|
+
set_feature_score(f, :BEST, s)
|
55
|
+
end # calc_contribution
|
56
|
+
|
57
|
+
|
58
|
+
end # class
|
59
|
+
|
60
|
+
|
61
|
+
# shortcut so that you can use FSelector::KLD instead of FSelector::KLDivergence
|
62
|
+
KLD = KLDivergence
|
63
|
+
|
64
|
+
|
65
|
+
end # module
|
@@ -11,7 +11,7 @@ module FSelector
|
|
11
11
|
# ref: [Review and Evaluation of Feature Selection Algorithms in Synthetic Problems](http://arxiv.org/abs/1101.2320)
|
12
12
|
#
|
13
13
|
class LasVegasFilter < BaseDiscrete
|
14
|
-
# include
|
14
|
+
# include module
|
15
15
|
include Consistency
|
16
16
|
|
17
17
|
# this algo outputs a subset of feature
|
@@ -24,13 +24,14 @@ module FSelector
|
|
24
24
|
#
|
25
25
|
def initialize(max_iter=100, data=nil)
|
26
26
|
super(data)
|
27
|
+
|
27
28
|
@max_iter = max_iter || 100
|
28
29
|
end
|
29
30
|
|
30
31
|
private
|
31
32
|
|
32
33
|
# Las Vegas Filter (LVF) algorithm
|
33
|
-
def get_feature_subset
|
34
|
+
def get_feature_subset
|
34
35
|
inst_cnt = get_instance_count
|
35
36
|
j0 = get_IR_by_count(inst_cnt)
|
36
37
|
|
@@ -9,7 +9,7 @@ module FSelector
|
|
9
9
|
# ref: [Incremental Feature Selection](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.8218)
|
10
10
|
#
|
11
11
|
class LasVegasIncremental < BaseDiscrete
|
12
|
-
# include
|
12
|
+
# include module
|
13
13
|
include Consistency
|
14
14
|
|
15
15
|
# this algo outputs a subset of feature
|
@@ -23,6 +23,7 @@ module FSelector
|
|
23
23
|
#
|
24
24
|
def initialize(max_iter=100, portion=0.10, data=nil)
|
25
25
|
super(data)
|
26
|
+
|
26
27
|
@max_iter = max_iter || 100
|
27
28
|
@portion = portion || 0.10
|
28
29
|
end
|
@@ -30,7 +31,7 @@ module FSelector
|
|
30
31
|
private
|
31
32
|
|
32
33
|
# Las Vegas Incremental (LVI) algorithm
|
33
|
-
def get_feature_subset
|
34
|
+
def get_feature_subset
|
34
35
|
data = get_data # working dataset
|
35
36
|
s0, s1 = portion(data)
|
36
37
|
feats = get_features
|
@@ -9,6 +9,9 @@ module FSelector
|
|
9
9
|
# ref: [Estimating Attributes: Analysis and Extensions of RELIEF](http://www.springerlink.com/content/fp23jh2h0426ww45/)
|
10
10
|
#
|
11
11
|
class ReliefF_d < BaseReliefF
|
12
|
+
# include module
|
13
|
+
include Discretizer
|
14
|
+
|
12
15
|
# this algo outputs weight for each feature
|
13
16
|
@algo_type = :feature_weighting
|
14
17
|
|
@@ -10,6 +10,9 @@ module FSelector
|
|
10
10
|
# ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
|
11
11
|
#
|
12
12
|
class Relief_d < BaseRelief
|
13
|
+
# include module
|
14
|
+
include Discretizer
|
15
|
+
|
13
16
|
# this algo outputs weight for each feature
|
14
17
|
@algo_type = :feature_weighting
|
15
18
|
|
@@ -17,7 +17,7 @@ module FSelector
|
|
17
17
|
# ref: [Wikipedia](http://en.wikipedia.org/wiki/Symmetric_uncertainty) and [Robust Feature Selection Using Ensemble Feature Selection Techniques](http://dl.acm.org/citation.cfm?id=1432021)
|
18
18
|
#
|
19
19
|
class SymmetricalUncertainty < BaseDiscrete
|
20
|
-
# include
|
20
|
+
# include module
|
21
21
|
include Entropy
|
22
22
|
|
23
23
|
# this algo outputs weight for each feature
|
@@ -2,11 +2,10 @@
|
|
2
2
|
# discretize continuous feature
|
3
3
|
#
|
4
4
|
module Discretizer
|
5
|
-
# include
|
6
|
-
include Entropy
|
7
|
-
# include Consistency module
|
5
|
+
# include module
|
8
6
|
include Consistency
|
9
|
-
|
7
|
+
include Entropy
|
8
|
+
|
10
9
|
#
|
11
10
|
# discretize by equal-width intervals
|
12
11
|
#
|
@@ -157,7 +156,7 @@ module Discretizer
|
|
157
156
|
#
|
158
157
|
# ref: [Chi2: Feature Selection and Discretization of Numeric Attributes](http://sci2s.ugr.es/keel/pdf/specific/congreso/liu1995.pdf)
|
159
158
|
#
|
160
|
-
def discretize_by_Chi2!(delta=0.02)
|
159
|
+
def discretize_by_Chi2!(delta=0.02)
|
161
160
|
# degree of freedom equals one less than number of classes
|
162
161
|
df = get_classes.size-1
|
163
162
|
|
@@ -270,7 +269,7 @@ module Discretizer
|
|
270
269
|
#
|
271
270
|
# ref: [Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning](http://www.ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf)
|
272
271
|
#
|
273
|
-
def discretize_by_MID!
|
272
|
+
def discretize_by_MID!
|
274
273
|
# determine the final boundaries
|
275
274
|
f2cp = {} # cut points for each feature
|
276
275
|
each_feature do |f|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: fselector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.2.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-05-
|
12
|
+
date: 2012-05-21 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rinruby
|
16
|
-
requirement: &
|
16
|
+
requirement: &23863908 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,7 +21,7 @@ dependencies:
|
|
21
21
|
version: 2.0.2
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *23863908
|
25
25
|
description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
|
26
26
|
algorithms and related functions into one single package. Welcome to contact me
|
27
27
|
(need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
|
@@ -50,11 +50,13 @@ files:
|
|
50
50
|
- lib/fselector/algo_base/base_ReliefF.rb
|
51
51
|
- lib/fselector/algo_continuous/BSS_WSS.rb
|
52
52
|
- lib/fselector/algo_continuous/CFS_c.rb
|
53
|
-
- lib/fselector/algo_continuous/
|
54
|
-
- lib/fselector/algo_continuous/
|
53
|
+
- lib/fselector/algo_continuous/F-Test.rb
|
54
|
+
- lib/fselector/algo_continuous/KS-CCBF.rb
|
55
|
+
- lib/fselector/algo_continuous/KS-Test.rb
|
56
|
+
- lib/fselector/algo_continuous/P-Metric.rb
|
55
57
|
- lib/fselector/algo_continuous/ReliefF_c.rb
|
56
58
|
- lib/fselector/algo_continuous/Relief_c.rb
|
57
|
-
- lib/fselector/algo_continuous/
|
59
|
+
- lib/fselector/algo_continuous/T-Score.rb
|
58
60
|
- lib/fselector/algo_continuous/WilcoxonRankSum.rb
|
59
61
|
- lib/fselector/algo_discrete/Accuracy.rb
|
60
62
|
- lib/fselector/algo_discrete/AccuracyBalanced.rb
|
@@ -66,11 +68,13 @@ files:
|
|
66
68
|
- lib/fselector/algo_discrete/F1Measure.rb
|
67
69
|
- lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb
|
68
70
|
- lib/fselector/algo_discrete/FishersExactTest.rb
|
71
|
+
- lib/fselector/algo_discrete/G-Mean.rb
|
69
72
|
- lib/fselector/algo_discrete/GiniIndex.rb
|
70
|
-
- lib/fselector/algo_discrete/GMean.rb
|
71
73
|
- lib/fselector/algo_discrete/GSSCoefficient.rb
|
72
74
|
- lib/fselector/algo_discrete/InformationGain.rb
|
73
75
|
- lib/fselector/algo_discrete/INTERACT.rb
|
76
|
+
- lib/fselector/algo_discrete/J-Measure.rb
|
77
|
+
- lib/fselector/algo_discrete/KL-Divergence.rb
|
74
78
|
- lib/fselector/algo_discrete/LasVegasFilter.rb
|
75
79
|
- lib/fselector/algo_discrete/LasVegasIncremental.rb
|
76
80
|
- lib/fselector/algo_discrete/MatthewsCorrelationCoefficient.rb
|