fselector 0.2.0 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +46 -33
- data/lib/fselector.rb +6 -1
- data/lib/fselector/algo_base/base.rb +14 -3
- data/lib/fselector/algo_base/base_CFS.rb +12 -0
- data/lib/fselector/algo_base/base_continuous.rb +2 -2
- data/lib/fselector/algo_continuous/CFS_c.rb +10 -0
- data/lib/fselector/algo_continuous/ReliefF_c.rb +3 -0
- data/lib/fselector/algo_continuous/Relief_c.rb +3 -0
- data/lib/fselector/algo_continuous/discretizer.rb +161 -7
- data/lib/fselector/algo_continuous/normalizer.rb +3 -3
- data/lib/fselector/algo_discrete/CFS_d.rb +6 -0
- data/lib/fselector/entropy.rb +31 -31
- data/lib/fselector/fileio.rb +15 -3
- data/lib/fselector/replace_missing_values.rb +78 -0
- metadata +13 -10
data/README.md
CHANGED
@@ -8,30 +8,41 @@ FSelector: a Ruby gem for feature selection and ranking
|
|
8
8
|
**Email**: [need47@gmail.com](mailto:need47@gmail.com)
|
9
9
|
**Copyright**: 2012
|
10
10
|
**License**: MIT License
|
11
|
-
**Latest Version**: 0.
|
12
|
-
**Release Date**: April
|
11
|
+
**Latest Version**: 0.3.0
|
12
|
+
**Release Date**: April 3rd 2012
|
13
13
|
|
14
14
|
Synopsis
|
15
15
|
--------
|
16
16
|
|
17
|
-
FSelector is a Ruby gem that aims to integrate various feature
|
18
|
-
algorithms into one single
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
17
|
+
FSelector is a Ruby gem that aims to integrate various feature
|
18
|
+
selection/ranking algorithms and related functions into one single
|
19
|
+
package. Welcome to contact me (need47@gmail.com) if you'd like to
|
20
|
+
contribute your own algorithms or report a bug. FSelector allows user
|
21
|
+
to perform feature selection by using either a single algorithm or an
|
22
|
+
ensemble of multiple algorithms, and other common tasks including
|
23
|
+
normalization and discretization on continuous data, as well as replace
|
24
|
+
missing feature values with certain criterion. FSelector acts on a
|
25
|
+
full-feature data set in either CSV, LibSVM or WEKA file format and
|
26
|
+
outputs a reduced data set with only selected subset of features, which
|
27
|
+
can later be used as the input for various machine learning softwares
|
28
|
+
including LibSVM and WEKA. FSelector, itself, does not implement
|
29
|
+
any of the machine learning algorithms such as support vector machines
|
30
|
+
and random forest. See below for a list of FSelector's features.
|
27
31
|
|
28
32
|
Feature List
|
29
33
|
------------
|
30
34
|
|
31
|
-
**1.
|
35
|
+
**1. supported input/output file types**
|
36
|
+
|
37
|
+
- csv
|
38
|
+
- libsvm
|
39
|
+
- weka ARFF
|
40
|
+
- random data (for test purpose)
|
41
|
+
|
42
|
+
**2. available feature selection/ranking algorithms**
|
32
43
|
|
33
|
-
algorithm alias
|
34
|
-
|
44
|
+
algorithm alias feature type
|
45
|
+
--------------------------------------------------------
|
35
46
|
Accuracy Acc discrete
|
36
47
|
AccuracyBalanced Acc2 discrete
|
37
48
|
BiNormalSeparation BNS discrete
|
@@ -67,29 +78,31 @@ Feature List
|
|
67
78
|
ReliefF_c ReliefF_c continuous
|
68
79
|
TScore TS continuous
|
69
80
|
|
70
|
-
**
|
81
|
+
**3. feature selection approaches**
|
71
82
|
|
72
83
|
- by a single algorithm
|
73
84
|
- by multiple algorithms in a tandem manner
|
74
85
|
- by multiple algorithms in a consensus manner
|
75
86
|
|
76
|
-
**
|
87
|
+
**4. availabe normalization and discretization algorithms for continuous feature**
|
77
88
|
|
78
89
|
algorithm note
|
79
|
-
|
80
|
-
log
|
81
|
-
min_max
|
82
|
-
zscore
|
83
|
-
equal_width
|
84
|
-
equal_frequency
|
85
|
-
ChiMerge
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
90
|
+
-----------------------------------------------------------------
|
91
|
+
log normalize by logarithmic transformation
|
92
|
+
min_max normalize by scaling into [min, max]
|
93
|
+
zscore normalize by converting into zscore
|
94
|
+
equal_width discretize by equal width among intervals
|
95
|
+
equal_frequency discretize by equal frequency among intervals
|
96
|
+
ChiMerge discretize by ChiMerge method
|
97
|
+
MID discretize by Multi-Interval Discretization
|
98
|
+
|
99
|
+
**5. availabe algorithms for replacing missing feature values**
|
100
|
+
|
101
|
+
algorithm note feature type
|
102
|
+
--------------------------------------------------------------------------------------
|
103
|
+
fixed_value replace with a fixed value discrete, continuous
|
104
|
+
mean_value replace with the mean feature value continuous
|
105
|
+
most_seen_value replace with most seen feature value discrete
|
93
106
|
|
94
107
|
Installing
|
95
108
|
----------
|
@@ -187,11 +200,11 @@ Usage
|
|
187
200
|
r1.data_from_csv('test/iris.csv')
|
188
201
|
|
189
202
|
# normalization by log2 (optional)
|
190
|
-
# r1.
|
203
|
+
# r1.normalize_by_log!(2)
|
191
204
|
|
192
205
|
# discretization by ChiMerge algorithm
|
193
206
|
# chi-squared value = 4.60 for a three-class problem at alpha=0.10
|
194
|
-
r1.
|
207
|
+
r1.discretize_by_ChiMerge!(4.60)
|
195
208
|
|
196
209
|
# apply Fast Correlation-Based Filter (FCBF) algorithm for discrete feature
|
197
210
|
# initialize with discretized data from r1
|
data/lib/fselector.rb
CHANGED
@@ -3,7 +3,7 @@
|
|
3
3
|
#
|
4
4
|
module FSelector
|
5
5
|
# module version
|
6
|
-
VERSION = '0.
|
6
|
+
VERSION = '0.3.0'
|
7
7
|
end
|
8
8
|
|
9
9
|
ROOT = File.expand_path(File.dirname(__FILE__))
|
@@ -11,9 +11,14 @@ ROOT = File.expand_path(File.dirname(__FILE__))
|
|
11
11
|
#
|
12
12
|
# include necessary files
|
13
13
|
#
|
14
|
+
# read and write file, supported formats include CSV, LibSVM and WEKA files
|
14
15
|
require "#{ROOT}/fselector/fileio.rb"
|
16
|
+
# extend Array and String class
|
15
17
|
require "#{ROOT}/fselector/util.rb"
|
18
|
+
# entropy-related functions
|
16
19
|
require "#{ROOT}/fselector/entropy.rb"
|
20
|
+
# replace missing values
|
21
|
+
require "#{ROOT}/fselector/replace_missing_values.rb"
|
17
22
|
|
18
23
|
#
|
19
24
|
# base class
|
@@ -8,6 +8,8 @@ module FSelector
|
|
8
8
|
class Base
|
9
9
|
# include FileIO
|
10
10
|
include FileIO
|
11
|
+
# include ReplaceMissingValues
|
12
|
+
include ReplaceMissingValues
|
11
13
|
|
12
14
|
# initialize from an existing data structure
|
13
15
|
def initialize(data=nil)
|
@@ -167,13 +169,13 @@ module FSelector
|
|
167
169
|
def set_data(data)
|
168
170
|
if data and data.class == Hash
|
169
171
|
@data = data
|
170
|
-
# clear
|
171
|
-
|
172
|
-
@scores, @ranks, @sz = nil, nil, nil
|
172
|
+
# clear variables
|
173
|
+
clear_vars
|
173
174
|
else
|
174
175
|
abort "[#{__FILE__}@#{__LINE__}]: "+
|
175
176
|
"data must be a Hash object!"
|
176
177
|
end
|
178
|
+
|
177
179
|
data
|
178
180
|
end
|
179
181
|
|
@@ -335,6 +337,14 @@ module FSelector
|
|
335
337
|
|
336
338
|
private
|
337
339
|
|
340
|
+
# clear variables when data structure is altered
|
341
|
+
def clear_vars
|
342
|
+
@classes, @features, @fvs = nil, nil, nil
|
343
|
+
@scores, @ranks, @sz = nil, nil, nil
|
344
|
+
@cv, @fvs = nil, nil
|
345
|
+
end
|
346
|
+
|
347
|
+
|
338
348
|
# set feature (f) score (s) for class (k)
|
339
349
|
def set_feature_score(f, k, s)
|
340
350
|
@scores ||= {}
|
@@ -342,6 +352,7 @@ module FSelector
|
|
342
352
|
@scores[f][k] = s
|
343
353
|
end
|
344
354
|
|
355
|
+
|
345
356
|
# get subset of feature
|
346
357
|
def get_feature_subset
|
347
358
|
abort "[#{__FILE__}@#{__LINE__}]: "+
|
@@ -21,6 +21,9 @@ module FSelector
|
|
21
21
|
|
22
22
|
# use sequential forward search
|
23
23
|
def get_feature_subset
|
24
|
+
# handle missing values
|
25
|
+
handle_missing_value
|
26
|
+
|
24
27
|
subset = []
|
25
28
|
feats = get_features.dup
|
26
29
|
|
@@ -58,6 +61,15 @@ module FSelector
|
|
58
61
|
end # get_feature_subset
|
59
62
|
|
60
63
|
|
64
|
+
# handle missing values
|
65
|
+
# CFS replaces missing values with the mean for continous features and
|
66
|
+
# the most seen value for discrete features
|
67
|
+
def handle_missing_values
|
68
|
+
abort "[#{__FILE__}@#{__LINE__}]: "+
|
69
|
+
"derived CFS algo must implement its own handle_missing_values()"
|
70
|
+
end
|
71
|
+
|
72
|
+
|
61
73
|
# calc new merit of subset when adding feature (f)
|
62
74
|
def calc_merit(subset, f)
|
63
75
|
k = subset.size.to_f + 1
|
@@ -10,8 +10,8 @@ module FSelector
|
|
10
10
|
class BaseContinuous < Base
|
11
11
|
# include normalizer
|
12
12
|
include Normalizer
|
13
|
-
# include
|
14
|
-
include
|
13
|
+
# include discretizer
|
14
|
+
include Discretizer
|
15
15
|
|
16
16
|
# initialize from an existing data structure
|
17
17
|
def initialize(data=nil)
|
@@ -8,8 +8,18 @@ module FSelector
|
|
8
8
|
# ref: [Feature Selection for Discrete and Numeric Class Machine Learning](http://www.cs.waikato.ac.nz/ml/publications/1999/99MH-Feature-Select.pdf)
|
9
9
|
#
|
10
10
|
class CFS_c < BaseCFS
|
11
|
+
# include normalizer and discretizer
|
12
|
+
include Normalizer
|
13
|
+
include Discretizer
|
11
14
|
|
12
15
|
private
|
16
|
+
|
17
|
+
|
18
|
+
# replace missing values with mean feature value
|
19
|
+
def handle_missing_values
|
20
|
+
replace_with_mean_value!
|
21
|
+
end
|
22
|
+
|
13
23
|
|
14
24
|
# calc the feature-class correlation of two vectors
|
15
25
|
def do_rcf(cv, fv)
|
@@ -10,6 +10,9 @@ module FSelector
|
|
10
10
|
# ref: [Estimating Attributes: Analysis and Extensions of RELIEF](http://www.springerlink.com/content/fp23jh2h0426ww45/)
|
11
11
|
#
|
12
12
|
class ReliefF_c < BaseReliefF
|
13
|
+
# include normalizer and discretizer
|
14
|
+
include Normalizer
|
15
|
+
include Discretizer
|
13
16
|
|
14
17
|
private
|
15
18
|
|
@@ -10,6 +10,9 @@ module FSelector
|
|
10
10
|
# ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
|
11
11
|
#
|
12
12
|
class Relief_c < BaseRelief
|
13
|
+
# include normalizer and discretizer
|
14
|
+
include Normalizer
|
15
|
+
include Discretizer
|
13
16
|
|
14
17
|
private
|
15
18
|
|
@@ -1,7 +1,10 @@
|
|
1
1
|
#
|
2
|
-
#
|
2
|
+
# discretize continous feature
|
3
3
|
#
|
4
|
-
module
|
4
|
+
module Discretizer
|
5
|
+
# include Entropy module
|
6
|
+
include Entropy
|
7
|
+
|
5
8
|
# discretize by equal-width intervals
|
6
9
|
#
|
7
10
|
# @param [Integer] n_interval
|
@@ -84,7 +87,7 @@ module Discretilizer
|
|
84
87
|
# 2 4.60 5.99 9.21 13.82
|
85
88
|
# 3 6.35 7.82 11.34 16.27
|
86
89
|
#
|
87
|
-
def
|
90
|
+
def discretize_by_ChiMerge!(chisq)
|
88
91
|
# chisq = 4.60 # for iris::Sepal.Length
|
89
92
|
# for intialization
|
90
93
|
hzero = {}
|
@@ -177,19 +180,71 @@ module Discretilizer
|
|
177
180
|
end
|
178
181
|
end
|
179
182
|
|
180
|
-
end #
|
183
|
+
end # discretize_ChiMerge!
|
184
|
+
|
185
|
+
|
186
|
+
#
|
187
|
+
# discretize by Multi-Interval Discretization (MID) algorithm
|
188
|
+
# @note no missing feature values allowed and data structure will be altered
|
189
|
+
#
|
190
|
+
# ref: [Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning](http://www.ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf)
|
191
|
+
#
|
192
|
+
def discretize_by_MID!
|
193
|
+
# determine the final boundaries
|
194
|
+
f2cp = {} # cut points for each feature
|
195
|
+
each_feature do |f|
|
196
|
+
cv = get_class_labels
|
197
|
+
# we assume no missing feature values
|
198
|
+
fv = get_feature_values(f)
|
199
|
+
|
200
|
+
n = cv.size
|
201
|
+
# sort cv and fv according ascending order of fv
|
202
|
+
sis = (0...n).to_a.sort { |i,j| fv[i] <=> fv[j] }
|
203
|
+
cv = cv.values_at(*sis)
|
204
|
+
fv = fv.values_at(*sis)
|
205
|
+
|
206
|
+
# get initial boundaries
|
207
|
+
bs = []
|
208
|
+
fv.each_with_index do |v, i|
|
209
|
+
# cut point (Ta) for feature A must always be a value between
|
210
|
+
# two examples of different classes in the sequence of sorted examples
|
211
|
+
# see orginal reference
|
212
|
+
if i < n-1 and cv[i] != cv[i+1]
|
213
|
+
bs << (v+fv[i+1])/2.0
|
214
|
+
end
|
215
|
+
end
|
216
|
+
bs.uniq! # remove duplicates
|
217
|
+
|
218
|
+
# main algorithm, iteratively determine cut point
|
219
|
+
cp = []
|
220
|
+
partition(cv, fv, bs, cp)
|
221
|
+
|
222
|
+
# add the rightmost boundary for convenience
|
223
|
+
cp << fv.max+1.0
|
224
|
+
# record cut points for feature (f)
|
225
|
+
f2cp[f] = cp
|
226
|
+
end
|
227
|
+
|
228
|
+
# discretize based on cut points
|
229
|
+
each_sample do |k, s|
|
230
|
+
s.keys.each do |f|
|
231
|
+
s[f] = get_index(s[f], f2cp[f])
|
232
|
+
end
|
233
|
+
end
|
234
|
+
|
235
|
+
end # discretize_by_MID!
|
181
236
|
|
182
237
|
private
|
183
238
|
|
184
239
|
# get index from sorted boundaries
|
185
240
|
#
|
186
241
|
# min -- | -- | -- | ... max |
|
187
|
-
#
|
188
|
-
#
|
242
|
+
# b1 b2 b3 bn(=max+1)
|
243
|
+
# 1 2 3 ... n
|
189
244
|
#
|
190
245
|
def get_index(v, boundaries)
|
191
246
|
boundaries.each_with_index do |b, i|
|
192
|
-
return i if v < b
|
247
|
+
return i+1 if v < b
|
193
248
|
end
|
194
249
|
end # get_index
|
195
250
|
|
@@ -215,4 +270,103 @@ module Discretilizer
|
|
215
270
|
end # calc_chisq
|
216
271
|
|
217
272
|
|
273
|
+
#
|
274
|
+
# Multi-Interval Discretization main algorithm
|
275
|
+
# recursively always selecting the best cut point
|
276
|
+
#
|
277
|
+
# @param [Array] cv class labels
|
278
|
+
# @param [Array] fv feature values
|
279
|
+
# @param [Array] bs potential cut points
|
280
|
+
# @param [Array] cp resultant cut points
|
281
|
+
def partition(cv, fv, bs, cp)
|
282
|
+
# best cut point
|
283
|
+
cp_best = nil
|
284
|
+
|
285
|
+
# binary subset at the best cut point
|
286
|
+
cv1_best, cv2_best = nil, nil
|
287
|
+
fv1_best, fv2_best = nil, nil
|
288
|
+
bs1_best, bs2_best = nil, nil
|
289
|
+
|
290
|
+
# best information gain
|
291
|
+
gain_best = -100.0
|
292
|
+
ent_best = -100.0
|
293
|
+
ent1_best = -100.0
|
294
|
+
ent2_best = -100.0
|
295
|
+
|
296
|
+
# try each potential cut point
|
297
|
+
bs.each do |b|
|
298
|
+
# binary split
|
299
|
+
cv1_try, cv2_try, fv1_try, fv2_try, bs1_try, bs2_try =
|
300
|
+
binary_split(cv, fv, bs, b)
|
301
|
+
|
302
|
+
# gain for this cut point
|
303
|
+
ent_try = get_marginal_entropy(cv)
|
304
|
+
ent1_try = get_marginal_entropy(cv1_try)
|
305
|
+
ent2_try = get_marginal_entropy(cv2_try)
|
306
|
+
gain_try = ent_try -
|
307
|
+
(cv1_try.size.to_f/cv.size) * ent1_try -
|
308
|
+
(cv2_try.size.to_f/cv.size) * ent2_try
|
309
|
+
|
310
|
+
#pp gain_try
|
311
|
+
if gain_try > gain_best
|
312
|
+
cp_best = b
|
313
|
+
cv1_best, cv2_best = cv1_try, cv2_try
|
314
|
+
fv1_best, fv2_best = fv1_try, fv2_try
|
315
|
+
bs1_best, bs2_best = bs1_try, bs2_try
|
316
|
+
|
317
|
+
gain_best = gain_try
|
318
|
+
ent_best = ent_try
|
319
|
+
ent1_best, ent2_best = ent1_try, ent2_try
|
320
|
+
end
|
321
|
+
end
|
322
|
+
|
323
|
+
# to cut or not to cut?
|
324
|
+
#
|
325
|
+
# Gain(A,T;S) > 1/N * log2(N-1) + 1/N * delta(A,T;S)
|
326
|
+
if cp_best
|
327
|
+
n = cv.size.to_f
|
328
|
+
k = cv.uniq.size.to_f
|
329
|
+
k1 = cv1_best.uniq.size.to_f
|
330
|
+
k2 = cv2_best.uniq.size.to_f
|
331
|
+
delta = Math.log2(3**k-2)-(k*ent_best - k1*ent1_best - k2*ent2_best)
|
332
|
+
|
333
|
+
# accept cut point
|
334
|
+
if gain_best > (Math.log2(n-1)/n + delta/n)
|
335
|
+
# a: record cut point
|
336
|
+
cp << cp_best
|
337
|
+
|
338
|
+
# b: recursively call on subset
|
339
|
+
partition(cv1_best, fv1_best, bs1_best, cp)
|
340
|
+
partition(cv2_best, fv2_best, bs2_best, cp)
|
341
|
+
end
|
342
|
+
end
|
343
|
+
end
|
344
|
+
|
345
|
+
|
346
|
+
# binarily split based on a cut point
|
347
|
+
def binary_split(cv, fv, bs, cut_point)
|
348
|
+
cv1, cv2, fv1, fv2, bs1, bs2 = [], [], [], [], [], []
|
349
|
+
fv.each_with_index do |v, i|
|
350
|
+
if v < cut_point
|
351
|
+
cv1 << cv[i]
|
352
|
+
fv1 << v
|
353
|
+
else
|
354
|
+
cv2 << cv[i]
|
355
|
+
fv2 << v
|
356
|
+
end
|
357
|
+
end
|
358
|
+
|
359
|
+
bs.each do |b|
|
360
|
+
if b < cut_point
|
361
|
+
bs1 << b
|
362
|
+
else
|
363
|
+
bs2 << b
|
364
|
+
end
|
365
|
+
end
|
366
|
+
|
367
|
+
# return subset
|
368
|
+
[cv1, cv2, fv1, fv2, bs1, bs2]
|
369
|
+
end
|
370
|
+
|
371
|
+
|
218
372
|
end # module
|
@@ -3,7 +3,7 @@
|
|
3
3
|
#
|
4
4
|
module Normalizer
|
5
5
|
# log transformation, requires positive feature values
|
6
|
-
def
|
6
|
+
def normalize_by_log!(base=10)
|
7
7
|
each_sample do |k, s|
|
8
8
|
s.keys.each do |f|
|
9
9
|
s[f] = Math.log(s[f], base) if s[f] > 0.0
|
@@ -13,7 +13,7 @@ module Normalizer
|
|
13
13
|
|
14
14
|
|
15
15
|
# scale to [min,max], max > min
|
16
|
-
def
|
16
|
+
def normalize_by_min_max!(min=0.0, max=1.0)
|
17
17
|
# first determine min and max for each feature
|
18
18
|
f2min_max = {}
|
19
19
|
|
@@ -33,7 +33,7 @@ module Normalizer
|
|
33
33
|
|
34
34
|
|
35
35
|
# by z-score
|
36
|
-
def
|
36
|
+
def normalize_by_zscore!
|
37
37
|
# first determine mean and sd for each feature
|
38
38
|
f2mean_sd = {}
|
39
39
|
|
@@ -12,6 +12,12 @@ module FSelector
|
|
12
12
|
include Entropy
|
13
13
|
|
14
14
|
private
|
15
|
+
|
16
|
+
# replace missing values with most seen feature value
|
17
|
+
def handle_missing_values
|
18
|
+
replace_with_most_seen_value!
|
19
|
+
end
|
20
|
+
|
15
21
|
|
16
22
|
# calc the feature-class correlation of two vectors
|
17
23
|
def do_rcf(cv, fv)
|
data/lib/fselector/entropy.rb
CHANGED
@@ -7,16 +7,16 @@ module Entropy
|
|
7
7
|
#
|
8
8
|
# H(X) = -1 * sigma_i (P(x_i) logP(x_i))
|
9
9
|
#
|
10
|
-
|
10
|
+
def get_marginal_entropy(arrX)
|
11
11
|
h = 0.0
|
12
12
|
n = arrX.size.to_f
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
13
|
+
|
14
|
+
arrX.uniq.each do |x_i|
|
15
|
+
p = arrX.count(x_i)/n
|
16
|
+
h += -1.0 * (p * Math.log2(p))
|
17
|
+
end
|
18
|
+
|
19
|
+
h
|
20
20
|
end # get_marginal_entropy
|
21
21
|
|
22
22
|
|
@@ -27,28 +27,28 @@ module Entropy
|
|
27
27
|
#
|
28
28
|
# where H(X|y_j) = -1 * sigma_i (P(x_i|y_j) logP(x_i|y_j))
|
29
29
|
#
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
30
|
+
def get_conditional_entropy(arrX, arrY)
|
31
|
+
abort "[#{__FILE__}@#{__LINE__}]: "+
|
32
|
+
"array must be of same length" if not arrX.size == arrY.size
|
33
|
+
|
34
34
|
hxy = 0.0
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
35
|
+
n = arrX.size.to_f
|
36
|
+
|
37
|
+
arrY.uniq.each do |y_j|
|
38
|
+
p1 = arrY.count(y_j)/n
|
39
|
+
|
40
|
+
indices = (0...n).to_a.select { |k| arrY[k] == y_j }
|
41
|
+
xvs = arrX.values_at(*indices)
|
42
|
+
m = xvs.size.to_f
|
43
|
+
|
44
|
+
xvs.uniq.each do |x_i|
|
45
|
+
p2 = xvs.count(x_i)/m
|
46
|
+
|
47
|
+
hxy += -1.0 * p1 * (p2 * Math.log2(p2))
|
48
|
+
end
|
48
49
|
end
|
49
|
-
|
50
|
-
|
51
|
-
hxy
|
50
|
+
|
51
|
+
hxy
|
52
52
|
end # get_conditional_entropy
|
53
53
|
|
54
54
|
|
@@ -60,11 +60,11 @@ module Entropy
|
|
60
60
|
#
|
61
61
|
# i.e. H(X,Y) == H(Y,X)
|
62
62
|
#
|
63
|
-
|
63
|
+
def get_joint_entropy(arrX, arrY)
|
64
64
|
abort "[#{__FILE__}@#{__LINE__}]: "+
|
65
65
|
"array must be of same length" if not arrX.size == arrY.size
|
66
|
-
|
67
|
-
get_marginal_entropy(arrY) + get_conditional_entropy(arrX, arrY)
|
66
|
+
|
67
|
+
get_marginal_entropy(arrY) + get_conditional_entropy(arrX, arrY)
|
68
68
|
end # get_joint_entropy
|
69
69
|
|
70
70
|
|
data/lib/fselector/fileio.rb
CHANGED
@@ -110,10 +110,22 @@ module FileIO
|
|
110
110
|
ofs = File.open(fname, 'w')
|
111
111
|
end
|
112
112
|
|
113
|
+
# convert class label to integer type
|
114
|
+
k2idx = {}
|
115
|
+
get_classes.each_with_index do |k, i|
|
116
|
+
k2idx[k] = i+1
|
117
|
+
end
|
118
|
+
|
119
|
+
# convert feature to integer type
|
120
|
+
f2idx = {}
|
121
|
+
get_features.each_with_index do |f, i|
|
122
|
+
f2idx[f] = i+1
|
123
|
+
end
|
124
|
+
|
113
125
|
each_sample do |k, s|
|
114
|
-
ofs.print "#{k} "
|
126
|
+
ofs.print "#{k2idx[k]} "
|
115
127
|
s.keys.sort { |x, y| x.to_s.to_i <=> y.to_s.to_i }.each do |i|
|
116
|
-
ofs.print " #{i}:#{s[i]}" if not s[i].zero?
|
128
|
+
ofs.print " #{f2idx[i]}:#{s[i]}" if not s[i].zero? # implicit mode
|
117
129
|
end
|
118
130
|
ofs.puts
|
119
131
|
end
|
@@ -171,7 +183,7 @@ module FileIO
|
|
171
183
|
end
|
172
184
|
else
|
173
185
|
abort "[#{__FILE__}@#{__LINE__}]: "+
|
174
|
-
"
|
186
|
+
"the first two rows must have same number of fields"
|
175
187
|
end
|
176
188
|
else # data rows
|
177
189
|
label, *fvs = ln.chomp.split(/,/)
|
@@ -0,0 +1,78 @@
|
|
1
|
+
#
|
2
|
+
# replace missing feature values
|
3
|
+
#
|
4
|
+
module ReplaceMissingValues
|
5
|
+
#
|
6
|
+
# replace missing feature value with a fixed value
|
7
|
+
# applicable for both discrete and continuous feature
|
8
|
+
# @note data structure will be altered
|
9
|
+
#
|
10
|
+
def replace_with_fixed_value!(val)
|
11
|
+
each_sample do |k, s|
|
12
|
+
each_feature do |f|
|
13
|
+
if not s.has_key? f
|
14
|
+
s[f] = my_value
|
15
|
+
end
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
# clear variables
|
20
|
+
clear_vars
|
21
|
+
end # replace_fixed_value
|
22
|
+
|
23
|
+
|
24
|
+
#
|
25
|
+
# replace missing feature value with mean feature value
|
26
|
+
# applicable only to continuous feature
|
27
|
+
# @note data structure will be altered
|
28
|
+
#
|
29
|
+
def replace_with_mean_value!
|
30
|
+
each_sample do |k, s|
|
31
|
+
each_feature do |f|
|
32
|
+
fv = get_feature_values(f)
|
33
|
+
next if fv.size == get_sample_size # no missing values
|
34
|
+
|
35
|
+
mean = fv.ave
|
36
|
+
if not s.has_key? f
|
37
|
+
s[f] = mean
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
# clear variables
|
43
|
+
clear_vars
|
44
|
+
end # replace_with_mean_value!
|
45
|
+
|
46
|
+
|
47
|
+
#
|
48
|
+
# replace missing feature value with most seen feature value
|
49
|
+
# applicable only to discrete feature
|
50
|
+
# @note data structure will be altered
|
51
|
+
#
|
52
|
+
def replace_with_most_seen_value!
|
53
|
+
each_sample do |k, s|
|
54
|
+
each_feature do |f|
|
55
|
+
fv = get_feature_values(f)
|
56
|
+
next if fv.size == get_sample_size # no missing values
|
57
|
+
|
58
|
+
seen_count, seen_value = 0, nil
|
59
|
+
fv.uniq.each do |v|
|
60
|
+
count = fv.count(v)
|
61
|
+
if count > seen_count
|
62
|
+
seen_count = count
|
63
|
+
seen_value = v
|
64
|
+
end
|
65
|
+
end
|
66
|
+
|
67
|
+
if not s.has_key? f
|
68
|
+
s[f] = seen_value
|
69
|
+
end
|
70
|
+
end
|
71
|
+
end
|
72
|
+
|
73
|
+
# clear variables
|
74
|
+
clear_vars
|
75
|
+
end # replace_with_mean_value!
|
76
|
+
|
77
|
+
|
78
|
+
end # ReplaceMissingValues
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: fselector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.3.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,17 +9,19 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-04-
|
12
|
+
date: 2012-04-03 00:00:00.000000000 Z
|
13
13
|
dependencies: []
|
14
14
|
description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
|
15
|
-
algorithms into one single package. Welcome to contact me
|
16
|
-
you
|
17
|
-
user to perform feature selection by using either a single algorithm
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
LibSVM
|
22
|
-
|
15
|
+
algorithms and related functions into one single package. Welcome to contact me
|
16
|
+
(need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
|
17
|
+
FSelector allows user to perform feature selection by using either a single algorithm
|
18
|
+
or an ensemble of multiple algorithms, and other common tasks including normalization
|
19
|
+
and discretization on continuous data, as well as replace missing feature values
|
20
|
+
with certain criterion. FSelector acts on a full-feature data set in either CSV,
|
21
|
+
LibSVM or WEKA file format and outputs a reduced data set with only selected subset
|
22
|
+
of features, which can later be used as the input for various machine learning softwares
|
23
|
+
including LibSVM and WEKA. FSelector, itself, does not implement any of the machine
|
24
|
+
learning algorithms such as support vector machines and random forest.
|
23
25
|
email: need47@gmail.com
|
24
26
|
executables: []
|
25
27
|
extensions: []
|
@@ -73,6 +75,7 @@ files:
|
|
73
75
|
- lib/fselector/ensemble.rb
|
74
76
|
- lib/fselector/entropy.rb
|
75
77
|
- lib/fselector/fileio.rb
|
78
|
+
- lib/fselector/replace_missing_values.rb
|
76
79
|
- lib/fselector/util.rb
|
77
80
|
- lib/fselector.rb
|
78
81
|
homepage: http://github.com/need47/fselector
|