fselector 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +46 -33
- data/lib/fselector.rb +6 -1
- data/lib/fselector/algo_base/base.rb +14 -3
- data/lib/fselector/algo_base/base_CFS.rb +12 -0
- data/lib/fselector/algo_base/base_continuous.rb +2 -2
- data/lib/fselector/algo_continuous/CFS_c.rb +10 -0
- data/lib/fselector/algo_continuous/ReliefF_c.rb +3 -0
- data/lib/fselector/algo_continuous/Relief_c.rb +3 -0
- data/lib/fselector/algo_continuous/discretizer.rb +161 -7
- data/lib/fselector/algo_continuous/normalizer.rb +3 -3
- data/lib/fselector/algo_discrete/CFS_d.rb +6 -0
- data/lib/fselector/entropy.rb +31 -31
- data/lib/fselector/fileio.rb +15 -3
- data/lib/fselector/replace_missing_values.rb +78 -0
- metadata +13 -10
data/README.md
CHANGED
@@ -8,30 +8,41 @@ FSelector: a Ruby gem for feature selection and ranking
|
|
8
8
|
**Email**: [need47@gmail.com](mailto:need47@gmail.com)
|
9
9
|
**Copyright**: 2012
|
10
10
|
**License**: MIT License
|
11
|
-
**Latest Version**: 0.
|
12
|
-
**Release Date**: April
|
11
|
+
**Latest Version**: 0.3.0
|
12
|
+
**Release Date**: April 3rd 2012
|
13
13
|
|
14
14
|
Synopsis
|
15
15
|
--------
|
16
16
|
|
17
|
-
FSelector is a Ruby gem that aims to integrate various feature
|
18
|
-
algorithms into one single
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
17
|
+
FSelector is a Ruby gem that aims to integrate various feature
|
18
|
+
selection/ranking algorithms and related functions into one single
|
19
|
+
package. Welcome to contact me (need47@gmail.com) if you'd like to
|
20
|
+
contribute your own algorithms or report a bug. FSelector allows user
|
21
|
+
to perform feature selection by using either a single algorithm or an
|
22
|
+
ensemble of multiple algorithms, and other common tasks including
|
23
|
+
normalization and discretization on continuous data, as well as replace
|
24
|
+
missing feature values with certain criterion. FSelector acts on a
|
25
|
+
full-feature data set in either CSV, LibSVM or WEKA file format and
|
26
|
+
outputs a reduced data set with only selected subset of features, which
|
27
|
+
can later be used as the input for various machine learning softwares
|
28
|
+
including LibSVM and WEKA. FSelector, itself, does not implement
|
29
|
+
any of the machine learning algorithms such as support vector machines
|
30
|
+
and random forest. See below for a list of FSelector's features.
|
27
31
|
|
28
32
|
Feature List
|
29
33
|
------------
|
30
34
|
|
31
|
-
**1.
|
35
|
+
**1. supported input/output file types**
|
36
|
+
|
37
|
+
- csv
|
38
|
+
- libsvm
|
39
|
+
- weka ARFF
|
40
|
+
- random data (for test purpose)
|
41
|
+
|
42
|
+
**2. available feature selection/ranking algorithms**
|
32
43
|
|
33
|
-
algorithm alias
|
34
|
-
|
44
|
+
algorithm alias feature type
|
45
|
+
--------------------------------------------------------
|
35
46
|
Accuracy Acc discrete
|
36
47
|
AccuracyBalanced Acc2 discrete
|
37
48
|
BiNormalSeparation BNS discrete
|
@@ -67,29 +78,31 @@ Feature List
|
|
67
78
|
ReliefF_c ReliefF_c continuous
|
68
79
|
TScore TS continuous
|
69
80
|
|
70
|
-
**
|
81
|
+
**3. feature selection approaches**
|
71
82
|
|
72
83
|
- by a single algorithm
|
73
84
|
- by multiple algorithms in a tandem manner
|
74
85
|
- by multiple algorithms in a consensus manner
|
75
86
|
|
76
|
-
**
|
87
|
+
**4. availabe normalization and discretization algorithms for continuous feature**
|
77
88
|
|
78
89
|
algorithm note
|
79
|
-
|
80
|
-
log
|
81
|
-
min_max
|
82
|
-
zscore
|
83
|
-
equal_width
|
84
|
-
equal_frequency
|
85
|
-
ChiMerge
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
90
|
+
-----------------------------------------------------------------
|
91
|
+
log normalize by logarithmic transformation
|
92
|
+
min_max normalize by scaling into [min, max]
|
93
|
+
zscore normalize by converting into zscore
|
94
|
+
equal_width discretize by equal width among intervals
|
95
|
+
equal_frequency discretize by equal frequency among intervals
|
96
|
+
ChiMerge discretize by ChiMerge method
|
97
|
+
MID discretize by Multi-Interval Discretization
|
98
|
+
|
99
|
+
**5. availabe algorithms for replacing missing feature values**
|
100
|
+
|
101
|
+
algorithm note feature type
|
102
|
+
--------------------------------------------------------------------------------------
|
103
|
+
fixed_value replace with a fixed value discrete, continuous
|
104
|
+
mean_value replace with the mean feature value continuous
|
105
|
+
most_seen_value replace with most seen feature value discrete
|
93
106
|
|
94
107
|
Installing
|
95
108
|
----------
|
@@ -187,11 +200,11 @@ Usage
|
|
187
200
|
r1.data_from_csv('test/iris.csv')
|
188
201
|
|
189
202
|
# normalization by log2 (optional)
|
190
|
-
# r1.
|
203
|
+
# r1.normalize_by_log!(2)
|
191
204
|
|
192
205
|
# discretization by ChiMerge algorithm
|
193
206
|
# chi-squared value = 4.60 for a three-class problem at alpha=0.10
|
194
|
-
r1.
|
207
|
+
r1.discretize_by_ChiMerge!(4.60)
|
195
208
|
|
196
209
|
# apply Fast Correlation-Based Filter (FCBF) algorithm for discrete feature
|
197
210
|
# initialize with discretized data from r1
|
data/lib/fselector.rb
CHANGED
@@ -3,7 +3,7 @@
|
|
3
3
|
#
|
4
4
|
module FSelector
|
5
5
|
# module version
|
6
|
-
VERSION = '0.
|
6
|
+
VERSION = '0.3.0'
|
7
7
|
end
|
8
8
|
|
9
9
|
ROOT = File.expand_path(File.dirname(__FILE__))
|
@@ -11,9 +11,14 @@ ROOT = File.expand_path(File.dirname(__FILE__))
|
|
11
11
|
#
|
12
12
|
# include necessary files
|
13
13
|
#
|
14
|
+
# read and write file, supported formats include CSV, LibSVM and WEKA files
|
14
15
|
require "#{ROOT}/fselector/fileio.rb"
|
16
|
+
# extend Array and String class
|
15
17
|
require "#{ROOT}/fselector/util.rb"
|
18
|
+
# entropy-related functions
|
16
19
|
require "#{ROOT}/fselector/entropy.rb"
|
20
|
+
# replace missing values
|
21
|
+
require "#{ROOT}/fselector/replace_missing_values.rb"
|
17
22
|
|
18
23
|
#
|
19
24
|
# base class
|
@@ -8,6 +8,8 @@ module FSelector
|
|
8
8
|
class Base
|
9
9
|
# include FileIO
|
10
10
|
include FileIO
|
11
|
+
# include ReplaceMissingValues
|
12
|
+
include ReplaceMissingValues
|
11
13
|
|
12
14
|
# initialize from an existing data structure
|
13
15
|
def initialize(data=nil)
|
@@ -167,13 +169,13 @@ module FSelector
|
|
167
169
|
def set_data(data)
|
168
170
|
if data and data.class == Hash
|
169
171
|
@data = data
|
170
|
-
# clear
|
171
|
-
|
172
|
-
@scores, @ranks, @sz = nil, nil, nil
|
172
|
+
# clear variables
|
173
|
+
clear_vars
|
173
174
|
else
|
174
175
|
abort "[#{__FILE__}@#{__LINE__}]: "+
|
175
176
|
"data must be a Hash object!"
|
176
177
|
end
|
178
|
+
|
177
179
|
data
|
178
180
|
end
|
179
181
|
|
@@ -335,6 +337,14 @@ module FSelector
|
|
335
337
|
|
336
338
|
private
|
337
339
|
|
340
|
+
# clear variables when data structure is altered
|
341
|
+
def clear_vars
|
342
|
+
@classes, @features, @fvs = nil, nil, nil
|
343
|
+
@scores, @ranks, @sz = nil, nil, nil
|
344
|
+
@cv, @fvs = nil, nil
|
345
|
+
end
|
346
|
+
|
347
|
+
|
338
348
|
# set feature (f) score (s) for class (k)
|
339
349
|
def set_feature_score(f, k, s)
|
340
350
|
@scores ||= {}
|
@@ -342,6 +352,7 @@ module FSelector
|
|
342
352
|
@scores[f][k] = s
|
343
353
|
end
|
344
354
|
|
355
|
+
|
345
356
|
# get subset of feature
|
346
357
|
def get_feature_subset
|
347
358
|
abort "[#{__FILE__}@#{__LINE__}]: "+
|
@@ -21,6 +21,9 @@ module FSelector
|
|
21
21
|
|
22
22
|
# use sequential forward search
|
23
23
|
def get_feature_subset
|
24
|
+
# handle missing values
|
25
|
+
handle_missing_value
|
26
|
+
|
24
27
|
subset = []
|
25
28
|
feats = get_features.dup
|
26
29
|
|
@@ -58,6 +61,15 @@ module FSelector
|
|
58
61
|
end # get_feature_subset
|
59
62
|
|
60
63
|
|
64
|
+
# handle missing values
|
65
|
+
# CFS replaces missing values with the mean for continous features and
|
66
|
+
# the most seen value for discrete features
|
67
|
+
def handle_missing_values
|
68
|
+
abort "[#{__FILE__}@#{__LINE__}]: "+
|
69
|
+
"derived CFS algo must implement its own handle_missing_values()"
|
70
|
+
end
|
71
|
+
|
72
|
+
|
61
73
|
# calc new merit of subset when adding feature (f)
|
62
74
|
def calc_merit(subset, f)
|
63
75
|
k = subset.size.to_f + 1
|
@@ -10,8 +10,8 @@ module FSelector
|
|
10
10
|
class BaseContinuous < Base
|
11
11
|
# include normalizer
|
12
12
|
include Normalizer
|
13
|
-
# include
|
14
|
-
include
|
13
|
+
# include discretizer
|
14
|
+
include Discretizer
|
15
15
|
|
16
16
|
# initialize from an existing data structure
|
17
17
|
def initialize(data=nil)
|
@@ -8,8 +8,18 @@ module FSelector
|
|
8
8
|
# ref: [Feature Selection for Discrete and Numeric Class Machine Learning](http://www.cs.waikato.ac.nz/ml/publications/1999/99MH-Feature-Select.pdf)
|
9
9
|
#
|
10
10
|
class CFS_c < BaseCFS
|
11
|
+
# include normalizer and discretizer
|
12
|
+
include Normalizer
|
13
|
+
include Discretizer
|
11
14
|
|
12
15
|
private
|
16
|
+
|
17
|
+
|
18
|
+
# replace missing values with mean feature value
|
19
|
+
def handle_missing_values
|
20
|
+
replace_with_mean_value!
|
21
|
+
end
|
22
|
+
|
13
23
|
|
14
24
|
# calc the feature-class correlation of two vectors
|
15
25
|
def do_rcf(cv, fv)
|
@@ -10,6 +10,9 @@ module FSelector
|
|
10
10
|
# ref: [Estimating Attributes: Analysis and Extensions of RELIEF](http://www.springerlink.com/content/fp23jh2h0426ww45/)
|
11
11
|
#
|
12
12
|
class ReliefF_c < BaseReliefF
|
13
|
+
# include normalizer and discretizer
|
14
|
+
include Normalizer
|
15
|
+
include Discretizer
|
13
16
|
|
14
17
|
private
|
15
18
|
|
@@ -10,6 +10,9 @@ module FSelector
|
|
10
10
|
# ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
|
11
11
|
#
|
12
12
|
class Relief_c < BaseRelief
|
13
|
+
# include normalizer and discretizer
|
14
|
+
include Normalizer
|
15
|
+
include Discretizer
|
13
16
|
|
14
17
|
private
|
15
18
|
|
@@ -1,7 +1,10 @@
|
|
1
1
|
#
|
2
|
-
#
|
2
|
+
# discretize continous feature
|
3
3
|
#
|
4
|
-
module
|
4
|
+
module Discretizer
|
5
|
+
# include Entropy module
|
6
|
+
include Entropy
|
7
|
+
|
5
8
|
# discretize by equal-width intervals
|
6
9
|
#
|
7
10
|
# @param [Integer] n_interval
|
@@ -84,7 +87,7 @@ module Discretilizer
|
|
84
87
|
# 2 4.60 5.99 9.21 13.82
|
85
88
|
# 3 6.35 7.82 11.34 16.27
|
86
89
|
#
|
87
|
-
def
|
90
|
+
def discretize_by_ChiMerge!(chisq)
|
88
91
|
# chisq = 4.60 # for iris::Sepal.Length
|
89
92
|
# for intialization
|
90
93
|
hzero = {}
|
@@ -177,19 +180,71 @@ module Discretilizer
|
|
177
180
|
end
|
178
181
|
end
|
179
182
|
|
180
|
-
end #
|
183
|
+
end # discretize_ChiMerge!
|
184
|
+
|
185
|
+
|
186
|
+
#
|
187
|
+
# discretize by Multi-Interval Discretization (MID) algorithm
|
188
|
+
# @note no missing feature values allowed and data structure will be altered
|
189
|
+
#
|
190
|
+
# ref: [Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning](http://www.ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf)
|
191
|
+
#
|
192
|
+
def discretize_by_MID!
|
193
|
+
# determine the final boundaries
|
194
|
+
f2cp = {} # cut points for each feature
|
195
|
+
each_feature do |f|
|
196
|
+
cv = get_class_labels
|
197
|
+
# we assume no missing feature values
|
198
|
+
fv = get_feature_values(f)
|
199
|
+
|
200
|
+
n = cv.size
|
201
|
+
# sort cv and fv according ascending order of fv
|
202
|
+
sis = (0...n).to_a.sort { |i,j| fv[i] <=> fv[j] }
|
203
|
+
cv = cv.values_at(*sis)
|
204
|
+
fv = fv.values_at(*sis)
|
205
|
+
|
206
|
+
# get initial boundaries
|
207
|
+
bs = []
|
208
|
+
fv.each_with_index do |v, i|
|
209
|
+
# cut point (Ta) for feature A must always be a value between
|
210
|
+
# two examples of different classes in the sequence of sorted examples
|
211
|
+
# see orginal reference
|
212
|
+
if i < n-1 and cv[i] != cv[i+1]
|
213
|
+
bs << (v+fv[i+1])/2.0
|
214
|
+
end
|
215
|
+
end
|
216
|
+
bs.uniq! # remove duplicates
|
217
|
+
|
218
|
+
# main algorithm, iteratively determine cut point
|
219
|
+
cp = []
|
220
|
+
partition(cv, fv, bs, cp)
|
221
|
+
|
222
|
+
# add the rightmost boundary for convenience
|
223
|
+
cp << fv.max+1.0
|
224
|
+
# record cut points for feature (f)
|
225
|
+
f2cp[f] = cp
|
226
|
+
end
|
227
|
+
|
228
|
+
# discretize based on cut points
|
229
|
+
each_sample do |k, s|
|
230
|
+
s.keys.each do |f|
|
231
|
+
s[f] = get_index(s[f], f2cp[f])
|
232
|
+
end
|
233
|
+
end
|
234
|
+
|
235
|
+
end # discretize_by_MID!
|
181
236
|
|
182
237
|
private
|
183
238
|
|
184
239
|
# get index from sorted boundaries
|
185
240
|
#
|
186
241
|
# min -- | -- | -- | ... max |
|
187
|
-
#
|
188
|
-
#
|
242
|
+
# b1 b2 b3 bn(=max+1)
|
243
|
+
# 1 2 3 ... n
|
189
244
|
#
|
190
245
|
def get_index(v, boundaries)
|
191
246
|
boundaries.each_with_index do |b, i|
|
192
|
-
return i if v < b
|
247
|
+
return i+1 if v < b
|
193
248
|
end
|
194
249
|
end # get_index
|
195
250
|
|
@@ -215,4 +270,103 @@ module Discretilizer
|
|
215
270
|
end # calc_chisq
|
216
271
|
|
217
272
|
|
273
|
+
#
|
274
|
+
# Multi-Interval Discretization main algorithm
|
275
|
+
# recursively always selecting the best cut point
|
276
|
+
#
|
277
|
+
# @param [Array] cv class labels
|
278
|
+
# @param [Array] fv feature values
|
279
|
+
# @param [Array] bs potential cut points
|
280
|
+
# @param [Array] cp resultant cut points
|
281
|
+
def partition(cv, fv, bs, cp)
|
282
|
+
# best cut point
|
283
|
+
cp_best = nil
|
284
|
+
|
285
|
+
# binary subset at the best cut point
|
286
|
+
cv1_best, cv2_best = nil, nil
|
287
|
+
fv1_best, fv2_best = nil, nil
|
288
|
+
bs1_best, bs2_best = nil, nil
|
289
|
+
|
290
|
+
# best information gain
|
291
|
+
gain_best = -100.0
|
292
|
+
ent_best = -100.0
|
293
|
+
ent1_best = -100.0
|
294
|
+
ent2_best = -100.0
|
295
|
+
|
296
|
+
# try each potential cut point
|
297
|
+
bs.each do |b|
|
298
|
+
# binary split
|
299
|
+
cv1_try, cv2_try, fv1_try, fv2_try, bs1_try, bs2_try =
|
300
|
+
binary_split(cv, fv, bs, b)
|
301
|
+
|
302
|
+
# gain for this cut point
|
303
|
+
ent_try = get_marginal_entropy(cv)
|
304
|
+
ent1_try = get_marginal_entropy(cv1_try)
|
305
|
+
ent2_try = get_marginal_entropy(cv2_try)
|
306
|
+
gain_try = ent_try -
|
307
|
+
(cv1_try.size.to_f/cv.size) * ent1_try -
|
308
|
+
(cv2_try.size.to_f/cv.size) * ent2_try
|
309
|
+
|
310
|
+
#pp gain_try
|
311
|
+
if gain_try > gain_best
|
312
|
+
cp_best = b
|
313
|
+
cv1_best, cv2_best = cv1_try, cv2_try
|
314
|
+
fv1_best, fv2_best = fv1_try, fv2_try
|
315
|
+
bs1_best, bs2_best = bs1_try, bs2_try
|
316
|
+
|
317
|
+
gain_best = gain_try
|
318
|
+
ent_best = ent_try
|
319
|
+
ent1_best, ent2_best = ent1_try, ent2_try
|
320
|
+
end
|
321
|
+
end
|
322
|
+
|
323
|
+
# to cut or not to cut?
|
324
|
+
#
|
325
|
+
# Gain(A,T;S) > 1/N * log2(N-1) + 1/N * delta(A,T;S)
|
326
|
+
if cp_best
|
327
|
+
n = cv.size.to_f
|
328
|
+
k = cv.uniq.size.to_f
|
329
|
+
k1 = cv1_best.uniq.size.to_f
|
330
|
+
k2 = cv2_best.uniq.size.to_f
|
331
|
+
delta = Math.log2(3**k-2)-(k*ent_best - k1*ent1_best - k2*ent2_best)
|
332
|
+
|
333
|
+
# accept cut point
|
334
|
+
if gain_best > (Math.log2(n-1)/n + delta/n)
|
335
|
+
# a: record cut point
|
336
|
+
cp << cp_best
|
337
|
+
|
338
|
+
# b: recursively call on subset
|
339
|
+
partition(cv1_best, fv1_best, bs1_best, cp)
|
340
|
+
partition(cv2_best, fv2_best, bs2_best, cp)
|
341
|
+
end
|
342
|
+
end
|
343
|
+
end
|
344
|
+
|
345
|
+
|
346
|
+
# binarily split based on a cut point
|
347
|
+
def binary_split(cv, fv, bs, cut_point)
|
348
|
+
cv1, cv2, fv1, fv2, bs1, bs2 = [], [], [], [], [], []
|
349
|
+
fv.each_with_index do |v, i|
|
350
|
+
if v < cut_point
|
351
|
+
cv1 << cv[i]
|
352
|
+
fv1 << v
|
353
|
+
else
|
354
|
+
cv2 << cv[i]
|
355
|
+
fv2 << v
|
356
|
+
end
|
357
|
+
end
|
358
|
+
|
359
|
+
bs.each do |b|
|
360
|
+
if b < cut_point
|
361
|
+
bs1 << b
|
362
|
+
else
|
363
|
+
bs2 << b
|
364
|
+
end
|
365
|
+
end
|
366
|
+
|
367
|
+
# return subset
|
368
|
+
[cv1, cv2, fv1, fv2, bs1, bs2]
|
369
|
+
end
|
370
|
+
|
371
|
+
|
218
372
|
end # module
|
@@ -3,7 +3,7 @@
|
|
3
3
|
#
|
4
4
|
module Normalizer
|
5
5
|
# log transformation, requires positive feature values
|
6
|
-
def
|
6
|
+
def normalize_by_log!(base=10)
|
7
7
|
each_sample do |k, s|
|
8
8
|
s.keys.each do |f|
|
9
9
|
s[f] = Math.log(s[f], base) if s[f] > 0.0
|
@@ -13,7 +13,7 @@ module Normalizer
|
|
13
13
|
|
14
14
|
|
15
15
|
# scale to [min,max], max > min
|
16
|
-
def
|
16
|
+
def normalize_by_min_max!(min=0.0, max=1.0)
|
17
17
|
# first determine min and max for each feature
|
18
18
|
f2min_max = {}
|
19
19
|
|
@@ -33,7 +33,7 @@ module Normalizer
|
|
33
33
|
|
34
34
|
|
35
35
|
# by z-score
|
36
|
-
def
|
36
|
+
def normalize_by_zscore!
|
37
37
|
# first determine mean and sd for each feature
|
38
38
|
f2mean_sd = {}
|
39
39
|
|
@@ -12,6 +12,12 @@ module FSelector
|
|
12
12
|
include Entropy
|
13
13
|
|
14
14
|
private
|
15
|
+
|
16
|
+
# replace missing values with most seen feature value
|
17
|
+
def handle_missing_values
|
18
|
+
replace_with_most_seen_value!
|
19
|
+
end
|
20
|
+
|
15
21
|
|
16
22
|
# calc the feature-class correlation of two vectors
|
17
23
|
def do_rcf(cv, fv)
|
data/lib/fselector/entropy.rb
CHANGED
@@ -7,16 +7,16 @@ module Entropy
|
|
7
7
|
#
|
8
8
|
# H(X) = -1 * sigma_i (P(x_i) logP(x_i))
|
9
9
|
#
|
10
|
-
|
10
|
+
def get_marginal_entropy(arrX)
|
11
11
|
h = 0.0
|
12
12
|
n = arrX.size.to_f
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
13
|
+
|
14
|
+
arrX.uniq.each do |x_i|
|
15
|
+
p = arrX.count(x_i)/n
|
16
|
+
h += -1.0 * (p * Math.log2(p))
|
17
|
+
end
|
18
|
+
|
19
|
+
h
|
20
20
|
end # get_marginal_entropy
|
21
21
|
|
22
22
|
|
@@ -27,28 +27,28 @@ module Entropy
|
|
27
27
|
#
|
28
28
|
# where H(X|y_j) = -1 * sigma_i (P(x_i|y_j) logP(x_i|y_j))
|
29
29
|
#
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
30
|
+
def get_conditional_entropy(arrX, arrY)
|
31
|
+
abort "[#{__FILE__}@#{__LINE__}]: "+
|
32
|
+
"array must be of same length" if not arrX.size == arrY.size
|
33
|
+
|
34
34
|
hxy = 0.0
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
35
|
+
n = arrX.size.to_f
|
36
|
+
|
37
|
+
arrY.uniq.each do |y_j|
|
38
|
+
p1 = arrY.count(y_j)/n
|
39
|
+
|
40
|
+
indices = (0...n).to_a.select { |k| arrY[k] == y_j }
|
41
|
+
xvs = arrX.values_at(*indices)
|
42
|
+
m = xvs.size.to_f
|
43
|
+
|
44
|
+
xvs.uniq.each do |x_i|
|
45
|
+
p2 = xvs.count(x_i)/m
|
46
|
+
|
47
|
+
hxy += -1.0 * p1 * (p2 * Math.log2(p2))
|
48
|
+
end
|
48
49
|
end
|
49
|
-
|
50
|
-
|
51
|
-
hxy
|
50
|
+
|
51
|
+
hxy
|
52
52
|
end # get_conditional_entropy
|
53
53
|
|
54
54
|
|
@@ -60,11 +60,11 @@ module Entropy
|
|
60
60
|
#
|
61
61
|
# i.e. H(X,Y) == H(Y,X)
|
62
62
|
#
|
63
|
-
|
63
|
+
def get_joint_entropy(arrX, arrY)
|
64
64
|
abort "[#{__FILE__}@#{__LINE__}]: "+
|
65
65
|
"array must be of same length" if not arrX.size == arrY.size
|
66
|
-
|
67
|
-
get_marginal_entropy(arrY) + get_conditional_entropy(arrX, arrY)
|
66
|
+
|
67
|
+
get_marginal_entropy(arrY) + get_conditional_entropy(arrX, arrY)
|
68
68
|
end # get_joint_entropy
|
69
69
|
|
70
70
|
|
data/lib/fselector/fileio.rb
CHANGED
@@ -110,10 +110,22 @@ module FileIO
|
|
110
110
|
ofs = File.open(fname, 'w')
|
111
111
|
end
|
112
112
|
|
113
|
+
# convert class label to integer type
|
114
|
+
k2idx = {}
|
115
|
+
get_classes.each_with_index do |k, i|
|
116
|
+
k2idx[k] = i+1
|
117
|
+
end
|
118
|
+
|
119
|
+
# convert feature to integer type
|
120
|
+
f2idx = {}
|
121
|
+
get_features.each_with_index do |f, i|
|
122
|
+
f2idx[f] = i+1
|
123
|
+
end
|
124
|
+
|
113
125
|
each_sample do |k, s|
|
114
|
-
ofs.print "#{k} "
|
126
|
+
ofs.print "#{k2idx[k]} "
|
115
127
|
s.keys.sort { |x, y| x.to_s.to_i <=> y.to_s.to_i }.each do |i|
|
116
|
-
ofs.print " #{i}:#{s[i]}" if not s[i].zero?
|
128
|
+
ofs.print " #{f2idx[i]}:#{s[i]}" if not s[i].zero? # implicit mode
|
117
129
|
end
|
118
130
|
ofs.puts
|
119
131
|
end
|
@@ -171,7 +183,7 @@ module FileIO
|
|
171
183
|
end
|
172
184
|
else
|
173
185
|
abort "[#{__FILE__}@#{__LINE__}]: "+
|
174
|
-
"
|
186
|
+
"the first two rows must have same number of fields"
|
175
187
|
end
|
176
188
|
else # data rows
|
177
189
|
label, *fvs = ln.chomp.split(/,/)
|
@@ -0,0 +1,78 @@
|
|
1
|
+
#
|
2
|
+
# replace missing feature values
|
3
|
+
#
|
4
|
+
module ReplaceMissingValues
|
5
|
+
#
|
6
|
+
# replace missing feature value with a fixed value
|
7
|
+
# applicable for both discrete and continuous feature
|
8
|
+
# @note data structure will be altered
|
9
|
+
#
|
10
|
+
def replace_with_fixed_value!(val)
|
11
|
+
each_sample do |k, s|
|
12
|
+
each_feature do |f|
|
13
|
+
if not s.has_key? f
|
14
|
+
s[f] = my_value
|
15
|
+
end
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
# clear variables
|
20
|
+
clear_vars
|
21
|
+
end # replace_fixed_value
|
22
|
+
|
23
|
+
|
24
|
+
#
|
25
|
+
# replace missing feature value with mean feature value
|
26
|
+
# applicable only to continuous feature
|
27
|
+
# @note data structure will be altered
|
28
|
+
#
|
29
|
+
def replace_with_mean_value!
|
30
|
+
each_sample do |k, s|
|
31
|
+
each_feature do |f|
|
32
|
+
fv = get_feature_values(f)
|
33
|
+
next if fv.size == get_sample_size # no missing values
|
34
|
+
|
35
|
+
mean = fv.ave
|
36
|
+
if not s.has_key? f
|
37
|
+
s[f] = mean
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
# clear variables
|
43
|
+
clear_vars
|
44
|
+
end # replace_with_mean_value!
|
45
|
+
|
46
|
+
|
47
|
+
#
|
48
|
+
# replace missing feature value with most seen feature value
|
49
|
+
# applicable only to discrete feature
|
50
|
+
# @note data structure will be altered
|
51
|
+
#
|
52
|
+
def replace_with_most_seen_value!
|
53
|
+
each_sample do |k, s|
|
54
|
+
each_feature do |f|
|
55
|
+
fv = get_feature_values(f)
|
56
|
+
next if fv.size == get_sample_size # no missing values
|
57
|
+
|
58
|
+
seen_count, seen_value = 0, nil
|
59
|
+
fv.uniq.each do |v|
|
60
|
+
count = fv.count(v)
|
61
|
+
if count > seen_count
|
62
|
+
seen_count = count
|
63
|
+
seen_value = v
|
64
|
+
end
|
65
|
+
end
|
66
|
+
|
67
|
+
if not s.has_key? f
|
68
|
+
s[f] = seen_value
|
69
|
+
end
|
70
|
+
end
|
71
|
+
end
|
72
|
+
|
73
|
+
# clear variables
|
74
|
+
clear_vars
|
75
|
+
end # replace_with_mean_value!
|
76
|
+
|
77
|
+
|
78
|
+
end # ReplaceMissingValues
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: fselector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.3.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,17 +9,19 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-04-
|
12
|
+
date: 2012-04-03 00:00:00.000000000 Z
|
13
13
|
dependencies: []
|
14
14
|
description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
|
15
|
-
algorithms into one single package. Welcome to contact me
|
16
|
-
you
|
17
|
-
user to perform feature selection by using either a single algorithm
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
LibSVM
|
22
|
-
|
15
|
+
algorithms and related functions into one single package. Welcome to contact me
|
16
|
+
(need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
|
17
|
+
FSelector allows user to perform feature selection by using either a single algorithm
|
18
|
+
or an ensemble of multiple algorithms, and other common tasks including normalization
|
19
|
+
and discretization on continuous data, as well as replace missing feature values
|
20
|
+
with certain criterion. FSelector acts on a full-feature data set in either CSV,
|
21
|
+
LibSVM or WEKA file format and outputs a reduced data set with only selected subset
|
22
|
+
of features, which can later be used as the input for various machine learning softwares
|
23
|
+
including LibSVM and WEKA. FSelector, itself, does not implement any of the machine
|
24
|
+
learning algorithms such as support vector machines and random forest.
|
23
25
|
email: need47@gmail.com
|
24
26
|
executables: []
|
25
27
|
extensions: []
|
@@ -73,6 +75,7 @@ files:
|
|
73
75
|
- lib/fselector/ensemble.rb
|
74
76
|
- lib/fselector/entropy.rb
|
75
77
|
- lib/fselector/fileio.rb
|
78
|
+
- lib/fselector/replace_missing_values.rb
|
76
79
|
- lib/fselector/util.rb
|
77
80
|
- lib/fselector.rb
|
78
81
|
homepage: http://github.com/need47/fselector
|