fselector 1.2.0 → 1.3.0
Sign up to get free protection for your applications and to get access to all the features.
- data/ChangeLog +7 -0
- data/README.md +53 -52
- data/lib/fselector.rb +10 -1
- data/lib/fselector/algo_base/base.rb +62 -9
- data/lib/fselector/algo_base/base_CFS.rb +0 -9
- data/lib/fselector/algo_base/base_ReliefF.rb +0 -8
- data/lib/fselector/algo_base/base_discrete.rb +0 -8
- data/lib/fselector/{algo_discrete → algo_both}/LasVegasFilter.rb +2 -2
- data/lib/fselector/{algo_discrete → algo_both}/LasVegasIncremental.rb +2 -2
- data/lib/fselector/{algo_discrete → algo_both}/Random.rb +3 -2
- data/lib/fselector/algo_continuous/ReliefF_c.rb +0 -8
- data/lib/fselector/algo_continuous/Relief_c.rb +0 -8
- data/lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb +0 -8
- data/lib/fselector/algo_discrete/InformationGain.rb +1 -9
- data/lib/fselector/discretizer.rb +6 -1
- data/lib/fselector/ensemble.rb +3 -3
- data/lib/fselector/fileio.rb +181 -102
- metadata +7 -7
data/ChangeLog
CHANGED
@@ -1,3 +1,10 @@
|
|
1
|
+
2012-05-24 version 1.3.0
|
2
|
+
|
3
|
+
* update clear\_vars() in Base by use of Ruby metaprogramming, this trick avoids repetitive overriding it in each derived subclass
|
4
|
+
* re-organize LasVegasFilter, LasVegasIncremental and Random into algo_both/, since they are applicable to dataset with either discrete or continuous features, even with mixed type
|
5
|
+
* update data\_from\_csv() so that it can read CSV file more flexibly. note by default, the last column is class label
|
6
|
+
* add data\_from\_url() to read on-line dataset (in CSV, LibSVM or Weka ARFF file format) specified by a url
|
7
|
+
|
1
8
|
2012-05-20 version 1.2.0
|
2
9
|
|
3
10
|
* add KS-Test algorithm for continuous feature
|
data/README.md
CHANGED
@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
|
|
8
8
|
**Email**: [need47@gmail.com](mailto:need47@gmail.com)
|
9
9
|
**Copyright**: 2012
|
10
10
|
**License**: MIT License
|
11
|
-
**Latest Version**: 1.
|
12
|
-
**Release Date**: 2012-05-
|
11
|
+
**Latest Version**: 1.3.0
|
12
|
+
**Release Date**: 2012-05-24
|
13
13
|
|
14
14
|
Synopsis
|
15
15
|
--------
|
@@ -38,56 +38,57 @@ Feature List
|
|
38
38
|
- csv
|
39
39
|
- libsvm
|
40
40
|
- weka ARFF
|
41
|
+
- on-line dataset in one of the above three formats (read only)
|
41
42
|
- random data (read only, for test purpose)
|
42
43
|
|
43
44
|
**2. available feature selection/ranking algorithms**
|
44
45
|
|
45
|
-
algorithm shortcut algo_type feature_type
|
46
|
-
|
47
|
-
Accuracy Acc weighting
|
48
|
-
AccuracyBalanced Acc2 weighting
|
49
|
-
BiNormalSeparation BNS weighting
|
50
|
-
CFS_d CFS_d subset
|
51
|
-
ChiSquaredTest CHI weighting
|
52
|
-
CorrelationCoefficient CC weighting
|
53
|
-
DocumentFrequency DF weighting
|
54
|
-
F1Measure F1 weighting
|
55
|
-
FishersExactTest FET weighting
|
56
|
-
FastCorrelationBasedFilter FCBF subset
|
57
|
-
GiniIndex GI weighting
|
58
|
-
GMean GM weighting
|
59
|
-
GSSCoefficient GSS weighting
|
60
|
-
InformationGain IG weighting
|
61
|
-
INTERACT INTERACT subset
|
62
|
-
JMeasure JM weighting
|
63
|
-
KLDivergence KLD weighting
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
Recall
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
46
|
+
algorithm shortcut algo_type applicability feature_type
|
47
|
+
--------------------------------------------------------------------------------------------------
|
48
|
+
Accuracy Acc weighting multi-class discrete
|
49
|
+
AccuracyBalanced Acc2 weighting multi-class discrete
|
50
|
+
BiNormalSeparation BNS weighting multi-class discrete
|
51
|
+
CFS_d CFS_d subset multi-class discrete
|
52
|
+
ChiSquaredTest CHI weighting multi-class discrete
|
53
|
+
CorrelationCoefficient CC weighting multi-class discrete
|
54
|
+
DocumentFrequency DF weighting multi-class discrete
|
55
|
+
F1Measure F1 weighting multi-class discrete
|
56
|
+
FishersExactTest FET weighting multi-class discrete
|
57
|
+
FastCorrelationBasedFilter FCBF subset multi-class discrete
|
58
|
+
GiniIndex GI weighting multi-class discrete
|
59
|
+
GMean GM weighting multi-class discrete
|
60
|
+
GSSCoefficient GSS weighting multi-class discrete
|
61
|
+
InformationGain IG weighting multi-class discrete
|
62
|
+
INTERACT INTERACT subset multi-class discrete
|
63
|
+
JMeasure JM weighting multi-class discrete
|
64
|
+
KLDivergence KLD weighting multi-class discrete
|
65
|
+
MatthewsCorrelationCoefficient MCC, PHI weighting multi-class discrete
|
66
|
+
McNemarsTest MNT weighting multi-class discrete
|
67
|
+
OddsRatio OR weighting multi-class discrete
|
68
|
+
OddsRatioNumerator ORN weighting multi-class discrete
|
69
|
+
PhiCoefficient PHI weighting multi-class discrete
|
70
|
+
Power Power weighting multi-class discrete
|
71
|
+
Precision Precision weighting multi-class discrete
|
72
|
+
ProbabilityRatio PR weighting multi-class discrete
|
73
|
+
Recall Recall weighting multi-class discrete
|
74
|
+
Relief_d Relief_d weighting two-class discrete
|
75
|
+
ReliefF_d ReliefF_d weighting multi-class discrete
|
76
|
+
Sensitivity SN, Recall weighting multi-class discrete
|
77
|
+
Specificity SP weighting multi-class discrete
|
78
|
+
SymmetricalUncertainty SU weighting multi-class discrete
|
79
|
+
BetweenWithinClassesSumOfSquare BSS_WSS weighting multi-class continuous
|
80
|
+
CFS_c CFS_c subset multi-class continuous
|
81
|
+
FTest FT weighting multi-class continuous
|
82
|
+
KS_CCBF KS_CCBF subset multi-class continuous
|
83
|
+
KSTest KST weighting two-class continuous
|
84
|
+
PMetric PM weighting two-class continuous
|
85
|
+
Relief_c Relief_c weighting two-class continuous
|
86
|
+
ReliefF_c ReliefF_c weighting multi-class continuous
|
87
|
+
TScore TS weighting two-class continuous
|
88
|
+
WilcoxonRankSum WRS weighting two-class continuous
|
89
|
+
LasVegasFilter LVF subset multi-class discrete, continuous, mixed
|
90
|
+
LasVegasIncremental LVI subset multi-class discrete, continuous, mixed
|
91
|
+
Random Random weighting multi-class discrete, continuous, mixed
|
91
92
|
|
92
93
|
**note for feature selection interface:**
|
93
94
|
there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
|
@@ -132,7 +133,7 @@ To install FSelector, use the following command:
|
|
132
133
|
|
133
134
|
$ gem install fselector
|
134
135
|
|
135
|
-
**note:**
|
136
|
+
**note:** From version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
|
136
137
|
as a seemless bridge to access the statistical routines in the R package (http://www.r-project.org),
|
137
138
|
which will greatly expand the inclusion of algorithms to FSelector, especially for those relying
|
138
139
|
on statistical test. To this end, please pre-install the R package. RinRuby should have been
|
@@ -194,7 +195,7 @@ Usage
|
|
194
195
|
|
195
196
|
# creating an ensemble of feature selectors by using
|
196
197
|
# a single feature selection algorithm (INTERACT)
|
197
|
-
# by instance perturbation (e.g.
|
198
|
+
# by instance perturbation (e.g. random sampling)
|
198
199
|
|
199
200
|
# test for the type of feature subset selection algorithms
|
200
201
|
r = FSelector::INTERACT.new(0.0001)
|
@@ -258,8 +259,8 @@ Usage
|
|
258
259
|
# the Information Gain (IG) algorithm requires data with discrete feature
|
259
260
|
r = FSelector::IG.new
|
260
261
|
|
261
|
-
# but the Iris data set contains continuous features
|
262
|
-
r.
|
262
|
+
# but the Iris data set contains continuous features
|
263
|
+
r.data_from_url('http://repository.seasr.org/Datasets/UCI/arff/iris.arff', :weka)
|
263
264
|
|
264
265
|
# let's first discretize it by ChiMerge algorithm at alpha=0.10
|
265
266
|
# then perform feature selection as usual
|
data/lib/fselector.rb
CHANGED
@@ -7,7 +7,7 @@ R.eval 'options(warn = -1)' # suppress R warnings
|
|
7
7
|
#
|
8
8
|
module FSelector
|
9
9
|
# module version
|
10
|
-
VERSION = '1.
|
10
|
+
VERSION = '1.3.0'
|
11
11
|
end
|
12
12
|
|
13
13
|
# the root dir of FSelector
|
@@ -52,6 +52,15 @@ Dir.glob("#{ROOT}/fselector/algo_continuous/*").each do |f|
|
|
52
52
|
require f
|
53
53
|
end
|
54
54
|
|
55
|
+
|
56
|
+
#
|
57
|
+
# algorithms for handling both discrete and continuous feature
|
58
|
+
#
|
59
|
+
Dir.glob("#{ROOT}/fselector/algo_both/*").each do |f|
|
60
|
+
require f
|
61
|
+
end
|
62
|
+
|
63
|
+
|
55
64
|
#
|
56
65
|
# feature selection use an ensemble of algorithms
|
57
66
|
#
|
@@ -192,6 +192,34 @@ module FSelector
|
|
192
192
|
end
|
193
193
|
|
194
194
|
|
195
|
+
#
|
196
|
+
# get the feature type stored in @types
|
197
|
+
#
|
198
|
+
# @param [Symbol] feature feature of interest
|
199
|
+
# return all feature name-type pairs if nil,
|
200
|
+
# otherwise reture the type for the feature of interest
|
201
|
+
#
|
202
|
+
def get_feature_type(feature=nil)
|
203
|
+
if @types
|
204
|
+
feature ? @types[feature] : @types
|
205
|
+
else
|
206
|
+
nil
|
207
|
+
end
|
208
|
+
end
|
209
|
+
|
210
|
+
|
211
|
+
#
|
212
|
+
# set feature name-type pair
|
213
|
+
#
|
214
|
+
# @param [Symbol] feature feature name
|
215
|
+
# @param [Symbol] type feature type
|
216
|
+
#
|
217
|
+
def set_feature_type(feature, type)
|
218
|
+
@types ||= {}
|
219
|
+
@types[feature] = type
|
220
|
+
end
|
221
|
+
|
222
|
+
|
195
223
|
#
|
196
224
|
# get internal data
|
197
225
|
#
|
@@ -241,7 +269,11 @@ module FSelector
|
|
241
269
|
# @note return all non-data as a Hash if key == nil
|
242
270
|
#
|
243
271
|
def get_opt(key=nil)
|
244
|
-
|
272
|
+
if @opts
|
273
|
+
key ? @opts[key] : @opts
|
274
|
+
else
|
275
|
+
nil
|
276
|
+
end
|
245
277
|
end
|
246
278
|
|
247
279
|
|
@@ -404,17 +436,38 @@ module FSelector
|
|
404
436
|
private
|
405
437
|
|
406
438
|
#
|
407
|
-
# clear variables when data structure is altered,
|
408
|
-
# useful when data structure has changed while
|
409
|
-
#
|
439
|
+
# clear instance variables when data structure is altered,
|
440
|
+
# this is useful when data structure has changed while you
|
441
|
+
# still want to use the same instance
|
410
442
|
#
|
411
|
-
# @note the variables
|
412
|
-
#
|
443
|
+
# @note only the instance variables used in the initialize()
|
444
|
+
# such as @data will be retained. class instance varialbe
|
445
|
+
# @algo_type will also be retained. This trick by use of
|
446
|
+
# Ruby metaprogramming avoids the repetivie overriding of
|
447
|
+
# clear_vars() in each derived subclass
|
413
448
|
#
|
414
449
|
def clear_vars
|
415
|
-
|
416
|
-
|
417
|
-
|
450
|
+
# instance vars appeared as arguments (with the same name) in initialize()
|
451
|
+
instance_var_in_new = []
|
452
|
+
|
453
|
+
constructor = method(:initialize)
|
454
|
+
if constructor.respond_to? :parameters
|
455
|
+
constructor.parameters.each do |p|
|
456
|
+
instance_var_in_new << "@#{p[1]}".to_sym
|
457
|
+
end
|
458
|
+
end
|
459
|
+
|
460
|
+
instance_variables.each do |var|
|
461
|
+
# retain instance vars appeared as arguments in initialize()
|
462
|
+
# such as @data
|
463
|
+
next if instance_var_in_new.include? var
|
464
|
+
# retain feature types, which may be needed
|
465
|
+
# by CSV and Weka ARFF file
|
466
|
+
next if var == :@types
|
467
|
+
|
468
|
+
# clear all other instance variable
|
469
|
+
instance_variable_set(var, nil)
|
470
|
+
end
|
418
471
|
end
|
419
472
|
|
420
473
|
|
@@ -142,15 +142,6 @@ module FSelector
|
|
142
142
|
end # do_rff
|
143
143
|
|
144
144
|
|
145
|
-
# override clear\_vars for BaseCFS
|
146
|
-
def clear_vars
|
147
|
-
super
|
148
|
-
|
149
|
-
@rcf_best, @rff_best = nil, nil
|
150
|
-
@f2rcf, @fs2rff, @f2idx = nil, nil, nil
|
151
|
-
end # clear_vars
|
152
|
-
|
153
|
-
|
154
145
|
end # class
|
155
146
|
|
156
147
|
|
@@ -3,14 +3,14 @@
|
|
3
3
|
#
|
4
4
|
module FSelector
|
5
5
|
#
|
6
|
-
# Las Vegas Filter (LVF) for discrete feature,
|
6
|
+
# Las Vegas Filter (LVF) for discrete, continuous or mixed feature,
|
7
7
|
# use **select\_feature!** for feature selection
|
8
8
|
#
|
9
9
|
# @note we only keep one of the equivalently good solutions
|
10
10
|
#
|
11
11
|
# ref: [Review and Evaluation of Feature Selection Algorithms in Synthetic Problems](http://arxiv.org/abs/1101.2320)
|
12
12
|
#
|
13
|
-
class LasVegasFilter <
|
13
|
+
class LasVegasFilter < Base
|
14
14
|
# include module
|
15
15
|
include Consistency
|
16
16
|
|
@@ -3,12 +3,12 @@
|
|
3
3
|
#
|
4
4
|
module FSelector
|
5
5
|
#
|
6
|
-
# Las Vegas Incremental (LVI) for discrete feature,
|
6
|
+
# Las Vegas Incremental (LVI) for discrete, continuous or mixed feature,
|
7
7
|
# use **select\_feature!** for feature selection
|
8
8
|
#
|
9
9
|
# ref: [Incremental Feature Selection](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.8218)
|
10
10
|
#
|
11
|
-
class LasVegasIncremental <
|
11
|
+
class LasVegasIncremental < Base
|
12
12
|
# include module
|
13
13
|
include Consistency
|
14
14
|
|
@@ -3,13 +3,14 @@
|
|
3
3
|
#
|
4
4
|
module FSelector
|
5
5
|
#
|
6
|
-
# Random (Rand)
|
6
|
+
# Random (Rand) for discrete, continuous or mixed feature,
|
7
|
+
# no pratical use but can be used as a baseline
|
7
8
|
#
|
8
9
|
# Rand = rand numbers within [0..1)
|
9
10
|
#
|
10
11
|
# ref: [An extensive empirical study of feature selection metrics for text classification](http://dl.acm.org/citation.cfm?id=944974)
|
11
12
|
#
|
12
|
-
class Random <
|
13
|
+
class Random < Base
|
13
14
|
# this algo outputs weight for each feature
|
14
15
|
@algo_type = :feature_weighting
|
15
16
|
|
@@ -41,15 +41,7 @@ module FSelector
|
|
41
41
|
s = @hc - hcf
|
42
42
|
|
43
43
|
set_feature_score(f, :BEST, s)
|
44
|
-
end # calc_contribution
|
45
|
-
|
46
|
-
|
47
|
-
# override clear\_vars for InformationGain
|
48
|
-
def clear_vars
|
49
|
-
super
|
50
|
-
|
51
|
-
@hc = nil
|
52
|
-
end # clear_vars
|
44
|
+
end # calc_contribution
|
53
45
|
|
54
46
|
|
55
47
|
end # class
|
@@ -474,8 +474,13 @@ module Discretizer
|
|
474
474
|
end
|
475
475
|
end
|
476
476
|
|
477
|
-
# clear vars
|
477
|
+
# clear vars because of data change
|
478
478
|
clear_vars
|
479
|
+
|
480
|
+
# set all feature type as CATEGORICAL
|
481
|
+
each_feature do |f|
|
482
|
+
set_feature_type(f, :categorical)
|
483
|
+
end
|
479
484
|
end # discretize_at_cutpoints!
|
480
485
|
|
481
486
|
|
data/lib/fselector/ensemble.rb
CHANGED
@@ -258,6 +258,7 @@ module FSelector
|
|
258
258
|
@algo_type = algo.algo_type
|
259
259
|
end
|
260
260
|
|
261
|
+
private
|
261
262
|
|
262
263
|
#
|
263
264
|
# get ensemble feature scores
|
@@ -304,7 +305,6 @@ module FSelector
|
|
304
305
|
#pp ensem_ranks
|
305
306
|
end # get_ensemble_ranks
|
306
307
|
|
307
|
-
private
|
308
308
|
|
309
309
|
#
|
310
310
|
# override get\_feature\_subset() for EnsembleSingle,
|
@@ -417,6 +417,8 @@ module FSelector
|
|
417
417
|
end
|
418
418
|
end
|
419
419
|
|
420
|
+
private
|
421
|
+
|
420
422
|
#
|
421
423
|
# get ensemble feature scores
|
422
424
|
#
|
@@ -455,8 +457,6 @@ module FSelector
|
|
455
457
|
end # get_ensemble_ranks
|
456
458
|
|
457
459
|
|
458
|
-
private
|
459
|
-
|
460
460
|
#
|
461
461
|
# override get\_feature\_subset() for EnsembleMultiple,
|
462
462
|
# select a subset of features based on frequency count
|
data/lib/fselector/fileio.rb
CHANGED
@@ -20,6 +20,11 @@
|
|
20
20
|
# @note class labels and features are treated as symbols
|
21
21
|
#
|
22
22
|
module FileIO
|
23
|
+
# require the open-uri lib for http/https/ftp request
|
24
|
+
require 'open-uri'
|
25
|
+
# require the stringio lib for converting a string into stringio object
|
26
|
+
require 'stringio'
|
27
|
+
|
23
28
|
#
|
24
29
|
# read from random data (read only, for test purpose)
|
25
30
|
#
|
@@ -76,20 +81,13 @@ module FileIO
|
|
76
81
|
# -1 3:1 4:1 ...
|
77
82
|
# ....
|
78
83
|
#
|
79
|
-
# @param [String] fname
|
84
|
+
# @param [Symbol|String|StringIO] fname data source to read from
|
80
85
|
# :stdin # read from standard input instead of file
|
81
86
|
#
|
82
|
-
def data_from_libsvm(fname=:stdin)
|
83
|
-
|
87
|
+
def data_from_libsvm(fname=:stdin)
|
88
|
+
ifs = get_ifs(fname)
|
84
89
|
|
85
|
-
|
86
|
-
ifs = $stdin
|
87
|
-
elsif not File.exists? fname
|
88
|
-
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
89
|
-
" File '#{fname}' does not exist!"
|
90
|
-
else
|
91
|
-
ifs = File.open(fname)
|
92
|
-
end
|
90
|
+
data = {}
|
93
91
|
|
94
92
|
ifs.each_line do |ln|
|
95
93
|
label, *features = ln.chomp.split(/\s+/)
|
@@ -116,14 +114,10 @@ module FileIO
|
|
116
114
|
# write to libsvm
|
117
115
|
#
|
118
116
|
# @param [String] fname file to write
|
119
|
-
# :stdout # write to standard ouput
|
117
|
+
# :stdout # write to standard ouput
|
120
118
|
#
|
121
119
|
def data_to_libsvm(fname=:stdout)
|
122
|
-
|
123
|
-
ofs = $stdout
|
124
|
-
else
|
125
|
-
ofs = File.open(fname, 'w')
|
126
|
-
end
|
120
|
+
ofs = get_ofs(fname)
|
127
121
|
|
128
122
|
# convert class label to integer type
|
129
123
|
k2idx = {}
|
@@ -153,75 +147,108 @@ module FileIO
|
|
153
147
|
#
|
154
148
|
# read from csv
|
155
149
|
#
|
156
|
-
#
|
157
|
-
#
|
158
|
-
#
|
159
|
-
#
|
150
|
+
# if no csv_opts supplied, we assume the CSV file in the following format:
|
151
|
+
# first row contains feature names with last column being class name
|
152
|
+
#
|
153
|
+
# feat_name1,feat_name2,...,class_name
|
154
|
+
#
|
155
|
+
# second row contains feature types with last column being class type
|
156
|
+
#
|
157
|
+
# feat_type1,feat_type2,...,class_type
|
160
158
|
#
|
161
|
-
# and the remaing rows
|
162
|
-
#
|
163
|
-
#
|
159
|
+
# and the remaing rows containing feature values with last column being class label
|
160
|
+
#
|
161
|
+
# feat_value1,feat_value2,...,class_label
|
162
|
+
# ...
|
164
163
|
#
|
165
164
|
# allowed feature types (case-insensitive) are:
|
166
|
-
# INTEGER, REAL, NUMERIC, CONTINUOUS, STRING, NOMINAL, CATEGORICAL
|
165
|
+
# INTEGER, REAL, NUMERIC, CONTINUOUS, DOUBLE, FLOAT, STRING, NOMINAL, CATEGORICAL
|
167
166
|
#
|
168
|
-
# @param [String] fname
|
169
|
-
# :stdin # read from standard input
|
167
|
+
# @param [Symbol|String|StringIO] fname data source to read from
|
168
|
+
# :stdin # read from standard input
|
169
|
+
# @param [Hash] csv_opts named arguments for csv options
|
170
|
+
# :feature\_name\_row => 1, # (first) row that contains feature names
|
171
|
+
# :feature\_type\_row => 2, # (second) row that contains feature types
|
172
|
+
# :class\_label\_column => 0 # (last) column that contains class labels
|
173
|
+
# :feature\_name2type => {}, # user-supplied hash containing feature name-type pairs
|
174
|
+
# if no rows specify them. feature must be in the same order
|
175
|
+
# as it appears in the dataset
|
170
176
|
#
|
171
|
-
# @note missing values are allowed
|
177
|
+
# @note missing values are allowed
|
172
178
|
#
|
173
|
-
def data_from_csv(fname=:stdin)
|
174
|
-
|
179
|
+
def data_from_csv(fname=:stdin, csv_opts = {})
|
180
|
+
ifs = get_ifs(fname)
|
175
181
|
|
176
|
-
|
177
|
-
|
178
|
-
|
179
|
-
|
180
|
-
|
181
|
-
|
182
|
-
|
182
|
+
csv_opts = { # default options, new opts will override old ones
|
183
|
+
:feature_name_row => 1, # first row contains feature names
|
184
|
+
:feature_type_row => 2, # second row contains feature types
|
185
|
+
:class_label_column => 0, # last column contains class labels
|
186
|
+
:feature_name2type => {} # user-supplied feature name-type pairs
|
187
|
+
}.merge(csv_opts) if csv_opts and csv_opts.class == Hash
|
188
|
+
|
189
|
+
feature_name_row = csv_opts[:feature_name_row]
|
190
|
+
feature_type_row = csv_opts[:feature_type_row]
|
191
|
+
class_label_column = csv_opts[:class_label_column]
|
192
|
+
feature_name2type = csv_opts[:feature_name2type]
|
193
|
+
|
194
|
+
# user-supplied feature name-type pairs, this is useful
|
195
|
+
# when file contains no specific rows for feture names and types
|
196
|
+
if feature_name2type and not feature_name2type.empty?
|
197
|
+
features = feature_name2type.keys.collect { |n| n.to_sym }
|
198
|
+
types = feature_name2type.values.collect { |t| t.downcase.to_sym }
|
199
|
+
# disable name and type rows
|
200
|
+
feature_name_row, feature_type_row = nil, nil
|
183
201
|
end
|
184
202
|
|
185
|
-
|
186
|
-
features, types = [], []
|
203
|
+
data = {}
|
187
204
|
|
188
205
|
ifs.each_line do |ln|
|
189
|
-
if
|
190
|
-
|
191
|
-
|
192
|
-
|
193
|
-
|
206
|
+
next if ln.blank?
|
207
|
+
|
208
|
+
cells = ln.chomp.split(/,/)
|
209
|
+
|
210
|
+
# feature names
|
211
|
+
if ifs.lineno == feature_name_row
|
212
|
+
class_name = cells.delete_at(class_label_column-1) # simply useless
|
213
|
+
features = cells.to_sym
|
214
|
+
# feature types
|
215
|
+
elsif ifs.lineno == feature_type_row
|
216
|
+
class_type = cells.delete_at(class_label_column-1) # simply useless
|
194
217
|
# store feature type as lower-case symbol
|
195
|
-
types =
|
196
|
-
if not types.size == features.size
|
197
|
-
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
198
|
-
" the first two rows must have same number of fields"
|
199
|
-
end
|
218
|
+
types = cells.collect { |t| t.downcase.to_sym }
|
200
219
|
else # data rows
|
201
|
-
|
202
|
-
|
203
|
-
|
220
|
+
if class_label_column <= cells.size and
|
221
|
+
features.size+1 == cells.size and
|
222
|
+
types.size+1 == cells.size
|
223
|
+
class_label = cells.delete_at(class_label_column-1).to_sym
|
224
|
+
data[class_label] ||= []
|
225
|
+
# feature values
|
226
|
+
fvs = cells
|
227
|
+
else
|
228
|
+
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
229
|
+
" invalid csv format!"
|
230
|
+
end
|
204
231
|
|
205
232
|
fs = {}
|
206
233
|
fvs.each_with_index do |v, i|
|
207
234
|
next if v.empty? # missing value
|
208
|
-
|
209
|
-
if
|
235
|
+
type = types[i]
|
236
|
+
if type == :integer
|
210
237
|
v = v.to_i
|
211
|
-
elsif [:real, :numeric, :continuous].include?
|
238
|
+
elsif [:real, :numeric, :continuous, :float, :double].include? type
|
212
239
|
v = v.to_f
|
213
|
-
elsif [:string, :nominal, :categorical].include?
|
240
|
+
elsif [:string, :nominal, :categorical].include? type
|
214
241
|
#
|
215
242
|
else
|
216
243
|
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
217
|
-
"
|
218
|
-
"
|
244
|
+
" invalid feature type '#{type}', must be one of the following: \n"+
|
245
|
+
" integer, real, numeric, float, double, continuous, categorical, string, nominal"
|
219
246
|
end
|
220
247
|
|
221
248
|
fs[features[i]] = v
|
222
249
|
end
|
223
250
|
|
224
|
-
data[
|
251
|
+
data[class_label] << fs
|
225
252
|
end
|
226
253
|
end
|
227
254
|
|
@@ -230,9 +257,10 @@ module FileIO
|
|
230
257
|
|
231
258
|
set_data(data)
|
232
259
|
set_features(features)
|
233
|
-
|
260
|
+
|
261
|
+
# feature name-type pairs
|
234
262
|
features.each_with_index do |f, i|
|
235
|
-
|
263
|
+
set_feature_type(f, types[i])
|
236
264
|
end
|
237
265
|
end # data_from_csv
|
238
266
|
|
@@ -240,35 +268,30 @@ module FileIO
|
|
240
268
|
#
|
241
269
|
# write to csv
|
242
270
|
#
|
243
|
-
# file has the format
|
244
|
-
# specifying features and their data types
|
245
|
-
# and the remaing rows showing data
|
271
|
+
# ouput CSV file has the same format as the default input for data\_from\_csv()
|
246
272
|
#
|
247
273
|
# @param [String] fname file to write
|
248
|
-
# :stdout # write to standard ouput
|
274
|
+
# :stdout # write to standard ouput
|
249
275
|
#
|
250
276
|
def data_to_csv(fname=:stdout)
|
251
|
-
|
252
|
-
|
253
|
-
|
254
|
-
ofs = File.open(fname, 'w')
|
255
|
-
end
|
256
|
-
|
277
|
+
ofs = get_ofs(fname)
|
278
|
+
|
279
|
+
# feature names
|
257
280
|
ofs.puts get_features.join(',')
|
281
|
+
# feature types
|
258
282
|
ofs.puts get_features.collect { |f|
|
259
|
-
|
283
|
+
get_feature_type(f) || :string
|
260
284
|
}.join(',')
|
261
285
|
|
262
286
|
each_sample do |k, s|
|
263
|
-
ofs.print "#{k}"
|
264
287
|
each_feature do |f|
|
265
288
|
if s.has_key? f
|
266
|
-
ofs.print "
|
289
|
+
ofs.print "#{s[f]},"
|
267
290
|
else
|
268
291
|
ofs.print ","
|
269
292
|
end
|
270
293
|
end
|
271
|
-
ofs.puts
|
294
|
+
ofs.puts "#{k}"
|
272
295
|
end
|
273
296
|
|
274
297
|
# close file
|
@@ -279,25 +302,18 @@ module FileIO
|
|
279
302
|
#
|
280
303
|
# read from WEKA ARFF file
|
281
304
|
#
|
282
|
-
# @param [String] fname
|
283
|
-
# :stdin # read from standard input
|
305
|
+
# @param [Symbol|String|StringIO] fname data source to read from
|
306
|
+
# :stdin # read from standard input
|
284
307
|
# @note it's ok if string containes spaces quoted by quote_char
|
285
308
|
#
|
286
309
|
def data_from_weka(fname=:stdin, quote_char='"')
|
287
|
-
|
288
|
-
|
289
|
-
if fname == :stdin
|
290
|
-
ifs = $stdin
|
291
|
-
elsif not File.exists? fname
|
292
|
-
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
293
|
-
" File '#{fname}' does not exist!"
|
294
|
-
else
|
295
|
-
ifs = File.open(fname)
|
296
|
-
end
|
310
|
+
ifs = get_ifs(fname)
|
297
311
|
|
298
312
|
relation, features, classes, types, comments = '', [], [], [], []
|
299
313
|
has_class, has_data = false, false
|
300
314
|
|
315
|
+
data = {}
|
316
|
+
|
301
317
|
ifs.each_line do |ln|
|
302
318
|
next if ln.blank? # blank lines
|
303
319
|
|
@@ -312,7 +328,7 @@ module FileIO
|
|
312
328
|
# class attribute
|
313
329
|
elsif ln =~ /^@ATTRIBUTE\s+class\s+{(.+)}/i
|
314
330
|
has_class = true
|
315
|
-
classes = $1.split_me(/,\s*/, quote_char).to_sym
|
331
|
+
classes = $1.strip.split_me(/,\s*/, quote_char).to_sym
|
316
332
|
classes.each { |k| data[k] = [] }
|
317
333
|
# feature attribute (nominal)
|
318
334
|
elsif ln =~ /^@ATTRIBUTE\s+(\S+)\s+{(.+)}/i
|
@@ -320,7 +336,7 @@ module FileIO
|
|
320
336
|
features << f
|
321
337
|
#$2.split_me(/,\s*/, quote_char) # feature nominal values
|
322
338
|
types << :nominal
|
323
|
-
# feature attribute (integer, real, numeric, string, date)
|
339
|
+
# feature attribute (categorical, integer, real, continuous, numeric, float, double, string, date)
|
324
340
|
elsif ln =~ /^@ATTRIBUTE/i
|
325
341
|
tmp, v1, v2 = ln.split_me(/\s+/, quote_char)
|
326
342
|
f = v1.to_sym
|
@@ -376,10 +392,12 @@ module FileIO
|
|
376
392
|
set_data(data)
|
377
393
|
set_classes(classes)
|
378
394
|
set_features(features)
|
379
|
-
|
395
|
+
# feature name-type pairs
|
380
396
|
features.each_with_index do |f, i|
|
381
|
-
|
397
|
+
set_feature_type(f, types[i])
|
382
398
|
end
|
399
|
+
|
400
|
+
set_opt(:relation, relation)
|
383
401
|
set_opt(:comments, comments) if not comments.empty?
|
384
402
|
end # data_from_weak
|
385
403
|
|
@@ -388,16 +406,12 @@ module FileIO
|
|
388
406
|
# write to WEKA ARFF file
|
389
407
|
#
|
390
408
|
# @param [String] fname file to write
|
391
|
-
# :stdout # write to standard ouput
|
409
|
+
# :stdout # write to standard ouput
|
392
410
|
# @param [Symbol] format sparse or regular ARFF
|
393
411
|
# :sparse # sparse ARFF, otherwise regular ARFF
|
394
412
|
#
|
395
413
|
def data_to_weka(fname=:stdout, format=nil)
|
396
|
-
|
397
|
-
ofs = $stdout
|
398
|
-
else
|
399
|
-
ofs = File.open(fname, 'w')
|
400
|
-
end
|
414
|
+
ofs = get_ofs(fname)
|
401
415
|
|
402
416
|
# comments
|
403
417
|
comments = get_opt(:comments)
|
@@ -419,7 +433,7 @@ module FileIO
|
|
419
433
|
# feature attribute
|
420
434
|
each_feature do |f|
|
421
435
|
ofs.print "@ATTRIBUTE #{f} "
|
422
|
-
type =
|
436
|
+
type = get_feature_type(f)
|
423
437
|
if type
|
424
438
|
if type == :nominal
|
425
439
|
ofs.puts "{#{get_feature_values(f).uniq.sort.join(',')}}"
|
@@ -464,10 +478,73 @@ module FileIO
|
|
464
478
|
|
465
479
|
# close file
|
466
480
|
ofs.close if not ofs == $stdout
|
467
|
-
end
|
481
|
+
end # data_to_weka
|
482
|
+
|
483
|
+
|
484
|
+
# read data from url
|
485
|
+
#
|
486
|
+
# @param [String] url url of on-line dataset
|
487
|
+
# @param [Symbol] format allowed formats are:
|
488
|
+
# :libsvm # LibSVM file
|
489
|
+
# :csv # csv file
|
490
|
+
# :weka # Weka ARFF file
|
491
|
+
# @param [Any] args arguments associated with format
|
492
|
+
#
|
493
|
+
def data_from_url(url, format, *args)
|
494
|
+
format = format.downcase.to_sym
|
495
|
+
|
496
|
+
if not [:libsvm, :csv, :weka].include? format
|
497
|
+
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
498
|
+
" only CSV, LibSVM and Weka file formats are supported!"
|
499
|
+
end
|
500
|
+
|
501
|
+
uri = URI.parse(URI.encode(url))
|
502
|
+
|
503
|
+
data_src = StringIO.new(uri.read)
|
504
|
+
|
505
|
+
if format == :csv
|
506
|
+
data_from_csv(data_src, *args)
|
507
|
+
elsif format == :libsvm
|
508
|
+
data_from_libsvm(data_src)
|
509
|
+
else # weka
|
510
|
+
data_from_weka(data_src, *args)
|
511
|
+
end
|
512
|
+
end # data_from_url
|
468
513
|
|
469
514
|
private
|
470
515
|
|
516
|
+
# get the input file handler
|
517
|
+
def get_ifs(fname)
|
518
|
+
# read from standard input by default
|
519
|
+
if fname == :stdin
|
520
|
+
ifs = $stdin
|
521
|
+
# read from string if it is a StringIO
|
522
|
+
elsif fname.class == StringIO
|
523
|
+
ifs = fname
|
524
|
+
# read from file if file exists
|
525
|
+
elsif File.exists? fname
|
526
|
+
ifs = File.open(fname)
|
527
|
+
else
|
528
|
+
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
529
|
+
" invalid data source!"
|
530
|
+
end
|
531
|
+
|
532
|
+
ifs
|
533
|
+
end
|
534
|
+
|
535
|
+
|
536
|
+
# get the ouput file handler
|
537
|
+
def get_ofs(fname)
|
538
|
+
if fname == :stdout
|
539
|
+
ofs = $stdout
|
540
|
+
else
|
541
|
+
ofs = File.open(fname, 'w')
|
542
|
+
end
|
543
|
+
|
544
|
+
ofs
|
545
|
+
end
|
546
|
+
|
547
|
+
|
471
548
|
# handle and add each feature for WEKA format
|
472
549
|
#
|
473
550
|
# @param [Hash] fs sample that stores feature and its value
|
@@ -480,14 +557,16 @@ module FileIO
|
|
480
557
|
return
|
481
558
|
elsif type == :integer
|
482
559
|
fs[f] = v.to_i
|
483
|
-
elsif
|
560
|
+
elsif [:real, :numeric, :float, :double, :continuous].include? type
|
484
561
|
fs[f] = v.to_f
|
485
|
-
elsif
|
562
|
+
elsif [:categorical, :string, :nominal].include? type
|
486
563
|
fs[f] = v
|
487
564
|
elsif type == :date # convert into integer
|
488
565
|
fs[f] = (DateTime.parse(v)-DateTime.new(1970,1,1)).to_i
|
489
566
|
else
|
490
|
-
|
567
|
+
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
568
|
+
" invalid feature type '#{type}', must be one of the following: \n"+
|
569
|
+
" integer, real, numeric, float, double, continuous, categorical, string, nominal, date"
|
491
570
|
end
|
492
571
|
end # add_feature_weka
|
493
572
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: fselector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.3.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-05-
|
12
|
+
date: 2012-05-24 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rinruby
|
16
|
-
requirement: &
|
16
|
+
requirement: &24934116 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,7 +21,7 @@ dependencies:
|
|
21
21
|
version: 2.0.2
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *24934116
|
25
25
|
description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
|
26
26
|
algorithms and related functions into one single package. Welcome to contact me
|
27
27
|
(need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
|
@@ -48,6 +48,9 @@ files:
|
|
48
48
|
- lib/fselector/algo_base/base_discrete.rb
|
49
49
|
- lib/fselector/algo_base/base_Relief.rb
|
50
50
|
- lib/fselector/algo_base/base_ReliefF.rb
|
51
|
+
- lib/fselector/algo_both/LasVegasFilter.rb
|
52
|
+
- lib/fselector/algo_both/LasVegasIncremental.rb
|
53
|
+
- lib/fselector/algo_both/Random.rb
|
51
54
|
- lib/fselector/algo_continuous/BSS_WSS.rb
|
52
55
|
- lib/fselector/algo_continuous/CFS_c.rb
|
53
56
|
- lib/fselector/algo_continuous/F-Test.rb
|
@@ -75,8 +78,6 @@ files:
|
|
75
78
|
- lib/fselector/algo_discrete/INTERACT.rb
|
76
79
|
- lib/fselector/algo_discrete/J-Measure.rb
|
77
80
|
- lib/fselector/algo_discrete/KL-Divergence.rb
|
78
|
-
- lib/fselector/algo_discrete/LasVegasFilter.rb
|
79
|
-
- lib/fselector/algo_discrete/LasVegasIncremental.rb
|
80
81
|
- lib/fselector/algo_discrete/MatthewsCorrelationCoefficient.rb
|
81
82
|
- lib/fselector/algo_discrete/McNemarsTest.rb
|
82
83
|
- lib/fselector/algo_discrete/MutualInformation.rb
|
@@ -85,7 +86,6 @@ files:
|
|
85
86
|
- lib/fselector/algo_discrete/Power.rb
|
86
87
|
- lib/fselector/algo_discrete/Precision.rb
|
87
88
|
- lib/fselector/algo_discrete/ProbabilityRatio.rb
|
88
|
-
- lib/fselector/algo_discrete/Random.rb
|
89
89
|
- lib/fselector/algo_discrete/ReliefF_d.rb
|
90
90
|
- lib/fselector/algo_discrete/Relief_d.rb
|
91
91
|
- lib/fselector/algo_discrete/Sensitivity.rb
|