fselector 1.2.0 → 1.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/ChangeLog +7 -0
- data/README.md +53 -52
- data/lib/fselector.rb +10 -1
- data/lib/fselector/algo_base/base.rb +62 -9
- data/lib/fselector/algo_base/base_CFS.rb +0 -9
- data/lib/fselector/algo_base/base_ReliefF.rb +0 -8
- data/lib/fselector/algo_base/base_discrete.rb +0 -8
- data/lib/fselector/{algo_discrete → algo_both}/LasVegasFilter.rb +2 -2
- data/lib/fselector/{algo_discrete → algo_both}/LasVegasIncremental.rb +2 -2
- data/lib/fselector/{algo_discrete → algo_both}/Random.rb +3 -2
- data/lib/fselector/algo_continuous/ReliefF_c.rb +0 -8
- data/lib/fselector/algo_continuous/Relief_c.rb +0 -8
- data/lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb +0 -8
- data/lib/fselector/algo_discrete/InformationGain.rb +1 -9
- data/lib/fselector/discretizer.rb +6 -1
- data/lib/fselector/ensemble.rb +3 -3
- data/lib/fselector/fileio.rb +181 -102
- metadata +7 -7
data/ChangeLog
CHANGED
@@ -1,3 +1,10 @@
|
|
1
|
+
2012-05-24 version 1.3.0
|
2
|
+
|
3
|
+
* update clear\_vars() in Base by use of Ruby metaprogramming, this trick avoids repetitive overriding it in each derived subclass
|
4
|
+
* re-organize LasVegasFilter, LasVegasIncremental and Random into algo_both/, since they are applicable to dataset with either discrete or continuous features, even with mixed type
|
5
|
+
* update data\_from\_csv() so that it can read CSV file more flexibly. note by default, the last column is class label
|
6
|
+
* add data\_from\_url() to read on-line dataset (in CSV, LibSVM or Weka ARFF file format) specified by a url
|
7
|
+
|
1
8
|
2012-05-20 version 1.2.0
|
2
9
|
|
3
10
|
* add KS-Test algorithm for continuous feature
|
data/README.md
CHANGED
@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
|
|
8
8
|
**Email**: [need47@gmail.com](mailto:need47@gmail.com)
|
9
9
|
**Copyright**: 2012
|
10
10
|
**License**: MIT License
|
11
|
-
**Latest Version**: 1.
|
12
|
-
**Release Date**: 2012-05-
|
11
|
+
**Latest Version**: 1.3.0
|
12
|
+
**Release Date**: 2012-05-24
|
13
13
|
|
14
14
|
Synopsis
|
15
15
|
--------
|
@@ -38,56 +38,57 @@ Feature List
|
|
38
38
|
- csv
|
39
39
|
- libsvm
|
40
40
|
- weka ARFF
|
41
|
+
- on-line dataset in one of the above three formats (read only)
|
41
42
|
- random data (read only, for test purpose)
|
42
43
|
|
43
44
|
**2. available feature selection/ranking algorithms**
|
44
45
|
|
45
|
-
algorithm shortcut algo_type feature_type
|
46
|
-
|
47
|
-
Accuracy Acc weighting
|
48
|
-
AccuracyBalanced Acc2 weighting
|
49
|
-
BiNormalSeparation BNS weighting
|
50
|
-
CFS_d CFS_d subset
|
51
|
-
ChiSquaredTest CHI weighting
|
52
|
-
CorrelationCoefficient CC weighting
|
53
|
-
DocumentFrequency DF weighting
|
54
|
-
F1Measure F1 weighting
|
55
|
-
FishersExactTest FET weighting
|
56
|
-
FastCorrelationBasedFilter FCBF subset
|
57
|
-
GiniIndex GI weighting
|
58
|
-
GMean GM weighting
|
59
|
-
GSSCoefficient GSS weighting
|
60
|
-
InformationGain IG weighting
|
61
|
-
INTERACT INTERACT subset
|
62
|
-
JMeasure JM weighting
|
63
|
-
KLDivergence KLD weighting
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
Recall
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
46
|
+
algorithm shortcut algo_type applicability feature_type
|
47
|
+
--------------------------------------------------------------------------------------------------
|
48
|
+
Accuracy Acc weighting multi-class discrete
|
49
|
+
AccuracyBalanced Acc2 weighting multi-class discrete
|
50
|
+
BiNormalSeparation BNS weighting multi-class discrete
|
51
|
+
CFS_d CFS_d subset multi-class discrete
|
52
|
+
ChiSquaredTest CHI weighting multi-class discrete
|
53
|
+
CorrelationCoefficient CC weighting multi-class discrete
|
54
|
+
DocumentFrequency DF weighting multi-class discrete
|
55
|
+
F1Measure F1 weighting multi-class discrete
|
56
|
+
FishersExactTest FET weighting multi-class discrete
|
57
|
+
FastCorrelationBasedFilter FCBF subset multi-class discrete
|
58
|
+
GiniIndex GI weighting multi-class discrete
|
59
|
+
GMean GM weighting multi-class discrete
|
60
|
+
GSSCoefficient GSS weighting multi-class discrete
|
61
|
+
InformationGain IG weighting multi-class discrete
|
62
|
+
INTERACT INTERACT subset multi-class discrete
|
63
|
+
JMeasure JM weighting multi-class discrete
|
64
|
+
KLDivergence KLD weighting multi-class discrete
|
65
|
+
MatthewsCorrelationCoefficient MCC, PHI weighting multi-class discrete
|
66
|
+
McNemarsTest MNT weighting multi-class discrete
|
67
|
+
OddsRatio OR weighting multi-class discrete
|
68
|
+
OddsRatioNumerator ORN weighting multi-class discrete
|
69
|
+
PhiCoefficient PHI weighting multi-class discrete
|
70
|
+
Power Power weighting multi-class discrete
|
71
|
+
Precision Precision weighting multi-class discrete
|
72
|
+
ProbabilityRatio PR weighting multi-class discrete
|
73
|
+
Recall Recall weighting multi-class discrete
|
74
|
+
Relief_d Relief_d weighting two-class discrete
|
75
|
+
ReliefF_d ReliefF_d weighting multi-class discrete
|
76
|
+
Sensitivity SN, Recall weighting multi-class discrete
|
77
|
+
Specificity SP weighting multi-class discrete
|
78
|
+
SymmetricalUncertainty SU weighting multi-class discrete
|
79
|
+
BetweenWithinClassesSumOfSquare BSS_WSS weighting multi-class continuous
|
80
|
+
CFS_c CFS_c subset multi-class continuous
|
81
|
+
FTest FT weighting multi-class continuous
|
82
|
+
KS_CCBF KS_CCBF subset multi-class continuous
|
83
|
+
KSTest KST weighting two-class continuous
|
84
|
+
PMetric PM weighting two-class continuous
|
85
|
+
Relief_c Relief_c weighting two-class continuous
|
86
|
+
ReliefF_c ReliefF_c weighting multi-class continuous
|
87
|
+
TScore TS weighting two-class continuous
|
88
|
+
WilcoxonRankSum WRS weighting two-class continuous
|
89
|
+
LasVegasFilter LVF subset multi-class discrete, continuous, mixed
|
90
|
+
LasVegasIncremental LVI subset multi-class discrete, continuous, mixed
|
91
|
+
Random Random weighting multi-class discrete, continuous, mixed
|
91
92
|
|
92
93
|
**note for feature selection interface:**
|
93
94
|
there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
|
@@ -132,7 +133,7 @@ To install FSelector, use the following command:
|
|
132
133
|
|
133
134
|
$ gem install fselector
|
134
135
|
|
135
|
-
**note:**
|
136
|
+
**note:** From version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
|
136
137
|
as a seemless bridge to access the statistical routines in the R package (http://www.r-project.org),
|
137
138
|
which will greatly expand the inclusion of algorithms to FSelector, especially for those relying
|
138
139
|
on statistical test. To this end, please pre-install the R package. RinRuby should have been
|
@@ -194,7 +195,7 @@ Usage
|
|
194
195
|
|
195
196
|
# creating an ensemble of feature selectors by using
|
196
197
|
# a single feature selection algorithm (INTERACT)
|
197
|
-
# by instance perturbation (e.g.
|
198
|
+
# by instance perturbation (e.g. random sampling)
|
198
199
|
|
199
200
|
# test for the type of feature subset selection algorithms
|
200
201
|
r = FSelector::INTERACT.new(0.0001)
|
@@ -258,8 +259,8 @@ Usage
|
|
258
259
|
# the Information Gain (IG) algorithm requires data with discrete feature
|
259
260
|
r = FSelector::IG.new
|
260
261
|
|
261
|
-
# but the Iris data set contains continuous features
|
262
|
-
r.
|
262
|
+
# but the Iris data set contains continuous features
|
263
|
+
r.data_from_url('http://repository.seasr.org/Datasets/UCI/arff/iris.arff', :weka)
|
263
264
|
|
264
265
|
# let's first discretize it by ChiMerge algorithm at alpha=0.10
|
265
266
|
# then perform feature selection as usual
|
data/lib/fselector.rb
CHANGED
@@ -7,7 +7,7 @@ R.eval 'options(warn = -1)' # suppress R warnings
|
|
7
7
|
#
|
8
8
|
module FSelector
|
9
9
|
# module version
|
10
|
-
VERSION = '1.
|
10
|
+
VERSION = '1.3.0'
|
11
11
|
end
|
12
12
|
|
13
13
|
# the root dir of FSelector
|
@@ -52,6 +52,15 @@ Dir.glob("#{ROOT}/fselector/algo_continuous/*").each do |f|
|
|
52
52
|
require f
|
53
53
|
end
|
54
54
|
|
55
|
+
|
56
|
+
#
|
57
|
+
# algorithms for handling both discrete and continuous feature
|
58
|
+
#
|
59
|
+
Dir.glob("#{ROOT}/fselector/algo_both/*").each do |f|
|
60
|
+
require f
|
61
|
+
end
|
62
|
+
|
63
|
+
|
55
64
|
#
|
56
65
|
# feature selection use an ensemble of algorithms
|
57
66
|
#
|
@@ -192,6 +192,34 @@ module FSelector
|
|
192
192
|
end
|
193
193
|
|
194
194
|
|
195
|
+
#
|
196
|
+
# get the feature type stored in @types
|
197
|
+
#
|
198
|
+
# @param [Symbol] feature feature of interest
|
199
|
+
# return all feature name-type pairs if nil,
|
200
|
+
# otherwise reture the type for the feature of interest
|
201
|
+
#
|
202
|
+
def get_feature_type(feature=nil)
|
203
|
+
if @types
|
204
|
+
feature ? @types[feature] : @types
|
205
|
+
else
|
206
|
+
nil
|
207
|
+
end
|
208
|
+
end
|
209
|
+
|
210
|
+
|
211
|
+
#
|
212
|
+
# set feature name-type pair
|
213
|
+
#
|
214
|
+
# @param [Symbol] feature feature name
|
215
|
+
# @param [Symbol] type feature type
|
216
|
+
#
|
217
|
+
def set_feature_type(feature, type)
|
218
|
+
@types ||= {}
|
219
|
+
@types[feature] = type
|
220
|
+
end
|
221
|
+
|
222
|
+
|
195
223
|
#
|
196
224
|
# get internal data
|
197
225
|
#
|
@@ -241,7 +269,11 @@ module FSelector
|
|
241
269
|
# @note return all non-data as a Hash if key == nil
|
242
270
|
#
|
243
271
|
def get_opt(key=nil)
|
244
|
-
|
272
|
+
if @opts
|
273
|
+
key ? @opts[key] : @opts
|
274
|
+
else
|
275
|
+
nil
|
276
|
+
end
|
245
277
|
end
|
246
278
|
|
247
279
|
|
@@ -404,17 +436,38 @@ module FSelector
|
|
404
436
|
private
|
405
437
|
|
406
438
|
#
|
407
|
-
# clear variables when data structure is altered,
|
408
|
-
# useful when data structure has changed while
|
409
|
-
#
|
439
|
+
# clear instance variables when data structure is altered,
|
440
|
+
# this is useful when data structure has changed while you
|
441
|
+
# still want to use the same instance
|
410
442
|
#
|
411
|
-
# @note the variables
|
412
|
-
#
|
443
|
+
# @note only the instance variables used in the initialize()
|
444
|
+
# such as @data will be retained. class instance varialbe
|
445
|
+
# @algo_type will also be retained. This trick by use of
|
446
|
+
# Ruby metaprogramming avoids the repetivie overriding of
|
447
|
+
# clear_vars() in each derived subclass
|
413
448
|
#
|
414
449
|
def clear_vars
|
415
|
-
|
416
|
-
|
417
|
-
|
450
|
+
# instance vars appeared as arguments (with the same name) in initialize()
|
451
|
+
instance_var_in_new = []
|
452
|
+
|
453
|
+
constructor = method(:initialize)
|
454
|
+
if constructor.respond_to? :parameters
|
455
|
+
constructor.parameters.each do |p|
|
456
|
+
instance_var_in_new << "@#{p[1]}".to_sym
|
457
|
+
end
|
458
|
+
end
|
459
|
+
|
460
|
+
instance_variables.each do |var|
|
461
|
+
# retain instance vars appeared as arguments in initialize()
|
462
|
+
# such as @data
|
463
|
+
next if instance_var_in_new.include? var
|
464
|
+
# retain feature types, which may be needed
|
465
|
+
# by CSV and Weka ARFF file
|
466
|
+
next if var == :@types
|
467
|
+
|
468
|
+
# clear all other instance variable
|
469
|
+
instance_variable_set(var, nil)
|
470
|
+
end
|
418
471
|
end
|
419
472
|
|
420
473
|
|
@@ -142,15 +142,6 @@ module FSelector
|
|
142
142
|
end # do_rff
|
143
143
|
|
144
144
|
|
145
|
-
# override clear\_vars for BaseCFS
|
146
|
-
def clear_vars
|
147
|
-
super
|
148
|
-
|
149
|
-
@rcf_best, @rff_best = nil, nil
|
150
|
-
@f2rcf, @fs2rff, @f2idx = nil, nil, nil
|
151
|
-
end # clear_vars
|
152
|
-
|
153
|
-
|
154
145
|
end # class
|
155
146
|
|
156
147
|
|
@@ -3,14 +3,14 @@
|
|
3
3
|
#
|
4
4
|
module FSelector
|
5
5
|
#
|
6
|
-
# Las Vegas Filter (LVF) for discrete feature,
|
6
|
+
# Las Vegas Filter (LVF) for discrete, continuous or mixed feature,
|
7
7
|
# use **select\_feature!** for feature selection
|
8
8
|
#
|
9
9
|
# @note we only keep one of the equivalently good solutions
|
10
10
|
#
|
11
11
|
# ref: [Review and Evaluation of Feature Selection Algorithms in Synthetic Problems](http://arxiv.org/abs/1101.2320)
|
12
12
|
#
|
13
|
-
class LasVegasFilter <
|
13
|
+
class LasVegasFilter < Base
|
14
14
|
# include module
|
15
15
|
include Consistency
|
16
16
|
|
@@ -3,12 +3,12 @@
|
|
3
3
|
#
|
4
4
|
module FSelector
|
5
5
|
#
|
6
|
-
# Las Vegas Incremental (LVI) for discrete feature,
|
6
|
+
# Las Vegas Incremental (LVI) for discrete, continuous or mixed feature,
|
7
7
|
# use **select\_feature!** for feature selection
|
8
8
|
#
|
9
9
|
# ref: [Incremental Feature Selection](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.8218)
|
10
10
|
#
|
11
|
-
class LasVegasIncremental <
|
11
|
+
class LasVegasIncremental < Base
|
12
12
|
# include module
|
13
13
|
include Consistency
|
14
14
|
|
@@ -3,13 +3,14 @@
|
|
3
3
|
#
|
4
4
|
module FSelector
|
5
5
|
#
|
6
|
-
# Random (Rand)
|
6
|
+
# Random (Rand) for discrete, continuous or mixed feature,
|
7
|
+
# no pratical use but can be used as a baseline
|
7
8
|
#
|
8
9
|
# Rand = rand numbers within [0..1)
|
9
10
|
#
|
10
11
|
# ref: [An extensive empirical study of feature selection metrics for text classification](http://dl.acm.org/citation.cfm?id=944974)
|
11
12
|
#
|
12
|
-
class Random <
|
13
|
+
class Random < Base
|
13
14
|
# this algo outputs weight for each feature
|
14
15
|
@algo_type = :feature_weighting
|
15
16
|
|
@@ -41,15 +41,7 @@ module FSelector
|
|
41
41
|
s = @hc - hcf
|
42
42
|
|
43
43
|
set_feature_score(f, :BEST, s)
|
44
|
-
end # calc_contribution
|
45
|
-
|
46
|
-
|
47
|
-
# override clear\_vars for InformationGain
|
48
|
-
def clear_vars
|
49
|
-
super
|
50
|
-
|
51
|
-
@hc = nil
|
52
|
-
end # clear_vars
|
44
|
+
end # calc_contribution
|
53
45
|
|
54
46
|
|
55
47
|
end # class
|
@@ -474,8 +474,13 @@ module Discretizer
|
|
474
474
|
end
|
475
475
|
end
|
476
476
|
|
477
|
-
# clear vars
|
477
|
+
# clear vars because of data change
|
478
478
|
clear_vars
|
479
|
+
|
480
|
+
# set all feature type as CATEGORICAL
|
481
|
+
each_feature do |f|
|
482
|
+
set_feature_type(f, :categorical)
|
483
|
+
end
|
479
484
|
end # discretize_at_cutpoints!
|
480
485
|
|
481
486
|
|
data/lib/fselector/ensemble.rb
CHANGED
@@ -258,6 +258,7 @@ module FSelector
|
|
258
258
|
@algo_type = algo.algo_type
|
259
259
|
end
|
260
260
|
|
261
|
+
private
|
261
262
|
|
262
263
|
#
|
263
264
|
# get ensemble feature scores
|
@@ -304,7 +305,6 @@ module FSelector
|
|
304
305
|
#pp ensem_ranks
|
305
306
|
end # get_ensemble_ranks
|
306
307
|
|
307
|
-
private
|
308
308
|
|
309
309
|
#
|
310
310
|
# override get\_feature\_subset() for EnsembleSingle,
|
@@ -417,6 +417,8 @@ module FSelector
|
|
417
417
|
end
|
418
418
|
end
|
419
419
|
|
420
|
+
private
|
421
|
+
|
420
422
|
#
|
421
423
|
# get ensemble feature scores
|
422
424
|
#
|
@@ -455,8 +457,6 @@ module FSelector
|
|
455
457
|
end # get_ensemble_ranks
|
456
458
|
|
457
459
|
|
458
|
-
private
|
459
|
-
|
460
460
|
#
|
461
461
|
# override get\_feature\_subset() for EnsembleMultiple,
|
462
462
|
# select a subset of features based on frequency count
|
data/lib/fselector/fileio.rb
CHANGED
@@ -20,6 +20,11 @@
|
|
20
20
|
# @note class labels and features are treated as symbols
|
21
21
|
#
|
22
22
|
module FileIO
|
23
|
+
# require the open-uri lib for http/https/ftp request
|
24
|
+
require 'open-uri'
|
25
|
+
# require the stringio lib for converting a string into stringio object
|
26
|
+
require 'stringio'
|
27
|
+
|
23
28
|
#
|
24
29
|
# read from random data (read only, for test purpose)
|
25
30
|
#
|
@@ -76,20 +81,13 @@ module FileIO
|
|
76
81
|
# -1 3:1 4:1 ...
|
77
82
|
# ....
|
78
83
|
#
|
79
|
-
# @param [String] fname
|
84
|
+
# @param [Symbol|String|StringIO] fname data source to read from
|
80
85
|
# :stdin # read from standard input instead of file
|
81
86
|
#
|
82
|
-
def data_from_libsvm(fname=:stdin)
|
83
|
-
|
87
|
+
def data_from_libsvm(fname=:stdin)
|
88
|
+
ifs = get_ifs(fname)
|
84
89
|
|
85
|
-
|
86
|
-
ifs = $stdin
|
87
|
-
elsif not File.exists? fname
|
88
|
-
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
89
|
-
" File '#{fname}' does not exist!"
|
90
|
-
else
|
91
|
-
ifs = File.open(fname)
|
92
|
-
end
|
90
|
+
data = {}
|
93
91
|
|
94
92
|
ifs.each_line do |ln|
|
95
93
|
label, *features = ln.chomp.split(/\s+/)
|
@@ -116,14 +114,10 @@ module FileIO
|
|
116
114
|
# write to libsvm
|
117
115
|
#
|
118
116
|
# @param [String] fname file to write
|
119
|
-
# :stdout # write to standard ouput
|
117
|
+
# :stdout # write to standard ouput
|
120
118
|
#
|
121
119
|
def data_to_libsvm(fname=:stdout)
|
122
|
-
|
123
|
-
ofs = $stdout
|
124
|
-
else
|
125
|
-
ofs = File.open(fname, 'w')
|
126
|
-
end
|
120
|
+
ofs = get_ofs(fname)
|
127
121
|
|
128
122
|
# convert class label to integer type
|
129
123
|
k2idx = {}
|
@@ -153,75 +147,108 @@ module FileIO
|
|
153
147
|
#
|
154
148
|
# read from csv
|
155
149
|
#
|
156
|
-
#
|
157
|
-
#
|
158
|
-
#
|
159
|
-
#
|
150
|
+
# if no csv_opts supplied, we assume the CSV file in the following format:
|
151
|
+
# first row contains feature names with last column being class name
|
152
|
+
#
|
153
|
+
# feat_name1,feat_name2,...,class_name
|
154
|
+
#
|
155
|
+
# second row contains feature types with last column being class type
|
156
|
+
#
|
157
|
+
# feat_type1,feat_type2,...,class_type
|
160
158
|
#
|
161
|
-
# and the remaing rows
|
162
|
-
#
|
163
|
-
#
|
159
|
+
# and the remaing rows containing feature values with last column being class label
|
160
|
+
#
|
161
|
+
# feat_value1,feat_value2,...,class_label
|
162
|
+
# ...
|
164
163
|
#
|
165
164
|
# allowed feature types (case-insensitive) are:
|
166
|
-
# INTEGER, REAL, NUMERIC, CONTINUOUS, STRING, NOMINAL, CATEGORICAL
|
165
|
+
# INTEGER, REAL, NUMERIC, CONTINUOUS, DOUBLE, FLOAT, STRING, NOMINAL, CATEGORICAL
|
167
166
|
#
|
168
|
-
# @param [String] fname
|
169
|
-
# :stdin # read from standard input
|
167
|
+
# @param [Symbol|String|StringIO] fname data source to read from
|
168
|
+
# :stdin # read from standard input
|
169
|
+
# @param [Hash] csv_opts named arguments for csv options
|
170
|
+
# :feature\_name\_row => 1, # (first) row that contains feature names
|
171
|
+
# :feature\_type\_row => 2, # (second) row that contains feature types
|
172
|
+
# :class\_label\_column => 0 # (last) column that contains class labels
|
173
|
+
# :feature\_name2type => {}, # user-supplied hash containing feature name-type pairs
|
174
|
+
# if no rows specify them. feature must be in the same order
|
175
|
+
# as it appears in the dataset
|
170
176
|
#
|
171
|
-
# @note missing values are allowed
|
177
|
+
# @note missing values are allowed
|
172
178
|
#
|
173
|
-
def data_from_csv(fname=:stdin)
|
174
|
-
|
179
|
+
def data_from_csv(fname=:stdin, csv_opts = {})
|
180
|
+
ifs = get_ifs(fname)
|
175
181
|
|
176
|
-
|
177
|
-
|
178
|
-
|
179
|
-
|
180
|
-
|
181
|
-
|
182
|
-
|
182
|
+
csv_opts = { # default options, new opts will override old ones
|
183
|
+
:feature_name_row => 1, # first row contains feature names
|
184
|
+
:feature_type_row => 2, # second row contains feature types
|
185
|
+
:class_label_column => 0, # last column contains class labels
|
186
|
+
:feature_name2type => {} # user-supplied feature name-type pairs
|
187
|
+
}.merge(csv_opts) if csv_opts and csv_opts.class == Hash
|
188
|
+
|
189
|
+
feature_name_row = csv_opts[:feature_name_row]
|
190
|
+
feature_type_row = csv_opts[:feature_type_row]
|
191
|
+
class_label_column = csv_opts[:class_label_column]
|
192
|
+
feature_name2type = csv_opts[:feature_name2type]
|
193
|
+
|
194
|
+
# user-supplied feature name-type pairs, this is useful
|
195
|
+
# when file contains no specific rows for feture names and types
|
196
|
+
if feature_name2type and not feature_name2type.empty?
|
197
|
+
features = feature_name2type.keys.collect { |n| n.to_sym }
|
198
|
+
types = feature_name2type.values.collect { |t| t.downcase.to_sym }
|
199
|
+
# disable name and type rows
|
200
|
+
feature_name_row, feature_type_row = nil, nil
|
183
201
|
end
|
184
202
|
|
185
|
-
|
186
|
-
features, types = [], []
|
203
|
+
data = {}
|
187
204
|
|
188
205
|
ifs.each_line do |ln|
|
189
|
-
if
|
190
|
-
|
191
|
-
|
192
|
-
|
193
|
-
|
206
|
+
next if ln.blank?
|
207
|
+
|
208
|
+
cells = ln.chomp.split(/,/)
|
209
|
+
|
210
|
+
# feature names
|
211
|
+
if ifs.lineno == feature_name_row
|
212
|
+
class_name = cells.delete_at(class_label_column-1) # simply useless
|
213
|
+
features = cells.to_sym
|
214
|
+
# feature types
|
215
|
+
elsif ifs.lineno == feature_type_row
|
216
|
+
class_type = cells.delete_at(class_label_column-1) # simply useless
|
194
217
|
# store feature type as lower-case symbol
|
195
|
-
types =
|
196
|
-
if not types.size == features.size
|
197
|
-
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
198
|
-
" the first two rows must have same number of fields"
|
199
|
-
end
|
218
|
+
types = cells.collect { |t| t.downcase.to_sym }
|
200
219
|
else # data rows
|
201
|
-
|
202
|
-
|
203
|
-
|
220
|
+
if class_label_column <= cells.size and
|
221
|
+
features.size+1 == cells.size and
|
222
|
+
types.size+1 == cells.size
|
223
|
+
class_label = cells.delete_at(class_label_column-1).to_sym
|
224
|
+
data[class_label] ||= []
|
225
|
+
# feature values
|
226
|
+
fvs = cells
|
227
|
+
else
|
228
|
+
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
229
|
+
" invalid csv format!"
|
230
|
+
end
|
204
231
|
|
205
232
|
fs = {}
|
206
233
|
fvs.each_with_index do |v, i|
|
207
234
|
next if v.empty? # missing value
|
208
|
-
|
209
|
-
if
|
235
|
+
type = types[i]
|
236
|
+
if type == :integer
|
210
237
|
v = v.to_i
|
211
|
-
elsif [:real, :numeric, :continuous].include?
|
238
|
+
elsif [:real, :numeric, :continuous, :float, :double].include? type
|
212
239
|
v = v.to_f
|
213
|
-
elsif [:string, :nominal, :categorical].include?
|
240
|
+
elsif [:string, :nominal, :categorical].include? type
|
214
241
|
#
|
215
242
|
else
|
216
243
|
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
217
|
-
"
|
218
|
-
"
|
244
|
+
" invalid feature type '#{type}', must be one of the following: \n"+
|
245
|
+
" integer, real, numeric, float, double, continuous, categorical, string, nominal"
|
219
246
|
end
|
220
247
|
|
221
248
|
fs[features[i]] = v
|
222
249
|
end
|
223
250
|
|
224
|
-
data[
|
251
|
+
data[class_label] << fs
|
225
252
|
end
|
226
253
|
end
|
227
254
|
|
@@ -230,9 +257,10 @@ module FileIO
|
|
230
257
|
|
231
258
|
set_data(data)
|
232
259
|
set_features(features)
|
233
|
-
|
260
|
+
|
261
|
+
# feature name-type pairs
|
234
262
|
features.each_with_index do |f, i|
|
235
|
-
|
263
|
+
set_feature_type(f, types[i])
|
236
264
|
end
|
237
265
|
end # data_from_csv
|
238
266
|
|
@@ -240,35 +268,30 @@ module FileIO
|
|
240
268
|
#
|
241
269
|
# write to csv
|
242
270
|
#
|
243
|
-
# file has the format
|
244
|
-
# specifying features and their data types
|
245
|
-
# and the remaing rows showing data
|
271
|
+
# ouput CSV file has the same format as the default input for data\_from\_csv()
|
246
272
|
#
|
247
273
|
# @param [String] fname file to write
|
248
|
-
# :stdout # write to standard ouput
|
274
|
+
# :stdout # write to standard ouput
|
249
275
|
#
|
250
276
|
def data_to_csv(fname=:stdout)
|
251
|
-
|
252
|
-
|
253
|
-
|
254
|
-
ofs = File.open(fname, 'w')
|
255
|
-
end
|
256
|
-
|
277
|
+
ofs = get_ofs(fname)
|
278
|
+
|
279
|
+
# feature names
|
257
280
|
ofs.puts get_features.join(',')
|
281
|
+
# feature types
|
258
282
|
ofs.puts get_features.collect { |f|
|
259
|
-
|
283
|
+
get_feature_type(f) || :string
|
260
284
|
}.join(',')
|
261
285
|
|
262
286
|
each_sample do |k, s|
|
263
|
-
ofs.print "#{k}"
|
264
287
|
each_feature do |f|
|
265
288
|
if s.has_key? f
|
266
|
-
ofs.print "
|
289
|
+
ofs.print "#{s[f]},"
|
267
290
|
else
|
268
291
|
ofs.print ","
|
269
292
|
end
|
270
293
|
end
|
271
|
-
ofs.puts
|
294
|
+
ofs.puts "#{k}"
|
272
295
|
end
|
273
296
|
|
274
297
|
# close file
|
@@ -279,25 +302,18 @@ module FileIO
|
|
279
302
|
#
|
280
303
|
# read from WEKA ARFF file
|
281
304
|
#
|
282
|
-
# @param [String] fname
|
283
|
-
# :stdin # read from standard input
|
305
|
+
# @param [Symbol|String|StringIO] fname data source to read from
|
306
|
+
# :stdin # read from standard input
|
284
307
|
# @note it's ok if string containes spaces quoted by quote_char
|
285
308
|
#
|
286
309
|
def data_from_weka(fname=:stdin, quote_char='"')
|
287
|
-
|
288
|
-
|
289
|
-
if fname == :stdin
|
290
|
-
ifs = $stdin
|
291
|
-
elsif not File.exists? fname
|
292
|
-
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
293
|
-
" File '#{fname}' does not exist!"
|
294
|
-
else
|
295
|
-
ifs = File.open(fname)
|
296
|
-
end
|
310
|
+
ifs = get_ifs(fname)
|
297
311
|
|
298
312
|
relation, features, classes, types, comments = '', [], [], [], []
|
299
313
|
has_class, has_data = false, false
|
300
314
|
|
315
|
+
data = {}
|
316
|
+
|
301
317
|
ifs.each_line do |ln|
|
302
318
|
next if ln.blank? # blank lines
|
303
319
|
|
@@ -312,7 +328,7 @@ module FileIO
|
|
312
328
|
# class attribute
|
313
329
|
elsif ln =~ /^@ATTRIBUTE\s+class\s+{(.+)}/i
|
314
330
|
has_class = true
|
315
|
-
classes = $1.split_me(/,\s*/, quote_char).to_sym
|
331
|
+
classes = $1.strip.split_me(/,\s*/, quote_char).to_sym
|
316
332
|
classes.each { |k| data[k] = [] }
|
317
333
|
# feature attribute (nominal)
|
318
334
|
elsif ln =~ /^@ATTRIBUTE\s+(\S+)\s+{(.+)}/i
|
@@ -320,7 +336,7 @@ module FileIO
|
|
320
336
|
features << f
|
321
337
|
#$2.split_me(/,\s*/, quote_char) # feature nominal values
|
322
338
|
types << :nominal
|
323
|
-
# feature attribute (integer, real, numeric, string, date)
|
339
|
+
# feature attribute (categorical, integer, real, continuous, numeric, float, double, string, date)
|
324
340
|
elsif ln =~ /^@ATTRIBUTE/i
|
325
341
|
tmp, v1, v2 = ln.split_me(/\s+/, quote_char)
|
326
342
|
f = v1.to_sym
|
@@ -376,10 +392,12 @@ module FileIO
|
|
376
392
|
set_data(data)
|
377
393
|
set_classes(classes)
|
378
394
|
set_features(features)
|
379
|
-
|
395
|
+
# feature name-type pairs
|
380
396
|
features.each_with_index do |f, i|
|
381
|
-
|
397
|
+
set_feature_type(f, types[i])
|
382
398
|
end
|
399
|
+
|
400
|
+
set_opt(:relation, relation)
|
383
401
|
set_opt(:comments, comments) if not comments.empty?
|
384
402
|
end # data_from_weak
|
385
403
|
|
@@ -388,16 +406,12 @@ module FileIO
|
|
388
406
|
# write to WEKA ARFF file
|
389
407
|
#
|
390
408
|
# @param [String] fname file to write
|
391
|
-
# :stdout # write to standard ouput
|
409
|
+
# :stdout # write to standard ouput
|
392
410
|
# @param [Symbol] format sparse or regular ARFF
|
393
411
|
# :sparse # sparse ARFF, otherwise regular ARFF
|
394
412
|
#
|
395
413
|
def data_to_weka(fname=:stdout, format=nil)
|
396
|
-
|
397
|
-
ofs = $stdout
|
398
|
-
else
|
399
|
-
ofs = File.open(fname, 'w')
|
400
|
-
end
|
414
|
+
ofs = get_ofs(fname)
|
401
415
|
|
402
416
|
# comments
|
403
417
|
comments = get_opt(:comments)
|
@@ -419,7 +433,7 @@ module FileIO
|
|
419
433
|
# feature attribute
|
420
434
|
each_feature do |f|
|
421
435
|
ofs.print "@ATTRIBUTE #{f} "
|
422
|
-
type =
|
436
|
+
type = get_feature_type(f)
|
423
437
|
if type
|
424
438
|
if type == :nominal
|
425
439
|
ofs.puts "{#{get_feature_values(f).uniq.sort.join(',')}}"
|
@@ -464,10 +478,73 @@ module FileIO
|
|
464
478
|
|
465
479
|
# close file
|
466
480
|
ofs.close if not ofs == $stdout
|
467
|
-
end
|
481
|
+
end # data_to_weka
|
482
|
+
|
483
|
+
|
484
|
+
# read data from url
|
485
|
+
#
|
486
|
+
# @param [String] url url of on-line dataset
|
487
|
+
# @param [Symbol] format allowed formats are:
|
488
|
+
# :libsvm # LibSVM file
|
489
|
+
# :csv # csv file
|
490
|
+
# :weka # Weka ARFF file
|
491
|
+
# @param [Any] args arguments associated with format
|
492
|
+
#
|
493
|
+
def data_from_url(url, format, *args)
|
494
|
+
format = format.downcase.to_sym
|
495
|
+
|
496
|
+
if not [:libsvm, :csv, :weka].include? format
|
497
|
+
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
498
|
+
" only CSV, LibSVM and Weka file formats are supported!"
|
499
|
+
end
|
500
|
+
|
501
|
+
uri = URI.parse(URI.encode(url))
|
502
|
+
|
503
|
+
data_src = StringIO.new(uri.read)
|
504
|
+
|
505
|
+
if format == :csv
|
506
|
+
data_from_csv(data_src, *args)
|
507
|
+
elsif format == :libsvm
|
508
|
+
data_from_libsvm(data_src)
|
509
|
+
else # weka
|
510
|
+
data_from_weka(data_src, *args)
|
511
|
+
end
|
512
|
+
end # data_from_url
|
468
513
|
|
469
514
|
private
|
470
515
|
|
516
|
+
# get the input file handler
|
517
|
+
def get_ifs(fname)
|
518
|
+
# read from standard input by default
|
519
|
+
if fname == :stdin
|
520
|
+
ifs = $stdin
|
521
|
+
# read from string if it is a StringIO
|
522
|
+
elsif fname.class == StringIO
|
523
|
+
ifs = fname
|
524
|
+
# read from file if file exists
|
525
|
+
elsif File.exists? fname
|
526
|
+
ifs = File.open(fname)
|
527
|
+
else
|
528
|
+
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
529
|
+
" invalid data source!"
|
530
|
+
end
|
531
|
+
|
532
|
+
ifs
|
533
|
+
end
|
534
|
+
|
535
|
+
|
536
|
+
# get the ouput file handler
|
537
|
+
def get_ofs(fname)
|
538
|
+
if fname == :stdout
|
539
|
+
ofs = $stdout
|
540
|
+
else
|
541
|
+
ofs = File.open(fname, 'w')
|
542
|
+
end
|
543
|
+
|
544
|
+
ofs
|
545
|
+
end
|
546
|
+
|
547
|
+
|
471
548
|
# handle and add each feature for WEKA format
|
472
549
|
#
|
473
550
|
# @param [Hash] fs sample that stores feature and its value
|
@@ -480,14 +557,16 @@ module FileIO
|
|
480
557
|
return
|
481
558
|
elsif type == :integer
|
482
559
|
fs[f] = v.to_i
|
483
|
-
elsif
|
560
|
+
elsif [:real, :numeric, :float, :double, :continuous].include? type
|
484
561
|
fs[f] = v.to_f
|
485
|
-
elsif
|
562
|
+
elsif [:categorical, :string, :nominal].include? type
|
486
563
|
fs[f] = v
|
487
564
|
elsif type == :date # convert into integer
|
488
565
|
fs[f] = (DateTime.parse(v)-DateTime.new(1970,1,1)).to_i
|
489
566
|
else
|
490
|
-
|
567
|
+
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
568
|
+
" invalid feature type '#{type}', must be one of the following: \n"+
|
569
|
+
" integer, real, numeric, float, double, continuous, categorical, string, nominal, date"
|
491
570
|
end
|
492
571
|
end # add_feature_weka
|
493
572
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: fselector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.3.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-05-
|
12
|
+
date: 2012-05-24 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rinruby
|
16
|
-
requirement: &
|
16
|
+
requirement: &24934116 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,7 +21,7 @@ dependencies:
|
|
21
21
|
version: 2.0.2
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *24934116
|
25
25
|
description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
|
26
26
|
algorithms and related functions into one single package. Welcome to contact me
|
27
27
|
(need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
|
@@ -48,6 +48,9 @@ files:
|
|
48
48
|
- lib/fselector/algo_base/base_discrete.rb
|
49
49
|
- lib/fselector/algo_base/base_Relief.rb
|
50
50
|
- lib/fselector/algo_base/base_ReliefF.rb
|
51
|
+
- lib/fselector/algo_both/LasVegasFilter.rb
|
52
|
+
- lib/fselector/algo_both/LasVegasIncremental.rb
|
53
|
+
- lib/fselector/algo_both/Random.rb
|
51
54
|
- lib/fselector/algo_continuous/BSS_WSS.rb
|
52
55
|
- lib/fselector/algo_continuous/CFS_c.rb
|
53
56
|
- lib/fselector/algo_continuous/F-Test.rb
|
@@ -75,8 +78,6 @@ files:
|
|
75
78
|
- lib/fselector/algo_discrete/INTERACT.rb
|
76
79
|
- lib/fselector/algo_discrete/J-Measure.rb
|
77
80
|
- lib/fselector/algo_discrete/KL-Divergence.rb
|
78
|
-
- lib/fselector/algo_discrete/LasVegasFilter.rb
|
79
|
-
- lib/fselector/algo_discrete/LasVegasIncremental.rb
|
80
81
|
- lib/fselector/algo_discrete/MatthewsCorrelationCoefficient.rb
|
81
82
|
- lib/fselector/algo_discrete/McNemarsTest.rb
|
82
83
|
- lib/fselector/algo_discrete/MutualInformation.rb
|
@@ -85,7 +86,6 @@ files:
|
|
85
86
|
- lib/fselector/algo_discrete/Power.rb
|
86
87
|
- lib/fselector/algo_discrete/Precision.rb
|
87
88
|
- lib/fselector/algo_discrete/ProbabilityRatio.rb
|
88
|
-
- lib/fselector/algo_discrete/Random.rb
|
89
89
|
- lib/fselector/algo_discrete/ReliefF_d.rb
|
90
90
|
- lib/fselector/algo_discrete/Relief_d.rb
|
91
91
|
- lib/fselector/algo_discrete/Sensitivity.rb
|