fselector 1.2.0 → 1.3.0

Sign up to get free protection for your applications and to get access to all the features.
data/ChangeLog CHANGED
@@ -1,3 +1,10 @@
1
+ 2012-05-24 version 1.3.0
2
+
3
+ * update clear\_vars() in Base by use of Ruby metaprogramming, this trick avoids repetitive overriding it in each derived subclass
4
+ * re-organize LasVegasFilter, LasVegasIncremental and Random into algo_both/, since they are applicable to dataset with either discrete or continuous features, even with mixed type
5
+ * update data\_from\_csv() so that it can read CSV file more flexibly. note by default, the last column is class label
6
+ * add data\_from\_url() to read on-line dataset (in CSV, LibSVM or Weka ARFF file format) specified by a url
7
+
1
8
  2012-05-20 version 1.2.0
2
9
 
3
10
  * add KS-Test algorithm for continuous feature
data/README.md CHANGED
@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
8
8
  **Email**: [need47@gmail.com](mailto:need47@gmail.com)
9
9
  **Copyright**: 2012
10
10
  **License**: MIT License
11
- **Latest Version**: 1.2.0
12
- **Release Date**: 2012-05-20
11
+ **Latest Version**: 1.3.0
12
+ **Release Date**: 2012-05-24
13
13
 
14
14
  Synopsis
15
15
  --------
@@ -38,56 +38,57 @@ Feature List
38
38
  - csv
39
39
  - libsvm
40
40
  - weka ARFF
41
+ - on-line dataset in one of the above three formats (read only)
41
42
  - random data (read only, for test purpose)
42
43
 
43
44
  **2. available feature selection/ranking algorithms**
44
45
 
45
- algorithm shortcut algo_type feature_type applicability
46
- --------------------------------------------------------------------------------------------------------
47
- Accuracy Acc weighting discrete multi-class
48
- AccuracyBalanced Acc2 weighting discrete multi-class
49
- BiNormalSeparation BNS weighting discrete multi-class
50
- CFS_d CFS_d subset discrete multi-class
51
- ChiSquaredTest CHI weighting discrete multi-class
52
- CorrelationCoefficient CC weighting discrete multi-class
53
- DocumentFrequency DF weighting discrete multi-class
54
- F1Measure F1 weighting discrete multi-class
55
- FishersExactTest FET weighting discrete multi-class
56
- FastCorrelationBasedFilter FCBF subset discrete multi-class
57
- GiniIndex GI weighting discrete multi-class
58
- GMean GM weighting discrete multi-class
59
- GSSCoefficient GSS weighting discrete multi-class
60
- InformationGain IG weighting discrete multi-class
61
- INTERACT INTERACT subset discrete multi-class
62
- JMeasure JM weighting discrete multi-class
63
- KLDivergence KLD weighting discrete multi-class
64
- LasVegasFilter LVF subset discrete, continuous multi-class
65
- LasVegasIncremental LVI subset discrete, continuous multi-class
66
- MatthewsCorrelationCoefficient MCC, PHI weighting discrete multi-class
67
- McNemarsTest MNT weighting discrete multi-class
68
- OddsRatio OR weighting discrete multi-class
69
- OddsRatioNumerator ORN weighting discrete multi-class
70
- PhiCoefficient PHI weighting discrete multi-class
71
- Power Power weighting discrete multi-class
72
- Precision Precision weighting discrete multi-class
73
- ProbabilityRatio PR weighting discrete multi-class
74
- Random Random weighting discrete multi-class
75
- Recall Recall weighting discrete multi-class
76
- Relief_d Relief_d weighting discrete two-class, no missing data
77
- ReliefF_d ReliefF_d weighting discrete multi-class
78
- Sensitivity SN, Recall weighting discrete multi-class
79
- Specificity SP weighting discrete multi-class
80
- SymmetricalUncertainty SU weighting discrete multi-class
81
- BetweenWithinClassesSumOfSquare BSS_WSS weighting continuous multi-class
82
- CFS_c CFS_c subset continuous multi-class
83
- FTest FT weighting continuous multi-class
84
- KS_CCBF KS_CCBF subset continuous multi-class
85
- KSTest KST weighting continuous two-class
86
- PMetric PM weighting continuous two-class
87
- Relief_c Relief_c weighting continuous two-class, no missing data
88
- ReliefF_c ReliefF_c weighting continuous multi-class
89
- TScore TS weighting continuous two-class
90
- WilcoxonRankSum WRS weighting continuous two-class
46
+ algorithm shortcut algo_type applicability feature_type
47
+ --------------------------------------------------------------------------------------------------
48
+ Accuracy Acc weighting multi-class discrete
49
+ AccuracyBalanced Acc2 weighting multi-class discrete
50
+ BiNormalSeparation BNS weighting multi-class discrete
51
+ CFS_d CFS_d subset multi-class discrete
52
+ ChiSquaredTest CHI weighting multi-class discrete
53
+ CorrelationCoefficient CC weighting multi-class discrete
54
+ DocumentFrequency DF weighting multi-class discrete
55
+ F1Measure F1 weighting multi-class discrete
56
+ FishersExactTest FET weighting multi-class discrete
57
+ FastCorrelationBasedFilter FCBF subset multi-class discrete
58
+ GiniIndex GI weighting multi-class discrete
59
+ GMean GM weighting multi-class discrete
60
+ GSSCoefficient GSS weighting multi-class discrete
61
+ InformationGain IG weighting multi-class discrete
62
+ INTERACT INTERACT subset multi-class discrete
63
+ JMeasure JM weighting multi-class discrete
64
+ KLDivergence KLD weighting multi-class discrete
65
+ MatthewsCorrelationCoefficient MCC, PHI weighting multi-class discrete
66
+ McNemarsTest MNT weighting multi-class discrete
67
+ OddsRatio OR weighting multi-class discrete
68
+ OddsRatioNumerator ORN weighting multi-class discrete
69
+ PhiCoefficient PHI weighting multi-class discrete
70
+ Power Power weighting multi-class discrete
71
+ Precision Precision weighting multi-class discrete
72
+ ProbabilityRatio PR weighting multi-class discrete
73
+ Recall Recall weighting multi-class discrete
74
+ Relief_d Relief_d weighting two-class discrete
75
+ ReliefF_d ReliefF_d weighting multi-class discrete
76
+ Sensitivity SN, Recall weighting multi-class discrete
77
+ Specificity SP weighting multi-class discrete
78
+ SymmetricalUncertainty SU weighting multi-class discrete
79
+ BetweenWithinClassesSumOfSquare BSS_WSS weighting multi-class continuous
80
+ CFS_c CFS_c subset multi-class continuous
81
+ FTest FT weighting multi-class continuous
82
+ KS_CCBF KS_CCBF subset multi-class continuous
83
+ KSTest KST weighting two-class continuous
84
+ PMetric PM weighting two-class continuous
85
+ Relief_c Relief_c weighting two-class continuous
86
+ ReliefF_c ReliefF_c weighting multi-class continuous
87
+ TScore TS weighting two-class continuous
88
+ WilcoxonRankSum WRS weighting two-class continuous
89
+ LasVegasFilter LVF subset multi-class discrete, continuous, mixed
90
+ LasVegasIncremental LVI subset multi-class discrete, continuous, mixed
91
+ Random Random weighting multi-class discrete, continuous, mixed
91
92
 
92
93
  **note for feature selection interface:**
93
94
  there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
@@ -132,7 +133,7 @@ To install FSelector, use the following command:
132
133
 
133
134
  $ gem install fselector
134
135
 
135
- **note:** Start from version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
136
+ **note:** From version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
136
137
  as a seemless bridge to access the statistical routines in the R package (http://www.r-project.org),
137
138
  which will greatly expand the inclusion of algorithms to FSelector, especially for those relying
138
139
  on statistical test. To this end, please pre-install the R package. RinRuby should have been
@@ -194,7 +195,7 @@ Usage
194
195
 
195
196
  # creating an ensemble of feature selectors by using
196
197
  # a single feature selection algorithm (INTERACT)
197
- # by instance perturbation (e.g. bootstrap sampling)
198
+ # by instance perturbation (e.g. random sampling)
198
199
 
199
200
  # test for the type of feature subset selection algorithms
200
201
  r = FSelector::INTERACT.new(0.0001)
@@ -258,8 +259,8 @@ Usage
258
259
  # the Information Gain (IG) algorithm requires data with discrete feature
259
260
  r = FSelector::IG.new
260
261
 
261
- # but the Iris data set contains continuous features (under the test/ directory)
262
- r.data_from_csv('test/iris.csv')
262
+ # but the Iris data set contains continuous features
263
+ r.data_from_url('http://repository.seasr.org/Datasets/UCI/arff/iris.arff', :weka)
263
264
 
264
265
  # let's first discretize it by ChiMerge algorithm at alpha=0.10
265
266
  # then perform feature selection as usual
@@ -7,7 +7,7 @@ R.eval 'options(warn = -1)' # suppress R warnings
7
7
  #
8
8
  module FSelector
9
9
  # module version
10
- VERSION = '1.2.0'
10
+ VERSION = '1.3.0'
11
11
  end
12
12
 
13
13
  # the root dir of FSelector
@@ -52,6 +52,15 @@ Dir.glob("#{ROOT}/fselector/algo_continuous/*").each do |f|
52
52
  require f
53
53
  end
54
54
 
55
+
56
+ #
57
+ # algorithms for handling both discrete and continuous feature
58
+ #
59
+ Dir.glob("#{ROOT}/fselector/algo_both/*").each do |f|
60
+ require f
61
+ end
62
+
63
+
55
64
  #
56
65
  # feature selection use an ensemble of algorithms
57
66
  #
@@ -192,6 +192,34 @@ module FSelector
192
192
  end
193
193
 
194
194
 
195
+ #
196
+ # get the feature type stored in @types
197
+ #
198
+ # @param [Symbol] feature feature of interest
199
+ # return all feature name-type pairs if nil,
200
+ # otherwise reture the type for the feature of interest
201
+ #
202
+ def get_feature_type(feature=nil)
203
+ if @types
204
+ feature ? @types[feature] : @types
205
+ else
206
+ nil
207
+ end
208
+ end
209
+
210
+
211
+ #
212
+ # set feature name-type pair
213
+ #
214
+ # @param [Symbol] feature feature name
215
+ # @param [Symbol] type feature type
216
+ #
217
+ def set_feature_type(feature, type)
218
+ @types ||= {}
219
+ @types[feature] = type
220
+ end
221
+
222
+
195
223
  #
196
224
  # get internal data
197
225
  #
@@ -241,7 +269,11 @@ module FSelector
241
269
  # @note return all non-data as a Hash if key == nil
242
270
  #
243
271
  def get_opt(key=nil)
244
- key ? @opts[key] : @opts
272
+ if @opts
273
+ key ? @opts[key] : @opts
274
+ else
275
+ nil
276
+ end
245
277
  end
246
278
 
247
279
 
@@ -404,17 +436,38 @@ module FSelector
404
436
  private
405
437
 
406
438
  #
407
- # clear variables when data structure is altered, this is
408
- # useful when data structure has changed while
409
- # you still want to use the same instance
439
+ # clear instance variables when data structure is altered,
440
+ # this is useful when data structure has changed while you
441
+ # still want to use the same instance
410
442
  #
411
- # @note the variables of original data structure (@data) and
412
- # algorithm type (@algo_type) are retained
443
+ # @note only the instance variables used in the initialize()
444
+ # such as @data will be retained. class instance varialbe
445
+ # @algo_type will also be retained. This trick by use of
446
+ # Ruby metaprogramming avoids the repetivie overriding of
447
+ # clear_vars() in each derived subclass
413
448
  #
414
449
  def clear_vars
415
- @classes, @features, @fvs = nil, nil, nil
416
- @scores, @ranks, @sz = nil, nil, nil
417
- @cv, @fvs, @opts = nil, nil, {}
450
+ # instance vars appeared as arguments (with the same name) in initialize()
451
+ instance_var_in_new = []
452
+
453
+ constructor = method(:initialize)
454
+ if constructor.respond_to? :parameters
455
+ constructor.parameters.each do |p|
456
+ instance_var_in_new << "@#{p[1]}".to_sym
457
+ end
458
+ end
459
+
460
+ instance_variables.each do |var|
461
+ # retain instance vars appeared as arguments in initialize()
462
+ # such as @data
463
+ next if instance_var_in_new.include? var
464
+ # retain feature types, which may be needed
465
+ # by CSV and Weka ARFF file
466
+ next if var == :@types
467
+
468
+ # clear all other instance variable
469
+ instance_variable_set(var, nil)
470
+ end
418
471
  end
419
472
 
420
473
 
@@ -142,15 +142,6 @@ module FSelector
142
142
  end # do_rff
143
143
 
144
144
 
145
- # override clear\_vars for BaseCFS
146
- def clear_vars
147
- super
148
-
149
- @rcf_best, @rff_best = nil, nil
150
- @f2rcf, @fs2rff, @f2idx = nil, nil, nil
151
- end # clear_vars
152
-
153
-
154
145
  end # class
155
146
 
156
147
 
@@ -151,14 +151,6 @@ module FSelector
151
151
  end
152
152
 
153
153
 
154
- # override clear\_vars for BaseReliefF
155
- def clear_vars
156
- super
157
-
158
- @f2mvp = nil
159
- end # clear_vars
160
-
161
-
162
154
  end # class
163
155
 
164
156
 
@@ -175,14 +175,6 @@ module FSelector
175
175
  end # calc_D
176
176
 
177
177
 
178
- # override clear\_vars for BaseDiscrete
179
- def clear_vars
180
- super
181
-
182
- @A, @B, @C, @D = nil, nil, nil, nil
183
- end # clear_vars
184
-
185
-
186
178
  end # class
187
179
 
188
180
 
@@ -3,14 +3,14 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # Las Vegas Filter (LVF) for discrete feature,
6
+ # Las Vegas Filter (LVF) for discrete, continuous or mixed feature,
7
7
  # use **select\_feature!** for feature selection
8
8
  #
9
9
  # @note we only keep one of the equivalently good solutions
10
10
  #
11
11
  # ref: [Review and Evaluation of Feature Selection Algorithms in Synthetic Problems](http://arxiv.org/abs/1101.2320)
12
12
  #
13
- class LasVegasFilter < BaseDiscrete
13
+ class LasVegasFilter < Base
14
14
  # include module
15
15
  include Consistency
16
16
 
@@ -3,12 +3,12 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # Las Vegas Incremental (LVI) for discrete feature,
6
+ # Las Vegas Incremental (LVI) for discrete, continuous or mixed feature,
7
7
  # use **select\_feature!** for feature selection
8
8
  #
9
9
  # ref: [Incremental Feature Selection](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.8218)
10
10
  #
11
- class LasVegasIncremental < BaseDiscrete
11
+ class LasVegasIncremental < Base
12
12
  # include module
13
13
  include Consistency
14
14
 
@@ -3,13 +3,14 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # Random (Rand), no pratical use but can be used as a baseline
6
+ # Random (Rand) for discrete, continuous or mixed feature,
7
+ # no pratical use but can be used as a baseline
7
8
  #
8
9
  # Rand = rand numbers within [0..1)
9
10
  #
10
11
  # ref: [An extensive empirical study of feature selection metrics for text classification](http://dl.acm.org/citation.cfm?id=944974)
11
12
  #
12
- class Random < BaseDiscrete
13
+ class Random < Base
13
14
  # this algo outputs weight for each feature
14
15
  @algo_type = :feature_weighting
15
16
 
@@ -59,14 +59,6 @@ module FSelector
59
59
  end # get_normalization_unit
60
60
 
61
61
 
62
- # override clear\_vars for ReliefF_c
63
- def clear_vars
64
- super
65
-
66
- @f2nu = nil
67
- end # clear_vars
68
-
69
-
70
62
  end # class
71
63
 
72
64
 
@@ -47,14 +47,6 @@ module FSelector
47
47
  end # get_normalization_unit
48
48
 
49
49
 
50
- # override clear\_vars for Relief_c
51
- def clear_vars
52
- super
53
-
54
- @f2nu = nil
55
- end # clear_vars
56
-
57
-
58
50
  end # class
59
51
 
60
52
 
@@ -121,14 +121,6 @@ module FSelector
121
121
  fq
122
122
  end # get_next_element
123
123
 
124
-
125
- # override clear\_vars for FastCorrelationBasedFilter
126
- def clear_vars
127
- super
128
-
129
- @f2hf = nil
130
- end # clear_vars
131
-
132
124
 
133
125
  end # class
134
126
 
@@ -41,15 +41,7 @@ module FSelector
41
41
  s = @hc - hcf
42
42
 
43
43
  set_feature_score(f, :BEST, s)
44
- end # calc_contribution
45
-
46
-
47
- # override clear\_vars for InformationGain
48
- def clear_vars
49
- super
50
-
51
- @hc = nil
52
- end # clear_vars
44
+ end # calc_contribution
53
45
 
54
46
 
55
47
  end # class
@@ -474,8 +474,13 @@ module Discretizer
474
474
  end
475
475
  end
476
476
 
477
- # clear vars
477
+ # clear vars because of data change
478
478
  clear_vars
479
+
480
+ # set all feature type as CATEGORICAL
481
+ each_feature do |f|
482
+ set_feature_type(f, :categorical)
483
+ end
479
484
  end # discretize_at_cutpoints!
480
485
 
481
486
 
@@ -258,6 +258,7 @@ module FSelector
258
258
  @algo_type = algo.algo_type
259
259
  end
260
260
 
261
+ private
261
262
 
262
263
  #
263
264
  # get ensemble feature scores
@@ -304,7 +305,6 @@ module FSelector
304
305
  #pp ensem_ranks
305
306
  end # get_ensemble_ranks
306
307
 
307
- private
308
308
 
309
309
  #
310
310
  # override get\_feature\_subset() for EnsembleSingle,
@@ -417,6 +417,8 @@ module FSelector
417
417
  end
418
418
  end
419
419
 
420
+ private
421
+
420
422
  #
421
423
  # get ensemble feature scores
422
424
  #
@@ -455,8 +457,6 @@ module FSelector
455
457
  end # get_ensemble_ranks
456
458
 
457
459
 
458
- private
459
-
460
460
  #
461
461
  # override get\_feature\_subset() for EnsembleMultiple,
462
462
  # select a subset of features based on frequency count
@@ -20,6 +20,11 @@
20
20
  # @note class labels and features are treated as symbols
21
21
  #
22
22
  module FileIO
23
+ # require the open-uri lib for http/https/ftp request
24
+ require 'open-uri'
25
+ # require the stringio lib for converting a string into stringio object
26
+ require 'stringio'
27
+
23
28
  #
24
29
  # read from random data (read only, for test purpose)
25
30
  #
@@ -76,20 +81,13 @@ module FileIO
76
81
  # -1 3:1 4:1 ...
77
82
  # ....
78
83
  #
79
- # @param [String] fname file to read from
84
+ # @param [Symbol|String|StringIO] fname data source to read from
80
85
  # :stdin # read from standard input instead of file
81
86
  #
82
- def data_from_libsvm(fname=:stdin)
83
- data = {}
87
+ def data_from_libsvm(fname=:stdin)
88
+ ifs = get_ifs(fname)
84
89
 
85
- if fname == :stdin
86
- ifs = $stdin
87
- elsif not File.exists? fname
88
- abort "[#{__FILE__}@#{__LINE__}]: \n"+
89
- " File '#{fname}' does not exist!"
90
- else
91
- ifs = File.open(fname)
92
- end
90
+ data = {}
93
91
 
94
92
  ifs.each_line do |ln|
95
93
  label, *features = ln.chomp.split(/\s+/)
@@ -116,14 +114,10 @@ module FileIO
116
114
  # write to libsvm
117
115
  #
118
116
  # @param [String] fname file to write
119
- # :stdout # write to standard ouput instead of file
117
+ # :stdout # write to standard ouput
120
118
  #
121
119
  def data_to_libsvm(fname=:stdout)
122
- if fname == :stdout
123
- ofs = $stdout
124
- else
125
- ofs = File.open(fname, 'w')
126
- end
120
+ ofs = get_ofs(fname)
127
121
 
128
122
  # convert class label to integer type
129
123
  k2idx = {}
@@ -153,75 +147,108 @@ module FileIO
153
147
  #
154
148
  # read from csv
155
149
  #
156
- # file should have the format with the first two rows
157
- # specifying features and their data types e.g.
158
- # feat\_name1,feat\_name2,...,feat\_namen
159
- # feat\_type1,feat\_type2,...,feat\_typen
150
+ # if no csv_opts supplied, we assume the CSV file in the following format:
151
+ # first row contains feature names with last column being class name
152
+ #
153
+ # feat_name1,feat_name2,...,class_name
154
+ #
155
+ # second row contains feature types with last column being class type
156
+ #
157
+ # feat_type1,feat_type2,...,class_type
160
158
  #
161
- # and the remaing rows showing data e.g.
162
- # class\_label,feat\_value1,feat\_value2,...,feat\_value3
163
- # ...
159
+ # and the remaing rows containing feature values with last column being class label
160
+ #
161
+ # feat_value1,feat_value2,...,class_label
162
+ # ...
164
163
  #
165
164
  # allowed feature types (case-insensitive) are:
166
- # INTEGER, REAL, NUMERIC, CONTINUOUS, STRING, NOMINAL, CATEGORICAL
165
+ # INTEGER, REAL, NUMERIC, CONTINUOUS, DOUBLE, FLOAT, STRING, NOMINAL, CATEGORICAL
167
166
  #
168
- # @param [String] fname file to read from
169
- # :stdin # read from standard input instead of file
167
+ # @param [Symbol|String|StringIO] fname data source to read from
168
+ # :stdin # read from standard input
169
+ # @param [Hash] csv_opts named arguments for csv options
170
+ # :feature\_name\_row => 1, # (first) row that contains feature names
171
+ # :feature\_type\_row => 2, # (second) row that contains feature types
172
+ # :class\_label\_column => 0 # (last) column that contains class labels
173
+ # :feature\_name2type => {}, # user-supplied hash containing feature name-type pairs
174
+ # if no rows specify them. feature must be in the same order
175
+ # as it appears in the dataset
170
176
  #
171
- # @note missing values are allowed, and feature types are stored as lower-case symbols
177
+ # @note missing values are allowed
172
178
  #
173
- def data_from_csv(fname=:stdin)
174
- data = {}
179
+ def data_from_csv(fname=:stdin, csv_opts = {})
180
+ ifs = get_ifs(fname)
175
181
 
176
- if fname == :stdin
177
- ifs = $stdin
178
- elsif not File.exists? fname
179
- abort "[#{__FILE__}@#{__LINE__}]: \n"+
180
- " File '#{fname}' does not exist!"
181
- else
182
- ifs = File.open(fname)
182
+ csv_opts = { # default options, new opts will override old ones
183
+ :feature_name_row => 1, # first row contains feature names
184
+ :feature_type_row => 2, # second row contains feature types
185
+ :class_label_column => 0, # last column contains class labels
186
+ :feature_name2type => {} # user-supplied feature name-type pairs
187
+ }.merge(csv_opts) if csv_opts and csv_opts.class == Hash
188
+
189
+ feature_name_row = csv_opts[:feature_name_row]
190
+ feature_type_row = csv_opts[:feature_type_row]
191
+ class_label_column = csv_opts[:class_label_column]
192
+ feature_name2type = csv_opts[:feature_name2type]
193
+
194
+ # user-supplied feature name-type pairs, this is useful
195
+ # when file contains no specific rows for feture names and types
196
+ if feature_name2type and not feature_name2type.empty?
197
+ features = feature_name2type.keys.collect { |n| n.to_sym }
198
+ types = feature_name2type.values.collect { |t| t.downcase.to_sym }
199
+ # disable name and type rows
200
+ feature_name_row, feature_type_row = nil, nil
183
201
  end
184
202
 
185
- first_row, second_row = true, true
186
- features, types = [], []
203
+ data = {}
187
204
 
188
205
  ifs.each_line do |ln|
189
- if first_row # first row
190
- first_row = false
191
- features = ln.chomp.split(/,/).to_sym
192
- elsif second_row # second row
193
- second_row = false
206
+ next if ln.blank?
207
+
208
+ cells = ln.chomp.split(/,/)
209
+
210
+ # feature names
211
+ if ifs.lineno == feature_name_row
212
+ class_name = cells.delete_at(class_label_column-1) # simply useless
213
+ features = cells.to_sym
214
+ # feature types
215
+ elsif ifs.lineno == feature_type_row
216
+ class_type = cells.delete_at(class_label_column-1) # simply useless
194
217
  # store feature type as lower-case symbol
195
- types = ln.chomp.split(/,/).collect { |t| t.downcase.to_sym }
196
- if not types.size == features.size
197
- abort "[#{__FILE__}@#{__LINE__}]: \n"+
198
- " the first two rows must have same number of fields"
199
- end
218
+ types = cells.collect { |t| t.downcase.to_sym }
200
219
  else # data rows
201
- label, *fvs = ln.chomp.split(/,/)
202
- label = label.to_sym
203
- data[label] ||= []
220
+ if class_label_column <= cells.size and
221
+ features.size+1 == cells.size and
222
+ types.size+1 == cells.size
223
+ class_label = cells.delete_at(class_label_column-1).to_sym
224
+ data[class_label] ||= []
225
+ # feature values
226
+ fvs = cells
227
+ else
228
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
229
+ " invalid csv format!"
230
+ end
204
231
 
205
232
  fs = {}
206
233
  fvs.each_with_index do |v, i|
207
234
  next if v.empty? # missing value
208
- feat_type = types[i]
209
- if feat_type == :integer
235
+ type = types[i]
236
+ if type == :integer
210
237
  v = v.to_i
211
- elsif [:real, :numeric, :continuous].include? feat_type
238
+ elsif [:real, :numeric, :continuous, :float, :double].include? type
212
239
  v = v.to_f
213
- elsif [:string, :nominal, :categorical].include? feat_type
240
+ elsif [:string, :nominal, :categorical].include? type
214
241
  #
215
242
  else
216
243
  abort "[#{__FILE__}@#{__LINE__}]: \n"+
217
- " please specify correct type "+
218
- "for each feature in the 2nd row"
244
+ " invalid feature type '#{type}', must be one of the following: \n"+
245
+ " integer, real, numeric, float, double, continuous, categorical, string, nominal"
219
246
  end
220
247
 
221
248
  fs[features[i]] = v
222
249
  end
223
250
 
224
- data[label] << fs
251
+ data[class_label] << fs
225
252
  end
226
253
  end
227
254
 
@@ -230,9 +257,10 @@ module FileIO
230
257
 
231
258
  set_data(data)
232
259
  set_features(features)
233
- # set feature type
260
+
261
+ # feature name-type pairs
234
262
  features.each_with_index do |f, i|
235
- set_opt(f, types[i])
263
+ set_feature_type(f, types[i])
236
264
  end
237
265
  end # data_from_csv
238
266
 
@@ -240,35 +268,30 @@ module FileIO
240
268
  #
241
269
  # write to csv
242
270
  #
243
- # file has the format with the first two rows
244
- # specifying features and their data types
245
- # and the remaing rows showing data
271
+ # ouput CSV file has the same format as the default input for data\_from\_csv()
246
272
  #
247
273
  # @param [String] fname file to write
248
- # :stdout # write to standard ouput instead of file
274
+ # :stdout # write to standard ouput
249
275
  #
250
276
  def data_to_csv(fname=:stdout)
251
- if fname == :stdout
252
- ofs = $stdout
253
- else
254
- ofs = File.open(fname, 'w')
255
- end
256
-
277
+ ofs = get_ofs(fname)
278
+
279
+ # feature names
257
280
  ofs.puts get_features.join(',')
281
+ # feature types
258
282
  ofs.puts get_features.collect { |f|
259
- get_opt(f) || :string
283
+ get_feature_type(f) || :string
260
284
  }.join(',')
261
285
 
262
286
  each_sample do |k, s|
263
- ofs.print "#{k}"
264
287
  each_feature do |f|
265
288
  if s.has_key? f
266
- ofs.print ",#{s[f]}"
289
+ ofs.print "#{s[f]},"
267
290
  else
268
291
  ofs.print ","
269
292
  end
270
293
  end
271
- ofs.puts
294
+ ofs.puts "#{k}"
272
295
  end
273
296
 
274
297
  # close file
@@ -279,25 +302,18 @@ module FileIO
279
302
  #
280
303
  # read from WEKA ARFF file
281
304
  #
282
- # @param [String] fname file to read from
283
- # :stdin # read from standard input instead of file
305
+ # @param [Symbol|String|StringIO] fname data source to read from
306
+ # :stdin # read from standard input
284
307
  # @note it's ok if string containes spaces quoted by quote_char
285
308
  #
286
309
  def data_from_weka(fname=:stdin, quote_char='"')
287
- data = {}
288
-
289
- if fname == :stdin
290
- ifs = $stdin
291
- elsif not File.exists? fname
292
- abort "[#{__FILE__}@#{__LINE__}]: \n"+
293
- " File '#{fname}' does not exist!"
294
- else
295
- ifs = File.open(fname)
296
- end
310
+ ifs = get_ifs(fname)
297
311
 
298
312
  relation, features, classes, types, comments = '', [], [], [], []
299
313
  has_class, has_data = false, false
300
314
 
315
+ data = {}
316
+
301
317
  ifs.each_line do |ln|
302
318
  next if ln.blank? # blank lines
303
319
 
@@ -312,7 +328,7 @@ module FileIO
312
328
  # class attribute
313
329
  elsif ln =~ /^@ATTRIBUTE\s+class\s+{(.+)}/i
314
330
  has_class = true
315
- classes = $1.split_me(/,\s*/, quote_char).to_sym
331
+ classes = $1.strip.split_me(/,\s*/, quote_char).to_sym
316
332
  classes.each { |k| data[k] = [] }
317
333
  # feature attribute (nominal)
318
334
  elsif ln =~ /^@ATTRIBUTE\s+(\S+)\s+{(.+)}/i
@@ -320,7 +336,7 @@ module FileIO
320
336
  features << f
321
337
  #$2.split_me(/,\s*/, quote_char) # feature nominal values
322
338
  types << :nominal
323
- # feature attribute (integer, real, numeric, string, date)
339
+ # feature attribute (categorical, integer, real, continuous, numeric, float, double, string, date)
324
340
  elsif ln =~ /^@ATTRIBUTE/i
325
341
  tmp, v1, v2 = ln.split_me(/\s+/, quote_char)
326
342
  f = v1.to_sym
@@ -376,10 +392,12 @@ module FileIO
376
392
  set_data(data)
377
393
  set_classes(classes)
378
394
  set_features(features)
379
- set_opt(:relation, relation)
395
+ # feature name-type pairs
380
396
  features.each_with_index do |f, i|
381
- set_opt(f, types[i])
397
+ set_feature_type(f, types[i])
382
398
  end
399
+
400
+ set_opt(:relation, relation)
383
401
  set_opt(:comments, comments) if not comments.empty?
384
402
  end # data_from_weak
385
403
 
@@ -388,16 +406,12 @@ module FileIO
388
406
  # write to WEKA ARFF file
389
407
  #
390
408
  # @param [String] fname file to write
391
- # :stdout # write to standard ouput instead of file
409
+ # :stdout # write to standard ouput
392
410
  # @param [Symbol] format sparse or regular ARFF
393
411
  # :sparse # sparse ARFF, otherwise regular ARFF
394
412
  #
395
413
  def data_to_weka(fname=:stdout, format=nil)
396
- if fname == :stdout
397
- ofs = $stdout
398
- else
399
- ofs = File.open(fname, 'w')
400
- end
414
+ ofs = get_ofs(fname)
401
415
 
402
416
  # comments
403
417
  comments = get_opt(:comments)
@@ -419,7 +433,7 @@ module FileIO
419
433
  # feature attribute
420
434
  each_feature do |f|
421
435
  ofs.print "@ATTRIBUTE #{f} "
422
- type = get_opt(f) # feature type
436
+ type = get_feature_type(f)
423
437
  if type
424
438
  if type == :nominal
425
439
  ofs.puts "{#{get_feature_values(f).uniq.sort.join(',')}}"
@@ -464,10 +478,73 @@ module FileIO
464
478
 
465
479
  # close file
466
480
  ofs.close if not ofs == $stdout
467
- end
481
+ end # data_to_weka
482
+
483
+
484
+ # read data from url
485
+ #
486
+ # @param [String] url url of on-line dataset
487
+ # @param [Symbol] format allowed formats are:
488
+ # :libsvm # LibSVM file
489
+ # :csv # csv file
490
+ # :weka # Weka ARFF file
491
+ # @param [Any] args arguments associated with format
492
+ #
493
+ def data_from_url(url, format, *args)
494
+ format = format.downcase.to_sym
495
+
496
+ if not [:libsvm, :csv, :weka].include? format
497
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
498
+ " only CSV, LibSVM and Weka file formats are supported!"
499
+ end
500
+
501
+ uri = URI.parse(URI.encode(url))
502
+
503
+ data_src = StringIO.new(uri.read)
504
+
505
+ if format == :csv
506
+ data_from_csv(data_src, *args)
507
+ elsif format == :libsvm
508
+ data_from_libsvm(data_src)
509
+ else # weka
510
+ data_from_weka(data_src, *args)
511
+ end
512
+ end # data_from_url
468
513
 
469
514
  private
470
515
 
516
+ # get the input file handler
517
+ def get_ifs(fname)
518
+ # read from standard input by default
519
+ if fname == :stdin
520
+ ifs = $stdin
521
+ # read from string if it is a StringIO
522
+ elsif fname.class == StringIO
523
+ ifs = fname
524
+ # read from file if file exists
525
+ elsif File.exists? fname
526
+ ifs = File.open(fname)
527
+ else
528
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
529
+ " invalid data source!"
530
+ end
531
+
532
+ ifs
533
+ end
534
+
535
+
536
+ # get the ouput file handler
537
+ def get_ofs(fname)
538
+ if fname == :stdout
539
+ ofs = $stdout
540
+ else
541
+ ofs = File.open(fname, 'w')
542
+ end
543
+
544
+ ofs
545
+ end
546
+
547
+
471
548
  # handle and add each feature for WEKA format
472
549
  #
473
550
  # @param [Hash] fs sample that stores feature and its value
@@ -480,14 +557,16 @@ module FileIO
480
557
  return
481
558
  elsif type == :integer
482
559
  fs[f] = v.to_i
483
- elsif type == :real or type == :numeric
560
+ elsif [:real, :numeric, :float, :double, :continuous].include? type
484
561
  fs[f] = v.to_f
485
- elsif type == :string or type == :nominal
562
+ elsif [:categorical, :string, :nominal].include? type
486
563
  fs[f] = v
487
564
  elsif type == :date # convert into integer
488
565
  fs[f] = (DateTime.parse(v)-DateTime.new(1970,1,1)).to_i
489
566
  else
490
- return
567
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
568
+ " invalid feature type '#{type}', must be one of the following: \n"+
569
+ " integer, real, numeric, float, double, continuous, categorical, string, nominal, date"
491
570
  end
492
571
  end # add_feature_weka
493
572
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fselector
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.2.0
4
+ version: 1.3.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-05-21 00:00:00.000000000 Z
12
+ date: 2012-05-24 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rinruby
16
- requirement: &23863908 !ruby/object:Gem::Requirement
16
+ requirement: &24934116 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,7 +21,7 @@ dependencies:
21
21
  version: 2.0.2
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *23863908
24
+ version_requirements: *24934116
25
25
  description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
26
26
  algorithms and related functions into one single package. Welcome to contact me
27
27
  (need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
@@ -48,6 +48,9 @@ files:
48
48
  - lib/fselector/algo_base/base_discrete.rb
49
49
  - lib/fselector/algo_base/base_Relief.rb
50
50
  - lib/fselector/algo_base/base_ReliefF.rb
51
+ - lib/fselector/algo_both/LasVegasFilter.rb
52
+ - lib/fselector/algo_both/LasVegasIncremental.rb
53
+ - lib/fselector/algo_both/Random.rb
51
54
  - lib/fselector/algo_continuous/BSS_WSS.rb
52
55
  - lib/fselector/algo_continuous/CFS_c.rb
53
56
  - lib/fselector/algo_continuous/F-Test.rb
@@ -75,8 +78,6 @@ files:
75
78
  - lib/fselector/algo_discrete/INTERACT.rb
76
79
  - lib/fselector/algo_discrete/J-Measure.rb
77
80
  - lib/fselector/algo_discrete/KL-Divergence.rb
78
- - lib/fselector/algo_discrete/LasVegasFilter.rb
79
- - lib/fselector/algo_discrete/LasVegasIncremental.rb
80
81
  - lib/fselector/algo_discrete/MatthewsCorrelationCoefficient.rb
81
82
  - lib/fselector/algo_discrete/McNemarsTest.rb
82
83
  - lib/fselector/algo_discrete/MutualInformation.rb
@@ -85,7 +86,6 @@ files:
85
86
  - lib/fselector/algo_discrete/Power.rb
86
87
  - lib/fselector/algo_discrete/Precision.rb
87
88
  - lib/fselector/algo_discrete/ProbabilityRatio.rb
88
- - lib/fselector/algo_discrete/Random.rb
89
89
  - lib/fselector/algo_discrete/ReliefF_d.rb
90
90
  - lib/fselector/algo_discrete/Relief_d.rb
91
91
  - lib/fselector/algo_discrete/Sensitivity.rb