fselector 1.2.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/ChangeLog CHANGED
@@ -1,3 +1,10 @@
1
+ 2012-05-24 version 1.3.0
2
+
3
+ * update clear\_vars() in Base by use of Ruby metaprogramming, this trick avoids repetitive overriding it in each derived subclass
4
+ * re-organize LasVegasFilter, LasVegasIncremental and Random into algo_both/, since they are applicable to dataset with either discrete or continuous features, even with mixed type
5
+ * update data\_from\_csv() so that it can read CSV file more flexibly. note by default, the last column is class label
6
+ * add data\_from\_url() to read on-line dataset (in CSV, LibSVM or Weka ARFF file format) specified by a url
7
+
1
8
  2012-05-20 version 1.2.0
2
9
 
3
10
  * add KS-Test algorithm for continuous feature
data/README.md CHANGED
@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
8
8
  **Email**: [need47@gmail.com](mailto:need47@gmail.com)
9
9
  **Copyright**: 2012
10
10
  **License**: MIT License
11
- **Latest Version**: 1.2.0
12
- **Release Date**: 2012-05-20
11
+ **Latest Version**: 1.3.0
12
+ **Release Date**: 2012-05-24
13
13
 
14
14
  Synopsis
15
15
  --------
@@ -38,56 +38,57 @@ Feature List
38
38
  - csv
39
39
  - libsvm
40
40
  - weka ARFF
41
+ - on-line dataset in one of the above three formats (read only)
41
42
  - random data (read only, for test purpose)
42
43
 
43
44
  **2. available feature selection/ranking algorithms**
44
45
 
45
- algorithm shortcut algo_type feature_type applicability
46
- --------------------------------------------------------------------------------------------------------
47
- Accuracy Acc weighting discrete multi-class
48
- AccuracyBalanced Acc2 weighting discrete multi-class
49
- BiNormalSeparation BNS weighting discrete multi-class
50
- CFS_d CFS_d subset discrete multi-class
51
- ChiSquaredTest CHI weighting discrete multi-class
52
- CorrelationCoefficient CC weighting discrete multi-class
53
- DocumentFrequency DF weighting discrete multi-class
54
- F1Measure F1 weighting discrete multi-class
55
- FishersExactTest FET weighting discrete multi-class
56
- FastCorrelationBasedFilter FCBF subset discrete multi-class
57
- GiniIndex GI weighting discrete multi-class
58
- GMean GM weighting discrete multi-class
59
- GSSCoefficient GSS weighting discrete multi-class
60
- InformationGain IG weighting discrete multi-class
61
- INTERACT INTERACT subset discrete multi-class
62
- JMeasure JM weighting discrete multi-class
63
- KLDivergence KLD weighting discrete multi-class
64
- LasVegasFilter LVF subset discrete, continuous multi-class
65
- LasVegasIncremental LVI subset discrete, continuous multi-class
66
- MatthewsCorrelationCoefficient MCC, PHI weighting discrete multi-class
67
- McNemarsTest MNT weighting discrete multi-class
68
- OddsRatio OR weighting discrete multi-class
69
- OddsRatioNumerator ORN weighting discrete multi-class
70
- PhiCoefficient PHI weighting discrete multi-class
71
- Power Power weighting discrete multi-class
72
- Precision Precision weighting discrete multi-class
73
- ProbabilityRatio PR weighting discrete multi-class
74
- Random Random weighting discrete multi-class
75
- Recall Recall weighting discrete multi-class
76
- Relief_d Relief_d weighting discrete two-class, no missing data
77
- ReliefF_d ReliefF_d weighting discrete multi-class
78
- Sensitivity SN, Recall weighting discrete multi-class
79
- Specificity SP weighting discrete multi-class
80
- SymmetricalUncertainty SU weighting discrete multi-class
81
- BetweenWithinClassesSumOfSquare BSS_WSS weighting continuous multi-class
82
- CFS_c CFS_c subset continuous multi-class
83
- FTest FT weighting continuous multi-class
84
- KS_CCBF KS_CCBF subset continuous multi-class
85
- KSTest KST weighting continuous two-class
86
- PMetric PM weighting continuous two-class
87
- Relief_c Relief_c weighting continuous two-class, no missing data
88
- ReliefF_c ReliefF_c weighting continuous multi-class
89
- TScore TS weighting continuous two-class
90
- WilcoxonRankSum WRS weighting continuous two-class
46
+ algorithm shortcut algo_type applicability feature_type
47
+ --------------------------------------------------------------------------------------------------
48
+ Accuracy Acc weighting multi-class discrete
49
+ AccuracyBalanced Acc2 weighting multi-class discrete
50
+ BiNormalSeparation BNS weighting multi-class discrete
51
+ CFS_d CFS_d subset multi-class discrete
52
+ ChiSquaredTest CHI weighting multi-class discrete
53
+ CorrelationCoefficient CC weighting multi-class discrete
54
+ DocumentFrequency DF weighting multi-class discrete
55
+ F1Measure F1 weighting multi-class discrete
56
+ FishersExactTest FET weighting multi-class discrete
57
+ FastCorrelationBasedFilter FCBF subset multi-class discrete
58
+ GiniIndex GI weighting multi-class discrete
59
+ GMean GM weighting multi-class discrete
60
+ GSSCoefficient GSS weighting multi-class discrete
61
+ InformationGain IG weighting multi-class discrete
62
+ INTERACT INTERACT subset multi-class discrete
63
+ JMeasure JM weighting multi-class discrete
64
+ KLDivergence KLD weighting multi-class discrete
65
+ MatthewsCorrelationCoefficient MCC, PHI weighting multi-class discrete
66
+ McNemarsTest MNT weighting multi-class discrete
67
+ OddsRatio OR weighting multi-class discrete
68
+ OddsRatioNumerator ORN weighting multi-class discrete
69
+ PhiCoefficient PHI weighting multi-class discrete
70
+ Power Power weighting multi-class discrete
71
+ Precision Precision weighting multi-class discrete
72
+ ProbabilityRatio PR weighting multi-class discrete
73
+ Recall Recall weighting multi-class discrete
74
+ Relief_d Relief_d weighting two-class discrete
75
+ ReliefF_d ReliefF_d weighting multi-class discrete
76
+ Sensitivity SN, Recall weighting multi-class discrete
77
+ Specificity SP weighting multi-class discrete
78
+ SymmetricalUncertainty SU weighting multi-class discrete
79
+ BetweenWithinClassesSumOfSquare BSS_WSS weighting multi-class continuous
80
+ CFS_c CFS_c subset multi-class continuous
81
+ FTest FT weighting multi-class continuous
82
+ KS_CCBF KS_CCBF subset multi-class continuous
83
+ KSTest KST weighting two-class continuous
84
+ PMetric PM weighting two-class continuous
85
+ Relief_c Relief_c weighting two-class continuous
86
+ ReliefF_c ReliefF_c weighting multi-class continuous
87
+ TScore TS weighting two-class continuous
88
+ WilcoxonRankSum WRS weighting two-class continuous
89
+ LasVegasFilter LVF subset multi-class discrete, continuous, mixed
90
+ LasVegasIncremental LVI subset multi-class discrete, continuous, mixed
91
+ Random Random weighting multi-class discrete, continuous, mixed
91
92
 
92
93
  **note for feature selection interface:**
93
94
  there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
@@ -132,7 +133,7 @@ To install FSelector, use the following command:
132
133
 
133
134
  $ gem install fselector
134
135
 
135
- **note:** Start from version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
136
+ **note:** From version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
136
137
  as a seemless bridge to access the statistical routines in the R package (http://www.r-project.org),
137
138
  which will greatly expand the inclusion of algorithms to FSelector, especially for those relying
138
139
  on statistical test. To this end, please pre-install the R package. RinRuby should have been
@@ -194,7 +195,7 @@ Usage
194
195
 
195
196
  # creating an ensemble of feature selectors by using
196
197
  # a single feature selection algorithm (INTERACT)
197
- # by instance perturbation (e.g. bootstrap sampling)
198
+ # by instance perturbation (e.g. random sampling)
198
199
 
199
200
  # test for the type of feature subset selection algorithms
200
201
  r = FSelector::INTERACT.new(0.0001)
@@ -258,8 +259,8 @@ Usage
258
259
  # the Information Gain (IG) algorithm requires data with discrete feature
259
260
  r = FSelector::IG.new
260
261
 
261
- # but the Iris data set contains continuous features (under the test/ directory)
262
- r.data_from_csv('test/iris.csv')
262
+ # but the Iris data set contains continuous features
263
+ r.data_from_url('http://repository.seasr.org/Datasets/UCI/arff/iris.arff', :weka)
263
264
 
264
265
  # let's first discretize it by ChiMerge algorithm at alpha=0.10
265
266
  # then perform feature selection as usual
@@ -7,7 +7,7 @@ R.eval 'options(warn = -1)' # suppress R warnings
7
7
  #
8
8
  module FSelector
9
9
  # module version
10
- VERSION = '1.2.0'
10
+ VERSION = '1.3.0'
11
11
  end
12
12
 
13
13
  # the root dir of FSelector
@@ -52,6 +52,15 @@ Dir.glob("#{ROOT}/fselector/algo_continuous/*").each do |f|
52
52
  require f
53
53
  end
54
54
 
55
+
56
+ #
57
+ # algorithms for handling both discrete and continuous feature
58
+ #
59
+ Dir.glob("#{ROOT}/fselector/algo_both/*").each do |f|
60
+ require f
61
+ end
62
+
63
+
55
64
  #
56
65
  # feature selection use an ensemble of algorithms
57
66
  #
@@ -192,6 +192,34 @@ module FSelector
192
192
  end
193
193
 
194
194
 
195
+ #
196
+ # get the feature type stored in @types
197
+ #
198
+ # @param [Symbol] feature feature of interest
199
+ # return all feature name-type pairs if nil,
200
+ # otherwise reture the type for the feature of interest
201
+ #
202
+ def get_feature_type(feature=nil)
203
+ if @types
204
+ feature ? @types[feature] : @types
205
+ else
206
+ nil
207
+ end
208
+ end
209
+
210
+
211
+ #
212
+ # set feature name-type pair
213
+ #
214
+ # @param [Symbol] feature feature name
215
+ # @param [Symbol] type feature type
216
+ #
217
+ def set_feature_type(feature, type)
218
+ @types ||= {}
219
+ @types[feature] = type
220
+ end
221
+
222
+
195
223
  #
196
224
  # get internal data
197
225
  #
@@ -241,7 +269,11 @@ module FSelector
241
269
  # @note return all non-data as a Hash if key == nil
242
270
  #
243
271
  def get_opt(key=nil)
244
- key ? @opts[key] : @opts
272
+ if @opts
273
+ key ? @opts[key] : @opts
274
+ else
275
+ nil
276
+ end
245
277
  end
246
278
 
247
279
 
@@ -404,17 +436,38 @@ module FSelector
404
436
  private
405
437
 
406
438
  #
407
- # clear variables when data structure is altered, this is
408
- # useful when data structure has changed while
409
- # you still want to use the same instance
439
+ # clear instance variables when data structure is altered,
440
+ # this is useful when data structure has changed while you
441
+ # still want to use the same instance
410
442
  #
411
- # @note the variables of original data structure (@data) and
412
- # algorithm type (@algo_type) are retained
443
+ # @note only the instance variables used in the initialize()
444
+ # such as @data will be retained. class instance varialbe
445
+ # @algo_type will also be retained. This trick by use of
446
+ # Ruby metaprogramming avoids the repetivie overriding of
447
+ # clear_vars() in each derived subclass
413
448
  #
414
449
  def clear_vars
415
- @classes, @features, @fvs = nil, nil, nil
416
- @scores, @ranks, @sz = nil, nil, nil
417
- @cv, @fvs, @opts = nil, nil, {}
450
+ # instance vars appeared as arguments (with the same name) in initialize()
451
+ instance_var_in_new = []
452
+
453
+ constructor = method(:initialize)
454
+ if constructor.respond_to? :parameters
455
+ constructor.parameters.each do |p|
456
+ instance_var_in_new << "@#{p[1]}".to_sym
457
+ end
458
+ end
459
+
460
+ instance_variables.each do |var|
461
+ # retain instance vars appeared as arguments in initialize()
462
+ # such as @data
463
+ next if instance_var_in_new.include? var
464
+ # retain feature types, which may be needed
465
+ # by CSV and Weka ARFF file
466
+ next if var == :@types
467
+
468
+ # clear all other instance variable
469
+ instance_variable_set(var, nil)
470
+ end
418
471
  end
419
472
 
420
473
 
@@ -142,15 +142,6 @@ module FSelector
142
142
  end # do_rff
143
143
 
144
144
 
145
- # override clear\_vars for BaseCFS
146
- def clear_vars
147
- super
148
-
149
- @rcf_best, @rff_best = nil, nil
150
- @f2rcf, @fs2rff, @f2idx = nil, nil, nil
151
- end # clear_vars
152
-
153
-
154
145
  end # class
155
146
 
156
147
 
@@ -151,14 +151,6 @@ module FSelector
151
151
  end
152
152
 
153
153
 
154
- # override clear\_vars for BaseReliefF
155
- def clear_vars
156
- super
157
-
158
- @f2mvp = nil
159
- end # clear_vars
160
-
161
-
162
154
  end # class
163
155
 
164
156
 
@@ -175,14 +175,6 @@ module FSelector
175
175
  end # calc_D
176
176
 
177
177
 
178
- # override clear\_vars for BaseDiscrete
179
- def clear_vars
180
- super
181
-
182
- @A, @B, @C, @D = nil, nil, nil, nil
183
- end # clear_vars
184
-
185
-
186
178
  end # class
187
179
 
188
180
 
@@ -3,14 +3,14 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # Las Vegas Filter (LVF) for discrete feature,
6
+ # Las Vegas Filter (LVF) for discrete, continuous or mixed feature,
7
7
  # use **select\_feature!** for feature selection
8
8
  #
9
9
  # @note we only keep one of the equivalently good solutions
10
10
  #
11
11
  # ref: [Review and Evaluation of Feature Selection Algorithms in Synthetic Problems](http://arxiv.org/abs/1101.2320)
12
12
  #
13
- class LasVegasFilter < BaseDiscrete
13
+ class LasVegasFilter < Base
14
14
  # include module
15
15
  include Consistency
16
16
 
@@ -3,12 +3,12 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # Las Vegas Incremental (LVI) for discrete feature,
6
+ # Las Vegas Incremental (LVI) for discrete, continuous or mixed feature,
7
7
  # use **select\_feature!** for feature selection
8
8
  #
9
9
  # ref: [Incremental Feature Selection](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.8218)
10
10
  #
11
- class LasVegasIncremental < BaseDiscrete
11
+ class LasVegasIncremental < Base
12
12
  # include module
13
13
  include Consistency
14
14
 
@@ -3,13 +3,14 @@
3
3
  #
4
4
  module FSelector
5
5
  #
6
- # Random (Rand), no pratical use but can be used as a baseline
6
+ # Random (Rand) for discrete, continuous or mixed feature,
7
+ # no pratical use but can be used as a baseline
7
8
  #
8
9
  # Rand = rand numbers within [0..1)
9
10
  #
10
11
  # ref: [An extensive empirical study of feature selection metrics for text classification](http://dl.acm.org/citation.cfm?id=944974)
11
12
  #
12
- class Random < BaseDiscrete
13
+ class Random < Base
13
14
  # this algo outputs weight for each feature
14
15
  @algo_type = :feature_weighting
15
16
 
@@ -59,14 +59,6 @@ module FSelector
59
59
  end # get_normalization_unit
60
60
 
61
61
 
62
- # override clear\_vars for ReliefF_c
63
- def clear_vars
64
- super
65
-
66
- @f2nu = nil
67
- end # clear_vars
68
-
69
-
70
62
  end # class
71
63
 
72
64
 
@@ -47,14 +47,6 @@ module FSelector
47
47
  end # get_normalization_unit
48
48
 
49
49
 
50
- # override clear\_vars for Relief_c
51
- def clear_vars
52
- super
53
-
54
- @f2nu = nil
55
- end # clear_vars
56
-
57
-
58
50
  end # class
59
51
 
60
52
 
@@ -121,14 +121,6 @@ module FSelector
121
121
  fq
122
122
  end # get_next_element
123
123
 
124
-
125
- # override clear\_vars for FastCorrelationBasedFilter
126
- def clear_vars
127
- super
128
-
129
- @f2hf = nil
130
- end # clear_vars
131
-
132
124
 
133
125
  end # class
134
126
 
@@ -41,15 +41,7 @@ module FSelector
41
41
  s = @hc - hcf
42
42
 
43
43
  set_feature_score(f, :BEST, s)
44
- end # calc_contribution
45
-
46
-
47
- # override clear\_vars for InformationGain
48
- def clear_vars
49
- super
50
-
51
- @hc = nil
52
- end # clear_vars
44
+ end # calc_contribution
53
45
 
54
46
 
55
47
  end # class
@@ -474,8 +474,13 @@ module Discretizer
474
474
  end
475
475
  end
476
476
 
477
- # clear vars
477
+ # clear vars because of data change
478
478
  clear_vars
479
+
480
+ # set all feature type as CATEGORICAL
481
+ each_feature do |f|
482
+ set_feature_type(f, :categorical)
483
+ end
479
484
  end # discretize_at_cutpoints!
480
485
 
481
486
 
@@ -258,6 +258,7 @@ module FSelector
258
258
  @algo_type = algo.algo_type
259
259
  end
260
260
 
261
+ private
261
262
 
262
263
  #
263
264
  # get ensemble feature scores
@@ -304,7 +305,6 @@ module FSelector
304
305
  #pp ensem_ranks
305
306
  end # get_ensemble_ranks
306
307
 
307
- private
308
308
 
309
309
  #
310
310
  # override get\_feature\_subset() for EnsembleSingle,
@@ -417,6 +417,8 @@ module FSelector
417
417
  end
418
418
  end
419
419
 
420
+ private
421
+
420
422
  #
421
423
  # get ensemble feature scores
422
424
  #
@@ -455,8 +457,6 @@ module FSelector
455
457
  end # get_ensemble_ranks
456
458
 
457
459
 
458
- private
459
-
460
460
  #
461
461
  # override get\_feature\_subset() for EnsembleMultiple,
462
462
  # select a subset of features based on frequency count
@@ -20,6 +20,11 @@
20
20
  # @note class labels and features are treated as symbols
21
21
  #
22
22
  module FileIO
23
+ # require the open-uri lib for http/https/ftp request
24
+ require 'open-uri'
25
+ # require the stringio lib for converting a string into stringio object
26
+ require 'stringio'
27
+
23
28
  #
24
29
  # read from random data (read only, for test purpose)
25
30
  #
@@ -76,20 +81,13 @@ module FileIO
76
81
  # -1 3:1 4:1 ...
77
82
  # ....
78
83
  #
79
- # @param [String] fname file to read from
84
+ # @param [Symbol|String|StringIO] fname data source to read from
80
85
  # :stdin # read from standard input instead of file
81
86
  #
82
- def data_from_libsvm(fname=:stdin)
83
- data = {}
87
+ def data_from_libsvm(fname=:stdin)
88
+ ifs = get_ifs(fname)
84
89
 
85
- if fname == :stdin
86
- ifs = $stdin
87
- elsif not File.exists? fname
88
- abort "[#{__FILE__}@#{__LINE__}]: \n"+
89
- " File '#{fname}' does not exist!"
90
- else
91
- ifs = File.open(fname)
92
- end
90
+ data = {}
93
91
 
94
92
  ifs.each_line do |ln|
95
93
  label, *features = ln.chomp.split(/\s+/)
@@ -116,14 +114,10 @@ module FileIO
116
114
  # write to libsvm
117
115
  #
118
116
  # @param [String] fname file to write
119
- # :stdout # write to standard ouput instead of file
117
+ # :stdout # write to standard ouput
120
118
  #
121
119
  def data_to_libsvm(fname=:stdout)
122
- if fname == :stdout
123
- ofs = $stdout
124
- else
125
- ofs = File.open(fname, 'w')
126
- end
120
+ ofs = get_ofs(fname)
127
121
 
128
122
  # convert class label to integer type
129
123
  k2idx = {}
@@ -153,75 +147,108 @@ module FileIO
153
147
  #
154
148
  # read from csv
155
149
  #
156
- # file should have the format with the first two rows
157
- # specifying features and their data types e.g.
158
- # feat\_name1,feat\_name2,...,feat\_namen
159
- # feat\_type1,feat\_type2,...,feat\_typen
150
+ # if no csv_opts supplied, we assume the CSV file in the following format:
151
+ # first row contains feature names with last column being class name
152
+ #
153
+ # feat_name1,feat_name2,...,class_name
154
+ #
155
+ # second row contains feature types with last column being class type
156
+ #
157
+ # feat_type1,feat_type2,...,class_type
160
158
  #
161
- # and the remaing rows showing data e.g.
162
- # class\_label,feat\_value1,feat\_value2,...,feat\_value3
163
- # ...
159
+ # and the remaing rows containing feature values with last column being class label
160
+ #
161
+ # feat_value1,feat_value2,...,class_label
162
+ # ...
164
163
  #
165
164
  # allowed feature types (case-insensitive) are:
166
- # INTEGER, REAL, NUMERIC, CONTINUOUS, STRING, NOMINAL, CATEGORICAL
165
+ # INTEGER, REAL, NUMERIC, CONTINUOUS, DOUBLE, FLOAT, STRING, NOMINAL, CATEGORICAL
167
166
  #
168
- # @param [String] fname file to read from
169
- # :stdin # read from standard input instead of file
167
+ # @param [Symbol|String|StringIO] fname data source to read from
168
+ # :stdin # read from standard input
169
+ # @param [Hash] csv_opts named arguments for csv options
170
+ # :feature\_name\_row => 1, # (first) row that contains feature names
171
+ # :feature\_type\_row => 2, # (second) row that contains feature types
172
+ # :class\_label\_column => 0 # (last) column that contains class labels
173
+ # :feature\_name2type => {}, # user-supplied hash containing feature name-type pairs
174
+ # if no rows specify them. feature must be in the same order
175
+ # as it appears in the dataset
170
176
  #
171
- # @note missing values are allowed, and feature types are stored as lower-case symbols
177
+ # @note missing values are allowed
172
178
  #
173
- def data_from_csv(fname=:stdin)
174
- data = {}
179
+ def data_from_csv(fname=:stdin, csv_opts = {})
180
+ ifs = get_ifs(fname)
175
181
 
176
- if fname == :stdin
177
- ifs = $stdin
178
- elsif not File.exists? fname
179
- abort "[#{__FILE__}@#{__LINE__}]: \n"+
180
- " File '#{fname}' does not exist!"
181
- else
182
- ifs = File.open(fname)
182
+ csv_opts = { # default options, new opts will override old ones
183
+ :feature_name_row => 1, # first row contains feature names
184
+ :feature_type_row => 2, # second row contains feature types
185
+ :class_label_column => 0, # last column contains class labels
186
+ :feature_name2type => {} # user-supplied feature name-type pairs
187
+ }.merge(csv_opts) if csv_opts and csv_opts.class == Hash
188
+
189
+ feature_name_row = csv_opts[:feature_name_row]
190
+ feature_type_row = csv_opts[:feature_type_row]
191
+ class_label_column = csv_opts[:class_label_column]
192
+ feature_name2type = csv_opts[:feature_name2type]
193
+
194
+ # user-supplied feature name-type pairs, this is useful
195
+ # when file contains no specific rows for feture names and types
196
+ if feature_name2type and not feature_name2type.empty?
197
+ features = feature_name2type.keys.collect { |n| n.to_sym }
198
+ types = feature_name2type.values.collect { |t| t.downcase.to_sym }
199
+ # disable name and type rows
200
+ feature_name_row, feature_type_row = nil, nil
183
201
  end
184
202
 
185
- first_row, second_row = true, true
186
- features, types = [], []
203
+ data = {}
187
204
 
188
205
  ifs.each_line do |ln|
189
- if first_row # first row
190
- first_row = false
191
- features = ln.chomp.split(/,/).to_sym
192
- elsif second_row # second row
193
- second_row = false
206
+ next if ln.blank?
207
+
208
+ cells = ln.chomp.split(/,/)
209
+
210
+ # feature names
211
+ if ifs.lineno == feature_name_row
212
+ class_name = cells.delete_at(class_label_column-1) # simply useless
213
+ features = cells.to_sym
214
+ # feature types
215
+ elsif ifs.lineno == feature_type_row
216
+ class_type = cells.delete_at(class_label_column-1) # simply useless
194
217
  # store feature type as lower-case symbol
195
- types = ln.chomp.split(/,/).collect { |t| t.downcase.to_sym }
196
- if not types.size == features.size
197
- abort "[#{__FILE__}@#{__LINE__}]: \n"+
198
- " the first two rows must have same number of fields"
199
- end
218
+ types = cells.collect { |t| t.downcase.to_sym }
200
219
  else # data rows
201
- label, *fvs = ln.chomp.split(/,/)
202
- label = label.to_sym
203
- data[label] ||= []
220
+ if class_label_column <= cells.size and
221
+ features.size+1 == cells.size and
222
+ types.size+1 == cells.size
223
+ class_label = cells.delete_at(class_label_column-1).to_sym
224
+ data[class_label] ||= []
225
+ # feature values
226
+ fvs = cells
227
+ else
228
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
229
+ " invalid csv format!"
230
+ end
204
231
 
205
232
  fs = {}
206
233
  fvs.each_with_index do |v, i|
207
234
  next if v.empty? # missing value
208
- feat_type = types[i]
209
- if feat_type == :integer
235
+ type = types[i]
236
+ if type == :integer
210
237
  v = v.to_i
211
- elsif [:real, :numeric, :continuous].include? feat_type
238
+ elsif [:real, :numeric, :continuous, :float, :double].include? type
212
239
  v = v.to_f
213
- elsif [:string, :nominal, :categorical].include? feat_type
240
+ elsif [:string, :nominal, :categorical].include? type
214
241
  #
215
242
  else
216
243
  abort "[#{__FILE__}@#{__LINE__}]: \n"+
217
- " please specify correct type "+
218
- "for each feature in the 2nd row"
244
+ " invalid feature type '#{type}', must be one of the following: \n"+
245
+ " integer, real, numeric, float, double, continuous, categorical, string, nominal"
219
246
  end
220
247
 
221
248
  fs[features[i]] = v
222
249
  end
223
250
 
224
- data[label] << fs
251
+ data[class_label] << fs
225
252
  end
226
253
  end
227
254
 
@@ -230,9 +257,10 @@ module FileIO
230
257
 
231
258
  set_data(data)
232
259
  set_features(features)
233
- # set feature type
260
+
261
+ # feature name-type pairs
234
262
  features.each_with_index do |f, i|
235
- set_opt(f, types[i])
263
+ set_feature_type(f, types[i])
236
264
  end
237
265
  end # data_from_csv
238
266
 
@@ -240,35 +268,30 @@ module FileIO
240
268
  #
241
269
  # write to csv
242
270
  #
243
- # file has the format with the first two rows
244
- # specifying features and their data types
245
- # and the remaing rows showing data
271
+ # ouput CSV file has the same format as the default input for data\_from\_csv()
246
272
  #
247
273
  # @param [String] fname file to write
248
- # :stdout # write to standard ouput instead of file
274
+ # :stdout # write to standard ouput
249
275
  #
250
276
  def data_to_csv(fname=:stdout)
251
- if fname == :stdout
252
- ofs = $stdout
253
- else
254
- ofs = File.open(fname, 'w')
255
- end
256
-
277
+ ofs = get_ofs(fname)
278
+
279
+ # feature names
257
280
  ofs.puts get_features.join(',')
281
+ # feature types
258
282
  ofs.puts get_features.collect { |f|
259
- get_opt(f) || :string
283
+ get_feature_type(f) || :string
260
284
  }.join(',')
261
285
 
262
286
  each_sample do |k, s|
263
- ofs.print "#{k}"
264
287
  each_feature do |f|
265
288
  if s.has_key? f
266
- ofs.print ",#{s[f]}"
289
+ ofs.print "#{s[f]},"
267
290
  else
268
291
  ofs.print ","
269
292
  end
270
293
  end
271
- ofs.puts
294
+ ofs.puts "#{k}"
272
295
  end
273
296
 
274
297
  # close file
@@ -279,25 +302,18 @@ module FileIO
279
302
  #
280
303
  # read from WEKA ARFF file
281
304
  #
282
- # @param [String] fname file to read from
283
- # :stdin # read from standard input instead of file
305
+ # @param [Symbol|String|StringIO] fname data source to read from
306
+ # :stdin # read from standard input
284
307
  # @note it's ok if string containes spaces quoted by quote_char
285
308
  #
286
309
  def data_from_weka(fname=:stdin, quote_char='"')
287
- data = {}
288
-
289
- if fname == :stdin
290
- ifs = $stdin
291
- elsif not File.exists? fname
292
- abort "[#{__FILE__}@#{__LINE__}]: \n"+
293
- " File '#{fname}' does not exist!"
294
- else
295
- ifs = File.open(fname)
296
- end
310
+ ifs = get_ifs(fname)
297
311
 
298
312
  relation, features, classes, types, comments = '', [], [], [], []
299
313
  has_class, has_data = false, false
300
314
 
315
+ data = {}
316
+
301
317
  ifs.each_line do |ln|
302
318
  next if ln.blank? # blank lines
303
319
 
@@ -312,7 +328,7 @@ module FileIO
312
328
  # class attribute
313
329
  elsif ln =~ /^@ATTRIBUTE\s+class\s+{(.+)}/i
314
330
  has_class = true
315
- classes = $1.split_me(/,\s*/, quote_char).to_sym
331
+ classes = $1.strip.split_me(/,\s*/, quote_char).to_sym
316
332
  classes.each { |k| data[k] = [] }
317
333
  # feature attribute (nominal)
318
334
  elsif ln =~ /^@ATTRIBUTE\s+(\S+)\s+{(.+)}/i
@@ -320,7 +336,7 @@ module FileIO
320
336
  features << f
321
337
  #$2.split_me(/,\s*/, quote_char) # feature nominal values
322
338
  types << :nominal
323
- # feature attribute (integer, real, numeric, string, date)
339
+ # feature attribute (categorical, integer, real, continuous, numeric, float, double, string, date)
324
340
  elsif ln =~ /^@ATTRIBUTE/i
325
341
  tmp, v1, v2 = ln.split_me(/\s+/, quote_char)
326
342
  f = v1.to_sym
@@ -376,10 +392,12 @@ module FileIO
376
392
  set_data(data)
377
393
  set_classes(classes)
378
394
  set_features(features)
379
- set_opt(:relation, relation)
395
+ # feature name-type pairs
380
396
  features.each_with_index do |f, i|
381
- set_opt(f, types[i])
397
+ set_feature_type(f, types[i])
382
398
  end
399
+
400
+ set_opt(:relation, relation)
383
401
  set_opt(:comments, comments) if not comments.empty?
384
402
  end # data_from_weak
385
403
 
@@ -388,16 +406,12 @@ module FileIO
388
406
  # write to WEKA ARFF file
389
407
  #
390
408
  # @param [String] fname file to write
391
- # :stdout # write to standard ouput instead of file
409
+ # :stdout # write to standard ouput
392
410
  # @param [Symbol] format sparse or regular ARFF
393
411
  # :sparse # sparse ARFF, otherwise regular ARFF
394
412
  #
395
413
  def data_to_weka(fname=:stdout, format=nil)
396
- if fname == :stdout
397
- ofs = $stdout
398
- else
399
- ofs = File.open(fname, 'w')
400
- end
414
+ ofs = get_ofs(fname)
401
415
 
402
416
  # comments
403
417
  comments = get_opt(:comments)
@@ -419,7 +433,7 @@ module FileIO
419
433
  # feature attribute
420
434
  each_feature do |f|
421
435
  ofs.print "@ATTRIBUTE #{f} "
422
- type = get_opt(f) # feature type
436
+ type = get_feature_type(f)
423
437
  if type
424
438
  if type == :nominal
425
439
  ofs.puts "{#{get_feature_values(f).uniq.sort.join(',')}}"
@@ -464,10 +478,73 @@ module FileIO
464
478
 
465
479
  # close file
466
480
  ofs.close if not ofs == $stdout
467
- end
481
+ end # data_to_weka
482
+
483
+
484
+ # read data from url
485
+ #
486
+ # @param [String] url url of on-line dataset
487
+ # @param [Symbol] format allowed formats are:
488
+ # :libsvm # LibSVM file
489
+ # :csv # csv file
490
+ # :weka # Weka ARFF file
491
+ # @param [Any] args arguments associated with format
492
+ #
493
+ def data_from_url(url, format, *args)
494
+ format = format.downcase.to_sym
495
+
496
+ if not [:libsvm, :csv, :weka].include? format
497
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
498
+ " only CSV, LibSVM and Weka file formats are supported!"
499
+ end
500
+
501
+ uri = URI.parse(URI.encode(url))
502
+
503
+ data_src = StringIO.new(uri.read)
504
+
505
+ if format == :csv
506
+ data_from_csv(data_src, *args)
507
+ elsif format == :libsvm
508
+ data_from_libsvm(data_src)
509
+ else # weka
510
+ data_from_weka(data_src, *args)
511
+ end
512
+ end # data_from_url
468
513
 
469
514
  private
470
515
 
516
+ # get the input file handler
517
+ def get_ifs(fname)
518
+ # read from standard input by default
519
+ if fname == :stdin
520
+ ifs = $stdin
521
+ # read from string if it is a StringIO
522
+ elsif fname.class == StringIO
523
+ ifs = fname
524
+ # read from file if file exists
525
+ elsif File.exists? fname
526
+ ifs = File.open(fname)
527
+ else
528
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
529
+ " invalid data source!"
530
+ end
531
+
532
+ ifs
533
+ end
534
+
535
+
536
+ # get the ouput file handler
537
+ def get_ofs(fname)
538
+ if fname == :stdout
539
+ ofs = $stdout
540
+ else
541
+ ofs = File.open(fname, 'w')
542
+ end
543
+
544
+ ofs
545
+ end
546
+
547
+
471
548
  # handle and add each feature for WEKA format
472
549
  #
473
550
  # @param [Hash] fs sample that stores feature and its value
@@ -480,14 +557,16 @@ module FileIO
480
557
  return
481
558
  elsif type == :integer
482
559
  fs[f] = v.to_i
483
- elsif type == :real or type == :numeric
560
+ elsif [:real, :numeric, :float, :double, :continuous].include? type
484
561
  fs[f] = v.to_f
485
- elsif type == :string or type == :nominal
562
+ elsif [:categorical, :string, :nominal].include? type
486
563
  fs[f] = v
487
564
  elsif type == :date # convert into integer
488
565
  fs[f] = (DateTime.parse(v)-DateTime.new(1970,1,1)).to_i
489
566
  else
490
- return
567
+ abort "[#{__FILE__}@#{__LINE__}]: \n"+
568
+ " invalid feature type '#{type}', must be one of the following: \n"+
569
+ " integer, real, numeric, float, double, continuous, categorical, string, nominal, date"
491
570
  end
492
571
  end # add_feature_weka
493
572
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fselector
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.2.0
4
+ version: 1.3.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-05-21 00:00:00.000000000 Z
12
+ date: 2012-05-24 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rinruby
16
- requirement: &23863908 !ruby/object:Gem::Requirement
16
+ requirement: &24934116 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,7 +21,7 @@ dependencies:
21
21
  version: 2.0.2
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *23863908
24
+ version_requirements: *24934116
25
25
  description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
26
26
  algorithms and related functions into one single package. Welcome to contact me
27
27
  (need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
@@ -48,6 +48,9 @@ files:
48
48
  - lib/fselector/algo_base/base_discrete.rb
49
49
  - lib/fselector/algo_base/base_Relief.rb
50
50
  - lib/fselector/algo_base/base_ReliefF.rb
51
+ - lib/fselector/algo_both/LasVegasFilter.rb
52
+ - lib/fselector/algo_both/LasVegasIncremental.rb
53
+ - lib/fselector/algo_both/Random.rb
51
54
  - lib/fselector/algo_continuous/BSS_WSS.rb
52
55
  - lib/fselector/algo_continuous/CFS_c.rb
53
56
  - lib/fselector/algo_continuous/F-Test.rb
@@ -75,8 +78,6 @@ files:
75
78
  - lib/fselector/algo_discrete/INTERACT.rb
76
79
  - lib/fselector/algo_discrete/J-Measure.rb
77
80
  - lib/fselector/algo_discrete/KL-Divergence.rb
78
- - lib/fselector/algo_discrete/LasVegasFilter.rb
79
- - lib/fselector/algo_discrete/LasVegasIncremental.rb
80
81
  - lib/fselector/algo_discrete/MatthewsCorrelationCoefficient.rb
81
82
  - lib/fselector/algo_discrete/McNemarsTest.rb
82
83
  - lib/fselector/algo_discrete/MutualInformation.rb
@@ -85,7 +86,6 @@ files:
85
86
  - lib/fselector/algo_discrete/Power.rb
86
87
  - lib/fselector/algo_discrete/Precision.rb
87
88
  - lib/fselector/algo_discrete/ProbabilityRatio.rb
88
- - lib/fselector/algo_discrete/Random.rb
89
89
  - lib/fselector/algo_discrete/ReliefF_d.rb
90
90
  - lib/fselector/algo_discrete/Relief_d.rb
91
91
  - lib/fselector/algo_discrete/Sensitivity.rb