RubyGems - fselector - Versions diffs - 1.2.0 → 1.3.0 - Mend

fselector 1.2.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

data/ChangeLog +7 -0
data/README.md +53 -52
data/lib/fselector.rb +10 -1
data/lib/fselector/algo_base/base.rb +62 -9
data/lib/fselector/algo_base/base_CFS.rb +0 -9
data/lib/fselector/algo_base/base_ReliefF.rb +0 -8
data/lib/fselector/algo_base/base_discrete.rb +0 -8
data/lib/fselector/{algo_discrete → algo_both}/LasVegasFilter.rb +2 -2
data/lib/fselector/{algo_discrete → algo_both}/LasVegasIncremental.rb +2 -2
data/lib/fselector/{algo_discrete → algo_both}/Random.rb +3 -2
data/lib/fselector/algo_continuous/ReliefF_c.rb +0 -8
data/lib/fselector/algo_continuous/Relief_c.rb +0 -8
data/lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb +0 -8
data/lib/fselector/algo_discrete/InformationGain.rb +1 -9
data/lib/fselector/discretizer.rb +6 -1
data/lib/fselector/ensemble.rb +3 -3
data/lib/fselector/fileio.rb +181 -102
metadata +7 -7

data/ChangeLog CHANGED

@@ -1,3 +1,10 @@
+2012-05-24	version 1.3.0
+  * update clear\_vars() in Base by use of Ruby metaprogramming, this trick avoids repetitive overriding it in each derived subclass
+  * re-organize LasVegasFilter, LasVegasIncremental and Random into algo_both/, since they are applicable to dataset with either discrete or continuous features, even with mixed type
+  * update data\_from\_csv() so that it can read CSV file more flexibly. note by default, the last column is class label
+  * add data\_from\_url() to read on-line dataset (in CSV, LibSVM or Weka ARFF file format) specified by a url
 2012-05-20	version 1.2.0
   * add KS-Test algorithm for continuous feature

data/README.md CHANGED

@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
 **Email**: [need47@gmail.com](mailto:need47@gmail.com)
 **Copyright**: 2012
 **License**: MIT License
-**Latest Version**: 1.2.0
-**Release Date**: 2012-05-20
+**Latest Version**: 1.3.0
+**Release Date**: 2012-05-24
 Synopsis
 --------
@@ -38,56 +38,57 @@ Feature List
  - csv
  - libsvm
  - weka ARFF
+ - on-line dataset in one of the above three formats (read only)
  - random data (read only, for test purpose)
 **2. available feature selection/ranking algorithms**
-    algorithm                        shortcut    algo_type  feature_type          applicability
-    --------------------------------------------------------------------------------------------------------
-    Accuracy                         Acc         weighting  discrete              multi-class
-    AccuracyBalanced                 Acc2        weighting  discrete              multi-class
-    BiNormalSeparation               BNS         weighting  discrete              multi-class
-    CFS_d                            CFS_d       subset     discrete              multi-class
-    ChiSquaredTest                   CHI         weighting  discrete              multi-class
-    CorrelationCoefficient           CC          weighting  discrete              multi-class
-    DocumentFrequency                DF          weighting  discrete              multi-class
-    F1Measure                        F1          weighting  discrete              multi-class
-    FishersExactTest                 FET         weighting  discrete              multi-class
-    FastCorrelationBasedFilter       FCBF        subset     discrete              multi-class
-    GiniIndex                        GI          weighting  discrete              multi-class
-    GMean                            GM          weighting  discrete              multi-class
-    GSSCoefficient                   GSS         weighting  discrete              multi-class
-    InformationGain                  IG          weighting  discrete              multi-class
-    INTERACT                         INTERACT    subset     discrete              multi-class
-    JMeasure                         JM          weighting  discrete              multi-class
-    KLDivergence                     KLD         weighting  discrete              multi-class
-    LasVegasFilter                   LVF         subset     discrete, continuous  multi-class
-    LasVegasIncremental              LVI         subset     discrete, continuous  multi-class
-    MatthewsCorrelationCoefficient   MCC, PHI    weighting  discrete              multi-class
-    McNemarsTest                     MNT         weighting  discrete              multi-class
-    OddsRatio                        OR          weighting  discrete              multi-class
-    OddsRatioNumerator               ORN         weighting  discrete              multi-class
-    PhiCoefficient                   PHI         weighting  discrete              multi-class
-    Power                            Power       weighting  discrete              multi-class
-    Precision                        Precision   weighting  discrete              multi-class
-    ProbabilityRatio                 PR          weighting  discrete              multi-class
-    Random                           Random      weighting  discrete              multi-class
-    Recall                           Recall      weighting  discrete              multi-class
-    Relief_d                         Relief_d    weighting  discrete              two-class, no missing data
-    ReliefF_d                        ReliefF_d   weighting  discrete              multi-class
-    Sensitivity                      SN, Recall  weighting  discrete              multi-class
-    Specificity                      SP          weighting  discrete              multi-class
-    SymmetricalUncertainty           SU          weighting  discrete              multi-class
-    BetweenWithinClassesSumOfSquare  BSS_WSS     weighting  continuous            multi-class
-    CFS_c                            CFS_c       subset     continuous            multi-class
-    FTest                            FT          weighting  continuous            multi-class
-    KS_CCBF                          KS_CCBF     subset     continuous            multi-class
-    KSTest                           KST         weighting  continuous            two-class
-    PMetric                          PM          weighting  continuous            two-class
-    Relief_c                         Relief_c    weighting  continuous            two-class, no missing data
-    ReliefF_c                        ReliefF_c   weighting  continuous            multi-class
-    TScore                           TS          weighting  continuous            two-class
-    WilcoxonRankSum                  WRS         weighting  continuous            two-class
+    algorithm                        shortcut    algo_type  applicability  feature_type
+    --------------------------------------------------------------------------------------------------
+    Accuracy                         Acc         weighting  multi-class    discrete
+    AccuracyBalanced                 Acc2        weighting  multi-class    discrete
+    BiNormalSeparation               BNS         weighting  multi-class    discrete
+    CFS_d                            CFS_d       subset     multi-class    discrete
+    ChiSquaredTest                   CHI         weighting  multi-class    discrete
+    CorrelationCoefficient           CC          weighting  multi-class    discrete
+    DocumentFrequency                DF          weighting  multi-class    discrete
+    F1Measure                        F1          weighting  multi-class    discrete
+    FishersExactTest                 FET         weighting  multi-class    discrete
+    FastCorrelationBasedFilter       FCBF        subset     multi-class    discrete
+    GiniIndex                        GI          weighting  multi-class    discrete
+    GMean                            GM          weighting  multi-class    discrete
+    GSSCoefficient                   GSS         weighting  multi-class    discrete
+    InformationGain                  IG          weighting  multi-class    discrete
+    INTERACT                         INTERACT    subset     multi-class    discrete
+    JMeasure                         JM          weighting  multi-class    discrete
+    KLDivergence                     KLD         weighting  multi-class    discrete
+    MatthewsCorrelationCoefficient   MCC, PHI    weighting  multi-class    discrete
+    McNemarsTest                     MNT         weighting  multi-class    discrete
+    OddsRatio                        OR          weighting  multi-class    discrete
+    OddsRatioNumerator               ORN         weighting  multi-class    discrete
+    PhiCoefficient                   PHI         weighting  multi-class    discrete
+    Power                            Power       weighting  multi-class    discrete
+    Precision                        Precision   weighting  multi-class    discrete
+    ProbabilityRatio                 PR          weighting  multi-class    discrete
+    Recall                           Recall      weighting  multi-class    discrete
+    Relief_d                         Relief_d    weighting  two-class      discrete
+    ReliefF_d                        ReliefF_d   weighting  multi-class    discrete
+    Sensitivity                      SN, Recall  weighting  multi-class    discrete
+    Specificity                      SP          weighting  multi-class    discrete
+    SymmetricalUncertainty           SU          weighting  multi-class    discrete
+    BetweenWithinClassesSumOfSquare  BSS_WSS     weighting  multi-class    continuous
+    CFS_c                            CFS_c       subset     multi-class    continuous
+    FTest                            FT          weighting  multi-class    continuous
+    KS_CCBF                          KS_CCBF     subset     multi-class    continuous
+    KSTest                           KST         weighting  two-class      continuous
+    PMetric                          PM          weighting  two-class      continuous
+    Relief_c                         Relief_c    weighting  two-class      continuous
+    ReliefF_c                        ReliefF_c   weighting  multi-class    continuous
+    TScore                           TS          weighting  two-class      continuous
+    WilcoxonRankSum                  WRS         weighting  two-class      continuous
+    LasVegasFilter                   LVF         subset     multi-class    discrete, continuous, mixed
+    LasVegasIncremental              LVI         subset     multi-class    discrete, continuous, mixed
+    Random                           Random      weighting  multi-class    discrete, continuous, mixed
   **note for feature selection interface:**
   there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
@@ -132,7 +133,7 @@ To install FSelector, use the following command:
     $ gem install fselector
-  **note:** Start from version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
+  **note:** From version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
   as a seemless bridge to access the statistical routines in the R package (http://www.r-project.org),
   which will greatly expand the inclusion of algorithms to FSelector, especially for those relying
   on statistical test. To this end, please pre-install the R package. RinRuby should have been
@@ -194,7 +195,7 @@ Usage
 	# creating an ensemble of feature selectors by using
 	# a single feature selection algorithm (INTERACT)
-	# by instance perturbation (e.g. bootstrap sampling)
+	# by instance perturbation (e.g. random sampling)
 	# test for the type of feature subset selection algorithms
     r = FSelector::INTERACT.new(0.0001)
@@ -258,8 +259,8 @@ Usage
     # the Information Gain (IG) algorithm requires data with discrete feature
     r = FSelector::IG.new
-    # but the Iris data set contains continuous features (under the test/ directory)
-    r.data_from_csv('test/iris.csv')
+    # but the Iris data set contains continuous features
+    r.data_from_url('http://repository.seasr.org/Datasets/UCI/arff/iris.arff', :weka)
     # let's first discretize it by ChiMerge algorithm at alpha=0.10
     # then perform feature selection as usual

data/lib/fselector.rb CHANGED

@@ -7,7 +7,7 @@ R.eval 'options(warn = -1)' # suppress R warnings
 #
 module FSelector
   # module version
-  VERSION = '1.2.0'
+  VERSION = '1.3.0'
 end
 # the root dir of FSelector
@@ -52,6 +52,15 @@ Dir.glob("#{ROOT}/fselector/algo_continuous/*").each do |f|
   require f
 end
+#
+# algorithms for handling both discrete and continuous feature
+#
+Dir.glob("#{ROOT}/fselector/algo_both/*").each do |f|
+  require f
+end
 #
 # feature selection use an ensemble of algorithms
 #

data/lib/fselector/algo_base/base.rb CHANGED

@@ -192,6 +192,34 @@ module FSelector
     end
+    #
+    # get the feature type stored in @types
+    #
+    # @param [Symbol] feature feature of interest
+    #   return all feature name-type pairs if nil,
+    #   otherwise reture the type for the feature of interest
+    #
+    def get_feature_type(feature=nil)
+      if @types
+        feature ? @types[feature] : @types
+      else
+        nil
+      end
+    end
+    #
+    # set feature name-type pair
+    #
+    # @param [Symbol] feature feature name
+    # @param [Symbol] type feature type
+    #
+    def set_feature_type(feature, type)
+      @types ||= {}
+      @types[feature] = type
+    end
     #
     # get internal data
     #
@@ -241,7 +269,11 @@ module FSelector
     # @note return all non-data as a Hash if key == nil
     #
     def get_opt(key=nil)
-      key ? @opts[key] : @opts
+      if @opts
+        key ? @opts[key] : @opts
+      else
+        nil
+      end
     end
@@ -404,17 +436,38 @@ module FSelector
     private
     #
-    # clear variables when data structure is altered, this is
-    # useful when data structure has changed while
-    # you still want to use the same instance
+    # clear instance variables when data structure is altered,
+    # this is useful when data structure has changed while you
+    # still want to use the same instance
     #
-    # @note the variables of original data structure (@data) and
-    #   algorithm type (@algo_type) are retained
+    # @note only the instance variables used in the initialize()
+    #   such as @data will be retained. class instance varialbe
+    #   @algo_type will also be retained. This trick by use of
+    #   Ruby metaprogramming avoids the repetivie overriding of
+    #   clear_vars() in each derived subclass
     #
     def clear_vars
-      @classes, @features, @fvs = nil, nil, nil
-      @scores, @ranks, @sz = nil, nil, nil
-      @cv, @fvs, @opts = nil, nil, {}
+      # instance vars appeared as arguments (with the same name) in initialize()
+      instance_var_in_new = []
+      constructor = method(:initialize)
+      if constructor.respond_to? :parameters
+        constructor.parameters.each do |p|
+          instance_var_in_new << "@#{p[1]}".to_sym
+        end
+      end
+      instance_variables.each do |var|
+        # retain instance vars appeared as arguments in initialize()
+        # such as @data
+        next if instance_var_in_new.include? var
+        # retain feature types, which may be needed
+        # by CSV and Weka ARFF file
+        next if var == :@types
+        # clear all other instance variable
+        instance_variable_set(var, nil)
+      end
     end

data/lib/fselector/algo_base/base_CFS.rb CHANGED

@@ -142,15 +142,6 @@ module FSelector
     end # do_rff
-    # override clear\_vars for BaseCFS
-    def clear_vars
-      super
-      @rcf_best, @rff_best = nil, nil
-      @f2rcf, @fs2rff, @f2idx = nil, nil, nil
-    end # clear_vars
   end # class

data/lib/fselector/algo_base/base_ReliefF.rb CHANGED

@@ -151,14 +151,6 @@ module FSelector
     end
-    # override clear\_vars for BaseReliefF
-    def clear_vars
-      super
-      @f2mvp = nil
-    end # clear_vars
   end # class

data/lib/fselector/algo_base/base_discrete.rb CHANGED

@@ -175,14 +175,6 @@ module FSelector
     end # calc_D
-    # override clear\_vars for BaseDiscrete
-    def clear_vars
-      super
-      @A, @B, @C, @D = nil, nil, nil, nil
-    end # clear_vars
   end # class

data/lib/fselector/{algo_discrete → algo_both}/LasVegasFilter.rb RENAMED

@@ -3,14 +3,14 @@
 #
 module FSelector
 #
-# Las Vegas Filter (LVF) for discrete feature,
+# Las Vegas Filter (LVF) for discrete, continuous or mixed feature,
 # use **select\_feature!** for feature selection
 #
 # @note we only keep one of the equivalently good solutions
 #
 # ref: [Review and Evaluation of Feature Selection Algorithms in Synthetic Problems](http://arxiv.org/abs/1101.2320)
 #
-  class LasVegasFilter < BaseDiscrete
+  class LasVegasFilter < Base
     # include module
     include Consistency

data/lib/fselector/{algo_discrete → algo_both}/LasVegasIncremental.rb RENAMED

@@ -3,12 +3,12 @@
 #
 module FSelector
 #
-# Las Vegas Incremental (LVI) for discrete feature,
+# Las Vegas Incremental (LVI) for discrete, continuous or mixed feature,
 # use **select\_feature!** for feature selection
 #
 # ref: [Incremental Feature Selection](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.8218)
 #
-  class LasVegasIncremental < BaseDiscrete
+  class LasVegasIncremental < Base
     # include module
     include Consistency

data/lib/fselector/{algo_discrete → algo_both}/Random.rb RENAMED

@@ -3,13 +3,14 @@
 #
 module FSelector
 #
-# Random (Rand), no pratical use but can be used as a baseline
+# Random (Rand) for discrete, continuous or mixed feature,
+# no pratical use but can be used as a baseline
 #
 #  Rand = rand numbers within [0..1)
 #
 # ref: [An extensive empirical study of feature selection metrics for text classification](http://dl.acm.org/citation.cfm?id=944974)
 #
-  class Random < BaseDiscrete
+  class Random < Base
     # this algo outputs weight for each feature
     @algo_type = :feature_weighting

data/lib/fselector/algo_continuous/ReliefF_c.rb CHANGED

@@ -59,14 +59,6 @@ module FSelector
     end # get_normalization_unit
-    # override clear\_vars for ReliefF_c
-    def clear_vars
-      super
-      @f2nu = nil
-    end # clear_vars
   end # class

data/lib/fselector/algo_continuous/Relief_c.rb CHANGED

@@ -47,14 +47,6 @@ module FSelector
     end # get_normalization_unit
-    # override clear\_vars for Relief_c
-    def clear_vars
-      super
-      @f2nu = nil
-    end # clear_vars
   end # class

data/lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb CHANGED

@@ -121,14 +121,6 @@ module FSelector
       fq
     end # get_next_element
-    # override clear\_vars for FastCorrelationBasedFilter
-    def clear_vars
-      super
-      @f2hf = nil
-    end # clear_vars
   end # class

data/lib/fselector/algo_discrete/InformationGain.rb CHANGED

@@ -41,15 +41,7 @@ module FSelector
       s =  @hc - hcf
       set_feature_score(f, :BEST, s)
-    end # calc_contribution
-    # override clear\_vars for InformationGain
-    def clear_vars
-      super
-      @hc = nil
-    end # clear_vars
+    end # calc_contribution
   end # class

data/lib/fselector/discretizer.rb CHANGED

@@ -474,8 +474,13 @@ module Discretizer
       end
     end
-    # clear vars
+    # clear vars because of data change
     clear_vars
+    # set all feature type as CATEGORICAL
+    each_feature do |f|
+      set_feature_type(f, :categorical)
+    end
   end # discretize_at_cutpoints!

data/lib/fselector/ensemble.rb CHANGED

@@ -258,6 +258,7 @@ module FSelector
       @algo_type = algo.algo_type
     end
+    private
     #
     # get ensemble feature scores
@@ -304,7 +305,6 @@ module FSelector
       #pp ensem_ranks
     end # get_ensemble_ranks
-    private
     #
     # override get\_feature\_subset() for EnsembleSingle,
@@ -417,6 +417,8 @@ module FSelector
       end
     end
+    private
     #
     # get ensemble feature scores
     #
@@ -455,8 +457,6 @@ module FSelector
     end # get_ensemble_ranks
-    private
     #
     # override get\_feature\_subset() for EnsembleMultiple,
     # select a subset of features based on frequency count

data/lib/fselector/fileio.rb CHANGED

@@ -20,6 +20,11 @@
 # @note class labels and features are treated as symbols
 #
 module FileIO
+  # require the open-uri lib for http/https/ftp request
+  require 'open-uri'
+  # require the stringio lib for converting a string into stringio object
+  require 'stringio'
   #
   # read from random data (read only, for test purpose)
   #
@@ -76,20 +81,13 @@ module FileIO
   # -1  3:1 4:1 ...
   # ....
   #
-  # @param [String] fname file to read from
+  # @param [Symbol|String|StringIO] fname data source to read from
   #   :stdin  # read from standard input instead of file
   #
-  def data_from_libsvm(fname=:stdin)
-    data = {}
+  def data_from_libsvm(fname=:stdin)
+    ifs = get_ifs(fname)
-    if fname == :stdin
-      ifs = $stdin
-    elsif not File.exists? fname
-      abort "[#{__FILE__}@#{__LINE__}]: \n"+
-            "  File '#{fname}' does not exist!"
-    else
-      ifs = File.open(fname)
-    end
+    data = {}
     ifs.each_line do |ln|
       label, *features = ln.chomp.split(/\s+/)
@@ -116,14 +114,10 @@ module FileIO
   # write to libsvm
   #
   # @param [String] fname file to write
-  #   :stdout  # write to standard ouput instead of file
+  #   :stdout  # write to standard ouput
   #
   def data_to_libsvm(fname=:stdout)
-    if fname == :stdout
-      ofs = $stdout
-    else
-      ofs = File.open(fname, 'w')
-    end
+    ofs = get_ofs(fname)
     # convert class label to integer type
     k2idx = {}
@@ -153,75 +147,108 @@ module FileIO
   #
   # read from csv
   #
-  # file should have the format with the first two rows
-  # specifying features and their data types e.g.
-  # feat\_name1,feat\_name2,...,feat\_namen
-  # feat\_type1,feat\_type2,...,feat\_typen
+  # if no csv_opts supplied, we assume the CSV file in the following format:
+  # first row contains feature names with last column being class name
+  #
+  #     feat_name1,feat_name2,...,class_name
+  #
+  # second row contains feature types with last column being class type
+  #
+  #     feat_type1,feat_type2,...,class_type
   #
-  # and the remaing rows showing data e.g.
-  # class\_label,feat\_value1,feat\_value2,...,feat\_value3
-  # ...
+  # and the remaing rows containing feature values with last column being class label
+  #
+  #     feat_value1,feat_value2,...,class_label
+  #     ...
   #
   # allowed feature types (case-insensitive) are:
-  # INTEGER, REAL, NUMERIC, CONTINUOUS, STRING, NOMINAL, CATEGORICAL
+  # INTEGER, REAL, NUMERIC, CONTINUOUS, DOUBLE, FLOAT, STRING, NOMINAL, CATEGORICAL
   #
-  # @param [String] fname file to read from
-  #   :stdin  # read from standard input instead of file
+  # @param [Symbol|String|StringIO] fname data source to read from
+  #   :stdin  # read from standard input
+  # @param [Hash] csv_opts named arguments for csv options
+  #   :feature\_name\_row => 1,    # (first) row that contains feature names
+  #   :feature\_type\_row => 2,    # (second) row that contains feature types
+  #   :class\_label\_column => 0   # (last) column that contains class labels
+  #   :feature\_name2type => {},   # user-supplied hash containing feature name-type pairs
+  #                                  if no rows specify them. feature must be in the same order
+  #                                  as it appears in the dataset
   #
-  # @note missing values are allowed, and feature types are stored as lower-case symbols
+  # @note missing values are allowed
   #
-  def data_from_csv(fname=:stdin)
-    data = {}
+  def data_from_csv(fname=:stdin, csv_opts = {})
+    ifs = get_ifs(fname)
-    if fname == :stdin
-      ifs = $stdin
-    elsif not File.exists? fname
-      abort "[#{__FILE__}@#{__LINE__}]: \n"+
-            "  File '#{fname}' does not exist!"
-    else
-      ifs = File.open(fname)
+    csv_opts = { # default options, new opts will override old ones
+      :feature_name_row => 1,    # first row contains feature names
+      :feature_type_row => 2,    # second row contains feature types
+      :class_label_column => 0,  # last column contains class labels
+      :feature_name2type => {}   # user-supplied feature name-type pairs
+    }.merge(csv_opts) if csv_opts and csv_opts.class == Hash
+    feature_name_row = csv_opts[:feature_name_row]
+    feature_type_row = csv_opts[:feature_type_row]
+    class_label_column = csv_opts[:class_label_column]
+    feature_name2type = csv_opts[:feature_name2type]
+    # user-supplied feature name-type pairs, this is useful
+    # when file contains no specific rows for feture names and types
+    if feature_name2type and not feature_name2type.empty?
+      features = feature_name2type.keys.collect { |n| n.to_sym }
+      types = feature_name2type.values.collect { |t| t.downcase.to_sym }
+      # disable name and type rows
+      feature_name_row, feature_type_row = nil, nil
     end
-    first_row, second_row = true, true
-    features, types = [], []
+    data = {}
     ifs.each_line do |ln|
-      if first_row # first row
-        first_row = false
-        features = ln.chomp.split(/,/).to_sym
-      elsif second_row # second row
-        second_row = false
+      next if ln.blank?
+      cells = ln.chomp.split(/,/)
+      # feature names
+      if ifs.lineno == feature_name_row
+        class_name = cells.delete_at(class_label_column-1) # simply useless
+        features = cells.to_sym
+      # feature types
+      elsif ifs.lineno == feature_type_row
+        class_type = cells.delete_at(class_label_column-1) # simply useless
         # store feature type as lower-case symbol
-        types = ln.chomp.split(/,/).collect { |t| t.downcase.to_sym }
-        if not types.size == features.size
-          abort "[#{__FILE__}@#{__LINE__}]: \n"+
-                "  the first two rows must have same number of fields"
-        end
+        types = cells.collect { |t| t.downcase.to_sym }
       else # data rows
-        label, *fvs = ln.chomp.split(/,/)
-        label = label.to_sym
-        data[label] ||= []
+        if class_label_column <= cells.size and
+           features.size+1 == cells.size and
+           types.size+1 == cells.size
+          class_label = cells.delete_at(class_label_column-1).to_sym
+          data[class_label] ||= []
+          # feature values
+          fvs = cells
+        else
+          abort "[#{__FILE__}@#{__LINE__}]: \n"+
+                "  invalid csv format!"
+        end
         fs = {}
         fvs.each_with_index do |v, i|
           next if v.empty? # missing value
-          feat_type = types[i]
-          if feat_type == :integer
+          type = types[i]
+          if type == :integer
             v = v.to_i
-          elsif [:real, :numeric, :continuous].include? feat_type
+          elsif [:real, :numeric, :continuous, :float, :double].include? type
             v = v.to_f
-          elsif [:string, :nominal, :categorical].include? feat_type
+          elsif [:string, :nominal, :categorical].include? type
             #
           else
             abort "[#{__FILE__}@#{__LINE__}]: \n"+
-                  "  please specify correct type "+
-                  "for each feature in the 2nd row"
+                  "  invalid feature type '#{type}', must be one of the following: \n"+
+                  "  integer, real, numeric, float, double, continuous, categorical, string, nominal"
           end
           fs[features[i]] = v
         end
-        data[label] << fs
+        data[class_label] << fs
       end
     end
@@ -230,9 +257,10 @@ module FileIO
     set_data(data)
     set_features(features)
-    # set feature type
+    # feature name-type pairs
     features.each_with_index do |f, i|
-      set_opt(f, types[i])
+      set_feature_type(f, types[i])
     end
   end # data_from_csv
@@ -240,35 +268,30 @@ module FileIO
   #
   # write to csv
   #
-  # file has the format with the first two rows
-  # specifying features and their data types
-  # and the remaing rows showing data
+  # ouput CSV file has the same format as the default input for data\_from\_csv()
   #
   # @param [String] fname file to write
-  #   :stdout  # write to standard ouput instead of file
+  #   :stdout  # write to standard ouput
   #
   def data_to_csv(fname=:stdout)
-    if fname == :stdout
-      ofs = $stdout
-    else
-      ofs = File.open(fname, 'w')
-    end
+    ofs = get_ofs(fname)
+    # feature names
     ofs.puts get_features.join(',')
+    # feature types
     ofs.puts get_features.collect { |f|
-      get_opt(f) || :string
+      get_feature_type(f) || :string
     }.join(',')
     each_sample do |k, s|
-      ofs.print "#{k}"
       each_feature do |f|
         if s.has_key? f
-          ofs.print ",#{s[f]}"
+          ofs.print "#{s[f]},"
         else
           ofs.print ","
         end
       end
-      ofs.puts
+      ofs.puts "#{k}"
     end
     # close file
@@ -279,25 +302,18 @@ module FileIO
   #
   # read from WEKA ARFF file
   #
-  # @param [String] fname file to read from
-  #   :stdin  # read from standard input instead of file
+  # @param [Symbol|String|StringIO] fname data source to read from
+  #   :stdin  # read from standard input
   # @note it's ok if string containes spaces quoted by quote_char
   #
   def data_from_weka(fname=:stdin, quote_char='"')
-    data = {}
-    if fname == :stdin
-      ifs = $stdin
-    elsif not File.exists? fname
-      abort "[#{__FILE__}@#{__LINE__}]: \n"+
-            "  File '#{fname}' does not exist!"
-    else
-      ifs = File.open(fname)
-    end
+    ifs = get_ifs(fname)
     relation, features, classes, types, comments = '', [], [], [], []
     has_class, has_data = false, false
+    data = {}
     ifs.each_line do |ln|
       next if ln.blank? # blank lines
@@ -312,7 +328,7 @@ module FileIO
       # class attribute
       elsif ln =~ /^@ATTRIBUTE\s+class\s+{(.+)}/i
         has_class = true
-        classes = $1.split_me(/,\s*/, quote_char).to_sym
+        classes = $1.strip.split_me(/,\s*/, quote_char).to_sym
         classes.each { |k| data[k] = [] }
       # feature attribute (nominal)
       elsif ln =~ /^@ATTRIBUTE\s+(\S+)\s+{(.+)}/i
@@ -320,7 +336,7 @@ module FileIO
         features << f
         #$2.split_me(/,\s*/, quote_char) # feature nominal values
         types << :nominal
-      # feature attribute (integer, real, numeric, string, date)
+      # feature attribute (categorical, integer, real, continuous, numeric, float, double, string, date)
       elsif ln =~ /^@ATTRIBUTE/i
         tmp, v1, v2 = ln.split_me(/\s+/, quote_char)
         f = v1.to_sym
@@ -376,10 +392,12 @@ module FileIO
     set_data(data)
     set_classes(classes)
     set_features(features)
-    set_opt(:relation, relation)
+    # feature name-type pairs
     features.each_with_index do |f, i|
-      set_opt(f, types[i])
+      set_feature_type(f, types[i])
     end
+    set_opt(:relation, relation)
     set_opt(:comments, comments) if not comments.empty?
   end # data_from_weak
@@ -388,16 +406,12 @@ module FileIO
   # write to WEKA ARFF file
   #
   # @param [String] fname file to write
-  #   :stdout  # write to standard ouput instead of file
+  #   :stdout  # write to standard ouput
   # @param [Symbol] format sparse or regular ARFF
   #   :sparse  # sparse ARFF, otherwise regular ARFF
   #
   def data_to_weka(fname=:stdout, format=nil)
-    if fname == :stdout
-      ofs = $stdout
-    else
-      ofs = File.open(fname, 'w')
-    end
+    ofs = get_ofs(fname)
     # comments
     comments = get_opt(:comments)
@@ -419,7 +433,7 @@ module FileIO
     # feature attribute
     each_feature do |f|
       ofs.print "@ATTRIBUTE #{f} "
-      type = get_opt(f) # feature type
+      type = get_feature_type(f)
       if type
         if type == :nominal
           ofs.puts "{#{get_feature_values(f).uniq.sort.join(',')}}"
@@ -464,10 +478,73 @@ module FileIO
     # close file
     ofs.close if not ofs == $stdout
-  end
+  end # data_to_weka
+  # read data from url
+  #
+  # @param [String] url url of on-line dataset
+  # @param [Symbol] format allowed formats are:
+  #   :libsvm  # LibSVM file
+  #   :csv     # csv file
+  #   :weka    # Weka ARFF file
+  # @param [Any] args arguments associated with format
+  #
+  def data_from_url(url, format, *args)
+    format = format.downcase.to_sym
+    if not [:libsvm, :csv, :weka].include? format
+      abort "[#{__FILE__}@#{__LINE__}]: \n"+
+            "  only CSV, LibSVM and Weka file formats are supported!"
+    end
+    uri = URI.parse(URI.encode(url))
+    data_src = StringIO.new(uri.read)
+    if format == :csv
+      data_from_csv(data_src, *args)
+    elsif format == :libsvm
+      data_from_libsvm(data_src)
+    else # weka
+      data_from_weka(data_src, *args)
+    end
+  end # data_from_url
   private
+  # get the input file handler
+  def get_ifs(fname)
+    # read from standard input by default
+    if fname == :stdin
+      ifs = $stdin
+    # read from string if it is a StringIO
+    elsif fname.class == StringIO
+      ifs = fname
+    # read from file if file exists
+    elsif File.exists? fname
+      ifs = File.open(fname)
+    else
+      abort "[#{__FILE__}@#{__LINE__}]: \n"+
+            "  invalid data source!"
+    end
+    ifs
+  end
+  # get the ouput file handler
+  def get_ofs(fname)
+    if fname == :stdout
+      ofs = $stdout
+    else
+      ofs = File.open(fname, 'w')
+    end
+    ofs
+  end
   # handle and add each feature for WEKA format
   #
   # @param [Hash] fs sample that stores feature and its value
@@ -480,14 +557,16 @@ module FileIO
       return
     elsif type == :integer
       fs[f] = v.to_i
-    elsif type == :real or type == :numeric
+    elsif [:real, :numeric, :float, :double, :continuous].include? type
       fs[f] = v.to_f
-    elsif type == :string or type == :nominal
+    elsif [:categorical, :string, :nominal].include? type
       fs[f] = v
     elsif type == :date # convert into integer
       fs[f] = (DateTime.parse(v)-DateTime.new(1970,1,1)).to_i
     else
-       return
+       abort "[#{__FILE__}@#{__LINE__}]: \n"+
+            "  invalid feature type '#{type}', must be one of the following: \n"+
+            "  integer, real, numeric, float, double, continuous, categorical, string, nominal, date"
     end
   end # add_feature_weka

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: fselector
 version: !ruby/object:Gem::Version
-  version: 1.2.0
+  version: 1.3.0
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-05-21 00:00:00.000000000 Z
+date: 2012-05-24 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rinruby
-  requirement: &23863908 !ruby/object:Gem::Requirement
+  requirement: &24934116 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -21,7 +21,7 @@ dependencies:
         version: 2.0.2
   type: :runtime
   prerelease: false
-  version_requirements: *23863908
+  version_requirements: *24934116
 description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
   algorithms and related functions into one single package. Welcome to contact me
   (need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
@@ -48,6 +48,9 @@ files:
 - lib/fselector/algo_base/base_discrete.rb
 - lib/fselector/algo_base/base_Relief.rb
 - lib/fselector/algo_base/base_ReliefF.rb
+- lib/fselector/algo_both/LasVegasFilter.rb
+- lib/fselector/algo_both/LasVegasIncremental.rb
+- lib/fselector/algo_both/Random.rb
 - lib/fselector/algo_continuous/BSS_WSS.rb
 - lib/fselector/algo_continuous/CFS_c.rb
 - lib/fselector/algo_continuous/F-Test.rb
@@ -75,8 +78,6 @@ files:
 - lib/fselector/algo_discrete/INTERACT.rb
 - lib/fselector/algo_discrete/J-Measure.rb
 - lib/fselector/algo_discrete/KL-Divergence.rb
-- lib/fselector/algo_discrete/LasVegasFilter.rb
-- lib/fselector/algo_discrete/LasVegasIncremental.rb
 - lib/fselector/algo_discrete/MatthewsCorrelationCoefficient.rb
 - lib/fselector/algo_discrete/McNemarsTest.rb
 - lib/fselector/algo_discrete/MutualInformation.rb
@@ -85,7 +86,6 @@ files:
 - lib/fselector/algo_discrete/Power.rb
 - lib/fselector/algo_discrete/Precision.rb
 - lib/fselector/algo_discrete/ProbabilityRatio.rb
-- lib/fselector/algo_discrete/Random.rb
 - lib/fselector/algo_discrete/ReliefF_d.rb
 - lib/fselector/algo_discrete/Relief_d.rb
 - lib/fselector/algo_discrete/Sensitivity.rb