RubyGems - fselector - Versions diffs - 1.0.1 → 1.1.0 - Mend

fselector 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

data/ChangeLog +9 -0
data/README.md +62 -26
data/lib/fselector.rb +1 -1
data/lib/fselector/algo_base/base.rb +89 -34
data/lib/fselector/algo_base/base_CFS.rb +20 -7
data/lib/fselector/algo_base/base_Relief.rb +5 -5
data/lib/fselector/algo_base/base_ReliefF.rb +11 -3
data/lib/fselector/algo_base/base_discrete.rb +8 -0
data/lib/fselector/algo_continuous/BSS_WSS.rb +3 -1
data/lib/fselector/algo_continuous/CFS_c.rb +3 -1
data/lib/fselector/algo_continuous/FTest.rb +2 -0
data/lib/fselector/algo_continuous/PMetric.rb +4 -2
data/lib/fselector/algo_continuous/ReliefF_c.rb +11 -0
data/lib/fselector/algo_continuous/Relief_c.rb +14 -3
data/lib/fselector/algo_continuous/TScore.rb +5 -3
data/lib/fselector/algo_continuous/WilcoxonRankSum.rb +5 -3
data/lib/fselector/algo_discrete/Accuracy.rb +2 -0
data/lib/fselector/algo_discrete/AccuracyBalanced.rb +2 -0
data/lib/fselector/algo_discrete/BiNormalSeparation.rb +3 -1
data/lib/fselector/algo_discrete/CFS_d.rb +3 -0
data/lib/fselector/algo_discrete/ChiSquaredTest.rb +3 -0
data/lib/fselector/algo_discrete/CorrelationCoefficient.rb +2 -0
data/lib/fselector/algo_discrete/DocumentFrequency.rb +2 -0
data/lib/fselector/algo_discrete/F1Measure.rb +2 -0
data/lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb +12 -1
data/lib/fselector/algo_discrete/FishersExactTest.rb +3 -1
data/lib/fselector/algo_discrete/GMean.rb +2 -0
data/lib/fselector/algo_discrete/GSSCoefficient.rb +2 -0
data/lib/fselector/algo_discrete/GiniIndex.rb +3 -1
data/lib/fselector/algo_discrete/INTERACT.rb +3 -0
data/lib/fselector/algo_discrete/InformationGain.rb +12 -1
data/lib/fselector/algo_discrete/LasVegasFilter.rb +3 -0
data/lib/fselector/algo_discrete/LasVegasIncremental.rb +3 -0
data/lib/fselector/algo_discrete/MatthewsCorrelationCoefficient.rb +2 -0
data/lib/fselector/algo_discrete/McNemarsTest.rb +3 -0
data/lib/fselector/algo_discrete/MutualInformation.rb +3 -1
data/lib/fselector/algo_discrete/OddsRatio.rb +2 -0
data/lib/fselector/algo_discrete/OddsRatioNumerator.rb +2 -0
data/lib/fselector/algo_discrete/Power.rb +4 -1
data/lib/fselector/algo_discrete/Precision.rb +2 -0
data/lib/fselector/algo_discrete/ProbabilityRatio.rb +2 -0
data/lib/fselector/algo_discrete/Random.rb +3 -0
data/lib/fselector/algo_discrete/ReliefF_d.rb +3 -1
data/lib/fselector/algo_discrete/Relief_d.rb +4 -2
data/lib/fselector/algo_discrete/Sensitivity.rb +2 -0
data/lib/fselector/algo_discrete/Specificity.rb +2 -0
data/lib/fselector/algo_discrete/SymmetricalUncertainty.rb +4 -1
data/lib/fselector/discretizer.rb +7 -7
data/lib/fselector/ensemble.rb +375 -115
data/lib/fselector/entropy.rb +2 -2
data/lib/fselector/fileio.rb +83 -70
data/lib/fselector/normalizer.rb +2 -2
data/lib/fselector/replace_missing_values.rb +137 -3
data/lib/fselector/util.rb +17 -5
metadata +4 -4

data/ChangeLog CHANGED

@@ -1,3 +1,12 @@
+2012-05-15	version 1.1.0
+  * add replace\_by\_median\_value! for replacing missing value with feature median value
+  * add replace\_by\_knn\_value! for replacing missing value with weighted feature value from k-nearest neighbors
+  * replace\_by\_mean\_value! and replace\_by\_median\_value! now support both column and row mode
+  * add EnsembleSingle class for ensemble feature selection by creating an ensemble of feature selectors using a single feature selection algorithm
+  * rename Ensemble to EnsembleMultiple for ensemble feature selection by creating an ensemble of feature selectors using multiple feature selection algorithms of the same type
+  * bug fix in FileIO module
 2012-05-08	version 1.0.1
   * modify Ensemble module so that ensemble\_by\_score() and ensemble\_by\_rank() now take Symbol, instead of Method, as argument. This allows easier and clearer function call

data/README.md CHANGED

@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
 **Email**: [need47@gmail.com](mailto:need47@gmail.com)
 **Copyright**: 2012
 **License**: MIT License
-**Latest Version**: 1.0.1
-**Release Date**: 2012-05-08
+**Latest Version**: 1.1.0
+**Release Date**: 2012-05-15
 Synopsis
 --------
@@ -86,17 +86,16 @@ Feature List
     WilcoxonRankSum                   WRS         weighting   continuous    two-class
   **note for feature selection interface:**
-  there are two types of filter methods, i.e., weighting algorithms and subset selection algorithms
+  there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
   - for weighting type: use either **select\_feature\_by\_rank!** or **select\_feature\_by\_score!**
   - for subset type: use **select\_feature!**
 **3. feature selection approaches**
  - by a single algorithm
  - by multiple algorithms in a tandem manner
- - by multiple algorithms in an ensemble manner
+ - by multiple algorithms in an ensemble manner (share same feature selection interface as single algorithm)
 **4. availabe normalization and discretization algorithms for continuous feature**
@@ -114,11 +113,13 @@ Feature List
 **5. availabe algorithms for replacing missing feature values**
-    algorithm                         note                                feature_type
-    -------------------------------------------------------------------------------------------
-    replace_by_fixed_value!           replace by a fixed value            discrete, continuous
-    replace_by_mean_value!            replace by mean feature value       continuous
-    replace_by_most_seen_value!       replace by most seen feature value  discrete
+    algorithm                         note                                   feature_type
+    ---------------------------------------------------------------------------------------------------------
+    replace_by_fixed_value!           replace by a fixed value               discrete, continuous
+    replace_by_mean_value!            replace by mean feature value          continuous
+    replace_by_median_value!          replace by median feature value        continuous
+    replace_by_knn_value!             replace by weighted knn feature value  continuous
+    replace_by_most_seen_value!       replace by most seen feature value     discrete
 Installing
 ----------
@@ -140,7 +141,7 @@ Usage
     require 'fselector'
-    # use InformationGain as a feature ranking algorithm
+    # use InformationGain as a feature selection algorithm
     r1 = FSelector::InformationGain.new
     # read from random data (or csv, libsvm, weka ARFF file)
@@ -152,13 +153,13 @@ Usage
     r1.data_from_random(100, 2, 15, 3, true)
     # number of features before feature selection
-    puts "# features (before): "+ r1.get_features.size.to_s
+    puts "  # features (before): "+ r1.get_features.size.to_s
     # select the top-ranked features with scores >0.01
     r1.select_feature_by_score!('>0.01')
     # number of features after feature selection
-    puts "# features (after): "+ r1.get_features.size.to_s
+    puts "  # features (after): "+ r1.get_features.size.to_s
     # you can also use multiple alogirithms in a tandem manner
     # e.g. use the ChiSquaredTest with Yates' continuity correction
@@ -166,29 +167,65 @@ Usage
     r2 = FSelector::ChiSquaredTest.new(:yates, r1.get_data)
     # number of features before feature selection
-    puts "# features (before): "+ r2.get_features.size.to_s
+    puts "  # features (before): "+ r2.get_features.size.to_s
     # select the top-ranked 3 features
     r2.select_feature_by_rank!('<=3')
     # number of features after feature selection
-    puts "# features (after): "+ r2.get_features.size.to_s
+    puts "  # features (after): "+ r2.get_features.size.to_s
     # save data to standard ouput as a weka ARFF file (sparse format)
     # with selected features only
     r2.data_to_weka(:stdout, :sparse)
-**2. feature selection by an ensemble of multiple algorithms**
+**2. feature selection by an ensemble of multiple feature selectors**
     require 'fselector'
-	# use both InformationGain and Relief_d
+	# example 1
+	#
+	# creating an ensemble of feature selectors by using
+	# a single feature selection algorithm (INTERACT)
+	# by instance perturbation (e.g. bootstrap sampling)
+	# test for the type of feature subset selection algorithms
+    r = FSelector::INTERACT.new(0.0001)
+    # an ensemble of 40 feature selectors with 90% data by random sampling
+    re = FSelector::EnsembleSingle.new(r, 40, 0.90, :random_sampling)
+    # read SPECT data set  (under the test/ directory)
+    re.data_from_csv('test/SPECT_train.csv')
+    # number of features before feature selection
+    puts '  # features (before): ' + re.get_features.size.to_s
+    # only features with above average count among ensemble are selected
+    re.select_feature!
+    # number of features after feature selection
+    puts '  # features before (after): ' + re.get_features.size.to_s
+    # example 2
+	#
+	# creating an ensemble of feature selectors by using
+	# two feature selection algorithms (InformationGain and Relief_d).
+	# note: can be 2+ algorithms, as long as they are of the same type,
+	# either feature weighting or feature subset selection algorithms
+	# test for the type of feature weighting algorithms
     r1 = FSelector::InformationGain.new
-    r2 = FSelector::Relief_d.new
+    r2 = FSelector::Relief_d.new(10)
-    # ensemble ranker
-    re = FSelector::Ensemble.new(r1, r2)
+    # an ensemble of two feature selectors
+    re = FSelector::EnsembleMultiple.new(r1, r2)
     # read random data
     re.data_from_random(100, 2, 15, 3, true)
@@ -198,18 +235,17 @@ Usage
     re.replace_by_most_seen_value!
     # number of features before feature selection
-    puts '# features (before): ' + re.get_features.size.to_s
+    puts '  # features (before): ' + re.get_features.size.to_s
     # based on the max feature score (z-score standardized) among
-    # an ensemble of feature selection algorithms
+    # an ensemble of feature selectors
     re.ensemble_by_score(:by_max, :by_zscore)
     # select the top-ranked 3 features
     re.select_feature_by_rank!('<=3')
     # number of features after feature selection
-    puts '# features (after): ' + re.get_features.size.to_s
+    puts '  # features (after): ' + re.get_features.size.to_s
 **3. normalization and discretization before feature selection**
@@ -233,13 +269,13 @@ Usage
     r2 = FSelector::FCBF.new(0.0, r1.get_data)
     # number of features before feature selection
-    puts '# features (before): ' + r2.get_features.size.to_s
+    puts '  # features (before): ' + r2.get_features.size.to_s
     # feature selection
     r2.select_feature!
     # number of features after feature selection
-    puts '# features (after): ' + r2.get_features.size.to_s
+    puts '  # features (after): ' + r2.get_features.size.to_s
 **4. see more examples test_*.rb under the test/ directory**

data/lib/fselector.rb CHANGED

@@ -6,7 +6,7 @@ require 'rinruby'
 #
 module FSelector
   # module version
-  VERSION = '1.0.1'
+  VERSION = '1.1.0'
 end
 # the root dir of FSelector

data/lib/fselector/algo_base/base.rb CHANGED

@@ -11,25 +11,39 @@ module FSelector
     # include ReplaceMissingValues
     include ReplaceMissingValues
+    class << self
+      # class-level instance variable, type of feature selection algorithm.
+      #
+      # @note derived class (except for Base*** class) must set its own type with
+      #   one of the following two:
+      #   - :feature\_weighting         # when algo outputs weight for each feature
+      #   - :feature\_subset_selection  # when algo outputs a subset of features
+      attr_accessor :algo_type
+    end
+    # get the type of feature selection algorithm at class-level
+    def algo_type
+      self.class.algo_type
+    end
     # initialize from an existing data structure
     def initialize(data=nil)
-      @data = data
-      @opts = {} # store non-data information
+      @data = data # store data
     end
     #
-    # iterator for each class, a block must be given
+    # iterator for each class, a block must be given. e.g.
     #
-    #     e.g.
-    #     self.each_class do |k|
+    #     each_class do |k|
     #       puts k
     #     end
     #
     def each_class
       if not block_given?
-        abort "[#{__FILE__}@#{__LINE__}]: "+
-              "block must be given!"
+        abort "[#{__FILE__}@#{__LINE__}]: \n"+
+              "  block must be given!"
       else
         get_classes.each { |k| yield k }
       end
@@ -37,17 +51,16 @@ module FSelector
     #
-    # iterator for each feature, a block must be given
+    # iterator for each feature, a block must be given. e.g.
     #
-    #     e.g.
-    #     self.each_feature do |f|
+    #     each_feature do |f|
     #       puts f
     #     end
     #
     def each_feature
       if not block_given?
-        abort "[#{__FILE__}@#{__LINE__}]: "+
-              "block must be given!"
+        abort "[#{__FILE__}@#{__LINE__}]: \n"+
+              "  block must be given!"
       else
         get_features.each { |f| yield f }
       end
@@ -55,10 +68,10 @@ module FSelector
     #
-    # iterator for each sample with class label, a block must be given
+    # iterator for each sample with class label,
+    # a block must be given. e.g.
     #
-    #     e.g.
-    #     self.each_sample do |k, s|
+    #     each_sample do |k, s|
     #       print k
     #       s.each { |f, v| print " #{v}" }
     #       puts
@@ -66,7 +79,7 @@ module FSelector
     #
     def each_sample
       if not block_given?
-        abort "[#{__FILE__}@#{__LINE__}]: "+
+        abort "[#{__FILE__}@#{__LINE__}]: \n"+
               " block must be given!"
       else
         get_data.each do |k, samples|
@@ -114,8 +127,8 @@ module FSelector
       if classes and classes.class == Array
         @classes = classes
       else
-        abort "[#{__FILE__}@#{__LINE__}]: "+
-              "classes must be a Array object!"
+        abort "[#{__FILE__}@#{__LINE__}]: \n"+
+              "  classes must be a Array object!"
       end
     end
@@ -125,7 +138,7 @@ module FSelector
     # @return [Array<Symbol>] unique features
     #
     def get_features
-      @features ||= @data.map { |x| x[1].map { |y| y.keys } }.flatten.uniq
+      @features ||= @data.collect { |x| x[1].collect { |y| y.keys } }.flatten.uniq
     end
@@ -174,8 +187,8 @@ module FSelector
       if features and features.class == Array
         @features = features
       else
-        abort "[#{__FILE__}@#{__LINE__}]: "+
-              "features must be a Array object!"
+        abort "[#{__FILE__}@#{__LINE__}]: \n"+
+              "  features must be a Array object!"
       end
     end
@@ -204,27 +217,40 @@ module FSelector
     # set data and clean relevant variables in case of data change
     #
     # @param [Hash] data source data structure
+    # @return [nil] to suppress console echo of data in irb
     #
     def set_data(data)
       if data and data.class == Hash
-        @data = data
         # clear variables
-        clear_vars
+        clear_vars if @data
+        @data = data # set new data structure
       else
-        abort "[#{__FILE__}@#{__LINE__}]: "+
-              "data must be a Hash object!"
+        abort "[#{__FILE__}@#{__LINE__}]: \n"+
+              "  data must be a Hash object!"
       end
+      nil # suppress console echo of data in irb
     end
+    #
     # get non-data information for a given key
-    def get_opt(key)
-      @opts.has_key?(key) ? @opts[key] : nil
+    #
+    # @param [Symbol] key key of non-data
+    # @return [Any] value of non-data, can be any type
+    #
+    # @note return all non-data as a Hash if key == nil
+    #
+    def get_opt(key=nil)
+      key ? @opts[key] : @opts
     end
     # set non-data information as a key-value pair
+    # @param [Symbol] key key of non-data
+    # @param [Any] value value of non-data, can be any type
     def set_opt(key, value)
+      @opts ||= {} # store non-data information
       @opts[key] = value
     end
@@ -235,7 +261,7 @@ module FSelector
     # @return [Integer] sample size
     #
     def get_sample_size
-      @sz ||= get_data.values.flatten.size
+      @sz ||= get_classes.inject(0) { |sz, k| sz+get_data[k].size }
     end
@@ -286,6 +312,13 @@ module FSelector
     #   the subset selection type of algorithms, see {file:README.md}
     #
     def select_feature!
+      if not self.algo_type == :feature_subset_selection
+        abort "[#{__FILE__}@#{__LINE__}]: \n"+
+              "  select_feature! is the interface for the type of feature subset selection algorithms only. \n" +
+              "  please consider select_featue_by_score! or select_feature_by_rank!, \n" +
+              "  which is the interface for the type of feature weighting algorithms"
+      end
       # derived class must implement its own one
       subset = get_feature_subset
       return if subset.empty?
@@ -313,8 +346,16 @@ module FSelector
     #   the weighting type of algorithms, see {file:README.md}
     #
     def select_feature_by_score!(criterion, my_scores=nil)
+      if not self.algo_type == :feature_weighting
+        abort "[#{__FILE__}@#{__LINE__}]: \n"+
+              "  select_feature_by_score! is the interface for the type of feature weighting algorithms only. \n" +
+              "  please consider select_featue!, \n" +
+              "  which is the interface for the type of feature subset selection algorithms"
+      end
       # user scores or internal scores
       scores = my_scores || get_feature_scores
+      return if scores.empty?
       my_data = {}
@@ -339,8 +380,16 @@ module FSelector
     #   the weighting type of algorithms, see {file:README.md}
     #
     def select_feature_by_rank!(criterion, my_ranks=nil)
+      if not self.algo_type == :feature_weighting
+        abort "[#{__FILE__}@#{__LINE__}]: \n"+
+              "  select_feature_by_rank! is the interface for the type of feature weighting algorithms only. \n" +
+              "  please consider select_featue!, \n" +
+              "  which is the interface for the type of feature subset selection algorithms"
+      end
       # user ranks or internal ranks
       ranks = my_ranks || get_feature_ranks
+      return if ranks.empty?
       my_data = {}
@@ -355,12 +404,18 @@ module FSelector
     private
-    # clear variables when data structure is altered,
-    # except @opts (non-data information)
+    #
+    # clear variables when data structure is altered, this is
+    # useful when data structure has changed while
+    # you still want to use the same instance
+    #
+    # @note the variables of original data structure (@data) and
+    #   algorithm type (@algo_type) are retained
+    #
     def clear_vars
       @classes, @features, @fvs = nil, nil, nil
       @scores, @ranks, @sz = nil, nil, nil
-      @cv, @fvs = nil, nil
+      @cv, @fvs, @opts = nil, nil, {}
     end
@@ -399,10 +454,10 @@ module FSelector
     end
-    # get subset of feature, for the type of subset selection algorithms
+    # get feature subset, for the type of subset selection algorithms
     def get_feature_subset
-      abort "[#{__FILE__}@#{__LINE__}]: "+
-              "derived class must implement its own get_feature_subset()"
+      abort "[#{__FILE__}@#{__LINE__}]: \n"+
+            "  derived subclass must implement its own get_feature_subset()"
     end