RubyGems - fselector - Versions diffs - 0.1.2 → 0.2.0 - Mend

fselector 0.1.2 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

data/LICENSE +1 -1
data/README.md +14 -12
data/lib/fselector.rb +11 -10
data/lib/fselector/{base.rb → algo_base/base.rb} +33 -41
data/lib/fselector/algo_base/base_CFS.rb +135 -0
data/lib/fselector/algo_base/base_Relief.rb +130 -0
data/lib/fselector/algo_base/base_ReliefF.rb +157 -0
data/lib/fselector/{base_continuous.rb → algo_base/base_continuous.rb} +2 -2
data/lib/fselector/algo_base/base_discrete.rb +190 -0
data/lib/fselector/algo_continuous/CFS_c.rb +47 -0
data/lib/fselector/algo_continuous/ReliefF_c.rb +4 -133
data/lib/fselector/algo_continuous/Relief_c.rb +3 -103
data/lib/fselector/algo_discrete/CFS_d.rb +41 -0
data/lib/fselector/algo_discrete/FastCorrelationBasedFilter.rb +1 -1
data/lib/fselector/algo_discrete/InformationGain.rb +15 -2
data/lib/fselector/algo_discrete/ReliefF_d.rb +3 -132
data/lib/fselector/algo_discrete/Relief_d.rb +3 -103
data/lib/fselector/entropy.rb +125 -0
data/lib/fselector/util.rb +22 -2
metadata +20 -6
data/lib/fselector/base_discrete.rb +0 -502

data/LICENSE CHANGED Viewed

@@ -1,4 +1,4 @@
-Copyright (c) 2011-2012 Tiejun Cheng
+Copyright (c) 2012 Tiejun Cheng
 Permission is hereby granted, free of charge, to any person
 obtaining a copy of this software and associated documentation

data/README.md CHANGED Viewed

@@ -8,22 +8,22 @@ FSelector: a Ruby gem for feature selection and ranking
 **Email**: [need47@gmail.com](mailto:need47@gmail.com)
 **Copyright**: 2012
 **License**: MIT License
-**Latest Version**: 0.1.2
-**Release Date**: March 29th 2012
+**Latest Version**: 0.2.0
+**Release Date**: April 1st 2012
 Synopsis
 --------
-FSelector is an open-access Ruby package that aims to integrate as many
-feature selection/ranking algorithms as possible. You're highly welcomed
-and encouraged to contact me if you want to contribute and/or add your own
-feature selection algorithms. FSelector enables the user to perform feature
-selection by using either a single algorithm or an ensemble of algorithms.
-FSelector acts on a full-feature data set and outputs a reduced data set with
-only selected features, which can later be used as the input for various
-machine learning softwares including LibSVM and WEKA. FSelector, itself, does
-not implement any of the machine learning algorithms such as support vector
-machines and random forest. Below is a summary of FSelector's features.
+FSelector is a Ruby gem that aims to integrate various feature selection/ranking
+algorithms into one single package. Welcome to contact me (need47@gmail.com)
+if you want to contribute your own algorithms or report a bug. FSelector enables
+the user to perform feature selection by using either a single algorithm or an
+ensemble of algorithms. FSelector acts on a full-feature data set with CSV, LibSVM
+or WEKA file format and outputs a reduced data set with only selected subset of
+features, which can later be used as the input for various machine learning softwares
+including LibSVM and WEKA. FSelector, itself, does not implement any of the machine
+learning algorithms such as support vector machines and random forest. Below is a
+summary of FSelector's features.
 Feature List
 ------------
@@ -35,6 +35,7 @@ Feature List
     Accuracy                        Acc         discrete
     AccuracyBalanced                Acc2        discrete
     BiNormalSeparation              BNS         discrete
+    CFS_d                           CFS_d       discrete
     ChiSquaredTest                  CHI         discrete
     CorrelationCoefficient          CC          discrete
     DocumentFrequency               DF          discrete
@@ -60,6 +61,7 @@ Feature List
     Sensitivity                     SN, Recall  discrete
     Specificity                     SP          discrete
     SymmetricalUncertainty          SU          discrete
+    CFS_c                           CFS_c       continuous
     PMetric                         PM          continuous
     Relief_c                        Relief_c    continuous
     ReliefF_c                       ReliefF_c   continuous

data/lib/fselector.rb CHANGED Viewed

@@ -3,7 +3,7 @@
 #
 module FSelector
   # module version
-  VERSION = '0.1.2'
+  VERSION = '0.2.0'
 end
 ROOT = File.expand_path(File.dirname(__FILE__))
@@ -13,18 +13,13 @@ ROOT = File.expand_path(File.dirname(__FILE__))
 #
 require "#{ROOT}/fselector/fileio.rb"
 require "#{ROOT}/fselector/util.rb"
+require "#{ROOT}/fselector/entropy.rb"
 #
 # base class
-#
-require "#{ROOT}/fselector/base.rb"
-require "#{ROOT}/fselector/base_discrete.rb"
-require "#{ROOT}/fselector/base_continuous.rb"
-#
-# feature selection use an ensemble of algorithms
-#
-require "#{ROOT}/fselector/ensemble.rb"
+Dir.glob("#{ROOT}/fselector/algo_base/*").each do |f|
+  require f
+end
 #
 # algorithms for handling discrete feature
@@ -39,3 +34,9 @@ end
 Dir.glob("#{ROOT}/fselector/algo_continuous/*").each do |f|
   require f
 end
+#
+# feature selection use an ensemble of algorithms
+#
+require "#{ROOT}/fselector/ensemble.rb"

data/lib/fselector/{base.rb → algo_base/base.rb} RENAMED Viewed

@@ -80,6 +80,20 @@ module FSelector
     end
+    # get class labels
+    def get_class_labels
+      if not @cv
+        @cv = []
+        each_sample do |k, s|
+          @cv << k
+        end
+      end
+      @cv
+    end
     # set classes
     def set_classes(classes)
       if classes and classes.class == Array
@@ -101,22 +115,34 @@ module FSelector
     # get feature values
     #
     # @param [Symbol] f feature of interest
+    # @param [Symbol] mv including missing feature values?
+    #   don't include missing feature values (recorded as nils)
+    #   if mv==nil, include otherwise
     # @param [Symbol] ck class of interest.
-    #   if not nil return feature values for the
-    #   specific class, otherwise return all feature values
+    #   return feature values for all classes, otherwise return feature
+    # values for the specific class (ck)
     #
-    def get_feature_values(f, ck=nil)
+    def get_feature_values(f, mv=nil, ck=nil)
       @fvs ||= {}
       if not @fvs.has_key? f
         @fvs[f] = {}
         each_sample do |k, s|
           @fvs[f][k] = [] if not @fvs[f].has_key? k
-          @fvs[f][k] << s[f] if s.has_key? f
+          if s.has_key? f
+            @fvs[f][k] << s[f]
+          else
+            @fvs[f][k] << nil # for missing featue values
+          end
         end
       end
-      ck ? @fvs[f][ck] : @fvs[f].values.flatten
+      if mv # include missing feature values
+        return ck ? @fvs[f][ck] : @fvs[f].values.flatten
+      else # don't include
+        return ck ? @fvs[f][ck].compact : @fvs[f].values.flatten.compact
+      end
     end
@@ -136,6 +162,7 @@ module FSelector
       @data
     end
     # set data
     def set_data(data)
       if data and data.class == Hash
@@ -167,42 +194,7 @@ module FSelector
     def get_sample_size
       @sz ||= get_data.values.flatten.size
     end
-    #
-    # print feature scores
-    #
-    # @param [String] kclass class of interest
-    #
-    def print_feature_scores(feat=nil, kclass=nil)
-      scores = get_feature_scores
-      scores.each do |f, ks|
-        next if feat and feat != f
-        print "#{f} =>"
-        ks.each do |k, s|
-          if kclass
-            print " #{k}->#{s}" if k == kclass
-          else
-            print " #{k}->#{s}"
-          end
-        end
-        puts
-      end
-    end
-    # print feature ranks
-    def print_feature_ranks
-      ranks = get_feature_ranks
-      ranks.each do |f, r|
-        puts "#{f} => #{r}"
-      end
-    end
     #
     # get scores of all features for all classes
     #

data/lib/fselector/algo_base/base_CFS.rb ADDED Viewed

@@ -0,0 +1,135 @@
+#
+# FSelector: a Ruby gem for feature selection and ranking
+#
+module FSelector
+#
+# base class for Correlation-based Feature Selection (CFS) algorithm, see specialized
+# versions for discrete feature (CFS_d) and continuous feature (CFS_c), respectively
+#
+# @note for simplicity, we use *sequential forward search* for optimal feature subset,
+# the original CFS that uses *best first search* only produces slightly better results
+# but demands much more computational resources
+#
+# ref: [Feature Selection for Discrete and Numeric Class Machine Learning](http://www.cs.waikato.ac.nz/ml/publications/1999/99MH-Feature-Select.pdf)
+#
+  class BaseCFS < Base
+    # undefine superclass methods
+    undef :select_feature_by_score!
+    undef :select_feature_by_rank!
+    private
+    # use sequential forward search
+    def get_feature_subset
+      subset = []
+      feats = get_features.dup
+	    s_best = -100.0
+      # use cache
+      @rcf_best, @rff_best = 0.0, 0.0
+      improvement = true
+      while improvement
+        improvement = false
+        f_max, s_max = nil, -100.0
+        rcf_max, rff_max = -100.0, -100.0
+        feats.each do |f|
+          s_try, rcf_try, rff_try = calc_merit(subset, f)
+          if s_try > s_best and s_try > s_max
+            f_max, s_max = f, s_try
+            rcf_max, rff_max = rcf_try, rff_try
+          end
+        end
+        # add f_max to subset and remove it from feats
+        if f_max
+          subset << f_max
+          feats.delete(f_max)
+          improvement = true
+          # update info
+          s_best, @rcf_best, @rff_best = s_max, rcf_max, rff_max
+        end
+      end
+      subset
+    end # get_feature_subset
+    # calc new merit of subset when adding feature (f)
+    def calc_merit(subset, f)
+      k = subset.size.to_f + 1
+      # use cache
+      rcf = @rcf_best + calc_rcf(f)
+      rff = @rff_best
+      subset.each do |s|
+        rff += 2*calc_rff(f, s)
+      end
+      m = rcf/Math.sqrt(k+rff)
+      [m, rcf, rff]
+    end # calc_metrit
+	  # calc feature-class correlation
+    def calc_rcf(f)
+	    @f2rcf ||= {} # use cache
+	    if not @f2rcf.has_key? f
+	      cv = get_class_labels
+	      fv = get_feature_values(f, :include_missing_values)
+	      @f2rcf[f] = do_rcf(cv, fv)
+	    end
+	    @f2rcf[f]
+    end # calc_rcf
+    # calc feature-feature intercorrelation
+    def calc_rff(f, s)
+	    @fs2rff ||= {} # use cache
+	    if not @f2idx
+	      @f2idx = {}
+	      fvs = get_features
+		    fvs.each_with_index { |f, idx| @f2idx[f] = idx }
+	    end
+	    if @f2idx[f] > @f2idx[s]
+	      k = [f, s].join('_')
+	    else
+	      k = [s, f].join('_')
+	    end
+	    if not @fs2rff.has_key? k
+	      fv = get_feature_values(f, :include_missing_values)
+	      sv = get_feature_values(s, :include_missing_values)
+	      @fs2rff[k] = do_rff(fv, sv)
+	    end
+	    @fs2rff[k]
+    end # calc_rff
+	  # calc the feature-class correlation of two vectors
+    def do_rcf(cv, fv)
+      abort "[#{__FILE__}@#{__LINE__}]: "+
+             "derived CFS algo must implement its own do_rcf()"
+    end # do_rcf
+    # calc the feature-class correlation of two vectors
+    def do_rff(fv, sv)
+      abort "[#{__FILE__}@#{__LINE__}]: "+
+             "derived CFS algo must implement its own do_rff()"
+    end # do_rff
+  end # class
+end # module

data/lib/fselector/algo_base/base_Relief.rb ADDED Viewed

@@ -0,0 +1,130 @@
+#
+# FSelector: a Ruby gem for feature selection and ranking
+#
+module FSelector
+#
+# base class for Relief algorithm, see specialized versions for discrete
+# feature (Relief_d) and continuous feature (Relief_c), respectively
+#
+# @note Relief applicable only to two-class problem without missing data
+#
+# ref: [The Feature Selection Problem: Traditional Methods and a New Algorithm](http://www.aaai.org/Papers/AAAI/1992/AAAI92-020.pdf)
+#
+  class BaseRelief < Base
+    #
+    # new()
+    #
+    # @param [Integer] m number of samples to be used
+    #   for estimating feature contribution. max can be
+    #   the number of training samples
+    # @param [Hash] data existing data structure
+    #
+    def initialize(m=nil, data=nil)
+      super(data)
+      @m = (m || 30) # default 30
+    end
+    private
+    # calculate contribution of each feature (f) across all classes
+    def calc_contribution(f)
+      if not get_classes.size == 2
+        abort "[#{__FILE__}@#{__LINE__}]: "+
+        "Relief applicable only to two-class problems without missing data"
+      end
+      ## use all samples if @m not provided
+      #@m = get_sample_size if not @m
+      k1, k2 = get_classes
+      score = 0.0
+      @m.times do
+        # pick a sample at random
+        rs, rk = pick_a_sample_at_random
+        # find the nearest neighbor for each class
+        nbrs = find_nearest_nb(rs, rk)
+        # calc contribution from neighbors
+        score += calc_score(f, rs, rk, nbrs)
+      end
+      s = score / @m
+      set_feature_score(f, :BEST, s)
+    end # calc_contribution
+    # pick a sample at random
+    def pick_a_sample_at_random
+      rk = get_classes[rand(get_classes.size)]
+      rks = get_data[rk]
+      [ rks[rand(rks.size)], rk ]
+    end # pick_a_sample_at_random
+    # find nearest neighbor sample for given sample (rs) within class (k)
+    def find_nearest_nb(rs, rk)
+      nbrs = {}
+      each_class do |k|
+        nb, dmin = nil, 999
+        get_data[k].each do |s|
+          next if s.object_id == rs.object_id # exclude self
+          d = diff_sample(rs, s)
+          if d < dmin
+            dmin = d
+            nb = s
+          end
+        end
+        nbrs[k] = nb
+      end
+      nbrs
+    end # find_nearest_nb
+    # difference between two samples
+    def diff_sample(s1, s2)
+      d = 0.0
+      each_feature do |f|
+        d += diff_feature(f, s1, s2)**2
+      end
+      d
+    end # diff_sample
+    # difference beween the feature (f) of two samples
+    def diff_feature(f, s1, s2)
+      abort "[#{__FILE__}@#{__LINE__}]: "+
+              "derived Relief algo must implement its own diff_feature()"
+    end # diff_feature
+    # calc feature (f) contribution from neighbors
+    def calc_score(f, rs, rk, nbrs)
+      score = 0.0
+      nbrs.each do |k, s|
+        if k == rk # near hit
+          score -= diff_feature(f, rs, s)**2
+        else # near_miss
+          score += diff_feature(f, rs, s)**2
+        end
+      end
+      score
+    end # calc_score
+  end # class
+end # module