RubyGems - fselector - Versions diffs - 0.5.0 → 0.6.0 - Mend

fselector 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

data/ChangeLog +7 -0
data/README.md +18 -7
data/lib/fselector.rb +4 -3
data/lib/fselector/algo_base/base.rb +7 -0
data/lib/fselector/algo_discrete/BiNormalSeparation.rb +3 -4
data/lib/fselector/algo_discrete/FishersExactTest.rb +5 -7
data/lib/fselector/discretizer.rb +15 -2
data/lib/fselector/fileio.rb +19 -4
data/lib/fselector/util.rb +0 -585
metadata +17 -6
data/lib/fselector/chisq_calc.rb +0 -189

data/ChangeLog ADDED Viewed

@@ -0,0 +1,7 @@
+2012-04-18	Tiejun Cheng	<need47@gmail.com>
+  * require the RinRuby gem (http://rinruby.ddahl.org) to access the
+    statistical routines in the R package (http://www.r-project.org/)
+  * because of RinRuby (and thus R), removed the following modules or implementations:
+    RubyStats (FishersExactTest.calculate, get_icdf) and ChiSquareCalculator

data/README.md CHANGED Viewed

@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
 **Email**: [need47@gmail.com](mailto:need47@gmail.com)
 **Copyright**: 2012
 **License**: MIT License
-**Latest Version**: 0.5.0
-**Release Date**: April 13 2012
+**Latest Version**: 0.6.0
+**Release Date**: April 19 2012
 Synopsis
 --------
@@ -25,9 +25,9 @@ missing feature values with certain criterion. FSelector acts on a
 full-feature data set in either CSV, LibSVM or WEKA file format and
 outputs a reduced data set with only selected subset of features, which
 can later be used as the input for various machine learning softwares
-including LibSVM and WEKA. FSelector, itself, does not implement
-any of the machine learning algorithms such as support vector machines
-and random forest. See below for a list of FSelector's features.
+such as LibSVM and WEKA. FSelector, as a collection of filter methods,
+does not implement any classifier like support vector machines or
+random forest. See below for a list of FSelector's features.
 Feature List
 ------------
@@ -78,7 +78,7 @@ Feature List
     ReliefF_c                         ReliefF_c   continuous
     TScore                            TS          continuous
-  **feature selection interace:**
+  **note for feature selection interace:**
     - for the algorithms of CFS\_d, FCBF and CFS\_c, use select\_feature!
     - for other algorithms, use either select\_feature\_by\_rank! or select\_feature\_by\_score!
@@ -115,7 +115,13 @@ Installing
 To install FSelector, use the following command:
     $ gem install fselector
+  **note:** Start from version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
+  as a seemless bridge to access the statistical routines in the R package (http://www.r-project.org),
+  which will greatly expand the inclusion of algorithms to FSelector, especially for those relying
+  on statistical test. To this end, please pre-install the R package. RinRuby should have been
+  auto-installed with FSelector.
 Usage
 -----
@@ -223,6 +229,11 @@ Usage
 **4. see more examples test_*.rb under the test/ directory**
+Change Log
+----------
+A {file:ChangeLog} is available from version 0.5.0 and upward to refelect
+what's new and what's changed
 Copyright
 ---------
 FSelector &copy; 2012 by [Tiejun Cheng](mailto:need47@gmail.com).

data/lib/fselector.rb CHANGED Viewed

@@ -1,9 +1,12 @@
+# access to the statistical routines in R package
+require 'rinruby'
 #
 # FSelector: a Ruby gem for feature selection and ranking
 #
 module FSelector
   # module version
-  VERSION = '0.5.0'
+  VERSION = '0.6.0'
 end
 ROOT = File.expand_path(File.dirname(__FILE__))
@@ -17,8 +20,6 @@ require "#{ROOT}/fselector/fileio.rb"
 require "#{ROOT}/fselector/util.rb"
 # entropy-related functions
 require "#{ROOT}/fselector/entropy.rb"
-# chi-square calculator
-require "#{ROOT}/fselector/chisq_calc.rb"
 # normalization for continuous data
 require "#{ROOT}/fselector/normalizer.rb"
 # discretization for continuous data

data/lib/fselector/algo_base/base.rb CHANGED Viewed

@@ -165,6 +165,13 @@ module FSelector
     end
+    # get a copy of data,
+    # by use of the standard Marshal library
+    def get_data_copy
+      Marshal.load(Marshal.dump(@data)) if @data
+    end
     # set data
     def set_data(data)
       if data and data.class == Hash

data/lib/fselector/algo_discrete/BiNormalSeparation.rb CHANGED Viewed

@@ -13,14 +13,11 @@ module FSelector
 # ref: [An extensive empirical study of feature selection metrics for text classification](http://dl.acm.org/citation.cfm?id=944974) and [Rubystats](http://rubystats.rubyforge.org)
 #
   class BiNormalSeparation < BaseDiscrete
-    # include Ruby statistics libraries
-    include Rubystats
     private
     # calculate contribution of each feature (f) for each class (k)
     def calc_contribution(f)
-      @nd ||= Rubystats::NormalDistribution.new
       each_class do |k|
         a, b, c, d = get_A(f, k), get_B(f, k), get_C(f, k), get_D(f, k)
@@ -28,7 +25,9 @@ module FSelector
         s = 0.0
         if not (a+c).zero? and not (b+d).zero?
           tpr, fpr = a/(a+c), b/(b+d)
-          s = (@nd.get_icdf(tpr) - @nd.get_icdf(fpr)).abs
+          R.eval "rv <- qnorm(#{tpr}) - qnorm(#{fpr})"
+          s = R.rv.abs
         end
         set_feature_score(f, k, s)

data/lib/fselector/algo_discrete/FishersExactTest.rb CHANGED Viewed

@@ -11,24 +11,22 @@ module FSelector
 #
 #     for FET, the smaller, the better, but we intentionally negate it
 #     so that the larger is always the better (consistent with other algorithms)
+#     R equivalent: fisher.test
 #
 # ref: [Wikipedia](http://en.wikipedia.org/wiki/Fisher's_exact_test) and [Rubystats](http://rubystats.rubyforge.org)
 #
   class FishersExactTest < BaseDiscrete
-    # include Ruby statistics libraries
-    include Rubystats
     private
     # calculate contribution of each feature (f) for each class (k)
-    def calc_contribution(f)
-      @fet ||= Rubystats::FishersExactTest.new
+    def calc_contribution(f)
       each_class do |k|
         a, b, c, d = get_A(f, k), get_B(f, k), get_C(f, k), get_D(f, k)
-        # note: we've intentionally negated it
-        s = -1 * @fet.calculate(a, b, c, d)[:twotail]
+        # note: intentionally negated it
+        R.eval "rv <- fisher.test(matrix(c(#{a}, #{b}, #{c}, #{d}), nrow=2))$p.value"
+        s = -1.0 * R.rv
         set_feature_score(f, k, s)
       end

data/lib/fselector/discretizer.rb CHANGED Viewed

@@ -4,8 +4,6 @@
 module Discretizer
   # include Entropy module
   include Entropy
-  # include ChiSquareCalculator module
-  include ChiSquareCalculator
   # discretize by equal-width intervals
   #
@@ -334,6 +332,19 @@ module Discretizer
   private
+  #
+  # get the Chi-square value from p-value
+  #
+  # @param [Float] pval p-value
+  # @param [Integer] df degree of freedom
+  # @return [Float] Chi-square vlaue
+  #
+  def pval2chisq(pval, df)
+    R.eval "chisq <- qchisq(#{1-pval}, #{df})"
+    R.chisq
+  end
   #
   # get index from sorted cut points
   #
@@ -388,6 +399,7 @@ module Discretizer
     clear_vars
   end
   #
   # Chi2: initialization
   #
@@ -423,6 +435,7 @@ module Discretizer
     [bs, cs, qs]
   end
   #
   # Chi2: merge two adjacent intervals
   #

data/lib/fselector/fileio.rb CHANGED Viewed

@@ -1,8 +1,23 @@
 #
-# read and write various file formats
+# read and write various file formats,
+# the internal data structure looks like:
+#
+#     data = {
+#
+#       :c1 => [                             # class c1
+#         {:f1=>1, :f2=>2}                   # sample 2
+#       ],
+#
+#       :c2 => [                             # class c2
+#         {:f1=>1, :f3=>3},                  # sample 1
+#         {:f2=>2}                           # sample 3
+#       ]
+#
+#     }
+#
+#     where :c1 and :c2 are class labels; :f1, :f2, and :f3 are features
 #
-# @note class labels and features are treated as symbols,
-#       e.g. length => :length
+# @note class labels and features are treated as symbols
 #
 module FileIO
   #
@@ -40,7 +55,7 @@ module FileIO
         if ncategory == 1
           feats[f] = 1
         elsif ncategory > 1
-          feats[f] = rand(ncategory)
+          feats[f] = rand(ncategory)+1
         else
           feats[f] = rand
         end

data/lib/fselector/util.rb CHANGED Viewed

@@ -149,588 +149,3 @@ end # String
 #=>a
 #=>_'b,c, d'_
 #=>'e'
-#
-# adapted from the Ruby statistics libraries --
-# [Rubystats](http://rubystats.rubyforge.org)
-#
-# - for Fisher's exact test (Rubystats::FishersExactTest.calculate())
-#   used by algo\_binary/FishersExactText.rb
-# - for inverse cumulative normal distribution function (Rubystats::NormalDistribution.get\_icdf())
-#   used by algo\_binary/BiNormalSeparation.rb. note the original get\_icdf() function is a private
-#   one, so we have to open it up and that's why the codes here.
-#
-#
- module Rubystats
-  MAX_VALUE = 1.2e290
-  SQRT2PI = 2.5066282746310005024157652848110452530069867406099
-  SQRT2 = 1.4142135623730950488016887242096980785696718753769
-  TWO_PI = 6.2831853071795864769252867665590057683943387987502
-  #
-  # Fisher's exact test calculator
-  #
-  class FishersExactTest
-    # new()
-    def initialize
-      @sn11    = 0.0
-      @sn1_    = 0.0
-      @sn_1    = 0.0
-      @sn      = 0.0
-      @sprob   = 0.0
-      @sleft   = 0.0
-      @sright  = 0.0
-      @sless   = 0.0
-      @slarg   = 0.0
-      @left    = 0.0
-      @right   = 0.0
-      @twotail = 0.0
-    end
-    # Fisher's exact test
-    def calculate(n11_,n12_,n21_,n22_)
-      n11_ *= -1 if n11_ < 0
-      n12_ *= -1 if n12_ < 0
-      n21_ *= -1 if n21_ < 0
-      n22_ *= -1 if n22_ < 0
-      n1_     = n11_ + n12_
-      n_1     = n11_ + n21_
-      n       = n11_ + n12_ + n21_ + n22_
-      prob    = exact(n11_,n1_,n_1,n)
-      left    = @sless
-      right   = @slarg
-      twotail = @sleft + @sright
-      twotail = 1 if twotail > 1
-      values_hash = { :left =>left, :right =>right, :twotail =>twotail }
-      return values_hash
-    end
-    private
-    # Reference: "Lanczos, C. 'A precision approximation
-    # of the gamma function', J. SIAM Numer. Anal., B, 1, 86-96, 1964."
-    # Translation of  Alan Miller's FORTRAN-implementation
-    # See http://lib.stat.cmu.edu/apstat/245
-    def lngamm(z)
-      x = 0
-      x += 0.0000001659470187408462/(z+7)
-      x += 0.000009934937113930748 /(z+6)
-      x -= 0.1385710331296526      /(z+5)
-      x += 12.50734324009056       /(z+4)
-      x -= 176.6150291498386       /(z+3)
-      x += 771.3234287757674       /(z+2)
-      x -= 1259.139216722289       /(z+1)
-      x += 676.5203681218835       /(z)
-      x += 0.9999999999995183
-      return(Math.log(x)-5.58106146679532777-z+(z-0.5) * Math.log(z+6.5))
-    end
-    def lnfact(n)
-      if n <= 1
-        return 0
-      else
-        return lngamm(n+1)
-      end
-    end
-    def lnbico(n,k)
-      return lnfact(n) - lnfact(k) - lnfact(n-k)
-    end
-    def hyper_323(n11, n1_, n_1, n)
-      return Math.exp(lnbico(n1_, n11) + lnbico(n-n1_, n_1-n11) - lnbico(n, n_1))
-    end
-    def hyper(n11)
-      return hyper0(n11, 0, 0, 0)
-    end
-    def hyper0(n11i,n1_i,n_1i,ni)
-      if n1_i == 0 and n_1i ==0 and ni == 0
-        unless n11i % 10 == 0
-          if n11i == @sn11+1
-            @sprob *= ((@sn1_ - @sn11)/(n11i.to_f))*((@sn_1 - @sn11)/(n11i.to_f + @sn - @sn1_ - @sn_1))
-            @sn11 = n11i
-            return @sprob
-          end
-          if n11i == @sn11-1
-            @sprob *= ((@sn11)/(@sn1_-n11i.to_f))*((@sn11+@sn-@sn1_-@sn_1)/(@sn_1-n11i.to_f))
-            @sn11 = n11i
-            return @sprob
-          end
-        end
-        @sn11 = n11i
-      else
-        @sn11 = n11i
-        @sn1_ = n1_i
-        @sn_1 = n_1i
-        @sn   = ni
-      end
-      @sprob = hyper_323(@sn11,@sn1_,@sn_1,@sn)
-      return @sprob
-    end
-    def exact(n11,n1_,n_1,n)
-      p = i = j = prob = 0.0
-      max = n1_
-      max = n_1 if n_1 < max
-      min = n1_ + n_1 - n
-      min = 0 if min < 0
-      if min == max
-        @sless  = 1
-        @sright = 1
-        @sleft  = 1
-        @slarg  = 1
-        return 1
-      end
-      prob = hyper0(n11,n1_,n_1,n)
-      @sleft = 0
-      p = hyper(min)
-      i = min + 1
-      while p < (0.99999999 * prob)
-        @sleft += p
-        p = hyper(i)
-        i += 1
-      end
-      i -= 1
-      if p < (1.00000001*prob)
-        @sleft += p
-      else
-        i -= 1
-      end
-      @sright = 0
-      p = hyper(max)
-      j = max - 1
-      while p < (0.99999999 * prob)
-        @sright += p
-        p = hyper(j)
-        j -= 1
-      end
-      j += 1
-      if p < (1.00000001*prob)
-        @sright += p
-      else
-        j += 1
-      end
-      if (i - n11).abs < (j - n11).abs
-        @sless = @sleft
-        @slarg = 1 - @sleft + prob
-      else
-        @sless = 1 - @sright + prob
-        @slarg = @sright
-      end
-      return prob
-    end
-  end # class
-  #
-  # Normal distribution
-  #
-  class NormalDistribution
-    # Constructs a normal distribution (defaults to zero mean and
-    # unity variance)
-    def initialize(mu=0.0, sigma=1.0)
-      @mean = mu
-      if sigma <= 0.0
-        return "error"
-      end
-      @stdev = sigma
-      @variance = sigma**2
-      @pdf_denominator = SQRT2PI * Math.sqrt(@variance)
-      @cdf_denominator = SQRT2   * Math.sqrt(@variance)
-    end
-    # Obtain single PDF value
-    # Returns the probability that a stochastic variable x has the value X,
-    # i.e. P(x=X)
-    def get_pdf(x)
-      Math.exp( -((x-@mean)**2) / (2 * @variance)) / @pdf_denominator
-    end
-    # Obtain single CDF value
-    # Returns the probability that a stochastic variable x is less than X,
-    # i.e. P(x<X)
-    def get_cdf(x)
-      complementary_error( -(x - @mean) / @cdf_denominator) / 2
-    end
-    # Obtain single inverse CDF value.
-    # returns the value X for which P(x<X).
-    def get_icdf(p)
-      check_range(p)
-      if p == 0.0
-        return -MAX_VALUE
-      end
-      if p == 1.0
-        return MAX_VALUE
-      end
-      if p == 0.5
-      return @mean
-      end
-      mean_save = @mean
-      var_save = @variance
-      pdf_D_save = @pdf_denominator
-      cdf_D_save = @cdf_denominator
-      @mean = 0.0
-      @variance = 1.0
-      @pdf_denominator = Math.sqrt(TWO_PI)
-      @cdf_denominator = SQRT2
-      x = find_root(p, 0.0, -100.0, 100.0)
-      #scale back
-      @mean = mean_save
-      @variance = var_save
-      @pdf_denominator = pdf_D_save
-      @cdf_denominator = cdf_D_save
-      return x * Math.sqrt(@variance) + @mean
-    end
-    private
-    #check that variable is between lo and hi limits.
-    #lo default is 0.0 and hi default is 1.0
-    def check_range(x, lo=0.0, hi=1.0)
-      raise ArgumentError.new("x cannot be nil") if x.nil?
-      if x < lo or x > hi
-        raise ArgumentError.new("x must be less than lo (#{lo}) and greater than hi (#{hi})")
-      end
-    end
-    def find_root(prob, guess, x_lo, x_hi)
-      accuracy = 1.0e-10
-      max_iteration = 150
-      x     = guess
-      x_new = guess
-      error = 0.0
-      _pdf  = 0.0
-      dx    = 1000.0
-      i     = 0
-      while ( dx.abs > accuracy && (i += 1) < max_iteration )
-        #Apply Newton-Raphson step
-        error = cdf(x) - prob
-        if error < 0.0
-        x_lo = x
-        else
-        x_hi = x
-        end
-        _pdf = pdf(x)
-        if _pdf != 0.0
-        dx = error / _pdf
-        x_new = x -dx
-        end
-        # If the NR fails to converge (which for example may be the
-        # case if the initial guess is too rough) we apply a bisection
-        # step to determine a more narrow interval around the root.
-        if  x_new < x_lo || x_new > x_hi || _pdf == 0.0
-        x_new = (x_lo + x_hi) / 2.0
-        dx = x_new - x
-        end
-        x = x_new
-      end
-      return x
-    end
-    #Probability density function
-    def pdf(x)
-      if x.class == Array
-        pdf_vals = []
-        for i in (0 ... x.length)
-          pdf_vals[i] = get_pdf(x[i])
-        end
-      return pdf_vals
-      else
-        return get_pdf(x)
-      end
-    end
-    #Cummulative distribution function
-    def cdf(x)
-      if x.class == Array
-        cdf_vals = []
-        for i in (0...x.size)
-          cdf_vals[i] = get_cdf(x[i])
-        end
-      return cdf_vals
-      else
-        return get_cdf(x)
-      end
-    end
-    # Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
-    #
-    # Developed at SunSoft, a Sun Microsystems, Inc. business.
-    # Permission to use, copy, modify, and distribute this
-    # software is freely granted, provided that this notice
-    # is preserved.
-    #
-    #                 x
-    #              2      |\
-    #     erf(x)  =  ---------  | exp(-t*t)dt
-    #            sqrt(pi) \|
-    #                 0
-    #
-    #     erfc(x) =  1-erf(x)
-    #  Note that
-    #        erf(-x) = -erf(x)
-    #        erfc(-x) = 2 - erfc(x)
-    #
-    # Method:
-    #    1. For |x| in [0, 0.84375]
-    #        erf(x)  = x + x*R(x^2)
-    #          erfc(x) = 1 - erf(x)           if x in [-.84375,0.25]
-    #                  = 0.5 + ((0.5-x)-x*R)  if x in [0.25,0.84375]
-    #       where R = P/Q where P is an odd poly of degree 8 and
-    #       Q is an odd poly of degree 10.
-    #                         -57.90
-    #            | R - (erf(x)-x)/x | <= 2
-    #
-    #
-    #       Remark. The formula is derived by noting
-    #          erf(x) = (2/sqrt(pi))*(x - x^3/3 + x^5/10 - x^7/42 + ....)
-    #       and that
-    #          2/sqrt(pi) = 1.128379167095512573896158903121545171688
-    #       is close to one. The interval is chosen because the fix
-    #       point of erf(x) is near 0.6174 (i.e., erf(x)=x when x is
-    #       near 0.6174), and by some experiment, 0.84375 is chosen to
-    #        guarantee the error is less than one ulp for erf.
-    #
-    #      2. For |x| in [0.84375,1.25], let s = |x| - 1, and
-    #         c = 0.84506291151 rounded to single (24 bits)
-    #             erf(x)  = sign(x) * (c  + P1(s)/Q1(s))
-    #             erfc(x) = (1-c)  - P1(s)/Q1(s) if x > 0
-    #              1+(c+P1(s)/Q1(s))    if x < 0
-    #             |P1/Q1 - (erf(|x|)-c)| <= 2**-59.06
-    #       Remark: here we use the taylor series expansion at x=1.
-    #        erf(1+s) = erf(1) + s*Poly(s)
-    #             = 0.845.. + P1(s)/Q1(s)
-    #              That is, we use rational approximation to approximate
-    #                 erf(1+s) - (c = (single)0.84506291151)
-    #            Note that |P1/Q1|< 0.078 for x in [0.84375,1.25]
-    #            where
-    #             P1(s) = degree 6 poly in s
-    #             Q1(s) = degree 6 poly in s
-    #
-    #           3. For x in [1.25,1/0.35(~2.857143)],
-    #                  erfc(x) = (1/x)*exp(-x*x-0.5625+R1/S1)
-    #                  erf(x)  = 1 - erfc(x)
-    #            where
-    #             R1(z) = degree 7 poly in z, (z=1/x^2)
-    #             S1(z) = degree 8 poly in z
-    #
-    #           4. For x in [1/0.35,28]
-    #                  erfc(x) = (1/x)*exp(-x*x-0.5625+R2/S2) if x > 0
-    #                 = 2.0 - (1/x)*exp(-x*x-0.5625+R2/S2) if -6<x<0
-    #                 = 2.0 - tiny        (if x <= -6)
-    #                  erf(x)  = sign(x)*(1.0 - erfc(x)) if x < 6, else
-    #                  erf(x)  = sign(x)*(1.0 - tiny)
-    #            where
-    #             R2(z) = degree 6 poly in z, (z=1/x^2)
-    #             S2(z) = degree 7 poly in z
-    #
-    #           Note1:
-    #            To compute exp(-x*x-0.5625+R/S), let s be a single
-    #            PRECISION number and s := x then
-    #             -x*x = -s*s + (s-x)*(s+x)
-    #                 exp(-x*x-0.5626+R/S) =
-    #                 exp(-s*s-0.5625)*exp((s-x)*(s+x)+R/S)
-    #           Note2:
-    #            Here 4 and 5 make use of the asymptotic series
-    #                   exp(-x*x)
-    #             erfc(x) ~ ---------- * ( 1 + Poly(1/x^2) )
-    #                   x*sqrt(pi)
-    #            We use rational approximation to approximate
-    #               g(s)=f(1/x^2) = log(erfc(x)*x) - x*x + 0.5625
-    #            Here is the error bound for R1/S1 and R2/S2
-    #               |R1/S1 - f(x)|  < 2**(-62.57)
-    #               |R2/S2 - f(x)|  < 2**(-61.52)
-    #
-    #            5. For inf > x >= 28
-    #                             erf(x)  = sign(x) *(1 - tiny)  (raise inexact)
-    #                             erfc(x) = tiny*tiny (raise underflow) if x > 0
-    #                           = 2 - tiny if x<0
-    #
-    #            7. Special case:
-    #                             erf(0)  = 0, erf(inf)  = 1, erf(-inf) = -1,
-    #                             erfc(0) = 1, erfc(inf) = 0, erfc(-inf) = 2,
-    #                           erfc/erf(NaN) is NaN
-    #
-    #               $efx8 = 1.02703333676410069053e00
-    #
-    #                 Coefficients for approximation to erf on [0,0.84375]
-    #
-    # Error function.
-    # Based on C-code for the error function developed at Sun Microsystems.
-    # Author:: Jaco van Kooten
-    def error(x)
-      e_efx = 1.28379167095512586316e-01
-      ePp = [ 1.28379167095512558561e-01,
-        -3.25042107247001499370e-01,
-        -2.84817495755985104766e-02,
-        -5.77027029648944159157e-03,
-        -2.37630166566501626084e-05 ]
-      eQq = [ 3.97917223959155352819e-01,
-        6.50222499887672944485e-02,
-        5.08130628187576562776e-03,
-        1.32494738004321644526e-04,
-        -3.96022827877536812320e-06 ]
-      # Coefficients for approximation to erf in [0.84375,1.25]
-      ePa = [-2.36211856075265944077e-03,
-        4.14856118683748331666e-01,
-        -3.72207876035701323847e-01,
-        3.18346619901161753674e-01,
-        -1.10894694282396677476e-01,
-        3.54783043256182359371e-02,
-        -2.16637559486879084300e-03 ]
-      eQa = [ 1.06420880400844228286e-01,
-        5.40397917702171048937e-01,
-        7.18286544141962662868e-02,
-        1.26171219808761642112e-01,
-        1.36370839120290507362e-02,
-        1.19844998467991074170e-02 ]
-      e_erx = 8.45062911510467529297e-01
-      abs_x = (if x >= 0.0 then x else -x end)
-      # 0 < |x| < 0.84375
-      if abs_x < 0.84375
-        #|x| < 2**-28
-        if abs_x < 3.7252902984619141e-9
-        retval = abs_x + abs_x * e_efx
-        else
-        s = x * x
-        p = ePp[0] + s * (ePp[1] + s * (ePp[2] + s * (ePp[3] + s * ePp[4])))
-        q = 1.0 + s * (eQq[0] + s * (eQq[1] + s *
-        ( eQq[2] + s * (eQq[3] + s * eQq[4]))))
-        retval = abs_x + abs_x * (p / q)
-        end
-      elsif abs_x < 1.25
-      s = abs_x - 1.0
-      p = ePa[0] + s * (ePa[1] + s *
-      (ePa[2] + s * (ePa[3] + s *
-      (ePa[4] + s * (ePa[5] + s * ePa[6])))))
-      q = 1.0 + s * (eQa[0] + s *
-      (eQa[1] + s * (eQa[2] + s *
-      (eQa[3] + s * (eQa[4] + s * eQa[5])))))
-      retval = e_erx + p / q
-      elsif abs_x >= 6.0
-      retval = 1.0
-      else
-        retval = 1.0 - complementary_error(abs_x)
-      end
-      return (if x >= 0.0 then retval else -retval end)
-    end
-    # Complementary error function.
-    # Based on C-code for the error function developed at Sun Microsystems.
-    # author Jaco van Kooten
-    def complementary_error(x)
-      # Coefficients for approximation of erfc in [1.25,1/.35]
-      eRa = [-9.86494403484714822705e-03,
-        -6.93858572707181764372e-01,
-        -1.05586262253232909814e01,
-        -6.23753324503260060396e01,
-        -1.62396669462573470355e02,
-        -1.84605092906711035994e02,
-        -8.12874355063065934246e01,
-        -9.81432934416914548592e00 ]
-      eSa = [ 1.96512716674392571292e01,
-        1.37657754143519042600e02,
-        4.34565877475229228821e02,
-        6.45387271733267880336e02,
-        4.29008140027567833386e02,
-        1.08635005541779435134e02,
-        6.57024977031928170135e00,
-        -6.04244152148580987438e-02 ]
-      # Coefficients for approximation to erfc in [1/.35,28]
-      eRb = [-9.86494292470009928597e-03,
-        -7.99283237680523006574e-01,
-        -1.77579549177547519889e01,
-        -1.60636384855821916062e02,
-        -6.37566443368389627722e02,
-        -1.02509513161107724954e03,
-        -4.83519191608651397019e02 ]
-      eSb = [ 3.03380607434824582924e01,
-        3.25792512996573918826e02,
-        1.53672958608443695994e03,
-        3.19985821950859553908e03,
-        2.55305040643316442583e03,
-        4.74528541206955367215e02,
-        -2.24409524465858183362e01 ]
-      abs_x = (if x >= 0.0 then x else -x end)
-      if abs_x < 1.25
-        retval = 1.0 - error(abs_x)
-      elsif abs_x > 28.0
-      retval = 0.0
-      # 1.25 < |x| < 28
-      else
-        s = 1.0/(abs_x * abs_x)
-        if abs_x < 2.8571428
-        r = eRa[0] + s * (eRa[1] + s *
-        (eRa[2] + s * (eRa[3] + s * (eRa[4] + s *
-        (eRa[5] + s *(eRa[6] + s * eRa[7])
-        )))))
-        s = 1.0 + s * (eSa[0] + s * (eSa[1] + s *
-        (eSa[2] + s * (eSa[3] + s * (eSa[4] + s *
-        (eSa[5] + s * (eSa[6] + s * eSa[7])))))))
-        else
-        r = eRb[0] + s * (eRb[1] + s *
-        (eRb[2] + s * (eRb[3] + s * (eRb[4] + s *
-        (eRb[5] + s * eRb[6])))))
-        s = 1.0 + s * (eSb[0] + s *
-        (eSb[1] + s * (eSb[2] + s * (eSb[3] + s *
-        (eSb[4] + s * (eSb[5] + s * eSb[6]))))))
-        end
-        retval =  Math.exp(-x * x - 0.5625 + r/s) / abs_x
-      end
-      return ( if x >= 0.0 then retval else 2.0 - retval end )
-    end
-  end # class
-end # module

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: fselector
 version: !ruby/object:Gem::Version
-  version: 0.5.0
+  version: 0.6.0
   prerelease:
 platform: ruby
 authors:
@@ -9,8 +9,19 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-04-13 00:00:00.000000000 Z
-dependencies: []
+date: 2012-04-19 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rinruby
+  requirement: &22515480 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 2.0.2
+  type: :runtime
+  prerelease: false
+  version_requirements: *22515480
 description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
   algorithms and related functions into one single package. Welcome to contact me
   (need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
@@ -20,8 +31,8 @@ description: FSelector is a Ruby gem that aims to integrate various feature sele
   with certain criterion. FSelector acts on a full-feature data set in either CSV,
   LibSVM or WEKA file format and outputs a reduced data set with only selected subset
   of features, which can later be used as the input for various machine learning softwares
-  including LibSVM and WEKA. FSelector, itself, does not implement any of the machine
-  learning algorithms such as support vector machines and random forest.
+  such as LibSVM and WEKA. FSelector, as a collection of filter methods, does not
+  implement any classifier like support vector machines or random forest.
 email: need47@gmail.com
 executables: []
 extensions: []
@@ -30,6 +41,7 @@ extra_rdoc_files:
 - LICENSE
 files:
 - README.md
+- ChangeLog
 - LICENSE
 - lib/fselector/algo_base/base.rb
 - lib/fselector/algo_base/base_CFS.rb
@@ -70,7 +82,6 @@ files:
 - lib/fselector/algo_discrete/Sensitivity.rb
 - lib/fselector/algo_discrete/Specificity.rb
 - lib/fselector/algo_discrete/SymmetricalUncertainty.rb
-- lib/fselector/chisq_calc.rb
 - lib/fselector/discretizer.rb
 - lib/fselector/ensemble.rb
 - lib/fselector/entropy.rb

data/lib/fselector/chisq_calc.rb DELETED Viewed

@@ -1,189 +0,0 @@
-#
-# Chi-Square Calculator
-#
-# This module is adpated from the on-line [Chi-square Calculator](http://www.swogstat.org/stat/public/chisq_calculator.htm)
-#
-# The functions for calculating normal and chi-square probabilities
-# and critical values were adapted by John Walker from C implementations
-# written by Gary Perlman of Wang Institute, Tyngsboro, MA 01879. The
-# original C code is in the public domain.
-#
-# chisq2pval(chisq, df) -- calculate p-value from given
-#                   chi-square value (chisq) and degree of freedom (df)
-# pval2chisq(pval, df) -- chi-square value from given
-#                   p-value (pvalue) and degree of freedom (df)
-#
-module ChiSquareCalculator
-  BIGX = 20.0 # max value to represent exp(x)
-  LOG_SQRT_PI = 0.5723649429247000870717135 # log(sqrt(pi))
-  I_SQRT_PI = 0.5641895835477562869480795 # 1 / sqrt(pi)
-  Z_MAX = 6.0 # Maximum meaningful z value
-  CHI_EPSILON = 0.000001 # Accuracy of critchi approximation
-  CHI_MAX = 99999.0 # Maximum chi-square value
-  #
-  # POCHISQ  --  probability of chi-square value
-  #
-  # Adapted from:
-  #
-  #   Hill, I. D. and Pike, M. C.  Algorithm 299
-  #
-  #   Collected Algorithms for the CACM 1967 p. 243
-  #
-  # Updated for rounding errors based on remark in
-  #
-  #   ACM TOMS June 1985, page 185
-  #
-  # @param [Float] x chi-square value
-  # @param [Integer] df degree of freedom
-  # @return [Float] p-value
-  def pochisq(x, df)
-    a, y, s = nil, nil, nil
-    e, c, z = nil, nil, nil
-    even = nil # True if df is an even number
-    if x <= 0.0 or df < 1
-      return 1.0
-    end
-    a = 0.5 * x
-    even = ((df & 1) == 0)
-    if df > 1
-      y = ex(-a)
-    end
-    s = even ? y : (2.0 * poz(-Math.sqrt(x)))
-    if df > 2
-      x = 0.5 * (df - 1.0)
-      z = even ? 1.0 : 0.5
-      if a > BIGX
-        e = even ? 0.0 : LOG_SQRT_PI
-        c = Math.log(a)
-        while z <= x
-          e = Math.log(z) + e
-          s += ex(c * z - a - e)
-          z += 1.0
-        end
-        return s
-      else
-        e = even ? 1.0 : (I_SQRT_PI / Math.sqrt(a))
-        c = 0.0
-        while (z <= x)
-          e = e * (a / z)
-          c = c + e
-          z += 1.0
-        end
-        return c * y + s
-      end
-    else
-      return s
-    end
-  end # pochisq
-  # function alias
-  alias :chisq2pval :pochisq
-  #
-  # CRITCHI  --  Compute critical chi-square value to
-  # produce given p.  We just do a bisection
-  # search for a value within CHI_EPSILON,
-  # relying on the monotonicity of pochisq()
-  #
-  # @param [Float] p p-value
-  # @param [Integer] df degree of freedom
-  # @return [Float] chi-square value
-  def critchi(p, df)
-    minchisq = 0.0
-    maxchisq = CHI_MAX
-    chisqval = nil
-    if p <= 0.0
-      return maxchisq
-    else
-      if p >= 1.0
-        return 0.0
-      end
-    end
-    chisqval = df / Math.sqrt(p);    # fair first value
-    while (maxchisq - minchisq) > CHI_EPSILON
-      if pochisq(chisqval, df) < p
-        maxchisq = chisqval
-      else
-        minchisq = chisqval
-      end
-      chisqval = (maxchisq + minchisq) * 0.5
-     end
-     return chisqval
-  end # critchi
-  # function alias
-  alias :pval2chisq :critchi
-  private
-  def ex(x)
-    return (x < -BIGX) ? 0.0 : Math.exp(x)
-  end # ex
-  #
-  # POZ  --  probability of normal z value
-  #
-  # Adapted from a polynomial approximation in:
-  #  Ibbetson D, Algorithm 209
-  #  Collected Algorithms of the CACM 1963 p. 616
-  #
-  # Note:
-  #   This routine has six digit accuracy, so it is only useful for absolute
-  #   z values < 6.  For z values >= to 6.0, poz() returns 0.0
-  #
-   def poz(z)
-    y, x, w = nil, nil, nil
-    if (z == 0.0)
-      x = 0.0
-    else
-      y = 0.5 * z.abs # Math.abs(z)
-      if (y >= (Z_MAX * 0.5))
-        x = 1.0
-      elsif (y < 1.0)
-        w = y * y
-        x = ((((((((0.000124818987 * w - 0.001075204047) * w +
-            0.005198775019) * w - 0.019198292004) * w +
-            0.059054035642) * w - 0.151968751364) * w +
-            0.319152932694) * w - 0.531923007300) * w +
-            0.797884560593) * y * 2.0
-      else
-        y -= 2.0
-        x = (((((((((((((-0.000045255659 * y +
-            0.000152529290) * y - 0.000019538132) * y -
-            0.000676904986) * y + 0.001390604284) * y -
-            0.000794620820) * y - 0.002034254874) * y +
-            0.006549791214) * y - 0.010557625006) * y +
-            0.011630447319) * y - 0.009279453341) * y +
-            0.005353579108) * y - 0.002141268741) * y +
-            0.000535310849) * y + 0.999936657524
-      end
-    end
-    return z > 0.0 ? ((x + 1.0) * 0.5) : ((1.0 - x) * 0.5)
-  end # poz
-end # module