RubyGems - fselector - Versions diffs - 0.5.0 → 0.6.0 - Mend

fselector 0.5.0 → 0.6.0

Files changed (11) hide show

data/ChangeLog +7 -0
data/README.md +18 -7
data/lib/fselector.rb +4 -3
data/lib/fselector/algo_base/base.rb +7 -0
data/lib/fselector/algo_discrete/BiNormalSeparation.rb +3 -4
data/lib/fselector/algo_discrete/FishersExactTest.rb +5 -7
data/lib/fselector/discretizer.rb +15 -2
data/lib/fselector/fileio.rb +19 -4
data/lib/fselector/util.rb +0 -585
metadata +17 -6
data/lib/fselector/chisq_calc.rb +0 -189

data/ChangeLog ADDED Viewed

@@ -0,0 +1,7 @@
+2012-04-18	Tiejun Cheng	<need47@gmail.com>
+  * require the RinRuby gem (http://rinruby.ddahl.org) to access the
+    statistical routines in the R package (http://www.r-project.org/)
+  * because of RinRuby (and thus R), removed the following modules or implementations:
+    RubyStats (FishersExactTest.calculate, get_icdf) and ChiSquareCalculator

data/README.md CHANGED Viewed

@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
 **Email**: [need47@gmail.com](mailto:need47@gmail.com)
 **Copyright**: 2012
 **License**: MIT License
-**Latest Version**: 0.5.0
-**Release Date**: April 13 2012
+**Latest Version**: 0.6.0
+**Release Date**: April 19 2012
 Synopsis
 --------
@@ -25,9 +25,9 @@ missing feature values with certain criterion. FSelector acts on a
 full-feature data set in either CSV, LibSVM or WEKA file format and
 outputs a reduced data set with only selected subset of features, which
 can later be used as the input for various machine learning softwares
-including LibSVM and WEKA. FSelector, itself, does not implement
-any of the machine learning algorithms such as support vector machines
-and random forest. See below for a list of FSelector's features.
+such as LibSVM and WEKA. FSelector, as a collection of filter methods,
+does not implement any classifier like support vector machines or
+random forest. See below for a list of FSelector's features.
 Feature List
 ------------
@@ -78,7 +78,7 @@ Feature List
     ReliefF_c                         ReliefF_c   continuous
     TScore                            TS          continuous
-  **feature selection interace:**
+  **note for feature selection interace:**
     - for the algorithms of CFS\_d, FCBF and CFS\_c, use select\_feature!
     - for other algorithms, use either select\_feature\_by\_rank! or select\_feature\_by\_score!
@@ -115,7 +115,13 @@ Installing
 To install FSelector, use the following command:
     $ gem install fselector
+  **note:** Start from version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
+  as a seemless bridge to access the statistical routines in the R package (http://www.r-project.org),
+  which will greatly expand the inclusion of algorithms to FSelector, especially for those relying
+  on statistical test. To this end, please pre-install the R package. RinRuby should have been
+  auto-installed with FSelector.
 Usage
 -----
@@ -223,6 +229,11 @@ Usage
 **4. see more examples test_*.rb under the test/ directory**
+Change Log
+----------
+A {file:ChangeLog} is available from version 0.5.0 and upward to refelect
+what's new and what's changed
 Copyright
 ---------
 FSelector &copy; 2012 by [Tiejun Cheng](mailto:need47@gmail.com).

data/lib/fselector.rb CHANGED Viewed

@@ -1,9 +1,12 @@
+# access to the statistical routines in R package
+require 'rinruby'
 #
 # FSelector: a Ruby gem for feature selection and ranking
 #
 module FSelector
   # module version
-  VERSION = '0.5.0'
+  VERSION = '0.6.0'
 end
 ROOT = File.expand_path(File.dirname(__FILE__))
@@ -17,8 +20,6 @@ require "#{ROOT}/fselector/fileio.rb"
 require "#{ROOT}/fselector/util.rb"
 # entropy-related functions
 require "#{ROOT}/fselector/entropy.rb"
-# chi-square calculator
-require "#{ROOT}/fselector/chisq_calc.rb"
 # normalization for continuous data
 require "#{ROOT}/fselector/normalizer.rb"
 # discretization for continuous data

data/lib/fselector/algo_base/base.rb CHANGED Viewed

@@ -165,6 +165,13 @@ module FSelector
     end
+    # get a copy of data,
+    # by use of the standard Marshal library
+    def get_data_copy
+      Marshal.load(Marshal.dump(@data)) if @data
+    end
     # set data
     def set_data(data)
       if data and data.class == Hash

data/lib/fselector/algo_discrete/BiNormalSeparation.rb CHANGED Viewed

@@ -13,14 +13,11 @@ module FSelector
 # ref: [An extensive empirical study of feature selection metrics for text classification](http://dl.acm.org/citation.cfm?id=944974) and [Rubystats](http://rubystats.rubyforge.org)
 #
   class BiNormalSeparation < BaseDiscrete
-    # include Ruby statistics libraries
-    include Rubystats
     private
     # calculate contribution of each feature (f) for each class (k)
     def calc_contribution(f)
-      @nd ||= Rubystats::NormalDistribution.new
       each_class do |k|
         a, b, c, d = get_A(f, k), get_B(f, k), get_C(f, k), get_D(f, k)
@@ -28,7 +25,9 @@ module FSelector
         s = 0.0
         if not (a+c).zero? and not (b+d).zero?
           tpr, fpr = a/(a+c), b/(b+d)
-          s = (@nd.get_icdf(tpr) - @nd.get_icdf(fpr)).abs
+          R.eval "rv <- qnorm(#{tpr}) - qnorm(#{fpr})"
+          s = R.rv.abs
         end
         set_feature_score(f, k, s)

data/lib/fselector/algo_discrete/FishersExactTest.rb CHANGED Viewed

@@ -11,24 +11,22 @@ module FSelector
 #
 #     for FET, the smaller, the better, but we intentionally negate it
 #     so that the larger is always the better (consistent with other algorithms)
+#     R equivalent: fisher.test
 #
 # ref: [Wikipedia](http://en.wikipedia.org/wiki/Fisher's_exact_test) and [Rubystats](http://rubystats.rubyforge.org)
 #
   class FishersExactTest < BaseDiscrete
-    # include Ruby statistics libraries
-    include Rubystats
     private
     # calculate contribution of each feature (f) for each class (k)
-    def calc_contribution(f)
-      @fet ||= Rubystats::FishersExactTest.new
+    def calc_contribution(f)
       each_class do |k|
         a, b, c, d = get_A(f, k), get_B(f, k), get_C(f, k), get_D(f, k)
-        # note: we've intentionally negated it
-        s = -1 * @fet.calculate(a, b, c, d)[:twotail]
+        # note: intentionally negated it
+        R.eval "rv <- fisher.test(matrix(c(#{a}, #{b}, #{c}, #{d}), nrow=2))$p.value"
+        s = -1.0 * R.rv
         set_feature_score(f, k, s)
       end

data/lib/fselector/discretizer.rb CHANGED Viewed

@@ -4,8 +4,6 @@
 module Discretizer
   # include Entropy module
   include Entropy
-  # include ChiSquareCalculator module
-  include ChiSquareCalculator
   # discretize by equal-width intervals
   #
@@ -334,6 +332,19 @@ module Discretizer
   private
+  #
+  # get the Chi-square value from p-value
+  #
+  # @param [Float] pval p-value
+  # @param [Integer] df degree of freedom
+  # @return [Float] Chi-square vlaue
+  #
+  def pval2chisq(pval, df)
+    R.eval "chisq <- qchisq(#{1-pval}, #{df})"
+    R.chisq
+  end
   #
   # get index from sorted cut points
   #
@@ -388,6 +399,7 @@ module Discretizer
     clear_vars
   end
   #
   # Chi2: initialization
   #
@@ -423,6 +435,7 @@ module Discretizer
     [bs, cs, qs]
   end
   #
   # Chi2: merge two adjacent intervals
   #

data/lib/fselector/fileio.rb CHANGED Viewed

@@ -1,8 +1,23 @@
 #
-# read and write various file formats
+# read and write various file formats,
+# the internal data structure looks like:
+#
+#     data = {
+#
+#       :c1 => [                             # class c1
+#         {:f1=>1, :f2=>2}                   # sample 2
+#       ],
+#
+#       :c2 => [                             # class c2
+#         {:f1=>1, :f3=>3},                  # sample 1
+#         {:f2=>2}                           # sample 3
+#       ]
+#
+#     }
+#
+#     where :c1 and :c2 are class labels; :f1, :f2, and :f3 are features
 #
-# @note class labels and features are treated as symbols,
-#       e.g. length => :length
+# @note class labels and features are treated as symbols
 #
 module FileIO
   #
@@ -40,7 +55,7 @@ module FileIO
         if ncategory == 1
           feats[f] = 1
         elsif ncategory > 1
-          feats[f] = rand(ncategory)
+          feats[f] = rand(ncategory)+1
         else
           feats[f] = rand
         end

data/lib/fselector/util.rb CHANGED Viewed

@@ -149,588 +149,3 @@ end # String
 #=>a
 #=>_'b,c, d'_
 #=>'e'
-#
-# adapted from the Ruby statistics libraries --
-# [Rubystats](http://rubystats.rubyforge.org)
-#
-# - for Fisher's exact test (Rubystats::FishersExactTest.calculate())
-#   used by algo\_binary/FishersExactText.rb
-# - for inverse cumulative normal distribution function (Rubystats::NormalDistribution.get\_icdf())
-#   used by algo\_binary/BiNormalSeparation.rb. note the original get\_icdf() function is a private
-#   one, so we have to open it up and that's why the codes here.
-#
-#
- module Rubystats
-  MAX_VALUE = 1.2e290
-  SQRT2PI = 2.5066282746310005024157652848110452530069867406099
-  SQRT2 = 1.4142135623730950488016887242096980785696718753769
-  TWO_PI = 6.2831853071795864769252867665590057683943387987502
-  #
-  # Fisher's exact test calculator
-  #
-  class FishersExactTest
-    # new()
-    def initialize
-      @sn11    = 0.0
-      @sn1_    = 0.0
-      @sn_1    = 0.0
-      @sn      = 0.0
-      @sprob   = 0.0
-      @sleft   = 0.0
-      @sright  = 0.0
-      @sless   = 0.0
-      @slarg   = 0.0
-      @left    = 0.0
-      @right   = 0.0
-      @twotail = 0.0
-    end
-    # Fisher's exact test
-    def calculate(n11_,n12_,n21_,n22_)
-      n11_ *= -1 if n11_ < 0
-      n12_ *= -1 if n12_ < 0
-      n21_ *= -1 if n21_ < 0
-      n22_ *= -1 if n22_ < 0
-      n1_     = n11_ + n12_
-      n_1     = n11_ + n21_
-      n       = n11_ + n12_ + n21_ + n22_
-      prob    = exact(n11_,n1_,n_1,n)
-      left    = @sless
-      right   = @slarg
-      twotail = @sleft + @sright
-      twotail = 1 if twotail > 1
-      values_hash = { :left =>left, :right =>right, :twotail =>twotail }
-      return values_hash
-    end
-    private
-    # Reference: "Lanczos, C. 'A precision approximation
-    # of the gamma function', J. SIAM Numer. Anal., B, 1, 86-96, 1964."
-    # Translation of  Alan Miller's FORTRAN-implementation
-    # See http://lib.stat.cmu.edu/apstat/245
-    def lngamm(z)
-      x = 0
-      x += 0.0000001659470187408462/(z+7)
-      x += 0.000009934937113930748 /(z+6)
-      x -= 0.1385710331296526      /(z+5)
-      x += 12.50734324009056       /(z+4)
-      x -= 176.6150291498386       /(z+3)
-      x += 771.3234287757674       /(z+2)
-      x -= 1259.139216722289       /(z+1)
-      x += 676.5203681218835       /(z)
-      x += 0.9999999999995183
-      return(Math.log(x)-5.58106146679532777-z+(z-0.5) * Math.log(z+6.5))
-    end
-    def lnfact(n)
-      if n <= 1
-        return 0
-      else
-        return lngamm(n+1)
-      end
-    end
-    def lnbico(n,k)
-      return lnfact(n) - lnfact(k) - lnfact(n-k)
-    end
-    def hyper_323(n11, n1_, n_1, n)
-      return Math.exp(lnbico(n1_, n11) + lnbico(n-n1_, n_1-n11) - lnbico(n, n_1))
-    end
-    def hyper(n11)
-      return hyper0(n11, 0, 0, 0)
-    end
-    def hyper0(n11i,n1_i,n_1i,ni)
-      if n1_i == 0 and n_1i ==0 and ni == 0
-        unless n11i % 10 == 0
-          if n11i == @sn11+1
-            @sprob *= ((@sn1_ - @sn11)/(n11i.to_f))*((@sn_1 - @sn11)/(n11i.to_f + @sn - @sn1_ - @sn_1))
-            @sn11 = n11i
-            return @sprob
-          end
-          if n11i == @sn11-1
-            @sprob *= ((@sn11)/(@sn1_-n11i.to_f))*((@sn11+@sn-@sn1_-@sn_1)/(@sn_1-n11i.to_f))
-            @sn11 = n11i
-            return @sprob
-          end
-        end
-        @sn11 = n11i
-      else
-        @sn11 = n11i
-        @sn1_ = n1_i
-        @sn_1 = n_1i
-        @sn   = ni
-      end
-      @sprob = hyper_323(@sn11,@sn1_,@sn_1,@sn)
-      return @sprob
-    end
-    def exact(n11,n1_,n_1,n)
-      p = i = j = prob = 0.0
-      max = n1_
-      max = n_1 if n_1 < max
-      min = n1_ + n_1 - n
-      min = 0 if min < 0
-      if min == max
-        @sless  = 1
-        @sright = 1
-        @sleft  = 1
-        @slarg  = 1
-        return 1
-      end
-      prob = hyper0(n11,n1_,n_1,n)
-      @sleft = 0
-      p = hyper(min)
-      i = min + 1
-      while p < (0.99999999 * prob)
-        @sleft += p
-        p = hyper(i)
-        i += 1
-      end
-      i -= 1
-      if p < (1.00000001*prob)
-        @sleft += p
-      else
-        i -= 1
-      end
-      @sright = 0
-      p = hyper(max)
-      j = max - 1
-      while p < (0.99999999 * prob)
-        @sright += p
-        p = hyper(j)
-        j -= 1
-      end
-      j += 1
-      if p < (1.00000001*prob)
-        @sright += p
-      else
-        j += 1
-      end
-      if (i - n11).abs < (j - n11).abs
-        @sless = @sleft
-        @slarg = 1 - @sleft + prob
-      else
-        @sless = 1 - @sright + prob
-        @slarg = @sright
-      end
-      return prob
-    end
-  end # class
-  #
-  # Normal distribution
-  #
-  class NormalDistribution
-    # Constructs a normal distribution (defaults to zero mean and
-    # unity variance)
-    def initialize(mu=0.0, sigma=1.0)
-      @mean = mu
-      if sigma <= 0.0
-        return "error"
-      end
-      @stdev = sigma
-      @variance = sigma**2
-      @pdf_denominator = SQRT2PI * Math.sqrt(@variance)
-      @cdf_denominator = SQRT2   * Math.sqrt(@variance)
-    end
-    # Obtain single PDF value
-    # Returns the probability that a stochastic variable x has the value X,
-    # i.e. P(x=X)
-    def get_pdf(x)
-      Math.exp( -((x-@mean)**2) / (2 * @variance)) / @pdf_denominator
-    end
-    # Obtain single CDF value
-    # Returns the probability that a stochastic variable x is less than X,
-    # i.e. P(x<X)
-    def get_cdf(x)
-      complementary_error( -(x - @mean) / @cdf_denominator) / 2
-    end
-    # Obtain single inverse CDF value.
-    # returns the value X for which P(x<X).
-    def get_icdf(p)
-      check_range(p)
-      if p == 0.0
-        return -MAX_VALUE
-      end
-      if p == 1.0
-        return MAX_VALUE
-      end
-      if p == 0.5
-      return @mean
-      end
-      mean_save = @mean
-      var_save = @variance
-      pdf_D_save = @pdf_denominator
-      cdf_D_save = @cdf_denominator
-      @mean = 0.0
-      @variance = 1.0
-      @pdf_denominator = Math.sqrt(TWO_PI)
-      @cdf_denominator = SQRT2
-      x = find_root(p, 0.0, -100.0, 100.0)
-      #scale back
-      @mean = mean_save
-      @variance = var_save
-      @pdf_denominator = pdf_D_save
-      @cdf_denominator = cdf_D_save
-      return x * Math.sqrt(@variance) + @mean
-    end
-    private
-    #check that variable is between lo and hi limits.
-    #lo default is 0.0 and hi default is 1.0
-    def check_range(x, lo=0.0, hi=1.0)
-      raise ArgumentError.new("x cannot be nil") if x.nil?
-      if x < lo or x > hi
-        raise ArgumentError.new("x must be less than lo (#{lo}) and greater than hi (#{hi})")
-      end
-    end
-    def find_root(prob, guess, x_lo, x_hi)
-      accuracy = 1.0e-10
-      max_iteration = 150
-      x     = guess
-      x_new = guess
-      error = 0.0
-      _pdf  = 0.0
-      dx    = 1000.0
-      i     = 0
-      while ( dx.abs > accuracy && (i += 1) < max_iteration )
-        #Apply Newton-Raphson step
-        error = cdf(x) - prob
-        if error < 0.0
-        x_lo = x
-        else
-        x_hi = x
-        end
-        _pdf = pdf(x)
-        if _pdf != 0.0
-        dx = error / _pdf
-        x_new = x -dx
-        end
-        # If the NR fails to converge (which for example may be the
-        # case if the initial guess is too rough) we apply a bisection
-        # step to determine a more narrow interval around the root.
-        if  x_new < x_lo || x_new > x_hi || _pdf == 0.0
-        x_new = (x_lo + x_hi) / 2.0
-        dx = x_new - x
-        end
-        x = x_new
-      end
-      return x
-    end
-    #Probability density function
-    def pdf(x)
-      if x.class == Array
-        pdf_vals = []
-        for i in (0 ... x.length)
-          pdf_vals[i] = get_pdf(x[i])
-        end
-      return pdf_vals
-      else
-        return get_pdf(x)
-      end
-    end
-    #Cummulative distribution function
-    def cdf(x)
-      if x.class == Array
-        cdf_vals = []
-        for i in (0...x.size)
-          cdf_vals[i] = get_cdf(x[i])
-        end
-      return cdf_vals
-      else
-        return get_cdf(x)
-      end
-    end
-    # Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
-    #
-    # Developed at SunSoft, a Sun Microsystems, Inc. business.
-    # Permission to use, copy, modify, and distribute this
-    # software is freely granted, provided that this notice
-    # is preserved.
-    #
-    #                 x
-    #              2      |\
-    #     erf(x)  =  ---------  | exp(-t*t)dt
-    #            sqrt(pi) \|
-    #                 0
-    #
-    #     erfc(x) =  1-erf(x)
-    #  Note that
-    #        erf(-x) = -erf(x)
-    #        erfc(-x) = 2 - erfc(x)
-    #
-    # Method:
-    #    1. For |x| in [0, 0.84375]
-    #        erf(x)  = x + x*R(x^2)
-    #          erfc(x) = 1 - erf(x)           if x in [-.84375,0.25]
-    #                  = 0.5 + ((0.5-x)-x*R)  if x in [0.25,0.84375]
-    #       where R = P/Q where P is an odd poly of degree 8 and
-    #       Q is an odd poly of degree 10.
-    #                         -57.90
-    #            | R - (erf(x)-x)/x | <= 2
-    #
-    #
-    #       Remark. The formula is derived by noting
-    #          erf(x) = (2/sqrt(pi))*(x - x^3/3 + x^5/10 - x^7/42 + ....)
-    #       and that
-    #          2/sqrt(pi) = 1.128379167095512573896158903121545171688
-    #       is close to one. The interval is chosen because the fix
-    #       point of erf(x) is near 0.6174 (i.e., erf(x)=x when x is
-    #       near 0.6174), and by some experiment, 0.84375 is chosen to
-    #        guarantee the error is less than one ulp for erf.
-    #
-    #      2. For |x| in [0.84375,1.25], let s = |x| - 1, and
-    #         c = 0.84506291151 rounded to single (24 bits)
-    #             erf(x)  = sign(x) * (c  + P1(s)/Q1(s))
-    #             erfc(x) = (1-c)  - P1(s)/Q1(s) if x > 0
-    #              1+(c+P1(s)/Q1(s))    if x < 0
-    #             |P1/Q1 - (erf(|x|)-c)| <= 2**-59.06
-    #       Remark: here we use the taylor series expansion at x=1.
-    #        erf(1+s) = erf(1) + s*Poly(s)
-    #             = 0.845.. + P1(s)/Q1(s)
-    #              That is, we use rational approximation to approximate
-    #                 erf(1+s) - (c = (single)0.84506291151)
-    #            Note that |P1/Q1|< 0.078 for x in [0.84375,1.25]
-    #            where
-    #             P1(s) = degree 6 poly in s
-    #             Q1(s) = degree 6 poly in s
-    #
-    #           3. For x in [1.25,1/0.35(~2.857143)],
-    #                  erfc(x) = (1/x)*exp(-x*x-0.5625+R1/S1)
-    #                  erf(x)  = 1 - erfc(x)
-    #            where
-    #             R1(z) = degree 7 poly in z, (z=1/x^2)
-    #             S1(z) = degree 8 poly in z
-    #
-    #           4. For x in [1/0.35,28]
-    #                  erfc(x) = (1/x)*exp(-x*x-0.5625+R2/S2) if x > 0
-    #                 = 2.0 - (1/x)*exp(-x*x-0.5625+R2/S2) if -6<x<0
-    #                 = 2.0 - tiny        (if x <= -6)
-    #                  erf(x)  = sign(x)*(1.0 - erfc(x)) if x < 6, else
-    #                  erf(x)  = sign(x)*(1.0 - tiny)
-    #            where
-    #             R2(z) = degree 6 poly in z, (z=1/x^2)
-    #             S2(z) = degree 7 poly in z
-    #
-    #           Note1:
-    #            To compute exp(-x*x-0.5625+R/S), let s be a single
-    #            PRECISION number and s := x then
-    #             -x*x = -s*s + (s-x)*(s+x)
-    #                 exp(-x*x-0.5626+R/S) =
-    #                 exp(-s*s-0.5625)*exp((s-x)*(s+x)+R/S)
-    #           Note2:
-    #            Here 4 and 5 make use of the asymptotic series
-    #                   exp(-x*x)
-    #             erfc(x) ~ ---------- * ( 1 + Poly(1/x^2) )
-    #                   x*sqrt(pi)
-    #            We use rational approximation to approximate
-    #               g(s)=f(1/x^2) = log(erfc(x)*x) - x*x + 0.5625
-    #            Here is the error bound for R1/S1 and R2/S2
-    #               |R1/S1 - f(x)|  < 2**(-62.57)
-    #               |R2/S2 - f(x)|  < 2**(-61.52)
-    #
-    #            5. For inf > x >= 28
-    #                             erf(x)  = sign(x) *(1 - tiny)  (raise inexact)
-    #                             erfc(x) = tiny*tiny (raise underflow) if x > 0
-    #                           = 2 - tiny if x<0
-    #
-    #            7. Special case:
-    #                             erf(0)  = 0, erf(inf)  = 1, erf(-inf) = -1,
-    #                             erfc(0) = 1, erfc(inf) = 0, erfc(-inf) = 2,
-    #                           erfc/erf(NaN) is NaN
-    #
-    #               $efx8 = 1.02703333676410069053e00
-    #
-    #                 Coefficients for approximation to erf on [0,0.84375]
-    #
-    # Error function.
-    # Based on C-code for the error function developed at Sun Microsystems.
-    # Author:: Jaco van Kooten
-    def error(x)
-      e_efx = 1.28379167095512586316e-01
-      ePp = [ 1.28379167095512558561e-01,
-        -3.25042107247001499370e-01,
-        -2.84817495755985104766e-02,
-        -5.77027029648944159157e-03,
-        -2.37630166566501626084e-05 ]
-      eQq = [ 3.97917223959155352819e-01,
-        6.50222499887672944485e-02,
-        5.08130628187576562776e-03,
-        1.32494738004321644526e-04,
-        -3.96022827877536812320e-06 ]
-      # Coefficients for approximation to erf in [0.84375,1.25]
-      ePa = [-2.36211856075265944077e-03,
-        4.14856118683748331666e-01,
-        -3.72207876035701323847e-01,
-        3.18346619901161753674e-01,
-        -1.10894694282396677476e-01,
-        3.54783043256182359371e-02,
-        -2.16637559486879084300e-03 ]
-      eQa = [ 1.06420880400844228286e-01,
-        5.40397917702171048937e-01,
-        7.18286544141962662868e-02,
-        1.26171219808761642112e-01,
-        1.36370839120290507362e-02,
-        1.19844998467991074170e-02 ]
-      e_erx = 8.45062911510467529297e-01
-      abs_x = (if x >= 0.0 then x else -x end)
-      # 0 < |x| < 0.84375
-      if abs_x < 0.84375
-        #|x| < 2**-28
-        if abs_x < 3.7252902984619141e-9
-        retval = abs_x + abs_x * e_efx
-        else
-        s = x * x
-        p = ePp[0] + s * (ePp[1] + s * (ePp[2] + s * (ePp[3] + s * ePp[4])))
-        q = 1.0 + s * (eQq[0] + s * (eQq[1] + s *
-        ( eQq[2] + s * (eQq[3] + s * eQq[4]))))
-        retval = abs_x + abs_x * (p / q)
-        end
-      elsif abs_x < 1.25
-      s = abs_x - 1.0
-      p = ePa[0] + s * (ePa[1] + s *
-      (ePa[2] + s * (ePa[3] + s *
-      (ePa[4] + s * (ePa[5] + s * ePa[6])))))
-      q = 1.0 + s * (eQa[0] + s *
-      (eQa[1] + s * (eQa[2] + s *
-      (eQa[3] + s * (eQa[4] + s * eQa[5])))))
-      retval = e_erx + p / q
-      elsif abs_x >= 6.0
-      retval = 1.0
-      else
-        retval = 1.0 - complementary_error(abs_x)
-      end
-      return (if x >= 0.0 then retval else -retval end)
-    end
-    # Complementary error function.
-    # Based on C-code for the error function developed at Sun Microsystems.
-    # author Jaco van Kooten
-    def complementary_error(x)
-      # Coefficients for approximation of erfc in [1.25,1/.35]
-      eRa = [-9.86494403484714822705e-03,
-        -6.93858572707181764372e-01,
-        -1.05586262253232909814e01,
-        -6.23753324503260060396e01,
-        -1.62396669462573470355e02,
-        -1.84605092906711035994e02,
-        -8.12874355063065934246e01,
-        -9.81432934416914548592e00 ]
-      eSa = [ 1.96512716674392571292e01,
-        1.37657754143519042600e02,
-        4.34565877475229228821e02,
-        6.45387271733267880336e02,
-        4.29008140027567833386e02,
-        1.08635005541779435134e02,
-        6.57024977031928170135e00,
-        -6.04244152148580987438e-02 ]
-      # Coefficients for approximation to erfc in [1/.35,28]
-      eRb = [-9.86494292470009928597e-03,
-        -7.99283237680523006574e-01,
-        -1.77579549177547519889e01,
-        -1.60636384855821916062e02,
-        -6.37566443368389627722e02,
-        -1.02509513161107724954e03,
-        -4.83519191608651397019e02 ]
-      eSb = [ 3.03380607434824582924e01,
-        3.25792512996573918826e02,
-        1.53672958608443695994e03,
-        3.19985821950859553908e03,
-        2.55305040643316442583e03,
-        4.74528541206955367215e02,
-        -2.24409524465858183362e01 ]
-      abs_x = (if x >= 0.0 then x else -x end)
-      if abs_x < 1.25
-        retval = 1.0 - error(abs_x)
-      elsif abs_x > 28.0
-      retval = 0.0
-      # 1.25 < |x| < 28
-      else
-        s = 1.0/(abs_x * abs_x)
-        if abs_x < 2.8571428
-        r = eRa[0] + s * (eRa[1] + s *
-        (eRa[2] + s * (eRa[3] + s * (eRa[4] + s *
-        (eRa[5] + s *(eRa[6] + s * eRa[7])
-        )))))
-        s = 1.0 + s * (eSa[0] + s * (eSa[1] + s *
-        (eSa[2] + s * (eSa[3] + s * (eSa[4] + s *
-        (eSa[5] + s * (eSa[6] + s * eSa[7])))))))
-        else
-        r = eRb[0] + s * (eRb[1] + s *
-        (eRb[2] + s * (eRb[3] + s * (eRb[4] + s *
-        (eRb[5] + s * eRb[6])))))
-        s = 1.0 + s * (eSb[0] + s *
-        (eSb[1] + s * (eSb[2] + s * (eSb[3] + s *
-        (eSb[4] + s * (eSb[5] + s * eSb[6]))))))
-        end
-        retval =  Math.exp(-x * x - 0.5625 + r/s) / abs_x
-      end
-      return ( if x >= 0.0 then retval else 2.0 - retval end )
-    end
-  end # class
-end # module

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: fselector
 version: !ruby/object:Gem::Version
-  version: 0.5.0
+  version: 0.6.0
   prerelease:
 platform: ruby
 authors:
@@ -9,8 +9,19 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-04-13 00:00:00.000000000 Z
-dependencies: []
+date: 2012-04-19 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rinruby
+  requirement: &22515480 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 2.0.2
+  type: :runtime
+  prerelease: false
+  version_requirements: *22515480
 description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
   algorithms and related functions into one single package. Welcome to contact me
   (need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
@@ -20,8 +31,8 @@ description: FSelector is a Ruby gem that aims to integrate various feature sele
   with certain criterion. FSelector acts on a full-feature data set in either CSV,
   LibSVM or WEKA file format and outputs a reduced data set with only selected subset
   of features, which can later be used as the input for various machine learning softwares
-  including LibSVM and WEKA. FSelector, itself, does not implement any of the machine
-  learning algorithms such as support vector machines and random forest.
+  such as LibSVM and WEKA. FSelector, as a collection of filter methods, does not
+  implement any classifier like support vector machines or random forest.
 email: need47@gmail.com
 executables: []
 extensions: []
@@ -30,6 +41,7 @@ extra_rdoc_files:
 - LICENSE
 files:
 - README.md
+- ChangeLog
 - LICENSE
 - lib/fselector/algo_base/base.rb
 - lib/fselector/algo_base/base_CFS.rb
@@ -70,7 +82,6 @@ files:
 - lib/fselector/algo_discrete/Sensitivity.rb
 - lib/fselector/algo_discrete/Specificity.rb
 - lib/fselector/algo_discrete/SymmetricalUncertainty.rb
-- lib/fselector/chisq_calc.rb
 - lib/fselector/discretizer.rb
 - lib/fselector/ensemble.rb
 - lib/fselector/entropy.rb

data/lib/fselector/chisq_calc.rb DELETED Viewed

@@ -1,189 +0,0 @@
-#
-# Chi-Square Calculator
-#
-# This module is adpated from the on-line [Chi-square Calculator](http://www.swogstat.org/stat/public/chisq_calculator.htm)
-#
-# The functions for calculating normal and chi-square probabilities
-# and critical values were adapted by John Walker from C implementations
-# written by Gary Perlman of Wang Institute, Tyngsboro, MA 01879. The
-# original C code is in the public domain.
-#
-# chisq2pval(chisq, df) -- calculate p-value from given
-#                   chi-square value (chisq) and degree of freedom (df)
-# pval2chisq(pval, df) -- chi-square value from given
-#                   p-value (pvalue) and degree of freedom (df)
-#
-module ChiSquareCalculator
-  BIGX = 20.0 # max value to represent exp(x)
-  LOG_SQRT_PI = 0.5723649429247000870717135 # log(sqrt(pi))
-  I_SQRT_PI = 0.5641895835477562869480795 # 1 / sqrt(pi)
-  Z_MAX = 6.0 # Maximum meaningful z value
-  CHI_EPSILON = 0.000001 # Accuracy of critchi approximation
-  CHI_MAX = 99999.0 # Maximum chi-square value
-  #
-  # POCHISQ  --  probability of chi-square value
-  #
-  # Adapted from:
-  #
-  #   Hill, I. D. and Pike, M. C.  Algorithm 299
-  #
-  #   Collected Algorithms for the CACM 1967 p. 243
-  #
-  # Updated for rounding errors based on remark in
-  #
-  #   ACM TOMS June 1985, page 185
-  #
-  # @param [Float] x chi-square value
-  # @param [Integer] df degree of freedom
-  # @return [Float] p-value
-  def pochisq(x, df)
-    a, y, s = nil, nil, nil
-    e, c, z = nil, nil, nil
-    even = nil # True if df is an even number
-    if x <= 0.0 or df < 1
-      return 1.0
-    end
-    a = 0.5 * x
-    even = ((df & 1) == 0)
-    if df > 1
-      y = ex(-a)
-    end
-    s = even ? y : (2.0 * poz(-Math.sqrt(x)))
-    if df > 2
-      x = 0.5 * (df - 1.0)
-      z = even ? 1.0 : 0.5
-      if a > BIGX
-        e = even ? 0.0 : LOG_SQRT_PI
-        c = Math.log(a)
-        while z <= x
-          e = Math.log(z) + e
-          s += ex(c * z - a - e)
-          z += 1.0
-        end
-        return s
-      else
-        e = even ? 1.0 : (I_SQRT_PI / Math.sqrt(a))
-        c = 0.0
-        while (z <= x)
-          e = e * (a / z)
-          c = c + e
-          z += 1.0
-        end
-        return c * y + s
-      end
-    else
-      return s
-    end
-  end # pochisq
-  # function alias
-  alias :chisq2pval :pochisq
-  #
-  # CRITCHI  --  Compute critical chi-square value to
-  # produce given p.  We just do a bisection
-  # search for a value within CHI_EPSILON,
-  # relying on the monotonicity of pochisq()
-  #
-  # @param [Float] p p-value
-  # @param [Integer] df degree of freedom
-  # @return [Float] chi-square value
-  def critchi(p, df)
-    minchisq = 0.0
-    maxchisq = CHI_MAX
-    chisqval = nil
-    if p <= 0.0
-      return maxchisq
-    else
-      if p >= 1.0
-        return 0.0
-      end
-    end
-    chisqval = df / Math.sqrt(p);    # fair first value
-    while (maxchisq - minchisq) > CHI_EPSILON
-      if pochisq(chisqval, df) < p
-        maxchisq = chisqval
-      else
-        minchisq = chisqval
-      end
-      chisqval = (maxchisq + minchisq) * 0.5
-     end
-     return chisqval
-  end # critchi
-  # function alias
-  alias :pval2chisq :critchi
-  private
-  def ex(x)
-    return (x < -BIGX) ? 0.0 : Math.exp(x)
-  end # ex
-  #
-  # POZ  --  probability of normal z value
-  #
-  # Adapted from a polynomial approximation in:
-  #  Ibbetson D, Algorithm 209
-  #  Collected Algorithms of the CACM 1963 p. 616
-  #
-  # Note:
-  #   This routine has six digit accuracy, so it is only useful for absolute
-  #   z values < 6.  For z values >= to 6.0, poz() returns 0.0
-  #
-   def poz(z)
-    y, x, w = nil, nil, nil
-    if (z == 0.0)
-      x = 0.0
-    else
-      y = 0.5 * z.abs # Math.abs(z)
-      if (y >= (Z_MAX * 0.5))
-        x = 1.0
-      elsif (y < 1.0)
-        w = y * y
-        x = ((((((((0.000124818987 * w - 0.001075204047) * w +
-            0.005198775019) * w - 0.019198292004) * w +
-            0.059054035642) * w - 0.151968751364) * w +
-            0.319152932694) * w - 0.531923007300) * w +
-            0.797884560593) * y * 2.0
-      else
-        y -= 2.0
-        x = (((((((((((((-0.000045255659 * y +
-            0.000152529290) * y - 0.000019538132) * y -
-            0.000676904986) * y + 0.001390604284) * y -
-            0.000794620820) * y - 0.002034254874) * y +
-            0.006549791214) * y - 0.010557625006) * y +
-            0.011630447319) * y - 0.009279453341) * y +
-            0.005353579108) * y - 0.002141268741) * y +
-            0.000535310849) * y + 0.999936657524
-      end
-    end
-    return z > 0.0 ? ((x + 1.0) * 0.5) : ((1.0 - x) * 0.5)
-  end # poz
-end # module