fselector 0.5.0 → 0.6.0

Sign up to get free protection for your applications and to get access to all the features.
data/ChangeLog ADDED
@@ -0,0 +1,7 @@
1
+ 2012-04-18 Tiejun Cheng <need47@gmail.com>
2
+
3
+ * require the RinRuby gem (http://rinruby.ddahl.org) to access the
4
+ statistical routines in the R package (http://www.r-project.org/)
5
+
6
+ * because of RinRuby (and thus R), removed the following modules or implementations:
7
+ RubyStats (FishersExactTest.calculate, get_icdf) and ChiSquareCalculator
data/README.md CHANGED
@@ -8,8 +8,8 @@ FSelector: a Ruby gem for feature selection and ranking
8
8
  **Email**: [need47@gmail.com](mailto:need47@gmail.com)
9
9
  **Copyright**: 2012
10
10
  **License**: MIT License
11
- **Latest Version**: 0.5.0
12
- **Release Date**: April 13 2012
11
+ **Latest Version**: 0.6.0
12
+ **Release Date**: April 19 2012
13
13
 
14
14
  Synopsis
15
15
  --------
@@ -25,9 +25,9 @@ missing feature values with certain criterion. FSelector acts on a
25
25
  full-feature data set in either CSV, LibSVM or WEKA file format and
26
26
  outputs a reduced data set with only selected subset of features, which
27
27
  can later be used as the input for various machine learning softwares
28
- including LibSVM and WEKA. FSelector, itself, does not implement
29
- any of the machine learning algorithms such as support vector machines
30
- and random forest. See below for a list of FSelector's features.
28
+ such as LibSVM and WEKA. FSelector, as a collection of filter methods,
29
+ does not implement any classifier like support vector machines or
30
+ random forest. See below for a list of FSelector's features.
31
31
 
32
32
  Feature List
33
33
  ------------
@@ -78,7 +78,7 @@ Feature List
78
78
  ReliefF_c ReliefF_c continuous
79
79
  TScore TS continuous
80
80
 
81
- **feature selection interace:**
81
+ **note for feature selection interace:**
82
82
  - for the algorithms of CFS\_d, FCBF and CFS\_c, use select\_feature!
83
83
  - for other algorithms, use either select\_feature\_by\_rank! or select\_feature\_by\_score!
84
84
 
@@ -115,7 +115,13 @@ Installing
115
115
  To install FSelector, use the following command:
116
116
 
117
117
  $ gem install fselector
118
-
118
+
119
+ **note:** Start from version 0.5.0, FSelector uses the RinRuby gem (http://rinruby.ddahl.org)
120
+ as a seemless bridge to access the statistical routines in the R package (http://www.r-project.org),
121
+ which will greatly expand the inclusion of algorithms to FSelector, especially for those relying
122
+ on statistical test. To this end, please pre-install the R package. RinRuby should have been
123
+ auto-installed with FSelector.
124
+
119
125
  Usage
120
126
  -----
121
127
 
@@ -223,6 +229,11 @@ Usage
223
229
 
224
230
  **4. see more examples test_*.rb under the test/ directory**
225
231
 
232
+ Change Log
233
+ ----------
234
+ A {file:ChangeLog} is available from version 0.5.0 and upward to refelect
235
+ what's new and what's changed
236
+
226
237
  Copyright
227
238
  ---------
228
239
  FSelector &copy; 2012 by [Tiejun Cheng](mailto:need47@gmail.com).
data/lib/fselector.rb CHANGED
@@ -1,9 +1,12 @@
1
+ # access to the statistical routines in R package
2
+ require 'rinruby'
3
+
1
4
  #
2
5
  # FSelector: a Ruby gem for feature selection and ranking
3
6
  #
4
7
  module FSelector
5
8
  # module version
6
- VERSION = '0.5.0'
9
+ VERSION = '0.6.0'
7
10
  end
8
11
 
9
12
  ROOT = File.expand_path(File.dirname(__FILE__))
@@ -17,8 +20,6 @@ require "#{ROOT}/fselector/fileio.rb"
17
20
  require "#{ROOT}/fselector/util.rb"
18
21
  # entropy-related functions
19
22
  require "#{ROOT}/fselector/entropy.rb"
20
- # chi-square calculator
21
- require "#{ROOT}/fselector/chisq_calc.rb"
22
23
  # normalization for continuous data
23
24
  require "#{ROOT}/fselector/normalizer.rb"
24
25
  # discretization for continuous data
@@ -165,6 +165,13 @@ module FSelector
165
165
  end
166
166
 
167
167
 
168
+ # get a copy of data,
169
+ # by use of the standard Marshal library
170
+ def get_data_copy
171
+ Marshal.load(Marshal.dump(@data)) if @data
172
+ end
173
+
174
+
168
175
  # set data
169
176
  def set_data(data)
170
177
  if data and data.class == Hash
@@ -13,14 +13,11 @@ module FSelector
13
13
  # ref: [An extensive empirical study of feature selection metrics for text classification](http://dl.acm.org/citation.cfm?id=944974) and [Rubystats](http://rubystats.rubyforge.org)
14
14
  #
15
15
  class BiNormalSeparation < BaseDiscrete
16
- # include Ruby statistics libraries
17
- include Rubystats
18
16
 
19
17
  private
20
18
 
21
19
  # calculate contribution of each feature (f) for each class (k)
22
20
  def calc_contribution(f)
23
- @nd ||= Rubystats::NormalDistribution.new
24
21
 
25
22
  each_class do |k|
26
23
  a, b, c, d = get_A(f, k), get_B(f, k), get_C(f, k), get_D(f, k)
@@ -28,7 +25,9 @@ module FSelector
28
25
  s = 0.0
29
26
  if not (a+c).zero? and not (b+d).zero?
30
27
  tpr, fpr = a/(a+c), b/(b+d)
31
- s = (@nd.get_icdf(tpr) - @nd.get_icdf(fpr)).abs
28
+
29
+ R.eval "rv <- qnorm(#{tpr}) - qnorm(#{fpr})"
30
+ s = R.rv.abs
32
31
  end
33
32
 
34
33
  set_feature_score(f, k, s)
@@ -11,24 +11,22 @@ module FSelector
11
11
  #
12
12
  # for FET, the smaller, the better, but we intentionally negate it
13
13
  # so that the larger is always the better (consistent with other algorithms)
14
+ # R equivalent: fisher.test
14
15
  #
15
16
  # ref: [Wikipedia](http://en.wikipedia.org/wiki/Fisher's_exact_test) and [Rubystats](http://rubystats.rubyforge.org)
16
17
  #
17
18
  class FishersExactTest < BaseDiscrete
18
- # include Ruby statistics libraries
19
- include Rubystats
20
19
 
21
20
  private
22
21
 
23
22
  # calculate contribution of each feature (f) for each class (k)
24
- def calc_contribution(f)
25
- @fet ||= Rubystats::FishersExactTest.new
26
-
23
+ def calc_contribution(f)
27
24
  each_class do |k|
28
25
  a, b, c, d = get_A(f, k), get_B(f, k), get_C(f, k), get_D(f, k)
29
26
 
30
- # note: we've intentionally negated it
31
- s = -1 * @fet.calculate(a, b, c, d)[:twotail]
27
+ # note: intentionally negated it
28
+ R.eval "rv <- fisher.test(matrix(c(#{a}, #{b}, #{c}, #{d}), nrow=2))$p.value"
29
+ s = -1.0 * R.rv
32
30
 
33
31
  set_feature_score(f, k, s)
34
32
  end
@@ -4,8 +4,6 @@
4
4
  module Discretizer
5
5
  # include Entropy module
6
6
  include Entropy
7
- # include ChiSquareCalculator module
8
- include ChiSquareCalculator
9
7
 
10
8
  # discretize by equal-width intervals
11
9
  #
@@ -334,6 +332,19 @@ module Discretizer
334
332
 
335
333
  private
336
334
 
335
+ #
336
+ # get the Chi-square value from p-value
337
+ #
338
+ # @param [Float] pval p-value
339
+ # @param [Integer] df degree of freedom
340
+ # @return [Float] Chi-square vlaue
341
+ #
342
+ def pval2chisq(pval, df)
343
+ R.eval "chisq <- qchisq(#{1-pval}, #{df})"
344
+ R.chisq
345
+ end
346
+
347
+
337
348
  #
338
349
  # get index from sorted cut points
339
350
  #
@@ -388,6 +399,7 @@ module Discretizer
388
399
  clear_vars
389
400
  end
390
401
 
402
+
391
403
  #
392
404
  # Chi2: initialization
393
405
  #
@@ -423,6 +435,7 @@ module Discretizer
423
435
  [bs, cs, qs]
424
436
  end
425
437
 
438
+
426
439
  #
427
440
  # Chi2: merge two adjacent intervals
428
441
  #
@@ -1,8 +1,23 @@
1
1
  #
2
- # read and write various file formats
2
+ # read and write various file formats,
3
+ # the internal data structure looks like:
4
+ #
5
+ # data = {
6
+ #
7
+ # :c1 => [ # class c1
8
+ # {:f1=>1, :f2=>2} # sample 2
9
+ # ],
10
+ #
11
+ # :c2 => [ # class c2
12
+ # {:f1=>1, :f3=>3}, # sample 1
13
+ # {:f2=>2} # sample 3
14
+ # ]
15
+ #
16
+ # }
17
+ #
18
+ # where :c1 and :c2 are class labels; :f1, :f2, and :f3 are features
3
19
  #
4
- # @note class labels and features are treated as symbols,
5
- # e.g. length => :length
20
+ # @note class labels and features are treated as symbols
6
21
  #
7
22
  module FileIO
8
23
  #
@@ -40,7 +55,7 @@ module FileIO
40
55
  if ncategory == 1
41
56
  feats[f] = 1
42
57
  elsif ncategory > 1
43
- feats[f] = rand(ncategory)
58
+ feats[f] = rand(ncategory)+1
44
59
  else
45
60
  feats[f] = rand
46
61
  end
@@ -149,588 +149,3 @@ end # String
149
149
  #=>a
150
150
  #=>_'b,c, d'_
151
151
  #=>'e'
152
-
153
-
154
- #
155
- # adapted from the Ruby statistics libraries --
156
- # [Rubystats](http://rubystats.rubyforge.org)
157
- #
158
- # - for Fisher's exact test (Rubystats::FishersExactTest.calculate())
159
- # used by algo\_binary/FishersExactText.rb
160
- # - for inverse cumulative normal distribution function (Rubystats::NormalDistribution.get\_icdf())
161
- # used by algo\_binary/BiNormalSeparation.rb. note the original get\_icdf() function is a private
162
- # one, so we have to open it up and that's why the codes here.
163
- #
164
- #
165
- module Rubystats
166
- MAX_VALUE = 1.2e290
167
- SQRT2PI = 2.5066282746310005024157652848110452530069867406099
168
- SQRT2 = 1.4142135623730950488016887242096980785696718753769
169
- TWO_PI = 6.2831853071795864769252867665590057683943387987502
170
-
171
- #
172
- # Fisher's exact test calculator
173
- #
174
- class FishersExactTest
175
- # new()
176
- def initialize
177
- @sn11 = 0.0
178
- @sn1_ = 0.0
179
- @sn_1 = 0.0
180
- @sn = 0.0
181
- @sprob = 0.0
182
-
183
- @sleft = 0.0
184
- @sright = 0.0
185
- @sless = 0.0
186
- @slarg = 0.0
187
-
188
- @left = 0.0
189
- @right = 0.0
190
- @twotail = 0.0
191
- end
192
-
193
-
194
- # Fisher's exact test
195
- def calculate(n11_,n12_,n21_,n22_)
196
- n11_ *= -1 if n11_ < 0
197
- n12_ *= -1 if n12_ < 0
198
- n21_ *= -1 if n21_ < 0
199
- n22_ *= -1 if n22_ < 0
200
- n1_ = n11_ + n12_
201
- n_1 = n11_ + n21_
202
- n = n11_ + n12_ + n21_ + n22_
203
- prob = exact(n11_,n1_,n_1,n)
204
- left = @sless
205
- right = @slarg
206
- twotail = @sleft + @sright
207
- twotail = 1 if twotail > 1
208
- values_hash = { :left =>left, :right =>right, :twotail =>twotail }
209
- return values_hash
210
- end
211
-
212
- private
213
-
214
- # Reference: "Lanczos, C. 'A precision approximation
215
- # of the gamma function', J. SIAM Numer. Anal., B, 1, 86-96, 1964."
216
- # Translation of Alan Miller's FORTRAN-implementation
217
- # See http://lib.stat.cmu.edu/apstat/245
218
- def lngamm(z)
219
- x = 0
220
- x += 0.0000001659470187408462/(z+7)
221
- x += 0.000009934937113930748 /(z+6)
222
- x -= 0.1385710331296526 /(z+5)
223
- x += 12.50734324009056 /(z+4)
224
- x -= 176.6150291498386 /(z+3)
225
- x += 771.3234287757674 /(z+2)
226
- x -= 1259.139216722289 /(z+1)
227
- x += 676.5203681218835 /(z)
228
- x += 0.9999999999995183
229
-
230
- return(Math.log(x)-5.58106146679532777-z+(z-0.5) * Math.log(z+6.5))
231
- end
232
-
233
- def lnfact(n)
234
- if n <= 1
235
- return 0
236
- else
237
- return lngamm(n+1)
238
- end
239
- end
240
-
241
- def lnbico(n,k)
242
- return lnfact(n) - lnfact(k) - lnfact(n-k)
243
- end
244
-
245
- def hyper_323(n11, n1_, n_1, n)
246
- return Math.exp(lnbico(n1_, n11) + lnbico(n-n1_, n_1-n11) - lnbico(n, n_1))
247
- end
248
-
249
- def hyper(n11)
250
- return hyper0(n11, 0, 0, 0)
251
- end
252
-
253
- def hyper0(n11i,n1_i,n_1i,ni)
254
- if n1_i == 0 and n_1i ==0 and ni == 0
255
- unless n11i % 10 == 0
256
- if n11i == @sn11+1
257
- @sprob *= ((@sn1_ - @sn11)/(n11i.to_f))*((@sn_1 - @sn11)/(n11i.to_f + @sn - @sn1_ - @sn_1))
258
- @sn11 = n11i
259
- return @sprob
260
- end
261
- if n11i == @sn11-1
262
- @sprob *= ((@sn11)/(@sn1_-n11i.to_f))*((@sn11+@sn-@sn1_-@sn_1)/(@sn_1-n11i.to_f))
263
- @sn11 = n11i
264
- return @sprob
265
- end
266
- end
267
- @sn11 = n11i
268
- else
269
- @sn11 = n11i
270
- @sn1_ = n1_i
271
- @sn_1 = n_1i
272
- @sn = ni
273
- end
274
- @sprob = hyper_323(@sn11,@sn1_,@sn_1,@sn)
275
- return @sprob
276
- end
277
-
278
- def exact(n11,n1_,n_1,n)
279
-
280
- p = i = j = prob = 0.0
281
-
282
- max = n1_
283
- max = n_1 if n_1 < max
284
- min = n1_ + n_1 - n
285
- min = 0 if min < 0
286
-
287
- if min == max
288
- @sless = 1
289
- @sright = 1
290
- @sleft = 1
291
- @slarg = 1
292
- return 1
293
- end
294
-
295
- prob = hyper0(n11,n1_,n_1,n)
296
- @sleft = 0
297
-
298
- p = hyper(min)
299
- i = min + 1
300
- while p < (0.99999999 * prob)
301
- @sleft += p
302
- p = hyper(i)
303
- i += 1
304
- end
305
-
306
- i -= 1
307
-
308
- if p < (1.00000001*prob)
309
- @sleft += p
310
- else
311
- i -= 1
312
- end
313
-
314
- @sright = 0
315
-
316
- p = hyper(max)
317
- j = max - 1
318
- while p < (0.99999999 * prob)
319
- @sright += p
320
- p = hyper(j)
321
- j -= 1
322
- end
323
- j += 1
324
-
325
- if p < (1.00000001*prob)
326
- @sright += p
327
- else
328
- j += 1
329
- end
330
-
331
- if (i - n11).abs < (j - n11).abs
332
- @sless = @sleft
333
- @slarg = 1 - @sleft + prob
334
- else
335
- @sless = 1 - @sright + prob
336
- @slarg = @sright
337
- end
338
- return prob
339
- end
340
-
341
-
342
- end # class
343
-
344
- #
345
- # Normal distribution
346
- #
347
- class NormalDistribution
348
- # Constructs a normal distribution (defaults to zero mean and
349
- # unity variance)
350
- def initialize(mu=0.0, sigma=1.0)
351
- @mean = mu
352
- if sigma <= 0.0
353
- return "error"
354
- end
355
- @stdev = sigma
356
- @variance = sigma**2
357
- @pdf_denominator = SQRT2PI * Math.sqrt(@variance)
358
- @cdf_denominator = SQRT2 * Math.sqrt(@variance)
359
- end
360
-
361
-
362
- # Obtain single PDF value
363
- # Returns the probability that a stochastic variable x has the value X,
364
- # i.e. P(x=X)
365
- def get_pdf(x)
366
- Math.exp( -((x-@mean)**2) / (2 * @variance)) / @pdf_denominator
367
- end
368
-
369
-
370
- # Obtain single CDF value
371
- # Returns the probability that a stochastic variable x is less than X,
372
- # i.e. P(x<X)
373
- def get_cdf(x)
374
- complementary_error( -(x - @mean) / @cdf_denominator) / 2
375
- end
376
-
377
-
378
- # Obtain single inverse CDF value.
379
- # returns the value X for which P(x<X).
380
- def get_icdf(p)
381
- check_range(p)
382
- if p == 0.0
383
- return -MAX_VALUE
384
- end
385
- if p == 1.0
386
- return MAX_VALUE
387
- end
388
- if p == 0.5
389
- return @mean
390
- end
391
-
392
- mean_save = @mean
393
- var_save = @variance
394
- pdf_D_save = @pdf_denominator
395
- cdf_D_save = @cdf_denominator
396
- @mean = 0.0
397
- @variance = 1.0
398
- @pdf_denominator = Math.sqrt(TWO_PI)
399
- @cdf_denominator = SQRT2
400
- x = find_root(p, 0.0, -100.0, 100.0)
401
- #scale back
402
- @mean = mean_save
403
- @variance = var_save
404
- @pdf_denominator = pdf_D_save
405
- @cdf_denominator = cdf_D_save
406
- return x * Math.sqrt(@variance) + @mean
407
- end
408
-
409
- private
410
-
411
- #check that variable is between lo and hi limits.
412
- #lo default is 0.0 and hi default is 1.0
413
- def check_range(x, lo=0.0, hi=1.0)
414
- raise ArgumentError.new("x cannot be nil") if x.nil?
415
- if x < lo or x > hi
416
- raise ArgumentError.new("x must be less than lo (#{lo}) and greater than hi (#{hi})")
417
- end
418
- end
419
-
420
-
421
- def find_root(prob, guess, x_lo, x_hi)
422
- accuracy = 1.0e-10
423
- max_iteration = 150
424
- x = guess
425
- x_new = guess
426
- error = 0.0
427
- _pdf = 0.0
428
- dx = 1000.0
429
- i = 0
430
- while ( dx.abs > accuracy && (i += 1) < max_iteration )
431
- #Apply Newton-Raphson step
432
- error = cdf(x) - prob
433
- if error < 0.0
434
- x_lo = x
435
- else
436
- x_hi = x
437
- end
438
- _pdf = pdf(x)
439
- if _pdf != 0.0
440
- dx = error / _pdf
441
- x_new = x -dx
442
- end
443
- # If the NR fails to converge (which for example may be the
444
- # case if the initial guess is too rough) we apply a bisection
445
- # step to determine a more narrow interval around the root.
446
- if x_new < x_lo || x_new > x_hi || _pdf == 0.0
447
- x_new = (x_lo + x_hi) / 2.0
448
- dx = x_new - x
449
- end
450
- x = x_new
451
- end
452
- return x
453
- end
454
-
455
-
456
- #Probability density function
457
- def pdf(x)
458
- if x.class == Array
459
- pdf_vals = []
460
- for i in (0 ... x.length)
461
- pdf_vals[i] = get_pdf(x[i])
462
- end
463
- return pdf_vals
464
- else
465
- return get_pdf(x)
466
- end
467
- end
468
-
469
-
470
- #Cummulative distribution function
471
- def cdf(x)
472
- if x.class == Array
473
- cdf_vals = []
474
- for i in (0...x.size)
475
- cdf_vals[i] = get_cdf(x[i])
476
- end
477
- return cdf_vals
478
- else
479
- return get_cdf(x)
480
- end
481
- end
482
-
483
-
484
-
485
- # Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
486
- #
487
- # Developed at SunSoft, a Sun Microsystems, Inc. business.
488
- # Permission to use, copy, modify, and distribute this
489
- # software is freely granted, provided that this notice
490
- # is preserved.
491
- #
492
- # x
493
- # 2 |\
494
- # erf(x) = --------- | exp(-t*t)dt
495
- # sqrt(pi) \|
496
- # 0
497
- #
498
- # erfc(x) = 1-erf(x)
499
- # Note that
500
- # erf(-x) = -erf(x)
501
- # erfc(-x) = 2 - erfc(x)
502
- #
503
- # Method:
504
- # 1. For |x| in [0, 0.84375]
505
- # erf(x) = x + x*R(x^2)
506
- # erfc(x) = 1 - erf(x) if x in [-.84375,0.25]
507
- # = 0.5 + ((0.5-x)-x*R) if x in [0.25,0.84375]
508
- # where R = P/Q where P is an odd poly of degree 8 and
509
- # Q is an odd poly of degree 10.
510
- # -57.90
511
- # | R - (erf(x)-x)/x | <= 2
512
- #
513
- #
514
- # Remark. The formula is derived by noting
515
- # erf(x) = (2/sqrt(pi))*(x - x^3/3 + x^5/10 - x^7/42 + ....)
516
- # and that
517
- # 2/sqrt(pi) = 1.128379167095512573896158903121545171688
518
- # is close to one. The interval is chosen because the fix
519
- # point of erf(x) is near 0.6174 (i.e., erf(x)=x when x is
520
- # near 0.6174), and by some experiment, 0.84375 is chosen to
521
- # guarantee the error is less than one ulp for erf.
522
- #
523
- # 2. For |x| in [0.84375,1.25], let s = |x| - 1, and
524
- # c = 0.84506291151 rounded to single (24 bits)
525
- # erf(x) = sign(x) * (c + P1(s)/Q1(s))
526
- # erfc(x) = (1-c) - P1(s)/Q1(s) if x > 0
527
- # 1+(c+P1(s)/Q1(s)) if x < 0
528
- # |P1/Q1 - (erf(|x|)-c)| <= 2**-59.06
529
- # Remark: here we use the taylor series expansion at x=1.
530
- # erf(1+s) = erf(1) + s*Poly(s)
531
- # = 0.845.. + P1(s)/Q1(s)
532
- # That is, we use rational approximation to approximate
533
- # erf(1+s) - (c = (single)0.84506291151)
534
- # Note that |P1/Q1|< 0.078 for x in [0.84375,1.25]
535
- # where
536
- # P1(s) = degree 6 poly in s
537
- # Q1(s) = degree 6 poly in s
538
- #
539
- # 3. For x in [1.25,1/0.35(~2.857143)],
540
- # erfc(x) = (1/x)*exp(-x*x-0.5625+R1/S1)
541
- # erf(x) = 1 - erfc(x)
542
- # where
543
- # R1(z) = degree 7 poly in z, (z=1/x^2)
544
- # S1(z) = degree 8 poly in z
545
- #
546
- # 4. For x in [1/0.35,28]
547
- # erfc(x) = (1/x)*exp(-x*x-0.5625+R2/S2) if x > 0
548
- # = 2.0 - (1/x)*exp(-x*x-0.5625+R2/S2) if -6<x<0
549
- # = 2.0 - tiny (if x <= -6)
550
- # erf(x) = sign(x)*(1.0 - erfc(x)) if x < 6, else
551
- # erf(x) = sign(x)*(1.0 - tiny)
552
- # where
553
- # R2(z) = degree 6 poly in z, (z=1/x^2)
554
- # S2(z) = degree 7 poly in z
555
- #
556
- # Note1:
557
- # To compute exp(-x*x-0.5625+R/S), let s be a single
558
- # PRECISION number and s := x then
559
- # -x*x = -s*s + (s-x)*(s+x)
560
- # exp(-x*x-0.5626+R/S) =
561
- # exp(-s*s-0.5625)*exp((s-x)*(s+x)+R/S)
562
- # Note2:
563
- # Here 4 and 5 make use of the asymptotic series
564
- # exp(-x*x)
565
- # erfc(x) ~ ---------- * ( 1 + Poly(1/x^2) )
566
- # x*sqrt(pi)
567
- # We use rational approximation to approximate
568
- # g(s)=f(1/x^2) = log(erfc(x)*x) - x*x + 0.5625
569
- # Here is the error bound for R1/S1 and R2/S2
570
- # |R1/S1 - f(x)| < 2**(-62.57)
571
- # |R2/S2 - f(x)| < 2**(-61.52)
572
- #
573
- # 5. For inf > x >= 28
574
- # erf(x) = sign(x) *(1 - tiny) (raise inexact)
575
- # erfc(x) = tiny*tiny (raise underflow) if x > 0
576
- # = 2 - tiny if x<0
577
- #
578
- # 7. Special case:
579
- # erf(0) = 0, erf(inf) = 1, erf(-inf) = -1,
580
- # erfc(0) = 1, erfc(inf) = 0, erfc(-inf) = 2,
581
- # erfc/erf(NaN) is NaN
582
- #
583
- # $efx8 = 1.02703333676410069053e00
584
- #
585
- # Coefficients for approximation to erf on [0,0.84375]
586
- #
587
-
588
- # Error function.
589
- # Based on C-code for the error function developed at Sun Microsystems.
590
- # Author:: Jaco van Kooten
591
-
592
- def error(x)
593
- e_efx = 1.28379167095512586316e-01
594
-
595
- ePp = [ 1.28379167095512558561e-01,
596
- -3.25042107247001499370e-01,
597
- -2.84817495755985104766e-02,
598
- -5.77027029648944159157e-03,
599
- -2.37630166566501626084e-05 ]
600
-
601
- eQq = [ 3.97917223959155352819e-01,
602
- 6.50222499887672944485e-02,
603
- 5.08130628187576562776e-03,
604
- 1.32494738004321644526e-04,
605
- -3.96022827877536812320e-06 ]
606
-
607
- # Coefficients for approximation to erf in [0.84375,1.25]
608
- ePa = [-2.36211856075265944077e-03,
609
- 4.14856118683748331666e-01,
610
- -3.72207876035701323847e-01,
611
- 3.18346619901161753674e-01,
612
- -1.10894694282396677476e-01,
613
- 3.54783043256182359371e-02,
614
- -2.16637559486879084300e-03 ]
615
-
616
- eQa = [ 1.06420880400844228286e-01,
617
- 5.40397917702171048937e-01,
618
- 7.18286544141962662868e-02,
619
- 1.26171219808761642112e-01,
620
- 1.36370839120290507362e-02,
621
- 1.19844998467991074170e-02 ]
622
-
623
- e_erx = 8.45062911510467529297e-01
624
-
625
- abs_x = (if x >= 0.0 then x else -x end)
626
- # 0 < |x| < 0.84375
627
- if abs_x < 0.84375
628
- #|x| < 2**-28
629
- if abs_x < 3.7252902984619141e-9
630
- retval = abs_x + abs_x * e_efx
631
- else
632
- s = x * x
633
- p = ePp[0] + s * (ePp[1] + s * (ePp[2] + s * (ePp[3] + s * ePp[4])))
634
-
635
- q = 1.0 + s * (eQq[0] + s * (eQq[1] + s *
636
- ( eQq[2] + s * (eQq[3] + s * eQq[4]))))
637
- retval = abs_x + abs_x * (p / q)
638
- end
639
- elsif abs_x < 1.25
640
- s = abs_x - 1.0
641
- p = ePa[0] + s * (ePa[1] + s *
642
- (ePa[2] + s * (ePa[3] + s *
643
- (ePa[4] + s * (ePa[5] + s * ePa[6])))))
644
-
645
- q = 1.0 + s * (eQa[0] + s *
646
- (eQa[1] + s * (eQa[2] + s *
647
- (eQa[3] + s * (eQa[4] + s * eQa[5])))))
648
- retval = e_erx + p / q
649
-
650
- elsif abs_x >= 6.0
651
- retval = 1.0
652
- else
653
- retval = 1.0 - complementary_error(abs_x)
654
- end
655
- return (if x >= 0.0 then retval else -retval end)
656
- end
657
-
658
- # Complementary error function.
659
- # Based on C-code for the error function developed at Sun Microsystems.
660
- # author Jaco van Kooten
661
-
662
- def complementary_error(x)
663
- # Coefficients for approximation of erfc in [1.25,1/.35]
664
-
665
- eRa = [-9.86494403484714822705e-03,
666
- -6.93858572707181764372e-01,
667
- -1.05586262253232909814e01,
668
- -6.23753324503260060396e01,
669
- -1.62396669462573470355e02,
670
- -1.84605092906711035994e02,
671
- -8.12874355063065934246e01,
672
- -9.81432934416914548592e00 ]
673
-
674
- eSa = [ 1.96512716674392571292e01,
675
- 1.37657754143519042600e02,
676
- 4.34565877475229228821e02,
677
- 6.45387271733267880336e02,
678
- 4.29008140027567833386e02,
679
- 1.08635005541779435134e02,
680
- 6.57024977031928170135e00,
681
- -6.04244152148580987438e-02 ]
682
-
683
- # Coefficients for approximation to erfc in [1/.35,28]
684
-
685
- eRb = [-9.86494292470009928597e-03,
686
- -7.99283237680523006574e-01,
687
- -1.77579549177547519889e01,
688
- -1.60636384855821916062e02,
689
- -6.37566443368389627722e02,
690
- -1.02509513161107724954e03,
691
- -4.83519191608651397019e02 ]
692
-
693
- eSb = [ 3.03380607434824582924e01,
694
- 3.25792512996573918826e02,
695
- 1.53672958608443695994e03,
696
- 3.19985821950859553908e03,
697
- 2.55305040643316442583e03,
698
- 4.74528541206955367215e02,
699
- -2.24409524465858183362e01 ]
700
-
701
- abs_x = (if x >= 0.0 then x else -x end)
702
- if abs_x < 1.25
703
- retval = 1.0 - error(abs_x)
704
- elsif abs_x > 28.0
705
- retval = 0.0
706
-
707
- # 1.25 < |x| < 28
708
- else
709
- s = 1.0/(abs_x * abs_x)
710
- if abs_x < 2.8571428
711
- r = eRa[0] + s * (eRa[1] + s *
712
- (eRa[2] + s * (eRa[3] + s * (eRa[4] + s *
713
- (eRa[5] + s *(eRa[6] + s * eRa[7])
714
- )))))
715
-
716
- s = 1.0 + s * (eSa[0] + s * (eSa[1] + s *
717
- (eSa[2] + s * (eSa[3] + s * (eSa[4] + s *
718
- (eSa[5] + s * (eSa[6] + s * eSa[7])))))))
719
-
720
- else
721
- r = eRb[0] + s * (eRb[1] + s *
722
- (eRb[2] + s * (eRb[3] + s * (eRb[4] + s *
723
- (eRb[5] + s * eRb[6])))))
724
-
725
- s = 1.0 + s * (eSb[0] + s *
726
- (eSb[1] + s * (eSb[2] + s * (eSb[3] + s *
727
- (eSb[4] + s * (eSb[5] + s * eSb[6]))))))
728
- end
729
- retval = Math.exp(-x * x - 0.5625 + r/s) / abs_x
730
- end
731
- return ( if x >= 0.0 then retval else 2.0 - retval end )
732
- end
733
-
734
- end # class
735
-
736
- end # module
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fselector
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.0
4
+ version: 0.6.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,8 +9,19 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-04-13 00:00:00.000000000 Z
13
- dependencies: []
12
+ date: 2012-04-19 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: rinruby
16
+ requirement: &22515480 !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: 2.0.2
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: *22515480
14
25
  description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
15
26
  algorithms and related functions into one single package. Welcome to contact me
16
27
  (need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
@@ -20,8 +31,8 @@ description: FSelector is a Ruby gem that aims to integrate various feature sele
20
31
  with certain criterion. FSelector acts on a full-feature data set in either CSV,
21
32
  LibSVM or WEKA file format and outputs a reduced data set with only selected subset
22
33
  of features, which can later be used as the input for various machine learning softwares
23
- including LibSVM and WEKA. FSelector, itself, does not implement any of the machine
24
- learning algorithms such as support vector machines and random forest.
34
+ such as LibSVM and WEKA. FSelector, as a collection of filter methods, does not
35
+ implement any classifier like support vector machines or random forest.
25
36
  email: need47@gmail.com
26
37
  executables: []
27
38
  extensions: []
@@ -30,6 +41,7 @@ extra_rdoc_files:
30
41
  - LICENSE
31
42
  files:
32
43
  - README.md
44
+ - ChangeLog
33
45
  - LICENSE
34
46
  - lib/fselector/algo_base/base.rb
35
47
  - lib/fselector/algo_base/base_CFS.rb
@@ -70,7 +82,6 @@ files:
70
82
  - lib/fselector/algo_discrete/Sensitivity.rb
71
83
  - lib/fselector/algo_discrete/Specificity.rb
72
84
  - lib/fselector/algo_discrete/SymmetricalUncertainty.rb
73
- - lib/fselector/chisq_calc.rb
74
85
  - lib/fselector/discretizer.rb
75
86
  - lib/fselector/ensemble.rb
76
87
  - lib/fselector/entropy.rb
@@ -1,189 +0,0 @@
1
- #
2
- # Chi-Square Calculator
3
- #
4
- # This module is adpated from the on-line [Chi-square Calculator](http://www.swogstat.org/stat/public/chisq_calculator.htm)
5
- #
6
- # The functions for calculating normal and chi-square probabilities
7
- # and critical values were adapted by John Walker from C implementations
8
- # written by Gary Perlman of Wang Institute, Tyngsboro, MA 01879. The
9
- # original C code is in the public domain.
10
- #
11
- # chisq2pval(chisq, df) -- calculate p-value from given
12
- # chi-square value (chisq) and degree of freedom (df)
13
- # pval2chisq(pval, df) -- chi-square value from given
14
- # p-value (pvalue) and degree of freedom (df)
15
- #
16
- module ChiSquareCalculator
17
- BIGX = 20.0 # max value to represent exp(x)
18
- LOG_SQRT_PI = 0.5723649429247000870717135 # log(sqrt(pi))
19
- I_SQRT_PI = 0.5641895835477562869480795 # 1 / sqrt(pi)
20
- Z_MAX = 6.0 # Maximum meaningful z value
21
- CHI_EPSILON = 0.000001 # Accuracy of critchi approximation
22
- CHI_MAX = 99999.0 # Maximum chi-square value
23
-
24
- #
25
- # POCHISQ -- probability of chi-square value
26
- #
27
- # Adapted from:
28
- #
29
- # Hill, I. D. and Pike, M. C. Algorithm 299
30
- #
31
- # Collected Algorithms for the CACM 1967 p. 243
32
- #
33
- # Updated for rounding errors based on remark in
34
- #
35
- # ACM TOMS June 1985, page 185
36
- #
37
- # @param [Float] x chi-square value
38
- # @param [Integer] df degree of freedom
39
- # @return [Float] p-value
40
- def pochisq(x, df)
41
- a, y, s = nil, nil, nil
42
- e, c, z = nil, nil, nil
43
-
44
- even = nil # True if df is an even number
45
-
46
- if x <= 0.0 or df < 1
47
- return 1.0
48
- end
49
-
50
- a = 0.5 * x
51
- even = ((df & 1) == 0)
52
-
53
- if df > 1
54
- y = ex(-a)
55
- end
56
-
57
- s = even ? y : (2.0 * poz(-Math.sqrt(x)))
58
-
59
- if df > 2
60
- x = 0.5 * (df - 1.0)
61
- z = even ? 1.0 : 0.5
62
-
63
- if a > BIGX
64
- e = even ? 0.0 : LOG_SQRT_PI
65
- c = Math.log(a)
66
-
67
- while z <= x
68
- e = Math.log(z) + e
69
- s += ex(c * z - a - e)
70
- z += 1.0
71
- end
72
-
73
- return s
74
- else
75
- e = even ? 1.0 : (I_SQRT_PI / Math.sqrt(a))
76
- c = 0.0
77
-
78
- while (z <= x)
79
- e = e * (a / z)
80
- c = c + e
81
- z += 1.0
82
- end
83
-
84
- return c * y + s
85
- end
86
- else
87
- return s
88
- end
89
-
90
- end # pochisq
91
-
92
- # function alias
93
- alias :chisq2pval :pochisq
94
-
95
-
96
- #
97
- # CRITCHI -- Compute critical chi-square value to
98
- # produce given p. We just do a bisection
99
- # search for a value within CHI_EPSILON,
100
- # relying on the monotonicity of pochisq()
101
- #
102
- # @param [Float] p p-value
103
- # @param [Integer] df degree of freedom
104
- # @return [Float] chi-square value
105
- def critchi(p, df)
106
- minchisq = 0.0
107
- maxchisq = CHI_MAX
108
-
109
- chisqval = nil
110
-
111
- if p <= 0.0
112
- return maxchisq
113
- else
114
- if p >= 1.0
115
- return 0.0
116
- end
117
- end
118
-
119
- chisqval = df / Math.sqrt(p); # fair first value
120
-
121
- while (maxchisq - minchisq) > CHI_EPSILON
122
- if pochisq(chisqval, df) < p
123
- maxchisq = chisqval
124
- else
125
- minchisq = chisqval
126
- end
127
-
128
- chisqval = (maxchisq + minchisq) * 0.5
129
- end
130
-
131
- return chisqval
132
- end # critchi
133
-
134
- # function alias
135
- alias :pval2chisq :critchi
136
-
137
- private
138
-
139
- def ex(x)
140
- return (x < -BIGX) ? 0.0 : Math.exp(x)
141
- end # ex
142
-
143
-
144
- #
145
- # POZ -- probability of normal z value
146
- #
147
- # Adapted from a polynomial approximation in:
148
- # Ibbetson D, Algorithm 209
149
- # Collected Algorithms of the CACM 1963 p. 616
150
- #
151
- # Note:
152
- # This routine has six digit accuracy, so it is only useful for absolute
153
- # z values < 6. For z values >= to 6.0, poz() returns 0.0
154
- #
155
- def poz(z)
156
- y, x, w = nil, nil, nil
157
-
158
- if (z == 0.0)
159
- x = 0.0
160
- else
161
- y = 0.5 * z.abs # Math.abs(z)
162
-
163
- if (y >= (Z_MAX * 0.5))
164
- x = 1.0
165
- elsif (y < 1.0)
166
- w = y * y
167
- x = ((((((((0.000124818987 * w - 0.001075204047) * w +
168
- 0.005198775019) * w - 0.019198292004) * w +
169
- 0.059054035642) * w - 0.151968751364) * w +
170
- 0.319152932694) * w - 0.531923007300) * w +
171
- 0.797884560593) * y * 2.0
172
- else
173
- y -= 2.0
174
- x = (((((((((((((-0.000045255659 * y +
175
- 0.000152529290) * y - 0.000019538132) * y -
176
- 0.000676904986) * y + 0.001390604284) * y -
177
- 0.000794620820) * y - 0.002034254874) * y +
178
- 0.006549791214) * y - 0.010557625006) * y +
179
- 0.011630447319) * y - 0.009279453341) * y +
180
- 0.005353579108) * y - 0.002141268741) * y +
181
- 0.000535310849) * y + 0.999936657524
182
- end
183
- end
184
-
185
- return z > 0.0 ? ((x + 1.0) * 0.5) : ((1.0 - x) * 0.5)
186
- end # poz
187
-
188
-
189
- end # module