fselector 1.3.0 → 1.3.1
Sign up to get free protection for your applications and to get access to all the features.
- data/.yardopts +1 -1
- data/ChangeLog +30 -12
- data/HowToContribute +116 -0
- data/README.md +18 -10
- data/lib/fselector.rb +1 -1
- data/lib/fselector/algo_base/base.rb +10 -1
- data/lib/fselector/algo_both/Random.rb +6 -2
- data/lib/fselector/algo_both/RandomSubset.rb +52 -0
- metadata +6 -4
data/.yardopts
CHANGED
data/ChangeLog
CHANGED
@@ -1,11 +1,20 @@
|
|
1
|
-
2012-05-
|
1
|
+
2012-05-31 v1.3.1
|
2
|
+
------------------
|
3
|
+
|
4
|
+
* add HowToContribute section with test examples in README to show how to write one's own feature selection algorithms and/or contribute them to FSelector on GitHub.com
|
5
|
+
* all discretization algorithms now set feature type as CATEGORICAL, this will allow correct file format conversion after discretization
|
6
|
+
* add RandomSubset algorithm (primarily for test purpose)
|
7
|
+
|
8
|
+
2012-05-24 v1.3.0
|
9
|
+
------------------
|
2
10
|
|
3
11
|
* update clear\_vars() in Base by use of Ruby metaprogramming, this trick avoids repetitive overriding it in each derived subclass
|
4
12
|
* re-organize LasVegasFilter, LasVegasIncremental and Random into algo_both/, since they are applicable to dataset with either discrete or continuous features, even with mixed type
|
5
13
|
* update data\_from\_csv() so that it can read CSV file more flexibly. note by default, the last column is class label
|
6
14
|
* add data\_from\_url() to read on-line dataset (in CSV, LibSVM or Weka ARFF file format) specified by a url
|
7
15
|
|
8
|
-
2012-05-20
|
16
|
+
2012-05-20 v1.2.0
|
17
|
+
------------------
|
9
18
|
|
10
19
|
* add KS-Test algorithm for continuous feature
|
11
20
|
* add KS-CCBF algorithm for continuous feature
|
@@ -13,7 +22,8 @@
|
|
13
22
|
* add KL-Divergence algorithm for discrete feature
|
14
23
|
* include the Discretizer module for algorithms requiring data with discrete feature, which allows to deal with continuous feature after discretization. Those algorithms requiring data with continuous feature now do not include the Discretizer module
|
15
24
|
|
16
|
-
2012-05-15
|
25
|
+
2012-05-15 v1.1.0
|
26
|
+
------------------
|
17
27
|
|
18
28
|
* add replace\_by\_median\_value! for replacing missing value with feature median value
|
19
29
|
* add replace\_by\_knn\_value! for replacing missing value with weighted feature value from k-nearest neighbors
|
@@ -22,43 +32,51 @@
|
|
22
32
|
* rename Ensemble to EnsembleMultiple for ensemble feature selection by creating an ensemble of feature selectors using multiple feature selection algorithms of the same type
|
23
33
|
* bug fix in FileIO module
|
24
34
|
|
25
|
-
2012-05-08
|
35
|
+
2012-05-08 v1.0.1
|
36
|
+
------------------
|
26
37
|
|
27
38
|
* modify Ensemble module so that ensemble\_by\_score() and ensemble\_by\_rank() now take Symbol, instead of Method, as argument. This allows easier and clearer function call
|
28
39
|
* enable select_feature! interface in Ensemble module for the type of subset selection algorithms
|
29
40
|
|
30
|
-
2012-05-04
|
41
|
+
2012-05-04 v1.0.0
|
42
|
+
------------------
|
31
43
|
|
32
44
|
* add new algorithm INTERACT for discrete feature
|
33
45
|
* add Consistency module to deal with data inconsistency calculation, which bases on a Hash table and is efficient in both storage and speed
|
34
46
|
* update the Chi2 algorithm to try to reproduce the results of the original Chi2 algorithm
|
35
47
|
* update documentation whenever necessary
|
36
48
|
|
37
|
-
2012-04-25
|
49
|
+
2012-04-25 v0.9.0
|
50
|
+
------------------
|
38
51
|
|
39
52
|
* add new discretization algorithm (Three-Interval Discretization, TID)
|
40
53
|
* add new algorithm Las Vegas Filter (LVF) for discrete feature
|
41
54
|
* add new algorithm Las Vegas Incremental (LVI) for discrete feature
|
42
55
|
|
43
|
-
2012-04-23
|
56
|
+
2012-04-23 v0.8.1
|
57
|
+
------------------
|
44
58
|
|
45
59
|
* correct a bug in the example in the README file because discretize\_by\_ChiMerge!() now takes confidence alpha value as argument instead of chi-square value
|
46
60
|
|
47
|
-
2012-04-23
|
61
|
+
2012-04-23 v0.8.0
|
62
|
+
------------------
|
48
63
|
|
49
64
|
* add new algorithm FTest (FT) for continuous feature
|
50
65
|
* add .yardoc_opts to gem to use the MarkDown documentation syntax
|
51
66
|
|
52
|
-
2012-04-20
|
67
|
+
2012-04-20 v0.7.0
|
68
|
+
------------------
|
53
69
|
|
54
|
-
* update to
|
70
|
+
* update to v0.7.0
|
55
71
|
|
56
|
-
2012-04-19
|
72
|
+
2012-04-19 v0.6.0
|
73
|
+
------------------
|
57
74
|
|
58
75
|
* add new algorithm BetweenWithinClassesSumOfSquare (BSS_WSS) for continuous feature
|
59
76
|
* add new algorithm WilcoxonRankSum (WRS) for continuous feature
|
60
77
|
|
61
|
-
2012-04-18
|
78
|
+
2012-04-18 v0.5.0
|
79
|
+
------------------
|
62
80
|
|
63
81
|
* require the RinRuby gem (http://rinruby.ddahl.org) to access the
|
64
82
|
statistical routines in the R package (http://www.r-project.org/)
|
data/HowToContribute
ADDED
@@ -0,0 +1,116 @@
|
|
1
|
+
How to add your own feature selection algorithms
|
2
|
+
------------------------------------------------
|
3
|
+
|
4
|
+
**Baisc steps**
|
5
|
+
|
6
|
+
1. Require the FSelector gem
|
7
|
+
|
8
|
+
2. Derive a subclass from a base class (Base, BaseDiscrete, BaseContinuous and etc.)
|
9
|
+
|
10
|
+
3. Set your own feature selection algorithm with one of the following two types:
|
11
|
+
:feature_weighting # if it outputs weight for each feature
|
12
|
+
:feature_subset_selection # if it outputs a subset of original feature set
|
13
|
+
|
14
|
+
4. Depending on your algorithm type, override one of the following two interfaces:
|
15
|
+
calc_contribution() # if it belongs to the type of feature weighting algorithms
|
16
|
+
get_feature_subset() # if it belongs to the type of feature subset selection algorithms
|
17
|
+
|
18
|
+
**Example**
|
19
|
+
|
20
|
+
require 'fselector' # step 1
|
21
|
+
|
22
|
+
module FSelector
|
23
|
+
# create a new algorithm belonging to the type of feature weighting
|
24
|
+
# to this end, simply override the calc_contribution() in Base class
|
25
|
+
class NewAlgo_Weight < Base # step 2
|
26
|
+
# set the algorithm type
|
27
|
+
@algo_type = :feature_weighting # step 3
|
28
|
+
|
29
|
+
# add your own initialize() here if necessary
|
30
|
+
|
31
|
+
private
|
32
|
+
|
33
|
+
# the algorithm assigns feature weight randomly
|
34
|
+
def calc_contribution(f) # step 4
|
35
|
+
s = rand
|
36
|
+
|
37
|
+
# set the score (s) of feature (f) for class (:BEST is the best score among all classes)
|
38
|
+
set_feature_score(f, :BEST, s)
|
39
|
+
end
|
40
|
+
end # NewAlgo_Weight
|
41
|
+
|
42
|
+
|
43
|
+
# create a new algorithm belonging to the type of feature subset selection
|
44
|
+
# to this end, simly override the get_feature_subset() in Base class
|
45
|
+
class NewAlgo_Subset < Base # step 2
|
46
|
+
# set the algorithm type
|
47
|
+
@algo_type = :feature_subset_selection # step 3
|
48
|
+
|
49
|
+
# add your own initialize() here if necessary
|
50
|
+
|
51
|
+
private
|
52
|
+
|
53
|
+
# the algorithm returns a random half-size subset of the orignal one
|
54
|
+
def get_feature_subset # step 4
|
55
|
+
org_features = get_features
|
56
|
+
subset = org_features.sample(org_features.size/2)
|
57
|
+
|
58
|
+
subset
|
59
|
+
end
|
60
|
+
|
61
|
+
end # NewAlgo_Subset
|
62
|
+
end # module
|
63
|
+
|
64
|
+
**Test your algorithms**
|
65
|
+
|
66
|
+
require 'fselector'
|
67
|
+
|
68
|
+
# example 1
|
69
|
+
|
70
|
+
# use NewAlgo_Weighting
|
71
|
+
r1 = FSelector::NewAlgo_Weight.new
|
72
|
+
r1.data_from_csv('test/iris.csv')
|
73
|
+
r1.select_feature_by_rank!('<=1')
|
74
|
+
puts r1.get_features
|
75
|
+
|
76
|
+
# example 2
|
77
|
+
|
78
|
+
# use NewAlgo_Subset
|
79
|
+
r2 = FSelector::NewAlgo_Subset.new
|
80
|
+
r2.data_from_csv('test/iris.csv')
|
81
|
+
r2.select_feature!
|
82
|
+
puts r2.get_features
|
83
|
+
|
84
|
+
How to become a contributor of FSelector on GitHub
|
85
|
+
--------------------------------------------------
|
86
|
+
|
87
|
+
**Set up your repository**
|
88
|
+
|
89
|
+
1. Go to https://github.com/need47/fselector and click the "Fork" button
|
90
|
+
|
91
|
+
2. Clone your fork to your local machine:
|
92
|
+
git clone git@github.com:yourGitUserName/fselector.git
|
93
|
+
|
94
|
+
3. Assign the original repository to a remote called "upstream":
|
95
|
+
cd fselector
|
96
|
+
git remote add upstream git://github.com/need47/fselector.git
|
97
|
+
|
98
|
+
4. Get updates from the "upstream" and merge to your local repository:
|
99
|
+
git fetch upstream
|
100
|
+
git merge upstream/master
|
101
|
+
|
102
|
+
**Develop features**
|
103
|
+
|
104
|
+
1. Create and checkout a feature branch to house your edits:
|
105
|
+
git checkout -b branchName
|
106
|
+
|
107
|
+
2. Add your own feature selection algorithm:
|
108
|
+
git add yourAlgorithm.rb
|
109
|
+
git commit -m "your commit message"
|
110
|
+
|
111
|
+
3. Push your branch to GitHub:
|
112
|
+
git push origin branchName
|
113
|
+
|
114
|
+
4. Visit your forked project on GitHub and switch to your branhName branch
|
115
|
+
|
116
|
+
5. Click the "Pull Request" button to request that your features to be merged to the "upstream" master
|
data/README.md
CHANGED
@@ -1,15 +1,15 @@
|
|
1
1
|
FSelector: a Ruby gem for feature selection and ranking
|
2
2
|
===========================================================
|
3
3
|
|
4
|
-
**Home
|
4
|
+
**Home**: [https://rubygems.org/gems/fselector](https://rubygems.org/gems/fselector)
|
5
5
|
**Source Code**: [https://github.com/need47/fselector](https://github.com/need47/fselector)
|
6
6
|
**Documentation** [http://rubydoc.info/gems/fselector/frames](http://rubydoc.info/gems/fselector/frames)
|
7
7
|
**Author**: Tiejun Cheng
|
8
8
|
**Email**: [need47@gmail.com](mailto:need47@gmail.com)
|
9
9
|
**Copyright**: 2012
|
10
10
|
**License**: MIT License
|
11
|
-
**Latest Version**: 1.3.
|
12
|
-
**Release Date**: 2012-05-
|
11
|
+
**Latest Version**: 1.3.1
|
12
|
+
**Release Date**: 2012-05-31
|
13
13
|
|
14
14
|
Synopsis
|
15
15
|
--------
|
@@ -27,8 +27,9 @@ outputs a reduced data set with only selected subset of features, which
|
|
27
27
|
can later be used as the input for various machine learning softwares
|
28
28
|
such as LibSVM and WEKA. FSelector, as a collection of filter methods,
|
29
29
|
does not implement any classifier like support vector machines or
|
30
|
-
random forest.
|
31
|
-
{file:ChangeLog} for updates
|
30
|
+
random forest. Check below for a list of FSelector's features,
|
31
|
+
{file:ChangeLog} for updates, and {file:HowToContribute} if you want
|
32
|
+
to contribute.
|
32
33
|
|
33
34
|
Feature List
|
34
35
|
------------
|
@@ -88,7 +89,8 @@ Feature List
|
|
88
89
|
WilcoxonRankSum WRS weighting two-class continuous
|
89
90
|
LasVegasFilter LVF subset multi-class discrete, continuous, mixed
|
90
91
|
LasVegasIncremental LVI subset multi-class discrete, continuous, mixed
|
91
|
-
Random
|
92
|
+
Random Rand weighting multi-class discrete, continuous, mixed
|
93
|
+
RandomSubset RandS subset multi-class discrete, continuous, mixed
|
92
94
|
|
93
95
|
**note for feature selection interface:**
|
94
96
|
there are two types of filter methods, i.e., feature weighting algorithms and feature subset selection algorithms
|
@@ -143,7 +145,7 @@ Usage
|
|
143
145
|
-----
|
144
146
|
|
145
147
|
**1. feature selection by a single algorithm**
|
146
|
-
|
148
|
+
|
147
149
|
require 'fselector'
|
148
150
|
|
149
151
|
# use InformationGain (IG) as a feature selection algorithm
|
@@ -186,7 +188,7 @@ Usage
|
|
186
188
|
|
187
189
|
|
188
190
|
**2. feature selection by an ensemble of multiple feature selectors**
|
189
|
-
|
191
|
+
|
190
192
|
require 'fselector'
|
191
193
|
|
192
194
|
# example 1
|
@@ -253,7 +255,7 @@ Usage
|
|
253
255
|
puts ' # features (after): ' + re.get_features.size.to_s
|
254
256
|
|
255
257
|
**3. feature selection after discretization**
|
256
|
-
|
258
|
+
|
257
259
|
require 'fselector'
|
258
260
|
|
259
261
|
# the Information Gain (IG) algorithm requires data with discrete feature
|
@@ -277,13 +279,19 @@ Usage
|
|
277
279
|
|
278
280
|
**4. see more examples test_*.rb under the test/ directory**
|
279
281
|
|
282
|
+
How to contribute
|
283
|
+
-----------------
|
284
|
+
check {file:HowToContribute} to see how to write your own feature selection algorithms and/or make contribution to FSelector.
|
285
|
+
|
280
286
|
Change Log
|
281
287
|
----------
|
288
|
+
|
282
289
|
A {file:ChangeLog} is available from version 0.5.0 and upward to refelect
|
283
|
-
what's new and what's changed
|
290
|
+
what's new and what's changed.
|
284
291
|
|
285
292
|
Copyright
|
286
293
|
---------
|
294
|
+
|
287
295
|
FSelector © 2012 by [Tiejun Cheng](mailto:need47@gmail.com).
|
288
296
|
FSelector is licensed under the MIT license. Please see the {file:LICENSE} for
|
289
297
|
more information.
|
data/lib/fselector.rb
CHANGED
@@ -506,7 +506,16 @@ module FSelector
|
|
506
506
|
end
|
507
507
|
|
508
508
|
|
509
|
-
#
|
509
|
+
# calculate each feature's contribution
|
510
|
+
# override it in derived subclass for the type of feature weighting algorithms
|
511
|
+
def calc_contribution(f)
|
512
|
+
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
513
|
+
" derived subclass must implement its own calc_contribution()"
|
514
|
+
end
|
515
|
+
|
516
|
+
|
517
|
+
# get feature subset
|
518
|
+
# override it in derived subclass for the type of feature subset selection algorithms
|
510
519
|
def get_feature_subset
|
511
520
|
abort "[#{__FILE__}@#{__LINE__}]: \n"+
|
512
521
|
" derived subclass must implement its own get_feature_subset()"
|
@@ -4,9 +4,9 @@
|
|
4
4
|
module FSelector
|
5
5
|
#
|
6
6
|
# Random (Rand) for discrete, continuous or mixed feature,
|
7
|
-
# no pratical use but can be used
|
7
|
+
# no pratical use but can be used for test purpose
|
8
8
|
#
|
9
|
-
# Rand
|
9
|
+
# Rand assignes random score of [0..1) to each feature
|
10
10
|
#
|
11
11
|
# ref: [An extensive empirical study of feature selection metrics for text classification](http://dl.acm.org/citation.cfm?id=944974)
|
12
12
|
#
|
@@ -40,4 +40,8 @@ module FSelector
|
|
40
40
|
end # class
|
41
41
|
|
42
42
|
|
43
|
+
# shortcut so that you can use FSelector::Rand instead of FSelector::Random
|
44
|
+
Rand = Random
|
45
|
+
|
46
|
+
|
43
47
|
end # module
|
@@ -0,0 +1,52 @@
|
|
1
|
+
#
|
2
|
+
# FSelector: a Ruby gem for feature selection and ranking
|
3
|
+
#
|
4
|
+
module FSelector
|
5
|
+
#
|
6
|
+
# RandomSubset (RandS) for discrete, continuous or mixed feature,
|
7
|
+
# no pratical use but can be used for test purpose
|
8
|
+
#
|
9
|
+
# RandS generates a random subset of the original one
|
10
|
+
#
|
11
|
+
class RandomSubset < Base
|
12
|
+
# this algo outputs a subset of feature
|
13
|
+
@algo_type = :feature_subset_selection
|
14
|
+
|
15
|
+
#
|
16
|
+
# initialize from an existing data structure
|
17
|
+
#
|
18
|
+
# @param [Integer] nfeature number of feature required
|
19
|
+
# use random number if nil
|
20
|
+
#
|
21
|
+
def initialize(nfeature=nil, data=nil)
|
22
|
+
super(data)
|
23
|
+
|
24
|
+
@nfeature = nfeature
|
25
|
+
end
|
26
|
+
|
27
|
+
private
|
28
|
+
|
29
|
+
# RandomSubset algorithm
|
30
|
+
def get_feature_subset
|
31
|
+
subset = []
|
32
|
+
|
33
|
+
if @nfeature and @nfeature > 0
|
34
|
+
subset = get_features.sample(@nfeature)
|
35
|
+
else
|
36
|
+
n = rand(get_features.size)
|
37
|
+
n += 1 if n == 0
|
38
|
+
subset = get_features.sample(n)
|
39
|
+
end
|
40
|
+
|
41
|
+
subset
|
42
|
+
end # get_feature_subset
|
43
|
+
|
44
|
+
|
45
|
+
end # class
|
46
|
+
|
47
|
+
|
48
|
+
# shortcut so that you can use FSelector::RandS instead of FSelector::RandomSubset
|
49
|
+
RandS = RandomSubset
|
50
|
+
|
51
|
+
|
52
|
+
end # module
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: fselector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.3.
|
4
|
+
version: 1.3.1
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-05-
|
12
|
+
date: 2012-05-31 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rinruby
|
16
|
-
requirement: &
|
16
|
+
requirement: &25065000 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,7 +21,7 @@ dependencies:
|
|
21
21
|
version: 2.0.2
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *25065000
|
25
25
|
description: FSelector is a Ruby gem that aims to integrate various feature selection/ranking
|
26
26
|
algorithms and related functions into one single package. Welcome to contact me
|
27
27
|
(need47@gmail.com) if you'd like to contribute your own algorithms or report a bug.
|
@@ -41,6 +41,7 @@ files:
|
|
41
41
|
- README.md
|
42
42
|
- ChangeLog
|
43
43
|
- LICENSE
|
44
|
+
- HowToContribute
|
44
45
|
- .yardopts
|
45
46
|
- lib/fselector/algo_base/base.rb
|
46
47
|
- lib/fselector/algo_base/base_CFS.rb
|
@@ -51,6 +52,7 @@ files:
|
|
51
52
|
- lib/fselector/algo_both/LasVegasFilter.rb
|
52
53
|
- lib/fselector/algo_both/LasVegasIncremental.rb
|
53
54
|
- lib/fselector/algo_both/Random.rb
|
55
|
+
- lib/fselector/algo_both/RandomSubset.rb
|
54
56
|
- lib/fselector/algo_continuous/BSS_WSS.rb
|
55
57
|
- lib/fselector/algo_continuous/CFS_c.rb
|
56
58
|
- lib/fselector/algo_continuous/F-Test.rb
|