svmlab 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,219 @@
1
+ Contents:
2
+ 1) Install
3
+ 2) Configuration
4
+ 3) Feature methods
5
+
6
+
7
+ --------------------------------------------------------------------------------
8
+ 1) HOW TO INSTALL
9
+ --------------------------------------------------------------------------------
10
+
11
+ --------------------------------------------------------------------------------
12
+ 2) CONFIGURATION
13
+ --------------------------------------------------------------------------------
14
+
15
+ ------------------------------------------------------------
16
+ CONFIGURATION OF SUPPORT VECTOR MACHINE
17
+ ------------------------------------------------------------
18
+ SVM:
19
+ C: 1
20
+ e: 0.3
21
+ g: 0.05
22
+ Optimization:
23
+ Method: patternsearch
24
+ Nhalf: 2
25
+ Scale:
26
+ Default: std
27
+ ---------------------------------------------------------------------------------
28
+ The configuration under the 'SVM' entry has three parts, as seen above.
29
+ 1) Parameters to the SVM algorithm.
30
+ Possible entries are:
31
+ C : The penalty factor for misclassification.
32
+ g : Gamma for the RBF kernel exp(- gamma * |x-y|^2)
33
+ e : Epsilon for epsilon tube regression
34
+ 2) Optimization configuration.
35
+ Any parameter given in the configuration may be optimized. The
36
+ optimization method is configured here.
37
+ Method : Name of optimization method to be used. Currently only the
38
+ patternsearch method is implemented.
39
+ Nhalf : How many times to halve the patternsearch step size before
40
+ giving up the optimization. It is a trade-off between
41
+ execution time and detail of optimization. A reasonable value
42
+ for Nhalf is 2.
43
+ ----------------------------------------------------------------------
44
+ | For a parameter to be optimized it needs to have optimization
45
+ | instructions added to its setting. Pattern search instructions may
46
+ | be added to any numerical parameter by appending text following
47
+ | this syntax:
48
+ | step exp<stepsize>
49
+ | OR
50
+ | step <stepsize>
51
+ | which should appear AFTER any parameter in the configuration to be
52
+ | optimized. If 'step exp1.0' is given, it means that the pattern
53
+ | search should use an initial step size of 1 in the 10-logarithmic
54
+ | scale, i.e. it tries values at 0.1 and 10 times the original
55
+ | value. Most parameters (including the SVM parameters) are best
56
+ | optimized in the log scale. For example, to optimize SVM parameters
57
+ | we may write
58
+ | C: 1 step exp1.0
59
+ | g: 0.05 step exp1.0
60
+ -------------------------------------------------------------------
61
+ 3) Normalization factors for features.
62
+ The SVM does best when features are normalized in one way or another.
63
+ The standard way of normalizing is to calculate the so called z-score
64
+ for each feature (subtracting mean value and dividing by standard
65
+ deviation), but depending on the feature other scaling may be better.
66
+ All scalings described below are applied after each feature has been
67
+ centralized around zero (by calculating the mean value of each feature
68
+ on the dataset given). For scaling after the centralization, there are
69
+ several possibilities:
70
+ A) No Scale entry at all in configuration.
71
+ All features will be given to the SVM without rescaling.
72
+ B) Only Default given.
73
+ All features will be scaled according to the scale factor given
74
+ for Default.
75
+ C) Specific features' scaling is also given.
76
+ A specific scaling factor will be applied to the features given.
77
+ All other features will have the Default scaling factor (or 1 if no
78
+ Default given). A specific feature scaling is given in one of the
79
+ following syntaxes:
80
+ featurename: scaling factor
81
+ OR
82
+ featurename:
83
+ index1: scaling factor
84
+ ...
85
+ indexN: scaling factor
86
+ Note that 'index' refers to index in a multidimensional feature,
87
+ counting from 0. Also note that if scaling is given by the first
88
+ syntax, the scaling will be applied to all dimensions of the
89
+ feature.
90
+ A scaling factor can be one of the following:
91
+ max : Scaling factor is set so that the max (absolute) value of the
92
+ (centralized) feature is 1 on the dataset given.
93
+ std : Scaling factor is set so that the standard deviation is 1.
94
+ This is the most common normalization of features.
95
+ <num>: A specific numeric value.
96
+ Note that optimization instructions may also be appended to scaling
97
+ factors (an extreme example would be to add it to the Default scaling
98
+ factor, which gives a lengthy optimization calculation and probably
99
+ an "overtraining" on how to scale features).
100
+ ---------------------------------------------------------------------------------
101
+ ------------------------------------------------------------
102
+ CONFIGURATION OF FEATURES
103
+ ------------------------------------------------------------
104
+ Feature:
105
+ Features:
106
+ - <targetfeature>
107
+ - <feature1>
108
+ - <feature2>
109
+ ...
110
+ - <featureN>
111
+ BaseDir: <base directory>
112
+ DataSet: <file>
113
+ OR
114
+ <file prefix>
115
+ Groups: <range (n1..n2)>
116
+ OR
117
+ <file prefix>
118
+ Methods: <path of Ruby code file>
119
+ <featureX>:
120
+ HomeDir: <home directory>
121
+ Method: <method name>
122
+ Dimensions: <number of dimensions in this feature>
123
+ <Further specific configuration for this feature>
124
+ ...
125
+
126
+ -----------------------------------------------------------------------------------
127
+ | A more detailed descriptions of the configuration entries for the
128
+ | feature configuration.
129
+ | ---------------------------------------------------------------------------------
130
+ | Features : This is an array giving all features to be used. The first feature
131
+ | in the array is the feature to be predicted from the other features.
132
+ | BaseDir : A directory from which all other directory names are derived.
133
+ | DataSet : A file or a file prefix (such as 'dataset' points to dataset1,
134
+ | dataset2, etc). The file(s) contain the names of all samples in
135
+ | the dataset used to build an SVM model in SVMLab. The file(s)
136
+ | should be simples lists of sample names, each name on its own
137
+ | line. Sample names should be without blank spaces.
138
+ | Example : ?
139
+ | Groups : These are groups made for crossvalidation.
140
+ | If Groups is not set, a leave-one-out crossvalidate will be used.
141
+ | Two possibilities:
142
+ | A) The range (n1..n2) groups together all samples whose names'
143
+ | substrings indexed by (n1..n2) correspond. E.g. the range
144
+ | (0..2) would group together 'tommy' and 'tomas' while putting
145
+ | 'tony' in a separate group.
146
+ | B) A file prefix, and a group is made from the sample names found
147
+ | in each single file. See DataSet above.
148
+ | Methods : The path to a file containg Ruby code implementing methods used
149
+ | for feature calculation (see specific feature configurations
150
+ | below).
151
+ | <feature> : Each feature given in the array Features (see above) needs its
152
+ | own configuration. The following entries should be set for all
153
+ | features:
154
+ | HomeDir : This may be set in an absolute value or relative to
155
+ | BaseDir (see above). If HomeDir is not given, it will
156
+ | be set to BaseDir/featurename.
157
+ | Method : The function that is called when the feature is
158
+ | requested. If Method is not given, an attempts is made
159
+ | to acquire the feature from the database. If it fails,
160
+ | ERROR is reported.
161
+ | Dimensions : The number of dimensions for this feature. If
162
+ | Dimensions is not given, it will be assumed to be 1
163
+ | and only the first value for each example will be
164
+ | used.
165
+ | <other> : You may wish to include more configuration for the
166
+ | feature, and you may add them here. The entire feature
167
+ | configuration are sent to the feature method (see
168
+ | Method above) so this is a good place to put
169
+ | parameters for feature calculations. It is also
170
+ | possible to add pattern search instructions here, for
171
+ | parameters that are up for optimization.
172
+ -----------------------------------------------------------------------------------
173
+
174
+ --------------------------------------------------------------------------------
175
+ 3) FEATURE METHODS
176
+ --------------------------------------------------------------------------------
177
+ As a user of SVMLab you should not have to worry about preparing data for the
178
+ SVM experiment. What you should care about is instead to implement methods for
179
+ calculating or retrieving features. Unfortunately, all methods have to be
180
+ implemented in Ruby since SVMLab is written in Ruby and executed in the Ruby
181
+ environment. Ruby is however not a very complicated language and you may also
182
+ use it merely to call another method implemented in any language by letting the
183
+ Ruby method make a call to the terminal shell (see below for an example).
184
+ As you can see in the configuration section, you have to configure a method for
185
+ each feature that you use. This section is about how to implement the method
186
+ itself. There are only two rules for this - the format of the input and the
187
+ format of the output.
188
+ 1) Input
189
+ SVMLab calls each method with the syntax
190
+ method(Hash cfg, String samplename)
191
+ It is up to the method that you implement to know how to interpret the
192
+ samplename and use the configuration in order to give an answer to the request
193
+ for a feature value.
194
+ 2) Output
195
+ This should be given in the format
196
+ {samplename => String answer,
197
+ othersample => String otheranswer,
198
+ ...}
199
+ Most important here is that an answer should be given in the form of a Hash
200
+ where one key is pointing to the answer for the sample that the feature was
201
+ requested for. Your method may however include other keys in the Hash giving
202
+ answers for other samples if this has a benefit for you. For example, if the
203
+ method is a lengthy calculation that would have to be repeated for other
204
+ samples it is convenient to throw out all answers given by the calculation to
205
+ SVMLab. This way, SVMLab will only need to consult its database next time a
206
+ feature value is needed that needs the same calculation.
207
+ Also note that the answers should be given as a String. While the SVM
208
+ expects numeric features, this lets the method give a non-numeric answer which
209
+ SVMLab interprets as an error message. If the feature is a multidimensional
210
+ feature all numeric values should be separated with a space in the String.
211
+
212
+ Example of a method
213
+
214
+ This is an example of a case where you already have an executable that you call
215
+ in the terminal by 'myexecutable -a myargument' :
216
+ def mymethod(cfg, sample)
217
+ ans = `myexecutable -a #{cfg{'myargument'}}`
218
+ {sample => ans}
219
+ end
@@ -0,0 +1,87 @@
1
+ #require 'narray'
2
+
3
+ class Array
4
+ def sum
5
+ self.inject(0) { |s,i| s + i }
6
+ end
7
+
8
+ def mean
9
+ self.sum.to_f / self.size
10
+ end
11
+
12
+ #def median
13
+ # NArray.to_na(self).median
14
+ #end
15
+
16
+ def percentile(percent)
17
+ if percent == 0
18
+ [self.min,self.max]
19
+ else
20
+ sorted = self.sort
21
+ [ sorted[(percent/100.0*self.size).round], sorted[((100-percent)/100.0*self.size).round] ]
22
+ end
23
+ end
24
+
25
+ def stddev
26
+ Math.sqrt( self.variance )
27
+ end
28
+
29
+ # Mean square deviation from mean
30
+ def variance
31
+ m = self.mean
32
+ sum = self.inject(0) do |s,i|
33
+ s + (i-m)**2
34
+ end
35
+ sum / self.size.to_f
36
+ end
37
+
38
+ # Scalar product
39
+ def *(arg)
40
+ if self.size != arg.size
41
+ raise "Not equal sizes in scalar product"
42
+ end
43
+ self.zip(arg).inject(0.0) do |s,(i,j)|
44
+ s + i * j
45
+ end
46
+ end
47
+
48
+ def to_i
49
+ self.map{|i| i.to_i}
50
+ end
51
+
52
+ def to_f
53
+ self.map{|i| i.to_f}
54
+ end
55
+ end
56
+
57
+ def correlation(x, y)
58
+ xymean = (x * y) / x.size.to_f
59
+ sx = x.stddev
60
+ sy = y.stddev
61
+ (xymean - x.mean * y.mean) / (sx * sy)
62
+ end
63
+
64
+ class Hash
65
+ def to_xy(other)
66
+ raise "Please give Hash as argument" if !other.is_a? Hash
67
+ arr = self.keys.inject([]) do |a,k|
68
+ if other[k] and self[k]
69
+ a << [self[k].to_f, other[k].to_f]
70
+ end
71
+ a
72
+ end
73
+ x = arr.map{|i| i[0]}
74
+ y = arr.map{|i| i[1]}
75
+ [ x, y ]
76
+ end
77
+
78
+ def cc(other)
79
+ x, y = to_xy(other)
80
+ #puts "CC : N = #{x.size}"
81
+ if x.size > 0
82
+ correlation(x,y)
83
+ else
84
+ nil
85
+ end
86
+ end
87
+ end
@@ -0,0 +1,100 @@
1
+ p.f1
2
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
3
+ p = lab.crossvalidate
4
+ p.cc
5
+ p.rmsd
6
+ p.f1 2
7
+ p.f1
8
+ p.f1 1
9
+ p.f1 1:3
10
+ p.f1 3
11
+ p.f1 4
12
+ p.f1 5
13
+ p.f1 0
14
+ p.plot
15
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
16
+ p = lab.crossvalidate
17
+ p.plot
18
+ help f1
19
+ help p.f1
20
+ pm p
21
+ class Fhash < Hash
22
+ def to_s
23
+ "I'm a Fhash!"
24
+ end
25
+ end
26
+ f = Fhash.new
27
+ puts f
28
+ exit
29
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
30
+ exit
31
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
32
+ exit
33
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
34
+ exit
35
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
36
+ exit
37
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
38
+ lab = SVMLab.new('testcfg')
39
+ exit
40
+ lab = SVMLab.new('testcfg')
41
+ lab = SVMLab.new('testcfg')
42
+ exit
43
+ lab = SVMLab.new('testcfg')
44
+ exit
45
+ lab = SVMLab.new('testcfg')
46
+ exit
47
+ lab = SVMLab.new('testcfg')
48
+ exit
49
+ lab = SVMLab.new('testcfg')
50
+ a |= 1
51
+ a
52
+ a = 1
53
+ a |= 1
54
+ a
55
+ a =| 1
56
+ a &= 1
57
+ a
58
+ a &= 2
59
+ a
60
+ b &= 2
61
+ b
62
+ a ||= 2
63
+ a = nil
64
+ a ||= 2
65
+ a
66
+ a ||= 3
67
+ a
68
+ a = nil
69
+ a ||= []
70
+ a
71
+ p a
72
+ a.size
73
+ exit
74
+ lab = SVMLab.new('testcfg')
75
+ lab['SVM']['Center'].to_yaml
76
+ lab.cfg['SVM']['Center'].to_yaml
77
+ puts lab.cfg['SVM']['Center'].to_yaml
78
+ exit
79
+ lab = SVMLab.new('testcfg')
80
+ exit
81
+ lab = SVMLab.new('testcfg')
82
+ exit
83
+ lab = SVMLab.new('testcfg')
84
+ exit
85
+ lab = SVMLab.new('testcfg')
86
+ exit
87
+ lab = SVMLab.new('testcfg')
88
+ exit
89
+ lab = SVMLab.new('testcfg')
90
+ p = lab.crossvalidate
91
+ p.cc
92
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
93
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
94
+ lab.crossvalidate.cc
95
+ lab.crossvalidate.rmsd
96
+ lab.C
97
+ lab.epsilon
98
+ exit
99
+ exit
100
+ exit
@@ -0,0 +1,122 @@
1
+ require 'tempfile'
2
+ # This is a class that IS the path of a file with LIBSVM data.
3
+ # It can do some more tricks beyond just being a path.
4
+ # However, all those tricks are DESTRUCTIVE so be careful.
5
+ # It is also not a good class if you want to keep LIBSVM's sparse
6
+ # data format. This class will fill everything missing with zeros
7
+ # upon creation.
8
+ class LIBSVMdata < String
9
+
10
+ attr_reader :data, :path
11
+
12
+ def initialize(path, readonly = true, classborder = nil)
13
+ @readonly = readonly
14
+ @data = open(path).read
15
+ @classborder = classborder
16
+ if @readonly
17
+ super(path)
18
+ else
19
+ @file = Tempfile.new('libsvmdata')
20
+ super(@file.path)
21
+ self.fillSparseData!
22
+ end
23
+ end
24
+
25
+ def dim
26
+ @dim ||= @data.split("\n").first.split.size
27
+ end
28
+
29
+ def nexamples
30
+ @nexamples ||= @data.split("\n").size
31
+ end
32
+
33
+ def update
34
+ return if @readonly
35
+ @dim = nil
36
+ @nexamples = nil
37
+ open(self,'w') do |f|
38
+ f.puts @data
39
+ end
40
+ self
41
+ end
42
+
43
+ def translate!(vector)
44
+ return if @readonly
45
+ @data =
46
+ @data.split("\n").map { |line|
47
+ pred,*feats = line.split
48
+ (@classborder ? "#{pred} " : "#{(pred.to_f+vector[0])} ") +
49
+ feats.map{|feat|
50
+ key,val = feat.split(':')
51
+ "#{key}:#{val.to_f+vector[key.to_i]}"}.join(' ') + "\n"
52
+ }.join
53
+ self.update
54
+ end
55
+
56
+ def scale!(vector)
57
+ return if @readonly
58
+ @data =
59
+ @data.split("\n").map { |line|
60
+ pred,*feats = line.split
61
+ (@classborder ? "#{pred} " : "#{(pred.to_f*vector[0])} ") +
62
+ feats.map{|feat|
63
+ key,val = feat.split(':')
64
+ "#{key}:#{val.to_f*vector[key.to_i]}"}.join(' ') + "\n"
65
+ }.join
66
+ self.update
67
+ end
68
+
69
+ def mean
70
+ sum = Array.new(dim){0}
71
+ sum =
72
+ @data.split("\n").inject(sum) { |s,line|
73
+ pred,*feats = line.split
74
+ if !@classborder then s[0] += pred.to_f end
75
+ fs = s[1..-1].zip(feats).map{|si,fi|
76
+ si += fi.split(':').last.to_f}
77
+ #puts s; gets
78
+ [s[0]] + fs }
79
+ sum.map{|i| i / nexamples.to_f}
80
+ end
81
+
82
+ def absmax
83
+ max = Array.new(dim){0}
84
+ max =
85
+ @data.split("\n").inject(max) { |s,line|
86
+ pred,*feats = line.split
87
+ s[0] = [s[0],pred.to_f.abs].max
88
+ fs = s[1..-1].zip(feats).map{|si,fi|
89
+ [si, fi.split(':').last.to_f.abs].max}
90
+ #puts s; gets
91
+ [s[0]] + fs }
92
+ end
93
+
94
+ def fillSparseData!
95
+ return if @readonly
96
+ ndim = @data.split("\n").map{|line|
97
+ if (arr=line.split).size>1
98
+ arr.last.split(':').first.to_i
99
+ else 0 end
100
+ }.max
101
+ @data =
102
+ @data.split("\n").map{|line|
103
+ pred,*feats = line.split
104
+ if @classborder
105
+ pred = (pred.to_f>=@classborder) ? '1' : '-1'
106
+ end
107
+ counter = 0
108
+ pred + ' ' +
109
+ (1..ndim).map { |index|
110
+ if (element = feats[counter]) and
111
+ (element.split(':').first.to_i == index)
112
+ counter+=1
113
+ "#{index}:#{element.split(':').last}"
114
+ else
115
+ "#{index}:0"
116
+ end
117
+ }.join(' ') + "\n"
118
+ }.join
119
+ self.update
120
+ end
121
+
122
+ end