svmlab 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,219 @@
1
+ Contents:
2
+ 1) Install
3
+ 2) Configuration
4
+ 3) Feature methods
5
+
6
+
7
+ --------------------------------------------------------------------------------
8
+ 1) HOW TO INSTALL
9
+ --------------------------------------------------------------------------------
10
+
11
+ --------------------------------------------------------------------------------
12
+ 2) CONFIGURATION
13
+ --------------------------------------------------------------------------------
14
+
15
+ ------------------------------------------------------------
16
+ CONFIGURATION OF SUPPORT VECTOR MACHINE
17
+ ------------------------------------------------------------
18
+ SVM:
19
+ C: 1
20
+ e: 0.3
21
+ g: 0.05
22
+ Optimization:
23
+ Method: patternsearch
24
+ Nhalf: 2
25
+ Scale:
26
+ Default: std
27
+ ---------------------------------------------------------------------------------
28
+ The configuration under the 'SVM' entry has three parts, as seen above.
29
+ 1) Parameters to the SVM algorithm.
30
+ Possible entries are:
31
+ C : The penalty factor for misclassification.
32
+ g : Gamma for the RBF kernel exp(- gamma * |x-y|^2)
33
+ e : Epsilon for epsilon tube regression
34
+ 2) Optimization configuration.
35
+ Any parameter given in the configuration may be optimized. The
36
+ optimization method is configured here.
37
+ Method : Name of optimization method to be used. Currently only the
38
+ patternsearch method is implemented.
39
+ Nhalf : How many times to halve the patternsearch step size before
40
+ giving up the optimization. It is a trade-off between
41
+ execution time and detail of optimization. A reasonable value
42
+ for Nhalf is 2.
43
+ ----------------------------------------------------------------------
44
+ | For a parameter to be optimized it needs to have optimization
45
+ | instructions added to its setting. Pattern search instructions may
46
+ | be added to any numerical parameter by appending text following
47
+ | this syntax:
48
+ | step exp<stepsize>
49
+ | OR
50
+ | step <stepsize>
51
+ | which should appear AFTER any parameter in the configuration to be
52
+ | optimized. If 'step exp1.0' is given, it means that the pattern
53
+ | search should use an initial step size of 1 in the 10-logarithmic
54
+ | scale, i.e. it tries values at 0.1 and 10 times the original
55
+ | value. Most parameters (including the SVM parameters) are best
56
+ | optimized in the log scale. For example, to optimize SVM parameters
57
+ | we may write
58
+ | C: 1 step exp1.0
59
+ | g: 0.05 step exp1.0
60
+ -------------------------------------------------------------------
61
+ 3) Normalization factors for features.
62
+ The SVM does best when features are normalized in one way or another.
63
+ The standard way of normalizing is to calculate the so called z-score
64
+ for each feature (subtracting mean value and dividing by standard
65
+ deviation), but depending on the feature other scaling may be better.
66
+ All scalings described below are applied after each feature has been
67
+ centralized around zero (by calculating the mean value of each feature
68
+ on the dataset given). For scaling after the centralization, there are
69
+ several possibilities:
70
+ A) No Scale entry at all in configuration.
71
+ All features will be given to the SVM without rescaling.
72
+ B) Only Default given.
73
+ All features will be scaled according to the scale factor given
74
+ for Default.
75
+ C) Specific features' scaling is also given.
76
+ A specific scaling factor will be applied to the features given.
77
+ All other features will have the Default scaling factor (or 1 if no
78
+ Default given). A specific feature scaling is given in one of the
79
+ following syntaxes:
80
+ featurename: scaling factor
81
+ OR
82
+ featurename:
83
+ index1: scaling factor
84
+ ...
85
+ indexN: scaling factor
86
+ Note that 'index' refers to index in a multidimensional feature,
87
+ counting from 0. Also note that if scaling is given by the first
88
+ syntax, the scaling will be applied to all dimensions of the
89
+ feature.
90
+ A scaling factor can be one of the following:
91
+ max : Scaling factor is set so that the max (absolute) value of the
92
+ (centralized) feature is 1 on the dataset given.
93
+ std : Scaling factor is set so that the standard deviation is 1.
94
+ This is the most common normalization of features.
95
+ <num>: A specific numeric value.
96
+ Note that optimization instructions may also be appended to scaling
97
+ factors (an extreme example would be to add it to the Default scaling
98
+ factor, which gives a lengthy optimization calculation and probably
99
+ an "overtraining" on how to scale features).
100
+ ---------------------------------------------------------------------------------
101
+ ------------------------------------------------------------
102
+ CONFIGURATION OF FEATURES
103
+ ------------------------------------------------------------
104
+ Feature:
105
+ Features:
106
+ - <targetfeature>
107
+ - <feature1>
108
+ - <feature2>
109
+ ...
110
+ - <featureN>
111
+ BaseDir: <base directory>
112
+ DataSet: <file>
113
+ OR
114
+ <file prefix>
115
+ Groups: <range (n1..n2)>
116
+ OR
117
+ <file prefix>
118
+ Methods: <path of Ruby code file>
119
+ <featureX>:
120
+ HomeDir: <home directory>
121
+ Method: <method name>
122
+ Dimensions: <number of dimensions in this feature>
123
+ <Further specific configuration for this feature>
124
+ ...
125
+
126
+ -----------------------------------------------------------------------------------
127
+ | A more detailed descriptions of the configuration entries for the
128
+ | feature configuration.
129
+ | ---------------------------------------------------------------------------------
130
+ | Features : This is an array giving all features to be used. The first feature
131
+ | in the array is the feature to be predicted from the other features.
132
+ | BaseDir : A directory from which all other directory names are derived.
133
+ | DataSet : A file or a file prefix (such as 'dataset' points to dataset1,
134
+ | dataset2, etc). The file(s) contain the names of all samples in
135
+ | the dataset used to build an SVM model in SVMLab. The file(s)
136
+ | should be simples lists of sample names, each name on its own
137
+ | line. Sample names should be without blank spaces.
138
+ | Example : ?
139
+ | Groups : These are groups made for crossvalidation.
140
+ | If Groups is not set, a leave-one-out crossvalidate will be used.
141
+ | Two possibilities:
142
+ | A) The range (n1..n2) groups together all samples whose names'
143
+ | substrings indexed by (n1..n2) correspond. E.g. the range
144
+ | (0..2) would group together 'tommy' and 'tomas' while putting
145
+ | 'tony' in a separate group.
146
+ | B) A file prefix, and a group is made from the sample names found
147
+ | in each single file. See DataSet above.
148
+ | Methods : The path to a file containg Ruby code implementing methods used
149
+ | for feature calculation (see specific feature configurations
150
+ | below).
151
+ | <feature> : Each feature given in the array Features (see above) needs its
152
+ | own configuration. The following entries should be set for all
153
+ | features:
154
+ | HomeDir : This may be set in an absolute value or relative to
155
+ | BaseDir (see above). If HomeDir is not given, it will
156
+ | be set to BaseDir/featurename.
157
+ | Method : The function that is called when the feature is
158
+ | requested. If Method is not given, an attempts is made
159
+ | to acquire the feature from the database. If it fails,
160
+ | ERROR is reported.
161
+ | Dimensions : The number of dimensions for this feature. If
162
+ | Dimensions is not given, it will be assumed to be 1
163
+ | and only the first value for each example will be
164
+ | used.
165
+ | <other> : You may wish to include more configuration for the
166
+ | feature, and you may add them here. The entire feature
167
+ | configuration are sent to the feature method (see
168
+ | Method above) so this is a good place to put
169
+ | parameters for feature calculations. It is also
170
+ | possible to add pattern search instructions here, for
171
+ | parameters that are up for optimization.
172
+ -----------------------------------------------------------------------------------
173
+
174
+ --------------------------------------------------------------------------------
175
+ 3) FEATURE METHODS
176
+ --------------------------------------------------------------------------------
177
+ As a user of SVMLab you should not have to worry about preparing data for the
178
+ SVM experiment. What you should care about is instead to implement methods for
179
+ calculating or retrieving features. Unfortunately, all methods have to be
180
+ implemented in Ruby since SVMLab is written in Ruby and executed in the Ruby
181
+ environment. Ruby is however not a very complicated language and you may also
182
+ use it merely to call another method implemented in any language by letting the
183
+ Ruby method make a call to the terminal shell (see below for an example).
184
+ As you can see in the configuration section, you have to configure a method for
185
+ each feature that you use. This section is about how to implement the method
186
+ itself. There are only two rules for this - the format of the input and the
187
+ format of the output.
188
+ 1) Input
189
+ SVMLab calls each method with the syntax
190
+ method(Hash cfg, String samplename)
191
+ It is up to the method that you implement to know how to interpret the
192
+ samplename and use the configuration in order to give an answer to the request
193
+ for a feature value.
194
+ 2) Output
195
+ This should be given in the format
196
+ {samplename => String answer,
197
+ othersample => String otheranswer,
198
+ ...}
199
+ Most important here is that an answer should be given in the form of a Hash
200
+ where one key is pointing to the answer for the sample that the feature was
201
+ requested for. Your method may however include other keys in the Hash giving
202
+ answers for other samples if this has a benefit for you. For example, if the
203
+ method is a lengthy calculation that would have to be repeated for other
204
+ samples it is convenient to throw out all answers given by the calculation to
205
+ SVMLab. This way, SVMLab will only need to consult its database next time a
206
+ feature value is needed that needs the same calculation.
207
+ Also note that the answers should be given as a String. While the SVM
208
+ expects numeric features, this lets the method give a non-numeric answer which
209
+ SVMLab interprets as an error message. If the feature is a multidimensional
210
+ feature all numeric values should be separated with a space in the String.
211
+
212
+ Example of a method
213
+
214
+ This is an example of a case where you already have an executable that you call
215
+ in the terminal by 'myexecutable -a myargument' :
216
+ def mymethod(cfg, sample)
217
+ ans = `myexecutable -a #{cfg{'myargument'}}`
218
+ {sample => ans}
219
+ end
@@ -0,0 +1,87 @@
1
+ #require 'narray'
2
+
3
+ class Array
4
+ def sum
5
+ self.inject(0) { |s,i| s + i }
6
+ end
7
+
8
+ def mean
9
+ self.sum.to_f / self.size
10
+ end
11
+
12
+ #def median
13
+ # NArray.to_na(self).median
14
+ #end
15
+
16
+ def percentile(percent)
17
+ if percent == 0
18
+ [self.min,self.max]
19
+ else
20
+ sorted = self.sort
21
+ [ sorted[(percent/100.0*self.size).round], sorted[((100-percent)/100.0*self.size).round] ]
22
+ end
23
+ end
24
+
25
+ def stddev
26
+ Math.sqrt( self.variance )
27
+ end
28
+
29
+ # Mean square deviation from mean
30
+ def variance
31
+ m = self.mean
32
+ sum = self.inject(0) do |s,i|
33
+ s + (i-m)**2
34
+ end
35
+ sum / self.size.to_f
36
+ end
37
+
38
+ # Scalar product
39
+ def *(arg)
40
+ if self.size != arg.size
41
+ raise "Not equal sizes in scalar product"
42
+ end
43
+ self.zip(arg).inject(0.0) do |s,(i,j)|
44
+ s + i * j
45
+ end
46
+ end
47
+
48
+ def to_i
49
+ self.map{|i| i.to_i}
50
+ end
51
+
52
+ def to_f
53
+ self.map{|i| i.to_f}
54
+ end
55
+ end
56
+
57
+ def correlation(x, y)
58
+ xymean = (x * y) / x.size.to_f
59
+ sx = x.stddev
60
+ sy = y.stddev
61
+ (xymean - x.mean * y.mean) / (sx * sy)
62
+ end
63
+
64
+ class Hash
65
+ def to_xy(other)
66
+ raise "Please give Hash as argument" if !other.is_a? Hash
67
+ arr = self.keys.inject([]) do |a,k|
68
+ if other[k] and self[k]
69
+ a << [self[k].to_f, other[k].to_f]
70
+ end
71
+ a
72
+ end
73
+ x = arr.map{|i| i[0]}
74
+ y = arr.map{|i| i[1]}
75
+ [ x, y ]
76
+ end
77
+
78
+ def cc(other)
79
+ x, y = to_xy(other)
80
+ #puts "CC : N = #{x.size}"
81
+ if x.size > 0
82
+ correlation(x,y)
83
+ else
84
+ nil
85
+ end
86
+ end
87
+ end
@@ -0,0 +1,100 @@
1
+ p.f1
2
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
3
+ p = lab.crossvalidate
4
+ p.cc
5
+ p.rmsd
6
+ p.f1 2
7
+ p.f1
8
+ p.f1 1
9
+ p.f1 1:3
10
+ p.f1 3
11
+ p.f1 4
12
+ p.f1 5
13
+ p.f1 0
14
+ p.plot
15
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
16
+ p = lab.crossvalidate
17
+ p.plot
18
+ help f1
19
+ help p.f1
20
+ pm p
21
+ class Fhash < Hash
22
+ def to_s
23
+ "I'm a Fhash!"
24
+ end
25
+ end
26
+ f = Fhash.new
27
+ puts f
28
+ exit
29
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
30
+ exit
31
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
32
+ exit
33
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
34
+ exit
35
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
36
+ exit
37
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
38
+ lab = SVMLab.new('testcfg')
39
+ exit
40
+ lab = SVMLab.new('testcfg')
41
+ lab = SVMLab.new('testcfg')
42
+ exit
43
+ lab = SVMLab.new('testcfg')
44
+ exit
45
+ lab = SVMLab.new('testcfg')
46
+ exit
47
+ lab = SVMLab.new('testcfg')
48
+ exit
49
+ lab = SVMLab.new('testcfg')
50
+ a |= 1
51
+ a
52
+ a = 1
53
+ a |= 1
54
+ a
55
+ a =| 1
56
+ a &= 1
57
+ a
58
+ a &= 2
59
+ a
60
+ b &= 2
61
+ b
62
+ a ||= 2
63
+ a = nil
64
+ a ||= 2
65
+ a
66
+ a ||= 3
67
+ a
68
+ a = nil
69
+ a ||= []
70
+ a
71
+ p a
72
+ a.size
73
+ exit
74
+ lab = SVMLab.new('testcfg')
75
+ lab['SVM']['Center'].to_yaml
76
+ lab.cfg['SVM']['Center'].to_yaml
77
+ puts lab.cfg['SVM']['Center'].to_yaml
78
+ exit
79
+ lab = SVMLab.new('testcfg')
80
+ exit
81
+ lab = SVMLab.new('testcfg')
82
+ exit
83
+ lab = SVMLab.new('testcfg')
84
+ exit
85
+ lab = SVMLab.new('testcfg')
86
+ exit
87
+ lab = SVMLab.new('testcfg')
88
+ exit
89
+ lab = SVMLab.new('testcfg')
90
+ p = lab.crossvalidate
91
+ p.cc
92
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
93
+ lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
94
+ lab.crossvalidate.cc
95
+ lab.crossvalidate.rmsd
96
+ lab.C
97
+ lab.epsilon
98
+ exit
99
+ exit
100
+ exit
@@ -0,0 +1,122 @@
1
+ require 'tempfile'
2
+ # This is a class that IS the path of a file with LIBSVM data.
3
+ # It can do some more tricks beyond just being a path.
4
+ # However, all those tricks are DESTRUCTIVE so be careful.
5
+ # It is also not a good class if you want to keep LIBSVM's sparse
6
+ # data format. This class will fill everything missing with zeros
7
+ # upon creation.
8
+ class LIBSVMdata < String
9
+
10
+ attr_reader :data, :path
11
+
12
+ def initialize(path, readonly = true, classborder = nil)
13
+ @readonly = readonly
14
+ @data = open(path).read
15
+ @classborder = classborder
16
+ if @readonly
17
+ super(path)
18
+ else
19
+ @file = Tempfile.new('libsvmdata')
20
+ super(@file.path)
21
+ self.fillSparseData!
22
+ end
23
+ end
24
+
25
+ def dim
26
+ @dim ||= @data.split("\n").first.split.size
27
+ end
28
+
29
+ def nexamples
30
+ @nexamples ||= @data.split("\n").size
31
+ end
32
+
33
+ def update
34
+ return if @readonly
35
+ @dim = nil
36
+ @nexamples = nil
37
+ open(self,'w') do |f|
38
+ f.puts @data
39
+ end
40
+ self
41
+ end
42
+
43
+ def translate!(vector)
44
+ return if @readonly
45
+ @data =
46
+ @data.split("\n").map { |line|
47
+ pred,*feats = line.split
48
+ (@classborder ? "#{pred} " : "#{(pred.to_f+vector[0])} ") +
49
+ feats.map{|feat|
50
+ key,val = feat.split(':')
51
+ "#{key}:#{val.to_f+vector[key.to_i]}"}.join(' ') + "\n"
52
+ }.join
53
+ self.update
54
+ end
55
+
56
+ def scale!(vector)
57
+ return if @readonly
58
+ @data =
59
+ @data.split("\n").map { |line|
60
+ pred,*feats = line.split
61
+ (@classborder ? "#{pred} " : "#{(pred.to_f*vector[0])} ") +
62
+ feats.map{|feat|
63
+ key,val = feat.split(':')
64
+ "#{key}:#{val.to_f*vector[key.to_i]}"}.join(' ') + "\n"
65
+ }.join
66
+ self.update
67
+ end
68
+
69
+ def mean
70
+ sum = Array.new(dim){0}
71
+ sum =
72
+ @data.split("\n").inject(sum) { |s,line|
73
+ pred,*feats = line.split
74
+ if !@classborder then s[0] += pred.to_f end
75
+ fs = s[1..-1].zip(feats).map{|si,fi|
76
+ si += fi.split(':').last.to_f}
77
+ #puts s; gets
78
+ [s[0]] + fs }
79
+ sum.map{|i| i / nexamples.to_f}
80
+ end
81
+
82
+ def absmax
83
+ max = Array.new(dim){0}
84
+ max =
85
+ @data.split("\n").inject(max) { |s,line|
86
+ pred,*feats = line.split
87
+ s[0] = [s[0],pred.to_f.abs].max
88
+ fs = s[1..-1].zip(feats).map{|si,fi|
89
+ [si, fi.split(':').last.to_f.abs].max}
90
+ #puts s; gets
91
+ [s[0]] + fs }
92
+ end
93
+
94
+ def fillSparseData!
95
+ return if @readonly
96
+ ndim = @data.split("\n").map{|line|
97
+ if (arr=line.split).size>1
98
+ arr.last.split(':').first.to_i
99
+ else 0 end
100
+ }.max
101
+ @data =
102
+ @data.split("\n").map{|line|
103
+ pred,*feats = line.split
104
+ if @classborder
105
+ pred = (pred.to_f>=@classborder) ? '1' : '-1'
106
+ end
107
+ counter = 0
108
+ pred + ' ' +
109
+ (1..ndim).map { |index|
110
+ if (element = feats[counter]) and
111
+ (element.split(':').first.to_i == index)
112
+ counter+=1
113
+ "#{index}:#{element.split(':').last}"
114
+ else
115
+ "#{index}:0"
116
+ end
117
+ }.join(' ') + "\n"
118
+ }.join
119
+ self.update
120
+ end
121
+
122
+ end