RubyGems - svmlab - Versions diffs - 1.0.0 - Mend

svmlab 1.0.0

Files changed (19) hide show

data/lib/README ADDED

@@ -0,0 +1,219 @@
+Contents:
+1) Install
+2) Configuration
+3) Feature methods
+--------------------------------------------------------------------------------
+  1) HOW TO INSTALL
+--------------------------------------------------------------------------------
+--------------------------------------------------------------------------------
+  2) CONFIGURATION
+--------------------------------------------------------------------------------
+------------------------------------------------------------
+ CONFIGURATION OF SUPPORT VECTOR MACHINE
+------------------------------------------------------------
+SVM:
+  C: 1
+  e: 0.3
+  g: 0.05
+  Optimization:
+    Method: patternsearch
+    Nhalf: 2
+  Scale:
+    Default: std
+  ---------------------------------------------------------------------------------
+   The configuration under the 'SVM' entry has three parts, as seen above.
+   1) Parameters to the SVM algorithm.
+      Possible entries are:
+      C : The penalty factor for misclassification.
+      g : Gamma for the RBF kernel exp(- gamma * |x-y|^2)
+      e : Epsilon for epsilon tube regression
+   2) Optimization configuration.
+      Any parameter given in the configuration may be optimized. The
+      optimization method is configured here.
+      Method : Name of optimization method to be used. Currently only the
+               patternsearch method is implemented.
+      Nhalf  : How many times to halve the patternsearch step size before
+               giving up the optimization. It is a trade-off between
+               execution time and detail of optimization. A reasonable value
+               for Nhalf is 2.
+      ----------------------------------------------------------------------
+      | For a parameter to be optimized it needs to have optimization
+      | instructions added to its setting. Pattern search instructions may
+      | be added to any numerical parameter by appending text following
+      | this syntax:
+      |   step exp<stepsize>
+      |    OR
+      |   step <stepsize>
+      | which should appear AFTER any parameter in the configuration to be
+      | optimized.  If 'step exp1.0' is given, it means that the pattern
+      | search should use an initial step size of 1 in the 10-logarithmic
+      | scale, i.e.  it tries values at 0.1 and 10 times the original
+      | value.  Most parameters (including the SVM parameters) are best
+      | optimized in the log scale. For example, to optimize SVM parameters
+      | we may write
+      |   C: 1 step exp1.0
+      |   g: 0.05 step exp1.0
+      -------------------------------------------------------------------
+   3) Normalization factors for features.
+      The SVM does best when features are normalized in one way or another.
+      The standard way of normalizing is to calculate the so called z-score
+      for each feature (subtracting mean value and dividing by standard
+      deviation), but depending on the feature other scaling may be better.
+      All scalings described below are applied after each feature has been
+      centralized around zero (by calculating the mean value of each feature
+      on the dataset given). For scaling after the centralization, there are
+      several possibilities:
+       A) No Scale entry at all in configuration.
+          All features will be given to the SVM without rescaling.
+       B) Only Default given.
+          All features will be scaled according to the scale factor given
+          for Default.
+       C) Specific features' scaling is also given.
+          A specific scaling factor will be applied to the features given.
+          All other features will have the Default scaling factor (or 1 if no
+          Default given).  A specific feature scaling is given in one of the
+          following syntaxes:
+            featurename: scaling factor
+              OR
+            featurename:
+              index1: scaling factor
+              ...
+              indexN: scaling factor
+          Note that 'index' refers to index in a multidimensional feature,
+          counting from 0. Also note that if scaling is given by the first
+          syntax, the scaling will be applied to all dimensions of the
+          feature.
+       A scaling factor can be one of the following:
+       max : Scaling factor is set so that the max (absolute) value of the
+             (centralized) feature is 1 on the dataset given.
+       std : Scaling factor is set so that the standard deviation is 1.
+             This is the most common normalization of features.
+      <num>: A specific numeric value.
+      Note that optimization instructions may also be appended to scaling
+      factors (an extreme example would be to add it to the Default scaling
+      factor, which gives a lengthy optimization calculation and probably
+      an "overtraining" on how to scale features).
+  ---------------------------------------------------------------------------------
+------------------------------------------------------------
+ CONFIGURATION OF FEATURES
+------------------------------------------------------------
+Feature:
+  Features:
+  - <targetfeature>
+  - <feature1>
+  - <feature2>
+  ...
+  - <featureN>
+  BaseDir: <base directory>
+  DataSet: <file>
+            OR
+           <file prefix>
+  Groups: <range (n1..n2)>
+           OR
+          <file prefix>
+  Methods: <path of Ruby code file>
+  <featureX>:
+    HomeDir: <home directory>
+    Method: <method name>
+    Dimensions: <number of dimensions in this feature>
+    <Further specific configuration for this feature>
+  ...
+  -----------------------------------------------------------------------------------
+  | A more detailed descriptions of the configuration entries for the
+  | feature configuration.
+  | ---------------------------------------------------------------------------------
+  | Features  : This is an array giving all features to be used. The first feature
+  |             in the array is the feature to be predicted from the other features.
+  | BaseDir   : A directory from which all other directory names are derived.
+  | DataSet   : A file or a file prefix (such as 'dataset' points to dataset1,
+  |             dataset2, etc). The file(s) contain the names of all samples in
+  |             the dataset used to build an SVM model in SVMLab. The file(s)
+  |             should be simples lists of sample names, each name on its own
+  |             line.  Sample names should be without blank spaces.
+  | Example   : ?
+  | Groups    : These are groups made for crossvalidation.
+  |             If Groups is not set, a leave-one-out crossvalidate will be used.
+  |             Two possibilities:
+  |             A) The range (n1..n2) groups together all samples whose names'
+  |                substrings indexed by (n1..n2) correspond. E.g. the range
+  |                (0..2) would group together 'tommy' and 'tomas' while putting
+  |                'tony' in a separate group.
+  |             B) A file prefix, and a group is made from the sample names found
+  |                in each single file. See DataSet above.
+  | Methods   : The path to a file containg Ruby code implementing methods used
+  |             for feature calculation (see specific feature configurations
+  |             below).
+  | <feature> : Each feature given in the array Features (see above) needs its
+  |             own configuration. The following entries should be set for all
+  |             features:
+  |             HomeDir : This may be set in an absolute value or relative to
+  |                       BaseDir (see above). If HomeDir is not given, it will
+  |                       be set to BaseDir/featurename.
+  |             Method  : The function that is called when the feature is
+  |                       requested. If Method is not given, an attempts is made
+  |                       to acquire the feature from the database. If it fails,
+  |                       ERROR is reported.
+  |             Dimensions : The number of dimensions for this feature. If
+  |                          Dimensions is not given, it will be assumed to be 1
+  |                          and only the first value for each example will be
+  |                          used.
+  |             <other> : You may wish to include more configuration for the
+  |                       feature, and you may add them here. The entire feature
+  |                       configuration are sent to the feature method (see
+  |                       Method above) so this is a good place to put
+  |                       parameters for feature calculations. It is also
+  |                       possible to add pattern search instructions here, for
+  |                       parameters that are up for optimization.
+  -----------------------------------------------------------------------------------
+--------------------------------------------------------------------------------
+  3) FEATURE METHODS
+--------------------------------------------------------------------------------
+As a user of SVMLab you should not have to worry about preparing data for the
+SVM experiment. What you should care about is instead to implement methods for
+calculating or retrieving features. Unfortunately, all methods have to be
+implemented in Ruby since SVMLab is written in Ruby and executed in the Ruby
+environment. Ruby is however not a very complicated language and you may also
+use it merely to call another method implemented in any language by letting the
+Ruby method make a call to the terminal shell (see below for an example).
+ As you can see in the configuration section, you have to configure a method for
+each feature that you use. This section is about how to implement the method
+itself. There are only two rules for this - the format of the input and the
+format of the output.
+1) Input
+  SVMLab calls each method with the syntax
+    method(Hash cfg, String samplename)
+  It is up to the method that you implement to know how to interpret the
+  samplename and use the configuration in order to give an answer to the request
+  for a feature value.
+2) Output
+  This should be given in the format
+    {samplename => String answer,
+     othersample => String otheranswer,
+     ...}
+  Most important here is that an answer should be given in the form of a Hash
+  where one key is pointing to the answer for the sample that the feature was
+  requested for. Your method may however include other keys in the Hash giving
+  answers for other samples if this has a benefit for you. For example, if the
+  method is a lengthy calculation that would have to be repeated for other
+  samples it is convenient to throw out all answers given by the calculation to
+  SVMLab.  This way, SVMLab will only need to consult its database next time a
+  feature value is needed that needs the same calculation.
+    Also note that the answers should be given as a String. While the SVM
+  expects numeric features, this lets the method give a non-numeric answer which
+  SVMLab interprets as an error message. If the feature is a multidimensional
+  feature all numeric values should be separated with a space in the String.
+Example of a method
+This is an example of a case where you already have an executable that you call
+in the terminal by 'myexecutable -a myargument' :
+  def mymethod(cfg, sample)
+    ans = `myexecutable -a #{cfg{'myargument'}}`
+    {sample => ans}
+  end

data/lib/arraymethods.rb ADDED

@@ -0,0 +1,87 @@
+#require 'narray'
+class Array
+  def sum
+    self.inject(0) { |s,i| s + i }
+  end
+  def mean
+    self.sum.to_f / self.size
+  end
+  #def median
+  #  NArray.to_na(self).median
+  #end
+  def percentile(percent)
+    if percent == 0
+      [self.min,self.max]
+    else
+      sorted = self.sort
+      [ sorted[(percent/100.0*self.size).round], sorted[((100-percent)/100.0*self.size).round] ]
+    end
+  end
+  def stddev
+    Math.sqrt( self.variance )
+  end
+  # Mean square deviation from mean
+  def variance
+    m = self.mean
+    sum = self.inject(0) do |s,i|
+      s + (i-m)**2
+    end
+    sum / self.size.to_f
+  end
+  # Scalar product
+  def *(arg)
+    if self.size != arg.size
+      raise "Not equal sizes in scalar product"
+    end
+    self.zip(arg).inject(0.0) do |s,(i,j)|
+      s + i * j
+    end
+  end
+  def to_i
+    self.map{|i| i.to_i}
+  end
+  def to_f
+    self.map{|i| i.to_f}
+  end
+end
+def correlation(x, y)
+  xymean = (x * y) / x.size.to_f
+  sx = x.stddev
+  sy = y.stddev
+  (xymean - x.mean * y.mean) / (sx * sy)
+end
+class Hash
+  def to_xy(other)
+    raise "Please give Hash as argument" if !other.is_a? Hash
+    arr = self.keys.inject([]) do |a,k|
+      if other[k] and self[k]
+        a << [self[k].to_f, other[k].to_f]
+      end
+      a
+    end
+    x = arr.map{|i| i[0]}
+    y = arr.map{|i| i[1]}
+    [ x, y ]
+  end
+  def cc(other)
+    x, y = to_xy(other)
+    #puts "CC : N = #{x.size}"
+    if x.size > 0
+      correlation(x,y)
+    else
+      nil
+    end
+  end
+end

data/lib/irb.history ADDED

@@ -0,0 +1,100 @@
+p.f1
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+p = lab.crossvalidate
+p.cc
+p.rmsd
+p.f1 2
+p.f1
+p.f1 1
+p.f1 1:3
+p.f1 3
+p.f1 4
+p.f1 5
+p.f1 0
+p.plot
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+p = lab.crossvalidate
+p.plot
+help f1
+help p.f1
+pm p
+class Fhash < Hash
+def to_s
+"I'm a Fhash!"
+end
+end
+f = Fhash.new
+puts f
+exit
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+exit
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+exit
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+exit
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+exit
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+a |= 1
+a
+a = 1
+a |= 1
+a
+a =| 1
+a &= 1
+a
+a &= 2
+a
+b &= 2
+b
+a ||= 2
+a = nil
+a ||= 2
+a
+a ||= 3
+a
+a = nil
+a ||= []
+a
+p a
+a.size
+exit
+lab = SVMLab.new('testcfg')
+lab['SVM']['Center'].to_yaml
+lab.cfg['SVM']['Center'].to_yaml
+puts lab.cfg['SVM']['Center'].to_yaml
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+p = lab.crossvalidate
+p.cc
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+lab.crossvalidate.cc
+lab.crossvalidate.rmsd
+lab.C
+lab.epsilon
+exit
+exit
+exit

data/lib/libsvmdata.rb ADDED

@@ -0,0 +1,122 @@
+require 'tempfile'
+# This is a class that IS the path of a file with LIBSVM data.
+# It can do some more tricks beyond just being a path.
+# However, all those tricks are DESTRUCTIVE so be careful.
+# It is also not a good class if you want to keep LIBSVM's sparse
+# data format. This class will fill everything missing with zeros
+# upon creation.
+class LIBSVMdata < String
+  attr_reader :data, :path
+  def initialize(path, readonly = true, classborder = nil)
+    @readonly = readonly
+    @data = open(path).read
+    @classborder = classborder
+    if @readonly
+      super(path)
+    else
+      @file = Tempfile.new('libsvmdata')
+      super(@file.path)
+      self.fillSparseData!
+    end
+  end
+  def dim
+    @dim ||= @data.split("\n").first.split.size
+  end
+  def nexamples
+    @nexamples ||= @data.split("\n").size
+  end
+  def update
+    return if @readonly
+    @dim = nil
+    @nexamples = nil
+    open(self,'w') do |f|
+      f.puts @data
+    end
+    self
+  end
+  def translate!(vector)
+    return if @readonly
+    @data =
+      @data.split("\n").map { |line|
+        pred,*feats = line.split
+        (@classborder ? "#{pred} " : "#{(pred.to_f+vector[0])} ") +
+        feats.map{|feat|
+          key,val = feat.split(':')
+          "#{key}:#{val.to_f+vector[key.to_i]}"}.join(' ') + "\n"
+      }.join
+    self.update
+  end
+  def scale!(vector)
+    return if @readonly
+    @data =
+      @data.split("\n").map { |line|
+        pred,*feats = line.split
+        (@classborder ? "#{pred} " : "#{(pred.to_f*vector[0])} ") +
+        feats.map{|feat|
+          key,val = feat.split(':')
+          "#{key}:#{val.to_f*vector[key.to_i]}"}.join(' ') + "\n"
+      }.join
+    self.update
+  end
+  def mean
+    sum = Array.new(dim){0}
+    sum =
+      @data.split("\n").inject(sum) { |s,line|
+        pred,*feats = line.split
+        if !@classborder then s[0] += pred.to_f end
+        fs = s[1..-1].zip(feats).map{|si,fi|
+          si += fi.split(':').last.to_f}
+        #puts s; gets
+        [s[0]] + fs }
+    sum.map{|i| i / nexamples.to_f}
+  end
+  def absmax
+    max = Array.new(dim){0}
+    max =
+      @data.split("\n").inject(max) { |s,line|
+        pred,*feats = line.split
+        s[0] = [s[0],pred.to_f.abs].max
+        fs = s[1..-1].zip(feats).map{|si,fi|
+          [si, fi.split(':').last.to_f.abs].max}
+        #puts s; gets
+        [s[0]] + fs }
+  end
+  def fillSparseData!
+    return if @readonly
+    ndim = @data.split("\n").map{|line|
+            if (arr=line.split).size>1
+              arr.last.split(':').first.to_i
+            else 0 end
+          }.max
+    @data =
+      @data.split("\n").map{|line|
+        pred,*feats = line.split
+        if @classborder
+          pred = (pred.to_f>=@classborder) ? '1' : '-1'
+        end
+        counter = 0
+        pred + ' ' +
+        (1..ndim).map { |index|
+          if (element = feats[counter]) and
+            (element.split(':').first.to_i == index)
+            counter+=1
+            "#{index}:#{element.split(':').last}"
+          else
+              "#{index}:0"
+          end
+        }.join(' ') + "\n"
+    }.join
+    self.update
+  end
+end