RubyGems - svmlab - Versions diffs - 1.0.0 - Mend

svmlab 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

data/lib/README ADDED

@@ -0,0 +1,219 @@
+Contents:
+1) Install
+2) Configuration
+3) Feature methods
+--------------------------------------------------------------------------------
+  1) HOW TO INSTALL
+--------------------------------------------------------------------------------
+--------------------------------------------------------------------------------
+  2) CONFIGURATION
+--------------------------------------------------------------------------------
+------------------------------------------------------------
+ CONFIGURATION OF SUPPORT VECTOR MACHINE
+------------------------------------------------------------
+SVM:
+  C: 1
+  e: 0.3
+  g: 0.05
+  Optimization:
+    Method: patternsearch
+    Nhalf: 2
+  Scale:
+    Default: std
+  ---------------------------------------------------------------------------------
+   The configuration under the 'SVM' entry has three parts, as seen above.
+   1) Parameters to the SVM algorithm.
+      Possible entries are:
+      C : The penalty factor for misclassification.
+      g : Gamma for the RBF kernel exp(- gamma * |x-y|^2)
+      e : Epsilon for epsilon tube regression
+   2) Optimization configuration.
+      Any parameter given in the configuration may be optimized. The
+      optimization method is configured here.
+      Method : Name of optimization method to be used. Currently only the
+               patternsearch method is implemented.
+      Nhalf  : How many times to halve the patternsearch step size before
+               giving up the optimization. It is a trade-off between
+               execution time and detail of optimization. A reasonable value
+               for Nhalf is 2.
+      ----------------------------------------------------------------------
+      | For a parameter to be optimized it needs to have optimization
+      | instructions added to its setting. Pattern search instructions may
+      | be added to any numerical parameter by appending text following
+      | this syntax:
+      |   step exp<stepsize>
+      |    OR
+      |   step <stepsize>
+      | which should appear AFTER any parameter in the configuration to be
+      | optimized.  If 'step exp1.0' is given, it means that the pattern
+      | search should use an initial step size of 1 in the 10-logarithmic
+      | scale, i.e.  it tries values at 0.1 and 10 times the original
+      | value.  Most parameters (including the SVM parameters) are best
+      | optimized in the log scale. For example, to optimize SVM parameters
+      | we may write
+      |   C: 1 step exp1.0
+      |   g: 0.05 step exp1.0
+      -------------------------------------------------------------------
+   3) Normalization factors for features.
+      The SVM does best when features are normalized in one way or another.
+      The standard way of normalizing is to calculate the so called z-score
+      for each feature (subtracting mean value and dividing by standard
+      deviation), but depending on the feature other scaling may be better.
+      All scalings described below are applied after each feature has been
+      centralized around zero (by calculating the mean value of each feature
+      on the dataset given). For scaling after the centralization, there are
+      several possibilities:
+       A) No Scale entry at all in configuration.
+          All features will be given to the SVM without rescaling.
+       B) Only Default given.
+          All features will be scaled according to the scale factor given
+          for Default.
+       C) Specific features' scaling is also given.
+          A specific scaling factor will be applied to the features given.
+          All other features will have the Default scaling factor (or 1 if no
+          Default given).  A specific feature scaling is given in one of the
+          following syntaxes:
+            featurename: scaling factor
+              OR
+            featurename:
+              index1: scaling factor
+              ...
+              indexN: scaling factor
+          Note that 'index' refers to index in a multidimensional feature,
+          counting from 0. Also note that if scaling is given by the first
+          syntax, the scaling will be applied to all dimensions of the
+          feature.
+       A scaling factor can be one of the following:
+       max : Scaling factor is set so that the max (absolute) value of the
+             (centralized) feature is 1 on the dataset given.
+       std : Scaling factor is set so that the standard deviation is 1.
+             This is the most common normalization of features.
+      <num>: A specific numeric value.
+      Note that optimization instructions may also be appended to scaling
+      factors (an extreme example would be to add it to the Default scaling
+      factor, which gives a lengthy optimization calculation and probably
+      an "overtraining" on how to scale features).
+  ---------------------------------------------------------------------------------
+------------------------------------------------------------
+ CONFIGURATION OF FEATURES
+------------------------------------------------------------
+Feature:
+  Features:
+  - <targetfeature>
+  - <feature1>
+  - <feature2>
+  ...
+  - <featureN>
+  BaseDir: <base directory>
+  DataSet: <file>
+            OR
+           <file prefix>
+  Groups: <range (n1..n2)>
+           OR
+          <file prefix>
+  Methods: <path of Ruby code file>
+  <featureX>:
+    HomeDir: <home directory>
+    Method: <method name>
+    Dimensions: <number of dimensions in this feature>
+    <Further specific configuration for this feature>
+  ...
+  -----------------------------------------------------------------------------------
+  | A more detailed descriptions of the configuration entries for the
+  | feature configuration.
+  | ---------------------------------------------------------------------------------
+  | Features  : This is an array giving all features to be used. The first feature
+  |             in the array is the feature to be predicted from the other features.
+  | BaseDir   : A directory from which all other directory names are derived.
+  | DataSet   : A file or a file prefix (such as 'dataset' points to dataset1,
+  |             dataset2, etc). The file(s) contain the names of all samples in
+  |             the dataset used to build an SVM model in SVMLab. The file(s)
+  |             should be simples lists of sample names, each name on its own
+  |             line.  Sample names should be without blank spaces.
+  | Example   : ?
+  | Groups    : These are groups made for crossvalidation.
+  |             If Groups is not set, a leave-one-out crossvalidate will be used.
+  |             Two possibilities:
+  |             A) The range (n1..n2) groups together all samples whose names'
+  |                substrings indexed by (n1..n2) correspond. E.g. the range
+  |                (0..2) would group together 'tommy' and 'tomas' while putting
+  |                'tony' in a separate group.
+  |             B) A file prefix, and a group is made from the sample names found
+  |                in each single file. See DataSet above.
+  | Methods   : The path to a file containg Ruby code implementing methods used
+  |             for feature calculation (see specific feature configurations
+  |             below).
+  | <feature> : Each feature given in the array Features (see above) needs its
+  |             own configuration. The following entries should be set for all
+  |             features:
+  |             HomeDir : This may be set in an absolute value or relative to
+  |                       BaseDir (see above). If HomeDir is not given, it will
+  |                       be set to BaseDir/featurename.
+  |             Method  : The function that is called when the feature is
+  |                       requested. If Method is not given, an attempts is made
+  |                       to acquire the feature from the database. If it fails,
+  |                       ERROR is reported.
+  |             Dimensions : The number of dimensions for this feature. If
+  |                          Dimensions is not given, it will be assumed to be 1
+  |                          and only the first value for each example will be
+  |                          used.
+  |             <other> : You may wish to include more configuration for the
+  |                       feature, and you may add them here. The entire feature
+  |                       configuration are sent to the feature method (see
+  |                       Method above) so this is a good place to put
+  |                       parameters for feature calculations. It is also
+  |                       possible to add pattern search instructions here, for
+  |                       parameters that are up for optimization.
+  -----------------------------------------------------------------------------------
+--------------------------------------------------------------------------------
+  3) FEATURE METHODS
+--------------------------------------------------------------------------------
+As a user of SVMLab you should not have to worry about preparing data for the
+SVM experiment. What you should care about is instead to implement methods for
+calculating or retrieving features. Unfortunately, all methods have to be
+implemented in Ruby since SVMLab is written in Ruby and executed in the Ruby
+environment. Ruby is however not a very complicated language and you may also
+use it merely to call another method implemented in any language by letting the
+Ruby method make a call to the terminal shell (see below for an example).
+ As you can see in the configuration section, you have to configure a method for
+each feature that you use. This section is about how to implement the method
+itself. There are only two rules for this - the format of the input and the
+format of the output.
+1) Input
+  SVMLab calls each method with the syntax
+    method(Hash cfg, String samplename)
+  It is up to the method that you implement to know how to interpret the
+  samplename and use the configuration in order to give an answer to the request
+  for a feature value.
+2) Output
+  This should be given in the format
+    {samplename => String answer,
+     othersample => String otheranswer,
+     ...}
+  Most important here is that an answer should be given in the form of a Hash
+  where one key is pointing to the answer for the sample that the feature was
+  requested for. Your method may however include other keys in the Hash giving
+  answers for other samples if this has a benefit for you. For example, if the
+  method is a lengthy calculation that would have to be repeated for other
+  samples it is convenient to throw out all answers given by the calculation to
+  SVMLab.  This way, SVMLab will only need to consult its database next time a
+  feature value is needed that needs the same calculation.
+    Also note that the answers should be given as a String. While the SVM
+  expects numeric features, this lets the method give a non-numeric answer which
+  SVMLab interprets as an error message. If the feature is a multidimensional
+  feature all numeric values should be separated with a space in the String.
+Example of a method
+This is an example of a case where you already have an executable that you call
+in the terminal by 'myexecutable -a myargument' :
+  def mymethod(cfg, sample)
+    ans = `myexecutable -a #{cfg{'myargument'}}`
+    {sample => ans}
+  end

data/lib/arraymethods.rb ADDED

@@ -0,0 +1,87 @@
+#require 'narray'
+class Array
+  def sum
+    self.inject(0) { |s,i| s + i }
+  end
+  def mean
+    self.sum.to_f / self.size
+  end
+  #def median
+  #  NArray.to_na(self).median
+  #end
+  def percentile(percent)
+    if percent == 0
+      [self.min,self.max]
+    else
+      sorted = self.sort
+      [ sorted[(percent/100.0*self.size).round], sorted[((100-percent)/100.0*self.size).round] ]
+    end
+  end
+  def stddev
+    Math.sqrt( self.variance )
+  end
+  # Mean square deviation from mean
+  def variance
+    m = self.mean
+    sum = self.inject(0) do |s,i|
+      s + (i-m)**2
+    end
+    sum / self.size.to_f
+  end
+  # Scalar product
+  def *(arg)
+    if self.size != arg.size
+      raise "Not equal sizes in scalar product"
+    end
+    self.zip(arg).inject(0.0) do |s,(i,j)|
+      s + i * j
+    end
+  end
+  def to_i
+    self.map{|i| i.to_i}
+  end
+  def to_f
+    self.map{|i| i.to_f}
+  end
+end
+def correlation(x, y)
+  xymean = (x * y) / x.size.to_f
+  sx = x.stddev
+  sy = y.stddev
+  (xymean - x.mean * y.mean) / (sx * sy)
+end
+class Hash
+  def to_xy(other)
+    raise "Please give Hash as argument" if !other.is_a? Hash
+    arr = self.keys.inject([]) do |a,k|
+      if other[k] and self[k]
+        a << [self[k].to_f, other[k].to_f]
+      end
+      a
+    end
+    x = arr.map{|i| i[0]}
+    y = arr.map{|i| i[1]}
+    [ x, y ]
+  end
+  def cc(other)
+    x, y = to_xy(other)
+    #puts "CC : N = #{x.size}"
+    if x.size > 0
+      correlation(x,y)
+    else
+      nil
+    end
+  end
+end

data/lib/irb.history ADDED

@@ -0,0 +1,100 @@
+p.f1
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+p = lab.crossvalidate
+p.cc
+p.rmsd
+p.f1 2
+p.f1
+p.f1 1
+p.f1 1:3
+p.f1 3
+p.f1 4
+p.f1 5
+p.f1 0
+p.plot
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+p = lab.crossvalidate
+p.plot
+help f1
+help p.f1
+pm p
+class Fhash < Hash
+def to_s
+"I'm a Fhash!"
+end
+end
+f = Fhash.new
+puts f
+exit
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+exit
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+exit
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+exit
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+exit
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+a |= 1
+a
+a = 1
+a |= 1
+a
+a =| 1
+a &= 1
+a
+a &= 2
+a
+b &= 2
+b
+a ||= 2
+a = nil
+a ||= 2
+a
+a ||= 3
+a
+a = nil
+a ||= []
+a
+p a
+a.size
+exit
+lab = SVMLab.new('testcfg')
+lab['SVM']['Center'].to_yaml
+lab.cfg['SVM']['Center'].to_yaml
+puts lab.cfg['SVM']['Center'].to_yaml
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+exit
+lab = SVMLab.new('testcfg')
+p = lab.crossvalidate
+p.cc
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+lab = SVMLab.new('/home/fred/DeltaDeltaG/cfg/final.cfg')
+lab.crossvalidate.cc
+lab.crossvalidate.rmsd
+lab.C
+lab.epsilon
+exit
+exit
+exit

data/lib/libsvmdata.rb ADDED

@@ -0,0 +1,122 @@
+require 'tempfile'
+# This is a class that IS the path of a file with LIBSVM data.
+# It can do some more tricks beyond just being a path.
+# However, all those tricks are DESTRUCTIVE so be careful.
+# It is also not a good class if you want to keep LIBSVM's sparse
+# data format. This class will fill everything missing with zeros
+# upon creation.
+class LIBSVMdata < String
+  attr_reader :data, :path
+  def initialize(path, readonly = true, classborder = nil)
+    @readonly = readonly
+    @data = open(path).read
+    @classborder = classborder
+    if @readonly
+      super(path)
+    else
+      @file = Tempfile.new('libsvmdata')
+      super(@file.path)
+      self.fillSparseData!
+    end
+  end
+  def dim
+    @dim ||= @data.split("\n").first.split.size
+  end
+  def nexamples
+    @nexamples ||= @data.split("\n").size
+  end
+  def update
+    return if @readonly
+    @dim = nil
+    @nexamples = nil
+    open(self,'w') do |f|
+      f.puts @data
+    end
+    self
+  end
+  def translate!(vector)
+    return if @readonly
+    @data =
+      @data.split("\n").map { |line|
+        pred,*feats = line.split
+        (@classborder ? "#{pred} " : "#{(pred.to_f+vector[0])} ") +
+        feats.map{|feat|
+          key,val = feat.split(':')
+          "#{key}:#{val.to_f+vector[key.to_i]}"}.join(' ') + "\n"
+      }.join
+    self.update
+  end
+  def scale!(vector)
+    return if @readonly
+    @data =
+      @data.split("\n").map { |line|
+        pred,*feats = line.split
+        (@classborder ? "#{pred} " : "#{(pred.to_f*vector[0])} ") +
+        feats.map{|feat|
+          key,val = feat.split(':')
+          "#{key}:#{val.to_f*vector[key.to_i]}"}.join(' ') + "\n"
+      }.join
+    self.update
+  end
+  def mean
+    sum = Array.new(dim){0}
+    sum =
+      @data.split("\n").inject(sum) { |s,line|
+        pred,*feats = line.split
+        if !@classborder then s[0] += pred.to_f end
+        fs = s[1..-1].zip(feats).map{|si,fi|
+          si += fi.split(':').last.to_f}
+        #puts s; gets
+        [s[0]] + fs }
+    sum.map{|i| i / nexamples.to_f}
+  end
+  def absmax
+    max = Array.new(dim){0}
+    max =
+      @data.split("\n").inject(max) { |s,line|
+        pred,*feats = line.split
+        s[0] = [s[0],pred.to_f.abs].max
+        fs = s[1..-1].zip(feats).map{|si,fi|
+          [si, fi.split(':').last.to_f.abs].max}
+        #puts s; gets
+        [s[0]] + fs }
+  end
+  def fillSparseData!
+    return if @readonly
+    ndim = @data.split("\n").map{|line|
+            if (arr=line.split).size>1
+              arr.last.split(':').first.to_i
+            else 0 end
+          }.max
+    @data =
+      @data.split("\n").map{|line|
+        pred,*feats = line.split
+        if @classborder
+          pred = (pred.to_f>=@classborder) ? '1' : '-1'
+        end
+        counter = 0
+        pred + ' ' +
+        (1..ndim).map { |index|
+          if (element = feats[counter]) and
+            (element.split(':').first.to_i == index)
+            counter+=1
+            "#{index}:#{element.split(':').last}"
+          else
+              "#{index}:0"
+          end
+        }.join(' ') + "\n"
+    }.join
+    self.update
+  end
+end