RubyGems - biopsy - Versions diffs - 0.1.0.alpha - Mend

biopsy 0.1.0.alpha

Files changed (27) hide show

data/LICENSE.txt +7 -0
data/README.md +49 -0
data/Rakefile +8 -0
data/lib/biopsy/base_extensions.rb +64 -0
data/lib/biopsy/domain.rb +156 -0
data/lib/biopsy/experiment.rb +103 -0
data/lib/biopsy/objective_function.rb +38 -0
data/lib/biopsy/objective_handler.rb +170 -0
data/lib/biopsy/objectives/fastest_optimum.rb +26 -0
data/lib/biopsy/opt_algorithm.rb +0 -0
data/lib/biopsy/optimisers/genetic_algorithm.rb +244 -0
data/lib/biopsy/optimisers/parameter_sweeper.rb +66 -0
data/lib/biopsy/optimisers/tabu_search.rb +437 -0
data/lib/biopsy/settings.rb +110 -0
data/lib/biopsy/target.rb +113 -0
data/lib/biopsy/version.rb +12 -0
data/lib/biopsy.rb +13 -0
data/test/helper.rb +187 -0
data/test/test_domain.rb +61 -0
data/test/test_experiment.rb +84 -0
data/test/test_file.rb +20 -0
data/test/test_hash.rb +55 -0
data/test/test_objective_handler.rb +99 -0
data/test/test_settings.rb +74 -0
data/test/test_string.rb +14 -0
data/test/test_target.rb +89 -0
metadata +198 -0

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,7 @@
+Copyright (c) 2013 Richard Smith
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,49 @@
+biopsy
+==========
+An automatic optimisation framework for programs and pipelines.
+Biopsy is a framework for optimising any program or pipeline which produces a measurable output. By reducing the settings of one or more programs to a parameter space, and by carefully choosing objective functions with which to measure the output of the program(s), biopsy can use a range of optimisation strategies to rapidly find the settings that perform the best. Combined with a strategy for subsampling the input data, this can lead to vast time and performance improvements.
+A simple example of the power of this approach is *de-novo* transcriptome assembly. Typically, the assembly process takes many GB of data as input, uses many GB of RAM and takes many hours to complete. This prevents researchers from performing full parameter sweeps, and they are therefore forced to use word-of-mouth and very basic optimisation to choose assembler settings. Assemblotron, which uses the Biopsy framework, can fully optimise any *de-novo* assembler to produce the optimal assembly possible given a particular input. This typically takes little more time than running a single assembly.
+## Development status
+[![Gem Version](https://badge.fury.io/rb/biopsy.png)][gem]
+[![Build Status](https://secure.travis-ci.org/Blahah/biopsy.png?branch=master)][travis]
+[![Dependency Status](https://gemnasium.com/Blahah/biopsy.png?travis)][gemnasium]
+[![Code Climate](https://codeclimate.com/github/Blahah/biopsy.png)][codeclimate]
+[![Coverage Status](https://coveralls.io/repos/Blahah/biopsy/badge.png?branch=master)][coveralls]
+[gem]: https://badge.fury.io/rb/biopsy
+[travis]: https://travis-ci.org/Blahah/biopsy
+[gemnasium]: https://gemnasium.com/Blahah/biopsy
+[codeclimate]: https://codeclimate.com/github/Blahah/biopsy
+[coveralls]: https://coveralls.io/r/Blahah/biopsy
+This project is in alpha development and is not yet ready for deployment.
+Please don't report issues or request documentation until we are ready for beta release (see below for estimated timeframe).
+### Roadmap
+| Class            | Code   | Tests   | Docs   |
+| ------------     | :----: | ------: | -----: |
+| Settings         | DONE   | DONE    | DONE   |
+| Target           | DONE   | DONE    | DONE   |
+| Domain           | DONE   | DONE    | DONE   |
+| Experiment       | DONE   | DONE    | DONE   |
+| TabuSearch       | DONE   | -       | -      |
+| ParameterSweeper | DONE   | -       | -      |
+| ObjectiveHandler | DONE   | DONE    | DONE   |
+* ~ 20/24 tasks completed, ~83% done overall
+* alpha released: 6th September 2013
+* planned beta release date: 17th September 2013
+### Documentation
+Documentation is in development and will be released with the beta.
+### Citation
+This is *pre-release*, *pre-publication* academic software. In lieu of a paper to cite, please cite this Github repo if your use of the software leads to a publication.

data/Rakefile ADDED Viewed

@@ -0,0 +1,8 @@
+require 'rake/testtask'
+Rake::TestTask.new do |t|
+  t.libs << 'test'
+end
+desc "Run tests"
+task :default => :test

data/lib/biopsy/base_extensions.rb ADDED Viewed

@@ -0,0 +1,64 @@
+class String
+  # return a CamelCase version of self
+  def camelize
+    return self if self !~ /_/ && self =~ /[A-Z]+.*/
+    split('_').map{|e| e.capitalize}.join
+  end
+end # String
+class File
+  # return the full path to the supplied cmd executable,
+  # if it exists in any location in PATH
+  def self.which(cmd)
+    exts = ENV['PATHEXT'] ? ENV['PATHEXT'].split(';') : ['']
+    ENV['PATH'].split(File::PATH_SEPARATOR).each do |path|
+      exts.each do |ext|
+        exe = File.join(path, "#{cmd}#{ext}")
+        return exe if File.executable? exe
+      end
+    end
+    return nil
+  end
+end # File
+class Hash
+  # recursively convert all keys to symbols
+  def deep_symbolize
+    target = dup
+    target.inject({}) do |memo, (key, value)|
+      value = value.deep_symbolize if value.is_a?(Hash)
+      memo[(key.to_sym rescue key) || key] = value
+      memo
+    end
+  end
+  # recursively merge two hashes
+  def deep_merge(other_hash)
+    self.merge(other_hash) do |key, oldval, newval|
+      oldval = oldval.to_hash if oldval.respond_to?(:to_hash)
+      newval = newval.to_hash if newval.respond_to?(:to_hash)
+      oldval.class.to_s == 'Hash' && newval.class.to_s == 'Hash' ? oldval.deep_merge(newval) : newval
+    end
+  end
+end # Hash
+class Array
+  # return the arithmetic mean of the elements in the array.
+  # Requires the array to contain only objects of class Fixnum.
+  # If any other class is encountered, an error will be raised.
+  def mean
+    self.sum / self.size.to_f
+  end
+  def sum
+    self.inject(0, :+)
+  end
+end # Array

data/lib/biopsy/domain.rb ADDED Viewed

@@ -0,0 +1,156 @@
+# todo: ensure testing accounts for situation where there are multiple
+# input or output files defined in the spec
+module Biopsy
+  class DomainLoadError < Exception
+  end
+  class Domain
+    attr_reader :name
+    attr_reader :input_filetypes
+    attr_reader :output_filetypes
+    attr_reader :objectives
+    attr_reader :keep_intermediates
+    attr_reader :gzip_intermediates
+    require 'yaml'
+    require 'pp'
+    # Return a new Domain object containing the specification of the
+    # currently active domain.
+    def initialize domain=nil
+      @name = domain.nil? ? self.get_current_domain : domain
+      @keep_intermediates = false
+      @gzip_intermediates = false
+      self.load_by_name @name
+    end
+    # Return the name of the currently active domain.
+    def get_current_domain
+      Settings.instance.domain
+    rescue
+      raise "You must specify the domain to use in the biopsy settings file or at the command line."
+    end
+    # Return the path to the YAML definition file for domain with +:name+.
+    # All +:domain_dirs+ in Settings are searched and the first matching
+    # file is returned.
+    def locate_definition name
+      Settings.instance.locate_config :domain_dir, name
+    end
+    # Check and apply the settings in +:config+ (Hash).
+    def apply_config config
+      [:input_filetypes, :output_filetypes, :objectives].each do |key|
+        raise DomainLoadError.new("Domain definition is missing the required key #{key}") unless config.has_key? key
+        self.instance_variable_set('@' + key.to_s, config[key])
+      end
+    end
+    # Load and apply the domain definition with +:name+
+    def load_by_name name
+      path = self.locate_definition name
+      raise DomainLoadError.new("Domain definition file does not exist for #{name}") if path.nil?
+      config = YAML::load_file(path)
+      raise DomainLoadError.new("Domain definition file #{path} is not valid YAML") if config.nil?
+      self.apply_config config.deep_symbolize
+    end
+    # Validate a Target, returning true if the target meets
+    # the specification of this Domain, and false otherwise.
+    # +:target+, the Target object to validate.
+    def target_valid? target
+      l = []
+      @input_filetypes.each do |input|
+        l << [target[:input_files], input]
+      end
+      @output_filetypes.each do |output|
+        l << [target[:output_files], output]
+      end
+      errors = []
+      l.each do |pair|
+        testcase, definition = pair
+        errors += self.validate_target_filetypes(testcase, definition)
+      end
+      errors
+    end
+    # Returns an empty array if +:testcase+ conforms to definition,
+    # otherwise returns an array of strings describing the
+    # errors found.
+    def validate_target_filetypes testcase, definition
+      errors = []
+      # check extensions
+      testcase.each_pair do |key, f|
+        ext = File.extname(f)
+        unless definition[:allowed_extensions].include? ext
+          errors << %Q{input file #{f} doesn't match any of the filetypes
+                       allowed for this domain}
+        end
+      end
+      # check number of files
+      in_count = testcase.size
+      if definition.has_key? :n
+        unless in_count == definition[:n]
+          errors << %Q{the number of input files (#{in_count}) doesn't
+                      match the domain specification (#{definition[:n]})}
+        end
+      end
+      if definition.has_key? :min
+        unless in_count >= definition[:min]
+          errors << %Q{the number of input files (#{in_count}) is lower
+                      than the minimum for this domain (#{definition[:n]})}
+        end
+      end
+      if definition.has_key? :max
+        unless in_count >= definition[:max]
+          errors << %Q{the number of input files (#{in_count}) is greater
+                      than the maximum for this domain (#{definition[:n]})}
+        end
+      end
+      errors
+    end
+    # Write out a template Domain definition to +:filename+
+    def write_template filename
+      data = {
+        :input_filetypes => [
+          {
+            :min => 1,
+            :max => 2,
+            :allowed_extensions => [
+              'txt',
+              'csv',
+              'tsv'
+            ]
+          },
+          {
+            :n => 2,
+            :allowed_extensions => [
+              'png'
+            ]
+          }
+        ],
+        :output_filetypes => [
+          {
+            :n => 1,
+            :allowed_extensions => [
+              'pdf',
+              'xls'
+            ]
+          }
+        ],
+        :objectives => [
+          'objective1', 'objective2'
+        ]
+      }
+      ::File.open(filename, 'w') do |f|
+        f.puts data.to_yaml
+      end
+    end
+  end # end of class Domain
+end # end of module Biopsy

data/lib/biopsy/experiment.rb ADDED Viewed

@@ -0,0 +1,103 @@
+# Optimisation Framework: Experiment
+#
+# == Description
+#
+# The Experiment object encapsulates the data and methods that represent
+# the optimisation experiment being carried out.
+#
+# The metadata necessary to conduct the experiment, as well as the
+# settings for the experiment, are stored here.
+#
+# It is also the main process controller for the entire optimisation
+# cycle. It takes user input, runs the target program, the objective function(s)
+# and the optimisation algorithm, looping through the optimisation cycle until
+# completion and then returning the output.
+module Biopsy
+  class Experiment
+    attr_reader :inputs, :outputs, :retain_intermediates, :target, :start, :algorithm
+    # Returns a new Experiment
+    def initialize(target_name, domain_name, start=nil, algorithm=nil)
+      @domain = Domain.new domain_name
+      @start = start
+      @algorithm = algorithm
+      self.load_target target_name
+      @objective = ObjectiveHandler.new(@domain, @target)
+      self.select_algorithm
+      self.select_starting_point
+      @scores = {}
+      @iteration_count = 0
+    end
+    # return the set of parameters to evaluate first
+    def select_starting_point
+      return unless @start.nil?
+      if @algorithm && @algorithm.knows_starting_point?
+        @start = @algorithm.select_starting_point
+      else
+        @start = self.random_start_point
+      end
+    end
+    # Return a random set of parameters from the parameter space.
+    def random_start_point
+      Hash[@target.parameter_ranges.map { |p, r| [p, r.sample] }]
+    end
+    # select the optimisation algorithm to use
+    def select_algorithm
+      @algorithm = ParameterSweeper.new(@target.parameter_ranges)
+      return if @algorithm.combinations.size < Settings.instance.sweep_cutoff
+      @algorithm = TabuSearch.new(@target.parameter_ranges)
+    end
+    # load the target named +:target_name+
+    def load_target target_name
+      @target = Target.new @domain
+      @target.load_by_name target_name
+    end
+    # Runs the experiment until the completion criteria
+    # are met. On completion, returns the best parameter
+    # set.
+    def run
+      in_progress = true
+      @algorithm.setup @start
+      @current_params = @start
+      while in_progress do
+        run_iteration
+        # update the best result
+        @best = @algorithm.best
+        # have we finished?
+        in_progress = !@algorithm.finished?
+      end
+      puts "found optimum score: #{@best[:score]} for parameters #{@best[:parameters]} in #{@iteration_count} iterations."
+      return @best
+    end
+    # Runs a single iteration of the optimisation,
+    # encompassing the program, objective(s) and optimiser.
+    # Returns the output of the optimiser.
+    def run_iteration
+      # run the target
+      run_data = @target.run @current_params
+      # evaluate with objectives
+      param_key = @current_params.to_s
+      result = nil
+      if @scores.has_key? param_key
+        result = @scores[param_key]
+      else
+        result = @objective.run_for_output run_data
+        @iteration_count += 1
+      end
+      @scores[@current_params.to_s] = result
+      # get next steps from optimiser
+      @current_params = @algorithm.run_one_iteration(@current_params, result)
+    end
+  end # end of class RunHandler
+end # end of module Biopsy

data/lib/biopsy/objective_function.rb ADDED Viewed

@@ -0,0 +1,38 @@
+# Assembly Optimisation Framework: Objective Function
+#
+# == Description
+#
+# ObjectiveFunction is a skeleton parent class to ensure
+# objective functions provide the essential methods.
+# Because abstract classes don't really make sense in
+# Ruby's runtime compilation, we can only check if methods
+# are implemented at runtime (but at least we can raise
+# a sensible error)
+module Biopsy
+  class ObjectiveFunction
+    # Runs the objective function for the assembly supplied,
+      # returning a real number value
+      #
+      # === Options
+      #
+      # * +:assemblydata+ - Hash containing data about the assembly to analyse
+      #
+      # === Example
+      #
+      # objective = ObjectiveFunction.new
+      # result = objective.run('example.fasta')
+    def run(assemblydata)
+      raise NotImplementedError.new("You must implement a run method for each objective function")
+    end
+    def essential_files
+      return []
+    end
+  end
+end

data/lib/biopsy/objective_handler.rb ADDED Viewed

@@ -0,0 +1,170 @@
+require 'securerandom'
+require 'fileutils'
+# Assembly Optimisation Framework: Objective Function Handler
+#
+# == Description
+#
+# The Handler manages the objective functions for the optimisation experiment.
+# Specifically, it finds all the objective functions and runs them when requested,
+# outputting the results to the main Optimiser.
+#
+# == Explanation
+#
+# === Loading objective functions
+#
+# The Handler expects a directory containing objectives (by default it looks in *currentdir/objectives*).
+# The *objectives* directory should contain the following:
+#
+# * a *.rb* file for each objective function. The file should define a subclass of ObjectiveFunction
+# * (optionally) a file *objectives.txt* which lists the objective function files to use
+#
+# If the objectives.txt file is absent, the subset of objectives to use can be set directly in the Optimiser
+# , or if no such restriction is set, the whole set of objectives will be run.
+#
+# Each file listed in *objectives.txt* is loaded if it exists.
+#
+# === Running objective functions
+#
+# The Handler iterates through the objectives, calling the *run()* method
+# of each by passing the assembly. After collecting results, it returns
+# a Hash of the results to the parent Optimiser.
+module Biopsy
+  class ObjectiveHandlerError < Exception
+  end
+  class ObjectiveHandler
+    attr_reader :last_tempdir
+    attr_accessor :objectives
+    def initialize domain, target
+      @domain = domain
+      @target = target
+      base_dir = Settings.instance.base_dir
+      @objectives_dir = Settings.instance.objectives_dir.first
+      @objectives = {}
+      $LOAD_PATH.unshift(@objectives_dir)
+      @subset = Settings.instance.respond_to?(:objectives_subset) ? Settings.instance.objectives_subset : nil
+      self.load_objectives
+      # pass objective list back to caller
+      return @objectives.keys
+    end
+    def load_objectives
+      # load objectives
+      # load subset list if available
+      subset_file = @objectives_dir + '/objectives.txt'
+      subset = File.exists?(subset_file) ? File.open(subset_file).readlines.map{ |l| l.strip } : nil
+      subset = @subset if subset.nil?
+      # parse in objectives
+      Dir.chdir @objectives_dir do
+        Dir['*.rb'].each do |f|
+          file_name = File.basename(f, '.rb')
+          require file_name
+          objective_name = file_name.camelize
+          objective =  Module.const_get(objective_name).new
+          if subset.nil? or subset.include?(file_name)
+            # this objective is included
+            @objectives[objective_name] = objective
+          end
+        end
+        # puts "loaded #{@objectives.length} objectives."
+      end
+    end
+    # Run a specific +:objective+ on the +:output+ of a target
+    # with max +:threads+.
+    def run_objective(objective, name, output, threads)
+      begin
+        # output is a hash containing the file(s) output
+        # by the target in the format expected by the
+        # objective function(s).
+        return objective.run(output, threads)
+      rescue NotImplementedError => e
+        puts "Error: objective function #{objective.class} does not implement the run() method"
+        puts "Please refer to the documentation for instructions on adding objective functions"
+        raise e
+      end
+    end
+    # Perform a euclidean distance dimension reduction of multiple objectives
+    # using weighting specified in the domain definition.
+    def dimension_reduce(results)
+      # calculate the weighted Euclidean distance from optimal
+      # d(p, q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2+...+(p_i - q_i)^2+...+(p_n - q_n)^2}
+      # here the max value is sqrt(n) where n is no. of results, min value (optimum) is 0
+      total = 0
+      results.each_pair do |key, value|
+        o = value[:optimum]
+        w = value[:weighting]
+        a = value[:result]
+        m = value[:max]
+        total += w * (((o - a)/m) ** 2)
+      end
+      return Math.sqrt(total) / results.length
+    end
+    # Run all objectives functions for +:output+.
+    def run_for_output(output, threads=6, cleanup=true, allresults=false)
+      # check output files exist
+      @target.output_files.each_pair do |key, name|
+        unless File.exists?(output[key]) && File.size(output[key]) > 0
+          info("file #{output[key]} does not exist or is empty")
+          return nil
+        end
+      end
+      # run all objectives for output
+      results = {}
+      # create temp dir
+      Dir.chdir(self.create_tempdir) do
+        @objectives.each_pair do |name, objective|
+          results[name] = self.run_objective(objective, name, output, threads)
+        end
+        if cleanup == 1
+          # remove all but essential files
+          essential_files = @domain.keep_intermediates
+          if essential_files
+            @objectives.values.each{ |objective| essential_files += objective.essential_files }
+          end
+          Dir["*"].each do |file|
+            next if File.directory? file
+            if essential_files && essential_files.include?(file)
+              `gzip #{file}` if @domain.gzip_intermediates
+              FileUtils.mv("#{file}.gz", '..')
+            end
+          end
+        end
+      end
+      if cleanup
+        # clean up temp dir
+        FileUtils.rm_rf @last_tempdir
+      end
+      if allresults
+        return {:results => results,
+                :reduced => self.dimension_reduce(results)}
+      else
+        results.each do |key, value|
+          return value[:result]
+        end
+      end
+    end
+    # create a guaranteed random temporary directory for storing outputs
+    # return the dirname
+    def create_tempdir
+      token = loop do
+        # generate random dirnames until we find one that
+        # doesn't exist
+        test_token = SecureRandom.hex
+        break test_token unless File.exists? test_token
+      end
+      Dir.mkdir(token)
+      @last_tempdir = token
+      return token
+    end
+  end
+end

data/lib/biopsy/objectives/fastest_optimum.rb ADDED Viewed

@@ -0,0 +1,26 @@
+# objective function to count number of conditioned
+# reciprocal best usearch annotations
+class FastestOptimum < Biopsy::ObjectiveFunction
+  def run(optdata, threads=6)
+    info "running objective: FastestOptimum"
+    t0 = Time.now
+    @threads = threads
+    # extract input data
+    @assembly = assemblydata[:assembly]
+    @assembly_name = assemblydata[:assemblyname]
+    @reference = assemblydata[:reference]
+    # results
+    res = self.rbusearch
+    return { :weighting => 1.0,
+             :optimum => 26000,
+             :max => 26000.0,
+             :time => Time.now - t0}.merge res
+  end
+  def essential_files
+    return ['bestmatches.rbu']
+  end
+end

data/lib/biopsy/opt_algorithm.rb ADDED Viewed

File without changes