biopsy 0.1.0.alpha

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE.txt ADDED
@@ -0,0 +1,7 @@
1
+ Copyright (c) 2013 Richard Smith
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4
+
5
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6
+
7
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,49 @@
1
+ biopsy
2
+ ==========
3
+
4
+ An automatic optimisation framework for programs and pipelines.
5
+
6
+ Biopsy is a framework for optimising any program or pipeline which produces a measurable output. By reducing the settings of one or more programs to a parameter space, and by carefully choosing objective functions with which to measure the output of the program(s), biopsy can use a range of optimisation strategies to rapidly find the settings that perform the best. Combined with a strategy for subsampling the input data, this can lead to vast time and performance improvements.
7
+
8
+ A simple example of the power of this approach is *de-novo* transcriptome assembly. Typically, the assembly process takes many GB of data as input, uses many GB of RAM and takes many hours to complete. This prevents researchers from performing full parameter sweeps, and they are therefore forced to use word-of-mouth and very basic optimisation to choose assembler settings. Assemblotron, which uses the Biopsy framework, can fully optimise any *de-novo* assembler to produce the optimal assembly possible given a particular input. This typically takes little more time than running a single assembly.
9
+
10
+ ## Development status
11
+
12
+ [![Gem Version](https://badge.fury.io/rb/biopsy.png)][gem]
13
+ [![Build Status](https://secure.travis-ci.org/Blahah/biopsy.png?branch=master)][travis]
14
+ [![Dependency Status](https://gemnasium.com/Blahah/biopsy.png?travis)][gemnasium]
15
+ [![Code Climate](https://codeclimate.com/github/Blahah/biopsy.png)][codeclimate]
16
+ [![Coverage Status](https://coveralls.io/repos/Blahah/biopsy/badge.png?branch=master)][coveralls]
17
+
18
+ [gem]: https://badge.fury.io/rb/biopsy
19
+ [travis]: https://travis-ci.org/Blahah/biopsy
20
+ [gemnasium]: https://gemnasium.com/Blahah/biopsy
21
+ [codeclimate]: https://codeclimate.com/github/Blahah/biopsy
22
+ [coveralls]: https://coveralls.io/r/Blahah/biopsy
23
+
24
+ This project is in alpha development and is not yet ready for deployment.
25
+ Please don't report issues or request documentation until we are ready for beta release (see below for estimated timeframe).
26
+
27
+ ### Roadmap
28
+
29
+ | Class | Code | Tests | Docs |
30
+ | ------------ | :----: | ------: | -----: |
31
+ | Settings | DONE | DONE | DONE |
32
+ | Target | DONE | DONE | DONE |
33
+ | Domain | DONE | DONE | DONE |
34
+ | Experiment | DONE | DONE | DONE |
35
+ | TabuSearch | DONE | - | - |
36
+ | ParameterSweeper | DONE | - | - |
37
+ | ObjectiveHandler | DONE | DONE | DONE |
38
+
39
+ * ~ 20/24 tasks completed, ~83% done overall
40
+ * alpha released: 6th September 2013
41
+ * planned beta release date: 17th September 2013
42
+
43
+ ### Documentation
44
+
45
+ Documentation is in development and will be released with the beta.
46
+
47
+ ### Citation
48
+
49
+ This is *pre-release*, *pre-publication* academic software. In lieu of a paper to cite, please cite this Github repo if your use of the software leads to a publication.
data/Rakefile ADDED
@@ -0,0 +1,8 @@
1
+ require 'rake/testtask'
2
+
3
+ Rake::TestTask.new do |t|
4
+ t.libs << 'test'
5
+ end
6
+
7
+ desc "Run tests"
8
+ task :default => :test
@@ -0,0 +1,64 @@
1
+ class String
2
+
3
+ # return a CamelCase version of self
4
+ def camelize
5
+ return self if self !~ /_/ && self =~ /[A-Z]+.*/
6
+ split('_').map{|e| e.capitalize}.join
7
+ end
8
+
9
+ end # String
10
+
11
+ class File
12
+
13
+ # return the full path to the supplied cmd executable,
14
+ # if it exists in any location in PATH
15
+ def self.which(cmd)
16
+ exts = ENV['PATHEXT'] ? ENV['PATHEXT'].split(';') : ['']
17
+ ENV['PATH'].split(File::PATH_SEPARATOR).each do |path|
18
+ exts.each do |ext|
19
+ exe = File.join(path, "#{cmd}#{ext}")
20
+ return exe if File.executable? exe
21
+ end
22
+ end
23
+ return nil
24
+ end
25
+
26
+ end # File
27
+
28
+ class Hash
29
+
30
+ # recursively convert all keys to symbols
31
+ def deep_symbolize
32
+ target = dup
33
+ target.inject({}) do |memo, (key, value)|
34
+ value = value.deep_symbolize if value.is_a?(Hash)
35
+ memo[(key.to_sym rescue key) || key] = value
36
+ memo
37
+ end
38
+ end
39
+
40
+ # recursively merge two hashes
41
+ def deep_merge(other_hash)
42
+ self.merge(other_hash) do |key, oldval, newval|
43
+ oldval = oldval.to_hash if oldval.respond_to?(:to_hash)
44
+ newval = newval.to_hash if newval.respond_to?(:to_hash)
45
+ oldval.class.to_s == 'Hash' && newval.class.to_s == 'Hash' ? oldval.deep_merge(newval) : newval
46
+ end
47
+ end
48
+
49
+ end # Hash
50
+
51
+ class Array
52
+
53
+ # return the arithmetic mean of the elements in the array.
54
+ # Requires the array to contain only objects of class Fixnum.
55
+ # If any other class is encountered, an error will be raised.
56
+ def mean
57
+ self.sum / self.size.to_f
58
+ end
59
+
60
+ def sum
61
+ self.inject(0, :+)
62
+ end
63
+
64
+ end # Array
@@ -0,0 +1,156 @@
1
+ # todo: ensure testing accounts for situation where there are multiple
2
+ # input or output files defined in the spec
3
+ module Biopsy
4
+
5
+ class DomainLoadError < Exception
6
+ end
7
+
8
+ class Domain
9
+
10
+ attr_reader :name
11
+ attr_reader :input_filetypes
12
+ attr_reader :output_filetypes
13
+ attr_reader :objectives
14
+ attr_reader :keep_intermediates
15
+ attr_reader :gzip_intermediates
16
+
17
+ require 'yaml'
18
+ require 'pp'
19
+
20
+ # Return a new Domain object containing the specification of the
21
+ # currently active domain.
22
+ def initialize domain=nil
23
+ @name = domain.nil? ? self.get_current_domain : domain
24
+
25
+ @keep_intermediates = false
26
+ @gzip_intermediates = false
27
+ self.load_by_name @name
28
+ end
29
+
30
+ # Return the name of the currently active domain.
31
+ def get_current_domain
32
+ Settings.instance.domain
33
+ rescue
34
+ raise "You must specify the domain to use in the biopsy settings file or at the command line."
35
+ end
36
+
37
+ # Return the path to the YAML definition file for domain with +:name+.
38
+ # All +:domain_dirs+ in Settings are searched and the first matching
39
+ # file is returned.
40
+ def locate_definition name
41
+ Settings.instance.locate_config :domain_dir, name
42
+ end
43
+
44
+ # Check and apply the settings in +:config+ (Hash).
45
+ def apply_config config
46
+ [:input_filetypes, :output_filetypes, :objectives].each do |key|
47
+ raise DomainLoadError.new("Domain definition is missing the required key #{key}") unless config.has_key? key
48
+ self.instance_variable_set('@' + key.to_s, config[key])
49
+ end
50
+ end
51
+
52
+ # Load and apply the domain definition with +:name+
53
+ def load_by_name name
54
+ path = self.locate_definition name
55
+ raise DomainLoadError.new("Domain definition file does not exist for #{name}") if path.nil?
56
+ config = YAML::load_file(path)
57
+ raise DomainLoadError.new("Domain definition file #{path} is not valid YAML") if config.nil?
58
+ self.apply_config config.deep_symbolize
59
+ end
60
+
61
+ # Validate a Target, returning true if the target meets
62
+ # the specification of this Domain, and false otherwise.
63
+ # +:target+, the Target object to validate.
64
+ def target_valid? target
65
+ l = []
66
+ @input_filetypes.each do |input|
67
+ l << [target[:input_files], input]
68
+ end
69
+ @output_filetypes.each do |output|
70
+ l << [target[:output_files], output]
71
+ end
72
+ errors = []
73
+ l.each do |pair|
74
+ testcase, definition = pair
75
+ errors += self.validate_target_filetypes(testcase, definition)
76
+ end
77
+ errors
78
+ end
79
+
80
+ # Returns an empty array if +:testcase+ conforms to definition,
81
+ # otherwise returns an array of strings describing the
82
+ # errors found.
83
+ def validate_target_filetypes testcase, definition
84
+ errors = []
85
+ # check extensions
86
+ testcase.each_pair do |key, f|
87
+ ext = File.extname(f)
88
+ unless definition[:allowed_extensions].include? ext
89
+ errors << %Q{input file #{f} doesn't match any of the filetypes
90
+ allowed for this domain}
91
+ end
92
+ end
93
+ # check number of files
94
+ in_count = testcase.size
95
+ if definition.has_key? :n
96
+ unless in_count == definition[:n]
97
+ errors << %Q{the number of input files (#{in_count}) doesn't
98
+ match the domain specification (#{definition[:n]})}
99
+ end
100
+ end
101
+ if definition.has_key? :min
102
+ unless in_count >= definition[:min]
103
+ errors << %Q{the number of input files (#{in_count}) is lower
104
+ than the minimum for this domain (#{definition[:n]})}
105
+ end
106
+ end
107
+ if definition.has_key? :max
108
+ unless in_count >= definition[:max]
109
+ errors << %Q{the number of input files (#{in_count}) is greater
110
+ than the maximum for this domain (#{definition[:n]})}
111
+ end
112
+ end
113
+ errors
114
+ end
115
+
116
+ # Write out a template Domain definition to +:filename+
117
+ def write_template filename
118
+ data = {
119
+ :input_filetypes => [
120
+ {
121
+ :min => 1,
122
+ :max => 2,
123
+ :allowed_extensions => [
124
+ 'txt',
125
+ 'csv',
126
+ 'tsv'
127
+ ]
128
+ },
129
+ {
130
+ :n => 2,
131
+ :allowed_extensions => [
132
+ 'png'
133
+ ]
134
+ }
135
+ ],
136
+ :output_filetypes => [
137
+ {
138
+ :n => 1,
139
+ :allowed_extensions => [
140
+ 'pdf',
141
+ 'xls'
142
+ ]
143
+ }
144
+ ],
145
+ :objectives => [
146
+ 'objective1', 'objective2'
147
+ ]
148
+ }
149
+ ::File.open(filename, 'w') do |f|
150
+ f.puts data.to_yaml
151
+ end
152
+ end
153
+
154
+ end # end of class Domain
155
+
156
+ end # end of module Biopsy
@@ -0,0 +1,103 @@
1
+ # Optimisation Framework: Experiment
2
+ #
3
+ # == Description
4
+ #
5
+ # The Experiment object encapsulates the data and methods that represent
6
+ # the optimisation experiment being carried out.
7
+ #
8
+ # The metadata necessary to conduct the experiment, as well as the
9
+ # settings for the experiment, are stored here.
10
+ #
11
+ # It is also the main process controller for the entire optimisation
12
+ # cycle. It takes user input, runs the target program, the objective function(s)
13
+ # and the optimisation algorithm, looping through the optimisation cycle until
14
+ # completion and then returning the output.
15
+ module Biopsy
16
+
17
+ class Experiment
18
+
19
+ attr_reader :inputs, :outputs, :retain_intermediates, :target, :start, :algorithm
20
+
21
+ # Returns a new Experiment
22
+ def initialize(target_name, domain_name, start=nil, algorithm=nil)
23
+ @domain = Domain.new domain_name
24
+ @start = start
25
+ @algorithm = algorithm
26
+
27
+ self.load_target target_name
28
+ @objective = ObjectiveHandler.new(@domain, @target)
29
+ self.select_algorithm
30
+ self.select_starting_point
31
+ @scores = {}
32
+ @iteration_count = 0
33
+ end
34
+
35
+ # return the set of parameters to evaluate first
36
+ def select_starting_point
37
+ return unless @start.nil?
38
+ if @algorithm && @algorithm.knows_starting_point?
39
+ @start = @algorithm.select_starting_point
40
+ else
41
+ @start = self.random_start_point
42
+ end
43
+ end
44
+
45
+ # Return a random set of parameters from the parameter space.
46
+ def random_start_point
47
+ Hash[@target.parameter_ranges.map { |p, r| [p, r.sample] }]
48
+ end
49
+
50
+ # select the optimisation algorithm to use
51
+ def select_algorithm
52
+ @algorithm = ParameterSweeper.new(@target.parameter_ranges)
53
+ return if @algorithm.combinations.size < Settings.instance.sweep_cutoff
54
+ @algorithm = TabuSearch.new(@target.parameter_ranges)
55
+ end
56
+
57
+ # load the target named +:target_name+
58
+ def load_target target_name
59
+ @target = Target.new @domain
60
+ @target.load_by_name target_name
61
+ end
62
+
63
+ # Runs the experiment until the completion criteria
64
+ # are met. On completion, returns the best parameter
65
+ # set.
66
+ def run
67
+ in_progress = true
68
+ @algorithm.setup @start
69
+ @current_params = @start
70
+ while in_progress do
71
+ run_iteration
72
+ # update the best result
73
+ @best = @algorithm.best
74
+ # have we finished?
75
+ in_progress = !@algorithm.finished?
76
+ end
77
+ puts "found optimum score: #{@best[:score]} for parameters #{@best[:parameters]} in #{@iteration_count} iterations."
78
+ return @best
79
+ end
80
+
81
+ # Runs a single iteration of the optimisation,
82
+ # encompassing the program, objective(s) and optimiser.
83
+ # Returns the output of the optimiser.
84
+ def run_iteration
85
+ # run the target
86
+ run_data = @target.run @current_params
87
+ # evaluate with objectives
88
+ param_key = @current_params.to_s
89
+ result = nil
90
+ if @scores.has_key? param_key
91
+ result = @scores[param_key]
92
+ else
93
+ result = @objective.run_for_output run_data
94
+ @iteration_count += 1
95
+ end
96
+ @scores[@current_params.to_s] = result
97
+ # get next steps from optimiser
98
+ @current_params = @algorithm.run_one_iteration(@current_params, result)
99
+ end
100
+
101
+ end # end of class RunHandler
102
+
103
+ end # end of module Biopsy
@@ -0,0 +1,38 @@
1
+
2
+
3
+ # Assembly Optimisation Framework: Objective Function
4
+ #
5
+ # == Description
6
+ #
7
+ # ObjectiveFunction is a skeleton parent class to ensure
8
+ # objective functions provide the essential methods.
9
+ # Because abstract classes don't really make sense in
10
+ # Ruby's runtime compilation, we can only check if methods
11
+ # are implemented at runtime (but at least we can raise
12
+ # a sensible error)
13
+ module Biopsy
14
+
15
+ class ObjectiveFunction
16
+
17
+ # Runs the objective function for the assembly supplied,
18
+ # returning a real number value
19
+ #
20
+ # === Options
21
+ #
22
+ # * +:assemblydata+ - Hash containing data about the assembly to analyse
23
+ #
24
+ # === Example
25
+ #
26
+ # objective = ObjectiveFunction.new
27
+ # result = objective.run('example.fasta')
28
+ def run(assemblydata)
29
+ raise NotImplementedError.new("You must implement a run method for each objective function")
30
+ end
31
+
32
+ def essential_files
33
+ return []
34
+ end
35
+
36
+ end
37
+
38
+ end
@@ -0,0 +1,170 @@
1
+ require 'securerandom'
2
+ require 'fileutils'
3
+
4
+ # Assembly Optimisation Framework: Objective Function Handler
5
+ #
6
+ # == Description
7
+ #
8
+ # The Handler manages the objective functions for the optimisation experiment.
9
+ # Specifically, it finds all the objective functions and runs them when requested,
10
+ # outputting the results to the main Optimiser.
11
+ #
12
+ # == Explanation
13
+ #
14
+ # === Loading objective functions
15
+ #
16
+ # The Handler expects a directory containing objectives (by default it looks in *currentdir/objectives*).
17
+ # The *objectives* directory should contain the following:
18
+ #
19
+ # * a *.rb* file for each objective function. The file should define a subclass of ObjectiveFunction
20
+ # * (optionally) a file *objectives.txt* which lists the objective function files to use
21
+ #
22
+ # If the objectives.txt file is absent, the subset of objectives to use can be set directly in the Optimiser
23
+ # , or if no such restriction is set, the whole set of objectives will be run.
24
+ #
25
+ # Each file listed in *objectives.txt* is loaded if it exists.
26
+ #
27
+ # === Running objective functions
28
+ #
29
+ # The Handler iterates through the objectives, calling the *run()* method
30
+ # of each by passing the assembly. After collecting results, it returns
31
+ # a Hash of the results to the parent Optimiser.
32
+ module Biopsy
33
+
34
+ class ObjectiveHandlerError < Exception
35
+ end
36
+
37
+ class ObjectiveHandler
38
+
39
+ attr_reader :last_tempdir
40
+ attr_accessor :objectives
41
+
42
+ def initialize domain, target
43
+ @domain = domain
44
+ @target = target
45
+ base_dir = Settings.instance.base_dir
46
+ @objectives_dir = Settings.instance.objectives_dir.first
47
+ @objectives = {}
48
+ $LOAD_PATH.unshift(@objectives_dir)
49
+ @subset = Settings.instance.respond_to?(:objectives_subset) ? Settings.instance.objectives_subset : nil
50
+ self.load_objectives
51
+ # pass objective list back to caller
52
+ return @objectives.keys
53
+ end
54
+
55
+ def load_objectives
56
+ # load objectives
57
+ # load subset list if available
58
+ subset_file = @objectives_dir + '/objectives.txt'
59
+ subset = File.exists?(subset_file) ? File.open(subset_file).readlines.map{ |l| l.strip } : nil
60
+ subset = @subset if subset.nil?
61
+ # parse in objectives
62
+ Dir.chdir @objectives_dir do
63
+ Dir['*.rb'].each do |f|
64
+ file_name = File.basename(f, '.rb')
65
+ require file_name
66
+ objective_name = file_name.camelize
67
+ objective = Module.const_get(objective_name).new
68
+ if subset.nil? or subset.include?(file_name)
69
+ # this objective is included
70
+ @objectives[objective_name] = objective
71
+ end
72
+ end
73
+ # puts "loaded #{@objectives.length} objectives."
74
+ end
75
+ end
76
+
77
+ # Run a specific +:objective+ on the +:output+ of a target
78
+ # with max +:threads+.
79
+ def run_objective(objective, name, output, threads)
80
+ begin
81
+ # output is a hash containing the file(s) output
82
+ # by the target in the format expected by the
83
+ # objective function(s).
84
+ return objective.run(output, threads)
85
+ rescue NotImplementedError => e
86
+ puts "Error: objective function #{objective.class} does not implement the run() method"
87
+ puts "Please refer to the documentation for instructions on adding objective functions"
88
+ raise e
89
+ end
90
+ end
91
+
92
+ # Perform a euclidean distance dimension reduction of multiple objectives
93
+ # using weighting specified in the domain definition.
94
+ def dimension_reduce(results)
95
+ # calculate the weighted Euclidean distance from optimal
96
+ # d(p, q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2+...+(p_i - q_i)^2+...+(p_n - q_n)^2}
97
+ # here the max value is sqrt(n) where n is no. of results, min value (optimum) is 0
98
+ total = 0
99
+ results.each_pair do |key, value|
100
+ o = value[:optimum]
101
+ w = value[:weighting]
102
+ a = value[:result]
103
+ m = value[:max]
104
+ total += w * (((o - a)/m) ** 2)
105
+ end
106
+ return Math.sqrt(total) / results.length
107
+ end
108
+
109
+ # Run all objectives functions for +:output+.
110
+ def run_for_output(output, threads=6, cleanup=true, allresults=false)
111
+ # check output files exist
112
+ @target.output_files.each_pair do |key, name|
113
+ unless File.exists?(output[key]) && File.size(output[key]) > 0
114
+ info("file #{output[key]} does not exist or is empty")
115
+ return nil
116
+ end
117
+ end
118
+ # run all objectives for output
119
+ results = {}
120
+ # create temp dir
121
+ Dir.chdir(self.create_tempdir) do
122
+ @objectives.each_pair do |name, objective|
123
+ results[name] = self.run_objective(objective, name, output, threads)
124
+ end
125
+ if cleanup == 1
126
+ # remove all but essential files
127
+ essential_files = @domain.keep_intermediates
128
+ if essential_files
129
+ @objectives.values.each{ |objective| essential_files += objective.essential_files }
130
+ end
131
+ Dir["*"].each do |file|
132
+ next if File.directory? file
133
+ if essential_files && essential_files.include?(file)
134
+ `gzip #{file}` if @domain.gzip_intermediates
135
+ FileUtils.mv("#{file}.gz", '..')
136
+ end
137
+ end
138
+ end
139
+ end
140
+ if cleanup
141
+ # clean up temp dir
142
+ FileUtils.rm_rf @last_tempdir
143
+ end
144
+ if allresults
145
+ return {:results => results,
146
+ :reduced => self.dimension_reduce(results)}
147
+ else
148
+ results.each do |key, value|
149
+ return value[:result]
150
+ end
151
+ end
152
+ end
153
+
154
+ # create a guaranteed random temporary directory for storing outputs
155
+ # return the dirname
156
+ def create_tempdir
157
+ token = loop do
158
+ # generate random dirnames until we find one that
159
+ # doesn't exist
160
+ test_token = SecureRandom.hex
161
+ break test_token unless File.exists? test_token
162
+ end
163
+ Dir.mkdir(token)
164
+ @last_tempdir = token
165
+ return token
166
+ end
167
+
168
+ end
169
+
170
+ end
@@ -0,0 +1,26 @@
1
+ # objective function to count number of conditioned
2
+ # reciprocal best usearch annotations
3
+
4
+ class FastestOptimum < Biopsy::ObjectiveFunction
5
+
6
+ def run(optdata, threads=6)
7
+ info "running objective: FastestOptimum"
8
+ t0 = Time.now
9
+ @threads = threads
10
+ # extract input data
11
+ @assembly = assemblydata[:assembly]
12
+ @assembly_name = assemblydata[:assemblyname]
13
+ @reference = assemblydata[:reference]
14
+ # results
15
+ res = self.rbusearch
16
+ return { :weighting => 1.0,
17
+ :optimum => 26000,
18
+ :max => 26000.0,
19
+ :time => Time.now - t0}.merge res
20
+ end
21
+
22
+ def essential_files
23
+ return ['bestmatches.rbu']
24
+ end
25
+
26
+ end
File without changes