RubyGems - transrate - Versions diffs - 0.0.10 → 0.0.12 - Mend

transrate 0.0.10 → 0.0.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

checksums.yaml +4 -4
data/.gitignore +3 -1
data/LICENSE +18 -1
data/README.md +70 -47
data/Rakefile +8 -0
data/bin/transrate +54 -48
data/lib/transrate.rb +4 -0
data/lib/transrate/assembly.rb +165 -37
data/lib/transrate/bowtie2.rb +2 -2
data/lib/transrate/comparative_metrics.rb +7 -0
data/lib/transrate/dimension_reduce.rb +1 -0
data/lib/transrate/express.rb +2 -2
data/lib/transrate/metric.rb +1 -1
data/lib/transrate/read_metrics.rb +10 -4
data/lib/transrate/reciprocal_annotation.rb +1 -0
data/lib/transrate/transrater.rb +34 -9
data/lib/transrate/usearch.rb +7 -2
data/lib/transrate/version.rb +1 -1
data/lib/transrate/writer.rb +18 -0
data/test/helper.rb +16 -0
data/transrate.gemspec +5 -5
metadata +35 -33
data/lib/transrate/#assembly.rb# +0 -130

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: e4a7687e1bc2071fe2f043245e1eff90742f4f1e
-  data.tar.gz: 61ef6386c15fe8d56485c8c5737d3283aefaede3
+  metadata.gz: f5f7d2d65376b69682c5e29c318ad35f43a5ea9a
+  data.tar.gz: 794238eafb17705f68d82296e53ffa6128bf7141
 SHA512:
-  metadata.gz: 42ab6bc0454bd683798c5e9a1d93a7687fd3282bac182275901b89435dc3c90203dc3ffab12ad128fdf4ebab99702df816f3534917a061da6efdbd120e85cf9b
-  data.tar.gz: fed477db5ad8a33560bdf25a9bff9185638e49e418ccd5fcf9808235034abbc03ea1c7dded07520f4e4d0bff3f784146417a7e11c58cf938a4007bc71e7c15fc
+  metadata.gz: 101280a09d847f28165d0a4394bb849af5e339bf782a25b7e09ad45e1fbdd694f441809b09f078848c69ff0607bedc1aff91e87c50839cd0be3a997038f381a8
+  data.tar.gz: 1cf8a710b6e7d83139eabd4b8d820a056de19715307b822c3096458cefdec89f195d0727a8b49ccc5ac648bba9e1e8ec007092abcc94796e5a3f6b3ba4c6df99

data/.gitignore CHANGED Viewed

@@ -9,7 +9,6 @@ lib/bundler/man
 pkg
 rdoc
 spec/reports
-test
 test/tmp
 test/version_tmp
 tmp
@@ -19,3 +18,6 @@ tmp
 _yardoc
 doc/
 .ruby-version
+# large test files not for repo
+dryrun

data/LICENSE CHANGED Viewed

@@ -1,4 +1,11 @@
-The MIT License (MIT)
+## Summary
+The Ruby code for Transrate is released under the MIT license.
+SNAP and CD-HIT-2D are bundled as binaries under their respective licenses
+as described below.
+## The MIT License (MIT)
 Copyright (c) 2013 Richard Smith
@@ -18,3 +25,13 @@ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+## SNAP
+SNAP is distributed as a binary in accordance with its Apache license.
+The source code for SNAP is available at https://github.com/amplab/snap
+## CD-HIT-2D
+CD-HIT-2D is distributed as a binary in accordance with ith GPLv2 license.
+The source code for CD-HIT-2D is available at https://code.google.com/p/cdhit/

data/README.md CHANGED Viewed

@@ -3,55 +3,57 @@ Transrate
 Quality analysis and comparison of transcriptome assemblies.
-## Transcriptome assembly quality metrics
-**transrate** implements a variety of established and new metrics.
-note: this list will be expanded soon with detailed explanations and a guide to interpreting the results.
-### Contig metrics
-* **n_seqs** - the number of contigs in the assembly
-* **smallest** - the size of the smallest contig
-* **largest** - the size of the largest contig
-* **n_bases** - the number of bases included in the assembly
-* **mean_len** - the mean length of the contigs
-* **n > 1k** - the number of contigs greater than 1,000 bases long
-* **n > 10k** - the number of contigs greater than 10,000 bases long
-* **nX** - the largest contig size at which at least X% of bases are contained in contigs *longer* than this length
-### Read mapping metrics
+## Contents
+1. [Development status](https://github.com/Blahah/transrate#development-status)
+2. [Transcriptome assembly quality metrics](https://github.com/Blahah/transrate#transcriptome-assembly-quality-metrics)
+3. [Installation](https://github.com/Blahah/transrate#installation)
+4. [Usage](https://github.com/Blahah/transrate#usage)
+    - [Command line](https://github.com/Blahah/transrate#command-line)
+        - [example](https://github.com/Blahah/transrate#example)
+    - [As a library](https://github.com/Blahah/transrate#as-a-library)
+5. [Requirements](https://github.com/Blahah/transrate#requirements)
+    - [Ruby](https://github.com/Blahah/transrate#ruby)
+    - [RubyGems](https://github.com/Blahah/transrate#rubygems)
+    - [USEARCH, Bowtie 2, and eXpress](https://github.com/Blahah/transrate#usearch-bowtie2-and-express)
+6. [Getting help](https://github.com/Blahah/transrate#getting-help)
+## Development status
+This software is in early development. Users should be aware that until the first release is made, features may change faster than the documentation is updated. Nevertheless, we welcome bug reports.
+[![Gem Version](https://badge.fury.io/rb/transrate.png)][gem]
+[![Build Status](https://secure.travis-ci.org/Blahah/transrate.png?branch=master)][travis]
+[![Dependency Status](https://gemnasium.com/Blahah/transrate.png?travis)][gemnasium]
+[![Code Climate](https://codeclimate.com/github/Blahah/transrate.png)][codeclimate]
+[![Coverage Status](https://coveralls.io/repos/Blahah/transrate/badge.png?branch=master)][coveralls]
+[gem]: https://badge.fury.io/rb/transrate
+[travis]: https://travis-ci.org/Blahah/transrate
+[gemnasium]: https://gemnasium.com/Blahah/transrate
+[codeclimate]: https://codeclimate.com/github/Blahah/transrate
+[coveralls]: https://coveralls.io/r/Blahah/transrate
-* **total** - the total number of reads pairs mapping
-* **good** - the number of read pairs mapping in a way indicative of good assembly
-* **bad** - the number of reads pairs mapping in a way indicative of bad assembly
-'Good' pairs are those where both members are aligned, in the correct orientation, either on the same contig or within a plausible distance of the ends of two separate contigs.
-Conversely, 'bad' pairs are those where one of the conditions for being 'good' are not met.
-Additionally, the software calculates whether there is any evidence in the read mappings that different contigs originate from the same transcript. These theoretical links are called bridges, and the number of bridges is shown in the **supported bridges** metric. The list of supported bridges is output to a file, `supported_bridges.csv`, in case you want to make use of the information. At a later date, transrate will include the ability to improve the assembly using this and other information.
-### Comparative metrics
+## Transcriptome assembly quality metrics
-* **reciprocal hits** - the number of reciprocal best hits against the reference using ublast. A high score indicates that a large number of real transcripts have been assembled.
-* **ortholog hit ratio** - the mean ratio of alignment length to reference sequence length. A low score on this metric indicates the assembly contains full-length transcripts.
-* **collapse factor** - the mean number of reference proteins mapping to each contig. A high score on this metric indicates the assembly contains chimeras.
+**transrate** implements a variety of established and new metrics. They are explained in detail [on the wiki](https://github.com/Blahah/transrate/wiki/Transcriptome-assembly-quality-metrics).
 ## Installation
-You can install transrate very easily. Just run at the terminal:
+Assuming all the requirements are met (see below), you can install transrate very easily. Just run at the terminal:
 `gem install transrate`
-If that doesn't work, check the requirements below...
+If you're new to linux/unix, there's a detailed tutorial for installing transrate with all the dependencies [on my blog](http://blahah.net/bioinformatics/2013/10/19/installing-transrate/).
 ## Usage
+### Command line
 `transrate --help` will give you...
 ```
-Transrate v0.0.1a by Richard Smith <rds45@cam.ac.uk>
+Transrate v0.0.10 by Richard Smith <rds45@cam.ac.uk>
 DESCRIPTION:
 Analyse a de-novo transcriptome
@@ -61,7 +63,7 @@ assembly using three kinds of metrics:
 2. read-mapping
 3. reference-based
-Please make sure USEARCH and bowtie2 are both installed
+Please make sure USEARCH, bowtie 2 and eXpress are installed
 and in the PATH.
 Bug reports and feature requests at:
@@ -84,18 +86,37 @@ OPTIONS:
 If you don't include --left and --right read files, the read-mapping based analysis will be skipped. I recommend that you don't align all your reads - just a subset of 500,000 will give you a very good idea of the quality. You can get a subset by running (on a linux system):
-`head -2000000 readfile.fastq`
+`head -2000000 left.fastq > left_500k.fastq`
+`head -2000000 right.fastq > right_500k.fastq`
 FASTQ records are 4 lines long, so make sure you multiply the number of reads you want by 4, and be sure to run the same command on both the left and right read files.
-### Example
+#### Example
 ```
 transrate --assembly assembly.fasta \
-		  --reference reference.fasta \
-		  --left l.fq \
-		  --right r.fq \
-		  --threads 4
+	  --reference reference.fasta \
+	  --left l.fq \
+	  --right r.fq \
+	  --threads 4
+```
+### As a library
+```ruby
+require 'transrate'
+assembly = Transrate::Assembly.new(File.expand_path('assembly.fasta'))
+reference = Transrate::Assembly.new(File.expand_path('reference.fasta'))
+t = Transrate::Transrater.new(assembly, reference)
+left = File.expand_path('left.fq')
+right = File.expand_path('right.fq')
+puts t.all_metrics(left, right)
+puts t.assembly_score
 ```
 ## Requirements
@@ -116,12 +137,14 @@ Your Ruby installation *should* come with RubyGems, the package manager for Ruby
 `gem --version`
-If you don't have it installed, I recommend installing the latest version of Ruby and RubyGems using the RVM instructions above (in the Requirements:Ruby section.
+If you don't have it installed, I recommend installing the latest version of Ruby and RubyGems using the RVM instructions above (in the [Requirements:Ruby](https://github.com/Blahah/transrate#ruby) section).
+### Usearch, Bowtie2 and eXpress
-### Usearch and Bowtie2
+Usearch (http://drive5.com/usearch), Bowtie2 (https://sourceforge.net/projects/bowtie-bio/files/bowtie2) and eXpress (http://bio.math.berkeley.edu/eXpress/) must be installed and in your PATH. Additionally, the Usearch binary executable should be named `usearch`.
-Usearch (http://drive5.com/usearch) and Bowtie2 (https://sourceforge.net/projects/bowtie-bio/files/bowtie2) must be installed and in your PATH. Additionally, the Usearch binary executable should be named `usearch`.
+## Getting help
-## Development status
+If you need help using transrate, please post to the [forum here](https://groups.google.com/forum/#!forum/transrate-users).
-This software is in very early development. Nevertheless, we welcome bug reports.
+If you think you've found a bug, please post it to the [issues list](https://github.com/Blahah/transrate/issues).

data/Rakefile ADDED Viewed

@@ -0,0 +1,8 @@
+require 'rake/testtask'
+Rake::TestTask.new do |t|
+  t.libs << 'test'
+end
+desc "Run tests"
+task :default => :test

data/bin/transrate CHANGED Viewed

@@ -4,21 +4,18 @@ require 'trollop'
 require 'transrate'
 opts = Trollop::options do
-  version "v0.0.1a"
+  version Transrate::VERSION::STRING.dup
   banner <<-EOS
-Transrate v0.0.1a by Richard Smith <rds45@cam.ac.uk>
+Transrate v#{Transrate::VERSION::STRING.dup} by Richard Smith <rds45@cam.ac.uk>
 DESCRIPTION:
 Analyse a de-novo transcriptome
 assembly using three kinds of metrics:
 1. contig-based
-2. read-mapping
-3. reference-based
-Please make sure USEARCH and bowtie2 are both installed
-and in the PATH.
+2. read-mapping (if --left and --right are provided)
+3. reference-based (if --reference is provided)
 Bug reports and feature requests at:
 http://github.com/blahah/transrate
@@ -30,7 +27,7 @@ OPTIONS:
 EOS
   opt :assembly, "assembly file in FASTA format", :required => true, :type => String
-  opt :reference, "reference proteome file in FASTA format", :required => true, :type => String
+  opt :reference, "reference proteome file in FASTA format", :type => String
   opt :left, "left reads file in FASTQ format", :type => String
   opt :right, "right reads file in FASTQ format", :type => String
   opt :insertsize, "mean insert size",  :default => 200, :type => Integer
@@ -45,59 +42,68 @@ end
 include Transrate
 a = Assembly.new opts.assembly
-r = Assembly.new opts.reference
+r = opts.reference ? Assembly.new(opts.reference) : nil
-puts "\n\nAnalysing assembly: #{opts.assembly}\n\n"
+transrater = Transrater.new(a, r,
+                            opts.left,
+                            opts.right,
+                            opts.insertsize,
+                            opts.insertsd)
-puts "calculating contig stats..."
-t0 = Time.now
-contig_results = a.basic_stats
-puts "...done in #{Time.now - t0} seconds"
+puts "\nAnalysing assembly: #{opts.assembly}\n\n"
-read_results = nil
-if (opts.left && opts.right)
-  puts "\ncalculating read diagnostics..."
-  t0 = Time.now
-  read_metrics = ReadMetrics.new a
-  read_metrics.run(opts.left, opts.right)
-  read_results = read_metrics.read_stats
-  puts "...done in #{Time.now - t0} seconds"
-else
-  puts "\nno reads provided, skipping read diagnostics"
-end
+report_width = 30
-puts "\ncalculating comparative metrics..."
+puts "Calculating contig metrics..."
 t0 = Time.now
-comparative_metrics = ComparativeMetrics.new(a, r)
-comparative_metrics.run
-comparative_results = comparative_metrics.comp_stats
-puts "...done in #{Time.now - t0} seconds"
-report_width = 30
+contig_results = transrater.assembly_metrics.basic_stats
 if contig_results
-  puts "\n\n"
+  puts "\n"
   puts "Contig metrics:"
   puts "-" *  report_width
   puts pretty_print_hash(contig_results, report_width)
 end
-if read_results
-  puts "\n\n"
-  puts "Read mapping metrics:"
-  puts "-" *  report_width
-  puts pretty_print_hash(read_results, report_width)
+puts "Contig metrics done in #{Time.now - t0} seconds"
+read_results = nil
+if (opts.left && opts.right)
+  puts "\ncalculating read diagnostics..."
+  t0 = Time.now
+  read_results = transrater.read_metrics(opts.left, opts.right).read_stats
+  if read_results
+    puts "\n"
+    puts "Read mapping metrics:"
+    puts "-" *  report_width
+    puts pretty_print_hash(read_results, report_width)
+  end
+  puts "Read metrics done in #{Time.now - t0} seconds"
+else
+  puts "\nNo reads provided, skipping read diagnostics"
 end
-if comparative_results
-  puts "\n\n"
-  puts "Comparative metrics:"
-  puts "-" *  report_width
-  puts pretty_print_hash(comparative_results, report_width)
+if opts.reference
+  puts "\nCalculating comparative metrics..."
+  t0 = Time.now
+  comparative_results = transrater.comparative_metrics.comp_stats
+  if comparative_results
+    puts "\n"
+    puts "Comparative metrics:"
+    puts "-" *  report_width
+    puts pretty_print_hash(comparative_results, report_width)
+  end
+  puts "Comparative metrics done in #{Time.now - t0} seconds"
 end
-transrater = Transrater.new(a, r, opts.left, opts.right)
-transrater.run(opts.left, opts.right)
-puts "\n\n"
-puts "Overall score #{transrater.assembly_score.to_f.round(2)}"
-puts "\n" + "-" * report_width
+puts "\n"
+puts "-" * report_width
+score = transrater.assembly_score
+unless score.nil?
+  puts "OVERALL SCORE: #{score.to_f.round(2) * 100}%"
+  puts "-" * report_width
+end

data/lib/transrate.rb CHANGED Viewed

@@ -10,3 +10,7 @@ require 'transrate/comparative_metrics'
 require 'transrate/metric'
 require 'transrate/dimension_reduce'
 require 'transrate/express'
+module Transrate
+end # Transrate

data/lib/transrate/assembly.rb CHANGED Viewed

@@ -9,12 +9,13 @@ module Transrate
     include Enumerable
     extend Forwardable
-    def_delegators :@assembly, :each, :<<
+    def_delegators :@assembly, :each, :<<, :size, :length
     attr_accessor :ublast_db
     attr_accessor :orfs_ublast_db
     attr_accessor :protein
     attr_reader :assembly
+    attr_reader :has_run
     # number of bases in the assembly
     attr_writer :n_bases
@@ -25,7 +26,7 @@ module Transrate
     # assembly n50
     attr_reader :n50
-    # Reuturn a new Assembly.
+    # Return a new Assembly.
     #
     # - +:file+ - path to the assembly FASTA file
     def initialize file
@@ -36,71 +37,198 @@ module Transrate
         @n_bases += entry.length
         @assembly << entry
       end
-      @assembly.sort_by! { |x| x.length }
     end
     # Return a new Assembly object by loading sequences
     # from the FASTA-format +:file+
-    def self.stats_from_fasta file
+    def self.stats_from20_fasta file
       a = Assembly.new file
       a.basic_stats
     end
-    def run
-      stats = self.basic_stats
+    def run threads=8
+      stats = self.basic_stats threads
       stats.each_pair do |key, value|
-        ivar = "@#{key.gsub(/ /, '_')}".to_sym
+        ivar = "@#{key.gsub(/\ /, '_')}".to_sym
+        attr_ivar = "#{key.gsub(/\ /, '_')}".to_sym
+        # creates accessors for the variables in stats
+        singleton_class.class_eval { attr_accessor attr_ivar }
         self.instance_variable_set(ivar, value)
       end
+      @has_run = true
     end
-    # Return a hash of statistics about this assembly
-    def basic_stats
+    # Return a hash of statistics about this assembly. Stats are
+    # calculated in parallel by splitting the assembly into
+    # equal-sized bins and calling Assembly#basic_bin_stat on each
+    # bin in a separate thread.
+    def basic_stats threads=8
+      # create a work queue to process contigs in parallel
+      queue = Queue.new
+      # split the contigs into equal sized bins, one bin per thread
+      binsize = (@assembly.size / threads.to_f).ceil
+      @assembly.each_slice(binsize) do |bin|
+        queue << bin
+      end
+      # a classic threadpool - an Array of threads that allows
+      # us to assign work to each thread and then aggregate their
+      # results when they are all finished
+      threadpool = []
+      # assign one bin of contigs to each thread from the queue.
+      # each thread will process its bin of contigs and then wait
+      # for the others to finish.
+      semaphore = Mutex.new
+      stats = []
+      threads.times do
+        threadpool << Thread.new do |thread|
+          # keep looping until we run out of bins
+          until queue.empty?
+            # use non-blocking pop, so an exception is raised
+            # when the queue runs dry
+            bin = queue.pop(true) rescue nil
+            if bin
+              # calculate basic stats for the bin, storing them
+              # in the current thread so they can be collected
+              # in the main thread.
+              bin_stats = basic_bin_stats bin
+              semaphore.synchronize { stats << bin_stats }
+            end
+          end
+        end
+      end
+      # collect the stats calculated in each thread and join
+      # the threads to terminate them
+      threadpool.each(&:join)
+      # merge the collected stats and return then
+      merge_basic_stats stats
+    end # basic_stats
+    # Calculate basic statistics in an single thread for a bin
+    # of contigs.
+    #
+    # Basic statistics are:
+    #
+    # - N10, N30, N50, N70, N90
+    # - number of contigs >= 1,000 base pairs long
+    # - number of contigs >= 10,000 base pairs long
+    # - length of the shortest contig
+    # - length of the longest contig
+    # - number of contigs in the bin
+    # - mean contig length
+    # - total number of nucleotides in the bin
+    # - mean % of contig length covered by the longest ORF
+    #
+    # @param [Array] bin An array of Bio::Sequence objects
+    # representing contigs in the assembly
+    def basic_bin_stats bin
+      # cumulative length is a float so we can divide it
+      # accurately later to get the mean length
       cumulative_length = 0.0
-      # we'll calculate Nx for all these x
-      x = [90, 70, 50, 30, 10]
-      x2 = x.clone
-      cutoff = x2.pop / 100.0
-      res = []
+      # we'll calculate Nx for x in [10, 30, 50, 70, 90]
+      # to do this we create a stack of the x values and
+      # pop the first one to set the first cutoff. when
+      # the cutoff is reached we store the nucleotide length and pop
+      # the next value to set the next cutoff. we take a copy
+      # of the Array so we can use the intact original to collect
+      # the results later
+      # x = [90, 70, 50, 30, 10]
+      # x2 = x.clone
+      # cutoff = x2.pop / 100.0
+      # res = []
       n1k = 0
       n10k = 0
       orf_length_sum = 0
-      @assembly.each do |s|
-        n1k += 1 if s.length > 1_000
-        n10k += 1 if s.length > 10_000
-        orf_length_sum += orf_length(s.seq)
-        cumulative_length += s.length
-        if cumulative_length >= @n_bases * cutoff
-          res << s.length
-          if x2.empty?
-            cutoff=1
-          else
-            cutoff = x2.pop / 100.0
-          end
-        end
+      # sort the contigs in ascending length order
+      # and iterate over them
+      bin.sort_by! { |c| c.seq.size }
+      bin.each do |contig|
+        # increment our long contig counters if this
+        # contig is above the thresholds
+        n1k += 1 if contig.length > 1_000
+        n10k += 1 if contig.length > 10_000
+        # add the length of the longest orf to the
+        # running total
+        orf_length_sum += orf_length(contig.seq)
+        # increment the cumulative length and check whether the Nx
+        # cutoff has been reached. if it has, store the Nx value and
+        # get the next cutoff
+        cumulative_length += contig.length
+#        if cumulative_length >= @n_bases * cutoff
+#          res << contig.length
+#          if x2.empty?
+#            cutoff=1
+#          else
+#            cutoff = x2.pop / 100.0
+#          end
+#        end
       end
+      # calculate and return the statistics as a hash
       mean = cumulative_length / @assembly.size
-      ns = Hash[x.map { |n| "N#{n}" }.zip(res)]
+ #     ns = Hash[x.map { |n| "N#{n}" }.zip(res)]
       {
-        "n_seqs" => @assembly.size,
-        "smallest" => @assembly.first.length,
-        "largest" => @assembly.last.length,
-        "n_bases" => @n_bases,
+        "n_seqs" => bin.size,
+        "smallest" => bin.first.length,
+        "largest" => bin.last.length,
+        "n_bases" => n_bases,
         "mean_len" => mean,
         "n_1k" => n1k,
         "n_10k" => n10k,
-        "orf percent" => 300*orf_length_sum/(@assembly.size*mean)
-      }.merge ns
-    end
+        "orf_percent" => 300 * orf_length_sum / (@assembly.size * mean)
+      }
+#      }.merge ns
+    end # basic_bin_stats
+    def merge_basic_stats stats
+      # convert the array of hashes into a hash of arrays
+      collect = Hash.new{|h,k| h[k]=[]}
+      stats.each_with_object(collect) do |collect, result|
+        collect.each{ |k, v| result[k] << v }
+      end
+      merged = {}
+      collect.each_pair do |stat, values|
+        if stat == 'orf_percent'  || /N[0-9]{2}/ =~ stat
+          # store the mean
+          merged[stat] = values.inject(:+) / values.size
+        elsif stat == 'smallest'
+          merged[stat] = values.min
+        elsif stat == 'largest'
+          merged[stat] = values.max
+        else
+          # store the sum
+          merged[stat] = values.inject(:+)
+        end
+      end
+      merged
+    end # merge_basic_stats
     # finds longest orf in a sequence
     def orf_length sequence
       longest=0
       (1..6).each do |frame|
         translated = Bio::Sequence::NA.new(sequence).translate(frame)
-        translated.split(/\*/).each do |orf|
+        translated.split('*').each do |orf|
           if orf.length > longest
             longest=orf.length
           end

data/lib/transrate/bowtie2.rb CHANGED Viewed

@@ -21,8 +21,8 @@ module Transrate
       realistic_dist = insertsize + (3 * insertsd)
       unless File.exists? outputname
         # construct bowtie command
-        bowtiecmd = "#{@bowtie2} --very-sensitive-local -p 8 -X #{realistic_dist}" # TODO number of cores should be variable '-p 8'
-        bowtiecmd += " --no-unal"
+        bowtiecmd = "#{@bowtie2} --very-sensitive-local -k 10 -p 8 -X #{realistic_dist}" # TODO number of cores should be variable '-p 8'
+        bowtiecmd += " --no-unal --quiet"
         bowtiecmd += " #{File.basename(file)} -1 #{left}"
         # paired end?
         bowtiecmd += " -2 #{right}" if right

data/lib/transrate/comparative_metrics.rb CHANGED Viewed

@@ -5,7 +5,10 @@ module Transrate
   class ComparativeMetrics
     attr_reader :rbh_per_contig
+    attr_reader :rbh_per_reference
     attr_reader :reciprocal_hits
+    attr_reader :reference_coverage
+    attr_reader :has_run
     def initialize assembly, reference
       @assembly = assembly
@@ -18,13 +21,17 @@ module Transrate
       @ortholog_hit_ratio = self.ortholog_hit_ratio rbu
       @collapse_factor = self.collapse_factor @ra.r2l_hits
       @reciprocal_hits = rbu.size
+      @rbh_per_reference = @reciprocal_hits.to_f / @reference.size.to_f
+      @reference_coverage = @rbh_per_reference * @collapse_factor
       @rbh_per_contig = @reciprocal_hits.to_f / @assembly.assembly.size.to_f
+      @has_run = true
     end
     def comp_stats
       {
         :reciprocal_hits => @reciprocal_hits,
         :rbh_per_contig => @rbh_per_contig,
+        :rbh_per_reference => @rbh_per_reference,
         :ortholog_hit_ratio => @ortholog_hit_ratio,
         :collapse_factor => @collapse_factor
       }

data/lib/transrate/dimension_reduce.rb CHANGED Viewed

@@ -4,6 +4,7 @@ module Transrate
     def self.dimension_reduce(metrics)
       total = 0
+      p metrics
       metrics.each do |metric|
         o = metric.origin
         w = metric.weighting

data/lib/transrate/express.rb CHANGED Viewed

@@ -15,11 +15,11 @@ module Transrate
     # in the assembly fastafile
     def quantify_expression assembly, samfile
       assembly = assembly.file if assembly.is_a? Assembly
-      cmd = "#{@express} --no-bias-correct #{assembly} #{samfile}"
+      cmd = "#{@express} --no-bias-correct #{File.expand_path assembly} #{File.expand_path samfile}"
       ex_output = 'results.xprs'
       fin_output = "#{assembly}_#{ex_output}"
       unless File.exists? fin_output
-        `#{cmd}`
+        `#{cmd} 2>&1`.split(/\n/)[1..30].join("\n")
         File.rename(ex_output, fin_output)
       end
       expression = {}

data/lib/transrate/metric.rb CHANGED Viewed

@@ -6,7 +6,7 @@ module Transrate
     def initialize(name, score, origin)
       @origin = origin
-      @score = score
+      @score = score ? score : (1 - origin)
       @name = name
       @weighting = 1
     end

data/lib/transrate/read_metrics.rb CHANGED Viewed

@@ -5,9 +5,10 @@ module Transrate
     attr_reader :total
     attr_reader :bad
     attr_reader :supported_bridges
-    attr_reader :pc_good_mapping
+    attr_reader :pr_good_mapping
     attr_reader :percent_mapping
-    attr_reader :expressed_contigs
+    attr_reader :prop_expressed
+    attr_reader :has_run
     def initialize assembly
       @assembly = assembly
@@ -20,8 +21,10 @@ module Transrate
       samfile = @mapper.map_reads(@assembly.file, left, right,  insertsize, insertsd)
       self.analyse_read_mappings(samfile, insertsize, insertsd)
       self.analyse_expression(samfile)
+      @pr_good_mapping = @good.to_f / @num_pairs.to_f
       @percent_mapping = @total.to_f / @num_pairs.to_f * 100.0
-      @pc_good_mapping = @good.to_f / @num_pairs.to_f * 100.0
+      @pc_good_mapping = @pr_good_mapping * 100.0
+      @has_run = true
     end
     def read_stats
@@ -44,7 +47,8 @@ module Transrate
         :unrealistic_fragment => @unrealistic_fragment,
         :potential_bridges => @supported_bridges,
         :expressed_contigs => @expressed_contigs,
-        :unexpressed_contigs => @unexpressed_contigs
+        :unexpressed_contigs => @unexpressed_contigs,
+        :percent_expressed => @percent_expressed
       }
     end
@@ -183,6 +187,8 @@ module Transrate
           @expressed_contigs += 1
         end
       end
+      @prop_expressed = @expressed_contigs.to_f / @assembly.size
+      @percent_expressed = @prop_expressed * 100.0
     end
   end # ReadMetrics

data/lib/transrate/reciprocal_annotation.rb CHANGED Viewed

@@ -39,6 +39,7 @@ module Transrate
         reference_db = File.join(reference_dir, reference_base + ".udb")
         @usearch.makeudb_ublast @reference.file, reference_db
         @reference.ublast_db = reference_db
+        return reference_db
       end
     end

data/lib/transrate/transrater.rb CHANGED Viewed

@@ -6,24 +6,49 @@ module Transrate
     attr_reader :read_metrics
     attr_reader :comparative_metrics
-    def initialize assembly, reference, left, right, insertsize=nil, insertsd=nil
+    def initialize assembly, reference, left=nil, right=nil, insertsize=nil, insertsd=nil
       @assembly  = assembly.is_a?(Assembly)  ? assembly  : Assembly.new(assembly)
       @reference = reference.is_a?(Assembly) ? reference : Assembly.new(reference)
       @read_metrics = ReadMetrics.new @assembly
       @comparative_metrics = ComparativeMetrics.new(@assembly, @reference)
     end
-    def run left, right, insertsize=nil, insertsd=nil
-      @assembly.run
-      @read_metrics.run(left, right)
-      @comparative_metrics.run
+    def run left=nil, right=nil, insertsize=nil, insertsd=nil
+      assembly_metrics
+      if left && right
+        read_metrics left, right
+      end
+      comparative_metrics
     end
     def assembly_score
-      pg = Metric.new('pg', @read_metrics.pc_good_mapping, 0.0)
-      rbhpc = Metric.new('rbhpc', @comparative_metrics.rbh_per_contig, 0.0)
-      ec = Metric.new('ec', @read_metrics.expressed_contigs, 0.0)
-      @score = DimensionReduce.dimension_reduce([pg, rbhpc, ec])
+      @score, pg, rc = nil
+      if @read_metrics.has_run
+        pg = Metric.new('pg', @read_metrics.pr_good_mapping, 0.0)
+      end
+      if @comparative_metrics.has_run
+        rc = Metric.new('rc', @comparative_metrics.reference_coverage,
+                    0.0)
+      end
+      if (pg && rc)
+        @score = DimensionReduce.dimension_reduce([pg, rc])
+      end
+      return @score
+    end
+    def assembly_metrics
+      @assembly.run unless @assembly.has_run
+      @assembly
+    end
+    def read_metrics left=nil, right=nil
+      @read_metrics.run(left, right) unless @read_metrics.has_run
+      @read_metrics
+    end
+    def comparative_metrics
+      @comparative_metrics.run unless @comparative_metrics.has_run
+      @comparative_metrics
     end
     def all_metrics left, right, insertsize=nil, insertsd=nil

data/lib/transrate/usearch.rb CHANGED Viewed

@@ -42,7 +42,9 @@ module Transrate
     end
     def findorfs filepath, output
-      unless File.exists? output
+      if File.exists? output
+        puts "skipping ORF finding: ORF file already exists at #{output}"
+      else
         subcmd = " -findorfs #{filepath}"
         subcmd += " -output #{output}"
         subcmd += " -xlat"
@@ -53,7 +55,10 @@ module Transrate
     def run subcmd
       subcmd += " -quiet"
-      `#{@cmd}#{subcmd}`
+      ret = `#{@cmd}#{subcmd} 2>&1`
+      unless $?.exitstatus == 0
+        puts "usearch command failed: #{subcmd}\noutput:\n#{ret}"
+      end
     end
   end # Usearch

data/lib/transrate/version.rb CHANGED Viewed

@@ -4,7 +4,7 @@ module Transrate
   module VERSION
     MAJOR = 0
     MINOR = 0
-    PATCH = 10
+    PATCH = 12
     BUILD = nil
     STRING = [MAJOR, MINOR, PATCH, BUILD].compact.join('.')

data/lib/transrate/writer.rb ADDED Viewed

@@ -0,0 +1,18 @@
+module Transrate
+	class Writer
+		require 'csv'
+		def self.write name, data
+			CSV.open(name, 'wb') do |csv|
+				csv << ["metric", "value"]
+				data.each_pair do |k, v|
+					csv << [k, v]
+				end
+			end
+		end
+	end # Writer
+end # Transrate

data/test/helper.rb ADDED Viewed

@@ -0,0 +1,16 @@
+require 'simplecov'
+require 'coveralls'
+SimpleCov.formatter = SimpleCov::Formatter::MultiFormatter[
+  SimpleCov::Formatter::HTMLFormatter,
+  Coveralls::SimpleCov::Formatter
+]
+SimpleCov.start
+require 'test/unit'
+begin; require 'turn/autorun'; rescue LoadError; end
+require 'shoulda-context'
+require 'transrate'
+Turn.config.format = :pretty
+Turn.config.trace = 5

data/transrate.gemspec CHANGED Viewed

@@ -7,7 +7,7 @@ Gem::Specification.new do |gem|
   gem.authors       = [ "Richard Smith" ]
   gem.email         = "rds45@cam.ac.uk"
   gem.licenses      = ["MIT"]
-  gem.homepage      = 'https://github.com/blahah/transrate'
+  gem.homepage      = 'https://github.com/Blahah/transrate'
   gem.summary       = %q{ quality assessment of de-novo transcriptome assemblies }
   gem.description   = %q{ a library and command-line tool for quality assessment of de-novo transcriptome assemblies }
   gem.version       = Transrate::VERSION::STRING.dup
@@ -16,14 +16,14 @@ Gem::Specification.new do |gem|
   gem.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
   gem.require_paths = %w( lib )
-  gem.add_dependency 'rake', '~> 10.1.0'
-  gem.add_dependency 'trollop', '~> 2.0'
+  gem.add_dependency 'rake'
+  gem.add_dependency 'trollop'
   gem.add_dependency 'which'
   gem.add_dependency 'bio'
-  gem.add_dependency 'bettersam', '~> 0.0.1.alpha'
+  gem.add_dependency 'bettersam'
   gem.add_development_dependency 'turn'
   gem.add_development_dependency 'simplecov'
   gem.add_development_dependency 'shoulda-context'
-  gem.add_development_dependency 'coveralls', '~> 0.6.7'
+  gem.add_development_dependency 'coveralls', '>= 0.6.7'
 end

metadata CHANGED Viewed

@@ -1,156 +1,156 @@
 --- !ruby/object:Gem::Specification
 name: transrate
 version: !ruby/object:Gem::Version
-  version: 0.0.10
+  version: 0.0.12
 platform: ruby
 authors:
 - Richard Smith
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-09-29 00:00:00.000000000 Z
+date: 2014-04-14 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rake
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - ">="
       - !ruby/object:Gem::Version
-        version: 10.1.0
+        version: '0'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - ">="
       - !ruby/object:Gem::Version
-        version: 10.1.0
+        version: '0'
 - !ruby/object:Gem::Dependency
   name: trollop
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - ">="
       - !ruby/object:Gem::Version
-        version: '2.0'
+        version: '0'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - ">="
       - !ruby/object:Gem::Version
-        version: '2.0'
+        version: '0'
 - !ruby/object:Gem::Dependency
   name: which
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
   name: bio
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
   name: bettersam
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - ">="
       - !ruby/object:Gem::Version
-        version: 0.0.1.alpha
+        version: '0'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - ">="
       - !ruby/object:Gem::Version
-        version: 0.0.1.alpha
+        version: '0'
 - !ruby/object:Gem::Dependency
   name: turn
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
   name: simplecov
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
   name: shoulda-context
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
   name: coveralls
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - ">="
       - !ruby/object:Gem::Version
         version: 0.6.7
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - ">="
       - !ruby/object:Gem::Version
         version: 0.6.7
-description: ' a library and command-line tool for quality assessment of de-novo transcriptome
-  assemblies '
+description: " a library and command-line tool for quality assessment of de-novo transcriptome
+  assemblies "
 email: rds45@cam.ac.uk
 executables:
 - transrate
 extensions: []
 extra_rdoc_files: []
 files:
-- .gitignore
+- ".gitignore"
 - Gemfile
 - LICENSE
 - README.md
+- Rakefile
 - bin/transrate
 - lib/transrate.rb
-- lib/transrate/#assembly.rb#
 - lib/transrate/assembly.rb
 - lib/transrate/bowtie2.rb
 - lib/transrate/comparative_metrics.rb
@@ -163,8 +163,10 @@ files:
 - lib/transrate/transrater.rb
 - lib/transrate/usearch.rb
 - lib/transrate/version.rb
+- lib/transrate/writer.rb
+- test/helper.rb
 - transrate.gemspec
-homepage: https://github.com/blahah/transrate
+homepage: https://github.com/Blahah/transrate
 licenses:
 - MIT
 metadata: {}
@@ -174,12 +176,12 @@ require_paths:
 - lib
 required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
-  - - '>='
+  - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
-  - - '>='
+  - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []

data/lib/transrate/#assembly.rb# DELETED Viewed

@@ -1,130 +0,0 @@
-require 'bio'
-require 'bettersam'
-require 'csv'
-require 'forwardable'
-module Transrate
-  class Assembly
-    include Enumerable
-    extend Forwardable
-    def_delegators :@assembly, :each, :<<
-    attr_accessor :ublast_db
-    attr_accessor :orfs_ublast_db
-    attr_accessor :protein
-    attr_reader :assembly
-    # number of bases in the assembly
-    attr_writer :n_bases
-    # assembly filename
-    attr_accessor :file
-    # assembly n50
-    attr_reader :n50
-    # Reuturn a new Assembly.
-    #
-    # - +:file+ - path to the assembly FASTA file
-    def initialize file
-      @file = file
-      @assembly = []
-      @n_bases = 0
-      Bio::FastaFormat.open(file).each do |entry|
-        @n_bases += entry.length
-        @assembly << entry
-      end
-      @assembly.sort_by! { |x| x.length }
-    end
-    # Return a new Assembly object by loading sequences
-    # from the FASTA-format +:file+
-    def self.stats_from_fasta file
-      a = Assembly.new file
-      a.basic_stats
-    end
-    def run
-      stats = self.basic_stats
-      stats.each_pair do |key, value|
-        ivar = "@#{key.gsub(/ /, '_')}".to_sym
-        self.instance_variable_set(ivar, value)
-      end
-    end
-    # Return a hash of statistics about this assembly
-    def basic_stats
-      cumulative_length = 0.0
-      # we'll calculate Nx for all these x
-      x = [90, 70, 50, 30, 10]
-      x2 = x.clone
-      cutoff = x2.pop / 100.0
-      res = []
-      n1k = 0
-      n10k = 0
-      orf_length_sum = 0
-      @assembly.each do |s|
-        n1k += 1 if s.length > 1_000
-        n10k += 1 if s.length > 10_000
-        orf_length_sum += orf_length(s.seq)
-        cumulative_length += s.length
-        if cumulative_length >= @n_bases * cutoff
-          res << s.length
-          if x2.empty?
-            cutoff=1
-          else
-            cutoff = x2.pop / 100.0
-          end
-        end
-      end
-      mean = cumulative_length / @assembly.size
-      ns = Hash[x.map { |n| "N#{n}" }.zip(res)]
-      {
-        "n_seqs" => @assembly.size,
-        "smallest" => @assembly.first.length,
-        "largest" => @assembly.last.length,
-        "n_bases" => @n_bases,
-        "mean_len" => mean,
-        "n_1k" => n1k,
-        "n_10k" => n10k,
-        "orf percent" => 300*orf_length_sum/(@assembly.size*mean)
-      }.merge ns
-    end
-    # finds longest orf in a sequence
-    def orf_length sequence
-      longest=0
-      (1..6).each do |frame|
-        translated = Bio::Sequence::NA.new(sequence).translate(frame)
-        translated.split(/\*/).each do |orf|
-          if orf.length > longest
-            longest=orf.length
-          end
-        end
-      end
-      return longest
-    end
-    # return the number of bases in the assembly, calculating
-    # from the assembly if it hasn't already been done.
-    def n_bases
-      unless @n_bases
-        @n_bases = 0
-        @assembly.each { |s| @n_bases += s.length }
-      end
-      @n_bases
-    end
-    def print_stats
-      self.basic_stats.map do |k, v|
-        "#{k}#{" " * (20 - (k.length + v.to_i.to_s.length))}#{v.to_i}"
-      end.join("\n")
-    end
-  end # Assembly
-end # Transrate