RubyGems - ngs-ci - Versions diffs - 0.0.1.a → 0.0.2.b - Mend

ngs-ci 0.0.1.a → 0.0.2.b

Files changed (14) hide show

checksums.yaml +4 -4
data/.gitignore +3 -1
data/README.md +20 -7
data/TODO.org +13 -5
data/lib/NGSCI/calculator.rb +95 -66
data/lib/NGSCI/read.rb +3 -2
data/lib/NGSCI/version.rb +1 -1
data/lib/NGSCI.rb +6 -4
data/ngs-ci.gemspec +5 -4
data/spec/lib/calculator_spec.rb +112 -36
data/spec/lib/read_spec.rb +49 -27
data/spec/test_files/saturated.bam +0 -0
data/spec/test_files/saturated.bam.bai +0 -0
metadata +33 -12

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 85d1cc53730fb57307136cd2313e81c60e55aad2
-  data.tar.gz: c6724259d4f7709b728bb7df07a3cbdcaa352173
+  metadata.gz: 1f537bf2115279943c5673a5dde124ae358972f2
+  data.tar.gz: fcb22914dfe57a1f9bdfddb71e8cfd453c9431dd
 SHA512:
-  metadata.gz: 334662df0f16af26954bc29d99bfe6ecb59289f892a34de669d57be2125482a557ce4657086f167cf50afd31f00cbdc8ae8e41e2905c7269a25b3a41adce33a9
-  data.tar.gz: 354bd86f5d1274c0cf26dc266b41f348d8ff26d1572101bd3196c72dc8aaad9f994d5703730777a91c8efa30d392bf8217d14b5a5c8a40f7998b2af2ff456ad3
+  metadata.gz: 123ab12bb812db0c6f2fafe6a6785a792b0e3596b154e45252be188ad7ab192755f4c715268b3b96f1ada8a7a6610afe67b4608352574bd033706868f410c202
+  data.tar.gz: 7f77bcb7e67204cb71f69869f8383233dcbc68d9182b7b62deb6cd186b404342d6aa65365f10f49d0eb7473f83fe562ac12348f74b63936f78d68ee3b9d1cf26

data/.gitignore CHANGED Viewed

@@ -12,4 +12,6 @@
 *.o
 *.a
 mkmf.log
-*~
+*~
+.ruby-gemset
+.ruby-version

data/README.md CHANGED Viewed

@@ -1,12 +1,18 @@
-[![Build Status](https://travis-ci.org/MatthewRalston/SCI.png?branch=master)](https://travis-ci.org/MatthewRalston/SCI)
+[![Build Status](https://travis-ci.org/MatthewRalston/ngs-ci.png?branch=master)](https://travis-ci.org/MatthewRalston/ngs-ci)
-[![Gem Version](https://badge.fury.io/rb/SCI.png)](http://badge.fury.io/rb/SCI)
+[![Gem Version](https://badge.fury.io/rb/ngs-ci.png)](http://badge.fury.io/rb/ngs-ci)
-[![Coverage Status](https://coveralls.io/repos/MatthewRalston/SCI/badge.png)](https://coveralls.io/r/MatthewRalston/SCI)
+[![Coverage Status](https://coveralls.io/repos/MatthewRalston/ngs-ci/badge.svg?branch=master&service=github)](https://coveralls.io/github/MatthewRalston/ngs-ci?branch=master)
+# Todo
-# SCI
+The inconsistency between the max summed dissimilarity and the denominator calculation is likely:
+1. Issue in the complexity index (when present is max,
+    a. ( present - missing ) / max = present/max_similarity - missing/max_dissim
+# NGS Complexity Index
 NOTE: This is a project in progress.
 This gem will calculate a sequencing complexity index for BAM files.
@@ -16,7 +22,7 @@ This gem will calculate a sequencing complexity index for BAM files.
 Add this line to your application's Gemfile:
 ```ruby
-gem 'sci'
+gem 'NGSCI'
 ```
 And then execute:
@@ -25,11 +31,18 @@ And then execute:
 Or install it yourself as:
-    $ gem install sci --pre
+    $ gem install ngs-ci --pre
+Or install manually:
+    $ git clone https://github.com/MatthewRalston/ngs-ci.git
+    $ cd ngs-ci
+    $ gem build ngs-ci.gemspec
+    $ gem install ngs-ci-[Version].gem
 ## Usage
-TODO: Write usage instructions here
+* See ```--help``` for details. More to come.
 ## Contributing

data/TODO.org CHANGED Viewed

@@ -30,10 +30,18 @@
 **** 2850 (triangular number T(L-1) L=76 J=1
 **** f(76) = 2850
 * Notes
-** U*O/L vs. 200*U*O/(L^2)
+** U = u/L
 ** U/L is the number of unique reads at that base, length normalized
-** When U/L is 1 (maximum saturation)
-** O = L/2
-** Although, average overlap can be greater than L/2 with less reads
+** U*O/L vs. 200*U*O/(L^2)
+** the average summed overlap is O/(L)
+** average because different reads in the formation have different summed overlaps
+** the average average overlap is 2L/3 or O/L/(L-1) or O/(L^2-L)
+**
+** D = d /
+** the average summed dissimilarity is D/(L)
+** average because different reads in the formation have different summed dissimilarities
+** the average average dissimilarity is L/3 or D/L/(L-1) or D/(L^2-L)
+** This matches the average similarity nicely...
+**
+** u*d / L*L*(L-1)
 * Bugs

data/lib/NGSCI/calculator.rb CHANGED Viewed

@@ -7,16 +7,21 @@ module NGSCI
   # A calculator calculates the sequencing complexity index.
   #
-  # @!attribute [r] sci
+  # @author Matthew Ralston
+  # @abstract A class for calculating the complexity index on next generation sequencing reads
+  # @attr_reader [Integer] block_size The block size for parallelizing disk access
+  # @attr_reader [Hash<Symbol,Integer>] chroms A hash of chromosomes and their sizes
+  # @attr_reader [Integer] read_length The read length obtained from a bam file
+  # @attr_reader [Integer] denominator The denominator and normalization factors calculated from the read length
   class Calculator
-    attr_reader :sci, :block_size, :buffer, :chroms
+    attr_reader :block_size, :chroms, :read_length, :denominator
     # A new calculator to compute the sequencing complexity index given
     # a loaded Bio::DB::Sam object and optional thread argument.
     #
-    # @param bam [Bio::DB::Sam] Opened bam file with loaded reference.
-    # @param threads [Int] The number of threads used to compute NGSCI.
-    # @param strand [String] One of [FR RF F] or nil for strandedness.
+    # @param [Bio::DB::Sam] bam Opened bam file with loaded reference.
+    # @param [Int] threads The number of threads used to compute NGSCI.
+    # @param [String] strand One of [FR RF F] or nil for strandedness.
     def initialize(bam, reference, strand: nil, threads: 1)
       @block_size = 1600
       @results = nil
@@ -28,7 +33,8 @@ module NGSCI
       @bam.open
       @threads = threads
       @chroms = reference_sequences(reference)
-      read_length
+      @read_length = NGSCI::Calculator.read_length_calc(@bam,@block_size)
+      @denominator = denominator_calc(@read_length)
       if strand
         unless %w(FR RF F).include?(strand)
           raise NGSCI::NGSCIError.new "Strand specific option #{opts.strand} is invalid." +
@@ -42,6 +48,7 @@ module NGSCI
     # Calculation of the sequencing complexity index
     #
+    # @param runtime [false] Print profiling information?
     def run(runtime: false)
       RubyProf.start if runtime
       # Convert each aligned read to Read clas
@@ -83,16 +90,16 @@ module NGSCI
     #
     # @param chrom [String] The chromosome from the bam file
     # @param i [Integer] The number of blocks that have been read
-    # @return localNGSCI [Hash<Symbol,Array>]
-    #   * :+ (Array[Integer]) The NGSCI for the + strand
-    #   * :- (Array[Integer]) The NGSCI for the - strand
+    # @return [Hash<Symbol,Array>]
+    #   * :+ (Array[Array]) The NGSCI for the + strand along the
+    #   * :- (Array[Array]) The NGSCI for the - strand
     def readblock(chrom,i)
       reads=[]
       results = @strand ? {"+" => [],"-" => []}: {nil => []}
-      start = [0,(i * @block_size) - @buffer].max
+      start = [0,(i * @block_size) - @read_length].max
       stop = [(i + 1) * @block_size, self.chroms[chrom]].min
       @bam.fetch(chrom,start,stop) {|read| reads << convert(read)}
-      start += @buffer unless start == 0
+      start += @read_length unless start == 0
       reads.compact!
       reads.sort_by!(&:start) unless reads.empty?
       x=0
@@ -109,93 +116,114 @@ module NGSCI
       return results
     end
     # Calculates sequencing complexity index for a single base
     #
     # @param reads [Array<NGSCI::Read>] A group of reads aligned to a single base.
-    # @return sci [Float]
+    # @return [Array<Integer,Integer,Float,Float>]
     def sci(reads)
       numreads=reads.size
       # Groups reads by start site
       # selects the largest read length from the groups
-      reads = reads.group_by(&:start).map{|k,v| v.max{|x,y| (x.stop-x.start).abs <=> (y.stop-y.start).abs}}
-      o = summed_overlaps(reads)
+      reads = reads.group_by(&:start).map{|k,v| v.max{|x,y| x.length <=> y.length}}
+      d = summed_dissimilarity(reads)
       uniquereads = reads.size
-      return [numreads,uniquereads,(@buffer*o.to_f/@denom).round(4),(300*uniquereads*o/(2*@denom)).round(4)]
+      return [numreads,uniquereads,(d.to_f/@read_length).round(4),(100*uniquereads*d/@denominator).round(4)]
     end
-    # Calculates summed overlap between a group of reads
+    # Calculation of the dissimilarity between two reads
+    #
+    # @param read1 [NGSCI::Read]  First read to be compared
+    # @param read2 [NGSCI::Read]  Second read to be compared
+    # @return [Integer] Length of non-overlapping/unique bases
+    def dissimilarity(read1,read2)
+      if read1.start > read2.start
+        if read1.stop < read2.stop # Read 1 is inside read 2
+          (read1.start - read2.start) + (read2.stop - read1.stop)
+        else # Normal overlap
+          read1.start - read2.start
+        end
+      else
+        if read1.stop > read2.stop # Read 2 is inside read 1
+          (read2.start - read1.start) + (read1.stop - read2.stop)
+        else # Normal overlap
+          read2.start - read1.start
+        end
+      end
+    end
+    # Calculates summed dissimilarity between a group of reads
     #
     # @param reads [Array<NGSCI::Read>] Array of reads
-    # @return avg_overlap [Integer] Summed overlap between reads
-    def summed_overlaps(reads)
+    # @return [Integer] Sum of all dissimilarities between the group of reads
+    def summed_dissimilarity(reads)
       numreads = reads.size
       sum=0
-      unless numreads == 1
+      unless numreads <= 1
         i = 0
         while i < numreads
           r1 = reads[i] # for each of n reads
           sum+=reads.
                 reject{|r| r == r1}. # select the n-1 other reads
-                map{|r| overlap(r,r1)}. # calculate their overlap to r1
+                map{|r| dissimilarity(r,r1)}. # calculate their overlap to r1
                 reduce(:+)
           i+=1
         end
       end
       return sum
+    end
+    # Calculates the average summed dissimilarity (per read) of that read to all other reads
+    #
+    # @param [Integer] read_length The read length
+    # @return [Integer] avg_summed_dissimilarity
+    def max_summed_dissimilarity(read_length)
+      # For each unique read under maximum saturation, calculate the sum of dissimilarities for that read to all other reads
+      summed_dissimilarities = (1..read_length).to_a.map { |r|
+        (read_length ** 2) / 2 - read_length*r + read_length/2 + r**2 - r }.reduce(:+)
     end
-    # Calculation of the overlap between two reads
-    #
-    # @param read1 [NGSCI::Read] First read to be compared
-    # @param read2 [NGSCI::Read] First read to be compared
-    # @return overlap_length [Integer] Length of overlap
-    def overlap(read1,read2)
-      if read1.start > read2.start
-        if read1.stop < read2.stop # Read 1 is inside read 2
-          read1.stop - read1.start
-        else # Normal overlap
-          read2.stop - read1.start
-        end
-      else
-        if read1.stop > read2.stop # Read 2 is inside read 1
-          read2.stop - read2.start
-        else # Normal overlap
-          read1.stop - read2.start
-        end
-      end
-    end
+    # Calculates the denominator for the complexity index from the read length, assuming maximum saturation (i.e. number of unique reads == read_length)
+    # unique reads /read length * summed_dissimilarity / (max_summed_dissimilarity/(read length * read length)
+    # Denomiator = read length * max_summed_dissimilarity / (read_length * read_length)
+    #
+    # @param [Integer] read_length The read length
+    # @return [Float] denominator The denominator including normalization factors for the complexity index 349184
+    def denominator_calc(read_length)
+      read_length*max_summed_dissimilarity(read_length)
+    end
-    # Loads the read length from a bam file into the @buffer variable
+    # Calculates the read length of a bam file by sampling at least on full block of reads
     #
-    def read_length
-      buffer=0
-      stats=@bam.index_stats.select {|k,v| k != "*" && v[:mapped_reads] > 0}
+    # @param [Bio::DB::Sam] bam A bam reader object
+    # @param [Integer] block_size The number of reads to read from a bam file
+    # @return [Integer] read_length The read length acquired from reading a block at a time until at least 100 reads are acquired
+    def self.read_length_calc(bam,block_size)
+      stats=bam.index_stats.select {|k,v| k != "*" && v[:mapped_reads] > 0}
       if stats.empty?
         raise NGSCIIOError.new "BAM file is empty! Check samtools idxstats."
-      else
-        i=0
-        lengths=[]
-        test = @block_size
-        while i <= test
-          @bam.view do |read|
-            lengths << read.seq.size
-            i +=1
-          end
-          if i == test && lengths.size < 100
-            test += @block_size
-          end
+      end
+      i=0
+      lengths=[]
+      test = block_size
+      while i <= test
+        bam.view do |read|
+          lengths << read.seq.size
+          i +=1
+        end
+        if i == test && lengths.size < 100
+          test += block_size
         end
-        @buffer = lengths.max
-        @denom = @buffer**2 * (@buffer - 1)**2
       end
+      lengths.max
     end
     # Converts strand specific BAM read into a sequence object format
     # Uses the @strand instance variable to determine the strand of conversion
     #
-    # @param read [Bio::DB::Alignment] Read to be converted.
-    # @return read [NGSCI::Read] Converted Read object
+    # @param [Bio::DB::Alignment] read Read to be converted.
+    # @return [NGSCI::Read] read Converted Read object
     def convert(read)
       unless read.query_unmapped
         if @strand
@@ -212,7 +240,7 @@ module NGSCI
     # Assumes paired-end strand-specific sequencing with "fr" chemistry
     #
     # @param read [Bio::DB::Alignment] Read to be converted.
-    # @return read [NGSCI::Read] Converted Read object
+    # @return [NGSCI::Read] Converted Read object
     def fr(read)
       if read.first_in_pair
         read.query_strand ? newread(read,strand:"+") : newread(read,strand:"-")
@@ -226,7 +254,7 @@ module NGSCI
     # Assumes paired-end strand-specific sequencing with "rf" chemistry
     #
     # @param read [Bio::DB::Alignment] Read to be converted.
-    # @return read [NGSCI::Read] Converted Read object
+    # @return [NGSCI::Read] Converted Read object
     def rf(read)
       if read.first_in_pair
         read.query_strand ? newread(read,strand:"-") : newread(read,strand:"+")
@@ -240,7 +268,7 @@ module NGSCI
     # Assumes single-end strand-specific sequencing with "f" chemistry
     #
     # @param read [Bio::DB::Alignment] Read to be converted.
-    # @return read [NGSCI::Read] Converted Read object
+    # @return [NGSCI::Read] Converted Read object
     def f(read)
       read.query_strand ? newread(read,strand:"+") : newread(read,strand:"-")
     end
@@ -249,7 +277,7 @@ module NGSCI
     #
     # @param read [Bio::DB::Alignment] Aligned read to be converted
     # @param strand [String] Strand of read
-    # @return read [NGSCI::Read] Converted Read object
+    # @return [NGSCI::Read] Converted Read object
     def newread(read,strand: nil)
       Read.new(read.pos,read.pos+read.seq.size,strand: strand)
     end
@@ -257,7 +285,7 @@ module NGSCI
     # Acquires names and sizes of reference sequences included in the bam file
     #
     # @param reference [String] Path to reference fasta file.
-    # @return chromosomes [Hash<Symbol,Object>] A dictionary of chromosome sizes
+    # @return [Hash<Symbol,Integer>] A dictionary of chromosome sizes
     def reference_sequences(reference)
       chromosomes={}
       Bio::FastaFormat.open(@reference).each_entry do |f|
@@ -265,6 +293,7 @@ module NGSCI
       end
       chromosomes.select {|chrom| @bam.index_stats.keys.include?(chrom)}
     end
     # Exports the results to outfile
     #
     # @param outfile [String] Path to outfile

data/lib/NGSCI/read.rb CHANGED Viewed

@@ -4,9 +4,10 @@ module NGSCI
   #
   # @!attribute [r] start
   # @!attribute [r] stop
+  # @!attribute [r] length
   # @!attribute [r] strand
   class Read
-    attr_reader :start, :stop, :strand
+    attr_reader :start, :stop, :length, :strand
     def initialize(start,stop,strand: nil)
 =begin DEPRECATED chromosome variable
       unless chr.is_a?(String)
@@ -24,8 +25,8 @@ module NGSCI
       end
       @start=start
       @stop=stop
+      @length=stop-start
       @strand=strand
     end
   end
 end

data/lib/NGSCI/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module NGSCI
-  VERSION = "0.0.1.a"
+  VERSION = "0.0.2.b"
 end

data/lib/NGSCI.rb CHANGED Viewed

@@ -1,20 +1,21 @@
-require 'yell'
 # NGSCI stands for Sequencing Complexity Index
 # This program calculates a sequencing complexity index for each base and/or strand in a genome.
 # This program calculates this by averaging average overlaps of reads aligned to that base.
 module NGSCI
+  require 'yell'
   # For custom error handling in the future, unimplemented
   class NGSCIError < StandardError; end
   class NGSCIIOError < NGSCIError; end
-  class NGSCIArgError < NGSCIError; end
+  class NGSCIArgError < NGSCIError; end
   # Create the universal logger and include it in Object
   # making the logger object available everywhere
   format = Yell::Formatter.new("[%5L] %d : %m", "%Y-%m-%d %H:%M:%S")
   # http://xkcd.com/1179/
-  Yell.new(:format => format) do |l|
+  logger = Yell.new(:format => format) do |l|
     l.level = :info
     l.name = Object
     l.adapter STDOUT, level: [:debug, :info, :warn]
@@ -29,3 +30,4 @@ require 'NGSCI/cmd'
 require 'NGSCI/version'
 require 'NGSCI/calculator'
 require 'NGSCI/read'
+#require 'yell'

data/ngs-ci.gemspec CHANGED Viewed

@@ -22,14 +22,15 @@ Gem::Specification.new do |spec|
   spec.add_dependency 'trollop','~> 2.1.2'
   spec.add_dependency 'bio-samtools', '= 2.3.2'
   spec.add_dependency 'parallel', '~> 1.4'
-  spec.add_dependency 'yell'
+  spec.add_dependency 'yell', '~> 2'
   spec.add_dependency "ruby-prof", "~> 0.15"
-  spec.has_rdoc = 'yard'
+  spec.has_rdoc = 'yard', '~> 0'
-  spec.add_development_dependency "bundler"
+  spec.add_development_dependency "bundler", "~> 1"
   spec.add_development_dependency "rake", "~> 10.0"
   spec.add_development_dependency "rspec", "~> 3.1"
+  spec.add_development_dependency "pry", "~> 0"
   #spec.add_development_dependency "guard", "~> 2.12"
-  spec.add_development_dependency "coveralls"
+  spec.add_development_dependency "coveralls", "~> 0"
   #spec.add_development_dependency "cucumber",  "~> 1.3"
 end

data/spec/lib/calculator_spec.rb CHANGED Viewed

@@ -3,11 +3,11 @@ require 'bio-samtools'
 testbam="spec/test_files/test.bam"
 emptybam="spec/test_files/empty.bam"
+saturatedbam="spec/test_files/saturated.bam"
 testfasta="spec/test_files/test.fa"
 testout="spec/test_files/testfile.txt"
 describe "#run" do
   context "during a strand specific run" do
     before(:each) do
@@ -68,21 +68,21 @@ describe "#sci" do
   context "when passed an array of read objects" do
     before(:each) do
       @calc = NGSCI::Calculator.new(testbam,testfasta)
-      @bam = Bio::DB::Sam.new(:bam=>testbam,:fasta=>testfasta)
+      @bam = Bio::DB::Sam.new(:bam=>saturatedbam,:fasta=>testfasta)
       @bam.open
       @reads = []
-      @bam.fetch("NC_001988.2",75,75){|x| read = @calc.convert(x); @reads << read unless read.nil?}
-      @reads = @reads.uniq{|r|r.start}
+      @bam.fetch("NC_001988.2",76,76){|x| read = @calc.convert(x); @reads << read unless read.nil?}
+      @reads = @reads.uniq{|x| x.start}
     end
     it "returns an array" do
       expect(@calc.sci(@reads)).to be_kind_of(Array)
     end
     it "returns the sequencing complexity index" do
-      expect(@calc.sci(@reads)[-1]).to eq(0.0)
+      expect(@calc.sci(@reads)[-1]).to eq(100.0)
     end
   end
   context "when passed an empty array" do
-    it "returns nil" do
+    it "returns zero" do
       @calc = NGSCI::Calculator.new(testbam,testfasta)
       empty_sci = @calc.sci([])[-1]
       expect(empty_sci).to be_zero
@@ -90,26 +90,36 @@ describe "#sci" do
   end
 end
-describe "#read_length" do
-  it "calculates the read length" do
-    @calc=NGSCI::Calculator.new(testbam,testfasta)
-    expect(@calc.buffer).to eq(76)
+describe "#dissimilarity" do
+  before(:each) do
+    @bam=Bio::DB::Sam.new(:bam => testbam, :fasta => testfasta)
+    @bam.open
+    @reads = []
+    @bam.fetch("NC_001988.2",0,200) {|x| @reads << x }
+    @calc = NGSCI::Calculator.new(testbam,testfasta)
+    @read1 = @calc.convert(@reads[2])
+    @read2 = @calc.convert(@reads[3])
   end
-  it "fails on an empty bam file" do
-    expect{NGSCI::Calculator.new(emptybam,testfasta)}.to raise_error(NGSCI::NGSCIIOError)
-    `rm #{emptybam}.bai`
+  it "calculates the unique bases of the first read from the second" do
+    expect(@calc.dissimilarity(@read1,@read2)).to eq(62)
+  end
+  it "calculates the unique bases, regardless of the order" do
+    expect(@calc.dissimilarity(@read2,@read1)).to eq(62)
   end
 end
-describe "#summed_overlaps" do
+describe "#summed_dissimilarity" do
   it "returns an int" do
     @bam=Bio::DB::Sam.new(:bam=>testbam,:fasta=>testfasta)
     @bam.open
     @reads = []
     @calc=NGSCI::Calculator.new(testbam,testfasta)
-    @bam.fetch("NC_001988.2",8,75) {|x| read=@calc.convert(x); @reads << read if read}
+    @bam.fetch("NC_001988.2",8,75) {|x| @reads << @calc.convert(x) }
     @reads = @reads.uniq{|r| r.start}
-    expect(@calc.summed_overlaps(@reads)).to be_an(Integer)
+    expect(@calc.summed_dissimilarity(@reads)).to be_an(Integer)
   end
   context "when passed an array of read objects" do
     before(:each) do
@@ -120,13 +130,15 @@ describe "#summed_overlaps" do
       @bam.fetch("NC_001988.2",8,75) {|x| read=@calc.convert(x); @reads << read if read}
       @reads = @reads.uniq{|r| r.start}
     end
-    it "returns the #overlap of two reads" do
-      summed_overlap = 2*@calc.overlap(@reads[0],@reads[1])
-      expect(@calc.summed_overlaps(@reads[0..1])).to eq(summed_overlap)
+    context "when passed two reads" do
+      it "returns the sum of their dissimilarities" do
+        summed_dissimilarity = @calc.dissimilarity(@reads[0],@reads[1]) + @calc.dissimilarity(@reads[1],@reads[0])
+        expect(@calc.summed_dissimilarity(@reads[0..1])).to eq(summed_dissimilarity)
+      end
     end
-    it "calculates the average overlap between a group of reads" do
-      expect(@calc.summed_overlaps(@reads[0..7]).round(4)).to eq(380.0)
+    it "calculates the summed dissimlarity of a group of reads" do
+      expect(@calc.summed_dissimilarity(@reads[0..7]).round(4)).to eq(532.0)
     end
   end
   context "when passed an array with a single read object" do
@@ -136,33 +148,97 @@ describe "#summed_overlaps" do
       @reads=[]
       @calc=NGSCI::Calculator.new(testbam,testfasta)
       @bam.fetch("NC_001988.2",8,75) {|x| read=@calc.convert(x); @reads << read if read}
-      expect(@calc.summed_overlaps([@reads[0]])).to be_zero
+      expect(@calc.summed_dissimilarity([@reads[0]])).to be_zero
     end
   end
   context "when passed an empty array" do
     it "returns zero" do
       @calc=NGSCI::Calculator.new(testbam,testfasta)
-      expect(@calc.summed_overlaps([])).to be_zero
+      expect(@calc.summed_dissimilarity([])).to be_zero
     end
   end
 end
-describe "#overlap" do
-  before(:each) do
-    @bam=Bio::DB::Sam.new(:bam=>testbam,:fasta=>testfasta)
-    @bam.open
-    @reads=[]
-    @bam.fetch("NC_001988.2",0,200) {|x| @reads << x}
-    @calc=NGSCI::Calculator.new(testbam,testfasta)
-    @read1=@calc.convert(@reads[2])
-    @read2=@calc.convert(@reads[3])
+describe "#max_summed_dissimilarity" do
+  context "when passed and integer read length" do
+    before(:each) do
+      @read_length = 76
+      @calc = NGSCI::Calculator.new(testbam,testfasta)
+    end
+    it "returns a float" do
+      expect(@calc.max_summed_dissimilarity(@read_length)).to be_kind_of Integer
+    end
   end
-  it "calculates the overlap between two reads" do
-    expect(@calc.overlap(@read1,@read2)).to eq(14)
+  context "when calculating the maximum summed dissimilarity" do
+    before(:each) do
+      @read_length = 76
+      @calc = NGSCI::Calculator.new(saturatedbam,testfasta)
+      @bam = Bio::DB::Sam.new(:bam=>saturatedbam,:fasta=>testfasta)
+      @bam.open
+      @reads = []
+      @bam.fetch("NC_001988.2",76,76){|x| read = @calc.convert(x); @reads << read unless read.nil?}
+      @reads = @reads.uniq{|x| x.start}
+    end
+    it "yields the triangular sum dissimilarity" do
+      # This test demonstrates that the simplified (more efficient) formula for maximum summed dissimilarity
+      # is equivalent to the triangular sum formula for the maximum summed dissimilarity within a group of reads
+      def tri(x,n=0)
+        return x == 0 ? n : tri(x-1,n+x)
+      end
+      triangular_sum = (1..@read_length).to_a.map{|x|
+        tri(@read_length - x) + tri(x - 1)
+      }.reduce(:+)
+      calculated_max_summed_dissimilarity = @calc.max_summed_dissimilarity(@read_length)
+      expect(calculated_max_summed_dissimilarity).to eq(triangular_sum)
+    end
+    it "is equal to the #summed_dissimilarity of saturated reads" do
+      # This test demonstrates that the formula for the theoretical maximum summed dissimilarity among reads
+      # is equivalent to the summed dissimilarity under maximum saturation (the saturated.bam test file)
+      theoretical_max_summed_dissimilarity = @calc.max_summed_dissimilarity(@read_length)
+      expect(theoretical_max_summed_dissimilarity).to eq(@calc.summed_dissimilarity(@reads))
+    end
   end
+  context "when averaging per read" do
+    it "is equal to 1/3 times (read_length - 1)" do
+      @calc = NGSCI::Calculator.new(testbam,testfasta)
+      (32..200).each do |read_length|
+        calculated_max_summed_dissimilarity = @calc.max_summed_dissimilarity(read_length)/(read_length*read_length)
+        expect(calculated_max_summed_dissimilarity).to eq((read_length-1)/3)
+      end
+    end
+  end
+end
-  it "calculates the overlap regardless of order" do
-    expect(@calc.overlap(@read2,@read1)).to eq(14)
+describe "#denominator_calc" do
+  context "when passed and integer read length" do
+    before(:each) do
+      @calc = NGSCI::Calculator.new(testbam,testfasta)
+    end
+    it "returns a float denominator" do
+      read_length = 76
+      expect(@calc.denominator_calc(read_length)).to be_kind_of Integer
+    end
+  end
+  it "is the max_summed_dissimilarity * read length" do
+    @calc = NGSCI::Calculator.new(testbam,testfasta)
+    (32..200).each do |read_length|
+      max_sum_dissim = @calc.max_summed_dissimilarity(read_length)
+      expect(@calc.denominator_calc(read_length)).to eq(read_length*max_sum_dissim)
+    end
+  end
+end
+describe "#read_length_calc" do
+  it "calculates the read length" do
+    @bam=Bio::DB::Sam.new(:bam => testbam,:fasta => testfasta)
+    test_block_size = 100
+    expect(NGSCI::Calculator.read_length_calc(@bam,100)).to eq(76)
+  end
+  it "fails on an empty bam file" do
+    @emptybam = Bio::DB::Sam.new(:bam => emptybam, :fasta => testfasta)
+    expect{NGSCI::Calculator.read_length_calc(@emptybam,100)}.to raise_error(NGSCI::NGSCIIOError)
   end
 end

data/spec/lib/read_spec.rb CHANGED Viewed

@@ -1,35 +1,57 @@
 require 'spec_helper'
-describe "reads" do
-  it "fails to instantiate on a string start site" do
-    expect{NGSCI::Read.new("foo",3)}.to raise_error(NGSCI::NGSCIError)
-  end
-  it "fails to instantiate on a string stop site" do
-    expect{NGSCI::Read.new(1,"foo")}.to raise_error(NGSCI::NGSCIError)
-  end
-  it "fails to instantiate when the stop site is greater than the start site" do
-    expect{NGSCI::Read.new(3,1)}.to raise_error(NGSCI::NGSCIError)
-  end
-  it "fails to instantiate on an improper strand argument" do
-    expect{NGSCI::Read.new(1,3,strand:"foo")}.to raise_error(NGSCI::NGSCIError)
-  end
-  it "fails to instantiate without the three necessary arguments" do
-    expect{NGSCI::Read.new(1)}.to raise_error(ArgumentError)
+describe NGSCI::Read do
+  context "before created" do
+    it "fails to instantiate on a string start site" do
+      expect{NGSCI::Read.new("foo",3)}.to raise_error(NGSCI::NGSCIError)
+    end
+    it "fails to instantiate on a string stop site" do
+      expect{NGSCI::Read.new(1,"foo")}.to raise_error(NGSCI::NGSCIError)
+    end
+    it "fails to instantiate when the stop site is greater than the start site" do
+      expect{NGSCI::Read.new(3,1)}.to raise_error(NGSCI::NGSCIError)
+    end
+    it "fails to instantiate on an improper strand argument" do
+      expect{NGSCI::Read.new(1,3,strand:"foo")}.to raise_error(NGSCI::NGSCIError)
+    end
+    it "fails to instantiate without the three necessary arguments" do
+      expect{NGSCI::Read.new(1)}.to raise_error(ArgumentError)
+    end
+    it "instantiates a new read with proper unstranded arguments" do
+      expect{NGSCI::Read.new(1,3)}.to_not raise_error
+    end
+    it "instantiates a new read with proper stranded arguments" do
+      expect{NGSCI::Read.new(1,3,strand:"+")}.to_not raise_error
+    end
   end
-  it "instantiates a new read with proper unstranded arguments" do
-    expect{NGSCI::Read.new(1,3)}.to_not raise_error
-  end
+  context "after created" do
+    before(:each) do
+      @read = NGSCI::Read.new(1,3,strand:"+")
+    end
+    it "has a start attribute" do
+      expect(@read.methods).to include(:start)
+    end
+    it "has a stop attribute" do
+      expect(@read.methods).to include(:stop)
+    end
+    it "has a length attribute" do
+      expect(@read.methods).to include(:length)
+    end
+    it "has a strand attribute" do
+      expect(@read.methods).to include(:strand)
+    end
-  it "instantiates a new read with proper stranded arguments" do
-    expect{NGSCI::Read.new(1,3,strand:"+")}.to_not raise_error
   end
 end

data/spec/test_files/saturated.bam ADDED Viewed

Binary file

data/spec/test_files/saturated.bam.bai ADDED Viewed

Binary file

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: ngs-ci
 version: !ruby/object:Gem::Version
-  version: 0.0.1.a
+  version: 0.0.2.b
 platform: ruby
 authors:
 - Matthew Ralston
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-10-28 00:00:00.000000000 Z
+date: 2015-12-21 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: trollop
@@ -56,16 +56,16 @@ dependencies:
   name: yell
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '2'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '2'
 - !ruby/object:Gem::Dependency
   name: ruby-prof
   requirement: !ruby/object:Gem::Requirement
@@ -84,16 +84,16 @@ dependencies:
   name: bundler
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '1'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '1'
 - !ruby/object:Gem::Dependency
   name: rake
   requirement: !ruby/object:Gem::Requirement
@@ -122,18 +122,32 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '3.1'
+- !ruby/object:Gem::Dependency
+  name: pry
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0'
 - !ruby/object:Gem::Dependency
   name: coveralls
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '0'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '0'
 description: Calculated a metric that estimates read complexity at each base for RNA-seq
@@ -169,6 +183,8 @@ files:
 - spec/lib/read_spec.rb
 - spec/spec_helper.rb
 - spec/test_files/empty.bam
+- spec/test_files/saturated.bam
+- spec/test_files/saturated.bam.bai
 - spec/test_files/test.bam
 - spec/test_files/test.bam.bai
 - spec/test_files/test.fa
@@ -204,6 +220,11 @@ test_files:
 - spec/lib/read_spec.rb
 - spec/spec_helper.rb
 - spec/test_files/empty.bam
+- spec/test_files/saturated.bam
+- spec/test_files/saturated.bam.bai
 - spec/test_files/test.bam
 - spec/test_files/test.bam.bai
 - spec/test_files/test.fa
+has_rdoc:
+- yard
+- "~> 0"