RubyGems - ms-core - Versions diffs - 0.0.1 - Mend

ms-core 0.0.1

Files changed (20) hide show

data/LICENSE +13 -0
data/README +27 -0
data/changelog.txt +196 -0
data/lib/ms/calc.rb +32 -0
data/lib/ms/data/interleaved.rb +60 -0
data/lib/ms/data/lazy_io.rb +73 -0
data/lib/ms/data/lazy_string.rb +15 -0
data/lib/ms/data/simple.rb +59 -0
data/lib/ms/data/transposed.rb +41 -0
data/lib/ms/data.rb +57 -0
data/lib/ms/format/format_error.rb +12 -0
data/lib/ms/id/peptide.rb +76 -0
data/lib/ms/id/protein.rb +17 -0
data/lib/ms/id/search.rb +110 -0
data/lib/ms/mass/aa.rb +136 -0
data/lib/ms/mass.rb +9 -0
data/lib/ms/spectrum.rb +157 -0
data/lib/ms/support/binary_search.rb +126 -0
data/lib/ms.rb +10 -0
metadata +95 -0

data/LICENSE ADDED Viewed

@@ -0,0 +1,13 @@
+Copyright (c) 2006, The University of Texas at Austin("U.T. Austin"). All rights reserved.
+Software by John T. Prince under the direction of Edward M. Marcotte.
+By using this software the USER indicates that he or she has read, understood and will comply with the following:
+U. T. Austin hereby grants USER permission to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of this software and its documentation for any purpose and without fee, provided that a full copy of this notice is included with the software and its documentation.
+Title to copyright this software and its associated documentation shall at all times remain with U. T. Austin. No right is granted to use in advertising, publicity or otherwise any trademark, service mark, or the name of U. T. Austin.
+This software and any associated documentation are provided "as is," and U. T. AUSTIN MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING THOSE OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, OR THAT USE OF THE SOFTWARE, MODIFICATIONS, OR ASSOCIATED DOCUMENTATION WILL NOT INFRINGE ANY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER INTELLECTUAL PROPERTY RIGHTS OF A THIRD PARTY. U. T. Austin, The University of Texas System, its Regents, officers, and employees shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to any claim by USER or any third party on account of or arising from the use, or inability to use, this software or its associated documentation, even if U. T. Austin has been advised of the possibility of those damages.
+Submit software operation questions to: Edward M. Marcotte, Department of Chemistry and Biochemistry, U. T. Austin, Austin, Texas 78712.

data/README ADDED Viewed

@@ -0,0 +1,27 @@
+= {ms-core}[http://mspire.rubyforge.org/projects/ms-core]
+The core mspire[http://mspire.rubyforge.org] library for working with mass spectrometry proteomics data.
+Generally, This will be used as a dependency and won't be all that useful on
+its own.
+== Description
+* Github[http://github.com/jtprince/ms-core/tree/master]
+* Lighthouse[http://bahuvrihi.lighthouseapp.com/projects/16692-mspire/tickets]
+* {Google Group}[http://groups.google.com/group/mspire-forum]
+== Installation
+Available as a gem on RubyForge[http://rubyforge.org/projects/mspire].  Use:
+  % gem install ms-mascot
+Typically, it will be included as a dependency so it will be installed with
+another gem
+== Info
+Copyright (c) 2006-2008, Regents of the University of Colorado and HHMI
+Developers:: {Simon Chiang}[http://bahuvrihi.wordpress.com], {Biomolecular Structure Program}[http://biomol.uchsc.edu/], {Hansen Lab}[http://hsc-proteomics.uchsc.edu/hansenlab/], John Prince, {Edward Marcotte Lab}[http://polaris.icmb.utexas.edu/home.html], {Natalie Ahn Lab}[http://www.colorado.edu/chem/people/ahnn.html], {Howard Hughes Medical Institute}[http://www.hhmi.org/], {BYU Dept. of Chemistry and Biochemistry}[http://www.chem.byu.edu/]
+Support:: CU Denver School of Medicine Deans Academic Enrichment Fund, HHMI
+License:: {MIT-Style}[link:files/MIT-LICENSE.html]

data/changelog.txt ADDED Viewed

@@ -0,0 +1,196 @@
+== version 0.1.7
+1. A couple of scripts and subroutines were hashing peptides but not on the file
+basename.  This would result in slightly incorrect results (any time there
+were overlapping scan numbers in multiple datasets, only the top one would be
+chosen).  The results would be correct for single runs.
+Output files that could be affected:
+*.top_per_scan.txt
+*.all_peps_per_scan.txt
+Scripts that could be affected:
+script/top_hit_per_scan.rb
+bin/filter_spec_id.rb
+script/filter-peps.rb
+bin/id_precision.rb
+Subroutines that were affected:
+spec_id.rb (pep_probs_by_* )
+spec_id.rb (top_peps_prefilter!)
+proph.rb uniq_by_seqcharge
+align.rb called uniq_by_seqcharge
+2. false_positive_rate.rb and protein_summary.rb (by extension) were using
+number of true positives on the x axis while in reality I was plotting the
+number of hits.  I've updated x axis labels to reflect this change.  In
+addition, since the term 'false positive rate' has such a distinct definition
+in classical ROC plots and binary statistics, I've decided to work primarily
+in terms of precision (TP/(TP+FP)).  I've purged the terms 'False Positive
+Rate' and 'FPR' from the package. It's been suggested that FP/(TP+FP) be
+called the False Positive Predictive Rate (FPPR).  I will probably implement
+this in a future release.
+== version 0.2.0
+Revamped the way SpecID works (it is now mixed-in).
+Added support for modifications to bioworks_to_pepxml.rb
+Can read .srf files (nearly interchangeable with bioworks files)
+Redid filter.rb
+== version 0.2.1
+minor bugfix
+== version 0.2.2
+made compatible with Bioworks fasta file reverser and updated tutorial.
+Killed classify_by_prefix routine in favor of classify_by_false_flag which has
+a prefix option
+== version 0.2.3
+in protein_summary.rb added handling for proteins with no annotation. (either
+dispaly NA or use gi2annnot to grab them from NCBI)
+== version 0.2.5
+renamed prep_list in roc (potential breaks in code)
+== version 0.2.6
+1. Massive refactorization of filtering and validation.  Validation objects are
+created and then can be used to validate just about anything.
+2. Massive redo of the parsing of MS runs.  Can parse mzXML v1, v2.X
+(including readw broken output), and mzData (even Thermo's broken output).
+4. Moved all tests to specs (rspec).
+5. Can read gradient programs off of .meth or .RAW files (both Xcal 1.X and
+2.X)
+Bugfixes:
+1. The search_summary 'base_name' in pepxml output was incorrect (this did not
+appear to influence our analyses, however). Fixed.
+2. Enzymes with no exceptions (e.g., cuts at KR) would report one too many
+missed cleavages if the last amino acid was a cut point. Fixed.
+== version 0.2.7
+1. In conversion from bioworks to pepxml, the default was trypsin (KR/P).
+Now, the sample enzyme is set explicitly from the params file and the option
+is not available.  This can give more accuract pepxml files than from
+previous depending on your enzyme.
+== version 0.2.9
+1. Added support for phobius transmembrane predictions
+2. have filter_and_validate.rb working well (multiple validators allowed).
+3. Can read bioworks 3.3.1 .srf files (.srf version 3.5 files)
+4. Added a bias validator
+== version 0.2.10
+1. Fixed --hits_separate flag in spec_id/filter
+== version 0.2.11
+1. Added prob precision support and reorganized filter_and_validate libs
+== version 0.2.12
+1. Fixed bug in transmem for prob and others.
+2. Can use axml (XMLParser based) or libxml depending on availability
+== version 0.2.13
+1. Fixed issue with --hits_separate
+2. filter_and_validate.rb requires decoy validator if decoy proteins
+(refactored code)
+== version 0.2.14
+1. Can read PeptideProphet files (should be able to read pepxml files, too)
+2. API change: Some slight modifications to the Sequest::PepXML object
+interfaces and implementations (using ArrayClass)
+== version 0.2.15
+1. can convert srf files to sqt files
+== version 0.3.0
+1. IMPORTANT BUG FIX: protein reporting in srf files is correct now (proteins after the first protein were being assigned to the last hit in an out file).
+2. SQT export is correct and works at least on 3.2 and 3.3.1.
+== version 0.3.1
+1. Bug fix in srf filtering (num_hits adjusted)
+== version 0.3.2
+1. Uses sequest peptide_mass_tolerance filter on srf group files by default
+now.
+== version 0.3.3
+1. Worked out minor kinks in prob_precision.rb
+== version 0.3.4
+1. filters >= +3 charged ions now.
+== version 0.3.5
+1. fixed creation of background distribution in validators (hash_by base_name,
+first_scan, charge now)
+== version 0.3.6
+1. split off bad_aa_est from bad_aa
+== version 0.3.7
+1. can deal with No_Enzyme searches now (while still capable of setting
+sample_enzyme)
+== version 0.3.8
+1. can set a decoy to target ratio for decoy validation
+2. added mass calculator in Mass::Calculator
+== version 0.3.9
+1. doesn't clobber mzdata filename in ms_to_lmat.rb conversion
+== version 0.3.10
+1. added run_percolator.rb script which makes running multiple files easy
+== version 0.3.11
+1. faster sensing of bad scan tags in mzXML v. 2.0 files
+2. implemented lazy evaluation of spectrum in 2 different ways allowing much
+larger files to be parsed
+== version 0.4.0
+1. ** INTERFACE CHANGE: each scan can only have one precursor (used to be an array)
+2. ** INTERFACE CHANGE: spectrum mz and intensity data accessed with mzs and intensities
+3. lazy eval working on mzData
+4. mzData not necessarily guaranteed to have precursor intensities on lazy
+eval methos (however, the method intensity_at_mz will still work (causing
+evaluation))
+== version 0.4.1
+1. added support for reading mzXML version 3.0 (may fail in some cases)
+== version 0.4.2
+1. added MS::MSRun.open method
+2. added method to write dta files from SRF
+== version 0.4.3
+1. added to_mfg_file from SRF

data/lib/ms/calc.rb ADDED Viewed

@@ -0,0 +1,32 @@
+module Ms
+  module Calc
+      module_function
+      #
+      # ppm calculations... maybe use RUnit
+      #
+      def ppm_tol_at(mz, ppm)
+        1.0 * mz * ppm / 10**6
+      end
+      def ppm_span_at(mz, ppm)
+        tol = ppm_tol_at(mz, ppm)
+        [mz-tol, mz+tol]
+      end
+      def ppm_range_at(mz, ppm)
+        mz = mz.to_f
+        tol = ppm_tol_at(mz, ppm)
+        mz-tol...mz+tol
+      end
+      # Rounds n to the specified precision (ie number of decimal places)
+      # def round(n, precision)
+      #   factor = 10**precision.to_i
+      #   (n * factor).round.to_f / factor
+      # end
+  end
+end

data/lib/ms/data/interleaved.rb ADDED Viewed

@@ -0,0 +1,60 @@
+require 'ms/data/simple'
+module Ms
+  module Data
+    module_function
+    # Initializes a new interleaved data array.
+    def new_interleaved(unresolved_data, n=2)
+      Interleaved.new(unresolved_data, n=2)
+    end
+    # An Interleaved data array lazily evaluates it's unresolved data as
+    # an interleaved array of n members.  The unresolved data is evaluated
+    # into an array using to_a.
+    #
+    #   i = Ms::Data::Interleaved.new([1,4,2,5,3,6])
+    #   i.unresolved_data    # => [1,4,2,5,3,6]
+    #   i.data               # => []
+    #   i[0]                 # => [1,2,3]
+    #   i[1]                 # => [4,5,6]
+    #   i.data               # => [[1,2,3], [4,5,6]]
+    #
+    class Interleaved < Simple
+      attr_reader :n
+      def initialize(unresolved_data, n=2)
+        @n = 2
+        super(unresolved_data)
+      end
+      def [](index)
+        resolve.data[index]
+      end
+      def resolved?
+        !@data.empty?
+      end
+      def resolve
+        return(self) if resolved?
+        unresolved_data = @unresolved_data.to_a
+        unless unresolved_data.length % n == 0
+          raise ArgumentError, "interleaved data must have a number of elements evenly divisible by n (#{n})"
+        end
+        n.times { @data << [] }
+        map = @data * (unresolved_data.length/n)
+        unresolved_data.each_with_index do |item, i|
+          map[i] << item
+        end
+        self
+      end
+    end
+  end
+end

data/lib/ms/data/lazy_io.rb ADDED Viewed

@@ -0,0 +1,73 @@
+module Ms
+  module Data
+    # LazyIO represents data to be lazily read from an IO.  To read the data
+    # from the IO, either string or to_a may be called (to_a unpacks the
+    # string into an array using the decode_format and unpack_format).
+    #
+    # LazyIO is a suitable unresolved_data source for Ms::Data formats.
+    class LazyIO
+      NETWORK_FLOAT = 'g*'
+      NETWORK_DOUBLE = 'G*'
+      LITTLE_ENDIAN_FLOAT = 'e*'
+      LITTLE_ENDIAN_DOUBLE = 'E*'
+      BASE_64 = 'm'
+      class << self
+        # Returns the unpacking code for the given precision (32 or 64-bit)
+        # and network order (true for big-endian).
+        def unpack_code(precision, network_order)
+          case precision
+          when 32 then network_order ? NETWORK_FLOAT : LITTLE_ENDIAN_FLOAT
+          when 64 then network_order ? NETWORK_DOUBLE : LITTLE_ENDIAN_DOUBLE
+          else raise ArgumentError, "unknown precision (should be 32 or 64): #{precision}"
+          end
+        end
+      end
+      # The IO from which string is read
+      attr_reader :io
+      # The start index for reading string
+      attr_reader :start_index
+      # The number of bytes to be read from io when evaluating string
+      attr_reader :num_bytes
+      # Indicates the unpacking format
+      attr_reader :unpack_format
+      # Indicates a decoding format, may be false to unpack string
+      # without decoding.
+      attr_reader :decode_format
+      def initialize(io, start_index=io.pos, num_bytes=nil, unpack_format=NETWORK_FLOAT, decode_format=BASE_64)
+        @io = io
+        @start_index = start_index
+        @num_bytes = num_bytes
+        @unpack_format = unpack_format
+        @decode_format = decode_format
+      end
+      # Positions io at start_index and reads a string of num_bytes length.
+      # The string is newly read from io each time string is called.
+      def string
+        io.pos = start_index unless io.pos == start_index
+        io.read(num_bytes)
+      end
+      # Resets the cached array (returned by to_a) so that the array will
+      # be re-read from io.
+      def reset
+        @array = nil
+      end
+      # Reads string and unpacks using decode_format and unpack_code.  The
+      # array is cached internally; to re-read the array, use reset.
+      def to_a
+        @array ||= (decode_format ? string.unpack(decode_format)[0] : string).unpack(unpack_format)
+      end
+    end
+  end
+end

data/lib/ms/data/lazy_string.rb ADDED Viewed

@@ -0,0 +1,15 @@
+require 'ms/data/lazy_io'
+require 'stringio'
+module Ms
+  module Data
+    # LazyString is a LazyIO initialized from a string, which is converted into
+    # a StringIO.
+    class LazyString < LazyIO
+      def initialize(string, unpack_format=NETWORK_FLOAT, decode_format=BASE_64)
+        super(StringIO.new(string), 0, string.length, unpack_format, decode_format)
+      end
+    end
+  end
+end

data/lib/ms/data/simple.rb ADDED Viewed

@@ -0,0 +1,59 @@
+module Ms
+  module Data
+    module_function
+    # Initializes a new simple data array.
+    def new_simple(unresolved_data)
+      Simple.new(unresolved_data)
+    end
+    # A Simple data array that lazily evaluates unresolved_data, and
+    # each member of unresolved_data using to_a:
+    #
+    #   class LazyObject
+    #     attr_reader :to_a
+    #     def initialize(array)
+    #       @to_a = array
+    #     end
+    #   end
+    #
+    #   a = LazyObject.new([1,2,3])
+    #   b = LazyObject.new([4,5,6])
+    #   s = Ms::Data::Simple.new([a, b])
+    #
+    #   s.unresolved_data     # => [a, b]
+    #   s.data                # => []
+    #   s[0]                  # => [1,2,3]
+    #   s[1]                  # => [4,5,6]
+    #   s.data                # => [[1,2,3], [4,5,6]]
+    #
+    class Simple
+      # The underlying resolved data store.
+      attr_reader :data
+      # The underlying unresolved data store.
+      attr_reader :unresolved_data
+      def initialize(unresolved_data)
+        @data = []
+        @unresolved_data = unresolved_data
+      end
+      def [](index)
+        @data[index] ||= @unresolved_data.to_a[index].to_a
+      end
+      def resolve
+        0.upto(@unresolved_data.length - 1) do |index|
+          self[index]
+        end unless resolved?
+        self
+      end
+      def resolved?
+        @data.compact.length == @unresolved_data.length
+      end
+    end
+  end
+end

data/lib/ms/data/transposed.rb ADDED Viewed

@@ -0,0 +1,41 @@
+require 'ms/data/simple'
+module Ms
+  module Data
+    module_function
+    # Initializes a new transposed data array.
+    def new_transposed(unresolved_data)
+      Transposed.new(unresolved_data)
+    end
+    # A Transposed data array lazily evaluates it's unresolved data as
+    # a transposed array.  The unresolved data is evaluated
+    # into an array using to_a.
+    #
+    #   t = Ms::Data::Transposed.new([[1,4],[2,5],[3,6]])
+    #
+    #   t.unresolved_data  # => [[1,4],[2,5],[3,6]]
+    #   t.data             # => []
+    #   t[0]               # => [1,2,3]
+    #   t[1]               # => [4,5,6]
+    #   t.data             # => [[1,2,3], [4,5,6]]
+    #
+    class Transposed < Simple
+      def [](index)
+        resolve.data[index]
+      end
+      def resolved?
+        !@data.empty?
+      end
+      def resolve
+        @data = @unresolved_data.to_a.transpose unless resolved?
+        self
+      end
+    end
+  end
+end

data/lib/ms/data.rb ADDED Viewed

@@ -0,0 +1,57 @@
+require 'ms/data/interleaved'
+require 'ms/data/transposed'
+module Ms
+  # The Data module contains a number of classes providing a standard way to
+  # resolve various data storage formats into a 'simple' data array.
+  #
+  #   type               format
+  #   simple             [[mzs,...], [intensities...]]
+  #   transposed         [[mz,intensity], [mz,intensity], ...]
+  #   interleaved        [mz,intensity,mz,intensity,...]
+  #
+  # For instance:
+  #
+  #   s = Data.new([[1,2,3], [4,5,6]], :simple)
+  #   s.resolve.data        # => [[1,2,3], [4,5,6]]
+  #
+  #   t = Data.new([[1,4],[2,5],[3,6]], :transposed)
+  #   t.resolve.data        # => [[1,2,3], [4,5,6]]
+  #
+  #   i = Data.new([1,4,2,5,3,6], :interleaved)
+  #   i.resolve.data        # => [[1,2,3], [4,5,6]]
+  #
+  # Data is always resolved by calling to_a on the unresolved data object
+  # and then rearranging as needed (in the case of simple data, to_a is
+  # also called on each member of the unresolved data array).  This lazy
+  # resolution allows the use of non-array unresolved_data objects such
+  # as Data::LazyString:
+  #
+  #   str = [[1,4,2,5,3,6].pack("g*")].pack("m")
+  #   unresolved_data = Data::LazyString.new(str)
+  #
+  #   i = Data.new(unresolved_data, :interleaved)
+  #   i.resolve.data        # => [[1,2,3], [4,5,6]]
+  #
+  # Obviously the big advantage of lazy data resolution is that Data objects
+  # may be instantiated cheaply while expensive operations like unpacking and
+  # rearrangement may be put off or not executed at all.
+  #
+  module Data
+    module_function
+    # Initializes a new data array of the specified type by forwarding
+    # data to the "new_<type>" method.
+    #
+    #   simple = Ms::Data.new([[1,2,3], [4,5,6]], :simple)
+    #   simple.class           # => Ms::Data::Simple
+    #
+    #   interleaved = Ms::Data.new([1,4,2,5,3,6], :interleaved)
+    #   interleaved.class      # => Ms::Data::Interleaved
+    #
+    def new(data, type=:simple)
+      send("new_#{type}", data)
+    end
+  end
+end

data/lib/ms/format/format_error.rb ADDED Viewed

@@ -0,0 +1,12 @@
+module Ms
+  module Format
+    class FormatError < Exception
+      attr_accessor :str
+      def initialize(msg, str)
+        super(msg)
+        @str = str
+      end
+    end
+  end
+end

data/lib/ms/id/peptide.rb ADDED Viewed

@@ -0,0 +1,76 @@
+module Ms ; end
+module Ms::Id ; end
+# A 'sequence' is a notation of a peptide that includes the leading and
+# trailing amino acid after cleavage (e.g., K.PEPTIDER.E or -.STARTK.L )
+# and may contain post-translational modification information.
+#
+# 'aaseq' is the amino acid sequence of just the peptide with no leading or
+# trailing notation (e.g., PEPTIDER or LAKKLY)
+module Ms::Id::Peptide
+  Nonstandard_AA_re = /[^A-Z\.\-]/
+  class << self
+    def sequence_to_aaseq(sequence)
+      after_removed = remove_non_amino_acids(sequence)
+      pieces = after_removed.split('.')
+      case pieces.size
+      when 3
+        pieces[1]
+      when 2
+        if pieces[0].size > 1  ## N termini
+          pieces[0]
+        else  ## C termini
+          pieces[1]
+        end
+      when 1  ## this must be a parse error!
+        pieces[0] ## which is the peptide itself
+      else
+        abort "bad peptide sequence: #{sequence}"
+      end
+    end
+    # removes non standard amino acids specified by Nonstandard_AA_re
+    def remove_non_amino_acids(sequence)
+      sequence.gsub(Nonstandard_AA_re, '')
+    end
+    # remove non amino acids and split the sequence
+    def prepare_sequence(sequence)
+      nv = remove_non_amino_acids(sequence)
+      split_sequence(nv)
+    end
+    # Returns prev, peptide, next from sequence.  Parse errors return
+    # nil,nil,nil
+    #   R.PEPTIDE.A  # -> R, PEPTIDE, A
+    #   R.PEPTIDE.-  # -> R, PEPTIDE, -
+    #   PEPTIDE.A    # -> -, PEPTIDE, A
+    #   A.PEPTIDE    # -> A, PEPTIDE, -
+    #   PEPTIDE      # -> nil,nil,nil
+    def split_sequence(sequence)
+      peptide_prev_aa = ""; peptide = ""; peptide_next_aa = ""
+      pieces = sequence.split('.')
+      case pieces.size
+      when 3
+        peptide_prev_aa, peptide, peptide_next_aa = *pieces
+      when 2
+        if pieces[0].size > 1  ## N termini
+          peptide_prev_aa, peptide, peptide_next_aa = '-', pieces[0], pieces[1]
+        else  ## C termini
+          peptide_prev_aa, peptide, peptide_next_aa = pieces[0], pieces[1], '-'
+        end
+      when 1  ## this must be a parse error!
+        peptide_prev_aa, peptide, peptide_next_aa = nil,nil,nil
+      when 0
+        peptide_prev_aa, peptide, peptide_next_aa = nil,nil,nil
+      end
+      return peptide_prev_aa, peptide, peptide_next_aa
+    end
+  end
+end

data/lib/ms/id/protein.rb ADDED Viewed

@@ -0,0 +1,17 @@
+module Ms ; end
+module Ms::Id ; end
+module Ms::Id::Protein
+  class << self
+  end
+  # gives the information up until the first space or carriage return.
+  # Assumes the protein can respond_to? :reference
+  def first_entry
+    reference.split(/[\s\r]/)[0]
+  end
+end

data/lib/ms/id/search.rb ADDED Viewed

@@ -0,0 +1,110 @@
+module Ms
+  module Id
+    module Search
+      attr_accessor :prots
+      attr_accessor :peps
+      def protein_class
+        self.const_get("Prot")
+      end
+      # returns an array of peptide_hits and protein_hits that are linked to
+      # one another.  NOTE: this will update peptide and protein
+      # hits :prots and :peps attributes respectively).  Assumes that each search
+      # responds to :peps, each peptide responds to :prots and each protein to
+      # :peps.  Can be done on a single file to restore protein/peptide
+      # linkages to their original single-file state.
+      # Assumes the protein is initialized with (reference, peptide_ar)
+      #
+      # yields the protein that will become the template for a new protein
+      # and expects a new protein hit
+      def merge!(ar_of_peptide_hit_arrays)
+        all_peptide_hits = []
+        reference_hash = {}
+        ar_of_peptide_hit_arrays.each do |peptide_hits|
+          all_peptide_hits.push(*peptide_hits)
+          peptide_hits.each do |pep|
+            pep.prots.each do |prot|
+              ref = prot.reference
+              if reference_hash.key? ref
+                reference_hash[ref].peps << pep
+                reference_hash[ref]
+              else
+                reference_hash[ref] = yield(prot, [pep])
+              end
+            end
+          end
+        end
+        [all_peptide_hits, reference_hash.values]
+      end
+    end
+    module SearchGroup
+      include Search
+      # an array of search objects
+      attr_accessor :searches
+      # the group's file extension (with no leading period)
+      def extension
+        'grp'
+      end
+      def search_class
+        Search
+      end
+      # a simple formatted file with paths to the search files
+      def to_paths(file)
+        IO.readlines(file).grep(/\w/).reject {|v| v =~ /^#/}.map {|v| v.chomp }
+      end
+      def from_file(file)
+        from_filenames(to_paths(file))
+      end
+      def from_filenames(filenames)
+        filenames.each do |file|
+          if !File.exist? file
+            message = "File: #{file} does not exist!\n"
+            message << "perhaps you need to modify the file with file paths"
+            abort message
+          end
+          @searches << search_class.new(file)
+        end
+      end
+      # takes an array of filenames or a single search filename (with
+      # extension defined by 'extendsion') or an array of objects passes any
+      # arguments to the initializer for each search
+      # the optional block yields the object for further processing
+      def initialize(arg=nil, opts={})
+        @peps = []
+        @reference_hash = {}
+        @searches = []
+        if arg
+          if arg.is_a?(String) && arg =~ /\.#{Regexp.escap(extension)}$/
+            from_file(arg)
+          elsif arg.is_a?(Array) && arg.first.is_a?(String)
+            from_filenames(arg)
+          elsif arg.is_a?(Array)
+            @searches = array
+          else
+            raise ArgumentError, "must be file, array of filenames, or array of objs"
+          end
+          @searches << search_class.new(file, opts)
+        end
+        yield(self) if block_given?
+      end
+    end
+  end
+end

data/lib/ms/mass/aa.rb ADDED Viewed

@@ -0,0 +1,136 @@
+require 'ms/mass'
+module Ms
+  module Mass
+    # A module for working with commonly used residue masses in proteomics.
+    #
+    #     require 'ms/mass/aa'
+    #     include Ms::Mass::AA
+    #
+    #     MONO['A'] # => 71.0371137878
+    #     AVG['A']  # => 71.0779
+    #
+    #     # or use symbols
+    #     MONO[:A]  # => 71.0371137878
+    #
+    # This module is built on masses generated from the excellent {'molecules'
+    # library}[http://github.com/bahuvrihi/molecules/tree/master].  See that
+    # library for more serious work with masses:
+    #
+    #     gem install molecules
+    module AA
+      Ms::Mass.constants.reject {|v| v == 'AA' }.each do |const|
+        const_set(const, Ms::Mass.const_get(const))
+      end
+      # These are included here to offer maximum functionality
+      MOLECULES_MONO_UNSUPPORTED = {
+        :B => 172.048405, # average of aspartic acid and asparagine
+        :X => 118.805716,  # the average of the mono masses of the 20 amino acids
+        :* => 118.805716, # same as X
+        :Z => (129.04259 + 128.05858) / 2,  # average glutamic acid and glutamine
+        #:J => nil,
+      }
+      MOLECULES_AVG_UNSUPPORTED = {
+        :B => 172.1405, # average of aspartic acid and asparagine
+        :X => 118.88603, # the average of the masses of the 20 amino acids
+        :* => 118.88603, # same as X
+        :Z => (129.1155+ 128.1307) / 2,  # average glutamic acid and glutamine
+        #:J => nil,
+      }
+      # generated from molecules version 0.1.3:
+      MOLECULES_MONO = {
+        :A => 71.0371137878,
+        :C => 103.0091844778,
+        :D => 115.026943032,
+        :E => 129.0425930962,
+        :F => 147.0684139162,
+        :G => 57.0214637236,
+        :H => 137.0589118624,
+        :I => 113.0840639804,
+        :K => 128.0949630177,
+        :L => 113.0840639804,
+        :M => 131.0404846062,
+        :N => 114.0429274472,
+        :O => 211.1446528645,
+        :P => 97.052763852,
+        :Q => 128.0585775114,
+        :R => 156.1011110281,
+        :S => 87.0320284099,
+        :T => 101.0476784741,
+        :U => 150.9536355878,
+        :V => 99.0684139162,
+        :W => 186.0793129535,
+        :Y => 163.0633285383,
+      }
+      MONO = MOLECULES_MONO_UNSUPPORTED.merge MOLECULES_MONO
+      # generated from molecules version 0.1.3:
+      MOLECULES_AVG = {
+        :A => 71.0779,
+        :C => 103.1429,
+        :D => 115.0874,
+        :E => 129.11398,
+        :F => 147.17386,
+        :G => 57.05132,
+        :H => 137.13928,
+        :I => 113.15764,
+        :K => 128.17228,
+        :L => 113.15764,
+        :M => 131.19606,
+        :N => 114.10264,
+        :O => 211.28076,
+        :P => 97.11518,
+        :Q => 128.12922,
+        :R => 156.18568,
+        :S => 87.0773,
+        :T => 101.10388,
+        :U => 150.0379,
+        :V => 99.13106,
+        :W => 186.2099,
+        :Y => 163.17326,
+      }
+      AVG = MOLECULES_AVG_UNSUPPORTED.merge MOLECULES_AVG
+      [AVG, MONO].each do |hash|
+        hash.each {|k,v| hash[k.to_s] = v }
+      end
+      # returns a hash based on the molecules library of amino acid residues.
+      # type is :mono or :avg
+      def self.mass_index(type=:mono)
+        require 'molecules'
+        hash = {}
+        ('A'..'Z').each do |letter|
+          if res = Molecules::Libraries::Residue[letter]
+            hash[letter] =
+              if type == :mono
+                res.mass
+              elsif type == :avg
+                res.mass {|v| v.std_atomic_weight.value }
+              else
+                raise ArgumentError, "type must be :mono or :avg"
+              end
+          end
+        end
+        hash
+      end
+      # prints a MONO or AVG hash for inclusion in ruby code
+      # type can be :mono or :avg
+      def self.print_mass_index(type=:mono)
+        puts "#{type.to_s.upcase} = {"
+        mass_index(type).sort.each do |k,v|
+          puts ":#{k} => #{v},"
+        end
+        puts "}"
+      end
+    end
+  end
+end

data/lib/ms/mass.rb ADDED Viewed

@@ -0,0 +1,9 @@
+module Ms
+  module Mass
+    MASCOT_H_PLUS = 1.007276
+    H_PLUS = 1.00727646677  # need to verify this against HYDROGEN - ELECTRON
+    PROTON = H_PLUS
+  end
+end

data/lib/ms/spectrum.rb ADDED Viewed

@@ -0,0 +1,157 @@
+module Ms
+  class Spectrum
+    # The underlying data store.
+    attr_reader :data
+    # Associated headers
+    attr_reader :headers
+    def initialize(data, headers={})
+      @data = data
+      @headers = headers
+    end
+    def self.from_peaks(ar_of_doublets)
+      _mzs = []
+      _ints = []
+      ar_of_doublets.each do |mz, int|
+        _mzs << mz
+        _ints << int
+      end
+      self.new([_mzs, _ints])
+    end
+    # An array of the mz data.
+    def mzs
+      @data[0]
+    end
+    # An array of the intensities data, corresponding to mzs.
+    def intensities
+      @data[1]
+    end
+    def mzs_and_intensities
+      [@data[0], @data[1]]
+    end
+    def [](array_index)
+      [mzs[array_index], intensities[array_index]]
+    end
+    # yields(mz, inten) across the spectrum, or array of doublets if no block
+    def peaks(&block)
+      (m, i) = mzs_and_intensities
+      m.zip(i, &block)
+    end
+    alias_method :each, :peaks
+    alias_method :each_peak, :peaks
+    # if the mzs and intensities are the same then the spectra are considered
+    # equal
+    def ==(other)
+      mzs == other.mzs && intensities == other.intensities
+    end
+    # returns a new spectrum whose intensities have been normalized by the tic
+    def normalize
+      tic = self.intensities.inject(0.0) {|sum,int| sum += int }
+      Ms::Spectrum.new([self.mzs, self.intensities.map {|v| v / tic }])
+    end
+    # uses index function and returns the intensity at that value
+    def intensity_at_mz(mz)
+      if x = index(mz)
+        intensities[x]
+      else
+        nil
+      end
+    end
+    # returns the index of the first value matching that m/z.  the argument m/z
+    # may be less precise than the actual m/z (rounding to the same precision
+    # given) but must be at least integer precision (after rounding)
+    # implemented as binary search (bsearch from the web)
+    def index(mz)
+      mz_ar = mzs
+      return_val = nil
+      ind = mz_ar.bsearch_lower_boundary{|x| x <=> mz }
+      if mz_ar[ind] == mz
+        return_val = ind
+      else
+        # do a rounding game to see which one is it, or nil
+        # find all the values rounding to the same integer in the locale
+        # test each one fully in turn
+        mz = mz.to_f
+        mz_size = mz_ar.size
+        if ((ind < mz_size) and equal_after_rounding?(mz_ar[ind], mz))
+          return_val = ind
+        else # run the loop
+          up = ind
+          loop do
+            up += 1
+            if up >= mz_size
+              break
+            end
+            mz_up = mz_ar[up]
+            if (mz_up.ceil  - mz.ceil >= 2)
+              break
+            else
+              if equal_after_rounding?(mz_up, mz)
+                return_val = up
+                return return_val
+              end
+            end
+          end
+          dn= ind
+          loop do
+            dn -= 1
+            if dn < 0
+              break
+            end
+            mz_dn = mz_ar[dn]
+            if (mz.floor - mz_dn.floor >= 2)
+              break
+            else
+              if equal_after_rounding?(mz_dn, mz)
+                return_val = dn
+                return return_val
+              end
+            end
+          end
+        end
+      end
+      return_val
+    end
+    # less_precise should be a float
+    # precise should be a float
+    def equal_after_rounding?(precise, less_precise) # :nodoc:
+      # determine the precision of less_precise
+      exp10 = precision_as_neg_int(less_precise)
+      #puts "EXP10: #{exp10}"
+      answ = ((precise*exp10).round == (less_precise*exp10).round)
+      #puts "TESTING FOR EQUAL: #{precise} #{less_precise}"
+      #puts answ
+      (precise*exp10).round == (less_precise*exp10).round
+    end
+    # returns 1 for ones place, 10 for tenths, 100 for hundredths
+    # to a precision exceeding 1e-6
+    def precision_as_neg_int(float) # :nodoc:
+      neg_exp10 = 1
+      loop do
+        over = float * neg_exp10
+        rounded = over.round
+        if (over - rounded).abs <= 1e-6
+          break
+        end
+        neg_exp10 *= 10
+      end
+      neg_exp10
+    end
+  end
+end

data/lib/ms/support/binary_search.rb ADDED Viewed

@@ -0,0 +1,126 @@
+module Ms
+  module Support
+    # A binary search library adapted from: http://0xcc.net/ruby-bsearch/
+    # ---
+    #
+    # Ruby/Bsearch - a binary search library for Ruby.
+    #
+    # Copyright (C) 2001 Satoru Takabayashi <satoru@namazu.org>
+    #     All rights reserved.
+    #     This is free software with ABSOLUTELY NO WARRANTY.
+    #
+    # You can redistribute it and/or modify it under the terms of
+    # the Ruby's licence.
+    #
+    # Example:
+    #
+    #  % irb -r ./bsearch.rb
+    #  >> %w(a b c c c d e f).bsearch_first {|x| x <=> "c"}
+    #  => 2
+    #  >> %w(a b c c c d e f).bsearch_last {|x| x <=> "c"}
+    #  => 4
+    #  >> %w(a b c e f).bsearch_first {|x| x <=> "c"}
+    #  => 2
+    #  >> %w(a b e f).bsearch_first {|x| x <=> "c"}
+    #  => nil
+    #  >> %w(a b e f).bsearch_last {|x| x <=> "c"}
+    #  => nil
+    #  >> %w(a b e f).bsearch_lower_boundary {|x| x <=> "c"}
+    #  => 2
+    #  >> %w(a b e f).bsearch_upper_boundary {|x| x <=> "c"}
+    #  => 2
+    #  >> %w(a b c c c d e f).bsearch_range {|x| x <=> "c"}
+    #  => 2...5
+    #  >> %w(a b c d e f).bsearch_range {|x| x <=> "c"}
+    #  => 2...3
+    #  >> %w(a b d e f).bsearch_range {|x| x <=> "c"}
+    #  => 2...2
+    #
+    # The binary search algorithm is extracted from Jon Bentley's
+    # Programming Pearls 2nd ed. p.93
+    #
+    module BinarySearch
+      VERSION = '1.5'
+      module_function
+      #
+      # Return the lower boundary. (inside)
+      #
+      def search_lower_boundary(array, range=nil, &block)
+        range = 0 ... array.length if range == nil
+        lower  = range.first() -1
+        upper = if range.exclude_end? then range.last else range.last + 1 end
+        while lower + 1 != upper
+          mid = ((lower + upper) / 2).to_i # for working with mathn.rb (Rational)
+          if yield(array[mid]) < 0
+            lower = mid
+          else
+            upper = mid
+          end
+        end
+        return upper
+      end
+      #
+      # This method searches the FIRST occurrence which satisfies a
+      # condition given by a block in binary fashion and return the
+      # index of the first occurrence. Return nil if not found.
+      #
+      def search_first(array, range=nil, &block)
+        boundary = search_lower_boundary(array, range, &block)
+        if boundary >= array.length || yield(array[boundary]) != 0
+          return nil
+        else
+          return boundary
+        end
+      end
+      #
+      # Return the upper boundary. (outside)
+      #
+      def search_upper_boundary(array, range=nil, &block)
+        range = 0 ... array.length if range == nil
+        lower  = range.first() -1
+        upper = if range.exclude_end? then range.last else range.last + 1 end
+        while lower + 1 != upper
+          mid = ((lower + upper) / 2).to_i # for working with mathn.rb (Rational)
+          if yield(array[mid]) <= 0
+            lower = mid
+          else
+            upper = mid
+          end
+        end
+        return lower + 1 # outside of the matching range.
+      end
+      #
+      # This method searches the LAST occurrence which satisfies a
+      # condition given by a block in binary fashion and return the
+      # index of the last occurrence. Return nil if not found.
+      #
+      def search_last(array, range=nil, &block)
+        # `- 1' for canceling `lower + 1' in bsearch_upper_boundary.
+        boundary = search_upper_boundary(array, range, &block) - 1
+        if (boundary <= -1 || yield(array[boundary]) != 0)
+          return nil
+        else
+          return boundary
+        end
+      end
+      #
+      # Return the search result as a Range object.
+      #
+      def search_range(array, range=nil, &block)
+        lower = search_lower_boundary(array, range, &block)
+        upper = search_upper_boundary(array, range, &block)
+        return lower ... upper
+      end
+    end
+  end
+end

data/lib/ms.rb ADDED Viewed

@@ -0,0 +1,10 @@
+module Ms
+  module_function
+  # def parse(format, path)
+  #   const = Tap::Env.instance.search(:formats, format)
+  #   raise ArgumentError, "unknown format: #{format}" unless const
+  #   const.constantize.parse(path)
+  # end
+end

metadata ADDED Viewed

@@ -0,0 +1,95 @@
+--- !ruby/object:Gem::Specification
+name: ms-core
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- John Prince
+- Simon Chiang
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2009-05-22 00:00:00 -06:00
+default_executable:
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: tap
+  type: :development
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 0.11.2
+    version:
+- !ruby/object:Gem::Dependency
+  name: minitest
+  type: :development
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "="
+      - !ruby/object:Gem::Version
+        version: 1.3.0
+    version:
+description:
+email: jtprince@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files:
+- changelog.txt
+- LICENSE
+- README
+files:
+- lib/ms/format/format_error.rb
+- lib/ms/id/search.rb
+- lib/ms/id/peptide.rb
+- lib/ms/id/protein.rb
+- lib/ms/mass/aa.rb
+- lib/ms/data.rb
+- lib/ms/spectrum.rb
+- lib/ms/support/binary_search.rb
+- lib/ms/mass.rb
+- lib/ms/calc.rb
+- lib/ms/data/interleaved.rb
+- lib/ms/data/simple.rb
+- lib/ms/data/lazy_string.rb
+- lib/ms/data/transposed.rb
+- lib/ms/data/lazy_io.rb
+- lib/ms.rb
+- changelog.txt
+- LICENSE
+- README
+has_rdoc: true
+homepage: http://mspire.rubyforge.org/projects/ms-core/
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+  version:
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+  version:
+requirements: []
+rubyforge_project: mspire
+rubygems_version: 1.3.2
+signing_key:
+specification_version: 3
+summary: the core, shared library for mspire
+test_files: []