ms-core 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE ADDED
@@ -0,0 +1,13 @@
1
+ Copyright (c) 2006, The University of Texas at Austin("U.T. Austin"). All rights reserved.
2
+
3
+ Software by John T. Prince under the direction of Edward M. Marcotte.
4
+
5
+ By using this software the USER indicates that he or she has read, understood and will comply with the following:
6
+
7
+ U. T. Austin hereby grants USER permission to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of this software and its documentation for any purpose and without fee, provided that a full copy of this notice is included with the software and its documentation.
8
+
9
+ Title to copyright this software and its associated documentation shall at all times remain with U. T. Austin. No right is granted to use in advertising, publicity or otherwise any trademark, service mark, or the name of U. T. Austin.
10
+
11
+ This software and any associated documentation are provided "as is," and U. T. AUSTIN MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING THOSE OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, OR THAT USE OF THE SOFTWARE, MODIFICATIONS, OR ASSOCIATED DOCUMENTATION WILL NOT INFRINGE ANY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER INTELLECTUAL PROPERTY RIGHTS OF A THIRD PARTY. U. T. Austin, The University of Texas System, its Regents, officers, and employees shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to any claim by USER or any third party on account of or arising from the use, or inability to use, this software or its associated documentation, even if U. T. Austin has been advised of the possibility of those damages.
12
+
13
+ Submit software operation questions to: Edward M. Marcotte, Department of Chemistry and Biochemistry, U. T. Austin, Austin, Texas 78712.
data/README ADDED
@@ -0,0 +1,27 @@
1
+ = {ms-core}[http://mspire.rubyforge.org/projects/ms-core]
2
+
3
+ The core mspire[http://mspire.rubyforge.org] library for working with mass spectrometry proteomics data.
4
+ Generally, This will be used as a dependency and won't be all that useful on
5
+ its own.
6
+
7
+ == Description
8
+
9
+ * Github[http://github.com/jtprince/ms-core/tree/master]
10
+ * Lighthouse[http://bahuvrihi.lighthouseapp.com/projects/16692-mspire/tickets]
11
+ * {Google Group}[http://groups.google.com/group/mspire-forum]
12
+
13
+ == Installation
14
+
15
+ Available as a gem on RubyForge[http://rubyforge.org/projects/mspire]. Use:
16
+
17
+ % gem install ms-mascot
18
+
19
+ Typically, it will be included as a dependency so it will be installed with
20
+ another gem
21
+
22
+ == Info
23
+
24
+ Copyright (c) 2006-2008, Regents of the University of Colorado and HHMI
25
+ Developers:: {Simon Chiang}[http://bahuvrihi.wordpress.com], {Biomolecular Structure Program}[http://biomol.uchsc.edu/], {Hansen Lab}[http://hsc-proteomics.uchsc.edu/hansenlab/], John Prince, {Edward Marcotte Lab}[http://polaris.icmb.utexas.edu/home.html], {Natalie Ahn Lab}[http://www.colorado.edu/chem/people/ahnn.html], {Howard Hughes Medical Institute}[http://www.hhmi.org/], {BYU Dept. of Chemistry and Biochemistry}[http://www.chem.byu.edu/]
26
+ Support:: CU Denver School of Medicine Deans Academic Enrichment Fund, HHMI
27
+ License:: {MIT-Style}[link:files/MIT-LICENSE.html]
data/changelog.txt ADDED
@@ -0,0 +1,196 @@
1
+
2
+ == version 0.1.7
3
+
4
+ 1. A couple of scripts and subroutines were hashing peptides but not on the file
5
+ basename. This would result in slightly incorrect results (any time there
6
+ were overlapping scan numbers in multiple datasets, only the top one would be
7
+ chosen). The results would be correct for single runs.
8
+
9
+ Output files that could be affected:
10
+ *.top_per_scan.txt
11
+ *.all_peps_per_scan.txt
12
+
13
+ Scripts that could be affected:
14
+ script/top_hit_per_scan.rb
15
+ bin/filter_spec_id.rb
16
+ script/filter-peps.rb
17
+ bin/id_precision.rb
18
+
19
+ Subroutines that were affected:
20
+ spec_id.rb (pep_probs_by_* )
21
+ spec_id.rb (top_peps_prefilter!)
22
+ proph.rb uniq_by_seqcharge
23
+ align.rb called uniq_by_seqcharge
24
+
25
+
26
+ 2. false_positive_rate.rb and protein_summary.rb (by extension) were using
27
+ number of true positives on the x axis while in reality I was plotting the
28
+ number of hits. I've updated x axis labels to reflect this change. In
29
+ addition, since the term 'false positive rate' has such a distinct definition
30
+ in classical ROC plots and binary statistics, I've decided to work primarily
31
+ in terms of precision (TP/(TP+FP)). I've purged the terms 'False Positive
32
+ Rate' and 'FPR' from the package. It's been suggested that FP/(TP+FP) be
33
+ called the False Positive Predictive Rate (FPPR). I will probably implement
34
+ this in a future release.
35
+
36
+ == version 0.2.0
37
+
38
+ Revamped the way SpecID works (it is now mixed-in).
39
+ Added support for modifications to bioworks_to_pepxml.rb
40
+ Can read .srf files (nearly interchangeable with bioworks files)
41
+ Redid filter.rb
42
+
43
+ == version 0.2.1
44
+
45
+ minor bugfix
46
+
47
+ == version 0.2.2
48
+
49
+ made compatible with Bioworks fasta file reverser and updated tutorial.
50
+ Killed classify_by_prefix routine in favor of classify_by_false_flag which has
51
+ a prefix option
52
+
53
+ == version 0.2.3
54
+
55
+ in protein_summary.rb added handling for proteins with no annotation. (either
56
+ dispaly NA or use gi2annnot to grab them from NCBI)
57
+
58
+ == version 0.2.5
59
+
60
+ renamed prep_list in roc (potential breaks in code)
61
+
62
+ == version 0.2.6
63
+
64
+ 1. Massive refactorization of filtering and validation. Validation objects are
65
+ created and then can be used to validate just about anything.
66
+ 2. Massive redo of the parsing of MS runs. Can parse mzXML v1, v2.X
67
+ (including readw broken output), and mzData (even Thermo's broken output).
68
+ 4. Moved all tests to specs (rspec).
69
+ 5. Can read gradient programs off of .meth or .RAW files (both Xcal 1.X and
70
+ 2.X)
71
+
72
+ Bugfixes:
73
+ 1. The search_summary 'base_name' in pepxml output was incorrect (this did not
74
+ appear to influence our analyses, however). Fixed.
75
+ 2. Enzymes with no exceptions (e.g., cuts at KR) would report one too many
76
+ missed cleavages if the last amino acid was a cut point. Fixed.
77
+
78
+ == version 0.2.7
79
+
80
+ 1. In conversion from bioworks to pepxml, the default was trypsin (KR/P).
81
+ Now, the sample enzyme is set explicitly from the params file and the option
82
+ is not available. This can give more accuract pepxml files than from
83
+ previous depending on your enzyme.
84
+
85
+ == version 0.2.9
86
+
87
+ 1. Added support for phobius transmembrane predictions
88
+ 2. have filter_and_validate.rb working well (multiple validators allowed).
89
+ 3. Can read bioworks 3.3.1 .srf files (.srf version 3.5 files)
90
+ 4. Added a bias validator
91
+
92
+ == version 0.2.10
93
+
94
+ 1. Fixed --hits_separate flag in spec_id/filter
95
+
96
+ == version 0.2.11
97
+
98
+ 1. Added prob precision support and reorganized filter_and_validate libs
99
+
100
+ == version 0.2.12
101
+
102
+ 1. Fixed bug in transmem for prob and others.
103
+ 2. Can use axml (XMLParser based) or libxml depending on availability
104
+
105
+ == version 0.2.13
106
+
107
+ 1. Fixed issue with --hits_separate
108
+ 2. filter_and_validate.rb requires decoy validator if decoy proteins
109
+ (refactored code)
110
+
111
+ == version 0.2.14
112
+
113
+ 1. Can read PeptideProphet files (should be able to read pepxml files, too)
114
+ 2. API change: Some slight modifications to the Sequest::PepXML object
115
+ interfaces and implementations (using ArrayClass)
116
+
117
+ == version 0.2.15
118
+
119
+ 1. can convert srf files to sqt files
120
+
121
+ == version 0.3.0
122
+
123
+ 1. IMPORTANT BUG FIX: protein reporting in srf files is correct now (proteins after the first protein were being assigned to the last hit in an out file).
124
+ 2. SQT export is correct and works at least on 3.2 and 3.3.1.
125
+
126
+ == version 0.3.1
127
+
128
+ 1. Bug fix in srf filtering (num_hits adjusted)
129
+
130
+ == version 0.3.2
131
+
132
+ 1. Uses sequest peptide_mass_tolerance filter on srf group files by default
133
+ now.
134
+
135
+ == version 0.3.3
136
+
137
+ 1. Worked out minor kinks in prob_precision.rb
138
+
139
+ == version 0.3.4
140
+
141
+ 1. filters >= +3 charged ions now.
142
+
143
+ == version 0.3.5
144
+
145
+ 1. fixed creation of background distribution in validators (hash_by base_name,
146
+ first_scan, charge now)
147
+
148
+ == version 0.3.6
149
+
150
+ 1. split off bad_aa_est from bad_aa
151
+
152
+ == version 0.3.7
153
+
154
+ 1. can deal with No_Enzyme searches now (while still capable of setting
155
+ sample_enzyme)
156
+
157
+ == version 0.3.8
158
+
159
+ 1. can set a decoy to target ratio for decoy validation
160
+ 2. added mass calculator in Mass::Calculator
161
+
162
+ == version 0.3.9
163
+
164
+ 1. doesn't clobber mzdata filename in ms_to_lmat.rb conversion
165
+
166
+ == version 0.3.10
167
+
168
+ 1. added run_percolator.rb script which makes running multiple files easy
169
+
170
+ == version 0.3.11
171
+
172
+ 1. faster sensing of bad scan tags in mzXML v. 2.0 files
173
+ 2. implemented lazy evaluation of spectrum in 2 different ways allowing much
174
+ larger files to be parsed
175
+
176
+ == version 0.4.0
177
+
178
+ 1. ** INTERFACE CHANGE: each scan can only have one precursor (used to be an array)
179
+ 2. ** INTERFACE CHANGE: spectrum mz and intensity data accessed with mzs and intensities
180
+ 3. lazy eval working on mzData
181
+ 4. mzData not necessarily guaranteed to have precursor intensities on lazy
182
+ eval methos (however, the method intensity_at_mz will still work (causing
183
+ evaluation))
184
+
185
+ == version 0.4.1
186
+
187
+ 1. added support for reading mzXML version 3.0 (may fail in some cases)
188
+
189
+ == version 0.4.2
190
+
191
+ 1. added MS::MSRun.open method
192
+ 2. added method to write dta files from SRF
193
+
194
+ == version 0.4.3
195
+
196
+ 1. added to_mfg_file from SRF
data/lib/ms/calc.rb ADDED
@@ -0,0 +1,32 @@
1
+ module Ms
2
+ module Calc
3
+ module_function
4
+
5
+ #
6
+ # ppm calculations... maybe use RUnit
7
+ #
8
+
9
+ def ppm_tol_at(mz, ppm)
10
+ 1.0 * mz * ppm / 10**6
11
+ end
12
+
13
+ def ppm_span_at(mz, ppm)
14
+ tol = ppm_tol_at(mz, ppm)
15
+ [mz-tol, mz+tol]
16
+ end
17
+
18
+ def ppm_range_at(mz, ppm)
19
+ mz = mz.to_f
20
+ tol = ppm_tol_at(mz, ppm)
21
+ mz-tol...mz+tol
22
+ end
23
+
24
+
25
+ # Rounds n to the specified precision (ie number of decimal places)
26
+ # def round(n, precision)
27
+ # factor = 10**precision.to_i
28
+ # (n * factor).round.to_f / factor
29
+ # end
30
+
31
+ end
32
+ end
@@ -0,0 +1,60 @@
1
+ require 'ms/data/simple'
2
+
3
+ module Ms
4
+ module Data
5
+ module_function
6
+
7
+ # Initializes a new interleaved data array.
8
+ def new_interleaved(unresolved_data, n=2)
9
+ Interleaved.new(unresolved_data, n=2)
10
+ end
11
+
12
+ # An Interleaved data array lazily evaluates it's unresolved data as
13
+ # an interleaved array of n members. The unresolved data is evaluated
14
+ # into an array using to_a.
15
+ #
16
+ # i = Ms::Data::Interleaved.new([1,4,2,5,3,6])
17
+ # i.unresolved_data # => [1,4,2,5,3,6]
18
+ # i.data # => []
19
+ # i[0] # => [1,2,3]
20
+ # i[1] # => [4,5,6]
21
+ # i.data # => [[1,2,3], [4,5,6]]
22
+ #
23
+ class Interleaved < Simple
24
+ attr_reader :n
25
+
26
+ def initialize(unresolved_data, n=2)
27
+ @n = 2
28
+ super(unresolved_data)
29
+ end
30
+
31
+ def [](index)
32
+ resolve.data[index]
33
+ end
34
+
35
+ def resolved?
36
+ !@data.empty?
37
+ end
38
+
39
+ def resolve
40
+ return(self) if resolved?
41
+
42
+ unresolved_data = @unresolved_data.to_a
43
+
44
+ unless unresolved_data.length % n == 0
45
+ raise ArgumentError, "interleaved data must have a number of elements evenly divisible by n (#{n})"
46
+ end
47
+
48
+ n.times { @data << [] }
49
+ map = @data * (unresolved_data.length/n)
50
+
51
+ unresolved_data.each_with_index do |item, i|
52
+ map[i] << item
53
+ end
54
+
55
+ self
56
+ end
57
+
58
+ end
59
+ end
60
+ end
@@ -0,0 +1,73 @@
1
+ module Ms
2
+ module Data
3
+
4
+ # LazyIO represents data to be lazily read from an IO. To read the data
5
+ # from the IO, either string or to_a may be called (to_a unpacks the
6
+ # string into an array using the decode_format and unpack_format).
7
+ #
8
+ # LazyIO is a suitable unresolved_data source for Ms::Data formats.
9
+ class LazyIO
10
+ NETWORK_FLOAT = 'g*'
11
+ NETWORK_DOUBLE = 'G*'
12
+ LITTLE_ENDIAN_FLOAT = 'e*'
13
+ LITTLE_ENDIAN_DOUBLE = 'E*'
14
+ BASE_64 = 'm'
15
+
16
+ class << self
17
+ # Returns the unpacking code for the given precision (32 or 64-bit)
18
+ # and network order (true for big-endian).
19
+ def unpack_code(precision, network_order)
20
+ case precision
21
+ when 32 then network_order ? NETWORK_FLOAT : LITTLE_ENDIAN_FLOAT
22
+ when 64 then network_order ? NETWORK_DOUBLE : LITTLE_ENDIAN_DOUBLE
23
+ else raise ArgumentError, "unknown precision (should be 32 or 64): #{precision}"
24
+ end
25
+ end
26
+ end
27
+
28
+ # The IO from which string is read
29
+ attr_reader :io
30
+
31
+ # The start index for reading string
32
+ attr_reader :start_index
33
+
34
+ # The number of bytes to be read from io when evaluating string
35
+ attr_reader :num_bytes
36
+
37
+ # Indicates the unpacking format
38
+ attr_reader :unpack_format
39
+
40
+ # Indicates a decoding format, may be false to unpack string
41
+ # without decoding.
42
+ attr_reader :decode_format
43
+
44
+ def initialize(io, start_index=io.pos, num_bytes=nil, unpack_format=NETWORK_FLOAT, decode_format=BASE_64)
45
+ @io = io
46
+ @start_index = start_index
47
+ @num_bytes = num_bytes
48
+ @unpack_format = unpack_format
49
+ @decode_format = decode_format
50
+ end
51
+
52
+ # Positions io at start_index and reads a string of num_bytes length.
53
+ # The string is newly read from io each time string is called.
54
+ def string
55
+ io.pos = start_index unless io.pos == start_index
56
+ io.read(num_bytes)
57
+ end
58
+
59
+ # Resets the cached array (returned by to_a) so that the array will
60
+ # be re-read from io.
61
+ def reset
62
+ @array = nil
63
+ end
64
+
65
+ # Reads string and unpacks using decode_format and unpack_code. The
66
+ # array is cached internally; to re-read the array, use reset.
67
+ def to_a
68
+ @array ||= (decode_format ? string.unpack(decode_format)[0] : string).unpack(unpack_format)
69
+ end
70
+
71
+ end
72
+ end
73
+ end
@@ -0,0 +1,15 @@
1
+ require 'ms/data/lazy_io'
2
+ require 'stringio'
3
+
4
+ module Ms
5
+ module Data
6
+
7
+ # LazyString is a LazyIO initialized from a string, which is converted into
8
+ # a StringIO.
9
+ class LazyString < LazyIO
10
+ def initialize(string, unpack_format=NETWORK_FLOAT, decode_format=BASE_64)
11
+ super(StringIO.new(string), 0, string.length, unpack_format, decode_format)
12
+ end
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,59 @@
1
+ module Ms
2
+ module Data
3
+ module_function
4
+
5
+ # Initializes a new simple data array.
6
+ def new_simple(unresolved_data)
7
+ Simple.new(unresolved_data)
8
+ end
9
+
10
+ # A Simple data array that lazily evaluates unresolved_data, and
11
+ # each member of unresolved_data using to_a:
12
+ #
13
+ # class LazyObject
14
+ # attr_reader :to_a
15
+ # def initialize(array)
16
+ # @to_a = array
17
+ # end
18
+ # end
19
+ #
20
+ # a = LazyObject.new([1,2,3])
21
+ # b = LazyObject.new([4,5,6])
22
+ # s = Ms::Data::Simple.new([a, b])
23
+ #
24
+ # s.unresolved_data # => [a, b]
25
+ # s.data # => []
26
+ # s[0] # => [1,2,3]
27
+ # s[1] # => [4,5,6]
28
+ # s.data # => [[1,2,3], [4,5,6]]
29
+ #
30
+ class Simple
31
+ # The underlying resolved data store.
32
+ attr_reader :data
33
+
34
+ # The underlying unresolved data store.
35
+ attr_reader :unresolved_data
36
+
37
+ def initialize(unresolved_data)
38
+ @data = []
39
+ @unresolved_data = unresolved_data
40
+ end
41
+
42
+ def [](index)
43
+ @data[index] ||= @unresolved_data.to_a[index].to_a
44
+ end
45
+
46
+ def resolve
47
+ 0.upto(@unresolved_data.length - 1) do |index|
48
+ self[index]
49
+ end unless resolved?
50
+
51
+ self
52
+ end
53
+
54
+ def resolved?
55
+ @data.compact.length == @unresolved_data.length
56
+ end
57
+ end
58
+ end
59
+ end
@@ -0,0 +1,41 @@
1
+ require 'ms/data/simple'
2
+
3
+ module Ms
4
+ module Data
5
+ module_function
6
+
7
+ # Initializes a new transposed data array.
8
+ def new_transposed(unresolved_data)
9
+ Transposed.new(unresolved_data)
10
+ end
11
+
12
+ # A Transposed data array lazily evaluates it's unresolved data as
13
+ # a transposed array. The unresolved data is evaluated
14
+ # into an array using to_a.
15
+ #
16
+ # t = Ms::Data::Transposed.new([[1,4],[2,5],[3,6]])
17
+ #
18
+ # t.unresolved_data # => [[1,4],[2,5],[3,6]]
19
+ # t.data # => []
20
+ # t[0] # => [1,2,3]
21
+ # t[1] # => [4,5,6]
22
+ # t.data # => [[1,2,3], [4,5,6]]
23
+ #
24
+ class Transposed < Simple
25
+
26
+ def [](index)
27
+ resolve.data[index]
28
+ end
29
+
30
+ def resolved?
31
+ !@data.empty?
32
+ end
33
+
34
+ def resolve
35
+ @data = @unresolved_data.to_a.transpose unless resolved?
36
+ self
37
+ end
38
+
39
+ end
40
+ end
41
+ end
data/lib/ms/data.rb ADDED
@@ -0,0 +1,57 @@
1
+ require 'ms/data/interleaved'
2
+ require 'ms/data/transposed'
3
+
4
+ module Ms
5
+
6
+ # The Data module contains a number of classes providing a standard way to
7
+ # resolve various data storage formats into a 'simple' data array.
8
+ #
9
+ # type format
10
+ # simple [[mzs,...], [intensities...]]
11
+ # transposed [[mz,intensity], [mz,intensity], ...]
12
+ # interleaved [mz,intensity,mz,intensity,...]
13
+ #
14
+ # For instance:
15
+ #
16
+ # s = Data.new([[1,2,3], [4,5,6]], :simple)
17
+ # s.resolve.data # => [[1,2,3], [4,5,6]]
18
+ #
19
+ # t = Data.new([[1,4],[2,5],[3,6]], :transposed)
20
+ # t.resolve.data # => [[1,2,3], [4,5,6]]
21
+ #
22
+ # i = Data.new([1,4,2,5,3,6], :interleaved)
23
+ # i.resolve.data # => [[1,2,3], [4,5,6]]
24
+ #
25
+ # Data is always resolved by calling to_a on the unresolved data object
26
+ # and then rearranging as needed (in the case of simple data, to_a is
27
+ # also called on each member of the unresolved data array). This lazy
28
+ # resolution allows the use of non-array unresolved_data objects such
29
+ # as Data::LazyString:
30
+ #
31
+ # str = [[1,4,2,5,3,6].pack("g*")].pack("m")
32
+ # unresolved_data = Data::LazyString.new(str)
33
+ #
34
+ # i = Data.new(unresolved_data, :interleaved)
35
+ # i.resolve.data # => [[1,2,3], [4,5,6]]
36
+ #
37
+ # Obviously the big advantage of lazy data resolution is that Data objects
38
+ # may be instantiated cheaply while expensive operations like unpacking and
39
+ # rearrangement may be put off or not executed at all.
40
+ #
41
+ module Data
42
+ module_function
43
+
44
+ # Initializes a new data array of the specified type by forwarding
45
+ # data to the "new_<type>" method.
46
+ #
47
+ # simple = Ms::Data.new([[1,2,3], [4,5,6]], :simple)
48
+ # simple.class # => Ms::Data::Simple
49
+ #
50
+ # interleaved = Ms::Data.new([1,4,2,5,3,6], :interleaved)
51
+ # interleaved.class # => Ms::Data::Interleaved
52
+ #
53
+ def new(data, type=:simple)
54
+ send("new_#{type}", data)
55
+ end
56
+ end
57
+ end
@@ -0,0 +1,12 @@
1
+ module Ms
2
+ module Format
3
+ class FormatError < Exception
4
+ attr_accessor :str
5
+
6
+ def initialize(msg, str)
7
+ super(msg)
8
+ @str = str
9
+ end
10
+ end
11
+ end
12
+ end
@@ -0,0 +1,76 @@
1
+
2
+ module Ms ; end
3
+ module Ms::Id ; end
4
+
5
+ # A 'sequence' is a notation of a peptide that includes the leading and
6
+ # trailing amino acid after cleavage (e.g., K.PEPTIDER.E or -.STARTK.L )
7
+ # and may contain post-translational modification information.
8
+ #
9
+ # 'aaseq' is the amino acid sequence of just the peptide with no leading or
10
+ # trailing notation (e.g., PEPTIDER or LAKKLY)
11
+ module Ms::Id::Peptide
12
+ Nonstandard_AA_re = /[^A-Z\.\-]/
13
+
14
+ class << self
15
+
16
+ def sequence_to_aaseq(sequence)
17
+ after_removed = remove_non_amino_acids(sequence)
18
+ pieces = after_removed.split('.')
19
+ case pieces.size
20
+ when 3
21
+ pieces[1]
22
+ when 2
23
+ if pieces[0].size > 1 ## N termini
24
+ pieces[0]
25
+ else ## C termini
26
+ pieces[1]
27
+ end
28
+ when 1 ## this must be a parse error!
29
+ pieces[0] ## which is the peptide itself
30
+ else
31
+ abort "bad peptide sequence: #{sequence}"
32
+ end
33
+ end
34
+
35
+ # removes non standard amino acids specified by Nonstandard_AA_re
36
+ def remove_non_amino_acids(sequence)
37
+ sequence.gsub(Nonstandard_AA_re, '')
38
+ end
39
+
40
+ # remove non amino acids and split the sequence
41
+ def prepare_sequence(sequence)
42
+ nv = remove_non_amino_acids(sequence)
43
+ split_sequence(nv)
44
+ end
45
+
46
+ # Returns prev, peptide, next from sequence. Parse errors return
47
+ # nil,nil,nil
48
+ # R.PEPTIDE.A # -> R, PEPTIDE, A
49
+ # R.PEPTIDE.- # -> R, PEPTIDE, -
50
+ # PEPTIDE.A # -> -, PEPTIDE, A
51
+ # A.PEPTIDE # -> A, PEPTIDE, -
52
+ # PEPTIDE # -> nil,nil,nil
53
+ def split_sequence(sequence)
54
+ peptide_prev_aa = ""; peptide = ""; peptide_next_aa = ""
55
+ pieces = sequence.split('.')
56
+ case pieces.size
57
+ when 3
58
+ peptide_prev_aa, peptide, peptide_next_aa = *pieces
59
+ when 2
60
+ if pieces[0].size > 1 ## N termini
61
+ peptide_prev_aa, peptide, peptide_next_aa = '-', pieces[0], pieces[1]
62
+ else ## C termini
63
+ peptide_prev_aa, peptide, peptide_next_aa = pieces[0], pieces[1], '-'
64
+ end
65
+ when 1 ## this must be a parse error!
66
+ peptide_prev_aa, peptide, peptide_next_aa = nil,nil,nil
67
+ when 0
68
+ peptide_prev_aa, peptide, peptide_next_aa = nil,nil,nil
69
+ end
70
+ return peptide_prev_aa, peptide, peptide_next_aa
71
+ end
72
+
73
+ end
74
+
75
+
76
+ end
@@ -0,0 +1,17 @@
1
+
2
+ module Ms ; end
3
+ module Ms::Id ; end
4
+
5
+ module Ms::Id::Protein
6
+
7
+ class << self
8
+ end
9
+
10
+ # gives the information up until the first space or carriage return.
11
+ # Assumes the protein can respond_to? :reference
12
+ def first_entry
13
+ reference.split(/[\s\r]/)[0]
14
+ end
15
+
16
+ end
17
+
@@ -0,0 +1,110 @@
1
+
2
+ module Ms
3
+ module Id
4
+
5
+ module Search
6
+ attr_accessor :prots
7
+ attr_accessor :peps
8
+
9
+ def protein_class
10
+ self.const_get("Prot")
11
+ end
12
+
13
+
14
+ # returns an array of peptide_hits and protein_hits that are linked to
15
+ # one another. NOTE: this will update peptide and protein
16
+ # hits :prots and :peps attributes respectively). Assumes that each search
17
+ # responds to :peps, each peptide responds to :prots and each protein to
18
+ # :peps. Can be done on a single file to restore protein/peptide
19
+ # linkages to their original single-file state.
20
+ # Assumes the protein is initialized with (reference, peptide_ar)
21
+ #
22
+ # yields the protein that will become the template for a new protein
23
+ # and expects a new protein hit
24
+ def merge!(ar_of_peptide_hit_arrays)
25
+ all_peptide_hits = []
26
+ reference_hash = {}
27
+ ar_of_peptide_hit_arrays.each do |peptide_hits|
28
+ all_peptide_hits.push(*peptide_hits)
29
+ peptide_hits.each do |pep|
30
+ pep.prots.each do |prot|
31
+ ref = prot.reference
32
+ if reference_hash.key? ref
33
+ reference_hash[ref].peps << pep
34
+ reference_hash[ref]
35
+ else
36
+ reference_hash[ref] = yield(prot, [pep])
37
+ end
38
+ end
39
+ end
40
+ end
41
+ [all_peptide_hits, reference_hash.values]
42
+ end
43
+
44
+ end
45
+
46
+
47
+ module SearchGroup
48
+ include Search
49
+
50
+ # an array of search objects
51
+ attr_accessor :searches
52
+
53
+ # the group's file extension (with no leading period)
54
+ def extension
55
+ 'grp'
56
+ end
57
+
58
+ def search_class
59
+ Search
60
+ end
61
+
62
+ # a simple formatted file with paths to the search files
63
+ def to_paths(file)
64
+ IO.readlines(file).grep(/\w/).reject {|v| v =~ /^#/}.map {|v| v.chomp }
65
+ end
66
+
67
+ def from_file(file)
68
+ from_filenames(to_paths(file))
69
+ end
70
+
71
+
72
+ def from_filenames(filenames)
73
+ filenames.each do |file|
74
+ if !File.exist? file
75
+ message = "File: #{file} does not exist!\n"
76
+ message << "perhaps you need to modify the file with file paths"
77
+ abort message
78
+ end
79
+ @searches << search_class.new(file)
80
+ end
81
+ end
82
+
83
+
84
+ # takes an array of filenames or a single search filename (with
85
+ # extension defined by 'extendsion') or an array of objects passes any
86
+ # arguments to the initializer for each search
87
+ # the optional block yields the object for further processing
88
+ def initialize(arg=nil, opts={})
89
+ @peps = []
90
+ @reference_hash = {}
91
+ @searches = []
92
+
93
+ if arg
94
+ if arg.is_a?(String) && arg =~ /\.#{Regexp.escap(extension)}$/
95
+ from_file(arg)
96
+ elsif arg.is_a?(Array) && arg.first.is_a?(String)
97
+ from_filenames(arg)
98
+ elsif arg.is_a?(Array)
99
+ @searches = array
100
+ else
101
+ raise ArgumentError, "must be file, array of filenames, or array of objs"
102
+ end
103
+ @searches << search_class.new(file, opts)
104
+ end
105
+ yield(self) if block_given?
106
+ end
107
+
108
+ end
109
+ end
110
+ end
data/lib/ms/mass/aa.rb ADDED
@@ -0,0 +1,136 @@
1
+ require 'ms/mass'
2
+
3
+ module Ms
4
+ module Mass
5
+
6
+ # A module for working with commonly used residue masses in proteomics.
7
+ #
8
+ # require 'ms/mass/aa'
9
+ # include Ms::Mass::AA
10
+ #
11
+ # MONO['A'] # => 71.0371137878
12
+ # AVG['A'] # => 71.0779
13
+ #
14
+ # # or use symbols
15
+ # MONO[:A] # => 71.0371137878
16
+ #
17
+ # This module is built on masses generated from the excellent {'molecules'
18
+ # library}[http://github.com/bahuvrihi/molecules/tree/master]. See that
19
+ # library for more serious work with masses:
20
+ #
21
+ # gem install molecules
22
+ module AA
23
+ Ms::Mass.constants.reject {|v| v == 'AA' }.each do |const|
24
+ const_set(const, Ms::Mass.const_get(const))
25
+ end
26
+
27
+ # These are included here to offer maximum functionality
28
+ MOLECULES_MONO_UNSUPPORTED = {
29
+ :B => 172.048405, # average of aspartic acid and asparagine
30
+ :X => 118.805716, # the average of the mono masses of the 20 amino acids
31
+ :* => 118.805716, # same as X
32
+ :Z => (129.04259 + 128.05858) / 2, # average glutamic acid and glutamine
33
+ #:J => nil,
34
+ }
35
+ MOLECULES_AVG_UNSUPPORTED = {
36
+ :B => 172.1405, # average of aspartic acid and asparagine
37
+ :X => 118.88603, # the average of the masses of the 20 amino acids
38
+ :* => 118.88603, # same as X
39
+ :Z => (129.1155+ 128.1307) / 2, # average glutamic acid and glutamine
40
+ #:J => nil,
41
+ }
42
+
43
+ # generated from molecules version 0.1.3:
44
+ MOLECULES_MONO = {
45
+ :A => 71.0371137878,
46
+ :C => 103.0091844778,
47
+ :D => 115.026943032,
48
+ :E => 129.0425930962,
49
+ :F => 147.0684139162,
50
+ :G => 57.0214637236,
51
+ :H => 137.0589118624,
52
+ :I => 113.0840639804,
53
+ :K => 128.0949630177,
54
+ :L => 113.0840639804,
55
+ :M => 131.0404846062,
56
+ :N => 114.0429274472,
57
+ :O => 211.1446528645,
58
+ :P => 97.052763852,
59
+ :Q => 128.0585775114,
60
+ :R => 156.1011110281,
61
+ :S => 87.0320284099,
62
+ :T => 101.0476784741,
63
+ :U => 150.9536355878,
64
+ :V => 99.0684139162,
65
+ :W => 186.0793129535,
66
+ :Y => 163.0633285383,
67
+ }
68
+
69
+ MONO = MOLECULES_MONO_UNSUPPORTED.merge MOLECULES_MONO
70
+
71
+ # generated from molecules version 0.1.3:
72
+ MOLECULES_AVG = {
73
+ :A => 71.0779,
74
+ :C => 103.1429,
75
+ :D => 115.0874,
76
+ :E => 129.11398,
77
+ :F => 147.17386,
78
+ :G => 57.05132,
79
+ :H => 137.13928,
80
+ :I => 113.15764,
81
+ :K => 128.17228,
82
+ :L => 113.15764,
83
+ :M => 131.19606,
84
+ :N => 114.10264,
85
+ :O => 211.28076,
86
+ :P => 97.11518,
87
+ :Q => 128.12922,
88
+ :R => 156.18568,
89
+ :S => 87.0773,
90
+ :T => 101.10388,
91
+ :U => 150.0379,
92
+ :V => 99.13106,
93
+ :W => 186.2099,
94
+ :Y => 163.17326,
95
+ }
96
+
97
+ AVG = MOLECULES_AVG_UNSUPPORTED.merge MOLECULES_AVG
98
+
99
+ [AVG, MONO].each do |hash|
100
+ hash.each {|k,v| hash[k.to_s] = v }
101
+ end
102
+
103
+ # returns a hash based on the molecules library of amino acid residues.
104
+ # type is :mono or :avg
105
+ def self.mass_index(type=:mono)
106
+ require 'molecules'
107
+ hash = {}
108
+ ('A'..'Z').each do |letter|
109
+ if res = Molecules::Libraries::Residue[letter]
110
+ hash[letter] =
111
+ if type == :mono
112
+ res.mass
113
+ elsif type == :avg
114
+ res.mass {|v| v.std_atomic_weight.value }
115
+ else
116
+ raise ArgumentError, "type must be :mono or :avg"
117
+ end
118
+ end
119
+ end
120
+ hash
121
+ end
122
+
123
+ # prints a MONO or AVG hash for inclusion in ruby code
124
+ # type can be :mono or :avg
125
+ def self.print_mass_index(type=:mono)
126
+ puts "#{type.to_s.upcase} = {"
127
+ mass_index(type).sort.each do |k,v|
128
+ puts ":#{k} => #{v},"
129
+ end
130
+ puts "}"
131
+ end
132
+
133
+ end
134
+ end
135
+ end
136
+
data/lib/ms/mass.rb ADDED
@@ -0,0 +1,9 @@
1
+
2
+ module Ms
3
+ module Mass
4
+ MASCOT_H_PLUS = 1.007276
5
+ H_PLUS = 1.00727646677 # need to verify this against HYDROGEN - ELECTRON
6
+ PROTON = H_PLUS
7
+ end
8
+ end
9
+
@@ -0,0 +1,157 @@
1
+ module Ms
2
+ class Spectrum
3
+ # The underlying data store.
4
+ attr_reader :data
5
+
6
+ # Associated headers
7
+ attr_reader :headers
8
+
9
+ def initialize(data, headers={})
10
+ @data = data
11
+ @headers = headers
12
+ end
13
+
14
+ def self.from_peaks(ar_of_doublets)
15
+ _mzs = []
16
+ _ints = []
17
+ ar_of_doublets.each do |mz, int|
18
+ _mzs << mz
19
+ _ints << int
20
+ end
21
+ self.new([_mzs, _ints])
22
+ end
23
+
24
+ # An array of the mz data.
25
+ def mzs
26
+ @data[0]
27
+ end
28
+
29
+ # An array of the intensities data, corresponding to mzs.
30
+ def intensities
31
+ @data[1]
32
+ end
33
+
34
+ def mzs_and_intensities
35
+ [@data[0], @data[1]]
36
+ end
37
+
38
+ def [](array_index)
39
+ [mzs[array_index], intensities[array_index]]
40
+ end
41
+
42
+ # yields(mz, inten) across the spectrum, or array of doublets if no block
43
+ def peaks(&block)
44
+ (m, i) = mzs_and_intensities
45
+ m.zip(i, &block)
46
+ end
47
+
48
+ alias_method :each, :peaks
49
+ alias_method :each_peak, :peaks
50
+
51
+ # if the mzs and intensities are the same then the spectra are considered
52
+ # equal
53
+ def ==(other)
54
+ mzs == other.mzs && intensities == other.intensities
55
+ end
56
+
57
+ # returns a new spectrum whose intensities have been normalized by the tic
58
+ def normalize
59
+ tic = self.intensities.inject(0.0) {|sum,int| sum += int }
60
+ Ms::Spectrum.new([self.mzs, self.intensities.map {|v| v / tic }])
61
+ end
62
+
63
+ # uses index function and returns the intensity at that value
64
+ def intensity_at_mz(mz)
65
+ if x = index(mz)
66
+ intensities[x]
67
+ else
68
+ nil
69
+ end
70
+ end
71
+
72
+ # returns the index of the first value matching that m/z. the argument m/z
73
+ # may be less precise than the actual m/z (rounding to the same precision
74
+ # given) but must be at least integer precision (after rounding)
75
+ # implemented as binary search (bsearch from the web)
76
+ def index(mz)
77
+ mz_ar = mzs
78
+ return_val = nil
79
+ ind = mz_ar.bsearch_lower_boundary{|x| x <=> mz }
80
+ if mz_ar[ind] == mz
81
+ return_val = ind
82
+ else
83
+ # do a rounding game to see which one is it, or nil
84
+ # find all the values rounding to the same integer in the locale
85
+ # test each one fully in turn
86
+ mz = mz.to_f
87
+ mz_size = mz_ar.size
88
+ if ((ind < mz_size) and equal_after_rounding?(mz_ar[ind], mz))
89
+ return_val = ind
90
+ else # run the loop
91
+ up = ind
92
+ loop do
93
+ up += 1
94
+ if up >= mz_size
95
+ break
96
+ end
97
+ mz_up = mz_ar[up]
98
+ if (mz_up.ceil - mz.ceil >= 2)
99
+ break
100
+ else
101
+ if equal_after_rounding?(mz_up, mz)
102
+ return_val = up
103
+ return return_val
104
+ end
105
+ end
106
+ end
107
+ dn= ind
108
+ loop do
109
+ dn -= 1
110
+ if dn < 0
111
+ break
112
+ end
113
+ mz_dn = mz_ar[dn]
114
+ if (mz.floor - mz_dn.floor >= 2)
115
+ break
116
+ else
117
+ if equal_after_rounding?(mz_dn, mz)
118
+ return_val = dn
119
+ return return_val
120
+ end
121
+ end
122
+ end
123
+ end
124
+ end
125
+ return_val
126
+ end
127
+
128
+ # less_precise should be a float
129
+ # precise should be a float
130
+ def equal_after_rounding?(precise, less_precise) # :nodoc:
131
+ # determine the precision of less_precise
132
+ exp10 = precision_as_neg_int(less_precise)
133
+ #puts "EXP10: #{exp10}"
134
+ answ = ((precise*exp10).round == (less_precise*exp10).round)
135
+ #puts "TESTING FOR EQUAL: #{precise} #{less_precise}"
136
+ #puts answ
137
+ (precise*exp10).round == (less_precise*exp10).round
138
+ end
139
+
140
+ # returns 1 for ones place, 10 for tenths, 100 for hundredths
141
+ # to a precision exceeding 1e-6
142
+ def precision_as_neg_int(float) # :nodoc:
143
+ neg_exp10 = 1
144
+ loop do
145
+ over = float * neg_exp10
146
+ rounded = over.round
147
+ if (over - rounded).abs <= 1e-6
148
+ break
149
+ end
150
+ neg_exp10 *= 10
151
+ end
152
+ neg_exp10
153
+ end
154
+
155
+
156
+ end
157
+ end
@@ -0,0 +1,126 @@
1
+ module Ms
2
+ module Support
3
+
4
+ # A binary search library adapted from: http://0xcc.net/ruby-bsearch/
5
+ # ---
6
+ #
7
+ # Ruby/Bsearch - a binary search library for Ruby.
8
+ #
9
+ # Copyright (C) 2001 Satoru Takabayashi <satoru@namazu.org>
10
+ # All rights reserved.
11
+ # This is free software with ABSOLUTELY NO WARRANTY.
12
+ #
13
+ # You can redistribute it and/or modify it under the terms of
14
+ # the Ruby's licence.
15
+ #
16
+ # Example:
17
+ #
18
+ # % irb -r ./bsearch.rb
19
+ # >> %w(a b c c c d e f).bsearch_first {|x| x <=> "c"}
20
+ # => 2
21
+ # >> %w(a b c c c d e f).bsearch_last {|x| x <=> "c"}
22
+ # => 4
23
+ # >> %w(a b c e f).bsearch_first {|x| x <=> "c"}
24
+ # => 2
25
+ # >> %w(a b e f).bsearch_first {|x| x <=> "c"}
26
+ # => nil
27
+ # >> %w(a b e f).bsearch_last {|x| x <=> "c"}
28
+ # => nil
29
+ # >> %w(a b e f).bsearch_lower_boundary {|x| x <=> "c"}
30
+ # => 2
31
+ # >> %w(a b e f).bsearch_upper_boundary {|x| x <=> "c"}
32
+ # => 2
33
+ # >> %w(a b c c c d e f).bsearch_range {|x| x <=> "c"}
34
+ # => 2...5
35
+ # >> %w(a b c d e f).bsearch_range {|x| x <=> "c"}
36
+ # => 2...3
37
+ # >> %w(a b d e f).bsearch_range {|x| x <=> "c"}
38
+ # => 2...2
39
+ #
40
+ # The binary search algorithm is extracted from Jon Bentley's
41
+ # Programming Pearls 2nd ed. p.93
42
+ #
43
+ module BinarySearch
44
+ VERSION = '1.5'
45
+
46
+ module_function
47
+
48
+ #
49
+ # Return the lower boundary. (inside)
50
+ #
51
+ def search_lower_boundary(array, range=nil, &block)
52
+ range = 0 ... array.length if range == nil
53
+
54
+ lower = range.first() -1
55
+ upper = if range.exclude_end? then range.last else range.last + 1 end
56
+ while lower + 1 != upper
57
+ mid = ((lower + upper) / 2).to_i # for working with mathn.rb (Rational)
58
+ if yield(array[mid]) < 0
59
+ lower = mid
60
+ else
61
+ upper = mid
62
+ end
63
+ end
64
+ return upper
65
+ end
66
+
67
+ #
68
+ # This method searches the FIRST occurrence which satisfies a
69
+ # condition given by a block in binary fashion and return the
70
+ # index of the first occurrence. Return nil if not found.
71
+ #
72
+ def search_first(array, range=nil, &block)
73
+ boundary = search_lower_boundary(array, range, &block)
74
+ if boundary >= array.length || yield(array[boundary]) != 0
75
+ return nil
76
+ else
77
+ return boundary
78
+ end
79
+ end
80
+
81
+ #
82
+ # Return the upper boundary. (outside)
83
+ #
84
+ def search_upper_boundary(array, range=nil, &block)
85
+ range = 0 ... array.length if range == nil
86
+
87
+ lower = range.first() -1
88
+ upper = if range.exclude_end? then range.last else range.last + 1 end
89
+ while lower + 1 != upper
90
+ mid = ((lower + upper) / 2).to_i # for working with mathn.rb (Rational)
91
+ if yield(array[mid]) <= 0
92
+ lower = mid
93
+ else
94
+ upper = mid
95
+ end
96
+ end
97
+ return lower + 1 # outside of the matching range.
98
+ end
99
+
100
+ #
101
+ # This method searches the LAST occurrence which satisfies a
102
+ # condition given by a block in binary fashion and return the
103
+ # index of the last occurrence. Return nil if not found.
104
+ #
105
+ def search_last(array, range=nil, &block)
106
+ # `- 1' for canceling `lower + 1' in bsearch_upper_boundary.
107
+ boundary = search_upper_boundary(array, range, &block) - 1
108
+
109
+ if (boundary <= -1 || yield(array[boundary]) != 0)
110
+ return nil
111
+ else
112
+ return boundary
113
+ end
114
+ end
115
+
116
+ #
117
+ # Return the search result as a Range object.
118
+ #
119
+ def search_range(array, range=nil, &block)
120
+ lower = search_lower_boundary(array, range, &block)
121
+ upper = search_upper_boundary(array, range, &block)
122
+ return lower ... upper
123
+ end
124
+ end
125
+ end
126
+ end
data/lib/ms.rb ADDED
@@ -0,0 +1,10 @@
1
+ module Ms
2
+ module_function
3
+
4
+ # def parse(format, path)
5
+ # const = Tap::Env.instance.search(:formats, format)
6
+ # raise ArgumentError, "unknown format: #{format}" unless const
7
+ # const.constantize.parse(path)
8
+ # end
9
+
10
+ end
metadata ADDED
@@ -0,0 +1,95 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: ms-core
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - John Prince
8
+ - Simon Chiang
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+
13
+ date: 2009-05-22 00:00:00 -06:00
14
+ default_executable:
15
+ dependencies:
16
+ - !ruby/object:Gem::Dependency
17
+ name: tap
18
+ type: :development
19
+ version_requirement:
20
+ version_requirements: !ruby/object:Gem::Requirement
21
+ requirements:
22
+ - - ">="
23
+ - !ruby/object:Gem::Version
24
+ version: 0.11.2
25
+ version:
26
+ - !ruby/object:Gem::Dependency
27
+ name: minitest
28
+ type: :development
29
+ version_requirement:
30
+ version_requirements: !ruby/object:Gem::Requirement
31
+ requirements:
32
+ - - "="
33
+ - !ruby/object:Gem::Version
34
+ version: 1.3.0
35
+ version:
36
+ description:
37
+ email: jtprince@gmail.com
38
+ executables: []
39
+
40
+ extensions: []
41
+
42
+ extra_rdoc_files:
43
+ - changelog.txt
44
+ - LICENSE
45
+ - README
46
+ files:
47
+ - lib/ms/format/format_error.rb
48
+ - lib/ms/id/search.rb
49
+ - lib/ms/id/peptide.rb
50
+ - lib/ms/id/protein.rb
51
+ - lib/ms/mass/aa.rb
52
+ - lib/ms/data.rb
53
+ - lib/ms/spectrum.rb
54
+ - lib/ms/support/binary_search.rb
55
+ - lib/ms/mass.rb
56
+ - lib/ms/calc.rb
57
+ - lib/ms/data/interleaved.rb
58
+ - lib/ms/data/simple.rb
59
+ - lib/ms/data/lazy_string.rb
60
+ - lib/ms/data/transposed.rb
61
+ - lib/ms/data/lazy_io.rb
62
+ - lib/ms.rb
63
+ - changelog.txt
64
+ - LICENSE
65
+ - README
66
+ has_rdoc: true
67
+ homepage: http://mspire.rubyforge.org/projects/ms-core/
68
+ licenses: []
69
+
70
+ post_install_message:
71
+ rdoc_options: []
72
+
73
+ require_paths:
74
+ - lib
75
+ required_ruby_version: !ruby/object:Gem::Requirement
76
+ requirements:
77
+ - - ">="
78
+ - !ruby/object:Gem::Version
79
+ version: "0"
80
+ version:
81
+ required_rubygems_version: !ruby/object:Gem::Requirement
82
+ requirements:
83
+ - - ">="
84
+ - !ruby/object:Gem::Version
85
+ version: "0"
86
+ version:
87
+ requirements: []
88
+
89
+ rubyforge_project: mspire
90
+ rubygems_version: 1.3.2
91
+ signing_key:
92
+ specification_version: 3
93
+ summary: the core, shared library for mspire
94
+ test_files: []
95
+