bio-gag 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,5 @@
1
+ lib/**/*.rb
2
+ bin/*
3
+ -
4
+ features/**/*.feature
5
+ LICENSE.txt
@@ -0,0 +1,12 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.2
4
+ - 1.9.3
5
+ - jruby-19mode # JRuby in 1.9 mode
6
+ - rbx-19mode
7
+ # - 1.8.7
8
+ # - jruby-18mode # JRuby in 1.8 mode
9
+ # - rbx-18mode
10
+
11
+ # uncomment this line if your project needs to run something other than `rake`:
12
+ # script: bundle exec rspec spec
data/Gemfile ADDED
@@ -0,0 +1,17 @@
1
+ source "http://rubygems.org"
2
+ # Add dependencies required to use your gem here.
3
+ # Example:
4
+ # gem "activesupport", ">= 2.3.5"
5
+ gem 'bio-pileup_iterator', '>=0.0.1'
6
+ gem 'bio-logger', '>=1.0.0'
7
+
8
+ # Add dependencies to develop your gem here.
9
+ # Include everything needed to run rake, tests, features, etc.
10
+ group :development do
11
+ gem "shoulda", ">= 0"
12
+ gem "rdoc", "~> 3.12"
13
+ gem "bundler", ">= 1.0.0"
14
+ gem "jeweler", "~> 1.8.3"
15
+ gem "bio", ">= 1.4.2"
16
+ gem "rdoc", "~> 3.12"
17
+ end
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2012 Ben J Woodcroft
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,69 @@
1
+ = bio-gag
2
+
3
+ bio-gag is a biogem for detecting and correcting a particular type of error that occurs/occurred in particular versions of the IonTorrent sequencing kit:
4
+
5
+ * Ion Xpress Template 100 Kit
6
+ * Ion Xpress Template 200 Kit
7
+ * Ion Sequencing 100 Kit
8
+ * Ion Sequencing 200 Kit
9
+
10
+ Newer versions of these kits do not appear to be affected by this error, starting with the "Ion PGM 200 Sequencing Kit". There are discussions about this on the (closed access) Ion Torrent forum:
11
+
12
+ * http://lifetech-it.hosted.jivesoftware.com/message/7893
13
+ * http://lifetech-it.hosted.jivesoftware.com/message/7792
14
+ * http://lifetech-it.hosted.jivesoftware.com/message/6233
15
+
16
+ To search for these errors, a pileup format file of aligned sequences is required. These can be generated either from an assembly or by aligning to a reference, although it has only been tested on de-novo assemblies assembled with newbler. Note that it is probably not optimised due to time constraints combined with the fact they appear to have been fixed in newer kits.
17
+
18
+ == Installation
19
+
20
+ gem install bio-gag
21
+
22
+ == Usage
23
+
24
+ To use the script, the important options are these:
25
+
26
+ gag [options] <pileup_output>
27
+
28
+ At first, you probably want to just run it without any options. The output is a list of predicted sites at which the error occurs.
29
+
30
+ --lookahead Work out if gag predictions are supported by orf predictions being extended [default is just to print out found gag errors]. There's modifed usage too - probably best for you to look at the code if you are using this operation
31
+ --fix CONSENSUS_FASTA_FILE Find gag errors in the pileup file, correct them in CONSENSUS_FASTA_FILE, and print to STDOUT the fixed up consensus
32
+ -g, --gags GAG_FILE Specify a list of GAG errors to be fixed in tab-separated form (use with --fix, the tab-separated output is from regular output or --lookahead)
33
+
34
+ And some options for logging:
35
+
36
+ --logger filename Log to file (default STDERR)
37
+ --trace options Set log level (default INFO, see bio-logger documentation at https://github.com/pjotrp/bioruby-logger-plugin
38
+ -q, --quiet Run quietly
39
+ -v, --verbose Run verbosely
40
+
41
+
42
+
43
+ == Developers
44
+
45
+ To use the library
46
+
47
+ require 'bio-gag'
48
+
49
+ The API doc is online. For more code examples see also the test files in
50
+ the source tree.
51
+
52
+ == Project home page
53
+
54
+ Information on the source tree, documentation, issues and how to contribute, see
55
+
56
+ http://github.com/wwood/bioruby-gag
57
+
58
+ == Cite
59
+
60
+ If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
61
+
62
+ == Biogems.info
63
+
64
+ This Biogem is published at http://biogems.info/index.html#bio-gag
65
+
66
+ == Copyright
67
+
68
+ Copyright (c) 2012 Ben J Woodcroft. See LICENSE.txt for further details.
69
+
@@ -0,0 +1,45 @@
1
+ # encoding: utf-8
2
+
3
+ require 'rubygems'
4
+ require 'bundler'
5
+ begin
6
+ Bundler.setup(:default, :development)
7
+ rescue Bundler::BundlerError => e
8
+ $stderr.puts e.message
9
+ $stderr.puts "Run `bundle install` to install missing gems"
10
+ exit e.status_code
11
+ end
12
+ require 'rake'
13
+
14
+ require 'jeweler'
15
+ Jeweler::Tasks.new do |gem|
16
+ # gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
17
+ gem.name = "bio-gag"
18
+ gem.homepage = "http://github.com/wwood/bioruby-gag"
19
+ gem.license = "MIT"
20
+ gem.summary = %Q{bio-gag is a biogem for detecting and correcting a particular type of error that occurs/occurred in particular versions of the IonTorrent DNA sequencing kit}
21
+ gem.description = %Q{bio-gag is a biogem for detecting and correcting a particular type of error that occurs/occurred in particular versions of the IonTorrent DNA sequencing kit. Recent versions of the system don't appear to suffer the same problem}
22
+ gem.email = "gmail.com after donttrustben"
23
+ gem.authors = ["Ben J Woodcroft"]
24
+ # dependencies defined in Gemfile
25
+ end
26
+ Jeweler::RubygemsDotOrgTasks.new
27
+
28
+ require 'rake/testtask'
29
+ Rake::TestTask.new(:test) do |test|
30
+ test.libs << 'lib' << 'test'
31
+ test.pattern = 'test/**/test_*.rb'
32
+ test.verbose = true
33
+ end
34
+
35
+ task :default => :test
36
+
37
+ require 'rdoc/task'
38
+ Rake::RDocTask.new do |rdoc|
39
+ version = File.exist?('VERSION') ? File.read('VERSION') : ""
40
+
41
+ rdoc.rdoc_dir = 'rdoc'
42
+ rdoc.title = "bio-gag #{version}"
43
+ rdoc.rdoc_files.include('README*')
44
+ rdoc.rdoc_files.include('lib/**/*.rb')
45
+ end
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 0.0.1
data/bin/gag ADDED
@@ -0,0 +1,288 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'bio'
4
+
5
+ $:.unshift File.join(File.dirname(__FILE__),'..','lib')
6
+ require 'bio-gag'
7
+
8
+
9
+ require 'optparse'
10
+ require 'csv'
11
+ require 'pp'
12
+
13
+
14
+ # Possible operations
15
+ FIND = 'find'
16
+ FIX = 'fix'
17
+ LOOKAHEAD = 'lookahead'
18
+ options = {
19
+ :operation => FIND,
20
+ :logger => 'stderr',
21
+ :trace => 'info',
22
+ }
23
+ o = OptionParser.new do |opts|
24
+ opts.banner = "\ngag [options] <pileup_output>\n\n"
25
+
26
+
27
+ opts.on('--lookahead', 'Work out if gag predictions are supported by orf predictions being extended [default is just to print out found gag errors]. There\'s modifed usage too - probably best for you to look at the code if you are using this operation') do |v|
28
+ options[:operation] = LOOKAHEAD
29
+ end
30
+
31
+ opts.on('--fix CONSENSUS_FASTA_FILE', 'Find gag errors in the pileup file, correct them in CONSENSUS_FASTA_FILE, and print to STDOUT the fixed up consensus') do |v|
32
+ options[:operation] = FIX
33
+ options[:fix_file] = v
34
+ end
35
+
36
+ opts.on('-g','--gags GAG_FILE', 'Specify a list of GAG errors to be fixed in tab-separated form (use with --fix, the tab-separated output is from regular output or --lookahead)') do |v|
37
+ options[:gags_file] = v
38
+ end
39
+
40
+
41
+ opts.on("--logger filename",String,"Log to file (default STDERR)") do | name |
42
+ options[:logger] = name
43
+ end
44
+
45
+ opts.on("--trace options",String,"Set log level (default INFO, see bio-logger documentation at https://github.com/pjotrp/bioruby-logger-plugin") do | s |
46
+ options[:trace] = s
47
+ end
48
+
49
+ opts.on("-q", "--quiet", "Run quietly") do |q|
50
+ options[:trace] = 'error'
51
+ end
52
+
53
+ opts.on("-v", "--verbose", "Run verbosely") do |v|
54
+ options[:trace] = 'info'
55
+ end
56
+ end.parse!
57
+
58
+ # Realize settings
59
+ Bio::Log::CLI.trace(options[:trace])
60
+ Bio::Log::CLI.logger(options[:logger]) #defaults to STDERR not STDOUT
61
+ Bio::Log::CLI.configure('bio-gag')
62
+ log = Bio::Log::LoggerPlus.new 'gag'
63
+ Bio::Log::CLI.configure('gag')
64
+
65
+ piles = Bio::DB::PileupIterator.new(ARGF)
66
+
67
+ if options[:operation] == FIX
68
+ # Cache the fasta sequences
69
+ sequences = {} # Hash of sequence_id to sequences
70
+
71
+ # Read in the gags if they have already been specified
72
+ # e.g. contig00125 11130 A GAG
73
+ gags = {}
74
+ if options[:gags_file]
75
+ log.info "Using pre-specified GAG errors from #{options[:gag_file]}"
76
+ CSV.foreach(options[:gags_file], :headers => true, :col_sep => "\t") do |row|
77
+ contig = row[0]
78
+ gag = Bio::Gag.new(row[1].to_i, nil, contig)
79
+ gags[contig] ||= []
80
+ gags[contig].push gag
81
+ end
82
+ end
83
+
84
+ Bio::FlatFile.foreach(options[:fix_file]) do |s|
85
+ if sequences[s.entry_id]
86
+ raise Exception, "Unexpectedly found 2 sequences with the same sequence identifier '#{sequence_id}', giving up"
87
+ end
88
+ sequences[s.entry_id] = s.seq
89
+ end
90
+ log.info "Cached #{sequences.length} sequences from the consensus fasta file"
91
+ log.debug "Sequences being fixed hash: #{sequences.inspect}"
92
+
93
+ #$stderr.puts gags
94
+ piles.fix_gags(sequences, gags).sort{|a,b| a[0]<=>b[0]}.each do |name, fixed_seq|
95
+ puts ">#{name}"
96
+ puts fixed_seq
97
+ end
98
+
99
+ elsif options[:operation] == LOOKAHEAD
100
+ # Given a list of gag errors and gene predictions before and after, second-guess whether they are really true gag errors
101
+ # * Where there is only 1 gene predicted, go with that
102
+ # * Where both sets predict the same thing, go with either
103
+ # * Where the sets disagree and there is more than 2 total, give up and go manual
104
+ # * Where the sets disagree and there is one from each, starting from the gag error and working in the direction of the gene in the 2 frames
105
+ # ** Where there is two gag errors predicted in the same gene, give up and go manual.
106
+
107
+ genes1_file = ARGV[0]
108
+ genes2_file = ARGV[1]
109
+ gag_predictions_file = ARGV[2]
110
+
111
+ class GenePrediction
112
+ attr_accessor :start, :stop, :direction, :name
113
+ end
114
+
115
+ class Gag
116
+ attr_accessor :ref_name, :position, :inserted_base, :context, :adjusted_position
117
+ end
118
+
119
+ # Read in all the gene predictions
120
+ add_genes = lambda do |file|
121
+ hash = {} #hash of contig to array of GenePrediction objects
122
+ Bio::FlatFile.foreach(file) do |s|
123
+ # ["contig00001_1_1", "#", "412", "#", "624", "#", "1", "#", "ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None"]
124
+ splits = s.definition.split(' ')
125
+ gene = GenePrediction.new
126
+ contig = splits[0].match(/(.+)_\d+_\d+$/)[1]
127
+ gene.start = splits[2].to_i
128
+ gene.stop = splits[4].to_i
129
+ gene.direction = splits[6]
130
+ gene.name = splits[0]
131
+
132
+ raise Exception, "Unexpected format for gene start (#{splits[3]}) or stop (#{splits[5]}) in fasta header #{s.definition}" if gene.start == 0 or gene.stop == 0
133
+ raise unless %w(1 -1).include? gene.direction
134
+
135
+ hash[contig] ||= []
136
+ hash[contig].push gene
137
+ end
138
+ hash
139
+ end
140
+ genes_before_unchanged = add_genes.call(genes1_file)
141
+ genes_after = add_genes.call(genes2_file)
142
+
143
+ # Read in the gag output file
144
+ gags = {} #hash of contigs to gag predictions (positions along the genome)
145
+ CSV.foreach(gag_predictions_file, :col_sep => "\t", :headers => true) do |row|
146
+ contig = row[0]
147
+
148
+ gag = Gag.new
149
+ gag.ref_name = row[0]
150
+ gag.position = row[1].to_i
151
+ gag.inserted_base = row[2]
152
+ gag.context = row[3]
153
+
154
+ gags[contig] ||= []
155
+ gags[contig].push gag
156
+ end
157
+
158
+ # Change the bases numbers of the gene predictions in the beforehand gene predictions to be in line so both sets of gene predictions line up
159
+ genes_before = {}
160
+ genes_before_unchanged.each do |contig, preds|
161
+ preds.each do |gene|
162
+ unless gags[contig].nil?
163
+ gags_before_start = gags[contig].count do |pos|
164
+ pos.position < gene.start
165
+ end
166
+ gene.start = gene.start+gags_before_start
167
+
168
+ gags_before_stop = gags[contig].count do |pos|
169
+ pos.position < gene.stop
170
+ end
171
+ gene.stop = gene.stop+gags_before_stop
172
+ end
173
+
174
+ genes_before[contig] ||= []
175
+ genes_before[contig].push gene
176
+ end
177
+ end
178
+
179
+ # Change the base numbers of the gag errors
180
+ gags.each do |contig, pregagged|
181
+ count = 0
182
+ pregagged.each do |g|
183
+ g.adjusted_position = g.position+count
184
+ count += 1
185
+ end
186
+ end
187
+
188
+ print_gag = lambda do |gag_object|
189
+ puts [
190
+ gag_object.ref_name,
191
+ gag_object.position,
192
+ gag_object.inserted_base,
193
+ gag_object.context
194
+ ].join("\t")
195
+ end
196
+
197
+ # print headers
198
+ puts %w(ref_name position inserted_base context).join("\t")
199
+
200
+ # Iterate through the gag erors
201
+ gags.each do |contig, gags|
202
+ gags.each do |gag_object|
203
+ gag = gag_object.adjusted_position
204
+ # Find overlapping genes from both sets of predictions at this site
205
+ genes1 = []
206
+ unless genes_before[contig].nil?
207
+ genes1 = genes_before[contig].select{|gene| gene.start < gag and gene.stop > gag}
208
+ end
209
+ genes2 = []
210
+ unless genes_after[contig].nil?
211
+ genes2 = genes_after[contig].select{|gene| gene.start < gag and gene.stop > gag}
212
+ end
213
+
214
+ # if there is no predictions, then do nothing
215
+ if genes1.empty? and genes2.empty?
216
+ log.debug "Gag doesn't fall within any ORFs called on contig #{contig} position #{gag}, ignoring"
217
+ next
218
+ end
219
+
220
+ all_genes = [genes1,genes2].flatten
221
+ manual_message = lambda do
222
+ log.info "before: #{genes1.inspect}"
223
+ log.info "after #{genes2.inspect}"
224
+ end
225
+
226
+ if all_genes.length == 3 and all_genes.collect{|g| g.direction}.uniq.length == 1
227
+ if genes1.length == 2
228
+ # 2 genes from before, 1 from after
229
+ if genes1[0].start == genes2[0].start and genes1[1].stop == genes2[0].stop
230
+ log.debug "Gag correctly called at #{gag}, I reckon, because there was 1 gene afterwards, 2 from before"
231
+ print_gag.call gag_object
232
+ else
233
+ log.info "2 genes from before, 1 from after, but they don't line up, giving up at #{contig}/#{gag}"
234
+ manual_message.call
235
+ end
236
+ elsif genes2.length == 2
237
+ # 2 genes from after, 1 from before
238
+ if genes1[0].start == genes2[0].start and genes1[0].stop == genes2[1].stop
239
+ log.debug "Gag incorrectly called at #{contig}/#{gag}, I reckon, because there was 2 genes from afterwards, 1 from before"
240
+ else
241
+ log.info "1 genes from before, 2 from after, but they don't line up, giving up at #{contig}/#{gag}"
242
+ manual_message.call
243
+ end
244
+ else
245
+ # 3 genes all from the same set of predictions
246
+ log.info "3 genes all in the same direction.. whacko.. giving up - gag was at #{contig}/#{gag}"
247
+ manual_message.call
248
+ end
249
+ elsif genes1.length == 1 and genes2.length == 1
250
+ if genes1[0].stop - genes1[0].start > genes2[0].stop - genes2[0].start
251
+ log.debug "Gag incorrectly called at contig #{contig}, gag #{gag}"
252
+ else
253
+ log.debug "Gag correctly called at contig #{contig}, gag #{gag}"
254
+ print_gag.call gag_object
255
+ end
256
+ elsif all_genes.length == 1
257
+ if genes1.length == 1
258
+ log.debug "Gag incorrectly called at contig #{contig}, gag #{gag} because only 1 gene was found"
259
+ else
260
+ log.debug "Gag correctly called at contig #{contig}, gag #{gag} because only 1 gene was found"
261
+ print_gag.call gag_object
262
+ end
263
+ else
264
+ log.info "Not 3 genes or something is strange with the direction with gag, at #{contig}/#{gag}"
265
+ manual_message.call
266
+ end
267
+ end
268
+ end
269
+
270
+ else
271
+ # Don't do anything, just predict them
272
+
273
+ puts %w(
274
+ ref_name
275
+ position
276
+ inserted_base
277
+ context
278
+ ).join("\t")
279
+
280
+ piles.gags do |gag|
281
+ puts [
282
+ gag.ref_name,
283
+ gag.position,
284
+ gag.inserted_base,
285
+ gag.gagging_pileups.collect{|g| g.ref_base}.join('')
286
+ ].join("\t")
287
+ end
288
+ end
@@ -0,0 +1,8 @@
1
+
2
+ require 'bio-logger'
3
+ Bio::Log::LoggerPlus.new('bio-gag')
4
+
5
+ $:.unshift File.join(File.dirname(__FILE__),'../../bioruby-pileup_iterator/lib/')
6
+ require 'bio-pileup_iterator'
7
+ require 'bio/db/gag'
8
+
@@ -0,0 +1,215 @@
1
+
2
+
3
+ class Bio::DB::PileupIterator
4
+ # Find places in this pileup that correspond to GAG errors
5
+ # * Only certain sequences are considered to be possible errors. Can change this with options[:acceptable_gag_errors]
6
+ # ** GAAG/CTTC (namesake of GAG errors. So GAG is looked for, to see if it is probably GAAG instead)
7
+ # ** AGGC/GCCT
8
+ # ** GCCG/CGGC
9
+ # ** GCCA/TGGC
10
+ # * There is at least 3 reads that have an insertion of base Y next to Y, and are all in the one direction. Can change this with options[:min_disagreeing_absolute]
11
+ # * The 3 or more reads form at least a proportion of 0.1 (i.e. 10%) of all the reads at that position. Can change this with options[:min_disagreeing_proportion]
12
+ #
13
+ # Returns an array of Bio::Gag objects
14
+ #
15
+ # When a block is given, each gag is yielded
16
+ def gags(options={})
17
+ min_disagreeing_proportion = options[:min_disagreeing_proportion]
18
+ min_disagreeing_proportion ||= 0.1
19
+ min_disagreeing_absolute = options[:min_disagreeing_absolute]
20
+ min_disagreeing_absolute ||= 3
21
+
22
+ options[:acceptable_gag_errors] ||= %w(GAG CTC AGC GCT GCG CGC GCA TGC)
23
+
24
+ log = Bio::Log::LoggerPlus['bio-gag']
25
+
26
+ piles = []
27
+ gags = []
28
+
29
+ each do |pile|
30
+ if piles.length < 2
31
+ #log.debug "Piles cache for this reference sequence less than length 2"
32
+ piles = [piles, pile].flatten
33
+ next
34
+ elsif piles.length < 3
35
+ #log.debug "Piles cache for this reference sequence becoming full"
36
+ piles = [piles, pile].flatten
37
+ elsif piles[1].ref_name != pile.ref_name
38
+ #log.debug "Piles cache removed - moving to new contig"
39
+ piles = [pile]
40
+ next
41
+ else
42
+ #log.debug "Piles cache regular push through"
43
+ piles = [piles[1], piles[2], pile].flatten
44
+ end
45
+ #log.debug "Current piles now at #{piles[0].ref_name}, #{piles.collect{|pile| "#{pile.pos}/#{pile.ref_base}"}.join(', ')}"
46
+
47
+ # if not at the start/end of the contig
48
+ first = piles[0]
49
+ second = piles[1]
50
+ third = piles[2]
51
+
52
+ # Require particular sequences in the reference sequence
53
+ ref_bases = "#{first.ref_base}#{second.ref_base}#{third.ref_base}"
54
+ index = options[:acceptable_gag_errors].index(ref_bases)
55
+ if index.nil?
56
+ #log.debug "Sequence #{ref_bases} does not match whitelist, so not calling a gag"
57
+ next
58
+ end
59
+ gag_sequence = options[:acceptable_gag_errors][index]
60
+
61
+ # all reads that have a single insertion after the first or second position, but not both
62
+ inserting_reads = [first.reads, second.reads].flatten.uniq.select do |read|
63
+ !(read.insertions[first.pos] and read.insertions[second.pos]) and
64
+ (read.insertions[first.pos] or read.insertions[second.pos])
65
+ end
66
+ #log.debug "Inserting reads after filtering: #{inserting_reads.inspect}"
67
+
68
+ # ignore regions that aren't ever going to make it past the next filter
69
+ if inserting_reads.length < min_disagreeing_absolute or inserting_reads.length.to_f/first.coverage < min_disagreeing_proportion
70
+ #log.debug "Insufficient disagreement at step 1, so not calling a gag"
71
+ next
72
+ end
73
+
74
+ # what is the maximal base that is inserted and maximal number of directions
75
+ direction_counts = {'+' => 0, '-' => 0}
76
+ base_counts = {}
77
+ inserting_reads.each do |read|
78
+ insert = read.insertions[first.pos]
79
+ insert ||= read.insertions[second.pos]
80
+ insert.upcase!
81
+ direction_counts[read.direction] += 1
82
+ base_counts[insert] ||= 0
83
+ base_counts[insert] += 1
84
+ end
85
+ #log.debug "Direction counts of insertions: #{direction_counts.inspect}"
86
+ #log.debug "Base counts of insertions: #{base_counts.inspect}"
87
+ max_direction = direction_counts['+']>direction_counts['-'] ? '+' : '-'
88
+ max_base = base_counts.max do |a,b|
89
+ a[1] <=> b[1]
90
+ end[0]
91
+ #log.debug "Picking max direction #{max_direction} and max base #{max_base}"
92
+
93
+ # Only accept positions that are inserting a single base
94
+ if max_base.length > 1
95
+ #log.debug "Maximal insertion is too long, so not calling a gag"
96
+ next
97
+ end
98
+
99
+ counted_inserts = inserting_reads.select do |read|
100
+ insert = read.insertions[first.pos]
101
+ insert ||= read.insertions[second.pos]
102
+ insert.upcase!
103
+ if read.direction == max_direction and insert == max_base
104
+ # # Remove reads that don't match the first and third bases like the consensus sequence
105
+ read.sequence[read.sequence.length-1] == third.ref_base and
106
+ read.sequence[read.sequence.length-3] == first.ref_base
107
+ else
108
+ false
109
+ end
110
+ end
111
+ #log.debug "Reads counting after final filtering: #{counted_inserts.inspect}"
112
+
113
+ coverage = (first.coverage+second.coverage+third.coverage).to_f / 3.0
114
+ coverage_percent = counted_inserts.length.to_f / coverage
115
+ #log.debug "Final abundance calculations: max base #{max_base} (comparison base #{second.ref_base.upcase}) occurs #{counted_inserts.length} times compared to coverage #{coverage} (#{coverage_percent*10}%)"
116
+ if max_base != second.ref_base.upcase or # first and second bases must be the same
117
+ counted_inserts.length < min_disagreeing_absolute or # require 3 bases in that maximal direction
118
+ coverage_percent < min_disagreeing_proportion # at least 10% of reads with disagree with the consensus and agree with the gag
119
+ #log.debug "Failed final abundance cutoffs, so not calling a gag"
120
+ next
121
+ end
122
+
123
+ # alright, gamut navigated. We have a match, record it
124
+ gag = Bio::Gag.new(second.pos, piles, first.ref_name)
125
+ gags.push gag
126
+ log.debug "Yielding gag #{gag.inspect}"
127
+ yield gag if block_given?
128
+ end
129
+
130
+ return gags
131
+ end
132
+
133
+ # Given a hash containing sequence identifier => sequences, where both key and value are plain old Ruby strings, return the hash with any GAG errors in the sequences fixed.
134
+ # If the sequence_id_to_gags argument is specified, the gags are not searched from the pileups. If specified, it should be a hash of reference sequence IDs to an array of Bio::Gag objects
135
+ def fix_gags(hash_of_sequence_ids_to_sequence_strings, sequence_id_to_gags={})
136
+ log = Bio::Log::LoggerPlus['bio-gag']
137
+
138
+ # Get the gags
139
+ if sequence_id_to_gags == {}
140
+ log.info "Predicting gags from the pileup"
141
+ gags do |gag|
142
+ sequence_id_to_gags[gag.ref_name] ||= []
143
+ sequence_id_to_gags[gag.ref_name].push gag
144
+ end
145
+ else
146
+ log.info "Using pre-specified GAG errors"
147
+ end
148
+ log.info "Found #{sequence_id_to_gags.values.flatten.length} gag errors to fix"
149
+
150
+ # Make sure all gag errors in the pileup map to a sequence input fasta file by keeping tally
151
+ accounted_for_seq_ids = []
152
+ fixed_sequences = {} #Hash of sequence ids to sequences without gag errors
153
+ hash_of_sequence_ids_to_sequence_strings.each do |seq_id, seq|
154
+ log.debug "Now attempting to fix sequence #{seq_id}, sequence #{seq}"
155
+ toilet = sequence_id_to_gags[seq_id]
156
+ if toilet.nil?
157
+ # No gag errors found in this sequence (or pessimistically the sequence wasn't in the pileup -leaving that issue to the user though)
158
+ fixed_sequences[seq_id] = seq
159
+ else
160
+ # Gag error found at least once somewhere in this sequence
161
+ # Record that this was touched in the pileup
162
+ accounted_for_seq_ids.push seq_id
163
+
164
+ # Output the fixed-up sequence
165
+ last_gag = 0
166
+ fixed = ''
167
+ toilet.sort{|a,b| a.position<=>b.position}.each do |gag|
168
+ #log.debug "Attempting to fix gag at position #{gag.position} in sequence #{seq_id}, which is #{seq.length} bases long"
169
+ fixed = fixed+seq[last_gag..(gag.position-1)]
170
+ fixed = fixed+seq[(gag.position-1)..(gag.position-1)]
171
+ last_gag = gag.position
172
+ #log.debug "After fixing gag at position #{gag.position}, fixed sequence is now #{fixed}"
173
+ end
174
+ fixed = fixed+seq[last_gag..(seq.length-1)]
175
+ fixed_sequences[seq_id] = fixed
176
+ end
177
+ end
178
+
179
+ unless accounted_for_seq_ids.length == sequence_id_to_gags.length
180
+ log.warn "Unexpectedly found GAG errors in sequences that weren't in the sequence that are to be fixed: Found gags in #{sequence_id_to_gags.length}, but only fixed #{accounted_for_seq_ids.length}"
181
+ end
182
+ return fixed_sequences
183
+ end
184
+ end
185
+
186
+ class Bio::Gag
187
+ # The name of the reference sequence where the error was called
188
+ attr_accessor :ref_name
189
+
190
+ # Position in the reference sequence where the error was called
191
+ attr_accessor :position
192
+
193
+ # Bio::DB::Pileup objects around the GAG error
194
+ attr_accessor :gagging_pileups
195
+
196
+ # The base to be inserted. May be derived from @gagging_pileups if they have been specified
197
+ attr_writer :inserted_base
198
+
199
+ def initialize(position, gagging_pileups, ref_name)
200
+ @position = position
201
+ @gagging_pileups = gagging_pileups
202
+ @ref_name = ref_name
203
+ end
204
+
205
+ # The base to be inserted. May be manually specified in @inserted_base, otherwise it is the ref_base derived from @gagging_pileups at the inserted position
206
+ def inserted_base
207
+ if @inserted_base.nil?
208
+ @gagging_pileups[1].ref_base
209
+ else
210
+ @inserted_base
211
+ end
212
+ end
213
+ end
214
+
215
+
@@ -0,0 +1,18 @@
1
+ require 'rubygems'
2
+ require 'bundler'
3
+ begin
4
+ Bundler.setup(:default, :development)
5
+ rescue Bundler::BundlerError => e
6
+ $stderr.puts e.message
7
+ $stderr.puts "Run `bundle install` to install missing gems"
8
+ exit e.status_code
9
+ end
10
+ require 'test/unit'
11
+ require 'shoulda'
12
+
13
+ $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
14
+ $LOAD_PATH.unshift(File.dirname(__FILE__))
15
+ require 'bio-gag'
16
+
17
+ class Test::Unit::TestCase
18
+ end
@@ -0,0 +1,345 @@
1
+ require 'helper'
2
+ require 'tempfile'
3
+ require 'open3'
4
+
5
+ class TestBioGag < Test::Unit::TestCase
6
+ should "find_gag" do
7
+ test = "contig00091 4 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
8
+ contig00091 5 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
9
+ contig00091 6 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
10
+ contig00091 7 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
11
+ contig00091 8 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa".gsub(/ +/,"\t")
12
+ gags = Bio::DB::PileupIterator.new(test).gags
13
+ assert_equal [6], gags.collect{|g| g.position}
14
+ end
15
+
16
+ should "find_gag with first and third bases different, but whitelisted" do
17
+ test = "contig00091 4 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
18
+ contig00091 5 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
19
+ contig00091 6 C 33 ,,.$.+1C,,.+1C.+1C.+1C.+1C.+1C.+1C,,,.+1C.+1C.+1C.+1C.+1C,,.+1C,,,,,,,.+1C,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
20
+ contig00091 7 A 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
21
+ contig00091 8 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa".gsub(/ +/,"\t")
22
+ gags = Bio::DB::PileupIterator.new(test).gags
23
+ assert_equal [6], gags.collect{|g| g.position}
24
+ end
25
+
26
+ should "find no gag when XXX" do
27
+ test = "contig00091 4 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
28
+ contig00091 5 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
29
+ contig00091 6 G 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
30
+ contig00091 7 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
31
+ contig00091 8 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa".gsub(/ +/,"\t")
32
+ gags = Bio::DB::PileupIterator.new(test).gags
33
+ assert_equal [], gags.collect{|g| g.position}
34
+ end
35
+
36
+ should "find no gag with first and third bases are the same but aren't in the whitelist" do
37
+ test = "contig00091 4 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
38
+ contig00091 5 C 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
39
+ contig00091 6 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
40
+ contig00091 7 C 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
41
+ contig00091 8 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa".gsub(/ +/,"\t")
42
+ gags = Bio::DB::PileupIterator.new(test).gags
43
+ assert_equal [], gags.collect{|g| g.position}
44
+ end
45
+
46
+ should "fix gag" do
47
+ test = "contig00091 1 G 32 ,,..,,......,,,.....,,.,,,,,,,., {;c{{{l{l{l{{{{{{{{{{{{{{{{{{{{U
48
+ contig00091 2 T 32 ,,.-1T.,,.-1T..-1T..-1T.,,,.....,,.,,,,,,,., a`$aaa!a!a!aaaaaaaaaaaaaaaaaaaaa
49
+ contig00091 3 T 32 ,,*.,,*.*.*.,,,.....,,.,,,,,,,., a`Iaaauauataaaaaaaaaaaaaaaaaaaaa
50
+ contig00091 4 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
51
+ contig00091 5 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
52
+ contig00091 6 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
53
+ contig00091 7 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
54
+ contig00091 8 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa
55
+ contig00091 9 C 32 ,,.,,......,,,.....,,.,,,,,,,.,. ~~i~~~~~~Z~~~~~~~~~~~~~~~~~~~~~r
56
+ contig00091 10 A 33 ,,.,,......,,,.....,,.,,,,,,,.,.^]. aaPaa^aaaYaaaaaaaaaaaaaaaaaaaaaaB".gsub(/ +/,"\t")
57
+ hash = {'contig00091' => 'GTTCGAGGC'}
58
+ expe = {'contig00091' => 'GTTCGAAGGC'}
59
+ assert_equal expe, gags = Bio::DB::PileupIterator.new(test).fix_gags(hash)
60
+ end
61
+
62
+ should "fix gag prespecified" do
63
+ test = "contig00091 1 G 32 ,,..,,......,,,.....,,.,,,,,,,., {;c{{{l{l{l{{{{{{{{{{{{{{{{{{{{U
64
+ contig00091 2 T 32 ,,.-1T.,,.-1T..-1T..-1T.,,,.....,,.,,,,,,,., a`$aaa!a!a!aaaaaaaaaaaaaaaaaaaaa
65
+ contig00091 3 T 32 ,,*.,,*.*.*.,,,.....,,.,,,,,,,., a`Iaaauauataaaaaaaaaaaaaaaaaaaaa
66
+ contig00091 4 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
67
+ contig00091 5 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
68
+ contig00091 6 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
69
+ contig00091 7 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
70
+ contig00091 8 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa
71
+ contig00091 9 C 32 ,,.,,......,,,.....,,.,,,,,,,.,. ~~i~~~~~~Z~~~~~~~~~~~~~~~~~~~~~r
72
+ contig00091 10 A 33 ,,.,,......,,,.....,,.,,,,,,,.,.^]. aaPaa^aaaYaaaaaaaaaaaaaaaaaaaaaaB".gsub(/ +/,"\t")
73
+ hash = {'contig00091' => 'GTTCGAGGC'}
74
+ expe = {'contig00091' => 'GTTTCGAGGC'}
75
+ gag1 = Bio::Gag.new(2,nil,'contig00091')
76
+ gags = {'contig00091' => [gag1]}
77
+ assert_equal expe, gags = Bio::DB::PileupIterator.new(test).fix_gags(hash, gags)
78
+ end
79
+
80
+ should "fix gag prespecified in 2 seqs" do
81
+ hash = {'contig00091' => 'GTTCGAGGC',
82
+ 'contig00092' => 'GAGTTCGAGGC'}
83
+ expe = {'contig00091' => 'GTTTCGAGGC',
84
+ 'contig00092' => 'GAGTTCGAGGC'}
85
+
86
+ gag1 = Bio::Gag.new(2,nil,'contig00091')
87
+ gags = {'contig00091' => [gag1]}
88
+ assert_equal expe, gags = Bio::DB::PileupIterator.new('').fix_gags(hash, gags)
89
+
90
+ gag2 = Bio::Gag.new(8,nil,'contig00092')
91
+ gags = {'contig00091' => [gag1], 'contig00092' => [gag2]}
92
+ expe = {'contig00091' => 'GTTTCGAGGC',
93
+ 'contig00092' => 'GAGTTCGAAGGC'}
94
+ assert_equal expe, gags = Bio::DB::PileupIterator.new('').fix_gags(hash, gags)
95
+ end
96
+
97
+ should "fix 2 gags" do
98
+ test = "contig00091 1 G 32 ,,..,,......,,,.....,,.,,,,,,,., {;c{{{l{l{l{{{{{{{{{{{{{{{{{{{{U
99
+ contig00091 2 T 32 ,,.-1T.,,.-1T..-1T..-1T.,,,.....,,.,,,,,,,., a`$aaa!a!a!aaaaaaaaaaaaaaaaaaaaa
100
+ contig00091 3 T 32 ,,*.,,*.*.*.,,,.....,,.,,,,,,,., a`Iaaauauataaaaaaaaaaaaaaaaaaaaa
101
+ contig00091 4 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
102
+ contig00091 5 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
103
+ contig00091 6 A 33 ,,..+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
104
+ contig00091 7 G 32 ,,..,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
105
+ contig00091 8 G 32 ,,..,,......,,,.....,,.,,,,,,,.,. {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
106
+ contig00091 9 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
107
+ contig00091 10 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
108
+ contig00091 11 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa
109
+ contig00091 12 C 32 ,,.,,......,,,.....,,.,,,,,,,.,. ~~i~~~~~~Z~~~~~~~~~~~~~~~~~~~~~r
110
+ contig00091 13 A 33 ,,.,,......,,,.....,,.,,,,,,,.,.^]. aaPaa^aaaYaaaaaaaaaaaaaaaaaaaaaaB".gsub(/ +/,"\t")
111
+
112
+ hash = {'contig00091' => 'GTTCGAGGAGGCA'}
113
+ expe = {'contig00091' => 'GTTCGAAGGAAGGCA'}
114
+ assert_equal expe, gags = Bio::DB::PileupIterator.new(test).fix_gags(hash)
115
+ end
116
+
117
+ should "run gagger predict ok" do
118
+ test = "contig00091 4 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
119
+ contig00091 5 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
120
+ contig00091 6 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
121
+ contig00091 7 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
122
+ contig00091 8 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa".gsub(/ +/,"\t")
123
+ command = File.join([File.dirname(__FILE__),%w(.. bin gag)].flatten)+ ' --trace info'
124
+ out = nil
125
+ err = nil
126
+ Open3.popen3(command) do |stdin, stdout, stderr|
127
+ stdin.puts test
128
+ stdin.close
129
+ out = stdout.readlines
130
+ err = stderr.readlines
131
+ end
132
+ assert_equal [], err
133
+ assert_equal [
134
+ "ref_name\tposition\tinserted_base\tcontext\n",
135
+ "contig00091\t6\tA\tGAG\n"
136
+ ], out
137
+ end
138
+
139
+ should "run gagger fix ok without gags pre-specified" do
140
+ test = "contig00091 1 G 32 ,,..,,......,,,.....,,.,,,,,,,., {;c{{{l{l{l{{{{{{{{{{{{{{{{{{{{U
141
+ contig00091 2 T 32 ,,.-1T.,,.-1T..-1T..-1T.,,,.....,,.,,,,,,,., a`$aaa!a!a!aaaaaaaaaaaaaaaaaaaaa
142
+ contig00091 3 T 32 ,,*.,,*.*.*.,,,.....,,.,,,,,,,., a`Iaaauauataaaaaaaaaaaaaaaaaaaaa
143
+ contig00091 4 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
144
+ contig00091 5 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
145
+ contig00091 6 A 33 ,,..+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
146
+ contig00091 7 G 32 ,,..,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
147
+ contig00091 8 G 32 ,,..,,......,,,.....,,.,,,,,,,.,. {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
148
+ contig00091 9 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
149
+ contig00091 10 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
150
+ contig00091 11 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa
151
+ contig00091 12 C 32 ,,.,,......,,,.....,,.,,,,,,,.,. ~~i~~~~~~Z~~~~~~~~~~~~~~~~~~~~~r
152
+ contig00091 13 A 33 ,,.,,......,,,.....,,.,,,,,,,.,.^]. aaPaa^aaaYaaaaaaaaaaaaaaaaaaaaaaB".gsub(/ +/,"\t")
153
+ Tempfile.open('test_gag_fix') do |tempfile|
154
+ tempfile.puts '>contig00091'
155
+ tempfile.puts 'GTTCGAGGAGGCA'
156
+ tempfile.close
157
+
158
+ command = File.join([File.dirname(__FILE__),%w(.. bin gag)].flatten)+' --trace error --fix '+tempfile.path
159
+ out = nil
160
+ err = nil
161
+ Open3.popen3(command) do |stdin, stdout, stderr|
162
+ stdin.puts test
163
+ stdin.close
164
+ out = stdout.readlines
165
+ err = stderr.readlines
166
+ end
167
+ assert_equal [], err
168
+ assert_equal [
169
+ ">contig00091\n",
170
+ "GTTCGAAGGAAGGCA\n"
171
+ ], out
172
+ end
173
+ end
174
+
175
+ should "run gagger fix ok with fasta comments" do
176
+ test = "contig00091 1 G 32 ,,..,,......,,,.....,,.,,,,,,,., {;c{{{l{l{l{{{{{{{{{{{{{{{{{{{{U
177
+ contig00091 2 T 32 ,,.-1T.,,.-1T..-1T..-1T.,,,.....,,.,,,,,,,., a`$aaa!a!a!aaaaaaaaaaaaaaaaaaaaa
178
+ contig00091 3 T 32 ,,*.,,*.*.*.,,,.....,,.,,,,,,,., a`Iaaauauataaaaaaaaaaaaaaaaaaaaa
179
+ contig00091 4 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
180
+ contig00091 5 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
181
+ contig00091 6 A 33 ,,..+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
182
+ contig00091 7 G 32 ,,..,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
183
+ contig00091 8 G 32 ,,..,,......,,,.....,,.,,,,,,,.,. {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
184
+ contig00091 9 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
185
+ contig00091 10 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
186
+ contig00091 11 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa
187
+ contig00091 12 C 32 ,,.,,......,,,.....,,.,,,,,,,.,. ~~i~~~~~~Z~~~~~~~~~~~~~~~~~~~~~r
188
+ contig00091 13 A 33 ,,.,,......,,,.....,,.,,,,,,,.,.^]. aaPaa^aaaYaaaaaaaaaaaaaaaaaaaaaaB".gsub(/ +/,"\t")
189
+ Tempfile.open('test_gag_fix') do |tempfile|
190
+ tempfile.puts '>contig00091 with comment'
191
+ tempfile.puts 'GTTCGAGGAGGCA'
192
+ tempfile.close
193
+
194
+ command = File.join([File.dirname(__FILE__),%w(.. bin gag)].flatten)+' --trace error --fix '+tempfile.path
195
+ out = nil
196
+ err = nil
197
+ Open3.popen3(command) do |stdin, stdout, stderr|
198
+ stdin.puts test
199
+ stdin.close
200
+ out = stdout.readlines
201
+ err = stderr.readlines
202
+ end
203
+ assert_equal [], err
204
+ assert_equal [
205
+ ">contig00091\n",
206
+ "GTTCGAAGGAAGGCA\n"
207
+ ], out
208
+ end
209
+ end
210
+
211
+ should "run gagger fix when some sequences don't have gag errors" do
212
+ test = "contig00091 1 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
213
+ contig00091 2 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
214
+ contig00091 3 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
215
+ contig00091 4 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
216
+ contig00091 5 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa".gsub(/ +/,"\t")
217
+
218
+ Tempfile.open('test_gag_fix') do |tempfile|
219
+ tempfile.puts '>contig00091 with comment'
220
+ tempfile.puts 'CGAGG'
221
+ tempfile.puts '>contig00092'
222
+ tempfile.puts 'ATGC'
223
+ tempfile.close
224
+
225
+ command = File.join([File.dirname(__FILE__),%w(.. bin gag)].flatten)+ ' --trace error --fix '+tempfile.path
226
+ out = nil
227
+ err = nil
228
+ Open3.popen3(command) do |stdin, stdout, stderr|
229
+ stdin.puts test
230
+ stdin.close
231
+ out = stdout.readlines
232
+ err = stderr.readlines
233
+ end
234
+ assert_equal [], err
235
+ assert_equal [
236
+ ">contig00091\n",
237
+ "CGAAGG\n",
238
+ ">contig00092\n",
239
+ "ATGC\n"
240
+ ], out
241
+ end
242
+ end
243
+
244
+
245
+ should "run gagger fix ok, but warn, when there's less sequences than gag errors" do
246
+ test = "contig00091 1 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
247
+ contig00091 2 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
248
+ contig00091 3 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
249
+ contig00091 4 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
250
+ contig00091 5 G 32 ,$,$.$,$,$.$.$.$.$*$.$,$,$,$.$.$.$.$.$,$,$.$,$,$,$,$,$,$,$.$,$.$ aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa
251
+ contig00090 1 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
252
+ contig00090 2 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
253
+ contig00090 3 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
254
+ contig00090 4 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
255
+ contig00090 5 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa".gsub(/ +/,"\t")
256
+
257
+ Tempfile.open('test_gag_fix') do |tempfile|
258
+ tempfile.puts '>contig00091 with comment'
259
+ tempfile.puts 'CGAGG'
260
+ tempfile.close
261
+
262
+ command = File.join([File.dirname(__FILE__),%w(.. bin gag)].flatten)+ ' --trace warn --fix '+tempfile.path
263
+ out = nil
264
+ err = nil
265
+ Open3.popen3(command) do |stdin, stdout, stderr|
266
+ stdin.puts test
267
+ stdin.close
268
+ out = stdout.readlines
269
+ err = stderr.readlines
270
+ end
271
+ assert_equal [" WARN bio-gag: Unexpectedly found GAG errors in sequences that weren't in the sequence that are to be fixed: Found gags in 2, but only fixed 1\n"], err
272
+ assert_equal [
273
+ ">contig00091\n",
274
+ "CGAAGG\n",
275
+ ], out
276
+ end
277
+ end
278
+
279
+ should "run gagger with --debug without any big problems" do
280
+ test = "contig00091 1 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
281
+ contig00091 2 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
282
+ contig00091 3 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
283
+ contig00091 4 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
284
+ contig00091 5 G 32 ,$,$.$,$,$.$.$.$.$*$.$,$,$,$.$.$.$.$.$,$,$.$,$,$,$,$,$,$,$.$,$.$ aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa
285
+ contig00090 1 C 32 ,,..,,......,,,.....,,.,,,,,,,., ~~I~~~u~u~t~~~~~~~~~~~~~~~~~~~~~
286
+ contig00090 2 G 32 ,,..,,......,,,.....,,.,,,,,,,., {{Ii{{iiii@i{{{iiiii{{i{{{{{{{i{
287
+ contig00090 3 A 33 ,,.$.+1A,,.+1A.+1A.+1A.+1A.+1A.+1A,,,.+1A.+1A.+1A.+1A.+1A,,.+1A,,,,,,,.+1A,^]. z{D${{$$$$!${{{$$$$${{${{{{{{{${E
288
+ contig00090 4 G 32 ,,.,,.....-1G.,,,.....,,.,,,,,,,.,. aaRaaRRRR&RaaaRRRRRaaRaaaaaaaRaU
289
+ contig00090 5 G 32 ,,.,,....*.,,,.....,,.,,,,,,,.,. aaRaaRRRRZRaaaRRRRRaaRaaaaaaaRaa".gsub(/ +/,"\t")
290
+
291
+ Tempfile.open('test_gag_fix') do |tempfile|
292
+ tempfile.puts '>contig00091 with comment'
293
+ tempfile.puts 'CGAGG'
294
+ tempfile.close
295
+
296
+ command = File.join([File.dirname(__FILE__),%w(.. bin gag)].flatten)+ ' --trace debug --fix '+tempfile.path
297
+ out = nil
298
+ err = nil
299
+ Open3.popen3(command) do |stdin, stdout, stderr|
300
+ stdin.puts test
301
+ stdin.close
302
+ out = stdout.readlines
303
+ err = stderr.readlines
304
+ end
305
+ assert err.length > 1, "expected more errors"
306
+ assert_equal [
307
+ ">contig00091\n",
308
+ "CGAAGG\n",
309
+ ], out
310
+ end
311
+ end
312
+
313
+ should "run gagger fix ok with prespecified gags" do
314
+ test = ""
315
+ Tempfile.open('test_gag_fix') do |tempfile|
316
+ tempfile.puts '>contig00091'
317
+ tempfile.puts 'GTTCGAGGAGGCA'
318
+ tempfile.close
319
+
320
+ Tempfile.open('gags_prespecified') do |gags_file|
321
+ gags_file.puts %w(ref_name position inserted_base context).join("\t")
322
+ gags_file.puts %w(contig00091 2 G CTC).join("\t")
323
+ gags_file.puts %w(contig00091 4 G CTC).join("\t")
324
+ gags_file.close
325
+
326
+ command = File.join([File.dirname(__FILE__),%w(.. bin gag)].flatten)+" --trace error --fix #{tempfile.path} --gags #{gags_file.path}"
327
+ out = nil
328
+ err = nil
329
+ Open3.popen3(command) do |stdin, stdout, stderr|
330
+ stdin.puts test
331
+ stdin.close
332
+ out = stdout.readlines
333
+ err = stderr.readlines
334
+ end
335
+ assert_equal [], err
336
+ assert_equal [
337
+ ">contig00091\n",
338
+ "GTTTCCGAGGAGGCA\n"
339
+ ], out
340
+
341
+ end
342
+ end
343
+ end
344
+
345
+ end
metadata ADDED
@@ -0,0 +1,154 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: bio-gag
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Ben J Woodcroft
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-05-17 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: bio-pileup_iterator
16
+ requirement: &86055840 !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: 0.0.1
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: *86055840
25
+ - !ruby/object:Gem::Dependency
26
+ name: bio-logger
27
+ requirement: &86055510 !ruby/object:Gem::Requirement
28
+ none: false
29
+ requirements:
30
+ - - ! '>='
31
+ - !ruby/object:Gem::Version
32
+ version: 1.0.0
33
+ type: :runtime
34
+ prerelease: false
35
+ version_requirements: *86055510
36
+ - !ruby/object:Gem::Dependency
37
+ name: shoulda
38
+ requirement: &86055220 !ruby/object:Gem::Requirement
39
+ none: false
40
+ requirements:
41
+ - - ! '>='
42
+ - !ruby/object:Gem::Version
43
+ version: '0'
44
+ type: :development
45
+ prerelease: false
46
+ version_requirements: *86055220
47
+ - !ruby/object:Gem::Dependency
48
+ name: rdoc
49
+ requirement: &86054930 !ruby/object:Gem::Requirement
50
+ none: false
51
+ requirements:
52
+ - - ~>
53
+ - !ruby/object:Gem::Version
54
+ version: '3.12'
55
+ type: :development
56
+ prerelease: false
57
+ version_requirements: *86054930
58
+ - !ruby/object:Gem::Dependency
59
+ name: bundler
60
+ requirement: &86054570 !ruby/object:Gem::Requirement
61
+ none: false
62
+ requirements:
63
+ - - ! '>='
64
+ - !ruby/object:Gem::Version
65
+ version: 1.0.0
66
+ type: :development
67
+ prerelease: false
68
+ version_requirements: *86054570
69
+ - !ruby/object:Gem::Dependency
70
+ name: jeweler
71
+ requirement: &86054260 !ruby/object:Gem::Requirement
72
+ none: false
73
+ requirements:
74
+ - - ~>
75
+ - !ruby/object:Gem::Version
76
+ version: 1.8.3
77
+ type: :development
78
+ prerelease: false
79
+ version_requirements: *86054260
80
+ - !ruby/object:Gem::Dependency
81
+ name: bio
82
+ requirement: &86053790 !ruby/object:Gem::Requirement
83
+ none: false
84
+ requirements:
85
+ - - ! '>='
86
+ - !ruby/object:Gem::Version
87
+ version: 1.4.2
88
+ type: :development
89
+ prerelease: false
90
+ version_requirements: *86053790
91
+ - !ruby/object:Gem::Dependency
92
+ name: rdoc
93
+ requirement: &86053100 !ruby/object:Gem::Requirement
94
+ none: false
95
+ requirements:
96
+ - - ~>
97
+ - !ruby/object:Gem::Version
98
+ version: '3.12'
99
+ type: :development
100
+ prerelease: false
101
+ version_requirements: *86053100
102
+ description: bio-gag is a biogem for detecting and correcting a particular type of
103
+ error that occurs/occurred in particular versions of the IonTorrent DNA sequencing
104
+ kit. Recent versions of the system don't appear to suffer the same problem
105
+ email: gmail.com after donttrustben
106
+ executables:
107
+ - gag
108
+ extensions: []
109
+ extra_rdoc_files:
110
+ - LICENSE.txt
111
+ - README.rdoc
112
+ files:
113
+ - .document
114
+ - .travis.yml
115
+ - Gemfile
116
+ - LICENSE.txt
117
+ - README.rdoc
118
+ - Rakefile
119
+ - VERSION
120
+ - bin/gag
121
+ - lib/bio-gag.rb
122
+ - lib/bio/db/gag.rb
123
+ - test/helper.rb
124
+ - test/test_bio-gag.rb
125
+ homepage: http://github.com/wwood/bioruby-gag
126
+ licenses:
127
+ - MIT
128
+ post_install_message:
129
+ rdoc_options: []
130
+ require_paths:
131
+ - lib
132
+ required_ruby_version: !ruby/object:Gem::Requirement
133
+ none: false
134
+ requirements:
135
+ - - ! '>='
136
+ - !ruby/object:Gem::Version
137
+ version: '0'
138
+ segments:
139
+ - 0
140
+ hash: 567820925
141
+ required_rubygems_version: !ruby/object:Gem::Requirement
142
+ none: false
143
+ requirements:
144
+ - - ! '>='
145
+ - !ruby/object:Gem::Version
146
+ version: '0'
147
+ requirements: []
148
+ rubyforge_project:
149
+ rubygems_version: 1.8.17
150
+ signing_key:
151
+ specification_version: 3
152
+ summary: bio-gag is a biogem for detecting and correcting a particular type of error
153
+ that occurs/occurred in particular versions of the IonTorrent DNA sequencing kit
154
+ test_files: []