bio-samtools 2.0.5 → 2.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/tutorial/tutorial.md CHANGED
@@ -5,7 +5,7 @@ bio-samtools Basic Tutorial
5
5
  Introduction
6
6
  ------------
7
7
 
8
- bio-samtools is a Ruby binding to the popular [SAMtools](http://samtools.sourceforge.net/) library, and provides access to individual read alignments as well as BAM files, reference sequence and pileup information.
8
+ bio-samtools is a Ruby binding to the popular [SAMtools](http://samtools.sourceforge.net/) library, and provides access to individual read alignments as well as BAM files, reference sequence and pileup information. Users should refer to the [bio-samtools documentation](http://rubydoc.info/gems/bio-samtools/index) and the [SAMtools manual](http://samtools.sourceforge.net/samtools.shtml) for further details of the methods.
9
9
 
10
10
  Installation
11
11
  ------------
@@ -23,17 +23,51 @@ bio-samtools relies on the following other rubygems:
23
23
 
24
24
  Once these are installed, bio-samtools can be installed with
25
25
 
26
- sudo gem install bio-samtools
27
-
26
+ ```ruby
27
+ gem install bio-samtools
28
+ ```
28
29
 
29
30
  It should then be easy to test whether installation went well. Start
30
31
  interactive Ruby (IRB) in the terminal, and type
31
- `require 'bio-samtools'` if the terminal returns `true` then all is
32
+
33
+ ```ruby
34
+ require 'bio-samtools'`
35
+ ```
36
+
37
+ if the terminal returns `true` then all is
32
38
  well.
39
+ ```ruby
40
+ $ irb
41
+ >> require 'bio-samtools'
42
+ => true
43
+ ```
44
+
45
+ ##Creating a BAM file
46
+ Often, the output from a next-generation sequence alignment tool will be a file in the [SAM format](http://samtools.github.io/hts-specs/SAMv1.pdf).
47
+
48
+ Typically, we'd create a compressed, indexed binary version of the SAM file, which would allow us to operate on it in a quicker and more efficient manner, being able to randomly access various parts of the alignment. We'd use the `view` to do this. This step would involve takeing our sam file, sorting it and indexing it.
49
+
50
+ ```ruby
51
+ #create the sam object
52
+ sam = Bio::DB::Sam.new(:bam => 'my.sam', :fasta => 'ref.fasta')
53
+
54
+ #create a bam file from the sam file
55
+ sam.view(:b=>true, :S=>true, :o=>'bam.bam')
56
+
57
+ #create a new sam object from the bam file
58
+ unsortedBam = Bio::DB::Sam.new(:bam => 'bam.bam', :fasta => 'ref.fasta')
59
+
60
+ #the bam file might not be sorted (necessary for samtools), so sort it
61
+ unsortedBam.sort(:prefix=>'sortedBam')
62
+
63
+ #create a new sam object
64
+ bam = Bio::DB::Sam.new(:bam => 'sortedBam.bam', :fasta => 'ref.fasta')
65
+ #create a new index
66
+ bam.index()
67
+
68
+ #creates index file sortedBam.bam.bai
69
+ ```
33
70
 
34
- $ irb
35
- >> require 'bio-samtools'
36
- => true
37
71
 
38
72
  Working with BAM files
39
73
  ----------------------
@@ -41,34 +75,65 @@ Working with BAM files
41
75
 
42
76
  ### Creating a new SAM object
43
77
 
44
- A SAM object represents the alignments in the BAM file. BAM files (and hence SAM objects here) are what most of SAMtools methods operate on and are very straightforward to create. You will need a sorted BAM file, to access the alignments and a reference sequence in FASTA format to use the reference sequence. The object can be created and opened as follows:
78
+ A SAM object represents the alignments in the BAM file. BAM files (and hence SAM objects here) are what most of SAMtools methods operate on and are very straightforward to create. You will need a sorted and indexed BAM file, to access the alignments and a reference sequence in FASTA format to use the reference sequence. Let's revisit the last few lines of code from the code above.
45
79
 
46
- bam = Bio::DB::Sam.new(:bam=>"my_sorted.bam", :fasta=>'ref.fasta')
80
+ ```ruby
81
+ bam = Bio::DB::Sam.new(:bam => 'sortedBam.bam', :fasta => 'ref.fasta')
82
+ bam.index()
83
+ ```
47
84
 
48
- Opening the file needs only to be done once for multiple operations on
49
- it, access to the alignments is random so you don't need to loop over
50
- the entries in the file.
85
+ Creating the new Bio::DB::Sam (named 'bam' in this case) only to be done once for multiple operations on it, access to the alignments is random so you don't need to loop over the entries in the file.
51
86
 
52
87
  ### Getting Reference Sequence
53
88
 
54
89
  The reference is accessed using reference
55
90
  name, start, end in 1-based co-ordinates. A standard Ruby String object is returned.
91
+ ```ruby
92
+ sequence_fragment = bam.fetch_reference("Chr1", 1, 100)
93
+ puts sequence_fragment
94
+ => cctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
95
+ ```
56
96
 
57
- sequence_fragment = bam.fetch_reference("Chr1", 1, 100)
58
-
59
- The output from this would be the raw sequence as a string, e.g.
97
+ A reference sequence can be returned as a Bio::Sequence::NA object buy the use of :as_bio => true
98
+ ```ruby
99
+ sequence_fragment = bam.fetch_reference("Chr1", 1, 100, :as_bio => true)
100
+ ```
60
101
 
61
- cctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
102
+ The printed output from this would be a fasta-formatted string
103
+ ```ruby
104
+ puts sequence_fragment
62
105
 
63
- A reference sequence can be returned as a Bio::Sequence::NA object buy the use of :as_bio => true
106
+ => >Chr1:1-100
107
+ => cctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
108
+ ```
109
+
110
+ ### Concatenating BAM files
111
+ BAM files may be concatenated using the `cat` command. The sequence dictionary of each input BAM must be identical, although the `cat` method does not check this.
112
+
113
+ ```ruby
114
+ #create an array of BAM files to cat
115
+ bam_files = [bam1, bam2]
116
+ cat_file = "maps_cated.bam" #the outfile
117
+ #cat the files
118
+ @sam.cat(:out=>cat_file, :bams=>bam_files)
119
+ #create a new Bio::DB::Sam object from the new cat file
120
+ cat_bam = Bio::DB::Sam.new(:fasta => "ref.fasta", :bam => cat_file)
64
121
 
65
- sequence_fragment = bam.fetch_reference("Chr1", 1, 100, :as_bio => true)
122
+ ```
66
123
 
67
- The output from this would be a Bio::Sequence::NA object, which provides a fasta-formatted string when printed
124
+ ### Removing duplicate reads
125
+ The `remove_duplicates` method removes potential PCR duplicates: if multiple read pairs have identical external coordinates it only retain the pair with highest mapping quality. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).
126
+ ```ruby
68
127
 
69
- >chr_1:1-100 cctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
128
+ unduped = "dupes_rmdup.bam" #an outfile for the removed duplicates bam
129
+ #remove single-end duplicates
130
+ bam.remove_duplicates(:s=>true, :out=>unduped)
131
+ #create new Bio::DB::Sam object
132
+ unduped_bam = Bio::DB::Sam.new(:fasta => "ref.fasta", :bam => unduped)
70
133
 
71
- #### Alignment Objects
134
+ ```
135
+
136
+ ### Alignment Objects
72
137
 
73
138
  The individual alignments represent a single read and are returned as
74
139
  Bio::DB::Alignment objects. These have numerous methods of their own,
@@ -78,99 +143,106 @@ represents a Ruby instance variable and can be accessed as any other
78
143
  method. Thus the `@is_mapped` attribute of an object `a` is accessed
79
144
  `a.is_mapped`
80
145
 
81
- require 'pp'
82
- pp an_alignment_object ##some Bio::DB::Alignment object
83
- #<Bio::DB::Alignment:0x101113f80
84
- @al=#<Bio::DB::SAM::Tools::Bam1T:0x101116a50>,
85
- @calend=4067,
86
- @cigar="76M",
87
- @failed_quality=false,
88
- @first_in_pair=false,
89
- @flag=163,
90
- @is_duplicate=false,
91
- @is_mapped=true,
92
- @is_paired=true,
93
- @isize=180,
94
- @mapq=60,
95
- @mate_strand=false,
96
- @mate_unmapped=false,
97
- @mpos=4096,
98
- @mrnm="=",
99
- @pos=3992,
100
- @primary=true,
101
- @qlen=76,
102
- @qname="HWI-EAS396_0001:7:115:17904:15958#0",
103
- @qual="IIIIIIIIIIIIHHIHGIHIDGGGG...",
104
- @query_strand=true,
105
- @query_unmapped=false,
106
- @rname="1",
107
- @second_in_pair=true,
108
- @seq="ACAGTCCAGTCAAAGTACAAATCGAG...",
109
- @tags=
110
- {"MD"=>#<Bio::DB::Tag:0x101114ed0 @tag="MD", @type="Z", @value="76">,
111
- "XO"=>#<Bio::DB::Tag:0x1011155d8 @tag="XO", @type="i", @value="0">,
112
- "AM"=>#<Bio::DB::Tag:0x101116280 @tag="AM", @type="i", @value="37">,
113
- "X0"=>#<Bio::DB::Tag:0x101115fb0 @tag="X0", @type="i", @value="1">,
114
- "X1"=>#<Bio::DB::Tag:0x101115c68 @tag="X1", @type="i", @value="0">,
115
- "XG"=>#<Bio::DB::Tag:0x101115240 @tag="XG", @type="i", @value="0">,
116
- "SM"=>#<Bio::DB::Tag:0x1011162f8 @tag="SM", @type="i", @value="37">,
117
- "XT"=>#<Bio::DB::Tag:0x1011162a8 @tag="XT", @type="A", @value="U">,
118
- "NM"=>#<Bio::DB::Tag:0x101116348 @tag="NM", @type="i", @value="0">,
119
- "XM"=>#<Bio::DB::Tag:0x101115948 @tag="XM", @type="i", @value="0">}>
120
-
146
+ ```ruby
147
+ require 'pp'
148
+ pp an_alignment_object ##some Bio::DB::Alignment object
149
+ #<Bio::DB::Alignment:0x101113f80
150
+ @al=#<Bio::DB::SAM::Tools::Bam1T:0x101116a50>,
151
+ @calend=4067,
152
+ @cigar="76M",
153
+ @failed_quality=false,
154
+ @first_in_pair=false,
155
+ @flag=163,
156
+ @is_duplicate=false,
157
+ @is_mapped=true,
158
+ @is_paired=true,
159
+ @isize=180,
160
+ @mapq=60,
161
+ @mate_strand=false,
162
+ @mate_unmapped=false,
163
+ @mpos=4096,
164
+ @mrnm="=",
165
+ @pos=3992,
166
+ @primary=true,
167
+ @qlen=76,
168
+ @qname="HWI-EAS396_0001:7:115:17904:15958#0",
169
+ @qual="IIIIIIIIIIIIHHIHGIHIDGGGG...",
170
+ @query_strand=true,
171
+ @query_unmapped=false,
172
+ @rname="1",
173
+ @second_in_pair=true,
174
+ @seq="ACAGTCCAGTCAAAGTACAAATCGAG...",
175
+ @tags=
176
+ {"MD"=>#<Bio::DB::Tag:0x101114ed0 @tag="MD", @type="Z", @value="76">,
177
+ "XO"=>#<Bio::DB::Tag:0x1011155d8 @tag="XO", @type="i", @value="0">,
178
+ "AM"=>#<Bio::DB::Tag:0x101116280 @tag="AM", @type="i", @value="37">,
179
+ "X0"=>#<Bio::DB::Tag:0x101115fb0 @tag="X0", @type="i", @value="1">,
180
+ "X1"=>#<Bio::DB::Tag:0x101115c68 @tag="X1", @type="i", @value="0">,
181
+ "XG"=>#<Bio::DB::Tag:0x101115240 @tag="XG", @type="i", @value="0">,
182
+ "SM"=>#<Bio::DB::Tag:0x1011162f8 @tag="SM", @type="i", @value="37">,
183
+ "XT"=>#<Bio::DB::Tag:0x1011162a8 @tag="XT", @type="A", @value="U">,
184
+ "NM"=>#<Bio::DB::Tag:0x101116348 @tag="NM", @type="i", @value="0">,
185
+ "XM"=>#<Bio::DB::Tag:0x101115948 @tag="XM", @type="i", @value="0">}>
186
+ ```
121
187
 
122
188
  ### Getting Alignments
123
189
 
124
190
  Alignments can be obtained one at a time by looping over a specified region using the `fetch()` function.
125
191
 
126
- bam.fetch("chr_1",3000,4000).each do |alignment|
127
- #do something with the alignment...
128
- end
192
+ ```ruby
193
+ bam.fetch("Chr1",3000,4000).each do |alignment|
194
+ #do something with the alignment...
195
+ end
196
+ ```
129
197
 
130
198
  A separate method `fetch_with_function()` allows you to pass a block (or
131
- a Proc object) to the function for efficient calculation. This example
199
+ a Proc object) to the function for efficient calculation. This example takes
132
200
  an alignment object and returns an array of sequences which exactly match the reference.
133
201
 
134
- #an array to hold the matching sequences
135
- exact_matches = []
136
-
137
- fetchAlignments = Proc.new do |a|
138
- #get the length of each read
139
- len = a.seq.length
140
- #get the cigar string
141
- cigar = a.cigar
142
- #create a cigar string which represents a full-length match
143
- cstr = len.to_s << "M"
144
- if cigar == cstr
145
- #add the current sequence to the array if it qualifies
146
- exact_matches << a.seq
147
- end
202
+ ```ruby
203
+ #an array to hold the matching sequences
204
+ exact_matches = []
205
+
206
+ matches = Proc.new do |a|
207
+ #get the length of each read
208
+ len = a.seq.length
209
+ #get the cigar string
210
+ cigar = a.cigar
211
+ #create a cigar string which represents a full-length match
212
+ cstr = len.to_s << "M"
213
+ if cigar == cstr
214
+ #add the current sequence to the array if it qualifies
215
+ exact_matches << a.seq
148
216
  end
217
+ end
149
218
 
150
- bam.fetch_with_function("chr_1", 100, 500, &fetchAlignments) #now run the fetch
219
+ bam.fetch_with_function("Chr1", 100, 500, &matches)
151
220
 
152
- puts exact_matches
221
+ puts exact_matches
222
+ ```
153
223
 
154
224
  ###Alignment stats
155
225
 
156
226
  The SAMtools flagstat method is implemented in bio-samtools to quickly examine the number of reads mapped to the reference. This includes the number of paired and singleton reads mapped and also the number of paired-reads that map to different chromosomes/contigs.
157
227
 
158
- bam.flag_stats()
159
-
160
- An example output would be
161
-
162
- 34672 + 0 in total (QC-passed reads + QC-failed reads)
163
- 0 + 0 duplicates
164
- 33196 + 0 mapped (95.74%:nan%)
165
- 34672 + 0 paired in sequencing
166
- 17335 + 0 read1
167
- 17337 + 0 read2
168
- 31392 + 0 properly paired (90.54%:nan%)
169
- 31728 + 0 with itself and mate mapped
170
- 1468 + 0 singletons (4.23%:nan%)
171
- 0 + 0 with mate mapped to a different chr
172
- 0 + 0 with mate mapped to a different chr (mapQ>=5)
228
+ ```ruby
229
+ bam.flag_stats()
230
+ ```
173
231
 
232
+ An example output would be
233
+ ```ruby
234
+ 34672 + 0 in total (QC-passed reads + QC-failed reads)
235
+ 0 + 0 duplicates
236
+ 33196 + 0 mapped (95.74%:nan%)
237
+ 34672 + 0 paired in sequencing
238
+ 17335 + 0 read1
239
+ 17337 + 0 read2
240
+ 31392 + 0 properly paired (90.54%:nan%)
241
+ 31728 + 0 with itself and mate mapped
242
+ 1468 + 0 singletons (4.23%:nan%)
243
+ 0 + 0 with mate mapped to a different chr
244
+ 0 + 0 with mate mapped to a different chr (mapQ>=5)
245
+ ```
174
246
 
175
247
  Getting Coverage Information
176
248
  ----------------------------
@@ -186,259 +258,167 @@ position in the array gives the depth of coverage at the given start
186
258
  position in the genome, the last position in the array gives the depth
187
259
  of coverage at the given start position plus the length given
188
260
 
189
- coverages = bam.chromosome_coverage("Chr1", 3000, 1000) #=> [16,16,25,25...]
261
+ ```ruby
262
+ coverages = bam.chromosome_coverage("Chr1", 3000, 1000) #=> [16,16,25,25...]
263
+ ```
190
264
 
191
265
  ### Average Coverage In A Region
192
266
 
193
- Similarly, average (arithmetic mean) of coverage can be retrieved, also
194
- with start and length parameters
195
-
196
- coverages = bam.average_coverage("Chr1", 3000, 1000) #=> 20.287
197
-
198
- ### Getting Pileup Information
267
+ Similarly, average (arithmetic mean) of coverage can be retrieved with the `average_coverage` method.
268
+
269
+ ```ruby
270
+ coverages = bam.average_coverage("Chr1", 3000, 1000) #=> 20.287
271
+ ```
272
+
273
+ ### Coverage from a BED file
274
+ It is possible to count the number of nucleotides mapped to a given region of a BAM file by providing a [BED formatted](http://genome.ucsc.edu/FAQ/FAQformat.html#format1) file and using the `bedcov` method. The output is the BED file with an extra column providing the number of nucleotides mapped to that region.
275
+
276
+ ```ruby
277
+ bed_file = "test.bed"
278
+ bam.bedcov(:bed=>bed_file)
279
+
280
+ => chr_1 1 30 6
281
+ => chr_1 40 45 8
282
+
283
+ ```
284
+ Alternatively, the `depth` method can be used to get per-position depth information (any unmapped positions will be ignored).
285
+ ```ruby
286
+ bed_file = "test.bed"
287
+ @sam.depth(:b=>bed_file)
288
+
289
+ => chr_1 25 1
290
+ => chr_1 26 1
291
+ => chr_1 27 1
292
+ => chr_1 28 1
293
+ => chr_1 29 1
294
+ => chr_1 30 1
295
+ => chr_1 41 1
296
+ => chr_1 42 1
297
+ => chr_1 43 2
298
+ => chr_1 44 2
299
+ => chr_1 45 2
300
+ ```
301
+ ##Getting Pileup Information
199
302
 
200
303
  Pileup format represents the coverage of reads over a single base in the
201
304
  reference. Getting a Pileup over a region is very easy. Note that this
202
- is done with `mpileup` and NOT the now deprecated SAMTools `pileup`
305
+ is done with `mpileup` and NOT the now deprecated SAMtools `pileup`
203
306
  function. Calling the `mpileup` method creates an iterator that yields a
204
307
  Pileup object for each base.
205
308
 
206
- bam.mpileup do |pileup|
207
- puts pileup.consensus #gives the consensus base from the reads for that postion
208
- end
309
+ ```ruby
310
+ bam.mpileup do |pileup|
311
+ puts pileup.consensus #gives the consensus base from the reads for that position
312
+ end
313
+ ```
209
314
 
210
315
  ###Caching pileups
211
- A pileup can be cached, so if you want to execute several operations on the same set of regions, mpilup won't be executed several times. Whenever you finish using a region, call mpileup_clear_cache to free the cache. The argument 'Region' is required, as it will be the key for the underlying hash. We asume that the options (other than the region) are constant. If they are not, the cache mechanism may not be consistent.
316
+ A pileup can be cached, so if you want to execute several operations on the same set of regions, mpilup won't be executed several times. Whenever you finish using a region, call mpileup_clear_cache to free the cache. The argument 'Region' is required, as it will be the key for the underlying hash. We assume that the options (other than the region) are constant. If they are not, the cache mechanism may not be consistent.
317
+
318
+ ```ruby
319
+ #create an mpileup
320
+ reg = Bio::DB::Fasta::Region.new
321
+ reg.entry = "Chr1"
322
+ reg.start = 1
323
+ reg.end = 334
212
324
 
213
- #create an mpileup
214
- reg = Bio::DB::Fasta::Region.new
215
- reg.entry = "chr_1"
216
- reg.start = 1
217
- reg.end = 334
218
-
219
- bam.mpileup_cached(:r=>reg,:g => false, :min_cov => 1, :min_per =>0.2) do |pileup|
220
-
325
+ bam.mpileup_cached(:r=>reg,:g => false, :min_cov => 1, :min_per =>0.2) do |pileup|
326
+ puts pileup.consensus
327
+ end
328
+ bam.mpileup_clear_cache(reg)
329
+ ```
221
330
 
222
331
 
223
332
  #### Pileup options
224
333
 
225
- The `mpileup` function takes a range of parameters to allow SAMTools
334
+ The `mpileup` function takes a range of parameters to allow SAMtools
226
335
  level filtering of reads and alignments. They are specified as key =\>
227
336
  value pairs eg
228
337
 
229
- bam.mpileup(:r => "Chr1:1000-2000", :Q => 50) do |pileup|
230
- ##only pileups on Chr1 between positions 1000-2000 are considered,
231
- ##bases with Quality Score < 50 are excluded
232
- ...
233
- end
338
+ ```ruby
339
+ bam.mpileup(:r => "Chr1:1000-2000", :Q => 50) do |pileup|
340
+ ##only pileups on Chr1 between positions 1000-2000 are considered,
341
+ ##bases with Quality Score < 50 are excluded
342
+ ...
343
+ end
344
+ ```
234
345
 
235
- Not all the options SAMTools allows you to pass to mpileup will return a
236
- Pileup object, those that cause mpileup to return BCF/VCF will be
237
- ignored. Specifically these are g,u,e,h,I,L,o,p. The table below lists
238
- the SAMTools flags supported and the symbols you can use to call them in
346
+ Not all the options SAMtools allows you to pass to mpileup will return a
347
+ Pileup object, The table below lists the SAMtools flags supported and the symbols you can use to call them in
239
348
  the mpileup command.
240
349
 
241
- SAMTools option
242
-
243
- description
244
-
245
- short symbol
246
-
247
- long symbol
248
-
249
- default
250
-
251
- example
252
-
253
- `r`
254
-
255
- limit retrieval to a region
256
-
257
- `:r`
258
-
259
- `:region`
260
-
261
- all positions
262
-
263
- `:r => "Chr1:1000-2000"`
264
-
265
- `6`
266
-
267
- assume Illumina scaled quality scores
268
-
269
- `:six`
270
-
271
- `:illumina_quals`
272
-
273
- false
274
-
275
- `:six => true`
276
-
277
- `A`
278
-
279
- count anomalous read pairs scores
280
-
281
- `:A`
282
-
283
- `:count_anomalous`
284
-
285
- false
286
-
287
- `:A => true`
288
-
289
- `B`
290
-
291
- disable BAQ computation
292
-
293
- `:B`
294
-
295
- `:no_baq`
296
-
297
- false
298
-
299
- `:no_baq => true`
300
-
301
- `C`
350
+ <table><tr><th>SAMtools options</th><th>description</th><th>short symbol</th><th>long symbol</th><th>default</th><th>example</th></tr>
351
+ <tr><td>r</td><td>limit retrieval to a region</td><td>:r</td><td>:region</td><td>all positions</td><td>:r => "Chr1:1000-2000"</td></tr>
352
+ <tr><td>6</td><td>assume Illumina scaled quality scores</td><td>:six</td><td>:illumina_quals</td><td>false</td><td>:six => true</td></tr>
353
+ <tr><td>A</td><td>count anomalous read pairs scores</td><td>:A</td><td>:count_anomalous</td><td>false</td><td>:A => true</td></tr>
354
+ <tr><td>B</td><td>disable BAQ computation</td><td>:B</td><td>:no_baq</td><td>false</td><td>:no_baq => true</td></tr>
355
+ <tr><td>C</td><td>parameter for adjusting mapQ</td><td>:C</td><td>:adjust_mapq</td><td>0</td><td>:C => 25</td></tr>
356
+ <tr><td>d</td><td>max per-BAM depth to avoid excessive memory usage</td><td>:d</td><td>:max_per_bam_depth</td><td>250</td><td>:d => 123</td></tr>
357
+ <tr><td>E</td><td>extended BAQ for higher sensitivity but lower specificity</td><td>:E</td><td>:extended_baq</td><td>false</td><td>:E => true</td></tr>
358
+ <tr><td>G</td><td>exclude read groups listed in FILE</td><td>:G</td><td>:exclude_reads_file</td><td>false</td><td>:G => my_file.txt</td></tr>
359
+ <tr><td>l</td><td>list of positions (chr pos) or regions (BED)</td><td>:l</td><td>:list_of_positions</td><td>false</td><td>:l => my_posns.bed</td></tr>
360
+ <tr><td>M</td><td>cap mapping quality at value</td><td>:M</td><td>:mapping_quality_cap</td><td>60</td><td>:M => 40 </td></tr>
361
+ <tr><td>R</td><td>ignore RG tags</td><td>:R</td><td>:ignore_rg</td><td>false</td><td>:R => true </td></tr>
362
+ <tr><td>q</td><td>skip alignments with mapping quality smaller than value</td><td>:q</td><td>:min_mapping_quality</td><td>0</td><td>:q => 30 </td></tr>
363
+ <tr><td>Q</td><td>skip bases with base quality smaller than value</td><td>:Q</td><td>:imin_base_quality</td><td>13</td><td>:Q => 30</td></tr>
364
+ </table>
302
365
 
303
- parameter for adjusting mapQ
304
366
 
305
- `:C`
306
-
307
- `:adjust_mapq`
308
-
309
- 0
310
-
311
- `:C => 25`
312
-
313
- `d`
314
-
315
- max per-BAM depth to avoid excessive memory usage
316
-
317
- `:d`
318
-
319
- `:max_per_bam_depth`
320
-
321
- 250
322
-
323
- `:d => 123`
324
-
325
- `E`
326
-
327
- extended BAQ for higher sensitivity but lower specificity
328
-
329
- `:E`
330
-
331
- `:extended_baq`
332
-
333
- false
334
-
335
- `:E => true`
336
-
337
- `G`
338
-
339
- exclude read groups listed in FILE
340
-
341
- `:G`
342
-
343
- `:exclude_reads_file`
344
-
345
- false
346
-
347
- `:G => 'my_file.txt'`
348
-
349
- `l`
350
-
351
- list of positions (chr pos) or regions (BED)
352
-
353
- `:l`
354
-
355
- `:list_of_positions`
356
-
357
- false
358
-
359
- `:l => 'my_posns.bed'`
360
-
361
- `M`
362
-
363
- cap mapping quality at value
364
-
365
- `:M`
366
-
367
- `:mapping_quality_cap`
368
-
369
- 60
370
-
371
- `:M => 40 `
372
-
373
- `R`
374
-
375
- ignore RG tags
376
-
377
- `:R`
378
-
379
- `:ignore_rg`
380
-
381
- false
382
-
383
- `:R => true `
384
-
385
- `q`
386
-
387
- skip alignments with mapping quality smaller than value
388
-
389
- `:q`
390
-
391
- `:min_mapping_quality`
392
-
393
- 0
367
+ ##Coverage Plots
368
+ You can create images that represent read coverage over binned regions of the reference sequence. The output format is svg. A number of parameters can be changed to alter the style of the plot. In the examples below the bin size and fill_color have been used to create plots with different colours and bar widths.
394
369
 
395
- `:q => 30 `
370
+ The following lines of code...
396
371
 
397
- `Q`
372
+ ```ruby
373
+ bam.plot_coverage("Chr1", 201, 2000, :bin=>20, :svg => "out2.svg", :fill_color => '#F1A1B1')
374
+ bam.plot_coverage("Chr1", 201, 2000, :bin=>50, :svg => "out.svg", :fill_color => '#99CCFF')
375
+ bam.plot_coverage("Chr1", 201, 1000, :bin=>250, :svg => "out3.svg", :fill_color => '#33AD5C', :stroke => '#33AD5C')
376
+ ```
398
377
 
399
- skip bases with base quality smaller than value
378
+ ![Coverage plot 1](http://ethering.github.io/bio-samtools/images/out2.svg)
379
+ ![Coverage plot 2](http://ethering.github.io/bio-samtools/images/out.svg)
380
+ ![Coverage plot 2](http://ethering.github.io/bio-samtools/images/out3.svg)
400
381
 
401
- `:Q`
382
+ The `plot_coverage` method will also return the raw svg code, for further use. Simply leave out a file name and assign the method to a variable.
402
383
 
403
- `:imin_base_quality`
384
+ ```ruby
385
+ svg = bam.plot_coverage("Chr1", 201, 2000, :bin=>50, :fill_color => '#99CCFF')
404
386
 
405
- 13
387
+ ```
406
388
 
407
- `:Q => 30 `
408
389
 
390
+ #VCF methods
391
+ For enhanced snp calling, we've included a VCF class which reflects each non-metadata line of a VCF file.
392
+ The VCF class returns the eight fixed fields present in VCF files, namely chromosome, position, ID, reference base, alt bases, alt quality score, filter and info along with the genotype fields, format and samples. This information allows the comparison of variants and their genotypes across any number of samples.
393
+ The following code takes a number of VCF objects and examines them for homozygous alt (1/1) SNPs
409
394
 
410
- There is an 'experimental' function, `mpileup_plus`, that can return a
411
- Bio::DB::Vcf object when g,u,e,h,I,L,o,p options are passed. The list
412
- below shows the symbols you can use to invoke this behaviour:
395
+ ```ruby
396
+ vcfs = []
397
+ vcfs << vcf1 = Bio::DB::Vcf.new("20 14370 rs6054257 G A 29 0 NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:-1,-1") #from a 3.3 vcf file
398
+ vcfs << vcf2 = Bio::DB::Vcf.new("19 111 . A C 9.6 . . GT:HQ 0|0:10,10 0/0:10,10 0/1:3,3") #from a 4.0 vcf file
399
+ vcfs << vcf3 = Bio::DB::Vcf.new("20 14380 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,") #from a 4.0 vcf file
413
400
 
414
- - `:genotype_calling, :g`
415
- - `:uncompressed_bcf , :u`
416
- - `:extension_sequencing_probability, :e`
417
- - `:homopolymer_error_coefficient, :h`
418
- - `:no_indels, :I`
419
- - `:skip_indel_over_average_depth, :L`
420
- - `:gap_open_sequencing_error_probability,:o`
421
- - `:platforms, :P`
401
+ vcfs.each do |vcf|
402
+ vcf.samples.each do |sample|
403
+ genotype = sample[1]['GT']
404
+ if genotype == '1/1' or genotype == '1|1'
405
+ print vcf.chrom, " "
406
+ puts vcf.pos
407
+ end
408
+ end
409
+ end
422
410
 
423
- ##Coverage Plots
424
- You can create images that represent read coverage over binned regions of the reference sequence. The output format is svg. A number of parameters can be changed to alter the style of the plot. In the examples below the bin size and fill_color have been used to create plots with different colours and bar widths.
411
+ => 20 14370
412
+ => 20 14380
413
+ ```
425
414
 
426
- The following lines of code...
427
-
428
- bam.plot_coverage("chr01", 201, 2000, :bin=>20, :svg => "out2.svg", :fill_color => '#F1A1B1')
429
- bam.plot_coverage("chr_1", 201, 2000, :bin=>50, :svg => "out.svg", :fill_color => '#99CCFF')
430
- bam.plot_coverage("chr01", 201, 1000, :bin=>250, :svg => "out3.svg", :fill_color => '#33AD5C')
431
-
432
-
433
- ..create these plots:
434
- ![coverage plot](images/out2.svg =700x "coverage plot")
435
- ![coverage plot](images/out.svg =700x "coverage plot")
436
- ![coverage plot](images/out3.svg =700x "coverage plot")
415
+ ##Other methods not covered
416
+ The SAMtools methods faidx, fixmate, tview, reheader, calmd, targetcut and phase are all included in the current bio-samtools release.
437
417
 
438
418
  Tests
439
419
  -----
440
420
 
441
421
  The easiest way to run the built-in unit tests is to change to the
442
422
  bio-samtools source directory and running 'rake test'
443
-
423
+
444
424
  Each test file tests different aspects of the code.