bio-samtools 2.0.4 → 2.0.5

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,444 @@
1
+
2
+ bio-samtools Basic Tutorial
3
+ ===========================
4
+
5
+ Introduction
6
+ ------------
7
+
8
+ bio-samtools is a Ruby binding to the popular [SAMtools](http://samtools.sourceforge.net/) library, and provides access to individual read alignments as well as BAM files, reference sequence and pileup information.
9
+
10
+ Installation
11
+ ------------
12
+
13
+ Installation of bio-samtools is very straightforward, and is
14
+ accomplished with the Ruby gems command. All you need is an internet
15
+ connection.
16
+
17
+ ### Prerequisites
18
+
19
+ bio-samtools relies on the following other rubygems:
20
+
21
+ - [bio \>= 1.4.2](http://rubygems.org/gems/bio)
22
+ - [bio-svgenes >= 0.4.1](https://rubygems.org/gems/bio-svgenes)
23
+
24
+ Once these are installed, bio-samtools can be installed with
25
+
26
+ sudo gem install bio-samtools
27
+
28
+
29
+ It should then be easy to test whether installation went well. Start
30
+ interactive Ruby (IRB) in the terminal, and type
31
+ `require 'bio-samtools'` if the terminal returns `true` then all is
32
+ well.
33
+
34
+ $ irb
35
+ >> require 'bio-samtools'
36
+ => true
37
+
38
+ Working with BAM files
39
+ ----------------------
40
+
41
+
42
+ ### Creating a new SAM object
43
+
44
+ A SAM object represents the alignments in the BAM file. BAM files (and hence SAM objects here) are what most of SAMtools methods operate on and are very straightforward to create. You will need a sorted BAM file, to access the alignments and a reference sequence in FASTA format to use the reference sequence. The object can be created and opened as follows:
45
+
46
+ bam = Bio::DB::Sam.new(:bam=>"my_sorted.bam", :fasta=>'ref.fasta')
47
+
48
+ Opening the file needs only to be done once for multiple operations on
49
+ it, access to the alignments is random so you don't need to loop over
50
+ the entries in the file.
51
+
52
+ ### Getting Reference Sequence
53
+
54
+ The reference is accessed using reference
55
+ name, start, end in 1-based co-ordinates. A standard Ruby String object is returned.
56
+
57
+ sequence_fragment = bam.fetch_reference("Chr1", 1, 100)
58
+
59
+ The output from this would be the raw sequence as a string, e.g.
60
+
61
+ cctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
62
+
63
+ A reference sequence can be returned as a Bio::Sequence::NA object buy the use of :as_bio => true
64
+
65
+ sequence_fragment = bam.fetch_reference("Chr1", 1, 100, :as_bio => true)
66
+
67
+ The output from this would be a Bio::Sequence::NA object, which provides a fasta-formatted string when printed
68
+
69
+ >chr_1:1-100 cctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
70
+
71
+ #### Alignment Objects
72
+
73
+ The individual alignments represent a single read and are returned as
74
+ Bio::DB::Alignment objects. These have numerous methods of their own,
75
+ using `require 'pp'` will allow you to check the attributes contained in
76
+ each object. Here is an example alignment object. Remember `@`
77
+ represents a Ruby instance variable and can be accessed as any other
78
+ method. Thus the `@is_mapped` attribute of an object `a` is accessed
79
+ `a.is_mapped`
80
+
81
+ require 'pp'
82
+ pp an_alignment_object ##some Bio::DB::Alignment object
83
+ #<Bio::DB::Alignment:0x101113f80
84
+ @al=#<Bio::DB::SAM::Tools::Bam1T:0x101116a50>,
85
+ @calend=4067,
86
+ @cigar="76M",
87
+ @failed_quality=false,
88
+ @first_in_pair=false,
89
+ @flag=163,
90
+ @is_duplicate=false,
91
+ @is_mapped=true,
92
+ @is_paired=true,
93
+ @isize=180,
94
+ @mapq=60,
95
+ @mate_strand=false,
96
+ @mate_unmapped=false,
97
+ @mpos=4096,
98
+ @mrnm="=",
99
+ @pos=3992,
100
+ @primary=true,
101
+ @qlen=76,
102
+ @qname="HWI-EAS396_0001:7:115:17904:15958#0",
103
+ @qual="IIIIIIIIIIIIHHIHGIHIDGGGG...",
104
+ @query_strand=true,
105
+ @query_unmapped=false,
106
+ @rname="1",
107
+ @second_in_pair=true,
108
+ @seq="ACAGTCCAGTCAAAGTACAAATCGAG...",
109
+ @tags=
110
+ {"MD"=>#<Bio::DB::Tag:0x101114ed0 @tag="MD", @type="Z", @value="76">,
111
+ "XO"=>#<Bio::DB::Tag:0x1011155d8 @tag="XO", @type="i", @value="0">,
112
+ "AM"=>#<Bio::DB::Tag:0x101116280 @tag="AM", @type="i", @value="37">,
113
+ "X0"=>#<Bio::DB::Tag:0x101115fb0 @tag="X0", @type="i", @value="1">,
114
+ "X1"=>#<Bio::DB::Tag:0x101115c68 @tag="X1", @type="i", @value="0">,
115
+ "XG"=>#<Bio::DB::Tag:0x101115240 @tag="XG", @type="i", @value="0">,
116
+ "SM"=>#<Bio::DB::Tag:0x1011162f8 @tag="SM", @type="i", @value="37">,
117
+ "XT"=>#<Bio::DB::Tag:0x1011162a8 @tag="XT", @type="A", @value="U">,
118
+ "NM"=>#<Bio::DB::Tag:0x101116348 @tag="NM", @type="i", @value="0">,
119
+ "XM"=>#<Bio::DB::Tag:0x101115948 @tag="XM", @type="i", @value="0">}>
120
+
121
+
122
+ ### Getting Alignments
123
+
124
+ Alignments can be obtained one at a time by looping over a specified region using the `fetch()` function.
125
+
126
+ bam.fetch("chr_1",3000,4000).each do |alignment|
127
+ #do something with the alignment...
128
+ end
129
+
130
+ A separate method `fetch_with_function()` allows you to pass a block (or
131
+ a Proc object) to the function for efficient calculation. This example
132
+ an alignment object and returns an array of sequences which exactly match the reference.
133
+
134
+ #an array to hold the matching sequences
135
+ exact_matches = []
136
+
137
+ fetchAlignments = Proc.new do |a|
138
+ #get the length of each read
139
+ len = a.seq.length
140
+ #get the cigar string
141
+ cigar = a.cigar
142
+ #create a cigar string which represents a full-length match
143
+ cstr = len.to_s << "M"
144
+ if cigar == cstr
145
+ #add the current sequence to the array if it qualifies
146
+ exact_matches << a.seq
147
+ end
148
+ end
149
+
150
+ bam.fetch_with_function("chr_1", 100, 500, &fetchAlignments) #now run the fetch
151
+
152
+ puts exact_matches
153
+
154
+ ###Alignment stats
155
+
156
+ The SAMtools flagstat method is implemented in bio-samtools to quickly examine the number of reads mapped to the reference. This includes the number of paired and singleton reads mapped and also the number of paired-reads that map to different chromosomes/contigs.
157
+
158
+ bam.flag_stats()
159
+
160
+ An example output would be
161
+
162
+ 34672 + 0 in total (QC-passed reads + QC-failed reads)
163
+ 0 + 0 duplicates
164
+ 33196 + 0 mapped (95.74%:nan%)
165
+ 34672 + 0 paired in sequencing
166
+ 17335 + 0 read1
167
+ 17337 + 0 read2
168
+ 31392 + 0 properly paired (90.54%:nan%)
169
+ 31728 + 0 with itself and mate mapped
170
+ 1468 + 0 singletons (4.23%:nan%)
171
+ 0 + 0 with mate mapped to a different chr
172
+ 0 + 0 with mate mapped to a different chr (mapQ>=5)
173
+
174
+
175
+ Getting Coverage Information
176
+ ----------------------------
177
+
178
+
179
+ ### Per Base Coverage
180
+
181
+ It is easy to get the total depth of reads at a given position, the
182
+ `chromosome_coverage` function is used. This differs from the previous
183
+ functions in that a start position and length (rather than end position)
184
+ are passed to the function. An array of coverages is returned, the first
185
+ position in the array gives the depth of coverage at the given start
186
+ position in the genome, the last position in the array gives the depth
187
+ of coverage at the given start position plus the length given
188
+
189
+ coverages = bam.chromosome_coverage("Chr1", 3000, 1000) #=> [16,16,25,25...]
190
+
191
+ ### Average Coverage In A Region
192
+
193
+ Similarly, average (arithmetic mean) of coverage can be retrieved, also
194
+ with start and length parameters
195
+
196
+ coverages = bam.average_coverage("Chr1", 3000, 1000) #=> 20.287
197
+
198
+ ### Getting Pileup Information
199
+
200
+ Pileup format represents the coverage of reads over a single base in the
201
+ reference. Getting a Pileup over a region is very easy. Note that this
202
+ is done with `mpileup` and NOT the now deprecated SAMTools `pileup`
203
+ function. Calling the `mpileup` method creates an iterator that yields a
204
+ Pileup object for each base.
205
+
206
+ bam.mpileup do |pileup|
207
+ puts pileup.consensus #gives the consensus base from the reads for that postion
208
+ end
209
+
210
+ ###Caching pileups
211
+ A pileup can be cached, so if you want to execute several operations on the same set of regions, mpilup won't be executed several times. Whenever you finish using a region, call mpileup_clear_cache to free the cache. The argument 'Region' is required, as it will be the key for the underlying hash. We asume that the options (other than the region) are constant. If they are not, the cache mechanism may not be consistent.
212
+
213
+ #create an mpileup
214
+ reg = Bio::DB::Fasta::Region.new
215
+ reg.entry = "chr_1"
216
+ reg.start = 1
217
+ reg.end = 334
218
+
219
+ bam.mpileup_cached(:r=>reg,:g => false, :min_cov => 1, :min_per =>0.2) do |pileup|
220
+
221
+
222
+
223
+ #### Pileup options
224
+
225
+ The `mpileup` function takes a range of parameters to allow SAMTools
226
+ level filtering of reads and alignments. They are specified as key =\>
227
+ value pairs eg
228
+
229
+ bam.mpileup(:r => "Chr1:1000-2000", :Q => 50) do |pileup|
230
+ ##only pileups on Chr1 between positions 1000-2000 are considered,
231
+ ##bases with Quality Score < 50 are excluded
232
+ ...
233
+ end
234
+
235
+ Not all the options SAMTools allows you to pass to mpileup will return a
236
+ Pileup object, those that cause mpileup to return BCF/VCF will be
237
+ ignored. Specifically these are g,u,e,h,I,L,o,p. The table below lists
238
+ the SAMTools flags supported and the symbols you can use to call them in
239
+ the mpileup command.
240
+
241
+ SAMTools option
242
+
243
+ description
244
+
245
+ short symbol
246
+
247
+ long symbol
248
+
249
+ default
250
+
251
+ example
252
+
253
+ `r`
254
+
255
+ limit retrieval to a region
256
+
257
+ `:r`
258
+
259
+ `:region`
260
+
261
+ all positions
262
+
263
+ `:r => "Chr1:1000-2000"`
264
+
265
+ `6`
266
+
267
+ assume Illumina scaled quality scores
268
+
269
+ `:six`
270
+
271
+ `:illumina_quals`
272
+
273
+ false
274
+
275
+ `:six => true`
276
+
277
+ `A`
278
+
279
+ count anomalous read pairs scores
280
+
281
+ `:A`
282
+
283
+ `:count_anomalous`
284
+
285
+ false
286
+
287
+ `:A => true`
288
+
289
+ `B`
290
+
291
+ disable BAQ computation
292
+
293
+ `:B`
294
+
295
+ `:no_baq`
296
+
297
+ false
298
+
299
+ `:no_baq => true`
300
+
301
+ `C`
302
+
303
+ parameter for adjusting mapQ
304
+
305
+ `:C`
306
+
307
+ `:adjust_mapq`
308
+
309
+ 0
310
+
311
+ `:C => 25`
312
+
313
+ `d`
314
+
315
+ max per-BAM depth to avoid excessive memory usage
316
+
317
+ `:d`
318
+
319
+ `:max_per_bam_depth`
320
+
321
+ 250
322
+
323
+ `:d => 123`
324
+
325
+ `E`
326
+
327
+ extended BAQ for higher sensitivity but lower specificity
328
+
329
+ `:E`
330
+
331
+ `:extended_baq`
332
+
333
+ false
334
+
335
+ `:E => true`
336
+
337
+ `G`
338
+
339
+ exclude read groups listed in FILE
340
+
341
+ `:G`
342
+
343
+ `:exclude_reads_file`
344
+
345
+ false
346
+
347
+ `:G => 'my_file.txt'`
348
+
349
+ `l`
350
+
351
+ list of positions (chr pos) or regions (BED)
352
+
353
+ `:l`
354
+
355
+ `:list_of_positions`
356
+
357
+ false
358
+
359
+ `:l => 'my_posns.bed'`
360
+
361
+ `M`
362
+
363
+ cap mapping quality at value
364
+
365
+ `:M`
366
+
367
+ `:mapping_quality_cap`
368
+
369
+ 60
370
+
371
+ `:M => 40 `
372
+
373
+ `R`
374
+
375
+ ignore RG tags
376
+
377
+ `:R`
378
+
379
+ `:ignore_rg`
380
+
381
+ false
382
+
383
+ `:R => true `
384
+
385
+ `q`
386
+
387
+ skip alignments with mapping quality smaller than value
388
+
389
+ `:q`
390
+
391
+ `:min_mapping_quality`
392
+
393
+ 0
394
+
395
+ `:q => 30 `
396
+
397
+ `Q`
398
+
399
+ skip bases with base quality smaller than value
400
+
401
+ `:Q`
402
+
403
+ `:imin_base_quality`
404
+
405
+ 13
406
+
407
+ `:Q => 30 `
408
+
409
+
410
+ There is an 'experimental' function, `mpileup_plus`, that can return a
411
+ Bio::DB::Vcf object when g,u,e,h,I,L,o,p options are passed. The list
412
+ below shows the symbols you can use to invoke this behaviour:
413
+
414
+ - `:genotype_calling, :g`
415
+ - `:uncompressed_bcf , :u`
416
+ - `:extension_sequencing_probability, :e`
417
+ - `:homopolymer_error_coefficient, :h`
418
+ - `:no_indels, :I`
419
+ - `:skip_indel_over_average_depth, :L`
420
+ - `:gap_open_sequencing_error_probability,:o`
421
+ - `:platforms, :P`
422
+
423
+ ##Coverage Plots
424
+ You can create images that represent read coverage over binned regions of the reference sequence. The output format is svg. A number of parameters can be changed to alter the style of the plot. In the examples below the bin size and fill_color have been used to create plots with different colours and bar widths.
425
+
426
+ The following lines of code...
427
+
428
+ bam.plot_coverage("chr01", 201, 2000, :bin=>20, :svg => "out2.svg", :fill_color => '#F1A1B1')
429
+ bam.plot_coverage("chr_1", 201, 2000, :bin=>50, :svg => "out.svg", :fill_color => '#99CCFF')
430
+ bam.plot_coverage("chr01", 201, 1000, :bin=>250, :svg => "out3.svg", :fill_color => '#33AD5C')
431
+
432
+
433
+ ..create these plots:
434
+ ![coverage plot](images/out2.svg =700x "coverage plot")
435
+ ![coverage plot](images/out.svg =700x "coverage plot")
436
+ ![coverage plot](images/out3.svg =700x "coverage plot")
437
+
438
+ Tests
439
+ -----
440
+
441
+ The easiest way to run the built-in unit tests is to change to the
442
+ bio-samtools source directory and running 'rake test'
443
+
444
+ Each test file tests different aspects of the code.
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bio-samtools
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.4
4
+ version: 2.0.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ricardo Ramirez-Gonzalez
@@ -10,7 +10,7 @@ authors:
10
10
  autorequire:
11
11
  bindir: bin
12
12
  cert_chain: []
13
- date: 2014-04-28 00:00:00.000000000 Z
13
+ date: 2014-05-31 00:00:00.000000000 Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: bio-svgenes
@@ -138,6 +138,20 @@ dependencies:
138
138
  - - ">="
139
139
  - !ruby/object:Gem::Version
140
140
  version: '0'
141
+ - !ruby/object:Gem::Dependency
142
+ name: ruby-prof
143
+ requirement: !ruby/object:Gem::Requirement
144
+ requirements:
145
+ - - ">="
146
+ - !ruby/object:Gem::Version
147
+ version: '0'
148
+ type: :development
149
+ prerelease: false
150
+ version_requirements: !ruby/object:Gem::Requirement
151
+ requirements:
152
+ - - ">="
153
+ - !ruby/object:Gem::Version
154
+ version: '0'
141
155
  - !ruby/object:Gem::Dependency
142
156
  name: rdoc
143
157
  requirement: !ruby/object:Gem::Requirement
@@ -250,12 +264,11 @@ files:
250
264
  - lib/bio/db/sam.rb
251
265
  - lib/bio/db/sam/external/COPYING
252
266
  - lib/bio/db/sam/external/VERSION
253
- - lib/bio/db/sam/faidx_old.rb
254
267
  - lib/bio/db/sam/library.rb
255
268
  - lib/bio/db/vcf.rb
256
269
  - test/.gitignore
257
270
  - test/helper.rb
258
- - test/old_test_basic.rb
271
+ - test/samples/.gitignore
259
272
  - test/samples/small/dupes.bam
260
273
  - test/samples/small/dupes.sam
261
274
  - test/samples/small/ids2.txt
@@ -289,12 +302,15 @@ files:
289
302
  - test/samples/small/test_cov.svg
290
303
  - test/samples/small/testu.bam
291
304
  - test/samples/small/testu.bam.bai
292
- - test/svg
293
305
  - test/test_bio-samtools.rb
294
306
  - test/test_pileup.rb
295
307
  - test/test_sam.rb
296
308
  - test/test_vcf.rb
309
+ - tutorial/images/out.svg
310
+ - tutorial/images/out2.svg
311
+ - tutorial/images/out3.svg
297
312
  - tutorial/tutorial.html
313
+ - tutorial/tutorial.md
298
314
  - tutorial/tutorial.pdf
299
315
  homepage: http://github.com/helios/bioruby-samtools
300
316
  licenses:
@@ -1,21 +0,0 @@
1
- #require 'rubygems'
2
- #require'ffi'
3
- #require 'bio/db/sam/bam'
4
- module Bio
5
- class DB
6
- module SAM
7
- module Tools
8
- extend FFI::Library
9
- #ffi_lib "#{File.join(File.expand_path(File.dirname(__FILE__)),'external','libbam.dylib')}"
10
- ffi_lib Bio::DB::SAM::Library.filename
11
-
12
- attach_function :fai_build, [ :string ], :int
13
- attach_function :fai_destroy, [ :pointer ], :void
14
- attach_function :fai_load, [ :string ], :pointer
15
- attach_function :fai_fetch, [ :pointer, :string, :pointer ], :string
16
- attach_function :faidx_fetch_nseq, [ :pointer ], :int
17
- attach_function :faidx_fetch_seq, [ :pointer, :string, :int, :int, :pointer ], :string
18
- end
19
- end
20
- end
21
- end