bio-maf 0.1.0 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +53 -0
- data/DEVELOPMENT.md +29 -0
- data/Gemfile +1 -0
- data/README.md +69 -1
- data/Rakefile +4 -3
- data/bin/find_overlaps +21 -0
- data/bin/maf_tile +103 -0
- data/bio-maf.gemspec +43 -0
- data/features/gap-filling.feature +158 -0
- data/features/gap-removal.feature +50 -0
- data/features/step_definitions/gap-filling_steps.rb +32 -0
- data/features/step_definitions/gap_removal_steps.rb +19 -0
- data/features/step_definitions/parse_steps.rb +2 -1
- data/lib/bio/maf/index.rb +15 -8
- data/lib/bio/maf/maf.rb +267 -0
- data/lib/bio/maf/parser.rb +115 -175
- data/lib/bio/maf/tiler.rb +167 -0
- data/lib/bio/maf.rb +2 -0
- data/man/maf_tile.1 +108 -0
- data/man/maf_tile.1.ronn +104 -0
- data/spec/bio/maf/index_spec.rb +1 -0
- data/spec/bio/maf/parser_spec.rb +103 -0
- data/spec/bio/maf/tiler_spec.rb +69 -0
- data/test/data/gap-sp1.fa +6 -0
- data/test/data/mm8_chr7_tiny.kct +0 -0
- metadata +58 -3
data/.gitignore
ADDED
@@ -0,0 +1,53 @@
|
|
1
|
+
# rcov generated
|
2
|
+
coverage
|
3
|
+
coverage.data
|
4
|
+
|
5
|
+
# rdoc generated
|
6
|
+
rdoc
|
7
|
+
|
8
|
+
# yard generated
|
9
|
+
doc
|
10
|
+
.yardoc
|
11
|
+
|
12
|
+
# bundler
|
13
|
+
.bundle
|
14
|
+
|
15
|
+
# jeweler generated
|
16
|
+
pkg
|
17
|
+
|
18
|
+
# Have editor/IDE/OS specific files you need to ignore? Consider using a global gitignore:
|
19
|
+
#
|
20
|
+
# * Create a file at ~/.gitignore
|
21
|
+
# * Include files you want ignored
|
22
|
+
# * Run: git config --global core.excludesfile ~/.gitignore
|
23
|
+
#
|
24
|
+
# After doing this, these files will be ignored in all your git projects,
|
25
|
+
# saving you from having to 'pollute' every project you touch with them
|
26
|
+
#
|
27
|
+
# Not sure what to needs to be ignored for particular editors/OSes? Here's some ideas to get you started. (Remember, remove the leading # of the line)
|
28
|
+
#
|
29
|
+
# For MacOS:
|
30
|
+
#
|
31
|
+
#.DS_Store
|
32
|
+
|
33
|
+
# For TextMate
|
34
|
+
#*.tmproj
|
35
|
+
#tmtags
|
36
|
+
|
37
|
+
# For emacs:
|
38
|
+
#*~
|
39
|
+
#\#*
|
40
|
+
#.\#*
|
41
|
+
|
42
|
+
# For vim:
|
43
|
+
#*.swp
|
44
|
+
|
45
|
+
# For redcar:
|
46
|
+
#.redcar
|
47
|
+
|
48
|
+
# For rubinius:
|
49
|
+
*.rbc
|
50
|
+
.rbx
|
51
|
+
# Ignore Gemfile.lock for gems. See http://yehudakatz.com/2010/12/16/clarifying-the-roles-of-the-gemspec-and-gemfile/
|
52
|
+
Gemfile.lock
|
53
|
+
|
data/DEVELOPMENT.md
CHANGED
@@ -3,6 +3,35 @@
|
|
3
3
|
Here are notes on less obvious aspects of the development process for
|
4
4
|
this library.
|
5
5
|
|
6
|
+
## Gem build / tagging / release
|
7
|
+
|
8
|
+
This now uses [rubygems-tasks][] for building and releasing gems.
|
9
|
+
|
10
|
+
[rubygems-tasks]: https://github.com/postmodern/rubygems-tasks
|
11
|
+
|
12
|
+
We build two gem platform variants: a 'default' one for MRI with no
|
13
|
+
platform set, and a JRuby one with `platform = 'java'`. These get
|
14
|
+
built as `bio-maf-X.Y.Z.gem` and `bio-maf-X.Y.Z-java.gem`. At least
|
15
|
+
for now, this is done by running `gem release` separately under JRuby
|
16
|
+
and MRI. SCM tagging and pushing is done under MRI only, but the gems
|
17
|
+
will be built and pushed to rubygems.org separately under each
|
18
|
+
platform.
|
19
|
+
|
20
|
+
The version is simply set by hand in `bio-maf.gemspec`. Don't forget
|
21
|
+
to increment it!
|
22
|
+
|
23
|
+
Testing the build:
|
24
|
+
|
25
|
+
$ rake build
|
26
|
+
$ rake install
|
27
|
+
|
28
|
+
Release:
|
29
|
+
|
30
|
+
$ rvm use 1.9.3@bioruby-maf
|
31
|
+
$ rake release
|
32
|
+
$ rvm use jruby-1.6.7.2@bioruby-maf
|
33
|
+
$ rake release
|
34
|
+
|
6
35
|
## kyotocabinet-java
|
7
36
|
|
8
37
|
Running `bio-maf` on JRuby requires the [kyotocabinet-java][] gem, a
|
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -47,8 +47,29 @@ problems building or using this gem, which is still fairly new.
|
|
47
47
|
|
48
48
|
## Installation
|
49
49
|
|
50
|
+
`bio-maf` is now published as a Ruby [gem](https://rubygems.org/gems/bio-maf).
|
51
|
+
|
50
52
|
$ gem install bio-maf
|
51
53
|
|
54
|
+
## Performance
|
55
|
+
|
56
|
+
This parser performs best under [JRuby][], particularly with Java
|
57
|
+
7. See the [Performance][] wiki page for more information. For best
|
58
|
+
results, use JRuby in 1.9 mode with the ObjectProxyCache disabled:
|
59
|
+
|
60
|
+
[JRuby]: http://jruby.org/
|
61
|
+
[Performance]: https://github.com/csw/bioruby-maf/wiki/Performance
|
62
|
+
|
63
|
+
$ export JRUBY_OPTS='--1.9 -Xji.objectProxyCache=false'
|
64
|
+
|
65
|
+
Many parsing modes are multithreaded. Under JRuby, it will default to
|
66
|
+
using one parser thread per available core, but if desired this can be
|
67
|
+
configured with the `:threads` parser option.
|
68
|
+
|
69
|
+
Ruby 1.9.3 is fully supported, but does not perform as well,
|
70
|
+
especially since its concurrency features are not useful for this
|
71
|
+
workload.
|
72
|
+
|
52
73
|
## Usage
|
53
74
|
|
54
75
|
### Create an index on a MAF file
|
@@ -162,6 +183,47 @@ Refer to [`chr22_ieq.maf`](https://github.com/csw/bioruby-maf/blob/master/test/d
|
|
162
183
|
# @size=1601, @strand=:+, @src_size=50103, @text=nil,
|
163
184
|
# @status="I">
|
164
185
|
|
186
|
+
### Remove gaps from parsed blocks
|
187
|
+
|
188
|
+
After filtering out species with
|
189
|
+
[`Parser#sequence_filter`](#filter-species-returned-in-alignment-blocks),
|
190
|
+
gaps may be left where there was an insertion present only in
|
191
|
+
sequences that were filtered out. Such gaps can be removed by setting
|
192
|
+
the `:remove_gaps` parser option:
|
193
|
+
|
194
|
+
require 'bio-maf'
|
195
|
+
p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
|
196
|
+
:remove_gaps => true)
|
197
|
+
|
198
|
+
### Tile blocks together over an interval
|
199
|
+
|
200
|
+
Extracts alignment blocks overlapping the given genomic interval and
|
201
|
+
constructs a single alignment block covering the entire interval for
|
202
|
+
the specified species. Optionally, any gaps in coverage of the MAF
|
203
|
+
file's reference sequence can be filled in from a FASTA sequence
|
204
|
+
file. See the Cucumber [feature][] for examples of output, and also
|
205
|
+
the
|
206
|
+
[`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
|
207
|
+
man page.
|
208
|
+
|
209
|
+
[feature]: https://github.com/csw/bioruby-maf/blob/master/features/gap-filling.feature
|
210
|
+
|
211
|
+
require 'bio-maf'
|
212
|
+
tiler = Bio::MAF::Tiler.new
|
213
|
+
tiler.index = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
|
214
|
+
tiler.parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
|
215
|
+
# optional
|
216
|
+
tiler.reference = Bio::MAF::FASTARangeReader.new('reference.fa.gz')
|
217
|
+
tiler.species = %w(mm8 rn4 hg18)
|
218
|
+
tiler.species_map = {
|
219
|
+
'mm8' => 'mouse',
|
220
|
+
'rn4' => 'rat',
|
221
|
+
'hg18' => 'human'
|
222
|
+
}
|
223
|
+
tiler.interval = Bio::GenomicInterval.zero_based('mm8.chr7',
|
224
|
+
80082334,
|
225
|
+
80082468)
|
226
|
+
tiler.write_fasta($stdout)
|
165
227
|
|
166
228
|
### Command line tools
|
167
229
|
|
@@ -169,6 +231,12 @@ Man pages for command line tools:
|
|
169
231
|
|
170
232
|
* [`maf_index(1)`](http://csw.github.com/bioruby-maf/man/maf_index.1.html)
|
171
233
|
* [`maf_to_fasta(1)`](http://csw.github.com/bioruby-maf/man/maf_to_fasta.1.html)
|
234
|
+
* [`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
|
235
|
+
|
236
|
+
With [gem-man](https://github.com/defunkt/gem-man) installed, these
|
237
|
+
can be read with:
|
238
|
+
|
239
|
+
$ gem man bio-maf
|
172
240
|
|
173
241
|
### Other documentation
|
174
242
|
|
@@ -201,7 +269,7 @@ If you use this software, please cite one of
|
|
201
269
|
|
202
270
|
## Biogems.info
|
203
271
|
|
204
|
-
This Biogem
|
272
|
+
This Biogem is published at [biogems.info](http://biogems.info/index.html#bio-maf).
|
205
273
|
|
206
274
|
## Copyright
|
207
275
|
|
data/Rakefile
CHANGED
@@ -10,10 +10,11 @@ rescue Bundler::BundlerError => e
|
|
10
10
|
exit e.status_code
|
11
11
|
end
|
12
12
|
require 'rake'
|
13
|
-
require 'rubygems/package_task'
|
14
13
|
|
15
|
-
|
16
|
-
|
14
|
+
require 'rubygems/tasks'
|
15
|
+
# we only want to do the SCM tag/push stuff once, on MRI
|
16
|
+
use_scm = (RUBY_PLATFORM != 'java')
|
17
|
+
Gem::Tasks.new(:scm => {:tag => use_scm, :push => use_scm})
|
17
18
|
|
18
19
|
require 'rspec/core'
|
19
20
|
require 'rspec/core/rake_task'
|
data/bin/find_overlaps
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'bio-maf'
|
4
|
+
|
5
|
+
parser = Bio::MAF::Parser.new(ARGV.shift, :threads => 4)
|
6
|
+
|
7
|
+
def desc(seq)
|
8
|
+
"#{seq.source}:#{seq.start}-#{seq.end}"
|
9
|
+
end
|
10
|
+
|
11
|
+
open = []
|
12
|
+
parser.parse_blocks.each do |block|
|
13
|
+
start_pos = block.ref_seq.start
|
14
|
+
open.delete_if { |open_b| open_b.ref_seq.end <= start_pos }
|
15
|
+
open.each do |ovl|
|
16
|
+
ref_a = ovl.ref_seq
|
17
|
+
ref_b = block.ref_seq
|
18
|
+
puts "#{desc(ref_a)} overlaps #{desc(ref_b)}"
|
19
|
+
end
|
20
|
+
open << block
|
21
|
+
end
|
data/bin/maf_tile
ADDED
@@ -0,0 +1,103 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'optparse'
|
4
|
+
require 'ostruct'
|
5
|
+
|
6
|
+
require 'bio-maf'
|
7
|
+
require 'bio-genomic-interval'
|
8
|
+
|
9
|
+
options = OpenStruct.new
|
10
|
+
options.p = { :threads => 1 }
|
11
|
+
options.species = []
|
12
|
+
options.species_map = {}
|
13
|
+
options.usage = false
|
14
|
+
|
15
|
+
o_parser = OptionParser.new do |opts|
|
16
|
+
opts.banner = "Usage: maf_tile [options] <maf> <index>"
|
17
|
+
opts.separator ""
|
18
|
+
opts.separator "Options:"
|
19
|
+
opts.on("-r", "--reference SEQ", "FASTA reference sequence") do |ref|
|
20
|
+
options.ref = ref
|
21
|
+
end
|
22
|
+
opts.on("-i", "--interval BEGIN:END", "Genomic interval, zero-based") do |int|
|
23
|
+
if int =~ /(\d+):(\d+)/
|
24
|
+
options.interval = ($1.to_i)...($2.to_i)
|
25
|
+
else
|
26
|
+
options.usage = true
|
27
|
+
end
|
28
|
+
end
|
29
|
+
opts.on("-s", "--species SPECIES[:NAME]", "Species to use (with mapped name)") do |sp|
|
30
|
+
if sp =~ /:/
|
31
|
+
species, mapped = sp.split(/:/)
|
32
|
+
options.species << species
|
33
|
+
options.species_map[species] = mapped
|
34
|
+
else
|
35
|
+
options.species << sp
|
36
|
+
end
|
37
|
+
end
|
38
|
+
opts.on("-o", "--output-base BASE", "Base name for output files",
|
39
|
+
"Use stdout for a single interval if not given") do |base|
|
40
|
+
options.output_base = base
|
41
|
+
end
|
42
|
+
opts.on("--bed BED", "BED file specifying intervals",
|
43
|
+
"(requires --output-base)") do |bed|
|
44
|
+
options.bed = bed
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
o_parser.parse!(ARGV)
|
49
|
+
|
50
|
+
maf_p = ARGV.shift
|
51
|
+
index_p = ARGV.shift
|
52
|
+
|
53
|
+
unless (! options.usage) \
|
54
|
+
&& maf_p && index_p && (! options.species.empty?) \
|
55
|
+
&& (options.output_base ? options.bed : options.interval)
|
56
|
+
$stderr.puts o_parser
|
57
|
+
exit 2
|
58
|
+
end
|
59
|
+
|
60
|
+
tiler = Bio::MAF::Tiler.new
|
61
|
+
tiler.index = Bio::MAF::KyotoIndex.open(index_p)
|
62
|
+
tiler.parser = Bio::MAF::Parser.new(maf_p, options.p)
|
63
|
+
tiler.reference = Bio::MAF::FASTARangeReader.new(options.ref) if options.ref
|
64
|
+
tiler.species = options.species
|
65
|
+
tiler.species_map = options.species_map
|
66
|
+
|
67
|
+
def parse_interval(line)
|
68
|
+
src, r_start_s, r_end_s, _ = line.split(nil, 4)
|
69
|
+
r_start = r_start_s.to_i
|
70
|
+
r_end = r_end_s.to_i
|
71
|
+
return Bio::GenomicInterval.zero_based(src, r_start, r_end)
|
72
|
+
end
|
73
|
+
|
74
|
+
def target_for(base, interval)
|
75
|
+
path = "#{base}_#{interval.zero_start}-#{interval.zero_end}.fa"
|
76
|
+
File.open(path, 'w')
|
77
|
+
end
|
78
|
+
|
79
|
+
if options.bed
|
80
|
+
intervals = []
|
81
|
+
File.open(options.bed) do |bed_f|
|
82
|
+
bed_f.each_line { |line| intervals << parse_interval(line) }
|
83
|
+
end
|
84
|
+
intervals.sort_by! { |int| int.zero_start }
|
85
|
+
intervals.each do |int|
|
86
|
+
tiler.interval = int
|
87
|
+
target = target_for(options.output_base, int)
|
88
|
+
tiler.write_fasta(target)
|
89
|
+
target.close
|
90
|
+
end
|
91
|
+
else
|
92
|
+
# single interval
|
93
|
+
tiler.interval = Bio::GenomicInterval.zero_based(tiler.index.ref_seq,
|
94
|
+
options.interval.begin,
|
95
|
+
options.interval.end)
|
96
|
+
if options.output_base
|
97
|
+
target = target_for(options.output_base, tiler.interval)
|
98
|
+
else
|
99
|
+
target = $stdout
|
100
|
+
end
|
101
|
+
tiler.write_fasta(target)
|
102
|
+
target.close
|
103
|
+
end
|
data/bio-maf.gemspec
ADDED
@@ -0,0 +1,43 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
|
3
|
+
Gem::Specification.new do |s|
|
4
|
+
s.name = "bio-maf"
|
5
|
+
s.version = "0.2.0"
|
6
|
+
|
7
|
+
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
8
|
+
s.authors = ["Clayton Wheeler"]
|
9
|
+
s.date = "2012-06-29"
|
10
|
+
s.description = "Multiple Alignment Format parser for BioRuby."
|
11
|
+
s.email = "cswh@umich.edu"
|
12
|
+
s.executables = ["maf_count", "maf_dump_blocks", "maf_extract_ranges_count", "maf_index", "maf_parse_bench", "maf_to_fasta", "maf_write", "random_ranges"]
|
13
|
+
s.extra_rdoc_files = [
|
14
|
+
"LICENSE.txt",
|
15
|
+
"README.md"
|
16
|
+
]
|
17
|
+
s.files = `git ls-files`.split("\n")
|
18
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
19
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map {
|
20
|
+
|f| File.basename(f)
|
21
|
+
}
|
22
|
+
|
23
|
+
s.homepage = "http://github.com/csw/bioruby-maf"
|
24
|
+
s.licenses = ["MIT"]
|
25
|
+
s.require_paths = ["lib"]
|
26
|
+
s.rubygems_version = "1.8.24"
|
27
|
+
s.summary = "MAF parser for BioRuby"
|
28
|
+
|
29
|
+
s.specification_version = 3
|
30
|
+
|
31
|
+
if RUBY_PLATFORM == 'java'
|
32
|
+
s.platform = 'java'
|
33
|
+
end
|
34
|
+
|
35
|
+
s.add_runtime_dependency('bio-bigbio', [">= 0"])
|
36
|
+
s.add_runtime_dependency('bio-genomic-interval', ["~> 0.1.2"])
|
37
|
+
if RUBY_PLATFORM == 'java'
|
38
|
+
s.add_runtime_dependency('kyotocabinet-java', ["~> 0.2.0"])
|
39
|
+
else
|
40
|
+
s.add_runtime_dependency('kyotocabinet-ruby', ["~> 1.27.1"])
|
41
|
+
end
|
42
|
+
|
43
|
+
end
|
@@ -0,0 +1,158 @@
|
|
1
|
+
Feature: Join alignment blocks with reference data
|
2
|
+
In order to produce FASTA output with one sequence per species
|
3
|
+
For use in downstream tools
|
4
|
+
We need to join adjacent MAF blocks together
|
5
|
+
And fill gaps in the reference sequence from reference data
|
6
|
+
|
7
|
+
Scenario: Non-overlapping MAF blocks in region of interest
|
8
|
+
Given MAF data:
|
9
|
+
"""
|
10
|
+
##maf version=1
|
11
|
+
a score=20.0
|
12
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
13
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
14
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
15
|
+
|
16
|
+
a score=21.0
|
17
|
+
s sp1.chr1 30 10 + 50 AGGGCGGTCC
|
18
|
+
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
|
19
|
+
"""
|
20
|
+
And chromosome reference sequence:
|
21
|
+
"""
|
22
|
+
>sp1.chr1
|
23
|
+
CCAGGATGCT
|
24
|
+
GGGCTGAGGG
|
25
|
+
CAGTTGTGTC
|
26
|
+
AGGGCGGTCC
|
27
|
+
GGTGCAGGCA
|
28
|
+
"""
|
29
|
+
When I open it with a MAF reader
|
30
|
+
And build an index on the reference sequence
|
31
|
+
And tile sp1.chr1:0-50 with the chromosome reference
|
32
|
+
And tile with species [sp1, sp2, sp3]
|
33
|
+
And write the tiled data as FASTA
|
34
|
+
Then the FASTA data obtained should be:
|
35
|
+
"""
|
36
|
+
>sp1
|
37
|
+
CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
|
38
|
+
>sp2
|
39
|
+
**********GGGCTGACGGC--AG*******AGGGCGGTGC**********
|
40
|
+
>sp3
|
41
|
+
**********AGGTTTAGGGCAGAG***************************
|
42
|
+
"""
|
43
|
+
|
44
|
+
Scenario: Non-overlapping MAF blocks with species map
|
45
|
+
Given MAF data:
|
46
|
+
"""
|
47
|
+
##maf version=1
|
48
|
+
a score=20.0
|
49
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
50
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
51
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
52
|
+
|
53
|
+
a score=21.0
|
54
|
+
s sp1.chr1 30 10 + 50 AGGGCGGTCC
|
55
|
+
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
|
56
|
+
"""
|
57
|
+
And chromosome reference sequence:
|
58
|
+
"""
|
59
|
+
>sp1.chr1
|
60
|
+
CCAGGATGCT
|
61
|
+
GGGCTGAGGG
|
62
|
+
CAGTTGTGTC
|
63
|
+
AGGGCGGTCC
|
64
|
+
GGTGCAGGCA
|
65
|
+
"""
|
66
|
+
When I open it with a MAF reader
|
67
|
+
And build an index on the reference sequence
|
68
|
+
And tile sp1.chr1:0-50 with the chromosome reference
|
69
|
+
And tile with species [sp1, sp2, sp3]
|
70
|
+
And map species sp1 as mouse
|
71
|
+
And map species sp2 as hippo
|
72
|
+
And map species sp3 as squid
|
73
|
+
And write the tiled data as FASTA
|
74
|
+
Then the FASTA data obtained should be:
|
75
|
+
"""
|
76
|
+
>mouse
|
77
|
+
CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
|
78
|
+
>hippo
|
79
|
+
**********GGGCTGACGGC--AG*******AGGGCGGTGC**********
|
80
|
+
>squid
|
81
|
+
**********AGGTTTAGGGCAGAG***************************
|
82
|
+
"""
|
83
|
+
|
84
|
+
Scenario: Subset of non-overlapping MAF blocks in region
|
85
|
+
Given MAF data:
|
86
|
+
"""
|
87
|
+
##maf version=1
|
88
|
+
a score=20.0
|
89
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
90
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
91
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
92
|
+
|
93
|
+
a score=21.0
|
94
|
+
s sp1.chr1 30 10 + 50 AGGGCGGTCC
|
95
|
+
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
|
96
|
+
"""
|
97
|
+
And chromosome reference sequence:
|
98
|
+
"""
|
99
|
+
>sp1.chr1
|
100
|
+
CCAGGATGCT
|
101
|
+
GGGCTGAGGG
|
102
|
+
CAGTTGTGTC
|
103
|
+
AGGGCGGTCC
|
104
|
+
GGTGCAGGCA
|
105
|
+
"""
|
106
|
+
When I open it with a MAF reader
|
107
|
+
And build an index on the reference sequence
|
108
|
+
And tile sp1.chr1:12-36 with the chromosome reference
|
109
|
+
And tile with species [sp1, sp2, sp3]
|
110
|
+
And write the tiled data as FASTA
|
111
|
+
Then the FASTA data obtained should be:
|
112
|
+
"""
|
113
|
+
>sp1
|
114
|
+
GCTGAGGGC--AGTTGTGTCAGGGCG
|
115
|
+
>sp2
|
116
|
+
GCTGACGGC--AG*******AGGGCG
|
117
|
+
>sp3
|
118
|
+
GTTTAGGGCAGAG*************
|
119
|
+
"""
|
120
|
+
Scenario: Overlapping MAF blocks in region of interest
|
121
|
+
Given MAF data:
|
122
|
+
"""
|
123
|
+
##maf version=1
|
124
|
+
a score=20.0
|
125
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
126
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
127
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
128
|
+
|
129
|
+
a score=21.0
|
130
|
+
s sp1.chr1 20 10 + 50 AGGGCGGTCC
|
131
|
+
s sp2.chr5 53020 10 + 65536 AGGGCGGTGC
|
132
|
+
"""
|
133
|
+
And chromosome reference sequence:
|
134
|
+
"""
|
135
|
+
>sp1.chr1
|
136
|
+
CCAGGATGCT
|
137
|
+
GGGCTGAGGG
|
138
|
+
CAGTTGTGTC
|
139
|
+
AGGGCGGTCC
|
140
|
+
GGTGCAGGCA
|
141
|
+
"""
|
142
|
+
When I open it with a MAF reader
|
143
|
+
And build an index on the reference sequence
|
144
|
+
And tile sp1.chr1:0-50 with the chromosome reference
|
145
|
+
And tile with species [sp1, sp2, sp3]
|
146
|
+
And write the tiled data as FASTA
|
147
|
+
Then the FASTA data obtained should be:
|
148
|
+
"""
|
149
|
+
>sp1
|
150
|
+
CCAGGATGCTGGGCTGAGGGAGGGCGGTCCAGGGCGGTCCGGTGCAGGCA
|
151
|
+
>sp2
|
152
|
+
**********GGGCTGACGGAGGGCGGTGC********************
|
153
|
+
>sp3
|
154
|
+
**********AGGTTTAGGG******************************
|
155
|
+
"""
|
156
|
+
|
157
|
+
|
158
|
+
|
@@ -0,0 +1,50 @@
|
|
1
|
+
Feature: Remove gaps from MAF files
|
2
|
+
In order to work with only the alignment data involving sequences
|
3
|
+
Which can be used by downstream software
|
4
|
+
We may want to filter out certain species
|
5
|
+
Which can leave gap regions where sequence data was only present
|
6
|
+
For removed species
|
7
|
+
So it is useful to be able to remove those gaps
|
8
|
+
|
9
|
+
Background:
|
10
|
+
Given MAF data:
|
11
|
+
"""
|
12
|
+
##maf version=1
|
13
|
+
a score=10542.0
|
14
|
+
s mm8.chr7 80082334 34 + 145134094 GGGCTGAGGGC--AGGGATGG---AGGGCGGTCC--------------CAGCA-
|
15
|
+
s rn4.chr1 136011785 34 + 267910886 GGGCTGAGGGC--AGGGACGG---AGGGCGGTCC--------------CAGCA-
|
16
|
+
s oryCun1.scaffold_199771 14021 43 - 75077 -----ATGGGC--AAGCGTGG---AGGGGAACCTCTCCTCCCCTCCGACAAAG-
|
17
|
+
s hg18.chr15 88557580 27 + 100338915 --------GGC--AAGTGTGGA--AGGGAAGCCC--------------CAGAA-
|
18
|
+
s panTro2.chr15 87959837 27 + 100063422 --------GGC--AAGTGTGGA--AGGGAAGCCC--------------CAGAA-
|
19
|
+
s rheMac2.chr7 69864714 28 + 169801366 -------GGGC--AAGTATGGA--AGGGAAGCCC--------------CAGAA-
|
20
|
+
s canFam2.chr3 56030570 39 + 94715083 AGGTTTAGGGCAGAGGGATGAAGGAGGAGAATCC--------------CTATG-
|
21
|
+
s dasNov1.scaffold_106893 7435 34 + 9831 GGAACGAGGGC--ATGTGTGG---AGGGGGCTGC--------------CCACA-
|
22
|
+
s loxAfr1.scaffold_8298 30264 38 + 78952 ATGATGAGGGG--AAGCGTGGAGGAGGGGAACCC--------------CTAGGA
|
23
|
+
s echTel1.scaffold_304651 594 37 - 10007 -TGCTATGGCT--TTGTGTCTAGGAGGGGAATCC--------------CCAGGA
|
24
|
+
"""
|
25
|
+
When I open it with a MAF reader
|
26
|
+
And filter for only the species
|
27
|
+
| mm8 |
|
28
|
+
| rn4 |
|
29
|
+
| hg18 |
|
30
|
+
| canFam2 |
|
31
|
+
| loxAfr1 |
|
32
|
+
|
33
|
+
Scenario: Detect filtered blocks
|
34
|
+
When an alignment block can be obtained
|
35
|
+
Then the alignment block is marked as filtered
|
36
|
+
And the alignment block has 5 sequences
|
37
|
+
|
38
|
+
Scenario: Detect gaps
|
39
|
+
When an alignment block can be obtained
|
40
|
+
Then 1 gap is found with length [14]
|
41
|
+
|
42
|
+
Scenario: Remove gaps
|
43
|
+
When an alignment block can be obtained
|
44
|
+
And gaps are removed
|
45
|
+
Then the text size of the block is 40
|
46
|
+
|
47
|
+
Scenario: Remove gaps in the parser
|
48
|
+
When I enable the :remove_gaps parser option
|
49
|
+
And an alignment block can be obtained
|
50
|
+
Then the text size of the block is 40
|
@@ -0,0 +1,32 @@
|
|
1
|
+
Given /^chromosome reference sequence:$/ do |string|
|
2
|
+
sio = StringIO.new(string)
|
3
|
+
@refseq = Bio::MAF::FASTARangeReader.new(sio)
|
4
|
+
end
|
5
|
+
|
6
|
+
When /^tile ([^:\s]+):(\d+)-(\d+)( with the chromosome reference)?$/ do |seq, i_start, i_end, ref_p|
|
7
|
+
@tiler = Bio::MAF::Tiler.new
|
8
|
+
@tiler.index = @idx
|
9
|
+
@tiler.parser = @parser
|
10
|
+
@tiler.reference = @refseq if ref_p
|
11
|
+
@tiler.interval = Bio::GenomicInterval.zero_based(seq,
|
12
|
+
i_start.to_i,
|
13
|
+
i_end.to_i)
|
14
|
+
end
|
15
|
+
|
16
|
+
When /^tile with species \[(.+?)\]$/ do |species_text|
|
17
|
+
@tiler.species = species_text.split(/,\s*/)
|
18
|
+
end
|
19
|
+
|
20
|
+
When /^map species (\S+) as (\S+)$/ do |sp1, sp2|
|
21
|
+
@tiler.species_map[sp1] = sp2
|
22
|
+
end
|
23
|
+
|
24
|
+
When /^write the tiled data as FASTA$/ do
|
25
|
+
@dst = Tempfile.new(["cuke", ".fa"])
|
26
|
+
@tiler.write_fasta(@dst)
|
27
|
+
end
|
28
|
+
|
29
|
+
Then /^the FASTA data obtained should be:$/ do |string|
|
30
|
+
@dst.seek(0)
|
31
|
+
@dst.read.rstrip.should == string.rstrip
|
32
|
+
end
|
@@ -0,0 +1,19 @@
|
|
1
|
+
Then /^the alignment block is marked as filtered$/ do
|
2
|
+
@block.filtered?.should be_true
|
3
|
+
end
|
4
|
+
|
5
|
+
Then /^(\d+) gaps? (?:is|are) found with length \[(\d+)\]$/ do |n_gaps, gap_sizes_s|
|
6
|
+
gaps = @block.find_gaps
|
7
|
+
gaps.size.should == n_gaps.to_i
|
8
|
+
e_gap_sizes = gap_sizes_s.split(/,\s*/).collect { |n| n.to_i }
|
9
|
+
gap_sizes = gaps.collect { |gap| gap[1] }
|
10
|
+
gap_sizes.should == e_gap_sizes
|
11
|
+
end
|
12
|
+
|
13
|
+
When /^gaps are removed$/ do
|
14
|
+
@block.remove_gaps!
|
15
|
+
end
|
16
|
+
|
17
|
+
Then /^the text size of the block is (\d+)$/ do |e_text_size|
|
18
|
+
@block.text_size.should == e_text_size.to_i
|
19
|
+
end
|