bio-maf 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +53 -0
- data/DEVELOPMENT.md +29 -0
- data/Gemfile +1 -0
- data/README.md +69 -1
- data/Rakefile +4 -3
- data/bin/find_overlaps +21 -0
- data/bin/maf_tile +103 -0
- data/bio-maf.gemspec +43 -0
- data/features/gap-filling.feature +158 -0
- data/features/gap-removal.feature +50 -0
- data/features/step_definitions/gap-filling_steps.rb +32 -0
- data/features/step_definitions/gap_removal_steps.rb +19 -0
- data/features/step_definitions/parse_steps.rb +2 -1
- data/lib/bio/maf/index.rb +15 -8
- data/lib/bio/maf/maf.rb +267 -0
- data/lib/bio/maf/parser.rb +115 -175
- data/lib/bio/maf/tiler.rb +167 -0
- data/lib/bio/maf.rb +2 -0
- data/man/maf_tile.1 +108 -0
- data/man/maf_tile.1.ronn +104 -0
- data/spec/bio/maf/index_spec.rb +1 -0
- data/spec/bio/maf/parser_spec.rb +103 -0
- data/spec/bio/maf/tiler_spec.rb +69 -0
- data/test/data/gap-sp1.fa +6 -0
- data/test/data/mm8_chr7_tiny.kct +0 -0
- metadata +58 -3
data/.gitignore
ADDED
@@ -0,0 +1,53 @@
|
|
1
|
+
# rcov generated
|
2
|
+
coverage
|
3
|
+
coverage.data
|
4
|
+
|
5
|
+
# rdoc generated
|
6
|
+
rdoc
|
7
|
+
|
8
|
+
# yard generated
|
9
|
+
doc
|
10
|
+
.yardoc
|
11
|
+
|
12
|
+
# bundler
|
13
|
+
.bundle
|
14
|
+
|
15
|
+
# jeweler generated
|
16
|
+
pkg
|
17
|
+
|
18
|
+
# Have editor/IDE/OS specific files you need to ignore? Consider using a global gitignore:
|
19
|
+
#
|
20
|
+
# * Create a file at ~/.gitignore
|
21
|
+
# * Include files you want ignored
|
22
|
+
# * Run: git config --global core.excludesfile ~/.gitignore
|
23
|
+
#
|
24
|
+
# After doing this, these files will be ignored in all your git projects,
|
25
|
+
# saving you from having to 'pollute' every project you touch with them
|
26
|
+
#
|
27
|
+
# Not sure what to needs to be ignored for particular editors/OSes? Here's some ideas to get you started. (Remember, remove the leading # of the line)
|
28
|
+
#
|
29
|
+
# For MacOS:
|
30
|
+
#
|
31
|
+
#.DS_Store
|
32
|
+
|
33
|
+
# For TextMate
|
34
|
+
#*.tmproj
|
35
|
+
#tmtags
|
36
|
+
|
37
|
+
# For emacs:
|
38
|
+
#*~
|
39
|
+
#\#*
|
40
|
+
#.\#*
|
41
|
+
|
42
|
+
# For vim:
|
43
|
+
#*.swp
|
44
|
+
|
45
|
+
# For redcar:
|
46
|
+
#.redcar
|
47
|
+
|
48
|
+
# For rubinius:
|
49
|
+
*.rbc
|
50
|
+
.rbx
|
51
|
+
# Ignore Gemfile.lock for gems. See http://yehudakatz.com/2010/12/16/clarifying-the-roles-of-the-gemspec-and-gemfile/
|
52
|
+
Gemfile.lock
|
53
|
+
|
data/DEVELOPMENT.md
CHANGED
@@ -3,6 +3,35 @@
|
|
3
3
|
Here are notes on less obvious aspects of the development process for
|
4
4
|
this library.
|
5
5
|
|
6
|
+
## Gem build / tagging / release
|
7
|
+
|
8
|
+
This now uses [rubygems-tasks][] for building and releasing gems.
|
9
|
+
|
10
|
+
[rubygems-tasks]: https://github.com/postmodern/rubygems-tasks
|
11
|
+
|
12
|
+
We build two gem platform variants: a 'default' one for MRI with no
|
13
|
+
platform set, and a JRuby one with `platform = 'java'`. These get
|
14
|
+
built as `bio-maf-X.Y.Z.gem` and `bio-maf-X.Y.Z-java.gem`. At least
|
15
|
+
for now, this is done by running `gem release` separately under JRuby
|
16
|
+
and MRI. SCM tagging and pushing is done under MRI only, but the gems
|
17
|
+
will be built and pushed to rubygems.org separately under each
|
18
|
+
platform.
|
19
|
+
|
20
|
+
The version is simply set by hand in `bio-maf.gemspec`. Don't forget
|
21
|
+
to increment it!
|
22
|
+
|
23
|
+
Testing the build:
|
24
|
+
|
25
|
+
$ rake build
|
26
|
+
$ rake install
|
27
|
+
|
28
|
+
Release:
|
29
|
+
|
30
|
+
$ rvm use 1.9.3@bioruby-maf
|
31
|
+
$ rake release
|
32
|
+
$ rvm use jruby-1.6.7.2@bioruby-maf
|
33
|
+
$ rake release
|
34
|
+
|
6
35
|
## kyotocabinet-java
|
7
36
|
|
8
37
|
Running `bio-maf` on JRuby requires the [kyotocabinet-java][] gem, a
|
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -47,8 +47,29 @@ problems building or using this gem, which is still fairly new.
|
|
47
47
|
|
48
48
|
## Installation
|
49
49
|
|
50
|
+
`bio-maf` is now published as a Ruby [gem](https://rubygems.org/gems/bio-maf).
|
51
|
+
|
50
52
|
$ gem install bio-maf
|
51
53
|
|
54
|
+
## Performance
|
55
|
+
|
56
|
+
This parser performs best under [JRuby][], particularly with Java
|
57
|
+
7. See the [Performance][] wiki page for more information. For best
|
58
|
+
results, use JRuby in 1.9 mode with the ObjectProxyCache disabled:
|
59
|
+
|
60
|
+
[JRuby]: http://jruby.org/
|
61
|
+
[Performance]: https://github.com/csw/bioruby-maf/wiki/Performance
|
62
|
+
|
63
|
+
$ export JRUBY_OPTS='--1.9 -Xji.objectProxyCache=false'
|
64
|
+
|
65
|
+
Many parsing modes are multithreaded. Under JRuby, it will default to
|
66
|
+
using one parser thread per available core, but if desired this can be
|
67
|
+
configured with the `:threads` parser option.
|
68
|
+
|
69
|
+
Ruby 1.9.3 is fully supported, but does not perform as well,
|
70
|
+
especially since its concurrency features are not useful for this
|
71
|
+
workload.
|
72
|
+
|
52
73
|
## Usage
|
53
74
|
|
54
75
|
### Create an index on a MAF file
|
@@ -162,6 +183,47 @@ Refer to [`chr22_ieq.maf`](https://github.com/csw/bioruby-maf/blob/master/test/d
|
|
162
183
|
# @size=1601, @strand=:+, @src_size=50103, @text=nil,
|
163
184
|
# @status="I">
|
164
185
|
|
186
|
+
### Remove gaps from parsed blocks
|
187
|
+
|
188
|
+
After filtering out species with
|
189
|
+
[`Parser#sequence_filter`](#filter-species-returned-in-alignment-blocks),
|
190
|
+
gaps may be left where there was an insertion present only in
|
191
|
+
sequences that were filtered out. Such gaps can be removed by setting
|
192
|
+
the `:remove_gaps` parser option:
|
193
|
+
|
194
|
+
require 'bio-maf'
|
195
|
+
p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
|
196
|
+
:remove_gaps => true)
|
197
|
+
|
198
|
+
### Tile blocks together over an interval
|
199
|
+
|
200
|
+
Extracts alignment blocks overlapping the given genomic interval and
|
201
|
+
constructs a single alignment block covering the entire interval for
|
202
|
+
the specified species. Optionally, any gaps in coverage of the MAF
|
203
|
+
file's reference sequence can be filled in from a FASTA sequence
|
204
|
+
file. See the Cucumber [feature][] for examples of output, and also
|
205
|
+
the
|
206
|
+
[`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
|
207
|
+
man page.
|
208
|
+
|
209
|
+
[feature]: https://github.com/csw/bioruby-maf/blob/master/features/gap-filling.feature
|
210
|
+
|
211
|
+
require 'bio-maf'
|
212
|
+
tiler = Bio::MAF::Tiler.new
|
213
|
+
tiler.index = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
|
214
|
+
tiler.parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
|
215
|
+
# optional
|
216
|
+
tiler.reference = Bio::MAF::FASTARangeReader.new('reference.fa.gz')
|
217
|
+
tiler.species = %w(mm8 rn4 hg18)
|
218
|
+
tiler.species_map = {
|
219
|
+
'mm8' => 'mouse',
|
220
|
+
'rn4' => 'rat',
|
221
|
+
'hg18' => 'human'
|
222
|
+
}
|
223
|
+
tiler.interval = Bio::GenomicInterval.zero_based('mm8.chr7',
|
224
|
+
80082334,
|
225
|
+
80082468)
|
226
|
+
tiler.write_fasta($stdout)
|
165
227
|
|
166
228
|
### Command line tools
|
167
229
|
|
@@ -169,6 +231,12 @@ Man pages for command line tools:
|
|
169
231
|
|
170
232
|
* [`maf_index(1)`](http://csw.github.com/bioruby-maf/man/maf_index.1.html)
|
171
233
|
* [`maf_to_fasta(1)`](http://csw.github.com/bioruby-maf/man/maf_to_fasta.1.html)
|
234
|
+
* [`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
|
235
|
+
|
236
|
+
With [gem-man](https://github.com/defunkt/gem-man) installed, these
|
237
|
+
can be read with:
|
238
|
+
|
239
|
+
$ gem man bio-maf
|
172
240
|
|
173
241
|
### Other documentation
|
174
242
|
|
@@ -201,7 +269,7 @@ If you use this software, please cite one of
|
|
201
269
|
|
202
270
|
## Biogems.info
|
203
271
|
|
204
|
-
This Biogem
|
272
|
+
This Biogem is published at [biogems.info](http://biogems.info/index.html#bio-maf).
|
205
273
|
|
206
274
|
## Copyright
|
207
275
|
|
data/Rakefile
CHANGED
@@ -10,10 +10,11 @@ rescue Bundler::BundlerError => e
|
|
10
10
|
exit e.status_code
|
11
11
|
end
|
12
12
|
require 'rake'
|
13
|
-
require 'rubygems/package_task'
|
14
13
|
|
15
|
-
|
16
|
-
|
14
|
+
require 'rubygems/tasks'
|
15
|
+
# we only want to do the SCM tag/push stuff once, on MRI
|
16
|
+
use_scm = (RUBY_PLATFORM != 'java')
|
17
|
+
Gem::Tasks.new(:scm => {:tag => use_scm, :push => use_scm})
|
17
18
|
|
18
19
|
require 'rspec/core'
|
19
20
|
require 'rspec/core/rake_task'
|
data/bin/find_overlaps
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'bio-maf'
|
4
|
+
|
5
|
+
parser = Bio::MAF::Parser.new(ARGV.shift, :threads => 4)
|
6
|
+
|
7
|
+
def desc(seq)
|
8
|
+
"#{seq.source}:#{seq.start}-#{seq.end}"
|
9
|
+
end
|
10
|
+
|
11
|
+
open = []
|
12
|
+
parser.parse_blocks.each do |block|
|
13
|
+
start_pos = block.ref_seq.start
|
14
|
+
open.delete_if { |open_b| open_b.ref_seq.end <= start_pos }
|
15
|
+
open.each do |ovl|
|
16
|
+
ref_a = ovl.ref_seq
|
17
|
+
ref_b = block.ref_seq
|
18
|
+
puts "#{desc(ref_a)} overlaps #{desc(ref_b)}"
|
19
|
+
end
|
20
|
+
open << block
|
21
|
+
end
|
data/bin/maf_tile
ADDED
@@ -0,0 +1,103 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'optparse'
|
4
|
+
require 'ostruct'
|
5
|
+
|
6
|
+
require 'bio-maf'
|
7
|
+
require 'bio-genomic-interval'
|
8
|
+
|
9
|
+
options = OpenStruct.new
|
10
|
+
options.p = { :threads => 1 }
|
11
|
+
options.species = []
|
12
|
+
options.species_map = {}
|
13
|
+
options.usage = false
|
14
|
+
|
15
|
+
o_parser = OptionParser.new do |opts|
|
16
|
+
opts.banner = "Usage: maf_tile [options] <maf> <index>"
|
17
|
+
opts.separator ""
|
18
|
+
opts.separator "Options:"
|
19
|
+
opts.on("-r", "--reference SEQ", "FASTA reference sequence") do |ref|
|
20
|
+
options.ref = ref
|
21
|
+
end
|
22
|
+
opts.on("-i", "--interval BEGIN:END", "Genomic interval, zero-based") do |int|
|
23
|
+
if int =~ /(\d+):(\d+)/
|
24
|
+
options.interval = ($1.to_i)...($2.to_i)
|
25
|
+
else
|
26
|
+
options.usage = true
|
27
|
+
end
|
28
|
+
end
|
29
|
+
opts.on("-s", "--species SPECIES[:NAME]", "Species to use (with mapped name)") do |sp|
|
30
|
+
if sp =~ /:/
|
31
|
+
species, mapped = sp.split(/:/)
|
32
|
+
options.species << species
|
33
|
+
options.species_map[species] = mapped
|
34
|
+
else
|
35
|
+
options.species << sp
|
36
|
+
end
|
37
|
+
end
|
38
|
+
opts.on("-o", "--output-base BASE", "Base name for output files",
|
39
|
+
"Use stdout for a single interval if not given") do |base|
|
40
|
+
options.output_base = base
|
41
|
+
end
|
42
|
+
opts.on("--bed BED", "BED file specifying intervals",
|
43
|
+
"(requires --output-base)") do |bed|
|
44
|
+
options.bed = bed
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
o_parser.parse!(ARGV)
|
49
|
+
|
50
|
+
maf_p = ARGV.shift
|
51
|
+
index_p = ARGV.shift
|
52
|
+
|
53
|
+
unless (! options.usage) \
|
54
|
+
&& maf_p && index_p && (! options.species.empty?) \
|
55
|
+
&& (options.output_base ? options.bed : options.interval)
|
56
|
+
$stderr.puts o_parser
|
57
|
+
exit 2
|
58
|
+
end
|
59
|
+
|
60
|
+
tiler = Bio::MAF::Tiler.new
|
61
|
+
tiler.index = Bio::MAF::KyotoIndex.open(index_p)
|
62
|
+
tiler.parser = Bio::MAF::Parser.new(maf_p, options.p)
|
63
|
+
tiler.reference = Bio::MAF::FASTARangeReader.new(options.ref) if options.ref
|
64
|
+
tiler.species = options.species
|
65
|
+
tiler.species_map = options.species_map
|
66
|
+
|
67
|
+
def parse_interval(line)
|
68
|
+
src, r_start_s, r_end_s, _ = line.split(nil, 4)
|
69
|
+
r_start = r_start_s.to_i
|
70
|
+
r_end = r_end_s.to_i
|
71
|
+
return Bio::GenomicInterval.zero_based(src, r_start, r_end)
|
72
|
+
end
|
73
|
+
|
74
|
+
def target_for(base, interval)
|
75
|
+
path = "#{base}_#{interval.zero_start}-#{interval.zero_end}.fa"
|
76
|
+
File.open(path, 'w')
|
77
|
+
end
|
78
|
+
|
79
|
+
if options.bed
|
80
|
+
intervals = []
|
81
|
+
File.open(options.bed) do |bed_f|
|
82
|
+
bed_f.each_line { |line| intervals << parse_interval(line) }
|
83
|
+
end
|
84
|
+
intervals.sort_by! { |int| int.zero_start }
|
85
|
+
intervals.each do |int|
|
86
|
+
tiler.interval = int
|
87
|
+
target = target_for(options.output_base, int)
|
88
|
+
tiler.write_fasta(target)
|
89
|
+
target.close
|
90
|
+
end
|
91
|
+
else
|
92
|
+
# single interval
|
93
|
+
tiler.interval = Bio::GenomicInterval.zero_based(tiler.index.ref_seq,
|
94
|
+
options.interval.begin,
|
95
|
+
options.interval.end)
|
96
|
+
if options.output_base
|
97
|
+
target = target_for(options.output_base, tiler.interval)
|
98
|
+
else
|
99
|
+
target = $stdout
|
100
|
+
end
|
101
|
+
tiler.write_fasta(target)
|
102
|
+
target.close
|
103
|
+
end
|
data/bio-maf.gemspec
ADDED
@@ -0,0 +1,43 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
|
3
|
+
Gem::Specification.new do |s|
|
4
|
+
s.name = "bio-maf"
|
5
|
+
s.version = "0.2.0"
|
6
|
+
|
7
|
+
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
8
|
+
s.authors = ["Clayton Wheeler"]
|
9
|
+
s.date = "2012-06-29"
|
10
|
+
s.description = "Multiple Alignment Format parser for BioRuby."
|
11
|
+
s.email = "cswh@umich.edu"
|
12
|
+
s.executables = ["maf_count", "maf_dump_blocks", "maf_extract_ranges_count", "maf_index", "maf_parse_bench", "maf_to_fasta", "maf_write", "random_ranges"]
|
13
|
+
s.extra_rdoc_files = [
|
14
|
+
"LICENSE.txt",
|
15
|
+
"README.md"
|
16
|
+
]
|
17
|
+
s.files = `git ls-files`.split("\n")
|
18
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
19
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map {
|
20
|
+
|f| File.basename(f)
|
21
|
+
}
|
22
|
+
|
23
|
+
s.homepage = "http://github.com/csw/bioruby-maf"
|
24
|
+
s.licenses = ["MIT"]
|
25
|
+
s.require_paths = ["lib"]
|
26
|
+
s.rubygems_version = "1.8.24"
|
27
|
+
s.summary = "MAF parser for BioRuby"
|
28
|
+
|
29
|
+
s.specification_version = 3
|
30
|
+
|
31
|
+
if RUBY_PLATFORM == 'java'
|
32
|
+
s.platform = 'java'
|
33
|
+
end
|
34
|
+
|
35
|
+
s.add_runtime_dependency('bio-bigbio', [">= 0"])
|
36
|
+
s.add_runtime_dependency('bio-genomic-interval', ["~> 0.1.2"])
|
37
|
+
if RUBY_PLATFORM == 'java'
|
38
|
+
s.add_runtime_dependency('kyotocabinet-java', ["~> 0.2.0"])
|
39
|
+
else
|
40
|
+
s.add_runtime_dependency('kyotocabinet-ruby', ["~> 1.27.1"])
|
41
|
+
end
|
42
|
+
|
43
|
+
end
|
@@ -0,0 +1,158 @@
|
|
1
|
+
Feature: Join alignment blocks with reference data
|
2
|
+
In order to produce FASTA output with one sequence per species
|
3
|
+
For use in downstream tools
|
4
|
+
We need to join adjacent MAF blocks together
|
5
|
+
And fill gaps in the reference sequence from reference data
|
6
|
+
|
7
|
+
Scenario: Non-overlapping MAF blocks in region of interest
|
8
|
+
Given MAF data:
|
9
|
+
"""
|
10
|
+
##maf version=1
|
11
|
+
a score=20.0
|
12
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
13
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
14
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
15
|
+
|
16
|
+
a score=21.0
|
17
|
+
s sp1.chr1 30 10 + 50 AGGGCGGTCC
|
18
|
+
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
|
19
|
+
"""
|
20
|
+
And chromosome reference sequence:
|
21
|
+
"""
|
22
|
+
>sp1.chr1
|
23
|
+
CCAGGATGCT
|
24
|
+
GGGCTGAGGG
|
25
|
+
CAGTTGTGTC
|
26
|
+
AGGGCGGTCC
|
27
|
+
GGTGCAGGCA
|
28
|
+
"""
|
29
|
+
When I open it with a MAF reader
|
30
|
+
And build an index on the reference sequence
|
31
|
+
And tile sp1.chr1:0-50 with the chromosome reference
|
32
|
+
And tile with species [sp1, sp2, sp3]
|
33
|
+
And write the tiled data as FASTA
|
34
|
+
Then the FASTA data obtained should be:
|
35
|
+
"""
|
36
|
+
>sp1
|
37
|
+
CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
|
38
|
+
>sp2
|
39
|
+
**********GGGCTGACGGC--AG*******AGGGCGGTGC**********
|
40
|
+
>sp3
|
41
|
+
**********AGGTTTAGGGCAGAG***************************
|
42
|
+
"""
|
43
|
+
|
44
|
+
Scenario: Non-overlapping MAF blocks with species map
|
45
|
+
Given MAF data:
|
46
|
+
"""
|
47
|
+
##maf version=1
|
48
|
+
a score=20.0
|
49
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
50
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
51
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
52
|
+
|
53
|
+
a score=21.0
|
54
|
+
s sp1.chr1 30 10 + 50 AGGGCGGTCC
|
55
|
+
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
|
56
|
+
"""
|
57
|
+
And chromosome reference sequence:
|
58
|
+
"""
|
59
|
+
>sp1.chr1
|
60
|
+
CCAGGATGCT
|
61
|
+
GGGCTGAGGG
|
62
|
+
CAGTTGTGTC
|
63
|
+
AGGGCGGTCC
|
64
|
+
GGTGCAGGCA
|
65
|
+
"""
|
66
|
+
When I open it with a MAF reader
|
67
|
+
And build an index on the reference sequence
|
68
|
+
And tile sp1.chr1:0-50 with the chromosome reference
|
69
|
+
And tile with species [sp1, sp2, sp3]
|
70
|
+
And map species sp1 as mouse
|
71
|
+
And map species sp2 as hippo
|
72
|
+
And map species sp3 as squid
|
73
|
+
And write the tiled data as FASTA
|
74
|
+
Then the FASTA data obtained should be:
|
75
|
+
"""
|
76
|
+
>mouse
|
77
|
+
CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
|
78
|
+
>hippo
|
79
|
+
**********GGGCTGACGGC--AG*******AGGGCGGTGC**********
|
80
|
+
>squid
|
81
|
+
**********AGGTTTAGGGCAGAG***************************
|
82
|
+
"""
|
83
|
+
|
84
|
+
Scenario: Subset of non-overlapping MAF blocks in region
|
85
|
+
Given MAF data:
|
86
|
+
"""
|
87
|
+
##maf version=1
|
88
|
+
a score=20.0
|
89
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
90
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
91
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
92
|
+
|
93
|
+
a score=21.0
|
94
|
+
s sp1.chr1 30 10 + 50 AGGGCGGTCC
|
95
|
+
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
|
96
|
+
"""
|
97
|
+
And chromosome reference sequence:
|
98
|
+
"""
|
99
|
+
>sp1.chr1
|
100
|
+
CCAGGATGCT
|
101
|
+
GGGCTGAGGG
|
102
|
+
CAGTTGTGTC
|
103
|
+
AGGGCGGTCC
|
104
|
+
GGTGCAGGCA
|
105
|
+
"""
|
106
|
+
When I open it with a MAF reader
|
107
|
+
And build an index on the reference sequence
|
108
|
+
And tile sp1.chr1:12-36 with the chromosome reference
|
109
|
+
And tile with species [sp1, sp2, sp3]
|
110
|
+
And write the tiled data as FASTA
|
111
|
+
Then the FASTA data obtained should be:
|
112
|
+
"""
|
113
|
+
>sp1
|
114
|
+
GCTGAGGGC--AGTTGTGTCAGGGCG
|
115
|
+
>sp2
|
116
|
+
GCTGACGGC--AG*******AGGGCG
|
117
|
+
>sp3
|
118
|
+
GTTTAGGGCAGAG*************
|
119
|
+
"""
|
120
|
+
Scenario: Overlapping MAF blocks in region of interest
|
121
|
+
Given MAF data:
|
122
|
+
"""
|
123
|
+
##maf version=1
|
124
|
+
a score=20.0
|
125
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
126
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
127
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
128
|
+
|
129
|
+
a score=21.0
|
130
|
+
s sp1.chr1 20 10 + 50 AGGGCGGTCC
|
131
|
+
s sp2.chr5 53020 10 + 65536 AGGGCGGTGC
|
132
|
+
"""
|
133
|
+
And chromosome reference sequence:
|
134
|
+
"""
|
135
|
+
>sp1.chr1
|
136
|
+
CCAGGATGCT
|
137
|
+
GGGCTGAGGG
|
138
|
+
CAGTTGTGTC
|
139
|
+
AGGGCGGTCC
|
140
|
+
GGTGCAGGCA
|
141
|
+
"""
|
142
|
+
When I open it with a MAF reader
|
143
|
+
And build an index on the reference sequence
|
144
|
+
And tile sp1.chr1:0-50 with the chromosome reference
|
145
|
+
And tile with species [sp1, sp2, sp3]
|
146
|
+
And write the tiled data as FASTA
|
147
|
+
Then the FASTA data obtained should be:
|
148
|
+
"""
|
149
|
+
>sp1
|
150
|
+
CCAGGATGCTGGGCTGAGGGAGGGCGGTCCAGGGCGGTCCGGTGCAGGCA
|
151
|
+
>sp2
|
152
|
+
**********GGGCTGACGGAGGGCGGTGC********************
|
153
|
+
>sp3
|
154
|
+
**********AGGTTTAGGG******************************
|
155
|
+
"""
|
156
|
+
|
157
|
+
|
158
|
+
|
@@ -0,0 +1,50 @@
|
|
1
|
+
Feature: Remove gaps from MAF files
|
2
|
+
In order to work with only the alignment data involving sequences
|
3
|
+
Which can be used by downstream software
|
4
|
+
We may want to filter out certain species
|
5
|
+
Which can leave gap regions where sequence data was only present
|
6
|
+
For removed species
|
7
|
+
So it is useful to be able to remove those gaps
|
8
|
+
|
9
|
+
Background:
|
10
|
+
Given MAF data:
|
11
|
+
"""
|
12
|
+
##maf version=1
|
13
|
+
a score=10542.0
|
14
|
+
s mm8.chr7 80082334 34 + 145134094 GGGCTGAGGGC--AGGGATGG---AGGGCGGTCC--------------CAGCA-
|
15
|
+
s rn4.chr1 136011785 34 + 267910886 GGGCTGAGGGC--AGGGACGG---AGGGCGGTCC--------------CAGCA-
|
16
|
+
s oryCun1.scaffold_199771 14021 43 - 75077 -----ATGGGC--AAGCGTGG---AGGGGAACCTCTCCTCCCCTCCGACAAAG-
|
17
|
+
s hg18.chr15 88557580 27 + 100338915 --------GGC--AAGTGTGGA--AGGGAAGCCC--------------CAGAA-
|
18
|
+
s panTro2.chr15 87959837 27 + 100063422 --------GGC--AAGTGTGGA--AGGGAAGCCC--------------CAGAA-
|
19
|
+
s rheMac2.chr7 69864714 28 + 169801366 -------GGGC--AAGTATGGA--AGGGAAGCCC--------------CAGAA-
|
20
|
+
s canFam2.chr3 56030570 39 + 94715083 AGGTTTAGGGCAGAGGGATGAAGGAGGAGAATCC--------------CTATG-
|
21
|
+
s dasNov1.scaffold_106893 7435 34 + 9831 GGAACGAGGGC--ATGTGTGG---AGGGGGCTGC--------------CCACA-
|
22
|
+
s loxAfr1.scaffold_8298 30264 38 + 78952 ATGATGAGGGG--AAGCGTGGAGGAGGGGAACCC--------------CTAGGA
|
23
|
+
s echTel1.scaffold_304651 594 37 - 10007 -TGCTATGGCT--TTGTGTCTAGGAGGGGAATCC--------------CCAGGA
|
24
|
+
"""
|
25
|
+
When I open it with a MAF reader
|
26
|
+
And filter for only the species
|
27
|
+
| mm8 |
|
28
|
+
| rn4 |
|
29
|
+
| hg18 |
|
30
|
+
| canFam2 |
|
31
|
+
| loxAfr1 |
|
32
|
+
|
33
|
+
Scenario: Detect filtered blocks
|
34
|
+
When an alignment block can be obtained
|
35
|
+
Then the alignment block is marked as filtered
|
36
|
+
And the alignment block has 5 sequences
|
37
|
+
|
38
|
+
Scenario: Detect gaps
|
39
|
+
When an alignment block can be obtained
|
40
|
+
Then 1 gap is found with length [14]
|
41
|
+
|
42
|
+
Scenario: Remove gaps
|
43
|
+
When an alignment block can be obtained
|
44
|
+
And gaps are removed
|
45
|
+
Then the text size of the block is 40
|
46
|
+
|
47
|
+
Scenario: Remove gaps in the parser
|
48
|
+
When I enable the :remove_gaps parser option
|
49
|
+
And an alignment block can be obtained
|
50
|
+
Then the text size of the block is 40
|
@@ -0,0 +1,32 @@
|
|
1
|
+
Given /^chromosome reference sequence:$/ do |string|
|
2
|
+
sio = StringIO.new(string)
|
3
|
+
@refseq = Bio::MAF::FASTARangeReader.new(sio)
|
4
|
+
end
|
5
|
+
|
6
|
+
When /^tile ([^:\s]+):(\d+)-(\d+)( with the chromosome reference)?$/ do |seq, i_start, i_end, ref_p|
|
7
|
+
@tiler = Bio::MAF::Tiler.new
|
8
|
+
@tiler.index = @idx
|
9
|
+
@tiler.parser = @parser
|
10
|
+
@tiler.reference = @refseq if ref_p
|
11
|
+
@tiler.interval = Bio::GenomicInterval.zero_based(seq,
|
12
|
+
i_start.to_i,
|
13
|
+
i_end.to_i)
|
14
|
+
end
|
15
|
+
|
16
|
+
When /^tile with species \[(.+?)\]$/ do |species_text|
|
17
|
+
@tiler.species = species_text.split(/,\s*/)
|
18
|
+
end
|
19
|
+
|
20
|
+
When /^map species (\S+) as (\S+)$/ do |sp1, sp2|
|
21
|
+
@tiler.species_map[sp1] = sp2
|
22
|
+
end
|
23
|
+
|
24
|
+
When /^write the tiled data as FASTA$/ do
|
25
|
+
@dst = Tempfile.new(["cuke", ".fa"])
|
26
|
+
@tiler.write_fasta(@dst)
|
27
|
+
end
|
28
|
+
|
29
|
+
Then /^the FASTA data obtained should be:$/ do |string|
|
30
|
+
@dst.seek(0)
|
31
|
+
@dst.read.rstrip.should == string.rstrip
|
32
|
+
end
|
@@ -0,0 +1,19 @@
|
|
1
|
+
Then /^the alignment block is marked as filtered$/ do
|
2
|
+
@block.filtered?.should be_true
|
3
|
+
end
|
4
|
+
|
5
|
+
Then /^(\d+) gaps? (?:is|are) found with length \[(\d+)\]$/ do |n_gaps, gap_sizes_s|
|
6
|
+
gaps = @block.find_gaps
|
7
|
+
gaps.size.should == n_gaps.to_i
|
8
|
+
e_gap_sizes = gap_sizes_s.split(/,\s*/).collect { |n| n.to_i }
|
9
|
+
gap_sizes = gaps.collect { |gap| gap[1] }
|
10
|
+
gap_sizes.should == e_gap_sizes
|
11
|
+
end
|
12
|
+
|
13
|
+
When /^gaps are removed$/ do
|
14
|
+
@block.remove_gaps!
|
15
|
+
end
|
16
|
+
|
17
|
+
Then /^the text size of the block is (\d+)$/ do |e_text_size|
|
18
|
+
@block.text_size.should == e_text_size.to_i
|
19
|
+
end
|