bio-maf 0.1.0-java → 0.2.0-java
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +53 -0
- data/DEVELOPMENT.md +29 -0
- data/Gemfile +1 -0
- data/README.md +69 -1
- data/Rakefile +4 -3
- data/bin/find_overlaps +21 -0
- data/bin/maf_tile +103 -0
- data/bio-maf.gemspec +43 -0
- data/features/gap-filling.feature +158 -0
- data/features/gap-removal.feature +50 -0
- data/features/step_definitions/gap-filling_steps.rb +32 -0
- data/features/step_definitions/gap_removal_steps.rb +19 -0
- data/features/step_definitions/parse_steps.rb +2 -1
- data/lib/bio/maf.rb +2 -0
- data/lib/bio/maf/index.rb +15 -8
- data/lib/bio/maf/maf.rb +267 -0
- data/lib/bio/maf/parser.rb +115 -175
- data/lib/bio/maf/tiler.rb +167 -0
- data/man/maf_tile.1 +108 -0
- data/man/maf_tile.1.ronn +104 -0
- data/spec/bio/maf/index_spec.rb +1 -0
- data/spec/bio/maf/parser_spec.rb +103 -0
- data/spec/bio/maf/tiler_spec.rb +69 -0
- data/test/data/gap-sp1.fa +6 -0
- data/test/data/mm8_chr7_tiny.kct +0 -0
- metadata +65 -7
data/.gitignore
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
# rcov generated
|
|
2
|
+
coverage
|
|
3
|
+
coverage.data
|
|
4
|
+
|
|
5
|
+
# rdoc generated
|
|
6
|
+
rdoc
|
|
7
|
+
|
|
8
|
+
# yard generated
|
|
9
|
+
doc
|
|
10
|
+
.yardoc
|
|
11
|
+
|
|
12
|
+
# bundler
|
|
13
|
+
.bundle
|
|
14
|
+
|
|
15
|
+
# jeweler generated
|
|
16
|
+
pkg
|
|
17
|
+
|
|
18
|
+
# Have editor/IDE/OS specific files you need to ignore? Consider using a global gitignore:
|
|
19
|
+
#
|
|
20
|
+
# * Create a file at ~/.gitignore
|
|
21
|
+
# * Include files you want ignored
|
|
22
|
+
# * Run: git config --global core.excludesfile ~/.gitignore
|
|
23
|
+
#
|
|
24
|
+
# After doing this, these files will be ignored in all your git projects,
|
|
25
|
+
# saving you from having to 'pollute' every project you touch with them
|
|
26
|
+
#
|
|
27
|
+
# Not sure what to needs to be ignored for particular editors/OSes? Here's some ideas to get you started. (Remember, remove the leading # of the line)
|
|
28
|
+
#
|
|
29
|
+
# For MacOS:
|
|
30
|
+
#
|
|
31
|
+
#.DS_Store
|
|
32
|
+
|
|
33
|
+
# For TextMate
|
|
34
|
+
#*.tmproj
|
|
35
|
+
#tmtags
|
|
36
|
+
|
|
37
|
+
# For emacs:
|
|
38
|
+
#*~
|
|
39
|
+
#\#*
|
|
40
|
+
#.\#*
|
|
41
|
+
|
|
42
|
+
# For vim:
|
|
43
|
+
#*.swp
|
|
44
|
+
|
|
45
|
+
# For redcar:
|
|
46
|
+
#.redcar
|
|
47
|
+
|
|
48
|
+
# For rubinius:
|
|
49
|
+
*.rbc
|
|
50
|
+
.rbx
|
|
51
|
+
# Ignore Gemfile.lock for gems. See http://yehudakatz.com/2010/12/16/clarifying-the-roles-of-the-gemspec-and-gemfile/
|
|
52
|
+
Gemfile.lock
|
|
53
|
+
|
data/DEVELOPMENT.md
CHANGED
|
@@ -3,6 +3,35 @@
|
|
|
3
3
|
Here are notes on less obvious aspects of the development process for
|
|
4
4
|
this library.
|
|
5
5
|
|
|
6
|
+
## Gem build / tagging / release
|
|
7
|
+
|
|
8
|
+
This now uses [rubygems-tasks][] for building and releasing gems.
|
|
9
|
+
|
|
10
|
+
[rubygems-tasks]: https://github.com/postmodern/rubygems-tasks
|
|
11
|
+
|
|
12
|
+
We build two gem platform variants: a 'default' one for MRI with no
|
|
13
|
+
platform set, and a JRuby one with `platform = 'java'`. These get
|
|
14
|
+
built as `bio-maf-X.Y.Z.gem` and `bio-maf-X.Y.Z-java.gem`. At least
|
|
15
|
+
for now, this is done by running `gem release` separately under JRuby
|
|
16
|
+
and MRI. SCM tagging and pushing is done under MRI only, but the gems
|
|
17
|
+
will be built and pushed to rubygems.org separately under each
|
|
18
|
+
platform.
|
|
19
|
+
|
|
20
|
+
The version is simply set by hand in `bio-maf.gemspec`. Don't forget
|
|
21
|
+
to increment it!
|
|
22
|
+
|
|
23
|
+
Testing the build:
|
|
24
|
+
|
|
25
|
+
$ rake build
|
|
26
|
+
$ rake install
|
|
27
|
+
|
|
28
|
+
Release:
|
|
29
|
+
|
|
30
|
+
$ rvm use 1.9.3@bioruby-maf
|
|
31
|
+
$ rake release
|
|
32
|
+
$ rvm use jruby-1.6.7.2@bioruby-maf
|
|
33
|
+
$ rake release
|
|
34
|
+
|
|
6
35
|
## kyotocabinet-java
|
|
7
36
|
|
|
8
37
|
Running `bio-maf` on JRuby requires the [kyotocabinet-java][] gem, a
|
data/Gemfile
CHANGED
data/README.md
CHANGED
|
@@ -47,8 +47,29 @@ problems building or using this gem, which is still fairly new.
|
|
|
47
47
|
|
|
48
48
|
## Installation
|
|
49
49
|
|
|
50
|
+
`bio-maf` is now published as a Ruby [gem](https://rubygems.org/gems/bio-maf).
|
|
51
|
+
|
|
50
52
|
$ gem install bio-maf
|
|
51
53
|
|
|
54
|
+
## Performance
|
|
55
|
+
|
|
56
|
+
This parser performs best under [JRuby][], particularly with Java
|
|
57
|
+
7. See the [Performance][] wiki page for more information. For best
|
|
58
|
+
results, use JRuby in 1.9 mode with the ObjectProxyCache disabled:
|
|
59
|
+
|
|
60
|
+
[JRuby]: http://jruby.org/
|
|
61
|
+
[Performance]: https://github.com/csw/bioruby-maf/wiki/Performance
|
|
62
|
+
|
|
63
|
+
$ export JRUBY_OPTS='--1.9 -Xji.objectProxyCache=false'
|
|
64
|
+
|
|
65
|
+
Many parsing modes are multithreaded. Under JRuby, it will default to
|
|
66
|
+
using one parser thread per available core, but if desired this can be
|
|
67
|
+
configured with the `:threads` parser option.
|
|
68
|
+
|
|
69
|
+
Ruby 1.9.3 is fully supported, but does not perform as well,
|
|
70
|
+
especially since its concurrency features are not useful for this
|
|
71
|
+
workload.
|
|
72
|
+
|
|
52
73
|
## Usage
|
|
53
74
|
|
|
54
75
|
### Create an index on a MAF file
|
|
@@ -162,6 +183,47 @@ Refer to [`chr22_ieq.maf`](https://github.com/csw/bioruby-maf/blob/master/test/d
|
|
|
162
183
|
# @size=1601, @strand=:+, @src_size=50103, @text=nil,
|
|
163
184
|
# @status="I">
|
|
164
185
|
|
|
186
|
+
### Remove gaps from parsed blocks
|
|
187
|
+
|
|
188
|
+
After filtering out species with
|
|
189
|
+
[`Parser#sequence_filter`](#filter-species-returned-in-alignment-blocks),
|
|
190
|
+
gaps may be left where there was an insertion present only in
|
|
191
|
+
sequences that were filtered out. Such gaps can be removed by setting
|
|
192
|
+
the `:remove_gaps` parser option:
|
|
193
|
+
|
|
194
|
+
require 'bio-maf'
|
|
195
|
+
p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
|
|
196
|
+
:remove_gaps => true)
|
|
197
|
+
|
|
198
|
+
### Tile blocks together over an interval
|
|
199
|
+
|
|
200
|
+
Extracts alignment blocks overlapping the given genomic interval and
|
|
201
|
+
constructs a single alignment block covering the entire interval for
|
|
202
|
+
the specified species. Optionally, any gaps in coverage of the MAF
|
|
203
|
+
file's reference sequence can be filled in from a FASTA sequence
|
|
204
|
+
file. See the Cucumber [feature][] for examples of output, and also
|
|
205
|
+
the
|
|
206
|
+
[`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
|
|
207
|
+
man page.
|
|
208
|
+
|
|
209
|
+
[feature]: https://github.com/csw/bioruby-maf/blob/master/features/gap-filling.feature
|
|
210
|
+
|
|
211
|
+
require 'bio-maf'
|
|
212
|
+
tiler = Bio::MAF::Tiler.new
|
|
213
|
+
tiler.index = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
|
|
214
|
+
tiler.parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
|
|
215
|
+
# optional
|
|
216
|
+
tiler.reference = Bio::MAF::FASTARangeReader.new('reference.fa.gz')
|
|
217
|
+
tiler.species = %w(mm8 rn4 hg18)
|
|
218
|
+
tiler.species_map = {
|
|
219
|
+
'mm8' => 'mouse',
|
|
220
|
+
'rn4' => 'rat',
|
|
221
|
+
'hg18' => 'human'
|
|
222
|
+
}
|
|
223
|
+
tiler.interval = Bio::GenomicInterval.zero_based('mm8.chr7',
|
|
224
|
+
80082334,
|
|
225
|
+
80082468)
|
|
226
|
+
tiler.write_fasta($stdout)
|
|
165
227
|
|
|
166
228
|
### Command line tools
|
|
167
229
|
|
|
@@ -169,6 +231,12 @@ Man pages for command line tools:
|
|
|
169
231
|
|
|
170
232
|
* [`maf_index(1)`](http://csw.github.com/bioruby-maf/man/maf_index.1.html)
|
|
171
233
|
* [`maf_to_fasta(1)`](http://csw.github.com/bioruby-maf/man/maf_to_fasta.1.html)
|
|
234
|
+
* [`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
|
|
235
|
+
|
|
236
|
+
With [gem-man](https://github.com/defunkt/gem-man) installed, these
|
|
237
|
+
can be read with:
|
|
238
|
+
|
|
239
|
+
$ gem man bio-maf
|
|
172
240
|
|
|
173
241
|
### Other documentation
|
|
174
242
|
|
|
@@ -201,7 +269,7 @@ If you use this software, please cite one of
|
|
|
201
269
|
|
|
202
270
|
## Biogems.info
|
|
203
271
|
|
|
204
|
-
This Biogem
|
|
272
|
+
This Biogem is published at [biogems.info](http://biogems.info/index.html#bio-maf).
|
|
205
273
|
|
|
206
274
|
## Copyright
|
|
207
275
|
|
data/Rakefile
CHANGED
|
@@ -10,10 +10,11 @@ rescue Bundler::BundlerError => e
|
|
|
10
10
|
exit e.status_code
|
|
11
11
|
end
|
|
12
12
|
require 'rake'
|
|
13
|
-
require 'rubygems/package_task'
|
|
14
13
|
|
|
15
|
-
|
|
16
|
-
|
|
14
|
+
require 'rubygems/tasks'
|
|
15
|
+
# we only want to do the SCM tag/push stuff once, on MRI
|
|
16
|
+
use_scm = (RUBY_PLATFORM != 'java')
|
|
17
|
+
Gem::Tasks.new(:scm => {:tag => use_scm, :push => use_scm})
|
|
17
18
|
|
|
18
19
|
require 'rspec/core'
|
|
19
20
|
require 'rspec/core/rake_task'
|
data/bin/find_overlaps
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
|
|
3
|
+
require 'bio-maf'
|
|
4
|
+
|
|
5
|
+
parser = Bio::MAF::Parser.new(ARGV.shift, :threads => 4)
|
|
6
|
+
|
|
7
|
+
def desc(seq)
|
|
8
|
+
"#{seq.source}:#{seq.start}-#{seq.end}"
|
|
9
|
+
end
|
|
10
|
+
|
|
11
|
+
open = []
|
|
12
|
+
parser.parse_blocks.each do |block|
|
|
13
|
+
start_pos = block.ref_seq.start
|
|
14
|
+
open.delete_if { |open_b| open_b.ref_seq.end <= start_pos }
|
|
15
|
+
open.each do |ovl|
|
|
16
|
+
ref_a = ovl.ref_seq
|
|
17
|
+
ref_b = block.ref_seq
|
|
18
|
+
puts "#{desc(ref_a)} overlaps #{desc(ref_b)}"
|
|
19
|
+
end
|
|
20
|
+
open << block
|
|
21
|
+
end
|
data/bin/maf_tile
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
|
|
3
|
+
require 'optparse'
|
|
4
|
+
require 'ostruct'
|
|
5
|
+
|
|
6
|
+
require 'bio-maf'
|
|
7
|
+
require 'bio-genomic-interval'
|
|
8
|
+
|
|
9
|
+
options = OpenStruct.new
|
|
10
|
+
options.p = { :threads => 1 }
|
|
11
|
+
options.species = []
|
|
12
|
+
options.species_map = {}
|
|
13
|
+
options.usage = false
|
|
14
|
+
|
|
15
|
+
o_parser = OptionParser.new do |opts|
|
|
16
|
+
opts.banner = "Usage: maf_tile [options] <maf> <index>"
|
|
17
|
+
opts.separator ""
|
|
18
|
+
opts.separator "Options:"
|
|
19
|
+
opts.on("-r", "--reference SEQ", "FASTA reference sequence") do |ref|
|
|
20
|
+
options.ref = ref
|
|
21
|
+
end
|
|
22
|
+
opts.on("-i", "--interval BEGIN:END", "Genomic interval, zero-based") do |int|
|
|
23
|
+
if int =~ /(\d+):(\d+)/
|
|
24
|
+
options.interval = ($1.to_i)...($2.to_i)
|
|
25
|
+
else
|
|
26
|
+
options.usage = true
|
|
27
|
+
end
|
|
28
|
+
end
|
|
29
|
+
opts.on("-s", "--species SPECIES[:NAME]", "Species to use (with mapped name)") do |sp|
|
|
30
|
+
if sp =~ /:/
|
|
31
|
+
species, mapped = sp.split(/:/)
|
|
32
|
+
options.species << species
|
|
33
|
+
options.species_map[species] = mapped
|
|
34
|
+
else
|
|
35
|
+
options.species << sp
|
|
36
|
+
end
|
|
37
|
+
end
|
|
38
|
+
opts.on("-o", "--output-base BASE", "Base name for output files",
|
|
39
|
+
"Use stdout for a single interval if not given") do |base|
|
|
40
|
+
options.output_base = base
|
|
41
|
+
end
|
|
42
|
+
opts.on("--bed BED", "BED file specifying intervals",
|
|
43
|
+
"(requires --output-base)") do |bed|
|
|
44
|
+
options.bed = bed
|
|
45
|
+
end
|
|
46
|
+
end
|
|
47
|
+
|
|
48
|
+
o_parser.parse!(ARGV)
|
|
49
|
+
|
|
50
|
+
maf_p = ARGV.shift
|
|
51
|
+
index_p = ARGV.shift
|
|
52
|
+
|
|
53
|
+
unless (! options.usage) \
|
|
54
|
+
&& maf_p && index_p && (! options.species.empty?) \
|
|
55
|
+
&& (options.output_base ? options.bed : options.interval)
|
|
56
|
+
$stderr.puts o_parser
|
|
57
|
+
exit 2
|
|
58
|
+
end
|
|
59
|
+
|
|
60
|
+
tiler = Bio::MAF::Tiler.new
|
|
61
|
+
tiler.index = Bio::MAF::KyotoIndex.open(index_p)
|
|
62
|
+
tiler.parser = Bio::MAF::Parser.new(maf_p, options.p)
|
|
63
|
+
tiler.reference = Bio::MAF::FASTARangeReader.new(options.ref) if options.ref
|
|
64
|
+
tiler.species = options.species
|
|
65
|
+
tiler.species_map = options.species_map
|
|
66
|
+
|
|
67
|
+
def parse_interval(line)
|
|
68
|
+
src, r_start_s, r_end_s, _ = line.split(nil, 4)
|
|
69
|
+
r_start = r_start_s.to_i
|
|
70
|
+
r_end = r_end_s.to_i
|
|
71
|
+
return Bio::GenomicInterval.zero_based(src, r_start, r_end)
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
def target_for(base, interval)
|
|
75
|
+
path = "#{base}_#{interval.zero_start}-#{interval.zero_end}.fa"
|
|
76
|
+
File.open(path, 'w')
|
|
77
|
+
end
|
|
78
|
+
|
|
79
|
+
if options.bed
|
|
80
|
+
intervals = []
|
|
81
|
+
File.open(options.bed) do |bed_f|
|
|
82
|
+
bed_f.each_line { |line| intervals << parse_interval(line) }
|
|
83
|
+
end
|
|
84
|
+
intervals.sort_by! { |int| int.zero_start }
|
|
85
|
+
intervals.each do |int|
|
|
86
|
+
tiler.interval = int
|
|
87
|
+
target = target_for(options.output_base, int)
|
|
88
|
+
tiler.write_fasta(target)
|
|
89
|
+
target.close
|
|
90
|
+
end
|
|
91
|
+
else
|
|
92
|
+
# single interval
|
|
93
|
+
tiler.interval = Bio::GenomicInterval.zero_based(tiler.index.ref_seq,
|
|
94
|
+
options.interval.begin,
|
|
95
|
+
options.interval.end)
|
|
96
|
+
if options.output_base
|
|
97
|
+
target = target_for(options.output_base, tiler.interval)
|
|
98
|
+
else
|
|
99
|
+
target = $stdout
|
|
100
|
+
end
|
|
101
|
+
tiler.write_fasta(target)
|
|
102
|
+
target.close
|
|
103
|
+
end
|
data/bio-maf.gemspec
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
|
2
|
+
|
|
3
|
+
Gem::Specification.new do |s|
|
|
4
|
+
s.name = "bio-maf"
|
|
5
|
+
s.version = "0.2.0"
|
|
6
|
+
|
|
7
|
+
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
|
8
|
+
s.authors = ["Clayton Wheeler"]
|
|
9
|
+
s.date = "2012-06-29"
|
|
10
|
+
s.description = "Multiple Alignment Format parser for BioRuby."
|
|
11
|
+
s.email = "cswh@umich.edu"
|
|
12
|
+
s.executables = ["maf_count", "maf_dump_blocks", "maf_extract_ranges_count", "maf_index", "maf_parse_bench", "maf_to_fasta", "maf_write", "random_ranges"]
|
|
13
|
+
s.extra_rdoc_files = [
|
|
14
|
+
"LICENSE.txt",
|
|
15
|
+
"README.md"
|
|
16
|
+
]
|
|
17
|
+
s.files = `git ls-files`.split("\n")
|
|
18
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
|
19
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map {
|
|
20
|
+
|f| File.basename(f)
|
|
21
|
+
}
|
|
22
|
+
|
|
23
|
+
s.homepage = "http://github.com/csw/bioruby-maf"
|
|
24
|
+
s.licenses = ["MIT"]
|
|
25
|
+
s.require_paths = ["lib"]
|
|
26
|
+
s.rubygems_version = "1.8.24"
|
|
27
|
+
s.summary = "MAF parser for BioRuby"
|
|
28
|
+
|
|
29
|
+
s.specification_version = 3
|
|
30
|
+
|
|
31
|
+
if RUBY_PLATFORM == 'java'
|
|
32
|
+
s.platform = 'java'
|
|
33
|
+
end
|
|
34
|
+
|
|
35
|
+
s.add_runtime_dependency('bio-bigbio', [">= 0"])
|
|
36
|
+
s.add_runtime_dependency('bio-genomic-interval', ["~> 0.1.2"])
|
|
37
|
+
if RUBY_PLATFORM == 'java'
|
|
38
|
+
s.add_runtime_dependency('kyotocabinet-java', ["~> 0.2.0"])
|
|
39
|
+
else
|
|
40
|
+
s.add_runtime_dependency('kyotocabinet-ruby', ["~> 1.27.1"])
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
end
|
|
@@ -0,0 +1,158 @@
|
|
|
1
|
+
Feature: Join alignment blocks with reference data
|
|
2
|
+
In order to produce FASTA output with one sequence per species
|
|
3
|
+
For use in downstream tools
|
|
4
|
+
We need to join adjacent MAF blocks together
|
|
5
|
+
And fill gaps in the reference sequence from reference data
|
|
6
|
+
|
|
7
|
+
Scenario: Non-overlapping MAF blocks in region of interest
|
|
8
|
+
Given MAF data:
|
|
9
|
+
"""
|
|
10
|
+
##maf version=1
|
|
11
|
+
a score=20.0
|
|
12
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
|
13
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
|
14
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
|
15
|
+
|
|
16
|
+
a score=21.0
|
|
17
|
+
s sp1.chr1 30 10 + 50 AGGGCGGTCC
|
|
18
|
+
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
|
|
19
|
+
"""
|
|
20
|
+
And chromosome reference sequence:
|
|
21
|
+
"""
|
|
22
|
+
>sp1.chr1
|
|
23
|
+
CCAGGATGCT
|
|
24
|
+
GGGCTGAGGG
|
|
25
|
+
CAGTTGTGTC
|
|
26
|
+
AGGGCGGTCC
|
|
27
|
+
GGTGCAGGCA
|
|
28
|
+
"""
|
|
29
|
+
When I open it with a MAF reader
|
|
30
|
+
And build an index on the reference sequence
|
|
31
|
+
And tile sp1.chr1:0-50 with the chromosome reference
|
|
32
|
+
And tile with species [sp1, sp2, sp3]
|
|
33
|
+
And write the tiled data as FASTA
|
|
34
|
+
Then the FASTA data obtained should be:
|
|
35
|
+
"""
|
|
36
|
+
>sp1
|
|
37
|
+
CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
|
|
38
|
+
>sp2
|
|
39
|
+
**********GGGCTGACGGC--AG*******AGGGCGGTGC**********
|
|
40
|
+
>sp3
|
|
41
|
+
**********AGGTTTAGGGCAGAG***************************
|
|
42
|
+
"""
|
|
43
|
+
|
|
44
|
+
Scenario: Non-overlapping MAF blocks with species map
|
|
45
|
+
Given MAF data:
|
|
46
|
+
"""
|
|
47
|
+
##maf version=1
|
|
48
|
+
a score=20.0
|
|
49
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
|
50
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
|
51
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
|
52
|
+
|
|
53
|
+
a score=21.0
|
|
54
|
+
s sp1.chr1 30 10 + 50 AGGGCGGTCC
|
|
55
|
+
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
|
|
56
|
+
"""
|
|
57
|
+
And chromosome reference sequence:
|
|
58
|
+
"""
|
|
59
|
+
>sp1.chr1
|
|
60
|
+
CCAGGATGCT
|
|
61
|
+
GGGCTGAGGG
|
|
62
|
+
CAGTTGTGTC
|
|
63
|
+
AGGGCGGTCC
|
|
64
|
+
GGTGCAGGCA
|
|
65
|
+
"""
|
|
66
|
+
When I open it with a MAF reader
|
|
67
|
+
And build an index on the reference sequence
|
|
68
|
+
And tile sp1.chr1:0-50 with the chromosome reference
|
|
69
|
+
And tile with species [sp1, sp2, sp3]
|
|
70
|
+
And map species sp1 as mouse
|
|
71
|
+
And map species sp2 as hippo
|
|
72
|
+
And map species sp3 as squid
|
|
73
|
+
And write the tiled data as FASTA
|
|
74
|
+
Then the FASTA data obtained should be:
|
|
75
|
+
"""
|
|
76
|
+
>mouse
|
|
77
|
+
CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
|
|
78
|
+
>hippo
|
|
79
|
+
**********GGGCTGACGGC--AG*******AGGGCGGTGC**********
|
|
80
|
+
>squid
|
|
81
|
+
**********AGGTTTAGGGCAGAG***************************
|
|
82
|
+
"""
|
|
83
|
+
|
|
84
|
+
Scenario: Subset of non-overlapping MAF blocks in region
|
|
85
|
+
Given MAF data:
|
|
86
|
+
"""
|
|
87
|
+
##maf version=1
|
|
88
|
+
a score=20.0
|
|
89
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
|
90
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
|
91
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
|
92
|
+
|
|
93
|
+
a score=21.0
|
|
94
|
+
s sp1.chr1 30 10 + 50 AGGGCGGTCC
|
|
95
|
+
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
|
|
96
|
+
"""
|
|
97
|
+
And chromosome reference sequence:
|
|
98
|
+
"""
|
|
99
|
+
>sp1.chr1
|
|
100
|
+
CCAGGATGCT
|
|
101
|
+
GGGCTGAGGG
|
|
102
|
+
CAGTTGTGTC
|
|
103
|
+
AGGGCGGTCC
|
|
104
|
+
GGTGCAGGCA
|
|
105
|
+
"""
|
|
106
|
+
When I open it with a MAF reader
|
|
107
|
+
And build an index on the reference sequence
|
|
108
|
+
And tile sp1.chr1:12-36 with the chromosome reference
|
|
109
|
+
And tile with species [sp1, sp2, sp3]
|
|
110
|
+
And write the tiled data as FASTA
|
|
111
|
+
Then the FASTA data obtained should be:
|
|
112
|
+
"""
|
|
113
|
+
>sp1
|
|
114
|
+
GCTGAGGGC--AGTTGTGTCAGGGCG
|
|
115
|
+
>sp2
|
|
116
|
+
GCTGACGGC--AG*******AGGGCG
|
|
117
|
+
>sp3
|
|
118
|
+
GTTTAGGGCAGAG*************
|
|
119
|
+
"""
|
|
120
|
+
Scenario: Overlapping MAF blocks in region of interest
|
|
121
|
+
Given MAF data:
|
|
122
|
+
"""
|
|
123
|
+
##maf version=1
|
|
124
|
+
a score=20.0
|
|
125
|
+
s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
|
|
126
|
+
s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
|
|
127
|
+
s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
|
|
128
|
+
|
|
129
|
+
a score=21.0
|
|
130
|
+
s sp1.chr1 20 10 + 50 AGGGCGGTCC
|
|
131
|
+
s sp2.chr5 53020 10 + 65536 AGGGCGGTGC
|
|
132
|
+
"""
|
|
133
|
+
And chromosome reference sequence:
|
|
134
|
+
"""
|
|
135
|
+
>sp1.chr1
|
|
136
|
+
CCAGGATGCT
|
|
137
|
+
GGGCTGAGGG
|
|
138
|
+
CAGTTGTGTC
|
|
139
|
+
AGGGCGGTCC
|
|
140
|
+
GGTGCAGGCA
|
|
141
|
+
"""
|
|
142
|
+
When I open it with a MAF reader
|
|
143
|
+
And build an index on the reference sequence
|
|
144
|
+
And tile sp1.chr1:0-50 with the chromosome reference
|
|
145
|
+
And tile with species [sp1, sp2, sp3]
|
|
146
|
+
And write the tiled data as FASTA
|
|
147
|
+
Then the FASTA data obtained should be:
|
|
148
|
+
"""
|
|
149
|
+
>sp1
|
|
150
|
+
CCAGGATGCTGGGCTGAGGGAGGGCGGTCCAGGGCGGTCCGGTGCAGGCA
|
|
151
|
+
>sp2
|
|
152
|
+
**********GGGCTGACGGAGGGCGGTGC********************
|
|
153
|
+
>sp3
|
|
154
|
+
**********AGGTTTAGGG******************************
|
|
155
|
+
"""
|
|
156
|
+
|
|
157
|
+
|
|
158
|
+
|