bio-maf 0.1.0-java → 0.2.0-java

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,53 @@
1
+ # rcov generated
2
+ coverage
3
+ coverage.data
4
+
5
+ # rdoc generated
6
+ rdoc
7
+
8
+ # yard generated
9
+ doc
10
+ .yardoc
11
+
12
+ # bundler
13
+ .bundle
14
+
15
+ # jeweler generated
16
+ pkg
17
+
18
+ # Have editor/IDE/OS specific files you need to ignore? Consider using a global gitignore:
19
+ #
20
+ # * Create a file at ~/.gitignore
21
+ # * Include files you want ignored
22
+ # * Run: git config --global core.excludesfile ~/.gitignore
23
+ #
24
+ # After doing this, these files will be ignored in all your git projects,
25
+ # saving you from having to 'pollute' every project you touch with them
26
+ #
27
+ # Not sure what to needs to be ignored for particular editors/OSes? Here's some ideas to get you started. (Remember, remove the leading # of the line)
28
+ #
29
+ # For MacOS:
30
+ #
31
+ #.DS_Store
32
+
33
+ # For TextMate
34
+ #*.tmproj
35
+ #tmtags
36
+
37
+ # For emacs:
38
+ #*~
39
+ #\#*
40
+ #.\#*
41
+
42
+ # For vim:
43
+ #*.swp
44
+
45
+ # For redcar:
46
+ #.redcar
47
+
48
+ # For rubinius:
49
+ *.rbc
50
+ .rbx
51
+ # Ignore Gemfile.lock for gems. See http://yehudakatz.com/2010/12/16/clarifying-the-roles-of-the-gemspec-and-gemfile/
52
+ Gemfile.lock
53
+
@@ -3,6 +3,35 @@
3
3
  Here are notes on less obvious aspects of the development process for
4
4
  this library.
5
5
 
6
+ ## Gem build / tagging / release
7
+
8
+ This now uses [rubygems-tasks][] for building and releasing gems.
9
+
10
+ [rubygems-tasks]: https://github.com/postmodern/rubygems-tasks
11
+
12
+ We build two gem platform variants: a 'default' one for MRI with no
13
+ platform set, and a JRuby one with `platform = 'java'`. These get
14
+ built as `bio-maf-X.Y.Z.gem` and `bio-maf-X.Y.Z-java.gem`. At least
15
+ for now, this is done by running `gem release` separately under JRuby
16
+ and MRI. SCM tagging and pushing is done under MRI only, but the gems
17
+ will be built and pushed to rubygems.org separately under each
18
+ platform.
19
+
20
+ The version is simply set by hand in `bio-maf.gemspec`. Don't forget
21
+ to increment it!
22
+
23
+ Testing the build:
24
+
25
+ $ rake build
26
+ $ rake install
27
+
28
+ Release:
29
+
30
+ $ rvm use 1.9.3@bioruby-maf
31
+ $ rake release
32
+ $ rvm use jruby-1.6.7.2@bioruby-maf
33
+ $ rake release
34
+
6
35
  ## kyotocabinet-java
7
36
 
8
37
  Running `bio-maf` on JRuby requires the [kyotocabinet-java][] gem, a
data/Gemfile CHANGED
@@ -13,6 +13,7 @@ group :development do
13
13
  gem "redcarpet", "~> 2.1.1", :platforms => :mri
14
14
  gem "ronn", "~> 0.7.3", :platforms => :mri
15
15
  gem "sinatra", "~> 1.3.2" # for ronn --server
16
+ gem "rubygems-tasks", "~> 0.2.3"
16
17
  end
17
18
 
18
19
  group :test do
data/README.md CHANGED
@@ -47,8 +47,29 @@ problems building or using this gem, which is still fairly new.
47
47
 
48
48
  ## Installation
49
49
 
50
+ `bio-maf` is now published as a Ruby [gem](https://rubygems.org/gems/bio-maf).
51
+
50
52
  $ gem install bio-maf
51
53
 
54
+ ## Performance
55
+
56
+ This parser performs best under [JRuby][], particularly with Java
57
+ 7. See the [Performance][] wiki page for more information. For best
58
+ results, use JRuby in 1.9 mode with the ObjectProxyCache disabled:
59
+
60
+ [JRuby]: http://jruby.org/
61
+ [Performance]: https://github.com/csw/bioruby-maf/wiki/Performance
62
+
63
+ $ export JRUBY_OPTS='--1.9 -Xji.objectProxyCache=false'
64
+
65
+ Many parsing modes are multithreaded. Under JRuby, it will default to
66
+ using one parser thread per available core, but if desired this can be
67
+ configured with the `:threads` parser option.
68
+
69
+ Ruby 1.9.3 is fully supported, but does not perform as well,
70
+ especially since its concurrency features are not useful for this
71
+ workload.
72
+
52
73
  ## Usage
53
74
 
54
75
  ### Create an index on a MAF file
@@ -162,6 +183,47 @@ Refer to [`chr22_ieq.maf`](https://github.com/csw/bioruby-maf/blob/master/test/d
162
183
  # @size=1601, @strand=:+, @src_size=50103, @text=nil,
163
184
  # @status="I">
164
185
 
186
+ ### Remove gaps from parsed blocks
187
+
188
+ After filtering out species with
189
+ [`Parser#sequence_filter`](#filter-species-returned-in-alignment-blocks),
190
+ gaps may be left where there was an insertion present only in
191
+ sequences that were filtered out. Such gaps can be removed by setting
192
+ the `:remove_gaps` parser option:
193
+
194
+ require 'bio-maf'
195
+ p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
196
+ :remove_gaps => true)
197
+
198
+ ### Tile blocks together over an interval
199
+
200
+ Extracts alignment blocks overlapping the given genomic interval and
201
+ constructs a single alignment block covering the entire interval for
202
+ the specified species. Optionally, any gaps in coverage of the MAF
203
+ file's reference sequence can be filled in from a FASTA sequence
204
+ file. See the Cucumber [feature][] for examples of output, and also
205
+ the
206
+ [`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
207
+ man page.
208
+
209
+ [feature]: https://github.com/csw/bioruby-maf/blob/master/features/gap-filling.feature
210
+
211
+ require 'bio-maf'
212
+ tiler = Bio::MAF::Tiler.new
213
+ tiler.index = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
214
+ tiler.parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
215
+ # optional
216
+ tiler.reference = Bio::MAF::FASTARangeReader.new('reference.fa.gz')
217
+ tiler.species = %w(mm8 rn4 hg18)
218
+ tiler.species_map = {
219
+ 'mm8' => 'mouse',
220
+ 'rn4' => 'rat',
221
+ 'hg18' => 'human'
222
+ }
223
+ tiler.interval = Bio::GenomicInterval.zero_based('mm8.chr7',
224
+ 80082334,
225
+ 80082468)
226
+ tiler.write_fasta($stdout)
165
227
 
166
228
  ### Command line tools
167
229
 
@@ -169,6 +231,12 @@ Man pages for command line tools:
169
231
 
170
232
  * [`maf_index(1)`](http://csw.github.com/bioruby-maf/man/maf_index.1.html)
171
233
  * [`maf_to_fasta(1)`](http://csw.github.com/bioruby-maf/man/maf_to_fasta.1.html)
234
+ * [`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
235
+
236
+ With [gem-man](https://github.com/defunkt/gem-man) installed, these
237
+ can be read with:
238
+
239
+ $ gem man bio-maf
172
240
 
173
241
  ### Other documentation
174
242
 
@@ -201,7 +269,7 @@ If you use this software, please cite one of
201
269
 
202
270
  ## Biogems.info
203
271
 
204
- This Biogem will be published at [#bio-maf](http://biogems.info/index.html)
272
+ This Biogem is published at [biogems.info](http://biogems.info/index.html#bio-maf).
205
273
 
206
274
  ## Copyright
207
275
 
data/Rakefile CHANGED
@@ -10,10 +10,11 @@ rescue Bundler::BundlerError => e
10
10
  exit e.status_code
11
11
  end
12
12
  require 'rake'
13
- require 'rubygems/package_task'
14
13
 
15
- $gemspec = Gem::Specification.load("bio-maf.gemspec")
16
- Gem::PackageTask.new($gemspec) { |pkg| }
14
+ require 'rubygems/tasks'
15
+ # we only want to do the SCM tag/push stuff once, on MRI
16
+ use_scm = (RUBY_PLATFORM != 'java')
17
+ Gem::Tasks.new(:scm => {:tag => use_scm, :push => use_scm})
17
18
 
18
19
  require 'rspec/core'
19
20
  require 'rspec/core/rake_task'
@@ -0,0 +1,21 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'bio-maf'
4
+
5
+ parser = Bio::MAF::Parser.new(ARGV.shift, :threads => 4)
6
+
7
+ def desc(seq)
8
+ "#{seq.source}:#{seq.start}-#{seq.end}"
9
+ end
10
+
11
+ open = []
12
+ parser.parse_blocks.each do |block|
13
+ start_pos = block.ref_seq.start
14
+ open.delete_if { |open_b| open_b.ref_seq.end <= start_pos }
15
+ open.each do |ovl|
16
+ ref_a = ovl.ref_seq
17
+ ref_b = block.ref_seq
18
+ puts "#{desc(ref_a)} overlaps #{desc(ref_b)}"
19
+ end
20
+ open << block
21
+ end
@@ -0,0 +1,103 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'optparse'
4
+ require 'ostruct'
5
+
6
+ require 'bio-maf'
7
+ require 'bio-genomic-interval'
8
+
9
+ options = OpenStruct.new
10
+ options.p = { :threads => 1 }
11
+ options.species = []
12
+ options.species_map = {}
13
+ options.usage = false
14
+
15
+ o_parser = OptionParser.new do |opts|
16
+ opts.banner = "Usage: maf_tile [options] <maf> <index>"
17
+ opts.separator ""
18
+ opts.separator "Options:"
19
+ opts.on("-r", "--reference SEQ", "FASTA reference sequence") do |ref|
20
+ options.ref = ref
21
+ end
22
+ opts.on("-i", "--interval BEGIN:END", "Genomic interval, zero-based") do |int|
23
+ if int =~ /(\d+):(\d+)/
24
+ options.interval = ($1.to_i)...($2.to_i)
25
+ else
26
+ options.usage = true
27
+ end
28
+ end
29
+ opts.on("-s", "--species SPECIES[:NAME]", "Species to use (with mapped name)") do |sp|
30
+ if sp =~ /:/
31
+ species, mapped = sp.split(/:/)
32
+ options.species << species
33
+ options.species_map[species] = mapped
34
+ else
35
+ options.species << sp
36
+ end
37
+ end
38
+ opts.on("-o", "--output-base BASE", "Base name for output files",
39
+ "Use stdout for a single interval if not given") do |base|
40
+ options.output_base = base
41
+ end
42
+ opts.on("--bed BED", "BED file specifying intervals",
43
+ "(requires --output-base)") do |bed|
44
+ options.bed = bed
45
+ end
46
+ end
47
+
48
+ o_parser.parse!(ARGV)
49
+
50
+ maf_p = ARGV.shift
51
+ index_p = ARGV.shift
52
+
53
+ unless (! options.usage) \
54
+ && maf_p && index_p && (! options.species.empty?) \
55
+ && (options.output_base ? options.bed : options.interval)
56
+ $stderr.puts o_parser
57
+ exit 2
58
+ end
59
+
60
+ tiler = Bio::MAF::Tiler.new
61
+ tiler.index = Bio::MAF::KyotoIndex.open(index_p)
62
+ tiler.parser = Bio::MAF::Parser.new(maf_p, options.p)
63
+ tiler.reference = Bio::MAF::FASTARangeReader.new(options.ref) if options.ref
64
+ tiler.species = options.species
65
+ tiler.species_map = options.species_map
66
+
67
+ def parse_interval(line)
68
+ src, r_start_s, r_end_s, _ = line.split(nil, 4)
69
+ r_start = r_start_s.to_i
70
+ r_end = r_end_s.to_i
71
+ return Bio::GenomicInterval.zero_based(src, r_start, r_end)
72
+ end
73
+
74
+ def target_for(base, interval)
75
+ path = "#{base}_#{interval.zero_start}-#{interval.zero_end}.fa"
76
+ File.open(path, 'w')
77
+ end
78
+
79
+ if options.bed
80
+ intervals = []
81
+ File.open(options.bed) do |bed_f|
82
+ bed_f.each_line { |line| intervals << parse_interval(line) }
83
+ end
84
+ intervals.sort_by! { |int| int.zero_start }
85
+ intervals.each do |int|
86
+ tiler.interval = int
87
+ target = target_for(options.output_base, int)
88
+ tiler.write_fasta(target)
89
+ target.close
90
+ end
91
+ else
92
+ # single interval
93
+ tiler.interval = Bio::GenomicInterval.zero_based(tiler.index.ref_seq,
94
+ options.interval.begin,
95
+ options.interval.end)
96
+ if options.output_base
97
+ target = target_for(options.output_base, tiler.interval)
98
+ else
99
+ target = $stdout
100
+ end
101
+ tiler.write_fasta(target)
102
+ target.close
103
+ end
@@ -0,0 +1,43 @@
1
+ # -*- encoding: utf-8 -*-
2
+
3
+ Gem::Specification.new do |s|
4
+ s.name = "bio-maf"
5
+ s.version = "0.2.0"
6
+
7
+ s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
8
+ s.authors = ["Clayton Wheeler"]
9
+ s.date = "2012-06-29"
10
+ s.description = "Multiple Alignment Format parser for BioRuby."
11
+ s.email = "cswh@umich.edu"
12
+ s.executables = ["maf_count", "maf_dump_blocks", "maf_extract_ranges_count", "maf_index", "maf_parse_bench", "maf_to_fasta", "maf_write", "random_ranges"]
13
+ s.extra_rdoc_files = [
14
+ "LICENSE.txt",
15
+ "README.md"
16
+ ]
17
+ s.files = `git ls-files`.split("\n")
18
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
19
+ s.executables = `git ls-files -- bin/*`.split("\n").map {
20
+ |f| File.basename(f)
21
+ }
22
+
23
+ s.homepage = "http://github.com/csw/bioruby-maf"
24
+ s.licenses = ["MIT"]
25
+ s.require_paths = ["lib"]
26
+ s.rubygems_version = "1.8.24"
27
+ s.summary = "MAF parser for BioRuby"
28
+
29
+ s.specification_version = 3
30
+
31
+ if RUBY_PLATFORM == 'java'
32
+ s.platform = 'java'
33
+ end
34
+
35
+ s.add_runtime_dependency('bio-bigbio', [">= 0"])
36
+ s.add_runtime_dependency('bio-genomic-interval', ["~> 0.1.2"])
37
+ if RUBY_PLATFORM == 'java'
38
+ s.add_runtime_dependency('kyotocabinet-java', ["~> 0.2.0"])
39
+ else
40
+ s.add_runtime_dependency('kyotocabinet-ruby', ["~> 1.27.1"])
41
+ end
42
+
43
+ end
@@ -0,0 +1,158 @@
1
+ Feature: Join alignment blocks with reference data
2
+ In order to produce FASTA output with one sequence per species
3
+ For use in downstream tools
4
+ We need to join adjacent MAF blocks together
5
+ And fill gaps in the reference sequence from reference data
6
+
7
+ Scenario: Non-overlapping MAF blocks in region of interest
8
+ Given MAF data:
9
+ """
10
+ ##maf version=1
11
+ a score=20.0
12
+ s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
13
+ s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
14
+ s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
15
+
16
+ a score=21.0
17
+ s sp1.chr1 30 10 + 50 AGGGCGGTCC
18
+ s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
19
+ """
20
+ And chromosome reference sequence:
21
+ """
22
+ >sp1.chr1
23
+ CCAGGATGCT
24
+ GGGCTGAGGG
25
+ CAGTTGTGTC
26
+ AGGGCGGTCC
27
+ GGTGCAGGCA
28
+ """
29
+ When I open it with a MAF reader
30
+ And build an index on the reference sequence
31
+ And tile sp1.chr1:0-50 with the chromosome reference
32
+ And tile with species [sp1, sp2, sp3]
33
+ And write the tiled data as FASTA
34
+ Then the FASTA data obtained should be:
35
+ """
36
+ >sp1
37
+ CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
38
+ >sp2
39
+ **********GGGCTGACGGC--AG*******AGGGCGGTGC**********
40
+ >sp3
41
+ **********AGGTTTAGGGCAGAG***************************
42
+ """
43
+
44
+ Scenario: Non-overlapping MAF blocks with species map
45
+ Given MAF data:
46
+ """
47
+ ##maf version=1
48
+ a score=20.0
49
+ s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
50
+ s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
51
+ s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
52
+
53
+ a score=21.0
54
+ s sp1.chr1 30 10 + 50 AGGGCGGTCC
55
+ s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
56
+ """
57
+ And chromosome reference sequence:
58
+ """
59
+ >sp1.chr1
60
+ CCAGGATGCT
61
+ GGGCTGAGGG
62
+ CAGTTGTGTC
63
+ AGGGCGGTCC
64
+ GGTGCAGGCA
65
+ """
66
+ When I open it with a MAF reader
67
+ And build an index on the reference sequence
68
+ And tile sp1.chr1:0-50 with the chromosome reference
69
+ And tile with species [sp1, sp2, sp3]
70
+ And map species sp1 as mouse
71
+ And map species sp2 as hippo
72
+ And map species sp3 as squid
73
+ And write the tiled data as FASTA
74
+ Then the FASTA data obtained should be:
75
+ """
76
+ >mouse
77
+ CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
78
+ >hippo
79
+ **********GGGCTGACGGC--AG*******AGGGCGGTGC**********
80
+ >squid
81
+ **********AGGTTTAGGGCAGAG***************************
82
+ """
83
+
84
+ Scenario: Subset of non-overlapping MAF blocks in region
85
+ Given MAF data:
86
+ """
87
+ ##maf version=1
88
+ a score=20.0
89
+ s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
90
+ s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
91
+ s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
92
+
93
+ a score=21.0
94
+ s sp1.chr1 30 10 + 50 AGGGCGGTCC
95
+ s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
96
+ """
97
+ And chromosome reference sequence:
98
+ """
99
+ >sp1.chr1
100
+ CCAGGATGCT
101
+ GGGCTGAGGG
102
+ CAGTTGTGTC
103
+ AGGGCGGTCC
104
+ GGTGCAGGCA
105
+ """
106
+ When I open it with a MAF reader
107
+ And build an index on the reference sequence
108
+ And tile sp1.chr1:12-36 with the chromosome reference
109
+ And tile with species [sp1, sp2, sp3]
110
+ And write the tiled data as FASTA
111
+ Then the FASTA data obtained should be:
112
+ """
113
+ >sp1
114
+ GCTGAGGGC--AGTTGTGTCAGGGCG
115
+ >sp2
116
+ GCTGACGGC--AG*******AGGGCG
117
+ >sp3
118
+ GTTTAGGGCAGAG*************
119
+ """
120
+ Scenario: Overlapping MAF blocks in region of interest
121
+ Given MAF data:
122
+ """
123
+ ##maf version=1
124
+ a score=20.0
125
+ s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG
126
+ s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG
127
+ s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
128
+
129
+ a score=21.0
130
+ s sp1.chr1 20 10 + 50 AGGGCGGTCC
131
+ s sp2.chr5 53020 10 + 65536 AGGGCGGTGC
132
+ """
133
+ And chromosome reference sequence:
134
+ """
135
+ >sp1.chr1
136
+ CCAGGATGCT
137
+ GGGCTGAGGG
138
+ CAGTTGTGTC
139
+ AGGGCGGTCC
140
+ GGTGCAGGCA
141
+ """
142
+ When I open it with a MAF reader
143
+ And build an index on the reference sequence
144
+ And tile sp1.chr1:0-50 with the chromosome reference
145
+ And tile with species [sp1, sp2, sp3]
146
+ And write the tiled data as FASTA
147
+ Then the FASTA data obtained should be:
148
+ """
149
+ >sp1
150
+ CCAGGATGCTGGGCTGAGGGAGGGCGGTCCAGGGCGGTCCGGTGCAGGCA
151
+ >sp2
152
+ **********GGGCTGACGGAGGGCGGTGC********************
153
+ >sp3
154
+ **********AGGTTTAGGG******************************
155
+ """
156
+
157
+
158
+