RubyGems - bio-maf - Versions diffs - 0.3.0-java → 0.3.2-java - Mend

bio-maf 0.3.0-java → 0.3.2-java

Files changed (26) hide show

data/DEVELOPMENT.md +4 -0
data/README.md +172 -114
data/bin/maf_count +0 -1
data/bin/maf_dump_blocks +0 -1
data/bin/maf_extract +180 -0
data/bin/maf_index +15 -8
data/bin/maf_tile +2 -0
data/bin/maf_to_fasta +4 -7
data/bio-maf.gemspec +3 -4
data/features/maf-indexing.feature +21 -1
data/features/step_definitions/convert_steps.rb +2 -7
data/features/step_definitions/index_steps.rb +4 -0
data/lib/bio-maf.rb +5 -0
data/lib/bio/maf/index.rb +33 -23
data/lib/bio/maf/maf.rb +10 -7
data/lib/bio/maf/parser.rb +37 -15
data/lib/bio/maf/tiler.rb +60 -8
data/lib/bio/maf/writer.rb +26 -0
data/man/maf_extract.1 +268 -0
data/man/maf_extract.1.ronn +213 -0
data/man/maf_index.1 +21 -10
data/man/maf_index.1.ronn +14 -7
data/man/maf_tile.1 +12 -0
data/man/maf_tile.1.ronn +9 -0
data/spec/bio/maf/index_spec.rb +23 -0
metadata +14 -10

data/DEVELOPMENT.md CHANGED

@@ -20,6 +20,10 @@ platform.
 The version is simply set by hand in `bio-maf.gemspec`. Don't forget
 to increment it!
+First, verify that you are on the `master` branch:
+    $ git branch
 Testing the build:
     $ rake build

data/README.md CHANGED

@@ -81,43 +81,57 @@ create one with [maf_index(1)][], like so:
     $ maf_index test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct
-Or programmatically:
-    require 'bio-maf'
-    parser = Bio::MAF::Parser.new("test/data/mm8_chr7_tiny.maf")
-    idx = Bio::MAF::KyotoIndex.build(parser, "/tmp/mm8_chr7_tiny.kct")
+To index all sequences for searching, not just the reference sequence:
+    $ maf_index --all test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct
+To build an index programmatically:
+```ruby
+require 'bio-maf'
+parser = Bio::MAF::Parser.new("test/data/mm8_chr7_tiny.maf")
+idx = Bio::MAF::KyotoIndex.build(parser, "/tmp/mm8_chr7_tiny.kct", false)
+```
 ### Extract blocks from an indexed MAF file, by genomic interval
 Refer to [`mm8_chr7_tiny.maf`](https://github.com/csw/bioruby-maf/blob/master/test/data/mm8_chr7_tiny.maf).
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    access.find(q) do |block|
-      ref_seq = block.sequences[0]
-      puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
-    end
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+access.find(q) do |block|
+  ref_seq = block.sequences[0]
+  puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
+end
-    # => Matched block at 80082592, 121 bases
-    # => Matched block at 80082713, 54 bases
+# => Matched block at 80082592, 121 bases
+# => Matched block at 80082713, 54 bases
+```
 Or, equivalently, one can work with a specific MAF file and index directly:
-    require 'bio-maf'
-    parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
-    idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
+```ruby
+require 'bio-maf'
+parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
+idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+idx.find(q, parser).each do |block|
+  ref_seq = block.sequences[0]
+  puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
+end
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    idx.find(q, parser).each do |block|
-      ref_seq = block.sequences[0]
-      puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
-    end
+# => Matched block at 80082592, 121 bases
+# => Matched block at 80082713, 54 bases
+```
-    # => Matched block at 80082592, 121 bases
-    # => Matched block at 80082713, 54 bases
+This can be done with [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html) as well:
+    $ maf_extract -d test/data --interval mm8.chr7:80082592-80082766
 ### Extract alignment blocks truncated to a given interval
@@ -125,25 +139,37 @@ Given a genomic interval of interest, one can also extract only the
 subsets of blocks that intersect with that interval, using the
 `#slice` method like so:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    int = Bio::GenomicInterval.zero_based('mm8.chr7', 80082350, 80082380)
-    blocks = access.slice(int).to_a
-    puts "Got #{blocks.size} blocks, first #{blocks.first.ref_seq.size} base pairs."
-    # => Got 2 blocks, first 18 base pairs.
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+int = Bio::GenomicInterval.zero_based('mm8.chr7', 80082350, 80082380)
+blocks = access.slice(int).to_a
+puts "Got #{blocks.size} blocks, first #{blocks.first.ref_seq.size} base pairs."
+# => Got 2 blocks, first 18 base pairs.
+```
+Or, with [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html):
+    $ maf_extract -d test/data --mode slice --interval mm8.chr7:80082592-80082766
 ### Filter species returned in alignment blocks
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.sequence_filter = { :only_species => %w(hg18 mm8 rheMac2) }
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+blocks = access.find(q)
+block = blocks.first
+puts "Block has #{block.sequences.size} sequences."
-    access.sequence_filter = { :only_species => %w(hg18 mm8 rheMac2) }
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    blocks = access.find(q)
-    block = blocks.first
-    puts "Block has #{block.sequences.size} sequences."
+# => Block has 3 sequences.
+```
-    # => Block has 3 sequences.
+With [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html):
+    $ maf_extract -d test/data --interval mm8.chr7:80082592-80082766 --only-species hg18,mm8,rheMac2
 ### Extract blocks matching certain conditions
@@ -154,68 +180,92 @@ See also the [Cucumber feature][] and [step definitions][] for this.
 #### Match only blocks with all specified species
-    access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082471, 80082730)]
-    access.block_filter = { :with_all_species => %w(panTro2 loxAfr1) }
-    n_blocks = access.find(q).count
-    # => 1
+```ruby
+access = Bio::MAF::Access.maf_dir('test/data')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082471, 80082730)]
+access.block_filter = { :with_all_species => %w(panTro2 loxAfr1) }
+n_blocks = access.find(q).count
+# => 1
+```
+With [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html):
+    $ maf_extract -d test/data --interval mm8.chr7:80082471-80082730 --with-all-species panTro2,loxAfr1
 #### Match only blocks with a certain number of sequences
-    access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082767, 80083008)]
-    access.block_filter = { :at_least_n_sequences => 6 }
-    n_blocks = access.find(q).count
-    # => 1
+```ruby
+access = Bio::MAF::Access.maf_dir('test/data')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082767, 80083008)]
+access.block_filter = { :at_least_n_sequences => 6 }
+n_blocks = access.find(q).count
+# => 1
+```
+With [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html):
+    $ maf_extract -d test/data --interval mm8.chr7:80082767-80083008 --min-sequences 6
 #### Match only blocks within a text size range
-    access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 0, 80100000)]
-    access.block_filter = { :min_size => 72, :max_size => 160 }
-    n_blocks = access.find(q).count
-    # => 3
+```ruby
+access = Bio::MAF::Access.maf_dir('test/data')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 0, 80100000)]
+access.block_filter = { :min_size => 72, :max_size => 160 }
+n_blocks = access.find(q).count
+# => 3
+```
+With [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html):
+    $ maf_extract -d test/data --interval mm8.chr7:0-80100000 --min-text-size 72 --max-text-size 160
 ### Process each block in a MAF file
-    require 'bio-maf'
-    p = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
-    puts "MAF version: #{p.header.version}"
-    # => MAF version: 1
+```ruby
+require 'bio-maf'
+p = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
+puts "MAF version: #{p.header.version}"
+# => MAF version: 1
-    p.each_block do |block|
-      block.sequences.each do |seq|
-        do_something(seq)
-      end
-    end
+p.each_block do |block|
+  block.sequences.each do |seq|
+    do_something(seq)
+  end
+end
+```
 ### Parse empty ('e') lines
 Refer to [`chr22_ieq.maf`](https://github.com/csw/bioruby-maf/blob/master/test/data/chr22_ieq.maf).
-    require 'bio-maf'
-    p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
-                             :parse_empty => false)
-    block = p.parse_block
-    block.sequences.size
-    # => 3
-    p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
-                             :parse_empty => true)
-    block = p.parse_block
-    block.sequences.size
-    # => 4
-    block.sequences.find { |s| s.empty? }
-    # => #<Bio::MAF::EmptySequence:0x007fe1f39882d0
-    #      @source="turTru1.scaffold_109008", @start=25049,
-    #      @size=1601, @strand=:+, @src_size=50103, @text=nil,
-    #      @status="I">
+```ruby
+require 'bio-maf'
+p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
+                         :parse_empty => false)
+block = p.parse_block
+block.sequences.size
+# => 3
+p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
+                         :parse_empty => true)
+block = p.parse_block
+block.sequences.size
+# => 4
+block.sequences.find { |s| s.empty? }
+# => #<Bio::MAF::EmptySequence:0x007fe1f39882d0
+#      @source="turTru1.scaffold_109008", @start=25049,
+#      @size=1601, @strand=:+, @src_size=50103, @text=nil,
+#      @status="I">
+```
 Such options can also be set on a Bio::MAF::Access object:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:parse_empty] = true
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:parse_empty] = true
+```
 ### Remove gaps from parsed blocks
@@ -225,9 +275,11 @@ gaps may be left where there was an insertion present only in
 sequences that were filtered out. Such gaps can be removed by setting
 the `:remove_gaps` parser option:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:remove_gaps] = true
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:remove_gaps] = true
+```
 ### Join blocks after filtering together
@@ -235,9 +287,11 @@ Similarly, filtering out species may remove a species which had caused
 two adjacent alignment blocks to be split. By enabling the
 `:join_blocks` parser option, such blocks can be joined together:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:join_blocks] = true
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:join_blocks] = true
+```
 See the [Cucumber feature][] for more details.
@@ -254,14 +308,16 @@ more.
 [Bio::BioAlignment::Alignment]: http://rdoc.info/gems/bio-alignment/Bio/BioAlignment/Alignment
 [bio-alignment]: https://github.com/pjotrp/bioruby-alignment
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:as_bio_alignment] = true
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    access.find(q) do |aln|
-      col = aln.columns[3]
-      puts "bases in column 3: #{col}"
-    end
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:as_bio_alignment] = true
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+access.find(q) do |aln|
+  col = aln.columns[3]
+  puts "bases in column 3: #{col}"
+end
+```
 ### Tile blocks together over an interval
@@ -276,29 +332,32 @@ man page.
 [feature]: https://github.com/csw/bioruby-maf/blob/master/features/tiling.feature
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    interval = Bio::GenomicInterval.zero_based('mm8.chr7',
-                                               80082334,
-                                               80082468)
-    access.tile(interval) do |tiler|
-      # reference is optional
-      tiler.reference = 'reference.fa.gz'
-      tiler.species = %w(mm8 rn4 hg18)
-      # species_map is optional
-      tiler.species_map = {
-        'mm8' => 'mouse',
-        'rn4' => 'rat',
-        'hg18' => 'human'
-      }
-      tiler.write_fasta($stdout)
-    end
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+interval = Bio::GenomicInterval.zero_based('mm8.chr7',
+                                           80082334,
+                                           80082468)
+access.tile(interval) do |tiler|
+  # reference is optional
+  tiler.reference = 'reference.fa.gz'
+  tiler.species = %w(mm8 rn4 hg18)
+  # species_map is optional
+  tiler.species_map = {
+    'mm8' => 'mouse',
+    'rn4' => 'rat',
+    'hg18' => 'human'
+  }
+  tiler.write_fasta($stdout)
+end
+```
 ### Command line tools
 Man pages for command line tools:
 * [`maf_index(1)`](http://csw.github.com/bioruby-maf/man/maf_index.1.html)
+* [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html)
 * [`maf_to_fasta(1)`](http://csw.github.com/bioruby-maf/man/maf_to_fasta.1.html)
 * [`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
@@ -343,4 +402,3 @@ This Biogem is published at [biogems.info](http://biogems.info/index.html#bio-ma
 ## Copyright
 Copyright (c) 2012 Clayton Wheeler. See LICENSE.txt for further details.

data/bin/maf_count CHANGED

@@ -1,7 +1,6 @@
 #!/usr/bin/env ruby
 require 'bio-maf'
-require 'bigbio'
 require 'optparse'
 require 'ostruct'

data/bin/maf_dump_blocks CHANGED

@@ -1,7 +1,6 @@
 #!/usr/bin/env ruby
 require 'bio-maf'
-require 'bigbio'
 require 'optparse'
 require 'ostruct'

data/bin/maf_extract ADDED

@@ -0,0 +1,180 @@
+#!/usr/bin/env ruby
+require 'bio-maf'
+require 'optparse'
+require 'ostruct'
+include Bio::MAF
+options = OpenStruct.new
+options.mode = :intersect
+options.format = :maf
+options.seq_filter = {}
+options.block_filter = {}
+options.parse_options = {}
+def handle_list_spec(spec)
+  if spec =~ /^@(.+)/
+    File.read($1).split
+  else
+    spec.split(',')
+  end
+end
+def handle_interval_spec(int)
+  if int =~ /(.+):(\d+)-(\d+)/
+    Bio::GenomicInterval.zero_based($1, $2.to_i, $3.to_i)
+  else
+    raise "Invalid interval specification: #{int}"
+  end
+end
+$op = OptionParser.new do |opts|
+  opts.banner = "Usage: maf_extract (-m MAF [-i INDEX] | -d MAFDIR) [options]"
+  opts.separator ""
+  opts.separator "MAF source options (either --maf or --maf-dir must be given):"
+  opts.on("-m", "--maf MAF", "MAF file") do |maf|
+    options.maf = maf
+  end
+  opts.on("-i", "--index INDEX", "MAF index") do |idx|
+    options.idx = idx
+  end
+  opts.on("-d", "--maf-dir DIR", "MAF directory") do |dir|
+    options.maf_dir = dir
+  end
+  opts.separator ""
+  opts.separator "Extraction options:"
+  opts.on("--mode MODE", [:intersect, :slice],
+          "Extraction mode; 'intersect' to match ",
+          "blocks intersecting the given region,",
+          "or 'slice' to extract subsets covering ",
+          "given regions") do |mode|
+    options.mode = mode
+  end
+  opts.on("--bed BED", "Use intervals from the given BED file") do |bed|
+    options.bed = bed
+  end
+  opts.on("--interval SEQ:START:END", "Zero-based genomic interval to match") do |int|
+    options.interval = handle_interval_spec(int)
+  end
+  opts.separator ""
+  opts.separator "Output options:"
+  opts.on("-f", "--format FMT", [:maf, :fasta], "Output format") do |fmt|
+    options.format = fmt
+  end
+  opts.on("-o", "--output OUT", "Write output to file OUT") do |out|
+    options.out_path = out
+  end
+  opts.separator ""
+  opts.separator "Filtering options:"
+  opts.on("--only-species SPECIES",
+          "Filter out all but the species in the",
+          "given comma-separated list",
+          "(or @FILE to read from a file)") do |spec|
+    options.seq_filter[:only_species] = handle_list_spec(spec)
+  end
+  opts.on("--with-all-species SPECIES",
+          "Only match blocks with all the given",
+          "species, comma-separated",
+          "(or @FILE to read from a file)") do |spec|
+    options.block_filter[:with_all_species] = handle_list_spec(spec)
+  end
+  opts.on("--min-sequences N", Integer,
+          "Match only blocks with at least N sequences") do |n|
+    options.block_filter[:at_least_n_sequences] = n
+  end
+  opts.on("--min-text-size N", Integer,
+          "Match only blocks with minimum text size N") do |n|
+    options.block_filter[:min_size] = n
+  end
+  opts.on("--max-text-size N", Integer,
+          "Match only blocks with maximum text size N") do |n|
+    options.block_filter[:max_size] = n
+  end
+  opts.separator ""
+  opts.separator "Block processing options:"
+  opts.on("--join-blocks",
+          "Join blocks if appropriate after filtering",
+          "out sequences") do
+    options.parse_options[:join_blocks] = true
+  end
+  opts.on("--remove-gaps", "Remove gaps after filtering out sequences") do
+    options.parse_options[:remove_gaps] = true
+  end
+  opts.on("--parse-extended", "Parse 'extended' MAF data (i, q lines)") do
+    options.parse_options[:parse_extended] = true
+  end
+  opts.on("--parse-empty", "Parse empty (e) lines of MAF data") do
+    options.parse_options[:parse_empty] = true
+  end
+  opts.separator ""
+  opts.separator "Logging options:"
+  Bio::MAF::handle_logging_options(opts)
+end
+$op.parse!(ARGV)
+Bio::Log::CLI.configure('bio-maf')
+def usage(msg)
+  $stderr.puts msg
+  $stderr.puts $op
+  exit 2
+end
+if options.maf
+  access = Access.file(options.maf, options.idx, options.parse_options)
+elsif options.maf_dir
+  access = Access.maf_dir(options.maf_dir, options.parse_options)
+else
+  usage "Must supply --maf or --maf-dir!"
+end
+begin
+  access.sequence_filter = options.seq_filter unless options.seq_filter.empty?
+  access.block_filter = options.block_filter unless options.block_filter.empty?
+  if options.out_path
+    outf = File.open(options.out_path, 'w')
+  else
+    outf = $stdout
+  end
+  case options.format
+  when :maf
+    writer = Writer.new(outf)
+  when :fasta
+    writer = FASTAWriter.new(outf)
+  else
+    raise "unsupported output format #{format}!"
+  end
+  if options.bed
+    intervals = read_bed_intervals(options.bed)
+  elsif options.interval
+    intervals = [options.interval]
+  else
+    usage "Must supply --interval or --bed!"
+  end
+  # TODO: provide access to original MAF header?
+  if options.format == :maf
+    writer.write_header(Header.default)
+  end
+  case options.mode
+  when :intersect
+    access.find(intervals) do |block|
+      writer.write_block(block)
+    end
+  when :slice
+    # TODO: multiple files if intervals.size > 1?
+    intervals.each do |interval|
+      access.slice(interval) do |block|
+        writer.write_block(block)
+      end
+    end
+  else
+    raise "Unsupported mode #{options.mode}!"
+  end
+ensure
+  access.close
+end