RubyGems - bio-maf - Versions diffs - 0.3.0-java → 0.3.2-java - Mend

bio-maf 0.3.0-java → 0.3.2-java

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

data/DEVELOPMENT.md +4 -0
data/README.md +172 -114
data/bin/maf_count +0 -1
data/bin/maf_dump_blocks +0 -1
data/bin/maf_extract +180 -0
data/bin/maf_index +15 -8
data/bin/maf_tile +2 -0
data/bin/maf_to_fasta +4 -7
data/bio-maf.gemspec +3 -4
data/features/maf-indexing.feature +21 -1
data/features/step_definitions/convert_steps.rb +2 -7
data/features/step_definitions/index_steps.rb +4 -0
data/lib/bio-maf.rb +5 -0
data/lib/bio/maf/index.rb +33 -23
data/lib/bio/maf/maf.rb +10 -7
data/lib/bio/maf/parser.rb +37 -15
data/lib/bio/maf/tiler.rb +60 -8
data/lib/bio/maf/writer.rb +26 -0
data/man/maf_extract.1 +268 -0
data/man/maf_extract.1.ronn +213 -0
data/man/maf_index.1 +21 -10
data/man/maf_index.1.ronn +14 -7
data/man/maf_tile.1 +12 -0
data/man/maf_tile.1.ronn +9 -0
data/spec/bio/maf/index_spec.rb +23 -0
metadata +14 -10

data/DEVELOPMENT.md CHANGED

@@ -20,6 +20,10 @@ platform.
 The version is simply set by hand in `bio-maf.gemspec`. Don't forget
 to increment it!
+First, verify that you are on the `master` branch:
+    $ git branch
 Testing the build:
     $ rake build

data/README.md CHANGED

@@ -81,43 +81,57 @@ create one with [maf_index(1)][], like so:
     $ maf_index test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct
-Or programmatically:
-    require 'bio-maf'
-    parser = Bio::MAF::Parser.new("test/data/mm8_chr7_tiny.maf")
-    idx = Bio::MAF::KyotoIndex.build(parser, "/tmp/mm8_chr7_tiny.kct")
+To index all sequences for searching, not just the reference sequence:
+    $ maf_index --all test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct
+To build an index programmatically:
+```ruby
+require 'bio-maf'
+parser = Bio::MAF::Parser.new("test/data/mm8_chr7_tiny.maf")
+idx = Bio::MAF::KyotoIndex.build(parser, "/tmp/mm8_chr7_tiny.kct", false)
+```
 ### Extract blocks from an indexed MAF file, by genomic interval
 Refer to [`mm8_chr7_tiny.maf`](https://github.com/csw/bioruby-maf/blob/master/test/data/mm8_chr7_tiny.maf).
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    access.find(q) do |block|
-      ref_seq = block.sequences[0]
-      puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
-    end
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+access.find(q) do |block|
+  ref_seq = block.sequences[0]
+  puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
+end
-    # => Matched block at 80082592, 121 bases
-    # => Matched block at 80082713, 54 bases
+# => Matched block at 80082592, 121 bases
+# => Matched block at 80082713, 54 bases
+```
 Or, equivalently, one can work with a specific MAF file and index directly:
-    require 'bio-maf'
-    parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
-    idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
+```ruby
+require 'bio-maf'
+parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
+idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+idx.find(q, parser).each do |block|
+  ref_seq = block.sequences[0]
+  puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
+end
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    idx.find(q, parser).each do |block|
-      ref_seq = block.sequences[0]
-      puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
-    end
+# => Matched block at 80082592, 121 bases
+# => Matched block at 80082713, 54 bases
+```
-    # => Matched block at 80082592, 121 bases
-    # => Matched block at 80082713, 54 bases
+This can be done with [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html) as well:
+    $ maf_extract -d test/data --interval mm8.chr7:80082592-80082766
 ### Extract alignment blocks truncated to a given interval
@@ -125,25 +139,37 @@ Given a genomic interval of interest, one can also extract only the
 subsets of blocks that intersect with that interval, using the
 `#slice` method like so:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    int = Bio::GenomicInterval.zero_based('mm8.chr7', 80082350, 80082380)
-    blocks = access.slice(int).to_a
-    puts "Got #{blocks.size} blocks, first #{blocks.first.ref_seq.size} base pairs."
-    # => Got 2 blocks, first 18 base pairs.
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+int = Bio::GenomicInterval.zero_based('mm8.chr7', 80082350, 80082380)
+blocks = access.slice(int).to_a
+puts "Got #{blocks.size} blocks, first #{blocks.first.ref_seq.size} base pairs."
+# => Got 2 blocks, first 18 base pairs.
+```
+Or, with [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html):
+    $ maf_extract -d test/data --mode slice --interval mm8.chr7:80082592-80082766
 ### Filter species returned in alignment blocks
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.sequence_filter = { :only_species => %w(hg18 mm8 rheMac2) }
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+blocks = access.find(q)
+block = blocks.first
+puts "Block has #{block.sequences.size} sequences."
-    access.sequence_filter = { :only_species => %w(hg18 mm8 rheMac2) }
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    blocks = access.find(q)
-    block = blocks.first
-    puts "Block has #{block.sequences.size} sequences."
+# => Block has 3 sequences.
+```
-    # => Block has 3 sequences.
+With [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html):
+    $ maf_extract -d test/data --interval mm8.chr7:80082592-80082766 --only-species hg18,mm8,rheMac2
 ### Extract blocks matching certain conditions
@@ -154,68 +180,92 @@ See also the [Cucumber feature][] and [step definitions][] for this.
 #### Match only blocks with all specified species
-    access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082471, 80082730)]
-    access.block_filter = { :with_all_species => %w(panTro2 loxAfr1) }
-    n_blocks = access.find(q).count
-    # => 1
+```ruby
+access = Bio::MAF::Access.maf_dir('test/data')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082471, 80082730)]
+access.block_filter = { :with_all_species => %w(panTro2 loxAfr1) }
+n_blocks = access.find(q).count
+# => 1
+```
+With [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html):
+    $ maf_extract -d test/data --interval mm8.chr7:80082471-80082730 --with-all-species panTro2,loxAfr1
 #### Match only blocks with a certain number of sequences
-    access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082767, 80083008)]
-    access.block_filter = { :at_least_n_sequences => 6 }
-    n_blocks = access.find(q).count
-    # => 1
+```ruby
+access = Bio::MAF::Access.maf_dir('test/data')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082767, 80083008)]
+access.block_filter = { :at_least_n_sequences => 6 }
+n_blocks = access.find(q).count
+# => 1
+```
+With [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html):
+    $ maf_extract -d test/data --interval mm8.chr7:80082767-80083008 --min-sequences 6
 #### Match only blocks within a text size range
-    access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 0, 80100000)]
-    access.block_filter = { :min_size => 72, :max_size => 160 }
-    n_blocks = access.find(q).count
-    # => 3
+```ruby
+access = Bio::MAF::Access.maf_dir('test/data')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 0, 80100000)]
+access.block_filter = { :min_size => 72, :max_size => 160 }
+n_blocks = access.find(q).count
+# => 3
+```
+With [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html):
+    $ maf_extract -d test/data --interval mm8.chr7:0-80100000 --min-text-size 72 --max-text-size 160
 ### Process each block in a MAF file
-    require 'bio-maf'
-    p = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
-    puts "MAF version: #{p.header.version}"
-    # => MAF version: 1
+```ruby
+require 'bio-maf'
+p = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
+puts "MAF version: #{p.header.version}"
+# => MAF version: 1
-    p.each_block do |block|
-      block.sequences.each do |seq|
-        do_something(seq)
-      end
-    end
+p.each_block do |block|
+  block.sequences.each do |seq|
+    do_something(seq)
+  end
+end
+```
 ### Parse empty ('e') lines
 Refer to [`chr22_ieq.maf`](https://github.com/csw/bioruby-maf/blob/master/test/data/chr22_ieq.maf).
-    require 'bio-maf'
-    p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
-                             :parse_empty => false)
-    block = p.parse_block
-    block.sequences.size
-    # => 3
-    p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
-                             :parse_empty => true)
-    block = p.parse_block
-    block.sequences.size
-    # => 4
-    block.sequences.find { |s| s.empty? }
-    # => #<Bio::MAF::EmptySequence:0x007fe1f39882d0
-    #      @source="turTru1.scaffold_109008", @start=25049,
-    #      @size=1601, @strand=:+, @src_size=50103, @text=nil,
-    #      @status="I">
+```ruby
+require 'bio-maf'
+p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
+                         :parse_empty => false)
+block = p.parse_block
+block.sequences.size
+# => 3
+p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
+                         :parse_empty => true)
+block = p.parse_block
+block.sequences.size
+# => 4
+block.sequences.find { |s| s.empty? }
+# => #<Bio::MAF::EmptySequence:0x007fe1f39882d0
+#      @source="turTru1.scaffold_109008", @start=25049,
+#      @size=1601, @strand=:+, @src_size=50103, @text=nil,
+#      @status="I">
+```
 Such options can also be set on a Bio::MAF::Access object:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:parse_empty] = true
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:parse_empty] = true
+```
 ### Remove gaps from parsed blocks
@@ -225,9 +275,11 @@ gaps may be left where there was an insertion present only in
 sequences that were filtered out. Such gaps can be removed by setting
 the `:remove_gaps` parser option:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:remove_gaps] = true
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:remove_gaps] = true
+```
 ### Join blocks after filtering together
@@ -235,9 +287,11 @@ Similarly, filtering out species may remove a species which had caused
 two adjacent alignment blocks to be split. By enabling the
 `:join_blocks` parser option, such blocks can be joined together:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:join_blocks] = true
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:join_blocks] = true
+```
 See the [Cucumber feature][] for more details.
@@ -254,14 +308,16 @@ more.
 [Bio::BioAlignment::Alignment]: http://rdoc.info/gems/bio-alignment/Bio/BioAlignment/Alignment
 [bio-alignment]: https://github.com/pjotrp/bioruby-alignment
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:as_bio_alignment] = true
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    access.find(q) do |aln|
-      col = aln.columns[3]
-      puts "bases in column 3: #{col}"
-    end
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:as_bio_alignment] = true
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+access.find(q) do |aln|
+  col = aln.columns[3]
+  puts "bases in column 3: #{col}"
+end
+```
 ### Tile blocks together over an interval
@@ -276,29 +332,32 @@ man page.
 [feature]: https://github.com/csw/bioruby-maf/blob/master/features/tiling.feature
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    interval = Bio::GenomicInterval.zero_based('mm8.chr7',
-                                               80082334,
-                                               80082468)
-    access.tile(interval) do |tiler|
-      # reference is optional
-      tiler.reference = 'reference.fa.gz'
-      tiler.species = %w(mm8 rn4 hg18)
-      # species_map is optional
-      tiler.species_map = {
-        'mm8' => 'mouse',
-        'rn4' => 'rat',
-        'hg18' => 'human'
-      }
-      tiler.write_fasta($stdout)
-    end
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+interval = Bio::GenomicInterval.zero_based('mm8.chr7',
+                                           80082334,
+                                           80082468)
+access.tile(interval) do |tiler|
+  # reference is optional
+  tiler.reference = 'reference.fa.gz'
+  tiler.species = %w(mm8 rn4 hg18)
+  # species_map is optional
+  tiler.species_map = {
+    'mm8' => 'mouse',
+    'rn4' => 'rat',
+    'hg18' => 'human'
+  }
+  tiler.write_fasta($stdout)
+end
+```
 ### Command line tools
 Man pages for command line tools:
 * [`maf_index(1)`](http://csw.github.com/bioruby-maf/man/maf_index.1.html)
+* [`maf_extract(1)`](http://csw.github.com/bioruby-maf/man/maf_extract.1.html)
 * [`maf_to_fasta(1)`](http://csw.github.com/bioruby-maf/man/maf_to_fasta.1.html)
 * [`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
@@ -343,4 +402,3 @@ This Biogem is published at [biogems.info](http://biogems.info/index.html#bio-ma
 ## Copyright
 Copyright (c) 2012 Clayton Wheeler. See LICENSE.txt for further details.

data/bin/maf_count CHANGED

@@ -1,7 +1,6 @@
 #!/usr/bin/env ruby
 require 'bio-maf'
-require 'bigbio'
 require 'optparse'
 require 'ostruct'

data/bin/maf_dump_blocks CHANGED

@@ -1,7 +1,6 @@
 #!/usr/bin/env ruby
 require 'bio-maf'
-require 'bigbio'
 require 'optparse'
 require 'ostruct'

data/bin/maf_extract ADDED

@@ -0,0 +1,180 @@
+#!/usr/bin/env ruby
+require 'bio-maf'
+require 'optparse'
+require 'ostruct'
+include Bio::MAF
+options = OpenStruct.new
+options.mode = :intersect
+options.format = :maf
+options.seq_filter = {}
+options.block_filter = {}
+options.parse_options = {}
+def handle_list_spec(spec)
+  if spec =~ /^@(.+)/
+    File.read($1).split
+  else
+    spec.split(',')
+  end
+end
+def handle_interval_spec(int)
+  if int =~ /(.+):(\d+)-(\d+)/
+    Bio::GenomicInterval.zero_based($1, $2.to_i, $3.to_i)
+  else
+    raise "Invalid interval specification: #{int}"
+  end
+end
+$op = OptionParser.new do |opts|
+  opts.banner = "Usage: maf_extract (-m MAF [-i INDEX] | -d MAFDIR) [options]"
+  opts.separator ""
+  opts.separator "MAF source options (either --maf or --maf-dir must be given):"
+  opts.on("-m", "--maf MAF", "MAF file") do |maf|
+    options.maf = maf
+  end
+  opts.on("-i", "--index INDEX", "MAF index") do |idx|
+    options.idx = idx
+  end
+  opts.on("-d", "--maf-dir DIR", "MAF directory") do |dir|
+    options.maf_dir = dir
+  end
+  opts.separator ""
+  opts.separator "Extraction options:"
+  opts.on("--mode MODE", [:intersect, :slice],
+          "Extraction mode; 'intersect' to match ",
+          "blocks intersecting the given region,",
+          "or 'slice' to extract subsets covering ",
+          "given regions") do |mode|
+    options.mode = mode
+  end
+  opts.on("--bed BED", "Use intervals from the given BED file") do |bed|
+    options.bed = bed
+  end
+  opts.on("--interval SEQ:START:END", "Zero-based genomic interval to match") do |int|
+    options.interval = handle_interval_spec(int)
+  end
+  opts.separator ""
+  opts.separator "Output options:"
+  opts.on("-f", "--format FMT", [:maf, :fasta], "Output format") do |fmt|
+    options.format = fmt
+  end
+  opts.on("-o", "--output OUT", "Write output to file OUT") do |out|
+    options.out_path = out
+  end
+  opts.separator ""
+  opts.separator "Filtering options:"
+  opts.on("--only-species SPECIES",
+          "Filter out all but the species in the",
+          "given comma-separated list",
+          "(or @FILE to read from a file)") do |spec|
+    options.seq_filter[:only_species] = handle_list_spec(spec)
+  end
+  opts.on("--with-all-species SPECIES",
+          "Only match blocks with all the given",
+          "species, comma-separated",
+          "(or @FILE to read from a file)") do |spec|
+    options.block_filter[:with_all_species] = handle_list_spec(spec)
+  end
+  opts.on("--min-sequences N", Integer,
+          "Match only blocks with at least N sequences") do |n|
+    options.block_filter[:at_least_n_sequences] = n
+  end
+  opts.on("--min-text-size N", Integer,
+          "Match only blocks with minimum text size N") do |n|
+    options.block_filter[:min_size] = n
+  end
+  opts.on("--max-text-size N", Integer,
+          "Match only blocks with maximum text size N") do |n|
+    options.block_filter[:max_size] = n
+  end
+  opts.separator ""
+  opts.separator "Block processing options:"
+  opts.on("--join-blocks",
+          "Join blocks if appropriate after filtering",
+          "out sequences") do
+    options.parse_options[:join_blocks] = true
+  end
+  opts.on("--remove-gaps", "Remove gaps after filtering out sequences") do
+    options.parse_options[:remove_gaps] = true
+  end
+  opts.on("--parse-extended", "Parse 'extended' MAF data (i, q lines)") do
+    options.parse_options[:parse_extended] = true
+  end
+  opts.on("--parse-empty", "Parse empty (e) lines of MAF data") do
+    options.parse_options[:parse_empty] = true
+  end
+  opts.separator ""
+  opts.separator "Logging options:"
+  Bio::MAF::handle_logging_options(opts)
+end
+$op.parse!(ARGV)
+Bio::Log::CLI.configure('bio-maf')
+def usage(msg)
+  $stderr.puts msg
+  $stderr.puts $op
+  exit 2
+end
+if options.maf
+  access = Access.file(options.maf, options.idx, options.parse_options)
+elsif options.maf_dir
+  access = Access.maf_dir(options.maf_dir, options.parse_options)
+else
+  usage "Must supply --maf or --maf-dir!"
+end
+begin
+  access.sequence_filter = options.seq_filter unless options.seq_filter.empty?
+  access.block_filter = options.block_filter unless options.block_filter.empty?
+  if options.out_path
+    outf = File.open(options.out_path, 'w')
+  else
+    outf = $stdout
+  end
+  case options.format
+  when :maf
+    writer = Writer.new(outf)
+  when :fasta
+    writer = FASTAWriter.new(outf)
+  else
+    raise "unsupported output format #{format}!"
+  end
+  if options.bed
+    intervals = read_bed_intervals(options.bed)
+  elsif options.interval
+    intervals = [options.interval]
+  else
+    usage "Must supply --interval or --bed!"
+  end
+  # TODO: provide access to original MAF header?
+  if options.format == :maf
+    writer.write_header(Header.default)
+  end
+  case options.mode
+  when :intersect
+    access.find(intervals) do |block|
+      writer.write_block(block)
+    end
+  when :slice
+    # TODO: multiple files if intervals.size > 1?
+    intervals.each do |interval|
+      access.slice(interval) do |block|
+        writer.write_block(block)
+      end
+    end
+  else
+    raise "Unsupported mode #{options.mode}!"
+  end
+ensure
+  access.close
+end