RubyGems - bio-maf - Versions diffs - 0.3.0 → 0.3.1 - Mend

bio-maf 0.3.0 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

data/Gemfile +1 -0
data/README.md +147 -113
data/bin/maf_count +0 -1
data/bin/maf_dump_blocks +0 -1
data/bin/maf_extract +177 -0
data/bin/maf_index +15 -8
data/bin/maf_tile +2 -0
data/bin/maf_to_fasta +4 -7
data/bio-maf.gemspec +3 -4
data/features/maf-indexing.feature +21 -1
data/features/step_definitions/convert_steps.rb +2 -7
data/features/step_definitions/index_steps.rb +4 -0
data/lib/bio-maf.rb +5 -0
data/lib/bio/maf/index.rb +33 -23
data/lib/bio/maf/maf.rb +10 -7
data/lib/bio/maf/parser.rb +37 -15
data/lib/bio/maf/tiler.rb +60 -8
data/lib/bio/maf/writer.rb +26 -0
data/man/maf_extract.1 +159 -0
data/man/maf_extract.1.ronn +175 -0
data/man/maf_index.1 +21 -10
data/man/maf_index.1.ronn +14 -7
data/man/maf_tile.1 +12 -0
data/man/maf_tile.1.ronn +9 -0
data/spec/bio/maf/index_spec.rb +23 -0
metadata +15 -11

data/Gemfile CHANGED Viewed

@@ -13,6 +13,7 @@ group :development do
   gem "redcarpet", "~> 2.1.1", :platforms => :mri
   gem "ronn", "~> 0.7.3", :platforms => :mri
   gem "sinatra", "~> 1.3.2" # for ronn --server
+  gem "jruby-openssl", ">= 0.7", :platforms => :jruby
 end
 group :test do

data/README.md CHANGED Viewed

@@ -81,43 +81,53 @@ create one with [maf_index(1)][], like so:
     $ maf_index test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct
-Or programmatically:
-    require 'bio-maf'
-    parser = Bio::MAF::Parser.new("test/data/mm8_chr7_tiny.maf")
-    idx = Bio::MAF::KyotoIndex.build(parser, "/tmp/mm8_chr7_tiny.kct")
+To index all sequences for searching, not just the reference sequence:
+    $ maf_index --all test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct
+To build an index programmatically:
+```ruby
+require 'bio-maf'
+parser = Bio::MAF::Parser.new("test/data/mm8_chr7_tiny.maf")
+idx = Bio::MAF::KyotoIndex.build(parser, "/tmp/mm8_chr7_tiny.kct", false)
+```
 ### Extract blocks from an indexed MAF file, by genomic interval
 Refer to [`mm8_chr7_tiny.maf`](https://github.com/csw/bioruby-maf/blob/master/test/data/mm8_chr7_tiny.maf).
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    access.find(q) do |block|
-      ref_seq = block.sequences[0]
-      puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
-    end
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+access.find(q) do |block|
+  ref_seq = block.sequences[0]
+  puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
+end
-    # => Matched block at 80082592, 121 bases
-    # => Matched block at 80082713, 54 bases
+# => Matched block at 80082592, 121 bases
+# => Matched block at 80082713, 54 bases
+```
 Or, equivalently, one can work with a specific MAF file and index directly:
-    require 'bio-maf'
-    parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
-    idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
+```ruby
+require 'bio-maf'
+parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
+idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    idx.find(q, parser).each do |block|
-      ref_seq = block.sequences[0]
-      puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
-    end
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+idx.find(q, parser).each do |block|
+  ref_seq = block.sequences[0]
+  puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
+end
-    # => Matched block at 80082592, 121 bases
-    # => Matched block at 80082713, 54 bases
+# => Matched block at 80082592, 121 bases
+# => Matched block at 80082713, 54 bases
+```
 ### Extract alignment blocks truncated to a given interval
@@ -125,25 +135,29 @@ Given a genomic interval of interest, one can also extract only the
 subsets of blocks that intersect with that interval, using the
 `#slice` method like so:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    int = Bio::GenomicInterval.zero_based('mm8.chr7', 80082350, 80082380)
-    blocks = access.slice(int).to_a
-    puts "Got #{blocks.size} blocks, first #{blocks.first.ref_seq.size} base pairs."
-    # => Got 2 blocks, first 18 base pairs.
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+int = Bio::GenomicInterval.zero_based('mm8.chr7', 80082350, 80082380)
+blocks = access.slice(int).to_a
+puts "Got #{blocks.size} blocks, first #{blocks.first.ref_seq.size} base pairs."
+# => Got 2 blocks, first 18 base pairs.
+```
 ### Filter species returned in alignment blocks
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
-    access.sequence_filter = { :only_species => %w(hg18 mm8 rheMac2) }
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    blocks = access.find(q)
-    block = blocks.first
-    puts "Block has #{block.sequences.size} sequences."
+access.sequence_filter = { :only_species => %w(hg18 mm8 rheMac2) }
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+blocks = access.find(q)
+block = blocks.first
+puts "Block has #{block.sequences.size} sequences."
-    # => Block has 3 sequences.
+# => Block has 3 sequences.
+```
 ### Extract blocks matching certain conditions
@@ -154,68 +168,80 @@ See also the [Cucumber feature][] and [step definitions][] for this.
 #### Match only blocks with all specified species
-    access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082471, 80082730)]
-    access.block_filter = { :with_all_species => %w(panTro2 loxAfr1) }
-    n_blocks = access.find(q).count
-    # => 1
+```ruby
+access = Bio::MAF::Access.maf_dir('test/data')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082471, 80082730)]
+access.block_filter = { :with_all_species => %w(panTro2 loxAfr1) }
+n_blocks = access.find(q).count
+# => 1
+```
 #### Match only blocks with a certain number of sequences
-    access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082767, 80083008)]
-    access.block_filter = { :at_least_n_sequences => 6 }
-    n_blocks = access.find(q).count
-    # => 1
+```ruby
+access = Bio::MAF::Access.maf_dir('test/data')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082767, 80083008)]
+access.block_filter = { :at_least_n_sequences => 6 }
+n_blocks = access.find(q).count
+# => 1
+```
 #### Match only blocks within a text size range
-    access = Bio::MAF::Access.maf_dir('test/data')
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 0, 80100000)]
-    access.block_filter = { :min_size => 72, :max_size => 160 }
-    n_blocks = access.find(q).count
-    # => 3
+```ruby
+access = Bio::MAF::Access.maf_dir('test/data')
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 0, 80100000)]
+access.block_filter = { :min_size => 72, :max_size => 160 }
+n_blocks = access.find(q).count
+# => 3
+```
 ### Process each block in a MAF file
-    require 'bio-maf'
-    p = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
-    puts "MAF version: #{p.header.version}"
-    # => MAF version: 1
+```ruby
+require 'bio-maf'
+p = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
+puts "MAF version: #{p.header.version}"
+# => MAF version: 1
-    p.each_block do |block|
-      block.sequences.each do |seq|
-        do_something(seq)
-      end
-    end
+p.each_block do |block|
+  block.sequences.each do |seq|
+    do_something(seq)
+  end
+end
+```
 ### Parse empty ('e') lines
 Refer to [`chr22_ieq.maf`](https://github.com/csw/bioruby-maf/blob/master/test/data/chr22_ieq.maf).
-    require 'bio-maf'
-    p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
-                             :parse_empty => false)
-    block = p.parse_block
-    block.sequences.size
-    # => 3
-    p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
-                             :parse_empty => true)
-    block = p.parse_block
-    block.sequences.size
-    # => 4
-    block.sequences.find { |s| s.empty? }
-    # => #<Bio::MAF::EmptySequence:0x007fe1f39882d0
-    #      @source="turTru1.scaffold_109008", @start=25049,
-    #      @size=1601, @strand=:+, @src_size=50103, @text=nil,
-    #      @status="I">
+```ruby
+require 'bio-maf'
+p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
+                         :parse_empty => false)
+block = p.parse_block
+block.sequences.size
+# => 3
+p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
+                         :parse_empty => true)
+block = p.parse_block
+block.sequences.size
+# => 4
+block.sequences.find { |s| s.empty? }
+# => #<Bio::MAF::EmptySequence:0x007fe1f39882d0
+#      @source="turTru1.scaffold_109008", @start=25049,
+#      @size=1601, @strand=:+, @src_size=50103, @text=nil,
+#      @status="I">
+```
 Such options can also be set on a Bio::MAF::Access object:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:parse_empty] = true
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:parse_empty] = true
+```
 ### Remove gaps from parsed blocks
@@ -225,9 +251,11 @@ gaps may be left where there was an insertion present only in
 sequences that were filtered out. Such gaps can be removed by setting
 the `:remove_gaps` parser option:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:remove_gaps] = true
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:remove_gaps] = true
+```
 ### Join blocks after filtering together
@@ -235,9 +263,11 @@ Similarly, filtering out species may remove a species which had caused
 two adjacent alignment blocks to be split. By enabling the
 `:join_blocks` parser option, such blocks can be joined together:
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:join_blocks] = true
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:join_blocks] = true
+```
 See the [Cucumber feature][] for more details.
@@ -254,14 +284,16 @@ more.
 [Bio::BioAlignment::Alignment]: http://rdoc.info/gems/bio-alignment/Bio/BioAlignment/Alignment
 [bio-alignment]: https://github.com/pjotrp/bioruby-alignment
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    access.parse_options[:as_bio_alignment] = true
-    q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
-    access.find(q) do |aln|
-      col = aln.columns[3]
-      puts "bases in column 3: #{col}"
-    end
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+access.parse_options[:as_bio_alignment] = true
+q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
+access.find(q) do |aln|
+  col = aln.columns[3]
+  puts "bases in column 3: #{col}"
+end
+```
 ### Tile blocks together over an interval
@@ -276,23 +308,25 @@ man page.
 [feature]: https://github.com/csw/bioruby-maf/blob/master/features/tiling.feature
-    require 'bio-maf'
-    access = Bio::MAF::Access.maf_dir('test/data')
-    interval = Bio::GenomicInterval.zero_based('mm8.chr7',
-                                               80082334,
-                                               80082468)
-    access.tile(interval) do |tiler|
-      # reference is optional
-      tiler.reference = 'reference.fa.gz'
-      tiler.species = %w(mm8 rn4 hg18)
-      # species_map is optional
-      tiler.species_map = {
-        'mm8' => 'mouse',
-        'rn4' => 'rat',
-        'hg18' => 'human'
-      }
-      tiler.write_fasta($stdout)
-    end
+```ruby
+require 'bio-maf'
+access = Bio::MAF::Access.maf_dir('test/data')
+interval = Bio::GenomicInterval.zero_based('mm8.chr7',
+                                           80082334,
+                                           80082468)
+access.tile(interval) do |tiler|
+  # reference is optional
+  tiler.reference = 'reference.fa.gz'
+  tiler.species = %w(mm8 rn4 hg18)
+  # species_map is optional
+  tiler.species_map = {
+    'mm8' => 'mouse',
+    'rn4' => 'rat',
+    'hg18' => 'human'
+  }
+  tiler.write_fasta($stdout)
+end
+```
 ### Command line tools

data/bin/maf_count CHANGED Viewed

@@ -1,7 +1,6 @@
 #!/usr/bin/env ruby
 require 'bio-maf'
-require 'bigbio'
 require 'optparse'
 require 'ostruct'

data/bin/maf_dump_blocks CHANGED Viewed

@@ -1,7 +1,6 @@
 #!/usr/bin/env ruby
 require 'bio-maf'
-require 'bigbio'
 require 'optparse'
 require 'ostruct'

data/bin/maf_extract ADDED Viewed

@@ -0,0 +1,177 @@
+#!/usr/bin/env ruby
+require 'bio-maf'
+require 'optparse'
+require 'ostruct'
+include Bio::MAF
+options = OpenStruct.new
+options.mode = :intersect
+options.format = :maf
+options.seq_filter = {}
+options.block_filter = {}
+options.parse_options = {}
+def handle_list_spec(spec)
+  if spec =~ /^@(.+)/
+    File.read($1).split
+  else
+    spec.split(',')
+  end
+end
+def handle_interval_spec(int)
+  parts = int.split(':')
+  Bio::GenomicInterval.zero_based(parts[0], parts[1].to_i, parts[2].to_i)
+end
+$op = OptionParser.new do |opts|
+  opts.banner = "Usage: maf_extract (-m MAF [-i INDEX] | -d MAFDIR) [options]"
+  opts.separator ""
+  opts.separator "MAF source options (either --maf or --maf-dir must be given):"
+  opts.on("-m", "--maf MAF", "MAF file") do |maf|
+    options.maf = maf
+  end
+  opts.on("-i", "--index INDEX", "MAF index") do |idx|
+    options.idx = idx
+  end
+  opts.on("-d", "--maf-dir DIR", "MAF directory") do |dir|
+    options.maf_dir = dir
+  end
+  opts.separator ""
+  opts.separator "Extraction options:"
+  opts.on("--mode MODE", [:intersect, :slice],
+          "Extraction mode; 'intersect' to match ",
+          "blocks intersecting the given region,",
+          "or 'slice' to extract subsets covering ",
+          "given regions") do |mode|
+    options.mode = mode
+  end
+  opts.on("--bed BED", "Use intervals from the given BED file") do |bed|
+    options.bed = bed
+  end
+  opts.on("--interval SEQ:START:END", "Zero-based genomic interval to match") do |int|
+    options.interval = handle_interval_spec(int)
+  end
+  opts.separator ""
+  opts.separator "Output options:"
+  opts.on("-f", "--format FMT", [:maf, :fasta], "Output format") do |fmt|
+    options.format = fmt
+  end
+  opts.on("-o", "--output OUT", "Write output to file OUT") do |out|
+    options.out_path = out
+  end
+  opts.separator ""
+  opts.separator "Filtering options:"
+  opts.on("--only-species SPECIES",
+          "Filter out all but the species in the",
+          "given comma-separated list",
+          "(or @FILE to read from a file)") do |spec|
+    options.seq_filter[:only_species] = handle_list_spec(spec)
+  end
+  opts.on("--with-all-species SPECIES",
+          "Only match blocks with all the given",
+          "species, comma-separated",
+          "(or @FILE to read from a file)") do |spec|
+    options.block_filter[:with_all_species] = handle_list_spec(spec)
+  end
+  opts.on("--min-sequences N", Integer,
+          "Match only blocks with at least N sequences") do |n|
+    options.block_filter[:at_least_n_sequences] = n
+  end
+  opts.on("--min-text-size N", Integer,
+          "Match only blocks with minimum text size N") do |n|
+    options.block_filter[:min_size] = n
+  end
+  opts.on("--max-text-size N", Integer,
+          "Match only blocks with maximum text size N") do |n|
+    options.block_filter[:max_size] = n
+  end
+  opts.separator ""
+  opts.separator "Block processing options:"
+  opts.on("--join-blocks",
+          "Join blocks if appropriate after filtering",
+          "out sequences") do
+    options.parse_options[:join_blocks] = true
+  end
+  opts.on("--remove-gaps", "Remove gaps after filtering out sequences") do
+    options.parse_options[:remove_gaps] = true
+  end
+  opts.on("--parse-extended", "Parse 'extended' MAF data (i, q lines)") do
+    options.parse_options[:parse_extended] = true
+  end
+  opts.on("--parse-empty", "Parse empty (e) lines of MAF data") do
+    options.parse_options[:parse_empty] = true
+  end
+  opts.separator ""
+  opts.separator "Logging options:"
+  Bio::MAF::handle_logging_options(opts)
+end
+$op.parse!(ARGV)
+Bio::Log::CLI.configure('bio-maf')
+def usage(msg)
+  $stderr.puts msg
+  $stderr.puts $op
+  exit 2
+end
+if options.maf
+  access = Access.file(options.maf, options.idx, options.parse_options)
+elsif options.maf_dir
+  access = Access.maf_dir(options.maf_dir, options.parse_options)
+else
+  usage "Must supply --maf or --maf-dir!"
+end
+begin
+  access.sequence_filter = options.seq_filter unless options.seq_filter.empty?
+  access.block_filter = options.block_filter unless options.block_filter.empty?
+  if options.out_path
+    outf = File.open(options.out_path, 'w')
+  else
+    outf = $stdout
+  end
+  case options.format
+  when :maf
+    writer = Writer.new(outf)
+  when :fasta
+    writer = FASTAWriter.new(outf)
+  else
+    raise "unsupported output format #{format}!"
+  end
+  if options.bed
+    intervals = read_bed_intervals(options.bed)
+  elsif options.interval
+    intervals = [options.interval]
+  else
+    usage "Must supply --interval or --bed!"
+  end
+  # TODO: provide access to original MAF header?
+  if options.format == :maf
+    writer.write_header(Header.default)
+  end
+  case options.mode
+  when :intersect
+    access.find(intervals) do |block|
+      writer.write_block(block)
+    end
+  when :slice
+    # TODO: multiple files if intervals.size > 1?
+    intervals.each do |interval|
+      access.slice(interval) do |block|
+        writer.write_block(block)
+      end
+    end
+  else
+    raise "Unsupported mode #{options.mode}!"
+  end
+ensure
+  access.close
+end