RubyGems - bio-maf - Versions diffs - 0.1.0-java → 0.2.0-java - Mend

bio-maf 0.1.0-java → 0.2.0-java

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

data/.gitignore +53 -0
data/DEVELOPMENT.md +29 -0
data/Gemfile +1 -0
data/README.md +69 -1
data/Rakefile +4 -3
data/bin/find_overlaps +21 -0
data/bin/maf_tile +103 -0
data/bio-maf.gemspec +43 -0
data/features/gap-filling.feature +158 -0
data/features/gap-removal.feature +50 -0
data/features/step_definitions/gap-filling_steps.rb +32 -0
data/features/step_definitions/gap_removal_steps.rb +19 -0
data/features/step_definitions/parse_steps.rb +2 -1
data/lib/bio/maf.rb +2 -0
data/lib/bio/maf/index.rb +15 -8
data/lib/bio/maf/maf.rb +267 -0
data/lib/bio/maf/parser.rb +115 -175
data/lib/bio/maf/tiler.rb +167 -0
data/man/maf_tile.1 +108 -0
data/man/maf_tile.1.ronn +104 -0
data/spec/bio/maf/index_spec.rb +1 -0
data/spec/bio/maf/parser_spec.rb +103 -0
data/spec/bio/maf/tiler_spec.rb +69 -0
data/test/data/gap-sp1.fa +6 -0
data/test/data/mm8_chr7_tiny.kct +0 -0
metadata +65 -7

data/.gitignore ADDED

@@ -0,0 +1,53 @@
+# rcov generated
+coverage
+coverage.data
+# rdoc generated
+rdoc
+# yard generated
+doc
+.yardoc
+# bundler
+.bundle
+# jeweler generated
+pkg
+# Have editor/IDE/OS specific files you need to ignore? Consider using a global gitignore:
+#
+# * Create a file at ~/.gitignore
+# * Include files you want ignored
+# * Run: git config --global core.excludesfile ~/.gitignore
+#
+# After doing this, these files will be ignored in all your git projects,
+# saving you from having to 'pollute' every project you touch with them
+#
+# Not sure what to needs to be ignored for particular editors/OSes? Here's some ideas to get you started. (Remember, remove the leading # of the line)
+#
+# For MacOS:
+#
+#.DS_Store
+# For TextMate
+#*.tmproj
+#tmtags
+# For emacs:
+#*~
+#\#*
+#.\#*
+# For vim:
+#*.swp
+# For redcar:
+#.redcar
+# For rubinius:
+*.rbc
+.rbx
+# Ignore Gemfile.lock for gems. See http://yehudakatz.com/2010/12/16/clarifying-the-roles-of-the-gemspec-and-gemfile/
+Gemfile.lock

data/DEVELOPMENT.md CHANGED

@@ -3,6 +3,35 @@
 Here are notes on less obvious aspects of the development process for
 this library.
+## Gem build / tagging / release
+This now uses [rubygems-tasks][] for building and releasing gems.
+[rubygems-tasks]: https://github.com/postmodern/rubygems-tasks
+We build two gem platform variants: a 'default' one for MRI with no
+platform set, and a JRuby one with `platform = 'java'`. These get
+built as `bio-maf-X.Y.Z.gem` and `bio-maf-X.Y.Z-java.gem`. At least
+for now, this is done by running `gem release` separately under JRuby
+and MRI. SCM tagging and pushing is done under MRI only, but the gems
+will be built and pushed to rubygems.org separately under each
+platform.
+The version is simply set by hand in `bio-maf.gemspec`. Don't forget
+to increment it!
+Testing the build:
+    $ rake build
+    $ rake install
+Release:
+    $ rvm use 1.9.3@bioruby-maf
+    $ rake release
+    $ rvm use jruby-1.6.7.2@bioruby-maf
+    $ rake release
 ## kyotocabinet-java
 Running `bio-maf` on JRuby requires the [kyotocabinet-java][] gem, a

data/Gemfile CHANGED

@@ -13,6 +13,7 @@ group :development do
   gem "redcarpet", "~> 2.1.1", :platforms => :mri
   gem "ronn", "~> 0.7.3", :platforms => :mri
   gem "sinatra", "~> 1.3.2" # for ronn --server
+  gem "rubygems-tasks", "~> 0.2.3"
 end
 group :test do

data/README.md CHANGED

@@ -47,8 +47,29 @@ problems building or using this gem, which is still fairly new.
 ## Installation
+`bio-maf` is now published as a Ruby [gem](https://rubygems.org/gems/bio-maf).
     $ gem install bio-maf
+## Performance
+This parser performs best under [JRuby][], particularly with Java
+7. See the [Performance][] wiki page for more information. For best
+results, use JRuby in 1.9 mode with the ObjectProxyCache disabled:
+[JRuby]: http://jruby.org/
+[Performance]: https://github.com/csw/bioruby-maf/wiki/Performance
+    $ export JRUBY_OPTS='--1.9 -Xji.objectProxyCache=false'
+Many parsing modes are multithreaded. Under JRuby, it will default to
+using one parser thread per available core, but if desired this can be
+configured with the `:threads` parser option.
+Ruby 1.9.3 is fully supported, but does not perform as well,
+especially since its concurrency features are not useful for this
+workload.
 ## Usage
 ### Create an index on a MAF file
@@ -162,6 +183,47 @@ Refer to [`chr22_ieq.maf`](https://github.com/csw/bioruby-maf/blob/master/test/d
     #      @size=1601, @strand=:+, @src_size=50103, @text=nil,
     #      @status="I">
+### Remove gaps from parsed blocks
+After filtering out species with
+[`Parser#sequence_filter`](#filter-species-returned-in-alignment-blocks),
+gaps may be left where there was an insertion present only in
+sequences that were filtered out. Such gaps can be removed by setting
+the `:remove_gaps` parser option:
+    require 'bio-maf'
+    p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
+                             :remove_gaps => true)
+### Tile blocks together over an interval
+Extracts alignment blocks overlapping the given genomic interval and
+constructs a single alignment block covering the entire interval for
+the specified species. Optionally, any gaps in coverage of the MAF
+file's reference sequence can be filled in from a FASTA sequence
+file. See the Cucumber [feature][] for examples of output, and also
+the
+[`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
+man page.
+[feature]: https://github.com/csw/bioruby-maf/blob/master/features/gap-filling.feature
+    require 'bio-maf'
+    tiler = Bio::MAF::Tiler.new
+    tiler.index = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')
+    tiler.parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
+    # optional
+    tiler.reference = Bio::MAF::FASTARangeReader.new('reference.fa.gz')
+    tiler.species = %w(mm8 rn4 hg18)
+    tiler.species_map = {
+      'mm8' => 'mouse',
+      'rn4' => 'rat',
+      'hg18' => 'human'
+    }
+    tiler.interval = Bio::GenomicInterval.zero_based('mm8.chr7',
+                                                     80082334,
+                                                     80082468)
+    tiler.write_fasta($stdout)
 ### Command line tools
@@ -169,6 +231,12 @@ Man pages for command line tools:
 * [`maf_index(1)`](http://csw.github.com/bioruby-maf/man/maf_index.1.html)
 * [`maf_to_fasta(1)`](http://csw.github.com/bioruby-maf/man/maf_to_fasta.1.html)
+* [`maf_tile(1)`](http://csw.github.com/bioruby-maf/man/maf_tile.1.html)
+With [gem-man](https://github.com/defunkt/gem-man) installed, these
+can be read with:
+    $ gem man bio-maf
 ### Other documentation
@@ -201,7 +269,7 @@ If you use this software, please cite one of
 ## Biogems.info
-This Biogem will be published at [#bio-maf](http://biogems.info/index.html)
+This Biogem is published at [biogems.info](http://biogems.info/index.html#bio-maf).
 ## Copyright

data/Rakefile CHANGED

@@ -10,10 +10,11 @@ rescue Bundler::BundlerError => e
   exit e.status_code
 end
 require 'rake'
-require 'rubygems/package_task'
-$gemspec = Gem::Specification.load("bio-maf.gemspec")
-Gem::PackageTask.new($gemspec) { |pkg| }
+require 'rubygems/tasks'
+# we only want to do the SCM tag/push stuff once, on MRI
+use_scm = (RUBY_PLATFORM != 'java')
+Gem::Tasks.new(:scm => {:tag => use_scm, :push => use_scm})
 require 'rspec/core'
 require 'rspec/core/rake_task'

data/bin/find_overlaps ADDED

@@ -0,0 +1,21 @@
+#!/usr/bin/env ruby
+require 'bio-maf'
+parser = Bio::MAF::Parser.new(ARGV.shift, :threads => 4)
+def desc(seq)
+  "#{seq.source}:#{seq.start}-#{seq.end}"
+end
+open = []
+parser.parse_blocks.each do |block|
+  start_pos = block.ref_seq.start
+  open.delete_if { |open_b| open_b.ref_seq.end <= start_pos }
+  open.each do |ovl|
+    ref_a = ovl.ref_seq
+    ref_b = block.ref_seq
+    puts "#{desc(ref_a)} overlaps #{desc(ref_b)}"
+  end
+  open << block
+end

data/bin/maf_tile ADDED

@@ -0,0 +1,103 @@
+#!/usr/bin/env ruby
+require 'optparse'
+require 'ostruct'
+require 'bio-maf'
+require 'bio-genomic-interval'
+options = OpenStruct.new
+options.p = { :threads => 1 }
+options.species = []
+options.species_map = {}
+options.usage = false
+o_parser = OptionParser.new do |opts|
+  opts.banner = "Usage: maf_tile [options] <maf> <index>"
+  opts.separator ""
+  opts.separator "Options:"
+  opts.on("-r", "--reference SEQ", "FASTA reference sequence") do |ref|
+    options.ref = ref
+  end
+  opts.on("-i", "--interval BEGIN:END", "Genomic interval, zero-based") do |int|
+    if int =~ /(\d+):(\d+)/
+      options.interval = ($1.to_i)...($2.to_i)
+    else
+      options.usage = true
+    end
+  end
+  opts.on("-s", "--species SPECIES[:NAME]", "Species to use (with mapped name)") do |sp|
+    if sp =~ /:/
+      species, mapped = sp.split(/:/)
+      options.species << species
+      options.species_map[species] = mapped
+    else
+      options.species << sp
+    end
+  end
+  opts.on("-o", "--output-base BASE", "Base name for output files",
+          "Use stdout for a single interval if not given") do |base|
+    options.output_base = base
+  end
+  opts.on("--bed BED", "BED file specifying intervals",
+          "(requires --output-base)") do |bed|
+    options.bed = bed
+  end
+end
+o_parser.parse!(ARGV)
+maf_p = ARGV.shift
+index_p = ARGV.shift
+unless (! options.usage) \
+  && maf_p && index_p && (! options.species.empty?) \
+  && (options.output_base ? options.bed : options.interval)
+  $stderr.puts o_parser
+  exit 2
+end
+tiler = Bio::MAF::Tiler.new
+tiler.index = Bio::MAF::KyotoIndex.open(index_p)
+tiler.parser = Bio::MAF::Parser.new(maf_p, options.p)
+tiler.reference = Bio::MAF::FASTARangeReader.new(options.ref) if options.ref
+tiler.species = options.species
+tiler.species_map = options.species_map
+def parse_interval(line)
+  src, r_start_s, r_end_s, _ = line.split(nil, 4)
+  r_start = r_start_s.to_i
+  r_end = r_end_s.to_i
+  return Bio::GenomicInterval.zero_based(src, r_start, r_end)
+end
+def target_for(base, interval)
+  path = "#{base}_#{interval.zero_start}-#{interval.zero_end}.fa"
+  File.open(path, 'w')
+end
+if options.bed
+  intervals = []
+  File.open(options.bed) do |bed_f|
+    bed_f.each_line { |line| intervals << parse_interval(line) }
+  end
+  intervals.sort_by! { |int| int.zero_start }
+  intervals.each do |int|
+    tiler.interval = int
+    target = target_for(options.output_base, int)
+    tiler.write_fasta(target)
+    target.close
+  end
+else
+  # single interval
+  tiler.interval = Bio::GenomicInterval.zero_based(tiler.index.ref_seq,
+                                                   options.interval.begin,
+                                                   options.interval.end)
+  if options.output_base
+    target = target_for(options.output_base, tiler.interval)
+  else
+    target = $stdout
+  end
+  tiler.write_fasta(target)
+  target.close
+end

data/bio-maf.gemspec ADDED

@@ -0,0 +1,43 @@
+# -*- encoding: utf-8 -*-
+Gem::Specification.new do |s|
+  s.name = "bio-maf"
+  s.version = "0.2.0"
+  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
+  s.authors = ["Clayton Wheeler"]
+  s.date = "2012-06-29"
+  s.description = "Multiple Alignment Format parser for BioRuby."
+  s.email = "cswh@umich.edu"
+  s.executables = ["maf_count", "maf_dump_blocks", "maf_extract_ranges_count", "maf_index", "maf_parse_bench", "maf_to_fasta", "maf_write", "random_ranges"]
+  s.extra_rdoc_files = [
+    "LICENSE.txt",
+    "README.md"
+                       ]
+  s.files         = `git ls-files`.split("\n")
+  s.test_files    = `git ls-files -- {test,spec,features}/*`.split("\n")
+  s.executables   = `git ls-files -- bin/*`.split("\n").map {
+    |f| File.basename(f)
+  }
+  s.homepage = "http://github.com/csw/bioruby-maf"
+  s.licenses = ["MIT"]
+  s.require_paths = ["lib"]
+  s.rubygems_version = "1.8.24"
+  s.summary = "MAF parser for BioRuby"
+  s.specification_version = 3
+  if RUBY_PLATFORM == 'java'
+    s.platform = 'java'
+  end
+  s.add_runtime_dependency('bio-bigbio', [">= 0"])
+  s.add_runtime_dependency('bio-genomic-interval', ["~> 0.1.2"])
+  if RUBY_PLATFORM == 'java'
+    s.add_runtime_dependency('kyotocabinet-java', ["~> 0.2.0"])
+  else
+    s.add_runtime_dependency('kyotocabinet-ruby', ["~> 1.27.1"])
+  end
+end

data/features/gap-filling.feature ADDED

@@ -0,0 +1,158 @@
+Feature: Join alignment blocks with reference data
+  In order to produce FASTA output with one sequence per species
+  For use in downstream tools
+  We need to join adjacent MAF blocks together
+  And fill gaps in the reference sequence from reference data
+  Scenario: Non-overlapping MAF blocks in region of interest
+    Given MAF data:
+    """
+    ##maf version=1
+    a score=20.0
+    s sp1.chr1        10 13 +      50 GGGCTGAGGGC--AG
+    s sp2.chr5     53010 13 +   65536 GGGCTGACGGC--AG
+    s sp3.chr2     33010 15 +   65536 AGGTTTAGGGCAGAG
+    a score=21.0
+    s sp1.chr1        30 10 +      50 AGGGCGGTCC
+    s sp2.chr5     53030 10 +   65536 AGGGCGGTGC
+    """
+    And chromosome reference sequence:
+    """
+    >sp1.chr1
+    CCAGGATGCT
+    GGGCTGAGGG
+    CAGTTGTGTC
+    AGGGCGGTCC
+    GGTGCAGGCA
+    """
+    When I open it with a MAF reader
+    And build an index on the reference sequence
+    And tile sp1.chr1:0-50 with the chromosome reference
+    And tile with species [sp1, sp2, sp3]
+    And write the tiled data as FASTA
+    Then the FASTA data obtained should be:
+    """
+    >sp1
+    CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
+    >sp2
+    **********GGGCTGACGGC--AG*******AGGGCGGTGC**********
+    >sp3
+    **********AGGTTTAGGGCAGAG***************************
+    """
+  Scenario: Non-overlapping MAF blocks with species map
+    Given MAF data:
+    """
+    ##maf version=1
+    a score=20.0
+    s sp1.chr1        10 13 +      50 GGGCTGAGGGC--AG
+    s sp2.chr5     53010 13 +   65536 GGGCTGACGGC--AG
+    s sp3.chr2     33010 15 +   65536 AGGTTTAGGGCAGAG
+    a score=21.0
+    s sp1.chr1        30 10 +      50 AGGGCGGTCC
+    s sp2.chr5     53030 10 +   65536 AGGGCGGTGC
+    """
+    And chromosome reference sequence:
+    """
+    >sp1.chr1
+    CCAGGATGCT
+    GGGCTGAGGG
+    CAGTTGTGTC
+    AGGGCGGTCC
+    GGTGCAGGCA
+    """
+    When I open it with a MAF reader
+    And build an index on the reference sequence
+    And tile sp1.chr1:0-50 with the chromosome reference
+    And tile with species [sp1, sp2, sp3]
+    And map species sp1 as mouse
+    And map species sp2 as hippo
+    And map species sp3 as squid
+    And write the tiled data as FASTA
+    Then the FASTA data obtained should be:
+    """
+    >mouse
+    CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
+    >hippo
+    **********GGGCTGACGGC--AG*******AGGGCGGTGC**********
+    >squid
+    **********AGGTTTAGGGCAGAG***************************
+    """
+  Scenario: Subset of non-overlapping MAF blocks in region
+    Given MAF data:
+    """
+    ##maf version=1
+    a score=20.0
+    s sp1.chr1        10 13 +      50 GGGCTGAGGGC--AG
+    s sp2.chr5     53010 13 +   65536 GGGCTGACGGC--AG
+    s sp3.chr2     33010 15 +   65536 AGGTTTAGGGCAGAG
+    a score=21.0
+    s sp1.chr1        30 10 +      50 AGGGCGGTCC
+    s sp2.chr5     53030 10 +   65536 AGGGCGGTGC
+    """
+    And chromosome reference sequence:
+    """
+    >sp1.chr1
+    CCAGGATGCT
+    GGGCTGAGGG
+    CAGTTGTGTC
+    AGGGCGGTCC
+    GGTGCAGGCA
+    """
+    When I open it with a MAF reader
+    And build an index on the reference sequence
+    And tile sp1.chr1:12-36 with the chromosome reference
+    And tile with species [sp1, sp2, sp3]
+    And write the tiled data as FASTA
+    Then the FASTA data obtained should be:
+    """
+    >sp1
+    GCTGAGGGC--AGTTGTGTCAGGGCG
+    >sp2
+    GCTGACGGC--AG*******AGGGCG
+    >sp3
+    GTTTAGGGCAGAG*************
+    """
+  Scenario: Overlapping MAF blocks in region of interest
+    Given MAF data:
+    """
+    ##maf version=1
+    a score=20.0
+    s sp1.chr1        10 13 +      50 GGGCTGAGGGC--AG
+    s sp2.chr5     53010 13 +   65536 GGGCTGACGGC--AG
+    s sp3.chr2     33010 15 +   65536 AGGTTTAGGGCAGAG
+    a score=21.0
+    s sp1.chr1        20 10 +      50 AGGGCGGTCC
+    s sp2.chr5     53020 10 +   65536 AGGGCGGTGC
+    """
+    And chromosome reference sequence:
+    """
+    >sp1.chr1
+    CCAGGATGCT
+    GGGCTGAGGG
+    CAGTTGTGTC
+    AGGGCGGTCC
+    GGTGCAGGCA
+    """
+    When I open it with a MAF reader
+    And build an index on the reference sequence
+    And tile sp1.chr1:0-50 with the chromosome reference
+    And tile with species [sp1, sp2, sp3]
+    And write the tiled data as FASTA
+    Then the FASTA data obtained should be:
+    """
+    >sp1
+    CCAGGATGCTGGGCTGAGGGAGGGCGGTCCAGGGCGGTCCGGTGCAGGCA
+    >sp2
+    **********GGGCTGACGGAGGGCGGTGC********************
+    >sp3
+    **********AGGTTTAGGG******************************
+    """