RubyGems - bio-blastxmlparser - Versions diffs - 0.6.0 → 0.6.1 - Mend

bio-blastxmlparser 0.6.0 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

data/README.rdoc +69 -28
data/VERSION +1 -1
data/bin/blastxmlparser +44 -10
data/bio-blastxmlparser.gemspec +2 -2
data/lib/bio-blastxmlparser.rb +2 -0
data/lib/bio/db/blast/xmlsplitter.rb +4 -0
metadata +4 -4

data/README.rdoc CHANGED Viewed

@@ -2,50 +2,57 @@
 blastxmlparser is a fast big-data BLAST XML file parser. Rather than
 loading everything in memory, XML is parsed by BLAST query
-(Iteration). Not only has this the advantage of low memory use, it may
-also be faster when IO continues in parallel (disks read ahead).
+(Iteration). Not only has this the advantage of low memory use, it
+also shows results early, and it may be faster when IO continues in
+parallel (disk read-ahead).
 Next to the API, blastxmlparser comes as a command line utility, which
 can be used to filter results and requires no understanding of Ruby.
 == Performance
-XML parsing is expensive. blastxmlparser uses the Nokogiri C, or Java, XML
-parser, based on libxml2. Basically a DOM parser is used for subsections of a
-document, tests show this is faster than a SAX parser with Ruby callbacks.  To
+XML parsing is expensive. blastxmlparser uses the fast Nokogiri C, or Java, XML
+parsers, based on libxml2. Basically, a DOM parser is used for subsections of a
+document. Tests show this is faster than a SAX parser with Ruby callbacks.  To
 see why libxml2 based Nokogiri is fast, see
 http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and
 http://www.xml.com/lpt/a/1703.
 The parser is also designed with other optimizations, such as lazy evaluation,
-only creating objects when required, and (future) parallelization. When parsing
+only creating objects when required, and (in a future version) parallelization. When parsing
 a full BLAST result usually only a few fields are used. By using XPath queries
 only the relevant fields are queried.
 Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
-Nokogiri DOM (default)
+  bio-blastxmlparser + Nokogiri DOM (default)
-real    0m1.259s
-user    0m1.052s
-sys     0m0.144s
+  real    0m1.259s
+  user    0m1.052s
+  sys     0m0.144s
-Nokogiri split DOM
+  bio-blastxmlparser + Nokogiri split DOM
-real    0m1.713s
-user    0m1.444s
-sys     0m0.160s
+  real    0m1.713s
+  user    0m1.444s
+  sys     0m0.160s
-BioRuby ReXML DOM parser
+  BioRuby ReXML DOM parser
-real    1m14.548s
-user    1m13.065s
-sys     0m0.472s
+  real    1m14.548s
+  user    1m13.065s
+  sys     0m0.472s
 == Install
+Quick install:
   gem install bio-blastxmlparser
+Important: the parser is written for Ruby >= 1.9. You can check with
+  gem env
 Nokogiri XML parser is required. To install it,
 the libxml2 libraries and headers need to be installed first, for
 example on Debian:
@@ -56,7 +63,7 @@ example on Debian:
 for more installation on other platforms see
 http://nokogiri.org/tutorials/installing_nokogiri.html.
-== API
+== API (Ruby library)
 To loop through a BLAST result:
@@ -72,12 +79,13 @@ To loop through a BLAST result:
     >>     end
     >>   end
-The next example parses XML using less memory
+The next example parses XML using less memory by using a Ruby
+Iterator
-    >> blast = XmlSplitterIterator.new(fn).to_enum
+    >> blast = Bio::Blast::XmlSplitterIterator.new(fn).to_enum
     >> iter = blast.next
     >> iter.iter_num
-    >> 1
+    => 1
     >> iter.query_id
     => "lcl|1_0"
@@ -132,14 +140,19 @@ Get the first Hsp
     >> hsp.midline
     => "|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||"
-It is possible to use the XML element names, over methods. E.g.
+Unlike BioRuby, this module uses the actual element names in the XML
+definition, to avoid confusion (if anyone wants a translation,
+feel free to contribute an adaptor).
+It is also possible to use the XML element names as Strings, rather
+than methods. E.g.
     >> hsp.field("Hsp_bit-score")
     => "145.205"
     >> hsp["Hsp_bit-score"]
     => "145.205"
-Note that these are always String values.
+Note that, when using the element names, the results are always String values.
 Fetch the next result (Iteration)
@@ -153,11 +166,14 @@ etc. etc.
 For more examples see the files in ./spec
-== Usage
+== Command line usage
+== Usage
   blastxmlparser [options] file(s)
     -p, --parser name                Use full|split parser (default full)
+        --output-fasta               Output FASTA
     -n, --named fields               Set named fields
     -e, --exec filter                Execute filter
@@ -182,11 +198,23 @@ Print fields where bit_score > 145
   blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
-It is also possible to use the XML element names directly
+prints a tab delimited
+  1       1       lcl|1_0 lcl|I_74685     1       5.82208e-34
+  2       1       lcl|1_0 lcl|I_1 1       5.82208e-34
+  3       2       lcl|2_0 lcl|I_2 1       6.05436e-59
+  4       3       lcl|3_0 lcl|I_3 1       2.03876e-56
+The second and third column show the BLAST iteration, and the others
+relate to the hits.
+As this is evaluated Ruby, it is also possible to use the XML element
+names directly
   blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
-Print named fields where E-value < 0.001 and hit length > 100
+And it is possible to print (non default) named fields where E-value < 0.001
+and hit length > 100. E.g.
   blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
@@ -197,7 +225,20 @@ Print named fields where E-value < 0.001 and hit length > 100
   5       2.76378e-11     GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
   etc. etc.
-To use the low-mem version use
+prints the evalue and qseq columns. To output FASTA use --output-fasta
+  blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+which prints matching sequences, where the first field is the accession, followed
+by query iteration id, and hit_id. E.g.
+  >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
+  AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
+  >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
+  AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
+  etc. etc.
+To use the low-mem (iterated slower) version of the parser use
   blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.6.0
1	+ 0.6.1

data/bin/blastxmlparser CHANGED Viewed

@@ -30,11 +30,23 @@ Print fields where bit_score > 145
   blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
-It is also possible to use the XML element names directly
+prints a tab delimited
+  1       1       lcl|1_0 lcl|I_74685     1       5.82208e-34
+  2       1       lcl|1_0 lcl|I_1 1       5.82208e-34
+  3       2       lcl|2_0 lcl|I_2 1       6.05436e-59
+  4       3       lcl|3_0 lcl|I_3 1       2.03876e-56
+The second and third column show the BLAST iteration, and the others
+relate to the hits.
+As this is evaluated Ruby, it is also possible to use the XML element
+names directly
   blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
-Print named fields where E-value < 0.001 and hit length > 100
+And it is possible to print (non default) named fields where E-value < 0.001
+and hit length > 100. E.g.
   blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
@@ -45,7 +57,20 @@ Print named fields where E-value < 0.001 and hit length > 100
   5       2.76378e-11     GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
   etc. etc.
-To use the low-mem version use
+prints the evalue and qseq columns. To output FASTA use --output-fasta
+  blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+which prints matching sequences, where the first field is the accession, followed
+by query iteration id, and hit_id. E.g.
+  >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
+  AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
+  >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
+  AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
+  etc. etc.
+To use the low-mem (iterated slower) version of the parser use
   blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
@@ -90,6 +115,10 @@ opts = OptionParser.new do |o|
     options.parser = p.to_sym
   end
+  o.on("--output-fasta","Output FASTA") do |b|
+    options.output_fasta = true
+  end
   o.on("-n fields","--named fields",String, "Set named fields") do |s|
     options.fields = s.split(/,/)
   end
@@ -145,14 +174,19 @@ begin
                        true
                      end
           if do_print
-            if options.fields
-              print i,"\t"
-              options.fields.each do | f |
-                print eval(f),"\t"
-              end
-              print "\n"
+            if options.output_fasta
+              print ">"+hit.accession+' '+iter.iter_num.to_s+'|'+iter.query_id+' '+hit.hit_id+' '+hit.hit_def+"\n"
+              print hsp.qseq+"\n"
             else
-              print [i,iter.iter_num,iter.query_id,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t"),"\n"
+              if options.fields
+                print i,"\t"
+                options.fields.each do | f |
+                  print eval(f),"\t"
+                end
+                print "\n"
+                else
+                  print [i,iter.iter_num,iter.query_id,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t"),"\n"
+              end
             end
             i += 1
           end

data/bio-blastxmlparser.gemspec CHANGED Viewed

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = %q{bio-blastxmlparser}
-  s.version = "0.6.0"
+  s.version = "0.6.1"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Pjotr Prins"]
-  s.date = %q{2011-02-14}
+  s.date = %q{2011-04-26}
   s.default_executable = %q{blastxmlparser}
   s.description = %q{Fast big data XML parser and library, written in Ruby}
   s.email = %q{pjotr.public01@thebird.nl}

data/lib/bio-blastxmlparser.rb CHANGED Viewed

@@ -10,6 +10,8 @@ else
 end
 require 'bio-logger'
+require 'enumerator'
 Bio::Log::LoggerPlus.new('bio-blastxmlparser')
 require 'bio/db/blast/parser/nokogiri'

data/lib/bio/db/blast/xmlsplitter.rb CHANGED Viewed

@@ -1,8 +1,12 @@
+require 'enumerator'
 module Bio
   module Blast
     # Reads a full XML result and splits it out into a buffer for each
     # Iteration (query result).
     class XmlSplitterIterator
+      # include Enumerable
       def initialize fn
         @fn = fn
       end

metadata CHANGED Viewed

@@ -5,8 +5,8 @@ version: !ruby/object:Gem::Version
   segments:
   - 0
   - 6
-  - 0
-  version: 0.6.0
+  - 1
+  version: 0.6.1
 platform: ruby
 authors:
 - Pjotr Prins
@@ -14,7 +14,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2011-02-14 00:00:00 +01:00
+date: 2011-04-26 00:00:00 +02:00
 default_executable: blastxmlparser
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -156,7 +156,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      hash: 4630273
+      hash: 169663261
       segments:
       - 0
       version: "0"