RubyGems - bio-blastxmlparser - Versions diffs - 2.0.0 → 2.0.1 - Mend

bio-blastxmlparser 2.0.0 → 2.0.1

Files changed (8) hide show

checksums.yaml +4 -4
data/README.md +40 -41
data/VERSION +1 -1
data/bin/blastxmlparser +30 -7
data/bio-blastxmlparser.gemspec +2 -2
data/lib/bio/db/blast/xmliterator.rb +1 -1
data/lib/bio/db/blast/xmlsplitter.rb +29 -18
metadata +2 -2

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: b96fa7141abe77c13e1e34eb1d920a624c22f83d
-  data.tar.gz: 1b038243195b478d18b2a2c72b3fb0b4538a6701
+  metadata.gz: 7222df89b2f60ef4b027ea7ca766a30c04de567b
+  data.tar.gz: b8d7c84c85dd58e7794a62b83a73b17a04b60ce1
 SHA512:
-  metadata.gz: 87ea99f1ac87b528e8a08c5490cb500038fdd06ea18def8cda1276cf5e7ba7976ca4f0a49934023be16ea122ae1d91e514ad5a5fbdf282245a105ce22131dc6f
-  data.tar.gz: 2237ce97c067f42123c9aeba9ae8cb77cf97d11ef0ad7211ceb8a7e6110f6cc6c71d2a7a2e1e1dcf31b5f6c67d55af8a992b82eb43e377caf858fff6cac3e4d9
+  metadata.gz: e9feee95e3063b0c6c9e9ac28c0f7389e4036130e51c107314216fe8e30f98342d2fbc5f1af0ef16f9c5a11be95aa97d86d16f5d9e2169eda2a54d2594c0dc84
+  data.tar.gz: 63971bd220b178e7ff0dbd7c50a4df6277b9dc8035f610f173603e2e304655610e65894fb3bde7e67c17963d13557b22c3c1d4a3e6dbadbf65c7f170ddbd12f5

data/README.md CHANGED

@@ -8,76 +8,73 @@ to:
 * Parse BLAST XML
 * Filter output
-* Generate FASTA, JSON, YAML, RDF, HTML, tabular output etc.
+* Generate FASTA, JSON, YAML, RDF, JSON-LD, HTML, csv, tabular output etc.
 Rather than loading everything in memory, XML is parsed by BLAST query
 (Iteration). Not only has this the advantage of low memory use, it also shows
-results early, and it may be faster when IO continues in parallel (disk
+results early, and it is faster when IO continues in parallel (disk
 read-ahead).
-Next to the API, blastxmlparser comes as a command line utility, which
+blastxmlparser comes as a command line utility, which
 can be used to filter results and requires no understanding of Ruby.
 # Quick start
 ```sh
   gem install bio-blastxmlparser
+  gem install parallel # if you want multi-core support
   blastxmlparser --help
 ```
 ## Performance
-XML parsing is expensive. blastxmlparser can use the fast Nokogiri C, or
-Java XML parsers, based on libxml2 in parallel. A DOM parser is used
-after splitting the BLAST XML document into subsections.
-Tests show this is faster than a SAX
-parser with Ruby callbacks.  To see why libxml2 based Nokogiri is
-fast, see this
-[benchmark](http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html)
-and [xml.com](http://www.xml.com/lpt/a/1703).
+XML parsing and transformation is expensive. blastxmlparser can use
+the fast Nokogiri C, or Java XML parsers, based on libxml2 in
+parallel. A DOM parser is used after splitting the BLAST XML document
+into subsections.  Tests show this is faster than a SAX parser with
+Ruby callbacks.  To see why libxml2 based Nokogiri is fast, see
+[xml.com](http://www.xml.com/lpt/a/1703). And blastxmlparser uses
+Nokogiri in parallel.
 Blastxmlparser is designed with other optimizations, such as lazy
-evaluation, i.e., only creating objects when required, and
-parallelism. When parsing a full BLAST result usually only a few
-fields are used. By using XPath queries the parser makes sure only the
-relevant fields are queried.
+evaluation, i.e., only creating objects when required. When parsing a
+full BLAST result usually only a few fields are used. By using XPath
+queries the parser makes sure only the relevant fields are queried.
-Timings for parsing a 128 Mb BLAST XML file on 4x1.2GHz laptop
+Timings for parsing a 1 Gb BLAST XML file on 4-core 1.2GHz laptop
 ```
-  real    0m13.985s
-  user    0m44.951s
-  sys     0m3.676s
+  real    2m40.248s
+  user    8m11.075s
+  sys     0m37.198s
 ```
-which makes for pretty good core utilisation.
+which makes for pretty good core utilisation and limited RAM use. If
+you have enough RAM it may make sense to try the `--parser nosplit'
+option which starts by reading the full DOM into RAM. It may be faster
+and show different IO characteristics.
 ## Install
 ```sh
-  gem install bio-blastxmlparser
+  gem install parallel bio-blastxmlparser
 ```
-Important: the parser is written for Ruby >= 1.9. Check with
+Important: the parser is written for Ruby 1.9 or later. Check with
 ```sh
   ruby -v
   gem env
 ```
-Nokogiri XML parser is required. To install it,
-the libxml2 libraries and headers need to be installed first, for
-example on Debian:
+Nokogiri XML parser is required. To install it, the libxml2 libraries and
+headers may need to be installed first, for example on Debian:
 ```sh
   apt-get install libxslt-dev libxml2-dev
   gem install bio-blastxmlparser
 ```
-Nokogiri balks when libxml2 or libxslt is missing on your system (or
-may install something automatically). In the worst case you'll have to
-provide build paths, as described [here](http://nokogiri.org/tutorials/installing_nokogiri.html).
 ## Command line usage
 ### Usage
@@ -85,8 +82,10 @@ provide build paths, as described [here](http://nokogiri.org/tutorials/installin
 ```
   blastxmlparser [options] file(s)
-    -p, --parser name                Use full|split parser (default full)
-    -e, --exec filter                Evaluate filter
+    -p, --parser name                Use split|nosplit parser (default split)
+        --filter filter              Filtering expression
+        --threads num                Use parallel threads
+    -e, --exec filter                Evaluate filter (deprecated)
     -n, --named fields               Print named fields
         --output-fasta               Output FASTA
@@ -105,7 +104,7 @@ provide build paths, as described [here](http://nokogiri.org/tutorials/installin
 Print result fields of iterations containing 'lcl', using a regex
 ```sh
-  blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
+  blastxmlparser --filter 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
 ```
 prints a tab delimited
@@ -124,20 +123,20 @@ As this is evaluated Ruby, it is also possible to use the XML element
 names directly
 ```sh
-  blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
+  blastxmlparser --filter 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
 ```
 Or the shorter
 ```sh
-  blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
+  blastxmlparser --filter 'hsp.bit_score>145' test/data/nt_example_blastn.m7
 ```
 And it is possible to print (non default) named fields where E-value < 0.001
 and hit length > 100. E.g.
 ```sh
-  blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+  blastxmlparser -n 'hsp.evalue,hsp.qseq' --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
   1       5.82208e-34     AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
   2       5.82208e-34     AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
@@ -150,7 +149,7 @@ and hit length > 100. E.g.
 prints the evalue and qseq columns. To output FASTA use --output-fasta
 ```sh
-  blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+  blastxmlparser --output-fasta --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
 ```
 which prints matching sequences, where the first field is the accession, followed
@@ -170,7 +169,7 @@ To have more output options blastxmlparser can use an [ERB
 template](http://www.stuartellis.eu/articles/erb/) for every match. This is a
 very flexible option that can output textual formats such as JSON, YAML, HTML
 and RDF. Examples are provided in
-[./templates](https://github.com/pjotrp/bioruby-vcf/templates/). A JSON
+[./templates](https://github.com/pjotrp/blastxmlparser/templates/). A JSON
 template could be
 ```Javascript
@@ -189,7 +188,7 @@ template could be
 To get JSON, run it with
 ```sh
-  blastxmlparser --template template/blast2json.erb -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+  blastxmlparser --template template/blast2json.erb --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
 ```
 ```Javascript
@@ -208,7 +207,7 @@ To get JSON, run it with
 Likewise, using the RDF template
 ```sh
-  blastxmlparser --template template/blast2rdf.erb -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+  blastxmlparser --template template/blast2rdf.erb --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
 ```
 ```ruby
@@ -231,10 +230,10 @@ Likewise, using the RDF template
 ## Additional options
-To use the low-mem (iterated slower) version of the parser use
+To use the high-mem version of the parser (slightly faster on single core) use
 ```sh
-  blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+  blastxmlparser --parser nosplit --threads 1 -n 'hsp.evalue,hsp.qseq' --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
 ```
 ## API (Ruby library)

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 2.0.0
1	+ 2.0.1

data/bin/blastxmlparser CHANGED

@@ -48,7 +48,7 @@ opts = OptionParser.new do |o|
   o.separator ""
-  o.on("-p name", "--parser name", "Use full|split parser (default full)") do |p|
+  o.on("-p name", "--parser name", "Use split|nosplit parser (default split)") do |p|
     options.parser = p.to_sym
   end
@@ -127,16 +127,32 @@ begin
   ARGV.each do | fn |
     logger.info("XML parsing #{fn}")
-    n = if options.parser == :split
-      Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
+    parser_type = options.parser
+    if !parser_type
+      # If a file is smaller than 0.5 Gb the nosplit parser is used by default for performance
+      if File.size(fn) > 512_000_000
+        parser_type = :split
+      else
+        parser_type = :nosplit
+      end
+    end
+    n = if parser_type == :nosplit
+      Bio::BlastXMLParser::NokogiriBlastXml.new(File.new(fn)).to_enum
     else
-      Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
+      # default
+      Bio::BlastXMLParser::BlastXmlSplitter.new(fn)
     end
     chunks = []
     chunks_count = 0
     NUM_CHUNKS=10_000
-    process = lambda { |iter,i|  # Process one BLAST iter block
+    process = lambda { |iter2,i|  # Process one BLAST iter block
+      if parser_type == :nosplit
+        iter = iter2
+      else
+        xml = Nokogiri::XML.parse(iter2.join) { | cfg | cfg.noblanks }
+        iter = Bio::BlastXMLParser::NokogiriBlastIterator.new(xml,self,:prefix=>nil)
+      end
       res = []
       line_count = 0
       hit_count = 0
@@ -164,7 +180,7 @@ begin
                 end
                 res << out.join("\t")+"\n"
               else
-                res << [hit_count,iter.iter_num,iter.query_id,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t")+"\n"
+                res << [iter.iter_num,iter.query_id,hit_count,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t")+"\n"
               end
             end
           end
@@ -188,9 +204,16 @@ begin
         chunks << iter
         chunks_count += 1
         if chunks.size > NUM_CHUNKS
-          output.call Parallel.map_with_index(chunks, :in_processes => options.threads) { | iter,i |
+          out = Parallel.map_with_index(chunks, :in_processes => options.threads) { | iter,i |
             process.call(iter,i)
           }
+          # Output is forked to a separate process too
+          fork do
+            output.call out
+            STDOUT.flush
+            STDOUT.close
+            exit 0
+          end
           chunks = []
         end
       end

data/bio-blastxmlparser.gemspec CHANGED

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = "bio-blastxmlparser"
-  s.version = "2.0.0"
+  s.version = "2.0.1"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Pjotr Prins"]
-  s.date = "2014-09-06"
+  s.date = "2014-09-07"
   s.description = "Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby and comes with a nice CLI"
   s.email = "pjotr.public01@thebird.nl"
   s.executables = ["blastxmlparser"]

data/lib/bio/db/blast/xmliterator.rb CHANGED

@@ -11,7 +11,7 @@ module Bio
       def to_enum
         logger = Bio::Log::LoggerPlus['bio-blastxmlparser']
-        logger.info("parsing (full) #{@fn}")
+        logger.info("parsing (:nosplit) #{@fn}")
         NokogiriBlastXml.new(File.new(@fn)).to_enum
       end
     end

data/lib/bio/db/blast/xmlsplitter.rb CHANGED

@@ -4,27 +4,21 @@ module Bio
   module BlastXMLParser
     # Reads a full XML result and splits it out into a buffer for each
     # Iteration (query result).
-    class XmlSplitterIterator
-      # include Enumerable
+    class BlastXmlSplitter
       def initialize fn
         @fn = fn
       end
-      def to_enum
-        Enumerator.new do | yielder |
-          logger = Bio::Log::LoggerPlus['bio-blastxmlparser']
-          logger.info("split file parsing #{@fn}")
-          f = File.open(@fn)
-          # Skip BLAST header
-          f.each_line do | line |
-            break if line.strip == "<Iteration>"
-          end
-          # Return each Iteration as an XML DOM
-          each_iteration(f) do | buf |
-            iteration = Nokogiri::XML.parse(buf.join) { | cfg | cfg.noblanks }
-            yielder.yield NokogiriBlastIterator.new(iteration,self,:prefix=>nil)
-          end
+      def each
+        logger = Bio::Log::LoggerPlus['bio-blastxmlparser']
+        logger.info("split file parsing #{@fn}")
+        f = File.open(@fn)
+        # Skip BLAST header
+        f.each_line do | line |
+          break if line.strip == "<Iteration>"
+        end
+        # Return each Iteration as an XML DOM
+        each_iteration(f) do | buf |
+          yield buf
         end
       end
@@ -43,5 +37,22 @@ module Bio
         end
       end
     end
+    class XmlSplitterIterator
+      # include Enumerable
+      def initialize fn
+        @splitter = BlastXmlSplitter.new(fn)
+      end
+      def to_enum
+        Enumerator.new do | yielder |
+          @splitter.each do | buf |
+            iteration = Nokogiri::XML.parse(buf.join) { | cfg | cfg.noblanks }
+            yielder.yield NokogiriBlastIterator.new(iteration,self,:prefix=>nil)
+          end
+        end
+      end
+    end
   end
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: bio-blastxmlparser
 version: !ruby/object:Gem::Version
-  version: 2.0.0
+  version: 2.0.1
 platform: ruby
 authors:
 - Pjotr Prins
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-09-06 00:00:00.000000000 Z
+date: 2014-09-07 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bio-logger