bio-blastxmlparser 2.0.0 → 2.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: b96fa7141abe77c13e1e34eb1d920a624c22f83d
4
- data.tar.gz: 1b038243195b478d18b2a2c72b3fb0b4538a6701
3
+ metadata.gz: 7222df89b2f60ef4b027ea7ca766a30c04de567b
4
+ data.tar.gz: b8d7c84c85dd58e7794a62b83a73b17a04b60ce1
5
5
  SHA512:
6
- metadata.gz: 87ea99f1ac87b528e8a08c5490cb500038fdd06ea18def8cda1276cf5e7ba7976ca4f0a49934023be16ea122ae1d91e514ad5a5fbdf282245a105ce22131dc6f
7
- data.tar.gz: 2237ce97c067f42123c9aeba9ae8cb77cf97d11ef0ad7211ceb8a7e6110f6cc6c71d2a7a2e1e1dcf31b5f6c67d55af8a992b82eb43e377caf858fff6cac3e4d9
6
+ metadata.gz: e9feee95e3063b0c6c9e9ac28c0f7389e4036130e51c107314216fe8e30f98342d2fbc5f1af0ef16f9c5a11be95aa97d86d16f5d9e2169eda2a54d2594c0dc84
7
+ data.tar.gz: 63971bd220b178e7ff0dbd7c50a4df6277b9dc8035f610f173603e2e304655610e65894fb3bde7e67c17963d13557b22c3c1d4a3e6dbadbf65c7f170ddbd12f5
data/README.md CHANGED
@@ -8,76 +8,73 @@ to:
8
8
 
9
9
  * Parse BLAST XML
10
10
  * Filter output
11
- * Generate FASTA, JSON, YAML, RDF, HTML, tabular output etc.
11
+ * Generate FASTA, JSON, YAML, RDF, JSON-LD, HTML, csv, tabular output etc.
12
12
 
13
13
  Rather than loading everything in memory, XML is parsed by BLAST query
14
14
  (Iteration). Not only has this the advantage of low memory use, it also shows
15
- results early, and it may be faster when IO continues in parallel (disk
15
+ results early, and it is faster when IO continues in parallel (disk
16
16
  read-ahead).
17
17
 
18
- Next to the API, blastxmlparser comes as a command line utility, which
18
+ blastxmlparser comes as a command line utility, which
19
19
  can be used to filter results and requires no understanding of Ruby.
20
20
 
21
21
  # Quick start
22
22
 
23
23
  ```sh
24
24
  gem install bio-blastxmlparser
25
+ gem install parallel # if you want multi-core support
25
26
  blastxmlparser --help
26
27
  ```
27
28
 
28
29
  ## Performance
29
30
 
30
- XML parsing is expensive. blastxmlparser can use the fast Nokogiri C, or
31
- Java XML parsers, based on libxml2 in parallel. A DOM parser is used
32
- after splitting the BLAST XML document into subsections.
33
- Tests show this is faster than a SAX
34
- parser with Ruby callbacks. To see why libxml2 based Nokogiri is
35
- fast, see this
36
- [benchmark](http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html)
37
- and [xml.com](http://www.xml.com/lpt/a/1703).
31
+ XML parsing and transformation is expensive. blastxmlparser can use
32
+ the fast Nokogiri C, or Java XML parsers, based on libxml2 in
33
+ parallel. A DOM parser is used after splitting the BLAST XML document
34
+ into subsections. Tests show this is faster than a SAX parser with
35
+ Ruby callbacks. To see why libxml2 based Nokogiri is fast, see
36
+ [xml.com](http://www.xml.com/lpt/a/1703). And blastxmlparser uses
37
+ Nokogiri in parallel.
38
38
 
39
39
  Blastxmlparser is designed with other optimizations, such as lazy
40
- evaluation, i.e., only creating objects when required, and
41
- parallelism. When parsing a full BLAST result usually only a few
42
- fields are used. By using XPath queries the parser makes sure only the
43
- relevant fields are queried.
40
+ evaluation, i.e., only creating objects when required. When parsing a
41
+ full BLAST result usually only a few fields are used. By using XPath
42
+ queries the parser makes sure only the relevant fields are queried.
44
43
 
45
- Timings for parsing a 128 Mb BLAST XML file on 4x1.2GHz laptop
44
+ Timings for parsing a 1 Gb BLAST XML file on 4-core 1.2GHz laptop
46
45
 
47
46
  ```
48
- real 0m13.985s
49
- user 0m44.951s
50
- sys 0m3.676s
47
+ real 2m40.248s
48
+ user 8m11.075s
49
+ sys 0m37.198s
51
50
  ```
52
51
 
53
- which makes for pretty good core utilisation.
52
+ which makes for pretty good core utilisation and limited RAM use. If
53
+ you have enough RAM it may make sense to try the `--parser nosplit'
54
+ option which starts by reading the full DOM into RAM. It may be faster
55
+ and show different IO characteristics.
54
56
 
55
57
  ## Install
56
58
 
57
59
  ```sh
58
- gem install bio-blastxmlparser
60
+ gem install parallel bio-blastxmlparser
59
61
  ```
60
62
 
61
- Important: the parser is written for Ruby >= 1.9. Check with
63
+ Important: the parser is written for Ruby 1.9 or later. Check with
62
64
 
63
65
  ```sh
64
66
  ruby -v
65
67
  gem env
66
68
  ```
67
69
 
68
- Nokogiri XML parser is required. To install it,
69
- the libxml2 libraries and headers need to be installed first, for
70
- example on Debian:
70
+ Nokogiri XML parser is required. To install it, the libxml2 libraries and
71
+ headers may need to be installed first, for example on Debian:
71
72
 
72
73
  ```sh
73
74
  apt-get install libxslt-dev libxml2-dev
74
75
  gem install bio-blastxmlparser
75
76
  ```
76
77
 
77
- Nokogiri balks when libxml2 or libxslt is missing on your system (or
78
- may install something automatically). In the worst case you'll have to
79
- provide build paths, as described [here](http://nokogiri.org/tutorials/installing_nokogiri.html).
80
-
81
78
  ## Command line usage
82
79
 
83
80
  ### Usage
@@ -85,8 +82,10 @@ provide build paths, as described [here](http://nokogiri.org/tutorials/installin
85
82
  ```
86
83
  blastxmlparser [options] file(s)
87
84
 
88
- -p, --parser name Use full|split parser (default full)
89
- -e, --exec filter Evaluate filter
85
+ -p, --parser name Use split|nosplit parser (default split)
86
+ --filter filter Filtering expression
87
+ --threads num Use parallel threads
88
+ -e, --exec filter Evaluate filter (deprecated)
90
89
 
91
90
  -n, --named fields Print named fields
92
91
  --output-fasta Output FASTA
@@ -105,7 +104,7 @@ provide build paths, as described [here](http://nokogiri.org/tutorials/installin
105
104
  Print result fields of iterations containing 'lcl', using a regex
106
105
 
107
106
  ```sh
108
- blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
107
+ blastxmlparser --filter 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
109
108
  ```
110
109
 
111
110
  prints a tab delimited
@@ -124,20 +123,20 @@ As this is evaluated Ruby, it is also possible to use the XML element
124
123
  names directly
125
124
 
126
125
  ```sh
127
- blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
126
+ blastxmlparser --filter 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
128
127
  ```
129
128
 
130
129
  Or the shorter
131
130
 
132
131
  ```sh
133
- blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
132
+ blastxmlparser --filter 'hsp.bit_score>145' test/data/nt_example_blastn.m7
134
133
  ```
135
134
 
136
135
  And it is possible to print (non default) named fields where E-value < 0.001
137
136
  and hit length > 100. E.g.
138
137
 
139
138
  ```sh
140
- blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
139
+ blastxmlparser -n 'hsp.evalue,hsp.qseq' --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
141
140
 
142
141
  1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
143
142
  2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
@@ -150,7 +149,7 @@ and hit length > 100. E.g.
150
149
  prints the evalue and qseq columns. To output FASTA use --output-fasta
151
150
 
152
151
  ```sh
153
- blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
152
+ blastxmlparser --output-fasta --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
154
153
  ```
155
154
 
156
155
  which prints matching sequences, where the first field is the accession, followed
@@ -170,7 +169,7 @@ To have more output options blastxmlparser can use an [ERB
170
169
  template](http://www.stuartellis.eu/articles/erb/) for every match. This is a
171
170
  very flexible option that can output textual formats such as JSON, YAML, HTML
172
171
  and RDF. Examples are provided in
173
- [./templates](https://github.com/pjotrp/bioruby-vcf/templates/). A JSON
172
+ [./templates](https://github.com/pjotrp/blastxmlparser/templates/). A JSON
174
173
  template could be
175
174
 
176
175
  ```Javascript
@@ -189,7 +188,7 @@ template could be
189
188
  To get JSON, run it with
190
189
 
191
190
  ```sh
192
- blastxmlparser --template template/blast2json.erb -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
191
+ blastxmlparser --template template/blast2json.erb --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
193
192
  ```
194
193
 
195
194
  ```Javascript
@@ -208,7 +207,7 @@ To get JSON, run it with
208
207
  Likewise, using the RDF template
209
208
 
210
209
  ```sh
211
- blastxmlparser --template template/blast2rdf.erb -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
210
+ blastxmlparser --template template/blast2rdf.erb --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
212
211
  ```
213
212
 
214
213
  ```ruby
@@ -231,10 +230,10 @@ Likewise, using the RDF template
231
230
 
232
231
  ## Additional options
233
232
 
234
- To use the low-mem (iterated slower) version of the parser use
233
+ To use the high-mem version of the parser (slightly faster on single core) use
235
234
 
236
235
  ```sh
237
- blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
236
+ blastxmlparser --parser nosplit --threads 1 -n 'hsp.evalue,hsp.qseq' --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
238
237
  ```
239
238
 
240
239
  ## API (Ruby library)
data/VERSION CHANGED
@@ -1 +1 @@
1
- 2.0.0
1
+ 2.0.1
@@ -48,7 +48,7 @@ opts = OptionParser.new do |o|
48
48
 
49
49
  o.separator ""
50
50
 
51
- o.on("-p name", "--parser name", "Use full|split parser (default full)") do |p|
51
+ o.on("-p name", "--parser name", "Use split|nosplit parser (default split)") do |p|
52
52
  options.parser = p.to_sym
53
53
  end
54
54
 
@@ -127,16 +127,32 @@ begin
127
127
 
128
128
  ARGV.each do | fn |
129
129
  logger.info("XML parsing #{fn}")
130
- n = if options.parser == :split
131
- Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
130
+ parser_type = options.parser
131
+ if !parser_type
132
+ # If a file is smaller than 0.5 Gb the nosplit parser is used by default for performance
133
+ if File.size(fn) > 512_000_000
134
+ parser_type = :split
135
+ else
136
+ parser_type = :nosplit
137
+ end
138
+ end
139
+ n = if parser_type == :nosplit
140
+ Bio::BlastXMLParser::NokogiriBlastXml.new(File.new(fn)).to_enum
132
141
  else
133
- Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
142
+ # default
143
+ Bio::BlastXMLParser::BlastXmlSplitter.new(fn)
134
144
  end
135
145
  chunks = []
136
146
  chunks_count = 0
137
147
  NUM_CHUNKS=10_000
138
148
 
139
- process = lambda { |iter,i| # Process one BLAST iter block
149
+ process = lambda { |iter2,i| # Process one BLAST iter block
150
+ if parser_type == :nosplit
151
+ iter = iter2
152
+ else
153
+ xml = Nokogiri::XML.parse(iter2.join) { | cfg | cfg.noblanks }
154
+ iter = Bio::BlastXMLParser::NokogiriBlastIterator.new(xml,self,:prefix=>nil)
155
+ end
140
156
  res = []
141
157
  line_count = 0
142
158
  hit_count = 0
@@ -164,7 +180,7 @@ begin
164
180
  end
165
181
  res << out.join("\t")+"\n"
166
182
  else
167
- res << [hit_count,iter.iter_num,iter.query_id,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t")+"\n"
183
+ res << [iter.iter_num,iter.query_id,hit_count,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t")+"\n"
168
184
  end
169
185
  end
170
186
  end
@@ -188,9 +204,16 @@ begin
188
204
  chunks << iter
189
205
  chunks_count += 1
190
206
  if chunks.size > NUM_CHUNKS
191
- output.call Parallel.map_with_index(chunks, :in_processes => options.threads) { | iter,i |
207
+ out = Parallel.map_with_index(chunks, :in_processes => options.threads) { | iter,i |
192
208
  process.call(iter,i)
193
209
  }
210
+ # Output is forked to a separate process too
211
+ fork do
212
+ output.call out
213
+ STDOUT.flush
214
+ STDOUT.close
215
+ exit 0
216
+ end
194
217
  chunks = []
195
218
  end
196
219
  end
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = "bio-blastxmlparser"
8
- s.version = "2.0.0"
8
+ s.version = "2.0.1"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Pjotr Prins"]
12
- s.date = "2014-09-06"
12
+ s.date = "2014-09-07"
13
13
  s.description = "Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby and comes with a nice CLI"
14
14
  s.email = "pjotr.public01@thebird.nl"
15
15
  s.executables = ["blastxmlparser"]
@@ -11,7 +11,7 @@ module Bio
11
11
 
12
12
  def to_enum
13
13
  logger = Bio::Log::LoggerPlus['bio-blastxmlparser']
14
- logger.info("parsing (full) #{@fn}")
14
+ logger.info("parsing (:nosplit) #{@fn}")
15
15
  NokogiriBlastXml.new(File.new(@fn)).to_enum
16
16
  end
17
17
  end
@@ -4,27 +4,21 @@ module Bio
4
4
  module BlastXMLParser
5
5
  # Reads a full XML result and splits it out into a buffer for each
6
6
  # Iteration (query result).
7
- class XmlSplitterIterator
8
- # include Enumerable
9
-
7
+ class BlastXmlSplitter
10
8
  def initialize fn
11
9
  @fn = fn
12
10
  end
13
-
14
- def to_enum
15
- Enumerator.new do | yielder |
16
- logger = Bio::Log::LoggerPlus['bio-blastxmlparser']
17
- logger.info("split file parsing #{@fn}")
18
- f = File.open(@fn)
19
- # Skip BLAST header
20
- f.each_line do | line |
21
- break if line.strip == "<Iteration>"
22
- end
23
- # Return each Iteration as an XML DOM
24
- each_iteration(f) do | buf |
25
- iteration = Nokogiri::XML.parse(buf.join) { | cfg | cfg.noblanks }
26
- yielder.yield NokogiriBlastIterator.new(iteration,self,:prefix=>nil)
27
- end
11
+ def each
12
+ logger = Bio::Log::LoggerPlus['bio-blastxmlparser']
13
+ logger.info("split file parsing #{@fn}")
14
+ f = File.open(@fn)
15
+ # Skip BLAST header
16
+ f.each_line do | line |
17
+ break if line.strip == "<Iteration>"
18
+ end
19
+ # Return each Iteration as an XML DOM
20
+ each_iteration(f) do | buf |
21
+ yield buf
28
22
  end
29
23
  end
30
24
 
@@ -43,5 +37,22 @@ module Bio
43
37
  end
44
38
  end
45
39
  end
40
+
41
+ class XmlSplitterIterator
42
+ # include Enumerable
43
+
44
+ def initialize fn
45
+ @splitter = BlastXmlSplitter.new(fn)
46
+ end
47
+
48
+ def to_enum
49
+ Enumerator.new do | yielder |
50
+ @splitter.each do | buf |
51
+ iteration = Nokogiri::XML.parse(buf.join) { | cfg | cfg.noblanks }
52
+ yielder.yield NokogiriBlastIterator.new(iteration,self,:prefix=>nil)
53
+ end
54
+ end
55
+ end
56
+ end
46
57
  end
47
58
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bio-blastxmlparser
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.0
4
+ version: 2.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Pjotr Prins
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-09-06 00:00:00.000000000 Z
11
+ date: 2014-09-07 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bio-logger