bio-blastxmlparser 2.0.0 → 2.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +40 -41
- data/VERSION +1 -1
- data/bin/blastxmlparser +30 -7
- data/bio-blastxmlparser.gemspec +2 -2
- data/lib/bio/db/blast/xmliterator.rb +1 -1
- data/lib/bio/db/blast/xmlsplitter.rb +29 -18
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7222df89b2f60ef4b027ea7ca766a30c04de567b
|
4
|
+
data.tar.gz: b8d7c84c85dd58e7794a62b83a73b17a04b60ce1
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e9feee95e3063b0c6c9e9ac28c0f7389e4036130e51c107314216fe8e30f98342d2fbc5f1af0ef16f9c5a11be95aa97d86d16f5d9e2169eda2a54d2594c0dc84
|
7
|
+
data.tar.gz: 63971bd220b178e7ff0dbd7c50a4df6277b9dc8035f610f173603e2e304655610e65894fb3bde7e67c17963d13557b22c3c1d4a3e6dbadbf65c7f170ddbd12f5
|
data/README.md
CHANGED
@@ -8,76 +8,73 @@ to:
|
|
8
8
|
|
9
9
|
* Parse BLAST XML
|
10
10
|
* Filter output
|
11
|
-
* Generate FASTA, JSON, YAML, RDF, HTML, tabular output etc.
|
11
|
+
* Generate FASTA, JSON, YAML, RDF, JSON-LD, HTML, csv, tabular output etc.
|
12
12
|
|
13
13
|
Rather than loading everything in memory, XML is parsed by BLAST query
|
14
14
|
(Iteration). Not only has this the advantage of low memory use, it also shows
|
15
|
-
results early, and it
|
15
|
+
results early, and it is faster when IO continues in parallel (disk
|
16
16
|
read-ahead).
|
17
17
|
|
18
|
-
|
18
|
+
blastxmlparser comes as a command line utility, which
|
19
19
|
can be used to filter results and requires no understanding of Ruby.
|
20
20
|
|
21
21
|
# Quick start
|
22
22
|
|
23
23
|
```sh
|
24
24
|
gem install bio-blastxmlparser
|
25
|
+
gem install parallel # if you want multi-core support
|
25
26
|
blastxmlparser --help
|
26
27
|
```
|
27
28
|
|
28
29
|
## Performance
|
29
30
|
|
30
|
-
XML parsing is expensive. blastxmlparser can use
|
31
|
-
Java XML parsers, based on libxml2 in
|
32
|
-
after splitting the BLAST XML document
|
33
|
-
Tests show this is faster than a SAX
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
and [xml.com](http://www.xml.com/lpt/a/1703).
|
31
|
+
XML parsing and transformation is expensive. blastxmlparser can use
|
32
|
+
the fast Nokogiri C, or Java XML parsers, based on libxml2 in
|
33
|
+
parallel. A DOM parser is used after splitting the BLAST XML document
|
34
|
+
into subsections. Tests show this is faster than a SAX parser with
|
35
|
+
Ruby callbacks. To see why libxml2 based Nokogiri is fast, see
|
36
|
+
[xml.com](http://www.xml.com/lpt/a/1703). And blastxmlparser uses
|
37
|
+
Nokogiri in parallel.
|
38
38
|
|
39
39
|
Blastxmlparser is designed with other optimizations, such as lazy
|
40
|
-
evaluation, i.e., only creating objects when required
|
41
|
-
|
42
|
-
|
43
|
-
relevant fields are queried.
|
40
|
+
evaluation, i.e., only creating objects when required. When parsing a
|
41
|
+
full BLAST result usually only a few fields are used. By using XPath
|
42
|
+
queries the parser makes sure only the relevant fields are queried.
|
44
43
|
|
45
|
-
Timings for parsing a
|
44
|
+
Timings for parsing a 1 Gb BLAST XML file on 4-core 1.2GHz laptop
|
46
45
|
|
47
46
|
```
|
48
|
-
real
|
49
|
-
user
|
50
|
-
sys
|
47
|
+
real 2m40.248s
|
48
|
+
user 8m11.075s
|
49
|
+
sys 0m37.198s
|
51
50
|
```
|
52
51
|
|
53
|
-
which makes for pretty good core utilisation.
|
52
|
+
which makes for pretty good core utilisation and limited RAM use. If
|
53
|
+
you have enough RAM it may make sense to try the `--parser nosplit'
|
54
|
+
option which starts by reading the full DOM into RAM. It may be faster
|
55
|
+
and show different IO characteristics.
|
54
56
|
|
55
57
|
## Install
|
56
58
|
|
57
59
|
```sh
|
58
|
-
gem install bio-blastxmlparser
|
60
|
+
gem install parallel bio-blastxmlparser
|
59
61
|
```
|
60
62
|
|
61
|
-
Important: the parser is written for Ruby
|
63
|
+
Important: the parser is written for Ruby 1.9 or later. Check with
|
62
64
|
|
63
65
|
```sh
|
64
66
|
ruby -v
|
65
67
|
gem env
|
66
68
|
```
|
67
69
|
|
68
|
-
Nokogiri XML parser is required. To install it,
|
69
|
-
|
70
|
-
example on Debian:
|
70
|
+
Nokogiri XML parser is required. To install it, the libxml2 libraries and
|
71
|
+
headers may need to be installed first, for example on Debian:
|
71
72
|
|
72
73
|
```sh
|
73
74
|
apt-get install libxslt-dev libxml2-dev
|
74
75
|
gem install bio-blastxmlparser
|
75
76
|
```
|
76
77
|
|
77
|
-
Nokogiri balks when libxml2 or libxslt is missing on your system (or
|
78
|
-
may install something automatically). In the worst case you'll have to
|
79
|
-
provide build paths, as described [here](http://nokogiri.org/tutorials/installing_nokogiri.html).
|
80
|
-
|
81
78
|
## Command line usage
|
82
79
|
|
83
80
|
### Usage
|
@@ -85,8 +82,10 @@ provide build paths, as described [here](http://nokogiri.org/tutorials/installin
|
|
85
82
|
```
|
86
83
|
blastxmlparser [options] file(s)
|
87
84
|
|
88
|
-
-p, --parser name Use
|
89
|
-
|
85
|
+
-p, --parser name Use split|nosplit parser (default split)
|
86
|
+
--filter filter Filtering expression
|
87
|
+
--threads num Use parallel threads
|
88
|
+
-e, --exec filter Evaluate filter (deprecated)
|
90
89
|
|
91
90
|
-n, --named fields Print named fields
|
92
91
|
--output-fasta Output FASTA
|
@@ -105,7 +104,7 @@ provide build paths, as described [here](http://nokogiri.org/tutorials/installin
|
|
105
104
|
Print result fields of iterations containing 'lcl', using a regex
|
106
105
|
|
107
106
|
```sh
|
108
|
-
blastxmlparser
|
107
|
+
blastxmlparser --filter 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
|
109
108
|
```
|
110
109
|
|
111
110
|
prints a tab delimited
|
@@ -124,20 +123,20 @@ As this is evaluated Ruby, it is also possible to use the XML element
|
|
124
123
|
names directly
|
125
124
|
|
126
125
|
```sh
|
127
|
-
blastxmlparser
|
126
|
+
blastxmlparser --filter 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
|
128
127
|
```
|
129
128
|
|
130
129
|
Or the shorter
|
131
130
|
|
132
131
|
```sh
|
133
|
-
blastxmlparser
|
132
|
+
blastxmlparser --filter 'hsp.bit_score>145' test/data/nt_example_blastn.m7
|
134
133
|
```
|
135
134
|
|
136
135
|
And it is possible to print (non default) named fields where E-value < 0.001
|
137
136
|
and hit length > 100. E.g.
|
138
137
|
|
139
138
|
```sh
|
140
|
-
blastxmlparser -n 'hsp.evalue,hsp.qseq'
|
139
|
+
blastxmlparser -n 'hsp.evalue,hsp.qseq' --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
141
140
|
|
142
141
|
1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
|
143
142
|
2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
|
@@ -150,7 +149,7 @@ and hit length > 100. E.g.
|
|
150
149
|
prints the evalue and qseq columns. To output FASTA use --output-fasta
|
151
150
|
|
152
151
|
```sh
|
153
|
-
blastxmlparser --output-fasta
|
152
|
+
blastxmlparser --output-fasta --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
154
153
|
```
|
155
154
|
|
156
155
|
which prints matching sequences, where the first field is the accession, followed
|
@@ -170,7 +169,7 @@ To have more output options blastxmlparser can use an [ERB
|
|
170
169
|
template](http://www.stuartellis.eu/articles/erb/) for every match. This is a
|
171
170
|
very flexible option that can output textual formats such as JSON, YAML, HTML
|
172
171
|
and RDF. Examples are provided in
|
173
|
-
[./templates](https://github.com/pjotrp/
|
172
|
+
[./templates](https://github.com/pjotrp/blastxmlparser/templates/). A JSON
|
174
173
|
template could be
|
175
174
|
|
176
175
|
```Javascript
|
@@ -189,7 +188,7 @@ template could be
|
|
189
188
|
To get JSON, run it with
|
190
189
|
|
191
190
|
```sh
|
192
|
-
blastxmlparser --template template/blast2json.erb
|
191
|
+
blastxmlparser --template template/blast2json.erb --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
193
192
|
```
|
194
193
|
|
195
194
|
```Javascript
|
@@ -208,7 +207,7 @@ To get JSON, run it with
|
|
208
207
|
Likewise, using the RDF template
|
209
208
|
|
210
209
|
```sh
|
211
|
-
blastxmlparser --template template/blast2rdf.erb
|
210
|
+
blastxmlparser --template template/blast2rdf.erb --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
212
211
|
```
|
213
212
|
|
214
213
|
```ruby
|
@@ -231,10 +230,10 @@ Likewise, using the RDF template
|
|
231
230
|
|
232
231
|
## Additional options
|
233
232
|
|
234
|
-
To use the
|
233
|
+
To use the high-mem version of the parser (slightly faster on single core) use
|
235
234
|
|
236
235
|
```sh
|
237
|
-
blastxmlparser --parser
|
236
|
+
blastxmlparser --parser nosplit --threads 1 -n 'hsp.evalue,hsp.qseq' --filter 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
238
237
|
```
|
239
238
|
|
240
239
|
## API (Ruby library)
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
2.0.
|
1
|
+
2.0.1
|
data/bin/blastxmlparser
CHANGED
@@ -48,7 +48,7 @@ opts = OptionParser.new do |o|
|
|
48
48
|
|
49
49
|
o.separator ""
|
50
50
|
|
51
|
-
o.on("-p name", "--parser name", "Use
|
51
|
+
o.on("-p name", "--parser name", "Use split|nosplit parser (default split)") do |p|
|
52
52
|
options.parser = p.to_sym
|
53
53
|
end
|
54
54
|
|
@@ -127,16 +127,32 @@ begin
|
|
127
127
|
|
128
128
|
ARGV.each do | fn |
|
129
129
|
logger.info("XML parsing #{fn}")
|
130
|
-
|
131
|
-
|
130
|
+
parser_type = options.parser
|
131
|
+
if !parser_type
|
132
|
+
# If a file is smaller than 0.5 Gb the nosplit parser is used by default for performance
|
133
|
+
if File.size(fn) > 512_000_000
|
134
|
+
parser_type = :split
|
135
|
+
else
|
136
|
+
parser_type = :nosplit
|
137
|
+
end
|
138
|
+
end
|
139
|
+
n = if parser_type == :nosplit
|
140
|
+
Bio::BlastXMLParser::NokogiriBlastXml.new(File.new(fn)).to_enum
|
132
141
|
else
|
133
|
-
|
142
|
+
# default
|
143
|
+
Bio::BlastXMLParser::BlastXmlSplitter.new(fn)
|
134
144
|
end
|
135
145
|
chunks = []
|
136
146
|
chunks_count = 0
|
137
147
|
NUM_CHUNKS=10_000
|
138
148
|
|
139
|
-
process = lambda { |
|
149
|
+
process = lambda { |iter2,i| # Process one BLAST iter block
|
150
|
+
if parser_type == :nosplit
|
151
|
+
iter = iter2
|
152
|
+
else
|
153
|
+
xml = Nokogiri::XML.parse(iter2.join) { | cfg | cfg.noblanks }
|
154
|
+
iter = Bio::BlastXMLParser::NokogiriBlastIterator.new(xml,self,:prefix=>nil)
|
155
|
+
end
|
140
156
|
res = []
|
141
157
|
line_count = 0
|
142
158
|
hit_count = 0
|
@@ -164,7 +180,7 @@ begin
|
|
164
180
|
end
|
165
181
|
res << out.join("\t")+"\n"
|
166
182
|
else
|
167
|
-
res << [
|
183
|
+
res << [iter.iter_num,iter.query_id,hit_count,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t")+"\n"
|
168
184
|
end
|
169
185
|
end
|
170
186
|
end
|
@@ -188,9 +204,16 @@ begin
|
|
188
204
|
chunks << iter
|
189
205
|
chunks_count += 1
|
190
206
|
if chunks.size > NUM_CHUNKS
|
191
|
-
|
207
|
+
out = Parallel.map_with_index(chunks, :in_processes => options.threads) { | iter,i |
|
192
208
|
process.call(iter,i)
|
193
209
|
}
|
210
|
+
# Output is forked to a separate process too
|
211
|
+
fork do
|
212
|
+
output.call out
|
213
|
+
STDOUT.flush
|
214
|
+
STDOUT.close
|
215
|
+
exit 0
|
216
|
+
end
|
194
217
|
chunks = []
|
195
218
|
end
|
196
219
|
end
|
data/bio-blastxmlparser.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "bio-blastxmlparser"
|
8
|
-
s.version = "2.0.
|
8
|
+
s.version = "2.0.1"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Pjotr Prins"]
|
12
|
-
s.date = "2014-09-
|
12
|
+
s.date = "2014-09-07"
|
13
13
|
s.description = "Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby and comes with a nice CLI"
|
14
14
|
s.email = "pjotr.public01@thebird.nl"
|
15
15
|
s.executables = ["blastxmlparser"]
|
@@ -4,27 +4,21 @@ module Bio
|
|
4
4
|
module BlastXMLParser
|
5
5
|
# Reads a full XML result and splits it out into a buffer for each
|
6
6
|
# Iteration (query result).
|
7
|
-
class
|
8
|
-
# include Enumerable
|
9
|
-
|
7
|
+
class BlastXmlSplitter
|
10
8
|
def initialize fn
|
11
9
|
@fn = fn
|
12
10
|
end
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
each_iteration(f) do | buf |
|
25
|
-
iteration = Nokogiri::XML.parse(buf.join) { | cfg | cfg.noblanks }
|
26
|
-
yielder.yield NokogiriBlastIterator.new(iteration,self,:prefix=>nil)
|
27
|
-
end
|
11
|
+
def each
|
12
|
+
logger = Bio::Log::LoggerPlus['bio-blastxmlparser']
|
13
|
+
logger.info("split file parsing #{@fn}")
|
14
|
+
f = File.open(@fn)
|
15
|
+
# Skip BLAST header
|
16
|
+
f.each_line do | line |
|
17
|
+
break if line.strip == "<Iteration>"
|
18
|
+
end
|
19
|
+
# Return each Iteration as an XML DOM
|
20
|
+
each_iteration(f) do | buf |
|
21
|
+
yield buf
|
28
22
|
end
|
29
23
|
end
|
30
24
|
|
@@ -43,5 +37,22 @@ module Bio
|
|
43
37
|
end
|
44
38
|
end
|
45
39
|
end
|
40
|
+
|
41
|
+
class XmlSplitterIterator
|
42
|
+
# include Enumerable
|
43
|
+
|
44
|
+
def initialize fn
|
45
|
+
@splitter = BlastXmlSplitter.new(fn)
|
46
|
+
end
|
47
|
+
|
48
|
+
def to_enum
|
49
|
+
Enumerator.new do | yielder |
|
50
|
+
@splitter.each do | buf |
|
51
|
+
iteration = Nokogiri::XML.parse(buf.join) { | cfg | cfg.noblanks }
|
52
|
+
yielder.yield NokogiriBlastIterator.new(iteration,self,:prefix=>nil)
|
53
|
+
end
|
54
|
+
end
|
55
|
+
end
|
56
|
+
end
|
46
57
|
end
|
47
58
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bio-blastxmlparser
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.0.
|
4
|
+
version: 2.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Pjotr Prins
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-09-
|
11
|
+
date: 2014-09-07 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bio-logger
|