bio-blastxmlparser 1.1.0 → 1.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.rdoc +84 -84
- data/Rakefile +1 -1
- data/VERSION +1 -1
- data/bin/blastxmlparser +3 -4
- data/bio-blastxmlparser.gemspec +3 -3
- data/lib/bio/db/blast/parser/nokogiri.rb +1 -1
- data/lib/bio/db/blast/xmliterator.rb +1 -1
- data/lib/bio/db/blast/xmlsplitter.rb +1 -1
- data/sample/blastxmlparserdemo.rb +1 -1
- data/spec/bio-blastxmlparser_spec.rb +6 -6
- metadata +19 -18
data/README.rdoc
CHANGED
@@ -28,10 +28,11 @@ see why libxml2 based Nokogiri is fast, see
|
|
28
28
|
http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and
|
29
29
|
http://www.xml.com/lpt/a/1703.
|
30
30
|
|
31
|
-
The parser is also designed with other optimizations, such as lazy
|
32
|
-
only creating objects when required, and (in a future
|
33
|
-
a full BLAST result usually
|
34
|
-
only
|
31
|
+
The parser is also designed with other optimizations, such as lazy
|
32
|
+
evaluation, i.e. only creating objects when required, and (in a future
|
33
|
+
version) parallelization. When parsing a full BLAST result usually
|
34
|
+
only a few fields are used. By using XPath queries only the relevant
|
35
|
+
fields are queried.
|
35
36
|
|
36
37
|
Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
|
37
38
|
|
@@ -47,7 +48,7 @@ Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
|
|
47
48
|
user 0m1.444s
|
48
49
|
sys 0m0.160s
|
49
50
|
|
50
|
-
BioRuby ReXML DOM parser
|
51
|
+
BioRuby ReXML DOM parser (old style)
|
51
52
|
|
52
53
|
real 1m14.548s
|
53
54
|
user 1m13.065s
|
@@ -72,13 +73,88 @@ example on Debian:
|
|
72
73
|
for more installation on other platforms see
|
73
74
|
http://nokogiri.org/tutorials/installing_nokogiri.html.
|
74
75
|
|
76
|
+
== Command line usage
|
77
|
+
|
78
|
+
=== Usage
|
79
|
+
blastxmlparser [options] file(s)
|
80
|
+
|
81
|
+
-p, --parser name Use full|split parser (default full)
|
82
|
+
--output-fasta Output FASTA
|
83
|
+
-n, --named fields Set named fields
|
84
|
+
-e, --exec filter Execute filter
|
85
|
+
|
86
|
+
--logger filename Log to file (default stderr)
|
87
|
+
--trace options Set log level (default INFO, see bio-logger)
|
88
|
+
-q, --quiet Run quietly
|
89
|
+
-v, --verbose Run verbosely
|
90
|
+
--debug Show debug messages
|
91
|
+
-h, --help Show help and examples
|
92
|
+
|
93
|
+
bioblastxmlparser filename(s)
|
94
|
+
|
95
|
+
Use --help switch for more information
|
96
|
+
|
97
|
+
=== Examples
|
98
|
+
|
99
|
+
Print result fields of iterations containing 'lcl', using a regex
|
100
|
+
|
101
|
+
blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
|
102
|
+
|
103
|
+
Print fields where bit_score > 145
|
104
|
+
|
105
|
+
blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
|
106
|
+
|
107
|
+
prints a tab delimited
|
108
|
+
|
109
|
+
1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
|
110
|
+
2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
|
111
|
+
3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
|
112
|
+
4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
|
113
|
+
|
114
|
+
The second and third column show the BLAST iteration, and the others
|
115
|
+
relate to the hits.
|
116
|
+
|
117
|
+
As this is evaluated Ruby, it is also possible to use the XML element
|
118
|
+
names directly
|
119
|
+
|
120
|
+
blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
|
121
|
+
|
122
|
+
And it is possible to print (non default) named fields where E-value < 0.001
|
123
|
+
and hit length > 100. E.g.
|
124
|
+
|
125
|
+
blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
126
|
+
|
127
|
+
1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
|
128
|
+
2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
|
129
|
+
3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
|
130
|
+
4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
|
131
|
+
5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
|
132
|
+
etc. etc.
|
133
|
+
|
134
|
+
prints the evalue and qseq columns. To output FASTA use --output-fasta
|
135
|
+
|
136
|
+
blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
137
|
+
|
138
|
+
which prints matching sequences, where the first field is the accession, followed
|
139
|
+
by query iteration id, and hit_id. E.g.
|
140
|
+
|
141
|
+
>I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
|
142
|
+
AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
|
143
|
+
>I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
|
144
|
+
AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
|
145
|
+
etc. etc.
|
146
|
+
|
147
|
+
To use the low-mem (iterated slower) version of the parser use
|
148
|
+
|
149
|
+
blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
150
|
+
|
75
151
|
== API (Ruby library)
|
76
152
|
|
77
153
|
To loop through a BLAST result:
|
78
154
|
|
79
155
|
>> require 'bio-blastxmlparser'
|
80
156
|
>> fn = 'test/data/nt_example_blastn.m7'
|
81
|
-
>> n = Bio::
|
157
|
+
>> n = Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
|
82
158
|
>> n.each do | iter |
|
83
159
|
>> puts "Hits for " + iter.query_id
|
84
160
|
>> iter.each do | hit |
|
@@ -91,7 +167,7 @@ To loop through a BLAST result:
|
|
91
167
|
The next example parses XML using less memory by using a Ruby
|
92
168
|
Iterator
|
93
169
|
|
94
|
-
>> blast = Bio::
|
170
|
+
>> blast = Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
|
95
171
|
>> iter = blast.next
|
96
172
|
>> iter.iter_num
|
97
173
|
=> 1
|
@@ -175,87 +251,11 @@ etc. etc.
|
|
175
251
|
|
176
252
|
For more examples see the files in ./spec
|
177
253
|
|
178
|
-
== Command line usage
|
179
|
-
|
180
|
-
|
181
|
-
== Usage
|
182
|
-
blastxmlparser [options] file(s)
|
183
|
-
|
184
|
-
-p, --parser name Use full|split parser (default full)
|
185
|
-
--output-fasta Output FASTA
|
186
|
-
-n, --named fields Set named fields
|
187
|
-
-e, --exec filter Execute filter
|
188
|
-
|
189
|
-
--logger filename Log to file (default stderr)
|
190
|
-
--trace options Set log level (default INFO, see bio-logger)
|
191
|
-
-q, --quiet Run quietly
|
192
|
-
-v, --verbose Run verbosely
|
193
|
-
--debug Show debug messages
|
194
|
-
-h, --help Show help and examples
|
195
|
-
|
196
|
-
bioblastxmlparser filename(s)
|
197
|
-
|
198
|
-
Use --help switch for more information
|
199
|
-
|
200
|
-
== Examples
|
201
|
-
|
202
|
-
Print result fields of iterations containing 'lcl', using a regex
|
203
|
-
|
204
|
-
blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
|
205
|
-
|
206
|
-
Print fields where bit_score > 145
|
207
|
-
|
208
|
-
blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
|
209
|
-
|
210
|
-
prints a tab delimited
|
211
|
-
|
212
|
-
1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
|
213
|
-
2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
|
214
|
-
3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
|
215
|
-
4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
|
216
|
-
|
217
|
-
The second and third column show the BLAST iteration, and the others
|
218
|
-
relate to the hits.
|
219
|
-
|
220
|
-
As this is evaluated Ruby, it is also possible to use the XML element
|
221
|
-
names directly
|
222
|
-
|
223
|
-
blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
|
224
|
-
|
225
|
-
And it is possible to print (non default) named fields where E-value < 0.001
|
226
|
-
and hit length > 100. E.g.
|
227
|
-
|
228
|
-
blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
229
|
-
|
230
|
-
1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
|
231
|
-
2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
|
232
|
-
3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
|
233
|
-
4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
|
234
|
-
5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
|
235
|
-
etc. etc.
|
236
|
-
|
237
|
-
prints the evalue and qseq columns. To output FASTA use --output-fasta
|
238
|
-
|
239
|
-
blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
240
|
-
|
241
|
-
which prints matching sequences, where the first field is the accession, followed
|
242
|
-
by query iteration id, and hit_id. E.g.
|
243
|
-
|
244
|
-
>I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
|
245
|
-
AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
|
246
|
-
>I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
|
247
|
-
AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
|
248
|
-
etc. etc.
|
249
|
-
|
250
|
-
To use the low-mem (iterated slower) version of the parser use
|
251
|
-
|
252
|
-
blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
253
|
-
|
254
254
|
== URL
|
255
255
|
|
256
256
|
The project lives at http://github.com/pjotrp/blastxmlparser. If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
|
257
257
|
|
258
258
|
== Copyright
|
259
259
|
|
260
|
-
Copyright (c) 2011 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.
|
260
|
+
Copyright (c) 2011,2012 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.
|
261
261
|
|
data/Rakefile
CHANGED
@@ -16,7 +16,7 @@ Jeweler::Tasks.new do |gem|
|
|
16
16
|
gem.homepage = "http://github.com/pjotrp/blastxmlparser"
|
17
17
|
gem.license = "MIT"
|
18
18
|
gem.summary = %Q{Very fast BLAST XML parser and library for big data}
|
19
|
-
gem.description = %Q{Fast big data XML parser and library
|
19
|
+
gem.description = %Q{Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby}
|
20
20
|
gem.email = "pjotr.public01@thebird.nl"
|
21
21
|
gem.authors = ["Pjotr Prins"]
|
22
22
|
# Include your dependencies below. Runtime dependencies are required when using your gem,
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
1.1.
|
1
|
+
1.1.1
|
data/bin/blastxmlparser
CHANGED
@@ -2,10 +2,9 @@
|
|
2
2
|
#
|
3
3
|
# BioRuby bio-blastxmlparser Plugin
|
4
4
|
# Author:: Pjotr Prins
|
5
|
-
# Copyright:: 2011
|
6
5
|
# License:: MIT License
|
7
6
|
#
|
8
|
-
# Copyright (C) 2010
|
7
|
+
# Copyright (C) 2010-2013 Pjotr Prins <pjotr.prins@thebird.nl>
|
9
8
|
|
10
9
|
rootpath = File.dirname(File.dirname(__FILE__))
|
11
10
|
$: << File.join(rootpath,'lib')
|
@@ -160,9 +159,9 @@ begin
|
|
160
159
|
ARGV.each do | fn |
|
161
160
|
logger.info("XML parsing #{fn}")
|
162
161
|
n = if options.parser == :split
|
163
|
-
Bio::
|
162
|
+
Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
|
164
163
|
else
|
165
|
-
Bio::
|
164
|
+
Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
|
166
165
|
end
|
167
166
|
i = 1
|
168
167
|
n.each do | iter |
|
data/bio-blastxmlparser.gemspec
CHANGED
@@ -5,12 +5,12 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "bio-blastxmlparser"
|
8
|
-
s.version = "1.1.
|
8
|
+
s.version = "1.1.1"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Pjotr Prins"]
|
12
|
-
s.date = "
|
13
|
-
s.description = "Fast big data XML parser and library
|
12
|
+
s.date = "2013-02-07"
|
13
|
+
s.description = "Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby"
|
14
14
|
s.email = "pjotr.public01@thebird.nl"
|
15
15
|
s.executables = ["blastxmlparser"]
|
16
16
|
s.extra_rdoc_files = [
|
@@ -5,7 +5,7 @@ $: << File.join(rootpath,'lib')
|
|
5
5
|
|
6
6
|
require 'bio-blastxmlparser'
|
7
7
|
fn = 'test/data/nt_example_blastn.m7'
|
8
|
-
n = Bio::
|
8
|
+
n = Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
|
9
9
|
n.each do | iter |
|
10
10
|
puts "Hits for " + iter.query_id
|
11
11
|
iter.each do | hit |
|
@@ -1,9 +1,9 @@
|
|
1
1
|
require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
|
2
2
|
|
3
3
|
TESTFILE = "./test/data/nt_example_blastn.m7"
|
4
|
-
include Bio::
|
4
|
+
include Bio::BlastXMLParser
|
5
5
|
|
6
|
-
describe "Bio::
|
6
|
+
describe "Bio::BlastXMLParser::NokogiriBlastXml" do
|
7
7
|
before(:all) do
|
8
8
|
n = NokogiriBlastXml.new(File.new(TESTFILE)).to_enum
|
9
9
|
@iter1 = n.next
|
@@ -75,8 +75,8 @@ describe "Bio::Blast::NokogiriBlastXml" do
|
|
75
75
|
end
|
76
76
|
end
|
77
77
|
|
78
|
-
describe Bio::
|
79
|
-
include Bio::
|
78
|
+
describe Bio::BlastXMLParser::XmlIterator do
|
79
|
+
include Bio::BlastXMLParser
|
80
80
|
it "should parse with Nokogiri" do
|
81
81
|
blast = XmlIterator.new(TESTFILE).to_enum
|
82
82
|
iter1 = blast.next
|
@@ -86,8 +86,8 @@ describe Bio::Blast::XmlIterator do
|
|
86
86
|
end
|
87
87
|
end
|
88
88
|
|
89
|
-
describe Bio::
|
90
|
-
include Bio::
|
89
|
+
describe Bio::BlastXMLParser::XmlSplitterIterator do
|
90
|
+
include Bio::BlastXMLParser
|
91
91
|
# it "should read a large file and yield Iterations" do
|
92
92
|
# s = XmlSplitter.new("./test/data/nt_example_blastn.m7")
|
93
93
|
# s.each do | result |
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bio-blastxmlparser
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.1.
|
4
|
+
version: 1.1.1
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2013-02-07 00:00:00.000000000Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bio-logger
|
16
|
-
requirement: &
|
16
|
+
requirement: &24214160 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.0.0
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *24214160
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: nokogiri
|
27
|
-
requirement: &
|
27
|
+
requirement: &24213120 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ! '>='
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.5.0
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *24213120
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: rake
|
38
|
-
requirement: &
|
38
|
+
requirement: &24212220 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ! '>='
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 0.9.2.2
|
44
44
|
type: :development
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *24212220
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: bundler
|
49
|
-
requirement: &
|
49
|
+
requirement: &24211440 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ! '>='
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: '0'
|
55
55
|
type: :development
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *24211440
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: jeweler
|
60
|
-
requirement: &
|
60
|
+
requirement: &24174660 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 1.8.4
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *24174660
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rspec
|
71
|
-
requirement: &
|
71
|
+
requirement: &24173840 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ! '>='
|
@@ -76,10 +76,10 @@ dependencies:
|
|
76
76
|
version: 2.3.0
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *24173840
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
81
|
name: rdoc
|
82
|
-
requirement: &
|
82
|
+
requirement: &24173100 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ! '>='
|
@@ -87,8 +87,9 @@ dependencies:
|
|
87
87
|
version: 2.4.2
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
91
|
-
description: Fast big data XML parser and library
|
90
|
+
version_requirements: *24173100
|
91
|
+
description: Fast big data BLAST XML parser and library; this libxml2 based version
|
92
|
+
is 50x faster than BioRuby
|
92
93
|
email: pjotr.public01@thebird.nl
|
93
94
|
executables:
|
94
95
|
- blastxmlparser
|
@@ -140,7 +141,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
140
141
|
version: '0'
|
141
142
|
segments:
|
142
143
|
- 0
|
143
|
-
hash: -
|
144
|
+
hash: -3287387609254152406
|
144
145
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
145
146
|
none: false
|
146
147
|
requirements:
|