bio-blastxmlparser 1.1.0 → 1.1.1
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +84 -84
- data/Rakefile +1 -1
- data/VERSION +1 -1
- data/bin/blastxmlparser +3 -4
- data/bio-blastxmlparser.gemspec +3 -3
- data/lib/bio/db/blast/parser/nokogiri.rb +1 -1
- data/lib/bio/db/blast/xmliterator.rb +1 -1
- data/lib/bio/db/blast/xmlsplitter.rb +1 -1
- data/sample/blastxmlparserdemo.rb +1 -1
- data/spec/bio-blastxmlparser_spec.rb +6 -6
- metadata +19 -18
data/README.rdoc
CHANGED
@@ -28,10 +28,11 @@ see why libxml2 based Nokogiri is fast, see
|
|
28
28
|
http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and
|
29
29
|
http://www.xml.com/lpt/a/1703.
|
30
30
|
|
31
|
-
The parser is also designed with other optimizations, such as lazy
|
32
|
-
only creating objects when required, and (in a future
|
33
|
-
a full BLAST result usually
|
34
|
-
only
|
31
|
+
The parser is also designed with other optimizations, such as lazy
|
32
|
+
evaluation, i.e. only creating objects when required, and (in a future
|
33
|
+
version) parallelization. When parsing a full BLAST result usually
|
34
|
+
only a few fields are used. By using XPath queries only the relevant
|
35
|
+
fields are queried.
|
35
36
|
|
36
37
|
Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
|
37
38
|
|
@@ -47,7 +48,7 @@ Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
|
|
47
48
|
user 0m1.444s
|
48
49
|
sys 0m0.160s
|
49
50
|
|
50
|
-
BioRuby ReXML DOM parser
|
51
|
+
BioRuby ReXML DOM parser (old style)
|
51
52
|
|
52
53
|
real 1m14.548s
|
53
54
|
user 1m13.065s
|
@@ -72,13 +73,88 @@ example on Debian:
|
|
72
73
|
for more installation on other platforms see
|
73
74
|
http://nokogiri.org/tutorials/installing_nokogiri.html.
|
74
75
|
|
76
|
+
== Command line usage
|
77
|
+
|
78
|
+
=== Usage
|
79
|
+
blastxmlparser [options] file(s)
|
80
|
+
|
81
|
+
-p, --parser name Use full|split parser (default full)
|
82
|
+
--output-fasta Output FASTA
|
83
|
+
-n, --named fields Set named fields
|
84
|
+
-e, --exec filter Execute filter
|
85
|
+
|
86
|
+
--logger filename Log to file (default stderr)
|
87
|
+
--trace options Set log level (default INFO, see bio-logger)
|
88
|
+
-q, --quiet Run quietly
|
89
|
+
-v, --verbose Run verbosely
|
90
|
+
--debug Show debug messages
|
91
|
+
-h, --help Show help and examples
|
92
|
+
|
93
|
+
bioblastxmlparser filename(s)
|
94
|
+
|
95
|
+
Use --help switch for more information
|
96
|
+
|
97
|
+
=== Examples
|
98
|
+
|
99
|
+
Print result fields of iterations containing 'lcl', using a regex
|
100
|
+
|
101
|
+
blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
|
102
|
+
|
103
|
+
Print fields where bit_score > 145
|
104
|
+
|
105
|
+
blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
|
106
|
+
|
107
|
+
prints a tab delimited
|
108
|
+
|
109
|
+
1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
|
110
|
+
2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
|
111
|
+
3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
|
112
|
+
4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
|
113
|
+
|
114
|
+
The second and third column show the BLAST iteration, and the others
|
115
|
+
relate to the hits.
|
116
|
+
|
117
|
+
As this is evaluated Ruby, it is also possible to use the XML element
|
118
|
+
names directly
|
119
|
+
|
120
|
+
blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
|
121
|
+
|
122
|
+
And it is possible to print (non default) named fields where E-value < 0.001
|
123
|
+
and hit length > 100. E.g.
|
124
|
+
|
125
|
+
blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
126
|
+
|
127
|
+
1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
|
128
|
+
2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
|
129
|
+
3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
|
130
|
+
4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
|
131
|
+
5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
|
132
|
+
etc. etc.
|
133
|
+
|
134
|
+
prints the evalue and qseq columns. To output FASTA use --output-fasta
|
135
|
+
|
136
|
+
blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
137
|
+
|
138
|
+
which prints matching sequences, where the first field is the accession, followed
|
139
|
+
by query iteration id, and hit_id. E.g.
|
140
|
+
|
141
|
+
>I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
|
142
|
+
AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
|
143
|
+
>I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
|
144
|
+
AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
|
145
|
+
etc. etc.
|
146
|
+
|
147
|
+
To use the low-mem (iterated slower) version of the parser use
|
148
|
+
|
149
|
+
blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
150
|
+
|
75
151
|
== API (Ruby library)
|
76
152
|
|
77
153
|
To loop through a BLAST result:
|
78
154
|
|
79
155
|
>> require 'bio-blastxmlparser'
|
80
156
|
>> fn = 'test/data/nt_example_blastn.m7'
|
81
|
-
>> n = Bio::
|
157
|
+
>> n = Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
|
82
158
|
>> n.each do | iter |
|
83
159
|
>> puts "Hits for " + iter.query_id
|
84
160
|
>> iter.each do | hit |
|
@@ -91,7 +167,7 @@ To loop through a BLAST result:
|
|
91
167
|
The next example parses XML using less memory by using a Ruby
|
92
168
|
Iterator
|
93
169
|
|
94
|
-
>> blast = Bio::
|
170
|
+
>> blast = Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
|
95
171
|
>> iter = blast.next
|
96
172
|
>> iter.iter_num
|
97
173
|
=> 1
|
@@ -175,87 +251,11 @@ etc. etc.
|
|
175
251
|
|
176
252
|
For more examples see the files in ./spec
|
177
253
|
|
178
|
-
== Command line usage
|
179
|
-
|
180
|
-
|
181
|
-
== Usage
|
182
|
-
blastxmlparser [options] file(s)
|
183
|
-
|
184
|
-
-p, --parser name Use full|split parser (default full)
|
185
|
-
--output-fasta Output FASTA
|
186
|
-
-n, --named fields Set named fields
|
187
|
-
-e, --exec filter Execute filter
|
188
|
-
|
189
|
-
--logger filename Log to file (default stderr)
|
190
|
-
--trace options Set log level (default INFO, see bio-logger)
|
191
|
-
-q, --quiet Run quietly
|
192
|
-
-v, --verbose Run verbosely
|
193
|
-
--debug Show debug messages
|
194
|
-
-h, --help Show help and examples
|
195
|
-
|
196
|
-
bioblastxmlparser filename(s)
|
197
|
-
|
198
|
-
Use --help switch for more information
|
199
|
-
|
200
|
-
== Examples
|
201
|
-
|
202
|
-
Print result fields of iterations containing 'lcl', using a regex
|
203
|
-
|
204
|
-
blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
|
205
|
-
|
206
|
-
Print fields where bit_score > 145
|
207
|
-
|
208
|
-
blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
|
209
|
-
|
210
|
-
prints a tab delimited
|
211
|
-
|
212
|
-
1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
|
213
|
-
2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
|
214
|
-
3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
|
215
|
-
4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
|
216
|
-
|
217
|
-
The second and third column show the BLAST iteration, and the others
|
218
|
-
relate to the hits.
|
219
|
-
|
220
|
-
As this is evaluated Ruby, it is also possible to use the XML element
|
221
|
-
names directly
|
222
|
-
|
223
|
-
blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
|
224
|
-
|
225
|
-
And it is possible to print (non default) named fields where E-value < 0.001
|
226
|
-
and hit length > 100. E.g.
|
227
|
-
|
228
|
-
blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
229
|
-
|
230
|
-
1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
|
231
|
-
2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
|
232
|
-
3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
|
233
|
-
4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
|
234
|
-
5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
|
235
|
-
etc. etc.
|
236
|
-
|
237
|
-
prints the evalue and qseq columns. To output FASTA use --output-fasta
|
238
|
-
|
239
|
-
blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
240
|
-
|
241
|
-
which prints matching sequences, where the first field is the accession, followed
|
242
|
-
by query iteration id, and hit_id. E.g.
|
243
|
-
|
244
|
-
>I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
|
245
|
-
AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
|
246
|
-
>I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
|
247
|
-
AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
|
248
|
-
etc. etc.
|
249
|
-
|
250
|
-
To use the low-mem (iterated slower) version of the parser use
|
251
|
-
|
252
|
-
blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
|
253
|
-
|
254
254
|
== URL
|
255
255
|
|
256
256
|
The project lives at http://github.com/pjotrp/blastxmlparser. If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
|
257
257
|
|
258
258
|
== Copyright
|
259
259
|
|
260
|
-
Copyright (c) 2011 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.
|
260
|
+
Copyright (c) 2011,2012 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.
|
261
261
|
|
data/Rakefile
CHANGED
@@ -16,7 +16,7 @@ Jeweler::Tasks.new do |gem|
|
|
16
16
|
gem.homepage = "http://github.com/pjotrp/blastxmlparser"
|
17
17
|
gem.license = "MIT"
|
18
18
|
gem.summary = %Q{Very fast BLAST XML parser and library for big data}
|
19
|
-
gem.description = %Q{Fast big data XML parser and library
|
19
|
+
gem.description = %Q{Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby}
|
20
20
|
gem.email = "pjotr.public01@thebird.nl"
|
21
21
|
gem.authors = ["Pjotr Prins"]
|
22
22
|
# Include your dependencies below. Runtime dependencies are required when using your gem,
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
1.1.
|
1
|
+
1.1.1
|
data/bin/blastxmlparser
CHANGED
@@ -2,10 +2,9 @@
|
|
2
2
|
#
|
3
3
|
# BioRuby bio-blastxmlparser Plugin
|
4
4
|
# Author:: Pjotr Prins
|
5
|
-
# Copyright:: 2011
|
6
5
|
# License:: MIT License
|
7
6
|
#
|
8
|
-
# Copyright (C) 2010
|
7
|
+
# Copyright (C) 2010-2013 Pjotr Prins <pjotr.prins@thebird.nl>
|
9
8
|
|
10
9
|
rootpath = File.dirname(File.dirname(__FILE__))
|
11
10
|
$: << File.join(rootpath,'lib')
|
@@ -160,9 +159,9 @@ begin
|
|
160
159
|
ARGV.each do | fn |
|
161
160
|
logger.info("XML parsing #{fn}")
|
162
161
|
n = if options.parser == :split
|
163
|
-
Bio::
|
162
|
+
Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
|
164
163
|
else
|
165
|
-
Bio::
|
164
|
+
Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
|
166
165
|
end
|
167
166
|
i = 1
|
168
167
|
n.each do | iter |
|
data/bio-blastxmlparser.gemspec
CHANGED
@@ -5,12 +5,12 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "bio-blastxmlparser"
|
8
|
-
s.version = "1.1.
|
8
|
+
s.version = "1.1.1"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Pjotr Prins"]
|
12
|
-
s.date = "
|
13
|
-
s.description = "Fast big data XML parser and library
|
12
|
+
s.date = "2013-02-07"
|
13
|
+
s.description = "Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby"
|
14
14
|
s.email = "pjotr.public01@thebird.nl"
|
15
15
|
s.executables = ["blastxmlparser"]
|
16
16
|
s.extra_rdoc_files = [
|
@@ -5,7 +5,7 @@ $: << File.join(rootpath,'lib')
|
|
5
5
|
|
6
6
|
require 'bio-blastxmlparser'
|
7
7
|
fn = 'test/data/nt_example_blastn.m7'
|
8
|
-
n = Bio::
|
8
|
+
n = Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
|
9
9
|
n.each do | iter |
|
10
10
|
puts "Hits for " + iter.query_id
|
11
11
|
iter.each do | hit |
|
@@ -1,9 +1,9 @@
|
|
1
1
|
require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
|
2
2
|
|
3
3
|
TESTFILE = "./test/data/nt_example_blastn.m7"
|
4
|
-
include Bio::
|
4
|
+
include Bio::BlastXMLParser
|
5
5
|
|
6
|
-
describe "Bio::
|
6
|
+
describe "Bio::BlastXMLParser::NokogiriBlastXml" do
|
7
7
|
before(:all) do
|
8
8
|
n = NokogiriBlastXml.new(File.new(TESTFILE)).to_enum
|
9
9
|
@iter1 = n.next
|
@@ -75,8 +75,8 @@ describe "Bio::Blast::NokogiriBlastXml" do
|
|
75
75
|
end
|
76
76
|
end
|
77
77
|
|
78
|
-
describe Bio::
|
79
|
-
include Bio::
|
78
|
+
describe Bio::BlastXMLParser::XmlIterator do
|
79
|
+
include Bio::BlastXMLParser
|
80
80
|
it "should parse with Nokogiri" do
|
81
81
|
blast = XmlIterator.new(TESTFILE).to_enum
|
82
82
|
iter1 = blast.next
|
@@ -86,8 +86,8 @@ describe Bio::Blast::XmlIterator do
|
|
86
86
|
end
|
87
87
|
end
|
88
88
|
|
89
|
-
describe Bio::
|
90
|
-
include Bio::
|
89
|
+
describe Bio::BlastXMLParser::XmlSplitterIterator do
|
90
|
+
include Bio::BlastXMLParser
|
91
91
|
# it "should read a large file and yield Iterations" do
|
92
92
|
# s = XmlSplitter.new("./test/data/nt_example_blastn.m7")
|
93
93
|
# s.each do | result |
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bio-blastxmlparser
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.1.
|
4
|
+
version: 1.1.1
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2013-02-07 00:00:00.000000000Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bio-logger
|
16
|
-
requirement: &
|
16
|
+
requirement: &24214160 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.0.0
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *24214160
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: nokogiri
|
27
|
-
requirement: &
|
27
|
+
requirement: &24213120 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ! '>='
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.5.0
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *24213120
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: rake
|
38
|
-
requirement: &
|
38
|
+
requirement: &24212220 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ! '>='
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 0.9.2.2
|
44
44
|
type: :development
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *24212220
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: bundler
|
49
|
-
requirement: &
|
49
|
+
requirement: &24211440 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ! '>='
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: '0'
|
55
55
|
type: :development
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *24211440
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: jeweler
|
60
|
-
requirement: &
|
60
|
+
requirement: &24174660 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 1.8.4
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *24174660
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rspec
|
71
|
-
requirement: &
|
71
|
+
requirement: &24173840 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ! '>='
|
@@ -76,10 +76,10 @@ dependencies:
|
|
76
76
|
version: 2.3.0
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *24173840
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
81
|
name: rdoc
|
82
|
-
requirement: &
|
82
|
+
requirement: &24173100 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ! '>='
|
@@ -87,8 +87,9 @@ dependencies:
|
|
87
87
|
version: 2.4.2
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
91
|
-
description: Fast big data XML parser and library
|
90
|
+
version_requirements: *24173100
|
91
|
+
description: Fast big data BLAST XML parser and library; this libxml2 based version
|
92
|
+
is 50x faster than BioRuby
|
92
93
|
email: pjotr.public01@thebird.nl
|
93
94
|
executables:
|
94
95
|
- blastxmlparser
|
@@ -140,7 +141,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
140
141
|
version: '0'
|
141
142
|
segments:
|
142
143
|
- 0
|
143
|
-
hash: -
|
144
|
+
hash: -3287387609254152406
|
144
145
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
145
146
|
none: false
|
146
147
|
requirements:
|