bio-blastxmlparser 0.6.0 → 0.6.1

Sign up to get free protection for your applications and to get access to all the features.
data/README.rdoc CHANGED
@@ -2,50 +2,57 @@
2
2
 
3
3
  blastxmlparser is a fast big-data BLAST XML file parser. Rather than
4
4
  loading everything in memory, XML is parsed by BLAST query
5
- (Iteration). Not only has this the advantage of low memory use, it may
6
- also be faster when IO continues in parallel (disks read ahead).
5
+ (Iteration). Not only has this the advantage of low memory use, it
6
+ also shows results early, and it may be faster when IO continues in
7
+ parallel (disk read-ahead).
7
8
 
8
9
  Next to the API, blastxmlparser comes as a command line utility, which
9
10
  can be used to filter results and requires no understanding of Ruby.
10
11
 
11
12
  == Performance
12
13
 
13
- XML parsing is expensive. blastxmlparser uses the Nokogiri C, or Java, XML
14
- parser, based on libxml2. Basically a DOM parser is used for subsections of a
15
- document, tests show this is faster than a SAX parser with Ruby callbacks. To
14
+ XML parsing is expensive. blastxmlparser uses the fast Nokogiri C, or Java, XML
15
+ parsers, based on libxml2. Basically, a DOM parser is used for subsections of a
16
+ document. Tests show this is faster than a SAX parser with Ruby callbacks. To
16
17
  see why libxml2 based Nokogiri is fast, see
17
18
  http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and
18
19
  http://www.xml.com/lpt/a/1703.
19
20
 
20
21
  The parser is also designed with other optimizations, such as lazy evaluation,
21
- only creating objects when required, and (future) parallelization. When parsing
22
+ only creating objects when required, and (in a future version) parallelization. When parsing
22
23
  a full BLAST result usually only a few fields are used. By using XPath queries
23
24
  only the relevant fields are queried.
24
25
 
25
26
  Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
26
27
 
27
- Nokogiri DOM (default)
28
+ bio-blastxmlparser + Nokogiri DOM (default)
28
29
 
29
- real 0m1.259s
30
- user 0m1.052s
31
- sys 0m0.144s
30
+ real 0m1.259s
31
+ user 0m1.052s
32
+ sys 0m0.144s
32
33
 
33
- Nokogiri split DOM
34
+ bio-blastxmlparser + Nokogiri split DOM
34
35
 
35
- real 0m1.713s
36
- user 0m1.444s
37
- sys 0m0.160s
36
+ real 0m1.713s
37
+ user 0m1.444s
38
+ sys 0m0.160s
38
39
 
39
- BioRuby ReXML DOM parser
40
+ BioRuby ReXML DOM parser
40
41
 
41
- real 1m14.548s
42
- user 1m13.065s
43
- sys 0m0.472s
42
+ real 1m14.548s
43
+ user 1m13.065s
44
+ sys 0m0.472s
44
45
 
45
46
  == Install
46
47
 
48
+ Quick install:
49
+
47
50
  gem install bio-blastxmlparser
48
51
 
52
+ Important: the parser is written for Ruby >= 1.9. You can check with
53
+
54
+ gem env
55
+
49
56
  Nokogiri XML parser is required. To install it,
50
57
  the libxml2 libraries and headers need to be installed first, for
51
58
  example on Debian:
@@ -56,7 +63,7 @@ example on Debian:
56
63
  for more installation on other platforms see
57
64
  http://nokogiri.org/tutorials/installing_nokogiri.html.
58
65
 
59
- == API
66
+ == API (Ruby library)
60
67
 
61
68
  To loop through a BLAST result:
62
69
 
@@ -72,12 +79,13 @@ To loop through a BLAST result:
72
79
  >> end
73
80
  >> end
74
81
 
75
- The next example parses XML using less memory
82
+ The next example parses XML using less memory by using a Ruby
83
+ Iterator
76
84
 
77
- >> blast = XmlSplitterIterator.new(fn).to_enum
85
+ >> blast = Bio::Blast::XmlSplitterIterator.new(fn).to_enum
78
86
  >> iter = blast.next
79
87
  >> iter.iter_num
80
- >> 1
88
+ => 1
81
89
  >> iter.query_id
82
90
  => "lcl|1_0"
83
91
 
@@ -132,14 +140,19 @@ Get the first Hsp
132
140
  >> hsp.midline
133
141
  => "|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||"
134
142
 
135
- It is possible to use the XML element names, over methods. E.g.
143
+ Unlike BioRuby, this module uses the actual element names in the XML
144
+ definition, to avoid confusion (if anyone wants a translation,
145
+ feel free to contribute an adaptor).
146
+
147
+ It is also possible to use the XML element names as Strings, rather
148
+ than methods. E.g.
136
149
 
137
150
  >> hsp.field("Hsp_bit-score")
138
151
  => "145.205"
139
152
  >> hsp["Hsp_bit-score"]
140
153
  => "145.205"
141
154
 
142
- Note that these are always String values.
155
+ Note that, when using the element names, the results are always String values.
143
156
 
144
157
  Fetch the next result (Iteration)
145
158
 
@@ -153,11 +166,14 @@ etc. etc.
153
166
 
154
167
  For more examples see the files in ./spec
155
168
 
156
- == Usage
169
+ == Command line usage
157
170
 
171
+
172
+ == Usage
158
173
  blastxmlparser [options] file(s)
159
174
 
160
175
  -p, --parser name Use full|split parser (default full)
176
+ --output-fasta Output FASTA
161
177
  -n, --named fields Set named fields
162
178
  -e, --exec filter Execute filter
163
179
 
@@ -182,11 +198,23 @@ Print fields where bit_score > 145
182
198
 
183
199
  blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
184
200
 
185
- It is also possible to use the XML element names directly
201
+ prints a tab delimited
202
+
203
+ 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
204
+ 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
205
+ 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
206
+ 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
207
+
208
+ The second and third column show the BLAST iteration, and the others
209
+ relate to the hits.
210
+
211
+ As this is evaluated Ruby, it is also possible to use the XML element
212
+ names directly
186
213
 
187
214
  blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
188
215
 
189
- Print named fields where E-value < 0.001 and hit length > 100
216
+ And it is possible to print (non default) named fields where E-value < 0.001
217
+ and hit length > 100. E.g.
190
218
 
191
219
  blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
192
220
 
@@ -197,7 +225,20 @@ Print named fields where E-value < 0.001 and hit length > 100
197
225
  5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
198
226
  etc. etc.
199
227
 
200
- To use the low-mem version use
228
+ prints the evalue and qseq columns. To output FASTA use --output-fasta
229
+
230
+ blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
231
+
232
+ which prints matching sequences, where the first field is the accession, followed
233
+ by query iteration id, and hit_id. E.g.
234
+
235
+ >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
236
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
237
+ >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
238
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
239
+ etc. etc.
240
+
241
+ To use the low-mem (iterated slower) version of the parser use
201
242
 
202
243
  blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
203
244
 
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.6.0
1
+ 0.6.1
data/bin/blastxmlparser CHANGED
@@ -30,11 +30,23 @@ Print fields where bit_score > 145
30
30
 
31
31
  blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
32
32
 
33
- It is also possible to use the XML element names directly
33
+ prints a tab delimited
34
+
35
+ 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
36
+ 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
37
+ 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
38
+ 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
39
+
40
+ The second and third column show the BLAST iteration, and the others
41
+ relate to the hits.
42
+
43
+ As this is evaluated Ruby, it is also possible to use the XML element
44
+ names directly
34
45
 
35
46
  blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
36
47
 
37
- Print named fields where E-value < 0.001 and hit length > 100
48
+ And it is possible to print (non default) named fields where E-value < 0.001
49
+ and hit length > 100. E.g.
38
50
 
39
51
  blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
40
52
 
@@ -45,7 +57,20 @@ Print named fields where E-value < 0.001 and hit length > 100
45
57
  5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
46
58
  etc. etc.
47
59
 
48
- To use the low-mem version use
60
+ prints the evalue and qseq columns. To output FASTA use --output-fasta
61
+
62
+ blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
63
+
64
+ which prints matching sequences, where the first field is the accession, followed
65
+ by query iteration id, and hit_id. E.g.
66
+
67
+ >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
68
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
69
+ >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
70
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
71
+ etc. etc.
72
+
73
+ To use the low-mem (iterated slower) version of the parser use
49
74
 
50
75
  blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
51
76
 
@@ -90,6 +115,10 @@ opts = OptionParser.new do |o|
90
115
  options.parser = p.to_sym
91
116
  end
92
117
 
118
+ o.on("--output-fasta","Output FASTA") do |b|
119
+ options.output_fasta = true
120
+ end
121
+
93
122
  o.on("-n fields","--named fields",String, "Set named fields") do |s|
94
123
  options.fields = s.split(/,/)
95
124
  end
@@ -145,14 +174,19 @@ begin
145
174
  true
146
175
  end
147
176
  if do_print
148
- if options.fields
149
- print i,"\t"
150
- options.fields.each do | f |
151
- print eval(f),"\t"
152
- end
153
- print "\n"
177
+ if options.output_fasta
178
+ print ">"+hit.accession+' '+iter.iter_num.to_s+'|'+iter.query_id+' '+hit.hit_id+' '+hit.hit_def+"\n"
179
+ print hsp.qseq+"\n"
154
180
  else
155
- print [i,iter.iter_num,iter.query_id,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t"),"\n"
181
+ if options.fields
182
+ print i,"\t"
183
+ options.fields.each do | f |
184
+ print eval(f),"\t"
185
+ end
186
+ print "\n"
187
+ else
188
+ print [i,iter.iter_num,iter.query_id,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t"),"\n"
189
+ end
156
190
  end
157
191
  i += 1
158
192
  end
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{bio-blastxmlparser}
8
- s.version = "0.6.0"
8
+ s.version = "0.6.1"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Pjotr Prins"]
12
- s.date = %q{2011-02-14}
12
+ s.date = %q{2011-04-26}
13
13
  s.default_executable = %q{blastxmlparser}
14
14
  s.description = %q{Fast big data XML parser and library, written in Ruby}
15
15
  s.email = %q{pjotr.public01@thebird.nl}
@@ -10,6 +10,8 @@ else
10
10
  end
11
11
  require 'bio-logger'
12
12
 
13
+ require 'enumerator'
14
+
13
15
  Bio::Log::LoggerPlus.new('bio-blastxmlparser')
14
16
 
15
17
  require 'bio/db/blast/parser/nokogiri'
@@ -1,8 +1,12 @@
1
+ require 'enumerator'
2
+
1
3
  module Bio
2
4
  module Blast
3
5
  # Reads a full XML result and splits it out into a buffer for each
4
6
  # Iteration (query result).
5
7
  class XmlSplitterIterator
8
+ # include Enumerable
9
+
6
10
  def initialize fn
7
11
  @fn = fn
8
12
  end
metadata CHANGED
@@ -5,8 +5,8 @@ version: !ruby/object:Gem::Version
5
5
  segments:
6
6
  - 0
7
7
  - 6
8
- - 0
9
- version: 0.6.0
8
+ - 1
9
+ version: 0.6.1
10
10
  platform: ruby
11
11
  authors:
12
12
  - Pjotr Prins
@@ -14,7 +14,7 @@ autorequire:
14
14
  bindir: bin
15
15
  cert_chain: []
16
16
 
17
- date: 2011-02-14 00:00:00 +01:00
17
+ date: 2011-04-26 00:00:00 +02:00
18
18
  default_executable: blastxmlparser
19
19
  dependencies:
20
20
  - !ruby/object:Gem::Dependency
@@ -156,7 +156,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
156
156
  requirements:
157
157
  - - ">="
158
158
  - !ruby/object:Gem::Version
159
- hash: 4630273
159
+ hash: 169663261
160
160
  segments:
161
161
  - 0
162
162
  version: "0"