bio-blastxmlparser 1.1.0 → 1.1.1

Sign up to get free protection for your applications and to get access to all the features.
data/README.rdoc CHANGED
@@ -28,10 +28,11 @@ see why libxml2 based Nokogiri is fast, see
28
28
  http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and
29
29
  http://www.xml.com/lpt/a/1703.
30
30
 
31
- The parser is also designed with other optimizations, such as lazy evaluation,
32
- only creating objects when required, and (in a future version) parallelization. When parsing
33
- a full BLAST result usually only a few fields are used. By using XPath queries
34
- only the relevant fields are queried.
31
+ The parser is also designed with other optimizations, such as lazy
32
+ evaluation, i.e. only creating objects when required, and (in a future
33
+ version) parallelization. When parsing a full BLAST result usually
34
+ only a few fields are used. By using XPath queries only the relevant
35
+ fields are queried.
35
36
 
36
37
  Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
37
38
 
@@ -47,7 +48,7 @@ Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
47
48
  user 0m1.444s
48
49
  sys 0m0.160s
49
50
 
50
- BioRuby ReXML DOM parser
51
+ BioRuby ReXML DOM parser (old style)
51
52
 
52
53
  real 1m14.548s
53
54
  user 1m13.065s
@@ -72,13 +73,88 @@ example on Debian:
72
73
  for more installation on other platforms see
73
74
  http://nokogiri.org/tutorials/installing_nokogiri.html.
74
75
 
76
+ == Command line usage
77
+
78
+ === Usage
79
+ blastxmlparser [options] file(s)
80
+
81
+ -p, --parser name Use full|split parser (default full)
82
+ --output-fasta Output FASTA
83
+ -n, --named fields Set named fields
84
+ -e, --exec filter Execute filter
85
+
86
+ --logger filename Log to file (default stderr)
87
+ --trace options Set log level (default INFO, see bio-logger)
88
+ -q, --quiet Run quietly
89
+ -v, --verbose Run verbosely
90
+ --debug Show debug messages
91
+ -h, --help Show help and examples
92
+
93
+ bioblastxmlparser filename(s)
94
+
95
+ Use --help switch for more information
96
+
97
+ === Examples
98
+
99
+ Print result fields of iterations containing 'lcl', using a regex
100
+
101
+ blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
102
+
103
+ Print fields where bit_score > 145
104
+
105
+ blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
106
+
107
+ prints a tab delimited
108
+
109
+ 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
110
+ 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
111
+ 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
112
+ 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
113
+
114
+ The second and third column show the BLAST iteration, and the others
115
+ relate to the hits.
116
+
117
+ As this is evaluated Ruby, it is also possible to use the XML element
118
+ names directly
119
+
120
+ blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
121
+
122
+ And it is possible to print (non default) named fields where E-value < 0.001
123
+ and hit length > 100. E.g.
124
+
125
+ blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
126
+
127
+ 1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
128
+ 2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
129
+ 3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
130
+ 4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
131
+ 5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
132
+ etc. etc.
133
+
134
+ prints the evalue and qseq columns. To output FASTA use --output-fasta
135
+
136
+ blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
137
+
138
+ which prints matching sequences, where the first field is the accession, followed
139
+ by query iteration id, and hit_id. E.g.
140
+
141
+ >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
142
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
143
+ >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
144
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
145
+ etc. etc.
146
+
147
+ To use the low-mem (iterated slower) version of the parser use
148
+
149
+ blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
150
+
75
151
  == API (Ruby library)
76
152
 
77
153
  To loop through a BLAST result:
78
154
 
79
155
  >> require 'bio-blastxmlparser'
80
156
  >> fn = 'test/data/nt_example_blastn.m7'
81
- >> n = Bio::Blast::XmlIterator.new(fn).to_enum
157
+ >> n = Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
82
158
  >> n.each do | iter |
83
159
  >> puts "Hits for " + iter.query_id
84
160
  >> iter.each do | hit |
@@ -91,7 +167,7 @@ To loop through a BLAST result:
91
167
  The next example parses XML using less memory by using a Ruby
92
168
  Iterator
93
169
 
94
- >> blast = Bio::Blast::XmlSplitterIterator.new(fn).to_enum
170
+ >> blast = Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
95
171
  >> iter = blast.next
96
172
  >> iter.iter_num
97
173
  => 1
@@ -175,87 +251,11 @@ etc. etc.
175
251
 
176
252
  For more examples see the files in ./spec
177
253
 
178
- == Command line usage
179
-
180
-
181
- == Usage
182
- blastxmlparser [options] file(s)
183
-
184
- -p, --parser name Use full|split parser (default full)
185
- --output-fasta Output FASTA
186
- -n, --named fields Set named fields
187
- -e, --exec filter Execute filter
188
-
189
- --logger filename Log to file (default stderr)
190
- --trace options Set log level (default INFO, see bio-logger)
191
- -q, --quiet Run quietly
192
- -v, --verbose Run verbosely
193
- --debug Show debug messages
194
- -h, --help Show help and examples
195
-
196
- bioblastxmlparser filename(s)
197
-
198
- Use --help switch for more information
199
-
200
- == Examples
201
-
202
- Print result fields of iterations containing 'lcl', using a regex
203
-
204
- blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
205
-
206
- Print fields where bit_score > 145
207
-
208
- blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
209
-
210
- prints a tab delimited
211
-
212
- 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
213
- 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
214
- 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
215
- 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
216
-
217
- The second and third column show the BLAST iteration, and the others
218
- relate to the hits.
219
-
220
- As this is evaluated Ruby, it is also possible to use the XML element
221
- names directly
222
-
223
- blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
224
-
225
- And it is possible to print (non default) named fields where E-value < 0.001
226
- and hit length > 100. E.g.
227
-
228
- blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
229
-
230
- 1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
231
- 2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
232
- 3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
233
- 4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
234
- 5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
235
- etc. etc.
236
-
237
- prints the evalue and qseq columns. To output FASTA use --output-fasta
238
-
239
- blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
240
-
241
- which prints matching sequences, where the first field is the accession, followed
242
- by query iteration id, and hit_id. E.g.
243
-
244
- >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
245
- AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
246
- >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
247
- AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
248
- etc. etc.
249
-
250
- To use the low-mem (iterated slower) version of the parser use
251
-
252
- blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
253
-
254
254
  == URL
255
255
 
256
256
  The project lives at http://github.com/pjotrp/blastxmlparser. If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
257
257
 
258
258
  == Copyright
259
259
 
260
- Copyright (c) 2011 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.
260
+ Copyright (c) 2011,2012 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.
261
261
 
data/Rakefile CHANGED
@@ -16,7 +16,7 @@ Jeweler::Tasks.new do |gem|
16
16
  gem.homepage = "http://github.com/pjotrp/blastxmlparser"
17
17
  gem.license = "MIT"
18
18
  gem.summary = %Q{Very fast BLAST XML parser and library for big data}
19
- gem.description = %Q{Fast big data XML parser and library, libxml2 based 50x faster than BioRuby}
19
+ gem.description = %Q{Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby}
20
20
  gem.email = "pjotr.public01@thebird.nl"
21
21
  gem.authors = ["Pjotr Prins"]
22
22
  # Include your dependencies below. Runtime dependencies are required when using your gem,
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.1.0
1
+ 1.1.1
data/bin/blastxmlparser CHANGED
@@ -2,10 +2,9 @@
2
2
  #
3
3
  # BioRuby bio-blastxmlparser Plugin
4
4
  # Author:: Pjotr Prins
5
- # Copyright:: 2011
6
5
  # License:: MIT License
7
6
  #
8
- # Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
7
+ # Copyright (C) 2010-2013 Pjotr Prins <pjotr.prins@thebird.nl>
9
8
 
10
9
  rootpath = File.dirname(File.dirname(__FILE__))
11
10
  $: << File.join(rootpath,'lib')
@@ -160,9 +159,9 @@ begin
160
159
  ARGV.each do | fn |
161
160
  logger.info("XML parsing #{fn}")
162
161
  n = if options.parser == :split
163
- Bio::Blast::XmlSplitterIterator.new(fn).to_enum
162
+ Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
164
163
  else
165
- Bio::Blast::XmlIterator.new(fn).to_enum
164
+ Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
166
165
  end
167
166
  i = 1
168
167
  n.each do | iter |
@@ -5,12 +5,12 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = "bio-blastxmlparser"
8
- s.version = "1.1.0"
8
+ s.version = "1.1.1"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Pjotr Prins"]
12
- s.date = "2012-08-08"
13
- s.description = "Fast big data XML parser and library, libxml2 based 50x faster than BioRuby"
12
+ s.date = "2013-02-07"
13
+ s.description = "Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby"
14
14
  s.email = "pjotr.public01@thebird.nl"
15
15
  s.executables = ["blastxmlparser"]
16
16
  s.extra_rdoc_files = [
@@ -3,7 +3,7 @@ require 'nokogiri'
3
3
  require 'enumerator'
4
4
 
5
5
  module Bio
6
- module Blast
6
+ module BlastXMLParser
7
7
 
8
8
  module XPath
9
9
  def field name
@@ -1,7 +1,7 @@
1
1
 
2
2
 
3
3
  module Bio
4
- module Blast
4
+ module BlastXMLParser
5
5
 
6
6
  # Iterate a BLAST file yielding (lazy) results
7
7
  class XmlIterator
@@ -1,7 +1,7 @@
1
1
  require 'enumerator'
2
2
 
3
3
  module Bio
4
- module Blast
4
+ module BlastXMLParser
5
5
  # Reads a full XML result and splits it out into a buffer for each
6
6
  # Iteration (query result).
7
7
  class XmlSplitterIterator
@@ -5,7 +5,7 @@ $: << File.join(rootpath,'lib')
5
5
 
6
6
  require 'bio-blastxmlparser'
7
7
  fn = 'test/data/nt_example_blastn.m7'
8
- n = Bio::Blast::XmlIterator.new(fn).to_enum
8
+ n = Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
9
9
  n.each do | iter |
10
10
  puts "Hits for " + iter.query_id
11
11
  iter.each do | hit |
@@ -1,9 +1,9 @@
1
1
  require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
2
2
 
3
3
  TESTFILE = "./test/data/nt_example_blastn.m7"
4
- include Bio::Blast
4
+ include Bio::BlastXMLParser
5
5
 
6
- describe "Bio::Blast::NokogiriBlastXml" do
6
+ describe "Bio::BlastXMLParser::NokogiriBlastXml" do
7
7
  before(:all) do
8
8
  n = NokogiriBlastXml.new(File.new(TESTFILE)).to_enum
9
9
  @iter1 = n.next
@@ -75,8 +75,8 @@ describe "Bio::Blast::NokogiriBlastXml" do
75
75
  end
76
76
  end
77
77
 
78
- describe Bio::Blast::XmlIterator do
79
- include Bio::Blast
78
+ describe Bio::BlastXMLParser::XmlIterator do
79
+ include Bio::BlastXMLParser
80
80
  it "should parse with Nokogiri" do
81
81
  blast = XmlIterator.new(TESTFILE).to_enum
82
82
  iter1 = blast.next
@@ -86,8 +86,8 @@ describe Bio::Blast::XmlIterator do
86
86
  end
87
87
  end
88
88
 
89
- describe Bio::Blast::XmlSplitterIterator do
90
- include Bio::Blast
89
+ describe Bio::BlastXMLParser::XmlSplitterIterator do
90
+ include Bio::BlastXMLParser
91
91
  # it "should read a large file and yield Iterations" do
92
92
  # s = XmlSplitter.new("./test/data/nt_example_blastn.m7")
93
93
  # s.each do | result |
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bio-blastxmlparser
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.0
4
+ version: 1.1.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-08-08 00:00:00.000000000Z
12
+ date: 2013-02-07 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bio-logger
16
- requirement: &14068420 !ruby/object:Gem::Requirement
16
+ requirement: &24214160 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.0.0
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *14068420
24
+ version_requirements: *24214160
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: nokogiri
27
- requirement: &14067240 !ruby/object:Gem::Requirement
27
+ requirement: &24213120 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.5.0
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *14067240
35
+ version_requirements: *24213120
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: rake
38
- requirement: &14066000 !ruby/object:Gem::Requirement
38
+ requirement: &24212220 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ! '>='
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 0.9.2.2
44
44
  type: :development
45
45
  prerelease: false
46
- version_requirements: *14066000
46
+ version_requirements: *24212220
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: bundler
49
- requirement: &14064760 !ruby/object:Gem::Requirement
49
+ requirement: &24211440 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ! '>='
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: '0'
55
55
  type: :development
56
56
  prerelease: false
57
- version_requirements: *14064760
57
+ version_requirements: *24211440
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: jeweler
60
- requirement: &14063640 !ruby/object:Gem::Requirement
60
+ requirement: &24174660 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: 1.8.4
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *14063640
68
+ version_requirements: *24174660
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rspec
71
- requirement: &14062500 !ruby/object:Gem::Requirement
71
+ requirement: &24173840 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ! '>='
@@ -76,10 +76,10 @@ dependencies:
76
76
  version: 2.3.0
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *14062500
79
+ version_requirements: *24173840
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: rdoc
82
- requirement: &14055400 !ruby/object:Gem::Requirement
82
+ requirement: &24173100 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ! '>='
@@ -87,8 +87,9 @@ dependencies:
87
87
  version: 2.4.2
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *14055400
91
- description: Fast big data XML parser and library, libxml2 based 50x faster than BioRuby
90
+ version_requirements: *24173100
91
+ description: Fast big data BLAST XML parser and library; this libxml2 based version
92
+ is 50x faster than BioRuby
92
93
  email: pjotr.public01@thebird.nl
93
94
  executables:
94
95
  - blastxmlparser
@@ -140,7 +141,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
140
141
  version: '0'
141
142
  segments:
142
143
  - 0
143
- hash: -1696395694674995706
144
+ hash: -3287387609254152406
144
145
  required_rubygems_version: !ruby/object:Gem::Requirement
145
146
  none: false
146
147
  requirements: