bio-blastxmlparser 1.1.0 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.rdoc CHANGED
@@ -28,10 +28,11 @@ see why libxml2 based Nokogiri is fast, see
28
28
  http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and
29
29
  http://www.xml.com/lpt/a/1703.
30
30
 
31
- The parser is also designed with other optimizations, such as lazy evaluation,
32
- only creating objects when required, and (in a future version) parallelization. When parsing
33
- a full BLAST result usually only a few fields are used. By using XPath queries
34
- only the relevant fields are queried.
31
+ The parser is also designed with other optimizations, such as lazy
32
+ evaluation, i.e. only creating objects when required, and (in a future
33
+ version) parallelization. When parsing a full BLAST result usually
34
+ only a few fields are used. By using XPath queries only the relevant
35
+ fields are queried.
35
36
 
36
37
  Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
37
38
 
@@ -47,7 +48,7 @@ Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
47
48
  user 0m1.444s
48
49
  sys 0m0.160s
49
50
 
50
- BioRuby ReXML DOM parser
51
+ BioRuby ReXML DOM parser (old style)
51
52
 
52
53
  real 1m14.548s
53
54
  user 1m13.065s
@@ -72,13 +73,88 @@ example on Debian:
72
73
  for more installation on other platforms see
73
74
  http://nokogiri.org/tutorials/installing_nokogiri.html.
74
75
 
76
+ == Command line usage
77
+
78
+ === Usage
79
+ blastxmlparser [options] file(s)
80
+
81
+ -p, --parser name Use full|split parser (default full)
82
+ --output-fasta Output FASTA
83
+ -n, --named fields Set named fields
84
+ -e, --exec filter Execute filter
85
+
86
+ --logger filename Log to file (default stderr)
87
+ --trace options Set log level (default INFO, see bio-logger)
88
+ -q, --quiet Run quietly
89
+ -v, --verbose Run verbosely
90
+ --debug Show debug messages
91
+ -h, --help Show help and examples
92
+
93
+ bioblastxmlparser filename(s)
94
+
95
+ Use --help switch for more information
96
+
97
+ === Examples
98
+
99
+ Print result fields of iterations containing 'lcl', using a regex
100
+
101
+ blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
102
+
103
+ Print fields where bit_score > 145
104
+
105
+ blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
106
+
107
+ prints a tab delimited
108
+
109
+ 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
110
+ 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
111
+ 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
112
+ 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
113
+
114
+ The second and third column show the BLAST iteration, and the others
115
+ relate to the hits.
116
+
117
+ As this is evaluated Ruby, it is also possible to use the XML element
118
+ names directly
119
+
120
+ blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
121
+
122
+ And it is possible to print (non default) named fields where E-value < 0.001
123
+ and hit length > 100. E.g.
124
+
125
+ blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
126
+
127
+ 1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
128
+ 2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
129
+ 3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
130
+ 4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
131
+ 5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
132
+ etc. etc.
133
+
134
+ prints the evalue and qseq columns. To output FASTA use --output-fasta
135
+
136
+ blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
137
+
138
+ which prints matching sequences, where the first field is the accession, followed
139
+ by query iteration id, and hit_id. E.g.
140
+
141
+ >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
142
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
143
+ >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
144
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
145
+ etc. etc.
146
+
147
+ To use the low-mem (iterated slower) version of the parser use
148
+
149
+ blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
150
+
75
151
  == API (Ruby library)
76
152
 
77
153
  To loop through a BLAST result:
78
154
 
79
155
  >> require 'bio-blastxmlparser'
80
156
  >> fn = 'test/data/nt_example_blastn.m7'
81
- >> n = Bio::Blast::XmlIterator.new(fn).to_enum
157
+ >> n = Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
82
158
  >> n.each do | iter |
83
159
  >> puts "Hits for " + iter.query_id
84
160
  >> iter.each do | hit |
@@ -91,7 +167,7 @@ To loop through a BLAST result:
91
167
  The next example parses XML using less memory by using a Ruby
92
168
  Iterator
93
169
 
94
- >> blast = Bio::Blast::XmlSplitterIterator.new(fn).to_enum
170
+ >> blast = Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
95
171
  >> iter = blast.next
96
172
  >> iter.iter_num
97
173
  => 1
@@ -175,87 +251,11 @@ etc. etc.
175
251
 
176
252
  For more examples see the files in ./spec
177
253
 
178
- == Command line usage
179
-
180
-
181
- == Usage
182
- blastxmlparser [options] file(s)
183
-
184
- -p, --parser name Use full|split parser (default full)
185
- --output-fasta Output FASTA
186
- -n, --named fields Set named fields
187
- -e, --exec filter Execute filter
188
-
189
- --logger filename Log to file (default stderr)
190
- --trace options Set log level (default INFO, see bio-logger)
191
- -q, --quiet Run quietly
192
- -v, --verbose Run verbosely
193
- --debug Show debug messages
194
- -h, --help Show help and examples
195
-
196
- bioblastxmlparser filename(s)
197
-
198
- Use --help switch for more information
199
-
200
- == Examples
201
-
202
- Print result fields of iterations containing 'lcl', using a regex
203
-
204
- blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
205
-
206
- Print fields where bit_score > 145
207
-
208
- blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
209
-
210
- prints a tab delimited
211
-
212
- 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
213
- 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
214
- 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
215
- 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
216
-
217
- The second and third column show the BLAST iteration, and the others
218
- relate to the hits.
219
-
220
- As this is evaluated Ruby, it is also possible to use the XML element
221
- names directly
222
-
223
- blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
224
-
225
- And it is possible to print (non default) named fields where E-value < 0.001
226
- and hit length > 100. E.g.
227
-
228
- blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
229
-
230
- 1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
231
- 2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
232
- 3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
233
- 4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
234
- 5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
235
- etc. etc.
236
-
237
- prints the evalue and qseq columns. To output FASTA use --output-fasta
238
-
239
- blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
240
-
241
- which prints matching sequences, where the first field is the accession, followed
242
- by query iteration id, and hit_id. E.g.
243
-
244
- >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
245
- AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
246
- >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
247
- AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
248
- etc. etc.
249
-
250
- To use the low-mem (iterated slower) version of the parser use
251
-
252
- blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
253
-
254
254
  == URL
255
255
 
256
256
  The project lives at http://github.com/pjotrp/blastxmlparser. If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
257
257
 
258
258
  == Copyright
259
259
 
260
- Copyright (c) 2011 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.
260
+ Copyright (c) 2011,2012 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.
261
261
 
data/Rakefile CHANGED
@@ -16,7 +16,7 @@ Jeweler::Tasks.new do |gem|
16
16
  gem.homepage = "http://github.com/pjotrp/blastxmlparser"
17
17
  gem.license = "MIT"
18
18
  gem.summary = %Q{Very fast BLAST XML parser and library for big data}
19
- gem.description = %Q{Fast big data XML parser and library, libxml2 based 50x faster than BioRuby}
19
+ gem.description = %Q{Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby}
20
20
  gem.email = "pjotr.public01@thebird.nl"
21
21
  gem.authors = ["Pjotr Prins"]
22
22
  # Include your dependencies below. Runtime dependencies are required when using your gem,
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.1.0
1
+ 1.1.1
data/bin/blastxmlparser CHANGED
@@ -2,10 +2,9 @@
2
2
  #
3
3
  # BioRuby bio-blastxmlparser Plugin
4
4
  # Author:: Pjotr Prins
5
- # Copyright:: 2011
6
5
  # License:: MIT License
7
6
  #
8
- # Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
7
+ # Copyright (C) 2010-2013 Pjotr Prins <pjotr.prins@thebird.nl>
9
8
 
10
9
  rootpath = File.dirname(File.dirname(__FILE__))
11
10
  $: << File.join(rootpath,'lib')
@@ -160,9 +159,9 @@ begin
160
159
  ARGV.each do | fn |
161
160
  logger.info("XML parsing #{fn}")
162
161
  n = if options.parser == :split
163
- Bio::Blast::XmlSplitterIterator.new(fn).to_enum
162
+ Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
164
163
  else
165
- Bio::Blast::XmlIterator.new(fn).to_enum
164
+ Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
166
165
  end
167
166
  i = 1
168
167
  n.each do | iter |
@@ -5,12 +5,12 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = "bio-blastxmlparser"
8
- s.version = "1.1.0"
8
+ s.version = "1.1.1"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Pjotr Prins"]
12
- s.date = "2012-08-08"
13
- s.description = "Fast big data XML parser and library, libxml2 based 50x faster than BioRuby"
12
+ s.date = "2013-02-07"
13
+ s.description = "Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby"
14
14
  s.email = "pjotr.public01@thebird.nl"
15
15
  s.executables = ["blastxmlparser"]
16
16
  s.extra_rdoc_files = [
@@ -3,7 +3,7 @@ require 'nokogiri'
3
3
  require 'enumerator'
4
4
 
5
5
  module Bio
6
- module Blast
6
+ module BlastXMLParser
7
7
 
8
8
  module XPath
9
9
  def field name
@@ -1,7 +1,7 @@
1
1
 
2
2
 
3
3
  module Bio
4
- module Blast
4
+ module BlastXMLParser
5
5
 
6
6
  # Iterate a BLAST file yielding (lazy) results
7
7
  class XmlIterator
@@ -1,7 +1,7 @@
1
1
  require 'enumerator'
2
2
 
3
3
  module Bio
4
- module Blast
4
+ module BlastXMLParser
5
5
  # Reads a full XML result and splits it out into a buffer for each
6
6
  # Iteration (query result).
7
7
  class XmlSplitterIterator
@@ -5,7 +5,7 @@ $: << File.join(rootpath,'lib')
5
5
 
6
6
  require 'bio-blastxmlparser'
7
7
  fn = 'test/data/nt_example_blastn.m7'
8
- n = Bio::Blast::XmlIterator.new(fn).to_enum
8
+ n = Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
9
9
  n.each do | iter |
10
10
  puts "Hits for " + iter.query_id
11
11
  iter.each do | hit |
@@ -1,9 +1,9 @@
1
1
  require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
2
2
 
3
3
  TESTFILE = "./test/data/nt_example_blastn.m7"
4
- include Bio::Blast
4
+ include Bio::BlastXMLParser
5
5
 
6
- describe "Bio::Blast::NokogiriBlastXml" do
6
+ describe "Bio::BlastXMLParser::NokogiriBlastXml" do
7
7
  before(:all) do
8
8
  n = NokogiriBlastXml.new(File.new(TESTFILE)).to_enum
9
9
  @iter1 = n.next
@@ -75,8 +75,8 @@ describe "Bio::Blast::NokogiriBlastXml" do
75
75
  end
76
76
  end
77
77
 
78
- describe Bio::Blast::XmlIterator do
79
- include Bio::Blast
78
+ describe Bio::BlastXMLParser::XmlIterator do
79
+ include Bio::BlastXMLParser
80
80
  it "should parse with Nokogiri" do
81
81
  blast = XmlIterator.new(TESTFILE).to_enum
82
82
  iter1 = blast.next
@@ -86,8 +86,8 @@ describe Bio::Blast::XmlIterator do
86
86
  end
87
87
  end
88
88
 
89
- describe Bio::Blast::XmlSplitterIterator do
90
- include Bio::Blast
89
+ describe Bio::BlastXMLParser::XmlSplitterIterator do
90
+ include Bio::BlastXMLParser
91
91
  # it "should read a large file and yield Iterations" do
92
92
  # s = XmlSplitter.new("./test/data/nt_example_blastn.m7")
93
93
  # s.each do | result |
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bio-blastxmlparser
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.0
4
+ version: 1.1.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-08-08 00:00:00.000000000Z
12
+ date: 2013-02-07 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bio-logger
16
- requirement: &14068420 !ruby/object:Gem::Requirement
16
+ requirement: &24214160 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.0.0
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *14068420
24
+ version_requirements: *24214160
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: nokogiri
27
- requirement: &14067240 !ruby/object:Gem::Requirement
27
+ requirement: &24213120 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.5.0
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *14067240
35
+ version_requirements: *24213120
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: rake
38
- requirement: &14066000 !ruby/object:Gem::Requirement
38
+ requirement: &24212220 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ! '>='
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 0.9.2.2
44
44
  type: :development
45
45
  prerelease: false
46
- version_requirements: *14066000
46
+ version_requirements: *24212220
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: bundler
49
- requirement: &14064760 !ruby/object:Gem::Requirement
49
+ requirement: &24211440 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ! '>='
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: '0'
55
55
  type: :development
56
56
  prerelease: false
57
- version_requirements: *14064760
57
+ version_requirements: *24211440
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: jeweler
60
- requirement: &14063640 !ruby/object:Gem::Requirement
60
+ requirement: &24174660 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: 1.8.4
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *14063640
68
+ version_requirements: *24174660
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rspec
71
- requirement: &14062500 !ruby/object:Gem::Requirement
71
+ requirement: &24173840 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ! '>='
@@ -76,10 +76,10 @@ dependencies:
76
76
  version: 2.3.0
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *14062500
79
+ version_requirements: *24173840
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: rdoc
82
- requirement: &14055400 !ruby/object:Gem::Requirement
82
+ requirement: &24173100 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ! '>='
@@ -87,8 +87,9 @@ dependencies:
87
87
  version: 2.4.2
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *14055400
91
- description: Fast big data XML parser and library, libxml2 based 50x faster than BioRuby
90
+ version_requirements: *24173100
91
+ description: Fast big data BLAST XML parser and library; this libxml2 based version
92
+ is 50x faster than BioRuby
92
93
  email: pjotr.public01@thebird.nl
93
94
  executables:
94
95
  - blastxmlparser
@@ -140,7 +141,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
140
141
  version: '0'
141
142
  segments:
142
143
  - 0
143
- hash: -1696395694674995706
144
+ hash: -3287387609254152406
144
145
  required_rubygems_version: !ruby/object:Gem::Requirement
145
146
  none: false
146
147
  requirements: