bio-blastxmlparser 1.1.2 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 31b42217bb809cde8d5ef3c06d11c6c9123c6413
4
- data.tar.gz: 5d23e19fb8c774f7edaffd03bbcb156800679f7f
3
+ metadata.gz: b96fa7141abe77c13e1e34eb1d920a624c22f83d
4
+ data.tar.gz: 1b038243195b478d18b2a2c72b3fb0b4538a6701
5
5
  SHA512:
6
- metadata.gz: de99019d564d5ea759f6e3ef330b8e9e68f6a7bbdb0578c34699ad7f716da16562d702da935bcfc5e3baa9a9e673b2ad99a62ae07210c4feea144134ad822e94
7
- data.tar.gz: 24bb61197ff82129b404dcaf928b38ac9f7b9d5c10b90a630032253e468adc558c1675639b11c4bc60f7da8082626b3b288c2f0b9e6328e0ef2525b3a79453a8
6
+ metadata.gz: 87ea99f1ac87b528e8a08c5490cb500038fdd06ea18def8cda1276cf5e7ba7976ca4f0a49934023be16ea122ae1d91e514ad5a5fbdf282245a105ce22131dc6f
7
+ data.tar.gz: 2237ce97c067f42123c9aeba9ae8cb77cf97d11ef0ad7211ceb8a7e6110f6cc6c71d2a7a2e1e1dcf31b5f6c67d55af8a992b82eb43e377caf858fff6cac3e4d9
data/Gemfile CHANGED
@@ -1,7 +1,7 @@
1
1
  source "http://rubygems.org"
2
2
  # Runtime dependencies
3
3
  gem "bio-logger"
4
- gem "nokogiri", "~>1.6.0"
4
+ gem "nokogiri", "~>1.6.3"
5
5
 
6
6
  # Add dependencies to develop your gem here.
7
7
  # Include everything needed to run rake, tests, features, etc.
data/Gemfile.lock CHANGED
@@ -66,7 +66,7 @@ DEPENDENCIES
66
66
  bio-logger
67
67
  bundler
68
68
  jeweler (~> 2.0.1)
69
- nokogiri (~> 1.6.0)
69
+ nokogiri (~> 1.6.3)
70
70
  rake
71
71
  rdoc
72
72
  rspec
data/README.md CHANGED
@@ -2,8 +2,9 @@
2
2
 
3
3
  # bio-blastxmlparser
4
4
 
5
- blastxmlparser is a very fast big-data BLAST XML file parser, which can be used
6
- as command line utility. Use blastxmlparser to:
5
+ blastxmlparser is a very fast parallelised big-data BLAST XML file
6
+ parser, which can be used as command line utility. Use blastxmlparser
7
+ to:
7
8
 
8
9
  * Parse BLAST XML
9
10
  * Filter output
@@ -24,12 +25,10 @@ can be used to filter results and requires no understanding of Ruby.
24
25
  blastxmlparser --help
25
26
  ```
26
27
 
27
- (see Installation, below, if it does not work)
28
-
29
28
  ## Performance
30
29
 
31
30
  XML parsing is expensive. blastxmlparser can use the fast Nokogiri C, or
32
- Java XML parsers, based on libxml2. Basically, a DOM parser is used
31
+ Java XML parsers, based on libxml2 in parallel. A DOM parser is used
33
32
  after splitting the BLAST XML document into subsections.
34
33
  Tests show this is faster than a SAX
35
34
  parser with Ruby callbacks. To see why libxml2 based Nokogiri is
@@ -38,33 +37,21 @@ fast, see this
38
37
  and [xml.com](http://www.xml.com/lpt/a/1703).
39
38
 
40
39
  Blastxmlparser is designed with other optimizations, such as lazy
41
- evaluation, i.e., only creating objects when required, and (in a
42
- future version) parallelization. When parsing a full BLAST result
43
- usually only a few fields are used. By using XPath queries the parser
44
- makes sure only the relevant fields are queried.
40
+ evaluation, i.e., only creating objects when required, and
41
+ parallelism. When parsing a full BLAST result usually only a few
42
+ fields are used. By using XPath queries the parser makes sure only the
43
+ relevant fields are queried.
45
44
 
46
- Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
45
+ Timings for parsing a 128 Mb BLAST XML file on 4x1.2GHz laptop
47
46
 
48
47
  ```
49
- bio-blastxmlparser + Nokogiri DOM (default)
50
-
51
- real 0m1.259s
52
- user 0m1.052s
53
- sys 0m0.144s
54
-
55
- bio-blastxmlparser + Nokogiri split DOM
56
-
57
- real 0m1.713s
58
- user 0m1.444s
59
- sys 0m0.160s
60
-
61
- BioRuby ReXML DOM parser (old style)
62
-
63
- real 1m14.548s
64
- user 1m13.065s
65
- sys 0m0.472s
48
+ real 0m13.985s
49
+ user 0m44.951s
50
+ sys 0m3.676s
66
51
  ```
67
52
 
53
+ which makes for pretty good core utilisation.
54
+
68
55
  ## Install
69
56
 
70
57
  ```sh
@@ -99,9 +86,11 @@ provide build paths, as described [here](http://nokogiri.org/tutorials/installin
99
86
  blastxmlparser [options] file(s)
100
87
 
101
88
  -p, --parser name Use full|split parser (default full)
89
+ -e, --exec filter Evaluate filter
90
+
91
+ -n, --named fields Print named fields
102
92
  --output-fasta Output FASTA
103
- -n, --named fields Set named fields
104
- -e, --exec filter Execute filter
93
+ -t, --template erb Use ERB template for output
105
94
 
106
95
  --logger filename Log to file (default stderr)
107
96
  --trace options Set log level (default INFO, see bio-logger)
@@ -109,10 +98,6 @@ provide build paths, as described [here](http://nokogiri.org/tutorials/installin
109
98
  -v, --verbose Run verbosely
110
99
  --debug Show debug messages
111
100
  -h, --help Show help and examples
112
-
113
- bioblastxmlparser filename(s)
114
-
115
- Use --help switch for more information
116
101
  ```
117
102
 
118
103
  ### Examples
@@ -204,7 +189,7 @@ template could be
204
189
  To get JSON, run it with
205
190
 
206
191
  ```sh
207
- blastxmlparser --template template/json.erb -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
192
+ blastxmlparser --template template/blast2json.erb -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
208
193
  ```
209
194
 
210
195
  ```Javascript
@@ -223,7 +208,7 @@ To get JSON, run it with
223
208
  Likewise, using the RDF template
224
209
 
225
210
  ```sh
226
- blastxmlparser --template template/rdf.erb -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
211
+ blastxmlparser --template template/blast2rdf.erb -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
227
212
  ```
228
213
 
229
214
  ```ruby
@@ -235,14 +220,13 @@ Likewise, using the RDF template
235
220
  :accession "Minc02032",
236
221
  :id "lcl|Minc02032",
237
222
  :len 147,
238
- :E-value 8.1089e-12,
239
223
  :identity 60,
240
224
  :align_len 69,
241
225
  :bitscore 69.8753,
242
226
  :qseq "ATGGGAGATGGAATTGAACCGTCATGGAAAGGGCCCAAACCGAAGCACAACCGACTGTGCCACCATCCA",
243
227
  :midline "|||||||||||||||||||| |||||||| | |||||||||||||||||||||||||||||||",
244
228
  :hseq "ATGGGAGATGGAATTGAACCATCATGGAATG-------ACCGAAGCACAACCGACTGTGCCACCATCCA",
245
- :evalue 8.1089e-12 .
229
+ :evalue 8.1089e-12 .
246
230
  ```
247
231
 
248
232
  ## Additional options
data/Rakefile CHANGED
@@ -15,7 +15,7 @@ Jeweler::Tasks.new do |gem|
15
15
  gem.name = "bio-blastxmlparser"
16
16
  gem.homepage = "http://github.com/pjotrp/blastxmlparser"
17
17
  gem.license = "MIT"
18
- gem.summary = %Q{Very fast BLAST XML to RDF/HTML/JSON/YAML/csv transformer}
18
+ gem.summary = %Q{Very fast parallel BLAST XML to RDF/HTML/JSON/YAML/csv transformer}
19
19
  gem.description = %Q{Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby and comes with a nice CLI}
20
20
  gem.email = "pjotr.public01@thebird.nl"
21
21
  gem.authors = ["Pjotr Prins"]
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.1.2
1
+ 2.0.0
data/bin/blastxmlparser CHANGED
@@ -52,8 +52,17 @@ opts = OptionParser.new do |o|
52
52
  options.parser = p.to_sym
53
53
  end
54
54
 
55
- o.on("-e filter","--exec filter",String, "Evaluate filter") do |s|
56
- options.exec = s
55
+ o.on("--filter filter",String, "Filtering expression") do |s|
56
+ options.filter = s
57
+ end
58
+
59
+ o.on("-t num", "--threads num",String, "Use parallel threads") do |num|
60
+ options.threads = num.to_i
61
+ end
62
+
63
+ o.on("-e filter","--exec filter",String, "Evaluate filter (deprecated)") do |s|
64
+ $stderr.print "WARNING: -e,--exec switch is deprecated, use --filter instead!\n"
65
+ options.filter = s
57
66
  end
58
67
 
59
68
  o.separator ""
@@ -61,7 +70,7 @@ opts = OptionParser.new do |o|
61
70
  o.on("-n fields","--named fields",String, "Print named fields") do |s|
62
71
  options.fields = s.split(/,/)
63
72
  end
64
- o.on("--output-fasta","Output FASTA") do |b|
73
+ o.on("--output-fasta","Output FASTA") do |b|
65
74
  options.output_fasta = true
66
75
  end
67
76
 
@@ -100,7 +109,16 @@ begin
100
109
  Bio::Log::CLI.configure('bio-blastxmlparser')
101
110
  logger = Bio::Log::LoggerPlus['bio-blastxmlparser']
102
111
 
103
- if options[:template]
112
+ if options.threads != 1
113
+ begin
114
+ require 'parallel'
115
+ rescue LoadError
116
+ $stderr.print "Error: Missing 'parallel' module. Install with command 'gem install parallel' if you want multiple threads\n"
117
+ options.threads = 1
118
+ end
119
+ end
120
+
121
+ if options.template
104
122
  include BioRdf
105
123
  fn = options.template
106
124
  raise "No template #{fn}!" if not File.exist?(fn)
@@ -114,39 +132,73 @@ begin
114
132
  else
115
133
  Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
116
134
  end
117
- i = 1
118
- n.each do | iter |
135
+ chunks = []
136
+ chunks_count = 0
137
+ NUM_CHUNKS=10_000
138
+
139
+ process = lambda { |iter,i| # Process one BLAST iter block
140
+ res = []
141
+ line_count = 0
142
+ hit_count = 0
119
143
  iter.each do | hit |
144
+ hit_count += 1
120
145
  hit.each do | hsp |
121
- do_print = if options.exec
122
- eval(options.exec)
146
+ do_print = if options.filter
147
+ eval(options.filter)
123
148
  else
124
149
  true
125
150
  end
126
151
  if do_print
152
+ line_count += 1
127
153
  if template
128
- print template.result(binding)
154
+ res << template.result(binding)
129
155
  elsif options.output_fasta
130
- print ">"+hit.accession+' '+iter.iter_num.to_s+'|'+iter.query_id+' '+hit.hit_id+' '+hit.hit_def+"\n"
131
- print hsp.qseq+"\n"
156
+ res << ">"+hit.accession+' '+iter.iter_num.to_s+'|'+iter.query_id+' '+hit.hit_id+' '+hit.hit_def+"\n"
157
+ res << hsp.qseq+"\n"
132
158
  else
133
159
  # Default output
134
160
  if options.fields
135
- print i,"\t"
161
+ out = [iter.iter_num,hit_count,hsp.hsp_num]
136
162
  options.fields.each do | f |
137
- print eval(f),"\t"
163
+ out << eval(f)
138
164
  end
139
- print "\n"
140
- else
141
- print [i,iter.iter_num,iter.query_id,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t"),"\n"
165
+ res << out.join("\t")+"\n"
166
+ else
167
+ res << [hit_count,iter.iter_num,iter.query_id,hit.hit_id,hsp.hsp_num,hsp.evalue].join("\t")+"\n"
142
168
  end
143
169
  end
144
- i += 1
145
170
  end
146
171
  end
147
172
  end
173
+ res
174
+ } # end process
175
+
176
+ output = lambda { |collection|
177
+ collection.each do | result |
178
+ result.each { |line| print line }
179
+ end
180
+ } # end output
181
+
182
+ if options.threads == 1
183
+ n.each do | iter |
184
+ process.call(iter,0).each { | line | print line }
185
+ end
186
+ else
187
+ n.each do | iter |
188
+ chunks << iter
189
+ chunks_count += 1
190
+ if chunks.size > NUM_CHUNKS
191
+ output.call Parallel.map_with_index(chunks, :in_processes => options.threads) { | iter,i |
192
+ process.call(iter,i)
193
+ }
194
+ chunks = []
195
+ end
196
+ end
197
+ output.call Parallel.map_with_index(chunks, :in_processes => options.threads) { | iter,i |
198
+ process.call(iter,i)
199
+ }
148
200
  end
149
201
  end
150
202
  rescue OptionParser::InvalidOption => e
151
- opts[:invalid_argument] = e.message
203
+ $stderr.print e.message
152
204
  end
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = "bio-blastxmlparser"
8
- s.version = "1.1.2"
8
+ s.version = "2.0.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Pjotr Prins"]
12
- s.date = "2014-09-02"
12
+ s.date = "2014-09-06"
13
13
  s.description = "Fast big data BLAST XML parser and library; this libxml2 based version is 50x faster than BioRuby and comes with a nice CLI"
14
14
  s.email = "pjotr.public01@thebird.nl"
15
15
  s.executables = ["blastxmlparser"]
@@ -42,8 +42,9 @@ Gem::Specification.new do |s|
42
42
  "sample/nokogiri_split_dom.rb",
43
43
  "spec/bio-blastxmlparser_spec.rb",
44
44
  "spec/spec_helper.rb",
45
- "template/json.erb",
46
- "template/rdf.erb",
45
+ "template/blast2json.erb",
46
+ "template/blast2rdf-minimal.erb",
47
+ "template/blast2rdf.erb",
47
48
  "test/data/aa_example.fasta",
48
49
  "test/data/aa_example_blastp.m7",
49
50
  "test/data/nt_example.fasta",
@@ -54,14 +55,14 @@ Gem::Specification.new do |s|
54
55
  s.licenses = ["MIT"]
55
56
  s.require_paths = ["lib"]
56
57
  s.rubygems_version = "2.0.3"
57
- s.summary = "Very fast BLAST XML to RDF/HTML/JSON/YAML/csv transformer"
58
+ s.summary = "Very fast parallel BLAST XML to RDF/HTML/JSON/YAML/csv transformer"
58
59
 
59
60
  if s.respond_to? :specification_version then
60
61
  s.specification_version = 4
61
62
 
62
63
  if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
63
64
  s.add_runtime_dependency(%q<bio-logger>, [">= 0"])
64
- s.add_runtime_dependency(%q<nokogiri>, ["~> 1.6.0"])
65
+ s.add_runtime_dependency(%q<nokogiri>, ["~> 1.6.3"])
65
66
  s.add_development_dependency(%q<rake>, [">= 0"])
66
67
  s.add_development_dependency(%q<bundler>, [">= 0"])
67
68
  s.add_development_dependency(%q<jeweler>, ["~> 2.0.1"])
@@ -69,7 +70,7 @@ Gem::Specification.new do |s|
69
70
  s.add_development_dependency(%q<rdoc>, [">= 0"])
70
71
  else
71
72
  s.add_dependency(%q<bio-logger>, [">= 0"])
72
- s.add_dependency(%q<nokogiri>, ["~> 1.6.0"])
73
+ s.add_dependency(%q<nokogiri>, ["~> 1.6.3"])
73
74
  s.add_dependency(%q<rake>, [">= 0"])
74
75
  s.add_dependency(%q<bundler>, [">= 0"])
75
76
  s.add_dependency(%q<jeweler>, ["~> 2.0.1"])
@@ -78,7 +79,7 @@ Gem::Specification.new do |s|
78
79
  end
79
80
  else
80
81
  s.add_dependency(%q<bio-logger>, [">= 0"])
81
- s.add_dependency(%q<nokogiri>, ["~> 1.6.0"])
82
+ s.add_dependency(%q<nokogiri>, ["~> 1.6.3"])
82
83
  s.add_dependency(%q<rake>, [">= 0"])
83
84
  s.add_dependency(%q<bundler>, [">= 0"])
84
85
  s.add_dependency(%q<jeweler>, ["~> 2.0.1"])
@@ -63,7 +63,7 @@ module BioRdf
63
63
  # Don't want Bio depency in templates!
64
64
  # logger = Bio::Log::LoggerPlus.new 'bio-rdf'
65
65
  # logger.warn "\nWARNING: Changed identifier <#{s}> to <#{id}>"
66
- $stderr.print "\nWARNING: Changed identifier <#{s}> to <#{id}>"
66
+ # $stderr.print "\nWARNING: Changed identifier <#{s}> to <#{id}>"
67
67
  end
68
68
  if not RDF::valid_uri?(id)
69
69
  raise "Invalid URI after mangling <#{s}> to <#{id}>!"
File without changes
@@ -0,0 +1,14 @@
1
+ <%
2
+ blastid = Turtle::mangle_identifier(hit.parent.query_def)
3
+ id = blastid+'_'+hit.hit_num.to_s
4
+ %>
5
+ :<%= blastid %> :query :<%= id %>
6
+ :<%= id %>
7
+ :query_def "<%= hit.parent.query_def %>",
8
+ :num <%= hit.hit_num %>,
9
+ :accession "<%= hit.accession %>",
10
+ :len <%= hit.len %>,
11
+ :identity <%= hsp.identity %>,
12
+ :align_len <%= hsp.align_len %>,
13
+ :bitscore <%= hsp.bit_score %>,
14
+ :evalue <%= hsp.evalue %> .
@@ -10,12 +10,11 @@
10
10
  :accession "<%= hit.accession %>",
11
11
  :id "<%= hit.hit_id %>",
12
12
  :len <%= hit.len %>,
13
- :E-value <%= hsp.evalue %>,
14
13
  :identity <%= hsp.identity %>,
15
14
  :align_len <%= hsp.align_len %>,
16
15
  :bitscore <%= hsp.bit_score %>,
17
16
  :qseq "<%= hsp.qseq %>",
18
17
  :midline "<%= hsp.midline %>",
19
18
  :hseq "<%= hsp.hseq %>",
20
- :evalue <%= hsp.evalue %> .
19
+ :evalue <%= hsp.evalue %> .
21
20
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bio-blastxmlparser
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.2
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Pjotr Prins
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-09-02 00:00:00.000000000 Z
11
+ date: 2014-09-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bio-logger
@@ -30,14 +30,14 @@ dependencies:
30
30
  requirements:
31
31
  - - "~>"
32
32
  - !ruby/object:Gem::Version
33
- version: 1.6.0
33
+ version: 1.6.3
34
34
  type: :runtime
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
38
  - - "~>"
39
39
  - !ruby/object:Gem::Version
40
- version: 1.6.0
40
+ version: 1.6.3
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: rake
43
43
  requirement: !ruby/object:Gem::Requirement
@@ -142,8 +142,9 @@ files:
142
142
  - sample/nokogiri_split_dom.rb
143
143
  - spec/bio-blastxmlparser_spec.rb
144
144
  - spec/spec_helper.rb
145
- - template/json.erb
146
- - template/rdf.erb
145
+ - template/blast2json.erb
146
+ - template/blast2rdf-minimal.erb
147
+ - template/blast2rdf.erb
147
148
  - test/data/aa_example.fasta
148
149
  - test/data/aa_example_blastp.m7
149
150
  - test/data/nt_example.fasta
@@ -172,5 +173,5 @@ rubyforge_project:
172
173
  rubygems_version: 2.0.3
173
174
  signing_key:
174
175
  specification_version: 4
175
- summary: Very fast BLAST XML to RDF/HTML/JSON/YAML/csv transformer
176
+ summary: Very fast parallel BLAST XML to RDF/HTML/JSON/YAML/csv transformer
176
177
  test_files: []