RubyGems - bio-vcf - Versions diffs - 0.7.0 → 0.7.3 - Mend

bio-vcf 0.7.0 → 0.7.3

Files changed (24) hide show

checksums.yaml +4 -4
data/.travis.yml +3 -2
data/Gemfile +2 -5
data/Gemfile.lock +3 -3
data/README.md +101 -23
data/Rakefile +4 -2
data/VERSION +1 -1
data/bin/bio-vcf +133 -73
data/bio-vcf.gemspec +13 -10
data/features/cli.feature +9 -1
data/features/multisample.feature +4 -4
data/features/sfilter.feature +1 -1
data/features/step_definitions/cli-feature.rb +4 -0
data/features/step_definitions/multisample.rb +24 -12
data/features/step_definitions/sfilter.rb +80 -31
data/lib/bio-vcf.rb +1 -0
data/lib/bio-vcf/vcfgenotypefield.rb +45 -9
data/lib/bio-vcf/vcfheader.rb +1 -1
data/lib/bio-vcf/vcfrecord.rb +14 -8
data/lib/bio-vcf/vcfsample.rb +101 -152
data/lib/bio-vcf/vcfstatistics.rb +28 -0
data/test/data/regression/ifilter_s.dp.ref +31 -0
data/test/data/regression/thread4_4_failed_filter-stderr.ref +1 -0
metadata +16 -12

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: c4325d76baee5956ed3f58277ad622cf9a0a6ce7
-  data.tar.gz: e971b0fb0f760aafb32af51a647f1ba39f59f26f
+  metadata.gz: 72f63aae77e382c88e04cb07344ab3fce2a57232
+  data.tar.gz: 48f36f4d75d18edf3619124f0b679706a002b646
 SHA512:
-  metadata.gz: a99e0be8ce0fd84d8afc557e5e30418da2fd98d2ad7458b242e9402e4605d5b7f816758da39b9248dad3ebf53e8a5a3c17862927c284832772cbf68c3b9d2fbc
-  data.tar.gz: 49f0e38cf66781a2d35bb45849d83bc137e299d42d3539928e70079b4ffbcc70bca20c4149d7cae355696e18670aab50dfbe0f8d87ffc03b6b7445b3d676eb95
+  metadata.gz: a1d1513454924e1d84bb9aecf1bd83cf0d8784e3ab8669bb7777129f2a8b33df53a2118d4fea15a825ce1c399de05cf729e0f7b5e7acd35367856a6a1821f328
+  data.tar.gz: 7b40ffdad49cbb690cfaa4d02e7a060095dc13aa4533f091f1f20b2f2b1903c71d2f6c6c5c5f672dc3be4da751305c2583265dbe7d86308501ab0219cca9e414

data/.travis.yml CHANGED

@@ -1,8 +1,9 @@
 language: ruby
 rvm:
-  - 1.9.3
+#  - 1.9.3 <- No longer working
+  - 2.0.0
   - 2.1.0
-  - jruby-head
+#  - jruby-head
 #  - jruby-19mode # JRuby in 1.9 mode
 #  - 1.8.7
 #  - jruby-18mode # JRuby in 1.8 mode

data/Gemfile CHANGED

@@ -9,9 +9,6 @@ group :development do
   # gem "minitest"
   gem "rspec"
   gem "cucumber"
-  gem "jeweler" # , "~> 1.8.4", :git => "https://github.com/technicalpickles/jeweler.git"
-  # gem "bundler", ">= 1.0.21"
-  # gem "bio", ">= 1.4.2"
-  # gem "rdoc", "~> 3.12"
-  gem "regressiontest"
+  gem "jeweler", "~> 2.0.1" # , "~> 1.8.4", :git => "https://github.com/technicalpickles/jeweler.git"
+  gem "regressiontest", "~> 0.0.3"
 end

data/Gemfile.lock CHANGED

@@ -60,7 +60,7 @@ GEM
     rake (10.1.1)
     rdoc (4.1.1)
       json (~> 1.4)
-    regressiontest (0.0.2)
+    regressiontest (0.0.3)
     rspec (2.14.1)
       rspec-core (~> 2.14.0)
       rspec-expectations (~> 2.14.0)
@@ -76,6 +76,6 @@ PLATFORMS
 DEPENDENCIES
   cucumber
-  jeweler
-  regressiontest
+  jeweler (~> 2.0.1)
+  regressiontest (~> 0.0.3)
   rspec

data/README.md CHANGED

@@ -2,14 +2,27 @@
 [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf)
-Yet another VCF parser. Bio-vcf is not only fast for genome-wide data,
-it also comes with a really nice filtering, evaluation and rewrite
-language. Bio-vcf has better performance than other tools
+A new generation VCF parser. Bio-vcf is not only fast for genome-wide
+(WGS) data, it also comes with a really nice filtering, evaluation and
+rewrite language. Why would you use bio-vcf over other parsers?
+1. Bio-vcf is fast and scales on multi-core computers
+2. Bio-vcf has an expressive filtering and evaluation language
+3. Bio-vcf has great multi-sample support
+4. Bio-vcf has multiple global filters and sample filters
+5. Bio-vcf can access any VCF format
+6. Bio-vcf can do calculations on fields
+7. Bio-vcf allows for genotype processing
+8. Bio-vcf has support for set analysis
+9. Bio-vcf has sane error handling
+10. Bio-vcf can output tabular data, HTML, LaTeX, RDF and (soon) JSON
+Bio-vcf has better performance than other tools
 because of lazy parsing, multi-threading, and useful combinations of
-(fancy) command line filtering. For example on an 2 core machine
-bio-vcf is 50% faster than SnpSift. On an 8 core machine bio-vcf is
-3x faster than SnpSift. Parsing a 1 Gb ESP VCF with 8 cores with
-bio-vcf takes
+(fancy) command line filtering. For example on an 2 core machine
+bio-vcf is typically 50% faster than JVM based SnpSift. On an 8 core machine
+bio-vcf is at least 3x faster than SnpSift. Parsing a 1 Gb ESP
+VCF with 8 cores with bio-vcf takes
 ```sh
   time ./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp>0.3' < ESP6500SI_V2_SSA137.vcf > test1.vcf
@@ -18,7 +31,7 @@ bio-vcf takes
   sys     0m7.852s
 ```
-and parsing with SnpSift takes
+while parsing with SnpSift takes
 ```sh
   time cat ESP6500SI_V2_SSA137.vcf |java -jar snpEff/SnpSift.jar filter "( CP>0.3 )" > test.vcf
@@ -32,22 +45,22 @@ Illumina Hiseq VCF file and evaluating the results into a BED format on
 a 16 core machine takes
 ```sh
-  time bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3
+  time bio-vcf --num-threads 16 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3
   real    0m47.612s
   user    8m18.234s
   sys     0m5.039s
 ```
-which shows some pretty decent core utilisation (10x).
+which shows pretty decent core utilisation (10x). We are running
+gzip compressed VCF files of 30+ Gb with similar performance gains.
 Use zcat to
-pipe gzipped (vcf.gz) files into bio-vcf, e.g.
+pipe such gzipped (vcf.gz) files into bio-vcf, e.g.
 ```sh
   zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50'
     --sfilter '!s.empty? and s.dp>20'
     --eval '[r.chrom,r.pos,r.pos+1]' > test.bed
 ```
 bio-vcf comes with a sensible parser definition language (it is 100%
@@ -184,6 +197,13 @@ commands exit for filtering and eval. When a set is defined, based on
 the sample name, you can apply filters on the samples inside the set,
 outside the set and over all samples. E.g.
+So, why would you use bio-vcf instead of rolling out your own
+Perl/Python/other ad-hoc script? I think the reason should be that
+there is less chance of mistakes because of Bio-vcf's clear filtering
+language and sensible built-in validation. The second reason would be
+speed. Bio-vcf's multi-threading capability gives it great and hard to
+replicate performance.
 Also note you can use
 [bio-table](https://github.com/pjotrp/bioruby-table) to
 filter/transform data further and convert to other formats, such as
@@ -202,7 +222,7 @@ example of a VCF statement you need to work on.
 ## Installation
-Note that you need Ruby 1.9.3 or later. The 2.x Ruby series also give
+Note that you need Ruby 2.x or later. The 2.x Ruby series also give
 a performance improvement. Bio-vcf will show the Ruby version when
 typing the command 'bio-vcf -h'.
@@ -371,7 +391,7 @@ And even better because of Ruby magic
 Note that only valid method names in lower case get picked up this
 way. Also by convention normal is sample 1 and tumor is sample 2.
-Even shorter r is an alias for rec (nyi)
+Even shorter r is an alias for rec
 ```sh
   bio-vcf --eval "r.original.gt" < file.vcf
@@ -380,7 +400,8 @@ Even shorter r is an alias for rec (nyi)
 ## Special functions
-Note: special functions are not yet implemented!
+Note: special functions are not yet implemented! Look below
+for genotype processing which has indexing in 'gti'.
 Sometime you want to use a special function in a filter. For
 example percentage variant reads can be defined as [a,c,g,t]
@@ -440,7 +461,8 @@ example, samples are selected that evaluate to true, all others should
 evaluate to false. For this we create three filters, one for all
 samples that are included (the --ifilter or -if), for all samples that
 are excluded (the --efilter or -ef) and for any sample (the --sfilter
-or -sf). So i=include, e=exclude and s=any sample.
+or -sf). So i=include (OR filter), e=exclude and s=any sample (AND
+filter).
 The equivalent of the union filter is by using the --sfilter, so
@@ -448,15 +470,19 @@ The equivalent of the union filter is by using the --sfilter, so
   bio-vcf --sfilter 's.dp>20'
 ```
-Filters DP on all samples. To filter on a subset you can add a
+Filters DP on all samples and is true if all samples match the
+criterium (AND). To filter on a subset you can add a
 selector
 ```sh
   bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20'
 ```
-For set analysis there are the additional ifilter (include) and efilter (exclude). To filter
-on samples 0,1,4 and output the gq values
+For set analysis there are the additional ifilter (include) and
+efilter (exclude).  Where sfilter represents an ALL match, the ifilter
+represents an ANY match, i.e., it is true if one of the samples
+matches the criterium (OR). To filter on samples 0,1,4 and output the gq
+values
 ```sh
   bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gq<10 or s.gq==99' --seval s.gq
@@ -494,8 +520,10 @@ To set an additional filter on the excluded samples:
 ```
 Etc. etc. Any combination of sfilter, ifilter and efilter is possible.
+Currently the efilter is an ALL filter (AND), i.e. all excluded
+samples need to match the criterium.
-The following are not yet implemented:
+The following regular expression matches are not yet implemented:
 In the near future it is also possible to select samples on a regex (here
 select all samples where the name starts with s3)
@@ -560,6 +588,8 @@ and 'gts' as a nucleotide string array
     1       15274   G       G       G       G       G       G       G
 ```
+where gts represents the indexed genotype on [ref] + [alt].
 These values can also be used in filters and output allele depth, for
 example
@@ -570,12 +600,18 @@ example
     1       13757   47      47      4       47      47      4       47
 ```
-The following does not yet work (using the gti in a sample directly)
+You can use the genotype index gti to fetch values from, for example,
+allele depth:
 ```ruby
 bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gti[0]==0' --seval 'rec.original.ad[s.gti[1]]'
+1       10257   151     151     151     151     151     8       151
+1       13302   26      10      10      10      10      10      10
+1       13757   47      47      4       47      47      4       47
 ```
 ## Modify VCF files
 Add or modify the sample file name in the INFO fields:
@@ -584,7 +620,7 @@ Add or modify the sample file name in the INFO fields:
   bio-vcf --rewrite 'rec.info["sample"]="mytest"' < mytest.vcf
 ```
-To remove/select 3 samples and create a new file:
+To remove/select 3 samples:
 ```sh
   bio-vcf --samples 0,1,3 < mytest.vcf
@@ -614,11 +650,50 @@ bio-vcf --id evs --filter 'r.info.maf[0]<5.0' --rdf --tags '{"db:evs" => true, "
 Similarly for GoNL
 ```ruby
-bio-vcf --id gonl --rdf --tags '{"db:evs" => true, "seq:freq" => rec.info.af }' < GoNL.vcf
+bio-vcf --id gonl --rdf --tags '{"db:gonl" => true, "seq:freq" => rec.info.af }' < GoNL.vcf
 ```
+or without AF
+```ruby
+bio-vcf --id gonl --rdf --tags '{"db:gonl" => true, "seq:freq" => (rec.info.ac.to_f/rec.info.an).round(2) }' < gonl_germline_overlap_r4.vcf
+```
 Also check out [bio-table](https://github.com/pjotrp/bioruby-table) to convert tabular data to RDF.
+## Statistics
+Simple statistics are available for REF>ALT changes:
+```sh
+./bin/bio-vcf -v --statistics < test/data/input/dbsnp.vcf
+```
+    ## ==== Statistics ==================================
+      G>A             59      45%
+      C>T             30      23%
+      A>G              5       4%
+      C>G              5       4%
+      C>A              5       4%
+      G>T              4       3%
+      T>C              4       3%
+      G>C              4       3%
+      T>A              3       2%
+      A>C              3       2%
+      A>T              2       2%
+      GTCCGACCGCTCC>G  1       1%
+      CGACCGCTCC>C     1       1%
+      T>TGGAGC         1       1%
+      C>CGTCTTCA       1       1%
+      TG>T             1       1%
+      AC>A             1       1%
+      Total          130
+    ## ==================================================
 ## Other examples
 For more examples see the feature [section](https://github.com/pjotrp/bioruby-vcf/tree/master/features).
@@ -654,6 +729,9 @@ what the command line interface uses (see ./bin/bio-vcf)
 ## Trouble shooting
+Note that Ruby 2.x is required for Bio-vcf. JRuby works, but only
+in single threaded mode (for now).
 The multi-threading creates temporary files using the system TMPDIR.
 This behaviour can be overridden by setting the environment variable.
 Also, for genome-wide sequencing it may be useful to increase

data/Rakefile CHANGED

@@ -17,8 +17,8 @@ Jeweler::Tasks.new do |gem|
   gem.name = "bio-vcf"
   gem.homepage = "http://github.com/pjotrp/bioruby-vcf"
   gem.license = "MIT"
-  gem.summary = %Q{VCF parser}
-  gem.description = %Q{Smart parser for VCF format}
+  gem.summary = %Q{Fast multi-threaded VCF parser}
+  gem.description = %Q{Smart lazy multi-threaded parser for VCF format with useful filtering and output rewriting}
   gem.email = "pjotr.public01@thebird.nl"
   gem.authors = ["Pjotr Prins"]
   # dependencies defined in Gemfile
@@ -47,6 +47,8 @@ Cucumber::Rake::Task.new(:features)
 task :default => :features
+task :test => [ :features ]
 require 'rdoc/task'
 Rake::RDocTask.new do |rdoc|
   version = File.exist?('VERSION') ? File.read('VERSION') : ""

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.7.0
1	+ 0.7.3

data/bin/bio-vcf CHANGED

@@ -26,7 +26,7 @@ require 'tempfile'
 # Bio::Log::CLI.logger('stderr')
 # Bio::Log::CLI.trace('info')
-options = { show_help: false, source: 'https://github.com/CuppenResearch/bioruby-vcf', version: version+' (Pjotr Prins)', date: Time.now.to_s, thread_lines: 100_000 }
+options = { show_help: false, source: 'https://github.com/CuppenResearch/bioruby-vcf', version: version+' (Pjotr Prins)', date: Time.now.to_s, thread_lines: 100_000, num_threads: 4 }
 opts = OptionParser.new do |o|
   o.banner = "Usage: #{File.basename($0)} [options] filename\ne.g.  #{File.basename($0)} < test/data/input/somaticsniper.vcf"
@@ -40,7 +40,7 @@ opts = OptionParser.new do |o|
   o.on('--sfilter cmd',String, 'Evaluate filter on each sample') do |cmd|
     options[:sfilter] = cmd
   end
-  o.on("--sfilter-samples list", Array, "Filter on selected samples") do |l|
+  o.on("--sfilter-samples list", Array, "Filter on selected samples (e.g., 0,1") do |l|
     options[:sfilter_samples] = l
   end
@@ -80,10 +80,10 @@ opts = OptionParser.new do |o|
     options[:rdf] = true
     options[:skip_header] = true
   end
-  o.on("--num-threads [num]", Integer, "Multi-core version") do |i|
+  o.on("--num-threads [num]", Integer, "Multi-core version (default #{options[:num_threads]})") do |i|
     options[:num_threads] = i
   end
-  o.on("--thread-lines num", Integer, "Fork thread on num lines (default 100_000)") do |i|
+  o.on("--thread-lines num", Integer, "Fork thread on num lines (default #{options[:thread_lines]})") do |i|
     options[:thread_lines] = i
   end
   o.on_tail("--id name", String, "Identifier") do |s|
@@ -112,6 +112,10 @@ opts = OptionParser.new do |o|
   #   Bio::Log::CLI.trace(s)
   # end
   #
+  o.on("--statistics", "Output statistics") do |q|
+    options[:statistics] = true
+    options[:num_threads] = nil
+  end
   o.on("-q", "--quiet", "Run quietly") do |q|
     # Bio::Log::CLI.trace('error')
     options[:quiet] = true
@@ -168,7 +172,7 @@ def parse_header line, samples, options
 end
 # Parse a VCF line
-def parse_line line,header,options,samples
+def parse_line line,header,options,samples,stats=nil
   fields = VcfLine.parse(line)
   rec = VcfRecord.new(fields,header)
   r = rec # alias
@@ -189,26 +193,34 @@ def parse_line line,header,options,samples
   # --------------------------
   # Filtering and set analysis
-  return if filter and not rec.filter(filter,ignore_missing,quiet)
+  return if filter and not rec.filter(filter,ignore_missing_data: ignore_missing,quiet: quiet)
   if sfilter
     rec.each_sample(options[:sfilter_samples]) do | sample |
-      return if not sample.sfilter(sfilter,ignore_missing,quiet)
+      return if not sample.sfilter(sfilter,ignore_missing_data: ignore_missing,quiet: quiet)
     end
   end
   if ifilter
+    found = false
     rec.each_sample(options[:ifilter_samples]) do | sample |
-      return if not sample.ifilter(ifilter,ignore_missing,quiet)
+      if sample.ifilter(ifilter,ignore_missing_data: ignore_missing,quiet: quiet)
+        found = true
+        break
+      end
     end
+    # Skip if there are no matches
+    return if not found
   end
   if efilter
     rec.each_sample(options[:efilter_samples]) do | sample |
-      return if not sample.efilter(efilter,ignore_missing,quiet)
+      return if not sample.efilter(efilter,ignore_missing_data: ignore_missing,quiet: quiet)
     end
   end
+  stats.add(rec) if stats
   # -----------------------------
   # From here on decide on output
   if samples
@@ -223,13 +235,13 @@ def parse_line line,header,options,samples
     begin
       results = nil # result string
       if options[:eval]
-        res = rec.eval(options[:eval],ignore_missing,quiet)
+        res = rec.eval(options[:eval],ignore_missing_data: ignore_missing,quiet: quiet)
         results = res if res
       end
       if seval
         list = (results ? [] : [rec.chr,rec.pos])
         rec.each_sample(options[:sfilter_samples]) { | sample |
-          list << sample.eval(seval,ignore_missing,quiet)
+          list << sample.eval(seval,ignore_missing_data: ignore_missing,quiet: quiet)
         }
         results = (results ? results.to_s + "\t" : "" ) + list.join("\t")
       end
@@ -249,6 +261,8 @@ def parse_line line,header,options,samples
       # Default behaviour prints VCF line, but rewrite info
       eval(options[:rewrite])
       print (fields[0..6]+[rec.info.to_s]+fields[8..-1]).join("\t")+"\n"
+    elsif stats
+      # do nothing
     else
       # Default behaviour prints VCF line
       $stdout.print fields.join("\t")+"\n"
@@ -261,18 +275,17 @@ end
 # Collect a buffer of lines and feed them to a thread
 # Returns the created pid, tempfilen and count_threads
 # (Note: this function should be turned into a closure)
-def parse_lines lines,header,options,samples,tempdir,count_threads
+def parse_lines lines,header,options,samples,tempdir,count_threads,stats
   pid = nil
   threadfilen = nil
   if options[:num_threads]
-    lines2 = lines.map { |l| l.clone }
     count_threads += 1
     threadfilen = tempdir+sprintf("/%0.6d-pid",count_threads)+'.bio-vcf'
     pid = fork do
       count_lines = 0
       tempfn = threadfilen+'.running'
       STDOUT.reopen(File.open(tempfn, 'w+'))
-      lines2.each do | line |
+      lines.each do | line |
         count_lines +=1 if parse_line(line,header,options,samples)
       end
       STDOUT.flush
@@ -280,10 +293,9 @@ def parse_lines lines,header,options,samples,tempdir,count_threads
       FileUtils::mv(tempfn,threadfilen)
       exit 0
     end
-    Process::detach(pid)
   else
     lines.each do | line |
-      parse_line line,header,options,samples
+      parse_line line,header,options,samples,stats
     end
   end
   return pid,threadfilen,count_threads
@@ -293,12 +305,30 @@ end
 def manage_thread_pool(workers, thread_list, num_threads)
   while true
     # ---- count running pids
-    running = thread_list.reduce(0) { | sum, thread_info | ( File.exist?(thread_info[1]+'.running') ? sum+1 : sum ) }
+    running = thread_list.reduce(0) do | sum, thread_info |
+      if thread_info[0] && pid_running?(thread_info[0])
+        sum+1
+      elsif  nil == thread_info[0] && File.exist?(thread_info[1]+'.running')
+        sum+1
+      else
+        sum
+      end
+    end
     break if running < num_threads
     sleep 0.1
   end
 end
+def pid_running?(pid)
+  begin
+    fpid,status=Process.waitpid2(pid,Process::WNOHANG)
+  rescue Errno::ECHILD, Errno::ESRCH
+    return false
+  end
+  return true if nil == fpid && nil == status
+  return ! (status.exited? || status.signaled?)
+end
 opts.parse!(ARGV)
 $stderr.print "vcf #{version} (biogem Ruby #{RUBY_VERSION}) by Pjotr Prins 2014\n" if !options[:quiet]
@@ -309,8 +339,23 @@ if options[:show_help]
   exit 1
 end
+if RUBY_VERSION =~ /^1/
+  $stderr.print "WARNING: bio-vcf runs on Ruby 2.x only\n"
+end
 $stderr.print "Options: ",options,"\n" if !options[:quiet]
+stats = nil
+if options[:statistics]
+  options[:num_threads] = nil
+  stats = BioVcf::VcfStatistics.new
+end
+# Check for option combinations
+raise "Missing option --ifilter" if options[:ifilter_samples] and not options[:ifilter]
+raise "Missing option --efilter" if options[:efilter_samples] and not options[:efilter]
+raise "Missing option --sfilter" if options[:sfilter_samples] and not options[:sfilter]
 if options[:samples]
   samples = options[:samples].map { |s| s.to_i }
 end
@@ -329,14 +374,15 @@ count_threads=0
 orig_std_out = STDOUT.clone
-Dir::mktmpdir("bio-vcf_") do |tempdir|
-  $stderr.print "Using #{tempdir} for temporary files\n" if num_threads
-  # ---- Main loop
-  STDIN.each_line do | line |
-    line_number += 1
-    $stderr.print '.' if line_number % thread_lines == 0 and not options[:quiet]
-    begin
+begin
+  Dir::mktmpdir("bio-vcf_") do |tempdir|
+    $stderr.print "Using #{tempdir} for temporary files\n" if num_threads
+    # ---- Main loop
+    STDIN.each_line do | line |
+      line_number += 1
+      $stderr.print '.' if line_number % thread_lines == 0 and not options[:quiet]
       # ---- In this section header information is handled
       next if header_output_completed and line =~ /^#/
       if line =~ /^##fileformat=/ or line =~ /^#CHR/
@@ -353,60 +399,74 @@ Dir::mktmpdir("bio-vcf_") do |tempdir|
       lines << line
       if lines.size > thread_lines
         manage_thread_pool(workers,thread_list,num_threads) if options[:num_threads]
-        thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads)
+        thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads,stats)
         count_threads = thread_list.last[2]
         lines = []
       end
-    rescue Exception => e
-      # $stderr.print line
-      $stderr.print e.message,"\n"
-      raise if options[:verbose]
-      exit 1
     end
-  end
-  thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads)
-  count_threads = thread_list.last[2]
-  # ---- In this section the output gets collected and printed on STDOUT
-  if options[:num_threads]
-    STDOUT.reopen(orig_std_out)
-    $stderr.print "Final pid=#{thread_list.last[0]}, size=#{lines.size}\n"
-    lines = []
-    fault = false
-    # Wait for the running threads to complete
-    thread_list.each do |info|
-      (pid,threadfn) = info
-      tempfn = threadfn + '.running'
-      $stderr.print "Waiting up to 3 minutes for pid=#{pid} to complete\n"
-      begin
-        Timeout.timeout(180) do
-          while not File.exist?(threadfn)  # wait for the result to appear
-            sleep 0.2
+    thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads,stats)
+    count_threads = thread_list.last[2]
+    # ---- In this section the output gets collected and printed on STDOUT
+    if options[:num_threads]
+      STDOUT.reopen(orig_std_out)
+      $stderr.print "Final pid=#{thread_list.last[0]}, size=#{lines.size}\n"
+      lines = []
+      fault = false
+      # Wait for the running threads to complete
+      thread_list.each do |info|
+        (pid,threadfn) = info
+        tempfn = threadfn + '.running'
+        timeout = 180
+        if (pid && !pid_running?(pid)) || fault
+          # no point to wait for a long time if we've failed one already or the proc is dead
+          timeout = 1
+        end
+        $stderr.print "Waiting up to #{timeout/60} minutes for pid=#{pid} to complete\n"
+        begin
+          Timeout.timeout(timeout) do
+            while not File.exist?(threadfn)  # wait for the result to appear
+              sleep 0.2
+            end
           end
+          # Thread file should have gone:
+          raise "FATAL: child process appears to have crashed #{tempfn}" if File.exist?(tempfn)
+          $stderr.print "OK pid=#{pid}\n"
+        rescue Timeout::Error
+          if pid_running?(pid)
+            Process.kill 9, pid
+            Process.wait pid
+          end
+          $stderr.print "FATAL: child process killed because it stopped responding, pid = #{pid}\n"
+          fault = true
         end
-        # Thread file should have gone:
-        raise "FATAL: child process appears to have crashed #{tempfn}" if File.exist?(tempfn)
-        $stderr.print "OK pid=#{pid}\n"
-      rescue Timeout::Error
-        Process.kill 9, pid
-        Process.wait pid
-        $stderr.print "FATAL: child process killed because it stopped responding, pid = #{pid}\n"
-        fault = true
       end
+      # Collate the output
+      thread_list.each do | info |
+        (pid,fn) = info
+        if !fault
+          # This should never happen
+          raise "FATAL: child process output #{fn} is missing" if not File.exist?(fn)
+          $stderr.print "Reading #{fn}\n"
+          File.new(fn).each_line { |buf|
+            print buf
+          }
+          File.unlink(fn)
+        end
+        Process.wait(pid) if pid && pid_running?(pid)
+      end
+      return 1 if fault
     end
-    # Collate the output
-    thread_list.each do | info |
-      (pid,fn) = info
-      # This should never happen
-      raise "FATAL: child process output #{fn} is missing" if not File.exist?(fn)
-      $stderr.print "Reading #{fn}\n"
-      File.new(fn).each_line { |buf|
-        print buf
-      }
-      File.unlink(fn)
-    end
-    return 1 if fault
-  end
-end  # cleans up tempdir
+  end  # cleans up tempdir
+  stats.print if stats
+rescue Exception => e
+  # $stderr.print line
+  $stderr.print e.message,"\n"
+  raise if options[:verbose]
+  exit 1
+end