RubyGems - bio-vcf - Versions diffs - 0.7.0 → 0.7.3 - Mend

bio-vcf 0.7.0 → 0.7.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

checksums.yaml +4 -4
data/.travis.yml +3 -2
data/Gemfile +2 -5
data/Gemfile.lock +3 -3
data/README.md +101 -23
data/Rakefile +4 -2
data/VERSION +1 -1
data/bin/bio-vcf +133 -73
data/bio-vcf.gemspec +13 -10
data/features/cli.feature +9 -1
data/features/multisample.feature +4 -4
data/features/sfilter.feature +1 -1
data/features/step_definitions/cli-feature.rb +4 -0
data/features/step_definitions/multisample.rb +24 -12
data/features/step_definitions/sfilter.rb +80 -31
data/lib/bio-vcf.rb +1 -0
data/lib/bio-vcf/vcfgenotypefield.rb +45 -9
data/lib/bio-vcf/vcfheader.rb +1 -1
data/lib/bio-vcf/vcfrecord.rb +14 -8
data/lib/bio-vcf/vcfsample.rb +101 -152
data/lib/bio-vcf/vcfstatistics.rb +28 -0
data/test/data/regression/ifilter_s.dp.ref +31 -0
data/test/data/regression/thread4_4_failed_filter-stderr.ref +1 -0
metadata +16 -12

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: c4325d76baee5956ed3f58277ad622cf9a0a6ce7
-  data.tar.gz: e971b0fb0f760aafb32af51a647f1ba39f59f26f
+  metadata.gz: 72f63aae77e382c88e04cb07344ab3fce2a57232
+  data.tar.gz: 48f36f4d75d18edf3619124f0b679706a002b646
 SHA512:
-  metadata.gz: a99e0be8ce0fd84d8afc557e5e30418da2fd98d2ad7458b242e9402e4605d5b7f816758da39b9248dad3ebf53e8a5a3c17862927c284832772cbf68c3b9d2fbc
-  data.tar.gz: 49f0e38cf66781a2d35bb45849d83bc137e299d42d3539928e70079b4ffbcc70bca20c4149d7cae355696e18670aab50dfbe0f8d87ffc03b6b7445b3d676eb95
+  metadata.gz: a1d1513454924e1d84bb9aecf1bd83cf0d8784e3ab8669bb7777129f2a8b33df53a2118d4fea15a825ce1c399de05cf729e0f7b5e7acd35367856a6a1821f328
+  data.tar.gz: 7b40ffdad49cbb690cfaa4d02e7a060095dc13aa4533f091f1f20b2f2b1903c71d2f6c6c5c5f672dc3be4da751305c2583265dbe7d86308501ab0219cca9e414

data/.travis.yml CHANGED

@@ -1,8 +1,9 @@
 language: ruby
 rvm:
-  - 1.9.3
+#  - 1.9.3 <- No longer working
+  - 2.0.0
   - 2.1.0
-  - jruby-head
+#  - jruby-head
 #  - jruby-19mode # JRuby in 1.9 mode
 #  - 1.8.7
 #  - jruby-18mode # JRuby in 1.8 mode

data/Gemfile CHANGED

@@ -9,9 +9,6 @@ group :development do
   # gem "minitest"
   gem "rspec"
   gem "cucumber"
-  gem "jeweler" # , "~> 1.8.4", :git => "https://github.com/technicalpickles/jeweler.git"
-  # gem "bundler", ">= 1.0.21"
-  # gem "bio", ">= 1.4.2"
-  # gem "rdoc", "~> 3.12"
-  gem "regressiontest"
+  gem "jeweler", "~> 2.0.1" # , "~> 1.8.4", :git => "https://github.com/technicalpickles/jeweler.git"
+  gem "regressiontest", "~> 0.0.3"
 end

data/Gemfile.lock CHANGED

@@ -60,7 +60,7 @@ GEM
     rake (10.1.1)
     rdoc (4.1.1)
       json (~> 1.4)
-    regressiontest (0.0.2)
+    regressiontest (0.0.3)
     rspec (2.14.1)
       rspec-core (~> 2.14.0)
       rspec-expectations (~> 2.14.0)
@@ -76,6 +76,6 @@ PLATFORMS
 DEPENDENCIES
   cucumber
-  jeweler
-  regressiontest
+  jeweler (~> 2.0.1)
+  regressiontest (~> 0.0.3)
   rspec

data/README.md CHANGED

@@ -2,14 +2,27 @@
 [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf)
-Yet another VCF parser. Bio-vcf is not only fast for genome-wide data,
-it also comes with a really nice filtering, evaluation and rewrite
-language. Bio-vcf has better performance than other tools
+A new generation VCF parser. Bio-vcf is not only fast for genome-wide
+(WGS) data, it also comes with a really nice filtering, evaluation and
+rewrite language. Why would you use bio-vcf over other parsers?
+1. Bio-vcf is fast and scales on multi-core computers
+2. Bio-vcf has an expressive filtering and evaluation language
+3. Bio-vcf has great multi-sample support
+4. Bio-vcf has multiple global filters and sample filters
+5. Bio-vcf can access any VCF format
+6. Bio-vcf can do calculations on fields
+7. Bio-vcf allows for genotype processing
+8. Bio-vcf has support for set analysis
+9. Bio-vcf has sane error handling
+10. Bio-vcf can output tabular data, HTML, LaTeX, RDF and (soon) JSON
+Bio-vcf has better performance than other tools
 because of lazy parsing, multi-threading, and useful combinations of
-(fancy) command line filtering. For example on an 2 core machine
-bio-vcf is 50% faster than SnpSift. On an 8 core machine bio-vcf is
-3x faster than SnpSift. Parsing a 1 Gb ESP VCF with 8 cores with
-bio-vcf takes
+(fancy) command line filtering. For example on an 2 core machine
+bio-vcf is typically 50% faster than JVM based SnpSift. On an 8 core machine
+bio-vcf is at least 3x faster than SnpSift. Parsing a 1 Gb ESP
+VCF with 8 cores with bio-vcf takes
 ```sh
   time ./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp>0.3' < ESP6500SI_V2_SSA137.vcf > test1.vcf
@@ -18,7 +31,7 @@ bio-vcf takes
   sys     0m7.852s
 ```
-and parsing with SnpSift takes
+while parsing with SnpSift takes
 ```sh
   time cat ESP6500SI_V2_SSA137.vcf |java -jar snpEff/SnpSift.jar filter "( CP>0.3 )" > test.vcf
@@ -32,22 +45,22 @@ Illumina Hiseq VCF file and evaluating the results into a BED format on
 a 16 core machine takes
 ```sh
-  time bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3
+  time bio-vcf --num-threads 16 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3
   real    0m47.612s
   user    8m18.234s
   sys     0m5.039s
 ```
-which shows some pretty decent core utilisation (10x).
+which shows pretty decent core utilisation (10x). We are running
+gzip compressed VCF files of 30+ Gb with similar performance gains.
 Use zcat to
-pipe gzipped (vcf.gz) files into bio-vcf, e.g.
+pipe such gzipped (vcf.gz) files into bio-vcf, e.g.
 ```sh
   zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50'
     --sfilter '!s.empty? and s.dp>20'
     --eval '[r.chrom,r.pos,r.pos+1]' > test.bed
 ```
 bio-vcf comes with a sensible parser definition language (it is 100%
@@ -184,6 +197,13 @@ commands exit for filtering and eval. When a set is defined, based on
 the sample name, you can apply filters on the samples inside the set,
 outside the set and over all samples. E.g.
+So, why would you use bio-vcf instead of rolling out your own
+Perl/Python/other ad-hoc script? I think the reason should be that
+there is less chance of mistakes because of Bio-vcf's clear filtering
+language and sensible built-in validation. The second reason would be
+speed. Bio-vcf's multi-threading capability gives it great and hard to
+replicate performance.
 Also note you can use
 [bio-table](https://github.com/pjotrp/bioruby-table) to
 filter/transform data further and convert to other formats, such as
@@ -202,7 +222,7 @@ example of a VCF statement you need to work on.
 ## Installation
-Note that you need Ruby 1.9.3 or later. The 2.x Ruby series also give
+Note that you need Ruby 2.x or later. The 2.x Ruby series also give
 a performance improvement. Bio-vcf will show the Ruby version when
 typing the command 'bio-vcf -h'.
@@ -371,7 +391,7 @@ And even better because of Ruby magic
 Note that only valid method names in lower case get picked up this
 way. Also by convention normal is sample 1 and tumor is sample 2.
-Even shorter r is an alias for rec (nyi)
+Even shorter r is an alias for rec
 ```sh
   bio-vcf --eval "r.original.gt" < file.vcf
@@ -380,7 +400,8 @@ Even shorter r is an alias for rec (nyi)
 ## Special functions
-Note: special functions are not yet implemented!
+Note: special functions are not yet implemented! Look below
+for genotype processing which has indexing in 'gti'.
 Sometime you want to use a special function in a filter. For
 example percentage variant reads can be defined as [a,c,g,t]
@@ -440,7 +461,8 @@ example, samples are selected that evaluate to true, all others should
 evaluate to false. For this we create three filters, one for all
 samples that are included (the --ifilter or -if), for all samples that
 are excluded (the --efilter or -ef) and for any sample (the --sfilter
-or -sf). So i=include, e=exclude and s=any sample.
+or -sf). So i=include (OR filter), e=exclude and s=any sample (AND
+filter).
 The equivalent of the union filter is by using the --sfilter, so
@@ -448,15 +470,19 @@ The equivalent of the union filter is by using the --sfilter, so
   bio-vcf --sfilter 's.dp>20'
 ```
-Filters DP on all samples. To filter on a subset you can add a
+Filters DP on all samples and is true if all samples match the
+criterium (AND). To filter on a subset you can add a
 selector
 ```sh
   bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20'
 ```
-For set analysis there are the additional ifilter (include) and efilter (exclude). To filter
-on samples 0,1,4 and output the gq values
+For set analysis there are the additional ifilter (include) and
+efilter (exclude).  Where sfilter represents an ALL match, the ifilter
+represents an ANY match, i.e., it is true if one of the samples
+matches the criterium (OR). To filter on samples 0,1,4 and output the gq
+values
 ```sh
   bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gq<10 or s.gq==99' --seval s.gq
@@ -494,8 +520,10 @@ To set an additional filter on the excluded samples:
 ```
 Etc. etc. Any combination of sfilter, ifilter and efilter is possible.
+Currently the efilter is an ALL filter (AND), i.e. all excluded
+samples need to match the criterium.
-The following are not yet implemented:
+The following regular expression matches are not yet implemented:
 In the near future it is also possible to select samples on a regex (here
 select all samples where the name starts with s3)
@@ -560,6 +588,8 @@ and 'gts' as a nucleotide string array
     1       15274   G       G       G       G       G       G       G
 ```
+where gts represents the indexed genotype on [ref] + [alt].
 These values can also be used in filters and output allele depth, for
 example
@@ -570,12 +600,18 @@ example
     1       13757   47      47      4       47      47      4       47
 ```
-The following does not yet work (using the gti in a sample directly)
+You can use the genotype index gti to fetch values from, for example,
+allele depth:
 ```ruby
 bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gti[0]==0' --seval 'rec.original.ad[s.gti[1]]'
+1       10257   151     151     151     151     151     8       151
+1       13302   26      10      10      10      10      10      10
+1       13757   47      47      4       47      47      4       47
 ```
 ## Modify VCF files
 Add or modify the sample file name in the INFO fields:
@@ -584,7 +620,7 @@ Add or modify the sample file name in the INFO fields:
   bio-vcf --rewrite 'rec.info["sample"]="mytest"' < mytest.vcf
 ```
-To remove/select 3 samples and create a new file:
+To remove/select 3 samples:
 ```sh
   bio-vcf --samples 0,1,3 < mytest.vcf
@@ -614,11 +650,50 @@ bio-vcf --id evs --filter 'r.info.maf[0]<5.0' --rdf --tags '{"db:evs" => true, "
 Similarly for GoNL
 ```ruby
-bio-vcf --id gonl --rdf --tags '{"db:evs" => true, "seq:freq" => rec.info.af }' < GoNL.vcf
+bio-vcf --id gonl --rdf --tags '{"db:gonl" => true, "seq:freq" => rec.info.af }' < GoNL.vcf
 ```
+or without AF
+```ruby
+bio-vcf --id gonl --rdf --tags '{"db:gonl" => true, "seq:freq" => (rec.info.ac.to_f/rec.info.an).round(2) }' < gonl_germline_overlap_r4.vcf
+```
 Also check out [bio-table](https://github.com/pjotrp/bioruby-table) to convert tabular data to RDF.
+## Statistics
+Simple statistics are available for REF>ALT changes:
+```sh
+./bin/bio-vcf -v --statistics < test/data/input/dbsnp.vcf
+```
+    ## ==== Statistics ==================================
+      G>A             59      45%
+      C>T             30      23%
+      A>G              5       4%
+      C>G              5       4%
+      C>A              5       4%
+      G>T              4       3%
+      T>C              4       3%
+      G>C              4       3%
+      T>A              3       2%
+      A>C              3       2%
+      A>T              2       2%
+      GTCCGACCGCTCC>G  1       1%
+      CGACCGCTCC>C     1       1%
+      T>TGGAGC         1       1%
+      C>CGTCTTCA       1       1%
+      TG>T             1       1%
+      AC>A             1       1%
+      Total          130
+    ## ==================================================
 ## Other examples
 For more examples see the feature [section](https://github.com/pjotrp/bioruby-vcf/tree/master/features).
@@ -654,6 +729,9 @@ what the command line interface uses (see ./bin/bio-vcf)
 ## Trouble shooting
+Note that Ruby 2.x is required for Bio-vcf. JRuby works, but only
+in single threaded mode (for now).
 The multi-threading creates temporary files using the system TMPDIR.
 This behaviour can be overridden by setting the environment variable.
 Also, for genome-wide sequencing it may be useful to increase

data/Rakefile CHANGED

@@ -17,8 +17,8 @@ Jeweler::Tasks.new do |gem|
   gem.name = "bio-vcf"
   gem.homepage = "http://github.com/pjotrp/bioruby-vcf"
   gem.license = "MIT"
-  gem.summary = %Q{VCF parser}
-  gem.description = %Q{Smart parser for VCF format}
+  gem.summary = %Q{Fast multi-threaded VCF parser}
+  gem.description = %Q{Smart lazy multi-threaded parser for VCF format with useful filtering and output rewriting}
   gem.email = "pjotr.public01@thebird.nl"
   gem.authors = ["Pjotr Prins"]
   # dependencies defined in Gemfile
@@ -47,6 +47,8 @@ Cucumber::Rake::Task.new(:features)
 task :default => :features
+task :test => [ :features ]
 require 'rdoc/task'
 Rake::RDocTask.new do |rdoc|
   version = File.exist?('VERSION') ? File.read('VERSION') : ""

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.7.0
1	+ 0.7.3

data/bin/bio-vcf CHANGED

@@ -26,7 +26,7 @@ require 'tempfile'
 # Bio::Log::CLI.logger('stderr')
 # Bio::Log::CLI.trace('info')
-options = { show_help: false, source: 'https://github.com/CuppenResearch/bioruby-vcf', version: version+' (Pjotr Prins)', date: Time.now.to_s, thread_lines: 100_000 }
+options = { show_help: false, source: 'https://github.com/CuppenResearch/bioruby-vcf', version: version+' (Pjotr Prins)', date: Time.now.to_s, thread_lines: 100_000, num_threads: 4 }
 opts = OptionParser.new do |o|
   o.banner = "Usage: #{File.basename($0)} [options] filename\ne.g.  #{File.basename($0)} < test/data/input/somaticsniper.vcf"
@@ -40,7 +40,7 @@ opts = OptionParser.new do |o|
   o.on('--sfilter cmd',String, 'Evaluate filter on each sample') do |cmd|
     options[:sfilter] = cmd
   end
-  o.on("--sfilter-samples list", Array, "Filter on selected samples") do |l|
+  o.on("--sfilter-samples list", Array, "Filter on selected samples (e.g., 0,1") do |l|
     options[:sfilter_samples] = l
   end
@@ -80,10 +80,10 @@ opts = OptionParser.new do |o|
     options[:rdf] = true
     options[:skip_header] = true
   end
-  o.on("--num-threads [num]", Integer, "Multi-core version") do |i|
+  o.on("--num-threads [num]", Integer, "Multi-core version (default #{options[:num_threads]})") do |i|
     options[:num_threads] = i
   end
-  o.on("--thread-lines num", Integer, "Fork thread on num lines (default 100_000)") do |i|
+  o.on("--thread-lines num", Integer, "Fork thread on num lines (default #{options[:thread_lines]})") do |i|
     options[:thread_lines] = i
   end
   o.on_tail("--id name", String, "Identifier") do |s|
@@ -112,6 +112,10 @@ opts = OptionParser.new do |o|
   #   Bio::Log::CLI.trace(s)
   # end
   #
+  o.on("--statistics", "Output statistics") do |q|
+    options[:statistics] = true
+    options[:num_threads] = nil
+  end
   o.on("-q", "--quiet", "Run quietly") do |q|
     # Bio::Log::CLI.trace('error')
     options[:quiet] = true
@@ -168,7 +172,7 @@ def parse_header line, samples, options
 end
 # Parse a VCF line
-def parse_line line,header,options,samples
+def parse_line line,header,options,samples,stats=nil
   fields = VcfLine.parse(line)
   rec = VcfRecord.new(fields,header)
   r = rec # alias
@@ -189,26 +193,34 @@ def parse_line line,header,options,samples
   # --------------------------
   # Filtering and set analysis
-  return if filter and not rec.filter(filter,ignore_missing,quiet)
+  return if filter and not rec.filter(filter,ignore_missing_data: ignore_missing,quiet: quiet)
   if sfilter
     rec.each_sample(options[:sfilter_samples]) do | sample |
-      return if not sample.sfilter(sfilter,ignore_missing,quiet)
+      return if not sample.sfilter(sfilter,ignore_missing_data: ignore_missing,quiet: quiet)
     end
   end
   if ifilter
+    found = false
     rec.each_sample(options[:ifilter_samples]) do | sample |
-      return if not sample.ifilter(ifilter,ignore_missing,quiet)
+      if sample.ifilter(ifilter,ignore_missing_data: ignore_missing,quiet: quiet)
+        found = true
+        break
+      end
     end
+    # Skip if there are no matches
+    return if not found
   end
   if efilter
     rec.each_sample(options[:efilter_samples]) do | sample |
-      return if not sample.efilter(efilter,ignore_missing,quiet)
+      return if not sample.efilter(efilter,ignore_missing_data: ignore_missing,quiet: quiet)
     end
   end
+  stats.add(rec) if stats
   # -----------------------------
   # From here on decide on output
   if samples
@@ -223,13 +235,13 @@ def parse_line line,header,options,samples
     begin
       results = nil # result string
       if options[:eval]
-        res = rec.eval(options[:eval],ignore_missing,quiet)
+        res = rec.eval(options[:eval],ignore_missing_data: ignore_missing,quiet: quiet)
         results = res if res
       end
       if seval
         list = (results ? [] : [rec.chr,rec.pos])
         rec.each_sample(options[:sfilter_samples]) { | sample |
-          list << sample.eval(seval,ignore_missing,quiet)
+          list << sample.eval(seval,ignore_missing_data: ignore_missing,quiet: quiet)
         }
         results = (results ? results.to_s + "\t" : "" ) + list.join("\t")
       end
@@ -249,6 +261,8 @@ def parse_line line,header,options,samples
       # Default behaviour prints VCF line, but rewrite info
       eval(options[:rewrite])
       print (fields[0..6]+[rec.info.to_s]+fields[8..-1]).join("\t")+"\n"
+    elsif stats
+      # do nothing
     else
       # Default behaviour prints VCF line
       $stdout.print fields.join("\t")+"\n"
@@ -261,18 +275,17 @@ end
 # Collect a buffer of lines and feed them to a thread
 # Returns the created pid, tempfilen and count_threads
 # (Note: this function should be turned into a closure)
-def parse_lines lines,header,options,samples,tempdir,count_threads
+def parse_lines lines,header,options,samples,tempdir,count_threads,stats
   pid = nil
   threadfilen = nil
   if options[:num_threads]
-    lines2 = lines.map { |l| l.clone }
     count_threads += 1
     threadfilen = tempdir+sprintf("/%0.6d-pid",count_threads)+'.bio-vcf'
     pid = fork do
       count_lines = 0
       tempfn = threadfilen+'.running'
       STDOUT.reopen(File.open(tempfn, 'w+'))
-      lines2.each do | line |
+      lines.each do | line |
         count_lines +=1 if parse_line(line,header,options,samples)
       end
       STDOUT.flush
@@ -280,10 +293,9 @@ def parse_lines lines,header,options,samples,tempdir,count_threads
       FileUtils::mv(tempfn,threadfilen)
       exit 0
     end
-    Process::detach(pid)
   else
     lines.each do | line |
-      parse_line line,header,options,samples
+      parse_line line,header,options,samples,stats
     end
   end
   return pid,threadfilen,count_threads
@@ -293,12 +305,30 @@ end
 def manage_thread_pool(workers, thread_list, num_threads)
   while true
     # ---- count running pids
-    running = thread_list.reduce(0) { | sum, thread_info | ( File.exist?(thread_info[1]+'.running') ? sum+1 : sum ) }
+    running = thread_list.reduce(0) do | sum, thread_info |
+      if thread_info[0] && pid_running?(thread_info[0])
+        sum+1
+      elsif  nil == thread_info[0] && File.exist?(thread_info[1]+'.running')
+        sum+1
+      else
+        sum
+      end
+    end
     break if running < num_threads
     sleep 0.1
   end
 end
+def pid_running?(pid)
+  begin
+    fpid,status=Process.waitpid2(pid,Process::WNOHANG)
+  rescue Errno::ECHILD, Errno::ESRCH
+    return false
+  end
+  return true if nil == fpid && nil == status
+  return ! (status.exited? || status.signaled?)
+end
 opts.parse!(ARGV)
 $stderr.print "vcf #{version} (biogem Ruby #{RUBY_VERSION}) by Pjotr Prins 2014\n" if !options[:quiet]
@@ -309,8 +339,23 @@ if options[:show_help]
   exit 1
 end
+if RUBY_VERSION =~ /^1/
+  $stderr.print "WARNING: bio-vcf runs on Ruby 2.x only\n"
+end
 $stderr.print "Options: ",options,"\n" if !options[:quiet]
+stats = nil
+if options[:statistics]
+  options[:num_threads] = nil
+  stats = BioVcf::VcfStatistics.new
+end
+# Check for option combinations
+raise "Missing option --ifilter" if options[:ifilter_samples] and not options[:ifilter]
+raise "Missing option --efilter" if options[:efilter_samples] and not options[:efilter]
+raise "Missing option --sfilter" if options[:sfilter_samples] and not options[:sfilter]
 if options[:samples]
   samples = options[:samples].map { |s| s.to_i }
 end
@@ -329,14 +374,15 @@ count_threads=0
 orig_std_out = STDOUT.clone
-Dir::mktmpdir("bio-vcf_") do |tempdir|
-  $stderr.print "Using #{tempdir} for temporary files\n" if num_threads
-  # ---- Main loop
-  STDIN.each_line do | line |
-    line_number += 1
-    $stderr.print '.' if line_number % thread_lines == 0 and not options[:quiet]
-    begin
+begin
+  Dir::mktmpdir("bio-vcf_") do |tempdir|
+    $stderr.print "Using #{tempdir} for temporary files\n" if num_threads
+    # ---- Main loop
+    STDIN.each_line do | line |
+      line_number += 1
+      $stderr.print '.' if line_number % thread_lines == 0 and not options[:quiet]
       # ---- In this section header information is handled
       next if header_output_completed and line =~ /^#/
       if line =~ /^##fileformat=/ or line =~ /^#CHR/
@@ -353,60 +399,74 @@ Dir::mktmpdir("bio-vcf_") do |tempdir|
       lines << line
       if lines.size > thread_lines
         manage_thread_pool(workers,thread_list,num_threads) if options[:num_threads]
-        thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads)
+        thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads,stats)
         count_threads = thread_list.last[2]
         lines = []
       end
-    rescue Exception => e
-      # $stderr.print line
-      $stderr.print e.message,"\n"
-      raise if options[:verbose]
-      exit 1
     end
-  end
-  thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads)
-  count_threads = thread_list.last[2]
-  # ---- In this section the output gets collected and printed on STDOUT
-  if options[:num_threads]
-    STDOUT.reopen(orig_std_out)
-    $stderr.print "Final pid=#{thread_list.last[0]}, size=#{lines.size}\n"
-    lines = []
-    fault = false
-    # Wait for the running threads to complete
-    thread_list.each do |info|
-      (pid,threadfn) = info
-      tempfn = threadfn + '.running'
-      $stderr.print "Waiting up to 3 minutes for pid=#{pid} to complete\n"
-      begin
-        Timeout.timeout(180) do
-          while not File.exist?(threadfn)  # wait for the result to appear
-            sleep 0.2
+    thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads,stats)
+    count_threads = thread_list.last[2]
+    # ---- In this section the output gets collected and printed on STDOUT
+    if options[:num_threads]
+      STDOUT.reopen(orig_std_out)
+      $stderr.print "Final pid=#{thread_list.last[0]}, size=#{lines.size}\n"
+      lines = []
+      fault = false
+      # Wait for the running threads to complete
+      thread_list.each do |info|
+        (pid,threadfn) = info
+        tempfn = threadfn + '.running'
+        timeout = 180
+        if (pid && !pid_running?(pid)) || fault
+          # no point to wait for a long time if we've failed one already or the proc is dead
+          timeout = 1
+        end
+        $stderr.print "Waiting up to #{timeout/60} minutes for pid=#{pid} to complete\n"
+        begin
+          Timeout.timeout(timeout) do
+            while not File.exist?(threadfn)  # wait for the result to appear
+              sleep 0.2
+            end
           end
+          # Thread file should have gone:
+          raise "FATAL: child process appears to have crashed #{tempfn}" if File.exist?(tempfn)
+          $stderr.print "OK pid=#{pid}\n"
+        rescue Timeout::Error
+          if pid_running?(pid)
+            Process.kill 9, pid
+            Process.wait pid
+          end
+          $stderr.print "FATAL: child process killed because it stopped responding, pid = #{pid}\n"
+          fault = true
         end
-        # Thread file should have gone:
-        raise "FATAL: child process appears to have crashed #{tempfn}" if File.exist?(tempfn)
-        $stderr.print "OK pid=#{pid}\n"
-      rescue Timeout::Error
-        Process.kill 9, pid
-        Process.wait pid
-        $stderr.print "FATAL: child process killed because it stopped responding, pid = #{pid}\n"
-        fault = true
       end
+      # Collate the output
+      thread_list.each do | info |
+        (pid,fn) = info
+        if !fault
+          # This should never happen
+          raise "FATAL: child process output #{fn} is missing" if not File.exist?(fn)
+          $stderr.print "Reading #{fn}\n"
+          File.new(fn).each_line { |buf|
+            print buf
+          }
+          File.unlink(fn)
+        end
+        Process.wait(pid) if pid && pid_running?(pid)
+      end
+      return 1 if fault
     end
-    # Collate the output
-    thread_list.each do | info |
-      (pid,fn) = info
-      # This should never happen
-      raise "FATAL: child process output #{fn} is missing" if not File.exist?(fn)
-      $stderr.print "Reading #{fn}\n"
-      File.new(fn).each_line { |buf|
-        print buf
-      }
-      File.unlink(fn)
-    end
-    return 1 if fault
-  end
-end  # cleans up tempdir
+  end  # cleans up tempdir
+  stats.print if stats
+rescue Exception => e
+  # $stderr.print line
+  $stderr.print e.message,"\n"
+  raise if options[:verbose]
+  exit 1
+end