RubyGems - bio-vcf - Versions diffs - 0.0.3 → 0.7.0 - Mend

bio-vcf 0.0.3 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

checksums.yaml +4 -4
data/README.md +145 -20
data/VERSION +1 -1
data/bin/bio-vcf +204 -62
data/bio-vcf.gemspec +7 -3
data/features/cli.feature +16 -0
data/features/multisample.feature +10 -0
data/features/sfilter.feature +60 -0
data/features/step_definitions/cli-feature.rb +1 -1
data/features/step_definitions/multisample.rb +32 -0
data/features/step_definitions/sfilter.rb +90 -0
data/lib/bio-vcf/utils.rb +12 -6
data/lib/bio-vcf/vcfgenotypefield.rb +4 -1
data/lib/bio-vcf/vcfheader.rb +24 -0
data/lib/bio-vcf/vcfrdf.rb +15 -8
data/lib/bio-vcf/vcfrecord.rb +45 -9
data/lib/bio-vcf/vcfsample.rb +94 -5
data/test/data/regression/sfilter_seval_s.dp.ref +31 -0
data/test/data/regression/{sfilter001.ref → thread4.ref} +5 -0
data/test/data/regression/thread4_4.ref +150 -0
data/test/performance/metrics.md +53 -19
metadata +7 -3

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 1f08be0a8d7ad751ad4758156e5ce6ccbc518cc0
-  data.tar.gz: 741386a278d7c38340abf35cc08d1f1923636131
+  metadata.gz: c4325d76baee5956ed3f58277ad622cf9a0a6ce7
+  data.tar.gz: e971b0fb0f760aafb32af51a647f1ba39f59f26f
 SHA512:
-  metadata.gz: 04d20d248629cccebbd3d639c2a25bb5d33efd2163999f87912343f44442c1c3f19429d6008900973116354a522679e75d853e3e6f9428d54a6647a38ef5e7fe
-  data.tar.gz: 625b39c9172569d3e893721a6f943721b30032b9946cea03923524af75345793edb3654f1489ebb21e084d81970a288e0552e442c27857647d0982999c497487
+  metadata.gz: a99e0be8ce0fd84d8afc557e5e30418da2fd98d2ad7458b242e9402e4605d5b7f816758da39b9248dad3ebf53e8a5a3c17862927c284832772cbf68c3b9d2fbc
+  data.tar.gz: 49f0e38cf66781a2d35bb45849d83bc137e299d42d3539928e70079b4ffbcc70bca20c4149d7cae355696e18670aab50dfbe0f8d87ffc03b6b7445b3d676eb95

data/README.md CHANGED

@@ -2,14 +2,62 @@
 [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf)
-Yet another VCF parser. This one may give better performance because
-of lazy parsing and useful combinations of (fancy) command line
-filtering. bio-vcf comes with a sensible parser definition language,
-as well as primitives for set analysis. Also few assumptions are made
-about the actual contents of the VCF file (field names are resolved on
-the fly).
+Yet another VCF parser. Bio-vcf is not only fast for genome-wide data,
+it also comes with a really nice filtering, evaluation and rewrite
+language. Bio-vcf has better performance than other tools
+because of lazy parsing, multi-threading, and useful combinations of
+(fancy) command line filtering. For example on an 2 core machine
+bio-vcf is 50% faster than SnpSift. On an 8 core machine bio-vcf is
+3x faster than SnpSift. Parsing a 1 Gb ESP VCF with 8 cores with
+bio-vcf takes
-To fetch all entries where all samples have depth larger than 20 use an sfilter
+```sh
+  time ./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp>0.3' < ESP6500SI_V2_SSA137.vcf > test1.vcf
+  real    0m21.095s
+  user    1m41.101s
+  sys     0m7.852s
+```
+and parsing with SnpSift takes
+```sh
+  time cat ESP6500SI_V2_SSA137.vcf |java -jar snpEff/SnpSift.jar filter "( CP>0.3 )" > test.vcf
+  real    1m4.913s
+  user    0m58.071s
+  sys     0m7.982s
+```
+Bio-vcf is perfect for parsing large data files. Parsing a 650 Mb GATK
+Illumina Hiseq VCF file and evaluating the results into a BED format on
+a 16 core machine takes
+```sh
+  time bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3
+  real    0m47.612s
+  user    8m18.234s
+  sys     0m5.039s
+```
+which shows some pretty decent core utilisation (10x).
+Use zcat to
+pipe gzipped (vcf.gz) files into bio-vcf, e.g.
+```sh
+  zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50'
+    --sfilter '!s.empty? and s.dp>20'
+    --eval '[r.chrom,r.pos,r.pos+1]' > test.bed
+```
+bio-vcf comes with a sensible parser definition language (it is 100%
+Ruby), as well as primitives for set analysis. Few
+assumptions are made about the actual contents of the VCF file (field
+names are resolved on the fly), so bio-vcf should practically work with
+all VCF files.
+To fetch all entries where all samples have depth larger than 20 use
+a sample filter
 ```ruby
   bio-vcf --sfilter 'sample.dp>20' < file.vcf
@@ -38,7 +86,7 @@ use the --eval switch, e.g.,
   bio-vcf --eval 'rec.alt+"\t"+rec.info.dp+"\t"+rec.tumor.gq.to_s' < file.vcf
 ```
-In fact, if the result is an Array the output gets tab dilimited so
+In fact, if the result is an Array the output gets tab dilimited, so
 the nicer version is
 ```ruby
@@ -61,13 +109,42 @@ bio-vcf -i --sfilter 's.dp>100' --seval 's.dp' < file.vcf
 Where -i ignores missing samples. Pick up sample allele depth
 ```ruby
-bio-vcf -i --seval 's.ad'
-  1       10257   151,8   219,22  227,22  226,22  166,18  185,27  201,15
-  1       10291   145,16  218,26  214,30  213,32  122,36  131,27  156,31
-  1       10297   155,18  218,23  219,26  207,30  137,20  124,27  151,27
+bio-vcf -i --seval 's.ad.to_s'
+1       10257   [151, 8]        [219, 22]       [227, 22]       [226, 22]       [166, 18]       [185, 27]  [201, 15]
+1       10291   [145, 16]       [218, 26]       [214, 30]       [213, 32]       [122, 36]       [131, 27]  [156, 31]
+1       10297   [155, 18]       [218, 23]       [219, 26]       [207, 30]       [137, 20]       [124, 27]  [151, 27]
+1       10303   [169, 25]       [211, 31]       [214, 28]       [214, 32]       [146, 17]       [123, 23]  [156, 22]
+```
+To get the alt depth per sample
+```ruby
+bio-vcf -i --seval 's.ad[1]'
+1       10257   8       22      22      22      18      27      15
+1       10291   16      26      30      32      36      27      31
+1       10297   18      23      26      30      20      27      27
+1       10303   25      31      28      32      17      23      22
+```
+To calculate alt frequencies from s.ad which is sample (alt dp)/(ref dp + alt dp)
+```ruby
+bio-vcf -i --seval 's.ad[1].to_f/(s.ad[0]+s.ad[1])'
+1       10257   0.050314465408805034    0.0912863070539419      0.08835341365461848     0.088709677419354840.09782608695652174      0.12735849056603774     0.06944444444444445
+1       10291   0.09937888198757763     0.10655737704918032     0.12295081967213115     0.1306122448979592 0.22784810126582278      0.17088607594936708     0.1657754010695187
+```
+note the floating point conversion .to_f is needed, otherwise you get
+an integer division. To account for multiple alleles
+```ruby
+bio-vcf -i --eval 'r.ref+">"+r.alt[0]' --seval 'tot=s.ad.reduce(:+) ; (tot-s.ad[0].to_f)/tot' --set-header "mutation,#samples"
+mutation        Original        s1t1    s2t1    s3t1    s1t2    s2t2    s3t2
+A>C     0.050314465408805034    0.0912863070539419      0.08835341365461848     0.08870967741935484     0.09782608695652174 0.12735849056603774     0.06944444444444445
+C>T     0.09937888198757763     0.10655737704918032     0.12295081967213115     0.1306122448979592      0.22784810126582278 0.17088607594936708     0.1657754010695187
 ```
-And to output DP ang GQ values for tumor normal:
+To output DP ang GQ values for tumor normal:
 ```ruby
 bio-vcf --filter 'r.normal.dp>=7 and r.tumor.dp>=5' --seval '[s.dp,s.gq]' < freebayes.vcf
@@ -83,13 +160,25 @@ bio-vcf --filter 'r.normal.dp>=7 and r.tumor.dp>=5' --seval '[s.dp,s.gq]' < free
 To parse and output genotype
 ```ruby
-bio-vcf -iq --sfilter 's.dp>=20 and s.gq>=20' --ifilter-sampler 's.gt!="0/0"' --seval s.gt < test/data/input/multisample.vcf
+bio-vcf -iq --sfilter 's.dp>=20 and s.gq>=20' --ifilter-samples 's.gt!="0/0"' --seval s.gt < test/data/input/multisample.vcf
 1       10257   0/0     0/0     0/0     0/0     0/0     0/1     0/0
 1       10291   0/1     0/1     0/1     0/1     0/1     0/1     0/1
 1       10297   0/1     0/1     0/1     0/0     0/0     0/1     0/1
 1       12783   0/1     0/1     0/1     0/1     0/1     0/1     0/1
 ```
+And use --set-header if you want to add a header
+```ruby
+bio-vcf -iq --set-header 'chr,pos,#samples' --sfilter 's.dp>=20 and s.gq>=20' --ifilter-samples 's.gt!="0/0"' --seval s.gt < test/data/input/multisample.vcf
+chr     pos     orig   s1t1    s2t1    s3t1    s1t2    s2t2    s3t2
+1       10257   0/0     0/0     0/0     0/0     0/0     0/1     0/0
+1       10291   0/1     0/1     0/1     0/1     0/1     0/1     0/1
+(etc)
+```
+where #samples gets expanded.
 Most filter and eval commands can be used at the same time. Special set
 commands exit for filtering and eval. When a set is defined, based on
 the sample name, you can apply filters on the samples inside the set,
@@ -111,13 +200,17 @@ If something is not working, check out the feature descriptions and
 the source code. It is not hard to add features. Otherwise, send a short
 example of a VCF statement you need to work on.
-bio-vcf is fast. Parsing a 55K line DbSNP file (22Mb) takes 1.5 seconds on a
-Macbook PRO running 64-bits Linux (Ruby 2.1.0).
 ## Installation
+Note that you need Ruby 1.9.3 or later. The 2.x Ruby series also give
+a performance improvement. Bio-vcf will show the Ruby version when
+typing the command 'bio-vcf -h'.
+To intall bio-vcf with gem:
 ```sh
 gem install bio-vcf
+bio-vcf -h
 ```
 ## Command line interface (CLI)
@@ -192,7 +285,7 @@ Output
 ```ruby
   bio-vcf --filter 'rec.tumor.gq>30'
-    --eval '[rec.ref,rec.alt,rec.tumor.bcount,rec.tumor.gq,rec.normal.gq].join("\t")'
+    --eval '[rec.ref,rec.alt,rec.tumor.bcount,rec.tumor.gq,rec.normal.gq]'
     < file.vcf
 ```
@@ -322,7 +415,7 @@ ref should always be identical across samples.
 One clinical variant DbSNP example
 ```sh
-    bio-vcf --eval '[rec.id,rec.chr,rec.pos,rec.alt,rec.info.sao,rec.info.CLNDBN].join("\t")' < clinvar_20140303.vcf
+    bio-vcf --eval '[rec.id,rec.chr,rec.pos,rec.alt,rec.info.sao,rec.info.CLNDBN]' < clinvar_20140303.vcf
 ```
 renders
@@ -499,7 +592,32 @@ To remove/select 3 samples and create a new file:
 ## RDF output
-Use [bio-table](https://github.com/pjotrp/bioruby-table) to convert tabular data to RDF.
+You can use --rdf for turtle RDF output, note the use of --id and
+--tags which includes the MAF record:
+```ruby
+bio-vcf --id evs --rdf --tags '{"db:evs" => true, "seq:freq" => rec.info.maf[0]/100 }' < EVS.vcf
+  :evs_ch9_139266496_T seq:chr "9" .
+  :evs_ch9_139266496_T seq:pos 139266496 .
+  :evs_ch9_139266496_T seq:alt T .
+  :evs_ch9_139266496_T db:vcf true .
+  :evs_ch9_139266496_T db:evs true .
+  :evs_ch9_139266496_T seq:freq 0.419801 .
+```
+It is possible to filter too! Pick out the rare variants with
+```ruby
+bio-vcf --id evs --filter 'r.info.maf[0]<5.0' --rdf --tags '{"db:evs" => true, "seq:freq" => rec.info.maf[0]/100 }' < EVS.vcf
+```
+Similarly for GoNL
+```ruby
+bio-vcf --id gonl --rdf --tags '{"db:evs" => true, "seq:freq" => rec.info.af }' < GoNL.vcf
+```
+Also check out [bio-table](https://github.com/pjotrp/bioruby-table) to convert tabular data to RDF.
 ## Other examples
@@ -534,6 +652,13 @@ what the command line interface uses (see ./bin/bio-vcf)
   end
 ```
+## Trouble shooting
+The multi-threading creates temporary files using the system TMPDIR.
+This behaviour can be overridden by setting the environment variable.
+Also, for genome-wide sequencing it may be useful to increase
+--thread-lines to a value larger than 1_000_000.
 ## Project home page
 Information on the source tree, documentation, examples, issues and

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.0.3
1	+ 0.7.0

data/bin/bio-vcf CHANGED

@@ -15,6 +15,9 @@ version = File.new(VERSION_FILENAME).read.chomp
 require 'bio-vcf'
 require 'optparse'
+require 'timeout'
+require 'fileutils'
+require 'tempfile'
 # Uncomment when using the bio-logger
 # require 'bio-logger'
@@ -23,7 +26,7 @@ require 'optparse'
 # Bio::Log::CLI.logger('stderr')
 # Bio::Log::CLI.trace('info')
-options = { show_help: false}
+options = { show_help: false, source: 'https://github.com/CuppenResearch/bioruby-vcf', version: version+' (Pjotr Prins)', date: Time.now.to_s, thread_lines: 100_000 }
 opts = OptionParser.new do |o|
   o.banner = "Usage: #{File.basename($0)} [options] filename\ne.g.  #{File.basename($0)} < test/data/input/somaticsniper.vcf"
@@ -77,13 +80,28 @@ opts = OptionParser.new do |o|
     options[:rdf] = true
     options[:skip_header] = true
   end
+  o.on("--num-threads [num]", Integer, "Multi-core version") do |i|
+    options[:num_threads] = i
+  end
+  o.on("--thread-lines num", Integer, "Fork thread on num lines (default 100_000)") do |i|
+    options[:thread_lines] = i
+  end
   o.on_tail("--id name", String, "Identifier") do |s|
     options[:id] = s
   end
   o.on_tail("--tags list", String, "Add tags") do |s|
-    options[:tags] = eval(s)
+    options[:tags] = s
   end
+  o.on("--skip-header", "Do not output VCF header info") do
+    options[:skip_header] = true
+  end
+  o.on("--set-header list", Array, "Set a special tab delimited output header (#samples expands to sample names)") do |list|
+    options[:set_header] = list
+    options[:skip_header] = true
+  end
   # Uncomment the following when using the bio-logger
   # o.separator ""
   # o.on("--logger filename",String,"Log to file (default stderr)") do | name |
@@ -113,9 +131,44 @@ opts = OptionParser.new do |o|
   end
 end
+include BioVcf
+# Parse the header section of a VCF file
+def parse_header line, samples, options
+  header = VcfHeader.new
+  header.add(line)
+  print line if not options[:skip_header]
+  STDIN.each_line do | headerline |
+    if headerline !~ /^#/
+      line = headerline
+      break # end of header
+    end
+    header.add(headerline)
+    if not options[:skip_header]
+      if headerline =~ /^#CHR/
+        # The header before actual data contains the sample names, first inject the BioVcf meta information
+        print header.tag(options),"\n" if not options[:skip_header]
+        selected = header.column_names
+        if samples
+          newfields = selected[0..8]
+          samples.each do |s|
+            newfields << selected[s+9]
+          end
+          selected = newfields
+        end
+        print "#",selected.join("\t"),"\n"
+      else
+        print headerline
+      end
+    end
+  end
+  print header.printable_header_line(options[:set_header]),"\n" if options[:set_header]
+  VcfRdf::header if options[:rdf]
+  return header,line
+end
+# Parse a VCF line
 def parse_line line,header,options,samples
-  # fields = VcfLine.parse(line,header.columns)
   fields = VcfLine.parse(line)
   rec = VcfRecord.new(fields,header)
   r = rec # alias
@@ -124,27 +177,35 @@ def parse_line line,header,options,samples
   sfilter = options[:sfilter]
   efilter = options[:efilter]
   ifilter = options[:ifilter]
+  seval = options[:seval]
   ignore_missing = options[:ignore_missing]
   quiet = options[:quiet]
+  if sfilter or efilter or ifilter or seval
+    # check for samples
+    header_samples = header.column_names[9..-1]
+    raise "Empty sample list, can not execute query!" if not header_samples
+  end
   # --------------------------
   # Filtering and set analysis
-  return if filter and not rec.eval(filter,ignore_missing,quiet)
+  return if filter and not rec.filter(filter,ignore_missing,quiet)
   if sfilter
     rec.each_sample(options[:sfilter_samples]) do | sample |
-      return if not sample.eval(sfilter,ignore_missing,quiet)
+      return if not sample.sfilter(sfilter,ignore_missing,quiet)
     end
   end
   if ifilter
     rec.each_sample(options[:ifilter_samples]) do | sample |
-      return if not sample.eval(ifilter,ignore_missing,quiet)
+      return if not sample.ifilter(ifilter,ignore_missing,quiet)
     end
   end
   if efilter
     rec.each_sample(options[:efilter_samples]) do | sample |
-      return if not sample.eval(efilter,ignore_missing,quiet)
+      return if not sample.efilter(efilter,ignore_missing,quiet)
     end
   end
@@ -158,19 +219,19 @@ def parse_line line,header,options,samples
     end
     fields = newfields
   end
-  if options[:eval] or options[:seval]
+  if options[:eval] or seval
     begin
       results = nil # result string
       if options[:eval]
         res = rec.eval(options[:eval],ignore_missing,quiet)
         results = res if res
       end
-      if options[:seval]
+      if seval
         list = (results ? [] : [rec.chr,rec.pos])
         rec.each_sample(options[:sfilter_samples]) { | sample |
-          list << sample.eval(options[:seval],ignore_missing,quiet)
+          list << sample.eval(seval,ignore_missing,quiet)
         }
-        results = (results ? results + "\t" : "" ) + list.join("\t")
+        results = (results ? results.to_s + "\t" : "" ) + list.join("\t")
       end
     rescue => e
       $stderr.print "\nLine: ",line
@@ -183,23 +244,60 @@ def parse_line line,header,options,samples
   else
     if options[:rdf]
       # Output Turtle RDF
-      if not header_out
-        VcfRdf::header
-        header_out = true
-      end
       VcfRdf::record(options[:id],rec,options[:tags])
     elsif options[:rewrite]
       # Default behaviour prints VCF line, but rewrite info
       eval(options[:rewrite])
-      print (fields[0..6]+[rec.info.to_s]+fields[8..-1]).join("\t"),"\n"
+      print (fields[0..6]+[rec.info.to_s]+fields[8..-1]).join("\t")+"\n"
     else
       # Default behaviour prints VCF line
-      print fields.join("\t"),"\n"
+      $stdout.print fields.join("\t")+"\n"
+      $stdout.flush
+      return true
     end
   end
 end
-include BioVcf
+# Collect a buffer of lines and feed them to a thread
+# Returns the created pid, tempfilen and count_threads
+# (Note: this function should be turned into a closure)
+def parse_lines lines,header,options,samples,tempdir,count_threads
+  pid = nil
+  threadfilen = nil
+  if options[:num_threads]
+    lines2 = lines.map { |l| l.clone }
+    count_threads += 1
+    threadfilen = tempdir+sprintf("/%0.6d-pid",count_threads)+'.bio-vcf'
+    pid = fork do
+      count_lines = 0
+      tempfn = threadfilen+'.running'
+      STDOUT.reopen(File.open(tempfn, 'w+'))
+      lines2.each do | line |
+        count_lines +=1 if parse_line(line,header,options,samples)
+      end
+      STDOUT.flush
+      STDOUT.close
+      FileUtils::mv(tempfn,threadfilen)
+      exit 0
+    end
+    Process::detach(pid)
+  else
+    lines.each do | line |
+      parse_line line,header,options,samples
+    end
+  end
+  return pid,threadfilen,count_threads
+end
+# Make sure no more than num_threads are running at the same time
+def manage_thread_pool(workers, thread_list, num_threads)
+  while true
+    # ---- count running pids
+    running = thread_list.reduce(0) { | sum, thread_info | ( File.exist?(thread_info[1]+'.running') ? sum+1 : sum ) }
+    break if running < num_threads
+    sleep 0.1
+  end
+end
 opts.parse!(ARGV)
@@ -216,55 +314,99 @@ $stderr.print "Options: ",options,"\n" if !options[:quiet]
 if options[:samples]
   samples = options[:samples].map { |s| s.to_i }
 end
-header = VcfHeader.new
-header_out = false
+num_threads = options[:num_threads]
+num_threads = 8 if num_threads != nil and num_threads < 2
+header = nil
+header_output_completed = false
 line_number=0
+lines = []
+thread_list = []
+workers = []
+thread_lines = options[:thread_lines]
+count_threads=0
-STDIN.each_line do | line |
-  line_number += 1
-  $stderr.print '.' if line_number%100_000 == 0 and not options[:quiet]
-  begin
-    if line =~ /^##fileformat=/
-      # ---- We have a new file header
-      header = VcfHeader.new
-      header.add(line)
-      print line if not options[:skip_header]
-      STDIN.each_line do | headerline |
-        if headerline !~ /^#/
-          line = headerline
-          break # end of header
-        end
-        header.add(headerline)
-        if not options[:skip_header]
-          if headerline =~ /^#CHR/
-            selected = header.column_names
-            if samples
-              newfields = selected[0..8]
-              samples.each do |s|
-                newfields << selected[s+9]
-              end
-              selected = newfields
-            end
-            print "#",selected.join("\t"),"\n"
-          else
-            print headerline
+orig_std_out = STDOUT.clone
+Dir::mktmpdir("bio-vcf_") do |tempdir|
+  $stderr.print "Using #{tempdir} for temporary files\n" if num_threads
+  # ---- Main loop
+  STDIN.each_line do | line |
+    line_number += 1
+    $stderr.print '.' if line_number % thread_lines == 0 and not options[:quiet]
+    begin
+      # ---- In this section header information is handled
+      next if header_output_completed and line =~ /^#/
+      if line =~ /^##fileformat=/ or line =~ /^#CHR/
+        header,line = parse_header(line,samples,options)
+      end
+      next if line =~ /^##/ # empty file
+      header_output_completed = true
+      if not options[:efilter_samples] and options[:ifilter_samples]
+        # Create exclude set as a complement of include set
+        options[:efilter_samples] = header.column_names[9..-1].fill{|i|i.to_s}-options[:ifilter_samples]
+      end
+      # ---- In this section the VCF variant lines are parsed
+      lines << line
+      if lines.size > thread_lines
+        manage_thread_pool(workers,thread_list,num_threads) if options[:num_threads]
+        thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads)
+        count_threads = thread_list.last[2]
+        lines = []
+      end
+    rescue Exception => e
+      # $stderr.print line
+      $stderr.print e.message,"\n"
+      raise if options[:verbose]
+      exit 1
+    end
+  end
+  thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads)
+  count_threads = thread_list.last[2]
+  # ---- In this section the output gets collected and printed on STDOUT
+  if options[:num_threads]
+    STDOUT.reopen(orig_std_out)
+    $stderr.print "Final pid=#{thread_list.last[0]}, size=#{lines.size}\n"
+    lines = []
+    fault = false
+    # Wait for the running threads to complete
+    thread_list.each do |info|
+      (pid,threadfn) = info
+      tempfn = threadfn + '.running'
+      $stderr.print "Waiting up to 3 minutes for pid=#{pid} to complete\n"
+      begin
+        Timeout.timeout(180) do
+          while not File.exist?(threadfn)  # wait for the result to appear
+            sleep 0.2
           end
         end
+        # Thread file should have gone:
+        raise "FATAL: child process appears to have crashed #{tempfn}" if File.exist?(tempfn)
+        $stderr.print "OK pid=#{pid}\n"
+      rescue Timeout::Error
+        Process.kill 9, pid
+        Process.wait pid
+        $stderr.print "FATAL: child process killed because it stopped responding, pid = #{pid}\n"
+        fault = true
       end
     end
-    next if line =~ /^##/ # empty file
-    if not options[:efilter_samples] and options[:ifilter_samples]
-      # Create exclude set as a complement of include set
-      options[:efilter_samples] = header.column_names[9..-1].fill{|i|i.to_s}-options[:ifilter_samples]
+    # Collate the output
+    thread_list.each do | info |
+      (pid,fn) = info
+      # This should never happen
+      raise "FATAL: child process output #{fn} is missing" if not File.exist?(fn)
+      $stderr.print "Reading #{fn}\n"
+      File.new(fn).each_line { |buf|
+        print buf
+      }
+      File.unlink(fn)
     end
-    # ---- Parse VCF record line
-    parse_line line,header,options,samples
-  rescue Exception => e
-    # $stderr.print line
-    $stderr.print e.message,"\n"
-    raise if options[:verbose]
-    exit 1
+    return 1 if fault
   end
-end
+end  # cleans up tempdir