RubyGems - bio-vcf - Versions diffs - 0.7.3 → 0.8.0 - Mend

bio-vcf 0.7.3 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +4 -4
data/README.md +89 -4
data/VERSION +1 -1
data/bin/bio-vcf +138 -189
data/bio-vcf.gemspec +5 -2
data/features/cli.feature +1 -1
data/features/step_definitions/sfilter.rb +1 -1
data/lib/bio-vcf/vcfrdf.rb +82 -0
data/lib/bio-vcf/vcfsample.rb +2 -2
data/template/gatk_vcf2rdf.erb +35 -0
data/template/vcf2json.erb +8 -0
data/template/vcf2rdf.erb +12 -0
data/test/data/regression/thread4_4_failed_filter-stderr.ref +1 -1
metadata +5 -2

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 72f63aae77e382c88e04cb07344ab3fce2a57232
-  data.tar.gz: 48f36f4d75d18edf3619124f0b679706a002b646
+  metadata.gz: da9c14380c66a497089fef836a22d44c7651f264
+  data.tar.gz: 3291080afde13a1b7cd392d4f1f06ff275ec1f1b
 SHA512:
-  metadata.gz: a1d1513454924e1d84bb9aecf1bd83cf0d8784e3ab8669bb7777129f2a8b33df53a2118d4fea15a825ce1c399de05cf729e0f7b5e7acd35367856a6a1821f328
-  data.tar.gz: 7b40ffdad49cbb690cfaa4d02e7a060095dc13aa4533f091f1f20b2f2b1903c71d2f6c6c5c5f672dc3be4da751305c2583265dbe7d86308501ab0219cca9e414
+  metadata.gz: f1584a14d8de45dc04115ae6458f3d1b6e9c69c9d19092dfefc18e411fc347d660ed1ca586e3e3b4327d4774e323a78eabe3434a4cd8eab7d87b1f4e37915c0e
+  data.tar.gz: d0a56668f17a272807167bde9caa066ba960ac597b9ce57a205fb2564b9944421e5ef4035f43ab855f72ab2c9ece6b963ef01043372db5601520f5593a1fe31b

data/README.md CHANGED

@@ -4,7 +4,8 @@
 A new generation VCF parser. Bio-vcf is not only fast for genome-wide
 (WGS) data, it also comes with a really nice filtering, evaluation and
-rewrite language. Why would you use bio-vcf over other parsers?
+rewrite language and it can output any type of textual data, including
+RDF and JSON. Why would you use bio-vcf over other parsers?
 1. Bio-vcf is fast and scales on multi-core computers
 2. Bio-vcf has an expressive filtering and evaluation language
@@ -15,7 +16,7 @@ rewrite language. Why would you use bio-vcf over other parsers?
 7. Bio-vcf allows for genotype processing
 8. Bio-vcf has support for set analysis
 9. Bio-vcf has sane error handling
-10. Bio-vcf can output tabular data, HTML, LaTeX, RDF and (soon) JSON
+10. Bio-vcf can output tabular data, HTML, LaTeX, RDF, JSON and JSON-LD using templates
 Bio-vcf has better performance than other tools
 because of lazy parsing, multi-threading, and useful combinations of
@@ -233,6 +234,12 @@ gem install bio-vcf
 bio-vcf -h
 ```
+For multi-core also install the parallel gem
+```sh
+gem install parallel
+```
 ## Command line interface (CLI)
 Get the version of the VCF file
@@ -628,7 +635,7 @@ To remove/select 3 samples:
 ## RDF output
-You can use --rdf for turtle RDF output, note the use of --id and
+You can use --rdf for turtle RDF output from simple one-liners, note the use of --id and
 --tags which includes the MAF record:
 ```ruby
@@ -641,6 +648,8 @@ bio-vcf --id evs --rdf --tags '{"db:evs" => true, "seq:freq" => rec.info.maf[0]/
   :evs_ch9_139266496_T seq:freq 0.419801 .
 ```
+Also check out the more powerful templating system below.
 It is possible to filter too! Pick out the rare variants with
 ```ruby
@@ -660,9 +669,85 @@ or without AF
 bio-vcf --id gonl --rdf --tags '{"db:gonl" => true, "seq:freq" => (rec.info.ac.to_f/rec.info.an).round(2) }' < gonl_germline_overlap_r4.vcf
 ```
+Also check out [bio-table](https://github.com/pjotrp/bioruby-table) to convert tabular data to RDF.
+## Templates
+To have more output options blastxmlparser can use an [ERB
+template](http://www.stuartellis.eu/articles/erb/) for every match. This is a
+very flexible option that can output textual formats such as JSON, YAML, HTML
+and RDF. Examples are provided in
+[./templates](https://github.com/pjotrp/bioruby-vcf/templates/). A JSON
+template could be
-Also check out [bio-table](https://github.com/pjotrp/bioruby-table) to convert tabular data to RDF.
+```Javascript
+{
+  "seq:chr": "<%= rec.chrom %>" ,
+  "seq:pos": <%= rec.pos %> ,
+  "seq:ref": "<%= rec.ref %>" ,
+  "seq:alt": "<%= rec.alt[0] %>" ,
+  "seq:maf": <%= rec.info.maf[0] %> ,
+  "dp":      <%= rec.info.dp %> ,
+};
+```
+To get JSON, run with something like
+```sh
+  bio-vcf --template template/vcf2json.erb --filter 'r.info.maf[0]<0.01' < dbsnp.vcf
+```
+which renders
+```Javascript
+{
+  "seq:chr": "13" ,
+  "seq:pos": 35745475 ,
+  "seq:ref": "C" ,
+  "seq:alt": "T" ,
+  "seq:maf": 0.0151 ,
+  "dp":      86 ,
+};
+```
+Likewise for RDF output:
+```sh
+  bio-vcf --template template/vcf2rdf.erb --filter 'r.info.maf[0]<0.01' < dbsnp.vcf
+```
+renders the ERB template
+```ruby
+<%
+  id = Turtle::mangle_identifier(['ch'+rec.chrom,rec.pos,rec.alt.join('')].join('_'))
+%>
+:<%= id %>
+  :query_id    "<%= id %>",
+  seq:chr      "<%= rec.chrom %>" ,
+  seq:pos      <%= rec.pos %> ,
+  seq:ref      "<%= rec.ref %>" ,
+  seq:alt      "<%= rec.alt[0] %>" ,
+  seq:maf      <%= rec.info.maf[0] %> ,
+  seq:dp       <%= rec.info.dp %> ,
+  db:vcf       true .
+```
+into
+```
+:ch13_33703698_A
+  :query_id    "ch13_33703698_A",
+  seq:chr      "13" ,
+  seq:pos      33703698 ,
+  seq:ref      "C" ,
+  seq:alt      "A" ,
+  seq:maf      0.1567 ,
+  seq:dp       92 ,
+  db:vcf       true .
+```
+Be creative! You can write templates for csv, HTML, XML, LaTeX, RDF, JSON, YAML, JSON-LD, etc. etc.!
 ## Statistics

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.7.3
1	+ 0.8.0

data/bin/bio-vcf CHANGED

@@ -1,7 +1,8 @@
 #!/usr/bin/env ruby
 #
-# BioRuby vcf plugin
+# bio-vcf parser and transformer
 # Author:: Pjotr Prins
+# License:: MIT
 #
 # Copyright (C) 2014 Pjotr Prins <pjotr.prins@thebird.nl>
@@ -17,7 +18,6 @@ require 'bio-vcf'
 require 'optparse'
 require 'timeout'
 require 'fileutils'
-require 'tempfile'
 # Uncomment when using the bio-logger
 # require 'bio-logger'
@@ -26,7 +26,7 @@ require 'tempfile'
 # Bio::Log::CLI.logger('stderr')
 # Bio::Log::CLI.trace('info')
-options = { show_help: false, source: 'https://github.com/CuppenResearch/bioruby-vcf', version: version+' (Pjotr Prins)', date: Time.now.to_s, thread_lines: 100_000, num_threads: 4 }
+options = { show_help: false, source: 'https://github.com/CuppenResearch/bioruby-vcf', version: version+' (Pjotr Prins)', date: Time.now.to_s, thread_lines: 40_000 }
 opts = OptionParser.new do |o|
   o.banner = "Usage: #{File.basename($0)} [options] filename\ne.g.  #{File.basename($0)} < test/data/input/somaticsniper.vcf"
@@ -75,7 +75,7 @@ opts = OptionParser.new do |o|
   o.on("--samples list", Array, "Output selected samples") do |l|
     options[:samples] = l
   end
-  o.on("--rdf", "Generate Turtle RDF") do |b|
+  o.on("--rdf", "Generate Turtle RDF (also check out --template!)") do |b|
     require 'bio-vcf/vcfrdf'
     options[:rdf] = true
     options[:skip_header] = true
@@ -101,6 +101,14 @@ opts = OptionParser.new do |o|
     options[:set_header] = list
     options[:skip_header] = true
   end
+  o.on("-t erb","--template erb",String, "Use ERB template for output") do |s|
+    require 'bio-vcf/vcfrdf'
+    require 'erb'
+    options[:template] = s
+    options[:skip_header] = true
+  end
   # Uncomment the following when using the bio-logger
   # o.separator ""
@@ -135,6 +143,53 @@ opts = OptionParser.new do |o|
   end
 end
+opts.parse!(ARGV)
+$stderr.print "vcf #{version} (biogem Ruby #{RUBY_VERSION}) by Pjotr Prins 2014\n" if !options[:quiet]
+if options[:show_help]
+  print opts
+  print USAGE
+  exit 1
+end
+if RUBY_VERSION =~ /^1/
+  $stderr.print "WARNING: bio-vcf runs on Ruby 2.x only\n"
+end
+$stderr.print "Options: ",options,"\n" if !options[:quiet]
+if options[:template]
+  include BioVcf::RDF
+  fn = options[:template]
+  raise "No template #{fn}!" if not File.exist?(fn)
+  template = ERB.new(File.read(fn))
+end
+if options[:num_threads] != 1
+  begin
+    require 'parallel'
+  rescue LoadError
+    $stderr.print "Error: Missing 'parallel' module. Install with command 'gem install parallel' if you want multiple threads\n"
+    options[:num_threads] = 1
+  end
+end
+stats = nil
+if options[:statistics]
+  options[:num_threads] = nil
+  stats = BioVcf::VcfStatistics.new
+end
+# Check for option combinations
+raise "Missing option --ifilter" if options[:ifilter_samples] and not options[:ifilter]
+raise "Missing option --efilter" if options[:efilter_samples] and not options[:efilter]
+raise "Missing option --sfilter" if options[:sfilter_samples] and not options[:sfilter]
+if options[:samples]
+  samples = options[:samples].map { |s| s.to_i }
+end
 include BioVcf
 # Parse the header section of a VCF file
@@ -171,8 +226,8 @@ def parse_header line, samples, options
   return header,line
 end
-# Parse a VCF line
-def parse_line line,header,options,samples,stats=nil
+# Parse a VCF line and return the result as a string
+def parse_line line,header,options,samples,template,stats=nil
   fields = VcfLine.parse(line)
   rec = VcfRecord.new(fields,header)
   r = rec # alias
@@ -251,216 +306,110 @@ def parse_line line,header,options,samples,stats=nil
       raise if options[:verbose]
       exit 1
     end
-    print results,"\n" if results
-    exit(1) if options[:eval_once]
+    return results.to_s+"\n" if results
+    exit(1) if options[:eval_once]  # <--- can this be reached?
   else
     if options[:rdf]
       # Output Turtle RDF
       VcfRdf::record(options[:id],rec,options[:tags])
+    elsif options[:template]
+      # Ruby ERB template
+      begin
+        template.result(binding)
+      rescue Exception => e
+        $stderr.print e,": ",fields,"\n"
+        $stderr.print e.backtrace.inspect if options[:verbose]
+        raise
+      end
     elsif options[:rewrite]
       # Default behaviour prints VCF line, but rewrite info
       eval(options[:rewrite])
-      print (fields[0..6]+[rec.info.to_s]+fields[8..-1]).join("\t")+"\n"
+      (fields[0..6]+[rec.info.to_s]+fields[8..-1]).join("\t")+"\n"
     elsif stats
       # do nothing
     else
       # Default behaviour prints VCF line
-      $stdout.print fields.join("\t")+"\n"
-      $stdout.flush
-      return true
+      fields.join("\t")+"\n"
     end
   end
 end
-# Collect a buffer of lines and feed them to a thread
-# Returns the created pid, tempfilen and count_threads
-# (Note: this function should be turned into a closure)
-def parse_lines lines,header,options,samples,tempdir,count_threads,stats
-  pid = nil
-  threadfilen = nil
-  if options[:num_threads]
-    count_threads += 1
-    threadfilen = tempdir+sprintf("/%0.6d-pid",count_threads)+'.bio-vcf'
-    pid = fork do
-      count_lines = 0
-      tempfn = threadfilen+'.running'
-      STDOUT.reopen(File.open(tempfn, 'w+'))
-      lines.each do | line |
-        count_lines +=1 if parse_line(line,header,options,samples)
-      end
-      STDOUT.flush
-      STDOUT.close
-      FileUtils::mv(tempfn,threadfilen)
-      exit 0
-    end
-  else
-    lines.each do | line |
-      parse_line line,header,options,samples,stats
-    end
-  end
-  return pid,threadfilen,count_threads
-end
-# Make sure no more than num_threads are running at the same time
-def manage_thread_pool(workers, thread_list, num_threads)
-  while true
-    # ---- count running pids
-    running = thread_list.reduce(0) do | sum, thread_info |
-      if thread_info[0] && pid_running?(thread_info[0])
-        sum+1
-      elsif  nil == thread_info[0] && File.exist?(thread_info[1]+'.running')
-        sum+1
-      else
-        sum
-      end
-    end
-    break if running < num_threads
-    sleep 0.1
-  end
-end
-def pid_running?(pid)
-  begin
-    fpid,status=Process.waitpid2(pid,Process::WNOHANG)
-  rescue Errno::ECHILD, Errno::ESRCH
-    return false
-  end
-  return true if nil == fpid && nil == status
-  return ! (status.exited? || status.signaled?)
-end
-opts.parse!(ARGV)
-$stderr.print "vcf #{version} (biogem Ruby #{RUBY_VERSION}) by Pjotr Prins 2014\n" if !options[:quiet]
-if options[:show_help]
-  print opts
-  print USAGE
-  exit 1
-end
-if RUBY_VERSION =~ /^1/
-  $stderr.print "WARNING: bio-vcf runs on Ruby 2.x only\n"
-end
-$stderr.print "Options: ",options,"\n" if !options[:quiet]
-stats = nil
-if options[:statistics]
-  options[:num_threads] = nil
-  stats = BioVcf::VcfStatistics.new
-end
-# Check for option combinations
-raise "Missing option --ifilter" if options[:ifilter_samples] and not options[:ifilter]
-raise "Missing option --efilter" if options[:efilter_samples] and not options[:efilter]
-raise "Missing option --sfilter" if options[:sfilter_samples] and not options[:sfilter]
-if options[:samples]
-  samples = options[:samples].map { |s| s.to_i }
-end
-num_threads = options[:num_threads]
-num_threads = 8 if num_threads != nil and num_threads < 2
 header = nil
 header_output_completed = false
-line_number=0
+NUM_THREADS = options[:num_threads]
+CHUNK_SIZE = options[:thread_lines]
+CHUNK_NUM = (NUM_THREADS && NUM_THREADS>6 ? NUM_THREADS*4 : 24)
+chunks = []
 lines = []
-thread_list = []
-workers = []
-thread_lines = options[:thread_lines]
-count_threads=0
-orig_std_out = STDOUT.clone
+line_number=0
 begin
-  Dir::mktmpdir("bio-vcf_") do |tempdir|
-    $stderr.print "Using #{tempdir} for temporary files\n" if num_threads
-    # ---- Main loop
-    STDIN.each_line do | line |
-      line_number += 1
-      $stderr.print '.' if line_number % thread_lines == 0 and not options[:quiet]
-      # ---- In this section header information is handled
-      next if header_output_completed and line =~ /^#/
-      if line =~ /^##fileformat=/ or line =~ /^#CHR/
-        header,line = parse_header(line,samples,options)
-      end
-      next if line =~ /^##/ # empty file
-      header_output_completed = true
-      if not options[:efilter_samples] and options[:ifilter_samples]
-        # Create exclude set as a complement of include set
-        options[:efilter_samples] = header.column_names[9..-1].fill{|i|i.to_s}-options[:ifilter_samples]
-      end
-      # ---- In this section the VCF variant lines are parsed
-      lines << line
-      if lines.size > thread_lines
-        manage_thread_pool(workers,thread_list,num_threads) if options[:num_threads]
-        thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads,stats)
-        count_threads = thread_list.last[2]
-        lines = []
-      end
+  process = lambda { | lines |
+    res = []
+    lines.each do | line |
+      res << parse_line(line,header,options,samples,template,stats)
+    end
+    res
+  }
+  output = lambda { |collection|
+    collection.each do | result |
+      result.each { |line| print line }
+    end
+  } # end output
+  # ---- Main loop
+  STDIN.each_line do | line |
+    line_number += 1
+    # ---- In this section header information is handled
+    next if header_output_completed and line =~ /^#/
+    if line =~ /^##fileformat=/ or line =~ /^#CHR/
+      header,line = parse_header(line,samples,options)
+    end
+    next if line =~ /^##/ # empty file
+    header_output_completed = true
+    if not options[:efilter_samples] and options[:ifilter_samples]
+      # Create exclude set as a complement of include set
+      options[:efilter_samples] = header.column_names[9..-1].fill{|i|i.to_s}-options[:ifilter_samples]
     end
-    thread_list << parse_lines(lines,header,options,samples,tempdir,count_threads,stats)
-    count_threads = thread_list.last[2]
-    # ---- In this section the output gets collected and printed on STDOUT
-    if options[:num_threads]
-      STDOUT.reopen(orig_std_out)
-      $stderr.print "Final pid=#{thread_list.last[0]}, size=#{lines.size}\n"
-      lines = []
-      fault = false
-      # Wait for the running threads to complete
-      thread_list.each do |info|
-        (pid,threadfn) = info
-        tempfn = threadfn + '.running'
-        timeout = 180
-        if (pid && !pid_running?(pid)) || fault
-          # no point to wait for a long time if we've failed one already or the proc is dead
-          timeout = 1
-        end
-        $stderr.print "Waiting up to #{timeout/60} minutes for pid=#{pid} to complete\n"
-        begin
-          Timeout.timeout(timeout) do
-            while not File.exist?(threadfn)  # wait for the result to appear
-              sleep 0.2
-            end
-          end
-          # Thread file should have gone:
-          raise "FATAL: child process appears to have crashed #{tempfn}" if File.exist?(tempfn)
-          $stderr.print "OK pid=#{pid}\n"
-        rescue Timeout::Error
-          if pid_running?(pid)
-            Process.kill 9, pid
-            Process.wait pid
-          end
-          $stderr.print "FATAL: child process killed because it stopped responding, pid = #{pid}\n"
-          fault = true
-        end
+    # ---- In this section the VCF variant lines are parsed
+    lines << line
+    if NUM_THREADS == 1
+      $stderr.print '.' if line_number % CHUNK_SIZE == 0 and not options[:quiet]
+      if lines.size > CHUNK_SIZE
+        process.call(lines).each { | l | print l }
+        lines = []
       end
-      # Collate the output
-      thread_list.each do | info |
-        (pid,fn) = info
-        if !fault
-          # This should never happen
-          raise "FATAL: child process output #{fn} is missing" if not File.exist?(fn)
-          $stderr.print "Reading #{fn}\n"
-          File.new(fn).each_line { |buf|
-            print buf
+    else
+      if lines.size > CHUNK_SIZE
+        chunks << lines
+        if chunks.size > CHUNK_NUM
+          $stderr.print '.' if not options[:quiet]
+          out = Parallel.map(chunks, :in_processes => NUM_THREADS) { | chunk |
+            process.call(chunk)
           }
-          File.unlink(fn)
+          chunks = []
+          # Output is forked to a separate process too
+          fork do
+            output.call out
+            STDOUT.flush
+            STDOUT.close
+            exit 0
+          end
         end
-        Process.wait(pid) if pid && pid_running?(pid)
+        lines = []
       end
-      return 1 if fault
     end
-  end  # cleans up tempdir
+  end
+  $stderr.print '.' if not options[:quiet]
+  if NUM_THREADS == 1
+    process.call(lines).each { |l| print l}
+  else
+    chunks << lines
+    output.call Parallel.map(chunks, :in_processes => NUM_THREADS) { | chunk |
+      process.call(chunk)
+    }
+  end
   stats.print if stats
 rescue Exception => e

data/bio-vcf.gemspec CHANGED

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = "bio-vcf"
-  s.version = "0.7.3"
+  s.version = "0.8.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Pjotr Prins"]
-  s.date = "2014-09-01"
+  s.date = "2014-09-19"
   s.description = "Smart lazy multi-threaded parser for VCF format with useful filtering and output rewriting"
   s.email = "pjotr.public01@thebird.nl"
   s.executables = ["bio-vcf"]
@@ -50,6 +50,9 @@ Gem::Specification.new do |s|
     "lib/bio-vcf/vcfrecord.rb",
     "lib/bio-vcf/vcfsample.rb",
     "lib/bio-vcf/vcfstatistics.rb",
+    "template/gatk_vcf2rdf.erb",
+    "template/vcf2json.erb",
+    "template/vcf2rdf.erb",
     "test/data/input/dbsnp.vcf",
     "test/data/input/multisample.vcf",
     "test/data/input/somaticsniper.vcf",

data/features/cli.feature CHANGED

@@ -51,6 +51,6 @@ Feature: Command-line interface (CLI)
   Scenario: Test deadlock on failed filter with threads
     Given I have input file(s) named "test/data/input/multisample.vcf"
-    When I execute "./bin/bio-vcf -i --num-threads 4 --thread-lines 4 --filter 't.info.dp>2'"
+    When I execute "./bin/bio-vcf --num-threads 4 --thread-lines 4 --filter 't.info.dp>2'"
     Then I expect an error and the named output to match the named output "thread4_4_failed_filter" in under 30 seconds

data/features/step_definitions/sfilter.rb CHANGED

@@ -117,7 +117,7 @@ When(/^I evaluate empty '\.\/\.' with ignore missing$/) do
 end
 Then(/^I expect s\.what\? to throw an error$/) do
-  expect { @s.eval("s.what?",do_cache: false) }.to raise_error RuntimeError
+  expect { @s.eval("s.what?",do_cache: false) }.to raise_error NoMethodError
 end
 Then(/^I expect s\.what to throw an error$/) do

data/lib/bio-vcf/vcfrdf.rb CHANGED

@@ -1,6 +1,9 @@
 module BioVcf
   # This is some primarily RDF support - which may be moved to another gem
+  #
+  # Note that this functionality is superceded by the --template command! Though
+  # this can be useful for one-liners.
   module VcfRdf
@@ -34,4 +37,83 @@ OUT
       print "\n"
     end
   end
+# RDF support module. Original is part of bioruby-rdf by Pjotr Prins
+#
+  module RDF
+    def RDF::valid_uri? uri
+      uri =~ /^([!#$&-;=?_a-z~]|%[0-9a-f]{2})+$/i
+    end
+    def RDF::escape_string_literal(literal)
+      s = literal.to_s
+      # Put a slash before every double quote if there is no such slash already
+      s = s.gsub(/(?<!\\)"/,'\"')
+      # Put a slash before a single slash if it is not \["utnr>\]
+      if s =~ /[^\\]\\[^\\]/
+        s2 = []
+        s.each_char.with_index { |c,i|
+          res = c
+          if i>0 and c == '\\' and s[i-1] != '\\' and s[i+1] !~ /^[uUtnr\\"]/
+            res = '\\' + c
+          end
+          # p [i,c,s[i+1],res]
+          s2 << res
+        }
+        s = s2.join('')
+      end
+      s
+    end
+    def RDF::stringify_literal(literal)
+      RDF::escape_string_literal(literal.to_s)
+    end
+    def RDF::quoted_stringify_literal(literal)
+      '"' + stringify_literal(literal) + '"'
+    end
+  end
+  module Turtle
+    def Turtle::stringify_literal(literal)
+      RDF::stringify_literal(literal)
+    end
+    def Turtle::identifier(id)
+      raise "Illegal identifier #{id}" if id != Turtle::mangle_identifier(id)
+    end
+    # Replace letters/symbols that are not allowed in a Turtle identifier
+    # (short hand URI). This should be the definite mangler and replace the
+    # ones in bioruby-table and bio-exominer. Manglers are useful when using
+    # data from other sources and trying to transform them into simple RDF
+    # identifiers.
+    def Turtle::mangle_identifier(s)
+      id = s.strip.gsub(/[^[:print:]]/, '').gsub(/[#)(,]/,"").gsub(/[%]/,"perc").gsub(/(\s|\.|\$|\/|\\|\>)+/,"_")
+      id = id.gsub(/\[|\]/,'')
+      # id = URI::escape(id)
+      id = id.gsub(/\|/,'_')
+      id = id.gsub(/\-|:/,'_')
+      if id != s
+        # Don't want Bio depency in templates!
+        # logger = Bio::Log::LoggerPlus.new 'bio-rdf'
+        # logger.warn "\nWARNING: Changed identifier <#{s}> to <#{id}>"
+        # $stderr.print "\nWARNING: Changed identifier <#{s}> to <#{id}>"
+      end
+      if not RDF::valid_uri?(id)
+        raise "Invalid URI after mangling <#{s}> to <#{id}>!"
+      end
+      valid_id = if id =~ /^\d/
+                   'r' + id
+                 else
+                   id
+                 end
+      valid_id  # we certainly hope so!
+    end
+  end
 end

data/lib/bio-vcf/vcfsample.rb CHANGED

@@ -3,7 +3,7 @@ module BioVcf
     # Check whether a sample is empty (on the raw string value)
     def VcfSample::empty? s
-      s == './.' or s == '' or s == nil
+      s==nil or s == './.' or s == '' or s[0..2]=='./.'
     end
     class Sample
@@ -77,7 +77,7 @@ module BioVcf
       def fetch_values name
         n = @format[name]
-        raise "Unknown sample field <#{name}>" if not n
+        raise NoMethodError.new("Unknown sample field <#{name}>") if not n
         @values[n]  # <-- save names with upcase!
       end

data/template/gatk_vcf2rdf.erb ADDED

@@ -0,0 +1,35 @@
+<%
+  id = Turtle::mangle_identifier(['ch'+rec.chrom,rec.pos,rec.alt.join('')].join('_'))
+  sample_num = 0
+%>
+:<%= id %>
+  :query_id    "<%= id %>";
+  seq:chr      "<%= rec.chrom %>" ;
+  seq:pos      <%= rec.pos %> ;
+  seq:ref      "<%= rec.ref %>" ;
+  seq:alt      "<%= rec.alt[0] %>" ;
+  db:gatk      true .
+<% rec.each_sample do | s | %>
+  <% if not s.empty?
+    sample_name = header.samples[sample_num]
+    sample_id = id + '_' + Turtle::mangle_identifier(sample_name)
+    sample_num += 1
+    if s.ad[0]+s.ad[1] != 0
+      alt_bias = (s.ad[1].to_f/(s.ad[0]+s.ad[1])).round(2)
+    end
+  %>
+:<%= sample_id %>
+  :call_id     :<%= id %> ;
+  sample:name  "<%= sample_name  %>" ;
+  sample:gt    "<%= s.gt %>" ;
+  <% s.gti.each do | index | %>
+  sample:ad<%= index %>    <%= s.ad[index] %> ;
+  sample:gts<%= index %>   "<%= s.gts[index] %>" ;
+  <% end %>
+  sample:dp        <%= s.dp %> ;
+  sample:alt_bias  <%= alt_bias %> .
+  <% end %>
+<% end %>

data/template/vcf2json.erb ADDED

@@ -0,0 +1,8 @@
+{
+  "seq:chr": "<%= rec.chrom %>" ,
+  "seq:pos": <%= rec.pos %> ,
+  "seq:ref": "<%= rec.ref %>" ,
+  "seq:alt": "<%= rec.alt[0] %>" ,
+  "seq:maf": <%= rec.info.maf[0] %> ,
+  "dp":      <%= rec.info.dp %> ,
+};

data/template/vcf2rdf.erb ADDED

@@ -0,0 +1,12 @@
+<%
+  id = Turtle::mangle_identifier(['ch'+rec.chrom,rec.pos,rec.alt.join('')].join('_'))
+%>
+:<%= id %>
+  :query_id    "<%= id %>";
+  seq:chr      "<%= rec.chrom %>" ;
+  seq:pos      <%= rec.pos %> ;
+  seq:ref      "<%= rec.ref %>" ;
+  seq:alt      "<%= rec.alt[0] %>" ;
+  seq:dp       <%= rec.info.dp %> ;
+  db:vcf       true .

data/test/data/regression/thread4_4_failed_filter-stderr.ref CHANGED

	@@ -1 +1 @@
1	- ~~unexpected~~ ~~return~~
1	+ Error: Missing 'parallel' module. Install with command 'gem install parallel' if you want multiple threads

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: bio-vcf
 version: !ruby/object:Gem::Version
-  version: 0.7.3
+  version: 0.8.0
 platform: ruby
 authors:
 - Pjotr Prins
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-09-01 00:00:00.000000000 Z
+date: 2014-09-19 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rspec
@@ -108,6 +108,9 @@ files:
 - lib/bio-vcf/vcfrecord.rb
 - lib/bio-vcf/vcfsample.rb
 - lib/bio-vcf/vcfstatistics.rb
+- template/gatk_vcf2rdf.erb
+- template/vcf2json.erb
+- template/vcf2rdf.erb
 - test/data/input/dbsnp.vcf
 - test/data/input/multisample.vcf
 - test/data/input/somaticsniper.vcf