RubyGems - viral_seq - Versions diffs - 1.0.9 → 1.0.14 - Mend

viral_seq 1.0.9 → 1.0.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml +4 -4
data/Gemfile.lock +1 -1
data/README.md +67 -32
data/bin/tcs +78 -143
data/lib/viral_seq.rb +3 -0
data/lib/viral_seq/constant.rb +5 -1
data/lib/viral_seq/enumerable.rb +0 -10
data/lib/viral_seq/hivdr.rb +1 -1
data/lib/viral_seq/math.rb +3 -3
data/lib/viral_seq/sdrm.rb +43 -0
data/lib/viral_seq/seq_hash.rb +15 -8
data/lib/viral_seq/seq_hash_pair.rb +6 -0
data/lib/viral_seq/tcs_core.rb +332 -0
data/lib/viral_seq/tcs_json.rb +178 -0
data/lib/viral_seq/version.rb +2 -2
metadata +6 -5
data/bin/tcs_json_generator +0 -166

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 4921d3609d6ffc7fd6fbafd7a4a86e5818d47ed855393addd68b20f28b9d214f
-  data.tar.gz: a9e18c01b287885f8f6238343d9633a52d4ae5ea061347e73bd4f3e86788b2a4
+  metadata.gz: '048e85ab67fbb667919d02d4509a15111798b116b3f927c921d203dc8565a1a2'
+  data.tar.gz: 6951e410bd4f9b727a44fab1aa88f9cc263151cf9aed2a9c25ae9d866ed72450
 SHA512:
-  metadata.gz: dd21b57e17751f6c3e475f05b7a565d295ac7592b7c02f8d89ed49192834bee444f08ee9ebf48e41922c8caaf37a03651d5d0c9aa89d97ccc2edb9aad8224d5f
-  data.tar.gz: d1162424ea877d9839c179cacc330c81cd3508fcff07b64a1e753c7c706485d1dcb9a6b60aec9ce02ed33b91bbd4386ed58329c17e247ba086e7d81ed107bfd4
+  metadata.gz: 02cc87e245918a5c8f1b16b0db978da66e3bf7e83c6c6140c394c560d31c86ab1845b337a4f53b4b7883ff8e452e8caef0b036ea113c4416d6a29d16f419eb81
+  data.tar.gz: 9f53bd6c46f4a49b5c14b8b8019ffa3e1abcf442bf0a6cc09a7dbc768a474f2afd81d5cf168f38eb8ab5abbd601c60f1f824f753bc77dbcc0d3c0d93568b9ae3

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    viral_seq (1.0.9)
+    viral_seq (1.0.13)
       colorize (~> 0.1)
       muscle_bio (~> 0.4)

data/README.md CHANGED Viewed

@@ -4,86 +4,121 @@ A Ruby Gem containing bioinformatics tools for processing viral NGS data.
 Specifically for Primer-ID sequencing and HIV drug resistance analysis.
-## Installation
+## Install
+```bash
     $ gem install viral_seq
+```
 ## Usage
-#### Load all ViralSeq classes by requiring 'viral_seq.rb'
+### Excutables
-```ruby
-#!/usr/bin/env ruby
-require 'viral_seq'
-```
-#### Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
+Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
+```bash
     $ locator -i sequence.fasta -o sequence.fasta.csv
+```
+Use executable `tcs` pipeline to process Primer ID MiSeq sequencing data.
-#### Use executable `tcs` pipeline to process Primer ID MiSeq sequencing data. Parameter json file can be generated using `tcs_json_generator` or at https://tcs-dr-dept-tcs.cloudapps.unc.edu/generator.php
-    $ tcs params.json
-#### Use executable `tcs_json_generator` to generate params .json file for the `tcs` pipeline.
+```bash
+    $ tcs -p params.json # run TCS pipeline with params.json
+    $ tcs -j # CLI to generate params.json
+    $ tcs -h # print out the help
+```
-    $ tcs_json_generator
+## Some Examples
+Load all ViralSeq classes by requiring 'viral_seq.rb' in your Ruby scripts.
-## Some Examples
+```ruby
+#!/usr/bin/env ruby
+require 'viral_seq'
+```
-#### Load nucleotide sequences from a FASTA format sequence file
+Load nucleotide sequences from a FASTA format sequence file
 ```ruby
 my_seqhash = ViralSeq::SeqHash.fa('my_seq_file.fasta')
 ```
-#### Make an alignment (using MUSCLE)
+Make an alignment (using MUSCLE)
 ```ruby
 aligned_seqhash = my_seqhash.align
 ```
-#### Filter nucleotide sequences with the reference coordinates (HIV Protease)
+Filter nucleotide sequences with the reference coordinates (HIV Protease)
 ```ruby
 qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
 ```
-#### Further filter out sequences with Apobec3g/f hypermutations
+Further filter out sequences with Apobec3g/f hypermutations
 ```ruby
 qc_seqhash = qc_seqhash.a3g
 ```
-#### Calculate nucleotide diveristy π
+Calculate nucleotide diveristy π
 ```ruby
 qc_seqhash.pi
 ```
-#### Calculate cut-off for minority variants based on Poisson model
+Calculate cut-off for minority variants based on Poisson model
 ```ruby
 cut_off = qc_seqhash.pm
 ```
-#### Examine for drug resistance mutations for HIV PR region
+Examine for drug resistance mutations for HIV PR region
 ```ruby
 qc_seqhash.sdrm_hiv_pr(cut_off)
 ```
+## Known issues
+  1. ~~have a conflict with rails.~~
+  2. ~~Update on 03032021. Still have conflict. But in rails gem file, can just use `requires: false` globally and only require "viral_seq" when the module is needed in controller.~~
+  3. The conflict seems to be resovled. It was from a combination of using `!` as a function for factorial and the gem name `viral_seq`. @_@
 ## Updates
-Version 1.0.9-07182020:
+### Version 1.0.14-03052021
+  1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
+### Version 1.0.13-03032021
+  1. Fixed the conflict with rails.
+### Version 1.0.12-03032021
+  1. Fixed an issue that may cause conflicts with ActiveRecord.
+### Version 1.0.11-03022021
+  1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
+  2. fixed an issue loading class 'OptionParser'in some ruby environments.
+### Version 1.0.10-11112020:
+  1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
+  2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
+  3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
+  4. a few optimizations.
+  5. TCS 2.1.0 delivered.
+  6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
+### Version 1.0.9-07182020:
   1. Change ViralSeq::SeqHash#stop_codon and ViralSeq::SeqHash#a3g_hypermut return value to hash object.
   2. TCS pipeline updated to version 2.0.1. Add optional `export_raw: TRUE/FALSE` in json params. If `export_raw` is `TRUE`, raw sequence reads (have to pass quality filters) will be exported, along with TCS reads.
-Version 1.0.8-02282020:
+### Version 1.0.8-02282020:
   1. TCS pipeline (version 2.0.0) added as executable.
       tcs  -  main TCS pipeline script.
@@ -94,14 +129,14 @@ Version 1.0.8-02282020:
   3. Bug fix for several methods.
-Version 1.0.7-01282020:
+### Version 1.0.7-01282020:
   1. Several methods added, including
       ViralSeq::SeqHash#error_table
       ViralSeq::SeqHash#random_select
   2. Improved performance for several functions.
-Version 1.0.6-07232019:
+### Version 1.0.6-07232019:
   1. Several methods added to ViralSeq::SeqHash, including
       ViralSeq::SeqHash#size
@@ -110,33 +145,33 @@ Version 1.0.6-07232019:
       ViralSeq::SeqHash#mutation
   2. Update documentations and rspec samples.
-Version 1.0.5-07112019:
+### Version 1.0.5-07112019:
   1. Update ViralSeq::SeqHash#sequence_locator.
      Program will try to determine the direction (`+` or `-` of the query sequence)
   2. update executable `locator` to have a column of `direction` in output .csv file
-Version 1.0.4-07102019:
+### Version 1.0.4-07102019:
   1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
   2. Fix bugs in bin `locator`
-Version 1.0.3-07102019:
+### Version 1.0.3-07102019:
   1. Bug fix.
-Version 1.0.2-07102019:
+### Version 1.0.2-07102019:
   1. Fixed a gem loading issue.
-Version 1.0.1-07102019:
+### Version 1.0.1-07102019:
   1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
   2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
   3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
   4. update documentations
-Version 1.0.0-07092019:
+### Version 1.0.0-07092019:
   1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq

data/bin/tcs CHANGED Viewed

@@ -28,114 +28,79 @@
 require 'viral_seq'
 require 'json'
 require 'colorize'
+require 'optparse'
+options = {}
-# calculate consensus cutoff
+banner = '-'*50 + "\n" +
+        '| The TCS Pipeline ' + "Version #{ViralSeq::TCS_VERSION}".red.bold + " by " + "Shuntai Zhou".blue.bold + ' |' + "\n" +
+        '-'*50 + "\n"
-def calculate_cut_off(m, error_rate = 0.02)
-  n = 0
-  case error_rate
-  when 0.005...0.015
-    if m <= 10
-      n = 2
-    else
-      n = 1.09*10**-26*m**6 + 7.82*10**-22*m**5 - 1.93*10**-16*m**4 + 1.01*10**-11*m**3 - 2.31*10**-7*m**2 + 0.00645*m + 2.872
-    end
+OptionParser.new do |opts|
+  opts.banner = banner + "Usage: tcs -j"
+  opts.on "-j", "--json_generator", "Command line interfac to generate new params json file" do |j|
+    options[:json_generator] = true
+  end
-  when 0...0.005
-    if m <= 10
-      n = 2
-    else
-      n = -9.59*10**-27*m**6 + 3.27*10**-21*m**5 - 3.05*10**-16*m**4 + 1.2*10**-11*m**3 - 2.19*10**-7*m**2 + 0.004044*m + 2.273
-    end
+  opts.on("-p", "--params PARAMS_JSON", "Execute the pipeline with input params json file") do |p|
+    options[:params_json] = p
+  end
-  else
-    if m <= 10
-      n = 2
-    elsif m <= 8500
-      n = -1.24*10**-21*m**6 + 3.53*10**-17*m**5 - 3.90*10**-13*m**4 + 2.12*10**-9*m**3 - 6.06*10**-6*m**2 + 1.80*10**-2*m + 3.15
-    else
-      n = 0.0079 * m + 9.4869
-    end
+  opts.on("-h", "--help", "Prints this help") do
+    puts opts
+    exit
   end
-  n = n.round
-  n = 2 if n < 3
-  return n
-end
+  opts.on("-v", "--version", "Version info") do
+    puts "tcs version: " + ViralSeq::TCS_VERSION.red.bold
+    puts "viral_seq version: " + ViralSeq::VERSION.red.bold
+    exit
+  end
-puts "\n" + '-'*50
-puts '| The TCS Pipeline ' + "Version #{ViralSeq::TCS_VERSION}".red.bold + " by " + "Shuntai Zhou".blue.bold + ' |'
-puts '-'*50 + "\n"
+  # opts.on("--no-parallel", "toggle off parallel processing") do
+  #   options[:no_parallel] = true
+  # end
+end.parse!
-unless ARGV[0]
-  raise "No JSON param file found. Script terminated."
+if options[:json_generator]
+  params = ViralSeq::TcsJson.generate
+elsif (options[:params_json] && File.exist?(options[:params_json]))
+  params = JSON.parse(File.read(options[:params_json]), symbolize_names: true)
+else
+  abort "No params JSON file found. Script terminated.".red
 end
-params = JSON.parse(File.read(ARGV[0]), symbolize_names: true)
 indir = params[:raw_sequence_dir]
 unless File.exist?(indir)
-  raise "No input sequence directory found. Script terminated."
-end
-libname = File.basename(indir)
-# obtain R1 and R2 file path
-files = []
-Dir.chdir(indir) do
-  files = Dir.glob("*")
-end
-if files.empty?
-  raise "Input dir does not contain files. Script terminated."
+  abort "No input sequence directory found. Script terminated.".red.bold
 end
-r1_f = ""
-r2_f = ""
+# log file
-# unzip .fasta.gz
-def unzip_r(indir, f)
-  r_file = indir + "/" + f
-  if f =~ /.gz/
-    `gzip -d #{r_file}`
-    new_f = f.sub ".gz", ""
-    r_file = File.join(indir, new_f)
-  end
-  return r_file
-end
 runtime_log_file = File.join(indir,"runtime.log")
 log = File.open(runtime_log_file, "w")
 log.puts "TSC pipeline Version " + ViralSeq::TCS_VERSION.to_s
 log.puts "viral_seq Version " + ViralSeq::VERSION.to_s
 log.puts Time.now.to_s + "\t" + "Start TCS pipeline..."
+libname = File.basename indir
-files.each do |f|
-  t = f.split("_")
-  if t.size == 1
-    tag = f
-  else
-    tag = f.split("_")[1..-1].join("_")
-  end
-  if tag =~ /r1/i
-    r1_f = unzip_r(indir, f)
-  elsif tag =~ /r2/i
-    r2_f = unzip_r(indir, f)
-  end
-end
+seq_files = ViralSeq::TcsCore.r1r2 indir
-unless File.exist?(r1_f)
-  log.puts "R1 file not found. Script terminated."
-  raise "R1 file not found. Script terminated."
+if seq_files[:r1_file].size > 0 and seq_files[:r2_file].size > 0
+  r1_f = seq_files[:r1_file]
+  r2_f = seq_files[:r2_file]
+elsif seq_files[:r1_file].size > 0 and seq_files[:r2_file].empty?
+  exit_sig = "Missing R2 file. Aborted."
+elsif seq_files[:r2_file].size > 0 and seq_files[:r1_file].empty?
+  exit_sig = "Missing R1 file. Aborted."
+else
+  exit_sig = "Cannot determine R1 R2 file in #{indir}. Aborted."
 end
-unless File.exist?(r2_f)
-  log.puts "R2 file not found. Script terminated."
-  raise "R2 file not found. Script terminated."
+if exit_sig
+  ViralSeq::TcsCore.log_and_abort log, exit_sig
 end
 r1_fastq_sh = ViralSeq::SeqHash.fq(r1_f)
@@ -152,10 +117,10 @@ end
 primers = params[:primer_pairs]
 if primers.empty?
-  log.puts "No primer information. Script terminated."
-  raise "No primer information. Script terminated."
+  ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
 end
 primers.each do |primer|
   summary_json = {}
   summary_json[:tcs_version] = ViralSeq::TCS_VERSION
@@ -179,66 +144,25 @@ primers.each do |primer|
   summary_json[:cdan_primer] = cdna_primer
   summary_json[:forward_primer] = forward_primer
-  primer[:majority] ? majority_cut_off = primer[:majority] : majority_cut_off = 0.5
+  primer[:majority] ? majority_cut_off = primer[:majority] : majority_cut_off = 0
   summary_json[:majority_cut_off] = majority_cut_off
   summary_json[:total_raw_sequence] = raw_sequence_number
   log.puts Time.now.to_s + "\t" +  "Porcessing #{region}..."
-  r1_raw = r1_fastq_sh.dna_hash
-  r2_raw = r2_fastq_sh.dna_hash
+  # filter R1
   log.puts Time.now.to_s + "\t" +  "filtering R1..."
-  # obtain biological forward primer sequence
-  if forward_primer.match(/(N+)(\w+)$/)
-    forward_n = $1.size
-    forward_bio_primer = $2
-  else
-    forward_n = 0
-    forward_bio_primer = forward_primer
-  end
-  forward_bio_primer_size = forward_bio_primer.size
-  forward_starting_number = forward_n + forward_bio_primer_size
-  # filter R1 sequences with forward primers.
-  forward_primer_ref = forward_bio_primer.nt_parser
-  r1_passed_seq = {}
-  r1_raw.each do |name,seq|
-    next if seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
-    next if seq =~ /A{11}/ # a string of poly-A indicates adaptor sequence
-    next if seq =~ /T{11}/ # a string of poly-T indicates adaptor sequence
-    primer_region_seq = seq[forward_n, forward_bio_primer_size]
-    if primer_region_seq =~ forward_primer_ref
-      r1_passed_seq[name.split("\s")[0]] = seq
-    end
-  end
+  filter_r1 = ViralSeq::TcsCore.filter_r1(r1_fastq_sh, forward_primer)
+  r1_passed_seq = filter_r1[:r1_passed_seq]
   log.puts Time.now.to_s + "\t" +  "R1 filtered: #{r1_passed_seq.size.to_s}"
   summary_json[:r1_filtered_raw] = r1_passed_seq.size
+  # filter R2
   log.puts Time.now.to_s + "\t" +  "filtering R2..."
-  # obtain biological reverse primer sequence
-  cdna_primer.match(/(N+)(\w+)$/)
-  pid_length = $1.size
-  cdna_bio_primer = $2
-  cdna_bio_primer_size = cdna_bio_primer.size
-  reverse_starting_number = pid_length + cdna_bio_primer_size
-  # filter R2 sequences with cDNA primers.
-  cdna_primer_ref = cdna_bio_primer.nt_parser
-  r2_passed_seq = {}
-  r2_raw.each do |name, seq|
-    next if seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
-    next if seq =~ /A{11}/ # a string of poly-A indicates adaptor sequence
-    next if seq =~ /T{11}/ # a string of poly-T indicates adaptor sequence
-    primer_region_seq = seq[pid_length, cdna_bio_primer_size]
-    if primer_region_seq =~ cdna_primer_ref
-      r2_passed_seq[name.split("\s")[0]] = seq
-    end
-  end
+  filter_r2 = ViralSeq::TcsCore.filter_r2(r2_fastq_sh, cdna_primer)
+  r2_passed_seq = filter_r2[:r2_passed_seq]
+  pid_length = filter_r2[:pid_length]
   log.puts Time.now.to_s + "\t" +  "R2 filtered: #{r2_passed_seq.size.to_s}"
   summary_json[:r2_filtered_raw] = r2_passed_seq.size
@@ -257,8 +181,8 @@ primers.each do |primer|
     r2_seq = r2_passed_seq[seqtag]
     pid = r2_seq[0, pid_length]
     id[seqtag] = pid
-    bio_r2[seqtag] = r2_seq[reverse_starting_number..-2]
-    bio_r1[seqtag] = r1_seq[forward_starting_number..-2]
+    bio_r2[seqtag] = r2_seq[filter_r2[:reverse_starting_number]..-2]
+    bio_r1[seqtag] = r1_seq[filter_r1[:forward_starting_number]..-2]
   end
   # TCS cut-off
@@ -278,11 +202,10 @@ primers.each do |primer|
   end
   max_id = primer_id_dis.keys.sort[-5..-1].mean
-  consensus_cutoff = calculate_cut_off(max_id,error_rate)
+  consensus_cutoff = ViralSeq::TcsCore.calculate_cut_off(max_id,error_rate)
   log.puts Time.now.to_s + "\t" +  "Consensus cut-off is #{consensus_cutoff.to_s}"
   summary_json[:consensus_cutoff] = consensus_cutoff
   summary_json[:length_of_pid] = pid_length
   log.puts Time.now.to_s + "\t" +  "Creating consensus..."
   # Primer ID over the cut-off
@@ -355,6 +278,8 @@ primers.each do |primer|
     consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
     r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
     r2_consensus = ViralSeq::SeqHash.array(r2_sub_seq).consensus(majority_cut_off)
+    # hide the following two lines if allowing sequence to have ambiguities.
     next if r1_consensus =~ /[^ATCG]/
     next if r2_consensus =~ /[^ATCG]/
@@ -392,8 +317,12 @@ primers.each do |primer|
   f1 = File.open(outfile_r1, 'w')
   f2 = File.open(outfile_r2, 'w')
   primer_id_in_use = {}
-  r1_seq_length = consensus_filtered.values[0][0].size
-  r2_seq_length = consensus_filtered.values[0][1].size
+  if n_con > 0
+    r1_seq_length = consensus_filtered.values[0][0].size
+    r2_seq_length = consensus_filtered.values[0][1].size
+  else
+    next
+  end
   log.puts Time.now.to_s + "\t" + "R1 sequence #{r1_seq_length} bp"
   log.puts Time.now.to_s + "\t" + "R1 sequence #{r2_seq_length} bp"
   consensus_filtered.each do |seq_name,seq|
@@ -404,6 +333,7 @@ primers.each do |primer|
   f1.close
   f2.close
+  # Primer ID distribution in .json file
   out_pid_json = File.join(out_dir_set, 'primer_id.json')
   pid_json = {}
   pid_json[:primer_id_in_use] = Hash[*(primer_id_in_use.sort_by {|k, v| [-v,k]}.flatten)]
@@ -413,11 +343,14 @@ primers.each do |primer|
     f.puts JSON.pretty_generate(pid_json)
   end
+  # start end-join
   def end_join(dir, option, overlap)
     shp = ViralSeq::SeqHashPair.fa(dir)
     case option
     when 1
       joined_sh = shp.join1()
+    when 2
+      joined_sh = shp.join1(overlap)
     when 3
       joined_sh = shp.join2
     when 4
@@ -489,9 +422,10 @@ primers.each do |primer|
       joined_sh = joined_sh.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
       if export_raw
-        joined_sh_raw = joined_sh.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
+        joined_sh_raw = joined_sh_raw.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
       end
     end
     log.puts Time.now.to_s + "\t" + "Paired TCS number after QC based on reference genome: " + joined_sh.size.to_s
     summary_json[:combined_tcs_after_qc] = joined_sh.size
     if primer[:trim]
@@ -499,10 +433,11 @@ primers.each do |primer|
       trim_end = primer[:trim_ref_end]
       trim_ref = primer[:trim_ref].to_sym
       joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
-    end
-    joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
-    if export_raw
-      joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.fasta"))
+      joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
+      if export_raw
+        joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
+        joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
+      end
     end
   end