RubyGems - viral_seq - Versions diffs - 1.0.14 → 1.2.1 - Mend

viral_seq 1.0.14 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

checksums.yaml +4 -4
data/.gitignore +0 -1
data/Gemfile.lock +16 -3
data/README.md +135 -8
data/bin/tcs +51 -10
data/bin/tcs_log +102 -0
data/bin/tcs_sdrm +409 -0
data/docs/assets/img/cover.jpg +0 -0
data/docs/dr.json +67 -0
data/docs/sample_miseq_data/hivdr_control/r1.fastq.gz +0 -0
data/docs/sample_miseq_data/hivdr_control/r2.fastq.gz +0 -0
data/lib/viral_seq.rb +5 -1
data/lib/viral_seq/constant.rb +41 -4
data/lib/viral_seq/hivdr.rb +1 -1
data/lib/viral_seq/muscle.rb +3 -2
data/lib/viral_seq/recency.rb +52 -0
data/lib/viral_seq/sdrm.rb +101 -35
data/lib/viral_seq/seq_hash.rb +25 -5
data/lib/viral_seq/seq_hash_pair.rb +6 -4
data/lib/viral_seq/sequence.rb +1 -84
data/lib/viral_seq/tcs_core.rb +3 -1
data/lib/viral_seq/tcs_dr.rb +71 -0
data/lib/viral_seq/tcs_json.rb +41 -10
data/lib/viral_seq/version.rb +2 -2
data/viral_seq.gemspec +11 -0
metadata +74 -4

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: '048e85ab67fbb667919d02d4509a15111798b116b3f927c921d203dc8565a1a2'
-  data.tar.gz: 6951e410bd4f9b727a44fab1aa88f9cc263151cf9aed2a9c25ae9d866ed72450
+  metadata.gz: f3316ff7e72ca84c6eb2fa861a9fdad14fbcb3ab3c0053ade843ee13cc9ce82e
+  data.tar.gz: df1035ea5934b794ef8c64a04085f407bcd4dffc0888bf81a569a7ccfba3560a
 SHA512:
-  metadata.gz: 02cc87e245918a5c8f1b16b0db978da66e3bf7e83c6c6140c394c560d31c86ab1845b337a4f53b4b7883ff8e452e8caef0b036ea113c4416d6a29d16f419eb81
-  data.tar.gz: 9f53bd6c46f4a49b5c14b8b8019ffa3e1abcf442bf0a6cc09a7dbc768a474f2afd81d5cf168f38eb8ab5abbd601c60f1f824f753bc77dbcc0d3c0d93568b9ae3
+  metadata.gz: a3ec35b3a40ee9cf66131416a1c20eda38bf8bde818aa41af285c099ddd2b49e4f31fe1d011c95def77fd5c6653d96f4295142fd543444f249242154bb2b671b
+  data.tar.gz: daa6e694a841cc615cfde850bf2d98ca7467cb1d27502daf398ab0204e55c4d477f34aafce0149c765a931755fc3a3f7dbdf425964904d0199efb0651b9a09a6

data/.gitignore CHANGED Viewed

@@ -2,7 +2,6 @@
 /.yardoc
 /_yardoc/
 /coverage/
-/doc/
 /pkg/
 /spec/reports/
 /tmp/

data/Gemfile.lock CHANGED Viewed

@@ -1,16 +1,27 @@
 PATH
   remote: .
   specs:
-    viral_seq (1.0.13)
-      colorize (~> 0.1)
-      muscle_bio (~> 0.4)
+    viral_seq (1.1.1)
+      colorize (>= 0.1)
+      combine_pdf (>= 1.0.0)
+      muscle_bio (>= 0.4)
+      prawn (>= 2.3.0)
+      prawn-table (>= 0.2.0)
 GEM
   remote: https://rubygems.org/
   specs:
     colorize (0.8.1)
+    combine_pdf (1.0.21)
+      ruby-rc4 (>= 0.1.5)
     diff-lcs (1.3)
     muscle_bio (0.4.0)
+    pdf-core (0.9.0)
+    prawn (2.4.0)
+      pdf-core (~> 0.9.0)
+      ttfunk (~> 1.7)
+    prawn-table (0.2.2)
+      prawn (>= 1.3.0, < 3.0.0)
     rake (13.0.1)
     rspec (3.8.0)
       rspec-core (~> 3.8.0)
@@ -25,6 +36,8 @@ GEM
       diff-lcs (>= 1.2.0, < 2.0)
       rspec-support (~> 3.8.0)
     rspec-support (3.8.0)
+    ruby-rc4 (0.1.5)
+    ttfunk (1.7.0)
 PLATFORMS
   ruby

data/README.md CHANGED Viewed

@@ -1,8 +1,24 @@
 # ViralSeq
+[![Gem Version](https://img.shields.io/gem/v/viral_seq?color=%2300e673&style=flat-square)](https://rubygems.org/gems/viral_seq)
+![GitHub](https://img.shields.io/github/license/viralseq/viral_seq)
+![Gem](https://img.shields.io/gem/dt/viral_seq?color=%23E9967A)
+![GitHub last commit](https://img.shields.io/github/last-commit/viralseq/viral_seq?color=%2300BFFF)
+[![Join the chat at https://gitter.im/viral_seq/community](https://badges.gitter.im/viral_seq/community.svg)](https://gitter.im/viral_seq/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
 A Ruby Gem containing bioinformatics tools for processing viral NGS data.
-Specifically for Primer-ID sequencing and HIV drug resistance analysis.
+Specifically for Primer ID sequencing and HIV drug resistance analysis.
+## Illustration for the Primer ID Sequencing
+![Primer ID Sequencing](./docs/assets/img/cover.jpg)
+### Reference readings on the Primer ID sequencing
+[Explantion of Primer ID sequencing](https://doi.org/10.21769/BioProtoc.3938)
+[Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
+[Application of Primer ID sequencing in COVID-19 research](https://doi.org/10.1126/scitranslmed.abb5883)
 ## Install
@@ -14,20 +30,93 @@ Specifically for Primer-ID sequencing and HIV drug resistance analysis.
 ### Excutables
-Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
+### `tcs`
+Use executable `tcs` pipeline to process **Primer ID MiSeq sequencing** data.
+Example commands:
 ```bash
-    $ locator -i sequence.fasta -o sequence.fasta.csv
+    $ tcs -p params.json # run TCS pipeline with params.json
+    $ tcs -p params.json -i DIRECTORY
+    # run TCS pipeline with params.json and DIRECTORY
+    # if DIRECTORY is not defined in params.json
+    $ tcs -dr -i DIRECTORY
+    # run tcs-dr (MPID HIV drug resistance sequencing) pipeline
+    # DIRECTORY needs to be given.
+    $ tcs -j # CLI to generate params.json
+    $ tcs -h # print out the help
 ```
-Use executable `tcs` pipeline to process Primer ID MiSeq sequencing data.
+[sample params.json for the tcs-dr pipeline](./docs/dr.json)
+---
+### `tcs_log`
+Use `tcs_log` script to pool run logs and TCS fasta files after one batch of `tcs` jobs.
+Example file structure:
+```
+batch_tcs_jobs/
+      ├── lib1
+      ├── lib2
+      ├── lib3
+      ├── lib4
+      ├── ...
+```
+Example command:
 ```bash
-    $ tcs -p params.json # run TCS pipeline with params.json
-    $ tcs -j # CLI to generate params.json
-    $ tcs -h # print out the help
+    $ tcs_log batch_tcs_jobs
+```
+---
+### `tcs_sdrm`
+Use `tcs_sdrm` pipeline for HIV-1 drug resistance mutation and recency.
+Example command:
+```bash
+    $ tcs_sdrm libs_dir
+```
+lib_dir file structure:
+```
+libs_dir/
+├── lib1
+  ├── lib1_RT
+  ├── lib1_PR
+  ├── lib1_IN
+  ├── lib1_V1V3
+├── lib2
+  ├── lib1_RT
+  ├── lib1_PR
+  ├── lib1_IN
+  ├── lib1_V1V3
+├── ...
 ```
+Output data in a new dir as 'libs_dir_SDRM'
+**Note: [R](https://www.r-project.org/) and the following R libraries are required:**
+- phangorn
+- ape
+- scales
+- ggforce
+- cowplot
+- magrittr
+- gridExtra
+---
+### `locator`
+Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
+```bash
+    $ locator -i sequence.fasta -o sequence.fasta.csv
+```
+---
 ## Some Examples
 Load all ViralSeq classes by requiring 'viral_seq.rb' in your Ruby scripts.
@@ -58,7 +147,7 @@ qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
 Further filter out sequences with Apobec3g/f hypermutations
 ```ruby
-qc_seqhash = qc_seqhash.a3g
+qc_seqhash = qc_seqhash.a3g[:filtered_seq]
 ```
 Calculate nucleotide diveristy π
@@ -86,6 +175,44 @@ qc_seqhash.sdrm_hiv_pr(cut_off)
 ## Updates
+### Version 1.2.1-05172021
+  1. Added a function in R to check and install missing R packages for `tcs_sdrm` pipeline.
+### Version 1.2.0-05102021
+  1. Added `tcs_sdrm` pipeline as an excutable.
+  `tcs_sdrm` processes `tcs`-processed HIV MPID-NGS data for drug resistance mutations, recency and phylogentic analysis.
+  2. Added function ViralSeq::SeqHash#sample.
+  3. Added recency determining function `ViralSeq::Recency::define`
+  4. Fixed a few bugs related to `tcs_sdrm`.
+### Version 1.1.2-04262021
+  1. Added function `ViralSeq::DRMs.sdrm_json` to export SDRM as json object.
+  2. Added a random string to the temp file names for `muscle_bio` to avoid issues when running scripts in parallel.
+  3. Added `--keep-original` flag to the `tcs` pipeline.
+### Version 1.1.1-04012021
+  1. Added warning when paired_raw_sequence less than 0.1% of total_raw_sequence.
+  2. Added option `-i WORKING_DIRECTORY` to the `tcs` script.
+  If the `params.json` file does not contain the path to the working directory, it will append path to the run params.
+  3. Added option `-dr` to the `tcs` script.
+### Version 1.1.0-03252021
+  1. Optimized the algorithm of end-join.
+  2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
+  3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
+  4. Added the preset of MPID-HIVDR params file [***dr.json***](./docs/dr.json) in /docs.
+  5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
+  Users can choose from 3 MiSeq platforms for processing their sequencing data.
+  MiSeq 300x7x300 is the default option.
 ### Version 1.0.14-03052021
   1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.

data/bin/tcs CHANGED Viewed

@@ -23,7 +23,7 @@
 # THE SOFTWARE.
 # Use JSON file as the run param
-# run tcs_json_generator.rb to generate param json file.
+# run `tcs -j` to generate param json file.
 require 'viral_seq'
 require 'json'
@@ -46,11 +46,23 @@ OptionParser.new do |opts|
     options[:params_json] = p
   end
+  opts.on("-i", "--input PATH_TO_WORKING_DIRECTORY", "Path to the working directory") do |p|
+    options[:input] = p
+  end
+  opts.on("-dr", "--dr_pipeline", "HIV drug resistance MPID pipeline") do |p|
+    options[:dr] = true
+  end
   opts.on("-h", "--help", "Prints this help") do
     puts opts
     exit
   end
+  opts.on("--keep-original", "keep raw sequence files") do
+    options[:keep] = true
+  end
   opts.on("-v", "--version", "Version info") do
     puts "tcs version: " + ViralSeq::TCS_VERSION.red.bold
     puts "viral_seq version: " + ViralSeq::VERSION.red.bold
@@ -64,15 +76,21 @@ end.parse!
 if options[:json_generator]
   params = ViralSeq::TcsJson.generate
+elsif options[:dr]
+  params = ViralSeq::TcsDr::PARAMS
 elsif (options[:params_json] && File.exist?(options[:params_json]))
   params = JSON.parse(File.read(options[:params_json]), symbolize_names: true)
 else
   abort "No params JSON file found. Script terminated.".red
 end
-indir = params[:raw_sequence_dir]
+if options[:input]
+  indir = options[:input]
+else
+  indir = params[:raw_sequence_dir]
+end
-unless File.exist?(indir)
+unless indir and File.exist?(indir)
   abort "No input sequence directory found. Script terminated.".red.bold
 end
@@ -115,6 +133,12 @@ else
   error_rate = 0.02
 end
+if params[:platform_format]
+  $platform_sequencing_length = params[:platform_format]
+else
+  $platform_sequencing_length = 300
+end
 primers = params[:primer_pairs]
 if primers.empty?
   ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
@@ -123,6 +147,7 @@ end
 primers.each do |primer|
   summary_json = {}
+  summary_json[:warnings] = []
   summary_json[:tcs_version] = ViralSeq::TCS_VERSION
   summary_json[:viralseq_version] = ViralSeq::VERSION
   summary_json[:runtime] = Time.now.to_s
@@ -134,6 +159,7 @@ primers.each do |primer|
   forward_primer = primer[:forward]
   export_raw = primer[:export_raw]
+  limit_raw = primer[:limit_raw]
   unless cdna_primer
     log.puts Time.now.to_s + "\t" + region + " does not have cDNA primer sequence. #{region} skipped."
@@ -175,6 +201,10 @@ primers.each do |primer|
   paired_seq_number = common_keys.size
   log.puts Time.now.to_s + "\t" +  "Paired raw sequences are : #{paired_seq_number.to_s}"
   summary_json[:paired_raw_sequence] = paired_seq_number
+  if paired_seq_number < raw_sequence_number * 0.001
+    summary_json[:warnings] <<
+      "WARNING: Filtered raw sequneces less than 0.1% of the total raw sequences. Possible contamination."
+  end
   common_keys.each do |seqtag|
     r1_seq = r1_passed_seq[seqtag]
@@ -236,7 +266,13 @@ primers.each do |primer|
     raw_r1_f = File.open(outfile_raw_r1, 'w')
     raw_r2_f = File.open(outfile_raw_r2, 'w')
-    bio_r1.keys.each do |k|
+    if limit_raw
+      raw_keys = bio_r1.keys.sample(limit_raw.to_i)
+    else
+      raw_keys = bio_r1.keys
+    end
+    raw_keys.each do |k|
       raw_r1_f.puts k + "_r1"
       raw_r2_f.puts k + "_r2"
       raw_r1_f.puts bio_r1[k]
@@ -273,7 +309,6 @@ primers.each do |primer|
       r1_sub_seq << bio_r1[seq_name]
       r2_sub_seq << bio_r2[seq_name]
     end
     #consensus name including the Primer ID and number of raw sequences of that Primer ID, library name and setname.
     consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
     r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
@@ -364,6 +399,7 @@ primers.each do |primer|
     shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
     joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
     log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
     summary_json[:combined_tcs] = joined_sh.size
     if export_raw
@@ -433,12 +469,15 @@ primers.each do |primer|
       trim_end = primer[:trim_ref_end]
       trim_ref = primer[:trim_ref].to_sym
       joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
-      joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
       if export_raw
         joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
-        joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
       end
     end
+    joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
+    if export_raw
+      joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
+    end
   end
   File.open(outfile_log, "w") do |f|
@@ -446,9 +485,11 @@ primers.each do |primer|
   end
 end
-log.puts Time.now.to_s + "\t" + "Removing raw sequence files..."
-File.unlink(r1_f)
-File.unlink(r2_f)
+unless options[:keep]
+  log.puts Time.now.to_s + "\t" + "Removing raw sequence files..."
+  File.unlink(r1_f)
+  File.unlink(r2_f)
+end
 log.puts Time.now.to_s + "\t" + "TCS pipeline successfuly exercuted."
 log.close
 puts "DONE!"

data/bin/tcs_log ADDED Viewed

@@ -0,0 +1,102 @@
+#!/usr/bin/env ruby
+# pool run logs from one batch of tcs jobs
+# file structure:
+#   batch_tcs_jobs/
+#   ├── lib1
+#   ├── lib2
+#   ├── lib3
+#   ├── lib4
+#   ├── ...
+#
+# command example:
+#   $ tcs_log batch_tcs_jobs
+require 'viral_seq'
+require 'pathname'
+require 'json'
+require 'fileutils'
+indir = ARGV[0].chomp
+indir_basename = File.basename(indir)
+indir_dirname = File.dirname(indir)
+tcs_dir = File.join(indir_dirname, (indir_basename + "_tcs"))
+Dir.mkdir(tcs_dir) unless File.directory?(tcs_dir)
+libs = []
+Dir.chdir(indir) {libs = Dir.glob("*")}
+outdir2 = File.join(tcs_dir, "combined_TCS_per_lib")
+outdir3 = File.join(tcs_dir, "TCS_per_region")
+outdir4 = File.join(tcs_dir, "combined_TCS_per_region")
+Dir.mkdir(outdir2) unless File.directory?(outdir2)
+Dir.mkdir(outdir3) unless File.directory?(outdir3)
+Dir.mkdir(outdir4) unless File.directory?(outdir4)
+log_file = File.join(tcs_dir,"log.csv")
+log = File.open(log_file,'w')
+header = %w{
+  lib_name
+  Region
+  Raw_Sequences_per_barcode
+  R1_Raw
+  R2_Raw
+  Paired_Raw
+  Cutoff
+  PID_Length
+  Consensus1
+  Consensus2
+  Distinct_to_Raw
+  Resampling_index
+  Combined_TCS
+  Combined_TCS_after_QC
+  WARNINGS
+}
+log.puts header.join(',')
+libs.each do |lib|
+  Dir.mkdir(File.join(outdir2, lib)) unless File.directory?(File.join(outdir2, lib))
+  fasta_files = []
+  json_files = []
+  Dir.chdir(File.join(indir, lib)) do
+     fasta_files = Dir.glob("**/*.fasta")
+     json_files = Dir.glob("**/log.json")
+  end
+  fasta_files.each do |f|
+    path_array = Pathname(f).each_filename.to_a
+    region = path_array[0]
+    if path_array[-1] == "combined.fasta"
+      FileUtils.cp(File.join(indir, lib, f), File.join(outdir2, lib, (lib + "_" + region)))
+      Dir.mkdir(File.join(outdir4,region)) unless File.directory?(File.join(outdir4,region))
+      FileUtils.cp(File.join(indir, lib, f), File.join(outdir4, region, (lib + "_" + region)))
+    else
+      Dir.mkdir(File.join(outdir3,region)) unless File.directory?(File.join(outdir3,region))
+      Dir.mkdir(File.join(outdir3,region, lib)) unless File.directory?(File.join(outdir3,region, lib))
+      FileUtils.cp(File.join(indir, lib, f), File.join(outdir3, region, lib, (lib + "_" + region + "_" + path_array[-1])))
+    end
+  end
+  json_files.each do |f|
+    json_log = JSON.parse(File.read(File.join(indir, lib, f)), symbolize_names: true)
+    log.print [lib,
+               json_log[:primer_set_name],
+               json_log[:total_raw_sequence],
+               json_log[:r1_filtered_raw],
+               json_log[:r2_filtered_raw],
+               json_log[:paired_raw_sequence],
+               json_log[:consensus_cutoff],
+               json_log[:length_of_pid],
+               json_log[:total_tcs_with_ambiguities],
+               json_log[:total_tcs],
+               json_log[:distinct_to_raw],
+               json_log[:resampling_param],
+               json_log[:combined_tcs],
+               json_log[:combined_tcs_after_qc],
+               json_log[:warnings],
+             ].join(',') + "\n"
+  end
+end
+log.close