RubyGems - viral_seq - Versions diffs - 1.9.0 → 1.10.0 - Mend

viral_seq 1.9.0 → 1.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/Gemfile.lock +6 -1
data/README.md +133 -119
data/bin/locator +2 -2
data/bin/tcs +38 -38
data/bin/tcs_sdrm +2 -2
data/lib/viral_seq/R.rb +3 -1
data/lib/viral_seq/pid.rb +1 -4
data/lib/viral_seq/seq_hash.rb +48 -12
data/lib/viral_seq/sequence.rb +22 -171
data/lib/viral_seq/string.rb +3 -6
data/lib/viral_seq/tcs_core.rb +4 -0
data/lib/viral_seq/tcs_dr.rb +82 -1
data/lib/viral_seq/util/drm_versions_config.json +52 -0
data/lib/viral_seq/version.rb +2 -2
data/lib/viral_seq.rb +2 -0
data/viral_seq.gemspec +5 -0
metadata +31 -5

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 5817b0c1bb2887e02c101dd1032ad1a5523d7390caa00c785050ad05e4bb77e7
-  data.tar.gz: 0e9e8c40625122a932f7e06062b4e129712d433dbf11dcd0b4b51d51ce80b514
+  metadata.gz: d940e5f465cba40def34166fe50e0a21b1c62a1fff8e0be8abdabb7b4c4aab77
+  data.tar.gz: 7e4be6ec82d9081a1ea3130eed49dcaac080608e481c7520b43c2e58a50e379d
 SHA512:
-  metadata.gz: 6abfc477dea09519649614f8300d17be470a2511f79d7ed33b794ba96ab721584a488b71c01040e6c55e92e9e22c5282afedbdf02b6889acfdb7fb6fcddecb0d
-  data.tar.gz: 861e6ff9b55be29357b677c270ea7c1cc20ff49960cd8f3429ceb323751c3ddd6a7c9a773a9c383da5abb5e8f99a5172cd74cb2a70d54a674eba2d1406d904ea
+  metadata.gz: 15805b09c96b6d1bff023a82948f23ceb584c60ffb21b85e59d6f4ddc2e2394045a29788a7c5811c714afedfae6405c36b88e0bcadce0d1408068418c497e596
+  data.tar.gz: '0871676e5ee49fa14f84ec3c109172d964efac18f3f104ec38ad52daa69b9ef85a935c35ca2377d261b38edc5d8d438469b2360ec791a902116f60c8daeef5c2'

data/Gemfile.lock CHANGED Viewed

@@ -1,12 +1,14 @@
 PATH
   remote: .
   specs:
-    viral_seq (1.9.0)
+    viral_seq (1.10.1)
       colorize (~> 0.1)
       combine_pdf (~> 1.0, >= 1.0.0)
       muscle_bio (= 0.4)
       prawn (~> 2.3, >= 2.3.0)
       prawn-table (~> 0.2, >= 0.2.0)
+      shellwords (~> 0.2)
+      virust-locator-ruby (~> 0.3)
 GEM
   remote: https://rubygems.org/
@@ -41,8 +43,11 @@ GEM
       rspec-support (~> 3.13.0)
     rspec-support (3.13.1)
     ruby-rc4 (0.1.5)
+    shellwords (0.2.0)
     ttfunk (1.8.0)
       bigdecimal (~> 3.1)
+    virust-locator-ruby (0.3.0)
+      shellwords (~> 0.2)
 PLATFORMS
   ruby

data/README.md CHANGED Viewed

@@ -16,10 +16,10 @@ CLI tools `tcs`, `tcs_sdrm`, `tcs_log` and `locator` included in the gem.
 ## Illustration for the Primer ID Sequencing
 ![Primer ID Sequencing](./docs/assets/img/cover.jpg)
 ### Reference readings on the Primer ID sequencing
 [Explantion of Primer ID sequencing](https://doi.org/10.21769/BioProtoc.3938)
 [Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
 [Application of Primer ID sequencing in COVID-19 research](https://doi.org/10.1126/scitranslmed.abb5883)
@@ -41,11 +41,13 @@ Required RubyGems version: >= 1.3.6
 ### Excutables
 ### `tcs`
 Use executable `tcs` pipeline to process **Primer ID MiSeq sequencing** data.
 Web-based `tcs` analysis can be accessed at https://primer-id.org/
 Example commands:
 ```bash
     $ tcs -p params.json # run TCS pipeline with params.json
     $ tcs -p params.json -i DIRECTORY
@@ -61,12 +63,13 @@ Example commands:
 [sample params.json for the tcs-dr pipeline](./docs/dr.json)
 ---
 ### `tcs_log`
 Use `tcs_log` script to pool run logs and TCS fasta files after one batch of `tcs` jobs. This command generates log.html to visualize the sequencing runs.
 Example file structure:
 ```
 batch_tcs_jobs/
       ├── lib1
@@ -77,21 +80,25 @@ batch_tcs_jobs/
 ```
 Example command:
 ```bash
     $ tcs_log batch_tcs_jobs
 ```
 ---
 ### `tcs_sdrm`
 Use `tcs_sdrm` pipeline for HIV-1 drug resistance mutation and recency.
 Example command:
 ```bash
     $ tcs_sdrm libs_dir
 ```
 lib_dir file structure:
 ```
 libs_dir/
 ├── lib1
@@ -109,8 +116,8 @@ libs_dir/
 Output data in a new dir as 'libs_dir_SDRM'
 **Note: [R](https://www.r-project.org/) and the following R libraries are required:**
 - phangorn
 - ape
 - scales
@@ -122,11 +129,13 @@ Output data in a new dir as 'libs_dir_SDRM'
 ---
 ### `locator`
 Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
 ```bash
     $ locator -i sequence.fasta -o sequence.fasta.csv
 ```
 ---
 ## Some Examples
@@ -179,244 +188,249 @@ Examine for drug resistance mutations for HIV PR region
 ```ruby
 qc_seqhash.sdrm_hiv_pr(cut_off)
 ```
-## Known issues
-  1. ~~have a conflict with rails.~~
-  2. ~~Update on 03032021. Still have conflict. But in rails gem file, can just use `requires: false` globally and only require "viral_seq" when the module is needed in controller.~~
-  3. The conflict seems to be resovled. It was from a combination of using `!` as a function for factorial and the gem name `viral_seq`. @_@
 ## Updates
+### Version-1.10.1
+1. Added quality filter for Illumina 2-color sequencing platforms (filter poly-G and poly-C)
+2. Replaced `MuscleBio` with [`VirustLocator`]("https://github.com/ViralSeq/virust-locator-ruby") for faster and more accurate pairwise alignment.
+3. Added DR primer version 4.
+4. Added a helper function to properly treat input params for #hiv_seq_qc.
+5. Solved the slow-performance issue when spawning a subprocess to call `VirustLocator` when holding a large amount of data in the momery. When Ruby run shell commands, a child process is spawned and share the parent's memory pages. To set it up, the OS has to walk the parent's entire memory table, causing an incremental delay in each subsequent process spawning. To solve this, I redid the `VirustLocator` API to allow all the arguments to be processed with one shell command instead of spawning individual child process.
+### Version-1.9.1-12022024
+1. Fixed a bug in the `tcs_sdrm` pipeline.
 ### Version-1.9.0-11132024
-  1. `ViralSeq::TcsCore::validate_file_name` will not report errors when non-sequence data in the folder, instead these files will be ignored.
-  2. Rewrote the APIs for DRM analysis for HIV. Now uses version config files for the sequencing information and DRM list configure files for DRM interpretation. Two configure files are at located in `/lib/viral_seq/util/`
-  3. `tcs_sdrm` will take a second argument for DRM config versions. Currently supports `["v1", "v2", "v3"]`. Refer to the documentations of the APIs for the details.
-  4. Next update will use secondary command `tcs sdrm` to replace `tcs_sdrm`, and `tcs log` to replace `tcs_log`.
+1. `ViralSeq::TcsCore::validate_file_name` will not report errors when non-sequence data in the folder, instead these files will be ignored.
+2. Rewrote the APIs for DRM analysis for HIV. Now uses version config files for the sequencing information and DRM list configure files for DRM interpretation. Two configure files are at located in `/lib/viral_seq/util/`
+3. `tcs_sdrm` will take a second argument for DRM config versions. Currently supports `["v1", "v2", "v3"]`. Refer to the documentations of the APIs for the details.
+4. Next update will use secondary command `tcs sdrm` to replace `tcs_sdrm`, and `tcs log` to replace `tcs_log`.
 ### Version-1.8.1-06042024
-  1. Fixed a bug that causes `tcs_sdrm` pipeline to crash.
+1. Fixed a bug that causes `tcs_sdrm` pipeline to crash.
 ### Version-1.8.0-04052024
-  1. Use `muscle-v3.8.1` as default aligner because of the compatibility issues with `muscle-v5` on some platforms.
-  2. Adjust the end-join model for short insert (insert size less than read length substracted by adaptor size)
-  3. Add an option in the DR pipeline for different versions of the pipeline, default version as "v1".
-  4. Add Days Post Infection (DPI) prediction model in the SDRM pipeline.
-  5. Re-organize the R scripts as stand-alone R files.
-  6. Bug fix.
-  7. **NOT SOLVED**: to include versions of DR in reports
+1. Use `muscle-v3.8.1` as default aligner because of the compatibility issues with `muscle-v5` on some platforms.
+2. Adjust the end-join model for short insert (insert size less than read length substracted by adaptor size)
+3. Add an option in the DR pipeline for different versions of the pipeline, default version as "v1".
+4. Add Days Post Infection (DPI) prediction model in the SDRM pipeline.
+5. Re-organize the R scripts as stand-alone R files.
+6. Bug fix.
+7. **NOT SOLVED**: to include versions of DR in reports
 ### Version-1.7.1-05122023
-  1. Add a size check for the raw sequences. If the size smaller than the input params, error messages will be sent to users. IF the actual size is greater than the input params, extra bases will be truncated.
-  2. Now allows mismatch for the primer region sequences. Forward primer region allows 2 nt differences and cDNA primer region allows 3 nt differences.
-  3. Bug fix.
-  4. TCS version to 2.5.2
+1. Add a size check for the raw sequences. If the size smaller than the input params, error messages will be sent to users. IF the actual size is greater than the input params, extra bases will be truncated.
+2. Now allows mismatch for the primer region sequences. Forward primer region allows 2 nt differences and cDNA primer region allows 3 nt differences.
+3. Bug fix.
+4. TCS version to 2.5.2
 ### Version-1.7.0-08242022
-  1. Add warnings if `tcs` pipeline is excecuting through source instead of installing from `gem`.
-  2. Optimized `ViralSeq:SeqHash#a3g` hypermut algorithm. Allowing a external reference other than the sample reference.
+1. Add warnings if `tcs` pipeline is excecuting through source instead of installing from `gem`.
+2. Optimized `ViralSeq:SeqHash#a3g` hypermut algorithm. Allowing a external reference other than the sample reference.
 ### Version-1.6.4-07182022
-  1. Included region "P17" in the default `tcs -d` pipeline setting. `tcs` pipeline updated to version 2.5.1.
-  2. Loosen the locator params for the "V1V3" end region for rare alignment issues. Now the default "V1V3" region end with position 7205 to 7210 instead of 7208.
-  3. `tcs_sdrm` now analyse "P17" region for pairwise diversity.
+1. Included region "P17" in the default `tcs -d` pipeline setting. `tcs` pipeline updated to version 2.5.1.
+2. Loosen the locator params for the "V1V3" end region for rare alignment issues. Now the default "V1V3" region end with position 7205 to 7210 instead of 7208.
+3. `tcs_sdrm` now analyse "P17" region for pairwise diversity.
 ### Version-1.6.3-02052022
-  1. Updated on `ViralSeq::Muscle` module along with the update of `muscle` from version 3.8.1 to 5.1.
-  2. Optimized the `locator` algorithm based on `muscle` v5.1.
-  3. Optimized the `tcs_sdrm` pipeline based on `muscle` v5.1.
+1. Updated on `ViralSeq::Muscle` module along with the update of `muscle` from version 3.8.1 to 5.1.
+2. Optimized the `locator` algorithm based on `muscle` v5.1.
+3. Optimized the `tcs_sdrm` pipeline based on `muscle` v5.1.
 ### Version-1.6.1-02022022
-  1. Fixed the `nav bar` in tcs_log html file.
-  2. Fixed a typo in `tcs`.
+1. Fixed the `nav bar` in tcs_log html file.
+2. Fixed a typo in `tcs`.
 ### Version 1.6.0-01042022
-  1. Update the `ViralSeq::TcsCore::detection_limit` with pre-calculated values to save processing time.
-  2. Update `tcs` pipeline to v2.5.0. HTML report will generated after running `tcs_log` script after `tcs` pipeline.
+1. Update the `ViralSeq::TcsCore::detection_limit` with pre-calculated values to save processing time.
+2. Update `tcs` pipeline to v2.5.0. HTML report will generated after running `tcs_log` script after `tcs` pipeline.
 ### Version 1.5.0-01042022
-  1. Added a function to calcute detection limit/sensitivity for minority variants (R required). `ViralSeq::TcsCore::detection_limit`
-  2. Added a function to get a sub SeqHash object given a range of nt positions. `ViralSeq::SeqHash#nt_range`
-  3. Added a function to quality check dna sequences comparing with sample consensus for indels. `ViralSeq::SeqHash#qc_indel`
-  4. Added a function for DNA variant analysis. Return a Hash object that can output as a JSON file. `ViralSeq::SeqHash#nt_variants`
-  5. Added a function to check the size of sequences of a SeqHash object. `ViralSeq::SeqHash#check_nt_size`
+1. Added a function to calcute detection limit/sensitivity for minority variants (R required). `ViralSeq::TcsCore::detection_limit`
+2. Added a function to get a sub SeqHash object given a range of nt positions. `ViralSeq::SeqHash#nt_range`
+3. Added a function to quality check dna sequences comparing with sample consensus for indels. `ViralSeq::SeqHash#qc_indel`
+4. Added a function for DNA variant analysis. Return a Hash object that can output as a JSON file. `ViralSeq::SeqHash#nt_variants`
+5. Added a function to check the size of sequences of a SeqHash object. `ViralSeq::SeqHash#check_nt_size`
 ### Version 1.4.0-10132021
-  1. Added a function to calculate false detectionr rate (FDR, aka, Benjamini-Hochberg correction) for minority mutations detected in the sequences. `ViralSeq::SeqHash#fdr`
-  2. Updated `bin\tcs_sdrm` script to add FDR value to each DRMs detected.
+1. Added a function to calculate false detectionr rate (FDR, aka, Benjamini-Hochberg correction) for minority mutations detected in the sequences. `ViralSeq::SeqHash#fdr`
+2. Updated `bin\tcs_sdrm` script to add FDR value to each DRMs detected.
 ### Version 1.3.0-08302021
-  1. Fixed a bug in the `tcs` pipeline.
+1. Fixed a bug in the `tcs` pipeline.
 ### Version 1.2.9-08022021
-  1. Fixed a bug when reading the input primer sequences in lowercases.
-  2. Fixed a bug in the method ViralSeq::Math::RandomGaussian
+1. Fixed a bug when reading the input primer sequences in lowercases.
+2. Fixed a bug in the method ViralSeq::Math::RandomGaussian
 ### Version 1.2.8-07292021
-  1. Fixed an issue when reading .fastq files containing blank_lines.
+1. Fixed an issue when reading .fastq files containing blank_lines.
 ### Version 1.2.7-07152021
-  1. Optimzed the workflow of the `tcs` pipeline on raw data with uneven lengths.
-  `tcs` version to v2.3.6.
+1. Optimzed the workflow of the `tcs` pipeline on raw data with uneven lengths.
+   `tcs` version to v2.3.6.
 ### Version 1.2.6-07122021
-  1. Optimized the workflow of the `tcs` pipeline in the "end-join/QC/Trimming" section.
-  `tcs` version to v2.3.5.
+1. Optimized the workflow of the `tcs` pipeline in the "end-join/QC/Trimming" section.
+   `tcs` version to v2.3.5.
 ### Version 1.2.5-06232021
-  1. Add error rescue and report in the `tcs` pipeline.
-    error messages are stored in the .tcs_error file. `tcs` pipeline updated to v2.3.4.
-  2. Use simple majority for the consensus cut-off in the default setting of the `tcs -dr` pipeline.
+1. Add error rescue and report in the `tcs` pipeline.
+   error messages are stored in the .tcs_error file. `tcs` pipeline updated to v2.3.4.
+2. Use simple majority for the consensus cut-off in the default setting of the `tcs -dr` pipeline.
 ### Version 1.2.2-05272021
-  1. Fixed a bug in the `tcs` pipeline that sometimes causes `SystemStackError`.
-  `tcs` pipeline upgraded to v2.3.2
+1. Fixed a bug in the `tcs` pipeline that sometimes causes `SystemStackError`.
+   `tcs` pipeline upgraded to v2.3.2
 ### Version 1.2.1-05172021
-  1. Added a function in R to check and install missing R packages for `tcs_sdrm` pipeline.
+1. Added a function in R to check and install missing R packages for `tcs_sdrm` pipeline.
 ### Version 1.2.0-05102021
-  1. Added `tcs_sdrm` pipeline as an excutable.
-  `tcs_sdrm` processes `tcs`-processed HIV MPID-NGS data for drug resistance mutations, recency and phylogentic analysis.
+1. Added `tcs_sdrm` pipeline as an excutable.
+   `tcs_sdrm` processes `tcs`-processed HIV MPID-NGS data for drug resistance mutations, recency and phylogentic analysis.
-  2. Added function ViralSeq::SeqHash#sample.
+2. Added function ViralSeq::SeqHash#sample.
-  3. Added recency determining function `ViralSeq::Recency::define`
+3. Added recency determining function `ViralSeq::Recency::define`
-  4. Fixed a few bugs related to `tcs_sdrm`.
+4. Fixed a few bugs related to `tcs_sdrm`.
 ### Version 1.1.2-04262021
-  1. Added function `ViralSeq::DRMs.sdrm_json` to export SDRM as json object.
-  2. Added a random string to the temp file names for `muscle_bio` to avoid issues when running scripts in parallel.
-  3. Added `--keep-original` flag to the `tcs` pipeline.
+1. Added function `ViralSeq::DRMs.sdrm_json` to export SDRM as json object.
+2. Added a random string to the temp file names for `muscle_bio` to avoid issues when running scripts in parallel.
+3. Added `--keep-original` flag to the `tcs` pipeline.
 ### Version 1.1.1-04012021
-  1. Added warning when paired_raw_sequence less than 0.1% of total_raw_sequence.
-  2. Added option `-i WORKING_DIRECTORY` to the `tcs` script.
-  If the `params.json` file does not contain the path to the working directory, it will append path to the run params.
-  3. Added option `-dr` to the `tcs` script.
+1. Added warning when paired_raw_sequence less than 0.1% of total_raw_sequence.
+2. Added option `-i WORKING_DIRECTORY` to the `tcs` script.
+   If the `params.json` file does not contain the path to the working directory, it will append path to the run params.
+3. Added option `-dr` to the `tcs` script.
 ### Version 1.1.0-03252021
-  1. Optimized the algorithm of end-join.
-  2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
-  3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
-  4. Added the preset of MPID-HIVDR params file [***dr.json***](./docs/dr.json) in /docs.
-  5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
-  Users can choose from 3 MiSeq platforms for processing their sequencing data.
-  MiSeq 300x7x300 is the default option.
+1. Optimized the algorithm of end-join.
+2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
+3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
+4. Added the preset of MPID-HIVDR params file [**_dr.json_**](./docs/dr.json) in /docs.
+5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
+   Users can choose from 3 MiSeq platforms for processing their sequencing data.
+   MiSeq 300x7x300 is the default option.
 ### Version 1.0.14-03052021
-  1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
+1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
 ### Version 1.0.13-03032021
-  1. Fixed the conflict with rails.
+1. Fixed the conflict with rails.
 ### Version 1.0.12-03032021
-  1. Fixed an issue that may cause conflicts with ActiveRecord.
+1. Fixed an issue that may cause conflicts with ActiveRecord.
 ### Version 1.0.11-03022021
-  1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
-  2. fixed an issue loading class 'OptionParser'in some ruby environments.
+1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
+2. fixed an issue loading class 'OptionParser'in some ruby environments.
 ### Version 1.0.10-11112020:
-  1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
-  2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
-  3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
-  4. a few optimizations.
-  5. TCS 2.1.0 delivered.
-  6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
+1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
+2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
+3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
+4. a few optimizations.
+5. TCS 2.1.0 delivered.
+6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
 ### Version 1.0.9-07182020:
-  1. Change ViralSeq::SeqHash#stop_codon and ViralSeq::SeqHash#a3g_hypermut return value to hash object.
+1. Change ViralSeq::SeqHash#stop_codon and ViralSeq::SeqHash#a3g_hypermut return value to hash object.
-  2. TCS pipeline updated to version 2.0.1. Add optional `export_raw: TRUE/FALSE` in json params. If `export_raw` is `TRUE`, raw sequence reads (have to pass quality filters) will be exported, along with TCS reads.
+2. TCS pipeline updated to version 2.0.1. Add optional `export_raw: TRUE/FALSE` in json params. If `export_raw` is `TRUE`, raw sequence reads (have to pass quality filters) will be exported, along with TCS reads.
 ### Version 1.0.8-02282020:
-  1. TCS pipeline (version 2.0.0) added as executable.
-      tcs  -  main TCS pipeline script.
-      tcs_json_generator  -  step-by-step script to generate json file for tcs pipeline.
+1. TCS pipeline (version 2.0.0) added as executable.
+   tcs - main TCS pipeline script.
+   tcs_json_generator - step-by-step script to generate json file for tcs pipeline.
-  2. Methods added:
-      ViralSeq::SeqHash#trim
+2. Methods added:
+   ViralSeq::SeqHash#trim
-  3. Bug fix for several methods.
+3. Bug fix for several methods.
 ### Version 1.0.7-01282020:
-  1. Several methods added, including
-      ViralSeq::SeqHash#error_table
-      ViralSeq::SeqHash#random_select
-  2. Improved performance for several functions.
+1. Several methods added, including
+   ViralSeq::SeqHash#error_table
+   ViralSeq::SeqHash#random_select
+2. Improved performance for several functions.
 ### Version 1.0.6-07232019:
-  1. Several methods added to ViralSeq::SeqHash, including
-      ViralSeq::SeqHash#size
-      ViralSeq::SeqHash#+
-      ViralSeq::SeqHash#write_nt_fa
-      ViralSeq::SeqHash#mutation
-  2. Update documentations and rspec samples.
+1. Several methods added to ViralSeq::SeqHash, including
+   ViralSeq::SeqHash#size
+   ViralSeq::SeqHash#+
+   ViralSeq::SeqHash#write_nt_fa
+   ViralSeq::SeqHash#mutation
+2. Update documentations and rspec samples.
 ### Version 1.0.5-07112019:
-  1. Update ViralSeq::SeqHash#sequence_locator.
-     Program will try to determine the direction (`+` or `-` of the query sequence)
-  2. update executable `locator` to have a column of `direction` in output .csv file
+1. Update ViralSeq::SeqHash#sequence_locator.
+   Program will try to determine the direction (`+` or `-` of the query sequence)
+2. update executable `locator` to have a column of `direction` in output .csv file
 ### Version 1.0.4-07102019:
-  1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
-  2. Fix bugs in bin `locator`
+1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
+2. Fix bugs in bin `locator`
 ### Version 1.0.3-07102019:
-  1. Bug fix.
+1. Bug fix.
 ### Version 1.0.2-07102019:
-  1. Fixed a gem loading issue.
+1. Fixed a gem loading issue.
 ### Version 1.0.1-07102019:
-  1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
-  2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
-  3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
-  4. update documentations
+1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
+2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
+3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
+4. update documentations
 ### Version 1.0.0-07092019:
-  1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq
+1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq
 ## Development

data/bin/locator CHANGED Viewed

@@ -38,7 +38,7 @@ def myparser
       options[:outfile] = o
     end
-    opts.on('-r', '--ref_option OPTION', "reference genome option, choose from #{"`HXB2` (default), `NL43`, `MAC239`".blue.bold}") do |o|
+    opts.on('-r', '--ref_option OPTION', "reference genome option, choose from #{"`HXB2` (default), `SIVmm239`".blue.bold}") do |o|
       options[:ref_option] = o.to_sym
     end
@@ -84,7 +84,7 @@ begin
   seqs = ViralSeq::SeqHash.fa(seq_file)
   opt =  options[:ref_option] ? options[:ref_option] : :HXB2
-  unless [:HXB2, :NL43, :MAC239].include? opt
+  unless [:HXB2, :SIVmm239].include? opt
     puts "Reference option `#{opt}` not recognized, using `HXB2` as the reference genome.".red.bold
     opt = :HXB2
   end

data/bin/tcs CHANGED Viewed

@@ -27,9 +27,8 @@
 # run `tcs -j` to generate param json file.
 def gem_installed?(gem_name)
-  found_gem = false
   begin
-    found_gem = Gem::Specification.find_by_name(gem_name)
+    Gem::Specification.find_by_name(gem_name)
   rescue Gem::LoadError
     return false
   else
@@ -217,8 +216,8 @@ begin
     ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
   end
   primers.each do |primer|
     summary_json = {}
     summary_json[:warnings] = []
     summary_json[:tcs_version] = ViralSeq::TCS_VERSION
@@ -470,6 +469,34 @@ begin
       f.puts JSON.pretty_generate(pid_json)
     end
+    filter_r1 = nil
+    filter_r2 = nil
+    r1_passed_seq = nil
+    r2_passed_seq = nil
+    r1_temp = nil
+    r2_temp = nil
+    r1_temp_sh = nil
+    r2_temp_sh = nil
+    r1_consensus_filtered = nil
+    r2_consensus_filtered = nil
+    consensus_filtered = nil
+    pid_json = nil
+    consensus = nil
+    r1_seq = nil
+    r2_seq = nil
+    bio_r1 = nil
+    bio_r2 = nil
+    id = nil
+    primer_id_count = nil
+    primer_id_dis = nil
+    primer_id_list = nil
+    primer_id_count_over_n = nil
+    r1_sub_seq = nil
+    r2_sub_seq = nil
+    common_keys = nil
+    GC.start
     # start end-join
     def end_join(dir, option, overlap)
       shp = ViralSeq::SeqHashPair.fa(dir)
@@ -492,7 +519,6 @@ begin
     if primer[:end_join]
       log.puts Time.now.to_s + "\t" +  "Start end-pairing for TCS..."
-      shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
       joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
       log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
@@ -502,6 +528,11 @@ begin
         joined_sh_raw = end_join(out_dir_raw, primer[:end_join_option], primer[:overlap])
       end
+      joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
+      if export_raw
+        joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
+      end
       if primer[:TCS_QC]
         ref_start = primer[:ref_start]
         ref_end = primer[:ref_end]
@@ -513,42 +544,11 @@ begin
         if ref_end == 0
           ref_end = 0..(ViralSeq::RefSeq.get(ref_genome).size - 1)
         end
-        if primer[:end_join_option] == 1
-          r1_sh = ViralSeq::SeqHash.fa(outfile_r1)
-          r2_sh = ViralSeq::SeqHash.fa(outfile_r2)
-          r1_sh = r1_sh.hiv_seq_qc(ref_start, (0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), indel, ref_genome)
-          r2_sh = r2_sh.hiv_seq_qc((0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), ref_end, indel, ref_genome)
-          new_r1_seq = r1_sh.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
-          new_r2_seq = r2_sh.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
-          joined_seq = {}
-          new_r1_seq.each do |seq_name, seq|
-            next unless seq
-            next unless new_r2_seq[seq_name]
-            joined_seq[seq_name] = seq + new_r2_seq[seq_name]
-          end
-          joined_sh = ViralSeq::SeqHash.new(joined_seq)
-          if export_raw
-            r1_sh_raw = ViralSeq::SeqHash.fa(outfile_raw_r1)
-            r2_sh_raw = ViralSeq::SeqHash.fa(outfile_raw_r2)
-            r1_sh_raw = r1_sh_raw.hiv_seq_qc(ref_start, (0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), indel, ref_genome)
-            r2_sh_raw = r2_sh_raw.hiv_seq_qc((0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), ref_end, indel, ref_genome)
-            new_r1_seq_raw = r1_sh_raw.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
-            new_r2_seq_raw = r2_sh_raw.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
-            joined_seq_raw = {}
-            new_r1_seq_raw.each do |seq_name, seq|
-              next unless seq
-              next unless new_r2_seq_raw[seq_name]
-              joined_seq_raw[seq_name] = seq + new_r2_seq_raw[seq_name]
-            end
-            joined_sh_raw = ViralSeq::SeqHash.new(joined_seq_raw)
-          end
-        else
-          joined_sh = joined_sh.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
+        joined_sh = joined_sh.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
-          if export_raw
-            joined_sh_raw = joined_sh_raw.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
-          end
+        if export_raw
+          joined_sh_raw = joined_sh_raw.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
         end
         log.puts Time.now.to_s + "\t" + "Paired TCS number after QC based on reference genome: " + joined_sh.size.to_s

data/bin/tcs_sdrm CHANGED Viewed

@@ -215,8 +215,8 @@ libs.each do |lib|
         tag = data[0].split("_")[-1].gsub(/\W/,"")
         summary_hash[tag] += "," + data[1].to_f.round(4).to_s + "," + data[2].to_f.round(4).to_s
       end
-      regions << "V1V3"
-      regions.each do |region|
+      regions_for_summary = regions.dup.push("V1V3")
+      regions_for_summary.each do |region|
         next unless summary_hash[region]
         seq_summary_out.puts region.to_s + "," + summary_hash[region]
       end

data/lib/viral_seq/R.rb CHANGED Viewed

@@ -14,7 +14,9 @@ module ViralSeq
     # check if required R packages is installed.
     def self.check_R_packages
-      if system "Rscript #{File.join( ViralSeq.root, "viral_seq", "util", "check_env.r")}"
+      file = File.join(ViralSeq.root, "viral_seq", "util", "check_env.r")
+      safe_file = Shellwords.escape(file)
+      if system "Rscript #{safe_file}"
         return 0
       else
         raise "Non-zero exit code. Error happens when checking required R packages."