RubyGems - viral_seq - Versions diffs - 1.9.1 → 1.10.0 - Mend

viral_seq 1.9.1 → 1.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml +4 -4
data/Gemfile.lock +6 -1
data/README.md +130 -120
data/bin/locator +2 -2
data/bin/tcs +38 -38
data/lib/viral_seq/R.rb +3 -1
data/lib/viral_seq/seq_hash.rb +48 -12
data/lib/viral_seq/sequence.rb +22 -171
data/lib/viral_seq/string.rb +3 -6
data/lib/viral_seq/tcs_core.rb +4 -0
data/lib/viral_seq/tcs_dr.rb +82 -1
data/lib/viral_seq/util/drm_versions_config.json +52 -0
data/lib/viral_seq/version.rb +2 -2
data/lib/viral_seq.rb +2 -0
data/viral_seq.gemspec +5 -0
metadata +31 -5

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 23960c6236e1230176c5517c4063bf1aa279597582915fca88b0e6ba9d649c75
-  data.tar.gz: 33965e6d03949d97be5fc5df871c355135c5cbe7124d5ec179f0e17ca4afe30b
+  metadata.gz: d940e5f465cba40def34166fe50e0a21b1c62a1fff8e0be8abdabb7b4c4aab77
+  data.tar.gz: 7e4be6ec82d9081a1ea3130eed49dcaac080608e481c7520b43c2e58a50e379d
 SHA512:
-  metadata.gz: 7cb163ef4f6fa0ccc486c39fdb34241228611a5ed7c4195ce982936cefbafb041713abcad3688bf8bf88fa326ba679e1f8eed7b8682d1e76ad527f755f71227a
-  data.tar.gz: de5f313ba3e76e987e1093dcb3d00bcb212fd5d0dd5118b2022c49807cc1a7ed8f4c92bbf9c2903dbcf7a6df5f654df52a19c21293eaa83709f72b7d30383e18
+  metadata.gz: 15805b09c96b6d1bff023a82948f23ceb584c60ffb21b85e59d6f4ddc2e2394045a29788a7c5811c714afedfae6405c36b88e0bcadce0d1408068418c497e596
+  data.tar.gz: '0871676e5ee49fa14f84ec3c109172d964efac18f3f104ec38ad52daa69b9ef85a935c35ca2377d261b38edc5d8d438469b2360ec791a902116f60c8daeef5c2'

data/Gemfile.lock CHANGED Viewed

@@ -1,12 +1,14 @@
 PATH
   remote: .
   specs:
-    viral_seq (1.9.0)
+    viral_seq (1.10.1)
       colorize (~> 0.1)
       combine_pdf (~> 1.0, >= 1.0.0)
       muscle_bio (= 0.4)
       prawn (~> 2.3, >= 2.3.0)
       prawn-table (~> 0.2, >= 0.2.0)
+      shellwords (~> 0.2)
+      virust-locator-ruby (~> 0.3)
 GEM
   remote: https://rubygems.org/
@@ -41,8 +43,11 @@ GEM
       rspec-support (~> 3.13.0)
     rspec-support (3.13.1)
     ruby-rc4 (0.1.5)
+    shellwords (0.2.0)
     ttfunk (1.8.0)
       bigdecimal (~> 3.1)
+    virust-locator-ruby (0.3.0)
+      shellwords (~> 0.2)
 PLATFORMS
   ruby

data/README.md CHANGED Viewed

@@ -16,10 +16,10 @@ CLI tools `tcs`, `tcs_sdrm`, `tcs_log` and `locator` included in the gem.
 ## Illustration for the Primer ID Sequencing
 ![Primer ID Sequencing](./docs/assets/img/cover.jpg)
 ### Reference readings on the Primer ID sequencing
 [Explantion of Primer ID sequencing](https://doi.org/10.21769/BioProtoc.3938)
 [Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
 [Application of Primer ID sequencing in COVID-19 research](https://doi.org/10.1126/scitranslmed.abb5883)
@@ -41,11 +41,13 @@ Required RubyGems version: >= 1.3.6
 ### Excutables
 ### `tcs`
 Use executable `tcs` pipeline to process **Primer ID MiSeq sequencing** data.
 Web-based `tcs` analysis can be accessed at https://primer-id.org/
 Example commands:
 ```bash
     $ tcs -p params.json # run TCS pipeline with params.json
     $ tcs -p params.json -i DIRECTORY
@@ -61,12 +63,13 @@ Example commands:
 [sample params.json for the tcs-dr pipeline](./docs/dr.json)
 ---
 ### `tcs_log`
 Use `tcs_log` script to pool run logs and TCS fasta files after one batch of `tcs` jobs. This command generates log.html to visualize the sequencing runs.
 Example file structure:
 ```
 batch_tcs_jobs/
       ├── lib1
@@ -77,21 +80,25 @@ batch_tcs_jobs/
 ```
 Example command:
 ```bash
     $ tcs_log batch_tcs_jobs
 ```
 ---
 ### `tcs_sdrm`
 Use `tcs_sdrm` pipeline for HIV-1 drug resistance mutation and recency.
 Example command:
 ```bash
     $ tcs_sdrm libs_dir
 ```
 lib_dir file structure:
 ```
 libs_dir/
 ├── lib1
@@ -109,8 +116,8 @@ libs_dir/
 Output data in a new dir as 'libs_dir_SDRM'
 **Note: [R](https://www.r-project.org/) and the following R libraries are required:**
 - phangorn
 - ape
 - scales
@@ -122,11 +129,13 @@ Output data in a new dir as 'libs_dir_SDRM'
 ---
 ### `locator`
 Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
 ```bash
     $ locator -i sequence.fasta -o sequence.fasta.csv
 ```
 ---
 ## Some Examples
@@ -179,248 +188,249 @@ Examine for drug resistance mutations for HIV PR region
 ```ruby
 qc_seqhash.sdrm_hiv_pr(cut_off)
 ```
-## Known issues
-  1. ~~have a conflict with rails.~~
-  2. ~~Update on 03032021. Still have conflict. But in rails gem file, can just use `requires: false` globally and only require "viral_seq" when the module is needed in controller.~~
-  3. The conflict seems to be resovled. It was from a combination of using `!` as a function for factorial and the gem name `viral_seq`. @_@
 ## Updates
+### Version-1.10.1
+1. Added quality filter for Illumina 2-color sequencing platforms (filter poly-G and poly-C)
+2. Replaced `MuscleBio` with [`VirustLocator`]("https://github.com/ViralSeq/virust-locator-ruby") for faster and more accurate pairwise alignment.
+3. Added DR primer version 4.
+4. Added a helper function to properly treat input params for #hiv_seq_qc.
+5. Solved the slow-performance issue when spawning a subprocess to call `VirustLocator` when holding a large amount of data in the momery. When Ruby run shell commands, a child process is spawned and share the parent's memory pages. To set it up, the OS has to walk the parent's entire memory table, causing an incremental delay in each subsequent process spawning. To solve this, I redid the `VirustLocator` API to allow all the arguments to be processed with one shell command instead of spawning individual child process.
 ### Version-1.9.1-12022024
-  1. Fixed a bug in the `tcs_sdrm` pipeline.
+1. Fixed a bug in the `tcs_sdrm` pipeline.
 ### Version-1.9.0-11132024
-  1. `ViralSeq::TcsCore::validate_file_name` will not report errors when non-sequence data in the folder, instead these files will be ignored.
-  2. Rewrote the APIs for DRM analysis for HIV. Now uses version config files for the sequencing information and DRM list configure files for DRM interpretation. Two configure files are at located in `/lib/viral_seq/util/`
-  3. `tcs_sdrm` will take a second argument for DRM config versions. Currently supports `["v1", "v2", "v3"]`. Refer to the documentations of the APIs for the details.
-  4. Next update will use secondary command `tcs sdrm` to replace `tcs_sdrm`, and `tcs log` to replace `tcs_log`.
+1. `ViralSeq::TcsCore::validate_file_name` will not report errors when non-sequence data in the folder, instead these files will be ignored.
+2. Rewrote the APIs for DRM analysis for HIV. Now uses version config files for the sequencing information and DRM list configure files for DRM interpretation. Two configure files are at located in `/lib/viral_seq/util/`
+3. `tcs_sdrm` will take a second argument for DRM config versions. Currently supports `["v1", "v2", "v3"]`. Refer to the documentations of the APIs for the details.
+4. Next update will use secondary command `tcs sdrm` to replace `tcs_sdrm`, and `tcs log` to replace `tcs_log`.
 ### Version-1.8.1-06042024
-  1. Fixed a bug that causes `tcs_sdrm` pipeline to crash.
+1. Fixed a bug that causes `tcs_sdrm` pipeline to crash.
 ### Version-1.8.0-04052024
-  1. Use `muscle-v3.8.1` as default aligner because of the compatibility issues with `muscle-v5` on some platforms.
-  2. Adjust the end-join model for short insert (insert size less than read length substracted by adaptor size)
-  3. Add an option in the DR pipeline for different versions of the pipeline, default version as "v1".
-  4. Add Days Post Infection (DPI) prediction model in the SDRM pipeline.
-  5. Re-organize the R scripts as stand-alone R files.
-  6. Bug fix.
-  7. **NOT SOLVED**: to include versions of DR in reports
+1. Use `muscle-v3.8.1` as default aligner because of the compatibility issues with `muscle-v5` on some platforms.
+2. Adjust the end-join model for short insert (insert size less than read length substracted by adaptor size)
+3. Add an option in the DR pipeline for different versions of the pipeline, default version as "v1".
+4. Add Days Post Infection (DPI) prediction model in the SDRM pipeline.
+5. Re-organize the R scripts as stand-alone R files.
+6. Bug fix.
+7. **NOT SOLVED**: to include versions of DR in reports
 ### Version-1.7.1-05122023
-  1. Add a size check for the raw sequences. If the size smaller than the input params, error messages will be sent to users. IF the actual size is greater than the input params, extra bases will be truncated.
-  2. Now allows mismatch for the primer region sequences. Forward primer region allows 2 nt differences and cDNA primer region allows 3 nt differences.
-  3. Bug fix.
-  4. TCS version to 2.5.2
+1. Add a size check for the raw sequences. If the size smaller than the input params, error messages will be sent to users. IF the actual size is greater than the input params, extra bases will be truncated.
+2. Now allows mismatch for the primer region sequences. Forward primer region allows 2 nt differences and cDNA primer region allows 3 nt differences.
+3. Bug fix.
+4. TCS version to 2.5.2
 ### Version-1.7.0-08242022
-  1. Add warnings if `tcs` pipeline is excecuting through source instead of installing from `gem`.
-  2. Optimized `ViralSeq:SeqHash#a3g` hypermut algorithm. Allowing a external reference other than the sample reference.
+1. Add warnings if `tcs` pipeline is excecuting through source instead of installing from `gem`.
+2. Optimized `ViralSeq:SeqHash#a3g` hypermut algorithm. Allowing a external reference other than the sample reference.
 ### Version-1.6.4-07182022
-  1. Included region "P17" in the default `tcs -d` pipeline setting. `tcs` pipeline updated to version 2.5.1.
-  2. Loosen the locator params for the "V1V3" end region for rare alignment issues. Now the default "V1V3" region end with position 7205 to 7210 instead of 7208.
-  3. `tcs_sdrm` now analyse "P17" region for pairwise diversity.
+1. Included region "P17" in the default `tcs -d` pipeline setting. `tcs` pipeline updated to version 2.5.1.
+2. Loosen the locator params for the "V1V3" end region for rare alignment issues. Now the default "V1V3" region end with position 7205 to 7210 instead of 7208.
+3. `tcs_sdrm` now analyse "P17" region for pairwise diversity.
 ### Version-1.6.3-02052022
-  1. Updated on `ViralSeq::Muscle` module along with the update of `muscle` from version 3.8.1 to 5.1.
-  2. Optimized the `locator` algorithm based on `muscle` v5.1.
-  3. Optimized the `tcs_sdrm` pipeline based on `muscle` v5.1.
+1. Updated on `ViralSeq::Muscle` module along with the update of `muscle` from version 3.8.1 to 5.1.
+2. Optimized the `locator` algorithm based on `muscle` v5.1.
+3. Optimized the `tcs_sdrm` pipeline based on `muscle` v5.1.
 ### Version-1.6.1-02022022
-  1. Fixed the `nav bar` in tcs_log html file.
-  2. Fixed a typo in `tcs`.
+1. Fixed the `nav bar` in tcs_log html file.
+2. Fixed a typo in `tcs`.
 ### Version 1.6.0-01042022
-  1. Update the `ViralSeq::TcsCore::detection_limit` with pre-calculated values to save processing time.
-  2. Update `tcs` pipeline to v2.5.0. HTML report will generated after running `tcs_log` script after `tcs` pipeline.
+1. Update the `ViralSeq::TcsCore::detection_limit` with pre-calculated values to save processing time.
+2. Update `tcs` pipeline to v2.5.0. HTML report will generated after running `tcs_log` script after `tcs` pipeline.
 ### Version 1.5.0-01042022
-  1. Added a function to calcute detection limit/sensitivity for minority variants (R required). `ViralSeq::TcsCore::detection_limit`
-  2. Added a function to get a sub SeqHash object given a range of nt positions. `ViralSeq::SeqHash#nt_range`
-  3. Added a function to quality check dna sequences comparing with sample consensus for indels. `ViralSeq::SeqHash#qc_indel`
-  4. Added a function for DNA variant analysis. Return a Hash object that can output as a JSON file. `ViralSeq::SeqHash#nt_variants`
-  5. Added a function to check the size of sequences of a SeqHash object. `ViralSeq::SeqHash#check_nt_size`
+1. Added a function to calcute detection limit/sensitivity for minority variants (R required). `ViralSeq::TcsCore::detection_limit`
+2. Added a function to get a sub SeqHash object given a range of nt positions. `ViralSeq::SeqHash#nt_range`
+3. Added a function to quality check dna sequences comparing with sample consensus for indels. `ViralSeq::SeqHash#qc_indel`
+4. Added a function for DNA variant analysis. Return a Hash object that can output as a JSON file. `ViralSeq::SeqHash#nt_variants`
+5. Added a function to check the size of sequences of a SeqHash object. `ViralSeq::SeqHash#check_nt_size`
 ### Version 1.4.0-10132021
-  1. Added a function to calculate false detectionr rate (FDR, aka, Benjamini-Hochberg correction) for minority mutations detected in the sequences. `ViralSeq::SeqHash#fdr`
-  2. Updated `bin\tcs_sdrm` script to add FDR value to each DRMs detected.
+1. Added a function to calculate false detectionr rate (FDR, aka, Benjamini-Hochberg correction) for minority mutations detected in the sequences. `ViralSeq::SeqHash#fdr`
+2. Updated `bin\tcs_sdrm` script to add FDR value to each DRMs detected.
 ### Version 1.3.0-08302021
-  1. Fixed a bug in the `tcs` pipeline.
+1. Fixed a bug in the `tcs` pipeline.
 ### Version 1.2.9-08022021
-  1. Fixed a bug when reading the input primer sequences in lowercases.
-  2. Fixed a bug in the method ViralSeq::Math::RandomGaussian
+1. Fixed a bug when reading the input primer sequences in lowercases.
+2. Fixed a bug in the method ViralSeq::Math::RandomGaussian
 ### Version 1.2.8-07292021
-  1. Fixed an issue when reading .fastq files containing blank_lines.
+1. Fixed an issue when reading .fastq files containing blank_lines.
 ### Version 1.2.7-07152021
-  1. Optimzed the workflow of the `tcs` pipeline on raw data with uneven lengths.
-  `tcs` version to v2.3.6.
+1. Optimzed the workflow of the `tcs` pipeline on raw data with uneven lengths.
+   `tcs` version to v2.3.6.
 ### Version 1.2.6-07122021
-  1. Optimized the workflow of the `tcs` pipeline in the "end-join/QC/Trimming" section.
-  `tcs` version to v2.3.5.
+1. Optimized the workflow of the `tcs` pipeline in the "end-join/QC/Trimming" section.
+   `tcs` version to v2.3.5.
 ### Version 1.2.5-06232021
-  1. Add error rescue and report in the `tcs` pipeline.
-    error messages are stored in the .tcs_error file. `tcs` pipeline updated to v2.3.4.
-  2. Use simple majority for the consensus cut-off in the default setting of the `tcs -dr` pipeline.
+1. Add error rescue and report in the `tcs` pipeline.
+   error messages are stored in the .tcs_error file. `tcs` pipeline updated to v2.3.4.
+2. Use simple majority for the consensus cut-off in the default setting of the `tcs -dr` pipeline.
 ### Version 1.2.2-05272021
-  1. Fixed a bug in the `tcs` pipeline that sometimes causes `SystemStackError`.
-  `tcs` pipeline upgraded to v2.3.2
+1. Fixed a bug in the `tcs` pipeline that sometimes causes `SystemStackError`.
+   `tcs` pipeline upgraded to v2.3.2
 ### Version 1.2.1-05172021
-  1. Added a function in R to check and install missing R packages for `tcs_sdrm` pipeline.
+1. Added a function in R to check and install missing R packages for `tcs_sdrm` pipeline.
 ### Version 1.2.0-05102021
-  1. Added `tcs_sdrm` pipeline as an excutable.
-  `tcs_sdrm` processes `tcs`-processed HIV MPID-NGS data for drug resistance mutations, recency and phylogentic analysis.
+1. Added `tcs_sdrm` pipeline as an excutable.
+   `tcs_sdrm` processes `tcs`-processed HIV MPID-NGS data for drug resistance mutations, recency and phylogentic analysis.
-  2. Added function ViralSeq::SeqHash#sample.
+2. Added function ViralSeq::SeqHash#sample.
-  3. Added recency determining function `ViralSeq::Recency::define`
+3. Added recency determining function `ViralSeq::Recency::define`
-  4. Fixed a few bugs related to `tcs_sdrm`.
+4. Fixed a few bugs related to `tcs_sdrm`.
 ### Version 1.1.2-04262021
-  1. Added function `ViralSeq::DRMs.sdrm_json` to export SDRM as json object.
-  2. Added a random string to the temp file names for `muscle_bio` to avoid issues when running scripts in parallel.
-  3. Added `--keep-original` flag to the `tcs` pipeline.
+1. Added function `ViralSeq::DRMs.sdrm_json` to export SDRM as json object.
+2. Added a random string to the temp file names for `muscle_bio` to avoid issues when running scripts in parallel.
+3. Added `--keep-original` flag to the `tcs` pipeline.
 ### Version 1.1.1-04012021
-  1. Added warning when paired_raw_sequence less than 0.1% of total_raw_sequence.
-  2. Added option `-i WORKING_DIRECTORY` to the `tcs` script.
-  If the `params.json` file does not contain the path to the working directory, it will append path to the run params.
-  3. Added option `-dr` to the `tcs` script.
+1. Added warning when paired_raw_sequence less than 0.1% of total_raw_sequence.
+2. Added option `-i WORKING_DIRECTORY` to the `tcs` script.
+   If the `params.json` file does not contain the path to the working directory, it will append path to the run params.
+3. Added option `-dr` to the `tcs` script.
 ### Version 1.1.0-03252021
-  1. Optimized the algorithm of end-join.
-  2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
-  3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
-  4. Added the preset of MPID-HIVDR params file [***dr.json***](./docs/dr.json) in /docs.
-  5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
-  Users can choose from 3 MiSeq platforms for processing their sequencing data.
-  MiSeq 300x7x300 is the default option.
+1. Optimized the algorithm of end-join.
+2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
+3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
+4. Added the preset of MPID-HIVDR params file [**_dr.json_**](./docs/dr.json) in /docs.
+5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
+   Users can choose from 3 MiSeq platforms for processing their sequencing data.
+   MiSeq 300x7x300 is the default option.
 ### Version 1.0.14-03052021
-  1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
+1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
 ### Version 1.0.13-03032021
-  1. Fixed the conflict with rails.
+1. Fixed the conflict with rails.
 ### Version 1.0.12-03032021
-  1. Fixed an issue that may cause conflicts with ActiveRecord.
+1. Fixed an issue that may cause conflicts with ActiveRecord.
 ### Version 1.0.11-03022021
-  1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
-  2. fixed an issue loading class 'OptionParser'in some ruby environments.
+1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
+2. fixed an issue loading class 'OptionParser'in some ruby environments.
 ### Version 1.0.10-11112020:
-  1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
-  2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
-  3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
-  4. a few optimizations.
-  5. TCS 2.1.0 delivered.
-  6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
+1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
+2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
+3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
+4. a few optimizations.
+5. TCS 2.1.0 delivered.
+6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
 ### Version 1.0.9-07182020:
-  1. Change ViralSeq::SeqHash#stop_codon and ViralSeq::SeqHash#a3g_hypermut return value to hash object.
+1. Change ViralSeq::SeqHash#stop_codon and ViralSeq::SeqHash#a3g_hypermut return value to hash object.
-  2. TCS pipeline updated to version 2.0.1. Add optional `export_raw: TRUE/FALSE` in json params. If `export_raw` is `TRUE`, raw sequence reads (have to pass quality filters) will be exported, along with TCS reads.
+2. TCS pipeline updated to version 2.0.1. Add optional `export_raw: TRUE/FALSE` in json params. If `export_raw` is `TRUE`, raw sequence reads (have to pass quality filters) will be exported, along with TCS reads.
 ### Version 1.0.8-02282020:
-  1. TCS pipeline (version 2.0.0) added as executable.
-      tcs  -  main TCS pipeline script.
-      tcs_json_generator  -  step-by-step script to generate json file for tcs pipeline.
+1. TCS pipeline (version 2.0.0) added as executable.
+   tcs - main TCS pipeline script.
+   tcs_json_generator - step-by-step script to generate json file for tcs pipeline.
-  2. Methods added:
-      ViralSeq::SeqHash#trim
+2. Methods added:
+   ViralSeq::SeqHash#trim
-  3. Bug fix for several methods.
+3. Bug fix for several methods.
 ### Version 1.0.7-01282020:
-  1. Several methods added, including
-      ViralSeq::SeqHash#error_table
-      ViralSeq::SeqHash#random_select
-  2. Improved performance for several functions.
+1. Several methods added, including
+   ViralSeq::SeqHash#error_table
+   ViralSeq::SeqHash#random_select
+2. Improved performance for several functions.
 ### Version 1.0.6-07232019:
-  1. Several methods added to ViralSeq::SeqHash, including
-      ViralSeq::SeqHash#size
-      ViralSeq::SeqHash#+
-      ViralSeq::SeqHash#write_nt_fa
-      ViralSeq::SeqHash#mutation
-  2. Update documentations and rspec samples.
+1. Several methods added to ViralSeq::SeqHash, including
+   ViralSeq::SeqHash#size
+   ViralSeq::SeqHash#+
+   ViralSeq::SeqHash#write_nt_fa
+   ViralSeq::SeqHash#mutation
+2. Update documentations and rspec samples.
 ### Version 1.0.5-07112019:
-  1. Update ViralSeq::SeqHash#sequence_locator.
-     Program will try to determine the direction (`+` or `-` of the query sequence)
-  2. update executable `locator` to have a column of `direction` in output .csv file
+1. Update ViralSeq::SeqHash#sequence_locator.
+   Program will try to determine the direction (`+` or `-` of the query sequence)
+2. update executable `locator` to have a column of `direction` in output .csv file
 ### Version 1.0.4-07102019:
-  1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
-  2. Fix bugs in bin `locator`
+1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
+2. Fix bugs in bin `locator`
 ### Version 1.0.3-07102019:
-  1. Bug fix.
+1. Bug fix.
 ### Version 1.0.2-07102019:
-  1. Fixed a gem loading issue.
+1. Fixed a gem loading issue.
 ### Version 1.0.1-07102019:
-  1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
-  2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
-  3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
-  4. update documentations
+1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
+2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
+3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
+4. update documentations
 ### Version 1.0.0-07092019:
-  1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq
+1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq
 ## Development

data/bin/locator CHANGED Viewed

@@ -38,7 +38,7 @@ def myparser
       options[:outfile] = o
     end
-    opts.on('-r', '--ref_option OPTION', "reference genome option, choose from #{"`HXB2` (default), `NL43`, `MAC239`".blue.bold}") do |o|
+    opts.on('-r', '--ref_option OPTION', "reference genome option, choose from #{"`HXB2` (default), `SIVmm239`".blue.bold}") do |o|
       options[:ref_option] = o.to_sym
     end
@@ -84,7 +84,7 @@ begin
   seqs = ViralSeq::SeqHash.fa(seq_file)
   opt =  options[:ref_option] ? options[:ref_option] : :HXB2
-  unless [:HXB2, :NL43, :MAC239].include? opt
+  unless [:HXB2, :SIVmm239].include? opt
     puts "Reference option `#{opt}` not recognized, using `HXB2` as the reference genome.".red.bold
     opt = :HXB2
   end

data/bin/tcs CHANGED Viewed

@@ -27,9 +27,8 @@
 # run `tcs -j` to generate param json file.
 def gem_installed?(gem_name)
-  found_gem = false
   begin
-    found_gem = Gem::Specification.find_by_name(gem_name)
+    Gem::Specification.find_by_name(gem_name)
   rescue Gem::LoadError
     return false
   else
@@ -217,8 +216,8 @@ begin
     ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
   end
   primers.each do |primer|
     summary_json = {}
     summary_json[:warnings] = []
     summary_json[:tcs_version] = ViralSeq::TCS_VERSION
@@ -470,6 +469,34 @@ begin
       f.puts JSON.pretty_generate(pid_json)
     end
+    filter_r1 = nil
+    filter_r2 = nil
+    r1_passed_seq = nil
+    r2_passed_seq = nil
+    r1_temp = nil
+    r2_temp = nil
+    r1_temp_sh = nil
+    r2_temp_sh = nil
+    r1_consensus_filtered = nil
+    r2_consensus_filtered = nil
+    consensus_filtered = nil
+    pid_json = nil
+    consensus = nil
+    r1_seq = nil
+    r2_seq = nil
+    bio_r1 = nil
+    bio_r2 = nil
+    id = nil
+    primer_id_count = nil
+    primer_id_dis = nil
+    primer_id_list = nil
+    primer_id_count_over_n = nil
+    r1_sub_seq = nil
+    r2_sub_seq = nil
+    common_keys = nil
+    GC.start
     # start end-join
     def end_join(dir, option, overlap)
       shp = ViralSeq::SeqHashPair.fa(dir)
@@ -492,7 +519,6 @@ begin
     if primer[:end_join]
       log.puts Time.now.to_s + "\t" +  "Start end-pairing for TCS..."
-      shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
       joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
       log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
@@ -502,6 +528,11 @@ begin
         joined_sh_raw = end_join(out_dir_raw, primer[:end_join_option], primer[:overlap])
       end
+      joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
+      if export_raw
+        joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
+      end
       if primer[:TCS_QC]
         ref_start = primer[:ref_start]
         ref_end = primer[:ref_end]
@@ -513,42 +544,11 @@ begin
         if ref_end == 0
           ref_end = 0..(ViralSeq::RefSeq.get(ref_genome).size - 1)
         end
-        if primer[:end_join_option] == 1
-          r1_sh = ViralSeq::SeqHash.fa(outfile_r1)
-          r2_sh = ViralSeq::SeqHash.fa(outfile_r2)
-          r1_sh = r1_sh.hiv_seq_qc(ref_start, (0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), indel, ref_genome)
-          r2_sh = r2_sh.hiv_seq_qc((0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), ref_end, indel, ref_genome)
-          new_r1_seq = r1_sh.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
-          new_r2_seq = r2_sh.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
-          joined_seq = {}
-          new_r1_seq.each do |seq_name, seq|
-            next unless seq
-            next unless new_r2_seq[seq_name]
-            joined_seq[seq_name] = seq + new_r2_seq[seq_name]
-          end
-          joined_sh = ViralSeq::SeqHash.new(joined_seq)
-          if export_raw
-            r1_sh_raw = ViralSeq::SeqHash.fa(outfile_raw_r1)
-            r2_sh_raw = ViralSeq::SeqHash.fa(outfile_raw_r2)
-            r1_sh_raw = r1_sh_raw.hiv_seq_qc(ref_start, (0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), indel, ref_genome)
-            r2_sh_raw = r2_sh_raw.hiv_seq_qc((0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), ref_end, indel, ref_genome)
-            new_r1_seq_raw = r1_sh_raw.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
-            new_r2_seq_raw = r2_sh_raw.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
-            joined_seq_raw = {}
-            new_r1_seq_raw.each do |seq_name, seq|
-              next unless seq
-              next unless new_r2_seq_raw[seq_name]
-              joined_seq_raw[seq_name] = seq + new_r2_seq_raw[seq_name]
-            end
-            joined_sh_raw = ViralSeq::SeqHash.new(joined_seq_raw)
-          end
-        else
-          joined_sh = joined_sh.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
+        joined_sh = joined_sh.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
-          if export_raw
-            joined_sh_raw = joined_sh_raw.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
-          end
+        if export_raw
+          joined_sh_raw = joined_sh_raw.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
         end
         log.puts Time.now.to_s + "\t" + "Paired TCS number after QC based on reference genome: " + joined_sh.size.to_s

data/lib/viral_seq/R.rb CHANGED Viewed

@@ -14,7 +14,9 @@ module ViralSeq
     # check if required R packages is installed.
     def self.check_R_packages
-      if system "Rscript #{File.join( ViralSeq.root, "viral_seq", "util", "check_env.r")}"
+      file = File.join(ViralSeq.root, "viral_seq", "util", "check_env.r")
+      safe_file = Shellwords.escape(file)
+      if system "Rscript #{safe_file}"
         return 0
       else
         raise "Non-zero exit code. Error happens when checking required R packages."

data/lib/viral_seq/seq_hash.rb CHANGED Viewed

@@ -656,7 +656,7 @@ module ViralSeq
     def nt_variants
       return_obj = {}
-      nt_hash = self.dna_hash
       tcs_number = self.size
       dl = ViralSeq::TcsCore.detection_limit(tcs_number)
       fdr_hash = self.fdr
@@ -869,7 +869,7 @@ module ViralSeq
     # @param start_nt [Integer,Range,Array] start nt position(s) on the refernce genome, can be single number (Integer) or a range of Integers (Range), or an Array
     # @param end_nt [Integer,Range,Array] end nt position(s) on the refernce genome,can be single number (Integer) or a range of Integers (Range), or an Array
     # @param indel [Boolean] allow indels or not, `ture` or `false`
-    # @param ref_option [Symbol], name of reference genomes, options are `:HXB2`, `:NL43`, `:MAC239`
+    # @param ref_option [Symbol], name of reference genomes, options are `:HXB2`, `:SIVmm239`
     # @param path_to_muscle [String], path to the muscle executable, if not provided, use MuscleBio to run Muscle
     # @return [ViralSeq::SeqHash] a new ViralSeq::SeqHash object with only the sequences that meet the QC criterias
     # @example QC for sequences in a FASTA files
@@ -880,17 +880,19 @@ module ViralSeq
     #   filtered_seqhash.dna_hash.size
     #   => 4
-    def hiv_seq_qc(start_nt, end_nt, indel=true, ref_option = :HXB2, path_to_muscle = false)
-      start_nt = start_nt..start_nt if start_nt.is_a?(Integer)
-      end_nt = end_nt..end_nt if end_nt.is_a?(Integer)
+    def hiv_seq_qc(start_nt, end_nt, indel=true, ref_option = :HXB2)
+      start_nt = position_helper(start_nt)
+      end_nt = position_helper(end_nt)
       seq_hash = self.dna_hash.dup
       seq_hash_unique = seq_hash.values.uniq
       seq_hash_unique_pass = []
-      seq_hash_unique.each do |seq|
-        next if seq.nil?
-        loc = ViralSeq::Sequence.new('', seq).locator(ref_option, path_to_muscle)
-        next unless loc # if locator tool fails, skip this seq.
+      batch_locator = VirustLocator::Locator.exec(seq_hash_unique.join("\s"), "nt", 1, ref_option).split("\n")
+      seq_hash_unique.each_with_index do |seq, i|
+        loc = batch_locator[i]
+        loc = locator_helper(loc)
+        next unless loc
         if start_nt.include?(loc[0]) && end_nt.include?(loc[1])
           if indel
             seq_hash_unique_pass << seq
@@ -898,8 +900,11 @@ module ViralSeq
             seq_hash_unique_pass << seq
           end
         end
       end
       seq_pass = []
       seq_hash_unique_pass.each do |seq|
         seq_hash.each do |seq_name, orginal_seq|
           if orginal_seq == seq
@@ -909,10 +914,10 @@ module ViralSeq
         end
       end
       self.sub(seq_pass)
-    end # end of #hiv_seq_qc
+    end # end of #hiv_seq_qc # end of #hiv_seq_qc
     # sequence locator for SeqHash object, resembling HIV Sequence Locator from LANL
-    # @param ref_option [Symbol], name of reference genomes, options are `:HXB2`, `:NL43`, `:MAC239`
+    # @param ref_option [Symbol], name of reference genomes, options are `:HXB2`, `:SIVmm239`
     # @return [Array] two dimensional array `[[],[],[],...]` for each sequence, including the following information:
     #
     #     title of the SeqHash object (String)
@@ -1341,7 +1346,7 @@ module ViralSeq
       seq_hash_unique = seq_hash.uniq_hash
       trimmed_seq_hash = {}
       seq_hash_unique.each do |seq, names|
-        trimmed_seq = ViralSeq::Sequence.new('', seq).sequence_clip(start_nt, end_nt, ref_option, path_to_muscle).dna
+        trimmed_seq = ViralSeq::Sequence.new('', seq).sequence_clip(start_nt, end_nt, ref_option).dna
         names.each do |name|
           trimmed_seq_hash[name] = trimmed_seq
         end
@@ -1431,6 +1436,37 @@ module ViralSeq
       var_count.sort_by{|key,_value|key}.to_h
     end # end of #varaint_for_poisson
+    # helper for start/end position for #hiv_seq_qc
+    def position_helper(position)
+      if position.is_a?(Range)
+        return position
+      elsif position.is_a?(Integer)
+        return position..position
+      elsif position.is_a?(String)
+        return position.to_i..position.to_i
+      elsif position.is_a?(Array)
+        return position[0].to_i..position[1].to_i
+      else
+        raise "Position #{position} not recognized"
+      end
+    end # position_helper
+    # helper for batch locator
+    # @param loc [String] the output of batch locator
+    # @return [Array] the locator information in an array
+    def locator_helper(loc)
+      loc = loc.split("\t")
+      loc[0] = loc[0].to_i
+      loc[1] = loc[1].to_i
+      loc[2] = loc[2].to_f.round(1)
+      if loc[3].to_s.downcase == "true"
+        loc[3] = true
+      else
+        loc[3] = false
+      end
+      return loc
+    end
   end # end of SeqHash
 end # end of ViralSeq

data/lib/viral_seq/sequence.rb CHANGED Viewed

@@ -165,7 +165,7 @@ module ViralSeq
     # HIV sequence locator function, resembling HIV Sequence Locator from LANL
     #   # current version only supports nucleotide sequence, not for amino acid sequence.
-    # @param ref_option [Symbol], name of reference genomes, options are `:HXB2`, `:NL43`, `:MAC239`
+    # @param ref_option [Symbol], name of reference genomes, options are `:HXB2`, `:SIVmm239`
     # @param path_to_muscle [String], path to the muscle executable, if not provided, use MuscleBio to run Muscle
     # @return [Array] an array of the following info:
     #
@@ -181,182 +181,32 @@ module ViralSeq
     #
     #   aligned_reference_sequence (String)
     #
-    # @example identify the location of the input sequence on the NL43 genome
+    # @example identify the location of the input sequence on the HXB2 genome
     #   sequence = 'AGCAGATGATACAGTATTAGAAGAAATAAATTTGCCAGGAAGATGGAAACCAAAAATGATAGGGGGAATTGGAGGTTTTATCAAAGTAAGACAATATGATC'
     #   s = ViralSeq::Sequence.new('my_sequence', sequence)
-    #   loc = s.locator(:NL43)
-    #   h = ViralSeq::SeqHash.new; h.dna_hash['NL43'] = loc[5]; h.dna_hash[s.name] = loc[4]
+    #   loc = s.locator(:HXB2)
+    #   h = ViralSeq::SeqHash.new; h.dna_hash['HXB2'] = loc[5]; h.dna_hash[s.name] = loc[4]
     #   rs_string = h.to_rsphylip.split("\n")[1..-1].join("\n") # get a relaxed phylip format string for display of alignment.
-    #   puts "The input sequence \"#{s.name}\" is located on the NL43 nt sequence from #{loc[0].to_s} to #{loc[1].to_s}.\nIt is #{loc[2].to_s}% similar to the reference.\nIt #{loc[3]? "does" : "does not"} have indels.\nThe alignment is\n#{rs_string}"
-    #   => The input sequence "my_sequence" is located on the NL43 nt sequence from 2333 to 2433.
-    #   => It is 98.0% similar to the reference.
+    #   puts "The input sequence \"#{s.name}\" is located on the HXB2 nt sequence from #{loc[0].to_s} to #{loc[1].to_s}.\nIt is #{loc[2].round(1).to_s}% similar to the reference.\nIt #{loc[3]? "does" : "does not"} have indels.\nThe alignment is\n#{rs_string}"
+    #   => The input sequence "my_sequence" is located on the HXB2 nt sequence from 2333 to 2433.
+    #   => It is 97.0% similar to the reference.
     #   => It does not have indels.
     #   => The alignment is
-    #   => NL43         AGCAGATGAT ACAGTATTAG AAGAAATGAA TTTGCCAGGA AGATGGAAAC CAAAAATGAT AGGGGGAATT GGAGGTTTTA TCAAAGTAAG ACAGTATGAT C
+    #   => HXB2         AGCAGATGAT ACAGTATTAG AAGAAATGAA TTTGCCAGGA AGATGGAAAC CAAAAATGAT AGGGGGAATT GGAGGTTTTA TCAAAGTAAG ACAGTATGAT C
     #   => my_sequence  AGCAGATGAT ACAGTATTAG AAGAAATAAA TTTGCCAGGA AGATGGAAAC CAAAAATGAT AGGGGGAATT GGAGGTTTTA TCAAAGTAAG ACAATATGAT C
     # @see https://www.hiv.lanl.gov/content/sequence/LOCATE/locate.html LANL Sequence Locator
-    def locator(ref_option = :HXB2, path_to_muscle = false)
+    def locator(ref_option = :HXB2, algorithm = 1)
       seq = self.dna
-      ori_ref = ViralSeq::RefSeq.get(ref_option)
+      ref = ref_option.to_s
       begin
-        ori_ref_l = ori_ref.size
-        l1 = 0
-        l2 = 0
-        aln_seq = ViralSeq::Muscle.align(ori_ref, seq, :Super5, path_to_muscle)
-        aln_test = aln_seq[1]
-        aln_test =~ /^(\-*)(\w.*\w)(\-*)$/
-        gap_begin = $1.size
-        gap_end = $3.size
-        aln_test2 = $2
-        ref = aln_seq[0]
-        ref = ref[gap_begin..(-gap_end-1)]
-        ref_size = ref.size
-        if ref_size > 1.3*(seq.size)
-          l1 = l1 + gap_begin
-          l2 = l2 + gap_end
-          max_seq = aln_test2.scan(/[ACGT]+/).max_by(&:length)
-          aln_test2 =~ /#{max_seq}/
-          before_aln_seq = $`
-          before_aln = $`.size
-          post_aln_seq = $'
-          post_aln = $'.size
-          before_aln_seq_size = before_aln_seq.scan(/[ACGT]+/).join("").size
-          b1 = (1.3 * before_aln_seq_size).to_i
-          post_aln_seq_size = post_aln_seq.scan(/[ACGT]+/).join("").size
-          b2 = (1.3 * post_aln_seq_size).to_i
-          if (before_aln > seq.size) and (post_aln <= seq.size)
-            ref = ref[(before_aln - b1)..(ref_size - post_aln - 1)]
-            l1 = l1 + (before_aln - b1)
-          elsif (post_aln > seq.size) and (before_aln <= seq.size)
-            ref = ref[before_aln..(ref_size - post_aln - 1 + b2)]
-            l2 = l2 + post_aln - b2
-          elsif (post_aln > seq.size) and (before_aln > seq.size)
-            ref = ref[(before_aln - b1)..(ref_size - post_aln - 1 + b2)]
-            l1 = l1 + (before_aln - b1)
-            l2 = l2 + (post_aln - b2)
-          end
-          aln_seq = ViralSeq::Muscle.align(ref, seq, :Super5, path_to_muscle)
-          aln_test = aln_seq[1]
-          aln_test =~ /^(\-*)(\w.*\w)(\-*)$/
-          gap_begin = $1.size
-          gap_end = $3.size
-          ref = aln_seq[0]
-          ref = ref[gap_begin..(-gap_end-1)]
-        end
-        aln_test = aln_seq[1]
-        aln_test =~ /^(\-*)(\w.*\w)(\-*)$/
-        gap_begin = $1.size
-        gap_end = $3.size
-        aln_test = $2
-        aln_test =~ /^(\w+)(\-*)\w/
-        s1 = $1.size
-        g1 = $2.size
-        aln_test =~ /\w(\-*)(\w+)$/
-        s2 = $2.size
-        g2 = $1.size
-        l1 = l1 + gap_begin
-        l2 = l2 + gap_end
-        repeat = 0
-        if g1 == g2 and (s1 + g1 + s2) == ref.size
-          if s1 > s2 and g2 >= s2
-            ref = ref[0..(-g2-1)]
-            repeat = 1
-            l2 = l2 + g2
-          elsif s1 < s2 and g1 >= s1
-            ref = ref[g1..-1]
-            repeat = 1
-            l1 = l1 + g1
-          end
-        else
-          if g1 >= s1
-            ref = ref[g1..-1]
-            repeat = 1
-            l1 = l1 + g1
-          end
-          if g2 >= s2
-            ref = ref[0..(-g2 - 1)]
-            repeat = 1
-            l2 = l2 + g2
-          end
-        end
-        while repeat == 1
-          aln_seq = ViralSeq::Muscle.align(ref, seq, :Super5, path_to_muscle)
-          aln_test = aln_seq[1]
-          aln_test =~ /^(\-*)(\w.*\w)(\-*)$/
-          gap_begin = $1.size
-          gap_end = $3.size
-          aln_test = $2
-          aln_test =~ /^(\w+)(\-*)\w/
-          s1 = $1.size
-          g1 = $2.size
-          aln_test =~ /\w(\-*)(\w+)$/
-          s2 = $2.size
-          g2 = $1.size
-          ref = aln_seq[0]
-          ref = ref[gap_begin..(-gap_end-1)]
-          l1 = l1 + gap_begin
-          l2 = l2 + gap_end
-          repeat = 0
-          if g1 >= s1
-            ref = ref[g1..-1]
-            repeat = 1
-            l1 = l1 + g1
-          end
-          if g2 >= s2
-            ref = ref[0..(-g2 - 1)]
-            repeat = 1
-            l2 = l2 + g2
-          end
-        end
-        ref = ori_ref[l1..(ori_ref_l - l2 - 1)]
-        aln_seq = ViralSeq::Muscle.align(ref, seq, :Super5, path_to_muscle)
-        aln_test = aln_seq[1]
-        ref = aln_seq[0]
-        #refine alignment
-        if ref =~ /^(\-+)/
-          l1 = l1 - $1.size
-        elsif ref =~ /(\-+)$/
-          l2 = l2 - $1.size
-        end
-        if (ori_ref_l - l2 - 1) >= l1
-          ref = ori_ref[l1..(ori_ref_l - l2 - 1)]
-          aln_seq = ViralSeq::Muscle.align(ref, seq, :Super5, path_to_muscle)
-          aln_test = aln_seq[1]
-          ref = aln_seq[0]
-          ref_size = ref.size
-          sim_count = 0
-          (0..(ref_size-1)).each do |n|
-            ref_base = ref[n]
-            test_base = aln_test[n]
-            sim_count += 1 if ref_base == test_base
-          end
-          similarity = (sim_count/ref_size.to_f*100).round(1)
-          loc_p1 = l1 + 1
-          loc_p2 = ori_ref_l - l2
-          if seq.size != (loc_p2 - loc_p1 + 1)
-              indel = true
-          elsif aln_test.include?("-")
-              indel = true
-          else
-              indel = false
-          end
-          return [loc_p1,loc_p2,similarity,indel,aln_test,ref]
+        loc = VirustLocator::Locator.exec(seq, "nt", algorithm, ref).split("\t")
+        loc[0] = loc[0].to_i
+        loc[1] = loc[1].to_i
+        loc[2] = loc[2].to_f.round(1)
+        if loc[3].to_s.downcase == "true"
+          loc[3] = true
         else
-          return [0,0,0,0,0,0,0]
+          loc[3] = false
         end
       rescue => e
         puts "Unexpected error occured."
@@ -366,12 +216,13 @@ module ViralSeq
         puts "ViralSeq.sequence_locator returns nil"
         return nil
       end
-    end # end of locator
+      return loc
+    end #end of locator
     # Given start and end positions on the reference genome, return a sub-sequence of the target sequence in that range
     # @param p1 [Integer] start position number on the reference genome
     # @param p2 [Integer] end position number on the reference genome
-    # @param ref_option [Symbol], name of reference genomes, options are `:HXB2`, `:NL43`, `:MAC239`
+    # @param ref_option [Symbol], name of reference genomes, options are `:HXB2`, `:SIVmm239`
     # @param path_to_muscle [String], path to the muscle executable, if not provided, use MuscleBio to run Muscle
     # @return [ViralSeq::Sequence, nil] a new ViralSeq::Sequence object that of input range on the reference genome or nil
     #   if either the start or end position is beyond the range of the target sequence.
@@ -381,8 +232,8 @@ module ViralSeq
     #   s.sequence_clip(2333, 2433, :HXB2).dna
     #   => "AGCAGATGATACAGTATTAGAAGAAATAAATTTGCCAGGAAGATGGAAACCAAAAATGATAGGGGGAATTGGAGGTTTTATCAAAGTAAGACAATATGATC"
-    def sequence_clip(p1 = 0, p2 = 0, ref_option = :HXB2, path_to_muscle = false)
-      loc = self.locator(ref_option, path_to_muscle)
+    def sequence_clip(p1 = 0, p2 = 0, ref_option = :HXB2)
+      loc = self.locator(ref_option)
       l1 = loc[0]
       l2 = loc[1]
       if (p1 >= l1) & (p2 <= l2)

data/lib/viral_seq/string.rb CHANGED Viewed

@@ -56,13 +56,13 @@ class String
     Regexp.new match
   end
-  # parse the nucleotide sequences as an Array of Array
+  # parse the nucleotide sequences as an Array of Array
   # @return [Array] Array of Array at each position
   # @example parse a sequence with ambiguities to Array of Array
   #   "ATRWCG".nt_to_array
   #   => [["A"], ["T"], ["A", "G"], ["A", "T"], ["C"], ["G"]]
-  def nt_to_array
+  def nt_to_array
     return_array = []
     self.each_char.each do |base|
       base_array = base.to_list
@@ -75,9 +75,6 @@ class String
   # compare the given nt sequence string with the ref sequence string
   # @param ref [String] the ref sequence string to compare with
   # @return [Interger] Number of differences
-  # @example parse a sequence with ambiguities to Array of Array
-  #   "ATRWCG".nt_to_array
-  #   => [["A"], ["T"], ["A", "G"], ["A", "T"], ["C"], ["G"]]
   def nt_diff(ref)
     count_diff = 0

data/lib/viral_seq/tcs_core.rb CHANGED Viewed

@@ -331,6 +331,10 @@ module ViralSeq
           return false
         elsif seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
           return false
+        elsif seq =~ /G{11}/ # a string of poly-G indicates poor quanlity in 2-color chemistry
+          return false
+        elsif seq =~ /C{11}/ # a string of poly-C indicates poor quanlity in 2-color chemistry
+          return false
         elsif seq =~ /A{11}/ # a string of poly-A indicates adaptor sequence
           return false
         elsif seq =~ /T{11}/ # a string of poly-T indicates adaptor sequence

data/lib/viral_seq/tcs_dr.rb CHANGED Viewed

@@ -186,7 +186,7 @@ module ViralSeq
         :trim=>false},
         {:region=>"PR",
         :cdna=>
-          "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNTTAACCTTTGGGCCATCCATTCC",
+          "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNCAGTTTAACTTTTGGGCCATCCATTCC",
         :forward=>
           "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTCAGAGCAGACCAGAGCCAACAGCCCCA",
         :majority=>0,
@@ -247,6 +247,87 @@ module ViralSeq
         ]
       },
+      "v4" => {:platform_error_rate=>0.01,
+      :primer_pairs=>
+      [{:region=>"RT",
+        :cdna=>
+          "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTAAGGAATGGAGGTTCTTTCTGATG",
+        :forward=>
+          "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNGGCCATTGACAGAAGAAAAAATAAAAGC",
+        :majority=>0,
+        :end_join=>true,
+        :end_join_option=>1,
+        :overlap=>0,
+        :TCS_QC=>true,
+        :ref_genome=>"HXB2",
+        :ref_start=>2648,
+        :ref_end=>3209,
+        :indel=>true,
+        :trim=>false},
+        {:region=>"PR",
+        :cdna=>
+          "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNCAGTTTAACTTTTGGGCCATCCATTCC",
+        :forward=>
+          "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTCAGAGCAGACCAGAGCCAACAGCCCCA",
+        :majority=>0,
+        :end_join=>true,
+        :end_join_option=>3,
+        :TCS_QC=>true,
+        :ref_genome=>"HXB2",
+        :ref_start=>0,
+        :ref_end=>2591,
+        :indel=>true,
+        :trim=>true,
+        :trim_ref=>"HXB2",
+        :trim_ref_start=>2253,
+        :trim_ref_end=>2549},
+        {:region=>"IN",
+        :cdna=>
+          "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCATCACCTGCCATCTGTTTTCCAT",
+        :forward=>"GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNGCAGAAGTTATYCCAGCAGAAACA",
+        :majority=>0,
+        :end_join=>true,
+        :end_join_option=>2,
+        :overlap=>3,
+        :TCS_QC=>true,
+        :ref_genome=>"HXB2",
+        :ref_start=>4509,
+        :ref_end=>5040,
+        :indel=>true,
+        :trim=>false},
+        {:region=>"V1V3",
+        :cdna=>
+          "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCCATTTTGCTYTAYTRABVTTACAATRTGC",
+        :forward=>
+          "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTTATGGGATCAAAGCCTAAAGCCATGTGTA",
+        :majority=>0,
+        :end_join=>true,
+        :end_join_option=>1,
+        :overlap=>0,
+        :TCS_QC=>true,
+        :ref_genome=>"HXB2",
+        :ref_start=>6585,
+        :ref_end=>7205..7210,
+        :indel=>true,
+        :trim=>false},
+        {:region=>"CA",
+          :cdna=>
+          "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCAACAAGGTTTCTGTCATCCAATTTTTTAC",
+          :forward=>
+          "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNGTCAGCCAAAATTACCCTATAGTGC",
+          :majority=>0,
+          :end_join=>true,
+          :end_join_option=>1,
+          :overlap=>0,
+          :TCS_QC=>true,
+          :ref_genome=>"HXB2",
+          :ref_start=>1196,
+          :ref_end=>1725,
+          :indel=>true,
+          :trim=>false}
+        ]
+      },
     }

data/lib/viral_seq/util/drm_versions_config.json CHANGED Viewed

@@ -54,6 +54,58 @@
             }
         }
+    },
+    {
+        "version": "v4",
+        "DRM_range": {
+            "CAI": [56,57, 66, 67, 70, 74, 105, 107],
+            "PI": [23, 24, 30, 32, 46, 47, 48, 50, 53, 54, 73, 76, 82, 83, 84, 88, 90],
+            "NRTI": [41, 65, 67, 69, 70, 74, 75, 77, 115, 116, 151, 184, 210, 215, 219],
+            "NNRTI": [98, 100, 101, 103, 106, 138, 179, 181, 188, 190],
+            "INSTI": [95, 97, 121, 140, 143, 147, 148, 155, 263]
+        },
+        "seq_coord": {
+            "CA": {
+                "minimum": 1196,
+                "maximum": 1725,
+                "gap": {
+                    "minimum": 1466,
+                    "maximum": 1471
+                }
+            },
+            "PR": {
+                "minimum": 2253,
+                "maximum": 2549
+            },
+            "RT": {
+                "minimum": 2648,
+                "maximum": 3209,
+                "gap": {
+                    "minimum": 2915,
+                    "maximum": 2949
+                }
+            },
+            "IN": {
+                "minimum": 4509,
+                "maximum": 5040
+            }
+        },
+        "seq_drm_correlation": {
+            "CA": ["CAI"],
+            "RT": ["NRTI", "NNRTI"],
+            "PR": ["PI"],
+            "IN": ["INSTI"]
+        },
+        "ref_info": {
+            "ref_type": "HXB2",
+            "ref_coord": {
+                "CA": [1186,1878],
+                "PR": [2253,2549],
+                "RT": [2550,3869],
+                "IN": [4230,5096]
+            }
+        }
     },
     {
         "version": "v1",

data/lib/viral_seq/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 # version info and histroy
 module ViralSeq
-  VERSION = "1.9.1"
-  TCS_VERSION = "2.7.1"
+  VERSION = "1.10.0"
+  TCS_VERSION = "2.7.2"
 end

data/lib/viral_seq.rb CHANGED Viewed

@@ -53,3 +53,5 @@ require "json"
 require "securerandom"
 require "prawn"
 require "colorize"
+require "virust_locator"
+require "shellwords"

data/viral_seq.gemspec CHANGED Viewed

@@ -37,6 +37,9 @@ Gem::Specification.new do |spec|
   # muscle_bio gem required
   spec.add_runtime_dependency "muscle_bio", "= 0.4"
+  # virust-locator-ruby required
+  spec.add_runtime_dependency "virust-locator-ruby", "~> 0.3"
   # colorize gem required
   spec.add_runtime_dependency "colorize", "~> 0.1"
@@ -47,4 +50,6 @@ Gem::Specification.new do |spec|
   spec.add_runtime_dependency "combine_pdf", "~> 1.0", '>= 1.0.0'
   spec.requirements << 'R required for some functions'
+  spec.add_dependency "shellwords", "~> 0.2"
 end

metadata CHANGED Viewed

@@ -1,15 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: viral_seq
 version: !ruby/object:Gem::Version
-  version: 1.9.1
+  version: 1.10.0
 platform: ruby
 authors:
 - Shuntai Zhou
 - Michael Clark
-autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-12-02 00:00:00.000000000 Z
+date: 1980-01-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -67,6 +66,20 @@ dependencies:
     - - '='
       - !ruby/object:Gem::Version
         version: '0.4'
+- !ruby/object:Gem::Dependency
+  name: virust-locator-ruby
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.3'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.3'
 - !ruby/object:Gem::Dependency
   name: colorize
   requirement: !ruby/object:Gem::Requirement
@@ -141,6 +154,20 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: 1.0.0
+- !ruby/object:Gem::Dependency
+  name: shellwords
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.2'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.2'
 description: |-
   A Ruby Gem with bioinformatics tools for processing viral NGS data.
                             Specifically for Primer-ID sequencing and HIV drug resistance analysis.
@@ -226,8 +253,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: 1.3.6
 requirements:
 - R required for some functions
-rubygems_version: 3.5.11
-signing_key:
+rubygems_version: 3.6.7
 specification_version: 4
 summary: A Ruby Gem containing bioinformatics tools for processing viral NGS data.
 test_files: []