viral_seq 1.0.14 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '048e85ab67fbb667919d02d4509a15111798b116b3f927c921d203dc8565a1a2'
4
- data.tar.gz: 6951e410bd4f9b727a44fab1aa88f9cc263151cf9aed2a9c25ae9d866ed72450
3
+ metadata.gz: f3316ff7e72ca84c6eb2fa861a9fdad14fbcb3ab3c0053ade843ee13cc9ce82e
4
+ data.tar.gz: df1035ea5934b794ef8c64a04085f407bcd4dffc0888bf81a569a7ccfba3560a
5
5
  SHA512:
6
- metadata.gz: 02cc87e245918a5c8f1b16b0db978da66e3bf7e83c6c6140c394c560d31c86ab1845b337a4f53b4b7883ff8e452e8caef0b036ea113c4416d6a29d16f419eb81
7
- data.tar.gz: 9f53bd6c46f4a49b5c14b8b8019ffa3e1abcf442bf0a6cc09a7dbc768a474f2afd81d5cf168f38eb8ab5abbd601c60f1f824f753bc77dbcc0d3c0d93568b9ae3
6
+ metadata.gz: a3ec35b3a40ee9cf66131416a1c20eda38bf8bde818aa41af285c099ddd2b49e4f31fe1d011c95def77fd5c6653d96f4295142fd543444f249242154bb2b671b
7
+ data.tar.gz: daa6e694a841cc615cfde850bf2d98ca7467cb1d27502daf398ab0204e55c4d477f34aafce0149c765a931755fc3a3f7dbdf425964904d0199efb0651b9a09a6
data/.gitignore CHANGED
@@ -2,7 +2,6 @@
2
2
  /.yardoc
3
3
  /_yardoc/
4
4
  /coverage/
5
- /doc/
6
5
  /pkg/
7
6
  /spec/reports/
8
7
  /tmp/
data/Gemfile.lock CHANGED
@@ -1,16 +1,27 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- viral_seq (1.0.13)
5
- colorize (~> 0.1)
6
- muscle_bio (~> 0.4)
4
+ viral_seq (1.1.1)
5
+ colorize (>= 0.1)
6
+ combine_pdf (>= 1.0.0)
7
+ muscle_bio (>= 0.4)
8
+ prawn (>= 2.3.0)
9
+ prawn-table (>= 0.2.0)
7
10
 
8
11
  GEM
9
12
  remote: https://rubygems.org/
10
13
  specs:
11
14
  colorize (0.8.1)
15
+ combine_pdf (1.0.21)
16
+ ruby-rc4 (>= 0.1.5)
12
17
  diff-lcs (1.3)
13
18
  muscle_bio (0.4.0)
19
+ pdf-core (0.9.0)
20
+ prawn (2.4.0)
21
+ pdf-core (~> 0.9.0)
22
+ ttfunk (~> 1.7)
23
+ prawn-table (0.2.2)
24
+ prawn (>= 1.3.0, < 3.0.0)
14
25
  rake (13.0.1)
15
26
  rspec (3.8.0)
16
27
  rspec-core (~> 3.8.0)
@@ -25,6 +36,8 @@ GEM
25
36
  diff-lcs (>= 1.2.0, < 2.0)
26
37
  rspec-support (~> 3.8.0)
27
38
  rspec-support (3.8.0)
39
+ ruby-rc4 (0.1.5)
40
+ ttfunk (1.7.0)
28
41
 
29
42
  PLATFORMS
30
43
  ruby
data/README.md CHANGED
@@ -1,8 +1,24 @@
1
1
  # ViralSeq
2
2
 
3
+ [![Gem Version](https://img.shields.io/gem/v/viral_seq?color=%2300e673&style=flat-square)](https://rubygems.org/gems/viral_seq)
4
+ ![GitHub](https://img.shields.io/github/license/viralseq/viral_seq)
5
+ ![Gem](https://img.shields.io/gem/dt/viral_seq?color=%23E9967A)
6
+ ![GitHub last commit](https://img.shields.io/github/last-commit/viralseq/viral_seq?color=%2300BFFF)
7
+ [![Join the chat at https://gitter.im/viral_seq/community](https://badges.gitter.im/viral_seq/community.svg)](https://gitter.im/viral_seq/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
8
+
3
9
  A Ruby Gem containing bioinformatics tools for processing viral NGS data.
4
10
 
5
- Specifically for Primer-ID sequencing and HIV drug resistance analysis.
11
+ Specifically for Primer ID sequencing and HIV drug resistance analysis.
12
+
13
+ ## Illustration for the Primer ID Sequencing
14
+
15
+
16
+ ![Primer ID Sequencing](./docs/assets/img/cover.jpg)
17
+
18
+ ### Reference readings on the Primer ID sequencing
19
+ [Explantion of Primer ID sequencing](https://doi.org/10.21769/BioProtoc.3938)
20
+ [Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
21
+ [Application of Primer ID sequencing in COVID-19 research](https://doi.org/10.1126/scitranslmed.abb5883)
6
22
 
7
23
  ## Install
8
24
 
@@ -14,20 +30,93 @@ Specifically for Primer-ID sequencing and HIV drug resistance analysis.
14
30
 
15
31
  ### Excutables
16
32
 
17
- Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
33
+ ### `tcs`
34
+ Use executable `tcs` pipeline to process **Primer ID MiSeq sequencing** data.
18
35
 
36
+ Example commands:
19
37
  ```bash
20
- $ locator -i sequence.fasta -o sequence.fasta.csv
38
+ $ tcs -p params.json # run TCS pipeline with params.json
39
+ $ tcs -p params.json -i DIRECTORY
40
+ # run TCS pipeline with params.json and DIRECTORY
41
+ # if DIRECTORY is not defined in params.json
42
+ $ tcs -dr -i DIRECTORY
43
+ # run tcs-dr (MPID HIV drug resistance sequencing) pipeline
44
+ # DIRECTORY needs to be given.
45
+ $ tcs -j # CLI to generate params.json
46
+ $ tcs -h # print out the help
21
47
  ```
22
48
 
23
- Use executable `tcs` pipeline to process Primer ID MiSeq sequencing data.
49
+ [sample params.json for the tcs-dr pipeline](./docs/dr.json)
50
+
51
+ ---
52
+ ### `tcs_log`
53
+
54
+ Use `tcs_log` script to pool run logs and TCS fasta files after one batch of `tcs` jobs.
55
+
24
56
 
57
+ Example file structure:
58
+ ```
59
+ batch_tcs_jobs/
60
+ ├── lib1
61
+ ├── lib2
62
+ ├── lib3
63
+ ├── lib4
64
+ ├── ...
65
+ ```
66
+
67
+ Example command:
25
68
  ```bash
26
- $ tcs -p params.json # run TCS pipeline with params.json
27
- $ tcs -j # CLI to generate params.json
28
- $ tcs -h # print out the help
69
+ $ tcs_log batch_tcs_jobs
70
+ ```
71
+
72
+ ---
73
+ ### `tcs_sdrm`
74
+
75
+ Use `tcs_sdrm` pipeline for HIV-1 drug resistance mutation and recency.
76
+
77
+ Example command:
78
+ ```bash
79
+ $ tcs_sdrm libs_dir
80
+ ```
81
+
82
+ lib_dir file structure:
83
+ ```
84
+ libs_dir/
85
+ ├── lib1
86
+ ├── lib1_RT
87
+ ├── lib1_PR
88
+ ├── lib1_IN
89
+ ├── lib1_V1V3
90
+ ├── lib2
91
+ ├── lib1_RT
92
+ ├── lib1_PR
93
+ ├── lib1_IN
94
+ ├── lib1_V1V3
95
+ ├── ...
29
96
  ```
30
97
 
98
+ Output data in a new dir as 'libs_dir_SDRM'
99
+
100
+
101
+ **Note: [R](https://www.r-project.org/) and the following R libraries are required:**
102
+ - phangorn
103
+ - ape
104
+ - scales
105
+ - ggforce
106
+ - cowplot
107
+ - magrittr
108
+ - gridExtra
109
+
110
+ ---
111
+
112
+ ### `locator`
113
+ Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
114
+
115
+ ```bash
116
+ $ locator -i sequence.fasta -o sequence.fasta.csv
117
+ ```
118
+ ---
119
+
31
120
  ## Some Examples
32
121
 
33
122
  Load all ViralSeq classes by requiring 'viral_seq.rb' in your Ruby scripts.
@@ -58,7 +147,7 @@ qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
58
147
  Further filter out sequences with Apobec3g/f hypermutations
59
148
 
60
149
  ```ruby
61
- qc_seqhash = qc_seqhash.a3g
150
+ qc_seqhash = qc_seqhash.a3g[:filtered_seq]
62
151
  ```
63
152
 
64
153
  Calculate nucleotide diveristy π
@@ -86,6 +175,44 @@ qc_seqhash.sdrm_hiv_pr(cut_off)
86
175
 
87
176
  ## Updates
88
177
 
178
+ ### Version 1.2.1-05172021
179
+
180
+ 1. Added a function in R to check and install missing R packages for `tcs_sdrm` pipeline.
181
+
182
+ ### Version 1.2.0-05102021
183
+
184
+ 1. Added `tcs_sdrm` pipeline as an excutable.
185
+ `tcs_sdrm` processes `tcs`-processed HIV MPID-NGS data for drug resistance mutations, recency and phylogentic analysis.
186
+
187
+ 2. Added function ViralSeq::SeqHash#sample.
188
+
189
+ 3. Added recency determining function `ViralSeq::Recency::define`
190
+
191
+ 4. Fixed a few bugs related to `tcs_sdrm`.
192
+
193
+ ### Version 1.1.2-04262021
194
+
195
+ 1. Added function `ViralSeq::DRMs.sdrm_json` to export SDRM as json object.
196
+ 2. Added a random string to the temp file names for `muscle_bio` to avoid issues when running scripts in parallel.
197
+ 3. Added `--keep-original` flag to the `tcs` pipeline.
198
+
199
+ ### Version 1.1.1-04012021
200
+
201
+ 1. Added warning when paired_raw_sequence less than 0.1% of total_raw_sequence.
202
+ 2. Added option `-i WORKING_DIRECTORY` to the `tcs` script.
203
+ If the `params.json` file does not contain the path to the working directory, it will append path to the run params.
204
+ 3. Added option `-dr` to the `tcs` script.
205
+
206
+ ### Version 1.1.0-03252021
207
+
208
+ 1. Optimized the algorithm of end-join.
209
+ 2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
210
+ 3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
211
+ 4. Added the preset of MPID-HIVDR params file [***dr.json***](./docs/dr.json) in /docs.
212
+ 5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
213
+ Users can choose from 3 MiSeq platforms for processing their sequencing data.
214
+ MiSeq 300x7x300 is the default option.
215
+
89
216
  ### Version 1.0.14-03052021
90
217
 
91
218
  1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
data/bin/tcs CHANGED
@@ -23,7 +23,7 @@
23
23
  # THE SOFTWARE.
24
24
 
25
25
  # Use JSON file as the run param
26
- # run tcs_json_generator.rb to generate param json file.
26
+ # run `tcs -j` to generate param json file.
27
27
 
28
28
  require 'viral_seq'
29
29
  require 'json'
@@ -46,11 +46,23 @@ OptionParser.new do |opts|
46
46
  options[:params_json] = p
47
47
  end
48
48
 
49
+ opts.on("-i", "--input PATH_TO_WORKING_DIRECTORY", "Path to the working directory") do |p|
50
+ options[:input] = p
51
+ end
52
+
53
+ opts.on("-dr", "--dr_pipeline", "HIV drug resistance MPID pipeline") do |p|
54
+ options[:dr] = true
55
+ end
56
+
49
57
  opts.on("-h", "--help", "Prints this help") do
50
58
  puts opts
51
59
  exit
52
60
  end
53
61
 
62
+ opts.on("--keep-original", "keep raw sequence files") do
63
+ options[:keep] = true
64
+ end
65
+
54
66
  opts.on("-v", "--version", "Version info") do
55
67
  puts "tcs version: " + ViralSeq::TCS_VERSION.red.bold
56
68
  puts "viral_seq version: " + ViralSeq::VERSION.red.bold
@@ -64,15 +76,21 @@ end.parse!
64
76
 
65
77
  if options[:json_generator]
66
78
  params = ViralSeq::TcsJson.generate
79
+ elsif options[:dr]
80
+ params = ViralSeq::TcsDr::PARAMS
67
81
  elsif (options[:params_json] && File.exist?(options[:params_json]))
68
82
  params = JSON.parse(File.read(options[:params_json]), symbolize_names: true)
69
83
  else
70
84
  abort "No params JSON file found. Script terminated.".red
71
85
  end
72
86
 
73
- indir = params[:raw_sequence_dir]
87
+ if options[:input]
88
+ indir = options[:input]
89
+ else
90
+ indir = params[:raw_sequence_dir]
91
+ end
74
92
 
75
- unless File.exist?(indir)
93
+ unless indir and File.exist?(indir)
76
94
  abort "No input sequence directory found. Script terminated.".red.bold
77
95
  end
78
96
 
@@ -115,6 +133,12 @@ else
115
133
  error_rate = 0.02
116
134
  end
117
135
 
136
+ if params[:platform_format]
137
+ $platform_sequencing_length = params[:platform_format]
138
+ else
139
+ $platform_sequencing_length = 300
140
+ end
141
+
118
142
  primers = params[:primer_pairs]
119
143
  if primers.empty?
120
144
  ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
@@ -123,6 +147,7 @@ end
123
147
 
124
148
  primers.each do |primer|
125
149
  summary_json = {}
150
+ summary_json[:warnings] = []
126
151
  summary_json[:tcs_version] = ViralSeq::TCS_VERSION
127
152
  summary_json[:viralseq_version] = ViralSeq::VERSION
128
153
  summary_json[:runtime] = Time.now.to_s
@@ -134,6 +159,7 @@ primers.each do |primer|
134
159
  forward_primer = primer[:forward]
135
160
 
136
161
  export_raw = primer[:export_raw]
162
+ limit_raw = primer[:limit_raw]
137
163
 
138
164
  unless cdna_primer
139
165
  log.puts Time.now.to_s + "\t" + region + " does not have cDNA primer sequence. #{region} skipped."
@@ -175,6 +201,10 @@ primers.each do |primer|
175
201
  paired_seq_number = common_keys.size
176
202
  log.puts Time.now.to_s + "\t" + "Paired raw sequences are : #{paired_seq_number.to_s}"
177
203
  summary_json[:paired_raw_sequence] = paired_seq_number
204
+ if paired_seq_number < raw_sequence_number * 0.001
205
+ summary_json[:warnings] <<
206
+ "WARNING: Filtered raw sequneces less than 0.1% of the total raw sequences. Possible contamination."
207
+ end
178
208
 
179
209
  common_keys.each do |seqtag|
180
210
  r1_seq = r1_passed_seq[seqtag]
@@ -236,7 +266,13 @@ primers.each do |primer|
236
266
  raw_r1_f = File.open(outfile_raw_r1, 'w')
237
267
  raw_r2_f = File.open(outfile_raw_r2, 'w')
238
268
 
239
- bio_r1.keys.each do |k|
269
+ if limit_raw
270
+ raw_keys = bio_r1.keys.sample(limit_raw.to_i)
271
+ else
272
+ raw_keys = bio_r1.keys
273
+ end
274
+
275
+ raw_keys.each do |k|
240
276
  raw_r1_f.puts k + "_r1"
241
277
  raw_r2_f.puts k + "_r2"
242
278
  raw_r1_f.puts bio_r1[k]
@@ -273,7 +309,6 @@ primers.each do |primer|
273
309
  r1_sub_seq << bio_r1[seq_name]
274
310
  r2_sub_seq << bio_r2[seq_name]
275
311
  end
276
-
277
312
  #consensus name including the Primer ID and number of raw sequences of that Primer ID, library name and setname.
278
313
  consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
279
314
  r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
@@ -364,6 +399,7 @@ primers.each do |primer|
364
399
  shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
365
400
  joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
366
401
  log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
402
+
367
403
  summary_json[:combined_tcs] = joined_sh.size
368
404
 
369
405
  if export_raw
@@ -433,12 +469,15 @@ primers.each do |primer|
433
469
  trim_end = primer[:trim_ref_end]
434
470
  trim_ref = primer[:trim_ref].to_sym
435
471
  joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
436
- joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
437
472
  if export_raw
438
473
  joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
439
- joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
440
474
  end
441
475
  end
476
+
477
+ joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
478
+ if export_raw
479
+ joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
480
+ end
442
481
  end
443
482
 
444
483
  File.open(outfile_log, "w") do |f|
@@ -446,9 +485,11 @@ primers.each do |primer|
446
485
  end
447
486
  end
448
487
 
449
- log.puts Time.now.to_s + "\t" + "Removing raw sequence files..."
450
- File.unlink(r1_f)
451
- File.unlink(r2_f)
488
+ unless options[:keep]
489
+ log.puts Time.now.to_s + "\t" + "Removing raw sequence files..."
490
+ File.unlink(r1_f)
491
+ File.unlink(r2_f)
492
+ end
452
493
  log.puts Time.now.to_s + "\t" + "TCS pipeline successfuly exercuted."
453
494
  log.close
454
495
  puts "DONE!"
data/bin/tcs_log ADDED
@@ -0,0 +1,102 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # pool run logs from one batch of tcs jobs
4
+ # file structure:
5
+ # batch_tcs_jobs/
6
+ # ├── lib1
7
+ # ├── lib2
8
+ # ├── lib3
9
+ # ├── lib4
10
+ # ├── ...
11
+ #
12
+ # command example:
13
+ # $ tcs_log batch_tcs_jobs
14
+
15
+ require 'viral_seq'
16
+ require 'pathname'
17
+ require 'json'
18
+ require 'fileutils'
19
+
20
+ indir = ARGV[0].chomp
21
+ indir_basename = File.basename(indir)
22
+ indir_dirname = File.dirname(indir)
23
+
24
+ tcs_dir = File.join(indir_dirname, (indir_basename + "_tcs"))
25
+ Dir.mkdir(tcs_dir) unless File.directory?(tcs_dir)
26
+
27
+ libs = []
28
+ Dir.chdir(indir) {libs = Dir.glob("*")}
29
+
30
+ outdir2 = File.join(tcs_dir, "combined_TCS_per_lib")
31
+ outdir3 = File.join(tcs_dir, "TCS_per_region")
32
+ outdir4 = File.join(tcs_dir, "combined_TCS_per_region")
33
+
34
+ Dir.mkdir(outdir2) unless File.directory?(outdir2)
35
+ Dir.mkdir(outdir3) unless File.directory?(outdir3)
36
+ Dir.mkdir(outdir4) unless File.directory?(outdir4)
37
+
38
+ log_file = File.join(tcs_dir,"log.csv")
39
+ log = File.open(log_file,'w')
40
+
41
+ header = %w{
42
+ lib_name
43
+ Region
44
+ Raw_Sequences_per_barcode
45
+ R1_Raw
46
+ R2_Raw
47
+ Paired_Raw
48
+ Cutoff
49
+ PID_Length
50
+ Consensus1
51
+ Consensus2
52
+ Distinct_to_Raw
53
+ Resampling_index
54
+ Combined_TCS
55
+ Combined_TCS_after_QC
56
+ WARNINGS
57
+ }
58
+
59
+ log.puts header.join(',')
60
+ libs.each do |lib|
61
+ Dir.mkdir(File.join(outdir2, lib)) unless File.directory?(File.join(outdir2, lib))
62
+ fasta_files = []
63
+ json_files = []
64
+ Dir.chdir(File.join(indir, lib)) do
65
+ fasta_files = Dir.glob("**/*.fasta")
66
+ json_files = Dir.glob("**/log.json")
67
+ end
68
+ fasta_files.each do |f|
69
+ path_array = Pathname(f).each_filename.to_a
70
+ region = path_array[0]
71
+ if path_array[-1] == "combined.fasta"
72
+ FileUtils.cp(File.join(indir, lib, f), File.join(outdir2, lib, (lib + "_" + region)))
73
+ Dir.mkdir(File.join(outdir4,region)) unless File.directory?(File.join(outdir4,region))
74
+ FileUtils.cp(File.join(indir, lib, f), File.join(outdir4, region, (lib + "_" + region)))
75
+ else
76
+ Dir.mkdir(File.join(outdir3,region)) unless File.directory?(File.join(outdir3,region))
77
+ Dir.mkdir(File.join(outdir3,region, lib)) unless File.directory?(File.join(outdir3,region, lib))
78
+ FileUtils.cp(File.join(indir, lib, f), File.join(outdir3, region, lib, (lib + "_" + region + "_" + path_array[-1])))
79
+ end
80
+ end
81
+
82
+ json_files.each do |f|
83
+ json_log = JSON.parse(File.read(File.join(indir, lib, f)), symbolize_names: true)
84
+ log.print [lib,
85
+ json_log[:primer_set_name],
86
+ json_log[:total_raw_sequence],
87
+ json_log[:r1_filtered_raw],
88
+ json_log[:r2_filtered_raw],
89
+ json_log[:paired_raw_sequence],
90
+ json_log[:consensus_cutoff],
91
+ json_log[:length_of_pid],
92
+ json_log[:total_tcs_with_ambiguities],
93
+ json_log[:total_tcs],
94
+ json_log[:distinct_to_raw],
95
+ json_log[:resampling_param],
96
+ json_log[:combined_tcs],
97
+ json_log[:combined_tcs_after_qc],
98
+ json_log[:warnings],
99
+ ].join(',') + "\n"
100
+ end
101
+ end
102
+ log.close