viral_seq 1.0.7 → 1.0.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: bb326c97b25326286a51ec63583983a20dfebee2513fd8811bc855ec21ac0b5d
4
- data.tar.gz: e9870bbaa8c17ba51d53e790ca8189e2dd362911e1b5cfcd4806a3bc68ccf369
3
+ metadata.gz: 97ecf609d8927a59e1174ea5f01794e3e9c1f598cf547f72cc53a66465817c0e
4
+ data.tar.gz: d9c8deeb6452075aa49a63f02e8e6a6f58ef10d8969065987ffa05651e2ad2b8
5
5
  SHA512:
6
- metadata.gz: ff6e5727484687db04180a1ef9d3204e9ed02d9b1a98862bdb8796255680aca1e830667429a57db116702793dc55eeb7cc84800c39b27f8e2773186e1a638988
7
- data.tar.gz: 86d0b03af6335cc91e38bc54a8c1fa7e2c84d430dc0adb02e4dc3819ebb188a0e8ae1e4c76c71e5066cac51675e0a45f9ee5a9b0bbd2de8b26da4fa04fe95d85
6
+ metadata.gz: 2f8dc3fd2c8e5f8cb0bdd130451a186790d1b9d102b66851f24d843368926632e375515e81df3e62b60dee8400effb17e9e15fcc74eab4b97c6f33dbc5cba358
7
+ data.tar.gz: a53032e574e844411c8b11ed07f238cba1a795f819fb3cae3d5cca129f7bcd38312efb30a87520b35540915ff461ce03e4448a6cdade50bda1c77497b5a03aef
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- viral_seq (1.0.7)
4
+ viral_seq (1.0.10)
5
5
  colorize (~> 0.1)
6
6
  muscle_bio (~> 0.4)
7
7
 
@@ -11,7 +11,7 @@ GEM
11
11
  colorize (0.8.1)
12
12
  diff-lcs (1.3)
13
13
  muscle_bio (0.4.0)
14
- rake (10.5.0)
14
+ rake (13.0.1)
15
15
  rspec (3.8.0)
16
16
  rspec-core (~> 3.8.0)
17
17
  rspec-expectations (~> 3.8.0)
@@ -31,7 +31,7 @@ PLATFORMS
31
31
 
32
32
  DEPENDENCIES
33
33
  bundler (~> 2.0)
34
- rake (~> 10.0)
34
+ rake (~> 13.0)
35
35
  rspec (~> 3.0)
36
36
  viral_seq!
37
37
 
data/README.md CHANGED
@@ -4,98 +4,167 @@ A Ruby Gem containing bioinformatics tools for processing viral NGS data.
4
4
 
5
5
  Specifically for Primer-ID sequencing and HIV drug resistance analysis.
6
6
 
7
- ## Installation
7
+ ## Install
8
8
 
9
+ ```bash
9
10
  $ gem install viral_seq
11
+ ```
10
12
 
11
13
  ## Usage
12
14
 
13
- #### Load all ViralSeq classes by requiring 'viral_seq.rb'
15
+ ### Excutables
14
16
 
15
- #!/usr/bin/env ruby
16
- require 'viral_seq'
17
-
18
- #### Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
17
+ Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
19
18
 
19
+ ```bash
20
20
  $ locator -i sequence.fasta -o sequence.fasta.csv
21
+ ```
22
+
23
+ Use executable `tcs` pipeline to process Primer ID MiSeq sequencing data.
24
+
25
+ ```bash
26
+ $ tcs -p params.json # run TCS pipeline with params.json
27
+ $ tcs -j # CLI to generate params.json
28
+ $ tcs -h # print out the help
29
+ ```
21
30
 
22
31
  ## Some Examples
23
32
 
24
- #### Load nucleotide sequences from a FASTA format sequence file
33
+ Load all ViralSeq classes by requiring 'viral_seq.rb' in your Ruby scripts.
34
+
35
+ ```ruby
36
+ #!/usr/bin/env ruby
37
+ require 'viral_seq'
38
+ ```
39
+
40
+ Load nucleotide sequences from a FASTA format sequence file
25
41
 
26
- my_seqhash = ViralSeq::SeqHash.fa('my_seq_file.fasta')
42
+ ```ruby
43
+ my_seqhash = ViralSeq::SeqHash.fa('my_seq_file.fasta')
44
+ ```
27
45
 
28
- #### Make an alignment (using MUSCLE)
46
+ Make an alignment (using MUSCLE)
29
47
 
30
- aligned_seqhash = my_seqhash.align
48
+ ```ruby
49
+ aligned_seqhash = my_seqhash.align
50
+ ```
31
51
 
32
- #### Filter nucleotide sequences with the reference coordinates (HIV Protease)
52
+ Filter nucleotide sequences with the reference coordinates (HIV Protease)
33
53
 
34
- qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
54
+ ```ruby
55
+ qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
56
+ ```
35
57
 
36
- #### Further filter out sequences with Apobec3g/f hypermutations
58
+ Further filter out sequences with Apobec3g/f hypermutations
37
59
 
38
- qc_seqhash = qc_seqhash.a3g
60
+ ```ruby
61
+ qc_seqhash = qc_seqhash.a3g
62
+ ```
39
63
 
40
- #### Calculate nucleotide diveristy π
64
+ Calculate nucleotide diveristy π
41
65
 
42
- qc_seqhash.pi
66
+ ```ruby
67
+ qc_seqhash.pi
68
+ ```
43
69
 
44
- #### Calculate cut-off for minority variants based on Poisson model
70
+ Calculate cut-off for minority variants based on Poisson model
45
71
 
46
- cut_off = qc_seqhash.pm
72
+ ```ruby
73
+ cut_off = qc_seqhash.pm
74
+ ```
47
75
 
48
- #### Examine for drug resistance mutations for HIV PR region
76
+ Examine for drug resistance mutations for HIV PR region
49
77
 
50
- qc_seqhash.sdrm_hiv_pr(cut_off)
78
+ ```ruby
79
+ qc_seqhash.sdrm_hiv_pr(cut_off)
80
+ ```
81
+ ## Known issues
82
+
83
+ 1. have a conflict with rails.
84
+ 2. Update on 03032021. Still have conflict. But in rails gem file, can just use `requires: false` globally and only require "viral_seq" when the module is needed in controller.
51
85
 
52
86
  ## Updates
53
87
 
54
- Version 1.0.7-01282020:
88
+ ### Version 1.1.2-03032021
89
+
90
+ 1. Fixed an issue that may cause conflicts with ActiveRecord.
91
+
92
+ ### Version 1.1.1-03022021
93
+
94
+ 1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
95
+ 2. fixed an issue loading class 'OptionParser'in some ruby environments.
96
+
97
+ ### Version 1.1.0-11112020:
98
+
99
+ 1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
100
+ 2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
101
+ 3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
102
+ 4. a few optimizations.
103
+ 5. TCS 2.1.0 delivered.
104
+ 6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
105
+
106
+ ### Version 1.0.9-07182020:
107
+
108
+ 1. Change ViralSeq::SeqHash#stop_codon and ViralSeq::SeqHash#a3g_hypermut return value to hash object.
109
+
110
+ 2. TCS pipeline updated to version 2.0.1. Add optional `export_raw: TRUE/FALSE` in json params. If `export_raw` is `TRUE`, raw sequence reads (have to pass quality filters) will be exported, along with TCS reads.
111
+
112
+ ### Version 1.0.8-02282020:
113
+
114
+ 1. TCS pipeline (version 2.0.0) added as executable.
115
+ tcs - main TCS pipeline script.
116
+ tcs_json_generator - step-by-step script to generate json file for tcs pipeline.
117
+
118
+ 2. Methods added:
119
+ ViralSeq::SeqHash#trim
120
+
121
+ 3. Bug fix for several methods.
122
+
123
+ ### Version 1.0.7-01282020:
55
124
 
56
- 1. Several methods added, including
57
- ViralSeq::SeqHash#error_table
58
- ViralSeq::SeqHash#random_select
59
- 2. Improved performance for several functions.
125
+ 1. Several methods added, including
126
+ ViralSeq::SeqHash#error_table
127
+ ViralSeq::SeqHash#random_select
128
+ 2. Improved performance for several functions.
60
129
 
61
- Version 1.0.6-07232019:
130
+ ### Version 1.0.6-07232019:
62
131
 
63
- 1. Several methods added to ViralSeq::SeqHash, including
64
- ViralSeq::SeqHash#size
65
- ViralSeq::SeqHash#+
66
- ViralSeq::SeqHash#write_nt_fa
67
- ViralSeq::SeqHash#mutation
68
- 2. Update documentations and rspec samples.
132
+ 1. Several methods added to ViralSeq::SeqHash, including
133
+ ViralSeq::SeqHash#size
134
+ ViralSeq::SeqHash#+
135
+ ViralSeq::SeqHash#write_nt_fa
136
+ ViralSeq::SeqHash#mutation
137
+ 2. Update documentations and rspec samples.
69
138
 
70
- Version 1.0.5-07112019:
139
+ ### Version 1.0.5-07112019:
71
140
 
72
- 1. Update ViralSeq::SeqHash#sequence_locator.
73
- Program will try to determine the direction (`+` or `-` of the query sequence)
74
- 2. update executable `locator` to have a column of `direction` in output .csv file
141
+ 1. Update ViralSeq::SeqHash#sequence_locator.
142
+ Program will try to determine the direction (`+` or `-` of the query sequence)
143
+ 2. update executable `locator` to have a column of `direction` in output .csv file
75
144
 
76
- Version 1.0.4-07102019:
145
+ ### Version 1.0.4-07102019:
77
146
 
78
- 1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
79
- 2. Fix bugs in bin `locator`
147
+ 1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
148
+ 2. Fix bugs in bin `locator`
80
149
 
81
- Version 1.0.3-07102019:
150
+ ### Version 1.0.3-07102019:
82
151
 
83
- 1. Bug fix.
152
+ 1. Bug fix.
84
153
 
85
- Version 1.0.2-07102019:
154
+ ### Version 1.0.2-07102019:
86
155
 
87
- 1. Fixed a gem loading issue.
156
+ 1. Fixed a gem loading issue.
88
157
 
89
- Version 1.0.1-07102019:
158
+ ### Version 1.0.1-07102019:
90
159
 
91
- 1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
92
- 2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
93
- 3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
94
- 4. update documentations
160
+ 1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
161
+ 2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
162
+ 3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
163
+ 4. update documentations
95
164
 
96
- Version 1.0.0-07092019:
165
+ ### Version 1.0.0-07092019:
97
166
 
98
- 1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq
167
+ 1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq
99
168
 
100
169
  ## Development
101
170
 
data/bin/locator CHANGED
@@ -1,5 +1,25 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
+ # Copyright (c) 2020 Shuntai Zhou (shuntai.zhou@gmail.com)
4
+ #
5
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ # of this software and associated documentation files (the "Software"), to deal
7
+ # in the Software without restriction, including without limitation the rights
8
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ # copies of the Software, and to permit persons to whom the Software is
10
+ # furnished to do so, subject to the following conditions:
11
+ #
12
+ # The above copyright notice and this permission notice shall be included in
13
+ # all copies or substantial portions of the Software.
14
+ #
15
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ # THE SOFTWARE.
22
+
3
23
  require 'viral_seq'
4
24
  require 'csv'
5
25
  require 'optparse'
data/bin/tcs ADDED
@@ -0,0 +1,454 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # TCS pipeline for Primer ID sequencing data analysis.
4
+
5
+ # Copyright (c) 2020 Shuntai Zhou (shuntai.zhou@gmail.com)
6
+ #
7
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
8
+ # of this software and associated documentation files (the "Software"), to deal
9
+ # in the Software without restriction, including without limitation the rights
10
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
11
+ # copies of the Software, and to permit persons to whom the Software is
12
+ # furnished to do so, subject to the following conditions:
13
+ #
14
+ # The above copyright notice and this permission notice shall be included in
15
+ # all copies or substantial portions of the Software.
16
+ #
17
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
18
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
19
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
20
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
21
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
22
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
23
+ # THE SOFTWARE.
24
+
25
+ # Use JSON file as the run param
26
+ # run tcs_json_generator.rb to generate param json file.
27
+
28
+ require 'viral_seq'
29
+ require 'json'
30
+ require 'colorize'
31
+ require 'optparse'
32
+
33
+ options = {}
34
+
35
+ banner = '-'*50 + "\n" +
36
+ '| The TCS Pipeline ' + "Version #{ViralSeq::TCS_VERSION}".red.bold + " by " + "Shuntai Zhou".blue.bold + ' |' + "\n" +
37
+ '-'*50 + "\n"
38
+
39
+ OptionParser.new do |opts|
40
+ opts.banner = banner + "Usage: tcs -j"
41
+ opts.on "-j", "--json_generator", "Command line interfac to generate new params json file" do |j|
42
+ options[:json_generator] = true
43
+ end
44
+
45
+ opts.on("-p", "--params PARAMS_JSON", "Execute the pipeline with input params json file") do |p|
46
+ options[:params_json] = p
47
+ end
48
+
49
+ opts.on("-h", "--help", "Prints this help") do
50
+ puts opts
51
+ exit
52
+ end
53
+
54
+ opts.on("-v", "--version", "Version info") do
55
+ puts "tcs version: " + ViralSeq::TCS_VERSION.red.bold
56
+ puts "viral_seq version: " + ViralSeq::VERSION.red.bold
57
+ exit
58
+ end
59
+
60
+ # opts.on("--no-parallel", "toggle off parallel processing") do
61
+ # options[:no_parallel] = true
62
+ # end
63
+ end.parse!
64
+
65
+ if options[:json_generator]
66
+ params = ViralSeq::TcsJson.generate
67
+ elsif (options[:params_json] && File.exist?(options[:params_json]))
68
+ params = JSON.parse(File.read(options[:params_json]), symbolize_names: true)
69
+ else
70
+ abort "No params JSON file found. Script terminated.".red
71
+ end
72
+
73
+ indir = params[:raw_sequence_dir]
74
+
75
+ unless File.exist?(indir)
76
+ abort "No input sequence directory found. Script terminated.".red.bold
77
+ end
78
+
79
+ # log file
80
+
81
+ runtime_log_file = File.join(indir,"runtime.log")
82
+ log = File.open(runtime_log_file, "w")
83
+ log.puts "TSC pipeline Version " + ViralSeq::TCS_VERSION.to_s
84
+ log.puts "viral_seq Version " + ViralSeq::VERSION.to_s
85
+ log.puts Time.now.to_s + "\t" + "Start TCS pipeline..."
86
+
87
+ libname = File.basename indir
88
+
89
+ seq_files = ViralSeq::TcsCore.r1r2 indir
90
+
91
+ if seq_files[:r1_file].size > 0 and seq_files[:r2_file].size > 0
92
+ r1_f = seq_files[:r1_file]
93
+ r2_f = seq_files[:r2_file]
94
+ elsif seq_files[:r1_file].size > 0 and seq_files[:r2_file].empty?
95
+ exit_sig = "Missing R2 file. Aborted."
96
+ elsif seq_files[:r2_file].size > 0 and seq_files[:r1_file].empty?
97
+ exit_sig = "Missing R1 file. Aborted."
98
+ else
99
+ exit_sig = "Cannot determine R1 R2 file in #{indir}. Aborted."
100
+ end
101
+
102
+ if exit_sig
103
+ ViralSeq::TcsCore.log_and_abort log, exit_sig
104
+ end
105
+
106
+ r1_fastq_sh = ViralSeq::SeqHash.fq(r1_f)
107
+ r2_fastq_sh = ViralSeq::SeqHash.fq(r2_f)
108
+
109
+ raw_sequence_number = r1_fastq_sh.size
110
+ log.puts Time.now.to_s + "\tRaw sequence number: #{raw_sequence_number.to_s}"
111
+
112
+ if params[:platform_error_rate]
113
+ error_rate = params[:platform_error_rate]
114
+ else
115
+ error_rate = 0.02
116
+ end
117
+
118
+ primers = params[:primer_pairs]
119
+ if primers.empty?
120
+ ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
121
+ end
122
+
123
+
124
+ primers.each do |primer|
125
+ summary_json = {}
126
+ summary_json[:tcs_version] = ViralSeq::TCS_VERSION
127
+ summary_json[:viralseq_version] = ViralSeq::VERSION
128
+ summary_json[:runtime] = Time.now.to_s
129
+
130
+ primer[:region] ? region = primer[:region] : region = "region"
131
+ summary_json[:primer_set_name] = region
132
+
133
+ cdna_primer = primer[:cdna]
134
+ forward_primer = primer[:forward]
135
+
136
+ export_raw = primer[:export_raw]
137
+
138
+ unless cdna_primer
139
+ log.puts Time.now.to_s + "\t" + region + " does not have cDNA primer sequence. #{region} skipped."
140
+ end
141
+ unless forward_primer
142
+ log.puts Time.now.to_s + "\t" + region + " does not have forward primer sequence. #{region} skipped."
143
+ end
144
+ summary_json[:cdan_primer] = cdna_primer
145
+ summary_json[:forward_primer] = forward_primer
146
+
147
+ primer[:majority] ? majority_cut_off = primer[:majority] : majority_cut_off = 0
148
+ summary_json[:majority_cut_off] = majority_cut_off
149
+
150
+ summary_json[:total_raw_sequence] = raw_sequence_number
151
+
152
+ log.puts Time.now.to_s + "\t" + "Porcessing #{region}..."
153
+
154
+ # filter R1
155
+ log.puts Time.now.to_s + "\t" + "filtering R1..."
156
+ filter_r1 = ViralSeq::TcsCore.filter_r1(r1_fastq_sh, forward_primer)
157
+ r1_passed_seq = filter_r1[:r1_passed_seq]
158
+ log.puts Time.now.to_s + "\t" + "R1 filtered: #{r1_passed_seq.size.to_s}"
159
+ summary_json[:r1_filtered_raw] = r1_passed_seq.size
160
+
161
+ # filter R2
162
+ log.puts Time.now.to_s + "\t" + "filtering R2..."
163
+ filter_r2 = ViralSeq::TcsCore.filter_r2(r2_fastq_sh, cdna_primer)
164
+ r2_passed_seq = filter_r2[:r2_passed_seq]
165
+ pid_length = filter_r2[:pid_length]
166
+ log.puts Time.now.to_s + "\t" + "R2 filtered: #{r2_passed_seq.size.to_s}"
167
+ summary_json[:r2_filtered_raw] = r2_passed_seq.size
168
+
169
+ # pair-end
170
+ log.puts Time.now.to_s + "\t" + "Pairing R1 and R2 seqs..."
171
+ id = {} # hash for :sequence_tag => primer_id
172
+ bio_r2 = {} # hash for :sequence_tag => primer_trimmed_r2_sequence
173
+ bio_r1 = {} # hash for :sequence_tag => primer_trimmed_r1_sequence
174
+ common_keys = r1_passed_seq.keys & r2_passed_seq.keys
175
+ paired_seq_number = common_keys.size
176
+ log.puts Time.now.to_s + "\t" + "Paired raw sequences are : #{paired_seq_number.to_s}"
177
+ summary_json[:paired_raw_sequence] = paired_seq_number
178
+
179
+ common_keys.each do |seqtag|
180
+ r1_seq = r1_passed_seq[seqtag]
181
+ r2_seq = r2_passed_seq[seqtag]
182
+ pid = r2_seq[0, pid_length]
183
+ id[seqtag] = pid
184
+ bio_r2[seqtag] = r2_seq[filter_r2[:reverse_starting_number]..-2]
185
+ bio_r1[seqtag] = r1_seq[filter_r1[:forward_starting_number]..-2]
186
+ end
187
+
188
+ # TCS cut-off
189
+ log.puts Time.now.to_s + "\t" + "Calculate consensus cutoff...."
190
+
191
+ primer_id_list = id.values
192
+ primer_id_count = primer_id_list.count_freq
193
+ primer_id_dis = primer_id_count.values.count_freq
194
+
195
+ # calculate distinct_to_raw
196
+ distinct_to_raw = (primer_id_count.size/primer_id_list.size.to_f).round(3)
197
+ summary_json[:distinct_to_raw] = distinct_to_raw
198
+
199
+ if primer_id_dis.keys.size < 5
200
+ log.puts Time.now.to_s + "\t" + "Less than 5 Primer IDs detected. Region #{region} aborted."
201
+ next
202
+ end
203
+
204
+ max_id = primer_id_dis.keys.sort[-5..-1].mean
205
+ consensus_cutoff = ViralSeq::TcsCore.calculate_cut_off(max_id,error_rate)
206
+ log.puts Time.now.to_s + "\t" + "Consensus cut-off is #{consensus_cutoff.to_s}"
207
+ summary_json[:consensus_cutoff] = consensus_cutoff
208
+ summary_json[:length_of_pid] = pid_length
209
+ log.puts Time.now.to_s + "\t" + "Creating consensus..."
210
+
211
+ # Primer ID over the cut-off
212
+ primer_id_count_over_n = []
213
+ primer_id_count.each do |primer_id,count|
214
+ primer_id_count_over_n << primer_id if count > consensus_cutoff
215
+ end
216
+ pid_to_process = primer_id_count_over_n.size
217
+ log.puts Time.now.to_s + "\t" + "Number of consensus to process: #{pid_to_process.to_s}"
218
+ summary_json[:total_tcs_with_ambiguities] = pid_to_process
219
+
220
+ # setup output path
221
+ out_dir_set = File.join(indir, region)
222
+ Dir.mkdir(out_dir_set) unless File.directory?(out_dir_set)
223
+ out_dir_consensus = File.join(out_dir_set, "consensus")
224
+ Dir.mkdir(out_dir_consensus) unless File.directory?(out_dir_consensus)
225
+
226
+ outfile_r1 = File.join(out_dir_consensus, 'r1.fasta')
227
+ outfile_r2 = File.join(out_dir_consensus, 'r2.fasta')
228
+ outfile_log = File.join(out_dir_set, 'log.json')
229
+
230
+ # if export_raw is true, create dir for raw sequence
231
+ if export_raw
232
+ out_dir_raw = File.join(out_dir_set, "raw")
233
+ Dir.mkdir(out_dir_raw) unless File.directory?(out_dir_raw)
234
+ outfile_raw_r1 = File.join(out_dir_raw, 'r1.raw.fasta')
235
+ outfile_raw_r2 = File.join(out_dir_raw, 'r2.raw.fasta')
236
+ raw_r1_f = File.open(outfile_raw_r1, 'w')
237
+ raw_r2_f = File.open(outfile_raw_r2, 'w')
238
+
239
+ bio_r1.keys.each do |k|
240
+ raw_r1_f.puts k + "_r1"
241
+ raw_r2_f.puts k + "_r2"
242
+ raw_r1_f.puts bio_r1[k]
243
+ raw_r2_f.puts bio_r2[k].rc
244
+ end
245
+
246
+ raw_r1_f.close
247
+ raw_r2_f.close
248
+ end
249
+
250
+ # create TCS
251
+
252
+ pid_seqtag_hash = {}
253
+ id.each do |name, pid|
254
+ if pid_seqtag_hash[pid]
255
+ pid_seqtag_hash[pid] << name
256
+ else
257
+ pid_seqtag_hash[pid] = []
258
+ pid_seqtag_hash[pid] << name
259
+ end
260
+ end
261
+
262
+ consensus = {}
263
+ r1_temp = {}
264
+ r2_temp = {}
265
+ m = 0
266
+ primer_id_count_over_n.each do |primer_id|
267
+ m += 1
268
+ log.puts Time.now.to_s + "\t" + "Now processing number #{m}" if m%100 == 0
269
+ seq_with_same_primer_id = pid_seqtag_hash[primer_id]
270
+ r1_sub_seq = []
271
+ r2_sub_seq = []
272
+ seq_with_same_primer_id.each do |seq_name|
273
+ r1_sub_seq << bio_r1[seq_name]
274
+ r2_sub_seq << bio_r2[seq_name]
275
+ end
276
+
277
+ #consensus name including the Primer ID and number of raw sequences of that Primer ID, library name and setname.
278
+ consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
279
+ r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
280
+ r2_consensus = ViralSeq::SeqHash.array(r2_sub_seq).consensus(majority_cut_off)
281
+
282
+ # hide the following two lines if allowing sequence to have ambiguities.
283
+ next if r1_consensus =~ /[^ATCG]/
284
+ next if r2_consensus =~ /[^ATCG]/
285
+
286
+ # reverse complement sequence of the R2 region
287
+ r2_consensus = r2_consensus.rc
288
+ consensus[consensus_name] = [r1_consensus, r2_consensus]
289
+ r1_temp[consensus_name] = r1_consensus
290
+ r2_temp[consensus_name] = r2_consensus
291
+ end
292
+ r1_temp_sh = ViralSeq::SeqHash.new(r1_temp)
293
+ r2_temp_sh = ViralSeq::SeqHash.new(r2_temp)
294
+
295
+ # filter consensus sequences for residual offspring PIDs
296
+ consensus_filtered = {}
297
+ consensus_number_temp = consensus.size
298
+ max_pid_comb = 4**pid_length
299
+ if consensus_number_temp < 0.003*max_pid_comb
300
+ log.puts Time.now.to_s + "\t" + "Applying PID post TCS filter..."
301
+ r1_consensus_filtered = r1_temp_sh.filter_similar_pid.dna_hash
302
+ r2_consensus_filtered = r2_temp_sh.filter_similar_pid.dna_hash
303
+ common_pid = r1_consensus_filtered.keys & r2_consensus_filtered.keys
304
+ common_pid.each do |pid|
305
+ consensus_filtered[pid] = [r1_consensus_filtered[pid], r2_consensus_filtered[pid]]
306
+ end
307
+ else
308
+ consensus_filtered = consensus
309
+ end
310
+ n_con = consensus_filtered.size
311
+ log.puts Time.now.to_s + "\t" + "Number of consensus sequences: " + n_con.to_s
312
+ summary_json[:total_tcs] = n_con
313
+ summary_json[:resampling_param] = (n_con/pid_to_process.to_f).round(3)
314
+
315
+ log.puts Time.now.to_s + "\t" + "Writing R1 and R2 files..."
316
+ # r1_file output
317
+ f1 = File.open(outfile_r1, 'w')
318
+ f2 = File.open(outfile_r2, 'w')
319
+ primer_id_in_use = {}
320
+ if n_con > 0
321
+ r1_seq_length = consensus_filtered.values[0][0].size
322
+ r2_seq_length = consensus_filtered.values[0][1].size
323
+ else
324
+ next
325
+ end
326
+ log.puts Time.now.to_s + "\t" + "R1 sequence #{r1_seq_length} bp"
327
+ log.puts Time.now.to_s + "\t" + "R1 sequence #{r2_seq_length} bp"
328
+ consensus_filtered.each do |seq_name,seq|
329
+ f1.print seq_name + "_r1\n" + seq[0] + "\n"
330
+ f2.print seq_name + "_r2\n" + seq[1] + "\n"
331
+ primer_id_in_use[seq_name.split("_")[0][1..-1]] = seq_name.split("_")[1].to_i
332
+ end
333
+ f1.close
334
+ f2.close
335
+
336
+ # Primer ID distribution in .json file
337
+ out_pid_json = File.join(out_dir_set, 'primer_id.json')
338
+ pid_json = {}
339
+ pid_json[:primer_id_in_use] = Hash[*(primer_id_in_use.sort_by {|k, v| [-v,k]}.flatten)]
340
+ pid_json[:primer_id_distribution] = Hash[*(primer_id_dis.sort_by{|k,v| k}.flatten)]
341
+ pid_json[:primer_id_frequency] = Hash[*(primer_id_count.sort_by {|k, v| [-v,k]}.flatten)]
342
+ File.open(out_pid_json, 'w') do |f|
343
+ f.puts JSON.pretty_generate(pid_json)
344
+ end
345
+
346
+ # start end-join
347
+ def end_join(dir, option, overlap)
348
+ shp = ViralSeq::SeqHashPair.fa(dir)
349
+ case option
350
+ when 1
351
+ joined_sh = shp.join1()
352
+ when 2
353
+ joined_sh = shp.join1(overlap)
354
+ when 3
355
+ joined_sh = shp.join2
356
+ when 4
357
+ joined_sh = shp.join2(model: :indiv)
358
+ end
359
+ return joined_sh
360
+ end
361
+
362
+ if primer[:end_join]
363
+ log.puts Time.now.to_s + "\t" + "Start end-pairing for TCS..."
364
+ shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
365
+ joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
366
+ log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
367
+ summary_json[:combined_tcs] = joined_sh.size
368
+
369
+ if export_raw
370
+ joined_sh_raw = end_join(out_dir_raw, primer[:end_join_option], primer[:overlap])
371
+ end
372
+
373
+ else
374
+ File.open(outfile_log, "w") do |f|
375
+ f.puts JSON.pretty_generate(summary_json)
376
+ end
377
+ next
378
+ end
379
+
380
+ if primer[:TCS_QC]
381
+ ref_start = primer[:ref_start]
382
+ ref_end = primer[:ref_end]
383
+ ref_genome = primer[:ref_genome].to_sym
384
+ indel = primer[:indel]
385
+ if ref_start == 0
386
+ ref_start = 0..(ViralSeq::RefSeq.get(ref_genome).size - 1)
387
+ end
388
+ if ref_end == 0
389
+ ref_end = 0..(ViralSeq::RefSeq.get(ref_genome).size - 1)
390
+ end
391
+ if primer[:end_join_option] == 1 and primer[:overlap] == 0
392
+ r1_sh = ViralSeq::SeqHash.fa(outfile_r1)
393
+ r2_sh = ViralSeq::SeqHash.fa(outfile_r2)
394
+ r1_sh = r1_sh.hiv_seq_qc(ref_start, (0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), indel, ref_genome)
395
+ r2_sh = r2_sh.hiv_seq_qc((0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), ref_end, indel, ref_genome)
396
+ new_r1_seq = r1_sh.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
397
+ new_r2_seq = r2_sh.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
398
+ joined_seq = {}
399
+ new_r1_seq.each do |seq_name, seq|
400
+ next unless seq
401
+ next unless new_r2_seq[seq_name]
402
+ joined_seq[seq_name] = seq + new_r2_seq[seq_name]
403
+ end
404
+ joined_sh = ViralSeq::SeqHash.new(joined_seq)
405
+
406
+ if export_raw
407
+ r1_sh_raw = ViralSeq::SeqHash.fa(outfile_raw_r1)
408
+ r2_sh_raw = ViralSeq::SeqHash.fa(outfile_raw_r2)
409
+ r1_sh_raw = r1_sh_raw.hiv_seq_qc(ref_start, (0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), indel, ref_genome)
410
+ r2_sh_raw = r2_sh_raw.hiv_seq_qc((0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), ref_end, indel, ref_genome)
411
+ new_r1_seq_raw = r1_sh_raw.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
412
+ new_r2_seq_raw = r2_sh_raw.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
413
+ joined_seq_raw = {}
414
+ new_r1_seq_raw.each do |seq_name, seq|
415
+ next unless seq
416
+ next unless new_r2_seq_raw[seq_name]
417
+ joined_seq_raw[seq_name] = seq + new_r2_seq_raw[seq_name]
418
+ end
419
+ joined_sh_raw = ViralSeq::SeqHash.new(joined_seq_raw)
420
+ end
421
+ else
422
+ joined_sh = joined_sh.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
423
+
424
+ if export_raw
425
+ joined_sh_raw = joined_sh_raw.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
426
+ end
427
+ end
428
+
429
+ log.puts Time.now.to_s + "\t" + "Paired TCS number after QC based on reference genome: " + joined_sh.size.to_s
430
+ summary_json[:combined_tcs_after_qc] = joined_sh.size
431
+ if primer[:trim]
432
+ trim_start = primer[:trim_ref_start]
433
+ trim_end = primer[:trim_ref_end]
434
+ trim_ref = primer[:trim_ref].to_sym
435
+ joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
436
+ joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
437
+ if export_raw
438
+ joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
439
+ joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
440
+ end
441
+ end
442
+ end
443
+
444
+ File.open(outfile_log, "w") do |f|
445
+ f.puts JSON.pretty_generate(summary_json)
446
+ end
447
+ end
448
+
449
+ log.puts Time.now.to_s + "\t" + "Removing raw sequence files..."
450
+ File.unlink(r1_f)
451
+ File.unlink(r2_f)
452
+ log.puts Time.now.to_s + "\t" + "TCS pipeline successfuly exercuted."
453
+ log.close
454
+ puts "DONE!"