viral_seq 1.0.5 → 1.0.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a13ba7912ee87511c2ecf19f07256d3a129661c6d7e180d57ecd1e34978386e6
4
- data.tar.gz: 61e5ed6b423f0b64c53a6bb8e8ec3801bf7e093e4d0741bd71bf9fbfa24f1b55
3
+ metadata.gz: 14d880e9f39b2b87892bec9d4377b358643c880cf32c81872cff51e1007bc23b
4
+ data.tar.gz: 6ee1c3293e2b0403a2eac033335f7575625b2d35f32127b5b57be53e94b4ec7d
5
5
  SHA512:
6
- metadata.gz: f18d03220190bf1479ed29bd4d4b83777ffe5216951d38a91dd2afdc6c07b516883a8694291106b6fee2693a246b8a3c6a824786527cd03730f28f6777fa3231
7
- data.tar.gz: 7fe146b081a7b633de963ed632bdcb548c71d1f401e227109d8745d23ad770d2099a2aa50bc4553a9450b260b7206892ed2a898d9154764aebe4094f38faeb44
6
+ metadata.gz: 951b75ced84aa21cf5650baa6970f60a617d3f29d20c14acadacefabea23d6b584f25990453c2008f30197aaef055a94edbdbb45494bb12b6343d90bc6bd45fb
7
+ data.tar.gz: 68ac69b4ebd5438a8f73780db823c94aa5a78c7c26d02cfd6bec979244dd1d6452c3698ade0606ddbbaccc480ad85e603171c11648dbb0110c2f5dbb3355bb35
@@ -1,15 +1,17 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- viral_seq (1.0.5)
4
+ viral_seq (1.0.10)
5
+ colorize (~> 0.1)
5
6
  muscle_bio (~> 0.4)
6
7
 
7
8
  GEM
8
9
  remote: https://rubygems.org/
9
10
  specs:
11
+ colorize (0.8.1)
10
12
  diff-lcs (1.3)
11
13
  muscle_bio (0.4.0)
12
- rake (10.5.0)
14
+ rake (13.0.1)
13
15
  rspec (3.8.0)
14
16
  rspec-core (~> 3.8.0)
15
17
  rspec-expectations (~> 3.8.0)
@@ -29,9 +31,9 @@ PLATFORMS
29
31
 
30
32
  DEPENDENCIES
31
33
  bundler (~> 2.0)
32
- rake (~> 10.0)
34
+ rake (~> 13.0)
33
35
  rspec (~> 3.0)
34
36
  viral_seq!
35
37
 
36
38
  BUNDLED WITH
37
- 2.0.2
39
+ 2.1.4
data/README.md CHANGED
@@ -4,82 +4,154 @@ A Ruby Gem containing bioinformatics tools for processing viral NGS data.
4
4
 
5
5
  Specifically for Primer-ID sequencing and HIV drug resistance analysis.
6
6
 
7
- ## Installation
7
+ ## Install
8
8
 
9
+ ```bash
9
10
  $ gem install viral_seq
11
+ ```
10
12
 
11
13
  ## Usage
12
14
 
13
- Load all ViralSeq classes by requiring 'viral_seq.rb'
15
+ ### Excutables
14
16
 
15
- #!/usr/bin/env ruby
16
- require 'viral_seq'
17
+ Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
18
+
19
+ ```bash
20
+ $ locator -i sequence.fasta -o sequence.fasta.csv
21
+ ```
22
+
23
+ Use executable `tcs` pipeline to process Primer ID MiSeq sequencing data.
24
+
25
+ ```bash
26
+ $ tcs -p params.json # run TCS pipeline with params.json
27
+ $ tcs -j # CLI to generate params.json
28
+ $ tcs -h # print out the help
29
+ ```
17
30
 
18
31
  ## Some Examples
19
32
 
20
- ### Load nucleotide sequences from a FASTA format sequence file
33
+ Load all ViralSeq classes by requiring 'viral_seq.rb' in your Ruby scripts.
21
34
 
22
- my_seqhash = ViralSeq::SeqHash.fa('my_seq_file.fasta')
35
+ ```ruby
36
+ #!/usr/bin/env ruby
37
+ require 'viral_seq'
38
+ ```
23
39
 
24
- ### Make an alignment (using MUSCLE)
40
+ Load nucleotide sequences from a FASTA format sequence file
25
41
 
26
- aligned_seqhash = my_seqhash.align
42
+ ```ruby
43
+ my_seqhash = ViralSeq::SeqHash.fa('my_seq_file.fasta')
44
+ ```
27
45
 
28
- ### Filter nucleotide sequences with the reference coordinates (HIV Protease)
46
+ Make an alignment (using MUSCLE)
29
47
 
30
- qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
48
+ ```ruby
49
+ aligned_seqhash = my_seqhash.align
50
+ ```
31
51
 
32
- ### Further filter out sequences with Apobec3g/f hypermutations
52
+ Filter nucleotide sequences with the reference coordinates (HIV Protease)
33
53
 
34
- qc_seqhash = qc_seqhash.a3g
54
+ ```ruby
55
+ qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
56
+ ```
35
57
 
36
- ### Calculate nucleotide diveristy π
58
+ Further filter out sequences with Apobec3g/f hypermutations
37
59
 
38
- qc_seqhash.pi
60
+ ```ruby
61
+ qc_seqhash = qc_seqhash.a3g
62
+ ```
39
63
 
40
- ### Calculate cut-off for minority variants based on Poisson model
64
+ Calculate nucleotide diveristy π
41
65
 
42
- cut_off = qc_seqhash.pm
66
+ ```ruby
67
+ qc_seqhash.pi
68
+ ```
43
69
 
44
- ### Examine for drug resistance mutations for HIV PR region
70
+ Calculate cut-off for minority variants based on Poisson model
45
71
 
46
- qc_seqhash.sdrm_hiv_pr(cut_off)
72
+ ```ruby
73
+ cut_off = qc_seqhash.pm
74
+ ```
47
75
 
48
- ### Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
76
+ Examine for drug resistance mutations for HIV PR region
49
77
 
50
- $ locator -i sequence.fasta -o sequence.fasta.csv
78
+ ```ruby
79
+ qc_seqhash.sdrm_hiv_pr(cut_off)
80
+ ```
51
81
 
52
82
  ## Updates
53
83
 
54
- Version 1.0.5-07112019:
84
+ ### Version 1.1.0-11112020:
85
+
86
+ 1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
87
+ 2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
88
+ 3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
89
+ 4. a few optimizations.
90
+ 5. TCS 2.1.0 delivered.
91
+ 6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
92
+
93
+ ### Version 1.0.9-07182020:
94
+
95
+ 1. Change ViralSeq::SeqHash#stop_codon and ViralSeq::SeqHash#a3g_hypermut return value to hash object.
96
+
97
+ 2. TCS pipeline updated to version 2.0.1. Add optional `export_raw: TRUE/FALSE` in json params. If `export_raw` is `TRUE`, raw sequence reads (have to pass quality filters) will be exported, along with TCS reads.
98
+
99
+ ### Version 1.0.8-02282020:
100
+
101
+ 1. TCS pipeline (version 2.0.0) added as executable.
102
+ tcs - main TCS pipeline script.
103
+ tcs_json_generator - step-by-step script to generate json file for tcs pipeline.
104
+
105
+ 2. Methods added:
106
+ ViralSeq::SeqHash#trim
107
+
108
+ 3. Bug fix for several methods.
109
+
110
+ ### Version 1.0.7-01282020:
111
+
112
+ 1. Several methods added, including
113
+ ViralSeq::SeqHash#error_table
114
+ ViralSeq::SeqHash#random_select
115
+ 2. Improved performance for several functions.
116
+
117
+ ### Version 1.0.6-07232019:
118
+
119
+ 1. Several methods added to ViralSeq::SeqHash, including
120
+ ViralSeq::SeqHash#size
121
+ ViralSeq::SeqHash#+
122
+ ViralSeq::SeqHash#write_nt_fa
123
+ ViralSeq::SeqHash#mutation
124
+ 2. Update documentations and rspec samples.
125
+
126
+ ### Version 1.0.5-07112019:
55
127
 
56
- 1. Update ViralSeq::SeqHash#sequence_locator.
57
- Program will try to determine the direction (`+` or `-` of the query sequence)
58
- 2. update executable `locator` to have a column of `direction` in output .csv file
128
+ 1. Update ViralSeq::SeqHash#sequence_locator.
129
+ Program will try to determine the direction (`+` or `-` of the query sequence)
130
+ 2. update executable `locator` to have a column of `direction` in output .csv file
59
131
 
60
- Version 1.0.4-07102019:
132
+ ### Version 1.0.4-07102019:
61
133
 
62
- 1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
63
- 2. Fix bugs in bin `locator`
134
+ 1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
135
+ 2. Fix bugs in bin `locator`
64
136
 
65
- Version 1.0.3-07102019:
137
+ ### Version 1.0.3-07102019:
66
138
 
67
- 1. Bug fix.
139
+ 1. Bug fix.
68
140
 
69
- Version 1.0.2-07102019:
141
+ ### Version 1.0.2-07102019:
70
142
 
71
- 1. Fixed a gem loading issue.
143
+ 1. Fixed a gem loading issue.
72
144
 
73
- Version 1.0.1-07102019:
145
+ ### Version 1.0.1-07102019:
74
146
 
75
- 1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
76
- 2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
77
- 3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
78
- 4. update documentations
147
+ 1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
148
+ 2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
149
+ 3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
150
+ 4. update documentations
79
151
 
80
- Version 1.0.0-07092019:
152
+ ### Version 1.0.0-07092019:
81
153
 
82
- 1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq
154
+ 1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq
83
155
 
84
156
  ## Development
85
157
 
@@ -1,15 +1,36 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
+ # Copyright (c) 2020 Shuntai Zhou (shuntai.zhou@gmail.com)
4
+ #
5
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ # of this software and associated documentation files (the "Software"), to deal
7
+ # in the Software without restriction, including without limitation the rights
8
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ # copies of the Software, and to permit persons to whom the Software is
10
+ # furnished to do so, subject to the following conditions:
11
+ #
12
+ # The above copyright notice and this permission notice shall be included in
13
+ # all copies or substantial portions of the Software.
14
+ #
15
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ # THE SOFTWARE.
22
+
3
23
  require 'viral_seq'
4
24
  require 'csv'
5
25
  require 'optparse'
26
+ require 'colorize'
6
27
 
7
28
  def myparser
8
29
  options = {}
9
30
  OptionParser.new do |opts|
10
- opts.banner = "Usage: locator -i [nt_sequence_fasta_file] -o [locator_info_csv_file] -r [reference_genome_option]"
31
+ opts.banner = "#{"Usage:".red.bold} locator #{"-i".blue.bold} [nt_sequence_fasta_file] #{"-o".blue.bold} [locator_info_csv_file] #{"-r".blue.bold} [reference_genome_option]"
11
32
 
12
- opts.on('-i', '--infile FASTA_FILE', 'nt sequence file in FASTA format') do |i|
33
+ opts.on('-i', '--infile FASTA_FILE', "#{"nt sequence".blue.bold} file in FASTA format") do |i|
13
34
  options[:infile] = i
14
35
  end
15
36
 
@@ -17,7 +38,7 @@ def myparser
17
38
  options[:outfile] = o
18
39
  end
19
40
 
20
- opts.on('-r', '--ref_option OPTION', 'reference genome option, choose from `HXB2` (default), `NL43`, `MAC239`') do |o|
41
+ opts.on('-r', '--ref_option OPTION', "reference genome option, choose from #{"`HXB2` (default), `NL43`, `MAC239`".blue.bold}") do |o|
21
42
  options[:ref_option] = o.to_sym
22
43
  end
23
44
 
@@ -35,9 +56,9 @@ def myparser
35
56
  return options
36
57
  end
37
58
 
38
- puts "\nSequence Locator (RubyGem::ViralSeq Version #{ViralSeq::VERSION}) by Shuntai Zhou"
39
- puts "See details at https://github.com/ViralSeq/viral_seq\n"
40
- puts "Resembling Sequence Locator from LANL (https://www.hiv.lanl.gov/content/sequence/LOCATE/locate.html)\n\n"
59
+ puts "\n" + "Sequence Locator (RubyGem::ViralSeq Version #{ViralSeq::VERSION})".red.bold + " by " + "Shuntai Zhou".blue.bold
60
+ puts "See details at " + "https://github.com/ViralSeq/viral_seq\n".blue
61
+ puts "Resembling" + " Sequence Locator ".magenta.bold + "from LANL" + " (https://www.hiv.lanl.gov/content/sequence/LOCATE/locate.html)\n".blue
41
62
 
42
63
  ARGV << '-h' if ARGV.size == 0
43
64
 
@@ -47,7 +68,7 @@ begin
47
68
  if options[:infile]
48
69
  seq_file = options[:infile]
49
70
  else
50
- raise StandardError.new("Input file sequence file not found")
71
+ raise StandardError.new("Input file sequence file not found".red.bold)
51
72
  end
52
73
 
53
74
  if options[:outfile]
@@ -57,14 +78,14 @@ begin
57
78
  end
58
79
 
59
80
  unless File.exist?(seq_file)
60
- raise StandardError.new("Input file sequence file not found")
81
+ raise StandardError.new("Input file sequence file not found".red.bold)
61
82
  end
62
83
 
63
84
  seqs = ViralSeq::SeqHash.fa(seq_file)
64
85
  opt = options[:ref_option] ? options[:ref_option] : :HXB2
65
86
 
66
87
  unless [:HXB2, :NL43, :MAC239].include? opt
67
- puts "Reference option #{opt} not recognized, using `:HXB2` as the reference genome."
88
+ puts "Reference option `#{opt}` not recognized, using `HXB2` as the reference genome.".red.bold
68
89
  opt = :HXB2
69
90
  end
70
91
 
@@ -76,6 +97,7 @@ begin
76
97
  end
77
98
 
78
99
  File.write(csv_file, data)
100
+ puts "Output file found at #{csv_file.green.bold}"
79
101
  rescue StandardError => e
80
102
  puts e.message
81
103
  puts "\n"
data/bin/tcs ADDED
@@ -0,0 +1,450 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # TCS pipeline for Primer ID sequencing data analysis.
4
+
5
+ # Copyright (c) 2020 Shuntai Zhou (shuntai.zhou@gmail.com)
6
+ #
7
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
8
+ # of this software and associated documentation files (the "Software"), to deal
9
+ # in the Software without restriction, including without limitation the rights
10
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
11
+ # copies of the Software, and to permit persons to whom the Software is
12
+ # furnished to do so, subject to the following conditions:
13
+ #
14
+ # The above copyright notice and this permission notice shall be included in
15
+ # all copies or substantial portions of the Software.
16
+ #
17
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
18
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
19
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
20
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
21
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
22
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
23
+ # THE SOFTWARE.
24
+
25
+ # Use JSON file as the run param
26
+ # run tcs_json_generator.rb to generate param json file.
27
+
28
+ require 'viral_seq'
29
+ require 'json'
30
+ require 'colorize'
31
+ require 'OptionParser'
32
+
33
+ options = {}
34
+
35
+ banner = '-'*50 + "\n" +
36
+ '| The TCS Pipeline ' + "Version #{ViralSeq::TCS_VERSION}".red.bold + " by " + "Shuntai Zhou".blue.bold + ' |' + "\n" +
37
+ '-'*50 + "\n"
38
+
39
+ OptionParser.new do |opts|
40
+ opts.banner = banner + "Usage: tcs -j"
41
+ opts.on "-j", "--json_generator", "Command line interfac to generate new params json file" do |j|
42
+ options[:json_generator] = true
43
+ end
44
+
45
+ opts.on("-p", "--params PARAMS_JSON", "Execute the pipeline with input params json file") do |p|
46
+ options[:params_json] = p
47
+ end
48
+
49
+ opts.on("-h", "--help", "Prints this help") do
50
+ puts opts
51
+ exit
52
+ end
53
+
54
+ opts.on("-v", "--version", "Version info") do
55
+ puts "tcs version: " + ViralSeq::TCS_VERSION.red.bold
56
+ puts "viral_seq version: " + ViralSeq::VERSION.red.bold
57
+ exit
58
+ end
59
+
60
+ # opts.on("--no-parallel", "toggle off parallel processing") do
61
+ # options[:no_parallel] = true
62
+ # end
63
+ end.parse!
64
+
65
+ if options[:json_generator]
66
+ params = ViralSeq::TcsJson.generate
67
+ elsif (options[:params_json] && File.exist?(options[:params_json]))
68
+ params = JSON.parse(File.read(options[:params_json]), symbolize_names: true)
69
+ else
70
+ abort "No params JSON file found. Script terminated.".red
71
+ end
72
+
73
+ indir = params[:raw_sequence_dir]
74
+
75
+ unless File.exist?(indir)
76
+ abort "No input sequence directory found. Script terminated.".red.bold
77
+ end
78
+
79
+ # log file
80
+
81
+ runtime_log_file = File.join(indir,"runtime.log")
82
+ log = File.open(runtime_log_file, "w")
83
+ log.puts "TSC pipeline Version " + ViralSeq::TCS_VERSION.to_s
84
+ log.puts "viral_seq Version " + ViralSeq::VERSION.to_s
85
+ log.puts Time.now.to_s + "\t" + "Start TCS pipeline..."
86
+
87
+ libname = File.basename indir
88
+
89
+ seq_files = ViralSeq::TcsCore.r1r2 indir
90
+
91
+ if seq_files[:r1_file].size > 0 and seq_files[:r2_file].size > 0
92
+ r1_f = seq_files[:r1_file]
93
+ r2_f = seq_files[:r2_file]
94
+ elsif seq_files[:r1_file].size > 0 and seq_files[:r2_file].empty?
95
+ exit_sig = "Missing R2 file. Aborted."
96
+ elsif seq_files[:r2_file].size > 0 and seq_files[:r1_file].empty?
97
+ exit_sig = "Missing R1 file. Aborted."
98
+ else
99
+ exit_sig = "Cannot determine R1 R2 file in #{indir}. Aborted."
100
+ end
101
+
102
+ if exit_sig
103
+ ViralSeq::TcsCore.log_and_abort log, exit_sig
104
+ end
105
+
106
+ r1_fastq_sh = ViralSeq::SeqHash.fq(r1_f)
107
+ r2_fastq_sh = ViralSeq::SeqHash.fq(r2_f)
108
+
109
+ raw_sequence_number = r1_fastq_sh.size
110
+ log.puts Time.now.to_s + "\tRaw sequence number: #{raw_sequence_number.to_s}"
111
+
112
+ if params[:platform_error_rate]
113
+ error_rate = params[:platform_error_rate]
114
+ else
115
+ error_rate = 0.02
116
+ end
117
+
118
+ primers = params[:primer_pairs]
119
+ if primers.empty?
120
+ ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
121
+ end
122
+
123
+
124
+ primers.each do |primer|
125
+ summary_json = {}
126
+ summary_json[:tcs_version] = ViralSeq::TCS_VERSION
127
+ summary_json[:viralseq_version] = ViralSeq::VERSION
128
+ summary_json[:runtime] = Time.now.to_s
129
+
130
+ primer[:region] ? region = primer[:region] : region = "region"
131
+ summary_json[:primer_set_name] = region
132
+
133
+ cdna_primer = primer[:cdna]
134
+ forward_primer = primer[:forward]
135
+
136
+ export_raw = primer[:export_raw]
137
+
138
+ unless cdna_primer
139
+ log.puts Time.now.to_s + "\t" + region + " does not have cDNA primer sequence. #{region} skipped."
140
+ end
141
+ unless forward_primer
142
+ log.puts Time.now.to_s + "\t" + region + " does not have forward primer sequence. #{region} skipped."
143
+ end
144
+ summary_json[:cdan_primer] = cdna_primer
145
+ summary_json[:forward_primer] = forward_primer
146
+
147
+ primer[:majority] ? majority_cut_off = primer[:majority] : majority_cut_off = 0
148
+ summary_json[:majority_cut_off] = majority_cut_off
149
+
150
+ summary_json[:total_raw_sequence] = raw_sequence_number
151
+
152
+ log.puts Time.now.to_s + "\t" + "Porcessing #{region}..."
153
+
154
+ # filter R1
155
+ log.puts Time.now.to_s + "\t" + "filtering R1..."
156
+ filter_r1 = ViralSeq::TcsCore.filter_r1(r1_fastq_sh, forward_primer)
157
+ r1_passed_seq = filter_r1[:r1_passed_seq]
158
+ log.puts Time.now.to_s + "\t" + "R1 filtered: #{r1_passed_seq.size.to_s}"
159
+ summary_json[:r1_filtered_raw] = r1_passed_seq.size
160
+
161
+ # filter R2
162
+ log.puts Time.now.to_s + "\t" + "filtering R2..."
163
+ filter_r2 = ViralSeq::TcsCore.filter_r2(r2_fastq_sh, cdna_primer)
164
+ r2_passed_seq = filter_r2[:r2_passed_seq]
165
+ pid_length = filter_r2[:pid_length]
166
+ log.puts Time.now.to_s + "\t" + "R2 filtered: #{r2_passed_seq.size.to_s}"
167
+ summary_json[:r2_filtered_raw] = r2_passed_seq.size
168
+
169
+ # pair-end
170
+ log.puts Time.now.to_s + "\t" + "Pairing R1 and R2 seqs..."
171
+ id = {} # hash for :sequence_tag => primer_id
172
+ bio_r2 = {} # hash for :sequence_tag => primer_trimmed_r2_sequence
173
+ bio_r1 = {} # hash for :sequence_tag => primer_trimmed_r1_sequence
174
+ common_keys = r1_passed_seq.keys & r2_passed_seq.keys
175
+ paired_seq_number = common_keys.size
176
+ log.puts Time.now.to_s + "\t" + "Paired raw sequences are : #{paired_seq_number.to_s}"
177
+ summary_json[:paired_raw_sequence] = paired_seq_number
178
+
179
+ common_keys.each do |seqtag|
180
+ r1_seq = r1_passed_seq[seqtag]
181
+ r2_seq = r2_passed_seq[seqtag]
182
+ pid = r2_seq[0, pid_length]
183
+ id[seqtag] = pid
184
+ bio_r2[seqtag] = r2_seq[filter_r2[:reverse_starting_number]..-2]
185
+ bio_r1[seqtag] = r1_seq[filter_r1[:forward_starting_number]..-2]
186
+ end
187
+
188
+ # TCS cut-off
189
+ log.puts Time.now.to_s + "\t" + "Calculate consensus cutoff...."
190
+
191
+ primer_id_list = id.values
192
+ primer_id_count = primer_id_list.count_freq
193
+ primer_id_dis = primer_id_count.values.count_freq
194
+
195
+ # calculate distinct_to_raw
196
+ distinct_to_raw = (primer_id_count.size/primer_id_list.size.to_f).round(3)
197
+ summary_json[:distinct_to_raw] = distinct_to_raw
198
+
199
+ if primer_id_dis.keys.size < 5
200
+ log.puts Time.now.to_s + "\t" + "Less than 5 Primer IDs detected. Region #{region} aborted."
201
+ next
202
+ end
203
+
204
+ max_id = primer_id_dis.keys.sort[-5..-1].mean
205
+ consensus_cutoff = ViralSeq::TcsCore.calculate_cut_off(max_id,error_rate)
206
+ log.puts Time.now.to_s + "\t" + "Consensus cut-off is #{consensus_cutoff.to_s}"
207
+ summary_json[:consensus_cutoff] = consensus_cutoff
208
+ summary_json[:length_of_pid] = pid_length
209
+ log.puts Time.now.to_s + "\t" + "Creating consensus..."
210
+
211
+ # Primer ID over the cut-off
212
+ primer_id_count_over_n = []
213
+ primer_id_count.each do |primer_id,count|
214
+ primer_id_count_over_n << primer_id if count > consensus_cutoff
215
+ end
216
+ pid_to_process = primer_id_count_over_n.size
217
+ log.puts Time.now.to_s + "\t" + "Number of consensus to process: #{pid_to_process.to_s}"
218
+ summary_json[:total_tcs_with_ambiguities] = pid_to_process
219
+
220
+ # setup output path
221
+ out_dir_set = File.join(indir, region)
222
+ Dir.mkdir(out_dir_set) unless File.directory?(out_dir_set)
223
+ out_dir_consensus = File.join(out_dir_set, "consensus")
224
+ Dir.mkdir(out_dir_consensus) unless File.directory?(out_dir_consensus)
225
+
226
+ outfile_r1 = File.join(out_dir_consensus, 'r1.fasta')
227
+ outfile_r2 = File.join(out_dir_consensus, 'r2.fasta')
228
+ outfile_log = File.join(out_dir_set, 'log.json')
229
+
230
+ # if export_raw is true, create dir for raw sequence
231
+ if export_raw
232
+ out_dir_raw = File.join(out_dir_set, "raw")
233
+ Dir.mkdir(out_dir_raw) unless File.directory?(out_dir_raw)
234
+ outfile_raw_r1 = File.join(out_dir_raw, 'r1.raw.fasta')
235
+ outfile_raw_r2 = File.join(out_dir_raw, 'r2.raw.fasta')
236
+ raw_r1_f = File.open(outfile_raw_r1, 'w')
237
+ raw_r2_f = File.open(outfile_raw_r2, 'w')
238
+
239
+ bio_r1.keys.each do |k|
240
+ raw_r1_f.puts k + "_r1"
241
+ raw_r2_f.puts k + "_r2"
242
+ raw_r1_f.puts bio_r1[k]
243
+ raw_r2_f.puts bio_r2[k].rc
244
+ end
245
+
246
+ raw_r1_f.close
247
+ raw_r2_f.close
248
+ end
249
+
250
+ # create TCS
251
+
252
+ pid_seqtag_hash = {}
253
+ id.each do |name, pid|
254
+ if pid_seqtag_hash[pid]
255
+ pid_seqtag_hash[pid] << name
256
+ else
257
+ pid_seqtag_hash[pid] = []
258
+ pid_seqtag_hash[pid] << name
259
+ end
260
+ end
261
+
262
+ consensus = {}
263
+ r1_temp = {}
264
+ r2_temp = {}
265
+ m = 0
266
+ primer_id_count_over_n.each do |primer_id|
267
+ m += 1
268
+ log.puts Time.now.to_s + "\t" + "Now processing number #{m}" if m%100 == 0
269
+ seq_with_same_primer_id = pid_seqtag_hash[primer_id]
270
+ r1_sub_seq = []
271
+ r2_sub_seq = []
272
+ seq_with_same_primer_id.each do |seq_name|
273
+ r1_sub_seq << bio_r1[seq_name]
274
+ r2_sub_seq << bio_r2[seq_name]
275
+ end
276
+
277
+ #consensus name including the Primer ID and number of raw sequences of that Primer ID, library name and setname.
278
+ consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
279
+ r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
280
+ r2_consensus = ViralSeq::SeqHash.array(r2_sub_seq).consensus(majority_cut_off)
281
+
282
+ # hide the following two lines if allowing sequence to have ambiguities.
283
+ next if r1_consensus =~ /[^ATCG]/
284
+ next if r2_consensus =~ /[^ATCG]/
285
+
286
+ # reverse complement sequence of the R2 region
287
+ r2_consensus = r2_consensus.rc
288
+ consensus[consensus_name] = [r1_consensus, r2_consensus]
289
+ r1_temp[consensus_name] = r1_consensus
290
+ r2_temp[consensus_name] = r2_consensus
291
+ end
292
+ r1_temp_sh = ViralSeq::SeqHash.new(r1_temp)
293
+ r2_temp_sh = ViralSeq::SeqHash.new(r2_temp)
294
+
295
+ # filter consensus sequences for residual offspring PIDs
296
+ consensus_filtered = {}
297
+ consensus_number_temp = consensus.size
298
+ max_pid_comb = 4**pid_length
299
+ if consensus_number_temp < 0.003*max_pid_comb
300
+ log.puts Time.now.to_s + "\t" + "Applying PID post TCS filter..."
301
+ r1_consensus_filtered = r1_temp_sh.filter_similar_pid.dna_hash
302
+ r2_consensus_filtered = r2_temp_sh.filter_similar_pid.dna_hash
303
+ common_pid = r1_consensus_filtered.keys & r2_consensus_filtered.keys
304
+ common_pid.each do |pid|
305
+ consensus_filtered[pid] = [r1_consensus_filtered[pid], r2_consensus_filtered[pid]]
306
+ end
307
+ else
308
+ consensus_filtered = consensus
309
+ end
310
+ n_con = consensus_filtered.size
311
+ log.puts Time.now.to_s + "\t" + "Number of consensus sequences: " + n_con.to_s
312
+ summary_json[:total_tcs] = n_con
313
+ summary_json[:resampling_param] = (n_con/pid_to_process.to_f).round(3)
314
+
315
+ log.puts Time.now.to_s + "\t" + "Writing R1 and R2 files..."
316
+ # r1_file output
317
+ f1 = File.open(outfile_r1, 'w')
318
+ f2 = File.open(outfile_r2, 'w')
319
+ primer_id_in_use = {}
320
+ r1_seq_length = consensus_filtered.values[0][0].size
321
+ r2_seq_length = consensus_filtered.values[0][1].size
322
+ log.puts Time.now.to_s + "\t" + "R1 sequence #{r1_seq_length} bp"
323
+ log.puts Time.now.to_s + "\t" + "R1 sequence #{r2_seq_length} bp"
324
+ consensus_filtered.each do |seq_name,seq|
325
+ f1.print seq_name + "_r1\n" + seq[0] + "\n"
326
+ f2.print seq_name + "_r2\n" + seq[1] + "\n"
327
+ primer_id_in_use[seq_name.split("_")[0][1..-1]] = seq_name.split("_")[1].to_i
328
+ end
329
+ f1.close
330
+ f2.close
331
+
332
+ # Primer ID distribution in .json file
333
+ out_pid_json = File.join(out_dir_set, 'primer_id.json')
334
+ pid_json = {}
335
+ pid_json[:primer_id_in_use] = Hash[*(primer_id_in_use.sort_by {|k, v| [-v,k]}.flatten)]
336
+ pid_json[:primer_id_distribution] = Hash[*(primer_id_dis.sort_by{|k,v| k}.flatten)]
337
+ pid_json[:primer_id_frequency] = Hash[*(primer_id_count.sort_by {|k, v| [-v,k]}.flatten)]
338
+ File.open(out_pid_json, 'w') do |f|
339
+ f.puts JSON.pretty_generate(pid_json)
340
+ end
341
+
342
+ # start end-join
343
+ def end_join(dir, option, overlap)
344
+ shp = ViralSeq::SeqHashPair.fa(dir)
345
+ case option
346
+ when 1
347
+ joined_sh = shp.join1()
348
+ when 2
349
+ joined_sh = shp.join1(overlap)
350
+ when 3
351
+ joined_sh = shp.join2
352
+ when 4
353
+ joined_sh = shp.join2(model: :indiv)
354
+ end
355
+ return joined_sh
356
+ end
357
+
358
+ if primer[:end_join]
359
+ log.puts Time.now.to_s + "\t" + "Start end-pairing for TCS..."
360
+ shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
361
+ joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
362
+ log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
363
+ summary_json[:combined_tcs] = joined_sh.size
364
+
365
+ if export_raw
366
+ joined_sh_raw = end_join(out_dir_raw, primer[:end_join_option], primer[:overlap])
367
+ end
368
+
369
+ else
370
+ File.open(outfile_log, "w") do |f|
371
+ f.puts JSON.pretty_generate(summary_json)
372
+ end
373
+ next
374
+ end
375
+
376
+ if primer[:TCS_QC]
377
+ ref_start = primer[:ref_start]
378
+ ref_end = primer[:ref_end]
379
+ ref_genome = primer[:ref_genome].to_sym
380
+ indel = primer[:indel]
381
+ if ref_start == 0
382
+ ref_start = 0..(ViralSeq::RefSeq.get(ref_genome).size - 1)
383
+ end
384
+ if ref_end == 0
385
+ ref_end = 0..(ViralSeq::RefSeq.get(ref_genome).size - 1)
386
+ end
387
+ if primer[:end_join_option] == 1 and primer[:overlap] == 0
388
+ r1_sh = ViralSeq::SeqHash.fa(outfile_r1)
389
+ r2_sh = ViralSeq::SeqHash.fa(outfile_r2)
390
+ r1_sh = r1_sh.hiv_seq_qc(ref_start, (0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), indel, ref_genome)
391
+ r2_sh = r2_sh.hiv_seq_qc((0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), ref_end, indel, ref_genome)
392
+ new_r1_seq = r1_sh.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
393
+ new_r2_seq = r2_sh.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
394
+ joined_seq = {}
395
+ new_r1_seq.each do |seq_name, seq|
396
+ next unless seq
397
+ next unless new_r2_seq[seq_name]
398
+ joined_seq[seq_name] = seq + new_r2_seq[seq_name]
399
+ end
400
+ joined_sh = ViralSeq::SeqHash.new(joined_seq)
401
+
402
+ if export_raw
403
+ r1_sh_raw = ViralSeq::SeqHash.fa(outfile_raw_r1)
404
+ r2_sh_raw = ViralSeq::SeqHash.fa(outfile_raw_r2)
405
+ r1_sh_raw = r1_sh_raw.hiv_seq_qc(ref_start, (0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), indel, ref_genome)
406
+ r2_sh_raw = r2_sh_raw.hiv_seq_qc((0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), ref_end, indel, ref_genome)
407
+ new_r1_seq_raw = r1_sh_raw.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
408
+ new_r2_seq_raw = r2_sh_raw.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
409
+ joined_seq_raw = {}
410
+ new_r1_seq_raw.each do |seq_name, seq|
411
+ next unless seq
412
+ next unless new_r2_seq_raw[seq_name]
413
+ joined_seq_raw[seq_name] = seq + new_r2_seq_raw[seq_name]
414
+ end
415
+ joined_sh_raw = ViralSeq::SeqHash.new(joined_seq_raw)
416
+ end
417
+ else
418
+ joined_sh = joined_sh.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
419
+
420
+ if export_raw
421
+ joined_sh_raw = joined_sh_raw.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
422
+ end
423
+ end
424
+
425
+ log.puts Time.now.to_s + "\t" + "Paired TCS number after QC based on reference genome: " + joined_sh.size.to_s
426
+ summary_json[:combined_tcs_after_qc] = joined_sh.size
427
+ if primer[:trim]
428
+ trim_start = primer[:trim_ref_start]
429
+ trim_end = primer[:trim_ref_end]
430
+ trim_ref = primer[:trim_ref].to_sym
431
+ joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
432
+ joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
433
+ if export_raw
434
+ joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
435
+ joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
436
+ end
437
+ end
438
+ end
439
+
440
+ File.open(outfile_log, "w") do |f|
441
+ f.puts JSON.pretty_generate(summary_json)
442
+ end
443
+ end
444
+
445
+ log.puts Time.now.to_s + "\t" + "Removing raw sequence files..."
446
+ File.unlink(r1_f)
447
+ File.unlink(r2_f)
448
+ log.puts Time.now.to_s + "\t" + "TCS pipeline successfuly exercuted."
449
+ log.close
450
+ puts "DONE!"