viral_seq 1.1.0 → 1.2.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ea453e452e6832e942512cdb94462c33af89ffd8295017806c9aa6ff7ec77ad4
4
- data.tar.gz: 2bb89d193e0e84ebe0791882c53e226a0a934ea3b9d1e61f87b8ffff6c22af1b
3
+ metadata.gz: a235cae95121a8522a47620eb9f8c05a3e2e416084743cd23df43aff7870a2c4
4
+ data.tar.gz: f0ce3a9412774eed703b0b0b663e7bb2dccf340f3f558cffdca85e920291794d
5
5
  SHA512:
6
- metadata.gz: 9dc0403ecaea119d3aa3e832305a0bd4f038fdb71789dcd036080fa89b0e454ee79001b6042df171364e4207a93b2d4d5747336b2fb7f8fb7d83103f5d641134
7
- data.tar.gz: 510ccfce7d717b56d55e2477ae01124009d1f53f010635759cf2f69afe0132313e08db9abaae1ec6d8d894961beba1c2d70a637eafa9b57b05f0aac3372cd0ca
6
+ metadata.gz: b97f98e40b8257281bd29cee40942d16084cf175933fc8357838ebb2a9eede1ab93ba323dbf315afb300f0a7852b2c6d939235831124710fc6f16f109e3eafc5
7
+ data.tar.gz: 4d660da22c69ce1ff929ed7f67d2b03aad662bb0237e9a93d9a8ea6bd1866d8544ad108db9ab8a11eee2df992395e41b68ffc43a8d1dbb132cc1f83a897676ef
data/Gemfile.lock CHANGED
@@ -1,16 +1,27 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- viral_seq (1.0.13)
5
- colorize (~> 0.1)
6
- muscle_bio (~> 0.4)
4
+ viral_seq (1.1.1)
5
+ colorize (>= 0.1)
6
+ combine_pdf (>= 1.0.0)
7
+ muscle_bio (>= 0.4)
8
+ prawn (>= 2.3.0)
9
+ prawn-table (>= 0.2.0)
7
10
 
8
11
  GEM
9
12
  remote: https://rubygems.org/
10
13
  specs:
11
14
  colorize (0.8.1)
15
+ combine_pdf (1.0.21)
16
+ ruby-rc4 (>= 0.1.5)
12
17
  diff-lcs (1.3)
13
18
  muscle_bio (0.4.0)
19
+ pdf-core (0.9.0)
20
+ prawn (2.4.0)
21
+ pdf-core (~> 0.9.0)
22
+ ttfunk (~> 1.7)
23
+ prawn-table (0.2.2)
24
+ prawn (>= 1.3.0, < 3.0.0)
14
25
  rake (13.0.1)
15
26
  rspec (3.8.0)
16
27
  rspec-core (~> 3.8.0)
@@ -25,6 +36,8 @@ GEM
25
36
  diff-lcs (>= 1.2.0, < 2.0)
26
37
  rspec-support (~> 3.8.0)
27
38
  rspec-support (3.8.0)
39
+ ruby-rc4 (0.1.5)
40
+ ttfunk (1.7.0)
28
41
 
29
42
  PLATFORMS
30
43
  ruby
data/README.md CHANGED
@@ -1,5 +1,11 @@
1
1
  # ViralSeq
2
2
 
3
+ [![Gem Version](https://img.shields.io/gem/v/viral_seq?color=%2300e673&style=flat-square)](https://rubygems.org/gems/viral_seq)
4
+ ![GitHub](https://img.shields.io/github/license/viralseq/viral_seq)
5
+ ![Gem](https://img.shields.io/gem/dt/viral_seq?color=%23E9967A)
6
+ ![GitHub last commit](https://img.shields.io/github/last-commit/viralseq/viral_seq?color=%2300BFFF)
7
+ [![Join the chat at https://gitter.im/viral_seq/community](https://badges.gitter.im/viral_seq/community.svg)](https://gitter.im/viral_seq/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
8
+
3
9
  A Ruby Gem containing bioinformatics tools for processing viral NGS data.
4
10
 
5
11
  Specifically for Primer ID sequencing and HIV drug resistance analysis.
@@ -7,11 +13,12 @@ Specifically for Primer ID sequencing and HIV drug resistance analysis.
7
13
  ## Illustration for the Primer ID Sequencing
8
14
 
9
15
 
10
- ![Primer ID Sequencing](https://storage.googleapis.com/tcs-dr-public/pid.png)
16
+ ![Primer ID Sequencing](./docs/assets/img/cover.jpg)
11
17
 
12
18
  ### Reference readings on the Primer ID sequencing
13
- [Primer ID JID paper](https://doi.org/10.21769/BioProtoc.3938)
14
- [Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
19
+ [Explantion of Primer ID sequencing](https://doi.org/10.21769/BioProtoc.3938)
20
+ [Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
21
+ [Application of Primer ID sequencing in COVID-19 research](https://doi.org/10.1126/scitranslmed.abb5883)
15
22
 
16
23
  ## Install
17
24
 
@@ -24,14 +31,23 @@ Specifically for Primer ID sequencing and HIV drug resistance analysis.
24
31
  ### Excutables
25
32
 
26
33
  ### `tcs`
27
- Use executable `tcs` pipeline to process **Primer ID MiSeq sequencing** data.
34
+ Use executable `tcs` pipeline (v2.3.2) to process **Primer ID MiSeq sequencing** data.
28
35
 
29
36
  Example commands:
30
37
  ```bash
31
38
  $ tcs -p params.json # run TCS pipeline with params.json
39
+ $ tcs -p params.json -i DIRECTORY
40
+ # run TCS pipeline with params.json and DIRECTORY
41
+ # if DIRECTORY is not defined in params.json
42
+ $ tcs -dr -i DIRECTORY
43
+ # run tcs-dr (MPID HIV drug resistance sequencing) pipeline
44
+ # DIRECTORY needs to be given.
32
45
  $ tcs -j # CLI to generate params.json
33
46
  $ tcs -h # print out the help
34
47
  ```
48
+
49
+ [sample params.json for the tcs-dr pipeline](./docs/dr.json)
50
+
35
51
  ---
36
52
  ### `tcs_log`
37
53
 
@@ -53,6 +69,44 @@ Example command:
53
69
  $ tcs_log batch_tcs_jobs
54
70
  ```
55
71
 
72
+ ---
73
+ ### `tcs_sdrm`
74
+
75
+ Use `tcs_sdrm` pipeline for HIV-1 drug resistance mutation and recency.
76
+
77
+ Example command:
78
+ ```bash
79
+ $ tcs_sdrm libs_dir
80
+ ```
81
+
82
+ lib_dir file structure:
83
+ ```
84
+ libs_dir/
85
+ ├── lib1
86
+ ├── lib1_RT
87
+ ├── lib1_PR
88
+ ├── lib1_IN
89
+ ├── lib1_V1V3
90
+ ├── lib2
91
+ ├── lib1_RT
92
+ ├── lib1_PR
93
+ ├── lib1_IN
94
+ ├── lib1_V1V3
95
+ ├── ...
96
+ ```
97
+
98
+ Output data in a new dir as 'libs_dir_SDRM'
99
+
100
+
101
+ **Note: [R](https://www.r-project.org/) and the following R libraries are required:**
102
+ - phangorn
103
+ - ape
104
+ - scales
105
+ - ggforce
106
+ - cowplot
107
+ - magrittr
108
+ - gridExtra
109
+
56
110
  ---
57
111
 
58
112
  ### `locator`
@@ -93,7 +147,7 @@ qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
93
147
  Further filter out sequences with Apobec3g/f hypermutations
94
148
 
95
149
  ```ruby
96
- qc_seqhash = qc_seqhash.a3g
150
+ qc_seqhash = qc_seqhash.a3g[:filtered_seq]
97
151
  ```
98
152
 
99
153
  Calculate nucleotide diveristy π
@@ -121,15 +175,48 @@ qc_seqhash.sdrm_hiv_pr(cut_off)
121
175
 
122
176
  ## Updates
123
177
 
178
+ ### Version 1.2.2-05272021
179
+
180
+ 1. Fixed a bug in the `tcs` pipeline that sometimes causes `SystemStackError`.
181
+ `tcs` pipeline upgraded to v2.3.2
182
+
183
+ ### Version 1.2.1-05172021
184
+
185
+ 1. Added a function in R to check and install missing R packages for `tcs_sdrm` pipeline.
186
+
187
+ ### Version 1.2.0-05102021
188
+
189
+ 1. Added `tcs_sdrm` pipeline as an excutable.
190
+ `tcs_sdrm` processes `tcs`-processed HIV MPID-NGS data for drug resistance mutations, recency and phylogentic analysis.
191
+
192
+ 2. Added function ViralSeq::SeqHash#sample.
193
+
194
+ 3. Added recency determining function `ViralSeq::Recency::define`
195
+
196
+ 4. Fixed a few bugs related to `tcs_sdrm`.
197
+
198
+ ### Version 1.1.2-04262021
199
+
200
+ 1. Added function `ViralSeq::DRMs.sdrm_json` to export SDRM as json object.
201
+ 2. Added a random string to the temp file names for `muscle_bio` to avoid issues when running scripts in parallel.
202
+ 3. Added `--keep-original` flag to the `tcs` pipeline.
203
+
204
+ ### Version 1.1.1-04012021
205
+
206
+ 1. Added warning when paired_raw_sequence less than 0.1% of total_raw_sequence.
207
+ 2. Added option `-i WORKING_DIRECTORY` to the `tcs` script.
208
+ If the `params.json` file does not contain the path to the working directory, it will append path to the run params.
209
+ 3. Added option `-dr` to the `tcs` script.
210
+
124
211
  ### Version 1.1.0-03252021
125
212
 
126
- 1. Optimized the algorithm of end-join.
127
- 2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
128
- 3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
129
- 4. Added the preset of MPID-HIVDR params file ***dr.json*** in /doc.
130
- 5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
131
- Users can choose from 3 MiSeq platforms for processing their sequencing data.
132
- MiSeq 300x7x300 is the default option.
213
+ 1. Optimized the algorithm of end-join.
214
+ 2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
215
+ 3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
216
+ 4. Added the preset of MPID-HIVDR params file [***dr.json***](./docs/dr.json) in /docs.
217
+ 5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
218
+ Users can choose from 3 MiSeq platforms for processing their sequencing data.
219
+ MiSeq 300x7x300 is the default option.
133
220
 
134
221
  ### Version 1.0.14-03052021
135
222
 
data/bin/tcs CHANGED
@@ -46,11 +46,23 @@ OptionParser.new do |opts|
46
46
  options[:params_json] = p
47
47
  end
48
48
 
49
+ opts.on("-i", "--input PATH_TO_WORKING_DIRECTORY", "Path to the working directory") do |p|
50
+ options[:input] = p
51
+ end
52
+
53
+ opts.on("-dr", "--dr_pipeline", "HIV drug resistance MPID pipeline") do |p|
54
+ options[:dr] = true
55
+ end
56
+
49
57
  opts.on("-h", "--help", "Prints this help") do
50
58
  puts opts
51
59
  exit
52
60
  end
53
61
 
62
+ opts.on("--keep-original", "keep raw sequence files") do
63
+ options[:keep] = true
64
+ end
65
+
54
66
  opts.on("-v", "--version", "Version info") do
55
67
  puts "tcs version: " + ViralSeq::TCS_VERSION.red.bold
56
68
  puts "viral_seq version: " + ViralSeq::VERSION.red.bold
@@ -64,15 +76,21 @@ end.parse!
64
76
 
65
77
  if options[:json_generator]
66
78
  params = ViralSeq::TcsJson.generate
79
+ elsif options[:dr]
80
+ params = ViralSeq::TcsDr::PARAMS
67
81
  elsif (options[:params_json] && File.exist?(options[:params_json]))
68
82
  params = JSON.parse(File.read(options[:params_json]), symbolize_names: true)
69
83
  else
70
84
  abort "No params JSON file found. Script terminated.".red
71
85
  end
72
86
 
73
- indir = params[:raw_sequence_dir]
87
+ if options[:input]
88
+ indir = options[:input]
89
+ else
90
+ indir = params[:raw_sequence_dir]
91
+ end
74
92
 
75
- unless File.exist?(indir)
93
+ unless indir and File.exist?(indir)
76
94
  abort "No input sequence directory found. Script terminated.".red.bold
77
95
  end
78
96
 
@@ -129,6 +147,7 @@ end
129
147
 
130
148
  primers.each do |primer|
131
149
  summary_json = {}
150
+ summary_json[:warnings] = []
132
151
  summary_json[:tcs_version] = ViralSeq::TCS_VERSION
133
152
  summary_json[:viralseq_version] = ViralSeq::VERSION
134
153
  summary_json[:runtime] = Time.now.to_s
@@ -140,6 +159,7 @@ primers.each do |primer|
140
159
  forward_primer = primer[:forward]
141
160
 
142
161
  export_raw = primer[:export_raw]
162
+ limit_raw = primer[:limit_raw]
143
163
 
144
164
  unless cdna_primer
145
165
  log.puts Time.now.to_s + "\t" + region + " does not have cDNA primer sequence. #{region} skipped."
@@ -181,6 +201,10 @@ primers.each do |primer|
181
201
  paired_seq_number = common_keys.size
182
202
  log.puts Time.now.to_s + "\t" + "Paired raw sequences are : #{paired_seq_number.to_s}"
183
203
  summary_json[:paired_raw_sequence] = paired_seq_number
204
+ if paired_seq_number < raw_sequence_number * 0.001
205
+ summary_json[:warnings] <<
206
+ "WARNING: Filtered raw sequneces less than 0.1% of the total raw sequences. Possible contamination."
207
+ end
184
208
 
185
209
  common_keys.each do |seqtag|
186
210
  r1_seq = r1_passed_seq[seqtag]
@@ -242,7 +266,13 @@ primers.each do |primer|
242
266
  raw_r1_f = File.open(outfile_raw_r1, 'w')
243
267
  raw_r2_f = File.open(outfile_raw_r2, 'w')
244
268
 
245
- bio_r1.keys.each do |k|
269
+ if limit_raw
270
+ raw_keys = bio_r1.keys.sample(limit_raw.to_i)
271
+ else
272
+ raw_keys = bio_r1.keys
273
+ end
274
+
275
+ raw_keys.each do |k|
246
276
  raw_r1_f.puts k + "_r1"
247
277
  raw_r2_f.puts k + "_r2"
248
278
  raw_r1_f.puts bio_r1[k]
@@ -341,9 +371,21 @@ primers.each do |primer|
341
371
  # Primer ID distribution in .json file
342
372
  out_pid_json = File.join(out_dir_set, 'primer_id.json')
343
373
  pid_json = {}
344
- pid_json[:primer_id_in_use] = Hash[*(primer_id_in_use.sort_by {|k, v| [-v,k]}.flatten)]
345
- pid_json[:primer_id_distribution] = Hash[*(primer_id_dis.sort_by{|k,v| k}.flatten)]
346
- pid_json[:primer_id_frequency] = Hash[*(primer_id_count.sort_by {|k, v| [-v,k]}.flatten)]
374
+ pid_json[:primer_id_in_use] = {}
375
+ primer_id_in_use.sort_by {|k, v| [-v,k]}.each do |k,v|
376
+ pid_json[:primer_id_in_use][k] = v
377
+ end
378
+
379
+ pid_json[:primer_id_distribution] = {}
380
+ primer_id_dis.sort_by{|k,v| k}.each do |k,v|
381
+ pid_json[:primer_id_distribution][k] = v
382
+ end
383
+
384
+ pid_json[:primer_id_frequency] = {}
385
+ primer_id_count.sort_by {|k,v| [-v,k]}.each do |k,v|
386
+ pid_json[:primer_id_frequency][k] = v
387
+ end
388
+
347
389
  File.open(out_pid_json, 'w') do |f|
348
390
  f.puts JSON.pretty_generate(pid_json)
349
391
  end
@@ -455,9 +497,11 @@ primers.each do |primer|
455
497
  end
456
498
  end
457
499
 
458
- log.puts Time.now.to_s + "\t" + "Removing raw sequence files..."
459
- File.unlink(r1_f)
460
- File.unlink(r2_f)
461
- log.puts Time.now.to_s + "\t" + "TCS pipeline successfuly exercuted."
500
+ unless options[:keep]
501
+ log.puts Time.now.to_s + "\t" + "Removing raw sequence files..."
502
+ File.unlink(r1_f)
503
+ File.unlink(r2_f)
504
+ end
505
+ log.puts Time.now.to_s + "\t" + "TCS pipeline successfuly executed."
462
506
  log.close
463
507
  puts "DONE!"
data/bin/tcs_log CHANGED
@@ -37,8 +37,26 @@ Dir.mkdir(outdir4) unless File.directory?(outdir4)
37
37
 
38
38
  log_file = File.join(tcs_dir,"log.csv")
39
39
  log = File.open(log_file,'w')
40
- log.puts "lib name,Region,Raw Sequences per barcode,R1 Raw,R2 Raw,Paired Raw,Cutoff,PID Length,Consensus1,Consensus2,Distinct to Raw,Resampling index,Combined TCS,Combined TCS after QC"
41
40
 
41
+ header = %w{
42
+ lib_name
43
+ Region
44
+ Raw_Sequences_per_barcode
45
+ R1_Raw
46
+ R2_Raw
47
+ Paired_Raw
48
+ Cutoff
49
+ PID_Length
50
+ Consensus1
51
+ Consensus2
52
+ Distinct_to_Raw
53
+ Resampling_index
54
+ Combined_TCS
55
+ Combined_TCS_after_QC
56
+ WARNINGS
57
+ }
58
+
59
+ log.puts header.join(',')
42
60
  libs.each do |lib|
43
61
  Dir.mkdir(File.join(outdir2, lib)) unless File.directory?(File.join(outdir2, lib))
44
62
  fasta_files = []
@@ -77,6 +95,7 @@ libs.each do |lib|
77
95
  json_log[:resampling_param],
78
96
  json_log[:combined_tcs],
79
97
  json_log[:combined_tcs_after_qc],
98
+ json_log[:warnings],
80
99
  ].join(',') + "\n"
81
100
  end
82
101
  end
data/bin/tcs_sdrm ADDED
@@ -0,0 +1,409 @@
1
+ #!/usr/bin/env ruby
2
+ # tcs/sdrm pipeline for HIV-1 drug resistance mutation and recency
3
+ #
4
+ # command example:
5
+ # $ tcs_sdrm libs_dir
6
+ #
7
+ # lib_dir file structure:
8
+ # libs_dir
9
+ # ├── lib1
10
+ # ├── lib1_RT
11
+ # ├── lib1_PR
12
+ # ├── lib1_IN
13
+ # ├── lib1_V1V3
14
+ # ├── lib2
15
+ # ├── lib1_RT
16
+ # ├── lib1_PR
17
+ # ├── lib1_IN
18
+ # ├── lib1_V1V3
19
+ # ├── ...
20
+ #
21
+ # output data in a new dir as 'libs_dir_SDRM'
22
+
23
+ require 'viral_seq'
24
+ require 'json'
25
+ require 'csv'
26
+ require 'fileutils'
27
+ require 'prawn'
28
+ require 'prawn/table'
29
+ require 'combine_pdf'
30
+
31
+ unless ARGV[0] && File.directory?(ARGV[0])
32
+ abort "No sequence data provided. `tcs_sdrm` pipeline aborted. "
33
+ end
34
+
35
+ begin
36
+ r_version = `R --version`.split("\n")[0]
37
+ r_check = `R -e '#{ViralSeq::R_SCRIPT_CHECK_PACKAGES}' > /dev/null 2>&1`
38
+ rescue Errno::ENOENT
39
+ abort '"R" is not installed. Install R at https://www.r-project.org/' +
40
+ "\n`tcs_sdrm` pipeline aborted."
41
+ end
42
+
43
+ def abstract_line(data)
44
+ return_data = data[3] + data[2] + data[4] + ":" +
45
+ (data[6].to_f * 100).round(2).to_s + "(" +
46
+ (data[7].to_f * 100).round(2).to_s + "-" +
47
+ (data[8].to_f * 100).round(2).to_s + "); "
48
+ end
49
+
50
+ # run params
51
+ log = []
52
+
53
+ log << { time: Time.now }
54
+ log << { viral_seq_version: ViralSeq::VERSION }
55
+ log << { tcs_version: ViralSeq::TCS_VERSION }
56
+ log << { R_version: r_version}
57
+ sdrm_list = {}
58
+ sdrm_list[:nrti] = ViralSeq::DRMs.sdrm_json(:nrti)
59
+ sdrm_list[:nnrti] = ViralSeq::DRMs.sdrm_json(:nnrti)
60
+ sdrm_list[:hiv_pr] = ViralSeq::DRMs.sdrm_json(:hiv_pr)
61
+ sdrm_list[:hiv_in] = ViralSeq::DRMs.sdrm_json(:hiv_in)
62
+ log << { sdrm_list: sdrm_list }
63
+
64
+ # input dir
65
+ indir = ARGV[0]
66
+ libs = Dir[indir + "/*"]
67
+ log << { processed_libs: libs }
68
+
69
+ #output dir
70
+ outdir = indir + "_SDRM"
71
+ Dir.mkdir(outdir) unless File.directory?(outdir)
72
+
73
+ libs.each do |lib|
74
+
75
+ r_script = ViralSeq::R_SCRIPT.dup
76
+
77
+ next unless File.directory?(lib)
78
+
79
+ lib_name = File.basename(lib)
80
+ out_lib_dir = File.join(outdir, lib_name)
81
+ Dir.mkdir(out_lib_dir) unless File.directory?(out_lib_dir)
82
+
83
+ sub_seq_files = Dir[lib + "/*"]
84
+
85
+ seq_summary_file = File.join(out_lib_dir, (lib_name + "_summary.csv"))
86
+ seq_summary_out = File.open(seq_summary_file, "w")
87
+ seq_summary_out.puts 'Region,TCS,TCS with A3G/F hypermutation,TCS with stop codon,' +
88
+ 'TCS w/o hypermutation and stop codon,' +
89
+ 'Poisson cutoff for minority mutation (>=),Pi,Dist20'
90
+
91
+ point_mutation_file = File.join(out_lib_dir, (lib_name + "_substitution.csv"))
92
+ point_mutation_out = File.open(point_mutation_file, "w")
93
+ point_mutation_out.puts "region,TCS,AA position,wild type,mutation," +
94
+ "number,percentage,95% CI low, 95% CI high, notes"
95
+
96
+ linkage_file = File.join(out_lib_dir, (lib_name + "_linkage.csv"))
97
+ linkage_out = File.open(linkage_file, "w")
98
+ linkage_out.puts "region,TCS,mutation linkage,number," +
99
+ "percentage,95% CI low, 95% CI high, notes"
100
+
101
+ aa_report_file = File.join(out_lib_dir, (lib_name + "_aa.csv"))
102
+ aa_report_out = File.open(aa_report_file, "w")
103
+ aa_report_out.puts "region,ref.aa.positions,TCS.number," +
104
+ ViralSeq::AMINO_ACID_LIST.join(",")
105
+
106
+ summary_json_file = File.join(out_lib_dir, (lib_name + "_summary.json"))
107
+ summary_json_out = File.open(summary_json_file,"w")
108
+
109
+ filtered_seq_dir = File.join(out_lib_dir, (lib_name + "_filtered_seq"))
110
+ Dir.mkdir(filtered_seq_dir) unless File.directory?(filtered_seq_dir)
111
+
112
+ aln_seq_dir = File.join(out_lib_dir, (lib_name + "_aln_seq"))
113
+ Dir.mkdir(aln_seq_dir) unless File.directory?(aln_seq_dir)
114
+
115
+ point_mutation_list = []
116
+ linkage_list = []
117
+ aa_report_list = []
118
+ summary_hash = {}
119
+
120
+ sub_seq_files.each do |sub_seq|
121
+ seq_basename = File.basename(sub_seq)
122
+ seqs = ViralSeq::SeqHash.fa(sub_seq)
123
+ next if seqs.size < 3
124
+ if seq_basename =~ /V1V3/i
125
+ summary_hash[:V1V3] = "#{seqs.size.to_s},NA,NA,NA,NA"
126
+ FileUtils.cp(sub_seq, filtered_seq_dir)
127
+ elsif seq_basename =~ /PR/i
128
+ a3g_check = seqs.a3g
129
+ a3g_seqs = a3g_check[:a3g_seq]
130
+ a3g_filtered_seqs = a3g_check[:filtered_seq]
131
+ stop_codon_check = a3g_filtered_seqs.stop_codon
132
+ stop_codon_seqs = stop_codon_check[:with_stop_codon]
133
+ filtered_seqs = stop_codon_check[:without_stop_codon]
134
+ poisson_minority_cutoff = filtered_seqs.pm
135
+ summary_hash[:PR] = [
136
+ seqs.size.to_s,
137
+ a3g_seqs.size.to_s,
138
+ stop_codon_seqs.size.to_s,
139
+ filtered_seqs.size.to_s,
140
+ poisson_minority_cutoff.to_s
141
+ ].join(',')
142
+ next if filtered_seqs.size < 3
143
+ filtered_seqs.write_nt_fa(File.join(filtered_seq_dir,seq_basename))
144
+
145
+ sdrm = filtered_seqs.sdrm_hiv_pr(poisson_minority_cutoff)
146
+ point_mutation_list += sdrm[0]
147
+ linkage_list += sdrm[1]
148
+ aa_report_list += sdrm[2]
149
+
150
+ elsif seq_basename =~/IN/i
151
+ a3g_check = seqs.a3g
152
+ a3g_seqs = a3g_check[:a3g_seq]
153
+ a3g_filtered_seqs = a3g_check[:filtered_seq]
154
+ stop_codon_check = a3g_filtered_seqs.stop_codon(2)
155
+ stop_codon_seqs = stop_codon_check[:with_stop_codon]
156
+ filtered_seqs = stop_codon_check[:without_stop_codon]
157
+ poisson_minority_cutoff = filtered_seqs.pm
158
+ summary_hash[:IN] = [
159
+ seqs.size.to_s,
160
+ a3g_seqs.size.to_s,
161
+ stop_codon_seqs.size.to_s,
162
+ filtered_seqs.size.to_s,
163
+ poisson_minority_cutoff.to_s
164
+ ].join(',')
165
+ next if filtered_seqs.size < 3
166
+ filtered_seqs.write_nt_fa(File.join(filtered_seq_dir,seq_basename))
167
+
168
+ sdrm = filtered_seqs.sdrm_hiv_in(poisson_minority_cutoff)
169
+ point_mutation_list += sdrm[0]
170
+ linkage_list += sdrm[1]
171
+ aa_report_list += sdrm[2]
172
+
173
+ elsif seq_basename =~/RT/i
174
+ rt_seq1 = {}
175
+ rt_seq2 = {}
176
+ seqs.dna_hash.each do |k,v|
177
+ rt_seq1[k] = v[0,267]
178
+ rt_seq2[k] = v[267..-1]
179
+ end
180
+ rt1 = ViralSeq::SeqHash.new(rt_seq1)
181
+ rt2 = ViralSeq::SeqHash.new(rt_seq2)
182
+ rt1_a3g = rt1.a3g
183
+ rt2_a3g = rt2.a3g
184
+ hypermut_seq_rt1 = rt1_a3g[:a3g_seq]
185
+ hypermut_seq_rt2 = rt2_a3g[:a3g_seq]
186
+ rt1_stop_codon = rt1.stop_codon(1)[:with_stop_codon]
187
+ rt2_stop_codon = rt2.stop_codon(2)[:with_stop_codon]
188
+ hypermut_seq_keys = (hypermut_seq_rt1.dna_hash.keys | hypermut_seq_rt2.dna_hash.keys)
189
+ stop_codon_seq_keys = (rt1_stop_codon.dna_hash.keys | rt2_stop_codon.dna_hash.keys)
190
+ reject_keys = (hypermut_seq_keys | stop_codon_seq_keys)
191
+ filtered_seqs = ViralSeq::SeqHash.new(seqs.dna_hash.reject {|k,v| reject_keys.include?(k) })
192
+ poisson_minority_cutoff = filtered_seqs.pm
193
+ summary_hash[:RT] = [
194
+ seqs.size.to_s,
195
+ hypermut_seq_keys.size.to_s,
196
+ stop_codon_seq_keys.size.to_s,
197
+ filtered_seqs.size.to_s,
198
+ poisson_minority_cutoff.to_s
199
+ ].join(',')
200
+ next if filtered_seqs.size < 3
201
+ filtered_seqs.write_nt_fa(File.join(filtered_seq_dir,seq_basename))
202
+
203
+ sdrm = filtered_seqs.sdrm_hiv_rt(poisson_minority_cutoff)
204
+ point_mutation_list += sdrm[0]
205
+ linkage_list += sdrm[1]
206
+ aa_report_list += sdrm[2]
207
+ end
208
+ end
209
+
210
+ point_mutation_list.each do |record|
211
+ point_mutation_out.puts record.join(",")
212
+ end
213
+ linkage_list.each do |record|
214
+ linkage_out.puts record.join(",")
215
+ end
216
+ aa_report_list.each do |record|
217
+ aa_report_out.puts record.join(",")
218
+ end
219
+
220
+ filtered_seq_files = Dir[filtered_seq_dir + "/*"]
221
+
222
+ out_r_csv = File.join(out_lib_dir, (lib_name + "_pi.csv"))
223
+ out_r_pdf = File.join(out_lib_dir, (lib_name + "_pi.pdf"))
224
+
225
+ if filtered_seq_files.size > 0
226
+ filtered_seq_files.each do |seq_file|
227
+ filtered_sh = ViralSeq::SeqHash.fa(seq_file)
228
+ next if filtered_sh.size < 3
229
+ aligned_sh = filtered_sh.random_select(1000).align
230
+ aligned_sh.write_nt_fa(File.join(aln_seq_dir, File.basename(seq_file)))
231
+ end
232
+
233
+ r_script.gsub!(/PATH_TO_FASTA/,aln_seq_dir)
234
+ File.unlink(out_r_csv) if File.exist?(out_r_csv)
235
+ File.unlink(out_r_pdf) if File.exist?(out_r_pdf)
236
+ r_script.gsub!(/OUTPUT_CSV/,out_r_csv)
237
+ r_script.gsub!(/OUTPUT_PDF/,out_r_pdf)
238
+ r_script_file = File.join(out_lib_dir, "/pi.R")
239
+ File.open(r_script_file,"w") {|line| line.puts r_script}
240
+ print `Rscript #{r_script_file} 1> /dev/null 2> /dev/null`
241
+ if File.exist?(out_r_csv)
242
+ pi_csv = File.readlines(out_r_csv)
243
+ pi_csv.each do |line|
244
+ line.chomp!
245
+ data = line.split(",")
246
+ tag = data[0].split("_")[-1].gsub(/\W/,"").to_sym
247
+ summary_hash[tag] += "," + data[1].to_f.round(4).to_s + "," + data[2].to_f.round(4).to_s
248
+ end
249
+ [:PR, :RT, :IN, :V1V3].each do |regions|
250
+ next unless summary_hash[regions]
251
+ seq_summary_out.puts regions.to_s + "," + summary_hash[regions]
252
+ end
253
+ File.unlink(out_r_csv)
254
+ end
255
+ File.unlink(r_script_file)
256
+ end
257
+
258
+ seq_summary_out.close
259
+ point_mutation_out.close
260
+ linkage_out.close
261
+ aa_report_out.close
262
+
263
+ summary_lines = File.readlines(seq_summary_file)
264
+ summary_lines.shift
265
+
266
+ tcs_PR = 0
267
+ tcs_RT = 0
268
+ tcs_IN = 0
269
+ tcs_V1V3 = 0
270
+ pi_RT = 0.0
271
+ pi_V1V3 = 0.0
272
+ dist20_RT = 0.0
273
+ dist20_V1V3 = 0.0
274
+ summary_lines.each do |line|
275
+ data = line.chomp.split(",")
276
+ if data[0] == "PR"
277
+ tcs_PR = data[4].to_i
278
+ elsif data[0] == "RT"
279
+ tcs_RT = data[4].to_i
280
+ pi_RT = data[6].to_f
281
+ dist20_RT = data[7].to_f
282
+ elsif data[0] == "IN"
283
+ tcs_IN = data[4].to_i
284
+ elsif data[0] == "V1V3"
285
+ tcs_V1V3 = data[1].to_i
286
+ pi_V1V3 = data[6].to_f
287
+ dist20_V1V3 = data[7].to_f
288
+ end
289
+ end
290
+
291
+ recency = ViralSeq::Recency.define(
292
+ tcs_RT: tcs_RT,
293
+ tcs_V1V3: tcs_V1V3,
294
+ pi_RT: pi_RT,
295
+ dist20_RT: dist20_RT,
296
+ pi_V1V3: pi_V1V3,
297
+ dist20_V1V3: dist20_V1V3
298
+ )
299
+
300
+ sdrm_lines = File.readlines(point_mutation_file)
301
+ sdrm_lines.shift
302
+ sdrm_PR = ""
303
+ sdrm_RT = ""
304
+ sdrm_IN = ""
305
+ sdrm_lines.each do |line|
306
+ data = line.chomp.split(",")
307
+ next if data[-1] == "*"
308
+ if data[0] == "PR"
309
+ sdrm_PR += abstract_line(data)
310
+ elsif data[0] =~ /NRTI/
311
+ sdrm_RT += abstract_line(data)
312
+ elsif data[0] == "IN"
313
+ sdrm_IN += abstract_line(data)
314
+ end
315
+ end
316
+
317
+ summary_json = [
318
+ sample_id: lib_name,
319
+ tcs_PR: tcs_PR,
320
+ tcs_RT: tcs_RT,
321
+ tcs_IN: tcs_IN,
322
+ tcs_V1V3: tcs_V1V3,
323
+ pi_RT: pi_RT,
324
+ dist20_RT: dist20_RT,
325
+ dist20_V1V3: dist20_V1V3,
326
+ recency: recency,
327
+ sdrm_PR: sdrm_PR,
328
+ sdrm_RT: sdrm_RT,
329
+ sdrm_IN: sdrm_IN
330
+ ]
331
+
332
+ summary_json_out.puts JSON.pretty_generate(summary_json)
333
+ summary_json_out.close
334
+
335
+ csvs = [
336
+ {
337
+ name: "summary",
338
+ title: "Summary",
339
+ file: seq_summary_file,
340
+ newPDF: "",
341
+ table_width: [65,55,110,110,110,110,60,60],
342
+ extra_text: ""
343
+ },
344
+ {
345
+ name: "substitution",
346
+ title: "Surveillance Drug Resistance Mutations",
347
+ file: point_mutation_file,
348
+ newPDF: "",
349
+ table_width: [65,55,85,80,60,65,85,85,85,45],
350
+ extra_text: "* Mutation below Poisson cut-off for minority mutations"
351
+ },
352
+ {
353
+ name: "linkage",
354
+ title: "Mutation Linkage",
355
+ file: linkage_file,
356
+ newPDF: "",
357
+ table_width: [55,50,250,60,80,80,80,45],
358
+ extra_text: "* Mutation below Poisson cut-off for minority mutations"
359
+ }
360
+ ]
361
+
362
+ csvs.each do |csv|
363
+ file_name = File.join(out_lib_dir, (csv[:name] + ".pdf"))
364
+ next unless File.exist? csv[:file]
365
+ Prawn::Document.generate(file_name, :page_layout => :landscape) do |pdf|
366
+ pdf.text((File.basename(lib, ".*") + ': ' + csv[:title]),
367
+ :size => 20,
368
+ :align => :center,
369
+ :style => :bold)
370
+ pdf.move_down 20
371
+ table_data = CSV.open(csv[:file]).to_a
372
+ header = table_data.first
373
+ pdf.table(table_data,
374
+ :header => header,
375
+ :position => :center,
376
+ :column_widths => csv[:table_width],
377
+ :row_colors => ["B6B6B6", "FFFFFF"],
378
+ :cell_style => {:align => :center, :size => 10}) do |table|
379
+ table.row(0).style :font_style => :bold, :size => 12 #, :background_color => 'ff00ff'
380
+ end
381
+ pdf.move_down 5
382
+ pdf.text(csv[:extra_text], :size => 8, :align => :justify,)
383
+ end
384
+ csv[:newPDF] = file_name
385
+ end
386
+
387
+ pdf = CombinePDF.new
388
+ csvs.each do |csv|
389
+ pdf << CombinePDF.load(csv[:newPDF]) if File.exist?(csv[:newPDF])
390
+ end
391
+ pdf << CombinePDF.load(out_r_pdf) if File.exist?(out_r_pdf)
392
+
393
+ pdf.number_pages location: [:bottom_right],
394
+ number_format: "Swanstrom\'s lab HIV SDRM Pipeline, version #{$sdrm_version_number} by S.Z. and M.U.C. Page %s",
395
+ font_size: 6,
396
+ opacity: 0.5
397
+
398
+ pdf.save File.join(out_lib_dir, (lib_name + ".pdf"))
399
+
400
+ csvs.each do |csv|
401
+ File.unlink csv[:newPDF]
402
+ end
403
+ end
404
+
405
+ log_file = File.join(File.dirname(indir), "sdrm_log.json")
406
+
407
+ File.open(log_file, 'w') { |f| f.puts JSON.pretty_generate(log) }
408
+
409
+ FileUtils.touch(File.join(outdir, ".done"))