viral_seq 1.0.11 → 1.1.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +0 -1
- data/Gemfile.lock +1 -1
- data/README.md +93 -11
- data/bin/tcs +34 -6
- data/bin/tcs_log +102 -0
- data/docs/assets/img/cover.jpg +0 -0
- data/docs/dr.json +67 -0
- data/docs/sample_miseq_data/hivdr_control/r1.fastq.gz +0 -0
- data/docs/sample_miseq_data/hivdr_control/r2.fastq.gz +0 -0
- data/lib/viral_seq.rb +1 -1
- data/lib/viral_seq/enumerable.rb +0 -10
- data/lib/viral_seq/math.rb +3 -3
- data/lib/viral_seq/seq_hash.rb +1 -1
- data/lib/viral_seq/seq_hash_pair.rb +6 -4
- data/lib/viral_seq/tcs_core.rb +34 -5
- data/lib/viral_seq/tcs_dr.rb +71 -0
- data/lib/viral_seq/tcs_json.rb +41 -10
- data/lib/viral_seq/version.rb +2 -2
- metadata +9 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7a283f3a09cc5d9807e7622cd1ddf27197919955e85d6472b34fc14b66749c03
|
4
|
+
data.tar.gz: 4f90c5a9c7ea0ec148ba7d45ee88dc441f79da67a97654734194a773499ebb8e
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 385a94eb93c3d8d9116c16a0d8af56ba714ba6191a454076acf881a036de80d1d598f3fcd1a4de841745ca08a1ad3e8bc028a30db9f96c19f3b217ef4583d652
|
7
|
+
data.tar.gz: 714d035b6f65863746cafb120c9cf6eccb8261f3eac69985bad96e5275351eec71aa3b744ee9b462e2dc3e0e199c2d4112386f6a2d7eef89b5b7824c1ab769be
|
data/.gitignore
CHANGED
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -1,8 +1,24 @@
|
|
1
1
|
# ViralSeq
|
2
2
|
|
3
|
+
[![Gem Version](https://badge.fury.io/rb/viral_seq.svg)](https://rubygems.org/gems/viral_seq)
|
4
|
+
![GitHub](https://img.shields.io/github/license/viralseq/viral_seq)
|
5
|
+
![Gem](https://img.shields.io/gem/dt/viral_seq?color=%23E9967A)
|
6
|
+
![GitHub last commit](https://img.shields.io/github/last-commit/viralseq/viral_seq?color=%2300BFFF)
|
7
|
+
[![Join the chat at https://gitter.im/viral_seq/community](https://badges.gitter.im/viral_seq/community.svg)](https://gitter.im/viral_seq/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
|
8
|
+
|
3
9
|
A Ruby Gem containing bioinformatics tools for processing viral NGS data.
|
4
10
|
|
5
|
-
Specifically for Primer
|
11
|
+
Specifically for Primer ID sequencing and HIV drug resistance analysis.
|
12
|
+
|
13
|
+
## Illustration for the Primer ID Sequencing
|
14
|
+
|
15
|
+
|
16
|
+
![Primer ID Sequencing](./docs/assets/img/cover.jpg)
|
17
|
+
|
18
|
+
### Reference readings on the Primer ID sequencing
|
19
|
+
[Explantion of Primer ID sequencing](https://doi.org/10.21769/BioProtoc.3938)
|
20
|
+
[Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
|
21
|
+
[Application of Primer ID sequencing in COVID-19 research](https://doi.org/10.1126/scitranslmed.abb5883)
|
6
22
|
|
7
23
|
## Install
|
8
24
|
|
@@ -14,20 +30,55 @@ Specifically for Primer-ID sequencing and HIV drug resistance analysis.
|
|
14
30
|
|
15
31
|
### Excutables
|
16
32
|
|
17
|
-
|
33
|
+
### `tcs`
|
34
|
+
Use executable `tcs` pipeline to process **Primer ID MiSeq sequencing** data.
|
18
35
|
|
36
|
+
Example commands:
|
19
37
|
```bash
|
20
|
-
$
|
38
|
+
$ tcs -p params.json # run TCS pipeline with params.json
|
39
|
+
$ tcs -p params.json -i DIRECTORY
|
40
|
+
# run TCS pipeline with params.json and DIRECTORY
|
41
|
+
# if DIRECTORY is not defined in params.json
|
42
|
+
$ tcs -dr -i DIRECTORY
|
43
|
+
# run tcs-dr (MPID HIV drug resistance sequencing) pipeline
|
44
|
+
# DIRECTORY needs to be given.
|
45
|
+
$ tcs -j # CLI to generate params.json
|
46
|
+
$ tcs -h # print out the help
|
21
47
|
```
|
22
48
|
|
23
|
-
|
49
|
+
[sample params.json for the tcs-dr pipeline](./docs/dr.json)
|
50
|
+
|
51
|
+
---
|
52
|
+
### `tcs_log`
|
53
|
+
|
54
|
+
Use `tcs_log` script to pool run logs and TCS fasta files after one batch of `tcs` jobs.
|
24
55
|
|
56
|
+
|
57
|
+
Example file structure:
|
58
|
+
```
|
59
|
+
batch_tcs_jobs/
|
60
|
+
├── lib1
|
61
|
+
├── lib2
|
62
|
+
├── lib3
|
63
|
+
├── lib4
|
64
|
+
├── ...
|
65
|
+
```
|
66
|
+
|
67
|
+
Example command:
|
25
68
|
```bash
|
26
|
-
$
|
27
|
-
$ tcs -j # CLI to generate params.json
|
28
|
-
$ tcs -h # print out the help
|
69
|
+
$ tcs_log batch_tcs_jobs
|
29
70
|
```
|
30
71
|
|
72
|
+
---
|
73
|
+
|
74
|
+
### `locator`
|
75
|
+
Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
|
76
|
+
|
77
|
+
```bash
|
78
|
+
$ locator -i sequence.fasta -o sequence.fasta.csv
|
79
|
+
```
|
80
|
+
---
|
81
|
+
|
31
82
|
## Some Examples
|
32
83
|
|
33
84
|
Load all ViralSeq classes by requiring 'viral_seq.rb' in your Ruby scripts.
|
@@ -80,16 +131,47 @@ qc_seqhash.sdrm_hiv_pr(cut_off)
|
|
80
131
|
```
|
81
132
|
## Known issues
|
82
133
|
|
83
|
-
1. have a conflict with rails
|
134
|
+
1. ~~have a conflict with rails.~~
|
135
|
+
2. ~~Update on 03032021. Still have conflict. But in rails gem file, can just use `requires: false` globally and only require "viral_seq" when the module is needed in controller.~~
|
136
|
+
3. The conflict seems to be resovled. It was from a combination of using `!` as a function for factorial and the gem name `viral_seq`. @_@
|
84
137
|
|
85
138
|
## Updates
|
86
139
|
|
87
|
-
### Version 1.1.1-
|
140
|
+
### Version 1.1.1-04012021
|
141
|
+
|
142
|
+
1. Added warning when paired_raw_sequence less than 0.1% of total_raw_sequence.
|
143
|
+
2. Added option `-i WORKING_DIRECTORY` to the `tcs` script.
|
144
|
+
If the `params.json` file does not contain the path to the working directory, it will append path to the run params.
|
145
|
+
3. Added option `-dr` to the `tcs` script.
|
146
|
+
|
147
|
+
### Version 1.1.0-03252021
|
148
|
+
|
149
|
+
1. Optimized the algorithm of end-join.
|
150
|
+
2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
|
151
|
+
3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
|
152
|
+
4. Added the preset of MPID-HIVDR params file [***dr.json***](./docs/dr.json) in /docs.
|
153
|
+
5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
|
154
|
+
Users can choose from 3 MiSeq platforms for processing their sequencing data.
|
155
|
+
MiSeq 300x7x300 is the default option.
|
156
|
+
|
157
|
+
### Version 1.0.14-03052021
|
158
|
+
|
159
|
+
1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
|
160
|
+
|
161
|
+
### Version 1.0.13-03032021
|
162
|
+
|
163
|
+
1. Fixed the conflict with rails.
|
164
|
+
|
165
|
+
### Version 1.0.12-03032021
|
166
|
+
|
167
|
+
1. Fixed an issue that may cause conflicts with ActiveRecord.
|
168
|
+
|
169
|
+
### Version 1.0.11-03022021
|
88
170
|
|
89
|
-
1. Fixed
|
171
|
+
1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
|
90
172
|
2. fixed an issue loading class 'OptionParser'in some ruby environments.
|
91
173
|
|
92
|
-
### Version 1.
|
174
|
+
### Version 1.0.10-11112020:
|
93
175
|
|
94
176
|
1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
|
95
177
|
2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
|
data/bin/tcs
CHANGED
@@ -23,7 +23,7 @@
|
|
23
23
|
# THE SOFTWARE.
|
24
24
|
|
25
25
|
# Use JSON file as the run param
|
26
|
-
# run
|
26
|
+
# run `tcs -j` to generate param json file.
|
27
27
|
|
28
28
|
require 'viral_seq'
|
29
29
|
require 'json'
|
@@ -46,6 +46,14 @@ OptionParser.new do |opts|
|
|
46
46
|
options[:params_json] = p
|
47
47
|
end
|
48
48
|
|
49
|
+
opts.on("-i", "--input PATH_TO_WORKING_DIRECTORY", "Path to the working directory") do |p|
|
50
|
+
options[:input] = p
|
51
|
+
end
|
52
|
+
|
53
|
+
opts.on("-dr", "--dr_pipeline", "HIV drug resistance MPID pipeline") do |p|
|
54
|
+
options[:dr] = true
|
55
|
+
end
|
56
|
+
|
49
57
|
opts.on("-h", "--help", "Prints this help") do
|
50
58
|
puts opts
|
51
59
|
exit
|
@@ -64,15 +72,21 @@ end.parse!
|
|
64
72
|
|
65
73
|
if options[:json_generator]
|
66
74
|
params = ViralSeq::TcsJson.generate
|
75
|
+
elsif options[:dr]
|
76
|
+
params = ViralSeq::TcsDr::PARAMS
|
67
77
|
elsif (options[:params_json] && File.exist?(options[:params_json]))
|
68
78
|
params = JSON.parse(File.read(options[:params_json]), symbolize_names: true)
|
69
79
|
else
|
70
80
|
abort "No params JSON file found. Script terminated.".red
|
71
81
|
end
|
72
82
|
|
73
|
-
|
83
|
+
if options[:input]
|
84
|
+
indir = options[:input]
|
85
|
+
else
|
86
|
+
indir = params[:raw_sequence_dir]
|
87
|
+
end
|
74
88
|
|
75
|
-
unless File.exist?(indir)
|
89
|
+
unless indir and File.exist?(indir)
|
76
90
|
abort "No input sequence directory found. Script terminated.".red.bold
|
77
91
|
end
|
78
92
|
|
@@ -115,6 +129,12 @@ else
|
|
115
129
|
error_rate = 0.02
|
116
130
|
end
|
117
131
|
|
132
|
+
if params[:platform_format]
|
133
|
+
$platform_sequencing_length = params[:platform_format]
|
134
|
+
else
|
135
|
+
$platform_sequencing_length = 300
|
136
|
+
end
|
137
|
+
|
118
138
|
primers = params[:primer_pairs]
|
119
139
|
if primers.empty?
|
120
140
|
ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
|
@@ -123,6 +143,7 @@ end
|
|
123
143
|
|
124
144
|
primers.each do |primer|
|
125
145
|
summary_json = {}
|
146
|
+
summary_json[:warnings] = []
|
126
147
|
summary_json[:tcs_version] = ViralSeq::TCS_VERSION
|
127
148
|
summary_json[:viralseq_version] = ViralSeq::VERSION
|
128
149
|
summary_json[:runtime] = Time.now.to_s
|
@@ -175,6 +196,10 @@ primers.each do |primer|
|
|
175
196
|
paired_seq_number = common_keys.size
|
176
197
|
log.puts Time.now.to_s + "\t" + "Paired raw sequences are : #{paired_seq_number.to_s}"
|
177
198
|
summary_json[:paired_raw_sequence] = paired_seq_number
|
199
|
+
if paired_seq_number < raw_sequence_number * 0.001
|
200
|
+
summary_json[:warnings] <<
|
201
|
+
"WARNING: Filtered raw sequneces less than 0.1% of the total raw sequences. Possible contamination."
|
202
|
+
end
|
178
203
|
|
179
204
|
common_keys.each do |seqtag|
|
180
205
|
r1_seq = r1_passed_seq[seqtag]
|
@@ -273,7 +298,6 @@ primers.each do |primer|
|
|
273
298
|
r1_sub_seq << bio_r1[seq_name]
|
274
299
|
r2_sub_seq << bio_r2[seq_name]
|
275
300
|
end
|
276
|
-
|
277
301
|
#consensus name including the Primer ID and number of raw sequences of that Primer ID, library name and setname.
|
278
302
|
consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
|
279
303
|
r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
|
@@ -364,6 +388,7 @@ primers.each do |primer|
|
|
364
388
|
shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
|
365
389
|
joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
|
366
390
|
log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
|
391
|
+
|
367
392
|
summary_json[:combined_tcs] = joined_sh.size
|
368
393
|
|
369
394
|
if export_raw
|
@@ -433,12 +458,15 @@ primers.each do |primer|
|
|
433
458
|
trim_end = primer[:trim_ref_end]
|
434
459
|
trim_ref = primer[:trim_ref].to_sym
|
435
460
|
joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
|
436
|
-
joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
|
437
461
|
if export_raw
|
438
462
|
joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
|
439
|
-
joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
|
440
463
|
end
|
441
464
|
end
|
465
|
+
|
466
|
+
joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
|
467
|
+
if export_raw
|
468
|
+
joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
|
469
|
+
end
|
442
470
|
end
|
443
471
|
|
444
472
|
File.open(outfile_log, "w") do |f|
|
data/bin/tcs_log
ADDED
@@ -0,0 +1,102 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
# pool run logs from one batch of tcs jobs
|
4
|
+
# file structure:
|
5
|
+
# batch_tcs_jobs/
|
6
|
+
# ├── lib1
|
7
|
+
# ├── lib2
|
8
|
+
# ├── lib3
|
9
|
+
# ├── lib4
|
10
|
+
# ├── ...
|
11
|
+
#
|
12
|
+
# command example:
|
13
|
+
# $ tcs_log batch_tcs_jobs
|
14
|
+
|
15
|
+
require 'viral_seq'
|
16
|
+
require 'pathname'
|
17
|
+
require 'json'
|
18
|
+
require 'fileutils'
|
19
|
+
|
20
|
+
indir = ARGV[0].chomp
|
21
|
+
indir_basename = File.basename(indir)
|
22
|
+
indir_dirname = File.dirname(indir)
|
23
|
+
|
24
|
+
tcs_dir = File.join(indir_dirname, (indir_basename + "_tcs"))
|
25
|
+
Dir.mkdir(tcs_dir) unless File.directory?(tcs_dir)
|
26
|
+
|
27
|
+
libs = []
|
28
|
+
Dir.chdir(indir) {libs = Dir.glob("*")}
|
29
|
+
|
30
|
+
outdir2 = File.join(tcs_dir, "combined_TCS_per_lib")
|
31
|
+
outdir3 = File.join(tcs_dir, "TCS_per_region")
|
32
|
+
outdir4 = File.join(tcs_dir, "combined_TCS_per_region")
|
33
|
+
|
34
|
+
Dir.mkdir(outdir2) unless File.directory?(outdir2)
|
35
|
+
Dir.mkdir(outdir3) unless File.directory?(outdir3)
|
36
|
+
Dir.mkdir(outdir4) unless File.directory?(outdir4)
|
37
|
+
|
38
|
+
log_file = File.join(tcs_dir,"log.csv")
|
39
|
+
log = File.open(log_file,'w')
|
40
|
+
|
41
|
+
header = %w{
|
42
|
+
lib_name
|
43
|
+
Region
|
44
|
+
Raw_Sequences_per_barcode
|
45
|
+
R1_Raw
|
46
|
+
R2_Raw
|
47
|
+
Paired_Raw
|
48
|
+
Cutoff
|
49
|
+
PID_Length
|
50
|
+
Consensus1
|
51
|
+
Consensus2
|
52
|
+
Distinct_to_Raw
|
53
|
+
Resampling_index
|
54
|
+
Combined_TCS
|
55
|
+
Combined_TCS_after_QC
|
56
|
+
WARNINGS
|
57
|
+
}
|
58
|
+
|
59
|
+
log.puts header.join(',')
|
60
|
+
libs.each do |lib|
|
61
|
+
Dir.mkdir(File.join(outdir2, lib)) unless File.directory?(File.join(outdir2, lib))
|
62
|
+
fasta_files = []
|
63
|
+
json_files = []
|
64
|
+
Dir.chdir(File.join(indir, lib)) do
|
65
|
+
fasta_files = Dir.glob("**/*.fasta")
|
66
|
+
json_files = Dir.glob("**/log.json")
|
67
|
+
end
|
68
|
+
fasta_files.each do |f|
|
69
|
+
path_array = Pathname(f).each_filename.to_a
|
70
|
+
region = path_array[0]
|
71
|
+
if path_array[-1] == "combined.fasta"
|
72
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir2, lib, (lib + "_" + region)))
|
73
|
+
Dir.mkdir(File.join(outdir4,region)) unless File.directory?(File.join(outdir4,region))
|
74
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir4, region, (lib + "_" + region)))
|
75
|
+
else
|
76
|
+
Dir.mkdir(File.join(outdir3,region)) unless File.directory?(File.join(outdir3,region))
|
77
|
+
Dir.mkdir(File.join(outdir3,region, lib)) unless File.directory?(File.join(outdir3,region, lib))
|
78
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir3, region, lib, (lib + "_" + region + "_" + path_array[-1])))
|
79
|
+
end
|
80
|
+
end
|
81
|
+
|
82
|
+
json_files.each do |f|
|
83
|
+
json_log = JSON.parse(File.read(File.join(indir, lib, f)), symbolize_names: true)
|
84
|
+
log.print [lib,
|
85
|
+
json_log[:primer_set_name],
|
86
|
+
json_log[:total_raw_sequence],
|
87
|
+
json_log[:r1_filtered_raw],
|
88
|
+
json_log[:r2_filtered_raw],
|
89
|
+
json_log[:paired_raw_sequence],
|
90
|
+
json_log[:consensus_cutoff],
|
91
|
+
json_log[:length_of_pid],
|
92
|
+
json_log[:total_tcs_with_ambiguities],
|
93
|
+
json_log[:total_tcs],
|
94
|
+
json_log[:distinct_to_raw],
|
95
|
+
json_log[:resampling_param],
|
96
|
+
json_log[:combined_tcs],
|
97
|
+
json_log[:combined_tcs_after_qc],
|
98
|
+
json_log[:warnings],
|
99
|
+
].join(',') + "\n"
|
100
|
+
end
|
101
|
+
end
|
102
|
+
log.close
|
Binary file
|
data/docs/dr.json
ADDED
@@ -0,0 +1,67 @@
|
|
1
|
+
{
|
2
|
+
"platform_error_rate": 0.02,
|
3
|
+
"primer_pairs": [
|
4
|
+
{
|
5
|
+
"region": "RT",
|
6
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCACTATAGGCTGTACTGTCCATTTATC",
|
7
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNGGCCATTGACAGAAGAAAAAATAAAAGC",
|
8
|
+
"majority": 0.5,
|
9
|
+
"end_join": true,
|
10
|
+
"end_join_option": 1,
|
11
|
+
"overlap": 0,
|
12
|
+
"TCS_QC": true,
|
13
|
+
"ref_genome": "HXB2",
|
14
|
+
"ref_start": 2648,
|
15
|
+
"ref_end": 3257,
|
16
|
+
"indel": true,
|
17
|
+
"trim": false
|
18
|
+
},
|
19
|
+
{
|
20
|
+
"region": "PR",
|
21
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNCAGTTTAACTTTTGGGCCATCCATTCC",
|
22
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTCAGAGCAGACCAGAGCCAACAGCCCCA",
|
23
|
+
"majority": 0.5,
|
24
|
+
"end_join": true,
|
25
|
+
"end_join_option": 3,
|
26
|
+
"TCS_QC": true,
|
27
|
+
"ref_genome": "HXB2",
|
28
|
+
"ref_start": 0,
|
29
|
+
"ref_end": 2591,
|
30
|
+
"indel": true,
|
31
|
+
"trim": true,
|
32
|
+
"trim_ref": "HXB2",
|
33
|
+
"trim_ref_start": 2253,
|
34
|
+
"trim_ref_end": 2549
|
35
|
+
},
|
36
|
+
{
|
37
|
+
"region": "IN",
|
38
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNATCGAATACTGCCATTTGTACTGC",
|
39
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNAAAAGGAGAAGCCATGCATG",
|
40
|
+
"majority": 0.5,
|
41
|
+
"end_join": true,
|
42
|
+
"end_join_option": 3,
|
43
|
+
"overlap": 171,
|
44
|
+
"TCS_QC": true,
|
45
|
+
"ref_genome": "HXB2",
|
46
|
+
"ref_start": 4384,
|
47
|
+
"ref_end": 4751,
|
48
|
+
"indel": false,
|
49
|
+
"trim": false
|
50
|
+
},
|
51
|
+
{
|
52
|
+
"region": "V1V3",
|
53
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCCATTTTGCTYTAYTRABVTTACAATRTGC",
|
54
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTTATGGGATCAAAGCCTAAAGCCATGTGTA",
|
55
|
+
"majority": 0.5,
|
56
|
+
"end_join": true,
|
57
|
+
"end_join_option": 1,
|
58
|
+
"overlap": 0,
|
59
|
+
"TCS_QC": true,
|
60
|
+
"ref_genome": "HXB2",
|
61
|
+
"ref_start": 6585,
|
62
|
+
"ref_end": 7208,
|
63
|
+
"indel": true,
|
64
|
+
"trim": false
|
65
|
+
}
|
66
|
+
]
|
67
|
+
}
|
Binary file
|
Binary file
|
data/lib/viral_seq.rb
CHANGED
data/lib/viral_seq/enumerable.rb
CHANGED
@@ -3,10 +3,6 @@
|
|
3
3
|
# array = [1,2,3,4,5,6,7,8,9,10]
|
4
4
|
# array.median
|
5
5
|
# => 5.5
|
6
|
-
# @example sum
|
7
|
-
# array = [1,2,3,4,5,6,7,8,9,10]
|
8
|
-
# array.sum
|
9
|
-
# => 55
|
10
6
|
# @example average number (mean)
|
11
7
|
# array = [1,2,3,4,5,6,7,8,9,10]
|
12
8
|
# array.mean
|
@@ -45,12 +41,6 @@ module Enumerable
|
|
45
41
|
len % 2 == 1 ? sorted[len/2] : (sorted[len/2 - 1] + sorted[len/2]).to_f / 2
|
46
42
|
end
|
47
43
|
|
48
|
-
# generate summed value
|
49
|
-
# @return [Numeric] summed value
|
50
|
-
def sum
|
51
|
-
self.inject(0){|accum, i| accum + i }
|
52
|
-
end
|
53
|
-
|
54
44
|
# generate mean number
|
55
45
|
# @return [Float] mean value
|
56
46
|
def mean
|
data/lib/viral_seq/math.rb
CHANGED
@@ -67,7 +67,7 @@ module ViralSeq
|
|
67
67
|
@k = k
|
68
68
|
@poisson_hash = {}
|
69
69
|
(0..k).each do |n|
|
70
|
-
p = (rate**n * ::Math::E**(-rate))
|
70
|
+
p = (rate**n * ::Math::E**(-rate))/n.factorial
|
71
71
|
@poisson_hash[n] = p
|
72
72
|
end
|
73
73
|
end
|
@@ -155,9 +155,9 @@ class Integer
|
|
155
155
|
# factorial method for an Integer
|
156
156
|
# @return [Integer] factorial for given Integer
|
157
157
|
# @example factorial for 5
|
158
|
-
#
|
158
|
+
# 5.factorial
|
159
159
|
# => 120
|
160
|
-
def
|
160
|
+
def factorial
|
161
161
|
if self == 0
|
162
162
|
return 1
|
163
163
|
else
|
data/lib/viral_seq/seq_hash.rb
CHANGED
@@ -394,7 +394,6 @@ module ViralSeq
|
|
394
394
|
end
|
395
395
|
end
|
396
396
|
end
|
397
|
-
|
398
397
|
consensus_seq += call_consensus_base(max_base_list)
|
399
398
|
end
|
400
399
|
return consensus_seq
|
@@ -742,6 +741,7 @@ module ViralSeq
|
|
742
741
|
seq_hash_unique_pass = []
|
743
742
|
|
744
743
|
seq_hash_unique.each do |seq|
|
744
|
+
next if seq.nil?
|
745
745
|
loc = ViralSeq::Sequence.new('', seq).locator(ref_option, path_to_muscle)
|
746
746
|
next unless loc # if locator tool fails, skip this seq.
|
747
747
|
if start_nt.include?(loc[0]) && end_nt.include?(loc[1])
|
@@ -110,19 +110,21 @@ module ViralSeq
|
|
110
110
|
raise ArgumentError.new(":overlap has to be Integer, input #{overlap} invalid.") unless overlap.is_a? Integer
|
111
111
|
raise ArgumentError.new(":diff has to be float or integer, input #{diff} invalid.") unless (diff.is_a? Integer or diff.is_a? Float)
|
112
112
|
joined_seq = {}
|
113
|
-
seq_pair_hash.
|
113
|
+
seq_pair_hash.each do |seq_name,seq_pair|
|
114
114
|
r1_seq = seq_pair[0]
|
115
115
|
r2_seq = seq_pair[1]
|
116
116
|
if overlap.zero?
|
117
117
|
joined_sequence = r1_seq + r2_seq
|
118
|
+
elsif diff.zero?
|
119
|
+
if r1_seq[-overlap..-1] == r2_seq[0,overlap]
|
120
|
+
joined_sequence= r1_seq + r2_seq[overlap..-1]
|
121
|
+
end
|
118
122
|
elsif r1_seq[-overlap..-1].compare_with(r2_seq[0,overlap]) <= (overlap * diff)
|
119
123
|
joined_sequence= r1_seq + r2_seq[overlap..-1]
|
120
124
|
else
|
121
125
|
next
|
122
126
|
end
|
123
|
-
|
124
|
-
joined_seq[seq_name] = joined_sequence
|
125
|
-
end
|
127
|
+
joined_seq[seq_name] = joined_sequence if joined_sequence
|
126
128
|
end
|
127
129
|
|
128
130
|
joined_seq_hash = ViralSeq::SeqHash.new
|
data/lib/viral_seq/tcs_core.rb
CHANGED
@@ -102,9 +102,9 @@ module ViralSeq
|
|
102
102
|
end
|
103
103
|
|
104
104
|
# sort array of file names to determine if there is potential errors
|
105
|
-
#
|
106
|
-
#
|
107
|
-
|
105
|
+
# @param name_array [Array] array of file names
|
106
|
+
# @return [hash] name check results
|
107
|
+
|
108
108
|
def validate_file_name(name_array)
|
109
109
|
errors = {
|
110
110
|
file_type_error: [] ,
|
@@ -165,6 +165,13 @@ module ViralSeq
|
|
165
165
|
end
|
166
166
|
end
|
167
167
|
|
168
|
+
file_name_with_lib_name = {}
|
169
|
+
passed_libs.each do |lib_name, files|
|
170
|
+
files.each do |f|
|
171
|
+
file_name_with_lib_name[f] = lib_name
|
172
|
+
end
|
173
|
+
end
|
174
|
+
|
168
175
|
passed_names = []
|
169
176
|
|
170
177
|
passed_libs.values.each { |names| passed_names += names}
|
@@ -175,7 +182,27 @@ module ViralSeq
|
|
175
182
|
pass = true
|
176
183
|
end
|
177
184
|
|
178
|
-
|
185
|
+
file_name_with_error_type = {}
|
186
|
+
|
187
|
+
errors.each do |type, files|
|
188
|
+
files.each do |f|
|
189
|
+
file_name_with_error_type[f] ||= []
|
190
|
+
file_name_with_error_type[f] << type.to_s.tr("_", "\s")
|
191
|
+
end
|
192
|
+
end
|
193
|
+
|
194
|
+
file_check = []
|
195
|
+
|
196
|
+
name_array.each do |name|
|
197
|
+
file_check_hash = {}
|
198
|
+
file_check_hash[:fileName] = name
|
199
|
+
file_check_hash[:errors] = file_name_with_error_type[name]
|
200
|
+
file_check_hash[:libName] = file_name_with_lib_name[name]
|
201
|
+
|
202
|
+
file_check << file_check_hash
|
203
|
+
end
|
204
|
+
|
205
|
+
return { allPass: pass, files: file_check }
|
179
206
|
end
|
180
207
|
|
181
208
|
# filter r1 raw sequences for non-specific primers.
|
@@ -278,7 +305,9 @@ module ViralSeq
|
|
278
305
|
end
|
279
306
|
|
280
307
|
def general_filter(seq)
|
281
|
-
if seq
|
308
|
+
if seq.size < $platform_sequencing_length
|
309
|
+
return false
|
310
|
+
elsif seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
|
282
311
|
return false
|
283
312
|
elsif seq =~ /A{11}/ # a string of poly-A indicates adaptor sequence
|
284
313
|
return false
|
@@ -0,0 +1,71 @@
|
|
1
|
+
module ViralSeq
|
2
|
+
|
3
|
+
class TcsDr
|
4
|
+
PARAMS = {:platform_error_rate=>0.02,
|
5
|
+
:primer_pairs=>
|
6
|
+
[{:region=>"RT",
|
7
|
+
:cdna=>
|
8
|
+
"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCACTATAGGCTGTACTGTCCATTTATC",
|
9
|
+
:forward=>
|
10
|
+
"GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNGGCCATTGACAGAAGAAAAAATAAAAGC",
|
11
|
+
:majority=>0.5,
|
12
|
+
:end_join=>true,
|
13
|
+
:end_join_option=>1,
|
14
|
+
:overlap=>0,
|
15
|
+
:TCS_QC=>true,
|
16
|
+
:ref_genome=>"HXB2",
|
17
|
+
:ref_start=>2648,
|
18
|
+
:ref_end=>3257,
|
19
|
+
:indel=>true,
|
20
|
+
:trim=>false},
|
21
|
+
{:region=>"PR",
|
22
|
+
:cdna=>
|
23
|
+
"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNCAGTTTAACTTTTGGGCCATCCATTCC",
|
24
|
+
:forward=>
|
25
|
+
"GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTCAGAGCAGACCAGAGCCAACAGCCCCA",
|
26
|
+
:majority=>0.5,
|
27
|
+
:end_join=>true,
|
28
|
+
:end_join_option=>3,
|
29
|
+
:TCS_QC=>true,
|
30
|
+
:ref_genome=>"HXB2",
|
31
|
+
:ref_start=>0,
|
32
|
+
:ref_end=>2591,
|
33
|
+
:indel=>true,
|
34
|
+
:trim=>true,
|
35
|
+
:trim_ref=>"HXB2",
|
36
|
+
:trim_ref_start=>2253,
|
37
|
+
:trim_ref_end=>2549},
|
38
|
+
{:region=>"IN",
|
39
|
+
:cdna=>
|
40
|
+
"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNATCGAATACTGCCATTTGTACTGC",
|
41
|
+
:forward=>"GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNAAAAGGAGAAGCCATGCATG",
|
42
|
+
:majority=>0.5,
|
43
|
+
:end_join=>true,
|
44
|
+
:end_join_option=>3,
|
45
|
+
:overlap=>171,
|
46
|
+
:TCS_QC=>true,
|
47
|
+
:ref_genome=>"HXB2",
|
48
|
+
:ref_start=>4384,
|
49
|
+
:ref_end=>4751,
|
50
|
+
:indel=>false,
|
51
|
+
:trim=>false},
|
52
|
+
{:region=>"V1V3",
|
53
|
+
:cdna=>
|
54
|
+
"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCCATTTTGCTYTAYTRABVTTACAATRTGC",
|
55
|
+
:forward=>
|
56
|
+
"GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTTATGGGATCAAAGCCTAAAGCCATGTGTA",
|
57
|
+
:majority=>0.5,
|
58
|
+
:end_join=>true,
|
59
|
+
:end_join_option=>1,
|
60
|
+
:overlap=>0,
|
61
|
+
:TCS_QC=>true,
|
62
|
+
:ref_genome=>"HXB2",
|
63
|
+
:ref_start=>6585,
|
64
|
+
:ref_end=>7208,
|
65
|
+
:indel=>true,
|
66
|
+
:trim=>false}
|
67
|
+
]
|
68
|
+
}
|
69
|
+
end
|
70
|
+
|
71
|
+
end
|
data/lib/viral_seq/tcs_json.rb
CHANGED
@@ -13,6 +13,22 @@ module ViralSeq
|
|
13
13
|
print '> '
|
14
14
|
param[:raw_sequence_dir] = gets.chomp.rstrip
|
15
15
|
|
16
|
+
puts "Choose MiSeq Platform (1-3):\n1. 150x7x150\n2. 250x7x250\n3. 300x7x300 (default)"
|
17
|
+
print "> "
|
18
|
+
pf_option = gets.chomp.rstrip
|
19
|
+
# while ![1,2,3].include?(pf_option.to_i)
|
20
|
+
# print "Entered MiSeq Platform #{pf_option.red.bold} not valid (choose 1-3), try again\n> "
|
21
|
+
# pf_option = gets.chomp.rstrip
|
22
|
+
# end
|
23
|
+
case pf_option.to_i
|
24
|
+
when 1
|
25
|
+
param[:platform_format] = 150
|
26
|
+
when 2
|
27
|
+
param[:platform_format] = 250
|
28
|
+
else
|
29
|
+
param[:platform_format] = 300
|
30
|
+
end
|
31
|
+
|
16
32
|
puts 'Enter the estimated platform error rate (for TCS cut-off calculation), default as ' + '0.02'.red.bold
|
17
33
|
print '> '
|
18
34
|
input_error = gets.chomp.rstrip.to_f
|
@@ -52,12 +68,12 @@ module ViralSeq
|
|
52
68
|
if ej =~ /y|yes/i
|
53
69
|
data[:end_join] = true
|
54
70
|
|
55
|
-
|
56
|
-
1: simple join, no overlap
|
57
|
-
2: known overlap
|
58
|
-
3: unknow overlap, use sample consensus to determine overlap, all sequence pairs have same overlap
|
59
|
-
4: unknow overlap, determine overlap by individual sequence pairs, sequence pairs can have different overlap
|
60
|
-
> "
|
71
|
+
puts "End-join option? Choose from (1-4):"
|
72
|
+
puts "1: simple join, no overlap"
|
73
|
+
puts "2: known overlap"
|
74
|
+
puts "3: unknow overlap, use sample consensus to determine overlap, all sequence pairs have same overlap"
|
75
|
+
puts "4: unknow overlap, determine overlap by individual sequence pairs, sequence pairs can have different overlap"
|
76
|
+
print "> "
|
61
77
|
ej_option = gets.chomp.rstrip
|
62
78
|
while ![1,2,3,4].include?(ej_option.to_i)
|
63
79
|
puts "Entered end-join option #{ej_option.red.bold} not valid (choose 1-4), try again"
|
@@ -138,7 +154,12 @@ module ViralSeq
|
|
138
154
|
if save_option =~ /y|yes/i
|
139
155
|
print "Path to save JSON file:\n> "
|
140
156
|
path = gets.chomp.rstrip
|
141
|
-
|
157
|
+
while !validate_path_name(path)
|
158
|
+
print "Entered path no valid, try again.\n".red.bold
|
159
|
+
print "Path to save JSON file:\n> "
|
160
|
+
path = gets.chomp.rstrip
|
161
|
+
end
|
162
|
+
File.open(validate_path_name(path), 'w') {|f| f.puts JSON.pretty_generate(param)}
|
142
163
|
end
|
143
164
|
|
144
165
|
print "\nDo you wish to execute tcs pipeline with the input params now? Y/N \n> "
|
@@ -147,7 +168,7 @@ module ViralSeq
|
|
147
168
|
if rsp =~ /y/i
|
148
169
|
return param
|
149
170
|
else
|
150
|
-
abort "Params json file generated. You can execute tcs pipeline using `tcs -p [params.json]`"
|
171
|
+
abort "Params json file generated. You can execute tcs pipeline using `tcs -p [params.json]`".blue
|
151
172
|
end
|
152
173
|
|
153
174
|
end
|
@@ -172,7 +193,17 @@ module ViralSeq
|
|
172
193
|
when 3
|
173
194
|
:MAC239
|
174
195
|
end
|
175
|
-
end
|
176
|
-
|
196
|
+
end # end of get_ref
|
197
|
+
|
198
|
+
def validate_path_name(path)
|
199
|
+
if path.empty?
|
200
|
+
return false
|
201
|
+
elsif File.directory? path
|
202
|
+
return File.join(path, 'params.json')
|
203
|
+
elsif File.directory?(File.dirname(path))
|
204
|
+
return path
|
205
|
+
end
|
206
|
+
end # end of validate_path_name
|
207
|
+
end # end of class << self
|
177
208
|
end # end TcsJson
|
178
209
|
end # end main module
|
data/lib/viral_seq/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: viral_seq
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Shuntai Zhou
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2021-
|
12
|
+
date: 2021-04-01 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bundler
|
@@ -90,6 +90,7 @@ email:
|
|
90
90
|
executables:
|
91
91
|
- locator
|
92
92
|
- tcs
|
93
|
+
- tcs_log
|
93
94
|
extensions: []
|
94
95
|
extra_rdoc_files: []
|
95
96
|
files:
|
@@ -104,6 +105,11 @@ files:
|
|
104
105
|
- Rakefile
|
105
106
|
- bin/locator
|
106
107
|
- bin/tcs
|
108
|
+
- bin/tcs_log
|
109
|
+
- docs/assets/img/cover.jpg
|
110
|
+
- docs/dr.json
|
111
|
+
- docs/sample_miseq_data/hivdr_control/r1.fastq.gz
|
112
|
+
- docs/sample_miseq_data/hivdr_control/r2.fastq.gz
|
107
113
|
- lib/viral_seq.rb
|
108
114
|
- lib/viral_seq/constant.rb
|
109
115
|
- lib/viral_seq/enumerable.rb
|
@@ -120,6 +126,7 @@ files:
|
|
120
126
|
- lib/viral_seq/sequence.rb
|
121
127
|
- lib/viral_seq/string.rb
|
122
128
|
- lib/viral_seq/tcs_core.rb
|
129
|
+
- lib/viral_seq/tcs_dr.rb
|
123
130
|
- lib/viral_seq/tcs_json.rb
|
124
131
|
- lib/viral_seq/version.rb
|
125
132
|
- viral_seq.gemspec
|