viral_seq 1.0.11 → 1.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +0 -1
- data/Gemfile.lock +1 -1
- data/README.md +93 -11
- data/bin/tcs +34 -6
- data/bin/tcs_log +102 -0
- data/docs/assets/img/cover.jpg +0 -0
- data/docs/dr.json +67 -0
- data/docs/sample_miseq_data/hivdr_control/r1.fastq.gz +0 -0
- data/docs/sample_miseq_data/hivdr_control/r2.fastq.gz +0 -0
- data/lib/viral_seq.rb +1 -1
- data/lib/viral_seq/enumerable.rb +0 -10
- data/lib/viral_seq/math.rb +3 -3
- data/lib/viral_seq/seq_hash.rb +1 -1
- data/lib/viral_seq/seq_hash_pair.rb +6 -4
- data/lib/viral_seq/tcs_core.rb +34 -5
- data/lib/viral_seq/tcs_dr.rb +71 -0
- data/lib/viral_seq/tcs_json.rb +41 -10
- data/lib/viral_seq/version.rb +2 -2
- metadata +9 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7a283f3a09cc5d9807e7622cd1ddf27197919955e85d6472b34fc14b66749c03
|
4
|
+
data.tar.gz: 4f90c5a9c7ea0ec148ba7d45ee88dc441f79da67a97654734194a773499ebb8e
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 385a94eb93c3d8d9116c16a0d8af56ba714ba6191a454076acf881a036de80d1d598f3fcd1a4de841745ca08a1ad3e8bc028a30db9f96c19f3b217ef4583d652
|
7
|
+
data.tar.gz: 714d035b6f65863746cafb120c9cf6eccb8261f3eac69985bad96e5275351eec71aa3b744ee9b462e2dc3e0e199c2d4112386f6a2d7eef89b5b7824c1ab769be
|
data/.gitignore
CHANGED
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -1,8 +1,24 @@
|
|
1
1
|
# ViralSeq
|
2
2
|
|
3
|
+
[](https://rubygems.org/gems/viral_seq)
|
4
|
+

|
5
|
+

|
6
|
+

|
7
|
+
[](https://gitter.im/viral_seq/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
|
8
|
+
|
3
9
|
A Ruby Gem containing bioinformatics tools for processing viral NGS data.
|
4
10
|
|
5
|
-
Specifically for Primer
|
11
|
+
Specifically for Primer ID sequencing and HIV drug resistance analysis.
|
12
|
+
|
13
|
+
## Illustration for the Primer ID Sequencing
|
14
|
+
|
15
|
+
|
16
|
+

|
17
|
+
|
18
|
+
### Reference readings on the Primer ID sequencing
|
19
|
+
[Explantion of Primer ID sequencing](https://doi.org/10.21769/BioProtoc.3938)
|
20
|
+
[Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
|
21
|
+
[Application of Primer ID sequencing in COVID-19 research](https://doi.org/10.1126/scitranslmed.abb5883)
|
6
22
|
|
7
23
|
## Install
|
8
24
|
|
@@ -14,20 +30,55 @@ Specifically for Primer-ID sequencing and HIV drug resistance analysis.
|
|
14
30
|
|
15
31
|
### Excutables
|
16
32
|
|
17
|
-
|
33
|
+
### `tcs`
|
34
|
+
Use executable `tcs` pipeline to process **Primer ID MiSeq sequencing** data.
|
18
35
|
|
36
|
+
Example commands:
|
19
37
|
```bash
|
20
|
-
$
|
38
|
+
$ tcs -p params.json # run TCS pipeline with params.json
|
39
|
+
$ tcs -p params.json -i DIRECTORY
|
40
|
+
# run TCS pipeline with params.json and DIRECTORY
|
41
|
+
# if DIRECTORY is not defined in params.json
|
42
|
+
$ tcs -dr -i DIRECTORY
|
43
|
+
# run tcs-dr (MPID HIV drug resistance sequencing) pipeline
|
44
|
+
# DIRECTORY needs to be given.
|
45
|
+
$ tcs -j # CLI to generate params.json
|
46
|
+
$ tcs -h # print out the help
|
21
47
|
```
|
22
48
|
|
23
|
-
|
49
|
+
[sample params.json for the tcs-dr pipeline](./docs/dr.json)
|
50
|
+
|
51
|
+
---
|
52
|
+
### `tcs_log`
|
53
|
+
|
54
|
+
Use `tcs_log` script to pool run logs and TCS fasta files after one batch of `tcs` jobs.
|
24
55
|
|
56
|
+
|
57
|
+
Example file structure:
|
58
|
+
```
|
59
|
+
batch_tcs_jobs/
|
60
|
+
├── lib1
|
61
|
+
├── lib2
|
62
|
+
├── lib3
|
63
|
+
├── lib4
|
64
|
+
├── ...
|
65
|
+
```
|
66
|
+
|
67
|
+
Example command:
|
25
68
|
```bash
|
26
|
-
$
|
27
|
-
$ tcs -j # CLI to generate params.json
|
28
|
-
$ tcs -h # print out the help
|
69
|
+
$ tcs_log batch_tcs_jobs
|
29
70
|
```
|
30
71
|
|
72
|
+
---
|
73
|
+
|
74
|
+
### `locator`
|
75
|
+
Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
|
76
|
+
|
77
|
+
```bash
|
78
|
+
$ locator -i sequence.fasta -o sequence.fasta.csv
|
79
|
+
```
|
80
|
+
---
|
81
|
+
|
31
82
|
## Some Examples
|
32
83
|
|
33
84
|
Load all ViralSeq classes by requiring 'viral_seq.rb' in your Ruby scripts.
|
@@ -80,16 +131,47 @@ qc_seqhash.sdrm_hiv_pr(cut_off)
|
|
80
131
|
```
|
81
132
|
## Known issues
|
82
133
|
|
83
|
-
1. have a conflict with rails
|
134
|
+
1. ~~have a conflict with rails.~~
|
135
|
+
2. ~~Update on 03032021. Still have conflict. But in rails gem file, can just use `requires: false` globally and only require "viral_seq" when the module is needed in controller.~~
|
136
|
+
3. The conflict seems to be resovled. It was from a combination of using `!` as a function for factorial and the gem name `viral_seq`. @_@
|
84
137
|
|
85
138
|
## Updates
|
86
139
|
|
87
|
-
### Version 1.1.1-
|
140
|
+
### Version 1.1.1-04012021
|
141
|
+
|
142
|
+
1. Added warning when paired_raw_sequence less than 0.1% of total_raw_sequence.
|
143
|
+
2. Added option `-i WORKING_DIRECTORY` to the `tcs` script.
|
144
|
+
If the `params.json` file does not contain the path to the working directory, it will append path to the run params.
|
145
|
+
3. Added option `-dr` to the `tcs` script.
|
146
|
+
|
147
|
+
### Version 1.1.0-03252021
|
148
|
+
|
149
|
+
1. Optimized the algorithm of end-join.
|
150
|
+
2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
|
151
|
+
3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
|
152
|
+
4. Added the preset of MPID-HIVDR params file [***dr.json***](./docs/dr.json) in /docs.
|
153
|
+
5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
|
154
|
+
Users can choose from 3 MiSeq platforms for processing their sequencing data.
|
155
|
+
MiSeq 300x7x300 is the default option.
|
156
|
+
|
157
|
+
### Version 1.0.14-03052021
|
158
|
+
|
159
|
+
1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
|
160
|
+
|
161
|
+
### Version 1.0.13-03032021
|
162
|
+
|
163
|
+
1. Fixed the conflict with rails.
|
164
|
+
|
165
|
+
### Version 1.0.12-03032021
|
166
|
+
|
167
|
+
1. Fixed an issue that may cause conflicts with ActiveRecord.
|
168
|
+
|
169
|
+
### Version 1.0.11-03022021
|
88
170
|
|
89
|
-
1. Fixed
|
171
|
+
1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
|
90
172
|
2. fixed an issue loading class 'OptionParser'in some ruby environments.
|
91
173
|
|
92
|
-
### Version 1.
|
174
|
+
### Version 1.0.10-11112020:
|
93
175
|
|
94
176
|
1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
|
95
177
|
2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
|
data/bin/tcs
CHANGED
@@ -23,7 +23,7 @@
|
|
23
23
|
# THE SOFTWARE.
|
24
24
|
|
25
25
|
# Use JSON file as the run param
|
26
|
-
# run
|
26
|
+
# run `tcs -j` to generate param json file.
|
27
27
|
|
28
28
|
require 'viral_seq'
|
29
29
|
require 'json'
|
@@ -46,6 +46,14 @@ OptionParser.new do |opts|
|
|
46
46
|
options[:params_json] = p
|
47
47
|
end
|
48
48
|
|
49
|
+
opts.on("-i", "--input PATH_TO_WORKING_DIRECTORY", "Path to the working directory") do |p|
|
50
|
+
options[:input] = p
|
51
|
+
end
|
52
|
+
|
53
|
+
opts.on("-dr", "--dr_pipeline", "HIV drug resistance MPID pipeline") do |p|
|
54
|
+
options[:dr] = true
|
55
|
+
end
|
56
|
+
|
49
57
|
opts.on("-h", "--help", "Prints this help") do
|
50
58
|
puts opts
|
51
59
|
exit
|
@@ -64,15 +72,21 @@ end.parse!
|
|
64
72
|
|
65
73
|
if options[:json_generator]
|
66
74
|
params = ViralSeq::TcsJson.generate
|
75
|
+
elsif options[:dr]
|
76
|
+
params = ViralSeq::TcsDr::PARAMS
|
67
77
|
elsif (options[:params_json] && File.exist?(options[:params_json]))
|
68
78
|
params = JSON.parse(File.read(options[:params_json]), symbolize_names: true)
|
69
79
|
else
|
70
80
|
abort "No params JSON file found. Script terminated.".red
|
71
81
|
end
|
72
82
|
|
73
|
-
|
83
|
+
if options[:input]
|
84
|
+
indir = options[:input]
|
85
|
+
else
|
86
|
+
indir = params[:raw_sequence_dir]
|
87
|
+
end
|
74
88
|
|
75
|
-
unless File.exist?(indir)
|
89
|
+
unless indir and File.exist?(indir)
|
76
90
|
abort "No input sequence directory found. Script terminated.".red.bold
|
77
91
|
end
|
78
92
|
|
@@ -115,6 +129,12 @@ else
|
|
115
129
|
error_rate = 0.02
|
116
130
|
end
|
117
131
|
|
132
|
+
if params[:platform_format]
|
133
|
+
$platform_sequencing_length = params[:platform_format]
|
134
|
+
else
|
135
|
+
$platform_sequencing_length = 300
|
136
|
+
end
|
137
|
+
|
118
138
|
primers = params[:primer_pairs]
|
119
139
|
if primers.empty?
|
120
140
|
ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
|
@@ -123,6 +143,7 @@ end
|
|
123
143
|
|
124
144
|
primers.each do |primer|
|
125
145
|
summary_json = {}
|
146
|
+
summary_json[:warnings] = []
|
126
147
|
summary_json[:tcs_version] = ViralSeq::TCS_VERSION
|
127
148
|
summary_json[:viralseq_version] = ViralSeq::VERSION
|
128
149
|
summary_json[:runtime] = Time.now.to_s
|
@@ -175,6 +196,10 @@ primers.each do |primer|
|
|
175
196
|
paired_seq_number = common_keys.size
|
176
197
|
log.puts Time.now.to_s + "\t" + "Paired raw sequences are : #{paired_seq_number.to_s}"
|
177
198
|
summary_json[:paired_raw_sequence] = paired_seq_number
|
199
|
+
if paired_seq_number < raw_sequence_number * 0.001
|
200
|
+
summary_json[:warnings] <<
|
201
|
+
"WARNING: Filtered raw sequneces less than 0.1% of the total raw sequences. Possible contamination."
|
202
|
+
end
|
178
203
|
|
179
204
|
common_keys.each do |seqtag|
|
180
205
|
r1_seq = r1_passed_seq[seqtag]
|
@@ -273,7 +298,6 @@ primers.each do |primer|
|
|
273
298
|
r1_sub_seq << bio_r1[seq_name]
|
274
299
|
r2_sub_seq << bio_r2[seq_name]
|
275
300
|
end
|
276
|
-
|
277
301
|
#consensus name including the Primer ID and number of raw sequences of that Primer ID, library name and setname.
|
278
302
|
consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
|
279
303
|
r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
|
@@ -364,6 +388,7 @@ primers.each do |primer|
|
|
364
388
|
shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
|
365
389
|
joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
|
366
390
|
log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
|
391
|
+
|
367
392
|
summary_json[:combined_tcs] = joined_sh.size
|
368
393
|
|
369
394
|
if export_raw
|
@@ -433,12 +458,15 @@ primers.each do |primer|
|
|
433
458
|
trim_end = primer[:trim_ref_end]
|
434
459
|
trim_ref = primer[:trim_ref].to_sym
|
435
460
|
joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
|
436
|
-
joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
|
437
461
|
if export_raw
|
438
462
|
joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
|
439
|
-
joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
|
440
463
|
end
|
441
464
|
end
|
465
|
+
|
466
|
+
joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
|
467
|
+
if export_raw
|
468
|
+
joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
|
469
|
+
end
|
442
470
|
end
|
443
471
|
|
444
472
|
File.open(outfile_log, "w") do |f|
|
data/bin/tcs_log
ADDED
@@ -0,0 +1,102 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
# pool run logs from one batch of tcs jobs
|
4
|
+
# file structure:
|
5
|
+
# batch_tcs_jobs/
|
6
|
+
# ├── lib1
|
7
|
+
# ├── lib2
|
8
|
+
# ├── lib3
|
9
|
+
# ├── lib4
|
10
|
+
# ├── ...
|
11
|
+
#
|
12
|
+
# command example:
|
13
|
+
# $ tcs_log batch_tcs_jobs
|
14
|
+
|
15
|
+
require 'viral_seq'
|
16
|
+
require 'pathname'
|
17
|
+
require 'json'
|
18
|
+
require 'fileutils'
|
19
|
+
|
20
|
+
indir = ARGV[0].chomp
|
21
|
+
indir_basename = File.basename(indir)
|
22
|
+
indir_dirname = File.dirname(indir)
|
23
|
+
|
24
|
+
tcs_dir = File.join(indir_dirname, (indir_basename + "_tcs"))
|
25
|
+
Dir.mkdir(tcs_dir) unless File.directory?(tcs_dir)
|
26
|
+
|
27
|
+
libs = []
|
28
|
+
Dir.chdir(indir) {libs = Dir.glob("*")}
|
29
|
+
|
30
|
+
outdir2 = File.join(tcs_dir, "combined_TCS_per_lib")
|
31
|
+
outdir3 = File.join(tcs_dir, "TCS_per_region")
|
32
|
+
outdir4 = File.join(tcs_dir, "combined_TCS_per_region")
|
33
|
+
|
34
|
+
Dir.mkdir(outdir2) unless File.directory?(outdir2)
|
35
|
+
Dir.mkdir(outdir3) unless File.directory?(outdir3)
|
36
|
+
Dir.mkdir(outdir4) unless File.directory?(outdir4)
|
37
|
+
|
38
|
+
log_file = File.join(tcs_dir,"log.csv")
|
39
|
+
log = File.open(log_file,'w')
|
40
|
+
|
41
|
+
header = %w{
|
42
|
+
lib_name
|
43
|
+
Region
|
44
|
+
Raw_Sequences_per_barcode
|
45
|
+
R1_Raw
|
46
|
+
R2_Raw
|
47
|
+
Paired_Raw
|
48
|
+
Cutoff
|
49
|
+
PID_Length
|
50
|
+
Consensus1
|
51
|
+
Consensus2
|
52
|
+
Distinct_to_Raw
|
53
|
+
Resampling_index
|
54
|
+
Combined_TCS
|
55
|
+
Combined_TCS_after_QC
|
56
|
+
WARNINGS
|
57
|
+
}
|
58
|
+
|
59
|
+
log.puts header.join(',')
|
60
|
+
libs.each do |lib|
|
61
|
+
Dir.mkdir(File.join(outdir2, lib)) unless File.directory?(File.join(outdir2, lib))
|
62
|
+
fasta_files = []
|
63
|
+
json_files = []
|
64
|
+
Dir.chdir(File.join(indir, lib)) do
|
65
|
+
fasta_files = Dir.glob("**/*.fasta")
|
66
|
+
json_files = Dir.glob("**/log.json")
|
67
|
+
end
|
68
|
+
fasta_files.each do |f|
|
69
|
+
path_array = Pathname(f).each_filename.to_a
|
70
|
+
region = path_array[0]
|
71
|
+
if path_array[-1] == "combined.fasta"
|
72
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir2, lib, (lib + "_" + region)))
|
73
|
+
Dir.mkdir(File.join(outdir4,region)) unless File.directory?(File.join(outdir4,region))
|
74
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir4, region, (lib + "_" + region)))
|
75
|
+
else
|
76
|
+
Dir.mkdir(File.join(outdir3,region)) unless File.directory?(File.join(outdir3,region))
|
77
|
+
Dir.mkdir(File.join(outdir3,region, lib)) unless File.directory?(File.join(outdir3,region, lib))
|
78
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir3, region, lib, (lib + "_" + region + "_" + path_array[-1])))
|
79
|
+
end
|
80
|
+
end
|
81
|
+
|
82
|
+
json_files.each do |f|
|
83
|
+
json_log = JSON.parse(File.read(File.join(indir, lib, f)), symbolize_names: true)
|
84
|
+
log.print [lib,
|
85
|
+
json_log[:primer_set_name],
|
86
|
+
json_log[:total_raw_sequence],
|
87
|
+
json_log[:r1_filtered_raw],
|
88
|
+
json_log[:r2_filtered_raw],
|
89
|
+
json_log[:paired_raw_sequence],
|
90
|
+
json_log[:consensus_cutoff],
|
91
|
+
json_log[:length_of_pid],
|
92
|
+
json_log[:total_tcs_with_ambiguities],
|
93
|
+
json_log[:total_tcs],
|
94
|
+
json_log[:distinct_to_raw],
|
95
|
+
json_log[:resampling_param],
|
96
|
+
json_log[:combined_tcs],
|
97
|
+
json_log[:combined_tcs_after_qc],
|
98
|
+
json_log[:warnings],
|
99
|
+
].join(',') + "\n"
|
100
|
+
end
|
101
|
+
end
|
102
|
+
log.close
|
Binary file
|
data/docs/dr.json
ADDED
@@ -0,0 +1,67 @@
|
|
1
|
+
{
|
2
|
+
"platform_error_rate": 0.02,
|
3
|
+
"primer_pairs": [
|
4
|
+
{
|
5
|
+
"region": "RT",
|
6
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCACTATAGGCTGTACTGTCCATTTATC",
|
7
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNGGCCATTGACAGAAGAAAAAATAAAAGC",
|
8
|
+
"majority": 0.5,
|
9
|
+
"end_join": true,
|
10
|
+
"end_join_option": 1,
|
11
|
+
"overlap": 0,
|
12
|
+
"TCS_QC": true,
|
13
|
+
"ref_genome": "HXB2",
|
14
|
+
"ref_start": 2648,
|
15
|
+
"ref_end": 3257,
|
16
|
+
"indel": true,
|
17
|
+
"trim": false
|
18
|
+
},
|
19
|
+
{
|
20
|
+
"region": "PR",
|
21
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNCAGTTTAACTTTTGGGCCATCCATTCC",
|
22
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTCAGAGCAGACCAGAGCCAACAGCCCCA",
|
23
|
+
"majority": 0.5,
|
24
|
+
"end_join": true,
|
25
|
+
"end_join_option": 3,
|
26
|
+
"TCS_QC": true,
|
27
|
+
"ref_genome": "HXB2",
|
28
|
+
"ref_start": 0,
|
29
|
+
"ref_end": 2591,
|
30
|
+
"indel": true,
|
31
|
+
"trim": true,
|
32
|
+
"trim_ref": "HXB2",
|
33
|
+
"trim_ref_start": 2253,
|
34
|
+
"trim_ref_end": 2549
|
35
|
+
},
|
36
|
+
{
|
37
|
+
"region": "IN",
|
38
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNATCGAATACTGCCATTTGTACTGC",
|
39
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNAAAAGGAGAAGCCATGCATG",
|
40
|
+
"majority": 0.5,
|
41
|
+
"end_join": true,
|
42
|
+
"end_join_option": 3,
|
43
|
+
"overlap": 171,
|
44
|
+
"TCS_QC": true,
|
45
|
+
"ref_genome": "HXB2",
|
46
|
+
"ref_start": 4384,
|
47
|
+
"ref_end": 4751,
|
48
|
+
"indel": false,
|
49
|
+
"trim": false
|
50
|
+
},
|
51
|
+
{
|
52
|
+
"region": "V1V3",
|
53
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCCATTTTGCTYTAYTRABVTTACAATRTGC",
|
54
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTTATGGGATCAAAGCCTAAAGCCATGTGTA",
|
55
|
+
"majority": 0.5,
|
56
|
+
"end_join": true,
|
57
|
+
"end_join_option": 1,
|
58
|
+
"overlap": 0,
|
59
|
+
"TCS_QC": true,
|
60
|
+
"ref_genome": "HXB2",
|
61
|
+
"ref_start": 6585,
|
62
|
+
"ref_end": 7208,
|
63
|
+
"indel": true,
|
64
|
+
"trim": false
|
65
|
+
}
|
66
|
+
]
|
67
|
+
}
|
Binary file
|
Binary file
|
data/lib/viral_seq.rb
CHANGED
data/lib/viral_seq/enumerable.rb
CHANGED
@@ -3,10 +3,6 @@
|
|
3
3
|
# array = [1,2,3,4,5,6,7,8,9,10]
|
4
4
|
# array.median
|
5
5
|
# => 5.5
|
6
|
-
# @example sum
|
7
|
-
# array = [1,2,3,4,5,6,7,8,9,10]
|
8
|
-
# array.sum
|
9
|
-
# => 55
|
10
6
|
# @example average number (mean)
|
11
7
|
# array = [1,2,3,4,5,6,7,8,9,10]
|
12
8
|
# array.mean
|
@@ -45,12 +41,6 @@ module Enumerable
|
|
45
41
|
len % 2 == 1 ? sorted[len/2] : (sorted[len/2 - 1] + sorted[len/2]).to_f / 2
|
46
42
|
end
|
47
43
|
|
48
|
-
# generate summed value
|
49
|
-
# @return [Numeric] summed value
|
50
|
-
def sum
|
51
|
-
self.inject(0){|accum, i| accum + i }
|
52
|
-
end
|
53
|
-
|
54
44
|
# generate mean number
|
55
45
|
# @return [Float] mean value
|
56
46
|
def mean
|
data/lib/viral_seq/math.rb
CHANGED
@@ -67,7 +67,7 @@ module ViralSeq
|
|
67
67
|
@k = k
|
68
68
|
@poisson_hash = {}
|
69
69
|
(0..k).each do |n|
|
70
|
-
p = (rate**n * ::Math::E**(-rate))
|
70
|
+
p = (rate**n * ::Math::E**(-rate))/n.factorial
|
71
71
|
@poisson_hash[n] = p
|
72
72
|
end
|
73
73
|
end
|
@@ -155,9 +155,9 @@ class Integer
|
|
155
155
|
# factorial method for an Integer
|
156
156
|
# @return [Integer] factorial for given Integer
|
157
157
|
# @example factorial for 5
|
158
|
-
#
|
158
|
+
# 5.factorial
|
159
159
|
# => 120
|
160
|
-
def
|
160
|
+
def factorial
|
161
161
|
if self == 0
|
162
162
|
return 1
|
163
163
|
else
|
data/lib/viral_seq/seq_hash.rb
CHANGED
@@ -394,7 +394,6 @@ module ViralSeq
|
|
394
394
|
end
|
395
395
|
end
|
396
396
|
end
|
397
|
-
|
398
397
|
consensus_seq += call_consensus_base(max_base_list)
|
399
398
|
end
|
400
399
|
return consensus_seq
|
@@ -742,6 +741,7 @@ module ViralSeq
|
|
742
741
|
seq_hash_unique_pass = []
|
743
742
|
|
744
743
|
seq_hash_unique.each do |seq|
|
744
|
+
next if seq.nil?
|
745
745
|
loc = ViralSeq::Sequence.new('', seq).locator(ref_option, path_to_muscle)
|
746
746
|
next unless loc # if locator tool fails, skip this seq.
|
747
747
|
if start_nt.include?(loc[0]) && end_nt.include?(loc[1])
|
@@ -110,19 +110,21 @@ module ViralSeq
|
|
110
110
|
raise ArgumentError.new(":overlap has to be Integer, input #{overlap} invalid.") unless overlap.is_a? Integer
|
111
111
|
raise ArgumentError.new(":diff has to be float or integer, input #{diff} invalid.") unless (diff.is_a? Integer or diff.is_a? Float)
|
112
112
|
joined_seq = {}
|
113
|
-
seq_pair_hash.
|
113
|
+
seq_pair_hash.each do |seq_name,seq_pair|
|
114
114
|
r1_seq = seq_pair[0]
|
115
115
|
r2_seq = seq_pair[1]
|
116
116
|
if overlap.zero?
|
117
117
|
joined_sequence = r1_seq + r2_seq
|
118
|
+
elsif diff.zero?
|
119
|
+
if r1_seq[-overlap..-1] == r2_seq[0,overlap]
|
120
|
+
joined_sequence= r1_seq + r2_seq[overlap..-1]
|
121
|
+
end
|
118
122
|
elsif r1_seq[-overlap..-1].compare_with(r2_seq[0,overlap]) <= (overlap * diff)
|
119
123
|
joined_sequence= r1_seq + r2_seq[overlap..-1]
|
120
124
|
else
|
121
125
|
next
|
122
126
|
end
|
123
|
-
|
124
|
-
joined_seq[seq_name] = joined_sequence
|
125
|
-
end
|
127
|
+
joined_seq[seq_name] = joined_sequence if joined_sequence
|
126
128
|
end
|
127
129
|
|
128
130
|
joined_seq_hash = ViralSeq::SeqHash.new
|
data/lib/viral_seq/tcs_core.rb
CHANGED
@@ -102,9 +102,9 @@ module ViralSeq
|
|
102
102
|
end
|
103
103
|
|
104
104
|
# sort array of file names to determine if there is potential errors
|
105
|
-
#
|
106
|
-
#
|
107
|
-
|
105
|
+
# @param name_array [Array] array of file names
|
106
|
+
# @return [hash] name check results
|
107
|
+
|
108
108
|
def validate_file_name(name_array)
|
109
109
|
errors = {
|
110
110
|
file_type_error: [] ,
|
@@ -165,6 +165,13 @@ module ViralSeq
|
|
165
165
|
end
|
166
166
|
end
|
167
167
|
|
168
|
+
file_name_with_lib_name = {}
|
169
|
+
passed_libs.each do |lib_name, files|
|
170
|
+
files.each do |f|
|
171
|
+
file_name_with_lib_name[f] = lib_name
|
172
|
+
end
|
173
|
+
end
|
174
|
+
|
168
175
|
passed_names = []
|
169
176
|
|
170
177
|
passed_libs.values.each { |names| passed_names += names}
|
@@ -175,7 +182,27 @@ module ViralSeq
|
|
175
182
|
pass = true
|
176
183
|
end
|
177
184
|
|
178
|
-
|
185
|
+
file_name_with_error_type = {}
|
186
|
+
|
187
|
+
errors.each do |type, files|
|
188
|
+
files.each do |f|
|
189
|
+
file_name_with_error_type[f] ||= []
|
190
|
+
file_name_with_error_type[f] << type.to_s.tr("_", "\s")
|
191
|
+
end
|
192
|
+
end
|
193
|
+
|
194
|
+
file_check = []
|
195
|
+
|
196
|
+
name_array.each do |name|
|
197
|
+
file_check_hash = {}
|
198
|
+
file_check_hash[:fileName] = name
|
199
|
+
file_check_hash[:errors] = file_name_with_error_type[name]
|
200
|
+
file_check_hash[:libName] = file_name_with_lib_name[name]
|
201
|
+
|
202
|
+
file_check << file_check_hash
|
203
|
+
end
|
204
|
+
|
205
|
+
return { allPass: pass, files: file_check }
|
179
206
|
end
|
180
207
|
|
181
208
|
# filter r1 raw sequences for non-specific primers.
|
@@ -278,7 +305,9 @@ module ViralSeq
|
|
278
305
|
end
|
279
306
|
|
280
307
|
def general_filter(seq)
|
281
|
-
if seq
|
308
|
+
if seq.size < $platform_sequencing_length
|
309
|
+
return false
|
310
|
+
elsif seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
|
282
311
|
return false
|
283
312
|
elsif seq =~ /A{11}/ # a string of poly-A indicates adaptor sequence
|
284
313
|
return false
|
@@ -0,0 +1,71 @@
|
|
1
|
+
module ViralSeq
|
2
|
+
|
3
|
+
class TcsDr
|
4
|
+
PARAMS = {:platform_error_rate=>0.02,
|
5
|
+
:primer_pairs=>
|
6
|
+
[{:region=>"RT",
|
7
|
+
:cdna=>
|
8
|
+
"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCACTATAGGCTGTACTGTCCATTTATC",
|
9
|
+
:forward=>
|
10
|
+
"GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNGGCCATTGACAGAAGAAAAAATAAAAGC",
|
11
|
+
:majority=>0.5,
|
12
|
+
:end_join=>true,
|
13
|
+
:end_join_option=>1,
|
14
|
+
:overlap=>0,
|
15
|
+
:TCS_QC=>true,
|
16
|
+
:ref_genome=>"HXB2",
|
17
|
+
:ref_start=>2648,
|
18
|
+
:ref_end=>3257,
|
19
|
+
:indel=>true,
|
20
|
+
:trim=>false},
|
21
|
+
{:region=>"PR",
|
22
|
+
:cdna=>
|
23
|
+
"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNCAGTTTAACTTTTGGGCCATCCATTCC",
|
24
|
+
:forward=>
|
25
|
+
"GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTCAGAGCAGACCAGAGCCAACAGCCCCA",
|
26
|
+
:majority=>0.5,
|
27
|
+
:end_join=>true,
|
28
|
+
:end_join_option=>3,
|
29
|
+
:TCS_QC=>true,
|
30
|
+
:ref_genome=>"HXB2",
|
31
|
+
:ref_start=>0,
|
32
|
+
:ref_end=>2591,
|
33
|
+
:indel=>true,
|
34
|
+
:trim=>true,
|
35
|
+
:trim_ref=>"HXB2",
|
36
|
+
:trim_ref_start=>2253,
|
37
|
+
:trim_ref_end=>2549},
|
38
|
+
{:region=>"IN",
|
39
|
+
:cdna=>
|
40
|
+
"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNATCGAATACTGCCATTTGTACTGC",
|
41
|
+
:forward=>"GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNAAAAGGAGAAGCCATGCATG",
|
42
|
+
:majority=>0.5,
|
43
|
+
:end_join=>true,
|
44
|
+
:end_join_option=>3,
|
45
|
+
:overlap=>171,
|
46
|
+
:TCS_QC=>true,
|
47
|
+
:ref_genome=>"HXB2",
|
48
|
+
:ref_start=>4384,
|
49
|
+
:ref_end=>4751,
|
50
|
+
:indel=>false,
|
51
|
+
:trim=>false},
|
52
|
+
{:region=>"V1V3",
|
53
|
+
:cdna=>
|
54
|
+
"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCCATTTTGCTYTAYTRABVTTACAATRTGC",
|
55
|
+
:forward=>
|
56
|
+
"GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTTATGGGATCAAAGCCTAAAGCCATGTGTA",
|
57
|
+
:majority=>0.5,
|
58
|
+
:end_join=>true,
|
59
|
+
:end_join_option=>1,
|
60
|
+
:overlap=>0,
|
61
|
+
:TCS_QC=>true,
|
62
|
+
:ref_genome=>"HXB2",
|
63
|
+
:ref_start=>6585,
|
64
|
+
:ref_end=>7208,
|
65
|
+
:indel=>true,
|
66
|
+
:trim=>false}
|
67
|
+
]
|
68
|
+
}
|
69
|
+
end
|
70
|
+
|
71
|
+
end
|
data/lib/viral_seq/tcs_json.rb
CHANGED
@@ -13,6 +13,22 @@ module ViralSeq
|
|
13
13
|
print '> '
|
14
14
|
param[:raw_sequence_dir] = gets.chomp.rstrip
|
15
15
|
|
16
|
+
puts "Choose MiSeq Platform (1-3):\n1. 150x7x150\n2. 250x7x250\n3. 300x7x300 (default)"
|
17
|
+
print "> "
|
18
|
+
pf_option = gets.chomp.rstrip
|
19
|
+
# while ![1,2,3].include?(pf_option.to_i)
|
20
|
+
# print "Entered MiSeq Platform #{pf_option.red.bold} not valid (choose 1-3), try again\n> "
|
21
|
+
# pf_option = gets.chomp.rstrip
|
22
|
+
# end
|
23
|
+
case pf_option.to_i
|
24
|
+
when 1
|
25
|
+
param[:platform_format] = 150
|
26
|
+
when 2
|
27
|
+
param[:platform_format] = 250
|
28
|
+
else
|
29
|
+
param[:platform_format] = 300
|
30
|
+
end
|
31
|
+
|
16
32
|
puts 'Enter the estimated platform error rate (for TCS cut-off calculation), default as ' + '0.02'.red.bold
|
17
33
|
print '> '
|
18
34
|
input_error = gets.chomp.rstrip.to_f
|
@@ -52,12 +68,12 @@ module ViralSeq
|
|
52
68
|
if ej =~ /y|yes/i
|
53
69
|
data[:end_join] = true
|
54
70
|
|
55
|
-
|
56
|
-
1: simple join, no overlap
|
57
|
-
2: known overlap
|
58
|
-
3: unknow overlap, use sample consensus to determine overlap, all sequence pairs have same overlap
|
59
|
-
4: unknow overlap, determine overlap by individual sequence pairs, sequence pairs can have different overlap
|
60
|
-
> "
|
71
|
+
puts "End-join option? Choose from (1-4):"
|
72
|
+
puts "1: simple join, no overlap"
|
73
|
+
puts "2: known overlap"
|
74
|
+
puts "3: unknow overlap, use sample consensus to determine overlap, all sequence pairs have same overlap"
|
75
|
+
puts "4: unknow overlap, determine overlap by individual sequence pairs, sequence pairs can have different overlap"
|
76
|
+
print "> "
|
61
77
|
ej_option = gets.chomp.rstrip
|
62
78
|
while ![1,2,3,4].include?(ej_option.to_i)
|
63
79
|
puts "Entered end-join option #{ej_option.red.bold} not valid (choose 1-4), try again"
|
@@ -138,7 +154,12 @@ module ViralSeq
|
|
138
154
|
if save_option =~ /y|yes/i
|
139
155
|
print "Path to save JSON file:\n> "
|
140
156
|
path = gets.chomp.rstrip
|
141
|
-
|
157
|
+
while !validate_path_name(path)
|
158
|
+
print "Entered path no valid, try again.\n".red.bold
|
159
|
+
print "Path to save JSON file:\n> "
|
160
|
+
path = gets.chomp.rstrip
|
161
|
+
end
|
162
|
+
File.open(validate_path_name(path), 'w') {|f| f.puts JSON.pretty_generate(param)}
|
142
163
|
end
|
143
164
|
|
144
165
|
print "\nDo you wish to execute tcs pipeline with the input params now? Y/N \n> "
|
@@ -147,7 +168,7 @@ module ViralSeq
|
|
147
168
|
if rsp =~ /y/i
|
148
169
|
return param
|
149
170
|
else
|
150
|
-
abort "Params json file generated. You can execute tcs pipeline using `tcs -p [params.json]`"
|
171
|
+
abort "Params json file generated. You can execute tcs pipeline using `tcs -p [params.json]`".blue
|
151
172
|
end
|
152
173
|
|
153
174
|
end
|
@@ -172,7 +193,17 @@ module ViralSeq
|
|
172
193
|
when 3
|
173
194
|
:MAC239
|
174
195
|
end
|
175
|
-
end
|
176
|
-
|
196
|
+
end # end of get_ref
|
197
|
+
|
198
|
+
def validate_path_name(path)
|
199
|
+
if path.empty?
|
200
|
+
return false
|
201
|
+
elsif File.directory? path
|
202
|
+
return File.join(path, 'params.json')
|
203
|
+
elsif File.directory?(File.dirname(path))
|
204
|
+
return path
|
205
|
+
end
|
206
|
+
end # end of validate_path_name
|
207
|
+
end # end of class << self
|
177
208
|
end # end TcsJson
|
178
209
|
end # end main module
|
data/lib/viral_seq/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: viral_seq
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Shuntai Zhou
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2021-
|
12
|
+
date: 2021-04-01 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bundler
|
@@ -90,6 +90,7 @@ email:
|
|
90
90
|
executables:
|
91
91
|
- locator
|
92
92
|
- tcs
|
93
|
+
- tcs_log
|
93
94
|
extensions: []
|
94
95
|
extra_rdoc_files: []
|
95
96
|
files:
|
@@ -104,6 +105,11 @@ files:
|
|
104
105
|
- Rakefile
|
105
106
|
- bin/locator
|
106
107
|
- bin/tcs
|
108
|
+
- bin/tcs_log
|
109
|
+
- docs/assets/img/cover.jpg
|
110
|
+
- docs/dr.json
|
111
|
+
- docs/sample_miseq_data/hivdr_control/r1.fastq.gz
|
112
|
+
- docs/sample_miseq_data/hivdr_control/r2.fastq.gz
|
107
113
|
- lib/viral_seq.rb
|
108
114
|
- lib/viral_seq/constant.rb
|
109
115
|
- lib/viral_seq/enumerable.rb
|
@@ -120,6 +126,7 @@ files:
|
|
120
126
|
- lib/viral_seq/sequence.rb
|
121
127
|
- lib/viral_seq/string.rb
|
122
128
|
- lib/viral_seq/tcs_core.rb
|
129
|
+
- lib/viral_seq/tcs_dr.rb
|
123
130
|
- lib/viral_seq/tcs_json.rb
|
124
131
|
- lib/viral_seq/version.rb
|
125
132
|
- viral_seq.gemspec
|