viral_seq 1.0.10 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +0 -1
- data/Gemfile.lock +1 -1
- data/README.md +76 -9
- data/bin/tcs +20 -7
- data/bin/tcs_log +83 -0
- data/doc/dr.json +68 -0
- data/lib/viral_seq/constant.rb +5 -1
- data/lib/viral_seq/enumerable.rb +0 -10
- data/lib/viral_seq/hivdr.rb +1 -1
- data/lib/viral_seq/math.rb +3 -3
- data/lib/viral_seq/sdrm.rb +43 -0
- data/lib/viral_seq/seq_hash.rb +3 -3
- data/lib/viral_seq/seq_hash_pair.rb +6 -4
- data/lib/viral_seq/tcs_core.rb +37 -6
- data/lib/viral_seq/tcs_json.rb +41 -10
- data/lib/viral_seq/version.rb +2 -2
- metadata +7 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ea453e452e6832e942512cdb94462c33af89ffd8295017806c9aa6ff7ec77ad4
|
4
|
+
data.tar.gz: 2bb89d193e0e84ebe0791882c53e226a0a934ea3b9d1e61f87b8ffff6c22af1b
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9dc0403ecaea119d3aa3e832305a0bd4f038fdb71789dcd036080fa89b0e454ee79001b6042df171364e4207a93b2d4d5747336b2fb7f8fb7d83103f5d641134
|
7
|
+
data.tar.gz: 510ccfce7d717b56d55e2477ae01124009d1f53f010635759cf2f69afe0132313e08db9abaae1ec6d8d894961beba1c2d70a637eafa9b57b05f0aac3372cd0ca
|
data/.gitignore
CHANGED
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -2,7 +2,16 @@
|
|
2
2
|
|
3
3
|
A Ruby Gem containing bioinformatics tools for processing viral NGS data.
|
4
4
|
|
5
|
-
Specifically for Primer
|
5
|
+
Specifically for Primer ID sequencing and HIV drug resistance analysis.
|
6
|
+
|
7
|
+
## Illustration for the Primer ID Sequencing
|
8
|
+
|
9
|
+
|
10
|
+
![Primer ID Sequencing](https://storage.googleapis.com/tcs-dr-public/pid.png)
|
11
|
+
|
12
|
+
### Reference readings on the Primer ID sequencing
|
13
|
+
[Primer ID JID paper](https://doi.org/10.21769/BioProtoc.3938)
|
14
|
+
[Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
|
6
15
|
|
7
16
|
## Install
|
8
17
|
|
@@ -14,19 +23,45 @@ Specifically for Primer-ID sequencing and HIV drug resistance analysis.
|
|
14
23
|
|
15
24
|
### Excutables
|
16
25
|
|
17
|
-
|
26
|
+
### `tcs`
|
27
|
+
Use executable `tcs` pipeline to process **Primer ID MiSeq sequencing** data.
|
18
28
|
|
29
|
+
Example commands:
|
19
30
|
```bash
|
20
|
-
$
|
31
|
+
$ tcs -p params.json # run TCS pipeline with params.json
|
32
|
+
$ tcs -j # CLI to generate params.json
|
33
|
+
$ tcs -h # print out the help
|
21
34
|
```
|
35
|
+
---
|
36
|
+
### `tcs_log`
|
22
37
|
|
23
|
-
Use
|
38
|
+
Use `tcs_log` script to pool run logs and TCS fasta files after one batch of `tcs` jobs.
|
24
39
|
|
40
|
+
|
41
|
+
Example file structure:
|
42
|
+
```
|
43
|
+
batch_tcs_jobs/
|
44
|
+
├── lib1
|
45
|
+
├── lib2
|
46
|
+
├── lib3
|
47
|
+
├── lib4
|
48
|
+
├── ...
|
49
|
+
```
|
50
|
+
|
51
|
+
Example command:
|
25
52
|
```bash
|
26
|
-
$
|
27
|
-
|
28
|
-
|
53
|
+
$ tcs_log batch_tcs_jobs
|
54
|
+
```
|
55
|
+
|
56
|
+
---
|
57
|
+
|
58
|
+
### `locator`
|
59
|
+
Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
|
60
|
+
|
61
|
+
```bash
|
62
|
+
$ locator -i sequence.fasta -o sequence.fasta.csv
|
29
63
|
```
|
64
|
+
---
|
30
65
|
|
31
66
|
## Some Examples
|
32
67
|
|
@@ -78,17 +113,49 @@ Examine for drug resistance mutations for HIV PR region
|
|
78
113
|
```ruby
|
79
114
|
qc_seqhash.sdrm_hiv_pr(cut_off)
|
80
115
|
```
|
116
|
+
## Known issues
|
117
|
+
|
118
|
+
1. ~~have a conflict with rails.~~
|
119
|
+
2. ~~Update on 03032021. Still have conflict. But in rails gem file, can just use `requires: false` globally and only require "viral_seq" when the module is needed in controller.~~
|
120
|
+
3. The conflict seems to be resovled. It was from a combination of using `!` as a function for factorial and the gem name `viral_seq`. @_@
|
81
121
|
|
82
122
|
## Updates
|
83
123
|
|
84
|
-
### Version 1.1.0-
|
124
|
+
### Version 1.1.0-03252021
|
125
|
+
|
126
|
+
1. Optimized the algorithm of end-join.
|
127
|
+
2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
|
128
|
+
3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
|
129
|
+
4. Added the preset of MPID-HIVDR params file ***dr.json*** in /doc.
|
130
|
+
5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
|
131
|
+
Users can choose from 3 MiSeq platforms for processing their sequencing data.
|
132
|
+
MiSeq 300x7x300 is the default option.
|
133
|
+
|
134
|
+
### Version 1.0.14-03052021
|
135
|
+
|
136
|
+
1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
|
137
|
+
|
138
|
+
### Version 1.0.13-03032021
|
139
|
+
|
140
|
+
1. Fixed the conflict with rails.
|
141
|
+
|
142
|
+
### Version 1.0.12-03032021
|
143
|
+
|
144
|
+
1. Fixed an issue that may cause conflicts with ActiveRecord.
|
145
|
+
|
146
|
+
### Version 1.0.11-03022021
|
147
|
+
|
148
|
+
1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
|
149
|
+
2. fixed an issue loading class 'OptionParser'in some ruby environments.
|
150
|
+
|
151
|
+
### Version 1.0.10-11112020:
|
85
152
|
|
86
153
|
1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
|
87
154
|
2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
|
88
155
|
3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
|
89
156
|
4. a few optimizations.
|
90
157
|
5. TCS 2.1.0 delivered.
|
91
|
-
6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
|
158
|
+
6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
|
92
159
|
|
93
160
|
### Version 1.0.9-07182020:
|
94
161
|
|
data/bin/tcs
CHANGED
@@ -23,12 +23,12 @@
|
|
23
23
|
# THE SOFTWARE.
|
24
24
|
|
25
25
|
# Use JSON file as the run param
|
26
|
-
# run
|
26
|
+
# run `tcs -j` to generate param json file.
|
27
27
|
|
28
28
|
require 'viral_seq'
|
29
29
|
require 'json'
|
30
30
|
require 'colorize'
|
31
|
-
require '
|
31
|
+
require 'optparse'
|
32
32
|
|
33
33
|
options = {}
|
34
34
|
|
@@ -115,6 +115,12 @@ else
|
|
115
115
|
error_rate = 0.02
|
116
116
|
end
|
117
117
|
|
118
|
+
if params[:platform_format]
|
119
|
+
$platform_sequencing_length = params[:platform_format]
|
120
|
+
else
|
121
|
+
$platform_sequencing_length = 300
|
122
|
+
end
|
123
|
+
|
118
124
|
primers = params[:primer_pairs]
|
119
125
|
if primers.empty?
|
120
126
|
ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
|
@@ -273,7 +279,6 @@ primers.each do |primer|
|
|
273
279
|
r1_sub_seq << bio_r1[seq_name]
|
274
280
|
r2_sub_seq << bio_r2[seq_name]
|
275
281
|
end
|
276
|
-
|
277
282
|
#consensus name including the Primer ID and number of raw sequences of that Primer ID, library name and setname.
|
278
283
|
consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
|
279
284
|
r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
|
@@ -317,8 +322,12 @@ primers.each do |primer|
|
|
317
322
|
f1 = File.open(outfile_r1, 'w')
|
318
323
|
f2 = File.open(outfile_r2, 'w')
|
319
324
|
primer_id_in_use = {}
|
320
|
-
|
321
|
-
|
325
|
+
if n_con > 0
|
326
|
+
r1_seq_length = consensus_filtered.values[0][0].size
|
327
|
+
r2_seq_length = consensus_filtered.values[0][1].size
|
328
|
+
else
|
329
|
+
next
|
330
|
+
end
|
322
331
|
log.puts Time.now.to_s + "\t" + "R1 sequence #{r1_seq_length} bp"
|
323
332
|
log.puts Time.now.to_s + "\t" + "R1 sequence #{r2_seq_length} bp"
|
324
333
|
consensus_filtered.each do |seq_name,seq|
|
@@ -360,6 +369,7 @@ primers.each do |primer|
|
|
360
369
|
shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
|
361
370
|
joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
|
362
371
|
log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
|
372
|
+
|
363
373
|
summary_json[:combined_tcs] = joined_sh.size
|
364
374
|
|
365
375
|
if export_raw
|
@@ -429,12 +439,15 @@ primers.each do |primer|
|
|
429
439
|
trim_end = primer[:trim_ref_end]
|
430
440
|
trim_ref = primer[:trim_ref].to_sym
|
431
441
|
joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
|
432
|
-
joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
|
433
442
|
if export_raw
|
434
443
|
joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
|
435
|
-
joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
|
436
444
|
end
|
437
445
|
end
|
446
|
+
|
447
|
+
joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
|
448
|
+
if export_raw
|
449
|
+
joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
|
450
|
+
end
|
438
451
|
end
|
439
452
|
|
440
453
|
File.open(outfile_log, "w") do |f|
|
data/bin/tcs_log
ADDED
@@ -0,0 +1,83 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
# pool run logs from one batch of tcs jobs
|
4
|
+
# file structure:
|
5
|
+
# batch_tcs_jobs/
|
6
|
+
# ├── lib1
|
7
|
+
# ├── lib2
|
8
|
+
# ├── lib3
|
9
|
+
# ├── lib4
|
10
|
+
# ├── ...
|
11
|
+
#
|
12
|
+
# command example:
|
13
|
+
# $ tcs_log batch_tcs_jobs
|
14
|
+
|
15
|
+
require 'viral_seq'
|
16
|
+
require 'pathname'
|
17
|
+
require 'json'
|
18
|
+
require 'fileutils'
|
19
|
+
|
20
|
+
indir = ARGV[0].chomp
|
21
|
+
indir_basename = File.basename(indir)
|
22
|
+
indir_dirname = File.dirname(indir)
|
23
|
+
|
24
|
+
tcs_dir = File.join(indir_dirname, (indir_basename + "_tcs"))
|
25
|
+
Dir.mkdir(tcs_dir) unless File.directory?(tcs_dir)
|
26
|
+
|
27
|
+
libs = []
|
28
|
+
Dir.chdir(indir) {libs = Dir.glob("*")}
|
29
|
+
|
30
|
+
outdir2 = File.join(tcs_dir, "combined_TCS_per_lib")
|
31
|
+
outdir3 = File.join(tcs_dir, "TCS_per_region")
|
32
|
+
outdir4 = File.join(tcs_dir, "combined_TCS_per_region")
|
33
|
+
|
34
|
+
Dir.mkdir(outdir2) unless File.directory?(outdir2)
|
35
|
+
Dir.mkdir(outdir3) unless File.directory?(outdir3)
|
36
|
+
Dir.mkdir(outdir4) unless File.directory?(outdir4)
|
37
|
+
|
38
|
+
log_file = File.join(tcs_dir,"log.csv")
|
39
|
+
log = File.open(log_file,'w')
|
40
|
+
log.puts "lib name,Region,Raw Sequences per barcode,R1 Raw,R2 Raw,Paired Raw,Cutoff,PID Length,Consensus1,Consensus2,Distinct to Raw,Resampling index,Combined TCS,Combined TCS after QC"
|
41
|
+
|
42
|
+
libs.each do |lib|
|
43
|
+
Dir.mkdir(File.join(outdir2, lib)) unless File.directory?(File.join(outdir2, lib))
|
44
|
+
fasta_files = []
|
45
|
+
json_files = []
|
46
|
+
Dir.chdir(File.join(indir, lib)) do
|
47
|
+
fasta_files = Dir.glob("**/*.fasta")
|
48
|
+
json_files = Dir.glob("**/log.json")
|
49
|
+
end
|
50
|
+
fasta_files.each do |f|
|
51
|
+
path_array = Pathname(f).each_filename.to_a
|
52
|
+
region = path_array[0]
|
53
|
+
if path_array[-1] == "combined.fasta"
|
54
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir2, lib, (lib + "_" + region)))
|
55
|
+
Dir.mkdir(File.join(outdir4,region)) unless File.directory?(File.join(outdir4,region))
|
56
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir4, region, (lib + "_" + region)))
|
57
|
+
else
|
58
|
+
Dir.mkdir(File.join(outdir3,region)) unless File.directory?(File.join(outdir3,region))
|
59
|
+
Dir.mkdir(File.join(outdir3,region, lib)) unless File.directory?(File.join(outdir3,region, lib))
|
60
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir3, region, lib, (lib + "_" + region + "_" + path_array[-1])))
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
json_files.each do |f|
|
65
|
+
json_log = JSON.parse(File.read(File.join(indir, lib, f)), symbolize_names: true)
|
66
|
+
log.print [lib,
|
67
|
+
json_log[:primer_set_name],
|
68
|
+
json_log[:total_raw_sequence],
|
69
|
+
json_log[:r1_filtered_raw],
|
70
|
+
json_log[:r2_filtered_raw],
|
71
|
+
json_log[:paired_raw_sequence],
|
72
|
+
json_log[:consensus_cutoff],
|
73
|
+
json_log[:length_of_pid],
|
74
|
+
json_log[:total_tcs_with_ambiguities],
|
75
|
+
json_log[:total_tcs],
|
76
|
+
json_log[:distinct_to_raw],
|
77
|
+
json_log[:resampling_param],
|
78
|
+
json_log[:combined_tcs],
|
79
|
+
json_log[:combined_tcs_after_qc],
|
80
|
+
].join(',') + "\n"
|
81
|
+
end
|
82
|
+
end
|
83
|
+
log.close
|
data/doc/dr.json
ADDED
@@ -0,0 +1,68 @@
|
|
1
|
+
{
|
2
|
+
"raw_sequence_dir": "MyExampleDir",
|
3
|
+
"platform_error_rate": 0.02,
|
4
|
+
"primer_pairs": [
|
5
|
+
{
|
6
|
+
"region": "RT",
|
7
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCACTATAGGCTGTACTGTCCATTTATC",
|
8
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNGGCCATTGACAGAAGAAAAAATAAAAGC",
|
9
|
+
"majority": 0.5,
|
10
|
+
"end_join": true,
|
11
|
+
"end_join_option": 1,
|
12
|
+
"overlap": 0,
|
13
|
+
"TCS_QC": true,
|
14
|
+
"ref_genome": "HXB2",
|
15
|
+
"ref_start": 2648,
|
16
|
+
"ref_end": 3257,
|
17
|
+
"indel": true,
|
18
|
+
"trim": false
|
19
|
+
},
|
20
|
+
{
|
21
|
+
"region": "PR",
|
22
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNCAGTTTAACTTTTGGGCCATCCATTCC",
|
23
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTCAGAGCAGACCAGAGCCAACAGCCCCA",
|
24
|
+
"majority": 0.5,
|
25
|
+
"end_join": true,
|
26
|
+
"end_join_option": 3,
|
27
|
+
"TCS_QC": true,
|
28
|
+
"ref_genome": "HXB2",
|
29
|
+
"ref_start": 0,
|
30
|
+
"ref_end": 2591,
|
31
|
+
"indel": true,
|
32
|
+
"trim": true,
|
33
|
+
"trim_ref": "HXB2",
|
34
|
+
"trim_ref_start": 2253,
|
35
|
+
"trim_ref_end": 2549
|
36
|
+
},
|
37
|
+
{
|
38
|
+
"region": "IN",
|
39
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNATCGAATACTGCCATTTGTACTGC",
|
40
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNAAAAGGAGAAGCCATGCATG",
|
41
|
+
"majority": 0.5,
|
42
|
+
"end_join": true,
|
43
|
+
"end_join_option": 3,
|
44
|
+
"overlap": 171,
|
45
|
+
"TCS_QC": true,
|
46
|
+
"ref_genome": "HXB2",
|
47
|
+
"ref_start": 4384,
|
48
|
+
"ref_end": 4751,
|
49
|
+
"indel": false,
|
50
|
+
"trim": false
|
51
|
+
},
|
52
|
+
{
|
53
|
+
"region": "V1V3",
|
54
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCCATTTTGCTYTAYTRABVTTACAATRTGC",
|
55
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTTATGGGATCAAAGCCTAAAGCCATGTGTA",
|
56
|
+
"majority": 0.5,
|
57
|
+
"end_join": true,
|
58
|
+
"end_join_option": 1,
|
59
|
+
"overlap": 0,
|
60
|
+
"TCS_QC": true,
|
61
|
+
"ref_genome": "HXB2",
|
62
|
+
"ref_start": 6585,
|
63
|
+
"ref_end": 7208,
|
64
|
+
"indel": true,
|
65
|
+
"trim": false
|
66
|
+
}
|
67
|
+
]
|
68
|
+
}
|
data/lib/viral_seq/constant.rb
CHANGED
@@ -1,7 +1,11 @@
|
|
1
1
|
module ViralSeq
|
2
|
-
|
2
|
+
|
3
3
|
# array for all amino acid one letter abbreviations
|
4
4
|
|
5
5
|
AMINO_ACID_LIST = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y", "*"]
|
6
6
|
|
7
|
+
SDRM_HIV_PR_LIST = {}
|
8
|
+
SDRM_HIV_RT_LIST = {}
|
9
|
+
SDRM_HIV_IN_LIST = {}
|
10
|
+
|
7
11
|
end
|
data/lib/viral_seq/enumerable.rb
CHANGED
@@ -3,10 +3,6 @@
|
|
3
3
|
# array = [1,2,3,4,5,6,7,8,9,10]
|
4
4
|
# array.median
|
5
5
|
# => 5.5
|
6
|
-
# @example sum
|
7
|
-
# array = [1,2,3,4,5,6,7,8,9,10]
|
8
|
-
# array.sum
|
9
|
-
# => 55
|
10
6
|
# @example average number (mean)
|
11
7
|
# array = [1,2,3,4,5,6,7,8,9,10]
|
12
8
|
# array.mean
|
@@ -45,12 +41,6 @@ module Enumerable
|
|
45
41
|
len % 2 == 1 ? sorted[len/2] : (sorted[len/2 - 1] + sorted[len/2]).to_f / 2
|
46
42
|
end
|
47
43
|
|
48
|
-
# generate summed value
|
49
|
-
# @return [Numeric] summed value
|
50
|
-
def sum
|
51
|
-
self.inject(0){|accum, i| accum + i }
|
52
|
-
end
|
53
|
-
|
54
44
|
# generate mean number
|
55
45
|
# @return [Float] mean value
|
56
46
|
def mean
|
data/lib/viral_seq/hivdr.rb
CHANGED
data/lib/viral_seq/math.rb
CHANGED
@@ -67,7 +67,7 @@ module ViralSeq
|
|
67
67
|
@k = k
|
68
68
|
@poisson_hash = {}
|
69
69
|
(0..k).each do |n|
|
70
|
-
p = (rate**n * ::Math::E**(-rate))
|
70
|
+
p = (rate**n * ::Math::E**(-rate))/n.factorial
|
71
71
|
@poisson_hash[n] = p
|
72
72
|
end
|
73
73
|
end
|
@@ -155,9 +155,9 @@ class Integer
|
|
155
155
|
# factorial method for an Integer
|
156
156
|
# @return [Integer] factorial for given Integer
|
157
157
|
# @example factorial for 5
|
158
|
-
#
|
158
|
+
# 5.factorial
|
159
159
|
# => 120
|
160
|
-
def
|
160
|
+
def factorial
|
161
161
|
if self == 0
|
162
162
|
return 1
|
163
163
|
else
|
@@ -0,0 +1,43 @@
|
|
1
|
+
module ViralSeq
|
2
|
+
class DRMs
|
3
|
+
def initialize (mutation_list = {})
|
4
|
+
@mutation_list = mutation_list
|
5
|
+
end
|
6
|
+
|
7
|
+
attr_accessor :mutation_list
|
8
|
+
end
|
9
|
+
|
10
|
+
def self.sdrm_hiv_pr(seq_hash)
|
11
|
+
end
|
12
|
+
|
13
|
+
def self.sdrm_hiv_rt(seq_hash)
|
14
|
+
end
|
15
|
+
|
16
|
+
def self.sdrm_hiv_in(seq_hash)
|
17
|
+
end
|
18
|
+
|
19
|
+
def self.list_from_json(file)
|
20
|
+
end
|
21
|
+
|
22
|
+
def self.list_from_csv(file)
|
23
|
+
end
|
24
|
+
|
25
|
+
def self.export_list_hiv_pr(file, format = :json)
|
26
|
+
if foramt == :json
|
27
|
+
|
28
|
+
end
|
29
|
+
end
|
30
|
+
|
31
|
+
def self.export_list_hiv_rt(file, format = :json)
|
32
|
+
|
33
|
+
end
|
34
|
+
|
35
|
+
def self.export_list_hiv_in(file, format = :json)
|
36
|
+
|
37
|
+
end
|
38
|
+
|
39
|
+
def drm_analysis(seq_hash)
|
40
|
+
mutation_list = self.mutation_list
|
41
|
+
|
42
|
+
end
|
43
|
+
end
|
data/lib/viral_seq/seq_hash.rb
CHANGED
@@ -394,7 +394,6 @@ module ViralSeq
|
|
394
394
|
end
|
395
395
|
end
|
396
396
|
end
|
397
|
-
|
398
397
|
consensus_seq += call_consensus_base(max_base_list)
|
399
398
|
end
|
400
399
|
return consensus_seq
|
@@ -549,7 +548,7 @@ module ViralSeq
|
|
549
548
|
if sequences.size == 0
|
550
549
|
return 0
|
551
550
|
else
|
552
|
-
cut_off =
|
551
|
+
cut_off = Float::INFINITY
|
553
552
|
l = sequences[0].size
|
554
553
|
rate = sequences.size * error_rate
|
555
554
|
count_mut = variant_for_poisson(sequences)
|
@@ -558,7 +557,7 @@ module ViralSeq
|
|
558
557
|
|
559
558
|
poisson_hash.each do |k,v|
|
560
559
|
cal = l * v
|
561
|
-
obs = count_mut[k] ? count_mut[k] :
|
560
|
+
obs = count_mut[k] ? count_mut[k] : 1
|
562
561
|
if obs >= fold_cutoff * cal
|
563
562
|
cut_off = k
|
564
563
|
break
|
@@ -742,6 +741,7 @@ module ViralSeq
|
|
742
741
|
seq_hash_unique_pass = []
|
743
742
|
|
744
743
|
seq_hash_unique.each do |seq|
|
744
|
+
next if seq.nil?
|
745
745
|
loc = ViralSeq::Sequence.new('', seq).locator(ref_option, path_to_muscle)
|
746
746
|
next unless loc # if locator tool fails, skip this seq.
|
747
747
|
if start_nt.include?(loc[0]) && end_nt.include?(loc[1])
|
@@ -110,19 +110,21 @@ module ViralSeq
|
|
110
110
|
raise ArgumentError.new(":overlap has to be Integer, input #{overlap} invalid.") unless overlap.is_a? Integer
|
111
111
|
raise ArgumentError.new(":diff has to be float or integer, input #{diff} invalid.") unless (diff.is_a? Integer or diff.is_a? Float)
|
112
112
|
joined_seq = {}
|
113
|
-
seq_pair_hash.
|
113
|
+
seq_pair_hash.each do |seq_name,seq_pair|
|
114
114
|
r1_seq = seq_pair[0]
|
115
115
|
r2_seq = seq_pair[1]
|
116
116
|
if overlap.zero?
|
117
117
|
joined_sequence = r1_seq + r2_seq
|
118
|
+
elsif diff.zero?
|
119
|
+
if r1_seq[-overlap..-1] == r2_seq[0,overlap]
|
120
|
+
joined_sequence= r1_seq + r2_seq[overlap..-1]
|
121
|
+
end
|
118
122
|
elsif r1_seq[-overlap..-1].compare_with(r2_seq[0,overlap]) <= (overlap * diff)
|
119
123
|
joined_sequence= r1_seq + r2_seq[overlap..-1]
|
120
124
|
else
|
121
125
|
next
|
122
126
|
end
|
123
|
-
|
124
|
-
joined_seq[seq_name] = joined_sequence
|
125
|
-
end
|
127
|
+
joined_seq[seq_name] = joined_sequence if joined_sequence
|
126
128
|
end
|
127
129
|
|
128
130
|
joined_seq_hash = ViralSeq::SeqHash.new
|
data/lib/viral_seq/tcs_core.rb
CHANGED
@@ -102,16 +102,18 @@ module ViralSeq
|
|
102
102
|
end
|
103
103
|
|
104
104
|
# sort array of file names to determine if there is potential errors
|
105
|
-
#
|
106
|
-
#
|
105
|
+
# @param name_array [Array] array of file names
|
106
|
+
# @return [hash] name check results
|
107
107
|
|
108
108
|
def validate_file_name(name_array)
|
109
|
-
errors = {
|
109
|
+
errors = {
|
110
|
+
file_type_error: [] ,
|
110
111
|
missing_r1_file: [] ,
|
111
112
|
missing_r2_file: [] ,
|
112
113
|
extra_r1_r2_file: [],
|
113
114
|
no_region_tag: [] ,
|
114
|
-
multiple_region_tag: []
|
115
|
+
multiple_region_tag: []
|
116
|
+
}
|
115
117
|
|
116
118
|
passed_libs = {}
|
117
119
|
|
@@ -163,6 +165,13 @@ module ViralSeq
|
|
163
165
|
end
|
164
166
|
end
|
165
167
|
|
168
|
+
file_name_with_lib_name = {}
|
169
|
+
passed_libs.each do |lib_name, files|
|
170
|
+
files.each do |f|
|
171
|
+
file_name_with_lib_name[f] = lib_name
|
172
|
+
end
|
173
|
+
end
|
174
|
+
|
166
175
|
passed_names = []
|
167
176
|
|
168
177
|
passed_libs.values.each { |names| passed_names += names}
|
@@ -173,7 +182,27 @@ module ViralSeq
|
|
173
182
|
pass = true
|
174
183
|
end
|
175
184
|
|
176
|
-
|
185
|
+
file_name_with_error_type = {}
|
186
|
+
|
187
|
+
errors.each do |type, files|
|
188
|
+
files.each do |f|
|
189
|
+
file_name_with_error_type[f] ||= []
|
190
|
+
file_name_with_error_type[f] << type.to_s.tr("_", "\s")
|
191
|
+
end
|
192
|
+
end
|
193
|
+
|
194
|
+
file_check = []
|
195
|
+
|
196
|
+
name_array.each do |name|
|
197
|
+
file_check_hash = {}
|
198
|
+
file_check_hash[:fileName] = name
|
199
|
+
file_check_hash[:errors] = file_name_with_error_type[name]
|
200
|
+
file_check_hash[:libName] = file_name_with_lib_name[name]
|
201
|
+
|
202
|
+
file_check << file_check_hash
|
203
|
+
end
|
204
|
+
|
205
|
+
return { allPass: pass, files: file_check }
|
177
206
|
end
|
178
207
|
|
179
208
|
# filter r1 raw sequences for non-specific primers.
|
@@ -276,7 +305,9 @@ module ViralSeq
|
|
276
305
|
end
|
277
306
|
|
278
307
|
def general_filter(seq)
|
279
|
-
if seq
|
308
|
+
if seq.size < $platform_sequencing_length
|
309
|
+
return false
|
310
|
+
elsif seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
|
280
311
|
return false
|
281
312
|
elsif seq =~ /A{11}/ # a string of poly-A indicates adaptor sequence
|
282
313
|
return false
|
data/lib/viral_seq/tcs_json.rb
CHANGED
@@ -13,6 +13,22 @@ module ViralSeq
|
|
13
13
|
print '> '
|
14
14
|
param[:raw_sequence_dir] = gets.chomp.rstrip
|
15
15
|
|
16
|
+
puts "Choose MiSeq Platform (1-3):\n1. 150x7x150\n2. 250x7x250\n3. 300x7x300 (default)"
|
17
|
+
print "> "
|
18
|
+
pf_option = gets.chomp.rstrip
|
19
|
+
# while ![1,2,3].include?(pf_option.to_i)
|
20
|
+
# print "Entered MiSeq Platform #{pf_option.red.bold} not valid (choose 1-3), try again\n> "
|
21
|
+
# pf_option = gets.chomp.rstrip
|
22
|
+
# end
|
23
|
+
case pf_option.to_i
|
24
|
+
when 1
|
25
|
+
param[:platform_format] = 150
|
26
|
+
when 2
|
27
|
+
param[:platform_format] = 250
|
28
|
+
else
|
29
|
+
param[:platform_format] = 300
|
30
|
+
end
|
31
|
+
|
16
32
|
puts 'Enter the estimated platform error rate (for TCS cut-off calculation), default as ' + '0.02'.red.bold
|
17
33
|
print '> '
|
18
34
|
input_error = gets.chomp.rstrip.to_f
|
@@ -52,12 +68,12 @@ module ViralSeq
|
|
52
68
|
if ej =~ /y|yes/i
|
53
69
|
data[:end_join] = true
|
54
70
|
|
55
|
-
|
56
|
-
1: simple join, no overlap
|
57
|
-
2: known overlap
|
58
|
-
3: unknow overlap, use sample consensus to determine overlap, all sequence pairs have same overlap
|
59
|
-
4: unknow overlap, determine overlap by individual sequence pairs, sequence pairs can have different overlap
|
60
|
-
> "
|
71
|
+
puts "End-join option? Choose from (1-4):"
|
72
|
+
puts "1: simple join, no overlap"
|
73
|
+
puts "2: known overlap"
|
74
|
+
puts "3: unknow overlap, use sample consensus to determine overlap, all sequence pairs have same overlap"
|
75
|
+
puts "4: unknow overlap, determine overlap by individual sequence pairs, sequence pairs can have different overlap"
|
76
|
+
print "> "
|
61
77
|
ej_option = gets.chomp.rstrip
|
62
78
|
while ![1,2,3,4].include?(ej_option.to_i)
|
63
79
|
puts "Entered end-join option #{ej_option.red.bold} not valid (choose 1-4), try again"
|
@@ -138,7 +154,12 @@ module ViralSeq
|
|
138
154
|
if save_option =~ /y|yes/i
|
139
155
|
print "Path to save JSON file:\n> "
|
140
156
|
path = gets.chomp.rstrip
|
141
|
-
|
157
|
+
while !validate_path_name(path)
|
158
|
+
print "Entered path no valid, try again.\n".red.bold
|
159
|
+
print "Path to save JSON file:\n> "
|
160
|
+
path = gets.chomp.rstrip
|
161
|
+
end
|
162
|
+
File.open(validate_path_name(path), 'w') {|f| f.puts JSON.pretty_generate(param)}
|
142
163
|
end
|
143
164
|
|
144
165
|
print "\nDo you wish to execute tcs pipeline with the input params now? Y/N \n> "
|
@@ -147,7 +168,7 @@ module ViralSeq
|
|
147
168
|
if rsp =~ /y/i
|
148
169
|
return param
|
149
170
|
else
|
150
|
-
abort "Params json file generated. You can execute tcs pipeline using `tcs -p [params.json]`"
|
171
|
+
abort "Params json file generated. You can execute tcs pipeline using `tcs -p [params.json]`".blue
|
151
172
|
end
|
152
173
|
|
153
174
|
end
|
@@ -172,7 +193,17 @@ module ViralSeq
|
|
172
193
|
when 3
|
173
194
|
:MAC239
|
174
195
|
end
|
175
|
-
end
|
176
|
-
|
196
|
+
end # end of get_ref
|
197
|
+
|
198
|
+
def validate_path_name(path)
|
199
|
+
if path.empty?
|
200
|
+
return false
|
201
|
+
elsif File.directory? path
|
202
|
+
return File.join(path, 'params.json')
|
203
|
+
elsif File.directory?(File.dirname(path))
|
204
|
+
return path
|
205
|
+
end
|
206
|
+
end # end of validate_path_name
|
207
|
+
end # end of class << self
|
177
208
|
end # end TcsJson
|
178
209
|
end # end main module
|
data/lib/viral_seq/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: viral_seq
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Shuntai Zhou
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2021-03-26 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bundler
|
@@ -90,6 +90,7 @@ email:
|
|
90
90
|
executables:
|
91
91
|
- locator
|
92
92
|
- tcs
|
93
|
+
- tcs_log
|
93
94
|
extensions: []
|
94
95
|
extra_rdoc_files: []
|
95
96
|
files:
|
@@ -104,6 +105,8 @@ files:
|
|
104
105
|
- Rakefile
|
105
106
|
- bin/locator
|
106
107
|
- bin/tcs
|
108
|
+
- bin/tcs_log
|
109
|
+
- doc/dr.json
|
107
110
|
- lib/viral_seq.rb
|
108
111
|
- lib/viral_seq/constant.rb
|
109
112
|
- lib/viral_seq/enumerable.rb
|
@@ -114,6 +117,7 @@ files:
|
|
114
117
|
- lib/viral_seq/pid.rb
|
115
118
|
- lib/viral_seq/ref_seq.rb
|
116
119
|
- lib/viral_seq/rubystats.rb
|
120
|
+
- lib/viral_seq/sdrm.rb
|
117
121
|
- lib/viral_seq/seq_hash.rb
|
118
122
|
- lib/viral_seq/seq_hash_pair.rb
|
119
123
|
- lib/viral_seq/sequence.rb
|
@@ -142,7 +146,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
142
146
|
version: '0'
|
143
147
|
requirements:
|
144
148
|
- R required for some functions
|
145
|
-
rubygems_version: 3.
|
149
|
+
rubygems_version: 3.2.2
|
146
150
|
signing_key:
|
147
151
|
specification_version: 4
|
148
152
|
summary: A Ruby Gem containing bioinformatics tools for processing viral NGS data.
|