viral_seq 1.0.10 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +0 -1
- data/Gemfile.lock +1 -1
- data/README.md +76 -9
- data/bin/tcs +20 -7
- data/bin/tcs_log +83 -0
- data/doc/dr.json +68 -0
- data/lib/viral_seq/constant.rb +5 -1
- data/lib/viral_seq/enumerable.rb +0 -10
- data/lib/viral_seq/hivdr.rb +1 -1
- data/lib/viral_seq/math.rb +3 -3
- data/lib/viral_seq/sdrm.rb +43 -0
- data/lib/viral_seq/seq_hash.rb +3 -3
- data/lib/viral_seq/seq_hash_pair.rb +6 -4
- data/lib/viral_seq/tcs_core.rb +37 -6
- data/lib/viral_seq/tcs_json.rb +41 -10
- data/lib/viral_seq/version.rb +2 -2
- metadata +7 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ea453e452e6832e942512cdb94462c33af89ffd8295017806c9aa6ff7ec77ad4
|
4
|
+
data.tar.gz: 2bb89d193e0e84ebe0791882c53e226a0a934ea3b9d1e61f87b8ffff6c22af1b
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9dc0403ecaea119d3aa3e832305a0bd4f038fdb71789dcd036080fa89b0e454ee79001b6042df171364e4207a93b2d4d5747336b2fb7f8fb7d83103f5d641134
|
7
|
+
data.tar.gz: 510ccfce7d717b56d55e2477ae01124009d1f53f010635759cf2f69afe0132313e08db9abaae1ec6d8d894961beba1c2d70a637eafa9b57b05f0aac3372cd0ca
|
data/.gitignore
CHANGED
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -2,7 +2,16 @@
|
|
2
2
|
|
3
3
|
A Ruby Gem containing bioinformatics tools for processing viral NGS data.
|
4
4
|
|
5
|
-
Specifically for Primer
|
5
|
+
Specifically for Primer ID sequencing and HIV drug resistance analysis.
|
6
|
+
|
7
|
+
## Illustration for the Primer ID Sequencing
|
8
|
+
|
9
|
+
|
10
|
+

|
11
|
+
|
12
|
+
### Reference readings on the Primer ID sequencing
|
13
|
+
[Primer ID JID paper](https://doi.org/10.21769/BioProtoc.3938)
|
14
|
+
[Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
|
6
15
|
|
7
16
|
## Install
|
8
17
|
|
@@ -14,19 +23,45 @@ Specifically for Primer-ID sequencing and HIV drug resistance analysis.
|
|
14
23
|
|
15
24
|
### Excutables
|
16
25
|
|
17
|
-
|
26
|
+
### `tcs`
|
27
|
+
Use executable `tcs` pipeline to process **Primer ID MiSeq sequencing** data.
|
18
28
|
|
29
|
+
Example commands:
|
19
30
|
```bash
|
20
|
-
$
|
31
|
+
$ tcs -p params.json # run TCS pipeline with params.json
|
32
|
+
$ tcs -j # CLI to generate params.json
|
33
|
+
$ tcs -h # print out the help
|
21
34
|
```
|
35
|
+
---
|
36
|
+
### `tcs_log`
|
22
37
|
|
23
|
-
Use
|
38
|
+
Use `tcs_log` script to pool run logs and TCS fasta files after one batch of `tcs` jobs.
|
24
39
|
|
40
|
+
|
41
|
+
Example file structure:
|
42
|
+
```
|
43
|
+
batch_tcs_jobs/
|
44
|
+
├── lib1
|
45
|
+
├── lib2
|
46
|
+
├── lib3
|
47
|
+
├── lib4
|
48
|
+
├── ...
|
49
|
+
```
|
50
|
+
|
51
|
+
Example command:
|
25
52
|
```bash
|
26
|
-
$
|
27
|
-
|
28
|
-
|
53
|
+
$ tcs_log batch_tcs_jobs
|
54
|
+
```
|
55
|
+
|
56
|
+
---
|
57
|
+
|
58
|
+
### `locator`
|
59
|
+
Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
|
60
|
+
|
61
|
+
```bash
|
62
|
+
$ locator -i sequence.fasta -o sequence.fasta.csv
|
29
63
|
```
|
64
|
+
---
|
30
65
|
|
31
66
|
## Some Examples
|
32
67
|
|
@@ -78,17 +113,49 @@ Examine for drug resistance mutations for HIV PR region
|
|
78
113
|
```ruby
|
79
114
|
qc_seqhash.sdrm_hiv_pr(cut_off)
|
80
115
|
```
|
116
|
+
## Known issues
|
117
|
+
|
118
|
+
1. ~~have a conflict with rails.~~
|
119
|
+
2. ~~Update on 03032021. Still have conflict. But in rails gem file, can just use `requires: false` globally and only require "viral_seq" when the module is needed in controller.~~
|
120
|
+
3. The conflict seems to be resovled. It was from a combination of using `!` as a function for factorial and the gem name `viral_seq`. @_@
|
81
121
|
|
82
122
|
## Updates
|
83
123
|
|
84
|
-
### Version 1.1.0-
|
124
|
+
### Version 1.1.0-03252021
|
125
|
+
|
126
|
+
1. Optimized the algorithm of end-join.
|
127
|
+
2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
|
128
|
+
3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
|
129
|
+
4. Added the preset of MPID-HIVDR params file ***dr.json*** in /doc.
|
130
|
+
5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
|
131
|
+
Users can choose from 3 MiSeq platforms for processing their sequencing data.
|
132
|
+
MiSeq 300x7x300 is the default option.
|
133
|
+
|
134
|
+
### Version 1.0.14-03052021
|
135
|
+
|
136
|
+
1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
|
137
|
+
|
138
|
+
### Version 1.0.13-03032021
|
139
|
+
|
140
|
+
1. Fixed the conflict with rails.
|
141
|
+
|
142
|
+
### Version 1.0.12-03032021
|
143
|
+
|
144
|
+
1. Fixed an issue that may cause conflicts with ActiveRecord.
|
145
|
+
|
146
|
+
### Version 1.0.11-03022021
|
147
|
+
|
148
|
+
1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
|
149
|
+
2. fixed an issue loading class 'OptionParser'in some ruby environments.
|
150
|
+
|
151
|
+
### Version 1.0.10-11112020:
|
85
152
|
|
86
153
|
1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
|
87
154
|
2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
|
88
155
|
3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
|
89
156
|
4. a few optimizations.
|
90
157
|
5. TCS 2.1.0 delivered.
|
91
|
-
6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
|
158
|
+
6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
|
92
159
|
|
93
160
|
### Version 1.0.9-07182020:
|
94
161
|
|
data/bin/tcs
CHANGED
@@ -23,12 +23,12 @@
|
|
23
23
|
# THE SOFTWARE.
|
24
24
|
|
25
25
|
# Use JSON file as the run param
|
26
|
-
# run
|
26
|
+
# run `tcs -j` to generate param json file.
|
27
27
|
|
28
28
|
require 'viral_seq'
|
29
29
|
require 'json'
|
30
30
|
require 'colorize'
|
31
|
-
require '
|
31
|
+
require 'optparse'
|
32
32
|
|
33
33
|
options = {}
|
34
34
|
|
@@ -115,6 +115,12 @@ else
|
|
115
115
|
error_rate = 0.02
|
116
116
|
end
|
117
117
|
|
118
|
+
if params[:platform_format]
|
119
|
+
$platform_sequencing_length = params[:platform_format]
|
120
|
+
else
|
121
|
+
$platform_sequencing_length = 300
|
122
|
+
end
|
123
|
+
|
118
124
|
primers = params[:primer_pairs]
|
119
125
|
if primers.empty?
|
120
126
|
ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
|
@@ -273,7 +279,6 @@ primers.each do |primer|
|
|
273
279
|
r1_sub_seq << bio_r1[seq_name]
|
274
280
|
r2_sub_seq << bio_r2[seq_name]
|
275
281
|
end
|
276
|
-
|
277
282
|
#consensus name including the Primer ID and number of raw sequences of that Primer ID, library name and setname.
|
278
283
|
consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
|
279
284
|
r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
|
@@ -317,8 +322,12 @@ primers.each do |primer|
|
|
317
322
|
f1 = File.open(outfile_r1, 'w')
|
318
323
|
f2 = File.open(outfile_r2, 'w')
|
319
324
|
primer_id_in_use = {}
|
320
|
-
|
321
|
-
|
325
|
+
if n_con > 0
|
326
|
+
r1_seq_length = consensus_filtered.values[0][0].size
|
327
|
+
r2_seq_length = consensus_filtered.values[0][1].size
|
328
|
+
else
|
329
|
+
next
|
330
|
+
end
|
322
331
|
log.puts Time.now.to_s + "\t" + "R1 sequence #{r1_seq_length} bp"
|
323
332
|
log.puts Time.now.to_s + "\t" + "R1 sequence #{r2_seq_length} bp"
|
324
333
|
consensus_filtered.each do |seq_name,seq|
|
@@ -360,6 +369,7 @@ primers.each do |primer|
|
|
360
369
|
shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
|
361
370
|
joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
|
362
371
|
log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
|
372
|
+
|
363
373
|
summary_json[:combined_tcs] = joined_sh.size
|
364
374
|
|
365
375
|
if export_raw
|
@@ -429,12 +439,15 @@ primers.each do |primer|
|
|
429
439
|
trim_end = primer[:trim_ref_end]
|
430
440
|
trim_ref = primer[:trim_ref].to_sym
|
431
441
|
joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
|
432
|
-
joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
|
433
442
|
if export_raw
|
434
443
|
joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
|
435
|
-
joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
|
436
444
|
end
|
437
445
|
end
|
446
|
+
|
447
|
+
joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
|
448
|
+
if export_raw
|
449
|
+
joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
|
450
|
+
end
|
438
451
|
end
|
439
452
|
|
440
453
|
File.open(outfile_log, "w") do |f|
|
data/bin/tcs_log
ADDED
@@ -0,0 +1,83 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
# pool run logs from one batch of tcs jobs
|
4
|
+
# file structure:
|
5
|
+
# batch_tcs_jobs/
|
6
|
+
# ├── lib1
|
7
|
+
# ├── lib2
|
8
|
+
# ├── lib3
|
9
|
+
# ├── lib4
|
10
|
+
# ├── ...
|
11
|
+
#
|
12
|
+
# command example:
|
13
|
+
# $ tcs_log batch_tcs_jobs
|
14
|
+
|
15
|
+
require 'viral_seq'
|
16
|
+
require 'pathname'
|
17
|
+
require 'json'
|
18
|
+
require 'fileutils'
|
19
|
+
|
20
|
+
indir = ARGV[0].chomp
|
21
|
+
indir_basename = File.basename(indir)
|
22
|
+
indir_dirname = File.dirname(indir)
|
23
|
+
|
24
|
+
tcs_dir = File.join(indir_dirname, (indir_basename + "_tcs"))
|
25
|
+
Dir.mkdir(tcs_dir) unless File.directory?(tcs_dir)
|
26
|
+
|
27
|
+
libs = []
|
28
|
+
Dir.chdir(indir) {libs = Dir.glob("*")}
|
29
|
+
|
30
|
+
outdir2 = File.join(tcs_dir, "combined_TCS_per_lib")
|
31
|
+
outdir3 = File.join(tcs_dir, "TCS_per_region")
|
32
|
+
outdir4 = File.join(tcs_dir, "combined_TCS_per_region")
|
33
|
+
|
34
|
+
Dir.mkdir(outdir2) unless File.directory?(outdir2)
|
35
|
+
Dir.mkdir(outdir3) unless File.directory?(outdir3)
|
36
|
+
Dir.mkdir(outdir4) unless File.directory?(outdir4)
|
37
|
+
|
38
|
+
log_file = File.join(tcs_dir,"log.csv")
|
39
|
+
log = File.open(log_file,'w')
|
40
|
+
log.puts "lib name,Region,Raw Sequences per barcode,R1 Raw,R2 Raw,Paired Raw,Cutoff,PID Length,Consensus1,Consensus2,Distinct to Raw,Resampling index,Combined TCS,Combined TCS after QC"
|
41
|
+
|
42
|
+
libs.each do |lib|
|
43
|
+
Dir.mkdir(File.join(outdir2, lib)) unless File.directory?(File.join(outdir2, lib))
|
44
|
+
fasta_files = []
|
45
|
+
json_files = []
|
46
|
+
Dir.chdir(File.join(indir, lib)) do
|
47
|
+
fasta_files = Dir.glob("**/*.fasta")
|
48
|
+
json_files = Dir.glob("**/log.json")
|
49
|
+
end
|
50
|
+
fasta_files.each do |f|
|
51
|
+
path_array = Pathname(f).each_filename.to_a
|
52
|
+
region = path_array[0]
|
53
|
+
if path_array[-1] == "combined.fasta"
|
54
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir2, lib, (lib + "_" + region)))
|
55
|
+
Dir.mkdir(File.join(outdir4,region)) unless File.directory?(File.join(outdir4,region))
|
56
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir4, region, (lib + "_" + region)))
|
57
|
+
else
|
58
|
+
Dir.mkdir(File.join(outdir3,region)) unless File.directory?(File.join(outdir3,region))
|
59
|
+
Dir.mkdir(File.join(outdir3,region, lib)) unless File.directory?(File.join(outdir3,region, lib))
|
60
|
+
FileUtils.cp(File.join(indir, lib, f), File.join(outdir3, region, lib, (lib + "_" + region + "_" + path_array[-1])))
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
json_files.each do |f|
|
65
|
+
json_log = JSON.parse(File.read(File.join(indir, lib, f)), symbolize_names: true)
|
66
|
+
log.print [lib,
|
67
|
+
json_log[:primer_set_name],
|
68
|
+
json_log[:total_raw_sequence],
|
69
|
+
json_log[:r1_filtered_raw],
|
70
|
+
json_log[:r2_filtered_raw],
|
71
|
+
json_log[:paired_raw_sequence],
|
72
|
+
json_log[:consensus_cutoff],
|
73
|
+
json_log[:length_of_pid],
|
74
|
+
json_log[:total_tcs_with_ambiguities],
|
75
|
+
json_log[:total_tcs],
|
76
|
+
json_log[:distinct_to_raw],
|
77
|
+
json_log[:resampling_param],
|
78
|
+
json_log[:combined_tcs],
|
79
|
+
json_log[:combined_tcs_after_qc],
|
80
|
+
].join(',') + "\n"
|
81
|
+
end
|
82
|
+
end
|
83
|
+
log.close
|
data/doc/dr.json
ADDED
@@ -0,0 +1,68 @@
|
|
1
|
+
{
|
2
|
+
"raw_sequence_dir": "MyExampleDir",
|
3
|
+
"platform_error_rate": 0.02,
|
4
|
+
"primer_pairs": [
|
5
|
+
{
|
6
|
+
"region": "RT",
|
7
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCACTATAGGCTGTACTGTCCATTTATC",
|
8
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNGGCCATTGACAGAAGAAAAAATAAAAGC",
|
9
|
+
"majority": 0.5,
|
10
|
+
"end_join": true,
|
11
|
+
"end_join_option": 1,
|
12
|
+
"overlap": 0,
|
13
|
+
"TCS_QC": true,
|
14
|
+
"ref_genome": "HXB2",
|
15
|
+
"ref_start": 2648,
|
16
|
+
"ref_end": 3257,
|
17
|
+
"indel": true,
|
18
|
+
"trim": false
|
19
|
+
},
|
20
|
+
{
|
21
|
+
"region": "PR",
|
22
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNCAGTTTAACTTTTGGGCCATCCATTCC",
|
23
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTCAGAGCAGACCAGAGCCAACAGCCCCA",
|
24
|
+
"majority": 0.5,
|
25
|
+
"end_join": true,
|
26
|
+
"end_join_option": 3,
|
27
|
+
"TCS_QC": true,
|
28
|
+
"ref_genome": "HXB2",
|
29
|
+
"ref_start": 0,
|
30
|
+
"ref_end": 2591,
|
31
|
+
"indel": true,
|
32
|
+
"trim": true,
|
33
|
+
"trim_ref": "HXB2",
|
34
|
+
"trim_ref_start": 2253,
|
35
|
+
"trim_ref_end": 2549
|
36
|
+
},
|
37
|
+
{
|
38
|
+
"region": "IN",
|
39
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNATCGAATACTGCCATTTGTACTGC",
|
40
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNAAAAGGAGAAGCCATGCATG",
|
41
|
+
"majority": 0.5,
|
42
|
+
"end_join": true,
|
43
|
+
"end_join_option": 3,
|
44
|
+
"overlap": 171,
|
45
|
+
"TCS_QC": true,
|
46
|
+
"ref_genome": "HXB2",
|
47
|
+
"ref_start": 4384,
|
48
|
+
"ref_end": 4751,
|
49
|
+
"indel": false,
|
50
|
+
"trim": false
|
51
|
+
},
|
52
|
+
{
|
53
|
+
"region": "V1V3",
|
54
|
+
"cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCCATTTTGCTYTAYTRABVTTACAATRTGC",
|
55
|
+
"forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTTATGGGATCAAAGCCTAAAGCCATGTGTA",
|
56
|
+
"majority": 0.5,
|
57
|
+
"end_join": true,
|
58
|
+
"end_join_option": 1,
|
59
|
+
"overlap": 0,
|
60
|
+
"TCS_QC": true,
|
61
|
+
"ref_genome": "HXB2",
|
62
|
+
"ref_start": 6585,
|
63
|
+
"ref_end": 7208,
|
64
|
+
"indel": true,
|
65
|
+
"trim": false
|
66
|
+
}
|
67
|
+
]
|
68
|
+
}
|
data/lib/viral_seq/constant.rb
CHANGED
@@ -1,7 +1,11 @@
|
|
1
1
|
module ViralSeq
|
2
|
-
|
2
|
+
|
3
3
|
# array for all amino acid one letter abbreviations
|
4
4
|
|
5
5
|
AMINO_ACID_LIST = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y", "*"]
|
6
6
|
|
7
|
+
SDRM_HIV_PR_LIST = {}
|
8
|
+
SDRM_HIV_RT_LIST = {}
|
9
|
+
SDRM_HIV_IN_LIST = {}
|
10
|
+
|
7
11
|
end
|
data/lib/viral_seq/enumerable.rb
CHANGED
@@ -3,10 +3,6 @@
|
|
3
3
|
# array = [1,2,3,4,5,6,7,8,9,10]
|
4
4
|
# array.median
|
5
5
|
# => 5.5
|
6
|
-
# @example sum
|
7
|
-
# array = [1,2,3,4,5,6,7,8,9,10]
|
8
|
-
# array.sum
|
9
|
-
# => 55
|
10
6
|
# @example average number (mean)
|
11
7
|
# array = [1,2,3,4,5,6,7,8,9,10]
|
12
8
|
# array.mean
|
@@ -45,12 +41,6 @@ module Enumerable
|
|
45
41
|
len % 2 == 1 ? sorted[len/2] : (sorted[len/2 - 1] + sorted[len/2]).to_f / 2
|
46
42
|
end
|
47
43
|
|
48
|
-
# generate summed value
|
49
|
-
# @return [Numeric] summed value
|
50
|
-
def sum
|
51
|
-
self.inject(0){|accum, i| accum + i }
|
52
|
-
end
|
53
|
-
|
54
44
|
# generate mean number
|
55
45
|
# @return [Float] mean value
|
56
46
|
def mean
|
data/lib/viral_seq/hivdr.rb
CHANGED
data/lib/viral_seq/math.rb
CHANGED
@@ -67,7 +67,7 @@ module ViralSeq
|
|
67
67
|
@k = k
|
68
68
|
@poisson_hash = {}
|
69
69
|
(0..k).each do |n|
|
70
|
-
p = (rate**n * ::Math::E**(-rate))
|
70
|
+
p = (rate**n * ::Math::E**(-rate))/n.factorial
|
71
71
|
@poisson_hash[n] = p
|
72
72
|
end
|
73
73
|
end
|
@@ -155,9 +155,9 @@ class Integer
|
|
155
155
|
# factorial method for an Integer
|
156
156
|
# @return [Integer] factorial for given Integer
|
157
157
|
# @example factorial for 5
|
158
|
-
#
|
158
|
+
# 5.factorial
|
159
159
|
# => 120
|
160
|
-
def
|
160
|
+
def factorial
|
161
161
|
if self == 0
|
162
162
|
return 1
|
163
163
|
else
|
@@ -0,0 +1,43 @@
|
|
1
|
+
module ViralSeq
|
2
|
+
class DRMs
|
3
|
+
def initialize (mutation_list = {})
|
4
|
+
@mutation_list = mutation_list
|
5
|
+
end
|
6
|
+
|
7
|
+
attr_accessor :mutation_list
|
8
|
+
end
|
9
|
+
|
10
|
+
def self.sdrm_hiv_pr(seq_hash)
|
11
|
+
end
|
12
|
+
|
13
|
+
def self.sdrm_hiv_rt(seq_hash)
|
14
|
+
end
|
15
|
+
|
16
|
+
def self.sdrm_hiv_in(seq_hash)
|
17
|
+
end
|
18
|
+
|
19
|
+
def self.list_from_json(file)
|
20
|
+
end
|
21
|
+
|
22
|
+
def self.list_from_csv(file)
|
23
|
+
end
|
24
|
+
|
25
|
+
def self.export_list_hiv_pr(file, format = :json)
|
26
|
+
if foramt == :json
|
27
|
+
|
28
|
+
end
|
29
|
+
end
|
30
|
+
|
31
|
+
def self.export_list_hiv_rt(file, format = :json)
|
32
|
+
|
33
|
+
end
|
34
|
+
|
35
|
+
def self.export_list_hiv_in(file, format = :json)
|
36
|
+
|
37
|
+
end
|
38
|
+
|
39
|
+
def drm_analysis(seq_hash)
|
40
|
+
mutation_list = self.mutation_list
|
41
|
+
|
42
|
+
end
|
43
|
+
end
|
data/lib/viral_seq/seq_hash.rb
CHANGED
@@ -394,7 +394,6 @@ module ViralSeq
|
|
394
394
|
end
|
395
395
|
end
|
396
396
|
end
|
397
|
-
|
398
397
|
consensus_seq += call_consensus_base(max_base_list)
|
399
398
|
end
|
400
399
|
return consensus_seq
|
@@ -549,7 +548,7 @@ module ViralSeq
|
|
549
548
|
if sequences.size == 0
|
550
549
|
return 0
|
551
550
|
else
|
552
|
-
cut_off =
|
551
|
+
cut_off = Float::INFINITY
|
553
552
|
l = sequences[0].size
|
554
553
|
rate = sequences.size * error_rate
|
555
554
|
count_mut = variant_for_poisson(sequences)
|
@@ -558,7 +557,7 @@ module ViralSeq
|
|
558
557
|
|
559
558
|
poisson_hash.each do |k,v|
|
560
559
|
cal = l * v
|
561
|
-
obs = count_mut[k] ? count_mut[k] :
|
560
|
+
obs = count_mut[k] ? count_mut[k] : 1
|
562
561
|
if obs >= fold_cutoff * cal
|
563
562
|
cut_off = k
|
564
563
|
break
|
@@ -742,6 +741,7 @@ module ViralSeq
|
|
742
741
|
seq_hash_unique_pass = []
|
743
742
|
|
744
743
|
seq_hash_unique.each do |seq|
|
744
|
+
next if seq.nil?
|
745
745
|
loc = ViralSeq::Sequence.new('', seq).locator(ref_option, path_to_muscle)
|
746
746
|
next unless loc # if locator tool fails, skip this seq.
|
747
747
|
if start_nt.include?(loc[0]) && end_nt.include?(loc[1])
|
@@ -110,19 +110,21 @@ module ViralSeq
|
|
110
110
|
raise ArgumentError.new(":overlap has to be Integer, input #{overlap} invalid.") unless overlap.is_a? Integer
|
111
111
|
raise ArgumentError.new(":diff has to be float or integer, input #{diff} invalid.") unless (diff.is_a? Integer or diff.is_a? Float)
|
112
112
|
joined_seq = {}
|
113
|
-
seq_pair_hash.
|
113
|
+
seq_pair_hash.each do |seq_name,seq_pair|
|
114
114
|
r1_seq = seq_pair[0]
|
115
115
|
r2_seq = seq_pair[1]
|
116
116
|
if overlap.zero?
|
117
117
|
joined_sequence = r1_seq + r2_seq
|
118
|
+
elsif diff.zero?
|
119
|
+
if r1_seq[-overlap..-1] == r2_seq[0,overlap]
|
120
|
+
joined_sequence= r1_seq + r2_seq[overlap..-1]
|
121
|
+
end
|
118
122
|
elsif r1_seq[-overlap..-1].compare_with(r2_seq[0,overlap]) <= (overlap * diff)
|
119
123
|
joined_sequence= r1_seq + r2_seq[overlap..-1]
|
120
124
|
else
|
121
125
|
next
|
122
126
|
end
|
123
|
-
|
124
|
-
joined_seq[seq_name] = joined_sequence
|
125
|
-
end
|
127
|
+
joined_seq[seq_name] = joined_sequence if joined_sequence
|
126
128
|
end
|
127
129
|
|
128
130
|
joined_seq_hash = ViralSeq::SeqHash.new
|
data/lib/viral_seq/tcs_core.rb
CHANGED
@@ -102,16 +102,18 @@ module ViralSeq
|
|
102
102
|
end
|
103
103
|
|
104
104
|
# sort array of file names to determine if there is potential errors
|
105
|
-
#
|
106
|
-
#
|
105
|
+
# @param name_array [Array] array of file names
|
106
|
+
# @return [hash] name check results
|
107
107
|
|
108
108
|
def validate_file_name(name_array)
|
109
|
-
errors = {
|
109
|
+
errors = {
|
110
|
+
file_type_error: [] ,
|
110
111
|
missing_r1_file: [] ,
|
111
112
|
missing_r2_file: [] ,
|
112
113
|
extra_r1_r2_file: [],
|
113
114
|
no_region_tag: [] ,
|
114
|
-
multiple_region_tag: []
|
115
|
+
multiple_region_tag: []
|
116
|
+
}
|
115
117
|
|
116
118
|
passed_libs = {}
|
117
119
|
|
@@ -163,6 +165,13 @@ module ViralSeq
|
|
163
165
|
end
|
164
166
|
end
|
165
167
|
|
168
|
+
file_name_with_lib_name = {}
|
169
|
+
passed_libs.each do |lib_name, files|
|
170
|
+
files.each do |f|
|
171
|
+
file_name_with_lib_name[f] = lib_name
|
172
|
+
end
|
173
|
+
end
|
174
|
+
|
166
175
|
passed_names = []
|
167
176
|
|
168
177
|
passed_libs.values.each { |names| passed_names += names}
|
@@ -173,7 +182,27 @@ module ViralSeq
|
|
173
182
|
pass = true
|
174
183
|
end
|
175
184
|
|
176
|
-
|
185
|
+
file_name_with_error_type = {}
|
186
|
+
|
187
|
+
errors.each do |type, files|
|
188
|
+
files.each do |f|
|
189
|
+
file_name_with_error_type[f] ||= []
|
190
|
+
file_name_with_error_type[f] << type.to_s.tr("_", "\s")
|
191
|
+
end
|
192
|
+
end
|
193
|
+
|
194
|
+
file_check = []
|
195
|
+
|
196
|
+
name_array.each do |name|
|
197
|
+
file_check_hash = {}
|
198
|
+
file_check_hash[:fileName] = name
|
199
|
+
file_check_hash[:errors] = file_name_with_error_type[name]
|
200
|
+
file_check_hash[:libName] = file_name_with_lib_name[name]
|
201
|
+
|
202
|
+
file_check << file_check_hash
|
203
|
+
end
|
204
|
+
|
205
|
+
return { allPass: pass, files: file_check }
|
177
206
|
end
|
178
207
|
|
179
208
|
# filter r1 raw sequences for non-specific primers.
|
@@ -276,7 +305,9 @@ module ViralSeq
|
|
276
305
|
end
|
277
306
|
|
278
307
|
def general_filter(seq)
|
279
|
-
if seq
|
308
|
+
if seq.size < $platform_sequencing_length
|
309
|
+
return false
|
310
|
+
elsif seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
|
280
311
|
return false
|
281
312
|
elsif seq =~ /A{11}/ # a string of poly-A indicates adaptor sequence
|
282
313
|
return false
|
data/lib/viral_seq/tcs_json.rb
CHANGED
@@ -13,6 +13,22 @@ module ViralSeq
|
|
13
13
|
print '> '
|
14
14
|
param[:raw_sequence_dir] = gets.chomp.rstrip
|
15
15
|
|
16
|
+
puts "Choose MiSeq Platform (1-3):\n1. 150x7x150\n2. 250x7x250\n3. 300x7x300 (default)"
|
17
|
+
print "> "
|
18
|
+
pf_option = gets.chomp.rstrip
|
19
|
+
# while ![1,2,3].include?(pf_option.to_i)
|
20
|
+
# print "Entered MiSeq Platform #{pf_option.red.bold} not valid (choose 1-3), try again\n> "
|
21
|
+
# pf_option = gets.chomp.rstrip
|
22
|
+
# end
|
23
|
+
case pf_option.to_i
|
24
|
+
when 1
|
25
|
+
param[:platform_format] = 150
|
26
|
+
when 2
|
27
|
+
param[:platform_format] = 250
|
28
|
+
else
|
29
|
+
param[:platform_format] = 300
|
30
|
+
end
|
31
|
+
|
16
32
|
puts 'Enter the estimated platform error rate (for TCS cut-off calculation), default as ' + '0.02'.red.bold
|
17
33
|
print '> '
|
18
34
|
input_error = gets.chomp.rstrip.to_f
|
@@ -52,12 +68,12 @@ module ViralSeq
|
|
52
68
|
if ej =~ /y|yes/i
|
53
69
|
data[:end_join] = true
|
54
70
|
|
55
|
-
|
56
|
-
1: simple join, no overlap
|
57
|
-
2: known overlap
|
58
|
-
3: unknow overlap, use sample consensus to determine overlap, all sequence pairs have same overlap
|
59
|
-
4: unknow overlap, determine overlap by individual sequence pairs, sequence pairs can have different overlap
|
60
|
-
> "
|
71
|
+
puts "End-join option? Choose from (1-4):"
|
72
|
+
puts "1: simple join, no overlap"
|
73
|
+
puts "2: known overlap"
|
74
|
+
puts "3: unknow overlap, use sample consensus to determine overlap, all sequence pairs have same overlap"
|
75
|
+
puts "4: unknow overlap, determine overlap by individual sequence pairs, sequence pairs can have different overlap"
|
76
|
+
print "> "
|
61
77
|
ej_option = gets.chomp.rstrip
|
62
78
|
while ![1,2,3,4].include?(ej_option.to_i)
|
63
79
|
puts "Entered end-join option #{ej_option.red.bold} not valid (choose 1-4), try again"
|
@@ -138,7 +154,12 @@ module ViralSeq
|
|
138
154
|
if save_option =~ /y|yes/i
|
139
155
|
print "Path to save JSON file:\n> "
|
140
156
|
path = gets.chomp.rstrip
|
141
|
-
|
157
|
+
while !validate_path_name(path)
|
158
|
+
print "Entered path no valid, try again.\n".red.bold
|
159
|
+
print "Path to save JSON file:\n> "
|
160
|
+
path = gets.chomp.rstrip
|
161
|
+
end
|
162
|
+
File.open(validate_path_name(path), 'w') {|f| f.puts JSON.pretty_generate(param)}
|
142
163
|
end
|
143
164
|
|
144
165
|
print "\nDo you wish to execute tcs pipeline with the input params now? Y/N \n> "
|
@@ -147,7 +168,7 @@ module ViralSeq
|
|
147
168
|
if rsp =~ /y/i
|
148
169
|
return param
|
149
170
|
else
|
150
|
-
abort "Params json file generated. You can execute tcs pipeline using `tcs -p [params.json]`"
|
171
|
+
abort "Params json file generated. You can execute tcs pipeline using `tcs -p [params.json]`".blue
|
151
172
|
end
|
152
173
|
|
153
174
|
end
|
@@ -172,7 +193,17 @@ module ViralSeq
|
|
172
193
|
when 3
|
173
194
|
:MAC239
|
174
195
|
end
|
175
|
-
end
|
176
|
-
|
196
|
+
end # end of get_ref
|
197
|
+
|
198
|
+
def validate_path_name(path)
|
199
|
+
if path.empty?
|
200
|
+
return false
|
201
|
+
elsif File.directory? path
|
202
|
+
return File.join(path, 'params.json')
|
203
|
+
elsif File.directory?(File.dirname(path))
|
204
|
+
return path
|
205
|
+
end
|
206
|
+
end # end of validate_path_name
|
207
|
+
end # end of class << self
|
177
208
|
end # end TcsJson
|
178
209
|
end # end main module
|
data/lib/viral_seq/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: viral_seq
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Shuntai Zhou
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2021-03-26 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bundler
|
@@ -90,6 +90,7 @@ email:
|
|
90
90
|
executables:
|
91
91
|
- locator
|
92
92
|
- tcs
|
93
|
+
- tcs_log
|
93
94
|
extensions: []
|
94
95
|
extra_rdoc_files: []
|
95
96
|
files:
|
@@ -104,6 +105,8 @@ files:
|
|
104
105
|
- Rakefile
|
105
106
|
- bin/locator
|
106
107
|
- bin/tcs
|
108
|
+
- bin/tcs_log
|
109
|
+
- doc/dr.json
|
107
110
|
- lib/viral_seq.rb
|
108
111
|
- lib/viral_seq/constant.rb
|
109
112
|
- lib/viral_seq/enumerable.rb
|
@@ -114,6 +117,7 @@ files:
|
|
114
117
|
- lib/viral_seq/pid.rb
|
115
118
|
- lib/viral_seq/ref_seq.rb
|
116
119
|
- lib/viral_seq/rubystats.rb
|
120
|
+
- lib/viral_seq/sdrm.rb
|
117
121
|
- lib/viral_seq/seq_hash.rb
|
118
122
|
- lib/viral_seq/seq_hash_pair.rb
|
119
123
|
- lib/viral_seq/sequence.rb
|
@@ -142,7 +146,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
142
146
|
version: '0'
|
143
147
|
requirements:
|
144
148
|
- R required for some functions
|
145
|
-
rubygems_version: 3.
|
149
|
+
rubygems_version: 3.2.2
|
146
150
|
signing_key:
|
147
151
|
specification_version: 4
|
148
152
|
summary: A Ruby Gem containing bioinformatics tools for processing viral NGS data.
|