viral_seq 1.0.10 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 14d880e9f39b2b87892bec9d4377b358643c880cf32c81872cff51e1007bc23b
4
- data.tar.gz: 6ee1c3293e2b0403a2eac033335f7575625b2d35f32127b5b57be53e94b4ec7d
3
+ metadata.gz: ea453e452e6832e942512cdb94462c33af89ffd8295017806c9aa6ff7ec77ad4
4
+ data.tar.gz: 2bb89d193e0e84ebe0791882c53e226a0a934ea3b9d1e61f87b8ffff6c22af1b
5
5
  SHA512:
6
- metadata.gz: 951b75ced84aa21cf5650baa6970f60a617d3f29d20c14acadacefabea23d6b584f25990453c2008f30197aaef055a94edbdbb45494bb12b6343d90bc6bd45fb
7
- data.tar.gz: 68ac69b4ebd5438a8f73780db823c94aa5a78c7c26d02cfd6bec979244dd1d6452c3698ade0606ddbbaccc480ad85e603171c11648dbb0110c2f5dbb3355bb35
6
+ metadata.gz: 9dc0403ecaea119d3aa3e832305a0bd4f038fdb71789dcd036080fa89b0e454ee79001b6042df171364e4207a93b2d4d5747336b2fb7f8fb7d83103f5d641134
7
+ data.tar.gz: 510ccfce7d717b56d55e2477ae01124009d1f53f010635759cf2f69afe0132313e08db9abaae1ec6d8d894961beba1c2d70a637eafa9b57b05f0aac3372cd0ca
data/.gitignore CHANGED
@@ -2,7 +2,6 @@
2
2
  /.yardoc
3
3
  /_yardoc/
4
4
  /coverage/
5
- /doc/
6
5
  /pkg/
7
6
  /spec/reports/
8
7
  /tmp/
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- viral_seq (1.0.10)
4
+ viral_seq (1.0.13)
5
5
  colorize (~> 0.1)
6
6
  muscle_bio (~> 0.4)
7
7
 
data/README.md CHANGED
@@ -2,7 +2,16 @@
2
2
 
3
3
  A Ruby Gem containing bioinformatics tools for processing viral NGS data.
4
4
 
5
- Specifically for Primer-ID sequencing and HIV drug resistance analysis.
5
+ Specifically for Primer ID sequencing and HIV drug resistance analysis.
6
+
7
+ ## Illustration for the Primer ID Sequencing
8
+
9
+
10
+ ![Primer ID Sequencing](https://storage.googleapis.com/tcs-dr-public/pid.png)
11
+
12
+ ### Reference readings on the Primer ID sequencing
13
+ [Primer ID JID paper](https://doi.org/10.21769/BioProtoc.3938)
14
+ [Primer ID MiSeq protocol](https://doi.org/10.1128/JVI.00522-15)
6
15
 
7
16
  ## Install
8
17
 
@@ -14,19 +23,45 @@ Specifically for Primer-ID sequencing and HIV drug resistance analysis.
14
23
 
15
24
  ### Excutables
16
25
 
17
- Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
26
+ ### `tcs`
27
+ Use executable `tcs` pipeline to process **Primer ID MiSeq sequencing** data.
18
28
 
29
+ Example commands:
19
30
  ```bash
20
- $ locator -i sequence.fasta -o sequence.fasta.csv
31
+ $ tcs -p params.json # run TCS pipeline with params.json
32
+ $ tcs -j # CLI to generate params.json
33
+ $ tcs -h # print out the help
21
34
  ```
35
+ ---
36
+ ### `tcs_log`
22
37
 
23
- Use executable `tcs` pipeline to process Primer ID MiSeq sequencing data.
38
+ Use `tcs_log` script to pool run logs and TCS fasta files after one batch of `tcs` jobs.
24
39
 
40
+
41
+ Example file structure:
42
+ ```
43
+ batch_tcs_jobs/
44
+ ├── lib1
45
+ ├── lib2
46
+ ├── lib3
47
+ ├── lib4
48
+ ├── ...
49
+ ```
50
+
51
+ Example command:
25
52
  ```bash
26
- $ tcs -p params.json # run TCS pipeline with params.json
27
- $ tcs -j # CLI to generate params.json
28
- $ tcs -h # print out the help
53
+ $ tcs_log batch_tcs_jobs
54
+ ```
55
+
56
+ ---
57
+
58
+ ### `locator`
59
+ Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
60
+
61
+ ```bash
62
+ $ locator -i sequence.fasta -o sequence.fasta.csv
29
63
  ```
64
+ ---
30
65
 
31
66
  ## Some Examples
32
67
 
@@ -78,17 +113,49 @@ Examine for drug resistance mutations for HIV PR region
78
113
  ```ruby
79
114
  qc_seqhash.sdrm_hiv_pr(cut_off)
80
115
  ```
116
+ ## Known issues
117
+
118
+ 1. ~~have a conflict with rails.~~
119
+ 2. ~~Update on 03032021. Still have conflict. But in rails gem file, can just use `requires: false` globally and only require "viral_seq" when the module is needed in controller.~~
120
+ 3. The conflict seems to be resovled. It was from a combination of using `!` as a function for factorial and the gem name `viral_seq`. @_@
81
121
 
82
122
  ## Updates
83
123
 
84
- ### Version 1.1.0-11112020:
124
+ ### Version 1.1.0-03252021
125
+
126
+ 1. Optimized the algorithm of end-join.
127
+ 2. Fixed a bug in the `tcs` pipeline that sometimes combined tcs files are not saved.
128
+ 3. Added `tcs_log` command to pool run logs and tcs files from one batch of tcs jobs.
129
+ 4. Added the preset of MPID-HIVDR params file ***dr.json*** in /doc.
130
+ 5. Add `platform_format` option in the json generator of the `tcs` Pipeline.
131
+ Users can choose from 3 MiSeq platforms for processing their sequencing data.
132
+ MiSeq 300x7x300 is the default option.
133
+
134
+ ### Version 1.0.14-03052021
135
+
136
+ 1. Add a function `ViralSeq::TcsCore.validate_file_name` to check MiSeq paired-end file names.
137
+
138
+ ### Version 1.0.13-03032021
139
+
140
+ 1. Fixed the conflict with rails.
141
+
142
+ ### Version 1.0.12-03032021
143
+
144
+ 1. Fixed an issue that may cause conflicts with ActiveRecord.
145
+
146
+ ### Version 1.0.11-03022021
147
+
148
+ 1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
149
+ 2. fixed an issue loading class 'OptionParser'in some ruby environments.
150
+
151
+ ### Version 1.0.10-11112020:
85
152
 
86
153
  1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
87
154
  2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
88
155
  3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
89
156
  4. a few optimizations.
90
157
  5. TCS 2.1.0 delivered.
91
- 6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
158
+ 6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
92
159
 
93
160
  ### Version 1.0.9-07182020:
94
161
 
data/bin/tcs CHANGED
@@ -23,12 +23,12 @@
23
23
  # THE SOFTWARE.
24
24
 
25
25
  # Use JSON file as the run param
26
- # run tcs_json_generator.rb to generate param json file.
26
+ # run `tcs -j` to generate param json file.
27
27
 
28
28
  require 'viral_seq'
29
29
  require 'json'
30
30
  require 'colorize'
31
- require 'OptionParser'
31
+ require 'optparse'
32
32
 
33
33
  options = {}
34
34
 
@@ -115,6 +115,12 @@ else
115
115
  error_rate = 0.02
116
116
  end
117
117
 
118
+ if params[:platform_format]
119
+ $platform_sequencing_length = params[:platform_format]
120
+ else
121
+ $platform_sequencing_length = 300
122
+ end
123
+
118
124
  primers = params[:primer_pairs]
119
125
  if primers.empty?
120
126
  ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
@@ -273,7 +279,6 @@ primers.each do |primer|
273
279
  r1_sub_seq << bio_r1[seq_name]
274
280
  r2_sub_seq << bio_r2[seq_name]
275
281
  end
276
-
277
282
  #consensus name including the Primer ID and number of raw sequences of that Primer ID, library name and setname.
278
283
  consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
279
284
  r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
@@ -317,8 +322,12 @@ primers.each do |primer|
317
322
  f1 = File.open(outfile_r1, 'w')
318
323
  f2 = File.open(outfile_r2, 'w')
319
324
  primer_id_in_use = {}
320
- r1_seq_length = consensus_filtered.values[0][0].size
321
- r2_seq_length = consensus_filtered.values[0][1].size
325
+ if n_con > 0
326
+ r1_seq_length = consensus_filtered.values[0][0].size
327
+ r2_seq_length = consensus_filtered.values[0][1].size
328
+ else
329
+ next
330
+ end
322
331
  log.puts Time.now.to_s + "\t" + "R1 sequence #{r1_seq_length} bp"
323
332
  log.puts Time.now.to_s + "\t" + "R1 sequence #{r2_seq_length} bp"
324
333
  consensus_filtered.each do |seq_name,seq|
@@ -360,6 +369,7 @@ primers.each do |primer|
360
369
  shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
361
370
  joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
362
371
  log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
372
+
363
373
  summary_json[:combined_tcs] = joined_sh.size
364
374
 
365
375
  if export_raw
@@ -429,12 +439,15 @@ primers.each do |primer|
429
439
  trim_end = primer[:trim_ref_end]
430
440
  trim_ref = primer[:trim_ref].to_sym
431
441
  joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
432
- joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
433
442
  if export_raw
434
443
  joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
435
- joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
436
444
  end
437
445
  end
446
+
447
+ joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
448
+ if export_raw
449
+ joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
450
+ end
438
451
  end
439
452
 
440
453
  File.open(outfile_log, "w") do |f|
data/bin/tcs_log ADDED
@@ -0,0 +1,83 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # pool run logs from one batch of tcs jobs
4
+ # file structure:
5
+ # batch_tcs_jobs/
6
+ # ├── lib1
7
+ # ├── lib2
8
+ # ├── lib3
9
+ # ├── lib4
10
+ # ├── ...
11
+ #
12
+ # command example:
13
+ # $ tcs_log batch_tcs_jobs
14
+
15
+ require 'viral_seq'
16
+ require 'pathname'
17
+ require 'json'
18
+ require 'fileutils'
19
+
20
+ indir = ARGV[0].chomp
21
+ indir_basename = File.basename(indir)
22
+ indir_dirname = File.dirname(indir)
23
+
24
+ tcs_dir = File.join(indir_dirname, (indir_basename + "_tcs"))
25
+ Dir.mkdir(tcs_dir) unless File.directory?(tcs_dir)
26
+
27
+ libs = []
28
+ Dir.chdir(indir) {libs = Dir.glob("*")}
29
+
30
+ outdir2 = File.join(tcs_dir, "combined_TCS_per_lib")
31
+ outdir3 = File.join(tcs_dir, "TCS_per_region")
32
+ outdir4 = File.join(tcs_dir, "combined_TCS_per_region")
33
+
34
+ Dir.mkdir(outdir2) unless File.directory?(outdir2)
35
+ Dir.mkdir(outdir3) unless File.directory?(outdir3)
36
+ Dir.mkdir(outdir4) unless File.directory?(outdir4)
37
+
38
+ log_file = File.join(tcs_dir,"log.csv")
39
+ log = File.open(log_file,'w')
40
+ log.puts "lib name,Region,Raw Sequences per barcode,R1 Raw,R2 Raw,Paired Raw,Cutoff,PID Length,Consensus1,Consensus2,Distinct to Raw,Resampling index,Combined TCS,Combined TCS after QC"
41
+
42
+ libs.each do |lib|
43
+ Dir.mkdir(File.join(outdir2, lib)) unless File.directory?(File.join(outdir2, lib))
44
+ fasta_files = []
45
+ json_files = []
46
+ Dir.chdir(File.join(indir, lib)) do
47
+ fasta_files = Dir.glob("**/*.fasta")
48
+ json_files = Dir.glob("**/log.json")
49
+ end
50
+ fasta_files.each do |f|
51
+ path_array = Pathname(f).each_filename.to_a
52
+ region = path_array[0]
53
+ if path_array[-1] == "combined.fasta"
54
+ FileUtils.cp(File.join(indir, lib, f), File.join(outdir2, lib, (lib + "_" + region)))
55
+ Dir.mkdir(File.join(outdir4,region)) unless File.directory?(File.join(outdir4,region))
56
+ FileUtils.cp(File.join(indir, lib, f), File.join(outdir4, region, (lib + "_" + region)))
57
+ else
58
+ Dir.mkdir(File.join(outdir3,region)) unless File.directory?(File.join(outdir3,region))
59
+ Dir.mkdir(File.join(outdir3,region, lib)) unless File.directory?(File.join(outdir3,region, lib))
60
+ FileUtils.cp(File.join(indir, lib, f), File.join(outdir3, region, lib, (lib + "_" + region + "_" + path_array[-1])))
61
+ end
62
+ end
63
+
64
+ json_files.each do |f|
65
+ json_log = JSON.parse(File.read(File.join(indir, lib, f)), symbolize_names: true)
66
+ log.print [lib,
67
+ json_log[:primer_set_name],
68
+ json_log[:total_raw_sequence],
69
+ json_log[:r1_filtered_raw],
70
+ json_log[:r2_filtered_raw],
71
+ json_log[:paired_raw_sequence],
72
+ json_log[:consensus_cutoff],
73
+ json_log[:length_of_pid],
74
+ json_log[:total_tcs_with_ambiguities],
75
+ json_log[:total_tcs],
76
+ json_log[:distinct_to_raw],
77
+ json_log[:resampling_param],
78
+ json_log[:combined_tcs],
79
+ json_log[:combined_tcs_after_qc],
80
+ ].join(',') + "\n"
81
+ end
82
+ end
83
+ log.close
data/doc/dr.json ADDED
@@ -0,0 +1,68 @@
1
+ {
2
+ "raw_sequence_dir": "MyExampleDir",
3
+ "platform_error_rate": 0.02,
4
+ "primer_pairs": [
5
+ {
6
+ "region": "RT",
7
+ "cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCACTATAGGCTGTACTGTCCATTTATC",
8
+ "forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNGGCCATTGACAGAAGAAAAAATAAAAGC",
9
+ "majority": 0.5,
10
+ "end_join": true,
11
+ "end_join_option": 1,
12
+ "overlap": 0,
13
+ "TCS_QC": true,
14
+ "ref_genome": "HXB2",
15
+ "ref_start": 2648,
16
+ "ref_end": 3257,
17
+ "indel": true,
18
+ "trim": false
19
+ },
20
+ {
21
+ "region": "PR",
22
+ "cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNCAGTTTAACTTTTGGGCCATCCATTCC",
23
+ "forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTCAGAGCAGACCAGAGCCAACAGCCCCA",
24
+ "majority": 0.5,
25
+ "end_join": true,
26
+ "end_join_option": 3,
27
+ "TCS_QC": true,
28
+ "ref_genome": "HXB2",
29
+ "ref_start": 0,
30
+ "ref_end": 2591,
31
+ "indel": true,
32
+ "trim": true,
33
+ "trim_ref": "HXB2",
34
+ "trim_ref_start": 2253,
35
+ "trim_ref_end": 2549
36
+ },
37
+ {
38
+ "region": "IN",
39
+ "cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNATCGAATACTGCCATTTGTACTGC",
40
+ "forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNAAAAGGAGAAGCCATGCATG",
41
+ "majority": 0.5,
42
+ "end_join": true,
43
+ "end_join_option": 3,
44
+ "overlap": 171,
45
+ "TCS_QC": true,
46
+ "ref_genome": "HXB2",
47
+ "ref_start": 4384,
48
+ "ref_end": 4751,
49
+ "indel": false,
50
+ "trim": false
51
+ },
52
+ {
53
+ "region": "V1V3",
54
+ "cdna": "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNNNNCAGTCCATTTTGCTYTAYTRABVTTACAATRTGC",
55
+ "forward": "GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAGNNNNTTATGGGATCAAAGCCTAAAGCCATGTGTA",
56
+ "majority": 0.5,
57
+ "end_join": true,
58
+ "end_join_option": 1,
59
+ "overlap": 0,
60
+ "TCS_QC": true,
61
+ "ref_genome": "HXB2",
62
+ "ref_start": 6585,
63
+ "ref_end": 7208,
64
+ "indel": true,
65
+ "trim": false
66
+ }
67
+ ]
68
+ }
@@ -1,7 +1,11 @@
1
1
  module ViralSeq
2
-
2
+
3
3
  # array for all amino acid one letter abbreviations
4
4
 
5
5
  AMINO_ACID_LIST = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y", "*"]
6
6
 
7
+ SDRM_HIV_PR_LIST = {}
8
+ SDRM_HIV_RT_LIST = {}
9
+ SDRM_HIV_IN_LIST = {}
10
+
7
11
  end
@@ -3,10 +3,6 @@
3
3
  # array = [1,2,3,4,5,6,7,8,9,10]
4
4
  # array.median
5
5
  # => 5.5
6
- # @example sum
7
- # array = [1,2,3,4,5,6,7,8,9,10]
8
- # array.sum
9
- # => 55
10
6
  # @example average number (mean)
11
7
  # array = [1,2,3,4,5,6,7,8,9,10]
12
8
  # array.mean
@@ -45,12 +41,6 @@ module Enumerable
45
41
  len % 2 == 1 ? sorted[len/2] : (sorted[len/2 - 1] + sorted[len/2]).to_f / 2
46
42
  end
47
43
 
48
- # generate summed value
49
- # @return [Numeric] summed value
50
- def sum
51
- self.inject(0){|accum, i| accum + i }
52
- end
53
-
54
44
  # generate mean number
55
45
  # @return [Float] mean value
56
46
  def mean
@@ -1,6 +1,6 @@
1
1
 
2
2
  module ViralSeq
3
- class SeqHash
3
+ class SDRM
4
4
 
5
5
  # functions to identify SDRMs from a ViralSeq::SeqHash object at HIV PR region.
6
6
  # works for MPID-DR protocol (dx.doi.org/10.17504/protocols.io.useewbe)
@@ -67,7 +67,7 @@ module ViralSeq
67
67
  @k = k
68
68
  @poisson_hash = {}
69
69
  (0..k).each do |n|
70
- p = (rate**n * ::Math::E**(-rate))/!n
70
+ p = (rate**n * ::Math::E**(-rate))/n.factorial
71
71
  @poisson_hash[n] = p
72
72
  end
73
73
  end
@@ -155,9 +155,9 @@ class Integer
155
155
  # factorial method for an Integer
156
156
  # @return [Integer] factorial for given Integer
157
157
  # @example factorial for 5
158
- # !5
158
+ # 5.factorial
159
159
  # => 120
160
- def !
160
+ def factorial
161
161
  if self == 0
162
162
  return 1
163
163
  else
@@ -0,0 +1,43 @@
1
+ module ViralSeq
2
+ class DRMs
3
+ def initialize (mutation_list = {})
4
+ @mutation_list = mutation_list
5
+ end
6
+
7
+ attr_accessor :mutation_list
8
+ end
9
+
10
+ def self.sdrm_hiv_pr(seq_hash)
11
+ end
12
+
13
+ def self.sdrm_hiv_rt(seq_hash)
14
+ end
15
+
16
+ def self.sdrm_hiv_in(seq_hash)
17
+ end
18
+
19
+ def self.list_from_json(file)
20
+ end
21
+
22
+ def self.list_from_csv(file)
23
+ end
24
+
25
+ def self.export_list_hiv_pr(file, format = :json)
26
+ if foramt == :json
27
+
28
+ end
29
+ end
30
+
31
+ def self.export_list_hiv_rt(file, format = :json)
32
+
33
+ end
34
+
35
+ def self.export_list_hiv_in(file, format = :json)
36
+
37
+ end
38
+
39
+ def drm_analysis(seq_hash)
40
+ mutation_list = self.mutation_list
41
+
42
+ end
43
+ end
@@ -394,7 +394,6 @@ module ViralSeq
394
394
  end
395
395
  end
396
396
  end
397
-
398
397
  consensus_seq += call_consensus_base(max_base_list)
399
398
  end
400
399
  return consensus_seq
@@ -549,7 +548,7 @@ module ViralSeq
549
548
  if sequences.size == 0
550
549
  return 0
551
550
  else
552
- cut_off = 1
551
+ cut_off = Float::INFINITY
553
552
  l = sequences[0].size
554
553
  rate = sequences.size * error_rate
555
554
  count_mut = variant_for_poisson(sequences)
@@ -558,7 +557,7 @@ module ViralSeq
558
557
 
559
558
  poisson_hash.each do |k,v|
560
559
  cal = l * v
561
- obs = count_mut[k] ? count_mut[k] : 0
560
+ obs = count_mut[k] ? count_mut[k] : 1
562
561
  if obs >= fold_cutoff * cal
563
562
  cut_off = k
564
563
  break
@@ -742,6 +741,7 @@ module ViralSeq
742
741
  seq_hash_unique_pass = []
743
742
 
744
743
  seq_hash_unique.each do |seq|
744
+ next if seq.nil?
745
745
  loc = ViralSeq::Sequence.new('', seq).locator(ref_option, path_to_muscle)
746
746
  next unless loc # if locator tool fails, skip this seq.
747
747
  if start_nt.include?(loc[0]) && end_nt.include?(loc[1])
@@ -110,19 +110,21 @@ module ViralSeq
110
110
  raise ArgumentError.new(":overlap has to be Integer, input #{overlap} invalid.") unless overlap.is_a? Integer
111
111
  raise ArgumentError.new(":diff has to be float or integer, input #{diff} invalid.") unless (diff.is_a? Integer or diff.is_a? Float)
112
112
  joined_seq = {}
113
- seq_pair_hash.uniq_hash.each do |seq_pair, seq_names|
113
+ seq_pair_hash.each do |seq_name,seq_pair|
114
114
  r1_seq = seq_pair[0]
115
115
  r2_seq = seq_pair[1]
116
116
  if overlap.zero?
117
117
  joined_sequence = r1_seq + r2_seq
118
+ elsif diff.zero?
119
+ if r1_seq[-overlap..-1] == r2_seq[0,overlap]
120
+ joined_sequence= r1_seq + r2_seq[overlap..-1]
121
+ end
118
122
  elsif r1_seq[-overlap..-1].compare_with(r2_seq[0,overlap]) <= (overlap * diff)
119
123
  joined_sequence= r1_seq + r2_seq[overlap..-1]
120
124
  else
121
125
  next
122
126
  end
123
- seq_names.each do |seq_name|
124
- joined_seq[seq_name] = joined_sequence
125
- end
127
+ joined_seq[seq_name] = joined_sequence if joined_sequence
126
128
  end
127
129
 
128
130
  joined_seq_hash = ViralSeq::SeqHash.new
@@ -102,16 +102,18 @@ module ViralSeq
102
102
  end
103
103
 
104
104
  # sort array of file names to determine if there is potential errors
105
- # input name_array array of file names
106
- # output hash { }
105
+ # @param name_array [Array] array of file names
106
+ # @return [hash] name check results
107
107
 
108
108
  def validate_file_name(name_array)
109
- errors = { file_type_error: [] ,
109
+ errors = {
110
+ file_type_error: [] ,
110
111
  missing_r1_file: [] ,
111
112
  missing_r2_file: [] ,
112
113
  extra_r1_r2_file: [],
113
114
  no_region_tag: [] ,
114
- multiple_region_tag: []}
115
+ multiple_region_tag: []
116
+ }
115
117
 
116
118
  passed_libs = {}
117
119
 
@@ -163,6 +165,13 @@ module ViralSeq
163
165
  end
164
166
  end
165
167
 
168
+ file_name_with_lib_name = {}
169
+ passed_libs.each do |lib_name, files|
170
+ files.each do |f|
171
+ file_name_with_lib_name[f] = lib_name
172
+ end
173
+ end
174
+
166
175
  passed_names = []
167
176
 
168
177
  passed_libs.values.each { |names| passed_names += names}
@@ -173,7 +182,27 @@ module ViralSeq
173
182
  pass = true
174
183
  end
175
184
 
176
- return { errors: errors, all_pass: pass, passed_names: passed_names, passed_libs: passed_libs }
185
+ file_name_with_error_type = {}
186
+
187
+ errors.each do |type, files|
188
+ files.each do |f|
189
+ file_name_with_error_type[f] ||= []
190
+ file_name_with_error_type[f] << type.to_s.tr("_", "\s")
191
+ end
192
+ end
193
+
194
+ file_check = []
195
+
196
+ name_array.each do |name|
197
+ file_check_hash = {}
198
+ file_check_hash[:fileName] = name
199
+ file_check_hash[:errors] = file_name_with_error_type[name]
200
+ file_check_hash[:libName] = file_name_with_lib_name[name]
201
+
202
+ file_check << file_check_hash
203
+ end
204
+
205
+ return { allPass: pass, files: file_check }
177
206
  end
178
207
 
179
208
  # filter r1 raw sequences for non-specific primers.
@@ -276,7 +305,9 @@ module ViralSeq
276
305
  end
277
306
 
278
307
  def general_filter(seq)
279
- if seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
308
+ if seq.size < $platform_sequencing_length
309
+ return false
310
+ elsif seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
280
311
  return false
281
312
  elsif seq =~ /A{11}/ # a string of poly-A indicates adaptor sequence
282
313
  return false
@@ -13,6 +13,22 @@ module ViralSeq
13
13
  print '> '
14
14
  param[:raw_sequence_dir] = gets.chomp.rstrip
15
15
 
16
+ puts "Choose MiSeq Platform (1-3):\n1. 150x7x150\n2. 250x7x250\n3. 300x7x300 (default)"
17
+ print "> "
18
+ pf_option = gets.chomp.rstrip
19
+ # while ![1,2,3].include?(pf_option.to_i)
20
+ # print "Entered MiSeq Platform #{pf_option.red.bold} not valid (choose 1-3), try again\n> "
21
+ # pf_option = gets.chomp.rstrip
22
+ # end
23
+ case pf_option.to_i
24
+ when 1
25
+ param[:platform_format] = 150
26
+ when 2
27
+ param[:platform_format] = 250
28
+ else
29
+ param[:platform_format] = 300
30
+ end
31
+
16
32
  puts 'Enter the estimated platform error rate (for TCS cut-off calculation), default as ' + '0.02'.red.bold
17
33
  print '> '
18
34
  input_error = gets.chomp.rstrip.to_f
@@ -52,12 +68,12 @@ module ViralSeq
52
68
  if ej =~ /y|yes/i
53
69
  data[:end_join] = true
54
70
 
55
- print "End-join option? Choose from (1-4):\n
56
- 1: simple join, no overlap
57
- 2: known overlap \n
58
- 3: unknow overlap, use sample consensus to determine overlap, all sequence pairs have same overlap\n
59
- 4: unknow overlap, determine overlap by individual sequence pairs, sequence pairs can have different overlap\n
60
- > "
71
+ puts "End-join option? Choose from (1-4):"
72
+ puts "1: simple join, no overlap"
73
+ puts "2: known overlap"
74
+ puts "3: unknow overlap, use sample consensus to determine overlap, all sequence pairs have same overlap"
75
+ puts "4: unknow overlap, determine overlap by individual sequence pairs, sequence pairs can have different overlap"
76
+ print "> "
61
77
  ej_option = gets.chomp.rstrip
62
78
  while ![1,2,3,4].include?(ej_option.to_i)
63
79
  puts "Entered end-join option #{ej_option.red.bold} not valid (choose 1-4), try again"
@@ -138,7 +154,12 @@ module ViralSeq
138
154
  if save_option =~ /y|yes/i
139
155
  print "Path to save JSON file:\n> "
140
156
  path = gets.chomp.rstrip
141
- File.open(path, 'w') {|f| f.puts JSON.pretty_generate(param)}
157
+ while !validate_path_name(path)
158
+ print "Entered path no valid, try again.\n".red.bold
159
+ print "Path to save JSON file:\n> "
160
+ path = gets.chomp.rstrip
161
+ end
162
+ File.open(validate_path_name(path), 'w') {|f| f.puts JSON.pretty_generate(param)}
142
163
  end
143
164
 
144
165
  print "\nDo you wish to execute tcs pipeline with the input params now? Y/N \n> "
@@ -147,7 +168,7 @@ module ViralSeq
147
168
  if rsp =~ /y/i
148
169
  return param
149
170
  else
150
- abort "Params json file generated. You can execute tcs pipeline using `tcs -p [params.json]`"
171
+ abort "Params json file generated. You can execute tcs pipeline using `tcs -p [params.json]`".blue
151
172
  end
152
173
 
153
174
  end
@@ -172,7 +193,17 @@ module ViralSeq
172
193
  when 3
173
194
  :MAC239
174
195
  end
175
- end
176
- end
196
+ end # end of get_ref
197
+
198
+ def validate_path_name(path)
199
+ if path.empty?
200
+ return false
201
+ elsif File.directory? path
202
+ return File.join(path, 'params.json')
203
+ elsif File.directory?(File.dirname(path))
204
+ return path
205
+ end
206
+ end # end of validate_path_name
207
+ end # end of class << self
177
208
  end # end TcsJson
178
209
  end # end main module
@@ -2,6 +2,6 @@
2
2
  # version info and histroy
3
3
 
4
4
  module ViralSeq
5
- VERSION = "1.0.10"
6
- TCS_VERSION = "2.1.0"
5
+ VERSION = "1.1.0"
6
+ TCS_VERSION = "2.2.0"
7
7
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: viral_seq
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.10
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Shuntai Zhou
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2020-11-12 00:00:00.000000000 Z
12
+ date: 2021-03-26 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bundler
@@ -90,6 +90,7 @@ email:
90
90
  executables:
91
91
  - locator
92
92
  - tcs
93
+ - tcs_log
93
94
  extensions: []
94
95
  extra_rdoc_files: []
95
96
  files:
@@ -104,6 +105,8 @@ files:
104
105
  - Rakefile
105
106
  - bin/locator
106
107
  - bin/tcs
108
+ - bin/tcs_log
109
+ - doc/dr.json
107
110
  - lib/viral_seq.rb
108
111
  - lib/viral_seq/constant.rb
109
112
  - lib/viral_seq/enumerable.rb
@@ -114,6 +117,7 @@ files:
114
117
  - lib/viral_seq/pid.rb
115
118
  - lib/viral_seq/ref_seq.rb
116
119
  - lib/viral_seq/rubystats.rb
120
+ - lib/viral_seq/sdrm.rb
117
121
  - lib/viral_seq/seq_hash.rb
118
122
  - lib/viral_seq/seq_hash_pair.rb
119
123
  - lib/viral_seq/sequence.rb
@@ -142,7 +146,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
142
146
  version: '0'
143
147
  requirements:
144
148
  - R required for some functions
145
- rubygems_version: 3.1.2
149
+ rubygems_version: 3.2.2
146
150
  signing_key:
147
151
  specification_version: 4
148
152
  summary: A Ruby Gem containing bioinformatics tools for processing viral NGS data.