RubyGems - viral_seq - Versions diffs - 1.0.8 → 1.0.13 - Mend

viral_seq 1.0.8 → 1.0.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/Gemfile.lock +3 -3
data/README.md +120 -57
data/bin/tcs +140 -214
data/lib/viral_seq.rb +3 -0
data/lib/viral_seq/constant.rb +5 -1
data/lib/viral_seq/enumerable.rb +0 -10
data/lib/viral_seq/hivdr.rb +1 -1
data/lib/viral_seq/math.rb +3 -3
data/lib/viral_seq/sdrm.rb +43 -0
data/lib/viral_seq/seq_hash.rb +38 -24
data/lib/viral_seq/seq_hash_pair.rb +6 -0
data/lib/viral_seq/tcs_core.rb +305 -0
data/lib/viral_seq/tcs_json.rb +178 -0
data/lib/viral_seq/version.rb +2 -2
data/viral_seq.gemspec +1 -1
metadata +8 -7
data/bin/tcs_json_generator +0 -170

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 8d79f0676fb23cdc25fb3b0161b5665ecfe082e2401f40a1de3a782d9fb3d52a
-  data.tar.gz: 01a09f4cfca1274bfb1b870cdad62614def01fdaded727ce9100eec377962401
+  metadata.gz: 7816fd2b8da8109a24a33b8663e5f4fa5f098ed590c7403f909b593ebdd78c2f
+  data.tar.gz: adaffa3e35268eaed0bb2d0c5a6ba387f8b09bc04561a3714f9e155a55466cd5
 SHA512:
-  metadata.gz: 042f11da57209003bc84b0f7c764a9953f0ca6c1fcd00a5e943be531162bc06c9d54e3c4ceb1305c91fe5795894e3da394a196899a4f1df83d97b826c5582411
-  data.tar.gz: b2b2bfb9a8e6d023f610b19311a1a1ea331fbaa804cf20aebc3a34f6b049240ec43fe10e92b9f00feef3fd78e922fe0ed39281146693358998020036b9553504
+  metadata.gz: 5bfbb3c2e78ae8ef01b1750b5135a76b7fdf65ecc00ccfe141e488154adfc9b0ddff42a58ee9f682f46576060632d12cd6e540bea26e81d8bd9e346f5e7bca84
+  data.tar.gz: 3b491d3070f2e7aacc73c1c9f4942fe770f2c0e4eebe5c021e5558ce7fa0e4299c44ddde51b07b4c3118d7832236be9703707209ea0ebfb6591a576871ad0804

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    viral_seq (1.0.8)
+    viral_seq (1.0.13)
       colorize (~> 0.1)
       muscle_bio (~> 0.4)
@@ -11,7 +11,7 @@ GEM
     colorize (0.8.1)
     diff-lcs (1.3)
     muscle_bio (0.4.0)
-    rake (10.5.0)
+    rake (13.0.1)
     rspec (3.8.0)
       rspec-core (~> 3.8.0)
       rspec-expectations (~> 3.8.0)
@@ -31,7 +31,7 @@ PLATFORMS
 DEPENDENCIES
   bundler (~> 2.0)
-  rake (~> 10.0)
+  rake (~> 13.0)
   rspec (~> 3.0)
   viral_seq!

data/README.md CHANGED Viewed

@@ -4,109 +4,172 @@ A Ruby Gem containing bioinformatics tools for processing viral NGS data.
 Specifically for Primer-ID sequencing and HIV drug resistance analysis.
-## Installation
+## Install
+```bash
     $ gem install viral_seq
+```
 ## Usage
-#### Load all ViralSeq classes by requiring 'viral_seq.rb'
+### Excutables
-    #!/usr/bin/env ruby
-    require 'viral_seq'
-#### Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
+Use executable `locator` to get the coordinates of the sequences on HIV/SIV reference genome from a FASTA file through a terminal
+```bash
     $ locator -i sequence.fasta -o sequence.fasta.csv
+```
+Use executable `tcs` pipeline to process Primer ID MiSeq sequencing data.
+```bash
+    $ tcs -p params.json # run TCS pipeline with params.json
+    $ tcs -j # CLI to generate params.json
+    $ tcs -h # print out the help
+```
 ## Some Examples
-#### Load nucleotide sequences from a FASTA format sequence file
+Load all ViralSeq classes by requiring 'viral_seq.rb' in your Ruby scripts.
+```ruby
+#!/usr/bin/env ruby
+require 'viral_seq'
+```
+Load nucleotide sequences from a FASTA format sequence file
-    my_seqhash = ViralSeq::SeqHash.fa('my_seq_file.fasta')
+```ruby
+my_seqhash = ViralSeq::SeqHash.fa('my_seq_file.fasta')
+```
-#### Make an alignment (using MUSCLE)
+Make an alignment (using MUSCLE)
-    aligned_seqhash = my_seqhash.align
+```ruby
+aligned_seqhash = my_seqhash.align
+```
-#### Filter nucleotide sequences with the reference coordinates (HIV Protease)
+Filter nucleotide sequences with the reference coordinates (HIV Protease)
-    qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
+```ruby
+qc_seqhash = aligned_seqhash.hiv_seq_qc(2253, 2549, false, :HXB2)
+```
-#### Further filter out sequences with Apobec3g/f hypermutations
+Further filter out sequences with Apobec3g/f hypermutations
-    qc_seqhash = qc_seqhash.a3g
+```ruby
+qc_seqhash = qc_seqhash.a3g
+```
-#### Calculate nucleotide diveristy π
+Calculate nucleotide diveristy π
-    qc_seqhash.pi
+```ruby
+qc_seqhash.pi
+```
-#### Calculate cut-off for minority variants based on Poisson model
+Calculate cut-off for minority variants based on Poisson model
-    cut_off = qc_seqhash.pm
+```ruby
+cut_off = qc_seqhash.pm
+```
-#### Examine for drug resistance mutations for HIV PR region
+Examine for drug resistance mutations for HIV PR region
-    qc_seqhash.sdrm_hiv_pr(cut_off)
+```ruby
+qc_seqhash.sdrm_hiv_pr(cut_off)
+```
+## Known issues
+  1. ~~have a conflict with rails.~~
+  2. ~~Update on 03032021. Still have conflict. But in rails gem file, can just use `requires: false` globally and only require "viral_seq" when the module is needed in controller.~~
+  3. The conflict seems to be resovled. It was from a combination of using `!` as a function for factorial and the gem name `viral_seq`. @_@
 ## Updates
-Version 1.0.8-02282020:
+### Version 1.1.3-03032021
+  1. Fixed the conflict with rails.
+### Version 1.1.2-03032021
+  1. Fixed an issue that may cause conflicts with ActiveRecord.
+### Version 1.1.1-03022021
+  1. Fixed an issue when calculating Poisson cutoff for minority mutations `ViralSeq::SeqHash.pm`.
+  2. fixed an issue loading class 'OptionParser'in some ruby environments.
+### Version 1.1.0-11112020:
+  1. Modularize TCS pipeline. Move key functions into /viral_seq/tcs_core.rb
+  2. `tcs_json_generator` is removed. This CLI is delivered within the `tcs` pipeline, by running `tcs -j`. The scripts are included in the /viral_seq/tcs_json.rb
+  3. consensus model now includes a true simple majority model, where no nt needs to be over 50% to be called.
+  4. a few optimizations.
+  5. TCS 2.1.0 delivered.
+  6. Tried parallel processing. Cannot achieve the goal because `parallel` gem by default can't pool data from memory of child processors and `in_threads` does not help with the speed.
+### Version 1.0.9-07182020:
+  1. Change ViralSeq::SeqHash#stop_codon and ViralSeq::SeqHash#a3g_hypermut return value to hash object.
+  2. TCS pipeline updated to version 2.0.1. Add optional `export_raw: TRUE/FALSE` in json params. If `export_raw` is `TRUE`, raw sequence reads (have to pass quality filters) will be exported, along with TCS reads.
+### Version 1.0.8-02282020:
-    1. TCS pipeline added as executable.
-        tcs  -  main TCS pipeline script.
-        tcs_json_generator  -  step-by-step script to generate json file for tcs pipeline.
+  1. TCS pipeline (version 2.0.0) added as executable.
+      tcs  -  main TCS pipeline script.
+      tcs_json_generator  -  step-by-step script to generate json file for tcs pipeline.
-    2. Methods added:
-        ViralSeq::SeqHash#trim
+  2. Methods added:
+      ViralSeq::SeqHash#trim
-    3. Bug fix for several methods.
+  3. Bug fix for several methods.
-Version 1.0.7-01282020:
+### Version 1.0.7-01282020:
-    1. Several methods added, including
-        ViralSeq::SeqHash#error_table
-        ViralSeq::SeqHash#random_select
-    2. Improved performance for several functions.
+  1. Several methods added, including
+      ViralSeq::SeqHash#error_table
+      ViralSeq::SeqHash#random_select
+  2. Improved performance for several functions.
-Version 1.0.6-07232019:
+### Version 1.0.6-07232019:
-    1. Several methods added to ViralSeq::SeqHash, including
-        ViralSeq::SeqHash#size
-        ViralSeq::SeqHash#+
-        ViralSeq::SeqHash#write_nt_fa
-        ViralSeq::SeqHash#mutation
-    2. Update documentations and rspec samples.
+  1. Several methods added to ViralSeq::SeqHash, including
+      ViralSeq::SeqHash#size
+      ViralSeq::SeqHash#+
+      ViralSeq::SeqHash#write_nt_fa
+      ViralSeq::SeqHash#mutation
+  2. Update documentations and rspec samples.
-Version 1.0.5-07112019:
+### Version 1.0.5-07112019:
-    1. Update ViralSeq::SeqHash#sequence_locator.
-       Program will try to determine the direction (`+` or `-` of the query sequence)
-    2. update executable `locator` to have a column of `direction` in output .csv file
+  1. Update ViralSeq::SeqHash#sequence_locator.
+     Program will try to determine the direction (`+` or `-` of the query sequence)
+  2. update executable `locator` to have a column of `direction` in output .csv file
-Version 1.0.4-07102019:
+### Version 1.0.4-07102019:
-    1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
-    2. Fix bugs in bin `locator`
+  1. Use home directory (Dir.home) instead of the directory of the script file for temp MUSCLE file.
+  2. Fix bugs in bin `locator`
-Version 1.0.3-07102019:
+### Version 1.0.3-07102019:
-    1. Bug fix.
+  1. Bug fix.
-Version 1.0.2-07102019:
+### Version 1.0.2-07102019:
-    1. Fixed a gem loading issue.
+  1. Fixed a gem loading issue.
-Version 1.0.1-07102019:
+### Version 1.0.1-07102019:
-    1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
-    2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
-    3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
-    4. update documentations
+  1. Add keyword argument :model to ViralSeq::SeqHashPair#join2.
+  2. Add method ViralSeq::SeqHash#sequence_locator (also: #loc), a function to locate sequences on HIV/SIV reference genomes, as HIV Sequence Locator from LANL.
+  3. Add executable 'locator'. An HIV/SIV sequence locator tool similar to LANL Sequence Locator.
+  4. update documentations
-Version 1.0.0-07092019:
+### Version 1.0.0-07092019:
-    1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq
+  1. Rewrote the whole ViralSeq gem, grouping methods into modules and classes under main Module::ViralSeq
 ## Development

data/bin/tcs CHANGED Viewed

@@ -28,180 +28,79 @@
 require 'viral_seq'
 require 'json'
 require 'colorize'
+require 'optparse'
-# updated the ViralSeq module. Push with the new version.
-module ViralSeq
-  class SeqHash
-    def self.new_from_fastq(fastq_file)
-      count = 0
-      sequence_a = []
-      quality_a = []
-      count_seq = 0
-      File.open(fastq_file,'r') do |file|
-        file.readlines.collect do |line|
-          count +=1
-          count_m = count % 4
-          if count_m == 1
-            line.tr!('@','>')
-            sequence_a << line.chomp
-            quality_a << line.chomp
-            count_seq += 1
-          elsif count_m == 2
-            sequence_a << line.chomp
-          elsif count_m == 0
-            quality_a << line.chomp
-          end
-        end
-      end
-      sequence_hash = Hash[sequence_a.each_slice(2).to_a]
-      quality_hash = Hash[quality_a.each_slice(2).to_a]
-      seq_hash = ViralSeq::SeqHash.new
-      seq_hash.dna_hash = sequence_hash
-      seq_hash.qc_hash = quality_hash
-      seq_hash.title = File.basename(fastq_file,".*")
-      seq_hash.file = fastq_file
-      return seq_hash
-    end # end of ::new_from_fastq
-    class << self
-      alias_method :fq, :new_from_fastq
-    end
-  end
-end
-module ViralSeq
-  class SeqHash
-    def trim(start_nt, end_nt, ref_option = :HXB2, path_to_muscle = false)
-      seq_hash = self.dna_hash.dup
-      seq_hash_unique = seq_hash.uniq_hash
-      trimmed_seq_hash = {}
-      seq_hash_unique.each do |seq, names|
-        trimmed_seq = ViralSeq::Sequence.new('', seq).sequence_clip(start_nt, end_nt, ref_option, path_to_muscle).dna
-        names.each do |name|
-          trimmed_seq_hash[name] = trimmed_seq
-        end
-      end
-      return_seq_hash = self.dup
-      return_seq_hash.dna_hash = trimmed_seq_hash
-      return return_seq_hash
-    end
-  end
-end
-# end of additonal methods. Delete before publish
-# calculate consensus cutoff
-def calculate_cut_off(m, error_rate = 0.02)
-  n = 0
-  case error_rate
-  when 0.005...0.015
-    if m <= 10
-      n = 2
-    else
-      n = 1.09*10**-26*m**6 + 7.82*10**-22*m**5 - 1.93*10**-16*m**4 + 1.01*10**-11*m**3 - 2.31*10**-7*m**2 + 0.00645*m + 2.872
-    end
+options = {}
-  when 0...0.005
-    if m <= 10
-      n = 2
-    else
-      n = -9.59*10**-27*m**6 + 3.27*10**-21*m**5 - 3.05*10**-16*m**4 + 1.2*10**-11*m**3 - 2.19*10**-7*m**2 + 0.004044*m + 2.273
-    end
+banner = '-'*50 + "\n" +
+        '| The TCS Pipeline ' + "Version #{ViralSeq::TCS_VERSION}".red.bold + " by " + "Shuntai Zhou".blue.bold + ' |' + "\n" +
+        '-'*50 + "\n"
-  else
-    if m <= 10
-      n = 2
-    elsif m <= 8500
-      n = -1.24*10**-21*m**6 + 3.53*10**-17*m**5 - 3.90*10**-13*m**4 + 2.12*10**-9*m**3 - 6.06*10**-6*m**2 + 1.80*10**-2*m + 3.15
-    else
-      n = 0.0079 * m + 9.4869
-    end
+OptionParser.new do |opts|
+  opts.banner = banner + "Usage: tcs -j"
+  opts.on "-j", "--json_generator", "Command line interfac to generate new params json file" do |j|
+    options[:json_generator] = true
   end
-  n = n.round
-  n = 2 if n < 3
-  return n
-end
+  opts.on("-p", "--params PARAMS_JSON", "Execute the pipeline with input params json file") do |p|
+    options[:params_json] = p
+  end
+  opts.on("-h", "--help", "Prints this help") do
+    puts opts
+    exit
+  end
-TCS_VERSION = "2.0.0"
+  opts.on("-v", "--version", "Version info") do
+    puts "tcs version: " + ViralSeq::TCS_VERSION.red.bold
+    puts "viral_seq version: " + ViralSeq::VERSION.red.bold
+    exit
+  end
-puts "\n" + '-'*58
-puts '| JSON Parameter Generator for ' + "TCS #{TCS_VERSION}".red.bold + " by " + "Shuntai Zhou".blue.bold + ' |'
-puts '-'*58 + "\n"
+  # opts.on("--no-parallel", "toggle off parallel processing") do
+  #   options[:no_parallel] = true
+  # end
+end.parse!
-unless ARGV[0]
-  raise "No JSON param file found. Script terminated."
+if options[:json_generator]
+  params = ViralSeq::TcsJson.generate
+elsif (options[:params_json] && File.exist?(options[:params_json]))
+  params = JSON.parse(File.read(options[:params_json]), symbolize_names: true)
+else
+  abort "No params JSON file found. Script terminated.".red
 end
-params = JSON.parse(File.read(ARGV[0]), symbolize_names: true)
 indir = params[:raw_sequence_dir]
 unless File.exist?(indir)
-  raise "No input sequence directory found. Script terminated."
-end
-libname = File.basename(indir)
-# obtain R1 and R2 file path
-files = []
-Dir.chdir(indir) do
-  files = Dir.glob("*")
+  abort "No input sequence directory found. Script terminated.".red.bold
 end
-if files.empty?
-  raise "Input dir does not contain files. Script terminated."
-end
-r1_f = ""
-r2_f = ""
+# log file
-# unzip .fasta.gz
-def unzip_r(indir, f)
-  r_file = indir + "/" + f
-  if f =~ /.gz/
-    `gzip -d #{r_file}`
-    new_f = f.sub ".gz", ""
-    r_file = File.join(indir, new_f)
-  end
-  return r_file
-end
 runtime_log_file = File.join(indir,"runtime.log")
 log = File.open(runtime_log_file, "w")
-log.puts "TSC pipeline Version " + TCS_VERSION.to_s
+log.puts "TSC pipeline Version " + ViralSeq::TCS_VERSION.to_s
 log.puts "viral_seq Version " + ViralSeq::VERSION.to_s
 log.puts Time.now.to_s + "\t" + "Start TCS pipeline..."
+libname = File.basename indir
-files.each do |f|
-  t = f.split("_")
-  if t.size == 1
-    tag = f
-  else
-    tag = f.split("_")[1..-1].join("_")
-  end
-  if tag =~ /r1/i
-    r1_f = unzip_r(indir, f)
-  elsif tag =~ /r2/i
-    r2_f = unzip_r(indir, f)
-  end
-end
+seq_files = ViralSeq::TcsCore.r1r2 indir
-unless File.exist?(r1_f)
-  log.puts "R1 file not found. Script terminated."
-  raise "R1 file not found. Script terminated."
+if seq_files[:r1_file].size > 0 and seq_files[:r2_file].size > 0
+  r1_f = seq_files[:r1_file]
+  r2_f = seq_files[:r2_file]
+elsif seq_files[:r1_file].size > 0 and seq_files[:r2_file].empty?
+  exit_sig = "Missing R2 file. Aborted."
+elsif seq_files[:r2_file].size > 0 and seq_files[:r1_file].empty?
+  exit_sig = "Missing R1 file. Aborted."
+else
+  exit_sig = "Cannot determine R1 R2 file in #{indir}. Aborted."
 end
-unless File.exist?(r2_f)
-  log.puts "R2 file not found. Script terminated."
-  raise "R2 file not found. Script terminated."
+if exit_sig
+  ViralSeq::TcsCore.log_and_abort log, exit_sig
 end
 r1_fastq_sh = ViralSeq::SeqHash.fq(r1_f)
@@ -218,13 +117,13 @@ end
 primers = params[:primer_pairs]
 if primers.empty?
-  log.puts "No primer information. Script terminated."
-  raise "No primer information. Script terminated."
+  ViralSeq::TcsCore.log_and_abort log, "No primer information. Script terminated."
 end
 primers.each do |primer|
   summary_json = {}
-  summary_json[:tcs_version] = TCS_VERSION
+  summary_json[:tcs_version] = ViralSeq::TCS_VERSION
   summary_json[:viralseq_version] = ViralSeq::VERSION
   summary_json[:runtime] = Time.now.to_s
@@ -233,6 +132,9 @@ primers.each do |primer|
   cdna_primer = primer[:cdna]
   forward_primer = primer[:forward]
+  export_raw = primer[:export_raw]
   unless cdna_primer
     log.puts Time.now.to_s + "\t" + region + " does not have cDNA primer sequence. #{region} skipped."
   end
@@ -242,66 +144,25 @@ primers.each do |primer|
   summary_json[:cdan_primer] = cdna_primer
   summary_json[:forward_primer] = forward_primer
-  primer[:majority] ? majority_cut_off = primer[:majority] : majority_cut_off = 0.5
+  primer[:majority] ? majority_cut_off = primer[:majority] : majority_cut_off = 0
   summary_json[:majority_cut_off] = majority_cut_off
   summary_json[:total_raw_sequence] = raw_sequence_number
   log.puts Time.now.to_s + "\t" +  "Porcessing #{region}..."
-  r1_raw = r1_fastq_sh.dna_hash
-  r2_raw = r2_fastq_sh.dna_hash
+  # filter R1
   log.puts Time.now.to_s + "\t" +  "filtering R1..."
-  # obtain biological forward primer sequence
-  if forward_primer.match(/(N+)(\w+)$/)
-    forward_n = $1.size
-    forward_bio_primer = $2
-  else
-    forward_n = 0
-    forward_bio_primer = forward_primer
-  end
-  forward_bio_primer_size = forward_bio_primer.size
-  forward_starting_number = forward_n + forward_bio_primer_size
-  # filter R1 sequences with forward primers.
-  forward_primer_ref = forward_bio_primer.nt_parser
-  r1_passed_seq = {}
-  r1_raw.each do |name,seq|
-    next if seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
-    next if seq =~ /A{11}/ # a string of poly-A indicates adaptor sequence
-    next if seq =~ /T{11}/ # a string of poly-T indicates adaptor sequence
-    primer_region_seq = seq[forward_n, forward_bio_primer_size]
-    if primer_region_seq =~ forward_primer_ref
-      r1_passed_seq[name.split("\s")[0]] = seq
-    end
-  end
+  filter_r1 = ViralSeq::TcsCore.filter_r1(r1_fastq_sh, forward_primer)
+  r1_passed_seq = filter_r1[:r1_passed_seq]
   log.puts Time.now.to_s + "\t" +  "R1 filtered: #{r1_passed_seq.size.to_s}"
   summary_json[:r1_filtered_raw] = r1_passed_seq.size
+  # filter R2
   log.puts Time.now.to_s + "\t" +  "filtering R2..."
-  # obtain biological reverse primer sequence
-  cdna_primer.match(/(N+)(\w+)$/)
-  pid_length = $1.size
-  cdna_bio_primer = $2
-  cdna_bio_primer_size = cdna_bio_primer.size
-  reverse_starting_number = pid_length + cdna_bio_primer_size
-  # filter R2 sequences with cDNA primers.
-  cdna_primer_ref = cdna_bio_primer.nt_parser
-  r2_passed_seq = {}
-  r2_raw.each do |name, seq|
-    next if seq[1..-2] =~ /N/ # sequences with ambiguities except the 1st and last position removed
-    next if seq =~ /A{11}/ # a string of poly-A indicates adaptor sequence
-    next if seq =~ /T{11}/ # a string of poly-T indicates adaptor sequence
-    primer_region_seq = seq[pid_length, cdna_bio_primer_size]
-    if primer_region_seq =~ cdna_primer_ref
-      r2_passed_seq[name.split("\s")[0]] = seq
-    end
-  end
+  filter_r2 = ViralSeq::TcsCore.filter_r2(r2_fastq_sh, cdna_primer)
+  r2_passed_seq = filter_r2[:r2_passed_seq]
+  pid_length = filter_r2[:pid_length]
   log.puts Time.now.to_s + "\t" +  "R2 filtered: #{r2_passed_seq.size.to_s}"
   summary_json[:r2_filtered_raw] = r2_passed_seq.size
@@ -320,8 +181,8 @@ primers.each do |primer|
     r2_seq = r2_passed_seq[seqtag]
     pid = r2_seq[0, pid_length]
     id[seqtag] = pid
-    bio_r2[seqtag] = r2_seq[reverse_starting_number..-2]
-    bio_r1[seqtag] = r1_seq[forward_starting_number..-2]
+    bio_r2[seqtag] = r2_seq[filter_r2[:reverse_starting_number]..-2]
+    bio_r1[seqtag] = r1_seq[filter_r1[:forward_starting_number]..-2]
   end
   # TCS cut-off
@@ -341,11 +202,10 @@ primers.each do |primer|
   end
   max_id = primer_id_dis.keys.sort[-5..-1].mean
-  consensus_cutoff = calculate_cut_off(max_id,error_rate)
+  consensus_cutoff = ViralSeq::TcsCore.calculate_cut_off(max_id,error_rate)
   log.puts Time.now.to_s + "\t" +  "Consensus cut-off is #{consensus_cutoff.to_s}"
   summary_json[:consensus_cutoff] = consensus_cutoff
   summary_json[:length_of_pid] = pid_length
   log.puts Time.now.to_s + "\t" +  "Creating consensus..."
   # Primer ID over the cut-off
@@ -363,10 +223,30 @@ primers.each do |primer|
   out_dir_consensus = File.join(out_dir_set, "consensus")
   Dir.mkdir(out_dir_consensus) unless File.directory?(out_dir_consensus)
-  outfile_r1 = File.join(out_dir_consensus, 'r1.txt')
-  outfile_r2 = File.join(out_dir_consensus, 'r2.txt')
+  outfile_r1 = File.join(out_dir_consensus, 'r1.fasta')
+  outfile_r2 = File.join(out_dir_consensus, 'r2.fasta')
   outfile_log = File.join(out_dir_set, 'log.json')
+  # if export_raw is true, create dir for raw sequence
+  if export_raw
+    out_dir_raw = File.join(out_dir_set, "raw")
+    Dir.mkdir(out_dir_raw) unless File.directory?(out_dir_raw)
+    outfile_raw_r1 = File.join(out_dir_raw, 'r1.raw.fasta')
+    outfile_raw_r2 = File.join(out_dir_raw, 'r2.raw.fasta')
+    raw_r1_f = File.open(outfile_raw_r1, 'w')
+    raw_r2_f = File.open(outfile_raw_r2, 'w')
+    bio_r1.keys.each do |k|
+      raw_r1_f.puts k + "_r1"
+      raw_r2_f.puts k + "_r2"
+      raw_r1_f.puts bio_r1[k]
+      raw_r2_f.puts bio_r2[k].rc
+    end
+    raw_r1_f.close
+    raw_r2_f.close
+  end
   # create TCS
   pid_seqtag_hash = {}
@@ -398,6 +278,8 @@ primers.each do |primer|
     consensus_name = ">" + primer_id + "_" + seq_with_same_primer_id.size.to_s + "_" + libname + "_" + region
     r1_consensus = ViralSeq::SeqHash.array(r1_sub_seq).consensus(majority_cut_off)
     r2_consensus = ViralSeq::SeqHash.array(r2_sub_seq).consensus(majority_cut_off)
+    # hide the following two lines if allowing sequence to have ambiguities.
     next if r1_consensus =~ /[^ATCG]/
     next if r2_consensus =~ /[^ATCG]/
@@ -435,8 +317,12 @@ primers.each do |primer|
   f1 = File.open(outfile_r1, 'w')
   f2 = File.open(outfile_r2, 'w')
   primer_id_in_use = {}
-  r1_seq_length = consensus_filtered.values[0][0].size
-  r2_seq_length = consensus_filtered.values[0][1].size
+  if n_con > 0
+    r1_seq_length = consensus_filtered.values[0][0].size
+    r2_seq_length = consensus_filtered.values[0][1].size
+  else
+    next
+  end
   log.puts Time.now.to_s + "\t" + "R1 sequence #{r1_seq_length} bp"
   log.puts Time.now.to_s + "\t" + "R1 sequence #{r2_seq_length} bp"
   consensus_filtered.each do |seq_name,seq|
@@ -447,6 +333,7 @@ primers.each do |primer|
   f1.close
   f2.close
+  # Primer ID distribution in .json file
   out_pid_json = File.join(out_dir_set, 'primer_id.json')
   pid_json = {}
   pid_json[:primer_id_in_use] = Hash[*(primer_id_in_use.sort_by {|k, v| [-v,k]}.flatten)]
@@ -456,19 +343,33 @@ primers.each do |primer|
     f.puts JSON.pretty_generate(pid_json)
   end
-  if primer[:end_join]
-    log.puts Time.now.to_s + "\t" +  "Start end-pairing for TCS..."
-    shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
-    case primer[:end_join_option]
+  # start end-join
+  def end_join(dir, option, overlap)
+    shp = ViralSeq::SeqHashPair.fa(dir)
+    case option
     when 1
-      joined_sh = shp.join1(primer[:overlap])
+      joined_sh = shp.join1()
+    when 2
+      joined_sh = shp.join1(overlap)
     when 3
       joined_sh = shp.join2
     when 4
       joined_sh = shp.join2(model: :indiv)
     end
+    return joined_sh
+  end
+  if primer[:end_join]
+    log.puts Time.now.to_s + "\t" +  "Start end-pairing for TCS..."
+    shp = ViralSeq::SeqHashPair.fa(out_dir_consensus)
+    joined_sh = end_join(out_dir_consensus, primer[:end_join_option], primer[:overlap])
     log.puts Time.now.to_s + "\t" + "Paired TCS number: " + joined_sh.size.to_s
     summary_json[:combined_tcs] = joined_sh.size
+    if export_raw
+      joined_sh_raw = end_join(out_dir_raw, primer[:end_join_option], primer[:overlap])
+    end
   else
     File.open(outfile_log, "w") do |f|
       f.puts JSON.pretty_generate(summary_json)
@@ -501,9 +402,30 @@ primers.each do |primer|
         joined_seq[seq_name] = seq + new_r2_seq[seq_name]
       end
       joined_sh = ViralSeq::SeqHash.new(joined_seq)
+      if export_raw
+        r1_sh_raw = ViralSeq::SeqHash.fa(outfile_raw_r1)
+        r2_sh_raw = ViralSeq::SeqHash.fa(outfile_raw_r2)
+        r1_sh_raw = r1_sh_raw.hiv_seq_qc(ref_start, (0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), indel, ref_genome)
+        r2_sh_raw = r2_sh_raw.hiv_seq_qc((0..(ViralSeq::RefSeq.get(ref_genome).size - 1)), ref_end, indel, ref_genome)
+        new_r1_seq_raw = r1_sh_raw.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
+        new_r2_seq_raw = r2_sh_raw.dna_hash.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
+        joined_seq_raw = {}
+        new_r1_seq_raw.each do |seq_name, seq|
+          next unless seq
+          next unless new_r2_seq_raw[seq_name]
+          joined_seq_raw[seq_name] = seq + new_r2_seq_raw[seq_name]
+        end
+        joined_sh_raw = ViralSeq::SeqHash.new(joined_seq_raw)
+      end
     else
       joined_sh = joined_sh.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
+      if export_raw
+        joined_sh_raw = joined_sh_raw.hiv_seq_qc(ref_start, ref_end, indel, ref_genome)
+      end
     end
     log.puts Time.now.to_s + "\t" + "Paired TCS number after QC based on reference genome: " + joined_sh.size.to_s
     summary_json[:combined_tcs_after_qc] = joined_sh.size
     if primer[:trim]
@@ -511,8 +433,12 @@ primers.each do |primer|
       trim_end = primer[:trim_ref_end]
       trim_ref = primer[:trim_ref].to_sym
       joined_sh = joined_sh.trim(trim_start, trim_end, trim_ref)
+      joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.fasta"))
+      if export_raw
+        joined_sh_raw = joined_sh_raw.trim(trim_start, trim_end, trim_ref)
+        joined_sh_raw.write_nt_fa(File.join(out_dir_raw, "combined.raw.fasta"))
+      end
     end
-    joined_sh.write_nt_fa(File.join(out_dir_consensus, "combined.txt"))
   end
   File.open(outfile_log, "w") do |f|