viral_seq 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 5429096d449465668f4b515257602d1451007b84ad20841d5035a66b45569cc5
4
+ data.tar.gz: c910521dcee98650f0db6a3c9d83a4435fae535b4a087a87117820da5c8459c1
5
+ SHA512:
6
+ metadata.gz: c2603d523affd6eca26bf39935f7137406a8f44a0451b1458ae528b631e0b916032aedacadb630cd00cdb9aeb745c0e276286e9480345c5add77e3ba47afe8eb
7
+ data.tar.gz: 9bfe02dee55538b08e268e96a5d0b2b1e27ae14db034e89fef47f0f00a0ac30e5533a6f69e43de21076335d34ab417bba7405748f3d4d8cca300732e8cab4c1b
data/.gitignore ADDED
@@ -0,0 +1,14 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /_yardoc/
4
+ /coverage/
5
+ /doc/
6
+ /pkg/
7
+ /spec/reports/
8
+ /tmp/
9
+
10
+ # rspec failure tracking
11
+ .rspec_status
12
+
13
+ # gem files
14
+ *.gem
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/.travis.yml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ sudo: false
3
+ language: ruby
4
+ cache: bundler
5
+ rvm:
6
+ - 2.6.0
7
+ before_install: gem install bundler -v 2.0.1
@@ -0,0 +1,74 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at swanstromlab@gmail.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [http://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: http://contributor-covenant.org
74
+ [version]: http://contributor-covenant.org/version/1/4/
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "https://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in viral_seq.gemspec
4
+ gemspec
data/Gemfile.lock ADDED
@@ -0,0 +1,37 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ viral_seq (0.3.0)
5
+ muscle_bio (~> 0.4)
6
+
7
+ GEM
8
+ remote: https://rubygems.org/
9
+ specs:
10
+ diff-lcs (1.3)
11
+ muscle_bio (0.4.0)
12
+ rake (10.5.0)
13
+ rspec (3.8.0)
14
+ rspec-core (~> 3.8.0)
15
+ rspec-expectations (~> 3.8.0)
16
+ rspec-mocks (~> 3.8.0)
17
+ rspec-core (3.8.0)
18
+ rspec-support (~> 3.8.0)
19
+ rspec-expectations (3.8.3)
20
+ diff-lcs (>= 1.2.0, < 2.0)
21
+ rspec-support (~> 3.8.0)
22
+ rspec-mocks (3.8.0)
23
+ diff-lcs (>= 1.2.0, < 2.0)
24
+ rspec-support (~> 3.8.0)
25
+ rspec-support (3.8.0)
26
+
27
+ PLATFORMS
28
+ ruby
29
+
30
+ DEPENDENCIES
31
+ bundler (~> 2.0)
32
+ rake (~> 10.0)
33
+ rspec (~> 3.0)
34
+ viral_seq!
35
+
36
+ BUNDLED WITH
37
+ 2.0.1
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2019 Shuntai Zhou (shuntai.zhou@gmail.com)
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,39 @@
1
+ # viral_seq
2
+
3
+ ## Installation
4
+
5
+ Add this line to your application's Gemfile:
6
+
7
+ ```ruby
8
+ gem 'ViralSeq'
9
+ ```
10
+
11
+ And then execute:
12
+
13
+ $ bundle
14
+
15
+ Or install it yourself as:
16
+
17
+ $ gem install viral_seq
18
+
19
+ ## Usage
20
+
21
+ TODO: Write usage instructions here
22
+
23
+ ## Development
24
+
25
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
26
+
27
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
28
+
29
+ ## Contributing
30
+
31
+ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/viral_seq. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
32
+
33
+ ## License
34
+
35
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
36
+
37
+ ## Code of Conduct
38
+
39
+ Everyone interacting in the viral_seq project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/viral_seq/blob/master/CODE_OF_CONDUCT.md).
data/Rakefile ADDED
@@ -0,0 +1,6 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "viral_seq"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start(__FILE__)
data/bin/setup ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,172 @@
1
+ # viral_seq/a3g
2
+ # APOBEC3g/f hypermutation function including
3
+ # ViralSeq::a3g_hypermut_seq_hash
4
+ # ViralSeq::apobec3gf
5
+
6
+ # APOBEC3g/f G to A hypermutation
7
+ # APOBEC3G/F pattern: GRD -> ARD
8
+ # control pattern: G[YN|RC] -> A[YN|RC]
9
+ # use the sample consensus to determine potential a3g sites
10
+
11
+ # Two criteria to identify hypermutation
12
+ # 1. Fisher's exact test on the frequencies of G to A mutation at A3G positons vs. non-A3G positions
13
+ # 2. Poisson distribution of G to A mutations at A3G positions, outliers sequences
14
+ # note: criteria 2 only applies on a sequence file containing more than 20 sequences
15
+ # b/c Poisson model does not do well on small sample size.
16
+
17
+ # ViralSeq.a3g_hypermut_seq_hash(sequence_hash)
18
+ # sequence_hash is a Hash object for sequences. {:name => :sequence, ...}
19
+ # return array [hypermutation_hash, statistic_info]
20
+ # hypermutation_hash is a Hash object for sequences
21
+ # statistic_info is a hash object of [sequence_name, stats],
22
+ # in which stats String object in csv format (separated by ',') containing
23
+ # sequence tag
24
+ # G to A mutation numbers at potential a3g positions
25
+ # total potential a3g G positions
26
+ # G to A mutation numbers at non a3g positions
27
+ # total non a3g G positions
28
+ # a3g G to A mutation rate / non-a3g G to A mutation rate
29
+ # Fishers Exact P-value
30
+ #
31
+ # =USAGE
32
+ # # example 1
33
+ # sequences = ViralSeq.fasta_to_hash('spec/sample_files/sample_a3g_sequence1.fasta')
34
+ # hypermut = ViralSeq.a3g_hypermut_seq_hash(sequences)
35
+ # hypermut[0].keys
36
+ # => [">Seq7", ">Seq14"]
37
+ # stats = hypermut[1]
38
+ # stats.values
39
+ # => [">Seq7,23,68,1,54,18.26,4.308329383112348e-06", ">Seq14,45,68,9,54,3.97,5.2143571971582974e-08"]
40
+ #
41
+ # # example 2
42
+ # sequences = ViralSeq.fasta_to_hash('spec/sample_files/sample_a3g_sequence2.fasta')
43
+ # hypermut = ViralSeq.a3g_hypermut_seq_hash(sequences)
44
+ # stats = hypermut[1]
45
+ # stats = values
46
+ # => [">CTAACACTCA_134_a3g-sample2,4,35,0,51,Infinity,0.02465676660128911", ">ATAGTGCCCA_60_a3g-sample2,4,35,1,51,5.83,0.1534487353839561"]
47
+ # # notice sequence ">ATAGTGCCCA_60_a3g-sample2" has a p value at 0.15, greater than 0.05, but it is still called as hypermutation sequence b/c it's Poisson outlier sequence.
48
+
49
+
50
+ # ViralSeq.apobec3gf(sequence)
51
+ # APOBEC3G/F pattern: GRD -> ARD
52
+ # control pattern: G[YN|RC] -> A[YN|RC]
53
+ # input a sequence String object
54
+ # return all two arrays of position numbers of
55
+ # a3g G positions (a3g)
56
+ # non-a3g G positions (control)
57
+
58
+
59
+ module ViralSeq
60
+ def ViralSeq.a3g_hypermut_seq_hash(seq_hash)
61
+ # mut_hash number of apobec3g/f mutations per sequence
62
+ mut_hash = {}
63
+ hm_hash = {}
64
+ out_hash = {}
65
+
66
+ # total G->A mutations at apobec3g/f positions.
67
+ total = 0
68
+
69
+ # make consensus sequence for the input sequence hash
70
+ ref = ViralSeq.consensus(seq_hash.values)
71
+
72
+ # obtain apobec3g positions and control positions
73
+ apobec = ViralSeq.apobec3gf(ref)
74
+ mut = apobec[0]
75
+ control = apobec[1]
76
+
77
+ seq_hash.each do |k,v|
78
+ a = 0 # muts
79
+ b = 0 # potential mut sites
80
+ c = 0 # control muts
81
+ d = 0 # potenrial controls
82
+ mut.each do |n|
83
+ next if v[n] == "-"
84
+ if v[n] == "A"
85
+ a += 1
86
+ b += 1
87
+ else
88
+ b += 1
89
+ end
90
+ end
91
+ mut_hash[k] = a
92
+ total += a
93
+
94
+ control.each do |n|
95
+ next if v[n] == "-"
96
+ if v[n] == "A"
97
+ c += 1
98
+ d += 1
99
+ else
100
+ d += 1
101
+ end
102
+ end
103
+ rr = (a/b.to_f)/(c/d.to_f)
104
+
105
+ t1 = b - a
106
+ t2 = d - c
107
+
108
+ fet = Rubystats::FishersExactTest.new
109
+ fisher = fet.calculate(t1,t2,a,c)
110
+ perc = fisher[:twotail]
111
+ info = k + "," + a.to_s + "," + b.to_s + "," + c.to_s + "," + d.to_s + "," + rr.round(2).to_s + "," + perc.to_s
112
+ out_hash[k] = info
113
+ if perc < 0.05
114
+ hm_hash[k] = info
115
+ end
116
+ end
117
+
118
+ if seq_hash.size > 20
119
+ rate = total.to_f/(seq_hash.size)
120
+
121
+ count_mut = ViralSeq.count(mut_hash.values)
122
+ maxi_count = count_mut.values.max
123
+
124
+ poisson_hash = ViralSeq.poisson_distribution(rate,maxi_count)
125
+
126
+ cut_off = 0
127
+ poisson_hash.each do |k,v|
128
+ cal = seq_hash.size * v
129
+ obs = count_mut[k]
130
+ if obs >= 20 * cal
131
+ cut_off = k
132
+ break
133
+ elsif k == maxi_count
134
+ cut_off = maxi_count
135
+ end
136
+ end
137
+
138
+ mut_hash.each do |k,v|
139
+ if v > cut_off
140
+ hm_hash[k] = out_hash[k]
141
+ end
142
+ end
143
+ end
144
+
145
+ hm_seq_hash = {}
146
+ hm_hash.keys.each do |k|
147
+ hm_seq_hash[k] = seq_hash[k]
148
+ end
149
+ return [hm_seq_hash,hm_hash]
150
+ end
151
+
152
+ # APOBEC3G/F mutation position identification
153
+ # APOBEC3G/F pattern: GRD -> ARD
154
+ # control pattern: G[YN|RC] -> A[YN|RC]
155
+
156
+ def self.apobec3gf(seq = "")
157
+ seq.tr!("-", "")
158
+ seq_length = seq.size
159
+ apobec_position = []
160
+ control_position = []
161
+ (0..(seq_length - 3)).each do |n|
162
+ tri_base = seq[n,3]
163
+ if tri_base =~ /G[A|G][A|G|T]/
164
+ apobec_position << n
165
+ elsif seq[n] == "G"
166
+ control_position << n
167
+ end
168
+ end
169
+ return [apobec_position,control_position]
170
+ end
171
+
172
+ end
@@ -0,0 +1,154 @@
1
+ # fasta.rb
2
+ # methods for converting sequence formats, including
3
+ # ViralSeq::fasta_to_hash
4
+ # ViralSeq::fastq_to_fasta
5
+ # ViralSeq::fastq_to_hash
6
+ # ViralSeq::fasta_hash_to_rsphylip
7
+ # ViralSeq::pair_fasta_to_hash
8
+
9
+ # =USAGE
10
+ # sequence_fasta_hash = ViralSeq.fasta_to_hash(input_fasta_file)
11
+ # # input a sequence file in fasta format, read as a sequence hash
12
+ # # {:sequence_name1 => sequence1, ...}
13
+
14
+ # sequence_fasta_hash = ViralSeq.fastq_to_fasta(input_fastq_file)
15
+ # # input a sequence file in fastq format, read as a sequence hash
16
+ # # discard sequence quality score
17
+
18
+ # sequence_fastq_hash = ViralSeq.fasta_to_hash(input_fastq_file)
19
+ # # input a sequence file in fastq format, read as a sequence hash
20
+ # # keep sequence quality score
21
+ # # {:sequence_name1 => [sequence1, quality1], ...}
22
+
23
+ # phylip_hash = ViralSeq.fasta_hash_to_rsphylip(sequence_fasta_hash)
24
+ # # convert a aligned fasta sequence hash into relaxed sequencial phylip format
25
+
26
+ # paired_sequence_hash = ViralSeq.pair_fasta_to_hash(directory_of_paired_fasta)
27
+ # # input a directory containing paired sequence files in the fasta format
28
+ # # ├───lib1
29
+ # │ lib1_r1.txt
30
+ # │ lib1_r2.txt
31
+ # # paired sequence files need to have "r1" and "r2" in their file names
32
+ # # the sequence taxa should only differ by last 3 characters to distinguish r1 and r2 sequence.
33
+ # # return a paired sequence hash :seq_name => [r1_seq, r2_seq]
34
+
35
+ module ViralSeq
36
+
37
+ def self.fasta_to_hash(infile)
38
+ f=File.open(infile,"r")
39
+ return_hash = {}
40
+ name = ""
41
+ while line = f.gets do
42
+ line.tr!("\u0000","")
43
+ next if line == "\n"
44
+ next if line =~ /^\=/
45
+ if line =~ /^\>/
46
+ name = line.chomp
47
+ return_hash[name] = ""
48
+ else
49
+ return_hash[name] += line.chomp.upcase
50
+ end
51
+ end
52
+ f.close
53
+ return return_hash
54
+ end
55
+
56
+
57
+ # fastq file to fasta, discard quality, return a sequence hash
58
+
59
+ def self.fastq_to_fasta(fastq_file)
60
+ count = 0
61
+ sequence_a = []
62
+ count_seq = 0
63
+
64
+ File.open(fastq_file,'r') do |file|
65
+ file.readlines.collect do |line|
66
+ count +=1
67
+ count_m = count % 4
68
+ if count_m == 1
69
+ line.tr!('@','>')
70
+ sequence_a << line.chomp
71
+ count_seq += 1
72
+ elsif count_m == 2
73
+ sequence_a << line.chomp
74
+ end
75
+ end
76
+ end
77
+ Hash[*sequence_a]
78
+ end
79
+
80
+ # fastq file to hash, including quality. {:seq_name => [seq,quality]}
81
+
82
+ def self.fastq_to_hash(fastq_file)
83
+ count = 0
84
+ sequence_a = []
85
+ quality_a = []
86
+ count_seq = 0
87
+
88
+ File.open(fastq_file,'r') do |file|
89
+ file.readlines.collect do |line|
90
+ count +=1
91
+ count_m = count % 4
92
+ if count_m == 1
93
+ line.tr!('@','>')
94
+ sequence_a << line.chomp
95
+ quality_a << line.chomp
96
+ count_seq += 1
97
+ elsif count_m == 2
98
+ sequence_a << line.chomp
99
+ elsif count_m == 0
100
+ quality_a << line.chomp
101
+ end
102
+ end
103
+ end
104
+ sequence_hash = Hash[*sequence_a]
105
+ quality_hash = Hash[*quality_a]
106
+ return_hash = {}
107
+ sequence_hash.each do |k,v|
108
+ return_hash[k] = [v, quality_hash[k]]
109
+ end
110
+ return return_hash
111
+ end
112
+
113
+ # fasta sequence hash to relaxed sequencial phylip format
114
+
115
+ def self.fasta_hash_to_rsphylip(seqs)
116
+ outline = "\s" + seqs.size.to_s + "\s" + seqs.values[0].size.to_s + "\n"
117
+ names = seqs.keys
118
+ max_name_l = (names.max.size - 1)
119
+ max_name_l > 10 ? name_block_l = max_name_l : name_block_l = 10
120
+ seqs.each do |k,v|
121
+ outline += k[1..-1] + "\s" * (name_block_l - k.size + 2) + v.scan(/.{1,10}/).join("\s") + "\n"
122
+ end
123
+ return outline
124
+ end
125
+
126
+ # input a directory with r1 and r2 sequences, return a hash :seq_name => [r1_seq, r2_seq]
127
+ # r1 and r2 file names should contain "r1" and "r2" respectively
128
+ # the sequence taxa should only differ by last 3 characters to distinguish r1 and r2 sequence.
129
+ def self.pair_fasta_to_hash(indir)
130
+ files = Dir[indir + "/*"]
131
+ r1_file = ""
132
+ r2_file = ""
133
+ files.each do |f|
134
+ if File.basename(f) =~ /r1/i
135
+ r1_file = f
136
+ elsif File.basename(f) =~ /r2/i
137
+ r2_file = f
138
+ end
139
+ end
140
+
141
+ seq1 = ViralSeq.fasta_to_hash(r1_file)
142
+ seq2 = ViralSeq.fasta_to_hash(r2_file)
143
+
144
+ new_seq1 = seq1.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
145
+ new_seq2 = seq2.each_with_object({}) {|(k, v), h| h[k[0..-4]] = v}
146
+
147
+ seq_pair_hash = {}
148
+
149
+ new_seq1.each do |seq_name,seq|
150
+ seq_pair_hash[seq_name] = [seq, new_seq2[seq_name]]
151
+ end
152
+ return seq_pair_hash
153
+ end
154
+ end