snp-search 1.0.0 → 2.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README +105 -0
- data/README.rdoc +2 -2
- data/VERSION +1 -1
- data/bin/snp-search +137 -73
- data/lib/snp-search.rb +140 -18
- data/lib/snp_db_schema.rb +10 -2
- data/snp-search.gemspec +2 -2
- metadata +21 -21
data/README
CHANGED
@@ -0,0 +1,105 @@
|
|
1
|
+
= snp-search
|
2
|
+
|
3
|
+
SNPsearch is a tool that manages SNP data and allows for data importing, manipulating, editing and complex querying of SNP data. It can be used to evaluate the utility of SNPs for the assessment of genetic diversity between haploid strains and the management of genotype and phenotype data. Once the database is created, the user is provided with several query and output options. SNPsearch is particularly useful in the analysis of phylogenetic trees that are based on SNP differences across whole core genomes. Queries can be made to answer critical genomic questions such as the association of SNPs with particular phenotypes.
|
4
|
+
|
5
|
+
== Obtaining and installing the code
|
6
|
+
SNPsearch is written in Ruby and operates in a Unix environment. It is made available as a gem. See the github site for more information (https://github.com/hpa-bioinformatics/snp-search).
|
7
|
+
|
8
|
+
To install snp-search, do
|
9
|
+
gem install snp-search
|
10
|
+
|
11
|
+
== Requirements
|
12
|
+
|
13
|
+
Not much, you just need:
|
14
|
+
|
15
|
+
* Unix. Once snp-search is installed, all the necessary gems to run snp-search will also be installed from Rubygems (note that Rubygems requires admin privileges. If you do not have admin privileges then we suggest you install RVM: (http://beginrescueend.com/rvm/install/) and then gem install snp-search).
|
16
|
+
* ruby version 1.8.7 and above.
|
17
|
+
|
18
|
+
* Optional: FastTree. If you require a tree output in Newick format, you must install FastTree from http://www.microbesonline.org/fasttree/#Install. You must specify the path of the executable in your .bashrc or .profile file as snp-search will run the command as just 'FastTree' and will not know where FastTree is if it is not specified in your .bashrc or .profile file.
|
19
|
+
|
20
|
+
Thats it!
|
21
|
+
|
22
|
+
== Running snp-search
|
23
|
+
|
24
|
+
1- Creating the database (snp-search -create)
|
25
|
+
|
26
|
+
Two files are needed to create the SQLite3 database:
|
27
|
+
|
28
|
+
1- Variant Call Format (.vcf) file (which contains the SNP information)
|
29
|
+
|
30
|
+
2- Your database reference genome that you used to generate your .vcf file (in genbank or embl format, the script will automatically detect the format).
|
31
|
+
|
32
|
+
You need the following parameters:
|
33
|
+
|
34
|
+
-n Name of your database
|
35
|
+
-v .vcf file
|
36
|
+
-d Database Reference genome (The same file that was used in generating the .vcf file). This should be in genbank or embl format.
|
37
|
+
|
38
|
+
Other options:
|
39
|
+
-c SNP quality score cutoff. A Phred-scaled quality score. High quality scores indicate high confidence calls. Optional, default = 90 (out of 100)
|
40
|
+
-g Genotype Quality score cutoff. Phred-scaled quality score that the genotype is true. Optional, default = 30
|
41
|
+
-h help message
|
42
|
+
|
43
|
+
Usage:
|
44
|
+
snp-search -create -n my_snp_db.sqlite3 -d my_ref.gbk -v my_vcf_file.vcf
|
45
|
+
|
46
|
+
Note: The strain names in your database will be taken from your vcf file so make sure they are named appropriately in your vcf file.
|
47
|
+
|
48
|
+
2- Querying the Database (snp-search -query)
|
49
|
+
|
50
|
+
Two queries are currently scripted in SNPsearch:
|
51
|
+
|
52
|
+
1- unique_snps: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided. The output is the number of unique SNPs.
|
53
|
+
|
54
|
+
You need the following parameters:
|
55
|
+
|
56
|
+
-n Name of your database
|
57
|
+
-s The strains/samples you like to query
|
58
|
+
|
59
|
+
Usage:
|
60
|
+
snp-search -n my_snp_db.sqlite3 -s list_of_my_strains.txt
|
61
|
+
|
62
|
+
2- not_include_snps_from_gene: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format). This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis. The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
|
63
|
+
|
64
|
+
You need the following parameters:
|
65
|
+
|
66
|
+
-n Name of your database
|
67
|
+
-a The gene you like to remove from analysis
|
68
|
+
-o Output file, in fasta format
|
69
|
+
|
70
|
+
options:
|
71
|
+
-t Generate SNP phylogeny
|
72
|
+
-w Output tree in Newick format
|
73
|
+
|
74
|
+
Usage (phage is used as the example gene):
|
75
|
+
snp-search -n my_snp_db.sqlite3 -a phage -o snps_sequences_without_phage.fasta -t -w snps_sequences_without_phage.nwk
|
76
|
+
|
77
|
+
The algorithm FastTree is used to generate the nwk file. FastTree can be downloaded from http://www.microbesonline.org/fasttree/#Install (see above)
|
78
|
+
|
79
|
+
3- Output database (snp-search -out_file)
|
80
|
+
|
81
|
+
You need the following parameters:
|
82
|
+
|
83
|
+
-n Name of your database
|
84
|
+
-o Output file containing the database in fasta format
|
85
|
+
|
86
|
+
== View database in Unix or in a GUI
|
87
|
+
Your database will be in sqlite3 format. If you like to view your table(s) and perform direct queries you can type
|
88
|
+
sqlite3 snp_db.sqlite3
|
89
|
+
|
90
|
+
Alternatively, you may download a SQL tool to view your database (e.g. SQLite sorcerer).
|
91
|
+
|
92
|
+
== Contact
|
93
|
+
|
94
|
+
If you have any comments, questions or suggestions, please email
|
95
|
+
ali.al-shahib@hpa.org.uk
|
96
|
+
or
|
97
|
+
anthony.underwood@hpa.org.uk
|
98
|
+
|
99
|
+
Have fun snp-searching!
|
100
|
+
|
101
|
+
== Copyright
|
102
|
+
|
103
|
+
Copyright (c) 2012 Ali Al-Shahib. See LICENSE.txt for
|
104
|
+
further details.
|
105
|
+
|
data/README.rdoc
CHANGED
@@ -49,7 +49,7 @@ You need the following parameters:
|
|
49
49
|
|
50
50
|
Two queries are currently scripted in SNPsearch:
|
51
51
|
|
52
|
-
1-
|
52
|
+
1- unique_snps: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided. The output is the number of unique SNPs.
|
53
53
|
|
54
54
|
You need the following parameters:
|
55
55
|
|
@@ -59,7 +59,7 @@ You need the following parameters:
|
|
59
59
|
Usage:
|
60
60
|
snp-search -n my_snp_db.sqlite3 -s list_of_my_strains.txt
|
61
61
|
|
62
|
-
2-
|
62
|
+
2- not_include_snps_from_gene: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format). This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis. The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
|
63
63
|
|
64
64
|
You need the following parameters:
|
65
65
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
|
1
|
+
2.0.0
|
data/bin/snp-search
CHANGED
@@ -3,60 +3,74 @@ require 'snp_db_connection'
|
|
3
3
|
require 'snp_db_models'
|
4
4
|
require 'snp_db_schema'
|
5
5
|
require 'activerecord-import'
|
6
|
-
|
7
|
-
gem "slop", "~> 2.4.0"
|
6
|
+
gem "slop", "~> 3.3.1"
|
7
|
+
# gem "slop", "~> 2.4.0"
|
8
8
|
require 'slop'
|
9
9
|
|
10
|
-
opts = Slop.
|
10
|
+
opts = Slop.parse do
|
11
11
|
|
12
|
-
|
12
|
+
banner "\nruby snp-search [-create] [-query] [-output] [-n <sqlite3>] [options]*"
|
13
|
+
separator ''
|
13
14
|
|
14
|
-
banner "\nruby snp-search [OPTIONS]"
|
15
15
|
on :C, :create, 'Create database'
|
16
16
|
on :Q, :query, 'Query database'
|
17
|
-
on :O, :
|
18
|
-
|
17
|
+
on :O, :output, 'Output options'
|
18
|
+
separator ''
|
19
19
|
# separator 'README file: https://github.com/hpa-bioinformatics/snp-search/blob/master/README.rdoc'
|
20
20
|
# separator 'The following command must be used when using -create, or -query or -out_file'
|
21
21
|
on :n, :name=, 'Name of database, Required'
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
on :
|
26
|
-
on :
|
27
|
-
on :
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
22
|
+
separator ''
|
23
|
+
|
24
|
+
separator '-create options'
|
25
|
+
on :d, :database_reference_file=, 'Reference genome file, in gbk or embl file format, Required', true
|
26
|
+
on :v, :vcf_file=, 'variant call format (vcf) file, Required', true
|
27
|
+
on :c, :cuttoff_snp=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
|
28
|
+
on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int, :default => 30
|
29
|
+
separator ''
|
30
|
+
|
31
|
+
separator '-query options'
|
32
|
+
on :u, :unique_snps, 'Query for unique snps in the database'
|
33
|
+
on :r, :not_include_snps_from_gene, 'Remove SNPs from specified gene from database'
|
32
34
|
on :s, :strain=, 'The strains/samples you like to query, Required'
|
33
35
|
on :a, :annotation=, 'The gene you like to remove from analysis'
|
34
|
-
|
36
|
+
separator ''
|
37
|
+
|
38
|
+
separator '-output [-fasta] [-syn] options'
|
39
|
+
on :f, :fasta, 'output fasta file'
|
40
|
+
on :S, :syn, 'output tab-delimited file with synonymous and non-synonymous info'
|
41
|
+
on :o, :out=, 'Name of output file'
|
35
42
|
on :t, :tree, 'Generate SNP phylogeny'
|
36
|
-
on :w, :
|
37
|
-
|
43
|
+
on :w, :nwk_out=, 'Name of output tree in Newick format'
|
44
|
+
|
38
45
|
end
|
39
|
-
opts.
|
46
|
+
# opts.end
|
40
47
|
|
41
48
|
###########################################################
|
42
49
|
|
43
50
|
# CREATING A DATABASE
|
44
51
|
if opts[:create]
|
45
52
|
|
46
|
-
|
53
|
+
# puts opts[:cuttoff_snp].to_i
|
54
|
+
|
47
55
|
error_msg = ""
|
48
56
|
|
49
|
-
error_msg += "-n
|
50
|
-
error_msg += "-d
|
51
|
-
error_msg += "-v
|
57
|
+
error_msg += "-n: \t Name of your database\n" unless opts[:name]
|
58
|
+
error_msg += "-d: \t Reference genome file, in gbk or embl file format\n" unless opts[:database_reference_file]
|
59
|
+
error_msg += "-v: \t .vcf file\n" unless opts[:vcf_file]
|
52
60
|
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
61
|
+
error_msg_optional = ""
|
62
|
+
|
63
|
+
error_msg_optional += "-c: \tSNP quality cutoff, (default = 90)\n"
|
64
|
+
error_msg_optional += "-g: \tGenotype quality cutoff (default = 30)\n"
|
65
|
+
|
66
|
+
unless error_msg == ""
|
67
|
+
puts "Please provide the following required fields:"
|
68
|
+
puts error_msg
|
69
|
+
puts "Optional fields:"
|
70
|
+
puts error_msg_optional
|
71
|
+
puts opts.help unless opts.empty?
|
72
|
+
exit
|
73
|
+
end
|
60
74
|
|
61
75
|
abort "#{opts[:database_reference_file]} file does not exist!" unless File.exist?(opts[:database_reference_file])
|
62
76
|
|
@@ -67,7 +81,7 @@ if opts[:create]
|
|
67
81
|
establish_connection(opts[:name])
|
68
82
|
|
69
83
|
# Schema will run here
|
70
|
-
|
84
|
+
db_schema
|
71
85
|
|
72
86
|
ref = opts[:database_reference_file]
|
73
87
|
|
@@ -87,22 +101,25 @@ if opts[:create]
|
|
87
101
|
vcf_mpileup_file = opts[:vcf_file]
|
88
102
|
|
89
103
|
# The populate_features_and_annotations method populates the features and annotations. It uses the embl/gbk file.
|
90
|
-
|
104
|
+
populate_features_and_annotations(sequence_flatfile)
|
91
105
|
|
92
106
|
#The populate_snps_alleles_genotypes method populates the snps, alleles and genotypes. It uses the vcf file, and if specified, the SNP quality cutoff and genotype quality cutoff
|
93
|
-
|
107
|
+
|
108
|
+
populate_snps_alleles_genotypes(vcf_mpileup_file, opts[:cuttoff_snp], opts[:cuttoff_genotype])
|
109
|
+
|
110
|
+
# puts "populate_snps_alleles_genotypes(#{vcf_mpileup_file}, #{opts[:cuttoff_snp]}, #{opts[:cuttoff_genotype]}.to_i)"
|
94
111
|
|
95
112
|
###########################################################
|
96
113
|
|
97
114
|
# QUERYING THE DATABASE
|
98
115
|
elsif opts [:query]
|
99
116
|
#FIND UNIQUE SNPS
|
100
|
-
if opts[:
|
117
|
+
if opts[:unique_snps]
|
101
118
|
|
102
119
|
error_msg = ""
|
103
120
|
|
104
|
-
error_msg += "-n
|
105
|
-
error_msg += "-s
|
121
|
+
error_msg += "-n: \t Name of your database\n" unless opts[:name]
|
122
|
+
error_msg += "-s: \t List of strains you like to query\n" unless opts[:strain]
|
106
123
|
|
107
124
|
unless error_msg == ""
|
108
125
|
puts "Please provide the following required fields:"
|
@@ -126,32 +143,39 @@ elsif opts [:query]
|
|
126
143
|
gas_snps = find_shared_snps(strains)
|
127
144
|
|
128
145
|
gas_snps.each do |snp|
|
129
|
-
puts "The number of unique snps are #{snp.id}
|
146
|
+
puts "The number of unique snps are #{snp.id}"
|
130
147
|
end
|
131
148
|
|
132
149
|
################################################################
|
133
150
|
# REMOVE SNPS ASSOCIATED WITH SPECIFIC GENES
|
134
|
-
elsif opts[:
|
151
|
+
elsif opts[:not_include_snps_from_gene]
|
135
152
|
|
136
153
|
error_msg = ""
|
137
154
|
|
138
|
-
error_msg += "-n
|
139
|
-
error_msg += "-o
|
140
|
-
error_msg += "-a
|
155
|
+
error_msg += "-n: \t Name of your database\n" unless opts[:name]
|
156
|
+
error_msg += "-o: \t name of your output file\n" unless opts[:out]
|
157
|
+
error_msg += "-a: \t name of the gene that you like to remove from the database\n" unless opts[:annotation]
|
158
|
+
|
159
|
+
error_msg_optional = ""
|
160
|
+
|
161
|
+
error_msg_optional += "-tree: \t Construct tree from output\n" unless opts[:tree]
|
162
|
+
error_msg_optional += "-nwk_out: Name of Newick output file(use only when-tree option used)\n" unless opts[:nwk_out]
|
141
163
|
|
142
164
|
unless error_msg == ""
|
143
165
|
puts "Please provide the following required fields:"
|
144
166
|
puts error_msg
|
167
|
+
puts "Optional fields:"
|
168
|
+
puts error_msg_optional
|
145
169
|
puts opts.help unless opts.empty?
|
146
170
|
exit
|
147
171
|
end
|
148
|
-
|
172
|
+
|
173
|
+
|
149
174
|
abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
|
150
175
|
|
151
176
|
# annotation = opts[:annotation]
|
152
177
|
establish_connection(opts[:name])
|
153
178
|
|
154
|
-
|
155
179
|
# Getting list of strains from database
|
156
180
|
strains = Strain.all
|
157
181
|
|
@@ -164,7 +188,7 @@ elsif opts [:query]
|
|
164
188
|
end
|
165
189
|
|
166
190
|
# output opened for data input
|
167
|
-
output = File.open("#{opts[:
|
191
|
+
output = File.open("#{opts[:out]}", "w")
|
168
192
|
|
169
193
|
# Perform query
|
170
194
|
snps = Snp.includes(:alleles => :genotypes).find_by_sql("SELECT snps.* FROM snps INNER JOIN features ON features.id = snps.feature_id WHERE features.id NOT IN (select distinct features.id FROM features INNER JOIN annotations ON annotations.feature_id = features.id WHERE annotations.value LIKE '%#{opts[:annotation]}%')")
|
@@ -175,16 +199,16 @@ elsif opts [:query]
|
|
175
199
|
# puts snp.inspect
|
176
200
|
i += 1
|
177
201
|
puts "Total number of SNPs generated so far: #{i}" if i % 100 == 0
|
178
|
-
|
179
|
-
|
180
|
-
|
181
|
-
|
182
|
-
|
183
|
-
|
202
|
+
ActiveRecord::Base.transaction do
|
203
|
+
snp.alleles.each do |allele|
|
204
|
+
# puts allele.inspect
|
205
|
+
allele.genotypes.each do |genotype|
|
206
|
+
#push bases to hash
|
207
|
+
sequence_hash[genotype.strain_id] << allele.base
|
208
|
+
end
|
209
|
+
end
|
184
210
|
end
|
185
|
-
end
|
186
211
|
end
|
187
|
-
end
|
188
212
|
|
189
213
|
#generate FASTA file
|
190
214
|
strains.each do |strain|
|
@@ -192,30 +216,42 @@ elsif opts [:query]
|
|
192
216
|
output.puts
|
193
217
|
end
|
194
218
|
|
219
|
+
# GENERATE TREE FROM FASTA FILE
|
195
220
|
if opts[:tree]
|
196
|
-
`FastTree -fastest -nt #{opts[:
|
221
|
+
`FastTree -fastest -nt #{opts[:out]} > #{opts[:nwk_out]}`
|
197
222
|
end
|
223
|
+
|
224
|
+
else
|
225
|
+
puts "use -unique_snps or -not_include_snps_from_gene query options"
|
198
226
|
end
|
199
227
|
|
200
|
-
##############################################################
|
228
|
+
# ##############################################################
|
201
229
|
|
202
230
|
# OUTPUT DATABASE IN FASTA FORMAT
|
203
|
-
elsif opts[:
|
231
|
+
elsif opts[:output]
|
232
|
+
if opts[:fasta]
|
204
233
|
error_msg = ""
|
205
234
|
|
206
|
-
error_msg += "-n
|
207
|
-
error_msg += "-o
|
235
|
+
error_msg += "-n: \t Name of your database\n" unless opts[:name]
|
236
|
+
error_msg += "-o: \t name of your output file (in FASTA format)\n" unless opts[:out]
|
237
|
+
|
238
|
+
error_msg_optional = ""
|
239
|
+
|
240
|
+
error_msg_optional += "-tree: \t Construct tree from output\n" unless opts[:tree]
|
241
|
+
error_msg_optional += "-nwk_out: Name of Newick output file(use only when-tree option used)\n" unless opts[:nwk_out]
|
208
242
|
|
209
243
|
unless error_msg == ""
|
210
244
|
puts "Please provide the following required fields:"
|
211
245
|
puts error_msg
|
246
|
+
puts "Optional fields:"
|
247
|
+
puts error_msg_optional
|
212
248
|
puts opts.help unless opts.empty?
|
213
249
|
exit
|
214
250
|
end
|
215
251
|
|
216
252
|
abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
|
217
253
|
|
218
|
-
|
254
|
+
establish_connection(opts[:name])
|
219
255
|
|
220
256
|
# Getting list of strains from database
|
221
257
|
strains = Strain.all
|
@@ -229,28 +265,30 @@ elsif opts[:out_file]
|
|
229
265
|
end
|
230
266
|
|
231
267
|
|
232
|
-
output = File.open("#{opts[:
|
268
|
+
output = File.open("#{opts[:out]}", "w")
|
233
269
|
|
234
270
|
# Select all snps
|
235
271
|
snps = Snp.all
|
236
|
-
|
272
|
+
|
237
273
|
i = 0
|
238
274
|
puts "Your out file is being prepared......."
|
239
275
|
snps.each do |snp|
|
240
276
|
i += 1
|
241
277
|
puts "Total number of SNPs outputted so far: #{i}" if i % 100 == 0
|
242
278
|
|
243
|
-
|
244
|
-
|
245
|
-
|
246
|
-
|
247
|
-
|
248
|
-
|
249
|
-
|
279
|
+
ActiveRecord::Base.transaction do
|
280
|
+
snp.alleles.each do |allele|
|
281
|
+
# puts allele.inspect
|
282
|
+
allele.genotypes.each do |genotype|
|
283
|
+
#push bases to hash
|
284
|
+
sequence_hash[genotype.strain_id] << allele.base
|
285
|
+
end
|
286
|
+
end
|
250
287
|
end
|
251
288
|
end
|
252
289
|
|
253
|
-
|
290
|
+
puts sequence_hash
|
291
|
+
exit
|
254
292
|
#generate FASTA file
|
255
293
|
strains.each do |strain|
|
256
294
|
output.print ">#{strain.name}\n" , sequence_hash[strain.id].join("")
|
@@ -258,10 +296,36 @@ elsif opts[:out_file]
|
|
258
296
|
end
|
259
297
|
|
260
298
|
if opts[:tree]
|
261
|
-
|
299
|
+
# puts "FastTree -fastest -nt #{opts[:out]} > #{opts[:w]}"
|
300
|
+
`FastTree -fastest -nt #{opts[:out]} > #{opts[:w]}`
|
262
301
|
end
|
263
302
|
end
|
264
|
-
|
303
|
+
|
304
|
+
#########################################
|
305
|
+
|
306
|
+
if opts[:syn]
|
307
|
+
error_msg = ""
|
308
|
+
|
309
|
+
error_msg += "-n option: \t the name of your database\n" unless opts[:name]
|
310
|
+
error_msg += "-d option: \t the reference file in gbk format\n" unless opts[:database_reference_file]
|
311
|
+
|
312
|
+
unless error_msg == ""
|
313
|
+
puts "Please provide the following required fields:"
|
314
|
+
puts error_msg
|
315
|
+
puts opts.help unless opts.empty?
|
316
|
+
exit
|
317
|
+
end
|
318
|
+
|
319
|
+
abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
|
320
|
+
abort "#{opts[:database_reference_file]} vcf file does not exist!" unless File.exist?(opts[:database_reference_file])
|
321
|
+
|
322
|
+
establish_connection(opts[:name])
|
323
|
+
|
324
|
+
ref = opts[:database_reference_file]
|
325
|
+
|
326
|
+
synonymous(ref)
|
327
|
+
end
|
328
|
+
|
265
329
|
else
|
266
|
-
|
330
|
+
puts opts.help
|
267
331
|
end
|
data/lib/snp-search.rb
CHANGED
@@ -3,6 +3,7 @@ gem "bio", "~> 1.4.2"
|
|
3
3
|
require 'bio'
|
4
4
|
require 'snp_db_models'
|
5
5
|
require 'activerecord-import'
|
6
|
+
require 'diff/lcs'
|
6
7
|
|
7
8
|
#This method guesses the reference sequence file format
|
8
9
|
def guess_sequence_format(reference_genome)
|
@@ -50,6 +51,7 @@ end
|
|
50
51
|
|
51
52
|
#This method populates the rest of the information, i.e. SNP information, Alleles and Genotypes.
|
52
53
|
def populate_snps_alleles_genotypes(vcf_file, cuttoff_snp, cuttoff_genotype)
|
54
|
+
|
53
55
|
puts "Adding SNPs........"
|
54
56
|
# open vcf file and parse each line
|
55
57
|
File.open(vcf_file) do |f|
|
@@ -86,37 +88,56 @@ puts "Adding SNPs........"
|
|
86
88
|
format = details[8].split(":")
|
87
89
|
gt = format.index("GT")
|
88
90
|
gq = format.index("GQ")
|
91
|
+
# dp = format.index("DP")
|
89
92
|
samples = details[9..-1]
|
90
|
-
|
91
|
-
next if ref_base.size != 1 || snp_base.size != 1 # exclude indels
|
93
|
+
|
94
|
+
next if ref_base.size != 1 || snp_base.size != 1 # exclude indels (e.g. G,A in REF)
|
92
95
|
genotypes = samples.map do |s|
|
93
|
-
format_values = s.chomp.split(":")
|
94
|
-
format_values[gt]
|
96
|
+
format_values = s.chomp.split(":") # output (e.g.): 0/0 \n 0,255,209 \n 99
|
97
|
+
format_values[gt] # e.g. 0/0
|
95
98
|
end
|
96
99
|
|
97
100
|
genotypes_qualities = samples.map do |s|
|
98
101
|
format_values = s.chomp.split(":")
|
99
|
-
format_values[gq]
|
102
|
+
format_values[gq] # e.g. 99
|
100
103
|
end
|
101
104
|
|
102
|
-
|
105
|
+
geno_quality_array = Array.new
|
106
|
+
|
107
|
+
high_quality_variant_genotypes = Array.new # this will be filled with the indicies of genotypes that are "1/1" and have a quality >= 30. Reminder: 0/0 is no SNP, 1/1 is SNP.
|
103
108
|
variant_genotypes = Array.new
|
104
|
-
genotypes.each_with_index do |gt, index|
|
109
|
+
genotypes.each_with_index do |gt, index| # indexes each 'genotypes'.
|
105
110
|
if gt == "1/1"
|
106
|
-
variant_genotypes << index
|
107
|
-
if genotypes_qualities[index].to_i >= cuttoff_genotype
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
+
variant_genotypes << index # variant_genotypes is the position of genome positions that have a correct SNP with 1/1. if you want the total number of strains thats have 1/1 for that row (genome position) then puts variant_genotypes.size
|
112
|
+
if genotypes_qualities[index].to_i >= cuttoff_genotype.to_i
|
113
|
+
high_quality_variant_genotypes << index
|
114
|
+
end
|
115
|
+
end
|
111
116
|
end
|
112
|
-
|
113
|
-
|
114
|
-
|
117
|
+
|
118
|
+
genotypes_qualities.each do |gq|
|
119
|
+
if gq.to_i >= cuttoff_genotype.to_i
|
120
|
+
geno_quality_array << gq
|
121
|
+
end
|
122
|
+
end
|
123
|
+
|
124
|
+
# # high_quality_variant_genotypes is the position of 1/1 and genotype quality above cuttoff_genotype. high_quality_variant_genotypes.size will give you the number of 1/1 in a row (genome position) that is above the genotype quality cuttoff.
|
125
|
+
# puts "yay" if geno_quality_array.keep_if {|z| z <= cuttoff_genotype.to_i}
|
126
|
+
|
127
|
+
# next if geno_quality_array.each {|z| z.to_i < cuttoff_genotype.to_i}
|
128
|
+
next if samples.include?("./.")
|
129
|
+
next if geno_quality_array.size != strains.size
|
130
|
+
if snp_qual.to_i >= cuttoff_snp.to_i && genotypes.include?("1/1") && ! high_quality_variant_genotypes.empty? && high_quality_variant_genotypes.size == variant_genotypes.size
|
131
|
+
# first condition checks the overall quality of the SNP is >=90, second checks that at least one genome has the 'homozygous' 1/1 variant type with quality >= 30 and informative SNP
|
132
|
+
|
133
|
+
if genotypes.include?("0/0") && !genotypes.include?("0/1") # exclude SNPs which are all 1/1 i.e something strange about ref and those which have confusing heterozygote 0/1s
|
134
|
+
|
115
135
|
good_snps +=1
|
116
136
|
# puts good_snps
|
117
137
|
#create snp
|
118
138
|
s = Snp.new
|
119
139
|
s.ref_pos = ref_pos
|
140
|
+
s.qual = snp_qual
|
120
141
|
s.save
|
121
142
|
|
122
143
|
# create ref allele
|
@@ -139,13 +160,14 @@ puts "Adding SNPs........"
|
|
139
160
|
genotypes.each_with_index do |gt, index|
|
140
161
|
genotype = Genotype.new
|
141
162
|
genotype.strain = strains[index]
|
163
|
+
genotype.geno_qual = genotypes_qualities[index].to_i
|
142
164
|
puts index if strains[index].nil?
|
143
165
|
if gt == "0/0" # wild type
|
144
166
|
genotype.allele = ref_allele
|
145
167
|
elsif gt == "1/1" # snp type
|
146
168
|
genotype.allele = snp_allele
|
147
|
-
|
148
|
-
|
169
|
+
else
|
170
|
+
puts "Strange SNP #{gt}"
|
149
171
|
end
|
150
172
|
genos << genotype
|
151
173
|
end
|
@@ -154,6 +176,7 @@ puts "Adding SNPs........"
|
|
154
176
|
puts "Total SNPs added so far: #{good_snps}" if good_snps % 100 == 0
|
155
177
|
end
|
156
178
|
end
|
179
|
+
|
157
180
|
end
|
158
181
|
end
|
159
182
|
end
|
@@ -172,8 +195,107 @@ def find_shared_snps(strain_names)
|
|
172
195
|
|
173
196
|
where_statement = strain_names.collect{|strain_name| "strains.name = '#{strain_name}' OR "}.join("").sub(/ OR $/, "")
|
174
197
|
|
175
|
-
|
198
|
+
Snp.find_by_sql("SELECT * from snps INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id INNER JOIN strains ON strains.id = genotypes.strain_id WHERE (#{where_statement}) AND alleles.id <> snps.reference_allele_id AND (SELECT COUNT(*) from snps AS s INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id WHERE alleles.id <> snps.reference_allele_id and s.id = snps.id) = #{strain_names.size} GROUP BY snps.id HAVING COUNT(*) = #{strain_names.size}")
|
176
199
|
end
|
177
200
|
|
201
|
+
def synonymous(sequence_file)
|
202
|
+
|
203
|
+
#Reference Sequence
|
204
|
+
genome_sequence = Bio::FlatFile.open(Bio::GenBank, sequence_file).next_entry
|
205
|
+
|
206
|
+
#Extract all nucleotide sequence from ORIGIN
|
207
|
+
all_seqs_original = genome_sequence.seq
|
208
|
+
ref_bases =[]
|
209
|
+
|
210
|
+
strains = Strain.all
|
211
|
+
|
212
|
+
strains_hash = Hash.new
|
213
|
+
# create a sequence hash
|
214
|
+
# hash key is strain_id, loop through strain_id
|
215
|
+
# create an empty array
|
216
|
+
strains.each do |strain|
|
217
|
+
strains_hash[strain.id] = Array.new
|
218
|
+
end
|
219
|
+
|
220
|
+
variants = Feature.find_by_sql("select distinct features.* from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id inner join genotypes on alleles.id = genotypes.allele_id inner join strains on strains.id = genotypes.strain_id where alleles.id <> snps.reference_allele_id and features.name = 'CDS'")
|
221
|
+
|
222
|
+
puts "start_cds_in_ref\tend_cds_in_ref\tpos_of_SNP_in_ref\tref_base\tSNP_base\tsynonymous or non-synonymous\tamino_acid_original\tamino_acid_change\tpossible_pseudogene?\tchange_in_hydrophobicity_of_AA?\tchange_in_polarisation_of_AA?\tchange_in_size_of_AA?"
|
223
|
+
|
224
|
+
|
225
|
+
variants.each do |variant|
|
226
|
+
variant.snps.each do |snp|
|
227
|
+
snp.alleles.each do |allele|
|
228
|
+
if allele.id != snp.reference_allele_id
|
229
|
+
all_seqs_mutated = genome_sequence.seq
|
230
|
+
mutated_seq_translated = []
|
231
|
+
original_seq_translated = []
|
232
|
+
all_seqs_mutated[snp.ref_pos.to_i-1] = allele.base
|
233
|
+
|
234
|
+
mutated_seq = Bio::Sequence.auto(all_seqs_mutated[variant.start-1..variant.end-1])
|
235
|
+
original_seq = Bio::Sequence.auto(all_seqs_original[variant.start-1..variant.end-1])
|
236
|
+
|
237
|
+
if variant.strand == -1
|
238
|
+
mutated_seq_translated << mutated_seq.reverse_complement.translate
|
239
|
+
original_seq_translated << original_seq.reverse_complement.translate
|
240
|
+
|
241
|
+
else
|
242
|
+
mutated_seq_translated << mutated_seq.translate
|
243
|
+
original_seq_translated << original_seq.translate
|
244
|
+
|
245
|
+
end
|
246
|
+
|
247
|
+
mutated_seq_translated.zip(original_seq_translated).each do |mut, org|
|
248
|
+
mutated_seq_translated_clean = mut.gsub(/\*$/,"")
|
249
|
+
original_seq_translated_clean = org.gsub(/\*$/,"")
|
250
|
+
|
251
|
+
hydrophobic = ["I", "L", "V", "C", "A", "G", "M", "F", "Y", "W", "H", "T"]
|
252
|
+
non_hydrophobic = ["K", "E", "Q", "D", "N", "S", "P", "B"]
|
253
|
+
|
254
|
+
polar = ["Y", "W", "H", "K", "R", "E", "Q", "D", "N", "S", "P", "B"]
|
255
|
+
non_polar = ["I", "L", "V", "C", "A", "G", "M", "F", "T"]
|
256
|
+
|
257
|
+
small = ["V","C","A","G","D","N","S","T","P"]
|
258
|
+
non_small = ["I","L","M","F","Y","W","H","K","R","E","Q"]
|
259
|
+
|
260
|
+
if original_seq_translated_clean == mutated_seq_translated_clean
|
261
|
+
# if original_seq_translated == mutated_seq_translated
|
262
|
+
if mutated_seq_translated_clean =~ /\*/
|
263
|
+
puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tsynonymous\t\t\tYes"
|
264
|
+
else
|
265
|
+
puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tsynonymous"
|
266
|
+
end
|
267
|
+
else
|
268
|
+
|
269
|
+
diffs = Diff::LCS.diff(original_seq_translated_clean, mutated_seq_translated_clean)
|
270
|
+
|
271
|
+
if mutated_seq_translated_clean =~ /\*/
|
272
|
+
puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tnon-synonymous\t#{diffs[0][0].element}\t#{diffs[0][1].element}\tYes\t#{'Yes' if (hydrophobic.include? diffs[0][0].element) == (non_hydrophobic.include? diffs[0][1].element)}\t#{'Yes' if (polar.include? diffs[0][0].element) == (non_polar.include? diffs[0][1].element)}\t#{'Yes' if (small.include? diffs[0][0].element) == (non_small.include? diffs[0][1].element)}"
|
273
|
+
else
|
274
|
+
puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tnon-synonymous\t#{diffs[0][0].element}\t#{diffs[0][1].element}\t\t#{'Yes' if (hydrophobic.include? diffs[0][0].element) == (non_hydrophobic.include? diffs[0][1].element)}\t#{'Yes' if (polar.include? diffs[0][0].element) == (non_polar.include? diffs[0][1].element)}\t#{'Yes' if (small.include? diffs[0][0].element) == (non_small.include? diffs[0][1].element)}"
|
275
|
+
end
|
276
|
+
end
|
277
|
+
end
|
278
|
+
|
279
|
+
end
|
280
|
+
end
|
281
|
+
end
|
282
|
+
end
|
283
|
+
|
284
|
+
#Take all SNP positions in ref genome
|
285
|
+
# snp_positions = Feature.find_by_sql("select snps.ref_pos from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where alleles.id <> snps.reference_allele_id and features.name = 'CDS'").map{|snp| snp.ref_pos}
|
286
|
+
|
287
|
+
# # Take all SNP nucleotide
|
288
|
+
# snps = Feature.find_by_sql("select alleles.base from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where alleles.id <> snps.reference_allele_id and features.name = 'CDS'").map{|allele| allele.base}
|
289
|
+
|
290
|
+
# # Mutate (substitute) the original sequence with the SNPs
|
291
|
+
|
292
|
+
# # Here all_seqs_original are all the nucelotide sequences but with the snps subsituted in them
|
293
|
+
|
294
|
+
# #Get start position of CDS with SNP
|
295
|
+
# coordinates_start = Feature.find_by_sql("select start from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where features.name = 'CDS' and alleles.id <> snps.reference_allele_id").map{|feature| feature.start}
|
178
296
|
|
297
|
+
# #Get end position of CDS with SNP
|
298
|
+
# coordinates_end = Feature.find_by_sql("select end from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where features.name = 'CDS' and alleles.id <> snps.reference_allele_id").map{|feature| feature.end}
|
179
299
|
|
300
|
+
|
301
|
+
end
|
data/lib/snp_db_schema.rb
CHANGED
@@ -21,12 +21,13 @@ ActiveRecord::Schema.define do
|
|
21
21
|
create_table :snps do |t|
|
22
22
|
t.column :feature_id, :integer
|
23
23
|
t.column :ref_pos, :integer
|
24
|
+
t.column :qual, :float
|
24
25
|
t.column :reference_allele_id, :integer
|
25
26
|
end
|
26
27
|
end
|
27
28
|
|
28
29
|
unless table_exists? :alleles
|
29
|
-
create_table :alleles do |t|
|
30
|
+
create_table :alleles do |t|
|
30
31
|
t.column :snp_id, :integer
|
31
32
|
t.column :base, :string
|
32
33
|
end
|
@@ -36,6 +37,7 @@ ActiveRecord::Schema.define do
|
|
36
37
|
create_table :genotypes do |t|
|
37
38
|
t.column :allele_id, :integer
|
38
39
|
t.column :strain_id, :integer
|
40
|
+
t.column :geno_qual, :float
|
39
41
|
end
|
40
42
|
end
|
41
43
|
|
@@ -66,18 +68,24 @@ ActiveRecord::Schema.define do
|
|
66
68
|
unless index_exists? :snps, :feature_id
|
67
69
|
add_index :snps, :feature_id
|
68
70
|
end
|
71
|
+
unless index_exists? :snps, :qual
|
72
|
+
add_index :snps, :qual
|
73
|
+
end
|
69
74
|
unless index_exists? :alleles, :snp_id
|
70
75
|
add_index :alleles, :snp_id
|
71
76
|
end
|
72
77
|
unless index_exists? :alleles, :base
|
73
78
|
add_index :alleles, :base
|
74
79
|
end
|
75
|
-
|
80
|
+
unless index_exists? :genotypes, :allele_id
|
76
81
|
add_index :genotypes, :allele_id
|
77
82
|
end
|
78
83
|
unless index_exists? :genotypes, :strain_id
|
79
84
|
add_index :genotypes, :strain_id
|
80
85
|
end
|
86
|
+
unless index_exists? :genotypes, :geno_qual
|
87
|
+
add_index :genotypes, :geno_qual
|
88
|
+
end
|
81
89
|
unless index_exists? :annotations, :feature_id
|
82
90
|
add_index :annotations, :feature_id
|
83
91
|
end
|
data/snp-search.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "snp-search"
|
8
|
-
s.version = "
|
8
|
+
s.version = "2.0.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Ali Al-Shahib", "Anthony Underwood"]
|
12
|
-
s.date = "2012-
|
12
|
+
s.date = "2012-08-02"
|
13
13
|
s.description = "Use the snp-search tool to create, import, manipulate and query your SNP database"
|
14
14
|
s.email = "ali.al-shahib@hpa.org.uk"
|
15
15
|
s.executables = ["snp-search"]
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: snp-search
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 2.0.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -10,11 +10,11 @@ authors:
|
|
10
10
|
autorequire:
|
11
11
|
bindir: bin
|
12
12
|
cert_chain: []
|
13
|
-
date: 2012-
|
13
|
+
date: 2012-08-02 00:00:00.000000000Z
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: activerecord
|
17
|
-
requirement: &
|
17
|
+
requirement: &2165264520 !ruby/object:Gem::Requirement
|
18
18
|
none: false
|
19
19
|
requirements:
|
20
20
|
- - ~>
|
@@ -22,10 +22,10 @@ dependencies:
|
|
22
22
|
version: 3.1.3
|
23
23
|
type: :runtime
|
24
24
|
prerelease: false
|
25
|
-
version_requirements: *
|
25
|
+
version_requirements: *2165264520
|
26
26
|
- !ruby/object:Gem::Dependency
|
27
27
|
name: bio
|
28
|
-
requirement: &
|
28
|
+
requirement: &2165263760 !ruby/object:Gem::Requirement
|
29
29
|
none: false
|
30
30
|
requirements:
|
31
31
|
- - ~>
|
@@ -33,10 +33,10 @@ dependencies:
|
|
33
33
|
version: 1.4.2
|
34
34
|
type: :runtime
|
35
35
|
prerelease: false
|
36
|
-
version_requirements: *
|
36
|
+
version_requirements: *2165263760
|
37
37
|
- !ruby/object:Gem::Dependency
|
38
38
|
name: slop
|
39
|
-
requirement: &
|
39
|
+
requirement: &2165262760 !ruby/object:Gem::Requirement
|
40
40
|
none: false
|
41
41
|
requirements:
|
42
42
|
- - ~>
|
@@ -44,10 +44,10 @@ dependencies:
|
|
44
44
|
version: 2.4.0
|
45
45
|
type: :runtime
|
46
46
|
prerelease: false
|
47
|
-
version_requirements: *
|
47
|
+
version_requirements: *2165262760
|
48
48
|
- !ruby/object:Gem::Dependency
|
49
49
|
name: sqlite3
|
50
|
-
requirement: &
|
50
|
+
requirement: &2165261900 !ruby/object:Gem::Requirement
|
51
51
|
none: false
|
52
52
|
requirements:
|
53
53
|
- - ~>
|
@@ -55,10 +55,10 @@ dependencies:
|
|
55
55
|
version: 1.3.4
|
56
56
|
type: :runtime
|
57
57
|
prerelease: false
|
58
|
-
version_requirements: *
|
58
|
+
version_requirements: *2165261900
|
59
59
|
- !ruby/object:Gem::Dependency
|
60
60
|
name: activerecord-import
|
61
|
-
requirement: &
|
61
|
+
requirement: &2165260720 !ruby/object:Gem::Requirement
|
62
62
|
none: false
|
63
63
|
requirements:
|
64
64
|
- - ~>
|
@@ -66,10 +66,10 @@ dependencies:
|
|
66
66
|
version: 0.2.8
|
67
67
|
type: :runtime
|
68
68
|
prerelease: false
|
69
|
-
version_requirements: *
|
69
|
+
version_requirements: *2165260720
|
70
70
|
- !ruby/object:Gem::Dependency
|
71
71
|
name: rspec
|
72
|
-
requirement: &
|
72
|
+
requirement: &2165259280 !ruby/object:Gem::Requirement
|
73
73
|
none: false
|
74
74
|
requirements:
|
75
75
|
- - ~>
|
@@ -77,10 +77,10 @@ dependencies:
|
|
77
77
|
version: 2.3.0
|
78
78
|
type: :development
|
79
79
|
prerelease: false
|
80
|
-
version_requirements: *
|
80
|
+
version_requirements: *2165259280
|
81
81
|
- !ruby/object:Gem::Dependency
|
82
82
|
name: bundler
|
83
|
-
requirement: &
|
83
|
+
requirement: &2165258160 !ruby/object:Gem::Requirement
|
84
84
|
none: false
|
85
85
|
requirements:
|
86
86
|
- - ~>
|
@@ -88,10 +88,10 @@ dependencies:
|
|
88
88
|
version: 1.0.0
|
89
89
|
type: :development
|
90
90
|
prerelease: false
|
91
|
-
version_requirements: *
|
91
|
+
version_requirements: *2165258160
|
92
92
|
- !ruby/object:Gem::Dependency
|
93
93
|
name: jeweler
|
94
|
-
requirement: &
|
94
|
+
requirement: &2165257000 !ruby/object:Gem::Requirement
|
95
95
|
none: false
|
96
96
|
requirements:
|
97
97
|
- - ~>
|
@@ -99,10 +99,10 @@ dependencies:
|
|
99
99
|
version: 1.6.4
|
100
100
|
type: :development
|
101
101
|
prerelease: false
|
102
|
-
version_requirements: *
|
102
|
+
version_requirements: *2165257000
|
103
103
|
- !ruby/object:Gem::Dependency
|
104
104
|
name: rcov
|
105
|
-
requirement: &
|
105
|
+
requirement: &2165255880 !ruby/object:Gem::Requirement
|
106
106
|
none: false
|
107
107
|
requirements:
|
108
108
|
- - ! '>='
|
@@ -110,7 +110,7 @@ dependencies:
|
|
110
110
|
version: '0'
|
111
111
|
type: :development
|
112
112
|
prerelease: false
|
113
|
-
version_requirements: *
|
113
|
+
version_requirements: *2165255880
|
114
114
|
description: Use the snp-search tool to create, import, manipulate and query your
|
115
115
|
SNP database
|
116
116
|
email: ali.al-shahib@hpa.org.uk
|
@@ -153,7 +153,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
153
153
|
version: '0'
|
154
154
|
segments:
|
155
155
|
- 0
|
156
|
-
hash:
|
156
|
+
hash: 1607617824420065040
|
157
157
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
158
158
|
none: false
|
159
159
|
requirements:
|