snp-search 1.0.0 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/README CHANGED
@@ -0,0 +1,105 @@
1
+ = snp-search
2
+
3
+ SNPsearch is a tool that manages SNP data and allows for data importing, manipulating, editing and complex querying of SNP data. It can be used to evaluate the utility of SNPs for the assessment of genetic diversity between haploid strains and the management of genotype and phenotype data. Once the database is created, the user is provided with several query and output options. SNPsearch is particularly useful in the analysis of phylogenetic trees that are based on SNP differences across whole core genomes. Queries can be made to answer critical genomic questions such as the association of SNPs with particular phenotypes.
4
+
5
+ == Obtaining and installing the code
6
+ SNPsearch is written in Ruby and operates in a Unix environment. It is made available as a gem. See the github site for more information (https://github.com/hpa-bioinformatics/snp-search).
7
+
8
+ To install snp-search, do
9
+ gem install snp-search
10
+
11
+ == Requirements
12
+
13
+ Not much, you just need:
14
+
15
+ * Unix. Once snp-search is installed, all the necessary gems to run snp-search will also be installed from Rubygems (note that Rubygems requires admin privileges. If you do not have admin privileges then we suggest you install RVM: (http://beginrescueend.com/rvm/install/) and then gem install snp-search).
16
+ * ruby version 1.8.7 and above.
17
+
18
+ * Optional: FastTree. If you require a tree output in Newick format, you must install FastTree from http://www.microbesonline.org/fasttree/#Install. You must specify the path of the executable in your .bashrc or .profile file as snp-search will run the command as just 'FastTree' and will not know where FastTree is if it is not specified in your .bashrc or .profile file.
19
+
20
+ Thats it!
21
+
22
+ == Running snp-search
23
+
24
+ 1- Creating the database (snp-search -create)
25
+
26
+ Two files are needed to create the SQLite3 database:
27
+
28
+ 1- Variant Call Format (.vcf) file (which contains the SNP information)
29
+
30
+ 2- Your database reference genome that you used to generate your .vcf file (in genbank or embl format, the script will automatically detect the format).
31
+
32
+ You need the following parameters:
33
+
34
+ -n Name of your database
35
+ -v .vcf file
36
+ -d Database Reference genome (The same file that was used in generating the .vcf file). This should be in genbank or embl format.
37
+
38
+ Other options:
39
+ -c SNP quality score cutoff. A Phred-scaled quality score. High quality scores indicate high confidence calls. Optional, default = 90 (out of 100)
40
+ -g Genotype Quality score cutoff. Phred-scaled quality score that the genotype is true. Optional, default = 30
41
+ -h help message
42
+
43
+ Usage:
44
+ snp-search -create -n my_snp_db.sqlite3 -d my_ref.gbk -v my_vcf_file.vcf
45
+
46
+ Note: The strain names in your database will be taken from your vcf file so make sure they are named appropriately in your vcf file.
47
+
48
+ 2- Querying the Database (snp-search -query)
49
+
50
+ Two queries are currently scripted in SNPsearch:
51
+
52
+ 1- unique_snps: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided. The output is the number of unique SNPs.
53
+
54
+ You need the following parameters:
55
+
56
+ -n Name of your database
57
+ -s The strains/samples you like to query
58
+
59
+ Usage:
60
+ snp-search -n my_snp_db.sqlite3 -s list_of_my_strains.txt
61
+
62
+ 2- not_include_snps_from_gene: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format). This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis. The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
63
+
64
+ You need the following parameters:
65
+
66
+ -n Name of your database
67
+ -a The gene you like to remove from analysis
68
+ -o Output file, in fasta format
69
+
70
+ options:
71
+ -t Generate SNP phylogeny
72
+ -w Output tree in Newick format
73
+
74
+ Usage (phage is used as the example gene):
75
+ snp-search -n my_snp_db.sqlite3 -a phage -o snps_sequences_without_phage.fasta -t -w snps_sequences_without_phage.nwk
76
+
77
+ The algorithm FastTree is used to generate the nwk file. FastTree can be downloaded from http://www.microbesonline.org/fasttree/#Install (see above)
78
+
79
+ 3- Output database (snp-search -out_file)
80
+
81
+ You need the following parameters:
82
+
83
+ -n Name of your database
84
+ -o Output file containing the database in fasta format
85
+
86
+ == View database in Unix or in a GUI
87
+ Your database will be in sqlite3 format. If you like to view your table(s) and perform direct queries you can type
88
+ sqlite3 snp_db.sqlite3
89
+
90
+ Alternatively, you may download a SQL tool to view your database (e.g. SQLite sorcerer).
91
+
92
+ == Contact
93
+
94
+ If you have any comments, questions or suggestions, please email
95
+ ali.al-shahib@hpa.org.uk
96
+ or
97
+ anthony.underwood@hpa.org.uk
98
+
99
+ Have fun snp-searching!
100
+
101
+ == Copyright
102
+
103
+ Copyright (c) 2012 Ali Al-Shahib. See LICENSE.txt for
104
+ further details.
105
+
data/README.rdoc CHANGED
@@ -49,7 +49,7 @@ You need the following parameters:
49
49
 
50
50
  Two queries are currently scripted in SNPsearch:
51
51
 
52
- 1- genes_query: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided. The output is the number of unique SNPs.
52
+ 1- unique_snps: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided. The output is the number of unique SNPs.
53
53
 
54
54
  You need the following parameters:
55
55
 
@@ -59,7 +59,7 @@ You need the following parameters:
59
59
  Usage:
60
60
  snp-search -n my_snp_db.sqlite3 -s list_of_my_strains.txt
61
61
 
62
- 2- remove_genes: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format). This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis. The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
62
+ 2- not_include_snps_from_gene: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format). This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis. The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
63
63
 
64
64
  You need the following parameters:
65
65
 
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.0.0
1
+ 2.0.0
data/bin/snp-search CHANGED
@@ -3,60 +3,74 @@ require 'snp_db_connection'
3
3
  require 'snp_db_models'
4
4
  require 'snp_db_schema'
5
5
  require 'activerecord-import'
6
- # gem "slop", "~> 3.1.0"
7
- gem "slop", "~> 2.4.0"
6
+ gem "slop", "~> 3.3.1"
7
+ # gem "slop", "~> 2.4.0"
8
8
  require 'slop'
9
9
 
10
- opts = Slop.new do
10
+ opts = Slop.parse do
11
11
 
12
- # separator 'test'
12
+ banner "\nruby snp-search [-create] [-query] [-output] [-n <sqlite3>] [options]*"
13
+ separator ''
13
14
 
14
- banner "\nruby snp-search [OPTIONS]"
15
15
  on :C, :create, 'Create database'
16
16
  on :Q, :query, 'Query database'
17
- on :O, :out_file, 'Output the database to a file'
18
- # separator ''
17
+ on :O, :output, 'Output options'
18
+ separator ''
19
19
  # separator 'README file: https://github.com/hpa-bioinformatics/snp-search/blob/master/README.rdoc'
20
20
  # separator 'The following command must be used when using -create, or -query or -out_file'
21
21
  on :n, :name=, 'Name of database, Required'
22
- # separator ''
23
- # separator '-create options'
24
- on :d, :database_reference_file, 'Reference genome file, in gbk or embl file format, Required', true
25
- on :v, :vcf_file, '.vcf file, Required', true
26
- on :c, :cuttoff_snp, 'SNP quality cutoff, (default = 90)', :default => 90
27
- on :g, :cuttoff_genotype, 'Genotype quality cutoff (default = 30)', :default => 30
28
- # separator ''
29
- # separator '-query options'
30
- on :G, :genes_query, 'Query for unique genes in the database'
31
- on :R, :remove_genes, 'Remove set of genes from database and create FASTA file'
22
+ separator ''
23
+
24
+ separator '-create options'
25
+ on :d, :database_reference_file=, 'Reference genome file, in gbk or embl file format, Required', true
26
+ on :v, :vcf_file=, 'variant call format (vcf) file, Required', true
27
+ on :c, :cuttoff_snp=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
28
+ on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int, :default => 30
29
+ separator ''
30
+
31
+ separator '-query options'
32
+ on :u, :unique_snps, 'Query for unique snps in the database'
33
+ on :r, :not_include_snps_from_gene, 'Remove SNPs from specified gene from database'
32
34
  on :s, :strain=, 'The strains/samples you like to query, Required'
33
35
  on :a, :annotation=, 'The gene you like to remove from analysis'
34
- on :o, :output=, 'output file, in fasta format'
36
+ separator ''
37
+
38
+ separator '-output [-fasta] [-syn] options'
39
+ on :f, :fasta, 'output fasta file'
40
+ on :S, :syn, 'output tab-delimited file with synonymous and non-synonymous info'
41
+ on :o, :out=, 'Name of output file'
35
42
  on :t, :tree, 'Generate SNP phylogeny'
36
- on :w, :tree_nwk_output=, 'output tree in Newick format'
37
- on :S, :syn, 'syn'
43
+ on :w, :nwk_out=, 'Name of output tree in Newick format'
44
+
38
45
  end
39
- opts.parse
46
+ # opts.end
40
47
 
41
48
  ###########################################################
42
49
 
43
50
  # CREATING A DATABASE
44
51
  if opts[:create]
45
52
 
46
-
53
+ # puts opts[:cuttoff_snp].to_i
54
+
47
55
  error_msg = ""
48
56
 
49
- error_msg += "-n option: \t the name of your database\n" unless opts[:name]
50
- error_msg += "-d option: \t reference genome file, in gbk or embl file format\n" unless opts[:database_reference_file]
51
- error_msg += "-v option: \t .vcf file\n" unless opts[:vcf_file]
57
+ error_msg += "-n: \t Name of your database\n" unless opts[:name]
58
+ error_msg += "-d: \t Reference genome file, in gbk or embl file format\n" unless opts[:database_reference_file]
59
+ error_msg += "-v: \t .vcf file\n" unless opts[:vcf_file]
52
60
 
53
- unless error_msg == ""
54
- puts "Please provide the following required fields:"
55
- puts error_msg
56
- puts opts.help unless opts.empty?
57
- exit
58
- end
59
-
61
+ error_msg_optional = ""
62
+
63
+ error_msg_optional += "-c: \tSNP quality cutoff, (default = 90)\n"
64
+ error_msg_optional += "-g: \tGenotype quality cutoff (default = 30)\n"
65
+
66
+ unless error_msg == ""
67
+ puts "Please provide the following required fields:"
68
+ puts error_msg
69
+ puts "Optional fields:"
70
+ puts error_msg_optional
71
+ puts opts.help unless opts.empty?
72
+ exit
73
+ end
60
74
 
61
75
  abort "#{opts[:database_reference_file]} file does not exist!" unless File.exist?(opts[:database_reference_file])
62
76
 
@@ -67,7 +81,7 @@ if opts[:create]
67
81
  establish_connection(opts[:name])
68
82
 
69
83
  # Schema will run here
70
- db_schema
84
+ db_schema
71
85
 
72
86
  ref = opts[:database_reference_file]
73
87
 
@@ -87,22 +101,25 @@ if opts[:create]
87
101
  vcf_mpileup_file = opts[:vcf_file]
88
102
 
89
103
  # The populate_features_and_annotations method populates the features and annotations. It uses the embl/gbk file.
90
- populate_features_and_annotations(sequence_flatfile)
104
+ populate_features_and_annotations(sequence_flatfile)
91
105
 
92
106
  #The populate_snps_alleles_genotypes method populates the snps, alleles and genotypes. It uses the vcf file, and if specified, the SNP quality cutoff and genotype quality cutoff
93
- populate_snps_alleles_genotypes(vcf_mpileup_file, opts[:cuttoff_snp].to_i, opts[:cuttoff_genotype].to_i)
107
+
108
+ populate_snps_alleles_genotypes(vcf_mpileup_file, opts[:cuttoff_snp], opts[:cuttoff_genotype])
109
+
110
+ # puts "populate_snps_alleles_genotypes(#{vcf_mpileup_file}, #{opts[:cuttoff_snp]}, #{opts[:cuttoff_genotype]}.to_i)"
94
111
 
95
112
  ###########################################################
96
113
 
97
114
  # QUERYING THE DATABASE
98
115
  elsif opts [:query]
99
116
  #FIND UNIQUE SNPS
100
- if opts[:genes_query]
117
+ if opts[:unique_snps]
101
118
 
102
119
  error_msg = ""
103
120
 
104
- error_msg += "-n option, \t the name of your database\n" unless opts[:name]
105
- error_msg += "-s option, \t list of strains you like to query\n" unless opts[:strain]
121
+ error_msg += "-n: \t Name of your database\n" unless opts[:name]
122
+ error_msg += "-s: \t List of strains you like to query\n" unless opts[:strain]
106
123
 
107
124
  unless error_msg == ""
108
125
  puts "Please provide the following required fields:"
@@ -126,32 +143,39 @@ elsif opts [:query]
126
143
  gas_snps = find_shared_snps(strains)
127
144
 
128
145
  gas_snps.each do |snp|
129
- puts "The number of unique snps are #{snp.id}.size"
146
+ puts "The number of unique snps are #{snp.id}"
130
147
  end
131
148
 
132
149
  ################################################################
133
150
  # REMOVE SNPS ASSOCIATED WITH SPECIFIC GENES
134
- elsif opts[:remove_genes]
151
+ elsif opts[:not_include_snps_from_gene]
135
152
 
136
153
  error_msg = ""
137
154
 
138
- error_msg += "-n option: \t the name of your database\n" unless opts[:name]
139
- error_msg += "-o option: \t name of your output file\n" unless opts[:output]
140
- error_msg += "-a option: \t name of the gene that you like to remove from the database\n" unless opts[:annotation]
155
+ error_msg += "-n: \t Name of your database\n" unless opts[:name]
156
+ error_msg += "-o: \t name of your output file\n" unless opts[:out]
157
+ error_msg += "-a: \t name of the gene that you like to remove from the database\n" unless opts[:annotation]
158
+
159
+ error_msg_optional = ""
160
+
161
+ error_msg_optional += "-tree: \t Construct tree from output\n" unless opts[:tree]
162
+ error_msg_optional += "-nwk_out: Name of Newick output file(use only when-tree option used)\n" unless opts[:nwk_out]
141
163
 
142
164
  unless error_msg == ""
143
165
  puts "Please provide the following required fields:"
144
166
  puts error_msg
167
+ puts "Optional fields:"
168
+ puts error_msg_optional
145
169
  puts opts.help unless opts.empty?
146
170
  exit
147
171
  end
148
-
172
+
173
+
149
174
  abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
150
175
 
151
176
  # annotation = opts[:annotation]
152
177
  establish_connection(opts[:name])
153
178
 
154
-
155
179
  # Getting list of strains from database
156
180
  strains = Strain.all
157
181
 
@@ -164,7 +188,7 @@ elsif opts [:query]
164
188
  end
165
189
 
166
190
  # output opened for data input
167
- output = File.open("#{opts[:output]}", "w")
191
+ output = File.open("#{opts[:out]}", "w")
168
192
 
169
193
  # Perform query
170
194
  snps = Snp.includes(:alleles => :genotypes).find_by_sql("SELECT snps.* FROM snps INNER JOIN features ON features.id = snps.feature_id WHERE features.id NOT IN (select distinct features.id FROM features INNER JOIN annotations ON annotations.feature_id = features.id WHERE annotations.value LIKE '%#{opts[:annotation]}%')")
@@ -175,16 +199,16 @@ elsif opts [:query]
175
199
  # puts snp.inspect
176
200
  i += 1
177
201
  puts "Total number of SNPs generated so far: #{i}" if i % 100 == 0
178
- ActiveRecord::Base.transaction do
179
- snp.alleles.each do |allele|
180
- # puts allele.inspect
181
- allele.genotypes.each do |genotype|
182
- #push bases to hash
183
- sequence_hash[genotype.strain_id] << allele.base
202
+ ActiveRecord::Base.transaction do
203
+ snp.alleles.each do |allele|
204
+ # puts allele.inspect
205
+ allele.genotypes.each do |genotype|
206
+ #push bases to hash
207
+ sequence_hash[genotype.strain_id] << allele.base
208
+ end
209
+ end
184
210
  end
185
- end
186
211
  end
187
- end
188
212
 
189
213
  #generate FASTA file
190
214
  strains.each do |strain|
@@ -192,30 +216,42 @@ elsif opts [:query]
192
216
  output.puts
193
217
  end
194
218
 
219
+ # GENERATE TREE FROM FASTA FILE
195
220
  if opts[:tree]
196
- `FastTree -fastest -nt #{opts[:output]} > #{opts[:w]}`
221
+ `FastTree -fastest -nt #{opts[:out]} > #{opts[:nwk_out]}`
197
222
  end
223
+
224
+ else
225
+ puts "use -unique_snps or -not_include_snps_from_gene query options"
198
226
  end
199
227
 
200
- ##############################################################
228
+ # ##############################################################
201
229
 
202
230
  # OUTPUT DATABASE IN FASTA FORMAT
203
- elsif opts[:out_file]
231
+ elsif opts[:output]
232
+ if opts[:fasta]
204
233
  error_msg = ""
205
234
 
206
- error_msg += "-n option: \t the name of your database\n" unless opts[:name]
207
- error_msg += "-o option: \t name of your output file\n" unless opts[:output]
235
+ error_msg += "-n: \t Name of your database\n" unless opts[:name]
236
+ error_msg += "-o: \t name of your output file (in FASTA format)\n" unless opts[:out]
237
+
238
+ error_msg_optional = ""
239
+
240
+ error_msg_optional += "-tree: \t Construct tree from output\n" unless opts[:tree]
241
+ error_msg_optional += "-nwk_out: Name of Newick output file(use only when-tree option used)\n" unless opts[:nwk_out]
208
242
 
209
243
  unless error_msg == ""
210
244
  puts "Please provide the following required fields:"
211
245
  puts error_msg
246
+ puts "Optional fields:"
247
+ puts error_msg_optional
212
248
  puts opts.help unless opts.empty?
213
249
  exit
214
250
  end
215
251
 
216
252
  abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
217
253
 
218
- establish_connection(opts[:name])
254
+ establish_connection(opts[:name])
219
255
 
220
256
  # Getting list of strains from database
221
257
  strains = Strain.all
@@ -229,28 +265,30 @@ elsif opts[:out_file]
229
265
  end
230
266
 
231
267
 
232
- output = File.open("#{opts[:output]}", "w")
268
+ output = File.open("#{opts[:out]}", "w")
233
269
 
234
270
  # Select all snps
235
271
  snps = Snp.all
236
-
272
+
237
273
  i = 0
238
274
  puts "Your out file is being prepared......."
239
275
  snps.each do |snp|
240
276
  i += 1
241
277
  puts "Total number of SNPs outputted so far: #{i}" if i % 100 == 0
242
278
 
243
- ActiveRecord::Base.transaction do
244
- snp.alleles.each do |allele|
245
- # puts allele.inspect
246
- allele.genotypes.each do |genotype|
247
- #push bases to hash
248
- sequence_hash[genotype.strain_id] << allele.base
249
- end
279
+ ActiveRecord::Base.transaction do
280
+ snp.alleles.each do |allele|
281
+ # puts allele.inspect
282
+ allele.genotypes.each do |genotype|
283
+ #push bases to hash
284
+ sequence_hash[genotype.strain_id] << allele.base
285
+ end
286
+ end
250
287
  end
251
288
  end
252
289
 
253
-
290
+ puts sequence_hash
291
+ exit
254
292
  #generate FASTA file
255
293
  strains.each do |strain|
256
294
  output.print ">#{strain.name}\n" , sequence_hash[strain.id].join("")
@@ -258,10 +296,36 @@ elsif opts[:out_file]
258
296
  end
259
297
 
260
298
  if opts[:tree]
261
- `FastTree -fastest -nt #{opts[:output]} > #{opts[:w]}`
299
+ # puts "FastTree -fastest -nt #{opts[:out]} > #{opts[:w]}"
300
+ `FastTree -fastest -nt #{opts[:out]} > #{opts[:w]}`
262
301
  end
263
302
  end
264
-
303
+
304
+ #########################################
305
+
306
+ if opts[:syn]
307
+ error_msg = ""
308
+
309
+ error_msg += "-n option: \t the name of your database\n" unless opts[:name]
310
+ error_msg += "-d option: \t the reference file in gbk format\n" unless opts[:database_reference_file]
311
+
312
+ unless error_msg == ""
313
+ puts "Please provide the following required fields:"
314
+ puts error_msg
315
+ puts opts.help unless opts.empty?
316
+ exit
317
+ end
318
+
319
+ abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
320
+ abort "#{opts[:database_reference_file]} vcf file does not exist!" unless File.exist?(opts[:database_reference_file])
321
+
322
+ establish_connection(opts[:name])
323
+
324
+ ref = opts[:database_reference_file]
325
+
326
+ synonymous(ref)
327
+ end
328
+
265
329
  else
266
- puts opts.help
330
+ puts opts.help
267
331
  end
data/lib/snp-search.rb CHANGED
@@ -3,6 +3,7 @@ gem "bio", "~> 1.4.2"
3
3
  require 'bio'
4
4
  require 'snp_db_models'
5
5
  require 'activerecord-import'
6
+ require 'diff/lcs'
6
7
 
7
8
  #This method guesses the reference sequence file format
8
9
  def guess_sequence_format(reference_genome)
@@ -50,6 +51,7 @@ end
50
51
 
51
52
  #This method populates the rest of the information, i.e. SNP information, Alleles and Genotypes.
52
53
  def populate_snps_alleles_genotypes(vcf_file, cuttoff_snp, cuttoff_genotype)
54
+
53
55
  puts "Adding SNPs........"
54
56
  # open vcf file and parse each line
55
57
  File.open(vcf_file) do |f|
@@ -86,37 +88,56 @@ puts "Adding SNPs........"
86
88
  format = details[8].split(":")
87
89
  gt = format.index("GT")
88
90
  gq = format.index("GQ")
91
+ # dp = format.index("DP")
89
92
  samples = details[9..-1]
90
-
91
- next if ref_base.size != 1 || snp_base.size != 1 # exclude indels
93
+
94
+ next if ref_base.size != 1 || snp_base.size != 1 # exclude indels (e.g. G,A in REF)
92
95
  genotypes = samples.map do |s|
93
- format_values = s.chomp.split(":")
94
- format_values[gt]
96
+ format_values = s.chomp.split(":") # output (e.g.): 0/0 \n 0,255,209 \n 99
97
+ format_values[gt] # e.g. 0/0
95
98
  end
96
99
 
97
100
  genotypes_qualities = samples.map do |s|
98
101
  format_values = s.chomp.split(":")
99
- format_values[gq]
102
+ format_values[gq] # e.g. 99
100
103
  end
101
104
 
102
- high_quality_variant_genotypes = Array.new # this will be filled with the indicies of genotypes that are "1/1" and have a quality >= 30
105
+ geno_quality_array = Array.new
106
+
107
+ high_quality_variant_genotypes = Array.new # this will be filled with the indicies of genotypes that are "1/1" and have a quality >= 30. Reminder: 0/0 is no SNP, 1/1 is SNP.
103
108
  variant_genotypes = Array.new
104
- genotypes.each_with_index do |gt, index|
109
+ genotypes.each_with_index do |gt, index| # indexes each 'genotypes'.
105
110
  if gt == "1/1"
106
- variant_genotypes << index
107
- if genotypes_qualities[index].to_i >= cuttoff_genotype
108
- high_quality_variant_genotypes << index
109
- end
110
- end
111
+ variant_genotypes << index # variant_genotypes is the position of genome positions that have a correct SNP with 1/1. if you want the total number of strains thats have 1/1 for that row (genome position) then puts variant_genotypes.size
112
+ if genotypes_qualities[index].to_i >= cuttoff_genotype.to_i
113
+ high_quality_variant_genotypes << index
114
+ end
115
+ end
111
116
  end
112
-
113
- if snp_qual.to_i >= cuttoff_snp && genotypes.include?("1/1") && ! high_quality_variant_genotypes.empty? && high_quality_variant_genotypes.size == variant_genotypes.size # first condition checks the overall quality of the SNP is >=90, second checks that at least one genome has the 'homozygous' 1/1 variant type with quality >= 30 and informative SNP
114
- if genotypes.include?("0/0") && !genotypes.include?("0/1") # exclude SNPs which are all 1/1 i.e something strange about ref and those which have confusing heterozygote 0/1s
117
+
118
+ genotypes_qualities.each do |gq|
119
+ if gq.to_i >= cuttoff_genotype.to_i
120
+ geno_quality_array << gq
121
+ end
122
+ end
123
+
124
+ # # high_quality_variant_genotypes is the position of 1/1 and genotype quality above cuttoff_genotype. high_quality_variant_genotypes.size will give you the number of 1/1 in a row (genome position) that is above the genotype quality cuttoff.
125
+ # puts "yay" if geno_quality_array.keep_if {|z| z <= cuttoff_genotype.to_i}
126
+
127
+ # next if geno_quality_array.each {|z| z.to_i < cuttoff_genotype.to_i}
128
+ next if samples.include?("./.")
129
+ next if geno_quality_array.size != strains.size
130
+ if snp_qual.to_i >= cuttoff_snp.to_i && genotypes.include?("1/1") && ! high_quality_variant_genotypes.empty? && high_quality_variant_genotypes.size == variant_genotypes.size
131
+ # first condition checks the overall quality of the SNP is >=90, second checks that at least one genome has the 'homozygous' 1/1 variant type with quality >= 30 and informative SNP
132
+
133
+ if genotypes.include?("0/0") && !genotypes.include?("0/1") # exclude SNPs which are all 1/1 i.e something strange about ref and those which have confusing heterozygote 0/1s
134
+
115
135
  good_snps +=1
116
136
  # puts good_snps
117
137
  #create snp
118
138
  s = Snp.new
119
139
  s.ref_pos = ref_pos
140
+ s.qual = snp_qual
120
141
  s.save
121
142
 
122
143
  # create ref allele
@@ -139,13 +160,14 @@ puts "Adding SNPs........"
139
160
  genotypes.each_with_index do |gt, index|
140
161
  genotype = Genotype.new
141
162
  genotype.strain = strains[index]
163
+ genotype.geno_qual = genotypes_qualities[index].to_i
142
164
  puts index if strains[index].nil?
143
165
  if gt == "0/0" # wild type
144
166
  genotype.allele = ref_allele
145
167
  elsif gt == "1/1" # snp type
146
168
  genotype.allele = snp_allele
147
- else
148
- puts "Strange SNP #{gt}"
169
+ else
170
+ puts "Strange SNP #{gt}"
149
171
  end
150
172
  genos << genotype
151
173
  end
@@ -154,6 +176,7 @@ puts "Adding SNPs........"
154
176
  puts "Total SNPs added so far: #{good_snps}" if good_snps % 100 == 0
155
177
  end
156
178
  end
179
+
157
180
  end
158
181
  end
159
182
  end
@@ -172,8 +195,107 @@ def find_shared_snps(strain_names)
172
195
 
173
196
  where_statement = strain_names.collect{|strain_name| "strains.name = '#{strain_name}' OR "}.join("").sub(/ OR $/, "")
174
197
 
175
- puts "Snp.find_by_sql(\"SELECT * from snps INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id INNER JOIN strains ON strains.id = genotypes.strain_id WHERE (#{where_statement}) AND alleles.id <> snps.reference_allele_id AND (SELECT COUNT(*) from snps AS s INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id WHERE alleles.id <> snps.reference_allele_id and s.id = snps.id) = #{strain_names.size} GROUP BY snps.id HAVING COUNT(*) = #{strain_names.size}\")"
198
+ Snp.find_by_sql("SELECT * from snps INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id INNER JOIN strains ON strains.id = genotypes.strain_id WHERE (#{where_statement}) AND alleles.id <> snps.reference_allele_id AND (SELECT COUNT(*) from snps AS s INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id WHERE alleles.id <> snps.reference_allele_id and s.id = snps.id) = #{strain_names.size} GROUP BY snps.id HAVING COUNT(*) = #{strain_names.size}")
176
199
  end
177
200
 
201
+ def synonymous(sequence_file)
202
+
203
+ #Reference Sequence
204
+ genome_sequence = Bio::FlatFile.open(Bio::GenBank, sequence_file).next_entry
205
+
206
+ #Extract all nucleotide sequence from ORIGIN
207
+ all_seqs_original = genome_sequence.seq
208
+ ref_bases =[]
209
+
210
+ strains = Strain.all
211
+
212
+ strains_hash = Hash.new
213
+ # create a sequence hash
214
+ # hash key is strain_id, loop through strain_id
215
+ # create an empty array
216
+ strains.each do |strain|
217
+ strains_hash[strain.id] = Array.new
218
+ end
219
+
220
+ variants = Feature.find_by_sql("select distinct features.* from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id inner join genotypes on alleles.id = genotypes.allele_id inner join strains on strains.id = genotypes.strain_id where alleles.id <> snps.reference_allele_id and features.name = 'CDS'")
221
+
222
+ puts "start_cds_in_ref\tend_cds_in_ref\tpos_of_SNP_in_ref\tref_base\tSNP_base\tsynonymous or non-synonymous\tamino_acid_original\tamino_acid_change\tpossible_pseudogene?\tchange_in_hydrophobicity_of_AA?\tchange_in_polarisation_of_AA?\tchange_in_size_of_AA?"
223
+
224
+
225
+ variants.each do |variant|
226
+ variant.snps.each do |snp|
227
+ snp.alleles.each do |allele|
228
+ if allele.id != snp.reference_allele_id
229
+ all_seqs_mutated = genome_sequence.seq
230
+ mutated_seq_translated = []
231
+ original_seq_translated = []
232
+ all_seqs_mutated[snp.ref_pos.to_i-1] = allele.base
233
+
234
+ mutated_seq = Bio::Sequence.auto(all_seqs_mutated[variant.start-1..variant.end-1])
235
+ original_seq = Bio::Sequence.auto(all_seqs_original[variant.start-1..variant.end-1])
236
+
237
+ if variant.strand == -1
238
+ mutated_seq_translated << mutated_seq.reverse_complement.translate
239
+ original_seq_translated << original_seq.reverse_complement.translate
240
+
241
+ else
242
+ mutated_seq_translated << mutated_seq.translate
243
+ original_seq_translated << original_seq.translate
244
+
245
+ end
246
+
247
+ mutated_seq_translated.zip(original_seq_translated).each do |mut, org|
248
+ mutated_seq_translated_clean = mut.gsub(/\*$/,"")
249
+ original_seq_translated_clean = org.gsub(/\*$/,"")
250
+
251
+ hydrophobic = ["I", "L", "V", "C", "A", "G", "M", "F", "Y", "W", "H", "T"]
252
+ non_hydrophobic = ["K", "E", "Q", "D", "N", "S", "P", "B"]
253
+
254
+ polar = ["Y", "W", "H", "K", "R", "E", "Q", "D", "N", "S", "P", "B"]
255
+ non_polar = ["I", "L", "V", "C", "A", "G", "M", "F", "T"]
256
+
257
+ small = ["V","C","A","G","D","N","S","T","P"]
258
+ non_small = ["I","L","M","F","Y","W","H","K","R","E","Q"]
259
+
260
+ if original_seq_translated_clean == mutated_seq_translated_clean
261
+ # if original_seq_translated == mutated_seq_translated
262
+ if mutated_seq_translated_clean =~ /\*/
263
+ puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tsynonymous\t\t\tYes"
264
+ else
265
+ puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tsynonymous"
266
+ end
267
+ else
268
+
269
+ diffs = Diff::LCS.diff(original_seq_translated_clean, mutated_seq_translated_clean)
270
+
271
+ if mutated_seq_translated_clean =~ /\*/
272
+ puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tnon-synonymous\t#{diffs[0][0].element}\t#{diffs[0][1].element}\tYes\t#{'Yes' if (hydrophobic.include? diffs[0][0].element) == (non_hydrophobic.include? diffs[0][1].element)}\t#{'Yes' if (polar.include? diffs[0][0].element) == (non_polar.include? diffs[0][1].element)}\t#{'Yes' if (small.include? diffs[0][0].element) == (non_small.include? diffs[0][1].element)}"
273
+ else
274
+ puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tnon-synonymous\t#{diffs[0][0].element}\t#{diffs[0][1].element}\t\t#{'Yes' if (hydrophobic.include? diffs[0][0].element) == (non_hydrophobic.include? diffs[0][1].element)}\t#{'Yes' if (polar.include? diffs[0][0].element) == (non_polar.include? diffs[0][1].element)}\t#{'Yes' if (small.include? diffs[0][0].element) == (non_small.include? diffs[0][1].element)}"
275
+ end
276
+ end
277
+ end
278
+
279
+ end
280
+ end
281
+ end
282
+ end
283
+
284
+ #Take all SNP positions in ref genome
285
+ # snp_positions = Feature.find_by_sql("select snps.ref_pos from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where alleles.id <> snps.reference_allele_id and features.name = 'CDS'").map{|snp| snp.ref_pos}
286
+
287
+ # # Take all SNP nucleotide
288
+ # snps = Feature.find_by_sql("select alleles.base from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where alleles.id <> snps.reference_allele_id and features.name = 'CDS'").map{|allele| allele.base}
289
+
290
+ # # Mutate (substitute) the original sequence with the SNPs
291
+
292
+ # # Here all_seqs_original are all the nucelotide sequences but with the snps subsituted in them
293
+
294
+ # #Get start position of CDS with SNP
295
+ # coordinates_start = Feature.find_by_sql("select start from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where features.name = 'CDS' and alleles.id <> snps.reference_allele_id").map{|feature| feature.start}
178
296
 
297
+ # #Get end position of CDS with SNP
298
+ # coordinates_end = Feature.find_by_sql("select end from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where features.name = 'CDS' and alleles.id <> snps.reference_allele_id").map{|feature| feature.end}
179
299
 
300
+
301
+ end
data/lib/snp_db_schema.rb CHANGED
@@ -21,12 +21,13 @@ ActiveRecord::Schema.define do
21
21
  create_table :snps do |t|
22
22
  t.column :feature_id, :integer
23
23
  t.column :ref_pos, :integer
24
+ t.column :qual, :float
24
25
  t.column :reference_allele_id, :integer
25
26
  end
26
27
  end
27
28
 
28
29
  unless table_exists? :alleles
29
- create_table :alleles do |t|name
30
+ create_table :alleles do |t|
30
31
  t.column :snp_id, :integer
31
32
  t.column :base, :string
32
33
  end
@@ -36,6 +37,7 @@ ActiveRecord::Schema.define do
36
37
  create_table :genotypes do |t|
37
38
  t.column :allele_id, :integer
38
39
  t.column :strain_id, :integer
40
+ t.column :geno_qual, :float
39
41
  end
40
42
  end
41
43
 
@@ -66,18 +68,24 @@ ActiveRecord::Schema.define do
66
68
  unless index_exists? :snps, :feature_id
67
69
  add_index :snps, :feature_id
68
70
  end
71
+ unless index_exists? :snps, :qual
72
+ add_index :snps, :qual
73
+ end
69
74
  unless index_exists? :alleles, :snp_id
70
75
  add_index :alleles, :snp_id
71
76
  end
72
77
  unless index_exists? :alleles, :base
73
78
  add_index :alleles, :base
74
79
  end
75
- unless index_exists? :genotypes, :allele_id
80
+ unless index_exists? :genotypes, :allele_id
76
81
  add_index :genotypes, :allele_id
77
82
  end
78
83
  unless index_exists? :genotypes, :strain_id
79
84
  add_index :genotypes, :strain_id
80
85
  end
86
+ unless index_exists? :genotypes, :geno_qual
87
+ add_index :genotypes, :geno_qual
88
+ end
81
89
  unless index_exists? :annotations, :feature_id
82
90
  add_index :annotations, :feature_id
83
91
  end
data/snp-search.gemspec CHANGED
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = "snp-search"
8
- s.version = "1.0.0"
8
+ s.version = "2.0.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Ali Al-Shahib", "Anthony Underwood"]
12
- s.date = "2012-05-10"
12
+ s.date = "2012-08-02"
13
13
  s.description = "Use the snp-search tool to create, import, manipulate and query your SNP database"
14
14
  s.email = "ali.al-shahib@hpa.org.uk"
15
15
  s.executables = ["snp-search"]
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: snp-search
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 2.0.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -10,11 +10,11 @@ authors:
10
10
  autorequire:
11
11
  bindir: bin
12
12
  cert_chain: []
13
- date: 2012-05-10 00:00:00.000000000Z
13
+ date: 2012-08-02 00:00:00.000000000Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: activerecord
17
- requirement: &2165230340 !ruby/object:Gem::Requirement
17
+ requirement: &2165264520 !ruby/object:Gem::Requirement
18
18
  none: false
19
19
  requirements:
20
20
  - - ~>
@@ -22,10 +22,10 @@ dependencies:
22
22
  version: 3.1.3
23
23
  type: :runtime
24
24
  prerelease: false
25
- version_requirements: *2165230340
25
+ version_requirements: *2165264520
26
26
  - !ruby/object:Gem::Dependency
27
27
  name: bio
28
- requirement: &2165229420 !ruby/object:Gem::Requirement
28
+ requirement: &2165263760 !ruby/object:Gem::Requirement
29
29
  none: false
30
30
  requirements:
31
31
  - - ~>
@@ -33,10 +33,10 @@ dependencies:
33
33
  version: 1.4.2
34
34
  type: :runtime
35
35
  prerelease: false
36
- version_requirements: *2165229420
36
+ version_requirements: *2165263760
37
37
  - !ruby/object:Gem::Dependency
38
38
  name: slop
39
- requirement: &2165228320 !ruby/object:Gem::Requirement
39
+ requirement: &2165262760 !ruby/object:Gem::Requirement
40
40
  none: false
41
41
  requirements:
42
42
  - - ~>
@@ -44,10 +44,10 @@ dependencies:
44
44
  version: 2.4.0
45
45
  type: :runtime
46
46
  prerelease: false
47
- version_requirements: *2165228320
47
+ version_requirements: *2165262760
48
48
  - !ruby/object:Gem::Dependency
49
49
  name: sqlite3
50
- requirement: &2165227400 !ruby/object:Gem::Requirement
50
+ requirement: &2165261900 !ruby/object:Gem::Requirement
51
51
  none: false
52
52
  requirements:
53
53
  - - ~>
@@ -55,10 +55,10 @@ dependencies:
55
55
  version: 1.3.4
56
56
  type: :runtime
57
57
  prerelease: false
58
- version_requirements: *2165227400
58
+ version_requirements: *2165261900
59
59
  - !ruby/object:Gem::Dependency
60
60
  name: activerecord-import
61
- requirement: &2165226380 !ruby/object:Gem::Requirement
61
+ requirement: &2165260720 !ruby/object:Gem::Requirement
62
62
  none: false
63
63
  requirements:
64
64
  - - ~>
@@ -66,10 +66,10 @@ dependencies:
66
66
  version: 0.2.8
67
67
  type: :runtime
68
68
  prerelease: false
69
- version_requirements: *2165226380
69
+ version_requirements: *2165260720
70
70
  - !ruby/object:Gem::Dependency
71
71
  name: rspec
72
- requirement: &2165225400 !ruby/object:Gem::Requirement
72
+ requirement: &2165259280 !ruby/object:Gem::Requirement
73
73
  none: false
74
74
  requirements:
75
75
  - - ~>
@@ -77,10 +77,10 @@ dependencies:
77
77
  version: 2.3.0
78
78
  type: :development
79
79
  prerelease: false
80
- version_requirements: *2165225400
80
+ version_requirements: *2165259280
81
81
  - !ruby/object:Gem::Dependency
82
82
  name: bundler
83
- requirement: &2165224600 !ruby/object:Gem::Requirement
83
+ requirement: &2165258160 !ruby/object:Gem::Requirement
84
84
  none: false
85
85
  requirements:
86
86
  - - ~>
@@ -88,10 +88,10 @@ dependencies:
88
88
  version: 1.0.0
89
89
  type: :development
90
90
  prerelease: false
91
- version_requirements: *2165224600
91
+ version_requirements: *2165258160
92
92
  - !ruby/object:Gem::Dependency
93
93
  name: jeweler
94
- requirement: &2165223220 !ruby/object:Gem::Requirement
94
+ requirement: &2165257000 !ruby/object:Gem::Requirement
95
95
  none: false
96
96
  requirements:
97
97
  - - ~>
@@ -99,10 +99,10 @@ dependencies:
99
99
  version: 1.6.4
100
100
  type: :development
101
101
  prerelease: false
102
- version_requirements: *2165223220
102
+ version_requirements: *2165257000
103
103
  - !ruby/object:Gem::Dependency
104
104
  name: rcov
105
- requirement: &2165222000 !ruby/object:Gem::Requirement
105
+ requirement: &2165255880 !ruby/object:Gem::Requirement
106
106
  none: false
107
107
  requirements:
108
108
  - - ! '>='
@@ -110,7 +110,7 @@ dependencies:
110
110
  version: '0'
111
111
  type: :development
112
112
  prerelease: false
113
- version_requirements: *2165222000
113
+ version_requirements: *2165255880
114
114
  description: Use the snp-search tool to create, import, manipulate and query your
115
115
  SNP database
116
116
  email: ali.al-shahib@hpa.org.uk
@@ -153,7 +153,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
153
153
  version: '0'
154
154
  segments:
155
155
  - 0
156
- hash: 1630410471760364863
156
+ hash: 1607617824420065040
157
157
  required_rubygems_version: !ruby/object:Gem::Requirement
158
158
  none: false
159
159
  requirements: