RubyGems - snp-search - Versions diffs - 1.0.0 → 2.0.0 - Mend

snp-search 1.0.0 → 2.0.0

Files changed (8) hide show

data/README CHANGED Viewed

@@ -0,0 +1,105 @@
+= snp-search
+SNPsearch is a tool that manages SNP data and allows for data importing, manipulating, editing and complex querying of SNP data.  It can be used to evaluate the utility of SNPs for the assessment of genetic diversity between haploid strains and the management of genotype and phenotype data.  Once the database is created, the user is provided with several query and output options. SNPsearch is particularly useful in the analysis of phylogenetic trees that are based on SNP differences across whole core genomes.  Queries can be made to answer critical genomic questions such as the association of SNPs with particular phenotypes.
+== Obtaining and installing the code
+SNPsearch is written in Ruby and operates in a Unix environment.  It is made available as a gem. See the github site for more information (https://github.com/hpa-bioinformatics/snp-search).
+To install snp-search, do
+  gem install snp-search
+== Requirements
+Not much, you just need:
+* Unix. Once snp-search is installed, all the necessary gems to run snp-search will also be installed from Rubygems (note that Rubygems requires admin privileges.  If you do not have admin privileges then we suggest you install RVM: (http://beginrescueend.com/rvm/install/) and then gem install snp-search).
+* ruby version 1.8.7 and above.
+* Optional: FastTree.  If you require a tree output in Newick format, you must install FastTree from http://www.microbesonline.org/fasttree/#Install.  You must specify the path of the executable in your .bashrc or .profile file as snp-search will run the command as just 'FastTree' and will not know where FastTree is if it is not specified in your .bashrc or .profile file.
+Thats it!
+== Running snp-search
+1- Creating the database (snp-search -create)
+  Two files are needed to create the SQLite3 database:
+  1- Variant Call Format (.vcf) file (which contains the SNP information)
+  2- Your database reference genome that you used to generate your .vcf file (in genbank or embl format, the script will automatically detect the format).
+You need the following parameters:
+  -n	Name of your database
+  -v	.vcf file
+  -d	Database Reference genome (The same file that was used in generating the .vcf file).  This should be in genbank or embl format.
+  Other options:
+  -c	SNP quality score cutoff.  A Phred-scaled quality score. High quality scores indicate high confidence calls. Optional, default = 90 (out of 100)
+  -g	Genotype Quality score cutoff. Phred-scaled quality score that the genotype is true.	Optional, default = 30
+  -h	help message
+  Usage:
+    snp-search -create -n my_snp_db.sqlite3 -d my_ref.gbk -v my_vcf_file.vcf
+  Note: The strain names in your database will be taken from your vcf file so make sure they are named appropriately in your vcf file.
+2- Querying the Database (snp-search -query)
+  Two queries are currently scripted in SNPsearch:
+  1- unique_snps: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided.  The output is the number of unique SNPs.
+  You need the following parameters:
+  -n  Name of your database
+  -s  The strains/samples you like to query
+  Usage:
+    snp-search -n my_snp_db.sqlite3 -s list_of_my_strains.txt
+  2- not_include_snps_from_gene: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format).  This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis.  The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
+  You need the following parameters:
+  -n  Name of your database
+  -a  The gene you like to remove from analysis
+  -o  Output file, in fasta format
+  options:
+  -t  Generate SNP phylogeny
+  -w  Output tree in Newick format
+  Usage (phage is used as the example gene):
+  snp-search -n my_snp_db.sqlite3 -a phage -o snps_sequences_without_phage.fasta -t -w snps_sequences_without_phage.nwk
+  The algorithm FastTree is used to generate the nwk file.  FastTree can be downloaded from http://www.microbesonline.org/fasttree/#Install (see above)
+  3- Output database (snp-search -out_file)
+  You need the following parameters:
+  -n  Name of your database
+  -o  Output file containing the database in fasta format
+== View database in Unix or in a GUI
+Your database will be in sqlite3 format.  If you like to view your table(s) and perform direct queries you can type
+  sqlite3 snp_db.sqlite3
+Alternatively, you may download a SQL tool to view your database (e.g. SQLite sorcerer).
+== Contact
+If you have any comments, questions or suggestions, please email
+  ali.al-shahib@hpa.org.uk
+or
+  anthony.underwood@hpa.org.uk
+Have fun snp-searching!
+== Copyright
+Copyright (c) 2012 Ali Al-Shahib. See LICENSE.txt for
+further details.

data/README.rdoc CHANGED Viewed

@@ -49,7 +49,7 @@ You need the following parameters:
   Two queries are currently scripted in SNPsearch:
-  1- genes_query: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided.  The output is the number of unique SNPs.
+  1- unique_snps: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided.  The output is the number of unique SNPs.
   You need the following parameters:
@@ -59,7 +59,7 @@ You need the following parameters:
   Usage:
     snp-search -n my_snp_db.sqlite3 -s list_of_my_strains.txt
-  2- remove_genes: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format).  This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis.  The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
+  2- not_include_snps_from_gene: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format).  This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis.  The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
   You need the following parameters:

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 1.0.0
1	+ 2.0.0

data/bin/snp-search CHANGED Viewed

@@ -3,60 +3,74 @@ require 'snp_db_connection'
 require 'snp_db_models'
 require 'snp_db_schema'
 require 'activerecord-import'
-# gem "slop", "~> 3.1.0"
-gem "slop", "~> 2.4.0"
+gem "slop", "~> 3.3.1"
+# gem "slop", "~> 2.4.0"
 require 'slop'
-opts = Slop.new do
+opts = Slop.parse do
-  # separator 'test'
+  banner "\nruby snp-search [-create] [-query] [-output] [-n <sqlite3>] [options]*"
+  separator ''
-  banner "\nruby snp-search [OPTIONS]"
   on :C, :create, 'Create database'
   on :Q, :query, 'Query database'
-  on :O, :out_file, 'Output the database to a file'
-  # separator ''
+  on :O, :output, 'Output options'
+  separator ''
   # separator 'README file: https://github.com/hpa-bioinformatics/snp-search/blob/master/README.rdoc'
   # separator 'The following command must be used when using -create, or -query or -out_file'
   on :n, :name=, 'Name of database, Required'
-  # separator ''
-  # separator '-create options'
-  on :d, :database_reference_file, 'Reference genome file, in gbk or embl file format, Required', true
-  on :v, :vcf_file, '.vcf file, Required', true
-  on :c, :cuttoff_snp, 'SNP quality cutoff, (default = 90)', :default => 90
-  on :g, :cuttoff_genotype, 'Genotype quality cutoff (default = 30)', :default => 30
-  # separator ''
-  # separator '-query options'
-  on :G, :genes_query, 'Query for unique genes in the database'
-  on :R, :remove_genes, 'Remove set of genes from database and create FASTA file'
+  separator ''
+  separator '-create options'
+  on :d, :database_reference_file=, 'Reference genome file, in gbk or embl file format, Required', true
+  on :v, :vcf_file=, 'variant call format (vcf) file, Required', true
+  on :c, :cuttoff_snp=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
+  on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int,  :default => 30
+  separator ''
+  separator '-query options'
+  on :u, :unique_snps, 'Query for unique snps in the database'
+  on :r, :not_include_snps_from_gene, 'Remove SNPs from specified gene from database'
   on :s, :strain=, 'The strains/samples you like to query, Required'
   on :a, :annotation=, 'The gene you like to remove from analysis'
-  on :o, :output=, 'output file, in fasta format'
+  separator ''
+  separator '-output [-fasta] [-syn] options'
+  on :f, :fasta, 'output fasta file'
+  on :S, :syn, 'output tab-delimited file with synonymous and non-synonymous info'
+  on :o, :out=, 'Name of output file'
   on :t, :tree, 'Generate SNP phylogeny'
-  on :w, :tree_nwk_output=, 'output tree in Newick format'
-  on :S, :syn, 'syn'
+  on :w, :nwk_out=, 'Name of output tree in Newick format'
 end
-opts.parse
+# opts.end
 ###########################################################
 # CREATING A DATABASE
 if opts[:create]
+    # puts opts[:cuttoff_snp].to_i
       error_msg = ""
-      error_msg += "-n option: \t the name of your database\n" unless opts[:name]
-      error_msg += "-d option: \t reference genome file, in gbk or embl file format\n" unless opts[:database_reference_file]
-      error_msg += "-v option: \t .vcf file\n" unless opts[:vcf_file]
+      error_msg += "-n: \t Name of your database\n" unless opts[:name]
+      error_msg += "-d: \t Reference genome file, in gbk or embl file format\n" unless opts[:database_reference_file]
+      error_msg += "-v: \t .vcf file\n" unless opts[:vcf_file]
-      unless error_msg == ""
-        puts "Please provide the following required fields:"
-        puts error_msg
-        puts opts.help unless opts.empty?
-        exit
-      end
+      error_msg_optional = ""
+      error_msg_optional += "-c: \tSNP quality cutoff, (default = 90)\n"
+      error_msg_optional += "-g: \tGenotype quality cutoff (default = 30)\n"
+        unless error_msg == ""
+          puts "Please provide the following required fields:"
+          puts error_msg
+          puts "Optional fields:"
+          puts error_msg_optional
+          puts opts.help unless opts.empty?
+          exit
+        end
       abort "#{opts[:database_reference_file]} file does not exist!" unless File.exist?(opts[:database_reference_file])
@@ -67,7 +81,7 @@ if opts[:create]
     establish_connection(opts[:name])
     # Schema will run here
-    db_schema
+   db_schema
     ref = opts[:database_reference_file]
@@ -87,22 +101,25 @@ if opts[:create]
       vcf_mpileup_file = opts[:vcf_file]
       # The populate_features_and_annotations method populates the features and annotations.  It uses the embl/gbk file.
-      populate_features_and_annotations(sequence_flatfile)
+     populate_features_and_annotations(sequence_flatfile)
       #The populate_snps_alleles_genotypes method populates the snps, alleles and genotypes.  It uses the vcf file, and if specified, the SNP quality cutoff and genotype quality cutoff
-      populate_snps_alleles_genotypes(vcf_mpileup_file, opts[:cuttoff_snp].to_i, opts[:cuttoff_genotype].to_i)
+      populate_snps_alleles_genotypes(vcf_mpileup_file, opts[:cuttoff_snp], opts[:cuttoff_genotype])
+      # puts "populate_snps_alleles_genotypes(#{vcf_mpileup_file}, #{opts[:cuttoff_snp]}, #{opts[:cuttoff_genotype]}.to_i)"
 ###########################################################
 # QUERYING THE DATABASE
 elsif opts [:query]
   #FIND UNIQUE SNPS
-  if opts[:genes_query]
+  if opts[:unique_snps]
         error_msg = ""
-        error_msg += "-n option, \t the name of your database\n" unless opts[:name]
-        error_msg += "-s option, \t list of strains you like to query\n" unless opts[:strain]
+        error_msg += "-n: \t Name of your database\n" unless opts[:name]
+        error_msg += "-s: \t List of strains you like to query\n" unless opts[:strain]
         unless error_msg == ""
           puts "Please provide the following required fields:"
@@ -126,32 +143,39 @@ elsif opts [:query]
       gas_snps = find_shared_snps(strains)
       gas_snps.each do |snp|
-        puts "The number of unique snps are #{snp.id}.size"
+        puts "The number of unique snps are #{snp.id}"
       end
 ################################################################
   # REMOVE SNPS ASSOCIATED WITH SPECIFIC GENES
-  elsif opts[:remove_genes]
+  elsif opts[:not_include_snps_from_gene]
       error_msg = ""
-        error_msg += "-n option: \t the name of your database\n" unless opts[:name]
-        error_msg += "-o option: \t name of your output file\n" unless opts[:output]
-        error_msg += "-a option: \t name of the gene that you like to remove from the database\n" unless opts[:annotation]
+        error_msg += "-n: \t Name of your database\n" unless opts[:name]
+        error_msg += "-o: \t name of your output file\n" unless opts[:out]
+        error_msg += "-a: \t name of the gene that you like to remove from the database\n" unless opts[:annotation]
+        error_msg_optional = ""
+        error_msg_optional += "-tree: \t Construct tree from output\n" unless  opts[:tree]
+        error_msg_optional += "-nwk_out: Name of Newick output file(use only when-tree option used)\n"  unless opts[:nwk_out]
         unless error_msg == ""
           puts "Please provide the following required fields:"
           puts error_msg
+          puts "Optional fields:"
+          puts error_msg_optional
           puts opts.help unless opts.empty?
           exit
         end
         abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
       # annotation = opts[:annotation]
      establish_connection(opts[:name])
       # Getting list of strains from database
       strains = Strain.all
@@ -164,7 +188,7 @@ elsif opts [:query]
       end
       # output opened for data input
-      output = File.open("#{opts[:output]}", "w")
+      output = File.open("#{opts[:out]}", "w")
       # Perform query
       snps = Snp.includes(:alleles => :genotypes).find_by_sql("SELECT snps.* FROM snps INNER JOIN features ON features.id = snps.feature_id WHERE features.id NOT IN (select distinct features.id FROM features INNER JOIN annotations ON annotations.feature_id = features.id WHERE annotations.value LIKE '%#{opts[:annotation]}%')")
@@ -175,16 +199,16 @@ elsif opts [:query]
           # puts snp.inspect
           i += 1
           puts "Total number of SNPs generated so far: #{i}" if i % 100 == 0
-     ActiveRecord::Base.transaction do
-          snp.alleles.each do |allele|
-            # puts allele.inspect
-            allele.genotypes.each do |genotype|
-              #push bases to hash
-              sequence_hash[genotype.strain_id] << allele.base
+           ActiveRecord::Base.transaction do
+                snp.alleles.each do |allele|
+                  # puts allele.inspect
+                  allele.genotypes.each do |genotype|
+                    #push bases to hash
+                    sequence_hash[genotype.strain_id] << allele.base
+                  end
+                end
             end
-          end
         end
-    end
     #generate FASTA file
     strains.each do |strain|
@@ -192,30 +216,42 @@ elsif opts [:query]
       output.puts
     end
+    # GENERATE TREE FROM FASTA FILE
     if opts[:tree]
-      `FastTree -fastest -nt #{opts[:output]} > #{opts[:w]}`
+      `FastTree -fastest -nt #{opts[:out]} > #{opts[:nwk_out]}`
     end
+  else
+    puts "use -unique_snps or -not_include_snps_from_gene query options"
   end
-##############################################################
+# ##############################################################
 # OUTPUT DATABASE IN FASTA FORMAT
-elsif opts[:out_file]
+elsif opts[:output]
+  if opts[:fasta]
     error_msg = ""
-        error_msg += "-n option: \t the name of your database\n" unless opts[:name]
-        error_msg += "-o option: \t name of your output file\n" unless opts[:output]
+        error_msg += "-n: \t Name of your database\n" unless opts[:name]
+        error_msg += "-o: \t name of your output file (in FASTA format)\n" unless opts[:out]
+    error_msg_optional = ""
+        error_msg_optional += "-tree: \t Construct tree from output\n" unless  opts[:tree]
+        error_msg_optional += "-nwk_out: Name of Newick output file(use only when-tree option used)\n"  unless opts[:nwk_out]
         unless error_msg == ""
           puts "Please provide the following required fields:"
           puts error_msg
+          puts "Optional fields:"
+          puts error_msg_optional
           puts opts.help unless opts.empty?
           exit
         end
         abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
-  establish_connection(opts[:name])
+    establish_connection(opts[:name])
       # Getting list of strains from database
       strains = Strain.all
@@ -229,28 +265,30 @@ elsif opts[:out_file]
       end
-      output = File.open("#{opts[:output]}", "w")
+      output = File.open("#{opts[:out]}", "w")
       # Select all snps
       snps = Snp.all
         i = 0
         puts "Your out file is being prepared......."
         snps.each do |snp|
           i += 1
           puts "Total number of SNPs outputted so far: #{i}" if i % 100 == 0
-     ActiveRecord::Base.transaction do
-          snp.alleles.each do |allele|
-            # puts allele.inspect
-            allele.genotypes.each do |genotype|
-              #push bases to hash
-              sequence_hash[genotype.strain_id] << allele.base
-            end
+         ActiveRecord::Base.transaction do
+              snp.alleles.each do |allele|
+                # puts allele.inspect
+                allele.genotypes.each do |genotype|
+                  #push bases to hash
+                  sequence_hash[genotype.strain_id] << allele.base
+                end
+              end
           end
         end
+    puts sequence_hash
+    exit
     #generate FASTA file
     strains.each do |strain|
       output.print ">#{strain.name}\n" , sequence_hash[strain.id].join("")
@@ -258,10 +296,36 @@ elsif opts[:out_file]
     end
     if opts[:tree]
-      `FastTree -fastest -nt #{opts[:output]} > #{opts[:w]}`
+      # puts "FastTree -fastest -nt #{opts[:out]} > #{opts[:w]}"
+       `FastTree -fastest -nt #{opts[:out]} > #{opts[:w]}`
     end
   end
+    #########################################
+  if opts[:syn]
+    error_msg = ""
+        error_msg += "-n option: \t the name of your database\n" unless opts[:name]
+        error_msg += "-d option: \t the reference file in gbk format\n" unless opts[:database_reference_file]
+        unless error_msg == ""
+          puts "Please provide the following required fields:"
+          puts error_msg
+          puts opts.help unless opts.empty?
+          exit
+        end
+        abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
+        abort "#{opts[:database_reference_file]} vcf file does not exist!" unless File.exist?(opts[:database_reference_file])
+    establish_connection(opts[:name])
+    ref = opts[:database_reference_file]
+    synonymous(ref)
+  end
 else
-   puts opts.help
+  puts opts.help
 end

data/lib/snp-search.rb CHANGED Viewed

@@ -3,6 +3,7 @@ gem "bio", "~> 1.4.2"
 require 'bio'
 require  'snp_db_models'
 require 'activerecord-import'
+require 'diff/lcs'
 #This method guesses the reference sequence file format
 def guess_sequence_format(reference_genome)
@@ -50,6 +51,7 @@ end
 #This method populates the rest of the information, i.e. SNP information, Alleles and Genotypes.
 def populate_snps_alleles_genotypes(vcf_file, cuttoff_snp, cuttoff_genotype)
 puts "Adding SNPs........"
 # open vcf file and parse each line
 	File.open(vcf_file) do |f|
@@ -86,37 +88,56 @@ puts "Adding SNPs........"
 				    format = details[8].split(":")
 				    gt = format.index("GT")
 				    gq = format.index("GQ")
+				    # dp = format.index("DP")
 				    samples = details[9..-1]
-			     	next if ref_base.size != 1 || snp_base.size != 1 # exclude indels
+			     	next if ref_base.size != 1 || snp_base.size != 1 # exclude indels (e.g. G,A in REF)
 				    genotypes = samples.map do |s|
-				      format_values = s.chomp.split(":")
-				      format_values[gt]
+				      format_values = s.chomp.split(":") # output (e.g.): 0/0 \n 0,255,209 \n 99
+				      format_values[gt] # e.g. 0/0
 				    end
 				    genotypes_qualities = samples.map do |s|
 				      format_values = s.chomp.split(":")
-				      format_values[gq]
+				      format_values[gq] # e.g. 99
 				    end
-				    high_quality_variant_genotypes = Array.new # this will be filled with the indicies of genotypes that are "1/1" and have a quality >= 30
+				    geno_quality_array = Array.new
+				    high_quality_variant_genotypes = Array.new # this will be filled with the indicies of genotypes that are "1/1" and have a quality >= 30. Reminder: 0/0 is no SNP, 1/1 is SNP.
 				    variant_genotypes = Array.new
-				    genotypes.each_with_index do |gt, index|
+				    genotypes.each_with_index do |gt, index| # indexes each 'genotypes'.
 				    	if gt == "1/1"
-					        variant_genotypes << index
-					        if genotypes_qualities[index].to_i >= cuttoff_genotype
-					          high_quality_variant_genotypes << index
-					        end
-				    	end
+					        variant_genotypes << index # variant_genotypes is the position of genome positions that have a correct SNP with 1/1.  if you want the total number of strains thats have 1/1 for that row (genome position) then puts variant_genotypes.size
+					        if genotypes_qualities[index].to_i >= cuttoff_genotype.to_i
+					           high_quality_variant_genotypes << index
+					    	end
+						end
 					end
-					if snp_qual.to_i >= cuttoff_snp && genotypes.include?("1/1") &&  ! high_quality_variant_genotypes.empty? && high_quality_variant_genotypes.size == variant_genotypes.size # first condition checks the overall quality of the SNP is >=90, second checks that at least one genome has the 'homozygous' 1/1 variant type with quality >= 30 and informative SNP
-				    	if  genotypes.include?("0/0") && !genotypes.include?("0/1") # exclude SNPs which are all 1/1 i.e something strange about ref and those which have confusing heterozygote 0/1s
+					genotypes_qualities.each do |gq|
+						if gq.to_i >= cuttoff_genotype.to_i
+							geno_quality_array << gq
+						end
+					end
+					# 	# high_quality_variant_genotypes is the position of 1/1 and genotype quality above cuttoff_genotype. high_quality_variant_genotypes.size will give you the number of 1/1 in a row (genome position) that is above the genotype quality cuttoff.
+					# puts "yay" if geno_quality_array.keep_if {|z| z <= cuttoff_genotype.to_i}
+					# next if geno_quality_array.each {|z| z.to_i < cuttoff_genotype.to_i}
+					next if samples.include?("./.")
+					 next if geno_quality_array.size != strains.size
+						if snp_qual.to_i >= cuttoff_snp.to_i && genotypes.include?("1/1") &&  ! high_quality_variant_genotypes.empty? && high_quality_variant_genotypes.size == variant_genotypes.size
+					   # first condition checks the overall quality of the SNP 	is >=90, second checks that at least one genome has the 'homozygous' 1/1 variant 	type with quality >= 30 and informative SNP
+				    	 if  genotypes.include?("0/0") && !genotypes.include?("0/1") # exclude SNPs which are all 1/1 i.e something strange about ref and those which have confusing heterozygote 0/1s
 					        good_snps +=1
 					        # puts good_snps
 					        #create snp
 					        s = Snp.new
 					        s.ref_pos = ref_pos
+					        s.qual = snp_qual
 					        s.save
 					   #  create ref allele
@@ -139,13 +160,14 @@ puts "Adding SNPs........"
 							    genotypes.each_with_index do |gt, index|
 							         genotype = Genotype.new
 							         genotype.strain = strains[index]
+							         genotype.geno_qual = genotypes_qualities[index].to_i
 							    	 puts index if strains[index].nil?
 							          if gt == "0/0" # wild type
 							             genotype.allele = ref_allele
 							          elsif gt == "1/1" # snp type
 							             genotype.allele = snp_allele
-							          else
-							            puts "Strange SNP #{gt}"
+							           else
+							             puts "Strange SNP #{gt}"
 							          end
 							          genos << genotype
 								end
@@ -154,6 +176,7 @@ puts "Adding SNPs........"
 								 puts "Total SNPs added so far: #{good_snps}" if good_snps % 100 == 0
 							end
 				      	end
 			    	end
 			    end
 			end
@@ -172,8 +195,107 @@ def find_shared_snps(strain_names)
    where_statement = strain_names.collect{|strain_name| "strains.name = '#{strain_name}' OR "}.join("").sub(/ OR $/, "")
-   puts "Snp.find_by_sql(\"SELECT * from snps INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id INNER JOIN strains ON strains.id = genotypes.strain_id WHERE (#{where_statement}) AND alleles.id <> snps.reference_allele_id AND (SELECT COUNT(*) from snps AS s INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id WHERE alleles.id <> snps.reference_allele_id and s.id = snps.id) = #{strain_names.size} GROUP BY snps.id HAVING COUNT(*) = #{strain_names.size}\")"
+   Snp.find_by_sql("SELECT * from snps INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id INNER JOIN strains ON strains.id = genotypes.strain_id WHERE (#{where_statement}) AND alleles.id <> snps.reference_allele_id AND (SELECT COUNT(*) from snps AS s INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id WHERE alleles.id <> snps.reference_allele_id and s.id = snps.id) = #{strain_names.size} GROUP BY snps.id HAVING COUNT(*) = #{strain_names.size}")
 end
+def synonymous(sequence_file)
+	#Reference Sequence
+	genome_sequence = Bio::FlatFile.open(Bio::GenBank, sequence_file).next_entry
+	#Extract all nucleotide sequence from ORIGIN
+	all_seqs_original = genome_sequence.seq
+	ref_bases =[]
+	 strains = Strain.all
+	  strains_hash = Hash.new
+      # create a sequence hash
+      # hash key is strain_id, loop through strain_id
+      # create an empty array
+      strains.each do |strain|
+        strains_hash[strain.id] = Array.new
+      end
+	variants = Feature.find_by_sql("select distinct features.* from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id inner join genotypes on alleles.id = genotypes.allele_id inner join strains on strains.id = genotypes.strain_id where alleles.id <> snps.reference_allele_id and features.name = 'CDS'")
+	puts "start_cds_in_ref\tend_cds_in_ref\tpos_of_SNP_in_ref\tref_base\tSNP_base\tsynonymous or non-synonymous\tamino_acid_original\tamino_acid_change\tpossible_pseudogene?\tchange_in_hydrophobicity_of_AA?\tchange_in_polarisation_of_AA?\tchange_in_size_of_AA?"
+	variants.each do |variant|
+		variant.snps.each do |snp|
+			snp.alleles.each do |allele|
+				if allele.id != snp.reference_allele_id
+					all_seqs_mutated = genome_sequence.seq
+					mutated_seq_translated = []
+					original_seq_translated = []
+					all_seqs_mutated[snp.ref_pos.to_i-1] = allele.base
+					mutated_seq = Bio::Sequence.auto(all_seqs_mutated[variant.start-1..variant.end-1])
+					original_seq =  Bio::Sequence.auto(all_seqs_original[variant.start-1..variant.end-1])
+					if variant.strand == -1
+						mutated_seq_translated << mutated_seq.reverse_complement.translate
+						original_seq_translated << original_seq.reverse_complement.translate
+					else
+						mutated_seq_translated << mutated_seq.translate
+						original_seq_translated << original_seq.translate
+				 	end
+					 	mutated_seq_translated.zip(original_seq_translated).each do |mut, org|
+					 		mutated_seq_translated_clean = mut.gsub(/\*$/,"")
+					 		original_seq_translated_clean = org.gsub(/\*$/,"")
+							hydrophobic = ["I", "L", "V", "C", "A", "G", "M", "F", "Y", "W", "H", "T"]
+							non_hydrophobic = ["K", "E", "Q", "D", "N", "S", "P", "B"]
+							polar = ["Y", "W", "H", "K", "R", "E", "Q", "D", "N", "S", "P", "B"]
+							non_polar = ["I", "L", "V", "C", "A", "G", "M", "F", "T"]
+							small = ["V","C","A","G","D","N","S","T","P"]
+							non_small = ["I","L","M","F","Y","W","H","K","R","E","Q"]
+							if original_seq_translated_clean == mutated_seq_translated_clean
+							# if original_seq_translated == mutated_seq_translated
+								if mutated_seq_translated_clean =~ /\*/
+									puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tsynonymous\t\t\tYes"
+								else
+									puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tsynonymous"
+								end
+							else
+								diffs = Diff::LCS.diff(original_seq_translated_clean, mutated_seq_translated_clean)
+								if mutated_seq_translated_clean =~ /\*/
+									puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tnon-synonymous\t#{diffs[0][0].element}\t#{diffs[0][1].element}\tYes\t#{'Yes' if (hydrophobic.include? diffs[0][0].element) == (non_hydrophobic.include? diffs[0][1].element)}\t#{'Yes' if (polar.include? diffs[0][0].element) == (non_polar.include? diffs[0][1].element)}\t#{'Yes' if (small.include? diffs[0][0].element) == (non_small.include? diffs[0][1].element)}"
+								else
+									puts "#{variant.start}\t#{variant.end}\t#{snp.ref_pos}\t#{all_seqs_original[snp.ref_pos.to_i-1].upcase}\t#{(allele.base).upcase}\tnon-synonymous\t#{diffs[0][0].element}\t#{diffs[0][1].element}\t\t#{'Yes' if (hydrophobic.include? diffs[0][0].element) == (non_hydrophobic.include? diffs[0][1].element)}\t#{'Yes' if (polar.include? diffs[0][0].element) == (non_polar.include? diffs[0][1].element)}\t#{'Yes' if (small.include? diffs[0][0].element) == (non_small.include? diffs[0][1].element)}"
+								end
+							end
+						end
+				end
+			end
+		end
+	end
+	#Take all SNP positions in ref genome
+	# snp_positions = Feature.find_by_sql("select snps.ref_pos from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where alleles.id <> snps.reference_allele_id and features.name = 'CDS'").map{|snp| snp.ref_pos}
+	# # Take all SNP nucleotide
+	# snps = Feature.find_by_sql("select alleles.base from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where alleles.id <> snps.reference_allele_id and features.name = 'CDS'").map{|allele| allele.base}
+	# # Mutate (substitute) the original sequence with the SNPs
+	# # Here all_seqs_original are all the nucelotide sequences but with the snps subsituted in them
+	# #Get start position of CDS with SNP
+	# coordinates_start = Feature.find_by_sql("select start from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where features.name = 'CDS' and alleles.id <> snps.reference_allele_id").map{|feature| feature.start}
+	# #Get end position of CDS with SNP
+	# coordinates_end = Feature.find_by_sql("select end from features inner join snps on features.id = snps.feature_id inner join alleles on snps.id = alleles.snp_id where features.name = 'CDS' and alleles.id <> snps.reference_allele_id").map{|feature| feature.end}
+end

data/lib/snp_db_schema.rb CHANGED Viewed

@@ -21,12 +21,13 @@ ActiveRecord::Schema.define do
     create_table :snps do |t|
       t.column :feature_id, :integer
       t.column :ref_pos, :integer
+      t.column :qual, :float
       t.column :reference_allele_id, :integer
     end
   end
   unless table_exists? :alleles
-    create_table :alleles do |t|name
+    create_table :alleles do |t|
       t.column :snp_id, :integer
       t.column :base, :string
     end
@@ -36,6 +37,7 @@ ActiveRecord::Schema.define do
     create_table :genotypes do |t|
       t.column :allele_id, :integer
       t.column :strain_id, :integer
+      t.column :geno_qual, :float
     end
   end
@@ -66,18 +68,24 @@ ActiveRecord::Schema.define do
   unless index_exists? :snps, :feature_id
     add_index :snps, :feature_id
   end
+  unless index_exists? :snps, :qual
+    add_index :snps, :qual
+  end
   unless index_exists? :alleles, :snp_id
     add_index :alleles, :snp_id
   end
   unless index_exists? :alleles, :base
     add_index :alleles, :base
   end
- unless index_exists? :genotypes, :allele_id
+  unless index_exists? :genotypes, :allele_id
   add_index :genotypes, :allele_id
   end
   unless index_exists? :genotypes, :strain_id
   add_index :genotypes, :strain_id
   end
+  unless index_exists? :genotypes, :geno_qual
+  add_index :genotypes, :geno_qual
+  end
   unless index_exists? :annotations, :feature_id
     add_index :annotations, :feature_id
   end

data/snp-search.gemspec CHANGED Viewed

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = "snp-search"
-  s.version = "1.0.0"
+  s.version = "2.0.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Ali Al-Shahib", "Anthony Underwood"]
-  s.date = "2012-05-10"
+  s.date = "2012-08-02"
   s.description = "Use the snp-search tool to create, import, manipulate and query your SNP database"
   s.email = "ali.al-shahib@hpa.org.uk"
   s.executables = ["snp-search"]

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: snp-search
 version: !ruby/object:Gem::Version
-  version: 1.0.0
+  version: 2.0.0
   prerelease:
 platform: ruby
 authors:
@@ -10,11 +10,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-05-10 00:00:00.000000000Z
+date: 2012-08-02 00:00:00.000000000Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activerecord
-  requirement: &2165230340 !ruby/object:Gem::Requirement
+  requirement: &2165264520 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -22,10 +22,10 @@ dependencies:
         version: 3.1.3
   type: :runtime
   prerelease: false
-  version_requirements: *2165230340
+  version_requirements: *2165264520
 - !ruby/object:Gem::Dependency
   name: bio
-  requirement: &2165229420 !ruby/object:Gem::Requirement
+  requirement: &2165263760 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -33,10 +33,10 @@ dependencies:
         version: 1.4.2
   type: :runtime
   prerelease: false
-  version_requirements: *2165229420
+  version_requirements: *2165263760
 - !ruby/object:Gem::Dependency
   name: slop
-  requirement: &2165228320 !ruby/object:Gem::Requirement
+  requirement: &2165262760 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -44,10 +44,10 @@ dependencies:
         version: 2.4.0
   type: :runtime
   prerelease: false
-  version_requirements: *2165228320
+  version_requirements: *2165262760
 - !ruby/object:Gem::Dependency
   name: sqlite3
-  requirement: &2165227400 !ruby/object:Gem::Requirement
+  requirement: &2165261900 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -55,10 +55,10 @@ dependencies:
         version: 1.3.4
   type: :runtime
   prerelease: false
-  version_requirements: *2165227400
+  version_requirements: *2165261900
 - !ruby/object:Gem::Dependency
   name: activerecord-import
-  requirement: &2165226380 !ruby/object:Gem::Requirement
+  requirement: &2165260720 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -66,10 +66,10 @@ dependencies:
         version: 0.2.8
   type: :runtime
   prerelease: false
-  version_requirements: *2165226380
+  version_requirements: *2165260720
 - !ruby/object:Gem::Dependency
   name: rspec
-  requirement: &2165225400 !ruby/object:Gem::Requirement
+  requirement: &2165259280 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -77,10 +77,10 @@ dependencies:
         version: 2.3.0
   type: :development
   prerelease: false
-  version_requirements: *2165225400
+  version_requirements: *2165259280
 - !ruby/object:Gem::Dependency
   name: bundler
-  requirement: &2165224600 !ruby/object:Gem::Requirement
+  requirement: &2165258160 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -88,10 +88,10 @@ dependencies:
         version: 1.0.0
   type: :development
   prerelease: false
-  version_requirements: *2165224600
+  version_requirements: *2165258160
 - !ruby/object:Gem::Dependency
   name: jeweler
-  requirement: &2165223220 !ruby/object:Gem::Requirement
+  requirement: &2165257000 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -99,10 +99,10 @@ dependencies:
         version: 1.6.4
   type: :development
   prerelease: false
-  version_requirements: *2165223220
+  version_requirements: *2165257000
 - !ruby/object:Gem::Dependency
   name: rcov
-  requirement: &2165222000 !ruby/object:Gem::Requirement
+  requirement: &2165255880 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -110,7 +110,7 @@ dependencies:
         version: '0'
   type: :development
   prerelease: false
-  version_requirements: *2165222000
+  version_requirements: *2165255880
 description: Use the snp-search tool to create, import, manipulate and query your
   SNP database
 email: ali.al-shahib@hpa.org.uk
@@ -153,7 +153,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: 1630410471760364863
+      hash: 1607617824420065040
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements: