RubyGems - snp-search - Versions diffs - 2.5.2 → 2.7.0 - Mend

snp-search 2.5.2 → 2.7.0

Files changed (10) hide show

data/README.rdoc +46 -39
data/VERSION +1 -1
data/bin/snp-search +53 -53
data/lib/information_methods.rb +2 -2
data/lib/output_information_methods.rb +23 -15
data/lib/snp-search.rb +1 -2
data/pkg/snp-search-2.5.2.gem +0 -0
data/pkg/snp-search-2.6.0.gem +0 -0
data/snp-search.gemspec +4 -2
metadata +5 -3

data/README.rdoc CHANGED

@@ -1,6 +1,6 @@
 = snp-search
-SNPsearch is a tool that manages SNP data and allows for data importing, manipulating, editing and complex querying of SNP data.  It can be used to evaluate the utility of SNPs for the assessment of genetic diversity between haploid strains and the management of genotype and phenotype data.  Once the database is created, the user is provided with several query and output options. SNPsearch is particularly useful in the analysis of phylogenetic trees that are based on SNP differences across whole core genomes.  Queries can be made to answer critical genomic questions such as the association of SNPs with particular phenotypes.
+an easy to use tool for management of SNPs generated from haploid next generation sequencing data. Given a vcf file, snp-search stores the SNPs generated by the variant calling algorithm into a sqlite database. snp-search can then be used to extract useful information from the database.
 == Obtaining and installing the code
 SNPsearch is written in Ruby and operates in a Unix environment.  It is made available as a gem. See the github site for more information (https://github.com/hpa-bioinformatics/snp-search).
@@ -13,9 +13,10 @@ To install snp-search, do
 Not much, you just need:
 * Unix. Once snp-search is installed, all the necessary gems to run snp-search will also be installed from Rubygems (note that Rubygems requires admin privileges.  If you do not have admin privileges then we suggest you install RVM: (http://beginrescueend.com/rvm/install/) and then gem install snp-search).
 * ruby version 1.8.7 and above.
-* Optional: FastTree.  If you require a tree output in Newick format, you must install FastTree from http://www.microbesonline.org/fasttree/#Install.  You must specify the path of the executable in your .bashrc or .profile file as snp-search will run the command as just 'FastTree' and will not know where FastTree is if it is not specified in your .bashrc or .profile file.
+* Optional: FastTree 2.  If you require a tree output in Newick format, you must install FastTree from http://www.microbesonline.org/fasttree/#Install.
 Thats it!
@@ -29,65 +30,72 @@ Thats it!
   1B- Your database reference genome that you used to generate your .vcf file (in genbank or embl format, the script will automatically detect the format).
-You need the following parameters:
+  You need the following parameters:
-  -n	Name of your database (note that this is a required field in all commands).
+  -d	Name of your database (note that this is a required field in all commands).
   -v	.vcf file
-  -d	Database Reference genome (The same file that was used in generating the .vcf file).  This should be in genbank or embl format.
+  -r	Database Reference genome (The same file that was used in generating the .vcf file).  This should be in genbank or embl format.
-  Other options:
-  -c	SNP quality score cutoff.  A Phred-scaled quality score. High quality scores indicate high confidence calls. Optional, default = 90 (out of 100)
-  -g	Genotype Quality score cutoff. Phred-scaled quality score that the genotype is true.	Optional, default = 30
-  -h	help message
+  Optional: -A  AD ratio cutoff (default 0.9)
   Usage:
-    snp-search -create -n my_snp_db.sqlite3 -d my_ref.gbk -v my_vcf_file.vcf
+    snp-search -create -d my_snp_db.sqlite3 -r my_ref.gbk -v my_vcf_file.vcf
   Note: The strain names in your database will be taken from your vcf file so make sure they are named appropriately in your vcf file.
 2- Now that you have created the database (my_snp_db.sqlite3) you can use snp-search to output several queried data.
-  2A- First, you should choose which output format you like:
-    -f, --fasta: output fasta file format (not available with -unique_snps option)
-    -T, --tabular: output tabular file format
-  2B- Next, you need to tell snp-search what you want out.  You have several options:
-    - Querying the Database to select the number of unique SNPs within the list of the strains/samples provided (list_of_my_strains.txt). The output is a text file with a list of the unique SNPs and information about each SNP (e.g. if its synonymous or non-synonymous SNP).
-  -u, --unique_snps                      Query for unique snps in the database (only used with -tabular option)
-  -s, --strain                           The strains/samples you like to query (only used with -unique_snps flag)
-  Usage:
-  snp-search -n my_snp_db.sqlite3 -O -T -u -n my_snp_db.sqlite3 -s list_of_my_strains.txt -o unique_snps.out
-  - Querying the database to output all SNPs without specified features in the database (e.g. phages).  This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis.  The user has the option of generating a core SNP tree Newick file for SNP phylogeny (if -F option was used to ouput fasta file).
-  -e, --ignore_snps_from_feature         Ignore SNPs from specified features in the database
-  -r, --remove_non_informative_snps      Only output informative SNPs
-  -I, --ignore_snps_in_range             A list of position ranges to ignore e.g 10..500,2000..2500
-  -R, --ignore_strains                   A list of strains to ignore (seperate by comma e.g. S1,S4,S8 )
-  -a, --annotation                       The name of the gene to ignore (only used with the -ignore_snps_from_feature flag)
-  -o, --out                              Name of output file
+  First, you need to tell snp-search what you want out.  You have several options:
+  - Querying the Database to select the number of unique SNPs within the list of the strains/samples provided (list_of_my_strains.txt). The output is a text file with a list of the unique SNPs and information about each SNP (e.g. if its synonymous or non-synonymous SNP).
+    -output -unique_snps -d db.sqlite3 [options]
+      -u, --unique_snps                      Query for unique snps in the database
+      -c, --cuttoff_snp_qual                 SNP quality cutoff, (default = 90)
+      -g, --cuttoff_genotype                 Genotype quality cutoff (default = 30)
+      -s, --strain                           The strains/samples you like to query (only used with -unique_snps flag)
+      -o, --out                              Name of output file, Required
+    Usage:
+    snp-search -O -u -d my_snp_db.sqlite3 -s list_of_my_strains.txt -o unique_snps.out
+  - Querying the database to output all SNPs without SNPs in a specified features in the database (e.g. phages).  This is a way of ignoring SNPs in genes (likely to be mobile element genes) that are not needed for SNP analysis.  The user has the option of generating a core SNP tree Newick file for SNP phylogeny (if -F option was used to ouput fasta file).
+  -output -all_or_filtered_snps -d db.sqlite3 [options]
+    -f, --all_or_filtered_snps             SNPs from specified features in the database (if you do not want to ignore any SNPs, just use this option with -n -F/T -o)
+    -F, --fasta                            output fasta file format (default)
+    -T, --tabular                          output tabular file format
+    -c, --cuttoff_snp_qual                 SNP quality cutoff, (default = 90)
+    -g, --cuttoff_genotype                 Genotype quality cutoff (default = 30)
+    -R, --remove_non_informative_snps      Only output informative SNPs. Only used with -e option
+    -e, --ignore_snps_in_range             A list of position ranges to ignore e.g 10..500,2000..2500. Only used with -e option
+    -a, --ignore_strains                   A list of strains to ignore (seperate by comma e.g. S1,S4,S8 ). Only used with -f option
+    -I, --ignore_snps_on_annotation        The name of the feature(s) to ignore.  Features should be seperated by comma (e.g. phages,inserstion,transposons)
+    -o, --out                              Name of output file, Required
+    -t, --tree                             Generate SNP phylogeny (only used with -fasta option)
+    -p, --fasttree_path                    Full path to the FastTree tool (e.g. /usr/local/bin/FastTree. only used with -tree option)
   Usage:
-  snp-search -O -F -e -n my_snp_db.sqlite3 -a phage,insertion,transposon -r -o snps_without_phages.fasta
+  snp-search -O -F -f -n my_snp_db.sqlite3 -a phage,insertion,transposon -R -o snps_without_phages.fasta
   - Optionally, you can add the following options to generate a phylogenetic tree from the resulting fasta file:
   -t  Generate SNP phylogeny
-  -w  Output tree in Newick format
+  -p  Full path to the FastTree tool (e.g. /usr/local/bin/FastTree. only used with -tree option)
   Usage:
-  snp-search -O -F -e -n my_snp_db.sqlite3 -a phage,insertion,transposon -r -t -w -o snps_without_phages.fasta
+  snp-search -O -F -e -n my_snp_db.sqlite3 -a phage,insertion,transposon -r -t -p /usr/local/bin/FastTree -o snps_without_phages.fasta
   The algorithm FastTree is used to generate the nwk file.  FastTree can be downloaded from http://www.microbesonline.org/fasttree/#Install (see above)
   - Output all SNPs with information.  Information for each SNP includes whether the SNP is synonymous or non-synonymous, gene function, whether it is a pseudogene and other useful information.  These information will be tab-seperated.
-  -E, --info                             Output various information about SNPs
-  -o, --out                              Name of output file
+  -output -info -d db.sqlite3 [options]
+    -i, --info                             Output various information about SNPs
+    -c, --cuttoff_snp_qual                 SNP quality cutoff, (default = 90)
+    -g, --cuttoff_genotype                 Genotype quality cutoff (default = 30)
+    -o, --out                              Name of output file, Required
   Usage:
-  snp-search -O -T -E -n my_snp_db.sqlite3 o snps_all_with_info.txt
+  snp-search -O -info -d my_snp_db.sqlite3 -o snps_all_with_info.txt
 == View database in Unix or in a GUI
 Your database will be in sqlite3 format.  If you like to view your table(s) and perform direct queries you can type
@@ -107,5 +115,4 @@ Have fun snp-searching!
 == Copyright
 Copyright (c) 2012 Ali Al-Shahib. See LICENSE.txt for
-further details.
+further details.

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 2.5.2
1	+ 2.7.0

data/bin/snp-search CHANGED

@@ -1,14 +1,15 @@
-require 'snp-search'
+require '/Volumes/NGS2_DataRAID/projects/ali/GAS/snp-search/lib/snp-search'
 require 'snp_db_connection.rb'
 require 'snp_db_models.rb'
 require 'snp_db_schema.rb'
-require 'output_information_methods.rb'
+require '/Volumes/NGS2_DataRAID/projects/ali/GAS/snp-search/lib/output_information_methods.rb'
 require 'activerecord-import'
 require 'slop'
 opts = Slop.parse do
   banner "\nruby snp-search [-create] [-output] [-n <sqlite3>] [options]*"
   separator ''
   on :C, :create, 'Create database'
@@ -17,49 +18,46 @@ opts = Slop.parse do
   # separator ''
   # # separator 'README file: https://github.com/hpa-bioinformatics/snp-search/blob/master/README.rdoc'
   # # separator 'The following command must be used when using -create, or -query or -out_file'
-  # on :n, :name=, 'Name of database, Required'
+  # on :n, :name_of_database=, 'Name of database, Required'
   separator ''
-  separator '-create [options]'
-  on :d, :database_reference_file=, 'Reference genome file, in gbk or embl file format, Required', true
+  separator '-create -r reference_file.fasta -v vcf_file.vcf -d db.sqlite3'
+  on :r, :reference_file=, 'Reference genome file, in gbk or embl file format, Required', true
   on :v, :vcf_file=, 'variant call format (vcf) file, Required', true
-  on :n, :name=, 'Name of database, Required'
+  on :d, :name_of_database=, 'Name of database, Required'
   on :A, :cuttoff_ad=, 'AD ratio cutoff (default 0.9)', :as => :int, :default => 0.9
   separator ''
-  separator '-output -snps_from_feature -n db_name [options] [-fasta] [-tabular]'
-  on :F, :fasta, 'output fasta file format'
+  separator '-output -all_or_filtered_snps -d db.sqlite3 [options]'
+  on :f, :all_or_filtered_snps, 'SNPs from specified features in the database (if you do not want to ignore any SNPs, just use this option with -n -F/T -o)'
+  on :F, :fasta, 'output fasta file format (default)'
   on :T, :tabular, 'output tabular file format'
   on :c, :cuttoff_snp_qual=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
   on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int,  :default => 30
-  on :S, :snps_from_feature, 'SNPs from specified features in the database (if you do not want to ignore any SNPs, just use this option with -n -F/T -o)'
-  on :r, :remove_non_informative_snps, 'Only output informative SNPs. Only used with -e option'
-  on :e, :ignore_snps_in_range=, 'A list of position ranges to ignore e.g 10..500,2000..2500. Only used with -e option'
-  on :R, :ignore_strains=, 'A list of strains to ignore (seperate by comma e.g. S1,S4,S8 ). Only used with -e option'
-  on :I, :ignore_snps_on_annotation=, 'The name of the feature to ignore.'
+  on :R, :remove_non_informative_snps, 'Only output informative SNPs.'
+  on :e, :ignore_snps_in_range=, 'A list of position ranges to ignore e.g 10..500,2000..2500.'
+  on :a, :ignore_strains=, 'A list of strains to ignore (seperate by comma e.g. S1,S4,S8 ).'
+  on :I, :ignore_snps_on_annotation=, 'The name of the feature(s) to ignore.  Features should be seperated by comma (e.g. phages,inserstion,transposons)'
   on :o, :out=, 'Name of output file, Required'
   on :t, :tree, 'Generate SNP phylogeny (only used with -fasta option)'
   on :p, :fasttree_path=, 'Full path to the FastTree tool (e.g. /usr/local/bin/FastTree. only used with -tree option)'
   separator ''
-  separator '-output -unique_snps -n db_name [-fasta] [-tabular] [options]'
+  separator '-output -unique_snps -d db.sqlite3 [options]'
+  on :u, :unique_snps, 'Query for unique snps in the database'
   on :c, :cuttoff_snp_qual=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
   on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int,  :default => 30
-  on :u, :unique_snps, 'Query for unique snps in the database'
   on :s, :strain=, 'The strains/samples you like to query (only used with -unique_snps flag)'
   on :o, :out=, 'Name of output file, Required'
   separator ''
-  separator '-output -info -n db_name [-fasta] [-tabular] [options]'
+  separator '-output -info -d db.sqlite3 [options]'
   on :i, :info, 'Output various information about SNPs'
   on :c, :cuttoff_snp_qual=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
   on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int,  :default => 30
-  on :t, :tree, 'Generate SNP phylogeny (only used with -fasta option)'
-  on :w, :nwk_out=, 'Name of output tree in Newick format (only used with -tree option)'
   on :o, :out=, 'Name of output file, Required'
 end
@@ -67,49 +65,52 @@ end
 # CREATING A DATABASE
 if opts[:create]
+  # raise "Please provide a database file name" if opts[:reference_file].empty?
   # puts opts[:cuttoff_snp_qual].to_i
     error_msg = ""
-    error_msg += "-n: \t Name of your database\n" unless opts[:name]
-    error_msg += "-d: \t Reference genome file, in gbk or embl file format\n" unless opts[:database_reference_file]
+    error_msg += "-d: \t Name of your database\n" unless opts[:name_of_database]
+    error_msg += "-r: \t Reference genome file, in gbk or embl file format\n" unless opts[:reference_file]
     error_msg += "-v: \t .vcf file\n" unless opts[:vcf_file]
-    error_msg_optional = ""
+    # error_msg_optional = ""
-    error_msg_optional += "-c: \tSNP quality cutoff, (default = 90)\n"
-    error_msg_optional += "-g: \tGenotype quality cutoff (default = 30)\n"
+    # error_msg_optional += "-c: \tSNP quality cutoff, (default = 90)\n"
+    # error_msg_optional += "-g: \tGenotype quality cutoff (default = 30)\n"
       unless error_msg == ""
         puts "Please provide the following required fields:"
         puts error_msg
-        puts "Optional fields:"
-        puts error_msg_optional
-        puts opts.help unless opts.empty?
+        # puts "Optional fields:"
+        # puts error_msg_optional
+        # puts "Please provide a database file name" if opts[:reference_file].empty?
+        # puts opts.help unless opts.empty?
         exit
       end
-    abort "#{opts[:database_reference_file]} file does not exist!" unless File.exist?(opts[:database_reference_file])
+    abort "#{opts[:reference_file]} file does not exist!" unless File.exist?(opts[:reference_file])
     abort "#{opts[:vcf_file]} file does not exist!" unless File.exist?(opts[:vcf_file])
   # Name of your database
-  establish_connection(opts[:name])
+  establish_connection(opts[:name_of_database])
   # Schema will run here
   db_schema
-  ref = opts[:database_reference_file]
+  ref = opts[:reference_file]
   sequence_format = guess_sequence_format(ref)
         case sequence_format
         when :genbank
-          sequence_flatfile = Bio::FlatFile.open(Bio::GenBank,opts[:database_reference_file]).next_entry
+          sequence_flatfile = Bio::FlatFile.open(Bio::GenBank,opts[:reference_file]).next_entry
         when :embl
-          sequence_flatfile = Bio::FlatFile.open(Bio::EMBL,opts[:database_reference_file]).next_entry
+          sequence_flatfile = Bio::FlatFile.open(Bio::EMBL,opts[:reference_file]).next_entry
         else
           puts "All sequence files should be in genbank or embl format"
           exit
@@ -128,33 +129,32 @@ if opts[:create]
 elsif opts[:output]
   error_msg = ""
-  error_msg += "-S: \t SNPs from specified features in the database OR\n-u: \t Query for unique snps in the database OR\n-i: \t Information on all SNPs\n" unless opts[:snps_from_feature] || opts[:unique_snps] || opts[:info]
+  error_msg += "-f: \t SNPs from specified features in the database OR\n-u: \t Query for unique snps in the database OR\n-i: \t Information on all SNPs\n" unless opts[:all_or_filtered_snps] || opts[:unique_snps] || opts[:info]
   unless error_msg == ""
     puts "Please provide the following required fields:"
     puts error_msg
-    puts opts.help unless opts.empty?
+    # puts opts.help unless opts.empty?
     exit
   end
-  if opts[:snps_from_feature]
+  if opts[:all_or_filtered_snps]
     error_msg = ""
-    error_msg += "-n: \t Name of your database\n" unless opts[:name]
+    error_msg += "-d: \t Name of your database\n" unless opts[:name_of_database]
     error_msg += "-o: \t name of your output file\n" unless opts[:out]
     error_msg += "-F: \t Fasta output OR\n-T: \t Tabular output" unless opts[:fasta] || opts[:tabular]
     error_msg_optional = ""
-    error_msg_optional += "-I,\t --ignore_snps_on_annotation: ignore SNPs from specified features in the database\n" unless opts[:ignore_snps_on_annotation]
-    error_msg_optional += "-R,\t --ignore_strains: A list of strains to ignore\n" unless  opts[:ignore_strains]
-    error_msg_optional += "-i,\t --ignore_snps_in_range: A list of position ranges to ignore e.g 10..500,2000..2500\n" unless  opts[:ignore_snps_in_range]
+    error_msg_optional += "-I,\t --ignore_snps_on_annotation: The name of the feature(s) to ignore.  Features should be seperated by comma (e.g. phages,inserstion,transposons)\n" unless opts[:ignore_snps_on_annotation]
+    error_msg_optional += "-a,\t --ignore_strains: A list of strains to ignore\n" unless  opts[:ignore_strains]
+    error_msg_optional += "-e,\t --ignore_snps_in_range: A list of position ranges to ignore e.g 10..500,2000..2500\n" unless  opts[:ignore_snps_in_range]
     error_msg_optional += "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality\n"  unless  opts[:cuttoff_snp_qual]
     error_msg_optional += "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality\n"  unless  opts[:cuttoff_genotype]
-    error_msg_optional += "-r,\t --remove_non_informative_snps: Only output informative SNPs\n"  unless  opts[:remove_non_informative_snps]
+    error_msg_optional += "-R,\t --remove_non_informative_snps: Only output informative SNPs\n"  unless  opts[:remove_non_informative_snps]
     error_msg_optional += "-t,\t --tree: Construct tree from output\n" unless  opts[:tree]
-    error_msg_optional += "-w,\t --nwk_out: Name of Newick output file(use only when-tree option used)\n"  unless opts[:nwk_out]
     unless error_msg == ""
       puts "Please provide the following required fields:"
@@ -164,13 +164,13 @@ elsif opts[:output]
       # Added this here as it wont appear here in error_msg_optional as its set as default.
       puts "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality (default 90)\n"
       puts "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality (default 30)\n"
-      puts opts.help unless opts.empty?
+      # puts opts.help unless opts.empty?
       exit
     end
-    abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
+    abort "#{opts[:name_of_database]} database does not exist!" unless File.exist?(opts[:name_of_database])
-    establish_connection(opts[:name])
+    establish_connection(opts[:name_of_database])
     get_snps(opts[:out], opts[:ignore_snps_on_annotation], opts[:ignore_snps_in_range], opts[:ignore_strains], opts[:remove_non_informative_snps], opts[:fasta], opts[:tabular], opts[:cuttoff_genotype], opts[:cuttoff_snp_qual], opts[:tree], opts[:fasttree_path])
   end
@@ -181,7 +181,7 @@ elsif opts[:output]
     error_msg = ""
-    error_msg += "-n: \t Name of your database\n" unless opts[:name]
+    error_msg += "-d: \t Name of your database\n" unless opts[:name_of_database]
     error_msg += "-s: \t List of strains you like to query\n" unless opts[:strain]
     error_msg += "-o: \t Name of the output file\n" unless opts[:out]
@@ -192,14 +192,14 @@ elsif opts[:output]
       # Added this here as it wont appear here in error_msg_optional as its set as default.
       puts "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality (default 90)\n"
       puts "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality (default 30)\n"
-      puts opts.help unless opts.empty?
+      # puts opts.help unless opts.empty?
       exit
     end
-    abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
+    abort "#{opts[:name_of_database]} database does not exist!" unless File.exist?(opts[:name_of_database])
     abort "#{opts[:strain]} file does not exist!" unless File.exist?(opts[:strain])
-    establish_connection(opts[:name])
+    establish_connection(opts[:name_of_database])
     strains = []
       File.read(opts[:strain]).each_line do |line|
@@ -214,7 +214,7 @@ elsif opts[:output]
     error_msg = ""
-    error_msg += "-n: \t the name of your database\n" unless opts[:name]
+    error_msg += "-d: \t the name of your database\n" unless opts[:name_of_database]
     error_msg += "-o: \t name of your output file (in tab-delimited format)\n" unless opts[:out]
     unless error_msg == ""
@@ -224,13 +224,13 @@ elsif opts[:output]
       # Added this here as it wont appear here in error_msg_optional as its set as default.
       puts "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality (default 90)\n"
       puts "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality (default 30)\n"
-      puts opts.help unless opts.empty?
+      # puts opts.help unless opts.empty?
       exit
     end
-    abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
+    abort "#{opts[:name_of_database]} database does not exist!" unless File.exist?(opts[:name_of_database])
-    establish_connection(opts[:name])
+    establish_connection(opts[:name_of_database])
     #information defined in bin/snp-search.rb
     information(opts[:out], opts[:cuttoff_genotype], opts[:cuttoff_snp_qual])

data/lib/information_methods.rb CHANGED

@@ -63,8 +63,8 @@ def information()
       hydrophobic = ["I", "L", "V", "C", "A", "G", "M", "F", "Y", "W", "H", "T"]
       non_hydrophobic = ["K", "E", "Q", "D", "N", "S", "P", "B"]
-      polar = ["Y", "W", "H", "K", "R", "E", "Q", "D", "N", "S", "P", "B"]
-      non_polar = ["I", "L", "V", "C", "A", "G", "M", "F", "T"]
+      polar = ["R", "N", "D", "E", "Q", "H", "K", "S", "T", "Y"]
+      non_polar = ["A", "C", "G", "I", "L", "M", "F", "P", "W", "V"]
       small = ["V","C","A","G","D","N","S","T","P"]
       non_small = ["I","L","M","F","Y","W","H","K","R","E","Q"]

data/lib/output_information_methods.rb CHANGED

@@ -9,17 +9,17 @@ def output_information_methods(snps, outfile, cuttoff_genotype, cuttoff_snp, inf
   outfile.puts "pos_of_SNP_in_ref\tref_base\tSNP_base\tsynonymous or non-synonymous\tGene_annotation\tpossible_pseudogene?\tamino_acid_original\tamino_acid_change\tchange_in_hydrophobicity_of_AA?\tchange_in_polarisation_of_AA?\tchange_in_size_of_AA?\t#{strains.map{|strain| strain.name}.join("\t") if info}"
   snps_counter = 0
+  cds_snps_counter = 0
   total_number_of_syn_snps = 0
   total_number_of_non_syn_snps = 0
   total_number_of_pseudo = 0
   snps.each do |snp|
     ActiveRecord::Base.transaction do
-      snps_counter +=1
       snp.alleles.each do |allele|
         next if snp.alleles.any?{|allele| allele.base.length > 1} # indel
         if allele.id != snp.reference_allele_id
           # get annotation (if there is any) for each SNP
           features = Feature.joins(:snps).where("snps.id = ?", snp.id)
@@ -36,20 +36,25 @@ def output_information_methods(snps, outfile, cuttoff_genotype, cuttoff_snp, inf
           ref_base = Bio::Sequence.auto(Allele.find(snp.reference_allele_id).base)
           snp_base = Bio::Sequence.auto(allele.base)
+          # count snps now: after you have selected the snps with gqs and snp_qual greater than the threshold.
+          snps_counter += 1
           # If the feature is empty then just output basic information about the snp.
           if features.empty?
             outfile.puts "#{snp.ref_pos}\t#{features.map{|feature| feature.strand == 1} ? "#{ref_base.upcase}" : "#{ref_base.reverse_complement.upcase}"}\t#{features.map{|feature| feature.strand == 1} ? "#{snp_base.upcase}" : "#{snp_base.reverse_complement.upcase}"}"
-          else
+          else
             features.each do |feature|
               if feature.name == "CDS"
+                cds_snps_counter +=1
                 annotation = Annotation.where("annotations.qualifier = 'product' and annotations.feature_id = ?", feature.id).first
                 #if annotation is nil, or empty
                 if annotation.nil?
                   outfile.puts "#{snp.ref_pos}\t#{feature.strand == 1 ? "#{ref_base.upcase}" : "#{ref_base.reverse_complement.upcase}"}\t#{feature.strand == 1 ? "#{snp_base.upcase}" : "#{snp_base.reverse_complement.upcase}"}"
                 else
                   feature_sequence = feature.sequence
                   feature_sequence_bio = Bio::Sequence::NA.new(feature_sequence)
@@ -91,24 +96,25 @@ def output_information_methods(snps, outfile, cuttoff_genotype, cuttoff_snp, inf
                       allele_for_strains = Allele.joins(:genotypes => :strain).where("strains.id = ? AND alleles.snp_id = ?", strain.id, snp.id).first
                       alleles_array << allele_for_strains.base
                     end
                   # If no difference between the amino acids then its synonymous SNP, if different then its non-synonymous.
                   if original_seq_translated_clean == mutated_seq_translated_clean
-                    total_number_of_non_syn_snps +=1
+                    total_number_of_syn_snps +=1
                     if mutated_seq_translated_clean =~ /\*/
                       total_number_of_pseudo +=1
-                      outfile.puts "#{snp.ref_pos}\t#{feature.strand == 1 ? "#{ref_base.upcase}" : "#{ref_base.reverse_complement.upcase}"}\t#{feature.strand == 1 ? "#{snp_base.upcase}" : "#{snp_base.reverse_complement.upcase}"}\tsynonymous\t#{annotation.value}\tYes\tN/A\tN/A\tN/A\tN/A\tN/A\t#{alleles_array.join("\t") if info}"
+                      outfile.puts "#{snp.ref_pos}\t#{features.map{|feature| feature.strand == 1} ? "#{ref_base.upcase}" : "#{ref_base.reverse_complement.upcase}"}\t#{features.map{|feature| feature.strand == 1} ? "#{snp_base.upcase}" : "#{snp_base.reverse_complement.upcase}"}\tsynonymous\t#{annotation.value}\tYes\tN/A\tN/A\tN/A\tN/A\tN/A\t#{alleles_array.join("\t") if info}"
                     else
-                      outfile.puts "#{snp.ref_pos}\t#{feature.strand == 1 ? "#{ref_base.upcase}" : "#{ref_base.reverse_complement.upcase}"}\t#{feature.strand == 1 ? "#{snp_base.upcase}" : "#{snp_base.reverse_complement.upcase}"}\tsynonymous\t#{annotation.value}\tNo\tN/A\tN/A\tN/A\tN/A\tN/A\t#{alleles_array.join("\t") if info}"
+                      outfile.puts "#{snp.ref_pos}\t#{features.map{|feature| feature.strand == 1} ? "#{ref_base.upcase}" : "#{ref_base.reverse_complement.upcase}"}\t#{features.map{|feature| feature.strand == 1} ? "#{snp_base.upcase}" : "#{snp_base.reverse_complement.upcase}"}\tsynonymous\t#{annotation.value}\tNo\tN/A\tN/A\tN/A\tN/A\tN/A\t#{alleles_array.join("\t") if info}"
                     end
                   else
-                    total_number_of_syn_snps +=1
+                    total_number_of_non_syn_snps +=1
                     diffs = Diff::LCS.diff(original_seq_translated_clean, mutated_seq_translated_clean)
                     if mutated_seq_translated_clean =~ /\*/
                       total_number_of_pseudo +=1
-                      outfile.puts "#{snp.ref_pos}\t#{feature.strand == 1 ? "#{ref_base.upcase}" : "#{ref_base.reverse_complement.upcase}"}\t#{feature.strand == 1 ? "#{snp_base.upcase}" : "#{snp_base.reverse_complement.upcase}"}\tnon-synonymous\t#{annotation.value}\tYes\t#{diffs[0][0].element}\t#{diffs[0][1].element}\t#{'Yes' if (hydrophobic.include? diffs[0][0].element) == (non_hydrophobic.include? diffs[0][1].element)}#{'No' if (hydrophobic.include? diffs[0][0].element) != (non_hydrophobic.include? diffs[0][1].element)}\t#{'Yes' if (polar.include? diffs[0][0].element) == (non_polar.include? diffs[0][1].element)}#{'No' if (polar.include? diffs[0][0].element) != (non_polar.include? diffs[0][1].element)}\t#{'Yes' if (small.include? diffs[0][0].element) == (non_small.include? diffs[0][1].element)}#{'No' if (small.include? diffs[0][0].element) != (non_small.include? diffs[0][1].element)}\t#{alleles_array.join("\t") if info}"
+                      outfile.puts "#{snp.ref_pos}\t#{features.map{|feature| feature.strand == 1} ? "#{ref_base.upcase}" : "#{ref_base.reverse_complement.upcase}"}\t#{features.map{|feature| feature.strand == 1} ? "#{snp_base.upcase}" : "#{snp_base.reverse_complement.upcase}"}\tnon-synonymous\t#{annotation.value}\tYes\t#{diffs[0][0].element}\t#{diffs[0][1].element}\t#{'Yes' if (hydrophobic.include? diffs[0][0].element) == (non_hydrophobic.include? diffs[0][1].element)}#{'No' if (hydrophobic.include? diffs[0][0].element) != (non_hydrophobic.include? diffs[0][1].element)}\t#{'Yes' if (polar.include? diffs[0][0].element) == (non_polar.include? diffs[0][1].element)}#{'No' if (polar.include? diffs[0][0].element) != (non_polar.include? diffs[0][1].element)}\t#{'Yes' if (small.include? diffs[0][0].element) == (non_small.include? diffs[0][1].element)}#{'No' if (small.include? diffs[0][0].element) != (non_small.include? diffs[0][1].element)}\t#{alleles_array.join("\t") if info}"
                     else
-                      outfile.puts "#{snp.ref_pos}\t#{feature.strand == 1 ? "#{ref_base.upcase}" : "#{ref_base.reverse_complement.upcase}"}\t#{feature.strand == 1 ? "#{snp_base.upcase}" : "#{snp_base.reverse_complement.upcase}"}\tnon-synonymous\t#{annotation.value}\tNo\t#{diffs[0][0].element}\t#{diffs[0][1].element}\t#{'Yes' if (hydrophobic.include? diffs[0][0].element) == (non_hydrophobic.include? diffs[0][1].element)}#{'No' if (hydrophobic.include? diffs[0][0].element) != (non_hydrophobic.include? diffs[0][1].element)}\t#{'Yes' if (polar.include? diffs[0][0].element) == (non_polar.include? diffs[0][1].element)}#{'No' if (polar.include? diffs[0][0].element) != (non_polar.include? diffs[0][1].element)}\t#{'Yes' if (small.include? diffs[0][0].element) == (non_small.include? diffs[0][1].element)}#{'No' if (small.include? diffs[0][0].element) != (non_small.include? diffs[0][1].element)}\t#{alleles_array.join("\t") if info}"
+                      outfile.puts "#{snp.ref_pos}\t#{features.map{|feature| feature.strand == 1} ? "#{ref_base.upcase}" : "#{ref_base.reverse_complement.upcase}"}\t#{features.map{|feature| feature.strand == 1} ? "#{snp_base.upcase}" : "#{snp_base.reverse_complement.upcase}"}\tnon-synonymous\t#{annotation.value}\tNo\t#{diffs[0][0].element}\t#{diffs[0][1].element}\t#{'Yes' if (hydrophobic.include? diffs[0][0].element) == (non_hydrophobic.include? diffs[0][1].element)}#{'No' if (hydrophobic.include? diffs[0][0].element) != (non_hydrophobic.include? diffs[0][1].element)}\t#{'Yes' if (polar.include? diffs[0][0].element) == (non_polar.include? diffs[0][1].element)}#{'No' if (polar.include? diffs[0][0].element) != (non_polar.include? diffs[0][1].element)}\t#{'Yes' if (small.include? diffs[0][0].element) == (non_small.include? diffs[0][1].element)}#{'No' if (small.include? diffs[0][0].element) != (non_small.include? diffs[0][1].element)}\t#{alleles_array.join("\t") if info}"
                     end
                   end
                 end
@@ -117,14 +123,16 @@ def output_information_methods(snps, outfile, cuttoff_genotype, cuttoff_snp, inf
           end
         end
       end
-      puts "Total SNPs added so far: #{snps_counter}" if snps_counter % 100 == 0
+      puts "Total SNPs added so far: #{cds_snps_counter}" if snps_counter % 100 == 0
     end
   end
   puts "Total number of snps: #{snps_counter}"
-  puts "Total number of synonymous SNPs #{total_number_of_syn_snps}"
-  puts "Total number of non-synonymous SNPs #{total_number_of_non_syn_snps}"
-  puts "Total number of pseudogenes #{total_number_of_pseudo}"
+  puts "Total number of snps in CDS region: #{cds_snps_counter}"
+  puts "Total number of synonymous SNPs: #{total_number_of_syn_snps}"
+  puts "Total number of non-synonymous SNPs: #{total_number_of_non_syn_snps}"
+  puts "Total number of pseudogenes: #{total_number_of_pseudo}"
   outfile.puts "Total number of snps: #{snps_counter}"
+  outfile.puts "Total number of snps in CDS region: #{cds_snps_counter}"
   outfile.puts "Total number of synonymous SNPs: #{total_number_of_syn_snps}"
   outfile.puts "Total number of non-synonymous SNPs: #{total_number_of_non_syn_snps}"
   outfile.puts "Total number of possible pseudogenes: #{total_number_of_pseudo}"

data/lib/snp-search.rb CHANGED

@@ -16,8 +16,7 @@ def find_unqiue_snps(strain_names, out, cuttoff_genotype, cuttoff_snp)
   outfile = File.open(out, "w")
    snps = Snp.find_by_sql("SELECT snps.* from snps INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id INNER JOIN strains ON strains.id = genotypes.strain_id WHERE (#{where_statement}) AND alleles.id <> snps.reference_allele_id AND genotypes.geno_qual >= #{cuttoff_genotype} AND snps.qual >= #{cuttoff_snp} AND (SELECT COUNT(*) from snps AS s INNER JOIN alleles ON alleles.snp_id = snps.id INNER JOIN genotypes ON alleles.id = genotypes.allele_id WHERE alleles.id <> snps.reference_allele_id and s.id = snps.id) = #{strain_names.size} GROUP BY snps.id HAVING COUNT(*) = #{strain_names.size}")
-   puts "The number of unique snps are #{snps.size}"
+   # puts "The number of unique snps are #{snps.size}"
    output_information_methods(snps, outfile, cuttoff_genotype, cuttoff_snp, false)
 end

data/pkg/snp-search-2.5.2.gem ADDED

Binary file

data/pkg/snp-search-2.6.0.gem ADDED

Binary file

data/snp-search.gemspec CHANGED

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = "snp-search"
-  s.version = "2.5.2"
+  s.version = "2.7.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Ali Al-Shahib", "Anthony Underwood"]
-  s.date = "2013-04-19"
+  s.date = "2013-08-02"
   s.description = "Use the snp-search tool to create, import, manipulate and query your SNP database"
   s.email = "ali.al-shahib@phe.gov.uk"
   s.executables = ["snp-search"]
@@ -40,6 +40,8 @@ Gem::Specification.new do |s|
     "pkg/snp-search-2.3.0.gem",
     "pkg/snp-search-2.4.0.gem",
     "pkg/snp-search-2.5.0.gem",
+    "pkg/snp-search-2.5.2.gem",
+    "pkg/snp-search-2.6.0.gem",
     "snp-search.gemspec",
     "spec/snp-search_spec.rb",
     "spec/spec_helper.rb"

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: snp-search
 version: !ruby/object:Gem::Version
-  version: 2.5.2
+  version: 2.7.0
   prerelease:
 platform: ruby
 authors:
@@ -10,7 +10,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-04-19 00:00:00.000000000 Z
+date: 2013-08-02 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activerecord
@@ -188,6 +188,8 @@ files:
 - pkg/snp-search-2.3.0.gem
 - pkg/snp-search-2.4.0.gem
 - pkg/snp-search-2.5.0.gem
+- pkg/snp-search-2.5.2.gem
+- pkg/snp-search-2.6.0.gem
 - snp-search.gemspec
 - spec/snp-search_spec.rb
 - spec/spec_helper.rb
@@ -206,7 +208,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: -1740466535190078901
+      hash: -258043406808362242
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements: