snp-search 2.2.0 → 2.3.0

Sign up to get free protection for your applications and to get access to all the features.
data/Gemfile CHANGED
@@ -5,10 +5,9 @@ source "http://rubygems.org"
5
5
 
6
6
  gem "activerecord", "~> 3.1.3"
7
7
  gem "bio", "~> 1.4.2"
8
- gem "slop", "~> 3.3.1"
8
+ gem "slop", "~> 2.4.0"
9
9
  gem 'sqlite3', "~> 1.3.4"
10
10
  gem 'activerecord-import', "~> 0.2.8"
11
- gem "diff-lcs", "~> 1.1.3"
12
11
 
13
12
  # Add dependencies to develop your gem here.
14
13
  # Include everything needed to run rake, tests, features, etc.
data/Gemfile.lock CHANGED
@@ -36,7 +36,7 @@ GEM
36
36
  rspec-expectations (2.3.0)
37
37
  diff-lcs (~> 1.1.2)
38
38
  rspec-mocks (2.3.0)
39
- slop (3.3.1)
39
+ slop (2.4.0)
40
40
  sqlite3 (1.3.4)
41
41
  tzinfo (0.3.31)
42
42
 
@@ -48,9 +48,8 @@ DEPENDENCIES
48
48
  activerecord-import (~> 0.2.8)
49
49
  bio (~> 1.4.2)
50
50
  bundler (~> 1.0.0)
51
- diff-lcs (~> 1.1.3)
52
51
  jeweler (~> 1.6.4)
53
52
  rcov
54
53
  rspec (~> 2.3.0)
55
- slop (~> 3.3.1)
54
+ slop (~> 2.4.0)
56
55
  sqlite3 (~> 1.3.4)
data/README CHANGED
@@ -1,105 +0,0 @@
1
- = snp-search
2
-
3
- SNPsearch is a tool that manages SNP data and allows for data importing, manipulating, editing and complex querying of SNP data. It can be used to evaluate the utility of SNPs for the assessment of genetic diversity between haploid strains and the management of genotype and phenotype data. Once the database is created, the user is provided with several query and output options. SNPsearch is particularly useful in the analysis of phylogenetic trees that are based on SNP differences across whole core genomes. Queries can be made to answer critical genomic questions such as the association of SNPs with particular phenotypes.
4
-
5
- == Obtaining and installing the code
6
- SNPsearch is written in Ruby and operates in a Unix environment. It is made available as a gem. See the github site for more information (https://github.com/hpa-bioinformatics/snp-search).
7
-
8
- To install snp-search, do
9
- gem install snp-search
10
-
11
- == Requirements
12
-
13
- Not much, you just need:
14
-
15
- * Unix. Once snp-search is installed, all the necessary gems to run snp-search will also be installed from Rubygems (note that Rubygems requires admin privileges. If you do not have admin privileges then we suggest you install RVM: (http://beginrescueend.com/rvm/install/) and then gem install snp-search).
16
- * ruby version 1.8.7 and above.
17
-
18
- * Optional: FastTree. If you require a tree output in Newick format, you must install FastTree from http://www.microbesonline.org/fasttree/#Install. You must specify the path of the executable in your .bashrc or .profile file as snp-search will run the command as just 'FastTree' and will not know where FastTree is if it is not specified in your .bashrc or .profile file.
19
-
20
- Thats it!
21
-
22
- == Running snp-search
23
-
24
- 1- Creating the database (snp-search -create)
25
-
26
- Two files are needed to create the SQLite3 database:
27
-
28
- 1- Variant Call Format (.vcf) file (which contains the SNP information)
29
-
30
- 2- Your database reference genome that you used to generate your .vcf file (in genbank or embl format, the script will automatically detect the format).
31
-
32
- You need the following parameters:
33
-
34
- -n Name of your database
35
- -v .vcf file
36
- -d Database Reference genome (The same file that was used in generating the .vcf file). This should be in genbank or embl format.
37
-
38
- Other options:
39
- -c SNP quality score cutoff. A Phred-scaled quality score. High quality scores indicate high confidence calls. Optional, default = 90 (out of 100)
40
- -g Genotype Quality score cutoff. Phred-scaled quality score that the genotype is true. Optional, default = 30
41
- -h help message
42
-
43
- Usage:
44
- snp-search -create -n my_snp_db.sqlite3 -d my_ref.gbk -v my_vcf_file.vcf
45
-
46
- Note: The strain names in your database will be taken from your vcf file so make sure they are named appropriately in your vcf file.
47
-
48
- 2- Querying the Database (snp-search -query)
49
-
50
- Two queries are currently scripted in SNPsearch:
51
-
52
- 1- unique_snps: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided. The output is the number of unique SNPs.
53
-
54
- You need the following parameters:
55
-
56
- -n Name of your database
57
- -s The strains/samples you like to query
58
-
59
- Usage:
60
- snp-search -n my_snp_db.sqlite3 -s list_of_my_strains.txt
61
-
62
- 2- not_include_snps_from_gene: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format). This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis. The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
63
-
64
- You need the following parameters:
65
-
66
- -n Name of your database
67
- -a The gene you like to remove from analysis
68
- -o Output file, in fasta format
69
-
70
- options:
71
- -t Generate SNP phylogeny
72
- -w Output tree in Newick format
73
-
74
- Usage (phage is used as the example gene):
75
- snp-search -n my_snp_db.sqlite3 -a phage -o snps_sequences_without_phage.fasta -t -w snps_sequences_without_phage.nwk
76
-
77
- The algorithm FastTree is used to generate the nwk file. FastTree can be downloaded from http://www.microbesonline.org/fasttree/#Install (see above)
78
-
79
- 3- Output database (snp-search -out_file)
80
-
81
- You need the following parameters:
82
-
83
- -n Name of your database
84
- -o Output file containing the database in fasta format
85
-
86
- == View database in Unix or in a GUI
87
- Your database will be in sqlite3 format. If you like to view your table(s) and perform direct queries you can type
88
- sqlite3 snp_db.sqlite3
89
-
90
- Alternatively, you may download a SQL tool to view your database (e.g. SQLite sorcerer).
91
-
92
- == Contact
93
-
94
- If you have any comments, questions or suggestions, please email
95
- ali.al-shahib@hpa.org.uk
96
- or
97
- anthony.underwood@hpa.org.uk
98
-
99
- Have fun snp-searching!
100
-
101
- == Copyright
102
-
103
- Copyright (c) 2012 Ali Al-Shahib. See LICENSE.txt for
104
- further details.
105
-
data/README.rdoc CHANGED
@@ -21,17 +21,17 @@ Thats it!
21
21
 
22
22
  == Running snp-search
23
23
 
24
- 1- Creating the database (snp-search -create)
24
+ 1- The first thing you need to do is to create the database (snp-search -create)
25
25
 
26
26
  Two files are needed to create the SQLite3 database:
27
27
 
28
- 1- Variant Call Format (.vcf) file (which contains the SNP information)
28
+ 1A- Variant Call Format (.vcf) file (which contains the SNP information)
29
29
 
30
- 2- Your database reference genome that you used to generate your .vcf file (in genbank or embl format, the script will automatically detect the format).
30
+ 1B- Your database reference genome that you used to generate your .vcf file (in genbank or embl format, the script will automatically detect the format).
31
31
 
32
32
  You need the following parameters:
33
33
 
34
- -n Name of your database
34
+ -n Name of your database (note that this is a required field in all commands).
35
35
  -v .vcf file
36
36
  -d Database Reference genome (The same file that was used in generating the .vcf file). This should be in genbank or embl format.
37
37
 
@@ -45,43 +45,49 @@ You need the following parameters:
45
45
 
46
46
  Note: The strain names in your database will be taken from your vcf file so make sure they are named appropriately in your vcf file.
47
47
 
48
- 2- Querying the Database (snp-search -query)
48
+ 2- Now that you have created the database (my_snp_db.sqlite3) you can use snp-search to output several queried data.
49
49
 
50
- Two queries are currently scripted in SNPsearch:
50
+ 2A- First, you should choose which output format you like:
51
+ -f, --fasta: output fasta file format (not available with -unique_snps option)
52
+ -T, --tabular: output tabular file format
51
53
 
52
- 1- unique_snps: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided. The output is the number of unique SNPs.
54
+ 2B- Next, you need to tell snp-search what you want out. You have several options:
55
+ - Querying the Database to select the number of unique SNPs within the list of the strains/samples provided (list_of_my_strains.txt). The output is a text file with a list of the unique SNPs and information about each SNP (e.g. if its synonymous or non-synonymous SNP).
53
56
 
54
- You need the following parameters:
57
+ -u, --unique_snps Query for unique snps in the database (only used with -tabular option)
58
+ -s, --strain The strains/samples you like to query (only used with -unique_snps flag)
59
+
60
+ Usage:
61
+ snp-search -n my_snp_db.sqlite3 -O -T -u -n my_snp_db.sqlite3 -s list_of_my_strains.txt -o unique_snps.out
55
62
 
56
- -n Name of your database
57
- -s The strains/samples you like to query
63
+ - Querying the database to output all SNPs without specified features in the database (e.g. phages). This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis. The user has the option of generating a core SNP tree Newick file for SNP phylogeny (if -F option was used to ouput fasta file).
58
64
 
59
- Usage:
60
- snp-search -n my_snp_db.sqlite3 -s list_of_my_strains.txt
61
-
62
- 2- not_include_snps_from_gene: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format). This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis. The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
63
-
64
- You need the following parameters:
65
+ -e, --ignore_snps_from_feature Ignore SNPs from specified features in the database
66
+ -r, --remove_non_informative_snps Only output informative SNPs
67
+ -I, --ignore_snps_in_range A list of position ranges to ignore e.g 10..500,2000..2500
68
+ -R, --ignore_strains A list of strains to ignore (seperate by comma e.g. S1,S4,S8 )
69
+ -a, --annotation The name of the gene to ignore (only used with the -ignore_snps_from_feature flag)
70
+ -o, --out Name of output file
65
71
 
66
- -n Name of your database
67
- -a The gene you like to remove from analysis
68
- -o Output file, in fasta format
72
+ Usage:
73
+ snp-search -O -F -e -n my_snp_db.sqlite3 -a phage,insertion,transposon -r -o snps_without_phages.fasta
69
74
 
70
- options:
75
+ - Optionally, you can add the following options to generate a phylogenetic tree from the resulting fasta file:
76
+
71
77
  -t Generate SNP phylogeny
72
78
  -w Output tree in Newick format
73
-
74
- Usage (phage is used as the example gene):
75
- snp-search -n my_snp_db.sqlite3 -a phage -o snps_sequences_without_phage.fasta -t -w snps_sequences_without_phage.nwk
76
-
79
+ Usage:
80
+ snp-search -O -F -e -n my_snp_db.sqlite3 -a phage,insertion,transposon -r -t -w -o snps_without_phages.fasta
81
+
77
82
  The algorithm FastTree is used to generate the nwk file. FastTree can be downloaded from http://www.microbesonline.org/fasttree/#Install (see above)
78
83
 
79
- 3- Output database (snp-search -out_file)
84
+ - Output all SNPs with information. Information for each SNP includes whether the SNP is synonymous or non-synonymous, gene function, whether it is a pseudogene and other useful information. These information will be tab-seperated.
80
85
 
81
- You need the following parameters:
82
-
83
- -n Name of your database
84
- -o Output file containing the database in fasta format
86
+ -E, --info Output various information about SNPs
87
+ -o, --out Name of output file
88
+
89
+ Usage:
90
+ snp-search -O -T -E -n my_snp_db.sqlite3 o snps_all_with_info.txt
85
91
 
86
92
  == View database in Unix or in a GUI
87
93
  Your database will be in sqlite3 format. If you like to view your table(s) and perform direct queries you can type
data/Rakefile CHANGED
@@ -15,11 +15,11 @@ require 'jeweler'
15
15
  Jeweler::Tasks.new do |gem|
16
16
  # gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
17
17
  gem.name = "snp-search"
18
- gem.homepage = "http://github.com/hpa-bioinformatics/snp-search"
18
+ gem.homepage = "http://github.com/phe-bioinformatics/snp-search"
19
19
  gem.license = "MIT"
20
20
  gem.summary = %Q{Tool for generating SNP database}
21
21
  gem.description = %Q{Use the snp-search tool to create, import, manipulate and query your SNP database}
22
- gem.email = "ali.al-shahib@hpa.org.uk"
22
+ gem.email = "ali.al-shahib@phe.gov.uk"
23
23
  gem.authors = ["Ali Al-Shahib", "Anthony Underwood"]
24
24
  gem.executables = ["snp-search"]
25
25
  # dependencies defined in Gemfile
data/VERSION CHANGED
@@ -1 +1 @@
1
- 2.2.0
1
+ 2.3.0
data/bin/snp-search CHANGED
@@ -1,329 +1,242 @@
1
1
  require 'snp-search'
2
- require 'snp_db_connection'
3
- require 'snp_db_models'
4
- require 'snp_db_schema'
2
+ require '../lib/snp_db_connection.rb'
3
+ require '../lib/snp_db_models.rb'
4
+ require '../lib/snp_db_schema.rb'
5
+ require '../lib/output_information_methods.rb'
5
6
  require 'activerecord-import'
6
7
  require 'slop'
7
8
 
8
9
  opts = Slop.parse do
9
10
 
10
- banner "\nruby snp-search [-create] [-query] [-output] [-n <sqlite3>] [options]*"
11
+ banner "\nruby snp-search [-create] [-output] [-n <sqlite3>] [options]*"
11
12
  separator ''
12
13
 
13
14
  on :C, :create, 'Create database'
14
- on :Q, :query, 'Query database'
15
- on :O, :output, 'Output options'
16
- separator ''
17
- # separator 'README file: https://github.com/hpa-bioinformatics/snp-search/blob/master/README.rdoc'
18
- # separator 'The following command must be used when using -create, or -query or -out_file'
19
- on :n, :name=, 'Name of database, Required'
15
+ on :O, :output, 'Output a process'
16
+
17
+ # separator ''
18
+ # # separator 'README file: https://github.com/hpa-bioinformatics/snp-search/blob/master/README.rdoc'
19
+ # # separator 'The following command must be used when using -create, or -query or -out_file'
20
+ # on :n, :name=, 'Name of database, Required'
21
+
20
22
  separator ''
21
23
 
22
- separator '-create options'
24
+ separator '-create [options]'
23
25
  on :d, :database_reference_file=, 'Reference genome file, in gbk or embl file format, Required', true
24
26
  on :v, :vcf_file=, 'variant call format (vcf) file, Required', true
25
- on :c, :cuttoff_snp=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
27
+ on :n, :name=, 'Name of database, Required'
28
+ on :A, :cuttoff_ad=, 'AD ratio cutoff (default 0.9)', :as => :int, :default => 0.9
29
+
30
+ separator ''
31
+
32
+ separator '-output -snps_from_feature -n db_name [options] [-fasta] [-tabular]'
33
+ on :F, :fasta, 'output fasta file format'
34
+ on :T, :tabular, 'output tabular file format'
35
+ on :c, :cuttoff_snp_qual=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
26
36
  on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int, :default => 30
37
+ on :S, :snps_from_feature, 'SNPs from specified features in the database (if you do not want to ignore any SNPs, just use this option with -n -F/T -o)'
38
+ on :r, :remove_non_informative_snps, 'Only output informative SNPs. Only used with -e option'
39
+ on :e, :ignore_snps_in_range=, 'A list of position ranges to ignore e.g 10..500,2000..2500. Only used with -e option'
40
+ on :R, :ignore_strains=, 'A list of strains to ignore (seperate by comma e.g. S1,S4,S8 ). Only used with -e option'
41
+ on :I, :ignore_snps_on_annotation=, 'The name of the feature to ignore.'
42
+ on :o, :out=, 'Name of output file, Required'
43
+ on :t, :tree, 'Generate SNP phylogeny (only used with -fasta option)'
44
+ on :p, :fasttree_path=, 'Full path to the FastTree tool (e.g. /usr/local/bin/FastTree. only used with -tree option)'
45
+
27
46
  separator ''
28
-
29
- separator '-query options'
47
+
48
+ separator '-output -unique_snps -n db_name [-fasta] [-tabular] [options]'
49
+ on :c, :cuttoff_snp_qual=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
50
+ on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int, :default => 30
30
51
  on :u, :unique_snps, 'Query for unique snps in the database'
31
- on :r, :not_include_snps_from_gene, 'Remove SNPs from specified gene from database'
32
- on :s, :strain=, 'The strains/samples you like to query, Required'
33
- on :a, :annotation=, 'The gene you like to remove from analysis'
52
+ on :s, :strain=, 'The strains/samples you like to query (only used with -unique_snps flag)'
53
+ on :o, :out=, 'Name of output file, Required'
54
+
34
55
  separator ''
35
56
 
36
- separator '-output [-fasta] [-syn] options'
37
- on :f, :fasta, 'output fasta file'
38
- on :S, :syn, 'output tab-delimited file with synonymous and non-synonymous info'
39
- on :o, :out=, 'Name of output file'
40
- on :t, :tree, 'Generate SNP phylogeny'
41
- on :w, :nwk_out=, 'Name of output tree in Newick format'
42
-
57
+ separator '-output -info -n db_name [-fasta] [-tabular] [options]'
58
+ on :i, :info, 'Output various information about SNPs'
59
+ on :c, :cuttoff_snp_qual=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
60
+ on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int, :default => 30
61
+ on :t, :tree, 'Generate SNP phylogeny (only used with -fasta option)'
62
+ on :w, :nwk_out=, 'Name of output tree in Newick format (only used with -tree option)'
63
+ on :o, :out=, 'Name of output file, Required'
43
64
  end
44
- # opts.end
45
65
 
46
66
  ###########################################################
47
67
 
48
68
  # CREATING A DATABASE
49
69
  if opts[:create]
50
70
 
51
- # puts opts[:cuttoff_snp].to_i
52
-
53
- error_msg = ""
54
-
55
- error_msg += "-n: \t Name of your database\n" unless opts[:name]
56
- error_msg += "-d: \t Reference genome file, in gbk or embl file format\n" unless opts[:database_reference_file]
57
- error_msg += "-v: \t .vcf file\n" unless opts[:vcf_file]
58
-
59
- error_msg_optional = ""
60
-
61
- error_msg_optional += "-c: \tSNP quality cutoff, (default = 90)\n"
62
- error_msg_optional += "-g: \tGenotype quality cutoff (default = 30)\n"
63
-
64
- unless error_msg == ""
65
- puts "Please provide the following required fields:"
66
- puts error_msg
67
- puts "Optional fields:"
68
- puts error_msg_optional
69
- puts opts.help unless opts.empty?
70
- exit
71
- end
72
-
73
- abort "#{opts[:database_reference_file]} file does not exist!" unless File.exist?(opts[:database_reference_file])
74
-
75
- abort "#{opts[:vcf_file]} file does not exist!" unless File.exist?(opts[:vcf_file])
76
-
77
-
78
- # Name of your database
79
- establish_connection(opts[:name])
80
-
81
- # Schema will run here
82
- db_schema
83
-
84
- ref = opts[:database_reference_file]
85
-
86
- sequence_format = guess_sequence_format(ref)
87
-
88
- case sequence_format
89
- when :genbank
90
- sequence_flatfile = Bio::FlatFile.open(Bio::GenBank,opts[:database_reference_file]).next_entry
91
- when :embl
92
- sequence_flatfile = Bio::FlatFile.open(Bio::EMBL,opts[:database_reference_file]).next_entry
93
- else
94
- puts "All sequence files should be in genbank or embl format"
95
- exit
96
- end
71
+ # puts opts[:cuttoff_snp_qual].to_i
72
+
73
+ error_msg = ""
97
74
 
98
- # path for vcf file here
99
- vcf_mpileup_file = opts[:vcf_file]
75
+ error_msg += "-n: \t Name of your database\n" unless opts[:name]
76
+ error_msg += "-d: \t Reference genome file, in gbk or embl file format\n" unless opts[:database_reference_file]
77
+ error_msg += "-v: \t .vcf file\n" unless opts[:vcf_file]
100
78
 
101
- # The populate_features_and_annotations method populates the features and annotations. It uses the embl/gbk file.
102
- populate_features_and_annotations(sequence_flatfile)
79
+ error_msg_optional = ""
103
80
 
104
- #The populate_snps_alleles_genotypes method populates the snps, alleles and genotypes. It uses the vcf file, and if specified, the SNP quality cutoff and genotype quality cutoff
81
+ error_msg_optional += "-c: \tSNP quality cutoff, (default = 90)\n"
82
+ error_msg_optional += "-g: \tGenotype quality cutoff (default = 30)\n"
83
+
84
+ unless error_msg == ""
85
+ puts "Please provide the following required fields:"
86
+ puts error_msg
87
+ puts "Optional fields:"
88
+ puts error_msg_optional
89
+ puts opts.help unless opts.empty?
90
+ exit
91
+ end
92
+
93
+ abort "#{opts[:database_reference_file]} file does not exist!" unless File.exist?(opts[:database_reference_file])
94
+
95
+ abort "#{opts[:vcf_file]} file does not exist!" unless File.exist?(opts[:vcf_file])
105
96
 
106
- populate_snps_alleles_genotypes(vcf_mpileup_file, opts[:cuttoff_snp], opts[:cuttoff_genotype])
107
97
 
108
- # puts "populate_snps_alleles_genotypes(#{vcf_mpileup_file}, #{opts[:cuttoff_snp]}, #{opts[:cuttoff_genotype]}.to_i)"
98
+ # Name of your database
99
+ establish_connection(opts[:name])
109
100
 
110
- ###########################################################
101
+ # Schema will run here
102
+ db_schema
111
103
 
112
- # QUERYING THE DATABASE
113
- elsif opts [:query]
114
- #FIND UNIQUE SNPS
115
- if opts[:unique_snps]
104
+ ref = opts[:database_reference_file]
116
105
 
117
- error_msg = ""
106
+ sequence_format = guess_sequence_format(ref)
118
107
 
119
- error_msg += "-n: \t Name of your database\n" unless opts[:name]
120
- error_msg += "-s: \t List of strains you like to query\n" unless opts[:strain]
121
-
122
- unless error_msg == ""
123
- puts "Please provide the following required fields:"
124
- puts error_msg
125
- puts opts.help unless opts.empty?
108
+ case sequence_format
109
+ when :genbank
110
+ sequence_flatfile = Bio::FlatFile.open(Bio::GenBank,opts[:database_reference_file]).next_entry
111
+ when :embl
112
+ sequence_flatfile = Bio::FlatFile.open(Bio::EMBL,opts[:database_reference_file]).next_entry
113
+ else
114
+ puts "All sequence files should be in genbank or embl format"
126
115
  exit
127
116
  end
128
-
129
- abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
130
- abort "#{opts[:strain]} file does not exist!" unless File.exist?(opts[:strain])
131
-
132
- establish_connection(opts[:name])
133
117
 
134
- strains = []
135
- File.read(opts[:strain]).each_line do |line|
136
- strains << line.chop
137
- end
118
+ # The populate_features_and_annotations method populates the features and annotations. It uses the embl/gbk file.
119
+ populate_features_and_annotations(sequence_flatfile)
138
120
 
139
- # puts find_shared_snps(strains)
140
- # exit
141
- gas_snps = find_shared_snps(strains)
121
+ #The populate_snps_alleles_genotypes method populates the snps, alleles and genotypes. It uses the vcf file, and if specified, the SNP quality cutoff and genotype quality cutoff
142
122
 
143
- gas_snps.each do |snp|
144
- puts "The number of unique snps are #{snp.id}"
145
- end
123
+ populate_snps_alleles_genotypes(opts[:vcf_file], opts[:cuttoff_ad])
146
124
 
147
- ################################################################
148
- # REMOVE SNPS ASSOCIATED WITH SPECIFIC GENES
149
- elsif opts[:not_include_snps_from_gene]
125
+ ###########################################################
150
126
 
151
- error_msg = ""
127
+ # QUERYING THE DATABASE
128
+ elsif opts[:output]
152
129
 
153
- error_msg += "-n: \t Name of your database\n" unless opts[:name]
154
- error_msg += "-o: \t name of your output file\n" unless opts[:out]
155
- error_msg += "-a: \t name of the gene that you like to remove from the database\n" unless opts[:annotation]
156
-
157
- error_msg_optional = ""
130
+ error_msg = ""
131
+ error_msg += "-S: \t SNPs from specified features in the database OR\n-u: \t Query for unique snps in the database OR\n-i: \t Information on all SNPs\n" unless opts[:snps_from_feature] || opts[:unique_snps] || opts[:info]
158
132
 
159
- error_msg_optional += "-tree: \t Construct tree from output\n" unless opts[:tree]
160
- error_msg_optional += "-nwk_out: Name of Newick output file(use only when-tree option used)\n" unless opts[:nwk_out]
133
+ unless error_msg == ""
134
+ puts "Please provide the following required fields:"
135
+ puts error_msg
136
+ puts opts.help unless opts.empty?
137
+ exit
138
+ end
161
139
 
162
- unless error_msg == ""
163
- puts "Please provide the following required fields:"
164
- puts error_msg
165
- puts "Optional fields:"
166
- puts error_msg_optional
167
- puts opts.help unless opts.empty?
168
- exit
169
- end
140
+ if opts[:snps_from_feature]
170
141
 
171
-
172
- abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
173
-
174
- # annotation = opts[:annotation]
175
- establish_connection(opts[:name])
176
-
177
- # Getting list of strains from database
178
- strains = Strain.all
179
-
180
- sequence_hash = Hash.new
181
- # create a sequence hash
182
- # hash key is strain_id, loop through strain_id
183
- # create an empty array
184
- strains.each do |strain|
185
- sequence_hash[strain.id] = Array.new
186
- end
142
+ error_msg = ""
187
143
 
188
- # output opened for data input
189
- output = File.open("#{opts[:out]}", "w")
190
-
191
- # Perform query
192
- snps = Snp.includes(:alleles => :genotypes).find_by_sql("SELECT snps.* FROM snps INNER JOIN features ON features.id = snps.feature_id WHERE features.id NOT IN (select distinct features.id FROM features INNER JOIN annotations ON annotations.feature_id = features.id WHERE annotations.value LIKE '%#{opts[:annotation]}%')")
193
-
194
- i = 0
195
- puts "Your Query is submitted and is being processed......."
196
- snps.each do |snp|
197
- # puts snp.inspect
198
- i += 1
199
- puts "Total number of SNPs generated so far: #{i}" if i % 100 == 0
200
- ActiveRecord::Base.transaction do
201
- snp.alleles.each do |allele|
202
- # puts allele.inspect
203
- allele.genotypes.each do |genotype|
204
- #push bases to hash
205
- sequence_hash[genotype.strain_id] << allele.base
206
- end
207
- end
208
- end
209
- end
144
+ error_msg += "-n: \t Name of your database\n" unless opts[:name]
145
+ error_msg += "-o: \t name of your output file\n" unless opts[:out]
146
+ error_msg += "-F: \t Fasta output OR\n-T: \t Tabular output" unless opts[:fasta] || opts[:tabular]
147
+
148
+ error_msg_optional = ""
210
149
 
211
- #generate FASTA file
212
- strains.each do |strain|
213
- output.print ">#{strain.name}\n" , sequence_hash[strain.id].join("")
214
- output.puts
150
+ error_msg_optional += "-I,\t --ignore_snps_on_annotation: ignore SNPs from specified features in the database\n" unless opts[:ignore_snps_on_annotation]
151
+ error_msg_optional += "-R,\t --ignore_strains: A list of strains to ignore\n" unless opts[:ignore_strains]
152
+ error_msg_optional += "-i,\t --ignore_snps_in_range: A list of position ranges to ignore e.g 10..500,2000..2500\n" unless opts[:ignore_snps_in_range]
153
+ error_msg_optional += "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality\n" unless opts[:cuttoff_snp_qual]
154
+ error_msg_optional += "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality\n" unless opts[:cuttoff_genotype]
155
+ error_msg_optional += "-r,\t --remove_non_informative_snps: Only output informative SNPs\n" unless opts[:remove_non_informative_snps]
156
+ error_msg_optional += "-t,\t --tree: Construct tree from output\n" unless opts[:tree]
157
+ error_msg_optional += "-w,\t --nwk_out: Name of Newick output file(use only when-tree option used)\n" unless opts[:nwk_out]
158
+
159
+ unless error_msg == ""
160
+ puts "Please provide the following required fields:"
161
+ puts error_msg
162
+ puts "Optional fields:"
163
+ puts error_msg_optional
164
+ # Added this here as it wont appear here in error_msg_optional as its set as default.
165
+ puts "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality (default 90)\n"
166
+ puts "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality (default 30)\n"
167
+ puts opts.help unless opts.empty?
168
+ exit
215
169
  end
216
170
 
217
- # GENERATE TREE FROM FASTA FILE
218
- if opts[:tree]
219
- `FastTree -fastest -nt #{opts[:out]} > #{opts[:nwk_out]}`
220
- end
171
+ abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
172
+
173
+ establish_connection(opts[:name])
221
174
 
222
- else
223
- puts "use -unique_snps or -not_include_snps_from_gene query options"
175
+ get_snps(opts[:out], opts[:ignore_snps_on_annotation], opts[:ignore_snps_in_range], opts[:ignore_strains], opts[:remove_non_informative_snps], opts[:fasta], opts[:tabular], opts[:cuttoff_genotype], opts[:cuttoff_snp_qual], opts[:tree], opts[:fasttree_path])
224
176
  end
225
177
 
226
- # ##############################################################
178
+ ####################################################################################################
179
+ #FIND UNIQUE SNPS
180
+ if opts[:unique_snps]
227
181
 
228
- # OUTPUT DATABASE IN FASTA FORMAT
229
- elsif opts[:output]
230
- if opts[:fasta]
231
182
  error_msg = ""
232
183
 
233
- error_msg += "-n: \t Name of your database\n" unless opts[:name]
234
- error_msg += "-o: \t name of your output file (in FASTA format)\n" unless opts[:out]
235
-
236
- error_msg_optional = ""
237
-
238
- error_msg_optional += "-tree: \t Construct tree from output\n" unless opts[:tree]
239
- error_msg_optional += "-nwk_out: Name of Newick output file(use only when-tree option used)\n" unless opts[:nwk_out]
240
-
241
- unless error_msg == ""
242
- puts "Please provide the following required fields:"
243
- puts error_msg
244
- puts "Optional fields:"
245
- puts error_msg_optional
246
- puts opts.help unless opts.empty?
247
- exit
248
- end
249
-
250
- abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
184
+ error_msg += "-n: \t Name of your database\n" unless opts[:name]
185
+ error_msg += "-s: \t List of strains you like to query\n" unless opts[:strain]
186
+ error_msg += "-o: \t Name of the output file\n" unless opts[:out]
187
+
188
+ unless error_msg == ""
189
+ puts "Please provide the following required fields:"
190
+ puts error_msg
191
+ puts "Optional fields:"
192
+ # Added this here as it wont appear here in error_msg_optional as its set as default.
193
+ puts "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality (default 90)\n"
194
+ puts "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality (default 30)\n"
195
+ puts opts.help unless opts.empty?
196
+ exit
197
+ end
198
+
199
+ abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
200
+ abort "#{opts[:strain]} file does not exist!" unless File.exist?(opts[:strain])
251
201
 
252
202
  establish_connection(opts[:name])
253
-
254
- # Getting list of strains from database
255
- strains = Strain.all
256
-
257
- sequence_hash = Hash.new
258
- # create a sequence hash
259
- # hash key is strain_id, loop through strain_id
260
- # create an empty array
261
- strains.each do |strain|
262
- sequence_hash[strain.id] = Array.new
263
- end
264
-
265
-
266
- output = File.open("#{opts[:out]}", "w")
267
-
268
- # Select all snps
269
- snps = Snp.all
270
-
271
- i = 0
272
- puts "Your out file is being prepared......."
273
- snps.each do |snp|
274
- i += 1
275
- puts "Total number of SNPs outputted so far: #{i}" if i % 100 == 0
276
-
277
- ActiveRecord::Base.transaction do
278
- snp.alleles.each do |allele|
279
- # puts allele.inspect
280
- allele.genotypes.each do |genotype|
281
- #push bases to hash
282
- sequence_hash[genotype.strain_id] << allele.base
283
- end
284
- end
285
- end
286
- end
287
-
288
- puts sequence_hash
289
- exit
290
- #generate FASTA file
291
- strains.each do |strain|
292
- output.print ">#{strain.name}\n" , sequence_hash[strain.id].join("")
293
- output.puts
294
- end
295
203
 
296
- if opts[:tree]
297
- # puts "FastTree -fastest -nt #{opts[:out]} > #{opts[:w]}"
298
- `FastTree -fastest -nt #{opts[:out]} > #{opts[:w]}`
299
- end
204
+ strains = []
205
+ File.read(opts[:strain]).each_line do |line|
206
+ strains << line.chop
207
+ end
208
+ # find_unique_snps defined in bin/snp-search.rb
209
+ find_unqiue_snps(strains, opts[:out], opts[:cuttoff_genotype], opts[:cuttoff_snp_qual])
300
210
  end
211
+
212
+ ##############################################################
213
+ if opts[:info]
301
214
 
302
- #########################################
303
-
304
- if opts[:syn]
305
215
  error_msg = ""
306
216
 
307
- error_msg += "-n option: \t the name of your database\n" unless opts[:name]
308
- error_msg += "-d option: \t the reference file in gbk format\n" unless opts[:database_reference_file]
309
-
310
- unless error_msg == ""
311
- puts "Please provide the following required fields:"
312
- puts error_msg
313
- puts opts.help unless opts.empty?
314
- exit
315
- end
217
+ error_msg += "-n: \t the name of your database\n" unless opts[:name]
218
+ error_msg += "-o: \t name of your output file (in tab-delimited format)\n" unless opts[:out]
219
+
220
+ unless error_msg == ""
221
+ puts "Please provide the following required fields:"
222
+ puts error_msg
223
+ puts "Optional fields:"
224
+ # Added this here as it wont appear here in error_msg_optional as its set as default.
225
+ puts "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality (default 90)\n"
226
+ puts "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality (default 30)\n"
227
+ puts opts.help unless opts.empty?
228
+ exit
229
+ end
316
230
 
317
- abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
318
- abort "#{opts[:database_reference_file]} vcf file does not exist!" unless File.exist?(opts[:database_reference_file])
231
+ abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
319
232
 
320
233
  establish_connection(opts[:name])
321
-
322
- ref = opts[:database_reference_file]
323
-
324
- synonymous(ref)
325
- end
326
234
 
235
+ #information defined in bin/snp-search.rb
236
+ information(opts[:out], opts[:cuttoff_genotype], opts[:cuttoff_snp_qual])
237
+
238
+ end
239
+
327
240
  else
328
- puts opts.help
241
+ puts opts.help
329
242
  end