snp-search 2.2.0 → 2.3.0
Sign up to get free protection for your applications and to get access to all the features.
- data/Gemfile +1 -2
- data/Gemfile.lock +2 -3
- data/README +0 -105
- data/README.rdoc +35 -29
- data/Rakefile +2 -2
- data/VERSION +1 -1
- data/bin/snp-search +174 -261
- data/lib/create_methods.rb +196 -0
- data/lib/filter_ignore_snps_methods.rb +130 -0
- data/lib/information_methods.rb +117 -0
- data/lib/output_information_methods.rb +131 -0
- data/lib/snp-search.rb +18 -280
- data/lib/snp_db_connection.rb +1 -2
- data/lib/snp_db_models.rb +3 -3
- data/lib/snp_db_schema.rb +119 -80
- data/pkg/snp-search-1.1.0.gem +0 -0
- data/pkg/snp-search-1.2.0.gem +0 -0
- data/pkg/snp-search-2.3.0.gem +0 -0
- data/snp-search.gemspec +15 -12
- metadata +73 -33
- data/.rspec +0 -1
data/Gemfile
CHANGED
@@ -5,10 +5,9 @@ source "http://rubygems.org"
|
|
5
5
|
|
6
6
|
gem "activerecord", "~> 3.1.3"
|
7
7
|
gem "bio", "~> 1.4.2"
|
8
|
-
gem "slop", "~>
|
8
|
+
gem "slop", "~> 2.4.0"
|
9
9
|
gem 'sqlite3', "~> 1.3.4"
|
10
10
|
gem 'activerecord-import', "~> 0.2.8"
|
11
|
-
gem "diff-lcs", "~> 1.1.3"
|
12
11
|
|
13
12
|
# Add dependencies to develop your gem here.
|
14
13
|
# Include everything needed to run rake, tests, features, etc.
|
data/Gemfile.lock
CHANGED
@@ -36,7 +36,7 @@ GEM
|
|
36
36
|
rspec-expectations (2.3.0)
|
37
37
|
diff-lcs (~> 1.1.2)
|
38
38
|
rspec-mocks (2.3.0)
|
39
|
-
slop (
|
39
|
+
slop (2.4.0)
|
40
40
|
sqlite3 (1.3.4)
|
41
41
|
tzinfo (0.3.31)
|
42
42
|
|
@@ -48,9 +48,8 @@ DEPENDENCIES
|
|
48
48
|
activerecord-import (~> 0.2.8)
|
49
49
|
bio (~> 1.4.2)
|
50
50
|
bundler (~> 1.0.0)
|
51
|
-
diff-lcs (~> 1.1.3)
|
52
51
|
jeweler (~> 1.6.4)
|
53
52
|
rcov
|
54
53
|
rspec (~> 2.3.0)
|
55
|
-
slop (~>
|
54
|
+
slop (~> 2.4.0)
|
56
55
|
sqlite3 (~> 1.3.4)
|
data/README
CHANGED
@@ -1,105 +0,0 @@
|
|
1
|
-
= snp-search
|
2
|
-
|
3
|
-
SNPsearch is a tool that manages SNP data and allows for data importing, manipulating, editing and complex querying of SNP data. It can be used to evaluate the utility of SNPs for the assessment of genetic diversity between haploid strains and the management of genotype and phenotype data. Once the database is created, the user is provided with several query and output options. SNPsearch is particularly useful in the analysis of phylogenetic trees that are based on SNP differences across whole core genomes. Queries can be made to answer critical genomic questions such as the association of SNPs with particular phenotypes.
|
4
|
-
|
5
|
-
== Obtaining and installing the code
|
6
|
-
SNPsearch is written in Ruby and operates in a Unix environment. It is made available as a gem. See the github site for more information (https://github.com/hpa-bioinformatics/snp-search).
|
7
|
-
|
8
|
-
To install snp-search, do
|
9
|
-
gem install snp-search
|
10
|
-
|
11
|
-
== Requirements
|
12
|
-
|
13
|
-
Not much, you just need:
|
14
|
-
|
15
|
-
* Unix. Once snp-search is installed, all the necessary gems to run snp-search will also be installed from Rubygems (note that Rubygems requires admin privileges. If you do not have admin privileges then we suggest you install RVM: (http://beginrescueend.com/rvm/install/) and then gem install snp-search).
|
16
|
-
* ruby version 1.8.7 and above.
|
17
|
-
|
18
|
-
* Optional: FastTree. If you require a tree output in Newick format, you must install FastTree from http://www.microbesonline.org/fasttree/#Install. You must specify the path of the executable in your .bashrc or .profile file as snp-search will run the command as just 'FastTree' and will not know where FastTree is if it is not specified in your .bashrc or .profile file.
|
19
|
-
|
20
|
-
Thats it!
|
21
|
-
|
22
|
-
== Running snp-search
|
23
|
-
|
24
|
-
1- Creating the database (snp-search -create)
|
25
|
-
|
26
|
-
Two files are needed to create the SQLite3 database:
|
27
|
-
|
28
|
-
1- Variant Call Format (.vcf) file (which contains the SNP information)
|
29
|
-
|
30
|
-
2- Your database reference genome that you used to generate your .vcf file (in genbank or embl format, the script will automatically detect the format).
|
31
|
-
|
32
|
-
You need the following parameters:
|
33
|
-
|
34
|
-
-n Name of your database
|
35
|
-
-v .vcf file
|
36
|
-
-d Database Reference genome (The same file that was used in generating the .vcf file). This should be in genbank or embl format.
|
37
|
-
|
38
|
-
Other options:
|
39
|
-
-c SNP quality score cutoff. A Phred-scaled quality score. High quality scores indicate high confidence calls. Optional, default = 90 (out of 100)
|
40
|
-
-g Genotype Quality score cutoff. Phred-scaled quality score that the genotype is true. Optional, default = 30
|
41
|
-
-h help message
|
42
|
-
|
43
|
-
Usage:
|
44
|
-
snp-search -create -n my_snp_db.sqlite3 -d my_ref.gbk -v my_vcf_file.vcf
|
45
|
-
|
46
|
-
Note: The strain names in your database will be taken from your vcf file so make sure they are named appropriately in your vcf file.
|
47
|
-
|
48
|
-
2- Querying the Database (snp-search -query)
|
49
|
-
|
50
|
-
Two queries are currently scripted in SNPsearch:
|
51
|
-
|
52
|
-
1- unique_snps: This option queries the database and selects the number of unique SNPs within the list of the strains/samples provided. The output is the number of unique SNPs.
|
53
|
-
|
54
|
-
You need the following parameters:
|
55
|
-
|
56
|
-
-n Name of your database
|
57
|
-
-s The strains/samples you like to query
|
58
|
-
|
59
|
-
Usage:
|
60
|
-
snp-search -n my_snp_db.sqlite3 -s list_of_my_strains.txt
|
61
|
-
|
62
|
-
2- not_include_snps_from_gene: This option queries the database to select only those SNPs not found in a specified gene. These SNPs are used to make a concatenated SNP multiple alignment file (FASTA format). This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis. The user has the option of generating a core SNP tree Newick file for SNP phylogeny.
|
63
|
-
|
64
|
-
You need the following parameters:
|
65
|
-
|
66
|
-
-n Name of your database
|
67
|
-
-a The gene you like to remove from analysis
|
68
|
-
-o Output file, in fasta format
|
69
|
-
|
70
|
-
options:
|
71
|
-
-t Generate SNP phylogeny
|
72
|
-
-w Output tree in Newick format
|
73
|
-
|
74
|
-
Usage (phage is used as the example gene):
|
75
|
-
snp-search -n my_snp_db.sqlite3 -a phage -o snps_sequences_without_phage.fasta -t -w snps_sequences_without_phage.nwk
|
76
|
-
|
77
|
-
The algorithm FastTree is used to generate the nwk file. FastTree can be downloaded from http://www.microbesonline.org/fasttree/#Install (see above)
|
78
|
-
|
79
|
-
3- Output database (snp-search -out_file)
|
80
|
-
|
81
|
-
You need the following parameters:
|
82
|
-
|
83
|
-
-n Name of your database
|
84
|
-
-o Output file containing the database in fasta format
|
85
|
-
|
86
|
-
== View database in Unix or in a GUI
|
87
|
-
Your database will be in sqlite3 format. If you like to view your table(s) and perform direct queries you can type
|
88
|
-
sqlite3 snp_db.sqlite3
|
89
|
-
|
90
|
-
Alternatively, you may download a SQL tool to view your database (e.g. SQLite sorcerer).
|
91
|
-
|
92
|
-
== Contact
|
93
|
-
|
94
|
-
If you have any comments, questions or suggestions, please email
|
95
|
-
ali.al-shahib@hpa.org.uk
|
96
|
-
or
|
97
|
-
anthony.underwood@hpa.org.uk
|
98
|
-
|
99
|
-
Have fun snp-searching!
|
100
|
-
|
101
|
-
== Copyright
|
102
|
-
|
103
|
-
Copyright (c) 2012 Ali Al-Shahib. See LICENSE.txt for
|
104
|
-
further details.
|
105
|
-
|
data/README.rdoc
CHANGED
@@ -21,17 +21,17 @@ Thats it!
|
|
21
21
|
|
22
22
|
== Running snp-search
|
23
23
|
|
24
|
-
1-
|
24
|
+
1- The first thing you need to do is to create the database (snp-search -create)
|
25
25
|
|
26
26
|
Two files are needed to create the SQLite3 database:
|
27
27
|
|
28
|
-
|
28
|
+
1A- Variant Call Format (.vcf) file (which contains the SNP information)
|
29
29
|
|
30
|
-
|
30
|
+
1B- Your database reference genome that you used to generate your .vcf file (in genbank or embl format, the script will automatically detect the format).
|
31
31
|
|
32
32
|
You need the following parameters:
|
33
33
|
|
34
|
-
-n Name of your database
|
34
|
+
-n Name of your database (note that this is a required field in all commands).
|
35
35
|
-v .vcf file
|
36
36
|
-d Database Reference genome (The same file that was used in generating the .vcf file). This should be in genbank or embl format.
|
37
37
|
|
@@ -45,43 +45,49 @@ You need the following parameters:
|
|
45
45
|
|
46
46
|
Note: The strain names in your database will be taken from your vcf file so make sure they are named appropriately in your vcf file.
|
47
47
|
|
48
|
-
2-
|
48
|
+
2- Now that you have created the database (my_snp_db.sqlite3) you can use snp-search to output several queried data.
|
49
49
|
|
50
|
-
|
50
|
+
2A- First, you should choose which output format you like:
|
51
|
+
-f, --fasta: output fasta file format (not available with -unique_snps option)
|
52
|
+
-T, --tabular: output tabular file format
|
51
53
|
|
52
|
-
|
54
|
+
2B- Next, you need to tell snp-search what you want out. You have several options:
|
55
|
+
- Querying the Database to select the number of unique SNPs within the list of the strains/samples provided (list_of_my_strains.txt). The output is a text file with a list of the unique SNPs and information about each SNP (e.g. if its synonymous or non-synonymous SNP).
|
53
56
|
|
54
|
-
|
57
|
+
-u, --unique_snps Query for unique snps in the database (only used with -tabular option)
|
58
|
+
-s, --strain The strains/samples you like to query (only used with -unique_snps flag)
|
59
|
+
|
60
|
+
Usage:
|
61
|
+
snp-search -n my_snp_db.sqlite3 -O -T -u -n my_snp_db.sqlite3 -s list_of_my_strains.txt -o unique_snps.out
|
55
62
|
|
56
|
-
-
|
57
|
-
-s The strains/samples you like to query
|
63
|
+
- Querying the database to output all SNPs without specified features in the database (e.g. phages). This is a way of removing a set of genes (likely to be mobile element genes) that are not needed for SNP analysis. The user has the option of generating a core SNP tree Newick file for SNP phylogeny (if -F option was used to ouput fasta file).
|
58
64
|
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
+
-e, --ignore_snps_from_feature Ignore SNPs from specified features in the database
|
66
|
+
-r, --remove_non_informative_snps Only output informative SNPs
|
67
|
+
-I, --ignore_snps_in_range A list of position ranges to ignore e.g 10..500,2000..2500
|
68
|
+
-R, --ignore_strains A list of strains to ignore (seperate by comma e.g. S1,S4,S8 )
|
69
|
+
-a, --annotation The name of the gene to ignore (only used with the -ignore_snps_from_feature flag)
|
70
|
+
-o, --out Name of output file
|
65
71
|
|
66
|
-
|
67
|
-
-
|
68
|
-
-o Output file, in fasta format
|
72
|
+
Usage:
|
73
|
+
snp-search -O -F -e -n my_snp_db.sqlite3 -a phage,insertion,transposon -r -o snps_without_phages.fasta
|
69
74
|
|
70
|
-
options:
|
75
|
+
- Optionally, you can add the following options to generate a phylogenetic tree from the resulting fasta file:
|
76
|
+
|
71
77
|
-t Generate SNP phylogeny
|
72
78
|
-w Output tree in Newick format
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
79
|
+
Usage:
|
80
|
+
snp-search -O -F -e -n my_snp_db.sqlite3 -a phage,insertion,transposon -r -t -w -o snps_without_phages.fasta
|
81
|
+
|
77
82
|
The algorithm FastTree is used to generate the nwk file. FastTree can be downloaded from http://www.microbesonline.org/fasttree/#Install (see above)
|
78
83
|
|
79
|
-
|
84
|
+
- Output all SNPs with information. Information for each SNP includes whether the SNP is synonymous or non-synonymous, gene function, whether it is a pseudogene and other useful information. These information will be tab-seperated.
|
80
85
|
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
86
|
+
-E, --info Output various information about SNPs
|
87
|
+
-o, --out Name of output file
|
88
|
+
|
89
|
+
Usage:
|
90
|
+
snp-search -O -T -E -n my_snp_db.sqlite3 o snps_all_with_info.txt
|
85
91
|
|
86
92
|
== View database in Unix or in a GUI
|
87
93
|
Your database will be in sqlite3 format. If you like to view your table(s) and perform direct queries you can type
|
data/Rakefile
CHANGED
@@ -15,11 +15,11 @@ require 'jeweler'
|
|
15
15
|
Jeweler::Tasks.new do |gem|
|
16
16
|
# gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
|
17
17
|
gem.name = "snp-search"
|
18
|
-
gem.homepage = "http://github.com/
|
18
|
+
gem.homepage = "http://github.com/phe-bioinformatics/snp-search"
|
19
19
|
gem.license = "MIT"
|
20
20
|
gem.summary = %Q{Tool for generating SNP database}
|
21
21
|
gem.description = %Q{Use the snp-search tool to create, import, manipulate and query your SNP database}
|
22
|
-
gem.email = "ali.al-shahib@
|
22
|
+
gem.email = "ali.al-shahib@phe.gov.uk"
|
23
23
|
gem.authors = ["Ali Al-Shahib", "Anthony Underwood"]
|
24
24
|
gem.executables = ["snp-search"]
|
25
25
|
# dependencies defined in Gemfile
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
2.
|
1
|
+
2.3.0
|
data/bin/snp-search
CHANGED
@@ -1,329 +1,242 @@
|
|
1
1
|
require 'snp-search'
|
2
|
-
require 'snp_db_connection'
|
3
|
-
require 'snp_db_models'
|
4
|
-
require 'snp_db_schema'
|
2
|
+
require '../lib/snp_db_connection.rb'
|
3
|
+
require '../lib/snp_db_models.rb'
|
4
|
+
require '../lib/snp_db_schema.rb'
|
5
|
+
require '../lib/output_information_methods.rb'
|
5
6
|
require 'activerecord-import'
|
6
7
|
require 'slop'
|
7
8
|
|
8
9
|
opts = Slop.parse do
|
9
10
|
|
10
|
-
banner "\nruby snp-search [-create] [-
|
11
|
+
banner "\nruby snp-search [-create] [-output] [-n <sqlite3>] [options]*"
|
11
12
|
separator ''
|
12
13
|
|
13
14
|
on :C, :create, 'Create database'
|
14
|
-
on :
|
15
|
-
|
16
|
-
separator ''
|
17
|
-
# separator 'README file: https://github.com/hpa-bioinformatics/snp-search/blob/master/README.rdoc'
|
18
|
-
# separator 'The following command must be used when using -create, or -query or -out_file'
|
19
|
-
on :n, :name=, 'Name of database, Required'
|
15
|
+
on :O, :output, 'Output a process'
|
16
|
+
|
17
|
+
# separator ''
|
18
|
+
# # separator 'README file: https://github.com/hpa-bioinformatics/snp-search/blob/master/README.rdoc'
|
19
|
+
# # separator 'The following command must be used when using -create, or -query or -out_file'
|
20
|
+
# on :n, :name=, 'Name of database, Required'
|
21
|
+
|
20
22
|
separator ''
|
21
23
|
|
22
|
-
separator '-create options'
|
24
|
+
separator '-create [options]'
|
23
25
|
on :d, :database_reference_file=, 'Reference genome file, in gbk or embl file format, Required', true
|
24
26
|
on :v, :vcf_file=, 'variant call format (vcf) file, Required', true
|
25
|
-
on :
|
27
|
+
on :n, :name=, 'Name of database, Required'
|
28
|
+
on :A, :cuttoff_ad=, 'AD ratio cutoff (default 0.9)', :as => :int, :default => 0.9
|
29
|
+
|
30
|
+
separator ''
|
31
|
+
|
32
|
+
separator '-output -snps_from_feature -n db_name [options] [-fasta] [-tabular]'
|
33
|
+
on :F, :fasta, 'output fasta file format'
|
34
|
+
on :T, :tabular, 'output tabular file format'
|
35
|
+
on :c, :cuttoff_snp_qual=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
|
26
36
|
on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int, :default => 30
|
37
|
+
on :S, :snps_from_feature, 'SNPs from specified features in the database (if you do not want to ignore any SNPs, just use this option with -n -F/T -o)'
|
38
|
+
on :r, :remove_non_informative_snps, 'Only output informative SNPs. Only used with -e option'
|
39
|
+
on :e, :ignore_snps_in_range=, 'A list of position ranges to ignore e.g 10..500,2000..2500. Only used with -e option'
|
40
|
+
on :R, :ignore_strains=, 'A list of strains to ignore (seperate by comma e.g. S1,S4,S8 ). Only used with -e option'
|
41
|
+
on :I, :ignore_snps_on_annotation=, 'The name of the feature to ignore.'
|
42
|
+
on :o, :out=, 'Name of output file, Required'
|
43
|
+
on :t, :tree, 'Generate SNP phylogeny (only used with -fasta option)'
|
44
|
+
on :p, :fasttree_path=, 'Full path to the FastTree tool (e.g. /usr/local/bin/FastTree. only used with -tree option)'
|
45
|
+
|
27
46
|
separator ''
|
28
|
-
|
29
|
-
separator '-
|
47
|
+
|
48
|
+
separator '-output -unique_snps -n db_name [-fasta] [-tabular] [options]'
|
49
|
+
on :c, :cuttoff_snp_qual=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
|
50
|
+
on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int, :default => 30
|
30
51
|
on :u, :unique_snps, 'Query for unique snps in the database'
|
31
|
-
on :
|
32
|
-
on :
|
33
|
-
|
52
|
+
on :s, :strain=, 'The strains/samples you like to query (only used with -unique_snps flag)'
|
53
|
+
on :o, :out=, 'Name of output file, Required'
|
54
|
+
|
34
55
|
separator ''
|
35
56
|
|
36
|
-
separator '-output [-fasta] [-
|
37
|
-
on :
|
38
|
-
on :
|
39
|
-
on :
|
40
|
-
on :t, :tree, 'Generate SNP phylogeny'
|
41
|
-
on :w, :nwk_out=, 'Name of output tree in Newick format'
|
42
|
-
|
57
|
+
separator '-output -info -n db_name [-fasta] [-tabular] [options]'
|
58
|
+
on :i, :info, 'Output various information about SNPs'
|
59
|
+
on :c, :cuttoff_snp_qual=, 'SNP quality cutoff, (default = 90)', :as => :int, :default => 90
|
60
|
+
on :g, :cuttoff_genotype=, 'Genotype quality cutoff (default = 30)', :as => :int, :default => 30
|
61
|
+
on :t, :tree, 'Generate SNP phylogeny (only used with -fasta option)'
|
62
|
+
on :w, :nwk_out=, 'Name of output tree in Newick format (only used with -tree option)'
|
63
|
+
on :o, :out=, 'Name of output file, Required'
|
43
64
|
end
|
44
|
-
# opts.end
|
45
65
|
|
46
66
|
###########################################################
|
47
67
|
|
48
68
|
# CREATING A DATABASE
|
49
69
|
if opts[:create]
|
50
70
|
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
error_msg += "-n: \t Name of your database\n" unless opts[:name]
|
56
|
-
error_msg += "-d: \t Reference genome file, in gbk or embl file format\n" unless opts[:database_reference_file]
|
57
|
-
error_msg += "-v: \t .vcf file\n" unless opts[:vcf_file]
|
58
|
-
|
59
|
-
error_msg_optional = ""
|
60
|
-
|
61
|
-
error_msg_optional += "-c: \tSNP quality cutoff, (default = 90)\n"
|
62
|
-
error_msg_optional += "-g: \tGenotype quality cutoff (default = 30)\n"
|
63
|
-
|
64
|
-
unless error_msg == ""
|
65
|
-
puts "Please provide the following required fields:"
|
66
|
-
puts error_msg
|
67
|
-
puts "Optional fields:"
|
68
|
-
puts error_msg_optional
|
69
|
-
puts opts.help unless opts.empty?
|
70
|
-
exit
|
71
|
-
end
|
72
|
-
|
73
|
-
abort "#{opts[:database_reference_file]} file does not exist!" unless File.exist?(opts[:database_reference_file])
|
74
|
-
|
75
|
-
abort "#{opts[:vcf_file]} file does not exist!" unless File.exist?(opts[:vcf_file])
|
76
|
-
|
77
|
-
|
78
|
-
# Name of your database
|
79
|
-
establish_connection(opts[:name])
|
80
|
-
|
81
|
-
# Schema will run here
|
82
|
-
db_schema
|
83
|
-
|
84
|
-
ref = opts[:database_reference_file]
|
85
|
-
|
86
|
-
sequence_format = guess_sequence_format(ref)
|
87
|
-
|
88
|
-
case sequence_format
|
89
|
-
when :genbank
|
90
|
-
sequence_flatfile = Bio::FlatFile.open(Bio::GenBank,opts[:database_reference_file]).next_entry
|
91
|
-
when :embl
|
92
|
-
sequence_flatfile = Bio::FlatFile.open(Bio::EMBL,opts[:database_reference_file]).next_entry
|
93
|
-
else
|
94
|
-
puts "All sequence files should be in genbank or embl format"
|
95
|
-
exit
|
96
|
-
end
|
71
|
+
# puts opts[:cuttoff_snp_qual].to_i
|
72
|
+
|
73
|
+
error_msg = ""
|
97
74
|
|
98
|
-
|
99
|
-
|
75
|
+
error_msg += "-n: \t Name of your database\n" unless opts[:name]
|
76
|
+
error_msg += "-d: \t Reference genome file, in gbk or embl file format\n" unless opts[:database_reference_file]
|
77
|
+
error_msg += "-v: \t .vcf file\n" unless opts[:vcf_file]
|
100
78
|
|
101
|
-
|
102
|
-
populate_features_and_annotations(sequence_flatfile)
|
79
|
+
error_msg_optional = ""
|
103
80
|
|
104
|
-
|
81
|
+
error_msg_optional += "-c: \tSNP quality cutoff, (default = 90)\n"
|
82
|
+
error_msg_optional += "-g: \tGenotype quality cutoff (default = 30)\n"
|
83
|
+
|
84
|
+
unless error_msg == ""
|
85
|
+
puts "Please provide the following required fields:"
|
86
|
+
puts error_msg
|
87
|
+
puts "Optional fields:"
|
88
|
+
puts error_msg_optional
|
89
|
+
puts opts.help unless opts.empty?
|
90
|
+
exit
|
91
|
+
end
|
92
|
+
|
93
|
+
abort "#{opts[:database_reference_file]} file does not exist!" unless File.exist?(opts[:database_reference_file])
|
94
|
+
|
95
|
+
abort "#{opts[:vcf_file]} file does not exist!" unless File.exist?(opts[:vcf_file])
|
105
96
|
|
106
|
-
populate_snps_alleles_genotypes(vcf_mpileup_file, opts[:cuttoff_snp], opts[:cuttoff_genotype])
|
107
97
|
|
108
|
-
|
98
|
+
# Name of your database
|
99
|
+
establish_connection(opts[:name])
|
109
100
|
|
110
|
-
|
101
|
+
# Schema will run here
|
102
|
+
db_schema
|
111
103
|
|
112
|
-
|
113
|
-
elsif opts [:query]
|
114
|
-
#FIND UNIQUE SNPS
|
115
|
-
if opts[:unique_snps]
|
104
|
+
ref = opts[:database_reference_file]
|
116
105
|
|
117
|
-
|
106
|
+
sequence_format = guess_sequence_format(ref)
|
118
107
|
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
puts
|
108
|
+
case sequence_format
|
109
|
+
when :genbank
|
110
|
+
sequence_flatfile = Bio::FlatFile.open(Bio::GenBank,opts[:database_reference_file]).next_entry
|
111
|
+
when :embl
|
112
|
+
sequence_flatfile = Bio::FlatFile.open(Bio::EMBL,opts[:database_reference_file]).next_entry
|
113
|
+
else
|
114
|
+
puts "All sequence files should be in genbank or embl format"
|
126
115
|
exit
|
127
116
|
end
|
128
|
-
|
129
|
-
abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
|
130
|
-
abort "#{opts[:strain]} file does not exist!" unless File.exist?(opts[:strain])
|
131
|
-
|
132
|
-
establish_connection(opts[:name])
|
133
117
|
|
134
|
-
|
135
|
-
|
136
|
-
strains << line.chop
|
137
|
-
end
|
118
|
+
# The populate_features_and_annotations method populates the features and annotations. It uses the embl/gbk file.
|
119
|
+
populate_features_and_annotations(sequence_flatfile)
|
138
120
|
|
139
|
-
|
140
|
-
# exit
|
141
|
-
gas_snps = find_shared_snps(strains)
|
121
|
+
#The populate_snps_alleles_genotypes method populates the snps, alleles and genotypes. It uses the vcf file, and if specified, the SNP quality cutoff and genotype quality cutoff
|
142
122
|
|
143
|
-
|
144
|
-
puts "The number of unique snps are #{snp.id}"
|
145
|
-
end
|
123
|
+
populate_snps_alleles_genotypes(opts[:vcf_file], opts[:cuttoff_ad])
|
146
124
|
|
147
|
-
|
148
|
-
# REMOVE SNPS ASSOCIATED WITH SPECIFIC GENES
|
149
|
-
elsif opts[:not_include_snps_from_gene]
|
125
|
+
###########################################################
|
150
126
|
|
151
|
-
|
127
|
+
# QUERYING THE DATABASE
|
128
|
+
elsif opts[:output]
|
152
129
|
|
153
|
-
|
154
|
-
|
155
|
-
error_msg += "-a: \t name of the gene that you like to remove from the database\n" unless opts[:annotation]
|
156
|
-
|
157
|
-
error_msg_optional = ""
|
130
|
+
error_msg = ""
|
131
|
+
error_msg += "-S: \t SNPs from specified features in the database OR\n-u: \t Query for unique snps in the database OR\n-i: \t Information on all SNPs\n" unless opts[:snps_from_feature] || opts[:unique_snps] || opts[:info]
|
158
132
|
|
159
|
-
|
160
|
-
|
133
|
+
unless error_msg == ""
|
134
|
+
puts "Please provide the following required fields:"
|
135
|
+
puts error_msg
|
136
|
+
puts opts.help unless opts.empty?
|
137
|
+
exit
|
138
|
+
end
|
161
139
|
|
162
|
-
|
163
|
-
puts "Please provide the following required fields:"
|
164
|
-
puts error_msg
|
165
|
-
puts "Optional fields:"
|
166
|
-
puts error_msg_optional
|
167
|
-
puts opts.help unless opts.empty?
|
168
|
-
exit
|
169
|
-
end
|
140
|
+
if opts[:snps_from_feature]
|
170
141
|
|
171
|
-
|
172
|
-
abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
|
173
|
-
|
174
|
-
# annotation = opts[:annotation]
|
175
|
-
establish_connection(opts[:name])
|
176
|
-
|
177
|
-
# Getting list of strains from database
|
178
|
-
strains = Strain.all
|
179
|
-
|
180
|
-
sequence_hash = Hash.new
|
181
|
-
# create a sequence hash
|
182
|
-
# hash key is strain_id, loop through strain_id
|
183
|
-
# create an empty array
|
184
|
-
strains.each do |strain|
|
185
|
-
sequence_hash[strain.id] = Array.new
|
186
|
-
end
|
142
|
+
error_msg = ""
|
187
143
|
|
188
|
-
|
189
|
-
|
190
|
-
|
191
|
-
|
192
|
-
|
193
|
-
|
194
|
-
i = 0
|
195
|
-
puts "Your Query is submitted and is being processed......."
|
196
|
-
snps.each do |snp|
|
197
|
-
# puts snp.inspect
|
198
|
-
i += 1
|
199
|
-
puts "Total number of SNPs generated so far: #{i}" if i % 100 == 0
|
200
|
-
ActiveRecord::Base.transaction do
|
201
|
-
snp.alleles.each do |allele|
|
202
|
-
# puts allele.inspect
|
203
|
-
allele.genotypes.each do |genotype|
|
204
|
-
#push bases to hash
|
205
|
-
sequence_hash[genotype.strain_id] << allele.base
|
206
|
-
end
|
207
|
-
end
|
208
|
-
end
|
209
|
-
end
|
144
|
+
error_msg += "-n: \t Name of your database\n" unless opts[:name]
|
145
|
+
error_msg += "-o: \t name of your output file\n" unless opts[:out]
|
146
|
+
error_msg += "-F: \t Fasta output OR\n-T: \t Tabular output" unless opts[:fasta] || opts[:tabular]
|
147
|
+
|
148
|
+
error_msg_optional = ""
|
210
149
|
|
211
|
-
|
212
|
-
strains
|
213
|
-
|
214
|
-
|
150
|
+
error_msg_optional += "-I,\t --ignore_snps_on_annotation: ignore SNPs from specified features in the database\n" unless opts[:ignore_snps_on_annotation]
|
151
|
+
error_msg_optional += "-R,\t --ignore_strains: A list of strains to ignore\n" unless opts[:ignore_strains]
|
152
|
+
error_msg_optional += "-i,\t --ignore_snps_in_range: A list of position ranges to ignore e.g 10..500,2000..2500\n" unless opts[:ignore_snps_in_range]
|
153
|
+
error_msg_optional += "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality\n" unless opts[:cuttoff_snp_qual]
|
154
|
+
error_msg_optional += "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality\n" unless opts[:cuttoff_genotype]
|
155
|
+
error_msg_optional += "-r,\t --remove_non_informative_snps: Only output informative SNPs\n" unless opts[:remove_non_informative_snps]
|
156
|
+
error_msg_optional += "-t,\t --tree: Construct tree from output\n" unless opts[:tree]
|
157
|
+
error_msg_optional += "-w,\t --nwk_out: Name of Newick output file(use only when-tree option used)\n" unless opts[:nwk_out]
|
158
|
+
|
159
|
+
unless error_msg == ""
|
160
|
+
puts "Please provide the following required fields:"
|
161
|
+
puts error_msg
|
162
|
+
puts "Optional fields:"
|
163
|
+
puts error_msg_optional
|
164
|
+
# Added this here as it wont appear here in error_msg_optional as its set as default.
|
165
|
+
puts "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality (default 90)\n"
|
166
|
+
puts "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality (default 30)\n"
|
167
|
+
puts opts.help unless opts.empty?
|
168
|
+
exit
|
215
169
|
end
|
216
170
|
|
217
|
-
#
|
218
|
-
|
219
|
-
|
220
|
-
end
|
171
|
+
abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
|
172
|
+
|
173
|
+
establish_connection(opts[:name])
|
221
174
|
|
222
|
-
|
223
|
-
puts "use -unique_snps or -not_include_snps_from_gene query options"
|
175
|
+
get_snps(opts[:out], opts[:ignore_snps_on_annotation], opts[:ignore_snps_in_range], opts[:ignore_strains], opts[:remove_non_informative_snps], opts[:fasta], opts[:tabular], opts[:cuttoff_genotype], opts[:cuttoff_snp_qual], opts[:tree], opts[:fasttree_path])
|
224
176
|
end
|
225
177
|
|
226
|
-
|
178
|
+
####################################################################################################
|
179
|
+
#FIND UNIQUE SNPS
|
180
|
+
if opts[:unique_snps]
|
227
181
|
|
228
|
-
# OUTPUT DATABASE IN FASTA FORMAT
|
229
|
-
elsif opts[:output]
|
230
|
-
if opts[:fasta]
|
231
182
|
error_msg = ""
|
232
183
|
|
233
|
-
|
234
|
-
|
235
|
-
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
|
240
|
-
|
241
|
-
|
242
|
-
|
243
|
-
|
244
|
-
|
245
|
-
|
246
|
-
|
247
|
-
|
248
|
-
|
249
|
-
|
250
|
-
abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
|
184
|
+
error_msg += "-n: \t Name of your database\n" unless opts[:name]
|
185
|
+
error_msg += "-s: \t List of strains you like to query\n" unless opts[:strain]
|
186
|
+
error_msg += "-o: \t Name of the output file\n" unless opts[:out]
|
187
|
+
|
188
|
+
unless error_msg == ""
|
189
|
+
puts "Please provide the following required fields:"
|
190
|
+
puts error_msg
|
191
|
+
puts "Optional fields:"
|
192
|
+
# Added this here as it wont appear here in error_msg_optional as its set as default.
|
193
|
+
puts "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality (default 90)\n"
|
194
|
+
puts "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality (default 30)\n"
|
195
|
+
puts opts.help unless opts.empty?
|
196
|
+
exit
|
197
|
+
end
|
198
|
+
|
199
|
+
abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
|
200
|
+
abort "#{opts[:strain]} file does not exist!" unless File.exist?(opts[:strain])
|
251
201
|
|
252
202
|
establish_connection(opts[:name])
|
253
|
-
|
254
|
-
# Getting list of strains from database
|
255
|
-
strains = Strain.all
|
256
|
-
|
257
|
-
sequence_hash = Hash.new
|
258
|
-
# create a sequence hash
|
259
|
-
# hash key is strain_id, loop through strain_id
|
260
|
-
# create an empty array
|
261
|
-
strains.each do |strain|
|
262
|
-
sequence_hash[strain.id] = Array.new
|
263
|
-
end
|
264
|
-
|
265
|
-
|
266
|
-
output = File.open("#{opts[:out]}", "w")
|
267
|
-
|
268
|
-
# Select all snps
|
269
|
-
snps = Snp.all
|
270
|
-
|
271
|
-
i = 0
|
272
|
-
puts "Your out file is being prepared......."
|
273
|
-
snps.each do |snp|
|
274
|
-
i += 1
|
275
|
-
puts "Total number of SNPs outputted so far: #{i}" if i % 100 == 0
|
276
|
-
|
277
|
-
ActiveRecord::Base.transaction do
|
278
|
-
snp.alleles.each do |allele|
|
279
|
-
# puts allele.inspect
|
280
|
-
allele.genotypes.each do |genotype|
|
281
|
-
#push bases to hash
|
282
|
-
sequence_hash[genotype.strain_id] << allele.base
|
283
|
-
end
|
284
|
-
end
|
285
|
-
end
|
286
|
-
end
|
287
|
-
|
288
|
-
puts sequence_hash
|
289
|
-
exit
|
290
|
-
#generate FASTA file
|
291
|
-
strains.each do |strain|
|
292
|
-
output.print ">#{strain.name}\n" , sequence_hash[strain.id].join("")
|
293
|
-
output.puts
|
294
|
-
end
|
295
203
|
|
296
|
-
|
297
|
-
|
298
|
-
|
299
|
-
|
204
|
+
strains = []
|
205
|
+
File.read(opts[:strain]).each_line do |line|
|
206
|
+
strains << line.chop
|
207
|
+
end
|
208
|
+
# find_unique_snps defined in bin/snp-search.rb
|
209
|
+
find_unqiue_snps(strains, opts[:out], opts[:cuttoff_genotype], opts[:cuttoff_snp_qual])
|
300
210
|
end
|
211
|
+
|
212
|
+
##############################################################
|
213
|
+
if opts[:info]
|
301
214
|
|
302
|
-
#########################################
|
303
|
-
|
304
|
-
if opts[:syn]
|
305
215
|
error_msg = ""
|
306
216
|
|
307
|
-
|
308
|
-
|
309
|
-
|
310
|
-
|
311
|
-
|
312
|
-
|
313
|
-
|
314
|
-
|
315
|
-
|
217
|
+
error_msg += "-n: \t the name of your database\n" unless opts[:name]
|
218
|
+
error_msg += "-o: \t name of your output file (in tab-delimited format)\n" unless opts[:out]
|
219
|
+
|
220
|
+
unless error_msg == ""
|
221
|
+
puts "Please provide the following required fields:"
|
222
|
+
puts error_msg
|
223
|
+
puts "Optional fields:"
|
224
|
+
# Added this here as it wont appear here in error_msg_optional as its set as default.
|
225
|
+
puts "-c,\t --cuttoff_snp_qual: cuttoff for SNP Quality (default 90)\n"
|
226
|
+
puts "-g,\t --cuttoff_genotype: cuttoff for Genotype Quality (default 30)\n"
|
227
|
+
puts opts.help unless opts.empty?
|
228
|
+
exit
|
229
|
+
end
|
316
230
|
|
317
|
-
|
318
|
-
abort "#{opts[:database_reference_file]} vcf file does not exist!" unless File.exist?(opts[:database_reference_file])
|
231
|
+
abort "#{opts[:name]} database does not exist!" unless File.exist?(opts[:name])
|
319
232
|
|
320
233
|
establish_connection(opts[:name])
|
321
|
-
|
322
|
-
ref = opts[:database_reference_file]
|
323
|
-
|
324
|
-
synonymous(ref)
|
325
|
-
end
|
326
234
|
|
235
|
+
#information defined in bin/snp-search.rb
|
236
|
+
information(opts[:out], opts[:cuttoff_genotype], opts[:cuttoff_snp_qual])
|
237
|
+
|
238
|
+
end
|
239
|
+
|
327
240
|
else
|
328
|
-
|
241
|
+
puts opts.help
|
329
242
|
end
|