anncrsnp 0.1.1 → 0.1.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +33 -16
- data/bin/grdbmanager.rb +33 -13
- data/bin/statistics.rb +20 -17
- data/lib/anncrsnp/dataset.rb +3 -3
- data/lib/anncrsnp/parsers/ucscparser.rb +4 -4
- data/lib/anncrsnp/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b49a4c171dac690e9c33215c3ad7ce265583feba
|
4
|
+
data.tar.gz: 864cc1ee837cc96db41c7940ba2c048f7e4f6c34
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: edf8ff3d825c2b12d4b232eb8a206860d94280b0a3e84fde97bd36481f9af5de210ce10d334cacf1f5baa3ccb24517fb51833c95527dbf57f0a14a21fae0879d
|
7
|
+
data.tar.gz: c4cc14d9a5ed2af88bb618404348fc8c8a559c665b06e2320f1011e14dc947a862fad8187171d52a69b2e9512cf0f9954d839a7835cb62009725fea27e636d6d
|
data/README.md
CHANGED
@@ -1,38 +1,55 @@
|
|
1
1
|
# Anncrsnp
|
2
2
|
|
3
|
-
|
3
|
+
AnNCR-SNP integrates data from various sources, allowing the user to obtain annotation to investigate the potential effects of variation in non-coding regions of the human genome. AnNCR-SNP consists of a database containing data on all non-coding elements and two main programs: manager and finder. The manager program is responsible for creating a local database in the user's computer', and the finder program queries the local database, returning a table of results. Local database is already built and it is downloaded when the finder program is used for the first time. If the user wants to build the local database with custom information, he/she has to use the manager program.
|
4
4
|
|
5
|
-
|
5
|
+
The user can mine the local database, searching information about SNPs that overlap with various genomic features suggestive of regulatory activity, such as TFBs, open chromatin, histone modifications, methylation sites and enhancers. These genomic features were obtained from a number of different projects and data sources (ENCODE, FANTOM5, DENdb, amongst others). SNP information comes from dbSNP, gene information from RefSeq and conserved regions from 46WayCons.
|
6
|
+
|
7
|
+
If you use this tool, please cite us: Rojano E, Ranea JA, Perkins JR. Characterisation of non-coding genetic variation in histamine receptors using AnNCR-SNP. Amino Acids. 2016 Jun 6. DOI: 10.1007/s00726-016-2265-5.
|
6
8
|
|
7
9
|
## Installation
|
8
10
|
|
9
|
-
|
11
|
+
Install the package directly as:
|
10
12
|
|
11
|
-
|
12
|
-
gem 'anncrsnp'
|
13
|
-
```
|
13
|
+
$ gem install anncrsnp
|
14
14
|
|
15
|
-
|
15
|
+
## Usage
|
16
16
|
|
17
|
-
|
17
|
+
### Finder
|
18
18
|
|
19
|
-
|
19
|
+
The user can query the local database using a list of SNPs or genomic coordinates. When the user runs the first query using the finder program (grdbfinder.rb), AnNCR-SNP will download the database (it is downloaded by default in the same directory where the ruby gem is installed). Then, it will accomplish the data search.
|
20
20
|
|
21
|
-
|
21
|
+
An example of use can be the following:
|
22
22
|
|
23
|
-
|
23
|
+
$ grdbfinder.rb –n rs2470893,rs12049351 -g snpDbSnp -F 200 -o output_file -f txt
|
24
|
+
|
25
|
+
Where:
|
26
|
+
|
27
|
+
```
|
28
|
+
-n: SNP identifier(s) to be queried. The user can also give gene identifiers (RefSeq gene symbols), or use the -c command instead of -n for search coordinates in the following format: chr:start:stop (example: chr3:11128779:11178779).
|
29
|
+
-g: when is set with 'snpDbSnp', the script generates a tabular file with each found SNP that overlaps with some feature of interest (regulatory element, gene, etc).
|
30
|
+
-F: flanking region length (in nucleotides) located up and downstream for each SNP, gene or coordinate queried. Used for increasing the range of search.
|
31
|
+
-o: output file name.
|
32
|
+
-f: output file format. Supported formats: .txt, .html.
|
33
|
+
```
|
34
|
+
|
35
|
+
Optional flags:
|
36
|
+
|
37
|
+
```
|
38
|
+
-r: for a graphical representation. Format .gff3
|
39
|
+
-p: path to a custom database. Use in case of the database is created by the user with the manager program, and the user doesn't want to use the default database.
|
40
|
+
```
|
24
41
|
|
25
|
-
|
42
|
+
Note: if this is the first query you perform, the program will download the database. It can take a time depending on your Internet connection. Database size: 1.5GB.
|
26
43
|
|
27
|
-
|
44
|
+
The user can also give to AnNCR-SNP finder a file with coordinates (use flag -c) or a list of SNPs or genes for searching (use flac -n). File must contain each element separated by line breaks.
|
28
45
|
|
29
|
-
|
46
|
+
### Manager
|
30
47
|
|
31
|
-
|
48
|
+
Only used if the user wants to build a local database with custom information. - In construction -.
|
32
49
|
|
33
50
|
## Contributing
|
34
51
|
|
35
|
-
Bug reports and pull requests are welcome
|
52
|
+
Bug reports and pull requests are welcome. Please contact with the ruby gem anncrsnp developer (elenarojano at uma.es).
|
36
53
|
|
37
54
|
|
38
55
|
## License
|
data/bin/grdbmanager.rb
CHANGED
@@ -43,50 +43,51 @@ if File.exist?(options[:data])
|
|
43
43
|
current_file = File.basename(file)
|
44
44
|
### Definitive sources
|
45
45
|
#If bin field from UCSC doesn't exist, put FALSE as input data to parseUCSCformat method
|
46
|
+
#OMIT IN HEADER THE 4 FIRST COLUMNS
|
46
47
|
if current_file == "wgEncodeAwgDnaseMasterSites.bed"
|
47
48
|
header = [:score, :floatScore, :sourceCount, :sourceIds]
|
48
|
-
current_dataset = parseUCSCformat(file, header, FALSE)
|
49
|
+
current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
|
49
50
|
current_dataset.numeric_filter(:sourceCount, 2)
|
50
51
|
current_dataset.drop_columns(header)
|
51
52
|
current_dataset.add_metadata(:classification, 'DNAseHS')
|
52
53
|
all_data['dnaseData'] = current_dataset
|
53
54
|
elsif current_file == "wgEncodeHaibMethyl450Ag04449SitesRep1.bed"
|
54
55
|
header = [:score, :strand, :thickStart, :thickEnd, :itemRgb]
|
55
|
-
current_dataset = parseUCSCformat(file, header, FALSE)
|
56
|
+
current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
|
56
57
|
current_dataset.drop_columns(header)
|
57
|
-
current_dataset.add_metadata(:classification, '
|
58
|
-
all_data['
|
58
|
+
current_dataset.add_metadata(:classification, 'Methylation_sites')
|
59
|
+
all_data['methylationData'] = current_dataset
|
59
60
|
elsif current_file == "snp144Common.txt" # current_file == "test.txt"
|
60
61
|
header = [:score, :strand, :refNCBI, :refUCSC, :observed, :molType, :class, :valid, :avHet, :avHetSE, :func, :locType, :weight, :exceptions, :submitterCount, :submitters, :alleleFreqCount, :alleles, :alleleNs, :alleleFreqs, :bitfields]
|
61
|
-
current_dataset = parseUCSCformat(file, header)
|
62
|
+
current_dataset = parseUCSCformat(file, header, TRUE, 1, 0)
|
62
63
|
current_dataset.drop_columns([:score, :strand, :refNCBI, :refUCSC, :observed, :molType, :valid, :avHet, :avHetSE, :locType, :weight, :exceptions, :submitterCount, :submitters, :alleleFreqCount, :alleles, :alleleNs, :alleleFreqs, :bitfields])
|
63
64
|
current_dataset.add_metadata(:classification, 'SNP')
|
64
65
|
all_data['snpDbSnp'] = current_dataset
|
65
66
|
elsif current_file == "refGene.txt"
|
66
67
|
header = [:name, :strand, :cdsStart, :cdsEnd, :exonCount, :exonStarts, :exonEnds, :score, :cdsStartStat, :cdsEndStat, :exonFrames]
|
67
|
-
current_dataset = parseUCSCrefseqformat(file, header)
|
68
|
+
current_dataset = parseUCSCrefseqformat(file, header, TRUE, 1, 0)
|
68
69
|
current_dataset.drop_columns(header)
|
69
70
|
current_dataset.add_metadata(:classification, 'gene')
|
70
71
|
all_data['gene'] = current_dataset
|
71
72
|
elsif current_file == "TFBSMasterSites.txt" #Must be generated with "masterfeatures.rb tfbs/files.txt antibody import_data/TFBSMasterSites.txt tfbs/"
|
72
73
|
header = []
|
73
|
-
current_dataset = parseUCSCformat(file, header, FALSE)
|
74
|
+
current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
|
74
75
|
current_dataset.add_metadata(:classification, 'TFBS')
|
75
76
|
all_data['tfbs'] = current_dataset
|
76
77
|
elsif current_file == "HistoneModMasterSites.txt" #Must be generated with "masterfeatures.rb tfbs/files.txt antibody import_data/TFBSMasterSites.txt tfbs/"
|
77
78
|
header = []
|
78
|
-
current_dataset = parseUCSCformat(file, header, FALSE)
|
79
|
+
current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
|
79
80
|
current_dataset.add_metadata(:classification, 'HistoneModification')
|
80
81
|
all_data['HistoneModification'] = current_dataset
|
81
82
|
elsif current_file == "46waycons.txt"
|
82
83
|
header = [:span, :count, :offset, :file, :lowerLimit, :dataRange, :validCount, :sumData, :sumSquares]
|
83
|
-
current_dataset = parseUCSCformat(file, header)
|
84
|
+
current_dataset = parseUCSCformat(file, header, TRUE, 1, 0)
|
84
85
|
current_dataset.drop_columns(header)
|
85
86
|
current_dataset.add_metadata(:classification, 'ConservedRegions')
|
86
87
|
all_data['ConservedRegions'] = current_dataset
|
87
88
|
elsif current_file == "enhancer_tss_associations.bed"
|
88
89
|
header = [:score, :strand, :enh_start, :enh_stop, :array, :index, :val1, :val2]
|
89
|
-
current_dataset = parseUCSCformat(file, header, FALSE)
|
90
|
+
current_dataset = parseUCSCformat(file, header, FALSE, 0, 0)
|
90
91
|
current_dataset.drop_columns(header)
|
91
92
|
current_dataset.add_metadata(:classification, 'Enhancers')
|
92
93
|
all_data['Enhancers'] = current_dataset
|
@@ -98,10 +99,29 @@ if File.exist?(options[:data])
|
|
98
99
|
all_data['DENdbEnhancers'] = current_dataset
|
99
100
|
elsif current_file == "all_hg19_bed.bed"
|
100
101
|
header = [:counter]
|
101
|
-
current_dataset = parseUCSCformat(file, header, FALSE)
|
102
|
+
current_dataset = parseUCSCformat(file, header, FALSE, 0, 0)
|
102
103
|
current_dataset.drop_columns(header)
|
103
104
|
current_dataset.add_metadata(:classification, 'SuperEnhancers')
|
104
|
-
all_data['SuperEnhancers'] = current_dataset
|
105
|
+
all_data['SuperEnhancers'] = current_dataset
|
106
|
+
elsif current_file == "oreganno_tfbs.txt"
|
107
|
+
header = []
|
108
|
+
current_dataset = parseUCSCformat(file, header, TRUE, 0, 0)
|
109
|
+
current_dataset.drop_columns(header)
|
110
|
+
current_dataset.add_metadata(:classification, 'ORegAnnoTFBS')
|
111
|
+
all_data['ORegAnnoTFBS'] = current_dataset
|
112
|
+
elsif current_file == "oreganno_regulatory.txt"
|
113
|
+
header = []
|
114
|
+
current_dataset = parseUCSCformat(file, header, TRUE, 0, 0)
|
115
|
+
current_dataset.drop_columns(header)
|
116
|
+
current_dataset.add_metadata(:classification, 'ORegAnnoRegulatoryElements')
|
117
|
+
all_data['ORegAnnoRegulatoryElements'] = current_dataset
|
118
|
+
#UNCOMMENT FOR INCLUDE GTEX DATA FROM UCSC. THIS FILE DOESN'T HAVE BIN FIELD!!
|
119
|
+
# elsif current_file == "gtexGene.txt"
|
120
|
+
# header = [:name, :score, :strand, :geneId, :geneType, :expCount, :expScores]
|
121
|
+
# current_dataset = parseUCSCrefseqformat(file, header, TRUE, 1, 0)
|
122
|
+
# current_dataset.drop_columns(header)
|
123
|
+
# current_dataset.add_metadata(:classification, 'GTEx')
|
124
|
+
# all_data['GTEx'] = current_dataset
|
105
125
|
end
|
106
126
|
end
|
107
127
|
end
|
@@ -223,4 +243,4 @@ if options[:create_sql]
|
|
223
243
|
end
|
224
244
|
DB.execute("CREATE INDEX name_index ON GenomicRange (name)")
|
225
245
|
DB.execute("CREATE INDEX bin_index ON GenomicRange (bin)")
|
226
|
-
DB.close
|
246
|
+
DB.close
|
data/bin/statistics.rb
CHANGED
@@ -19,11 +19,13 @@ def load_snp_data(input_file, fields_length)
|
|
19
19
|
"HistoneModification" => [],
|
20
20
|
"tfbs" => [],
|
21
21
|
"dnaseData" => [],
|
22
|
-
"
|
22
|
+
"methylationData" => [],
|
23
23
|
"ConservedRegions" => [],
|
24
24
|
"Enhancers" => [],
|
25
25
|
"DENdbEnhancers" => [],
|
26
|
-
"SuperEnhancers" => []
|
26
|
+
"SuperEnhancers" => [],
|
27
|
+
"ORegAnnoTFBS" => [],
|
28
|
+
"ORegAnnoRegulatoryElements" => []
|
27
29
|
}
|
28
30
|
categories.each do |category_name, category_value|
|
29
31
|
column_position = index[category_name]
|
@@ -46,11 +48,13 @@ def snp_calculate_stats(snp_storage)
|
|
46
48
|
"HistoneModification" => 0,
|
47
49
|
"tfbs" => 0,
|
48
50
|
"dnaseData" => 0,
|
49
|
-
"
|
51
|
+
"methylationData" => 0,
|
50
52
|
"ConservedRegions" => 0,
|
51
53
|
"Enhancers" => 0,
|
52
54
|
"DENdbEnhancers" => 0,
|
53
|
-
"SuperEnhancers" => 0
|
55
|
+
"SuperEnhancers" => 0,
|
56
|
+
"ORegAnnoTFBS" => 0,
|
57
|
+
"ORegAnnoRegulatoryElements" => 0
|
54
58
|
}
|
55
59
|
snp_storage.each do |snp_name, annotations|
|
56
60
|
annotations.each do |annotation_category, annotation_value|
|
@@ -70,9 +74,7 @@ end
|
|
70
74
|
def create_histogram(snp_percentage, name)
|
71
75
|
# create Histogram
|
72
76
|
p=ScbiPlot::Histogram.new(name,'SNPs genomic region annotations')
|
73
|
-
|
74
77
|
# add x axis data
|
75
|
-
|
76
78
|
p.add_x(snp_percentage.keys)
|
77
79
|
puts snp_percentage.keys.inspect
|
78
80
|
# add y axis data
|
@@ -88,11 +90,13 @@ def snp_calculate_stats_with_reference(snp_storage, snp_storage_reference)
|
|
88
90
|
"HistoneModification" => 0,
|
89
91
|
"tfbs" => 0,
|
90
92
|
"dnaseData" => 0,
|
91
|
-
"
|
93
|
+
"methylationData" => 0,
|
92
94
|
"ConservedRegions" => 0,
|
93
95
|
"Enhancers" => 0,
|
94
96
|
"DENdbEnhancers" => 0,
|
95
|
-
"SuperEnhancers" => 0
|
97
|
+
"SuperEnhancers" => 0,
|
98
|
+
"ORegAnnoTFBS" => 0,
|
99
|
+
"ORegAnnoRegulatoryElements" => 0
|
96
100
|
}
|
97
101
|
|
98
102
|
snp_storage_reference.each do |snp_name_ref, annotations_ref|
|
@@ -129,7 +133,7 @@ def annotation_comparison(annotation_value_ref, annotation_value, annotation_cat
|
|
129
133
|
if !(annotation_value_ref & annotation_value).empty? || annotation_value.length >= 5
|
130
134
|
result= true
|
131
135
|
end
|
132
|
-
elsif annotation_category_ref == '
|
136
|
+
elsif annotation_category_ref == 'methylationData' &&
|
133
137
|
!annotation_value.empty?
|
134
138
|
result = true
|
135
139
|
elsif annotation_category_ref == 'HistoneModification'
|
@@ -162,17 +166,18 @@ def annotation_comparison(annotation_value_ref, annotation_value, annotation_cat
|
|
162
166
|
elsif annotation_category_ref == 'SuperEnhancers' &&
|
163
167
|
!annotation_value.empty?
|
164
168
|
result = true
|
169
|
+
elsif annotation_category_ref == 'ORegAnnoTFBS' &&
|
170
|
+
!annotation_value.empty?
|
171
|
+
result = true
|
172
|
+
elsif annotation_category_ref == 'ORegAnnoRegulatoryElements' &&
|
173
|
+
!annotation_value.empty?
|
174
|
+
result = true
|
165
175
|
end
|
166
176
|
return result
|
167
177
|
end
|
168
178
|
|
169
179
|
#MAIN
|
170
180
|
#----------
|
171
|
-
|
172
|
-
#RECUERDA: este programa hace analisis estadisticos y compara resultados para dos archivos dados.
|
173
|
-
#En nuestro caso, comparamos los datos dados por nuestro programa con los datos obtenidos experimentalmente.
|
174
|
-
#nuestros datos = ARGV[0], datos del experimento = ARGV[1]
|
175
|
-
#si no se especifica segundo argumento de entrada = se hace el análisis sobre el propio resultado del programa
|
176
181
|
fields_length = 5
|
177
182
|
fields_length = ARGV[2].to_i if !ARGV[2].nil?
|
178
183
|
|
@@ -184,10 +189,8 @@ else
|
|
184
189
|
snp_percentage = snp_calculate_stats(snp_storage)
|
185
190
|
end
|
186
191
|
snp_percentage.each do |category_name, percentage|
|
187
|
-
puts "#{category_name}\t#{percentage}\t#{ARGV[3]}"
|
192
|
+
puts "#{category_name.capitalize}\t#{percentage}\t#{ARGV[3]}"
|
188
193
|
end
|
189
|
-
|
190
|
-
#El archivo de graficado aparecera donde se ejecute el script
|
191
194
|
# file_name = File.basename(ARGV[0], ".txt")
|
192
195
|
# graph_name = file_name + ".png"
|
193
196
|
# create_histogram(snp_percentage, graph_name)
|
data/lib/anncrsnp/dataset.rb
CHANGED
@@ -10,9 +10,9 @@ class Dataset
|
|
10
10
|
add_metadata(:header, [:chr, :start, :ending, :id].concat(header))
|
11
11
|
end
|
12
12
|
|
13
|
-
def add_record(fields_array) # Fixed col => 0 -> chr, 1 -> start, 2 -> end, 3 -> id
|
14
|
-
fields_array[START] = fields_array[START].to_i
|
15
|
-
fields_array[ENDING] = fields_array[ENDING].to_i
|
13
|
+
def add_record(fields_array, add_start = 0, add_stop = 0) # Fixed col => 0 -> chr, 1 -> start, 2 -> end, 3 -> id
|
14
|
+
fields_array[START] = fields_array[START].to_i + add_start
|
15
|
+
fields_array[ENDING] = fields_array[ENDING].to_i + add_stop
|
16
16
|
@all_record << fields_array
|
17
17
|
end
|
18
18
|
|
@@ -1,24 +1,24 @@
|
|
1
1
|
require 'dataset'
|
2
2
|
|
3
|
-
def parseUCSCformat(file, header, skip_first_col = TRUE)
|
3
|
+
def parseUCSCformat(file, header, skip_first_col = TRUE, add_start = 0, add_stop = 0)
|
4
4
|
dataset = Dataset.new(header)
|
5
5
|
File.open(file).each do |line|
|
6
6
|
line.chomp!
|
7
7
|
fields = line.split("\t")
|
8
8
|
bin_signal = fields.shift if skip_first_col
|
9
|
-
dataset.add_record(fields)
|
9
|
+
dataset.add_record(fields, add_start, add_stop)
|
10
10
|
end
|
11
11
|
return dataset
|
12
12
|
end
|
13
13
|
|
14
|
-
def parseUCSCrefseqformat(file, header, skip_first_col = TRUE)
|
14
|
+
def parseUCSCrefseqformat(file, header, skip_first_col = TRUE, add_start = 0, add_stop = 0)
|
15
15
|
dataset = Dataset.new(header)
|
16
16
|
File.open(file).each do |line|
|
17
17
|
line.chomp!
|
18
18
|
fields = line.split("\t")
|
19
19
|
bin_signal = fields.shift if skip_first_col
|
20
20
|
fields = [fields[1], fields[3], fields[4], fields[11], fields[0], fields[2], fields[5], fields[6], fields[7], fields[8], fields[9], fields[10], fields[12], fields[13], fields[14]]
|
21
|
-
dataset.add_record(fields)
|
21
|
+
dataset.add_record(fields, add_start, add_stop)
|
22
22
|
end
|
23
23
|
return dataset
|
24
24
|
end
|
data/lib/anncrsnp/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: anncrsnp
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Elena Rojano
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2016-
|
12
|
+
date: 2016-07-20 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bundler
|