anncrsnp 0.1.1 → 0.1.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 7423aa26bc862aad466da47e83250e5611c62d48
4
- data.tar.gz: 5cbd0c681a2cabc6ab31f01219beccbd403bccee
3
+ metadata.gz: b49a4c171dac690e9c33215c3ad7ce265583feba
4
+ data.tar.gz: 864cc1ee837cc96db41c7940ba2c048f7e4f6c34
5
5
  SHA512:
6
- metadata.gz: 3c9e3d060064ad8bfe9904f3e9469c30451ab1dad5f8d9127f4d37cb84c3ec2d8fe7fc857b76e77e638de55d81a442272ccae69ddddb6131fc5826a708a53f10
7
- data.tar.gz: d2c4a7a86c15ec915ddbbc5e53c1da23e632e6afe34eabc495a37bf9294d29ddc792b1dffbbef65c9cab37927193318f6decc75fb030dcb329df0641ed2fb4d7
6
+ metadata.gz: edf8ff3d825c2b12d4b232eb8a206860d94280b0a3e84fde97bd36481f9af5de210ce10d334cacf1f5baa3ccb24517fb51833c95527dbf57f0a14a21fae0879d
7
+ data.tar.gz: c4cc14d9a5ed2af88bb618404348fc8c8a559c665b06e2320f1011e14dc947a862fad8187171d52a69b2e9512cf0f9954d839a7835cb62009725fea27e636d6d
data/README.md CHANGED
@@ -1,38 +1,55 @@
1
1
  # Anncrsnp
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/anncrsnp`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ AnNCR-SNP integrates data from various sources, allowing the user to obtain annotation to investigate the potential effects of variation in non-coding regions of the human genome. AnNCR-SNP consists of a database containing data on all non-coding elements and two main programs: manager and finder. The manager program is responsible for creating a local database in the user's computer', and the finder program queries the local database, returning a table of results. Local database is already built and it is downloaded when the finder program is used for the first time. If the user wants to build the local database with custom information, he/she has to use the manager program.
4
4
 
5
- TODO: Delete this and the text above, and describe your gem
5
+ The user can mine the local database, searching information about SNPs that overlap with various genomic features suggestive of regulatory activity, such as TFBs, open chromatin, histone modifications, methylation sites and enhancers. These genomic features were obtained from a number of different projects and data sources (ENCODE, FANTOM5, DENdb, amongst others). SNP information comes from dbSNP, gene information from RefSeq and conserved regions from 46WayCons.
6
+
7
+ If you use this tool, please cite us: Rojano E, Ranea JA, Perkins JR. Characterisation of non-coding genetic variation in histamine receptors using AnNCR-SNP. Amino Acids. 2016 Jun 6. DOI: 10.1007/s00726-016-2265-5.
6
8
 
7
9
  ## Installation
8
10
 
9
- Add this line to your application's Gemfile:
11
+ Install the package directly as:
10
12
 
11
- ```ruby
12
- gem 'anncrsnp'
13
- ```
13
+ $ gem install anncrsnp
14
14
 
15
- And then execute:
15
+ ## Usage
16
16
 
17
- $ bundle
17
+ ### Finder
18
18
 
19
- Or install it yourself as:
19
+ The user can query the local database using a list of SNPs or genomic coordinates. When the user runs the first query using the finder program (grdbfinder.rb), AnNCR-SNP will download the database (it is downloaded by default in the same directory where the ruby gem is installed). Then, it will accomplish the data search.
20
20
 
21
- $ gem install anncrsnp
21
+ An example of use can be the following:
22
22
 
23
- ## Usage
23
+ $ grdbfinder.rb –n rs2470893,rs12049351 -g snpDbSnp -F 200 -o output_file -f txt
24
+
25
+ Where:
26
+
27
+ ```
28
+ -n: SNP identifier(s) to be queried. The user can also give gene identifiers (RefSeq gene symbols), or use the -c command instead of -n for search coordinates in the following format: chr:start:stop (example: chr3:11128779:11178779).
29
+ -g: when is set with 'snpDbSnp', the script generates a tabular file with each found SNP that overlaps with some feature of interest (regulatory element, gene, etc).
30
+ -F: flanking region length (in nucleotides) located up and downstream for each SNP, gene or coordinate queried. Used for increasing the range of search.
31
+ -o: output file name.
32
+ -f: output file format. Supported formats: .txt, .html.
33
+ ```
34
+
35
+ Optional flags:
36
+
37
+ ```
38
+ -r: for a graphical representation. Format .gff3
39
+ -p: path to a custom database. Use in case of the database is created by the user with the manager program, and the user doesn't want to use the default database.
40
+ ```
24
41
 
25
- TODO: Write usage instructions here
42
+ Note: if this is the first query you perform, the program will download the database. It can take a time depending on your Internet connection. Database size: 1.5GB.
26
43
 
27
- ## Development
44
+ The user can also give to AnNCR-SNP finder a file with coordinates (use flag -c) or a list of SNPs or genes for searching (use flac -n). File must contain each element separated by line breaks.
28
45
 
29
- After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
46
+ ### Manager
30
47
 
31
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
48
+ Only used if the user wants to build a local database with custom information. - In construction -.
32
49
 
33
50
  ## Contributing
34
51
 
35
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/anncrsnp.
52
+ Bug reports and pull requests are welcome. Please contact with the ruby gem anncrsnp developer (elenarojano at uma.es).
36
53
 
37
54
 
38
55
  ## License
data/bin/grdbmanager.rb CHANGED
@@ -43,50 +43,51 @@ if File.exist?(options[:data])
43
43
  current_file = File.basename(file)
44
44
  ### Definitive sources
45
45
  #If bin field from UCSC doesn't exist, put FALSE as input data to parseUCSCformat method
46
+ #OMIT IN HEADER THE 4 FIRST COLUMNS
46
47
  if current_file == "wgEncodeAwgDnaseMasterSites.bed"
47
48
  header = [:score, :floatScore, :sourceCount, :sourceIds]
48
- current_dataset = parseUCSCformat(file, header, FALSE)
49
+ current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
49
50
  current_dataset.numeric_filter(:sourceCount, 2)
50
51
  current_dataset.drop_columns(header)
51
52
  current_dataset.add_metadata(:classification, 'DNAseHS')
52
53
  all_data['dnaseData'] = current_dataset
53
54
  elsif current_file == "wgEncodeHaibMethyl450Ag04449SitesRep1.bed"
54
55
  header = [:score, :strand, :thickStart, :thickEnd, :itemRgb]
55
- current_dataset = parseUCSCformat(file, header, FALSE)
56
+ current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
56
57
  current_dataset.drop_columns(header)
57
- current_dataset.add_metadata(:classification, 'Metilation_sites')
58
- all_data['metilationData'] = current_dataset
58
+ current_dataset.add_metadata(:classification, 'Methylation_sites')
59
+ all_data['methylationData'] = current_dataset
59
60
  elsif current_file == "snp144Common.txt" # current_file == "test.txt"
60
61
  header = [:score, :strand, :refNCBI, :refUCSC, :observed, :molType, :class, :valid, :avHet, :avHetSE, :func, :locType, :weight, :exceptions, :submitterCount, :submitters, :alleleFreqCount, :alleles, :alleleNs, :alleleFreqs, :bitfields]
61
- current_dataset = parseUCSCformat(file, header)
62
+ current_dataset = parseUCSCformat(file, header, TRUE, 1, 0)
62
63
  current_dataset.drop_columns([:score, :strand, :refNCBI, :refUCSC, :observed, :molType, :valid, :avHet, :avHetSE, :locType, :weight, :exceptions, :submitterCount, :submitters, :alleleFreqCount, :alleles, :alleleNs, :alleleFreqs, :bitfields])
63
64
  current_dataset.add_metadata(:classification, 'SNP')
64
65
  all_data['snpDbSnp'] = current_dataset
65
66
  elsif current_file == "refGene.txt"
66
67
  header = [:name, :strand, :cdsStart, :cdsEnd, :exonCount, :exonStarts, :exonEnds, :score, :cdsStartStat, :cdsEndStat, :exonFrames]
67
- current_dataset = parseUCSCrefseqformat(file, header)
68
+ current_dataset = parseUCSCrefseqformat(file, header, TRUE, 1, 0)
68
69
  current_dataset.drop_columns(header)
69
70
  current_dataset.add_metadata(:classification, 'gene')
70
71
  all_data['gene'] = current_dataset
71
72
  elsif current_file == "TFBSMasterSites.txt" #Must be generated with "masterfeatures.rb tfbs/files.txt antibody import_data/TFBSMasterSites.txt tfbs/"
72
73
  header = []
73
- current_dataset = parseUCSCformat(file, header, FALSE)
74
+ current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
74
75
  current_dataset.add_metadata(:classification, 'TFBS')
75
76
  all_data['tfbs'] = current_dataset
76
77
  elsif current_file == "HistoneModMasterSites.txt" #Must be generated with "masterfeatures.rb tfbs/files.txt antibody import_data/TFBSMasterSites.txt tfbs/"
77
78
  header = []
78
- current_dataset = parseUCSCformat(file, header, FALSE)
79
+ current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
79
80
  current_dataset.add_metadata(:classification, 'HistoneModification')
80
81
  all_data['HistoneModification'] = current_dataset
81
82
  elsif current_file == "46waycons.txt"
82
83
  header = [:span, :count, :offset, :file, :lowerLimit, :dataRange, :validCount, :sumData, :sumSquares]
83
- current_dataset = parseUCSCformat(file, header)
84
+ current_dataset = parseUCSCformat(file, header, TRUE, 1, 0)
84
85
  current_dataset.drop_columns(header)
85
86
  current_dataset.add_metadata(:classification, 'ConservedRegions')
86
87
  all_data['ConservedRegions'] = current_dataset
87
88
  elsif current_file == "enhancer_tss_associations.bed"
88
89
  header = [:score, :strand, :enh_start, :enh_stop, :array, :index, :val1, :val2]
89
- current_dataset = parseUCSCformat(file, header, FALSE)
90
+ current_dataset = parseUCSCformat(file, header, FALSE, 0, 0)
90
91
  current_dataset.drop_columns(header)
91
92
  current_dataset.add_metadata(:classification, 'Enhancers')
92
93
  all_data['Enhancers'] = current_dataset
@@ -98,10 +99,29 @@ if File.exist?(options[:data])
98
99
  all_data['DENdbEnhancers'] = current_dataset
99
100
  elsif current_file == "all_hg19_bed.bed"
100
101
  header = [:counter]
101
- current_dataset = parseUCSCformat(file, header, FALSE)
102
+ current_dataset = parseUCSCformat(file, header, FALSE, 0, 0)
102
103
  current_dataset.drop_columns(header)
103
104
  current_dataset.add_metadata(:classification, 'SuperEnhancers')
104
- all_data['SuperEnhancers'] = current_dataset
105
+ all_data['SuperEnhancers'] = current_dataset
106
+ elsif current_file == "oreganno_tfbs.txt"
107
+ header = []
108
+ current_dataset = parseUCSCformat(file, header, TRUE, 0, 0)
109
+ current_dataset.drop_columns(header)
110
+ current_dataset.add_metadata(:classification, 'ORegAnnoTFBS')
111
+ all_data['ORegAnnoTFBS'] = current_dataset
112
+ elsif current_file == "oreganno_regulatory.txt"
113
+ header = []
114
+ current_dataset = parseUCSCformat(file, header, TRUE, 0, 0)
115
+ current_dataset.drop_columns(header)
116
+ current_dataset.add_metadata(:classification, 'ORegAnnoRegulatoryElements')
117
+ all_data['ORegAnnoRegulatoryElements'] = current_dataset
118
+ #UNCOMMENT FOR INCLUDE GTEX DATA FROM UCSC. THIS FILE DOESN'T HAVE BIN FIELD!!
119
+ # elsif current_file == "gtexGene.txt"
120
+ # header = [:name, :score, :strand, :geneId, :geneType, :expCount, :expScores]
121
+ # current_dataset = parseUCSCrefseqformat(file, header, TRUE, 1, 0)
122
+ # current_dataset.drop_columns(header)
123
+ # current_dataset.add_metadata(:classification, 'GTEx')
124
+ # all_data['GTEx'] = current_dataset
105
125
  end
106
126
  end
107
127
  end
@@ -223,4 +243,4 @@ if options[:create_sql]
223
243
  end
224
244
  DB.execute("CREATE INDEX name_index ON GenomicRange (name)")
225
245
  DB.execute("CREATE INDEX bin_index ON GenomicRange (bin)")
226
- DB.close
246
+ DB.close
data/bin/statistics.rb CHANGED
@@ -19,11 +19,13 @@ def load_snp_data(input_file, fields_length)
19
19
  "HistoneModification" => [],
20
20
  "tfbs" => [],
21
21
  "dnaseData" => [],
22
- "metilationData" => [],
22
+ "methylationData" => [],
23
23
  "ConservedRegions" => [],
24
24
  "Enhancers" => [],
25
25
  "DENdbEnhancers" => [],
26
- "SuperEnhancers" => []
26
+ "SuperEnhancers" => [],
27
+ "ORegAnnoTFBS" => [],
28
+ "ORegAnnoRegulatoryElements" => []
27
29
  }
28
30
  categories.each do |category_name, category_value|
29
31
  column_position = index[category_name]
@@ -46,11 +48,13 @@ def snp_calculate_stats(snp_storage)
46
48
  "HistoneModification" => 0,
47
49
  "tfbs" => 0,
48
50
  "dnaseData" => 0,
49
- "metilationData" => 0,
51
+ "methylationData" => 0,
50
52
  "ConservedRegions" => 0,
51
53
  "Enhancers" => 0,
52
54
  "DENdbEnhancers" => 0,
53
- "SuperEnhancers" => 0
55
+ "SuperEnhancers" => 0,
56
+ "ORegAnnoTFBS" => 0,
57
+ "ORegAnnoRegulatoryElements" => 0
54
58
  }
55
59
  snp_storage.each do |snp_name, annotations|
56
60
  annotations.each do |annotation_category, annotation_value|
@@ -70,9 +74,7 @@ end
70
74
  def create_histogram(snp_percentage, name)
71
75
  # create Histogram
72
76
  p=ScbiPlot::Histogram.new(name,'SNPs genomic region annotations')
73
-
74
77
  # add x axis data
75
-
76
78
  p.add_x(snp_percentage.keys)
77
79
  puts snp_percentage.keys.inspect
78
80
  # add y axis data
@@ -88,11 +90,13 @@ def snp_calculate_stats_with_reference(snp_storage, snp_storage_reference)
88
90
  "HistoneModification" => 0,
89
91
  "tfbs" => 0,
90
92
  "dnaseData" => 0,
91
- "metilationData" => 0,
93
+ "methylationData" => 0,
92
94
  "ConservedRegions" => 0,
93
95
  "Enhancers" => 0,
94
96
  "DENdbEnhancers" => 0,
95
- "SuperEnhancers" => 0
97
+ "SuperEnhancers" => 0,
98
+ "ORegAnnoTFBS" => 0,
99
+ "ORegAnnoRegulatoryElements" => 0
96
100
  }
97
101
 
98
102
  snp_storage_reference.each do |snp_name_ref, annotations_ref|
@@ -129,7 +133,7 @@ def annotation_comparison(annotation_value_ref, annotation_value, annotation_cat
129
133
  if !(annotation_value_ref & annotation_value).empty? || annotation_value.length >= 5
130
134
  result= true
131
135
  end
132
- elsif annotation_category_ref == 'metilationData' &&
136
+ elsif annotation_category_ref == 'methylationData' &&
133
137
  !annotation_value.empty?
134
138
  result = true
135
139
  elsif annotation_category_ref == 'HistoneModification'
@@ -162,17 +166,18 @@ def annotation_comparison(annotation_value_ref, annotation_value, annotation_cat
162
166
  elsif annotation_category_ref == 'SuperEnhancers' &&
163
167
  !annotation_value.empty?
164
168
  result = true
169
+ elsif annotation_category_ref == 'ORegAnnoTFBS' &&
170
+ !annotation_value.empty?
171
+ result = true
172
+ elsif annotation_category_ref == 'ORegAnnoRegulatoryElements' &&
173
+ !annotation_value.empty?
174
+ result = true
165
175
  end
166
176
  return result
167
177
  end
168
178
 
169
179
  #MAIN
170
180
  #----------
171
-
172
- #RECUERDA: este programa hace analisis estadisticos y compara resultados para dos archivos dados.
173
- #En nuestro caso, comparamos los datos dados por nuestro programa con los datos obtenidos experimentalmente.
174
- #nuestros datos = ARGV[0], datos del experimento = ARGV[1]
175
- #si no se especifica segundo argumento de entrada = se hace el análisis sobre el propio resultado del programa
176
181
  fields_length = 5
177
182
  fields_length = ARGV[2].to_i if !ARGV[2].nil?
178
183
 
@@ -184,10 +189,8 @@ else
184
189
  snp_percentage = snp_calculate_stats(snp_storage)
185
190
  end
186
191
  snp_percentage.each do |category_name, percentage|
187
- puts "#{category_name}\t#{percentage}\t#{ARGV[3]}"
192
+ puts "#{category_name.capitalize}\t#{percentage}\t#{ARGV[3]}"
188
193
  end
189
-
190
- #El archivo de graficado aparecera donde se ejecute el script
191
194
  # file_name = File.basename(ARGV[0], ".txt")
192
195
  # graph_name = file_name + ".png"
193
196
  # create_histogram(snp_percentage, graph_name)
@@ -10,9 +10,9 @@ class Dataset
10
10
  add_metadata(:header, [:chr, :start, :ending, :id].concat(header))
11
11
  end
12
12
 
13
- def add_record(fields_array) # Fixed col => 0 -> chr, 1 -> start, 2 -> end, 3 -> id
14
- fields_array[START] = fields_array[START].to_i
15
- fields_array[ENDING] = fields_array[ENDING].to_i
13
+ def add_record(fields_array, add_start = 0, add_stop = 0) # Fixed col => 0 -> chr, 1 -> start, 2 -> end, 3 -> id
14
+ fields_array[START] = fields_array[START].to_i + add_start
15
+ fields_array[ENDING] = fields_array[ENDING].to_i + add_stop
16
16
  @all_record << fields_array
17
17
  end
18
18
 
@@ -1,24 +1,24 @@
1
1
  require 'dataset'
2
2
 
3
- def parseUCSCformat(file, header, skip_first_col = TRUE)
3
+ def parseUCSCformat(file, header, skip_first_col = TRUE, add_start = 0, add_stop = 0)
4
4
  dataset = Dataset.new(header)
5
5
  File.open(file).each do |line|
6
6
  line.chomp!
7
7
  fields = line.split("\t")
8
8
  bin_signal = fields.shift if skip_first_col
9
- dataset.add_record(fields)
9
+ dataset.add_record(fields, add_start, add_stop)
10
10
  end
11
11
  return dataset
12
12
  end
13
13
 
14
- def parseUCSCrefseqformat(file, header, skip_first_col = TRUE)
14
+ def parseUCSCrefseqformat(file, header, skip_first_col = TRUE, add_start = 0, add_stop = 0)
15
15
  dataset = Dataset.new(header)
16
16
  File.open(file).each do |line|
17
17
  line.chomp!
18
18
  fields = line.split("\t")
19
19
  bin_signal = fields.shift if skip_first_col
20
20
  fields = [fields[1], fields[3], fields[4], fields[11], fields[0], fields[2], fields[5], fields[6], fields[7], fields[8], fields[9], fields[10], fields[12], fields[13], fields[14]]
21
- dataset.add_record(fields)
21
+ dataset.add_record(fields, add_start, add_stop)
22
22
  end
23
23
  return dataset
24
24
  end
@@ -1,3 +1,3 @@
1
1
  module Anncrsnp
2
- VERSION = "0.1.1"
2
+ VERSION = "0.1.2"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: anncrsnp
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Elena Rojano
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2016-01-25 00:00:00.000000000 Z
12
+ date: 2016-07-20 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bundler