RubyGems - anncrsnp - Versions diffs - 0.1.1 → 0.1.2 - Mend

anncrsnp 0.1.1 → 0.1.2

Files changed (8) hide show

checksums.yaml +4 -4
data/README.md +33 -16
data/bin/grdbmanager.rb +33 -13
data/bin/statistics.rb +20 -17
data/lib/anncrsnp/dataset.rb +3 -3
data/lib/anncrsnp/parsers/ucscparser.rb +4 -4
data/lib/anncrsnp/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 7423aa26bc862aad466da47e83250e5611c62d48
-  data.tar.gz: 5cbd0c681a2cabc6ab31f01219beccbd403bccee
+  metadata.gz: b49a4c171dac690e9c33215c3ad7ce265583feba
+  data.tar.gz: 864cc1ee837cc96db41c7940ba2c048f7e4f6c34
 SHA512:
-  metadata.gz: 3c9e3d060064ad8bfe9904f3e9469c30451ab1dad5f8d9127f4d37cb84c3ec2d8fe7fc857b76e77e638de55d81a442272ccae69ddddb6131fc5826a708a53f10
-  data.tar.gz: d2c4a7a86c15ec915ddbbc5e53c1da23e632e6afe34eabc495a37bf9294d29ddc792b1dffbbef65c9cab37927193318f6decc75fb030dcb329df0641ed2fb4d7
+  metadata.gz: edf8ff3d825c2b12d4b232eb8a206860d94280b0a3e84fde97bd36481f9af5de210ce10d334cacf1f5baa3ccb24517fb51833c95527dbf57f0a14a21fae0879d
+  data.tar.gz: c4cc14d9a5ed2af88bb618404348fc8c8a559c665b06e2320f1011e14dc947a862fad8187171d52a69b2e9512cf0f9954d839a7835cb62009725fea27e636d6d

data/README.md CHANGED Viewed

@@ -1,38 +1,55 @@
 # Anncrsnp
-Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/anncrsnp`. To experiment with that code, run `bin/console` for an interactive prompt.
+AnNCR-SNP integrates data from various sources, allowing the user to obtain annotation to investigate the potential effects of variation in non-coding regions of the human genome. AnNCR-SNP consists of a database containing data on all non-coding elements and two main programs: manager and finder. The manager program is responsible for creating a local database in the user's computer', and the finder program queries the local database, returning a table of results. Local database is already built and it is downloaded when the finder program is used for the first time. If the user wants to build the local database with custom information, he/she has to use the manager program.
-TODO: Delete this and the text above, and describe your gem
+The user can mine the local database, searching information about SNPs that overlap with various genomic features suggestive of regulatory activity, such as TFBs, open chromatin, histone modifications, methylation sites and enhancers. These genomic features were obtained from a number of different projects and data sources (ENCODE, FANTOM5, DENdb, amongst others). SNP information comes from dbSNP, gene information from RefSeq and conserved regions from 46WayCons.
+If you use this tool, please cite us: Rojano E, Ranea JA, Perkins JR. Characterisation of non-coding genetic variation in histamine receptors using AnNCR-SNP. Amino Acids. 2016 Jun 6. DOI: 10.1007/s00726-016-2265-5.
 ## Installation
-Add this line to your application's Gemfile:
+Install the package directly as:
-```ruby
-gem 'anncrsnp'
-```
+    $ gem install anncrsnp
-And then execute:
+## Usage
-    $ bundle
+### Finder
-Or install it yourself as:
+The user can query the local database using a list of SNPs or genomic coordinates. When the user runs the first query using the finder program (grdbfinder.rb), AnNCR-SNP will download the database (it is downloaded by default in the same directory where the ruby gem is installed). Then, it will accomplish the data search.
-    $ gem install anncrsnp
+An example of use can be the following:
-## Usage
+    $ grdbfinder.rb –n rs2470893,rs12049351 -g snpDbSnp -F 200 -o output_file -f txt
+Where:
+```
+-n: SNP identifier(s) to be queried. The user can also give gene identifiers (RefSeq gene symbols), or use the -c command instead of -n for search coordinates in the following format: chr:start:stop (example: chr3:11128779:11178779).
+-g: when is set with 'snpDbSnp', the script generates a tabular file with each found SNP that overlaps with some feature of interest (regulatory element, gene, etc).
+-F: flanking region length (in nucleotides) located up and downstream for each SNP, gene or coordinate queried. Used for increasing the range of search.
+-o: output file name.
+-f: output file format. Supported formats: .txt, .html.
+```
+Optional flags:
+```
+-r: for a graphical representation. Format .gff3
+-p: path to a custom database. Use in case of the database is created by the user with the manager program, and the user doesn't want to use the default database.
+```
-TODO: Write usage instructions here
+Note: if this is the first query you perform, the program will download the database. It can take a time depending on your Internet connection. Database size: 1.5GB.
-## Development
+The user can also give to AnNCR-SNP finder a file with coordinates (use flag -c) or a list of SNPs or genes for searching (use flac -n). File must contain each element separated by line breaks.
-After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+### Manager
-To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
+Only used if the user wants to build a local database with custom information. - In construction -.
 ## Contributing
-Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/anncrsnp.
+Bug reports and pull requests are welcome. Please contact with the ruby gem anncrsnp developer (elenarojano at uma.es).
 ## License

data/bin/grdbmanager.rb CHANGED Viewed

@@ -43,50 +43,51 @@ if File.exist?(options[:data])
 		current_file = File.basename(file)
 ### Definitive sources
 #If bin field from UCSC doesn't exist, put FALSE as input data to parseUCSCformat method
+#OMIT IN HEADER THE 4 FIRST COLUMNS
 		if current_file == "wgEncodeAwgDnaseMasterSites.bed"
 			header = [:score, :floatScore, :sourceCount, :sourceIds]
-			current_dataset = parseUCSCformat(file, header, FALSE)
+			current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
 			current_dataset.numeric_filter(:sourceCount, 2)
 			current_dataset.drop_columns(header)
 			current_dataset.add_metadata(:classification, 'DNAseHS')
 			all_data['dnaseData'] = current_dataset
 		elsif current_file == "wgEncodeHaibMethyl450Ag04449SitesRep1.bed"
 			header = [:score, :strand, :thickStart, :thickEnd, :itemRgb]
-			current_dataset = parseUCSCformat(file, header, FALSE)
+			current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
 			current_dataset.drop_columns(header)
-			current_dataset.add_metadata(:classification, 'Metilation_sites')
-			all_data['metilationData'] = current_dataset
+			current_dataset.add_metadata(:classification, 'Methylation_sites')
+			all_data['methylationData'] = current_dataset
 		elsif current_file == "snp144Common.txt" # current_file == "test.txt"
 			header = [:score, :strand, :refNCBI, :refUCSC, :observed, :molType, :class, :valid, :avHet, :avHetSE, :func, :locType, :weight, :exceptions, :submitterCount, :submitters, :alleleFreqCount, :alleles, :alleleNs, :alleleFreqs, :bitfields]
-			current_dataset = parseUCSCformat(file, header)
+			current_dataset = parseUCSCformat(file, header, TRUE, 1, 0)
 			current_dataset.drop_columns([:score, :strand, :refNCBI, :refUCSC, :observed, :molType, :valid, :avHet, :avHetSE, :locType, :weight, :exceptions, :submitterCount, :submitters, :alleleFreqCount, :alleles, :alleleNs, :alleleFreqs, :bitfields])
 			current_dataset.add_metadata(:classification, 'SNP')
 			all_data['snpDbSnp'] = current_dataset
 		elsif current_file == "refGene.txt"
 			header = [:name, :strand, :cdsStart, :cdsEnd, :exonCount, :exonStarts, :exonEnds, :score, :cdsStartStat, :cdsEndStat, :exonFrames]
-			current_dataset = parseUCSCrefseqformat(file, header)
+			current_dataset = parseUCSCrefseqformat(file, header, TRUE, 1, 0)
 			current_dataset.drop_columns(header)
 			current_dataset.add_metadata(:classification, 'gene')
 			all_data['gene'] = current_dataset
 		elsif current_file == "TFBSMasterSites.txt" #Must be generated with "masterfeatures.rb tfbs/files.txt antibody import_data/TFBSMasterSites.txt tfbs/"
 			header = []
-			current_dataset = parseUCSCformat(file, header, FALSE)
+			current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
 			current_dataset.add_metadata(:classification, 'TFBS')
 			all_data['tfbs'] = current_dataset
 		elsif current_file == "HistoneModMasterSites.txt" #Must be generated with "masterfeatures.rb tfbs/files.txt antibody import_data/TFBSMasterSites.txt tfbs/"
 			header = []
-			current_dataset = parseUCSCformat(file, header, FALSE)
+			current_dataset = parseUCSCformat(file, header, FALSE, 1, 0)
 			current_dataset.add_metadata(:classification, 'HistoneModification')
 			all_data['HistoneModification'] = current_dataset
 		elsif current_file == "46waycons.txt"
 			header = [:span, :count, :offset, :file, :lowerLimit, :dataRange, :validCount, :sumData, :sumSquares]
-			current_dataset = parseUCSCformat(file, header)
+			current_dataset = parseUCSCformat(file, header, TRUE, 1, 0)
 			current_dataset.drop_columns(header)
 			current_dataset.add_metadata(:classification, 'ConservedRegions')
 			all_data['ConservedRegions'] = current_dataset
 		elsif current_file == "enhancer_tss_associations.bed"
 			header = [:score, :strand, :enh_start, :enh_stop, :array, :index, :val1, :val2]
-			current_dataset = parseUCSCformat(file, header, FALSE)
+			current_dataset = parseUCSCformat(file, header, FALSE, 0, 0)
 			current_dataset.drop_columns(header)
 			current_dataset.add_metadata(:classification, 'Enhancers')
 			all_data['Enhancers'] = current_dataset
@@ -98,10 +99,29 @@ if File.exist?(options[:data])
 			all_data['DENdbEnhancers'] = current_dataset
 		elsif current_file == "all_hg19_bed.bed"
 			header = [:counter]
-			current_dataset = parseUCSCformat(file, header, FALSE)
+			current_dataset = parseUCSCformat(file, header, FALSE, 0, 0)
 			current_dataset.drop_columns(header)
 			current_dataset.add_metadata(:classification, 'SuperEnhancers')
-			all_data['SuperEnhancers'] = current_dataset
+			all_data['SuperEnhancers'] = current_dataset
+		elsif current_file == "oreganno_tfbs.txt"
+			header = []
+			current_dataset = parseUCSCformat(file, header, TRUE, 0, 0)
+			current_dataset.drop_columns(header)
+			current_dataset.add_metadata(:classification, 'ORegAnnoTFBS')
+			all_data['ORegAnnoTFBS'] = current_dataset
+		elsif current_file == "oreganno_regulatory.txt"
+			header = []
+			current_dataset = parseUCSCformat(file, header, TRUE, 0, 0)
+			current_dataset.drop_columns(header)
+			current_dataset.add_metadata(:classification, 'ORegAnnoRegulatoryElements')
+			all_data['ORegAnnoRegulatoryElements'] = current_dataset
+		#UNCOMMENT FOR INCLUDE GTEX DATA FROM UCSC. THIS FILE DOESN'T HAVE BIN FIELD!!
+		# elsif current_file == "gtexGene.txt"
+		# 	header = [:name, :score, :strand, :geneId, :geneType, :expCount, :expScores]
+		# 	current_dataset = parseUCSCrefseqformat(file, header, TRUE, 1, 0)
+		# 	current_dataset.drop_columns(header)
+		# 	current_dataset.add_metadata(:classification, 'GTEx')
+		# 	all_data['GTEx'] = current_dataset
 		end
 	end
 end
@@ -223,4 +243,4 @@ if options[:create_sql]
 end
 DB.execute("CREATE INDEX name_index ON GenomicRange (name)")
 DB.execute("CREATE INDEX bin_index ON GenomicRange (bin)")
-DB.close
+DB.close

data/bin/statistics.rb CHANGED Viewed

@@ -19,11 +19,13 @@ def load_snp_data(input_file, fields_length)
 							"HistoneModification" => [],
 							"tfbs" => [],
 							"dnaseData" => [],
-							"metilationData" => [],
+							"methylationData" => [],
 							"ConservedRegions" => [],
 							"Enhancers" => [],
 							"DENdbEnhancers" => [],
-							"SuperEnhancers" => []
+							"SuperEnhancers" => [],
+							"ORegAnnoTFBS" => [],
+							"ORegAnnoRegulatoryElements" => []
 						}
 			categories.each do |category_name, category_value|
 				column_position = index[category_name]
@@ -46,11 +48,13 @@ def snp_calculate_stats(snp_storage)
 							"HistoneModification" => 0,
 							"tfbs" => 0,
 							"dnaseData" => 0,
-							"metilationData" => 0,
+							"methylationData" => 0,
 							"ConservedRegions" => 0,
 							"Enhancers" => 0,
 							"DENdbEnhancers" => 0,
-							"SuperEnhancers" => 0
+							"SuperEnhancers" => 0,
+							"ORegAnnoTFBS" => 0,
+							"ORegAnnoRegulatoryElements" => 0
 						}
 	snp_storage.each do |snp_name, annotations|
 		annotations.each do |annotation_category, annotation_value|
@@ -70,9 +74,7 @@ end
 def create_histogram(snp_percentage, name)
 	# create Histogram
 	p=ScbiPlot::Histogram.new(name,'SNPs genomic region annotations')
 	# add x axis data
 	p.add_x(snp_percentage.keys)
 	puts snp_percentage.keys.inspect
 	# add y axis data
@@ -88,11 +90,13 @@ def snp_calculate_stats_with_reference(snp_storage, snp_storage_reference)
 							"HistoneModification" => 0,
 							"tfbs" => 0,
 							"dnaseData" => 0,
-							"metilationData" => 0,
+							"methylationData" => 0,
 							"ConservedRegions" => 0,
 							"Enhancers" => 0,
 							"DENdbEnhancers" => 0,
-							"SuperEnhancers" => 0
+							"SuperEnhancers" => 0,
+							"ORegAnnoTFBS" => 0,
+							"ORegAnnoRegulatoryElements" => 0
 						}
 	snp_storage_reference.each do |snp_name_ref, annotations_ref|
@@ -129,7 +133,7 @@ def annotation_comparison(annotation_value_ref, annotation_value, annotation_cat
 		if !(annotation_value_ref & annotation_value).empty? || annotation_value.length >= 5
 			result= true
 		end
-	elsif annotation_category_ref == 'metilationData' &&
+	elsif annotation_category_ref == 'methylationData' &&
 		!annotation_value.empty?
 		result = true
 	elsif annotation_category_ref == 'HistoneModification'
@@ -162,17 +166,18 @@ def annotation_comparison(annotation_value_ref, annotation_value, annotation_cat
 	elsif annotation_category_ref == 'SuperEnhancers' &&
 		!annotation_value.empty?
 		result = true
+	elsif annotation_category_ref == 'ORegAnnoTFBS' &&
+		!annotation_value.empty?
+		result = true
+	elsif annotation_category_ref == 'ORegAnnoRegulatoryElements' &&
+		!annotation_value.empty?
+		result = true
 	end
 	return result
 end
 #MAIN
 #----------
-#RECUERDA: este programa hace analisis estadisticos y compara resultados para dos archivos dados.
-#En nuestro caso, comparamos los datos dados por nuestro programa con los datos obtenidos experimentalmente.
-#nuestros datos = ARGV[0], datos del experimento = ARGV[1]
-#si no se especifica segundo argumento de entrada = se hace el análisis sobre el propio resultado del programa
 fields_length = 5
 fields_length = ARGV[2].to_i if !ARGV[2].nil?
@@ -184,10 +189,8 @@ else
 	snp_percentage = snp_calculate_stats(snp_storage)
 end
 snp_percentage.each do |category_name, percentage|
-	puts "#{category_name}\t#{percentage}\t#{ARGV[3]}"
+	puts "#{category_name.capitalize}\t#{percentage}\t#{ARGV[3]}"
 end
-#El archivo de graficado aparecera donde se ejecute el script
 # file_name = File.basename(ARGV[0], ".txt")
 # graph_name = file_name + ".png"
 # create_histogram(snp_percentage, graph_name)

data/lib/anncrsnp/dataset.rb CHANGED Viewed

@@ -10,9 +10,9 @@ class Dataset
 		add_metadata(:header, [:chr, :start, :ending, :id].concat(header))
 	end
-	def add_record(fields_array) # Fixed col => 0 -> chr, 1 -> start, 2 -> end, 3 -> id
-		fields_array[START] = fields_array[START].to_i
-		fields_array[ENDING] = fields_array[ENDING].to_i
+	def add_record(fields_array, add_start = 0, add_stop = 0) # Fixed col => 0 -> chr, 1 -> start, 2 -> end, 3 -> id
+		fields_array[START] = fields_array[START].to_i + add_start
+		fields_array[ENDING] = fields_array[ENDING].to_i + add_stop
 		@all_record << fields_array
 	end

data/lib/anncrsnp/parsers/ucscparser.rb CHANGED Viewed

@@ -1,24 +1,24 @@
 require 'dataset'
-def parseUCSCformat(file, header, skip_first_col = TRUE)
+def parseUCSCformat(file, header, skip_first_col = TRUE, add_start = 0, add_stop = 0)
 	dataset = Dataset.new(header)
 	File.open(file).each do |line|
 		line.chomp!
 		fields = line.split("\t")
 		bin_signal = fields.shift if skip_first_col
-		dataset.add_record(fields)
+		dataset.add_record(fields, add_start, add_stop)
 	end
 	return dataset
 end
-def parseUCSCrefseqformat(file, header, skip_first_col = TRUE)
+def parseUCSCrefseqformat(file, header, skip_first_col = TRUE, add_start = 0, add_stop = 0)
 	dataset = Dataset.new(header)
 	File.open(file).each do |line|
 		line.chomp!
 		fields = line.split("\t")
 		bin_signal = fields.shift if skip_first_col
 		fields = [fields[1], fields[3], fields[4], fields[11], fields[0], fields[2], fields[5], fields[6], fields[7], fields[8], fields[9], fields[10], fields[12], fields[13], fields[14]]
-		dataset.add_record(fields)
+		dataset.add_record(fields, add_start, add_stop)
 	end
 	return dataset
 end

data/lib/anncrsnp/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Anncrsnp
-  VERSION = "0.1.1"
+  VERSION = "0.1.2"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: anncrsnp
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.1.2
 platform: ruby
 authors:
 - Elena Rojano
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-01-25 00:00:00.000000000 Z
+date: 2016-07-20 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler