rbbt 1.1.8 → 1.2.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (47) hide show
  1. data/README.rdoc +12 -12
  2. data/bin/rbbt_config +2 -3
  3. data/install_scripts/norm/Rakefile +4 -4
  4. data/install_scripts/organisms/{tair.Rakefile → Ath.Rakefile} +4 -3
  5. data/install_scripts/organisms/{cgd.Rakefile → Cal.Rakefile} +0 -0
  6. data/install_scripts/organisms/{worm.Rakefile → Cel.Rakefile} +0 -0
  7. data/install_scripts/organisms/{human.Rakefile → Hsa.Rakefile} +4 -8
  8. data/install_scripts/organisms/{mgi.Rakefile → Mmu.Rakefile} +0 -0
  9. data/install_scripts/organisms/{rgd.Rakefile → Rno.Rakefile} +0 -0
  10. data/install_scripts/organisms/{sgd.Rakefile → Sce.Rakefile} +0 -0
  11. data/install_scripts/organisms/{pombe.Rakefile → Spo.Rakefile} +0 -0
  12. data/install_scripts/organisms/rake-include.rb +15 -19
  13. data/lib/rbbt.rb +0 -3
  14. data/lib/rbbt/ner/rnorm.rb +2 -2
  15. data/lib/rbbt/sources/go.rb +48 -3
  16. data/lib/rbbt/sources/organism.rb +12 -17
  17. data/lib/rbbt/util/open.rb +27 -27
  18. data/lib/rbbt/util/tmpfile.rb +16 -0
  19. data/tasks/install.rake +1 -1
  20. data/test/rbbt/bow/test_bow.rb +33 -0
  21. data/test/rbbt/bow/test_classifier.rb +72 -0
  22. data/test/rbbt/bow/test_dictionary.rb +91 -0
  23. data/test/rbbt/ner/rnorm/test_cue_index.rb +57 -0
  24. data/test/rbbt/ner/rnorm/test_tokens.rb +70 -0
  25. data/test/rbbt/ner/test_abner.rb +17 -0
  26. data/test/rbbt/ner/test_banner.rb +17 -0
  27. data/test/rbbt/ner/test_dictionaryNER.rb +122 -0
  28. data/test/rbbt/ner/test_regexpNER.rb +33 -0
  29. data/test/rbbt/ner/test_rner.rb +126 -0
  30. data/test/rbbt/ner/test_rnorm.rb +47 -0
  31. data/test/rbbt/sources/test_biocreative.rb +38 -0
  32. data/test/rbbt/sources/test_biomart.rb +31 -0
  33. data/test/rbbt/sources/test_entrez.rb +49 -0
  34. data/test/rbbt/sources/test_go.rb +24 -0
  35. data/test/rbbt/sources/test_organism.rb +59 -0
  36. data/test/rbbt/sources/test_polysearch.rb +27 -0
  37. data/test/rbbt/sources/test_pubmed.rb +29 -0
  38. data/test/rbbt/util/test_arrayHash.rb +257 -0
  39. data/test/rbbt/util/test_filecache.rb +37 -0
  40. data/test/rbbt/util/test_index.rb +31 -0
  41. data/test/rbbt/util/test_misc.rb +20 -0
  42. data/test/rbbt/util/test_open.rb +97 -0
  43. data/test/rbbt/util/test_simpleDSL.rb +57 -0
  44. data/test/rbbt/util/test_tmpfile.rb +21 -0
  45. data/test/test_helper.rb +4 -0
  46. data/test/test_rbbt.rb +11 -0
  47. metadata +39 -12
data/README.rdoc CHANGED
@@ -57,14 +57,14 @@ Identifiers translation:: Translates gene identifiers between formats.
57
57
 
58
58
  Organisms in rbbt are identified using a keyword. This is the list of organisms currently supported with their associated keywords:
59
59
 
60
- Candida albicans:: cgd
61
- Mus musculus:: mgi
62
- Rattus norvegicus:: rgd
63
- Saccharomyces cerevisiae:: sgd
64
- Arabidopsis thaliana:: tair
65
- Caenorhabditis elegans:: worm
66
- Homo sapiens:: human
67
- Schizosaccharomyces pombe:: pombe
60
+ Candida albicans:: Cal
61
+ Mus musculus:: Mmu
62
+ Rattus norvegicus:: Rno
63
+ Saccharomyces cerevisiae:: Sce
64
+ Arabidopsis thaliana:: Ata
65
+ Caenorhabditis elegans:: Cel
66
+ Homo sapiens:: Hsa
67
+ Schizosaccharomyces pombe:: Spo
68
68
 
69
69
 
70
70
  === Other
@@ -80,11 +80,11 @@ Install the gem normally <tt>gem install rbbt</tt>. The gem includes a configura
80
80
  === Using rbbt to translate identifiers
81
81
 
82
82
  1. Do <tt>rbbt_config prepare identifiers</tt> to do deploy the configuration files and download entrez data, this needs to be done just once.
83
- 3. Now you may do <tt>rbbt_config install organisms</tt> toprocess all the organisms, or <tt>rbbt_config install organisms -o sgd</tt> to process only yeast (sgd).
83
+ 3. Now you may do <tt>rbbt_config install organisms</tt> toprocess all the organisms, or <tt>rbbt_config install organisms -o Sce</tt> to process only yeast (Sce).
84
84
  4. You may now use a script like this to translate gene identifiers from yeast feed from the standard input
85
85
  require 'rbbt/sources/organism'
86
86
 
87
- index = Organism.id_index('sgd', :native => 'Entrez Gene Id')
87
+ index = Organism.id_index('Sce', :native => 'Entrez Gene Id')
88
88
 
89
89
  STDIN.each_line{|l| puts "#{l.chomp} => #{index[l.chomp]}"}
90
90
 
@@ -93,7 +93,7 @@ Install the gem normally <tt>gem install rbbt</tt>. The gem includes a configura
93
93
  First prepare the organisms as you did in the previous section. Next, if you want to use the default NER module:
94
94
 
95
95
  1. Install the Biocreative data used to train the model and compile the CRF++ plugin, <tt>rbbt_config prepare rner</tt>. You may need at this point to install ParseTree and ruby2ruby
96
- 2. Build the module for a particular organism <tt>rbbt_config install ner -o sgd</tt>. You need to have the gems ParseTree and ruby2ruby for this to work. This process can take a long time.
96
+ 2. Build the module for a particular organism <tt>rbbt_config install ner -o Sce</tt>. You need to have the gems ParseTree and ruby2ruby for this to work. This process can take a long time.
97
97
 
98
98
  Or, if you wan to use Abner or Banner:
99
99
 
@@ -108,7 +108,7 @@ You may now, for example, find mentions to genes in articles from a PubMed query
108
108
  # type = :banner
109
109
  type = :rner
110
110
 
111
- ner = Organism.ner('sgd', type )
111
+ ner = Organism.ner('Sce', type )
112
112
  pmids = PubMed.query(ARGV[0], 500)
113
113
 
114
114
  PubMed.get_article(pmids).each{|pmid,article|
data/bin/rbbt_config CHANGED
@@ -1,5 +1,7 @@
1
1
  #!/usr/bin/ruby
2
2
 
3
+ $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
4
+
3
5
  require 'rubygems'
4
6
  require 'rake'
5
7
 
@@ -67,9 +69,6 @@ $USAGE =<<EOT
67
69
  descriptions, is not cleaned, as these are not likely to change
68
70
 
69
71
  * organisms: Show a list of all organisms along with their identifier in the system
70
-
71
-
72
-
73
72
  EOT
74
73
 
75
74
  class Controller < SimpleConsole::Controller
@@ -14,10 +14,10 @@ $docs = ENV['docs']
14
14
 
15
15
 
16
16
  $org2rbbt = {
17
- 'yeast' => 'sgd',
18
- 'mouse' => 'mgi',
19
- 'fly' => 'sgd',
20
- 'bc2gn' => 'human',
17
+ 'yeast' => 'Sce',
18
+ 'mouse' => 'Mmu',
19
+ 'fly' => 'Sce',
20
+ 'bc2gn' => 'Hsa',
21
21
  }
22
22
 
23
23
  def match(org, filedir, goldstandard,outfile)
@@ -21,9 +21,10 @@ $lexicon = {
21
21
 
22
22
  $identifiers = {
23
23
  :file => {
24
- :url => "ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20090313",
25
- :native => 0,
26
- :extra => [],
24
+ :url => "ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements-2009-7-29.txt",
25
+ :native => 4,
26
+ :extra => [0],
27
+ :fields => ["Affymetrix"],
27
28
  },
28
29
  :biomart => {
29
30
  :database => 'athaliana_eg_gene',
@@ -86,7 +86,7 @@ Rake::Task['gene.go'].clear
86
86
  file 'gene.go' => ['identifiers'] do
87
87
  if File.exists? 'identifiers'
88
88
  require 'rbbt/sources/organism'
89
- index = Organism.id_index('human', :other => ['Associated Gene Name'])
89
+ index = Organism.id_index('Hsa', :other => ['Associated Gene Name'])
90
90
  data = Open.to_hash($go[:url], :native => $go[:code], :extra => $go[:go], :exclude => $go[:exclude])
91
91
 
92
92
  data = data.collect{|code, value_lists|
@@ -96,9 +96,7 @@ file 'gene.go' => ['identifiers'] do
96
96
 
97
97
  Open.write('gene.go',
98
98
  data.collect{|p|
99
- p[1].uniq.collect{|go|
100
- "#{p[0]}\t#{go}"
101
- }.join("\n")
99
+ "#{p[0]}\t#{p[1].uniq.join("|")}"
102
100
  }.join("\n")
103
101
  )
104
102
  end
@@ -117,9 +115,7 @@ file 'gene_go.pmid' => ['identifiers'] do
117
115
 
118
116
  Open.write('gene_go.pmid',
119
117
  data.collect{|p|
120
- p[1].uniq.collect{|pmid|
121
- "#{p[0]}\t#{pmid}"
122
- }.join("\n")
118
+ "#{p[0]}\t#{p[1].uniq.join("|")}"
123
119
  }.join("\n")
124
120
  )
125
121
  end
@@ -132,7 +128,7 @@ file 'lexicon' => ['identifiers'] do
132
128
  require 'rbbt/sources/organism'
133
129
  HGNC_URL = 'http://www.genenames.org/cgi-bin/hgnc_downloads.cgi?title=HGNC+output+data&hgnc_dbtag=on&col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_prev_sym&col=gd_prev_name&col=gd_aliases&col=gd_name_aliases&col=gd_pub_acc_ids&status=Approved&status_opt=2&level=pri&=on&where=&order_by=gd_app_sym_sort&limit=&format=text&submit=submit&.cgifields=&.cgifields=level&.cgifields=chr&.cgifields=status&.cgifields=hgnc_dbtag'
134
130
  names = Open.to_hash(HGNC_URL, :exclude => proc{|l| l.match(/^HGNC ID/)}, :flatten => true)
135
- translations = Organism.id_index('human', :native => 'Entrez Gene ID', :other => ['HGNC ID'])
131
+ translations = Organism.id_index('Hsa', :native => 'Entrez Gene ID', :other => ['HGNC ID'])
136
132
 
137
133
  Open.write('lexicon',
138
134
  names.collect{|code, names|
@@ -192,23 +192,18 @@ end
192
192
 
193
193
 
194
194
  file 'gene.go' do
195
- data = Open.to_hash($go[:url], :native => $go[:code], :extra => $go[:go], :exclude => $go[:exclude], :fix => $go[:fix])
195
+ data = Open.to_hash($go[:url], :native => $go[:code], :extra => $go[:go], :exclude => $go[:exclude], :fix => $go[:fix], :flatten => true)
196
196
 
197
- data = data.collect{|code, value_lists|
198
- [code, value_lists.flatten.select{|ref| ref =~ /GO:\d+/}.collect{|ref| ref.match(/(GO:\d+)/)[1]}]
199
- }.select{|p| p[1].any?}
197
+ Open.write('gene.go', data.collect { |gene, values|
198
+ goterms = values.select{|v| v =~ /GO:/}.collect{|v| v.match(/(GO:\d+)/)[1]}
199
+ goterms.empty? ? nil : "%s\t%s" % [gene, values.uniq.join("|")]
200
+ }.compact.join("\n"))
200
201
 
201
- Open.write('gene.go',
202
- data.collect{|p|
203
- p[1].uniq.collect{|go|
204
- "#{p[0]}\t#{go}"
205
- }.join("\n")
206
- }.join("\n")
207
- )
208
202
  end
209
203
 
204
+
210
205
  file 'gene_go.pmid' do
211
- data = Open.to_hash($go[:url], :native => $go[:code], :extra => $go[:pmid], :exclude => $go[:exclude], :fix => $go[:fix])
206
+ data = Open.to_hash($go[:url], :native => $go[:code], :extra => $go[:pmid], :exclude => $go[:exclude], :fix => $go[:fix], :flatten => true)
212
207
 
213
208
  data = data.collect{|code, value_lists|
214
209
  [code, value_lists.flatten.select{|ref| ref =~ /PMID:\d+/}.collect{|ref| ref.match(/PMID:(\d+)/)[1]}]
@@ -216,8 +211,9 @@ file 'gene_go.pmid' do
216
211
 
217
212
  Open.write('gene_go.pmid',
218
213
  data.collect{|p|
219
- p[1].uniq.collect{|pmid| "#{p[0]}\t#{pmid}" }.join("\n")
220
- }.join("\n")
214
+ next if p[1].empty?
215
+ "#{p[0]}\t#{p[1].uniq.join("|")}"
216
+ }.compact.join("\n")
221
217
  )
222
218
  end
223
219
 
@@ -230,11 +226,9 @@ file 'gene.pmid' do
230
226
 
231
227
  Open.write('gene.pmid',
232
228
  data.collect{|code,pmids|
233
- next if translations && ! translations[code]
234
- code = translations[code].first if translations
235
- pmids.collect{|pmid|
236
- "#{ code }\t#{pmid}"
237
- }.compact.join("\n")
229
+ next if translations && ! translations[code]
230
+ code = translations[code].first if translations
231
+ "#{code}\t#{pmids.uniq.join("|")}"
238
232
  }.compact.join("\n")
239
233
  )
240
234
  rescue Entrez::NoFileError
@@ -256,3 +250,5 @@ task 'update' do
256
250
  Rake::Task['all'].invoke
257
251
  end
258
252
 
253
+ task 'default' => 'all'
254
+
data/lib/rbbt.rb CHANGED
@@ -1,6 +1,3 @@
1
- $:.unshift(File.dirname(__FILE__)) unless
2
- $:.include?(File.dirname(__FILE__)) || $:.include?(File.expand_path(File.dirname(__FILE__)))
3
-
4
1
  require 'fileutils'
5
2
  require 'yaml'
6
3
 
@@ -60,9 +60,9 @@ class Normalizer
60
60
  }
61
61
 
62
62
  # Get all at once, better performance
63
-
64
63
  genes = Entrez.get_gene(code2entrez.values)
65
- code2entrez_genes = code2entrez.collect{|p| [p[0], genes[p[1]]]}
64
+
65
+ code2entrez_genes = code2entrez.collect{|key, value| [key, genes[value]]}
66
66
 
67
67
  code2entrez_genes.collect{|p|
68
68
  [p[0], Entrez.gene_text_similarity(p[1], text)]
@@ -4,7 +4,9 @@ require 'rbbt'
4
4
  # This module holds helper methods to deal with the Gene Ontology files. Right
5
5
  # now all it does is provide a translation form id to the actual names.
6
6
  module GO
7
+
7
8
  @@info = nil
9
+ MULTIPLE_VALUE_FIELDS = %w(is_a)
8
10
 
9
11
  # This method needs to be called before any translations can be made, it is
10
12
  # called automatically the first time the id2name method is called. It loads
@@ -20,10 +22,25 @@ module GO
20
22
  select{|l| l =~ /:/}.
21
23
  each{|l|
22
24
  key, value = l.chomp.match(/(.*?):(.*)/).values_at(1,2)
23
- term_info[key.strip] = value.strip
25
+ if MULTIPLE_VALUE_FIELDS.include? key.strip
26
+ term_info[key.strip] ||= []
27
+ term_info[key.strip] << value.strip
28
+ else
29
+ term_info[key.strip] = value.strip
30
+ end
24
31
  }
25
32
  @@info[term_info["id"]] = term_info
26
- }
33
+ }
34
+ end
35
+
36
+ def self.info
37
+ self.init unless @@info
38
+ @@info
39
+ end
40
+
41
+ def self.goterms
42
+ self.init unless @@info
43
+ @@info.keys
27
44
  end
28
45
 
29
46
  def self.id2name(id)
@@ -31,10 +48,38 @@ module GO
31
48
  if id.kind_of? Array
32
49
  @@info.values_at(*id).collect{|i| i['name'] if i}
33
50
  else
34
- return "Name not found" unless @@info[id]
51
+ return nil if @@info[id].nil?
35
52
  @@info[id]['name']
36
53
  end
37
54
  end
38
55
 
56
+ def self.id2ancestors(id)
57
+ self.init unless @@info
58
+ if id.kind_of? Array
59
+ @@info.values_at(*id).
60
+ select{|i| ! i['is_a'].nil?}.
61
+ collect{|i| i['is_a'].collect{|id|
62
+ id.match(/(GO:\d+)/)[1] if id.match(/(GO:\d+)/)
63
+ }.compact
64
+ }
65
+ else
66
+ return [] if @@info[id].nil? || @@info[id]['is_a'].nil?
67
+ @@info[id]['is_a'].
68
+ collect{|id|
69
+ id.match(/(GO:\d+)/)[1] if id.match(/(GO:\d+)/)
70
+ }.compact
71
+ end
72
+ end
73
+
74
+ def self.id2namespace(id)
75
+ self.init unless @@info
76
+ if id.kind_of? Array
77
+ @@info.values_at(*id).collect{|i| i['namespace'] if i}
78
+ else
79
+ return nil if @@info[id].nil?
80
+ @@info[id]['namespace']
81
+ end
82
+ end
83
+
39
84
 
40
85
  end
@@ -93,13 +93,7 @@ module Organism
93
93
  # Returns a hash with the list of go terms for each gene id. Gene ids are in
94
94
  # Rbbt native format for that organism.
95
95
  def self.goterms(org)
96
- goterms = {}
97
- Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/gene.go")).each_line{|l|
98
- gene, go = l.chomp.split(/\t/)
99
- goterms[gene.strip] ||= []
100
- goterms[gene.strip] << go.strip
101
- }
102
- goterms
96
+ Open.to_hash(File.join(Rbbt.datadir,"organisms/#{ org }/gene.go"), :flatten => true)
103
97
  end
104
98
 
105
99
  # Return list of PubMed ids associated to the organism. Determined using a
@@ -209,33 +203,34 @@ module Organism
209
203
  pos
210
204
  end
211
205
 
212
- def self.id_index(org, option = {})
213
- native = option[:native]
214
- other = option[:other]
215
- option[:case_sensitive] = false if option[:case_sensitive].nil?
206
+ def self.id_index(org, options = {})
207
+ native = options[:native]
208
+ other = options[:other]
209
+ options[:case_sensitive] = false if options[:case_sensitive].nil?
216
210
 
217
211
  if native.nil? and other.nil?
218
- Index.index(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"), option)
212
+ Index.index(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"), options)
219
213
  else
220
214
  supported = Organism.supported_ids(org)
221
215
 
222
216
  first = nil
223
217
  if native
224
- first = id_position(supported,native,option)
218
+ first = id_position(supported,native,options)
225
219
  else
226
220
  first = 0
227
221
  end
228
222
 
229
223
  rest = nil
230
224
  if other
231
- rest = other.collect{|name| id_position(supported,name, option)}
225
+ rest = other.collect{|name| id_position(supported,name, options)}
232
226
  else
233
227
  rest = (0..supported.length - 1).to_a - [first]
234
228
  end
235
229
 
236
- option[:native] = first
237
- option[:extra] = rest
238
- index = Index.index(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"), option)
230
+ options[:native] = first
231
+ options[:extra] = rest
232
+ options[:sep] = "\t"
233
+ index = Index.index(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"), options)
239
234
 
240
235
  index
241
236
  end
@@ -171,16 +171,18 @@ module Open
171
171
  # * :native => position of the elements that will constitute the keys. By default 0.
172
172
  # * :extra => positions of the rest of elements. By default all but :native. It can be an array of positions or a single position.
173
173
  # * :sep => pattern to use in splitting the lines into elements, by default "\t"
174
+ # * :sep2 => pattern to use in splitting the elements into subelements, by default "|"
174
175
  # * :flatten => flatten the array of arrays that hold the values for each key into a simple array.
175
176
  # * :single => for each key select only the first of the values, instead of the complete array.
176
177
  # * :fix => A Proc that is called to pre-process the line
177
178
  # * :exclude => A Proc that is called to check if the line must be excluded from the process.
178
- def self.to_hash(filename, options = {})
179
+ def self.to_hash(input, options = {})
179
180
  native = options[:native] || 0
180
181
  extra = options[:extra]
181
182
  exclude = options[:exclude]
182
183
  fix = options[:fix]
183
184
  sep = options[:sep] || "\t"
185
+ sep2 = options[:sep2] || "|"
184
186
  single = options[:single]
185
187
  single = false if single.nil?
186
188
  flatten = options[:flatten] || single
@@ -188,8 +190,14 @@ module Open
188
190
 
189
191
  extra = [extra] if extra && ! extra.is_a?( Array)
190
192
 
193
+ if StringIO === input
194
+ content = input
195
+ else
196
+ content = Open.read(input)
197
+ end
198
+
191
199
  data = {}
192
- Open.read(filename).each_line{|l|
200
+ content.each_line{|l|
193
201
  l = fix.call(l) if fix
194
202
  next if exclude and exclude.call(l)
195
203
 
@@ -198,37 +206,29 @@ module Open
198
206
  next if id.nil? || id == ""
199
207
 
200
208
  data[id] ||= []
209
+
201
210
  if extra
202
- fields = extra
211
+ row_fields = row_fields.values_at(*extra)
203
212
  else
204
- fields = (0..(row_fields.length - 1)).to_a - [native]
213
+ row_fields.delete_at(native)
205
214
  end
206
- fields.each_with_index{|pos,i|
207
- data[id][i] ||= []
208
- data[id][i] << row_fields[pos]
209
- }
210
- }
211
215
 
212
- if flatten
213
- data.each{|key, values|
214
- if values
215
- values.flatten!
216
- values.collect!{|v|
217
- if v != ""
218
- v
219
- else
220
- nil
221
- end
222
- }
223
- values.compact!
224
- else
225
- nil
226
- end
227
- }
228
- end
216
+
217
+ if flatten
218
+ data[id] += row_fields.compact.collect{|v|
219
+ v.split(sep2)
220
+ }.flatten
221
+ else
222
+ row_fields.each_with_index{|value, i|
223
+ next if value.nil?
224
+ data[id][i] ||= []
225
+ data[id][i] += value.split(sep2)
226
+ }
227
+ end
228
+ }
229
229
 
230
230
  data = Hash[*(data.collect{|key,values| [key, values.first]}).flatten] if single
231
-
231
+
232
232
  data
233
233
  end
234
234