rbbt 1.1.8 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. data/README.rdoc +12 -12
  2. data/bin/rbbt_config +2 -3
  3. data/install_scripts/norm/Rakefile +4 -4
  4. data/install_scripts/organisms/{tair.Rakefile → Ath.Rakefile} +4 -3
  5. data/install_scripts/organisms/{cgd.Rakefile → Cal.Rakefile} +0 -0
  6. data/install_scripts/organisms/{worm.Rakefile → Cel.Rakefile} +0 -0
  7. data/install_scripts/organisms/{human.Rakefile → Hsa.Rakefile} +4 -8
  8. data/install_scripts/organisms/{mgi.Rakefile → Mmu.Rakefile} +0 -0
  9. data/install_scripts/organisms/{rgd.Rakefile → Rno.Rakefile} +0 -0
  10. data/install_scripts/organisms/{sgd.Rakefile → Sce.Rakefile} +0 -0
  11. data/install_scripts/organisms/{pombe.Rakefile → Spo.Rakefile} +0 -0
  12. data/install_scripts/organisms/rake-include.rb +15 -19
  13. data/lib/rbbt.rb +0 -3
  14. data/lib/rbbt/ner/rnorm.rb +2 -2
  15. data/lib/rbbt/sources/go.rb +48 -3
  16. data/lib/rbbt/sources/organism.rb +12 -17
  17. data/lib/rbbt/util/open.rb +27 -27
  18. data/lib/rbbt/util/tmpfile.rb +16 -0
  19. data/tasks/install.rake +1 -1
  20. data/test/rbbt/bow/test_bow.rb +33 -0
  21. data/test/rbbt/bow/test_classifier.rb +72 -0
  22. data/test/rbbt/bow/test_dictionary.rb +91 -0
  23. data/test/rbbt/ner/rnorm/test_cue_index.rb +57 -0
  24. data/test/rbbt/ner/rnorm/test_tokens.rb +70 -0
  25. data/test/rbbt/ner/test_abner.rb +17 -0
  26. data/test/rbbt/ner/test_banner.rb +17 -0
  27. data/test/rbbt/ner/test_dictionaryNER.rb +122 -0
  28. data/test/rbbt/ner/test_regexpNER.rb +33 -0
  29. data/test/rbbt/ner/test_rner.rb +126 -0
  30. data/test/rbbt/ner/test_rnorm.rb +47 -0
  31. data/test/rbbt/sources/test_biocreative.rb +38 -0
  32. data/test/rbbt/sources/test_biomart.rb +31 -0
  33. data/test/rbbt/sources/test_entrez.rb +49 -0
  34. data/test/rbbt/sources/test_go.rb +24 -0
  35. data/test/rbbt/sources/test_organism.rb +59 -0
  36. data/test/rbbt/sources/test_polysearch.rb +27 -0
  37. data/test/rbbt/sources/test_pubmed.rb +29 -0
  38. data/test/rbbt/util/test_arrayHash.rb +257 -0
  39. data/test/rbbt/util/test_filecache.rb +37 -0
  40. data/test/rbbt/util/test_index.rb +31 -0
  41. data/test/rbbt/util/test_misc.rb +20 -0
  42. data/test/rbbt/util/test_open.rb +97 -0
  43. data/test/rbbt/util/test_simpleDSL.rb +57 -0
  44. data/test/rbbt/util/test_tmpfile.rb +21 -0
  45. data/test/test_helper.rb +4 -0
  46. data/test/test_rbbt.rb +11 -0
  47. metadata +39 -12
data/README.rdoc CHANGED
@@ -57,14 +57,14 @@ Identifiers translation:: Translates gene identifiers between formats.
57
57
 
58
58
  Organisms in rbbt are identified using a keyword. This is the list of organisms currently supported with their associated keywords:
59
59
 
60
- Candida albicans:: cgd
61
- Mus musculus:: mgi
62
- Rattus norvegicus:: rgd
63
- Saccharomyces cerevisiae:: sgd
64
- Arabidopsis thaliana:: tair
65
- Caenorhabditis elegans:: worm
66
- Homo sapiens:: human
67
- Schizosaccharomyces pombe:: pombe
60
+ Candida albicans:: Cal
61
+ Mus musculus:: Mmu
62
+ Rattus norvegicus:: Rno
63
+ Saccharomyces cerevisiae:: Sce
64
+ Arabidopsis thaliana:: Ata
65
+ Caenorhabditis elegans:: Cel
66
+ Homo sapiens:: Hsa
67
+ Schizosaccharomyces pombe:: Spo
68
68
 
69
69
 
70
70
  === Other
@@ -80,11 +80,11 @@ Install the gem normally <tt>gem install rbbt</tt>. The gem includes a configura
80
80
  === Using rbbt to translate identifiers
81
81
 
82
82
  1. Do <tt>rbbt_config prepare identifiers</tt> to do deploy the configuration files and download entrez data, this needs to be done just once.
83
- 3. Now you may do <tt>rbbt_config install organisms</tt> toprocess all the organisms, or <tt>rbbt_config install organisms -o sgd</tt> to process only yeast (sgd).
83
+ 3. Now you may do <tt>rbbt_config install organisms</tt> toprocess all the organisms, or <tt>rbbt_config install organisms -o Sce</tt> to process only yeast (Sce).
84
84
  4. You may now use a script like this to translate gene identifiers from yeast feed from the standard input
85
85
  require 'rbbt/sources/organism'
86
86
 
87
- index = Organism.id_index('sgd', :native => 'Entrez Gene Id')
87
+ index = Organism.id_index('Sce', :native => 'Entrez Gene Id')
88
88
 
89
89
  STDIN.each_line{|l| puts "#{l.chomp} => #{index[l.chomp]}"}
90
90
 
@@ -93,7 +93,7 @@ Install the gem normally <tt>gem install rbbt</tt>. The gem includes a configura
93
93
  First prepare the organisms as you did in the previous section. Next, if you want to use the default NER module:
94
94
 
95
95
  1. Install the Biocreative data used to train the model and compile the CRF++ plugin, <tt>rbbt_config prepare rner</tt>. You may need at this point to install ParseTree and ruby2ruby
96
- 2. Build the module for a particular organism <tt>rbbt_config install ner -o sgd</tt>. You need to have the gems ParseTree and ruby2ruby for this to work. This process can take a long time.
96
+ 2. Build the module for a particular organism <tt>rbbt_config install ner -o Sce</tt>. You need to have the gems ParseTree and ruby2ruby for this to work. This process can take a long time.
97
97
 
98
98
  Or, if you wan to use Abner or Banner:
99
99
 
@@ -108,7 +108,7 @@ You may now, for example, find mentions to genes in articles from a PubMed query
108
108
  # type = :banner
109
109
  type = :rner
110
110
 
111
- ner = Organism.ner('sgd', type )
111
+ ner = Organism.ner('Sce', type )
112
112
  pmids = PubMed.query(ARGV[0], 500)
113
113
 
114
114
  PubMed.get_article(pmids).each{|pmid,article|
data/bin/rbbt_config CHANGED
@@ -1,5 +1,7 @@
1
1
  #!/usr/bin/ruby
2
2
 
3
+ $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
4
+
3
5
  require 'rubygems'
4
6
  require 'rake'
5
7
 
@@ -67,9 +69,6 @@ $USAGE =<<EOT
67
69
  descriptions, is not cleaned, as these are not likely to change
68
70
 
69
71
  * organisms: Show a list of all organisms along with their identifier in the system
70
-
71
-
72
-
73
72
  EOT
74
73
 
75
74
  class Controller < SimpleConsole::Controller
@@ -14,10 +14,10 @@ $docs = ENV['docs']
14
14
 
15
15
 
16
16
  $org2rbbt = {
17
- 'yeast' => 'sgd',
18
- 'mouse' => 'mgi',
19
- 'fly' => 'sgd',
20
- 'bc2gn' => 'human',
17
+ 'yeast' => 'Sce',
18
+ 'mouse' => 'Mmu',
19
+ 'fly' => 'Sce',
20
+ 'bc2gn' => 'Hsa',
21
21
  }
22
22
 
23
23
  def match(org, filedir, goldstandard,outfile)
@@ -21,9 +21,10 @@ $lexicon = {
21
21
 
22
22
  $identifiers = {
23
23
  :file => {
24
- :url => "ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20090313",
25
- :native => 0,
26
- :extra => [],
24
+ :url => "ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements-2009-7-29.txt",
25
+ :native => 4,
26
+ :extra => [0],
27
+ :fields => ["Affymetrix"],
27
28
  },
28
29
  :biomart => {
29
30
  :database => 'athaliana_eg_gene',
@@ -86,7 +86,7 @@ Rake::Task['gene.go'].clear
86
86
  file 'gene.go' => ['identifiers'] do
87
87
  if File.exists? 'identifiers'
88
88
  require 'rbbt/sources/organism'
89
- index = Organism.id_index('human', :other => ['Associated Gene Name'])
89
+ index = Organism.id_index('Hsa', :other => ['Associated Gene Name'])
90
90
  data = Open.to_hash($go[:url], :native => $go[:code], :extra => $go[:go], :exclude => $go[:exclude])
91
91
 
92
92
  data = data.collect{|code, value_lists|
@@ -96,9 +96,7 @@ file 'gene.go' => ['identifiers'] do
96
96
 
97
97
  Open.write('gene.go',
98
98
  data.collect{|p|
99
- p[1].uniq.collect{|go|
100
- "#{p[0]}\t#{go}"
101
- }.join("\n")
99
+ "#{p[0]}\t#{p[1].uniq.join("|")}"
102
100
  }.join("\n")
103
101
  )
104
102
  end
@@ -117,9 +115,7 @@ file 'gene_go.pmid' => ['identifiers'] do
117
115
 
118
116
  Open.write('gene_go.pmid',
119
117
  data.collect{|p|
120
- p[1].uniq.collect{|pmid|
121
- "#{p[0]}\t#{pmid}"
122
- }.join("\n")
118
+ "#{p[0]}\t#{p[1].uniq.join("|")}"
123
119
  }.join("\n")
124
120
  )
125
121
  end
@@ -132,7 +128,7 @@ file 'lexicon' => ['identifiers'] do
132
128
  require 'rbbt/sources/organism'
133
129
  HGNC_URL = 'http://www.genenames.org/cgi-bin/hgnc_downloads.cgi?title=HGNC+output+data&hgnc_dbtag=on&col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_prev_sym&col=gd_prev_name&col=gd_aliases&col=gd_name_aliases&col=gd_pub_acc_ids&status=Approved&status_opt=2&level=pri&=on&where=&order_by=gd_app_sym_sort&limit=&format=text&submit=submit&.cgifields=&.cgifields=level&.cgifields=chr&.cgifields=status&.cgifields=hgnc_dbtag'
134
130
  names = Open.to_hash(HGNC_URL, :exclude => proc{|l| l.match(/^HGNC ID/)}, :flatten => true)
135
- translations = Organism.id_index('human', :native => 'Entrez Gene ID', :other => ['HGNC ID'])
131
+ translations = Organism.id_index('Hsa', :native => 'Entrez Gene ID', :other => ['HGNC ID'])
136
132
 
137
133
  Open.write('lexicon',
138
134
  names.collect{|code, names|
@@ -192,23 +192,18 @@ end
192
192
 
193
193
 
194
194
  file 'gene.go' do
195
- data = Open.to_hash($go[:url], :native => $go[:code], :extra => $go[:go], :exclude => $go[:exclude], :fix => $go[:fix])
195
+ data = Open.to_hash($go[:url], :native => $go[:code], :extra => $go[:go], :exclude => $go[:exclude], :fix => $go[:fix], :flatten => true)
196
196
 
197
- data = data.collect{|code, value_lists|
198
- [code, value_lists.flatten.select{|ref| ref =~ /GO:\d+/}.collect{|ref| ref.match(/(GO:\d+)/)[1]}]
199
- }.select{|p| p[1].any?}
197
+ Open.write('gene.go', data.collect { |gene, values|
198
+ goterms = values.select{|v| v =~ /GO:/}.collect{|v| v.match(/(GO:\d+)/)[1]}
199
+ goterms.empty? ? nil : "%s\t%s" % [gene, values.uniq.join("|")]
200
+ }.compact.join("\n"))
200
201
 
201
- Open.write('gene.go',
202
- data.collect{|p|
203
- p[1].uniq.collect{|go|
204
- "#{p[0]}\t#{go}"
205
- }.join("\n")
206
- }.join("\n")
207
- )
208
202
  end
209
203
 
204
+
210
205
  file 'gene_go.pmid' do
211
- data = Open.to_hash($go[:url], :native => $go[:code], :extra => $go[:pmid], :exclude => $go[:exclude], :fix => $go[:fix])
206
+ data = Open.to_hash($go[:url], :native => $go[:code], :extra => $go[:pmid], :exclude => $go[:exclude], :fix => $go[:fix], :flatten => true)
212
207
 
213
208
  data = data.collect{|code, value_lists|
214
209
  [code, value_lists.flatten.select{|ref| ref =~ /PMID:\d+/}.collect{|ref| ref.match(/PMID:(\d+)/)[1]}]
@@ -216,8 +211,9 @@ file 'gene_go.pmid' do
216
211
 
217
212
  Open.write('gene_go.pmid',
218
213
  data.collect{|p|
219
- p[1].uniq.collect{|pmid| "#{p[0]}\t#{pmid}" }.join("\n")
220
- }.join("\n")
214
+ next if p[1].empty?
215
+ "#{p[0]}\t#{p[1].uniq.join("|")}"
216
+ }.compact.join("\n")
221
217
  )
222
218
  end
223
219
 
@@ -230,11 +226,9 @@ file 'gene.pmid' do
230
226
 
231
227
  Open.write('gene.pmid',
232
228
  data.collect{|code,pmids|
233
- next if translations && ! translations[code]
234
- code = translations[code].first if translations
235
- pmids.collect{|pmid|
236
- "#{ code }\t#{pmid}"
237
- }.compact.join("\n")
229
+ next if translations && ! translations[code]
230
+ code = translations[code].first if translations
231
+ "#{code}\t#{pmids.uniq.join("|")}"
238
232
  }.compact.join("\n")
239
233
  )
240
234
  rescue Entrez::NoFileError
@@ -256,3 +250,5 @@ task 'update' do
256
250
  Rake::Task['all'].invoke
257
251
  end
258
252
 
253
+ task 'default' => 'all'
254
+
data/lib/rbbt.rb CHANGED
@@ -1,6 +1,3 @@
1
- $:.unshift(File.dirname(__FILE__)) unless
2
- $:.include?(File.dirname(__FILE__)) || $:.include?(File.expand_path(File.dirname(__FILE__)))
3
-
4
1
  require 'fileutils'
5
2
  require 'yaml'
6
3
 
@@ -60,9 +60,9 @@ class Normalizer
60
60
  }
61
61
 
62
62
  # Get all at once, better performance
63
-
64
63
  genes = Entrez.get_gene(code2entrez.values)
65
- code2entrez_genes = code2entrez.collect{|p| [p[0], genes[p[1]]]}
64
+
65
+ code2entrez_genes = code2entrez.collect{|key, value| [key, genes[value]]}
66
66
 
67
67
  code2entrez_genes.collect{|p|
68
68
  [p[0], Entrez.gene_text_similarity(p[1], text)]
@@ -4,7 +4,9 @@ require 'rbbt'
4
4
  # This module holds helper methods to deal with the Gene Ontology files. Right
5
5
  # now all it does is provide a translation form id to the actual names.
6
6
  module GO
7
+
7
8
  @@info = nil
9
+ MULTIPLE_VALUE_FIELDS = %w(is_a)
8
10
 
9
11
  # This method needs to be called before any translations can be made, it is
10
12
  # called automatically the first time the id2name method is called. It loads
@@ -20,10 +22,25 @@ module GO
20
22
  select{|l| l =~ /:/}.
21
23
  each{|l|
22
24
  key, value = l.chomp.match(/(.*?):(.*)/).values_at(1,2)
23
- term_info[key.strip] = value.strip
25
+ if MULTIPLE_VALUE_FIELDS.include? key.strip
26
+ term_info[key.strip] ||= []
27
+ term_info[key.strip] << value.strip
28
+ else
29
+ term_info[key.strip] = value.strip
30
+ end
24
31
  }
25
32
  @@info[term_info["id"]] = term_info
26
- }
33
+ }
34
+ end
35
+
36
+ def self.info
37
+ self.init unless @@info
38
+ @@info
39
+ end
40
+
41
+ def self.goterms
42
+ self.init unless @@info
43
+ @@info.keys
27
44
  end
28
45
 
29
46
  def self.id2name(id)
@@ -31,10 +48,38 @@ module GO
31
48
  if id.kind_of? Array
32
49
  @@info.values_at(*id).collect{|i| i['name'] if i}
33
50
  else
34
- return "Name not found" unless @@info[id]
51
+ return nil if @@info[id].nil?
35
52
  @@info[id]['name']
36
53
  end
37
54
  end
38
55
 
56
+ def self.id2ancestors(id)
57
+ self.init unless @@info
58
+ if id.kind_of? Array
59
+ @@info.values_at(*id).
60
+ select{|i| ! i['is_a'].nil?}.
61
+ collect{|i| i['is_a'].collect{|id|
62
+ id.match(/(GO:\d+)/)[1] if id.match(/(GO:\d+)/)
63
+ }.compact
64
+ }
65
+ else
66
+ return [] if @@info[id].nil? || @@info[id]['is_a'].nil?
67
+ @@info[id]['is_a'].
68
+ collect{|id|
69
+ id.match(/(GO:\d+)/)[1] if id.match(/(GO:\d+)/)
70
+ }.compact
71
+ end
72
+ end
73
+
74
+ def self.id2namespace(id)
75
+ self.init unless @@info
76
+ if id.kind_of? Array
77
+ @@info.values_at(*id).collect{|i| i['namespace'] if i}
78
+ else
79
+ return nil if @@info[id].nil?
80
+ @@info[id]['namespace']
81
+ end
82
+ end
83
+
39
84
 
40
85
  end
@@ -93,13 +93,7 @@ module Organism
93
93
  # Returns a hash with the list of go terms for each gene id. Gene ids are in
94
94
  # Rbbt native format for that organism.
95
95
  def self.goterms(org)
96
- goterms = {}
97
- Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/gene.go")).each_line{|l|
98
- gene, go = l.chomp.split(/\t/)
99
- goterms[gene.strip] ||= []
100
- goterms[gene.strip] << go.strip
101
- }
102
- goterms
96
+ Open.to_hash(File.join(Rbbt.datadir,"organisms/#{ org }/gene.go"), :flatten => true)
103
97
  end
104
98
 
105
99
  # Return list of PubMed ids associated to the organism. Determined using a
@@ -209,33 +203,34 @@ module Organism
209
203
  pos
210
204
  end
211
205
 
212
- def self.id_index(org, option = {})
213
- native = option[:native]
214
- other = option[:other]
215
- option[:case_sensitive] = false if option[:case_sensitive].nil?
206
+ def self.id_index(org, options = {})
207
+ native = options[:native]
208
+ other = options[:other]
209
+ options[:case_sensitive] = false if options[:case_sensitive].nil?
216
210
 
217
211
  if native.nil? and other.nil?
218
- Index.index(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"), option)
212
+ Index.index(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"), options)
219
213
  else
220
214
  supported = Organism.supported_ids(org)
221
215
 
222
216
  first = nil
223
217
  if native
224
- first = id_position(supported,native,option)
218
+ first = id_position(supported,native,options)
225
219
  else
226
220
  first = 0
227
221
  end
228
222
 
229
223
  rest = nil
230
224
  if other
231
- rest = other.collect{|name| id_position(supported,name, option)}
225
+ rest = other.collect{|name| id_position(supported,name, options)}
232
226
  else
233
227
  rest = (0..supported.length - 1).to_a - [first]
234
228
  end
235
229
 
236
- option[:native] = first
237
- option[:extra] = rest
238
- index = Index.index(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"), option)
230
+ options[:native] = first
231
+ options[:extra] = rest
232
+ options[:sep] = "\t"
233
+ index = Index.index(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"), options)
239
234
 
240
235
  index
241
236
  end
@@ -171,16 +171,18 @@ module Open
171
171
  # * :native => position of the elements that will constitute the keys. By default 0.
172
172
  # * :extra => positions of the rest of elements. By default all but :native. It can be an array of positions or a single position.
173
173
  # * :sep => pattern to use in splitting the lines into elements, by default "\t"
174
+ # * :sep2 => pattern to use in splitting the elements into subelements, by default "|"
174
175
  # * :flatten => flatten the array of arrays that hold the values for each key into a simple array.
175
176
  # * :single => for each key select only the first of the values, instead of the complete array.
176
177
  # * :fix => A Proc that is called to pre-process the line
177
178
  # * :exclude => A Proc that is called to check if the line must be excluded from the process.
178
- def self.to_hash(filename, options = {})
179
+ def self.to_hash(input, options = {})
179
180
  native = options[:native] || 0
180
181
  extra = options[:extra]
181
182
  exclude = options[:exclude]
182
183
  fix = options[:fix]
183
184
  sep = options[:sep] || "\t"
185
+ sep2 = options[:sep2] || "|"
184
186
  single = options[:single]
185
187
  single = false if single.nil?
186
188
  flatten = options[:flatten] || single
@@ -188,8 +190,14 @@ module Open
188
190
 
189
191
  extra = [extra] if extra && ! extra.is_a?( Array)
190
192
 
193
+ if StringIO === input
194
+ content = input
195
+ else
196
+ content = Open.read(input)
197
+ end
198
+
191
199
  data = {}
192
- Open.read(filename).each_line{|l|
200
+ content.each_line{|l|
193
201
  l = fix.call(l) if fix
194
202
  next if exclude and exclude.call(l)
195
203
 
@@ -198,37 +206,29 @@ module Open
198
206
  next if id.nil? || id == ""
199
207
 
200
208
  data[id] ||= []
209
+
201
210
  if extra
202
- fields = extra
211
+ row_fields = row_fields.values_at(*extra)
203
212
  else
204
- fields = (0..(row_fields.length - 1)).to_a - [native]
213
+ row_fields.delete_at(native)
205
214
  end
206
- fields.each_with_index{|pos,i|
207
- data[id][i] ||= []
208
- data[id][i] << row_fields[pos]
209
- }
210
- }
211
215
 
212
- if flatten
213
- data.each{|key, values|
214
- if values
215
- values.flatten!
216
- values.collect!{|v|
217
- if v != ""
218
- v
219
- else
220
- nil
221
- end
222
- }
223
- values.compact!
224
- else
225
- nil
226
- end
227
- }
228
- end
216
+
217
+ if flatten
218
+ data[id] += row_fields.compact.collect{|v|
219
+ v.split(sep2)
220
+ }.flatten
221
+ else
222
+ row_fields.each_with_index{|value, i|
223
+ next if value.nil?
224
+ data[id][i] ||= []
225
+ data[id][i] += value.split(sep2)
226
+ }
227
+ end
228
+ }
229
229
 
230
230
  data = Hash[*(data.collect{|key,values| [key, values.first]}).flatten] if single
231
-
231
+
232
232
  data
233
233
  end
234
234