rbbt 1.0.0 → 1.0.2

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,17 +1,140 @@
1
1
  = rbbt
2
2
 
3
- Description goes here.
3
+ Rbbt stands for Ruby Bio-Text, it started as an API for text mining developed
4
+ for SENT[http://sent.dacya.ucm.es], but its functionality has been used
5
+ for other applications as well, such as MARQ[http://marq.dacya.ucm.es].
4
6
 
5
- == Note on Patches/Pull Requests
7
+ == Important Note
8
+
9
+ Some unexpected gem dependencies may appear.
10
+
11
+ Rbbt covers several functionalities, some will work right away, some require to
12
+ install dependencies or download and process data from the internet. Since not
13
+ all users are likely to need all the functionalities, this gems dependencies
14
+ include only the very basic requirements. Dependencies may appear unexpectedly
15
+ when using new parts of the API.
16
+
17
+ == Functionality
18
+
19
+ === Data sources interface
20
+
21
+ PubMed:: Making queries and retrieving articles.
22
+
23
+ BioMart:: Making queries to BioMart programmatically. It can divide a large query into smaller ones and merge the results.
24
+
25
+ Entrez:: Retrieving gene entries, associated articles, and gene synonyms and aliases.
26
+
27
+ Biocreative:: Using the competition test and training data to train and evaluate Named Entity Extraction models and Gene Mention Normalization.
28
+
29
+
30
+ === Text mining tasks
31
+
32
+ BagOfWords:: Bag-of-words representation of text. Chunk text into terms, which can be unigrams or bi-grams, remove stopwords, build a term thesaurus using a TF_IDF (term frequency inverse document frequency) or a KL (Kullback-Leibler divergence) Dictionary, and extract a bag-of-words representations suitable for the Classifier.
33
+
34
+ Classifier:: Using R to build classification models and to use them to classify new entires. Currently the models are Support Vector Machines.
35
+
36
+ NER:: Named Entity Extraction. Currently there are 3 alternatives to do this Abner, Banner, RegExpNER, and NER. The first two are third party Java systems that require the rjb[rjb.rubyforge.org/] (Ruby Java Bridge) gem to be installed. The third one, RegExpNER, is a simple regular-expression based system which can be used when there is not enough data to train a CRF based system, for example, to find Polysearch terms. The last one, the default, is a reimplementation of a CRF-based system, such as Abner and Banner, completely configurable using a simple DSL (domain specific language).
37
+
38
+ Normalizer:: Resolve gene mentions to the actual genes they refer to. It compares the gene mention to all possible gene names and synonyms to find the best match. It is configurable using a DSL.
39
+
40
+ === Organisms support
41
+
42
+ Using configuration files rbbt can support different organisms. The system is prepared to parse organism specific database files and merge them with Entrez and BioMart. Basically producing the following information
43
+
44
+ Lexicon:: Listing the synonyms for each gene
45
+
46
+ Identifiers:: Listing different identifiers for each gene like Entrez Gene Ids, Unigene, Affymetrix probe ids, etc. This is not the same as the lexicon which holds names, not identifiers.
47
+
48
+ GO:: Listing associations of genes to GO terms.
49
+
50
+ PubMed articles:: List articles associated to each gene, as listed in Entrez or listed to support of GO associations.
51
+
52
+ With this information rbbt offers the following functionality via the Organism class
53
+
54
+ NER and Normalization:: Loads custom models for Named Entity Extraction and Gene Mention Normalization
55
+
56
+ Identifiers translation:: Translates gene identifiers between formats.
57
+
58
+ Organisms in rbbt are identified using a keyword. This is the list of organisms currently supported with their associated keywords:
59
+
60
+ Candida albicans:: cgd
61
+ Mus musculus:: mgi
62
+ Rattus norvegicus:: rgd
63
+ Saccharomyces cerevisiae:: sgd
64
+ Arabidopsis thaliana:: tair
65
+ Caenorhabditis elegans:: worm
66
+ Homo sapiens:: human
67
+ Schizosaccharomyces pombe:: pombe
68
+
69
+
70
+ === Other
71
+
72
+ Cache:: The system caches PubMed articles and Entrez gene entries, this is considered a persistent cache since these items are unlikely to change. Also caches any data downloaded from the internet, like BioMart queries for example, into a non-persistent cache that can be purged to perform updates to the system.
73
+
74
+ Tab separated file helpers:: The data in rbbt is saved into tab separated files and is loaded into Hash. Modules like Open or ArrayHash help dealing with these files and data structures.
75
+
76
+ = Installation
77
+
78
+ Install the gem normally <tt>gem install rbbt</tt>. The gem includes a configuration tool rbbt_config. The first time you run it it will ask you to configure some paths. After that you may use it to process data for different tasks. Lets see some scenarios:
79
+
80
+ === Using rbbt to translate identifiers
81
+
82
+ 1. Do <tt>rbbt_config install identifiers</tt> to do deploy the configuration files and download entrez data, this needs to be done just once.
83
+ 3. Now you may do <tt>rbbt_config update organisms</tt> toprocess all the organisms, or <tt>rbbt_config update organisms -o sgd</tt> to process only yeast (sgd).
84
+ 4. You may now use a script like this to translate gene identifiers from yeast feed from the standard input
85
+ require 'rbbt/sources/organism'
86
+
87
+ index = Organism.id_index('sgd', :native => 'Entrez Gene Id')
88
+
89
+ STDIN.each_line{|l| puts "#{l.chomp} => #{index[l.chomp]}"}
90
+
91
+ === Using rbbt to find gene mentions in text
92
+
93
+ First prepare the organisms as you did in the previous section. Next, if you want to use the default NER module:
94
+
95
+ 1. Install the Biocreative data used to train the model and compile the CRF++ plugin, <tt>rbbt_config install rner</tt>. You may need at this point to install ParseTree and ruby2ruby
96
+ 2. Build the module for a particular organism <tt>rbbt_config update ner -o sgd</tt>. You need to have the gems ParseTree and ruby2ruby for this to work. This process can take a long time.
97
+
98
+ Or, if you wan to use Abner or Banner:
99
+
100
+ 1. Download and install the packages <tt>rbbt_config install java_ner</tt>
101
+
102
+ You may now, for example, find mentions to genes in articles from a PubMed query using this script
103
+
104
+ require 'rbbt/sources/organism'
105
+ require 'rbbt/sources/pubmed'
106
+
107
+ # type = :abner
108
+ # type = :banner
109
+ type = :rner
110
+
111
+ ner = Organism.ner('sgd', type )
112
+ pmids = PubMed.query(ARGV[0], 500)
113
+
114
+ PubMed.get_article(pmids).each{|pmid,article|
115
+ mentions = ner.extract(article.text)
116
+ puts pmid
117
+ puts article.text
118
+ puts "Mentions: " << mentions.uniq.join(", ")
119
+ puts
120
+ }
121
+
122
+ == More Installation Guidelines
123
+
124
+ This is the complete list of gem requirements: <tt>ParseTree ruby2ruby simpleconsole rjb rsruby stemmer rand rake progress-monitor</tt>. Some of these gems to not work with ruby 1.9 at the time, or may be a bit more complicated to install, for that reason *they are not reported as dependencies and are only required when they are about to be used*. Note that some of these gems are in the gemcutter repository, you may need to install the <tt>gemcutter</tt> gem and do <tt>gem tumble</tt>
125
+
126
+ Some of the API requires to have some data processed using rbbt_config. This command is used to install third party software, download data from the internet, or build models. The command <tt>rbbt_config install all</tt> will install and process everything, this will take a long time, specially building the NER models. So you might want to start with the basic install and include more things as they are needed.
127
+
128
+
129
+ = Note on Patches/Pull Requests
6
130
 
7
131
  * Fork the project.
8
132
  * Make your feature addition or bug fix.
9
- * Add tests for it. This is important so I don't break it in a
10
- future version unintentionally.
133
+ * Add tests for it. This is important so I don't break it in a future version unintentionally.
11
134
  * Commit, do not mess with rakefile, version, or history.
12
- (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
135
+ (if you want to have your own version, that is fine, but bump version in a commit by itself that I can ignore when I pull)
13
136
  * Send me a pull request. Bonus points for topic branches.
14
137
 
15
- == Copyright
138
+ = Copyright
16
139
 
17
140
  Copyright (c) 2009 Miguel Vazquez. See LICENSE for details.
@@ -12,29 +12,62 @@ rescue Rbbt::NoConfig
12
12
  $noconfig = true
13
13
  end
14
14
 
15
+ TASKS= %w(organisms ner norm classifier biocreative entrez go wordlists polysearch abner banner crf++)
15
16
 
16
17
  $USAGE =<<EOT
17
18
  #{__FILE__} <action> [<subaction>] [--force] [--organism <org>]
19
+
18
20
  actions:
21
+
19
22
  * configure: Set paths for data, cache, and tmp directories
20
23
 
21
24
  * install:
22
- * basic: Third party software
23
- * databases: Entrez and Biocreative
24
- * models: Gene Mention and Classification
25
- * organisms: Rules to gather data for organisms
26
- * all: 3party wordlists entrez biocreative go ner norm classifier organisms polysearch
27
-
28
- * update:
29
- * organisms: Gather data for organisms
30
- * ner: Build Named Entity Recognition Models for Gene Mention
31
- * classification:
32
- Build Function/Process Classifiers
25
+
26
+ Basic subactions:
27
+
28
+ * organisms: Install processing scripts to process organisms
29
+ * ner: Install processing scripts for Named Entity Recognition
30
+ * norm: Install processing scripts for Gene Mention Normalization
31
+ * classifier: Install processing scripts for Classification
32
+
33
+ * biocreative: Download and train and test data from BioCreative
34
+ * entrez: Download and install data from Entrez
35
+ * go: Download and install data from The Gene Ontology
36
+ * wordlists: Install word lists
37
+ * polysearch: Download and install Polysearch dictionaries
38
+
39
+ * abner: Download and install Abner NER system: http://pages.cs.wisc.edu/~bsettles/abner/
40
+ * banner: Download and install Banner NER system: http://sourceforge.net/projects/banner/
41
+ * crf++: Download and install CRF++ a CRF framework: http://crfpp.sourceforge.net/
42
+
43
+ Subactions grouped by task:
44
+
45
+ * identifiers: entrez, organisms
46
+ * rner: entrez, organisms, biocreative, ner, crf++
47
+ * java_ner: entrez, organisms, abner, banner
48
+ * norm: entrez organisms, biocreative, crf++, norm, polysearch
49
+ * bow: organisms, wordlists
50
+ * classifier: organisms, wordlists, classifier, go
51
+ * all: #{TASKS.join(", ")}
52
+
53
+ * update:
54
+ * organisms: Gather organisms data
55
+ * ner: Build Named Entity Recognition Models. Mention Normalization needs no training.
56
+ * classification: Build Function/Process Classifiers
57
+
58
+ --force: Rebuild models or reprocess organism data even if present. You may want to purge the cache
59
+ to be up to date with the data in the internet.
60
+
61
+ --organism: Gather data only for that particular organism. The organism must be specified by the
62
+ keyword. Use '#{__FILE__} organisms' to see find the keywords.
33
63
 
34
64
  * purge_cache: Clean the non-persistent cache, which holds general things
35
65
  downloaded using Open.read, like organism identifiers downloaded from
36
66
  BioMart. The persistent cache, which hold pubmed articles or entrez gene
37
67
  descriptions, is not cleaned, as these are not likely to change
68
+
69
+ * organisms: Show a list of all organisms along with their identifier in the system
70
+
38
71
 
39
72
 
40
73
  EOT
@@ -44,6 +77,10 @@ class Controller < SimpleConsole::Controller
44
77
  params :bool => {:f => :force},
45
78
  :string => {:o => :organism}
46
79
 
80
+ def organisms
81
+ end
82
+
83
+
47
84
  def default
48
85
  render :action => :usage
49
86
  end
@@ -73,21 +110,39 @@ class Controller < SimpleConsole::Controller
73
110
 
74
111
  def install
75
112
  raise "Run #{__FILE__} configure first to configure rbbt" if $noconfig
76
-
77
113
  case params[:id]
78
- when "basic"
79
- @tasks = %w(3party wordlists polysearch)
80
- when "databases"
81
- @tasks = %w(entrez biocreative go)
82
- when "models"
83
- @tasks = %w(ner norm classifier)
84
- when "organisms"
85
- @tasks = %w(organisms)
114
+ when "identifiers"
115
+ require 'rbbt/sources/organism'
116
+ require 'rbbt/sources/entrez'
117
+ @tasks = %w(entrez organisms)
118
+ when "rner"
119
+ require 'rbbt/ner/rner'
120
+ require 'rbbt/sources/entrez'
121
+ @tasks = %w(entrez organisms biocreative ner crf++)
122
+ when "java_ner"
123
+ require 'rjb'
124
+ @tasks = %w(entrez organisms abner banner)
125
+ when "norm"
126
+ require 'rbbt/ner/rner'
127
+ require 'rbbt/ner/rnorm'
128
+ require 'rbbt/ner/regexpNER'
129
+ require 'rbbt/sources/entrez'
130
+ @tasks = %w(entrez organisms biocreative crf++ norm polysearch)
131
+ when "bow"
132
+ require 'rbbt/bow/bow'
133
+ require 'rbbt/bow/dictionary'
134
+ @tasks = %w(organisms wordlists)
135
+ when "classifier"
136
+ require 'rbbt/bow/bow'
137
+ require 'rbbt/bow/dictionary'
138
+ require 'rbbt/bow/classifier'
139
+ @tasks = %w(organisms wordlists classifier go)
86
140
  when "all"
87
- @tasks = %w(3party wordlists entrez biocreative go ner norm classifier organisms polysearch)
141
+ @tasks = TASKS
88
142
  when nil
89
143
  redirect_to :action => :help, :id => :install
90
144
  else
145
+ redirect_to :action => :help, :id => :install if ! TASKS.include? params[:id]
91
146
  @tasks = [params[:id]]
92
147
  end
93
148
 
@@ -109,6 +164,17 @@ class View < SimpleConsole::View
109
164
  puts $USAGE
110
165
  end
111
166
 
167
+ def organisms
168
+ require 'rbbt/sources/organism'
169
+ all = Organism.all(false)
170
+ installed = Organism.all
171
+
172
+ all.each{|org|
173
+ puts "#{Organism.name(org)}: #{org} #{installed.include?(org) ? "(installed)" : ""}"
174
+ }
175
+ end
176
+
177
+
112
178
  def install
113
179
  load File.join(Rbbt.rootdir, 'tasks/install.rake')
114
180
 
@@ -85,7 +85,12 @@ rule (/results\/(.*)/) => lambda{|n| n.sub(/results/,'model')} do |t|
85
85
 
86
86
  ndocs = 100
87
87
 
88
- used = Open.read(features).collect{|l| l.chomp.split(/\t/).first}[1..-1]
88
+ used = []
89
+ if "".respond_to? :collect
90
+ used = Open.read(features).collect{|l| l.chomp.split(/\t/).first}[1..-1]
91
+ else
92
+ used = Open.read(features).lines.collect{|l| l.chomp.split(/\t/).first}[1..-1]
93
+ end
89
94
 
90
95
  classifier = Classifier.new(model)
91
96
  go = Organism.gene_literature_go(org).collect{|gene, pmids| pmids}.flatten.uniq - used
@@ -37,7 +37,7 @@ def BC2GN_features(dataset, outfile)
37
37
  data[code] = {}
38
38
  data[code][:text] = Open.read(f)
39
39
  }
40
- Open.read(File.join(Rbbt.datadir,'biocreative','BC2GN',dataset,'genelist')).each{|l|
40
+ Open.read(File.join(Rbbt.datadir,'biocreative','BC2GN',dataset,'genelist')).each_line{|l|
41
41
  code, gene, mention = l.chomp.split(/\t/)
42
42
  data[code][:mentions] ||= []
43
43
  data[code][:mentions] << mention
@@ -2,9 +2,10 @@ require 'rbbt'
2
2
  require 'rbbt/sources/organism'
3
3
  require 'rbbt/util/open'
4
4
  require 'rbbt/ner/rner'
5
+ require 'rbbt/ner/rnorm'
5
6
 
6
7
 
7
- require 'progress-meter'
8
+ require 'progress-monitor'
8
9
 
9
10
  $type = ENV['ner'] || :rner
10
11
  $debug = !ENV['debug'].nil?
@@ -1,21 +1,23 @@
1
1
  #!/bin/bash
2
2
  function norm(){
3
- o=$1
3
+ organism=$1
4
4
  shift
5
- s=$1
5
+ dataset=$1
6
6
  shift
7
- n=$1
7
+ ner=$1
8
8
  shift
9
9
 
10
- echo "rm results/${o}_$s; rake results/${o}_$s.eval ner=$n $@ > ${o}_$s.log_$n; tail results/${o}_$s.eval"
11
- rm results/${o}_$s; rake results/${o}_$s.eval ner=$n $@ > ${o}_$s.log_$n; tail results/${o}_$s.eval
10
+ CMD="rm results/${organism}_$dataset; rake results/${organism}_$dataset.eval ner=$ner $@ > ${organism}_$dataset.log_$ner; tail results/${organism}_$dataset.eval"
11
+ echo $CMD
12
+ $CMD
12
13
  }
13
14
 
14
15
 
15
16
  function norm_2(){
16
- n=$1
17
+ ner=$1
17
18
  shift
18
19
 
19
- echo "rm results/bc2gn; rake results/bc2gn.eval ner=$n $@ > bc2gn.log_$n; tail results/bc2gn.eval"
20
- rm results/bc2gn; rake results/bc2gn.eval ner=$n $@ > bc2gn.log_$n; tail results/bc2gn.eval
20
+ CMD="rm results/bc2gn; rake results/bc2gn.eval ner=$ner $@ > bc2gn.log_$ner; tail results/bc2gn.eval"
21
+ echo $CMD
22
+ $CMD
21
23
  }
@@ -1,5 +1,23 @@
1
1
  $org = [$org, ENV['organism'],nil].reject{|e| e.nil? }.first
2
2
 
3
+ task 'names' do
4
+ orgs = Dir.glob('*').
5
+ select{|t|
6
+ File.directory?(t ) &&
7
+ File.exist?(t + '/Rakefile')
8
+ }
9
+
10
+ orgs.each{|org|
11
+ pid = Process.fork{
12
+ Dir.chdir(org)
13
+ load 'Rakefile'
14
+ Rake::Task['name'].invoke
15
+ }
16
+ Process.waitpid pid
17
+ }
18
+
19
+ end
20
+
3
21
  task 'default' do
4
22
  if $org
5
23
  orgs = [$org]
@@ -88,7 +88,7 @@ file 'lexicon' do
88
88
  "#{ code }\t" + name_lists.flatten.select{|n| n.to_s != ""}.uniq.join("\t")
89
89
  }.join("\n"))
90
90
 
91
- rescue Entrez::NoFile
91
+ rescue Entrez::NoFileError
92
92
  puts "Lexicon not produced for #{$name}, install the entrez gene_info file (rbbt_config install entrez)."
93
93
  end
94
94
  end
@@ -185,7 +185,7 @@ file 'identifiers' do
185
185
  }
186
186
  fout.close
187
187
 
188
- rescue Entrez::NoFile
188
+ rescue Entrez::NoFileError
189
189
  puts "Identifiers not produced for #{$name}, install the entrez gene_info file (rbbt_config install entrez)."
190
190
  end
191
191
  end
@@ -237,7 +237,7 @@ file 'gene.pmid' do
237
237
  }.compact.join("\n")
238
238
  }.compact.join("\n")
239
239
  )
240
- rescue Entrez::NoFile
240
+ rescue Entrez::NoFileError
241
241
  puts "Gene article associations from entrez not produced, install the gene2pumbed file (rbbt_config install entrez)."
242
242
  end
243
243
 
@@ -1,6 +1,6 @@
1
1
  require __FILE__.sub(/[^\/]*$/,'') + '../rake-include'
2
2
 
3
- $name = "Caenorhabditis elegans "
3
+ $name = "Caenorhabditis elegans"
4
4
 
5
5
 
6
6
  $native_id = "WormBase ID"
@@ -59,7 +59,7 @@ module Rbbt
59
59
 
60
60
  # For some reason banner.jar must be loaded before abner.jar
61
61
  ENV['CLASSPATH'] ||= ""
62
- ENV['CLASSPATH'] += ":" + %w(banner abner).collect{|pkg| File.join(datadir, "third_party/#{pkg}/#{ pkg }.jar")}.join(":")
62
+ ENV['CLASSPATH'] += ":" + %w(banner abner).collect{|pkg| File.join(datadir, "third_party", pkg, "#{ pkg }.jar")}.join(":")
63
63
  end
64
64
 
65
65
  def self.rootdir
@@ -17,6 +17,7 @@ module BagOfWords
17
17
  # 'rbbt/util/misc'.
18
18
  def self.words(text)
19
19
  return [] if text.nil?
20
+ raise "Stopword list not loaded. Have you installed the wordlists? (rbbt_config install wordlists)" if $stopwords.nil?
20
21
  text.scan(/\w+/).
21
22
  collect{|word| word.downcase.stem}.
22
23
  select{|word|
@@ -113,6 +113,4 @@ class Classifier
113
113
 
114
114
  end
115
115
 
116
-
117
-
118
116
  end
@@ -185,34 +185,3 @@ class Dictionary::KL
185
185
 
186
186
 
187
187
  end
188
-
189
- if __FILE__ == $0
190
-
191
- require 'benchmark'
192
- require 'rbbt/sources/pubmed'
193
- require 'rbbt/bow/bow'
194
- require 'progress-meter'
195
-
196
- max = 10000
197
-
198
- pmids = PubMed.query("Homo Sapiens", max)
199
- Progress.monitor "Get pimds"
200
- docs = PubMed.get_article(pmids).values.collect{|article| BagOfWords.terms(article.text)}
201
-
202
- dict = Dictionary::TF_IDF.new()
203
-
204
- puts "Starting Benchmark"
205
- puts Benchmark.measure{
206
- docs.each{|doc|
207
- dict.add doc
208
- }
209
- }
210
- puts Benchmark.measure{
211
- dict.weights
212
- }
213
-
214
- puts dict.terms.length
215
-
216
-
217
- end
218
-
@@ -4,6 +4,7 @@ require 'rbbt/ner/rnorm/tokens'
4
4
  require 'rbbt/util/index'
5
5
  require 'rbbt/util/open'
6
6
  require 'rbbt/sources/entrez'
7
+ require 'rbbt/bow/bow.rb'
7
8
 
8
9
  class Normalizer
9
10
 
@@ -12,12 +12,12 @@ module Biocreative
12
12
 
13
13
  data = {}
14
14
 
15
- Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/#{dataset}.in")).each{|l|
15
+ Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/#{dataset}.in")).each_line{|l|
16
16
  code, text = l.chomp.match(/(.*?) (.*)/).values_at(1,2)
17
17
  data[code] ={ :text => text }
18
18
  }
19
19
 
20
- Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/GENE.eval")).each{|l|
20
+ Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/GENE.eval")).each_line{|l|
21
21
  code, pos, mention = l.chomp.split(/\|/)
22
22
  data[code] ||= {}
23
23
  data[code][:mentions] ||= []
@@ -1,4 +1,3 @@
1
-
2
1
  require 'rbbt/util/open'
3
2
  require 'rbbt'
4
3
 
@@ -1,4 +1,3 @@
1
-
2
1
  require 'rbbt'
3
2
  require 'rbbt/util/open'
4
3
  require 'rbbt/util/tmpfile'
@@ -190,6 +189,7 @@ module Entrez
190
189
  # found in Entrez Gene for that particular gene. The +gene+ may be a
191
190
  # gene identifier or a Gene class instance.
192
191
  def self.gene_text_similarity(gene, text)
192
+
193
193
  case
194
194
  when Entrez::Gene === gene
195
195
  gene_text = gene.text
@@ -1,18 +1,23 @@
1
-
2
1
  require 'rbbt'
3
- require 'rbbt/ner/rnorm'
4
2
  require 'rbbt/util/open'
3
+ require 'rbbt/util/index'
4
+
5
5
 
6
6
  module Organism
7
7
 
8
8
  class OrganismNotProcessedError < StandardError; end
9
9
 
10
- def self.all
11
- Dir.glob(File.join(Rbbt.datadir,'/organisms/') + '/*/name').collect{|f| File.basename(File.dirname(f))}
10
+ def self.all(installed = true)
11
+ if installed
12
+ Dir.glob(File.join(Rbbt.datadir,'/organisms/') + '/*/identifiers').collect{|f| File.basename(File.dirname(f))}
13
+ else
14
+ Dir.glob(File.join(Rbbt.datadir,'/organisms/') + '/*').select{|f| File.directory? f}.collect{|f| File.basename(f)}
15
+ end
12
16
  end
13
17
 
14
18
 
15
19
  def self.name(org)
20
+ raise OrganismNotProcessedError, "Missing 'name' file" if ! File.exists? File.join(Rbbt.datadir,"organisms/#{ org }/name")
16
21
  Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/name"))
17
22
  end
18
23
 
@@ -30,7 +35,13 @@ module Organism
30
35
  id_types = {}
31
36
  formats = supported_ids(org)
32
37
 
33
- lines = Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers")).collect
38
+ text = Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"))
39
+
40
+ if text.respond_to? :collect
41
+ lines = text.collect
42
+ else
43
+ lines = text.lines
44
+ end
34
45
 
35
46
  lines.each{|l|
36
47
  ids_per_type = l.split(/\t/)
@@ -68,7 +79,10 @@ module Organism
68
79
  format_count.select{|k,v| v > (query.length / 10)}.sort{|a,b| b[1] <=> a[1]}.first
69
80
  end
70
81
 
71
- def self.ner(org, type=:abner, options = {})
82
+ # FIXME: The NER related stuff is harder to install, thats why we hide the
83
+ # requires next to where they are needed, next to options
84
+
85
+ def self.ner(org, type=:rner, options = {})
72
86
 
73
87
  case type.to_sym
74
88
  when :abner
@@ -90,6 +104,7 @@ module Organism
90
104
  end
91
105
 
92
106
  def self.norm(org, to_entrez = nil)
107
+ require 'rbbt/ner/rnorm'
93
108
  if to_entrez.nil?
94
109
  to_entrez = id_index(org, :native => 'Entrez Gene ID', :other => [supported_ids(org).first])
95
110
  end
@@ -109,7 +124,7 @@ module Organism
109
124
 
110
125
  def self.goterms(org)
111
126
  goterms = {}
112
- Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/gene.go")).each{|l|
127
+ Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/gene.go")).each_line{|l|
113
128
  gene, go = l.chomp.split(/\t/)
114
129
  goterms[gene.strip] ||= []
115
130
  goterms[gene.strip] << go.strip
@@ -118,7 +133,7 @@ module Organism
118
133
  end
119
134
 
120
135
  def self.literature(org)
121
- Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/all.pmid")).collect{|l| l.chomp.scan(/\d+/)}.flatten
136
+ Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/all.pmid")).scan(/\d+/)
122
137
  end
123
138
 
124
139
  def self.gene_literature(org)
@@ -133,7 +148,7 @@ module Organism
133
148
  formats = []
134
149
  examples = [] if options[:examples]
135
150
  i= 0
136
- Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers")).each{|l|
151
+ Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers")).each_line{|l|
137
152
  if i == 0
138
153
  i += 1
139
154
  next unless l=~/^\s*#/
@@ -42,28 +42,3 @@ module Index
42
42
  index
43
43
  end
44
44
  end
45
-
46
- if __FILE__ == $0
47
-
48
- require 'benchmark'
49
-
50
- normal = nil
51
- puts "Normal " + Benchmark.measure{
52
- normal = Index.index('/home/miki/rbbt/data/organisms/human/identifiers',:trie => false, :case_sensitive => false)
53
- }.to_s
54
-
55
-
56
- ids = Open.read('/home/miki/git/MARQ/test/GDS1375_malignant_vs_normal_up.genes').collect{|l| l.chomp.strip.upcase}
57
-
58
- new = nil
59
-
60
- puts ids.inspect
61
- puts "normal " + Benchmark.measure{
62
- 100.times{
63
- new = ids.collect{|id| normal[id]}
64
- }
65
- }.to_s
66
-
67
- puts new.inspect
68
-
69
- end
@@ -1,8 +1,12 @@
1
1
  require 'rbbt'
2
2
  require 'rbbt/util/open'
3
3
 
4
- $consonants = Open.read(File.join(Rbbt.datadir, 'wordlists/consonants')).collect{|l| l.chomp}.uniq
5
4
  class String
5
+ CONSONANTS = []
6
+ if File.exists? File.join(Rbbt.datadir, 'wordlists/consonants')
7
+ Open.read(File.join(Rbbt.datadir, 'wordlists/consonants')).each_line{|l| CONSONANTS << l.chomp}
8
+ end
9
+
6
10
  # Uses heuristics to checks if a string seems like a special word, like a gene name.
7
11
  def is_special?
8
12
  # Only consonants
@@ -22,7 +26,7 @@ class String
22
26
  # Dashed word
23
27
  return true if self =~ /(^\w-|-\w$)/
24
28
  # To many consonants (very heuristic)
25
- if self =~ /([^aeiouy]{3,})/i && !$consonants.include?($1.downcase)
29
+ if self =~ /([^aeiouy]{3,})/i && !CONSONANTS.include?($1.downcase)
26
30
  return true
27
31
  end
28
32
 
@@ -83,7 +87,8 @@ $greek = {
83
87
 
84
88
  $inverse_greek = Hash.new
85
89
  $greek.each{|l,s| $inverse_greek[s] = l }
86
- $stopwords = Open.read(File.join(Rbbt.datadir, 'wordlists/stopwords')).scan(/\w+/)
90
+
91
+ $stopwords = Open.read(File.join(Rbbt.datadir, 'wordlists/stopwords')).scan(/\w+/) if File.exists? File.join(Rbbt.datadir, 'wordlists/stopwords')
87
92
 
88
93
  class Array
89
94
 
@@ -161,7 +161,7 @@ module Open
161
161
  extra = [extra] if extra && ! extra.is_a?( Array)
162
162
 
163
163
  data = {}
164
- Open.read(filename).each{|l|
164
+ Open.read(filename).each_line{|l|
165
165
  l = fix.call(l) if fix
166
166
  next if exclude and exclude.call(l)
167
167
 
@@ -64,7 +64,7 @@ class SimpleDSL
64
64
  def initialize(method = nil, file = nil, &block)
65
65
  @config = {}
66
66
  if file
67
- raise ConfigFileMissingError.new "File '#{ file }' is missing. Have you installed the config files? (rbbt_config install norm)." unless File.exists? file
67
+ raise ConfigFileMissingError.new "File '#{ file }' is missing. Have you installed the config files? (use rbbt_config)." unless File.exists? file
68
68
  parse(method, file)
69
69
  end
70
70
 
@@ -85,6 +85,7 @@ task 'organisms' do
85
85
  end
86
86
  FileUtils.cp f , File.join(directory, "#{ org }/Rakefile")
87
87
  }
88
+ `cd #{directory}; rake names`
88
89
  end
89
90
 
90
91
  task 'ner' do
@@ -102,10 +103,10 @@ end
102
103
  task 'norm' do
103
104
  directory = "#{$datadir}/norm"
104
105
  FileUtils.mkdir_p directory
105
- %w(Rakefile config).each{|f|
106
+ %w(Rakefile config functions.sh).each{|f|
106
107
  FileUtils.cp_r File.join($scriptdir, "norm/#{ f }"), directory
107
108
  }
108
- %w(results).each{|d|
109
+ %w(results models).each{|d|
109
110
  FileUtils.mkdir_p File.join(directory, d)
110
111
  }
111
112
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rbbt
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Miguel Vazquez
@@ -9,10 +9,59 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-10-29 00:00:00 +01:00
12
+ date: 2009-11-02 00:00:00 +01:00
13
13
  default_executable: rbbt_config
14
- dependencies: []
15
-
14
+ dependencies:
15
+ - !ruby/object:Gem::Dependency
16
+ name: rake
17
+ type: :runtime
18
+ version_requirement:
19
+ version_requirements: !ruby/object:Gem::Requirement
20
+ requirements:
21
+ - - ">="
22
+ - !ruby/object:Gem::Version
23
+ version: 0.8.4
24
+ version:
25
+ - !ruby/object:Gem::Dependency
26
+ name: simpleconsole
27
+ type: :runtime
28
+ version_requirement:
29
+ version_requirements: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: "0"
34
+ version:
35
+ - !ruby/object:Gem::Dependency
36
+ name: stemmer
37
+ type: :runtime
38
+ version_requirement:
39
+ version_requirements: !ruby/object:Gem::Requirement
40
+ requirements:
41
+ - - ">="
42
+ - !ruby/object:Gem::Version
43
+ version: "0"
44
+ version:
45
+ - !ruby/object:Gem::Dependency
46
+ name: progress-monitor
47
+ type: :runtime
48
+ version_requirement:
49
+ version_requirements: !ruby/object:Gem::Requirement
50
+ requirements:
51
+ - - ">="
52
+ - !ruby/object:Gem::Version
53
+ version: "0"
54
+ version:
55
+ - !ruby/object:Gem::Dependency
56
+ name: simpleconsole
57
+ type: :runtime
58
+ version_requirement:
59
+ version_requirements: !ruby/object:Gem::Requirement
60
+ requirements:
61
+ - - ">="
62
+ - !ruby/object:Gem::Version
63
+ version: "0"
64
+ version:
16
65
  description: |-
17
66
  This toolbox includes modules for text-mining, like Named Entity Recognition and Normalization and document
18
67
  classification, as well as data integration modules that interface with PubMed, Entrez Gene, BioMart.
@@ -78,7 +127,6 @@ files:
78
127
  - lib/rbbt/util/open.rb
79
128
  - lib/rbbt/util/simpleDSL.rb
80
129
  - lib/rbbt/util/tmpfile.rb
81
- - lib/rbbt/version.rb
82
130
  - tasks/install.rake
83
131
  - LICENSE
84
132
  - README.rdoc
@@ -1,10 +0,0 @@
1
- module Rbbt
2
- module VERSION #:nodoc:
3
- MAJOR = 1
4
- MINOR = 0
5
- TINY = 0
6
-
7
- STRING = [MAJOR, MINOR, TINY].join('.')
8
- self
9
- end
10
- end