rbbt 1.0.0 → 1.0.2
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +129 -6
- data/bin/rbbt_config +87 -21
- data/install_scripts/classifier/Rakefile +6 -1
- data/install_scripts/ner/Rakefile +1 -1
- data/install_scripts/norm/Rakefile +2 -1
- data/install_scripts/norm/functions.sh +10 -8
- data/install_scripts/organisms/Rakefile +18 -0
- data/install_scripts/organisms/rake-include.rb +3 -3
- data/install_scripts/organisms/worm.Rakefile +1 -1
- data/lib/rbbt.rb +1 -1
- data/lib/rbbt/bow/bow.rb +1 -0
- data/lib/rbbt/bow/classifier.rb +0 -2
- data/lib/rbbt/bow/dictionary.rb +0 -31
- data/lib/rbbt/ner/rnorm.rb +1 -0
- data/lib/rbbt/sources/biocreative.rb +2 -2
- data/lib/rbbt/sources/biomart.rb +0 -1
- data/lib/rbbt/sources/entrez.rb +1 -1
- data/lib/rbbt/sources/organism.rb +24 -9
- data/lib/rbbt/util/index.rb +0 -25
- data/lib/rbbt/util/misc.rb +8 -3
- data/lib/rbbt/util/open.rb +1 -1
- data/lib/rbbt/util/simpleDSL.rb +1 -1
- data/tasks/install.rake +3 -2
- metadata +53 -5
- data/lib/rbbt/version.rb +0 -10
data/README.rdoc
CHANGED
@@ -1,17 +1,140 @@
|
|
1
1
|
= rbbt
|
2
2
|
|
3
|
-
|
3
|
+
Rbbt stands for Ruby Bio-Text, it started as an API for text mining developed
|
4
|
+
for SENT[http://sent.dacya.ucm.es], but its functionality has been used
|
5
|
+
for other applications as well, such as MARQ[http://marq.dacya.ucm.es].
|
4
6
|
|
5
|
-
== Note
|
7
|
+
== Important Note
|
8
|
+
|
9
|
+
Some unexpected gem dependencies may appear.
|
10
|
+
|
11
|
+
Rbbt covers several functionalities, some will work right away, some require to
|
12
|
+
install dependencies or download and process data from the internet. Since not
|
13
|
+
all users are likely to need all the functionalities, this gems dependencies
|
14
|
+
include only the very basic requirements. Dependencies may appear unexpectedly
|
15
|
+
when using new parts of the API.
|
16
|
+
|
17
|
+
== Functionality
|
18
|
+
|
19
|
+
=== Data sources interface
|
20
|
+
|
21
|
+
PubMed:: Making queries and retrieving articles.
|
22
|
+
|
23
|
+
BioMart:: Making queries to BioMart programmatically. It can divide a large query into smaller ones and merge the results.
|
24
|
+
|
25
|
+
Entrez:: Retrieving gene entries, associated articles, and gene synonyms and aliases.
|
26
|
+
|
27
|
+
Biocreative:: Using the competition test and training data to train and evaluate Named Entity Extraction models and Gene Mention Normalization.
|
28
|
+
|
29
|
+
|
30
|
+
=== Text mining tasks
|
31
|
+
|
32
|
+
BagOfWords:: Bag-of-words representation of text. Chunk text into terms, which can be unigrams or bi-grams, remove stopwords, build a term thesaurus using a TF_IDF (term frequency inverse document frequency) or a KL (Kullback-Leibler divergence) Dictionary, and extract a bag-of-words representations suitable for the Classifier.
|
33
|
+
|
34
|
+
Classifier:: Using R to build classification models and to use them to classify new entires. Currently the models are Support Vector Machines.
|
35
|
+
|
36
|
+
NER:: Named Entity Extraction. Currently there are 3 alternatives to do this Abner, Banner, RegExpNER, and NER. The first two are third party Java systems that require the rjb[rjb.rubyforge.org/] (Ruby Java Bridge) gem to be installed. The third one, RegExpNER, is a simple regular-expression based system which can be used when there is not enough data to train a CRF based system, for example, to find Polysearch terms. The last one, the default, is a reimplementation of a CRF-based system, such as Abner and Banner, completely configurable using a simple DSL (domain specific language).
|
37
|
+
|
38
|
+
Normalizer:: Resolve gene mentions to the actual genes they refer to. It compares the gene mention to all possible gene names and synonyms to find the best match. It is configurable using a DSL.
|
39
|
+
|
40
|
+
=== Organisms support
|
41
|
+
|
42
|
+
Using configuration files rbbt can support different organisms. The system is prepared to parse organism specific database files and merge them with Entrez and BioMart. Basically producing the following information
|
43
|
+
|
44
|
+
Lexicon:: Listing the synonyms for each gene
|
45
|
+
|
46
|
+
Identifiers:: Listing different identifiers for each gene like Entrez Gene Ids, Unigene, Affymetrix probe ids, etc. This is not the same as the lexicon which holds names, not identifiers.
|
47
|
+
|
48
|
+
GO:: Listing associations of genes to GO terms.
|
49
|
+
|
50
|
+
PubMed articles:: List articles associated to each gene, as listed in Entrez or listed to support of GO associations.
|
51
|
+
|
52
|
+
With this information rbbt offers the following functionality via the Organism class
|
53
|
+
|
54
|
+
NER and Normalization:: Loads custom models for Named Entity Extraction and Gene Mention Normalization
|
55
|
+
|
56
|
+
Identifiers translation:: Translates gene identifiers between formats.
|
57
|
+
|
58
|
+
Organisms in rbbt are identified using a keyword. This is the list of organisms currently supported with their associated keywords:
|
59
|
+
|
60
|
+
Candida albicans:: cgd
|
61
|
+
Mus musculus:: mgi
|
62
|
+
Rattus norvegicus:: rgd
|
63
|
+
Saccharomyces cerevisiae:: sgd
|
64
|
+
Arabidopsis thaliana:: tair
|
65
|
+
Caenorhabditis elegans:: worm
|
66
|
+
Homo sapiens:: human
|
67
|
+
Schizosaccharomyces pombe:: pombe
|
68
|
+
|
69
|
+
|
70
|
+
=== Other
|
71
|
+
|
72
|
+
Cache:: The system caches PubMed articles and Entrez gene entries, this is considered a persistent cache since these items are unlikely to change. Also caches any data downloaded from the internet, like BioMart queries for example, into a non-persistent cache that can be purged to perform updates to the system.
|
73
|
+
|
74
|
+
Tab separated file helpers:: The data in rbbt is saved into tab separated files and is loaded into Hash. Modules like Open or ArrayHash help dealing with these files and data structures.
|
75
|
+
|
76
|
+
= Installation
|
77
|
+
|
78
|
+
Install the gem normally <tt>gem install rbbt</tt>. The gem includes a configuration tool rbbt_config. The first time you run it it will ask you to configure some paths. After that you may use it to process data for different tasks. Lets see some scenarios:
|
79
|
+
|
80
|
+
=== Using rbbt to translate identifiers
|
81
|
+
|
82
|
+
1. Do <tt>rbbt_config install identifiers</tt> to do deploy the configuration files and download entrez data, this needs to be done just once.
|
83
|
+
3. Now you may do <tt>rbbt_config update organisms</tt> toprocess all the organisms, or <tt>rbbt_config update organisms -o sgd</tt> to process only yeast (sgd).
|
84
|
+
4. You may now use a script like this to translate gene identifiers from yeast feed from the standard input
|
85
|
+
require 'rbbt/sources/organism'
|
86
|
+
|
87
|
+
index = Organism.id_index('sgd', :native => 'Entrez Gene Id')
|
88
|
+
|
89
|
+
STDIN.each_line{|l| puts "#{l.chomp} => #{index[l.chomp]}"}
|
90
|
+
|
91
|
+
=== Using rbbt to find gene mentions in text
|
92
|
+
|
93
|
+
First prepare the organisms as you did in the previous section. Next, if you want to use the default NER module:
|
94
|
+
|
95
|
+
1. Install the Biocreative data used to train the model and compile the CRF++ plugin, <tt>rbbt_config install rner</tt>. You may need at this point to install ParseTree and ruby2ruby
|
96
|
+
2. Build the module for a particular organism <tt>rbbt_config update ner -o sgd</tt>. You need to have the gems ParseTree and ruby2ruby for this to work. This process can take a long time.
|
97
|
+
|
98
|
+
Or, if you wan to use Abner or Banner:
|
99
|
+
|
100
|
+
1. Download and install the packages <tt>rbbt_config install java_ner</tt>
|
101
|
+
|
102
|
+
You may now, for example, find mentions to genes in articles from a PubMed query using this script
|
103
|
+
|
104
|
+
require 'rbbt/sources/organism'
|
105
|
+
require 'rbbt/sources/pubmed'
|
106
|
+
|
107
|
+
# type = :abner
|
108
|
+
# type = :banner
|
109
|
+
type = :rner
|
110
|
+
|
111
|
+
ner = Organism.ner('sgd', type )
|
112
|
+
pmids = PubMed.query(ARGV[0], 500)
|
113
|
+
|
114
|
+
PubMed.get_article(pmids).each{|pmid,article|
|
115
|
+
mentions = ner.extract(article.text)
|
116
|
+
puts pmid
|
117
|
+
puts article.text
|
118
|
+
puts "Mentions: " << mentions.uniq.join(", ")
|
119
|
+
puts
|
120
|
+
}
|
121
|
+
|
122
|
+
== More Installation Guidelines
|
123
|
+
|
124
|
+
This is the complete list of gem requirements: <tt>ParseTree ruby2ruby simpleconsole rjb rsruby stemmer rand rake progress-monitor</tt>. Some of these gems to not work with ruby 1.9 at the time, or may be a bit more complicated to install, for that reason *they are not reported as dependencies and are only required when they are about to be used*. Note that some of these gems are in the gemcutter repository, you may need to install the <tt>gemcutter</tt> gem and do <tt>gem tumble</tt>
|
125
|
+
|
126
|
+
Some of the API requires to have some data processed using rbbt_config. This command is used to install third party software, download data from the internet, or build models. The command <tt>rbbt_config install all</tt> will install and process everything, this will take a long time, specially building the NER models. So you might want to start with the basic install and include more things as they are needed.
|
127
|
+
|
128
|
+
|
129
|
+
= Note on Patches/Pull Requests
|
6
130
|
|
7
131
|
* Fork the project.
|
8
132
|
* Make your feature addition or bug fix.
|
9
|
-
* Add tests for it. This is important so I don't break it in a
|
10
|
-
future version unintentionally.
|
133
|
+
* Add tests for it. This is important so I don't break it in a future version unintentionally.
|
11
134
|
* Commit, do not mess with rakefile, version, or history.
|
12
|
-
(if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
|
135
|
+
(if you want to have your own version, that is fine, but bump version in a commit by itself that I can ignore when I pull)
|
13
136
|
* Send me a pull request. Bonus points for topic branches.
|
14
137
|
|
15
|
-
|
138
|
+
= Copyright
|
16
139
|
|
17
140
|
Copyright (c) 2009 Miguel Vazquez. See LICENSE for details.
|
data/bin/rbbt_config
CHANGED
@@ -12,29 +12,62 @@ rescue Rbbt::NoConfig
|
|
12
12
|
$noconfig = true
|
13
13
|
end
|
14
14
|
|
15
|
+
TASKS= %w(organisms ner norm classifier biocreative entrez go wordlists polysearch abner banner crf++)
|
15
16
|
|
16
17
|
$USAGE =<<EOT
|
17
18
|
#{__FILE__} <action> [<subaction>] [--force] [--organism <org>]
|
19
|
+
|
18
20
|
actions:
|
21
|
+
|
19
22
|
* configure: Set paths for data, cache, and tmp directories
|
20
23
|
|
21
24
|
* install:
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
* organisms:
|
26
|
-
*
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
*
|
31
|
-
*
|
32
|
-
|
25
|
+
|
26
|
+
Basic subactions:
|
27
|
+
|
28
|
+
* organisms: Install processing scripts to process organisms
|
29
|
+
* ner: Install processing scripts for Named Entity Recognition
|
30
|
+
* norm: Install processing scripts for Gene Mention Normalization
|
31
|
+
* classifier: Install processing scripts for Classification
|
32
|
+
|
33
|
+
* biocreative: Download and train and test data from BioCreative
|
34
|
+
* entrez: Download and install data from Entrez
|
35
|
+
* go: Download and install data from The Gene Ontology
|
36
|
+
* wordlists: Install word lists
|
37
|
+
* polysearch: Download and install Polysearch dictionaries
|
38
|
+
|
39
|
+
* abner: Download and install Abner NER system: http://pages.cs.wisc.edu/~bsettles/abner/
|
40
|
+
* banner: Download and install Banner NER system: http://sourceforge.net/projects/banner/
|
41
|
+
* crf++: Download and install CRF++ a CRF framework: http://crfpp.sourceforge.net/
|
42
|
+
|
43
|
+
Subactions grouped by task:
|
44
|
+
|
45
|
+
* identifiers: entrez, organisms
|
46
|
+
* rner: entrez, organisms, biocreative, ner, crf++
|
47
|
+
* java_ner: entrez, organisms, abner, banner
|
48
|
+
* norm: entrez organisms, biocreative, crf++, norm, polysearch
|
49
|
+
* bow: organisms, wordlists
|
50
|
+
* classifier: organisms, wordlists, classifier, go
|
51
|
+
* all: #{TASKS.join(", ")}
|
52
|
+
|
53
|
+
* update:
|
54
|
+
* organisms: Gather organisms data
|
55
|
+
* ner: Build Named Entity Recognition Models. Mention Normalization needs no training.
|
56
|
+
* classification: Build Function/Process Classifiers
|
57
|
+
|
58
|
+
--force: Rebuild models or reprocess organism data even if present. You may want to purge the cache
|
59
|
+
to be up to date with the data in the internet.
|
60
|
+
|
61
|
+
--organism: Gather data only for that particular organism. The organism must be specified by the
|
62
|
+
keyword. Use '#{__FILE__} organisms' to see find the keywords.
|
33
63
|
|
34
64
|
* purge_cache: Clean the non-persistent cache, which holds general things
|
35
65
|
downloaded using Open.read, like organism identifiers downloaded from
|
36
66
|
BioMart. The persistent cache, which hold pubmed articles or entrez gene
|
37
67
|
descriptions, is not cleaned, as these are not likely to change
|
68
|
+
|
69
|
+
* organisms: Show a list of all organisms along with their identifier in the system
|
70
|
+
|
38
71
|
|
39
72
|
|
40
73
|
EOT
|
@@ -44,6 +77,10 @@ class Controller < SimpleConsole::Controller
|
|
44
77
|
params :bool => {:f => :force},
|
45
78
|
:string => {:o => :organism}
|
46
79
|
|
80
|
+
def organisms
|
81
|
+
end
|
82
|
+
|
83
|
+
|
47
84
|
def default
|
48
85
|
render :action => :usage
|
49
86
|
end
|
@@ -73,21 +110,39 @@ class Controller < SimpleConsole::Controller
|
|
73
110
|
|
74
111
|
def install
|
75
112
|
raise "Run #{__FILE__} configure first to configure rbbt" if $noconfig
|
76
|
-
|
77
113
|
case params[:id]
|
78
|
-
when "
|
79
|
-
|
80
|
-
|
81
|
-
@tasks = %w(entrez
|
82
|
-
when "
|
83
|
-
|
84
|
-
|
85
|
-
@tasks = %w(organisms)
|
114
|
+
when "identifiers"
|
115
|
+
require 'rbbt/sources/organism'
|
116
|
+
require 'rbbt/sources/entrez'
|
117
|
+
@tasks = %w(entrez organisms)
|
118
|
+
when "rner"
|
119
|
+
require 'rbbt/ner/rner'
|
120
|
+
require 'rbbt/sources/entrez'
|
121
|
+
@tasks = %w(entrez organisms biocreative ner crf++)
|
122
|
+
when "java_ner"
|
123
|
+
require 'rjb'
|
124
|
+
@tasks = %w(entrez organisms abner banner)
|
125
|
+
when "norm"
|
126
|
+
require 'rbbt/ner/rner'
|
127
|
+
require 'rbbt/ner/rnorm'
|
128
|
+
require 'rbbt/ner/regexpNER'
|
129
|
+
require 'rbbt/sources/entrez'
|
130
|
+
@tasks = %w(entrez organisms biocreative crf++ norm polysearch)
|
131
|
+
when "bow"
|
132
|
+
require 'rbbt/bow/bow'
|
133
|
+
require 'rbbt/bow/dictionary'
|
134
|
+
@tasks = %w(organisms wordlists)
|
135
|
+
when "classifier"
|
136
|
+
require 'rbbt/bow/bow'
|
137
|
+
require 'rbbt/bow/dictionary'
|
138
|
+
require 'rbbt/bow/classifier'
|
139
|
+
@tasks = %w(organisms wordlists classifier go)
|
86
140
|
when "all"
|
87
|
-
@tasks =
|
141
|
+
@tasks = TASKS
|
88
142
|
when nil
|
89
143
|
redirect_to :action => :help, :id => :install
|
90
144
|
else
|
145
|
+
redirect_to :action => :help, :id => :install if ! TASKS.include? params[:id]
|
91
146
|
@tasks = [params[:id]]
|
92
147
|
end
|
93
148
|
|
@@ -109,6 +164,17 @@ class View < SimpleConsole::View
|
|
109
164
|
puts $USAGE
|
110
165
|
end
|
111
166
|
|
167
|
+
def organisms
|
168
|
+
require 'rbbt/sources/organism'
|
169
|
+
all = Organism.all(false)
|
170
|
+
installed = Organism.all
|
171
|
+
|
172
|
+
all.each{|org|
|
173
|
+
puts "#{Organism.name(org)}: #{org} #{installed.include?(org) ? "(installed)" : ""}"
|
174
|
+
}
|
175
|
+
end
|
176
|
+
|
177
|
+
|
112
178
|
def install
|
113
179
|
load File.join(Rbbt.rootdir, 'tasks/install.rake')
|
114
180
|
|
@@ -85,7 +85,12 @@ rule (/results\/(.*)/) => lambda{|n| n.sub(/results/,'model')} do |t|
|
|
85
85
|
|
86
86
|
ndocs = 100
|
87
87
|
|
88
|
-
used =
|
88
|
+
used = []
|
89
|
+
if "".respond_to? :collect
|
90
|
+
used = Open.read(features).collect{|l| l.chomp.split(/\t/).first}[1..-1]
|
91
|
+
else
|
92
|
+
used = Open.read(features).lines.collect{|l| l.chomp.split(/\t/).first}[1..-1]
|
93
|
+
end
|
89
94
|
|
90
95
|
classifier = Classifier.new(model)
|
91
96
|
go = Organism.gene_literature_go(org).collect{|gene, pmids| pmids}.flatten.uniq - used
|
@@ -37,7 +37,7 @@ def BC2GN_features(dataset, outfile)
|
|
37
37
|
data[code] = {}
|
38
38
|
data[code][:text] = Open.read(f)
|
39
39
|
}
|
40
|
-
Open.read(File.join(Rbbt.datadir,'biocreative','BC2GN',dataset,'genelist')).
|
40
|
+
Open.read(File.join(Rbbt.datadir,'biocreative','BC2GN',dataset,'genelist')).each_line{|l|
|
41
41
|
code, gene, mention = l.chomp.split(/\t/)
|
42
42
|
data[code][:mentions] ||= []
|
43
43
|
data[code][:mentions] << mention
|
@@ -1,21 +1,23 @@
|
|
1
1
|
#!/bin/bash
|
2
2
|
function norm(){
|
3
|
-
|
3
|
+
organism=$1
|
4
4
|
shift
|
5
|
-
|
5
|
+
dataset=$1
|
6
6
|
shift
|
7
|
-
|
7
|
+
ner=$1
|
8
8
|
shift
|
9
9
|
|
10
|
-
|
11
|
-
|
10
|
+
CMD="rm results/${organism}_$dataset; rake results/${organism}_$dataset.eval ner=$ner $@ > ${organism}_$dataset.log_$ner; tail results/${organism}_$dataset.eval"
|
11
|
+
echo $CMD
|
12
|
+
$CMD
|
12
13
|
}
|
13
14
|
|
14
15
|
|
15
16
|
function norm_2(){
|
16
|
-
|
17
|
+
ner=$1
|
17
18
|
shift
|
18
19
|
|
19
|
-
|
20
|
-
|
20
|
+
CMD="rm results/bc2gn; rake results/bc2gn.eval ner=$ner $@ > bc2gn.log_$ner; tail results/bc2gn.eval"
|
21
|
+
echo $CMD
|
22
|
+
$CMD
|
21
23
|
}
|
@@ -1,5 +1,23 @@
|
|
1
1
|
$org = [$org, ENV['organism'],nil].reject{|e| e.nil? }.first
|
2
2
|
|
3
|
+
task 'names' do
|
4
|
+
orgs = Dir.glob('*').
|
5
|
+
select{|t|
|
6
|
+
File.directory?(t ) &&
|
7
|
+
File.exist?(t + '/Rakefile')
|
8
|
+
}
|
9
|
+
|
10
|
+
orgs.each{|org|
|
11
|
+
pid = Process.fork{
|
12
|
+
Dir.chdir(org)
|
13
|
+
load 'Rakefile'
|
14
|
+
Rake::Task['name'].invoke
|
15
|
+
}
|
16
|
+
Process.waitpid pid
|
17
|
+
}
|
18
|
+
|
19
|
+
end
|
20
|
+
|
3
21
|
task 'default' do
|
4
22
|
if $org
|
5
23
|
orgs = [$org]
|
@@ -88,7 +88,7 @@ file 'lexicon' do
|
|
88
88
|
"#{ code }\t" + name_lists.flatten.select{|n| n.to_s != ""}.uniq.join("\t")
|
89
89
|
}.join("\n"))
|
90
90
|
|
91
|
-
rescue Entrez::
|
91
|
+
rescue Entrez::NoFileError
|
92
92
|
puts "Lexicon not produced for #{$name}, install the entrez gene_info file (rbbt_config install entrez)."
|
93
93
|
end
|
94
94
|
end
|
@@ -185,7 +185,7 @@ file 'identifiers' do
|
|
185
185
|
}
|
186
186
|
fout.close
|
187
187
|
|
188
|
-
rescue Entrez::
|
188
|
+
rescue Entrez::NoFileError
|
189
189
|
puts "Identifiers not produced for #{$name}, install the entrez gene_info file (rbbt_config install entrez)."
|
190
190
|
end
|
191
191
|
end
|
@@ -237,7 +237,7 @@ file 'gene.pmid' do
|
|
237
237
|
}.compact.join("\n")
|
238
238
|
}.compact.join("\n")
|
239
239
|
)
|
240
|
-
rescue Entrez::
|
240
|
+
rescue Entrez::NoFileError
|
241
241
|
puts "Gene article associations from entrez not produced, install the gene2pumbed file (rbbt_config install entrez)."
|
242
242
|
end
|
243
243
|
|
data/lib/rbbt.rb
CHANGED
@@ -59,7 +59,7 @@ module Rbbt
|
|
59
59
|
|
60
60
|
# For some reason banner.jar must be loaded before abner.jar
|
61
61
|
ENV['CLASSPATH'] ||= ""
|
62
|
-
ENV['CLASSPATH'] += ":" + %w(banner abner).collect{|pkg| File.join(datadir, "third_party
|
62
|
+
ENV['CLASSPATH'] += ":" + %w(banner abner).collect{|pkg| File.join(datadir, "third_party", pkg, "#{ pkg }.jar")}.join(":")
|
63
63
|
end
|
64
64
|
|
65
65
|
def self.rootdir
|
data/lib/rbbt/bow/bow.rb
CHANGED
@@ -17,6 +17,7 @@ module BagOfWords
|
|
17
17
|
# 'rbbt/util/misc'.
|
18
18
|
def self.words(text)
|
19
19
|
return [] if text.nil?
|
20
|
+
raise "Stopword list not loaded. Have you installed the wordlists? (rbbt_config install wordlists)" if $stopwords.nil?
|
20
21
|
text.scan(/\w+/).
|
21
22
|
collect{|word| word.downcase.stem}.
|
22
23
|
select{|word|
|
data/lib/rbbt/bow/classifier.rb
CHANGED
data/lib/rbbt/bow/dictionary.rb
CHANGED
@@ -185,34 +185,3 @@ class Dictionary::KL
|
|
185
185
|
|
186
186
|
|
187
187
|
end
|
188
|
-
|
189
|
-
if __FILE__ == $0
|
190
|
-
|
191
|
-
require 'benchmark'
|
192
|
-
require 'rbbt/sources/pubmed'
|
193
|
-
require 'rbbt/bow/bow'
|
194
|
-
require 'progress-meter'
|
195
|
-
|
196
|
-
max = 10000
|
197
|
-
|
198
|
-
pmids = PubMed.query("Homo Sapiens", max)
|
199
|
-
Progress.monitor "Get pimds"
|
200
|
-
docs = PubMed.get_article(pmids).values.collect{|article| BagOfWords.terms(article.text)}
|
201
|
-
|
202
|
-
dict = Dictionary::TF_IDF.new()
|
203
|
-
|
204
|
-
puts "Starting Benchmark"
|
205
|
-
puts Benchmark.measure{
|
206
|
-
docs.each{|doc|
|
207
|
-
dict.add doc
|
208
|
-
}
|
209
|
-
}
|
210
|
-
puts Benchmark.measure{
|
211
|
-
dict.weights
|
212
|
-
}
|
213
|
-
|
214
|
-
puts dict.terms.length
|
215
|
-
|
216
|
-
|
217
|
-
end
|
218
|
-
|
data/lib/rbbt/ner/rnorm.rb
CHANGED
@@ -12,12 +12,12 @@ module Biocreative
|
|
12
12
|
|
13
13
|
data = {}
|
14
14
|
|
15
|
-
Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/#{dataset}.in")).
|
15
|
+
Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/#{dataset}.in")).each_line{|l|
|
16
16
|
code, text = l.chomp.match(/(.*?) (.*)/).values_at(1,2)
|
17
17
|
data[code] ={ :text => text }
|
18
18
|
}
|
19
19
|
|
20
|
-
Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/GENE.eval")).
|
20
|
+
Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/GENE.eval")).each_line{|l|
|
21
21
|
code, pos, mention = l.chomp.split(/\|/)
|
22
22
|
data[code] ||= {}
|
23
23
|
data[code][:mentions] ||= []
|
data/lib/rbbt/sources/biomart.rb
CHANGED
data/lib/rbbt/sources/entrez.rb
CHANGED
@@ -1,4 +1,3 @@
|
|
1
|
-
|
2
1
|
require 'rbbt'
|
3
2
|
require 'rbbt/util/open'
|
4
3
|
require 'rbbt/util/tmpfile'
|
@@ -190,6 +189,7 @@ module Entrez
|
|
190
189
|
# found in Entrez Gene for that particular gene. The +gene+ may be a
|
191
190
|
# gene identifier or a Gene class instance.
|
192
191
|
def self.gene_text_similarity(gene, text)
|
192
|
+
|
193
193
|
case
|
194
194
|
when Entrez::Gene === gene
|
195
195
|
gene_text = gene.text
|
@@ -1,18 +1,23 @@
|
|
1
|
-
|
2
1
|
require 'rbbt'
|
3
|
-
require 'rbbt/ner/rnorm'
|
4
2
|
require 'rbbt/util/open'
|
3
|
+
require 'rbbt/util/index'
|
4
|
+
|
5
5
|
|
6
6
|
module Organism
|
7
7
|
|
8
8
|
class OrganismNotProcessedError < StandardError; end
|
9
9
|
|
10
|
-
def self.all
|
11
|
-
|
10
|
+
def self.all(installed = true)
|
11
|
+
if installed
|
12
|
+
Dir.glob(File.join(Rbbt.datadir,'/organisms/') + '/*/identifiers').collect{|f| File.basename(File.dirname(f))}
|
13
|
+
else
|
14
|
+
Dir.glob(File.join(Rbbt.datadir,'/organisms/') + '/*').select{|f| File.directory? f}.collect{|f| File.basename(f)}
|
15
|
+
end
|
12
16
|
end
|
13
17
|
|
14
18
|
|
15
19
|
def self.name(org)
|
20
|
+
raise OrganismNotProcessedError, "Missing 'name' file" if ! File.exists? File.join(Rbbt.datadir,"organisms/#{ org }/name")
|
16
21
|
Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/name"))
|
17
22
|
end
|
18
23
|
|
@@ -30,7 +35,13 @@ module Organism
|
|
30
35
|
id_types = {}
|
31
36
|
formats = supported_ids(org)
|
32
37
|
|
33
|
-
|
38
|
+
text = Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"))
|
39
|
+
|
40
|
+
if text.respond_to? :collect
|
41
|
+
lines = text.collect
|
42
|
+
else
|
43
|
+
lines = text.lines
|
44
|
+
end
|
34
45
|
|
35
46
|
lines.each{|l|
|
36
47
|
ids_per_type = l.split(/\t/)
|
@@ -68,7 +79,10 @@ module Organism
|
|
68
79
|
format_count.select{|k,v| v > (query.length / 10)}.sort{|a,b| b[1] <=> a[1]}.first
|
69
80
|
end
|
70
81
|
|
71
|
-
|
82
|
+
# FIXME: The NER related stuff is harder to install, thats why we hide the
|
83
|
+
# requires next to where they are needed, next to options
|
84
|
+
|
85
|
+
def self.ner(org, type=:rner, options = {})
|
72
86
|
|
73
87
|
case type.to_sym
|
74
88
|
when :abner
|
@@ -90,6 +104,7 @@ module Organism
|
|
90
104
|
end
|
91
105
|
|
92
106
|
def self.norm(org, to_entrez = nil)
|
107
|
+
require 'rbbt/ner/rnorm'
|
93
108
|
if to_entrez.nil?
|
94
109
|
to_entrez = id_index(org, :native => 'Entrez Gene ID', :other => [supported_ids(org).first])
|
95
110
|
end
|
@@ -109,7 +124,7 @@ module Organism
|
|
109
124
|
|
110
125
|
def self.goterms(org)
|
111
126
|
goterms = {}
|
112
|
-
Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/gene.go")).
|
127
|
+
Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/gene.go")).each_line{|l|
|
113
128
|
gene, go = l.chomp.split(/\t/)
|
114
129
|
goterms[gene.strip] ||= []
|
115
130
|
goterms[gene.strip] << go.strip
|
@@ -118,7 +133,7 @@ module Organism
|
|
118
133
|
end
|
119
134
|
|
120
135
|
def self.literature(org)
|
121
|
-
Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/all.pmid")).
|
136
|
+
Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/all.pmid")).scan(/\d+/)
|
122
137
|
end
|
123
138
|
|
124
139
|
def self.gene_literature(org)
|
@@ -133,7 +148,7 @@ module Organism
|
|
133
148
|
formats = []
|
134
149
|
examples = [] if options[:examples]
|
135
150
|
i= 0
|
136
|
-
Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers")).
|
151
|
+
Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers")).each_line{|l|
|
137
152
|
if i == 0
|
138
153
|
i += 1
|
139
154
|
next unless l=~/^\s*#/
|
data/lib/rbbt/util/index.rb
CHANGED
@@ -42,28 +42,3 @@ module Index
|
|
42
42
|
index
|
43
43
|
end
|
44
44
|
end
|
45
|
-
|
46
|
-
if __FILE__ == $0
|
47
|
-
|
48
|
-
require 'benchmark'
|
49
|
-
|
50
|
-
normal = nil
|
51
|
-
puts "Normal " + Benchmark.measure{
|
52
|
-
normal = Index.index('/home/miki/rbbt/data/organisms/human/identifiers',:trie => false, :case_sensitive => false)
|
53
|
-
}.to_s
|
54
|
-
|
55
|
-
|
56
|
-
ids = Open.read('/home/miki/git/MARQ/test/GDS1375_malignant_vs_normal_up.genes').collect{|l| l.chomp.strip.upcase}
|
57
|
-
|
58
|
-
new = nil
|
59
|
-
|
60
|
-
puts ids.inspect
|
61
|
-
puts "normal " + Benchmark.measure{
|
62
|
-
100.times{
|
63
|
-
new = ids.collect{|id| normal[id]}
|
64
|
-
}
|
65
|
-
}.to_s
|
66
|
-
|
67
|
-
puts new.inspect
|
68
|
-
|
69
|
-
end
|
data/lib/rbbt/util/misc.rb
CHANGED
@@ -1,8 +1,12 @@
|
|
1
1
|
require 'rbbt'
|
2
2
|
require 'rbbt/util/open'
|
3
3
|
|
4
|
-
$consonants = Open.read(File.join(Rbbt.datadir, 'wordlists/consonants')).collect{|l| l.chomp}.uniq
|
5
4
|
class String
|
5
|
+
CONSONANTS = []
|
6
|
+
if File.exists? File.join(Rbbt.datadir, 'wordlists/consonants')
|
7
|
+
Open.read(File.join(Rbbt.datadir, 'wordlists/consonants')).each_line{|l| CONSONANTS << l.chomp}
|
8
|
+
end
|
9
|
+
|
6
10
|
# Uses heuristics to checks if a string seems like a special word, like a gene name.
|
7
11
|
def is_special?
|
8
12
|
# Only consonants
|
@@ -22,7 +26,7 @@ class String
|
|
22
26
|
# Dashed word
|
23
27
|
return true if self =~ /(^\w-|-\w$)/
|
24
28
|
# To many consonants (very heuristic)
|
25
|
-
if self =~ /([^aeiouy]{3,})/i &&
|
29
|
+
if self =~ /([^aeiouy]{3,})/i && !CONSONANTS.include?($1.downcase)
|
26
30
|
return true
|
27
31
|
end
|
28
32
|
|
@@ -83,7 +87,8 @@ $greek = {
|
|
83
87
|
|
84
88
|
$inverse_greek = Hash.new
|
85
89
|
$greek.each{|l,s| $inverse_greek[s] = l }
|
86
|
-
|
90
|
+
|
91
|
+
$stopwords = Open.read(File.join(Rbbt.datadir, 'wordlists/stopwords')).scan(/\w+/) if File.exists? File.join(Rbbt.datadir, 'wordlists/stopwords')
|
87
92
|
|
88
93
|
class Array
|
89
94
|
|
data/lib/rbbt/util/open.rb
CHANGED
data/lib/rbbt/util/simpleDSL.rb
CHANGED
@@ -64,7 +64,7 @@ class SimpleDSL
|
|
64
64
|
def initialize(method = nil, file = nil, &block)
|
65
65
|
@config = {}
|
66
66
|
if file
|
67
|
-
raise ConfigFileMissingError.new "File '#{ file }' is missing. Have you installed the config files? (rbbt_config
|
67
|
+
raise ConfigFileMissingError.new "File '#{ file }' is missing. Have you installed the config files? (use rbbt_config)." unless File.exists? file
|
68
68
|
parse(method, file)
|
69
69
|
end
|
70
70
|
|
data/tasks/install.rake
CHANGED
@@ -85,6 +85,7 @@ task 'organisms' do
|
|
85
85
|
end
|
86
86
|
FileUtils.cp f , File.join(directory, "#{ org }/Rakefile")
|
87
87
|
}
|
88
|
+
`cd #{directory}; rake names`
|
88
89
|
end
|
89
90
|
|
90
91
|
task 'ner' do
|
@@ -102,10 +103,10 @@ end
|
|
102
103
|
task 'norm' do
|
103
104
|
directory = "#{$datadir}/norm"
|
104
105
|
FileUtils.mkdir_p directory
|
105
|
-
%w(Rakefile config).each{|f|
|
106
|
+
%w(Rakefile config functions.sh).each{|f|
|
106
107
|
FileUtils.cp_r File.join($scriptdir, "norm/#{ f }"), directory
|
107
108
|
}
|
108
|
-
%w(results).each{|d|
|
109
|
+
%w(results models).each{|d|
|
109
110
|
FileUtils.mkdir_p File.join(directory, d)
|
110
111
|
}
|
111
112
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: rbbt
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Miguel Vazquez
|
@@ -9,10 +9,59 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2009-
|
12
|
+
date: 2009-11-02 00:00:00 +01:00
|
13
13
|
default_executable: rbbt_config
|
14
|
-
dependencies:
|
15
|
-
|
14
|
+
dependencies:
|
15
|
+
- !ruby/object:Gem::Dependency
|
16
|
+
name: rake
|
17
|
+
type: :runtime
|
18
|
+
version_requirement:
|
19
|
+
version_requirements: !ruby/object:Gem::Requirement
|
20
|
+
requirements:
|
21
|
+
- - ">="
|
22
|
+
- !ruby/object:Gem::Version
|
23
|
+
version: 0.8.4
|
24
|
+
version:
|
25
|
+
- !ruby/object:Gem::Dependency
|
26
|
+
name: simpleconsole
|
27
|
+
type: :runtime
|
28
|
+
version_requirement:
|
29
|
+
version_requirements: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - ">="
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: "0"
|
34
|
+
version:
|
35
|
+
- !ruby/object:Gem::Dependency
|
36
|
+
name: stemmer
|
37
|
+
type: :runtime
|
38
|
+
version_requirement:
|
39
|
+
version_requirements: !ruby/object:Gem::Requirement
|
40
|
+
requirements:
|
41
|
+
- - ">="
|
42
|
+
- !ruby/object:Gem::Version
|
43
|
+
version: "0"
|
44
|
+
version:
|
45
|
+
- !ruby/object:Gem::Dependency
|
46
|
+
name: progress-monitor
|
47
|
+
type: :runtime
|
48
|
+
version_requirement:
|
49
|
+
version_requirements: !ruby/object:Gem::Requirement
|
50
|
+
requirements:
|
51
|
+
- - ">="
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: "0"
|
54
|
+
version:
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: simpleconsole
|
57
|
+
type: :runtime
|
58
|
+
version_requirement:
|
59
|
+
version_requirements: !ruby/object:Gem::Requirement
|
60
|
+
requirements:
|
61
|
+
- - ">="
|
62
|
+
- !ruby/object:Gem::Version
|
63
|
+
version: "0"
|
64
|
+
version:
|
16
65
|
description: |-
|
17
66
|
This toolbox includes modules for text-mining, like Named Entity Recognition and Normalization and document
|
18
67
|
classification, as well as data integration modules that interface with PubMed, Entrez Gene, BioMart.
|
@@ -78,7 +127,6 @@ files:
|
|
78
127
|
- lib/rbbt/util/open.rb
|
79
128
|
- lib/rbbt/util/simpleDSL.rb
|
80
129
|
- lib/rbbt/util/tmpfile.rb
|
81
|
-
- lib/rbbt/version.rb
|
82
130
|
- tasks/install.rake
|
83
131
|
- LICENSE
|
84
132
|
- README.rdoc
|