RubyGems - rbbt - Versions diffs - 1.0.0 → 1.0.2 - Mend

rbbt 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

data/README.rdoc +129 -6
data/bin/rbbt_config +87 -21
data/install_scripts/classifier/Rakefile +6 -1
data/install_scripts/ner/Rakefile +1 -1
data/install_scripts/norm/Rakefile +2 -1
data/install_scripts/norm/functions.sh +10 -8
data/install_scripts/organisms/Rakefile +18 -0
data/install_scripts/organisms/rake-include.rb +3 -3
data/install_scripts/organisms/worm.Rakefile +1 -1
data/lib/rbbt.rb +1 -1
data/lib/rbbt/bow/bow.rb +1 -0
data/lib/rbbt/bow/classifier.rb +0 -2
data/lib/rbbt/bow/dictionary.rb +0 -31
data/lib/rbbt/ner/rnorm.rb +1 -0
data/lib/rbbt/sources/biocreative.rb +2 -2
data/lib/rbbt/sources/biomart.rb +0 -1
data/lib/rbbt/sources/entrez.rb +1 -1
data/lib/rbbt/sources/organism.rb +24 -9
data/lib/rbbt/util/index.rb +0 -25
data/lib/rbbt/util/misc.rb +8 -3
data/lib/rbbt/util/open.rb +1 -1
data/lib/rbbt/util/simpleDSL.rb +1 -1
data/tasks/install.rake +3 -2
metadata +53 -5
data/lib/rbbt/version.rb +0 -10

data/README.rdoc CHANGED

@@ -1,17 +1,140 @@
 = rbbt
-Description goes here.
+Rbbt stands for Ruby Bio-Text, it started as an API for text mining developed
+for SENT[http://sent.dacya.ucm.es], but its functionality has been used
+for other applications as well, such as MARQ[http://marq.dacya.ucm.es].
-== Note on Patches/Pull Requests
+== Important Note
+Some unexpected gem dependencies may appear.
+Rbbt covers several functionalities, some will work right away, some require to
+install dependencies or download and process data from the internet. Since not
+all users are likely to need all the functionalities, this gems dependencies
+include only the very basic requirements. Dependencies may appear unexpectedly
+when using new parts of the API.
+== Functionality
+=== Data sources interface
+PubMed:: Making queries and retrieving articles.
+BioMart:: Making queries to BioMart programmatically. It can divide a large query into smaller ones and merge the results.
+Entrez:: Retrieving gene entries, associated articles, and gene synonyms and aliases.
+Biocreative:: Using the competition test and training data to train and evaluate Named Entity Extraction models and Gene Mention Normalization.
+=== Text mining tasks
+BagOfWords:: Bag-of-words representation of text. Chunk text into terms, which can be unigrams or bi-grams, remove stopwords, build a term thesaurus using a TF_IDF (term frequency inverse document frequency) or a KL (Kullback-Leibler divergence) Dictionary, and extract a bag-of-words representations suitable for the Classifier.
+Classifier:: Using R to build classification models and to use them to classify new entires. Currently the models are Support Vector Machines.
+NER:: Named Entity Extraction. Currently there are 3 alternatives to do this Abner, Banner, RegExpNER, and NER. The first two are third party Java systems that require the rjb[rjb.rubyforge.org/] (Ruby Java Bridge) gem to be installed. The third one, RegExpNER, is a simple regular-expression based system which can be used when there is not enough data to train a CRF based system, for example, to find Polysearch terms. The last one, the default, is a reimplementation of a CRF-based system, such as Abner and Banner, completely configurable using a simple DSL (domain specific language).
+Normalizer:: Resolve gene mentions to the actual genes they refer to. It compares the gene mention to all possible gene names and synonyms to find the best match. It is configurable using a DSL.
+=== Organisms support
+Using configuration files rbbt can support different organisms. The system is prepared to parse organism specific database files and merge them with Entrez and BioMart. Basically producing the following information
+Lexicon:: Listing the synonyms for each gene
+Identifiers:: Listing different identifiers for each gene like Entrez Gene Ids, Unigene, Affymetrix probe ids, etc. This is not the same as the lexicon which holds names, not identifiers.
+GO:: Listing associations of genes to GO terms.
+PubMed articles:: List articles associated to each gene, as listed in Entrez or listed to support of GO associations.
+With this information rbbt offers the following functionality via the Organism class
+NER and Normalization:: Loads custom models for Named Entity Extraction and Gene Mention Normalization
+Identifiers translation:: Translates gene identifiers between formats.
+Organisms in rbbt are identified using a keyword. This is the list of organisms currently supported with their associated keywords:
+Candida albicans:: cgd
+Mus musculus:: mgi
+Rattus norvegicus:: rgd
+Saccharomyces cerevisiae:: sgd
+Arabidopsis thaliana:: tair
+Caenorhabditis elegans:: worm
+Homo sapiens:: human
+Schizosaccharomyces pombe:: pombe
+=== Other
+Cache:: The system caches PubMed articles and Entrez gene entries, this is considered a persistent cache since these items are unlikely to change. Also caches any data downloaded from the internet, like BioMart queries for example, into a non-persistent cache that can be purged to perform updates to the system.
+Tab separated file helpers:: The data in rbbt is saved into tab separated files and is loaded into Hash. Modules like Open or ArrayHash help dealing with these files and data structures.
+= Installation
+Install the gem normally <tt>gem install rbbt</tt>. The gem includes a configuration tool rbbt_config. The first time you run it it will ask you to configure some paths. After that you may use it to process data for different tasks. Lets see some scenarios:
+=== Using rbbt to translate identifiers
+1. Do <tt>rbbt_config install identifiers</tt> to do deploy the configuration files and download entrez data, this needs to be done just once.
+3. Now you may do <tt>rbbt_config update organisms</tt> toprocess all the organisms, or <tt>rbbt_config update organisms -o sgd</tt> to process only yeast (sgd).
+4. You may now use a script like this to translate gene identifiers from yeast feed from the standard input
+  require 'rbbt/sources/organism'
+  index = Organism.id_index('sgd', :native => 'Entrez Gene Id')
+  STDIN.each_line{|l| puts "#{l.chomp} => #{index[l.chomp]}"}
+=== Using rbbt to find gene mentions in text
+First prepare the organisms as you did in the previous section. Next, if you want to use the default NER module:
+1. Install the Biocreative data used to train the model and compile the CRF++ plugin, <tt>rbbt_config install rner</tt>. You may need at this point to install ParseTree and ruby2ruby
+2. Build the module for a particular organism <tt>rbbt_config update ner -o sgd</tt>. You need to have the gems ParseTree and ruby2ruby for this to work. This process can take a long time.
+Or, if you wan to use Abner or Banner:
+1. Download and install the packages <tt>rbbt_config install java_ner</tt>
+You may now, for example, find mentions to genes in articles from a PubMed query using this script
+    require 'rbbt/sources/organism'
+    require 'rbbt/sources/pubmed'
+    # type = :abner
+    # type = :banner
+    type = :rner
+    ner = Organism.ner('sgd', type )
+    pmids = PubMed.query(ARGV[0], 500)
+    PubMed.get_article(pmids).each{|pmid,article|
+      mentions = ner.extract(article.text)
+      puts pmid
+      puts article.text
+      puts "Mentions: " << mentions.uniq.join(", ")
+      puts
+    }
+== More Installation Guidelines
+This is the complete list of gem requirements: <tt>ParseTree ruby2ruby simpleconsole rjb rsruby stemmer rand rake progress-monitor</tt>. Some of these gems to not work with ruby 1.9 at the time, or may be a bit more complicated to install, for that reason *they are not reported as dependencies and are only required when they are about to be used*. Note that some of these gems are in the gemcutter repository, you may need to install the <tt>gemcutter</tt> gem and do <tt>gem tumble</tt>
+Some of the API requires to have some data processed using rbbt_config. This command is used to install third party software, download data from the internet, or build models. The command <tt>rbbt_config install all</tt> will install and process everything, this will take a long time, specially building the NER models. So you might want to start with the basic install and include more things as they are needed.
+= Note on Patches/Pull Requests
 * Fork the project.
 * Make your feature addition or bug fix.
-* Add tests for it. This is important so I don't break it in a
-  future version unintentionally.
+* Add tests for it. This is important so I don't break it in a future version unintentionally.
 * Commit, do not mess with rakefile, version, or history.
-  (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
+  (if you want to have your own version, that is fine, but bump version in a commit by itself that I can ignore when I pull)
 * Send me a pull request. Bonus points for topic branches.
-== Copyright
+= Copyright
 Copyright (c) 2009 Miguel Vazquez. See LICENSE for details.

data/bin/rbbt_config CHANGED

@@ -12,29 +12,62 @@ rescue Rbbt::NoConfig
   $noconfig = true
 end
+TASKS= %w(organisms ner norm classifier biocreative entrez go wordlists polysearch abner banner crf++)
 $USAGE =<<EOT
 #{__FILE__} <action> [<subaction>] [--force] [--organism <org>]
   actions:
     * configure:   Set paths for data, cache, and tmp directories
     * install:
-      * basic:     Third party software
-      * databases: Entrez and Biocreative
-      * models:    Gene Mention and Classification
-      * organisms: Rules to gather data for organisms
-      * all:       3party wordlists entrez biocreative go ner norm classifier organisms polysearch
-    * update:
-      * organisms: Gather data for organisms
-      * ner:       Build Named Entity Recognition Models for Gene Mention
-      * classification:
-                   Build Function/Process Classifiers
+      Basic subactions:
+      * organisms:     Install processing scripts to process organisms
+      * ner:           Install processing scripts for Named Entity Recognition
+      * norm:          Install processing scripts for Gene Mention Normalization
+      * classifier:    Install processing scripts for Classification
+      * biocreative:   Download and train and test data from BioCreative
+      * entrez:        Download and install data from Entrez
+      * go:            Download and install data from The Gene Ontology
+      * wordlists:     Install word lists
+      * polysearch:    Download and install Polysearch dictionaries
+      * abner:         Download and install Abner NER system:      http://pages.cs.wisc.edu/~bsettles/abner/
+      * banner:        Download and install Banner NER system:     http://sourceforge.net/projects/banner/
+      * crf++:         Download and install CRF++ a CRF framework: http://crfpp.sourceforge.net/
+      Subactions grouped by task:
+      * identifiers:  entrez, organisms
+      * rner:         entrez, organisms, biocreative, ner, crf++
+      * java_ner:     entrez, organisms, abner, banner
+      * norm: entrez  organisms, biocreative, crf++, norm, polysearch
+      * bow:          organisms, wordlists
+      * classifier:   organisms, wordlists, classifier, go
+      * all:          #{TASKS.join(", ")}
+    * update:
+      * organisms:      Gather organisms data
+      * ner:            Build Named Entity Recognition Models. Mention Normalization needs no training.
+      * classification: Build Function/Process Classifiers
+      --force:          Rebuild models or reprocess organism data even if present. You may want to purge the cache
+                        to be up to date with the data in the internet.
+      --organism:       Gather data only for that particular organism. The organism must be specified by the
+                        keyword. Use '#{__FILE__} organisms' to see find the keywords.
     * purge_cache: Clean the non-persistent cache, which holds general things
         downloaded using Open.read, like organism identifiers downloaded from
         BioMart. The persistent cache, which hold pubmed articles or entrez gene
         descriptions, is not cleaned, as these are not likely to change
+    * organisms: Show a list of all organisms along with their identifier in the system
 EOT
@@ -44,6 +77,10 @@ class Controller < SimpleConsole::Controller
   params :bool => {:f => :force},
          :string => {:o => :organism}
+  def organisms
+  end
   def default
     render :action => :usage
   end
@@ -73,21 +110,39 @@ class Controller < SimpleConsole::Controller
   def install
     raise "Run #{__FILE__} configure first to configure rbbt" if $noconfig
     case params[:id]
-    when "basic"
-      @tasks = %w(3party wordlists polysearch)
-    when "databases"
-      @tasks = %w(entrez biocreative go)
-    when "models"
-      @tasks = %w(ner norm classifier)
-    when "organisms"
-      @tasks = %w(organisms)
+    when "identifiers"
+      require 'rbbt/sources/organism'
+      require 'rbbt/sources/entrez'
+      @tasks = %w(entrez organisms)
+    when "rner"
+      require 'rbbt/ner/rner'
+      require 'rbbt/sources/entrez'
+      @tasks = %w(entrez organisms biocreative ner crf++)
+    when "java_ner"
+      require 'rjb'
+      @tasks = %w(entrez organisms abner banner)
+    when "norm"
+      require 'rbbt/ner/rner'
+      require 'rbbt/ner/rnorm'
+      require 'rbbt/ner/regexpNER'
+      require 'rbbt/sources/entrez'
+      @tasks = %w(entrez organisms biocreative crf++ norm polysearch)
+    when "bow"
+      require 'rbbt/bow/bow'
+      require 'rbbt/bow/dictionary'
+      @tasks = %w(organisms wordlists)
+    when "classifier"
+      require 'rbbt/bow/bow'
+      require 'rbbt/bow/dictionary'
+      require 'rbbt/bow/classifier'
+      @tasks = %w(organisms wordlists classifier go)
     when "all"
-      @tasks = %w(3party wordlists entrez biocreative go ner norm classifier organisms polysearch)
+      @tasks = TASKS
     when nil
       redirect_to :action => :help, :id => :install
     else
+      redirect_to :action => :help, :id => :install if ! TASKS.include? params[:id]
       @tasks = [params[:id]]
     end
@@ -109,6 +164,17 @@ class View < SimpleConsole::View
     puts $USAGE
   end
+  def organisms
+      require 'rbbt/sources/organism'
+      all = Organism.all(false)
+      installed = Organism.all
+      all.each{|org|
+          puts "#{Organism.name(org)}: #{org} #{installed.include?(org) ? "(installed)" : ""}"
+      }
+  end
   def install
     load File.join(Rbbt.rootdir, 'tasks/install.rake')

data/install_scripts/classifier/Rakefile CHANGED

@@ -85,7 +85,12 @@ rule (/results\/(.*)/) => lambda{|n| n.sub(/results/,'model')} do |t|
   ndocs    = 100
-  used = Open.read(features).collect{|l| l.chomp.split(/\t/).first}[1..-1]
+  used = []
+  if "".respond_to? :collect
+    used = Open.read(features).collect{|l| l.chomp.split(/\t/).first}[1..-1]
+  else
+    used = Open.read(features).lines.collect{|l| l.chomp.split(/\t/).first}[1..-1]
+  end
   classifier = Classifier.new(model)
   go  = Organism.gene_literature_go(org).collect{|gene, pmids| pmids}.flatten.uniq - used

data/install_scripts/ner/Rakefile CHANGED

@@ -37,7 +37,7 @@ def BC2GN_features(dataset, outfile)
     data[code] = {}
     data[code][:text] = Open.read(f)
   }
-  Open.read(File.join(Rbbt.datadir,'biocreative','BC2GN',dataset,'genelist')).each{|l|
+  Open.read(File.join(Rbbt.datadir,'biocreative','BC2GN',dataset,'genelist')).each_line{|l|
    code, gene, mention = l.chomp.split(/\t/)
    data[code][:mentions] ||= []
    data[code][:mentions] << mention

data/install_scripts/norm/Rakefile CHANGED

@@ -2,9 +2,10 @@ require 'rbbt'
 require 'rbbt/sources/organism'
 require 'rbbt/util/open'
 require 'rbbt/ner/rner'
+require 'rbbt/ner/rnorm'
-require 'progress-meter'
+require 'progress-monitor'
 $type = ENV['ner'] || :rner
 $debug = !ENV['debug'].nil?

data/install_scripts/norm/functions.sh CHANGED

@@ -1,21 +1,23 @@
 #!/bin/bash
 function norm(){
-    o=$1
+    organism=$1
     shift
-    s=$1
+    dataset=$1
     shift
-    n=$1
+    ner=$1
     shift
-    echo "rm results/${o}_$s; rake results/${o}_$s.eval ner=$n $@ > ${o}_$s.log_$n; tail results/${o}_$s.eval"
-    rm results/${o}_$s; rake results/${o}_$s.eval ner=$n $@ > ${o}_$s.log_$n; tail results/${o}_$s.eval
+    CMD="rm results/${organism}_$dataset; rake results/${organism}_$dataset.eval ner=$ner $@ > ${organism}_$dataset.log_$ner; tail results/${organism}_$dataset.eval"
+    echo $CMD
+    $CMD
 }
 function norm_2(){
-    n=$1
+    ner=$1
     shift
-    echo "rm results/bc2gn; rake results/bc2gn.eval ner=$n $@ > bc2gn.log_$n; tail results/bc2gn.eval"
-    rm results/bc2gn; rake results/bc2gn.eval ner=$n $@ > bc2gn.log_$n; tail results/bc2gn.eval
+    CMD="rm results/bc2gn; rake results/bc2gn.eval ner=$ner $@ > bc2gn.log_$ner; tail results/bc2gn.eval"
+    echo $CMD
+    $CMD
 }

data/install_scripts/organisms/Rakefile CHANGED

@@ -1,5 +1,23 @@
 $org = [$org, ENV['organism'],nil].reject{|e| e.nil? }.first
+task 'names' do
+  orgs = Dir.glob('*').
+    select{|t|
+    File.directory?(t ) &&
+      File.exist?(t + '/Rakefile')
+  }
+  orgs.each{|org|
+    pid = Process.fork{
+      Dir.chdir(org)
+      load 'Rakefile'
+      Rake::Task['name'].invoke
+    }
+    Process.waitpid pid
+  }
+end
 task 'default' do
   if $org
     orgs = [$org]

data/install_scripts/organisms/rake-include.rb CHANGED

@@ -88,7 +88,7 @@ file 'lexicon' do
       "#{ code }\t" + name_lists.flatten.select{|n| n.to_s != ""}.uniq.join("\t")
     }.join("\n"))
-rescue Entrez::NoFile
+rescue Entrez::NoFileError
   puts "Lexicon not produced for #{$name}, install the entrez gene_info file (rbbt_config install entrez)."
 end
 end
@@ -185,7 +185,7 @@ file 'identifiers' do
     }
     fout.close
-  rescue Entrez::NoFile
+  rescue Entrez::NoFileError
     puts "Identifiers not produced for #{$name}, install the entrez gene_info file (rbbt_config install entrez)."
   end
 end
@@ -237,7 +237,7 @@ file 'gene.pmid' do
       }.compact.join("\n")
     }.compact.join("\n")
               )
-  rescue Entrez::NoFile
+  rescue Entrez::NoFileError
     puts "Gene article associations from entrez not produced, install the gene2pumbed file (rbbt_config install entrez)."
   end

data/install_scripts/organisms/worm.Rakefile CHANGED

@@ -1,6 +1,6 @@
 require __FILE__.sub(/[^\/]*$/,'') + '../rake-include'
-$name = "Caenorhabditis elegans "
+$name = "Caenorhabditis elegans"
 $native_id = "WormBase ID"

data/lib/rbbt.rb CHANGED

@@ -59,7 +59,7 @@ module Rbbt
     # For some reason banner.jar must be loaded before abner.jar
     ENV['CLASSPATH'] ||= ""
-    ENV['CLASSPATH'] += ":" + %w(banner abner).collect{|pkg| File.join(datadir, "third_party/#{pkg}/#{ pkg }.jar")}.join(":")
+    ENV['CLASSPATH'] += ":" + %w(banner abner).collect{|pkg| File.join(datadir, "third_party", pkg, "#{ pkg }.jar")}.join(":")
   end
   def self.rootdir

data/lib/rbbt/bow/bow.rb CHANGED

@@ -17,6 +17,7 @@ module BagOfWords
   # 'rbbt/util/misc'.
   def self.words(text)
     return [] if text.nil?
+    raise "Stopword list not loaded. Have you installed the wordlists? (rbbt_config install wordlists)" if $stopwords.nil?
     text.scan(/\w+/).
       collect{|word| word.downcase.stem}.
       select{|word|

data/lib/rbbt/bow/classifier.rb CHANGED

@@ -113,6 +113,4 @@ class Classifier
   end
 end

data/lib/rbbt/bow/dictionary.rb CHANGED

@@ -185,34 +185,3 @@ class Dictionary::KL
 end
-if __FILE__ == $0
-  require 'benchmark'
-  require 'rbbt/sources/pubmed'
-  require 'rbbt/bow/bow'
-  require 'progress-meter'
-  max = 10000
-  pmids = PubMed.query("Homo Sapiens", max)
-  Progress.monitor "Get pimds"
-  docs = PubMed.get_article(pmids).values.collect{|article| BagOfWords.terms(article.text)}
-  dict = Dictionary::TF_IDF.new()
-  puts "Starting Benchmark"
-  puts Benchmark.measure{
-    docs.each{|doc|
-      dict.add doc
-    }
-  }
-  puts Benchmark.measure{
-    dict.weights
-  }
-  puts dict.terms.length
-end

data/lib/rbbt/ner/rnorm.rb CHANGED

@@ -4,6 +4,7 @@ require 'rbbt/ner/rnorm/tokens'
 require 'rbbt/util/index'
 require 'rbbt/util/open'
 require 'rbbt/sources/entrez'
+require 'rbbt/bow/bow.rb'
 class Normalizer

data/lib/rbbt/sources/biocreative.rb CHANGED

@@ -12,12 +12,12 @@ module Biocreative
     data = {}
-    Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/#{dataset}.in")).each{|l|
+    Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/#{dataset}.in")).each_line{|l|
       code, text = l.chomp.match(/(.*?) (.*)/).values_at(1,2)
       data[code] ={ :text => text }
     }
-    Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/GENE.eval")).each{|l|
+    Open.read(File.join(Rbbt.datadir,"biocreative/BC2GM/#{dataset}/GENE.eval")).each_line{|l|
       code, pos, mention = l.chomp.split(/\|/)
       data[code] ||= {}
       data[code][:mentions] ||= []

data/lib/rbbt/sources/biomart.rb CHANGED

@@ -1,4 +1,3 @@
 require 'rbbt/util/open'
 require 'rbbt'

data/lib/rbbt/sources/entrez.rb CHANGED

@@ -1,4 +1,3 @@
 require 'rbbt'
 require 'rbbt/util/open'
 require 'rbbt/util/tmpfile'
@@ -190,6 +189,7 @@ module Entrez
   # found in Entrez Gene for that particular gene. The +gene+ may be a
   # gene identifier or a Gene class instance.
   def self.gene_text_similarity(gene, text)
     case
     when Entrez::Gene === gene
       gene_text = gene.text

data/lib/rbbt/sources/organism.rb CHANGED

@@ -1,18 +1,23 @@
 require 'rbbt'
-require 'rbbt/ner/rnorm'
 require 'rbbt/util/open'
+require 'rbbt/util/index'
 module Organism
   class OrganismNotProcessedError < StandardError; end
-  def self.all
-    Dir.glob(File.join(Rbbt.datadir,'/organisms/') + '/*/name').collect{|f| File.basename(File.dirname(f))}
+  def self.all(installed = true)
+    if installed
+      Dir.glob(File.join(Rbbt.datadir,'/organisms/') + '/*/identifiers').collect{|f| File.basename(File.dirname(f))}
+    else
+      Dir.glob(File.join(Rbbt.datadir,'/organisms/') + '/*').select{|f| File.directory? f}.collect{|f| File.basename(f)}
+    end
   end
   def self.name(org)
+    raise OrganismNotProcessedError, "Missing 'name' file" if ! File.exists? File.join(Rbbt.datadir,"organisms/#{ org }/name")
     Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/name"))
   end
@@ -30,7 +35,13 @@ module Organism
     id_types = {}
     formats = supported_ids(org)
-    lines = Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers")).collect
+    text = Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers"))
+    if text.respond_to? :collect
+      lines = text.collect
+    else
+      lines = text.lines
+    end
     lines.each{|l|
       ids_per_type = l.split(/\t/)
@@ -68,7 +79,10 @@ module Organism
     format_count.select{|k,v| v > (query.length / 10)}.sort{|a,b| b[1] <=> a[1]}.first
   end
-  def self.ner(org, type=:abner, options = {})
+  # FIXME: The NER related stuff is harder to install, thats why we hide the
+  # requires next to where they are needed, next to options
+  def self.ner(org, type=:rner, options = {})
     case type.to_sym
     when :abner
@@ -90,6 +104,7 @@ module Organism
   end
   def self.norm(org, to_entrez = nil)
+    require 'rbbt/ner/rnorm'
     if to_entrez.nil?
       to_entrez = id_index(org, :native => 'Entrez Gene ID', :other => [supported_ids(org).first])
     end
@@ -109,7 +124,7 @@ module Organism
   def self.goterms(org)
     goterms = {}
-    Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/gene.go")).each{|l|
+    Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/gene.go")).each_line{|l|
       gene, go = l.chomp.split(/\t/)
       goterms[gene.strip] ||= []
       goterms[gene.strip] << go.strip
@@ -118,7 +133,7 @@ module Organism
   end
   def self.literature(org)
-    Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/all.pmid")).collect{|l| l.chomp.scan(/\d+/)}.flatten
+    Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/all.pmid")).scan(/\d+/)
   end
   def self.gene_literature(org)
@@ -133,7 +148,7 @@ module Organism
     formats  = []
     examples = [] if options[:examples]
     i= 0
-    Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers")).each{|l|
+    Open.read(File.join(Rbbt.datadir,"organisms/#{ org }/identifiers")).each_line{|l|
       if i == 0
         i += 1
         next unless l=~/^\s*#/

data/lib/rbbt/util/index.rb CHANGED

@@ -42,28 +42,3 @@ module Index
     index
   end
 end
-if __FILE__ == $0
-  require 'benchmark'
-  normal = nil
-  puts "Normal " + Benchmark.measure{
-    normal = Index.index('/home/miki/rbbt/data/organisms/human/identifiers',:trie => false, :case_sensitive => false)
-  }.to_s
-  ids = Open.read('/home/miki/git/MARQ/test/GDS1375_malignant_vs_normal_up.genes').collect{|l| l.chomp.strip.upcase}
-  new = nil
-  puts ids.inspect
-  puts "normal " + Benchmark.measure{
-    100.times{
-      new = ids.collect{|id| normal[id]}
-    }
-  }.to_s
-  puts new.inspect
-end

data/lib/rbbt/util/misc.rb CHANGED

@@ -1,8 +1,12 @@
 require 'rbbt'
 require 'rbbt/util/open'
-$consonants = Open.read(File.join(Rbbt.datadir, 'wordlists/consonants')).collect{|l| l.chomp}.uniq
 class String
+  CONSONANTS = []
+  if File.exists? File.join(Rbbt.datadir, 'wordlists/consonants')
+    Open.read(File.join(Rbbt.datadir, 'wordlists/consonants')).each_line{|l| CONSONANTS << l.chomp}
+  end
   # Uses heuristics to checks if a string seems like a special word, like a gene name.
   def is_special?
     # Only consonants
@@ -22,7 +26,7 @@ class String
     # Dashed word
     return true if self =~ /(^\w-|-\w$)/
     # To many consonants (very heuristic)
-    if self =~ /([^aeiouy]{3,})/i && !$consonants.include?($1.downcase)
+    if self =~ /([^aeiouy]{3,})/i && !CONSONANTS.include?($1.downcase)
       return true
     end
@@ -83,7 +87,8 @@ $greek = {
 $inverse_greek = Hash.new
 $greek.each{|l,s| $inverse_greek[s] = l }
-$stopwords = Open.read(File.join(Rbbt.datadir, 'wordlists/stopwords')).scan(/\w+/)
+$stopwords = Open.read(File.join(Rbbt.datadir, 'wordlists/stopwords')).scan(/\w+/) if File.exists? File.join(Rbbt.datadir, 'wordlists/stopwords')
 class Array

data/lib/rbbt/util/open.rb CHANGED

@@ -161,7 +161,7 @@ module Open
     extra = [extra] if extra && ! extra.is_a?( Array)
     data = {}
-    Open.read(filename).each{|l|
+    Open.read(filename).each_line{|l|
       l = fix.call(l) if fix
       next if exclude and exclude.call(l)

data/lib/rbbt/util/simpleDSL.rb CHANGED

@@ -64,7 +64,7 @@ class SimpleDSL
   def initialize(method = nil, file = nil, &block)
     @config = {}
     if file
-      raise ConfigFileMissingError.new "File '#{ file }' is missing. Have you installed the config files? (rbbt_config install norm)." unless File.exists? file
+      raise ConfigFileMissingError.new "File '#{ file }' is missing. Have you installed the config files? (use rbbt_config)." unless File.exists? file
       parse(method, file)
     end

data/tasks/install.rake CHANGED

@@ -85,6 +85,7 @@ task 'organisms' do
     end
     FileUtils.cp f , File.join(directory, "#{ org }/Rakefile")
   }
+  `cd #{directory}; rake names`
 end
 task 'ner' do
@@ -102,10 +103,10 @@ end
 task 'norm' do
   directory = "#{$datadir}/norm"
   FileUtils.mkdir_p directory
-  %w(Rakefile config).each{|f|
+  %w(Rakefile config functions.sh).each{|f|
     FileUtils.cp_r File.join($scriptdir, "norm/#{ f }"), directory
   }
- %w(results).each{|d|
+ %w(results models).each{|d|
   FileUtils.mkdir_p File.join(directory, d)
   }
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: rbbt
 version: !ruby/object:Gem::Version
-  version: 1.0.0
+  version: 1.0.2
 platform: ruby
 authors:
 - Miguel Vazquez
@@ -9,10 +9,59 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2009-10-29 00:00:00 +01:00
+date: 2009-11-02 00:00:00 +01:00
 default_executable: rbbt_config
-dependencies: []
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rake
+  type: :runtime
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 0.8.4
+    version:
+- !ruby/object:Gem::Dependency
+  name: simpleconsole
+  type: :runtime
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: "0"
+    version:
+- !ruby/object:Gem::Dependency
+  name: stemmer
+  type: :runtime
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: "0"
+    version:
+- !ruby/object:Gem::Dependency
+  name: progress-monitor
+  type: :runtime
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: "0"
+    version:
+- !ruby/object:Gem::Dependency
+  name: simpleconsole
+  type: :runtime
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: "0"
+    version:
 description: |-
   This toolbox includes modules for text-mining, like Named Entity Recognition and Normalization and document
       classification, as well as data integration modules that interface with PubMed, Entrez Gene, BioMart.
@@ -78,7 +127,6 @@ files:
 - lib/rbbt/util/open.rb
 - lib/rbbt/util/simpleDSL.rb
 - lib/rbbt/util/tmpfile.rb
-- lib/rbbt/version.rb
 - tasks/install.rake
 - LICENSE
 - README.rdoc

data/lib/rbbt/version.rb DELETED

@@ -1,10 +0,0 @@
-module Rbbt
-  module VERSION #:nodoc:
-    MAJOR = 1
-    MINOR = 0
-    TINY  = 0
-    STRING = [MAJOR, MINOR, TINY].join('.')
-    self
-  end
-end