RubyGems - nimbus - Versions diffs - 1.0.1 → 2.0.0 - Mend

nimbus 1.0.1 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

data/README.md +149 -0
data/lib/nimbus.rb +15 -11
data/lib/nimbus/application.rb +20 -23
data/lib/nimbus/classification_tree.rb +111 -0
data/lib/nimbus/configuration.rb +52 -37
data/lib/nimbus/forest.rb +56 -20
data/lib/nimbus/individual.rb +7 -7
data/lib/nimbus/loss_functions.rb +44 -10
data/lib/nimbus/regression_tree.rb +103 -0
data/lib/nimbus/training_set.rb +4 -4
data/lib/nimbus/tree.rb +20 -83
data/lib/nimbus/version.rb +3 -0
data/spec/classification_tree_spec.rb +132 -0
data/spec/configuration_spec.rb +46 -19
data/spec/fixtures/classification_config.yml +13 -0
data/spec/fixtures/classification_random_forest.yml +922 -0
data/spec/fixtures/classification_testing.data +500 -0
data/spec/fixtures/classification_training.data +1000 -0
data/spec/forest_spec.rb +109 -50
data/spec/individual_spec.rb +2 -2
data/spec/loss_functions_spec.rb +71 -0
data/spec/nimbus_spec.rb +4 -4
data/spec/regression_tree_spec.rb +129 -0
data/spec/training_set_spec.rb +5 -5
data/spec/tree_spec.rb +4 -115
metadata +53 -45
data/spec/fixtures/regression_snp_importances.txt +0 -200
data/spec/fixtures/regression_testing_file_predictions.txt +0 -200
data/spec/fixtures/regression_training_file_predictions.txt +0 -758

data/README.md ADDED Viewed

@@ -0,0 +1,149 @@
+# Nimbus [![Build Status](https://secure.travis-ci.org/xuanxu/nimbus.png?branch=master)](http://travis-ci.org/xuanxu/nimbus)
+Random Forest algorithm for genomic selection.
+## Random Forest
+The [random forest algorithm](http://en.wikipedia.org/wiki/Random_forest) is an classifier consisting in many random decision trees. It is based on choosing random subsets of variables for each tree and using the most frequent, or the averaged tree output as the overall classification. In machine learning terms, it is an ensembler classifier, so it uses multiple models to obtain better predictive performance than could be obtained from any of the constituent models.
+The forest outputs the class that is the mean or the mode (in regression problems) or the majority class (in classification problems) of the node's output by individual trees.
+## Genomic selection context
+Nimbus is a Ruby gem implementing Random Forest in a genomic selection context, meaning every input file is expected to contain genotype and/or fenotype data from a sample of individuals.
+Other than the ids of the individuals, Nimbus handle the data as genotype values for [single-nucleotide polymorphisms](http://en.wikipedia.org/wiki/SNPs) (SNPs), so the variables in the classifier must have values of 0, 1 or 2, corresponding with SNPs classes of AA, AB and BB.
+Nimbus can be used to:
+* Create a random forest using a training sample of individuals with fenotype data.
+* Use an existent random forest to get predictions for a testing sample.
+## Learning algorithm
+**Training**: Each tree in the forest is constructed using the following algorithm:
+1. Let the number of training cases be N, and the number of variables (SNPs) in the classifier be M.
+1. We are told the number mtry of input variables to be used to determine the decision at a node of the tree; m should be much less than M
+1. Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases (Out Of Bag sample) to estimate the error of the tree, by predicting their classes.
+1. For each node of the tree, randomly choose m SNPs on which to base the decision at that node. Calculate the best split based on these m SNPs in the training set.
+1. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).
+1. When in a node there is not any SNP split that minimizes the general loss function of the node, or the number of individuals in the node is less than the minimum node size then label the node with the average fenotype value of the individuals in the node.
+**Testing**: For prediction a sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the average vote of all trees is reported as random forest prediction.
+## Regression and Classification
+Nimbus can be used both with regression and classification problems.
+**Regression**: is the default mode.
+* The split of nodes uses quadratic loss as loss function.
+* Labeling of nodes is made averaging the fenotype values of the individuals in the node.
+**Classification**: user-activated declaring `classes` in the configuration file.
+* The split of nodes uses the Gini index as loss function.
+* Labeling of nodes is made finding the majority fenotype class of the individuals in the node.
+## Install
+You need to have Ruby (1.9.2 or higher) and Rubygems installed in your computer. Then install nimbus with:
+    $ gem install nimbus
+## Getting Started
+Once you have nimbus installed in your system, you can run the gem using the `nimbus` executable:
+    $ nimbus
+It will look for these files:
+* `training.data`: If found it will be used to build a random forest
+* `testing.data` : If found it will be pushed down the forest to obtain predictions for every individual in the file
+* `random_forest.yml`: If found it will be the forest used for the testing.
+That way in order to train a forest a training file is needed. And to do the testing you need two files: the testing file and one of the other two: the training OR the random_forest file, because nimbus needs a forest from which obtain the predictions.
+## Configuration (config.yml)
+The values for the input data files and the forest can be specified in the `config.yml` file that shouldbe locate in the directory where you are running `nimbus`.
+The `config.yml` has the following structure and parameters:
+    #Input files
+    input:
+      training: training_regression.data
+      testing: testing_regression.data
+      forest: my_forest.yml
+      classes: [0,1]
+    #Forest parameters
+    forest:
+      forest_size: 10 #how many trees
+      SNP_sample_size_mtry: 60 #mtry
+      SNP_total_count: 200
+      node_min_size: 5
+Under the input chapter:
+ * `training`: specify the path to the training data file (optional, if specified `nimbus` will create a random forest).
+ * `testing`: specify the path to the testing data file (optional, if specified `nimbus` will traverse this dat through a random forest).
+ * `forest`: specify the path to a file containing a random forest structure (optional, if there is also testing file, this will be the forest used for the testing).
+ * `classes`: **optional (needed only for classification problems)**. Specify the list of classes in the input files as a comma separated list between squared brackets, e.g.:`[A,B]`.
+Under the forest chapter:
+ * `forest_size`: number of trees for the forest.
+ * `SNP_sample_size_mtry`: size of the random sample of SNPs to be used in every tree node.
+ * `SNP_total_count`: total count of SNPs in the training and/or testing files
+ * `node_min_size`: minimum amount of individuals in a tree node to make a split.
+## Input files
+The three input files you can use with Nimbus should have proper format:
+**The training file** has any number of rows, each representing data for an individual, with this columns:
+1. A column with the fenotype for the individual
+1. A column with the ID of the individual
+1. M columns (where M = SNP_total_count in `config.yml`) with values 0, 1 or 2, representing the genotype of the individual.
+**The training file** has any number of rows, each representing data for an individual, similar to the training file but without the fenotype column:
+1. A column with the ID of the individual
+1. M columns (where M = SNP_total_count in `config.yml`) with values 0, 1 or 2, representing the genotype of the individual.
+**The forest file** contains the structure of a forest in YAML format. It is the output file of a nimbus training run.
+## Output files
+Nimbus will generate the following output files:
+After training:
+ * `random_forest.yml`: A file defining the structure of the computed Random Forest. It can be used as input forest file.
+ * `generalization_errors.txt`: A file with the generalization error for every tree in the forest.
+ * `training_file_predictions.txt`: A file with predictions for every individual from the training file.
+ * `snp_importances.txt`: A file with the computed importance for every SNP.
+After testing:
+ * `testing_file_predictions.txt`: A file defining the structure of the computed Random Forest.
+## Resources
+* [Nimbus website](http://www.nimbusgem.org)
+* [Source code](http://github.com/xuanxu/nimbus) – Fork the code
+* [Issues](http://github.com/xuanxu/nimbus/issues) – Bugs and feature requests
+* [Online rdocs](http://rubydoc.info/gems/nimbus/frames)
+* [Nimbus at rubygems.org](https://rubygems.org/gems/nimbus)
+* [Random Forest at Wikipedia](http://en.wikipedia.org/wiki/Random_forest)
+* [RF Leo Breiman page](http://www.stat.berkeley.edu/~breiman/RandomForests/)
+## Credits
+Nimbus was developed by [Juanjo Bazán](http://twitter.com/xuanxu) in collaboration with Oscar González-Recio.
+Copyright © 2011 - 2012 Juanjo Bazán, released under the MIT license

data/lib/nimbus.rb CHANGED Viewed

@@ -1,24 +1,26 @@
 require 'yaml'
-require 'optparse'
 require 'nimbus/exceptions'
 require 'nimbus/training_set'
 require 'nimbus/configuration'
 require 'nimbus/loss_functions'
 require 'nimbus/individual'
 require 'nimbus/tree'
+require 'nimbus/regression_tree'
+require 'nimbus/classification_tree'
 require 'nimbus/forest'
 require 'nimbus/application'
+require 'nimbus/version'
 #####################################################################
-# Nimbus module.
+# Nimbus module.
 # Used as a namespace containing all the Nimbus code.
 # The module defines a Nimbus::Application and interacts with the user output console.
 #
 module Nimbus
   STDERR = $stderr
   STDOUT = $stdout
   # Nimbus module singleton methods.
   #
   class << self
@@ -26,30 +28,30 @@ module Nimbus
     def application
       @application ||= ::Nimbus::Application.new
     end
     # Set the current Nimbus application object.
     def application=(app)
       @application = app
     end
     # Stops the execution of the Nimbus application.
     def stop(msg = "Error: Nimbus finished.")
       self.error_message msg
       exit(false)
     end
     # Writes message to the standard output
     def message(msg)
       STDOUT.puts msg
       STDOUT.flush
     end
     # Writes message to the error output
     def error_message(msg)
       STDERR.puts msg
       STDERR.flush
     end
     # Writes to the standard output
     def write(str)
       STDOUT.write str
@@ -59,8 +61,10 @@ module Nimbus
     # Clear current console line
     def clear_line!
       self.write "\r"
+      self.write(" " * 50)
+      self.write "\r"
     end
   end
 end

data/lib/nimbus/application.rb CHANGED Viewed

@@ -1,14 +1,14 @@
 module Nimbus
   #####################################################################
-  # Nimbus main application object.
-  #
-  # When invoking +nimbus+ from the command line,
+  # Nimbus main application object.
+  #
+  # When invoking +nimbus+ from the command line,
   # a Nimbus::Application object is created and run.
   #
   class Application
     attr_accessor :config
     # Initialize a Nimbus::Application object.
     # Check and load the configuration options.
     def initialize(c = nil)
@@ -18,7 +18,7 @@ module Nimbus
         @forest = nil
       end
     end
     # Run the Nimbus application. The run method performs the following
     # three steps:
     #
@@ -27,7 +27,7 @@ module Nimbus
     # * Write results to output files.
     def run
       nimbus_exception_handling do
         if @config.do_training && @config.load_training_data
           @forest = ::Nimbus::Forest.new @config
           @forest.grow
@@ -36,13 +36,13 @@ module Nimbus
           output_training_file_predictions(@forest)
           output_snp_importances_file(@forest)
         end
         if @config.do_testing
           @forest = @config.load_forest if @config.forest_file
           @forest.traverse
-          output_testing_set_predictions(@forest)
+          output_testing_set_predictions(@forest)
         end
       end
     end
@@ -52,16 +52,13 @@ module Nimbus
     def config
       @config ||= ::Nimbus::Configuration.new
     end
     # Provides the default exception handling for the given block.
     def nimbus_exception_handling
       begin
         yield
       rescue SystemExit => ex
         raise
-      rescue OptionParser::InvalidOption => ex
-        display_error_message(Nimbus::InvalidOptionError ex.message)
-        Nimbus.stop
       rescue Nimbus::Error => ex
         display_error_message(ex)
         Nimbus.stop
@@ -70,7 +67,7 @@ module Nimbus
         Nimbus.stop
       end
     end
     # Display an error message that caused a exception.
     def display_error_message(ex)
       Nimbus.error_message "* Nimbus encountered an error! The random forest was not generated *"
@@ -81,7 +78,7 @@ module Nimbus
       #   Nimbus.error_message "(See full error trace by running Nimbus with --trace)"
       # end
     end
     protected
     def output_random_forest_file(forest)
       File.open(@config.output_forest_file , 'w') {|f| f.write(forest.to_yaml) }
@@ -89,9 +86,9 @@ module Nimbus
       Nimbus.message "*   Output forest file: #{@config.output_forest_file}"
       Nimbus.message "*" * 50
     end
     def output_tree_errors_file(forest)
-      File.open(@config.output_tree_errors_file , 'w') {|f|
+      File.open(@config.output_tree_errors_file , 'w') {|f|
         1.upto(forest.tree_errors.size) do |te|
           f.write("generalization error for tree #{te}: #{forest.tree_errors[te-1].round(5)}\n")
         end
@@ -100,7 +97,7 @@ module Nimbus
       Nimbus.message "*   Output tree errors file: #{@config.output_tree_errors_file}"
       Nimbus.message "*" * 50
     end
     def output_training_file_predictions(forest)
       File.open(@config.output_training_file , 'w') {|f|
         forest.predictions.sort.each{|p|
@@ -111,7 +108,7 @@ module Nimbus
       Nimbus.message "*   Output from training file: #{@config.output_training_file}"
       Nimbus.message "*" * 50
     end
     def output_testing_set_predictions(forest)
       File.open(@config.output_testing_file , 'w') {|f|
         forest.predictions.sort.each{|p|
@@ -122,7 +119,7 @@ module Nimbus
       Nimbus.message "*   Output from testing file: #{@config.output_testing_file}"
       Nimbus.message "*" * 50
     end
     def output_snp_importances_file(forest)
       File.open(@config.output_snp_importances_file , 'w') {|f|
         forest.snp_importances.sort.each{|p|
@@ -133,7 +130,7 @@ module Nimbus
       Nimbus.message "*   Output snp importance file: #{@config.output_snp_importances_file}"
       Nimbus.message "*" * 50
     end
   end
 end

data/lib/nimbus/classification_tree.rb ADDED Viewed

@@ -0,0 +1,111 @@
+module Nimbus
+  #####################################################################
+  # Tree object representing a random classification tree.
+  #
+  # A tree is generated following this steps:
+  #
+  # * 1: Calculate loss function for the individuals in the node (first node contains all the individuals).
+  # * 2: Take a random sample of the SNPs (size m << total count of SNPs)
+  # * 3: Compute the loss function (default: gini index) for the split of the sample based on value of every SNP.
+  # * 4: If the SNP with minimum loss function also minimizes the general loss of the node, split the individuals sample in three nodes, based on value for that SNP [0, 1, or 2]
+  # * 5: Repeat from 1 for every node until:
+  #   - a) The individuals count in that node is < minimum size OR
+  #   - b) None of the SNP splits has a loss function smaller than the node loss function
+  # * 6) When a node stops, label the node with the majority class in the node.
+  #
+  class ClassificationTree < Tree
+    attr_accessor :classes
+    # Initialize Tree object with the configuration (as in Nimbus::Configuration.tree) options received.
+    def initialize(options)
+      @classes = options[:classes]
+      super
+    end
+    # Creates the structure of the tree, as a hash of SNP splits and values.
+    #
+    # It just initializes the needed variables and then defines the first node of the tree.
+    # The rest of the structure of the tree is computed recursively building every node calling +build_node+.
+    def seed(all_individuals, individuals_sample, ids_fenotypes)
+      super
+      @structure = build_node individuals_sample, Nimbus::LossFunctions.majority_class(individuals_sample, @id_to_fenotype, @classes)
+    end
+    # Creates a node by taking a random sample of the SNPs and computing the loss function for every split by SNP of that sample.
+    #
+    # * If SNP_min is the SNP with smaller loss function and it is < the loss function of the node, it splits the individuals sample in three:
+    # (those with value 0 for the SNP_min, those with value 1 for the SNP_min, and those with value 2 for the SNP_min) then it builds these 3 new nodes.
+    # * Otherwise every individual in the node gets labeled with the average of the fenotype values of all of them.
+    def build_node(individuals_ids, y_hat)
+      # General loss function value for the node
+      individuals_count = individuals_ids.size
+      return label_node(y_hat, individuals_ids) if individuals_count < @node_min_size
+      node_loss_function = Nimbus::LossFunctions.gini_index individuals_ids, @id_to_fenotype, @classes
+      # Finding the SNP that minimizes loss function
+      snps = snps_random_sample
+      min_loss, min_SNP, split, ginis = node_loss_function, nil, nil, nil
+      snps.each do |snp|
+        individuals_split_by_snp_value = split_by_snp_value individuals_ids, snp
+        y_hat_0 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[0], @id_to_fenotype, @classes)
+        y_hat_1 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[1], @id_to_fenotype, @classes)
+        y_hat_2 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[2], @id_to_fenotype, @classes)
+        gini_0 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[0], @id_to_fenotype, @classes
+        gini_1 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[1], @id_to_fenotype, @classes
+        gini_2 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[2], @id_to_fenotype, @classes
+        loss_snp = (individuals_split_by_snp_value[0].size * gini_0 +
+                    individuals_split_by_snp_value[1].size * gini_1 +
+                    individuals_split_by_snp_value[2].size * gini_2) / individuals_count
+        min_loss, min_SNP, split, ginis = loss_snp, snp, individuals_split_by_snp_value, [y_hat_0, y_hat_1, y_hat_2] if loss_snp < min_loss
+      end
+      return build_branch(min_SNP, split, ginis, y_hat) if min_loss < node_loss_function
+      return label_node(y_hat, individuals_ids)
+    end
+    # Compute generalization error for the tree.
+    #
+    # Traversing the 'out of bag' (OOB) sample (those individuals of the training set not
+    # used in the building of this tree) through the tree, and comparing
+    # the prediction with the real fenotype class of the individual is possible
+    # to calculate the error frequency, an unbiased generalization error for the tree.
+    def generalization_error_from_oob(oob_ids)
+      return nil if (@structure.nil? || @individuals.nil? || @id_to_fenotype.nil?)
+      oob_errors = 0.0
+      oob_ids.each do |oobi|
+        oob_errors += 1 unless @id_to_fenotype[oobi] == Tree.traverse(@structure, individuals[oobi].snp_list)
+      end
+      @generalization_error = oob_errors / oob_ids.size
+    end
+    # Estimation of importance for every SNP.
+    #
+    # The importance of any SNP in the tree is calculated using the OOB sample.
+    # For every SNP, every individual in the sample is pushed down the tree but with the
+    # value of that SNP permuted with other individual in the sample.
+    #
+    # That way the difference between the generalization error and the error frequency with the SNP value modified can be estimated for any given SNP.
+    #
+    # This method computes importance estimations for every SNPs used in the tree (for any other SNP it would be 0).
+    def estimate_importances(oob_ids)
+      return nil if (@generalization_error.nil? && generalization_error_from_oob(oob_ids).nil?)
+      oob_individuals_count = oob_ids.size
+      @importances = {}
+      @used_snps.uniq.each do |current_snp|
+        shuffled_ids = oob_ids.shuffle
+        permutated_snp_errors = 0.0
+        oob_ids.each_with_index {|oobi, index|
+          permutated_prediction = traverse_with_permutation @structure, individuals[oobi].snp_list, current_snp, individuals[shuffled_ids[index]].snp_list
+          permutated_snp_errors += 1 unless @id_to_fenotype[oobi] == permutated_prediction
+        }
+        @importances[current_snp] = ((permutated_snp_errors / oob_individuals_count) - @generalization_error).round(5)
+      end
+      @importances
+    end
+  end
+end