nimbus 1.0.1 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md ADDED
@@ -0,0 +1,149 @@
1
+ # Nimbus [![Build Status](https://secure.travis-ci.org/xuanxu/nimbus.png?branch=master)](http://travis-ci.org/xuanxu/nimbus)
2
+ Random Forest algorithm for genomic selection.
3
+
4
+ ## Random Forest
5
+
6
+ The [random forest algorithm](http://en.wikipedia.org/wiki/Random_forest) is an classifier consisting in many random decision trees. It is based on choosing random subsets of variables for each tree and using the most frequent, or the averaged tree output as the overall classification. In machine learning terms, it is an ensembler classifier, so it uses multiple models to obtain better predictive performance than could be obtained from any of the constituent models.
7
+
8
+ The forest outputs the class that is the mean or the mode (in regression problems) or the majority class (in classification problems) of the node's output by individual trees.
9
+
10
+ ## Genomic selection context
11
+
12
+ Nimbus is a Ruby gem implementing Random Forest in a genomic selection context, meaning every input file is expected to contain genotype and/or fenotype data from a sample of individuals.
13
+
14
+ Other than the ids of the individuals, Nimbus handle the data as genotype values for [single-nucleotide polymorphisms](http://en.wikipedia.org/wiki/SNPs) (SNPs), so the variables in the classifier must have values of 0, 1 or 2, corresponding with SNPs classes of AA, AB and BB.
15
+
16
+ Nimbus can be used to:
17
+
18
+ * Create a random forest using a training sample of individuals with fenotype data.
19
+ * Use an existent random forest to get predictions for a testing sample.
20
+
21
+ ## Learning algorithm
22
+
23
+ **Training**: Each tree in the forest is constructed using the following algorithm:
24
+
25
+ 1. Let the number of training cases be N, and the number of variables (SNPs) in the classifier be M.
26
+ 1. We are told the number mtry of input variables to be used to determine the decision at a node of the tree; m should be much less than M
27
+ 1. Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases (Out Of Bag sample) to estimate the error of the tree, by predicting their classes.
28
+ 1. For each node of the tree, randomly choose m SNPs on which to base the decision at that node. Calculate the best split based on these m SNPs in the training set.
29
+ 1. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).
30
+ 1. When in a node there is not any SNP split that minimizes the general loss function of the node, or the number of individuals in the node is less than the minimum node size then label the node with the average fenotype value of the individuals in the node.
31
+
32
+ **Testing**: For prediction a sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the average vote of all trees is reported as random forest prediction.
33
+
34
+ ## Regression and Classification
35
+
36
+ Nimbus can be used both with regression and classification problems.
37
+
38
+ **Regression**: is the default mode.
39
+
40
+ * The split of nodes uses quadratic loss as loss function.
41
+ * Labeling of nodes is made averaging the fenotype values of the individuals in the node.
42
+
43
+ **Classification**: user-activated declaring `classes` in the configuration file.
44
+
45
+ * The split of nodes uses the Gini index as loss function.
46
+ * Labeling of nodes is made finding the majority fenotype class of the individuals in the node.
47
+
48
+ ## Install
49
+
50
+ You need to have Ruby (1.9.2 or higher) and Rubygems installed in your computer. Then install nimbus with:
51
+
52
+ $ gem install nimbus
53
+
54
+ ## Getting Started
55
+
56
+ Once you have nimbus installed in your system, you can run the gem using the `nimbus` executable:
57
+
58
+ $ nimbus
59
+
60
+ It will look for these files:
61
+
62
+ * `training.data`: If found it will be used to build a random forest
63
+ * `testing.data` : If found it will be pushed down the forest to obtain predictions for every individual in the file
64
+ * `random_forest.yml`: If found it will be the forest used for the testing.
65
+
66
+ That way in order to train a forest a training file is needed. And to do the testing you need two files: the testing file and one of the other two: the training OR the random_forest file, because nimbus needs a forest from which obtain the predictions.
67
+
68
+ ## Configuration (config.yml)
69
+
70
+ The values for the input data files and the forest can be specified in the `config.yml` file that shouldbe locate in the directory where you are running `nimbus`.
71
+
72
+ The `config.yml` has the following structure and parameters:
73
+
74
+ #Input files
75
+ input:
76
+ training: training_regression.data
77
+ testing: testing_regression.data
78
+ forest: my_forest.yml
79
+ classes: [0,1]
80
+
81
+ #Forest parameters
82
+ forest:
83
+ forest_size: 10 #how many trees
84
+ SNP_sample_size_mtry: 60 #mtry
85
+ SNP_total_count: 200
86
+ node_min_size: 5
87
+
88
+ Under the input chapter:
89
+
90
+ * `training`: specify the path to the training data file (optional, if specified `nimbus` will create a random forest).
91
+ * `testing`: specify the path to the testing data file (optional, if specified `nimbus` will traverse this dat through a random forest).
92
+ * `forest`: specify the path to a file containing a random forest structure (optional, if there is also testing file, this will be the forest used for the testing).
93
+ * `classes`: **optional (needed only for classification problems)**. Specify the list of classes in the input files as a comma separated list between squared brackets, e.g.:`[A,B]`.
94
+
95
+ Under the forest chapter:
96
+
97
+ * `forest_size`: number of trees for the forest.
98
+ * `SNP_sample_size_mtry`: size of the random sample of SNPs to be used in every tree node.
99
+ * `SNP_total_count`: total count of SNPs in the training and/or testing files
100
+ * `node_min_size`: minimum amount of individuals in a tree node to make a split.
101
+
102
+
103
+ ## Input files
104
+
105
+ The three input files you can use with Nimbus should have proper format:
106
+
107
+ **The training file** has any number of rows, each representing data for an individual, with this columns:
108
+
109
+ 1. A column with the fenotype for the individual
110
+ 1. A column with the ID of the individual
111
+ 1. M columns (where M = SNP_total_count in `config.yml`) with values 0, 1 or 2, representing the genotype of the individual.
112
+
113
+ **The training file** has any number of rows, each representing data for an individual, similar to the training file but without the fenotype column:
114
+
115
+ 1. A column with the ID of the individual
116
+ 1. M columns (where M = SNP_total_count in `config.yml`) with values 0, 1 or 2, representing the genotype of the individual.
117
+
118
+ **The forest file** contains the structure of a forest in YAML format. It is the output file of a nimbus training run.
119
+
120
+ ## Output files
121
+
122
+ Nimbus will generate the following output files:
123
+
124
+ After training:
125
+
126
+ * `random_forest.yml`: A file defining the structure of the computed Random Forest. It can be used as input forest file.
127
+ * `generalization_errors.txt`: A file with the generalization error for every tree in the forest.
128
+ * `training_file_predictions.txt`: A file with predictions for every individual from the training file.
129
+ * `snp_importances.txt`: A file with the computed importance for every SNP.
130
+
131
+ After testing:
132
+
133
+ * `testing_file_predictions.txt`: A file defining the structure of the computed Random Forest.
134
+
135
+ ## Resources
136
+
137
+ * [Nimbus website](http://www.nimbusgem.org)
138
+ * [Source code](http://github.com/xuanxu/nimbus) – Fork the code
139
+ * [Issues](http://github.com/xuanxu/nimbus/issues) – Bugs and feature requests
140
+ * [Online rdocs](http://rubydoc.info/gems/nimbus/frames)
141
+ * [Nimbus at rubygems.org](https://rubygems.org/gems/nimbus)
142
+ * [Random Forest at Wikipedia](http://en.wikipedia.org/wiki/Random_forest)
143
+ * [RF Leo Breiman page](http://www.stat.berkeley.edu/~breiman/RandomForests/)
144
+
145
+ ## Credits
146
+
147
+ Nimbus was developed by [Juanjo Bazán](http://twitter.com/xuanxu) in collaboration with Oscar González-Recio.
148
+
149
+ Copyright © 2011 - 2012 Juanjo Bazán, released under the MIT license
data/lib/nimbus.rb CHANGED
@@ -1,24 +1,26 @@
1
1
  require 'yaml'
2
- require 'optparse'
3
2
  require 'nimbus/exceptions'
4
3
  require 'nimbus/training_set'
5
4
  require 'nimbus/configuration'
6
5
  require 'nimbus/loss_functions'
7
6
  require 'nimbus/individual'
8
7
  require 'nimbus/tree'
8
+ require 'nimbus/regression_tree'
9
+ require 'nimbus/classification_tree'
9
10
  require 'nimbus/forest'
10
11
  require 'nimbus/application'
12
+ require 'nimbus/version'
11
13
 
12
14
  #####################################################################
13
- # Nimbus module.
15
+ # Nimbus module.
14
16
  # Used as a namespace containing all the Nimbus code.
15
17
  # The module defines a Nimbus::Application and interacts with the user output console.
16
18
  #
17
19
  module Nimbus
18
-
20
+
19
21
  STDERR = $stderr
20
22
  STDOUT = $stdout
21
-
23
+
22
24
  # Nimbus module singleton methods.
23
25
  #
24
26
  class << self
@@ -26,30 +28,30 @@ module Nimbus
26
28
  def application
27
29
  @application ||= ::Nimbus::Application.new
28
30
  end
29
-
31
+
30
32
  # Set the current Nimbus application object.
31
33
  def application=(app)
32
34
  @application = app
33
35
  end
34
-
36
+
35
37
  # Stops the execution of the Nimbus application.
36
38
  def stop(msg = "Error: Nimbus finished.")
37
39
  self.error_message msg
38
40
  exit(false)
39
41
  end
40
-
42
+
41
43
  # Writes message to the standard output
42
44
  def message(msg)
43
45
  STDOUT.puts msg
44
46
  STDOUT.flush
45
47
  end
46
-
48
+
47
49
  # Writes message to the error output
48
50
  def error_message(msg)
49
51
  STDERR.puts msg
50
52
  STDERR.flush
51
53
  end
52
-
54
+
53
55
  # Writes to the standard output
54
56
  def write(str)
55
57
  STDOUT.write str
@@ -59,8 +61,10 @@ module Nimbus
59
61
  # Clear current console line
60
62
  def clear_line!
61
63
  self.write "\r"
64
+ self.write(" " * 50)
65
+ self.write "\r"
62
66
  end
63
-
67
+
64
68
  end
65
-
69
+
66
70
  end
@@ -1,14 +1,14 @@
1
1
  module Nimbus
2
-
2
+
3
3
  #####################################################################
4
- # Nimbus main application object.
5
- #
6
- # When invoking +nimbus+ from the command line,
4
+ # Nimbus main application object.
5
+ #
6
+ # When invoking +nimbus+ from the command line,
7
7
  # a Nimbus::Application object is created and run.
8
8
  #
9
9
  class Application
10
10
  attr_accessor :config
11
-
11
+
12
12
  # Initialize a Nimbus::Application object.
13
13
  # Check and load the configuration options.
14
14
  def initialize(c = nil)
@@ -18,7 +18,7 @@ module Nimbus
18
18
  @forest = nil
19
19
  end
20
20
  end
21
-
21
+
22
22
  # Run the Nimbus application. The run method performs the following
23
23
  # three steps:
24
24
  #
@@ -27,7 +27,7 @@ module Nimbus
27
27
  # * Write results to output files.
28
28
  def run
29
29
  nimbus_exception_handling do
30
-
30
+
31
31
  if @config.do_training && @config.load_training_data
32
32
  @forest = ::Nimbus::Forest.new @config
33
33
  @forest.grow
@@ -36,13 +36,13 @@ module Nimbus
36
36
  output_training_file_predictions(@forest)
37
37
  output_snp_importances_file(@forest)
38
38
  end
39
-
39
+
40
40
  if @config.do_testing
41
41
  @forest = @config.load_forest if @config.forest_file
42
42
  @forest.traverse
43
- output_testing_set_predictions(@forest)
43
+ output_testing_set_predictions(@forest)
44
44
  end
45
-
45
+
46
46
  end
47
47
  end
48
48
 
@@ -52,16 +52,13 @@ module Nimbus
52
52
  def config
53
53
  @config ||= ::Nimbus::Configuration.new
54
54
  end
55
-
55
+
56
56
  # Provides the default exception handling for the given block.
57
57
  def nimbus_exception_handling
58
58
  begin
59
59
  yield
60
60
  rescue SystemExit => ex
61
61
  raise
62
- rescue OptionParser::InvalidOption => ex
63
- display_error_message(Nimbus::InvalidOptionError ex.message)
64
- Nimbus.stop
65
62
  rescue Nimbus::Error => ex
66
63
  display_error_message(ex)
67
64
  Nimbus.stop
@@ -70,7 +67,7 @@ module Nimbus
70
67
  Nimbus.stop
71
68
  end
72
69
  end
73
-
70
+
74
71
  # Display an error message that caused a exception.
75
72
  def display_error_message(ex)
76
73
  Nimbus.error_message "* Nimbus encountered an error! The random forest was not generated *"
@@ -81,7 +78,7 @@ module Nimbus
81
78
  # Nimbus.error_message "(See full error trace by running Nimbus with --trace)"
82
79
  # end
83
80
  end
84
-
81
+
85
82
  protected
86
83
  def output_random_forest_file(forest)
87
84
  File.open(@config.output_forest_file , 'w') {|f| f.write(forest.to_yaml) }
@@ -89,9 +86,9 @@ module Nimbus
89
86
  Nimbus.message "* Output forest file: #{@config.output_forest_file}"
90
87
  Nimbus.message "*" * 50
91
88
  end
92
-
89
+
93
90
  def output_tree_errors_file(forest)
94
- File.open(@config.output_tree_errors_file , 'w') {|f|
91
+ File.open(@config.output_tree_errors_file , 'w') {|f|
95
92
  1.upto(forest.tree_errors.size) do |te|
96
93
  f.write("generalization error for tree #{te}: #{forest.tree_errors[te-1].round(5)}\n")
97
94
  end
@@ -100,7 +97,7 @@ module Nimbus
100
97
  Nimbus.message "* Output tree errors file: #{@config.output_tree_errors_file}"
101
98
  Nimbus.message "*" * 50
102
99
  end
103
-
100
+
104
101
  def output_training_file_predictions(forest)
105
102
  File.open(@config.output_training_file , 'w') {|f|
106
103
  forest.predictions.sort.each{|p|
@@ -111,7 +108,7 @@ module Nimbus
111
108
  Nimbus.message "* Output from training file: #{@config.output_training_file}"
112
109
  Nimbus.message "*" * 50
113
110
  end
114
-
111
+
115
112
  def output_testing_set_predictions(forest)
116
113
  File.open(@config.output_testing_file , 'w') {|f|
117
114
  forest.predictions.sort.each{|p|
@@ -122,7 +119,7 @@ module Nimbus
122
119
  Nimbus.message "* Output from testing file: #{@config.output_testing_file}"
123
120
  Nimbus.message "*" * 50
124
121
  end
125
-
122
+
126
123
  def output_snp_importances_file(forest)
127
124
  File.open(@config.output_snp_importances_file , 'w') {|f|
128
125
  forest.snp_importances.sort.each{|p|
@@ -133,7 +130,7 @@ module Nimbus
133
130
  Nimbus.message "* Output snp importance file: #{@config.output_snp_importances_file}"
134
131
  Nimbus.message "*" * 50
135
132
  end
136
-
133
+
137
134
  end
138
-
135
+
139
136
  end
@@ -0,0 +1,111 @@
1
+ module Nimbus
2
+
3
+ #####################################################################
4
+ # Tree object representing a random classification tree.
5
+ #
6
+ # A tree is generated following this steps:
7
+ #
8
+ # * 1: Calculate loss function for the individuals in the node (first node contains all the individuals).
9
+ # * 2: Take a random sample of the SNPs (size m << total count of SNPs)
10
+ # * 3: Compute the loss function (default: gini index) for the split of the sample based on value of every SNP.
11
+ # * 4: If the SNP with minimum loss function also minimizes the general loss of the node, split the individuals sample in three nodes, based on value for that SNP [0, 1, or 2]
12
+ # * 5: Repeat from 1 for every node until:
13
+ # - a) The individuals count in that node is < minimum size OR
14
+ # - b) None of the SNP splits has a loss function smaller than the node loss function
15
+ # * 6) When a node stops, label the node with the majority class in the node.
16
+ #
17
+ class ClassificationTree < Tree
18
+ attr_accessor :classes
19
+
20
+ # Initialize Tree object with the configuration (as in Nimbus::Configuration.tree) options received.
21
+ def initialize(options)
22
+ @classes = options[:classes]
23
+ super
24
+ end
25
+
26
+ # Creates the structure of the tree, as a hash of SNP splits and values.
27
+ #
28
+ # It just initializes the needed variables and then defines the first node of the tree.
29
+ # The rest of the structure of the tree is computed recursively building every node calling +build_node+.
30
+ def seed(all_individuals, individuals_sample, ids_fenotypes)
31
+ super
32
+ @structure = build_node individuals_sample, Nimbus::LossFunctions.majority_class(individuals_sample, @id_to_fenotype, @classes)
33
+ end
34
+
35
+ # Creates a node by taking a random sample of the SNPs and computing the loss function for every split by SNP of that sample.
36
+ #
37
+ # * If SNP_min is the SNP with smaller loss function and it is < the loss function of the node, it splits the individuals sample in three:
38
+ # (those with value 0 for the SNP_min, those with value 1 for the SNP_min, and those with value 2 for the SNP_min) then it builds these 3 new nodes.
39
+ # * Otherwise every individual in the node gets labeled with the average of the fenotype values of all of them.
40
+ def build_node(individuals_ids, y_hat)
41
+ # General loss function value for the node
42
+ individuals_count = individuals_ids.size
43
+ return label_node(y_hat, individuals_ids) if individuals_count < @node_min_size
44
+ node_loss_function = Nimbus::LossFunctions.gini_index individuals_ids, @id_to_fenotype, @classes
45
+
46
+ # Finding the SNP that minimizes loss function
47
+ snps = snps_random_sample
48
+ min_loss, min_SNP, split, ginis = node_loss_function, nil, nil, nil
49
+
50
+ snps.each do |snp|
51
+ individuals_split_by_snp_value = split_by_snp_value individuals_ids, snp
52
+ y_hat_0 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[0], @id_to_fenotype, @classes)
53
+ y_hat_1 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[1], @id_to_fenotype, @classes)
54
+ y_hat_2 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[2], @id_to_fenotype, @classes)
55
+
56
+ gini_0 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[0], @id_to_fenotype, @classes
57
+ gini_1 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[1], @id_to_fenotype, @classes
58
+ gini_2 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[2], @id_to_fenotype, @classes
59
+ loss_snp = (individuals_split_by_snp_value[0].size * gini_0 +
60
+ individuals_split_by_snp_value[1].size * gini_1 +
61
+ individuals_split_by_snp_value[2].size * gini_2) / individuals_count
62
+
63
+ min_loss, min_SNP, split, ginis = loss_snp, snp, individuals_split_by_snp_value, [y_hat_0, y_hat_1, y_hat_2] if loss_snp < min_loss
64
+ end
65
+ return build_branch(min_SNP, split, ginis, y_hat) if min_loss < node_loss_function
66
+ return label_node(y_hat, individuals_ids)
67
+ end
68
+
69
+ # Compute generalization error for the tree.
70
+ #
71
+ # Traversing the 'out of bag' (OOB) sample (those individuals of the training set not
72
+ # used in the building of this tree) through the tree, and comparing
73
+ # the prediction with the real fenotype class of the individual is possible
74
+ # to calculate the error frequency, an unbiased generalization error for the tree.
75
+ def generalization_error_from_oob(oob_ids)
76
+ return nil if (@structure.nil? || @individuals.nil? || @id_to_fenotype.nil?)
77
+ oob_errors = 0.0
78
+ oob_ids.each do |oobi|
79
+ oob_errors += 1 unless @id_to_fenotype[oobi] == Tree.traverse(@structure, individuals[oobi].snp_list)
80
+ end
81
+ @generalization_error = oob_errors / oob_ids.size
82
+ end
83
+
84
+ # Estimation of importance for every SNP.
85
+ #
86
+ # The importance of any SNP in the tree is calculated using the OOB sample.
87
+ # For every SNP, every individual in the sample is pushed down the tree but with the
88
+ # value of that SNP permuted with other individual in the sample.
89
+ #
90
+ # That way the difference between the generalization error and the error frequency with the SNP value modified can be estimated for any given SNP.
91
+ #
92
+ # This method computes importance estimations for every SNPs used in the tree (for any other SNP it would be 0).
93
+ def estimate_importances(oob_ids)
94
+ return nil if (@generalization_error.nil? && generalization_error_from_oob(oob_ids).nil?)
95
+ oob_individuals_count = oob_ids.size
96
+ @importances = {}
97
+ @used_snps.uniq.each do |current_snp|
98
+ shuffled_ids = oob_ids.shuffle
99
+ permutated_snp_errors = 0.0
100
+ oob_ids.each_with_index {|oobi, index|
101
+ permutated_prediction = traverse_with_permutation @structure, individuals[oobi].snp_list, current_snp, individuals[shuffled_ids[index]].snp_list
102
+ permutated_snp_errors += 1 unless @id_to_fenotype[oobi] == permutated_prediction
103
+ }
104
+ @importances[current_snp] = ((permutated_snp_errors / oob_individuals_count) - @generalization_error).round(5)
105
+ end
106
+ @importances
107
+ end
108
+
109
+ end
110
+
111
+ end