nimbus 2.2.1 → 2.4.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (34) hide show
  1. checksums.yaml +7 -0
  2. data/CODE_OF_CONDUCT.md +7 -0
  3. data/CONTRIBUTING.md +46 -0
  4. data/MIT-LICENSE.txt +1 -1
  5. data/README.md +141 -22
  6. data/bin/nimbus +2 -2
  7. data/lib/nimbus/classification_tree.rb +9 -12
  8. data/lib/nimbus/configuration.rb +27 -27
  9. data/lib/nimbus/forest.rb +8 -8
  10. data/lib/nimbus/loss_functions.rb +11 -0
  11. data/lib/nimbus/regression_tree.rb +8 -10
  12. data/lib/nimbus/tree.rb +54 -12
  13. data/lib/nimbus/version.rb +1 -1
  14. data/lib/nimbus.rb +2 -6
  15. data/spec/classification_tree_spec.rb +47 -47
  16. data/spec/configuration_spec.rb +55 -55
  17. data/spec/fixtures/{classification_config.yml → classification/config.yml} +3 -3
  18. data/spec/fixtures/classification/random_forest.yml +1174 -0
  19. data/spec/fixtures/{regression_config.yml → regression/config.yml} +4 -4
  20. data/spec/fixtures/regression/random_forest.yml +2737 -0
  21. data/spec/forest_spec.rb +39 -39
  22. data/spec/individual_spec.rb +3 -3
  23. data/spec/loss_functions_spec.rb +31 -13
  24. data/spec/nimbus_spec.rb +2 -2
  25. data/spec/regression_tree_spec.rb +44 -44
  26. data/spec/training_set_spec.rb +3 -3
  27. data/spec/tree_spec.rb +4 -4
  28. metadata +42 -39
  29. data/spec/fixtures/classification_random_forest.yml +0 -922
  30. data/spec/fixtures/regression_random_forest.yml +0 -1741
  31. /data/spec/fixtures/{classification_testing.data → classification/testing.data} +0 -0
  32. /data/spec/fixtures/{classification_training.data → classification/training.data} +0 -0
  33. /data/spec/fixtures/{regression_testing.data → regression/testing.data} +0 -0
  34. /data/spec/fixtures/{regression_training.data → regression/training.data} +0 -0
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: fafe3e89add124de9435a9328e8a99817394057ed2c19352b1c6979849459e35
4
+ data.tar.gz: 27f46862abdeea053a1c01c5751017632a92c0f9ca5ad634dff7e17a22ce7640
5
+ SHA512:
6
+ metadata.gz: 9b18a6cee7368fa15c908c65341b90a9e5bacf1f8a866219cdb6f65e8dc2274ff8d340d837ca2c62b38039a1b02f74680d978ddc2f7fb52e6ddcedd93418b9da
7
+ data.tar.gz: d6e32ca7be26e7edd3d43854056a9f9f41f98048947003bc0d20a05cf2e177ae20998883af5ea1d93acaf863d6b77c495789fb57ce80719a05fd4f8682f75aa7
@@ -0,0 +1,7 @@
1
+ # Contributor Code of Conduct
2
+
3
+ The Nimbus team is committed to fostering a welcoming community.
4
+
5
+ we adopt an inclusive Code of Conduct adapted from the Contributor Covenant, version 1.4, you can read it here: [Contributor Covenant Code of Conduct](http://contributor-covenant.org/version/1/4/).
6
+
7
+
data/CONTRIBUTING.md ADDED
@@ -0,0 +1,46 @@
1
+ # How to Contribute to this Project
2
+
3
+ ## Report an issue
4
+
5
+ The prefered way to report any bug is [opening an issue in the project's Github repo](https://github.com/xuanxu/nimbus/issues/new).
6
+
7
+ For more informal communication, you can contact [@xuanxu via twitter](https://twitter.com/xuanxu)
8
+
9
+ ## Resolve an issue
10
+
11
+ Pull request are welcome. If you want to contribute code to solve an issue:
12
+
13
+ * Add a comment to tell everyone you are working on the issue.
14
+ * If an issue has someone assigned it means that person is already working on it.
15
+ * Fork the project.
16
+ * Create a topic branch based on master.
17
+ * Commit there your code to solve the issue.
18
+ * Make sure all test are passing (and add specs to test any new feature if needed).
19
+ * Follow these [best practices](https://github.com/styleguide/ruby)
20
+ * Open a *pull request* to the main repository describing what issue you are addressing.
21
+
22
+ ## Cleaning up
23
+
24
+ In the rush of time sometimes things get messy, you can help us cleaning things up:
25
+
26
+ * implementing pending specs
27
+ * increasing code coverage
28
+ * improving code quality
29
+ * updating dependencies
30
+ * making code consistent
31
+
32
+ ## Other ways of contributing without coding
33
+
34
+ * If you think there's a feature missing, or find a bug, create an issue (make sure it has not already been reported).
35
+ * You can also help promoting the project talking about it in your social networks.
36
+
37
+ ## How to report an issue
38
+
39
+ * Try to use a descriptive and to-the-point title
40
+ * Is a good idea to include some of there sections:
41
+ * Steps to reproduce the bug
42
+ * Expected behaviour/response
43
+ * Actual response
44
+ * Sometimes it is also helpful if you mention your operating system or shell.
45
+
46
+ Thanks! :heart: :heart: :heart:
data/MIT-LICENSE.txt CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2011 Juanjo Bazán & Oscar González Recio
1
+ Copyright (c) 2017-2019 Juanjo Bazán & Oscar González Recio
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person obtaining
4
4
  a copy of this software and associated documentation files (the
data/README.md CHANGED
@@ -1,9 +1,14 @@
1
- # Nimbus [![Build Status](https://secure.travis-ci.org/xuanxu/nimbus.png?branch=master)](http://travis-ci.org/xuanxu/nimbus)
1
+ # Nimbus
2
2
  Random Forest algorithm for genomic selection.
3
3
 
4
+ [![Build Status](https://github.com/xuanxu/nimbus/actions/workflows/tests.yml/badge.svg)](https://github.com/xuanxu/nimbus/actions/workflows/tests.yml)
5
+ [![Gem Version](https://badge.fury.io/rb/nimbus.png)](http://badge.fury.io/rb/nimbus)
6
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/xuanxu/nimbus/blob/master/MIT-LICENSE.txt)
7
+ [![DOI](http://joss.theoj.org/papers/10.21105/joss.00351/status.svg)](https://doi.org/10.21105/joss.00351)
8
+
4
9
  ## Random Forest
5
10
 
6
- The [random forest algorithm](http://en.wikipedia.org/wiki/Random_forest) is an classifier consisting in many random decision trees. It is based on choosing random subsets of variables for each tree and using the most frequent, or the averaged tree output as the overall classification. In machine learning terms, it is an ensembler classifier, so it uses multiple models to obtain better predictive performance than could be obtained from any of the constituent models.
11
+ The [random forest algorithm](http://en.wikipedia.org/wiki/Random_forest) is a classifier consisting in many random decision trees. It is based on choosing random subsets of variables for each tree and using the most frequent, or the averaged tree output as the overall classification. In machine learning terms, it is an ensemble classifier, so it uses multiple models to obtain better predictive performance than could be obtained from any of the constituent models.
7
12
 
8
13
  The forest outputs the class that is the mean or the mode (in regression problems) or the majority class (in classification problems) of the node's output by individual trees.
9
14
 
@@ -35,14 +40,14 @@ Nimbus can be used to:
35
40
 
36
41
  Nimbus can be used both with regression and classification problems.
37
42
 
38
- **Regression**: is the default mode.
43
+ **Regression**: is the default mode.
39
44
 
40
- * The split of nodes uses quadratic loss as loss function.
45
+ * The split of nodes uses quadratic loss as loss function.
41
46
  * Labeling of nodes is made averaging the fenotype values of the individuals in the node.
42
47
 
43
- **Classification**: user-activated declaring `classes` in the configuration file.
48
+ **Classification**: user-activated declaring `classes` in the configuration file.
44
49
 
45
- * The split of nodes uses the Gini index as loss function.
50
+ * The split of nodes uses the Gini index as loss function.
46
51
  * Labeling of nodes is made finding the majority fenotype class of the individuals in the node.
47
52
 
48
53
  ## Variable importances
@@ -51,29 +56,36 @@ By default Nimbus will estimate SNP importances everytime a training file is run
51
56
 
52
57
  You can disable this behaviour (and speed up the training process) by setting the parameter `var_importances: No` in the configuration file.
53
58
 
54
- ## Install
59
+ ## Installation
60
+
61
+ You need to have [Ruby](https://www.ruby-lang.org) (2.6 or higher) with Rubygems installed in your computer. Then install Nimbus with:
55
62
 
56
- You need to have Ruby (1.9.2 or higher) and Rubygems installed in your computer. Then install nimbus with:
63
+ ````shell
64
+ > gem install nimbus
65
+ ````
57
66
 
58
- $ gem install nimbus
67
+ There are not extra dependencies needed.
59
68
 
60
69
  ## Getting Started
61
70
 
62
71
  Once you have nimbus installed in your system, you can run the gem using the `nimbus` executable:
63
72
 
64
- $ nimbus
73
+ ````shell
74
+ > nimbus
75
+ ````
65
76
 
66
- It will look for these files:
77
+ It will look for these files in the directory where Nimbus is running:
67
78
 
68
- * `training.data`: If found it will be used to build a random forest
69
- * `testing.data` : If found it will be pushed down the forest to obtain predictions for every individual in the file
70
- * `random_forest.yml`: If found it will be the forest used for the testing.
79
+ * `training.data`: If found it will be used to build a random forest.
80
+ * `testing.data` : If found it will be pushed down the forest to obtain predictions for every individual in the file.
81
+ * `random_forest.yml`: If found it will be the forest used for the testing instead of building one.
82
+ * `config.yml`: A file detailing random forest parameters and datasets. If not found default values will be used.
71
83
 
72
- That way in order to train a forest a training file is needed. And to do the testing you need two files: the testing file and one of the other two: the training OR the random_forest file, because nimbus needs a forest from which obtain the predictions.
84
+ That way in order to train a forest a training file is needed. And to do the testing you need two files: the testing file and one of the other two: the training OR the random_forest file, because Nimbus needs a forest from which obtain the predictions.
73
85
 
74
86
  ## Configuration (config.yml)
75
87
 
76
- The values for the input data files and the forest can be specified in the `config.yml` file that shouldbe locate in the directory where you are running `nimbus`.
88
+ The names for the input data files and the forest parameters can be specified in the `config.yml` file that should be located in the directory where you are running `nimbus`.
77
89
 
78
90
  The `config.yml` has the following structure and parameters:
79
91
 
@@ -91,14 +103,14 @@ The `config.yml` has the following structure and parameters:
91
103
  SNP_total_count: 200
92
104
  node_min_size: 5
93
105
 
94
- Under the input chapter:
106
+ ### Under the input chapter:
95
107
 
96
108
  * `training`: specify the path to the training data file (optional, if specified `nimbus` will create a random forest).
97
- * `testing`: specify the path to the testing data file (optional, if specified `nimbus` will traverse this dat through a random forest).
109
+ * `testing`: specify the path to the testing data file (optional, if specified `nimbus` will traverse this data through a random forest).
98
110
  * `forest`: specify the path to a file containing a random forest structure (optional, if there is also testing file, this will be the forest used for the testing).
99
111
  * `classes`: **optional (needed only for classification problems)**. Specify the list of classes in the input files as a comma separated list between squared brackets, e.g.:`[A, B]`.
100
112
 
101
- Under the forest chapter:
113
+ ### Under the forest chapter:
102
114
 
103
115
  * `forest_size`: number of trees for the forest.
104
116
  * `SNP_sample_size_mtry`: size of the random sample of SNPs to be used in every tree node.
@@ -106,6 +118,19 @@ Under the forest chapter:
106
118
  * `node_min_size`: minimum amount of individuals in a tree node to make a split.
107
119
  * `var_importances`: **optional**. If set to `No` Nimbus will not calculate SNP importances.
108
120
 
121
+ ### Default values
122
+
123
+ If there is no config.yml file present, Nimbus will use these default values:
124
+
125
+ ````yaml
126
+ forest_size: 300
127
+ tree_SNP_sample_size: 60
128
+ tree_SNP_total_count: 200
129
+ tree_node_min_size: 5
130
+ training_file: 'training.data'
131
+ testing_file: 'testing.data'
132
+ forest_file: 'forest.yml
133
+ ````
109
134
 
110
135
  ## Input files
111
136
 
@@ -137,7 +162,83 @@ After training:
137
162
 
138
163
  After testing:
139
164
 
140
- * `testing_file_predictions.txt`: A file defining the structure of the computed Random Forest.
165
+ * `testing_file_predictions.txt`: A file detailing the predicted results for the testing dataset.
166
+
167
+ ## Example usage
168
+
169
+ ### Sample files
170
+
171
+ Sample files are located in the `/spec/fixtures` directory, both for regression and classification problems. They can be used as a starting point to tweak your own configurations.
172
+
173
+ Depending on the kind of problem you want to test different files are needed:
174
+
175
+ ### Regression
176
+
177
+ **Test with a Random Forest created from a training data set**
178
+
179
+ Download/copy the `config.yml`, `training.data` and `testing.data` files from the [regression folder](./tree/master/spec/fixtures/regression).
180
+
181
+ Then run nimbus:
182
+
183
+ ````shell
184
+ > nimbus
185
+ ````
186
+
187
+ It should output a `random_forest.yml` file with the nodes and structure of the resulting random forest, the `generalization_errors` and `snp_importances` files, and the predictions for both training and testing datasets (`training_file_predictions.txt` and `testing_file_predictions.txt` files).
188
+
189
+ **Test with a Random Forest previously created**
190
+
191
+ Download/copy the `config.yml`, `testing.data` and `random_forest.yml` files from the [regression folder](./tree/master/spec/fixtures/regression).
192
+
193
+ Edit the `config.yml` file to comment/remove the training entry.
194
+
195
+ Then use nimbus to run the testing:
196
+
197
+ ````shell
198
+ > nimbus
199
+ ````
200
+
201
+ It should output a `testing_file_predictions.txt` file with the resulting predictions for the testing dataset using the given random forest.
202
+
203
+ ### Classification
204
+
205
+ **Test with a Random Forest created from a training data set**
206
+
207
+ Download/copy the `config.yml`, `training.data` and `testing.data` files from the [classification folder](./tree/master/spec/fixtures/classification).
208
+
209
+ Then run nimbus:
210
+
211
+ ````shell
212
+ > nimbus
213
+ ````
214
+
215
+ It should output a `random_forest.yml` file with the nodes and structure of the resulting random forest, the `generalization_errors` file, and the predictions for both training and testing datasets (`training_file_predictions.txt` and `testing_file_predictions.txt` files).
216
+
217
+ **Test with a Random Forest previously created**
218
+
219
+ Download/copy the `config.yml`, `testing.data` and `random_forest.yml` files from the [classification folder](./tree/master/spec/fixtures/classification).
220
+
221
+ Edit the `config.yml` file to comment/remove the training entry.
222
+
223
+ Then use nimbus to run the testing:
224
+
225
+ ````shell
226
+ > nimbus
227
+ ````
228
+
229
+ It should output a `testing_file_predictions.txt` file with the resulting predictions for the testing dataset using the given random forest.
230
+
231
+
232
+ ## Test suite
233
+
234
+ Nimbus includes a test suite located in the `spec` directory. The current state of the build is [publicly tracked by Travis CI](https://travis-ci.org/xuanxu/nimbus). You can run the specs locally if you clone the code to your local machine and run the default rake task:
235
+
236
+ ````shell
237
+ > git clone git://github.com/xuanxu/nimbus.git
238
+ > cd nimbus
239
+ > bundle install
240
+ > rake
241
+ ````
141
242
 
142
243
  ## Resources
143
244
 
@@ -149,8 +250,26 @@ After testing:
149
250
  * [Random Forest at Wikipedia](http://en.wikipedia.org/wiki/Random_forest)
150
251
  * [RF Leo Breiman page](http://www.stat.berkeley.edu/~breiman/RandomForests/)
151
252
 
152
- ## Credits
253
+
254
+ ## Contributing
255
+
256
+ Contributions are welcome. We encourage you to contribute to the Nimbus codebase.
257
+
258
+ Please read the [CONTRIBUTING](CONTRIBUTING.md) file.
259
+
260
+
261
+ ## Credits and DOI
262
+
263
+ If you use Nimbus, please cite our [JOSS paper: http://dx.doi.org/10.21105/joss.00351](http://dx.doi.org/10.21105/joss.00351)
264
+
265
+ You can find the citation info in [Bibtex format here](CITATION.bib).
266
+
267
+ **Cite as:**
268
+ *Bazán et al, (2017), Nimbus: a Ruby gem to implement Random Forest algorithms in a genomic selection context, Journal of Open Source Software, 2(16), 351, doi:10.21105/joss.0035*
153
269
 
154
270
  Nimbus was developed by [Juanjo Bazán](http://twitter.com/xuanxu) in collaboration with Oscar González-Recio.
155
271
 
156
- Copyright © 2011 - 2012 Juanjo Bazán, released under the MIT license
272
+
273
+ ## LICENSE
274
+
275
+ Copyright © Juanjo Bazán, released under the [MIT license](MIT-LICENSE.txt)
data/bin/nimbus CHANGED
@@ -1,7 +1,7 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
3
  #--
4
- # Copyright (c) 2011-2013 Juanjo Bazan
4
+ # Copyright (c) 2011-2019 Juanjo Bazan
5
5
  #
6
6
  # Permission is hereby granted, free of charge, to any person obtaining a copy
7
7
  # of this software and associated documentation files (the "Software"), to
@@ -23,4 +23,4 @@
23
23
  #++
24
24
 
25
25
  require 'nimbus'
26
- Nimbus.application.run
26
+ Nimbus.application.run
@@ -8,7 +8,7 @@ module Nimbus
8
8
  # * 1: Calculate loss function for the individuals in the node (first node contains all the individuals).
9
9
  # * 2: Take a random sample of the SNPs (size m << total count of SNPs)
10
10
  # * 3: Compute the loss function (default: gini index) for the split of the sample based on value of every SNP.
11
- # * 4: If the SNP with minimum loss function also minimizes the general loss of the node, split the individuals sample in three nodes, based on value for that SNP [0, 1, or 2]
11
+ # * 4: If the SNP with minimum loss function also minimizes the general loss of the node, split the individuals sample in two nodes, based on average value for that SNP [0,1][2], or [0][1,2]
12
12
  # * 5: Repeat from 1 for every node until:
13
13
  # - a) The individuals count in that node is < minimum size OR
14
14
  # - b) None of the SNP splits has a loss function smaller than the node loss function
@@ -34,8 +34,8 @@ module Nimbus
34
34
 
35
35
  # Creates a node by taking a random sample of the SNPs and computing the loss function for every split by SNP of that sample.
36
36
  #
37
- # * If SNP_min is the SNP with smaller loss function and it is < the loss function of the node, it splits the individuals sample in three:
38
- # (those with value 0 for the SNP_min, those with value 1 for the SNP_min, and those with value 2 for the SNP_min) then it builds these 3 new nodes.
37
+ # * If SNP_min is the SNP with smaller loss function and it is < the loss function of the node, it splits the individuals sample in two:
38
+ # (the average of the 0,1,2 values for the SNP_min in the individuals is computed, and they are splitted in [<=avg], [>avg]) then it builds these 2 new nodes.
39
39
  # * Otherwise every individual in the node gets labeled with the average of the fenotype values of all of them.
40
40
  def build_node(individuals_ids, y_hat)
41
41
  # General loss function value for the node
@@ -45,24 +45,21 @@ module Nimbus
45
45
 
46
46
  # Finding the SNP that minimizes loss function
47
47
  snps = snps_random_sample
48
- min_loss, min_SNP, split, ginis = node_loss_function, nil, nil, nil
48
+ min_loss, min_SNP, split, split_type, ginis = node_loss_function, nil, nil, nil, nil
49
49
 
50
50
  snps.each do |snp|
51
- individuals_split_by_snp_value = split_by_snp_value individuals_ids, snp
51
+ individuals_split_by_snp_value, node_split_type = split_by_snp_avegare_value individuals_ids, snp
52
52
  y_hat_0 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[0], @id_to_fenotype, @classes)
53
53
  y_hat_1 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[1], @id_to_fenotype, @classes)
54
- y_hat_2 = Nimbus::LossFunctions.majority_class(individuals_split_by_snp_value[2], @id_to_fenotype, @classes)
55
-
54
+
56
55
  gini_0 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[0], @id_to_fenotype, @classes
57
56
  gini_1 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[1], @id_to_fenotype, @classes
58
- gini_2 = Nimbus::LossFunctions.gini_index individuals_split_by_snp_value[2], @id_to_fenotype, @classes
59
57
  loss_snp = (individuals_split_by_snp_value[0].size * gini_0 +
60
- individuals_split_by_snp_value[1].size * gini_1 +
61
- individuals_split_by_snp_value[2].size * gini_2) / individuals_count
58
+ individuals_split_by_snp_value[1].size * gini_1) / individuals_count
62
59
 
63
- min_loss, min_SNP, split, ginis = loss_snp, snp, individuals_split_by_snp_value, [y_hat_0, y_hat_1, y_hat_2] if loss_snp < min_loss
60
+ min_loss, min_SNP, split, split_type, ginis = loss_snp, snp, individuals_split_by_snp_value, node_split_type, [y_hat_0, y_hat_1] if loss_snp < min_loss
64
61
  end
65
- return build_branch(min_SNP, split, ginis, y_hat) if min_loss < node_loss_function
62
+ return build_branch(min_SNP, split, split_type, ginis, y_hat) if min_loss < node_loss_function
66
63
  return label_node(y_hat, individuals_ids)
67
64
  end
68
65
 
@@ -36,26 +36,26 @@ module Nimbus
36
36
  )
37
37
 
38
38
  DEFAULTS = {
39
- :forest_size => 500,
40
- :tree_SNP_sample_size => 60,
41
- :tree_SNP_total_count => 200,
42
- :tree_node_min_size => 5,
39
+ forest_size: 300,
40
+ tree_SNP_sample_size: 60,
41
+ tree_SNP_total_count: 200,
42
+ tree_node_min_size: 5,
43
43
 
44
- :loss_function_discrete => 'majority_class',
45
- :loss_function_continuous => 'average',
44
+ loss_function_discrete: 'majority_class',
45
+ loss_function_continuous: 'average',
46
46
 
47
- :training_file => 'training.data',
48
- :testing_file => 'testing.data',
49
- :forest_file => 'forest.yml',
50
- :config_file => 'config.yml',
47
+ training_file: 'training.data',
48
+ testing_file: 'testing.data',
49
+ forest_file: 'forest.yml',
50
+ config_file: 'config.yml',
51
51
 
52
- :output_forest_file => 'random_forest.yml',
53
- :output_training_file => 'training_file_predictions.txt',
54
- :output_testing_file => 'testing_file_predictions.txt',
55
- :output_tree_errors_file => 'generalization_errors.txt',
56
- :output_snp_importances_file => 'snp_importances.txt',
52
+ output_forest_file: 'random_forest.yml',
53
+ output_training_file: 'training_file_predictions.txt',
54
+ output_testing_file: 'testing_file_predictions.txt',
55
+ output_tree_errors_file: 'generalization_errors.txt',
56
+ output_snp_importances_file: 'snp_importances.txt',
57
57
 
58
- :silent => false
58
+ silent: false
59
59
  }
60
60
 
61
61
  # Initialize a Nimbus::Configuration object.
@@ -85,10 +85,10 @@ module Nimbus
85
85
  # Accessor method for the tree-related subset of options.
86
86
  def tree
87
87
  {
88
- :snp_sample_size => @tree_SNP_sample_size,
89
- :snp_total_count => @tree_SNP_total_count,
90
- :tree_node_min_size => @tree_node_min_size,
91
- :classes => @classes
88
+ snp_sample_size: @tree_SNP_sample_size,
89
+ snp_total_count: @tree_SNP_total_count,
90
+ tree_node_min_size: @tree_node_min_size,
91
+ classes: @classes
92
92
  }
93
93
  end
94
94
 
@@ -105,7 +105,7 @@ module Nimbus
105
105
  def load(config_file = DEFAULTS[:config_file])
106
106
  user_config_params = {}
107
107
  dirname = Dir.pwd
108
- if File.exists?(File.expand_path(config_file, Dir.pwd))
108
+ if File.exist?(File.expand_path(config_file, Dir.pwd))
109
109
  begin
110
110
  config_file_path = File.expand_path config_file, Dir.pwd
111
111
  user_config_params = Psych.load(File.open(config_file_path))
@@ -121,13 +121,13 @@ module Nimbus
121
121
  @forest_file = File.expand_path(user_config_params['input']['forest' ], dirname) if user_config_params['input']['forest']
122
122
  @classes = user_config_params['input']['classes'] if user_config_params['input']['classes']
123
123
  else
124
- @training_file = File.expand_path(DEFAULTS[:training_file], Dir.pwd) if File.exists? File.expand_path(DEFAULTS[:training_file], Dir.pwd)
125
- @testing_file = File.expand_path(DEFAULTS[:testing_file ], Dir.pwd) if File.exists? File.expand_path(DEFAULTS[:testing_file ], Dir.pwd)
126
- @forest_file = File.expand_path(DEFAULTS[:forest_file ], Dir.pwd) if File.exists? File.expand_path(DEFAULTS[:forest_file ], Dir.pwd)
124
+ @training_file = File.expand_path(DEFAULTS[:training_file], Dir.pwd) if File.exist? File.expand_path(DEFAULTS[:training_file], Dir.pwd)
125
+ @testing_file = File.expand_path(DEFAULTS[:testing_file ], Dir.pwd) if File.exist? File.expand_path(DEFAULTS[:testing_file ], Dir.pwd)
126
+ @forest_file = File.expand_path(DEFAULTS[:forest_file ], Dir.pwd) if File.exist? File.expand_path(DEFAULTS[:forest_file ], Dir.pwd)
127
127
  end
128
128
 
129
- @do_training = true if @training_file
130
- @do_testing = true if @testing_file
129
+ @do_training = true unless @training_file.nil?
130
+ @do_testing = true unless @testing_file.nil?
131
131
  @classes = @classes.map{|c| c.to_s.strip} if @classes
132
132
 
133
133
  if @do_testing && !@do_training && !@forest_file
@@ -186,7 +186,7 @@ module Nimbus
186
186
  # The format of the input file should be the same as the forest output data of a Nimbus Application.
187
187
  def load_forest
188
188
  trees = []
189
- if File.exists?(@forest_file)
189
+ if File.exist?(@forest_file)
190
190
  begin
191
191
  trees = Psych.load(File.open @forest_file)
192
192
  rescue ArgumentError => e
data/lib/nimbus/forest.rb CHANGED
@@ -88,6 +88,14 @@ module Nimbus
88
88
  @trees.to_yaml
89
89
  end
90
90
 
91
+ def classification?
92
+ @options.tree[:classes]
93
+ end
94
+
95
+ def regression?
96
+ @options.tree[:classes].nil?
97
+ end
98
+
91
99
  private
92
100
 
93
101
  def individuals_random_sample
@@ -140,14 +148,6 @@ module Nimbus
140
148
  }
141
149
  end
142
150
 
143
- def classification?
144
- @options.tree[:classes]
145
- end
146
-
147
- def regression?
148
- @options.tree[:classes].nil?
149
- end
150
-
151
151
  end
152
152
 
153
153
  end
@@ -35,6 +35,17 @@ module Nimbus
35
35
  def squared_difference(x,y)
36
36
  0.0 + (x-y)**2
37
37
  end
38
+
39
+ # Simplified Huber function
40
+ def pseudo_huber_error(ids, value_table, mean = nil)
41
+ mean ||= self.average ids, value_table
42
+ ids.inject(0.0){|sum, i| sum + (Math.log(Math.cosh(value_table[i] - mean))) }
43
+ end
44
+
45
+ # Simplified Huber loss function: PHE / n
46
+ def pseudo_huber_loss(ids, value_table, mean = nil)
47
+ self.pseudo_huber_error(ids, value_table, mean) / ids.size
48
+ end
38
49
 
39
50
  ## CLASSSIFICATION
40
51
 
@@ -8,7 +8,7 @@ module Nimbus
8
8
  # * 1: Calculate loss function for the individuals in the node (first node contains all the individuals).
9
9
  # * 2: Take a random sample of the SNPs (size m << total count of SNPs)
10
10
  # * 3: Compute the loss function (quadratic loss) for the split of the sample based on value of every SNP.
11
- # * 4: If the SNP with minimum loss function also minimizes the general loss of the node, split the individuals sample in three nodes, based on value for that SNP [0, 1, or 2]
11
+ # * 4: If the SNP with minimum loss function also minimizes the general loss of the node, split the individuals sample in two nodes, based on average value for that SNP [0,1][2], or [0][1,2]
12
12
  # * 5: Repeat from 1 for every node until:
13
13
  # - a) The individuals count in that node is < minimum size OR
14
14
  # - b) None of the SNP splits has a loss function smaller than the node loss function
@@ -27,8 +27,8 @@ module Nimbus
27
27
 
28
28
  # Creates a node by taking a random sample of the SNPs and computing the loss function for every split by SNP of that sample.
29
29
  #
30
- # * If SNP_min is the SNP with smaller loss function and it is < the loss function of the node, it splits the individuals sample in three:
31
- # (those with value 0 for the SNP_min, those with value 1 for the SNP_min, and those with value 2 for the SNP_min) then it builds these 3 new nodes.
30
+ # * If SNP_min is the SNP with smaller loss function and it is < the loss function of the node, it splits the individuals sample in two:
31
+ # (the average of the 0,1,2 values for the SNP_min in the individuals is computed, and they are splitted in [<=avg], [>avg]) then it builds these 2 new nodes.
32
32
  # * Otherwise every individual in the node gets labeled with the average of the fenotype values of all of them.
33
33
  def build_node(individuals_ids, y_hat)
34
34
  # General loss function value for the node
@@ -38,22 +38,20 @@ module Nimbus
38
38
 
39
39
  # Finding the SNP that minimizes loss function
40
40
  snps = snps_random_sample
41
- min_loss, min_SNP, split, means = node_loss_function, nil, nil, nil
41
+ min_loss, min_SNP, split, split_type, means = node_loss_function, nil, nil, nil, nil
42
42
 
43
43
  snps.each do |snp|
44
- individuals_split_by_snp_value = split_by_snp_value individuals_ids, snp
44
+ individuals_split_by_snp_value, node_split_type = split_by_snp_avegare_value individuals_ids, snp
45
45
  mean_0 = Nimbus::LossFunctions.average individuals_split_by_snp_value[0], @id_to_fenotype
46
46
  mean_1 = Nimbus::LossFunctions.average individuals_split_by_snp_value[1], @id_to_fenotype
47
- mean_2 = Nimbus::LossFunctions.average individuals_split_by_snp_value[2], @id_to_fenotype
48
47
  loss_0 = Nimbus::LossFunctions.mean_squared_error individuals_split_by_snp_value[0], @id_to_fenotype, mean_0
49
48
  loss_1 = Nimbus::LossFunctions.mean_squared_error individuals_split_by_snp_value[1], @id_to_fenotype, mean_1
50
- loss_2 = Nimbus::LossFunctions.mean_squared_error individuals_split_by_snp_value[2], @id_to_fenotype, mean_2
51
- loss_snp = (loss_0 + loss_1 + loss_2) / individuals_count
49
+ loss_snp = (loss_0 + loss_1) / individuals_count
52
50
 
53
- min_loss, min_SNP, split, means = loss_snp, snp, individuals_split_by_snp_value, [mean_0, mean_1, mean_2] if loss_snp < min_loss
51
+ min_loss, min_SNP, split, split_type, means = loss_snp, snp, individuals_split_by_snp_value, node_split_type, [mean_0, mean_1] if loss_snp < min_loss
54
52
  end
55
53
 
56
- return build_branch(min_SNP, split, means, y_hat) if min_loss < node_loss_function
54
+ return build_branch(min_SNP, split, split_type, means, y_hat) if min_loss < node_loss_function
57
55
  return label_node(y_hat, individuals_ids)
58
56
  end
59
57