greedyboruta 0.1.4__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- greedyboruta-0.1.4/LICENSE +27 -0
- greedyboruta-0.1.4/MANIFEST.in +7 -0
- greedyboruta-0.1.4/PKG-INFO +276 -0
- greedyboruta-0.1.4/README.md +249 -0
- greedyboruta-0.1.4/greedy_boruta/GreedyBoruta.py +578 -0
- greedyboruta-0.1.4/greedy_boruta/__init__.py +5 -0
- greedyboruta-0.1.4/greedy_boruta/examples/test_y.csv +1000 -0
- greedyboruta-0.1.4/greedyboruta.egg-info/PKG-INFO +276 -0
- greedyboruta-0.1.4/greedyboruta.egg-info/SOURCES.txt +15 -0
- greedyboruta-0.1.4/greedyboruta.egg-info/dependency_links.txt +1 -0
- greedyboruta-0.1.4/greedyboruta.egg-info/requires.txt +3 -0
- greedyboruta-0.1.4/greedyboruta.egg-info/top_level.txt +1 -0
- greedyboruta-0.1.4/pyproject.toml +3 -0
- greedyboruta-0.1.4/setup.cfg +4 -0
- greedyboruta-0.1.4/setup.py +21 -0
- greedyboruta-0.1.4/test/test_greedy_boruta.py +507 -0
- greedyboruta-0.1.4/test/test_greedy_boruta_smoke.py +195 -0
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
Copyright (c) 2025, Nicolas Vana Santos
|
|
2
|
+
All rights reserved.
|
|
3
|
+
|
|
4
|
+
Redistribution and use in source and binary forms, with or without
|
|
5
|
+
modification, are permitted provided that the following conditions are met:
|
|
6
|
+
|
|
7
|
+
* Redistributions of source code must retain the above copyright notice, this
|
|
8
|
+
list of conditions and the following disclaimer.
|
|
9
|
+
|
|
10
|
+
* Redistributions in binary form must reproduce the above copyright notice,
|
|
11
|
+
this list of conditions and the following disclaimer in the documentation
|
|
12
|
+
and/or other materials provided with the distribution.
|
|
13
|
+
|
|
14
|
+
* Neither the name of greedy_boruta nor the names of its
|
|
15
|
+
contributors may be used to endorse or promote products derived from
|
|
16
|
+
this software without specific prior written permission.
|
|
17
|
+
|
|
18
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
19
|
+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
20
|
+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
|
21
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
|
22
|
+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
23
|
+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
|
24
|
+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
|
25
|
+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
|
26
|
+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
27
|
+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
@@ -0,0 +1,276 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: greedyboruta
|
|
3
|
+
Version: 0.1.4
|
|
4
|
+
Summary: Python Implementation of GreedyBoruta Feature Selection
|
|
5
|
+
Home-page: https://github.com/Nicolas-Vana/GreedyBorutaPy
|
|
6
|
+
Download-URL: https://github.com/Nicolas-Vana/GreedyBorutaPy/tarball/0.1.5
|
|
7
|
+
Author: Nicolas Vana Santos
|
|
8
|
+
Author-email: nicolas.vana@gmail.com
|
|
9
|
+
License: BSD 3 clause
|
|
10
|
+
Keywords: feature selection,machine learning,random forest
|
|
11
|
+
Description-Content-Type: text/markdown
|
|
12
|
+
License-File: LICENSE
|
|
13
|
+
Requires-Dist: numpy>=1.10.4
|
|
14
|
+
Requires-Dist: scikit-learn>=0.17.1
|
|
15
|
+
Requires-Dist: scipy>=0.17.0
|
|
16
|
+
Dynamic: author
|
|
17
|
+
Dynamic: author-email
|
|
18
|
+
Dynamic: description
|
|
19
|
+
Dynamic: description-content-type
|
|
20
|
+
Dynamic: download-url
|
|
21
|
+
Dynamic: home-page
|
|
22
|
+
Dynamic: keywords
|
|
23
|
+
Dynamic: license
|
|
24
|
+
Dynamic: license-file
|
|
25
|
+
Dynamic: requires-dist
|
|
26
|
+
Dynamic: summary
|
|
27
|
+
|
|
28
|
+
# GreedyBoruta
|
|
29
|
+
|
|
30
|
+
[](https://github.com/Nicolas-Vana/GreedyBorutaPy/blob/master/LICENSE)
|
|
31
|
+
[](https://badge.fury.io/py/GreedyBoruta)
|
|
32
|
+

|
|
33
|
+
|
|
34
|
+
A faster variant of the [Boruta all-relevant feature selection method](https://www.jstatsoft.org/article/view/v036i11) with **greedy feature confirmation** that achieves **5-40x speedups** through a confirmation criterion relaxation.
|
|
35
|
+
|
|
36
|
+
This implementation is a fork of [boruta_py](https://github.com/scikit-learn-contrib/boruta_py), with modifications focused on improving computational efficiency while maintaining statistical rigor.
|
|
37
|
+
|
|
38
|
+
**[Read the full article explaining the algorithm and experimental results](LINK_TO_BE_ADDED)**
|
|
39
|
+
|
|
40
|
+
## Greedy Confirmation
|
|
41
|
+
|
|
42
|
+
Unlike the original Boruta algorithm which requires features to achieve statistical significance through binomial testing before confirmation, **GreedyBoruta confirms any feature that beats the maximum shadow importance at least once**. This simple change leads to:
|
|
43
|
+
|
|
44
|
+
- **5-40x faster convergence** on tested datasets
|
|
45
|
+
- **Automatic determination of max_iter** based on alpha (no manual tuning needed)
|
|
46
|
+
- **Equal or higher recall** compared to standard Boruta (provably cannot miss relevant features that are identified by the vanilla algorithm)
|
|
47
|
+
- **Guaranteed convergence** in O(-log alpha) iterations
|
|
48
|
+
|
|
49
|
+
The algorithm automatically calculates the minimum iterations needed for a feature with zero hits to be rejected as log2(1/alpha), then runs until all features are confirmed or rejected (which occurs at or before this limit).
|
|
50
|
+
|
|
51
|
+
## How to Install
|
|
52
|
+
|
|
53
|
+
Install with `pip`:
|
|
54
|
+
|
|
55
|
+
```shell
|
|
56
|
+
pip install greedyboruta
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
or with `conda`:
|
|
60
|
+
|
|
61
|
+
```shell
|
|
62
|
+
conda install -c conda-forge greedyboruta
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## Dependencies
|
|
66
|
+
|
|
67
|
+
* numpy
|
|
68
|
+
* scipy
|
|
69
|
+
* scikit-learn
|
|
70
|
+
|
|
71
|
+
## How to Use
|
|
72
|
+
|
|
73
|
+
The interface is identical to scikit-learn and boruta_py:
|
|
74
|
+
|
|
75
|
+
```python
|
|
76
|
+
import pandas as pd
|
|
77
|
+
from sklearn.ensemble import RandomForestClassifier
|
|
78
|
+
from greedyboruta import GreedyBorutaPy
|
|
79
|
+
|
|
80
|
+
# load X and y
|
|
81
|
+
X = pd.read_csv('examples/test_X.csv', index_col=0).values
|
|
82
|
+
y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values
|
|
83
|
+
y = y.ravel()
|
|
84
|
+
|
|
85
|
+
# define random forest classifier
|
|
86
|
+
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
|
|
87
|
+
|
|
88
|
+
# define GreedyBoruta feature selection method
|
|
89
|
+
# max_iter is automatically determined based on alpha
|
|
90
|
+
feat_selector = GreedyBorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
|
|
91
|
+
|
|
92
|
+
# find all relevant features - typically 5-40x faster than standard Boruta
|
|
93
|
+
feat_selector.fit(X, y)
|
|
94
|
+
|
|
95
|
+
# check selected features
|
|
96
|
+
feat_selector.support_
|
|
97
|
+
|
|
98
|
+
# check ranking of features
|
|
99
|
+
feat_selector.ranking_
|
|
100
|
+
|
|
101
|
+
# transform X to selected features
|
|
102
|
+
X_filtered = feat_selector.transform(X)
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## Philosophy: All-Relevant vs Minimal-Optimal
|
|
106
|
+
|
|
107
|
+
GreedyBoruta, like the vanilla Boruta, follows the **all-relevant** feature selection philosophy. This means it aims to find **every feature that carries useful information**, not just the smallest set that achieves good prediction.
|
|
108
|
+
|
|
109
|
+
**Why this matters:**
|
|
110
|
+
- When you want to **understand a phenomenon** (not just predict it), you need all contributing factors
|
|
111
|
+
- In **scientific discovery** and **causal inference**, missing a relevant feature can lead to incorrect conclusions
|
|
112
|
+
- **Redundant features** (correlated with informative ones) are intentionally retained - they carry signal even if not strictly necessary
|
|
113
|
+
- Downstream **minimal-optimal methods** (RFE, LASSO, mRMR) can further reduce the feature set if needed
|
|
114
|
+
|
|
115
|
+
This philosophy justifies the greedy confirmation criterion: in all-relevant selection, **false negatives (missing relevant features) are more costly than false positives (including a few extra features)**. The relaxed criterion prioritizes high recall, which aligns perfectly with the all-relevant goal.
|
|
116
|
+
|
|
117
|
+
## What's Different from Vanilla Boruta?
|
|
118
|
+
|
|
119
|
+
### Core Algorithm Change
|
|
120
|
+
|
|
121
|
+
**Greedy Confirmation Criterion**: Features are confirmed immediately upon achieving **at least one hit** (beating the maximum shadow importance in any iteration) rather than requiring statistical significance through binomial testing. The rejection criterion remains unchanged. This change:
|
|
122
|
+
|
|
123
|
+
1. **Maintains or improves recall** - any feature confirmed by vanilla Boruta will also be confirmed by Greedy Boruta (since statistical significance requires at least one hit)
|
|
124
|
+
2. **Enables guaranteed convergence** - all tentative features have exactly zero hits, simplifying the rejection test
|
|
125
|
+
3. **Dramatically speeds up the process** - GreedyBoruta runs at most K iterations, which is the same number of iterations at which point the vanilla boruta confirms or rejects its "first batch" of features.
|
|
126
|
+
4. **Trades slight specificity for speed** - the reduce in specificity seen by the relaxation of the confirmation criterion is relatively small when compared to the speed gains.
|
|
127
|
+
|
|
128
|
+
### Automatic max_iter Calculation
|
|
129
|
+
|
|
130
|
+
Because all tentative features have exactly zero hits (confirmed features had at least one), the binomial test for rejection simplifies dramatically. The algorithm computes the minimum iterations needed for a feature with zero hits to be rejected at significance level alpha:
|
|
131
|
+
|
|
132
|
+
For a binomial test with p_0 = 0.5 and x = 0 hits:
|
|
133
|
+
```
|
|
134
|
+
p-value = (1/2)^n < alpha
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
Therefore: **max_iter = O(log2(1/alpha))**
|
|
138
|
+
|
|
139
|
+
This means that all features will be sorted into confirmed or rejected in at most max_iter iterations - at this iteration, all remaining tentative features (with zero hits) are automatically rejected, and all features with hits > 0 are confirmed. No statistical testing is required during intermediate iterations.
|
|
140
|
+
|
|
141
|
+
**With FDR correction applied (as in boruta_py), max_iter values are:**
|
|
142
|
+
- alpha = 0.10: ~6 iterations
|
|
143
|
+
- alpha = 0.01: ~10 iterations
|
|
144
|
+
- alpha = 0.001: ~14 iterations
|
|
145
|
+
- alpha = 0.0001: ~18 iterations
|
|
146
|
+
- alpha = 0.00001: ~22 iterations
|
|
147
|
+
|
|
148
|
+
No manual tuning of max_iter is required!
|
|
149
|
+
|
|
150
|
+
## What's Inherited from boruta_py?
|
|
151
|
+
|
|
152
|
+
This implementation builds upon the excellent work in boruta_py and retains all its key improvements over the original R implementation:
|
|
153
|
+
|
|
154
|
+
* **Faster run times** thanks to scikit-learn
|
|
155
|
+
* **Scikit-learn interface** (fit, transform, fit_transform)
|
|
156
|
+
* **Compatible with any ensemble method** from scikit-learn
|
|
157
|
+
* **Automatic n_estimator selection**
|
|
158
|
+
* **Feature ranking**
|
|
159
|
+
* **Percentile threshold** (perc parameter) for more flexible shadow feature comparison
|
|
160
|
+
* **Two-step correction** (FDR + Bonferroni) for multiple testing
|
|
161
|
+
|
|
162
|
+
We highly recommend using pruned trees with depth between 3-7, as suggested in the original boruta_py documentation.
|
|
163
|
+
|
|
164
|
+
## Parameters
|
|
165
|
+
|
|
166
|
+
**estimator** : object
|
|
167
|
+
> A supervised learning estimator with a 'fit' method that returns the
|
|
168
|
+
> feature_importances_ attribute. Important features must correspond to
|
|
169
|
+
> high absolute values in the feature_importances_.
|
|
170
|
+
|
|
171
|
+
**n_estimators** : int or string, default = 1000
|
|
172
|
+
> If int, sets the number of estimators in the chosen ensemble method.
|
|
173
|
+
> If 'auto', this is determined automatically based on dataset size.
|
|
174
|
+
|
|
175
|
+
**perc** : int, default = 100
|
|
176
|
+
> Percentile of shadow feature importances to use as threshold.
|
|
177
|
+
> The default (100) uses the maximum, equivalent to vanilla Boruta.
|
|
178
|
+
> Lower values (e.g., 90) are less stringent and may select more features.
|
|
179
|
+
|
|
180
|
+
**alpha** : float, default = 0.05
|
|
181
|
+
> Significance level for the corrected p-values in both correction steps.
|
|
182
|
+
> Also automatically determines max_iter via the formula: log2(1/alpha)
|
|
183
|
+
> Lower alpha = more conservative selection + more iterations
|
|
184
|
+
|
|
185
|
+
**two_step** : Boolean, default = True
|
|
186
|
+
> If True, uses FDR + Bonferroni correction. If False, uses only
|
|
187
|
+
> Bonferroni correction (original Boruta behavior with perc=100).
|
|
188
|
+
|
|
189
|
+
**random_state** : int, RandomState instance or None, default = None
|
|
190
|
+
> Random seed for reproducibility.
|
|
191
|
+
|
|
192
|
+
**verbose** : int, default = 0
|
|
193
|
+
> Controls verbosity of output:
|
|
194
|
+
> 0 = silent, 1 = iteration counter, 2 = detailed statistics per iteration
|
|
195
|
+
|
|
196
|
+
### Removed Parameters
|
|
197
|
+
|
|
198
|
+
Unlike vanilla Boruta implementations, GreedyBoruta does **not** require:
|
|
199
|
+
- **max_iter**: Automatically calculated from alpha
|
|
200
|
+
- **early_stopping**: Not needed due to guaranteed convergence
|
|
201
|
+
- **n_iter_no_change**: Not needed due to guaranteed convergence
|
|
202
|
+
|
|
203
|
+
This simplification improves usability and eliminates the need for manual tuning of convergence-related parameters.
|
|
204
|
+
|
|
205
|
+
## Attributes
|
|
206
|
+
|
|
207
|
+
**n_features_** : int
|
|
208
|
+
> The number of selected features (confirmed only).
|
|
209
|
+
|
|
210
|
+
**support_** : array of shape [n_features]
|
|
211
|
+
> Boolean mask of selected features (confirmed features only).
|
|
212
|
+
|
|
213
|
+
**support_weak_** : array of shape [n_features]
|
|
214
|
+
> Boolean mask of tentative features that didn't gain enough support.
|
|
215
|
+
|
|
216
|
+
**ranking_** : array of shape [n_features]
|
|
217
|
+
> Feature ranking where confirmed features = 1, tentative features = 2,
|
|
218
|
+
> and rejected features have ranks ≥ 3 based on importance.
|
|
219
|
+
|
|
220
|
+
**importance_history_** : array of shape [n_iterations, n_features]
|
|
221
|
+
> Historical record of feature importances across all iterations.
|
|
222
|
+
|
|
223
|
+
## Performance Comparison
|
|
224
|
+
|
|
225
|
+
Based on synthetic experiments with known ground truth:
|
|
226
|
+
|
|
227
|
+
- **5-15x speedup** on challenging datasets with proper early stopping in vanilla Boruta
|
|
228
|
+
- **Up to 40x speedup** when vanilla Boruta runs without early stopping to full convergence
|
|
229
|
+
- **Equal or higher recall** (never misses features that vanilla Boruta would find relevant)
|
|
230
|
+
- **Slightly lower specificity** (<10 features selected on 500-feature datasets tested)
|
|
231
|
+
- **Guaranteed convergence** - all features are always classified (no tentative features remain)
|
|
232
|
+
|
|
233
|
+
## When to Use GreedyBoruta
|
|
234
|
+
|
|
235
|
+
**Use GreedyBoruta when:**
|
|
236
|
+
- You want **all-relevant feature selection** with high recall
|
|
237
|
+
- You're working with **high-dimensional data** for which the vanilla boruta takes too long to run
|
|
238
|
+
- **Computational efficiency matters** (exploratory analysis, rapid prototyping, iterative workflows)
|
|
239
|
+
- False positives can be filtered in **downstream pipelines** (regularization, cross-validation, minimal-optimal selection)
|
|
240
|
+
- You want to avoid manually tuning max_iter or early stopping parameters
|
|
241
|
+
|
|
242
|
+
**Consider standard Boruta when:**
|
|
243
|
+
- You need **maximum specificity** and false positives are very costly
|
|
244
|
+
- Your dataset is **small enough** that speed isn't a concern
|
|
245
|
+
- **Statistical conservatism** is paramount for your application
|
|
246
|
+
|
|
247
|
+
## References
|
|
248
|
+
|
|
249
|
+
1. Kursa M., Rudnicki W., "Feature Selection with the Boruta Package" Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010
|
|
250
|
+
2. Homola D., "BorutaPy: An all-relevant feature selection method" https://github.com/scikit-learn-contrib/boruta_py
|
|
251
|
+
|
|
252
|
+
## Credits
|
|
253
|
+
|
|
254
|
+
This implementation is built upon [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) by Daniel Homola, which itself is based on the original Boruta algorithm by Miron B. Kursa and Witold R. Rudnicki.
|
|
255
|
+
|
|
256
|
+
The greedy confirmation criterion and automatic convergence calculation are novel contributions of this fork, based on findings by Nicolas Vana Santos and Estevão Batista do Prado.
|
|
257
|
+
|
|
258
|
+
## Citation
|
|
259
|
+
|
|
260
|
+
If you use GreedyBoruta in your research, please cite both the original Boruta paper and the boruta_py implementation:
|
|
261
|
+
|
|
262
|
+
```
|
|
263
|
+
@article{kursa2010feature,
|
|
264
|
+
title={Feature selection with the Boruta package},
|
|
265
|
+
author={Kursa, Miron B and Rudnicki, Witold R},
|
|
266
|
+
journal={Journal of Statistical Software},
|
|
267
|
+
volume={36},
|
|
268
|
+
number={11},
|
|
269
|
+
pages={1--13},
|
|
270
|
+
year={2010}
|
|
271
|
+
}
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
## License
|
|
275
|
+
|
|
276
|
+
This project maintains the same BSD-3-Clause license as boruta_py.
|
|
@@ -0,0 +1,249 @@
|
|
|
1
|
+
# GreedyBoruta
|
|
2
|
+
|
|
3
|
+
[](https://github.com/Nicolas-Vana/GreedyBorutaPy/blob/master/LICENSE)
|
|
4
|
+
[](https://badge.fury.io/py/GreedyBoruta)
|
|
5
|
+

|
|
6
|
+
|
|
7
|
+
A faster variant of the [Boruta all-relevant feature selection method](https://www.jstatsoft.org/article/view/v036i11) with **greedy feature confirmation** that achieves **5-40x speedups** through a confirmation criterion relaxation.
|
|
8
|
+
|
|
9
|
+
This implementation is a fork of [boruta_py](https://github.com/scikit-learn-contrib/boruta_py), with modifications focused on improving computational efficiency while maintaining statistical rigor.
|
|
10
|
+
|
|
11
|
+
**[Read the full article explaining the algorithm and experimental results](LINK_TO_BE_ADDED)**
|
|
12
|
+
|
|
13
|
+
## Greedy Confirmation
|
|
14
|
+
|
|
15
|
+
Unlike the original Boruta algorithm which requires features to achieve statistical significance through binomial testing before confirmation, **GreedyBoruta confirms any feature that beats the maximum shadow importance at least once**. This simple change leads to:
|
|
16
|
+
|
|
17
|
+
- **5-40x faster convergence** on tested datasets
|
|
18
|
+
- **Automatic determination of max_iter** based on alpha (no manual tuning needed)
|
|
19
|
+
- **Equal or higher recall** compared to standard Boruta (provably cannot miss relevant features that are identified by the vanilla algorithm)
|
|
20
|
+
- **Guaranteed convergence** in O(-log alpha) iterations
|
|
21
|
+
|
|
22
|
+
The algorithm automatically calculates the minimum iterations needed for a feature with zero hits to be rejected as log2(1/alpha), then runs until all features are confirmed or rejected (which occurs at or before this limit).
|
|
23
|
+
|
|
24
|
+
## How to Install
|
|
25
|
+
|
|
26
|
+
Install with `pip`:
|
|
27
|
+
|
|
28
|
+
```shell
|
|
29
|
+
pip install greedyboruta
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
or with `conda`:
|
|
33
|
+
|
|
34
|
+
```shell
|
|
35
|
+
conda install -c conda-forge greedyboruta
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Dependencies
|
|
39
|
+
|
|
40
|
+
* numpy
|
|
41
|
+
* scipy
|
|
42
|
+
* scikit-learn
|
|
43
|
+
|
|
44
|
+
## How to Use
|
|
45
|
+
|
|
46
|
+
The interface is identical to scikit-learn and boruta_py:
|
|
47
|
+
|
|
48
|
+
```python
|
|
49
|
+
import pandas as pd
|
|
50
|
+
from sklearn.ensemble import RandomForestClassifier
|
|
51
|
+
from greedyboruta import GreedyBorutaPy
|
|
52
|
+
|
|
53
|
+
# load X and y
|
|
54
|
+
X = pd.read_csv('examples/test_X.csv', index_col=0).values
|
|
55
|
+
y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values
|
|
56
|
+
y = y.ravel()
|
|
57
|
+
|
|
58
|
+
# define random forest classifier
|
|
59
|
+
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
|
|
60
|
+
|
|
61
|
+
# define GreedyBoruta feature selection method
|
|
62
|
+
# max_iter is automatically determined based on alpha
|
|
63
|
+
feat_selector = GreedyBorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
|
|
64
|
+
|
|
65
|
+
# find all relevant features - typically 5-40x faster than standard Boruta
|
|
66
|
+
feat_selector.fit(X, y)
|
|
67
|
+
|
|
68
|
+
# check selected features
|
|
69
|
+
feat_selector.support_
|
|
70
|
+
|
|
71
|
+
# check ranking of features
|
|
72
|
+
feat_selector.ranking_
|
|
73
|
+
|
|
74
|
+
# transform X to selected features
|
|
75
|
+
X_filtered = feat_selector.transform(X)
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
## Philosophy: All-Relevant vs Minimal-Optimal
|
|
79
|
+
|
|
80
|
+
GreedyBoruta, like the vanilla Boruta, follows the **all-relevant** feature selection philosophy. This means it aims to find **every feature that carries useful information**, not just the smallest set that achieves good prediction.
|
|
81
|
+
|
|
82
|
+
**Why this matters:**
|
|
83
|
+
- When you want to **understand a phenomenon** (not just predict it), you need all contributing factors
|
|
84
|
+
- In **scientific discovery** and **causal inference**, missing a relevant feature can lead to incorrect conclusions
|
|
85
|
+
- **Redundant features** (correlated with informative ones) are intentionally retained - they carry signal even if not strictly necessary
|
|
86
|
+
- Downstream **minimal-optimal methods** (RFE, LASSO, mRMR) can further reduce the feature set if needed
|
|
87
|
+
|
|
88
|
+
This philosophy justifies the greedy confirmation criterion: in all-relevant selection, **false negatives (missing relevant features) are more costly than false positives (including a few extra features)**. The relaxed criterion prioritizes high recall, which aligns perfectly with the all-relevant goal.
|
|
89
|
+
|
|
90
|
+
## What's Different from Vanilla Boruta?
|
|
91
|
+
|
|
92
|
+
### Core Algorithm Change
|
|
93
|
+
|
|
94
|
+
**Greedy Confirmation Criterion**: Features are confirmed immediately upon achieving **at least one hit** (beating the maximum shadow importance in any iteration) rather than requiring statistical significance through binomial testing. The rejection criterion remains unchanged. This change:
|
|
95
|
+
|
|
96
|
+
1. **Maintains or improves recall** - any feature confirmed by vanilla Boruta will also be confirmed by Greedy Boruta (since statistical significance requires at least one hit)
|
|
97
|
+
2. **Enables guaranteed convergence** - all tentative features have exactly zero hits, simplifying the rejection test
|
|
98
|
+
3. **Dramatically speeds up the process** - GreedyBoruta runs at most K iterations, which is the same number of iterations at which point the vanilla boruta confirms or rejects its "first batch" of features.
|
|
99
|
+
4. **Trades slight specificity for speed** - the reduce in specificity seen by the relaxation of the confirmation criterion is relatively small when compared to the speed gains.
|
|
100
|
+
|
|
101
|
+
### Automatic max_iter Calculation
|
|
102
|
+
|
|
103
|
+
Because all tentative features have exactly zero hits (confirmed features had at least one), the binomial test for rejection simplifies dramatically. The algorithm computes the minimum iterations needed for a feature with zero hits to be rejected at significance level alpha:
|
|
104
|
+
|
|
105
|
+
For a binomial test with p_0 = 0.5 and x = 0 hits:
|
|
106
|
+
```
|
|
107
|
+
p-value = (1/2)^n < alpha
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
Therefore: **max_iter = O(log2(1/alpha))**
|
|
111
|
+
|
|
112
|
+
This means that all features will be sorted into confirmed or rejected in at most max_iter iterations - at this iteration, all remaining tentative features (with zero hits) are automatically rejected, and all features with hits > 0 are confirmed. No statistical testing is required during intermediate iterations.
|
|
113
|
+
|
|
114
|
+
**With FDR correction applied (as in boruta_py), max_iter values are:**
|
|
115
|
+
- alpha = 0.10: ~6 iterations
|
|
116
|
+
- alpha = 0.01: ~10 iterations
|
|
117
|
+
- alpha = 0.001: ~14 iterations
|
|
118
|
+
- alpha = 0.0001: ~18 iterations
|
|
119
|
+
- alpha = 0.00001: ~22 iterations
|
|
120
|
+
|
|
121
|
+
No manual tuning of max_iter is required!
|
|
122
|
+
|
|
123
|
+
## What's Inherited from boruta_py?
|
|
124
|
+
|
|
125
|
+
This implementation builds upon the excellent work in boruta_py and retains all its key improvements over the original R implementation:
|
|
126
|
+
|
|
127
|
+
* **Faster run times** thanks to scikit-learn
|
|
128
|
+
* **Scikit-learn interface** (fit, transform, fit_transform)
|
|
129
|
+
* **Compatible with any ensemble method** from scikit-learn
|
|
130
|
+
* **Automatic n_estimator selection**
|
|
131
|
+
* **Feature ranking**
|
|
132
|
+
* **Percentile threshold** (perc parameter) for more flexible shadow feature comparison
|
|
133
|
+
* **Two-step correction** (FDR + Bonferroni) for multiple testing
|
|
134
|
+
|
|
135
|
+
We highly recommend using pruned trees with depth between 3-7, as suggested in the original boruta_py documentation.
|
|
136
|
+
|
|
137
|
+
## Parameters
|
|
138
|
+
|
|
139
|
+
**estimator** : object
|
|
140
|
+
> A supervised learning estimator with a 'fit' method that returns the
|
|
141
|
+
> feature_importances_ attribute. Important features must correspond to
|
|
142
|
+
> high absolute values in the feature_importances_.
|
|
143
|
+
|
|
144
|
+
**n_estimators** : int or string, default = 1000
|
|
145
|
+
> If int, sets the number of estimators in the chosen ensemble method.
|
|
146
|
+
> If 'auto', this is determined automatically based on dataset size.
|
|
147
|
+
|
|
148
|
+
**perc** : int, default = 100
|
|
149
|
+
> Percentile of shadow feature importances to use as threshold.
|
|
150
|
+
> The default (100) uses the maximum, equivalent to vanilla Boruta.
|
|
151
|
+
> Lower values (e.g., 90) are less stringent and may select more features.
|
|
152
|
+
|
|
153
|
+
**alpha** : float, default = 0.05
|
|
154
|
+
> Significance level for the corrected p-values in both correction steps.
|
|
155
|
+
> Also automatically determines max_iter via the formula: log2(1/alpha)
|
|
156
|
+
> Lower alpha = more conservative selection + more iterations
|
|
157
|
+
|
|
158
|
+
**two_step** : Boolean, default = True
|
|
159
|
+
> If True, uses FDR + Bonferroni correction. If False, uses only
|
|
160
|
+
> Bonferroni correction (original Boruta behavior with perc=100).
|
|
161
|
+
|
|
162
|
+
**random_state** : int, RandomState instance or None, default = None
|
|
163
|
+
> Random seed for reproducibility.
|
|
164
|
+
|
|
165
|
+
**verbose** : int, default = 0
|
|
166
|
+
> Controls verbosity of output:
|
|
167
|
+
> 0 = silent, 1 = iteration counter, 2 = detailed statistics per iteration
|
|
168
|
+
|
|
169
|
+
### Removed Parameters
|
|
170
|
+
|
|
171
|
+
Unlike vanilla Boruta implementations, GreedyBoruta does **not** require:
|
|
172
|
+
- **max_iter**: Automatically calculated from alpha
|
|
173
|
+
- **early_stopping**: Not needed due to guaranteed convergence
|
|
174
|
+
- **n_iter_no_change**: Not needed due to guaranteed convergence
|
|
175
|
+
|
|
176
|
+
This simplification improves usability and eliminates the need for manual tuning of convergence-related parameters.
|
|
177
|
+
|
|
178
|
+
## Attributes
|
|
179
|
+
|
|
180
|
+
**n_features_** : int
|
|
181
|
+
> The number of selected features (confirmed only).
|
|
182
|
+
|
|
183
|
+
**support_** : array of shape [n_features]
|
|
184
|
+
> Boolean mask of selected features (confirmed features only).
|
|
185
|
+
|
|
186
|
+
**support_weak_** : array of shape [n_features]
|
|
187
|
+
> Boolean mask of tentative features that didn't gain enough support.
|
|
188
|
+
|
|
189
|
+
**ranking_** : array of shape [n_features]
|
|
190
|
+
> Feature ranking where confirmed features = 1, tentative features = 2,
|
|
191
|
+
> and rejected features have ranks ≥ 3 based on importance.
|
|
192
|
+
|
|
193
|
+
**importance_history_** : array of shape [n_iterations, n_features]
|
|
194
|
+
> Historical record of feature importances across all iterations.
|
|
195
|
+
|
|
196
|
+
## Performance Comparison
|
|
197
|
+
|
|
198
|
+
Based on synthetic experiments with known ground truth:
|
|
199
|
+
|
|
200
|
+
- **5-15x speedup** on challenging datasets with proper early stopping in vanilla Boruta
|
|
201
|
+
- **Up to 40x speedup** when vanilla Boruta runs without early stopping to full convergence
|
|
202
|
+
- **Equal or higher recall** (never misses features that vanilla Boruta would find relevant)
|
|
203
|
+
- **Slightly lower specificity** (<10 features selected on 500-feature datasets tested)
|
|
204
|
+
- **Guaranteed convergence** - all features are always classified (no tentative features remain)
|
|
205
|
+
|
|
206
|
+
## When to Use GreedyBoruta
|
|
207
|
+
|
|
208
|
+
**Use GreedyBoruta when:**
|
|
209
|
+
- You want **all-relevant feature selection** with high recall
|
|
210
|
+
- You're working with **high-dimensional data** for which the vanilla boruta takes too long to run
|
|
211
|
+
- **Computational efficiency matters** (exploratory analysis, rapid prototyping, iterative workflows)
|
|
212
|
+
- False positives can be filtered in **downstream pipelines** (regularization, cross-validation, minimal-optimal selection)
|
|
213
|
+
- You want to avoid manually tuning max_iter or early stopping parameters
|
|
214
|
+
|
|
215
|
+
**Consider standard Boruta when:**
|
|
216
|
+
- You need **maximum specificity** and false positives are very costly
|
|
217
|
+
- Your dataset is **small enough** that speed isn't a concern
|
|
218
|
+
- **Statistical conservatism** is paramount for your application
|
|
219
|
+
|
|
220
|
+
## References
|
|
221
|
+
|
|
222
|
+
1. Kursa M., Rudnicki W., "Feature Selection with the Boruta Package" Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010
|
|
223
|
+
2. Homola D., "BorutaPy: An all-relevant feature selection method" https://github.com/scikit-learn-contrib/boruta_py
|
|
224
|
+
|
|
225
|
+
## Credits
|
|
226
|
+
|
|
227
|
+
This implementation is built upon [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) by Daniel Homola, which itself is based on the original Boruta algorithm by Miron B. Kursa and Witold R. Rudnicki.
|
|
228
|
+
|
|
229
|
+
The greedy confirmation criterion and automatic convergence calculation are novel contributions of this fork, based on findings by Nicolas Vana Santos and Estevão Batista do Prado.
|
|
230
|
+
|
|
231
|
+
## Citation
|
|
232
|
+
|
|
233
|
+
If you use GreedyBoruta in your research, please cite both the original Boruta paper and the boruta_py implementation:
|
|
234
|
+
|
|
235
|
+
```
|
|
236
|
+
@article{kursa2010feature,
|
|
237
|
+
title={Feature selection with the Boruta package},
|
|
238
|
+
author={Kursa, Miron B and Rudnicki, Witold R},
|
|
239
|
+
journal={Journal of Statistical Software},
|
|
240
|
+
volume={36},
|
|
241
|
+
number={11},
|
|
242
|
+
pages={1--13},
|
|
243
|
+
year={2010}
|
|
244
|
+
}
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
## License
|
|
248
|
+
|
|
249
|
+
This project maintains the same BSD-3-Clause license as boruta_py.
|