greedyboruta 0.1.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,27 @@
1
+ Copyright (c) 2025, Nicolas Vana Santos
2
+ All rights reserved.
3
+
4
+ Redistribution and use in source and binary forms, with or without
5
+ modification, are permitted provided that the following conditions are met:
6
+
7
+ * Redistributions of source code must retain the above copyright notice, this
8
+ list of conditions and the following disclaimer.
9
+
10
+ * Redistributions in binary form must reproduce the above copyright notice,
11
+ this list of conditions and the following disclaimer in the documentation
12
+ and/or other materials provided with the distribution.
13
+
14
+ * Neither the name of greedy_boruta nor the names of its
15
+ contributors may be used to endorse or promote products derived from
16
+ this software without specific prior written permission.
17
+
18
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
19
+ AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
22
+ FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23
+ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
24
+ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
25
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
26
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@@ -0,0 +1,7 @@
1
+ # Include the license file
2
+ include LICENSE
3
+ include README.md
4
+
5
+ # Include the data files
6
+ include greedy_borutaboruta/examples/test_X.csv
7
+ include greedy_boruta/examples/test_y.csv
@@ -0,0 +1,276 @@
1
+ Metadata-Version: 2.4
2
+ Name: greedyboruta
3
+ Version: 0.1.4
4
+ Summary: Python Implementation of GreedyBoruta Feature Selection
5
+ Home-page: https://github.com/Nicolas-Vana/GreedyBorutaPy
6
+ Download-URL: https://github.com/Nicolas-Vana/GreedyBorutaPy/tarball/0.1.5
7
+ Author: Nicolas Vana Santos
8
+ Author-email: nicolas.vana@gmail.com
9
+ License: BSD 3 clause
10
+ Keywords: feature selection,machine learning,random forest
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: numpy>=1.10.4
14
+ Requires-Dist: scikit-learn>=0.17.1
15
+ Requires-Dist: scipy>=0.17.0
16
+ Dynamic: author
17
+ Dynamic: author-email
18
+ Dynamic: description
19
+ Dynamic: description-content-type
20
+ Dynamic: download-url
21
+ Dynamic: home-page
22
+ Dynamic: keywords
23
+ Dynamic: license
24
+ Dynamic: license-file
25
+ Dynamic: requires-dist
26
+ Dynamic: summary
27
+
28
+ # GreedyBoruta
29
+
30
+ [![License](https://img.shields.io/github/license/Nicolas-Vana/GreedyBorutaPy)](https://github.com/Nicolas-Vana/GreedyBorutaPy/blob/master/LICENSE)
31
+ [![PyPI version](https://badge.fury.io/py/GreedyBoruta.svg)](https://badge.fury.io/py/GreedyBoruta)
32
+ ![Test Coverage](./coverage.svg)
33
+
34
+ A faster variant of the [Boruta all-relevant feature selection method](https://www.jstatsoft.org/article/view/v036i11) with **greedy feature confirmation** that achieves **5-40x speedups** through a confirmation criterion relaxation.
35
+
36
+ This implementation is a fork of [boruta_py](https://github.com/scikit-learn-contrib/boruta_py), with modifications focused on improving computational efficiency while maintaining statistical rigor.
37
+
38
+ **[Read the full article explaining the algorithm and experimental results](LINK_TO_BE_ADDED)**
39
+
40
+ ## Greedy Confirmation
41
+
42
+ Unlike the original Boruta algorithm which requires features to achieve statistical significance through binomial testing before confirmation, **GreedyBoruta confirms any feature that beats the maximum shadow importance at least once**. This simple change leads to:
43
+
44
+ - **5-40x faster convergence** on tested datasets
45
+ - **Automatic determination of max_iter** based on alpha (no manual tuning needed)
46
+ - **Equal or higher recall** compared to standard Boruta (provably cannot miss relevant features that are identified by the vanilla algorithm)
47
+ - **Guaranteed convergence** in O(-log alpha) iterations
48
+
49
+ The algorithm automatically calculates the minimum iterations needed for a feature with zero hits to be rejected as log2(1/alpha), then runs until all features are confirmed or rejected (which occurs at or before this limit).
50
+
51
+ ## How to Install
52
+
53
+ Install with `pip`:
54
+
55
+ ```shell
56
+ pip install greedyboruta
57
+ ```
58
+
59
+ or with `conda`:
60
+
61
+ ```shell
62
+ conda install -c conda-forge greedyboruta
63
+ ```
64
+
65
+ ## Dependencies
66
+
67
+ * numpy
68
+ * scipy
69
+ * scikit-learn
70
+
71
+ ## How to Use
72
+
73
+ The interface is identical to scikit-learn and boruta_py:
74
+
75
+ ```python
76
+ import pandas as pd
77
+ from sklearn.ensemble import RandomForestClassifier
78
+ from greedyboruta import GreedyBorutaPy
79
+
80
+ # load X and y
81
+ X = pd.read_csv('examples/test_X.csv', index_col=0).values
82
+ y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values
83
+ y = y.ravel()
84
+
85
+ # define random forest classifier
86
+ rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
87
+
88
+ # define GreedyBoruta feature selection method
89
+ # max_iter is automatically determined based on alpha
90
+ feat_selector = GreedyBorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
91
+
92
+ # find all relevant features - typically 5-40x faster than standard Boruta
93
+ feat_selector.fit(X, y)
94
+
95
+ # check selected features
96
+ feat_selector.support_
97
+
98
+ # check ranking of features
99
+ feat_selector.ranking_
100
+
101
+ # transform X to selected features
102
+ X_filtered = feat_selector.transform(X)
103
+ ```
104
+
105
+ ## Philosophy: All-Relevant vs Minimal-Optimal
106
+
107
+ GreedyBoruta, like the vanilla Boruta, follows the **all-relevant** feature selection philosophy. This means it aims to find **every feature that carries useful information**, not just the smallest set that achieves good prediction.
108
+
109
+ **Why this matters:**
110
+ - When you want to **understand a phenomenon** (not just predict it), you need all contributing factors
111
+ - In **scientific discovery** and **causal inference**, missing a relevant feature can lead to incorrect conclusions
112
+ - **Redundant features** (correlated with informative ones) are intentionally retained - they carry signal even if not strictly necessary
113
+ - Downstream **minimal-optimal methods** (RFE, LASSO, mRMR) can further reduce the feature set if needed
114
+
115
+ This philosophy justifies the greedy confirmation criterion: in all-relevant selection, **false negatives (missing relevant features) are more costly than false positives (including a few extra features)**. The relaxed criterion prioritizes high recall, which aligns perfectly with the all-relevant goal.
116
+
117
+ ## What's Different from Vanilla Boruta?
118
+
119
+ ### Core Algorithm Change
120
+
121
+ **Greedy Confirmation Criterion**: Features are confirmed immediately upon achieving **at least one hit** (beating the maximum shadow importance in any iteration) rather than requiring statistical significance through binomial testing. The rejection criterion remains unchanged. This change:
122
+
123
+ 1. **Maintains or improves recall** - any feature confirmed by vanilla Boruta will also be confirmed by Greedy Boruta (since statistical significance requires at least one hit)
124
+ 2. **Enables guaranteed convergence** - all tentative features have exactly zero hits, simplifying the rejection test
125
+ 3. **Dramatically speeds up the process** - GreedyBoruta runs at most K iterations, which is the same number of iterations at which point the vanilla boruta confirms or rejects its "first batch" of features.
126
+ 4. **Trades slight specificity for speed** - the reduce in specificity seen by the relaxation of the confirmation criterion is relatively small when compared to the speed gains.
127
+
128
+ ### Automatic max_iter Calculation
129
+
130
+ Because all tentative features have exactly zero hits (confirmed features had at least one), the binomial test for rejection simplifies dramatically. The algorithm computes the minimum iterations needed for a feature with zero hits to be rejected at significance level alpha:
131
+
132
+ For a binomial test with p_0 = 0.5 and x = 0 hits:
133
+ ```
134
+ p-value = (1/2)^n < alpha
135
+ ```
136
+
137
+ Therefore: **max_iter = O(log2(1/alpha))**
138
+
139
+ This means that all features will be sorted into confirmed or rejected in at most max_iter iterations - at this iteration, all remaining tentative features (with zero hits) are automatically rejected, and all features with hits > 0 are confirmed. No statistical testing is required during intermediate iterations.
140
+
141
+ **With FDR correction applied (as in boruta_py), max_iter values are:**
142
+ - alpha = 0.10: ~6 iterations
143
+ - alpha = 0.01: ~10 iterations
144
+ - alpha = 0.001: ~14 iterations
145
+ - alpha = 0.0001: ~18 iterations
146
+ - alpha = 0.00001: ~22 iterations
147
+
148
+ No manual tuning of max_iter is required!
149
+
150
+ ## What's Inherited from boruta_py?
151
+
152
+ This implementation builds upon the excellent work in boruta_py and retains all its key improvements over the original R implementation:
153
+
154
+ * **Faster run times** thanks to scikit-learn
155
+ * **Scikit-learn interface** (fit, transform, fit_transform)
156
+ * **Compatible with any ensemble method** from scikit-learn
157
+ * **Automatic n_estimator selection**
158
+ * **Feature ranking**
159
+ * **Percentile threshold** (perc parameter) for more flexible shadow feature comparison
160
+ * **Two-step correction** (FDR + Bonferroni) for multiple testing
161
+
162
+ We highly recommend using pruned trees with depth between 3-7, as suggested in the original boruta_py documentation.
163
+
164
+ ## Parameters
165
+
166
+ **estimator** : object
167
+ > A supervised learning estimator with a 'fit' method that returns the
168
+ > feature_importances_ attribute. Important features must correspond to
169
+ > high absolute values in the feature_importances_.
170
+
171
+ **n_estimators** : int or string, default = 1000
172
+ > If int, sets the number of estimators in the chosen ensemble method.
173
+ > If 'auto', this is determined automatically based on dataset size.
174
+
175
+ **perc** : int, default = 100
176
+ > Percentile of shadow feature importances to use as threshold.
177
+ > The default (100) uses the maximum, equivalent to vanilla Boruta.
178
+ > Lower values (e.g., 90) are less stringent and may select more features.
179
+
180
+ **alpha** : float, default = 0.05
181
+ > Significance level for the corrected p-values in both correction steps.
182
+ > Also automatically determines max_iter via the formula: log2(1/alpha)
183
+ > Lower alpha = more conservative selection + more iterations
184
+
185
+ **two_step** : Boolean, default = True
186
+ > If True, uses FDR + Bonferroni correction. If False, uses only
187
+ > Bonferroni correction (original Boruta behavior with perc=100).
188
+
189
+ **random_state** : int, RandomState instance or None, default = None
190
+ > Random seed for reproducibility.
191
+
192
+ **verbose** : int, default = 0
193
+ > Controls verbosity of output:
194
+ > 0 = silent, 1 = iteration counter, 2 = detailed statistics per iteration
195
+
196
+ ### Removed Parameters
197
+
198
+ Unlike vanilla Boruta implementations, GreedyBoruta does **not** require:
199
+ - **max_iter**: Automatically calculated from alpha
200
+ - **early_stopping**: Not needed due to guaranteed convergence
201
+ - **n_iter_no_change**: Not needed due to guaranteed convergence
202
+
203
+ This simplification improves usability and eliminates the need for manual tuning of convergence-related parameters.
204
+
205
+ ## Attributes
206
+
207
+ **n_features_** : int
208
+ > The number of selected features (confirmed only).
209
+
210
+ **support_** : array of shape [n_features]
211
+ > Boolean mask of selected features (confirmed features only).
212
+
213
+ **support_weak_** : array of shape [n_features]
214
+ > Boolean mask of tentative features that didn't gain enough support.
215
+
216
+ **ranking_** : array of shape [n_features]
217
+ > Feature ranking where confirmed features = 1, tentative features = 2,
218
+ > and rejected features have ranks ≥ 3 based on importance.
219
+
220
+ **importance_history_** : array of shape [n_iterations, n_features]
221
+ > Historical record of feature importances across all iterations.
222
+
223
+ ## Performance Comparison
224
+
225
+ Based on synthetic experiments with known ground truth:
226
+
227
+ - **5-15x speedup** on challenging datasets with proper early stopping in vanilla Boruta
228
+ - **Up to 40x speedup** when vanilla Boruta runs without early stopping to full convergence
229
+ - **Equal or higher recall** (never misses features that vanilla Boruta would find relevant)
230
+ - **Slightly lower specificity** (<10 features selected on 500-feature datasets tested)
231
+ - **Guaranteed convergence** - all features are always classified (no tentative features remain)
232
+
233
+ ## When to Use GreedyBoruta
234
+
235
+ **Use GreedyBoruta when:**
236
+ - You want **all-relevant feature selection** with high recall
237
+ - You're working with **high-dimensional data** for which the vanilla boruta takes too long to run
238
+ - **Computational efficiency matters** (exploratory analysis, rapid prototyping, iterative workflows)
239
+ - False positives can be filtered in **downstream pipelines** (regularization, cross-validation, minimal-optimal selection)
240
+ - You want to avoid manually tuning max_iter or early stopping parameters
241
+
242
+ **Consider standard Boruta when:**
243
+ - You need **maximum specificity** and false positives are very costly
244
+ - Your dataset is **small enough** that speed isn't a concern
245
+ - **Statistical conservatism** is paramount for your application
246
+
247
+ ## References
248
+
249
+ 1. Kursa M., Rudnicki W., "Feature Selection with the Boruta Package" Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010
250
+ 2. Homola D., "BorutaPy: An all-relevant feature selection method" https://github.com/scikit-learn-contrib/boruta_py
251
+
252
+ ## Credits
253
+
254
+ This implementation is built upon [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) by Daniel Homola, which itself is based on the original Boruta algorithm by Miron B. Kursa and Witold R. Rudnicki.
255
+
256
+ The greedy confirmation criterion and automatic convergence calculation are novel contributions of this fork, based on findings by Nicolas Vana Santos and Estevão Batista do Prado.
257
+
258
+ ## Citation
259
+
260
+ If you use GreedyBoruta in your research, please cite both the original Boruta paper and the boruta_py implementation:
261
+
262
+ ```
263
+ @article{kursa2010feature,
264
+ title={Feature selection with the Boruta package},
265
+ author={Kursa, Miron B and Rudnicki, Witold R},
266
+ journal={Journal of Statistical Software},
267
+ volume={36},
268
+ number={11},
269
+ pages={1--13},
270
+ year={2010}
271
+ }
272
+ ```
273
+
274
+ ## License
275
+
276
+ This project maintains the same BSD-3-Clause license as boruta_py.
@@ -0,0 +1,249 @@
1
+ # GreedyBoruta
2
+
3
+ [![License](https://img.shields.io/github/license/Nicolas-Vana/GreedyBorutaPy)](https://github.com/Nicolas-Vana/GreedyBorutaPy/blob/master/LICENSE)
4
+ [![PyPI version](https://badge.fury.io/py/GreedyBoruta.svg)](https://badge.fury.io/py/GreedyBoruta)
5
+ ![Test Coverage](./coverage.svg)
6
+
7
+ A faster variant of the [Boruta all-relevant feature selection method](https://www.jstatsoft.org/article/view/v036i11) with **greedy feature confirmation** that achieves **5-40x speedups** through a confirmation criterion relaxation.
8
+
9
+ This implementation is a fork of [boruta_py](https://github.com/scikit-learn-contrib/boruta_py), with modifications focused on improving computational efficiency while maintaining statistical rigor.
10
+
11
+ **[Read the full article explaining the algorithm and experimental results](LINK_TO_BE_ADDED)**
12
+
13
+ ## Greedy Confirmation
14
+
15
+ Unlike the original Boruta algorithm which requires features to achieve statistical significance through binomial testing before confirmation, **GreedyBoruta confirms any feature that beats the maximum shadow importance at least once**. This simple change leads to:
16
+
17
+ - **5-40x faster convergence** on tested datasets
18
+ - **Automatic determination of max_iter** based on alpha (no manual tuning needed)
19
+ - **Equal or higher recall** compared to standard Boruta (provably cannot miss relevant features that are identified by the vanilla algorithm)
20
+ - **Guaranteed convergence** in O(-log alpha) iterations
21
+
22
+ The algorithm automatically calculates the minimum iterations needed for a feature with zero hits to be rejected as log2(1/alpha), then runs until all features are confirmed or rejected (which occurs at or before this limit).
23
+
24
+ ## How to Install
25
+
26
+ Install with `pip`:
27
+
28
+ ```shell
29
+ pip install greedyboruta
30
+ ```
31
+
32
+ or with `conda`:
33
+
34
+ ```shell
35
+ conda install -c conda-forge greedyboruta
36
+ ```
37
+
38
+ ## Dependencies
39
+
40
+ * numpy
41
+ * scipy
42
+ * scikit-learn
43
+
44
+ ## How to Use
45
+
46
+ The interface is identical to scikit-learn and boruta_py:
47
+
48
+ ```python
49
+ import pandas as pd
50
+ from sklearn.ensemble import RandomForestClassifier
51
+ from greedyboruta import GreedyBorutaPy
52
+
53
+ # load X and y
54
+ X = pd.read_csv('examples/test_X.csv', index_col=0).values
55
+ y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values
56
+ y = y.ravel()
57
+
58
+ # define random forest classifier
59
+ rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
60
+
61
+ # define GreedyBoruta feature selection method
62
+ # max_iter is automatically determined based on alpha
63
+ feat_selector = GreedyBorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
64
+
65
+ # find all relevant features - typically 5-40x faster than standard Boruta
66
+ feat_selector.fit(X, y)
67
+
68
+ # check selected features
69
+ feat_selector.support_
70
+
71
+ # check ranking of features
72
+ feat_selector.ranking_
73
+
74
+ # transform X to selected features
75
+ X_filtered = feat_selector.transform(X)
76
+ ```
77
+
78
+ ## Philosophy: All-Relevant vs Minimal-Optimal
79
+
80
+ GreedyBoruta, like the vanilla Boruta, follows the **all-relevant** feature selection philosophy. This means it aims to find **every feature that carries useful information**, not just the smallest set that achieves good prediction.
81
+
82
+ **Why this matters:**
83
+ - When you want to **understand a phenomenon** (not just predict it), you need all contributing factors
84
+ - In **scientific discovery** and **causal inference**, missing a relevant feature can lead to incorrect conclusions
85
+ - **Redundant features** (correlated with informative ones) are intentionally retained - they carry signal even if not strictly necessary
86
+ - Downstream **minimal-optimal methods** (RFE, LASSO, mRMR) can further reduce the feature set if needed
87
+
88
+ This philosophy justifies the greedy confirmation criterion: in all-relevant selection, **false negatives (missing relevant features) are more costly than false positives (including a few extra features)**. The relaxed criterion prioritizes high recall, which aligns perfectly with the all-relevant goal.
89
+
90
+ ## What's Different from Vanilla Boruta?
91
+
92
+ ### Core Algorithm Change
93
+
94
+ **Greedy Confirmation Criterion**: Features are confirmed immediately upon achieving **at least one hit** (beating the maximum shadow importance in any iteration) rather than requiring statistical significance through binomial testing. The rejection criterion remains unchanged. This change:
95
+
96
+ 1. **Maintains or improves recall** - any feature confirmed by vanilla Boruta will also be confirmed by Greedy Boruta (since statistical significance requires at least one hit)
97
+ 2. **Enables guaranteed convergence** - all tentative features have exactly zero hits, simplifying the rejection test
98
+ 3. **Dramatically speeds up the process** - GreedyBoruta runs at most K iterations, which is the same number of iterations at which point the vanilla boruta confirms or rejects its "first batch" of features.
99
+ 4. **Trades slight specificity for speed** - the reduce in specificity seen by the relaxation of the confirmation criterion is relatively small when compared to the speed gains.
100
+
101
+ ### Automatic max_iter Calculation
102
+
103
+ Because all tentative features have exactly zero hits (confirmed features had at least one), the binomial test for rejection simplifies dramatically. The algorithm computes the minimum iterations needed for a feature with zero hits to be rejected at significance level alpha:
104
+
105
+ For a binomial test with p_0 = 0.5 and x = 0 hits:
106
+ ```
107
+ p-value = (1/2)^n < alpha
108
+ ```
109
+
110
+ Therefore: **max_iter = O(log2(1/alpha))**
111
+
112
+ This means that all features will be sorted into confirmed or rejected in at most max_iter iterations - at this iteration, all remaining tentative features (with zero hits) are automatically rejected, and all features with hits > 0 are confirmed. No statistical testing is required during intermediate iterations.
113
+
114
+ **With FDR correction applied (as in boruta_py), max_iter values are:**
115
+ - alpha = 0.10: ~6 iterations
116
+ - alpha = 0.01: ~10 iterations
117
+ - alpha = 0.001: ~14 iterations
118
+ - alpha = 0.0001: ~18 iterations
119
+ - alpha = 0.00001: ~22 iterations
120
+
121
+ No manual tuning of max_iter is required!
122
+
123
+ ## What's Inherited from boruta_py?
124
+
125
+ This implementation builds upon the excellent work in boruta_py and retains all its key improvements over the original R implementation:
126
+
127
+ * **Faster run times** thanks to scikit-learn
128
+ * **Scikit-learn interface** (fit, transform, fit_transform)
129
+ * **Compatible with any ensemble method** from scikit-learn
130
+ * **Automatic n_estimator selection**
131
+ * **Feature ranking**
132
+ * **Percentile threshold** (perc parameter) for more flexible shadow feature comparison
133
+ * **Two-step correction** (FDR + Bonferroni) for multiple testing
134
+
135
+ We highly recommend using pruned trees with depth between 3-7, as suggested in the original boruta_py documentation.
136
+
137
+ ## Parameters
138
+
139
+ **estimator** : object
140
+ > A supervised learning estimator with a 'fit' method that returns the
141
+ > feature_importances_ attribute. Important features must correspond to
142
+ > high absolute values in the feature_importances_.
143
+
144
+ **n_estimators** : int or string, default = 1000
145
+ > If int, sets the number of estimators in the chosen ensemble method.
146
+ > If 'auto', this is determined automatically based on dataset size.
147
+
148
+ **perc** : int, default = 100
149
+ > Percentile of shadow feature importances to use as threshold.
150
+ > The default (100) uses the maximum, equivalent to vanilla Boruta.
151
+ > Lower values (e.g., 90) are less stringent and may select more features.
152
+
153
+ **alpha** : float, default = 0.05
154
+ > Significance level for the corrected p-values in both correction steps.
155
+ > Also automatically determines max_iter via the formula: log2(1/alpha)
156
+ > Lower alpha = more conservative selection + more iterations
157
+
158
+ **two_step** : Boolean, default = True
159
+ > If True, uses FDR + Bonferroni correction. If False, uses only
160
+ > Bonferroni correction (original Boruta behavior with perc=100).
161
+
162
+ **random_state** : int, RandomState instance or None, default = None
163
+ > Random seed for reproducibility.
164
+
165
+ **verbose** : int, default = 0
166
+ > Controls verbosity of output:
167
+ > 0 = silent, 1 = iteration counter, 2 = detailed statistics per iteration
168
+
169
+ ### Removed Parameters
170
+
171
+ Unlike vanilla Boruta implementations, GreedyBoruta does **not** require:
172
+ - **max_iter**: Automatically calculated from alpha
173
+ - **early_stopping**: Not needed due to guaranteed convergence
174
+ - **n_iter_no_change**: Not needed due to guaranteed convergence
175
+
176
+ This simplification improves usability and eliminates the need for manual tuning of convergence-related parameters.
177
+
178
+ ## Attributes
179
+
180
+ **n_features_** : int
181
+ > The number of selected features (confirmed only).
182
+
183
+ **support_** : array of shape [n_features]
184
+ > Boolean mask of selected features (confirmed features only).
185
+
186
+ **support_weak_** : array of shape [n_features]
187
+ > Boolean mask of tentative features that didn't gain enough support.
188
+
189
+ **ranking_** : array of shape [n_features]
190
+ > Feature ranking where confirmed features = 1, tentative features = 2,
191
+ > and rejected features have ranks ≥ 3 based on importance.
192
+
193
+ **importance_history_** : array of shape [n_iterations, n_features]
194
+ > Historical record of feature importances across all iterations.
195
+
196
+ ## Performance Comparison
197
+
198
+ Based on synthetic experiments with known ground truth:
199
+
200
+ - **5-15x speedup** on challenging datasets with proper early stopping in vanilla Boruta
201
+ - **Up to 40x speedup** when vanilla Boruta runs without early stopping to full convergence
202
+ - **Equal or higher recall** (never misses features that vanilla Boruta would find relevant)
203
+ - **Slightly lower specificity** (<10 features selected on 500-feature datasets tested)
204
+ - **Guaranteed convergence** - all features are always classified (no tentative features remain)
205
+
206
+ ## When to Use GreedyBoruta
207
+
208
+ **Use GreedyBoruta when:**
209
+ - You want **all-relevant feature selection** with high recall
210
+ - You're working with **high-dimensional data** for which the vanilla boruta takes too long to run
211
+ - **Computational efficiency matters** (exploratory analysis, rapid prototyping, iterative workflows)
212
+ - False positives can be filtered in **downstream pipelines** (regularization, cross-validation, minimal-optimal selection)
213
+ - You want to avoid manually tuning max_iter or early stopping parameters
214
+
215
+ **Consider standard Boruta when:**
216
+ - You need **maximum specificity** and false positives are very costly
217
+ - Your dataset is **small enough** that speed isn't a concern
218
+ - **Statistical conservatism** is paramount for your application
219
+
220
+ ## References
221
+
222
+ 1. Kursa M., Rudnicki W., "Feature Selection with the Boruta Package" Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010
223
+ 2. Homola D., "BorutaPy: An all-relevant feature selection method" https://github.com/scikit-learn-contrib/boruta_py
224
+
225
+ ## Credits
226
+
227
+ This implementation is built upon [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) by Daniel Homola, which itself is based on the original Boruta algorithm by Miron B. Kursa and Witold R. Rudnicki.
228
+
229
+ The greedy confirmation criterion and automatic convergence calculation are novel contributions of this fork, based on findings by Nicolas Vana Santos and Estevão Batista do Prado.
230
+
231
+ ## Citation
232
+
233
+ If you use GreedyBoruta in your research, please cite both the original Boruta paper and the boruta_py implementation:
234
+
235
+ ```
236
+ @article{kursa2010feature,
237
+ title={Feature selection with the Boruta package},
238
+ author={Kursa, Miron B and Rudnicki, Witold R},
239
+ journal={Journal of Statistical Software},
240
+ volume={36},
241
+ number={11},
242
+ pages={1--13},
243
+ year={2010}
244
+ }
245
+ ```
246
+
247
+ ## License
248
+
249
+ This project maintains the same BSD-3-Clause license as boruta_py.