balancr 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- balancr-0.1.0/LICENSE +21 -0
- balancr-0.1.0/MANIFEST.in +4 -0
- balancr-0.1.0/PKG-INFO +536 -0
- balancr-0.1.0/README.md +495 -0
- balancr-0.1.0/pyproject.toml +5 -0
- balancr-0.1.0/requirements.txt +15 -0
- balancr-0.1.0/setup.cfg +4 -0
- balancr-0.1.0/setup.py +72 -0
- balancr-0.1.0/src/balancr/__init__.py +13 -0
- balancr-0.1.0/src/balancr/base.py +14 -0
- balancr-0.1.0/src/balancr/classifier_registry.py +300 -0
- balancr-0.1.0/src/balancr/cli/__init__.py +0 -0
- balancr-0.1.0/src/balancr/cli/commands.py +1838 -0
- balancr-0.1.0/src/balancr/cli/config.py +165 -0
- balancr-0.1.0/src/balancr/cli/main.py +778 -0
- balancr-0.1.0/src/balancr/cli/utils.py +101 -0
- balancr-0.1.0/src/balancr/data/__init__.py +5 -0
- balancr-0.1.0/src/balancr/data/loader.py +59 -0
- balancr-0.1.0/src/balancr/data/preprocessor.py +556 -0
- balancr-0.1.0/src/balancr/evaluation/__init__.py +19 -0
- balancr-0.1.0/src/balancr/evaluation/metrics.py +442 -0
- balancr-0.1.0/src/balancr/evaluation/visualisation.py +660 -0
- balancr-0.1.0/src/balancr/imbalance_analyser.py +677 -0
- balancr-0.1.0/src/balancr/technique_registry.py +284 -0
- balancr-0.1.0/src/balancr/techniques/__init__.py +4 -0
- balancr-0.1.0/src/balancr/techniques/custom/__init__.py +0 -0
- balancr-0.1.0/src/balancr/techniques/custom/example_custom_technique.py +27 -0
- balancr-0.1.0/src/balancr.egg-info/PKG-INFO +536 -0
- balancr-0.1.0/src/balancr.egg-info/SOURCES.txt +36 -0
- balancr-0.1.0/src/balancr.egg-info/dependency_links.txt +1 -0
- balancr-0.1.0/src/balancr.egg-info/entry_points.txt +2 -0
- balancr-0.1.0/src/balancr.egg-info/not-zip-safe +1 -0
- balancr-0.1.0/src/balancr.egg-info/requires.txt +14 -0
- balancr-0.1.0/src/balancr.egg-info/top_level.txt +1 -0
- balancr-0.1.0/tests/test_base.py +50 -0
- balancr-0.1.0/tests/test_classifier_registry.py +650 -0
- balancr-0.1.0/tests/test_imbalance_analyser.py +548 -0
- balancr-0.1.0/tests/test_technique_registry.py +318 -0
balancr-0.1.0/LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
MIT License
|
2
|
+
|
3
|
+
Copyright (c) 2025 Conor Doherty
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
13
|
+
copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21
|
+
SOFTWARE.
|
balancr-0.1.0/PKG-INFO
ADDED
@@ -0,0 +1,536 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: balancr
|
3
|
+
Version: 0.1.0
|
4
|
+
Summary: A unified framework for analysing and comparing techniques for handling imbalanced datasets
|
5
|
+
Home-page: https://github.com/Ruaskill/balancr
|
6
|
+
Author: Conor Doherty
|
7
|
+
Author-email: ruaskillz1@gmail.com
|
8
|
+
License: MIT
|
9
|
+
Project-URL: Documentation, https://github.com/Ruaskill/balancr/blob/main/README.md
|
10
|
+
Project-URL: Source, https://github.com/Ruaskill/balancr
|
11
|
+
Project-URL: Issues, https://github.com/Ruaskill/balancr/issues
|
12
|
+
Keywords: machine learning,imbalanced data,data balancing,classification,resampling,oversampling,undersampling,SMOTE,ADASYN,imbalanced learning
|
13
|
+
Classifier: Development Status :: 4 - Beta
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
15
|
+
Classifier: Intended Audience :: Developers
|
16
|
+
Classifier: License :: OSI Approved :: MIT License
|
17
|
+
Classifier: Programming Language :: Python :: 3
|
18
|
+
Classifier: Programming Language :: Python :: 3.8
|
19
|
+
Classifier: Programming Language :: Python :: 3.9
|
20
|
+
Classifier: Programming Language :: Python :: 3.10
|
21
|
+
Classifier: Programming Language :: Python :: 3.11
|
22
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
23
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
24
|
+
Classifier: Operating System :: OS Independent
|
25
|
+
Requires-Python: >=3.8
|
26
|
+
Description-Content-Type: text/markdown
|
27
|
+
License-File: LICENSE
|
28
|
+
Requires-Dist: numpy>=1.21.0
|
29
|
+
Requires-Dist: pandas>=1.3.0
|
30
|
+
Requires-Dist: scikit-learn>=1.0.0
|
31
|
+
Requires-Dist: matplotlib>=3.4.0
|
32
|
+
Requires-Dist: seaborn>=0.11.0
|
33
|
+
Requires-Dist: imbalanced-learn>=0.8.0
|
34
|
+
Requires-Dist: openpyxl>=3.0.0
|
35
|
+
Requires-Dist: colorama>=0.4.4
|
36
|
+
Requires-Dist: plotly>=6.0.1
|
37
|
+
Provides-Extra: dev
|
38
|
+
Requires-Dist: pytest>=6.0.0; extra == "dev"
|
39
|
+
Requires-Dist: black>=21.0.0; extra == "dev"
|
40
|
+
Requires-Dist: flake8>=3.9.0; extra == "dev"
|
41
|
+
|
42
|
+
# Balancr: A Unified Framework for Analysing Data Balancing Techniques
|
43
|
+
|
44
|
+
A comprehensive framework and CLI tool for analysing and comparing different techniques for handling imbalanced datasets in machine learning. Balancr makes it easier to compare balancing algorithims against a wide range of classifers
|
45
|
+
|
46
|
+
## Overview
|
47
|
+
|
48
|
+
Imbalanced datasets are a significant challenge in machine learning, particularly in areas such as:
|
49
|
+
- Medical diagnosis
|
50
|
+
- Fraud detection
|
51
|
+
- Network intrusion detection
|
52
|
+
- Rare event prediction
|
53
|
+
|
54
|
+
Balancr allows you to:
|
55
|
+
- Compare different balancing techniques (e.g., SMOTE, ADASYN, random undersampling), and the same technqiues with different configurations, against multiple classifiers
|
56
|
+
- Evaluate performance using relevant metrics
|
57
|
+
- Visualise results and class distributions
|
58
|
+
- Generate balanced datasets using various methods
|
59
|
+
- Customise the evaluation process with different classifiers
|
60
|
+
|
61
|
+
## Features
|
62
|
+
|
63
|
+
### Core Functionality:
|
64
|
+
- **CLI Interface**: Simple command-line interface for full workflow
|
65
|
+
- **Data Loading**: Support for CSV, and provides a data quality check
|
66
|
+
- **Preprocessing**: Configurable preprocessing functionality including handling data quality issues, scaling, and encoding categorical features
|
67
|
+
- **Dynamic Technique Discovery**: Automatic discovery of techniques from imbalanced-learn
|
68
|
+
- **Custom Technique Registration**: Register your own balancing techniques
|
69
|
+
- **Classifier Selection**: Compare performance across multiple classifiers.
|
70
|
+
Fine-tune parameters via the configuration file
|
71
|
+
- **Custom Classifier Registration**: Register your own classifier implementations
|
72
|
+
- **Comprehensive Metric Evaluation**: Get metrics specific to imbalanced learning
|
73
|
+
- **Visualisation Suite**: Plots for class distributions, metrics comparison, and learning curves (more to come)
|
74
|
+
- **Flexible Configuration**: Configure every aspect via CLI or configuration file
|
75
|
+
|
76
|
+
### Available Metrics
|
77
|
+
- Accuracy
|
78
|
+
- Precision
|
79
|
+
- Recall
|
80
|
+
- F1-score
|
81
|
+
- ROC AUC
|
82
|
+
- G-mean
|
83
|
+
- Specificity
|
84
|
+
- Cross-validation scores
|
85
|
+
|
86
|
+
### Visualisations
|
87
|
+
- Class distribution comparisons
|
88
|
+
- Performance metric comparisons
|
89
|
+
- Learning curves
|
90
|
+
- Results comparison plots
|
91
|
+
|
92
|
+
## Installation
|
93
|
+
|
94
|
+
```bash
|
95
|
+
# From PyPI (recommended)
|
96
|
+
pip install balancr
|
97
|
+
|
98
|
+
# From source
|
99
|
+
git clone https://gitlab.eeecs.qub.ac.uk/40353634/csc3002-balancing-techniques-framework.git
|
100
|
+
cd balancing-techniques-framework
|
101
|
+
pip install -e .
|
102
|
+
```
|
103
|
+
|
104
|
+
## Command-Line Interface
|
105
|
+
|
106
|
+
Balancr provides a comprehensive CLI to help you analyse imbalanced datasets:
|
107
|
+
|
108
|
+
```
|
109
|
+
____ _
|
110
|
+
| __ ) __ _| | __ _ _ __ ___ _ __
|
111
|
+
| _ \ / _` | |/ _` | '_ \ / __| '__|
|
112
|
+
| |_) | (_| | | (_| | | | | (__| |
|
113
|
+
|____/ \__,_|_|\__,_|_| |_|\\___|_|
|
114
|
+
|
115
|
+
```
|
116
|
+
|
117
|
+
### Quick Start - CLI
|
118
|
+
|
119
|
+
Here's a complete workflow using the CLI:
|
120
|
+
|
121
|
+
```bash
|
122
|
+
# Load your dataset
|
123
|
+
balancr load-data dataset.csv -t target_column
|
124
|
+
|
125
|
+
# Configure preprocessing
|
126
|
+
balancr preprocess --scale standard --handle-missing mean --encode auto
|
127
|
+
|
128
|
+
# Select balancing techniques to compare
|
129
|
+
balancr select-techniques SMOTE RandomUnderSampler ADASYN
|
130
|
+
|
131
|
+
# Select classifiers for evaluation
|
132
|
+
balancr select-classifiers RandomForestClassifier LogisticRegression
|
133
|
+
|
134
|
+
# Configure metrics
|
135
|
+
balancr configure-metrics --metrics precision recall f1 roc_auc
|
136
|
+
|
137
|
+
# Configure visualisations
|
138
|
+
balancr configure-visualisations --types all --save-formats png
|
139
|
+
|
140
|
+
# Configure evaluation settings
|
141
|
+
balancr configure-evaluation --test-size 0.3 --cross-validation 5
|
142
|
+
|
143
|
+
# Run the comparison
|
144
|
+
balancr run --output-dir results/experiment1
|
145
|
+
```
|
146
|
+
|
147
|
+
### Available Commands
|
148
|
+
|
149
|
+
| Command | Description | Example |
|
150
|
+
|---------|-------------|---------|
|
151
|
+
| `load-data` | Load a dataset for analysis | `balancr load-data dataset.csv -t target` |
|
152
|
+
| `preprocess` | Configure preprocessing options | `balancr preprocess --scale standard --handle-missing mean` |
|
153
|
+
| `select-techniques` | Select balancing techniques | `balancr select-techniques SMOTE ADASYN` |
|
154
|
+
| `register-techniques` | Register custom techniques | `balancr register-techniques my_technique.py` |
|
155
|
+
| `select-classifiers` | Select classifiers for evaluation | `balancr select-classifiers RandomForestClassifier` |
|
156
|
+
| `register-classifiers` | Register custom classifiers | `balancr register-classifiers my_classifier.py` |
|
157
|
+
| `configure-metrics` | Configure evaluation metrics | `balancr configure-metrics --metrics precision recall f1` |
|
158
|
+
| `configure-visualisations` | Configure visualisation options | `balancr configure-visualisations --types all` |
|
159
|
+
| `configure-evaluation` | Configure model evaluation settings | `balancr configure-evaluation --test-size 0.3` |
|
160
|
+
| `run` | Run comparison of techniques | `balancr run --output-dir results` |
|
161
|
+
| `reset` | Reset configuration to defaults | `balancr reset` |
|
162
|
+
|
163
|
+
## Python API
|
164
|
+
|
165
|
+
Balancr can also be used as a Python library:
|
166
|
+
|
167
|
+
```python
|
168
|
+
from balancr.imbalance_analyser import BalancingFramework
|
169
|
+
|
170
|
+
# Initialize the framework
|
171
|
+
framework = BalancingFramework()
|
172
|
+
|
173
|
+
# Load your dataset
|
174
|
+
framework.load_data(
|
175
|
+
file_path="path/to/your/data.csv",
|
176
|
+
target_column="target",
|
177
|
+
feature_columns=["feature1", "feature2", "feature3"]
|
178
|
+
)
|
179
|
+
|
180
|
+
# Preprocess the data
|
181
|
+
framework.preprocess_data(
|
182
|
+
handle_missing="mean",
|
183
|
+
scale="standard",
|
184
|
+
encode="auto"
|
185
|
+
)
|
186
|
+
|
187
|
+
# Apply balancing techniques
|
188
|
+
balanced_datasets = framework.apply_balancing_techniques(
|
189
|
+
technique_names=["SMOTE", "RandomUnderSampler", "ADASYN"],
|
190
|
+
test_size=0.2
|
191
|
+
)
|
192
|
+
|
193
|
+
# Train and evaluate classifiers
|
194
|
+
results = framework.train_classifiers(
|
195
|
+
classifier_configs={
|
196
|
+
"RandomForestClassifier": {"n_estimators": 100, "random_state": 42},
|
197
|
+
"LogisticRegression": {"C": 1.0, "random_state": 42}
|
198
|
+
},
|
199
|
+
enable_cv=True,
|
200
|
+
cv_folds=5
|
201
|
+
)
|
202
|
+
|
203
|
+
# Generate visualisations
|
204
|
+
framework.compare_balanced_class_distributions(
|
205
|
+
save_path="results/class_distributions.png"
|
206
|
+
)
|
207
|
+
|
208
|
+
# Generate learning curves
|
209
|
+
framework.generate_learning_curves(
|
210
|
+
classifier_name="RandomForestClassifier",
|
211
|
+
save_path="results/learning_curves.png"
|
212
|
+
)
|
213
|
+
|
214
|
+
# Save results
|
215
|
+
framework.save_classifier_results(
|
216
|
+
"results/metrics_results.csv",
|
217
|
+
classifier_name="RandomForestClassifier"
|
218
|
+
)
|
219
|
+
```
|
220
|
+
|
221
|
+
## Creating Custom Techniques
|
222
|
+
|
223
|
+
You can create and register your own balancing techniques:
|
224
|
+
|
225
|
+
```python
|
226
|
+
from balancr.base import BaseBalancer
|
227
|
+
import numpy as np
|
228
|
+
|
229
|
+
class MyCustomBalancer(BaseBalancer):
|
230
|
+
"""A custom balancing technique that implements your logic"""
|
231
|
+
|
232
|
+
def balance(self, X: np.ndarray, y: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
|
233
|
+
# Implement your balancing logic here
|
234
|
+
# This should return the balanced X and y
|
235
|
+
|
236
|
+
return X_balanced, y_balanced
|
237
|
+
```
|
238
|
+
|
239
|
+
Register your technique using the CLI:
|
240
|
+
|
241
|
+
```bash
|
242
|
+
balancr register-techniques my_custom_technique.py
|
243
|
+
```
|
244
|
+
|
245
|
+
Or using the Python API:
|
246
|
+
|
247
|
+
```python
|
248
|
+
from balancr.technique_registry import TechniqueRegistry
|
249
|
+
from my_custom_technique import MyCustomBalancer
|
250
|
+
|
251
|
+
registry = TechniqueRegistry()
|
252
|
+
registry.register_custom_technique("MyCustomBalancer", MyCustomBalancer)
|
253
|
+
```
|
254
|
+
|
255
|
+
## Creating Custom Classifiers
|
256
|
+
|
257
|
+
You can create and register your own classifiers:
|
258
|
+
|
259
|
+
```python
|
260
|
+
from sklearn.base import BaseEstimator
|
261
|
+
import numpy as np
|
262
|
+
|
263
|
+
class CustomClassifier(BaseEstimator):
|
264
|
+
def __init__(self, n_estimators, random_state):
|
265
|
+
self.n_estimators = n_estimators
|
266
|
+
self.random_state = random_state
|
267
|
+
|
268
|
+
def fit(self, X, y):
|
269
|
+
# Implement your training logic here
|
270
|
+
# Return self. Fitted estimator.
|
271
|
+
return self
|
272
|
+
|
273
|
+
def predict(self, X):
|
274
|
+
# Implement your prediction logic here
|
275
|
+
# Return your predictions/list of predicitons
|
276
|
+
return np.zeros(len(X))
|
277
|
+
```
|
278
|
+
|
279
|
+
Register your classifier using the CLI:
|
280
|
+
|
281
|
+
```bash
|
282
|
+
balancr register-classifier my_custom_classifier.py
|
283
|
+
```
|
284
|
+
|
285
|
+
Or using the Python API:
|
286
|
+
|
287
|
+
```python
|
288
|
+
from balancr.classifier_registry import ClassifierRegistry
|
289
|
+
from my_custom_classifier import MyCustomClassifier
|
290
|
+
|
291
|
+
registry = ClassifierRegistry()
|
292
|
+
registry.register_custom_classifier("MyCustomClassifier", MyCustomClassifier)
|
293
|
+
```
|
294
|
+
|
295
|
+
## Extra Configuration Tips
|
296
|
+
|
297
|
+
### Manual Configurations
|
298
|
+
For more control, all confiurgations are stored in balancr's config file (default location: ~/.balancr/config.json)
|
299
|
+
|
300
|
+
### Comparing Balancers/Classifiers Against Themselves
|
301
|
+
|
302
|
+
To be able to compare a balancing technique or classifier against itself, but with different parameters, as is often the case, extra configuration is required
|
303
|
+
|
304
|
+
### To compare a balancing technique against itself:
|
305
|
+
First, select the balancer you want to compare:
|
306
|
+
|
307
|
+
```bash
|
308
|
+
balancr select-techniques SMOTE
|
309
|
+
```
|
310
|
+
|
311
|
+
Then in balancr's config file (default location: ~/.balancr/config.json), you should see the config settings of your selected technique:
|
312
|
+
|
313
|
+
```json
|
314
|
+
"balancing_techniques": {
|
315
|
+
"SMOTE": {
|
316
|
+
"sampling_strategy": "auto",
|
317
|
+
"random_state": 42,
|
318
|
+
"k_neighbors": 3,
|
319
|
+
"n_jobs": null
|
320
|
+
}
|
321
|
+
},
|
322
|
+
```
|
323
|
+
|
324
|
+
Create a copy of this technique config, and make sure to change the name to contain a valid suffix (suffix needs to start with _ or -), e.g. SMOTE_v2, SMOTE-2, SMOTE_ChangedParams, etc.:
|
325
|
+
|
326
|
+
```json
|
327
|
+
"balancing_techniques": {
|
328
|
+
"SMOTE": {
|
329
|
+
"sampling_strategy": "auto",
|
330
|
+
"random_state": 42,
|
331
|
+
"k_neighbors": 3,
|
332
|
+
"n_jobs": null
|
333
|
+
},
|
334
|
+
"SMOTE_v2": {
|
335
|
+
"sampling_strategy": "auto",
|
336
|
+
"random_state": 42,
|
337
|
+
"k_neighbors": 3,
|
338
|
+
"n_jobs": null
|
339
|
+
}
|
340
|
+
},
|
341
|
+
```
|
342
|
+
|
343
|
+
You can then change the desired parameters. In this exmaple, we change k_neighbors from 3 to 5:
|
344
|
+
```json
|
345
|
+
"balancing_techniques": {
|
346
|
+
"SMOTE": {
|
347
|
+
"sampling_strategy": "auto",
|
348
|
+
"random_state": 42,
|
349
|
+
"k_neighbors": 3,
|
350
|
+
"n_jobs": null
|
351
|
+
},
|
352
|
+
"SMOTE_v2": {
|
353
|
+
"sampling_strategy": "auto",
|
354
|
+
"random_state": 42,
|
355
|
+
"k_neighbors": 5,
|
356
|
+
"n_jobs": null
|
357
|
+
}
|
358
|
+
},
|
359
|
+
```
|
360
|
+
### To compare a classifier against itself:
|
361
|
+
First, select the classifier you want to compare:
|
362
|
+
|
363
|
+
```bash
|
364
|
+
balancr select-classifiers RandomForestClassifier
|
365
|
+
```
|
366
|
+
|
367
|
+
Then in balancr's config file (default location: ~/.balancr/config.json), you should see the config settings of your selected classifier:
|
368
|
+
|
369
|
+
```json
|
370
|
+
"classifiers": {
|
371
|
+
"RandomForestClassifier": {
|
372
|
+
"n_estimators": 100,
|
373
|
+
"criterion": "gini",
|
374
|
+
"max_depth": null,
|
375
|
+
"min_samples_split": 2,
|
376
|
+
"min_samples_leaf": 1,
|
377
|
+
"min_weight_fraction_leaf": 0.0,
|
378
|
+
"max_features": "sqrt",
|
379
|
+
"max_leaf_nodes": null,
|
380
|
+
"min_impurity_decrease": 0.0,
|
381
|
+
"bootstrap": true,
|
382
|
+
"oob_score": false,
|
383
|
+
"n_jobs": null,
|
384
|
+
"random_state": null,
|
385
|
+
"verbose": 0,
|
386
|
+
"warm_start": false,
|
387
|
+
"class_weight": null,
|
388
|
+
"ccp_alpha": 0.0,
|
389
|
+
"max_samples": null,
|
390
|
+
"monotonic_cst": null
|
391
|
+
}
|
392
|
+
},
|
393
|
+
```
|
394
|
+
|
395
|
+
Create a copy of this classifier config, and make sure to change the name to contain a valid suffix (suffix needs to start with _ or -), e.g. RandomForestClassifier_v2, RandomForestClassifier-2, RandomForestClassifier_ChangedParams, etc.:
|
396
|
+
|
397
|
+
```json
|
398
|
+
"classifiers": {
|
399
|
+
"RandomForestClassifier": {
|
400
|
+
"n_estimators": 100,
|
401
|
+
"criterion": "gini",
|
402
|
+
"max_depth": null,
|
403
|
+
"min_samples_split": 2,
|
404
|
+
"min_samples_leaf": 1,
|
405
|
+
"min_weight_fraction_leaf": 0.0,
|
406
|
+
"max_features": "sqrt",
|
407
|
+
"max_leaf_nodes": null,
|
408
|
+
"min_impurity_decrease": 0.0,
|
409
|
+
"bootstrap": true,
|
410
|
+
"oob_score": false,
|
411
|
+
"n_jobs": null,
|
412
|
+
"random_state": null,
|
413
|
+
"verbose": 0,
|
414
|
+
"warm_start": false,
|
415
|
+
"class_weight": null,
|
416
|
+
"ccp_alpha": 0.0,
|
417
|
+
"max_samples": null,
|
418
|
+
"monotonic_cst": null
|
419
|
+
},
|
420
|
+
"RandomForestClassifier_v2": {
|
421
|
+
"n_estimators": 100,
|
422
|
+
"criterion": "gini",
|
423
|
+
"max_depth": null,
|
424
|
+
"min_samples_split": 2,
|
425
|
+
"min_samples_leaf": 1,
|
426
|
+
"min_weight_fraction_leaf": 0.0,
|
427
|
+
"max_features": "sqrt",
|
428
|
+
"max_leaf_nodes": null,
|
429
|
+
"min_impurity_decrease": 0.0,
|
430
|
+
"bootstrap": true,
|
431
|
+
"oob_score": false,
|
432
|
+
"n_jobs": null,
|
433
|
+
"random_state": null,
|
434
|
+
"verbose": 0,
|
435
|
+
"warm_start": false,
|
436
|
+
"class_weight": null,
|
437
|
+
"ccp_alpha": 0.0,
|
438
|
+
"max_samples": null,
|
439
|
+
"monotonic_cst": null
|
440
|
+
}
|
441
|
+
},
|
442
|
+
```
|
443
|
+
|
444
|
+
You can then change the desired parameters. In this exmaple, we change n_estimators from 100 to 200:
|
445
|
+
```json
|
446
|
+
"classifiers": {
|
447
|
+
"RandomForestClassifier": {
|
448
|
+
"n_estimators": 100,
|
449
|
+
"criterion": "gini",
|
450
|
+
"max_depth": null,
|
451
|
+
"min_samples_split": 2,
|
452
|
+
"min_samples_leaf": 1,
|
453
|
+
"min_weight_fraction_leaf": 0.0,
|
454
|
+
"max_features": "sqrt",
|
455
|
+
"max_leaf_nodes": null,
|
456
|
+
"min_impurity_decrease": 0.0,
|
457
|
+
"bootstrap": true,
|
458
|
+
"oob_score": false,
|
459
|
+
"n_jobs": null,
|
460
|
+
"random_state": null,
|
461
|
+
"verbose": 0,
|
462
|
+
"warm_start": false,
|
463
|
+
"class_weight": null,
|
464
|
+
"ccp_alpha": 0.0,
|
465
|
+
"max_samples": null,
|
466
|
+
"monotonic_cst": null
|
467
|
+
},
|
468
|
+
"RandomForestClassifier_v2": {
|
469
|
+
"n_estimators": 200,
|
470
|
+
"criterion": "gini",
|
471
|
+
"max_depth": null,
|
472
|
+
"min_samples_split": 2,
|
473
|
+
"min_samples_leaf": 1,
|
474
|
+
"min_weight_fraction_leaf": 0.0,
|
475
|
+
"max_features": "sqrt",
|
476
|
+
"max_leaf_nodes": null,
|
477
|
+
"min_impurity_decrease": 0.0,
|
478
|
+
"bootstrap": true,
|
479
|
+
"oob_score": false,
|
480
|
+
"n_jobs": null,
|
481
|
+
"random_state": null,
|
482
|
+
"verbose": 0,
|
483
|
+
"warm_start": false,
|
484
|
+
"class_weight": null,
|
485
|
+
"ccp_alpha": 0.0,
|
486
|
+
"max_samples": null,
|
487
|
+
"monotonic_cst": null
|
488
|
+
}
|
489
|
+
},
|
490
|
+
```
|
491
|
+
|
492
|
+
### Cross Validation
|
493
|
+
Cross validation can be enabled to be applied to balanced training data only. This gives an estimate of how well a classifier can learn from the balanced data and generalise across different parts of that balanced dataset.
|
494
|
+
|
495
|
+
To apply cross validation with balanced datasets, apply a cross validation number with:
|
496
|
+
```bash
|
497
|
+
balancr configure-evaluation 5
|
498
|
+
```
|
499
|
+
This will perform cross validation with 5 folds, meaning each balanced dataset will be split into 5 folds, and will train the selected classifiers in 5 rounds.
|
500
|
+
|
501
|
+
An average of these round's results will be retrieved
|
502
|
+
|
503
|
+
### Learning Curves
|
504
|
+
The same process mentioned in cross validation above is applied to generating learning curves.
|
505
|
+
|
506
|
+
Learning curves help us visualise each model's performance when being trained on increasing amounts of data.
|
507
|
+
|
508
|
+
These learning curves are generated using the balanced datasets chosen by the user
|
509
|
+
|
510
|
+
To configure learning curves:
|
511
|
+
```bash
|
512
|
+
balancr configure-evaluations --learning-curve-folds 5 --learning-curve-points 10
|
513
|
+
```
|
514
|
+
This will set the number of cross validation folds and number of points to plot on the learning curves
|
515
|
+
|
516
|
+
## Requirements
|
517
|
+
|
518
|
+
- Python >= 3.8
|
519
|
+
- NumPy >= 1.21.0
|
520
|
+
- pandas >= 1.3.0
|
521
|
+
- scikit-learn >= 1.0.0
|
522
|
+
- matplotlib >= 3.4.0
|
523
|
+
- seaborn >= 0.11.0
|
524
|
+
- imbalanced-learn >= 0.8.0
|
525
|
+
- openpyxl >= 3.0.0
|
526
|
+
- colorama >= 0.4.4
|
527
|
+
|
528
|
+
## Future Plans
|
529
|
+
|
530
|
+
- More visualisation options
|
531
|
+
- Collecting balancer and classifier times, other than only displaying in logs
|
532
|
+
- Saving results as runs go along, rather than retrieving all results at end of run
|
533
|
+
|
534
|
+
## Author
|
535
|
+
|
536
|
+
Conor Doherty, cdoherty135@qub.ac.uk
|