vfa 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- vfa-0.1.0/LICENSE +21 -0
- vfa-0.1.0/MANIFEST.in +4 -0
- vfa-0.1.0/PKG-INFO +130 -0
- vfa-0.1.0/README.md +100 -0
- vfa-0.1.0/pyproject.toml +45 -0
- vfa-0.1.0/setup.cfg +4 -0
- vfa-0.1.0/tests/test_vfa.py +99 -0
- vfa-0.1.0/vfa/__init__.py +10 -0
- vfa-0.1.0/vfa/vfa.py +78 -0
- vfa-0.1.0/vfa.egg-info/PKG-INFO +130 -0
- vfa-0.1.0/vfa.egg-info/SOURCES.txt +12 -0
- vfa-0.1.0/vfa.egg-info/dependency_links.txt +1 -0
- vfa-0.1.0/vfa.egg-info/requires.txt +5 -0
- vfa-0.1.0/vfa.egg-info/top_level.txt +1 -0
vfa-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2024 Your Name
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
vfa-0.1.0/MANIFEST.in
ADDED
vfa-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,130 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: vfa
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Variance Feature Analysis for binary classification feature selection
|
|
5
|
+
Author-email: mohdadil <mohdadil@live.com>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/nqmn/vfa
|
|
8
|
+
Project-URL: Repository, https://github.com/nqmn/vfa
|
|
9
|
+
Project-URL: Bug Tracker, https://github.com/nqmn/vfa/issues
|
|
10
|
+
Keywords: feature-selection,machine-learning,variance-analysis,classification
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Intended Audience :: Science/Research
|
|
14
|
+
Classifier: Programming Language :: Python :: 3
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
20
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
21
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
22
|
+
Requires-Python: >=3.8
|
|
23
|
+
Description-Content-Type: text/markdown
|
|
24
|
+
License-File: LICENSE
|
|
25
|
+
Requires-Dist: numpy>=1.20.0
|
|
26
|
+
Provides-Extra: dev
|
|
27
|
+
Requires-Dist: pytest>=7.0.0; extra == "dev"
|
|
28
|
+
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
|
|
29
|
+
Dynamic: license-file
|
|
30
|
+
|
|
31
|
+
# VFA - Variance Feature Analysis
|
|
32
|
+
|
|
33
|
+
A Python package for binary classification feature selection using variance-based analysis.
|
|
34
|
+
|
|
35
|
+
## Overview
|
|
36
|
+
|
|
37
|
+
Variance Feature Analysis (VFA) implements a feature selection method based on the Class-Variance Ratio (CVR). It selects the most discriminative features for binary classification tasks by analyzing the ratio of between-class variance to total variance.
|
|
38
|
+
|
|
39
|
+
## Installation
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
pip install vfa
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
## Features
|
|
46
|
+
|
|
47
|
+
- Fast variance-based feature selection for binary classification
|
|
48
|
+
- Automatic feature ranking using Class-Variance Ratio (CVR)
|
|
49
|
+
- Weighted feature aggregation
|
|
50
|
+
- Compatible with scikit-learn workflows
|
|
51
|
+
- Lightweight with minimal dependencies (only NumPy)
|
|
52
|
+
|
|
53
|
+
## Usage
|
|
54
|
+
|
|
55
|
+
```python
|
|
56
|
+
from vfa import variance_feature_analysis
|
|
57
|
+
import numpy as np
|
|
58
|
+
|
|
59
|
+
# Example data
|
|
60
|
+
X = np.random.rand(100, 20) # 100 samples, 20 features
|
|
61
|
+
y = np.random.randint(0, 2, 100) # Binary labels
|
|
62
|
+
|
|
63
|
+
# Select top 5 features
|
|
64
|
+
X_selected, f_aggregated, selected_indices, scores = variance_feature_analysis(X, y, k=5)
|
|
65
|
+
|
|
66
|
+
print(f"Selected feature indices: {selected_indices}")
|
|
67
|
+
print(f"Feature scores: {scores[selected_indices]}")
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
## Parameters
|
|
71
|
+
|
|
72
|
+
- `X` (array-like): Training data features of shape (n_samples, n_features)
|
|
73
|
+
- `y` (array-like): Target labels of shape (n_samples,) - must be binary
|
|
74
|
+
- `k` (int, default=8): Number of top features to select
|
|
75
|
+
- `epsilon` (float, default=1e-12): Small constant to prevent division by zero
|
|
76
|
+
|
|
77
|
+
## Returns
|
|
78
|
+
|
|
79
|
+
- `X_selected`: Selected feature subset
|
|
80
|
+
- `f_aggregated`: Weighted aggregation of selected features
|
|
81
|
+
- `selected_idx`: Indices of selected features
|
|
82
|
+
- `scores`: CVR scores for all features
|
|
83
|
+
|
|
84
|
+
## How It Works
|
|
85
|
+
|
|
86
|
+
The algorithm:
|
|
87
|
+
1. Computes within-class and between-class variance for each feature
|
|
88
|
+
2. Calculates the Class-Variance Ratio (CVR) = B / (B + W)
|
|
89
|
+
3. Selects the top-k features with highest CVR scores
|
|
90
|
+
4. Returns selected features and their weighted aggregation
|
|
91
|
+
|
|
92
|
+
## Requirements
|
|
93
|
+
|
|
94
|
+
- Python >=3.8
|
|
95
|
+
- NumPy >=1.20.0
|
|
96
|
+
|
|
97
|
+
## Development
|
|
98
|
+
|
|
99
|
+
```bash
|
|
100
|
+
# Clone the repository
|
|
101
|
+
git clone https://github.com/nqmn/vfa.git
|
|
102
|
+
cd vfa
|
|
103
|
+
|
|
104
|
+
# Install development dependencies
|
|
105
|
+
pip install -e ".[dev]"
|
|
106
|
+
|
|
107
|
+
# Run tests
|
|
108
|
+
pytest
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## License
|
|
112
|
+
|
|
113
|
+
MIT License - see LICENSE file for details.
|
|
114
|
+
|
|
115
|
+
## Contributing
|
|
116
|
+
|
|
117
|
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
|
118
|
+
|
|
119
|
+
## Citation
|
|
120
|
+
|
|
121
|
+
If you use this package in your research, please cite:
|
|
122
|
+
|
|
123
|
+
```bibtex
|
|
124
|
+
@software{vfa2024,
|
|
125
|
+
title={VFA: Variance Feature Analysis},
|
|
126
|
+
author={Mohd Adil Mokti},
|
|
127
|
+
year={2026},
|
|
128
|
+
url={https://github.com/nqmn/vfa}
|
|
129
|
+
}
|
|
130
|
+
```
|
vfa-0.1.0/README.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
1
|
+
# VFA - Variance Feature Analysis
|
|
2
|
+
|
|
3
|
+
A Python package for binary classification feature selection using variance-based analysis.
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
Variance Feature Analysis (VFA) implements a feature selection method based on the Class-Variance Ratio (CVR). It selects the most discriminative features for binary classification tasks by analyzing the ratio of between-class variance to total variance.
|
|
8
|
+
|
|
9
|
+
## Installation
|
|
10
|
+
|
|
11
|
+
```bash
|
|
12
|
+
pip install vfa
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Features
|
|
16
|
+
|
|
17
|
+
- Fast variance-based feature selection for binary classification
|
|
18
|
+
- Automatic feature ranking using Class-Variance Ratio (CVR)
|
|
19
|
+
- Weighted feature aggregation
|
|
20
|
+
- Compatible with scikit-learn workflows
|
|
21
|
+
- Lightweight with minimal dependencies (only NumPy)
|
|
22
|
+
|
|
23
|
+
## Usage
|
|
24
|
+
|
|
25
|
+
```python
|
|
26
|
+
from vfa import variance_feature_analysis
|
|
27
|
+
import numpy as np
|
|
28
|
+
|
|
29
|
+
# Example data
|
|
30
|
+
X = np.random.rand(100, 20) # 100 samples, 20 features
|
|
31
|
+
y = np.random.randint(0, 2, 100) # Binary labels
|
|
32
|
+
|
|
33
|
+
# Select top 5 features
|
|
34
|
+
X_selected, f_aggregated, selected_indices, scores = variance_feature_analysis(X, y, k=5)
|
|
35
|
+
|
|
36
|
+
print(f"Selected feature indices: {selected_indices}")
|
|
37
|
+
print(f"Feature scores: {scores[selected_indices]}")
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
## Parameters
|
|
41
|
+
|
|
42
|
+
- `X` (array-like): Training data features of shape (n_samples, n_features)
|
|
43
|
+
- `y` (array-like): Target labels of shape (n_samples,) - must be binary
|
|
44
|
+
- `k` (int, default=8): Number of top features to select
|
|
45
|
+
- `epsilon` (float, default=1e-12): Small constant to prevent division by zero
|
|
46
|
+
|
|
47
|
+
## Returns
|
|
48
|
+
|
|
49
|
+
- `X_selected`: Selected feature subset
|
|
50
|
+
- `f_aggregated`: Weighted aggregation of selected features
|
|
51
|
+
- `selected_idx`: Indices of selected features
|
|
52
|
+
- `scores`: CVR scores for all features
|
|
53
|
+
|
|
54
|
+
## How It Works
|
|
55
|
+
|
|
56
|
+
The algorithm:
|
|
57
|
+
1. Computes within-class and between-class variance for each feature
|
|
58
|
+
2. Calculates the Class-Variance Ratio (CVR) = B / (B + W)
|
|
59
|
+
3. Selects the top-k features with highest CVR scores
|
|
60
|
+
4. Returns selected features and their weighted aggregation
|
|
61
|
+
|
|
62
|
+
## Requirements
|
|
63
|
+
|
|
64
|
+
- Python >=3.8
|
|
65
|
+
- NumPy >=1.20.0
|
|
66
|
+
|
|
67
|
+
## Development
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
# Clone the repository
|
|
71
|
+
git clone https://github.com/nqmn/vfa.git
|
|
72
|
+
cd vfa
|
|
73
|
+
|
|
74
|
+
# Install development dependencies
|
|
75
|
+
pip install -e ".[dev]"
|
|
76
|
+
|
|
77
|
+
# Run tests
|
|
78
|
+
pytest
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
## License
|
|
82
|
+
|
|
83
|
+
MIT License - see LICENSE file for details.
|
|
84
|
+
|
|
85
|
+
## Contributing
|
|
86
|
+
|
|
87
|
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
|
88
|
+
|
|
89
|
+
## Citation
|
|
90
|
+
|
|
91
|
+
If you use this package in your research, please cite:
|
|
92
|
+
|
|
93
|
+
```bibtex
|
|
94
|
+
@software{vfa2024,
|
|
95
|
+
title={VFA: Variance Feature Analysis},
|
|
96
|
+
author={Mohd Adil Mokti},
|
|
97
|
+
year={2026},
|
|
98
|
+
url={https://github.com/nqmn/vfa}
|
|
99
|
+
}
|
|
100
|
+
```
|
vfa-0.1.0/pyproject.toml
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=61.0", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "vfa"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "Variance Feature Analysis for binary classification feature selection"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
authors = [
|
|
11
|
+
{name = "mohdadil", email = "mohdadil@live.com"}
|
|
12
|
+
]
|
|
13
|
+
license = "MIT"
|
|
14
|
+
classifiers = [
|
|
15
|
+
"Development Status :: 3 - Alpha",
|
|
16
|
+
"Intended Audience :: Developers",
|
|
17
|
+
"Intended Audience :: Science/Research",
|
|
18
|
+
"Programming Language :: Python :: 3",
|
|
19
|
+
"Programming Language :: Python :: 3.8",
|
|
20
|
+
"Programming Language :: Python :: 3.9",
|
|
21
|
+
"Programming Language :: Python :: 3.10",
|
|
22
|
+
"Programming Language :: Python :: 3.11",
|
|
23
|
+
"Programming Language :: Python :: 3.12",
|
|
24
|
+
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
|
25
|
+
"Topic :: Software Development :: Libraries :: Python Modules",
|
|
26
|
+
]
|
|
27
|
+
keywords = ["feature-selection", "machine-learning", "variance-analysis", "classification"]
|
|
28
|
+
requires-python = ">=3.8"
|
|
29
|
+
dependencies = [
|
|
30
|
+
"numpy>=1.20.0",
|
|
31
|
+
]
|
|
32
|
+
|
|
33
|
+
[project.optional-dependencies]
|
|
34
|
+
dev = [
|
|
35
|
+
"pytest>=7.0.0",
|
|
36
|
+
"pytest-cov>=4.0.0",
|
|
37
|
+
]
|
|
38
|
+
|
|
39
|
+
[project.urls]
|
|
40
|
+
Homepage = "https://github.com/nqmn/vfa"
|
|
41
|
+
Repository = "https://github.com/nqmn/vfa"
|
|
42
|
+
"Bug Tracker" = "https://github.com/nqmn/vfa/issues"
|
|
43
|
+
|
|
44
|
+
[tool.setuptools]
|
|
45
|
+
packages = ["vfa"]
|
vfa-0.1.0/setup.cfg
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
import numpy as np
|
|
2
|
+
import pytest
|
|
3
|
+
from vfa import variance_feature_analysis
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
def test_basic_functionality():
|
|
7
|
+
"""Test basic functionality with simple binary classification data."""
|
|
8
|
+
np.random.seed(42)
|
|
9
|
+
X = np.random.rand(100, 10)
|
|
10
|
+
y = np.random.randint(0, 2, 100)
|
|
11
|
+
|
|
12
|
+
X_selected, f_agg, indices, scores = variance_feature_analysis(X, y, k=5)
|
|
13
|
+
|
|
14
|
+
assert X_selected.shape == (100, 5), "Selected features shape mismatch"
|
|
15
|
+
assert f_agg.shape == (100,), "Aggregated features shape mismatch"
|
|
16
|
+
assert len(indices) == 5, "Number of selected indices mismatch"
|
|
17
|
+
assert len(scores) == 10, "Scores length should match original feature count"
|
|
18
|
+
|
|
19
|
+
|
|
20
|
+
def test_k_exceeds_features():
|
|
21
|
+
"""Test that k is capped at number of features."""
|
|
22
|
+
np.random.seed(42)
|
|
23
|
+
X = np.random.rand(50, 5)
|
|
24
|
+
y = np.random.randint(0, 2, 50)
|
|
25
|
+
|
|
26
|
+
X_selected, f_agg, indices, scores = variance_feature_analysis(X, y, k=10)
|
|
27
|
+
|
|
28
|
+
assert X_selected.shape[1] == 5, "Should cap k at number of features"
|
|
29
|
+
assert len(indices) == 5, "Indices should be capped at number of features"
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
def test_feature_scores_range():
|
|
33
|
+
"""Test that CVR scores are in valid range [0, 1]."""
|
|
34
|
+
np.random.seed(42)
|
|
35
|
+
X = np.random.rand(100, 10)
|
|
36
|
+
y = np.random.randint(0, 2, 100)
|
|
37
|
+
|
|
38
|
+
_, _, _, scores = variance_feature_analysis(X, y, k=5)
|
|
39
|
+
|
|
40
|
+
assert np.all(scores >= 0), "Scores should be non-negative"
|
|
41
|
+
assert np.all(scores <= 1), "Scores should not exceed 1"
|
|
42
|
+
|
|
43
|
+
|
|
44
|
+
def test_discriminative_features():
|
|
45
|
+
"""Test that discriminative features get higher scores."""
|
|
46
|
+
np.random.seed(42)
|
|
47
|
+
|
|
48
|
+
# Create data where first feature is highly discriminative
|
|
49
|
+
X = np.random.rand(100, 5)
|
|
50
|
+
y = np.array([0] * 50 + [1] * 50)
|
|
51
|
+
|
|
52
|
+
# Make first feature discriminative
|
|
53
|
+
X[:50, 0] = 0.0 + np.random.rand(50) * 0.1 # Class 0: low values
|
|
54
|
+
X[50:, 0] = 0.9 + np.random.rand(50) * 0.1 # Class 1: high values
|
|
55
|
+
|
|
56
|
+
_, _, indices, scores = variance_feature_analysis(X, y, k=5)
|
|
57
|
+
|
|
58
|
+
assert 0 in indices, "Discriminative feature should be selected"
|
|
59
|
+
assert scores[0] > scores[1:].mean(), "Discriminative feature should have high score"
|
|
60
|
+
|
|
61
|
+
|
|
62
|
+
def test_epsilon_prevents_division_by_zero():
|
|
63
|
+
"""Test that epsilon parameter prevents division by zero."""
|
|
64
|
+
np.random.seed(42)
|
|
65
|
+
X = np.zeros((100, 5)) # All zeros
|
|
66
|
+
y = np.random.randint(0, 2, 100)
|
|
67
|
+
|
|
68
|
+
# Should not raise division by zero error
|
|
69
|
+
X_selected, f_agg, indices, scores = variance_feature_analysis(X, y, k=3)
|
|
70
|
+
|
|
71
|
+
assert not np.any(np.isnan(scores)), "Should not produce NaN scores"
|
|
72
|
+
assert not np.any(np.isinf(scores)), "Should not produce infinite scores"
|
|
73
|
+
|
|
74
|
+
|
|
75
|
+
def test_output_types():
|
|
76
|
+
"""Test that outputs have correct types."""
|
|
77
|
+
np.random.seed(42)
|
|
78
|
+
X = np.random.rand(100, 10)
|
|
79
|
+
y = np.random.randint(0, 2, 100)
|
|
80
|
+
|
|
81
|
+
X_selected, f_agg, indices, scores = variance_feature_analysis(X, y, k=5)
|
|
82
|
+
|
|
83
|
+
assert isinstance(X_selected, np.ndarray), "X_selected should be ndarray"
|
|
84
|
+
assert isinstance(f_agg, np.ndarray), "f_agg should be ndarray"
|
|
85
|
+
assert isinstance(indices, np.ndarray), "indices should be ndarray"
|
|
86
|
+
assert isinstance(scores, np.ndarray), "scores should be ndarray"
|
|
87
|
+
|
|
88
|
+
|
|
89
|
+
def test_aggregated_feature_weights():
|
|
90
|
+
"""Test that aggregated features are properly weighted."""
|
|
91
|
+
np.random.seed(42)
|
|
92
|
+
X = np.random.rand(100, 10)
|
|
93
|
+
y = np.random.randint(0, 2, 100)
|
|
94
|
+
|
|
95
|
+
X_selected, f_agg, indices, scores = variance_feature_analysis(X, y, k=5)
|
|
96
|
+
|
|
97
|
+
# Aggregated feature should be in reasonable range
|
|
98
|
+
assert f_agg.min() >= 0, "Aggregated features should be non-negative (weighted from positive scores)"
|
|
99
|
+
assert not np.any(np.isnan(f_agg)), "Aggregated features should not be NaN"
|
vfa-0.1.0/vfa/vfa.py
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
import numpy as np
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
def variance_feature_analysis(X, y, k=8, epsilon=1e-12):
|
|
5
|
+
"""
|
|
6
|
+
Perform variance-based feature selection for binary classification.
|
|
7
|
+
|
|
8
|
+
This function selects the top-k features based on the Class-Variance Ratio (CVR),
|
|
9
|
+
which measures the ratio of between-class variance to total variance for each feature.
|
|
10
|
+
|
|
11
|
+
Parameters
|
|
12
|
+
----------
|
|
13
|
+
X : array-like of shape (n_samples, n_features)
|
|
14
|
+
Training data features.
|
|
15
|
+
y : array-like of shape (n_samples,)
|
|
16
|
+
Target labels (binary: two unique classes).
|
|
17
|
+
k : int, default=8
|
|
18
|
+
Number of top features to select. Will be capped at n_features if k > n_features.
|
|
19
|
+
epsilon : float, default=1e-12
|
|
20
|
+
Small constant to prevent division by zero.
|
|
21
|
+
|
|
22
|
+
Returns
|
|
23
|
+
-------
|
|
24
|
+
X_selected : ndarray of shape (n_samples, k)
|
|
25
|
+
Selected feature subset.
|
|
26
|
+
f_aggregated : ndarray of shape (n_samples,)
|
|
27
|
+
Weighted aggregation of selected features.
|
|
28
|
+
selected_idx : ndarray of shape (k,)
|
|
29
|
+
Indices of selected features.
|
|
30
|
+
scores : ndarray of shape (n_features,)
|
|
31
|
+
CVR scores for all features.
|
|
32
|
+
|
|
33
|
+
Examples
|
|
34
|
+
--------
|
|
35
|
+
>>> from vfa import variance_feature_analysis
|
|
36
|
+
>>> import numpy as np
|
|
37
|
+
>>> X = np.random.rand(100, 20)
|
|
38
|
+
>>> y = np.random.randint(0, 2, 100)
|
|
39
|
+
>>> X_selected, f_agg, indices, scores = variance_feature_analysis(X, y, k=5)
|
|
40
|
+
>>> print(f"Selected {len(indices)} features with indices: {indices}")
|
|
41
|
+
"""
|
|
42
|
+
X = np.asarray(X, dtype=float)
|
|
43
|
+
y = np.asarray(y).astype(int)
|
|
44
|
+
|
|
45
|
+
classes = np.unique(y)
|
|
46
|
+
|
|
47
|
+
c0, c1 = classes
|
|
48
|
+
X0, X1 = X[y == c0], X[y == c1]
|
|
49
|
+
|
|
50
|
+
n0, n1 = X0.shape[0], X1.shape[0]
|
|
51
|
+
n = n0 + n1
|
|
52
|
+
p0, p1 = n0 / n, n1 / n
|
|
53
|
+
|
|
54
|
+
# Class means and variances
|
|
55
|
+
mu0, mu1 = X0.mean(axis=0), X1.mean(axis=0)
|
|
56
|
+
v0, v1 = X0.var(axis=0), X1.var(axis=0)
|
|
57
|
+
|
|
58
|
+
# Within-class variance
|
|
59
|
+
W = p0 * v0 + p1 * v1
|
|
60
|
+
|
|
61
|
+
# Between-class variance (binary closed form)
|
|
62
|
+
B = p0 * p1 * (mu0 - mu1) ** 2
|
|
63
|
+
|
|
64
|
+
# Class-Variance Ratio (CVR)
|
|
65
|
+
scores = B / (B + W + epsilon)
|
|
66
|
+
|
|
67
|
+
# Select top-k features
|
|
68
|
+
k = min(int(k), X.shape[1])
|
|
69
|
+
selected_idx = np.argsort(scores)[-k:]
|
|
70
|
+
X_selected = X[:, selected_idx]
|
|
71
|
+
|
|
72
|
+
# Aggregation weights
|
|
73
|
+
w_raw = scores[selected_idx].copy()
|
|
74
|
+
w = w_raw / (w_raw.sum() + epsilon)
|
|
75
|
+
|
|
76
|
+
f_aggregated = X_selected @ w
|
|
77
|
+
|
|
78
|
+
return X_selected, f_aggregated, selected_idx, scores
|
|
@@ -0,0 +1,130 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: vfa
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Variance Feature Analysis for binary classification feature selection
|
|
5
|
+
Author-email: mohdadil <mohdadil@live.com>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/nqmn/vfa
|
|
8
|
+
Project-URL: Repository, https://github.com/nqmn/vfa
|
|
9
|
+
Project-URL: Bug Tracker, https://github.com/nqmn/vfa/issues
|
|
10
|
+
Keywords: feature-selection,machine-learning,variance-analysis,classification
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Intended Audience :: Science/Research
|
|
14
|
+
Classifier: Programming Language :: Python :: 3
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
20
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
21
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
22
|
+
Requires-Python: >=3.8
|
|
23
|
+
Description-Content-Type: text/markdown
|
|
24
|
+
License-File: LICENSE
|
|
25
|
+
Requires-Dist: numpy>=1.20.0
|
|
26
|
+
Provides-Extra: dev
|
|
27
|
+
Requires-Dist: pytest>=7.0.0; extra == "dev"
|
|
28
|
+
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
|
|
29
|
+
Dynamic: license-file
|
|
30
|
+
|
|
31
|
+
# VFA - Variance Feature Analysis
|
|
32
|
+
|
|
33
|
+
A Python package for binary classification feature selection using variance-based analysis.
|
|
34
|
+
|
|
35
|
+
## Overview
|
|
36
|
+
|
|
37
|
+
Variance Feature Analysis (VFA) implements a feature selection method based on the Class-Variance Ratio (CVR). It selects the most discriminative features for binary classification tasks by analyzing the ratio of between-class variance to total variance.
|
|
38
|
+
|
|
39
|
+
## Installation
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
pip install vfa
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
## Features
|
|
46
|
+
|
|
47
|
+
- Fast variance-based feature selection for binary classification
|
|
48
|
+
- Automatic feature ranking using Class-Variance Ratio (CVR)
|
|
49
|
+
- Weighted feature aggregation
|
|
50
|
+
- Compatible with scikit-learn workflows
|
|
51
|
+
- Lightweight with minimal dependencies (only NumPy)
|
|
52
|
+
|
|
53
|
+
## Usage
|
|
54
|
+
|
|
55
|
+
```python
|
|
56
|
+
from vfa import variance_feature_analysis
|
|
57
|
+
import numpy as np
|
|
58
|
+
|
|
59
|
+
# Example data
|
|
60
|
+
X = np.random.rand(100, 20) # 100 samples, 20 features
|
|
61
|
+
y = np.random.randint(0, 2, 100) # Binary labels
|
|
62
|
+
|
|
63
|
+
# Select top 5 features
|
|
64
|
+
X_selected, f_aggregated, selected_indices, scores = variance_feature_analysis(X, y, k=5)
|
|
65
|
+
|
|
66
|
+
print(f"Selected feature indices: {selected_indices}")
|
|
67
|
+
print(f"Feature scores: {scores[selected_indices]}")
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
## Parameters
|
|
71
|
+
|
|
72
|
+
- `X` (array-like): Training data features of shape (n_samples, n_features)
|
|
73
|
+
- `y` (array-like): Target labels of shape (n_samples,) - must be binary
|
|
74
|
+
- `k` (int, default=8): Number of top features to select
|
|
75
|
+
- `epsilon` (float, default=1e-12): Small constant to prevent division by zero
|
|
76
|
+
|
|
77
|
+
## Returns
|
|
78
|
+
|
|
79
|
+
- `X_selected`: Selected feature subset
|
|
80
|
+
- `f_aggregated`: Weighted aggregation of selected features
|
|
81
|
+
- `selected_idx`: Indices of selected features
|
|
82
|
+
- `scores`: CVR scores for all features
|
|
83
|
+
|
|
84
|
+
## How It Works
|
|
85
|
+
|
|
86
|
+
The algorithm:
|
|
87
|
+
1. Computes within-class and between-class variance for each feature
|
|
88
|
+
2. Calculates the Class-Variance Ratio (CVR) = B / (B + W)
|
|
89
|
+
3. Selects the top-k features with highest CVR scores
|
|
90
|
+
4. Returns selected features and their weighted aggregation
|
|
91
|
+
|
|
92
|
+
## Requirements
|
|
93
|
+
|
|
94
|
+
- Python >=3.8
|
|
95
|
+
- NumPy >=1.20.0
|
|
96
|
+
|
|
97
|
+
## Development
|
|
98
|
+
|
|
99
|
+
```bash
|
|
100
|
+
# Clone the repository
|
|
101
|
+
git clone https://github.com/nqmn/vfa.git
|
|
102
|
+
cd vfa
|
|
103
|
+
|
|
104
|
+
# Install development dependencies
|
|
105
|
+
pip install -e ".[dev]"
|
|
106
|
+
|
|
107
|
+
# Run tests
|
|
108
|
+
pytest
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## License
|
|
112
|
+
|
|
113
|
+
MIT License - see LICENSE file for details.
|
|
114
|
+
|
|
115
|
+
## Contributing
|
|
116
|
+
|
|
117
|
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
|
118
|
+
|
|
119
|
+
## Citation
|
|
120
|
+
|
|
121
|
+
If you use this package in your research, please cite:
|
|
122
|
+
|
|
123
|
+
```bibtex
|
|
124
|
+
@software{vfa2024,
|
|
125
|
+
title={VFA: Variance Feature Analysis},
|
|
126
|
+
author={Mohd Adil Mokti},
|
|
127
|
+
year={2026},
|
|
128
|
+
url={https://github.com/nqmn/vfa}
|
|
129
|
+
}
|
|
130
|
+
```
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
vfa
|