benchmark-reliability 0.1.2__tar.gz → 0.1.4__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- benchmark_reliability-0.1.4/PKG-INFO +105 -0
- benchmark_reliability-0.1.4/README.md +81 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/pyproject.toml +1 -1
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/setup.py +1 -1
- benchmark_reliability-0.1.4/src/benchmark_reliability.egg-info/PKG-INFO +105 -0
- benchmark_reliability-0.1.2/PKG-INFO +0 -121
- benchmark_reliability-0.1.2/README.md +0 -97
- benchmark_reliability-0.1.2/src/benchmark_reliability.egg-info/PKG-INFO +0 -121
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/LICENSE +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/setup.cfg +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/benchmark_reliability.egg-info/SOURCES.txt +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/benchmark_reliability.egg-info/dependency_links.txt +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/benchmark_reliability.egg-info/requires.txt +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/benchmark_reliability.egg-info/top_level.txt +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/__init__.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/analyzer.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/metrics/__init__.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/metrics/baseline_gap.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/metrics/instability.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/metrics/metadata.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/metrics/null_test.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/phase/__init__.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/phase/classifier.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/phase/embedding.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/phase/visualization.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/report/__init__.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/report/json_export.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/report/latex_export.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/tests/test_analyzer.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/tests/test_metrics.py +0 -0
- {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/tests/test_phase.py +0 -0
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: benchmark-reliability
|
|
3
|
+
Version: 0.1.4
|
|
4
|
+
Summary: Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks
|
|
5
|
+
Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
|
|
8
|
+
Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
|
|
9
|
+
Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
12
|
+
Classifier: Programming Language :: Python :: 3
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
|
+
Requires-Python: >=3.8
|
|
20
|
+
Description-Content-Type: text/markdown
|
|
21
|
+
Requires-Dist: numpy>=1.21
|
|
22
|
+
Requires-Dist: scikit-learn>=1.0
|
|
23
|
+
Requires-Dist: matplotlib>=3.5
|
|
24
|
+
|
|
25
|
+
# benchmark-reliability
|
|
26
|
+
|
|
27
|
+
A Python package for computing the **Benchmark Reliability Framework (BRF)**: a four-dimension audit protocol that evaluates whether a predictive dataset is structurally reliable before model development.
|
|
28
|
+
|
|
29
|
+
## Installation
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
pip install benchmark-reliability
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
Requires Python 3.8+ with numpy, scikit-learn, and matplotlib.
|
|
36
|
+
|
|
37
|
+
## Quick Start
|
|
38
|
+
|
|
39
|
+
```python
|
|
40
|
+
import numpy as np
|
|
41
|
+
from brf import BRFAnalyzer
|
|
42
|
+
|
|
43
|
+
# Your data
|
|
44
|
+
X = np.random.randn(200, 10)
|
|
45
|
+
y = np.random.randn(200)
|
|
46
|
+
groups = np.random.choice(["A", "B", "C"], 200)
|
|
47
|
+
|
|
48
|
+
# Run the audit
|
|
49
|
+
analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
|
|
50
|
+
|
|
51
|
+
# Results
|
|
52
|
+
print(analyzer.brf_vector)
|
|
53
|
+
# {'B': 0.123, 'I': 0.045, 'N': 0.97, 'M': 0.82,
|
|
54
|
+
# 'S': 0.925, 'E': 0.943, 'class': 'Reliable'}
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## BRF Dimensions
|
|
58
|
+
|
|
59
|
+
| Dimension | Name | Meaning |
|
|
60
|
+
|-----------|------|---------|
|
|
61
|
+
| B | Baseline Gain | Model improvement over mean predictor |
|
|
62
|
+
| I | Instability | Sensitivity to train/test split choice |
|
|
63
|
+
| N | Null Separability | Signal distinguishability from noise |
|
|
64
|
+
| M | Metadata Sufficiency | Group structure completeness |
|
|
65
|
+
|
|
66
|
+
The embedding coordinates S = N - I (Signal Identifiability) and E = B + M (Epistemic Completeness) classify datasets into **Reliable**, **Fragile**, or **Void**.
|
|
67
|
+
|
|
68
|
+
## Visualization
|
|
69
|
+
|
|
70
|
+
```python
|
|
71
|
+
from brf.phase import plot_phase_diagram
|
|
72
|
+
|
|
73
|
+
plot_phase_diagram(
|
|
74
|
+
[analyzer.S], [analyzer.E],
|
|
75
|
+
labels=[analyzer.class_],
|
|
76
|
+
classes=[analyzer.class_],
|
|
77
|
+
)
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Export
|
|
81
|
+
|
|
82
|
+
```python
|
|
83
|
+
from brf.report import export_json, export_latex
|
|
84
|
+
|
|
85
|
+
export_json(analyzer.brf_vector, "results.json")
|
|
86
|
+
latex_table = export_latex(analyzer.brf_vector)
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Citation
|
|
90
|
+
|
|
91
|
+
If you use this package, please cite the BehaviorAudit paper:
|
|
92
|
+
|
|
93
|
+
```
|
|
94
|
+
BehaviorAudit: a four-dimension pre-modeling audit protocol
|
|
95
|
+
for educational prediction benchmarks. Scientific Reports (under review).
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## License
|
|
99
|
+
|
|
100
|
+
MIT
|
|
101
|
+
|
|
102
|
+
## Links
|
|
103
|
+
|
|
104
|
+
- GitHub: https://github.com/zhanglizhuo/BenchmarkReliability
|
|
105
|
+
- PyPI: https://pypi.org/project/benchmark-reliability/
|
|
@@ -0,0 +1,81 @@
|
|
|
1
|
+
# benchmark-reliability
|
|
2
|
+
|
|
3
|
+
A Python package for computing the **Benchmark Reliability Framework (BRF)**: a four-dimension audit protocol that evaluates whether a predictive dataset is structurally reliable before model development.
|
|
4
|
+
|
|
5
|
+
## Installation
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
pip install benchmark-reliability
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
Requires Python 3.8+ with numpy, scikit-learn, and matplotlib.
|
|
12
|
+
|
|
13
|
+
## Quick Start
|
|
14
|
+
|
|
15
|
+
```python
|
|
16
|
+
import numpy as np
|
|
17
|
+
from brf import BRFAnalyzer
|
|
18
|
+
|
|
19
|
+
# Your data
|
|
20
|
+
X = np.random.randn(200, 10)
|
|
21
|
+
y = np.random.randn(200)
|
|
22
|
+
groups = np.random.choice(["A", "B", "C"], 200)
|
|
23
|
+
|
|
24
|
+
# Run the audit
|
|
25
|
+
analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
|
|
26
|
+
|
|
27
|
+
# Results
|
|
28
|
+
print(analyzer.brf_vector)
|
|
29
|
+
# {'B': 0.123, 'I': 0.045, 'N': 0.97, 'M': 0.82,
|
|
30
|
+
# 'S': 0.925, 'E': 0.943, 'class': 'Reliable'}
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
## BRF Dimensions
|
|
34
|
+
|
|
35
|
+
| Dimension | Name | Meaning |
|
|
36
|
+
|-----------|------|---------|
|
|
37
|
+
| B | Baseline Gain | Model improvement over mean predictor |
|
|
38
|
+
| I | Instability | Sensitivity to train/test split choice |
|
|
39
|
+
| N | Null Separability | Signal distinguishability from noise |
|
|
40
|
+
| M | Metadata Sufficiency | Group structure completeness |
|
|
41
|
+
|
|
42
|
+
The embedding coordinates S = N - I (Signal Identifiability) and E = B + M (Epistemic Completeness) classify datasets into **Reliable**, **Fragile**, or **Void**.
|
|
43
|
+
|
|
44
|
+
## Visualization
|
|
45
|
+
|
|
46
|
+
```python
|
|
47
|
+
from brf.phase import plot_phase_diagram
|
|
48
|
+
|
|
49
|
+
plot_phase_diagram(
|
|
50
|
+
[analyzer.S], [analyzer.E],
|
|
51
|
+
labels=[analyzer.class_],
|
|
52
|
+
classes=[analyzer.class_],
|
|
53
|
+
)
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## Export
|
|
57
|
+
|
|
58
|
+
```python
|
|
59
|
+
from brf.report import export_json, export_latex
|
|
60
|
+
|
|
61
|
+
export_json(analyzer.brf_vector, "results.json")
|
|
62
|
+
latex_table = export_latex(analyzer.brf_vector)
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## Citation
|
|
66
|
+
|
|
67
|
+
If you use this package, please cite the BehaviorAudit paper:
|
|
68
|
+
|
|
69
|
+
```
|
|
70
|
+
BehaviorAudit: a four-dimension pre-modeling audit protocol
|
|
71
|
+
for educational prediction benchmarks. Scientific Reports (under review).
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
## License
|
|
75
|
+
|
|
76
|
+
MIT
|
|
77
|
+
|
|
78
|
+
## Links
|
|
79
|
+
|
|
80
|
+
- GitHub: https://github.com/zhanglizhuo/BenchmarkReliability
|
|
81
|
+
- PyPI: https://pypi.org/project/benchmark-reliability/
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "benchmark-reliability"
|
|
7
|
-
version = "0.1.
|
|
7
|
+
version = "0.1.4"
|
|
8
8
|
description = "Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks"
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
license = { text = "MIT" }
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: benchmark-reliability
|
|
3
|
+
Version: 0.1.4
|
|
4
|
+
Summary: Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks
|
|
5
|
+
Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
|
|
8
|
+
Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
|
|
9
|
+
Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
12
|
+
Classifier: Programming Language :: Python :: 3
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
|
+
Requires-Python: >=3.8
|
|
20
|
+
Description-Content-Type: text/markdown
|
|
21
|
+
Requires-Dist: numpy>=1.21
|
|
22
|
+
Requires-Dist: scikit-learn>=1.0
|
|
23
|
+
Requires-Dist: matplotlib>=3.5
|
|
24
|
+
|
|
25
|
+
# benchmark-reliability
|
|
26
|
+
|
|
27
|
+
A Python package for computing the **Benchmark Reliability Framework (BRF)**: a four-dimension audit protocol that evaluates whether a predictive dataset is structurally reliable before model development.
|
|
28
|
+
|
|
29
|
+
## Installation
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
pip install benchmark-reliability
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
Requires Python 3.8+ with numpy, scikit-learn, and matplotlib.
|
|
36
|
+
|
|
37
|
+
## Quick Start
|
|
38
|
+
|
|
39
|
+
```python
|
|
40
|
+
import numpy as np
|
|
41
|
+
from brf import BRFAnalyzer
|
|
42
|
+
|
|
43
|
+
# Your data
|
|
44
|
+
X = np.random.randn(200, 10)
|
|
45
|
+
y = np.random.randn(200)
|
|
46
|
+
groups = np.random.choice(["A", "B", "C"], 200)
|
|
47
|
+
|
|
48
|
+
# Run the audit
|
|
49
|
+
analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
|
|
50
|
+
|
|
51
|
+
# Results
|
|
52
|
+
print(analyzer.brf_vector)
|
|
53
|
+
# {'B': 0.123, 'I': 0.045, 'N': 0.97, 'M': 0.82,
|
|
54
|
+
# 'S': 0.925, 'E': 0.943, 'class': 'Reliable'}
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## BRF Dimensions
|
|
58
|
+
|
|
59
|
+
| Dimension | Name | Meaning |
|
|
60
|
+
|-----------|------|---------|
|
|
61
|
+
| B | Baseline Gain | Model improvement over mean predictor |
|
|
62
|
+
| I | Instability | Sensitivity to train/test split choice |
|
|
63
|
+
| N | Null Separability | Signal distinguishability from noise |
|
|
64
|
+
| M | Metadata Sufficiency | Group structure completeness |
|
|
65
|
+
|
|
66
|
+
The embedding coordinates S = N - I (Signal Identifiability) and E = B + M (Epistemic Completeness) classify datasets into **Reliable**, **Fragile**, or **Void**.
|
|
67
|
+
|
|
68
|
+
## Visualization
|
|
69
|
+
|
|
70
|
+
```python
|
|
71
|
+
from brf.phase import plot_phase_diagram
|
|
72
|
+
|
|
73
|
+
plot_phase_diagram(
|
|
74
|
+
[analyzer.S], [analyzer.E],
|
|
75
|
+
labels=[analyzer.class_],
|
|
76
|
+
classes=[analyzer.class_],
|
|
77
|
+
)
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Export
|
|
81
|
+
|
|
82
|
+
```python
|
|
83
|
+
from brf.report import export_json, export_latex
|
|
84
|
+
|
|
85
|
+
export_json(analyzer.brf_vector, "results.json")
|
|
86
|
+
latex_table = export_latex(analyzer.brf_vector)
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Citation
|
|
90
|
+
|
|
91
|
+
If you use this package, please cite the BehaviorAudit paper:
|
|
92
|
+
|
|
93
|
+
```
|
|
94
|
+
BehaviorAudit: a four-dimension pre-modeling audit protocol
|
|
95
|
+
for educational prediction benchmarks. Scientific Reports (under review).
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## License
|
|
99
|
+
|
|
100
|
+
MIT
|
|
101
|
+
|
|
102
|
+
## Links
|
|
103
|
+
|
|
104
|
+
- GitHub: https://github.com/zhanglizhuo/BenchmarkReliability
|
|
105
|
+
- PyPI: https://pypi.org/project/benchmark-reliability/
|
|
@@ -1,121 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.1
|
|
2
|
-
Name: benchmark-reliability
|
|
3
|
-
Version: 0.1.2
|
|
4
|
-
Summary: Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks
|
|
5
|
-
Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
|
|
6
|
-
License: MIT
|
|
7
|
-
Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
|
|
8
|
-
Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
|
|
9
|
-
Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
|
|
10
|
-
Classifier: Development Status :: 3 - Alpha
|
|
11
|
-
Classifier: License :: OSI Approved :: MIT License
|
|
12
|
-
Classifier: Programming Language :: Python :: 3
|
|
13
|
-
Classifier: Programming Language :: Python :: 3.8
|
|
14
|
-
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
-
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
-
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
-
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
-
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
|
-
Requires-Python: >=3.8
|
|
20
|
-
Description-Content-Type: text/markdown
|
|
21
|
-
Requires-Dist: numpy>=1.21
|
|
22
|
-
Requires-Dist: scikit-learn>=1.0
|
|
23
|
-
Requires-Dist: matplotlib>=3.5
|
|
24
|
-
|
|
25
|
-
# BenchmarkReliability - BRF Python Package
|
|
26
|
-
|
|
27
|
-
## Target
|
|
28
|
-
|
|
29
|
-
Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.
|
|
30
|
-
|
|
31
|
-
## Method
|
|
32
|
-
|
|
33
|
-
The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:
|
|
34
|
-
|
|
35
|
-
```python
|
|
36
|
-
from brf import BRFAnalyzer
|
|
37
|
-
from brf.phase import plot_phase_diagram
|
|
38
|
-
from brf.report import export_json
|
|
39
|
-
|
|
40
|
-
analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
|
|
41
|
-
print(analyzer.brf_vector) # (B, I, N, M) → (S, E) → class
|
|
42
|
-
|
|
43
|
-
# Visualization
|
|
44
|
-
plot_phase_diagram(
|
|
45
|
-
[analyzer.S], [analyzer.E],
|
|
46
|
-
labels=[analyzer.class_],
|
|
47
|
-
classes=[analyzer.class_],
|
|
48
|
-
)
|
|
49
|
-
|
|
50
|
-
# Export
|
|
51
|
-
export_json(analyzer.brf_vector, "results.json")
|
|
52
|
-
```
|
|
53
|
-
|
|
54
|
-
## Package Structure
|
|
55
|
-
|
|
56
|
-
```
|
|
57
|
-
brf/
|
|
58
|
-
├── __init__.py
|
|
59
|
-
├── analyzer.py ← BRFAnalyzer main class
|
|
60
|
-
├── metrics/
|
|
61
|
-
│ ├── baseline_gap.py ← B
|
|
62
|
-
│ ├── instability.py ← I
|
|
63
|
-
│ ├── null_test.py ← N (permutation test)
|
|
64
|
-
│ └── metadata.py ← M
|
|
65
|
-
├── phase/
|
|
66
|
-
│ ├── embedding.py ← S = N - I, E = B + M
|
|
67
|
-
│ ├── classifier.py ← Reliable / Fragile / Void
|
|
68
|
-
│ └── visualization.py ← phase diagram, clustering plot
|
|
69
|
-
├── report/
|
|
70
|
-
│ ├── json_export.py
|
|
71
|
-
│ └── latex_export.py
|
|
72
|
-
```
|
|
73
|
-
|
|
74
|
-
## Steps
|
|
75
|
-
|
|
76
|
-
### Phase 1: Package skeleton (1-2 weeks)
|
|
77
|
-
- [x] Initialize Python project with `pyproject.toml`
|
|
78
|
-
- [x] Implement `BRFAnalyzer` main class with fit/predict interface
|
|
79
|
-
- [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
|
|
80
|
-
- [x] Write unit tests for each metric
|
|
81
|
-
|
|
82
|
-
### Phase 2: Phase embedding + classification (1 week)
|
|
83
|
-
- [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
|
|
84
|
-
- [x] Build phase diagram visualization (matplotlib)
|
|
85
|
-
- [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results
|
|
86
|
-
|
|
87
|
-
### Phase 3: Documentation + distribution (1-2 weeks)
|
|
88
|
-
- [x] Write README with quick-start tutorial and API docs
|
|
89
|
-
- [ ] Publish to TestPyPI → PyPI
|
|
90
|
-
- [ ] Set up ReadTheDocs for auto-generated documentation
|
|
91
|
-
- [ ] Add GitHub Actions CI (test on Python 3.9–3.12)
|
|
92
|
-
|
|
93
|
-
### Phase 4: HuggingFace Hub integration (optional, 1 week)
|
|
94
|
-
- [ ] Add HF dataset loading wrapper
|
|
95
|
-
- [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand
|
|
96
|
-
|
|
97
|
-
## Dependencies
|
|
98
|
-
|
|
99
|
-
- `numpy>=1.21`
|
|
100
|
-
- `scikit-learn>=1.0`
|
|
101
|
-
- `matplotlib>=3.5`
|
|
102
|
-
- No deep learning dependencies required
|
|
103
|
-
|
|
104
|
-
## Relationship to Sister Repos
|
|
105
|
-
|
|
106
|
-
- `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
|
|
107
|
-
- `LLMScoringAudit/`: first applied use case (MM-TBA × multiple LLMs)
|
|
108
|
-
- `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
|
|
109
|
-
- `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings
|
|
110
|
-
|
|
111
|
-
## Target Journal
|
|
112
|
-
|
|
113
|
-
- Journal of Open Source Software (JOSS) - tool paper, lightweight submission
|
|
114
|
-
- Followed by application papers in C&E / BJET
|
|
115
|
-
|
|
116
|
-
## Timeline
|
|
117
|
-
|
|
118
|
-
- Phase 1–2: 3 weeks
|
|
119
|
-
- Phase 3: 2 weeks
|
|
120
|
-
- Phase 4: optional
|
|
121
|
-
- JOSS submission: after Phase 3
|
|
@@ -1,97 +0,0 @@
|
|
|
1
|
-
# BenchmarkReliability - BRF Python Package
|
|
2
|
-
|
|
3
|
-
## Target
|
|
4
|
-
|
|
5
|
-
Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.
|
|
6
|
-
|
|
7
|
-
## Method
|
|
8
|
-
|
|
9
|
-
The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:
|
|
10
|
-
|
|
11
|
-
```python
|
|
12
|
-
from brf import BRFAnalyzer
|
|
13
|
-
from brf.phase import plot_phase_diagram
|
|
14
|
-
from brf.report import export_json
|
|
15
|
-
|
|
16
|
-
analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
|
|
17
|
-
print(analyzer.brf_vector) # (B, I, N, M) → (S, E) → class
|
|
18
|
-
|
|
19
|
-
# Visualization
|
|
20
|
-
plot_phase_diagram(
|
|
21
|
-
[analyzer.S], [analyzer.E],
|
|
22
|
-
labels=[analyzer.class_],
|
|
23
|
-
classes=[analyzer.class_],
|
|
24
|
-
)
|
|
25
|
-
|
|
26
|
-
# Export
|
|
27
|
-
export_json(analyzer.brf_vector, "results.json")
|
|
28
|
-
```
|
|
29
|
-
|
|
30
|
-
## Package Structure
|
|
31
|
-
|
|
32
|
-
```
|
|
33
|
-
brf/
|
|
34
|
-
├── __init__.py
|
|
35
|
-
├── analyzer.py ← BRFAnalyzer main class
|
|
36
|
-
├── metrics/
|
|
37
|
-
│ ├── baseline_gap.py ← B
|
|
38
|
-
│ ├── instability.py ← I
|
|
39
|
-
│ ├── null_test.py ← N (permutation test)
|
|
40
|
-
│ └── metadata.py ← M
|
|
41
|
-
├── phase/
|
|
42
|
-
│ ├── embedding.py ← S = N - I, E = B + M
|
|
43
|
-
│ ├── classifier.py ← Reliable / Fragile / Void
|
|
44
|
-
│ └── visualization.py ← phase diagram, clustering plot
|
|
45
|
-
├── report/
|
|
46
|
-
│ ├── json_export.py
|
|
47
|
-
│ └── latex_export.py
|
|
48
|
-
```
|
|
49
|
-
|
|
50
|
-
## Steps
|
|
51
|
-
|
|
52
|
-
### Phase 1: Package skeleton (1-2 weeks)
|
|
53
|
-
- [x] Initialize Python project with `pyproject.toml`
|
|
54
|
-
- [x] Implement `BRFAnalyzer` main class with fit/predict interface
|
|
55
|
-
- [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
|
|
56
|
-
- [x] Write unit tests for each metric
|
|
57
|
-
|
|
58
|
-
### Phase 2: Phase embedding + classification (1 week)
|
|
59
|
-
- [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
|
|
60
|
-
- [x] Build phase diagram visualization (matplotlib)
|
|
61
|
-
- [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results
|
|
62
|
-
|
|
63
|
-
### Phase 3: Documentation + distribution (1-2 weeks)
|
|
64
|
-
- [x] Write README with quick-start tutorial and API docs
|
|
65
|
-
- [ ] Publish to TestPyPI → PyPI
|
|
66
|
-
- [ ] Set up ReadTheDocs for auto-generated documentation
|
|
67
|
-
- [ ] Add GitHub Actions CI (test on Python 3.9–3.12)
|
|
68
|
-
|
|
69
|
-
### Phase 4: HuggingFace Hub integration (optional, 1 week)
|
|
70
|
-
- [ ] Add HF dataset loading wrapper
|
|
71
|
-
- [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand
|
|
72
|
-
|
|
73
|
-
## Dependencies
|
|
74
|
-
|
|
75
|
-
- `numpy>=1.21`
|
|
76
|
-
- `scikit-learn>=1.0`
|
|
77
|
-
- `matplotlib>=3.5`
|
|
78
|
-
- No deep learning dependencies required
|
|
79
|
-
|
|
80
|
-
## Relationship to Sister Repos
|
|
81
|
-
|
|
82
|
-
- `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
|
|
83
|
-
- `LLMScoringAudit/`: first applied use case (MM-TBA × multiple LLMs)
|
|
84
|
-
- `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
|
|
85
|
-
- `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings
|
|
86
|
-
|
|
87
|
-
## Target Journal
|
|
88
|
-
|
|
89
|
-
- Journal of Open Source Software (JOSS) - tool paper, lightweight submission
|
|
90
|
-
- Followed by application papers in C&E / BJET
|
|
91
|
-
|
|
92
|
-
## Timeline
|
|
93
|
-
|
|
94
|
-
- Phase 1–2: 3 weeks
|
|
95
|
-
- Phase 3: 2 weeks
|
|
96
|
-
- Phase 4: optional
|
|
97
|
-
- JOSS submission: after Phase 3
|
|
@@ -1,121 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.1
|
|
2
|
-
Name: benchmark-reliability
|
|
3
|
-
Version: 0.1.2
|
|
4
|
-
Summary: Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks
|
|
5
|
-
Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
|
|
6
|
-
License: MIT
|
|
7
|
-
Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
|
|
8
|
-
Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
|
|
9
|
-
Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
|
|
10
|
-
Classifier: Development Status :: 3 - Alpha
|
|
11
|
-
Classifier: License :: OSI Approved :: MIT License
|
|
12
|
-
Classifier: Programming Language :: Python :: 3
|
|
13
|
-
Classifier: Programming Language :: Python :: 3.8
|
|
14
|
-
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
-
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
-
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
-
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
-
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
|
-
Requires-Python: >=3.8
|
|
20
|
-
Description-Content-Type: text/markdown
|
|
21
|
-
Requires-Dist: numpy>=1.21
|
|
22
|
-
Requires-Dist: scikit-learn>=1.0
|
|
23
|
-
Requires-Dist: matplotlib>=3.5
|
|
24
|
-
|
|
25
|
-
# BenchmarkReliability - BRF Python Package
|
|
26
|
-
|
|
27
|
-
## Target
|
|
28
|
-
|
|
29
|
-
Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.
|
|
30
|
-
|
|
31
|
-
## Method
|
|
32
|
-
|
|
33
|
-
The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:
|
|
34
|
-
|
|
35
|
-
```python
|
|
36
|
-
from brf import BRFAnalyzer
|
|
37
|
-
from brf.phase import plot_phase_diagram
|
|
38
|
-
from brf.report import export_json
|
|
39
|
-
|
|
40
|
-
analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
|
|
41
|
-
print(analyzer.brf_vector) # (B, I, N, M) → (S, E) → class
|
|
42
|
-
|
|
43
|
-
# Visualization
|
|
44
|
-
plot_phase_diagram(
|
|
45
|
-
[analyzer.S], [analyzer.E],
|
|
46
|
-
labels=[analyzer.class_],
|
|
47
|
-
classes=[analyzer.class_],
|
|
48
|
-
)
|
|
49
|
-
|
|
50
|
-
# Export
|
|
51
|
-
export_json(analyzer.brf_vector, "results.json")
|
|
52
|
-
```
|
|
53
|
-
|
|
54
|
-
## Package Structure
|
|
55
|
-
|
|
56
|
-
```
|
|
57
|
-
brf/
|
|
58
|
-
├── __init__.py
|
|
59
|
-
├── analyzer.py ← BRFAnalyzer main class
|
|
60
|
-
├── metrics/
|
|
61
|
-
│ ├── baseline_gap.py ← B
|
|
62
|
-
│ ├── instability.py ← I
|
|
63
|
-
│ ├── null_test.py ← N (permutation test)
|
|
64
|
-
│ └── metadata.py ← M
|
|
65
|
-
├── phase/
|
|
66
|
-
│ ├── embedding.py ← S = N - I, E = B + M
|
|
67
|
-
│ ├── classifier.py ← Reliable / Fragile / Void
|
|
68
|
-
│ └── visualization.py ← phase diagram, clustering plot
|
|
69
|
-
├── report/
|
|
70
|
-
│ ├── json_export.py
|
|
71
|
-
│ └── latex_export.py
|
|
72
|
-
```
|
|
73
|
-
|
|
74
|
-
## Steps
|
|
75
|
-
|
|
76
|
-
### Phase 1: Package skeleton (1-2 weeks)
|
|
77
|
-
- [x] Initialize Python project with `pyproject.toml`
|
|
78
|
-
- [x] Implement `BRFAnalyzer` main class with fit/predict interface
|
|
79
|
-
- [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
|
|
80
|
-
- [x] Write unit tests for each metric
|
|
81
|
-
|
|
82
|
-
### Phase 2: Phase embedding + classification (1 week)
|
|
83
|
-
- [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
|
|
84
|
-
- [x] Build phase diagram visualization (matplotlib)
|
|
85
|
-
- [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results
|
|
86
|
-
|
|
87
|
-
### Phase 3: Documentation + distribution (1-2 weeks)
|
|
88
|
-
- [x] Write README with quick-start tutorial and API docs
|
|
89
|
-
- [ ] Publish to TestPyPI → PyPI
|
|
90
|
-
- [ ] Set up ReadTheDocs for auto-generated documentation
|
|
91
|
-
- [ ] Add GitHub Actions CI (test on Python 3.9–3.12)
|
|
92
|
-
|
|
93
|
-
### Phase 4: HuggingFace Hub integration (optional, 1 week)
|
|
94
|
-
- [ ] Add HF dataset loading wrapper
|
|
95
|
-
- [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand
|
|
96
|
-
|
|
97
|
-
## Dependencies
|
|
98
|
-
|
|
99
|
-
- `numpy>=1.21`
|
|
100
|
-
- `scikit-learn>=1.0`
|
|
101
|
-
- `matplotlib>=3.5`
|
|
102
|
-
- No deep learning dependencies required
|
|
103
|
-
|
|
104
|
-
## Relationship to Sister Repos
|
|
105
|
-
|
|
106
|
-
- `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
|
|
107
|
-
- `LLMScoringAudit/`: first applied use case (MM-TBA × multiple LLMs)
|
|
108
|
-
- `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
|
|
109
|
-
- `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings
|
|
110
|
-
|
|
111
|
-
## Target Journal
|
|
112
|
-
|
|
113
|
-
- Journal of Open Source Software (JOSS) - tool paper, lightweight submission
|
|
114
|
-
- Followed by application papers in C&E / BJET
|
|
115
|
-
|
|
116
|
-
## Timeline
|
|
117
|
-
|
|
118
|
-
- Phase 1–2: 3 weeks
|
|
119
|
-
- Phase 3: 2 weeks
|
|
120
|
-
- Phase 4: optional
|
|
121
|
-
- JOSS submission: after Phase 3
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|