benchmark-reliability 0.1.2__tar.gz → 0.1.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. benchmark_reliability-0.1.4/PKG-INFO +105 -0
  2. benchmark_reliability-0.1.4/README.md +81 -0
  3. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/pyproject.toml +1 -1
  4. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/setup.py +1 -1
  5. benchmark_reliability-0.1.4/src/benchmark_reliability.egg-info/PKG-INFO +105 -0
  6. benchmark_reliability-0.1.2/PKG-INFO +0 -121
  7. benchmark_reliability-0.1.2/README.md +0 -97
  8. benchmark_reliability-0.1.2/src/benchmark_reliability.egg-info/PKG-INFO +0 -121
  9. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/LICENSE +0 -0
  10. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/setup.cfg +0 -0
  11. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/benchmark_reliability.egg-info/SOURCES.txt +0 -0
  12. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/benchmark_reliability.egg-info/dependency_links.txt +0 -0
  13. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/benchmark_reliability.egg-info/requires.txt +0 -0
  14. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/benchmark_reliability.egg-info/top_level.txt +0 -0
  15. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/__init__.py +0 -0
  16. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/analyzer.py +0 -0
  17. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/metrics/__init__.py +0 -0
  18. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/metrics/baseline_gap.py +0 -0
  19. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/metrics/instability.py +0 -0
  20. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/metrics/metadata.py +0 -0
  21. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/metrics/null_test.py +0 -0
  22. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/phase/__init__.py +0 -0
  23. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/phase/classifier.py +0 -0
  24. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/phase/embedding.py +0 -0
  25. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/phase/visualization.py +0 -0
  26. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/report/__init__.py +0 -0
  27. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/report/json_export.py +0 -0
  28. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/src/brf/report/latex_export.py +0 -0
  29. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/tests/test_analyzer.py +0 -0
  30. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/tests/test_metrics.py +0 -0
  31. {benchmark_reliability-0.1.2 → benchmark_reliability-0.1.4}/tests/test_phase.py +0 -0
@@ -0,0 +1,105 @@
1
+ Metadata-Version: 2.1
2
+ Name: benchmark-reliability
3
+ Version: 0.1.4
4
+ Summary: Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks
5
+ Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
8
+ Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
9
+ Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.8
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
+ Requires-Python: >=3.8
20
+ Description-Content-Type: text/markdown
21
+ Requires-Dist: numpy>=1.21
22
+ Requires-Dist: scikit-learn>=1.0
23
+ Requires-Dist: matplotlib>=3.5
24
+
25
+ # benchmark-reliability
26
+
27
+ A Python package for computing the **Benchmark Reliability Framework (BRF)**: a four-dimension audit protocol that evaluates whether a predictive dataset is structurally reliable before model development.
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ pip install benchmark-reliability
33
+ ```
34
+
35
+ Requires Python 3.8+ with numpy, scikit-learn, and matplotlib.
36
+
37
+ ## Quick Start
38
+
39
+ ```python
40
+ import numpy as np
41
+ from brf import BRFAnalyzer
42
+
43
+ # Your data
44
+ X = np.random.randn(200, 10)
45
+ y = np.random.randn(200)
46
+ groups = np.random.choice(["A", "B", "C"], 200)
47
+
48
+ # Run the audit
49
+ analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
50
+
51
+ # Results
52
+ print(analyzer.brf_vector)
53
+ # {'B': 0.123, 'I': 0.045, 'N': 0.97, 'M': 0.82,
54
+ # 'S': 0.925, 'E': 0.943, 'class': 'Reliable'}
55
+ ```
56
+
57
+ ## BRF Dimensions
58
+
59
+ | Dimension | Name | Meaning |
60
+ |-----------|------|---------|
61
+ | B | Baseline Gain | Model improvement over mean predictor |
62
+ | I | Instability | Sensitivity to train/test split choice |
63
+ | N | Null Separability | Signal distinguishability from noise |
64
+ | M | Metadata Sufficiency | Group structure completeness |
65
+
66
+ The embedding coordinates S = N - I (Signal Identifiability) and E = B + M (Epistemic Completeness) classify datasets into **Reliable**, **Fragile**, or **Void**.
67
+
68
+ ## Visualization
69
+
70
+ ```python
71
+ from brf.phase import plot_phase_diagram
72
+
73
+ plot_phase_diagram(
74
+ [analyzer.S], [analyzer.E],
75
+ labels=[analyzer.class_],
76
+ classes=[analyzer.class_],
77
+ )
78
+ ```
79
+
80
+ ## Export
81
+
82
+ ```python
83
+ from brf.report import export_json, export_latex
84
+
85
+ export_json(analyzer.brf_vector, "results.json")
86
+ latex_table = export_latex(analyzer.brf_vector)
87
+ ```
88
+
89
+ ## Citation
90
+
91
+ If you use this package, please cite the BehaviorAudit paper:
92
+
93
+ ```
94
+ BehaviorAudit: a four-dimension pre-modeling audit protocol
95
+ for educational prediction benchmarks. Scientific Reports (under review).
96
+ ```
97
+
98
+ ## License
99
+
100
+ MIT
101
+
102
+ ## Links
103
+
104
+ - GitHub: https://github.com/zhanglizhuo/BenchmarkReliability
105
+ - PyPI: https://pypi.org/project/benchmark-reliability/
@@ -0,0 +1,81 @@
1
+ # benchmark-reliability
2
+
3
+ A Python package for computing the **Benchmark Reliability Framework (BRF)**: a four-dimension audit protocol that evaluates whether a predictive dataset is structurally reliable before model development.
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ pip install benchmark-reliability
9
+ ```
10
+
11
+ Requires Python 3.8+ with numpy, scikit-learn, and matplotlib.
12
+
13
+ ## Quick Start
14
+
15
+ ```python
16
+ import numpy as np
17
+ from brf import BRFAnalyzer
18
+
19
+ # Your data
20
+ X = np.random.randn(200, 10)
21
+ y = np.random.randn(200)
22
+ groups = np.random.choice(["A", "B", "C"], 200)
23
+
24
+ # Run the audit
25
+ analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
26
+
27
+ # Results
28
+ print(analyzer.brf_vector)
29
+ # {'B': 0.123, 'I': 0.045, 'N': 0.97, 'M': 0.82,
30
+ # 'S': 0.925, 'E': 0.943, 'class': 'Reliable'}
31
+ ```
32
+
33
+ ## BRF Dimensions
34
+
35
+ | Dimension | Name | Meaning |
36
+ |-----------|------|---------|
37
+ | B | Baseline Gain | Model improvement over mean predictor |
38
+ | I | Instability | Sensitivity to train/test split choice |
39
+ | N | Null Separability | Signal distinguishability from noise |
40
+ | M | Metadata Sufficiency | Group structure completeness |
41
+
42
+ The embedding coordinates S = N - I (Signal Identifiability) and E = B + M (Epistemic Completeness) classify datasets into **Reliable**, **Fragile**, or **Void**.
43
+
44
+ ## Visualization
45
+
46
+ ```python
47
+ from brf.phase import plot_phase_diagram
48
+
49
+ plot_phase_diagram(
50
+ [analyzer.S], [analyzer.E],
51
+ labels=[analyzer.class_],
52
+ classes=[analyzer.class_],
53
+ )
54
+ ```
55
+
56
+ ## Export
57
+
58
+ ```python
59
+ from brf.report import export_json, export_latex
60
+
61
+ export_json(analyzer.brf_vector, "results.json")
62
+ latex_table = export_latex(analyzer.brf_vector)
63
+ ```
64
+
65
+ ## Citation
66
+
67
+ If you use this package, please cite the BehaviorAudit paper:
68
+
69
+ ```
70
+ BehaviorAudit: a four-dimension pre-modeling audit protocol
71
+ for educational prediction benchmarks. Scientific Reports (under review).
72
+ ```
73
+
74
+ ## License
75
+
76
+ MIT
77
+
78
+ ## Links
79
+
80
+ - GitHub: https://github.com/zhanglizhuo/BenchmarkReliability
81
+ - PyPI: https://pypi.org/project/benchmark-reliability/
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "benchmark-reliability"
7
- version = "0.1.2"
7
+ version = "0.1.4"
8
8
  description = "Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks"
9
9
  readme = "README.md"
10
10
  license = { text = "MIT" }
@@ -2,7 +2,7 @@ from setuptools import setup, find_packages
2
2
 
3
3
  setup(
4
4
  name="benchmark-reliability",
5
- version="0.1.2",
5
+ version="0.1.4",
6
6
  packages=find_packages(where="src"),
7
7
  package_dir={"": "src"},
8
8
  )
@@ -0,0 +1,105 @@
1
+ Metadata-Version: 2.1
2
+ Name: benchmark-reliability
3
+ Version: 0.1.4
4
+ Summary: Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks
5
+ Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
8
+ Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
9
+ Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.8
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
+ Requires-Python: >=3.8
20
+ Description-Content-Type: text/markdown
21
+ Requires-Dist: numpy>=1.21
22
+ Requires-Dist: scikit-learn>=1.0
23
+ Requires-Dist: matplotlib>=3.5
24
+
25
+ # benchmark-reliability
26
+
27
+ A Python package for computing the **Benchmark Reliability Framework (BRF)**: a four-dimension audit protocol that evaluates whether a predictive dataset is structurally reliable before model development.
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ pip install benchmark-reliability
33
+ ```
34
+
35
+ Requires Python 3.8+ with numpy, scikit-learn, and matplotlib.
36
+
37
+ ## Quick Start
38
+
39
+ ```python
40
+ import numpy as np
41
+ from brf import BRFAnalyzer
42
+
43
+ # Your data
44
+ X = np.random.randn(200, 10)
45
+ y = np.random.randn(200)
46
+ groups = np.random.choice(["A", "B", "C"], 200)
47
+
48
+ # Run the audit
49
+ analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
50
+
51
+ # Results
52
+ print(analyzer.brf_vector)
53
+ # {'B': 0.123, 'I': 0.045, 'N': 0.97, 'M': 0.82,
54
+ # 'S': 0.925, 'E': 0.943, 'class': 'Reliable'}
55
+ ```
56
+
57
+ ## BRF Dimensions
58
+
59
+ | Dimension | Name | Meaning |
60
+ |-----------|------|---------|
61
+ | B | Baseline Gain | Model improvement over mean predictor |
62
+ | I | Instability | Sensitivity to train/test split choice |
63
+ | N | Null Separability | Signal distinguishability from noise |
64
+ | M | Metadata Sufficiency | Group structure completeness |
65
+
66
+ The embedding coordinates S = N - I (Signal Identifiability) and E = B + M (Epistemic Completeness) classify datasets into **Reliable**, **Fragile**, or **Void**.
67
+
68
+ ## Visualization
69
+
70
+ ```python
71
+ from brf.phase import plot_phase_diagram
72
+
73
+ plot_phase_diagram(
74
+ [analyzer.S], [analyzer.E],
75
+ labels=[analyzer.class_],
76
+ classes=[analyzer.class_],
77
+ )
78
+ ```
79
+
80
+ ## Export
81
+
82
+ ```python
83
+ from brf.report import export_json, export_latex
84
+
85
+ export_json(analyzer.brf_vector, "results.json")
86
+ latex_table = export_latex(analyzer.brf_vector)
87
+ ```
88
+
89
+ ## Citation
90
+
91
+ If you use this package, please cite the BehaviorAudit paper:
92
+
93
+ ```
94
+ BehaviorAudit: a four-dimension pre-modeling audit protocol
95
+ for educational prediction benchmarks. Scientific Reports (under review).
96
+ ```
97
+
98
+ ## License
99
+
100
+ MIT
101
+
102
+ ## Links
103
+
104
+ - GitHub: https://github.com/zhanglizhuo/BenchmarkReliability
105
+ - PyPI: https://pypi.org/project/benchmark-reliability/
@@ -1,121 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: benchmark-reliability
3
- Version: 0.1.2
4
- Summary: Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks
5
- Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
6
- License: MIT
7
- Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
8
- Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
9
- Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
10
- Classifier: Development Status :: 3 - Alpha
11
- Classifier: License :: OSI Approved :: MIT License
12
- Classifier: Programming Language :: Python :: 3
13
- Classifier: Programming Language :: Python :: 3.8
14
- Classifier: Programming Language :: Python :: 3.9
15
- Classifier: Programming Language :: Python :: 3.10
16
- Classifier: Programming Language :: Python :: 3.11
17
- Classifier: Programming Language :: Python :: 3.12
18
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
- Requires-Python: >=3.8
20
- Description-Content-Type: text/markdown
21
- Requires-Dist: numpy>=1.21
22
- Requires-Dist: scikit-learn>=1.0
23
- Requires-Dist: matplotlib>=3.5
24
-
25
- # BenchmarkReliability - BRF Python Package
26
-
27
- ## Target
28
-
29
- Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.
30
-
31
- ## Method
32
-
33
- The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:
34
-
35
- ```python
36
- from brf import BRFAnalyzer
37
- from brf.phase import plot_phase_diagram
38
- from brf.report import export_json
39
-
40
- analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
41
- print(analyzer.brf_vector) # (B, I, N, M) → (S, E) → class
42
-
43
- # Visualization
44
- plot_phase_diagram(
45
- [analyzer.S], [analyzer.E],
46
- labels=[analyzer.class_],
47
- classes=[analyzer.class_],
48
- )
49
-
50
- # Export
51
- export_json(analyzer.brf_vector, "results.json")
52
- ```
53
-
54
- ## Package Structure
55
-
56
- ```
57
- brf/
58
- ├── __init__.py
59
- ├── analyzer.py ← BRFAnalyzer main class
60
- ├── metrics/
61
- │ ├── baseline_gap.py ← B
62
- │ ├── instability.py ← I
63
- │ ├── null_test.py ← N (permutation test)
64
- │ └── metadata.py ← M
65
- ├── phase/
66
- │ ├── embedding.py ← S = N - I, E = B + M
67
- │ ├── classifier.py ← Reliable / Fragile / Void
68
- │ └── visualization.py ← phase diagram, clustering plot
69
- ├── report/
70
- │ ├── json_export.py
71
- │ └── latex_export.py
72
- ```
73
-
74
- ## Steps
75
-
76
- ### Phase 1: Package skeleton (1-2 weeks)
77
- - [x] Initialize Python project with `pyproject.toml`
78
- - [x] Implement `BRFAnalyzer` main class with fit/predict interface
79
- - [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
80
- - [x] Write unit tests for each metric
81
-
82
- ### Phase 2: Phase embedding + classification (1 week)
83
- - [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
84
- - [x] Build phase diagram visualization (matplotlib)
85
- - [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results
86
-
87
- ### Phase 3: Documentation + distribution (1-2 weeks)
88
- - [x] Write README with quick-start tutorial and API docs
89
- - [ ] Publish to TestPyPI → PyPI
90
- - [ ] Set up ReadTheDocs for auto-generated documentation
91
- - [ ] Add GitHub Actions CI (test on Python 3.9–3.12)
92
-
93
- ### Phase 4: HuggingFace Hub integration (optional, 1 week)
94
- - [ ] Add HF dataset loading wrapper
95
- - [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand
96
-
97
- ## Dependencies
98
-
99
- - `numpy>=1.21`
100
- - `scikit-learn>=1.0`
101
- - `matplotlib>=3.5`
102
- - No deep learning dependencies required
103
-
104
- ## Relationship to Sister Repos
105
-
106
- - `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
107
- - `LLMScoringAudit/`: first applied use case (MM-TBA × multiple LLMs)
108
- - `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
109
- - `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings
110
-
111
- ## Target Journal
112
-
113
- - Journal of Open Source Software (JOSS) - tool paper, lightweight submission
114
- - Followed by application papers in C&E / BJET
115
-
116
- ## Timeline
117
-
118
- - Phase 1–2: 3 weeks
119
- - Phase 3: 2 weeks
120
- - Phase 4: optional
121
- - JOSS submission: after Phase 3
@@ -1,97 +0,0 @@
1
- # BenchmarkReliability - BRF Python Package
2
-
3
- ## Target
4
-
5
- Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.
6
-
7
- ## Method
8
-
9
- The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:
10
-
11
- ```python
12
- from brf import BRFAnalyzer
13
- from brf.phase import plot_phase_diagram
14
- from brf.report import export_json
15
-
16
- analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
17
- print(analyzer.brf_vector) # (B, I, N, M) → (S, E) → class
18
-
19
- # Visualization
20
- plot_phase_diagram(
21
- [analyzer.S], [analyzer.E],
22
- labels=[analyzer.class_],
23
- classes=[analyzer.class_],
24
- )
25
-
26
- # Export
27
- export_json(analyzer.brf_vector, "results.json")
28
- ```
29
-
30
- ## Package Structure
31
-
32
- ```
33
- brf/
34
- ├── __init__.py
35
- ├── analyzer.py ← BRFAnalyzer main class
36
- ├── metrics/
37
- │ ├── baseline_gap.py ← B
38
- │ ├── instability.py ← I
39
- │ ├── null_test.py ← N (permutation test)
40
- │ └── metadata.py ← M
41
- ├── phase/
42
- │ ├── embedding.py ← S = N - I, E = B + M
43
- │ ├── classifier.py ← Reliable / Fragile / Void
44
- │ └── visualization.py ← phase diagram, clustering plot
45
- ├── report/
46
- │ ├── json_export.py
47
- │ └── latex_export.py
48
- ```
49
-
50
- ## Steps
51
-
52
- ### Phase 1: Package skeleton (1-2 weeks)
53
- - [x] Initialize Python project with `pyproject.toml`
54
- - [x] Implement `BRFAnalyzer` main class with fit/predict interface
55
- - [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
56
- - [x] Write unit tests for each metric
57
-
58
- ### Phase 2: Phase embedding + classification (1 week)
59
- - [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
60
- - [x] Build phase diagram visualization (matplotlib)
61
- - [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results
62
-
63
- ### Phase 3: Documentation + distribution (1-2 weeks)
64
- - [x] Write README with quick-start tutorial and API docs
65
- - [ ] Publish to TestPyPI → PyPI
66
- - [ ] Set up ReadTheDocs for auto-generated documentation
67
- - [ ] Add GitHub Actions CI (test on Python 3.9–3.12)
68
-
69
- ### Phase 4: HuggingFace Hub integration (optional, 1 week)
70
- - [ ] Add HF dataset loading wrapper
71
- - [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand
72
-
73
- ## Dependencies
74
-
75
- - `numpy>=1.21`
76
- - `scikit-learn>=1.0`
77
- - `matplotlib>=3.5`
78
- - No deep learning dependencies required
79
-
80
- ## Relationship to Sister Repos
81
-
82
- - `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
83
- - `LLMScoringAudit/`: first applied use case (MM-TBA × multiple LLMs)
84
- - `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
85
- - `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings
86
-
87
- ## Target Journal
88
-
89
- - Journal of Open Source Software (JOSS) - tool paper, lightweight submission
90
- - Followed by application papers in C&E / BJET
91
-
92
- ## Timeline
93
-
94
- - Phase 1–2: 3 weeks
95
- - Phase 3: 2 weeks
96
- - Phase 4: optional
97
- - JOSS submission: after Phase 3
@@ -1,121 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: benchmark-reliability
3
- Version: 0.1.2
4
- Summary: Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks
5
- Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
6
- License: MIT
7
- Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
8
- Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
9
- Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
10
- Classifier: Development Status :: 3 - Alpha
11
- Classifier: License :: OSI Approved :: MIT License
12
- Classifier: Programming Language :: Python :: 3
13
- Classifier: Programming Language :: Python :: 3.8
14
- Classifier: Programming Language :: Python :: 3.9
15
- Classifier: Programming Language :: Python :: 3.10
16
- Classifier: Programming Language :: Python :: 3.11
17
- Classifier: Programming Language :: Python :: 3.12
18
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
- Requires-Python: >=3.8
20
- Description-Content-Type: text/markdown
21
- Requires-Dist: numpy>=1.21
22
- Requires-Dist: scikit-learn>=1.0
23
- Requires-Dist: matplotlib>=3.5
24
-
25
- # BenchmarkReliability - BRF Python Package
26
-
27
- ## Target
28
-
29
- Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.
30
-
31
- ## Method
32
-
33
- The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:
34
-
35
- ```python
36
- from brf import BRFAnalyzer
37
- from brf.phase import plot_phase_diagram
38
- from brf.report import export_json
39
-
40
- analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
41
- print(analyzer.brf_vector) # (B, I, N, M) → (S, E) → class
42
-
43
- # Visualization
44
- plot_phase_diagram(
45
- [analyzer.S], [analyzer.E],
46
- labels=[analyzer.class_],
47
- classes=[analyzer.class_],
48
- )
49
-
50
- # Export
51
- export_json(analyzer.brf_vector, "results.json")
52
- ```
53
-
54
- ## Package Structure
55
-
56
- ```
57
- brf/
58
- ├── __init__.py
59
- ├── analyzer.py ← BRFAnalyzer main class
60
- ├── metrics/
61
- │ ├── baseline_gap.py ← B
62
- │ ├── instability.py ← I
63
- │ ├── null_test.py ← N (permutation test)
64
- │ └── metadata.py ← M
65
- ├── phase/
66
- │ ├── embedding.py ← S = N - I, E = B + M
67
- │ ├── classifier.py ← Reliable / Fragile / Void
68
- │ └── visualization.py ← phase diagram, clustering plot
69
- ├── report/
70
- │ ├── json_export.py
71
- │ └── latex_export.py
72
- ```
73
-
74
- ## Steps
75
-
76
- ### Phase 1: Package skeleton (1-2 weeks)
77
- - [x] Initialize Python project with `pyproject.toml`
78
- - [x] Implement `BRFAnalyzer` main class with fit/predict interface
79
- - [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
80
- - [x] Write unit tests for each metric
81
-
82
- ### Phase 2: Phase embedding + classification (1 week)
83
- - [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
84
- - [x] Build phase diagram visualization (matplotlib)
85
- - [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results
86
-
87
- ### Phase 3: Documentation + distribution (1-2 weeks)
88
- - [x] Write README with quick-start tutorial and API docs
89
- - [ ] Publish to TestPyPI → PyPI
90
- - [ ] Set up ReadTheDocs for auto-generated documentation
91
- - [ ] Add GitHub Actions CI (test on Python 3.9–3.12)
92
-
93
- ### Phase 4: HuggingFace Hub integration (optional, 1 week)
94
- - [ ] Add HF dataset loading wrapper
95
- - [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand
96
-
97
- ## Dependencies
98
-
99
- - `numpy>=1.21`
100
- - `scikit-learn>=1.0`
101
- - `matplotlib>=3.5`
102
- - No deep learning dependencies required
103
-
104
- ## Relationship to Sister Repos
105
-
106
- - `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
107
- - `LLMScoringAudit/`: first applied use case (MM-TBA × multiple LLMs)
108
- - `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
109
- - `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings
110
-
111
- ## Target Journal
112
-
113
- - Journal of Open Source Software (JOSS) - tool paper, lightweight submission
114
- - Followed by application papers in C&E / BJET
115
-
116
- ## Timeline
117
-
118
- - Phase 1–2: 3 weeks
119
- - Phase 3: 2 weeks
120
- - Phase 4: optional
121
- - JOSS submission: after Phase 3