PyPI - benchmark-reliability - Versions diffs - 0.1.0__tar.gz - Mend

benchmark-reliability 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

benchmark_reliability-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 zhanglizhuo
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

benchmark_reliability-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,121 @@
+Metadata-Version: 2.1
+Name: benchmark-reliability
+Version: 0.1.0
+Summary: Benchmark Reliability Framework (BRF) — dataset-level reliability auditing for predictive benchmarks
+Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
+License: MIT
+Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
+Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
+Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
+Classifier: Development Status :: 3 - Alpha
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+Requires-Dist: numpy>=1.21
+Requires-Dist: scikit-learn>=1.0
+Requires-Dist: matplotlib>=3.5
+# BenchmarkReliability — BRF Python Package
+## Target
+Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.
+## Method
+The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:
+```python
+from brf import BRFAnalyzer
+from brf.phase import plot_phase_diagram
+from brf.report import export_json
+analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
+print(analyzer.brf_vector)   # (B, I, N, M) → (S, E) → class
+# Visualization
+plot_phase_diagram(
+    [analyzer.S], [analyzer.E],
+    labels=[analyzer.class_],
+    classes=[analyzer.class_],
+)
+# Export
+export_json(analyzer.brf_vector, "results.json")
+```
+## Package Structure
+```
+brf/
+├── __init__.py
+├── analyzer.py          ← BRFAnalyzer main class
+├── metrics/
+│   ├── baseline_gap.py  ← B
+│   ├── instability.py   ← I
+│   ├── null_test.py     ← N (permutation test)
+│   └── metadata.py      ← M
+├── phase/
+│   ├── embedding.py     ← S = N - I, E = B + M
+│   ├── classifier.py    ← Reliable / Fragile / Void
+│   └── visualization.py ← phase diagram, clustering plot
+├── report/
+│   ├── json_export.py
+│   └── latex_export.py
+```
+## Steps
+### Phase 1: Package skeleton (1-2 weeks)
+- [x] Initialize Python project with `pyproject.toml`
+- [x] Implement `BRFAnalyzer` main class with fit/predict interface
+- [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
+- [x] Write unit tests for each metric
+### Phase 2: Phase embedding + classification (1 week)
+- [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
+- [x] Build phase diagram visualization (matplotlib)
+- [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results
+### Phase 3: Documentation + distribution (1-2 weeks)
+- [x] Write README with quick-start tutorial and API docs
+- [ ] Publish to TestPyPI → PyPI
+- [ ] Set up ReadTheDocs for auto-generated documentation
+- [ ] Add GitHub Actions CI (test on Python 3.9–3.12)
+### Phase 4: HuggingFace Hub integration (optional, 1 week)
+- [ ] Add HF dataset loading wrapper
+- [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand
+## Dependencies
+- `numpy>=1.21`
+- `scikit-learn>=1.0`
+- `matplotlib>=3.5`
+- No deep learning dependencies required
+## Relationship to Sister Repos
+- `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
+- `LLMScoringAudit/`: first applied use case (MM-TBA × multiple LLMs)
+- `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
+- `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings
+## Target Journal
+- Journal of Open Source Software (JOSS) — tool paper, lightweight submission
+- Followed by application papers in C&E / BJET
+## Timeline
+- Phase 1–2: 3 weeks
+- Phase 3: 2 weeks
+- Phase 4: optional
+- JOSS submission: after Phase 3

benchmark_reliability-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,97 @@
+# BenchmarkReliability — BRF Python Package
+## Target
+Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.
+## Method
+The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:
+```python
+from brf import BRFAnalyzer
+from brf.phase import plot_phase_diagram
+from brf.report import export_json
+analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
+print(analyzer.brf_vector)   # (B, I, N, M) → (S, E) → class
+# Visualization
+plot_phase_diagram(
+    [analyzer.S], [analyzer.E],
+    labels=[analyzer.class_],
+    classes=[analyzer.class_],
+)
+# Export
+export_json(analyzer.brf_vector, "results.json")
+```
+## Package Structure
+```
+brf/
+├── __init__.py
+├── analyzer.py          ← BRFAnalyzer main class
+├── metrics/
+│   ├── baseline_gap.py  ← B
+│   ├── instability.py   ← I
+│   ├── null_test.py     ← N (permutation test)
+│   └── metadata.py      ← M
+├── phase/
+│   ├── embedding.py     ← S = N - I, E = B + M
+│   ├── classifier.py    ← Reliable / Fragile / Void
+│   └── visualization.py ← phase diagram, clustering plot
+├── report/
+│   ├── json_export.py
+│   └── latex_export.py
+```
+## Steps
+### Phase 1: Package skeleton (1-2 weeks)
+- [x] Initialize Python project with `pyproject.toml`
+- [x] Implement `BRFAnalyzer` main class with fit/predict interface
+- [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
+- [x] Write unit tests for each metric
+### Phase 2: Phase embedding + classification (1 week)
+- [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
+- [x] Build phase diagram visualization (matplotlib)
+- [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results
+### Phase 3: Documentation + distribution (1-2 weeks)
+- [x] Write README with quick-start tutorial and API docs
+- [ ] Publish to TestPyPI → PyPI
+- [ ] Set up ReadTheDocs for auto-generated documentation
+- [ ] Add GitHub Actions CI (test on Python 3.9–3.12)
+### Phase 4: HuggingFace Hub integration (optional, 1 week)
+- [ ] Add HF dataset loading wrapper
+- [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand
+## Dependencies
+- `numpy>=1.21`
+- `scikit-learn>=1.0`
+- `matplotlib>=3.5`
+- No deep learning dependencies required
+## Relationship to Sister Repos
+- `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
+- `LLMScoringAudit/`: first applied use case (MM-TBA × multiple LLMs)
+- `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
+- `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings
+## Target Journal
+- Journal of Open Source Software (JOSS) — tool paper, lightweight submission
+- Followed by application papers in C&E / BJET
+## Timeline
+- Phase 1–2: 3 weeks
+- Phase 3: 2 weeks
+- Phase 4: optional
+- JOSS submission: after Phase 3

benchmark_reliability-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,41 @@
+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "benchmark-reliability"
+version = "0.1.0"
+description = "Benchmark Reliability Framework (BRF) — dataset-level reliability auditing for predictive benchmarks"
+readme = "README.md"
+license = { text = "MIT" }
+requires-python = ">=3.8"
+authors = [
+    { name = "zhanglizhuo", email = "zhanglizhuo@gmail.com" },
+]
+keywords = ["benchmark reliability", "dataset auditing", "educational AI", "machine learning"]
+classifiers = [
+    "Development Status :: 3 - Alpha",
+    "License :: OSI Approved :: MIT License",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.8",
+    "Programming Language :: Python :: 3.9",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+]
+dependencies = [
+    "numpy>=1.21",
+    "scikit-learn>=1.0",
+    "matplotlib>=3.5",
+]
+[project.urls]
+Homepage = "https://github.com/zhanglizhuo/BenchmarkReliability"
+Repository = "https://github.com/zhanglizhuo/BenchmarkReliability"
+[tool.setuptools]
+license-files = []
+[tool.setuptools.packages.find]
+where = ["src"]

benchmark_reliability-0.1.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

benchmark_reliability-0.1.0/setup.py ADDED Viewed

@@ -0,0 +1,8 @@
+from setuptools import setup, find_packages
+setup(
+    name="benchmark-reliability",
+    version="0.1.0",
+    packages=find_packages(where="src"),
+    package_dir={"": "src"},
+)

benchmark_reliability-0.1.0/src/benchmark_reliability.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,121 @@
+Metadata-Version: 2.1
+Name: benchmark-reliability
+Version: 0.1.0
+Summary: Benchmark Reliability Framework (BRF) — dataset-level reliability auditing for predictive benchmarks
+Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
+License: MIT
+Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
+Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
+Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
+Classifier: Development Status :: 3 - Alpha
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+Requires-Dist: numpy>=1.21
+Requires-Dist: scikit-learn>=1.0
+Requires-Dist: matplotlib>=3.5
+# BenchmarkReliability — BRF Python Package
+## Target
+Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.
+## Method
+The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:
+```python
+from brf import BRFAnalyzer
+from brf.phase import plot_phase_diagram
+from brf.report import export_json
+analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
+print(analyzer.brf_vector)   # (B, I, N, M) → (S, E) → class
+# Visualization
+plot_phase_diagram(
+    [analyzer.S], [analyzer.E],
+    labels=[analyzer.class_],
+    classes=[analyzer.class_],
+)
+# Export
+export_json(analyzer.brf_vector, "results.json")
+```
+## Package Structure
+```
+brf/
+├── __init__.py
+├── analyzer.py          ← BRFAnalyzer main class
+├── metrics/
+│   ├── baseline_gap.py  ← B
+│   ├── instability.py   ← I
+│   ├── null_test.py     ← N (permutation test)
+│   └── metadata.py      ← M
+├── phase/
+│   ├── embedding.py     ← S = N - I, E = B + M
+│   ├── classifier.py    ← Reliable / Fragile / Void
+│   └── visualization.py ← phase diagram, clustering plot
+├── report/
+│   ├── json_export.py
+│   └── latex_export.py
+```
+## Steps
+### Phase 1: Package skeleton (1-2 weeks)
+- [x] Initialize Python project with `pyproject.toml`
+- [x] Implement `BRFAnalyzer` main class with fit/predict interface
+- [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
+- [x] Write unit tests for each metric
+### Phase 2: Phase embedding + classification (1 week)
+- [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
+- [x] Build phase diagram visualization (matplotlib)
+- [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results
+### Phase 3: Documentation + distribution (1-2 weeks)
+- [x] Write README with quick-start tutorial and API docs
+- [ ] Publish to TestPyPI → PyPI
+- [ ] Set up ReadTheDocs for auto-generated documentation
+- [ ] Add GitHub Actions CI (test on Python 3.9–3.12)
+### Phase 4: HuggingFace Hub integration (optional, 1 week)
+- [ ] Add HF dataset loading wrapper
+- [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand
+## Dependencies
+- `numpy>=1.21`
+- `scikit-learn>=1.0`
+- `matplotlib>=3.5`
+- No deep learning dependencies required
+## Relationship to Sister Repos
+- `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
+- `LLMScoringAudit/`: first applied use case (MM-TBA × multiple LLMs)
+- `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
+- `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings
+## Target Journal
+- Journal of Open Source Software (JOSS) — tool paper, lightweight submission
+- Followed by application papers in C&E / BJET
+## Timeline
+- Phase 1–2: 3 weeks
+- Phase 3: 2 weeks
+- Phase 4: optional
+- JOSS submission: after Phase 3

benchmark_reliability-0.1.0/src/benchmark_reliability.egg-info/SOURCES.txt ADDED Viewed

@@ -0,0 +1,26 @@
+LICENSE
+README.md
+pyproject.toml
+setup.py
+src/benchmark_reliability.egg-info/PKG-INFO
+src/benchmark_reliability.egg-info/SOURCES.txt
+src/benchmark_reliability.egg-info/dependency_links.txt
+src/benchmark_reliability.egg-info/requires.txt
+src/benchmark_reliability.egg-info/top_level.txt
+src/brf/__init__.py
+src/brf/analyzer.py
+src/brf/metrics/__init__.py
+src/brf/metrics/baseline_gap.py
+src/brf/metrics/instability.py
+src/brf/metrics/metadata.py
+src/brf/metrics/null_test.py
+src/brf/phase/__init__.py
+src/brf/phase/classifier.py
+src/brf/phase/embedding.py
+src/brf/phase/visualization.py
+src/brf/report/__init__.py
+src/brf/report/json_export.py
+src/brf/report/latex_export.py
+tests/test_analyzer.py
+tests/test_metrics.py
+tests/test_phase.py

benchmark_reliability-0.1.0/src/benchmark_reliability.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+

benchmark_reliability-0.1.0/src/benchmark_reliability.egg-info/requires.txt ADDED Viewed

@@ -0,0 +1,3 @@
+numpy>=1.21
+scikit-learn>=1.0
+matplotlib>=3.5

benchmark_reliability-0.1.0/src/benchmark_reliability.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ brf

benchmark_reliability-0.1.0/src/brf/__init__.py ADDED Viewed

@@ -0,0 +1,3 @@
+from .analyzer import BRFAnalyzer
+__all__ = ["BRFAnalyzer"]

benchmark_reliability-0.1.0/src/brf/analyzer.py ADDED Viewed

@@ -0,0 +1,133 @@
+import math
+import warnings
+from typing import Optional
+import numpy as np
+from sklearn.base import clone
+from sklearn.linear_model import Ridge
+from sklearn.metrics import r2_score
+from sklearn.preprocessing import StandardScaler
+from .metrics import compute_b, compute_i, compute_m
+from .phase import compute_phase_from_brf, classify_dataset
+class BRFAnalyzer:
+    def __init__(
+        self,
+        n_splits: int = 30,
+        n_permutations: int = 200,
+        model=None,
+        seed: int = 42,
+        scale: bool = True,
+    ):
+        if n_splits < 2:
+            raise ValueError("n_splits must be >= 2")
+        self.n_splits = n_splits
+        self.n_permutations = n_permutations
+        self.model = model or Ridge(alpha=1.0)
+        self.seed = seed
+        self.scale = scale
+        self._fitted = False
+        self.B: Optional[float] = None
+        self.I: Optional[float] = None
+        self.N: Optional[float] = None
+        self.M: Optional[float] = None
+        self.S: Optional[float] = None
+        self.E: Optional[float] = None
+        self.class_: Optional[str] = None
+    def _validate_inputs(self, X, y):
+        X = np.asarray(X, dtype=float)
+        y = np.asarray(y, dtype=float)
+        if X.ndim != 2:
+            raise ValueError(f"X must be 2D, got shape {X.shape}")
+        if y.ndim != 1:
+            raise ValueError(f"y must be 1D, got shape {y.shape}")
+        if len(X) != len(y):
+            raise ValueError(f"X and y length mismatch: {len(X)} vs {len(y)}")
+        if len(X) < 20:
+            raise ValueError(f"Need at least 20 samples, got {len(X)}")
+        if not np.all(np.isfinite(X)):
+            raise ValueError("X contains NaN or Inf values")
+        if not np.all(np.isfinite(y)):
+            raise ValueError("y contains NaN or Inf values")
+        unique_y = np.unique(y)
+        if len(unique_y) <= 12 and np.all(unique_y == unique_y.astype(int)):
+            warnings.warn(
+                "y appears to be integer classification labels "
+                f"({len(unique_y)} unique values). "
+                "BRF is designed for regression targets."
+            )
+        return X, y
+    def fit(self, X, y, groups=None):
+        X, y = self._validate_inputs(X, y)
+        n = len(y)
+        if self.scale:
+            scaler = StandardScaler()
+            X = scaler.fit_transform(X)
+        rng_cv = np.random.default_rng(self.seed)
+        rng_perm = np.random.default_rng(self.seed + 10_007)
+        r2_scores = []
+        b_gains = []
+        n_per_fold = max(3, math.ceil(self.n_permutations / self.n_splits))
+        exceed_count = 0
+        for i in range(self.n_splits):
+            idx = rng_cv.permutation(n)
+            split = max(1, int(0.8 * n))
+            train_idx = idx[:split]
+            test_idx = idx[split:]
+            Xtr, Xte = X[train_idx], X[test_idx]
+            ytr, yte = y[train_idx], y[test_idx]
+            y_mean = np.full(len(yte), float(np.mean(ytr)))
+            m = clone(self.model)
+            m.fit(Xtr, ytr)
+            y_pred = m.predict(Xte)
+            r2_real = r2_score(yte, y_pred)
+            r2_scores.append(r2_real)
+            b_gains.append(compute_b(yte, y_pred, y_mean))
+            perm_r2s = []
+            for _ in range(n_per_fold):
+                y_perm = rng_perm.permutation(ytr)
+                m_perm = clone(self.model)
+                m_perm.fit(Xtr, y_perm)
+                y_pred_perm = m_perm.predict(Xte)
+                perm_r2s.append(r2_score(yte, y_pred_perm))
+            if r2_real > float(np.median(perm_r2s)):
+                exceed_count += 1
+        self.B = float(np.mean(b_gains))
+        self.I = compute_i(r2_scores)
+        self.N = exceed_count / self.n_splits
+        self.M = compute_m(groups)
+        self.S, self.E = compute_phase_from_brf(self.B, self.I, self.N, self.M)
+        self.class_ = classify_dataset(self.S, self.E)
+        self._fitted = True
+        return self
+    @property
+    def brf_vector(self) -> dict:
+        if not self._fitted:
+            raise RuntimeError("call fit() before accessing brf_vector")
+        return {
+            "B": self.B,
+            "I": self.I,
+            "N": self.N,
+            "M": self.M,
+            "S": self.S,
+            "E": self.E,
+            "class": self.class_,
+        }

benchmark_reliability-0.1.0/src/brf/metrics/__init__.py ADDED Viewed

@@ -0,0 +1,6 @@
+from .baseline_gap import compute_b
+from .instability import compute_i
+from .null_test import compute_n
+from .metadata import compute_m
+__all__ = ["compute_b", "compute_i", "compute_n", "compute_m"]

benchmark_reliability-0.1.0/src/brf/metrics/baseline_gap.py ADDED Viewed

@@ -0,0 +1,12 @@
+import numpy as np
+from sklearn.metrics import r2_score
+def compute_b(
+    y_true: np.ndarray,
+    y_pred_model: np.ndarray,
+    y_pred_baseline: np.ndarray,
+) -> float:
+    r2_model = r2_score(y_true, y_pred_model)
+    r2_baseline = r2_score(y_true, y_pred_baseline)
+    return float(r2_model - r2_baseline)

benchmark_reliability-0.1.0/src/brf/metrics/instability.py ADDED Viewed

@@ -0,0 +1,11 @@
+from typing import Sequence
+import numpy as np
+def compute_i(r2_values: Sequence[float], eps: float = 1e-8) -> float:
+    r2_arr = np.array(r2_values)
+    mean_r2 = float(np.mean(r2_arr))
+    std_r2 = float(np.std(r2_arr, ddof=1))
+    denom = max(abs(mean_r2), 1e-4) + eps
+    return std_r2 / denom

benchmark_reliability-0.1.0/src/brf/metrics/metadata.py ADDED Viewed

@@ -0,0 +1,30 @@
+from typing import Optional
+import numpy as np
+def compute_m(groups: Optional[np.ndarray] = None) -> float:
+    if groups is None:
+        return 0.0
+    group_arr = np.asarray(groups)
+    if not np.issubdtype(group_arr.dtype, np.number):
+        _, group_arr = np.unique(group_arr, return_inverse=True)
+    if not np.all(np.isfinite(group_arr)):
+        raise ValueError("groups contains NaN or Inf values")
+    unique, counts = np.unique(group_arr, return_counts=True)
+    n_groups = len(unique)
+    if n_groups <= 1:
+        return 0.0
+    probs = counts / counts.sum()
+    entropy = -np.sum(probs * np.log(probs + 1e-10))
+    max_entropy = np.log(n_groups)
+    normalized_entropy = entropy / max_entropy if max_entropy > 0 else 0.0
+    group_balance = 1.0 - float(np.std(counts) / (np.mean(counts) + 1e-8))
+    group_balance = max(0.0, min(1.0, group_balance))
+    return float(0.5 * normalized_entropy + 0.5 * group_balance)

benchmark_reliability-0.1.0/src/brf/metrics/null_test.py ADDED Viewed

@@ -0,0 +1,25 @@
+import numpy as np
+from sklearn.metrics import r2_score
+def compute_n(
+    y_true: np.ndarray,
+    y_pred_real: np.ndarray,
+    n_permutations: int = 500,
+    seed: int = 42,
+) -> float:
+    """Simple permutation test: shuffle y and compare R² against fixed predictions.
+    Does NOT retrain the model per permutation (see BRFAnalyzer for the
+    per-fold retrain version used in the full BRF protocol).
+    """
+    rng = np.random.default_rng(seed)
+    r2_real = r2_score(y_true, y_pred_real)
+    count_exceed = 0
+    for _ in range(n_permutations):
+        y_perm = rng.permutation(y_true)
+        r2_perm = r2_score(y_perm, y_pred_real)
+        if r2_real >= r2_perm:
+            count_exceed += 1
+    return count_exceed / n_permutations

benchmark_reliability-0.1.0/src/brf/phase/__init__.py ADDED Viewed

@@ -0,0 +1,5 @@
+from .embedding import compute_phase_from_brf
+from .classifier import classify_dataset
+from .visualization import plot_phase_diagram
+__all__ = ["compute_phase_from_brf", "classify_dataset", "plot_phase_diagram"]

benchmark_reliability-0.1.0/src/brf/phase/classifier.py ADDED Viewed

@@ -0,0 +1,7 @@
+def classify_dataset(S: float, E: float, tau_s: float = 0.0, tau_e: float = 0.5) -> str:
+    if S <= tau_s:
+        return "Void"
+    elif E <= tau_e:
+        return "Fragile"
+    else:
+        return "Reliable"

benchmark_reliability-0.1.0/src/brf/phase/embedding.py ADDED Viewed

@@ -0,0 +1,12 @@
+from typing import Tuple
+def compute_phase_from_brf(
+    B: float,
+    I: float,
+    N: float,
+    M: float,
+) -> Tuple[float, float]:
+    S = N - I
+    E = B + M
+    return S, E

benchmark_reliability-0.1.0/src/brf/phase/visualization.py ADDED Viewed

@@ -0,0 +1,52 @@
+from typing import List, Optional
+import matplotlib.pyplot as plt
+import numpy as np
+def plot_phase_diagram(
+    S_list: List[float],
+    E_list: List[float],
+    labels: Optional[List[str]] = None,
+    classes: Optional[List[str]] = None,
+    title: str = "BRF Phase Diagram",
+    save_path: Optional[str] = None,
+    tau_s: float = 0.0,
+    tau_e: float = 0.5,
+):
+    fig, ax = plt.subplots(figsize=(8, 6))
+    if classes is not None:
+        color_map = {"Reliable": "#2ecc71", "Fragile": "#f39c12", "Void": "#e74c3c"}
+        for cls in set(classes):
+            mask = [c == cls for c in classes]
+            ax.scatter(
+                np.array(S_list)[mask],
+                np.array(E_list)[mask],
+                c=color_map.get(cls, "#95a5a6"),
+                label=cls,
+                s=80,
+                edgecolors="black",
+                linewidths=0.5,
+                alpha=0.8,
+            )
+        ax.legend(fontsize=12)
+    else:
+        ax.scatter(S_list, E_list, c="#3498db", s=80, edgecolors="black", linewidths=0.5)
+    if labels:
+        for i, label in enumerate(labels):
+            ax.annotate(label, (S_list[i], E_list[i]), fontsize=8, alpha=0.8)
+    ax.axhline(y=tau_e, color="gray", linestyle="--", alpha=0.4, label=f"E = {tau_e} (Fragile boundary)")
+    ax.axvline(x=tau_s, color="gray", linestyle="--", alpha=0.4, label=f"S = {tau_s} (Void boundary)")
+    ax.set_xlabel("Signal Identifiability (S = N - I)", fontsize=12)
+    ax.set_ylabel("Epistemic Completeness (E = B + M)", fontsize=12)
+    ax.set_title(title, fontsize=14)
+    ax.grid(True, alpha=0.3)
+    if save_path:
+        fig.savefig(save_path, dpi=300, bbox_inches="tight")
+    return fig

benchmark_reliability-0.1.0/src/brf/report/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from .json_export import export_json
+from .latex_export import export_latex
+__all__ = ["export_json", "export_latex"]

benchmark_reliability-0.1.0/src/brf/report/json_export.py ADDED Viewed

@@ -0,0 +1,8 @@
+import json
+def export_json(brf_vector: dict, filepath: str) -> None:
+    if any(v is None for v in brf_vector.values()):
+        raise ValueError("BRF vector contains None values; call fit() first")
+    with open(filepath, "w", encoding="utf-8") as f:
+        json.dump(brf_vector, f, indent=2, ensure_ascii=False)

benchmark_reliability-0.1.0/src/brf/report/latex_export.py ADDED Viewed

@@ -0,0 +1,23 @@
+def export_latex(brf_vector: dict) -> str:
+    """Export BRF vector as a LaTeX table (requires booktabs package)."""
+    for v in brf_vector.values():
+        if v is None:
+            raise ValueError("BRF vector contains None values; call fit() first")
+    lines = [
+        r"\begin{tabular}{lcc}",
+        r"\toprule",
+        r"Dimension & Value & Interpretation \\",
+        r"\midrule",
+        f"B (Baseline Gain) & {brf_vector['B']:.3f} & Model improvement over mean predictor \\\\",
+        f"I (Instability) & {brf_vector['I']:.3f} & Sensitivity to split choice \\\\",
+        f"N (Null Separability) & {brf_vector['N']:.3f} & Signal distinguishability from noise \\\\",
+        f"M (Metadata Sufficiency) & {brf_vector['M']:.3f} & Group structure completeness \\\\",
+        r"\midrule",
+        f"S (Signal Identifiability) & {brf_vector['S']:.3f} & N - I \\\\",
+        f"E (Epistemic Completeness) & {brf_vector['E']:.3f} & B + M \\\\",
+        r"\midrule",
+        f"Class & \\multicolumn{{2}}{{c}}{{{brf_vector['class']}}} \\\\",
+        r"\bottomrule",
+        r"\end{tabular}",
+    ]
+    return "\n".join(lines)

benchmark_reliability-0.1.0/tests/test_analyzer.py ADDED Viewed

@@ -0,0 +1,138 @@
+import numpy as np
+import pytest
+from brf import BRFAnalyzer
+class TestBRFAnalyzer:
+    def test_fit_returns_self(self):
+        X = np.random.default_rng(0).normal(size=(100, 5))
+        y = X[:, 0] + 0.1 * np.random.default_rng(0).normal(size=100)
+        analyzer = BRFAnalyzer(n_splits=5, n_permutations=20, seed=42)
+        result = analyzer.fit(X, y)
+        assert result is analyzer
+    def test_brf_vector_keys(self):
+        X = np.random.default_rng(1).normal(size=(100, 5))
+        y = X[:, 0] + 0.1 * np.random.default_rng(1).normal(size=100)
+        analyzer = BRFAnalyzer(n_splits=5, n_permutations=20, seed=42)
+        analyzer.fit(X, y)
+        v = analyzer.brf_vector
+        expected_keys = {"B", "I", "N", "M", "S", "E", "class"}
+        assert set(v.keys()) == expected_keys
+    def test_all_values_computed(self):
+        X = np.random.default_rng(2).normal(size=(100, 5))
+        y = X[:, 0] + 0.5 * X[:, 1] + np.random.default_rng(2).normal(scale=0.2, size=100)
+        analyzer = BRFAnalyzer(n_splits=10, n_permutations=30, seed=42)
+        analyzer.fit(X, y)
+        v = analyzer.brf_vector
+        for key in ["B", "I", "N", "M", "S", "E"]:
+            assert v[key] is not None, f"{key} should not be None"
+            assert np.isfinite(v[key]), f"{key} should be finite"
+        assert v["class"] in ["Reliable", "Fragile", "Void"]
+    def test_reliable_with_clean_signal(self):
+        rng = np.random.default_rng(42)
+        X = rng.normal(size=(200, 3))
+        y = 2.0 * X[:, 0] + 1.5 * X[:, 1] + rng.normal(scale=0.1, size=200)
+        analyzer = BRFAnalyzer(n_splits=10, n_permutations=50, seed=42)
+        analyzer.fit(X, y)
+        assert analyzer.class_ == "Reliable"
+    def test_void_with_noise_only(self):
+        rng = np.random.default_rng(42)
+        X = rng.normal(size=(200, 3))
+        y = rng.normal(size=200)
+        analyzer = BRFAnalyzer(n_splits=10, n_permutations=50, seed=42)
+        analyzer.fit(X, y)
+        assert analyzer.class_ in ("Fragile", "Void")
+    def test_groups_affect_m_score(self):
+        rng = np.random.default_rng(42)
+        X = rng.normal(size=(200, 3))
+        y = X[:, 0] + 0.3 * rng.normal(size=200)
+        a1 = BRFAnalyzer(n_splits=5, n_permutations=20, seed=42)
+        a1.fit(X, y, groups=np.repeat([0, 1, 2, 3], 50))
+        m_with = a1.M
+        a2 = BRFAnalyzer(n_splits=5, n_permutations=20, seed=42)
+        a2.fit(X, y)
+        m_without = a2.M
+        assert m_with > 0.0
+        assert m_without == 0.0
+    def test_custom_model(self):
+        from sklearn.linear_model import LinearRegression
+        X = np.random.default_rng(3).normal(size=(100, 3))
+        y = X[:, 0] + 0.2 * np.random.default_rng(3).normal(size=100)
+        analyzer = BRFAnalyzer(n_splits=5, n_permutations=20, model=LinearRegression(), seed=42)
+        analyzer.fit(X, y)
+        assert analyzer.class_ is not None
+    def test_nan_input_raises(self):
+        rng = np.random.default_rng(0)
+        X = rng.normal(size=(20, 3))
+        X[0, 0] = np.nan
+        y = rng.normal(size=20)
+        analyzer = BRFAnalyzer(n_splits=2, n_permutations=5)
+        with pytest.raises(ValueError, match="NaN"):
+            analyzer.fit(X, y)
+    def test_inf_input_raises(self):
+        rng = np.random.default_rng(0)
+        X = rng.normal(size=(20, 3))
+        X[0, 0] = np.inf
+        y = rng.normal(size=20)
+        analyzer = BRFAnalyzer(n_splits=2, n_permutations=5)
+        with pytest.raises(ValueError, match="Inf"):
+            analyzer.fit(X, y)
+    def test_too_few_samples_raises(self):
+        X = np.random.default_rng(0).normal(size=(3, 2))
+        y = np.random.default_rng(0).normal(size=3)
+        analyzer = BRFAnalyzer(n_splits=2, n_permutations=5)
+        with pytest.raises(ValueError, match="20 samples"):
+            analyzer.fit(X, y)
+    def test_dimension_mismatch_raises(self):
+        X = np.random.default_rng(0).normal(size=(20, 2))
+        y = np.random.default_rng(0).normal(size=5)
+        analyzer = BRFAnalyzer(n_splits=2, n_permutations=5)
+        with pytest.raises(ValueError, match="length mismatch"):
+            analyzer.fit(X, y)
+    def test_1d_X_raises(self):
+        X = np.random.default_rng(0).normal(size=20)
+        y = np.random.default_rng(0).normal(size=20)
+        analyzer = BRFAnalyzer(n_splits=2, n_permutations=5)
+        with pytest.raises(ValueError, match="2D"):
+            analyzer.fit(X, y)
+    def test_brf_vector_before_fit_raises(self):
+        analyzer = BRFAnalyzer()
+        with pytest.raises(RuntimeError, match="fit"):
+            _ = analyzer.brf_vector
+    def test_n_splits_less_than_2_raises(self):
+        with pytest.raises(ValueError, match="n_splits"):
+            BRFAnalyzer(n_splits=1)
+    def test_scale_false_still_works(self):
+        rng = np.random.default_rng(42)
+        X = rng.normal(size=(100, 5))
+        y = X[:, 0] + 0.3 * rng.normal(size=100)
+        analyzer = BRFAnalyzer(n_splits=5, n_permutations=20, scale=False, seed=42)
+        analyzer.fit(X, y)
+        assert analyzer.class_ is not None
+    def test_classification_warning(self):
+        rng = np.random.default_rng(42)
+        X = rng.normal(size=(50, 2))
+        y = np.random.default_rng(42).integers(0, 2, size=50)
+        analyzer = BRFAnalyzer(n_splits=5, n_permutations=12, seed=42)
+        with pytest.warns(UserWarning, match="classification"):
+            analyzer.fit(X, y)

benchmark_reliability-0.1.0/tests/test_metrics.py ADDED Viewed

@@ -0,0 +1,125 @@
+import numpy as np
+import pytest
+from brf.metrics import compute_b, compute_i, compute_n, compute_m
+class TestComputeB:
+    def test_perfect_prediction(self):
+        y_true = np.array([1.0, 2.0, 3.0, 4.0])
+        y_pred_model = np.array([1.0, 2.0, 3.0, 4.0])
+        y_pred_baseline = np.array([2.5, 2.5, 2.5, 2.5])
+        b = compute_b(y_true, y_pred_model, y_pred_baseline)
+        assert b == 1.0
+    def test_model_no_better_than_baseline(self):
+        y_true = np.array([1.0, 2.0, 3.0, 4.0])
+        y_pred_model = np.array([2.5, 2.5, 2.5, 2.5])
+        y_pred_baseline = np.array([2.5, 2.5, 2.5, 2.5])
+        b = compute_b(y_true, y_pred_model, y_pred_baseline)
+        assert b == 0.0
+    def test_negative_b_gap(self):
+        y_true = np.array([1.0, 2.0, 3.0, 4.0])
+        y_pred_model = np.array([4.0, 3.0, 2.0, 1.0])
+        y_pred_baseline = np.array([2.5, 2.5, 2.5, 2.5])
+        b = compute_b(y_true, y_pred_model, y_pred_baseline)
+        assert b < 0.0
+class TestComputeI:
+    def test_zero_instability(self):
+        values = [0.5, 0.5, 0.5, 0.5]
+        i = compute_i(values)
+        assert i == 0.0
+    def test_low_instability(self):
+        values = [0.5, 0.51, 0.49, 0.5]
+        i = compute_i(values)
+        assert 0.0 < i < 0.1
+    def test_high_instability(self):
+        values = [0.9, 0.1, 0.8, 0.2]
+        i = compute_i(values)
+        assert i > 0.5
+    def test_eps_avoid_division_by_zero(self):
+        values = [0.0, 0.0, 0.0, 0.0]
+        i = compute_i(values)
+        assert not np.isnan(i)
+        assert np.isfinite(i)
+    def test_near_zero_mean_does_not_explode(self):
+        values = [0.001, -0.002, 0.003, -0.001, 0.002]
+        i = compute_i(values)
+        assert np.isfinite(i)
+        assert i < 1e4
+class TestComputeN:
+    def test_perfect_predictions(self):
+        y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
+        y_pred = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
+        n = compute_n(y_true, y_pred, n_permutations=200)
+        assert n >= 0.99
+    def test_worse_than_random(self):
+        y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
+        y_pred = np.array([5.0, 4.0, 3.0, 2.0, 1.0])
+        n = compute_n(y_true, y_pred, n_permutations=200)
+        assert n < 0.5
+    def test_deterministic_seed(self):
+        y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0])
+        y_pred = np.array([1.2, 2.3, 2.8, 4.1, 4.9, 6.0, 7.1, 7.8, 9.2, 9.9])
+        n1 = compute_n(y_true, y_pred, n_permutations=100, seed=42)
+        n2 = compute_n(y_true, y_pred, n_permutations=100, seed=42)
+        assert n1 == n2
+    def test_range_bounded(self):
+        rng = np.random.default_rng(99)
+        y_true = rng.normal(size=50)
+        y_pred = y_true + rng.normal(scale=0.3, size=50)
+        n = compute_n(y_true, y_pred, n_permutations=100)
+        assert 0.0 <= n <= 1.0
+class TestComputeM:
+    def test_no_groups(self):
+        m = compute_m(groups=None)
+        assert m == 0.0
+    def test_single_group(self):
+        groups = np.zeros(100)
+        m = compute_m(groups=groups)
+        assert m == 0.0
+    def test_perfectly_balanced_multi_group(self):
+        groups = np.repeat([0, 1, 2, 3], 25)
+        m = compute_m(groups=groups)
+        assert m > 0.5
+    def test_imbalanced_groups_lower_score(self):
+        balanced = np.repeat([0, 1, 2, 3], 25)
+        imbalanced = np.concatenate([np.full(85, 0), np.full(5, 1), np.full(5, 2), np.full(5, 3)])
+        m_bal = compute_m(groups=balanced)
+        m_imb = compute_m(groups=imbalanced)
+        assert m_bal > m_imb
+    def test_range_bounded_01(self):
+        rng = np.random.default_rng(42)
+        for _ in range(10):
+            n = rng.integers(5, 20)
+            groups = rng.integers(0, n // 2, size=100)
+            m = compute_m(groups=groups)
+            assert 0.0 <= m <= 1.0
+    def test_nan_groups_raises(self):
+        groups = np.array([0, 1, np.nan, 1, 0])
+        with pytest.raises(ValueError, match="NaN"):
+            compute_m(groups=groups)
+    def test_non_numeric_groups_converted(self):
+        groups = np.array(["a", "b", "c", "a", "b"])
+        m = compute_m(groups=groups)
+        assert m > 0.0

benchmark_reliability-0.1.0/tests/test_phase.py ADDED Viewed

@@ -0,0 +1,96 @@
+import numpy as np
+import pytest
+from brf.phase import compute_phase_from_brf, classify_dataset, plot_phase_diagram
+from brf.report import export_json, export_latex
+class TestEmbedding:
+    def test_phase_coordinates(self):
+        S, E = compute_phase_from_brf(B=1.0, I=0.01, N=1.0, M=0.7)
+        assert S == 0.99
+        assert E == 1.70
+class TestClassifier:
+    def test_reliable(self):
+        assert classify_dataset(S=1.0, E=0.8) == "Reliable"
+    def test_fragile(self):
+        assert classify_dataset(S=0.3, E=0.2) == "Fragile"
+    def test_void_due_to_negative_s(self):
+        assert classify_dataset(S=-0.1, E=0.8) == "Void"
+    def test_void_due_to_zero_s(self):
+        assert classify_dataset(S=0.0, E=0.8) == "Void"
+    def test_custom_thresholds(self):
+        assert classify_dataset(S=0.1, E=0.6, tau_s=0.2) == "Void"
+        assert classify_dataset(S=0.5, E=0.4, tau_e=0.5) == "Fragile"
+    def test_edge_case_boundary(self):
+        assert classify_dataset(S=0.5, E=0.5) == "Fragile"
+class TestReport:
+    def test_export_json_none_raises(self):
+        with pytest.raises(ValueError, match="None"):
+            export_json({"B": None, "I": 0.5}, "dummy.json")
+    def test_export_latex_none_raises(self):
+        with pytest.raises(ValueError, match="None"):
+            export_latex({"B": None, "I": 0.5})
+    def test_export_json_roundtrip(self, tmp_path):
+        p = str(tmp_path / "test.json")
+        bv = {"B": 0.8, "I": 0.1, "N": 0.95, "M": 0.5, "S": 0.85, "E": 1.3, "class": "Reliable"}
+        export_json(bv, p)
+        import json
+        with open(p) as f:
+            loaded = json.load(f)
+        assert loaded == bv
+    def test_export_latex_output(self):
+        bv = {"B": 0.8, "I": 0.1, "N": 0.95, "M": 0.5, "S": 0.85, "E": 1.3, "class": "Reliable"}
+        out = export_latex(bv)
+        assert "tabular" in out
+        assert "Reliable" in out
+        assert "0.800" in out
+class TestVisualization:
+    def test_plot_returns_figure(self):
+        fig = plot_phase_diagram(
+            S_list=[0.9, 0.3, -0.5],
+            E_list=[1.5, 0.2, 0.8],
+            labels=["A", "B", "C"],
+        )
+        assert fig is not None
+    def test_plot_without_labels(self):
+        fig = plot_phase_diagram(
+            S_list=[0.5, 0.0],
+            E_list=[0.6, 1.0],
+        )
+        assert fig is not None
+    def test_plot_custom_thresholds(self):
+        fig = plot_phase_diagram(
+            S_list=[0.9, -0.1],
+            E_list=[0.6, 0.4],
+            tau_s=0.1,
+            tau_e=0.3,
+        )
+        assert fig is not None
+    def test_plot_save_path(self, tmp_path):
+        from pathlib import Path
+        save_path = str(tmp_path / "phase.png")
+        fig = plot_phase_diagram(
+            S_list=[0.5, 0.0],
+            E_list=[0.6, 1.0],
+            save_path=save_path,
+        )
+        assert Path(save_path).exists()
+        assert fig is not None