PyPI - eval-toolkit - Versions diffs - 0.49.0__tar.gz → 0.50.0__tar.gz - Mend

eval-toolkit 0.49.0tar.gz → 0.50.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (182) hide show

{eval_toolkit-0.49.0 → eval_toolkit-0.50.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,62 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.50.0] — 2026-05-23 — SPEC 7 `rng` parameter adoption
+The SPEC 7 follow-up to v0.49.0. The `_rng.py` scaffold shipped at
+v0.49.0 (SeedLike + RNGLike type aliases per
+[Scientific Python SPEC 7](https://scientific-python.org/specs/spec-0007/))
+is now wired into every Tier-1 public function that consumes a NumPy RNG.
+### BREAKING
+**22 Tier-1 function signatures**: `seed: int = X` / `random_state: int | None` → `rng: RNGLike | SeedLike | None = X`. Pre-v1.0 SemVer-minor BREAKING (v0.34.0 precedent). Defaults preserved (still deterministic-by-default).
+Affected functions:
+- `bootstrap.py` (7 public + 1 private): `bootstrap_ci`, `paired_bootstrap_diff`, `paired_bootstrap_ece_diff`, `paired_bootstrap_op_point_diff`, `paired_mde`, `block_bootstrap_on_folds`, `cross_validate_metric`, `_bootstrap_t_ci`.
+- `metrics.py:1063`: `expected_calibration_error_debiased`.
+- `thresholds.py`: `selected_operating_point` + `_bootstrap_threshold_metric_cis`.
+- `analysis.py`: `bootstrap_metric_from_predictions`, `paired_diff_from_prediction_refs`.
+- `harness.py` (6 sites): `evaluate`, `evaluate_scorer_on_slice`, `_bootstrap_auc_ci`, `_evaluate_scores`, `_compute_paired_diffs`, `_score_all_slices`.
+- `scorecards.py`: `scorecard`, `_evaluate_spec`.
+- `stacking.py`: `LogisticStacker.random_state` → `LogisticStacker.rng` class-field rename (sklearn pass-through derives int at the boundary).
+**Body refactors**:
+- 4 SeedSequence.spawn() sites converted from `np.random.SeedSequence(seed).spawn(n)` to `rng.bit_generator.seed_seq.spawn(n)` (Option A — preserves existing worker SeedSequence signatures).
+- 2 sklearn-bridge sites in `cross_validate_metric` derive int from rng before passing to `StratifiedKFold`/`KFold(random_state=...)` (defensive across sklearn versions <1.4).
+- `LogisticStacker.fit` derives sklearn int from `self.rng` at the boundary.
+**Config schema** (Tier-2 additive): `evaluate()` config dict key `"seed"` → `"rng"`. Generator-typed input serializes as `repr(rng)`; int/None serialize as-is (backward-compatible for prior int-seed usage).
+### Added
+- **Docstrings**: NumPy-style parameter doc for every renamed function now references `rng : RNGLike | SeedLike | None` with explicit link to SPEC 7.
+- **STYLE.md §3a** + **ADR 0004 D4**: `rng` row flipped from "target convention; adopted in v0.50.0" → "**canonical** convention (adopted v0.50.0)".
+### Changed
+- **Test sweep** (~230+ test sites): `seed=X` → `rng=X` in test kwarg calls, EXCEPT in test files that test legitimate `seed`-as-int contexts (`test_adversarial.py` for Python `random.Random`, `test_seeds.py` for `set_global_seeds`, `test_splits*.py` for Splitter dataclass fields, `test_text_dedup*.py` for MinHashLSHStrategy class field).
+- **CHANGELOG header**: this release.
+### Exceptions to SPEC 7 (KEPT `seed:` — documented in STYLE.md §3a + ADR 0004 D4)
+- `seeds.set_global_seeds(seed: int)` — global-state setter, not per-function RNG.
+- `adversarial.py` dataclass fields + functional wrappers — use Python stdlib `random.Random(seed)`, not NumPy.
+- `splits.py` Splitter dataclass class-fields (`HoldoutSplitter.seed`, `StratifiedKFoldSplitter.seed`, etc.) — configuration storage, not user-facing RNG parameter.
+- `loaders.py:903` YAML config schema key — declarative; renaming would break consumer YAMLs.
+### Migration
+- Consumer (`prompt-injection-detection-submission`) lockstep: bump dep pin `>=0.49.0` → `>=0.50.0`; rename `seed=` → `rng=` on eval-toolkit-bound call sites (estimated 5-8 sites).
+- Bit-for-bit reproducibility preserved when migrating `seed=42` → `rng=42` (int seed is SeedLike; `np.random.default_rng(42)` is the canonical normalization).
+### Notes
+- Ships in parallel with Round 8 audit STOP-GATE (Decision Y.2); R8 briefing at commit `6f6839a`, awaiting Codex+Gemini reports.
+- Memory pattern captured at v0.49.0: pre-flight grep MUST cover `README.md`, `.doctest-modules`, and any config files (per `feedback_sybil_runs_readme.md`). Applied to v0.50.0 pre-flight.
 ## [0.49.0] — 2026-05-23 — Global naming-standards sweep + final cleanup before v1.0
 Final pre-v1.0 minor consolidating the naming-convention standardization

{eval_toolkit-0.49.0 → eval_toolkit-0.50.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.49.0
+Version: 0.50.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/
@@ -233,12 +233,12 @@ print(f"ECE (10 bins): {expected_calibration_error(y, s, n_bins=10):.3f}")
 from eval_toolkit import bootstrap_ci, paired_bootstrap_diff
 from eval_toolkit.metrics import pr_auc
-ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, seed=42)
+ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, rng=42)
 print(f"PR-AUC: {ci.point_estimate:.3f}  95% CI: [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")
 # Paired bootstrap on the lift between two scorers (s_baseline must be in [0, 1] too).
 s_baseline = np.clip(rng.normal(0.5, 0.3, size=200), 0, 1)
-diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, seed=42)
+diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, rng=42)
 print(f"Δ PR-AUC: {diff.delta:.3f}  overlaps zero: {diff.overlaps_zero}")
 ```

{eval_toolkit-0.49.0 → eval_toolkit-0.50.0}/README.md RENAMED Viewed

@@ -150,12 +150,12 @@ print(f"ECE (10 bins): {expected_calibration_error(y, s, n_bins=10):.3f}")
 from eval_toolkit import bootstrap_ci, paired_bootstrap_diff
 from eval_toolkit.metrics import pr_auc
-ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, seed=42)
+ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, rng=42)
 print(f"PR-AUC: {ci.point_estimate:.3f}  95% CI: [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")
 # Paired bootstrap on the lift between two scorers (s_baseline must be in [0, 1] too).
 s_baseline = np.clip(rng.normal(0.5, 0.3, size=200), 0, 1)
-diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, seed=42)
+diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, rng=42)
 print(f"Δ PR-AUC: {diff.delta:.3f}  overlaps zero: {diff.overlaps_zero}")
 ```

{eval_toolkit-0.49.0 → eval_toolkit-0.50.0}/STYLE.md RENAMED Viewed

@@ -76,7 +76,7 @@ them; deviations need justification in the PR description.
 | `n_jobs` | Parallelism (joblib + sklearn convention) |
 | `ax` | Matplotlib axis (matplotlib convention) |
 | `metric` | Callable `(y_true, y_score) -> float` |
-| `rng` | RNG argument per [SPEC 7](https://scientific-python.org/specs/spec-0007/) — target convention; adopted in v0.50.0 |
+| `rng` | RNG argument per [SPEC 7](https://scientific-python.org/specs/spec-0007/) — **canonical** convention (adopted v0.50.0). Accepts `int`, `np.random.Generator`, `BitGenerator`, `SeedSequence`, or `None`. |
 The v0.50.0 SPEC 7 adoption preserves two `seed: int` exceptions:
 `set_global_seeds(seed: int)` (global-state setter, not per-function

{eval_toolkit-0.49.0 → eval_toolkit-0.50.0}/src/eval_toolkit/_rng.py RENAMED Viewed

@@ -26,6 +26,7 @@ Exceptions to the SPEC 7 convention — documented in STYLE.md §3a:
 from __future__ import annotations
 from collections.abc import Sequence
+from typing import cast
 import numpy as np
@@ -44,3 +45,21 @@ type RNGLike = np.random.Generator | np.random.BitGenerator
 ``Generator`` inputs and lifts ``BitGenerator`` inputs into a
 ``Generator`` — both forms compose cleanly.
 """
+def spawn_seed_sequences(rng: RNGLike | SeedLike | None, n: int) -> list[np.random.SeedSequence]:
+    """Spawn ``n`` independent SeedSequences from any SPEC 7 ``rng`` input.
+    Normalizes the input to a ``Generator``, then extracts the underlying
+    ``SeedSequence`` via the bit-generator and spawns ``n`` children.
+    The cast satisfies mypy strict: the ``seed_seq`` attribute on a
+    concrete BitGenerator is a ``SeedSequence`` instance, but the type
+    stub on ``BitGenerator.seed_seq`` returns the abstract
+    ``ISeedSequence`` interface (which lacks ``spawn``).
+    Used by the bootstrap parallel workers (which take spawned
+    ``SeedSequence`` objects to seed their internal ``default_rng()`` calls).
+    """
+    gen = np.random.default_rng(rng)
+    seed_seq = cast(np.random.SeedSequence, gen.bit_generator.seed_seq)
+    return seed_seq.spawn(n)

{eval_toolkit-0.49.0 → eval_toolkit-0.50.0}/src/eval_toolkit/_version.py RENAMED Viewed

@@ -2,4 +2,4 @@
 __all__ = ["__version__"]
-__version__ = "0.49.0"
+__version__ = "0.50.0"

{eval_toolkit-0.49.0 → eval_toolkit-0.50.0}/src/eval_toolkit/analysis.py RENAMED Viewed

@@ -11,6 +11,7 @@ from typing import Any
 import numpy as np
+from eval_toolkit._rng import RNGLike, SeedLike
 from eval_toolkit.bootstrap import bootstrap_ci, paired_bootstrap_diff
 from eval_toolkit.metrics import pr_auc
 from eval_toolkit.protocols import PredictionReader
@@ -121,7 +122,7 @@ def bootstrap_metric_from_predictions(
     *,
     reader: PredictionReader | None = None,
     n_resamples: int = 1000,
-    seed: int = 42,
+    rng: RNGLike | SeedLike | None = 42,
 ) -> dict[str, object]:
     """Compute a PR-AUC bootstrap CI from one prediction ref."""
     arrays = load_prediction_arrays(ref, reader=reader)
@@ -130,7 +131,7 @@ def bootstrap_metric_from_predictions(
         arrays.scores,
         pr_auc,
         n_resamples=n_resamples,
-        seed=seed,
+        rng=rng,
     ).to_dict()
@@ -141,7 +142,7 @@ def paired_diff_from_prediction_refs(
     baseline_reader: PredictionReader | None = None,
     candidate_reader: PredictionReader | None = None,
     n_resamples: int = 1000,
-    seed: int = 42,
+    rng: RNGLike | SeedLike | None = 42,
 ) -> dict[str, object]:
     """Compute paired PR-AUC delta from two prediction refs.
@@ -172,7 +173,7 @@ def paired_diff_from_prediction_refs(
         candidate.scores,
         pr_auc,
         n_resamples=n_resamples,
-        seed=seed,
+        rng=rng,
     ).to_dict()

{eval_toolkit-0.49.0 → eval_toolkit-0.50.0}/src/eval_toolkit/bootstrap.py RENAMED Viewed

@@ -31,6 +31,7 @@ from scipy.stats import norm as _scipy_norm
 from scipy.stats import rankdata as _scipy_rankdata
 from eval_toolkit._parallel import parallel_map
+from eval_toolkit._rng import RNGLike, SeedLike, spawn_seed_sequences
 _logger = logging.getLogger(__name__)
@@ -236,7 +237,7 @@ def bootstrap_ci(
     n_resamples: int = DEFAULT_N_RESAMPLES,
     confidence: float = DEFAULT_CONFIDENCE,
     method: Literal["BCa", "percentile", "studentized"] = DEFAULT_METHOD,
-    seed: int = DEFAULT_SEED,
+    rng: RNGLike | SeedLike | None = DEFAULT_SEED,
     n_jobs: int = 1,
 ) -> BootstrapCI:
     """Per-condition CI via :func:`scipy.stats.bootstrap`.
@@ -257,8 +258,9 @@ def bootstrap_ci(
         Two-sided confidence level (default 0.95).
     method : {"BCa", "percentile", "studentized"}, optional
         Default "BCa".
-    seed : int, optional
-        RNG seed for reproducibility.
+    rng : RNGLike | SeedLike | None, optional
+        RNG argument per `Scientific Python SPEC 7 <https://scientific-python.org/specs/spec-0007/>`_.
+        Int seed (default ``DEFAULT_SEED=42``), ``Generator``, or ``None`` (entropy).
     n_jobs : int, optional
         Parallel workers (default 1 — sequential). Only effective when
         ``method='studentized'`` (which has the only Python-level outer loop
@@ -284,7 +286,7 @@ def bootstrap_ci(
     >>> rng = np.random.default_rng(42)
     >>> y = rng.integers(0, 2, size=200)
     >>> s = y + rng.normal(0, 0.3, size=200)
-    >>> ci = bootstrap_ci(y, s, metric=pr_auc, n_resamples=200, seed=42)
+    >>> ci = bootstrap_ci(y, s, metric=pr_auc, n_resamples=200, rng=42)
     >>> ci.ci_low <= ci.point_estimate <= ci.ci_high
     True
@@ -319,13 +321,13 @@ def bootstrap_ci(
         )
     _logger.debug(
-        "bootstrap_ci: metric=%s n=%d n_resamples=%d method=%s confidence=%.3f seed=%d n_jobs=%d",
+        "bootstrap_ci: metric=%s n=%d n_resamples=%d method=%s confidence=%.3f rng=%r n_jobs=%d",
         getattr(metric, "__name__", repr(metric)),
         n,
         n_resamples,
         method,
         confidence,
-        seed,
+        rng,
         n_jobs,
     )
@@ -342,11 +344,11 @@ def bootstrap_ci(
             point,
             n_resamples=n_resamples,
             confidence=confidence,
-            seed=seed,
+            rng=rng,
             n_jobs=n_jobs,
         )
     else:
-        rng = np.random.default_rng(seed)
+        rng = np.random.default_rng(rng)
         res = _scipy_bootstrap(
             (y_true_arr, y_score_arr),
             statistic=_statistic,
@@ -423,7 +425,7 @@ def _bootstrap_t_ci(
     *,
     n_resamples: int,
     confidence: float,
-    seed: int,
+    rng: RNGLike | SeedLike | None,
     n_jobs: int = 1,
 ) -> tuple[float, float]:
     r"""Studentized bootstrap-t CI per Algeshiemer 2024 / Davison & Hinkley §5.2.
@@ -441,7 +443,7 @@ def _bootstrap_t_ci(
     Skips degenerate resamples (single-class draws causing the metric to
     raise); raises if > 5% of resamples are degenerate.
     """
-    seed_seqs = np.random.SeedSequence(seed).spawn(n_resamples)
+    seed_seqs = spawn_seed_sequences(rng, n_resamples)
     step = functools.partial(_bootstrap_t_step, y_true=y_true, y_score=y_score, metric=metric)
     raw_results = parallel_map(step, seed_seqs, n_jobs=n_jobs, description="bootstrap_t")
     valid_pairs = [r for r, _ in raw_results if r is not None]
@@ -505,7 +507,7 @@ def paired_bootstrap_diff(
     *,
     n_resamples: int = DEFAULT_N_RESAMPLES,
     confidence: float = DEFAULT_CONFIDENCE,
-    seed: int = DEFAULT_SEED,
+    rng: RNGLike | SeedLike | None = DEFAULT_SEED,
     n_jobs: int = 1,
 ) -> PairedBootstrapCI:
     """Paired-bootstrap CI on ``metric(B) − metric(A)`` using the same resample indices.
@@ -518,7 +520,7 @@ def paired_bootstrap_diff(
         Scores from two scorers on the same rows.
     metric : callable ``(y_true, y_score) -> float``
         Must be picklable when ``n_jobs != 1`` (lambdas not supported).
-    n_resamples, confidence, seed : standard bootstrap params.
+    n_resamples, confidence, rng : standard bootstrap params (``rng`` per SPEC 7).
     n_jobs : int, optional
         Parallel workers (default 1 — sequential). ``n_jobs > 1`` uses
         joblib loky; ``n_jobs=-1`` uses all cores; ``n_jobs=0`` is rejected.
@@ -547,7 +549,7 @@ def paired_bootstrap_diff(
     >>> y = rng.integers(0, 2, size=200)
     >>> s_a = rng.normal(0, 1, size=200)                 # random scorer
     >>> s_b = y + rng.normal(0, 0.3, size=200)           # signal scorer
-    >>> diff = paired_bootstrap_diff(y, s_a, s_b, pr_auc, n_resamples=200, seed=42)
+    >>> diff = paired_bootstrap_diff(y, s_a, s_b, pr_auc, n_resamples=200, rng=42)
     >>> diff.delta > 0  # B beats A
     True
@@ -581,7 +583,7 @@ def paired_bootstrap_diff(
         raise ValueError(f"n={n} too small for paired bootstrap; need ≥ 10")
     delta_point = float(metric(y_true_arr, b)) - float(metric(y_true_arr, a))
-    seed_seqs = np.random.SeedSequence(seed).spawn(n_resamples)
+    seed_seqs = spawn_seed_sequences(rng, n_resamples)
     step = functools.partial(
         _paired_bootstrap_diff_step,
         y_true_arr=y_true_arr,
@@ -654,7 +656,7 @@ def paired_bootstrap_ece_diff(
     ece_fn: Callable[[np.ndarray, np.ndarray, int], float],
     n_resamples: int = DEFAULT_N_RESAMPLES,
     confidence: float = DEFAULT_CONFIDENCE,
-    seed: int = DEFAULT_SEED,
+    rng: RNGLike | SeedLike | None = DEFAULT_SEED,
     n_bins: int = 10,
     n_jobs: int = 1,
 ) -> PairedBootstrapCI:
@@ -677,7 +679,7 @@ def paired_bootstrap_ece_diff(
         does not depend on calibration. Typical use:
         ``from eval_toolkit.metrics import expected_calibration_error``,
         then pass ``ece_fn=expected_calibration_error``.
-    n_resamples, confidence, seed : standard bootstrap params.
+    n_resamples, confidence, rng : standard bootstrap params (``rng`` per SPEC 7).
     n_bins : int, optional
         Number of ECE bins (passed through to ``ece_fn``).
     n_jobs : int, optional
@@ -715,7 +717,7 @@ def paired_bootstrap_ece_diff(
         raise ValueError(f"n={n} too small for paired bootstrap; need >= 10")
     delta_point = float(ece_fn(y_true_arr, b, n_bins)) - float(ece_fn(y_true_arr, a, n_bins))
-    seed_seqs = np.random.SeedSequence(seed).spawn(n_resamples)
+    seed_seqs = spawn_seed_sequences(rng, n_resamples)
     step = functools.partial(
         _paired_bootstrap_ece_diff_step,
         y_true_arr=y_true_arr,
@@ -798,7 +800,7 @@ def paired_bootstrap_op_point_diff(
     *,
     n_resamples: int = DEFAULT_N_RESAMPLES,
     confidence: float = DEFAULT_CONFIDENCE,
-    seed: int = DEFAULT_SEED,
+    rng: RNGLike | SeedLike | None = DEFAULT_SEED,
     n_jobs: int = 1,
 ) -> PairedBootstrapCI:
     r"""Two-level paired bootstrap for operating-point lifts.
@@ -826,7 +828,7 @@ def paired_bootstrap_op_point_diff(
         ``lambda y, s: MaxF1Selector().select(y, s).threshold``).
     metric_fn : callable ``(y_true, y_score, threshold) -> float``
         Operating-point metric (e.g., F1, precision) at the given threshold.
-    n_resamples, confidence, seed : standard bootstrap params.
+    n_resamples, confidence, rng : standard bootstrap params (``rng`` per SPEC 7).
     n_jobs : int, optional
         Parallel workers (default 1 — sequential). See
         :ref:`methodology/parallelism`. Both ``threshold_fn`` and
@@ -913,7 +915,7 @@ def paired_bootstrap_op_point_diff(
         metric_fn(test_y_arr, test_a, thr_a_full)
     )
-    seed_seqs = np.random.SeedSequence(seed).spawn(n_resamples)
+    seed_seqs = spawn_seed_sequences(rng, n_resamples)
     step = functools.partial(
         _paired_bootstrap_op_point_diff_step,
         val_y_arr=val_y_arr,
@@ -1132,7 +1134,7 @@ def paired_mde(
     alpha: float = 0.05,
     power: float = 0.80,
     n_resamples: int = DEFAULT_N_RESAMPLES,
-    seed: int = DEFAULT_SEED,
+    rng: RNGLike | SeedLike | None = DEFAULT_SEED,
     n_jobs: int = 1,
 ) -> MDEEstimate:
     r"""Minimum detectable paired Δ at (α, power).
@@ -1174,7 +1176,7 @@ def paired_mde(
         metric,
         n_resamples=n_resamples,
         confidence=0.95,
-        seed=seed,
+        rng=rng,
         n_jobs=n_jobs,
     )
     est = mde_from_ci(paired, alpha=alpha, power=power)
@@ -1306,7 +1308,7 @@ def block_bootstrap_on_folds(
     *,
     n_resamples: int = DEFAULT_N_RESAMPLES,
     confidence: float = DEFAULT_CONFIDENCE,
-    seed: int = DEFAULT_SEED,
+    rng: RNGLike | SeedLike | None = DEFAULT_SEED,
 ) -> BootstrapCI:
     r"""Block bootstrap on folds: resample K folds with replacement; percentile CI on mean.
@@ -1341,8 +1343,9 @@ def block_bootstrap_on_folds(
         the cross-fold sensitivity-check use case (runs in O(seconds)).
     confidence : float, optional
         Two-sided confidence level (default 0.95).
-    seed : int, optional
-        RNG seed for reproducibility.
+    rng : RNGLike | SeedLike | None, optional
+        RNG argument per `Scientific Python SPEC 7 <https://scientific-python.org/specs/spec-0007/>`_.
+        Int seed (default ``DEFAULT_SEED=42``), ``Generator``, or ``None`` (entropy).
     Returns
     -------
@@ -1360,7 +1363,7 @@ def block_bootstrap_on_folds(
     --------
     >>> import numpy as np
     >>> folds = np.array([0.83, 0.81, 0.85, 0.79, 0.84])
-    >>> ci = block_bootstrap_on_folds(folds, n_resamples=2000, seed=42)
+    >>> ci = block_bootstrap_on_folds(folds, n_resamples=2000, rng=42)
     >>> ci.method
     'block_bootstrap'
     >>> bool(ci.ci_low <= ci.point_estimate <= ci.ci_high)
@@ -1389,7 +1392,7 @@ def block_bootstrap_on_folds(
     if not 0.0 < confidence < 1.0:
         raise ValueError(f"confidence must be in (0, 1); got {confidence}")
-    rng = np.random.default_rng(seed)
+    rng = np.random.default_rng(rng)
     # Vectorized: (n_resamples, K) index draws, gather, mean along axis 1.
     idx = rng.integers(0, K, size=(n_resamples, K))
     resample_means = arr[idx].mean(axis=1)
@@ -1412,7 +1415,7 @@ def cross_validate_metric(
     metric: MetricFn,
     k: int = 5,
     stratified: bool = True,
-    seed: int = DEFAULT_SEED,
+    rng: RNGLike | SeedLike | None = DEFAULT_SEED,
 ) -> np.ndarray:
     r"""K-fold cross-validation of a metric on caller-supplied scores.
@@ -1444,8 +1447,8 @@ def cross_validate_metric(
         If ``True`` (default), use ``StratifiedKFold`` so each fold
         preserves the class balance. Recommended for binary
         classification under class imbalance.
-    seed : int, optional
-        Shuffle seed for fold assignment.
+    rng : RNGLike | SeedLike | None, optional
+        RNG per SPEC 7 — derived to int at the sklearn ``KFold/StratifiedKFold`` boundary.
     Returns
     -------
@@ -1467,7 +1470,7 @@ def cross_validate_metric(
     >>> n = 200
     >>> y = rng.binomial(1, 0.3, size=n).astype(int)
     >>> s = np.clip(y * 0.6 + rng.normal(0, 0.3, n), 0, 1)
-    >>> folds = cross_validate_metric(y, s, metric=pr_auc, k=5, seed=42)
+    >>> folds = cross_validate_metric(y, s, metric=pr_auc, k=5, rng=42)
     >>> folds.shape
     (5,)
     >>> bool(np.all(0.0 <= folds[~np.isnan(folds)]))
@@ -1491,12 +1494,18 @@ def cross_validate_metric(
     if k > n:
         raise ValueError(f"k={k} exceeds n={n}")
+    # Derive an int seed for sklearn — sklearn KFold's random_state accepts
+    # int | None | RandomState (not Generator) across versions <1.4; safer to
+    # derive at the boundary than pin a higher sklearn minimum.
+    rng = np.random.default_rng(rng)
+    sklearn_seed = int(rng.integers(0, 2**31 - 1))
     splitter: KFold | StratifiedKFold
     if stratified:
-        splitter = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
+        splitter = StratifiedKFold(n_splits=k, shuffle=True, random_state=sklearn_seed)
         fold_iter = splitter.split(np.zeros(n), y_arr)
     else:
-        splitter = KFold(n_splits=k, shuffle=True, random_state=seed)
+        splitter = KFold(n_splits=k, shuffle=True, random_state=sklearn_seed)
         fold_iter = splitter.split(np.zeros(n))
     fold_metrics = np.full(k, np.nan, dtype=np.float64)

eval-toolkit 0.49.0__tar.gz → 0.50.0__tar.gz

eval-toolkit 0.49.0tar.gz → 0.50.0tar.gz