PyPI - eval-toolkit - Versions diffs - 0.34.0__tar.gz → 0.36.0__tar.gz - Mend

eval-toolkit 0.34.0tar.gz → 0.36.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (157) hide show

{eval_toolkit-0.34.0 → eval_toolkit-0.36.0}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,89 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.36.0] — 2026-05-18 — harness parallelization (#29, #30) + Node 24 actions
+Wires the v0.34.0 unified parallelism pattern into the harness evaluation
+loop. `evaluate()` and `evaluate_folded()` now accept an `n_jobs` kwarg
+(default `1` preserves bit-identical sequential behavior); under
+`n_jobs != 1`, the `(slice × scorer)` work-unit loop in
+`_score_all_slices` and the `(spec × scorer)` fit phase in
+`_attach_transferred_operating_points` dispatch through joblib loky via
+the existing `_parallel.parallel_map` helper.
+### Added
+- `evaluate(..., n_jobs: int = 1)` and `evaluate_folded(..., n_jobs: int = 1)`
+  — keyword-only kwarg per Principle #3 of `methodology/parallelism.md`.
+  `n_jobs=1` (default) runs the existing pure-Python sequential loop
+  (Principle #4 — bit-identical to v0.35). `n_jobs > 1` uses joblib loky;
+  `n_jobs=-1` uses all cores; `n_jobs=0` is rejected. Closes #29, #30.
+- Strict-pickle Scorer sniff at `evaluate()` entry when `n_jobs != 1`:
+  raises a clean `TypeError` referencing
+  `methodology/parallelism.md#scorer-picklability` with the underlying
+  pickle error attached. Reuses the v0.35 ADR contract; no new exception
+  class. Catches non-picklable scorers up front rather than relying on
+  joblib's more permissive cloudpickle path (which would silently absorb
+  closures and obscure the contract documented in v0.35).
+### Internal
+- New module-scope step functions `_score_one_pair` and
+  `_fit_one_op_point_pair` in `harness.py` (picklable; required by loky).
+- `_score_all_slices` and `_attach_transferred_operating_points`
+  refactored to use flat work-unit dispatch via `parallel_map`.
+### Tests
+- New `tests/test_harness_parallelism.py` (7 tests): bit-identical
+  reproducibility across `n_jobs=1` vs `n_jobs=2` for `evaluate`
+  (basic, paired-diffs, operating-points), `evaluate_folded`,
+  picklability rejection (closure scorer), `n_jobs=0` rejection,
+  `n_jobs=-1` smoke. All 66 harness tests pass (7 new + 59 existing).
+### Infrastructure
+- Bumped `actions/upload-artifact` and `actions/download-artifact` from
+  `@v5` → `@v6` across `publish.yml` / `nightly-mc.yml` /
+  `nightly-benchmarks.yml`. The v6 majors run on Node.js 24
+  (GitHub deprecates Node 20 actions from 2026-06-02). Other pinned
+  actions (`checkout@v6`, `setup-uv@v8.1.0`, `codeql-action@v3`,
+  `deploy-pages@v4`, `upload-pages-artifact@v3`) were not flagged in
+  the v0.35 publish annotation and are deferred to a separate audit.
+## [0.35.0] — 2026-05-18 — `fit_temperature_binary` + Scorer picklability ADR
+Small, additive release. Adds a binary-classification calibration helper
+that lets consumers drop the ~50 LOC scalar-proba adapter many were
+carrying, plus a design ADR that unblocks the v0.36 harness / operating-
+point parallelization work (#29, #30) without re-litigating picklability.
+### Added
+- `eval_toolkit.fit_temperature_binary(y_true, y_score)` — scalar-proba
+  adapter for the multi-class `fit_temperature` fitter. Converts `(n,)`
+  probabilities of class 1 to a 2-column logit array via clipped logit
+  (`[0, logit(p)]` so softmax row 1 reproduces `p`), delegates to the
+  deployment-quality fitter, and returns `(T_opt, apply)` where
+  `apply: (n,) -> (n,)` does scalar-in / scalar-out T-scaling. Unlike
+  `fit_temperature_oracle`, no warning — the contract assumes val / test
+  separation (deployment-quality calibration, not fit-on-test). Closes
+  #28.
+### Documentation
+- `docs/source/methodology/parallelism.md` — new `## Scorer picklability`
+  sub-section documenting the Scorer protocol's picklability contract
+  for `n_jobs > 1` usage. Includes worked picklable / broken-closure /
+  fix examples plus a list of common non-picklable patterns to watch for
+  in user-supplied Scorers (closures, lambdas on instances, local-scope
+  classes, attributes holding live sockets / file handles). Anchors on
+  the existing v0.34.0 `parallel_map` pickle sniff + `TypeError`
+  channel — no new exception class. Unblocks v0.36 implementation of
+  #29 and #30.
+- `eval_toolkit.protocols.Scorer` docstring — Notes block pointing at
+  the new methodology section.
 ## [0.34.0] — 2026-05-17 — Phase 4 stats unblockers + unified parallelism + cookbook (BREAKING)
 Closes all 7 open backlog issues in one consumer-closing release. Also

{eval_toolkit-0.34.0 → eval_toolkit-0.36.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.34.0
+Version: 0.36.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/

{eval_toolkit-0.34.0 → eval_toolkit-0.36.0}/src/eval_toolkit/__init__.py RENAMED Viewed

@@ -87,6 +87,7 @@ _EXPORTS: dict[str, str] = {
     "fit_isotonic_calibrator": "eval_toolkit.calibration",
     "fit_platt_calibrator": "eval_toolkit.calibration",
     "fit_temperature": "eval_toolkit.calibration",
+    "fit_temperature_binary": "eval_toolkit.calibration",
     "fit_temperature_oracle": "eval_toolkit.calibration",
     "reliability_curve": "eval_toolkit.calibration",
     "reliability_diagram_data": "eval_toolkit.calibration",

{eval_toolkit-0.34.0 → eval_toolkit-0.36.0}/src/eval_toolkit/_version.py RENAMED Viewed

@@ -2,4 +2,4 @@
 __all__ = ["__version__"]
-__version__ = "0.34.0"
+__version__ = "0.36.0"

{eval_toolkit-0.34.0 → eval_toolkit-0.36.0}/src/eval_toolkit/calibration.py RENAMED Viewed

@@ -57,6 +57,7 @@ __all__ = [
     "fit_isotonic_calibrator",
     "fit_platt_calibrator",
     "fit_temperature",
+    "fit_temperature_binary",
     "fit_temperature_oracle",
     "maximum_calibration_error",
     "reliability_curve",
@@ -1038,6 +1039,102 @@ def _negative_log_likelihood(t: float, logits: np.ndarray, labels: np.ndarray) -
     return float(-log_probs[np.arange(len(labels)), labels].mean())
+def fit_temperature_binary(
+    y_true: np.ndarray,
+    y_score: np.ndarray,
+    *,
+    bounds: tuple[float, float] = (0.05, 20.0),
+) -> tuple[float, Callable[[np.ndarray], np.ndarray]]:
+    r"""Binary-probability adapter for :func:`fit_temperature` (Guo et al. 2017 [#guo]_).
+    Fits a scalar T > 0 on *validation* probabilities of class 1 and returns
+    both T and a callable that applies the same T-scaling to test
+    probabilities. Internally:
+    1. Clips ``y_score`` to ``[1e-7, 1-1e-7]`` for finite logit inversion.
+    2. Builds a 2-column logit array ``[0, logit(p)]`` so softmax row 1
+       reproduces ``p`` exactly.
+    3. Delegates to :func:`fit_temperature` for the bounded NLL minimization.
+    4. Returns ``(T, apply)`` where ``apply(p_test) = sigmoid(logit(p_test)/T)``.
+    Unlike :func:`fit_temperature_oracle`, this does NOT emit a warning — the
+    contract is that ``y_true`` / ``y_score`` come from a held-out validation
+    set and ``apply`` is invoked on a separate test set (deployment-quality
+    calibration, not fit-on-test).
+    Parameters
+    ----------
+    y_true : np.ndarray, shape (n,)
+        Binary validation labels in {0, 1}.
+    y_score : np.ndarray, shape (n,)
+        Validation predicted probabilities of class 1, in [0, 1]. Values at
+        the extremes are clipped to ``[1e-7, 1 - 1e-7]``.
+    bounds : tuple of float, optional
+        ``(lo, hi)`` bracket for T. Default ``(0.05, 20.0)``, matches
+        :func:`fit_temperature`.
+    Returns
+    -------
+    tuple
+        ``(T_optimal, apply)`` where ``apply: (n,) -> (n,)`` maps any input
+        probability array through :math:`\sigma(\mathrm{logit}(p) / T)`.
+    Raises
+    ------
+    ValueError
+        On shape mismatch, empty input, non-finite scores, or single-class
+        ``y_true``.
+    RuntimeError
+        If the bounded scalar optimizer fails to converge.
+    Examples
+    --------
+    >>> import numpy as np
+    >>> rng = np.random.default_rng(0)
+    >>> n = 500
+    >>> y_val = rng.binomial(1, 0.3, size=n).astype(int)
+    >>> p_val = np.clip(y_val * 0.6 + rng.normal(0, 0.2, n), 0.01, 0.99)
+    >>> T, apply = fit_temperature_binary(y_val, p_val)
+    >>> T > 0
+    True
+    >>> p_test = np.array([0.1, 0.5, 0.9])
+    >>> apply(p_test).shape == (3,)
+    True
+    See Also
+    --------
+    fit_temperature : underlying multi-class fitter (operates on 2-col logits)
+    fit_temperature_oracle : diagnostic-only variant that fits T on the same
+        probabilities it scores
+    References
+    ----------
+    .. [#guo] Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. "On
+       calibration of modern neural networks." ICML 2017. arXiv:1706.04599.
+    """
+    y_true_arr, y_score_arr = _validate_calibrator_inputs(y_true, y_score)
+    # Build 2-col logits [0, logit(p)] so softmax([0, logit(p)])[1] == p exactly.
+    s_clipped = np.clip(y_score_arr, _SCORE_CLIP_LO, _SCORE_CLIP_HI)
+    logit_pos = np.log(s_clipped / (1.0 - s_clipped))
+    val_logits_2col = np.column_stack([np.zeros_like(logit_pos), logit_pos])
+    result = fit_temperature(val_logits_2col, y_true_arr, bounds=bounds)
+    t_optimal = float(result["temperature"])
+    def apply(scores: np.ndarray) -> np.ndarray:
+        arr = np.asarray(scores, dtype=float).ravel()
+        if not np.isfinite(arr).all():
+            raise ValueError("scores contains NaN or inf")
+        clipped = np.clip(arr, _SCORE_CLIP_LO, _SCORE_CLIP_HI)
+        logit = np.log(clipped / (1.0 - clipped))
+        scaled = logit / t_optimal
+        out: np.ndarray = (1.0 / (1.0 + np.exp(-scaled))).astype(float)
+        return out
+    return t_optimal, apply
 def fit_temperature_oracle(
     y_true: np.ndarray, y_score: np.ndarray
 ) -> tuple[float, Callable[[np.ndarray], np.ndarray]]:

{eval_toolkit-0.34.0 → eval_toolkit-0.36.0}/src/eval_toolkit/harness.py RENAMED Viewed

@@ -32,6 +32,7 @@ v0.7.0 additions:
 from __future__ import annotations
 import logging
+import pickle
 import time
 import traceback
 from collections.abc import Mapping, Sequence
@@ -41,6 +42,7 @@ from typing import TYPE_CHECKING, Final, Literal, cast
 import numpy as np
+from eval_toolkit._parallel import parallel_map
 from eval_toolkit.artifacts import (
     error_metric,
     sanitize_for_json,
@@ -62,7 +64,7 @@ from eval_toolkit.operating_points import (
     fit_operating_points,
 )
 from eval_toolkit.protocols import Scorer, SliceAwareScorer
-from eval_toolkit.thresholds import TargetFPRSelector
+from eval_toolkit.thresholds import TargetFPRSelector, ThresholdSelector
 if TYPE_CHECKING:
     import pandas as pd
@@ -278,6 +280,31 @@ def _object_to_dict(obj: object, *, what: str) -> dict[str, object]:
     raise TypeError(f"expected {what} mapping or object with to_dict(), got {type(obj).__name__}")
+def _assert_scorers_picklable(scorers: Mapping[str, Scorer]) -> None:
+    """Strict-pickle sniff for Scorer args when ``n_jobs != 1``.
+    joblib's loky backend uses cloudpickle (which absorbs closures + local
+    classes), but the v0.35 Scorer picklability ADR
+    (``methodology/parallelism.md#scorer-picklability``) is a *strict* pickle
+    contract — cloudpickle behavior is platform-dependent and the more
+    permissive failure modes are harder to debug. Fail fast at
+    :func:`evaluate` entry with the same ``TypeError`` style as
+    :func:`eval_toolkit._parallel.parallel_map`'s fn-sniff (no new exception
+    class — single channel for the picklability contract).
+    """
+    for sname, scorer in scorers.items():
+        try:
+            pickle.dumps(scorer)
+        except (pickle.PicklingError, AttributeError, TypeError) as exc:
+            raise TypeError(
+                f"evaluate(n_jobs != 1): scorer {sname!r} "
+                f"({type(scorer).__name__}) is not picklable. See "
+                f"methodology/parallelism.md#scorer-picklability for the "
+                f"contract and worked picklable / broken / fix examples. "
+                f"Underlying error: {exc}"
+            ) from exc
 def _should_score_slice(scorer: Scorer, slice_name: str) -> bool:
     """Honor optional slice-aware scorer hooks without widening the base Protocol."""
     should_score = getattr(scorer, "should_score_slice", None)
@@ -696,6 +723,36 @@ def _run_leakage_phase(
         )
+# Tuple shape for the flat `(slice × scorer)` work-unit dispatched to
+# parallel_map by `_score_all_slices`. Defined at module scope so workers
+# can pickle the function reference.
+_ScoreOnePairItem = tuple[EvalSlice, str, Scorer, int, int, Literal["raise", "record"]]
+_ScoreOnePairResult = tuple[str, str, dict[str, object], np.ndarray]
+def _score_one_pair(item: _ScoreOnePairItem) -> _ScoreOnePairResult:
+    """Picklable step function for ``(slice × scorer)`` parallel dispatch.
+    Module-scope so loky workers can serialize the reference (closures over
+    enclosing locals would fail :func:`parallel_map`'s pickle sniff). All
+    inputs flow through the ``item`` tuple — no captured state.
+    Returns ``(slice_name, scorer_name, result_dict, scores_array)`` so the
+    caller can reassemble ``by_slice`` + ``score_cache`` in the original
+    iteration order.
+    """
+    slice_, sname, scorer, n_resamples, seed, on_scorer_error = item
+    result = evaluate_scorer_on_slice(
+        scorer,
+        slice_,
+        n_resamples=n_resamples,
+        seed=seed,
+        on_scorer_error=on_scorer_error,
+    )
+    scores = np.asarray(result["scores"], dtype=np.float64)
+    return slice_.name, sname, result, scores
 def _score_all_slices(
     scorers: dict[str, Scorer],
     slices: Sequence[EvalSlice],
@@ -704,6 +761,7 @@ def _score_all_slices(
     seed: int,
     paired_diffs: list[tuple[str, str]] | None,
     on_scorer_error: Literal["raise", "record"],
+    n_jobs: int = 1,
 ) -> tuple[dict[str, dict[str, object]], dict[tuple[str, str], np.ndarray]]:
     """Score every ``(slice, scorer)`` pair; return ``(by_slice, score_cache)``.
@@ -714,10 +772,17 @@ def _score_all_slices(
     ``score_cache`` is keyed ``(slice.name, scorer.name)`` and carries the
     raw score arrays so :func:`_attach_transferred_operating_points` can
     re-use them without re-calling scorers.
-    """
-    by_slice: dict[str, dict[str, object]] = {}
-    score_cache: dict[tuple[str, str], np.ndarray] = {}
+    v0.36 added ``n_jobs``: a flat ``(slice × scorer)`` parallel dispatch
+    via :func:`eval_toolkit._parallel.parallel_map`. Default ``1`` preserves
+    bit-identical sequential behavior. ``n_jobs != 1`` requires picklable
+    scorers per the v0.35 ADR
+    (``docs/source/methodology/parallelism.md#scorer-picklability``).
+    """
+    # Pre-filter skipped pairs (allow-list miss) before dispatching parallel
+    # work-units. Logs the same skip messages as the pre-parallel version.
+    work_units: list[_ScoreOnePairItem] = []
+    skipped: dict[tuple[str, str], dict[str, object]] = {}
     for slice_ in slices:
         _logger.info(
             "[slice %s] n=%d, positives=%d",
@@ -725,32 +790,61 @@ def _score_all_slices(
             len(slice_.df),
             int(slice_.y_true.sum()),
         )
-        slice_data: dict[str, dict[str, object]] = {}
-        scores_by_scorer: dict[str, np.ndarray] = {}
         for sname, scorer in scorers.items():
             if not _should_score_slice(scorer, slice_.name):
                 reason = f"slice {slice_.name!r} not in scorer allow-list"
-                slice_data[sname] = _skipped_scorer_result(slice_, reason)
+                skipped[(slice_.name, sname)] = _skipped_scorer_result(slice_, reason)
                 _logger.info("    skipped %s: %s", sname, reason)
                 continue
-            t0 = time.time()
-            slice_data[sname] = evaluate_scorer_on_slice(
-                scorer,
-                slice_,
-                n_resamples=n_resamples,
-                seed=seed,
-                on_scorer_error=on_scorer_error,
-            )
-            # If the scorer raised under on_scorer_error="record", scores is [].
-            # Subsequent paired-diff machinery sees the empty array and will
-            # short-circuit on the same len-check it already does for skipped
-            # scorers; no special-case needed.
-            scores_by_scorer[sname] = np.asarray(slice_data[sname]["scores"], dtype=np.float64)
-            score_cache[(slice_.name, sname)] = scores_by_scorer[sname]
-            elapsed = time.time() - t0
-            pr = slice_data[sname].get("pr_auc")
+            work_units.append((slice_, sname, scorer, n_resamples, seed, on_scorer_error))
+    # Parallel scoring. parallel_map at n_jobs=1 is a pure-Python for-loop
+    # (Principle #4) — bit-identical to the pre-v0.36 sequential code.
+    if work_units:
+        t0_total = time.time()
+        results = parallel_map(
+            _score_one_pair,
+            work_units,
+            n_jobs=n_jobs,
+            description="harness _score_all_slices",
+        )
+        elapsed_total = time.time() - t0_total
+        _logger.info(
+            "  scored %d (slice, scorer) pairs in %.1fs (n_jobs=%d)",
+            len(work_units),
+            elapsed_total,
+            n_jobs,
+        )
+    else:
+        results = []
+    # Index results for O(1) lookup during reassembly.
+    results_by_key: dict[tuple[str, str], _ScoreOnePairResult] = {
+        (slice_name, sname): (slice_name, sname, result_dict, scores_arr)
+        for slice_name, sname, result_dict, scores_arr in results
+    }
+    # Reassemble in the original (slices × scorers.items()) iteration order.
+    by_slice: dict[str, dict[str, object]] = {}
+    score_cache: dict[tuple[str, str], np.ndarray] = {}
+    for slice_ in slices:
+        slice_data: dict[str, dict[str, object]] = {}
+        scores_by_scorer: dict[str, np.ndarray] = {}
+        for sname in scorers:
+            key = (slice_.name, sname)
+            if key in skipped:
+                slice_data[sname] = skipped[key]
+                continue
+            _, _, result_dict, scores_arr = results_by_key[key]
+            slice_data[sname] = result_dict
+            # If the scorer raised under on_scorer_error="record", scores_arr is [].
+            # Paired-diff machinery short-circuits on the same len-check it uses
+            # for skipped scorers; no special-case needed.
+            scores_by_scorer[sname] = scores_arr
+            score_cache[key] = scores_arr
+            pr = result_dict.get("pr_auc")
             pr_display = f"{pr:.4f}" if isinstance(pr, float) else "N/A"
-            _logger.info("    %s: PR-AUC=%s (%.1fs)", sname, pr_display, elapsed)
+            _logger.info("    %s: PR-AUC=%s", sname, pr_display)
         diffs = (
             _compute_paired_diffs(
@@ -789,6 +883,7 @@ def evaluate(
     on_leakage: Literal["raise", "record", "skip"] = "raise",
     on_scorer_error: Literal["raise", "record"] = "raise",
     operating_point_specs: Sequence[OperatingPointSpec] = (),
+    n_jobs: int = 1,
 ) -> RunResult:
     """Run every scorer on every slice; return a pure :class:`RunResult` (no IO).
@@ -830,6 +925,15 @@ def evaluate(
         Fit thresholds on one mixed-class slice and apply them to named target
         slices. Results are attached under each scorer's
         ``"transferred_operating_points"`` block. Default empty (skip).
+    n_jobs : int, optional
+        Parallel workers (default 1 — sequential). ``n_jobs > 1`` uses
+        joblib loky to parallelize the flat ``(slice × scorer)`` work-unit
+        loop in :func:`_score_all_slices` (and the operating-point fit
+        phase when ``operating_point_specs`` is non-empty). ``n_jobs=-1``
+        uses all cores; ``n_jobs=0`` is rejected. Scorers must be picklable
+        when ``n_jobs != 1`` — see
+        :doc:`methodology/parallelism` § Scorer picklability for the
+        contract + worked examples.
     Returns
     -------
@@ -850,6 +954,9 @@ def evaluate(
     if not slices:
         raise ValueError("at least one slice required")
+    if n_jobs != 1:
+        _assert_scorers_picklable(scorers)
     config: dict[str, object] = {
         "n_resamples": n_resamples,
         "seed": seed,
@@ -872,6 +979,7 @@ def evaluate(
         seed=seed,
         paired_diffs=paired_diffs,
         on_scorer_error=on_scorer_error,
+        n_jobs=n_jobs,
     )
     if operating_point_specs:
@@ -882,11 +990,45 @@ def evaluate(
             score_cache=score_cache,
             scorer_names=list(scorers.keys()),
             specs=operating_point_specs,
+            n_jobs=n_jobs,
         )
     return RunResult(run_id=run_id, git_sha=git_sha, config=config, by_slice=by_slice)
+_OpPointFitItem = tuple[
+    str,  # spec_name (for reassembly key)
+    str,  # fit_slice_name (passed through to fit_operating_points)
+    str,  # scorer_name
+    np.ndarray,  # fit_y_true
+    np.ndarray,  # fit_scores
+    Sequence[ThresholdSelector],  # spec.selectors (passed through to fit_operating_points)
+]
+_OpPointFitResult = tuple[str, str, object]  # (spec_name, scorer_name, fitted | error_dict)
+def _fit_one_op_point_pair(item: _OpPointFitItem) -> _OpPointFitResult:
+    """Picklable step function for ``(spec × scorer)`` operating-point fitting.
+    Module-scope so loky workers can serialize the reference. All inputs flow
+    through the ``item`` tuple. Returns ``(spec_name, scorer_name, fitted)``
+    where ``fitted`` is either the :func:`fit_operating_points` result or a
+    ``{"error": str}`` dict matching the sequential code path.
+    """
+    spec_name, fit_slice_name, scorer_name, y_true, fit_scores, selectors = item
+    try:
+        fitted = fit_operating_points(
+            y_true,
+            fit_scores,
+            selectors,
+            fitted_on_slice=fit_slice_name,
+            scorer_name=scorer_name,
+        )
+    except (ValueError, RuntimeError) as exc:
+        return spec_name, scorer_name, {"error": str(exc)}
+    return spec_name, scorer_name, fitted
 def _attach_transferred_operating_points(
     *,
     by_slice: dict[str, dict[str, object]],
@@ -894,34 +1036,73 @@ def _attach_transferred_operating_points(
     score_cache: Mapping[tuple[str, str], np.ndarray],
     scorer_names: Sequence[str],
     specs: Sequence[OperatingPointSpec],
+    n_jobs: int = 1,
 ) -> None:
-    """Mutate ``by_slice`` to attach opt-in cross-slice operating-point metrics."""
+    """Mutate ``by_slice`` to attach opt-in cross-slice operating-point metrics.
+    v0.36 added ``n_jobs``: parallelizes the ``(spec × scorer)`` fit phase
+    via :func:`eval_toolkit._parallel.parallel_map`. The apply phase
+    (writing into ``by_slice``) stays sequential — fitting dominates runtime.
+    Default ``n_jobs=1`` preserves bit-identical sequential behavior.
+    """
+    # Pre-flight: handle "fit slice not found" errors (these short-circuit the
+    # entire spec) + collect valid fit work-units. Tracks pre-conditions
+    # ("fit scorer skipped") as separate state so the parallel dispatch only
+    # carries actual work.
+    fit_work: list[_OpPointFitItem] = []
+    fit_skip_reasons: dict[tuple[str, str], dict[str, object]] = {}
+    specs_with_valid_fit: list[OperatingPointSpec] = []
+    names_per_spec: dict[str, list[str]] = {}
     for spec in specs:
         names = list(spec.scorer_names) if spec.scorer_names else list(scorer_names)
+        names_per_spec[spec.name] = names
         if spec.fit_slice not in slices_by_name:
             _record_spec_error(by_slice, spec, names, f"fit slice {spec.fit_slice!r} not found")
             continue
+        specs_with_valid_fit.append(spec)
         fit_slice = slices_by_name[spec.fit_slice]
-        fitted_by_scorer: dict[str, object] = {}
         for scorer_name in names:
             fit_scores = score_cache.get((spec.fit_slice, scorer_name))
             if fit_scores is None or len(fit_scores) != len(fit_slice.y_true):
-                fitted_by_scorer[scorer_name] = {
+                fit_skip_reasons[(spec.name, scorer_name)] = {
                     "error": "fit scorer skipped, errored, or produced no scores"
                 }
                 continue
-            try:
-                fitted_by_scorer[scorer_name] = fit_operating_points(
+            fit_work.append(
+                (
+                    spec.name,
+                    spec.fit_slice,
+                    scorer_name,
                     fit_slice.y_true,
                     fit_scores,
                     spec.selectors,
-                    fitted_on_slice=spec.fit_slice,
-                    scorer_name=scorer_name,
                 )
-            except (ValueError, RuntimeError) as exc:
-                fitted_by_scorer[scorer_name] = {"error": str(exc)}
+            )
+    # Parallel fit phase. parallel_map at n_jobs=1 is a pure-Python for-loop
+    # (Principle #4) — bit-identical to the pre-v0.36 sequential code.
+    fit_results: list[_OpPointFitResult] = (
+        parallel_map(
+            _fit_one_op_point_pair,
+            fit_work,
+            n_jobs=n_jobs,
+            description="harness _attach_transferred_operating_points (fit)",
+        )
+        if fit_work
+        else []
+    )
+    # Index by (spec_name, scorer_name) for O(1) lookup in the apply phase.
+    fitted_by_pair: dict[tuple[str, str], object] = {
+        (spec_name, scorer_name): fitted for spec_name, scorer_name, fitted in fit_results
+    }
+    fitted_by_pair.update(fit_skip_reasons)
+    # Sequential apply phase — preserves the original by_slice mutation order
+    # and the schema of error / skipped markers.
+    for spec in specs_with_valid_fit:
+        names = names_per_spec[spec.name]
         for target_name in spec.apply_slices:
             if target_name not in slices_by_name:
                 _record_spec_error(
@@ -939,7 +1120,7 @@ def _attach_transferred_operating_points(
                 spec_block: dict[str, object] = {}
                 transfer_block[spec.name] = spec_block
-                fitted = fitted_by_scorer.get(scorer_name)
+                fitted = fitted_by_pair.get((spec.name, scorer_name))
                 if not isinstance(fitted, dict) or "error" in fitted:
                     spec_block["error"] = (
                         str(fitted.get("error", "threshold fitting failed"))
@@ -1099,6 +1280,7 @@ def evaluate_folded(
     on_scorer_error: Literal["raise", "record"] = "raise",
     eval_split_names: Sequence[str] = ("test",),
     summary_metrics: Sequence[str] = ("pr_auc", "roc_auc"),
+    n_jobs: int = 1,
 ) -> RunResult:
     """Run a fold aggregator: ``Splitter × seeds → RunResult`` with CV-CI summary.
@@ -1128,6 +1310,15 @@ def evaluate_folded(
         RNG seeds for multi-seed × CV. Default ``(42,)`` (single seed).
     n_resamples, paired_diffs, leakage_checks, on_leakage, on_scorer_error :
         Forwarded to :func:`evaluate` per fold.
+    n_jobs : int, optional
+        Parallel workers (default 1 — sequential). Forwarded to
+        :func:`evaluate` per fold; parallelizes the inner
+        ``(slice × scorer)`` work-unit loop within each fold. Folds
+        themselves run sequentially to keep determinism + traceback
+        fidelity simple; for fold-level parallelism, consider an external
+        ``joblib.Parallel`` wrapper at the call site. See
+        :doc:`methodology/parallelism` § Scorer picklability for the
+        Scorer picklability contract when ``n_jobs != 1``.
     eval_split_names : sequence of str, optional
         Subset of each fold-dict's keys to actually evaluate. Default
         ``("test",)`` — train sets are skipped (eval-only K-fold). Pass
@@ -1183,6 +1374,7 @@ def evaluate_folded(
                 leakage_checks=leakage_checks,
                 on_leakage=on_leakage,
                 on_scorer_error=on_scorer_error,
+                n_jobs=n_jobs,
             )
             by_fold[fold_id] = fold_result

{eval_toolkit-0.34.0 → eval_toolkit-0.36.0}/src/eval_toolkit/protocols.py RENAMED Viewed

@@ -31,6 +31,16 @@ class Scorer(Protocol):
     Accepts ``list[str]``, ``np.ndarray``, or ``pd.Series`` of features.
     Pandas is imported under ``TYPE_CHECKING`` only, so this Protocol
     has no runtime pandas dependency.
+    Notes
+    -----
+    When passed to a parallel-capable harness call (``n_jobs > 1``), Scorer
+    instances MUST be picklable — joblib's loky backend serializes the entire
+    delayed call (function plus bound arguments) before worker dispatch.
+    Closures, lambdas, local-scope classes, and attributes holding live
+    sockets / file handles break pickling. See
+    ``docs/source/methodology/parallelism.md#scorer-picklability`` for the
+    full contract and worked examples.
     """
     def predict_proba(  # pragma: no cover

eval-toolkit 0.34.0__tar.gz → 0.36.0__tar.gz

eval-toolkit 0.34.0tar.gz → 0.36.0tar.gz