PyPI - eval-toolkit - Versions diffs - 0.35.0__tar.gz → 0.38.0__tar.gz - Mend

eval-toolkit 0.35.0tar.gz → 0.38.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (158) hide show

{eval_toolkit-0.35.0 → eval_toolkit-0.38.0}/.gitignore RENAMED Viewed

@@ -22,6 +22,7 @@ wheels/
 .coverage.*
 htmlcov/
 coverage.xml
+coverage.json
 .hypothesis/
 # Type-checker / linter caches

{eval_toolkit-0.35.0 → eval_toolkit-0.38.0}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,153 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.38.0] — 2026-05-18 — executable examples (myst-nb migration)
+Docs-only minor. Migrates the 14 walkthrough pages in
+`docs/source/examples/` from sybil-validated `` ```python `` blocks to
+myst-nb `{code-cell}` directives. Cells now execute during
+`sphinx-build` (`nb_execution_mode = "cache"`) rather than during
+`pytest` via sybil. Cell outputs (printed text, tables, figures)
+render inline in the published HTML, so the docs site reflects the
+actual library behavior rather than a snapshot from the last manual
+screenshot. Closes #31 (deferred from v0.34.1 and v0.35).
+No public API changes.
+### Changed
+- **14 example pages migrated** to myst-nb (`kernelspec` frontmatter +
+  `{code-cell}` directives in place of `` ```python ``). 73 code blocks
+  converted in total.
+- **Two pages skip execution at page level** (`mystnb.execution_mode:
+  'off'`) because they require optional deps kept out of `[dev]`:
+  - `pytorch_scorer_example.md` (needs `torch`)
+  - `callable_embedder_dedup.md` (needs `[embeddings]` /
+    `sentence-transformers`)
+  Both pages render their code statically.
+- **`docs/source/examples/index.md`** — "How these run" section
+  rewritten to reflect myst-nb instead of sybil; new "skip-execed
+  pages" callout.
+- **`conftest.py`** — dropped `docs/source/examples/*.md` from sybil
+  patterns. Sybil still covers `README`, `methodology/`, `migration/`,
+  `getting-started`, etc. (parts without executable-notebook value).
+### Why this matters
+myst-nb infrastructure has been wired since v0.31.0 (the Sphinx docs
+migration) but was underutilized — all example pages used static
+`` ```python `` blocks. This release closes that gap. API drift in
+the future will fail the docs build via runtime-output verification
+(in addition to sybil's existing Python-level error catch on the
+other doc trees).
+## [0.37.0] — 2026-05-18 — TokenizationLeakageCheck + per-module coverage floors
+Two-issue bundle (#35 + #37) plus housekeeping closure of stale items
+(PR #27, #38) that turned out to have been resolved in v0.33.x without
+being checked off. Roadmap refresh in `3d40796` (this minor's
+predecessor commit) replaced the version-keyed candidate list with
+issue-keyed tracking, so this class of stale-roadmap bug shouldn't
+recur.
+### Added
+- **`eval_toolkit.leakage.TokenizationLeakageCheck`** — new within-split
+  `LeakageCheck` that dedups on tokenizer output rather than raw text.
+  Catches encoding-obfuscated dupes that survive
+  `NormalizedFormLeakageCheck` but collapse to identical `input_ids`
+  under a transformer's BPE / SentencePiece / WordPiece tokenizer.
+  Accepts any `Callable[[str], Mapping[str, object]]` returning HF-style
+  output with an `"input_ids"` key — does **not** import `transformers`
+  itself; consumers pass an already-instantiated tokenizer. Default
+  severity `"error"` (mirrors `NormalizedFormLeakageCheck`). Closes #35.
+- New optional install extra **`[transformers]`** (`transformers>=4.0`).
+  Intentionally **not** in `[all]` / `[dev]` — mirrors the `[embeddings]`
+  precedent from v0.33.1 to keep contributor setup small (transformers
+  transitively pulls torch ~700MB).
+### Test
+- **Per-module coverage floors restored.** `scripts/check_module_floors.py`
+  enforces an 85 % per-file floor (coverage.py natively only ships
+  global `--fail-under`). Hooked into `make coverage` via a post-pytest
+  invocation. Closes #37.
+- **`# pragma: no cover` on optional-dep-active paths** in `seeds.py`
+  (torch) and `embeddings.py` (sentence-transformers). Reflects the
+  reality that these branches execute in user code, not CI. Both
+  modules now report 100 % coverage; previously sat at ~70 % which
+  obscured per-module floor enforcement.
+### Fixed
+- **`make coverage` Makefile parity with PR CI.** PR #27 (external
+  contributor @leno23, draft) proposed adding `-m "not monte_carlo and
+  not benchmark"` to the `coverage` target. Audit found the same fix
+  had landed in v0.33.0 commit `9e375a8` ahead of the PR being filed;
+  closed PR #27 as superseded with thanks. No change in this release.
+### Closed (already-resolved)
+- **#38 — CI doctests for `paths.py` / `provenance.py` / `seeds.py` /
+  `docs.py`.** All four modules were added to `.doctest-modules` in
+  `a26fd44` (2026-05-14, v0.32.x era); 7 doctests collected across the
+  named modules in current CI. Closed as already-resolved.
+### Test coverage
+Test count 1376 → 1387 (+11). Aggregate 95.65 % → 95.69 %. All 28
+modules ≥ 90 % individually post-pragma.
+## [0.36.0] — 2026-05-18 — harness parallelization (#29, #30) + Node 24 actions
+Wires the v0.34.0 unified parallelism pattern into the harness evaluation
+loop. `evaluate()` and `evaluate_folded()` now accept an `n_jobs` kwarg
+(default `1` preserves bit-identical sequential behavior); under
+`n_jobs != 1`, the `(slice × scorer)` work-unit loop in
+`_score_all_slices` and the `(spec × scorer)` fit phase in
+`_attach_transferred_operating_points` dispatch through joblib loky via
+the existing `_parallel.parallel_map` helper.
+### Added
+- `evaluate(..., n_jobs: int = 1)` and `evaluate_folded(..., n_jobs: int = 1)`
+  — keyword-only kwarg per Principle #3 of `methodology/parallelism.md`.
+  `n_jobs=1` (default) runs the existing pure-Python sequential loop
+  (Principle #4 — bit-identical to v0.35). `n_jobs > 1` uses joblib loky;
+  `n_jobs=-1` uses all cores; `n_jobs=0` is rejected. Closes #29, #30.
+- Strict-pickle Scorer sniff at `evaluate()` entry when `n_jobs != 1`:
+  raises a clean `TypeError` referencing
+  `methodology/parallelism.md#scorer-picklability` with the underlying
+  pickle error attached. Reuses the v0.35 ADR contract; no new exception
+  class. Catches non-picklable scorers up front rather than relying on
+  joblib's more permissive cloudpickle path (which would silently absorb
+  closures and obscure the contract documented in v0.35).
+### Internal
+- New module-scope step functions `_score_one_pair` and
+  `_fit_one_op_point_pair` in `harness.py` (picklable; required by loky).
+- `_score_all_slices` and `_attach_transferred_operating_points`
+  refactored to use flat work-unit dispatch via `parallel_map`.
+### Tests
+- New `tests/test_harness_parallelism.py` (7 tests): bit-identical
+  reproducibility across `n_jobs=1` vs `n_jobs=2` for `evaluate`
+  (basic, paired-diffs, operating-points), `evaluate_folded`,
+  picklability rejection (closure scorer), `n_jobs=0` rejection,
+  `n_jobs=-1` smoke. All 66 harness tests pass (7 new + 59 existing).
+### Infrastructure
+- Bumped `actions/upload-artifact` and `actions/download-artifact` from
+  `@v5` → `@v6` across `publish.yml` / `nightly-mc.yml` /
+  `nightly-benchmarks.yml`. The v6 majors run on Node.js 24
+  (GitHub deprecates Node 20 actions from 2026-06-02). Other pinned
+  actions (`checkout@v6`, `setup-uv@v8.1.0`, `codeql-action@v3`,
+  `deploy-pages@v4`, `upload-pages-artifact@v3`) were not flagged in
+  the v0.35 publish annotation and are deferred to a separate audit.
 ## [0.35.0] — 2026-05-18 — `fit_temperature_binary` + Scorer picklability ADR
 Small, additive release. Adds a binary-classification calibration helper

{eval_toolkit-0.35.0 → eval_toolkit-0.38.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.35.0
+Version: 0.38.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/
@@ -69,6 +69,8 @@ Requires-Dist: matplotlib>=3.8; extra == 'plotting'
 Requires-Dist: pillow>=10.0; extra == 'plotting'
 Provides-Extra: property
 Requires-Dist: hypothesis>=6.100; extra == 'property'
+Provides-Extra: transformers
+Requires-Dist: transformers>=4.0; extra == 'transformers'
 Provides-Extra: validation
 Provides-Extra: yaml
 Requires-Dist: pyyaml>=6.0; extra == 'yaml'

{eval_toolkit-0.35.0 → eval_toolkit-0.38.0}/pyproject.toml RENAMED Viewed

@@ -56,6 +56,13 @@ parquet = ["pyarrow>=15.0"]
 # setup small. The canonical semantic-dedup recipe (all-MiniLM-L6-v2 +
 # cosine@0.80) is what this factory pre-wires for callers.
 embeddings = ["sentence-transformers>=3.0"]
+# v0.37.0: TokenizationLeakageCheck — HF-tokenizer-aware dedup.
+# transformers transitively pulls torch + tokenizers (~700MB) so we
+# follow the [embeddings] precedent: opt-in only, NOT in [all] / [dev].
+# Consumers pass an already-instantiated tokenizer callable; the check
+# itself does not import transformers, so the optional install is
+# strictly for callers wanting AutoTokenizer.from_pretrained(...).
+transformers = ["transformers>=4.0"]
 # DEPRECATED (announced v0.30.1, removal v0.33.0).
 #
 # Retained as a transitive no-op so `pip install eval-toolkit[validation]`

{eval_toolkit-0.35.0 → eval_toolkit-0.38.0}/src/eval_toolkit/__init__.py RENAMED Viewed

@@ -147,6 +147,7 @@ _EXPORTS: dict[str, str] = {
     "NearDuplicateCheck": "eval_toolkit.leakage",
     "NormalizedFormLeakageCheck": "eval_toolkit.leakage",
     "TemporalLeakageCheck": "eval_toolkit.leakage",
+    "TokenizationLeakageCheck": "eval_toolkit.leakage",
     "run_leakage_checks": "eval_toolkit.leakage",
     # --- loaders ---
     "DataFrameLoader": "eval_toolkit.loaders",

{eval_toolkit-0.35.0 → eval_toolkit-0.38.0}/src/eval_toolkit/_version.py RENAMED Viewed

@@ -2,4 +2,4 @@
 __all__ = ["__version__"]
-__version__ = "0.35.0"
+__version__ = "0.38.0"

{eval_toolkit-0.35.0 → eval_toolkit-0.38.0}/src/eval_toolkit/embeddings.py RENAMED Viewed

@@ -88,15 +88,18 @@ def make_minilm_embedder(
             "Install via: pip install eval-toolkit[embeddings]"
         ) from e
-    _logger.debug(
+    # sentence-transformers-active path: excluded from CI coverage
+    # because [embeddings] is intentionally kept out of [dev]/[all]
+    # (transitive torch cost ~700MB per the v0.33.1 design note).
+    _logger.debug(  # pragma: no cover
         "loading SentenceTransformer model_id=%s device=%s batch_size=%d",
         model_id,
         device,
         batch_size,
     )
-    model = SentenceTransformer(model_id, device=device)
+    model = SentenceTransformer(model_id, device=device)  # pragma: no cover
-    def embedder(texts: Sequence[str]) -> np.ndarray:
+    def embedder(texts: Sequence[str]) -> np.ndarray:  # pragma: no cover
         result = model.encode(
             list(texts),
             convert_to_numpy=True,
@@ -105,4 +108,4 @@ def make_minilm_embedder(
         )
         return np.asarray(result, dtype=np.float64)
-    return embedder
+    return embedder  # pragma: no cover

{eval_toolkit-0.35.0 → eval_toolkit-0.38.0}/src/eval_toolkit/harness.py RENAMED Viewed

@@ -32,6 +32,7 @@ v0.7.0 additions:
 from __future__ import annotations
 import logging
+import pickle
 import time
 import traceback
 from collections.abc import Mapping, Sequence
@@ -41,6 +42,7 @@ from typing import TYPE_CHECKING, Final, Literal, cast
 import numpy as np
+from eval_toolkit._parallel import parallel_map
 from eval_toolkit.artifacts import (
     error_metric,
     sanitize_for_json,
@@ -62,7 +64,7 @@ from eval_toolkit.operating_points import (
     fit_operating_points,
 )
 from eval_toolkit.protocols import Scorer, SliceAwareScorer
-from eval_toolkit.thresholds import TargetFPRSelector
+from eval_toolkit.thresholds import TargetFPRSelector, ThresholdSelector
 if TYPE_CHECKING:
     import pandas as pd
@@ -278,6 +280,31 @@ def _object_to_dict(obj: object, *, what: str) -> dict[str, object]:
     raise TypeError(f"expected {what} mapping or object with to_dict(), got {type(obj).__name__}")
+def _assert_scorers_picklable(scorers: Mapping[str, Scorer]) -> None:
+    """Strict-pickle sniff for Scorer args when ``n_jobs != 1``.
+    joblib's loky backend uses cloudpickle (which absorbs closures + local
+    classes), but the v0.35 Scorer picklability ADR
+    (``methodology/parallelism.md#scorer-picklability``) is a *strict* pickle
+    contract — cloudpickle behavior is platform-dependent and the more
+    permissive failure modes are harder to debug. Fail fast at
+    :func:`evaluate` entry with the same ``TypeError`` style as
+    :func:`eval_toolkit._parallel.parallel_map`'s fn-sniff (no new exception
+    class — single channel for the picklability contract).
+    """
+    for sname, scorer in scorers.items():
+        try:
+            pickle.dumps(scorer)
+        except (pickle.PicklingError, AttributeError, TypeError) as exc:
+            raise TypeError(
+                f"evaluate(n_jobs != 1): scorer {sname!r} "
+                f"({type(scorer).__name__}) is not picklable. See "
+                f"methodology/parallelism.md#scorer-picklability for the "
+                f"contract and worked picklable / broken / fix examples. "
+                f"Underlying error: {exc}"
+            ) from exc
 def _should_score_slice(scorer: Scorer, slice_name: str) -> bool:
     """Honor optional slice-aware scorer hooks without widening the base Protocol."""
     should_score = getattr(scorer, "should_score_slice", None)
@@ -696,6 +723,36 @@ def _run_leakage_phase(
         )
+# Tuple shape for the flat `(slice × scorer)` work-unit dispatched to
+# parallel_map by `_score_all_slices`. Defined at module scope so workers
+# can pickle the function reference.
+_ScoreOnePairItem = tuple[EvalSlice, str, Scorer, int, int, Literal["raise", "record"]]
+_ScoreOnePairResult = tuple[str, str, dict[str, object], np.ndarray]
+def _score_one_pair(item: _ScoreOnePairItem) -> _ScoreOnePairResult:
+    """Picklable step function for ``(slice × scorer)`` parallel dispatch.
+    Module-scope so loky workers can serialize the reference (closures over
+    enclosing locals would fail :func:`parallel_map`'s pickle sniff). All
+    inputs flow through the ``item`` tuple — no captured state.
+    Returns ``(slice_name, scorer_name, result_dict, scores_array)`` so the
+    caller can reassemble ``by_slice`` + ``score_cache`` in the original
+    iteration order.
+    """
+    slice_, sname, scorer, n_resamples, seed, on_scorer_error = item
+    result = evaluate_scorer_on_slice(
+        scorer,
+        slice_,
+        n_resamples=n_resamples,
+        seed=seed,
+        on_scorer_error=on_scorer_error,
+    )
+    scores = np.asarray(result["scores"], dtype=np.float64)
+    return slice_.name, sname, result, scores
 def _score_all_slices(
     scorers: dict[str, Scorer],
     slices: Sequence[EvalSlice],
@@ -704,6 +761,7 @@ def _score_all_slices(
     seed: int,
     paired_diffs: list[tuple[str, str]] | None,
     on_scorer_error: Literal["raise", "record"],
+    n_jobs: int = 1,
 ) -> tuple[dict[str, dict[str, object]], dict[tuple[str, str], np.ndarray]]:
     """Score every ``(slice, scorer)`` pair; return ``(by_slice, score_cache)``.
@@ -714,10 +772,17 @@ def _score_all_slices(
     ``score_cache`` is keyed ``(slice.name, scorer.name)`` and carries the
     raw score arrays so :func:`_attach_transferred_operating_points` can
     re-use them without re-calling scorers.
-    """
-    by_slice: dict[str, dict[str, object]] = {}
-    score_cache: dict[tuple[str, str], np.ndarray] = {}
+    v0.36 added ``n_jobs``: a flat ``(slice × scorer)`` parallel dispatch
+    via :func:`eval_toolkit._parallel.parallel_map`. Default ``1`` preserves
+    bit-identical sequential behavior. ``n_jobs != 1`` requires picklable
+    scorers per the v0.35 ADR
+    (``docs/source/methodology/parallelism.md#scorer-picklability``).
+    """
+    # Pre-filter skipped pairs (allow-list miss) before dispatching parallel
+    # work-units. Logs the same skip messages as the pre-parallel version.
+    work_units: list[_ScoreOnePairItem] = []
+    skipped: dict[tuple[str, str], dict[str, object]] = {}
     for slice_ in slices:
         _logger.info(
             "[slice %s] n=%d, positives=%d",
@@ -725,32 +790,61 @@ def _score_all_slices(
             len(slice_.df),
             int(slice_.y_true.sum()),
         )
-        slice_data: dict[str, dict[str, object]] = {}
-        scores_by_scorer: dict[str, np.ndarray] = {}
         for sname, scorer in scorers.items():
             if not _should_score_slice(scorer, slice_.name):
                 reason = f"slice {slice_.name!r} not in scorer allow-list"
-                slice_data[sname] = _skipped_scorer_result(slice_, reason)
+                skipped[(slice_.name, sname)] = _skipped_scorer_result(slice_, reason)
                 _logger.info("    skipped %s: %s", sname, reason)
                 continue
-            t0 = time.time()
-            slice_data[sname] = evaluate_scorer_on_slice(
-                scorer,
-                slice_,
-                n_resamples=n_resamples,
-                seed=seed,
-                on_scorer_error=on_scorer_error,
-            )
-            # If the scorer raised under on_scorer_error="record", scores is [].
-            # Subsequent paired-diff machinery sees the empty array and will
-            # short-circuit on the same len-check it already does for skipped
-            # scorers; no special-case needed.
-            scores_by_scorer[sname] = np.asarray(slice_data[sname]["scores"], dtype=np.float64)
-            score_cache[(slice_.name, sname)] = scores_by_scorer[sname]
-            elapsed = time.time() - t0
-            pr = slice_data[sname].get("pr_auc")
+            work_units.append((slice_, sname, scorer, n_resamples, seed, on_scorer_error))
+    # Parallel scoring. parallel_map at n_jobs=1 is a pure-Python for-loop
+    # (Principle #4) — bit-identical to the pre-v0.36 sequential code.
+    if work_units:
+        t0_total = time.time()
+        results = parallel_map(
+            _score_one_pair,
+            work_units,
+            n_jobs=n_jobs,
+            description="harness _score_all_slices",
+        )
+        elapsed_total = time.time() - t0_total
+        _logger.info(
+            "  scored %d (slice, scorer) pairs in %.1fs (n_jobs=%d)",
+            len(work_units),
+            elapsed_total,
+            n_jobs,
+        )
+    else:
+        results = []
+    # Index results for O(1) lookup during reassembly.
+    results_by_key: dict[tuple[str, str], _ScoreOnePairResult] = {
+        (slice_name, sname): (slice_name, sname, result_dict, scores_arr)
+        for slice_name, sname, result_dict, scores_arr in results
+    }
+    # Reassemble in the original (slices × scorers.items()) iteration order.
+    by_slice: dict[str, dict[str, object]] = {}
+    score_cache: dict[tuple[str, str], np.ndarray] = {}
+    for slice_ in slices:
+        slice_data: dict[str, dict[str, object]] = {}
+        scores_by_scorer: dict[str, np.ndarray] = {}
+        for sname in scorers:
+            key = (slice_.name, sname)
+            if key in skipped:
+                slice_data[sname] = skipped[key]
+                continue
+            _, _, result_dict, scores_arr = results_by_key[key]
+            slice_data[sname] = result_dict
+            # If the scorer raised under on_scorer_error="record", scores_arr is [].
+            # Paired-diff machinery short-circuits on the same len-check it uses
+            # for skipped scorers; no special-case needed.
+            scores_by_scorer[sname] = scores_arr
+            score_cache[key] = scores_arr
+            pr = result_dict.get("pr_auc")
             pr_display = f"{pr:.4f}" if isinstance(pr, float) else "N/A"
-            _logger.info("    %s: PR-AUC=%s (%.1fs)", sname, pr_display, elapsed)
+            _logger.info("    %s: PR-AUC=%s", sname, pr_display)
         diffs = (
             _compute_paired_diffs(
@@ -789,6 +883,7 @@ def evaluate(
     on_leakage: Literal["raise", "record", "skip"] = "raise",
     on_scorer_error: Literal["raise", "record"] = "raise",
     operating_point_specs: Sequence[OperatingPointSpec] = (),
+    n_jobs: int = 1,
 ) -> RunResult:
     """Run every scorer on every slice; return a pure :class:`RunResult` (no IO).
@@ -830,6 +925,15 @@ def evaluate(
         Fit thresholds on one mixed-class slice and apply them to named target
         slices. Results are attached under each scorer's
         ``"transferred_operating_points"`` block. Default empty (skip).
+    n_jobs : int, optional
+        Parallel workers (default 1 — sequential). ``n_jobs > 1`` uses
+        joblib loky to parallelize the flat ``(slice × scorer)`` work-unit
+        loop in :func:`_score_all_slices` (and the operating-point fit
+        phase when ``operating_point_specs`` is non-empty). ``n_jobs=-1``
+        uses all cores; ``n_jobs=0`` is rejected. Scorers must be picklable
+        when ``n_jobs != 1`` — see
+        :doc:`methodology/parallelism` § Scorer picklability for the
+        contract + worked examples.
     Returns
     -------
@@ -850,6 +954,9 @@ def evaluate(
     if not slices:
         raise ValueError("at least one slice required")
+    if n_jobs != 1:
+        _assert_scorers_picklable(scorers)
     config: dict[str, object] = {
         "n_resamples": n_resamples,
         "seed": seed,
@@ -872,6 +979,7 @@ def evaluate(
         seed=seed,
         paired_diffs=paired_diffs,
         on_scorer_error=on_scorer_error,
+        n_jobs=n_jobs,
     )
     if operating_point_specs:
@@ -882,11 +990,45 @@ def evaluate(
             score_cache=score_cache,
             scorer_names=list(scorers.keys()),
             specs=operating_point_specs,
+            n_jobs=n_jobs,
         )
     return RunResult(run_id=run_id, git_sha=git_sha, config=config, by_slice=by_slice)
+_OpPointFitItem = tuple[
+    str,  # spec_name (for reassembly key)
+    str,  # fit_slice_name (passed through to fit_operating_points)
+    str,  # scorer_name
+    np.ndarray,  # fit_y_true
+    np.ndarray,  # fit_scores
+    Sequence[ThresholdSelector],  # spec.selectors (passed through to fit_operating_points)
+]
+_OpPointFitResult = tuple[str, str, object]  # (spec_name, scorer_name, fitted | error_dict)
+def _fit_one_op_point_pair(item: _OpPointFitItem) -> _OpPointFitResult:
+    """Picklable step function for ``(spec × scorer)`` operating-point fitting.
+    Module-scope so loky workers can serialize the reference. All inputs flow
+    through the ``item`` tuple. Returns ``(spec_name, scorer_name, fitted)``
+    where ``fitted`` is either the :func:`fit_operating_points` result or a
+    ``{"error": str}`` dict matching the sequential code path.
+    """
+    spec_name, fit_slice_name, scorer_name, y_true, fit_scores, selectors = item
+    try:
+        fitted = fit_operating_points(
+            y_true,
+            fit_scores,
+            selectors,
+            fitted_on_slice=fit_slice_name,
+            scorer_name=scorer_name,
+        )
+    except (ValueError, RuntimeError) as exc:
+        return spec_name, scorer_name, {"error": str(exc)}
+    return spec_name, scorer_name, fitted
 def _attach_transferred_operating_points(
     *,
     by_slice: dict[str, dict[str, object]],
@@ -894,34 +1036,73 @@ def _attach_transferred_operating_points(
     score_cache: Mapping[tuple[str, str], np.ndarray],
     scorer_names: Sequence[str],
     specs: Sequence[OperatingPointSpec],
+    n_jobs: int = 1,
 ) -> None:
-    """Mutate ``by_slice`` to attach opt-in cross-slice operating-point metrics."""
+    """Mutate ``by_slice`` to attach opt-in cross-slice operating-point metrics.
+    v0.36 added ``n_jobs``: parallelizes the ``(spec × scorer)`` fit phase
+    via :func:`eval_toolkit._parallel.parallel_map`. The apply phase
+    (writing into ``by_slice``) stays sequential — fitting dominates runtime.
+    Default ``n_jobs=1`` preserves bit-identical sequential behavior.
+    """
+    # Pre-flight: handle "fit slice not found" errors (these short-circuit the
+    # entire spec) + collect valid fit work-units. Tracks pre-conditions
+    # ("fit scorer skipped") as separate state so the parallel dispatch only
+    # carries actual work.
+    fit_work: list[_OpPointFitItem] = []
+    fit_skip_reasons: dict[tuple[str, str], dict[str, object]] = {}
+    specs_with_valid_fit: list[OperatingPointSpec] = []
+    names_per_spec: dict[str, list[str]] = {}
     for spec in specs:
         names = list(spec.scorer_names) if spec.scorer_names else list(scorer_names)
+        names_per_spec[spec.name] = names
         if spec.fit_slice not in slices_by_name:
             _record_spec_error(by_slice, spec, names, f"fit slice {spec.fit_slice!r} not found")
             continue
+        specs_with_valid_fit.append(spec)
         fit_slice = slices_by_name[spec.fit_slice]
-        fitted_by_scorer: dict[str, object] = {}
         for scorer_name in names:
             fit_scores = score_cache.get((spec.fit_slice, scorer_name))
             if fit_scores is None or len(fit_scores) != len(fit_slice.y_true):
-                fitted_by_scorer[scorer_name] = {
+                fit_skip_reasons[(spec.name, scorer_name)] = {
                     "error": "fit scorer skipped, errored, or produced no scores"
                 }
                 continue
-            try:
-                fitted_by_scorer[scorer_name] = fit_operating_points(
+            fit_work.append(
+                (
+                    spec.name,
+                    spec.fit_slice,
+                    scorer_name,
                     fit_slice.y_true,
                     fit_scores,
                     spec.selectors,
-                    fitted_on_slice=spec.fit_slice,
-                    scorer_name=scorer_name,
                 )
-            except (ValueError, RuntimeError) as exc:
-                fitted_by_scorer[scorer_name] = {"error": str(exc)}
+            )
+    # Parallel fit phase. parallel_map at n_jobs=1 is a pure-Python for-loop
+    # (Principle #4) — bit-identical to the pre-v0.36 sequential code.
+    fit_results: list[_OpPointFitResult] = (
+        parallel_map(
+            _fit_one_op_point_pair,
+            fit_work,
+            n_jobs=n_jobs,
+            description="harness _attach_transferred_operating_points (fit)",
+        )
+        if fit_work
+        else []
+    )
+    # Index by (spec_name, scorer_name) for O(1) lookup in the apply phase.
+    fitted_by_pair: dict[tuple[str, str], object] = {
+        (spec_name, scorer_name): fitted for spec_name, scorer_name, fitted in fit_results
+    }
+    fitted_by_pair.update(fit_skip_reasons)
+    # Sequential apply phase — preserves the original by_slice mutation order
+    # and the schema of error / skipped markers.
+    for spec in specs_with_valid_fit:
+        names = names_per_spec[spec.name]
         for target_name in spec.apply_slices:
             if target_name not in slices_by_name:
                 _record_spec_error(
@@ -939,7 +1120,7 @@ def _attach_transferred_operating_points(
                 spec_block: dict[str, object] = {}
                 transfer_block[spec.name] = spec_block
-                fitted = fitted_by_scorer.get(scorer_name)
+                fitted = fitted_by_pair.get((spec.name, scorer_name))
                 if not isinstance(fitted, dict) or "error" in fitted:
                     spec_block["error"] = (
                         str(fitted.get("error", "threshold fitting failed"))
@@ -1099,6 +1280,7 @@ def evaluate_folded(
     on_scorer_error: Literal["raise", "record"] = "raise",
     eval_split_names: Sequence[str] = ("test",),
     summary_metrics: Sequence[str] = ("pr_auc", "roc_auc"),
+    n_jobs: int = 1,
 ) -> RunResult:
     """Run a fold aggregator: ``Splitter × seeds → RunResult`` with CV-CI summary.
@@ -1128,6 +1310,15 @@ def evaluate_folded(
         RNG seeds for multi-seed × CV. Default ``(42,)`` (single seed).
     n_resamples, paired_diffs, leakage_checks, on_leakage, on_scorer_error :
         Forwarded to :func:`evaluate` per fold.
+    n_jobs : int, optional
+        Parallel workers (default 1 — sequential). Forwarded to
+        :func:`evaluate` per fold; parallelizes the inner
+        ``(slice × scorer)`` work-unit loop within each fold. Folds
+        themselves run sequentially to keep determinism + traceback
+        fidelity simple; for fold-level parallelism, consider an external
+        ``joblib.Parallel`` wrapper at the call site. See
+        :doc:`methodology/parallelism` § Scorer picklability for the
+        Scorer picklability contract when ``n_jobs != 1``.
     eval_split_names : sequence of str, optional
         Subset of each fold-dict's keys to actually evaluate. Default
         ``("test",)`` — train sets are skipped (eval-only K-fold). Pass
@@ -1183,6 +1374,7 @@ def evaluate_folded(
                 leakage_checks=leakage_checks,
                 on_leakage=on_leakage,
                 on_scorer_error=on_scorer_error,
+                n_jobs=n_jobs,
             )
             by_fold[fold_id] = fold_result

eval-toolkit 0.35.0__tar.gz → 0.38.0__tar.gz

eval-toolkit 0.35.0tar.gz → 0.38.0tar.gz