PyPI - eval-toolkit - Versions diffs - 0.46.0__tar.gz → 0.46.1__tar.gz - Mend

eval-toolkit 0.46.0tar.gz → 0.46.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (178) hide show

{eval_toolkit-0.46.0 → eval_toolkit-0.46.1}/.gitignore RENAMED Viewed

@@ -45,6 +45,16 @@ coverage.json
 # Mutation-testing output (mutmut / cargo-mutants — local run artifacts)
 mutants/
+# Local audit artifacts (Round 5+ Gate 3 LLM cross-review packets + reports).
+# The canonical prompt lives at ~/.claude/plans/gate3-audit-prompt.md and the
+# canonical findings ledger lives at docs/source/audit_findings.md; per-run
+# raw model outputs are author-local working copies.
+# Tracked: per-round briefing files (`gate3-audit-round-<N>.md`).
+# Untracked: prompt template, generic report, per-round report files.
+gate3-audit-prompt.md
+gate3-audit-report.md
+gate3-audit-round-*-report.md
 # Claude Code project settings (machine-local)
 .claude/

{eval_toolkit-0.46.0 → eval_toolkit-0.46.1}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,71 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.46.1] — 2026-05-21 — Round 6 hotfix: ECE strategy validation + deprecation warning content
+Hotfix release per **Decision Q** (data correctness regression + time-sensitive
+warning content) + **Decision R6-E** (scope: R6-F1 + R6-F2 only; R6-A docstring
+rolls forward to v0.47). All other Round 6 findings dispositioned to v0.47.0.
+See [`docs/source/audit_findings.md`](docs/source/audit_findings.md) Round 6 for
+the full disposition ledger.
+### Fixed
+- **`metric_specs.ece(strategy=<value>)` strategy validation** (Round 6 Codex
+  R6-F1). Prior to v0.46.1, an invalid strategy string (e.g.
+  `metric_specs.ece(strategy="typo")`) silently dispatched to quantile ECE and
+  returned a `scorecard()` cell with `status="ok"` under an invalid encoded key
+  (`"ece_n_bins_15_strategy_typo"`) — wrong-by-design data correctness path.
+  Verified by Codex via runtime probe. Now both the `ece()` factory and
+  `_EceSpec.compute()` raise:
+  ```
+  ValueError: ECE strategy must be 'uniform' or 'quantile'; got 'typo'
+  ```
+  Defence-in-depth: the factory validates eagerly (before LRU cache hit) AND
+  `compute()` validates at the compute boundary so direct construction of
+  `_EceSpec(strategy="typo")` (bypassing the factory) also raises.
+- **Deprecation warning content for all 5 ECE variants** (Round 6 Codex R6-F2 +
+  Gemini R6-F2, with Decisions R6-F + R6-G). The v0.46.0 `__getattr__`
+  deprecation shim's warning messages produced broken migration snippets:
+  - For `expected_calibration_error` + `expected_calibration_error_equal_mass`:
+    the suggested `Scorecard` lookup key was the factory-call expression
+    (`"ece(n_bins=10)"`) instead of the encoded spec name
+    (`"ece_n_bins_10_strategy_uniform"`). Now uses the correct encoded key.
+  - For `expected_calibration_error_debiased` / `_l2` / `_l2_debiased`: these
+    variants are not in the v0.46 `metric_specs` namespace (Decision R6-G;
+    research-completeness primitives, deferred to v1.x if user demand
+    surfaces). Their warnings now point at the submodule path
+    (`from eval_toolkit.metrics import expected_calibration_error_debiased`)
+    instead of an unconstructable scorecard snippet.
+  - Pre-v0.46 default verification: Gemini's report claimed
+    `expected_calibration_error` defaulted to `n_bins=15`; verified against
+    `metrics.py:730-734` that the actual default is `n_bins=10`. Per Decision
+    R6-F, warning snippets use `n_bins=10` to preserve bit-identical pre-v0.46
+    math + add a migration note explaining the new `metric_specs.ece()` factory
+    default of `n_bins=15` (matching Hines et al.).
+### Tests
+- `tests/test_scorecard.py`: 4 new tests for ECE strategy validation
+  (parametrized factory-rejection + compute-defence-in-depth).
+- `tests/test_deprecated_scalars_shim.py`: 4 new test classes — verify each
+  warning contains correct factory expression + encoded scorecard key, ECE
+  warnings carry the n_bins=10/15 migration note, submodule-only warnings cite
+  `eval_toolkit.metrics` path, and the snippet in each first-party warning is
+  EXECUTABLE (parses + runs against synthetic data + produces ok-status cell).
+### Rolled forward to v0.47 (Decision R6-E)
+- R6-A `seed=None` docstring fix (non-blocker per Decision Q).
+- R6-F3 duplicate `MetricSpec.name` rejection.
+- R6-F5 (Codex) Protocol method-shape drift guard.
+- R6-F3 (Gemini) `Scorecard.to_pandas()` schema expansion.
+- R6-F4 (Gemini) `make_spec_name()` helper.
+- R6-F5 (Gemini) narrow `_evaluate_spec()` exception catch.
+- R6-F6 (Codex) plan + roadmap state-drift refresh.
 ## [0.46.0] — 2026-05-21 — Scorecard: primary v1.0 metric surface (closes #36)
 Second minor of the staggered v0.45 → v0.46 → v0.47 → v0.48 → v1.0 sequence.

{eval_toolkit-0.46.0 → eval_toolkit-0.46.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.46.0
+Version: 0.46.1
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/

{eval_toolkit-0.46.0 → eval_toolkit-0.46.1}/docs/source/adr/README.md RENAMED Viewed

@@ -73,4 +73,6 @@ What would have to change for this decision to be reopened?
 | # | Title | Status | Date |
 |---|---|---|---|
-| _none yet_ | | | |
+| [0001](0001-flat-module-layout.md) | Flat single-file modules through v1.x | Accepted | 2026-05-21 |
+| [0002](0002-scorecard-as-primary-metric-surface.md) | `scorecard()` as the primary v1.0 metric surface | Accepted | 2026-05-21 |
+| [0003](0003-stability-contract-and-gate3-methodology.md) | v1.0 stability contract + Gate 3 methodology | Accepted | 2026-05-21 |

{eval_toolkit-0.46.0 → eval_toolkit-0.46.1}/src/eval_toolkit/__init__.py RENAMED Viewed

@@ -342,10 +342,7 @@ def __getattr__(name: str) -> Any:
         import warnings
         warnings.warn(
-            f"eval_toolkit.{name} is deprecated and will be removed in v0.47. "
-            f"Use `scorecard(y, s, metrics=[metric_specs.{_scorecard_spec_for(name)}])"
-            f'["{_scorecard_spec_for(name)}"].value` instead, or "import from the'
-            f" `eval_toolkit.metrics` submodule directly (internal API).",
+            _deprecation_warning_for(name),
             DeprecationWarning,
             stacklevel=2,
         )
@@ -366,23 +363,107 @@ def __getattr__(name: str) -> Any:
 # ── BEGIN TRANSITIONAL DEPRECATION HELPER (Decision L; REMOVE AT v0.47) ──
-def _scorecard_spec_for(deprecated_name: str) -> str:
-    """Map a deprecated-scalar name to its `metric_specs` replacement name.
+#
+# Per Round 6 audit (Codex R6-F2 + Gemini R6-F2; Decisions R6-F + R6-G):
+# - For deprecated scalars with a first-party `metric_specs` equivalent, the
+#   warning emits an EXECUTABLE scorecard snippet (factory expression + the
+#   correct encoded scorecard key, not the factory call string).
+# - For the 3 ECE variants without a `metric_specs` equivalent
+#   (expected_calibration_error_debiased / _l2 / _l2_debiased), the warning
+#   instead points at the submodule path per Decision R6-G — no first-party
+#   replacement is shipped at v0.47.
+# - ECE `n_bins=10` preserves the pre-v0.46 default (verified at
+#   `metrics.py:730-734`) — Decision R6-F. A migration note explains that
+#   the v0.46+ `metric_specs.ece()` factory defaults to `n_bins=15` (matching
+#   Hines et al.) and how to opt in.
+_FirstParty = tuple[str, str]  # (factory_expression, scorecard_key)
+"""Type alias for a deprecated-scalar that has a metric_specs replacement.
+The factory expression is what the user types after ``metric_specs.``; the
+scorecard key is the literal string that indexes ``Scorecard``.
+"""
+_FIRST_PARTY_REPLACEMENTS: dict[str, _FirstParty] = {
+    "pr_auc": ("pr_auc", "pr_auc"),
+    "roc_auc": ("roc_auc", "roc_auc"),
+    "brier_score": ("brier", "brier"),
+    # ECE variants: use n_bins=10 (pre-v0.46 default per Decision R6-F).
+    # The migration note in the warning text explains how to switch to
+    # n_bins=15 if the user wants the v0.46+ metric_specs.ece() default.
+    "expected_calibration_error": (
+        "ece(n_bins=10)",
+        "ece_n_bins_10_strategy_uniform",
+    ),
+    "expected_calibration_error_equal_mass": (
+        'ece(n_bins=10, strategy="quantile")',
+        "ece_n_bins_10_strategy_quantile",
+    ),
+}
+"""Names that have a first-party metric_specs replacement at v0.46.
+The 3 ECE variants NOT in this map (_debiased, _l2, _l2_debiased) get the
+submodule-path warning template instead (Decision R6-G).
+"""
-    Used only inside the v0.46 deprecation warning message. Returns the
-    closest equivalent first-party spec name where one exists; falls back
-    to the original name for ECE variants whose exact-match spec isn't in
-    the v0.46 first-party namespace (e.g., the L2 / debiased variants —
-    callers either implement a custom `MetricSpec` or stay on the
-    submodule path).
+def _deprecation_warning_for(name: str) -> str:
+    """Render the DeprecationWarning message for a deprecated scalar name.
+    Branches on whether ``name`` has a first-party `metric_specs` replacement
+    (Decision R6-G):
+    - First-party (5 names): scorecard snippet with the correct encoded key
+      (Decision R6-F).
+    - Submodule-only (3 ECE variants): point at the submodule path per
+      Decision R6-G.
+    The first-party variants for ECE include a migration note explaining the
+    new ``metric_specs.ece()`` factory default of ``n_bins=15`` so users can
+    opt in to the new convention; the snippet itself uses ``n_bins=10`` for
+    bit-identical pre-v0.46 math (Decision R6-F).
+    Parameters
+    ----------
+    name : str
+        A name in ``_DEPRECATED_SCALARS``.
+    Returns
+    -------
+    str
+        The warning message, ready to pass to ``warnings.warn``.
     """
-    return {
-        "pr_auc": "pr_auc",
-        "roc_auc": "roc_auc",
-        "brier_score": "brier",
-        "expected_calibration_error": "ece(n_bins=10)",
-        "expected_calibration_error_equal_mass": 'ece(n_bins=10, strategy="quantile")',
-    }.get(deprecated_name, deprecated_name)
+    first_party = _FIRST_PARTY_REPLACEMENTS.get(name)
+    if first_party is not None:
+        factory_expr, scorecard_key = first_party
+        msg = (
+            f"eval_toolkit.{name} is deprecated and will be removed in v0.47. "
+            f"For the same math, use:\n"
+            f"    scorecard(y, s, metrics=[metric_specs.{factory_expr}])"
+            f'["{scorecard_key}"].value\n'
+            f"Or import from the eval_toolkit.metrics submodule directly "
+            f"(internal API per ADR 0002 — stable across v1.x, subject to "
+            f"refactor in major versions)."
+        )
+        # ECE-specific migration note about the n_bins default change.
+        if name.startswith("expected_calibration_error"):
+            msg += (
+                "\nNote: the v0.46+ metric_specs.ece() factory defaults to "
+                "n_bins=15 (matching Hines et al.); the n_bins=10 in this "
+                "snippet preserves the pre-v0.46 math. Pass n_bins=15 to use "
+                "the new convention."
+            )
+        return msg
+    # Decision R6-G: 3 ECE variants without first-party replacements →
+    # submodule path only.
+    return (
+        f"eval_toolkit.{name} is deprecated and will be removed in v0.47. "
+        f"This variant is NOT in v0.46+ metric_specs. Use:\n"
+        f"    from eval_toolkit.metrics import {name}\n"
+        f"(internal API per ADR 0002 — stable across v1.x, subject to "
+        f"refactor in major versions). Or contribute the variant to "
+        f"metric_specs if you use it regularly."
+    )
 # ── END TRANSITIONAL DEPRECATION HELPER (Decision L; REMOVE AT v0.47) ──

{eval_toolkit-0.46.0 → eval_toolkit-0.46.1}/src/eval_toolkit/_version.py RENAMED Viewed

@@ -2,4 +2,4 @@
 __all__ = ["__version__"]
-__version__ = "0.46.0"
+__version__ = "0.46.1"

{eval_toolkit-0.46.0 → eval_toolkit-0.46.1}/src/eval_toolkit/metric_specs.py RENAMED Viewed

@@ -118,6 +118,23 @@ brier: MetricSpec = _BrierSpec()
 # ─────────────────────────────────────────────────────────────────────────────
+# Valid strategy values for ECE specs. Locked at v0.46.1 to prevent the
+# Round 6 R6-F1 footgun where `ece(strategy="typo")` silently dispatched to
+# quantile ECE and returned a scorecard cell with status="ok" under an
+# invalid key. See `docs/source/audit_findings.md` Round 6.
+_ECE_VALID_STRATEGIES: frozenset[str] = frozenset({"uniform", "quantile"})
+def _validate_ece_strategy(strategy: str) -> None:
+    """Validate ECE strategy value; raise ValueError with context if invalid.
+    Shared between the factory (eager validation) and ``_EceSpec.compute`` (defence in
+    depth for direct construction paths that bypass the factory).
+    """
+    if strategy not in _ECE_VALID_STRATEGIES:
+        raise ValueError(f"ECE strategy must be 'uniform' or 'quantile'; got {strategy!r}")
 @dataclass(frozen=True, slots=True)
 class _EceSpec:
     """Internal :class:`MetricSpec` for expected calibration error.
@@ -135,6 +152,10 @@ class _EceSpec:
         return f"ece_n_bins_{self.n_bins}_strategy_{self.strategy}"
     def compute(self, y_true: np.ndarray, y_score: np.ndarray) -> float:
+        # Defence-in-depth strategy validation — the factory validates first,
+        # but a caller bypassing the factory and constructing `_EceSpec` directly
+        # would otherwise produce a wrong-metric scorecard cell silently.
+        _validate_ece_strategy(self.strategy)
         if self.strategy == "uniform":
             return float(_ece_uniform(y_true, y_score, n_bins=self.n_bins))
         return float(_ece_equal_mass(y_true, y_score, n_bins=self.n_bins))
@@ -178,5 +199,19 @@ def ece(*, n_bins: int = 15, strategy: ECEStrategy = "uniform") -> MetricSpec:
     'ece_n_bins_15_strategy_uniform'
     >>> ece(n_bins=10, strategy="quantile").name
     'ece_n_bins_10_strategy_quantile'
+    Invalid strategies raise ``ValueError`` eagerly (v0.46.1+; Round 6 R6-F1
+    fix — prior to v0.46.1 this silently dispatched to quantile ECE):
+    >>> ece(strategy="typo")
+    Traceback (most recent call last):
+        ...
+    ValueError: ECE strategy must be 'uniform' or 'quantile'; got 'typo'
+    Raises
+    ------
+    ValueError
+        If ``strategy`` is not in ``{"uniform", "quantile"}``.
     """
+    _validate_ece_strategy(strategy)
     return _EceSpec(n_bins=n_bins, strategy=strategy)

{eval_toolkit-0.46.0 → eval_toolkit-0.46.1}/tests/golden/public_api/snapshot.json RENAMED Viewed

@@ -1192,7 +1192,7 @@
       "doc_first_line": "str(object='') -> str",
       "kind": "value",
       "type": "str",
-      "value": "'0.46.0'"
+      "value": "'0.46.1'"
     },
     "apply_operating_points": {
       "doc_first_line": "Apply fitted thresholds to a mixed-class or single-class target slice.",

{eval_toolkit-0.46.0 → eval_toolkit-0.46.1}/tests/test_deprecated_scalars_shim.py RENAMED Viewed

@@ -58,7 +58,13 @@ def test_deprecated_names_not_in_exports(name: str) -> None:
 @pytest.mark.unit
 @pytest.mark.parametrize("name", sorted(DEPRECATED_SCALARS))
 def test_deprecated_name_emits_warning(name: str) -> None:
-    """Looking up a deprecated name at the top level emits DeprecationWarning."""
+    """Looking up a deprecated name at the top level emits DeprecationWarning.
+    Updated v0.46.1 per Decision R6-G: the 3 ECE variants without first-party
+    `metric_specs` equivalents point at the submodule path
+    (`from eval_toolkit.metrics import ...`) rather than a scorecard snippet.
+    The other 5 first-party-replaceable names use the scorecard snippet.
+    """
     with warnings.catch_warnings(record=True) as caught:
         warnings.simplefilter("always")
         _ = getattr(eval_toolkit, name)
@@ -66,9 +72,16 @@ def test_deprecated_name_emits_warning(name: str) -> None:
         assert (
             len(deprecations) >= 1
         ), f"expected DeprecationWarning for {name}; got {[w.category.__name__ for w in caught]}"
-        assert name in str(deprecations[0].message)
-        assert "v0.47" in str(deprecations[0].message)
-        assert "scorecard" in str(deprecations[0].message)
+        msg = str(deprecations[0].message)
+        # Universal assertions for ALL deprecated names:
+        assert name in msg
+        assert "v0.47" in msg
+        # Per-name-class assertions: scorecard for first-party, submodule for the rest.
+        if name in _EXPECTED_SUBMODULE_ONLY:
+            assert "eval_toolkit.metrics" in msg
+            assert "NOT in v0.46+ metric_specs" in msg
+        else:
+            assert "scorecard" in msg
 @pytest.mark.unit
@@ -76,7 +89,7 @@ def test_deprecated_pr_auc_still_functional() -> None:
     """The returned function still works — only the WAY it's imported is deprecated."""
     with warnings.catch_warnings():
         warnings.simplefilter("ignore", DeprecationWarning)
-        pr_auc = eval_toolkit.pr_auc  # type: ignore[attr-defined]
+        pr_auc = eval_toolkit.pr_auc
     y = np.array([0, 1, 0, 1, 1, 0, 1, 0])
     s = np.array([0.2, 0.8, 0.3, 0.7, 0.9, 0.1, 0.6, 0.4])
     assert 0.0 <= pr_auc(y, s) <= 1.0
@@ -86,7 +99,7 @@ def test_deprecated_pr_auc_still_functional() -> None:
 def test_deprecated_brier_score_still_functional() -> None:
     with warnings.catch_warnings():
         warnings.simplefilter("ignore", DeprecationWarning)
-        brier_score = eval_toolkit.brier_score  # type: ignore[attr-defined]
+        brier_score = eval_toolkit.brier_score
     y = np.array([0, 1, 0, 1])
     s = np.array([0.1, 0.9, 0.2, 0.8])
     assert 0.0 <= brier_score(y, s) <= 1.0
@@ -170,7 +183,7 @@ def test_full_all_resolves_without_attribute_error() -> None:
 def test_unknown_name_still_raises_attribute_error() -> None:
     """The deprecation branch must not swallow unknown-name errors."""
     with pytest.raises(AttributeError, match="no attribute"):
-        _ = eval_toolkit.nonexistent_symbol_xyz  # type: ignore[attr-defined]
+        _ = eval_toolkit.nonexistent_symbol_xyz
 # ─────────────────────────────────────────────────────────────────────────────
@@ -182,3 +195,140 @@ def test_unknown_name_still_raises_attribute_error() -> None:
 def test_deprecated_scalars_set_matches() -> None:
     """The internal `_DEPRECATED_SCALARS` set lines up with this test's expectations."""
     assert eval_toolkit._DEPRECATED_SCALARS == DEPRECATED_SCALARS
+# ─────────────────────────────────────────────────────────────────────────────
+# v0.46.1 — Round 6 R6-F2 + R6-F + R6-G: warning snippet content & executability
+# ─────────────────────────────────────────────────────────────────────────────
+# First-party replacements that should appear in warning snippets verbatim.
+# (factory_expression, scorecard_key) per deprecated name. Matches
+# `eval_toolkit._FIRST_PARTY_REPLACEMENTS`.
+_EXPECTED_FIRST_PARTY: dict[str, tuple[str, str]] = {
+    "pr_auc": ("pr_auc", "pr_auc"),
+    "roc_auc": ("roc_auc", "roc_auc"),
+    "brier_score": ("brier", "brier"),
+    "expected_calibration_error": ("ece(n_bins=10)", "ece_n_bins_10_strategy_uniform"),
+    "expected_calibration_error_equal_mass": (
+        'ece(n_bins=10, strategy="quantile")',
+        "ece_n_bins_10_strategy_quantile",
+    ),
+}
+# ECE variants without first-party metric_specs equivalents (Decision R6-G).
+_EXPECTED_SUBMODULE_ONLY: frozenset[str] = frozenset(
+    {
+        "expected_calibration_error_debiased",
+        "expected_calibration_error_l2",
+        "expected_calibration_error_l2_debiased",
+    }
+)
+def _capture_warning_message(name: str) -> str:
+    """Trigger the deprecation shim for `name` and return the rendered message."""
+    with warnings.catch_warnings(record=True) as caught:
+        warnings.simplefilter("always")
+        getattr(eval_toolkit, name)
+        deprecations = [w for w in caught if issubclass(w.category, DeprecationWarning)]
+        assert deprecations, f"no DeprecationWarning emitted for {name}"
+        return str(deprecations[0].message)
+@pytest.mark.unit
+@pytest.mark.parametrize("name", sorted(_EXPECTED_FIRST_PARTY))
+def test_first_party_warning_contains_correct_snippet(name: str) -> None:
+    """First-party replacements emit scorecard snippet with the encoded key.
+    Round 6 R6-F2: prior warnings used the factory-call expression
+    (e.g. ``"ece(n_bins=10)"``) as the scorecard lookup key. The shipped
+    Scorecard is a Mapping keyed by the encoded spec name
+    (e.g. ``"ece_n_bins_10_strategy_uniform"``). The v0.46.1 fix uses the
+    correct encoded key inline so blindly-copied snippets actually work.
+    """
+    factory_expr, scorecard_key = _EXPECTED_FIRST_PARTY[name]
+    msg = _capture_warning_message(name)
+    assert f"metric_specs.{factory_expr}" in msg
+    assert f'["{scorecard_key}"]' in msg
+@pytest.mark.unit
+def test_ece_first_party_warnings_carry_n_bins_10_migration_note() -> None:
+    """ECE first-party warnings preserve pre-v0.46 default (n_bins=10) + nudge.
+    Per Decision R6-F: pre-v0.46 `expected_calibration_error` defaulted to
+    n_bins=10; v0.46+ `metric_specs.ece()` defaults to n_bins=15. The
+    warning snippet uses n_bins=10 for bit-identical math; an appended note
+    explains the new convention.
+    """
+    for name in ("expected_calibration_error", "expected_calibration_error_equal_mass"):
+        msg = _capture_warning_message(name)
+        assert "n_bins=10" in msg
+        # Migration note about the new default:
+        assert "n_bins=15" in msg
+        assert "Hines" in msg or "new convention" in msg
+@pytest.mark.unit
+@pytest.mark.parametrize("name", sorted(_EXPECTED_SUBMODULE_ONLY))
+def test_submodule_only_warning_points_at_submodule_path(name: str) -> None:
+    """The 3 ECE variants without first-party specs route users to the submodule.
+    Per Decision R6-G: `expected_calibration_error_debiased` / `_l2` /
+    `_l2_debiased` are research-completeness primitives without
+    `metric_specs` equivalents at v0.46. Their warnings cite
+    `eval_toolkit.metrics.<name>` rather than a scorecard snippet.
+    """
+    msg = _capture_warning_message(name)
+    assert f"from eval_toolkit.metrics import {name}" in msg
+    assert "NOT in v0.46+ metric_specs" in msg
+@pytest.mark.unit
+@pytest.mark.parametrize("name", sorted(_EXPECTED_FIRST_PARTY))
+def test_first_party_warning_snippet_is_executable(name: str) -> None:
+    """The scorecard snippet in the warning produces a usable MetricResult.
+    Parses the snippet, executes it against a synthetic balanced slice, and
+    asserts that the resulting `MetricResult` has `status="ok"` and finite
+    `value`. This is the user-facing migration contract: copy the snippet,
+    run it, get a number.
+    """
+    from eval_toolkit import metric_specs as ms
+    from eval_toolkit import scorecard  # noqa: F401
+    msg = _capture_warning_message(name)
+    factory_expr, scorecard_key = _EXPECTED_FIRST_PARTY[name]
+    # Build the snippet that the warning instructs the user to use:
+    #   scorecard(y, s, metrics=[metric_specs.<factory_expr>])["<key>"].value
+    rng = np.random.default_rng(0)
+    y = rng.integers(0, 2, 200)
+    s = rng.random(200)
+    snippet = (
+        f"scorecard(y, s, metrics=[ms.{factory_expr}], bootstrap=False)" f'["{scorecard_key}"]'
+    )
+    # Confirm the warning actually contains the snippet shape it promises:
+    assert f"metric_specs.{factory_expr}" in msg
+    # Evaluate (safe — we constructed factory_expr from the known mapping):
+    cell = eval(snippet, {"scorecard": scorecard, "ms": ms, "y": y, "s": s})  # noqa: S307
+    assert cell.status == "ok", f"snippet for {name}: {cell.status} (reason: {cell.reason})"
+    assert cell.value is not None
+    assert isinstance(cell.value, float)
+@pytest.mark.unit
+@pytest.mark.parametrize("name", sorted(_EXPECTED_SUBMODULE_ONLY))
+def test_submodule_only_snippet_is_importable(name: str) -> None:
+    """The submodule-import snippet in the warning actually imports something callable."""
+    import importlib
+    metrics_mod = importlib.import_module("eval_toolkit.metrics")
+    assert hasattr(metrics_mod, name), (
+        f"warning for {name} promises `from eval_toolkit.metrics import {name}` "
+        f"but the symbol isn't present in the submodule"
+    )
+    assert callable(getattr(metrics_mod, name))

eval-toolkit 0.46.0__tar.gz → 0.46.1__tar.gz

eval-toolkit 0.46.0tar.gz → 0.46.1tar.gz