PyPI - eval-toolkit - Versions diffs - 0.45.0__tar.gz → 0.46.1__tar.gz - Mend

eval-toolkit 0.45.0tar.gz → 0.46.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (178) hide show

{eval_toolkit-0.45.0 → eval_toolkit-0.46.1}/.gitignore RENAMED Viewed

@@ -45,6 +45,16 @@ coverage.json
 # Mutation-testing output (mutmut / cargo-mutants — local run artifacts)
 mutants/
+# Local audit artifacts (Round 5+ Gate 3 LLM cross-review packets + reports).
+# The canonical prompt lives at ~/.claude/plans/gate3-audit-prompt.md and the
+# canonical findings ledger lives at docs/source/audit_findings.md; per-run
+# raw model outputs are author-local working copies.
+# Tracked: per-round briefing files (`gate3-audit-round-<N>.md`).
+# Untracked: prompt template, generic report, per-round report files.
+gate3-audit-prompt.md
+gate3-audit-report.md
+gate3-audit-round-*-report.md
 # Claude Code project settings (machine-local)
 .claude/

{eval_toolkit-0.45.0 → eval_toolkit-0.46.1}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,146 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.46.1] — 2026-05-21 — Round 6 hotfix: ECE strategy validation + deprecation warning content
+Hotfix release per **Decision Q** (data correctness regression + time-sensitive
+warning content) + **Decision R6-E** (scope: R6-F1 + R6-F2 only; R6-A docstring
+rolls forward to v0.47). All other Round 6 findings dispositioned to v0.47.0.
+See [`docs/source/audit_findings.md`](docs/source/audit_findings.md) Round 6 for
+the full disposition ledger.
+### Fixed
+- **`metric_specs.ece(strategy=<value>)` strategy validation** (Round 6 Codex
+  R6-F1). Prior to v0.46.1, an invalid strategy string (e.g.
+  `metric_specs.ece(strategy="typo")`) silently dispatched to quantile ECE and
+  returned a `scorecard()` cell with `status="ok"` under an invalid encoded key
+  (`"ece_n_bins_15_strategy_typo"`) — wrong-by-design data correctness path.
+  Verified by Codex via runtime probe. Now both the `ece()` factory and
+  `_EceSpec.compute()` raise:
+  ```
+  ValueError: ECE strategy must be 'uniform' or 'quantile'; got 'typo'
+  ```
+  Defence-in-depth: the factory validates eagerly (before LRU cache hit) AND
+  `compute()` validates at the compute boundary so direct construction of
+  `_EceSpec(strategy="typo")` (bypassing the factory) also raises.
+- **Deprecation warning content for all 5 ECE variants** (Round 6 Codex R6-F2 +
+  Gemini R6-F2, with Decisions R6-F + R6-G). The v0.46.0 `__getattr__`
+  deprecation shim's warning messages produced broken migration snippets:
+  - For `expected_calibration_error` + `expected_calibration_error_equal_mass`:
+    the suggested `Scorecard` lookup key was the factory-call expression
+    (`"ece(n_bins=10)"`) instead of the encoded spec name
+    (`"ece_n_bins_10_strategy_uniform"`). Now uses the correct encoded key.
+  - For `expected_calibration_error_debiased` / `_l2` / `_l2_debiased`: these
+    variants are not in the v0.46 `metric_specs` namespace (Decision R6-G;
+    research-completeness primitives, deferred to v1.x if user demand
+    surfaces). Their warnings now point at the submodule path
+    (`from eval_toolkit.metrics import expected_calibration_error_debiased`)
+    instead of an unconstructable scorecard snippet.
+  - Pre-v0.46 default verification: Gemini's report claimed
+    `expected_calibration_error` defaulted to `n_bins=15`; verified against
+    `metrics.py:730-734` that the actual default is `n_bins=10`. Per Decision
+    R6-F, warning snippets use `n_bins=10` to preserve bit-identical pre-v0.46
+    math + add a migration note explaining the new `metric_specs.ece()` factory
+    default of `n_bins=15` (matching Hines et al.).
+### Tests
+- `tests/test_scorecard.py`: 4 new tests for ECE strategy validation
+  (parametrized factory-rejection + compute-defence-in-depth).
+- `tests/test_deprecated_scalars_shim.py`: 4 new test classes — verify each
+  warning contains correct factory expression + encoded scorecard key, ECE
+  warnings carry the n_bins=10/15 migration note, submodule-only warnings cite
+  `eval_toolkit.metrics` path, and the snippet in each first-party warning is
+  EXECUTABLE (parses + runs against synthetic data + produces ok-status cell).
+### Rolled forward to v0.47 (Decision R6-E)
+- R6-A `seed=None` docstring fix (non-blocker per Decision Q).
+- R6-F3 duplicate `MetricSpec.name` rejection.
+- R6-F5 (Codex) Protocol method-shape drift guard.
+- R6-F3 (Gemini) `Scorecard.to_pandas()` schema expansion.
+- R6-F4 (Gemini) `make_spec_name()` helper.
+- R6-F5 (Gemini) narrow `_evaluate_spec()` exception catch.
+- R6-F6 (Codex) plan + roadmap state-drift refresh.
+## [0.46.0] — 2026-05-21 — Scorecard: primary v1.0 metric surface (closes #36)
+Second minor of the staggered v0.45 → v0.46 → v0.47 → v0.48 → v1.0 sequence.
+**Soft-breaking** — existing top-level scalar metric imports still work but
+emit `DeprecationWarning` (hard-removed at v0.47).
+See `docs/source/migration/v0.46.md` for the full consumer migration guide and
+`docs/source/adr/0002-scorecard-as-primary-metric-surface.md` for the
+decision record.
+### Added
+- **`eval_toolkit.scorecard(y_true, y_score, metrics=[...], bootstrap=True)`**
+  — primary v1.0 metric surface. Single call computes multiple threshold-free
+  metrics + bootstrap CIs on one slice; returns a `Scorecard` (read-only
+  `Mapping[str, MetricResult]`). Type-safe dict-subscript access; status-aware
+  cells; per-cell error isolation.
+- **`MetricSpec` Protocol** — v1.0 Tier-2 contract; `name: str` +
+  `compute(y_true, y_score) -> float`. Custom user specs satisfy structurally.
+- **`MetricResult`** frozen dataclass — `value: float | None`, `status:
+  Literal["ok", "skipped", "error"]`, `reason: str`, `ci: BootstrapCI | None`.
+  Reuses the existing `MetricState` vocabulary from `artifacts.py:30-61`.
+- **`Scorecard`** read-only `Mapping[str, MetricResult]` — `to_dict()`
+  JSON-friendly, `to_pandas()` one-row DataFrame (lazy pandas import).
+- **`eval_toolkit.metric_specs`** namespace submodule with threshold-free
+  first-party specs:
+  - `pr_auc`, `roc_auc`, `brier` — module-level singletons (identity stable).
+  - `ece(n_bins, strategy)` — LRU-cached factory (identity stable per kwargs).
+- **`SINGLE_CLASS_INCOMPATIBLE_METRICS`** extended with `pr_auc` / `roc_auc`
+  aliases (alongside existing `auroc` / `auprc`) so the v0.46 scorecard
+  surface and the v0.39 harness paths both produce correct skipped-status
+  behavior. Non-breaking; doctest + unit tests added.
+- **`docs/source/adr/0002-scorecard-as-primary-metric-surface.md`** —
+  decision record covering single-surface rationale, threshold-free scope,
+  Tier-2 Protocol commitment, and v2.0 trigger conditions.
+- **`docs/source/migration/v0.46.md`** — consumer migration guide with
+  side-by-side recipes for every common pattern.
+### Deprecated
+The following 8 top-level scalar imports emit `DeprecationWarning` and will
+be hard-removed at v0.47.0. Use `scorecard()` + `metric_specs` or the
+`eval_toolkit.metrics` submodule path (internal API, no warning).
+- `pr_auc`, `roc_auc`, `brier_score`
+- `expected_calibration_error`
+- `expected_calibration_error_debiased`
+- `expected_calibration_error_equal_mass`
+- `expected_calibration_error_l2`
+- `expected_calibration_error_l2_debiased`
+### Audit findings integrated (Round 5)
+Per `docs/source/audit_findings.md`:
+- **F1** (scorecard threshold semantics) — addressed by Decision R: ship
+  threshold-free first-party specs only at v0.46. Threshold-dependent
+  metrics (F1, accuracy, precision, recall) deferred to v1.x with explicit
+  operating-point provenance.
+- **F2** (scorecard cell-state semantics) — addressed by Decision S: reuse
+  existing `MetricState` (`ok`/`skipped`/`error`) vocabulary.
+- **F4** (deprecation shim must extend the lazy resolver, not replace it) —
+  addressed: `__getattr__` deprecation branch sits between `__version__`
+  short-circuit and the base `_EXPORTS` lookup; tagged with BEGIN/END
+  TRANSITIONAL markers for clean v0.47 removal. Tests guard that every
+  remaining `_EXPORTS` symbol still resolves.
+- **X.2 precondition** — `is_metric_defined_for_slice` aliases shipped
+  ahead of v0.46 (PR #62).
+### Protocol stability
+Tier-2 streak continues: 7 of 7 consecutive minors (v0.40–v0.46) without
+method-shape edits to any existing Tier-2 Protocol. `MetricSpec` is a NEW
+Tier-2 Protocol added at v0.46; freezes at v1.0.
 ## [0.45.0] — 2026-05-21 — Stacking: MetaLearner Protocol + LogisticStacker (closes #52)
 First minor of the staggered v0.45 → v0.46 → v0.47 → v0.48 → v1.0 sequence

{eval_toolkit-0.45.0 → eval_toolkit-0.46.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.45.0
+Version: 0.46.1
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/

{eval_toolkit-0.45.0 → eval_toolkit-0.46.1}/docs/source/adr/README.md RENAMED Viewed

@@ -73,4 +73,6 @@ What would have to change for this decision to be reopened?
 | # | Title | Status | Date |
 |---|---|---|---|
-| _none yet_ | | | |
+| [0001](0001-flat-module-layout.md) | Flat single-file modules through v1.x | Accepted | 2026-05-21 |
+| [0002](0002-scorecard-as-primary-metric-surface.md) | `scorecard()` as the primary v1.0 metric surface | Accepted | 2026-05-21 |
+| [0003](0003-stability-contract-and-gate3-methodology.md) | v1.0 stability contract + Gate 3 methodology | Accepted | 2026-05-21 |

{eval_toolkit-0.45.0 → eval_toolkit-0.46.1}/src/eval_toolkit/__init__.py RENAMED Viewed

@@ -193,20 +193,18 @@ _EXPORTS: dict[str, str] = {
     "SINGLE_CLASS_INCOMPATIBLE_METRICS": "eval_toolkit.metrics",
     "ThresholdResult": "eval_toolkit.metrics",
     "brier_decomposition": "eval_toolkit.metrics",
-    "brier_score": "eval_toolkit.metrics",
-    "expected_calibration_error": "eval_toolkit.metrics",
-    "expected_calibration_error_debiased": "eval_toolkit.metrics",
-    "expected_calibration_error_equal_mass": "eval_toolkit.metrics",
-    "expected_calibration_error_l2": "eval_toolkit.metrics",
-    "expected_calibration_error_l2_debiased": "eval_toolkit.metrics",
+    # `brier_score`, `pr_auc`, `roc_auc`, and the 5 ECE variants removed from
+    # `_EXPORTS` at v0.46 (Decision L). They remain reachable at the top
+    # level via the `__getattr__` deprecation branch (emits
+    # `DeprecationWarning`; branch removed at v0.47) and via the metrics
+    # submodule (`from eval_toolkit.metrics import pr_auc` — internal API
+    # per ADR 0002, not part of the v1.0 stability contract).
     "headline_metrics": "eval_toolkit.metrics",
     "is_metric_defined_for_slice": "eval_toolkit.metrics",
     "metrics_at_threshold": "eval_toolkit.metrics",
-    "pr_auc": "eval_toolkit.metrics",
     "precision_at_prior": "eval_toolkit.metrics",
     "quantile_stratified_pr_auc": "eval_toolkit.metrics",
     "quantile_stratified_report": "eval_toolkit.metrics",
-    "roc_auc": "eval_toolkit.metrics",
     "score_distribution_summary": "eval_toolkit.metrics",
     "single_class_threshold_metrics": "eval_toolkit.metrics",
     "stratified_recall": "eval_toolkit.metrics",
@@ -296,15 +294,65 @@ _EXPORTS: dict[str, str] = {
     "wilson_interval": "eval_toolkit.thresholds",
     "LogisticStacker": "eval_toolkit.stacking",
     "MetaLearner": "eval_toolkit.stacking",
+    "MetricResult": "eval_toolkit._scorecard",
+    "MetricSpec": "eval_toolkit._scorecard",
+    "Scorecard": "eval_toolkit._scorecard",
+    "scorecard": "eval_toolkit._scorecard",
 }
 __all__ = ["__version__", *_EXPORTS.keys()]
+# ── BEGIN TRANSITIONAL DEPRECATION BRANCH (Decision L; REMOVE AT v0.47) ──
+# At v0.46 the scalar metric functions left the top-level `_EXPORTS` map (above)
+# in favor of the `scorecard()` surface (Decision A). To give the consumer one
+# release of overlap before the hard removal at v0.47, the names below remain
+# reachable via the package-level `__getattr__` (which delegates to the
+# `eval_toolkit.metrics` submodule) but emit a `DeprecationWarning` on first
+# lookup pointing at the new API.
+#
+# WHY THIS IS A BRANCH, NOT A REPLACEMENT (Audit F4 — Round 5):
+# `__getattr__` below is the load-bearing lazy export resolver for every name
+# in `_EXPORTS`. The deprecation branch is a discrete `if name in
+# _DEPRECATED_SCALARS` check ABOVE the resolver — the resolver's existing
+# behavior for non-deprecated names is unchanged. At v0.47 we delete this
+# transitional block and the resolver continues to work for every remaining
+# `_EXPORTS` entry.
+_DEPRECATED_SCALARS: frozenset[str] = frozenset(
+    {
+        "pr_auc",
+        "roc_auc",
+        "brier_score",
+        "expected_calibration_error",
+        "expected_calibration_error_debiased",
+        "expected_calibration_error_equal_mass",
+        "expected_calibration_error_l2",
+        "expected_calibration_error_l2_debiased",
+    }
+)
+# ── END TRANSITIONAL DEPRECATION (Decision L; REMOVE AT v0.47) ──
 def __getattr__(name: str) -> Any:
     """Resolve public symbols lazily."""
     if name == "__version__":
         return __version__
+    # ── BEGIN TRANSITIONAL DEPRECATION BRANCH (Decision L; REMOVE AT v0.47) ──
+    if name in _DEPRECATED_SCALARS:
+        import warnings
+        warnings.warn(
+            _deprecation_warning_for(name),
+            DeprecationWarning,
+            stacklevel=2,
+        )
+        module = import_module("eval_toolkit.metrics")
+        value = getattr(module, name)
+        # Do NOT cache in globals() — repeated lookups should keep re-warning
+        # (one warning per call site, modulo Python's default
+        # DeprecationWarning de-duplication).
+        return value
+    # ── END TRANSITIONAL DEPRECATION (Decision L; REMOVE AT v0.47) ──
     module_name = _EXPORTS.get(name)
     if module_name is None:
         raise AttributeError(f"module 'eval_toolkit' has no attribute {name!r}")
@@ -314,6 +362,113 @@ def __getattr__(name: str) -> Any:
     return value
+# ── BEGIN TRANSITIONAL DEPRECATION HELPER (Decision L; REMOVE AT v0.47) ──
+#
+# Per Round 6 audit (Codex R6-F2 + Gemini R6-F2; Decisions R6-F + R6-G):
+# - For deprecated scalars with a first-party `metric_specs` equivalent, the
+#   warning emits an EXECUTABLE scorecard snippet (factory expression + the
+#   correct encoded scorecard key, not the factory call string).
+# - For the 3 ECE variants without a `metric_specs` equivalent
+#   (expected_calibration_error_debiased / _l2 / _l2_debiased), the warning
+#   instead points at the submodule path per Decision R6-G — no first-party
+#   replacement is shipped at v0.47.
+# - ECE `n_bins=10` preserves the pre-v0.46 default (verified at
+#   `metrics.py:730-734`) — Decision R6-F. A migration note explains that
+#   the v0.46+ `metric_specs.ece()` factory defaults to `n_bins=15` (matching
+#   Hines et al.) and how to opt in.
+_FirstParty = tuple[str, str]  # (factory_expression, scorecard_key)
+"""Type alias for a deprecated-scalar that has a metric_specs replacement.
+The factory expression is what the user types after ``metric_specs.``; the
+scorecard key is the literal string that indexes ``Scorecard``.
+"""
+_FIRST_PARTY_REPLACEMENTS: dict[str, _FirstParty] = {
+    "pr_auc": ("pr_auc", "pr_auc"),
+    "roc_auc": ("roc_auc", "roc_auc"),
+    "brier_score": ("brier", "brier"),
+    # ECE variants: use n_bins=10 (pre-v0.46 default per Decision R6-F).
+    # The migration note in the warning text explains how to switch to
+    # n_bins=15 if the user wants the v0.46+ metric_specs.ece() default.
+    "expected_calibration_error": (
+        "ece(n_bins=10)",
+        "ece_n_bins_10_strategy_uniform",
+    ),
+    "expected_calibration_error_equal_mass": (
+        'ece(n_bins=10, strategy="quantile")',
+        "ece_n_bins_10_strategy_quantile",
+    ),
+}
+"""Names that have a first-party metric_specs replacement at v0.46.
+The 3 ECE variants NOT in this map (_debiased, _l2, _l2_debiased) get the
+submodule-path warning template instead (Decision R6-G).
+"""
+def _deprecation_warning_for(name: str) -> str:
+    """Render the DeprecationWarning message for a deprecated scalar name.
+    Branches on whether ``name`` has a first-party `metric_specs` replacement
+    (Decision R6-G):
+    - First-party (5 names): scorecard snippet with the correct encoded key
+      (Decision R6-F).
+    - Submodule-only (3 ECE variants): point at the submodule path per
+      Decision R6-G.
+    The first-party variants for ECE include a migration note explaining the
+    new ``metric_specs.ece()`` factory default of ``n_bins=15`` so users can
+    opt in to the new convention; the snippet itself uses ``n_bins=10`` for
+    bit-identical pre-v0.46 math (Decision R6-F).
+    Parameters
+    ----------
+    name : str
+        A name in ``_DEPRECATED_SCALARS``.
+    Returns
+    -------
+    str
+        The warning message, ready to pass to ``warnings.warn``.
+    """
+    first_party = _FIRST_PARTY_REPLACEMENTS.get(name)
+    if first_party is not None:
+        factory_expr, scorecard_key = first_party
+        msg = (
+            f"eval_toolkit.{name} is deprecated and will be removed in v0.47. "
+            f"For the same math, use:\n"
+            f"    scorecard(y, s, metrics=[metric_specs.{factory_expr}])"
+            f'["{scorecard_key}"].value\n'
+            f"Or import from the eval_toolkit.metrics submodule directly "
+            f"(internal API per ADR 0002 — stable across v1.x, subject to "
+            f"refactor in major versions)."
+        )
+        # ECE-specific migration note about the n_bins default change.
+        if name.startswith("expected_calibration_error"):
+            msg += (
+                "\nNote: the v0.46+ metric_specs.ece() factory defaults to "
+                "n_bins=15 (matching Hines et al.); the n_bins=10 in this "
+                "snippet preserves the pre-v0.46 math. Pass n_bins=15 to use "
+                "the new convention."
+            )
+        return msg
+    # Decision R6-G: 3 ECE variants without first-party replacements →
+    # submodule path only.
+    return (
+        f"eval_toolkit.{name} is deprecated and will be removed in v0.47. "
+        f"This variant is NOT in v0.46+ metric_specs. Use:\n"
+        f"    from eval_toolkit.metrics import {name}\n"
+        f"(internal API per ADR 0002 — stable across v1.x, subject to "
+        f"refactor in major versions). Or contribute the variant to "
+        f"metric_specs if you use it regularly."
+    )
+# ── END TRANSITIONAL DEPRECATION HELPER (Decision L; REMOVE AT v0.47) ──
 def __dir__() -> list[str]:
     """Expose lazy public symbols to introspection."""
     return sorted(__all__)

eval-toolkit 0.45.0__tar.gz → 0.46.1__tar.gz

eval-toolkit 0.45.0tar.gz → 0.46.1tar.gz